SOME CONTRIBUTIONS TO DIMENSIONALITY
REDUCTION

By
Wei Tong

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY

Computer Science
2010

ABSTRACT
SOME CONTRIBUTIONS TO DIMENSIONALITY
REDUCTION
By
Wei Tong

Dimensionality reduction is a long standing challenging problem in the ﬁelds of statistical
learning, pattern recognition and computer vision. Numerous algorithms have been proposed
and studied in the past decades. In this dissertation we address several challenging problems
emerged in recent studies of dimensionality reduction. We ﬁrst explore the dimensionality
reduction method for semi-supervised classiﬁcation via the idea of mixed label propagation
in which we attempt to ﬁnd the best one dimensional embedding of the data in which data
points in diﬀerent classes can be well separated and the class labels are obtained by simply
thresholding the one dimensional representation. In the next, we explore the dimensionality
reduction methods for non-vector data representations. We ﬁrst look into the problem in
which a datum is represented by a matrix. We give a convex formulation to the problem of
dimensionality reduction for matrices and developed an eﬃcient approximating algorithm to
solve the associated semi-deﬁnite programming problem. In the last, we studied the problem
of dimensionality reduction with even more challenging data representation, i.e., each datum
is described as a diﬀerent number of unordered vectors. We convert the problem into the
functional data dimensionality reduction and study it in the context of large scale image
retrieval.

ACKNOWLEDGMENTS

Time ﬂies! It has been six years since I started my Ph. D. study at MSU. Looking back,
I am surprised and at the same time very grateful for all I have received throughout these
years.
My ﬁrst, and most earnest, acknowledgment must go to my advisor and the chair of my
Committee Dr. Rong Jin. With his enthusiasm, his inspiration, and his great eﬀorts to
explain things clearly and simply, he helped to make mathematics easy for me. Throughout
my thesis-writing period, he provided encouragement, sound advice, good teaching, good
company, and lots of good ideas. I would have been lost without him. In every sense, none
of this work would have been possible without him. Many thanks also to committee members
Dr. Anil Jain, Dr. Joyce Chai and Dr. Selin Aviyente.
Many thanks to my previous advisor Dr. Juyang Weng with whom I worked in my ﬁrst
two years of study. I would like to present my sincere appreciation for his valuable support
during those years.
Many people on the faculty and staﬀ of the CSE department at MSU assisted and encouraged me in various ways during my studies. I am especially grateful to Prof. George
Stockman, Dr. Eric Torng, Linda Moore, Norma Teague, Adam Pitcher and Debbie Kruch.
My Ph. D. study would not have been the same without the great support from my
friends from EI, PRIP and LINKS labs. I am particularly thankful to Xiao Huang, Nan
Zhang, Shuqing Zeng, Zhengping Ji, Feilong Chen, Xiaoguang Lv, Hong Chen, Yi Chen, Yi
Liu, Tianbao Yang, Yang Zhou, Jinfeng Yi and Fengjie Li.
I also owe a huge debt of gratitude to all the people from the Institute of Water Research
at MSU where I worked as a system administrator for more than three years. They are, Jon
Bartholic, Jeremiah Asher, Yi Shi, Cindy Brewbaker, Stephanie Smith, Glenn O’Neil, Debra
Porter, Ruth Kline-Robach.

iii

My ﬁnal, and the most heartfelt, acknowledgment must go to my family for their unlimited
and unconditional encouragement, support and love. Learning to love you and to receive
your love make me a better person. You have my everlasting love!

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 Literature Review of Dimensionality Reduction . . . .
2.1 PCA and Related Methods . . . . . . . . . . . . . . . . . .
2.1.1 Principle Components Analysis (PCA) . . . . . . .
2.1.2 Factor Analysis . . . . . . . . . . . . . . . . . . . .
2.1.3 Probabilistic PCA . . . . . . . . . . . . . . . . . .
2.1.4 Metric Multidimensional Scaling . . . . . . . . . . .
2.1.5 Kernel PCA . . . . . . . . . . . . . . . . . . . . . .
2.2 Graph-Based Method . . . . . . . . . . . . . . . . . . . . .
2.2.1 Isomap . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Local Linear Embedding (LLE) . . . . . . . . . . .
2.2.3 Laplacian Eigenmaps . . . . . . . . . . . . . . . . .
2.2.4 Maximum Variance Unfolding . . . . . . . . . . . .
2.2.5 From the Kernel Point of View . . . . . . . . . . .
2.3 Matrix Factorization . . . . . . . . . . . . . . . . . . . . .
2.3.1 Low Rank Approximation . . . . . . . . . . . . . .
2.3.2 Singular Value Decomposition . . . . . . . . . . . .
2.3.3 Non-negative Matrix Factorization . . . . . . . . .
2.3.4 Maximum Margin Matrix Factorization . . . . . . .
2.4 Supervised and Semi-Supervised Dimensionality Reduction
2.4.1 Linear Discriminate Analysis . . . . . . . . . . . . .
2.4.2 Semi-Supervised LLE and ISOMAP . . . . . . . . .
2.5 Functional Data Dimensionality Reduction . . . . . . . . .
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .

3 Semi-Supervised Learning by Mixed Label Propagation . . . .
3.1 Graph-Based Semi-Supervised Learning . . . . . . . . . . . . . . .
3.2 Mixed Label Propagation . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 The Framework of Mixed Label Propagation . . . . . . . .
3.2.2 An Eﬃcient Algorithm for Mixed Label Propagation . . .
3.2.3 Application to Collaborative Filtering . . . . . . . . . . . .
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Experiment (I): Eﬀectiveness of Mixed Label Propagation

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

. . . . 51
. . . . 51
. . . . 53
. . . . 54
. . . . 55
. . . . 58
. . . . 59
. . . . 62

v

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

4
5
5
8
11
14
15
18
18
19
23
26
29
32
32
33
34
38
42
42
44
46
50

3.4
3.5

3.3.2 Experiment (II): Empirical Values for λ . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Second-Order PCA-Style Dimensionality Reduction Algorithm by SemiDeﬁnite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Second-Order PCA-Style Algorithms . . . . . . . . . . . . . . . . . . . . . .
4.2 A Convex Formulation for SOPCA . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Low Norm Approximation of a Single Matrix . . . . . . . . . . . . . .
4.2.2 Low Norm Approximation for Multiple Matrices . . . . . . . . . . . .
4.2.3 An SDP Formulation for SOPCA . . . . . . . . . . . . . . . . . . . .
4.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62
63
64

69
69
72
72
75
77
80
81
84

5 Dimensionality Reduction for Functional Data Representation and Its Application to Large-Scale Image Retrieval by Local Features . . . . . . . . 86
5.1 Content-Based Image Retrieval with Local Image Features . . . . . . . . . . 87
5.2 Dimensionality for Functional Data Representation . . . . . . . . . . . . . . 89
5.3 Eﬃcient Image Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4 Comparing to The-State-Of-The-Art Methods . . . . . . . . . . . . . . . . . 98
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.2 Implementation and Baselines . . . . . . . . . . . . . . . . . . . . . . 102
5.5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.4 Results on The Tattoo Image Dataset . . . . . . . . . . . . . . . . . . 103
5.5.5 Parameter λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5.6 Parameter ρ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.7 Number of Random Centers . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.8 Results on Oxford Building and Oxford Building + Flickr Datasets . 105
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
A Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B Schur Complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
C Semi-Deﬁnite Programming

. . . . . . . . . . . . . . . . . . . . . . . . . . 120

D Trace Norm And Ky Fan k-Norm . . . . . . . . . . . . . . . . . . . . . . . 122
vi

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

vii

LIST OF TABLES

3.1

Average precision for the Mixed Label Propagation (MLP), Label Propagation
(LP), Pearson Correlation Coeﬃcient (Pearson) and Maximum-Margin Matrix
Factorization(MMMF) using 10 training users. . . . . . . . . . . . . . . . . .

59

Average precision for the Mixed Label Propagation (MLP), Label Propagation
(LP), Pearson Correlation Coeﬃcient (Pearson) and Maximum-Margin Matrix
Factorization(MMMF) using 20 training users. . . . . . . . . . . . . . . . . .

60

4.1

Statistics of data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.2

The classiﬁcation errors of 1-NN classiﬁer using the proposed dimensionality
reduction algorithm for SOPCA, GLRAM and 2-D SVD. For each method,
the ﬁrst line is the mean error rates of the ten-fold cross validation and the
second line is the standard deviations of the error rates. k represents the
target dimension. The best performance of each case is highlighted by the
bold font. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

3.2

5.1

Statistics of the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.2

The preprocessing and retrieval time of the two methods on tattoo dataset. . 104

5.3

mAP results of the proposed method and BoW method for Oxford5K building
data set and Oxford5K+Flickr1M data set. . . . . . . . . . . . . . . . . . . 106

5.4

Preprocessing and retrieval times of the two methods with one million cluster/random centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

viii

LIST OF FIGURES

3.1

3.2

The Average rating of the Mixed Label Propagation (MLP), Label Propagation (LP), Pearson Correlation Coeﬃcient (Pearson) and Maximum-Margin
Matrix Factorization(MMMF) using 10 training users. . . . . . . . . . . . . .

66

The Average rating of the Mixed Label Propagation (MLP), Label Propagation (LP), Pearson Correlation Coeﬃcient (Pearson) and Maximum-Margin
Matrix Factorization(MMMF) using 20 training users. . . . . . . . . . . . . .

67

3.3

The λs computed for the 100 movies with 20 training users and 10 given ratings. 68

5.1

The CMC scores for tattoo image retrieval with one million cluster/random
centers. For interpretation of the references to color in this and all other
ﬁgures, the reader is referred to the electronic version of this dissertation. . 109

5.2

Results of the proposed method for tattoo image retrieval with diﬀerent value
¯
of λ base on 1 million random centers with ρ = 0.6d . . . . . . . . . . . . . . 110

5.3

Results of the proposed method for tattoo image retrieval with diﬀerent value
of ρ base on 1 million random centers . . . . . . . . . . . . . . . . . . . . . . 111

5.4

Results of the proposed method for tattoo image retrieval with diﬀerent number of centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.5

Examples of two queries (the ﬁrst column) and the ﬁrst six retrieved images. The ﬁrst two rows give the retrieved results for the Oxford5K building database, and the next two rows give the retrieved results for Oxford5K+Flickr1M database. The correctly retrieved results are outlined in
green and while irrelevant images are marked by the red color. . . . . . . . 113

ix

CHAPTER 1
Introduction
The curse of dimensionality[8] is a well known problem in the ﬁelds of machine learning
and statistical pattern recognition. It refers to the fact that, in the absence of simpliﬁed
assumption, the sample size needed to estimate a function of multiple variables to a given
degree of accuracy grows exponentially in the number of variables. In other words, when
the number of variables – the dimensionality of data – increases, the number of data points
needed to determine an accurate estimation could quickly exceed one’s processing ability.
To illustrate the challenge of handling high dimensional data, consider a face recognition
system based on grey scale images. Assume the size of each image is 256 × 256, which can
also be treated as a vector of 65536 dimension. In order to train a perceptron, a simple
linear classiﬁer for face recognition, it requires solving an optimization problem with 65537
parameters. The size of the recognition problem will increase exponentially when considering
a multi-layer perceptron, making it impractical unless certain simplifying assumptions are
made.
The challenge of handling high dimensional data calls for the techniques of dimensionality
reduction, whose goal is to signiﬁcantly reduce the number of features that are used for
data description. More speciﬁcally, suppose we have a high dimensional data set D =
{x1 , x2 , · · · , xn }, where xi ∈ Rm , i = 1, · · · n, are input patterns of m features. The
objective of dimensionality reduction is to compute n output patterns yi ∈ Rr , one for each
data point in D, that provide a ”faithful” low dimensional representation of the original data
set with r

m. By faithful, we mean that the mapping from x to y preserves the geometric

relationship among data points. Namely, if two data points are close to each other in the

1

original space Rm measured by the Euclidean distance, they should also be close to each
other in the mapped space Rr , and vice versa. Without loss of generality we assume, if not
speciﬁed, the input patterns are centered at the origin, i.e.,

n
i=1 xi = 0.

Numerous algorithms have been proposed and studied in the past decades, for example,
the classical PCA family algorithms, graph based methods and matrix factorization based
methods. However, there are two limitations of these methods. The ﬁrst one is that most
of these methods are either supervised or unsupervised methods, there are few algorithms
developed for semi-supervised dimensionality reduction in which information from both labeled and unlabeled data can be explored. In this dissertation, we try to develop such an
algorithm for semi-supervised dimensionality reduction via the idea of mixed label propagation. We approximate a data classiﬁcation problem into a dimensionality reduction one
by ﬁnding the best one dimensional embedding of the data in which data points in diﬀerent
classes can be well separated. We then obtain the class labels by simply thresholding the
one dimensional representation.
The second limitation of the existing dimensionality reduction methods is that each input
data must be a vector with ﬁxed length. However, in many real applications, the natural
representation of each datum could be versatile. For example, an image is represented as
a matrix of pixels, a text document can be represented as a set of vectors with diﬀerent
length. In those situations, in order to apply the dimensionality reduction methods, one has
to convert the natural representation of each datum into the vector representation and this
conversion may cause serious problems. For instance, by converting an image from the matrix
presentation into the vector representation, all the neighborhood information between pixels
are lost. This also makes the dimension of the resulting vector to the extremely high and often
leads to very high computational cost when applying dimensionality reduction methods. In
order to overcome this limitation, we study the problem of dimensionality reduction when
each datum has a non-vector representation. More speciﬁcally, We ﬁrst look into the problem
in which a datum is represented by a matrix. We give a convex formulation to the problem

2

of dimensionality reduction for matrices and developed an eﬃcient approximating algorithm
to solve the associated semi-deﬁnite programming problem. Second, we studied the problem
of dimensionality reduction with each datum is described as a diﬀerent number of unordered
vectors. We convert the problem into the functional data dimensionality reduction and study
it in the context of large scale image retrieval.
The rest of the dissertation is organized as the following: in the next chapter, we give an
literature review of the methods for dimensionality reduction which have close relationship
to the problem we are going to solve in this dissertation. In the third chapter, we address the
problem of dimensionality reduction for semi-supervised learning by mixed label propagation.
In the fourth chapter, we study the problem when an input data is represented as a matrix
instead of a vector. In the ﬁfth chapter, we address the problem of how to reduced the
dimensionality for functional data representation. The conclusion is presented in the last
chapter.

3

CHAPTER 2
Literature Review of Dimensionality
Reduction
In this chapter we start our review in Section 2.1 with one of the most widely used methods for dimensionality reduction, principal component analysis (PCA), followed by metric
multidimensional scaling (MDS) and several unsupervised dimensionality reduction algorithms that are closely related to PCA and MDS. Although these classical methods were
traditionally classiﬁed as a linear dimensionality reduction that transforms the input patterns by a linear projection, they can be extended to nonlinear dimensionality reduction by
the introduction of kernel functions. Section 2.2 describes several graph-based methods for
dimensionality reduction. They are non-linear dimensionality reduction methods and are
particularly useful for analyzing high dimensional data that are sampled from a low dimensional manifold. Although these methods are based on diﬀerent geometric intuitions and use
diﬀerent computational procedures, they share similar computational structures. Section 2.3
reviews dimensionality reduction methods based on matrix factorization that are not covered
in most recent surveys of dimensionality reduction [135, 19, 45, 21]. In Section 2.4 we review
both supervised and semi-supervised methods for dimensionality reduction. Lastly, in Section 2.5, we review the functional dimensionality reduction in which the data are presented
by functions instead of vectors.

4

2.1

PCA and Related Methods

Principal component analysis (PCA) and its related linear methods, such as multidimensional scaling (MDS), are possibly the most widely used dimensionality reduction techniques.
These algorithms are computationally eﬃcient and easily implemented. As a result, these
techniques have found a large number of applications as well as a number of diﬀerent derivations. It is also important to note that despite the simplicity, the basic geometric intuition
of PCA and MDS has inspired many algorithms for non-linear dimensionality reduction.

2.1.1

Principle Components Analysis (PCA)

Principle Components Analysis [67, 72] is an unsupervised dimensionality reduction method.
The key idea of PCA is to identify the low dimensional representation of a high dimensional
data set that most faithfully preserves its covariance structure. PCA has been referenced
by many names in diﬀerent ﬁelds. For example, in statistics and signal processing it is
also known as the Karhunen-Lo`ve transformation, the Hotelling transformation, and the
e
empirical orthogonal function(EOF). Below we discuss several ways of deriving the algorithm
of PCA based on diﬀerent principles.
Reconstruction with Minimum Squared Error
Given a data set D = {x1 , x2 , · · · , xn }, where xi ∈ Rm , i = 1, · · · n, PCA projects the input
data xi onto a r-dimensional subspace such that the reconstruction error is minimized. In
other words, we need to ﬁnd a projection matrix P with rank r < m that minimizes the
following objective function:
n

xi − P xi 2

min
P

(2.1)

i=1

The optimal projection matrix can be factorized as P = U U , where U ∈ Rm×r has
orthogonal columns. The r-dimensional representations are given as:
yi = U xi ,

i = 1, · · · , n
5

(2.2)

The solution to (2.1) is obtained from the eigen-decomposition of the covariance matrix
C=

n
i=1 xi xi . Let C be expressed as C =

m
i=1 λi ui ui , where λi is the i-th largest eigen-

value of C and ui is the associated eigenvector. The optimal low dimensional representation
can be computed using the equation(2.2) with U = [u1 · · · ur ].
Finding an Informative Direction
PCA can also be derived from the point of view of ﬁnding the most informative direction
[19]. Suppose we aim to ﬁnd a direction u ∈ Rm such that the projection xi · u gives a good
one dimensional representation of the original data (i.e., the projection should lose as little
information of the original data as possible). Assume that the data in fact lie along a line
embedded in Rm , i.e., xi = µ + θi v, where µ is the mean of the samples, v ∈ Rm is a vector
with unit length and θi ∈ R. The sample variance of the projection along v is
varv =

1
n

n

((xi − µ) · v)2 =

i=1

1
n

n

2
θi

i=1

The variance along other unit direction v is
1
varv =
n
Since v · v

2

n

(xi − µ) · v
i=1

2

1
=
n

n

2
θi v · v

i=1

= cos2 φ, where φ is the angle between the two directions v and v , projected

variance is maximized if and only if v = v. Therefore, the most informative direction for
the projection is the one for which the projected variance is maximized.
It can be shown that the problem in (2.1) is equivalent to the following optimization
problem:
n

max
yi

s. t.

i=1

1
yi 2 =
yi − yj 2
2n

U U = I,

(2.3)

yi = U xi

which is equivalent to ﬁnding the projection for which the projected variance is maximized.

6

PCA Decorrelates the Samples
If we use the full set of orthonormal eigenvectors as the basis, then in this basis the data are
represented as Y = U X, where U is the m × m matrix obtained in Section 2.1.1. Thus,
the sample covariance matrix in the new space CY is:
CY

=
=

1
n
1
n

n

yi yi
i=1
n

U xi

U xi

i=1

= U CX U
= diag (λ1 , · · · , λm )
which means in the new basis, the sample covariance matrix is diagonalized. This implies
the samples are uncorrelated in the space speciﬁed by matrix U . It is worth emphasizing
that the notion of correlation is basis-dependent which is to say that data can be correlated
in one basis but uncorrelated in another.
Limitation of PCA and Its Extensions
Although PCA is proved to be eﬃcient and eﬀective for dimensionality reduction in many
applications, it however has its own limitations.
The ﬁrst limitation of PCA is that it is a linear dimensionality reduction method and only
identiﬁes the linear subspace that best preserves the covariance structure of data. However,
in many real world applications data essentially lie in nonlinear manifold whose structure
is usually diﬃcult to be discovered by PCA. A number of extensions have been developed
to adapt PCA for non-linear dimensionality reduction. The most widely used approach is
kernel PCA [136, 137], which extends the linear PCA by the introduction of kernel functions.
We will discuss kernel PCA in Section 2.1.5.
Another limitation of PCA is the interpretation of the principal components can be diﬃcult
at times. Although the dimensions identiﬁed by PCA are uncorrelated variables constructed
7

as linear combination of the original variables and have some desirable properties, they do not
necessarily correspond to meaningful physical quantities. In some cases, such loss of interpretability is unsatisfactory to the domain scientists. Non-negative matrix factorization [87]
was introduced to give meaningful approximation of a non-negative matrix by non-negative
low rank factorization. We will discuss this in Section 2.3.3,
The third limitation of PCA is that the identiﬁed dimensions may not be informative to
discriminate data points of diﬀerent classes. This is because the objective of PCA is to ﬁnd
dimensions that preserve the overall structure of data points, not to diﬀerentiate data points
of diﬀerent classes. For example, given images of printed uppercase letters O and Q, PCA
might discover the gross features that characterize Os and Qs, but might ignore the tail that
distinguishes an O and a Q. Linear discriminant analysis (LDA), which will be discussed in
Section 2.4.1, addresses this shortcoming of PCA. It seeks the directions that are eﬃcient to
distinguish between letter O and letter Q.

2.1.2

Factor Analysis

Factor analysis [51] is another popular method for dimensionality reduction. Like PCA,
factor analysis can be viewed as an attempt to approximate the covariance matrix. The
essential purpose of factor analysis is to describe, if possible, the covariance relationships
among multiple variables in terms of a few underlying, but unobservable, random quantities called factors. The factor model is motivated by the assumption that variables can
be grouped by their correlations. That is, all variables within the same group are highly
correlated among themselves but have relatively small correlations with variables in a different group. It is conceivable that each group of variables represents a single underlying
construction, or factor, that is responsible for the observed correlations.

8

The Orthogonal Factor Model
Suppose the observable random vector x, with m components/attributes, has mean µ and
covariance matrix Σ. The factor model postulates that all the attributes in x are linearly
independent given the unobservable random variables y1 , · · · , yr , called common factors,
and m additional sources of variation 1 , · · · , m , called errors or speciﬁc factors. In matrix
notation, the factor model is expressed as:
x = Wy + µ +

(2.4)

The m×r matrix W is call the matrix of factor loadings. The motivation is that, with r < m,
the unobservable variables will oﬀer a more parsimonious explanation of the dependencies
between the observations. Furthermore, we introduce the following assumptions about the
random vector y and ,
E(y) = 0,
E( ) = 0,
and that y and

Cov(y) = I

(2.5)

Cov( ) = Ψ = diag(ψi , · · · , ψm )

are independent,
Cov(y, ) = 0r×m

These assumptions and the relation in (2.4) constitute the orthogonal factor model and the
covariance structure for x can be inferred as:
Σ = Cov(x)

(2.6)

= E (x − µ) (x − µ)
= E (W y + ) (W y + )
= WW

+Ψ

Apparently there exist an arbitrary rotation matrix R, such that W and W R generate the
same covariance matrix Σ.
9

The analysis of the factor model proceeds by imposing conditions that allow one to uniquely
estimate W and Ψ. The loading matrix is then rotated, where the rotation is determined
by some ’easy-of-interpretation’ criterion [71]. Once the loadings and the speciﬁc variances are obtained, factors are identiﬁed and estimated values for the factors themselves
are constructed. Two popular methods of parameter estimation are the Principal Component method and Maximum Likelihood method[71].
Comparisons Between Factor Analysis and PCA
First, it’s clear that a major distinction between factor analysis and PCA is that the factor
analysis deﬁnes a speciﬁc generative model for the observed data patterns x, while PCA
does not.
Second, both factor analysis and PCA can be thought of as trying to preserve the covariance matrix Σ (or correlation matrix) as well as possible. The diﬀerence between them is
that PCA concentrates on the diagonal elements, whereas in factor analysis the interest is
in the oﬀ-diagonal elements [72].
To justify this statement, we ﬁrst consider PCA. The objective is to maximize
r
i=1 var(yi ) =

r
i=1 λi , which accounts for as much as possible the sum of diagonal

elements of Σ. Turning now to factor analysis, consider the factor model (2.4) and the
corresponding equation (2.7) for Σ. It can be seen that since Ψ is diagonal, the common
factor term W y in (2.4) accounts completely for the oﬀ-diagonal elements of Σ in the perfect
factor model, but there is no compulsion for the diagonal elements to be well explained by
the common factors. The elements, ψj , j = 1, · · · , m, of Ψ will be low if all of the variables have considerable common variation, but if a variable xj is almost independent of all
other variables, then ψj = var( j ) will be almost as large as var(xj ). Thus, factor analysis
concentrates on explaining only the oﬀ-diagonal elements of Σ by a small number of factors
whereas conversely PCA concentrates on the diagonal elements of Σ.
The third diﬀerence between the two techniques is the number of dimensions r that is

10

suﬃcient to represent the data in m dimensional space [72]. In PCA, if one variable is
independent from other variables, it usually corresponds to a principal component. In PCA
it is usually necessary to include such a principal component as part of the space with
reduced dimensionality because they provide information that is not available from other
variables. In contrast, in factor analysis, an independent attribute is usually accounted for
by the noise model and therefore is not explained by the hidden factor. Thus, for most
data sets the number of factors required by factor analysis is usually no larger – and may
be strictly smaller – than the number of principal components required to account for the
variation in the data.
One more diﬀerence between PCA and factor analysis is that it usually has more impact
on factor analysis than it does on PCA [72] by varying the value of dimensionality of the
model. In particular, in PCA when r is increased from r1 to r2 , an additional (r2 − r1 )
principal components are extracted and included to generate the subspace of the reduced
dimensionality. In contrast, this is not true in factor analysis. When r is increased from r1
to r2 , we often ﬁnd that the common factors found by factor analysis could be completely
diﬀerent from the factors that are identiﬁed when r = r1 .

2.1.3

Probabilistic PCA

Probabilistic PCA (PPCA) was ﬁrst introduced by Topping and Bishop [155, 154], where
they showed how the principal axes of a set of observed data vectors may be determined
through maximum likelihood estimation of parameters in a latent variable model closely
related to factor analysis. An EM algorithm was also introduced [155, 154] to eﬃciently
solve PPCA without having to compute the eigenvectors of a matrix. Although PPCA was
originally designed to interpret PCA from a viewpoint of probabilistic model, it has several
advantages when compared to the original PCA. First, the PPCA can be applied to estimate
the principal axes even when some of the elements in the data matrix are missing [155, 131].
Second, PPCA provides a natural way to construct more complex projection methods, for

11

instance, by combining multiple PCA models [155].
The Probabilistic Model
In the case of isotropic noise ∼ N (0, σ 2 I), equation (2.4) implies the conditional probability
distribution of x given y is expressed as:
x|y ∼ N W y + µ, σ 2 I

(2.7)

Assuming a Gaussian prior for the latent variable y deﬁned by
y ∼ N (0, I) ,

(2.8)

we can obtain the marginal distribution of x by integrating out the latent variables. The
resulting distribution for x is again Gaussian, and is expressed as:
x ∼ N (µ, C) ,
where C = W W

(2.9)

+ σ 2 I.

We now try to derive the conditional distribution of the hidden factor y given x. Using
Bayes’ rule, the posterior distribution of the latent variables y given observed x is again
Gaussian:
y|x ∼ N M −1 W (x − µ) , σ 2 M −1

(2.10)

where M = W W + σ 2 I, and note that M is of size r × r while C is m × m
The log-likelihood of the observed data under this model is:
n

L=

ln (p (xi )) = −
i=1

where
S=

1
n

n
d ln (2π) + ln |c| + tr C −1 S
2
n

(xi − µ) (xi − µ)
i=1

is the sample covariance matrix of the observation.
12

(2.11)

The Maximum-likelihood Estimators
Estimates for W and σ 2 may be obtained by EM algorithm [155] or may be obtained explicitly
with C which is W W

+ σ 2 I. Given σ, the noise level, the optimal solution to the loading

matrix W that maximizes the log-likelihood L, is maximized when:
1

WM L = Ur Λr − σ 2 I 2 R

(2.12)

where the r column vectors in the m × r matrix Ur are the principal eigenvectors of S, with
corresponding eigenvalues λ1 , · · · , λr in the r × r diagonal matrix Λr , and R is an arbitrary
r × r orthogonal rotation matrix.
It may also be shown that, for W = WM L , the maximum-likelihood estimation for σ 2 is
given by:
1
2
σM L =
m−r

m

λj

(2.13)

j=r+1

which is a clear interpretation as the variance ’lost’ in the projection, averaged over the lost
dimensions.
In practice, to ﬁnd the most likely model given S, we would ﬁrst estimate σ 2 from (2.13),
and then WM L from (2.12). For simplicity we would eﬀectively ignore the rotation matrix
R, i.e. choose R = I [153]. Alternatively, an EM algorithm was also developed in [153] and
closely related works can be found in [47, 131].
Zero Noise Limit
In conventional PCA, the reduced-dimensionality transformation of a data point xi is given
ˆ
by yi = U (xi − µ) and its reconstruction by xi = U yi + µ. However, in the probabilistic
framework of PPCA, the probabilistic analogue of the dimensionality reduction process of
conventional PCA would be to invert the conditional distribution p(x|y) using Bayes’ rule, to
give p(y|x). In this case, each data point xi is represented in the latent space not by a single
vector, but by the Gaussian posterior distribution deﬁned by (2.10). As an alternative to
13

the standard PCA projection, a convenient summary of this distribution and representation
of yi would be the posterior mean
yi |xi = M −1 WM L (xi − µ)
However, it is worth emphasizing that WM L xi +µ is not the optimal linear reconstruction
of xi (in the squared reconstruction-error sense). This may be seen from the fact that for
σ 2 > 0, M −1 WM L is not an orthogonal projection of xi . If we consider the limit as σ 2 → 0,
1

then WM L = U Λ 2 (if we ignore the rotation matrix R) and the projection
WM L xn = WM L WM L WM L

−1

WM L (xi − µ)

does become orthogonal and is equivalent to conventional PCA, but then the density model
is singular and thus undeﬁned.

2.1.4

Metric Multidimensional Scaling

Classical Multidimensional Scaling
Classical multidimensional scaling(MDS) [28] computes the low dimensional representations
that most faithfully preserve the inner products between the high dimensional data points.
It is usually cast into the following optimization problem:
(xi xj − yi yj )2 = G − K F

min
y

(2.14)

ij

where · F represents the Frobenius norm. G and K are the Gram matrix of the inputs
and outputs respectively, with Gij = xi xj and Kij = yi yj .
In many applications, the relationship between two data points is usually represented by
their distance instead of their pair-wise similarity (i.e., dot product). Let Dij = xi − xj 2
and D be the matrix of squared pair-wise Euclidean distances. It is easy to derive matrix G
from matrix D as:
1
1
1
G = − (I − 11 )D(I − 11 )
2
n
n
14

where 1 denotes the vector of all ones.
The solution to MDS is obtained from the eigen-decomposition of the Gram matrix G.
Suppose G =

n
k=1 λk vk vk , where λk is the k-th largest eigenvalue of G and vk is the

corresponding eigenvector. The r dimensional outputs of MDS are given by
yi =

λi (v1 )i · · ·

λr (vr )i

,

i = 1, · · · n

Relationships Between Classical scaling and PCA
Essentially, MDS and PCA produce the same outputs but in diﬀerent formats. Recall that
in PCA we write the covariance matrix as C = XX

while in MDS the Gram matrix

is G = X X, where X = [x1 , · · · , xn ]. The equivalence of their outputs can be easily
established using the singular value decomposition which is:
Λpca = Λmds ,

V pca = XV mds
1

Y pca = (Λpca ) 2 Y mds
Here, V pca and V mds are eigen matrices containing eigenvectors of C and G as columns
respectively; Λpca and Λmds are diagonal matrices whose diagonal elements are the corresponding eigenvalues of C and G.

2.1.5

Kernel PCA

In recent years, we have witnessed an explosion of work on kernel methods [138]. The
basic idea of the kernel methods is to use the “kernel trick” that map a point x in the
original r dimensional space X to a point Φ(x) in a NF -dimensional feature space F, where
Φ(x) = φ1 (x), φ2 (x), · · · , φN (x)
F

. We can think of each function φi (x) as a non-linear

mapping. The central idea of the kernel trick is based on the observation that for many
algorithms, the key quantity is in the form Φ(xi )·Φ(xj ). Hence, instead of explicitly deﬁning
the mapping functions Φ(xi ), we only need to deﬁne the kernel function that assesses the dot
product between Φ(xi ) and Φ(xj ), i.e., k(xi , xj ) = Φ(xi ) · Φ(xj ). Kernel PCA [136, 137]
15

applies the kernel idea to perform PCA in the feature space F. It is important to note
that since the projection is performed in the space NF whose dimension can be signiﬁcantly
larger than m, the number of ”reduced” dimensions can actually exceed m.
Performing PCA in Dot Product Space F
Given data points x1 , x2 , · · · , xn , we assume them to be in the vector space X . Kernel PCA
computes the principal components of the mapped points Φ(x1 ), Φ(x2 ), · · · , Φ(xn ) in space
F. Since F may be inﬁnite dimensional, the PCA problem needs to be transformed into
a problem that can be solved in terms of the kernel k(·, ·). To this end, we consider the
covariance matrix in F, which is:
1
C=
n

n

(Φi − µ) (Φi − µ)
i=1

n
i=1 Φi . We are looking for eigenvector solutions v of

1
where Φi = Φ(xi ) and µ = n

Cv = λv.
Since this can be written as
1
n

n

(Φi − µ) [(Φi − µ) · v] = λv
i=1

the eigenvectors v lie in the span of the Φi − µ ’s, e.g.,
n

v=

αi (Φi − µ) ,
i=1

for some αi . Thus, we can have m equations
(Φi − µ) Cv = λ (Φi − µ) v.
Now, consider the ”kernel matrix” Kij , the matrix of dot product in F : Kij = Φi ·
1
Φj , i, j = 1, · · · , m. Having the ”centering matrix” H = I − n 11 , the above equation

becomes
¯
K c K c α = λK c α
16

¯
where K c = HKH, α ∈ Rm and λ = mλ.
It is also straightforward to show we only need to consider the following equation
¯
K c α = λα
Finally, to use the eigenvectors v to compute principal components in F, we need v to be
1
¯
normalized, that is v · v = 1 = λα · α, so the α must be normalized to have length √ ¯ .
λ

Thus, for a given test point Φ(x), the p-th eigenvector in feature space F is calculated by:
n

n

1
1
p
p
vp , Φ(x) = √
αi (Φ (xi ) · Φ (x)) = √
αi k (xi , x)
¯
¯
λ i=1
λ i=1
A Relationship Between Kernel PCA and Metric MDS
A kernel function is stationary if k xi , xj depends only on the vector xi − xj . A stationary
covariance function is isotropic if k xi , xj depends only on the distance d2 = xi − xj ·
ij
xi − xj , so that we can write k xi , xj

= r dij . Assume that the kernel matrix is

scaled so that r(0) = 1. An example of an isotropic kernel is the RBF kernel k xi , xj =
−θ xi −xj

e

xi −xj
, for some parameter θ > 0.

˜
Φi − Φj . With an
Consider the Euclidean distance in feature space d2 = Φi − Φj
ij
˜
isotropic kernel, this can be re-expressed as d2 = 2 1 − r dij . Thus, the matrix A has
ij

the elements aij = r dij − 1 which can be written as A = K − 11 . It can be easily veriﬁed
that the centering matrix H annihilate 11 , so that HAH = HKH.
We see that the conﬁguration of points derived from performing classical scaling on K
˜
aims to approximat the feature space distance computed as dij =
If non-stationary kernels (for example, k xi , xj

= 1 + xi · xj

2 1 − r dij
m

.

with integer m) are

used,we can again show the kernel MDS procedure operations on the matrix HKH. However,
˜
the distance dij in the feature space is not a function of dij .
For a more detailed analysis, please refer to [165].

17

2.2

Graph-Based Method

Graph-based methods have recently emerged as a powerful tool for analyzing high dimensional data that are sampled from a low dimensional manifold and for revealing highly
nonlinear structures [135]. These methods share the similar idea of using a spare graph (for
simplicity, we assume that the graph is connected) for data representation, in which each
node represents an input pattern and an edge represents a neighborhood relation to discretely
approximate the submanifold sampled by the input patterns. From these graphs one can
then construct matrices whose spectral decompositions reveal the low dimensional structure
of the submanifold. In what follows, we review four widely used graph-based algorithms for
manifold learning: Isomap [152], locally linear embedding [130, 134], Laplacian eigenmaps
[5] and maximum variance unfolding [161, 150]

2.2.1

Isomap

The key idea behind Isomap [152] is to compute the low dimensional representation of a
high dimensional data set that most faithfully preservers the pair-wise distances between
input patterns as measured along the submanifolds from which they were sampled [135].
Isomap is built on the classical MDS but seeks to preserve the intrinsic geometry of the
data which is captured in the geodesic manifold distances between all pairs of data points.
The crux is estimating the geodesic distance between faraway points given only input-space
distance. For neighboring points, input-space distance provides a good approximation to
geodesic distance. For faraway points, geodesic distance can be approximated by adding up
a sequence of ”short hops” between neighboring points. These approximations are computed
eﬃciently by ﬁnding the shortest path in the graph with edges connecting neighboring points.
The algorithm has three steps. The ﬁrst step is to compute the k-nearest neighbors (k is
pre-determined) of each input point and to construct a graph whose vertices represent input
data points and whose edges connect k-nearest neighbors. In the second step, the edges are

18

assigned weights based on the Euclidean distances dij between all vertices along the shortest
paths through the graph. One eﬃcient algorithm to do this is Djikstra’s method [35]. In
the third step, the pair-wise distances dij from Djikstra’s algorithm are fed as the input to
classical MDS as described in Section 2.1.4, yielding low dimensional output yi ∈ Rr for
which yi − yj 2 ≈ dij . The value of r required for a faithful low dimensional representation
can be estimated by the number of signiﬁcant eigenvalues in the Gram matrix constructed
by MDS.
Some theoretical analysis of Isomap can be found in [10, 39]. In [10] the authors proved
that the two distance metrics, one along the submanifold and the other along the path of
the graph, approximate each other arbitrarily close when the density of data points tends
toward inﬁnity. In [39] the authors further studied the conditions for the successful recovery
of co-ordinates.
Several valuable extensions of Isomap include Conformal Isomap(C-Isomap) [33], Landmark Isomap(L-Isomap) [33], and the incremental Isomap [85] [84]. C-Isomap is capable
of learning the structure of certain curved manifolds. This extension comes at the cost of
making a uniform sampling assumption about the data. L-Isomap tries to address the two
computational bottlenecks, the high cost in calculating the shortest-path distance matrix
and the MSD eigen-decomposition, using a relatively small number of landmark points. The
incremental Isoamp does not require all data points be available during training but gives
us a way to update the low dimensional representation of data points gradually as more and
more samples are collected. A similar extension that can handle out-of-sample data points
can be found in [9]

2.2.2

Local Linear Embedding (LLE)

LLE [130, 134] is based on computing the low dimensional representation of a high dimensional data set that most faithfully preserves the local linear structure of nearby input
patterns [135]. The algorithm diﬀers signiﬁcantly from Isomap in that its outputs are derived

19

from the smallest eigenvectors of a sparse matrix, as opposed to the largest eigenvectors of
a (dense) Gram matrix.
The algorithm has three steps. The ﬁrst step, as Isomap, is to compute the k-nearest
neighbors of each high dimensional input point xi . In LLE however, we construct a directed
graph which may or may not be symmetric. In this case, the set of edges E consists of
ordered pairs (i, j) meaning that j is a neighbor of i, and we use Ni = {j|(i, i) ∈ E} to
denote the set of neighbors of i.
The second step of the algorithm assigns weights wij to the edges in this graph by solving
the following least-squares problem:
n

min
wij

s. t.

wij xj 2

xi −
i

j∈Ni

wij = 0, if j ∈ Ni
/
wij = 1, i = 1, · · · , n
i∈Ni

Here, LLE appeals to the intuition that each input pattern and its k-nearest neighbors
can be viewed as samples from a small linear ”patch” on a low dimensional submanifold and
weights wi,j are computed by reconstructing each input point xi from its k-nearest neighbors.
The weights thus constitute a sparse matrix W which encodes local geometric properties of
the data set by specifying the relation of each input pattern xi to its k-nearest neighbors.
Note that the constrained weights that minimize these reconstruction errors obey an important symmetry: for any particular data point they are invariant to rotations, rescalings,
and translations [134]. A consequence of these symmetries is that the reconstruction weights
characterize geometric properties that do not depend on a particular frame of reference.
In the ﬁnal step, LLE derives the output yi ∈ Rr that keeps the relations to its k-nearest
neighbors which is another least-square problem:

20

n

min
yi

wij yi 2

yi −
i

s.t.

(2.15)

j∈Ni

yi = 0
i

1
n

yi yi = I
i

It turns out that the solution to (2.15) can be obtained by computing the bottom
r + 1 eigenvectors of the matrix (I − W ) (I − W ). Let these normalized eigenvectors be
vn , vn−1 , · · · , vn−r , associated with the bottom eigenvalues 0 = λn < λn−1 ≤ · · · ≤ λn−r .
1
We discard vn = ( √n )1 associated with λn = 0, and use the next r eigenvectors to form the

outputs as:
yi = [(vn−1 )i · · · (vn−r )i ] , i = 1, · · · , n
Free Parameters in LLE
In LLE, there are two free parameters: the number of nearest neighbors k and the target
dimensionality r. The authors of [81, 80] suggested a hierarchical method for automatic
selection of an optimal k. For r, [121] suggested to choose r by the number of eigenvalues
comparable in magnitude to the smallest nonzero eigenvalue of the cost matrix (I −W ) (I −
W ). However, the authors of [134] found that in practice it is not reliable and works only for
contrived examples such as data lying on an essentially linear manifold, or data sampled in
an especially uniform way so that the lowest nonzero eigenvalues are equal or nearly equal
due to symmetry. Instead, they found it is more useful to rely on classical methods [118] for
estimating the intrinsic dimensionality r of a data set. A similar idea can also be found in
[127].

21

Extensions
Some useful extensions of LLE include LLE from pair-wise distances [134], convex reconstructions [134], out-of-sample extensions [9], and supervised LLE [32, 127, 128].
The LLE from pair-wise distance solves the problem when the user may not have access to
data in the form of high dimensional vectors xi , but only the measurements of dissimilarity
or distance between data points.
In the convex reconstructions, the author borrowed the idea from the non-negative matrix
factorization [87] and additionally constrained the weights wij to be nonnegative, thus forcing
the reconstruction of each data point to lie within the convex hull of its neighbors. Such a
constraint has both advantages and disadvantages. On one hand, it tends to increase the
robustness of linear ﬁts to outliers. On the other hand, it can degrade the reconstruction
of data points that lie on the boundary of a manifold and outside the convex hull of their
neighbors. For such points negative weights may be helpful.
Supervised LLE [32, 127, 128] was introduced to deal with labeled data points. The
intuition is that in order to obtain disjointed embeddings for individual classes, we would
expect the local neighborhood of any sample xi in class yi ∈ {1, . . . , c} should be composed
of samples belonging to the same class. This can be achieved by artiﬁcially increasing the
pre-calculated distances between samples belonging to diﬀerent classes, but leaving them
unchanged if samples are from the same class, which is:
D = D + α max(D)∆, α ∈ [0, 1]
where ∆ij = 1 if the data point xi and xj are in the same class, otherwise it is 0. When
α = 0, one obtains unsupervised LLE; when α = 1, the result is a fully supervised LLE. In
fact, one can further show that this supervised version of LLE behaves as a nonlinear Fisher
mapping (or linear discriminant analysis, LDA).

22

2.2.3

Laplacian Eigenmaps

Laplacian Eigenmaps [4, 5] is a nonlinear dimensionality reduction method. Similar to
the other graph-based approach, LE constructs a neighborhood graph for given data that
incorporates the pairwise relationship among the data. Using the notion of graph Laplacian,
LE derives a low dimensional representation for a given data set that preserves the local
neighborhood of individual data points. The low dimensional mapping generated by LE can
also be viewed as a discrete approximation to a continuous mapping that naturally arises
from the geometry of the manifold.
Alrgorithm
Given a data set D = {x1 , x2 , · · · , xn }, where xi ∈ Rm , i = 1, · · · n, we ﬁrst compute the
k-nearest neighbors of each data point xi and construct a symmetric undirected graph. In
the graph, each node corresponds to an individual data point. Two nodes are connected
by an edge if one node is a local neighbor of the other node. In the second step, we assign
positive weights Wij to the edges of the constructed graph. According to the LE algorithm,
the weight is computed by a heat kernel that is deﬁned as follows: if nodes i and j are
connected by an edge in the graph, we compute the weight for the edge, denoted by Wij , as
xi − xj 2
Wij = exp −
σ2

,

otherwise, Wij = 0 if node i and j are disconnected. In the last step of the algorithm, we
derive the low dimensional representation of the data points by computing the eigenvalues
and the eigenvectors of the following generalized eigenvector problem:
Lv = λDv
where D is a diagonal matrix with diagonal elements Dii computed as Dii =

j Wij .

Matrix L = D − W is often referred to as the Laplacian matrix, and is a well known
quantity in the Graph theory for characterizing the properties of a weighted graph. Let
23

0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 be the eigenvalues of the graph Laplacian L and v0 , · · · , vn−1 be
the corresponding eigenvectors. We construct the r dimensional embedding of the original
data set by using the ﬁrst r+1 eigenvectors of L, i.e., v1 , . . . , vr . In particular, the embedding
vector of data point xi , denoted by yi , is computed as yi = ([v1 ]i , . . . , [vr ]i ). Note that we
did not use v0 in the construction of the low dimensional embedding since its corresponding
eigenvalue is zero.
Optimal Embeddings
As illustrated in [5], we use the following example to demostrate why the embedding generated by the Laplacian Eigenmap algorithm preserves the local neighborhood structure.
Given a data set and its weighted graph G = (V, E) constructed by the LE algorithm,
we consider the problem of mapping the weighted graph G to a line (i.e., one dimensional
embedding). We denote by y = (y1 , . . . , yn )

the image of this one dimensional mapping.

The objective of this mapping is to ensure that the connected points in the graph G should
stay as close to each other as possible after the mapping. We can formulate this criterion
into the following optimization problem
(yi − yj )2 Wij .

min
y

ij

Speciﬁcally, the objective function with the choice of weights Wij incurs a penalty if neighboring points are mapped far apart.
It turns out that for any y, the objective 1
2

2
ij (yi − yj ) Wij can be further written as

y Ly. The above problem thus can be expressed as:
min

y Ly

s. t.

y Dy = 1

y

y D1 = 0.
Note that we introduce two constraints into the optimization problem. The ﬁrst constraint
y Dy = 1 is introduced to remove an arbitrary scaling factor in the embedding; the second
24

constraint y D1 = 0 is introduced to eliminate the trivial solution that collapses all vertices
of G into a single number. We can also interpret the second constraint as the removal of a
translation invariance in the mapping y.
We can generalize the above discussion to the problem of embedding a graph into rdimensional Euclidean space. Let Y = (y1 · · · yr ) be the resulting mapping, where y(i) , i.e.,
the ith row of matrix Y , is the low dimensional representation of data point xi . Similar to
the above discussion we cast it into the following optimization problem:
y(i) − y(j) 2 wij = tr(Y LY )

min
Y

s. t.

ij

Y DY = I.

The constraint Y DY = I ensures that the dimensionality of the resulting subspace is r.
The solution of the above optimization problem is given by the ﬁrst r+1 smallest eigenvectors
of the generalized eigenvalue problem Ly = λDy.
Laplacian Eigenmap and LLE
Laplacian eiganmap and LLE are similar in that both methods try to preserve the local
neighborhood structure and the low dimensional embeddings are derived by the smallest
eigenvectors of certain matrices. In [5], the authors analyzed the connection between the
two methods. Recall that we use the semi-deﬁnite matrix (I − W ) (I − W ) in LLE and
the Laplacian matrix L in Lalacian eigenmap. Both matrices can be treated as operators
acting in functions deﬁned in the data points. In [5] the authors showed that under certain
conditions, the generalized eigenvector problem of LLE can be expressed as L2 v = λv. Since
L2 and L share the same eigen spectrum, we have that LLE and LE share the same low
dimensional embedding.

25

Extensions
Several extensions of Laplacian eigenmap have recently been developed. In [58], the author
introduced an linear version of Laplacian eigenmap called Locality Preserving Projections
(LPP), for which a linear projection y = a X is preferred in the third step of the Laplacian
eigenmap algorithm, where a is a transformation vector. As a result, the generalized eigenvector problem becomes XLX a = λXDX a. The key advantage of LPP in comparison
to LE is that it is computationally more eﬃcient and can easily handle the out-of-sample
problem.
In [4, 6], the authors extended the idea of Laplacian eigenmap to semi-supervised learning,
to which they refered as manifold regularization. The central idea of manifold regularization
is that classiﬁcation functions are naturally deﬁned only on the submanifold in question
rather than the total ambient space. Using the Laplace-Beltrami operator, one produces a
basis (the Laplacian Eigenmaps) for a Hilbert space of square integrable functions on the
submanifold. To recover such a basis only unlabeled examples are required. Once such a
basis is obtained, training can be performed using the labeled data set.

2.2.4

Maximum Variance Unfolding

Maximum variance unfolding(MVU)[161, 162, 135] is also known as semi-deﬁnite embedding(SDE) [159, 160, 163], which was ﬁrst proposed by Weinberger and Saul in 2004. The
algorithm tries to ”unfold” the manifold by pulling the data points apart as far as possible,
while faithfully preserving the local distances and angles between nearby input data points.
The MVU Algorithm
Similar to the other graph-based approaches, the ﬁrst step of MVU is to compute the knearest neighbors of each input pattern. A neighborhood-indicator matrix η is deﬁned as
ηij = 1 if and only if one input pattern is the k-nearest neighbor of another, or both input
patterns are the k-nearest neighbors of some data point. In other words, we create a fully
26

connected clique of size k + 1 out of every input xi and its k nearest neighbors. In order to
preserve both the distances and angles between the k-nearest neighbors, we introduce the
following objective function to measure the quality of the embedding y = (y1 , . . . , yn )
yi − yj 2 = xi − xj 2 .

(2.16)

Note that although the objective in the above only aims to preserve the distance, it also
preserves the angle between points in the k-nearest neighbors. This is because the graph
is constructed based on the clique of the nearest neighbors. To eliminate a translational
degree of freedom in the low dimensional representation, the outputs are also constrained to
be centered on the origin leading to the following constraint
(2.17)

yi = 0.
i

Finally, since the algorithm attempts to “unfold” the input patterns, it aims to maximize the
variance of the mapping, i.e.,

n
2
i=1 yi 2 . By putting both the objective and the constraints

together, we have the following optimization problem for MVU algorithm:
max
Y

i
n

s. t.

1
yi 2 =
2n

yi − yj 2

(2.18)

ij

yi = 0
i=1

yi − yj 2 = xi − xj 2 , for ηij = 1
The optimization problem listed above is not convex due to the non-convex constraints.
However, it can be converted into a convex optimization when using semi-deﬁnite embedding [156](see Appendix C), i.e.,
max Tr K

(2.19)

K

s. t. K

0,

Kij = 0
ij

Kii + Kjj − 2Kij = xi − xj 2 , for all (i, j) with ηij = 1

27

where, Kij = yi · yj denotes the Gram matrix of the output. After obtaining the Gram
matrix K by solving the above optimization problem, similar to MDS, we derive the low
dimensional embedding by computing the principle eigenvectors of the Gram matrix K.
Extensions
For real-world problems, it may be desirable to relax the constraints in (2.19) such that the
local distances are preserved only to some approximation. One trick often used is to replace
the strict equalities by inequalities [162]. More speciﬁcally, we use Kii + Kjj − 2Kij ≤
xi − xj 2 for all (i, j) with ηij = 1 to replace the corresponding equality constraint in (2.19).
Interestingly, the dual of this relaxed SDP problem is equivalent to the problem studied by
Sun and Xiao in the calculation of the fastest mixing Markov processes on a weighted graph
[150, 167]. Additional relaxations of the SDP problem in (2.19) can be found in [162, 14].
One major concern with the MVU algorithm is that in practice it scales poorly to large
data sets because it requires solving an SDP problem over a matrix of size n × n, where n is
the number of examples. In order to address this problem, [159] suggested to approximate
the Gram matrix K by K = QLQ , where Q is a n × k matrix that transforms Gram
matrix K of size n × n into a landmark matrix L of size k × k. This approximation allows
us to solve an SDP problem of signiﬁcantly smaller size, making the MVU scale well even
to large data sets. The transformation matrix Q can be derived by solving a sparse set of
linear equations as suggested in [159]. A similar approach is proposed in [164] except that
factorized matrix Q is derived by expanding the solution of the original problem in terms of
the bottom eigenvectors of a graph Laplacian.
Connecting to the Other Methods
It is evident that the objective function of MVU is closely connected to PCA in that both
methods are related to the maximization of data variance. MVU is also closely related to
MDS since both methods aim to keep the pair-wise distance between nearby data points.

28

Isomap can be interpreted as an approximation to the MVU problem. Since the Euclidean
distance between any two points on a manifold is always smaller than their geodesic distance,
the Euclidean distance provides an upper bound for the corresponding geodesic distance. In
addition, maximizing the variance is equivalent to maximizing the total pair-wise Euclidean
distance in the embedding space. Therefore, in this sense Isomap attempts to maximize
the variance by directly using the geodesic distances. This interpretation becomes accurate
in the limit with increasing sampling density (n → ∞), if the sub-manifold is isometric
to a convex subset of the Euclidean space. In particular, this condition guarantees the
asymptotic convergence of Isomap algorithm [10, 39]. In this case, the pair-wise geodesic
distances become feasible to the MVU problem, and the solution to MVU approaches its
upper bound obtained by Isomap, thus MVU converges to the same limit as Isomap. If the
above condition is not satisﬁed, then Isomap and MVU could behave quite diﬀerently [160].
More general conditions for the asymptotic convergence of MVU remains an open question.
For the connections between MVU LLE and Laplacian eigenmaps, [167] gives detailed
discussions in which the authors showed the dual problem of MVU is closely connected to
LLE and Laplacian eigenmaps.

2.2.5

From the Kernel Point of View

All of the algorithms presented in this section can be viewed as instances of kernel PCA
[56, 135], with the kernel matrices that are derived from sparse weighted graphs rather than
a pre-deﬁned kernel function. Often, these kernels are described as data-dependent kernels
because they are derived from graphs that encode the neighborhood relations of the input
patterns in the training set.
The foundation of Isomap algorithm is MDS. As pointed out in Section 2.1.5, the Gram
matrix constructed in MDS from these geodesic distances can be viewed as a kernel matrix
so as to Isomap. For ﬁnite data sets however, this matrix in Isomap is not guaranteed to be
positive semi-deﬁnite. It should therefore be projected onto the cone of positive semi-deﬁnite

29

matrices before it is used as a kernel matrix in other settings.
The graph Laplacian in the Laplacian eigenmap arises in the description of diﬀusion on the
graph and can be related to Green’s function and heat kernels [78]. The smallest eigenvectors
of Laplacian L are equal to the top eigenvectors of the pseudo-inverse of the Laplacian, L† ,
which can further be viewed as a (centered) kernel matrix for kernel PCA. Moreover, viewing
†

†

†

†

†

the elements Lij as inner products, the squared distances deﬁned by Lii + Ljj − Lij − Lji are
in fact proportional to the round-trip commute times of the continuous-time Markov chain
with transition rate matrix L. The commute times are nonnegative, symmetric, and satisfy
the triangle inequality thus, Laplacian eigenmaps can be alternately viewed as MDS on the
metric induced by these graph commute times.
The matrix diagonalized by LLE can also be interpreted as an operator on graphs whose
pseudo-inverse corresponds to a kernel matrix. The operator does not generate a simple
diﬀusive process, but in certain cases it acts similarly to the square of the graph Laplacian
[56].
For MVU, if we treat the Gram matrix K in MVU as a kernel matrix, then the whole
procedure of MVU can be thought as applying the kernel PCA on the learned kernel K.
However, the kernel matrix K in MVU is interested in a very diﬀerent aspect which is, in
kernel PCA, the input data are typically mapped from the low dimensional space to a much
higher dimensional feature space, while in MVU it is opposite, the kernel matrix K is used
to map the inputs into a lower dimensional space. Another useful point is that in MVU, the
kernel is learned without supervised information.
Summary
In this section we presented several graph-based algorithms for dimensionality reduction.
Comparing to the linear methods, such as PCA and MDS presented in the previous section,
the graph-based methods belong to the category of non-linear dimensionality reduction.
Furthermore, Isomap can be viewed as a “global” method since it attempts to preserve all

30

geodesic distance; by contrast, LLE, Laplacian Eigenmap and MVU are “local” methods
since only local neighbor structures are preserved through the mapping. Although these
methods start from quite diﬀerent geometrical consideration, they all look similar under the
hood and some strong connections have been found among those methods.
In terms of scalability, LLE and Laplacian eigenmap are eﬃcient and can scale to modestly
large data sets (i.e., n < 10000), provided that one uses special-purpose eigensolvers that
are optimized for sparse matrices. MVU is the most ineﬃcient among all the graph-based
approaches due to the expense of solving semi-deﬁnite programs over n × n matrices.
In summary, research on graph-based methods for dimensionality reduction continues at
a rapid pace. Motivation for ongoing work include the handling of manifolds with more
complex geometries, the need for robustness to noise and outliers, and the scalability to
large data sets.

31

2.3
2.3.1

Matrix Factorization
Low Rank Approximation

The task of dimensionality reduction can also be viewed naturally from the point of matrix
factorization. Consider the data set is organized in the observed matrix X ∈ Rm×n , where
the columns of X are data vectors xi . Our goal is to approximate X by a product of two
matrices U V , where U ∈ Rm×r and V ∈ Rn×r . For each data vector xi , we have xi ∼ U vi ,
where vi is the ith column of matrix V , implying that each data vector is approximated by
a linear combination of column vectors in matrix U and the combination coeﬃcients are
given by matrix V . Geometrically speaking, each data vector xi ∈ Rm is approximated by
its projection in a r-dimensional subspace that is spanned by column vectors of U . For the
convenience of discussion we denote Y = U V

as the approximate matrix. Evidently the

rank of Y is at most r.
The most common, and in many ways the simplest, method to measure how well the model
Y “approximate” the data X, is the sum squared error or Frobenius distance between X
and Y :
X −Y 2 =
F

m

n

(Xij − Yij )2

i=1 j=1

Minimizing the Frobenius distance can also be seen as the maximum likelihood estimation
in the presence of the additive i.i.d. Gaussian noise with a ﬁxed variance in terms of a
probabilistic model. In particular, we assume that the observed matrix X is generated by
the following stochastic process
X = Y + Z,
where Y is a rank r matrix and each element in Z is independent and follows a Gaussian
distribution with zero mean and constant variance σ 2 . Then, the log-likelihood of Y given

32

the observed matrix X is
log Pr(Y |X) = −
= −

mn
ln(2πσ 2 ) −
2

ij

(Xij − Yij )2
2σ 2

1
X − Y F + const
2σ 2

As indicated by the above equation, ﬁnding Y by maximizing the likelihood is equivalent
to minimizing the Frobenius distance between X and Y . It is worth pointing out that
besides Frobenius distance, there are other approaches for measuring the discrepancy between
data matrix X and the underlying model Y , such as matrix Bregman divergence [2]. More
information can be found in [148].
In some applications, besides the rank constraint, additional constraints of factorized matrices are introduced to ensure the appropriateness of the resulting embedding. The most
well known one is non-negative matrix factorization [87], which will be introduced in Section 2.3.3

2.3.2

Singular Value Decomposition

Singular value decomposition (SVD) is one of the most widely used low rank approximation
techniques in dimensionality reduction. PCA, introduced in the Section 2.1, can be viewed
as a special case of SVD.
Let X be an m × n matrix with rank k. We can write X as
X = U ΣV ,
where
D 0
is a matrix of size m × n, where D = diag(σ1 , . . . , σk ) is a diagonal
0 0
matrix whose diagonal entries are the ﬁrst k singular values of X;

• Σ=

• U is an orthogonal matrix of size m × m, and V is an orthogonal matrix of size n × n.
The columns of U are eigenvectors of XX

and are called left singular vectors of X,

the columns of V are eigenvectors of X X and are called right singular vectors of X.
33

Note that since only D (i.e., a matrix of size k × k) in Σ is nonzero, it is suﬃcient to
compute the ﬁrst k column vectors of matrices U and V in order to reconstruct the original
data matrix X, i.e.,
X = Um×k Σk×k Vk×n = Um×k DVk×n
In the context of dimensionality reduction, we often need to solve the problem of approximating a matrix X of rank k with another matrix Y which has a lower rank r(i.e., r < k).
For example, when the approximation is based on the minimization of the Frobenius norm
of the diﬀerence between X and Y under the constraint that the rank of Y is r, the solution
is given by the SVD of X, i.e., Y = Um×r diag(σ1 , . . . , σr )Vn×r .

2.3.3

Non-negative Matrix Factorization

Directly applying SVD will result in factorized matrices that have both positive and negative
elements which often contradict the physical meaning of the result. For example, if the data
matrix X is used to represent the intensities of pixels in a gray-scale image, it is diﬃcult to
interpret the reconstructed matrix Y for a gray-scale image that has negative elements. In
this case it is more appealing to ﬁnd the reduced rank nonnegative factors to approximate
a given nonnegative data matrix. This approach is often referred to as nonnegative matrix
factorization (NMF) [87].
Deﬁnitions
Given a m × n data matrix X with Xij ≥ 0 and a pre-determined positive integer r
min(m, n), NMF ﬁnds two non-negative matrices U ∈ Rm×r and V ∈ Rn×r , such that
X ≈ UV
A common way to ﬁnd U and V is to minimize the Euclidean distance between X and U V :
1
min f (U, V ) =
2
U,V

m

n

Xij − (U V )ij
i=1 j=1

s. t. Uia ≥ 0, Vjb ≥ 0, ∀i, a, b, j.
34

2

=

1
X − UV
2

2
F

(2.20)

Besides the Frobenius norm, [88] provided an information theoretic formulation based
on the Kullback-Leibler divergence of X from U V , leading to the following optimization
problem:
m

n

min f (U, V ) =
U,V

Xij log
i=1 j=1

Xij
(U V )ij

− Xij + (U V )ij

(2.21)

s. t. Uia ≥ 0, Vjb ≥ 0, ∀i, a, b, j.
Other alternative objective functions that measure the divergence between X and its reconstruction U V

can be found in [27, 157, 54, 34]. It is worth pointing out that in [37], the

non-negative constraints of U and V are relaxed by only requiring V to be non-negative
extending the application of NMF methods.
Algorithms
In most cases NMF algorithms can be classiﬁed into three general categories[11]: multiplicative update algorithms, gradient descent algorithms, and alternating least squares algorithms. [26] recently created a library (NMFLAB) of MATLAB routines that include most
of the existing algorithms for NMF algorithms. Below we brieﬂy review algorithms of these
three categories.
Multiplicative update algorithms The Multiplicative update algorithm was ﬁrst presented in [87]. It employs the following equations for updating the solution for U and V
U X
Vbj ← Vbj
U UV

bj

Uia ← Uia

bj

(XV )ia
(U V V )ia

(2.22)

Usually, a very small number such as 10−9 is added to both of the denominators in (2.22) to
avoid division by zero. In [88], Lee and Seung claimed that the above algorithm converges
to a local minimum which was later shown to be incorrect[91, 43, 50, 24]! To understand
why the updating equation in (2.22) may not lead to local optimal, we consider the Karush-

35

Kuhn-Tucker (KKT) conditions for solution U and V :
U ≥ 0
V
− X)V

(U V

≥ 0
≥ 0

U (U V
(U V
U (U V

− X) ≥ 0

− X)V • U = 0
− X) • V

= 0,

where • represents the element-wise product between two matrices. It’s easy to show that if
the optimal point(U ∗ , V ∗ ) does not have any elements equal to 0, then it converges to a local
minimum. However, if certain elements in U ∗ or V ∗ become zero after several iterations,
∂f
∗
for instance if Vij is zero, one has to show ∂V (U ∗ , V ∗ ) ≥ 0 in order to ensure the local

optimality of the solution. It is not evident that the multiplicative update rules in (2.22)
are able to deliver this guarantee. Because of this issue, the local optimality of the solution
found by the multiplicative method is not always guaranteed. In [90], the authors propose
an improved version of the Lee and Seung algorithm that is guaranteed to converge to a
stationary point, although it requires slightly more work per iteration than the original Lee
and Seung algorithm.
Despite the convergency problem, the Lee and Seung multiplicative update algorithm has
become a common approach for solving NMF. Another shortcoming of the Lee and Seung
algorithm is that it is notoriously slow to converge. Several modiﬁcation have been proposed
to address this problem [50, 34, 90].
Gradient descent algorithms The Gradient descent algorithms use the following equations for updating U and V
∂f
∂f
(2.23)
Vbj ← Vbj − V
Uia ← Uia − U
∂Uia
∂Vbj
where U and V are the step sizes. In fact, the multiplicative update algorithms can also be
considered as a gradient based method [24, 88]. The trick to making these algorithms eﬃcient
36

comes with choosing appropriate values for the step sizes U and V . Various algorithms
have been proposed that use diﬀerent strategies for choosing the step size parameters U and
V . For instance, some algorithms initially set these step size values to 1, then multiply them

by one-half at each subsequent iteration [63]. Despite the simplicity, these algorithms suﬀer
from the problem that they do not guarantee the elements of the updated matrices U and
V to be non-negative. A common practice employed by many gradient descent algorithms
is a projection step that projects the obtained solution to the non-negative orthant [140, 63,
24, 116]. Without a careful choice of U and V , it is diﬃcult to analyze the convergence of
the gradient descent methods.
Alternating least squares algorithms As suggested by the name, this method solves
the optimization problem in (2.20) by optimizing variables alternately. Particularly, in each
iteration of the algorithm, with one of the two variables (i.e., U and V ) ﬁxed, we optimize the
other variable by solving a least square problem. Alternating least square algorithms were
ﬁrst used in [114]. This algorithm exploits the fact that while the optimization problem in
(2.20) is not convex in both U and V , it is however convex in either U or V . Thus, given one
matrix ﬁxed, the other matrix can be found eﬃciently by a simple least square computation.
The basic idea of alternating least squares algorithms is summarized as follows.
• Initialize U as a random dense matrix
• for i = 1 to max iter
1. Solve for V

in matrix equation U U V

2. Set all negative elements in V

=U X

to 0

3. Solve for U in matrix equation V V U

=V X

4. Set all negative elements in U to 0
In the above procedure, the nonnegativity is ensured by using the simplest projection
step which sets all negative elements resulting from the least squares computation to 0.

37

This simple projection step often results in a sparse solution for U and V . Moreover, it
allows additional ﬂexibility during the iteration that is not available in other algorithms as
pointed out in [11]. For instance, one drawback of the multiplicative algorithms is that once
an element in U or V becomes 0, it will remain as 0 forever. This locking of 0 elements is
restrictive, meaning that once the algorithm starts heading down toward a ﬁxed point, even if
it is a poor ﬁxed point, it must continue in that vein. The alternating least squares algorithms
are more ﬂexible since it allows the iterative process to escape from a poor path. Another
appealing feature of alternating least squares algorithm is if the algorithm is implemented
appropriately, they can be much faster than the other two types of algorithms. A discussion
of the convergency of the alternating least squares algorithms can be found in [11].

2.3.4

Maximum Margin Matrix Factorization

The idea of Maximum Margin Matrix Factorization (MMMF) [148, 149, 125] arises from the
study of collaborative ﬁltering [17], in which the goal is to complete the missing elements
in the user-item rating matrix. Unlike the matrix factorization methods discussed in the
previous sections where the focus is to approximate an existing data matrix by a matrix
with a lower rank, MMMF searches for the optimal factorized matrices that minimizes the
trace norm of the approximate matrix. More speciﬁcally, given the factorization X = U V ,
MMMF aims to ﬁnd matrices U and V with the least norms. The idea of minimizing the
norms of the factorized matrices arises by viewing matrix factorization as a problem of
feature extraction. Unlike the formulation of matrix factorization presented in the previous
sections that often result in a non-convex optimization problem, MMMF can be cast into a
convex optimization, making it computationally more desirable.
Collaborative Filtering and Matrix Factorization
The objective of collaborative ﬁltering is to predict the utility of items for a given user based
on a database of user votes from a sample or population of other users [17]. Collaborative

38

ﬁltering can be formalized as a matrix completion problem: Let Y be the user-item rating
matrix, with each row corresponding to an user, each column corresponding to an item, and
every element Yi,j representing the rating of user i for item j. Since each user only provides
ratings for a subset of items, or in other words, only a part of the matrix is observed, the
objective of predicting users’ ratings for the remaining items can be naturally viewed as the
completion of the matrix Y .
It is straightforward to extend the idea of low-rank matrix factorization for collaborative
prediction [62, 100]. Given a partially observed matrix Y , we aim to ﬁnd the matrices (U, V )
that minimizes the discrepancy between the observed entries YS and the corresponding
entries in X = U V . With the estimated U and V , we can reconstruct the entire matrix Y ,
leading to the prediction of missing ratings in Y .
Matrix Factorization and Linear Prediction
In order to motivate the formulation for maximum margin matrix factorization, below we
try to link the problem of matrix factorization and linear prediction. Suppose one of the
factor matrices, say U , is ﬁxed and only the other factor matrix V

needs to be learned.

Then, ﬁtting each column of the target matrix Y by U V , is a separate linear prediction
problem; e.g., each row of U is a ”feature vector”; and each column of V

is a linear predictor

that predicts the entries in the corresponding column of Y based on the ”features” in U .
According to the maximum margin principle [18], the optimal V is found by minimizing the
reconstruction error as well as the square of Frobenius norm of U . It is the similarity between
matrix factorization and linear prediction that leads to the maximum margin principle for
matrix factorization.
In collaborative prediction, both U and V are unknown and need to be estimated. This
can be thought of learning both the feature vectors (i.e., U ) and the linear predictor (i.e.,
V ) for Y simultaneously. Although a natural approach is to solve this problem by some
alternating method, below we will review the maximum margin matrix factorization method

39

that allows us to determine both U and V at the same time.
Maximum-margin Low Trace Norm Matrix Factorization
The trace norm X tr is deﬁned as the sum of the singular values of X [148] (See Appendix D
for details), which can also be written as:
X tr

=

min
X=U V

U F V F

=

min
X=U V

1
2

U 2 + V 2
F
F

It is important to note that the trace norm X tr is a convex function of X. This fact is the
key for casting the maximum margin matrix factorization problem into a convex optimization
problem.
To simplify presentation, we focus on the problem with binary labels, i.e., every observed
elements in Y is either +1 or −1. In the hard-margin matrix factorization framework, we
seek the completed matrix X that (1) provides a good approximation for all the observed
elements in Y , and (2) has the least trace norm. We can cast the above requirements into
the following optimization problem:
min
X

(2.24)

X tr

s. t. Yia Xia ≤ 1, for all ia ∈ S
where S includes the indices of all the observed elements in Y . Similarly, the soft-margin
matrix factorization, where we minimize a trade-oﬀ between the trace norm and its hinge-loss
relative to YS , is expressed as follows:
min
X

X tr + c

max(0, 1 − Yia Xia )

(2.25)

ia∈S

It is worth emphasizing that there is an inverse dependence between the norm and the
margin: ﬁxing the margin and minimizing the trace norm is equivalent to ﬁxing the trace
norm and maximizing the margin.
The geometric interpretation of the maximum matrix factorization can be seen clearly
from constraining all rows of U and V to have small L2 norm. More clearly, instead of
40

constraining the norm of rows in U and V on average, we replace the trace norm with
X max =

min

max Ui 2
i

X=U V

max Va 2
a

where Ui and Va are row vectors of U and V . Then the hard-margin low-max-norm prediction
corresponds to mapping the users and items to points and hyper planes in a high-dimensional
unit sphere such that each users hyperplane separates his positive and negative items with
a large-margin (the margin being the inverse of the max norm).
The soft-margin matrix factorization (2.25) can be further formulated as a semi-deﬁnite
optimization problem [148, 156] as:
min
s. t.

1
(trA + trB) + c
2
A
X

X
B

ξia

(2.26)

ia∈S

≥ 0,

Yia Xia ≥ 1 − ξia ,

ξia ≥ 0,

∀ia ∈ S

By introducing a dual variable Qia for each constraint on Xia , the dual of (2.26)[148]is
given as:
max

Qia

(2.27)

ia∈S

s. t.

I
(−Q ⊗ Y )

(−Q ⊗ Y )
I

≥ 0,

0 ≤ Qia ≤ c

where Q ⊗ Y denotes the sparse matrix with (Q ⊗ Y )ia = Qia Yia for ia ∈ S and zeros
elsewhere. Since the prime problem is strictly feasible, the duality gap between the prime
problem and the dual problem is zero. Hence, both the prime problem and the dual formulation give the same solution. Note that unlike typical prime and dual problems, in SDP,
recovering the optimal solution to the prime problem directly from a dual optimal solution
is in fact non-trivial. However, at least for the hard-margin problem this is possible to derive
the solution for the prime problem from the solution to the dual problem by exploring the
complementary slackness conditions[148].

41

2.4

Supervised and Semi-Supervised Dimensionality
Reduction

All the dimensionality reduction methods discussed in the previous sections are unsupervised,
i.e., no supervision information is provided other than the input patterns of data points. The
methods reviewed in this section belong to the category of supervised and semi-supervised
dimensionality reduction. These algorithms are inspired by the fact that in many cases
additional information, such as class labels and pairwise constraints, is provided besides
the input patterns of data. The objective of supervised and semi-supervised dimensionality
reduction is to exploit the supervisory information in order to either guide or regulate the
search of the space with reduced dimensions. In this section, we will review several the most
representative methods to illustrate the general idea of the supervised and semi-supervised
dimensionality reduction. It is worth pointing out that the methods reviewed in this section
have a very close relationship to the methods developed in this thesis, and therefore they serve
as a good starting point for the remaining chapters. Finally, it is also worth emphasizing, the
supervised and semi-supervised dimensionality reduction is closely related to distance metric
learning and kernel learning. We refer the audience to [169, 177] for more information.

2.4.1

Linear Discriminate Analysis

As discussed in Section 2.1.1, although PCA ﬁnds components that are representative for
a given dataset, these components however may not be useful for distinguishing data from
diﬀerent classes. Unlike PCA, Linear Discriminate Analysis (LDA) [44] aims to identify the
low dimensional space that has the maximum class discrimination. It achieves this goal by
projecting data onto a low dimensional space in which the separation of data points from different classes is maximized while the dispersion of data from the same class is simultaneously
minimized.

42

Assume the data set D is grouped as:
D = {X1 , X2 , . . . , Xk }
where k is the number of classes and Xi ∈ Rm×ni is the data matrix of the ith class. ni is
k
i=1 ni = n.

the number of data points of the ith class and

In LDA three matrices are deﬁned as follows which are: within-class, between-class, and
total scatter matrices:
k

1
Sw =
n

(x − ci )(x − ci )
i=1 x∈Xi
k

1
Sb =
n
St =

(2.28)

ni (ci − c)(ci − c)

(2.29)

(xi − c)(xi − c)

(2.30)

i=1
n

1
n

i=1

where ci is the sample mean of the ith class and c is the mean of the whole data set. It can
be shown that the three matrices satisfy the following relationship:
St = Sw + Sb
It is evident that the trace of Sw , denoted by tr(Sw ), measures the overall distance between
each data point and its class center (mean). Similarly, the trace of Sb , denoted by tr(Sb ),
measures the separation between each class center and the center of all data points. In
the low dimensional space where each data point is transformed by a projection matrix
G ∈ Rm×(k−1) , the three scatter matrices become:
L
L
L
Sw = G Sw G, Sb = G Sb G, St = G St G

In order to ensure that in the space of reduced dimensions (a) all the classes are well separated
and (b) the data points in the same class are close to each other, LDA cast the dimensionality
reduction problem into the following optimization problem:
max

tr G Sb G

G

tr G Sw G

s.t.

G G=I
43

(2.31)

The solution of LDA comes from the result of the generalized eigenvalue problem:
−1
Sw Sb vi = λi vi

(2.32)

and the eigenvectors corresponding to the k − 1 largest eigenvalues form the columns of the
projection matrix G.
Extensions of LDA One well know problem in LDA is the so called singularity or undersampled problem. It arises when the number of data points is smaller than the dimensionality, leading to singular scatter matrices singular and consequentially a ill deﬁned eigenvalue
problem in (2.32).
Many attempts have been made to deal with the singularity problem. One category of approaches is to reduce the dimensionality before LDA is applied. PCA+LDA [3, 173, 101] is a
representative approach in this category. It ﬁrst applies PCA to the given data set to identify
a space of p dimension in which all the scatter matrices are nonsingular. It then applied the
standard LDA to data patterns in the reduced space. The parameter p is usually estimated
by cross validation. Another popular method to deal with the singularity of scatter matrices
is regularization, which is known as RLDA [55]. In RLDA, a small positive value is added to
the diagonal elements of the scatter matrices, which makes them to be positive deﬁnite thus
nonsingular [49]. Other methods, such as null space LDA (NLDA) [23], orthogonal LDA
(OLDA) [171], and uncorrelated LDA (ULDA) [171], have also been proposed and applied
successfully to resolve the singularity issue in diﬀerent domains. An uniﬁed framework is
presented in [69] that uniﬁes these variants of LDA methods.

2.4.2

Semi-Supervised LLE and ISOMAP

Semi-supervised LLE [170] and ISOMAP [170] can be viewed as a natural extension of their
unsupervised versions which were reviewed in the Section 2.2. In semi-supervised settings,
the prior information of the exact mapping of certain data points is known and is used to
guide the procedure of dimensionality reduction. Without loss of generality, we assume that
44

for nl data points, we know their low dimensional representation. By rearranging the data
matrix D, we can partition D into two parts as D = {Xl , Xu }, where Xl represents the
data points for which we know their low dimensional representation and Xu are other data
points. Accordingly, we can partition the low dimensional representation Y into two parts
as Y = {Yl , Yu }, where Yl is given and Yu needs to be computed.
Semi-Supervised LLE From (2.15), the objective of LLE is to minimize
j∈Ni wij yi

2 which is equivalent to minimize tr(Y M Y

n
i yi −

), where M = (I − W ) (I − W ).

The M matrix can also be partitioned in a similar way as:
M =

Mll Mlu
Mlu Muu

Replacing Y and M with a partitioned version, the objective function in LLE can be
rewritten as :
min tr

Yl Y u

Yu

Mll Mlu
Mlu Muu

Yl
Yu

(2.33)

which is equivalent to
min tr Yu Muu Yu + 2Yl Mlu Yu
Yu

(2.34)

The solution to the above objective function can be computed by solving a linear system
of equations as:
Muu Yu

= Mlu Yl

(2.35)

Semi-supervised ISOMAP For the semi-supervise version of ISOMAP [170], we ﬁrst
restate the ISOMAP problem as:
max tr Y AY
Y

s.t.

YY

=I

where A is the matrix as:
1
1
1
A = − (I − 11 ) ∆(I − 11 )
2
n
n
45

(2.36)

in which, ∆ is the matrix of squared geodesic distances.
Let A = QΛQ

is the eigen decomposition of A. Λ = diag(λ1 , λ2 , . . . , λn ) is a diagonal

matrix where λi is the eigenvalue with λ1 ≥ λ2 ≥ . . . ≥ λn . Matrix Q consists of Q =
[q1 , . . . , qn ] and qi is the eigenvector corresponding to λi . Deﬁne matrix M as:
r

M = λ1 I − A −

(λ1 − λi )qi qi − λ1 11 /n

(2.37)

i=2

It can be shown that matrix M has r + 1 zero eigenvalues and the null space is given by
span([q1 , q2 , . . . , qr , 1]). Therefore, (2.36) can be rewritten as:
min tr Y M Y

(2.38)

Y

n

s. t. Y Y

= I,

yi = 0
i=0

Using the same idea of partitioning as semi-supervised LLE, we can incorporate the supervised information into the optimization problem in (2.38). As a result, this yields the
same format as (2.33), except the matrix M is replaced by (2.37).

2.5

Functional Data Dimensionality Reduction

The functional data dimensionality reduction [74] is inspired by the Functional Data Analysis
(FDA) [123]. The FDA approach is based on the assumption that multivariate data are
discrete samples of some underlying functions and further analysis is carried out with the
functional representations instead of the original data. The advantage is that, provided that
the data is smooth enough, the functional representation will be of smaller dimension. One
typical approach to obtain the functional representation is to construct a ﬁnite dimensional
function space and then represent the data as the linear combination of a set of basis functions
which span that space. The dimension of the functional representation of the data thus
becomes the dimension of the combination weighting vectors which could be lower than
the original data dimension if only a limited number of basis functions are used. However,
46

the choice of basis is not trivial because in many real-world applications the basis has a
severe impact on the overall performance, therefore it is often desirable to construct problem
speciﬁc basis functions to get good representation with a small number of basis functions
while maintaining most of the original information. The shortcoming of the functional data
dimensionality reduction is that it is more restricted because one has to assume that the
multidimensional data can be represented in functional form.
The central idea of functional data analysis[123] is to treat high dimensional data as
continuous functions. In general, the functions lie in a space with inﬁnite dimensions and
in order to make the computation tractable, the functions are usually approximated with
ﬁnite dimensional representations. More speciﬁcally, it assumes that for each datum xi =
[x1 , x2 , . . . , xm ] , there exists a function fi (z) belonging to certain function space L such
i i
i
that xh = fi (zh )+ h for all h = 1, . . . , m, where h is the noise and the argument zh where the
i
i
i
observation are made are problem speciﬁc. The idea is to estimate the underlying function
fi based on the available data and conduct further analysis with the function representation
instead of the original discrete data.
In general, the function space where fi is deﬁned is of inﬁnite dimensions making it very
diﬃcult to work with the function fi directly. In practice, it is very often to approximate fi
with a ﬁnite dimensional representation wi which is deﬁned in some r dimensional subspace
A ⊂ L. Since the goal of this whole procedure is to reduce the dimensionality, a natural
requirement is that the dimension of the subspace r is less than the original data dimension
which is m.
The subspace is deﬁned by a set of basis functions ψj (z), j = 1, . . . , r which span
A.

Given the basis, any function fi in A can be represented by the weight vector

1 2
r
wi = [wi , wi , . . . , wi ] :

fi (z) = wi ψ(z),

(2.39)

where ψ(z) = [ψ1 (z), ψ2 (z), . . . , ψr (z)] . One typical approach to obtain the weights is to

47

minimize the square ﬁtting error:
m

min
wi

wi ψ(zl ) − xl
i

2

(2.40)

.

l=1

In order to obtain more stable models, regularization is often applied to (2.40), for example,
a penalty term λ wi 2 can be added to (2.40) where λ is introduced to balance between the
regularization and the regression error.
In general, orthogonal basis, such as Fourier basis or wavelets are often used, however,
there are no free parameters and thus the basis cannot be optimized for a speciﬁc task.
It is sometimes appealing to use non-orthogonal functions as basis so that a small number
of weights are needed for representing the data. For instance, we can use Gaussian basis
functions:
− z − rj 2

ψj (z) = exp

2
σj

, j = 1, . . . , m

(2.41)

where rj is the location and σj is the width of the function. Clearly, these functions are
not orthonormal. The basis is diﬀerentiable respect to the parameters rj and σj , thus it can
be optimized for an accurate ﬁtting using standard gradient-based optimization algorithms.
For example, if we choose the Gaussian basis functions and minimize the square ﬁtting error
of all the functions from i = 1, . . . , n as:
n

min

(Ψwi − xi ) (Ψwi − xi )

(2.42)

i=1
n

wi Ψ Ψwi − 2xi Ψwi + xi xi ,

=
i=1

where the columns of Ψ are Ψj = [ψj (z1 ), ψj (z2 ), . . . , ψj (zm )]

and it’s derivative with

respect to rj and σj are:
(r)

Ψj

(σ)

Ψj

=

=

z1 − rj
2
σj

ψj (z1 ), . . . ,

(z1 − rj )2
3
σj

zm − rj

ψj (z1 ), . . . ,

48

2
σj

ψj (zm )

(zm − rj )2
3
σj

ψj (zm )

(2.43)

(2.44)

With this notation, it is easy to derive:
∂ xi Ψwi
∂rj
∂ wi Ψ Ψwi
∂rj

(r)

j

= xi Ψj wi

(2.45)

(r)

j

= 2wi Ψ Ψj wi

(2.46)

and ﬁnally we can get:
∂
∂rj

n

=

(r)

j

(2.47)

(σ)

j

(2.48)

(Ψwi − xi ) Ψj wi
i=1

Similarly, we can derive:
∂
∂σj

n

=

(Ψwi − xi ) Ψj wi
i=1

After the gradient is computed, we can obtain the optimal ri and σj using standard
unconstrained non-linear optimization.

49

2.6

Summary

Dimensionality reduction is a board area and our review is far from complete. Here, we
only focus on the methods of dimensionality reduction that are most widely used. There are
many other well known algorithms for dimensionality reduction that are not included in the
disscussion, such as Projection Pursuit [73], Principal Curve [57], Topologically Continuous
Surfaces [21], Neural Network [68], and Independent Component Analysis [65]. For reviews
of these methods please refer to [45] [21].
Recent studies have revealed interesting connection between dimensionality reduction and
data clustering. For example, studies have shown that PCA is closely related to K-means[34],
and non-negative matrix factorization is closely related to spectral clustering [35]. Also, studies [4] have shown that Laplacian eigenmaps [4] and spectral clustering [142], although not
mathematically equivalent, share very similar ideas and only diﬀer in the last one step: the
last step of clustering thresholds the elements yi in order to to divide the data points into
two disjointed sets, while in dimensionality reduction the obtained matrix Y becomes the
reduced representation of the original data patterns. In [36] the authors further formulate the problem of multi-way spectral clustering into the dimensionality reduction problem
of ﬁnding the best one dimensional embedding, thus bridging gap between clustering and
dimensionality reduction.
Although many methods have been proposed and successfully applied to real-world problems in reducing dimensionality, several open problems need further study. For example,
most non-linear methods are not applicable for very large scale data sets (n > 100, 000).
Another issue is that the ”target” dimension r often needs to be predetermined. However,
determining the right target dimension is, sometimes, an even more diﬃcult problem than
dimensionality reduction itself.

50

CHAPTER 3
Semi-Supervised Learning by Mixed
Label Propagation
In this chapter, we aim to explore the dimensionality reduction method for semi-supervised
classiﬁcation via the idea of mixed label propagation. In particular, our work is inspired
by [36] where a clustering problem is approximated by a low dimension embedding problem.
We extend this idea by approximating a data classiﬁcation problem into a dimensionality
reduction problem: we attempt to ﬁnd the best one dimensional embedding of the data in
which data points in diﬀerent classes can be well separated, and the class labels obtained
by simply thresholding the one dimensional representation. We furthermore extend LDA, a
supervised dimensionality reduction method, to the semi-supervised learning setting. In this
chapter, we ﬁrst brieﬂy review the existing work on graph-based semi-supervised learning,
which is closely related to our work. We then present the proposed framework of mixed label
propagation that essentially extend dimensionality reduction methods to semi-supervised
classiﬁcation.

3.1

Graph-Based Semi-Supervised Learning

Recent studies have shown a promising performance of graph-based approaches for semisupervised learning [1, 7, 25, 59, 70, 174, 175, 176]. The key idea behind most graphbased approaches is to explore the pair-wise similarity between examples in determining
the class labels for unlabeled examples. In particular, the class assignments of unlabeled
examples need to be consistent with both the example similarity and the class labels of
51

training examples. Graph-based approaches can often be interpreted as propagating the
label information of the training examples to the unlabeled examples through the pair-wise
similarity between examples. This process is sometimes referred to as label propagation [175].
One key component to most graph-based approaches is how to measure the inconsistency
between the class assignments and the example similarity. For instance, in the harmonic
function approach the inconsistency between the class assignment y = (y1 , y2 , . . . , yn ) and
similarity Sij ≥ 0 is measured by the energy function:
n

E(S, y) =

Si,j yi − yj

2

= y Ly

(3.1)

i,j=1

where L is the Laplacian matrix and is deﬁned as L = D − S.

Here, D =

diag(D1 , D2 , . . . , Dn ) is a diagonal matrix with its diagonal elements deﬁned as Di =
n
j=1 Si,j . Note that (3.1) is almost the same as the objective function of Laplacian Eigen-

map in Section 2.2.3. The only diﬀerence is, here the variable y is class labels, while in 2.2.3
it is the low dimensional representation of data.
ˆ
Given the class labels yl = (ˆ1 , y2 , . . . , ynl ) for the ﬁrst nl examples, the optimal class
y ˆ
ˆ
assignment y is found by minimizing the above energy function, i.e.,
min E(S, y)
y

(3.2)

ˆ
s. t. yl = yl
where yl stands for the ﬁrst nl elements of y. The optimal class labels assigned to the
unlabeled examples, denoted by yu , are computed as:
ˆ
yu = −[Lu,u ]−1 Lu,l yl

(3.3)

where the super indices u and l stand for the parts of the Lapacian matrix that are related
to the labeled and the unlabeled examples, respectively. Similar strategy is also employed
for semi-supervised dimensionality reduction reviewed in section 2.4.2
It is important to note that the pair-wise similarity Sij in the above energy function must
be non-negative. This is because E(S, y) could become negatively unbounded when certain
52

pair-wise similarity Si,j is negative, which implies the optimal solution to (3.3) does not
exist.

3.2

Mixed Label Propagation

Despite the success, most graph-based approaches are limited to exploring positive similarity,
which can be interpreted as the conﬁdence of assigning two diﬀerent examples to the same
class. In many cases we may run into dissimilarity or negative similarity that expresses the
conﬁdence of assigning two examples to diﬀerent classes. For instance, if we measure the
similarity between two examples by their correlation coeﬃcient, we could have both positive
and negative similarity. One application of negative similarity is collaborative ﬁltering [17],
in which a negative similarity between two users indicates that the two users share diﬀerent
interests and therefore tend to give opposite ratings for the same items. Another application
of negative similarity is semi-supervised data clustering, in which one has side information
of must-link pairs and must-not-link pairs. One way to explore the side information is to
associate every must-link pair with a positive similarity, and every must-not-link pair with
a negative similarity [82].
It is important to note that most existing graph-based approaches are unapplicable to
negative similarity because the dissimilar relations are non-transitive and therefore cannot
be propagated directly. This can also be understood from the viewpoint of optimization.
The energy function, i.e., the objective function employed by most graph-based approaches
to measure the inconsistency between the class assignments and the example similarity, could
be negatively unbounded when similarity is negative, thus no optimal solution can be found
to minimize the objective function. To address this problem we propose a new framework of
label propagation for semi-supervised learning, termed as mixed label propagation, which
can eﬀectively explore both negative and positive similarity simultaneously. Mixed label
propagation measures two quantities: the inconsistency between the class assignments and

53

the positive similarity, and the consistency between the class assignments and the negative
similarity. The optimal class assignments are found by minimizing the ratio between the
inconsistency and the consistency.

3.2.1

The Framework of Mixed Label Propagation

To incorporate negative similarity into the framework of label propagation, we consider
constructing two energy functions: the energy function E+ that measures the inconsistency
between the class assignments and the positive similarity, and the energy function E− that
measures the consistency between the class assignments and the negative similarity. In
order to minimize the inconsistency E+ and maximize the consistency E− simultaneously,
we follow the idea of Linear Discriminative Analysis (LDA) [44] by minimizing the ratio
between E+ and E− . More speciﬁcally, given the pair-wise similarity S, we construct the
positive similarity matrix S+ and the negative similarity matrix S− as follows:
[S+ ]i,j =

Si,j Si,j > 0
0
otherwise

[S− ]i,j =

|Si,j | Si,j < 0
0
otherwise

Evidently, we have S = S+ − S− . We then construct two energy functions E+ (S+ , y)
and E− (S− , y) based on the two similarity matrices S+ and S− using Eqn. (3.1). Finally,
ˆ
given the class labels yl ∈ {−1, +1}nl for the ﬁrst nl training examples, the optimal class
assignment y is determined by minimizing the ratio between E+ and E− , i.e.,
min

y∈Rn

s. t.

y L+ y
E+ (S+ , y)
=
E− (S− , y)
y L− y
yi = yi , i = 1, 2, . . . , nl
ˆ

(3.4)

where L+ and L− are graph Laplacians for similarity matrices S+ and S− , respectively. Note
that without the linear constraints yi = yi , i = 1, 2, . . . , n, the above optimization problem
ˆ
is identical to the optimization problem in LDA [44]. Hence, the optimal solution y to (3.4)
†

is the minimum eigenvector of matrix L− L+ where † stands for the pseudo inverse. The
54

challenges of solving the optimization problem in (3.4) arises from the linear constraints. In
the next subsection we present an eﬃcient algorithm to solve this optimization problem.

3.2.2

An Eﬃcient Algorithm for Mixed Label Propagation

For the convenience of presentation we rewrite class assignments y as y = (yl , yu ), where yl
represents the class labels of the ﬁrst nl examples and yu represents the labels for the next
ˆ
nu = n − nl examples. According to the constraints in (3.4), we have yl = yl .
To solve the problem in (3.4), we ﬁrst follow the idea of LDA by converting the problem
of optimizing a ratio into a constrained optimization problem, i.e.,
min

y L+ y

s. t.

y L− y ≥ 1,

β∈R,yu ∈Rnu

(3.5)
β≥0

where
l,l

l,u

ˆ
y L+ y = (β yl , yu )

L+ L +
u,l
u,u
L+ L +

ˆ
β yl
yu

l,l

l,u

ˆ
y L− y = (β yl , yu )

L− L−
u,l
u,u
L− L −

ˆ
β yl
yu

ˆ
Note that in (3.5) we introduce a scaling factor β for yl . This is because we introduce the
ˆ
ˆ
constraint y L− y ≥ 1, and therefore have to convert yl = yl to yl ∝ yl . β is introduced to
ˆ
account for the scaling factor between yl and yl .
We take the alternating optimization strategy to solve (3.5). More speciﬁcally, we ﬁrst
optimize yu by ﬁxing β and then optimize β by ﬁxing yu . However, the problem in (3.5)
is a non-convex programming problem for both β and yu because of the non-convex constraint y L− y ≥ 1. To resolve this problem we resort to the following theorem of the
alternative [15]:
Theorem 3.1. The implication
x F1 x + 2g1 x + h1 ≤ 0 =⇒ x F2 x + 2g2 x + h2 ≤ 0,
55

where Fi is symmetric n × n matrix, holds if and only if there exists λ ≥ 0 such that
F2 g2
g2 h2
Optimize y with ﬁxed β

F1 g1
g1 h1

λ

Using the above theorem, to compute the optimal yu with

ﬁxed β, we turn the problem in (3.5) into its dual form, i.e.,

max λa22 − β 2 a21 a−1 a12
11

(3.6)

λ

s. t. λ ≥ 0
u,u

u,u

u,l

u,l

a11 = L+ − λL− , a12 = L+ − λL−
ˆ
a21 = yl

l,u

l,u

L+ − λL−

ˆ
yl
l,l

ˆ
ˆ
, a22 = 1 − β 2 yl L− yl

Given the solution for λ, we can compute the solution for yu using the Karush-Kuhn-Tucker
(KKT) conditions [15], i.e.,
u,u

u,u −1

yu = −β L+ − λL−

u,l

u,l

L+ − λL−

yl

(3.7)

It is interesting to note that the above solution for yu is equivalent to the solution by the
harmonic function (in Eqn. (3.3)) if we use L = L+ − λL− as the graph Laplacian matrix.
Thus, the parameter λ weighs the importance between the two energy functions E+ and E− .
The advantage of the proposed approach is that it automatically determines λ by making
the optimal tradeoﬀ between the inconsistency measure E+ and the consistency measure
E− . This is particularly important for semi-supervised learning when the number of labeled
examples is limited and is insuﬃcient to determine λ by cross validation. We will also show
in our empirical study that the value of λ varies signiﬁcantly from one case to another, and
therefore it is suboptimal to replace λ with a ﬁxed constant. The problem in (3.6) can be
further turned into a Semi-Deﬁnite Programming (SDP) [156](See Appendix C) problem as

56

follows:
max γ

(3.8)

λ,γ

a11
a12
a21 (λa22 − γ) /β 2

s. t.

u,u

0, λ ≥ 0

u,u

u,l

u,l

a11 = L+ − λL− , a12 = L+ − λL−
ˆ
a21 = yl

l,u

l,u

L+ −λL−

ˆ
yl
l,l

ˆ
ˆ
, a22 = 1 − β 2 yl L− yl

In (3.8) we introduce the slack variable γ ≤ λa22 −β 2 a21 a−1 a12 , which can be further turned
11
into a Linear Matrix Inequality (LMI)
a11
a12
a21 (λa22 − γ) /β 2

0

using the Schur complement [15](see Appendix B). Since the problem in (3.8) belongs to
the family of semi-deﬁnitive programming, it can be solved eﬀectively using the standard
packages such as SeDuMi1.
Optimize β with ﬁxed y Similarly, using the theorem 1 we have the following optimization problem for ﬁnding optimal β with ﬁxed yu
(3.9)

max γ
λ,γ

s. t. λ ≥ 0,

b11 b12
b21 b22

0

l,l
ˆ l,l ˆ
ˆ
y
b11 = yl L+ yl − λˆ l L− yl
l,u
ˆ l,u
b12 = b21 = yl L+ yu − λˆ l L− yu
y
u,u

b22 = −λ yu L− yu − 1 − γ
The corresponding solution for β is
l,u
ˆ l,u
y
yl L+ yu − λˆ l L− yu
β = −
l,l
ˆ l,l ˆ
ˆ
yl L+ yl − λˆ l L− yl
y
1

http://sedumi.mcmaster.ca/

57

(3.10)

In summary, we start with a large value for β, then ﬁnd the optimal yu by solving the
problem in (3.8) with a ﬁxed β, and ﬁnd the optimal β by solving the problem in (3.9) with
a ﬁxed yu . We alternate these two steps iteratively until the solution converges to the local
maximum.

3.2.3

Application to Collaborative Filtering

In order to verify the eﬃcacy of the proposed algorithms for semi-supervised learning, we
apply it to collaborative ﬁltering. We emphasize that we choose collaborative ﬁltering for
the test bed because we are able to compute both similarity and dissimilarity in the task
of collaborative ﬁltering, a key requirement for the proposed method. As discussed in Section 2.3.4, the goal of collaborative ﬁltering is to predict the utility of items to a user based on
ratings by other users [126]. This method diﬀers from the traditional feature based approach
where predictions are made based on the features of data. For example, in a collaborative
ﬁltering movie recommendation system2, features of the movies (e.g., genre, year, actors,
external reviews) and the users (e.g., age, gender, explicitly speciﬁed preferences) are not
used by the system. Inputs to the system are user ratings on movies they have already
seen. When making predictions of user ratings on movies they have not yet seen, the system
assume that users ”collaborate” by sharing their ratings, e.g., to predict the ratings of a
movie by user u, most collaborative ﬁltering algorithms ﬁrst identify a subset of users who
share similar interests to u, and then combine the ratings of these similar users as the ratings by u. The most well known algorithms for collaborative ﬁltering include the Pearson
correlation coeﬃcient [126], personality diagnosis [117], matrix factorization [147], graphical
models [17, 61], and ordinal regression [25].
In order to apply the proposed approach to collaborative ﬁltering, the key is to estimate the
matrix S for user similarity. To this end, we employ the Pearson correlation coeﬃcient [126]
to estimate user similarity. It measures the linear correlation between the ratings of two
2

http://movielens.org/

58

MLP
LP
Pearson
MMMF

MLP
LP
Pearson
MMMF

Given 10 Rated Items
Kc = 1
Kc = 5
0.8318 ± 2.7E-04 0.7368 ± 1.2E-04
0.8184 ± 8.6E-04 0.7316 ± 1.3E-04
0.7766 ± 6.8E-04 0.6749 ± 4.0E-04
0.7827 ± 9.2E-05 0.6974 ± 3.6E-05
Given 15 Rated Items
Kc = 1
Kc = 5
0.8599 ± 3.4E-04 0.7704 ± 1.2E-04
0.8526 ± 5.3E-04 0.7689 ± 1.4E-04
0.8170 ± 1.0E-03 0.7222 ± 3.8E-04
0.8189 ± 2.0E-04 0.7221 ± 8.4E-05

Table 3.1. Average precision for the Mixed Label Propagation (MLP), Label Propagation
(LP), Pearson Correlation Coeﬃcient (Pearson) and Maximum-Margin Matrix Factorization(MMMF) using 10 training users.

users. More speciﬁcally, given two users ui and uj , let Oi,j denote the set of items that are
rated by both users. The similarity between ui and uj is measured as:
Si,j =

|Oi,j |
¯
¯
k=1 (ri (k) − ri )(rj (k) − rj )
|Oi,j |
|Oi,j |
¯ 2
¯ 2
k=1 (ri (k) − ri )
k=1 (rj (k) − rj )

where ri (k) stands for the rating of the kth item by user ui , and ri is the average rating of
¯
user ui . Since the Pearson correlation coeﬃcient can be both negative and positive, we can
apply the proposed method to collaborative ﬁltering.

3.3

Experiments

We evaluated the eﬀectiveness of the proposed mixed label propagation approach by the task
of collaborative ﬁltering and used a subset of the MovieLens database (http://movielens.
umn.edu/login) as the test bed. In particular, we selected the 100 most popular movies and
randomly selected 200 users who had provided at least 25 ratings for the 100 selected movies.
59

MLP
LP
Pearson
MMMF

MLP
LP
Pearson
MMMF

Given 10 Rated Items
Kc = 1
Kc = 5
0.8325 ± 3.7E-04 0.7449 ± 1.3E-04
0.8228 ± 1.3E-04 0.7408 ± 4.7E-05
0.7998 ± 3.5E-04 0.6938 ± 3.4E-04
0.7946 ± 1.5E-04 0.7036 ± 1.2E-04
Given 15 Rated Items
Kc = 1
Kc = 5
0.8661 ± 8.8E-05 0.7732 ± 5.2E-05
0.8592 ± 4.7E-04 0.7727 ± 1.4E-04
0.8177 ± 3.3E-04 0.7291 ± 1.9E-04
0.8128 ± 2.3E-05 0.7219 ± 1.5E-04

Table 3.2. Average precision for the Mixed Label Propagation (MLP), Label Propagation
(LP), Pearson Correlation Coeﬃcient (Pearson) and Maximum-Margin Matrix Factorization(MMMF) using 20 training users.

In contrast to the binary label requirement in (3.4), the labels here were the ratings ranging
from 1 to 5, and the mixed label propagation algorithm was applied to each movie separately
to predict the ratings by all users. To acquire a full spectrum of the performance, we varied
the number of training users and the number of movies whose ratings were provided by the
test users. More speciﬁcally, 10 and 20 users were used as the training users. For each test
user, 10 and 15 movies were randomly selected and their ratings by the test user were given.
Each experiment was conducted ten times, and the results averaged over ten trials were
reported in our study.
Three baseline models were used in our study: the Pearson correlation coeﬃcient method
(Pearson) [126], the Maximum-Margin Matrix Factorization method (MMMF) [124], and the
Label Propagation method (LP) [176] that is based on the harmonic function and only uses
the positive similarity. By comparing to the Pearson correlation method and the maximummargin matrix factorization method, we are able to observe if the proposed method is eﬀective
for collaborative ﬁltering. By comparing to the label propagation method, we are able to

60

observe if the proposed method is eﬀective in exploiting the negative similarity. For the
maximum margin matrix factorization method, we used coding from the website http://
people.csail.mit.edu/nati/mmmf/code.html and used a maximum norm with C = 0.001
for all the experiments following the suggestion in [147].
To examine the quality of diﬀerent collaborative ﬁltering algorithms, we focused on evaluating how well each method was able to rank items for users. For each user we ranked items
in the descending order of their estimated ratings. We then evaluated the ranked items by
comparing to the items ranked by the true ratings. More speciﬁcally, for a test user u, we
u
used le = (i1 , i2 , . . . , im ) to denote the list of items ordered by the estimated ratings, and ri

to denote the true rating of the ith item by user u. Then, the quality of the ranked items
was evaluated by the following two metrics:
• Average Precision [46] (AP). To measure the average precision, we assume that the ﬁrst
k items with the highest ratings by user u, denoted by Mu (k), are the “good” items
for user u. Then, the average precision for user u at the cutoﬀ rank Kc is computed
as:
1
APu (Kc ) =
Kc

Kc
k=1

|Mu (k) ∩ {i1 , · · · , ik }|
k

(3.11)

where | · | outputs the length of the set.
• Average Rating (AR). This computes the average rating of the ﬁrst Kr ranked items in
the list le . More speciﬁcally, the average rating for user u at the rank Kr is computed
as follows
1
ARu (Kr ) =
Kr

Kr
k=1

u
ri

k

(3.12)

Finally, we averaged both metrics over all test users, and reported the average precision at a
diﬀerent cutoﬀ rank Kc and the average rating at a diﬀerent ranking position Kr . Note that
we did not use the Mean Average Error (MAE) [17] because MAE requires an additional
step to calibrate scores into rating, and therefore does not directly reﬂect the quality of
collaborative ﬁltering algorithms.
61

3.3.1

Experiment (I): Eﬀectiveness of Mixed Label Propagation

The average precision of the four methods for 10 and 20 training users are reported in Table 3.1 and 3.2, respectively. Figure 3.1 and 3.2 show the average ratings of the four methods
for 10 and 20 training users, respectively. First, we observe that for all of the four methods,
both the average precision and the average rating are improved with the larger number of
training users and given more rated items. This is consistent with our expectation: the more
the training examples, the better the performance. Second, according to the average precision metric, we observe that compared to other methods the Pearson correlation coeﬃcient
method achieves almost the worst performance. It is also surprising to observe that the maximum margin matrix factorization (i.e., MMMF) does not improve the prediction accuracy in
comparison to the Pearson correlation coeﬃcient. We believe that this may be attributed to
the small number of training users in our study while most of the previous studies of MMMF
focused on large numbers of training users. Third, the label propagation method based on
the harmonic function performs better than both the Pearson correlation coeﬃcient method
and the maximum-margin matrix factorization method according to the average precision
metric. Finally, according to both metrics, the mixed label propagation method outperforms
the other three methods considerably in all experiments and the improvements are statistical
signiﬁcant under the student-t test with 95% conﬁdence interval.

3.3.2

Experiment (II): Empirical Values for λ

As described before, the solution for yu in Eqn. (3.7) is essentially equivalent to the solution
by the harmonic function (in Eqn. (3.3)) if we use L = L+ − λL− as the graph Laplacian
matrix. Since we have to solve the mixed label propagation problem for each movie, we
compute a diﬀerent λ for each movie. Figure 3.3 shows the λs that were computed for the
100 movies with 20 training users and 10 given ratings. We clearly see the large variance
in λ across diﬀerent movies. For some movies the optimal λ can be as high as 0.3. For
other movies the optimal λ can be as low as 10−4 . Given the empirical values of λ shown
62

in Figure 3.3, it is unlikely to ﬁnd a ﬁxed λ that can ﬁt in well with all movies. This is also
conﬁrmed by our empirical study with ﬁxed λ.

3.4

Discussion

A closely related work [48] was published almost the same time as our mixed label propagation, here we refer it to the mixed-graph method. In the mixed-graph method, the authors
try to incorporate both similarity and dissimilarity into semi-supervised classiﬁcation problems. More speciﬁcally, they deﬁne a mixed graph of n nodes by two n × n matrices S
and W . The matrix S speciﬁes the edge type: Sij = 1 if there is a similarity between ith
node and jth node; Sij = −1 if there is a dissimilarity edge. Non-negative weights wij ≥ 0
represent the weights of the edge between ith node and jth node regardless of its type.
In the next step, they deﬁne matrix M as an analog of the graph Laplacian L as:
M = L + (11 − S) • W
where • stands for the elementwise product. Like the Laplacian matrix, M is also positive
semi-deﬁnite. If the graph has no dissimilarity edges, M degenerates to L.
The M matrix is then used in the framework of manifold regularization [7], in which the
discriminant function is obtained by solving:
l

min

f ∈H

c (yi , f (xi )) + λ1 f 2 + λ2 f M f
H

(3.13)

i=1

where H is the Reproducing Kernel Hilbert Space of a kernel K, c() is an arbitrary loss
function, e.g., the hinge loss for Support Vector Machine (SVM) [18], or squared loss for
Regularized Least Squares classiﬁers (RLS) [41], and f is the vector of discriminant function
values on the n points. The ﬁrst two terms are the same as in supervised learning, while
the third term is the additional regularization term for the mixed-graph semi-supervised
learning.

63

Comparing to our mixed label propagation, instead of constructing energy functions associated with similarity and dissimilarity separately, the mixed-graph method tries to ﬁnd a
uniﬁed way to express both similarity and dissimilarity, thus resulting a mixed version of the
standard Laplacian matrix which is the M matrix. This, however, brings both advantages
and disadvantages. First, the positive side is that the M matrix is still positive semi-deﬁnite
and this makes the optimization problem of (3.13) to be convex. The negative side is that
the importance between the similarity and dissimilarity cannot be balanced automatically.
One has to resort to other techniques, such as cross validation, to decide how important the
dissimilarity is before the construction of matrix M . This can be seen clearly from the experiment part of the mixed-graph method in which the weights of dissimilarity are manually
varied to exam the importance of dissimilarity. Our method, however, has the beneﬁt of
automatically deciding the importance between similarity and dissimilarity. As we discussed
when explaining (3.7), our method essentially uses L = L+ − λL− , and can automatically
determine λ by making the optimal tradeoﬀ between the inconsistency measure E+ and the
consistency measure E− .

3.5

Summary

In this chapter, we proposed a mixed label propagation framework for semi-supervised learning. Unlike the existing graph-based approaches that are only applicable to the positive
similarity of examples, our framework is able to explore both positive and negative similarity simultaneously. The key idea behind the proposed framework is very similar to LDA,
which is to minimize the inconsistency between the class assignments and the positive similarity of examples, and maximize the consistency between the class assignments and the
negative similarity of examples. We presented an eﬃcient learning algorithm for the mixed
label propagation that is based on the alternative optimization strategy and semi-deﬁnitive
programming. Our empirical study with collaborative ﬁltering showed that the proposed

64

algorithm is eﬀective in exploring negative similarity and outperforms both the label propagation approach and state-of-the-art approaches for collaborative ﬁltering.

65

(a) 10 given rated items

(b) 15 given rated items

Figure 3.1. The Average rating of the Mixed Label Propagation (MLP), Label Propagation
(LP), Pearson Correlation Coeﬃcient (Pearson) and Maximum-Margin Matrix Factorization(MMMF) using 10 training users.
66

(a) 10 given rated items

(b) 15 given rated items

Figure 3.2. The Average rating of the Mixed Label Propagation (MLP), Label Propagation
(LP), Pearson Correlation Coeﬃcient (Pearson) and Maximum-Margin Matrix Factorization(MMMF) using 20 training users.
67

Figure 3.3. The λs computed for the 100 movies with 20 training users and 10 given ratings.

68

CHAPTER 4
Second-Order PCA-Style
Dimensionality Reduction Algorithm
by Semi-Deﬁnite Programming
PCA is one of the most widely used methods for dimensionality reduction. However, when
handling very high dimensional data, such as images, PCA could be computationally expensive since it requires compute principle eigenvectors of a very large matrix. Recently,
researchers proposed the Second-Order PCA-Style (SOPCA) algorithms [168, 172, 38, 20,
12, 95] that aim to alleviate the high computational cost of PCA. Unlike PCA that represents
each object, such as an image, by a vector, these methods represent each datum by a matrix,
a natural representation for images, which is the key for reducing the computational cost.
Despite the success, almost all SOPCA methods require solving a non-convex optimization
problem and as a result, are only able to identify the local optimal solution either by an
alternative method or by an approximate approach. In this chapter we aim to alleviate this
problem by presenting a convex optimization framework for SOPCA.

4.1

Second-Order PCA-Style Algorithms

PCA, or more general PCA-style methods, are based on the vector space model in which each
datum is a vector and the collection of data is a matrix. There are two primary drawbacks
when applying the vector space model to image data analysis. First, by converting each

69

image into a vector, the spatial relationship between neighboring pixels is completely lost
leading to algorithms that are unable to fully utilize the spatial correlation of pixels. Second,
due to the large number of pixels, each image has to be represented by a vector in a very
high dimensional space. As a result, running PCA-style algorithms against images could be
expensive both computationally and space wise due to the time and space complexities of
singular value decomposition (SVD) for large matrices.
Early work aimed to reduce the computational complexity of PCA by incremental algorithms, such as [16, 53]. More recently, researchers have proposed the second-order PCA-style
(SOPCA) algorithms [168, 172, 38, 20, 12, 95] to alleviate the computational complexity of
PCA for images. These algorithms represent an image by a matrix, and directly conduct
dimensionality reduction over a collection of matrices. By avoiding vector representation of
images, these algorithms are able to achieve substantially better computational eﬃciency
than PCA-style algorithms for image data analysis. Both theoretical and experimental
results have shown that these methods have substantially lower computational cost than
classical PCA-style algorithms when applied to high dimensional data.
One of the most representative works of SOPCA is the Generalized Low Rank Approximations of Matrices (GLRAM) [172]. It reduces the dimensionality of a matrix by multiplying it
with left and right projection matrices. More speciﬁcally, we denote M = (M1 , M2 , . . . , Mn )
as the collection of matrices for dimensionality reduction, where each Mi ∈ Rp×q . The goal
is to identify two matrices L ∈ Rp×k1 and R ∈ Rq×k2 such that each matrix Mi can be well
approximated by LAi R , where Ai ∈ Rk1 ×k2 . Often in practice, we set k1 = k2 = k, where
k is the target dimension to be reduced. The L and R matrices are obtained by solving the
following optimization problem:
n

min

L,R,Ai

s. t.

Mi − LAi R

2
F

(4.1)

i=1

L L = R R = Ik .

It is well known that (4.1) is a non-convex optimization problem, and usually an iterative
70

procedure is employed to obtain the local optimal solution. More speciﬁcally, an initial L is
obtained ﬁrst, for example, an identity matrix. Then, the columns of the projection matrix
R are computed from the eigenvectors corresponding to the k largest eigenvalues of the
matrix

n
i=1 Mi LL Mi . With the computed R, we can then update L by computing the

eigenvectors of the matrix

n
i=1 Mi R RMi

and form the matrix L using the eigenvectors

corresponding to the k largest eigenvalues. The procedure is repeated until convergence.
Inspired by GLRAM, [38] extends the traditional Singular Value Decomposition (SVD)
to the 2-D form. In 2-D SVD, the row-row and column-column covariance matrices are ﬁrst
computed as follows:
n

F =
i=1
n

G =

¯
¯
(Mi − M )(Mi − M )
¯
¯
(Mi − M ) (Mi − M ),

i=1

¯
where M =

n
i=1 Mi /n is the mean of all matrices. The left and right projection matrices

are then computed from the leading eigenvectors of F and G. It can be shown that the 2-D
PCA [168] is a special case of 2-D SVD.
Similar to 2-D SVD, Tensor-PCA [20] views each image as a point in a tensor space and
tries to ﬁnd the optimal tensor subspace such that in that subspace, the variance of the
projected data is maximized. The optimization problem of Tensor-PCA, like GLRAM, is
non-convex. In [20] the authors present an approximate algorithm that indeed gives the
same solution as 2-D SVD [66].
In [12] the authors extend Probabilistic PCA [13], a probabilistic interpretation of PCA
reviewed in Section 2.1.3, to matrix and tensor dimensionality reduction. The authors present
a family of probabilistic models, termed Probabilistic Higher-Order PCA, that explicitly
specify the generative process for higher-order objects such as matrices. Studies in [77, 95]
also consider the extension of second-order PCA to high-order tensor objects.

71

4.2

A Convex Formulation for SOPCA

The primary diﬃculty with identifying the left and right projection matrices L and R in
GLRAM arises from their dependency, namely, the solution of L depends on the solution of
R and vice versa. The key observation motivating this work is that if we can factorize each
matrix Mi in advance into a product of left matrix Ui and right matrix Vi , then the left and
right projection matrices L and R could be roughly computed by examining the eigenvectors
of the sum of the left and right matrices. Based on this motivation, we introduce a Semideﬁnite Programming (SDP) [156] formulation for SOPCA, which is a convex optimization
problem and is guaranteed to ﬁnd the global optimal solution.

4.2.1

Low Norm Approximation of a Single Matrix

The key question now is how to determine the appropriate factorization for Mi . In order to
motivate the right formulation, we ﬁrst consider the case with a single matrix X. Assume
X = U ΣV

is the singular value decomposition of X. According to the deﬁnition of the trace

norm of a matrix [136](See Appendix D), the optimal solution to the following optimization
problem:
min

˜˜
˜˜
tr(U U ) + tr(V V )

s. t.

˜˜
X = UV

˜ ˜
U ,V

(4.2)

satisﬁes the condition:
˜˜
UU

= U ΣU

˜˜
VV

= V ΣV

This provides us with a good staring point for the factorization of a matrix X. In the
next step, we extend (4.2) from extracting all singular vectors to only the singular vectors
corresponding to the k largest singular values, which is summarized in the following theorem:

72

Theorem 4.1. The optimal value to the following optimization problem
min

Sk (P ) + Sk (Q)

p
P ∈S+
q
Q∈S+

P
X

s. t.
is 2

X
Q

(4.3)

0

k
i=1 σi , where Sk (·) is the sum of the k largest singular value of a matrix, and σi , i =

1, . . . , k are the largest k singular values of X. The ﬁrst k eigenvectors P and Q are the ﬁrst
k left and right singular vectors of matrix X.
Proof. First, we could write Sk (P ) as an optimization problem [158], i.e.,
Sk (P ) = max {tr(ZP P ) : 0
p

ZP

ZP ∈S+

Ip , tr(ZP ) = k}

For the convenience of presentation, we denote
p

∆p (k) = {Z ∈ S+ : 0

Z

Ip , tr(Z) = k},

and therefore represent Sk (P ) as
Sk (P ) =

max

ZP ∈∆p (k)

tr(ZP P ).

Using the above notations, the problem in (4.3) can be transferred into a min-max optimization problem, which is:
min

max

p
P ∈R+ ZP ∈∆p (k)
q Z ∈∆q (k)
Q∈R+ Q

tr(ZP P ) + tr(ZQ Q)

P
X

s. t.

X
Q

(4.4)

0

Using the Von Newmann’s lemma [111], we could switch minimization and maximization.

73

Furthermore, by constructing the dual problem for P and Q, we have (4.4) rewritten as:
max

ZP ,ZQ ,ZX

tr(ZX X + ZX X )
ZP Z X
ZX ZQ

s. t.

0

0

ZP

Ip , tr(ZP ) = k

0

ZQ

Iq , tr(ZQ ) = k

We write X in its singular value decomposition, i.e., X = U ΣV , and the objective function
becomes 2tr (U ZX V + V ZX U )Σ . We furthermore replace the LMI constraint as
U

ZP ZX
ZX ZQ

−V

U
−V

0

which leads to the following LMI constraint
U ZP U + V ZQ V − V Z X U − U ZX V

0

Deﬁne
A = U ZP U, B = V ZQ V, C = V ZX U + U ZX V
we ﬁnally simplify the original problem as:
max

tr(CΣ)

s. t.

C

A+B

0

A, B

A,B,C

Im

tr(A) = tr(B) = k
Evidently, the optimal solution for C is 2Ik , when A and B are Im (k), where Im (k) is a
m × m diagonal matrix with ﬁrst k diagonal elements being one, and zeros for the remaining
diagonal elements. Note that since we enlarged the solution space by relaxing the LMI
constraint, what we obtained here is the upper bound of the optimal value of problem (4.3).
However, the solutions to C, A and B are exactly in the original space, therefore, the optimal
74

value of problem (4.3) can reach its upper bound which is twice the sum of the ﬁrst k singular
values of X. Since A and B are Im (k), we have ZP and ZQ as constructed by the top k
column vectors of U and V , i.e., the ﬁrst k left and right singular vectors of X.

4.2.2

Low Norm Approximation for Multiple Matrices

The result in Theorem 5.2 indicates an alternative way of extracting the top singular vectors
from a given matrix. In order to extract the optimal projection matrices from a collection
of matrices, it is natural to extend (4.3) to multiple matrices. In particular, we propose the
following optimization formulation for SOPCA.
n

n

min

p
Pi ∈S+
q
Qi ∈S+

Sk

Pi
i=1

Pi
Mi

s. t.

Qi

+ Sk

(4.5)

i=1

Mi
Qi

0, i = 1, 2, . . . , n

where Pi and Qi represent a particular factorization of matrix Mi .
To facilitate our analysis, the following theorem presents an alternative form of GLRAM
in (4.1).
Theorem 4.2. The optimal solution to (4.1) is equivalent to the following optimization
problem
n

max

p
ZP ∈S+
q
ZQ ∈S+

s. t.

tr(Mi ZP Mi ZQ )
i=1

0

ZP

Ip , tr(ZP ) = k

0

ZQ

Iq , tr(ZQ ) = k

Proof. As already shown in [172], GLRAM in (4.1) is equivalent to
n

tr(Mi LL Mi RR )

max

L∈Rp×k
R∈Rq×k

s. t.

i=1

L L = R R = Ik
75

(4.6)

By deﬁning ZP = LL

and ZQ = RR , we have the problem in (4.6).

The following theorem reveals the relationship between GLRAM in (4.6) and the proposed
formulation in (4.5)
Theorem 4.3. Let the optimal value of (4.5) be denoted by f1 , and the optimal value of
(4.6) be denoted by f2 . We have the following relationship between f1 and f2 ,
1 2
f2 ≤ f1
2

(4.7)

Proof. Let Ui and Vi be a factorization of matrix Mi , i.e., Mi = Ui Vi . We can rewrite the
objective function as

n

tr(Ui ZP Ui Vi ZQ Vi )
i=1

Note that both Ui ZP Ui and Vi ZQ Vi are positive semi-deﬁnite matrices and the above
function is upper bounded as:

n

tr(Ui ZP Ui Vi ZQ Vi )
i=1
n

≤

tr(Ui ZP Ui Ui ZP Ui )tr(Vi ZQ Vi Vi ZQ Vi )
i=1
n

≤
≤

1
2
1
2

1
≤
2

tr(Ui ZP Ui Ui ZP Ui ) + tr(Vi ZQ Vi Vi ZQ Vi )
i=1
n

[tr(Ui ZP Ui )]2 + [tr(Vi ZQ Vi )]2

i=1
n

2

tr(Ui ZP Ui ) + tr(Vi ZQ Vi )
i=1

The ﬁrst step of the above derivation follows the Cauchy’s inequality. We deﬁne Pi = Ui Ui

76

and Qi = Vi Vi , and have the above inequality simpliﬁed as
n

tr(Ui ZP Ui Vi ZQ Vi )
i=1

1
≤
2
≤

1
2

n

tr ZP

Pi

+ tr ZQ

i=1
n

Pi

+ Sk

i=1

Qi
i=1
2

n

Sk

2

n

Qi
i=1

Finally, using the property derived in [136], i.e.,
P = UU , Q = V V , X = UV

⇐⇒

P
X

X
Q

0,

we have the result in the theorem.
As revealed by Theorem 4.3, the optimal value of (4.5) provides an upper bound for the
optimal value of (4.6), which to some degree justiﬁes the usage of (4.5) for ﬁnding the optimal
projection for matrices.

4.2.3

An SDP Formulation for SOPCA

Solving the optimization problem in (4.5) is diﬃcult due to function Sk (·). The theorem
below shows an SDP formulation that is equivalent to (4.5).
Theorem 4.4. The problem in (4.5) is equivalent to the following optimization problem:
n

tr(Zi Mi )

max

p
Tp ∈S+
q
TQ ∈S+
Zi ∈Rp×q

(4.8)

i=1

TP
Zi

s. t.
TP

Zi
TQ
I, TQ

0, i = 1, 2, . . . , n
I

tr(TP ) ≤ k, tr(TQ ) ≤ k

77

Proof. Similar to the analysis of Theorem 5.2, we ﬁrst rewrite Sk (
n

n

Sk

Pi

=

i=1
p

where ∆p (k) = {T ∈ R+ : 0

n
i=1 Pi ) as

T

max

TP ∈∆p (k)

tr TP

Pi
i=1

Ip , tr(T ) = k}. Hence, the optimization problem in (4.5)

becomes
n

min

max

tr

p
Pi ∈S+ TP ∈∆p (k)
q T ∈∆q (k)
Qi ∈S+ Q

n

Pi TP

+ tr

i=1

Pi
Mi

s. t.

Qi TQ
i=1

Mi
Qi

0, i = 1, . . . , n

Since the above problem is a convex-concave optimization problem, we could switch minimization with maximization. In the next step, we will turn the minimization over Pi and Qi
into a maximization problem by deriving its dual form. In particular, we aim to derive the
dual form for the following minimization problem
n

min

tr

Pi ,Qi

n

Pi TP

+ tr

Qi TQ

i=1

Pi
Mi

s. t.

i=1

Mi
Qi

0, i = 1, . . . , n

To this end, we construct the Lagrangian function L
n

L = tr

n

Pi TP

+ tr

i=1
n

−

Qi TQ
i=1

tr(Pi Ai ) + tr(Qi Bi ) + 2tr(Zi Mi )
i=1

where
Ai
Zi

Zi
Bi

0

is the Lagrangian multiplier introduced for each LMI constraint. By setting the derivative of
L with respect to Pi and Qi to be zero, we have TP = Ai and TQ = Bi . Hence, the resulting
78

dual problem becomes
n

max

tr(Zi Mi )

Zi

i=1

TP
Zi

s. t.

Zi
TQ

0, i = 1, . . . , n

Combining the above maximization problem with the maximization over TP and TQ , we
have the theorem result.
The formulation in (4.8) reveals an interesting aspect of the new formulation for SOPCA
that is summarized by the following theorem.
Theorem 4.5. The problem in (4.8) is equivalent to the following optimization problem if
rank(TP ) = rank(TQ ) = k
n

max

A∈Rp×k
B∈Rq×k

s. t.
where TP = AA

(4.9)

A Mi B tr
i=1

A A = B B = Ik

and TQ = BB .

Proof. First, since rank(TP ) = rank(TQ ) = k, we could rewrite TP and TQ as
k

λP ai ai = AΣP A
i

TP =
i=1
k

TQ =

Q

λi bi bi = BΣQ B
i=1

where A = (a1 , . . . , ak ) and B = (b1 , . . . , bk ) are eigenvectors of TP and TQ , and {λP }k
i i=1
Q
and {λi }k are the eigenvalues of TP and TQ , respectively. Since 0
i=1

TP , TQ

I, we have

Q

0 ≤ λP , λi ≤ 1. Since the objective is to maximize tr(Zi Mi ), and a larger TP and TQ will
i
Q

lead to a larger Zi , we therefore have λP = λi = 1, i = 1, . . . k. Thus, TP and TQ can be
i
simpliﬁed as TP = AA and TQ = BB . Due to the LMI constraint, we have Zi written as

79

Zi = AWi B

where Wi ∈ Rk×k . Using Wi , we rewrite the LMI constraint as
Ik
Wi

Wi
Ik

0

The above constraint is also equivalent to Wi Wi

Ik . Hence, we could write Wi in its

singular value decomposition form, i.e.,
k

Wi =

i
γli ui vl
l

l=1

where 0 ≤ γli ≤ 1 due to Wi Wi

Ik . As a result, tr(Zi Mi ) is upper bounded by

tr(Zi Mi ) = tr(Wi A Mi B) ≤ A Mi B tr
i
where the equality is taken when ui and vl are singular value decomposition of A Mi B.
l

We thus could replace the problem in (4.8) with the one that involves A and B, which leads
to the theorem result.
Using the above theorem result, it is easy to identify the relationship between (4.9) and
(4.6). To make the comparison more obvious, we rewrite A Mi B tr as
A Mi B tr = tr
By letting ZP = AA

A Mi BB Mi A

and ZQ = BB , we have the following relationship [136]:
1

tr(Mi ZP Mi ZQ ) 2 ≤ A Mi B tr
1

1
r 2 tr(Mi ZP Mi ZQ ) 2 ≥ A Mi B tr ,

where r is the rank of A Mi B, which clearly reveals the relationship between (4.6) and
(4.9).

4.3

The Algorithm

Optimization problem involved in (4.8) is convex, thus the global optimal exists. We can
resort to the standard SDP solving packages to solve (4.8). Directly solving (4.8) for large80

Dataset
USPS
ORL
AR

Size
4000
400
1638

Dim
16 ×16=256
112×92=10304
400×350=140000

# of classes
10
40
126

Table 4.1. Statistics of data sets
size data sets however, could be computationally expensive. Below, we give an alternative
algorithm that obtains an approximate solution to (4.8).
Note that in (4.9),
n
i=1 tr

the trace norm

n
i=1 A Mi B tr

can be expressed as

A MiB [MiB ] A , where MiB = Mi B. Using the following inequality:
A (MiB [MiB ] )A
1

1

= A (MiB [MiB ] ) 2 (MiB [MiB ] ) 2 A
1

A (MiB [MiB ] ) 2 A

1

A (MiB [MiB ] ) 2 A ,

we have the following inequality holds:
tr

A (MiB [MiB ] )A

≥ tr A

MiB [MiB ] A

Following the above inequality, we relax the problem in (4.9) as follows:
n

max

A∈Rp×k

s. t.

tr A

MiB [MiB ] A

(4.10)

i=1

A A = Ik

Clearly, the solution to (4.10) is the top k eigenvectors of
B can be computed from the top eigenvectors of matrix

n
i=1
n
i=1

MiB [MiB ] . Similarly,

MiA [MiA ] , where MiA =

Mi A. Algorithm 1 summarizes the iterative algorithm.

4.4

Experiments

We evaluated the eﬃcacy of the proposed method by using a task of image classiﬁcation,
and compared it to GLRAM and 2-D SVD, two state-of-the-art algorithms for SOPCA.
81

Algorithm 1 Second-Order PCA-Style Algorithm by Semi-Deﬁnite Programming
1: INPUT: a collection of matrices {Mi }n , target dimension k
i=1
2: OUTPUT: left and right projection matrices A, B, and low dimensional representation
{Di }n
i=1
3: Obtain an initial A0 and set i = 0
4: while not converge do
5:
i=i+1
6:
compute MR = n
j=1 Mj Ai−1 Ai−1 Mj
8:

compute the top k eigenvectors {ul }k of MR
l=1
set Bi = [u1 , · · · , uk ]

9:

compute ML =

7:

10:
11:
12:
13:
14:
15:
16:

n
j=1

Mj Bi Bi Mj
compute the top k eigenvectors {vl }k of ML
l=1
set Ai = [v1 , · · · , vk ]
end while
set A = Ai and B = Bi
for i = 1 to n do
Di = A Mi B
end for

More speciﬁcally, we represented each image using a matrix, and applied the algorithms
to the training images to compute the left and right projection matrices L, R. Then the
low dimensional representation Ai was computed for each test matrix Xi (i.e., image) as
Ai = L Xi R. Finally, we predicted class labels of test images using the 1-Nearest neighbor
classiﬁer in which the distance between two images, Xa and Xb , was computed by the
Frobenius norm Aa − Ab F , where Aa and Ab were the reduced representations of Xa
and Xb . A 10-fold cross validation was used in the experiments, and the classiﬁcation
accuracy averaged over ten runs was reported. It is worthy to note that the purpose of
designing this experiment is simply to demonstrate the advantages of the proposed convex
formulation comparing to the non-convex formulations of GLRAM and 2-D SVD, better
classiﬁcation results can be achieved if more sophisticated processing procedure and stateof-the-art classiﬁers are applied.
Three image data sets were employed as the testing bed including ORL [133], AR [102],
and USPS [64]. For ORL and USPS date sets, we used their original image size. For AR

82

data set, each image was cropped from row 100 to 500 and column 200 to 550 which was the
same as [172]. The details of these data sets are summarized in Table 5.1.
As we design, we can see the diﬃculty of the task increases by two fold from the USPS
data set to the AR data set. First, the dimensionality of the data in these data sets increases
dramatically. For USPS, it is only 256, however in AR, it is 140, 000, more than 500 times
higher than USPS. By switching to diﬀerent data sets, we gradually vary the diﬃculty for
dimensionality reduction, which can give us a clear picture about how these algorithms
perform. Second, the ORL and AR data sets consist of human faces images which makes
them a more challenging task for dimensionality reduction than the USPS digit images.
Moreover, some faces in the AR data set vary with diﬀerent facial expressions, illumination
conditions, and occlusions (sun glasses and scarf). Combining high dimensionality and the
large variance of face images, the AR data set is the most diﬃcult one of the three data sets.
Table 4.2 shows the classiﬁcation errors of the three algorithms with diﬀerent k (i.e., the
target dimension). First, we observe that all three methods have similar performance level
on the USPS data set. We believe this may be due to the low dimensionality of the data
points in USPS. Each image in USPS is only 16 ∗ 16 = 256 dimensional. As a result, even
a very low dimensional representation is suﬃcient for capturing most of the information in
the original data. In addition, the large number of samples in USPS make it an even easier
task for dimensionality reduction. Second, for the ORL data set, all of the three methods
still perform similarly except for k = 2, where the proposed algorithm yields a signiﬁcantly
better performance than the two baseline algorithms. Finally, for AR data set, the proposed
method outperforms the two baseline methods signiﬁcantly for almost all ks. We also observe
that 2-D SVD performs signiﬁcantly worse than the other two algorithms. In summary, we
can conclude that the proposed algorithm is more eﬀective than the two state-of-the-art
algorithms for SOPCA, especially when the target dimension is very low.

83

4.5

Summary

In this chapter, we proposed aa convex formulation for the SOPCA problem. This contrasts
to most existing algorithms for SOPCA which require solving non-convex optimization problems. We showed that the proposed formulation could be converted into a Semi-deﬁnite
Programming Problem, we also presented an approximate algorithm to eﬃciently solve the
related SDP problem and evaluated the proposed algorithm by image classiﬁcation. Our
experimental study shows the performance of the proposed algorithm is better than or comparable to GLRAM and 2-D SVD, two state-of-the-art algorithms for SOPCA.

84

USPS data set
k=2

k=6

k=7

k=8

k=9

k = 10

0.3553

0.0995

0.0485

0.0385

0.0375

0.0377

0.0398

0.0405

0.0428

0.0003

0.0001

0.0001

0.0001

0.0001

0.0001

0.0001

0.0001

0.3565

0.0978

0.0505

0.0385

0.0375

0.0380

0.0403

0.0408

0.0423

0.0003

0.0001

0.0001

0.0001

0.0001

0.0001

0.0001

0.0001

0.3553

0.1108

0.0573

0.0428

0.0413

0.0380

0.0400

0.0408

0.0425

0.0004

2-D SVD

k=5

0.0002

GLRAM

k=4

0.0001

Proposed

k=3

0.0002

0.0001

0.0001

0.0001

0.0001

0.0001

0.0001

0.0001

ORL data set
k=2

k=6

k=7

k=8

k=9

k = 10

0.1725

0.0550

0.0275

0.0150

0.0200

0.0150

0.0150

0.0200

0.0175

0.0012

0.0008

0.0007

0.0008

0.0004

0.0006

0.0005

0.0003

0.3000

0.0425

0.0275

0.0175

0.0175

0.0175

0.0150

0.0200

0.0175

0.0014

0.0008

0.0004

0.0006

0.0007

0.0006

0.0005

0.0003

0.2225

0.0400

0.0300

0.0250

0.0200

0.0150

0.0150

0.0200

0.0200

0.0041

2-D SVD

k=5

0.0036

GLRAM

k=4

0.0019

Proposed

k=3

0.0016

0.0007

0.0007

0.0005

0.0006

0.0006

0.0005

0.0005

AR data set
k=2

k=6

k=7

k=8

k=9

k = 10

0.6590

0.4385

0.3128

0.4667

0.3974

0.3744

0.3333

0.3179

0.3077

0.0095

0.0036

0.0031

0.0033

0.0034

0.0052

0.0049

0.0044

0.6797

0.6154

0.5128

0.4974

0.4538

0.4026

0.3667

0.3308

0.3154

0.0084

0.0044

0.0043

0.0038

0.0044

0.0034

0.0037

0.0030

0.9282

0.9128

0.8718

0.8359

0.0877

0.8385

0.8231

0.8231

0.8256

0.0018

2-D SVD

k=5

0.0041

GLRAM

k=4

0.0023

Proposed

k=3

0.0023

0.0037

0.0063

0.0054

0.0045

0.0065

0.0018

0.0028

Table 4.2. The classiﬁcation errors of 1-NN classiﬁer using the proposed dimensionality
reduction algorithm for SOPCA, GLRAM and 2-D SVD. For each method, the ﬁrst line
is the mean error rates of the ten-fold cross validation and the second line is the standard
deviations of the error rates. k represents the target dimension. The best performance of
each case is highlighted by the bold font.

85

CHAPTER 5
Dimensionality Reduction for
Functional Data Representation and
Its Application to Large-Scale Image
Retrieval by Local Features
In the previous chapter, we extended the data representation from vectors to matrices. In this
chapter, we will study the problem of dimensionality reduction with even more challenging
data representation, i.e., each datum is described as a diﬀerent number of unordered vectors,
which sometimes is referred to as the bag-of-features model. One way to deal with the
bag-of-features representation is to associate with each datum a function from which the
set of vectors are generated. The challenge of using functions to represent data is that the
similarity between two data points is evaluated by the similarity between the associated
functions, which is usually computationally expensive. It is therefore important to identify a
low dimensional representation for those functions to speed up the computation of similarity
between two functions and to facilitate the search of nearby neighbors, leading to the research
of functional data dimensionality reduction. We study the problem of representing functions
in a low dimensional space in the context of content-based image retrieval (CBIR) with
images represented by the bag-of-features. In this chapter, we ﬁrst brieﬂy review the contentbased image retrieval, and present how the problem of image matching can be cast into
the problem of matching two diﬀerent density functions. We then describe our approach

86

of functional data dimensionality reduction, which embeds each density function in a low
dimensional space.

5.1

Content-Based Image Retrieval with Local Image
Features

Content-based image retrieval (CBIR) is a long standing challenging problem in computer
vision and multimedia. The earliest use of the term content-based image retrieval in the
literature seems to have been by [60], to describe his experiments into automatic retrieval of
images from a database by color and shape feature. The term has since been widely used to
describe the process of retrieving desired images from a large collection on the basis of features
that can be automatically extracted from the images themselves [40]. In the early ages, global
features such as color [139, 109, 108], texture [22, 151, 92, 97, 96] and shape [112, 103] were
widely used to represent images for CBIR. Recent studies [119, 145, 113, 89, 141, 75] have
shown that local image features (e.g. SIFT descriptor [94]), often referred to as keypoints,
are eﬀective for identifying images with similar visual content. The key idea is to represent
each image by a set of “interesting” patches extracted from the image. By representing every
image patch with a high dimensional feature vector, each image is essentially represented by
a bag of feature vectors, which is often referred to as the bag-of-features representation [29].
In the bag-of-features representation, the number of keypoints usually varies from image
to image and the order among the keypoints is usually ignored in representing the visual
content of images although they may be important. Given the bag-of-features representation,
the similarity between two images are measured based on the“overlap” between the two
sets of keypoints associated with the images. The challenge of applying the bag-of-features
representation to image retrieval is how to eﬃciently measure the similarity between two
unordered set of vectors with diﬀerent number of vectors in each set.
A straightforward method to compute the similarity between two sets of keypoints is the

87

best partial matching which ﬁnds the best mapping between the keypoints in the two images
that has the overall shortest distance [94, 104]. The main shortcoming of the optimal partial
matching is its high computational cost since we need to compute the distance between
every possible pair of keypoints of the two images. Several methods have been proposed
to improve the computational eﬃciency of the optimal partial matching [146, 52, 79, 106].
The central ideal of these methods is that instead of using the bag-of-features representation
directly, they manage to convert the bag-of-features into certain consistent form and then
standard techniques for similarity measurement can be easily applied. For example, in [146]
and [52], the keypoints are quantized into “visual words” or “multi-resolution bins” resulting
a histogram or multi-resolution histogram representation of the images. These methods,
however, suﬀer from the information loss during the quantization procedure [120] which often
leads to suboptimal results. One possible approach to alleviate the problem mentioned above
is to assume that the observed feature vectors are sampled from an unknown distribution.
This assumption leads to a natural representation that represents each image by a density
function of keypoints [79, 106, 83, 30]. The similarity between two images is then computed
by the distance between two density functions derived from the observed keypionts. More
speciﬁcally, let G = {I1 , . . . , IC } be the collection of C images, and each image Ii be
represented by a set of ni keypoints {xi , . . . , xi }, where each key point xi ∈ Rd is a d
ni
1
dimensional vector. We assume that the keypoints of image Ii are randomly sampled from
an unknown distribution p(x|Ii ). Furthermore, we can assume the distribution of p(x|Ii )
has a parametric form, for example, Gaussian Mixture Model (GMM) [106]. The parameter
of the model can be estimated by maximizing the log likelihood of the observed keypoints
of the image. After obtaining the generative models p(x|Ii ) of the images, the similarity
between images Ii and Ij is computed as the model distance between p(x|Ii ) and p(x|Ij ).
One of the most widely used methods is the Kullback-Leibler divergence [106]:
D(p(x|Ii ), p(x|Ij )) =

p(x|Ii ) log

p(x|Ii )
dx
p(x|Ij )

(5.1)

The disadvantage of ﬁtting keypoints by a parametric model is that the resulting model
88

could be signiﬁcantly constrained by the family of parametric models, and consequently leads
to a suboptimal performance for image retrieval. Besides the parametric models, we can also
use nonparametric methods, for instance, kernel density estimation [115] to estimate the
underlying density function p(x|I), i.e.,
p(x|I) =

1
n

n

κ(x, xi )

(5.2)

i=1

where κ(·, ·) : Rd ×Rd → R+ is the kernel density function that is normalized as

dzκ(x, z) =

1. Given the density function in (5.2), we follow the work of statistical language models [122]
for document retrieval, and measure the similarity of I to the query image Q by the logarithm
of the query likelihood p(Q|I), i.e.,
m

m

log 

log p(qi |I) =

log p(Q|I) =
i=1



i=1

1
n

n


κ(xj , qi )

j=1

Although the idea of modeling a bag-of-features by a density function and model queryto-image similarity by a log-likelihood have been studied by multiple authors (e.g., [166,
76, 86, 107, 79]), there are one computational challenge that makes this approach diﬃcult
to scale to image retrieval problems with large databases, i.e., an naive implementation of
this approach requires a linear scan of the entire database before ﬁnding the images with the
largest similarity. In this chapter, we will address this challenge by developing dimensionality
reduction methods for functional data representation.

5.2

Dimensionality for Functional Data Representation

In order to make an eﬃcient similarity measurement between functions, we consider an
alternative approach of estimating the density function for image I. We assume that for
any image I in the gallery G, its density function p(x|I) is expressed as a weighted mixture

89

models:
N

p(x|I) =

αi κ(x, ci )

(5.3)

i=1

where ci ∈ Rd , i = 1, . . . , N is a collection of N points (centers) that are randomly selected
from all the keypoints observed in G and α = (α1 , . . . , αN ) is a probability distribution used
to combine diﬀerent kernel functions.
The choice of randomly selected centers, although may seem to be naive at the ﬁrst glance,
is in fact strongly supported by the consistency results of kernel density estimation [115]. In
particular, the kernel density function constructed by randomly selected centers is almost
“optimal” when the number of centers is very large. The number of centers N is usually
chosen to be very large, in order to cover the diverse visual content of images.
It is important to note that unlike (5.2), the weights α in (5.3) are unknown and need
to be determined for each image. As will be shown later, with an appropriate choice of
kernel function κ(·, ·), the resulting weights α will be sparse with most of the elements being
zero. This is ensured by the fact that in a high dimensional space, almost any two randomly
selected data points are far away from each other. It is the sparsity of α that makes it
possible to eﬃciently compute the similarity between functions and identify images that are
visually similar to the query without having to scan the entire image database.
In order to use the density function in (5.3), we need to eﬃciently estimate the combination
weights α. By assuming keypoints x1 , . . . , xn are randomly sampled from p(x|I), our ﬁrst
attempt is to estimate α by a maximum likelihood estimation, i.e.,


n

α = arg max L(I, α) =
α∈∆

where ∆ = {α ∈ [0, 1]C :

N

log 
i=1

αj κ(xi , cj )

(5.4)

j=1

C
i=1 αi = 1} deﬁnes a simplex of probability distributions. It is

easy to verify that the problem in (5.4) is convex and has a global optimal solution.
Although we can directly apply the standard optimization approaches to ﬁnd the optimal
solution α for (5.4), it is in general computationally expensive because
90

• We have to solve (5.4) for every image. Even if the optimization algorithm is eﬃcient
and can solve the problem within one second, for a database with a million of images,
it will take more than 277 hours to complete the computation.
• The number of weights α to be determined is very large. To achieve the desired
performance of image retrieval, we often need a very large number of centers, for
example one million. As a result, it requires solving an optimization problem with
million variables even for a single optimization problem in (5.4).
In order to address the computational challenge, we choose the following local kernel function
for this study
κ(x, c) ∝ I(|x − c|2 ≤ ρ)

(5.5)

where I(z) is an indicator function that outputs 1 if z is true and zero otherwise. The
parameter ρ > 0 is a predeﬁned constant that deﬁnes the locality of the kernel function and
its value is determined empirically. The proposition below shows the sparsity of the solution
α for (5.4).
Proposition 1. Given the local kernel function deﬁned in (5.5), for the optimal solution α
to (5.4), we have αj = 0 for center cj if max |cj − xi |2 > ρ
1≤i≤n

Proposition 1 follows directly from the fact that κ(cj , xi ) = 0, i = 1, . . . , n if
max1≤i≤n |cj − xi |2 > ρ. As implied by Proposition 1, αj will be nonzero only if the
center cj is within a distance ρ of some keypoints. By setting ρ to a small value, we will
only have a small number of non-zero αj . We can quickly identify the subset of centers with
non-zero αj by an eﬃcient range search, for example using k-d tree [93]. In our study, this
step reduces the number of variables from 1 million to about 1, 000.
Although Proposition 1 allows us to reduce the number of variables dramatically, we still
have to ﬁnd a way to solve (5.4) eﬃciently. To this end, we resort to the bound optimization
strategy that leads to a simple iterative algorithm for optimizing (5.4): we denote by α the
current solution and by α the updated solution for (5.4). It is straightforward to show that
91

{L(I, α) − L(I, α )} is bounded as follows
n

L(I, α) − L(I, α ) =

log
i=1

n

N

αj κ(xi , cj )

≥
i=1 j=1

N
l=1 αj κ(xi , cl )

N
j=1 αj κ(xi , cj )
N
j=1 αj κ(xi , cj )

log

αj
αj

(5.6)

By maximizing the lower bound in (5.6), we have the following updating rule for α
1
αj =
Z

n
i=1

αj κ(xi , cj )

(5.7)

N
l=1 αl κ(xi , cl )

where Z is the normalization factor ensuring

N
j=1 αj = 1.

Note that α obtained by

iteratively running the updating equation in (5.7) is indeed globally optimal because the
optimization problem in (5.4) is convex.
We can further simplify the computation of α as: we ﬁrst initialize αj = 1/N, i = 1, . . . , N ,
and then obtain the solution α by only running the iteration once, i.e.,
1
αj =
n

n
i=1

κ(xi , cj )
N
l=1 κ(xi , cl )

(5.8)

We emphasize that although the solution in (5.8) is approximated in only one update, it
is however the exact optimal solution when the keypoints {xi }N are far apart from each
i=1
other, as shown by the following theorem.
Theorem 5.1. Let the kernel function be (5.5). Assume that all the keypoints x1 , . . . , xn
are separated by at least 2ρ. The solution α in (5.8) optimizes the problem in (5.4).
Proof. When any two keypoints xi and xj are separated by at least 2ρ, we have
κ(xi , ck )κ(xj , ck ) = 0 for any center ck . This implies that no key point could make contribution to the estimation of weight αk simultaneously for two diﬀerent centers in 5.7 . As a

92

result, the expression in 5.7 could be rewritten as
αj =
=
=

1
Z
1
Z
1
Z

n

I(|xi − cj | ≤ ρ)
i=1
n

I(|xi − cj | ≤ ρ)
i=1
n

αj
N
l=1 αl κ(xi , cl )

αj
αj κ(xi , cj )

I(|xi − cj | ≤ ρ)
i=1

As a result, the updating equation will give the ﬁxed solution, which is the global optimal
solution.
In algorithm 2, we summarize the procedure of computing αi for each image Ii in the
image collection. The key step of computing each αi is how to eﬃciently compute the value
of kernel function in (5.5). In the algorithm, we resort to the k-d tree based range search to
achieve the goal. More speciﬁcally, we ﬁrst build a k-d tree for all the key points of images in
the collection. For each center, we then search the keypints which are within the distance ρ
of that center using the k-d tree. For all the keypoints that are within the distance ρ of that
center, their values to the kernel function (5.5) for this center is 1 and for other keypoints the
value is 0. After we conduct the range search for every centers, we obtain the value of (5.5)
for every pair of keypoints and centers for all the image in the collection which can be used
directly for computing αi for each image.
Regularization Although the sparse solution resulting from the local kernel is computationally eﬃcient, the sparse solution may lead to a poor estimation of query-likelihood, as
demonstrated in the study of statistical language model [99]. To address this challenge, we
g

g

introduce αg = (α1 , . . . , αN ), a global set of weights used for kernel density function. αg
plays the same role as the background langauge model in statistical language models [99].
We defer the discussion of how to compute αg to the end of this section. Given the global
set of weights αg , we introduce KL(αg α), the Kullback-Leibler divergence [106] between

93

Algorithm 2 Compute weight vector αi of each image Ii in the image collection
1: INPUT:
• Image collection G = {I1 , . . . , IC } with each image Ii represented by a set of
keypoints
• Number of random centers N
• Distance threshold ρ
2: Randomly select N keypoints as the centers z1 , . . . , zN from X that consists of all the
3:
4:
5:
6:
7:
8:
9:
10:

keypoints detected from images in G
Construct a randomized k-d tree T for all the keypoints in X
for l = 1 to N do
Using k-d tree T , search keypoints which are within the distance ρ of the center zl .
For all the returned keypoints, set their values of the kernel function (5.5) to be 1. For
all other keypoints, set the value to be 0.
end for
for i = 1 to C do
Compute each element in the weight vector αi of image Ii using (5.8)
end for

αg and α, as a regularizer in (5.4), i.e.,
α = arg max L(I, α) − λKL(αg α)

(5.9)

α∈∆

where λ > 0 is introduced to weight the importance of the regularizer. As indicated in (5.9),
by introducing the KL divergence as the regularizer, we prefer the solution α that is similar
to αg . Note that (5.9) is equivalent to the MAP estimation of α by introducing a Dirichlet
prior Dir(α) ∝

g
N
β
i=1 [αi ] i , where βi = λαi . Similar to the bound optimization strategy

used for solving (5.4), we have the following approximate solution for (5.9)
αj =

1
n+λ

g

n

λαj +
i=1

κ(xi , cj )
N
l=1 κ(xi , cj )

(5.10)

It is important to note that, according to (5.10), the solution for α is no longer sparse if αg
is not sparse, which could potentially lead to a high computational cost in image matching.
We will discuss a method in the next section that explicitly addresses this computational
challenge.
94

The remaining question is how to estimate αg , the global set of weights. To this end, we
search for the weight αg that can explain all the keypoints observed in all the images of
gallery G, i.e.,
C
g = arg max
α
L(Ii , αg )
g ∈∆
α
i=1

(5.11)

Although we can employ the same bound optimization strategy to estimate αg , we describe below a simple approach that directly utilizes the solution α for individual images
i
i
to construct αg . We denote by αi = (α1 , . . . , αN ) the optimal solution that is obtained

by maximizing the log-likelihood L(Ii , αi ) of the keypoints observed in image Ii . Given αi
that maximizes L(Ii , αi ), we have
1
L(Ii , αg ) ≈ L(Ii , αi ) + (αg−αi )
2
S Hessian matrix

2 L(I , α) is computed as
i

(5.12)

ni
k k
k
k=1 ui [ui ] , where ui ∈

2 L(I , α) = −
i

N
i
l=1 αj κ(xk , cj )). The lemma below allows

RN is a vector deﬁned as [uk ]j = κ(xi , cj )/(
i
k
2 L(I , αi ).
i

us to bound the Hessian matrix
Lemma 1. N I

2 L(I , αi )(αg−αi )
i

− 2 L(Ii , αi ).

Proof. To bound the maximum eigenvalue − 2 L(Ii , αi ), we consider the quantity
γ

2 L(I , αi )γ with |γ| = 1.
2
i

γ

≤
Deﬁne ηj = |γj |/(

ni

N
i
2
j=1 γj κ(xk , cj )]
N
i
2
k=1 [ j=1 αj κ(xk , cj )]
2
ni
N
i
j=1 |γj |κ(xk , cj )
N
i
j=1 αj κ(xk , cj )
k=1

2 L(I , αi )γ =
i

[

N
j=1 |γj |) and η = (η1 , . . . , ηN ). Deﬁne t =

γ

2 L(I , αi )γ
i

ni

≤ t2

k=1

N
i
j=1 ηj κ(xk , cj )
N
i
j=1 αj κ(xk , cj )

Since αi maximizes L(Ii , α), we have
(η − αi )

L(Ii , α) ≤ 0,
95

N
j=1 |γj |. We have

which implies

Since t ≤

√

ni

N
i
j=1 ηj κ(xk , cj )
≤1
N
i
j=1 αj κ(xk , cj )

k=1

N , we have

2 L(I , αi )
i

−N I.

Using the result in Lemma 5.1, the objective function in (5.11) can be approximated as
C
i=1

L(Ii , αg ) ≈

C

C

N
L(Ii , αi ) −
2

i=1

|αi − αg |2
2

(5.13)

i=1

1
The global weights αg maximizing (5.13) is αg = C

C
i
g
i=1 α which shows that α can be

computed as an average of {αi }C that are optimized for individual images.
i=1

5.3

Eﬃcient Image Search

Given the kernel density function p(x|Ii ) for each image in gallery G and a query Q, the next
question is how to eﬃciently identify the subset of images that are likely to be visually similar
to the query Q and furthermore rank those images in the descending order of their similarity.
Following the framework of statistical language models for text retrieval, we estimate the
similarity by the likelihood of generating the keypoints {qi }m observed in the query Q,
i=1
i.e.,


m

N

log 

log p(Q|Ii ) =
k=1


i
αj κ(qk , cj )

(5.14)

j=1

i
i
where αi = (α1 , . . . , αN ) are the weights for constructing the kernel density function for

image Ii . Clearly, a naive implementation will require a linear scan of all the images in the
database before the subset of similar ones is found. To achieve the eﬃcient image retrieval,
we need to exploit the sparse structure of α in (5.10). We deﬁne
1
i
αj =
ni

ni
k=1

κ(xi , cj )
k

N
i
l=1 κ(xk , cl )

96

(5.15)

i
We then write αj as
i
αj =

λ
ni
g
αj +
αi
ni + λ
ni + λ j

(5.16)

i
i
Note that although αj is sparse, αj is not. Our goal is to eﬀectively explore the sparsity of
i
αj for eﬃcient image retrieval. Using the expression in (5.16), we have log p(Q|Ii ) expressed

as
m





N

λ
ni
g
αl +
αi κ(xj , cl )
ni + λ
ni + λ l

log 

log p(Q|Ii ) =
j=1
m

=
j=1

l=1

N
i
l=1 αl κ(xj , cl )
g
N
l=1 αl κ(xj , cl )

n
log 1 + i
λ

where
m

log

sQ =
j=1

λ
ni + λ

m



N

j=1

+ sQ


g

αl κ(xj , cl )

log 

+

(5.17)

(5.18)

l=1

Note that (i) the second term of sQ is independent of the individual images for the same
query, and (ii) log p(Q|Ii ) ≥ sQ for any image Ii . Given the above facts, our goal is to
eﬃciently ﬁnd the subset of images whose query log-likelihood is strictly larger than sQ , i.e.,
log p(Q|Ii ) > sQ . To this end, we consider the following procedure:
• Finding the relevant centers CQ for a given query Q. Given a query image Q with
keypoints q1 , . . . , qm , we ﬁrst identify the subset of centers, denoted by CQ , that are
within distance ρ of the keypoints in Q, i.e., CQ = cj : ∃qk ∈ Q s. t. |qk − cj |2 ≤ ρ .
• Finding the candidates of similar images using the relevant centers. Given the relevant
i
centers in CQ , we ﬁnd the subset of images that have at least one non-zero αj for the

centers in CQ , i.e.,






i >0
RQ = Ii ∈ G :
αj




cj ∈CQ

(5.19)

Theorem 5.2 shows that all the images with query log-likelihood larger than sQ belong to
RQ .
97

Theorem 5.2. Let SQ denote the set of images with query log-likelihood larger than sQ , i.e.,
SQ = {Ii ∈ G : log p(Q|Ii ) > sQ }. We have SQ = RQ .
It is easy to verify the above theorem. In order to eﬃciently construct RQ (or SQ ) for a
given query Q, we exploit the technique of invert indexing [99]: we preprocess the images to
i
obtain a list for each cj , denoted Vj , that includes all the images Ii with αj > 0. Clearly,

we have
RQ =

Vj

(5.20)

cj ∈CQ

Algorithm 3 summarizes the procedure of eﬃcient image retrieval.

5.4

Comparing to The-State-Of-The-Art Methods

Bag-of-words model [146] is one the most successful and widely used methods for large scale
image retrieval. Motivated by the success in text information retrieval [132], the bag-of-words
model ﬁrst quantizes image features to a vocabulary of “visual words”, and represents each
image by the counts of visual words or a histogram. Standard text retrieval techniques can
then be applied to identify the images that share similar visual content as the query image.
The quantization is typically achieved by grouping all the keypoints into a speciﬁed number of
clusters using a clustering algorithm. A number of studies have shown promising performance
of the bag-of-words approach for image/object retrieval [119, 145, 113, 89, 143, 141, 75]
To better understand the proposed method in (5.3), we compare it to the bag-of-words
model. More speciﬁcally, we can view each random center ci as a diﬀerent visual word and
each α as a histogram vector. One computational advantage of the proposed method is
that, while the bag-of-words model requires clustering all the keypoints into a large number
of clusters, the proposed method only needs to randomly select a number of points from the
date which is computationally eﬃcient. Although recent progress on approximate nearest
neighbor search [93, 31, 89, 144, 110] has made it feasible to group billions of keypoints into
98

Algorithm 3 Eﬃcient image retrieval algorithm
1: INPUT:
• A query image Q with keypoints q1 , . . . , qm
• Inverted indices Vj , j = 1, . . . , N
• Number of images to be retrieved, k
2: OUTPUT:

• k images sorted descendingly by their similarity to the query image
3: Construct CQ of the query Q as

CQ = cj : ∃qk ∈ Q s. t. |qk − cj |2 ≤ ρ
4: Construct candidate image set RQ of query Q as

Vj

RQ =
cj ∈CQ

5: for every image Ii in RQ do
6:
Compute the likelihood log p(Q|Ii ) using (5.17)
7: end for
8: Sort images in RQ based on their likelihood log p(Q|Ii )
9: Return the ﬁrst k images in RQ

millions of clusters, the computational cost is still very high. We will see this clearly later
in our empirical study.
Second, in the bag-of-words model, we need to map each keypoint to the closest visual
word(s). Since the computational cost of this procedure is linear in the number of keypoints,
it is time consuming when the number of keypoints is very large; The proposed method,
however, only needs to conduct a range search for every randomly selected centers which
is in general signiﬁcantly smaller than the number of key points, for example, one million
centers v.s. on billion keypoints. This computational saving makes the proposed method
more suitable for large image databases than the bag-of-words model.

99

Third, in the bag-of-words model, the radius of clusters (i.e., the maximum distance between the keypoints in a cluster and its center) could vary signiﬁcantly from cluster to
cluster. As a result, for cluster with large radius, two keypoints can be mapped to the same
visual word even if they diﬀer signiﬁcantly in visual features, leading to an inconsistent criterion for keypoints quantization and potentially suboptimal performance in retrieval; On
the contrary, the proposed method uses a range search for each center which ensures that
only “similar” keypoints, which are within the distance of r to the center, will contribute to
the corresponding element in the weight α of that center.
Lastly, a keypoint is ignored by the proposed method if its distances to all the centers are
larger than the threshold. The underlying rationale is that if a keypoint is far away from
all centers, it is very likely to be an outlier and therefore should be ignored; While in the
bag-of-words model, every keypoint must be mapped to a cluster center even if the keypoint
is far away from all the cluster centers. We will see this advantage of the proposed method
clearly demonstrated in the experiments.

5.5
5.5.1

Experiments
Datasets

To evaluate the proposed method for large-scale image search, we conduct experiments on
three benchmark data sets: (1) tattoo image dataset (tattoo) with about 100, 000 images.
(2) Oxford building dataset with 5, 000 images (Oxford5K) [119] and (3) Oxford building
dataset plus one million Flickr images (Oxford5K+Flickr1M). Table 5.1 shows the details
of the three datasets.
Tattoo image dataset (tattoo) Tattoos have been commonly used in forensics and law
enforcement agencies to assist in human identiﬁcation. The tattoo image database used in our
study consist of 101, 745 images, among which 61, 745 are tattoo images and the remaining

100

Data set

# images

# features

Size of descriptors

tattoo
Oxford5K
Oxford5K+Flickr1M

101,745
5,062
1,002,805

10,843,145
14,972,956
823,297,045

3.4GB
4.7GB
252.7GB

Table 5.1. Statistics of the datasets
40, 000 images are randomly selected from the ESP dataset1. The purpose of adding images
from the ESP dataset is to verify the capacity of the algorithms in distinguishing tattoo
images from the other images. On average, about 100 Harris-Laplacian interesting points
are detected for each image, and each key point is described by a 128-dimensional SIFT
descriptor.
Oxford building dataset (Oxford5K) The Oxford building dataset consists of 5, 062
images. Although it is a small data set, we use it for evaluating the proposed algorithm
for image retrieval mainly because it is one of the widely used benchmark datasets. When
detecting keypoints for each image, we use both Harris-Laplacian and Hessian-Aﬃne interesting point detectors and each key point is described by a 128-dimensional SIFT descriptor.
Since the algorithms perform similarly with keypoints detected by the two methods, we
only report the results based on the Harris-Laplacian detector. On average, about 3, 000
keypoints are detected for each image.
Oxford building dataset plus one million Flickr images (Oxford5K+Flickr1M)
In this dataset, we ﬁrst crawled Flickr.com to ﬁnd about one million images of medium
resolution and then added them into the Oxford building dataset. The same procedure is
applied to extract keypoints from the crawled Flickr images.
1

http://www.gwap.com/gwap/gamesPreview/espgame/

101

5.5.2

Implementation and Baselines

For the implementation of the proposed method, the kernel function (5.5) is used. The
centers for the kernel are randomly selected from the datasets. We employ the FLANN
library 2 to perform the eﬃcient range search.
We select the bag-of-words model (BoW) [119] as our baseline which is a popular method
for large scale image retrieval. It constructs the visual vocabulary by a hierarchical k-means
algorithm that deploys FLANN library for eﬃcient nearest neighbor search. We set the
branching factor of hierarchical k-means to be 10 based on our experience. A forest of 8
randomized k-d trees is used in all experiments. We initialize cluster centers by randomly
selecting a number of keypoints in the dataset. For the bag-of-words model, a state-of-the-art
text retrieval method, Okapi BM25 [129] is used to compute the similarity between a query
image and gallery images based on their bag-of-words representations. The inverted indices
for both Okapi BM25 and the proposed retrieval model are stored in memory to make the
retrieval procedure eﬃcient.

5.5.3

Evaluation

In order to exam the eﬃciency of the proposed method, we measure the time spent on
preprocessing as well as retrieval stage of the retrieval systems. For the proposed method,
the preprocessing stage consists of three steps, i.e., randomly selecting a number of centers,
identifying keypionts within the predeﬁned range of the selected centers and computing
weights α of every image; For the baseline method, it consists of two steps, constructing
visual vocabulary by clustering and mapping keypoints to visual words. For retrieval time,
we report the averaged retrieving time of one query for the two methods. We emphasize
that besides the retrieval time, the preprocessing time is also very important for an image
retrieval system when it comes to a large collection of millions of images and the image
collection is updated frequently. Take Flickr.com as an example, which is one of the most
2

http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

102

popular online photo sharing web sites, there are about 900,000 new images uploaded every
day [98]. These images must be preprocessed in time in order to be used for retrieval, which
requires the preprocessing of an algorithm be very eﬃcient.
To evaluate the retrieval accuracy of the proposed method, we use two diﬀerent metrics
for the datasets. For tattoo image dataset, the retrieval accuracy is evaluated based on
whether a system could retrieve images that share the tattoo symbol as in the query image.
We adapt the evaluation metric termed Cumulative Matching Characteristics (CMC) score
[105]in this study. For a given rank position k, its CMC score is computed as the percentage
of queries whose matched images are found in the ﬁrst k retrieved images. The CMC score
is similar to recall, a common metric used in Information Retrieval. We use CMC score on
the tattoo database because it is the most widely used evaluation metric in face recognition
and forensic analysis.
For the Oxford building dataset and the Oxford building plus Flickr dataset, we follow [119]
and evaluate the retrieval performance by Average Precision (AP) which is computed as the
area under the precision-recall curve. In particular, an average precision score is computed
for each of the 5 queries from a landmark speciﬁed in the Oxford building dataset, and these
results are averaged to obtain the mean Average Precision (mAP) for each landmark.

5.5.4

Results on The Tattoo Image Dataset

We select 995 images as queries, and manually identify the gallery images that have the same
tattoo symbols as the query images. We randomly select 100 images among the 995 query
images and use them to train the optimal values for both λ and ρ. The learned parameter λ
and ρ are used for the consequential experiments. The remaining images are used for testing.
We ﬁrst show the retrieval results of both the proposed method and the bag-of-words
method with the parameters tuned to achieve the best performance, and then show the
sensitive of the proposed algorithm to the choice of parameters. Figure 5.1 give the retrieval
performance of the two methods in CMC curves for the ﬁrst 100 retrieved images. It is

103

Preprocessing Time
Proposed
BoW

Retrieval Time

1.0h
8.8h

0.02s
0.009s

Table 5.2. The preprocessing and retrieval time of the two methods on tattoo dataset.

clear that the proposed algorithm signiﬁcantly outperforms the bag-of-words model based
approach.
The eﬃciency of the two methods are listed in the Table 5.2. For the preprocessing time,
the proposed methods is about 8 times more eﬃcient that the baseline methods; For the
retrieval time, the proposed method is signiﬁcantly slower than that of the bag-of-words
model method. After carefully checking the implementation, we found that the diﬀerence in
retrieval time is because the logarithm function used by the proposed method (5.17) takes
a signiﬁcantly longer time to compute than the simple addition and multiplication used by
the BM25 model. In a real retrieval system, however, this disadvantage can be overcome
by some engineering tricks. For example, a logarithm look up table can be built in advance
and computing the logarithm of a value can be simpliﬁed as checking the lookup table. In
fact, this trick is commonly used in the implementation of an automatic speech recognition
system.

5.5.5

Parameter λ

Figure 5.2 shows the CMC curves of the proposed method with λ varied from 0.01 n to 100
¯
n, where n is the average number of keypoints in an image. In this experiment, we set the
¯
¯
¯
¯
number of random centers to be one million, and ρ to be 0.6 d, where d is the average distance
between any two keypoints which is estimated from 1,000 randomly sampled keypoints from
the collection. This result shows the performance of the proposed method is overall not
sensitive to the choice of λ.

104

5.5.6

Parameter ρ

¯
¯
Figure 5.3 shows the CMC curves of the proposed method with ρ varied from 0.3 d to 1.1 d.
In this experiment, we again ﬁxed the number of centers to be one million. From the ﬁgure we
¯
observe that with the exception of the smallest radius ρ (i.e., r = 0.3d), the retrieval system
achieves similar performance for diﬀerent values of ρ. This indicates that the proposed
algorithm is in general insensitive to the choice of ρ as long as ρ is large enough compared
to the average inter-points distance between keypoints. This result can be understood by
the fact that in a high dimensional space, most data points are far from each other and as
a result, unless we dramatically change the radius ρ, we do not expect the points within a
distance ρ of the centers to change signiﬁcantly.

5.5.7

Number of Random Centers

Figure 5.4 shows the performance of the proposed method with diﬀerent number of randomly
selected centers. The λ and ρ are selected to maximize the performance for the given number
of centers. We clearly observe a signiﬁcant increase in the retrieval accuracy when the number
of centers is increased from 10K to 1M. This is not surprising because a large number of
random centers usually results in a better discrimination between diﬀerent SIFT keypoints
and consequently leads to an improvement in the detection of similar images. A similar
observation is also found when we run our retrieval system using the bag-of-words model
approach which is consistent with the observation in [119].

5.5.8

Results on Oxford Building and Oxford Building + Flickr
Datasets

Based to the observation from the experiments of tattoo image retrieval and the similar
observation in [119], we use one million cluster/random centers in this experiment. The
parameters of the proposed methods are set as the following based on our experiments done

105

Oxford5K
Oxford5K+Flickr1M

Proposed
0.561
0.472

BoW
0.521
0.330

Table 5.3. mAP results of the proposed method and BoW method for Oxford5K building
data set and Oxford5K+Flickr1M data set.
¯
¯
with tattoo images. We set ρ = 0.6d, where d is the average inter-points distance that was
estimated based on 1000 randomly sampled pairs. We set the parameter λ = 10¯ where n
n
¯
is the average number of key point in an image.
The mAP results of the proposed method and BoW method are listed in Table 5.3. In
Figure 5.5, we show two examples of the queries and the retrieved images. Note that for the
Oxford5K+Flickr1M dataset, we follow the experimental protocol in [119] by only using the
cluster/random centers that are obtained from the images in the Oxford5K dataset. The
results clearly show that the proposed method outperforms the bag-of-words model.
As expected the performance of the proposed method drops slightly when the 1M Flickr
images are added to the Oxford5K dataset. In contrast, the bag-of-words based method
suﬀers from a signiﬁcant loss in the performance when we include one million images into
the Oxford5K data set. We believe this diﬀerence in the performance is due to the fact that
the visual content of the 1 million Flickr images is signiﬁcantly diﬀerent from that of the
Oxford 5K images, i.e., the keypoints extracted from the Flickr images are generally far away
from those in the Oxford5K images. As discussed before, the proposed method is robust to
the outlying keypoints which makes it less sensitive to the inclusion of the Flickr1M dataset
than the bag-of-words model. To verify this, we measure the distance between keypoints and
centers for both Oxford5k data set and Oxford5k+Flickr1M data set. We ﬁnd that for the
Oxford building images, there are ∼ 8% keypoints that are separated from any of the centers
¯
by a distance larger than ρ = 0.6d. This percentage is increased to ∼ 24% for the Flickr
images, indicating that a large portion of keypoints from the Flickr images are signiﬁcantly
diﬀerent from the keypoints from the Oxford building images.

106

Oxford5K
Oxford5K+Flickr1M

Proposed
preprocessing retrieval
1.09h
0.8s
95h
1.3s

BoW
preprocessing retrieval
11.4h
0.08s
685h
0.9s

Table 5.4. Preprocessing and retrieval times of the two methods with one million cluster/random centers.

The preprocessing and retrieval times of the two algorithms are shown in Table 5.4. For
preprocessing, we split the the Oxford5K+Flickr1M dataset into 82 subsets and each subset
contains about 10,000,000 keypoints. These 82 subsets are processed separately on multiple
machines, and are aggregated later to obtain the ﬁnal result of key point quantization. The
preprocessing time for 5K+1M dataset is estimated by the average processing time of each
of the 82 subsets. It shows clearly that, for both the datasets, the proposed method is
signiﬁcantly more eﬃcient, e.g. 10 times faster, in preprocessing time than the bag-of-words
method. Combining the results of preprocessing time from the previous experiment on the
tattoo image dataset, it is clear that the proposed method is signiﬁcant more eﬃcient than
the BoW method in terms of preprocessing time, which makes it more applicable to large
scale image retrieval.
For the retrieval time, we ﬁrst compare the proposed algorithm to a naive retrieval method
using 5.14 in which a linear scan of the whole database is required. On average, it takes
more than 1100 seconds to retrieve a query for the naive retrieval method while the proposed
eﬃcient retrieval method only takes 0.8 second for one query. When comparing to the the
BoW method, we observe that the proposed method is still signiﬁcantly slower than the
BoW model, due to the reason mentioned earlier.

107

5.6

Summary

In this chapter, we study the problem of dimensionality reduction with extending the data
representation from a single vector to a set of unordered vectors. By describing each datum
with a function which is estimated from its associated vectors, the task of dimensionality
reduction becomes how to make an appropriate representation of each function to facilitate
the computation of similarity between two data points and furthermore speedup the nearest
neighbor search for a very large database. We study the problem in the context of large
scale image retrieval with each image is represented by its local features. There are two
main challenges in designing eﬃcient search algorithms: one is how to eﬃciently estimate
the density function for each image from its local features; the other is how to eﬃciently
identify a subset of candidate images using the derived function representation of images.
We have developed a statistical modeling approach and an eﬃcient learning and retrieval
algorithm for the two challenges. Our empirical results on three large-scale image retrieval
tasks show that, comparing to the popular bag-of-words method, the proposed method is
signiﬁcantly more eﬀective and more eﬃcient in indexing images. The only disadvantage of
the proposed method is it takes a longer time to retrieve images than the BoW model due
to the high cost in computing the logarithm function.

108

0.9

0.85

0.8

CMC

0.75

0.7

0.65

0.6

0.55

Proposed
BoW
0

20

40
60
Number of retrieved images

80

100

Figure 5.1. The CMC scores for tattoo image retrieval with one million cluster/random
centers. For interpretation of the references to color in this and all other ﬁgures, the reader
is referred to the electronic version of this dissertation.

109

0.85

0.8

CMC

0.75

0.7
λ=0.01 n
¯
λ=0.1 n
¯
λ=1¯
n
0.65
λ=10¯
n
λ=100¯
n

0

5

10
Number of retrieved images

15

20

Figure 5.2. Results of the proposed method for tattoo image retrieval with diﬀerent value of
¯
λ base on 1 million random centers with ρ = 0.6d

110

0.9
0.85
0.8

¯
ρ = 0.3d
¯
ρ = 0.4d

0.75
CMC

¯
ρ = 0.5d
0.7

¯
ρ = 0.6d
¯
ρ = 0.7d

0.65

¯
ρ = 0.8d
0.6

¯
ρ = 0.9d
¯
ρ = 1.0d

0.55

¯
ρ = 1.1d
0.5

0

5

10
Number of retrieved images

15

20

Figure 5.3. Results of the proposed method for tattoo image retrieval with diﬀerent value of
ρ base on 1 million random centers

111

0.9
0.8
0.7

CMC

0.6
0.5
0.4
10k
50k
100k
500k
1M

0.3
0.2
0.1

0

5

10
Number of retrieved images

15

20

Figure 5.4. Results of the proposed method for tattoo image retrieval with diﬀerent number
of centers.

112

Figure 5.5. Examples of two queries (the ﬁrst column) and the ﬁrst six retrieved images.
The ﬁrst two rows give the retrieved results for the Oxford5K building database, and the
next two rows give the retrieved results for Oxford5K+Flickr1M database. The correctly
retrieved results are outlined in green and while irrelevant images are marked by the red
color.

113

CHAPTER 6
Conclusion
People have been working on the problem of dimensionality reduction for more than a century
and numerous algorithms have been proposed and studied. Recent research on dimensionality
reduction has been fueled by the fast-paced development of statistical analysis, machine
learning, as well as by the advances in convex optimization and the constant growth in sheer
processing power. This dissertation continues along this line of research, and oﬀers several
novel contributions to the ﬁeld:
• Mixed label propagation for semi-supervised learning Unlike the existing
graph-based approaches that are only applicable to the positive similarity of examples, our framework is able to explore both positive and negative similarities simultaneously. The challenge of incorporating negative similarities comes from the fact that
the negative similarity is non-transitive, and therefore cannot be propagated directly.
In addition, objective functions employed by most graph-based approaches could be
unbounded below when the similarity is negative, and therefore no optimal solutions
can be found by minimizing the objective function. The key idea behind the proposed framework is to explore LDA, which is to minimize the inconsistency between
the class assignments and the positive similarity of examples, and maximize the consistency between the class assignments and the negative similarity of examples. We
present an eﬃcient learning algorithm for the mixed label propagation that is based on
the alternating optimization strategy and semi-deﬁnitive programming. Our empirical
study with collaborative ﬁltering shows that the proposed algorithm is eﬀective in exploring negative similarities and outperforms both the label propagation approach and

114

state-of-the-art approaches for collaborative ﬁltering
• An semi-deﬁnite formulation for the second-order PCA-style algorithms
Most existing algorithms for second-order PCA-style algorithms require solving nonconvex optimization problems, often leading to suboptimal performance. The key
challenge in solving the optimization problem arises from the fact that they require
identifying both the left and right projection matrices simultaneously that are dependent on each other. In our method, we try to avoid this diﬃculty by ﬁrst factorizing
each matrix into left and right parts, then extracting the left and right projection
matrices from the sum of the left and right parts of all matrices. We begin with factorizing a single matrix which leads to the trace norm of a matrix. We then generalize it
to the case of multiple matrices by extracting the largest common singular vectors of
multiple matrices, and ﬁnally present a semi-deﬁnite programming formulation. The
relationship between the proposed SDP formulation and that of GLRAM is also studied. Finally, we present an approximate algorithm to eﬃciently solve the related SDP
problem and evaluate the proposed algorithm by image classiﬁcation. Our experimental study shows that the performance of the proposed algorithm is better than or
comparable to the state-of-the-art algorithms for SOPCA.
• Dimensionality reduction for functional data representation and its application to large-scale image retrieval by local features Since the number of local
features used to represent each image is varied from one image to another, it is in
general diﬃcult to represent all the images in a vector space. One way to deal with
the bag-of-features representation is to associate with each image a function which is
estimated from its local features. The challenge of using functions to represent data
is its high computational cost in estimating the similarity between functions. In this
dissertation, we study the problem of conducting dimensionality reduction for the function representation of data. More speciﬁcally, we would like to ﬁnd the appropriate

115

representation of the functions that could leads to eﬃcient search of similar images
for very large image databases. We have developed a statistical modeling approach
and an eﬃcient retrieval algorithm for the challenges which can eﬃciently estimate
the functions of each image with its local features and further utilize the derived functions to eﬃciently identify a subset of candidate images which are visually similar to a
given query. Our empirical studies on large-scale image retrieval tasks show that, the
proposed methods is robust to the choice of the parameters and is signiﬁcantly more
eﬀective and more eﬃcient in indexing images comparing to the popular bag-of-words
method.
Finally, besides the success of the studies conducted in this dissertation, there are a number
of challenging problems that deserve further investigation in the future. For example, both
of the mixed label propagation and the proposed convex formulation of SOPCA need to
solve SDP problems which are computationally expensive. We may utilize some special
structures of the involved SDP problems so that problem-speciﬁc SDP solvers might be built
to solve the prolbems more eﬃciently. For the functional data dimensionality reduction,
besides the simple basis function used in the proposed methods, we may be able to make a
more intelligent choice of the basis functions so that the overall performance can be further
improved.

116

APPENDICES

117

APPENDIX A
Notation
Some speciﬁc sets
R
Rn
Rm×n
R+ , R++
Rn , Rn
+ ++
Sn
Sn , Sn
+ ++

Real numbers
Real n-vectors
Real m × n matrices
Nonnegative, positive real numbers
Nonnegative, positive real n-vectors
Symmetric n × n matrices
Symmetric positive semi-deﬁnite, positive deﬁnite, n × n matrices

Vectors and matrices
1
I
X
tr(X)
X†
rank(X)

Vector with all components one
Identity matrix
Transpose of matrix X
Trace of matrix X
Moore-Penrose or pesudo-inverse of matrix X
Rank of matrix X

Norms and distances
·
A norm
x 2
Euclidean (of l2 -) norm of vector x
X F
Frobenius norm of matrix X
X tr
Trace norm of matrix X
Generalized inequalities
x
x
X
X

y
y
Y
Y

Componentwise inequality between vectors x and y
Strick componentwise inequality between vectors x and y
Matrix inequality between matrices X and Y
Strict matrix inequality between matrices X and Y

118

APPENDIX B
Schur Complement
Consider a matrix X ∈ S n partitioned as:
X =

A
B

B
C

where A ∈ S k . If detA = 0, then matrix
S = C − B A−1 B
is called the Schur complement of A in X.

119

APPENDIX C
Semi-Deﬁnite Programming
Semi-deﬁnite programming is a linear programming over the cone of semi-deﬁnite matrices.
In comparison to standard linear programming, the variable of vector x ∈ Rn is replaced by
+
n
a variable of matrix X ∈ S+ .

Given m matrices A1 , . . . , Am , where each matrix Ai ∈ S n , we deﬁne:



tr(A1 X)


.
.
AX = 

.
tr(Am X)
With this notation, the standard formulation of a semi-deﬁnite programming problem is
deﬁned as:
min

tr(CX)

n
X∈S+

s. t.

(C.1)

AX = b
X

0

Similarly, if we deﬁne:
m

A y =

y i Ai
i=1

the corresponding dual problem of the semi-deﬁnite programming problem in (C.1) is given
as:
max

y∈Rm
Z∈Sn
+

s. t.

b y

(C.2)

A y+Z =C
Z
120

0

In (C.1) and (C.2), the inequality X

0 and Z

(LMI).

121

0 are often called linear matrix inequality

APPENDIX D
Trace Norm And Ky Fan k-Norm
The trace norm of a matrix X is deﬁned as the sum of all singular values of X. It can also
be deﬁned as an optimization problem over all possible factorizations of matrix X, as shown
in the following deﬁnition:
Deﬁnition 1. The trace norm X tr of a matrix is given by any of the three quantities:
1. minU,V
2. minU,V

U F V F : X = UV
1( U 2 + V 2 ) : X = UV
2
F
F

3. The sum of the singular values of X
Furthermore, If X = U ΣV is the singular value decomposition of X, then the matrices
√
√
U Σ and V Σ minimize the ﬁrst two quantities.
A very important property of trace norm is that it is the convex envelope of the function
φ(X) = rank(X), on C = {X ∈ Rm×n | X ≤ 1} [42, 136]. Because of this property, trace
norm is often used as the convex surrogate for the rank function in rank minimization [42].
Minimizing the trace norm can be converted to an SDP problem by the following lemma [42]:
Lemma 2. For X ∈ Rp×q and t ∈ R, we have X tr ≤ t if and only if there exist matrices
P ∈ Rp×p and Q ∈ Rq×q such that:
P
X

X
Q

0, tr(P ) + tr(Q) ≤ 2t

122

Using the above lemma, minimizing the trace norm of X is equivalent to the following
problem:
min

P ∈Rp×p
Q∈Rq×q

s. t.

tr(P ) + tr(Q)

P
X

X
Q

(D.1)

0

It is straightforward to show that an optimal solution to (D.1) is P = U ΣU

and Q =

V ΣV , where U and V are derived from the singular value decomposition of X, i.e., X =
U ΣV .
The Ky Fan k-norm Sk (X) generalizes the concept of trace norm by computing the sum
of the k largest singular values of matrix X. When X is a symmetric positive semi-deﬁnite
matrix, Sk (X) becomes the sum of the k largest eigenvalues of X.

123

BIBLIOGRAPHY

124

BIBLIOGRAPHY

[1] Y. Altun, D. A. McAllester, and M. Belkin. Margin semi-supervised learning for
structured variables. In NIPS, 2005.
[2] Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, and Joydeep Ghosh. Clustering
with bregman divergences. Journal of Machine Learning Research, (6):1705–1749,
2005.
[3] P. N. Belhumeour, J. P. Hespanha, and D. J. Kriegman. eigenfaces vs.ﬁsherfaces:
Recognition using class speciﬁc linear projection. IEEE Trans. Pattern Anal. Mach.
Intell, 19(7):711–720, 1997.
[4] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding
and clustering. In Neural Information Processing Systems 14 (NIPS’2001), 2001.
[5] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation, 15(6):137–1396, 2003.
[6] M. Belkin and P. Niyogi. Semi-supervised learning on manifolds. Machine Learning,
56:209–239, 2004.
[7] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: a geometric framework for learning from examples. Technical Report TR-2004-06, University of Chicago
Computer Science, 2006.
[8] R. Bellman. Adaptive Control Process: A Guided Tour. Princeton University Press,
Princeton, 1961.
[9] Y. Bengio, J-F. Paiement, and P. Vincent. Out-of-sample extensions for lle, isomap,
mds, eigenmaps, and spectral clustering. In NIPS, 2004.
[10] M. Bernstein, V. de Silva, J. Langford, , and J. Tenenbaum. Graph approximations
to geodesics on embedded manifolds. Technical report, Department of Psychology,
Stanford University, 2000.
[11] M. Berry, M. Browne, A. Langville, and etc al. Algorithms and applications for approximate nonnegative matrix factorization. Submitted to Computational Statistics
and Data Analysis, 2006.
[12] S. Yu J. Bi and J. Ye. Probabilistic interpretations and extensions for a family of
2d pca-style algorithms. In KDD’2008 Workshop on Data Mining using Matrices and
Tensors, 2008.
125

[13] Christopher M. Bishop. Probabilistic principal component analysis. Journal of the
Royal Statistical Society, Series B, 61:611–622, 1999.
[14] J. Blitzer, K. Q. Weinberger, L. K. Saul, and F. C. N. Pereira. Hierarchical distributed
representations for statistical language modeling. In NIPS, 2006.
[15] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
[16] Matthew Brand and Matthew Brand. Incremental singular value decomposition of
uncertain data with missing values. In ECCV, pages 707–720, 2002.
[17] John S. Breese, David Heckerman, and Carl Kadie. Empirical analysis of predictive
algorithms for collaborative ﬁltering. Technical Report MSR-TR-98-12, Microsoft Research, 1998.
[18] Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.
[19] Christopher J.C. Burges. Geometric methods for feature extraction and dimensional
reduction: A guided tour. Technical Report MSR-TR-2004-55, Microsoft Research,
2004.
[20] Deng Cai, Xiaofei He, and Jiawei Han. Subspace learning based on tensor analysis.
Technical Report UIUCCDCS-R-2005-2572, Department of Computer Science, UIUC,
2005.
´
[21] Miguel A. Carreira-Perpi˜´. A review of dimensional reduction techniques. Technical
na
Report CS-96-09, Department of Computer Science, University of Sheﬃeld, 1997.
[22] Chad Carson, Serge Belongie, Hayit Greenspan, and Jitendra Malik. Region-based
image querying. In Proceedings of IEEE Workshop on Content-Based Access of Image
and Video Libraries, pages 42–49, 1997.
[23] L. F. Chen, H. Y. M. Liao, J. C. Lin, M. D. Kao, and G. J. Yu. A new lda-based face
recognition system which can solve the small sample size problem. Pattern Recognit.,
33(10):1713–1726, 2000.
[24] M. Chu, F. Diele, and S. Plemmons, R.and Ragni. Optimality, computation, and
interpretations of nonnegative matrix factorization, 2004.
[25] W. Chu and Z. Ghahramani. Preference learning with gaussian processes. In ICML’05,
2005.
[26] A. Cichocki, S. Amari, K. Siwek, T. Tanaka, and et al. Icalab toolboxes. http:
//www.bsp.brain.riken.jp/ICALAB.
126

[27] A. Cichocki, R. Zdunek, and S. Amari. Csiszars divergences for non-negative matrix
factorization: Family of new algorithms. In Proc. 6th International Conference on
Independent Component Analysis and Blind Signal Separation, 2006.
[28] T. Cox and M. Cox. Multidimensional Scaling. Chapman And Hall, London, 1994.
[29] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints.
In Workshop on Statistical Learning in Computer Vision, ECCV, 2004.
[30] M. Cuturi and J-P Vert. Semigroup kernels on ﬁnite sets. In Advances in Neural
Information Processing Systems, 2005.
[31] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive
hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual
symposium on Computational geometry, 2004.
[32] D. de Ridder, M. Loog, and M.J.T. Reinders. Local ﬁsher embedding. In 17th International Conference on Pattern Recognition (ICPR2004), 2004.
[33] V. de Silva and J. B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In Neural Information Processing Systems (NIPS), pages 705–712,
2002.
[34] I. Dhillon and S. Sra. Generalized nonnegative matrix approximations with bregman
divergences. In Proceeding of the Neural Information Processing Systems (NIPS), 2005.
[35] E.W. Dijkstra. A note on two problems in connexion with graphs. Numerische mathematik, 1:269–271, 1959.
[36] Chris Ding and Xiaofeng He. K-means clustering via principal component analysis.
In ICML ’04: Proceedings of the 23rd international conference on Machine learning,
2004.
[37] Chris Ding, Tao Li, and Michael I. Jordan. Convex and semi-nonnegative matrix
factorizations. Technical Report 60428, Lawrence Berkeley National Laboratory, 2006.
[38] Chris Ding and Jieping Ye. Two-dimensional singular value decomposition (2dsvd) for
2d maps and images. In SDM, 2005.
[39] D. L. Donoho and C. Grimes. When does isomap recover natural parameterization of
families of articulated images? Technical Report 2002-27, Department of Statistics,
Stanford University, 2002.
[40] John Eakins and Margaret Graham. University of northumbria at newcastle. Technical
Report 39, IBM Research, 1999.

127

[41] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advanced In Computational Mathematics,
13:1C50, 2000.
[42] M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to
minimum order system approximation. In Proceedings American Control Conference,
2001.
[43] L. Finesso and P. Spreij. Approximate nonnegative matrix factorization via alternating
minimization. In Proc. 16th International Symposium on Mathematical Theory of
Networks and Systems, 2004.
[44] R. A. Fisher. The use of multiple measurements in taxonomic problems. Ann. Eugenics,
7:179–188, 1936.
[45] Imola K. Fodor. A surveyof dimensional reduction techniques, 2002.
[46] Y. Freund, R. D. Lyer, R. E. Schapire, and Y. Singer. An eﬃcient boosting algorithm
for combining preferences. In ICML’98, 1998.
[47] Zoubin Ghahramani and Geoﬀrey E. Hinton. The EM algorithm for mixtures of factor
analyzers. Technical Report CRG-TR-96-1, 21 1996.
[48] Andrew Goldberg, Xiaojin Zhu, and Stephen Wright. Dissimilarity in graph-based
semi-supervised classiﬁcation. In AISTATS, 2007.
[49] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University
Press, 1996.
[50] E. Gonzalez and Y. Zhang. Accelerating the lee-seung algorithm for nonnegative matrix
factorization. Technical Report TR-05-02, Rice University, 2005.
[51] R. L. Gorsuch. Factor analysis (2nd. ed.). Hillsdale, N.J.: Erlbaum, 1983.
[52] Kristen Grauman and Trevor Darrell. The pyramid match kernel: Eﬃcient learning
with sets of features. J. Mach. Learn. Res., 8:725–760, 2007.
[53] Ming Gu, Stanley, C. Eisenstat, and In O. A stable and fast algorithm for updating
the singular value decomposition. Technical report, Department of Computer Science,
Yale University, 1994.
[54] D. Guillamet, M. Bressan, and J. Vitria. A weighted non-negative matrix factorization for local representations. In Proc. 2001 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition., 2001.
[55] Y. Guo, T. Hastie, and R. Tibshirani. Regularized linear discriminant analysis and its
application in microarrays. Biostatistics, 8(1):86–100, 2007.
128

[56] J. Ham, D. D. Lee, S. Mika, and B. Scholk¨pf. A kernel view of the dimensionality
o
reduction of manifolds. Technical Report TR-110, Max Planck Institute for Biological
Cybernetics, 2003.
[57] T. J. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical
Association, 84:502–516, 1989.
[58] X. He and P. Niyogi. Locality preserving projections. In Neural Information Processing
Systems 16 (NIPS’2003), 2003.
[59] M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In ICML ’05,
2005.
[60] K Hirata and T Kato. Query by visual example - content-based image retrieval. In
Third International Conference on Extending Database Technology, 1992.
[61] T. Hofmann. Gaussian latent semantic models for collaborative ﬁltering. In SIGIR’03,
2003.
[62] T. Hofmann. Latent semantic models for collaborative ﬁltering. ACM Trans. Inf.
Syst., 22(1):89–115, 2004.
[63] P. Hoyer. Non-negative matrix factorization with sparseness constraints. J. of Machine
Learning Research, 5:1457–1469, 2004.
[64] J. J. Hull. A database for handwritten text recognition research. IEEE Trans. Pattern
Anal. Mach. Intell., 16(5):550–554, 1994.
[65] Aapo Hyv¨rinen. Independent component analysis. Neural Computing Surveys, 2,
a
2001.
[66] K. Inoue and K. Urahama. Equivalence of non-iterative algorithms for simultaneous
low rank approximations of matrices. In CVPR ’06, pages 154–159, 2006.
[67] J.E. Jackson. A Users Guide to Principal Components. New York: John Wiley and
Sons, 1991.
[68] A K Jain, J Mao, and K Mohiuddin. Artiﬁcial neural networks: A tutorial. Computer,
(29), 1996.
[69] Shuiwang Ji and Jieping Ye. Generalized linear discriminant analysis: A uniﬁed framework and eﬃcient model selection. Neural Networks, IEEE Transactions on, 19:1768–
1782, 2008.
[70] T. Joachims. Transductive learning via spectral graph partitioning. In ICML’03, 2003.

129

[71] Richard A. Johnson and Dean W. Wichern. Applied Multivariate Statistical Analysis(5th Edition). Pearson Education, 2001.
[72] I.T. Jolliﬀe. Principal Component Analysis, Second Edition. Springer-Verlag, 2002.
[73] M. C. Jones and R. Sibson. What is projection pursuit? Journal of the Royal Statistical
Society, 19:1–18, 1987.
[74] Tuomas K¨rn¨. Functional data dimensionality reduction for machine learning. Masa a
ter’s thesis, Department of Electrical Communications Engineering, Helsinki University
of Techonology, 2007.
[75] Yan Ke, Rahul Sukthankar, and Larry Huston. Eﬃcient near-duplicate detection and
sub-image retrieval. In Proceeding of ACM Multimedia, pages 869–876, 2004.
[76] J.J. Kivinen, E.B. Sudderth, and M.I. Jordan. Learning multiscale representations of
natural scenes using dirichlet processes. In ICCV, 2007.
[77] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM
Review. to appear (accepted June 2008).
[78] R. I. Kondor and J. Laﬀerty. Diﬀusion kernels on graphs and other discrete structures. In In Proceedings of the Twenty First International Conference on Machine
Learning(ICML02), 2002.
[79] Risi Imre Kondor and Tony Jebara. A kernel between sets of vectors. In ICML, 2003.
[80] O Kouropteva, O Okun, A Hadid, M Soriano, S Marcos, and M Pietik¨inen. Beyond
a
locally linear embedding algorithm. Technical Report MVG-01-2002, University of
Oulu, Machine Vision Group, Information Processing Laboratory, 2002.
[81] O Kouropteva, O Okun, and M Pietik¨inen. Selection of the optimal parameter value
a
for the locally linear embedding algorithm. In Proc. of the 1st International Conference
on Fuzzy Systems and Knowledge Discovery (FSKD02), 2002.
[82] B. Kulis, S. Basu, I. Dhillon, and R. J. Mooney. Semi-supervised graph clustering: A
kernel approach. In ICML’05, 2005.
[83] John Laﬀerty and Guy Lebanon. Information diﬀusion kernels. In Advances in Neural
Information Processing Systems 15, 2002.
[84] M. H. Law and A. K. Jain. Incremental nonlinear dimensionality reduction by manifold
learning. IEEE Transactions of Pattern Analysis and Machine Intelligence, 28(3):377–
391, 2006.
[85] M. H. Law and A. K. Jain N. Zhang. Nonlinear manifold learning for data stream. In
Proceedings of SIAM Data Mining, pages 33–44, 2004.
130

[86] S. Lazebnik and et al. A sparse texture representation using aﬃne-invariant regions.
In CVPR, 2003.
[87] D. Lee and H. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999.
[88] D. Lee and H. Seung. Algorithms for non-negative matrix factorization. In NIPS,
2001.
[89] Vincent Lepetit, Pascal Lagger, and Pascal Fua. Randomized trees for real-time keypoint recognition. In CVPR, 2005.
[90] C.-J. Lin. On the convergence of multiplicative update algorithms for non-negative
matrix factorization. Technical report, Department of Computer Science, National
Taiwan University., 2005.
[91] C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Technical Report ISSTECH-95-013, Department of Computer Science, National Taiwan
University., 2005.
[92] F Liu and R W Picard. Periodicity, directionality and randomness: Wold features for
image modelling and retrieval. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 1996.
[93] T. Liu, A. Moore, A. Gray, and K. Yang. An investigation of practical approximate
nearest neighbor algorithms. In NIPS, 2004.
[94] D. Lowe. Distinctive image features form scale-invariant keypoints. In International
Journal of Computer Vision, 2004.
[95] Haiping Lu, K.N Plataniotis, and A.N. Venetsanopoulos. Mpca: Multilinear principal
component analysis of tensor objects. Neural Networks, IEEE Transactions on, 19:18–
39, 2008.
[96] KAPLAN L. M., MURENZI R., and NAMUDURI K. R. Fast texture database retrieval using extended fractal features. In Storage and Retrieval for Image and Video
Databases VI, pages 162–173, 1998.
[97] W Y Ma and Manjunath B S. A texture thesaurus for browsing large aerial photographs. Journal of the American Society for Information Science, 1998.
[98] Pavan K. Mallapragada, Rong Jin, and A. K. Jain. Online visual vocabulary pruning
using pairwise constraints. In CVPR, 2010.
[99] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction to
Information Retrieval,. Cambridge University Press, 2008.
131

[100] Benjamin Marlin. Modeling user rating proﬁles for collaborative ﬁltering. In n Advances in Neural Information Processing Systems, 2004.
[101] A. M. Martinez and M. Zhu. where are linear feature extraction methods applicable?
IEEE Trans. Pattern Anal. Mach. Intell, 27(12):1934–1944, 2005.
[102] A.M. Martinez and R. Benavente. The ar face database. Technical Report 24, Computer Vision Center at the U.A.B, 1998.
[103] R Mehrotra and J E Gary. Similar-shape retrieval in shape data management. IEEE
Computer, 1995.
[104] K. Mikolajczyk and C. Schmid. Scale and aﬃne invariant interest point detectors. In
International Journal of Computer Vision, 2004.
[105] H. Moon and P. J. Phillips. Computational and performance aspects of pca-based face
recognition algorithms. Perception, 30:303–321, 2001.
[106] P. Moreno, P. Ho, and N. Vasconcelos. A kullback-leibler divergence based kernel
for svm classiﬁcation in multimedia applications. In Neural Information Processing
Systems, 2003.
[107] Pedro J. Moreno and et al. A kullback-leibler divergence based kernel for svm classiﬁcation in multimedia applications. In NIPS, 2003.
[108] M.Stricker and D.Ballard. Similarity of color images. 1995.
[109] M.Swain and D. Ballard. Color indexing. Journal of Computer Vision, 1991.
[110] Marius Muja and David G. Lowe. Fast approximate nearest neighbors with automatic
algorithm conﬁguration. In International Conference on Computer Vision Theory and
Applications, 2009.
[111] John Von Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen,
100(1):295–320, 1928.
[112] Carlton W. Niblack, Ron Barber, Will Equitz, Myron D. Flickner, Eduardo H. Glasman, Dragutin Petkovic, Peter Yanker, Christos Faloutsos, and Gabriel Taubin. The
qbic project: querying images by color, texture and shape. Technical Report RJ-9203,
IBM Research, 1993.
[113] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In
CVPR, 2006.
[114] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model
with optimal utilization of error estimates of data values. Environmetrics, 5:111–126,
1994.
132

[115] E. Parzen. On estimation of a probability density function and mode. Ann. Math.
Stat, 1962.
[116] V. Pauca, F. Shahnaz, and M. Berry. Text mining using non-negative matrix factorizations. In Proceedings of the Fourth SIAM International Conference on Data Mining,
2004.
[117] D. M. Pennock, E. Horvitz, S. Lawrence, and C. L. Giles. Collaborative ﬁltering by
personality diagnosis: A hybrid memory- and model-based approach. In Proc. UAI,
2000.
[118] K. Pettis, T. Bailey, A. Jain, and R. Dubes. An intrinsic dimensionality estimator
from near-neighbor information. IEEE Transactions on Pattern Analysis and Machine
Intelligence(PAMI), 1(1):25–37, 1979.
[119] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large
vocabularies and fast spatial matching. In CVPR, 2007.
[120] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization:
Improving particular object retrieval in large scale image databases. In CVPR, 2008.
[121] M. Polito and P. Perona. Grouping and dimensionality reduction by locally linear
embedding. In Neural Information Processing Systems(NIPS), 2001.
[122] J M Ponte and W B Croft. A language modeling approach to information retrieval.
Research and Development in Information Retrieval, 1998.
[123] J. O. Ramsay and B.W. Silverman. Functional data analysis 2nd ed. Springer, 2005.
[124] J. Rennie and N. Srebro. Fast maximun margin matrix factorization for collaborative
prediction. In ICML’05, 2005.
[125] Jason Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In 22nd International Conference on Machine Learning (ICML),
2005.
[126] P. Resnick, N. Iacovou, M.Sushak, P. Bergstrom, and J.Riedl. Grouplense: An open
architecture for collaborative ﬁltering of netnews. In Proc. Computer Supported Collaborative Work Conference, 1994.
[127] D. Ridder and Duin R.P.W. Locally linear embedding for classiﬁcation. Technical
Report PH-2002-01, Pattern Recognition Group, Dept. of Imaging Science And Technology, Delft University of Technology, 2002.

133

[128] D Ritter, O Kouropteva, O Okun, M Pietik¨inen, and RPW Duin. Supervised loa
cally linear embedding. In Artiﬁcial Neural Networks and Neural Information Processing(ICANN/ICONIP), 2003.
[129] Stephen E. Robertson, Steve Walker, and Micheline Hancock-Beaulieu. Okapi at trec-7.
In Proceedings of the Seventh Text REtrieval Conference, 1998.
[130] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear
embedding. Science, 290:2323C–2326, 2000.
[131] Sam Roweis. EM algorithms for PCA and SPCA. In Michael I. Jordan, Michael J.
Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998.
[132] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval.
McGraw-Hill, Inc., 1986.
[133] F. Samaria and A. Harter. Parameterisation of a stochastic model for human face
identiﬁcation. In 2nd IEEE Workshop on Applications of Computer Vision, 1994.
[134] L. K. Saul and S. T. Roweis. Think globally, ﬁt locally: unsupervised learning of low
dimensional manifolds. Journal of Machine Learning Research, 4:119–155, 2003.
[135] L. K. Saul, K. Q. Weinberger, J. H. Ham, F. Sha, and D. D. Lee. Spectral methods for dimensionality reduction. In O. Chapelle B. Schoelkopf and A. Zien, editors,
Semisupervised Learning. MIT Press, 2006.
[136] B. Sch¨lkopf, A. Smola, and K. R. M¨ller. Nonlinear component analysis as a kernel
o
u
eigenvalue problem. Technical Report TR-44, Max Planck Institute for Biological
Cybernetics, 1996.
[137] B. Sch¨lkopf, A. Smola, and K. R. M¨ller. Nonlinear component analysis as a kernel
o
u
eigenvalue problem. Neural Computation, 10:1299–1319, 1998.
[138] Bernhard Scholk¨pf and Alexander J. Smola. Learning with Kernels: Support Vector
o
Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA,
USA, 2001.
[139] N. SEBE and LEW. Color based retrieval. Pattern Recognition Letters, 2001.
[140] F. Shahnaz, M. Berry, V. Pauca, and R. Plemmons. Document clustering using nonnegative matrix factorization. In Information Processing And Management, 2006.
[141] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk. Nearest-Neighbor Methods
in Learning and Vision: Theory and Practice. MIT Press, 2006.

134

[142] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
[143] C. Silpa-Anan and R. Hartley. Localization using an imagemap. In Proceedings of the
2004 Australasian Conference on Robotics & Automation, 2004.
[144] Chanop Silpa-Anan and Richard Hartley. Optimised kd-trees for fast image descriptor
matching. In CVPR, 2008.
[145] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching
in videos. In ICCV, 2003.
[146] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching
in videos. In International Conference on Computer Vision, 2003.
[147] N. Srebro, J.D. Rennie, and T.S. Jaakkola. Maximun-margin matrix factorization. In
NIPS, 2005.
[148] Nathan Srebro. Learning with Matrix Factorization. PhD thesis, Massachusetts Institute of Technology, 2004.
[149] Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakola. Maximum-margin matrix
factorization. In Proceeding of the Neural Information Processing Systems (NIPS),
2005.
[150] J. Sun, S. Boyd, L. Xiao, and P. Diaconis. The fastest mixing markov process on a
graph and a connection to a maximum variance unfolding problem. SIAM Review,
submitted, 2006.
[151] Hideyuki Tamura, Shunji Mori, and Takashi Yamawaki. Textural features corresponding to visual perception. EEE Transactions on Systems, Man and Cybernetics, 1978.
[152] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for
nonlinear dimensionality reduction. Science, 290:2319–2323, 2000.
[153] M. E. Topping and C. M. Bishop. Probablistic principal component analysis. Technical
Report NCRG/97/010, Neural Computing Research Group, Dept of Computer Science
And Applied Mathematics, Aston University, 1997.
[154] M. E. Topping and C. M. Bishop. Mixtures of probablistic principal component analyzers. Neural Computation, 11(2):443–482, 1999.
[155] M. E. Topping and C. M. Bishop. Probablistic principal component analysis. Journal
of the royal Statistical Society, 61(3), 1999.
[156] L. Vandenberghe and S. Boyd. Semideﬁnite programming. SIAM Review, 38(1):49–95,
1996.
135

[157] Y. Wang, Y. Jiar, C. Hu, and M. Turk. Fisher non-negative matrix factorization for
learning local features. In Asian Conference on Computer Vision., 2004.
[158] G A Watson. On matrix approximation problems with ky fan k norms. Numerical
Algorithms, (5):5–263, 1993.
[159] K. Q. Weinberger, B. D. Packer, and L. K. Saul. Unsupervised learning of image
manifolds by semideﬁnite programming. In Proceedings of the Tenth International
Workshop on Artiﬁcial Intelligence and Statistics, Barbados, January 2005.
[160] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by semidefinite programming. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR-04), volume 2, pages 988–995, Washington D.C., 2004.
[161] K. Q. Weinberger and L. K. Saul. An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In AAAI, Boston, August 2006.
[162] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision, 70(1):77–90, 2006.
[163] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear
dimensionality reduction. In Proceedings of the Twenty First International Conference
on Machine Learning (ICML-04), pages 839–846, Banﬀ, Canada, 2004.
[164] Kilian Weinberger, Fei Sha, Qihui Zhu, and Lawrence Saul. Graph laplacian methods
for large-scale semideﬁnite programming, with an application to sensor localization. In
B. Sch¨lkopf, J. Platt, and Thomas Hofmann, editors, Advances in Neural Information
o
Processing Systems 19, Cambridge, MA, 2007. MIT Press.
[165] Christopher K. I. Williams. On a connection between kernel PCA and metric multidimensional scaling. In NIPS, pages 675–681, 2000.
[166] J. Winn, A. Criminisi, and T. Minka. Object categorization by learned universal visual
dictionary. In ICCV, 2005.
[167] Lin Xiao, Jun Sun, and Stephen Boyd. A duality view of spectral methods for dimensionality reduction. In ICML ’06: Proceedings of the 23rd international conference on
Machine learning, pages 1041–1048, New York, NY, USA, 2006. ACM Press.
[168] J. Yang, D. Zhang, A.F. Frangi, and JY. Yang. Two-dimensional pca: a new approach
to appearance-based face representation and recognition. PAMI, 26(1):131–137, 2004.
[169] Liu Yang. Distance metric learning: A comprehensive survey, 2007.

136

[170] Xin Yang, Haoying Fu, Hongyuan Zha, and Jesse Barlow. Semi-supervised nonlinear
dimensionality reduction. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, 2006.
[171] J. Ye. Characterization of a family of algorithms for generalized discriminant analysis
on undersampled problems. J. Mach. Learn. Res., 6:483–502, 2005.
[172] J. Ye. Generalized low rank approximations of matrices. Machine Learning, 61:167–
191, 2005.
[173] W. Zhao, R. Chellappa, and P. Phillips. Subspace linear discriminant analysis for
face recognition. Technical Report CAR-TR-914, Cntr. Autom. Res., Univ. Maryland,
1999.
[174] D. Zhou, B. Sch¨lkopf, and T. Hofmann. Semisupervised learning on directed graphs.
o
In NIPS, 2005.
[175] D.Y. Zhou, O. Bousquet, T. Navin Lal, J. Weston, and Bernhard Sch¨lkopf. Learning
o
with local and global consistency. In NIPS, 2004.
[176] X. Zhu, Z. Ghahramani, and J.D. Laﬀerty. Semi-supervised learning using gaussian
ﬁelds and harmonic functions. In ICML’03, 2003.
[177] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report TR 1530,
Computer Sciences, University of Wisconsin Madison, 2008.

137