.0
4.. as

g . .
Vita” 9%
If: e s
.

(e
I

3 .
. Sim:
.zrxdnkhLiﬁ
t . 1.
£25.33. . ,
unliztui I

Wu: ..
d ’1‘ u
. I f
?. Ip‘Iﬁhsoa
n. r09!::. 1...
ﬁt. .33.? .p

a x
33$
$713..
-2 v

 

 

 

91.8.

.3. .
a. $ ‘1‘!
Lab.

 

‘. a . , ‘ _ . 2 («#955159sz
._ixeﬁwﬁikx.baﬁrﬂnnécp. .ﬂiu. . .

 

 

 

l 'm

V
5.58

LIBRARY
Michigan State
University

This is to certify that the
dissertation entitled

Clustering, Dimensionality Reduction, and Side Information

presented by

Hiu Chung Law

has been accepted towards fulfillment
of the requirements for the

Ph.D. degree in Computer Science and

 

Engineerigg

M WV
Major Professor's Signature
APYI‘Q 5-, QOOé

 

Date

MSU is an Affirmative Action/Equal Opportunity Institution

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/CIRC/DateDue.indd-p.‘l

 

CLUSTERING, DIMENSIONALITY REDUCTION, AND SIDE
INFORMATION

By

H in Chung Law

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment Of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science & Engineering

2006

ABSTRACT
CLUSTERINC, DIMENSIONALITY REDUCTION, AND SIDE
INFORMATION
By
Hiu Chung Law

Recent advances in sensing and storage technology have created many high«
volume, high-dimensional data sets in pattern recognition, machine learning, and data
mining. Unsupervised learning can provide generic tools for analyzing and summariz-
ing these data sets when there is no well-deﬁned notion Of classes. The purpose of this
thesis is to study some of the Open problems in two main areas Of unsupervised learn-
ing, namely clustering and (unsupervised) dimensionality reduction. Instance-level
constraint on Objects, an example of side—information, is also considered to improve

the clustering results.

Our ﬁrst contribution is a modiﬁcation to the isometric feature mapping
(ISOMAP) algorithm when the input data, instead of being all available Simulta-
neously, arrive sequentially from a data stream. ISOMAP is representative of a class
of nonlinear dimensionality reduction algorithms that are based on the notion of a

manifold. Both the standard ISOMAP and the landmark version of ISOMAP are

 

_ m I i, 5-; ._ m

considered. Experimental results on synthetic data as well as real world images
demonstrate that the modiﬁed algorithm can maintain an accurate low—dimensional
representation of the data in an efﬁcient matmer.

We study the problem of feature selection in model-based clustering when the
number Of clusters is unknown. we propose the concept of feature saliency and intro—
duce an expectation-maximization (EM) algorithm for its estimation. By using the
minimum message length (MML) model selection criterion, the saliency of irrelevant
features is driven towards zero, which corresponds to performing feature selection.
The use Of MML can also determine the number Of clusters automatically by pruning
away the weak clusters. The proposed algorithm is validated on both synthetic data
and data sets from the UCI machine learning repository.

We have also developed a new algorithm for incorporating instance—level con-
straints in model-based clustering. Its main idea is that we require the cluster label
Of an Object to be determined only by its feature vector and the cluster parameters.
In particular, the constraints should not have any direct influence. This consideration
leads to a new objective function that considers both the ﬁt to the data and the sat-
isfaction of the constraints simultaneously. The line—search Newton algorithm is used
to ﬁnd the cluster parameter vector that Optimizes this Objective function. This ap-
proach is extended to simultaneously perform feature extraction and clustering under
constraints. Comparison of the proposed algorithm with competitive algorithms over
eighteen data sets from different domains, including text categorization, low—level im-
age segmentation, appearance-based vision, and benclunark data sets from the UCI

machine learning repository, shows the superiority of the proposed approach.

© Copyright 2006 by Hiu Chung Law
All Rights Reserved

 

TO My Lord Jesus Christ

 

ACKNOWLEDGMENTS

For the LORD gives wisdom,

and from his mouth come knowledge and understanding.

Proverbs 2:6

To our God and Father be glory for ever and ever. Amen.

Philippians 4:20

Time ﬂies and I shall leave Michigan State soon — a place I shall cherish long after
my graduation. There are so many people who have been so kind and so helpful to
me during all these years; all of you have made a mark in my life!

First and foremost, I want to express my greatest gratitude to my thesis supervisor
Dr. Anil Jain. He is such a wonderful advisor, mentor, and motivator. Under his
guidance, I have learned a lot in different aspects of conducting research, including
ﬁnding a good research problem, writing a convincing technical paper, and prioritizing
different research tasks, to name a few. Of course I shall never forget. all the good

vi

times when we “Prippies” partied in his awesome house. I am also very thankful to
the rest of my thesis guidance committee, including Dr. John Wong, Dr. Bill Punch,
and Dr. Sarat Dass. Their advice and suggestions have been very helpful.

I am also grateful to several other researchers who have mentored me during my
various stages as a research student. Dr. Mario Figueiredo from Instituto Superior
Técnico in Lisbon is a very nice person and his research insight has been an eye-opener.
His intelligent use Of the EM algorithm is truly remarkable. I feel fortunate that I
have had the chance to work under the supervision of Dr. Patti Viola at Microsoft
Research. Interaction with him has not only led to a much deeper appreciation of
boosting, but also has sharpened my thoughts on how to formalize a research problem.
It is also a great pleasure that I could work under Dr. Tin-Kam H0 at Bell Labs.
Discussions with her have led to a new perspective towards different tools in pattern
recognition and machine learning. The emphasis Of Dr. Joachim Buhmann at ETH
Zurich on correct modeling has impacted me on how to design a solution for any
research problem. I am particularly grateful to Dr. Buhmann for his invitation to
spend a month at ETH Zurich, and the hospitality he showed while I was there.
The chance to work closely with Tilman Lange is deﬁnitely memorable. It is so
interesting when two minds from different cultures and research heritage meet and
conduct research together. Despite our differences, we have so much in common
and the friendship with Tilman is probably the most valuable “side-product” of the
research conducted during my Ph.D. study. 6)

I want to thank Dr. Yunhong Wang for providing the NLPR database that is
used in chapters four and ﬁve in this thesis. The work in chapter ﬁve of this thesis

vii

 

has beneﬁted from the discussions I had during my stay at ETH Zurich. Special
thanks go to ONR (grant nos. # N00014—01—1—0266 and # N00014-04—1-0183) for its
ﬁnancial support during my Ph.D. studies.

On a more personal side, I am grateful to all the new friends that I have made
during the past few years. I am especially grateful to the hospitality shown by the
couples Steve & Liana, John (HO) & Agnes, Ellen & her husband Mr. Yip, and John
(Bankson) & Bonnie towards an international student like me. They have set up such
a great example for me to imitate wherever I go. All the people in the Cantonese group
of the Lansing Chinese Christian Ministry have given a “lab—bound” graduate student
like me the possibility of a social life. The support shown by the group, including
Simon, Paul, Kok, Tom, Timothy, Anthony, Twinsen, Dennis, Josie. Mitzi, Melody,
Karen, Esther, Janni, Bean, Lok, Christal, Janice, and many more, has helped me
to survive the tough times. All of my labmates in the PRIP lab, including Arun,
Anoop, Umut, Xiaoguang, Hong, Dirk, Miguel, Yi, Unsang, Karthik, Pavan, Meltem,
Sasha, Steve, ..., have been so valuable to me. In addition to learning from them
professionally, their emotional and social support is something I shall never forget.

Let me reserve my ﬁnal appreciation to the most important people in my life.
Without the nurturing, care, and love from my father and mother, I definitely could
not have completed my doctoral degree. They have provided such a wonderful envi-
ronment for me and my two brothers growing up. It is such a great achievement for
my parents that all their three sons have completed at least a master’s degree. I am

also proud of my two brothers. I miss you, dad, mum, Pong, and Fail

viii

TABLE OF CONTENTS

LIST OF TABLES xv
LIST OF FIGURES xvii
LIST OF ALGORITHMS xxi
1 Introduction 1
1.1 Data Analysis ................................. 2
1.1.1 Types of Data ............................... 2
1.1.2 Types of Features .............................. 6
1.1.3 Types of Analysis .............................. 7
1.2 Dimensionality Reduction .......................... 8
1.2.1 Prevalence of High Dimensional Data ................... 9
1.2.2 Advantages of Dimensionality Reduction ................. 11
1.2.3 Techniques for Dimensionality Reduction ................. 13
1.3 Data Clustering ................................ 21
1.3.1 A Taxonomy of clustering ......................... 23
1.3.2 A Brief History of Cluster Analysis .................... 25
1.3.3 Examining Some Clustering Algorithms ................. 27
1.4 Side-Information ............................... 35
ix

1.5 Overview ................................... 37

2 A Survey Of Nonlinear Dimensionality Reduction Algorithms 38
2.1 Overview ................................... 39
2.2 Preliminary .................................. 41
2.3 Sammon’s mapping .............................. 45
2.4 Auto-associative neural network ....................... 48
2.5 Kernel PCA .................................. 49
2.5.1 Recap of SVM ............................... 50
2.5.2 Kernel PCA ................................. 51
2.6 ISOMAP ................................... 55
2.7 Locally Linear Embedding .......................... 60
2.8 Laplacian Eigenmap ............................. 63
2.9 Global Co—ordinates via Local Co—ordinates ................ 67
2.9.1 Global Co—ordination ............................ 68
2.9.2 Charting................... ................ 70
2.9.3 LLC ..................................... 72
2.10 Experiments .................................. 74
2.11 Summary ................................... 79

3 Incremental Nonlinear Dimensionality Reduction By Manifold

Learning 84
3.1 Details of ISOMAP .............................. 86
3.2 Incremental Version of ISOMAP ....................... 88

x

3.2.1 Incremental ISOMAP: Basic Version ................... 89
3.2.2 ISOMAP With Landmark Points ..................... 98
3.2.3 Vertex Contraction ............................. 101
3.3 Experiments .................................. 103
3.3.1 Incremental ISOMAP: Basic Version ................... 103
3.3.2 Experiments on Landmark ISOMAP ................... 113
3.3.3 Vertex Contraction ............................. 118
3.3.4 Incorporating Variance By Incremental Learning ............ 119
3.4 Discussion ................................... 121
3.4.1 Variants of the Main Algorithms ..................... 123
3.4.2 Comparison With Out-of-sample Extension ............... 124
3.4.3 Implementation Details .......................... 125
3.5 Summary ................................... 126
4 Simultaneous Feature Selection and Clustering 128
4.1 Clustering and Feature Selection ...................... 129
4.2 Related Work ................................. 132
4.3 EM Algorithm for Feature Saliency ..................... 133
4.3.1 Mixture Densities .............................. 134
4.3.2 Feature Saliency .............................. 136
4.3.3 Model Selection ............................... 142
4.3.4 Post-processing Of Feature Saliency .................... 147
4.4 Experimental Results ............................. 148
xi

4.4.1 Synthetic Data ............................... 148

4.4.2 Real data .................................. 149
4.5 Discussion ................................... 155
4.5.1 Complexity ................................. 155
4.5.2 Relation to Shrinkage Estimate ...................... 156
4.5.3 Limitation of the Proposed Algorithm .................. 157
4.5.4 Extension to Semi-supervised Learning .................. 157
4.5.5 A Note on Maximizing the Posterior Probability ............. 158
4.6 Summary ................................... 159
5 Clustering With Constraints 161
5.0.1 Related Work ................................ 163
5.0.2 The Hypothesis Space ........................... 165
5.1 Preliminaries ................................. 168
5.1.1 Exponential Family ............................. 169
5.1.2 Instance—level Constraints ......................... 172
5.2 An Illustrative Example ........................... 173
5.2.1 An Explanation of The Anomaly ..................... 174
5.3 Proposed Approach .............................. 179
5.3.1 Loss Function for Constraint Violation .................. 184
5.4 Optimizing the Objective Function ..................... 188
5.4.1 Unconstrained Optimization Algorithms ................. 188
5.4.2 Algorithm Details ............................. 193

xii

5.4.3 Speciﬁcs for a l\r"lixture of Gaussians ................... 196

5.5 Feature Extraction and Clustering with Constraints ............ 198
5.5.1 The Algorithm ............................... 200
5.6 Experiments .................................. 201
5.6.1 Experimental Result on Synthetic Data ................. 202
5.6.2 Experimental Results on Real VVOI'ld Data ................ 205
5.6.3 Experiments on Feature Extraction .................... 226
5.7 Discussion ................................... 241
5.7.1 Time Complexity .............................. 241
5.7.2 Discriminative versus Generative ..................... 2.43
5.7.3 Drawback of the Proposed Approach ................... 245
5.7.4 Some Implementation Details ....................... 247
5.8 Summary ................................... 247
6 Summary 249
6.1 Contributions ................................. 249
6.2 Future work .................................. 253
APPENDICES 257
A Details of Incremental ISOMAP 258
Al Update of Neighborhood Graph ....................... 258
A2 Update of Geodesic Distances: Edge Deletion ............... 259
A21 Finding Vertex Pairs For Update ..................... 259

xiii

A.2.2 Propagation Step .............................. 261

A23 Performing The Update .......................... 264
A24 Order for Performing Update ....................... 266
A3 Update Of Geodesic Distances: Edge Insertion ............... 267
A4 Geodesic Distance Update: Overall Time Complexity ........... 269
B Calculations for Clustering with Constraints 271
8.1 First Order Information ........................... 271
B.1.1 Computing the Differential ........................ 272
B.1.2 Gradient Computation ........................... 276
8.1.3 Derivative for Gaussian distribution . . . . .............. 278
32 Second Order Information .......................... 280
8.2.1 Second-order Differential .......................... 280
B.2.2 Obtaining the Hessian matrix ....................... 283
8.2.3 Hessian of the Gaussian Probability Density Function .......... 290
BIBLIOGRAPHY 295

xiv

LIST OF TABLES

Worldwide generation of original data, if stored digitally, in terabytes (TB)

circa 2002 .................................. 2
A comparison of nonlinear mapping algorithms. .............. 42
Run time (seconds) for batch and incremental ISOMAP. ......... 104

Run time (seconds) for executing batch and incremental ISOMAP once
for different number of points (n) ..................... 104

Run time (seconds) for batch and incremental landmark ISOMAP. . . . . 114

Run time (seconds) for executing batch and incremental landmark
ISOMAP once for different number of points (n). ........... 114

Real world data sets used in the experiment ................ 152

Results of the algorithm over 20 random data splits and algorithm initial-

izations. .................................. 153
Different algorithms for clustering with constraints. ............ 164
Summary of the real world data sets used in the experiments. ...... 213

Performance of different clustering algorithms in the absence of constraints.220

Performance of clustering under constraints algorithms when the con-
straint level is 1%. ............................ 227

Performance of clustering under constraints algorithms when the con-
straint level is 2%. ............................ 228

XV

 

C1
0‘:

5.7

5.8

5.9

Performance of clustering
straint level is 3%.

Performance of clustering
straint level is 5%.

Performance of clustering
straint level is 10%.

Performance of clustering
straint level is 15%.

under constraints

under constraints

under constraints

under constraints

xvi

algorithms

algorithms

algorithms

algorithms

when the con-

when the con-

when the con—

when the con-

229

230

231

232

1.1

1.2

1.3

1.4

1.5

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3.1

LIST OF FIGURES

Comparing feature vector. dissimilarity matrix, and a discrete structure
on a set of artiﬁcial objects. .......................

An example of dimensionality reduction. ..................

The three well-separated clusters can be easily detected by most clustering
algorithms. ................................

Diversity of clusters. .............................

A taxonomy of clustering algorithms ....................

An example of a manifold ..........................
An example of a geodesic ..........................
Example of an auto-associative neural network ...............
Example of neighborhood graph and geodesic distance approximation .
Data sets used in the experiments for nonlinear mapping ..........
Results of nonlinear mapping algorithms on the parabolic data set.
Results of nonlinear mapping algorithms on the swiss roll data set.
Results of nonlinear mapping algorithms on the S-curve data set.

Results of nonlinear mapping algorithms on the face images. .......

The edge e(a, b) is to be deleted from the neighborhood graph .......

xvii

 

 

21

22

25

41

44

49

57

76

80

81

82

83

92

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

3.10

3.11

3.12

3.13

3.14

4.1

4.2

4.3

4.4

4.5

Effect Of edge insertion. ........................... 93
Snapshots of “Swiss Roll” for incremental ISOMAP ............. 105
Approximation error (En) between the co-ordinates estimated by the ba—

sic incremental ISOMAP and the basic batch ISOMAP for different
numbers of data points (n) for the ﬁve data sets. . . . . . . . . . . . 107

Evolution of the estimated co—ordinates for Swiss roll to their ﬁnal values. 111

Example images from the rendered face image data set ........... 111
Example “2” digits from the MNIST database ................ 112
Example face images from ethn database. ................. 112
Classiﬁcation performance on ethn database for basic ISOMAP ...... 112
Snapshots of “Swiss roll” for incremental landmark ISOMAP. ...... 115
Approximation error between the co—ordinates estimated by the incremen-

tal landmark ISOMAP and the batch landmark ISOMAP for different

numbers of data points. ......................... 117
Classiﬁcation performance on ethn database, landmark ISOMAP. . . . . 118
Utility of vertex contraction .......................... 119
Sum Of residue square for 1032 images at 15 rotation angles ........ 121

An irrelevant feature makes it difﬁcult for the Gaussian mixture learning

algorithm in [81] to recover the two underlying clusters ......... 130
The number of clusters is inter-related with feature subset used. ..... 131
Deficiency Of variance-based method for feature selection .......... 132

An example graphical model for the probability model in Equation (4.5). 137

An example graphical model showing the mixture density in Equation (4.6).139

xviii

4.6

4.7

4.8

4.9

5.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8

5.9

An example execution of the proposed algorithm. .............

Feature saliencies for the synthetic data used in Figure 4.6(a) and the
Trunk data set. ..............................

A ﬁgure showing the clustering result on the image data set ........

Image maps of feature saliency for different data sets ...........

Different classiﬁcation/clustering settings: supervised, unsupervised, and
intermediate. ...............................

An example contrasting parametric and non-parametric clustering.

A simple example of clustering under constraints that illustrates the lim-
itation of hidden Markov random ﬁeld (HMRF) based approaches. .

The result of running different clustering under constraints algorithms for
the synthetic data set shown in Figure 5.3(a). .............

Example face images in the ethnicity classiﬁcation problem for the data
set ethn ...................................

The Mondrian image used for the data set Mondrian. ...........

F-score and NMI for different algorithms for clustering under constraints
for the data sets ethn, Mondrian, and ion ................

F-score and N MI for different algorithms for clustering under constraints
for the data sets script, derm, and vehicle. .............

F -score and NMI for different algorithms for clustering under constraints
for the data sets wdbc. ..........................

5.10 F -score and NMI for different algorithms for clustering under constraints

for the data sets UCI-seg, heart and austra. .............

5.11 F-score and NMI for different algorithms for clustering under constraints

for the data sets german, Sim-300 and diff-300. ...........

xix

150

151

160

180

181

182

206

211

212

233

234

235

236

5.12 F -score and N MI for different algorithms for clustering under constraints
for the data sets sat and digits. .................... 238

5.13 F-score and NMI for different algorithms for clustering under constraints
for the data sets mfeat-fou, same-300 and texture. ......... 239

5.14 The result of simultaneously performing feature extraction and clustering

with constraints on the data set in Figure 5.3(a). ........... 241
5.15 An example of learning the subspace and the clusters simultaneously. . . 242
Al Example Of T(u; b) and T(a; b). ....................... 262

3.1

3.2

3.3

3.4

3.5

3.6

3.7

4.1

LIST OF ALGORITHMS

ConstructFab: F(a’b), the set of vertex pairs whose shortest paths are
invalidated when e(a, b) is deleted, is constructed. ...........

ModifiedDijkstra: The geodesic distances from the source vertex u to the
set of vertices C (u) are updated ......................

OptimaIOrder: a greedy algorithm to remove the vertex with the smallest
degree in the auxiliary graph 8. .....................

UpdateInsert: given that va ——> 1th -——> vb is a better shortest path be-
tween va and vb after the insertion of ‘vn+1, its effect is propagated to
other vertices.

InitializeEdgeVVeightIncrease for 7(a). ...................

InitializeEdgeW’eightDecrease for 7(a) ....................

Rebuild 7(a) for those vertices in the priority queue Q that need to be
updated ...................................

The unsupervised feature saliency algorithm .................

xxi

91

92

93

94

100

101

102

146

Chapter 1

Introduction

The most important characteristic of the information age is the abundance of data.
Advances in computer technology, in particular the Internet, have led to what some ‘
people call “data explosion”: the amount of data available to any person has increased
so much that it is more than he or she can handle. According to a recent study1
conducted at UC Berkeley, the amount of new data stored on paper, ﬁlm, magnetic,
and Optical media is estimated to have grown 30% per year between 1999 and 2002. In
the year 2002 alone, about 5 exabytes of new data have been generated. (One exabyte
is about 1018 bytes, or 1000000 terabytes). Most of the original data are stored in
electronic devices like hard disks (Table 1.1). This increase in both the volume and
the variety of data calls for advances in methodology to understand, process, and
summarize the data. From a more technical point of view, understanding the structure
of large data sets arising from the data explosion is of fundamental importance in

data mining, pattern recognition, and machine learning. In this thesis, we focus on

 

 

1http://www.Sims.berkeley.edu/research/projects/how-much-info-2003/

1

Table 1.1: Worldwide production of original data, if stored digitally, in terabytes
(TB) circa 2002. Upper estimates (denoted by “upper”) assume the data are digitally
scanned, while lower estimates (denoted by “lower”) assume the digital contents have
been compressed. It is taken from Table 1.2 in http://www.sims.berkeley.edu/
research/projects/how-much-info-2003/execsum.htm. The precise deﬁnitions of
“paper,” “ﬁlm,” “magnetic,” and “optical” can be found in the web report.

 

 

 

 

 

 

 

 

 

storage upper, lower, upper, lower, % change,
medium 2002 2002 1999—2000 1999-2000 upper
Paper 1,634 327 1,200 240 36%
Film 420,254 74,202 431,690 58,209 -3%
Magnetic 5,187,130 3,416,230 2,779,760 2,073,760 87%
Optical 103 51 81 29 28%
Total 5,609,121 3,416,281 3,212,731 2,132,238 74.5%

 

 

 

 

 

two important techniques for data analysis in pattern recognition: dimensionality
reduction and clustering. We also investigate how the addition Of constraints, an

example of side—information, can assist in data clustering.

1.1 Data Analysis

The word “data,” as simple as it seetns, is not easy to deﬁne precisely. We shall
adopt a pattern recognition perspective and regard data as the description of a set of
Objects or patterns that can be processed by a computer. The objects are assumed
to have some commonalities, so that the same systematic procedure can be applied

to all the objects to generate the description.

1.1.1 Types Of Data

Data can be classiﬁed into different types. Most often, an Object is represented by

the results of measurement of its various properties. A measurement result is called

“a feature” in pattern recognition or “a variable” in statistics. The concatenation of
all the features of a single object forms the feature vector. By arranging the feature
vectors of different Objects in different rows, we get a pattern matrix (also called
“data matrix”) of size n by d, where n is the total number of objects and d is the
number of features. This representation is very popular because it converts different
kinds of objects into a standard representation. If all the features are numerical, an
object can be represented as a point in Rd. This enables a number of mathematical
tools to be used to analyze the objects.

Alternatively, the similarity or dissimilarity between pairs Of objects can be used
as the data description. Speciﬁcally, a dissimilarity (Similarity) matrix of size n by n
can be formed for the n Objects, where the (2, j)-th entry of the matrix corresponds
to a quantitative assessment of how dissimilar (similar) the 2-th and the j-th Objects
are. Dissimilarity representation is useful in applications where domain knowledge
suggests a natural comparison function, such as the Hausdorff distance for geometric
shapes. Examples of using dissimilarity for classiﬁcation can be seen in [132], and
more recently in [202]. Pattern matrix, on the other hand, can be easier to obtain than
dissimilarity matrix. The system designer can simply list all the interesting attributes
of the objects to obtain the pattern matrix, while a good dissimilarity measure with
respect to the task can be difﬁcult to design.

Similarity/ dissimilarity matrix can be regarded as more generic than pattern ma—
trix, because given the feature vectors of a set of objects, a dissimilarity matrix of
these Objects can be generated by computing the distances among the data points

represented by'these feature vectors. A similarity matrix can be generated either

by subtracting the distances from a pie-speciﬁed number, or by exponentiating the
negative Of the distances. Pattern matrix, 011 the other hand, can be more ﬂexible
because the user can adjust the distance function according to the task. It is easier to
incorporate new information by creating additional features than modifying the sim—
ilarity/dissimiliarity measure. Also, in the common scenarios where there are a large
number of patterns and a moderate number of features, the size of pattern matrix,

0(nd), is smaller than the size Of similarity/ dissimilarity matrix, 0(n2).

A third possibility to represent an object is by discrete structures, such as parse
trees, ranked lists, or general graphs. Objects such as chemical structures, web pages
with hyperlinks, DNA sequences, computer programs, or customer preference for
certain products have a natural discrete structure representation. Graph-related rep—
resentations have also been used in various computer vision tasks, such as object
recognition [145] and shape-from-shading [217]. Representing structural objects using
a vector of attributes can discard important information on the relationship between
different parts of the Objects. On the other hand, coming up with the appropriate
dissimilarity or similarity measure for such Objects is Often difﬁcult. New algorithms
that can handle discrete structure directly have been developed. An example is seen
in [154], where a kernel function (diffusion kernel) is deﬁned on different vertices in a
graph, leading to improved classiﬁcation performance for categorical data. Learning

with structural data is sometimes called “learning with relational data,” and several

”30 +© ,. Objects

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.® Q C!
Extract Features Extract relational
Compute information
dissimilarities
Color Area Shape ®®+0® @O ©® + ‘
0 Green 10 Circle 9 Q
+ Red 9 (3,053 +1010 0 10 9 12 2 10
. Blue 10 Circle 0 1 10 0 3 2 4 4 \‘ ‘/©
Blue 22 Smiley Q4 9 3 0 3 7 1 0+9
0 Gray 25 Circle 05 12 2 3 0 4 3
l3) Yellow 12 Cross 0 9 2 4 7 5 0 6
- 3 10 4 1 3 5 0 .
©Orange 25 Smiley © DISCI'Qte S e
' ° - - . . Directed a
Pattern matrix Dissmulanty matrix ( gr Ph)

Figure 1.1: Comparing feature vector, dissimilarity matrix and a discrete structure
on a set of artiﬁcial objects. (Left) Extracting different features (color, area, and
shape in this case) leads to a pattern matrix. (Center) A dissimilarity measure on
the objects can be used to compare different pairs of Objects, leading to a dissimilarity
matrix. (Right) If the user can provide relational properties on the objects, a discrete
structure like a directed graph can be created.

workshops2 have been organized on this theme.

In Figure 1.1, we provide a simple illustration contrasting feature vector, dissimi-
larity matrix, and discrete structure representatirm for a set of artiﬁcial objects. Each
of the representations corresponds to a different view of the objects. III practice, the
system designer has to choose the representation that he or She thinks is the most

relevant to the task.

 

2A NIPS workshop in 2002 (http://Inlg.anu.edu.au/unrealdata/) and several ICML work-
shops (2004:http://www. cs.umd. edu/projects/sr12004/) (2002zhttp: //demo.cs.brandeis.
edu/icm102ws/) (20002http : //www. inf ormatik . uni-freiburg . de/ml/icm12000_worksh0p .
html) have been held on how to learn with structural or relational data.

5

In this thesis, we focus on feature vector representation, though dissimilar—

ity/ similarity information in the form of instance-level constraints is also considered.

1.1.2 Types of Features

Even within the feature vector representation, descriptions Of an Object can be clas-
siﬁed into different types. A feature is essentially a measurement, and the “scale of
measurement” [244] proposed by Stevens can be used to classify features into different

categories. They are:

Nominal: discrete unordered. Exam )les: “a ) )le ” “orange,” and “banana.”
7 i O .

Ordinal: discrete, ordered. Examples: “conservative,” “moderate,” and “liberal”.

Interval: continuous, no absolute zero, can be negative. Examples: temperature in

Fahrenheit.

Ratio: continuous, with absolute zero, positive. Examples: length, weight.

This classiﬁcation scheme, however, is not perfect [256]. One problem is that a
measurement may not ﬁt well into any of the categories listed in this scheme. An
example for this is given in chapter 5 in [191], which considers the following types of

measurements:

Grades: ordered labels such as Freshmen, Sophomore, Junior, Senior.

Ranks: starting from 1, which may be the largest or the smallest.

Counted fractions: bounded by zero and one. It includes percentage, for example.

Counts: non—negative integers.
Amounts: non-negative real numbers.
Balances: unbounded, positive, or negative values.

Most people would agree that these six types of data are different, yet all but the
third and the last would be “ordinal” in the scheme by Stevens. “Counted fractions”
also do not ﬁt well into any of the category proposed by Stevens.

Consideration of different types of features can help us to design appropriate

algorithms for handling different types of data arising from different domains.

1.1.3 Types Of Analysis

The analysis to be performed on the data can also be classiﬁed into different types.
It can be exploratory/descriptive, meaning that the investigator does not have a
speciﬁc goal and only wants to understand the general characteristics or structure of
the data. It can be conﬁrmatory/ inferential, meaning that the investigator wants to
conﬁrm the validity of a hypothesis / model or a set of assumptions using the available
data. Many statistical techniques have been proposed to analyze the data, such
as analysis of variance (ANOVA), linear regression, canonical correlation analysis
(CCA), multidimensional scaling (MDS), factor analysis (FA), or principal component
analysis (PCA), to name a few. A useful overview is given in [245].

In pattern recognition, most Of the data analysis is concerned with predictive mod—
eling: given some existing data (“training data”), we want to predict the behavior of

the unseen data (“testing data”). This is often called “machine learning” or simply

7?

“learning. Depending on the type of feedback one can get in the learning process,
three types of learning techniques have been suggested [68]. In supervised learning,
labels on data points are available to indicate if the prediction is correct or not. In
unsupervised learning, such label information is missing. III reinforcement learning,
only the feedback after a sequence Of actions that can change the possibly unknown
state of the system is given. In the past few years, a hybrid learning scenario between
supervised and unsupervised learning, known as semi—supervised learning, transduc-
tive learning [136], or learning with unlabeled data [195], has emerged, where only
some of the data points have labels. This scenario happens frequently in applica-
tions, since data collection and feature extraction can Often be automated, whereas
the labeling of patterns or objects has to be done manually and this is expensive
both in time and cost. In Chapter 5 we shall consider another hybrid scenario where

instance-level constraints, which can be viewed as a “relaxed” version of labels, are

available on some of the data points.

1.2 Dimensionality Reduction

Dimensionality reduction deals with the transformation of a high dimensional data
set into a low dimensional space, while retaining most of the useful structure in
the original data. An example application Of dimensionality reduction with face
images can be seen in Figure 1.2. Dimensionality reduction has become increasingly
important due to the emergence of many data sets with a large number of features.

The underlying assumption for dimensionality reduction is that the data points do

 

Face images 3; , .. j.3

     

  

Comatenafing

the pixels

(31118118101181
feature vectors

 

Dimensionality . ‘
reduction
Low
dimensional
. CED CED CED CED CED
representation

Figure 1.2: An example of dimensionality reduction. The face images are converted
into a high dimensional feature vector by concatenating the pixels. Dimensionality
reduction is then used to create a set of more manageable low-dimensional feature
vectors, which can then be used as the input to various classiﬁers.

not lie randomly in the high-dimensional space; rather, there is a certain structure in
the locations of the data points that can be exploited, and the useful information in

high dimensional data can be summarized by a. small number of attributes.

1.2.1 Prevalence of High Dimensional Data

High dimensional data have become prevalent in different applications in pattern
recognition, machine learning, and data mining. The deﬁnition of “high dimensional”
has also changed from tens of features to hundreds or even tens of thousands of

features [101].
Some recent applications involving high dimensional data sets include: (i) text
categorization, the representation Of a text document or a web page using the pop-

ular bag-Of-words model can lead to thousands of features [277, 254], where each

feature corresponds to the occurrence of a keyword or a key-term in the document;
(ii) appearance—based computer vision approaches interpret each pixel as a feature
[253, 22]. Images of handwritten digits can be recognized using the pixel values by
neural networks [170] or support vector machines [255]. Evert for a small image with
size 64 by 64, such representation leads to more than 4,000 features; (iii) hyperspectral

images3

in remote sensing lead to high dimensional data sets: each pixel can contain
more than 200 spectral measurements in different wavelengths; (iv) the characteris-
tics of a chemical compound recorded by a mass spectrometer can be represented by
hundreds of features, where each feature corresponds to the reading in a particular
range; (v) microarray technology enables us to measure the expression levels of thou-
sands Of genes simultaneously for different subjects with different treatments [6, 273].
Analyzing microarray data is particularly challenging, because the number of data

points (subjects in this case) is much smaller than the number of features (expression

levels in this case).

High dimensional data can also be derived in applications where the initial num-
ber of features is moderate. In an image processing task, the user can apply different
ﬁlters with different parameters to extract a. large number of features from a localized
window in the image. The features are then summarized by applying a dimensional-
ity reduction algorithm that matches the task at hand. This (relatively) automatic
procedure contrasts with the traditional approach, where the user hand-crafts a small

number of salient features manually, often with great effort. Creating a large feature

 

3Information on hyperspectral images can be found at http: //backserv. gsfc.nasa. gov/
nips2003hyperspectra1 . html and http : //www . eoc . csiro . au/hswww/Overview . htm.

10

set and then summarizing the features is advantageous when the domain is highly
variable and robust features are hard to obtain, such as the occupant classiﬁcation

problem in [78].

1.2.2 Advantages Of Dimensionality Reduction

Why should we reduce the dimensionality of a data set? In principle, the more
information we have about each pattern, the better a learning algorithm is expected
to perform. This seems to suggest that we should use as many features as possible for
the task at hand. However, this is not the case in practice. Many learning algorithms
perform poorly in a high dimensional space given a small number of learning samples.
Often some features in the data set are just “noise” and thus do not contribute to
(sometimes even degrade) the learning process. This difﬁculty in analyzing data
sets with many features and a small number of samples is known as the curse of

d2mens2onal2ty [211].

Dimensionality reduction can circumvent this problem by reducing the number of
features in the data set before the training process. This can also reduce the compu-
tation time, and the resulting classiﬁers take less space to store. Models with small
number of variables are often easier for domain experts to interpret. Dimensionality
reduction is also invaluable as a visualization tool, where the high dimensional data
set is transformed into two or three dimensions for display purposes. This can give

the system designer additional insight into the problem at hand.

The main drawback of dimensionality reduction is the possibility of information

11

loss. When done poorly, dimensionality reduction can discard useful instead of irrel-
evant information. No matter what subsequent processing is to be performed, there

is no way to recover this information loss.

1.2.2.1 Alternatives to Dimensionality Reduction

In the context of predictive modeling, (explicit) dimensionality reduction is not the
only approach to handle high dimensional data. The naive Bayes classiﬁer has
found empirical success in classifying high dimensional data sets like webpages (the
WEB—’KB project in [50]). Regularized classiﬁers such as support vector machines
have achieved good accuracy for high dimensional data sets in the domain Of text
categorization [135]. Some learning algorithms have built-in feature selection abilities
and thus (in theory) do not require explicit dimensionality reduction. For example,
boosting [90] can use each feature as a “weak” classiﬁer and construct an overall
Classiﬁer by selecting the appropriate features and combining them [261].

Despite the apparent robustness of these learning algorithms in high dimensional
data sets, it can still be beneﬁcial to reduce the dimensionality ﬁrst. Noisy features
can degrade the performance of support vector machines because values of the kernel
function (particular RBF kernel that depends on inter-point Euclidean distances)
become less reliable if many features are irrelevant. It is beneﬁcial to adjust the
kernel to ignore those features [156], effectively performing dimensionality reduction.
Concerns related to efﬁciency and storage requirement of a classiﬁer also suggest the
Use of dimensionality reduction as a preprocessing step.

The important lesson is: dimensionality reduction is useful for most applications,

12

yet the tolerance for the amount Of information discarded should be subject to the
judgement Of the system designer. In general, a more conservative dimensionality
reduction strategy should be employed if a classiﬁer that is more robust to high
dimensionality (such as support vector machines) is used. The dimensionality Of the
data may still be somewhat large, but at least little useful information is lost. On the
other hand, if a more traditional and easier-to—understand classiﬁer (like quadratic
discriminant analysis) is to be used, we should reduce the dimensionality of the data
set more aggressively to a smaller number, so that the classiﬁer can competently

handle the data.

1.2.3 Techniques for Dimensionality Reduction

Dimensionality reduction techniques can be broadly divided into several categories:
(i) feature selection and feature weighting, (ii) feature extraction, and (iii) feature

grouping.

1.2.3.1 Feature Selection and Feature Weighting

Feature selection, also known as variable selection or subset selection in the statistics
(particularly regression) literature, deals with the selection of a subset Of features
that is most appropriate for the task at hand. A feature is either selected (because
it is relevant) or discarded (because it is irrelevant). Feature weighting [271], on the
other hand, assigns weights (usually between zero and one) to different features to
indicate the saliencies of the individual features. Most of the literature on feature

Selection/weighting pertains to supervised learning (both classiﬁcation [122, 151, 26,

13

 

101] and regression [186]).

Filters, Wrappers, and Embedded Algorithms Feature selection/weighting
algorithms can be broadly divided into three categories [26, 151, 101]. The ﬁlter
approaches evaluate the relevance of each feature (subset) using the data set alone,
regardless of the subsequent learning task. RELIEF [147] and its enhancement [155]
are representatives of this class, where the basic idea is to assign feature weights based
on the consistency of the feature value in the k nearest neighbors of every data point.
Wrapper algorithms, on the other hand, invoke the learning algorithm to evaluate
the quality Of each feature (subset). Speciﬁcally, a learning algorithm (e.g., a nearest
neighbor classifier, a decision tree, a naive Bayes method) is run using a feature sub—
set and the feature subset is assessed by some estimate related to the classiﬁcation
accuracy. Often the learning algorithm is regarded as a “black box” in the sense that
the wrapper algorithm Operates independent of the internal mechanism of the clas-
siﬁer. An example is [212], which used genetic search to adjust the feature weights
for the best performance of the k nearest neighbor classiﬁer. In the third approach
(called embedded in [101]), the learning algorithm is modiﬁed to have the ability to
perform feature selection. There is no longer an explicit feature selection step; the
algorithm automatically builds a classiﬁer with a small number of features. LASSO
(least absolute shrinkage and selection operator) [250] is a good example in this cat-
egory. LASSO modiﬁes the ordinary least square by including a constraint on the
L1 norm of the weight coefﬁcients. This has the effect Of preferring sparse regression

coefﬁcients (a formal statement for this is proved in [65, 64]), effectively perform-

14

ing feature selection. Another example is MARS (multivariate adaptive regression
splines) [91], where choosing the variables used in the polynomial splines effectively
performs variable selection. Automatic relevance detection in neural networks [177]
is another example, which uses a Bayesian approach to estimate the weights in the
neural network as well as the relevancy parameters that can be interpreted as feature

weights.

Filter approaches are generally faster because they are classiﬁer-independent and
only require computation of simple quantities. They scale well with the number of
features, and many of them can comfortably handle thousands of features. Wrapper
approaches, on the other hand, can be superior in accuracy when compared with
ﬁlters, which ignore the properties of the learning task at hand [151]. They are, how-
ever, computationally more demanding, and do not scale very well with the number
of features. It is because training and evaluating a classiﬁer with many features can
be slow, and the performance of a traditional classiﬁer with a large number of fea-
tures may not be reliable enough to estimate the utilities of individual features. To
get the best results from ﬁlters and wrappers, the user can apply a ﬁlter-type tech-
nique as preprocessing to cut down the feature set to a moderate size, and then use
a wrapper algorithm to determine a small yet discriminative feature subset. Some
state-Of—the-art feature selection algorithms indeed adopt this approach, as Observed
in [102]. “Embedded” algorithms are highly specialized and it is difﬁcult to compare

them in general with ﬁlter and wrapper approaches.

15

Quality of a Feature Subset Feature selection / weighting algorithms can also be
classiﬁed according to the deﬁnition of “relevance” or how the quality of a feature
subset is assessed. Five deﬁnitions of relevance are given in [26]. Information—theoretic
methods are Often used to evaluate features, because the mutual information between
a relevant feature and the class labels should be. high [15]. Non-parametric methods
can be used to estimate the probability density function Of a continuous feature, which
in turn is used to compute the mutual information [159, 251]. Correlation is also used
frequently to evaluate features [278, 104]. A feature can be declared irrelevant if it
is conditionally independent of the class labels given other features. The concept
of Markov blanket is used to formalize this notion Of irrelevancy in [153]. RELIEF
[147, 155] uses the consistency of the featurevalue in the k nearest neighbors of every

data point to quantify the usefulness of a feature.

Optimization Strategy Given a deﬁnition of feature relevancy, a feature selec-
tion algorithm can search for the most relevant feature subset. Because of the lack Of
monotonicity (with respect to the features) of many feature relevancy criteria, a com-
binatorial search through the space of all possible feature subsets is needed. Usually,
heuristic (non-exhaustive) methods have to be adopted, because the size of this space
is exponential in the number Of features. In this case, one generally loses any guaran-
tee Of Optimality of the selected feature subset. Different types of heuristics, such as
sequential forward or backward searches, floating search, beam search, bi-directional
search, and genetic search have been suggested [36, 151, 209, 275]. A comparison

of some of these search heuristics can be found in [211]. III the context of linear

16

regression, sequential forward search is often known as stepwise regression. Forward
stagewise regression is a generalization of stepwise regression, where a feature is only
“partially” selected by increasing the corresponding regression coefﬁcient by a ﬁxed
amount. It is closely related to LASSO [250], and this relationship was established
via least angle regression (LARS), another interesting algorithm on its own, in [72].
Wrapper algorithms generally include a heuristic search, as is the case for ﬁlter
algorithms with feature quality criteria dependent on the features selected so far.
Note that feature weighting algorithms do not involve a heuristic search because the
weights for all features are computed simultaneously. However, the computation of
the weights may be expensive. Embedded approaches also do not require any heuristic
search. The optimal parameter is often estimated by optimizing a certain Objective
function. Depending on the form of the objective function, different Optimization
strategies can be used. In the case of LASSO, for example, a general quadratic
programming solver, homotopy method [198], a modiﬁed version of LARS [72], or the

EM algorithm [80] can be used to estimate the parameters.

1.2.3.2 Feature Extraction

In feature extraction, a small set of new features is constructed by a general map—
ping from the high dimensional data. The mapping Often involves all the available
features. Many techniques for feature extraction have been proposed. In this section,
we describe some of the linear feature extraction methods, i.e., the extracted features
can be written as linear combinations Of the original features. Nonlinear feature ex-

traction techniques are more sophisticated. In Chapter 2 we shall examine some Of

17

the recent nonlinear feature extraction algorithms in more detail. The readers may

also ﬁnd two recent surveys [284, 34] useful in this regard.

Unsupervised Techniques “Unsupervised” here refers to the fact that these fea-
ture extraction techniques are based only on the data (pattern matrix), without
pattern label information. Principal component analysis (PCA), also known as
Karhunen-Loeve Transform or simply KL transform, is arguably the most popular
feature extraction method. PCA ﬁnds a hyperplane such that, upon projection to the
hyperplane, the data variance is best preserved. The optimal hyperplane is spanned
by the principal components, which are the leading eigenvectors Of the sample covari—
ance matrix. Features extracted by PCA consist of the projection of the data points
to different principal components. When the features extracted by PCA are used for
linear regression, it is sometimes called “principal component regression”. Recently,
sparse variants of PCA have also been proposed [137, 291, 52], where each principal

component only has a small number Of non-zero coefﬁcients.

Factor analysis (FA) can also be used for feature extraction. FA assumes that the
Observed high dimensional data points are the results of a linear function (expressed
by the factor loading matrix) on a few unobserved random variables, together with
uncorrelated zero-mean noise. After estimating the factor loading matrix and the
variance of the noise, the factor scores for different patterns can be estimated and

serve as a low-dimensional representation of the data.

18

Supervised Techniques Labels in classiﬁcation and response variables in regres-
sion can be used together with the data to extract more relevant features. Linear
discriminant analysis (LDA) ﬁnds the projection direction such that the ratio of
between-class variance to within-class variance is the largest. When there are more
than two classes, multiple discriminant analysis (MDA) ﬁnds a sequence of projection
directions that maximizes a similar criterion. Features are extracted by projecting
the data points to these directions.

Partial least squares (PLS) can be viewed as the regression counterpart of LDA.
Instead of extracting features by retaining maximum data variance as in principal
component regression, PLS ﬁnds projection directions that can best explain the re—
sponse variable. Canonical correlation analysis (CCA) is a closely related technique
that ﬁnds projection directions that maximize the correlation between the response

variables and the features extracted by projection.

1.2.3.3 Feature Grouping

In feature grouping, new features are constructed by combining several existing fea-
tures. Feature grouping can be useful in scenarios where it can be more meaningful
to combine features due to the characteristics of the domain. For example, in a text
categorization task different words can have similar meanings and combining them
into a single word class is more appropriate. Another example is the use of power
spectrum for classiﬁcation, where each feature corresponds to the energy in a certain
frequency range. The preset boundaries of the frequency ranges can be sub-Optimal,

and the sum of features from adjacent frequency ranges can lead to a more meaningful

19

feature by capturing the energy in a wider frequency range. For gene expression data,
genes that are Similar may share a common biological pathway and the grouping of
predictive genes can be of interest to biologists [108, 230, 59].

The most direct way to perform feature grouping is to cluster the features (instead
of the Objects) of a data set. Feature clustering is not new; the SAS / STAT procedure
“varclus” for variable clustering was written before 1990 [225]. It is performed by ap-
plying the hierarchical clustering method on a similarity matrix of different features,
which is derived by, say, the Pearson’s correlation coefﬁcient. This scheme was prob-
ably ﬁrst proposed in [124], which also suggested summarizing one group of features
by a single feature in order to achieve dimensionality reduction. Recently, feature
clustering has been applied to boost the performance in text categorization. Tech-
niques based on distribution clustering [4], mutual information [62], and information
bottleneck [238] have also been proposed.

Features can also be clustered together with the objects. As mentioned in [201],
this idea has been known under different names in the literature, including “bi-
clustering” [41, 150], “cqclustering” [63, 61], “double-clustering” [73], “coupled clus-
tering” [95], and “simultaneous clustering” [208]. A bipartite graph can be used to
represent the relationship between objects and features, and the partitioning of the
graph can be used to cluster the objects and the features simultaneously [281, 61].
Information bottleneck can also be used for this task [237].

In the context of regression, feature grouping can be achieved indirectly by favoring
similar features to have similar coefﬁcients. This can be done by combining ridge

regression with LASSO, leading to the elastic net regression algorithm [290].

20

 

 

 

 

 

 

 

. - ' I ‘ ,
c . -~ : " | - 'v... c l
‘. ._~: o ‘2‘ , I \\. 'J-‘ﬁh
.'_. I ‘ . ' V 3‘ t- :' ‘-
alw- a}: : -' . ' 1;,“ “.17.- -' ". '
)5: sat ”we: > =-'~-'
.- .‘-, .."‘r-‘..'r-‘ , x ' I":'-‘ .
.. .~ xiii-“if ' f’ ""' “ ”5"
b 0“ " ...,} ‘1. .0 .1 ‘ H '_..;.»f
' "l“. ‘ .5 ' a ’0‘.» n
h. ' . , D G -, o .
.. , ‘, ‘ I ‘ ' . s
., ' . .. ,3». .
. J ]. - '
l
>- 4 L
, I
| . .
.. ' .s . - , . .4 : ' 'f‘ . N 7‘ '
.. , f'_-. '.. '- . ”.3: ”x .1 i
.4. 'N .I _ . 72.3 ,' -'2~'M ‘ ~ 3: ”inf"
.. ~.‘":5 : .'. .-.- ~ “4:23 122°- ~.‘“:5 : .‘. W-Jctr g: \
nuts-Var? ' "3 ”*2“ -=. 12443.9»; R's-K -* s
r .' 72265-.) 3 “-‘."*.~“."""- ‘ ”vi-9:1.) ‘F; ‘ “
.a: . ‘ .54.: . It»: .. is‘rev ":r
.-' - J . I " .,‘ ' , - . 1 .- ~ t . ~ 1. x
4 . ‘0' .. ...o , f} ] J . ’2 - ‘ “‘P ..i)‘
o ' o ' i l ‘
l‘ - s - - s x v
4_ —L i A A A“ X
(a) Original data (b) Clustering Result

Figure 1.3: The three well-separated clusters can be easily detected by most clustering
algorithms. Images in this thesis / dissertation are presented in color.

1.3 Data Clustering

The goal of (data) clustering, also known as cluster analysis, is to discover the “nat-
ural” grouping(s) of a set of patterns, points, or objects. Webster4 deﬁnes cluster
analysis as “a statistical classiﬁcation technique for discovering whether the individ-
uals of a population fall into different groups by making quantitative comparisons of
multiple characteristics.” An example of clustering can be seen in Figure 1.3. The
unlabeled data set in Figure 1.3(a) is assigned labels by a clustering procedure in
order to discover the natural grouping of the three groups as shown in Figure 1.3(b).

Cluster analysis is prevalent in any discipline that involves analysis of multivariate
data. It is difﬁcult to exhaustively list the numerous uses Of clustering techniques.
Image segmentation, an important problem in computer vision, can be formulated
as a clustering problem [94, 128, 234]. Documents can be clustered [120] to generate

topical hierarchies for information access [221] or retrieval [20]. Clustering is also

 

4http://www.m-w . com/

21

 

 

 

     

 

 

 

 

   

. ’ g .; a a z
(a) Original data (b) Clustering Result

Figure 1.4: Diversity of clusters. The seven clusters in this data set (denoted by
the seven different colors), though easily identified by human, are difﬁcult to detect

automatically. The clusters are of different shapes, sizes, and densities. The presence
of background noise makes the clustering task even more difﬁcult.

used to perform market segmentation [3, 39] as well as to study genome data [6] in
biology.

Clustering, unfortunately, is difﬁcult for most data sets. A non—trivial example of
clustering is shown in Figure 1.4. Unlike the three well-separated, spherical clusters
in Figure 1.3, the seven clusters in Figure 1.4 have diverse shapes: globular, circular,
and spiral in this case. The densities and the sizes of the clusters are also different.
The presence of background noise makes the detection of the clusters even more
difﬁcult. This example also illustrates the fundamental difficulty of clustering. The
diversity of “good” clusters in different scenarios make it virtually impossible for one
to provide a universal deﬁnition of “good” clusters. In fact, it has been proved in [149]
that it is impossible for any clustering algorithm to achieve some fairly basic goals

simultaneously. Therefore, it is not surprising that many clustering algorithms have

22

been proposed to address the different needs of “good clusters” in different scenarios.

In this section, we attempt to provide a taxonomy of the major clustering tech-
niques, present a brief history of cluster analysis, and present the basic ideas of some

popular clustering algorithms in the pattern recognition community.

1.3.1 A Taxonomy Of clustering

Many clustering algorithms have been proposed in different application scenarios.
Perhaps the most important way to classify clustering algorithms is hierarchical versus
partitional. Hierarchical clustering creates a tree of objects, where branches merging
at the lower levels correspond to higher similarity. Partitional clustering, on the
other hand, aims at creating a “flat” partition of the set Of objects with each object

belonging to one and only one group.

Clustering algorithms can also be classiﬁed by the type of input data used (pattern
matrix or similarity matrix), or by the type of the features, e. g. numerical, categorical,
or special data structures, such as rank data, strings, graphs, etc. (See Section 1.1.1
for information on different types of data.) Alternatively, a clustering algorithm
can be characterized by the probability model used, if any, or by the core search
(optimization) process used to ﬁnd the clusters. Hierarchical clustering algorithms

can be described by the clustering direction, either agglomerative or divisive.

In Figure 1.5, we provide one possible hierarchy of partitional clustering algorithms
(modiﬁed from [131]). Heuristic-based techniques refer to clustering algorithms that

0 timize a certain notion of “Good” clusters. The O'oodness function is constructed
0 C)

23

 

by the user in a heuristic manner. Model-based clustering assumes that there are
underlying (usually probabilistic) models that govern the clusters. Density-based
algorithms attempt to estimate the data density and utilize that to construct the

clusters.

One may further sub-divide heuristic-based techniques depending on the input
type. If a pattern matrix is used, the algorithm is usually prototype-based, i.e.,
each cluster is represented by the most typical “prototype.” The k-IIIGaIIS and the
k-medoids algorithms [79] are probably the best known in this category. If a dis-
similarity or similarity matrix is used as the input, two sub-categories are possible:
those based on linkage (single-link, average—link, complete-link, and CHAMELEON
[142]), and those inspired from graph theory, such as min-cut [272] and spectral clus-
tering [234, 194]. Model-based algorithms Often refer to clustering by using a ﬁnite
mixture distribution [184], with each mixture component interpreted as a cluster.
Spatial clustering can involve a probabilistic model Of the point process. For density-
based methods, the mean-shift algorithm [45] ﬁnds the modes of the data densities
by the mean-Shift operation, and the cluster label is determined by which “basin of
convergence” a point is located. DENCLUE [111] utilizes a kernel (non—parametric)

estimate for the data density to ﬁnd the clusters.

24

 

[Clustering Algorithms]

//'\\\

] Heuristic-based] [Model-based ] ] Density-based ]

 

 

 

 

 

 

 

] Pattern Matrix] [Proximity matrix ]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Prototype- Unkage Graph Spatial Mixture Kernel- Mode
based methods theoretic clustering Model based seeking
(k-means, (single-link, (MST, spectral (Gaussian (DENCLUE) (mean-
k-medoid) complete-link, clustering, mixture, shift)
CHAMELEON) min-cut) Latent class)

 

 

 

 

 

 

 

 

 

Figure 1.5: A taxonomy of clustering algorithms.

1.3.2 A Brief History Of Cluster Analysis

According to the scholarly journal archive JSTORE’, the ﬁrst. appearance of the word
“cluster” in the title of a scholarly article was in 1739 [11]: “A Letter from John
Bartram, M. D. to Peter Collinson, F. R. S. concerning a Cluster of Small Teeth
Observed by Him at the Root of Each Fang or Great Tooth in the Head Of a Rattle-
Snake, upon Dissecting It”. The word “cluster” here, though, was used only in its
general sense to denote a group. The phrase “cluster analysis” ﬁrst appeared in 1954
and it was suggested as a tool to understand anthropological data [43]. In its early
days, cluster analysis was sometimes referred to as grouping [48, 85], and biologists
called it “numerical taxonomy” [242].

Early research on hierarchical clustering was mainly done by biologists, because
these techniques helped them to create a hierarchy of different Species for analyz-

ing their relationship systematically. According to [242], single-link clustering [240],

 

5http://www.jstor.org

25

complete-link clustering [213], and average-link clustering [241] ﬁrst appeared in 1957,
1948, and 1958, respectively. W'ard’s method [266] was proposed in 1963. Partitional
clustering, on the other hand, is closely related to data compression and vector quan-
tization. This link is not surprising because the cluster labels assigned by a partitional
clustering algorithm can be viewed as the compressed version of the data. The most
popular partitional clustering algorithm, k—means, has been proposed several times
in the literature: Steinhaus in 1955 [243], Lloyd in 1957 [174], and MacQueen in 1967
[178]. The ISODATA algorithm by Ball and Hall in 1965 [8] can be regarded as an
adaptive version of k-means that adjusts the number of clusters. The k-means algo-
rithm is also attributed to Forgy (like [140] and [99]), though the reference for this [88]
only contains an abstract and it is not clear what Forgy exactly proposed. The his-
torical account of vector quantization given in [99] also presents the history of some
of the partitional clustering algorithms. In 1971, Zahn proposed a graph—theoretic
clustering method [280], which is closely related to single-link clustering. The EM
algorithm, which is the standard algorithm for estimating a ﬁnite mixture model for
mixture-based clustering, is attributed to Dempster et al. in 1977 [58]. Interest in
mean-shift clustering was revived in 1995 by Cheng [40], and Comaniciu and Meer
further popularized it in [45]. Hoffman and Buhmann considered the use of deter-
ministic annealing for pairwise clustering [115], and Fischer and Buhmann modiﬁed
the connectedness idea in single-link clustering that led to path-based clustering [84].
The normalized cut algorithm by Shi and Malik [233] in 1997 is often regarded as the
ﬁrst spectral clustering algorithm, though similar ideas were considered by spectral

graph theorists earlier. A summary of the important results in spectral graph theory

26

 

can be found in the 1997 book by Chung [42]. The emergence of data. mining leads to
a new line of clustering research that emphasizes efficiency when dealing with huge
database. DBSCAN by Ester ct al. [77] for density—based clustering and CLIQUE
by Agrawal et al. [2] for subspace clustering are two well-known algorithms in this

community.

The current literature 011 cluster analysis is vast, and hundreds of clustering al—
gorithms have been proposed in the literature. It will require a tremendous effort to
list and summarize all the major clustering algorithms. The reader is encouraged to

refer to a survey like [130] or [79] for an overview of different clustering algorithms.

1.3.3 Examining Some Clustering Algorithms

In this section, we will examine two very important clustering algorithms used in
the pattern recognition community: the k-means algorithm and the EM algorithm.
Other clustering algorithms that are used regularly in pattern recognition include the
mean-shift algorithm [45, 44, 40], pairwise clustering [115, 116], path-based clustering

[84, 83], and spectral clustering [234, 139, 269, 194, 258, 42].

Let {y1, . . . , yn} be the set of n d—dimensional data points to be clustered. The
cluster label of y, is denoted by 22:. The goal of (partitional) clustering is to recover
2,, with z,- E {1, . . . , k}, where 1.: denotes the number of clusters specified by the user.

The set of y,- with z, = j is referred to as the j-th cluster.

27

 

1.3.3.1 The k-means algorithm

The k-means algorithm is probably the best known clustering algorithm. In this algo-
rithm, the j-th cluster is represented by the “cluster prototype” 143- in Rd. Clustering

is done by ﬁnding z, and #j that minimize the following cost function:

n n k
2 - 2
1km -—— Z My.- — uni = 221(4- = J)||yi — iujll . (1.1)
i=1

i=1j=1

Here, [(2,- = j) denotes the indicator function, which is one if the condition 2,- = j
is true, and zero otherwise. To optimize Jk—means» we ﬁrst assume that all uj are

speciﬁed. The values of 3,- that minimize Jkﬁneaus are given by
i = o i — ' ' °
2 arUme J 2 12

On the other hand, if z,- is ﬁxed, the optimal pj can be found by differentiating

Jk—means with respect to ”3' and setting the derivatives to zero, leading to

k .
. __ Zj=11(zi =J)uj __ 23:1,sz “j (1 3)
H] _ 2.1721192. : j) _ number ofz' with z,- = j‘ °

 

 

Starting from an initial guess on 11,], the k-means algorithm iterates between Equa—
tions (1.2) and (1.3), which is guaranteed to decrease the k-means objective function
until a local minimum is reached. In this case, pj and 2,: remain unchanged after
the iteration, and the k-means algorithm is said to have converged. The resulting z,-

and 143- constitute the clustering solution. In practice, one can stop if the change in
28

—_L

 

successive values of Jk—means is less than a threshold.

The k-means algorithm is easy to understand and is also easy to implement. How—
ever, k-means has problems in discovering clusters that are not spherical in shape. It
also encounters some difﬁculties when different clusters have a signiﬁcantly different
number of points. k-means also requires a good initialization to avoid getting trapped
in a poor local minimum. In many cases, the user does not know the number of clus-
ters in advance, which is required by k—means. The problem of determining the value
of It automatically still does not have a very satisfactory solution. Some heuristics

have been described in [125], and a recent paper on this is [106].

Because the k-means algorithm alternates between the two conditions of optimal-
ity, it is an example of alternating optimization. The k-means clustering result can
be interpreted as a solution to vector quantization, with a codebook of size k and a
square error loss function. Each pj is a codeword in this case. The lit—means algorithm
can also be viewed as a special case of fitting a Gaussian mixture, with covariance
matrices of all the mixture components ﬁxed to be 021 and 0 tends to zero (for the
“hard” cluster assignment). The k—medoid algorithm is similar to k-means, except

that pj is restricted to be one of the given patterns yi.

There is also an online version of k—means. W hen the i-th data point yz- is observed,

the cluster center pj that is the nearest to y,- is found. pj is then updated by

new

it] = m +06% ~15). (1-4)

where a is the learning rate. This learning rule is an example of “winner—take—all” in

29

competitive learning, because only the cluster that “wins” the data point can learn

from it.

1.3.3.2 Clustering by Fitting Finite Mixture Model

The k-means algorithm is an example of “hard” clustering, where a data point is
assigned to only one cluster. In many cases, it is beneficial to consider “soft” cluster-
ing, where a point is assigned to different clusters with different degrees of certainties.
This can be done either by fuzzy clustering or by n‘iixture-based clustering. We prefer
the latter because it has a. more rigorous foundation.

In mixture-based clustering, a ﬁnite mixture model is ﬁtted to the data. Let Y
and Z be the random variables for a data point and a cluster label, respectively. Each
cluster is represented by the component distribution p(Y|6,-), where 6]- denotes the
parameter for the j-th cluster. Data points from the j—th cluster are assumed to follow
this distribution, i.e., [)(YlZ = j) = p(Y|9j). The component distribution p(Y|9,-) is
often assumed to be a Gaussian when Y is continuous, and the corresponding mixture
model is called “a mixture of Gaussians”. If Y is categorical, multinomial distribution
can be used for [DO/[6,). Let a, = P(Z = j) be the prior probability for the j-th

cluster. The key idea of a mixture model is

k
pme) =ZP( Y|Z—j)P( -Za,p( Y|6,) (1.5)
jzl

where O = {91,...,6k,a1,...,ak} contains all the model parameters. The mix-

ture model can be understood as a two-stage data. generation process. First, the

30

hidden cluster label Z is sampled from a multinomial distribution with parameters
(01, . . . , ak). The data point. Y is then generated according to the mixture distribu-

tion determined by Z, i.e., Y is sampled from p(Y|0,-) if Z = j.

The degree of membership of y,- to the j—th cluster is determined by the posterior

probability of Z equals to j given yi, i.e.,

MZ=$Y=yd_ WMYWD

MZ=ﬂY=y-= — -
2) 60’ = 3’2“) 23L, a,p(Y|6,-)

 

 

(1.6)

If a “hard” clustering is needed, y,- can be assigned to the cluster with the highest

posterior probability P(Z|Y = y,).

The parameter 9 can be determined using the maximum likelihood principle. We

seek O that minimizes the negative log-likelihood:

n I;
Jrnixture Z — :10g 2 ajp(Yil6j)- (1'7)
i=1 j=1
For brevity of notation, we write p(y,~|6,-) to denote ])(Y = yildj).

The EM algorithm can be used to optimize Jmixture- EM is a powerful technique
for parameter estimation when some of the data are missing. In the context of a
ﬁnite mixture model, the missing data are the cluster labels. Starting with an initial
guess of the parameters, the EM algorithm alternates between the “E—step” and the
“M—step”. Let 1",, = P(Z = j [Y :2 y,,(-)°ld), where OOld is the current parameter

estimate. In the E—step, we compute the expected complete data log-likelihood, also

31

%

 

known as the Q-function:

Q<elie°‘“) = E

 

n n 1:
2103666. 26)] = 2: Zn) (log; a) +108P<Yil6j)) (1-8)

i=1j=1

Note that the expectation is done with respect to the old parameter value via 73,-.
Computationally, E—step requires calculation of 73,-. In the M-step, O that maximizes
Q(OI]O°Id) is found:

Onew = arg InéixQ(€-)]|(-)Old). (1.9)

The M-step is guaranteed to decrease Jmixture- By repeating the E-step and the
M-step, the negative log-likelihood continues to decrease until a local minimum is

reached.

Convergence Proofs on the EM algorithm In this section, we shall state the
well-known proof in the literature that the M-step indeed decreases Jmixturea thereby
showing that the EM algorithm does converge to a local minimum of Jmixture- We
consider the correctness of the EM algorithm in a more general setting, where Y and
Z are redeﬁned to mean “observed data” and “missing data,” respectively. Note that
the data points and the missing labels are examples of observed data and missing

data, respectively.
In this general setting, Q(O|IOOld) can be written as

Q(@ll9‘”d) = Zpizmeddnogpm ZIG) (1.10)
Z

32

Our ﬁrst proof is based on the concavity of the logarithm function. Because M-step

maximizes Q(e), Q(e"0W) — c.2(eold) 2 0. Observe that

Q(@new) _ Q((_)old)

= ZpiZIY, 60‘“) (108414)”, ZI<->""'W) — 10W, ZI<->°”>)
Z

 

_ O, ne v . ld oll ~p(Z[Y,@neW)
—Io.,p<Yie ‘)—logp(Yl@" >+§ijizme ‘>10%,(Zly,eom,
(Zn/,eneW)

 

<13 Yenew _1g, Y901d+lg ZY,@Old P
_ 0 P( l ) O P( l ) O Ez:p( l )p(Z]Y,@Old)

= 10g 6(YIG""W) - log axle“).

The inequality is due to the cmicavity of logaritlun, and the fact that p( Z IY, 8”“)
can be viewed as “weights” because they are non—negative and Z Z p(Z IY, Odd) = 1.
Since Q(O"ew) — Q(OOld) 2 O, the above implies log p(Y|(-new) — log p(Y|(-)Old) Z 0.
So, the update of parameter from 6°” to Omw indeed improves the log—likelihood of
the observed data. When OOH = 9’10“", the inequality becomes an equality, and we
reach a local minimum of log p(Y|(-)).

Note that the above argument holds as long as Q(Onew) — (AC-901“) Z 0. Thus
it sufﬁces to increase (instead of maximizes) the expected complete log-likelihood
in the M-step. The resulting algorithm that only increases the expected complete
log-likelihood is known as the generalized EM algorithm.

It is interesting to note a variant of the EM algorithm used in [80] for Bayesian
parameter estimation. The goal is to find 8 that maximizes log p((-)|Y). Since the

missing data in [80] are continuous, the expectation is performed by integration in-

33

stead of summation. The E—step computes f p((-)|Z, Y) log 1)(Z]O°ld, Y) dZ, and the
M-step solves Onew = argmaxe f p(Z IOOId, Y) log p((-)|Z , Y) dZ. The correctness of

this variant of the EM algorithm can be seen by the following:

/ p(ZIGO'd, Y) logaem’w, Y) dz — / p(ZIeok', Y) 1066(90‘dIZ. Y) dz
: [p(ZIQOId, Y)(logp((-)new]Y) + logp(Z|(-)new, Y) — log P(Z|Y)

-— log p(OOldlY) — log p(zleo‘d, Y) + log P(Z|Y)) dZ

p(Zlane‘”, Y)

=lg @"ewY—lg eO'dY+/ zeO‘d,Y1g dZ
o p( I ) opi l ) p( i )0 “2'60”,”

 

s IogpieneuY) — losp(@O‘“lY)

Note that p(OIZ, Y) = p(OIY)p(Z]O, Y)/p(Z]Y).

Our second proof of the EM algorithm is to regard it as a special case of variational
method. Here, we follow the presentation in [205]. Let T (Z ) be an unknown variable

distribution on the missing data Z. Since p(Y|O) = p(Y, Z|O)/p(Z|Y, O), we have

10gp(YIO)——— 0sp(Y.Zl0)-10sp(ZlY 9)

p(Yle): ZIP(2 p(YZIG) ZTZ) ”(W
Z
p(,Y Z 0)
“22:2, _(_Z')_ + DKL(T<Z)i)p(ZIY.®))

Here, DKL(T(Z)]|p(Z[Y)) is the Kullback Leibler divergence deﬁned as

TZ)

DKL(TQ(Z )||P( ZIYl) =ZTQ(Z) )logp (ZlY)

 

34

 

 

Note that the divergence is always nonnegative, meaning that S =
Z Z T(Z) log 1%???) is a lower bound of log p(Y[O). Variational method maximizes
log p(YIO) indirectly by ﬁnding 9 and T(Z) that maximizes 5, under a restriction on
the form of T(Z). The EM algorithm can be regarded as a special case of variational
method, which does not put any restriction on T(Z). It is easy to show that in this
case 3 is maximized with respect to T(Z) if T(Z) = p(ZlY,O). With this choice
of T(Z), s is no longer a lower bound but exactly equals log p(YIO), because the
divergence term is zero. Maximizing s with respect to O is the same as maximizing

Zz PfZlY, 9) 10g p(Y, ZIO), which is the Q-function.

1.4 Side-Information

In many pattern recognition problems, the performance of advanced classiﬁers like
support vector machines and simple classiﬁers like k-nearest neighbors are more or
less the same. It is the “quality” of the input information (in terms of discrimination
power), instead of the type of the classiﬁer, that is the determining factor in the
classiﬁcation accuracy. However, research effort in pattern recognition and machine
learning has focused on devising better classiﬁers. One is more likely to improve the
performance of practical systems by incorporating additional domain/ contextual in—
formation, than by improving the classifier. Side-information, i.e., information other
than what is contained in feature vectors and class labels, is relevant here because
it provides alternative means for the system designer to input more prior knowledge

into the classﬁciation / clustering system, therefore boosting its performance.

Side-information arises because some aspects of a pattern recognition problem
cannot be speciﬁed via the class labels and the feature vectors. It can be viewed as a
complement to the given pattern or proximity matrix. Examples of side—information
include alternative metrics between objects, known data groupings or associations,
additional labels or attributes (such as soft biometric traits [123]), relevance of dif—

ferent features, and ranks of the objects.

Side-information is particularly valuable to clustering, owing to the inherent arbi-
trariness in the notion of a cluster. Given different possibilities to cluster a data set,
side information can help us to identify the cluster structure that is the most appropri-
ate in the context that the clustering solution will be used. A set of constraints, which
specify the relationship between different cluster labels, is probably the most natural
type of side—information in clustering. Constraints arise naturally in many clustering
applications. For example, in image segmentation one can have partial grouping cues
for several regions in the image to assist in the overall clustering [279]. Clustering of
customers in a market-basket database can have multiple records pertaining to the
same person. In video retrieval tasks different users may provide alternative annota-
tions of images in small subsets of a large database [110]. Such groupings may be used
for semi-supervised clustering of the entire database. “Orthogonality” to a known or
trivial partition of the data set is another type of side—information for clustering, and

this requirement can be incorporated via a variant of information bottleneck [97].

1 .5 Overview

In the remainder of this thesis, we shall ﬁrst provide an in—depth survey of some non-
linear dimensionality reduction methods in Chapter 2. We then present our work on
how to convert ISOMAP, one of the algorithms described in Chapter 2, to its incre-
mental version in Chapter 3. In Chapter 4, we present our algorithm on the problem
of estimating the relevance of different features in a clustering context. Chapter 5
describes our proposed approach to perform model-based clustering in the presence
of constraints. Finally, we conclude with some of our contributions to the ﬁeld and

outline some research directions in Chapter 6.

Chapter 2

A Survey of Nonlinear
Dimensionality Reduction

Algorithms

In section 1.2 we described the importance of dimensionality reduction and presented
an overall picture of different approaches for dimensionality reduction. This chapter
continues the discussion in section 1.2.3.2, where linear feature extraction methods
like principal component analysis (PCA) and linear discriminant analysis (LDA) were
mentioned. Linear methods are easy to understand and are very simple to implement,
but the linearity assumption does not hold in many real world scenarios. Images of
handwritten digits do not conform to the linearity assumption [113]; rotation, shear-
ing, and variation of stroke widths can at best be approximated by linear functions
only in a small neighborhood (as in the use of tangent distance [68]). A transformation

38

as simple as translating an object on a. uniform background cannot be represented as
a linear function of the pixels. This has motivated the design of nonlinear mapping
methods in a general setting. Note, however, that a globally nonlinear mapping can
often be approximated by a linear mapping in a local region. In fact, this is the
essence of many of the algorithms considered in this chapter.

In this chapter, we shall survey some of the recent nonlinear dimensionality re-
duction algorithms, with an emphasis on several algorithms that perform nonlinear
mapping via the notion of learning the data manifold. Since we are mostly interested
in unsupervised learning, supervised nonlinear dimensionality methods such as hier-
archical discriminant regression (HDR) [118] are omitted from this survey. Some of

the methods considered in this chapter have also been surveyed recently in [284] and

[34).

2. 1 Overview

The history of nonlinear mapping is long, tracing back to Sammon’s mapping in 1969
[223]. Over time, different techniques have been proposed, such as projection pursuit
[93] and projection pursuit regression [92], self organizing maps (SOM) [152], principal
curve and its extensions [107, 249, 239, 144], auto—encoder neural networks [7, 57],
generative topographic maps (GTM) [24], and kernel principal component analysis
[228]. A comparison of some of these methods can be found in [180].

A new line of nonlinear mapping algorithms has been proposed recently based on

the notion of manifold learning. Given a data set that is assumed to be lying ap—

39

proximately on a (Riemannian) manifold in a high dimensional space, dimensionality
reduction can be achieved by constructing a mapping that respects certain properties
of the manifold. Isometric feature mapping (ISOMAP) [248], locally linear embed—
ding (LLE), Laplacian eigenmap [16], semideﬁnite embedding [268], charting [29], and
co—ordination-based ideas [220, 257] are some of the examples. The utility of manifold
learning has been demonstrated in different applications, such as face pose detection
[103, 172], face recognition [283, 276], analysis of facial expressions [75, 38], human
motion data interpretation [133], gait analysis [75, 74], visualization of ﬁber traces
[32], and wood texture analysis [196].

In this chapter, we shall review some of these algorithms, with an emphasis towards
the manifold-based nonlinear mapping algorithms. It is hoped that this exposition
can help the reader to become familiar with these recent exciting developments in
nonlinear dimensionality reduction. Table 2.1 provides a comparison of the algorithms
we are going to discuss. We want to point out that there are many other interesting
manifold-related ideas that have been omitted in this chapter. Examples include
stochastic embedding [112], locality preserving projections [109], Hessian eigenmap
[67], semideﬁnite embedding [268] and its extension [267], the co—ordination type
methods described in [257], [134] and [285], as well as the method in [31] which is
related to Laplacian eigenmap. Robust statistics techniques can be used too [214]. It
is also possible to learn a Parzen window along the data manifold [260].

The rest of this chapter is organized as follows. We ﬁrst deﬁne our notation and
describe some properties of a manifold in Section 2.2. Sammon’s mapping, probably

the earliest nonlinear mapping algorithm, is discussed in Section 2.3. Auto—associative

40

 

Figure 2.1: An example of a manifold. This example is usually known as the “Swiss
roll”. (a) Surface of the manifold. (b) Data points lying on the manifold.

neural network [7], also known as auto—encoder neural networks [57], is described in
Section 2.4. Kernel PCA is described in Section 2.5, followed by ISOMAP in Section
2.6, LLE in Section 2.7, and Laplacian eigenmap in Section 2.8. Three closely related
ideas that involve combining different local co—ordinates are described in Section 2.9.
We then Show some results of running these algorithms on simple data sets in Section

2.10. Finally, we summarize our survey in Section 2.11.

2.2 Preliminary

Let y = {y1, . . . ,yn} be the high—dimensional data set, where y,- 6 RD and D is usu-
ally large. Let Y = [y1, . . . ,yn] be the D x 11 data matrix. We seek a transformation
of 32 that maps yi to its low dimensional counterpart x,, where x,- 6 Rd and dis small.
Let X = [x1, . . . ,xn] be the d x 72 matrix. We shall assume that different y,- do not
lie randomly in RD, but approximately on a manifold, which is denoted by M. The

manifold may simply be a hyperplane, or it can be more complicated. An example of

41

.woaeotom ma 00360080800030 a
page: one $882630 maﬁa: .8 waned“: 30m 05 E00 .08 Sec 05 Mo 3?. 2: E 2 when? .2 3 : mm 36 mﬁ. x $33.. mm x53:
< 0%: may 3 commooam on. 3 maﬁa: Mo 0855 Sec 050 80¢ 035838 on :8 SE washes? E vooﬁonﬁwwoc 0H: peat 802
.2838 ..EQEQE: 05 E 2mm? “0 5500 EN 30 03 03%qu a mo 0030: 2: 80¢ 00:09: mm Eﬁtom? Em b .m.m sosoom E
00530 3&2 0a £88 ..Uooﬁonswmoz: .ﬁ oueﬁsme 0m? 08 25320me $23 m0 e88 gmsoﬁ .0393 an 8 0005me 2 3an
6006586 32 2: m0 .mpzmcowmawﬁmv 23. 030550 mu: 5 33032 mﬁﬁiomﬁ wsﬁmea Swanson Mo QOmEemaoo < ”mm 0358

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

330 0”: 00 085 waE :3 ieEm w 225:5 mad NE 03250
m0> 3006 05.03: c. 02 00 mange/cows coszmewzow L000 1.2010 ”swim 0%. E0005 E004 fvﬁ 0.:
XESE :5 43:6
€20.90 00$ a m0 whopooiawmo £003 050% ammo” k3 00050500 1520
mew 000503330: 02 -0300 50:: 02:30:00 “.5256 Ex £363.50 maueoz 3Q $5350
30005 23055 Esoﬂmﬂg 8258000 132m Co 82880 13005 Smﬁ
m0> E02 “0 58:: Z m0> >0 00350 Eﬁiowﬁ Em -ED mmcoshﬁwce 0003 “0 025052 0059:0500 1520
50050000 533 02.3%: 852$ 31
8% 600553590: 02 dumb: .m *0 m~080>00w6 $238550 530% ﬁcoEm asazowmo 030$qu
02me 030%
m0> wooneonzwﬁc oz 0ch a mo m00000>z0w6 3:303 20305020000 05am gﬁ m5;
. x508
:8 .mweﬂ c. we m00000>00wm0
90> coofoniwﬁos 02 £qu 08:02? hem 2w 30:0qu unmowoow 0>u0m2m vaﬁ Lame/Hog
ﬁber:
OZ 005003 350x oz :3 dweﬂ m .«0 E0000>00mm0 8me magnum E <Om _wmm_ <Omv~
3030253002 Saw one cos Fm E 919500
02 0:0: 8% maize: x5250: :2:me Jawbone 33.90% How x5250: 3,502 $50: 953083.036
oz 25: m0> 000080 0002080 mwobm wEoEEdm 02552 Ema :OESem
30232 28003qu 9:30: 0053:0800 3M do? them

 

 

42

 

a “curved” manifold with the data points lying on it can be seen in Figure 2.1. This
manifold assumption is reasonable because many real world phenomena are driven by
a small number of latent factors. The high dimensional feature vectors observed are
the results of applying a (usually unknown) mapping to the latent factors, followed
by the introduction of noise. Consequently, high dimensional vectors in practice he

approximately on a low dimensional manifold.

Strictly speaking, what we refer to as “manifold” in this thesis should properly be
called “Riemannian manifold.” A Riemannian manifold is smooth and differentiable,
and contains the notion of length. We leave the precise definition of Riemannian
manifold to encyclopedias like Mathworld.1 and VVikipedia2, and describe only some
of its properties here. Every y in the manifold M has a neighborhood N (y) that is
homeomorphic3 to a set S, where S is either an open subset of Rd, or an open subset

on the closed half of Rd.

This mapping 0y : N (y) r—> S is called a co—ordinate chart, and ¢y(y) is called
the “co—ordinate” of y. A collection of (to-ordinate charts that covers the entire M is
called an atlas. If y is in two co—ordinate charts (153,1 and 953,2, y will have two (local)
co—ordinates ¢y1(y) and on (y) These two co—ordinates should be “consistent” in
the sense that there is a map to convert between 03,1 (y) and on (y), and the map
is continuous for any path in N (y1) F) N (yg). For any y,- and yj in M, there can

be many paths in M that connect yi and yj. The shortest of such paths is called

 

1http://mathwor1d.wolfram.com

2http://en2.wikipedia.org/

3Two (topological) spaces are homeomorphic if there exists a continuous and invertible function
between the two spaces, and that the inverse function is also continuous.

43

 

   

Great C' cle

Figure 2.2: An example of a geodesic. For two points A and B on the sphere, many
lines (the dash-dot lines) can be drawn to connect them. However, the shortest of
these lines, which is the solid line joining A and B, is called the geodesic between A
and B. In the case of a sphere, the geodesic is simply the great circle.

the geodesic4 between y,- and yj. For example, the geodesic between two points on a
sphere is an arc of a “great circle”: a circle whose center coincides with the center of
the sphere (Figure 2.2). The length of the geodesic between y, and yj is the geodesic

distance between y,- and yj.

To perform nonlinear mapping, one can assume that there exists a mapping
¢global(-) that maps all points on M to Rd. The “global co—ordinate” of y, de-
noted by x = ¢global(y), is regarded as the low dimensional representation of y. In
general, such a mapping may not exist5. In that case, a mapping that preserves a

certain property of the manifold can be constructed to obtain x.

 

4Strictly speaking, geodesics are curves with zero covariant derivatives of their velocity vectors
along the curve. A shortest curve must be a geodesic, whereas a geodesic might not be a shortest
curve.

5For example, there is no such map (homeomorphism) between all points on a sphere and R2.
However, if we exclude the north pole of a sphere, we can construct such a mapping.

44

Many of the nonlinear mapping algorithms that are manifold—based require a
concrete deﬁnition of 1V(y,-), the neighborhood of y,. Two definitions are commonly
used. In e-neighborhood, yj E N(yi) if Hy,- —— yJ-ll < 6, where the norm is the
Euclidean distance in RD. In krill-neighborhood, yj E N (yi) if yj is one of the k
nearest neighbors of y, in y, or vice versa. In both cases, 6 or k is a user-deﬁned
parameter. knn neighborhood has the advantage that it is independent of the scale
of the data, though it can lead to too small a neighborhood when the number of data
points is large. Note that the neighborhood can be defined in a data-driven manner

[29] instead of being speciﬁed by a user.

2.3 Sammon’s mapping

Sammon’s mapping [223], which is an example of metric least square scaling [49], is
perhaps the most well-known algorithm for nonlinear mapping. Sammon’s mapping
is an algorithm for multidir’nensional scaling and it maps a set of n items into an Eu-
clidean Space based on the dissimilarity values. This problem is related to the metric
embedding problem considered by theoretical computer scientists [119]. Sammon’s
mapping can be used for dimensionality reduction if the dissimilarity matrix is based
on the Euclidean distance between the data points in the high dimensional space.
Given a n by n matrix of dissimilarity values {6ij}’ where 62-]- denotes the dissim-
ilarity between the z'-th and the j-th items, we want to map the n items to n points
{x1, . . . ,xn} in a low dimensional space, such that the distance between x,- and x]-

is as “close” to (5,-3- as possible. Many different deﬁnitions of closeness have been

45

proposed, with the “Sasmmon stress” ,deﬁned by Sammon, being the most popular.

The Sammon’s stress S is deﬁned by

S: E dl—L——; ”2/250, (2.1)

i<j oi‘j i<j

where {1,-J- : ||x,- - jll is the distance between x,- and xj. The quantity (dij -
(ll-j)? measures the discrepancy between the observed dissimilarities with the actual
distances. It is weighted by (52731 because if the dissimilarity is large, we should be more

tolerant to the discrepancy. The division by Z 62-] makes S scale free. Sammon

i<j

proposed the following iterative equation to ﬁnd x,- that minimize S

. (925
3:9?“ 2 :11- #:8/ I—— I 2.2

where MP is a “magic factor”, usually set to 0.3 or 0.4. Now, differentiating (1%]. =

Zk/(xik/ — xjk’)2 with respect to Tika we get

2dijadij = 2(232'k — xjk)axik

3% _ We - $31;

 

(912k — dij
So, the gradient of S is
BS ( 2 m dij — 5:7
‘—,. = —- ~ ' (1"); _ ll 'k)
dl‘z‘k Zi<j 02'} ._ . . dzjozj l J
]_19]#2 2 3
__2___ m 1 1 ( ' )
K16 j: Ljﬂ I] '7']

46

 

where Iik is the k-th component in xi. For the second order information, note that

—1
__2__ a as
Zi<j 6i] 81.1112 017M.

=1(v=1:) in: (g—iyuwn—Iew»

 

 

 

j=1J¢i ‘7' d”
m 1 .1? — :r- :13- — :r-
+ Z 7m — rjr)(1(u = >—,—-——J— + I(u aé 21)I(u = j)—i"7,——’—”)
j=1,j7éi dij ‘ ij ij
771.
1 1 1 1
= 112k qui ————— —Iu7€i)(—.—-———)
( ) ( L337“ (50' dij) ( 0m (1m
m

. (‘Tik — Ijklfxiv " fjv) . (Iii: —$uk)(1‘llv — xiv)

+ I(u = z) . Z . (13: + I(u # 2) d3

3:109:42 U w

(2.4)

where I () is the indicator function deﬁned as

I(true) : 1 I(false) = 0.

One can use a nonlinear optimization algorithm other than Equation (2.2) to minimize
S. It is also possible to implement Sammon’s mapping by a feed-forward neural
network [180] or in an incremental manner [129]. Note that Sammon’s mapping is
“global” and considers all the interpoint distances between the 71 items. This can be
a drawback for data like the Swiss roll data set, where Euclidean distances between
pairs of points that are far away from each other do not reveal the true structure of

the data.

47

 

2.4 Auto-associative neural network

A special type of feed-forward neural network, “auto—associative neural network”
[7, 57], can be used for nonlinear dimensionality reduction. A11 example of such a
network is shown in Figure 2.3. The idea. is to model the functional relationship
between x,- and y,- by a neural network. If x, is a good representation for yi, it
should contain sufﬁcient information to reconstruct y,- via a neural network (decoding
network), with the “decoding layer” as its hidden layer. To obtain x,- from yi, another
neural network (encoding network) is needed, with the “encoding layer” as its hidden
layer. The encoding network and the decoding network are connected so that the
output of the encoding network is used as the input of the decoding network, and
both of them correspond to X). The high-dimensional data points y, are used as both
the input and the target for training in this neural network. Sum of square error
can be used as the objective function for training. Note that the neural network in
Figure 2.3 is just an example; alternative architecture can be used. For example,
multiple hidden layers can be used, and the number of neurons in the encoding and

decoding layers can also be different.

The advantage of this approach is that mapping a new y to the corresponding x is
easy: just feed y to the neural network and extract the output of the encoding layer.
Also, there exists a number of software packages for training neural networks. The
drawback is that it is difﬁcult to determine the. appropriate network architecture to
best reduce the dimension for any given data set. Also, training of a neural network

involves an optimization problem that is considerably more difﬁcult than the eigen-

48

I

111%

//.; Output layer

‘92.! ’7’ /
:2; I

//
4 -//

if»

9-! .
\V

/

O
O
.s
O

\

s ‘2‘
j.

'0

.

\\/

\

“an

i
C
-3 7.,
' . 3;

$

hu‘

I

Decoding layer

‘

34
\
‘91»

4’
/’

“Middle” layer

011

v
Q

A

Encoding layer

9"
3‘)
av
. 4.. a».
;Or.:
Ii: -
b a
It“:
I’

O
G
O
o,
- 0, £33,,
0
0
Q

\‘ g

' ‘~ are

\ \»;"7
\
i
7V

Z?!
“ll
//
/

Input layer

ZS

Figure 2.3: Example of an auto—associative neural network. This network extracts X,

with 3 features from the given data y,- with 8 features.

decomposition required by some other nonlinear mapping methods like ISOMAP,

LLE, or Laplacian eigenmap, which we shall examine later in this chapter.

2.5 Kernel PCA

The basic idea of kernel principal component analysis (KPCA) is to transform the
input patterns to an even higher dimensional space nonlinearly and then perform

principal component analysis in the new space. It is inspired from the success of the

support vector machines (SVM) [189].

49

2.5.1 Recap of SVM

Consider a mapping qb : RD +——> H, where H is a Hilbert space. H can be, for example,
a (very) high dimensional Euclidean space. By convention, RD and H are called the
input space and the feature space, respectively. The point y, in RD is ﬁrst transformed
into the Hilbert space H by g§(y,j). SVM assumes a suitable transformation gb(.) such
that the transformed data set is more linearly separable in H than in RD, and a large
margin-classiﬁer in H is trained to separate the transformed data. It turns out that
the large margin classiﬁer can be trained by using only the inner product between
the transformed data (c‘)(y,-) (55(yj)), Without knowing (f)(.) explicitly. Therefore, in

practice, the kernel function K (yi, y,:) is speciﬁed instead of ¢b(.), where

Kbmyz‘) = (95(yz'),¢(yj')>-

Specifying the kernel function K(., .) instead of the mapping ¢(.) has the advantage
of computational efﬁciency when H is of high dimension. Also, this allows us to
generalize to inﬁnite dimensional H, which happens when the radial basis function
kernel is used. This use of kernel function to replace an explicit mapping is often
called “the kernel trick”. Intuitively, the kernel function, being an inner product,

represents the similarity between y,- and yj.

The kernel trick can be illustrated by the following example with D = 2. Let

¢(y,-) E (y?1,\/§yi1y,2,y222)T, where y, = (yi1,y,t2)T. The kernel function corre-

SPODding to this 0%) is K()’:\ yj) = (UM/)1 + LUi‘Zil/j2l2v because

, 2
1‘ (Yian) = (MN/1'1 + yi23/j2)
2 2 2 2
Z gill/1'1 + 2.1/i13/j13/2'2yj2 + y‘igyﬂ
2 2 2 2 T
= (ya, ﬁynyizy 31,2)(yj1, ﬁijja 1132)

T,
),

= 95(3’2' nyj')‘

Many different kernel functions have been proposed. Polynomial kernel, deﬁned as
K (yi, yj) = (yiTyj + 1)r with r as the parameter (degree) of the kernel, corresponds
to a polynomial decision boundary in the input space. The radial basis function

), where w is the width

(RBF) kernel is defined by Ix’(y,-,yj) = exp(w||y, — lel2
parameter. SVM classiﬁers using RBF kernel are related to RBF neural networks,
except that for SVM, the centers of the basis functions and the corresponding weights
are estimated by the quadratic programming solver simultaneously [229]. The choice

of the appropriate kernel function in an application is difﬁcult in general. This is still

an active research area, with many principles being proposed [121, 154, 227].

2.5.2 Kernel PCA

One important lesson we can learn from SVM is that a linear algorithm in the feature
space corresponds to a nonlinear algorithm in the input space. Different types of
nonlinearity can be achieved by different kernel functions. Kernel PCA [228] utilizes

this to generalize PCA to become nonlinear. For ease of notation, we shall assume H

51

is of ﬁnite dimension6.

KPCA follows the steps of the standard PCA, except the data set under consid-

eration is {¢(y1), . . . ,o(yn)}. Let 63(y2') be the “centered” version of cb(y.,-),
7 1 ' ,
(0(3'2‘) = 95(5’2') -" I, 206%)-

The covariance matrix C is Given by

1 - ~
= — Z amen?"
n i

The eigenvalue problem /\v 2 CV is solved to ﬁnd the (kernel) principal component

v. Because
V=-C :An 12¢(Yi)(¢3( (ryi) TV),

v is in the subspace spanned by o(y,~), and it can be written as

V = 2096“”)-
j

Denote a = (011, . . . , an). Let K be the symmetric matrix such that its (2', j )-th entry

Rij is d(yi)Tgf)(yj). Rewrite /\v 2 CV as

AZQJ‘QM (Yj) :%Zj:ajK ij@( Yr)- (2'5)

 

6The case for infinite dimensional H is similar, with operators replacing matrices and eigenfunc-
tions replacing eigenvectors.

52

By multiplying both sides with (SQ/p7", we have

3 1 ~, ~, .
AZOJ'AZJ- = ﬁZa’injhli V1, (2.6)
j i]

which, in matrix form, can be written as
/\7IRQ = K20. (2.7)

Since R is symmetric, K and K2 have the same set of eigenvectors. This set of eigen-
vectors is also the solution to the generalized eigenvalue problem in Equation (2.7).
Therefore, a, and hence v, can be found by solving /\a -=: Ra. For projection pur-
poses, it is customary to normalize v to norm one. Since “V”2 = aTKa, we should
divide a by V aTKa. To perform dimensionality reduction for y, it is ﬁrst mapped

to the feature space as p(y), and its projection 011 v is given by

17><y>Tv = Z ,;(y)r,,,,;(y,, = aTiéy, (2.8)
2'

.. ~ ~ T
where ky = (95(Y)T¢(y1), . . . , $(y)T¢(y,,)) . Finally, by rewriting the relationship

sz = 95(Yi)T¢3(yj')

53

1 ,
— <:><y.>T¢<y,> — — ¢<yz>Tcz><yJ>

Tl
’ 1:1

1 Ti 1 72

— g 2 am )Teb'z) + 533- : Z e(yr)T¢(yk)
' 1:1 k=1 1:1
in matrix form, we have
R : HnKHn, (2.9)

where Hn = I — %1n,n is a centering matrix with In," denoting a matrix of size n
by n with all entries one, and K is the kernel matrix with its (2', j)-th entry given by
K(y,-, yj). A similar expression can be derived for g3(y)qu>(yj).

KPCA solves the eigenvalue problem of a n by 72. matrix, which may be larger
than the D by D matrix considered by PCA. Recall D is the dimension of y,. The
number of possible features to be extracted in KPCA can be larger than D. This
contrasts with the standard PCA, where at most D features can be extracted. An
interesting problem related to KPCA is how to map 2, the projection of (ﬂy) into
the subspace spanned by the ﬁrst few kernel principal components, back to the input
space. This can be useful for, say, image denoising with KPCA [185]. The search for
the “best” y' such that (p(y') R: z is known as the pre—image problem and different

solutions have been proposed [160, 5].

In summary, KPCA consists of the following steps.

1. Let K be the kernel matrix, where Ki]- 2 ¢(y,-,yj). Compute R by

K : HnKHn.

54

2. Solve the eigenvalue problem An: = Ra and ﬁnd the eigenvectors corresponding

to the largest few eigenvalues.
3. Normalize a by dividing it by V aTKa.

4. For any y, its projection to a principal component can be found by aTky, where
~ 1
ky = Hn(ky _ gKan):

ky = (K(y,y1), . . . , K(y,yn) and 171,1 is a n by 1 vector with all entries equal

to one.

2.6 ISOMAP

The basic idea of isometric feature map (ISOMAP) [248] is to ﬁnd a mapping that
best preserves the geodesic distances between any two points on a manifold. Recall
that the geodesic distance between two points on a manifold is deﬁned as the length of
the shortest path on the manifold that connects the two points. ISOMAP constructs
a mapping from y,- to the x,- (x, 6 R“) such that the Euclidean distance between x,-
and xj in R“ is as close as possible to the geodesic distance between y; and yj on
the manifold.

Geodesic distances are hard enough to ﬁnd when the manifold is known, let alone
in the current case where the manifold is unknown and only points 011 the manifold
are given. So, ISOMAP approximates the geodesic distances by ﬁrst constructing a

neighborhood graph to represent the manifold. The vertex v, in the neighborhood

55

graph G = (V, E) corresponds to the high dimensional data point yi. A11 edge 6(2, 3)
between 2',- and vj exists if and only if y,- is in the neighborhood of yj, N (yj), and
the weight of this edge is Hy,- — lel- Details of N(yj) are described in section 2.2.
An example of a neighborhood graph is shown in Figure 2.4(b) for the data shown
in Figure 2.4(a). ISOMAP approximates a path on the manifold by a path in the
neighborhood graph. The geodesic between y,- and yj corresponds to the shortest
path between 2),- and vj. The estimation problem of the geodesic distances between
all pairs of points yi and yj thus becomes the all-pairs shortest path problem in the
neighborhood graph. It can be solved [46] either by the Floyd-VVarshall algorithm,
or by Dijkstra’s algorithm with different source vertices. The latter is more efﬁcient
because the neighborhood graph is sparse. An example of how the shortest path
approximates the geodesic is shown in Figure 2.4(c). It can be shown that the shortest

path distances converge to the geodesic distances asymptotically [18].

The next step of ISOMAP ﬁnds x,- that best preserve the geodesic distances. Let
gij denote the estimated geodesic distance between y,- and yj, and write G = {5,3} as
the geodesic distance matrix. The optimal x,- can be found by applying the classical
scaling [49], a simple multi—dimensional scaling technique. Let dij = [[x, —- xJ-I].
Without loss of generality, assume 2,- x, = 0. We have the following:

T 2 2 T
(Xi - Xj) = llxill + [[19]] — 2X2“ Xj

2 2 2
2:612} = E :llxill +nlllel
2' 2
2 2
2% = 2‘": “Xi“
ij 2

56

 

‘s. . . ”"
c ‘ - .
-?:‘.:?.°-—-~;,\. ﬂaw-cw
“2" 2* " "mar ~52!"
‘10— C O O I

 

 

 

 

 

1 1o 5 o —5 1o -15 0

(a) Input data

 

 

 

 

(b) Neighborhood graph and geodesic approximation

 

0|

0 F—WV

C)

20 4o 60 80 100
((2) Another view of geodesic approximation

Figure 2.4: Example of neighborhood graph and geodesic distance approximation.
(a) Input data. (b) The neighborhood graph and an example of the shortest path.
(c) This is the same as (b), except the manifold is flattened. The true geodesic (blue
line) is approximated by the shortest path (red line).

 

l

2 2
30, E ”Kill :27 E [{1}

l

l

2 2
”39'” = g E :d'zfj2711221j:d22j
2'

and

1 1
2x339- : — (12 +— (12152212 —d2 (2.10)

n ,
.7
If we replace {1,-J- with the estimated geodesic distance gij in Equation (2.10), bij, the

target inner product between x,- and x_,-, is given by

1 2 1 2 1 2 2
EZQU‘I'EZgij-ﬁzgij—gij ~ (2'11)
j 2' ‘ ii

Let A = {“23} with aij = —a‘-lzg,2j. Equation (2.11) means that B = HnAHn, where

B = {bl-j}, Hn = I — 711-17,,71 and 17m denotes a n by 72 matrix with all entries one.

Computing HnAHn is effectively a centering operation 011 A, i.e., each column
is subtracted by its corresponding column mean, and each row is subtracted by its
corresponding row mean. Because multiplication of Hn has this effect of “zeroing” the
means for different rows and columns, Hn is often referred to as the centering matrix.
The centering operation is also seen in other embedding algorithm such as KPCA
(section 2.5). Since B is the matrix of target inner product, we have B 2 XTX,
where X = [x1, . . . ,xn]. We recover X by ﬁnding the best rank—d approximation
for B, which can be obtained via the eigen—decomposition of B. Let A1, . . . , Ad be
the d largest eigenvalues of B with corresponding eigem'ectors v1, . . . ,vd. We have

= [y/Alvlvu, Advd]T. Here, we assume A,- > 0 for all ‘1'. = 1, ...,(1. Unlike

Sammon’s mapping, the objective function for the optimal X is less explicit: it is the
sum of the square error (squared Frobenius norm) between the target inner product
(bij) and the actual inner product (xz-ij).

One drawback of ISOMAP is the 0(722) memory requirement for storing the dense
matrix of geodesic distances. Also, solving the eigenvalue problem of a large dense ma—
trix is relatively slow. To reduce both the computational and memory requirements,
landmark ISOMAP [55] sets apart a subset of y as landmark points and preserves
only the geodesic distances from y,- to these landmark points. A similar idea has been
applied to Sammon’s mapping before [25]. A continuum version of ISOMAP has also
been proposed [282]. ISOMAP can fail when there is a “hole” in the manifold [66].
We also want to note that an exact isometric mapping of a manifold is theoretically
possible only when the manifold is “ﬂat”, i.e., when the curvature tensor is zero, as

pointed out in [16].

To summarize, ISOMAP consists of the following steps:

1. Construct a neighborhood graph using either the e neighborhood or the knn

neighborhood.

2. Solve the all pair shortest path problem on the neighborhood graph to obtain

an estimate of the geodesic distances gij.
3. Compute A = {dz-j}, where aij = _29i2j1 and B = HnAHn.

4. The d largest eigenvalues and the corresponding eigenvectors of B are found

and X =[ﬁ1-V1,. . '31/Advle'

59

2.7 Locally Linear Embedding

In locally linear embedding (LLE) [219, 226], each local region 011 a manifold is
approximated by a linear hyperplane. LLE maps the high dimensional data points
into a low dimensional space so that the local geometric properties, represented by
the reconstruction weights, are best preserved.

Speciﬁcally, y,- is reconstructed by its projection y, on the hyperplane H passing

through its neighbors N (y,) (deﬁned in section 2.2). Mathematically,

yz‘ % Y2 = Zwinja
j

with the constraint 23' 21.1,]- : 1 to reflect the translational invariance for the recon-
struction. By minimizing the sum of square error of this approximation, we can also
achieve invariance for rotation and scaling. The weights wij reﬂect the local geomet-
ric properties of yi. This interpretation on wij, however, is reasonable only when y,-
is well approximated by 32,, i.e., when y,- is close to H. The weights are found by

solving the following optimization problem:

{min} [[yi — 2")inij subject to Zwii = 1, wij : 0 if yj é N(yi) for all 2'.
“21' j 1'

(2.12)
Now, write N(y,) = {yT1,.. ”yTL} and denote zj = y—rj. Note that y,- ¢ N(yi). The
optimization problem (2.12) can be solved efﬁciently by ﬁrst constructing a L by L

matrix F such that fjk = (zj — x,)T(zk — xi). Equivalently, F = (Z — x,11,L)T(Z —

x1111), where F = {fjk}, 1qu is a 1 by L vector with all entries one, and Z =

60

[z1, . . . , 21]. The next step is to solve the equation
Fu = 11,1 (2.13)

, ' 7 ' ~. _ . _ . ~.
and then we normalize the solution 11 by u] — 217/217 21.]. The values of u] are

assigned to the corresponding ”ail-j, i.e., 2112-, = 223-, and the rest of 221,-]- are set to zero.

71'
Sometimes, F can be singular. This can happen when the neighborhood size L is
larger than D, the dimension of y. In this case, a small regularization term 6I L is
added to F before solving the Equation (2.13). This regularization has the effect of
preferring values of 10,-]- with small 23' 2112.2]. Finding uj is efﬁcient because only small
linear systems of equations are solved. Note that uj can be negative and y‘, can be
outside the convex hull of N (y,).

In the second phase of LLE, we seek X = [x1, . . . ,xn] such that x,- z Zj wijxj,
and X, E R“. To make the problem well-deﬁned, additional constraints Zixi = 0
and 2.1-xix? = Id are needed. The second constraint has the effect of both ﬁxing
the scale and enforcing different features in x, to carry independent information by

requiring the sample covariances between different variables in x, to be zero. The

optimization problem is now
Inn; lle' — Ewijlelz subject to in = 0 and inxz = Id. (2.14)
X1“ . . .
J z 2

Note the similarity between Equations (2.12) and (2.14). Let x“) denote the i-th row

 

7 . . . . _ -1
The normalization is valid because 23‘ 223- — 11,mF 1172.1 and hence Zj uj cannot be zero,

by the positive deﬁniteness of F ’1.

61

of X. Equation (2.14) can be rewritten as
ngn trace(X(I—W)T(I —W)XT) subject to 1mem 2 0 and x(i)Tx(f) == (fl-j.
(2.15)
This can be solved by eigen-decomposition on M = (I — W)T(I —- W). Note that M
is positive semi-deﬁnite. Let vj be the eigenvector corresponding to the (j + 1)-th
smallest eigenvalue. The optimal X is given by X 2 [v1, . . . , vd]T. The ﬁrst constraint
is automatically satisﬁed because 1,1,1 is the eigenvector of M with eigenvalue 0. This
eigenvalue problem is relatively easy because M is sparse and can be represented as
a product of sparser matrices (I — W)T and (I — W).

The above exposition of LLE assumes the pattern matrix as input. LLE can
be modiﬁed to work with a dissimilarity matrix [226]. There is also a supervised
extension of LLE [53, 54], which uses the class labels to modify the neighborhood
structure. The kernel trick can also be applied to LLE to visualize the data points
in the feature space [56]. The case when LLE is applied to data sets with natural
clustering structure has been examined in [206].

In summary, LLE includes the following steps:

1. Find the neighbors of each y,- according to either e-neighborhood or knn neigh—

borhood.

2. For each yi, form the matrix F and solve the equation Fu 2 111,1. After

normalizing u by iij = uj/ Zj uj, set 211,37]. 2 213- and the remaining wij to zero.

3. Find the second to the (d + 1)-th smallest eigenvalues of (I — W)T(I — W) by

62

-+

a sparse eigenvalue solver and let {V1, . . . ,vd} be the eigenvectors.

4. Obtain the reduced dimension representation by X 2 [v1, . ..,V(1]T.

2.8 Laplacian Eigenmap

The approach taken by Laplacian eigernnap [16] for nonlinear mapping is different
from those of ISOMAP and LLE. Laplacian eigenmap constructs orthogonal smooth
functions deﬁned 011 the manifold based on the Laplacian of the neighborhood graph.
It has its roots in spectral graph theory [42].

As in ISOMAP, a. neighborhood graph G = (V, E) is ﬁrst constructed. Unlike
ISOMAP, where the weight wij of the edge (2),, 21]) represents the distance between
22,- and vj, the weight in Laplacian eigenmap represents the similarity between v,- and

vj. The weight 221,-]- can be set by

2
x- _ x-
221,-]- : exp ——[—[——2——4t——J—[[— , (2.16)

with t as an algorithmic parameter, or it can be simply set to one. The use of the
exponential function to transform a distance value to a similarity value can be justiﬁed
by its relationship to the heat kernel [16].

The nonlinear mapping problem is recast as the graph embedding problem that
maps the vertices in the neighborhood graph G to R“. The ﬁrst step is to ﬁnd a
“good” function f () : V H R that maps the vertices in G to a real number. Since

the domain of f () is ﬁnite, f () can be represented by a vector u, with f (22,-) = 12,-.

63

 

According to spectral graph theory, the smoothness of f can be deﬁned by

1 2
S E 2 Emu-(u,- — uj) . (2.17)

The intuition of S is that, for large wij, the vertices v,- and vj are “similar” and hence
the difference between f (11,-) and f (11]) should be small if f () is smooth. A smooth
mapping f () is desirable because a faithful embedding of the graph should assign
similar values to v,- and '11} when they are close. We can rewrite S as

1 2 2
S = 5 :0qu + wijuj — 211,2”)

2']

= §(Z 11,- Z 10,-]- + 2213-2112,]- - 2 L tel-juiuj) (218)

i j j i ii
= 221,2 Z 'wjj — Z'wijuiuj = uTLu,
i J ij

where L is the graph Laplacian deﬁned by L = D — W, W = {202-j} is the graph
weight matrix, and D is a diagonal matrix with (1,,- = 23' wij. The matrix L can
be thought of as the Laplacian operator 011 functions deﬁned on the graph. Since
(1,,- can be interpreted as the importance of 21,-, the natural inner product between
two functions f1(.) and f2(.) deﬁned 011 the graph is (f1, f2) = ufDUQ. Because
the constant function is the smoothest and is uninteresting, we seek f () to be as

smooth as possible while being orthogonal to the constant function. The norm of f ()

64

is constrained to be one to make the problem well—deﬁned. Thus we want to solve
11311 uTLu subject to uTDu = 1 and uTDan = 0. (2.19)
This can be done by solving the generalized eigenvalue problem
Ln = ADu, (2.20)

after noting that 171,1 is a solution to Equation (2.20) with A = 0. Here, 172,1 denotes
a n by 1 vector with all entries one. As L is positive semi-deﬁnite, the eigenvector
corresponding to the second smallest eigenvalue of Equation (2.20) yields the desired
f(..) In general, (1 orthogonal8 functions {f1(.), . . ..,fd()} that are as smooth as
possible are sought to map the vertices to IR“. The functions can be obtained by
the eigenvectors corresponding to the second to the (d + 1)-th smallest eigenvalues
in Equation (2.20). The low dimensional representation of y,- is then given by x,- =
(f1(v,'), f2(v,-), . . .,fd(v,-))T. In matrix form, X = [u1, . . . , ud]T.

The embedding problem of the neighborhood graph and the embedding problem
of the points in the manifold is related in the following way. A smooth function f ()
that maps the point y,- in the manifold to x,- E R“ is preferable, because a faithful
mapping should give similar values (small [[x, — xJ||) to y,- and yj when [[y, — yJ-[I is
small. A small [[y, — y JII corresponds to a large 11%,}- in the graph. Thus, intuitively,

a smooth function deﬁned on the graph corresponds to a smooth function deﬁned

 

8Orthogonality is preferred as it suggests the independence of information. Also, in PCA, each
of the extracted features is orthogonal to the others.

65

on the manifold. In fact, this relationship can be made more rigorous, because the
graph Laplacian is closely related to the Laplace-Beltrami operator on the manifold,
which in turn is related to the smoothness of a function deﬁned on the manifold. The
eigenvectors of the graph Laplacian correspond to the eigenfunctions of the Laplace-
Beltraini operator, and the eigenfunctions with small eigerwalues provide a “smooth”
basis of the functions deﬁned on the manifold. The neighborhood graph used in
Laplacian eigenmap can thus be viewed as a discretization tool for computation 011
the manifold.

There is also a close relationship between Laplacian eigenmap and spectral clus-
tering. In fact, the spectral clustering algorithm in [194] is almost the same as ﬁrst
performing Laplacian eigenmap and then applying k-ineans clustering on the low di-
mensional feature vectors. The manifold structure discovered by Laplacian eigenmap
can also be used to train a classiﬁer in a semi—supervised setting [182]. The Laplacian
of a graph can also lead to an interesting kernel function (as in SVM) for vertices in
a graph [154]. This idea of nonlinear mapping via graph embedding has also been
extended to learn a linear mapping [109] as well as generalized to the case when a
vector is associated with each vertex in the graph [31].

To sum up, the steps for Laplacian eigenmap include:

1. Construct a neighborhood graph of y by either the e-neighborhood or the knn

neighborhood.

2. Compute the edge weight 211,]- by either exp([|y,- — yj||2/(4t)), or simply set wij

to 1.

66

3. Compute D and the graph Laplacian L.

4. Find the second to the ((l + 1)—th smallest eigenvalues in the generalized eigen-
value problem Lu = /\Du and denote the eigenvectors by 111, . . . , ud. The low

dimensional feature vectors are given by X = [111, . . . , udIT.

2.9 Global Co-ordinates via Local Co—ordinates

Recall that in section 2.2, an atlas of a manifold M is deﬁned as a collection of co-
ordinate charts that covers the entire M, and overlapping charts can be “connected”
smoothly. This idea has inspired several nonlinear mapping algorithms [220, 29, 247]
which construct different local charts and join them together.

There are two stages in these type of algorithms. First, different local models are
ﬁtted to the data, usually by the means of a mixture model. Each local model gives
rise to a local co—ordinate system. A local model can be, for example, a Gaussian
or a factor analyzer. Let 223 be the local co—ordinate given to y, by the s-th local
co-ordinate system. Let Tis denote the suitability of using the s-th local model for
y,. We require 2,3 2 0 and 23 rm 2 1. The introduction of 2,3 can represent the
fact that only a small number of local models are meaningful for each yi. Typically,
rig is obtained as the posterior probability of the s-th local model, given y,-.

In the second stage, different local co—ordinates of y,- are combined to give a global
co-ordinate. Let g, S be the global co—ordinate of y,- due to the s-th local model, and
let g,- E R“ be the corresponding “combined” global co—ordinate. In the three papers

we have considered here, 8275 is simply the afﬁne transform of the local co—ordinate,

67

gis = Lsiis. Here, 2,3,. is the “augmented” zik, 2,7,, = [zﬂ, 1]T. L3 is the (unknown)
transformation matrix with (1 rows for the s-th local model. Note that it is desirable
for neighboring local models to be “similar” so that the global co-ordinates are more
consistent. An important characteristic of the algorithms in this section is that, unlike
ISOMAP, LLE, or Laplacian eigenmap, extension for a point y that is outside the

training data 32 is easy after computing zs and r8 for different 3.

2.9.1 Global Co—ordination

In the global co-ordination algorithm in [220], the ﬁrst and the second stages are
performed simultaneously by the variational method. The ﬁrst stage is done by
ﬁtting a mixture of factor analyzers. Under the s—th local model, a data point is

modeled by

Yi = ”'3 + Aszis + Gisa (2.21)

where “S is the mean, A3 is the factor loading matrix, and 6,3,. is the noise that
follows N (0, \I’ S), a multivariate Gaussian with mean 0 and covariance \113. By the
deﬁnition of factor analyzer, \Ils is diagonal. The hidden variable 323 is assumed to
follow N (0, I). The scale of Zis is unimportant because it can be absorbed by the

factor loading matrix. Let 0,, be the prior probability of the s—th factor analyzer.

68

The parameters are {03,213, A3, \I',,-}, and the data density is given by

mw=Z/pmmamawwa
s 223
2 2 (13(22)—D/2(det(A3AZ + 21.))‘1/2 (2.22)
S
I _
eXl)(—§(Yi _ “sleASAZ + \Ils) I(Y2' — #3))-

VVe deﬁne rig. as the posterior probability of the s-th local model given yi, P(s|yi),

and it can be computed based 011 Equation (2.22). Equation (2.22) also gives rise

 

to p(zis s,y,~) and hence p(g,3[s,y,j), because gig is a function of Zis and L3. The

posterior probability of the global co—ordinate is deﬁned as

mmw=Zmeaami em

Equation (2.23) assumes that the overall global co-ordinate g,- is selected among
different gis, with s stochastically selected according to the posterior probability of
the s-th model. In the case where y,- is likely to be generated either by the j-th or the
k-th local model, the corresponding global co—ordinates gij and Si]: should be similar.
This implies that the posterior density p(g,|y,-) should be unimodal. Enforcing the
unimodality of p(gilyi) directly is difﬁcult. So, the authors in [220] instead drive
p(gilyi) to be as similar to a Gaussian distribution as possible by adding an extra

term to the log-likelihood objective function to be maximized:

‘12 = 210?; (3’2) - Z DKL((1(gia5l}’i)llP(gis3]Yil)~ (2-24)

2' is

69

Here, D K L(Q||P) is the Kullback-Leibler divergence deﬁned as

DKLlelP)=/Q(Y)10g%[%] dy, (2-25)

and q(g,-, sti) is assumed to be factorized as

 

(Ifgz', 8 w) = q’ifgilx2)Q2(3lYi)

with q,(g,-|y,-) as a Gaussian and (p(slyi) as a multinomial distribution. This addition
of a divergence term between a posterior distribution and a factorized distribution is
commonly seen in the literature on the variaticmal method. The objective function
in Equation (2.24) can be maximized by an EM-type algorithm, which estimates the
parameters {03, us, A3, ‘113, L3} as well as the parameters for (Ii(gini) and q,(s|y,-).
Since the ﬁrst and the second stages are carried out simultaneously, local models that

lead to consistent global co—ordinates are implicitly favored.

2.9.2 Charting

For the charting algorithm in [29], the ﬁrst and the second stages are performed
separately. This decoupling decreases the complexity of the optimization problem
and can reduce the chance of getting trapped in poor local minima. In the ﬁrst stage,

a mixture of Gaussians is ﬁtted to the data,

p(y) = [am/(u... 2.). (2.26)

70

with the constraint that two adjacent Gaussians should be “similar”. This is achieved
by using a prior distribution on the mean vectors and the covariance matrices that

encourages the similarity of adjacent Gaussians:

p({us},{Es}) O< exp(-Z Z AsfﬂleKLfA/(l‘wZslllep'jvzjll): (227)

8 was

where As(,uj) measures the closeness between the locations of the s-th and the J—th
Gaussian components. It is set to A3023) oc exp(—||us — “HP/(202)), where a is
a width parameter determined according to the neighborhood structure. The prior
distribution also makes the parameter estimation problem more well-conditioned. In
practice, 72 Gaussian components are used, with the center of the i-th component. 12,:
set to y,- and the weight of each component set to 1/72. The only parameters to be
estimated are the covariance matrices. The MAP estimate of the covariance matrices
can be shown to satisfy a set of constrained linear equations and they are obtained

by solving this set of equations.

In the second stage, the local co—ordinate Z23 is ﬁrst obtained as 223 = VT(X,- —us),
where V consists of the (1 leading eigenvectors of 23. we can regard zis as the feature
extracted from y, using PCA on the s-th local model. The local model weight 73;,
is, once again, set to the posterior probability of the s-th local model given yi. The
transformation matrices L are found by solving the following weighted least square

problem:

. ~ - 2
mm X 'rijrikllezij — LkzikllF- (2.28)
{LS} i.j,k

71

Here, “XII?P denotes the square of the Frobenius norm, [IXIIF E trace(XTX). Intu-
itively, we want to ﬁnd the transformation matrices such that the global co—ordinates
due to different local models are the most consistent in the least square sense, weighted

by the importance of different local models.

Equation (2.28) can be solved as follow. Let K and h be the number of lo-
cal models and the length of the augmented local co-ordinate 2,3, respectively.
Deﬁne 23 = [213,...,2,,5] as the h by 72 matrix of local co—ordinates using the
s-th local model for all the data. points. Deﬁne the K h by 12 matrix T3 by
T3 = [07,,(8_1)h,2;,0n3(1{_8)h]T, where Own denotes a zero matrix with size n
by 172. Let P5 be a n by 12 diagonal matrix where the (2',2)-th entry is 7'2‘3- The
solution to Equation (2.28) is given by the cl trailing eigenvectors of the Kh by Kh.
matrix QQT, where Q = 2]“ 21:,1921-(ij — Tk)PJ-Pk). Note that the second stage
is independent of the ﬁrst stage. In particular, alternative collection of local models

can be used, as long as 2,3 and 2,13 can be calculated.

2.9.3 LLC

The LLC algorithm described in [247] concerns the second stage only. Given the
local co—ordinates Zis and the model conﬁdences 2,13 computed from the ﬁrst stage,
the LLC algorithm ﬁnds the best Ls such that the local geometric properties are best

preserved in the sense of the LLE loss function. The global co—ordinate g, is assumed

72

to be a weighted sum in the form

g1 = Erisgis = risLsiis' (2'29)

8 8

Suppose there are K local models, each of which gives a local co—ordinate Zis in a
h — 1 dimensional spaceg. We stack 2,3733 for different 5 to get a vector of length
Kh, u,- = 9112373222, . . . , riKZEAT, and concatenate different L3 to form a d by
Kh matrix J = [L1,L2, . . .,LK]. (Each L3 is of size d by h.) Equation (2.29) can
be rewritten as g, = Jui. The global co—ordinate matrix, G = (g1, . . . ,gn), is thus
given by G = JU, where U is a Kh by 22 matrix U = [u1, . . . , un]. Denote the 2-th
row of J by j“). If we substituteG as Y in the LLE objective function in equation

(2.15), we have

min trace(JU(I — W)T(I — W)UTJT)
J (2.30)

subject to i‘i’Ulnl = 0 and jliluuTiW : a].

where W is deﬁned in the same way (the neighborhood reconstruction weight) as
in section 2.7. Here, 1”,,1 denotes a n by 1 vector with all entries one. Note that
obtaining W is efﬁcient (see section 2.7 for details). The value of j“) can be obtained
as the solution of the generalized eigenvalue problem (U(I — W)T(I — W)UT)v =
A(UUT)V. The authors in [247] claim that the jfi) thus obtained satisﬁes the con-

straint j(i)U1m,1 = 0 automatically because Ulm‘l is an eigenvector of the general-

 

9In general, different local models can give local co—orrlinates with different lengths, as emphasized
in [247]. Here we assume a common 11 for the ease of notation.

73

ized eigenvalue problem with eigenvalue 0. However, this is not true in general. 111
any case, the authors in [247] use the eigenvectors corresponding to the second to
the (d + 1)-th smallest eigenvalues as the solution of J. Note that this generalized
eigenvalue problem is about a K12 by Kh. matrix, instead of a large n. by 71 matrix
in the original LLE. After ﬁnding j“), J and hence Ls are reconstructed. The global

co—ordinate is obtained via equation (2.29).

The idea of this algorithm is somewhat analogous to the locality preserving pro-
jection (LPP) algorithm [109]. LPP simpliﬁes the eigenvalue problem by the extra
information that the projection should be linear, whereas the current algorithm sim-

pliﬁes the eigenvalue problem by the given mixture model.

2.10 Experiments

We applied some of these algorithms 011 three synthetic 3D data sets. The data
manifold and the data points can be seen in Figure 2.5. The ﬁrst data set, parabolic,
consists of 2000 randomly sampled data points lying on a paraboloid. It is an example
of a nonlinear manifold with a simple analytic form — a. second degree polynomial in
the co—ordinates in this case. The second data set swiss roll and the third data set
S—curve are commonly used for validating manifold learning algorithms. Again, 2000
points are randomly sampled from the “Swiss roll” and the S-shaped surface to create
the data sets, respectively. KPCA, ISOMAP, LLE, and Laplacian eigemnap were run

on these 3D data sets to project the data to 2D. We have implemented KPCA and

74

Laplacian eigemnap ourselves, while the impleinentations for ISOMAP10 and LLE11
were downloaded from their respective web sites. For ISOMAP, LLE, and Laplacian
eigenmap, knn neighlmrhood with k = 12 is used. The edge weight is set to one
for Laplacian eigenmap. For KPCA, polynomial kernel with degree 2 is used. For
comparison, the standard PCA and Sammon’s mapping were also performed on these
data sets. Sammon’s mapping is initialized by the result of PCA.

The results of these algorithms can be seen in Figures 2.6, 2.7, and 2.8. The
data points are colored differently to visualize their locations 011 the manifold. We
intentionally omit the “goodness-of-ﬁts” or “error” on the projection results, because
the criteria used by different algorithms (Sammon’s stress in Sammon’s mapping,
correlation of distances in ISOMAP, reconstruction error in LLE, residue variance
in PCA and KPCA, to name a few) are very different and it can be n'iisleading to

compare them.

For the parabolic data set, we can see in Figures 2.6(b) and 2.6(c) that both
ISOMAP and LLE recover the intrinsic co—ordinates very well, because the changes
in the color of the data points after embedding are smooth. Since this manifold
is quadratic, we expect that KPCA with a. quadratic kernel function should also
recover the true structure of the data. It turns out that the ﬁrst two kernel principal
components cannot lead to a clean mapping of the data points. Instead, the second
and the third kernel principal components extract the structure of the data (Figure

2.6(a)). The ﬁrst two features extracted by Laplacian eigenmap cannot recover the

 

10ISOMAP web site: http://stanford. isomap.edu
11LLE web site: http://www.cs.toronto.edu/~roweis/lle/

75

  
  
  

   

20 \ ‘
T i 1, 'Fl 7 ' 7' ‘7‘,
o 0 ° -30 —20 -1o 0

~60 -50 —4o -30 -20 —10 —so -50 —40
(a) parabolic, the manifold (b) parabolic, the data

 

 

-15 —10 -5 o 5

$_,_
10 15 20 0
(c) swiss roll, the manifold

   
     

-0.5 o

 

0.5 1 o 0-5; 1 o
(e) S-curve, the manifold (f) S—curve, the data

Figure 2.5: Data sets used in the experiments for nonlinear mapping. The manifold
and the data points are shown. The data points are colored according to the major
structure of the data as perceived by human.

76

desired trend in the data. The target structure with slight distortion can be recovered
if the second and the third extracted features are used instead (Figure 2.6(d)). PCA
and Sammon’s mapping cannot recover the structure of this data set (Figures 2.6(c)
and 2.6(f)). The similarity of the results of PCA and Sammon’s mapping can be
attributed to the fact that Saminons mapping is initialized by the PCA solution.
The initial PCA solution is already a good solution with respect to Sammon’s stress
for this low-dimensional data set.

For the data set swiss roll, we can see from Figures 2.7(b) and 2.7(c) that ISOMAP
and LLE performed a good job “unfolding” the manifold. For Laplacian eigenmap,
once again, the first two extracted features cannot be interpreted easily, though the
structure of the data set is revealed if the second and the third features are used
(Figure 2.7(d)). KPCA cannot recover the intrinsic structure of the data set no
matter which kernel principal component is used. An example of the poor result of
KPCA is shown in Figure 2.7(a). PCA and Sammon’s mapping also cannot recover
the underlying structure (Figures 2.7(c) and 2.7(f)). The results for the third data
set S-curve (Figure 2.8) are similar to those of swiss roll, with the exception that
Laplacian eigenmap can recover the desired structure using the ﬁrst two extracted
features.

In addition to these synthetic data sets, we have also tested these nonlinear map-
ping algorithms on a high—dimensional real world data set: the face images used in
[175] The task here is to classify a 64 by 64 face image in this data set as either
the “Asian class” or the “non-Asian class”. This data set will be described in more

details in Section 3.3. The results of mapping these 4096B data points to 3D can be

77

seen in Figure 2.9. Data. points from the two classes are shown in different colors.
The (training) error rates using quadratic discriminant analysis are also computed
for different mappings. As we can see from Figures 2.9(a), 2.9(d), 2.9(e) and 2.9(f),
the mapping results by Laplacian eigenmap, KPCA, PCA and Sammon’s mapping
are not very useful. The two classes are not well-separated, and the error rates are
also high. ISOMAP maps the two classes more separately and has smaller error rates
(Figure 2.9(b)). For LLE (Figure 2.9(e)), although the mapping results look a bit
unnatural, the error rate turns out to be the smallest, indicating the two classes are
reasonably separated. It should be noted that the intrinsic dimensionality of this
data set is probably higher than 3. So. mapping the data to 3D, while good for
visualization, can lose some information and is suboptimal for classiﬁcation.

From these experiments, we can see that both ISOMAP and LLE recover the
intrinsic structure of the data sets well. The performance of Laplacian eigenmap
is less satisfactory. We have attempted to set the edge weight by the exponential
function of distances (Equation (2.16)) instead of one, but the preliminary results
suggest that a good choice of the width parameter t is hard to obtain. The standard
PCA and Sammon’s mapping cannot recover the target structure of the data. It is
not surprising, because PCA is a linear algorithm and the underlying structure of
the data cannot be reﬂected by any linear function of the features. For Sammon’s
mapping, it does not give very good results because Sammon’s mapping is “global”,
meaning that the relationship between all pairs of data points in the 3D space is
considered. Local properties of the manifold cannot be modeled. The reason for the

failure of KPCA is that the parametric representation of the manifold for swiss r011

78

 

and S-curve and the face images is hard to obtain, and is certainly not quadratic. So,

the assumption in KPCA is violated and this leads to poor results.

2.11 Summary

In this chapter, we have described different approaches for nonlinear mapping based
on fairly different principles. The algorithms ISOMAP, LLE, and Laplacian eigenmap
are non-iterative and require mainly eigen-decomposition, which is well understood
with many off-the-shelf algorithms available. ISOMAP, LLE, and Laplacian eigen—
map are basically non-parametric algorithms. While this provides extra flexibility to
model the manifold, more data points are needed to give a good estimate of the low
dimensional vector. The basic version of some of the algorithms (Sammon’s mapping,
ISOMAP, LLE, and Laplacian eigenmap) cannot generalize the mapping to patterns
outside the training set y, though an out-of-sample extension has been proposed [17].

There are interesting connections between some of these algorithms. ISOMAP,
LLE, and Laplacian Eigenmap can be shown to be the special cases of KPCA [105].
The matrix M in LLE can be shown to be related to the square of the Laplacian
Beltrami operator [16], an important concept in Laplacian eigenmap. While these
techniques have been successfully applied to high dimensional data sets like face
images, digit images, texture images, motion data, and textual data, the relative
merits of these algorithms in practice are still not clear. More comparative studies

like the one in [196] would be helpful.

79

 

 

KPCA, 2nd and 3rd

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

       

 

 

 

 

. . 60
0.08[ 5 , - .5
0.06- ‘
0.04» l
0.02-
ol
-o.02r i. I
-o.o4l " ‘
—0.06'
'68 -0‘1 -0 05 o 065 2360 -4‘0 4? 0 f 20 40 60 80
(a) KPCA, 2nd and 3rd (b) ISOMAP
LLE m E Laplacian Eigenmap, 2nd and 3rd
‘9 ' ' 0.UIJ ' ' ' ' '
, \
3] . " 0.01
2] I 0.005. 4
1 : ,
0
0]
—0.005:
-1 - ,
-2» '; —0.01> 1
-3. I —o.o15-
'34 —2 0 2 4 ‘oflfoz 4.615 001 —o.005 0 0.005 0.01
(c) LLE (d) Laplacian Eigenmap, 2nd and 3rd
PCA Sammon
3G . . ’ 4c . .
l < 30. ..... ~“f:."v.“.:~ I
20'
Lg.- ’ 10* . .; '3.
0- '
"°' 52'
. —2o-
-20 ..ttg-é‘xﬁﬂm. _ . ' —30*
“W —20 0 20 4o 60 "1‘60 —40 —20 0 20
(e) Standard PCA (f) Sammon’s mapping

Figure 2.6: Results of nonlinear mapping algorithms on the parabolic data set. “2nd
and 3rd” in the captions means that we are showing the second and the third com—
ponents, instead of the ﬁrst two.

80

ISOMAP

 

 

   

 

 

 

 

 

 

 

  

 

 

 

 

 

 

M KPCA. 1st and 2nd M
Ma ' ' ‘V
0.06 1 ‘15L
004* . l 10’
0.02- 5-
0+ ' 0-
-0.02L ' —5'
-0.04» " - l —10-
-0.06’ < —15-
_o.no I A A | _2I\ A
306 —0.04 —0.02 0 0.02 J60 —40 —20 0
(b) ISOMAP
Laplacian Eigenmap. 2nd and 3rd
0.015 ' v .. A
, "f". ‘
0.01’ _ 1".) -. ’
0.005- {11.777 ’
0' if" :5
-0005- ‘3“‘a-‘zlvu: . i: "Z
-0.01-
"0"110301 —0.005 0 0.0 5 0.01
(d) Laplacian Eigenmap, 2nd and 3rd
Sammon
15‘ f 15 .
forﬂi' ‘.~‘;'?9’ft‘,'n"=$f ’3‘. 9.1%} .ﬁ‘f’. F?
1ol:‘”“”'. . ' z- :1 :‘. 5

 

 

 

 

 

1'0 1‘5 20

'1—‘75 40 -5 0 5
(e) Standard PCA (f) Sammon‘s mapping

Figure 2.7: Results of nonlinear mapping algorithms on the swiss r011 data set. “2nd
and 3rd” in the captions means that we are showing the second and the third com-
ponents, instead of the ﬁrst two.

81

-2

ISOMAP

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

35 0 5
(b) ISOMAP
Laplacian Eigenmap
0.01 s .
‘ n:,._ "w
Kai...
‘ 0.005 3?]
:L.'.'
‘ 0- .;-,- '5
if
‘ i
‘ —0.005 if 7.
‘ '\:__ ,
2 “0911.01 —0.005 0.005 0.01
(d) Laplacian Eigenmap
Sammon
.J
2.
2‘ ‘ 1-
3'13 tiniest
‘f" .- . 0.
vi -1.
_2-
4
—2 1 o 2 3 ‘34 -3 -2 —i 0 i 2 3

(e) Standard PCA

 

(f) Sammon’s mapping

Figure 2.8: Results of nonlinear mapping algorithms on the S—curve data set.

82

 

 

 

 

 

°-‘ -0.06 —0.04 -0.02 0 0.02 0.04 0.06
(a) KPCA,35.2% error

 

 

 

 

 

 

 

 

 

  

 

 

 

 

 
   

 

 

 

 

 

 

 

  

 

  
 

 

 

 

   

: 0W‘
0
4~« 0‘
3.. -0.(X)5*
2— -0.01
1‘ -0.015~
0“ .0.02~
5
_1_ L
o 43.0%- 10
n 0.01
" I / I / I 0 4
-4 -3 -2 —1 0 1 2 '5 ‘°‘°2 -0.00 —5 m
(c) LLE, 9.2% error (d) Laplacian Eigenmap, 41.5% error
-0..‘.
0.3- —0.6—
02-
-0.8“‘ o -
0.1— .
-1q
0~ i
-0.1— 42‘
-02~ -1.4~
'0'3‘ “05 -10—
413.?
.4 4.8 r I 1 l .
° 02 0.4 "0‘5 —0.2 0 0.2 0.4 0.6 0.8 1 1324 5
(e) Standard PCA, 36.4% error (f) Sammon’s mapping, 39.3% error

Figure 2.9: Results of nonlinear mapping algorithms on the face images. The two
classes (Asians and non-Asians) are shown in two different colors. The (training)
error rates by applying quadratic discriminant analysis on the low dimensional data
points are shown in the captions.

83

Chapter 3

Incremental Nonlinear
Dimensionality Reduction By

Manifold Learning

In Chapter 2, we discussed different algorithms to achieve dimensionality reduction
by nonlinear mapping. Most of these nonlinear mapping algorithms operate in a
batch model, meaning that all the data points need to be available during train-
ing. In applications like surveillance, where (image) data are collected sequentially,
batch method is computationally demanding: repeatedly running the “batch” ver-
sion whenever new data points become available takes a long time. It is wasteful to
discard previous computation results. Data accumulation is particularly beneﬁcial to

manifold learning algorithms due to their non-parametric nature. Another reason for

 

1Sammon’s mapping can be implemented by a feed-forward neural network [180] and hence can
be made online if an online training rule is used.

84

developing incremental (non-batch) methods is that the gradual changes in the data
manifold can be visualized. As more and more data points are obtained, the evolution
of the data manifold can reveal interesting properties of the data stream. Incremental
learning can also help us to decide when we should stop collecting data: if there is
no noticeable change in the learning result with the additional data collected, there
is no point in continuing. The intermediate result produced by an incremental algo—
rithm can prompt us about the existence of any “problematic” region: we can focus
the remaining data collection effort on that region. An incremental algorithm can
be easily modiﬁed to incorporating “forgetting”, i.e., the old data points gradually ‘
lose their signiﬁcance. The algorithm can then adjust the manifold in the presence of
the drifting of data characteristics. Incremental learning is also useful when there is
an unbounded stream of possible data to learn from. This situation can arise when
a continuous invariance transformation is applied to a ﬁnite set of training data to
create additional data to reﬂect pattern invariance.

In this chapter, we describe a modification of the ISOMAP algorithm so that it
can update the low dimensional representation of data points efﬁciently as additional
samples become available. Both the original ISOMAP algorithm [248] and its land—
mark points version [55] are considered. We are interested in ISOMAP because it
is intuitive, well understood, and produces good mapping results [133, 276]. Fur-
thermore, there are theoretical studies supporting the use of ISOMAP, such as its
convergence proof [18] and the conditions for successful recovery of co-ordinates [66].
There is also a continuum extension of ISOMAP [282] as well as a spatio—temporal

extension [133]. However, the motivation of our work is applicable to other mapping

85

 

algorithms as well.

The main contributions of this chapter include:

1. An algorithm that efﬁciently updates the solution of the all-pairs shortest path
problems. This contrasts with previous work like [193], where different shortest

path trees are updated independently.

2. More accurate mappings for new points by a superior estimate of the inner

products.

3. An incremental eigen-decomposition problem with increasing matrix size is
solved by subspace iteration with Ritz acceleration. This differs from previ-

ous work [270] where the matrix size is assumed to be constant.

4. A vertex contraction procedure that improves the geodesic distance estimate

without additional memory.

The rest of this chapter is organized as follows. After a recap of ISOMAP in section
3.1, the proposed incremental methods are described in section 3.2. Experimental
results are presented in section 3.3, followed by discussions in section 3.4. Finally, in

section 3.5 we conclude and describe some topics for future work.

3.1 Details of ISOMAP

The basic idea of the ISOMAP algorithm was presented in Section 2.6. It maps a high
dimensional data set y1, . . . , yn in RD to its low dimensional counterpart x1, . . . ,xn

in R“, in such a way that the geodesic distance between y,- and y j on the data manifold

86

is as close to the Euclidean distance between xi and xJ- in IR“ as possible. In this
section, we provide more algorithmic details on how the mapping is done. This also
deﬁnes the notation that we are going to use throughout this chapter.

The ISOMAP algorithm has three stages. First, a neighborhood graph is con-
structed. Let AU be the (Euclidean) distance between y,- and yj. A weighted undi—
rected neighborhood graph 9 = (V, E) with the vertex v,- E V corresponding to yz- is
constructed. An edge e(i, j) between vi and vj exists if and only if y,- is a neighbor
of yj, i.e., yz- E N(yj). The weight of e(i,j), denoted by wij, is set to Aij. The set
of indices of the vertices adjacent to v,- in Q is denoted by adj (2)

ISOMAP proceeds with the estimation of geodesic distances. Let gij denote the
length of the shortest path sp(z', 3') between vi and vj. The shortest paths are found
by the Dijkstra’s algorithm with different source vertices. The shortest paths can be
stored efficiently by the predecessor matrix wij, where 7n]- = k if vk is immediately
before '03- in sp(i, j ) If there is no path from v,- to U], 7%- is set to O. Conceptually,
however, it is useful to imagine a shortest path tree T(i), where the root node is v,-
and sp(z', 3') consists of the tree edges from v, to vj. The subtree of T(Z) rooted at va
is denoted by T(z'; a). Since gij is the approximate geodesic distance between yz- and
yj, we shall call gij the “geodesic distance”. Note that G = {gij} is a symmetric
matrix.

Finally, ISOMAP recovers x,- by using the classical scaling [49] on the geodesic
distance. Deﬁne X = [x1,...,xn]. Compute B = —1/2HCH, where H = {bl-j},
hij = 6ij —1/n and 613- is the delta function, i.e., dij = 1 iii = j and 0 otherwise. The

entries 5”),- - of C are simply 9.2.. We seek XTX to be as close to B as possible in the least
.7 2]

87

square sense. This is done by setting X = [\//\1v1 . . . \/)\dvd]T, where /\1, . . . , Ad are

the (1 largest eigenvalues of B, with corresponding eigenvectors v1, . . . ,vd.

3.2 Incremental Version of ISOMAP

The key computation in ISOMAP involves solving an all-pairs shortest path problem
and an eigen-decomposition problem. As new data arrive, these quantities usually do
not change much: a new vertex in the graph often changes the shortest paths among
only a subset of the vertices, and the simple eigenvectors and eigenvalues of a slightly
perturbed real symmetric matrix stay close to their original values. This justiﬁes
the reuse of the current geodesic distance and co—ordinate estimates for update. we
restrict our attention to knn neighborhood, since e-neighborhood is awkward for incre-
mental learning: the neighborhood size should be constantly decreasing as additional
data points become available.

The problem of incremental ISOMAP can be stated as follows. Assume that the
low dimensional co-ordinates x,- of yi for the ﬁrst n points are given. We observe
the new sample yn+1. How should we update the existing set of x, and ﬁnd xn+17
Our solution consists of three stages. The geodesic distances gij are ﬁrst updated in
view of the change of neighborhood graph due to on“. The geodesic distances of the
new point to the existing points are then used to estimate xn+1. Finally, all xi are
updated in view of the change in gij.

In section 3.2.1, we shall describe the modiﬁcation of the original ISOMAP for

incremental updates. A variant of ISOMAP that utilizes the geodesic distances from

88

a fixed set of points (landmark points) [55] is modified to become incremental in
section 3.2.2. Because ISOMAP is non-parametric, the data points themselves need
to be stored. Section 3.2.3 describes a. vertex contraction procedure, which improves
the geodesic distance estimate with the arrival of new data without storing the new
data. This procedure can be applied to both the variants of ISOMAP. Throughout
this section we assume (1 (dimensionality of the projected space) is fixed. This can be
estimated by analyzing either the spectrum of the target inner product matrix or the
residue of the low rank approximation as in [248], or by other methods to estimate

the intrinsic dimensionality of a manifold [143, 171, 47, 35, 33, 259, 207].

3.2.1 Incremental ISOMAP: Basic Version

We shall modify the original ISOMAP algorithm [248] (summarized in section 3.1)
to become incremental. Details of the algorithms as well as an analysis of their time
complexity are given in Appendix A. Throughout this section, the shortest paths are
represented by the more economical predecessor matrix, instead of multiple shortest

path trees T(i).

3.2.1.1 Updating the Neighborhood Graph

Let A and ’D denote the set of edges to be added and deleted after inserting on“ to
the neighborhood graph, respectively. An edge e(i, n + 1) should be added if (i) v,- is

one of the k: nearest neiehbors of v , , or ii 22 re )laces an existing vertex and
o n+1 n+1 o

89

becomes one of the k nearest neighbors of 11,-. In other words,

A = {C(i, n + 1) 3 An+1,i S An+1,‘rn+1 0r Az'.n+1 S Ain’t-l1 (3'1)

where ”r,- is the index of the k-th nearest neighbor of 22,-.

For D, note that a necessary condition to delete the edge e(2', j) is that vn+1
replaces v,- (vj) as one of the k nearest neighbors of vj (12,-). So, all the edges to be
deleted must be in the form e(z', Ti) with Ai,n+1 g ALT," The deletion should proceed

if v,- is not one of the k nearest neighbors of ”Ti after inserting vn+1. Therefore,

D = {(30%) 2 Am,- > A21.n+1 and ATM > ATM}, (3-2)

where L,- is the index of the k-th nearest neighbor of “Ti after inserting vn+1 in the
graph. Note that we have assumed there is no tie in the distances. If there are ties,

random perturbation can be applied to break the ties.

3.2.1.2 Updating the Geodesic Distances

The deleted edges can break existing shortest paths, while the added edges can
create improved shortest paths. This is much more involved than it appears, because
the change of a single edge can modify the shortest paths among multiple vertices.

Consider e(a,b) E D. If sp(a, b) is not simply e(a,b), deletion of e(a, b) has no
effect on the geodesic distances. Hence, we shall suppose that sp(a, 1)) consists of the

single edge e(a, b). We propagate the effect of the removal of e(a,b) to the set of

90

 

 

1: Input: e(a, b), the edge to be removed; {$11)}: {7W}

2: Output: p(a.b)1 set of “affected” vertex pairs

3: Rab := (Z); Q.enqueue(a);

4: while Q.notEmpty do

5: t3: Q-popiRab = Rab U It}?

6: for all u E adj(t) do

7: If 7711b = a, enqucue u to Q;

8: end for

9: end while{Construction of Rab ﬁnishes when the loop ends}
10: F(a,b) :2 0;
11: Initialize ’1", the expanded part of T(a; b), to contain vb only;
12: for all u 6 Rob do
13: Q.enqueue(b)

14: while Q.notEmpty do

15: t := Q.pop;

16: if not = 7r“, then

171 F(a,b) = F(a.b) U {(1% t)};

18: if v; is a leaf node in T’ then

19: for all Us in adj(t) do
20: Insert vs as a child of wt in T’ if was 2 t
21: end for
22: end if
23: Insert all the children of m in 7" to the queue Q;
24: end if

25: end while
26: end for{V uERab,V sET(u; b), sp(u, 3) uses e(a. b).}

 

Algorithm 3.1: ConstructFab: F(a,b), the set of vertex pairs whose shortest paths
are invalidated when e(a, b) is deleted, is constructed. Rab is the set of vertices such

that if u E Rab, the shortest path between a and u contains e(a, b).

vertices Rab (Figure 3.1). Rab is used in turn to construct 11101,), the set of all (i, j)
pairs with e(a,b) in sp(z',j). This is done by ConstructFab (Algorithm 3.1), which
ﬁnds all the vertices of under T(a; b) such that sp(u, t) contains vb, where u E Rab-
The set of vertex pairs whose shortest paths are invalidated due to the removal of
edges in D is thus F = U€(a.,b)EDF(a-.b)' The shortest path distances between these
vertex pairs are updated by AllodiﬁedDijkstra (Algorithm 3.2) with source vertex vu
and destination vertices C (11). It is similar to the Dijkstra’s algorithm, except that

only the geodesic distances from 2.7., to C (u) (instead of all the vertices) are unknown.

91

 

 

 

1: Input: u; C(11); {911}; {we}

2: Output: the updated geodesic distances {guv}

3: for all j 6 C(11) do

4: H := adj(j) ﬂ (V/C(u));

5: 6(3) = minke” (guic + win), or 00 if H = 0;

6: Insert 6( j) to a heap with index j;

7: end for

8: while the heap is not empty do

9: k := the index of the entry by “Extract Min” on the heap;
10= 0(a) == C(U)/{k};guk == 6(k);gku := 506);
11: for all j E adj(k) 0 C(11) do

12: dist 1: guk + wk];
13: If guk + wkj < 6( j ), perform “Decrease Key” on 6( j) to become dist;
14: end for

15: end while

 

Algorithm 3.2: ModiﬁedDijkstra: The geodesic distances from the source vertex 0.
to the set of vertices C (u) are updated.

    

 

 

(a) An example of neighborhood graph (b) The shortest path-tree T(a) and
Rab

Figure 3.1: The edge e(a,b) is to be deleted from the neighborhood graph shown
in (a). The shortest path tree 7(a) is shown as directed arrows in (b). Rab (c.f.
Algorithm 3.1) consists of all the vertices on such that sp(b, u) contains e(a, b), i.e.,
’ITub = a.

Note that both on and C (u) are derived from F.

The order of the source vertex in invoking A’IodiﬁedDijkstra can impact the run
time signiﬁcantly. An approximately optimal order is found by interpreting F as an
auxiliary graph 8 (the undirected edge e(z', j ) is in 8 iff (2', j ) E F), and removing the
vertices in B with the smallest degree in a greedy manner (OptimalOrder, Algorithm
3.3). When on is removed from B, ModiﬁedDijkstra is called with source vertex 1).,

and C(u) as the neighbors of on in B.

92

 

 

 

1: Input: Auxiliary graph 8

2: Output: None. The geodesic distances are updated as a side-effect

3: [[2] := an empty linked list, for i = 1, . . . ,n;

4: for all on E 8 do

5: z: degree of vu in 8. Insert on to l[f];

6: end for

7: pos := 1;

8: foriz=1tondo

9 If l[pos] is empty, increment pos one by one and until [bios] is not empty;

10: Remove 12“, a vertex in l[pos], from the graph 8;
11: Call ModifiedDijkstra(u, adj(u) in B);
12: for all 2)]: that is a neighbor of v.“ in 3 do

13: Find f such that vj 6 l [f] by an indexing array;

14: Remove vj from l[f] if f = 1, and move 22,- from l[f] to l[f — 1] otherwise;
15: pos = min(pos,f — 1);

16: end for

17: end for

 

Algorithm 3.3: OptimalOrder“. a greedy algorithm to remove the vertex with the
smallest degree in the auxiliary graph 8. The removal of on corresponds to the
execution of ModiﬁedDijsktra (Algorithm 3.2) with u as the source vertex.

The next stage of the algorithm ﬁnds the geodesic distances between vn+1 and
the other vertices. Since all the edges in A (edges to be inserted) are incident on

en+1 , we have

-= = min :-+ui-. . Vi. 3.3
gn+1,z 92,rz.+1 j such that (91] J,n+1) ( )

e(n+1.j)€A

 
   

S: the set of of v, with
sp(b,t) improved by Vnn

Figure 3.2: Effect of edge insertion. T (a) before the insertion of vn+1 is represented by
the arrows between vertices. The introduction of vn+1 creates a better path between
va and vb. S denotes the set of vertices such that t E 5' iff sp(b, t) is improved by on“.
Note that vt must be in T(n+1; a). For each u E S, UpdateInsert (Algorithm 3.4)
ﬁnds t such that sp(u, t) is improved by vn+1, starting with t = b.

93

 

 

 

1: Input: a; b; {91.}; {...,}
2: Output: {9,3} are updated because of the new shortest path 00 -—+ 1.1,,“ -—> vb.
3: S := 0; Q.enqueue(a);
4: while Q.notEmpty do
5: t 2: Q.pop;S :2 S U {t};
6: for all on that are children of v; in T(n + 1) do
7: if gu,n+1 + ’wn-H,b < gu,b then
8: Q.enqueue( u);
9: end if
10: end for
11: end while{S has been constructed}
12: for all u E S do
13: Q.enqueue(b);
14: while Q.notEmpty do
151 t 3: 62-901); gut 1: gtu 3: Qu,n+1 + 9n+1,t;
16: for all US that are children of U, in T(n + 1) do
171 if gs.n+1 'l' wn+l.rz < 93,0 then
18: Q.enqueue( s);
19: end if
20: end for

21: end while
22: end for{V u E S, update sp(u.,t) if on“ helps}

 

Algorithm 3.4: UpdateInsert: given that va -+ vn+1 ——> vb is a better shortest
path between va and ”b after the insertion of 'vn+1, its effect is propagated to other
vertices.

Finally, we consider how A can shorten other geodesic distances. This is done
by ﬁrst locating all the vertex pairs (00, vb), both adjacent to vn+1, such that vb ——>
vn+1 —> va is a better shortest path between va and vb. Starting from ed and vb,
UpdateInsert (Algorithm 3.4) searches for all the vertex pairs that can use the new

edge for a better shortest path, based on the updated graph.

For all the priority queues in this section, binary heap is used instead of the
asymptotically faster F ibonacci’s heap. Since the size of our heap is typically small,

binary heap, with a smaller time constant, is likely to be more efﬁcient.

94

 

3.2.1.3 Finding the Co—ordinates of the New Sample

The co—ordinate xn+1 is found by matching its inner product with x,- to the values
derived from the geodesic distances. This approach is in the same spirit as the classical
scalin [49] used in ISOMAP Define "-- - ||x — x-|[2 — “X“2 + “X“2 — 2xTx-

( g a - ' 7L] _ Z J _ 1 J 2 .7

Since 2?:1 x,- = 0, summation over j and then over 2' for ii]- leads to

1 ~
llxz-IIQ = #212:- leleﬁ).
1 j

Zuxjn2 = 3221...
J 1.7

Similarly, if we deﬁne '7,- 2 [[xi — xn+1||2, we have
1 TI. TI.
2 2
llxn+1ll ——- #237.- — Dis-H ).
' i=1 i=1
T 1 2 2 .
xn+1x7 : _§(’72 — llxn+1ll — llx7ll ) V2.

If we approximate 31-]: by gig]- and 7,: by 92-2 T, +1, the target inner product f,- between

xn+1 and x,- can be estimated by

 

2 2 2
~ 2]“ 91'1“ le glj + 2197,72le 2
N — _ 2
n

77, TI.

xn+1 is obtained by solving XTxn+1 = f in the least-square sense, where f 2
(f1,..., fn)T. One way to interpret the least square solution is by noting that

X = (\/)\1v1 . . . ‘//\dvd)T, where (M, v,) is an eigenpair of the target inner product

95

matrix. The least square solution can be written as

1 T 1 T T
x,,+1=(ﬁv1f,m,—\/-A—_dvdf) . (3.5)

The same estimate is obtained if Nystrom approximation [89] is used.
A similar procedure is used to compute the out-of-sample extension of ISOMAP
in [55, 17]. However, there is an important difference: in these studies, the inner

product between the new sample and the existing points is estimated by

n 9.2.
" 7
2f,- = 2 :71] ._ gin“. (3.6)
1:1

It is unclear how this estimate is derived. This estimate is different from that in
Equation (3.4) because 2:) 912,11+1/n — sz glzj/n2 does not vanish in general; in fact,
most of the time this is a large number. Empirical comparisons indicate that our
inner product estimate given in Equation (3.4) is much more accurate than the one
in Equation (3.6).

Finally, the new mean is subtracted from :r,,z' = 1,...,(n + l), to ensure

23:11 x,- = 0, in order to conform to the convention in the standard ISOMAP.

3.2.1.4 Updating the Co-ordinates

The co—ordinates x,- should be updated in view of the modiﬁed geodesic distance
matrix Gnew. This can be viewed as an incremental eigenvalue problem, as x,- can

be obtained by eigen—decomposition. However, since the size of the geodesic distance

96

f .

matrix is increasing, traditional methods (such as those described in [270] or [30])
cannot be applied directly. We update X by ﬁnding the eigenvalues and eigenvectors
of Bnew by an iterative scheme. Note that gradient descent can be used instead [168].

A good initial guess for the subspace of dominant eigenvectors of Bnew is the
column space of XT. Subspace iteration together with Rayleigh-Ritz acceleration

[96] is used to ﬁnd a better eigen-space:

1. Compute Z = BnewV and perform QR decomposition on Z, i.e., we write

Z = QR and let V = Q.

2. Form Z = VTBnewV and perform eigen-decomposition of the d by (1 matrix Z.

Let A; and u,- be the i-th eigenvalue and the corresponding eigenvector.

3. Vnew = V[u1 . . . ud] is the improved set of eigenvectors of Bnew.

Since at is small, the time for eigen-decomposition of Z is negligible. We do not

use any variant of inverse iteration because Bnew is not sparse and its inversion takes

0(n3) time.

3.2. l .5 Complexity

In Appendix A.4, we show that the overall complexity of the geodesic distance update
can be written as O(q(|F|+lH|)+n1/ log u+ IAI2), where F and H contain vertex pairs
whose geodesic distances are lengthened and shortened because of vn+1, respectively,
q is the maximum degree of the vertices in the graph, n is the number of vertices
with non-zero degree in B, and I/ = maxim. Here, n,- is the degree of the i-th

vertex removed from the auxiliary graph 8 in Algorithm 3.3. We conjecture that u,

97

’f-‘ -

on average, is of the order 0(log 11.). Note that ,u. g 2|F|. The complexity is thus
O(q([F| + |H|) + [1. log 11. log log n + |A|2). In practice, the ﬁrst two terms dominate,

leading to the effective complexity O(q(|F| + [H I).

We also want to point out that Algorithm 3.2 is fairly efﬁcient; its complexity to
solve the all-pairs shortest path by updating all geodesic distances is 0(n2logn+n2q).
This is the same as the complexity of the best known algorithm for the all-pairs
shortest path problem of a sparse graph, which involves running Dijkstra’s algorithm
multiple times with different source vertices. For the update of co-ordinates, subspace

iteration takes 0(722) time because of the matrix multiplication.

3.2.2 ISOMAP With Landmark Points

One drawback of the original ISOMAP is its quadratic memory requirement: the
geodesic distance matrix is dense and is of size ()(n2), making ISOMAP infeasible
for large data sets. Landmark ISOMAP was proposed in [55] to reduce the memory
requirement while lowering the computation cost. Instead of all the pairwise geodesic
distances, landmark ISOMAP ﬁnds a mapping that preserves the geodesic distances
originating from a small set of “landmark points”. This idea is not entirely new,
and the authors in [25] refer to it as the “reference point approach” in the context of

embedding.

Without loss of generality, let the ﬁrst m points, i.e., y1,. . . ,ym, be the land-
mark points. After constructing the neighborhood graph as in the original ISOMAP,

landmark ISOMAP uses the Dijkstra’s algorithm to compute the m X n landmark

98

¥—__ _

geodesic distance matrix C = {gij}, where gij is the length of the shortest path
between v,- (a landmark point) and vj. In [55] the authors suggest that X can be
found by ﬁrst embedding the landmark points and then embedding the remaining
points with respect to the landmark points. This is similar to the modiﬁcation of the
Sammon’s mapping made by Biswas et al. in [25] to cope with large data sets. How-
ever, our preliminary experiments indicate that this is not very robust, particularly
when the number of landmark points is small. Instead, we follow the implementation
of landmark ISOMAP2 and decompose B = HmCHn by singular value decompo—
sition, B = USVT = (U(S)1/2)(V(S)1/2)T, where UTU and VTV are identity
matrices of corresponding sizes, and S is a diagonal matrix of singular values. The
vectors corresponding to the largest d singular values are used to construct a low-rank

approximation, B a: QTX.

3.2.2.1 Incremental Landmark ISOMAP

After updating the neighborhood graph, the incremental version for landmark
ISOMAP proceeds with the update of geodesic distances. Since only the shortest
paths from a small number of source vertices are maintained, the computation that
can be shared among different shortest path trees is limited. Therefore, we update
the shortest path trees independently by adopting the algorithm I presented in [193],
instead of the algorithm in section 3.2.1.2. First, Algorithm 3.5 is called to initialize

the edge weight increase, which includes edge deletion as a. special case. Algorithm

 

2We are referring to the “ofﬁcial” implementation by the authors of ISOMAP in http : //isomap.
stanford.edu.

99

 

 

I := (0;
for all (73-, 3,, 1112”“, my”) in the input do
Swap 7‘, and 3,- if 12,, is a child of vs, in T(a);
if '03, is a child of or, in T(a) then
J := {vs,}U descendent of v3I in T(a);
gaj = gaj + w?” — w?“ W E .7;
IzIUJ;
end if
end for
for all j E J do
b 2: minkeadﬂj) gak + 'wkj; {Find a new path to vj}
Q.enqueue(j, arg minkeadjm gak + wkj, b) if b < gaj
end for

 

Algorithm 3.5: InitializeEdgeWeightIncrease for the shortest path tree from va,

T a . The inputs are the four tuples r-,s:,w91d,w‘~‘ew , meaning the weiO‘ht of
2 ’l 1 2 C)

e(ri,rj) should increase from w?“ to wzflew. Q is the queue of vertices to be pro—
cessed in Algorithm 3.7.

3.7 is then executed to rebuild the shortest path tree. Algorithm 3.6 is then called
to initialize the edge weight decrease, which includes edge insertion as a special case.
Algorithm 3.7 is again called to rebuild the tree. Deletion of edges is done before the
addition of edges because this is more efficient in practice.

The co—ordinate of the new point xn+1 is determined by solving a least-square
problem similar to that in section 3.2.1.3. The difference is that the columns of Q,
instead of X, are used. So, QTxn+1 = f is solved in the least-square sense. Finally,
we use subspace iteration together with Ritz acceleration [236] to improve singular

vector estimates. The steps are

1. Perform SVD on the matrix BX, U181V{ = BX

2. Perform SVD on the matrix BTUl, U282V§ = BTU1

3. Set xnew = U2(S2)1/2 and Qnew = U1(S2)1/2

As far as time complexity is concerned, the time to update one shortest path tree

100

 

 

I := Q);
for all (Ti, 3,, 112?“, my”) in the input do
Swap 1', and 3; if gay, > 90‘s,;
diff := gay, + 111.?“ — 911.3,;
if diff < 0 then
Move vs, to be a child of 1)., in T(a);
J :2 {’03,}U descendent of 0,, in T(a);
903' = 903' + diff W E .7;
I = I U J;
end if
end for
for all j E .7 do
for all k E adj(j) do
Q.enqueue(k,j,gaj + wjk) if 903- + wjk < gak
end for
[ end for

algorithm 3.6: InitializeEdgeVVeightDecrease for the shortest path tree from va,

7(a). The inputs are the four tuples (ri,si,wfld,w?ew), meaning the weight of

e(rbrj) should decrease from in?“ to urgew. Q is the queue of vertices to be pro-
cessed in Algorithm 3.7..

 

 

is 0(6d log 6d + (16d), where (id is the minimum number of nodes that must change
their distance or parent attributes or both [193], and q is the maximum degree of
vertices in the neighborhood graph. The complexity of updating the singular vectors

is 0(nm), which is linear in 77., because the number of landmark points m is ﬁxed.

3.2.3 Vertex Contraction

Owing to the non-parametric nature of ISOMAP, the data points collected need to
be stored in the memory in order to reﬁne the estimation of the geodesic distances
92‘3- and the co-ordinates xi. This can be undesirable if we have an arbitrarily large
data stream.

One simple solution is to discard the oldest data point when a pre—determined

I1leber of data points has been accumulated. This has the additional advantage of

101
I A

 

 

while Q.notEmpty do
(i,j,d) :2 “Extract Min” on Q;
de=d—%ﬁ
if diff < 0 then
Move vi to be a child of v]- in T(a):
gai = d;
for all k E adj(i) do
new = 9m- + wiki
Q.enqueue(k,i,newd) if newd < gak;
end for
end if
end while
Algorithm 3.7: Rebuild T(a.) for those vertices in the priority queue Q that need

to be updated.

 

 

 

making the algorithm adaptive to drifting in data characteristics. The deletion should
take place after the completion of all the updates due to the new point. Deleting the

vertex v,- is easy: the edge deletion procedure is used to delete all the edges incident

on v; for both ISOMAP and landmark ISOMAP.

We can do better than deletion, however. A vertex contraction heuristic can

be used to record the improvement in geodesic distance estimate without storing
additional points. Most of the information the new vertex vn+1 contains about the
geodesic distance estimate is represented by the shortest paths passing through vn+1.
Suppose sp(a,b) can be written as ea w v,- ——> 22n+1 ——> vb. The geodesic distance
between va and vb can be preserved by introducing a new edge e(z’, b) with weight
(102-[n+1 + urn+1,b), even though on“ is deleted. Both the shortest path tree T(a.)
alld the graph are updated in view of this new edge. This procedure cannot create
irlConsistency in any shortest path trees, because the subpath of any shortest path is
also a shortest path. This heuristic increases the density of the edges in the graph,

11 ()wever.

192

W'hich vertex should be contracted? A simple choice is to contract the new vertex
vn+1 after adjusting for the change of geodesic distances. Alternatively, we can delete
the vertices that are most “crowded” so that the points are spread more evenly along

the manifold. This can be done by contracting the non-landmark point whose nearest

neighbor is the closest to itself.

3.3 Experiments

we have implemented our main algorithm in Matlab, with the graph theoretic parts
written in C++. The running time is measured on a Pentium IV 3.2 GHz PC with

512MB memory running Windows XP, using the proﬁler of Matlab with the java

virtual machine turned off.

3.3.1 Incremental ISOMAP: Basic Version

We evaluated the accuracy and the efﬁciency of our incremental algorithm on sev-
eral data sets. The ﬁrst experiment was on the Swiss roll data set. It is a typical
benchmark for manifold learning. Because of its “roll” nature, geodesic distances
are more appropriate in understanding the structure of this data set than Euclidean
distances. Initialization was done by ﬁnding the co—ordinate estimate x,- for 100 ran-
dOInly selected points using the “batch” ISOMAP, with a [run neighborhood of size
6‘ Random points from the Swiss roll data set were added one by one, until 1500
I)Qims were accumulated. The incremental algorithm described in section 3.2.1 was

Ilged to update the co-ordinates. The ﬁrst two dimensions of x,- corresponded to the

103

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2s 2.... 8.2 m3 :3 was <2 <2 <2 was 8.1.. was Se 2.... was 82

was as was So was a: <\z <2 <2 :s as was is mom was 82

8.0 as a: 8.0 8.0 was mos sec 8.0 3.0 8.0 see as was 8.0 can

..85 58am $5 ..85 nopem .30 ..85 58mm .pma ..85 nopwm .EQ .55 58am SEQ :
530 m 997:2 85 coamcamm o>§u-m :8 mmmam

 

 

 

 

 

.mpﬁom : 2: :e a8 coﬂmusmﬁoo 85358 a8 was 2: on mwcoamotou

2.5.5: .65 350a mo .8555 usoamtwc a5 mono age/HOE EEQEQSE was 585* massomxm .25 Amwsoommv 25» 55 ”Nm @3er.

 

 

 

 

 

 

 

 

 

 

 

 

 

104

 

 

 

 

 

 

 

 

 

 

is . 3m. . as . is . 4. . ..x was a
A: 38 we 2% 3 is we cos. ammo 3% Ex assimzwco
can ems: is ©me we at: 2... man: was 3:2 828% 2880
S mama 3 was so mg 2 3mm 3 0.0mm ism eooioeaez
..85 58mm ..85 :oﬁwm ..85 zouem ..85 58mm ..85 nepwm

53¢ m 9922 8.3 convened 9.50% :8 mmrsm

 

.oES SE B5288 one Eco mm 823 0283 .aoﬁmwou Enacted was “X
we .5553: use 5+5“ mo 2033:9200 mag/HOE 5.3m: 8m m<ZOmH 53:55.85 55 :83 .28 $58ng 95... and ”mm 025.

888558 88 828558 28588 25 853 E5 .28de 5825025225 25 85883: 2858 555 23. m<EOmH
.3 88858 32:58.8 5:25:85 585 25 50 2:8 25 3 280me58 5525 a mo 858 25 825» £83 8588 8888
2858 9888 23. 525888 .8558? 3288885 25 28 85.5 25 NE 88858 .5509 858 338 25 50 852558
.8 25 8085 85858 28 3.25m 23. 585888 .882» 5588885 25 28 823 25 .3 852258 88:68.8 25
58858 85mm 25 5 Bot 28 85.8 25 £858 85 25 5 m<20m5 5388885 8% 23¢ 853m: 50 mpoﬁmaam ”m5 Emma

 

 

  

 

8... n z 8 8m. n z 3 8n n a :3
8 9. cm o owl ov: 8-
.8:
ﬁ§.ua 8-
. a o o o ‘6 o w g.
.. .... r . ‘ . -
.. ... ...u o o ‘0 3
r .. ....u. v 0‘" of». o
..:W ﬂ ’ 0 av? & v.0
.. .....Jon ogroﬂ‘un o—
o’ ‘o w o o. ‘
W... .5 o. . .0 8
.8

 

 

9: H z 4225 3 9: H a .3555 A5

02 u u .325 3
o

 

3 on cm: 91 8..
8|
0 o
o Q coco Q o
‘00 o o o no.3 o .9:
o o. o o.
a o o
o O o. o o o
. is...
o o o o. o o. o. .o—
r 33.1. o
o .8

 

 

 

105

82 u 2 .1ng 8

305:3:on Wm charm

82 n z .123 3

82 u c .35 3
o a-

 

 

£4.
. w

 

 

 

 

 

 

 

106

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Average: 0.000525 x 104 Average: 0.000182
0.025 . . f e f . . 5 , , , 7 .
5»
0.02f
4»
0.015“ W LU:
m: 3’
0.01» 1 .
2’ .1
i 1
0.005 ' 1 11 1 i '
:7 'ﬁ 1 : . 1 .
0 ___,L-.i..i-_ 21.21.- OWL‘M*~- ~ ' ‘-~- - ' - ' -
200400600800n100012001400 2°°‘°°6°°8‘f,°‘°°°‘2°°‘4°°
(a) Swiss roll (b) S-curve
Average: 0.001952 Average: 0.005323
0-02 V 1 v 1 ﬁ' Y ' 0.09 f V ' 7
l
I 0.08- 4
0.016» ' ‘ 007-
0.06»
0.012-
0.05 4
c 1 c
m 1 m
' 0.04-
0.008 -
0.03
0.004 f . - 0-02
' 0.01I ,
G 4 1 L G 1 4 A n hath-Mn“
100 200 300 "400 500 600 698 200 400 600 800n100012001400
(c) Rendered Faces (d) MNIST digit 2
Average: 0.004815
0,07 1 v , 1 T
0.06- 1
0.05-
0.04 .
“JG
0.03 1
0.02
0.01 , 7
o 4 n i L 1 ‘ ‘ 1 I i M
200 400 600 800 n 1000 1200 1400
(e) ethn

I:igure 3.4: Approximation error (8") between the co—ordinates estimated by the
t)‘3-sic incremental ISOMAP and the basic batch ISOMAP for different numbers of
etta points (n) for the ﬁve data sets.

107

 

true structure of the manifold. The gap between the second and the third eigenvalues
is fairly signiﬁcant and it is not difﬁcult to determine the intrinsic dimensionality as
two for this data set. Figure 3.3 shows several snapshots of the algorithm3. The black

dots (REM) and the red circles (XE-n) ) correspond to the co—ordinates estimated by the

incremental and the batch version of ISOMAP, respectively. The red circles and the
black dots match very well, indicating that the co—ordinates updated by the incre-

mental ISOMAP follow closely with the co—ordinates estimated by the batch version.

This closeness can be quantiﬁed by an error measure deﬁned as the square root of the

En) and x(n), normalized by the total sample variance:

Inean square error between it i

 

1 (n, (n
:‘zlllx. ~x.)112

5,, = ’7? (3.7)
% 23:1 1114”)”2

 

Figure 3.4(a) displays Sn against the number of data points n for Swiss roll. We can
see that the proposed updating method is fairly accurate, with an average error of
0.05%. The “spikes" in the graph correspond to the instances where many geodesic
distances change dramatically because of the creation or deletion of “short-cuts” in
the neighborhood graph. These large errors fade very quickly, however, as evident
from the graph.
Table 3.1 shows the computation time decomposed into different tasks. Our in-
cremental approach has signiﬁcant savings in all three aspects of ISOMAP: graph
11Ibdate, geodesic distance update, and co—ordinate update. The computation time for

tlle distances is not included in the table, because both batch and incremental ver-

\

3The avi ﬁles can be found at http://www. cse .msu.edu/prip/ResearchProjects/iisomap/.

108

sions perform the same number of distance computations. Empirically, we observed
that for moderate number of data points, the time to update the geodesic distances
is longer than the time to update the co—ordinates, whereas the opposite is true when
a large number of points have been collected. This is probably due to the fact that
the geodesic distances change more rapidly when only a moderate amount of data
are collected, whereas the time for matrix multiplication becomes more signiﬁcant
with a larger number of co—ordinates. We have also run the batch algorithm once for
different numbers of data points (71). Table 3.2 shows the measured time averaged
over 5 identical trials, after excluding the time for distance computation. The time
for computing the distances for all the n points, together with the time to run the
incremental algorithm once to update when the n—th point arrives, is also included in

the table. See Section 3.4 for further discussion of the result.

The co—ordinates estimated with different number of data points are also compared
with the co—ordinates estimated with all the available data points. This can give us
an additional insight on how the estimated co—ordinates evolve to their ﬁnal values

as new data points gradually arrive. Some snapshots are shown in Figure 3.5.

A similar experimental procedure was applied to other data sets. The “S—curve”
data set, another benchmark for manifold learning, contains points in a 3D space lying
on a “S”-shaped surface, with an effective dimensionality of two. The “rendered face”
data set4 contains 698 face images with size 64 by 64 rendered at different illumination

and pose conditions. Some examples are shown in Figure 3.6. The “MN IST digit 2”

 

4http://isomap.stanford.edu

109

data set is derived from the digit images “2” from l\/INIST5, and contains 28 by 28
digit images. Several typical images are shown in Figure 3.7. The rendered face data
set and the MNIST digit 2 data sets were used in the original ISOMAP paper [248].
Our last data set, ethn, contains the face images used in [175]. The task of this data
set is to classify a 64 by 64 face image as Asian or non-Asian. This database contains
1320 images for Asian class and 1310 images for non—Asian class, and is composed
of several face image databases, including the PFOl databaseG, the Yale database7
the AR database [181], as well as the non—public NLPR databases. Some example
face images are shown in Figure 3.8. For all these images, the high dimensional
feature vectors were created by concatenating the image pixels. The neighborhood
Size for MNIST digit 2 and ethn was set to 10 in order to demonstrate that the
proposed approach is efficient and accurate irrespective of the neighborhood used.
The approximation error and the computation time for these data sets are shown in
Figure 3.4 and Table 3.1. We can see that the incremental ISOMAP is accurate and

Efficient for updating the co—ordinates for all these data sets.

Since the ethn data set is from a supervised classiﬁcation problem with two classes,
We also want to investigate the quality of the ISOMAP mapping with respect to classi-
fication. This is done quantitatively by computing the leave-one-out nearest neighbor
(With respect to L2 distance) error rate using different dimensions of the co—ordinates
eStimated by incremental ISOMAP with 1500 points. For comparison, we project
5http://yann.1ecun.com/exdb/mnist/.
6http://nova.postech.ac.kr/archives/imdb.html.

7http://cvc.yale.edu/projects/yalefaces/yalefaces.html.
8Provided by Dr. Yunhong Wang, National Laboratory for Pattern Recognition, Beijing.

110

 

 

 

 

 

‘ .1. ' ' . '. 4
a ' ' ', .-- o . "9 .,.".93
. 0 ° °o 0: o ' , ° 0° -cp . .,g 0.502156% 03,033 ' _° “(you‘lﬁcﬁa ,. J
20 o o oo 20 o .0 - o - . 0 o . as g
o 00 o o :50. :8 o o 9.....0'090ioc’ooo '%w E) .18 00..
10:0 0 d "- 10L ° .0 Q.)- o n o . 60 on 0 0’3).
1: 0° ° 300 €°§°oo. omOW°l g 0 9°.q° "
,o o o '8 . ' ,
0- 0 ° 0 13°. °° . so so 0' °°°pg° 0° 50°29 g; °°°o-°'ea 33>
o ' o o o D .(6 2°00 '°¢§-.29...°" O o 0% 9000990
401 o o o .9 -10 o d, 0° 9 ' do .a e0 "d9
0 ° :.5 ° . 8 '00 o ‘90 '50. . go? _o
-20. . 0° . -. c. o _20 o ‘5 - 09' .° 50 '0 ..0. Go . .0 0°
0 o o w o % at: o :1 ago D.0..y?)c3-TP v.10 9 I}? (Do-.390.
-30' . -30* q, ' . . .. ~ - _-.. . . .
-60 —40 40 60 -60 —40 ~20 0 60
(81) Initial n — 100 (b) n = 300
'. ' ' -' ‘- I -. “mm--2
30’ r. ' ...". b \ 4.0.. 30’. : 2 ":5 " $.6:.',.l'. “96’
2 ’ _ ‘bQo' . y .. g. - '9 ; own a " ° -'
0051-. 39,50 @313, 3%388'0‘.‘ ¢e_3ﬁce,dlg-. P08 °° 93° , ° ‘3‘" _ 093:5 .- _"'
20~o . mm. o ., o r - m- 20 o .6 ° . 430003.111; «
3;..5,’ 0&‘Q°_ ° 0 .50%€§~'.o q-gao °. was: 4%? ‘51.:
_ . <11 ' wﬁ ’ :3 ' "o -_ ' ‘ ' W '3 ‘ 0“ 'd':
10 a $23.0 . £0 <5 ‘2 ”1“. 0° g. “gig 10 .0 we :1 5£°::°. o ijm . ‘3’ 1
g ‘- .0- f .- I
0 . ' '
O

 

 

 

 

—ﬁgke§$*5o

 

 

 

 

 

 

 

.60 -40 —20 20
(e) n = 01200

 

 

0

f —20 20
(r) Final, 71 = 1500

Figure 3.5: Evolution of the estimated co—ordinates for Swiss roll to their ﬁnal values.
The black dots denote the co—ordinates estimated with different number of samples,
Whereas red circles show the co—ordinates estimated with all the 1500 points. The
(lo-ordinates have been re—scaled to better observe the trend.

 

Figure 3.6: Example images from the rendered face image data set. This data set can

be found at the ISOMAP web-site.

111

Figure 3.7: Example “2” digits from the MN IST database. The MN IST database can
be found at http://yann. lecun.com/exdb/mnist/.

 

 

1NN LOO error rate in %

 

 

 

 

5 6
No. of features

Figure 3.9: Classiﬁcation performance on ethn database for basic ISOMAP.

112

the data linearly to the best hyperplane by PCA and also evaluate the corresponding
leave-one-out error rate. Figure 3.9 shows the result. The representation recovered by
ISOMAP leads to a smaller error rate than PCA. Note that the performance of PCA
can be improved by rescaling each feature so that all of them have equal variance,

though the rescaling is essentially a post-processing step, not required by ISOMAP.

3.3.2 Experiments on Landmark ISOMAP

A similar experimental procedure was applied to the incremental landmark ISOMAP
described in section 3.2.2 for Swiss roll, S-curve, rendered face, MNIST digit 2, and
ethn data sets. Starting with 200 randomly selected points from the data set, random

9 accumulated. Forty points from the

points were added until a total of 5000 points
initial 200 points were chosen randomly to be the landmark points. Snapshots corn-
Daring the (to—ordinates estimated by the batch version and the incren’iental version
for Swiss roll are shown in Figure 3.10. The approximation error and the computa-
tlion time are shown in Figure 3.11 and Table 3.3, respectively. The time to run the
batch version only once is listed in Table 3.4. Once again, the co—ordinates estimated
by the incremental version are accurate with respect to the batch version, and the
COIIlputation time is much less. We also consider the classification accuracy using
Iandrnark ISOMAP on all the 2630 images in the ethn data set. The result is shown

in Figure 3.12. The co—ordinates estimated by landmark ISOMAP again lead to a

Slllaller error rate than those based on PCA. The difference is more pronounced when
\

9When the data set has less than 5000 points, the experiment stopped after all the points have
Gen used.

113

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

<2 <2 <2 88 28mm :38 <2 <2 <2 was meet 22mm mod :82 $28 8%
<2 <2 <2 30 a: 8.3 <2 <2 <2 8.0 5.2 a: 8.0 8.: as comm
8.0 $2 on: So a? was <2 <2 <2 So a: 02 3o 83 mm; 88
8.0 a; a: 8.0 a; was So as 3.0 mod as 3o :5 Es 8.0 8...
.85 88m .85 .85 88m 8me .85 88m .85 .85 88mm .85 .85 88m ...EQ

58 m BmHZE 88 888de madam :8 $me

 

 

 

 

 

 

 

.muﬁoa : 85 :8 8* 28332588 88358 5 8:5 2: 3 5288888 :85:

.AE 2an m0 838:: ”:88wa 88 8:0 Lag/HOE xaeEcsﬂ 8888885 98 889 magnowxm 8“ $5883 883 55 Jam 8388

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

. . . . . . ax maﬁa a
Whoa 9% saw. @on mm oé “New wéom Mmoo 5.2: 112x msg#89290
8 05m 3: 33 8 we mom 2% 3m 3% 8:28 22.88
on 0.82 E 3% ea mom 3 28w 8 33m ism 8928862
.85 88m .85 88m .85 88m .85 88m .85 88m
58 m BmHZE 88 885:5 958m :8 mmtsm

 

.883 5: 8:558 28 Eco m5 885 88m 823888 58:28:85 m8 ex .8 Maggi:

98 H+2x mo 2053:9208 mega/HOE 88h 8m mag/HOE x8585: 5388883 58 88a 88 32888 was and 5m 855.

114

 

88me 8m: 8 mag/HOE .8 86:? 58:58:65 5: 8:: E88 .m.m 8:me o: 8:86
mm 5 888858 :8 88:32: 86.55% on... 833 :85 .5288 wooﬁonnwmm: was. 88:85: 5:58 BE: 95. wig/HOE E
88:58 89538.8 885:8: 80:: 23 mo 88> 2: 3 889888 8:8: 8 mo :28 5: 9853 £83 838m 888:8 :858
8:88 58 585880.: 5:58de 8:88:85 8:: 8:: 8.85 on: :3 88858 5:89 838 538: 2: mo 88:88.8 2:
30:8 8288 8:: 88:8 .23. 58388on :22?» 8.89883 8:: :8 :88. on: 23 88838 88:88.8 :8. “88:8:
8830: 8:: 5 m8: 5:: 88.8 8:: £838 85 23 5 max—20mm 58:53:85 5385885 :8 Ea: mmrsm: mo muoamme:m ”SM 8:me

82 n 2 9 88 n z E 82 u g E
n c m1 o? m7 owl
..." ... . . .

 

 

£72.. .. ..
assume. ._

 

 

«u

 

com H g .855 8v com H : .855 A3
0 o

m1 owl 31 our

 

. . n":
. 0..
. . . 9
o 0. i 1
:30... “a one. c “To"... 0 o
.- o ‘00 a .- Km...
0 O
C O . O
o a I .o
0 c 0’. .0. .0
. . .m
o I .0 o O
O O .0 I. .0. ' l O.
o o a e
O... C O. O. O 0‘ OF
I . o

 

 

 

.,-

115

88 n : 42E 8

8% n z .35: E

‘ I .
t‘ ...
I.-

2...

WW
x

 

€0::5:8v oﬁm oSME
8% n 2 .35: 3

 

 

. 4

m 0 ml owl n7 owl

 

 

 

 

 

 

 

v-

116

 

 

 

 

 

 

 

 

 

 

 

 

0x ‘04 Average: 0.000010 “x 10-5 Average: 0.000003
1" 1 Y 1 v Y T T a 1 1 v v r r r
8.
1 1
7»
0.8- 6
“J:
0.6- 4
0.4* l
0.2
~ A A A A 1 AL 0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
n n
(a) Swiss roll (b) S-curve
-3 Average: 0.000212 Average: 0.001685
x 10
2 v 1 v f r Y 0.04 V T v Y r r 1 1
1.8
1.6

 

 

 

 

 

 

 

 

 

”400 500 6006560100015002000250930003500400045005000
2(c) Renderned Faces (d) MNIST digit 2

Average: 0.001990
0.04 . w T r

 

0.035 r 4
0.03 r
0.025

m:
0.02 g :

 

 

0.015

l

0.01 l j l

‘ .

 

 

‘ l

 

0.005 ‘ _ {

c A A -.. MA AAA ..-.

500 1000 n 1500 2000 2500
(e) ethn

Figure 3.11: Approximation error (8n) between the co—ordinates estimated by the
incremental landmark ISOMAP and the batch landmark ISOMAP for different num-
bers of data points (n). It is similar to Figure 3.4, except that incremental landmark
ISOMAP is used instead of the basic. ISOMAP.

117

 

A
O
.4

 

 

 

 

 

 

 

\‘ I I T r L r
.\ -—~ PCA
35‘.\“~\ -- ~ - ISOMAP “
o\° 30’ h,
.2 . \\
a, . \
a; 20" \
8 \
_r 15“ \\\
2 \‘\~
2 \N‘
'— 1ob \\
\F“\\
5'- .1..\_.__.._1--_
o ‘ ' ' I
No. of features

Figure 3.12: Classiﬁcation performance on ethn database, landmark ISOMAP.

the number of dimensions is small (less than ﬁve).

3.3.3 Vertex Contraction

The utility of vertex contraction is illustrated in the following experiment. Consider
a manifold of a 3-dimensional unit hemisphere embedded in a lO-dimensional space.
The geodesic on this manifold is simply the great circle, and the geodesic distance
between x1 and x2 on the manifold is given by COS—I(XCITXZ). Data points lying on
this manifold are randomly generated. With K = 6, 40 landmark points and 1000
points in memory, vertex contraction is executed until 10000 points are examined.
The geodesic distances between the landmark points XL and the points in memory
X M are compared with the ground-truth, and the discrepancy is shown by the solid

line in Figure 3.13. As more points are encountered, the error decreases, indicating

that vertex contraction indeed improves the geodesic distance estimate. There is,

118

0,16 1 r r r r 1*
———-—- With contraction

 

 

 

0.14_ , . - - - Without contraction

 

 

012*

0.1

rms

0.08

0.06 ~

0.04 _

 

 

 

 

0.0 I L L L l l l I
12000 2000 3000 4000 5000 6000 7000 8000 9000 10000
number of points

Figure 3.13: Utility of vertex contraction. Solid line: the root-mean-square error
(when compared with the ground truth) of the geodesic distance estimate for points
currently held in memory when vertex contraction is used. Dash-dot line: the cor—
responding root—mean—square error when the new points are stored in the memory
instead of being contracted.

however, a lower limit (around 0.03) on the achievable accuracy, because of the finite
size of samples retained in the memory. When additional points are kept in the
memory instead of being contracted, the improvement of geodesic distance estimate
is significantly slower (the dash-dot line in Figure 3.13). We can see that vertex
contraction indeed improves the geodesic distance estimate, partly because it spreads
the data points more evenly, and partly because more points are included in the

neighborhood effectively.

3.3.4 Incorporating Variance By Incremental Learning

One interesting use of incremental learning is to incorporate invariance by “hallucinat—

,. I .

ing” training data. Given a training sample yi, additional training data yzll), yzf?)

119

can be created by applying different invariance transformations on yi. The amount of
training data can be unbounded, because the number of possible invariance transfor-
mations is inﬁnite. This unboundedness calls for an incremental algorithm, which can
accumulate the effect of the data generated. This idea has been exploited in [235] for
improving the accuracy in digit classiﬁcation. Given a digit image, simple distortions
like translation, rotation, and skewing are applied to create additional training data

for improving the invariance property of a neural network.

We tested a similar idea using the proposed incremental ISOMAP. The training
data were generated by ﬁrst randomly selecting an image from 500 digit “2” images
in the MNIST training set. The image was then rotated randomly by 6 degree, where
6 was uniformly distributed in [—30, 30]. The image was used as the input for the
incremental landmark ISOMAP with 40 landmarks and a memory size of 10000, with
vertex contraction enabled. The training was stopped when 60000 training images
were generated. We wanted to investigate how well the rotation angle is recovered
by the nonlinear mapping. This was done by using an independent set of digit “2”
images from the MNIST testing set, which was of size 1032. For each image y“), it
was rotated by 15 different angles: 30j/7 for j = —7, . . . ,7. The mappings of these
15 images, 291,. .. ,xgi), were found using the out—of—sample extension of ISOMAP.
If ISOMAP can discover the rotation angle, there should exist a linear projection
direction h such that thEi) % c,- +l for all z' and l, where Ci is a constant speciﬁc to

5,0). This is equivalent to

hT (if) — 5(8)) 3 r, (3.8)

 

320

 

---SOMAP
'~ISOMAPH ,
ISOMAPHI

8
o

 

.>\ —~«PCA
'\ ...... \ .0 PCA"
° ' °'\.\ V'VPCAHI

N

(I)

on

v VYY
O

 

 

 

N
8
O

E

——————————————————————————————

B
0

square root 01 the sum 01 residue square
N
o
O

-A

a:

O
' 2
l
I
1

l

i

J

 

 

#
a)
O
Y
4

 

...
M
(a)
A
‘1».
CD
‘0
8

5 6
No. of features

Figure 3.14: Sum of residue square for 1032 images in 15 rotation angles. The larger
the residue, the worse the representation. “PCA” and “ISOMAP” correspond to the
nonlinear mapping obtained by PCA and ISOMAP when 10000 generated images are
used for training, respectively. “ISOMAP II”/ “PCA II” and “ISOMAP III”/ “PCA
III” correspond to the result when the learning stops after 20000 and 50000 images
are generated, respectively.

which is an over-determined linear system. The goodness of the mapping it“) in terms
of how well the rotation angle is recovered can thus be quantiﬁed by the residue of
the above equation. For comparison, a similar procedure was applied for PCA using
the ﬁrst 10000 generated images. Figure 3.14 shows the result. We can see that
the residue for ISOMAP is smaller than PCA, indicating that ISOMAP recovers

the rotation angle better. The residue is even smaller when additional images are

generated to improve the mapping.

3.4 Discussion

We have presented algorithms to incrementally update the co—ordinates produced by
ISOMAP. Our approach can be extended to other manifold learning algorithms; for

example, creating an incremental version of Laplacian eigenmap requires the update

121

 

of the neighborhood graph and the leading eigenvectors of a matrix (graph Laplacian)
derived from the neighborhood graph.

The convergence of geodesic distance is guaranteed since the geodesic distances
are maintained exactly. Subspace iteration used in co-ordinate update is provably
convergent if a sufﬁcient number of iterations is used, assuming all eigenvalues are
simple, which is generally the case. The fact that we only run subspace iteration once
can be interpreted as trading off guaranteed convergence with empirical efﬁciency.
Since the change in target inner product matrix is often small, the eigenvector im-
provement due to subspace iterations with different number of points is aggregated,
leading to the low approximation error as shown in Figures 3.4 and 3.11.

W'hile running the proposed incremental ISOMAP is much faster than running.
the batch version repeatedly, it is more efﬁcient to run the batch version once using
all the data points if only the ﬁnal solution is desired (compare Tables 3.1 and 3.2, as
well as Tables 3.3 and 3.4). It is because maintaining intermediate geodesic distances
and co—ordinates accurately requires extra computation. The incremental algorithm
can be made faster if the geodesic distances are updated upon seeing p subsequent
points, p > 1. We ﬁrst embed yn+1, . . . , yn+p independently by the method in section
3.2.1.3. The geodesic distances among the existing points are not updated, and the
same set of x, is used to find xn+1, . . . ,Xn+p- After that, all the geodesic distances are
updated, followed by the update of x1, . . . ,xn+p by subspace iteration. This strategy
makes the incremental algorithm almost p—times faster, because the time to embed
the new points is very small (see the time for “computing xn+1” in Tables 3.1 and

3.3). On the other hand, the quality of the embedding will deteriorate because the

122

embedding of the existing points cannot beneﬁt from the new points. This strategy is
particularly attractive with large 71., because the effect of yn+1, . . . ,yn+p on yn+p+1

is small.

Also, for a ﬁxed amount of memory, the solution obtained by the incremental
version can be superior to that of the batch version. This is because the incremental
version can perform vertex contraction, thereby obtaining a better geodesic distance
estimate. The incremental version can be easily adopted to an unbounded data stream

when training data are generated by applying invariance transformation, too.

3.4.1 Variants of the Main Algorithms

Our incremental algorithm can be modiﬁed to cope with variable neighborhood def-
inition, if the user is willing to do some tedious book-keeping. We can, for example,
use c-neighborhood with the value of 6 re-adjusted whenever, say, 200 data points
have arrived. This can be easily achieved by ﬁrst calculating the edges that need to
be deleted or added because of the new neighborhood deﬁnition. The algorithms in
sections 3.2.1 and 3.2.2 are then used to update the geodesic distances. The embedded

co—ordinates can then be updated accordingly.

The supervised ISOMAP algorithm in [276], which utilizes a criterion similar to the
Fisher discriminant for embedding, can also be converted to become incremental. The
only change is that the subspace iteration method for solving a generalized eigenvalue
problem is used instead. The proposed incremental ISOMAP can be easily converted

to incremental conformal ISOMAP [55]. In conformal ISOMAP, the edge weight wij is

123

Aij / \/ 111 (2)1110), where 111(2) denotes the distance of y,- from its k nearest neighbors.
The computation of the shortest path distances and eigen-decomposition remains the
same. To convert this to its incremental counterpart, we need to maintain the sum
of the weights of the kt nearest neighbors of different vertices. The change in the edge
weights due to the insertion and deletion of edges as a new point comes can be easily
tracked. The target inner product matrix is updated, and subspace iteration can be

used to update the embedding.

3.4.2 Comparison With Out-of-sample Extension

One problem closely related to incremental nonlinear mapping is the “out—of-sample
extension” [17]: given the embedding x1, . . . ,xn for a “training set” yl, . . . , yn, what
is the embedding result (xn+1) for a “testing” point yn+1? This is effectively the
problem considered in section 3.2.1.3. In incremental learning, however, we go beyond
obtaining xn+1z the co—ordinate estimates x1, . . . ,xn of the existing points are also
improved by yn+1. In the case of incremental ISOMAP, this amounts to updating

the geodesic distances and then applying subspace iteration.

The out-of-sample extension is faster because it skips the improvement step. How-
ever, it is less accurate, and cannot provide intermediate embedding with good quality
as points are accumulated. Incremental ISOMAP, on the other hand, utilizes the new
samples to continuously improve the co—ordinate estimates. Out—of—sample extension
may be more appealing when a large number of samples have been accumulated and

the geodesic distances and x1, . . . ,xn are reasonably accurate. Even in this case,

124

though, the strategy of updating x1, . . . ,xn after p new points (with p > 1) have
been embedded works equally as well. The updating of geodesic distances and co—
ordinates occurs infrequently in this case, and its amortized computational cost is

very low.

Incremental ISOMAP is also preferable to out-of-sample extension when there is
a drifting of data characteristics. In out—of—sample extension, the n points collected
are assumed to be representative of all future data points that are likely to be ob-
served. There is no way to capture the change of data characteristics. In incremental
ISOMAP, however, we can easily maintain an embedding using a window of the points
recently encountered. Changes in data characteristics are captured as the geodesic
distances and co—ordinate estimates are updated. Vertex contraction should be turned
off if incremental ISOMAP is run in this mode, to ensure that the effect of old data

points is erased.

3.4.3 Implementation Details

The subspace iteration in section 3.2.1.4 requires that the eigenvalues corresponding
to the leading eigenvectors have the largest absolute values. This can be violated
if the target inner product matrix has a large negative eigenvalue. To tackle this,
we shift the spectrum and ﬁnd the eigenvectors of (B + 01) instead of B. Subspace
iteration on (B + 01) can proceed in almost the same manner, because (B + aI)v =
B + av. ‘While a large value of a guarantees that all shifted eigenvalues are positive,

this has the adverse effect of reducing the rate of convergence of the eigenvectors,

125

because the shift reduces the ratio between adjacent eigenvalues. We empirically set
a = max(—0.7Amin(B) — 0-3’\d-th(B).~.0)- where Amin(B) and Ad,th(B) denote the
smallest (most negative) and the d—th largest eigenvalues, respectively. The later is
being maintained by the incremental algorithm, while the former can be found by,
say, residual norm bounds or Gerschgoren disk bounds. In practice, Amin(B) is found
at the initialization stage. This estimate is updated only when a large number of data
points have been accumulated.

During the incremental learning, the neighborhood graph may be temporarily dis-
connected. A simple solution is to embed only the largest graph component. The
excluded vertices are added back for embedding again when they become reconnected
as additional data points are encountered. Alternatively, an edge can be added be-
tween the two nearest vertices to connect the two disconnected components in the

neighborhood graph.

3.5 Summary

Nonlinear dimensionality reduction is an important problem with applications in pat-
tern recognition, computer vision, and machine learning. We have developed an algo-
rithm for the incremental nonlinear mapping problem by modifying the well-known
ISOMAP algorithm. The core idea. is to efﬁciently update the geodesic distances
(a graph theoretic problem) and re-estimate the eigenvectors (a numerical analysis
problem), using the previous computation results. Our experiments on synthetic data

as well as real world images validate that the proposed method is almost as accurate

126

as running the batch version, while saving signiﬁcant computation time. Our algo-
rithm can also be easily adopted to other manifold learning methods to produce their

incremental versions.

127

Chapter 4

Simultaneous Feature Selection

and Clustering

Hundreds of clustering algorithms have been proposed in the literature for Clustering
in different applications. In this chapter, we examine a different aspect of clustering
that is often neglected: the issue of feature selection. Our focus will be on partitional
clustering by a mixture of Gaussians, though the method presented here can be easily
generalized to other types of mixtures. We are interested in mixture-based clustering
because its statistical nature gives us a solid foundation for analyzing its behavior.
Also, it leads to good results in many cases. \Ne propose the concept of feature saliency
and introduce an expectation-maximization (EM) algorithm to estimate it, in the
context of mixture-based clustering. We adopt the minimum message length (MML)
model selection criterion, so the saliency of irrelevant features is driven towards zero,
which corresponds to performing feature selection. The MML criterion and the EM

algorithm are then extended to simultaneously estimate the feature saliencies and the

128

number of clusters.

The remainder of this chapter is organized as follows. We discuss the challenge
of feature selection in unsupervised domain in Section 4.1. In Section 4.2, we review
previous attempts to solve the feature selection problem in unsupervised learning.
The details of our approach are presented in Section 4.3. Experimental results are
reported in Section 4.4, followed by comments on the proposed algorithm in Section

4.5. Finally, we conclude in Section 4.6.

4.1 Clustering and Feature Selection

Clustering, similar to supervised classiﬁcation and regression, can be beneﬁted by
using a good subset of the available features. One simple example illustrating the
corrupting influence of irrelevant features can be seen in Figure 4.1, where the irrel-
evant feature makes it hard for the algorithm in [81] to discover the two underlying
clusters. Feature selection has been widely studied in the context of supervised learn-
ing (see [101, 26, 122, 151, 153] and references therein, and also section 1.2.3.1), where
the ultimate goal is to select features that can achieve the highest accuracy on unseen
data. Feature selection has received comparatively very little attention in unsuper-
vised learning or clustering. One important reason is that it is not at all clear how
to assess the relevance of a subset of features without resorting to class labels. The
problem is made even more challenging when the number of clusters is unknown,
since the optimal number of clusters and the optimal feature subset are inter-related,

as illustrated in Figure 4.2 (taken from [69]). Note that methods based on variance

129

 

 

 

 

 

 

 

Figure 4.1: An irrelevant feature ($2) makes it difficult for the Gaussian mixture
learning algorithm in [81] to recover the two underlying clusters. Gaussian mixture
ﬁtting ﬁnds seven clusters when both the features are used, but identiﬁes only two
clusters when the feature an is used. The curves along the horizontal and vertical
axes of the ﬁgure indicate the marginal distribution of 1:1 and mg, respectively.

(such as principal components analysis) need not select good features for clustering,
as features with large variance can be independent of the intrinsic grouping of the
data (see Figure 4.3). Another important problem in clustering is the determination
of the number of clusters, which clearly impacts and is inﬂuenced by the feature
selection issue. Most feature selection algorithms (such as [36, 151, 209]) involve a
combinatorial search through the space of all feature subsets. Usually, heuristic (non-
exhaustive) methods have to be adopted, because the size of this space is exponential
in the number of features. In this case, one generally loses any guarantee of optimality

of the selected feature subset.

We propose a solution to the feature selection problem in unsupervised learning by

casting it as an estimation problem, thus avoiding any combinatorial search. Instead

130

 

 

 

 

 

Figure 4.2: The number of clusters is inter-related with the feature subset used. The
optimal feature subsets for identifying 3, 2, and 1 clusters in this data set are {5131, $2},
{2:1}, and {1:2}, respectively. On the other hand, the optimal number of clusters for
feature subsets {£121,132}, {11}, and {2:2} are also 3, 2, and 1, respectively.

of selecting a subset of features, we estimate a set of real—valued (actually in [0, 1])
quantities (one for each feature), which we call the feature saliencies. This estimation
is carried out by an EM algorithm derived for the task. Since we are in the presence
of a model-selection-type problem, it is necessary to avoid the situation where all the
features are completely salient. This is achieved by adopting a minimum message
length (MML, [264, 265]) penalty, as was done in [81] to select the number of clusters.
The MML criterion encourages the saliencies of the irrelevant features to go to zero,
allowing us to prune the feature set. Finally, we integrate the process of feature
saliency estimation into the mixture ﬁtting algorithm proposed in [81], thus obtaining
a method that is able to simultaneously perform feature selection and determine the

number of clusters.

This chapter is based on our journal publication in [163].

131

 

 

 

 

 

Figure 4.3: Deﬁciency of variance-based method for feature selection. Feature 171,
although it explains more data variance than feature 2:2, is spurious for the identiﬁ-
cation of the two clusters in this data set.

, 4.2 Related Work

Most of the literature on feature selection pertains to supervised learning (see Sec-
tion 1.2.3.1). Comparatively, not much work has been done for feature selection in
unsupervised learning. Of course, any method conceived for supervised learning that
does not use the class labels could be used for unsupervised learning; this is the
case for methods that measure feature similarity to detect redundant features, using,
e. g., mutual information [221] or a maximum information compression index [188]. In
[70, 71], the normalized log-likelihood and cluster separability are used to evaluate the
quality of clusters obtained with different feature subsets. Different feature subsets
and different numbers of clusters, for multinomial model-based clustering, are evalu—
ated using marginal likelihood and cross-validated likelihood in [254]. The algorithm
described in [218] uses a LASSO-based idea to select the appropriate features. In [51],

the clustering tendency of each feature is assessed by an entropy index. A genetic

132

algorithm is used in [146] for feature selection in k—means clustering. In [246], feature
selection for symbolic data is addressed by assuming that irrelevant features are un-
correlated with the relevant features. Reference [60] describes the notion of “category
utility” for feature selection in a conceptual clustering task. The CLIQUE algorithm
[2] is popular in the data mining community, and it ﬁnds hyper-rectangular shaped
clusters using a subset of attributes for a large database. The wrapper approach can
also be adopted to select features for clustering; this has been explored in our earlier
work [82, 165].

All the methods referred to above perform “hard” feature selection (a feature
is either selected or not). There are also algorithms that assign weights to different
features to indicate their signiﬁcance. In [190], weights are assigned to different groups
of features for k-means clustering based on a score related to the Fisher discriminant.
Feature weighting for k-means clustering is also considered in [187], but the goal
there is to ﬁnd the best description of the clusters, after they are identiﬁed. The
method described in [204] can be classiﬁed as learning feature weights for conditional
Gaussian networks. An EM algorithm based on Bayesian shrinking is proposed in

[100] for unsupervised learning.

4.3 EM Algorithm for Feature Saliency

In this section, we propose an EM algorithm for performing mixture-based (or model-
based) clustering with feature selection. In mixture-based clustering, each data point

is modelled as having been generated by one of a set of probabilistic models [125, 183].

133

Clustering is then done by learning the 1i)arameters of these models and the associated
probabilities. Each pattern is assigned to the mixture component that most likely
generated it. Although the derivations below refer to Gaussian mixtures, they can be

generalized to other types of mixtures.

4.3.1 Mixture Densities

A ﬁnite mixture density with 1: components is deﬁned by

k
My} = 20]!)(y191), (4-1)
j=1

where V409 2 0; 2 j 03- = 1; each GJ- is the set of parameters of the j-th com-
ponent (all components are assumed to have the same form, e.g., Gaussian); and
0 E {91, ...,Ok,al. ...,ak} will denote the full parameter set. The goal of mixture
estimation is to infer 0 from a set of n data points 32 = {y1, ...,yn}, assumed to
be samples of a distribution with density given by (4.1). Each y,- is a d—dimensional
feature vector [y,-1,y,-d]T. In the sequel, we will use the indices 2', j and l to run
through data points (1 to n), mixture components (1 to k), and features (1 to (1),

respectively.

As is well known, neither the maximum likelihood (ML) estimate,

BML = arg mgx {log p(y]9)} ,

134

nor the maximum a posteriori (MAP) estimate (given some prior p(6))
6M“) = arg mgx {log [JO/[9) + log p(0)} ,

can be found analytically. The usual choice is the EM algorithm, which ﬁnds local
maxima of these criteria [183]. This algorithm is based on a set Z = {z1, ..., zn} of n
missing (latent) labels, where z, = [3,71, Zikla with 25,]- : 1 and zip = 0, for p 75 j,
meaning that y, is a sample of p(-]0j). For brevity of notation, sometimes we write
z,- = j for such 2,. The complete data log-likelihood, i.e., the log-likelihood if Z were

observed, is

n k
102.1101, ZIB) = Z Z Zij log [a'jpfinBﬂI - (42)

i=1j=1
The EM algorithm produces a sequence of estimates {6(t), t = 0, 1, 2, ...} using two

alternating steps:

0 E-step: Compute W = E [Z [31, 605)], the expected value of the missing data given
the current parameter estimate, and plug it into log p(y, Z [0), yielding the so—called

Q-function Q(0, 6(0) 2 log p (y, W] 0). Since the elements of Z are binary, we have

6,0) pone-(w)

k
X at) mime»
j=1

 

ww- E E [z.,-j|y,6(t)] 2 Fr [Zij = 1|y,,6(t)] = (4.3)

Notice that 03- is the a priori probability that 25,]- : 1 (i.e., that. y,- belongs to cluster

j), while wij is the corresponding a posteriori probability, after observing yi.

135

e M-step: Update the parameter estimates,

A

0(t + 1) = arg maax {Q(0,6(t)) + logp(0)},

in the case of MAP estimation, or without log p(0) in the ML case.

4.3.2 Feature Saliency

In this section we deﬁne the concept of feature saliency and derive an EM algorithm
to estimate its value. We assume that the features are conditionally independent

given the (hidden) component label, that is,

d

k k
P(Y|9) = 2051304939 = 2% HPUJZIOjl): (4-4)
1:1

1:1 (=1

where p(-|6ﬂ) is the pdf of the l-th feature in the j-th component. This assumption
enables us to utilize the power of the EM algorithm. In the particular case of Gaussian
mixtures, the conditional independence assumption is equivalent to adopting diagonal
covariance matrices, which is a common choice for high—dimensional data, such as in
naive Bayes classiﬁers and latent class models, as well as in the emission densities of
continuous hidden Markov models.

Among different deﬁnitions of feature irrelevancy (proposed for supervised learn-
ing), we adopt the one suggested in [210, 254], which is suitable for unsupervised
learning: the l-th feature is irrelevant if its distribution is independent of the class

labels, i.e., if it follows a common density, denoted by q(yl|/\l). Let (I) = ($1, (15d)

136

6
“@909 @619”

(a) ¢1=1..,¢)2=1¢>3=0¢4=1 (b)¢1=0.¢2=1.¢>3=1.¢4=0

Figure 4.4: An example graphical model for the probability model in Equation (4.5)
for the case of four features ((1 = 4) with different indicator variables. 051 = 1 corre-
sponds to the existence of an are from 2 to y], and a] = 0 corresponds to its absence.

be a set of d binary parameters, such that a, = 1 if feature I is relevant and <25] 2 0,

otherwise. The mixture density in (4.4) can then be re—written as

lab/I‘D, {aj}1{0jl}1{’\l}:aj Hlpl yzl6 z)“’l(1(?/rl/\z l1 “l (4.5)

1:111:

A related model for feature selection in supervised learning has been considered in

[197, 210]. Intuitively, <I> determines which edges exist between the hidden label z

and the individual features y] in the graphical model illustrated in Figure 4.4, for the

case d = 4.

Our notion of feature saliency is summarized in the following steps: (i) we treat

the (W’s as missing variables; (ii) we deﬁne the feature saliency as p] = p(gbl = 1),
the probability that the l-th feature is relevant. This deﬁnition makes sense, as it is
difﬁcult to know for sure that a certain feature is irrelevant in unsupervised learning.

The resulting model (likelihood function) is written as

20619) =

M»

d
aj Hl( pip (Ll/1193'!) +)(1-pz N(yzl/Vl) (4-6)

H
.._:

.7
Where 0 = {{0}}, {OJ-l}, {Al}, {[21]} is the set of all the parameters of the model.

137

Equation (4.6) can be derived as follows. We treat ,0] = p(qbl = 1) as a set of

parameters to be estimated (the feature saliencies). we assume the 951 s are mutually

independent and also independent of the hidden component label 2 for any pattern

y. Thus,

p(y,‘1>) =1v(y|<1>)11(<1>

d
_ a _
= (:01 151(1) (.9111! j11))“I(((1(y1|/\1))1 “7) HpilU —P1)1 “’ (47)
j: 1 1:1
, d
= 2013' H( 1011? 11.1193 “1((1— Pz)(1(y1|)\1))1—¢’-
j=1 1:1
The marginal density for y is
d d
= Zp(y,‘1>)== 2% 21101119 (9119]' “(((1 - 1)1l(1(.7le/\1))1_“bl
<I> 3'21 <1) 1:1
k d 1
= :09- H Zf p110 (1/1lj‘b’((1— pz)q(yzl/\1))1—“l (4.8)
jzl 121(1)]:0
k d
= 2013' I10? (9116511111 + q(yzl/\1)(1- pd),
j=1 1:1

Whlch 18 just Equation (4.6). Another way to see how Equation (4 6) IS obtained 18 to

. ,. . . . . 1—
notlce that the conditional density of y] given 2 = j and (251, [p(ylldﬂ)]¢l[q(yll)q)] “51,

Can be written as ¢1p(y1|6ﬂ) + (1 — (b])q(g1|)\1), because at is binary. Taking the

expectation with respect to oil and 2 leads to Equation (4 6)

The form of q( | ) reflects our prior knowledge about the distribution of the non-

Sallent features. In principle it can be any 1-D distribution (e.g., a Gauss1an, a

Student t or even a mixture). We shall limit q(..|) to be a Gauss1an, since this leads

138

 

Figure 4.5: An example graphical model showing the mixture density in Equa-
tion (4.6). The variables 2, ¢I,¢2,¢3,¢4 are “hidden” and only y1,y2,y3,y4 are
observed.

to reasonable results in practice.

Equation (4.6) has a generative interpretation. As in a standard ﬁnite mixture, we
first select the component label j by sampling from a multinomial distribution with
parameters (01, . . . ,ak). Then, for each feature 1 = 1, ..., d, we flip a biased coin whose
probability of getting a head is pl; if we get a head, we use the mixture component
p(ldﬂ) to generate the l—th feature; otherwise, the common component q(.[)\l) is used.
A graphical model representation of Equation (4.6) is shown in Figure 4.5 for the case

61:4.

4.3.2.1 EM Algorithm

8)" treating Z (the hidden class labels) and (I) (the feature indicators) as hidden
Variables, one can derive an EM algorithm for parameter estimation. The complete-

data log—likelihood for the model in Equation (4.6) is

d
1301,21 = 13(1)) = “j H(prp(yi1|911))“1((1 — 191)(1(y11|/\1))1_“l- (4-9)
1:1

139

 

 

 

Deﬁne the following quantities:

1011: 11(31 = jb’i): '11111=1)(31=J}¢1=1ly1’)1 11111: 1431:1190: Olyz‘)

They are calculated using the current. parameter estimate 0””. Note that (11,-)! +

1),-ﬂ) = wij and 211:1Z§:1u’ij = n. The expected complete data log-likelihood

based on Onow is

Egnow[l0g p(y, Z, (1’)]

= Z p(21 = i.<1>ly1-)(10saj + Z(¢1(10sp(y11|9}1)+10sp1)
1‘,j.<D 1

+11 — «91111011111111» +1010 -— 11)»)

1
=ZPM=JIY1>109401+ Z: 2: 19(32‘ =J3¢1IY1)(951(10gI)(3/1119j1)+10sp1)

+ (1 - «51)(10sq(y11|11)+10s(1- p1)))

= Z]: wij) 108 “j + Z Z Um 10gP(3/11|9j 1) + Z Z L'1j110g4(y11|/\1)
j 2' 1 i,j

 

 

 

 

1 j,l
part 1 part 2 part 3
+ 2(108p1 Z 11131 + 108(1 — P1) 2:11:31)-
1 is]. 2.1.7
paft 4

(4.10)

The four parts in the. equation above can be maximized separately. Recall that the
dellsities p(.) and q(.) are univariate Gaussian and are characterized by their means

and variances. As a result, maximizing the expected complete data log-likelihood

140

leads to the M-step in Equations (4.18)—(4.23). For the E—step, observe that

11(951 = 1syzflzz' =1)
1)(y1|zz' =1)
* {1111(yzl9jz) H11¢1(pyp(yy|0]- 1!) + (1 - py)q(yyW))
_ “(I(pIIPWz/lej 1') + (1 ‘ P1/l9(y1'l)‘1’))
p1P(y1|9j1) __ M
p11)(.I/1|9j1)+ (1 — 101)¢I(;U1|)\1) T 6131'

 

11(01=1|31=J¥yz')=

 

 

Therefore, equation (4.16) follows because

“11,1 = PM = 1|31= j,yz‘)P(zz‘ =j|yi1 = —'—‘wzj- (“D

So, the EM algorithm is

o E—step: Compute the following quantities:

 

0131: P(¢1=1,y11131=j)= P1P('?Jz'1|9j1) (4-12)
51111: M01: 0,1/11lzz' = 1') = (1 - p1)(1(y11|A1) (4-13)
Cm = P(!/11|21 = j) = 0131+ bijl (4-14)
' GUI—[10.1
“’ij = 11(32' = lez') = 3 __ I], (415)
2:311] H1 Cm
. 02']! f
Uijl = PM = 1,21 = lez') = 3710271 (4-16)
“11
'Uz'jl = PM = 0» 31 = 31311) = '10sz - “111 (417)

o M-step: Re-estimate the parameters according to following expressions:

141

é _ :1 "wij _ :1 'wz‘j
a] — —— 1-- — ————f
2,] '11,] 71
Mean 1n 6]) = w
21' ”111
A u. z: —— Meﬂﬂ- 2
Var in 611: :2 131 (.111 ( Jl)) (4.20)
Ziuzjz

21(Zjvz'j1) 3111

(4.18)

(4.19)

 

 

 

114m A) = (4.21)
Zij U131
A ,- -v-- 1- — Mem A 2
Var in /\1= 21(2) 1311011 ( 1)) (4.22)
213‘ Um
. .u- . . .u. .
51 Z"? ”I _ 213—.131 (423)

 

_ Em 'uz'j1+21,j vz'jz — ’n

In these equations, the variable 11,-]- , measures how important the i-th pattern is
to the j—th component, when the l-th feature is used. It is thus natural that the
estimates of the mean and the variance in Oﬂ are weighted sums with weight “13'!-
A similar relationship exists between 23"”:ij and A). The term Zij u,“ can be
interpreted as how likely it is that ab, equals one, explaining why the estimate of pl is

proportional to Zij “ij 1-

4.3.3 Model Selection

Standard EM for mixtures exhibits some weaknesses, which also affect the EM algo-
rithm introduced above: it requires knowledge of k: (the number of mixture compo—
nents), and a good initialization is essential for reaching a good local optimum (not to
mention the global optimum). To overcome these difficulties, we adopt the approach
in [81], which is based on the minimum message length (l\»’Il\v‘IL) criterion [265, 264].

The MML criterion for our model consists of minimizing, with respect to 0, the

142

following cost function (after discarding the order one term)

(1

d k
117+ d 'r s
— long/lO) + 2 log n + 2 [E E 1log(nozj)0)) + é IE log(n(1— pl)), (4.24)

 

where 7' and 3 are the number of parameters in 6]) and A), respectively. If p(.l.) and
q(..|) are univariate Gaussians (arbitrary mean and variance), r = s = 2. Equation
(4.24) is derived by considering the minimum message length (MML) criterion (see
[81] for details and references)

‘ ~ . l c, E 0i
9 = a1gm91n{—logp(0) — 10011046) + 2 100 |I(0)| + 2(1—1-10012)}, (4.25)

where 0 is the set of parameter of the model, c is the dimension of 0, 1(0) 2
—E[Dg log p(yl6)] is the (expected) Fisher information matrix (the negative expected
value of the Hessian of the log-likelihood), and II(9)| is the determinant of I(O). The
information matrix for the model (4.6) is very difﬁcult to obtain analytically. There-
fore, as in [81], we approximate it by the information matrix of the complete data
log-likelihood, 16(0). By differentiating the logarithm of equation (4.9), we can show

that

1 1
—— ,—-—--,a 1011911),---,a'1919 ,
91(1—111) Pd(1—Pd) 1 ( d ( 1d)

goo.

14(0) = block-diag[M,

a'2101I(921)a - - - 10'2pc11(9211)» - - - 101.0110921), - - . ,akdeWm), (4'26)

Manuel), . . . , (1440101)],

where M is the information matrix of the multinominal distribution with parameters

143

(a1, . . . , (1k). The size of 1(0) is (k + (1+ kdr + (15), where r and s are the number
of parameters in 6]) and A), respectively. Note that (p[(1 — pl))_1 is the Fisher

information of a Bernoulli distribution with parameter p1. Thus we can write

(1 k (1
1041140)! = 1.1g1({a,))+ 210ng,) + T 2: Dog (am)
1:1 j:11 .__1
k d d d (4.27)
+ZZlogI(6jl)+s:10g(1—pl) +Zlogl()\l)
j=11=1 1:1 [:1

For the prior densities of the parameters, we assume that different groups of parame-
ters are independent. Specifically, {(1')}, pl (for different values of l), 63-, (for different
values of j and l) and A) (for different values Of 1) are independent. Furthermore, since
we have no knowledge about. the parameters, we adopt non-informative Jeffrey’s pri-
ors (see [81] for details and references), which are proportional to the square root
of the determinant of the corresponding information matrices. When we substitute
p(0) and [I(0)| into equation (4.25), and drop the order-one term, we obtain our final

criterion, which is equation (4.24).

From a parameter estimation viewpoint, Equation (4.24) is equivalent to a massi-

mum a posterior? (MAP) estimate,

k. d d
. d "k
0 = arg mgx{logp(y|9) — :2— 2 log aJ- — g E log(1 — pl) — 1—2— 2 log pl}, (4.28)

144

with the following (Dirichlet-type, but improper) priors 011 the o-j’s and pl’s:

44-.., 4. 4 11 71/2,

—k2(1 _
P(p11-~1p(1)0( Hp r/ ()1) 3/2-

Since these priors are conjugate with respect to the complete data likelihood, the EM
algorithm undergoes a. minor modification: the M—step Equations (4.18) and (4.23)

are replaced by

, l
’5 _ max(zi 'wij — :29, O) (4 29)
“3‘2 (2' ---"’ o1 ‘
J max , 111,] —2—,
max<2.-,,u.,1— 4 0)

rnax(Z,-J um - 525, 0) + max(Z,-J viﬂ —— 3 0)

 

 

17) = (4.30)

111 addition to the log—likelihood, the other terms in Equation (4.24) have simple
ii'iterpretations. The term—2— k+d log n is a standard MDL type [215] parameter code-
length corresponding to k 03' values and d p) values. For the l—th feature in the j-th
component, the “effective” number of data points for estimating (9]) is najpl. Since
there are 7‘ parameters in each 6]), the corresponding code-length is glog(nplaj).
Similarly, for the l—th feature in the common component, the number of effective data
points for estimation is 71(1 — pl). Thus, there is a term 3 log(n(1 — p))) in (4.24) for

each feature.

One key property of Equations (4.29) and (4.30) is their pruning behavior, forcing

some of the aj to go to zero and some of the pl to go to zero or one. This pruning

145

behavior also has the indirect benefit of protecting us from almost singular covariance
matrix in a mixture component: the weight of such a component is usually very small,
and the component is likely to be pruned in the next few iterations. Concerns that
the message length in (4.24) may become invalid at these boundary values can be
circumvented by the arguments in [81]: when p) goes to zero, the l-th feature is no
longer salient and p) and 611, . . . ,0)“ are removed; when Pl goes to 1, A) and pl are
dropped.

Finally, since the model selection algorithm determines the number of components,
it can be initialized with a. large value of k, thus alleviating the need for a good
initialization, as shown in [81]. Because of this, a component-wise version of EM can

be adopted [37, 81]. The algorithm is summarized in Algorithm 4.1.

 

 

Input: Training data y = {y1, . . . ,yn}, minimum number of components kmin
Output: Number of components 1:, mixture parameters {OJ-l}, {aj}, parameters of
common distribution {Al} and feature saliencies {pl}
{Initialization}
Set the parameters of a large number of mixture components randomly
Set the common distribution to cover all data
Set the feature saliency of all features to 0.5
{Initialization ends; main loop begins}
while k > kmin do
while not reach local minimum do
Perform E—step according to Equations (4.12) to (4.17)
Perform M-step according to Equations (4.19) to (4.22), (4.29) and (4.30)
If aj becomes zero, the j-th component is pruned.
If pl becomes 1, q(yllAl) is pruned.
If pl becomes 0, p(yll6ﬂ) are pruned for all j
end while
Record the current model parameters and its message length
Remove the component with the smallest weight
end while
Return the model parameters that yield the smallest message length

 

Algorithm 4.1: The unsupervised feature saliency algorithm.

146

 

4.3.4 Post-processing of Feature Saliency

The feature saliencies generated by Algorithm 4.1 attempt to find the best way to
model the data, using different component densities. Alternatively, we can consider
feature saliencies that best discriminate between different components. This can be
more appropriate if the ultimate goal is to discover well-separated clusters. If the
components are well-separated, each pattern is likely to be generated by one com-
ponent only. Therefore, one quantitative measure of the separability of the clusters
is

J = 2102;144- =41”). (4.31)

where t,- = argmaxj p(zi = jlyi). Intuitively, J is the sum of the logarithms of the
posterior probabilities of the data, assuming that each data point was indeed gener-
ated by the component with maximum posterior probability (an implicit assumption
in mixture-based clustering). .1 can then be maximized by varying ,0) while keeping

the other parameters fixed.

Unlike the MML criterion, .1 cannot be optimized by an EM algorithm. However,

by deﬁning

10(1/11I9jz) - Q(y1|/\1)

 

h'l' = ,
I J 1111411111941) + (1 - p1)(1(’y11|/\1)
k
911: Zu'ijhilj:
1:1

147

we can show that

8
8—_ 108101) = (1113' — 9111

Pt
(92 n k
—— loam): = 2(9-192 - 2211- -h.-) -h- ) for l 95 m
4 r, o ._j 1 1m 2] 2 J 2771] 7 a
0p10pm 1:1 3.21
82 n 2 2
$108211)” : 2(91-1 — hi U)'
1 1:1

The gradient and Hessian of J can then be calculated accordingly, if we ignore the
dependence of t,- on ,0). We can then use any constrained non—linear optimization
software to ﬁnd the optimal values of p, in [0,1]. We have used the MATLAB opti-
mization toolbox in our experiments. After ol:)tai1'1ing the set of optimized ,0), we fix

them and estimate the remaining parameters using the EM algorithm.

4.4 Experimental Results

4.4.1 Synthetic Data

The first synthetic data set consisted of 800 data points from a mixture of four
equiprobable Gaussians N(m,,I),i = {1,2,3,4}, where m1 = [g], m2 = [$11],

m3 = [2], m4 = [7],] (Figure 4.6(a)). Eight “noisy” features (sampled from a

.__A

N(O, 1) density) were then appended to this data, yielding a set of 800 10—dimensional
patterns. We ran the proposed algorithm 10 times, [each initialized with k = 30; the
common component was initialized to cover the entire set of data, and the feature

saliency values were initialized at 0.5. A local minimum was reached if the change in

148

description length between two iterations was less than 10‘7. A typical run of the al-
gorithm is shown in Figure 4.6. In all the ten random runs with this mixture, the four
components were always correctly identified. The saliencies of all the ten features,
together with their standard deviations (error bars), are shown in Figure 4.7(a). We
can conclude that, in this case, the algorithm successfully locates the true clusters

and correctly assigns the feature saliencies.

In the second experiment, we considered the Tfunk data [122, 252], consisting of
two 20—dimensional Gaussians N(ml, I) and N(mg, I), where m1 = (1, 1%, . . . , Viz—6),
m2 2 —m1. Data were obtained by sampling 5000 points from each of these two
Gaussians. Note that the features are arranged in descending order of relevance. As
above, the stopping threshold was set. to 10’”7 and the initial value of It was set to
30. In all the 10 runs performed, the two components were always detected. The
feature saliencies are shown in Figure 47(1)). The lower the rank number, the more
important was the feature. We can see the general trend that as the feature number

increases, the saliency decreases, in accordance with the true characteristics of the

data.

4.4.2 Real data

VVe tested our algorithm on several data sets with different characteristics (Table 4.1).
The wine recognition data set (wine) contains results of chemical analysis of wines
grown in different cultivars. The goal is to predict the type of a wine based on its

Chemical composition; it has 178 data points, 13 features, and 3 classes. The VVis-

149

Iteration 1 K=30

 

 

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

14
12 ,5“ '
, - . : . .1 _
1o ‘ "fr 6%?
'., '. ._ "‘a - .2 A
N 8 fr: 321‘. .1, 1. E
e . z;
a s __.~. ~ .,
. _,. . .
:5". 1:5", ' r; .g'” E
2 :13” _ . , .
o ..
5 o 5 1o —" o 5 10
Feature 1 Feature1 (0.50)
(a) The data set (b) Initialization
Iteration 35 K=13 Iteration 40 K=10
14 14
12 12
310 A10
33 8 § 8
9' K
N
e 6 g 6
e. 4 4 4
1‘3 11
2 2
0 0
5 o 5 1o —" 0 5
Feature1 (0.91) Feature1 (0.91)
(c) A snapshot ((1) p2 is “pruned” to 1
Iteration 99 K=5 Iteration 182 K=4
14 14
12 12
-10 A10
5 5 § 8
N N
9 6 9 6
2 2
8 4 g 4
LL LL
2 2
0 -— 0
3 o 5 fl 0 5 10
Feature 1 (0.95) Feature1 (1.00)
(e) A local minimum (11:5) (f) The best local minimum

Figure 4.6: An example execution of the proposed algorithm. The solid ellipses
represent the Gaussian mixture components; the dotted ellipse represents the common
Sensity. The number in parenthesis along the axis label is the feature saliency; when
ft 1' eaches 1, the common component is no longer applicable to that feature. Thus,
1n (d), the common component degenerates to a line; when the feature saliency for
ea‘ture 1 also becomes 1, as in Figure 4.6(f), the common density degenerates to a
point at (0,0).

150

 

 

 

 

 

 

 

1:: — 1 1 - - +
0.9’I 1
0.81
>. 0.8» 1
U >\
E) 0 6L 80 7) I 1
a , 3 '
m 3 I
9., ”0.6’ I 1
.304 1 '5‘05 I
(U _ ' 1
e a I I
“2 °"' I I I I I I i
0.3. I I l I I 1
0' = I 2: =: I — :1: z: I I
1 2 3 4 5 6 7 8 9 —10' 0" 5 f0 15 2.0
Feature Number Feature number
(a) Features saliencies: 4 Gaussian (b) Features saliencies: Trunk

Figure 4.7: Feature saliencies for (a) the 10—D 4 Gaussian data set used in F ig-
ure 4.6(a), and (b) the Trunk data set. The mean values plus and minus one standard
deviation over ten runs are shown. Recall that features 3 to 10 for the 4 Gaussian
data set are the noisy features.

consin diagnostic breast cancer data set (wdbc) was used to obtain a binary diagnosis
(benign or malignant) based on 30 features extracted from cell nuclei presented in
an image; it has 576 data points. The image segmentation data set (image) con-
tains 2320 data points with 19 features from seven classes; each pattern consists of
features extracted from a 3 x 3 region taken from 7 types of outdoor images: brick-
face, sky, foliage, cement, window, path, and grass. The texture data set (texture)
consists of 4000 19-dimensional Gabor ﬁlter features from a collage of four Brodatz
textures [127]. A data set (zernike) of 47 Zernike moments extracted from images of
handwriting numerals (as in [126]) is also used; there are 200 images for each digit, to-
taling 2000 patterns. The data sets wine, wdbc, image, and zernike are from the UCI
machine learning repository (http : //www. ics . uci . edu/~m1earn/MLSummary . html).
This repository has been extensively used in pattern recognition and machine learning
studies. Normalization to zero mean and unit variance is performed for all but the

texture data set, so as to make the contribution of different features roughly equal a

151

 

Table 4.1: Real world data sets used in the experiment. Each data set has n data
points with (1 features from c classes. One feature with a constant value in image
is discarded. Normalization is not needed for texture because the features have
comparable variances.

 

 

 

 

 

 

 

 

 

 

 

 

Abbr. Full name 71 d c Normalized?
wine wine recognition 178 13 3 yes
wdbc Wisconsin diagnostic breast cancer 569 30 2 yes
image image segmentation 2320 18 7 yes
texture Texture data set 4000 19 4 no
zernike Zernike moments of digit images 2000 47 10 yes

 

 

priori. We do not normalize the texture data set because it is already approximately
normalized. Since these data sets were collected for supervised classiﬁcation, the
class labels are not involved in our experiment, except for the evaluation of clustering

results.

Each data set was first randomly divided into two halves: one for training, another
for testing. Algorithm 4.1 was run on the training set. The feature saliency values can
be refined as described in Section 4.3.4. We evaluated the results by interpreting the
fitted Gaussian components as clusters and compared them with the ground truth
labels. Each data point in the test set was assigned to the component that most
likely generated it, and the pattern was classified to the class represented by the
component. We then computed the error rates on the test data. For comparison, we
also ran the mixture of Gaussian algorithm in [81] using all the features, with the
number of classes of the data set as a lower bound on the number of components.
This gives us a fair ground for comparing Gaussian mixtures with and without feature
Saliency. In order to ensure that we had enough data with respect to the number of

features for the algorithm in [81], the covariance matrices of the mixture components

Table 4.2: Results of the algorithm over 20 random data splits and algorithm ini-
tializations. “Error” corresponds to the mean of the error rates on the testing set
when the clustering results are compared with the ground truth labels. E denotes
the number of Gaussian components estimated. Note that the post-processing does
not change the number of Gaussian components. The numbers in parenthesis are the
standard deviation of the corresponding quantities.

 

 

 

Algorithm 4.1 With post-processing Using all the features
error (in ‘70) 8 error (in %) error (in %) 8
wine 6.61 (3.91) 3.1 (0.31) 6.61 (3.23) 8.06 (3.73) 3 (0)

 

 

wdbc 9.55 (1.99) 5.65 (0.75) 9.35 (2.07) 10.09 (2.00) 2.70 (0.57)
image 20.19 (1.54) 23.1 (1.74) 20.28(1.60) 32.84 (5.1) 13.8 (1.94)
1mm 4.04 (0.76) 36.17 (1.19) 4.02 (0.74) 4.85 (0.98) 31.42 (2.81)
zernike 52.09 (2.52) 11.3 (0.98) 51.99 (2.32) 56.42 (3.62) 10(0)

 

 

 

 

 

 

 

 

 

 

were restricted to be diagonal, but were different for different components. The entire
procedure was repeated 20 times with different splits of data and initializations of the
algoritlnn. The results are shown in Table 4.2. We also show the feature saliency
values of different features in different runs as gray-level image maps in Figure 4.9.
For illustrative purpose, we contrast. the clusters obtained for the image data set with

the true class labels in Figure 4.8, after using PCA to project the data into 3D.

From Table 4.2, we can see that the proposed algorithm reduces the error rates
when compared with using all the features for all five data sets. The improvement is
more signiﬁcant for the image data set, but this may be due to the increased number
of components estimated. The high error rate for zernike is due to the fact that digit
images are inherently more difficult to cluster: for example, “4” can be written in a
manner very similar to “9”, and it is difficult for any unsupervised learning algorithm
to distinguish among them. The post-processing can increase the “contrast” of the
feature saliencies, as the image maps in Figure 4.9 show, without deteriorating the

acCuracy. It is easier to perform “hard” feature selection using these post-processed

153

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

k ,,
3~ 3*
2‘ 24 ”@9514?
k 1‘ e +
. §
m N *g i
-L. "N .i
-& 4‘ .4
50 50
_5—4—a—2—1o1‘234 .4-3—2—101234
We.) Result of proposed algorithm (b) Result of using all the features
3‘ ”a
24 "5‘ ‘
‘ (

 

 

 

 

 

 

- -5‘3—4——2-1o1234
6(c) The true class labels

Figure 4.8: A ﬁgure showing the clustering result on the image data set. Only the
labels for the testing data are shown. (a) The true class labels. (b) The clustering
results by Algorithm 4.1. (c) The clustering result using all the features. The data
points are reduced to 3D by PCA. A cluster is matched to its majority class before
plotting. The error rates for the proposed algorithm and the algorithm using all the
features in this particular run are 22% and 30%, respectively.

feature saliencies, if this is required for the application.

154

4.5 Discussion

4.5. 1 Complexity

The major computational load in the proposed algorithm is in the E—step and the
M-step. Each E-step iteration computes 0(ndkt) quantities. As each quantity can be
computed in constant time, the time complexity for E—step is also 0(ndk). Similarly,
the M-step takes 0(ndk) time. The total amount of computation depends on the
number of iterations required for convergence.

At a. first sight, the amount of computation seems to be demanding. However,
a close examination reveals that each iteration (E—step and M-step) of the standard
EM algorithm also takes ()(ndk) time. The value of k in the standard EM, though, is
usually smaller, because the proposed algorithm starts with a larger number of com-
ponents. The number of iterations required for our algorithm is also in general larger
because of the increase in the number of parameters. Therefore, it is true that the
proposed algorithm takes more time than the standard EM algorithm with one param-
eter setting. However, the proposed algorithm can determine the number of clusters
as well as the feature subset. If we want to achieve the same goal with the standard
EM algorithm using a wrapper approach, we need to re—run EM multiple times with
a different number of components and different feature subsets. The computational
demand is much heavier than the proposed algorithm, even with a heuristic search
to guide the selection of feature subsets. Another strength of the proposed algorithm
is that by initialization with a large number of Gaussian components, the algorithm

is less sensitive to the local minimum problem than the standard EM algorithm. we

155

can further reduce the complexity by adopting optimization techniques applicable to
standard EM for Gaussian mixture, such as sampling the data, compressing the data

[28], or using efficient data structures [203, 224].

For the post-processing step in Section 4.3.4, each computation of the quantity J
and its gradient and Hessian takes 0(ndk) time. The number of iterations is difficult
to predict, as it depends on the optimization routine. However, we can always put an
upper bound on the number of iterations and trade speed for the optimality of the

results.

4.5.2 Relation to Shrinkage Estimate

One interpretation of Equation (4.6) is that we “regularize” the distribution of each
feature in different components by the common distribution. This is analogous to the
shrinkage estimator for covariance matrices of class-conditional densities [68], which
is a weighted sum of an estimate of the class-speciﬁc covariance matrix, and the
“global” covariance matrix estimate. In Equation (4.6), the pdf of the l-th feature
is also a weighted sum of a component-specific pdf and a common density. An im-
portant difference here is that the weight pl is estimated from the data, using the
MML principle, instead of being set heuristically, as is commonly done. As shrinkage
estimators have found empirical success to combat data scarcity, this “regularization”

viewpoint is an alternative explanation for the usefulness of the proposed algorithm.

156

4.5.3 Limitation of the Proposed Algorithm

A limitation of the proposed algoritlnn is the feature independence assumption (con-
ditioned on the mixture component). While, empirically, the violation of the in-
dependence assumption usually does not affect the accuracy of a classifier (as in
supervised learning) or the quality of clusters (as in unsupervised learning), this has
some negative influei‘ice on the feature selection problem. Specifically, a feature that
is redundant because its distribution is independent of the component label given
another feature cannot be modelled under the feature independence assumption. As
a result, both features are kept. This explains why, in general, the feature saliencies
are somewhat high. The post-processing in Section 4.3.4 can «me with this problem
because it considers the posterior distribution and therefore can discard features that

do not help in identifying the clusters directly.

4.5.4 Extension to Semi-supervised Learning

Sometimes, we may have some knowledge of the class labels of different Gaussian
components. This can happen when, say, we adopt a procedure to combine different
Gaussian components to form a cluster (e.g., as in [216]), or in a semi-supervised
learning scenario, where we can use a small amount of labelled data to help us identify
which Gaussian component belongs to which class. This additional information can
suggest combination of several Gaussian components to form a single class/cluster,
thereby allowing the identification of non-Gaussian clusters. The post-processing step

can take advantage of this information.

157

Suppose we know there are C classes and the posterior probability that pattern
y,- belongs to the c—th class, denoted Tim can be computed as Tic = 237:1 HCJP(z,- =
j|y,j). For example, if we know that the components 4, 6, and 10 are from class 2, we
can set 132,4 = 32.6 = 132.10 = 1 / 3 and the other 13% to be zero. The post-processing
is modified accordingly: redefine t,- in Equation (4.31) to t,- = arg maxc Tic» i.e., it
becomes the class label for yi in view of the extra information; replace log P(z,- =
tilYi) in Equation (4.31) by log "it,” The gradient and Hessian can still be computed

easily after notng that

 

 

(91110
OP! 2 “mm-5;); log 10,-]- = “510’in — 911)
k k (432)
(9 1 (9 541111)
8_‘ log Tic : _,__ Z: :Lfcja—H'ij 2 Z . "hilj _ 911'
/’1 71c j=1 pl jzl TIC

We can then optimize the modified J in Equation (4.31) to carry out the post-

processm g.

4.5.5 A Note on Maximizing the Posterior Probability

The sum of the logarithm of the maximum posterior probability considered in the
post—processing in Section 4.3.4 can be regarded as the sample estimate of an 1111-
orthodox type of entropy (see [141]) for the posterior distribution. It can be regarded

as the limit of Renyi’s entropy 120(1)) when (1 tends to infinity, where

 

k

1 .

MP) = 1 _ 0104212?- (4.33)
1:1

W hen this entropy is used for parameter estimation under the maximum entropy
framework, the corresponding procedure is closely related to minimax inference.
Other functions 011 the posterior probabilities can also be used, such as the Shan-
non entropy of the posterior distribution. Preliminary study shows that the use of

different types of entropy does not affect the results significantly.

4.6 Summary

Given n points in (I dimension, we have presented an EM algorithm to estimate the
saliencies of individual features and the best number of components for Gaussian-
mixture clustering. The proposed algorithm can avoid running EM many times with
different numbers of components and different feature subsets, and can achieve better
performance than using all the available features for clustering. By initializing with a
large number of mixture components, our EM algorithm is less prone to the problem
of poor local minima. The usefulness of the algorithm was demonstrated on both

synthetic and benchmark real data sets.

159

 

B .;

2x

(a) wine, proposed algorithm

   

5

1O

 

 

 

 

2

‘

3

ts“ It.” .- ..: -» ------ -
101

'2 m lawman-teal-
‘4

1o

‘8 In: ...l

(g) texture, proposed algorithm

 

0
(i) zernike, proposed algorithm

 

(b) wine, after post-processing

   
   
  

3.. - LEE-

15’ 20

s 7 1o 2
(d) wdbc, after post-processing

 

 

ml E A l .
20

(f) image, after post—processing

    
  

1
2] -
4]
a - ﬂ
.. -
8 - - "-
1o -
~12 mm hm; - —
1a 1-
- - -
1“ .‘4‘.
-
15
.h.-. __‘ 1 -.
1o 15

(h) texture, after post-processing

 

1 O 20
(j) zernike, after post-processing

Figure 4.9: Image maps of feature saliency for different data sets with and without
the post—processing procedure. Feature saliency of 1 (0) is shown as a pixel of gray
level 255 (0). The vertical and horizontal axes correspond to the feature number and

the trial number, respectively.

160

Chapter 5

Clustering With Constraints

In Section 1.4, we introduced instance—level constraints as a type of side-information
for clustering. In this chapter, we shall examine the drawbacks of the existing clus-
tering under constraints algorithms, and propose a new algorithm that can remedy
the defects.

Recall that there are two types of instance-level constraints: a must-link/ positive
constraint requires two or more objects to be put in the same cluster, whereas a must-
not-link/negative constraint requires two or more objects to be placed in different
clusters. Often, the constraints are pairwise, though one can extend them to multiple
objects [231, 167]. Constraints are particularly appropriate in a clustering scenario,
because there is no clear notion of the target classes. On the other hand, the user can
suggest if two or more objects should be included in the same cluster or not. This
can be done in an interactive manner, if desired. Side-information can improve the
robustness of a clustering algorithm towards model mismatch, because it provides
additional clues for the desirable clusters other than the shape of the clusters, as

161

 

suggested by the parametric model. Side-information has also been found to alleviate

the problem of local minima of the clustering objective function.

Clustering with instance—level constraints is different from learning with partially-
labeled data, also known as transductive learning or semi-supervised learning [136,
288, 157, 169, 289, 287, 98, 195], where the class labels of some of the objects are
provided. Constraints only reveal the relationship among the labels, not the labels
themselves. Indeed, if the “absolute” labels can be specified, the user is no longer

facing a clustering task, and a supervised method should be adopted instead.

We contrast different learning settings according to the type of information avail-
able in Figure 5.1. At one end of the spectrum, we have supervised learning (Fig-
ure 5.1(a)), where the labels of all the objects are known. At the other end of the
spectrum, we have unsupervised learning (Figure 5.1(d)), where the label information
is absent. In between, we can have partially labeled data (Figure 5.1(b)), where the
true class labels of some of the objects are known. The main scenario considered in
this paper is depicted in Figure 5.1(c): there is no label information, but must-link
and must-not-link constraints (represented by solid and dashed lines, respectively)
are provided. Note that the settings exemplified in Figures 5.1(a) and 5.1(b) are
classification—oriented because there is a clear definition of different classes. On the
other hand, the setups in Figures 5.1(c) and 5.1(d) are clustering-oriented, because
no precise deﬁnitions of classes are given. The clustering algorithm needs to discover

the classes.

162

5.0. 1 Related Work

Different algorithms have been proposed for clustering under instance—level con-
straints. In [262], the four primary operators in COBVVEB were modified in view
of the constraints. The k-means algorithm was modified in [263] to avoid violating
the constraints when different objects are assigned to different clusters. However, the
algorithm can fail even when a solution exists. Positive constraints served as “short-
cuts” in [148] to modify the dissimilarity measure for complete—link clustering. There
can be catastrophic consequences if a single constraint is incorrect, because the dis-
similarity matrix can be greatly distorted by a wrong constraint. Spectral clustering
was modified in [138] to work with constraints, which augmented the affinity matrix.
Constraints were incorporated into image segmentation algorithms by solving the
constrained version of the corresponding normalized cut problem, with smoothness of
cluster labels explicitly incorporated in the formulation [279]. Hidden Markov random
field was used in [14] for k-means clustering with constraints. Constraints have also
been used for metric-learning [274]; in fact, the problems of metric-learning and It‘-
means clustering with constraints were considered simultaneously in [21]. Because the
problem of k-means with metric-learning is related to EM clustering with a common
covariance matrix, the work in [21] may be viewed as related to EM clustering with
constraints. The work in [158] extended the work in [21] by studying the relation-
ship between constraints and the kernel k-means algorithms. Ideas based on hidden
Markov random ﬁeld have also been used for model-based clustering with constraints

[14, 176, 161]; the difference between these three methods lies in how the inference is

163

 

[Summary LKey ideas [ Examples ]

 

 

 

 

 

 

 

Distance edit- Modify the distance/proxin‘iity matrix due to the [148, 138]
ing constraints

Constraints on The cluster labels are inferred under the restriction [262, 263,
labels that the constraints are always satisfied 279]
Hidden Cluster labels constitute a hidden Markov random [14, 21, 12,
Markov ﬁeld; feature vectors are assumed to be independent 158, 176,
random ﬁeld of each other given cluster labels 161, 286]
Modify genera— Generation process of data points that participate [231, 166,
tion model in constraints is modiﬁed 167]
Constraints Clustering solution is obtained by resolving con- [10]
resolution straints only

 

 

 

 

Table 5.1: Different algorithms for clustering with constraints.

conducted. In particular, the method in [14] used iterative conditional mode (ICM),
the method in [176] used Gibbs sampling, and the method in [161] used a mean-ﬁeld
approximation. The approach in [286] is similar to [161], since both used mean-ﬁeld
approximation. However, the authors of [286] also considered the case when each
class is modeled by more than one component. A related idea was presented in [231],
which uses a graphical model for generating the data with constraints. A fairly differ-
ent route to clustering under constraints was taken by the authors in [10] under the
name “correlation clustering”, which used only the positive and negative constraints
(and no information on the objects) for clustering. The number of clusters can be

determined by the constraints.

Table 5.1 provides a summary of these algorithms for clustering under constraints.
In most of these approaches, clustering with constraints has been shown to improve
the quality of clustering in different domains. Example applications include text

classiﬁcation [14], image segmentation [161], and video retrieval [231].

164

5.0.2 The Hypothesis Space

An important issue in parametric clustering under constraints, namely the hypoth-
esis space, has virtually been ignored in the current literature. Here, we adopt the
terminology from inductive learning and regard “hypothesis space” as the space of all
possible solutions to the clustering task. Since partitional clustering can be viewed
as the construction of a mapping from a set of objects to a set of cluster labels, its
hypothesis space is the set of all possible mappings between the objects (or their rep-
resentations) and the cluster labels. In a non-parametric clustering algorithm such
as pairwise clustering [114] and methods based on graph-cut [234, 272], there is no
restriction on this hypothesis space. A particular non-parametric clustering algorithm
selects the best clustering solution in the space according to some criterion function.
In other words, if a poor criterion function is used (perhaps due to the influence of
constraints), one can obtain a counter-intuitive clustering solution such as the one
in Figure 5.3(c), where very similar objects can be assigned different cluster labels.
Note that objects in non-parametric clustering, unlike in parametric clustering, may
not have a feature vector representation. They can be represented, for example, by

pairwise afﬁnity or dissimilarity measure with higher order [1]

The hypothesis space in parametric clustering is typically much smaller, because
the parametric assumption imposes restrictions on the cluster boundaries. While
these restrictions are generally perceived as a drawback, they become advantageous
when they prevent counter-intuitive clustering solutions such as the one in F ig-

ure 5.3((:) from appearing. These clustering solutions are simply outside the hy-

165

pothesis space of parametric clustering, and are never attainable irrespective of how
the constraints modify the clustering objective function.

An example contrasting parametric and non-parametric clustering is shown in
Figure 5.2. The particular parametric family considered in this example is a Gaussian

distribution with common covariance matrix, resulting in linear cluster boundaries.

5.0.2.1 Inconsistent Hypothesis Space in Existing Approaches

The basic idea of most of the existing parametric clustering with instance-level con-
straints algorithms [263, 14, 21, 12, 158, 176, 161, 286] is to use some variants of
hidden Markov random ﬁelds to model the cluster labels and the feature vectors.
Given the cluster label of the object, its feature vector is assumed to be independent '
of the feature vectors and the cluster labels of all the other objects. The cluster
labels, which are hidden (unknown), form a Markov random ﬁeld, with the potential
function in this random ﬁeld related to the satisﬁability of the constraints based on
the cluster labels.

There is an unfortunate consequence of adopting the hidden Markov random ﬁeld,
however. For objects participating in the constraints, their cluster labels are deter-
mined by the cluster parameters, associated feature vectors and the constraints. On
the other hand, for data points without constraints, the cluster labels are determined
by only the cluster parameters and associated feature vectors. We can thus see that
there is an inconsistency in how the objects obtain their cluster labels. In other
words, two identical objects, one with a constraint and one without, can be assigned

different cluster labels! This is the underlying reason for the problem illustrated in

166

Figure 5.3(d), where two objects with almost identical feature vectors are assigned
different labels due to the constraints.

From a generative viewpoint, the above inconsistency is caused by the difference
in how data points with and without constraints are generated. For the data points
without constraint, each of them is generated in an identical and independent manner
according to the current cluster parameter value. On the other, all the data points
with constraints are generated simultaneously by ﬁrst choosing the cluster labels
according to the hidden Markov random ﬁeld, followed by the generation of the
feature vectors based on the cluster labels. It is a dubious modeling assumption
that “posterior" knowledge such as the set of instance—level constraints, which are
solicited from the user after observing the data, should control how the data are
generated in the ﬁrst place.

Note that this inconsistency does not exist if all the objects to be clustered are
involved in some constraints determined by the properties of the objects. This is
commonly encountered in image segmentation [128], where pixel attributes (e.g. in-
tensities or ﬁlter responses) and spatial coherency based on the locations of the pixels
are considered simultaneously to decide the segment label. In this case, the clus-
ter labels of all the objects are determined by both the constraints and the feature

vectors.

5.0.2.2 Proposed Solution

We propose to eliminate the problem of inconsistent hypothesis space by enforcing a

uniform way to determine the cluster label of an object. we use the same hypothesis

167

space of standard parametric clusterng for parametric clustering under constraints.
The constraints are only used to bias the search of a clustering solution within this
hypothesis space. Since each clustering solution in this hypothesis space can be
represented by the cluster parameters, the constraints play no role in determining the
cluster labels, given the cluster parameters. The quality of the cluster parameters
with respect to the constraints is computed by examining how well the cluster labels
(determined by the cluster parameters) satisfy the constraints. However, cluster
parameters that ﬁt the constraints well may not ﬁt the data well. We need a tradeoff
between these two goals. This can be done by maximizing a weighted sum of the data

log-likelihood and a constraint ﬁt term. The details will be presented in Section 5.3.

5.1 Preliminaries

Given a set of 71 objects y = {y1, . . . ,yn}, (probabilistic) parametric partitional clus-
tering discovers the cluster structure of the data under the assumption that data in
a cluster are generated according to a certain probabilistic model p(yldj), with 6]-
representing the parameter vector for the j—th cluster. For simplicity, the number of
clusters, k, is assumed to be speciﬁed by the user, though model selection strategy
(such as minimum description length [81] and stability [162]) can be applied to de-
termine k, if desired. The distribution of the data can be written as a. ﬁnite mixture

distribution, i.e.,

k
p(y) = Zp(y|z)p(Z) = Zajmylﬂj). (5-1)
z j:1

168

Here, 2 denotes the cluster label, ozj denotes p(z = j) (the prior probability of cluster
j), and p(yIOJ) corresponds to p(ylz = j). Clustering is performed by estimating
the model parameter 9, deﬁned by 0 = (01, . . .,Ct'k,(91,...,6k). By applying the
maximum likelihood principle, 9 can be estimated as 6 = argmaxgz £(6’; y), where

the log—likelihood £(6; y) is deﬁned as

n n k
w. y) = Z Iogpm) = 2: log Zajp(yl6j)- (52)
2'21 i=1 j=1

This maximization is often done by the EM algorithm [58] by regarding Z, (the cluster
label of y,) as the missing data. The posterior probability p(z = j [y) represents
how likely it is that y belongs to the j-th cluster. If a hard cluster assignment is
desired, the MAP (maximum a posteriori) rule can be applied based on the model in

Equation (5.1), i.e., the object y is assigned to the j-th cluster (2 = j) if

01100491)
21' 01100491] (53)

 

] = arg max

5.1 . 1 Exponential Family

While there are many possibilities for the form of the probability distribution p(yl6j),
It is Very common to assume that p(yldj) belongs to the exponential family. The
dlstribution p(ylﬂj) is in the exponential family if it satisﬁes the following two criteria:

the Suppert of p(yIOj) (the set of y with non-zero probability) is independent of the

169

 

value of 6], and that p(yldj) can be written in the form
port) = exp 6(wain — Aw») . (5.4)

Here, p(y) transforms the data y to become the “sufﬁcient statistics”, meaning that
(My) encompasses all the relevant information of y in the computation of p(ylﬁj).
The function A(6j), also known as the log-partition function, normalizes the density
so that it integrates to one over all y. The function 179(93') transforms the parameter
and enables us to adopt different parameterizations of the same density. When 1“.)
is the identity mapping, the density is said to be in natural parameterization, and
6]- is known as the natural parameter of the distribution. The function A(0j) then
becomes the cumulant generating function, and the derivative of A(6j) generates
the cumulant of the sufﬁcient statistics. For example, the gradient and Hessian of
A(f)J-) (with respect to Bj) lead to the expected value and the covariance matrix for
the sufﬁcient statistics, respectively. Note that A(6j) is a convex function, and the

domain of Bj where the density is well-deﬁned under natural parameterization is also

COIIVQX .

As an example, consider a multivariate Gaussian density with mean vector p and

covariance matrix 2. Its pdf is given by

My) = exp (—§ logeo + graders-1 — go —- “FE-1o — m) , (5.5)

where d is the dimension of the feature vector y. If we deﬁne T = 2‘1 and V =

170

 

23’1”, the above can be rewritten as

1

1
2 logdet T — éuTT—lu

p(y) = exp (trace (—%nyT) + yTV — glog(27r) +

(5.6)

From this, we can see that the sufficient statistics consist of —%ny and y. The set

of natural parameter is given by (T, V). The parameter V can take any value in Rd,

whereas T can only assume values in the positive-deﬁnite cone of d by d symmetric

matrices. Both these two sets are convex, as expected. The log-cumulant function is
given by

d

1
A(6) 2 2 log(27r) — 2 log det T + éuT_lu, (5.7)

which can be shown to be convex within the domain of T and V, where the density
is well-deﬁned.

It is interesting to note that the exponential family is closely related to Bregman
divergence [27]. For any Bregman divergence Dp(., .) derived from a strictly convex

function p(.), one can construct a function fp such that

226') = eXP (—Dp()’» u)) fp(y)

is a member of the exponential family. Here, [1. is the moment parameter, meaning

1

that it is the expected value of the sufﬁcient statistics . The cumulant generating

function of the density is given by the Legendre dual of p(.). One important con-

 

1The strict convexity of A(.) implies that there is an one-to—one correspondence between mo-
ment parameter and natural parameter. While the existence of such a mapping is easy to show,
constructing such a mapping can be difficult in general.

171

sequence of this relationship is that soft-clustering (clustering where an object can
be partially assigned to a cluster) based on any Bregman divergence can be done
by fitting a mixture of the corresponding distribution in the exponential family, as
argued in [9]. Since Bregman divergence includes many useful distance measures2 as
special cases (such as Euclidean distance and Kullback-Leibler divergence, and see [9]

for more), a mixture density, with each component density in the exponential family,

covers many interesting clustering scenarios.

5.1.2 Instance-level Constraints

We assume that. the user has provided side-information in the form of a set of instance—
level constraints (denoted by C). The set of must-link constraints, denoted by (3+,
is represented by the indicator variables rim, such that (im- = 1 iff y,- participates in
the h-th must—link constraint. For example, if the user wants to state that the pair
(y2, y8) participates in the ﬁfth must-link constraint, the user sets (15,2 2 1, 95,8 = 1,
and (i572- : 0 for all other 2'. This formulation, while less explicit than the formulation
in [161], which speciﬁes the pairs of points participating in the constraints directly,
allows easy generalization to group constraints [166]: we simply set 51),,- to one for all
y, that are involved in the h-th group constraint. We also define ah,- = [1m / Z,- (1,”,
where ah,- can be perceived as the “normalized” indicator matrix, in the sense that
Zia)”- = 1. The set of must-not-link constraints, denoted by C‘, is represented

similarly by the variables 6),,- and bin" Speciﬁcally, 6),,- = 1 if y,- participates in the

 

2Strictly speaking, Bregman divergence can be asynunetric and hence is not really a distance
function.

172

h-th must-not-link constraint, and bhi = f),,,-/ 2,,- 6,,,-. Note that {am} and {bhi} are
highly sparse, because each constraint provided by the user involves only a small

number of points (two if all the constraints are pairwise).

5.2 An Illustrative Example

In this section, we describe a simple example to illustrate an important shortcoming
of parametric clustering under constraints methods based on hidden Markov random
ﬁeld — the approach common in the literature [263, 14, 21, 12, 158, 176, 161, 286]. In
Figure 5.3, there are altogether 400 data points generated by four different Gaussian
distributions. The task is to split this data into two clusters.

Suppose the user, perhaps due to domain knowledge, prefers a “left” and a “right”
cluster (as shown in Figure 5.3(c)) to the more natural solution of a “top” and a
“bottom” cluster (as shown in Figure 5.3(b)). This preference can be expressed
via the introduction of two must-link constraints, represented by the solid lines in
Figure 5.3(a). When we apply an algorithm based on hidden Markov random ﬁeld to
discover the two clusters in this example, we can get a solution shown in Figure 5.3(c).
While cluster labels of the points involved in the constraints are modiﬁed by the
constraints, there is virtually no difference in the resulting cluster structure when
compared with the natural solution in Figure 5.3(b). This is because the change in
the cluster labels of the small number of points in constraints does not signiﬁcantly
affect the cluster parameters. Not only are the clusters not what the user seeks, but

also the clustering solution is counter-intuitive: the cluster labels of points involved

173

in the constraints are different from their neighbors (see the big cross and plus in

Figure 5.3(c); the symbols are enlarged for clarity).

Similar phenomena of “non-smooth” clustering solution have been observed in
[279] in the context of normalized cut clustering with constraints. A variation of the
same problem has been used as a motivation for the “space—level” instead of “instance-
level” constraints in [148]. One way to understand the cause of this problem is that the
use of hidden Markov random ﬁeld effectively puts an upper bound on the maximum
influence of a constraint, irrespective of how large the penalty for constraint violation
is. So, the adjustment of the tradeoff parameters cannot circumvent this problem.
Since this problem is not caused by the violation of any constraints, the inclusion of

negative constraints cannot help.

5.2.1 An Explanation of The Anomaly

In order to have a better understanding of why an “um‘iatural” solution depicted in
Figure 5.3(d) is obtained, let us examine the hidden Markov random ﬁeld approach
for clustering under constraints in more detail. In this approach, the distribution of
the cluster labels (represented by 32') and the feature vectors (represented by y,) can

be written as

p(yla' " aynlzlv' ° .,Zn,6) = Hp(YIlZ?)

174

One typical choice of the potential function H(21, . . . , 2n, C +,C —) of the cluster labels

is to count the number of constraint violations:

H(z1,...,.~,,,c+,c—) = /\+ Z I(z, ¢ zj) + A“ Z 1(z,= z,), (5.8)
(i‘j)€C+ (MEC—

where A+ and z\‘ are the penalty parameters for the violation of the must-link and
must—not-link constraints, respectively. This potential function can be derived [161]
by the maximum entropy principle, with constraints (as in constrained optimization)
on the number of violations of the two. types of instance-level constraints. The as-
sigmnent of points to different clusters is determined by the posterior probability
p(21,. . . ,z,-,|y1, . . . .yn,6). Clustering is performed by searching for the parameters

that maximize the log-likelihood p(y1,. . . ,yn|6). Because

p(yli ' ° ' 7y72l0) Z Z p(y17° ' 'iY7llzli ° - -:Zn36)p(31a- ' 'aznlgl
21,...,Zn (59)

z arg max, p(yl‘) ' - '1ynlzli ' ' -,Zna6)p(zla ' - wznlg),

21,-"vat

the result of maximizing p(y1,...,yn[6) is often similar to the re-

sult of maximizing the “hard assignment log-likelihood”, deﬁned by

arg max p(y1,...,yn|zl,...,zn,6)p(21,...,zn|6). This illustrates the rela-
21,...,Zn

tionship between “hard” clustering under constraints approaches (such as in [263])

and the “soft” approaches (such as in [161] and [14]).

For ease of illustration, assume that p(ylz = j) is a Gaussian with

mean vector ,uj and identity covariance matrix. The maximization of

175

p(yl, . . . ,ynlzl, . . . , 2n, 9);)(21, . . . , znld) for the clustering under constraints example

in Figure 5.3 is equivalent. to the minimization of

n 2

2:21a- =2>Hy. — “jug +,\+ Z re, a z,),

611:1 (i,j)eC+

where the potential function of the Markov random ﬁeld is as deﬁned in Equa-
tion (5.8), and C+ contains the two must-link constraints. Note that the ﬁrst term,
the sum of square Euclidean distances between data points and the corresponding

cluster centers, is the cost function for standard k-means clustering.

We are going to compare two cluster conﬁgurations. The configuration “LR”,
which consists of a. “left" and a “right” cluster, can be represented by ”ER 2 (—2, 0)
and ugR = (2,0), and this corresponds to the partition sought by the user in Fig-
ure 5.3(c). The conﬁguration “TB”, which consists of a “top” and a “bottom”
cluster, can be represented by [.1ng = (0,~8) and ”2TB = (0,8), and this corre-
sponds to the “natural” solution shown in Figure 5.3(b). When X“ is very small,
the natural solution “TB” is preferable to “LR”, because the points, on average,
are closer to the cluster centers in “TB”, and the penalty for constraint viola-
tion is negligible. As /\+ increases, the cost for selecting “TB” increases. When
(A+ + lle‘ — MEBHQ) > My, — ”TB”? (y,- is the point under constraint in the upper
left point clouds), switching the cluster label of y, from “x” to “+” leads to a lower
cost for the “TB” conﬁguration. This switching of cluster label affects the cluster
centers in the “TB” conﬁguration. However, its influence is minimal because there

is only one such point, and the sum of the square error term in the objective func-

176

F.

tion is dominated by the remaining points that are not involved in constraints. As a
result, the sum of square term is minimized when the cluster centers are effectively
unmodiﬁed from the “TB” conﬁguration. This leads to the counter—intuitive cluster—
ing solution in Figure 5.3(c), where the constraints are satisﬁed, but the cluster labels
are “discontinuous” in the sense that the cluster label of an object in the middle of a
dense point cloud can assume a cluster label different from those of its neighbors. A
related argument has been used to motivate “space-level” constraints in preference to
“instance-level” constraints in [148]: the influence of instance-level constraints may
fail to propagate to the surrounding points. This problem may also be attributed
to the problem of the inconsistent hypothesis space discussed in Section 5.0.2.1, be-
cause the cluster labels of points under constraints are determined in a way that is
different from the points without constraints. When A+ increases further, the cost
for this counter-intuitive conﬁguration remains the same, because no constraints are
violated. Let C denote the cost of this counter-intuitive conﬁguration.

We are now in a position to understand why it is not possible to attain the de-
sirable conﬁguration “LR”. By pushing the vertical and horizontal point clouds away
from each other, we can arbitrarily increase the cost of the “LR” configuration, while
keeping the cost of the “TB” conﬁguration the same. While the cost for the counter-
intuitive conﬁguration also increases when the two point clouds are pushed apart,
such an increase is very slow because only the distance of one point (as in the term
Hy,- — #111,)”2) is affected. Consequently, the cost of “LR” conﬁguration can be made

larger than C, which is indeed the case for the example in Figure 5.3. Therefore,

assuming that the clustering under constraints algorithm ﬁnds the clustering solu-

177

tion that minimizes the cost function, the desired “LR” conﬁguration can never be
recovered.

Note that specifying additional constraints (either must-link or must—not-link) on
points already participating in the constraints cannot solve the problem, because none
of the constraints are violated in the counter-intuitive conﬁguration. This problem
remained unnoticed in previous studies, because it is a consequence of a small number
of constraints. When there are a large number of data points involved in constraints,
the sum of the square error is no longer dominated by data points not involved in
constraints. The enforcement of constraints changes the cluster labels, which in turn
modiﬁes the cluster centers signiﬁcantly during the minimization of the sum of error.
The counter—intuitive conﬁguration is no longer optimal, and the “LR” conﬁguration
will be generated because of its smaller cost. Note that this problem is independent
of the probabilistic model chosen to represent each cluster: the same problem can
arise if there is no restriction on the covariance matrix, for example.

There are several ways to circumvent this problem. One possibility is to increase
the number of constraints so that the constraints involve a large number of data
points. However, clustering under constraints is most useful when there are few
constraints, because the creation of constraints often requires a signiﬁcant effort on
the part of the user. Instead of soliciting additional constraints from the user, the
system should provide the user an option to increase arbitrarily the influence of the
existing constraints —— something the hidden Markov random ﬁeld approach fails to
do. One may also try to initialize the cluster parameters intelligently [13] so that a

desired local minimum (the “LR” conﬁguration in Figure 5.3(c)) is obtained, instead

178

of the global minimum (the counter—intuitive configuration in Figure 5.3(d) or the
“TB” conﬁguration in Figure 5.3(b), depending on the value of /\+). However, this
approach is heuristic. Indeed, the discussion above reveals a problem in the objective
function itself, and we should specify a more appropriate objective function to reflect
what the user really desires. The solution in [161] is to introduce a parameter (in
addition to X“ and /\_) that can increase the influence of data points in constraints.
However, this approach introduces an additional parameter, and it is also heuristic.
An alternative potential function for use in the hidden Markov random ﬁeld has been
proposed in [14] to try to circumvent the problem.

Because the main problem lies in the objective function itself, we propose a prin-
cipled solution to this problem by specifying an alternative objective function for

clustering under constraints.

5.3 Proposed Approach

Our approach begins by requiring the hypothesis space (see Section 5.0.2) used by
parametric clustering under constraints to be the same as the hypothesis space used
by parametric clustering without constraints. This means that the cluster label of an
object should be determined by its feature vector and the cluster parameters according
to the MAP rule in Equation (5.3) based on the standard ﬁnite mixture model in
Equation (5.1). The constraints should play no role in deciding the cluster labels.
This contrasts with the hidden Markov random ﬁeld approaches (see Section 5.2),

where both the cluster labels and the cluster parameters can freely vary to minimize

179

 

 

*-
. +‘E— .
1.. .34. 1* ‘X‘
+ + +13. "' * *
- + 4. *
4: + 9K 916*
I *6 .

 

 

 

 

 

 

 

(c) Partially constrained

Figure 5.1: Supervised, unsupervised, and intermediate. In this ﬁgure, dots corre-
spond to points without any labels. Points with labels are denoted by circles, asterisks
and crosses. In (c), the must-link and must-not-link constraints are denoted by solid

and dashed lines, respectively.

 

.. +. sip... 0 :0 0;.
L .0: +. . .0 0.0...
o. 0.. * .:0:
o. 0 ' o . . o ’
0+ 0
'. . ° ' x
x "Xe. -'
X o
o . .. .
o .': X

 

 

 

180

 

.
. .

.0 . o o
. . ..
o
.0 o.:'. t:

 

AA

 

 

(d) Unsupervised

 

 

 

 

 

 

 

 

 

X
X
.

 

 

 

 

 

 

 

(d) (9) (f)

Figure 5.2: An example contrasting parametric and non-parametric clustering. The
particular parametric family considered here is a mixture of Gaussian with a common
covariance matrix. This is reflected by the linear cluster boundary. The clustering
solutions in (a) to (c) are in the hypothesis space induced by this model assumption,
and the clustering solutions in (d) to (f) are outside the hypothesis space, and thus
can never be obtained, no matter which objective function is used. On the other
hand, all of these six solutions are within the hypothesis space of non-parametric
clustering. It is possible that the clustering solutions depicted in (d), (e), and (f)
may be obtained if a poor clustering objective function is used.

181

 

15

 

10'” .-

 

 

 

 

15

1o»

-10»

-15»

 

 

 

Figure 5.3: A simple example of clustering under constraints that illustrates the

(c) Desired 2-cluster solution

 

15*
10
5
0.
-5,
-10.

-15t

 

M
l

&
X

I_ O a H.’
. 331554333?! » “‘3‘" $33;

3: 1‘ x x I
“5min W

N*‘

§.:
99 +

4':

¢{*+~J

 

—5

0

(b) Natural partition in 2 clusters

 

15*

-15..

 

 

 

((1) Solution by HMRF

limitation of hidden Markov random ﬁeld (HMRF) based approaches.

182

 

 

the cost function.

The desirable cluster parameters should (i) result in cluster labels that satisfy the
constraints, and (ii) explain the data well. These two goals, however, may conflict with
each other, and a compromise is made by the use of tradeoff parameters. Formally, we

seek the parameter vector 6 that maximizes an objective function J(0; y, C), deﬁned

by

3(9;y,C) = 5093’) +.7-"(0;C), (5.10)
m+ m—

f(M) -—— -Zx\;§f+(6;Cf{) — Zip—(mg). (5.11)
h=1 h=1

where 17(6; C) denotes how well the clusters speciﬁed in 0 satisfy the constraints in C.
It consists of two terms: f +(0;C,:') and f “(6;C il— ). The loss functions f +(6;C;]') and
f ‘(9;C,: ) correspond to the violation of the h—th must-link constraint (denoted by
Cg”) and the h-th must-not-link constraint (denoted by C}: ), respectively. There are

altogether m+

must—link constraints and m- must-not-link constraints, i.e., [CI-fl =
m+ and [Ch— ] = m‘. The log-likelihood term £(6; 32), which corresponds to the ﬁt
of the data 32 by the. model parameter 9, is the same as the log-likelihood of the
ﬁnite mixture model used in standard parametric clustering (Equation (5.2)). The
parameters A17“ and A; give us flexibility to assign different weights on the constraints.
In practice, they are set to a common value A. The value of /\ can either be speciﬁed
by the user, or it can be estimated by a cross—validation type of procedure. For

brevity, sometimes we drop the dependence of ,7 on 6, y and C and write J as the

objective function.

183

How can this approach be superior to the HMRF approaches? A counter-intuitive
clustering solution such as the one depicted in Figure 5.3(d) is no longer attainable.
The cluster boundaries are determined solely by the cluster parameters. So, in the
example in Figure 5.3(d), the top—left “big plus” point will assume the cluster label
of “x”, whereas the bottom-right “big cross” point will assume the cluster label of
“+”, based on the value of the cluster parameters as shown in the ﬁgure. The sec-
ond beneﬁt is that the effect of the instance-level constraints is propagated to the
surrounding points automatically, thereby achieving the effect of the desirable space-
level constraints. This is because parametric cluster boundaries divide the data space
into different contiguous regions. Another advantage of the proposed approach is that
it can obtain clustering solutions unattainable by HMRF approaches. For example,
the “TB” conﬁguration in Figure 5.3(b) can be made to have an arbitrarily high cost
by increasing the value of the constraint penalty parameter )0”. Since the cost of
the “LR” conﬁguration is not affected by /\+, the “LR” conﬁguration will have a
smaller cost than the “TB” configuration with a large )9”. When the cost function is

minimized, the “LR” conﬁguration sought by the user will be returned.

5.3.1 Loss Function for Constraint Violation

What should be the form of the loss functions f +(6;C,:') and f ”(6;C;)? Suppose
the points y,- and yj participate in a must-link constraint. This must—link constraint
is violated if the cluster labels 2,- (for y,-) and zj (for yj), determined by the MAP

rule, are different. Deﬁne z, to be a vector of length k, such that its l-th entry

184

is one if z,- = l, and zero otherwise. The number of constraint violations can be
represented by (1(z,,zj) if (1 is a distance measure such that d(z,-, 2]) = 1 if z,- # z]-
and zero, otherwise. Similarly, the violation of a must-not-link constraint between

y-* and yjav can be represented by 1 — (1(z,*, z

z '*), where yiat and yja: are involved in

J

a must-not-link constraint.

Adopting such a distance function (1(., .) as the loss functions f +() and f ’() is,
however, not a good idea because d(z,-, 2)) is a discontinuous function of 6, due to the
presence of arg max in Equation (5.3). In order to construct an easier optimization

problem, we “soften” z,- and deﬁne a new vector 5,- by

 

__ (01P(yz'|91))7 _ qu
5:!

— — —————, (5.12)
21’ (Oz/Pfy'ilgl’llT 21’ (15’

where (12'! = (Maw-[91), and 7' is the smoothness parameter. When 7' goes to inﬁnity,
5,- approaches zi, whereas a small value of 7' leads to a smooth loss function, which,

in general, has a less severe local optima problem.

Another issue is the choice of the distance function d(s,j,sJ-). Since 3,, 2 0 and
21 3,1 = 1, 5,, has a probabilistic interpretation. A divergence is therefore more
appropriate than a common distance measure such as the Minkowski distance for

comparing 5‘ and s'. We adopt the Jensen-Shannon divergence D 3,8» 173
2 3 JS 1 j

185

with a uniform class prior as the distance measure:

k k
3:) Sjl
DJSfSi,Sj) = — :8“ log 7,31- + 23].) log ..t_l_
121 [:1
(5.13)
1
where t) = '2‘(5il + 3],).

There are several desirable properties of Jensen-Shannon divergence. It is symmetric,
well—deﬁned for all s,- and sj, and its square root can be shown to be a metric [76, 199].
The minimum value of 0 for D JS(., .) is attained only when s,- = sj. It is upper-
bounded by a constant (log 2), and this happens only when s, and sj are farthest
apart, i.e., when 5,]: 1 and 5]), = 1 with l 75 h. Because lo—glggDJS(Zi,Zj) = 1 if z,- 75
zj and 0 otherwise, the Jensen—Shannon divergence satisﬁes (up to a multiplicative
constant) the desirable property of a distance measure as described earlier in this
section. Note that Kullback—Leibler divergence can become unbounded when s,- and

sj have different supports, and thus it is not an appropriate choice.

Jensen-Shannon divergence has an additional appealing property: it can be gen-
eralized to measure the difference between more than two distributions. This gives
a very natural extension to constraints at the group level [231, 166]. Suppose 6 ob-
jects participate in the h-th group-level must-link constraint. This is denoted by the
variables ah,- introduced in Section 5.1.2, where ah, = 1/e if y,- participates in this

constraint, and zero otherwise. The Jensen-Shannon divergence for the h-th must-link

186

constraint D+ S(h,) is defined as

ziahiz 921103—
i=1
n

where flit = Zahisz‘l-

i=1

n k k
=2; ah,- :18 81'] log Sil - Z: tﬁhl log ”Ifl’ (5.14)
i=1 [=1 [=1

till

Similarly, the Jensen—Shannon divergence for the h-th must-not-link constraint

D;S(h) is deﬁned as

n k k
s; _. _
DJS( ([2) =2 bh, El ,1 logt—: = E bh, E 3,) logsil— E thllogthl’ (5.15)
i: n1 [Ll i=1 [=1 [=1

where t}?! = 2b)”st
i=1

Here, bm- denotes the i'i'iust-not-link constraint as discussed in Section 5.1.2. The

proposed objective function in Equation (5.10) can be rewritten as

.7 = 13(6)?) +T(B;C)

m+ m— (5'16)
= cameraman) — 2 50302.) + Z Agngso)
h=1 h=1

where the annealed log—likelihood cannealedwmn), deﬁned in Equation (B2), 18 a
generalization of the log—likelihood intended for deterministic annealing. When 7 = 1,
Cannealed(6;y,7) equals may). Note that both D} 5(1).) and D150?) are functions

of 6.

187

5.4 Optimizing the Objective Function

The proposed objective function (Equation (5.16)) is more difﬁcult to optimize than
the log-likelihood (Equation (5.2)) used in standard parametric clustering. We can-
not derive any efﬁcient convex relaxation for ,7, meaning that a bound-optimization
procedure such as the EM algorithm cannot be applied. We resort to general non-
linear optimization algorithms to optimize the objective function. In Section 5.4.1,
we shall present the general idea of these algorithms. After describing some details
of the algorithms in Section 5.4.2, we present the speciﬁc equations used for a mix-
ture of Gaussians in Section 5.4.3. Note that these algorithms are often presented in
the literature as minimization algorithms. Therefore, we minimize —J rather than

maximizing ,7 in practice.

5.4.1 Unconstrained Optimization Algorithms

Different algorithms have been attempted to optimize the proposed objective function
,7. They include conjugate gradient, quasi-Newton, preconditioned conjugate gradi—
ent, and line-search Newton. Because these algorithms are fairly well-documented in
the literature [87, 23], we shall only describe their general ideas here. All of these

algorithms are iterative and require an initial parameter vector 6(0).

5.4.1.1 Nonlinear Conjugate Gradient

The key idea of nonlinear conjugate gradient is to maintain the descent directions

d“) in different iterations, so that different d“) are orthogonal (conjugate) to each

188

other with respect to some approximation of the Hessian matrix. This can prevent
the inefﬁcient “zig-zag” behavior encountered in steepest descent, which always uses
the negative gradient for descent. Initially, d(0) equals the negative gradient of the
function to be minimized. At iteration t, a line-search is performed along d“), 1e
we seek 7) such that the objective function evaluated at 6“) + ndu) is minimized,
where 6“) is the current parameter estimate. The parameter is then updated by
6““) = 6m + rjdm. The next direction of descent d(t+1) is found by computing
a vector that is (approximately) conjugate to previous descent directions. Many
different schemes have been proposed for this, and we follow the suggestion given

in the tutorial [232] and adopt the Polak—Ribiére method with restarting to update

 

(t+1) T ()t+1 (t)
(1+1): (r ) (r( —r )
C max (1” (rft))Tr(t) ,0)

d(t+1) : N“) + (“HM“).

Note that line-search in conjugate gradient should be reasonably accurate, in order
to ensure that the search directions d“) are indeed approximately conjugate (see the
discussion in Chapter 7 in [23]).

The main strength of conjugate gradient is that its memory usage is only linear
with respect to the number of variables, thereby making it attractive for large scale
problems. Conjugate gradient has also found empirical success in ﬁtting a mixture of
Gaussians [222], and is shown to be more efﬁcient than the EM algorithm when the

clusters are highly overlapping.

189

 

5.4.1.2 Quasi-Newton

Consider the second-order Taylor expansion for a real-valued function f (x), which is
it»ewai+w—6Wﬂéwmi+ge—eﬁﬂHwae—e”i (5N)

where g(x) and H(x0) denote the gradient and the Hessian of the function f ()
evaluated with x = x0. For brevity, we shall drop the reference to 60") for both
g and H. Assuming that H is positive deﬁnite, the right-hand-side of the above

approximation can be minimized by 6 = 6“) — H‘lg.

The quasi—Newton algorithm does not require explicit knowledge of the Hessian
H, which can sometimes be tricky to obtain. Instead, it maintains an approximate

Hessian H, which should satisfy the quasi-Newton condition:

0(t+1) _ 9(1) : H-1(g(t+1) _ gm).

Since the inversion of the Hessian can be computationally expensive, G“), the inverse
of the Hessian is approximated instead. W'hile different schemes to update G“) are
possible, the de facto standard is the BFGS (Broyden-Fletcher-Goldfarb—Shanno)

procedure. Below is its description taken from [23]:

p 2 9ft“) _ 9(1)

v=-eW“-gm>

190

Gt“) 2 G“) + ——;pp — mevaGm + (vTG(‘>v)uuT.

Given that Gm is positive-deﬁnite and the round-off error is negligible, the above up—
date guarantees that GUH) is positive-deﬁnite. The initial value of the approximated
inverse Hessian Gm) is often set to the identity matrix. Note that an alternative ap-
proach to implement quasi-Newton is to maintain the Cholesky decomposition of the
approximated Hessian instead. This has the advantage that the approximated Hes-
sian is guaranteed to be positive deﬁnite even when the round-off error cannot be
ignored.

In practice, the quasi-Newton algorithm is accompanied with a line—search pro—
cedure to cope with the error in the Taylor approximation in Equation (5.17) when
6 is far away from 6”). The descent direction used is —H_1g(t). Note that if H is
positive deﬁnite, —-g(t)H‘—1g(t) will be always negative and —H"1g(t) will be a valid
descent direction.

The main drawback of the quasi-Newton method is its memory requirement. The
approximate inverse Hessian requires O(|6|2) memory, where [6] is the number of
variables in 6. This can be slow for high-dimensional 6, which is the case when the

data y, is of high dimensionality.

5.4.1.3 Preconditioned Conjugate Gradient

Both conjugate gradients and quasi-Newton require only the gradient information of
the function to be minimized. Faster convergence is possible if we incorporate the

analytic form of the Hessian matrix into the optimization procedure. However, what

191

really can help is not the Hessian, but the inverse of the Hessian. Since the inversion
of the Hessian can be slow, it is common to adopt some approximation of the Hessian

matrix so that its inversion can be done quickly.

Preconditioned conjugate gradient (PCG) uses an approximation to the inverse
Hessian to speed up conjugate gradient. The approximation, also known as the
preconditioner, is denoted by M. PCG essentially creates an optimization problem

_ _ 9 ,. . . . . .
that has M 1/2HM 1/ “ as the “effective” Hessran matr1x and applles conjugate
gradient to it, where H is the Hessian matrix of the original optimization problem. If
the “effective” Hessian matrix is close to the identity, conjugate gradient can converge

very fast.

We refer the reader to the appendix in [232] for the exact algorithm for PCG.
Practical implementation of PCG does not require the computation of M—1/2. Only
the multiplication by M“1 is needed. Note that the preconditioner should be positive
deﬁnite, or the descent direction computed may not decrease the objective function.
We can see that there are three requirements for a good conditioner: positive deﬁ—
nite, efﬁcient inversion, and good approximation of the Hessian. The first and the
third requirements can contradict. with each other, because the true Hessian is often
not positive-deﬁnite unless the objective function is convex. Finding a good precon-
ditioner is an art, and often requires insights into the problem at hand. However,
general procedures for creating a preconditioner also exist, which can be based on

incomplete Cholesky factorization, for example.

192

5.4.1.4 Line-search Newton

Line-search Newton is almost the same as the quasi-Newton algorithm, except that
the Hessian is provided by the user instead of being approximated by the gradients.
There is, however, a catch here. The true Hessian may not be positive-deﬁnite,
meaning that the n'iinimization problem on the right-hand—side of Equation (5.17)
does not have a solution. Therefore, it is common to replace the true Hessian with
some approximated version that is positive—deﬁnite. Since H_1g is to be computed,
such an approxin‘iation should admit efﬁcient inversion, or at least multiplication by its
inverse should be fast. There are two possible ways to obtain such an approximation.
We can either add 6 I to the true Hessian, where (S is some positive number determined
empirically, or we can “repair” H by adding some terms to it to convert it to a
positive-deﬁnite matrix.

Note that for both line-search Newton and PCG, the approximated inverse of the
Hessian, which takes O(|6|2) memory, need not be formed explicitly. The only thing

needed is the ability to be multiplied by the approximated inverse.

5.4.2 Algorithm Details

There are several issues that are common to all these optimization algorithms.

5.4.2.1 Constraints on the Parameters

The algorithms described in Section 5.4.1 are all unconstrained optimization algo-

rithms, meaning that there are no restrictions on the values of 6. However, our

193

optimization problem contains the constraint that the mixture weights ozj are pos-
itive and sum to one, and the fact that the precision matrix Tj is symmetric and
positive deﬁnite. For {0].}, we re-parameterize by introducing a set of variables {63-}
and set

exp(/3j)

For Tj, we can either re-parameterize by introducing Fj such that Tj = FjF 31-”,
or we can modify our optimization algorithm to cope with the constraints. The
positive-definite constraint is enforced by modifying the line-search routine so that
the parameters are always feasible. This is a feasible approach because the precision
matrices in a reasonable clustering solution should not become near singular. For

the symmetric constraint, it is enforced by requiring that the descent direction in

line-search always has symmetric precision matrices.

5.4.2.2 Common Precision Matrix

A common practice of ﬁtting a mixture of Gaussian is to assume a common precision
matrix, i.e., the precision matrices of all the k Gaussian components are restricted
to be the same, i.e., T1 = = Tk = T. Instead of the gradient with respect to
different Tj, we need the gradient of ,7 with respect to T. This can be done easily
because

”8,7

j=1 er,-

194

 

Consequently, Equation (B.12) should be modiﬁed to

0 1
6—T—‘7z— 2:617)”le +Z(-12-ujp]T +—%‘T1)Zc,j (5.19)

2'

whereas Equation (B20) should be modiﬁed to

t} 1 _ 1
—,,T.7= gr 12%- — 525.6.- —u,~)<y.- —u,->T. (5.20)
ij ij
The case for Cholesky parameterization is similar. We set F1 = = Fk = F, and

Equation (B.15) should be modiﬁed to
0
DFJ:_ Z(iniyiF+Z(Hjuj +T1)FZCzja (5'21)
i] .7

and Equation (821) should be modiﬁed to

(‘9 -
51:37 = T 1F200— Zcijf)’i - Mj)(yz' - MleF- (5-22)

5.4.2.3 Line Search Algorithm

The line-search algorithm we used is based on the implementation in Matlab, which is
in turn based on section 2.6 in [86]. Its basic idea is to perform a cubic interpolation
based on the value of the function and the gradient evaluated at two parameter
values. The line search terminates when the VVolfe’s condition is satisﬁed. Following
the advice in Chapter 7 of [23], the line-starch is stricter for both conjugate gradient

and preconditioned conjugate gradient in order to ensure conjugacy. Note that when

195

the Gaussians are parameterized by their precision matrices, the line search procedure

disallows any parameter vector that has non-positive deﬁnite precision matrices.

5.4.2.4 Annealing the Objective Function

The algorithms described in Section 5.4.1 ﬁnd only the local minima of .7 based
on the initial parameter estimate 6(0). One strategy to alleviate this problem is to
adopt a deterministic-annealing type of procedure and use a “smoother” version of
the objective function. The solution of this “smoother” optimization problem is used
as the initial guess for the actual objective function to be optimized. Speciﬁcally, we
adjust the two temperature—like parameters 7 and T in J deﬁned in Equation (5.16).
When 7 and T are small, .,7 is smooth and is almost convex, therefore it is easy to
optimize. The annealing stops when 7 reaches one and T reaches a pre-speciﬁed value
Tﬁnal

, which is set to four in our experiment. This is, however, a fairly insensitive

parameter. Any number between one and sixteen leads to similar clustering results.

5.4.3 Speciﬁcs for a Mixture of Gaussians

All the algorithms described in Section 5.4.1 require the gradient information of the
objective function. In Appendix B.1, we have derived the gradient information with
the assumption that each mixture component is a Gaussian distribution. Recall that
qz-j = log p(yil6j), and sij has been deﬁned in Equation (5.12). Deﬁne the following:

7
- qij

Zj’ qij,

196

1111'] = E Afahl— Zh /\— bhi 81] log 823'

[2:1

"1+

m—
— 3”- Z ,\+ (1),,- log 2;] =2: Agbh, iogtgj
h=1
C16 = 7”,") - T(Il’ij — Sij 2w”).
[=1

The partial derivative of ,7 with respect to 63- is

(7
0‘37]: 2‘ C21“ (.11 ngj

2'

f!-

(5.23)

Under the natural parameterization u) and T) for the parameters of the l-th cluster,

we have

g_l;7: ZC Cile" “[ZC C2[
59!- = A Z Cuyryf + % (#sz + 22) Z 022-
2' 2'

(9T) 2

If the Cholesky parameterization F1 is used instead of T1, we have
(9.7 :r T
O—F] = - Zcztlyiyi F1 + (mm + 231) Fl 2021-
i i

If the moment parameterization #1 and T) are used instead, we have

0.7
._— = r E : -—

i

197

(5.24)

(5.25)

(5.26)

(5.27)

297,: El: 6‘2"! g; 0216'.- - #1)T(Y2 - m) (5.28)

and the corresponding partial derivative if Cholesky parameterization is used is

ar, =(21 Z CH ‘ Z (My )T(y2 — 222)) F2. (5.29)

The Hessian of ,7 is clumsier to present, and the reader can refer to Appendix B.2

for its exact form under various parameterizations.

5.5 Feature Extraction and Clustering with Con-

straints

It turns out that the objective function introduced in Section 5.3 can be modiﬁed to
simultaneously perform feature extraction and clustering with constraints. There are
three reasons why we are interested in performing these two tasks together.

First, the proposed algorithm does not perform well for small data sets with
a large number of features (denoted by (1), because the d by d covariance matrix
is estimated from the available data. In other words, we are suffering from the
curse of dimensionality. The standard solution is to preprocess the data by reducing
the dimensionality using methods like principal component analysis. However, the
resulting low-dimensional representation may not be optimal for clustering with the
given set of constraints. It is desirable to incorporate the constraints in seeking a

good low-dimensional representation.

198

The second reason is from a modeling perspective. One can argue that it is inap-
propriate to model the two desired clusters shown in F igurc 5.3(d) by two Gaussians,
because the distribution of the data points are very “non—Gaussian”: there are no
data points in the central regions of the two Gaussians, which are supposed to have
the highest data densities! If the data points are projected to the one-dimensional
subspace of the :r-axis, the resulting two clusters follow the Gaussian assumption well
while satisfying the constraints. Note that PCA selects a projection direction that is
predominantly based on the y-axis because the data variance in that direction is large.
However, the clusters formed after such a projection will violate the constraints. In
general, it is quite possible that given a high dimensional data set, there exists a
low—dimensional subspace such that the clusters after projection are Gaussian-like,
and the constraints are satisﬁed by those clusters.

The third reason is that the projection can be combined with the kernel trick to
achieve clusters with arbitrary shapes. A nonlinear transformation is applied to the
data set to embed the data in a high-dimensional feature space. A linear subspace
of the given feature space is sought such that the Gaussian clusters formed in that
subspace are consistent with the given set of constraints. Because of the non-linear
transformation, linear cluster boundaries in that subspace correspond to nonlinear
boundaries in the original input space. The exact form of the nonlinear boundaries
is controlled by the type of the nonlinear transformation applied. Note that such
transformation need not be performed explicitly because of the kernel trick (see Sec-
tion 2.5.1). In practice, kernel PCA is ﬁrst performed on the data in order to extract

the main structure in the high dimensional feature space. The number of features

199

 

returned by kernel PCA should be large. The feature extraction algorithm in this

section is then applied to the result of kernel PCA.

5.5.1 The Algorithm

Let X, be the result of projecting the data point y,- into a d’-dimensional space, where
d’ is small and d’ < (1, and d is the dimension of y,. Let PT be the d’ by d projection
matrix, i.e., x,- = PTyi, and PTP = I. Let PTuj and T be the cluster center of the
j-th Gaussian component and the common covariance matrix, respectively. Let R be

the Cholesky decomposition of T, i.e., T = RRT. We have
1
p(X2|22 = j) = (2.5-(172(th 101/2 exp (—-,—(x.- — PTpJ-iTrtx. — PTMﬁ) . (5.30)
Because T = RTPTPR, we can rewrite the above as

losp(xz'|22 = j)
I

I
= —% log(27r) + glog det T — -2—(y,- — pj)TPTPT(y,- — 22]), (5-31)

I

(l 1 1
2 _§ log(27r) + —2— log det FTF — 5m — uj)TFFT(y2 - 1a)),

where F = PR. Note the similarity between this expression and that of log p(yilzi =
j) if we adopt the parameterization T = FFT as discussed in Section 5.4.2.1. We

have

6 .
5’71 losp(xilz-i = J) = FFTin — m). (5.32)

200

a ' I _
U—F 1()gP(Xi[3i : J) : F(FTF) 1 — (Yi _ Hj)(y1'— Hj)TF- (533)

While P has an orthogonality constraint, there is no constraint on F, and thus we cast
our optimization problem in terms of F. The parameters F, pj and 63- can be found
by optimizing ,7, after substituting Equation (5.31) as log qij into Equation (5.16).
In practice, the quasi-Newton algorithm is used to ﬁnd the parameters that minimize
the objective function, because it is difficult to inverse the Hessian efﬁciently.

It is interesting to point out that this subspace learning procedure is related to
linear discriminant analysis if the data points y,- are standardized to have equal vari-
ance. If we ﬁx T to be the identity matrix, maximizing the log-likelihood is the same
as minimizing (y, -— pj)TPTP(y,- — 22]). This is the within-class scatter of the j-th
cluster. Since the sum of between-class scatter and the within-class scatter is the
total data scatter, which is constant because of the standardization, maximization of
the within-class scatter is the same as maximizing the ratio of between-class scatter

to the within-class scatter. This is what linear discriminant analysis does.

5.6 Experiments

To verify the effectiveness of the proposed approach, we have applied our algorithm
to both synthetic and real world data sets. we compare the proposed algorithm
with two state-of—the-art algorithms for clustering under constraints. The ﬁrst one,
denoted by Shental, is the algorithm proposed by Shental et al. in [231]. It uses

“chunklets” to represent the cluster labels of the objects involved in must—link con-

201

straints, and a l\-"‘Iarkov network to represent the cluster labels of objects participating
in must-not—link constraints. The EM algorithm is used for parameter estimation,
and the E-step is done by computations within the Markov network. It is not clear
from the paper the precise algorithm used for the inference in the E—step, though
the Matlab implementation3 provided by the authors seems to use the junction tree
algorithm. This can take exponential time when the constraints are highly coupled.
This potential high time complexity is the motivation for the mean-ﬁeld approxi-
mation used in the E-step of [161]. The second algorithm, denoted by Basu, is the
constrained k-means algorithm with metric learning4 described in [21]. It is based
on the idea of hidden Markov random ﬁeld, and it uses the constraints to adjust the
metrics between different data points. A parameter is needed for the strength of the
constraints. Note that we do not compare our approach with the algorithm in [161],

because its implementation is no longer available.

5.6.1 Experimental Result on Synthetic Data

Our ﬁrst experiment is based on the example in Figure 5.3(a), which contains 400
points generated by four Gaussians centered at [g], l—28l’ [:g] and [”82], each
with identity covariance matrix. Recall that the goal is to group this data set into
two clusters — a “left” and a “right” cluster —- based on the two must—link constraints.

Specifically, points with negative and positive horizontal co—ordinates are intended

to be in two different clusters. Note that this synthetic example differs from the

 

3The url is http : //www . cs . huj i . ac . il/~tomboy/code/ConstrainedEM_plusBNT . zip.
4Its implementation is available at http://www.cs .utexas .edu/users/ml/risc/code/.

202

similar one in [161] in that the vertical separation between the top and bottom point
clouds is larger. This increases the difference between the goodness of the “left / right”
and “top/ bottom” clustering solutions, so that a small number of constraints is no
longer powerful enough to bias one clustering solution over the other as in [161]. The
results of running the algorithms Shental and Basu are shown in Figures 5.4(a) and
5.4(b), respectively. For Shental the two Gaussians estimated are also shown. Not
only did both algorithms fail to recover the desired cluster structure, but also the
cluster assignments found were counter-intuitive. This failure is due to the fact that
these two approaches represent the constraints by imposing prior distributions on the
cluster labels, as explained earlier in Section 5.2.

The result of applying the proposed algorithm to this data set. with A = 250 is
shown in Figure 5.4(c). The two desired clusters have been almost perfectly recov-
ered, when we compare the solution visually with the desired cluster structure in
Figure 5.3(c). A more careful comparison is done in Figure 5.4(d), where the cluster
boundaries obtained by the proposed algorithm (the gray dotted line) is compared
with the ground-truth (the solid green line). We can see that these two boundaries
are very close to each other, indicating that the proposed algorithm discovered a good
cluster boundary. This compares with the similar example in [167], where the cluster
boundary there (as inferred from the Gaussians shown) is quite different5 from the
desired cluster boundary. An additional cluster boundary obtained by the proposed

algorithm when T took the intermediate value of 1 is also shown (the magenta dashed

 

5Note that the synthetic data example in [167] is ﬁtted with a mixture model with different
covariance matrices per class. Therefore, comparing it with the proposed algorithm may not be the
most fair.

203

line). (The ﬁnal cluster boundary was produced with T = 4.) This boundary is signif-
icantly different from the ground—truth boundary. So, a large value of T improves the
clustering result in this case. This improvement is the consequence of the fact that
a large T focuses on the cluster assignments of the objects and reduces the spurious
influence of the exact locations of the points. The Jensen-Shannon divergence mea-
sures the constraint violation / satisfaction more accurately. Note that a larger value
of T does not have any further visible effect on the cluster boundary.

The Gaussian distributions contributing to these cluster boundaries are shown in
Figure 5.4(e). We observe that the Gaussians recovered by the proposed algorithm
(dotted gray lines) are slightly “fatter” than those obtained with the ground—truth
labels (solid green lines). This is because data points not in a particular cluster
can still contribute, though to a smaller extent, to the covariance of the Gaussian
distributions due to the soft-assignment implied in the mixture model. This is not
the case when the covariance matrix is estimated based on the ground-truth labels.

While the proposed algorithm is the only clustering under constraints algorithm
we know that can return the two desired clusters, we want to note that a sufﬁciently
large A is needed for its success. If A = 50, for example, the result of the proposed
algorithm is shown in Figure 5.4(f). This is virtually identical to the clustering
solution without any constraints (Figure 5.3(b)). While the constraints are violated,
the clustering solution is more “reasonable” than the solutions shown in Figures 5.4(a)
and 5.4(b). Note that it is easy to detect that A is too small in this example, because
the constraints are violated. We should increase A until this is no longer the case.

The resulting clustering solution will effectively be identical to the desired solution

204

shown in Figure 5.4(c).

5.6.2 Experimental Results on Real World Data

We have also compared the proposed algorithm with the algorithms Shental and Basu
based on real world data sets obtained from different domains. The label information
in these data sets is used only for the creation of the constraints and for performance

evaluation. In 1;)articular, the labels are not used by the clustering algorithms.

5.6.2.1 Data Sets Used

Table 5.2 summarizes the characteristics of the data sets used. The following prepro-
cessing has been applied to the data. whenever necessary. If a data set has a nominal
feature that can assume 0 possible values with c > 2, that feature is converted into
0 continuous features. The 2-th such feature is set to one when the nominal feature
assumes the 2-th possible value, and the remaining c— 1 continuous features are set to
zero. If the variances of the features of a data set are very different, standardization
is applied to all the features, so that the variances or the ranges of the preprocessed
features become the same. If the number of features is too large when compared with
the number of data points 72, principal component analysis (PCA) is applied to reduce
the dimensionality. The number of reduced dimension d is determined by finding the
largest d that satisﬁes n > 3d”, while the principal components with negligible eigen-
values are also discarded. The difficulty of the classiﬁcation tasks associated with
these data sets can be seen by the values of the F -score and the normalized mutual

information (to be deﬁned in Section 5.6.2.3) computed using the ground—truth labels,

205

 

 

 

 

 

   

 

 

 

 

0 _________ O
10‘ ,t - ‘ ‘ I 3 ‘‘‘‘‘‘ ’0‘ ~ - 10’ o o O o -
00,9 6° 0 ' ’ f ~‘ 0. 0 o .0 o o
. '3 xiii-xx ' x.) . . v t'stui .W 1
‘~-_‘L’. .9 t—!_>_'.-" ‘... 00 0: '.‘
5 O ----------- 5. 0 -
0r 0.
-5 ,ffxt—‘f‘ -------------- 13””: 7“ ., ﬁg“
5.: - 5’s,
..eae as. £55.35) , sex ewes
- . --y __________________ .
'1? 6 ‘15 o
(a) Result of the algorithm Shental (b) Result of the algorithm Basu
15 1’, ’ xx Xx, “.\ 1 15» __We 4
I ‘ \
10 ‘ x § ‘\,". g 0 Y 10' .,.;ﬁ . “313.5...“ '-
‘r‘ 2"” ”Fiﬁ-ii“. 71". e. e? .3: 1‘ :-' .2
5. I! X I" (I 5.
i .' I I
0' I I l l 07
I z: : .
" I I _5_ _ . .
:s‘ , ...,: ...-4: ,
l '1: .1.
,0, "
«15
"r o 5
(d) The cluster boundaries in (c)
10 ° ]
u.» x-qisrryiitomes-aw ~:- 1
.1! .. o ... IQ” o r)
“‘{y ,—.—.-..2zi'_‘.._t-3-e'-:“‘.'
5. o
o- .
-5 x _______
_,,r— «19 '
I
«ate "”2! 25$ {We

 

0
(e) The Gaussians in (d)

Figure 5.4: The result of running different clustering under constraints algorithms
for the synthetic data set shown in Figure 5.3(a).
and Basu failed to discover the desired clusters ((a) and (b)), the proposed algorithm
The resulting cluster boundaries and Gaussians are
When

succeeded with A = 250 (0).

compared with those estimated with the ground-truth labels ((d) and (e)).

 

 

 

(f) Result of proposed algorithm, A = 50

A = 50, the proposed algorithm returned the natural clustering solution (f).

206

While the algorithms Shental

under the assumption that the class conditional densities are Gaussian with common

covariance matrices.

Data Sets from UCI The following data sets are obtained from the UCI machine
learning repository” The list below includes most of the data sets in the repository
that have mostly continuous features and have relatively balanced class sizes.

The dermatology database (derm) contains 366 cases with 34 features. The goal
is to determine the type of Erythen'iato-Squamous disease based on the features ex-
tracted. The age attribute, which has missing values, is removed. PCA is performed
to reduce the resulting 33 dimensional data to 11 features. The sizes of the six classes
are 112, 61, 72, 49, 52 and 20.

The optical recognition of handwritten digits data set (digits) is based on nor-
malized bitmaps of handwritten digits extracted from a preprinted form. The 32x32
bitmaps are divided into non-overlapping blocks of 4x4 and the number of pixels are
counted in each block. Thus 64 features are obtained for each digit. The training
and testing sets are combined, leading to 5620 patterns. PCA is applied to reduce
the dimensionality to 42 to preserve 99% of the total variance. The sizes of the ten
classes are 554, 571, 557, 572, 568, 558, 558, 566, 554, and 562.

The ionosphere data set (ion) consists of 351 radar readings returned from the
ionosphere. seventeen pulse numbers are extracted from each reading. The real part
and the imaginary part of the complex pulse. numbers constitute the 34 features per

pattern. There are two classes: “good” radar returns (225 patterns) are those showing

 

6The url is http://www. ics .uci . edu/mlearn/MLRepository.html

207

evidence of some type of structure in the ionosphere, and “bad” returns (126 patterns)
are those that do not; their signals pass through the ionosphere. PCA is applied to

reduce the dimensionality to 10.

The multi-feature digit data set consists of features of handwritten numerals ex—
tracted from a collection of Dutch utility maps. Multiple types of features have been
extracted. we have only used the features based on the 76 Fourier coefﬁcients of the
character shapes. The resulting data set is denoted by mfeat-fou. There are 200
patterns per digit class. PCA is applied to reduce the dimensionality to 16, which

preserves 95% of the total energy.

The Wisconsin breast cancer diagnostic data set (wdbc) has two classes: benign
(357 cases) and malignant (212 cases). The 30 features are computed from a digitized
image of the breast tissue, which describes the characteristics of the cell nuclei present
in the image. All the features are standardized to have mean zero and variance one.

PCA is applied to reduce the dimensionality of the data to 14.

The UCI image segmentation data set (UCI-seg) contains 19 continuous attributes
extracted from random 3x3 regions of seven outdoor images. One of the features has
zero variation and is discarded. The training and testing sets are combined to form
a data set with 2310 patterns. After standardizing each feature to have variance one,
PCA is applied to reduce the dimensionality of the data to 10. The seven classes
correspond to brick-face, sky, foliage, cement, window, path, and grass. Each of the

classes has 330 patterns.

208

Data Sets from Statlog in UCI The following ﬁve data sets are taken from the

Statlog section7 in the UCI machine learning repository.

The Australian credit approval data set (austra) has 690 instances with 14 at-
tributes. The two classes are of size 383 and 307. The continuous features are stan-
dardized to have standard deviation 0.5. Four of the features are non-binary nominal
features, and they are converted to multiple continuous features. PCA is then applied

to reduce the dimensionality of the concatenated feature vector to 15.

The German credit data (german) contains 1000 records with 24 features. The
version with numerical attributes is used in our experiments. PCA is used to reduce
the dimensionality of the data to 18, after standardizing the features so that all of

them lie between zero and one. The two classes have 700 and 300 records.

The heart data set (heart) has 270 observations with 13 raw features in two
classes with 150 and 120 data points. The three nominal features are converted
into continuous features. The continuous features are standardized to have standard

deviation 0.5, before applying PCA to reduce the data set to 9 features.

The satellite image data set (sat) consists of the multi-spectral values of pixels
in 3x3 neighborhoods in a satellite image. The aim is to classify the class associated
with the central pixel, which can be “red soil”, “cotton crop”, “grey soil”, “damp
grey soil”, “soil with vegetation stubble” or “very damp grey soil”. The training and
the testing sets are combined to yield a data set of size 6435. There are 36 features

altogether. The classes are of size 1533, 703, 1358, 626, 707 and 1508.

 

7The url is http://www. ics.uci.edu/mlearn/databases/statlog/

209

The vehicle silhouettes data set (vehicle) contains a set of features extracted
from the silhouette of a. vehicle. The goal is to classify a vehicle as one of the four
types (Opel, Saab, bus, or van) based on the silhouette. There are altogether 846
patterns in the four classes with sizes of the four classes as 212, 217, 218, and 199.
The features are ﬁrst standardized to have standard deviation one, before applying

PCA to reduce the dimensionality to 16.

Other Data Sets We have also experimented the proposed algorithm with data

sets from other sources.

The texture classiﬁcation data set (texture) is taken from [127]. It consists of
4000 patterns with four different types of textures. The 19 features are based on

Gabor ﬁlter responses. The four classes are of sizes 987, 999, 1027, and 987.

The online handwritten script data set (script), taken from [192], is about a
problem that classiﬁes words and lines in an online handwritten document into one of
the six major scripts: Arabic, Cyrillic, Devnagari, Han, Hebrew, and Roman. Eleven
spatial and temporal features are extracted from the strokes of the words. There are
altogether 12938 patterns, and the sizes of the six classes are 1190, 3173, 1773, 3539,

1002, and 2261.
The ethnicity recognition data set (ethn) was originally used in [175]. The goal

is to classify a 64x64 face image into two classes: “Asian” (1320 images) and “non-

Asian” (1310 images). It includes the PF 01 database”, the Yale databaseg, the AR

 

8http://nova.postech.ac.kr/archives/imdb.htm1.
9http://cvc.yale.edu/projects/yalefaces/yalefaces.html.

210

(a) Asians

515751

  

sewn-pa:
....Wt. .

. A(b) Non-Asians

   

Figure 5.5: Example face images in the ethnicity classiﬁcation problem for the data
set ethn.

database [181], and the non-public NLPR database“). Some example images are
shown in Figure 5.5. 30 eigenface coefﬁcients are extracted to represent the images.

The clustering under constraints algorithm is also tested on an image segmenta-
tion task based on the l\r’londrian image shown in Figure 5.6, which has ﬁve distinct
segments. The image is divided into 101 by 101 sites. Twelve histogram features
and twelve Gabor ﬁlter responses of four orientations at three different scales are
extracted. Because the histogram features always sum to one, PCA is performed to
reduce the dimension from 24 to 23. The resulting data set Mondrian contains 10201
patterns with 23 features in 5 classes. The sizes of the classes are 2181, 2284, 2145,
2323, and 1268.

The 3-newsgroup database11 is about the classiﬁcation of Usenet articles from
different newsgroups. It has been used previously to demonstrate the effectiveness of
clustering under constraints in [14]. It consists of three classiﬁcation tasks (diff -300,
Sim-300, same-300), each of which contains roughly 300 documents from three dif-
ferent topics. The topics are regarded as the classes to be discovered. The three

classiﬁcation tasks are of different difﬁculties: the sets of three topics in diff-300,

 

10Provided by Dr. Yunhong Wang, National Laboratory for Pattern Recognition, Beijing.
11It can be downloaded at http://www. cs .utexas.edu/users/ml/risc/.

211

 

Figure 5.6: The Mondrian image used for the data set Mondrian. It contains 5
segments. Three of the segments are best distinguished by Gabor filter responses,
whereas the remaining two are best distinguished by their gray-level histograms.

sin-300, and same-300 respectively have increasing similarities. Latent semantic
indexing is applied to the tf—idf normalized word features to convert each newsgroup
article into a feature vector of dimension 10. The three classes in diff —300 are all of
sizes 100, whereas the number of patterns in the three classes in Sim-300 is 96, 97,

and 98. The sizes of the classes in same-300 are 99, 98, and 100.

Notice that the data sets ethn, Mondrian, diff-300, Sim-300, and same-300
have been used in the previous work [161]. The same preprocessing is applied for
both ethn and Mondrianas in [161], though we reduce the dimensionality of the data
set from 20 to 10 for the diff-300, Sim-300, and same-300 data sets based on our

“72 > 3d2” rule.

212

 

.5 w E2 w o as
H W m V 0 £230 £03022 mm 07.3 02533630 00020066 05 09: 3.0065 :22 was .m mo 82.? .8:me .m.m.©.m 028% E
09500 03 2.8030 0%. 08:8 .322 as 03083 003225052: 6308 002E880 05 use a 3 0000003 0.88% 22 3 000m
03 see pom 330 e ﬁts US$083 xmﬁ 00385630 05 m0 33$ch 23. Emu/50008.3 6 00s 2 3 0030858 0.8 00m: 8353
Egon go .8856 08. was 3500 Set m0 .5955 BE. .mucmatmaxm 030 5 00m: 30m Sec 0103 30a 0% m0 530225 and 033.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

:98 mamas m S as EH 5285 Emma 585538 5:80 oomtaam
cameo 83s m S as 3 5:55 52358818555: sees: comes
new; mass m 2 8m use 955% 555 5:555: 550 costs
some 886 a mm as: use 85555 seen 5582 ﬁasco:
33o some m cm 88 “at“ £255 55

33¢ 88.: a a 89. ”EU 235s 855:
«$2 ES a 2 was use seemsscaese 55:5 atom
cameo as; a. S as H00 5 523m 85855 ease, 3522,
SE: same a on was 80 5 mass amass asses pew

293 83¢ a 5 EN 80 s 52% see; 553

$35 so; a a: 82 H00 5 mossm 385 said 55%
£36 28¢ a 2 can 50 a mess 555% €25 ﬁnesse 853
as? 3.23 m E 8m 50 5:855 5.555 555 58853 on?

as; $33 a 2 2mm 50 533858555 000 538
5.83 as; S S 88 H0: 555558 scam .55 2383258 8553a
moose as; a 2 an 60 23582 8..“

name 285 S we owe... H00 see asstaeeefo 8:582 sage 3&3
sad as? a : 8” 200 555558 Ems

52 a a s e 885m 582 as

 

 

 

 

 

 

 

 

213

5.6.2.2 Experimental Procedure

For each data set listed in Table 5.2, a constraint was speciﬁed by first generating
a random point pair (yi, yj). If the ground—truth class labels of y, and yj were the
same, a must-link constraint was created between y, and yj. Otherwise, a must-
not-link constraint was created. Different numbers of constraints were created as a
percentage of the number of points in the data set: 1%, 2%, 3%, 5%, 10%, and 15%.
Note that the constraints were generated in a “cumulative” manner: the set of “1%”

constraints was included in the set of “2%” constraints, and so on.

The line—search Newton algorithm was used to optimize the objective function J
in the proposed approach. The Gaussians Were represented by the natural parameters
VJ' and T, with a common precision matrix among different Gaussian components.
This particular choice of optimization algorithm was made based on a preliminary
efficiency study, where this approach was found to be the most efficient among all the
algorithms described in Section 5.4.1. Because the gradient is available in linesearch
Newton, convergence was decided when the norm of the gradient was less than a
threshold of the norm of the initial gradient. Note that this is a stricter and more
reasonable convergence criteria than the one typically used in the EM algorithm,
which is based on the relative change of log-likelihood. However, in order to safe-
guard against round-off error, we also declare convergence when the relative change
of the objective function is very small: 10““), to be precise. Starting with a random
initialization, line-search Newton was run with 7 = 1 and T = 0.25, with the conver-

gence threshold set to 10‘2. Line-search Newton was run again after increasing 7' to

214

1, with the convergence threshold tightened to 10’3. Finally, ’7' and the convergence
threshold were set to 4 and 10‘4, respectively. The optimization algorithm was also
stopped if convergence was not achieved within 5000 Newton iterations. Fifteen ran-
dom initializations were attempted. The solution with the best objective function
value was regarded as the solution found by the proposed algorithm.

The above procedure, however, assumes the constraint strength A is known. The
value of A was determined using a set of validation constraints. The constraints for
training set and the constraints for validation set were obtained using the following
rules. Given a data set, if the number of constraints was less than 3k, k being the
number of clusters, all the available constraints were used for training and valida-
tion. This procedure, while risking overﬁtting, is necessary because a too small set of
constraints is poor for training the clusters as well as the estimation of A. When the
number of constraints was between 3k and 6k, the number of training constraints and
validation constraints were both set to 3k. So, the training constraints overlapped
with the validation constraints. When the number of constraints was larger than 6k,
half the constraints were used for training and the other half were used for valida—
tion. Starting with A = 0.1, we increased A by multiplying it by \/T(l. For each A,
the proposed algorithm was executed. A better value of A was encountered if the
number of violations of the validation constraints was smaller than the current best.
If there was a tie, the decision was made on the number of violations of the training
constraints. If the best value of A did not change for four iterations, we assumed
that the optimal value of A was found. The proposed algorithm was executed again

using all the available constraints and A value just determined. The resulting solution

215

was compared with the solution obtained using only the training constraints, and the
one with the smaller total number of constraint violations was regarded as our ﬁnal
clustering solution. If there was a tie, the solution obtained with training constraints

only was selected.

The algorithms Shental and Basu were run using the same set of data and con-
straints as input. For Shental, we modified the initialization strategy in their soft-
ware, which involved a two step process. First, five random parameter vectors were
generated, and the one with the highest log-likelihood was selected as the initial
value of the EM algorithm. Convergence was achieved if the relative change in the
log-likelihood was less than a threshold, which is 10’6 by default. This process
was repeated 15 times, and the parameter vector with the highest log-likelihood was
regarded as the solution. For easier comparison, we also assumed a common co-
variance matrix among the different Gaussian components. For the algorithm Basu,
the authors provided their own initialization strategy, which was based on the set
of constraints provided. The algorithm was run 15 times, and the solution with the
best objective function was picked. The algorithm Basu requires a constraint penalty
parameter. In our experiment, a wide range of values were tried: 1, 2, 4, 8, 16, 32,
64, 128, 256, 500, 1000, 2000, 4000, 8000, 16000. We only report their results with
the best possible penalty values. As a result, the performance of Basu reported here

might be inflated.

216

5.6.2.3 Performance criteria

A clustering under constraints algorithm is said to perform well on a data set if
the clusters obtained are similar to the ground-truth classes. Consider the k by k
“contingency matrix” {(2,-j}, where [3,-J- denotes the number of data points that are
originally from the i-th class and are assigned to the j-th cluster. If the clusters
match the true classes perfectly, there should only be one non-zero entry in each row

and each column of the contingency matrix.

Following the common practice in the literature, we summarize the contingency
matrix by the F-score and the normalized mutual information (NMI). Consider the
“recall matrix” {fij} in which the entries are deﬁned by Fij = 513‘ / :34 513’- Intu—
itively, fij denotes the proportion of the i-th class that is “recalled” by the j-th cluster.
The “precision matrix” {16,-j}, on the other hand, is deﬁned by ﬁij = 515/ :1" 6213-. It
represents how “pure” the j-th cluster is with respect to the i-th class. Entries in the
F-score matrix {fl-j} are simply the harmonic mean of the corresponding entries in
the precision and recall matrices, i.e., fij = 277,713“ / (fij + 132-j). The F-score of the
i-th class, F}, is obtained by assuming that the i-th class matches12 with the best
cluster, i.e., F,- = maxj 1%. The overall F-score is computed as the weighted sum of

the individual F,- according to the sizes of the true classes, i.e.,

k k ~
.— C. . ~
F-score = E E—J—T—l—QFZ' (5.34)

, TL
121

 

12Here, we do not require that one cluster can only match to one class. If this one-to-one corre-
spondence is desired, the Hungarian algorithm should be used to perform the matching instead of
the max operation to compute Fi-

217

Note that the precision of an empty cluster is undeﬁned. This problem can be circum-
vented if we restrict that empty clusters, if any, should not contribute to the overall

F -score.

The computation of normalized mutual information interprets the true class label
and the cluster label as two random variables U and V. The contingency table,
after dividing by n (the number of objects), forms the joint distribution of U and V.
The mutual information (MI) between U and V can be computed based on the joint
distribution. Since the range of the mutual information depends on the sizes of the
true classes and the sizes of the clusters, we normalize the MI by the average of the

entropies of U and V (denoted by H (U) and H (V)) so that the resulting value lies

between zero and one. Formally, we have

k 2’6 ~.. ’6 ~..
'21 0,] 2 '=1 CI]
H(U)=—:—Ln—log—Jn—

i=1

1: k ~ k ~
.__ Cl. . .._. C. .
i=1

P?

k

H(U,V)=—Zzir:llog%

i=1j=1

(5.35)

MI = H(U) + H(V) — H(U, V)

 

For both F-score and NMI, the higher the value, the better the match between
the clusters and the true classes. For a perfect match, both NMI and F-score take the
value of 1. When the cluster labels are completely independent of the class labels,

N MI takes its smallest value of 0. The minimum value of F -score depends on the sizes

218

of the true classes. If all the classes are of equal sizes, the lower bound of F-score is
l/k. In general, the lower bound of F-score is higher, and it can be more than 0.5 if

there is a dominant class.

5.6.2.4 Results

The results of clustering the data sets mentioned in Section 5.6.2.1 when there are no
constraints are shown in Table 5.3. In the absence of constraints, both the proposed
algorithm and Shental effectively ﬁnd the cluster parameter vector that maximizes
thelog-likelihood, whereas Basu is the same as the k-means algorithm. One may be
surprised to discover from Table 5.3 that even though the proposed algorithm and
Shental optimize the same objective function. their results are different. This is
understandable when we notice that the line-search Newton algorithm used by the
proposed approach and the EM algorithm used by Shental can locate different local
optima. It is sometimes argued that maximizing the mixture log-likelihood globally
is inappropriate as it can go to inﬁnity when one of the Gaussian components has an
almost singular covariance matrix. However, this is not the case here, because the
covariance matrices all have small condition numbers as seen in Table 5.3. Therefore,
among the two solutions produced by the proposed approach and by Shental, we
take the one with the larger log-likelihood. In the remaining experiments, the no—
constraint solutions found by the proposed algorithm were also used as the initial
value for Shental. It is because we are interested in locating the best possible local
optima for the objective functions.

The results of running our proposed algorithm, Shental, and Basu, with 1%

219

.Eocﬁomammu .553: mocaﬂgoo 20588 2: mo Spawn 283628 2: was 4509385-on m5 5258.285
33:8 womzeanoc 23 .808i 2: .muﬁoa 33o mo .8858 2: 302% K was xzmoﬁ .22 Z mm 5 $598; $3. .Eﬁiow? 2m
23 moms 333 ME... 3223» ~8ng £983.32 23 moms 52:8 23 nwzoﬁ .930 m3... 5 wooazmﬁfmoﬁ 23 325me $325
one Eﬁtow? wwmoaoa 2: 50m .muﬁeﬁmsoo Mo 8535 23 E 853%? $38630 32¢me mo moneanchm ”mh 2st

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Sod was be x we ”S x 83 Rod ”93 OS x 3 me x was 9.3 was as 08-23...
macs 3.2 He x we ”S x 83” age was HS x 3 .S x $3 33 see an 83:.“
32 BE 32 x ma ”2 x was 33 as; 1: x 3 me x $3 $3 .83 com 8ng
83 83 1: x 3 as x 33. Home See H.2 x 3 1: x 3:. Sec Sec as: 5:30:
moms an; 1: x as fa: x 8? wood we; HS x as as x 83 wood 33 88 33
$3 $3 H2 x as m: x £3: 23 :3. HS x as 1: x see: 23 ES 89. 83x8
33 :3 HS x .2 a: x 83: $2 82 HS x 2 me. x 2.3: ca; 83 was a5...
:3 as we x 3 as x $.31 two 32. me x a...” ”S x 23: cams and as 302...;
See 23 a: x as as x EST Sec :3 me x as a: x SST we; 22 was paw
Sod sad s2 x 3 ”S x 33: ago 83 NS x 3 ”S x 33: ego and EN p.32
was 35 me x 3. me x was: 83 $3 me x a...“ ”S x “as: wood ass 82 55%
25.0 83 a: x 3 K: x 5.3.. See was as x we me x some: 83 was 0% «has
£3 .23 HS x em ".2 x ”8.7 Sod ES HS x 3 e2 x 3.3: mood sad 8m. 33
£3 83 HS x 3.. as x 33: 83 83 me x E a: x 33' 3.5 53 2mm 338
we; £3 a: x S as x 33 Eco $3 a: x 3 as x $3 33 a; 88 3:82.
£3 £3 1: x 3 me x 231 Sod See H2 x 3 me x 33: .83 39o an a:
Ed 33 HS x 3 me x was: 2.3 3:. HS x 3 as x 32: 8.3 $3 88 35%
83 Sec 92 x as me x 83: moms 33 92 x as me x was: memo ES 8m 5%
32 a x 52-3 32 a M £52 32 m 2

ammm Hapqwnm comoaohm

 

 

220

constraint level, 2% constraint level, 3% constraint level, 5% constraint level, 10%
constraint level, and 15% constraint level are shown in Tables 5.4, 5.5, 5.6, 5.7,
5.8, and 5.9, respectively. In these tables, the columns under “Proposed” correspond
to the performance of the proposed algorithm. The heading A denotes the value
of the constraint strength as determined by the validation procedure. The heading
“Shental, default init” corresponds to the performance when the algorithm Shental
is initialized by its default strategy, whereas “Shental, special init” corresponds to
the result when Shental is initialized by the no—constraint solution found by the
proposed approach. The heading “log-lik” shows the log-likelihood of the resulting
parameter vector. Among these two solutions of Shental, the one with a higher
log-likelihood is selected, and its performance is shown under the heading “Shental,
combine” .

From these tables, we can see that Shental with default initialization often yields
a higher performance than Shental with the special initialization. However, the
log-likelihood of Shental with default initialization is sometimes smaller. By the
principle of maximum likelihood, such a solution, though it has a higher F-score
and / or NMI, should not be accepted. This observation has the implication that the
good performance of Shental as reported in comparative work such as in [161] might
be due to the initialization strategy instead of the model used. The fact that we
are more interested in comparing the model used in Shental with that used in the
proposed approach, instead of the strategy for initialization, is the reason why we run
Shental with the special initialization. We have also tried to do something similar

with Basu, but its initialization routine is integrated with the main clustering routine

221

so that it is non-trivial to modify the initialization strategy.

The numbers listed in Tables 5.3 to 5.9 are visualized in Figures 5.7 to 5.13.
For each data set, we draw the F-score and the N MI with an increasing number of
constraints. The horizontal axis corresponds to different constraint levels in terms of
the percentages of the number of data points, whereas the vertical axis corresponds
to the F-score or the NMI. The results of the proposed algorithm, Shental, and
Basu are shown by the (red) solid lines, (blue) dotted lines, and (black) dashed lines,
respectively. For comparison, the (gray) dashdot lines in the ﬁgures show the F-score
and the NMI due to a classifier trained using the labels of all the objects in the data
set under the assumption that the class conditional densities are Gaussian with a
common covariance matrix. The data sets. are grouped according to the performance
of the proposed algorithms. The proposed algorithm outperformed both Shental and
Basu for the data sets shown in Figures 5.7 to 5.9. The performance of the proposed
algorithm is comparable to its competitors for the data sets shown in Figures 5.10
to 5.12. For the data sets shown in Figures 5.13, the proposed algorithm is slightly
inferior to one of its competitors. We shall examine the performance on individual
data sets later.

Perhaps the ﬁrst observation from these ﬁgures is that the performance is not
monotonic, i.e., the F-score and the N MI can actually decrease when there are ad-
ditional constraints. This is counter-intuitive, because one expects improved results
when more information (in the form of constraints) is fed as the input to the algo—
rithms. Note that this lack of monotonicity is observed for all the three algorithms.

There are three reasons for this. First, the additional constraints can be based on

222

data points that are erroneously labeled (errors in the ground truth), or they are
“outlier” in the sense that they would be nus-classiﬁed by most reasonable super-
vised classifiers trained with all the labels known. The additional constraints in this
case serve as “mis—information”, and it can hurt the performance of the clustering
under constraints algorithms. This effect is more severe for the proposed approach
when there are only a small number of constraints, because the inﬂuence of each of
the constraints may be magnified by a large value of A. The second reason is that
an algorithm may locate a poor local optima. In general, the larger the number of
constraints, the greater the number of local optima in the energy landscape. So, the
proposed algorithm as well as Shentaland Basu is more likely to get trapped in poor
local optima. This trend is the most obvious for Basu, as the performance at 10% and
15% constraint levels dropped for more than half of the data sets. This is not surpris-
ing, because the iterative conditional mode used by Basu is greedy and it is likely to
get trapped in local optima. The third reason is specific to the proposed approach. It
is due to the random nature of the partitioning of the constraints into training set and
validation set. If we have an unfavorable split, the value of A found by minimizing
the number of violations on the set of validation constraints can be suboptimal. In
fact, we observe that whenever there is a significant drop in the F-score and NMI,

there often exists a better value of /\ than the one found by the validation procedure.

Performance on Individual Data Sets The result on the ethn data set can be
seen in Figures 5.7(a) and 5.7(b). The performance of the proposed algorithm im-

proves with additional constraints, and it outperforms Shental and Basu at all con-

223

straint levels. A similar phenomenon occurs for the. Mondrian data set (Figures 5.7(c)
and 5.7(d)) and the ion data set (Figures 5.7(e) and 5.7(f)). For Mondrian, note that
1% constraint level is already sufficient to bias the cluster parameter to match the re—
sult using the ground-truth labels. Additional constraints only help marginally. The
performance of the proposed algorithm for the script data set (Figures 5.8(a) and
5.8(b)) is better than Shental and Basu for all constraint levels except 1%, where the
proposed algorithm is inferior to the result of Basu. However, given how much better
the k-means algorithm is when compared with the EM algorithm in the absence of
constraints, it is fair to say that the proposed algorithm is doing a decent job. For the
data set derm, the clustering solution without any constraints is pretty good: that
solution, in fact, satisfies all the constraints when the constraint levels are 1% and 2%.
Therefore, it is natural that the performance does not improve with the provision of
the constraints. However, when the constraint level is higher than 2%, the proposed
algorithm again outperforms Shental and Basu (Figures 5.8(c) and 5.8(d)). The per-
formance of the proposed algorithm on the vehicle data set is superior to Shental
and Basu for all constraint levels except 5%, where the performance of Shental is
slightly superior. For the data set wdbc, the performance of the proposed algorithm
(Figures 5.9(a) and 5.9(b)) is better than Shental at all constraint levels except 5%.
The proposed algorithm outperforms Basu when the constraint level is higher than
1%.

The F-score of the proposed algorithm on the UCI-seg data (Figures 5.10(a)) is
superior to Shental at three constraint levels and is superior to Basu at all but 1%

constraint level. On the other hand, if N MI is used (Figure 5.10(b)), the proposed

224

1

algorithm does not do as well as the others. For the heart data set, the proposed
algorithm is superior to Shental at all constraint levels, but it is superior to Basu
at only 3% constraint level (Figures 5.10(c) and 5.10(d)). Note that the performance
of Basu might be inflated because we only report its best results among all possible
values of constraint penalty in this algorithm. we can regard the performance of
the proposed algorithm on the austra data set (Figures 510(0) and 5.10(f)) as a
tie with Shental and Basu, because the proposed algorithm outperforms Shental
and Basuat three out of six possible constraint levels. For the german data set, the
proposed algorithm performs the best in terms of NMI (Figure 5.11(b)), though the
performances of all three algorithms are not that good. Apparently, this is a difficult
data set. The performance of the proposed algorithm is less impressive when F—score
is used, however (Figure 5.11(a)). The proposed algorithm is superior to Shental
in performance for the Sim-300 data set (Figures 5.11(e) and 5.11(d)). While the
proposed algorithm has a tie in performance when compared with Basu based on the
F-score, Basu outperforms the proposed algorithm on this data set when N MI is used.
The result of the diff-300 data set (Figures 5.11(e) and 5.11(f)) is somewhat similar:
the proposed algorithm outperforms Shental at all constraint levels, but it is inferior
to Basu. Given the fact that the k-means algorithm is much better than BM in the
absence of constraints for this data set, the proposed algorithm is not as bad as it ﬁrst
seems. For the sat data set (Figures 5.12(a) and 5.12(b)), the proposed algorithm
outperforms Shental and Basu significantly in terms of F-score when the constraint
levels are 10% and 15%. The improvement in NMI is less signiﬁcant, though the

proposed method is still the best at three constraint levels. The result of the digits

225

RI

data set (Figures 5.12(c) and 5.12(d)) is similar: the proposed method is superior to
its competitors at three and four constraint levels if F-score and NMI are used as the

evaluation criteria, respectively.

It is difficult to draw any conclusion on the performance of the three algorithms on
the mfeat-fou data set (Figures 5.13(a) and 5.13(b)). The performances of all three
algorithms go up and down with an increasing number of constraints. Apparently
this data set is fairly noisy, and clustering with constraints is not appropriate for
this data set. For the data set same-300, the proposed algorithm does not perform
well: it has a tie with Shental, but it is inferior to Basu at all constraint levels,
as seen in Figures 5.13(c) and 5.1.‘3(d). The performance of the proposed algorithm
is better than Shental only at the 15% constraint level for the data set texture ‘
(Figures 5.13(e) and 5.13(f)). The proposed algorithm is superior to Basu for this
data set, though this is probably due to the better performance of the EM algorithm
in the absence of constraints. Note that this data set is a relatively easy data set for
model-based clustering: both k—means and EM have a F-score higher than 0.95 when

no constraints are used.

5.6.3 Experiments on Feature Extraction

We have also tested the idea of learning the low-dimensional subspace and the clusters
simultaneously in the presence of constraints. Our first experiment in this regard is
based on the data set shown in Figure 5.3. The two features were standardized to

variance one before applying the algorithm described in Section 5.5 with the two

226

 

$32838: .vooanmxzag 23 «Ed 65:80:: 95:62:.» :5 .3 :::£ 8:303: @8953 m5 :8 4 383:0 9:
£838.83: ESE: 8.2388: 9: 688$ 23 £38,538 mo :58:: ms... .350: gm: No 898:: 23 305: xzaoﬁ m5: .4
.222 ..m ..E .2 mwﬁdam: 23. AK; mm :53 3:26:00 23 5:3 33:83: 3:65.38 :25: w:€8m3o mo 85:88th ”vb £an

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mwﬁd vamd mmdd awvd m3 x Bad mmdd awvd me x maﬁa aan mwmd m3 x ad vgd dwnd m nam oomnoamm
wde ndmd mmdd mavd Lad: x no; mmdd mavd m3 x and awed davd Nd: x ad ade wand m Ham oomuaﬂm
mwvd Hand add awad Lmdﬁ x mmmd dvdd awvd de x 3m...” v3.0 vavd m3 x Nd mamd ddmd a dam oomnmwﬁ:
nwnd wand wand adwd v3 x 8:: mand aand «a: x 91.: wand adwd No: x Nd nawd ddad m3 8de ”8:38:
dddd Hnmd wddd Had no: x mom; wddd Edd mo: x mom: wood Edd are x d; mmvd Hmwd om dmdm 5:6
Sad amad mHad nnad :3 x mwvdl mHad nnad we: x mwvdl Bad nnad a fad nnad d: dagV 8:88p
wand amdd Hand mdmd de x dSAI Hand momd m3 x anAI and ondd me x a; momd de amH wmamﬁ wagon
mZd dmvd mmﬁd nwvd m3 x aadnl med nwvd moH x vndnl mmmd vwvd de x d; wﬁmd wand w de 3039»
Sad End mmdd and no: x waddl mmcd 8nd m3 x waddl mmdd 3nd Nd: x ad mood 2nd Ed nave pwm
ammd dmwd vmdd damd mag x vnoAl vmdd damd m3 x wnoAI vmdd damd me x Nd ndmd mend m dnm Fume:
dddd mmdd Hddd ndmd me x mmmdl wood awed no: x mmadl Sad ndmd d wddd add 3 dddH 55mm
addd nmmd mmdd dad he: x daddl mmdd Sad me x nﬁmdl Hmvd vmwd we x ad dddd dvmd n cam 93::
Bad ndad Sid mnod v3 x dmdgl Sid andd :3 x wmdAI vmmd awnd Nd: x d; man nndd d adm 09:3
«and 3nd mwdd and we x mama! 93d and No: x Quad! mood ndnd Nd: x d; andd 3nd mm Ema wmmnHoD
vadd awnd End End :3 x mmdd Edd andd :3 x mmdd and End m3 x ad wood wndd dm doom 3739::
made dand dadd Edd me x anmdl awed Edd ma: x anmdl awed Edd de x ad mwmd add v Em :oH
nmnd Ewd annd ﬁwnd me: x vgdl wﬁmd wvwd me x andl annd Hand de x d: mvnd dand am. dmdm mpﬁwﬁu
dwwd mmwd wdwd nde me x nowdl wdwd Sad me x nddél wowd Sad d wamd nde v can 8.8:
H22 «1: H22 m xzao— 2.42 m 5&2 H22 m 4 H22 m E :

:mmm 8:388 Jazmnm .ﬁﬁ $8me .3323 5:: 933:: 4:225 demoaohm

 

 

 

 

 

 

 

227

3:330:38 .:000::::::-m0: o:: :8: 68:80:: :0:::E::> ::: n: 888 8:883: «5:098: 0:: :0: /\ ::8:::0 2::
50:88:08: :::::8 88:88: o:: 800:»: ::: 5:885:00 :0 :::8:: ::: 6:80: ::::0 :0 :::8:: ::: 30:20 xzwo: :8: .4
:2 Z .n: .E ,2 8868: ::H dam m: :36: 88.3800 0:: :08: m8::::0w:: m::::::m:00 8:8: w:::::m::0 o0 :0::8:0o::n: ”nn ::::,H

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ww:d wand omod wawd no: x ao:.w omdd wawd no: x :aod mw:d ownd no: x o.: nmdd owwd o nam oomnoa:m
nw:.o n:nd amod navd Loo: x nw:.w n:d.o wawd no: x wovw amdd navd mo: x No :ndd mond o :am oomaaﬂm
wawd ::n.o dndd wawd no: x nn:.w dndd wawd no: x nn:.w :aod owwd mo: x aw aa:.d o:o.o o dom donuts
nwnd oond nand aowd vo: x nw:.v nand aowd Ho: x nw:.v nand aowd mo: x No wawd ooad vow :dmd: 88:80:
:dod annd oodd wvod no: x don: odod wvod no: x wav: ddod nvod mo: x No w:o.o vmad wn doom 8:3
nowd wnad w:ad wnad vo: x awvol w:ad wnad «o: x mwvol w:ad wnad :d: x No w:ad onad dw ooov 9888p
ownd M:Vnod w:n.d dnod no: x n:o.:l w:nd onod no: x n:o.:| w:nd :nod no: x o.: onnd nwnd anm wwam: 880m
3:.d w:v.d :omd n:nd d: x anon! :omd n:nd No: x nw:.ni ww:d nwvd no: x o.: mond nwnd n: ovw 3089,.
w:od n:nd owod n:nd no: x nmool owod n:nd no: x nmool vwod n:nd Ho: x an nwod a:n.o am: nwvo 9::
awwd owwd oaod wand mo: x ann.:l odd and no: x mnn.:| omod wand Nro: x o.: nomd mond n onm 98m:
nodd annd wood wwod mo: x wowél wood wvod no: x nawél :dod :nod wd: x o.: ::od :wod om odd: 825mm
:mod ownd :dod wmnd no: x nwmol nmdd :dod no: x w::.ol :dod wmnd mo: x no wnod o:o.d v: dao :upm::
onnd adad :n:.d :nod wd: x wno.:l mood :wod vo: x ono.:l :n:d :nod mo: x No owod w:ad :: aon 0:63
nnnd wmod wwod ::n.o vo: x nnn.ml wwod ::n.o «1d: x nvodl nood mond vo: x o.: aond :mnd ov d:nm mmwnHoo
oond annd onod :ond wo: x anod ndnd onnd wo: x anon onod :ond mo: x no wnod nwnd ow doom :0wap:mma
dEd :mnd ovdd :nod No: x annwl ovod :nod mo: x annwl ovod :nod mo: x an nn:.o nnnd n :nw :0:
oond n:w.o nwnd onnd no: x nn:.ol :and nwnd no: x wn:.ol nwnd onnd oo: x o.: d:n.o nwnd a: oaon mpﬁmﬁv
owwd wwwd wowd n:w.o no: x nooél wowd n:w.o no: x noodl wowd n:w.o d wowd n:w.o n oow 8.3::
:27: n: :27: n: v3&0: :27: b :::.m0: :22 n: 4 :27: ,...: E :

:95 883800 4:828 :8: BBQ? 4:::o:m :8: 33:20 4:8m:m ::m0Q0:n:

 

 

 

 

 

 

 

228

23388888 ,voonzmxo-wo~ 93 88 683008 80585? 2: 83 888 852038 88803 83 .88 4 E8380 mi
8888.888 33:8 8888.88 25 .808i Qt £88588 no 85888 8: .8808 Saw do .8385: 2: 308% voodoo 88 .4
“:2 Z mm .8 .2 $888: 23. .OAVd mo 3?: 988888 23 8mg? mEsquwE 8885.88 $88 @5883 .8 8888.83me ”dd 2an

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mmmd wad odod owwd doH x omﬁd odod owwd doH x Food bmod n:w.o moH x o4 owod nwnd o uom oomnmﬁmm
ode 3nd NSd wwvd doH x ndod ovod mwvd doH x owwd NSd wwvd moH x Nd odod dnvd o How oomnaﬁm
mddd omod de mHmd doH x womd de and doH x Rad Edd vad mod x Nd wnad dvod o ood oomuwwwv
dmud End oond oowd «2 x «36 oond dowd Boa x mmﬁv dobd dowd moH x Nd oood mood ood Home 58580:
wood owwd wood omod moH x mow; wood omdd moH x wow; oood mvod moH x o4 odnd mood on odom 83w
homo oomd wﬁod whod voH x vadl Sod wnad woﬁ x vadl dad wnad mod x oA wad wnad o3 ooov 9888p
mdmd vmod :md onod boa x :oAl :md ohod mod x SoAI :md :od doH x md ddmd owwd wdd wdomﬁ pmﬁuum
de ogd ohod dwdd doH x ondnl onod dwdd doH x mdwﬁl oomd wvvd doH x md nwnd dmmd mm ovw mHUﬁnwb n94
Sod w:ad odod w:ad me x nwodl odod 2nd moH x nmodl odod wand doH x o4 nwnd w:ad d3 mdvo 8mm 2
Sod owwd nmod vomd moH x wddél hood mood do” x nmwél umod vomd NoH x o4 nomd mound w ohm ”Emma
mood wnad Hood modd doH x Hamel Hood mood doH x omdél Hood oond doH x Nd god and od oooH 830m
dHod :md Hood dmnd doH x mdmdl Hood dmmd doH x ooddl dmvd dndd moH x Nd oood ovod Hm ooo mnpmsm
owwd oood mood wnod voH x dmoél mood wnod «.3 x omoél mood mood HoH x md wood vmod 2 mod on?»
ddod dﬁdd odod wand voH x Sodl odod ﬁnd we x wood! god dwod moH x Nd Hmdd wood do oHdm wmmuHuo
Sod vwod Food omnd voH x Hood hood owwd voﬁ x mmod Rod End o wood and do doom sownpmmwa
mdﬁd w:ad mdod Bod droﬁ x woodl wood Bod mod x wnmdl mdod Sod doH x md ode wand : and no“
wand Edd vvnd vmnd noa x dmédl Sumo End moH x dmﬁdl dEd wand doH x oA oond nmwd of omom mpﬁmﬂu
Edd bddd wowd :wd doH x modél dodd :wd doH x modél wdwd dﬁwd moH x Nd Edd God 3 com 8.36
:22 m H22 m x:-wo_ :22 m x:-w£ ZZZ m 4 :42 m E 2
swam «58588 49825 .38 7:0on ,Hmpqmsm £8 88va .Hdpqmnm Emoaem

 

 

 

 

 

 

 

58388888 ,oooazmxzdoo 83 ~88 588850.8 888883, 83 3 @888 852cm? ommoooa 83 88 4 E8380 2:
80388888 8888 888288808 888 688$ 83 888.388 .8 $2888 83 .2868 888d do 85888 8: 30888 voodoo o88 ,4
:22 ..m :E ,8 $8888 23; .ooo mo 728 8888888 3: 8883 m88880d8 888.388 8888 d88$mﬂu do 8888th ”wo @589

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ooHd dwod owod wwwd doH x doHd wwod odwd doH x oNHd owod wwwd NoH x o4 ddod ded 3 woN oomquMm
doNd oood wdod wowd moH x ode wdod wowd doH x ode oHod Ndwd on x Nd odod oNod 2 SN oomuaﬁm
dNod oodd dwod dowd doH x ode dwod dowd doH x oSd NSd owwd on x o4 Hddd wood 3 ood oednwwﬂv
Howd Ndwd wowd dodd on x Hodw wowd dodd on x Sow wowd dodd moH x Nd wood oood oHo HoNoH 888880:
ded ddwd wood owod ooo x wowA wood owod ooH x oowA dowd Hodd Nroﬁ x Nd owwd dood NdH odoN 88pm
oodd oood dHod dwod on x odwol dHod dwod on x odwol dHod dwod NoH x Nd wHod wwod ooN ooow 8.888»
Hood ond odwd oNod ooH x NHoAI odwd oNod ooH x dHoAl dHod Nwod on x oA Nood ddwd wwo ddoNH pmﬁuum
de ond wHNd owwd doH x ddel wHNd owwd doH x wowwl HoNd oowd HoH x Nd Nde dwwd Nw owd 0839, 000
:od ond ddod ond ooH x wNo.ol dod wawd ooH x wNool ddod ond NoH x Nd wNod ddwd NNd odwo paw 2
wood dddd wNod wood doH x wodAI wNod wood doH x wodAl wNod wood N9 x Nd oodd dddd wH owN p.888
oood odod Hood oood doH x doowl oood oood LdoH x ddw.wl Hood oood doH x Nd oood dwod oo oooH 888.2%
oowd dwdd oood dHod d3 x dodol ded oodd doH x oNN.o| oood dHod doH x o4 ooNd oowd od ooo 8.388
wwwd owdd wwod wodd on x ooogl owod Nodd on x oooAl wwod wodd HoH x Nd oood owdd dN ooo ”58,
dNod owod odod ded on x dooNl odod ded on x wHoNl ddod wawd dog x o4 HNwd Nwwd o: oHdN wmmuHoo
Hwod owwd oowd wowd woo x oood NNwd dowd on x oood oowd wowd doH x o4 Hdod ond ooo oooN sownpmmwa
oNHd ond wdod oood doH x woodl wdod oood doH x woodl wdod oood HoH x Nd wodd ded do Hod 83
:wd ond oowd owwd ooH x NoHoI wawd Nde ooH x woﬁoi oowd owwd HoH x Nd wowd Hdwd ﬂdN oNoo mpﬁdﬁv
wwdd dddd dodd ded doH x woowl dodd ded doH x ooo.w| NHod owod NoH x o4 dHod wood dH ood ammo
342 m :22 m zoom: H22 m ﬁTdB :54 m 4 :22 .m E 8
swam 88588 488828 .88 88QO 488898 £8 88888 488895 ommoagm

 

 

 

 

 

 

 

5030000008 8008:0082d0— 080 88 6.880008 80388800, 080 03 @888 8888038 8000808 080 80o 4 8088.80 05
8038880088 8888 800208808 05 68000.0“ 080 88800800 o0 805888 080 000808 3.000 o0 808888 080 0008000 8:-d8 880 .4
42 Z mm :80 ,8 0d88008 08H. 003 08 80008 08800800 080 808.3 08888088 008800800 00888 d8800080 o0 008.08.88.85 ”do 030%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

000.0 800 0:0 000 0030000 008.0 000.0 0020080 0:0 000.0 0030.0 000.0 000.0 00 000 000-050
000.0 000.0 000.0 000.0 00:03.0 020.0 000.0 00300080 000.0 000.0 LV00x08 000.0 000.0 00 800 000.900.
000.0 000.0 000.0 000.0 0030000 008.0 000.0 0030000 000.0 000.0 00300 000.0 000.0 00 000 000-0000
000.0 000.0 000.0 000.0 000x003. 000.0 000.0 00300000 000.0 000.0 0030.0 000.0 000.0 0008 00008 53082
000.0 000.0 80.0 8000 003003 000.0 000.0 00:00: 80.0 00.0 00300 000.0 000.0 000 0000 0000
000.0 000.0 080.0 000.0 003000.01 000.0 000.0 003000.0: 000.0 000.0 H10300.0 000.0 000.0 000 0000 3303
000.0 000.0 000.0 000.0 003000.01 080.0 000.0 00300008: 0800 000.0 008x08 080.0 000.0 0002 00008 8:80
0:0 000.0 000.0 000.0 003000.01 000.0 000.0 008.3000: 080.0 000.0 00:08 0.080 000.0 00 000 3039,
000.0 020.0 000.0 000.0 08:00.0: 200.0 0800 0030000: 000.0 000.0 0020.0 000.0 000.0 3.0 0000 :0.
000.0 000.0 000.0 00.0 00300008.. 000.0 000.0 0030008.. 000.0 0800 00:08 000.0 000.0 00 000 0080
000.0 000.0 000.0 0000 0033.080: 000.0 000.0 008.3080: 000.0 000.0 008x08 000.0 000.0 008 0000 90800
000.0 000.0 000.0 000.0 02:00.0: 000.0 000.0 08:000.. 000.0 000.0 000.08 000.0 000.0 00 000 2003..
000.0 000.0 000.0 000.0 002x003: 000.0 080.0 002000.01 000.0 000.0 00:08 000.0 000.0 00 000 BE
0800 000.0 000.0 000.0 0030000: 000.0 000.0 003200.01 000.0 000.0 00:08 200.0 000.0 800 0000 03-800
000.0 000.0 000.0 000.0 00300000 000.0 000.0 003000.0 000.0 00.0 008.08 000.0 000.0 000 0000 80-0800
0:0 000.0 000.0 000.0 003000.01 000.0 000.0 003000.01 000.0 000.0 H0:00 000.0 000.0 00 800 02
000.0 000.0 000.0 000.0 0033.08.01 000.0 0.000 8030080: 000.0 000.0 0030.0 080.0 000.0 000 0000 30080
000.0 000.0 000.0 000.0 008.0800: 000.0 0800 00320.0: 000.0 00.0 00:00 30.0 000.0 00 000 9.80
822 0 822 .0 00-02 822 8 00-002 822 8 0 822 8 E 0
800m 8088800 4088085 .88 80080 4088008 .88 88880 40980.3 80000080“

 

 

31

2

3030008000 8008:8288 888 888 n0.8800088 80388880, 080 03 8888 8888088 80008088 080 80o 4 #88580 080
88088888 #8888 800388808 08. 88000-8 83 00088800800 Mo 85888 880 F00808 8088 do 808888 080 000806 8:88 888 .4
422 ..m .80 08 0.0888808 08H. .052 8 758 088800800 080 8083 08888038 0088800800 88888 98.800080 o0 0088880o80m Moo 038B

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

dwdd dood dZd owod Moﬁ x doHd dZd owod doH x dwod Hwod oowd on x o4 owod odwd ow woN oomnmﬁmm
dde Nood oHod oowd m3 x ode oHod oowd doH x Nde owod wdwd on x o4 oood owwd ww HoN oomaaﬂm
owwd wdod Hdod ddwd whoa x ond Hdod odwd doH x Ndod oood wwwd ooH x o4 dHod wodd ow ood oomuwwﬁv
dowd odwd oood wood on x ddow oood wood on x ddow oood wood No8 x Nd oood owod odoH HoNoH 880.380:
wdwd oodd dood wodd ooH x oowA dwwd wood ooH x wow; dood wodd No8 x o4 owwd wood ood odoN 8880
oodd wood oNod owod on x owwd! oNod owod 0.3 x owwol oNod owod No8 x Nd wNod Ndod ooo ooow 0.080389
Sod odod wHod owod ooH x dNoAI wHod dwod ooH x dNoAI wHod owod doH x o4 oNod oodd Hon ddoNH 030.008
de ond wood Nodd doH x ood.wl wood Nodd doH x ooow! wHNd wowd doH x Nd ded Nood wNH owd 08008000
odd dowd ddod ond ooH x dNo.o| wdod wde ooH x dNo.o| ddod odwd on x o4 dood oowd ooo odwo 98m
Nwdd oddd wdod oood doH x wwoAl wdod oood doH x wdoAl ooNd oowd doH x oA ded oHdd Hw owN 8.0888
oood dood oHod dwod doH x ddNoI oHod dwod de x Nwdol Hood owod o dood owod ooH oooH 850d
wowd dwdd owwd dodd doH x ooool owwd dodd dog x oood... owwd Hodd doH x Nd oodd dodd on ooo 8.3888
oood dﬁod odod odod on x Hooél odod odod on x Hoogl odod odod Hod x Nd oNdd Nwod od ooo onus
oodd wowd odod wawd Boa x wooNl odod wawd on x owoNl Hood oowd on x Nd Nood odwd wwd oHdN mmmuHoo
dood dwod owod oowd on x dood owod oowd on x oood oowd dowd HoH x Nd owod dowd ood oooN zomnumwwa
wood owod Ndod dwod doH x Nwodl Ndod dwod doH x Nwodl Ndod dwod HoH x Nd dddd ded do Hod 800
owwd dodd owwd oowd ooH x woﬁol ded wodd ooH x ooﬁol owwd oowd doH x oA oowd wwdd dwd oNoo mpﬁmﬂu
dowd Ndod Nodd oodd doH x wdowl Nodd oHdd doH x dwowl Hodd oodd No0 x oA dHod wood oo ood 8.000
dZ./H m 2.82 m 0.5-000— :Az m 03-no— HZZ m 4 :42 h 80 8

888m 8085800 4888005 .88 78.580 4888086 .88 8:880 4888008 8000800L

 

 

 

 

 

 

 

232

 

 

0.5 *1 ..... 4}
.v

0.4 * I

0.3 r- , 5;

0215 I
,

 

‘Luu—n'
"""’ 1 2 a 5 10 15 c 1
(a) ethn, Fscore

 

 

10 15

 

 

..... ,-.-._,.-

 

 

 

 

 

 

 

 

 

 

2 5 5

(e) ion, Fscore (f) ion, NMI
Figure 5.7: F-score and NMI for different algorithms for clustering under constraints
for the data sets ethn, Mondrian, and ion. The results of the proposed algorithm,
Shental, and Basu are represented by the red solid line, blue dotted lines and the
black dashed line, respectively. The performance of a classiﬁer trained using all the
labels is shown by the gray dashdot line. The horizontal axis shows the number of
constraints as the percentage of the number of data points.

233

 

 

 

""’< 1 é s 5 1o 15 1
(a) script, Fscore

- n n:
. v 1' v.”

 

 

 

 

 

 

a: . L i 07

0""‘o 1 2 3 5 1o 15 ' i 2 a 5 1b 15
(c) derm, Fscore (d) derm, NMI

3.8 _________________________ ,_ _ - _‘_
0.75 i

 

 

 

 

 

 

 

 

 

23 :i ‘ 2 5 5‘;
(6) vehicle, Fscore (f) vehicle, NMI

Figure 5.8: F-score and NMI for different algorithms for clustering under constraints
for the data sets script, derm, and vehicle. The results of the proposed algorithm,
Shental, and Basu are represented by the red solid line, blue dotted lines and the
black dashed line, respectively. The performance of a classiﬁer trained using all the
labels is shown by the gray dashdot line. The horizontal axis shows the number of
constraints as the percentage of the number of data points.

234

 

 

l
I"
r '

 

 

 

 

 

 

 

o ; 2 a 510.50 ; én‘a; 512,15

(a) wdbc, Fscore (b) wdbc, NMI
Figure 5.9: F-score and NMI for different algorithms for clustering under constraints
for the data sets wdbc. The results of the proposed algorithm, Shental, and Basu
are represented by the red solid line, blue dotted lines and the black dashed line,
respectively. The performance of a classifier trained using all the labels is shown by
the gray dashdot line. The horizontal axis shows the number of constraints as the
percentage of the number of data points.

235

 

 

 

i» :3 5
(a) UCI-seg, Fscore

10

1

i» 5 3
(b) UCI-sag, NMI

10 15

 

 

 

 

 

(c) heart, Fscore

 

 

 

‘--
I

 

 

.‘-'.u~::‘""

 

 

2 5 5
(e) austra, Fscore

.90 1

 

 

(f)

 

3 '5' 1o 15
austra, N MI

Figure 5.10: F—score and N M1 for different algorithms for clustering under constraints
for the data sets UCI-seg, heart and austra. The results of the proposed algorithm,
Shental, and Basu are represented by the red solid line, blue dotted lines and the
black dashed line, respectively. The performance of a classiﬁer trained using all the
labels is shown by the gray dashdot line. The horizontal axis shows the number of
constraints as the percentage of the number of data points.

236

 

1
1
J

 

 

4
_,. ,_l . “1...“.
EE 4 A ~‘._.‘-a'--l ‘-m:-A-- A"

“i 1 2 a 5 1o 15 1 2 3 5 1o 15
(b) german, NMI

 

 

 

 

 

 

 

 

 

 

"so 1 2 5 5 1b 15 i 2 :3 5 1o 15
(c) Sim-300, F score (d) Sim-300, NMI
0.95‘- T . . - o.s-----.v.._.__ . i---

 

 

 

 

   

 

10 15 “o 10 1s

1 2 3 5 2 a 5
(e) diff-300, Fscore (f) diff—300, NMI
Figure 5.11: F-score and NMI for different algorithms for clustering under constraints
for the data sets german, Sim-300 and diff-300. The results of the proposed algo—
rithm, Shental, and Basu are represented by the red solid line, blue dotted lines and
the black dashed line, respectively. The performance of a classiﬁer trained using all
the labels is shown by the gray dashdot line. The horizontal axis shows the number
of constraints as the percentage of the number of data points.

237

 

 

 

 

 

 

 

 

 

 

. . é
(0) digits, Fscore (d) digits, NMI

 

Figure 5.12: F -score and NMI for different algorithms for clustering under constraints
for the data sets sat and digits. The results of the proposed algorithm, Shental,
and Basu are represented by the red solid line, blue dotted lines and the black dashed
line, respectively. The performance of a classiﬁer trained using all the labels is shown
by the gray dashdot line. The horizontal axis shows the number of constraints as the
percentage of the number of data points.

238

 

. . . 0.7.: _
0.73 ~ ,

0.72 r

 

 

 

 

0.7-

0.65

”"0 1 2 3 5 1o 15 “'1 1 2 a 5 1o 15
(a) mfeat-fou, Fscore (b) mfeat-fou, NMI

08 l ' _ _ 0.45L ' I ‘

”5 1 0‘, - -.- _ -.-,- _ _ --.- -._ _ _ -.-,- - -,-- -._._.-._._.=

     

 

   

 

  

'o 1 2 "3 5 1o 15 1

2 5 5 1o 15
(c) same-300, Fscore (d) same-300, NMI

 

 

1 r v r
------.--.__.-.-_-_.-_-_-------.---5

 

  

 

 

 

 

 

0.98.7 oaa
» ..... 087
096 .ab'---O"'--‘ ‘\‘ AK [ ,‘ ----- *---""’ ---- A
‘a‘ ’_a‘ \
095g “\ s" 086‘! ‘s‘ '1
‘~-’¢' ‘\ "
ARE A x 1 an: ‘ “"
”"1 1 2 3 5 1o 15 “'W 1 2 a 5 1o 15
(e) texture, Fscore (f) texture, NMI

Figure 5.13: F—score and NMI for different algorithms for clustering under constraints
for the data sets mfeat—fou, same—300 and texture. The results of the proposed
algorithm, Shental, and Basu are represented by the red solid line, blue dotted lines
and the black dashed line, respectively. The performance of a classiﬁer trained using
all the labels is shown by the gray dashdot line. The horizontal axis shows the number
of constraints as the percentage of the number of data points.

239

must-link constraints. Based on the result slmwn in Figure 5.14(a), we can see that a
good projection direction was found by the proposed algorithm. The projected data
follow the Gaussian distribution well, as evident. from Figure 5.14(b).

Our second experiment is about the combination of feature extraction and the ker-
nel trick to detect clusters with general shapes. The two-ring data set (Figure 5.15(a))
considered in [158], which used a hidden Markov random ﬁeld approach for cluster-
ing with constraints in kernel k-means, was used. As in [158], we applied the REF
kernel to transform this data set of 200 points nonlinearly. The kernel width was
set to 0.2, which was the 20-percentile of all the pairwise distances. Unlike [158],
we applied kernel PCA to this data set and extracted 20 features. The algorithm
described in Section 5.5 was used to learn a good projection of these 20 features into
a 2D space while clustering the data into two groups simultaneously in the presence
of 60 randomly generated constraints. The result shown in Figure 5.15(b) indicates
that the algorithm successfully found a 2D subspace such that the two clusters were
Gaussian-like, and all the constraints were satisﬁed. When we plot the cluster labels
of the original two-ring data set, we can see that the desired clusters (the “inner”
and the “outer” rings) were recovered perfectly (Figure 5.15(c)). Note that the algo-
rithm described in [158] required at least 450 constraints to identify the two clusters
perfectly, whereas we have only used 60 constraints. For comparison, the spectral
clustering algorithm in [194] was applied to this data set using the same kernel ma-
trix as the similarity. The two desired clusters could not be recovered (Figure 5.15(d)).
In fact, the two desired clusters were never recovered even when we tried other values

of kernel widths.

240

 

 

-05[ 4 03f
O. u '51:
0 ° ‘H 4 '5, ,1 ‘ 02r
o M , e r? M "“ x
" "$de .€¢x§§¢§ “3‘?" l 01

 

 

 

o . .--

 

_1=

— . —1 5 ,1 0 5 6 05 ‘l 1i5 2 -§.5, “—E —1 5 —1 —05 .5 l . .
(a) Clustering result and the axis to be pro- (b) Clustering result after projection to the
jected axis

Figure 5.14: The result of simultaneously performing feature extraction and clustering
with constraints simultaneously on the data set in Figure 5.3(a). The blue line in
(a) corresponds to the projection direction found by the algorithm. The projected
data points (which is ID), together with the cluster labels and the two Gaussians,
are shown in (b).

5.7 Discussion

5.7 .1 Time Complexity

The computation of the objective function and its gradient requires the calculation
of Fij, sij, wij, and the weighted sum of different sufﬁcient statistics with rij and
wij as weights. When compared with the EM algorithm for standard model-based
clustering, the extra computation by the proposed algorithm is due to sij, wij, and the
accumulation of the corresponding sufﬁcient statistics. These take O(kd(m+ + m‘" +
n*)) time, where k, d, m+, m", n* denote the number of clusters, the dimension of
the feature vector, the number of must-link constraints, the number of must-not-link
constraints, and the number of data points involved in any constraint, respectively.
This is smaller than the 0(kdn) time required for one iteration of the EM algorithm,

with n indicating the total number of data points. Multiplication by the inverse of the

241

 

0.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

e 0 3
Q
. e { ‘ . r
s“ \‘ ’I ’l g 2 4
. ‘ Q I; l '
. . o o_ _ . _
‘: _|_‘°L ‘ a". . 1 *
-—-""" ‘3‘|‘::{:3’5’Q
.‘ ’ h) ‘*‘: 2": -ﬂ ,'.: j of
0 o... - 2.3," .' e ‘ ‘ :0“ o ' ' .0.
l ’1‘ ' ‘.‘ . . .-
. ’ ‘ - . J'.— ‘ r 5" F.. - '1
0‘ ' g o . f \‘t ‘ .s
: v.1“: \ " ‘ 0 -2f
. l “ \ o
' I \|.
I; ’ I‘V”. ' . _al-
° 9 o
-0. ‘ ‘ 4 ‘ ‘ ‘
- .5 o 0.5 is —1 -o.5 o 0.5 1 1.5
(a) The constraints (b) Projected space
0.5 . 0.5
o 0 00 o 0 Q:
o O 5 O o 3" 8 o O 9 0 0 $9 8
O O O, O
0 Q00 0 x 000 o
0 xx 1% W); " 09 000 (Sign? o
3"" o (590 yQOb 0
d: z x i i 0:0 o
0% )& ’S< 0% x§ i 00 0%
OF 00 >13: x %U 1 0' xx xi? (:3 (b
00 o x x x x
O X xxx § 0 X X 31L. @ C
as x u , a x 0g
800 got" "xiig‘fx (95° §Xx ﬁX§XXX§J dag"
O o o x x o
o O f O
b: O .. .-O ' X x
8) 0 ad” “7,336 O k X x”! x
-0. 4 -0. ‘ ‘
—8.5 0 0.5 -8.5 0 0.5
(c) Clusters obtained (d) Result of spectral clustering

Figure 5.15: An example of learning the subspace and the clusters simultaneously.
(a): the original data and the constraints, where solid (dotted) lines correspond to
must-link (must-not-link) constraints. (b) Clustering result of projecting 20 features
extracted by kernel PCA to a 2D space. (c) Clustering solution ((1) Result of applying
spectral clustering [194] to this data set with two clusters, using the same kernel used
for kernel PCA.

242

function evaluation because of the line-search.

Each iteration in the algorithm Shental is similar to that in the standard EM
algorithm. The difference is in the E—step, in which Shental involves an inference
for a Markov network. This can take exponential time with respect to the number of
constraints in the worst case. The per-iteration computation cost in Basu is in general
smaller than both Shental and the proposed algorithm, because it is fundamentally
the lat—means algorithm. However, the use of iterative conditional mode to solve the
cluster labels in the hidden Markov random ﬁelds, as well as the metric learning based

on the constraints, becomes the overhead due to the constraints.

In practice, the proposed algorithm is slower than the other two because of the
cross-validation procedure to determine the optimal A. Even when A is ﬁxed, however,
the proposed algorithm is still slower because (i) the optimization problem considered .
by the proposed algorithm is more difficult than those considered by Shental and
Basu, and (ii) the convergence criteria based on the relative norm of gradient is

stricter.

5.7.2 Discriminative versus Generative

One way to view the difference between the proposed algorithm and the algorithms
Shental and Basu is that both Shental and Basu are generative, whereas the pro-
posed approach is a combination of generative and discriminative. In supervised
learning, a classiﬁer is “generative” if it assumes a certain model on how the data

from different classes are generated via the speciﬁcation of the class conditional den-

243

sities, whereas a “discriminative” classiﬁer is built by optimizing some error measure,
without any regard to the class conditional densities. Discriminative approaches are
often superior to generative approaches when the actual class conditional densities
differ from their assumed forms. On the other hand, incorporation of prior knowl-
edge is easier for generative approaches because one can construct a generative model
based on the domain knowledge. Discriminative approaches are also more prone to
overfitting.

In the context of clustering under constraints, Shental and Basu can be regarded
as generative because they specify a hidden Markov random field to describe how
the data are generated. The constraint violation term f(6;C) used by the pro—
posed algorithm is discriminative, because it effectively counts the number of vio-
lated constraints, which are analogous to the number of misclassiﬁed samples. The
log-likelihood term £(6; y) in the proposed objective function is generative because
it is based on how the data are generated by a finite mixture model. Therefore, the
proposed approach is both generative and discriminative, with the tradeoff parameter
/\ controlling the relative importance of these two properties. One can think that the
discriminative component enables the proposed algorithm to have a higher perfor-
mance, whereas the generative component acts as a regularization term to prevent
overfitting in the discriminative component.

This discussion provides a new perspective in viewing the example in Figure 5.3.
Shental and Basu, being generative, failed to recover the two desired clusters because
their forms differ signiﬁcantly from what Shental and Basu assume about a cluster.

On the other hand, the discriminative property of the proposed algorithm can locate

244

the desired vertical cluster boundary, which can satisfy the constraints.

The discriminative nature of the proposed algorithm is also the reason why the
proposed algorithm, using constraints only, can outperform the generative classiﬁer
using all the labels. This is surprising at. ﬁrst, because, after all, constraints carry less
information than labels. Incorporating the constraints on only some of the objects
therefore should not outperform the case when the labels of all objects are available.
However, this is only true when all possible classiﬁers are considered. When we restrict
ourself to the generative classiﬁer that. assumes a Gaussian distribution with common
covariance matrix as the class conditional density, it is possible for a discriminative
algorithm to outperform the generative classiﬁer if the class conditional densities are
non-Gaussians. In fact, for the data sets ethn, Mondrian, script, wdbc, and texture,
we observed that the proposed algorithm can have a higher F-score or NMI than that
estimated using all the class labels. The difference is more noticeable for script and
wdbc. Note that for the data set austra, the generative algorithm Shental can also
out-perform the classiﬁer trained using all the labels, though the difference is very

small and it may be due to the noisy nature of this data set.

5.7 .3 Drawback of the Proposed Approach

There are two main drawbacks of the proposed approach. The optimization problem
considered, while accurately representing the goal of clustering with constraints, is
more difﬁcult. This has several consequences. First, a more sophisticated algorithm

(line—search Newton) is needed instead of the simpler EM algorithm. The landscape

245

of the proposed objective function is more “rugged”. So, it it is more likely to get
trapped in poor local optima. It also takes more iterations to reach a local optimum.
Because we are initializing randomly, this also means that the proposed algorithm is

not very stable if we have an insufﬁcient number of random initializations.

The second difﬁculty is the determination of A. (Note that the algorithm Basu has
a similar parameter.) In our experiments, we adopted a cross-validation procedure
to determine A, which is computationally expensive. Cross-validation may yield a
suboptimal A when the number of informative constraints in the validation set is
too small, or when too many constraints are erroneous due to the noise in the data.
Here, a constraint is informative if it provides “useful” information to the clustering
process. So, a must—link constraint between two points close to each other is not very

informative because they are likely to be in the same cluster anyway.

Another problem is that we may encounter an unfavorable split of the training
and validation constraints when the set of available constraints is too small. When
this happens, the number of violations for the validation constraints is signiﬁcantly
larger than that of the training constraints. Increasing the value of A cannot reduce
the violation of the validation constraints, leading to an optimal constraint strength
of zero. When this happens, we should try a different split of the constraints for

training and validation.

246

5.7 .4 Some Implementation Details

We have. incorporated some heuristics in our optimization algorithm. During the
optimization process, a cluster may become almost empty. This is detected when
2, fij/n. falls below a threshold, which is set to 4 x 10‘3/k. The empty cluster is
removed, and the largest cluster that can result in the increase in the .7 value is
split to maintain the same number of clusters. If no such cluster exists, the one that
can lead to the smallest decrease in j is split. Another heuristic is that we lower-
bound aj by 10-8, no matter what the values of {ﬂj} are. This is used to improve
the numerical stability of the proposed algorithm. The ozj are then renormalized to

ensure that they sum to one.

5.8 Summary

We have presented an algorithm that handles instance-level constraints for model-
based clustering. The key assumption in our approach is that the cluster labels are
determined based on the feature vectors and the cluster parameters; the set of con-
straints has no inﬂuence here. This contrasts with previous approaches like [231] and
[21] which impose prior distribution on the cluster labels directly to reflect the con-
straints. This is the fundamental reason for the anomaly described in Section 5.2. The
actual clustering is performed by the line-search Newton algorithm under the natural
parameterization of the Gaussian distributions. The strength of the constraints is
determined by a hold—out set of validation constraints. The proposed approach can

be extended to handle simultaneously feature extraction and clustering under con—

247

straints. The effectiveness of the proposed approach has been demonstrated on both
synthetic data sets and real-world data sets from different. domain. In particular, we
notice that the discriminative nature of the proposed algorithm can lead to superior
performance when compared with a generative classifier trained using the labels of

all the objects.

248

Chapter 6

Summary

The primary objective of the work presented in this dissertation is to advance the
state-of-the-art in unsupervised learning. Unsupervised learning is challenging be-
cause its objective is often ill—deﬁned. Instead of providing yet another new unsuper-
vised learning algorithm, we are more interested in studying issues that are generic
to different unsupervised learning tasks. This is the motivation behind the study of
various topics in this dissertation, including the modification of the batch version of
an algorithm to become incremental, the selection of the appropriate data representa-
tion (feature selection), and the incorporation of side—information in an unsupervised

learning task.

6. 1 Contributions

The results in this thesis have contributed to the ﬁeld of unsupervised learning in
several ways, and has led to the publication of two journal articles [163, 164]. Several

249

conference papers [168, 161, 167, 165, 82] have also been published at different stages
of the research conducted in this thesis.
The incremental ISOMAP algorithm described in Chapter 3 has made the follow-

ing contributions:

0 Framework for incremental manifold learning: The proposed incremental
ISOMAP algorithm can serve as a general framework for converting a mani-
fold learning algorithm to become incremental: the neighborhood graph is ﬁrst
updated, followed by the update of the low-dimensional representation, which

is often an incren‘iental eigenvalue problem similar to our case.

0 Solution of the all-pairs shortest path problems: One component in the incre—
mental algorithm is to update the all-pairs shortest path distances in view of the
change in the neighborhood graph due to the new data points. We have devel-
oped a new algorithm that performs such an update efficiently. Our algorithm
updates the shortest path distances from multiple source vertices simultane-
ously. This contrasts with previous work like [193], where different shortest

path trees are updated independently.

0 Improved embedding for new data points: We have derived an improved esti-
mate of the inner product between the low-dimensional representation of the
new point and the low—dimensional representations of the existing points. This

leads to an improved embedding for the new point.

0 Algorithm for incremental eigen-decomposition with increasing matrix size:

The problem of updating the low-dimensional representation of the data points

250

is essentially an incremental eigen-decomposition problem. Unlike the previous

work [270], however, the size of the matrix we considered is increasing.

0 Vertex contraction to memorize the effect of data points: A vertex contrac-
tion procedure that improves the geodesic distance estimate without additional

memory is proposed.

Our work on estimating the feature saliency and the number of clusters simulta—

neously in Chapter 4 has made the following contributions:

0 Feature Saliency in unsupervised learning: The problem of feature selec-
tion / feature saliency estimation is rarely studied for unsupervised learning. We
tackle this problem by introducing a notion of feature saliency, which is able to
describe the difference between the distributions of a feature among different

clusters. The saliency is estimated efﬁciently by the EM algorithm.

0 Automatic Feature Saliency and Determination of the Number of Clusters: The
algorithm in [81], which utilizes the minimum message length to select the

number of clusters automatically, is extended to estimate the feature saliency.

The clustering under constraints algorithm proposed in Chapter 5 has made the

following contributions:

0 New objective function for clustering under constraints: We have proposed a
new objective function for clustering under constraints under the assumption
that the constraints do not have any direct influence on the cluster labels. Ex-

tensive experimental evaluations reveal that this objective function is superior

to the other state—of—the—art algorithms in most cases. It is also easy to extend
the proposed objective function to handle group constraints that involve more

than two data points.

Avoidance of Counter-intuitive Clustering Result:

The proposed objective function can avoid the pitfall of previous clustering
under constraints algorithms like [231] and [21], which are based on hidden
Markov random ﬁeld. Speciﬁcally, clustering solutions that assign the cluster
label to a data point that is different from all its neighbors is possible for

previous algorithms, a situation avoided by the proposed algorithm.

Robustness to model-mismatch:

The proposed objective function for clustering under constraints is a combina-
tion of generative and discriminative terms. The discriminative term, which
is based on the satisfaction of the constraints, improves the robustness of the
proposed algorithm towards mismatch in the cluster shape. This leads to an
improvement in the overall performance. The improvement can sometimes be
so signiﬁcant that the proposed algorithm, using constraints only, outperforms

a generative supervised classiﬁer trained using all the labels.

Feature extraction and clustering with constraints: The proposed algorithm has
been extended to perform feature extraction and clusterng with constraints
simultaneously by locating the best low-dimensional subspace, such that the

Gaussian clusters formed will satisfy the given set of constraints as well as

252

they can. This allows the proposed algorithm to handle data sets with higher
dimensionality. The combination of this notion of feature extraction and the

kernel trick allows us to extract clusters with general shapes.

0 Efﬁcient implementation of the Line-search Newton Algorithm:

The proposed objective function is optimized by the line—search Newton al-
gorithm. The multiplication by the inverse of the Hessian for the case of a
Gaussian mixture can be done efficiently with time complexity 0(d3) without
forming the 0(d2) by 0((12) Hessian matrix explicitly. Here, d denotes the num-
ber of features. A naive approach of inverting the Hessian would require 0(d6)

time.

6.2 Future work

The study conducted in this dissertation leads to several interesting new research

possibilities.

0 Improvement in the efﬁciency of the incremental ISOMAP algorithm

There are several possibilities for improving the efﬁciency of the proposed in-
cremental ISOMAP algorithm. Data structures such as kd—tree, ball-tree, and
cover-tree [19] can be used to speed up the search of the k nearest neighbors.
The update strategy for geodesic distance and co—ordinates can be more aggres-
sive; we can sacriﬁce the theoretical convergence property in favor of empirical

efﬁciency. For example, the geodesic distance can be updated approximately

using a scheme analogous to the distance vector protocol in the network routing
literature. Co—ordinate update can be made faster if only a subset of the co-
ordinates (such as those close to the new point) are updated at each iteration.
The co—ordinates of every point would be ﬁnally updated if the new points came

from different regions of the manifold.

Incrementalization of other manifold learning algorithms

The algorithm in Chapter 3 modiﬁes the ISOMAP algorithm to become incre-
mental. We can also modify similar algorithms, such as locally linear embedding

or Laplacian eigenmap to become incremental.

Features dependency in dimensionality reduction and unsupervised learning:
The algorithm in Chapter 4 assumes that the features are conditionally in-
dependent of each other when the cluster labels are known. This assumption,
however, is generally not true in practice. A new algorithm needs to be designed

to cope with the situation when features are highly correlated in this setting.

Feature selection and constraints:

The main difﬁculty of feature selection in clustering is the ill-posed nature of
the problem. A possible way to make the problem more well-deﬁned is to intro-
duce instance-level constraints. In Section 5.5, we described an algorithm for
performing feature extraction and clustering under constraints simultaneously.
One can apply a similar idea and use the constraints to assist in feature selection

for clustering.

o More efficient algorithms for clustering with constraints

The use of line-search Newton algorithm for optimizing the objective function
in Chapter 5 is relatively efﬁcient when compared with alternative approaches.
Unfortunately, the objective function, which effectively uses Jensen-Shannon
divergence to count the number of violated constraints, is difﬁcult to optimize.
It is similar to the minimization of the number of classiﬁcation errors directly in
supervised learning, which is generally perceived as difﬁcult. Often, the number
of errors is approximated by some quantities that are easier to optimize, such as
the distances of nus-classiﬁed points from the separating hyperplane in the case
of support vector machines. In the current context, we may want to approximate
the number of violated constraints by some quantities that are easier to opti-
mize. A difﬁculty can arise, however, when both must-link and must-not-link
constraints are considered. If the violation of a must—link constraint is approx—
imated by a convex function g(.), the violation of a must—not-link constraint is
naturally approximated by ——g(.), which is concave. Their combination leads
to a function that is neither concave nor convex, which is difficult to optimize.
Techniques like DC (difference of convex functions) programming [117] can be

adopted for global optimization.

0 Number of clusters for clustering with constraints

The algorithm described in Chapter 5 assumes that the number of clusters is
known. It is desirable if the number of clusters can be estimated automatically

from the data. The presence of constraints should be helpful in this process.

255

In fact, correlation clustering [10] considers must-link and rnust-not-link con—
straints only, without any regard to the feature vectors, and it can infer the

optimal number of clusters by minimizing the number of constraint violations.

256

APPENDICES

257

Appendix A

Details of Incremental ISOMAP

In this appendix, we present the proof for the correctness of the algorithms in chapter 3

as well as analyzing their time complexity.

A.1 Update of Neighborhood Graph

The procedure to update the neighborhood graph has been described in section
3.2.1.1, where A, the set of edges to be added, and ’D, the set of edges to be deleted,

are constructed upon insertion of vn+1 to the neighborhood graph.

Time Complexity For time complexity, note that for each i, the conditions in
Equations (3.1) and (3.2) can be checked in constant time. So, the construction of .A
and ’D takes 0(n) time. The calculation of Li for all i can be done in 0( [i=1 deg(v,-) +
[AD or O(|E I + IAI) time by examining the neighbors of different vertices. Here,
deg(v,j) denotes the degree of vi. The complexity of the update of neighborhood graph
can be bounded by ()(nq), where q is the maximum degree of the vertices in the graph

258

after inserting en“. Note that L,- becomes the r, for the updated neighborhood graph.

A.2 Update of Geodesic Distances: Edge Deletion

A.2.1 Finding Vertex Pairs For Update

In this section, we examine how the geodesic distances should be updated upon edge
deletion. Consider an edge e(a, b) E D that is to be deleted. If ”ab 75 a, the shortest
path between va and vb does not contain e(a, b). Deletion of e(a, b) does not affect

sp(a, b) and hence none of the existing shortest paths are affected. Therefore, we have

Lemma A.1. Ifrrdb # a. deletion of e(a, b) does not affect any of the eristing shortest

paths and therefore no geodesic distance gij needs to be updated.

we now consider the case ”ab = a. This implies 7Tba = b because the graph is

undirected. The next lemma is an easy consequence of this assumption.

Lemma A.2. For any vertex vi, sp(i, b) passes through va ijf sp(i, b) contains e(a,b)

iff ”it = a.

Before we proceed further, recall the deﬁnitions of T(b) and T(b; a) in section 3.1:
T(b) is the shortest path tree of vb, where the root node is vb and sp(b, j) consists of
the tree edges from vb to vj, and T(b; a) is the subtree of T(b) rooted at va.

Let Rab E {i : rrl-b = a}. Intuitively, Rab contains vertices whose shortest paths
to vb include e(a, b). We shall ﬁrst construct Rabv and then “propagate” from Rab to

get the geodesic distances that require update.

259

Because sp(t, b) passes through the vertices that are the ancestor of vt in T(b),

plus of, we have

Lemma A.3. Rab = { vertices in T(b;a) }.

Proof.

vt E T(b; a)
<:> on is an ancestor of w in T(b), or va = vt
o sp(t, b) passes through va

o rrtb -— a (lemma A.2)

42> t E Rob 5

El

If vt is a child of vu in T(b), vu is the vertex in sp(b, t) just before vt. Thus, we

have the lemma below.

Lemma AA. The set of children of vu in T(b) = {vt : vt is a neighbor of on and

”bi = U} '

Consequently, we can examine all the neighbors of vu to ﬁnd the node’s children
in T(b) based on the predecessor matrix. Note that the shortest path trees are not
stored explicitly; only the predecessor matrix is maintained. The ﬁrst nine lines in
Algorithm 3.1 perform a tree traversal that extracts all the vertices in T(b; a) to form

Rab, using Lemma A4 to ﬁnd all the children of a node in the tree.

260

Time Complexity At any time, the queue Q contains vertices in the subtree
T(b;a) that have been examined. The while-loop is executed lRabl times because
a new vertex is added to Rab in each iteration. The inner for-loop is executed a
total of 2,, {,6 Ra b deg(vt), which can be bounded loosely by (llRabl‘ Therefore, a loose

bound for the ﬁrst nine lines in Algorithm 3.1 is 0((IlRabll-

A.2.2 Propagation Step

Deﬁne F(a,b) E {(i,j) : sp(i,j) contains e(a, b)}. Here, (a,b) denotes the unordered
pair a and b. So, F(a,b) is indexed by the unordered pair (a,b), and its elements
are also unordered pairs. Intuitively, F(a,b) contains the vertex pairs whose geodesic
distances need to be recomputed when the edge e(a, b) is deleted. Starting from "’b

for each of the vertex in Rab» we construct F (a,b) by a search.
Lemma A.5. If (i,j) E F(a,b).~ eitheri orj is in Rab-

Proof. (i,_j) E Fm“ is equivalent to sp(i, j) contains e(a, b). The shortest path
sp(i,j) can be written either as sp(i,j) = v,- w va —> vb w vj, or sp(i,j) = v,- w
vb -—> va w v j, where «M denotes a path between the two vertices. Because the
subpath of a shortest path is also a shortest path, either sp(i,b) or sp( j, b) passes
through va. By lemma A.2, either 7% = a or rm, 2 a. Hence either i or 3' is in

Rab° El

Lemma A.6. F011,) 2 U {(u, t) : vt is in T(u;b)}.
uERab

Proof. By lemma A.5, (u, t) E F(a,b) implies either u or t is in Rab- W'ithout loss of

generality, suppose u E Rab. So, sp(u, t) can be written as vu w va —> vb w vt. Thus

261

  
 

 

...
OOH

T(a; b)

Figure A.1: Example of T(u;b) and T(a;b)_. All the nodes and the edges shown
constitute T(a; b), whereas only the part of the subtree above the line constitutes
T(u; b). This example illustrates the relationship of T(u; b) and T(a; b) as proved in
Lemma A.7.

 

 

vt must be in T(u; b). On the other hand, for any vertex vt in the subtree of T(u; b),
sp(u,t) goes through vb. Since sp(u,b) goes through va (because u E Rab), sp(u,t)

must also go through va and hence use e(a, b). El

Direct application of the above lemma to compute F011,) requires the construction
of T(u; b) for different u. This is not necessary, however, because for all u E Rab:
T(u;b) must be a part of T (a;b) in the sense that is exempliﬁed in Figure A.1.
This relationship aids the construction of T(u; b) in Algorithm 3.1 (the variable 7")

because we only need to expand the vertices in T(a; b) that are also in T(u; b).

Lemma A.7. Consider u E Rab- The subtree T(u;b) is non-empty, and let vt be
any vertex in this subtree. Let vs be a child of vt in T(u;b), if any. We have the

following:

262

1. vt is in the subtree of T(a; b).
2. v, is a child of Pt in the subtree of T(a; b).

3- 71-us 2 7ras : t

Proof. The subtree T(u; b) is not empty because vb is in this subtree. For any vt in this
subtree, sp(u, t) passes through vb. Hence sp(u, b) is a subpath of sp(u, t). Because
u E Rab, sp(u, b) passes through va. So, we can write sp(u, t) as on w va —> vb w vt.
So, sp(a, t) contains vb, and this implies that vt is in T(a; b).

Now, if US is a child of vt in T(u; b), sp(u, s) can be written as on w va ——> vb w
vt —> vs. So, was 2 t. Because any subpath of a shortest path is also a shortest path,
sp(a, s) is simply va —+ vb w vt —> us, which implies that us is also a child of vt in

T(a; b), and ray 2 t. Therefore, we have rug 2 was = t. Cl

Let F be the set of unordered pair (i, j) such that a new shortest path from v,-
to vj is needed when edges in D are removed. So, F = Ue(a,b)€D F(a,b)- For each
(a. b) E D, Rab constructed in the ﬁrst nine lines in Algorithm 3.1 is used to construct
F(a,b) from line 11 until the end of Algorithm 3.1. At each iteration of the while-loop
starting at line 15. the subtree T(a; b) is traversed, using the condition nus = was to
check if v, is in T(u; b) or not. The part of the subtree T(a; b) is expanded only when

necessary, using the variable T’.

Time Complexity If we ignore the time to construct T’, the complexity of the
construction of F is proportional to the number of vertices examined. If the maximum

degree of T’ is q’, this is bounded by O(q’|F|). Note that q' S q, where q is the

263

11' .._I

maximum degree of the vertices in the neighborhood graph. The time to expand T'
is proportional to the number of vertices actually expanded plus the number of edges
incident 011 those vertices. This is bounded by q times the size of the tree, and the
size of the tree is at most 0(lF(a,b)l)' Usually, the time is much less, because different
u in Rab can reuse the same T’. The time complexity to construct F(a,b) can be
bounded by O(q|F(a.b)|) in the worst case. The overall time complexity to construct
F, which is the union of F(a,b) for all (a,b) E D, is O(q|F|), assuming the number
of duplicate pairs in F(a,b) for different (a, b) is 0(1). Empirically, there are at most

several such pairs. Most of the time, there is no duplicate pair at all.

A.2.3 Performing The Update .

Let Q’ = (V, E /D), the graph after deleting the edges in D. Let B be an auxiliary
undirected graph with the same vertices as 9, but its edges are based on F. In other
words, there is an edge between v,- and vj in the graph 8 if and only if (i, j) is in F.
Because F contains all the vertex pairs whose geodesic distances need to be updated,

an edge in 8 corresponds to a geodesic distance value that needs to be revised.

To update the geodesic distances, we ﬁrst pick a vu in B with at least one edge
incident on it. Define C(u) = {i : e(u, i) is an edge of B}. So, the geodesic distance
gm- needs to be updated if and only if i E C (u) These geodesic distances are updated
by the modiﬁed Dijkstra’s algorithm (Algorithm 3.2), with v", as the source vertex
and C (u) as the set of “unprocess vertices”, i.e., the set of vertices such that their

shortest paths from on are invalid. Recall the basic idea of Dijkstra’s algorithm is that,

264

starting with an empty set of “processed vertices” (vertices whose shortest paths have
been found), different vertices are added one by one to this set in an ascending order
of estimated shortest path distances. The ascending order guarantees the optimality
of the shortest paths. Algorithm 3.2 does something similar, except that the set of
“processed vertices” begins with V/ C (u) instead of an empty set. The ﬁrst for-loop
estimates the shortest path distances for j E C (u) if sp(u, j ) is “one edge away” from
the processed vertices, i.e., sp(u,j) can be written as on w va ——> vj with a E V/C(u).
In the while loop, the vertex vk (k E C (u)) with the smallest estimated shortest path
distance is examined and transferred into the set of processed vertices. The estimates
of the shortest path distances between on and the adjacent vertices of 2);, are relaxed
(updated) accordingly. This repeats until C (u) becomes empty, i.e., all vertices have
been processed.

When the modiﬁed Dijkstra’s algorithm with on as the source vertex ﬁnishes, all
geodesic distances involving on have been updated. Since an edge in 8 corresponds
to a geodesic distance estimate requiring update, we should remove all edges incident
on on in B. We then select another vertex on; with at least one edge incident on it in
B, and call the modiﬁed Dijkstra’s algorithm again but with on; as the source vertex.

This repeats until 8 becomes an empty graph.

Time Complexity The for-loop in Algorithm 3.2 takes at most O(q|C(u)|) time.
In the while—loop, there are |C(u)[ Extracth’lin operations, and the number of De-
creaseKey operations depends on how many edges are there within the vertices in

C(u). A upper bound for this is q|C(u)[. By using Fibonacci’s heap, ExtractMin

265

can be done in ()(log |C(u)|) time while DecreaseKey can be done in 0(1) time, on
average. Thus the complexity of algorithm 3.2 is O(|C(u)| log|C(u)I + q|C(u)|). If

binary heap is used instead, the complexity is 0((1IC (u)[ log |C (u)|)

A.2.4 Order for Performing Update

How do we select on in B to be eliminated and to act as the source vertex for the
modiﬁed Dijkstra’s Algorithm (Algorithm 3.2)? We seek an elimination order that
minimizes the time complexity of all the updates. Let f,- be the degree of vh-i, the
i-th vertex removed from B. So, f, = |C(K,-)|. The overall time complexity T for
running the modiﬁed Dijkstra’s algorithm (with F ibonacci’s heap) for all the vertices

in B with at least an incident edge is 0(T), with

T: Z(fi103fi+(1fi)- (All

Because Ell-:1 f,- is a constant (twice the number of edges in B) with respect to dif-
ferent. elimination order, the vertices should be eliminated in an order that minimizes

2,- fi log f,. If binary heap is used, the time complexity is O(T*), with

T* = QZfi 10?; fi- (A?)

In both cases, we should minimize 2, f,- log fi. Finding an order that minimizes this
is difﬁcult, unfortunately. Since this sum is dominated by the largest fi, we instead

minimize max, f,. This minimization is achieved by a greedy algorithm that removes

266

the vertex in B with the smallest degree. The correctness of this greedy approach
can be seen from the following argument. Suppose the greedy algorithm is wrong.
So, at some point the algorithm makes a mistake, i.e., the removal of vt instead of on
leads to an increase of max,- f,. This can only happen when deg(vt) > deg(vu). We
get a contradiction, since the algorithm always removes the vertex with the smallest
degree.

Because the degree of each vertex is an integer, an array of linked lists can be
used to implement the greedy search (Algorithm 3.3) efﬁciently without an explicit
search. At any time of the instance, the linked list l[i] is empty for i < pos. So, the
vertex in l[i] has the smallest degree in B. The for—loop in lines 10 to 18 removes all
the edges incident 011v]- in B by reducing the degree of all vertices adjacent to vj by

one, and moving pos back by one if necessary.

Time Complexity The ﬁrst for-loop in Algorithm 3.3 takes O(|F[) time, because
|F| is the number of edges in B. In the second for-loop, pos is incremented at most 2n
times, because it can move backwards at most n steps. The inner for-loop is executed
altogether ()(lFl) time. Therefore, the overall time complexity for algorithm 3.3

(excluding the time for executing the modiﬁed Dijkstra’s algorithm) is O([F|).

A.3 Update of Geodesic Distances: Edge Insertion

In Equation (3.3), we describe how the geodesic distance between the new vertex vn+1

and vi is computed, after updating the geodesic distance in view of the edge deletion.

267

Since all the edges in A, the set of edges inserted into the neighborhood graph, are
incident on vn+1. any improvement in an existing shortest path must involve vn+1.
Let L z {(i j) : aim“ +wn+1$j < gij}. Intuitively, L is the set of unordered pairs

adjacent to vn+1 with improved shortest paths due to the insertion of vn+1.

For different ((1,1)) 6 L, Algorithm 3.4 is used to propagate the effect of the
improvement in sp(u, b) to the vertices near va and ”a First, lines 1 to 9 construct
a set Sab that is similar to Rob in Algorithm 3.1, and it consists of vertices whose
shortest paths to vb have been improved. For each vertex v,- in Sub, lines 11 to 22
search for other shortest paths starting from v,- that can be improved, and update the
geodesic distance according to the improved shortest path just discovered. Its idea is
analogous to the construction of F011,) in Algorithm 3.1, but now sp(a, b) is improved

instead of destroyed as in the case of F011,).

The correctness of Algorithm 3.1 can be seen by the following argument. Without
loss of generality, the improved shortest path between v,- and vj can be written as
u,- wva ——>v,,+1 —> 7’1) wvj. So, u,- is a vertex in T(n + 1;a), and vj must be in
both T(i;b) and T(n + 1;b). If v1 is a child of vj in T(i;b), v1 is also a child of
vj in T(n + 1;b), and (gmirl + gn+1,1) < 911 should be satisﬁed. In other words,
the relationship between T(i;b) and T(n + 1;b) here is similar to the relationship
between T(u; b) and T(a; b) depicted in Figure Al. The proof of these properties is
similar to the proof given for the relationship between F(a,b) and Rob: and hence is

not repeated.

268

Time Complexity The set L can be constructed in 0(IAI2) time. Let H = {(i, j) :
A better shortest path appears between v, and vj because of vn+1 }. By an argument
similar to the complexity of constructing F, the complexity of finding H and revising

the corresponding geodesic distances in Algorithm 3.4 is 0(qIH I + IAI2).

A.4 Geodesic Distance Update: Overall Time
Complexity

Updating the neighborhood graph takes 0(nq) time. The construction of Rab and Fab
(Algorithm 3.1) takes 0(quabl) and 0(quabl) time, respectively. Since lFablZ [RabL -
these steps take 0((1lFabl) time together. As a result, F can be constructed in O(q|F|)
time. The time to run the modiﬁed Dijkstra’s algorithm (Algorithm 3.2) is difficult to
estimate. Let u be the number of vertices in B with at least one edge incident on it,
and let 1/ E max,- f, with f, defined in Appendix A.2.4. In the highly unlikely worst
case, 11 can be as large as u. The time of running Algorithm 3.2 can be rewritten
as 0( up log 1/ + qul). The typical value of V can be estimated using concepts from

random graph theory. It is easy to see that
z/ = Inlax{8 has a l-regular sub-graph}, (A.3)

where a l-regular sub-graph is deﬁned as a subgraph with the degree of all ver-
tices as l. Unfortunately, we fail to locate the exact result 011 the behavior of the

largest l-regular sub-graph in random graph theory. On the other hand, the largest

269

l-complete sub-graph, i.e., a clique of size l, of a random graph has been well stud-
ied. The clique number (the size of the largest clique in a graph) of almost ev-
ery graph is “close” to 0(log u) [200], assuming the average degree of vertices is
a constant and u is the number of vertices in the graph. Based on our empirical
observations in the experiments, we conjecture that, on average, I/ is also of the or-
der O(log ,u). 'With this conjecture, the total time to run the Dijkstra’s algorithm
can be bounded by ()(uloguloglogu + qul). Finally, the time complexity of al-
gorithm 3.4 is 0(qIHI + IAI2). So, the overall time complexity can be written as
0(qIFl + qul + u logu log log u + [A|2). Note that u s 2|F|. In practice, the ﬁrst

two terms dominate, and the complexity can be written as O(q(|F I + |H|)).

270

 

Appendix B

Calculations for Clustering with

Constraints

The purpose of this appendix is to derive the results in Chapter 5, some of which are

relatively involved.

B.1 First Order Information

In this appendix, we shall derive the gradient of the objective function ,7. The
differential of a variable or a function a: will be denoted by “d :17”. We shall ﬁrst
compute the differential of J, followed by the conversion of the differentials into the

derivatives with respect to the cluster parameters.

271

B.1.1 Computing the Differential

The differential of the log-likelihood can be derived as follows:

 

k k
1 .. dl ..
(l .C( 6; 3?): Ed (10g Zexp( IOgCIijl> = ZZCXP( 08Q2J)( qutyl

3'21 i=lj=1 Zj'eXpa quii’ ’)

n k (3.1)

= 22% (d logql-j).

i=1j=1

Here, rij = exp(log(1,]~)/ 2]»: exp(log (lift) = (“j/2ft gift is the usual posterior
probability for the j-th cluster given the point y,. The annealing version of the
log-likelihood, which is needed if we want to apply a deterministic-annealing type of

pro<edure to optimize. the log-likelihood, is defined by

£an'lwled(6;y,7 =ZZTiJlqulj- 22731100737” (82)

i=1j=17i=1j=1

where y is the inverse temperature parameter. Note that £annealed(6; y, '7) becomes
the log-likelihood £(6; y) when '7' is one. The temperature invtcmp is different from
the smoothness parameter r: '7' is related to all the data points, whereas r is only
concerned with objects involved in the constraints. The “fuzzy” cluster assignment

fij is deﬁned as
7
.. qu’

I -= -—————. (13.3)
U :3" (1ng

A small value of '7 corresponds to a state of high temperature, in which the cluster
assignments are highly uncertain. The ﬁrst term in Equation (B2) can also be un-

derstood as a weighted sum of distortion in coding theory, with fij as the weights

272

and log qij as the distortion. The second term in Equation (8.2) is proportional to

the sum of the entropy of fij. Because

r )‘ ~ 1 ~ 1 ~ I
£dlllltdl€(l(6; y, ,7) = Z: Tij 10g (lij _ g E: 7.2:]. log ([27]. + 3 Z: Tij log 2 (1,71
ij ij '7 I

1
= 3 :1: log 2 exp (7 10?; (1:71) ,

l

the differential of the annealed log-likelihood is similar to that of the log-likelihood,
which is

d ca"“°*““‘(6; y, 1) = 2%., (d lost/u) (13.4)

if
Our next. step is to derive the differential for the constraint satisfaction function

f (6; C ) Based on the deﬁnition of sij in Equation (5.12), we can obtain its differential

88

d log sij = d (7' log (1,-j) — d logZexp(rlog (1,-1)

l
= 7' (d log qij) - T231161 log (12'!)
l
d 32.1.: T‘S’ij(d10gqij_ 2813' (d IOEQijl)
j

d t};- = :ahi ((180)
i

(1 lg]. = thi ((1 SiJ)
’i

. k ..._ k +_ k —_ .._ k +._
Note that 213.1(1sz -— ijl d thj — ijl d thj — 0 because Zj 3,] — Zj=1thj —

k _ . . . ‘ g . + __
Zj=1thj = 1. The differential for the negative entropy of sij, thj and thj can be

273

 

derived by considering

(I ZSiJ‘lOgSiJ=Z(IOg sij+1) (ds,])— =Zlog sij (dszj)
j

j

= 723.153. (01 105% — 25.1 <d 1., >)
j l
: 72(52'3'10832'1" Sij ZS.” 10g Sil) (d lOng’j)
j l
+ 0, + _ + +)
(12tthOothj-— Zlogthj( (d thj)
.7

= T Zest/132 ahisij (d 105 (In - Z Sit (‘1 10% (1:1 l)
j i l
:2 T Z ahisij (10g t3“). -— 2 Sit log flit) (d log (1,-j)

ij 1
(1 Z tgj log tgj = r: bhz‘Sij (log tgj — :8” log thl) (d log qij)
j ij t

The differential for the Jensen-Shannon divergence term is then given by

d DJS (h): ZahZZsU locrsij— Zia; 100%
= Tzahisij (log 8” — :81] log Sit) (d logqij)

ij 1

— T 2 arms (10517:,- - Z 8a 105 till) (d log (lij)
ij 1
= T Z “hisij (logs t—ij - 28 9,1 log ff 8&1) (d log (fij)

ij hj [ hl

dD — — —erh,-s,-j (100— — szlloa—“)(108(Iijld

thl

274

The differential of the loss functions of constraint violation can thus be written as

m+ m—
(1f(6;C) = d (3121 Agojsuz) + Z A;D;S(h))
= —TZ(Z A, (zmsiJ- (loggi—j W: ,1 lorr Ell)
2]

h J 1 hl

—;A;bh:8ij(10:j—_§;52110”—))—:d logqul
thl
= JZ(:Z+:},)‘ aIiiSiJIO

iJ gthj

“7%: Ah b,,,-s,-,1 (B-5)

gtIJ’

_3iJZEAh ”his 8,1100:- l+

l h: 1 thl
+ “v 28:5 b,,,-s 5,,1og— till ((1 longJ)
l h=1
—TZ (u’ij - SiJZU’il) (d 103 (11'3“)
iJ’ t
where we deﬁne
771+
. + . +
u’ij = Z “\h a,,, ‘SiJ lOg Sij - Sij Z All ah, log thj
h=1
_ m—
~WZ Ah bhi siJ- loor gsiJ- + siJ 2 Ah bh, log thj
h=1 h: 1
(B6)
m—
: Z, A+ a,,,-— ZA, bk, 3,,- log 3,,-
Izzl

m—

771+
+ .
—s,—,~ ZAha,,,-1ogth.— ;Ahbl,,loghtj
h=1

275

 

It is interesting to note that

n k
2: Z
z—ljzl
n k m—
:2 (::AEL a;us,leogs,J--— ZAJjth-SJ-J-logsJ-J
i:lj=1 h=1
m+ m
— Aza'hisij log {7;}- + Z AgbhiS-ij log :Ifj) (8'7)
h=1 h=1
=23A 22% log: — -2: A 232% log—
h=1i=1j=l thj — i=1j=1thj
m+ m
__ +— + — — ._ .
_. Ah 12,502) — ZAh 12,502.) —f(6,C)
[2:1 [2:1

Therefore, summing all 21..r,-J- provides a way to compute the loss function for constraint

violation.

We are now ready to write down the differential of j:

n k I;
(I J = ZZ<ﬂj — T(wij —- Sij 21011)) (d log (Iij) (13.8)
[:1

i=1j=1

B.1.2 Gradient Computation

Since the only differentials in Equation (8.8) are (d log qu), the gradient of j can be
obtained by converting these differentials into derivatives. Recall that ql-J- : (VJ-p(yz- I0).
So,

______1 a J. :1 x21
Blogaj Ooqll (J ),

276

 

where I () is the indicator function, and is one if the argument is true and zero
otherwise. To enforce the restriction that 01- > 0 and Zj aj = 1, we introduce new

variables (33' and express 03' in terms of {3}}:

exp(,BJ-)
013' = k I, .
2.1/:1 €Xp(,t3jl)

 

We then have

 

 

0 ('9 . . exptﬁl)
T—r—loga'zr— fl—log ex [j- =1 =l — ‘
0W! J (Ml L] g p( 3’) (J ) Zjl GXPWJ-I)

=IU=0-0z
k ,

a 0100‘ q l 0100' (1771 ,
.._—7—10g_: bl {0 = E [[2711 17112 —O
(Adj (Ill "1:1 810g a," a’ﬂj m ( ) ( ( J) J)

=Io=n—aj

If p(yz-Wj) falls into the exponential family (Section 5.1.1), and 03- is the natural

parameter, the derivative of log qt- j with respect to 61 can be written as

8 (9
aiqu=nr=00mo-5Ema0. chm

Note that Qb(y.,-) -— Egg/4(6)!) is zero when the sufﬁcient statistics of the observed data
(represented by ¢(y,j)) equal to its expected value (represented by 5%A(91)). In this

case, the convexity of A(Ol) guarantees that the log-likelihood is maximized.

Before going into the special case of the Gaussian distribution, we want to note

277

 

that for any number cij, we have

0 .
E“ Cij'.0_3l 1000(1ij: E_:Cij(1(l:])_al):§:Cil—alzczj
i ij

2.7 2]
(‘9 a , (9
%:Cij_‘66110g0(1ij: ;L~11567110gq.i1 = 21:0“ (((d)/i)“ 676—114(01))

, 8
= Z Cil‘pb’i) — 551/1091) :02!
i i

The gradient of J can be computed by substituting cij = fij — T(wij — sij 2L1 wig).

B.1.3 Derivative for Gaussian distribution

Consider the special case that p(yz-IGZ) is a Gaussian distribution. Based on Equa-
tion (5.6), we can see that the natural parameters are T1 and V1, the sufﬁcient
statistics consist of yz- and —%yiy;-F, and the log-cumulant function A(61) is given by

Equation (5.7). In this case, we have

Buljz :0 CzlYi" “1:021 (B'll)

1 T l T 1
5%..7— — _§ ZCiIYiyi + (EHH‘I + 52!) 262'! (B-12)

2' 2'

Note that the above computation implicitly assumes that Tl is symmetric. To ex-
plicitly enforce the constraints that T1 is symmetric and positive deﬁnite, we can

re-pararneterize by its Cholesky decomposition:

r1 = F,F,T, (13.13)

278

Note, however, with this set. of parameters, the density is no longer in its natural
form. The gradient with respect to V1 remains unchanged, and it is not hard to Show
that

0 . T T

517110;)(113‘ = 1(J = 1) —Yiyi + mm + 231 Fl- (B-14)

O—FZJ — Z 011Y1ygF1 + (#1H1T+ 21) F1 2 Cu (315)
1 1

Alternatively, the Gaussian distribution can be parameterized by the mean #3-

and the precision matrix Tj as in Equation (5.5). Because

(3

——lU,--=I"=lT -— B.16

dill ()0 (1U (J ) [(3% M) ( )

3 . 1 1 T

69—1? 10% (Iij = 10 =1) (2‘21 — g(yz' - #1)(yz' - #1) ) (B-17)

(‘3 , T

513:! 10% <11} = 1(J =1) (St-(Y1 - mm - H1) )an (318)
the corresponding gradient of ,7 is

———J= T y: — p ) (B19)

0H1 1122:0111d 1 1

g‘ﬁj—z — 1212621 _ :26 011(3’ MDT (B'ZO)

apl =(EIZ Cil ZiCzlmyiT)F1-u1) (B21)

279

 

B.2 Second Order Information

The second-order information (Hessian) of the proposed objective function J can be
derived in a manner similar to the ﬁrst order information. We shall ﬁrst compute
the second-order differentials and then convert them to the Hessian matrix. Let d2 2:

denote the second-order differential of the variable :13.

B.2.1 Second-order Differential

By taking the differential on both sides of Equation (B4), we have

d2 Cammw(9;y,‘1') = 2W 7‘13“) (d 10g (113') + Zfztj (d2 109; (113') - (B22)
13' 13'

To compute d 731-, we take the differentials of the logarithm of both sides of Equa-

tion (B3):

k
d log fij = d log (13 — longg

[=1
1 k
= '7' d log (1,-j — —7€——§— Z: q?! d log q?! (B23)
1’21 “111’ 1:1
k
= 7 d 103011 — 27711 d 10% (111
1:1

Because of the identity that d .T = a: d logs: for :1? > O, we have

(1 fij = fij d log 7:1} (13.24)

280

 

Substituting Equation (B24) into Equation (B22), we have

([2 ﬁannealedw; y, ,7)
= :71]. ((12 logqij) + V’Zfij(d log (jij)(d logqij)
1} iJ'

-VZZf11d 1030112f1j(11<1g(11j
i 1 j

= Z 73-1 (d2 10g (111') + v 235,, — 1.1111111 log (1.3-11d log (11).

z] ijl

(B25)

Here, (SJ-l is the delta function, and it is one if j = l and zero otherwise. The deﬁnition

of 5i] in Equation (5.12) implies the following:

(1 82'} = 31'] d log 513 (8.26)

k.
d log sij = 'r d log (1,-j — Z 8,) d log (1,-1 (B27)
(=1

Note the similarity between the deﬁnitions of sij and 17,-]. Because for any 2', Zﬂwij —

5%,-j :1 111,1) 2 0 and Zj d 3,-1- = d Zj sij = 0, Equation (BS) can be rewritten as

(1 f(9,C) = —TZZ(U1U -— Sij Zwil)(d log (11])
i J'

l
= —T:<Z(U‘U — sij :20”) (d log qu — 28” d log qﬂ)
i j l l
= — X (20110 -— Sij Z U?i[)d log 8U)
i j l

: — 2:113]- d log Sij + ZZd Sij Zia-’11 1’ — 22207:]. d log Sij
2' j 2' j l i j

281

 

The differential of this expression yields (12 f(6;C). So, we need to ﬁnd (1 wij and

d2 log Sij‘ The definition of 1111-]- in Equation (B.6) means that its differential is

 

 

 

171+ m—
d 111,, z}; ,\+ (1,,,-— Z Agbh, (log 11,-,- + 1)s,~j d log 3,]-
+ _
m + m —
A hahi /\ b '
- ‘0‘ + _ h h? _
51] Zh t+ d thj E: t‘. d thj
h=1hj 11:1 12.]
m+ m"
- 11,-]- : A311,”- log-1;]. — Z Agbh, log 1;]. d logsij
h=l h=1
h: 1 h] h=1thJ
where we define 11'?)- = (PH-f -(Zh’f__+ _1 A1} J“(1,u— 2’21 Agbhi) sij. Taking the differen-

tials of both sides of Equation (B27), we have

I: k.
([2 logSz‘j = T(d2 log (lij _ 232'! d2 log (11" _ 251101 log8i1)(d 109111))

[=1 [21
k
2 .
= T(d 10% (Iz'j - Z 811 d2 101%(111)
[=1
k k
— Z sil(d log sil)(d log 8,1 + T Z Sill d 10?; qil’)
1:1 1’=1
k k
: T(dz log qij — :3” (12 00" (1,1) 23,1(d log 81-1 )(d log 3,1)

282

Note that we have used the fact 22:1 5,1(d log 5,1) = 0. If we deﬁne 1111']- = 2013' —

52’j 21:1 11,-), we can write

2 211,]- (12 log Sij

ij
1
= 7:011”? longtj — 21211111- 2311(‘1 10% 811)(d 103811)
1') i — :1

Putting them together, we. have

2 HM) = 221111 d2 loge-1' + Ed “1’11 ‘1 1011,13,,
13' ’71

= Z 1111-]- (12 10381} + Z}111§}(d log .s,_)-)((l log 3'0)
2'} ..

- 2 E 2% )(011113’1’jd log Sij)

jh=li

m __
+Zflle-jh-( dth] -)(b}n‘5ij d logsij)
J h=1 i til]

:7 2: “’0 (12108 ‘11] + :(wi S7J 2: “it“ )(d log 32])(d log 813)

ij ij [=1

‘EZjadthJ-X )(dtffj ”ZEL—

h=1j hl=JthJ(

. k , ~ + — _
lNote that 111:]. -— 3,] 21:1 1112-) = u1ij++(Z"‘+1/\ha111-Z'1I‘=1 ’\h bhi) 311]“.

B.2.2 Obtaining the Hessian matrix

(8.28)

(13.29)

Our goal in this section is to obtain the Hessian matrix of J with respect to the

parameters ordered by 131, . . .,1/5’k, . . . 61, . . . ,Qk. Let «pm denote the column vector

283

0 log q- - , . .
T119111 , and define ‘1!" to be the [GUI by n matr1x {1,/111),.1-11pm1l1 where lgztl 13 the

number of parameters in 61). Let D1111 be a 11 by 71 diagonal matrix such that its
(2', '1')-th entry is 7(6),), — 77,017,“. Let 11,” denote a 1 by n matrix with all its entries
equal to one. Let Hﬁ be the Hessian matrix of (log qz-j) with respect to the ﬁns, i.e.,
the (‘u,1,1)-th entry of H5 is given by

42-1“ ._‘0—10..- *3 (1 —11)——a (1 -1)
81’31‘0131) Oqu — 8181101111 0 J — 8‘82) ]U u _ u U?) 'U -

 

Here, (51,11 is the Kronecker delta, which is 1 if 11: v and 0 otherwise. Note that Hg
does not depend 011 the value of '1' and j in log quj. Let Hij denote the Hessian of
log p(y,|(9j) with respect to Qj. Its exact form for the case of Gaussian distributions

will be derived in the next section.

3.2.2.1 Hessian of the Log-likelihood

Based on Equation (B25), we have

068—35; cannealedw y ,7)
u v
*2? 0211,1101 20111 61011111]- 61011-1111 T
ij lj 091189)) +7 0611 861;
- ~ 82 log q- - ~ ..
= 011.?) Z Tin—50?,” + “1' 20511.11 — Tirﬁhﬂbiu‘l’zzﬁ
i ' U. 2'

Z 6111) 'fiuHi'u + \IIILDUU‘p’lYl‘

284

a2

, Lannealed 9; ‘ A!
0.1111061) ( ,y I)
- 1 - Blog (111 T ~ ~ T
: ’7 2(6jl — 7111)]ij (Suj — 0'11. a“) _—06 = ’7 Z<6uv —' Ti.U)7'iU/wiv
1' )1 U 1'
= 11.11D11v‘1’5
‘92 1 d
, Cannea e 0; ,
aliuafjv ( y ’7)
= Z 'fijO'ufO'v — 511.11) + ‘1' 20511 — fillfij(611j — a'u)(5»111 — 011)
1' j 1' jl

: nau(()v — 611.11) + ’7 2(6UU ‘- fit/017:2.“ 2: Tl-au(av _ 6112)) + 11,-nDu‘Ulin

1

Deﬁne Hc, the “expected Hessian" of the annealed log-likelihood, by

Hi: = blk-diag(nH3, Z “fl-1H“, . . . , Z fikHikl- (B30)

1 i

It can be viewed as the expected value of the Hessian matrix of the complete-data
log-likelihood. Deﬁne a k(1 + |01I) by air matrix A and partition it into 2k by k
blocks, so that the (j,j)-th block is 11,,“ and the (j + k,j)-th block is 19, where

1 g 1 _<_ k. All other entries in A are zero. In other words,

- 1 - .,

11.11 11,11

A = ’ = ' (B.31)
‘1’1 l¢11---¢nll

 

 

 

 

‘1’1 [#111- 1410111];

 

Let. D be a 7119 by nk matrix and we partition it into I: by k blocks, so that the (u, v)-th

block is DW. With these notation, the Hessian of the annealed log-likelihood is

02
06

D is symmetric because the (i, i)—th element of both Duly and DW are. 7(61”; -fiqu'v-

Also, the sum of each of the column of D 18 0 because 2,, 705m. - riu)riv = 0.

B.2.2.2 Hessian of the Constraint Violation Term

By converting the differentials in Equation (B27) into derivatives, we have

310g 9
75;] =T(5ju1/’iu—;521(sluwiu): T((Sju _Siuﬁljzu
u

k
010gs~
Tu” — 7(5sz — an. - Z 821(5111 - 0a)) = T(5ju ‘ 5i")

[=1

This implies

dlogs Blogsa- T
Z( a); -— Sijzwil) (01—;3uj)(—aj—Q)
.v

i]

k
2 T
— —T Z: :00: '1 '_' Sij ZIL’11)(O ju “3211)(6jv- 811)) : 119nEuv113n
121

Similarly, we have

A“ r T
()log 9 ()log 8;,-
ZW‘ :“"( ”(139—1) =11~‘”E“”‘I'3

1']

286

2£annealed(6 y ,7) = Hf, + ADAT (B32) '

0 log 3, 8100' 3,-- T
J O 1] T
“.._S TU r =‘I’_E‘I’
2.] ’3: ”)(_— aeu )( 09,. > u “U U
U

Here, 72 (Zj (11);]- -— sij ::le 11.1”)(61-1, — 3iu)(5jv — and) is the (2', i)-th element of the
n. by 11 diagonal matrix Em. Let ahju. denote a vector of length n such that its i-th e11-
try is given by Tahz‘5ij(5ju — Siul Because d t3]. 2: 2,- “hi (1 5,-3' = Z,- Ohisij d log 3,],

we have

 

(NI-:7- 810g S‘i]

, = Zahisz‘je— — TZahiSijwju_3zu)1/Jiu: ‘I’uahju
(991, 2' 06a

. +

(31,].

 

811311 = T Z ahisijwju - Sin) 2 11.nahju

i
This means that

+ T
m k A+ 01+ &+

227: 37: if ‘ 11:21,!“ + :ahJuath ‘1’;

h=1j=1th h=1j=1hj

T
= quuL+Av \Ilf,
where we concatenate different ahju to form a n by km+ matrix Au, deﬁned by

A“ : [31,1411 32.1,“? ' ’ ° ’am+,1,u’ 21112911” ’ ' ’am+,2,u’ ° ' ‘ ’ alakdt’ ' ' ' ’am‘l’Jcml'

Note that An has similar sparsity pattern as the matrix {am}. The diagonal matrix

/\+
L+ is of si7e km+ by Am..+ Its diagonal entries are given by—— ,and the ordering of

hi
these diagonal entries matches the ordering of ahju in Au. By similar reasoning, we

287

 

have

 

"1+ k + + T
)‘h . 8th.
J _ + T T
Z Z— t+ (Olga) (39v) — 11,71AUL Av ‘I’v

h=1j=1hJ

m+ k +
)‘h 8th
J _ + T T
$251; (33”) (83:) _ anAuL Avllﬂ

h=1j=1hj

 

 

The case for tI—fj’ which corresponds to to must-not—link constraints, is similar. So,
we deﬁne bhju to consist of Tl);,,-.9,-j(6ju —— sin) for different i, and concatenate bhju

A-
to form B“. L‘ is a diagonal matrix with entries {in Substituting all these into the

hj
result derived in Equation (B29), we have
0 0
d—6u~ ; UJU 59; log 3,-1-
02 8 T
:Tguijd—iﬂudf): log (111+; —siij,-l)(a ulog 35])(0—9U10g 8,3)
0 a gutt) —t )(— a —t )T
—§Zi\% _u'thj)( — +2: Jean M 09v M
h: 1 j
- ~ 02 log - _
= 701w 2 JLJJV,J————-a:2:“ + w,,E,,vw{ — \IJUA,,L+A{\IJ{ + \IIuBuL 33,"qu
2' ll.
_ ‘~. 02 log (1m. + T _ T T
— 7611.1) “"211—363— + ‘I'U Euv — AUL Av + BUL Bv ‘I’v
'lt

Similarly, we have

r) . — V
618211 2 100061? 10g Sij 11,71 (Euv AUL+AU + BU BU ) ‘I’IJ;
. ij . I;

288

(9 i) (9210?,qu
071;“le ””2?” 0131thqu

11,n (E1111 '_ AUL+A$ + BULTBS) 1in

Let HC denote the “expected” hessian of the complete data log-likelihood due to the

constraints, i.e.,

 

H0 = b1k-diag(0, T Z alum-1,. ..,7- : w,,,H,-,,.). (13.33)
1' 2'

Note that there are no Hessian terms corresponding to the 53- because Zj 1172']- = 0.
Let E be a 71k by 711; matrix. We partition it into k. by k blocks, such that the (11,v)-th
block is EW. Let A be a 11k by km+ matrix and B be a 71k by km- matrix, such

that

 

 

 

 

L- a- L _1

we are now ready to state the Hessian term corresponding to the constraints:

771+ m
{:33ng )+Z,\;D;S(h)

[1:1

2 _HC — AEAT + AAL+ATAT — ABL‘BTAT (3.34)

Note that the sum of each of the columns of A is 0, because Em Tron-SUM)” — 3,“) =

Z,- Tahisij 21,051,, — Sin) 2 0. Combine Equation (B34) with Equation (8.32), we

289

 

have the Hessian of the objective function j in matrix form:

._2 ~ ~
.35 = HL — H6 + ADAT — AEAT + AAL+ ATAT — ABL“BTAT
W (3%)
= H“ + A (D — E + AL+AT — BL—BT) AT.

Here, H56 = H5 — H6 is the combined expected Hessian.

B.2.3 Hessian of the Gaussian Probability Density Function

Computation of HLZC requires Hij, which is the result of differentiating log p(yilélj)
with respect to the parameter 03- twice. We shall derive the explicit form of Hij when
log p(yil6j) is the Gaussian pdf. For simplicity, we shall omit the reference to the

object index i and the cluster index j in our derivation.

We shall need some notations in matrix calculus [179] in our derivation. Let vec X -

denote a. vector of length pq formed by stacking the columns of a p by q matrix X.

Let Y be a 'r by .9 matrix. The Kronecker product X®Y is a pr by qs matrix deﬁned

by _ -
:EllY :13ng . . . (13qu
:5le :rggY . . . :rqu
X®Y= . . (3%)
prY . . . 1:qu

 

 

The precedence of the operator 69 is defined to be lower than matrix multiplication,

i.e., XY 6'9 Z is the same as (XY) 63> Z. The following identity is used frequently in

290

'A

this section:

vec(XYZ) = (zT <59 X) vec Y. (13.37)

Let Kd denote a permutation matrix of size (12 by d2, such that

Kd vec Z : vec ZT, (B38)

where Z is a d by (1 matrix. Note that K5 2 K51 2 Kd.

 

B.2.3.1 Natural Parameter

When the density is parameterized by its natural parameter as in Equation (5.6), we

have

0 1

_ U, ___ __ __ -l —T

auloopm y 2 (T +r )u

3 , _ 1 T 1 —T 1 —:r T —:r
8T10°p(y) — 2yy + 2T + 2T 1111 T
5% log p(y) = —nyF + F“T + T—TVVTT‘TF,

where T-T denotes the transpose of T—l. Therefore,

82
517 10%]’(Y) = —

02
Ovec T 01/ 10gp(y)

291

 

02

———— (ri' = T —1 "A _T T —1 —T
dvecF 0111001)“) F T @T V+F T ”(8T

= (F_1®E)(Id®u+l’®ld)

The last term in the Hessian matrix requires more work. We ﬁrst take the differential

with respect to T:

1
d 5%.— logp(y) = —§T'T(d YT)T_T

1 1
— ér'TuuTrTw TT)‘I“‘T — Ear-Tm TT)T_TVVTT'T.
By using the identity in Equation (8.37), the Hessian term can be obtained as

82

_ 1 —1 —T

_ g (r—1 69 T—TVVTT—T) Kd — é- (T—IVVTT—l ® T—T) Kd

Similarly, the Hessian term corresponding to F can be obtained if we note that

a
(1 5? log p(y) = —ny(d F) — F‘T(d FT)F‘T — T—TuuTF_T(d FT)F"‘T

— T‘T(F(d FT) +(dF)FT)T_T1/VTF‘T

02

M 10$ p(y) = -Id 9’9 ny — (F"1 8) F4) Kd - (F"1 <8 MMTF) Kd

—— (FTppT 69 FT) Kd -— FTupTF a 2

292

 

 

In the special case that T is always symmetric, we can have a simpler Hessian term.

This amounts to assuming that TT 2 T and (alT)T 2 ((1T). We have

82 1 1 T 1 T
_l 0‘ r: ——2 )l 2 — _2 _ — 2
0(vec r)2 GDP”) 2 g 2 QM“ 2”“ ®
1
2 _5 ((23 + muT) e (23 + ##T) - (n ® (QUIT ® HTl)

B.2.3.2 Moment Parameter

When moment parameter is used as in Equation (5.6) for the density, we have

a 1 T
alosp(y)—§(T+T )(y—u)

a ., _ 1 T 1 _T
ﬁlosmyb 2(y (My u) +2T
8

Ef loamy) = —(y - u)(y - MTF + F‘T

The second—order terms include

 

a? _ 1 :r
W103P()’)——§ (TWLT )
a? 1m( )—1 I I
aveCTap oopy —2(d®(y-u)+(y-u)® d)
1
= 5(1d2 + Kd) (Id ® (3' - Ill)

= (FT e Idiadg + K.» (I. e (y — u»

293

 

 

As in the case of natural parameter, we have

a 1 —T T —T
+1 a! = _—
(10 ogp(y) 2r ((lT )T

a? 1 _1 _T
___—__1 or = __ .,
0(vec r)? 0" pm 2 (T 3 T )Kd
a _ ._
d 55103210!) = —<y — u><y — u)T(d F) — F T(d FT>F T
<92 . . _ T —1 -—T
WIOOMY) - ‘Id ’3) (Y — u)(y — H) — (F ‘59 F )Kd

If we assume both T and d T are always symmetric, we have

(92

1
W lOgP(Y) = "—"'2 ® 2

2

294

 

 

 

BIBLIOGRAPHY

295

 

Bibliography

ill

[2]

[3]

[4]

l5]

[6]

[7]

[8]

[9}

[10]

S. Agarwal, J. Lim, L. Zelnik—Manor, P. Perona, D. Kriegman, and S. Belongie.
Beyond pairwise clustering. In Proc. IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pages II—838—II—845. IEEE Com-
puter Society, 2005.

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. In Proc. 1998
ACM SIGM OD International Conference on Management of Data, pages 94—-
105, June 1998.

P. Arabie and L. Hubert. Cluster analysis in marketing research. In Advanced
Methods of Marketing Research, pages 160—189. Oxford: Blackwell, 1994.

L.D. Baker and AK. McCallum. Distributional clustering of words for text
classiﬁcation. In Proc. SICIR-98, 215t ACM International Conference on Re-
search and Development in Information Retrieval, pages 96—103. ACM Press,
New York, US, 1998.

CH. Bakir, J. Weston, and B. Schoelkopf. Learning to ﬁnd pre-images. In
Advances in Neural Information Processing Systems 16. MIT Press, 2004.

P. Baldi and G.W. Hatﬁeld. DNA Microar'rays and Gene Expression. Cambridge
University Press, 2002.

P. Baldi and K. Hornik. Neural networks and principal component analysis:
learning from examples without local minima. Neural Networks, 2:53—58, 1989.

CH. Ball and DJ. Hall. Isodata, a novel method of data analysis and pattern
classiﬁcation. Technical Report NTIS AD 699616, Stanford Research Institute,
Stanford, CA, 1965.

A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman
divergence. In Proc. SIAM International Conference on Data Mining, pages

234—245, 2004.

N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning,
56(1-3):89—113, 2004.

296

 

1111

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

J. Bartram. A letter from John Bartram, MD. to Peter Collinson, F.R.S.
concerning a cluster of small teeth observed by him at the root of each fang
or great tooth in the head of a rattle-snake, upon dissecting it. Philosophical
Transactions (1683—1775), 411358-359, 1739—1741.

S. Basu, A. Banerjee, and R.J. Mooney. Active semi-supervision for pairwise
constrained clustering. In Proc. the SIAM International Conference on Data
Mining, pages 333—344, 2004.

S. Basu, A. Banerjee, and R.J. Mooney. Semi-supervised clustering by seed-

ing. In Proc. 19th International Conference on Machine Learning, pages 19—26,
2005.

S. Basu. M. Bilenko, and R.J. Mooney. A probabilistic framework for semi—
supervised clustering. In Proc. 10th ACM SICKDD, International Conference
on Knowledge Discovery and Data Mining, pages 59—68, 2004.

R. Battiti. Using mutual information for selecting features in supervised neural
net learning. IEEE Transactions on Neural Networks, 5(4):537~550, July 1994.

M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and
data representation. Neural Computation, 15(6):1373—1396, June 2003.

Y. Bengio, J .-F . Paiement, and P. Vincent. Out-of-sample extensions for LLE,
Isomap, MDS, Eigenmaps, and spectral clustering. In Advances in Neural In-
formation Processing Systems 16, pages 177—184. MIT Press, 2004.

M. Bernstein, V. de Silva, J. Langford, and J. Tenenbaum. Graph approxi-
mations to geodesics on embedded manifolds. Technical report, Department of
Psychology, Stanford University, 2000.

A. Beygelzimer, SM. Kakade, and J. Langford. Cover trees for nearest neighbor.
Technical report, 2005. http://www. cis.upenn.edu/~skakade/papers/ml/
cover_tree.pdf.

SK. Bhatia and J .S. Deogun. Conceptual clustering in information retrieval.
IEEE Transactions on Systems. Man and Cybernetics, Part B, 28(3):427—436,
June 1998. .

M. Bilenko, S. Basu, and R.J. Mooney. Integrating constraints and metric learn-
ing in semi-supervised clustering. In Proc. 213t International Conference on
Machine Learning, 2004. http://doi.acm.org/10. 1145/1015330. 1015360.

J. Bins and B. Draper. Feature selection from huge feature sets. In Proc. 8th
IEEE International Conference on Computer Vision, pages 159465, 2001.

C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,
New York, 1995.

297

[24] CM. Bishop, M. Svensen, and C.K.I. Williams. GTM: the generative topo-
graphic mapping. Neural Computation, 10:215-234, 1998.

[25] G. Biswas, R. Dubes, and AK. Jain. Evaluation of projection algorithms. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 3:702—708, 1981.

[26] AL. Blum and P. Langley. Selection of relevant features and examples in ma-
chine learning. Artiﬁcial Intelligence, 97(1-2):245—271, 1997.

[27] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.

[28] P. Bradley, U. Fayyad, and C. Reina. Clustering very large database using EM
mixture models. In Proc. 15th International Conference on Pattern Recognition,
pages 76—80, September 2000.

[29] M. Brand. Charting a manifold. In Advances in Neural Information Processing
Systems 15, pages 961—968. MIT Press, 2003.

[30] M. Brand. Fast online‘SVD revisions for lightweight recommender systems.
In Proc. SIAM International Conference on Data Mining, 2003. http://www.
siam.org/meetings/sdmO3/proceedings/sdm03_04.pdf.

[31] M. Brand. Nonlinear dimensionality reduction by kernel eigenmaps. In Proc.
18th International Joint Conference on Artiﬁcial Intelligence, pages 547-552,
August 2003.

[32] A. Brun, H-J. Park, H. Knutsson, and Carl-Fredrik Westin. Coloring of DT-
MRI ﬁber traces using Laplacian eigenmaps. In Proc. the Ninth International
Conference on Computer Aided Systems Theory, volume 2809, February 2003.

[33] J. Bruske and G. Sommer. Intrinsic dimensionality estimation with optimally
topology preserving maps. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 20(5):572—575, 1998.

[34] C.J.C. Burges. Geometric methods for feature extraction and dimensional re-
duction. In L. Rokach and O. Maimon, editors, Data Mining and Knowledge
Discovery Handbook: A Complete Guide for Practitioners and Researchers.
Kluwer Academic Publishers, 2005.

[35] F. Camastra and A. Vinciarelli. Estimating the intrinsic dimension of data with
a fractal-based method. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(10):1404—1407, October 2002.

[36] R. Caruana and D. Freitag. Greedy attribute selection. In Proc. 11th Inter-

national Conferenee on Machine Learning, pages 2836. Morgan Kaufmann,
1994.

298

[37] G. Celeux, S. Chrétien, F. Forbes, and A. Mkhadri. A component-wise EM
algorithm for mixtures. Journal of Computational and Graphical Statistics,
10:699—712, 2001.

[38] Y. Chang, C. H11, and M. Matthew Turk. Probabilistic expression analysis on
manifolds. In Proc. IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, volume 2, pages 520—527, 2004.

[39] A. Chaturvedi and J .D. Carroll. A feature-based approach to market segmen-
tation via overlapping k-centroids clustering. Journal of Marketing Research,
34(3):370—377, August 1997.

[40] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17:790—799, 1995.

[41] Y. Cheng and GM. Church. Biclustering of expression data. In Proc. of the
Eighth International Conference on Intelligent Systems for Molecular Biology,
2000.

[42] F. Chung. Spectral Graph Theory. American Mathematical Society, 1997.

[43] Forrest E. Clements. Use of cluster analysis with anthropological data. Amer-
ican Anthropologist, New Series, Part 1, 56(2):180—199, April 1954.

[44] D. Comaniciu. An algorithm for data-driven bandwidth selection. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 25(2):281—288, 2003.

[45] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(5):603—619, 2002.

[46] TH. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms.
MIT Press, 1990.

[47] J. Costa and AC. Hero. Manifold learning using euclidean k-nearest neighbor
graphs. In Proc. IEEE International Conference on Acoustic Speech and Signal
Processing, volume 3, pages 988—991, Montreal, 2004.

[48] DR. Cox. Note on grouping. Journal of the American Statistical Association,
52(280):543—547, December 1957.

[49] T.F. Cox and M.A.A. Cox. Multidimensional Scaling. Chapman & Hall, 2001.

[50] M. Craven, D. DiPasquo, D. Freitag, A.K. McCallum, T.M. Mitchell, K. Nigam,
and S. Slattery. Learning to construct knowledge bases from the World Wide
Web. Artiﬁcial Intelligence, 118(1/2):69—113, 2000.

[51] M. Dash and H. Liu. Feature selection for clustering. In Proc. Paciﬁc-Asia
Conference on Knowledge Discovery and Data Mining 2000, 2000.

299

 

[5‘2]

[53]

[57]

[58]

[59]

[60]

[61]

1621

[63]

A. d’Aspremont, L. E1 Ghaoui, MI. Jordan, and G.R.G. Lanckriet. A direct
formulation for sparse PCA using semideﬁnite programming. In Advances in
Neural Information Processing Systems 17. MIT Press, 2005.

D. de Ridder, O. Kouropteva, O. Okun, M. Pietikinen, and R.P.W. Duin. Super-
vised locally linear embedding. In Proc. Artiﬁcial Neural Networks and Neural
Information Processing, pages 333—341. Springer, 2003.

D. de Ridder, M. Loog, and M.J.T. Reinders. Local ﬁsher embedding. In
Proc. 17th International Conference on Pattern Recognition, pages II—295——II—
298, 2004.

V. de Silva and J.B. Tenenbaum. Global versus local approaches to nonlin-
ear dimensionality reduction. In Advances in Neural Information Processing
Systems 15, pages 705—712. MIT Press, 2003.

D. DeCoste. Visualizing Mercer kernel feature spaces via kernelized locally-
linear embeddings. In Proc. 8th International Conference on Neural Informa-
tion Processing, November 2001. Available at http://www. cse . cuhk.edu.hk/
~apnna/proceedings/iconip2001/index.htm.

D. DeMers and G. Cottrell. Non—linear dimensionality reduction. In Advances in
Neural Information Processing Systems 5, pages 580—587. Morgan Kaufmann,
1993.

AP. Dempster, N.M. Laird, and DB. Rubin. Maximum likelihood estimation
from incomplete data via the EM algorithm. Journal of the Royal Statistical
Society. Series B (Methodological), 3921—38, 1977.

M. Dettling and P. Biihlmann. Finding predictive gene groups from microarray
data. Journal of Multivariate Analysis, 90:106—131, 2004.

M. Devaney and A. Ram. Efﬁcient feature selection in conceptual clustering. In
Proc. 14th International Conference on Machine Learning, pages 92—97. Morgan
Kaufmann, 1997.

LS. Dhillon. Co—clustering documents and words using bipartite spectral graph
partitioning. In Knowledge Discovery and Data Mining, pages 269—274, 2001.

IS. Dhillon, S. Mallela, and R. Kumar. A divisive information-theoretic fea-
ture clustering algorithm for text classiﬁcation. Journal of Machine Learning
Research, 321265—1287, March 2003.

LS. Dhillon, S. Mallela, and D.S. Modha. Information-theoretic co-clustering.

In Proc. of The Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 89—98, 2003.

300

 

 

 

[64]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

D. Donoho. For most large underdetermined systems of linear equations, the
minimal 1 1-norm near-solution approximates the sparsest near-solution. Tech-
nical report, Department of Statistics, Stanford University, 2004.

D. Donoho. For most large underdetermined systems of linear equations, the
minimal 1 1-norm solution is also the sparsest solution. Technical report, De-
partment of Statistics, Stanford University, 2004.

D.L. Donoho and C. Grimes. When does isomap recover natural parameteriza-
tion of families of articulated images? Technical Report 2002-27, Department
of Statistics, Stanford University, August 2002.

D.L. Donoho and C. Grimes. Hessian eigenmaps: new locally linear embedding
techniques for high-dimensional data. Technical Report TR-2003-08, Depart-
ment of Statistics, Stanford University, 2003.

R. Duda, P. Hart, and D. Stork. Pattern Classiﬁcation. John Wiley & Sons,
New York, 2nd edition, 2001.

J .G. Dy and CE. Brodley. Feature subset selection and order identiﬁcation
for unsupervised learning. In Proc. 17th International Conference on. Machine
Learning, pages 247—-254. Morgan Kaufmann, 2000.

J .G. Dy and CE. Brodley. Feature selection for unsupervised learning. Journal
of Machine Learning Research, 5:845—889, August 2004.

J .G. Dy, C.E. Brodley, A. Kak, L.S. Broderick, and A.M. Aisen. Unsupervised
feature selection applied to content-based retrieval of lung images. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 25(3):373—378, March
2003.

B. Efron, T. Hastie, I.M. Johnstone, and R. Tibshirani. Least angle regression
(with discussion). Annals of Statistics, 32(2):407—499, 2004.

R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised
and semi-supervised learning. In Advances in Neural Information Processing
Systems 14, pages 1025-1032. MIT Press, 2002.

A. Elgannnal and CS. Lee. Inferring 3D body pose from silhouettes using
activity manifold learning. In Proc. IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, volume 2, pages 681—688, 2004.

A. Elgammal and CS. Lee. Separating style and content on a nonlinear man-
ifold. In Proc. IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pages 478489, 2004.

D.M. Endres and J .E. Schindelin. A new metric for probability distributions.
IEEE Transactions on. Information Theory, 491858—1860, September 2003.

301

[77] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm
for discovering clusters in large spatial databases with noise. In Proc. of the

Second International Conference on Knowledge Discovery and Data Mining,
pages 226—231, 1996.

[78] M. Farmer and A. Jain. Occupant classiﬁcation system for automotive airbag
suppression. In Proc. IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pages 7536761, 2003.

[79] D. Fasulo. A11 analysis of recent work on clustering algorithms. Tech-
nical report, University of Washington, 1999. Available at http://www.
cs . washington. edu/homes/df asulo/clustering . ps and http : //citeseer.
ist.psu.edu/fasu1099analysi.html.

 

[80] M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 25:1150—1159, 2003.

[81] M.A.T. Figueiredo and AK. Jain. Unsupervised learning of ﬁnite mixture
models. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(3):381-—396, 2002.

[82] M.A.T. Figueiredo, A.K. Jain, and M.H. Law. A feature selection wrapper
for mixtures. In Proc. the First Iberian Conference on Pattern Recognition

and Image Analysis, pages 229—237, Puerto de Andratx, Spain, 2003. Springer
Verlag, LNCS vol. 2652.

[83] B. Fischer and J .M. Buhmann. Bagging for path-based clustering. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 25(11):1411—1415, 2003.

[84] B. Fischer and J.l\r‘I. Buhmann. Path-based clustering for grouping smooth
curves and texture segmentation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 25(4):513~-518, 2003.

[85] VVD. Fisher. On grouping for maximum homogeneity. Journal of the American
Statistical Association, 53(284):789—-798, December 1958.

[86] R. Fletcher. Practical Methods of Optimization. John Wiley & Sons, 1987.

[87] R. Fletcher. Practical Methods of Optimization. John W'iley & Sons, 2nd edition,
2000.

[88] E. Porgy. Cluster analysis of multivariate data: Efficiency vs. interpretability
of classiﬁcation. Biometrics, 21, September 1964.

[89] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using
the Nystrom method. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26(2):214~225, February 2004.

302

[90] Y. Frennd and RE. Schapire. Experiments with a new boosting algorithm.

In Proc. 13th International Conference on Machine Learning, pages 148—156.
Morgan Kaufmann, 1996.

[91] J .H. Friedman. l\11111ti\«'ariate adaptive regression splines. Annals of Statistics,

19(1):1-—67, March 1991.

[92] J.H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the

American Statistical Association, 76:817~—823, 1981.

[93] J .H. Friedman and J .W. Tukey. A projection pursuit algorithm for exploratory

data analysis. IEEE Transactions on Computers, C-232881—890, 1974.

[94] H. Frigui and R. Krishnapuram. A robust competitive clustering algorithm with

[95]

[96]

[97]

[98]

[99]

[100]

[101]

[102]

(103]

applications in computer vision. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 21(5):450—465, May 1999.

G. Getz, E. Levine, and E. Domany. Coupled two-way clustering analysis of
gene microarray data. Proc. National Academy of Sciences of the United States
of America, 94:12079—12084, 2000.

CH. Golub and CF. Van Loan. Matrix Computations. Johns Hopkins Univer-
sity Press, 1996.

D. Gondek and T. Hofmann. Non-redundant data clustering. In Proc. 5th IEEE
International Conference on Data Mining, pages 75—82, 2004.

Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimiza-
tion. In Advances in Neural Information Processing Systems 17, pages 529—536,
Cambridge, MA, 2005. MIT Press.

RM. Gray and D.L. Neuhoff. Quantization. IEEE Transactions on Information
Theory, 44(6):2325—2383, October 1998.

P. Gustafson, P. Carbonetto, N. Thompson, and N. de Freitas. Bayesian feature
weighting for unsupervised learning, with application to object recognition. In
C. M. Bishop and B. J. Frey, editors, Proc. 9th International Workshop on
Artiﬁcial Intelligence and Statistics, Key West, FL, 2003.

I. Guyon and A. Elisseeff. An introduction to variable and feature selection.
Journal of Machine Learning Research, 321157—1182, March 2003.

I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS
2003 feature selection challenge. In Advances in Neural Information Processing
Systems 17, pages 545—552. MIT Press, 2005.

A. Hadid, O. Kouropteva, and M. Pietikainen. Unsupervised learning using
locally linear embedding: experiments in face pose analysis. In Proc. 16th
International Conference on Pattern Recognition, pages I:111~114, 2002.

303

 

[104]

[105]

[106]
[107]
[108]
[109]

[110]

[111]

[112]

[113}

[114]

[115]

[116]

[117]

MA. Hall. Correlation-based feature selection for discrete and numeric class
machine learning. In Proc. 17th International Conference on Machine Learning,
pages 359—366. Morgan Kaufmann, 2000.

J. Ham, D.D. Lee, S. Mika, and B. Schoelkopf. A kernel view of the dimension-
ality reduction of manifolds. In Proc. 21 st International Conference on Machine
Learning, 2004.

G. Hamerly and C. Elkan. Learning the k in k-means. In Advances in Neural
Information Processing Systems 16, pages 281—288. MIT Press, 2004.

T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical
Association, 84:502—516, 1989.

T. Hastie, R. T ibshirani, D. Botstein, and P. Brown. Supervised harvesting of
expression trees. Genome Biology, 2:0003.1—0003.12, 2003.

X. He and P. Niyogi. Locality preserving projections. In Advances in Neural
Information Processing Systems 16, pages 153—160. MIT Press, 2004.

T. Hertz, N. Shental, A. Bar-Hillel, and D. Weinshall. Enhancing image and
video retrieval: Learning via equivalence constraint. In Proc. IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, volume 2,
pages 668*674, 2003.

A. Hinneburg and DA. Keim. An efficient approach to clustering in large
multimedia databases with noise. In Knowledge Discovery and Data Mining,
pages 58-65, 1998.

G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in
Neural Information Processing Systems 15, pages 833—840. MIT Press, 2003.

GE. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of handwritten
digits. IEEE Transactions on Neural Networks, 8(1):65—74, January 1997.

T. Hofmann and J. Buhmann. Pairwise data clustering by deterministic an-
nealing. IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(1):1—14, 1997.

T. Hofmann and J.M. Buhmann. Pairwise data clustering by deterministic
annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(1):1~14, 1997.

T. Hofmann, J. Puzicha, and J.M. Buhmann. Unsupervised texture segmen-
tation in a deterministic annealing framework. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 20(8):803--818, 1998.

R. Horst, PM. Pardalos, and Nguyen Van Thoai. Introduction to Global Opti-
mization (Nonconvea: Optimization and Its Applications). Springer, 1995.

304

[118] VV.S. Hwang and J. W'eng. Hierarchical discriminant regression. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 22(11):1277-—1293, Novem-
ber 2000.

[119] P. Indyk and J. Matousek. Low distortion embeddings of ﬁnite metric spaces.
In Handbook of Discrete and Computational Geometry, pages 177-196. CRC
Press LLC, 2nd edition, 2004. http://theory.lcs.mit.edu/~indyk/p.htm1.

[120] M. Iwayama and T. Tokunaga. Cluster-based text categorization: a comparison
of category search strategies. In Proc. of 18th ACM International Conference
on Research and Development in Information Retrieval, pages 273—281, 1995.

[121] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative
classiﬁers. In Advances in Neural Information Processing Systems 11, pages

487-493, 1998.

[122] A. Jain and D. Zongker. Feature selection: Evaluation, application, and small
sample performance. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(2):153-157, February 1997.

[123] A.K. Jain, S.C. Dass, and K. Nandakumar. Soft biometric traits for personal

recognition systems. In Proc. International Conference on Biometric Authen-
tication, pages 731-738, Hong Kong, 2004.

[124] A.K. Jain and R. Dubes. Feature deﬁnition in pattern recognition with small
sample size. Pattern Recognition, 10(2):85-97, 1978.

[125] A.K. Jain and RC. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

[126] A.K. Jain, R. Duin, and J. Mao. Statistical pattern recognition: A review.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4-—38,
2000.

[127] A.K. Jain and F. Farrokhnia. Unsupervised texture segmentation using Gabor
ﬁlters. Pattern Recognition, 24:1167—1186, 1991.

[128] A.K. Jain and P. Flynn. Image segmentation using clustering. In Advances in
Image Understanding, pages 65—83. IEEE Computer Society Press, 1996.

[129] A.K. Jain and J. Mao. Artiﬁcial neural network for nonlinear projection of
multivariate data. In Proc. International Joint Conference on Neural Networks,
pages III-335—III—340, 1992.

[130] A.K. Jain. M.N. Murty, and P.J. Flynn. Data clustering: A review. ACM
Computing Surveys, 31(3):264—323, September 1999.

[131] A.K. Jain, A. Topchy, M.H. Law, and J. Buhmann. Landscape of clustering
algorithms. In Proc. 17th International Conference on Pattern Recognition,
pages I—260—I——263, 2004.

305

[132] A.K. Jain and D. Zongker. Representation and recognition of handwritten
digits using deformable templates. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 19(12):1386-1390, December 1997.

[133] O. Jenkins and M. Mataric. A spatio—temporal extension to Isomap nonlinear
dimension reduction. In Proc. 2Ist International Conference on Machine Learn—
ing, 2004. Available at http://doi.acrn.org/10. 1145/1015330. 1015357.

[134] ST. Roweis J .J . Verbeek and N. Vlassis. Non-linear CCA and PCA by align—
ment of local models. In Advances in Neural Information Processing Systems
16, pages 297-304. MIT Press, 2004.

[135] T. Joachims. Text categorization with support vector machines: learning with
many relevant features. In Proc. 10th European Conference on Machine Learn-
ing, pages 137—142. Springer Verlag, 1998.

[136] T. Joachims. Transductive inference for text classiﬁcation using support vector
machines. In Proc. 16th International Conference on Machine Learning, pages
200—209. Morgan Kaufmann, 1999.

[137] I.M. Johnstone and A.Y. Lu. Sparse principal components analy-
sis. Located at http : //www-stat . stanford. edu/ ~ imj /WEBLIST/AsYetUnpub/
sparse.pdf,2004.

[138] S. Kamvar, D. Klein, and CD. Manning. Spectral learning. In Proc. 18th
International Joint Conference on Artiﬁcial Intelligence, pages 561—566, 2003.

[139] R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral.
Journal of the ACM, 51(3):497-—515, May 2004.

[140] T. Kanungo, D.M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A.Y.
Wu. An efficient k-means clustering algorithm: Analysis and implementation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881—
892, 2002.

[141] J.N. Kapur. Measures of Information and Their Applications. Wiley, New
Delhi, India, 1994.

[142] G. Karypis, E.-H. Han, and V. Kumar. Chameleon: A hierarchical clustering
algorithms using dynamic modeling. IEEE Computer, Special Issue on Data
Analysis and Mining, 32(8):68-75, August 1999.

[143] B. Kégl. Intrinsic dimension estimation using packing numbers. In Advances in
Neural Information Processing Systems 15, pages 681—688. MIT Press, 2003.

[144] B. Kégl, A. Krzyzak, T. Linder, and K. Zeger. Learning and design of princi-
pal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(3):281«297, 2000.

306

 

[145]

[146]

[147]

[148]

[149]

[1.50]

[151]

[152]

[153]

[154]

[155]

[156]

[1.57]

W.-Y. Kim and AC. Kak. 3-d object recognition using bipartite matching
embedded in discrete relaxation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 13(3):224—251, March 1991.

Y. Kim, W. Street, and F. Menczer. Feature selection in unsupervised learning
via evolutionary search. In Proc. 6th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 365—369, 2000.

K. Kira and L. Rendell. The feature selection problem: Traditional methods
and a new algorithm. In Proc. of the 10th National Conference on Artiﬁcial
Intelligence, pages 129-«134, Menlo Park, CA, USA, 1992. AAAI Press.

D. Klein, S.D. Kamvar, and CD. Manning. From instance-level constraints to
space-level constraints: Making the most of prior knowledge in data clustering.
In Proc. 19th International Conference on Machine Learning, pages 307—314.
Morgan Kaufmann, 2002.

J. Kleinberg. An impossibility theorem for clustering. In Advances in Neural
Information Processing Systems 15. MIT Press, 2003.

Y. Kluger, R. Basri, J.T. Chang, and M. Gerstein. Spectral biclustering of

microarray data: coclustering genes and conditions. Genome Research, 13:703—
716, 2003.

R. Kohavi and G. John. Wrappers for feature subset selection. Artiﬁcial Intel-
ligence, 97(1—2):273—324, 1997.

T. Kohonen. Self-Organizing Maps. Springer-Verlag, 2001. 3rd edition.

D. Koller and M. Sahami. Toward optimal feature selection. In Proc. 13th Inter-

national Conference on Machine Learning, pages 284—292. Morgan Kaufmann,
1996.

R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input
spaces. In Proc. 19th International Conference on Machine Learning, pages
315—322. Morgan Kaufmann, 2002.

I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In
Proc. 7th European Conference on Machine Learning, pages 171—182, 1994.

B. Krishnapuram, A. Hartemink, L. Carin, and M. Figueiredo. A Bayesian
approach to joint feature selection and classiﬁer design. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 26(9):1105—1111, September 2004.

B. Krislmapuram, D. Williams, Y. Xue, A. Hartemink, L. Carin, and
M. Figueiredo. On semi-supervised classiﬁcation. In Advances in Neural In.-
formation Processing Systems 17, pages 721—728, Cambridge, MA, 2005. MIT
Press.

307

[158]

[159]

[160]

[161]

[162]

[163]

[164]

[165]

[166]

[167]

[168]

[169]

B. Kulis, S. Basu, I. Dhillon, and R. Mooney. Semi-supervised graph cluster-
ing: A kernel approach. In Proc. 22nd International Conference on Machine
Learning, pages 457—464, 2005.

N. Kwak and C.-H. Choi. Input feature selection by mutual information based
on Parzen window. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 24(12):1667—1671, December 2002.

J.T. Kwok and I.W. Tsang. The pre-image problem in kernel methods. In Proc.
20th International Conference on Machine Learning, pages 408—415, 2003.

T. Lange, M.H. Law, A.K. Jain, and J .B. Buhmann. Learning with constrained
and unlabelled data. In Proc. IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, volume 1, pages 730—737, 2005.

T. Lange, V. Roth, M.L. Braun, and J .M. Buhmann. Stability-based validation
of clustering solutions. Neural Computation, 16(6):1299—1323, June 2004.

M.H. Law, M.A.T. Figueiredo, and A.K. Jain. Simultaneous feature selection
and clustering using mixture models. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 26(9):1154—1166, September 2004.

M.H. Law and A.K. Jain. Incremental nonlinear dimensionality reduction by
manifold learning. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 28(3):377—391, March 2006.

M.H. Law, A.K. Jain, and M.A.T. Figueiredo. Feature selection in mixture-

based clustering. In Advances in Neural Information Processing Systems 15,
pages 625—632. MIT Press, 2003.

M.H. Law, A. Topchy, and A.K. Jain. Clustering with soft and group con-
straints. In Proc. Joint IAPR International Workshops on Structural, Syntactic,
And Statistical Pattern Recognition, pages 662—670, Lisbon, Portugal, August
2004.

M.H. Law, A. Topchy, and A.K. Jain. Model-based clustering with probabilistic

constraints. In Proc. SIAM International Conference on Data Mining, pages
641—645, 2005.

M.H. Law, N. Zhang, and A.K. Jain. Non-linear manifold learning for data
stream. In Proc. SIAM International Conference for Data Mining, pages 33—
44, 2004.

N.D. Lawrence and MI. Jordan. Semi—supervised learning via gaussian pro—
cesses. In Advances in Neural Information Processing Systems 17, pages 753—
760, Cambridge, MA, 2005. MIT Press.

308

[170]

[171]

[172]

[173]

[174]

[175]

[176]

[177]

[178]

[179]

[180]

[181]

[182]

Y. LeCun, O. Matan, B. Boser, J.S. Denker, D. Henderson, R.E. Howard,
W. Hubbard, L.D. Jackel, and HS. Baird. Handwritten zip code recognition
with multilayer networks. In Proc. 10th International Conference on Pattern
Recognition, volume 2, pages 35—40, 1990.

E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimen-

sion. In Advances in Neural Information Processing Systems 17, pages 777—784.
MIT Press, 2005.

S.Z. Li, R. Xiao, Z.Y. Li, and H.J. Zhang. Nonlinear mapping from multi—
view face patterns to a gaussian distribution in a low dimensional space.
In Proc. IEEE ICCV Workshop on Recognition, Analysis and Tracking of
Faces and Gestures in Real-time Systems, 2001. Available at http://doi.
ieeecomputersociety.org/10.1109/RATFG.2001.938909.

J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions
on Information Theory, 37(1):145—151, 1991.

S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Infor-
mation Theory, 28(2):129—137, March 1982. Originally as an unpublished Bell
Laboratories Technical Note (1957).

X. Lu and A.K. Jain. Ethnicity identiﬁcation from face images. In Proc. SPIE,
volume 5404, pages 114—123, 2004.

Z. Lu and T. Leen. Semi-supervised learning with penalized probabilistic clus-

tering. In Advances in Neural Information Processing Systems 17, pages 849—
856, Cambridge, MA, 2005. MIT Press.

D.J.C. MacKay. Bayesian non-linear modelling for the prediction competi-
tion. In ASHRAE Transactions, V.100, Pt.2, pages 1053—1062, Atlanta Geor-
gia, 1994.

J. MacQueen. Some methods for classiﬁcation and analysis of multivariate
observations. In Proc. Fifth Berkeley Symposium on Math. Stat. and Prob.,
pages 281—297. University of California Press, 1967.

J.R. Magnus and H. Neudecker. Matrix Diﬁerential Calculus with Applications
in Statistics and Econometrics. John Wiles and Sons, 1999. Revised Edition.

J. Mao and A.K. Jain. Artiﬁcial neural networks for feature extraction and
multivariate data projection. IEEE Transactions on Neural Networks, 6(2):296—
317, March 1995.

A. Martinez and R. Benavente. The AR face database. Technical Report 24,
CVC, 1998. http://rvll.ecn.purdue.edu/~aleix/aleix_face_DB.html.

M.Belkin and P. N iyogi. Semi-supervised learning on manifolds. Machine Learn-
ing, Special Issue on Clustering, 56:209—239, 2004.

309

[183] G. McLachlan and K. Basford. Mixture Models: Inference and Application to
Clustering. Marcel Dekker, New York, 1988.

[184] G. McLachlan and D. Peel. Finite Aalirture Models. John Wiley & Sons, New
York, 2000.

[185] S. Mika, B. Scholkopf, A.J. Smola, K.-R. Muller, M. Scholz, and G. Ratsch. Ker-
nel PCA and de—noising in feature spaces. In Advances in Neural Information
Processing Systems 11, pages 536—542. MIT Press, 1998.

[186] A.J. l\-‘Iiller. Subset Selection in Regression. Chapman & Hall, London, 2002.

[187] B. Mirkin. Concept learning and feature selection based on square—error clus-
tering. Machine Learning, 35:25—39, 1999.

[188] P. Mitra and CA. Murthy. Unsupervised feature selection using feature sim-
ilarity. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(3):301—312, 2002.

[189] K.-R. Mller, S. Mika, G. Rtsch, K. Tsuda, and B. Schlkopf. An introduction
to kernel-based learning algorithms. IEEE Transactions on Neural Networks,
12(2):181—201, May 2001.

[190] D. Modha and W. Scott-Spangler. Feature weighting in k—means clustering.
Machine Learning, 52(3):217—237, September 2003.

[191] F. Mosteller and J.W. 'Ihkey. Data Analysis and Regression. Addison—\Nesley,
Boston, 1977.

[192] A.M. N amboodiri and A.K. Jain. Online handwritten script recognition. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 26(1):124—130,
2004.

[193] P. Narvaez, K.-Y. Sin, and H.-Y. Tzeng. New dynamic algorithms for shortest
path tree computation. IEEE/A CM Transactions on Networking, 8(6):734—746,
December 2000.

[194] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an

algorithm. In Advances in Neural Information Processing Systems 14, pages
849—856, Cambridge, MA, 2002. MIT Press.

[195] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classiﬁcation from
labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103—
134, 2000.

[196] M. Niskanen and O. Silvn. Comparison of dimensionality reduction methods
for wood surface inspection. In Proc. 6th International Conference on Quality
Control by Artiﬁcial Vision, pages 178-188, 2003.

310

[197]

[198]

[199]

[200]

[201]

[206]

[207]

[208]

J. Novovicova, P. Pudil, and J. Kittler. Divergence based feature selection
for multimodal class densities. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(2):218—223, February 1996.

MR. Osborne, B. Presnell, and BA. Turlach. A new approach to variable selec-
tion in least squares problems. IMA Journal of Numerical Analysis, 20(3):389—
403, 2000.

F. Osterreicher and I. Vajda. A new class of metric divergences on probability
spaces and its statistical applications. Ann. Inst. Statist. Math, 55:639—653,
2003.

EM. Palmer. Graph Evolution: An Introduction to the Theory of Random
Graphs. John Wiley & Sons, 1985.

A. Patrikainen and H. Mannila. Subspace clustering of high-dimensional bi-
nary data - a probabilistic approach. In Proc. Workshop on Clustering High
Dimensional Data, SIAM International Conference on Data Mining, pages 57-
65, 2004.

E. Pekalska, P. Paclik, and R.P.W'. Duin. A generalized kernel approach to
dissimilarity-based classiﬁcation. Journal of Machine Learning Research, 2:175—
211, December 2001. '

D. Pelleg and AW. Moore. X—means: Extending k-means with efficient esti-
mation of the number of clusters. In Proc. 17th International Conference on
Machine Learning, pages 727—734, San Francisco, 2000. Morgan Kaufmann.

J. Pena, J. Lozano, P. Larranaga, and I. Inza. Dimensionality reduction in
unsupervised learning of conditional Gaussian networks. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 23(6):590—603, 2001.

W. Penny and S. Roberts. Notes on variational learning. Technical Report
PARG-OO-l, Department of Engineering Science, University of Oxford, April
2000.

P. Perona and M. Polito. Grouping and dimensionality reduction by locally

linear embedding. In Advances in Neural Information Processing Systems 14,
pages 1255—1262. MIT Press, 2002.

K. Pettis, T. Bailey, A.K. Jain, and R. Dubes. An intrinsic dimensionality esti-
mator from near-neighbor information. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 1(1):25—36, January 1979.

K.S. Pollard and M.J. van der Laan. Statistical inference for simultaneous

clustering of gene expression data. Mathematical Biosciences, 176(1):99—121,
2002.

311

[209] P. Pudil, J. Novovicova, and J. Kittler. Floating search methods in feature
selection. Pattern Recognition Letters, 15:1119—1125, 1994.

[210] P. Pudil, J. Novovicova, and J. Kittler. Feature selection based on the ap-

proximation of class densities by ﬁnite mixtures of the special type. Pattern
Recognition, 28(9):1389—1398, 1995.

[211] S.J. Raudys and A.K. Jain. Small sample size effects in statistical pattern
recognition: Recommendations for practitioners. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 13(3):252—264, 1991.

[212] ML. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain. Di-
mensionality reduction using genetic algorithms. IEEE Transactions on Evolu-
tionary Computation, 4(2):164—171, July 2000.

[213] T. Sorensen. A method of establishing groups of equal amplitude in plant
sociology based on similarity of species content and its application to analyses
of the vegetation on danish commons. Biologiske Skrifter, 5:1—34, 1948.

[214] D. De Ridder and V. Franc. Robust manifold learning. Technical Report CTU-
CMP-2003—08, Center for Machine Perception, Department of Cybernetics Fac-
ulty of Electrical Engineering, Czech Technical University, Prague, 2003.

[215] J. Rissanen. Stochastic Complexity in Stastistical Inquiry. World Scientiﬁc,
Singapore, 1989.

[216] S.J. Roberts, RM. Everson, and I. Rezek. Maximum certainty data partition-
ing. Pattern Recognition, 33(5):833—839, 1999.

[217] A. Robles-Kelly and ER. Hancock. A graph spectral approach to shape-from-
shading. IEEE Transactions on Image Processing, 13:912—926, 2004.

[218] V. Roth and T. Lange. Feature selection in clustering problems. In Advances in
Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.

[219] ST. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear
embedding. Science, 290:2323—2326, 2000.

[220] ST. Roweis, L.K. Saul, and GE. Hinton. Global coordination of local linear
models. In Advances in Neural Information Processing Systems 14, pages 889—
896. MIT Press, 2002.

[221] M. Sahami. Using Machine Learning to Improve Information Access. PhD
thesis, Computer Science Department, Stanford University, 1998.

[222] R. Salakhutdinov, S.T. Roweis, and Z. Ghahramani. Optimization with em

and expectation-conjugate-gradient. In Proc. 20th International Conference on
Machine Learning, pages 672—679. AAAI Press, 2003.

312

[223] J .VV. Sammon. A non-linear mapping for data structure analysis. IEEE Trans-
actions on Computers, C-18(5):401—409, May 1969.

[224] P. Sand and AW. Moore. Repairing faulty mixture models using density es-
timation. In Proc. 18th International Conference on Machine Learning, pages
457—464. Morgan Kaufmann, 2001.

[225] W.S. Sarle. The VARCLUS procedure. In SAS/STAT User’s Guide. SAS
Institute, Inc., 4th edition, 1990.

[226] L.K. Saul and ST. Roweis. Think globally, ﬁt locally: Unsupervised learning of
low dimensional manifolds. Journal of Machine Learning Research, 4:119—155,
2003.

[227] C. Saunders, J. Shawe-Taylor, and A. Vinokourov. String kernels, ﬁsher ker-
nels and ﬁnite state automata. In Advances in Neural Information Processing
Systems 15, pages 633—640. MIT Press, 2003.

[228] B. Scholkopf, A.J. Smola, and K.—R. Miiller. Nonlinear component analysis as
a kernel eigenvalue problem. Neural Computation, 10:1299—1319, 1998.

[229] B. Scholkopf, K. Sung, C.J.C. Burges, F. Girosi, P. Niyogi, T. Poggio, and
V. Vapnik. Comparing support vector machines with gaussian kernels to radial

basis function classiﬁers. IEEE Transactions on Signal Processing, 45(11):2758—
2765, 1997.

[230] M. Segal, K. Dahlquist, and B. Conklin. Regression approach for microarray
data analysis. Journal of Computational Biology, 10:961—980, 2003.

[231] N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing gaussian
mixture models with EM using equivalence constraints. In Advances in Neural
Information Processing Systems 16, pages 465—472. MIT Press, 2004.

[232] J.R. Shewchuk. An introduction to the conjugate gradient method without
the agonizing pain. Technical report, School of Computer Science, Carnegie
Mellon University, August 1994. Available at http://www.cs.cmu.edu/
~quake-papers/painless-conjugate-gradient.pdf.

[233] J. Shi and J. Malik. Normalized cuts and image segmentation. In Proc. IEEE

Computer Society Conference on Computer Vision and Pattern Recognition,
pages 731—737, 1997.

[234] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE T ransac-
tions on Pattern Analysis and Machine Intelligence, 22(8):888—905, 2000.

[235] P.Y. Simard, D. Steinkraus, and J. Platt. Best practice for convolutional neural
networks applied to visual document analysis. In Proc. International Conference
on Document Analysis and Recogntion, pages 958—962, 2003.

313

 

 

[236] E. Sjostrom. Singular value computations for toeplitz matrices. Licentiate
thesis, 1996. http://www.mai.liu.se/~evlun/pub/lic/lic.html.

[237] N. Slonim and N. Tishby. Document clustering using word clusters via the
information bottleneck method. In ACM SIGIR, pages 208—215, 2000.

[238] N. Slonim and N. Tishby. The power of word clusters for text classiﬁcation. In
28rd European Colloquium on Information Retrieval Research, 2001.

[239] A.J. Smola, S. Mika, B. Scholkopf, and RC. Williamson. Regularized principal
manifolds. Journal of Machine Learning Research, 1:179—209, June 2001.

[240] P.H.A. Sneath. The application of computers to taxonomy. J. Gen. Microbiol.,
17:201—226, 1957.

[241] RR. Sokal and CD. Michener. A statistical method for evaluating systematic
relationships. Univ. Kansas Sci. Bull, 38:1409—1438, 1958.

[242] RR. Sokal and P.H.A. Sneath. Principles of Numerical Taxonomy. San Fran-
cisco, W. H. Freeman, 1963.

[243] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon.
Sci, C1. III, Vol. IV, pages 801—804, 1956.

[244] 33 Stevens. On the theory of the scales of measurement. Science, 103:677—680,
1946.

[245] E. Sungur. Overview of multivariate statistical data analysis, http : //www.mrs.
umn.edu/~sungurea/multivariatestatistics/overview.html.

[246] L. Talavera. Dependency-based feature selection for clustering symbolic data.
Intelligent Data Analysis, 4:19—28, 2000.

[247] Y.W. Teh and ST. Roweis. Automatic alignment of local representations. In
Advances in Neural Information Processing Systems 15, pages 841—848. MIT
Press, 2003.

[248] J .B. Tenenbaum, V. de Silva, and J .C. Langford. A global geometric framework
for nonlinear dimensionality reduction. Science, 290:2319—2323, 2000.

[249] R. Tibshirani. Principal curves revisited. Statistics and Computing, 2:183—190,
1992.

[250] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (Methodological), 58(1):267—288, 1996.

[251] K. Torkkola. Feature extraction by non parametric mutual information maxi—
mization. Journal of Machine Learning Research, 321415—1438, March 2003.

314

 

[252] G. Trunk. A problem of dimensionality: A simple example. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 1(3):306—307, 1979.

[253] MA. Turk and AP. Pentland. Face recognition using eigenfaces. In Proc. IEEE

Computer Society Conference on Computer Vision and Pattern Recognition,
pages 586—591, 1991.

[254] S. Vaithyanathan and B. Dom. Generalized model selection for unsupervised
learning in high dimensions. In Advances in Neural Information Processing
Systems 12, pages 970—976, Cambridge, MA, 1999. MIT Press.

[255] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York,
2nd edition, 2000.

[256] PP. Velleman and L. W'ilkinson. Nominal, ordinal, interval, and ratio typologies
are misleading. The American Statistician, 47(1):65—72, February 1993.

[257] J.J. Verbeek, N. Vlassis, and B. Krose. Coordinating principal component
analyzers. In Proc. International Conference on Artiﬁcial Neural Networks,
pages 914- 919, 2002.

[258] D. Verma and M. Meila. A comparison of spectral clustering algorithms. Tech-
nical Report 03—05-01, CSE Department, University of Washington, 2003.

[259] P. Verveer and R. Duin. An evaluation of intrinsic dimensionality estimators.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1):81—86,
1995.

[260] P. Vincent and Y. Bengio. Manifold parzen windows. In Advances in Neural
Information Processing Systems 15, pages 825—832. MIT Press, 2003.

[261] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple
features. In Proc. IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pages I—511—I—518, vol. 1, 2001.

[262] K. Wagstaff and C. Cardie. Clustering with instance—level constraints. In Proc.
17th International Conference on Machine Learning, pages 1103—1110. Morgan
Kaufmann, 2000.

[263] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clus-
tering with background knowledge. In Proc. 18th International Conference on
Machine Learning, pages 577—584. Morgan Kaufmann, 2001.

[264] CS. Wallace and D.L. Dowe. MML clustering of multi-state, Poisson, von Mises
circular and Gaussian distributions. Statistics and Computing, 10:73—83, 2000.

[265] CS. W'allace and P. Freeman. Estimation and inference via compact coding.
Journal of the Royal Statistical Society. Series B (Methodological), 49(3):241—
252, 1987.

315

[266]

[267]

[268]

[269]

[270]

[271]

[272]

[273]

[274]

[275]

[276]

[277]

[278]

J. Ward. Hierarchical grouping to optimize an objective function. Journal of
the American Statistical Association, 58(301):236—244, March 1963.

KO. Weinberger, B. Packer, and L.K. Saul. Nonlinear dimensionality reduction
by semidefinite programming and kernel matrix factorization. In Proc. 10th
International Workshop on Artiﬁcial Intelligence and Statistics, pages 381—388,
2005.

K.Q. Weinberger and L.K. Saul. Unsupervised learning of image manifolds by
semideﬁnite programming. In Proc. IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, volume 2, pages 988—995, 2004.

Y. Weiss. Segmentation using eigenvectors: A unifying view. In Proc. 7th IEEE
International Conference on Computer Vision, pages II—975—II—982, 1999.

J. VVeng, Y. Zhang, and W.S. Hwang. Candid covariance-free incremental prin-
cipal component analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 25(8):1034—1040, 2003.

D. Wettschereck, D.W. Aha, and T. Mohri. A review and empirical evaluation

of feature weighting methods for a class of lazy learning algorithms. Artif. Intell.
Rev, 11(1-5):273—314, 1997.

Z. Wu and R. Leahy. An optimal graph theoretic approach to data cluster-
ing: Theory and its application to image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 15(11):1101—1113, November 1993.

E. Xing, M. Jordan, and R. Karp. Feature selection for high-dimensional ge—
nomic microarray data. In Proc. 18th International Conference on Machine
Learning, pages 601—608. Morgan Kaufmann, 2001.

BR Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning,
with application to clustering with side-information. In Advances in Neural
Information Processing Systems 15, pages 505—512. MIT Press, 2003.

J. Yang and V. Honavar. Feature subset selection using a genetic algorithm.
IEEE Intelligent Systems, 13:44—49, 1998.

M.-H. Yang. Face recognition using extended isomap. 111 IEEE International
Conference on Image Processing, pages II: 117—120, 2002.

Y. Yang and X. Liu. A re—examination of text categorization methods. In Proc.
SI CIR-99, 22nd ACM International Conference on Research and Development
in Information Retrieval, pages 42—49. ACM Press, New York, 1999.

L. Yu and H. Liu. Feature selection for high-dimensional data: A fast
correlation-based ﬁlter solution. In Proc. 20th International Conference on
Machine Learning, pages 856—863. AAAI Press, 2003.

316

 

[279]

S.X. Yu and J. Shi. Segmentation given partial grouping constraints. IEEE
Transactions on Pattern Analysis and IWachine Intelligence, 26(2):173—183,
2004.

[280] CT. Zahn. Graph-theoretic methods for detecting and describing gestalt clus-

[281]

[282]

[288]

[284]

[285]

[286]

[287]

[288]

[289]

[290]

[291]

ters. IEEE Transactions on Computers, 20(31):68—86, 1971.

H. Zha, X. He, C. Ding, M. Gu, and H. Simon. Bipartite graph partitioning and
data clustering. In Proc. of ACM 10th Int’l Conf. Information and Knowledge
Management, pages 25—31, 2001.

H. Zha and Z. Zhang. Isometric embedding and continuum isomap. In Proc.
20th International Conference on Machine Learning, pages 864—871, 2003.

J. Zhang, Stan Z. Li, and J. Wang. Nearest manifold approach for face recogni-
tion. In Proc. The 6th International Conference on Automatic Face and Gesture
Recognition, Seoul, Korea, May 2004. Available at http://www.cbsr.ia.ac.
cn/users/szli/papers/ZJP-FG2004.pdf.

J. Zhang, S.Z. Li, and J. Wang. Manifold learning and applications in recog-
nition. In Intelligent Multimedia Processing with Soft Computing. Springer-
Verlag, Heidelberg, 2004. Available at http://www.nlpr.ia.ac.cn/users/
szli/papers/ZHP-MLA-Chapter.pdf.

Z. Zhang and H. Zha. Principal manifolds and nonlinear dimension reduction
via local tangent space alignment. Technical Report CSE—02-019, CSE, Penn
State Univ., 2002.

Q. Zhao and D.J. Miller. Mixture modeling with pairwise, instance-level class
constraints. Neural Computation, 17(11):2482—2507, November 2005.

D. Zhou, B. Schoelkopf, and T. Hofmann. Semi-supervised learning on directed

graphs. In Advances in Neural Information Processing Systems 17, pages 1633—
1640, Cambridge, MA, 2005. MIT Press.

X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaus-
sian ﬁelds and harmonic functions. In Proc. 20th International Conference on
Machine Learning, pages 912—919. AAAI Press, 2003.

X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric trans—
forms of graph kernels for semi-supervised learning. In Advances in Neural
Information Processing Systems 17, pages 1641—1648, Cambridge, MA, 2005.
MIT Press.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society. Series B (Methodological), To appear.

H. Zou, T. Hastie, and R. Tibshirani. Sparse principal components analysis.
Technical report, Department of Statistics, Stanford University, 2004.

317