.0 4.. as g . . Vita” 9% If: e s . (e I 3 . . Sim: .zrxdnkhLifi t . 1. £25.33. . , unliztui I Wu: .. d ’1‘ u . I f ?. Ip‘Ifihsoa n. r09!::. 1... fit. .33.? .p a x 33$ $713.. -2 v 91.8. .3. . a. $ ‘1‘! Lab. ‘. a . , ‘ _ . 2 («#955159sz ._ixefiwfiikx.bafirflnnécp. .fliu. . . l 'm V 5.58 LIBRARY Michigan State University This is to certify that the dissertation entitled Clustering, Dimensionality Reduction, and Side Information presented by Hiu Chung Law has been accepted towards fulfillment of the requirements for the Ph.D. degree in Computer Science and Engineerigg M WV Major Professor's Signature APYI‘Q 5-, QOOé Date MSU is an Affirmative Action/Equal Opportunity Institution PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 p:/CIRC/DateDue.indd-p.‘l CLUSTERING, DIMENSIONALITY REDUCTION, AND SIDE INFORMATION By H in Chung Law A DISSERTATION Submitted to Michigan State University in partial fulfillment Of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science & Engineering 2006 ABSTRACT CLUSTERINC, DIMENSIONALITY REDUCTION, AND SIDE INFORMATION By Hiu Chung Law Recent advances in sensing and storage technology have created many high« volume, high-dimensional data sets in pattern recognition, machine learning, and data mining. Unsupervised learning can provide generic tools for analyzing and summariz- ing these data sets when there is no well-defined notion Of classes. The purpose of this thesis is to study some of the Open problems in two main areas Of unsupervised learn- ing, namely clustering and (unsupervised) dimensionality reduction. Instance-level constraint on Objects, an example of side—information, is also considered to improve the clustering results. Our first contribution is a modification to the isometric feature mapping (ISOMAP) algorithm when the input data, instead of being all available Simulta- neously, arrive sequentially from a data stream. ISOMAP is representative of a class of nonlinear dimensionality reduction algorithms that are based on the notion of a manifold. Both the standard ISOMAP and the landmark version of ISOMAP are _ m I i, 5-; ._ m considered. Experimental results on synthetic data as well as real world images demonstrate that the modified algorithm can maintain an accurate low—dimensional representation of the data in an efficient matmer. We study the problem of feature selection in model-based clustering when the number Of clusters is unknown. we propose the concept of feature saliency and intro— duce an expectation-maximization (EM) algorithm for its estimation. By using the minimum message length (MML) model selection criterion, the saliency of irrelevant features is driven towards zero, which corresponds to performing feature selection. The use Of MML can also determine the number Of clusters automatically by pruning away the weak clusters. The proposed algorithm is validated on both synthetic data and data sets from the UCI machine learning repository. We have also developed a new algorithm for incorporating instance—level con- straints in model-based clustering. Its main idea is that we require the cluster label Of an Object to be determined only by its feature vector and the cluster parameters. In particular, the constraints should not have any direct influence. This consideration leads to a new objective function that considers both the fit to the data and the sat- isfaction of the constraints simultaneously. The line—search Newton algorithm is used to find the cluster parameter vector that Optimizes this Objective function. This ap- proach is extended to simultaneously perform feature extraction and clustering under constraints. Comparison of the proposed algorithm with competitive algorithms over eighteen data sets from different domains, including text categorization, low—level im- age segmentation, appearance-based vision, and benclunark data sets from the UCI machine learning repository, shows the superiority of the proposed approach. © Copyright 2006 by Hiu Chung Law All Rights Reserved TO My Lord Jesus Christ ACKNOWLEDGMENTS For the LORD gives wisdom, and from his mouth come knowledge and understanding. Proverbs 2:6 To our God and Father be glory for ever and ever. Amen. Philippians 4:20 Time flies and I shall leave Michigan State soon — a place I shall cherish long after my graduation. There are so many people who have been so kind and so helpful to me during all these years; all of you have made a mark in my life! First and foremost, I want to express my greatest gratitude to my thesis supervisor Dr. Anil Jain. He is such a wonderful advisor, mentor, and motivator. Under his guidance, I have learned a lot in different aspects of conducting research, including finding a good research problem, writing a convincing technical paper, and prioritizing different research tasks, to name a few. Of course I shall never forget. all the good vi times when we “Prippies” partied in his awesome house. I am also very thankful to the rest of my thesis guidance committee, including Dr. John Wong, Dr. Bill Punch, and Dr. Sarat Dass. Their advice and suggestions have been very helpful. I am also grateful to several other researchers who have mentored me during my various stages as a research student. Dr. Mario Figueiredo from Instituto Superior Técnico in Lisbon is a very nice person and his research insight has been an eye-opener. His intelligent use Of the EM algorithm is truly remarkable. I feel fortunate that I have had the chance to work under the supervision of Dr. Patti Viola at Microsoft Research. Interaction with him has not only led to a much deeper appreciation of boosting, but also has sharpened my thoughts on how to formalize a research problem. It is also a great pleasure that I could work under Dr. Tin-Kam H0 at Bell Labs. Discussions with her have led to a new perspective towards different tools in pattern recognition and machine learning. The emphasis Of Dr. Joachim Buhmann at ETH Zurich on correct modeling has impacted me on how to design a solution for any research problem. I am particularly grateful to Dr. Buhmann for his invitation to spend a month at ETH Zurich, and the hospitality he showed while I was there. The chance to work closely with Tilman Lange is definitely memorable. It is so interesting when two minds from different cultures and research heritage meet and conduct research together. Despite our differences, we have so much in common and the friendship with Tilman is probably the most valuable “side-product” of the research conducted during my Ph.D. study. 6) I want to thank Dr. Yunhong Wang for providing the NLPR database that is used in chapters four and five in this thesis. The work in chapter five of this thesis vii has benefited from the discussions I had during my stay at ETH Zurich. Special thanks go to ONR (grant nos. # N00014—01—1—0266 and # N00014-04—1-0183) for its financial support during my Ph.D. studies. On a more personal side, I am grateful to all the new friends that I have made during the past few years. I am especially grateful to the hospitality shown by the couples Steve & Liana, John (HO) & Agnes, Ellen & her husband Mr. Yip, and John (Bankson) & Bonnie towards an international student like me. They have set up such a great example for me to imitate wherever I go. All the people in the Cantonese group of the Lansing Chinese Christian Ministry have given a “lab—bound” graduate student like me the possibility of a social life. The support shown by the group, including Simon, Paul, Kok, Tom, Timothy, Anthony, Twinsen, Dennis, Josie. Mitzi, Melody, Karen, Esther, Janni, Bean, Lok, Christal, Janice, and many more, has helped me to survive the tough times. All of my labmates in the PRIP lab, including Arun, Anoop, Umut, Xiaoguang, Hong, Dirk, Miguel, Yi, Unsang, Karthik, Pavan, Meltem, Sasha, Steve, ..., have been so valuable to me. In addition to learning from them professionally, their emotional and social support is something I shall never forget. Let me reserve my final appreciation to the most important people in my life. Without the nurturing, care, and love from my father and mother, I definitely could not have completed my doctoral degree. They have provided such a wonderful envi- ronment for me and my two brothers growing up. It is such a great achievement for my parents that all their three sons have completed at least a master’s degree. I am also proud of my two brothers. I miss you, dad, mum, Pong, and Fail viii TABLE OF CONTENTS LIST OF TABLES xv LIST OF FIGURES xvii LIST OF ALGORITHMS xxi 1 Introduction 1 1.1 Data Analysis ................................. 2 1.1.1 Types of Data ............................... 2 1.1.2 Types of Features .............................. 6 1.1.3 Types of Analysis .............................. 7 1.2 Dimensionality Reduction .......................... 8 1.2.1 Prevalence of High Dimensional Data ................... 9 1.2.2 Advantages of Dimensionality Reduction ................. 11 1.2.3 Techniques for Dimensionality Reduction ................. 13 1.3 Data Clustering ................................ 21 1.3.1 A Taxonomy of clustering ......................... 23 1.3.2 A Brief History of Cluster Analysis .................... 25 1.3.3 Examining Some Clustering Algorithms ................. 27 1.4 Side-Information ............................... 35 ix 1.5 Overview ................................... 37 2 A Survey Of Nonlinear Dimensionality Reduction Algorithms 38 2.1 Overview ................................... 39 2.2 Preliminary .................................. 41 2.3 Sammon’s mapping .............................. 45 2.4 Auto-associative neural network ....................... 48 2.5 Kernel PCA .................................. 49 2.5.1 Recap of SVM ............................... 50 2.5.2 Kernel PCA ................................. 51 2.6 ISOMAP ................................... 55 2.7 Locally Linear Embedding .......................... 60 2.8 Laplacian Eigenmap ............................. 63 2.9 Global Co—ordinates via Local Co—ordinates ................ 67 2.9.1 Global Co—ordination ............................ 68 2.9.2 Charting................... ................ 70 2.9.3 LLC ..................................... 72 2.10 Experiments .................................. 74 2.11 Summary ................................... 79 3 Incremental Nonlinear Dimensionality Reduction By Manifold Learning 84 3.1 Details of ISOMAP .............................. 86 3.2 Incremental Version of ISOMAP ....................... 88 x 3.2.1 Incremental ISOMAP: Basic Version ................... 89 3.2.2 ISOMAP With Landmark Points ..................... 98 3.2.3 Vertex Contraction ............................. 101 3.3 Experiments .................................. 103 3.3.1 Incremental ISOMAP: Basic Version ................... 103 3.3.2 Experiments on Landmark ISOMAP ................... 113 3.3.3 Vertex Contraction ............................. 118 3.3.4 Incorporating Variance By Incremental Learning ............ 119 3.4 Discussion ................................... 121 3.4.1 Variants of the Main Algorithms ..................... 123 3.4.2 Comparison With Out-of-sample Extension ............... 124 3.4.3 Implementation Details .......................... 125 3.5 Summary ................................... 126 4 Simultaneous Feature Selection and Clustering 128 4.1 Clustering and Feature Selection ...................... 129 4.2 Related Work ................................. 132 4.3 EM Algorithm for Feature Saliency ..................... 133 4.3.1 Mixture Densities .............................. 134 4.3.2 Feature Saliency .............................. 136 4.3.3 Model Selection ............................... 142 4.3.4 Post-processing Of Feature Saliency .................... 147 4.4 Experimental Results ............................. 148 xi 4.4.1 Synthetic Data ............................... 148 4.4.2 Real data .................................. 149 4.5 Discussion ................................... 155 4.5.1 Complexity ................................. 155 4.5.2 Relation to Shrinkage Estimate ...................... 156 4.5.3 Limitation of the Proposed Algorithm .................. 157 4.5.4 Extension to Semi-supervised Learning .................. 157 4.5.5 A Note on Maximizing the Posterior Probability ............. 158 4.6 Summary ................................... 159 5 Clustering With Constraints 161 5.0.1 Related Work ................................ 163 5.0.2 The Hypothesis Space ........................... 165 5.1 Preliminaries ................................. 168 5.1.1 Exponential Family ............................. 169 5.1.2 Instance—level Constraints ......................... 172 5.2 An Illustrative Example ........................... 173 5.2.1 An Explanation of The Anomaly ..................... 174 5.3 Proposed Approach .............................. 179 5.3.1 Loss Function for Constraint Violation .................. 184 5.4 Optimizing the Objective Function ..................... 188 5.4.1 Unconstrained Optimization Algorithms ................. 188 5.4.2 Algorithm Details ............................. 193 xii 5.4.3 Specifics for a l\r"lixture of Gaussians ................... 196 5.5 Feature Extraction and Clustering with Constraints ............ 198 5.5.1 The Algorithm ............................... 200 5.6 Experiments .................................. 201 5.6.1 Experimental Result on Synthetic Data ................. 202 5.6.2 Experimental Results on Real VVOI'ld Data ................ 205 5.6.3 Experiments on Feature Extraction .................... 226 5.7 Discussion ................................... 241 5.7.1 Time Complexity .............................. 241 5.7.2 Discriminative versus Generative ..................... 2.43 5.7.3 Drawback of the Proposed Approach ................... 245 5.7.4 Some Implementation Details ....................... 247 5.8 Summary ................................... 247 6 Summary 249 6.1 Contributions ................................. 249 6.2 Future work .................................. 253 APPENDICES 257 A Details of Incremental ISOMAP 258 Al Update of Neighborhood Graph ....................... 258 A2 Update of Geodesic Distances: Edge Deletion ............... 259 A21 Finding Vertex Pairs For Update ..................... 259 xiii A.2.2 Propagation Step .............................. 261 A23 Performing The Update .......................... 264 A24 Order for Performing Update ....................... 266 A3 Update Of Geodesic Distances: Edge Insertion ............... 267 A4 Geodesic Distance Update: Overall Time Complexity ........... 269 B Calculations for Clustering with Constraints 271 8.1 First Order Information ........................... 271 B.1.1 Computing the Differential ........................ 272 B.1.2 Gradient Computation ........................... 276 8.1.3 Derivative for Gaussian distribution . . . . .............. 278 32 Second Order Information .......................... 280 8.2.1 Second-order Differential .......................... 280 B.2.2 Obtaining the Hessian matrix ....................... 283 8.2.3 Hessian of the Gaussian Probability Density Function .......... 290 BIBLIOGRAPHY 295 xiv LIST OF TABLES Worldwide generation of original data, if stored digitally, in terabytes (TB) circa 2002 .................................. 2 A comparison of nonlinear mapping algorithms. .............. 42 Run time (seconds) for batch and incremental ISOMAP. ......... 104 Run time (seconds) for executing batch and incremental ISOMAP once for different number of points (n) ..................... 104 Run time (seconds) for batch and incremental landmark ISOMAP. . . . . 114 Run time (seconds) for executing batch and incremental landmark ISOMAP once for different number of points (n). ........... 114 Real world data sets used in the experiment ................ 152 Results of the algorithm over 20 random data splits and algorithm initial- izations. .................................. 153 Different algorithms for clustering with constraints. ............ 164 Summary of the real world data sets used in the experiments. ...... 213 Performance of different clustering algorithms in the absence of constraints.220 Performance of clustering under constraints algorithms when the con- straint level is 1%. ............................ 227 Performance of clustering under constraints algorithms when the con- straint level is 2%. ............................ 228 XV C1 0‘: 5.7 5.8 5.9 Performance of clustering straint level is 3%. Performance of clustering straint level is 5%. Performance of clustering straint level is 10%. Performance of clustering straint level is 15%. under constraints under constraints under constraints under constraints xvi algorithms algorithms algorithms algorithms when the con- when the con- when the con— when the con- 229 230 231 232 1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.1 LIST OF FIGURES Comparing feature vector. dissimilarity matrix, and a discrete structure on a set of artificial objects. ....................... An example of dimensionality reduction. .................. The three well-separated clusters can be easily detected by most clustering algorithms. ................................ Diversity of clusters. ............................. A taxonomy of clustering algorithms .................... An example of a manifold .......................... An example of a geodesic .......................... Example of an auto-associative neural network ............... Example of neighborhood graph and geodesic distance approximation . Data sets used in the experiments for nonlinear mapping .......... Results of nonlinear mapping algorithms on the parabolic data set. Results of nonlinear mapping algorithms on the swiss roll data set. Results of nonlinear mapping algorithms on the S-curve data set. Results of nonlinear mapping algorithms on the face images. ....... The edge e(a, b) is to be deleted from the neighborhood graph ....... xvii 21 22 25 41 44 49 57 76 80 81 82 83 92 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 4.1 4.2 4.3 4.4 4.5 Effect Of edge insertion. ........................... 93 Snapshots of “Swiss Roll” for incremental ISOMAP ............. 105 Approximation error (En) between the co-ordinates estimated by the ba— sic incremental ISOMAP and the basic batch ISOMAP for different numbers of data points (n) for the five data sets. . . . . . . . . . . . 107 Evolution of the estimated co—ordinates for Swiss roll to their final values. 111 Example images from the rendered face image data set ........... 111 Example “2” digits from the MNIST database ................ 112 Example face images from ethn database. ................. 112 Classification performance on ethn database for basic ISOMAP ...... 112 Snapshots of “Swiss roll” for incremental landmark ISOMAP. ...... 115 Approximation error between the co—ordinates estimated by the incremen- tal landmark ISOMAP and the batch landmark ISOMAP for different numbers of data points. ......................... 117 Classification performance on ethn database, landmark ISOMAP. . . . . 118 Utility of vertex contraction .......................... 119 Sum Of residue square for 1032 images at 15 rotation angles ........ 121 An irrelevant feature makes it difficult for the Gaussian mixture learning algorithm in [81] to recover the two underlying clusters ......... 130 The number of clusters is inter-related with feature subset used. ..... 131 Deficiency Of variance-based method for feature selection .......... 132 An example graphical model for the probability model in Equation (4.5). 137 An example graphical model showing the mixture density in Equation (4.6).139 xviii 4.6 4.7 4.8 4.9 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 An example execution of the proposed algorithm. ............. Feature saliencies for the synthetic data used in Figure 4.6(a) and the Trunk data set. .............................. A figure showing the clustering result on the image data set ........ Image maps of feature saliency for different data sets ........... Different classification/clustering settings: supervised, unsupervised, and intermediate. ............................... An example contrasting parametric and non-parametric clustering. A simple example of clustering under constraints that illustrates the lim- itation of hidden Markov random field (HMRF) based approaches. . The result of running different clustering under constraints algorithms for the synthetic data set shown in Figure 5.3(a). ............. Example face images in the ethnicity classification problem for the data set ethn ................................... The Mondrian image used for the data set Mondrian. ........... F-score and NMI for different algorithms for clustering under constraints for the data sets ethn, Mondrian, and ion ................ F-score and N MI for different algorithms for clustering under constraints for the data sets script, derm, and vehicle. ............. F -score and NMI for different algorithms for clustering under constraints for the data sets wdbc. .......................... 5.10 F -score and NMI for different algorithms for clustering under constraints for the data sets UCI-seg, heart and austra. ............. 5.11 F-score and NMI for different algorithms for clustering under constraints for the data sets german, Sim-300 and diff-300. ........... xix 150 151 160 180 181 182 206 211 212 233 234 235 236 5.12 F -score and N MI for different algorithms for clustering under constraints for the data sets sat and digits. .................... 238 5.13 F-score and NMI for different algorithms for clustering under constraints for the data sets mfeat-fou, same-300 and texture. ......... 239 5.14 The result of simultaneously performing feature extraction and clustering with constraints on the data set in Figure 5.3(a). ........... 241 5.15 An example of learning the subspace and the clusters simultaneously. . . 242 Al Example Of T(u; b) and T(a; b). ....................... 262 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 LIST OF ALGORITHMS ConstructFab: F(a’b), the set of vertex pairs whose shortest paths are invalidated when e(a, b) is deleted, is constructed. ........... ModifiedDijkstra: The geodesic distances from the source vertex u to the set of vertices C (u) are updated ...................... OptimaIOrder: a greedy algorithm to remove the vertex with the smallest degree in the auxiliary graph 8. ..................... UpdateInsert: given that va ——> 1th -——> vb is a better shortest path be- tween va and vb after the insertion of ‘vn+1, its effect is propagated to other vertices. InitializeEdgeVVeightIncrease for 7(a). ................... InitializeEdgeW’eightDecrease for 7(a) .................... Rebuild 7(a) for those vertices in the priority queue Q that need to be updated ................................... The unsupervised feature saliency algorithm ................. xxi 91 92 93 94 100 101 102 146 Chapter 1 Introduction The most important characteristic of the information age is the abundance of data. Advances in computer technology, in particular the Internet, have led to what some ‘ people call “data explosion”: the amount of data available to any person has increased so much that it is more than he or she can handle. According to a recent study1 conducted at UC Berkeley, the amount of new data stored on paper, film, magnetic, and Optical media is estimated to have grown 30% per year between 1999 and 2002. In the year 2002 alone, about 5 exabytes of new data have been generated. (One exabyte is about 1018 bytes, or 1000000 terabytes). Most of the original data are stored in electronic devices like hard disks (Table 1.1). This increase in both the volume and the variety of data calls for advances in methodology to understand, process, and summarize the data. From a more technical point of view, understanding the structure of large data sets arising from the data explosion is of fundamental importance in data mining, pattern recognition, and machine learning. In this thesis, we focus on 1http://www.Sims.berkeley.edu/research/projects/how-much-info-2003/ 1 Table 1.1: Worldwide production of original data, if stored digitally, in terabytes (TB) circa 2002. Upper estimates (denoted by “upper”) assume the data are digitally scanned, while lower estimates (denoted by “lower”) assume the digital contents have been compressed. It is taken from Table 1.2 in http://www.sims.berkeley.edu/ research/projects/how-much-info-2003/execsum.htm. The precise definitions of “paper,” “film,” “magnetic,” and “optical” can be found in the web report. storage upper, lower, upper, lower, % change, medium 2002 2002 1999—2000 1999-2000 upper Paper 1,634 327 1,200 240 36% Film 420,254 74,202 431,690 58,209 -3% Magnetic 5,187,130 3,416,230 2,779,760 2,073,760 87% Optical 103 51 81 29 28% Total 5,609,121 3,416,281 3,212,731 2,132,238 74.5% two important techniques for data analysis in pattern recognition: dimensionality reduction and clustering. We also investigate how the addition Of constraints, an example of side—information, can assist in data clustering. 1.1 Data Analysis The word “data,” as simple as it seetns, is not easy to define precisely. We shall adopt a pattern recognition perspective and regard data as the description of a set of Objects or patterns that can be processed by a computer. The objects are assumed to have some commonalities, so that the same systematic procedure can be applied to all the objects to generate the description. 1.1.1 Types Of Data Data can be classified into different types. Most often, an Object is represented by the results of measurement of its various properties. A measurement result is called “a feature” in pattern recognition or “a variable” in statistics. The concatenation of all the features of a single object forms the feature vector. By arranging the feature vectors of different Objects in different rows, we get a pattern matrix (also called “data matrix”) of size n by d, where n is the total number of objects and d is the number of features. This representation is very popular because it converts different kinds of objects into a standard representation. If all the features are numerical, an object can be represented as a point in Rd. This enables a number of mathematical tools to be used to analyze the objects. Alternatively, the similarity or dissimilarity between pairs Of objects can be used as the data description. Specifically, a dissimilarity (Similarity) matrix of size n by n can be formed for the n Objects, where the (2, j)-th entry of the matrix corresponds to a quantitative assessment of how dissimilar (similar) the 2-th and the j-th Objects are. Dissimilarity representation is useful in applications where domain knowledge suggests a natural comparison function, such as the Hausdorff distance for geometric shapes. Examples of using dissimilarity for classification can be seen in [132], and more recently in [202]. Pattern matrix, on the other hand, can be easier to obtain than dissimilarity matrix. The system designer can simply list all the interesting attributes of the objects to obtain the pattern matrix, while a good dissimilarity measure with respect to the task can be difficult to design. Similarity/ dissimilarity matrix can be regarded as more generic than pattern ma— trix, because given the feature vectors of a set of objects, a dissimilarity matrix of these Objects can be generated by computing the distances among the data points represented by'these feature vectors. A similarity matrix can be generated either by subtracting the distances from a pie-specified number, or by exponentiating the negative Of the distances. Pattern matrix, 011 the other hand, can be more flexible because the user can adjust the distance function according to the task. It is easier to incorporate new information by creating additional features than modifying the sim— ilarity/dissimiliarity measure. Also, in the common scenarios where there are a large number of patterns and a moderate number of features, the size of pattern matrix, 0(nd), is smaller than the size Of similarity/ dissimilarity matrix, 0(n2). A third possibility to represent an object is by discrete structures, such as parse trees, ranked lists, or general graphs. Objects such as chemical structures, web pages with hyperlinks, DNA sequences, computer programs, or customer preference for certain products have a natural discrete structure representation. Graph-related rep— resentations have also been used in various computer vision tasks, such as object recognition [145] and shape-from-shading [217]. Representing structural objects using a vector of attributes can discard important information on the relationship between different parts of the Objects. On the other hand, coming up with the appropriate dissimilarity or similarity measure for such Objects is Often difficult. New algorithms that can handle discrete structure directly have been developed. An example is seen in [154], where a kernel function (diffusion kernel) is defined on different vertices in a graph, leading to improved classification performance for categorical data. Learning with structural data is sometimes called “learning with relational data,” and several ”30 +© ,. Objects .® Q C! Extract Features Extract relational Compute information dissimilarities Color Area Shape ®®+0® @O ©® + ‘ 0 Green 10 Circle 9 Q + Red 9 (3,053 +1010 0 10 9 12 2 10 . Blue 10 Circle 0 1 10 0 3 2 4 4 \‘ ‘/© Blue 22 Smiley Q4 9 3 0 3 7 1 0+9 0 Gray 25 Circle 05 12 2 3 0 4 3 l3) Yellow 12 Cross 0 9 2 4 7 5 0 6 - 3 10 4 1 3 5 0 . ©Orange 25 Smiley © DISCI'Qte S e ' ° - - . . Directed a Pattern matrix Dissmulanty matrix ( gr Ph) Figure 1.1: Comparing feature vector, dissimilarity matrix and a discrete structure on a set of artificial objects. (Left) Extracting different features (color, area, and shape in this case) leads to a pattern matrix. (Center) A dissimilarity measure on the objects can be used to compare different pairs of Objects, leading to a dissimilarity matrix. (Right) If the user can provide relational properties on the objects, a discrete structure like a directed graph can be created. workshops2 have been organized on this theme. In Figure 1.1, we provide a simple illustration contrasting feature vector, dissimi- larity matrix, and discrete structure representatirm for a set of artificial objects. Each of the representations corresponds to a different view of the objects. III practice, the system designer has to choose the representation that he or She thinks is the most relevant to the task. 2A NIPS workshop in 2002 (http://Inlg.anu.edu.au/unrealdata/) and several ICML work- shops (2004:http://www. cs.umd. edu/projects/sr12004/) (2002zhttp: //demo.cs.brandeis. edu/icm102ws/) (20002http : //www. inf ormatik . uni-freiburg . de/ml/icm12000_worksh0p . html) have been held on how to learn with structural or relational data. 5 In this thesis, we focus on feature vector representation, though dissimilar— ity/ similarity information in the form of instance-level constraints is also considered. 1.1.2 Types of Features Even within the feature vector representation, descriptions Of an Object can be clas- sified into different types. A feature is essentially a measurement, and the “scale of measurement” [244] proposed by Stevens can be used to classify features into different categories. They are: Nominal: discrete unordered. Exam )les: “a ) )le ” “orange,” and “banana.” 7 i O . Ordinal: discrete, ordered. Examples: “conservative,” “moderate,” and “liberal”. Interval: continuous, no absolute zero, can be negative. Examples: temperature in Fahrenheit. Ratio: continuous, with absolute zero, positive. Examples: length, weight. This classification scheme, however, is not perfect [256]. One problem is that a measurement may not fit well into any of the categories listed in this scheme. An example for this is given in chapter 5 in [191], which considers the following types of measurements: Grades: ordered labels such as Freshmen, Sophomore, Junior, Senior. Ranks: starting from 1, which may be the largest or the smallest. Counted fractions: bounded by zero and one. It includes percentage, for example. Counts: non—negative integers. Amounts: non-negative real numbers. Balances: unbounded, positive, or negative values. Most people would agree that these six types of data are different, yet all but the third and the last would be “ordinal” in the scheme by Stevens. “Counted fractions” also do not fit well into any of the category proposed by Stevens. Consideration of different types of features can help us to design appropriate algorithms for handling different types of data arising from different domains. 1.1.3 Types Of Analysis The analysis to be performed on the data can also be classified into different types. It can be exploratory/descriptive, meaning that the investigator does not have a specific goal and only wants to understand the general characteristics or structure of the data. It can be confirmatory/ inferential, meaning that the investigator wants to confirm the validity of a hypothesis / model or a set of assumptions using the available data. Many statistical techniques have been proposed to analyze the data, such as analysis of variance (ANOVA), linear regression, canonical correlation analysis (CCA), multidimensional scaling (MDS), factor analysis (FA), or principal component analysis (PCA), to name a few. A useful overview is given in [245]. In pattern recognition, most Of the data analysis is concerned with predictive mod— eling: given some existing data (“training data”), we want to predict the behavior of the unseen data (“testing data”). This is often called “machine learning” or simply 7? “learning. Depending on the type of feedback one can get in the learning process, three types of learning techniques have been suggested [68]. In supervised learning, labels on data points are available to indicate if the prediction is correct or not. In unsupervised learning, such label information is missing. III reinforcement learning, only the feedback after a sequence Of actions that can change the possibly unknown state of the system is given. In the past few years, a hybrid learning scenario between supervised and unsupervised learning, known as semi—supervised learning, transduc- tive learning [136], or learning with unlabeled data [195], has emerged, where only some of the data points have labels. This scenario happens frequently in applica- tions, since data collection and feature extraction can Often be automated, whereas the labeling of patterns or objects has to be done manually and this is expensive both in time and cost. In Chapter 5 we shall consider another hybrid scenario where instance-level constraints, which can be viewed as a “relaxed” version of labels, are available on some of the data points. 1.2 Dimensionality Reduction Dimensionality reduction deals with the transformation of a high dimensional data set into a low dimensional space, while retaining most of the useful structure in the original data. An example application Of dimensionality reduction with face images can be seen in Figure 1.2. Dimensionality reduction has become increasingly important due to the emergence of many data sets with a large number of features. The underlying assumption for dimensionality reduction is that the data points do Face images 3; , .. j.3 Comatenafing the pixels (31118118101181 feature vectors Dimensionality . ‘ reduction Low dimensional . CED CED CED CED CED representation Figure 1.2: An example of dimensionality reduction. The face images are converted into a high dimensional feature vector by concatenating the pixels. Dimensionality reduction is then used to create a set of more manageable low-dimensional feature vectors, which can then be used as the input to various classifiers. not lie randomly in the high-dimensional space; rather, there is a certain structure in the locations of the data points that can be exploited, and the useful information in high dimensional data can be summarized by a. small number of attributes. 1.2.1 Prevalence of High Dimensional Data High dimensional data have become prevalent in different applications in pattern recognition, machine learning, and data mining. The definition of “high dimensional” has also changed from tens of features to hundreds or even tens of thousands of features [101]. Some recent applications involving high dimensional data sets include: (i) text categorization, the representation Of a text document or a web page using the pop- ular bag-Of-words model can lead to thousands of features [277, 254], where each feature corresponds to the occurrence of a keyword or a key-term in the document; (ii) appearance—based computer vision approaches interpret each pixel as a feature [253, 22]. Images of handwritten digits can be recognized using the pixel values by neural networks [170] or support vector machines [255]. Evert for a small image with size 64 by 64, such representation leads to more than 4,000 features; (iii) hyperspectral images3 in remote sensing lead to high dimensional data sets: each pixel can contain more than 200 spectral measurements in different wavelengths; (iv) the characteris- tics of a chemical compound recorded by a mass spectrometer can be represented by hundreds of features, where each feature corresponds to the reading in a particular range; (v) microarray technology enables us to measure the expression levels of thou- sands Of genes simultaneously for different subjects with different treatments [6, 273]. Analyzing microarray data is particularly challenging, because the number of data points (subjects in this case) is much smaller than the number of features (expression levels in this case). High dimensional data can also be derived in applications where the initial num- ber of features is moderate. In an image processing task, the user can apply different filters with different parameters to extract a. large number of features from a localized window in the image. The features are then summarized by applying a dimensional- ity reduction algorithm that matches the task at hand. This (relatively) automatic procedure contrasts with the traditional approach, where the user hand-crafts a small number of salient features manually, often with great effort. Creating a large feature 3Information on hyperspectral images can be found at http: //backserv. gsfc.nasa. gov/ nips2003hyperspectra1 . html and http : //www . eoc . csiro . au/hswww/Overview . htm. 10 set and then summarizing the features is advantageous when the domain is highly variable and robust features are hard to obtain, such as the occupant classification problem in [78]. 1.2.2 Advantages Of Dimensionality Reduction Why should we reduce the dimensionality of a data set? In principle, the more information we have about each pattern, the better a learning algorithm is expected to perform. This seems to suggest that we should use as many features as possible for the task at hand. However, this is not the case in practice. Many learning algorithms perform poorly in a high dimensional space given a small number of learning samples. Often some features in the data set are just “noise” and thus do not contribute to (sometimes even degrade) the learning process. This difficulty in analyzing data sets with many features and a small number of samples is known as the curse of d2mens2onal2ty [211]. Dimensionality reduction can circumvent this problem by reducing the number of features in the data set before the training process. This can also reduce the compu- tation time, and the resulting classifiers take less space to store. Models with small number of variables are often easier for domain experts to interpret. Dimensionality reduction is also invaluable as a visualization tool, where the high dimensional data set is transformed into two or three dimensions for display purposes. This can give the system designer additional insight into the problem at hand. The main drawback of dimensionality reduction is the possibility of information 11 loss. When done poorly, dimensionality reduction can discard useful instead of irrel- evant information. No matter what subsequent processing is to be performed, there is no way to recover this information loss. 1.2.2.1 Alternatives to Dimensionality Reduction In the context of predictive modeling, (explicit) dimensionality reduction is not the only approach to handle high dimensional data. The naive Bayes classifier has found empirical success in classifying high dimensional data sets like webpages (the WEB—’KB project in [50]). Regularized classifiers such as support vector machines have achieved good accuracy for high dimensional data sets in the domain Of text categorization [135]. Some learning algorithms have built-in feature selection abilities and thus (in theory) do not require explicit dimensionality reduction. For example, boosting [90] can use each feature as a “weak” classifier and construct an overall Classifier by selecting the appropriate features and combining them [261]. Despite the apparent robustness of these learning algorithms in high dimensional data sets, it can still be beneficial to reduce the dimensionality first. Noisy features can degrade the performance of support vector machines because values of the kernel function (particular RBF kernel that depends on inter-point Euclidean distances) become less reliable if many features are irrelevant. It is beneficial to adjust the kernel to ignore those features [156], effectively performing dimensionality reduction. Concerns related to efficiency and storage requirement of a classifier also suggest the Use of dimensionality reduction as a preprocessing step. The important lesson is: dimensionality reduction is useful for most applications, 12 yet the tolerance for the amount Of information discarded should be subject to the judgement Of the system designer. In general, a more conservative dimensionality reduction strategy should be employed if a classifier that is more robust to high dimensionality (such as support vector machines) is used. The dimensionality Of the data may still be somewhat large, but at least little useful information is lost. On the other hand, if a more traditional and easier-to—understand classifier (like quadratic discriminant analysis) is to be used, we should reduce the dimensionality of the data set more aggressively to a smaller number, so that the classifier can competently handle the data. 1.2.3 Techniques for Dimensionality Reduction Dimensionality reduction techniques can be broadly divided into several categories: (i) feature selection and feature weighting, (ii) feature extraction, and (iii) feature grouping. 1.2.3.1 Feature Selection and Feature Weighting Feature selection, also known as variable selection or subset selection in the statistics (particularly regression) literature, deals with the selection of a subset Of features that is most appropriate for the task at hand. A feature is either selected (because it is relevant) or discarded (because it is irrelevant). Feature weighting [271], on the other hand, assigns weights (usually between zero and one) to different features to indicate the saliencies of the individual features. Most of the literature on feature Selection/weighting pertains to supervised learning (both classification [122, 151, 26, 13 101] and regression [186]). Filters, Wrappers, and Embedded Algorithms Feature selection/weighting algorithms can be broadly divided into three categories [26, 151, 101]. The filter approaches evaluate the relevance of each feature (subset) using the data set alone, regardless of the subsequent learning task. RELIEF [147] and its enhancement [155] are representatives of this class, where the basic idea is to assign feature weights based on the consistency of the feature value in the k nearest neighbors of every data point. Wrapper algorithms, on the other hand, invoke the learning algorithm to evaluate the quality Of each feature (subset). Specifically, a learning algorithm (e.g., a nearest neighbor classifier, a decision tree, a naive Bayes method) is run using a feature sub— set and the feature subset is assessed by some estimate related to the classification accuracy. Often the learning algorithm is regarded as a “black box” in the sense that the wrapper algorithm Operates independent of the internal mechanism of the clas- sifier. An example is [212], which used genetic search to adjust the feature weights for the best performance of the k nearest neighbor classifier. In the third approach (called embedded in [101]), the learning algorithm is modified to have the ability to perform feature selection. There is no longer an explicit feature selection step; the algorithm automatically builds a classifier with a small number of features. LASSO (least absolute shrinkage and selection operator) [250] is a good example in this cat- egory. LASSO modifies the ordinary least square by including a constraint on the L1 norm of the weight coefficients. This has the effect Of preferring sparse regression coefficients (a formal statement for this is proved in [65, 64]), effectively perform- 14 ing feature selection. Another example is MARS (multivariate adaptive regression splines) [91], where choosing the variables used in the polynomial splines effectively performs variable selection. Automatic relevance detection in neural networks [177] is another example, which uses a Bayesian approach to estimate the weights in the neural network as well as the relevancy parameters that can be interpreted as feature weights. Filter approaches are generally faster because they are classifier-independent and only require computation of simple quantities. They scale well with the number of features, and many of them can comfortably handle thousands of features. Wrapper approaches, on the other hand, can be superior in accuracy when compared with filters, which ignore the properties of the learning task at hand [151]. They are, how- ever, computationally more demanding, and do not scale very well with the number of features. It is because training and evaluating a classifier with many features can be slow, and the performance of a traditional classifier with a large number of fea- tures may not be reliable enough to estimate the utilities of individual features. To get the best results from filters and wrappers, the user can apply a filter-type tech- nique as preprocessing to cut down the feature set to a moderate size, and then use a wrapper algorithm to determine a small yet discriminative feature subset. Some state-Of—the-art feature selection algorithms indeed adopt this approach, as Observed in [102]. “Embedded” algorithms are highly specialized and it is difficult to compare them in general with filter and wrapper approaches. 15 Quality of a Feature Subset Feature selection / weighting algorithms can also be classified according to the definition of “relevance” or how the quality of a feature subset is assessed. Five definitions of relevance are given in [26]. Information—theoretic methods are Often used to evaluate features, because the mutual information between a relevant feature and the class labels should be. high [15]. Non-parametric methods can be used to estimate the probability density function Of a continuous feature, which in turn is used to compute the mutual information [159, 251]. Correlation is also used frequently to evaluate features [278, 104]. A feature can be declared irrelevant if it is conditionally independent of the class labels given other features. The concept of Markov blanket is used to formalize this notion Of irrelevancy in [153]. RELIEF [147, 155] uses the consistency of the featurevalue in the k nearest neighbors of every data point to quantify the usefulness of a feature. Optimization Strategy Given a definition of feature relevancy, a feature selec- tion algorithm can search for the most relevant feature subset. Because of the lack Of monotonicity (with respect to the features) of many feature relevancy criteria, a com- binatorial search through the space of all possible feature subsets is needed. Usually, heuristic (non-exhaustive) methods have to be adopted, because the size of this space is exponential in the number Of features. In this case, one generally loses any guaran- tee Of Optimality of the selected feature subset. Different types of heuristics, such as sequential forward or backward searches, floating search, beam search, bi-directional search, and genetic search have been suggested [36, 151, 209, 275]. A comparison of some of these search heuristics can be found in [211]. III the context of linear 16 regression, sequential forward search is often known as stepwise regression. Forward stagewise regression is a generalization of stepwise regression, where a feature is only “partially” selected by increasing the corresponding regression coefficient by a fixed amount. It is closely related to LASSO [250], and this relationship was established via least angle regression (LARS), another interesting algorithm on its own, in [72]. Wrapper algorithms generally include a heuristic search, as is the case for filter algorithms with feature quality criteria dependent on the features selected so far. Note that feature weighting algorithms do not involve a heuristic search because the weights for all features are computed simultaneously. However, the computation of the weights may be expensive. Embedded approaches also do not require any heuristic search. The optimal parameter is often estimated by optimizing a certain Objective function. Depending on the form of the objective function, different Optimization strategies can be used. In the case of LASSO, for example, a general quadratic programming solver, homotopy method [198], a modified version of LARS [72], or the EM algorithm [80] can be used to estimate the parameters. 1.2.3.2 Feature Extraction In feature extraction, a small set of new features is constructed by a general map— ping from the high dimensional data. The mapping Often involves all the available features. Many techniques for feature extraction have been proposed. In this section, we describe some of the linear feature extraction methods, i.e., the extracted features can be written as linear combinations Of the original features. Nonlinear feature ex- traction techniques are more sophisticated. In Chapter 2 we shall examine some Of 17 the recent nonlinear feature extraction algorithms in more detail. The readers may also find two recent surveys [284, 34] useful in this regard. Unsupervised Techniques “Unsupervised” here refers to the fact that these fea- ture extraction techniques are based only on the data (pattern matrix), without pattern label information. Principal component analysis (PCA), also known as Karhunen-Loeve Transform or simply KL transform, is arguably the most popular feature extraction method. PCA finds a hyperplane such that, upon projection to the hyperplane, the data variance is best preserved. The optimal hyperplane is spanned by the principal components, which are the leading eigenvectors Of the sample covari— ance matrix. Features extracted by PCA consist of the projection of the data points to different principal components. When the features extracted by PCA are used for linear regression, it is sometimes called “principal component regression”. Recently, sparse variants of PCA have also been proposed [137, 291, 52], where each principal component only has a small number Of non-zero coefficients. Factor analysis (FA) can also be used for feature extraction. FA assumes that the Observed high dimensional data points are the results of a linear function (expressed by the factor loading matrix) on a few unobserved random variables, together with uncorrelated zero-mean noise. After estimating the factor loading matrix and the variance of the noise, the factor scores for different patterns can be estimated and serve as a low-dimensional representation of the data. 18 Supervised Techniques Labels in classification and response variables in regres- sion can be used together with the data to extract more relevant features. Linear discriminant analysis (LDA) finds the projection direction such that the ratio of between-class variance to within-class variance is the largest. When there are more than two classes, multiple discriminant analysis (MDA) finds a sequence of projection directions that maximizes a similar criterion. Features are extracted by projecting the data points to these directions. Partial least squares (PLS) can be viewed as the regression counterpart of LDA. Instead of extracting features by retaining maximum data variance as in principal component regression, PLS finds projection directions that can best explain the re— sponse variable. Canonical correlation analysis (CCA) is a closely related technique that finds projection directions that maximize the correlation between the response variables and the features extracted by projection. 1.2.3.3 Feature Grouping In feature grouping, new features are constructed by combining several existing fea- tures. Feature grouping can be useful in scenarios where it can be more meaningful to combine features due to the characteristics of the domain. For example, in a text categorization task different words can have similar meanings and combining them into a single word class is more appropriate. Another example is the use of power spectrum for classification, where each feature corresponds to the energy in a certain frequency range. The preset boundaries of the frequency ranges can be sub-Optimal, and the sum of features from adjacent frequency ranges can lead to a more meaningful 19 feature by capturing the energy in a wider frequency range. For gene expression data, genes that are Similar may share a common biological pathway and the grouping of predictive genes can be of interest to biologists [108, 230, 59]. The most direct way to perform feature grouping is to cluster the features (instead of the Objects) of a data set. Feature clustering is not new; the SAS / STAT procedure “varclus” for variable clustering was written before 1990 [225]. It is performed by ap- plying the hierarchical clustering method on a similarity matrix of different features, which is derived by, say, the Pearson’s correlation coefficient. This scheme was prob- ably first proposed in [124], which also suggested summarizing one group of features by a single feature in order to achieve dimensionality reduction. Recently, feature clustering has been applied to boost the performance in text categorization. Tech- niques based on distribution clustering [4], mutual information [62], and information bottleneck [238] have also been proposed. Features can also be clustered together with the objects. As mentioned in [201], this idea has been known under different names in the literature, including “bi- clustering” [41, 150], “cqclustering” [63, 61], “double-clustering” [73], “coupled clus- tering” [95], and “simultaneous clustering” [208]. A bipartite graph can be used to represent the relationship between objects and features, and the partitioning of the graph can be used to cluster the objects and the features simultaneously [281, 61]. Information bottleneck can also be used for this task [237]. In the context of regression, feature grouping can be achieved indirectly by favoring similar features to have similar coefficients. This can be done by combining ridge regression with LASSO, leading to the elastic net regression algorithm [290]. 20 . - ' I ‘ , c . -~ : " | - 'v... c l ‘. ._~: o ‘2‘ , I \\. 'J-‘fih .'_. I ‘ . ' V 3‘ t- :' ‘- alw- a}: : -' . ' 1;,“ “.17.- -' ". ' )5: sat ”we: > =-'~-' .- .‘-, .."‘r-‘..'r-‘ , x ' I":'-‘ . .. .~ xiii-“if ' f’ ""' “ ”5" b 0“ " ...,} ‘1. .0 .1 ‘ H '_..;.»f ' "l“. ‘ .5 ' a ’0‘.» n h. ' . , D G -, o . .. , ‘, ‘ I ‘ ' . s ., ' . .. ,3». . . J ]. - ' l >- 4 L , I | . . .. ' .s . - , . .4 : ' 'f‘ . N 7‘ ' .. , f'_-. '.. '- . ”.3: ”x .1 i .4. 'N .I _ . 72.3 ,' -'2~'M ‘ ~ 3: ”inf" .. ~.‘":5 : .'. .-.- ~ “4:23 122°- ~.‘“:5 : .‘. W-Jctr g: \ nuts-Var? ' "3 ”*2“ -=. 12443.9»; R's-K -* s r .' 72265-.) 3 “-‘."*.~“."""- ‘ ”vi-9:1.) ‘F; ‘ “ .a: . ‘ .54.: . It»: .. is‘rev ":r .-' - J . I " .,‘ ' , - . 1 .- ~ t . ~ 1. x 4 . ‘0' .. ...o , f} ] J . ’2 - ‘ “‘P ..i)‘ o ' o ' i l ‘ l‘ - s - - s x v 4_ —L i A A A“ X (a) Original data (b) Clustering Result Figure 1.3: The three well-separated clusters can be easily detected by most clustering algorithms. Images in this thesis / dissertation are presented in color. 1.3 Data Clustering The goal of (data) clustering, also known as cluster analysis, is to discover the “nat- ural” grouping(s) of a set of patterns, points, or objects. Webster4 defines cluster analysis as “a statistical classification technique for discovering whether the individ- uals of a population fall into different groups by making quantitative comparisons of multiple characteristics.” An example of clustering can be seen in Figure 1.3. The unlabeled data set in Figure 1.3(a) is assigned labels by a clustering procedure in order to discover the natural grouping of the three groups as shown in Figure 1.3(b). Cluster analysis is prevalent in any discipline that involves analysis of multivariate data. It is difficult to exhaustively list the numerous uses Of clustering techniques. Image segmentation, an important problem in computer vision, can be formulated as a clustering problem [94, 128, 234]. Documents can be clustered [120] to generate topical hierarchies for information access [221] or retrieval [20]. Clustering is also 4http://www.m-w . com/ 21 . ’ g .; a a z (a) Original data (b) Clustering Result Figure 1.4: Diversity of clusters. The seven clusters in this data set (denoted by the seven different colors), though easily identified by human, are difficult to detect automatically. The clusters are of different shapes, sizes, and densities. The presence of background noise makes the clustering task even more difficult. used to perform market segmentation [3, 39] as well as to study genome data [6] in biology. Clustering, unfortunately, is difficult for most data sets. A non—trivial example of clustering is shown in Figure 1.4. Unlike the three well-separated, spherical clusters in Figure 1.3, the seven clusters in Figure 1.4 have diverse shapes: globular, circular, and spiral in this case. The densities and the sizes of the clusters are also different. The presence of background noise makes the detection of the clusters even more difficult. This example also illustrates the fundamental difficulty of clustering. The diversity of “good” clusters in different scenarios make it virtually impossible for one to provide a universal definition of “good” clusters. In fact, it has been proved in [149] that it is impossible for any clustering algorithm to achieve some fairly basic goals simultaneously. Therefore, it is not surprising that many clustering algorithms have 22 been proposed to address the different needs of “good clusters” in different scenarios. In this section, we attempt to provide a taxonomy of the major clustering tech- niques, present a brief history of cluster analysis, and present the basic ideas of some popular clustering algorithms in the pattern recognition community. 1.3.1 A Taxonomy Of clustering Many clustering algorithms have been proposed in different application scenarios. Perhaps the most important way to classify clustering algorithms is hierarchical versus partitional. Hierarchical clustering creates a tree of objects, where branches merging at the lower levels correspond to higher similarity. Partitional clustering, on the other hand, aims at creating a “flat” partition of the set Of objects with each object belonging to one and only one group. Clustering algorithms can also be classified by the type of input data used (pattern matrix or similarity matrix), or by the type of the features, e. g. numerical, categorical, or special data structures, such as rank data, strings, graphs, etc. (See Section 1.1.1 for information on different types of data.) Alternatively, a clustering algorithm can be characterized by the probability model used, if any, or by the core search (optimization) process used to find the clusters. Hierarchical clustering algorithms can be described by the clustering direction, either agglomerative or divisive. In Figure 1.5, we provide one possible hierarchy of partitional clustering algorithms (modified from [131]). Heuristic-based techniques refer to clustering algorithms that 0 timize a certain notion of “Good” clusters. The O'oodness function is constructed 0 C) 23 by the user in a heuristic manner. Model-based clustering assumes that there are underlying (usually probabilistic) models that govern the clusters. Density-based algorithms attempt to estimate the data density and utilize that to construct the clusters. One may further sub-divide heuristic-based techniques depending on the input type. If a pattern matrix is used, the algorithm is usually prototype-based, i.e., each cluster is represented by the most typical “prototype.” The k-IIIGaIIS and the k-medoids algorithms [79] are probably the best known in this category. If a dis- similarity or similarity matrix is used as the input, two sub-categories are possible: those based on linkage (single-link, average—link, complete-link, and CHAMELEON [142]), and those inspired from graph theory, such as min-cut [272] and spectral clus- tering [234, 194]. Model-based algorithms Often refer to clustering by using a finite mixture distribution [184], with each mixture component interpreted as a cluster. Spatial clustering can involve a probabilistic model Of the point process. For density- based methods, the mean-shift algorithm [45] finds the modes of the data densities by the mean-Shift operation, and the cluster label is determined by which “basin of convergence” a point is located. DENCLUE [111] utilizes a kernel (non—parametric) estimate for the data density to find the clusters. 24 [Clustering Algorithms] //'\\\ ] Heuristic-based] [Model-based ] ] Density-based ] ] Pattern Matrix] [Proximity matrix ] Prototype- Unkage Graph Spatial Mixture Kernel- Mode based methods theoretic clustering Model based seeking (k-means, (single-link, (MST, spectral (Gaussian (DENCLUE) (mean- k-medoid) complete-link, clustering, mixture, shift) CHAMELEON) min-cut) Latent class) Figure 1.5: A taxonomy of clustering algorithms. 1.3.2 A Brief History Of Cluster Analysis According to the scholarly journal archive JSTORE’, the first. appearance of the word “cluster” in the title of a scholarly article was in 1739 [11]: “A Letter from John Bartram, M. D. to Peter Collinson, F. R. S. concerning a Cluster of Small Teeth Observed by Him at the Root of Each Fang or Great Tooth in the Head Of a Rattle- Snake, upon Dissecting It”. The word “cluster” here, though, was used only in its general sense to denote a group. The phrase “cluster analysis” first appeared in 1954 and it was suggested as a tool to understand anthropological data [43]. In its early days, cluster analysis was sometimes referred to as grouping [48, 85], and biologists called it “numerical taxonomy” [242]. Early research on hierarchical clustering was mainly done by biologists, because these techniques helped them to create a hierarchy of different Species for analyz- ing their relationship systematically. According to [242], single-link clustering [240], 5http://www.jstor.org 25 complete-link clustering [213], and average-link clustering [241] first appeared in 1957, 1948, and 1958, respectively. W'ard’s method [266] was proposed in 1963. Partitional clustering, on the other hand, is closely related to data compression and vector quan- tization. This link is not surprising because the cluster labels assigned by a partitional clustering algorithm can be viewed as the compressed version of the data. The most popular partitional clustering algorithm, k—means, has been proposed several times in the literature: Steinhaus in 1955 [243], Lloyd in 1957 [174], and MacQueen in 1967 [178]. The ISODATA algorithm by Ball and Hall in 1965 [8] can be regarded as an adaptive version of k-means that adjusts the number of clusters. The k-means algo- rithm is also attributed to Forgy (like [140] and [99]), though the reference for this [88] only contains an abstract and it is not clear what Forgy exactly proposed. The his- torical account of vector quantization given in [99] also presents the history of some of the partitional clustering algorithms. In 1971, Zahn proposed a graph—theoretic clustering method [280], which is closely related to single-link clustering. The EM algorithm, which is the standard algorithm for estimating a finite mixture model for mixture-based clustering, is attributed to Dempster et al. in 1977 [58]. Interest in mean-shift clustering was revived in 1995 by Cheng [40], and Comaniciu and Meer further popularized it in [45]. Hoffman and Buhmann considered the use of deter- ministic annealing for pairwise clustering [115], and Fischer and Buhmann modified the connectedness idea in single-link clustering that led to path-based clustering [84]. The normalized cut algorithm by Shi and Malik [233] in 1997 is often regarded as the first spectral clustering algorithm, though similar ideas were considered by spectral graph theorists earlier. A summary of the important results in spectral graph theory 26 can be found in the 1997 book by Chung [42]. The emergence of data. mining leads to a new line of clustering research that emphasizes efficiency when dealing with huge database. DBSCAN by Ester ct al. [77] for density—based clustering and CLIQUE by Agrawal et al. [2] for subspace clustering are two well-known algorithms in this community. The current literature 011 cluster analysis is vast, and hundreds of clustering al— gorithms have been proposed in the literature. It will require a tremendous effort to list and summarize all the major clustering algorithms. The reader is encouraged to refer to a survey like [130] or [79] for an overview of different clustering algorithms. 1.3.3 Examining Some Clustering Algorithms In this section, we will examine two very important clustering algorithms used in the pattern recognition community: the k-means algorithm and the EM algorithm. Other clustering algorithms that are used regularly in pattern recognition include the mean-shift algorithm [45, 44, 40], pairwise clustering [115, 116], path-based clustering [84, 83], and spectral clustering [234, 139, 269, 194, 258, 42]. Let {y1, . . . , yn} be the set of n d—dimensional data points to be clustered. The cluster label of y, is denoted by 22:. The goal of (partitional) clustering is to recover 2,, with z,- E {1, . . . , k}, where 1.: denotes the number of clusters specified by the user. The set of y,- with z, = j is referred to as the j-th cluster. 27 1.3.3.1 The k-means algorithm The k-means algorithm is probably the best known clustering algorithm. In this algo- rithm, the j-th cluster is represented by the “cluster prototype” 143- in Rd. Clustering is done by finding z, and #j that minimize the following cost function: n n k 2 - 2 1km -—— Z My.- — uni = 221(4- = J)||yi — iujll . (1.1) i=1 i=1j=1 Here, [(2,- = j) denotes the indicator function, which is one if the condition 2,- = j is true, and zero otherwise. To optimize Jk—means» we first assume that all uj are specified. The values of 3,- that minimize Jkfineaus are given by i = o i — ' ' ° 2 arUme J 2 12 On the other hand, if z,- is fixed, the optimal pj can be found by differentiating Jk—means with respect to ”3' and setting the derivatives to zero, leading to k . . __ Zj=11(zi =J)uj __ 23:1,sz “j (1 3) H] _ 2.1721192. : j) _ number ofz' with z,- = j‘ ° Starting from an initial guess on 11,], the k-means algorithm iterates between Equa— tions (1.2) and (1.3), which is guaranteed to decrease the k-means objective function until a local minimum is reached. In this case, pj and 2,: remain unchanged after the iteration, and the k-means algorithm is said to have converged. The resulting z,- and 143- constitute the clustering solution. In practice, one can stop if the change in 28 —_L successive values of Jk—means is less than a threshold. The k-means algorithm is easy to understand and is also easy to implement. How— ever, k-means has problems in discovering clusters that are not spherical in shape. It also encounters some difficulties when different clusters have a significantly different number of points. k-means also requires a good initialization to avoid getting trapped in a poor local minimum. In many cases, the user does not know the number of clus- ters in advance, which is required by k—means. The problem of determining the value of It automatically still does not have a very satisfactory solution. Some heuristics have been described in [125], and a recent paper on this is [106]. Because the k-means algorithm alternates between the two conditions of optimal- ity, it is an example of alternating optimization. The k-means clustering result can be interpreted as a solution to vector quantization, with a codebook of size k and a square error loss function. Each pj is a codeword in this case. The lit—means algorithm can also be viewed as a special case of fitting a Gaussian mixture, with covariance matrices of all the mixture components fixed to be 021 and 0 tends to zero (for the “hard” cluster assignment). The k—medoid algorithm is similar to k-means, except that pj is restricted to be one of the given patterns yi. There is also an online version of k—means. W hen the i-th data point yz- is observed, the cluster center pj that is the nearest to y,- is found. pj is then updated by new it] = m +06% ~15). (1-4) where a is the learning rate. This learning rule is an example of “winner—take—all” in 29 competitive learning, because only the cluster that “wins” the data point can learn from it. 1.3.3.2 Clustering by Fitting Finite Mixture Model The k-means algorithm is an example of “hard” clustering, where a data point is assigned to only one cluster. In many cases, it is beneficial to consider “soft” cluster- ing, where a point is assigned to different clusters with different degrees of certainties. This can be done either by fuzzy clustering or by n‘iixture-based clustering. We prefer the latter because it has a. more rigorous foundation. In mixture-based clustering, a finite mixture model is fitted to the data. Let Y and Z be the random variables for a data point and a cluster label, respectively. Each cluster is represented by the component distribution p(Y|6,-), where 6]- denotes the parameter for the j-th cluster. Data points from the j—th cluster are assumed to follow this distribution, i.e., [)(YlZ = j) = p(Y|9j). The component distribution p(Y|9,-) is often assumed to be a Gaussian when Y is continuous, and the corresponding mixture model is called “a mixture of Gaussians”. If Y is categorical, multinomial distribution can be used for [DO/[6,). Let a, = P(Z = j) be the prior probability for the j-th cluster. The key idea of a mixture model is k pme) =ZP( Y|Z—j)P( -Za,p( Y|6,) (1.5) jzl where O = {91,...,6k,a1,...,ak} contains all the model parameters. The mix- ture model can be understood as a two-stage data. generation process. First, the 30 hidden cluster label Z is sampled from a multinomial distribution with parameters (01, . . . , ak). The data point. Y is then generated according to the mixture distribu- tion determined by Z, i.e., Y is sampled from p(Y|0,-) if Z = j. The degree of membership of y,- to the j—th cluster is determined by the posterior probability of Z equals to j given yi, i.e., MZ=$Y=yd_ WMYWD MZ=flY=y-= — - 2) 60’ = 3’2“) 23L, a,p(Y|6,-) (1.6) If a “hard” clustering is needed, y,- can be assigned to the cluster with the highest posterior probability P(Z|Y = y,). The parameter 9 can be determined using the maximum likelihood principle. We seek O that minimizes the negative log-likelihood: n I; Jrnixture Z — :10g 2 ajp(Yil6j)- (1'7) i=1 j=1 For brevity of notation, we write p(y,~|6,-) to denote ])(Y = yildj). The EM algorithm can be used to optimize Jmixture- EM is a powerful technique for parameter estimation when some of the data are missing. In the context of a finite mixture model, the missing data are the cluster labels. Starting with an initial guess of the parameters, the EM algorithm alternates between the “E—step” and the “M—step”. Let 1",, = P(Z = j [Y :2 y,,(-)°ld), where OOld is the current parameter estimate. In the E—step, we compute the expected complete data log-likelihood, also 31 % known as the Q-function: Q""'W) — 10W, ZI<->°”>) Z _ O, ne v . ld oll ~p(Z[Y,@neW) —Io.,p+§ijizme ‘>10%,(Zly,eom, (Zn/,eneW) <13 Yenew _1g, Y901d+lg ZY,@Old P _ 0 P( l ) O P( l ) O Ez:p( l )p(Z]Y,@Old) = 10g 6(YIG""W) - log axle“). The inequality is due to the cmicavity of logaritlun, and the fact that p( Z IY, 8”“) can be viewed as “weights” because they are non—negative and Z Z p(Z IY, Odd) = 1. Since Q(O"ew) — Q(OOld) 2 O, the above implies log p(Y|(-new) — log p(Y|(-)Old) Z 0. So, the update of parameter from 6°” to Omw indeed improves the log—likelihood of the observed data. When OOH = 9’10“", the inequality becomes an equality, and we reach a local minimum of log p(Y|(-)). Note that the above argument holds as long as Q(Onew) — (AC-901“) Z 0. Thus it suffices to increase (instead of maximizes) the expected complete log-likelihood in the M-step. The resulting algorithm that only increases the expected complete log-likelihood is known as the generalized EM algorithm. It is interesting to note a variant of the EM algorithm used in [80] for Bayesian parameter estimation. The goal is to find 8 that maximizes log p((-)|Y). Since the missing data in [80] are continuous, the expectation is performed by integration in- 33 stead of summation. The E—step computes f p((-)|Z, Y) log 1)(Z]O°ld, Y) dZ, and the M-step solves Onew = argmaxe f p(Z IOOId, Y) log p((-)|Z , Y) dZ. The correctness of this variant of the EM algorithm can be seen by the following: / p(ZIGO'd, Y) logaem’w, Y) dz — / p(ZIeok', Y) 1066(90‘dIZ. Y) dz : [p(ZIQOId, Y)(logp((-)new]Y) + logp(Z|(-)new, Y) — log P(Z|Y) -— log p(OOldlY) — log p(zleo‘d, Y) + log P(Z|Y)) dZ p(Zlane‘”, Y) =lg @"ewY—lg eO'dY+/ zeO‘d,Y1g dZ o p( I ) opi l ) p( i )0 “2'60”,” s IogpieneuY) — losp(@O‘“lY) Note that p(OIZ, Y) = p(OIY)p(Z]O, Y)/p(Z]Y). Our second proof of the EM algorithm is to regard it as a special case of variational method. Here, we follow the presentation in [205]. Let T (Z ) be an unknown variable distribution on the missing data Z. Since p(Y|O) = p(Y, Z|O)/p(Z|Y, O), we have 10gp(YIO)——— 0sp(Y.Zl0)-10sp(ZlY 9) p(Yle): ZIP(2 p(YZIG) ZTZ) ”(W Z p(,Y Z 0) “22:2, _(_Z')_ + DKL(T 3006 05.03: c. 02 00 mange/cows coszmewzow L000 1.2010 ”swim 0%. E0005 E004 fvfi 0.: XESE :5 43:6 €20.90 00$ a m0 whopooiawmo £003 050% ammo” k3 00050500 1520 mew 000503330: 02 -0300 50:: 02:30:00 “.5256 Ex £363.50 maueoz 3Q $5350 30005 23055 Esoflmflg 8258000 132m Co 82880 13005 Smfi m0> E02 “0 58:: Z m0> >0 00350 Efiiowfi Em -ED mmcoshfiwce 0003 “0 025052 0059:0500 1520 50050000 533 02.3%: 852$ 31 8% 600553590: 02 dumb: .m *0 m~080>00w6 $238550 530% ficoEm asazowmo 030$qu 02me 030% m0> wooneonzwfic oz 0ch a mo m00000>z0w6 3:303 20305020000 05am gfi m5; . x508 :8 .mwefl c. we m00000>00wm0 90> coofoniwfios 02 £qu 08:02? hem 2w 30:0qu unmowoow 0>u0m2m vafi Lame/Hog fiber: OZ 005003 350x oz :3 dwefl m .«0 E0000>00mm0 8me magnum E 000080 0002080 mwobm wEoEEdm 02552 Ema :OESem 30232 28003qu 9:30: 0053:0800 3M do? them 42 a “curved” manifold with the data points lying on it can be seen in Figure 2.1. This manifold assumption is reasonable because many real world phenomena are driven by a small number of latent factors. The high dimensional feature vectors observed are the results of applying a (usually unknown) mapping to the latent factors, followed by the introduction of noise. Consequently, high dimensional vectors in practice he approximately on a low dimensional manifold. Strictly speaking, what we refer to as “manifold” in this thesis should properly be called “Riemannian manifold.” A Riemannian manifold is smooth and differentiable, and contains the notion of length. We leave the precise definition of Riemannian manifold to encyclopedias like Mathworld.1 and VVikipedia2, and describe only some of its properties here. Every y in the manifold M has a neighborhood N (y) that is homeomorphic3 to a set S, where S is either an open subset of Rd, or an open subset on the closed half of Rd. This mapping 0y : N (y) r—> S is called a co—ordinate chart, and ¢y(y) is called the “co—ordinate” of y. A collection of (to-ordinate charts that covers the entire M is called an atlas. If y is in two co—ordinate charts (153,1 and 953,2, y will have two (local) co—ordinates ¢y1(y) and on (y) These two co—ordinates should be “consistent” in the sense that there is a map to convert between 03,1 (y) and on (y), and the map is continuous for any path in N (y1) F) N (yg). For any y,- and yj in M, there can be many paths in M that connect yi and yj. The shortest of such paths is called 1http://mathwor1d.wolfram.com 2http://en2.wikipedia.org/ 3Two (topological) spaces are homeomorphic if there exists a continuous and invertible function between the two spaces, and that the inverse function is also continuous. 43 Great C' cle Figure 2.2: An example of a geodesic. For two points A and B on the sphere, many lines (the dash-dot lines) can be drawn to connect them. However, the shortest of these lines, which is the solid line joining A and B, is called the geodesic between A and B. In the case of a sphere, the geodesic is simply the great circle. the geodesic4 between y,- and yj. For example, the geodesic between two points on a sphere is an arc of a “great circle”: a circle whose center coincides with the center of the sphere (Figure 2.2). The length of the geodesic between y, and yj is the geodesic distance between y,- and yj. To perform nonlinear mapping, one can assume that there exists a mapping ¢global(-) that maps all points on M to Rd. The “global co—ordinate” of y, de- noted by x = ¢global(y), is regarded as the low dimensional representation of y. In general, such a mapping may not exist5. In that case, a mapping that preserves a certain property of the manifold can be constructed to obtain x. 4Strictly speaking, geodesics are curves with zero covariant derivatives of their velocity vectors along the curve. A shortest curve must be a geodesic, whereas a geodesic might not be a shortest curve. 5For example, there is no such map (homeomorphism) between all points on a sphere and R2. However, if we exclude the north pole of a sphere, we can construct such a mapping. 44 Many of the nonlinear mapping algorithms that are manifold—based require a concrete definition of 1V(y,-), the neighborhood of y,. Two definitions are commonly used. In e-neighborhood, yj E N(yi) if Hy,- —— yJ-ll < 6, where the norm is the Euclidean distance in RD. In krill-neighborhood, yj E N (yi) if yj is one of the k nearest neighbors of y, in y, or vice versa. In both cases, 6 or k is a user-defined parameter. knn neighborhood has the advantage that it is independent of the scale of the data, though it can lead to too small a neighborhood when the number of data points is large. Note that the neighborhood can be defined in a data-driven manner [29] instead of being specified by a user. 2.3 Sammon’s mapping Sammon’s mapping [223], which is an example of metric least square scaling [49], is perhaps the most well-known algorithm for nonlinear mapping. Sammon’s mapping is an algorithm for multidir’nensional scaling and it maps a set of n items into an Eu- clidean Space based on the dissimilarity values. This problem is related to the metric embedding problem considered by theoretical computer scientists [119]. Sammon’s mapping can be used for dimensionality reduction if the dissimilarity matrix is based on the Euclidean distance between the data points in the high dimensional space. Given a n by n matrix of dissimilarity values {6ij}’ where 62-]- denotes the dissim- ilarity between the z'-th and the j-th items, we want to map the n items to n points {x1, . . . ,xn} in a low dimensional space, such that the distance between x,- and x]- is as “close” to (5,-3- as possible. Many different definitions of closeness have been 45 proposed, with the “Sasmmon stress” ,defined by Sammon, being the most popular. The Sammon’s stress S is defined by S: E dl—L——; ”2/250, (2.1) i—,—-——J— + I(u aé 21)I(u = j)—i"7,——’—”) j=1,j7éi dij ‘ ij ij 771. 1 1 1 1 = 112k qui ————— —Iu7€i)(—.—-———) ( ) ( L337“ (50' dij) ( 0m (1m m . (‘Tik — Ijklfxiv " fjv) . (Iii: —$uk)(1‘llv — xiv) + I(u = z) . Z . (13: + I(u # 2) d3 3:109:42 U w (2.4) where I () is the indicator function defined as I(true) : 1 I(false) = 0. One can use a nonlinear optimization algorithm other than Equation (2.2) to minimize S. It is also possible to implement Sammon’s mapping by a feed-forward neural network [180] or in an incremental manner [129]. Note that Sammon’s mapping is “global” and considers all the interpoint distances between the 71 items. This can be a drawback for data like the Swiss roll data set, where Euclidean distances between pairs of points that are far away from each other do not reveal the true structure of the data. 47 2.4 Auto-associative neural network A special type of feed-forward neural network, “auto—associative neural network” [7, 57], can be used for nonlinear dimensionality reduction. A11 example of such a network is shown in Figure 2.3. The idea. is to model the functional relationship between x,- and y,- by a neural network. If x, is a good representation for yi, it should contain sufficient information to reconstruct y,- via a neural network (decoding network), with the “decoding layer” as its hidden layer. To obtain x,- from yi, another neural network (encoding network) is needed, with the “encoding layer” as its hidden layer. The encoding network and the decoding network are connected so that the output of the encoding network is used as the input of the decoding network, and both of them correspond to X). The high-dimensional data points y, are used as both the input and the target for training in this neural network. Sum of square error can be used as the objective function for training. Note that the neural network in Figure 2.3 is just an example; alternative architecture can be used. For example, multiple hidden layers can be used, and the number of neurons in the encoding and decoding layers can also be different. The advantage of this approach is that mapping a new y to the corresponding x is easy: just feed y to the neural network and extract the output of the encoding layer. Also, there exists a number of software packages for training neural networks. The drawback is that it is difficult to determine the. appropriate network architecture to best reduce the dimension for any given data set. Also, training of a neural network involves an optimization problem that is considerably more difficult than the eigen- 48 I 111% //.; Output layer ‘92.! ’7’ / :2; I // 4 -// if» 9-! . \V / O O .s O \ s ‘2‘ j. '0 . \\/ \ “an i C -3 7., ' . 3; $ hu‘ I Decoding layer ‘ 34 \ ‘91» 4’ /’ “Middle” layer 011 v Q A Encoding layer 9" 3‘) av . 4.. a». ;Or.: Ii: - b a It“: I’ O G O o, - 0, £33,, 0 0 Q \‘ g ' ‘~ are \ \»;"7 \ i 7V Z?! “ll // / Input layer ZS Figure 2.3: Example of an auto—associative neural network. This network extracts X, with 3 features from the given data y,- with 8 features. decomposition required by some other nonlinear mapping methods like ISOMAP, LLE, or Laplacian eigenmap, which we shall examine later in this chapter. 2.5 Kernel PCA The basic idea of kernel principal component analysis (KPCA) is to transform the input patterns to an even higher dimensional space nonlinearly and then perform principal component analysis in the new space. It is inspired from the success of the support vector machines (SVM) [189]. 49 2.5.1 Recap of SVM Consider a mapping qb : RD +——> H, where H is a Hilbert space. H can be, for example, a (very) high dimensional Euclidean space. By convention, RD and H are called the input space and the feature space, respectively. The point y, in RD is first transformed into the Hilbert space H by g§(y,j). SVM assumes a suitable transformation gb(.) such that the transformed data set is more linearly separable in H than in RD, and a large margin-classifier in H is trained to separate the transformed data. It turns out that the large margin classifier can be trained by using only the inner product between the transformed data (c‘)(y,-) (55(yj)), Without knowing (f)(.) explicitly. Therefore, in practice, the kernel function K (yi, y,:) is specified instead of ¢b(.), where Kbmyz‘) = (95(yz'),¢(yj')>- Specifying the kernel function K(., .) instead of the mapping ¢(.) has the advantage of computational efficiency when H is of high dimension. Also, this allows us to generalize to infinite dimensional H, which happens when the radial basis function kernel is used. This use of kernel function to replace an explicit mapping is often called “the kernel trick”. Intuitively, the kernel function, being an inner product, represents the similarity between y,- and yj. The kernel trick can be illustrated by the following example with D = 2. Let ¢(y,-) E (y?1,\/§yi1y,2,y222)T, where y, = (yi1,y,t2)T. The kernel function corre- SPODding to this 0%) is K()’:\ yj) = (UM/)1 + LUi‘Zil/j2l2v because , 2 1‘ (Yian) = (MN/1'1 + yi23/j2) 2 2 2 2 Z gill/1'1 + 2.1/i13/j13/2'2yj2 + y‘igyfl 2 2 2 2 T = (ya, fiynyizy 31,2)(yj1, fiijja 1132) T, ), = 95(3’2' nyj')‘ Many different kernel functions have been proposed. Polynomial kernel, defined as K (yi, yj) = (yiTyj + 1)r with r as the parameter (degree) of the kernel, corresponds to a polynomial decision boundary in the input space. The radial basis function ), where w is the width (RBF) kernel is defined by Ix’(y,-,yj) = exp(w||y, — lel2 parameter. SVM classifiers using RBF kernel are related to RBF neural networks, except that for SVM, the centers of the basis functions and the corresponding weights are estimated by the quadratic programming solver simultaneously [229]. The choice of the appropriate kernel function in an application is difficult in general. This is still an active research area, with many principles being proposed [121, 154, 227]. 2.5.2 Kernel PCA One important lesson we can learn from SVM is that a linear algorithm in the feature space corresponds to a nonlinear algorithm in the input space. Different types of nonlinearity can be achieved by different kernel functions. Kernel PCA [228] utilizes this to generalize PCA to become nonlinear. For ease of notation, we shall assume H 51 is of finite dimension6. KPCA follows the steps of the standard PCA, except the data set under consid- eration is {¢(y1), . . . ,o(yn)}. Let 63(y2') be the “centered” version of cb(y.,-), 7 1 ' , (0(3'2‘) = 95(5’2') -" I, 206%)- The covariance matrix C is Given by 1 - ~ = — Z amen?" n i The eigenvalue problem /\v 2 CV is solved to find the (kernel) principal component v. Because V=-C :An 12¢(Yi)(¢3( (ryi) TV), v is in the subspace spanned by o(y,~), and it can be written as V = 2096“”)- j Denote a = (011, . . . , an). Let K be the symmetric matrix such that its (2', j )-th entry Rij is d(yi)Tgf)(yj). Rewrite /\v 2 CV as AZQJ‘QM (Yj) :%Zj:ajK ij@( Yr)- (2'5) 6The case for infinite dimensional H is similar, with operators replacing matrices and eigenfunc- tions replacing eigenvectors. 52 By multiplying both sides with (SQ/p7", we have 3 1 ~, ~, . AZOJ'AZJ- = fiZa’injhli V1, (2.6) j i] which, in matrix form, can be written as /\7IRQ = K20. (2.7) Since R is symmetric, K and K2 have the same set of eigenvectors. This set of eigen- vectors is also the solution to the generalized eigenvalue problem in Equation (2.7). Therefore, a, and hence v, can be found by solving /\a -=: Ra. For projection pur- poses, it is customary to normalize v to norm one. Since “V”2 = aTKa, we should divide a by V aTKa. To perform dimensionality reduction for y, it is first mapped to the feature space as p(y), and its projection 011 v is given by 17>Tv = Z ,;(y)r,,,,;(y,, = aTiéy, (2.8) 2' .. ~ ~ T where ky = (95(Y)T¢(y1), . . . , $(y)T¢(y,,)) . Finally, by rewriting the relationship sz = 95(Yi)T¢3(yj') 53 1 , — <:> — — ¢Tcz> Tl ’ 1:1 1 Ti 1 72 — g 2 am )Teb'z) + 533- : Z e(yr)T¢(yk) ' 1:1 k=1 1:1 in matrix form, we have R : HnKHn, (2.9) where Hn = I — %1n,n is a centering matrix with In," denoting a matrix of size n by n with all entries one, and K is the kernel matrix with its (2', j)-th entry given by K(y,-, yj). A similar expression can be derived for g3(y)qu>(yj). KPCA solves the eigenvalue problem of a n by 72. matrix, which may be larger than the D by D matrix considered by PCA. Recall D is the dimension of y,. The number of possible features to be extracted in KPCA can be larger than D. This contrasts with the standard PCA, where at most D features can be extracted. An interesting problem related to KPCA is how to map 2, the projection of (fly) into the subspace spanned by the first few kernel principal components, back to the input space. This can be useful for, say, image denoising with KPCA [185]. The search for the “best” y' such that (p(y') R: z is known as the pre—image problem and different solutions have been proposed [160, 5]. In summary, KPCA consists of the following steps. 1. Let K be the kernel matrix, where Ki]- 2 ¢(y,-,yj). Compute R by K : HnKHn. 54 2. Solve the eigenvalue problem An: = Ra and find the eigenvectors corresponding to the largest few eigenvalues. 3. Normalize a by dividing it by V aTKa. 4. For any y, its projection to a principal component can be found by aTky, where ~ 1 ky = Hn(ky _ gKan): ky = (K(y,y1), . . . , K(y,yn) and 171,1 is a n by 1 vector with all entries equal to one. 2.6 ISOMAP The basic idea of isometric feature map (ISOMAP) [248] is to find a mapping that best preserves the geodesic distances between any two points on a manifold. Recall that the geodesic distance between two points on a manifold is defined as the length of the shortest path on the manifold that connects the two points. ISOMAP constructs a mapping from y,- to the x,- (x, 6 R“) such that the Euclidean distance between x,- and xj in R“ is as close as possible to the geodesic distance between y; and yj on the manifold. Geodesic distances are hard enough to find when the manifold is known, let alone in the current case where the manifold is unknown and only points 011 the manifold are given. So, ISOMAP approximates the geodesic distances by first constructing a neighborhood graph to represent the manifold. The vertex v, in the neighborhood 55 graph G = (V, E) corresponds to the high dimensional data point yi. A11 edge 6(2, 3) between 2',- and vj exists if and only if y,- is in the neighborhood of yj, N (yj), and the weight of this edge is Hy,- — lel- Details of N(yj) are described in section 2.2. An example of a neighborhood graph is shown in Figure 2.4(b) for the data shown in Figure 2.4(a). ISOMAP approximates a path on the manifold by a path in the neighborhood graph. The geodesic between y,- and yj corresponds to the shortest path between 2),- and vj. The estimation problem of the geodesic distances between all pairs of points yi and yj thus becomes the all-pairs shortest path problem in the neighborhood graph. It can be solved [46] either by the Floyd-VVarshall algorithm, or by Dijkstra’s algorithm with different source vertices. The latter is more efficient because the neighborhood graph is sparse. An example of how the shortest path approximates the geodesic is shown in Figure 2.4(c). It can be shown that the shortest path distances converge to the geodesic distances asymptotically [18]. The next step of ISOMAP finds x,- that best preserve the geodesic distances. Let gij denote the estimated geodesic distance between y,- and yj, and write G = {5,3} as the geodesic distance matrix. The optimal x,- can be found by applying the classical scaling [49], a simple multi—dimensional scaling technique. Let dij = [[x, —- xJ-I]. Without loss of generality, assume 2,- x, = 0. We have the following: T 2 2 T (Xi - Xj) = llxill + [[19]] — 2X2“ Xj 2 2 2 2:612} = E :llxill +nlllel 2' 2 2 2 2% = 2‘": “Xi“ ij 2 56 ‘s. . . ”" c ‘ - . -?:‘.:?.°-—-~;,\. flaw-cw “2" 2* " "mar ~52!" ‘10— C O O I 1 1o 5 o —5 1o -15 0 (a) Input data (b) Neighborhood graph and geodesic approximation 0| 0 F—WV C) 20 4o 60 80 100 ((2) Another view of geodesic approximation Figure 2.4: Example of neighborhood graph and geodesic distance approximation. (a) Input data. (b) The neighborhood graph and an example of the shortest path. (c) This is the same as (b), except the manifold is flattened. The true geodesic (blue line) is approximated by the shortest path (red line). l 2 2 30, E ”Kill :27 E [{1} l l 2 2 ”39'” = g E :d'zfj2711221j:d22j 2' and 1 1 2x339- : — (12 +— (12152212 —d2 (2.10) n , .7 If we replace {1,-J- with the estimated geodesic distance gij in Equation (2.10), bij, the target inner product between x,- and x_,-, is given by 1 2 1 2 1 2 2 EZQU‘I'EZgij-fizgij—gij ~ (2'11) j 2' ‘ ii Let A = {“23} with aij = —a‘-lzg,2j. Equation (2.11) means that B = HnAHn, where B = {bl-j}, Hn = I — 711-17,,71 and 17m denotes a n by 72 matrix with all entries one. Computing HnAHn is effectively a centering operation 011 A, i.e., each column is subtracted by its corresponding column mean, and each row is subtracted by its corresponding row mean. Because multiplication of Hn has this effect of “zeroing” the means for different rows and columns, Hn is often referred to as the centering matrix. The centering operation is also seen in other embedding algorithm such as KPCA (section 2.5). Since B is the matrix of target inner product, we have B 2 XTX, where X = [x1, . . . ,xn]. We recover X by finding the best rank—d approximation for B, which can be obtained via the eigen—decomposition of B. Let A1, . . . , Ad be the d largest eigenvalues of B with corresponding eigem'ectors v1, . . . ,vd. We have = [y/Alvlvu, Advd]T. Here, we assume A,- > 0 for all ‘1'. = 1, ...,(1. Unlike Sammon’s mapping, the objective function for the optimal X is less explicit: it is the sum of the square error (squared Frobenius norm) between the target inner product (bij) and the actual inner product (xz-ij). One drawback of ISOMAP is the 0(722) memory requirement for storing the dense matrix of geodesic distances. Also, solving the eigenvalue problem of a large dense ma— trix is relatively slow. To reduce both the computational and memory requirements, landmark ISOMAP [55] sets apart a subset of y as landmark points and preserves only the geodesic distances from y,- to these landmark points. A similar idea has been applied to Sammon’s mapping before [25]. A continuum version of ISOMAP has also been proposed [282]. ISOMAP can fail when there is a “hole” in the manifold [66]. We also want to note that an exact isometric mapping of a manifold is theoretically possible only when the manifold is “flat”, i.e., when the curvature tensor is zero, as pointed out in [16]. To summarize, ISOMAP consists of the following steps: 1. Construct a neighborhood graph using either the e neighborhood or the knn neighborhood. 2. Solve the all pair shortest path problem on the neighborhood graph to obtain an estimate of the geodesic distances gij. 3. Compute A = {dz-j}, where aij = _29i2j1 and B = HnAHn. 4. The d largest eigenvalues and the corresponding eigenvectors of B are found and X =[fi1-V1,. . '31/Advle' 59 2.7 Locally Linear Embedding In locally linear embedding (LLE) [219, 226], each local region 011 a manifold is approximated by a linear hyperplane. LLE maps the high dimensional data points into a low dimensional space so that the local geometric properties, represented by the reconstruction weights, are best preserved. Specifically, y,- is reconstructed by its projection y, on the hyperplane H passing through its neighbors N (y,) (defined in section 2.2). Mathematically, yz‘ % Y2 = Zwinja j with the constraint 23' 21.1,]- : 1 to reflect the translational invariance for the recon- struction. By minimizing the sum of square error of this approximation, we can also achieve invariance for rotation and scaling. The weights wij reflect the local geomet- ric properties of yi. This interpretation on wij, however, is reasonable only when y,- is well approximated by 32,, i.e., when y,- is close to H. The weights are found by solving the following optimization problem: {min} [[yi — 2")inij subject to Zwii = 1, wij : 0 if yj é N(yi) for all 2'. “21' j 1' (2.12) Now, write N(y,) = {yT1,.. ”yTL} and denote zj = y—rj. Note that y,- ¢ N(yi). The optimization problem (2.12) can be solved efficiently by first constructing a L by L matrix F such that fjk = (zj — x,)T(zk — xi). Equivalently, F = (Z — x,11,L)T(Z — x1111), where F = {fjk}, 1qu is a 1 by L vector with all entries one, and Z = 60 [z1, . . . , 21]. The next step is to solve the equation Fu = 11,1 (2.13) , ' 7 ' ~. _ . _ . ~. and then we normalize the solution 11 by u] — 217/217 21.]. The values of u] are assigned to the corresponding ”ail-j, i.e., 2112-, = 223-, and the rest of 221,-]- are set to zero. 71' Sometimes, F can be singular. This can happen when the neighborhood size L is larger than D, the dimension of y. In this case, a small regularization term 6I L is added to F before solving the Equation (2.13). This regularization has the effect of preferring values of 10,-]- with small 23' 2112.2]. Finding uj is efficient because only small linear systems of equations are solved. Note that uj can be negative and y‘, can be outside the convex hull of N (y,). In the second phase of LLE, we seek X = [x1, . . . ,xn] such that x,- z Zj wijxj, and X, E R“. To make the problem well-defined, additional constraints Zixi = 0 and 2.1-xix? = Id are needed. The second constraint has the effect of both fixing the scale and enforcing different features in x, to carry independent information by requiring the sample covariances between different variables in x, to be zero. The optimization problem is now Inn; lle' — Ewijlelz subject to in = 0 and inxz = Id. (2.14) X1“ . . . J z 2 Note the similarity between Equations (2.12) and (2.14). Let x“) denote the i-th row 7 . . . . _ -1 The normalization is valid because 23‘ 223- — 11,mF 1172.1 and hence Zj uj cannot be zero, by the positive definiteness of F ’1. 61 of X. Equation (2.14) can be rewritten as ngn trace(X(I—W)T(I —W)XT) subject to 1mem 2 0 and x(i)Tx(f) == (fl-j. (2.15) This can be solved by eigen-decomposition on M = (I — W)T(I —- W). Note that M is positive semi-definite. Let vj be the eigenvector corresponding to the (j + 1)-th smallest eigenvalue. The optimal X is given by X 2 [v1, . . . , vd]T. The first constraint is automatically satisfied because 1,1,1 is the eigenvector of M with eigenvalue 0. This eigenvalue problem is relatively easy because M is sparse and can be represented as a product of sparser matrices (I — W)T and (I — W). The above exposition of LLE assumes the pattern matrix as input. LLE can be modified to work with a dissimilarity matrix [226]. There is also a supervised extension of LLE [53, 54], which uses the class labels to modify the neighborhood structure. The kernel trick can also be applied to LLE to visualize the data points in the feature space [56]. The case when LLE is applied to data sets with natural clustering structure has been examined in [206]. In summary, LLE includes the following steps: 1. Find the neighbors of each y,- according to either e-neighborhood or knn neigh— borhood. 2. For each yi, form the matrix F and solve the equation Fu 2 111,1. After normalizing u by iij = uj/ Zj uj, set 211,37]. 2 213- and the remaining wij to zero. 3. Find the second to the (d + 1)-th smallest eigenvalues of (I — W)T(I — W) by 62 -+ a sparse eigenvalue solver and let {V1, . . . ,vd} be the eigenvectors. 4. Obtain the reduced dimension representation by X 2 [v1, . ..,V(1]T. 2.8 Laplacian Eigenmap The approach taken by Laplacian eigernnap [16] for nonlinear mapping is different from those of ISOMAP and LLE. Laplacian eigenmap constructs orthogonal smooth functions defined 011 the manifold based on the Laplacian of the neighborhood graph. It has its roots in spectral graph theory [42]. As in ISOMAP, a. neighborhood graph G = (V, E) is first constructed. Unlike ISOMAP, where the weight wij of the edge (2),, 21]) represents the distance between 22,- and vj, the weight in Laplacian eigenmap represents the similarity between v,- and vj. The weight 221,-]- can be set by 2 x- _ x- 221,-]- : exp ——[—[——2——4t——J—[[— , (2.16) with t as an algorithmic parameter, or it can be simply set to one. The use of the exponential function to transform a distance value to a similarity value can be justified by its relationship to the heat kernel [16]. The nonlinear mapping problem is recast as the graph embedding problem that maps the vertices in the neighborhood graph G to R“. The first step is to find a “good” function f () : V H R that maps the vertices in G to a real number. Since the domain of f () is finite, f () can be represented by a vector u, with f (22,-) = 12,-. 63 According to spectral graph theory, the smoothness of f can be defined by 1 2 S E 2 Emu-(u,- — uj) . (2.17) The intuition of S is that, for large wij, the vertices v,- and vj are “similar” and hence the difference between f (11,-) and f (11]) should be small if f () is smooth. A smooth mapping f () is desirable because a faithful embedding of the graph should assign similar values to v,- and '11} when they are close. We can rewrite S as 1 2 2 S = 5 :0qu + wijuj — 211,2”) 2'] = §(Z 11,- Z 10,-]- + 2213-2112,]- - 2 L tel-juiuj) (218) i j j i ii = 221,2 Z 'wjj — Z'wijuiuj = uTLu, i J ij where L is the graph Laplacian defined by L = D — W, W = {202-j} is the graph weight matrix, and D is a diagonal matrix with (1,,- = 23' wij. The matrix L can be thought of as the Laplacian operator 011 functions defined on the graph. Since (1,,- can be interpreted as the importance of 21,-, the natural inner product between two functions f1(.) and f2(.) defined 011 the graph is (f1, f2) = ufDUQ. Because the constant function is the smoothest and is uninteresting, we seek f () to be as smooth as possible while being orthogonal to the constant function. The norm of f () 64 is constrained to be one to make the problem well—defined. Thus we want to solve 11311 uTLu subject to uTDu = 1 and uTDan = 0. (2.19) This can be done by solving the generalized eigenvalue problem Ln = ADu, (2.20) after noting that 171,1 is a solution to Equation (2.20) with A = 0. Here, 172,1 denotes a n by 1 vector with all entries one. As L is positive semi-definite, the eigenvector corresponding to the second smallest eigenvalue of Equation (2.20) yields the desired f(..) In general, (1 orthogonal8 functions {f1(.), . . ..,fd()} that are as smooth as possible are sought to map the vertices to IR“. The functions can be obtained by the eigenvectors corresponding to the second to the (d + 1)-th smallest eigenvalues in Equation (2.20). The low dimensional representation of y,- is then given by x,- = (f1(v,'), f2(v,-), . . .,fd(v,-))T. In matrix form, X = [u1, . . . , ud]T. The embedding problem of the neighborhood graph and the embedding problem of the points in the manifold is related in the following way. A smooth function f () that maps the point y,- in the manifold to x,- E R“ is preferable, because a faithful mapping should give similar values (small [[x, — xJ||) to y,- and yj when [[y, — yJ-[I is small. A small [[y, — y JII corresponds to a large 11%,}- in the graph. Thus, intuitively, a smooth function defined on the graph corresponds to a smooth function defined 8Orthogonality is preferred as it suggests the independence of information. Also, in PCA, each of the extracted features is orthogonal to the others. 65 on the manifold. In fact, this relationship can be made more rigorous, because the graph Laplacian is closely related to the Laplace-Beltrami operator on the manifold, which in turn is related to the smoothness of a function defined on the manifold. The eigenvectors of the graph Laplacian correspond to the eigenfunctions of the Laplace- Beltraini operator, and the eigenfunctions with small eigerwalues provide a “smooth” basis of the functions defined on the manifold. The neighborhood graph used in Laplacian eigenmap can thus be viewed as a discretization tool for computation 011 the manifold. There is also a close relationship between Laplacian eigenmap and spectral clus- tering. In fact, the spectral clustering algorithm in [194] is almost the same as first performing Laplacian eigenmap and then applying k-ineans clustering on the low di- mensional feature vectors. The manifold structure discovered by Laplacian eigenmap can also be used to train a classifier in a semi—supervised setting [182]. The Laplacian of a graph can also lead to an interesting kernel function (as in SVM) for vertices in a graph [154]. This idea of nonlinear mapping via graph embedding has also been extended to learn a linear mapping [109] as well as generalized to the case when a vector is associated with each vertex in the graph [31]. To sum up, the steps for Laplacian eigenmap include: 1. Construct a neighborhood graph of y by either the e-neighborhood or the knn neighborhood. 2. Compute the edge weight 211,]- by either exp([|y,- — yj||2/(4t)), or simply set wij to 1. 66 3. Compute D and the graph Laplacian L. 4. Find the second to the ((l + 1)—th smallest eigenvalues in the generalized eigen- value problem Lu = /\Du and denote the eigenvectors by 111, . . . , ud. The low dimensional feature vectors are given by X = [111, . . . , udIT. 2.9 Global Co-ordinates via Local Co—ordinates Recall that in section 2.2, an atlas of a manifold M is defined as a collection of co- ordinate charts that covers the entire M, and overlapping charts can be “connected” smoothly. This idea has inspired several nonlinear mapping algorithms [220, 29, 247] which construct different local charts and join them together. There are two stages in these type of algorithms. First, different local models are fitted to the data, usually by the means of a mixture model. Each local model gives rise to a local co—ordinate system. A local model can be, for example, a Gaussian or a factor analyzer. Let 223 be the local co—ordinate given to y, by the s-th local co-ordinate system. Let Tis denote the suitability of using the s-th local model for y,. We require 2,3 2 0 and 23 rm 2 1. The introduction of 2,3 can represent the fact that only a small number of local models are meaningful for each yi. Typically, rig is obtained as the posterior probability of the s-th local model, given y,-. In the second stage, different local co—ordinates of y,- are combined to give a global co-ordinate. Let g, S be the global co—ordinate of y,- due to the s-th local model, and let g,- E R“ be the corresponding “combined” global co—ordinate. In the three papers we have considered here, 8275 is simply the affine transform of the local co—ordinate, 67 gis = Lsiis. Here, 2,3,. is the “augmented” zik, 2,7,, = [zfl, 1]T. L3 is the (unknown) transformation matrix with (1 rows for the s-th local model. Note that it is desirable for neighboring local models to be “similar” so that the global co-ordinates are more consistent. An important characteristic of the algorithms in this section is that, unlike ISOMAP, LLE, or Laplacian eigenmap, extension for a point y that is outside the training data 32 is easy after computing zs and r8 for different 3. 2.9.1 Global Co—ordination In the global co-ordination algorithm in [220], the first and the second stages are performed simultaneously by the variational method. The first stage is done by fitting a mixture of factor analyzers. Under the s—th local model, a data point is modeled by Yi = ”'3 + Aszis + Gisa (2.21) where “S is the mean, A3 is the factor loading matrix, and 6,3,. is the noise that follows N (0, \I’ S), a multivariate Gaussian with mean 0 and covariance \113. By the definition of factor analyzer, \Ils is diagonal. The hidden variable 323 is assumed to follow N (0, I). The scale of Zis is unimportant because it can be absorbed by the factor loading matrix. Let 0,, be the prior probability of the s—th factor analyzer. 68 The parameters are {03,213, A3, \I',,-}, and the data density is given by mw=Z/pmmamawwa s 223 2 2 (13(22)—D/2(det(A3AZ + 21.))‘1/2 (2.22) S I _ eXl)(—§(Yi _ “sleASAZ + \Ils) I(Y2' — #3))- VVe define rig. as the posterior probability of the s-th local model given yi, P(s|yi), and it can be computed based 011 Equation (2.22). Equation (2.22) also gives rise to p(zis s,y,~) and hence p(g,3[s,y,j), because gig is a function of Zis and L3. The posterior probability of the global co—ordinate is defined as mmw=Zmeaami em Equation (2.23) assumes that the overall global co-ordinate g,- is selected among different gis, with s stochastically selected according to the posterior probability of the s-th model. In the case where y,- is likely to be generated either by the j-th or the k-th local model, the corresponding global co—ordinates gij and Si]: should be similar. This implies that the posterior density p(g,|y,-) should be unimodal. Enforcing the unimodality of p(gilyi) directly is difficult. So, the authors in [220] instead drive p(gilyi) to be as similar to a Gaussian distribution as possible by adding an extra term to the log-likelihood objective function to be maximized: ‘12 = 210?; (3’2) - Z DKL((1(gia5l}’i)llP(gis3]Yil)~ (2-24) 2' is 69 Here, D K L(Q||P) is the Kullback-Leibler divergence defined as DKLlelP)=/Q(Y)10g%[%] dy, (2-25) and q(g,-, sti) is assumed to be factorized as (Ifgz', 8 w) = q’ifgilx2)Q2(3lYi) with q,(g,-|y,-) as a Gaussian and (p(slyi) as a multinomial distribution. This addition of a divergence term between a posterior distribution and a factorized distribution is commonly seen in the literature on the variaticmal method. The objective function in Equation (2.24) can be maximized by an EM-type algorithm, which estimates the parameters {03, us, A3, ‘113, L3} as well as the parameters for (Ii(gini) and q,(s|y,-). Since the first and the second stages are carried out simultaneously, local models that lead to consistent global co—ordinates are implicitly favored. 2.9.2 Charting For the charting algorithm in [29], the first and the second stages are performed separately. This decoupling decreases the complexity of the optimization problem and can reduce the chance of getting trapped in poor local minima. In the first stage, a mixture of Gaussians is fitted to the data, p(y) = [am/(u... 2.). (2.26) 70 with the constraint that two adjacent Gaussians should be “similar”. This is achieved by using a prior distribution on the mean vectors and the covariance matrices that encourages the similarity of adjacent Gaussians: p({us},{Es}) O< exp(-Z Z AsfflleKLfA/(l‘wZslllep'jvzjll): (227) 8 was where As(,uj) measures the closeness between the locations of the s-th and the J—th Gaussian components. It is set to A3023) oc exp(—||us — “HP/(202)), where a is a width parameter determined according to the neighborhood structure. The prior distribution also makes the parameter estimation problem more well-conditioned. In practice, 72 Gaussian components are used, with the center of the i-th component. 12,: set to y,- and the weight of each component set to 1/72. The only parameters to be estimated are the covariance matrices. The MAP estimate of the covariance matrices can be shown to satisfy a set of constrained linear equations and they are obtained by solving this set of equations. In the second stage, the local co—ordinate Z23 is first obtained as 223 = VT(X,- —us), where V consists of the (1 leading eigenvectors of 23. we can regard zis as the feature extracted from y, using PCA on the s-th local model. The local model weight 73;, is, once again, set to the posterior probability of the s-th local model given yi. The transformation matrices L are found by solving the following weighted least square problem: . ~ - 2 mm X 'rijrikllezij — LkzikllF- (2.28) {LS} i.j,k 71 Here, “XII?P denotes the square of the Frobenius norm, [IXIIF E trace(XTX). Intu- itively, we want to find the transformation matrices such that the global co—ordinates due to different local models are the most consistent in the least square sense, weighted by the importance of different local models. Equation (2.28) can be solved as follow. Let K and h be the number of lo- cal models and the length of the augmented local co-ordinate 2,3, respectively. Define 23 = [213,...,2,,5] as the h by 72 matrix of local co—ordinates using the s-th local model for all the data. points. Define the K h by 12 matrix T3 by T3 = [07,,(8_1)h,2;,0n3(1{_8)h]T, where Own denotes a zero matrix with size n by 172. Let P5 be a n by 12 diagonal matrix where the (2',2)-th entry is 7'2‘3- The solution to Equation (2.28) is given by the cl trailing eigenvectors of the Kh by Kh. matrix QQT, where Q = 2]“ 21:,1921-(ij — Tk)PJ-Pk). Note that the second stage is independent of the first stage. In particular, alternative collection of local models can be used, as long as 2,3 and 2,13 can be calculated. 2.9.3 LLC The LLC algorithm described in [247] concerns the second stage only. Given the local co—ordinates Zis and the model confidences 2,13 computed from the first stage, the LLC algorithm finds the best Ls such that the local geometric properties are best preserved in the sense of the LLE loss function. The global co—ordinate g, is assumed 72 to be a weighted sum in the form g1 = Erisgis = risLsiis' (2'29) 8 8 Suppose there are K local models, each of which gives a local co—ordinate Zis in a h — 1 dimensional spaceg. We stack 2,3733 for different 5 to get a vector of length Kh, u,- = 9112373222, . . . , riKZEAT, and concatenate different L3 to form a d by Kh matrix J = [L1,L2, . . .,LK]. (Each L3 is of size d by h.) Equation (2.29) can be rewritten as g, = Jui. The global co—ordinate matrix, G = (g1, . . . ,gn), is thus given by G = JU, where U is a Kh by 22 matrix U = [u1, . . . , un]. Denote the 2-th row of J by j“). If we substituteG as Y in the LLE objective function in equation (2.15), we have min trace(JU(I — W)T(I — W)UTJT) J (2.30) subject to i‘i’Ulnl = 0 and jliluuTiW : a]. where W is defined in the same way (the neighborhood reconstruction weight) as in section 2.7. Here, 1”,,1 denotes a n by 1 vector with all entries one. Note that obtaining W is efficient (see section 2.7 for details). The value of j“) can be obtained as the solution of the generalized eigenvalue problem (U(I — W)T(I — W)UT)v = A(UUT)V. The authors in [247] claim that the jfi) thus obtained satisfies the con- straint j(i)U1m,1 = 0 automatically because Ulm‘l is an eigenvector of the general- 9In general, different local models can give local co—orrlinates with different lengths, as emphasized in [247]. Here we assume a common 11 for the ease of notation. 73 ized eigenvalue problem with eigenvalue 0. However, this is not true in general. 111 any case, the authors in [247] use the eigenvectors corresponding to the second to the (d + 1)-th smallest eigenvalues as the solution of J. Note that this generalized eigenvalue problem is about a K12 by Kh. matrix, instead of a large n. by 71 matrix in the original LLE. After finding j“), J and hence Ls are reconstructed. The global co—ordinate is obtained via equation (2.29). The idea of this algorithm is somewhat analogous to the locality preserving pro- jection (LPP) algorithm [109]. LPP simplifies the eigenvalue problem by the extra information that the projection should be linear, whereas the current algorithm sim- plifies the eigenvalue problem by the given mixture model. 2.10 Experiments We applied some of these algorithms 011 three synthetic 3D data sets. The data manifold and the data points can be seen in Figure 2.5. The first data set, parabolic, consists of 2000 randomly sampled data points lying on a paraboloid. It is an example of a nonlinear manifold with a simple analytic form — a. second degree polynomial in the co—ordinates in this case. The second data set swiss roll and the third data set S—curve are commonly used for validating manifold learning algorithms. Again, 2000 points are randomly sampled from the “Swiss roll” and the S-shaped surface to create the data sets, respectively. KPCA, ISOMAP, LLE, and Laplacian eigemnap were run on these 3D data sets to project the data to 2D. We have implemented KPCA and 74 Laplacian eigemnap ourselves, while the impleinentations for ISOMAP10 and LLE11 were downloaded from their respective web sites. For ISOMAP, LLE, and Laplacian eigenmap, knn neighlmrhood with k = 12 is used. The edge weight is set to one for Laplacian eigenmap. For KPCA, polynomial kernel with degree 2 is used. For comparison, the standard PCA and Sammon’s mapping were also performed on these data sets. Sammon’s mapping is initialized by the result of PCA. The results of these algorithms can be seen in Figures 2.6, 2.7, and 2.8. The data points are colored differently to visualize their locations 011 the manifold. We intentionally omit the “goodness-of-fits” or “error” on the projection results, because the criteria used by different algorithms (Sammon’s stress in Sammon’s mapping, correlation of distances in ISOMAP, reconstruction error in LLE, residue variance in PCA and KPCA, to name a few) are very different and it can be n'iisleading to compare them. For the parabolic data set, we can see in Figures 2.6(b) and 2.6(c) that both ISOMAP and LLE recover the intrinsic co—ordinates very well, because the changes in the color of the data points after embedding are smooth. Since this manifold is quadratic, we expect that KPCA with a. quadratic kernel function should also recover the true structure of the data. It turns out that the first two kernel principal components cannot lead to a clean mapping of the data points. Instead, the second and the third kernel principal components extract the structure of the data (Figure 2.6(a)). The first two features extracted by Laplacian eigenmap cannot recover the 10ISOMAP web site: http://stanford. isomap.edu 11LLE web site: http://www.cs.toronto.edu/~roweis/lle/ 75 20 \ ‘ T i 1, 'Fl 7 ' 7' ‘7‘, o 0 ° -30 —20 -1o 0 ~60 -50 —4o -30 -20 —10 —so -50 —40 (a) parabolic, the manifold (b) parabolic, the data -15 —10 -5 o 5 $_,_ 10 15 20 0 (c) swiss roll, the manifold -0.5 o 0.5 1 o 0-5; 1 o (e) S-curve, the manifold (f) S—curve, the data Figure 2.5: Data sets used in the experiments for nonlinear mapping. The manifold and the data points are shown. The data points are colored according to the major structure of the data as perceived by human. 76 desired trend in the data. The target structure with slight distortion can be recovered if the second and the third extracted features are used instead (Figure 2.6(d)). PCA and Sammon’s mapping cannot recover the structure of this data set (Figures 2.6(c) and 2.6(f)). The similarity of the results of PCA and Sammon’s mapping can be attributed to the fact that Saminons mapping is initialized by the PCA solution. The initial PCA solution is already a good solution with respect to Sammon’s stress for this low-dimensional data set. For the data set swiss roll, we can see from Figures 2.7(b) and 2.7(c) that ISOMAP and LLE performed a good job “unfolding” the manifold. For Laplacian eigenmap, once again, the first two extracted features cannot be interpreted easily, though the structure of the data set is revealed if the second and the third features are used (Figure 2.7(d)). KPCA cannot recover the intrinsic structure of the data set no matter which kernel principal component is used. An example of the poor result of KPCA is shown in Figure 2.7(a). PCA and Sammon’s mapping also cannot recover the underlying structure (Figures 2.7(c) and 2.7(f)). The results for the third data set S-curve (Figure 2.8) are similar to those of swiss roll, with the exception that Laplacian eigenmap can recover the desired structure using the first two extracted features. In addition to these synthetic data sets, we have also tested these nonlinear map- ping algorithms on a high—dimensional real world data set: the face images used in [175] The task here is to classify a 64 by 64 face image in this data set as either the “Asian class” or the “non-Asian class”. This data set will be described in more details in Section 3.3. The results of mapping these 4096B data points to 3D can be 77 seen in Figure 2.9. Data. points from the two classes are shown in different colors. The (training) error rates using quadratic discriminant analysis are also computed for different mappings. As we can see from Figures 2.9(a), 2.9(d), 2.9(e) and 2.9(f), the mapping results by Laplacian eigenmap, KPCA, PCA and Sammon’s mapping are not very useful. The two classes are not well-separated, and the error rates are also high. ISOMAP maps the two classes more separately and has smaller error rates (Figure 2.9(b)). For LLE (Figure 2.9(e)), although the mapping results look a bit unnatural, the error rate turns out to be the smallest, indicating the two classes are reasonably separated. It should be noted that the intrinsic dimensionality of this data set is probably higher than 3. So. mapping the data to 3D, while good for visualization, can lose some information and is suboptimal for classification. From these experiments, we can see that both ISOMAP and LLE recover the intrinsic structure of the data sets well. The performance of Laplacian eigenmap is less satisfactory. We have attempted to set the edge weight by the exponential function of distances (Equation (2.16)) instead of one, but the preliminary results suggest that a good choice of the width parameter t is hard to obtain. The standard PCA and Sammon’s mapping cannot recover the target structure of the data. It is not surprising, because PCA is a linear algorithm and the underlying structure of the data cannot be reflected by any linear function of the features. For Sammon’s mapping, it does not give very good results because Sammon’s mapping is “global”, meaning that the relationship between all pairs of data points in the 3D space is considered. Local properties of the manifold cannot be modeled. The reason for the failure of KPCA is that the parametric representation of the manifold for swiss r011 78 and S-curve and the face images is hard to obtain, and is certainly not quadratic. So, the assumption in KPCA is violated and this leads to poor results. 2.11 Summary In this chapter, we have described different approaches for nonlinear mapping based on fairly different principles. The algorithms ISOMAP, LLE, and Laplacian eigenmap are non-iterative and require mainly eigen-decomposition, which is well understood with many off-the-shelf algorithms available. ISOMAP, LLE, and Laplacian eigen— map are basically non-parametric algorithms. While this provides extra flexibility to model the manifold, more data points are needed to give a good estimate of the low dimensional vector. The basic version of some of the algorithms (Sammon’s mapping, ISOMAP, LLE, and Laplacian eigenmap) cannot generalize the mapping to patterns outside the training set y, though an out-of-sample extension has been proposed [17]. There are interesting connections between some of these algorithms. ISOMAP, LLE, and Laplacian Eigenmap can be shown to be the special cases of KPCA [105]. The matrix M in LLE can be shown to be related to the square of the Laplacian Beltrami operator [16], an important concept in Laplacian eigenmap. While these techniques have been successfully applied to high dimensional data sets like face images, digit images, texture images, motion data, and textual data, the relative merits of these algorithms in practice are still not clear. More comparative studies like the one in [196] would be helpful. 79 KPCA, 2nd and 3rd . . 60 0.08[ 5 , - .5 0.06- ‘ 0.04» l 0.02- ol -o.02r i. I -o.o4l " ‘ —0.06' '68 -0‘1 -0 05 o 065 2360 -4‘0 4? 0 f 20 40 60 80 (a) KPCA, 2nd and 3rd (b) ISOMAP LLE m E Laplacian Eigenmap, 2nd and 3rd ‘9 ' ' 0.UIJ ' ' ' ' ' , \ 3] . " 0.01 2] I 0.005. 4 1 : , 0 0] —0.005: -1 - , -2» '; —0.01> 1 -3. I —o.o15- '34 —2 0 2 4 ‘oflfoz 4.615 001 —o.005 0 0.005 0.01 (c) LLE (d) Laplacian Eigenmap, 2nd and 3rd PCA Sammon 3G . . ’ 4c . . l < 30. ..... ~“f:."v.“.:~ I 20' Lg.- ’ 10* . .; '3. 0- ' "°' 52' . —2o- -20 ..ttg-é‘xfiflm. _ . ' —30* “W —20 0 20 4o 60 "1‘60 —40 —20 0 20 (e) Standard PCA (f) Sammon’s mapping Figure 2.6: Results of nonlinear mapping algorithms on the parabolic data set. “2nd and 3rd” in the captions means that we are showing the second and the third com— ponents, instead of the first two. 80 ISOMAP M KPCA. 1st and 2nd M Ma ' ' ‘V 0.06 1 ‘15L 004* . l 10’ 0.02- 5- 0+ ' 0- -0.02L ' —5' -0.04» " - l —10- -0.06’ < —15- _o.no I A A | _2I\ A 306 —0.04 —0.02 0 0.02 J60 —40 —20 0 (b) ISOMAP Laplacian Eigenmap. 2nd and 3rd 0.015 ' v .. A , "f". ‘ 0.01’ _ 1".) -. ’ 0.005- {11.777 ’ 0' if" :5 -0005- ‘3“‘a-‘zlvu: . i: "Z -0.01- "0"110301 —0.005 0 0.0 5 0.01 (d) Laplacian Eigenmap, 2nd and 3rd Sammon 15‘ f 15 . forfli' ‘.~‘;'?9’ft‘,'n"=$f ’3‘. 9.1%} .fi‘f’. F? 1ol:‘”“”'. . ' z- :1 :‘. 5 1'0 1‘5 20 '1—‘75 40 -5 0 5 (e) Standard PCA (f) Sammon‘s mapping Figure 2.7: Results of nonlinear mapping algorithms on the swiss r011 data set. “2nd and 3rd” in the captions means that we are showing the second and the third com- ponents, instead of the first two. 81 -2 ISOMAP 35 0 5 (b) ISOMAP Laplacian Eigenmap 0.01 s . ‘ n:,._ "w Kai... ‘ 0.005 3?] :L.'.' ‘ 0- .;-,- '5 if ‘ i ‘ —0.005 if 7. ‘ '\:__ , 2 “0911.01 —0.005 0.005 0.01 (d) Laplacian Eigenmap Sammon .J 2. 2‘ ‘ 1- 3'13 tiniest ‘f" .- . 0. vi -1. _2- 4 —2 1 o 2 3 ‘34 -3 -2 —i 0 i 2 3 (e) Standard PCA (f) Sammon’s mapping Figure 2.8: Results of nonlinear mapping algorithms on the S—curve data set. 82 °-‘ -0.06 —0.04 -0.02 0 0.02 0.04 0.06 (a) KPCA,35.2% error : 0W‘ 0 4~« 0‘ 3.. -0.(X)5* 2— -0.01 1‘ -0.015~ 0“ .0.02~ 5 _1_ L o 43.0%- 10 n 0.01 " I / I / I 0 4 -4 -3 -2 —1 0 1 2 '5 ‘°‘°2 -0.00 —5 m (c) LLE, 9.2% error (d) Laplacian Eigenmap, 41.5% error -0..‘. 0.3- —0.6— 02- -0.8“‘ o - 0.1— . -1q 0~ i -0.1— 42‘ -02~ -1.4~ '0'3‘ “05 -10— 413.? .4 4.8 r I 1 l . ° 02 0.4 "0‘5 —0.2 0 0.2 0.4 0.6 0.8 1 1324 5 (e) Standard PCA, 36.4% error (f) Sammon’s mapping, 39.3% error Figure 2.9: Results of nonlinear mapping algorithms on the face images. The two classes (Asians and non-Asians) are shown in two different colors. The (training) error rates by applying quadratic discriminant analysis on the low dimensional data points are shown in the captions. 83 Chapter 3 Incremental Nonlinear Dimensionality Reduction By Manifold Learning In Chapter 2, we discussed different algorithms to achieve dimensionality reduction by nonlinear mapping. Most of these nonlinear mapping algorithms operate in a batch model, meaning that all the data points need to be available during train- ing. In applications like surveillance, where (image) data are collected sequentially, batch method is computationally demanding: repeatedly running the “batch” ver- sion whenever new data points become available takes a long time. It is wasteful to discard previous computation results. Data accumulation is particularly beneficial to manifold learning algorithms due to their non-parametric nature. Another reason for 1Sammon’s mapping can be implemented by a feed-forward neural network [180] and hence can be made online if an online training rule is used. 84 developing incremental (non-batch) methods is that the gradual changes in the data manifold can be visualized. As more and more data points are obtained, the evolution of the data manifold can reveal interesting properties of the data stream. Incremental learning can also help us to decide when we should stop collecting data: if there is no noticeable change in the learning result with the additional data collected, there is no point in continuing. The intermediate result produced by an incremental algo— rithm can prompt us about the existence of any “problematic” region: we can focus the remaining data collection effort on that region. An incremental algorithm can be easily modified to incorporating “forgetting”, i.e., the old data points gradually ‘ lose their significance. The algorithm can then adjust the manifold in the presence of the drifting of data characteristics. Incremental learning is also useful when there is an unbounded stream of possible data to learn from. This situation can arise when a continuous invariance transformation is applied to a finite set of training data to create additional data to reflect pattern invariance. In this chapter, we describe a modification of the ISOMAP algorithm so that it can update the low dimensional representation of data points efficiently as additional samples become available. Both the original ISOMAP algorithm [248] and its land— mark points version [55] are considered. We are interested in ISOMAP because it is intuitive, well understood, and produces good mapping results [133, 276]. Fur- thermore, there are theoretical studies supporting the use of ISOMAP, such as its convergence proof [18] and the conditions for successful recovery of co-ordinates [66]. There is also a continuum extension of ISOMAP [282] as well as a spatio—temporal extension [133]. However, the motivation of our work is applicable to other mapping 85 algorithms as well. The main contributions of this chapter include: 1. An algorithm that efficiently updates the solution of the all-pairs shortest path problems. This contrasts with previous work like [193], where different shortest path trees are updated independently. 2. More accurate mappings for new points by a superior estimate of the inner products. 3. An incremental eigen-decomposition problem with increasing matrix size is solved by subspace iteration with Ritz acceleration. This differs from previ- ous work [270] where the matrix size is assumed to be constant. 4. A vertex contraction procedure that improves the geodesic distance estimate without additional memory. The rest of this chapter is organized as follows. After a recap of ISOMAP in section 3.1, the proposed incremental methods are described in section 3.2. Experimental results are presented in section 3.3, followed by discussions in section 3.4. Finally, in section 3.5 we conclude and describe some topics for future work. 3.1 Details of ISOMAP The basic idea of the ISOMAP algorithm was presented in Section 2.6. It maps a high dimensional data set y1, . . . , yn in RD to its low dimensional counterpart x1, . . . ,xn in R“, in such a way that the geodesic distance between y,- and y j on the data manifold 86 is as close to the Euclidean distance between xi and xJ- in IR“ as possible. In this section, we provide more algorithmic details on how the mapping is done. This also defines the notation that we are going to use throughout this chapter. The ISOMAP algorithm has three stages. First, a neighborhood graph is con- structed. Let AU be the (Euclidean) distance between y,- and yj. A weighted undi— rected neighborhood graph 9 = (V, E) with the vertex v,- E V corresponding to yz- is constructed. An edge e(i, j) between vi and vj exists if and only if y,- is a neighbor of yj, i.e., yz- E N(yj). The weight of e(i,j), denoted by wij, is set to Aij. The set of indices of the vertices adjacent to v,- in Q is denoted by adj (2) ISOMAP proceeds with the estimation of geodesic distances. Let gij denote the length of the shortest path sp(z', 3') between vi and vj. The shortest paths are found by the Dijkstra’s algorithm with different source vertices. The shortest paths can be stored efficiently by the predecessor matrix wij, where 7n]- = k if vk is immediately before '03- in sp(i, j ) If there is no path from v,- to U], 7%- is set to O. Conceptually, however, it is useful to imagine a shortest path tree T(i), where the root node is v,- and sp(z', 3') consists of the tree edges from v, to vj. The subtree of T(Z) rooted at va is denoted by T(z'; a). Since gij is the approximate geodesic distance between yz- and yj, we shall call gij the “geodesic distance”. Note that G = {gij} is a symmetric matrix. Finally, ISOMAP recovers x,- by using the classical scaling [49] on the geodesic distance. Define X = [x1,...,xn]. Compute B = —1/2HCH, where H = {bl-j}, hij = 6ij —1/n and 613- is the delta function, i.e., dij = 1 iii = j and 0 otherwise. The entries 5”),- - of C are simply 9.2.. We seek XTX to be as close to B as possible in the least .7 2] 87 square sense. This is done by setting X = [\//\1v1 . . . \/)\dvd]T, where /\1, . . . , Ad are the (1 largest eigenvalues of B, with corresponding eigenvectors v1, . . . ,vd. 3.2 Incremental Version of ISOMAP The key computation in ISOMAP involves solving an all-pairs shortest path problem and an eigen-decomposition problem. As new data arrive, these quantities usually do not change much: a new vertex in the graph often changes the shortest paths among only a subset of the vertices, and the simple eigenvectors and eigenvalues of a slightly perturbed real symmetric matrix stay close to their original values. This justifies the reuse of the current geodesic distance and co—ordinate estimates for update. we restrict our attention to knn neighborhood, since e-neighborhood is awkward for incre- mental learning: the neighborhood size should be constantly decreasing as additional data points become available. The problem of incremental ISOMAP can be stated as follows. Assume that the low dimensional co-ordinates x,- of yi for the first n points are given. We observe the new sample yn+1. How should we update the existing set of x, and find xn+17 Our solution consists of three stages. The geodesic distances gij are first updated in view of the change of neighborhood graph due to on“. The geodesic distances of the new point to the existing points are then used to estimate xn+1. Finally, all xi are updated in view of the change in gij. In section 3.2.1, we shall describe the modification of the original ISOMAP for incremental updates. A variant of ISOMAP that utilizes the geodesic distances from 88 a fixed set of points (landmark points) [55] is modified to become incremental in section 3.2.2. Because ISOMAP is non-parametric, the data points themselves need to be stored. Section 3.2.3 describes a. vertex contraction procedure, which improves the geodesic distance estimate with the arrival of new data without storing the new data. This procedure can be applied to both the variants of ISOMAP. Throughout this section we assume (1 (dimensionality of the projected space) is fixed. This can be estimated by analyzing either the spectrum of the target inner product matrix or the residue of the low rank approximation as in [248], or by other methods to estimate the intrinsic dimensionality of a manifold [143, 171, 47, 35, 33, 259, 207]. 3.2.1 Incremental ISOMAP: Basic Version We shall modify the original ISOMAP algorithm [248] (summarized in section 3.1) to become incremental. Details of the algorithms as well as an analysis of their time complexity are given in Appendix A. Throughout this section, the shortest paths are represented by the more economical predecessor matrix, instead of multiple shortest path trees T(i). 3.2.1.1 Updating the Neighborhood Graph Let A and ’D denote the set of edges to be added and deleted after inserting on“ to the neighborhood graph, respectively. An edge e(i, n + 1) should be added if (i) v,- is one of the k: nearest neiehbors of v , , or ii 22 re )laces an existing vertex and o n+1 n+1 o 89 becomes one of the k nearest neighbors of 11,-. In other words, A = {C(i, n + 1) 3 An+1,i S An+1,‘rn+1 0r Az'.n+1 S Ain’t-l1 (3'1) where ”r,- is the index of the k-th nearest neighbor of 22,-. For D, note that a necessary condition to delete the edge e(2', j) is that vn+1 replaces v,- (vj) as one of the k nearest neighbors of vj (12,-). So, all the edges to be deleted must be in the form e(z', Ti) with Ai,n+1 g ALT," The deletion should proceed if v,- is not one of the k nearest neighbors of ”Ti after inserting vn+1. Therefore, D = {(30%) 2 Am,- > A21.n+1 and ATM > ATM}, (3-2) where L,- is the index of the k-th nearest neighbor of “Ti after inserting vn+1 in the graph. Note that we have assumed there is no tie in the distances. If there are ties, random perturbation can be applied to break the ties. 3.2.1.2 Updating the Geodesic Distances The deleted edges can break existing shortest paths, while the added edges can create improved shortest paths. This is much more involved than it appears, because the change of a single edge can modify the shortest paths among multiple vertices. Consider e(a,b) E D. If sp(a, b) is not simply e(a,b), deletion of e(a, b) has no effect on the geodesic distances. Hence, we shall suppose that sp(a, 1)) consists of the single edge e(a, b). We propagate the effect of the removal of e(a,b) to the set of 90 1: Input: e(a, b), the edge to be removed; {$11)}: {7W} 2: Output: p(a.b)1 set of “affected” vertex pairs 3: Rab := (Z); Q.enqueue(a); 4: while Q.notEmpty do 5: t3: Q-popiRab = Rab U It}? 6: for all u E adj(t) do 7: If 7711b = a, enqucue u to Q; 8: end for 9: end while{Construction of Rab finishes when the loop ends} 10: F(a,b) :2 0; 11: Initialize ’1", the expanded part of T(a; b), to contain vb only; 12: for all u 6 Rob do 13: Q.enqueue(b) 14: while Q.notEmpty do 15: t := Q.pop; 16: if not = 7r“, then 171 F(a,b) = F(a.b) U {(1% t)}; 18: if v; is a leaf node in T’ then 19: for all Us in adj(t) do 20: Insert vs as a child of wt in T’ if was 2 t 21: end for 22: end if 23: Insert all the children of m in 7" to the queue Q; 24: end if 25: end while 26: end for{V uERab,V sET(u; b), sp(u, 3) uses e(a. b).} Algorithm 3.1: ConstructFab: F(a,b), the set of vertex pairs whose shortest paths are invalidated when e(a, b) is deleted, is constructed. Rab is the set of vertices such that if u E Rab, the shortest path between a and u contains e(a, b). vertices Rab (Figure 3.1). Rab is used in turn to construct 11101,), the set of all (i, j) pairs with e(a,b) in sp(z',j). This is done by ConstructFab (Algorithm 3.1), which finds all the vertices of under T(a; b) such that sp(u, t) contains vb, where u E Rab- The set of vertex pairs whose shortest paths are invalidated due to the removal of edges in D is thus F = U€(a.,b)EDF(a-.b)' The shortest path distances between these vertex pairs are updated by AllodifiedDijkstra (Algorithm 3.2) with source vertex vu and destination vertices C (11). It is similar to the Dijkstra’s algorithm, except that only the geodesic distances from 2.7., to C (u) (instead of all the vertices) are unknown. 91 1: Input: u; C(11); {911}; {we} 2: Output: the updated geodesic distances {guv} 3: for all j 6 C(11) do 4: H := adj(j) fl (V/C(u)); 5: 6(3) = minke” (guic + win), or 00 if H = 0; 6: Insert 6( j) to a heap with index j; 7: end for 8: while the heap is not empty do 9: k := the index of the entry by “Extract Min” on the heap; 10= 0(a) == C(U)/{k};guk == 6(k);gku := 506); 11: for all j E adj(k) 0 C(11) do 12: dist 1: guk + wk]; 13: If guk + wkj < 6( j ), perform “Decrease Key” on 6( j) to become dist; 14: end for 15: end while Algorithm 3.2: ModifiedDijkstra: The geodesic distances from the source vertex 0. to the set of vertices C (u) are updated. (a) An example of neighborhood graph (b) The shortest path-tree T(a) and Rab Figure 3.1: The edge e(a,b) is to be deleted from the neighborhood graph shown in (a). The shortest path tree 7(a) is shown as directed arrows in (b). Rab (c.f. Algorithm 3.1) consists of all the vertices on such that sp(b, u) contains e(a, b), i.e., ’ITub = a. Note that both on and C (u) are derived from F. The order of the source vertex in invoking A’IodifiedDijkstra can impact the run time significantly. An approximately optimal order is found by interpreting F as an auxiliary graph 8 (the undirected edge e(z', j ) is in 8 iff (2', j ) E F), and removing the vertices in B with the smallest degree in a greedy manner (OptimalOrder, Algorithm 3.3). When on is removed from B, ModifiedDijkstra is called with source vertex 1)., and C(u) as the neighbors of on in B. 92 1: Input: Auxiliary graph 8 2: Output: None. The geodesic distances are updated as a side-effect 3: [[2] := an empty linked list, for i = 1, . . . ,n; 4: for all on E 8 do 5: z: degree of vu in 8. Insert on to l[f]; 6: end for 7: pos := 1; 8: foriz=1tondo 9 If l[pos] is empty, increment pos one by one and until [bios] is not empty; 10: Remove 12“, a vertex in l[pos], from the graph 8; 11: Call ModifiedDijkstra(u, adj(u) in B); 12: for all 2)]: that is a neighbor of v.“ in 3 do 13: Find f such that vj 6 l [f] by an indexing array; 14: Remove vj from l[f] if f = 1, and move 22,- from l[f] to l[f — 1] otherwise; 15: pos = min(pos,f — 1); 16: end for 17: end for Algorithm 3.3: OptimalOrder“. a greedy algorithm to remove the vertex with the smallest degree in the auxiliary graph 8. The removal of on corresponds to the execution of ModifiedDijsktra (Algorithm 3.2) with u as the source vertex. The next stage of the algorithm finds the geodesic distances between vn+1 and the other vertices. Since all the edges in A (edges to be inserted) are incident on en+1 , we have -= = min :-+ui-. . Vi. 3.3 gn+1,z 92,rz.+1 j such that (91] J,n+1) ( ) e(n+1.j)€A S: the set of of v, with sp(b,t) improved by Vnn Figure 3.2: Effect of edge insertion. T (a) before the insertion of vn+1 is represented by the arrows between vertices. The introduction of vn+1 creates a better path between va and vb. S denotes the set of vertices such that t E 5' iff sp(b, t) is improved by on“. Note that vt must be in T(n+1; a). For each u E S, UpdateInsert (Algorithm 3.4) finds t such that sp(u, t) is improved by vn+1, starting with t = b. 93 1: Input: a; b; {91.}; {...,} 2: Output: {9,3} are updated because of the new shortest path 00 -—+ 1.1,,“ -—> vb. 3: S := 0; Q.enqueue(a); 4: while Q.notEmpty do 5: t 2: Q.pop;S :2 S U {t}; 6: for all on that are children of v; in T(n + 1) do 7: if gu,n+1 + ’wn-H,b < gu,b then 8: Q.enqueue( u); 9: end if 10: end for 11: end while{S has been constructed} 12: for all u E S do 13: Q.enqueue(b); 14: while Q.notEmpty do 151 t 3: 62-901); gut 1: gtu 3: Qu,n+1 + 9n+1,t; 16: for all US that are children of U, in T(n + 1) do 171 if gs.n+1 'l' wn+l.rz < 93,0 then 18: Q.enqueue( s); 19: end if 20: end for 21: end while 22: end for{V u E S, update sp(u.,t) if on“ helps} Algorithm 3.4: UpdateInsert: given that va -+ vn+1 ——> vb is a better shortest path between va and ”b after the insertion of 'vn+1, its effect is propagated to other vertices. Finally, we consider how A can shorten other geodesic distances. This is done by first locating all the vertex pairs (00, vb), both adjacent to vn+1, such that vb ——> vn+1 —> va is a better shortest path between va and vb. Starting from ed and vb, UpdateInsert (Algorithm 3.4) searches for all the vertex pairs that can use the new edge for a better shortest path, based on the updated graph. For all the priority queues in this section, binary heap is used instead of the asymptotically faster F ibonacci’s heap. Since the size of our heap is typically small, binary heap, with a smaller time constant, is likely to be more efficient. 94 3.2.1.3 Finding the Co—ordinates of the New Sample The co—ordinate xn+1 is found by matching its inner product with x,- to the values derived from the geodesic distances. This approach is in the same spirit as the classical scalin [49] used in ISOMAP Define "-- - ||x — x-|[2 — “X“2 + “X“2 — 2xTx- ( g a - ' 7L] _ Z J _ 1 J 2 .7 Since 2?:1 x,- = 0, summation over j and then over 2' for ii]- leads to 1 ~ llxz-IIQ = #212:- lelefi). 1 j Zuxjn2 = 3221... J 1.7 Similarly, if we define '7,- 2 [[xi — xn+1||2, we have 1 TI. TI. 2 2 llxn+1ll ——- #237.- — Dis-H ). ' i=1 i=1 T 1 2 2 . xn+1x7 : _§(’72 — llxn+1ll — llx7ll ) V2. If we approximate 31-]: by gig]- and 7,: by 92-2 T, +1, the target inner product f,- between xn+1 and x,- can be estimated by 2 2 2 ~ 2]“ 91'1“ le glj + 2197,72le 2 N — _ 2 n 77, TI. xn+1 is obtained by solving XTxn+1 = f in the least-square sense, where f 2 (f1,..., fn)T. One way to interpret the least square solution is by noting that X = (\/)\1v1 . . . ‘//\dvd)T, where (M, v,) is an eigenpair of the target inner product 95 matrix. The least square solution can be written as 1 T 1 T T x,,+1=(fiv1f,m,—\/-A—_dvdf) . (3.5) The same estimate is obtained if Nystrom approximation [89] is used. A similar procedure is used to compute the out-of-sample extension of ISOMAP in [55, 17]. However, there is an important difference: in these studies, the inner product between the new sample and the existing points is estimated by n 9.2. " 7 2f,- = 2 :71] ._ gin“. (3.6) 1:1 It is unclear how this estimate is derived. This estimate is different from that in Equation (3.4) because 2:) 912,11+1/n — sz glzj/n2 does not vanish in general; in fact, most of the time this is a large number. Empirical comparisons indicate that our inner product estimate given in Equation (3.4) is much more accurate than the one in Equation (3.6). Finally, the new mean is subtracted from :r,,z' = 1,...,(n + l), to ensure 23:11 x,- = 0, in order to conform to the convention in the standard ISOMAP. 3.2.1.4 Updating the Co-ordinates The co—ordinates x,- should be updated in view of the modified geodesic distance matrix Gnew. This can be viewed as an incremental eigenvalue problem, as x,- can be obtained by eigen—decomposition. However, since the size of the geodesic distance 96 f . matrix is increasing, traditional methods (such as those described in [270] or [30]) cannot be applied directly. We update X by finding the eigenvalues and eigenvectors of Bnew by an iterative scheme. Note that gradient descent can be used instead [168]. A good initial guess for the subspace of dominant eigenvectors of Bnew is the column space of XT. Subspace iteration together with Rayleigh-Ritz acceleration [96] is used to find a better eigen-space: 1. Compute Z = BnewV and perform QR decomposition on Z, i.e., we write Z = QR and let V = Q. 2. Form Z = VTBnewV and perform eigen-decomposition of the d by (1 matrix Z. Let A; and u,- be the i-th eigenvalue and the corresponding eigenvector. 3. Vnew = V[u1 . . . ud] is the improved set of eigenvectors of Bnew. Since at is small, the time for eigen-decomposition of Z is negligible. We do not use any variant of inverse iteration because Bnew is not sparse and its inversion takes 0(n3) time. 3.2. l .5 Complexity In Appendix A.4, we show that the overall complexity of the geodesic distance update can be written as O(q(|F|+lH|)+n1/ log u+ IAI2), where F and H contain vertex pairs whose geodesic distances are lengthened and shortened because of vn+1, respectively, q is the maximum degree of the vertices in the graph, n is the number of vertices with non-zero degree in B, and I/ = maxim. Here, n,- is the degree of the i-th vertex removed from the auxiliary graph 8 in Algorithm 3.3. We conjecture that u, 97 ’f-‘ - on average, is of the order 0(log 11.). Note that ,u. g 2|F|. The complexity is thus O(q([F| + |H|) + [1. log 11. log log n + |A|2). In practice, the first two terms dominate, leading to the effective complexity O(q(|F| + [H I). We also want to point out that Algorithm 3.2 is fairly efficient; its complexity to solve the all-pairs shortest path by updating all geodesic distances is 0(n2logn+n2q). This is the same as the complexity of the best known algorithm for the all-pairs shortest path problem of a sparse graph, which involves running Dijkstra’s algorithm multiple times with different source vertices. For the update of co-ordinates, subspace iteration takes 0(722) time because of the matrix multiplication. 3.2.2 ISOMAP With Landmark Points One drawback of the original ISOMAP is its quadratic memory requirement: the geodesic distance matrix is dense and is of size ()(n2), making ISOMAP infeasible for large data sets. Landmark ISOMAP was proposed in [55] to reduce the memory requirement while lowering the computation cost. Instead of all the pairwise geodesic distances, landmark ISOMAP finds a mapping that preserves the geodesic distances originating from a small set of “landmark points”. This idea is not entirely new, and the authors in [25] refer to it as the “reference point approach” in the context of embedding. Without loss of generality, let the first m points, i.e., y1,. . . ,ym, be the land- mark points. After constructing the neighborhood graph as in the original ISOMAP, landmark ISOMAP uses the Dijkstra’s algorithm to compute the m X n landmark 98 ¥—__ _ geodesic distance matrix C = {gij}, where gij is the length of the shortest path between v,- (a landmark point) and vj. In [55] the authors suggest that X can be found by first embedding the landmark points and then embedding the remaining points with respect to the landmark points. This is similar to the modification of the Sammon’s mapping made by Biswas et al. in [25] to cope with large data sets. How- ever, our preliminary experiments indicate that this is not very robust, particularly when the number of landmark points is small. Instead, we follow the implementation of landmark ISOMAP2 and decompose B = HmCHn by singular value decompo— sition, B = USVT = (U(S)1/2)(V(S)1/2)T, where UTU and VTV are identity matrices of corresponding sizes, and S is a diagonal matrix of singular values. The vectors corresponding to the largest d singular values are used to construct a low-rank approximation, B a: QTX. 3.2.2.1 Incremental Landmark ISOMAP After updating the neighborhood graph, the incremental version for landmark ISOMAP proceeds with the update of geodesic distances. Since only the shortest paths from a small number of source vertices are maintained, the computation that can be shared among different shortest path trees is limited. Therefore, we update the shortest path trees independently by adopting the algorithm I presented in [193], instead of the algorithm in section 3.2.1.2. First, Algorithm 3.5 is called to initialize the edge weight increase, which includes edge deletion as a. special case. Algorithm 2We are referring to the “official” implementation by the authors of ISOMAP in http : //isomap. stanford.edu. 99 I := (0; for all (73-, 3,, 1112”“, my”) in the input do Swap 7‘, and 3,- if 12,, is a child of vs, in T(a); if '03, is a child of or, in T(a) then J := {vs,}U descendent of v3I in T(a); gaj = gaj + w?” — w?“ W E .7; IzIUJ; end if end for for all j E J do b 2: minkeadflj) gak + 'wkj; {Find a new path to vj} Q.enqueue(j, arg minkeadjm gak + wkj, b) if b < gaj end for Algorithm 3.5: InitializeEdgeWeightIncrease for the shortest path tree from va, T a . The inputs are the four tuples r-,s:,w91d,w‘~‘ew , meaning the weiO‘ht of 2 ’l 1 2 C) e(ri,rj) should increase from w?“ to wzflew. Q is the queue of vertices to be pro— cessed in Algorithm 3.7. 3.7 is then executed to rebuild the shortest path tree. Algorithm 3.6 is then called to initialize the edge weight decrease, which includes edge insertion as a special case. Algorithm 3.7 is again called to rebuild the tree. Deletion of edges is done before the addition of edges because this is more efficient in practice. The co—ordinate of the new point xn+1 is determined by solving a least-square problem similar to that in section 3.2.1.3. The difference is that the columns of Q, instead of X, are used. So, QTxn+1 = f is solved in the least-square sense. Finally, we use subspace iteration together with Ritz acceleration [236] to improve singular vector estimates. The steps are 1. Perform SVD on the matrix BX, U181V{ = BX 2. Perform SVD on the matrix BTUl, U282V§ = BTU1 3. Set xnew = U2(S2)1/2 and Qnew = U1(S2)1/2 As far as time complexity is concerned, the time to update one shortest path tree 100 I := Q); for all (Ti, 3,, 112?“, my”) in the input do Swap 1', and 3; if gay, > 90‘s,; diff := gay, + 111.?“ — 911.3,; if diff < 0 then Move vs, to be a child of 1)., in T(a); J :2 {’03,}U descendent of 0,, in T(a); 903' = 903' + diff W E .7; I = I U J; end if end for for all j E .7 do for all k E adj(j) do Q.enqueue(k,j,gaj + wjk) if 903- + wjk < gak end for [ end for algorithm 3.6: InitializeEdgeVVeightDecrease for the shortest path tree from va, 7(a). The inputs are the four tuples (ri,si,wfld,w?ew), meaning the weight of e(rbrj) should decrease from in?“ to urgew. Q is the queue of vertices to be pro- cessed in Algorithm 3.7.. is 0(6d log 6d + (16d), where (id is the minimum number of nodes that must change their distance or parent attributes or both [193], and q is the maximum degree of vertices in the neighborhood graph. The complexity of updating the singular vectors is 0(nm), which is linear in 77., because the number of landmark points m is fixed. 3.2.3 Vertex Contraction Owing to the non-parametric nature of ISOMAP, the data points collected need to be stored in the memory in order to refine the estimation of the geodesic distances 92‘3- and the co-ordinates xi. This can be undesirable if we have an arbitrarily large data stream. One simple solution is to discard the oldest data point when a pre—determined I1leber of data points has been accumulated. This has the additional advantage of 101 I A while Q.notEmpty do (i,j,d) :2 “Extract Min” on Q; de=d—%fi if diff < 0 then Move vi to be a child of v]- in T(a): gai = d; for all k E adj(i) do new = 9m- + wiki Q.enqueue(k,i,newd) if newd < gak; end for end if end while Algorithm 3.7: Rebuild T(a.) for those vertices in the priority queue Q that need to be updated. making the algorithm adaptive to drifting in data characteristics. The deletion should take place after the completion of all the updates due to the new point. Deleting the vertex v,- is easy: the edge deletion procedure is used to delete all the edges incident on v; for both ISOMAP and landmark ISOMAP. We can do better than deletion, however. A vertex contraction heuristic can be used to record the improvement in geodesic distance estimate without storing additional points. Most of the information the new vertex vn+1 contains about the geodesic distance estimate is represented by the shortest paths passing through vn+1. Suppose sp(a,b) can be written as ea w v,- ——> 22n+1 ——> vb. The geodesic distance between va and vb can be preserved by introducing a new edge e(z’, b) with weight (102-[n+1 + urn+1,b), even though on“ is deleted. Both the shortest path tree T(a.) alld the graph are updated in view of this new edge. This procedure cannot create irlConsistency in any shortest path trees, because the subpath of any shortest path is also a shortest path. This heuristic increases the density of the edges in the graph, 11 ()wever. 192 W'hich vertex should be contracted? A simple choice is to contract the new vertex vn+1 after adjusting for the change of geodesic distances. Alternatively, we can delete the vertices that are most “crowded” so that the points are spread more evenly along the manifold. This can be done by contracting the non-landmark point whose nearest neighbor is the closest to itself. 3.3 Experiments we have implemented our main algorithm in Matlab, with the graph theoretic parts written in C++. The running time is measured on a Pentium IV 3.2 GHz PC with 512MB memory running Windows XP, using the profiler of Matlab with the java virtual machine turned off. 3.3.1 Incremental ISOMAP: Basic Version We evaluated the accuracy and the efficiency of our incremental algorithm on sev- eral data sets. The first experiment was on the Swiss roll data set. It is a typical benchmark for manifold learning. Because of its “roll” nature, geodesic distances are more appropriate in understanding the structure of this data set than Euclidean distances. Initialization was done by finding the co—ordinate estimate x,- for 100 ran- dOInly selected points using the “batch” ISOMAP, with a [run neighborhood of size 6‘ Random points from the Swiss roll data set were added one by one, until 1500 I)Qims were accumulated. The incremental algorithm described in section 3.2.1 was Ilged to update the co-ordinates. The first two dimensions of x,- corresponded to the 103 2s 2.... 8.2 m3 :3 was <2 <2 <2 was 8.1.. was Se 2.... was 82 was as was So was a: <\z <2 <2 :s as was is mom was 82 8.0 as a: 8.0 8.0 was mos sec 8.0 3.0 8.0 see as was 8.0 can ..85 58am $5 ..85 nopem .30 ..85 58mm .pma ..85 nopwm .EQ .55 58am SEQ : 530 m 997:2 85 coamcamm o>§u-m :8 mmmam .mpfiom : 2: :e a8 coflmusmfioo 85358 a8 was 2: on mwcoamotou 2.5.5: .65 350a mo .8555 usoamtwc a5 mono age/HOE EEQEQSE was 585* massomxm .25 Amwsoommv 25» 55 ”Nm @3er. 104 is . 3m. . as . is . 4. . ..x was a A: 38 we 2% 3 is we cos. ammo 3% Ex assimzwco can ems: is ©me we at: 2... man: was 3:2 828% 2880 S mama 3 was so mg 2 3mm 3 0.0mm ism eooioeaez ..85 58mm ..85 :ofiwm ..85 zouem ..85 58mm ..85 nepwm 53¢ m 9922 8.3 convened 9.50% :8 mmrsm .oES SE B5288 one Eco mm 823 0283 .aofimwou Enacted was “X we .5553: use 5+5“ mo 2033:9200 mag/HOE 5.3m: 8m m o ' o o o D .(6 2°00 '°¢§-.29...°" O o 0% 9000990 401 o o o .9 -10 o d, 0° 9 ' do .a e0 "d9 0 ° :.5 ° . 8 '00 o ‘90 '50. . go? _o -20. . 0° . -. c. o _20 o ‘5 - 09' .° 50 '0 ..0. Go . .0 0° 0 o o w o % at: o :1 ago D.0..y?)c3-TP v.10 9 I}? (Do-.390. -30' . -30* q, ' . . .. ~ - _-.. . . . -60 —40 40 60 -60 —40 ~20 0 60 (81) Initial n — 100 (b) n = 300 '. ' ' -' ‘- I -. “mm--2 30’ r. ' ...". b \ 4.0.. 30’. : 2 ":5 " $.6:.',.l'. “96’ 2 ’ _ ‘bQo' . y .. g. - '9 ; own a " ° -' 0051-. 39,50 @313, 3%388'0‘.‘ ¢e_3fice,dlg-. P08 °° 93° , ° ‘3‘" _ 093:5 .- _"' 20~o . mm. o ., o r - m- 20 o .6 ° . 430003.111; « 3;..5,’ 0&‘Q°_ ° 0 .50%€§~'.o q-gao °. was: 4%? ‘51.: _ . <11 ' wfi ’ :3 ' "o -_ ' ‘ ' W '3 ‘ 0“ 'd': 10 a $23.0 . £0 <5 ‘2 ”1“. 0° g. “gig 10 .0 we :1 5£°::°. o ijm . ‘3’ 1 g ‘- .0- f .- I 0 . ' ' O —figke§$*5o .60 -40 —20 20 (e) n = 01200 0 f —20 20 (r) Final, 71 = 1500 Figure 3.5: Evolution of the estimated co—ordinates for Swiss roll to their final values. The black dots denote the co—ordinates estimated with different number of samples, Whereas red circles show the co—ordinates estimated with all the 1500 points. The (lo-ordinates have been re—scaled to better observe the trend. Figure 3.6: Example images from the rendered face image data set. This data set can be found at the ISOMAP web-site. 111 Figure 3.7: Example “2” digits from the MN IST database. The MN IST database can be found at http://yann. lecun.com/exdb/mnist/. 1NN LOO error rate in % 5 6 No. of features Figure 3.9: Classification performance on ethn database for basic ISOMAP. 112 the data linearly to the best hyperplane by PCA and also evaluate the corresponding leave-one-out error rate. Figure 3.9 shows the result. The representation recovered by ISOMAP leads to a smaller error rate than PCA. Note that the performance of PCA can be improved by rescaling each feature so that all of them have equal variance, though the rescaling is essentially a post-processing step, not required by ISOMAP. 3.3.2 Experiments on Landmark ISOMAP A similar experimental procedure was applied to the incremental landmark ISOMAP described in section 3.2.2 for Swiss roll, S-curve, rendered face, MNIST digit 2, and ethn data sets. Starting with 200 randomly selected points from the data set, random 9 accumulated. Forty points from the points were added until a total of 5000 points initial 200 points were chosen randomly to be the landmark points. Snapshots corn- Daring the (to—ordinates estimated by the batch version and the incren’iental version for Swiss roll are shown in Figure 3.10. The approximation error and the computa- tlion time are shown in Figure 3.11 and Table 3.3, respectively. The time to run the batch version only once is listed in Table 3.4. Once again, the co—ordinates estimated by the incremental version are accurate with respect to the batch version, and the COIIlputation time is much less. We also consider the classification accuracy using Iandrnark ISOMAP on all the 2630 images in the ethn data set. The result is shown in Figure 3.12. The co—ordinates estimated by landmark ISOMAP again lead to a Slllaller error rate than those based on PCA. The difference is more pronounced when \ 9When the data set has less than 5000 points, the experiment stopped after all the points have Gen used. 113 <2 <2 <2 88 28mm :38 <2 <2 <2 was meet 22mm mod :82 $28 8% <2 <2 <2 30 a: 8.3 <2 <2 <2 8.0 5.2 a: 8.0 8.: as comm 8.0 $2 on: So a? was <2 <2 <2 So a: 02 3o 83 mm; 88 8.0 a; a: 8.0 a; was So as 3.0 mod as 3o :5 Es 8.0 8... .85 88m .85 .85 88m 8me .85 88m .85 .85 88mm .85 .85 88m ...EQ 58 m BmHZE 88 888de madam :8 $me .mufioa : 85 :8 8* 28332588 88358 5 8:5 2: 3 5288888 :85: .AE 2an m0 838:: ”:88wa 88 8:0 Lag/HOE xaeEcsfl 8888885 98 889 magnowxm 8“ $5883 883 55 Jam 8388 . . . . . . ax mafia a Whoa 9% saw. @on mm oé “New wéom Mmoo 5.2: 112x msg#89290 8 05m 3: 33 8 we mom 2% 3m 3% 8:28 22.88 on 0.82 E 3% ea mom 3 28w 8 33m ism 8928862 .85 88m .85 88m .85 88m .85 88m .85 88m 58 m BmHZE 88 885:5 958m :8 mmtsm .883 5: 8:558 28 Eco m5 885 88m 823888 58:28:85 m8 ex .8 Maggi: 98 H+2x mo 2053:9208 mega/HOE 88h 8m mag/HOE x8585: 5388883 58 88a 88 32888 was and 5m 855. 114 88me 8m: 8 mag/HOE .8 86:? 58:58:65 5: 8:: E88 .m.m 8:me o: 8:86 mm 5 888858 :8 88:32: 86.55% on... 833 :85 .5288 woofionnwmm: was. 88:85: 5:58 BE: 95. wig/HOE E 88:58 89538.8 885:8: 80:: 23 mo 88> 2: 3 889888 8:8: 8 mo :28 5: 9853 £83 838m 888:8 :858 8:88 58 585880.: 5:58de 8:88:85 8:: 8:: 8.85 on: :3 88858 5:89 838 538: 2: mo 88:88.8 2: 30:8 8288 8:: 88:8 .23. 58388on :22?» 8.89883 8:: :8 :88. on: 23 88838 88:88.8 :8. “88:8: 8830: 8:: 5 m8: 5:: 88.8 8:: £838 85 23 5 max—20mm 58:53:85 5385885 :8 Ea: mmrsm: mo muoamme:m ”SM 8:me 82 n 2 9 88 n z E 82 u g E n c m1 o? m7 owl ..." ... . . . £72.. .. .. assume. ._ «u com H g .855 8v com H : .855 A3 0 o m1 owl 31 our . . n": . 0.. . . . 9 o 0. i 1 :30... “a one. c “To"... 0 o .- o ‘00 a .- Km... 0 O C O . O o a I .o 0 c 0’. .0. .0 . . .m o I .0 o O O O .0 I. .0. ' l O. o o a e O... C O. O. O 0‘ OF I . o .,- 115 88 n : 42E 8 8% n z .35: E ‘ I . t‘ ... I.- 2... WW x €0::5:8v ofim oSME 8% n 2 .35: 3 . 4 m 0 ml owl n7 owl v- 116 0x ‘04 Average: 0.000010 “x 10-5 Average: 0.000003 1" 1 Y 1 v Y T T a 1 1 v v r r r 8. 1 1 7» 0.8- 6 “J: 0.6- 4 0.4* l 0.2 ~ A A A A 1 AL 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 n n (a) Swiss roll (b) S-curve -3 Average: 0.000212 Average: 0.001685 x 10 2 v 1 v f r Y 0.04 V T v Y r r 1 1 1.8 1.6 ”400 500 6006560100015002000250930003500400045005000 2(c) Renderned Faces (d) MNIST digit 2 Average: 0.001990 0.04 . w T r 0.035 r 4 0.03 r 0.025 m: 0.02 g : 0.015 l 0.01 l j l ‘ . ‘ l 0.005 ‘ _ { c A A -.. MA AAA ..-. 500 1000 n 1500 2000 2500 (e) ethn Figure 3.11: Approximation error (8n) between the co—ordinates estimated by the incremental landmark ISOMAP and the batch landmark ISOMAP for different num- bers of data points (n). It is similar to Figure 3.4, except that incremental landmark ISOMAP is used instead of the basic. ISOMAP. 117 A O .4 \‘ I I T r L r .\ -—~ PCA 35‘.\“~\ -- ~ - ISOMAP “ o\° 30’ h, .2 . \\ a, . \ a; 20" \ 8 \ _r 15“ \\\ 2 \‘\~ 2 \N‘ '— 1ob \\ \F“\\ 5'- .1..\_.__.._1--_ o ‘ ' ' I No. of features Figure 3.12: Classification performance on ethn database, landmark ISOMAP. the number of dimensions is small (less than five). 3.3.3 Vertex Contraction The utility of vertex contraction is illustrated in the following experiment. Consider a manifold of a 3-dimensional unit hemisphere embedded in a lO-dimensional space. The geodesic on this manifold is simply the great circle, and the geodesic distance between x1 and x2 on the manifold is given by COS—I(XCITXZ). Data points lying on this manifold are randomly generated. With K = 6, 40 landmark points and 1000 points in memory, vertex contraction is executed until 10000 points are examined. The geodesic distances between the landmark points XL and the points in memory X M are compared with the ground-truth, and the discrepancy is shown by the solid line in Figure 3.13. As more points are encountered, the error decreases, indicating that vertex contraction indeed improves the geodesic distance estimate. There is, 118 0,16 1 r r r r 1* ———-—- With contraction 0.14_ , . - - - Without contraction 012* 0.1 rms 0.08 0.06 ~ 0.04 _ 0.0 I L L L l l l I 12000 2000 3000 4000 5000 6000 7000 8000 9000 10000 number of points Figure 3.13: Utility of vertex contraction. Solid line: the root-mean-square error (when compared with the ground truth) of the geodesic distance estimate for points currently held in memory when vertex contraction is used. Dash-dot line: the cor— responding root—mean—square error when the new points are stored in the memory instead of being contracted. however, a lower limit (around 0.03) on the achievable accuracy, because of the finite size of samples retained in the memory. When additional points are kept in the memory instead of being contracted, the improvement of geodesic distance estimate is significantly slower (the dash-dot line in Figure 3.13). We can see that vertex contraction indeed improves the geodesic distance estimate, partly because it spreads the data points more evenly, and partly because more points are included in the neighborhood effectively. 3.3.4 Incorporating Variance By Incremental Learning One interesting use of incremental learning is to incorporate invariance by “hallucinat— ,. I . ing” training data. Given a training sample yi, additional training data yzll), yzf?) 119 can be created by applying different invariance transformations on yi. The amount of training data can be unbounded, because the number of possible invariance transfor- mations is infinite. This unboundedness calls for an incremental algorithm, which can accumulate the effect of the data generated. This idea has been exploited in [235] for improving the accuracy in digit classification. Given a digit image, simple distortions like translation, rotation, and skewing are applied to create additional training data for improving the invariance property of a neural network. We tested a similar idea using the proposed incremental ISOMAP. The training data were generated by first randomly selecting an image from 500 digit “2” images in the MNIST training set. The image was then rotated randomly by 6 degree, where 6 was uniformly distributed in [—30, 30]. The image was used as the input for the incremental landmark ISOMAP with 40 landmarks and a memory size of 10000, with vertex contraction enabled. The training was stopped when 60000 training images were generated. We wanted to investigate how well the rotation angle is recovered by the nonlinear mapping. This was done by using an independent set of digit “2” images from the MNIST testing set, which was of size 1032. For each image y“), it was rotated by 15 different angles: 30j/7 for j = —7, . . . ,7. The mappings of these 15 images, 291,. .. ,xgi), were found using the out—of—sample extension of ISOMAP. If ISOMAP can discover the rotation angle, there should exist a linear projection direction h such that thEi) % c,- +l for all z' and l, where Ci is a constant specific to 5,0). This is equivalent to hT (if) — 5(8)) 3 r, (3.8) 320 ---SOMAP '~ISOMAPH , ISOMAPHI 8 o .>\ —~«PCA '\ ...... \ .0 PCA" ° ' °'\.\ V'VPCAHI N (I) on v VYY O N 8 O E —————————————————————————————— B 0 square root 01 the sum 01 residue square N o O -A a: O ' 2 l I 1 l i J # a) O Y 4 ... M (a) A ‘1». CD ‘0 8 5 6 No. of features Figure 3.14: Sum of residue square for 1032 images in 15 rotation angles. The larger the residue, the worse the representation. “PCA” and “ISOMAP” correspond to the nonlinear mapping obtained by PCA and ISOMAP when 10000 generated images are used for training, respectively. “ISOMAP II”/ “PCA II” and “ISOMAP III”/ “PCA III” correspond to the result when the learning stops after 20000 and 50000 images are generated, respectively. which is an over-determined linear system. The goodness of the mapping it“) in terms of how well the rotation angle is recovered can thus be quantified by the residue of the above equation. For comparison, a similar procedure was applied for PCA using the first 10000 generated images. Figure 3.14 shows the result. We can see that the residue for ISOMAP is smaller than PCA, indicating that ISOMAP recovers the rotation angle better. The residue is even smaller when additional images are generated to improve the mapping. 3.4 Discussion We have presented algorithms to incrementally update the co—ordinates produced by ISOMAP. Our approach can be extended to other manifold learning algorithms; for example, creating an incremental version of Laplacian eigenmap requires the update 121 of the neighborhood graph and the leading eigenvectors of a matrix (graph Laplacian) derived from the neighborhood graph. The convergence of geodesic distance is guaranteed since the geodesic distances are maintained exactly. Subspace iteration used in co-ordinate update is provably convergent if a sufficient number of iterations is used, assuming all eigenvalues are simple, which is generally the case. The fact that we only run subspace iteration once can be interpreted as trading off guaranteed convergence with empirical efficiency. Since the change in target inner product matrix is often small, the eigenvector im- provement due to subspace iterations with different number of points is aggregated, leading to the low approximation error as shown in Figures 3.4 and 3.11. W'hile running the proposed incremental ISOMAP is much faster than running. the batch version repeatedly, it is more efficient to run the batch version once using all the data points if only the final solution is desired (compare Tables 3.1 and 3.2, as well as Tables 3.3 and 3.4). It is because maintaining intermediate geodesic distances and co—ordinates accurately requires extra computation. The incremental algorithm can be made faster if the geodesic distances are updated upon seeing p subsequent points, p > 1. We first embed yn+1, . . . , yn+p independently by the method in section 3.2.1.3. The geodesic distances among the existing points are not updated, and the same set of x, is used to find xn+1, . . . ,Xn+p- After that, all the geodesic distances are updated, followed by the update of x1, . . . ,xn+p by subspace iteration. This strategy makes the incremental algorithm almost p—times faster, because the time to embed the new points is very small (see the time for “computing xn+1” in Tables 3.1 and 3.3). On the other hand, the quality of the embedding will deteriorate because the 122 embedding of the existing points cannot benefit from the new points. This strategy is particularly attractive with large 71., because the effect of yn+1, . . . ,yn+p on yn+p+1 is small. Also, for a fixed amount of memory, the solution obtained by the incremental version can be superior to that of the batch version. This is because the incremental version can perform vertex contraction, thereby obtaining a better geodesic distance estimate. The incremental version can be easily adopted to an unbounded data stream when training data are generated by applying invariance transformation, too. 3.4.1 Variants of the Main Algorithms Our incremental algorithm can be modified to cope with variable neighborhood def- inition, if the user is willing to do some tedious book-keeping. We can, for example, use c-neighborhood with the value of 6 re-adjusted whenever, say, 200 data points have arrived. This can be easily achieved by first calculating the edges that need to be deleted or added because of the new neighborhood definition. The algorithms in sections 3.2.1 and 3.2.2 are then used to update the geodesic distances. The embedded co—ordinates can then be updated accordingly. The supervised ISOMAP algorithm in [276], which utilizes a criterion similar to the Fisher discriminant for embedding, can also be converted to become incremental. The only change is that the subspace iteration method for solving a generalized eigenvalue problem is used instead. The proposed incremental ISOMAP can be easily converted to incremental conformal ISOMAP [55]. In conformal ISOMAP, the edge weight wij is 123 Aij / \/ 111 (2)1110), where 111(2) denotes the distance of y,- from its k nearest neighbors. The computation of the shortest path distances and eigen-decomposition remains the same. To convert this to its incremental counterpart, we need to maintain the sum of the weights of the kt nearest neighbors of different vertices. The change in the edge weights due to the insertion and deletion of edges as a new point comes can be easily tracked. The target inner product matrix is updated, and subspace iteration can be used to update the embedding. 3.4.2 Comparison With Out-of-sample Extension One problem closely related to incremental nonlinear mapping is the “out—of-sample extension” [17]: given the embedding x1, . . . ,xn for a “training set” yl, . . . , yn, what is the embedding result (xn+1) for a “testing” point yn+1? This is effectively the problem considered in section 3.2.1.3. In incremental learning, however, we go beyond obtaining xn+1z the co—ordinate estimates x1, . . . ,xn of the existing points are also improved by yn+1. In the case of incremental ISOMAP, this amounts to updating the geodesic distances and then applying subspace iteration. The out-of-sample extension is faster because it skips the improvement step. How- ever, it is less accurate, and cannot provide intermediate embedding with good quality as points are accumulated. Incremental ISOMAP, on the other hand, utilizes the new samples to continuously improve the co—ordinate estimates. Out—of—sample extension may be more appealing when a large number of samples have been accumulated and the geodesic distances and x1, . . . ,xn are reasonably accurate. Even in this case, 124 though, the strategy of updating x1, . . . ,xn after p new points (with p > 1) have been embedded works equally as well. The updating of geodesic distances and co— ordinates occurs infrequently in this case, and its amortized computational cost is very low. Incremental ISOMAP is also preferable to out-of-sample extension when there is a drifting of data characteristics. In out—of—sample extension, the n points collected are assumed to be representative of all future data points that are likely to be ob- served. There is no way to capture the change of data characteristics. In incremental ISOMAP, however, we can easily maintain an embedding using a window of the points recently encountered. Changes in data characteristics are captured as the geodesic distances and co—ordinate estimates are updated. Vertex contraction should be turned off if incremental ISOMAP is run in this mode, to ensure that the effect of old data points is erased. 3.4.3 Implementation Details The subspace iteration in section 3.2.1.4 requires that the eigenvalues corresponding to the leading eigenvectors have the largest absolute values. This can be violated if the target inner product matrix has a large negative eigenvalue. To tackle this, we shift the spectrum and find the eigenvectors of (B + 01) instead of B. Subspace iteration on (B + 01) can proceed in almost the same manner, because (B + aI)v = B + av. ‘While a large value of a guarantees that all shifted eigenvalues are positive, this has the adverse effect of reducing the rate of convergence of the eigenvectors, 125 because the shift reduces the ratio between adjacent eigenvalues. We empirically set a = max(—0.7Amin(B) — 0-3’\d-th(B).~.0)- where Amin(B) and Ad,th(B) denote the smallest (most negative) and the d—th largest eigenvalues, respectively. The later is being maintained by the incremental algorithm, while the former can be found by, say, residual norm bounds or Gerschgoren disk bounds. In practice, Amin(B) is found at the initialization stage. This estimate is updated only when a large number of data points have been accumulated. During the incremental learning, the neighborhood graph may be temporarily dis- connected. A simple solution is to embed only the largest graph component. The excluded vertices are added back for embedding again when they become reconnected as additional data points are encountered. Alternatively, an edge can be added be- tween the two nearest vertices to connect the two disconnected components in the neighborhood graph. 3.5 Summary Nonlinear dimensionality reduction is an important problem with applications in pat- tern recognition, computer vision, and machine learning. We have developed an algo- rithm for the incremental nonlinear mapping problem by modifying the well-known ISOMAP algorithm. The core idea. is to efficiently update the geodesic distances (a graph theoretic problem) and re-estimate the eigenvectors (a numerical analysis problem), using the previous computation results. Our experiments on synthetic data as well as real world images validate that the proposed method is almost as accurate 126 as running the batch version, while saving significant computation time. Our algo- rithm can also be easily adopted to other manifold learning methods to produce their incremental versions. 127 Chapter 4 Simultaneous Feature Selection and Clustering Hundreds of clustering algorithms have been proposed in the literature for Clustering in different applications. In this chapter, we examine a different aspect of clustering that is often neglected: the issue of feature selection. Our focus will be on partitional clustering by a mixture of Gaussians, though the method presented here can be easily generalized to other types of mixtures. We are interested in mixture-based clustering because its statistical nature gives us a solid foundation for analyzing its behavior. Also, it leads to good results in many cases. \Ne propose the concept of feature saliency and introduce an expectation-maximization (EM) algorithm to estimate it, in the context of mixture-based clustering. We adopt the minimum message length (MML) model selection criterion, so the saliency of irrelevant features is driven towards zero, which corresponds to performing feature selection. The MML criterion and the EM algorithm are then extended to simultaneously estimate the feature saliencies and the 128 number of clusters. The remainder of this chapter is organized as follows. We discuss the challenge of feature selection in unsupervised domain in Section 4.1. In Section 4.2, we review previous attempts to solve the feature selection problem in unsupervised learning. The details of our approach are presented in Section 4.3. Experimental results are reported in Section 4.4, followed by comments on the proposed algorithm in Section 4.5. Finally, we conclude in Section 4.6. 4.1 Clustering and Feature Selection Clustering, similar to supervised classification and regression, can be benefited by using a good subset of the available features. One simple example illustrating the corrupting influence of irrelevant features can be seen in Figure 4.1, where the irrel- evant feature makes it hard for the algorithm in [81] to discover the two underlying clusters. Feature selection has been widely studied in the context of supervised learn- ing (see [101, 26, 122, 151, 153] and references therein, and also section 1.2.3.1), where the ultimate goal is to select features that can achieve the highest accuracy on unseen data. Feature selection has received comparatively very little attention in unsuper- vised learning or clustering. One important reason is that it is not at all clear how to assess the relevance of a subset of features without resorting to class labels. The problem is made even more challenging when the number of clusters is unknown, since the optimal number of clusters and the optimal feature subset are inter-related, as illustrated in Figure 4.2 (taken from [69]). Note that methods based on variance 129 Figure 4.1: An irrelevant feature ($2) makes it difficult for the Gaussian mixture learning algorithm in [81] to recover the two underlying clusters. Gaussian mixture fitting finds seven clusters when both the features are used, but identifies only two clusters when the feature an is used. The curves along the horizontal and vertical axes of the figure indicate the marginal distribution of 1:1 and mg, respectively. (such as principal components analysis) need not select good features for clustering, as features with large variance can be independent of the intrinsic grouping of the data (see Figure 4.3). Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. Most feature selection algorithms (such as [36, 151, 209]) involve a combinatorial search through the space of all feature subsets. Usually, heuristic (non- exhaustive) methods have to be adopted, because the size of this space is exponential in the number of features. In this case, one generally loses any guarantee of optimality of the selected feature subset. We propose a solution to the feature selection problem in unsupervised learning by casting it as an estimation problem, thus avoiding any combinatorial search. Instead 130 Figure 4.2: The number of clusters is inter-related with the feature subset used. The optimal feature subsets for identifying 3, 2, and 1 clusters in this data set are {5131, $2}, {2:1}, and {1:2}, respectively. On the other hand, the optimal number of clusters for feature subsets {£121,132}, {11}, and {2:2} are also 3, 2, and 1, respectively. of selecting a subset of features, we estimate a set of real—valued (actually in [0, 1]) quantities (one for each feature), which we call the feature saliencies. This estimation is carried out by an EM algorithm derived for the task. Since we are in the presence of a model-selection-type problem, it is necessary to avoid the situation where all the features are completely salient. This is achieved by adopting a minimum message length (MML, [264, 265]) penalty, as was done in [81] to select the number of clusters. The MML criterion encourages the saliencies of the irrelevant features to go to zero, allowing us to prune the feature set. Finally, we integrate the process of feature saliency estimation into the mixture fitting algorithm proposed in [81], thus obtaining a method that is able to simultaneously perform feature selection and determine the number of clusters. This chapter is based on our journal publication in [163]. 131 Figure 4.3: Deficiency of variance-based method for feature selection. Feature 171, although it explains more data variance than feature 2:2, is spurious for the identifi- cation of the two clusters in this data set. , 4.2 Related Work Most of the literature on feature selection pertains to supervised learning (see Sec- tion 1.2.3.1). Comparatively, not much work has been done for feature selection in unsupervised learning. Of course, any method conceived for supervised learning that does not use the class labels could be used for unsupervised learning; this is the case for methods that measure feature similarity to detect redundant features, using, e. g., mutual information [221] or a maximum information compression index [188]. In [70, 71], the normalized log-likelihood and cluster separability are used to evaluate the quality of clusters obtained with different feature subsets. Different feature subsets and different numbers of clusters, for multinomial model-based clustering, are evalu— ated using marginal likelihood and cross-validated likelihood in [254]. The algorithm described in [218] uses a LASSO-based idea to select the appropriate features. In [51], the clustering tendency of each feature is assessed by an entropy index. A genetic 132 algorithm is used in [146] for feature selection in k—means clustering. In [246], feature selection for symbolic data is addressed by assuming that irrelevant features are un- correlated with the relevant features. Reference [60] describes the notion of “category utility” for feature selection in a conceptual clustering task. The CLIQUE algorithm [2] is popular in the data mining community, and it finds hyper-rectangular shaped clusters using a subset of attributes for a large database. The wrapper approach can also be adopted to select features for clustering; this has been explored in our earlier work [82, 165]. All the methods referred to above perform “hard” feature selection (a feature is either selected or not). There are also algorithms that assign weights to different features to indicate their significance. In [190], weights are assigned to different groups of features for k-means clustering based on a score related to the Fisher discriminant. Feature weighting for k-means clustering is also considered in [187], but the goal there is to find the best description of the clusters, after they are identified. The method described in [204] can be classified as learning feature weights for conditional Gaussian networks. An EM algorithm based on Bayesian shrinking is proposed in [100] for unsupervised learning. 4.3 EM Algorithm for Feature Saliency In this section, we propose an EM algorithm for performing mixture-based (or model- based) clustering with feature selection. In mixture-based clustering, each data point is modelled as having been generated by one of a set of probabilistic models [125, 183]. 133 Clustering is then done by learning the 1i)arameters of these models and the associated probabilities. Each pattern is assigned to the mixture component that most likely generated it. Although the derivations below refer to Gaussian mixtures, they can be generalized to other types of mixtures. 4.3.1 Mixture Densities A finite mixture density with 1: components is defined by k My} = 20]!)(y191), (4-1) j=1 where V409 2 0; 2 j 03- = 1; each GJ- is the set of parameters of the j-th com- ponent (all components are assumed to have the same form, e.g., Gaussian); and 0 E {91, ...,Ok,al. ...,ak} will denote the full parameter set. The goal of mixture estimation is to infer 0 from a set of n data points 32 = {y1, ...,yn}, assumed to be samples of a distribution with density given by (4.1). Each y,- is a d—dimensional feature vector [y,-1,y,-d]T. In the sequel, we will use the indices 2', j and l to run through data points (1 to n), mixture components (1 to k), and features (1 to (1), respectively. As is well known, neither the maximum likelihood (ML) estimate, BML = arg mgx {log p(y]9)} , 134 nor the maximum a posteriori (MAP) estimate (given some prior p(6)) 6M“) = arg mgx {log [JO/[9) + log p(0)} , can be found analytically. The usual choice is the EM algorithm, which finds local maxima of these criteria [183]. This algorithm is based on a set Z = {z1, ..., zn} of n missing (latent) labels, where z, = [3,71, Zikla with 25,]- : 1 and zip = 0, for p 75 j, meaning that y, is a sample of p(-]0j). For brevity of notation, sometimes we write z,- = j for such 2,. The complete data log-likelihood, i.e., the log-likelihood if Z were observed, is n k 102.1101, ZIB) = Z Z Zij log [a'jpfinBflI - (42) i=1j=1 The EM algorithm produces a sequence of estimates {6(t), t = 0, 1, 2, ...} using two alternating steps: 0 E-step: Compute W = E [Z [31, 605)], the expected value of the missing data given the current parameter estimate, and plug it into log p(y, Z [0), yielding the so—called Q-function Q(0, 6(0) 2 log p (y, W] 0). Since the elements of Z are binary, we have 6,0) pone-(w) k X at) mime» j=1 ww- E E [z.,-j|y,6(t)] 2 Fr [Zij = 1|y,,6(t)] = (4.3) Notice that 03- is the a priori probability that 25,]- : 1 (i.e., that. y,- belongs to cluster j), while wij is the corresponding a posteriori probability, after observing yi. 135 e M-step: Update the parameter estimates, A 0(t + 1) = arg maax {Q(0,6(t)) + logp(0)}, in the case of MAP estimation, or without log p(0) in the ML case. 4.3.2 Feature Saliency In this section we define the concept of feature saliency and derive an EM algorithm to estimate its value. We assume that the features are conditionally independent given the (hidden) component label, that is, d k k P(Y|9) = 2051304939 = 2% HPUJZIOjl): (4-4) 1:1 1:1 (=1 where p(-|6fl) is the pdf of the l-th feature in the j-th component. This assumption enables us to utilize the power of the EM algorithm. In the particular case of Gaussian mixtures, the conditional independence assumption is equivalent to adopting diagonal covariance matrices, which is a common choice for high—dimensional data, such as in naive Bayes classifiers and latent class models, as well as in the emission densities of continuous hidden Markov models. Among different definitions of feature irrelevancy (proposed for supervised learn- ing), we adopt the one suggested in [210, 254], which is suitable for unsupervised learning: the l-th feature is irrelevant if its distribution is independent of the class labels, i.e., if it follows a common density, denoted by q(yl|/\l). Let (I) = ($1, (15d) 136 6 “@909 @619” (a) ¢1=1..,¢)2=1¢>3=0¢4=1 (b)¢1=0.¢2=1.¢>3=1.¢4=0 Figure 4.4: An example graphical model for the probability model in Equation (4.5) for the case of four features ((1 = 4) with different indicator variables. 051 = 1 corre- sponds to the existence of an are from 2 to y], and a] = 0 corresponds to its absence. be a set of d binary parameters, such that a, = 1 if feature I is relevant and <25] 2 0, otherwise. The mixture density in (4.4) can then be re—written as lab/I‘D, {aj}1{0jl}1{’\l}:aj Hlpl yzl6 z)“’l(1(?/rl/\z l1 “l (4.5) 1:111: A related model for feature selection in supervised learning has been considered in [197, 210]. Intuitively, determines which edges exist between the hidden label z and the individual features y] in the graphical model illustrated in Figure 4.4, for the case d = 4. Our notion of feature saliency is summarized in the following steps: (i) we treat the (W’s as missing variables; (ii) we define the feature saliency as p] = p(gbl = 1), the probability that the l-th feature is relevant. This definition makes sense, as it is difficult to know for sure that a certain feature is irrelevant in unsupervised learning. The resulting model (likelihood function) is written as 20619) = M» d aj Hl( pip (Ll/1193'!) +)(1-pz N(yzl/Vl) (4-6) H .._: .7 Where 0 = {{0}}, {OJ-l}, {Al}, {[21]} is the set of all the parameters of the model. 137 Equation (4.6) can be derived as follows. We treat ,0] = p(qbl = 1) as a set of parameters to be estimated (the feature saliencies). we assume the 951 s are mutually independent and also independent of the hidden component label 2 for any pattern y. Thus, p(y,‘1>) =1v(y|<1>)11(<1> d _ a _ = (:01 151(1) (.9111! j11))“I(((1(y1|/\1))1 “7) HpilU —P1)1 “’ (47) j: 1 1:1 , d = 2013' H( 1011? 11.1193 “1((1— Pz)(1(y1|)\1))1—¢’- j=1 1:1 The marginal density for y is d d = Zp(y,‘1>)== 2% 21101119 (9119]' “(((1 - 1)1l(1(.7le/\1))1_“bl 3'21 <1) 1:1 k d 1 = :09- H Zf p110 (1/1lj‘b’((1— pz)q(yzl/\1))1—“l (4.8) jzl 121(1)]:0 k d = 2013' I10? (9116511111 + q(yzl/\1)(1- pd), j=1 1:1 Whlch 18 just Equation (4.6). Another way to see how Equation (4 6) IS obtained 18 to . ,. . . . . 1— notlce that the conditional density of y] given 2 = j and (251, [p(ylldfl)]¢l[q(yll)q)] “51, Can be written as ¢1p(y1|6fl) + (1 — (b])q(g1|)\1), because at is binary. Taking the expectation with respect to oil and 2 leads to Equation (4 6) The form of q( | ) reflects our prior knowledge about the distribution of the non- Sallent features. In principle it can be any 1-D distribution (e.g., a Gauss1an, a Student t or even a mixture). We shall limit q(..|) to be a Gauss1an, since this leads 138 Figure 4.5: An example graphical model showing the mixture density in Equa- tion (4.6). The variables 2, ¢I,¢2,¢3,¢4 are “hidden” and only y1,y2,y3,y4 are observed. to reasonable results in practice. Equation (4.6) has a generative interpretation. As in a standard finite mixture, we first select the component label j by sampling from a multinomial distribution with parameters (01, . . . ,ak). Then, for each feature 1 = 1, ..., d, we flip a biased coin whose probability of getting a head is pl; if we get a head, we use the mixture component p(ldfl) to generate the l—th feature; otherwise, the common component q(.[)\l) is used. A graphical model representation of Equation (4.6) is shown in Figure 4.5 for the case 61:4. 4.3.2.1 EM Algorithm 8)" treating Z (the hidden class labels) and (I) (the feature indicators) as hidden Variables, one can derive an EM algorithm for parameter estimation. The complete- data log—likelihood for the model in Equation (4.6) is d 1301,21 = 13(1)) = “j H(prp(yi1|911))“1((1 — 191)(1(y11|/\1))1_“l- (4-9) 1:1 139 Define the following quantities: 1011: 11(31 = jb’i): '11111=1)(31=J}¢1=1ly1’)1 11111: 1431:1190: Olyz‘) They are calculated using the current. parameter estimate 0””. Note that (11,-)! + 1),-fl) = wij and 211:1Z§:1u’ij = n. The expected complete data log-likelihood based on Onow is Egnow[l0g p(y, Z, (1’)] = Z p(21 = i.<1>ly1-)(10saj + Z(¢1(10sp(y11|9}1)+10sp1) 1‘,j.109401+ Z: 2: 19(32‘ =J3¢1IY1)(951(10gI)(3/1119j1)+10sp1) + (1 - «51)(10sq(y11|11)+10s(1- p1))) = Z]: wij) 108 “j + Z Z Um 10gP(3/11|9j 1) + Z Z L'1j110g4(y11|/\1) j 2' 1 i,j 1 j,l part 1 part 2 part 3 + 2(108p1 Z 11131 + 108(1 — P1) 2:11:31)- 1 is]. 2.1.7 paft 4 (4.10) The four parts in the. equation above can be maximized separately. Recall that the dellsities p(.) and q(.) are univariate Gaussian and are characterized by their means and variances. As a result, maximizing the expected complete data log-likelihood 140 leads to the M-step in Equations (4.18)—(4.23). For the E—step, observe that 11(951 = 1syzflzz' =1) 1)(y1|zz' =1) * {1111(yzl9jz) H11¢1(pyp(yy|0]- 1!) + (1 - py)q(yyW)) _ “(I(pIIPWz/lej 1') + (1 ‘ P1/l9(y1'l)‘1’)) p1P(y1|9j1) __ M p11)(.I/1|9j1)+ (1 — 101)¢I(;U1|)\1) T 6131' 11(01=1|31=J¥yz')= Therefore, equation (4.16) follows because “11,1 = PM = 1|31= j,yz‘)P(zz‘ =j|yi1 = —'—‘wzj- (“D So, the EM algorithm is o E—step: Compute the following quantities: 0131: P(¢1=1,y11131=j)= P1P('?Jz'1|9j1) (4-12) 51111: M01: 0,1/11lzz' = 1') = (1 - p1)(1(y11|A1) (4-13) Cm = P(!/11|21 = j) = 0131+ bijl (4-14) ' GUI—[10.1 “’ij = 11(32' = lez') = 3 __ I], (415) 2:311] H1 Cm . 02']! f Uijl = PM = 1,21 = lez') = 3710271 (4-16) “11 'Uz'jl = PM = 0» 31 = 31311) = '10sz - “111 (417) o M-step: Re-estimate the parameters according to following expressions: 141 é _ :1 "wij _ :1 'wz‘j a] — —— 1-- — ————f 2,] '11,] 71 Mean 1n 6]) = w 21' ”111 A u. z: —— Meflfl- 2 Var in 611: :2 131 (.111 ( Jl)) (4.20) Ziuzjz 21(Zjvz'j1) 3111 (4.18) (4.19) 114m A) = (4.21) Zij U131 A ,- -v-- 1- — Mem A 2 Var in /\1= 21(2) 1311011 ( 1)) (4.22) 213‘ Um . .u- . . .u. . 51 Z"? ”I _ 213—.131 (423) _ Em 'uz'j1+21,j vz'jz — ’n In these equations, the variable 11,-]- , measures how important the i-th pattern is to the j—th component, when the l-th feature is used. It is thus natural that the estimates of the mean and the variance in Ofl are weighted sums with weight “13'!- A similar relationship exists between 23"”:ij and A). The term Zij u,“ can be interpreted as how likely it is that ab, equals one, explaining why the estimate of pl is proportional to Zij “ij 1- 4.3.3 Model Selection Standard EM for mixtures exhibits some weaknesses, which also affect the EM algo- rithm introduced above: it requires knowledge of k: (the number of mixture compo— nents), and a good initialization is essential for reaching a good local optimum (not to mention the global optimum). To overcome these difficulties, we adopt the approach in [81], which is based on the minimum message length (l\»’Il\v‘IL) criterion [265, 264]. The MML criterion for our model consists of minimizing, with respect to 0, the 142 following cost function (after discarding the order one term) (1 d k 117+ d 'r s — long/lO) + 2 log n + 2 [E E 1log(nozj)0)) + é IE log(n(1— pl)), (4.24) where 7' and 3 are the number of parameters in 6]) and A), respectively. If p(.l.) and q(..|) are univariate Gaussians (arbitrary mean and variance), r = s = 2. Equation (4.24) is derived by considering the minimum message length (MML) criterion (see [81] for details and references) ‘ ~ . l c, E 0i 9 = a1gm91n{—logp(0) — 10011046) + 2 100 |I(0)| + 2(1—1-10012)}, (4.25) where 0 is the set of parameter of the model, c is the dimension of 0, 1(0) 2 —E[Dg log p(yl6)] is the (expected) Fisher information matrix (the negative expected value of the Hessian of the log-likelihood), and II(9)| is the determinant of I(O). The information matrix for the model (4.6) is very difficult to obtain analytically. There- fore, as in [81], we approximate it by the information matrix of the complete data log-likelihood, 16(0). By differentiating the logarithm of equation (4.9), we can show that 1 1 —— ,—-—--,a 1011911),---,a'1919 , 91(1—111) Pd(1—Pd) 1 ( d ( 1d) goo. 14(0) = block-diag[M, a'2101I(921)a - - - 10'2pc11(9211)» - - - 101.0110921), - - . ,akdeWm), (4'26) Manuel), . . . , (1440101)], where M is the information matrix of the multinominal distribution with parameters 143 (a1, . . . , (1k). The size of 1(0) is (k + (1+ kdr + (15), where r and s are the number of parameters in 6]) and A), respectively. Note that (p[(1 — pl))_1 is the Fisher information of a Bernoulli distribution with parameter p1. Thus we can write (1 k (1 1041140)! = 1.1g1({a,))+ 210ng,) + T 2: Dog (am) 1:1 j:11 .__1 k d d d (4.27) +ZZlogI(6jl)+s:10g(1—pl) +Zlogl()\l) j=11=1 1:1 [:1 For the prior densities of the parameters, we assume that different groups of parame- ters are independent. Specifically, {(1')}, pl (for different values of l), 63-, (for different values of j and l) and A) (for different values Of 1) are independent. Furthermore, since we have no knowledge about. the parameters, we adopt non-informative Jeffrey’s pri- ors (see [81] for details and references), which are proportional to the square root of the determinant of the corresponding information matrices. When we substitute p(0) and [I(0)| into equation (4.25), and drop the order-one term, we obtain our final criterion, which is equation (4.24). From a parameter estimation viewpoint, Equation (4.24) is equivalent to a massi- mum a posterior? (MAP) estimate, k. d d . d "k 0 = arg mgx{logp(y|9) — :2— 2 log aJ- — g E log(1 — pl) — 1—2— 2 log pl}, (4.28) 144 with the following (Dirichlet-type, but improper) priors 011 the o-j’s and pl’s: 44-.., 4. 4 11 71/2, —k2(1 _ P(p11-~1p(1)0( Hp r/ ()1) 3/2- Since these priors are conjugate with respect to the complete data likelihood, the EM algorithm undergoes a. minor modification: the M—step Equations (4.18) and (4.23) are replaced by , l ’5 _ max(zi 'wij — :29, O) (4 29) “3‘2 (2' ---"’ o1 ‘ J max , 111,] —2—, max<2.-,,u.,1— 4 0) rnax(Z,-J um - 525, 0) + max(Z,-J vifl —— 3 0) 17) = (4.30) 111 addition to the log—likelihood, the other terms in Equation (4.24) have simple ii'iterpretations. The term—2— k+d log n is a standard MDL type [215] parameter code- length corresponding to k 03' values and d p) values. For the l—th feature in the j-th component, the “effective” number of data points for estimating (9]) is najpl. Since there are 7‘ parameters in each 6]), the corresponding code-length is glog(nplaj). Similarly, for the l—th feature in the common component, the number of effective data points for estimation is 71(1 — pl). Thus, there is a term 3 log(n(1 — p))) in (4.24) for each feature. One key property of Equations (4.29) and (4.30) is their pruning behavior, forcing some of the aj to go to zero and some of the pl to go to zero or one. This pruning 145 behavior also has the indirect benefit of protecting us from almost singular covariance matrix in a mixture component: the weight of such a component is usually very small, and the component is likely to be pruned in the next few iterations. Concerns that the message length in (4.24) may become invalid at these boundary values can be circumvented by the arguments in [81]: when p) goes to zero, the l-th feature is no longer salient and p) and 611, . . . ,0)“ are removed; when Pl goes to 1, A) and pl are dropped. Finally, since the model selection algorithm determines the number of components, it can be initialized with a. large value of k, thus alleviating the need for a good initialization, as shown in [81]. Because of this, a component-wise version of EM can be adopted [37, 81]. The algorithm is summarized in Algorithm 4.1. Input: Training data y = {y1, . . . ,yn}, minimum number of components kmin Output: Number of components 1:, mixture parameters {OJ-l}, {aj}, parameters of common distribution {Al} and feature saliencies {pl} {Initialization} Set the parameters of a large number of mixture components randomly Set the common distribution to cover all data Set the feature saliency of all features to 0.5 {Initialization ends; main loop begins} while k > kmin do while not reach local minimum do Perform E—step according to Equations (4.12) to (4.17) Perform M-step according to Equations (4.19) to (4.22), (4.29) and (4.30) If aj becomes zero, the j-th component is pruned. If pl becomes 1, q(yllAl) is pruned. If pl becomes 0, p(yll6fl) are pruned for all j end while Record the current model parameters and its message length Remove the component with the smallest weight end while Return the model parameters that yield the smallest message length Algorithm 4.1: The unsupervised feature saliency algorithm. 146 4.3.4 Post-processing of Feature Saliency The feature saliencies generated by Algorithm 4.1 attempt to find the best way to model the data, using different component densities. Alternatively, we can consider feature saliencies that best discriminate between different components. This can be more appropriate if the ultimate goal is to discover well-separated clusters. If the components are well-separated, each pattern is likely to be generated by one com- ponent only. Therefore, one quantitative measure of the separability of the clusters is J = 2102;144- =41”). (4.31) where t,- = argmaxj p(zi = jlyi). Intuitively, J is the sum of the logarithms of the posterior probabilities of the data, assuming that each data point was indeed gener- ated by the component with maximum posterior probability (an implicit assumption in mixture-based clustering). .1 can then be maximized by varying ,0) while keeping the other parameters fixed. Unlike the MML criterion, .1 cannot be optimized by an EM algorithm. However, by defining 10(1/11I9jz) - Q(y1|/\1) h'l' = , I J 1111411111941) + (1 - p1)(1(’y11|/\1) k 911: Zu'ijhilj: 1:1 147 we can show that 8 8—_ 108101) = (1113' — 9111 Pt (92 n k —— loam): = 2(9-192 - 2211- -h.-) -h- ) for l 95 m 4 r, o ._j 1 1m 2] 2 J 2771] 7 a 0p10pm 1:1 3.21 82 n 2 2 $108211)” : 2(91-1 — hi U)' 1 1:1 The gradient and Hessian of J can then be calculated accordingly, if we ignore the dependence of t,- on ,0). We can then use any constrained non—linear optimization software to find the optimal values of p, in [0,1]. We have used the MATLAB opti- mization toolbox in our experiments. After ol:)tai1'1ing the set of optimized ,0), we fix them and estimate the remaining parameters using the EM algorithm. 4.4 Experimental Results 4.4.1 Synthetic Data The first synthetic data set consisted of 800 data points from a mixture of four equiprobable Gaussians N(m,,I),i = {1,2,3,4}, where m1 = [g], m2 = [$11], m3 = [2], m4 = [7],] (Figure 4.6(a)). Eight “noisy” features (sampled from a .__A N(O, 1) density) were then appended to this data, yielding a set of 800 10—dimensional patterns. We ran the proposed algorithm 10 times, [each initialized with k = 30; the common component was initialized to cover the entire set of data, and the feature saliency values were initialized at 0.5. A local minimum was reached if the change in 148 description length between two iterations was less than 10‘7. A typical run of the al- gorithm is shown in Figure 4.6. In all the ten random runs with this mixture, the four components were always correctly identified. The saliencies of all the ten features, together with their standard deviations (error bars), are shown in Figure 4.7(a). We can conclude that, in this case, the algorithm successfully locates the true clusters and correctly assigns the feature saliencies. In the second experiment, we considered the Tfunk data [122, 252], consisting of two 20—dimensional Gaussians N(ml, I) and N(mg, I), where m1 = (1, 1%, . . . , Viz—6), m2 2 —m1. Data were obtained by sampling 5000 points from each of these two Gaussians. Note that the features are arranged in descending order of relevance. As above, the stopping threshold was set. to 10’”7 and the initial value of It was set to 30. In all the 10 runs performed, the two components were always detected. The feature saliencies are shown in Figure 47(1)). The lower the rank number, the more important was the feature. We can see the general trend that as the feature number increases, the saliency decreases, in accordance with the true characteristics of the data. 4.4.2 Real data VVe tested our algorithm on several data sets with different characteristics (Table 4.1). The wine recognition data set (wine) contains results of chemical analysis of wines grown in different cultivars. The goal is to predict the type of a wine based on its Chemical composition; it has 178 data points, 13 features, and 3 classes. The VVis- 149 Iteration 1 K=30 14 12 ,5“ ' , - . : . .1 _ 1o ‘ "fr 6%? '., '. ._ "‘a - .2 A N 8 fr: 321‘. .1, 1. E e . z; a s __.~. ~ ., . _,. . . :5". 1:5", ' r; .g'” E 2 :13” _ . , . o .. 5 o 5 1o —" o 5 10 Feature 1 Feature1 (0.50) (a) The data set (b) Initialization Iteration 35 K=13 Iteration 40 K=10 14 14 12 12 310 A10 33 8 § 8 9' K N e 6 g 6 e. 4 4 4 1‘3 11 2 2 0 0 5 o 5 1o —" 0 5 Feature1 (0.91) Feature1 (0.91) (c) A snapshot ((1) p2 is “pruned” to 1 Iteration 99 K=5 Iteration 182 K=4 14 14 12 12 -10 A10 5 5 § 8 N N 9 6 9 6 2 2 8 4 g 4 LL LL 2 2 0 -— 0 3 o 5 fl 0 5 10 Feature 1 (0.95) Feature1 (1.00) (e) A local minimum (11:5) (f) The best local minimum Figure 4.6: An example execution of the proposed algorithm. The solid ellipses represent the Gaussian mixture components; the dotted ellipse represents the common Sensity. The number in parenthesis along the axis label is the feature saliency; when ft 1' eaches 1, the common component is no longer applicable to that feature. Thus, 1n (d), the common component degenerates to a line; when the feature saliency for ea‘ture 1 also becomes 1, as in Figure 4.6(f), the common density degenerates to a point at (0,0). 150 1:: — 1 1 - - + 0.9’I 1 0.81 >. 0.8» 1 U >\ E) 0 6L 80 7) I 1 a , 3 ' m 3 I 9., ”0.6’ I 1 .304 1 '5‘05 I (U _ ' 1 e a I I “2 °"' I I I I I I i 0.3. I I l I I 1 0' = I 2: =: I — :1: z: I I 1 2 3 4 5 6 7 8 9 —10' 0" 5 f0 15 2.0 Feature Number Feature number (a) Features saliencies: 4 Gaussian (b) Features saliencies: Trunk Figure 4.7: Feature saliencies for (a) the 10—D 4 Gaussian data set used in F ig- ure 4.6(a), and (b) the Trunk data set. The mean values plus and minus one standard deviation over ten runs are shown. Recall that features 3 to 10 for the 4 Gaussian data set are the noisy features. consin diagnostic breast cancer data set (wdbc) was used to obtain a binary diagnosis (benign or malignant) based on 30 features extracted from cell nuclei presented in an image; it has 576 data points. The image segmentation data set (image) con- tains 2320 data points with 19 features from seven classes; each pattern consists of features extracted from a 3 x 3 region taken from 7 types of outdoor images: brick- face, sky, foliage, cement, window, path, and grass. The texture data set (texture) consists of 4000 19-dimensional Gabor filter features from a collage of four Brodatz textures [127]. A data set (zernike) of 47 Zernike moments extracted from images of handwriting numerals (as in [126]) is also used; there are 200 images for each digit, to- taling 2000 patterns. The data sets wine, wdbc, image, and zernike are from the UCI machine learning repository (http : //www. ics . uci . edu/~m1earn/MLSummary . html). This repository has been extensively used in pattern recognition and machine learning studies. Normalization to zero mean and unit variance is performed for all but the texture data set, so as to make the contribution of different features roughly equal a 151 Table 4.1: Real world data sets used in the experiment. Each data set has n data points with (1 features from c classes. One feature with a constant value in image is discarded. Normalization is not needed for texture because the features have comparable variances. Abbr. Full name 71 d c Normalized? wine wine recognition 178 13 3 yes wdbc Wisconsin diagnostic breast cancer 569 30 2 yes image image segmentation 2320 18 7 yes texture Texture data set 4000 19 4 no zernike Zernike moments of digit images 2000 47 10 yes priori. We do not normalize the texture data set because it is already approximately normalized. Since these data sets were collected for supervised classification, the class labels are not involved in our experiment, except for the evaluation of clustering results. Each data set was first randomly divided into two halves: one for training, another for testing. Algorithm 4.1 was run on the training set. The feature saliency values can be refined as described in Section 4.3.4. We evaluated the results by interpreting the fitted Gaussian components as clusters and compared them with the ground truth labels. Each data point in the test set was assigned to the component that most likely generated it, and the pattern was classified to the class represented by the component. We then computed the error rates on the test data. For comparison, we also ran the mixture of Gaussian algorithm in [81] using all the features, with the number of classes of the data set as a lower bound on the number of components. This gives us a fair ground for comparing Gaussian mixtures with and without feature Saliency. In order to ensure that we had enough data with respect to the number of features for the algorithm in [81], the covariance matrices of the mixture components Table 4.2: Results of the algorithm over 20 random data splits and algorithm ini- tializations. “Error” corresponds to the mean of the error rates on the testing set when the clustering results are compared with the ground truth labels. E denotes the number of Gaussian components estimated. Note that the post-processing does not change the number of Gaussian components. The numbers in parenthesis are the standard deviation of the corresponding quantities. Algorithm 4.1 With post-processing Using all the features error (in ‘70) 8 error (in %) error (in %) 8 wine 6.61 (3.91) 3.1 (0.31) 6.61 (3.23) 8.06 (3.73) 3 (0) wdbc 9.55 (1.99) 5.65 (0.75) 9.35 (2.07) 10.09 (2.00) 2.70 (0.57) image 20.19 (1.54) 23.1 (1.74) 20.28(1.60) 32.84 (5.1) 13.8 (1.94) 1mm 4.04 (0.76) 36.17 (1.19) 4.02 (0.74) 4.85 (0.98) 31.42 (2.81) zernike 52.09 (2.52) 11.3 (0.98) 51.99 (2.32) 56.42 (3.62) 10(0) were restricted to be diagonal, but were different for different components. The entire procedure was repeated 20 times with different splits of data and initializations of the algoritlnn. The results are shown in Table 4.2. We also show the feature saliency values of different features in different runs as gray-level image maps in Figure 4.9. For illustrative purpose, we contrast. the clusters obtained for the image data set with the true class labels in Figure 4.8, after using PCA to project the data into 3D. From Table 4.2, we can see that the proposed algorithm reduces the error rates when compared with using all the features for all five data sets. The improvement is more significant for the image data set, but this may be due to the increased number of components estimated. The high error rate for zernike is due to the fact that digit images are inherently more difficult to cluster: for example, “4” can be written in a manner very similar to “9”, and it is difficult for any unsupervised learning algorithm to distinguish among them. The post-processing can increase the “contrast” of the feature saliencies, as the image maps in Figure 4.9 show, without deteriorating the acCuracy. It is easier to perform “hard” feature selection using these post-processed 153 k ,, 3~ 3* 2‘ 24 ”@9514? k 1‘ e + . § m N *g i -L. "N .i -& 4‘ .4 50 50 _5—4—a—2—1o1‘234 .4-3—2—101234 We.) Result of proposed algorithm (b) Result of using all the features 3‘ ”a 24 "5‘ ‘ ‘ ( - -5‘3—4——2-1o1234 6(c) The true class labels Figure 4.8: A figure showing the clustering result on the image data set. Only the labels for the testing data are shown. (a) The true class labels. (b) The clustering results by Algorithm 4.1. (c) The clustering result using all the features. The data points are reduced to 3D by PCA. A cluster is matched to its majority class before plotting. The error rates for the proposed algorithm and the algorithm using all the features in this particular run are 22% and 30%, respectively. feature saliencies, if this is required for the application. 154 4.5 Discussion 4.5. 1 Complexity The major computational load in the proposed algorithm is in the E—step and the M-step. Each E-step iteration computes 0(ndkt) quantities. As each quantity can be computed in constant time, the time complexity for E—step is also 0(ndk). Similarly, the M-step takes 0(ndk) time. The total amount of computation depends on the number of iterations required for convergence. At a. first sight, the amount of computation seems to be demanding. However, a close examination reveals that each iteration (E—step and M-step) of the standard EM algorithm also takes ()(ndk) time. The value of k in the standard EM, though, is usually smaller, because the proposed algorithm starts with a larger number of com- ponents. The number of iterations required for our algorithm is also in general larger because of the increase in the number of parameters. Therefore, it is true that the proposed algorithm takes more time than the standard EM algorithm with one param- eter setting. However, the proposed algorithm can determine the number of clusters as well as the feature subset. If we want to achieve the same goal with the standard EM algorithm using a wrapper approach, we need to re—run EM multiple times with a different number of components and different feature subsets. The computational demand is much heavier than the proposed algorithm, even with a heuristic search to guide the selection of feature subsets. Another strength of the proposed algorithm is that by initialization with a large number of Gaussian components, the algorithm is less sensitive to the local minimum problem than the standard EM algorithm. we 155 can further reduce the complexity by adopting optimization techniques applicable to standard EM for Gaussian mixture, such as sampling the data, compressing the data [28], or using efficient data structures [203, 224]. For the post-processing step in Section 4.3.4, each computation of the quantity J and its gradient and Hessian takes 0(ndk) time. The number of iterations is difficult to predict, as it depends on the optimization routine. However, we can always put an upper bound on the number of iterations and trade speed for the optimality of the results. 4.5.2 Relation to Shrinkage Estimate One interpretation of Equation (4.6) is that we “regularize” the distribution of each feature in different components by the common distribution. This is analogous to the shrinkage estimator for covariance matrices of class-conditional densities [68], which is a weighted sum of an estimate of the class-specific covariance matrix, and the “global” covariance matrix estimate. In Equation (4.6), the pdf of the l-th feature is also a weighted sum of a component-specific pdf and a common density. An im- portant difference here is that the weight pl is estimated from the data, using the MML principle, instead of being set heuristically, as is commonly done. As shrinkage estimators have found empirical success to combat data scarcity, this “regularization” viewpoint is an alternative explanation for the usefulness of the proposed algorithm. 156 4.5.3 Limitation of the Proposed Algorithm A limitation of the proposed algoritlnn is the feature independence assumption (con- ditioned on the mixture component). While, empirically, the violation of the in- dependence assumption usually does not affect the accuracy of a classifier (as in supervised learning) or the quality of clusters (as in unsupervised learning), this has some negative influei‘ice on the feature selection problem. Specifically, a feature that is redundant because its distribution is independent of the component label given another feature cannot be modelled under the feature independence assumption. As a result, both features are kept. This explains why, in general, the feature saliencies are somewhat high. The post-processing in Section 4.3.4 can «me with this problem because it considers the posterior distribution and therefore can discard features that do not help in identifying the clusters directly. 4.5.4 Extension to Semi-supervised Learning Sometimes, we may have some knowledge of the class labels of different Gaussian components. This can happen when, say, we adopt a procedure to combine different Gaussian components to form a cluster (e.g., as in [216]), or in a semi-supervised learning scenario, where we can use a small amount of labelled data to help us identify which Gaussian component belongs to which class. This additional information can suggest combination of several Gaussian components to form a single class/cluster, thereby allowing the identification of non-Gaussian clusters. The post-processing step can take advantage of this information. 157 Suppose we know there are C classes and the posterior probability that pattern y,- belongs to the c—th class, denoted Tim can be computed as Tic = 237:1 HCJP(z,- = j|y,j). For example, if we know that the components 4, 6, and 10 are from class 2, we can set 132,4 = 32.6 = 132.10 = 1 / 3 and the other 13% to be zero. The post-processing is modified accordingly: redefine t,- in Equation (4.31) to t,- = arg maxc Tic» i.e., it becomes the class label for yi in view of the extra information; replace log P(z,- = tilYi) in Equation (4.31) by log "it,” The gradient and Hessian can still be computed easily after notng that (91110 OP! 2 “mm-5;); log 10,-]- = “510’in — 911) k k (432) (9 1 (9 541111) 8_‘ log Tic : _,__ Z: :Lfcja—H'ij 2 Z . "hilj _ 911' /’1 71c j=1 pl jzl TIC We can then optimize the modified J in Equation (4.31) to carry out the post- processm g. 4.5.5 A Note on Maximizing the Posterior Probability The sum of the logarithm of the maximum posterior probability considered in the post—processing in Section 4.3.4 can be regarded as the sample estimate of an 1111- orthodox type of entropy (see [141]) for the posterior distribution. It can be regarded as the limit of Renyi’s entropy 120(1)) when (1 tends to infinity, where k 1 . MP) = 1 _ 0104212?- (4.33) 1:1 W hen this entropy is used for parameter estimation under the maximum entropy framework, the corresponding procedure is closely related to minimax inference. Other functions 011 the posterior probabilities can also be used, such as the Shan- non entropy of the posterior distribution. Preliminary study shows that the use of different types of entropy does not affect the results significantly. 4.6 Summary Given n points in (I dimension, we have presented an EM algorithm to estimate the saliencies of individual features and the best number of components for Gaussian- mixture clustering. The proposed algorithm can avoid running EM many times with different numbers of components and different feature subsets, and can achieve better performance than using all the available features for clustering. By initializing with a large number of mixture components, our EM algorithm is less prone to the problem of poor local minima. The usefulness of the algorithm was demonstrated on both synthetic and benchmark real data sets. 159 B .; 2x (a) wine, proposed algorithm 5 1O 2 ‘ 3 ts“ It.” .- ..: -» ------ - 101 '2 m lawman-teal- ‘4 1o ‘8 In: ...l (g) texture, proposed algorithm 0 (i) zernike, proposed algorithm (b) wine, after post-processing 3.. - LEE- 15’ 20 s 7 1o 2 (d) wdbc, after post-processing ml E A l . 20 (f) image, after post—processing 1 2] - 4] a - fl .. - 8 - - "- 1o - ~12 mm hm; - — 1a 1- - - - 1“ .‘4‘. - 15 .h.-. __‘ 1 -. 1o 15 (h) texture, after post-processing 1 O 20 (j) zernike, after post-processing Figure 4.9: Image maps of feature saliency for different data sets with and without the post—processing procedure. Feature saliency of 1 (0) is shown as a pixel of gray level 255 (0). The vertical and horizontal axes correspond to the feature number and the trial number, respectively. 160 Chapter 5 Clustering With Constraints In Section 1.4, we introduced instance—level constraints as a type of side-information for clustering. In this chapter, we shall examine the drawbacks of the existing clus- tering under constraints algorithms, and propose a new algorithm that can remedy the defects. Recall that there are two types of instance-level constraints: a must-link/ positive constraint requires two or more objects to be put in the same cluster, whereas a must- not-link/negative constraint requires two or more objects to be placed in different clusters. Often, the constraints are pairwise, though one can extend them to multiple objects [231, 167]. Constraints are particularly appropriate in a clustering scenario, because there is no clear notion of the target classes. On the other hand, the user can suggest if two or more objects should be included in the same cluster or not. This can be done in an interactive manner, if desired. Side-information can improve the robustness of a clustering algorithm towards model mismatch, because it provides additional clues for the desirable clusters other than the shape of the clusters, as 161 suggested by the parametric model. Side-information has also been found to alleviate the problem of local minima of the clustering objective function. Clustering with instance—level constraints is different from learning with partially- labeled data, also known as transductive learning or semi-supervised learning [136, 288, 157, 169, 289, 287, 98, 195], where the class labels of some of the objects are provided. Constraints only reveal the relationship among the labels, not the labels themselves. Indeed, if the “absolute” labels can be specified, the user is no longer facing a clustering task, and a supervised method should be adopted instead. We contrast different learning settings according to the type of information avail- able in Figure 5.1. At one end of the spectrum, we have supervised learning (Fig- ure 5.1(a)), where the labels of all the objects are known. At the other end of the spectrum, we have unsupervised learning (Figure 5.1(d)), where the label information is absent. In between, we can have partially labeled data (Figure 5.1(b)), where the true class labels of some of the objects are known. The main scenario considered in this paper is depicted in Figure 5.1(c): there is no label information, but must-link and must-not-link constraints (represented by solid and dashed lines, respectively) are provided. Note that the settings exemplified in Figures 5.1(a) and 5.1(b) are classification—oriented because there is a clear definition of different classes. On the other hand, the setups in Figures 5.1(c) and 5.1(d) are clustering-oriented, because no precise definitions of classes are given. The clustering algorithm needs to discover the classes. 162 5.0. 1 Related Work Different algorithms have been proposed for clustering under instance—level con- straints. In [262], the four primary operators in COBVVEB were modified in view of the constraints. The k-means algorithm was modified in [263] to avoid violating the constraints when different objects are assigned to different clusters. However, the algorithm can fail even when a solution exists. Positive constraints served as “short- cuts” in [148] to modify the dissimilarity measure for complete—link clustering. There can be catastrophic consequences if a single constraint is incorrect, because the dis- similarity matrix can be greatly distorted by a wrong constraint. Spectral clustering was modified in [138] to work with constraints, which augmented the affinity matrix. Constraints were incorporated into image segmentation algorithms by solving the constrained version of the corresponding normalized cut problem, with smoothness of cluster labels explicitly incorporated in the formulation [279]. Hidden Markov random field was used in [14] for k-means clustering with constraints. Constraints have also been used for metric-learning [274]; in fact, the problems of metric-learning and It‘- means clustering with constraints were considered simultaneously in [21]. Because the problem of k-means with metric-learning is related to EM clustering with a common covariance matrix, the work in [21] may be viewed as related to EM clustering with constraints. The work in [158] extended the work in [21] by studying the relation- ship between constraints and the kernel k-means algorithms. Ideas based on hidden Markov random field have also been used for model-based clustering with constraints [14, 176, 161]; the difference between these three methods lies in how the inference is 163 [Summary LKey ideas [ Examples ] Distance edit- Modify the distance/proxin‘iity matrix due to the [148, 138] ing constraints Constraints on The cluster labels are inferred under the restriction [262, 263, labels that the constraints are always satisfied 279] Hidden Cluster labels constitute a hidden Markov random [14, 21, 12, Markov field; feature vectors are assumed to be independent 158, 176, random field of each other given cluster labels 161, 286] Modify genera— Generation process of data points that participate [231, 166, tion model in constraints is modified 167] Constraints Clustering solution is obtained by resolving con- [10] resolution straints only Table 5.1: Different algorithms for clustering with constraints. conducted. In particular, the method in [14] used iterative conditional mode (ICM), the method in [176] used Gibbs sampling, and the method in [161] used a mean-field approximation. The approach in [286] is similar to [161], since both used mean-field approximation. However, the authors of [286] also considered the case when each class is modeled by more than one component. A related idea was presented in [231], which uses a graphical model for generating the data with constraints. A fairly differ- ent route to clustering under constraints was taken by the authors in [10] under the name “correlation clustering”, which used only the positive and negative constraints (and no information on the objects) for clustering. The number of clusters can be determined by the constraints. Table 5.1 provides a summary of these algorithms for clustering under constraints. In most of these approaches, clustering with constraints has been shown to improve the quality of clustering in different domains. Example applications include text classification [14], image segmentation [161], and video retrieval [231]. 164 5.0.2 The Hypothesis Space An important issue in parametric clustering under constraints, namely the hypoth- esis space, has virtually been ignored in the current literature. Here, we adopt the terminology from inductive learning and regard “hypothesis space” as the space of all possible solutions to the clustering task. Since partitional clustering can be viewed as the construction of a mapping from a set of objects to a set of cluster labels, its hypothesis space is the set of all possible mappings between the objects (or their rep- resentations) and the cluster labels. In a non-parametric clustering algorithm such as pairwise clustering [114] and methods based on graph-cut [234, 272], there is no restriction on this hypothesis space. A particular non-parametric clustering algorithm selects the best clustering solution in the space according to some criterion function. In other words, if a poor criterion function is used (perhaps due to the influence of constraints), one can obtain a counter-intuitive clustering solution such as the one in Figure 5.3(c), where very similar objects can be assigned different cluster labels. Note that objects in non-parametric clustering, unlike in parametric clustering, may not have a feature vector representation. They can be represented, for example, by pairwise affinity or dissimilarity measure with higher order [1] The hypothesis space in parametric clustering is typically much smaller, because the parametric assumption imposes restrictions on the cluster boundaries. While these restrictions are generally perceived as a drawback, they become advantageous when they prevent counter-intuitive clustering solutions such as the one in F ig- ure 5.3((:) from appearing. These clustering solutions are simply outside the hy- 165 pothesis space of parametric clustering, and are never attainable irrespective of how the constraints modify the clustering objective function. An example contrasting parametric and non-parametric clustering is shown in Figure 5.2. The particular parametric family considered in this example is a Gaussian distribution with common covariance matrix, resulting in linear cluster boundaries. 5.0.2.1 Inconsistent Hypothesis Space in Existing Approaches The basic idea of most of the existing parametric clustering with instance-level con- straints algorithms [263, 14, 21, 12, 158, 176, 161, 286] is to use some variants of hidden Markov random fields to model the cluster labels and the feature vectors. Given the cluster label of the object, its feature vector is assumed to be independent ' of the feature vectors and the cluster labels of all the other objects. The cluster labels, which are hidden (unknown), form a Markov random field, with the potential function in this random field related to the satisfiability of the constraints based on the cluster labels. There is an unfortunate consequence of adopting the hidden Markov random field, however. For objects participating in the constraints, their cluster labels are deter- mined by the cluster parameters, associated feature vectors and the constraints. On the other hand, for data points without constraints, the cluster labels are determined by only the cluster parameters and associated feature vectors. We can thus see that there is an inconsistency in how the objects obtain their cluster labels. In other words, two identical objects, one with a constraint and one without, can be assigned different cluster labels! This is the underlying reason for the problem illustrated in 166 Figure 5.3(d), where two objects with almost identical feature vectors are assigned different labels due to the constraints. From a generative viewpoint, the above inconsistency is caused by the difference in how data points with and without constraints are generated. For the data points without constraint, each of them is generated in an identical and independent manner according to the current cluster parameter value. On the other, all the data points with constraints are generated simultaneously by first choosing the cluster labels according to the hidden Markov random field, followed by the generation of the feature vectors based on the cluster labels. It is a dubious modeling assumption that “posterior" knowledge such as the set of instance—level constraints, which are solicited from the user after observing the data, should control how the data are generated in the first place. Note that this inconsistency does not exist if all the objects to be clustered are involved in some constraints determined by the properties of the objects. This is commonly encountered in image segmentation [128], where pixel attributes (e.g. in- tensities or filter responses) and spatial coherency based on the locations of the pixels are considered simultaneously to decide the segment label. In this case, the clus- ter labels of all the objects are determined by both the constraints and the feature vectors. 5.0.2.2 Proposed Solution We propose to eliminate the problem of inconsistent hypothesis space by enforcing a uniform way to determine the cluster label of an object. we use the same hypothesis 167 space of standard parametric clusterng for parametric clustering under constraints. The constraints are only used to bias the search of a clustering solution within this hypothesis space. Since each clustering solution in this hypothesis space can be represented by the cluster parameters, the constraints play no role in determining the cluster labels, given the cluster parameters. The quality of the cluster parameters with respect to the constraints is computed by examining how well the cluster labels (determined by the cluster parameters) satisfy the constraints. However, cluster parameters that fit the constraints well may not fit the data well. We need a tradeoff between these two goals. This can be done by maximizing a weighted sum of the data log-likelihood and a constraint fit term. The details will be presented in Section 5.3. 5.1 Preliminaries Given a set of 71 objects y = {y1, . . . ,yn}, (probabilistic) parametric partitional clus- tering discovers the cluster structure of the data under the assumption that data in a cluster are generated according to a certain probabilistic model p(yldj), with 6]- representing the parameter vector for the j—th cluster. For simplicity, the number of clusters, k, is assumed to be specified by the user, though model selection strategy (such as minimum description length [81] and stability [162]) can be applied to de- termine k, if desired. The distribution of the data can be written as a. finite mixture distribution, i.e., k p(y) = Zp(y|z)p(Z) = Zajmylflj). (5-1) z j:1 168 Here, 2 denotes the cluster label, ozj denotes p(z = j) (the prior probability of cluster j), and p(yIOJ) corresponds to p(ylz = j). Clustering is performed by estimating the model parameter 9, defined by 0 = (01, . . .,Ct'k,(91,...,6k). By applying the maximum likelihood principle, 9 can be estimated as 6 = argmaxgz £(6’; y), where the log—likelihood £(6; y) is defined as n n k w. y) = Z Iogpm) = 2: log Zajp(yl6j)- (52) 2'21 i=1 j=1 This maximization is often done by the EM algorithm [58] by regarding Z, (the cluster label of y,) as the missing data. The posterior probability p(z = j [y) represents how likely it is that y belongs to the j-th cluster. If a hard cluster assignment is desired, the MAP (maximum a posteriori) rule can be applied based on the model in Equation (5.1), i.e., the object y is assigned to the j-th cluster (2 = j) if 01100491) 21' 01100491] (53) ] = arg max 5.1 . 1 Exponential Family While there are many possibilities for the form of the probability distribution p(yl6j), It is Very common to assume that p(yldj) belongs to the exponential family. The dlstribution p(ylflj) is in the exponential family if it satisfies the following two criteria: the Suppert of p(yIOj) (the set of y with non-zero probability) is independent of the 169 value of 6], and that p(yldj) can be written in the form port) = exp 6(wain — Aw») . (5.4) Here, p(y) transforms the data y to become the “sufficient statistics”, meaning that (My) encompasses all the relevant information of y in the computation of p(ylfij). The function A(6j), also known as the log-partition function, normalizes the density so that it integrates to one over all y. The function 179(93') transforms the parameter and enables us to adopt different parameterizations of the same density. When 1“.) is the identity mapping, the density is said to be in natural parameterization, and 6]- is known as the natural parameter of the distribution. The function A(0j) then becomes the cumulant generating function, and the derivative of A(6j) generates the cumulant of the sufficient statistics. For example, the gradient and Hessian of A(f)J-) (with respect to Bj) lead to the expected value and the covariance matrix for the sufficient statistics, respectively. Note that A(6j) is a convex function, and the domain of Bj where the density is well-defined under natural parameterization is also COIIVQX . As an example, consider a multivariate Gaussian density with mean vector p and covariance matrix 2. Its pdf is given by My) = exp (—§ logeo + graders-1 — go —- “FE-1o — m) , (5.5) where d is the dimension of the feature vector y. If we define T = 2‘1 and V = 170 23’1”, the above can be rewritten as 1 1 2 logdet T — éuTT—lu p(y) = exp (trace (—%nyT) + yTV — glog(27r) + (5.6) From this, we can see that the sufficient statistics consist of —%ny and y. The set of natural parameter is given by (T, V). The parameter V can take any value in Rd, whereas T can only assume values in the positive-definite cone of d by d symmetric matrices. Both these two sets are convex, as expected. The log-cumulant function is given by d 1 A(6) 2 2 log(27r) — 2 log det T + éuT_lu, (5.7) which can be shown to be convex within the domain of T and V, where the density is well-defined. It is interesting to note that the exponential family is closely related to Bregman divergence [27]. For any Bregman divergence Dp(., .) derived from a strictly convex function p(.), one can construct a function fp such that 226') = eXP (—Dp()’» u)) fp(y) is a member of the exponential family. Here, [1. is the moment parameter, meaning 1 that it is the expected value of the sufficient statistics . The cumulant generating function of the density is given by the Legendre dual of p(.). One important con- 1The strict convexity of A(.) implies that there is an one-to—one correspondence between mo- ment parameter and natural parameter. While the existence of such a mapping is easy to show, constructing such a mapping can be difficult in general. 171 sequence of this relationship is that soft-clustering (clustering where an object can be partially assigned to a cluster) based on any Bregman divergence can be done by fitting a mixture of the corresponding distribution in the exponential family, as argued in [9]. Since Bregman divergence includes many useful distance measures2 as special cases (such as Euclidean distance and Kullback-Leibler divergence, and see [9] for more), a mixture density, with each component density in the exponential family, covers many interesting clustering scenarios. 5.1.2 Instance-level Constraints We assume that. the user has provided side-information in the form of a set of instance— level constraints (denoted by C). The set of must-link constraints, denoted by (3+, is represented by the indicator variables rim, such that (im- = 1 iff y,- participates in the h-th must—link constraint. For example, if the user wants to state that the pair (y2, y8) participates in the fifth must-link constraint, the user sets (15,2 2 1, 95,8 = 1, and (i572- : 0 for all other 2'. This formulation, while less explicit than the formulation in [161], which specifies the pairs of points participating in the constraints directly, allows easy generalization to group constraints [166]: we simply set 51),,- to one for all y, that are involved in the h-th group constraint. We also define ah,- = [1m / Z,- (1,”, where ah,- can be perceived as the “normalized” indicator matrix, in the sense that Zia)”- = 1. The set of must-not-link constraints, denoted by C‘, is represented similarly by the variables 6),,- and bin" Specifically, 6),,- = 1 if y,- participates in the 2Strictly speaking, Bregman divergence can be asynunetric and hence is not really a distance function. 172 h-th must-not-link constraint, and bhi = f),,,-/ 2,,- 6,,,-. Note that {am} and {bhi} are highly sparse, because each constraint provided by the user involves only a small number of points (two if all the constraints are pairwise). 5.2 An Illustrative Example In this section, we describe a simple example to illustrate an important shortcoming of parametric clustering under constraints methods based on hidden Markov random field — the approach common in the literature [263, 14, 21, 12, 158, 176, 161, 286]. In Figure 5.3, there are altogether 400 data points generated by four different Gaussian distributions. The task is to split this data into two clusters. Suppose the user, perhaps due to domain knowledge, prefers a “left” and a “right” cluster (as shown in Figure 5.3(c)) to the more natural solution of a “top” and a “bottom” cluster (as shown in Figure 5.3(b)). This preference can be expressed via the introduction of two must-link constraints, represented by the solid lines in Figure 5.3(a). When we apply an algorithm based on hidden Markov random field to discover the two clusters in this example, we can get a solution shown in Figure 5.3(c). While cluster labels of the points involved in the constraints are modified by the constraints, there is virtually no difference in the resulting cluster structure when compared with the natural solution in Figure 5.3(b). This is because the change in the cluster labels of the small number of points in constraints does not significantly affect the cluster parameters. Not only are the clusters not what the user seeks, but also the clustering solution is counter-intuitive: the cluster labels of points involved 173 in the constraints are different from their neighbors (see the big cross and plus in Figure 5.3(c); the symbols are enlarged for clarity). Similar phenomena of “non-smooth” clustering solution have been observed in [279] in the context of normalized cut clustering with constraints. A variation of the same problem has been used as a motivation for the “space—level” instead of “instance- level” constraints in [148]. One way to understand the cause of this problem is that the use of hidden Markov random field effectively puts an upper bound on the maximum influence of a constraint, irrespective of how large the penalty for constraint violation is. So, the adjustment of the tradeoff parameters cannot circumvent this problem. Since this problem is not caused by the violation of any constraints, the inclusion of negative constraints cannot help. 5.2.1 An Explanation of The Anomaly In order to have a better understanding of why an “um‘iatural” solution depicted in Figure 5.3(d) is obtained, let us examine the hidden Markov random field approach for clustering under constraints in more detail. In this approach, the distribution of the cluster labels (represented by 32') and the feature vectors (represented by y,) can be written as p(yla' " aynlzlv' ° .,Zn,6) = Hp(YIlZ?) 174 One typical choice of the potential function H(21, . . . , 2n, C +,C —) of the cluster labels is to count the number of constraint violations: H(z1,...,.~,,,c+,c—) = /\+ Z I(z, ¢ zj) + A“ Z 1(z,= z,), (5.8) (i‘j)€C+ (MEC— where A+ and z\‘ are the penalty parameters for the violation of the must-link and must—not-link constraints, respectively. This potential function can be derived [161] by the maximum entropy principle, with constraints (as in constrained optimization) on the number of violations of the two. types of instance-level constraints. The as- sigmnent of points to different clusters is determined by the posterior probability p(21,. . . ,z,-,|y1, . . . .yn,6). Clustering is performed by searching for the parameters that maximize the log-likelihood p(y1,. . . ,yn|6). Because p(yli ' ° ' 7y72l0) Z Z p(y17° ' 'iY7llzli ° - -:Zn36)p(31a- ' 'aznlgl 21,...,Zn (59) z arg max, p(yl‘) ' - '1ynlzli ' ' -,Zna6)p(zla ' - wznlg), 21,-"vat the result of maximizing p(y1,...,yn[6) is often similar to the re- sult of maximizing the “hard assignment log-likelihood”, defined by arg max p(y1,...,yn|zl,...,zn,6)p(21,...,zn|6). This illustrates the rela- 21,...,Zn tionship between “hard” clustering under constraints approaches (such as in [263]) and the “soft” approaches (such as in [161] and [14]). For ease of illustration, assume that p(ylz = j) is a Gaussian with mean vector ,uj and identity covariance matrix. The maximization of 175 p(yl, . . . ,ynlzl, . . . , 2n, 9);)(21, . . . , znld) for the clustering under constraints example in Figure 5.3 is equivalent. to the minimization of n 2 2:21a- =2>Hy. — “jug +,\+ Z re, a z,), 611:1 (i,j)eC+ where the potential function of the Markov random field is as defined in Equa- tion (5.8), and C+ contains the two must-link constraints. Note that the first term, the sum of square Euclidean distances between data points and the corresponding cluster centers, is the cost function for standard k-means clustering. We are going to compare two cluster configurations. The configuration “LR”, which consists of a. “left" and a “right” cluster, can be represented by ”ER 2 (—2, 0) and ugR = (2,0), and this corresponds to the partition sought by the user in Fig- ure 5.3(c). The configuration “TB”, which consists of a “top” and a “bottom” cluster, can be represented by [.1ng = (0,~8) and ”2TB = (0,8), and this corre- sponds to the “natural” solution shown in Figure 5.3(b). When X“ is very small, the natural solution “TB” is preferable to “LR”, because the points, on average, are closer to the cluster centers in “TB”, and the penalty for constraint viola- tion is negligible. As /\+ increases, the cost for selecting “TB” increases. When (A+ + lle‘ — MEBHQ) > My, — ”TB”? (y,- is the point under constraint in the upper left point clouds), switching the cluster label of y, from “x” to “+” leads to a lower cost for the “TB” configuration. This switching of cluster label affects the cluster centers in the “TB” configuration. However, its influence is minimal because there is only one such point, and the sum of the square error term in the objective func- 176 F. tion is dominated by the remaining points that are not involved in constraints. As a result, the sum of square term is minimized when the cluster centers are effectively unmodified from the “TB” configuration. This leads to the counter—intuitive cluster— ing solution in Figure 5.3(c), where the constraints are satisfied, but the cluster labels are “discontinuous” in the sense that the cluster label of an object in the middle of a dense point cloud can assume a cluster label different from those of its neighbors. A related argument has been used to motivate “space-level” constraints in preference to “instance-level” constraints in [148]: the influence of instance-level constraints may fail to propagate to the surrounding points. This problem may also be attributed to the problem of the inconsistent hypothesis space discussed in Section 5.0.2.1, be- cause the cluster labels of points under constraints are determined in a way that is different from the points without constraints. When A+ increases further, the cost for this counter-intuitive configuration remains the same, because no constraints are violated. Let C denote the cost of this counter-intuitive configuration. We are now in a position to understand why it is not possible to attain the de- sirable configuration “LR”. By pushing the vertical and horizontal point clouds away from each other, we can arbitrarily increase the cost of the “LR” configuration, while keeping the cost of the “TB” configuration the same. While the cost for the counter- intuitive configuration also increases when the two point clouds are pushed apart, such an increase is very slow because only the distance of one point (as in the term Hy,- — #111,)”2) is affected. Consequently, the cost of “LR” configuration can be made larger than C, which is indeed the case for the example in Figure 5.3. Therefore, assuming that the clustering under constraints algorithm finds the clustering solu- 177 tion that minimizes the cost function, the desired “LR” configuration can never be recovered. Note that specifying additional constraints (either must-link or must—not-link) on points already participating in the constraints cannot solve the problem, because none of the constraints are violated in the counter-intuitive configuration. This problem remained unnoticed in previous studies, because it is a consequence of a small number of constraints. When there are a large number of data points involved in constraints, the sum of the square error is no longer dominated by data points not involved in constraints. The enforcement of constraints changes the cluster labels, which in turn modifies the cluster centers significantly during the minimization of the sum of error. The counter—intuitive configuration is no longer optimal, and the “LR” configuration will be generated because of its smaller cost. Note that this problem is independent of the probabilistic model chosen to represent each cluster: the same problem can arise if there is no restriction on the covariance matrix, for example. There are several ways to circumvent this problem. One possibility is to increase the number of constraints so that the constraints involve a large number of data points. However, clustering under constraints is most useful when there are few constraints, because the creation of constraints often requires a significant effort on the part of the user. Instead of soliciting additional constraints from the user, the system should provide the user an option to increase arbitrarily the influence of the existing constraints —— something the hidden Markov random field approach fails to do. One may also try to initialize the cluster parameters intelligently [13] so that a desired local minimum (the “LR” configuration in Figure 5.3(c)) is obtained, instead 178 of the global minimum (the counter—intuitive configuration in Figure 5.3(d) or the “TB” configuration in Figure 5.3(b), depending on the value of /\+). However, this approach is heuristic. Indeed, the discussion above reveals a problem in the objective function itself, and we should specify a more appropriate objective function to reflect what the user really desires. The solution in [161] is to introduce a parameter (in addition to X“ and /\_) that can increase the influence of data points in constraints. However, this approach introduces an additional parameter, and it is also heuristic. An alternative potential function for use in the hidden Markov random field has been proposed in [14] to try to circumvent the problem. Because the main problem lies in the objective function itself, we propose a prin- cipled solution to this problem by specifying an alternative objective function for clustering under constraints. 5.3 Proposed Approach Our approach begins by requiring the hypothesis space (see Section 5.0.2) used by parametric clustering under constraints to be the same as the hypothesis space used by parametric clustering without constraints. This means that the cluster label of an object should be determined by its feature vector and the cluster parameters according to the MAP rule in Equation (5.3) based on the standard finite mixture model in Equation (5.1). The constraints should play no role in deciding the cluster labels. This contrasts with the hidden Markov random field approaches (see Section 5.2), where both the cluster labels and the cluster parameters can freely vary to minimize 179 *- . +‘E— . 1.. .34. 1* ‘X‘ + + +13. "' * * - + 4. * 4: + 9K 916* I *6 . (c) Partially constrained Figure 5.1: Supervised, unsupervised, and intermediate. In this figure, dots corre- spond to points without any labels. Points with labels are denoted by circles, asterisks and crosses. In (c), the must-link and must-not-link constraints are denoted by solid and dashed lines, respectively. .. +. sip... 0 :0 0;. L .0: +. . .0 0.0... o. 0.. * .:0: o. 0 ' o . . o ’ 0+ 0 '. . ° ' x x "Xe. -' X o o . .. . o .': X 180 . . . .0 . o o . . .. o .0 o.:'. t: AA (d) Unsupervised X X . (d) (9) (f) Figure 5.2: An example contrasting parametric and non-parametric clustering. The particular parametric family considered here is a mixture of Gaussian with a common covariance matrix. This is reflected by the linear cluster boundary. The clustering solutions in (a) to (c) are in the hypothesis space induced by this model assumption, and the clustering solutions in (d) to (f) are outside the hypothesis space, and thus can never be obtained, no matter which objective function is used. On the other hand, all of these six solutions are within the hypothesis space of non-parametric clustering. It is possible that the clustering solutions depicted in (d), (e), and (f) may be obtained if a poor clustering objective function is used. 181 15 10'” .- 15 1o» -10» -15» Figure 5.3: A simple example of clustering under constraints that illustrates the (c) Desired 2-cluster solution 15* 10 5 0. -5, -10. -15t M l & X I_ O a H.’ . 331554333?! » “‘3‘" $33; 3: 1‘ x x I “5min W N*‘ §.: 99 + 4': ¢{*+~J —5 0 (b) Natural partition in 2 clusters 15* -15.. ((1) Solution by HMRF limitation of hidden Markov random field (HMRF) based approaches. 182 the cost function. The desirable cluster parameters should (i) result in cluster labels that satisfy the constraints, and (ii) explain the data well. These two goals, however, may conflict with each other, and a compromise is made by the use of tradeoff parameters. Formally, we seek the parameter vector 6 that maximizes an objective function J(0; y, C), defined by 3(9;y,C) = 5093’) +.7-"(0;C), (5.10) m+ m— f(M) -—— -Zx\;§f+(6;Cf{) — Zip—(mg). (5.11) h=1 h=1 where 17(6; C) denotes how well the clusters specified in 0 satisfy the constraints in C. It consists of two terms: f +(0;C,:') and f “(6;C il— ). The loss functions f +(6;C;]') and f ‘(9;C,: ) correspond to the violation of the h—th must-link constraint (denoted by Cg”) and the h-th must-not-link constraint (denoted by C}: ), respectively. There are altogether m+ must—link constraints and m- must-not-link constraints, i.e., [CI-fl = m+ and [Ch— ] = m‘. The log-likelihood term £(6; 32), which corresponds to the fit of the data 32 by the. model parameter 9, is the same as the log-likelihood of the finite mixture model used in standard parametric clustering (Equation (5.2)). The parameters A17“ and A; give us flexibility to assign different weights on the constraints. In practice, they are set to a common value A. The value of /\ can either be specified by the user, or it can be estimated by a cross—validation type of procedure. For brevity, sometimes we drop the dependence of ,7 on 6, y and C and write J as the objective function. 183 How can this approach be superior to the HMRF approaches? A counter-intuitive clustering solution such as the one depicted in Figure 5.3(d) is no longer attainable. The cluster boundaries are determined solely by the cluster parameters. So, in the example in Figure 5.3(d), the top—left “big plus” point will assume the cluster label of “x”, whereas the bottom-right “big cross” point will assume the cluster label of “+”, based on the value of the cluster parameters as shown in the figure. The sec- ond benefit is that the effect of the instance-level constraints is propagated to the surrounding points automatically, thereby achieving the effect of the desirable space- level constraints. This is because parametric cluster boundaries divide the data space into different contiguous regions. Another advantage of the proposed approach is that it can obtain clustering solutions unattainable by HMRF approaches. For example, the “TB” configuration in Figure 5.3(b) can be made to have an arbitrarily high cost by increasing the value of the constraint penalty parameter )0”. Since the cost of the “LR” configuration is not affected by /\+, the “LR” configuration will have a smaller cost than the “TB” configuration with a large )9”. When the cost function is minimized, the “LR” configuration sought by the user will be returned. 5.3.1 Loss Function for Constraint Violation What should be the form of the loss functions f +(6;C,:') and f ”(6;C;)? Suppose the points y,- and yj participate in a must-link constraint. This must—link constraint is violated if the cluster labels 2,- (for y,-) and zj (for yj), determined by the MAP rule, are different. Define z, to be a vector of length k, such that its l-th entry 184 is one if z,- = l, and zero otherwise. The number of constraint violations can be represented by (1(z,,zj) if (1 is a distance measure such that d(z,-, 2]) = 1 if z,- # z]- and zero, otherwise. Similarly, the violation of a must-not-link constraint between y-* and yjav can be represented by 1 — (1(z,*, z z '*), where yiat and yja: are involved in J a must-not-link constraint. Adopting such a distance function (1(., .) as the loss functions f +() and f ’() is, however, not a good idea because d(z,-, 2)) is a discontinuous function of 6, due to the presence of arg max in Equation (5.3). In order to construct an easier optimization problem, we “soften” z,- and define a new vector 5,- by __ (01P(yz'|91))7 _ qu 5:! — — —————, (5.12) 21’ (Oz/Pfy'ilgl’llT 21’ (15’ where (12'! = (Maw-[91), and 7' is the smoothness parameter. When 7' goes to infinity, 5,- approaches zi, whereas a small value of 7' leads to a smooth loss function, which, in general, has a less severe local optima problem. Another issue is the choice of the distance function d(s,j,sJ-). Since 3,, 2 0 and 21 3,1 = 1, 5,, has a probabilistic interpretation. A divergence is therefore more appropriate than a common distance measure such as the Minkowski distance for comparing 5‘ and s'. We adopt the Jensen-Shannon divergence D 3,8» 173 2 3 JS 1 j 185 with a uniform class prior as the distance measure: k k 3:) Sjl DJSfSi,Sj) = — :8“ log 7,31- + 23].) log ..t_l_ 121 [:1 (5.13) 1 where t) = '2‘(5il + 3],). There are several desirable properties of Jensen-Shannon divergence. It is symmetric, well—defined for all s,- and sj, and its square root can be shown to be a metric [76, 199]. The minimum value of 0 for D JS(., .) is attained only when s,- = sj. It is upper- bounded by a constant (log 2), and this happens only when s, and sj are farthest apart, i.e., when 5,]: 1 and 5]), = 1 with l 75 h. Because lo—glggDJS(Zi,Zj) = 1 if z,- 75 zj and 0 otherwise, the Jensen—Shannon divergence satisfies (up to a multiplicative constant) the desirable property of a distance measure as described earlier in this section. Note that Kullback—Leibler divergence can become unbounded when s,- and sj have different supports, and thus it is not an appropriate choice. Jensen-Shannon divergence has an additional appealing property: it can be gen- eralized to measure the difference between more than two distributions. This gives a very natural extension to constraints at the group level [231, 166]. Suppose 6 ob- jects participate in the h-th group-level must-link constraint. This is denoted by the variables ah,- introduced in Section 5.1.2, where ah, = 1/e if y,- participates in this constraint, and zero otherwise. The Jensen-Shannon divergence for the h-th must-link 186 constraint D+ S(h,) is defined as ziahiz 921103— i=1 n where flit = Zahisz‘l- i=1 n k k =2; ah,- :18 81'] log Sil - Z: tfihl log ”Ifl’ (5.14) i=1 [=1 [=1 till Similarly, the Jensen—Shannon divergence for the h-th must-not-link constraint D;S(h) is defined as n k k s; _. _ DJS( ([2) =2 bh, El ,1 logt—: = E bh, E 3,) logsil— E thllogthl’ (5.15) i: n1 [Ll i=1 [=1 [=1 where t}?! = 2b)”st i=1 Here, bm- denotes the i'i'iust-not-link constraint as discussed in Section 5.1.2. The proposed objective function in Equation (5.10) can be rewritten as .7 = 13(6)?) +T(B;C) m+ m— (5'16) = cameraman) — 2 50302.) + Z Agngso) h=1 h=1 where the annealed log—likelihood cannealedwmn), defined in Equation (B2), 18 a generalization of the log—likelihood intended for deterministic annealing. When 7 = 1, Cannealed(6;y,7) equals may). Note that both D} 5(1).) and D150?) are functions of 6. 187 5.4 Optimizing the Objective Function The proposed objective function (Equation (5.16)) is more difficult to optimize than the log-likelihood (Equation (5.2)) used in standard parametric clustering. We can- not derive any efficient convex relaxation for ,7, meaning that a bound-optimization procedure such as the EM algorithm cannot be applied. We resort to general non- linear optimization algorithms to optimize the objective function. In Section 5.4.1, we shall present the general idea of these algorithms. After describing some details of the algorithms in Section 5.4.2, we present the specific equations used for a mix- ture of Gaussians in Section 5.4.3. Note that these algorithms are often presented in the literature as minimization algorithms. Therefore, we minimize —J rather than maximizing ,7 in practice. 5.4.1 Unconstrained Optimization Algorithms Different algorithms have been attempted to optimize the proposed objective function ,7. They include conjugate gradient, quasi-Newton, preconditioned conjugate gradi— ent, and line-search Newton. Because these algorithms are fairly well-documented in the literature [87, 23], we shall only describe their general ideas here. All of these algorithms are iterative and require an initial parameter vector 6(0). 5.4.1.1 Nonlinear Conjugate Gradient The key idea of nonlinear conjugate gradient is to maintain the descent directions d“) in different iterations, so that different d“) are orthogonal (conjugate) to each 188 other with respect to some approximation of the Hessian matrix. This can prevent the inefficient “zig-zag” behavior encountered in steepest descent, which always uses the negative gradient for descent. Initially, d(0) equals the negative gradient of the function to be minimized. At iteration t, a line-search is performed along d“), 1e we seek 7) such that the objective function evaluated at 6“) + ndu) is minimized, where 6“) is the current parameter estimate. The parameter is then updated by 6““) = 6m + rjdm. The next direction of descent d(t+1) is found by computing a vector that is (approximately) conjugate to previous descent directions. Many different schemes have been proposed for this, and we follow the suggestion given in the tutorial [232] and adopt the Polak—Ribiére method with restarting to update (t+1) T ()t+1 (t) (1+1): (r ) (r( —r ) C max (1” (rft))Tr(t) ,0) d(t+1) : N“) + (“HM“). Note that line-search in conjugate gradient should be reasonably accurate, in order to ensure that the search directions d“) are indeed approximately conjugate (see the discussion in Chapter 7 in [23]). The main strength of conjugate gradient is that its memory usage is only linear with respect to the number of variables, thereby making it attractive for large scale problems. Conjugate gradient has also found empirical success in fitting a mixture of Gaussians [222], and is shown to be more efficient than the EM algorithm when the clusters are highly overlapping. 189 5.4.1.2 Quasi-Newton Consider the second-order Taylor expansion for a real-valued function f (x), which is it»ewai+w—6Wfléwmi+ge—efiflHwae—e”i (5N) where g(x) and H(x0) denote the gradient and the Hessian of the function f () evaluated with x = x0. For brevity, we shall drop the reference to 60") for both g and H. Assuming that H is positive definite, the right-hand-side of the above approximation can be minimized by 6 = 6“) — H‘lg. The quasi—Newton algorithm does not require explicit knowledge of the Hessian H, which can sometimes be tricky to obtain. Instead, it maintains an approximate Hessian H, which should satisfy the quasi-Newton condition: 0(t+1) _ 9(1) : H-1(g(t+1) _ gm). Since the inversion of the Hessian can be computationally expensive, G“), the inverse of the Hessian is approximated instead. W'hile different schemes to update G“) are possible, the de facto standard is the BFGS (Broyden-Fletcher-Goldfarb—Shanno) procedure. Below is its description taken from [23]: p 2 9ft“) _ 9(1) v=-eW“-gm> 190 Gt“) 2 G“) + ——;pp — mevaGm + (vTG(‘>v)uuT. Given that Gm is positive-definite and the round-off error is negligible, the above up— date guarantees that GUH) is positive-definite. The initial value of the approximated inverse Hessian Gm) is often set to the identity matrix. Note that an alternative ap- proach to implement quasi-Newton is to maintain the Cholesky decomposition of the approximated Hessian instead. This has the advantage that the approximated Hes- sian is guaranteed to be positive definite even when the round-off error cannot be ignored. In practice, the quasi-Newton algorithm is accompanied with a line—search pro— cedure to cope with the error in the Taylor approximation in Equation (5.17) when 6 is far away from 6”). The descent direction used is —H_1g(t). Note that if H is positive definite, —-g(t)H‘—1g(t) will be always negative and —H"1g(t) will be a valid descent direction. The main drawback of the quasi-Newton method is its memory requirement. The approximate inverse Hessian requires O(|6|2) memory, where [6] is the number of variables in 6. This can be slow for high-dimensional 6, which is the case when the data y, is of high dimensionality. 5.4.1.3 Preconditioned Conjugate Gradient Both conjugate gradients and quasi-Newton require only the gradient information of the function to be minimized. Faster convergence is possible if we incorporate the analytic form of the Hessian matrix into the optimization procedure. However, what 191 really can help is not the Hessian, but the inverse of the Hessian. Since the inversion of the Hessian can be slow, it is common to adopt some approximation of the Hessian matrix so that its inversion can be done quickly. Preconditioned conjugate gradient (PCG) uses an approximation to the inverse Hessian to speed up conjugate gradient. The approximation, also known as the preconditioner, is denoted by M. PCG essentially creates an optimization problem _ _ 9 ,. . . . . . that has M 1/2HM 1/ “ as the “effective” Hessran matr1x and applles conjugate gradient to it, where H is the Hessian matrix of the original optimization problem. If the “effective” Hessian matrix is close to the identity, conjugate gradient can converge very fast. We refer the reader to the appendix in [232] for the exact algorithm for PCG. Practical implementation of PCG does not require the computation of M—1/2. Only the multiplication by M“1 is needed. Note that the preconditioner should be positive definite, or the descent direction computed may not decrease the objective function. We can see that there are three requirements for a good conditioner: positive defi— nite, efficient inversion, and good approximation of the Hessian. The first and the third requirements can contradict. with each other, because the true Hessian is often not positive-definite unless the objective function is convex. Finding a good precon- ditioner is an art, and often requires insights into the problem at hand. However, general procedures for creating a preconditioner also exist, which can be based on incomplete Cholesky factorization, for example. 192 5.4.1.4 Line-search Newton Line-search Newton is almost the same as the quasi-Newton algorithm, except that the Hessian is provided by the user instead of being approximated by the gradients. There is, however, a catch here. The true Hessian may not be positive-definite, meaning that the n'iinimization problem on the right-hand—side of Equation (5.17) does not have a solution. Therefore, it is common to replace the true Hessian with some approximated version that is positive—definite. Since H_1g is to be computed, such an approxin‘iation should admit efficient inversion, or at least multiplication by its inverse should be fast. There are two possible ways to obtain such an approximation. We can either add 6 I to the true Hessian, where (S is some positive number determined empirically, or we can “repair” H by adding some terms to it to convert it to a positive-definite matrix. Note that for both line-search Newton and PCG, the approximated inverse of the Hessian, which takes O(|6|2) memory, need not be formed explicitly. The only thing needed is the ability to be multiplied by the approximated inverse. 5.4.2 Algorithm Details There are several issues that are common to all these optimization algorithms. 5.4.2.1 Constraints on the Parameters The algorithms described in Section 5.4.1 are all unconstrained optimization algo- rithms, meaning that there are no restrictions on the values of 6. However, our 193 optimization problem contains the constraint that the mixture weights ozj are pos- itive and sum to one, and the fact that the precision matrix Tj is symmetric and positive definite. For {0].}, we re-parameterize by introducing a set of variables {63-} and set exp(/3j) For Tj, we can either re-parameterize by introducing Fj such that Tj = FjF 31-”, or we can modify our optimization algorithm to cope with the constraints. The positive-definite constraint is enforced by modifying the line-search routine so that the parameters are always feasible. This is a feasible approach because the precision matrices in a reasonable clustering solution should not become near singular. For the symmetric constraint, it is enforced by requiring that the descent direction in line-search always has symmetric precision matrices. 5.4.2.2 Common Precision Matrix A common practice of fitting a mixture of Gaussian is to assume a common precision matrix, i.e., the precision matrices of all the k Gaussian components are restricted to be the same, i.e., T1 = = Tk = T. Instead of the gradient with respect to different Tj, we need the gradient of ,7 with respect to T. This can be done easily because ”8,7 j=1 er,- 194 Consequently, Equation (B.12) should be modified to 0 1 6—T—‘7z— 2:617)”le +Z(-12-ujp]T +—%‘T1)Zc,j (5.19) 2' whereas Equation (B20) should be modified to t} 1 _ 1 —,,T.7= gr 12%- — 525.6.- —u,~)T. (5.20) ij ij The case for Cholesky parameterization is similar. We set F1 = = Fk = F, and Equation (B.15) should be modified to 0 DFJ:_ Z(iniyiF+Z(Hjuj +T1)FZCzja (5'21) i] .7 and Equation (821) should be modified to (‘9 - 51:37 = T 1F200— Zcijf)’i - Mj)(yz' - MleF- (5-22) 5.4.2.3 Line Search Algorithm The line-search algorithm we used is based on the implementation in Matlab, which is in turn based on section 2.6 in [86]. Its basic idea is to perform a cubic interpolation based on the value of the function and the gradient evaluated at two parameter values. The line search terminates when the VVolfe’s condition is satisfied. Following the advice in Chapter 7 of [23], the line-starch is stricter for both conjugate gradient and preconditioned conjugate gradient in order to ensure conjugacy. Note that when 195 the Gaussians are parameterized by their precision matrices, the line search procedure disallows any parameter vector that has non-positive definite precision matrices. 5.4.2.4 Annealing the Objective Function The algorithms described in Section 5.4.1 find only the local minima of .7 based on the initial parameter estimate 6(0). One strategy to alleviate this problem is to adopt a deterministic-annealing type of procedure and use a “smoother” version of the objective function. The solution of this “smoother” optimization problem is used as the initial guess for the actual objective function to be optimized. Specifically, we adjust the two temperature—like parameters 7 and T in J defined in Equation (5.16). When 7 and T are small, .,7 is smooth and is almost convex, therefore it is easy to optimize. The annealing stops when 7 reaches one and T reaches a pre-specified value Tfinal , which is set to four in our experiment. This is, however, a fairly insensitive parameter. Any number between one and sixteen leads to similar clustering results. 5.4.3 Specifics for a Mixture of Gaussians All the algorithms described in Section 5.4.1 require the gradient information of the objective function. In Appendix B.1, we have derived the gradient information with the assumption that each mixture component is a Gaussian distribution. Recall that qz-j = log p(yil6j), and sij has been defined in Equation (5.12). Define the following: 7 - qij Zj’ qij, 196 1111'] = E Afahl— Zh /\— bhi 81] log 823' [2:1 "1+ m— — 3”- Z ,\+ (1),,- log 2;] =2: Agbh, iogtgj h=1 C16 = 7”,") - T(Il’ij — Sij 2w”). [=1 The partial derivative of ,7 with respect to 63- is (7 0‘37]: 2‘ C21“ (.11 ngj 2' f!- (5.23) Under the natural parameterization u) and T) for the parameters of the l-th cluster, we have g_l;7: ZC Cile" “[ZC C2[ 59!- = A Z Cuyryf + % (#sz + 22) Z 022- 2' 2' (9T) 2 If the Cholesky parameterization F1 is used instead of T1, we have (9.7 :r T O—F] = - Zcztlyiyi F1 + (mm + 231) Fl 2021- i i If the moment parameterization #1 and T) are used instead, we have 0.7 ._— = r E : -— i 197 (5.24) (5.25) (5.26) (5.27) 297,: El: 6‘2"! g; 0216'.- - #1)T(Y2 - m) (5.28) and the corresponding partial derivative if Cholesky parameterization is used is ar, =(21 Z CH ‘ Z (My )T(y2 — 222)) F2. (5.29) The Hessian of ,7 is clumsier to present, and the reader can refer to Appendix B.2 for its exact form under various parameterizations. 5.5 Feature Extraction and Clustering with Con- straints It turns out that the objective function introduced in Section 5.3 can be modified to simultaneously perform feature extraction and clustering with constraints. There are three reasons why we are interested in performing these two tasks together. First, the proposed algorithm does not perform well for small data sets with a large number of features (denoted by (1), because the d by d covariance matrix is estimated from the available data. In other words, we are suffering from the curse of dimensionality. The standard solution is to preprocess the data by reducing the dimensionality using methods like principal component analysis. However, the resulting low-dimensional representation may not be optimal for clustering with the given set of constraints. It is desirable to incorporate the constraints in seeking a good low-dimensional representation. 198 The second reason is from a modeling perspective. One can argue that it is inap- propriate to model the two desired clusters shown in F igurc 5.3(d) by two Gaussians, because the distribution of the data points are very “non—Gaussian”: there are no data points in the central regions of the two Gaussians, which are supposed to have the highest data densities! If the data points are projected to the one-dimensional subspace of the :r-axis, the resulting two clusters follow the Gaussian assumption well while satisfying the constraints. Note that PCA selects a projection direction that is predominantly based on the y-axis because the data variance in that direction is large. However, the clusters formed after such a projection will violate the constraints. In general, it is quite possible that given a high dimensional data set, there exists a low—dimensional subspace such that the clusters after projection are Gaussian-like, and the constraints are satisfied by those clusters. The third reason is that the projection can be combined with the kernel trick to achieve clusters with arbitrary shapes. A nonlinear transformation is applied to the data set to embed the data in a high-dimensional feature space. A linear subspace of the given feature space is sought such that the Gaussian clusters formed in that subspace are consistent with the given set of constraints. Because of the non-linear transformation, linear cluster boundaries in that subspace correspond to nonlinear boundaries in the original input space. The exact form of the nonlinear boundaries is controlled by the type of the nonlinear transformation applied. Note that such transformation need not be performed explicitly because of the kernel trick (see Sec- tion 2.5.1). In practice, kernel PCA is first performed on the data in order to extract the main structure in the high dimensional feature space. The number of features 199 returned by kernel PCA should be large. The feature extraction algorithm in this section is then applied to the result of kernel PCA. 5.5.1 The Algorithm Let X, be the result of projecting the data point y,- into a d’-dimensional space, where d’ is small and d’ < (1, and d is the dimension of y,. Let PT be the d’ by d projection matrix, i.e., x,- = PTyi, and PTP = I. Let PTuj and T be the cluster center of the j-th Gaussian component and the common covariance matrix, respectively. Let R be the Cholesky decomposition of T, i.e., T = RRT. We have 1 p(X2|22 = j) = (2.5-(172(th 101/2 exp (—-,—(x.- — PTpJ-iTrtx. — PTMfi) . (5.30) Because T = RTPTPR, we can rewrite the above as losp(xz'|22 = j) I I = —% log(27r) + glog det T — -2—(y,- — pj)TPTPT(y,- — 22]), (5-31) I (l 1 1 2 _§ log(27r) + —2— log det FTF — 5m — uj)TFFT(y2 - 1a)), where F = PR. Note the similarity between this expression and that of log p(yilzi = j) if we adopt the parameterization T = FFT as discussed in Section 5.4.2.1. We have 6 . 5’71 losp(xilz-i = J) = FFTin — m). (5.32) 200 a ' I _ U—F 1()gP(Xi[3i : J) : F(FTF) 1 — (Yi _ Hj)(y1'— Hj)TF- (533) While P has an orthogonality constraint, there is no constraint on F, and thus we cast our optimization problem in terms of F. The parameters F, pj and 63- can be found by optimizing ,7, after substituting Equation (5.31) as log qij into Equation (5.16). In practice, the quasi-Newton algorithm is used to find the parameters that minimize the objective function, because it is difficult to inverse the Hessian efficiently. It is interesting to point out that this subspace learning procedure is related to linear discriminant analysis if the data points y,- are standardized to have equal vari- ance. If we fix T to be the identity matrix, maximizing the log-likelihood is the same as minimizing (y, -— pj)TPTP(y,- — 22]). This is the within-class scatter of the j-th cluster. Since the sum of between-class scatter and the within-class scatter is the total data scatter, which is constant because of the standardization, maximization of the within-class scatter is the same as maximizing the ratio of between-class scatter to the within-class scatter. This is what linear discriminant analysis does. 5.6 Experiments To verify the effectiveness of the proposed approach, we have applied our algorithm to both synthetic and real world data sets. we compare the proposed algorithm with two state-of—the-art algorithms for clustering under constraints. The first one, denoted by Shental, is the algorithm proposed by Shental et al. in [231]. It uses “chunklets” to represent the cluster labels of the objects involved in must—link con- 201 straints, and a l\-"‘Iarkov network to represent the cluster labels of objects participating in must-not—link constraints. The EM algorithm is used for parameter estimation, and the E-step is done by computations within the Markov network. It is not clear from the paper the precise algorithm used for the inference in the E—step, though the Matlab implementation3 provided by the authors seems to use the junction tree algorithm. This can take exponential time when the constraints are highly coupled. This potential high time complexity is the motivation for the mean-field approxi- mation used in the E-step of [161]. The second algorithm, denoted by Basu, is the constrained k-means algorithm with metric learning4 described in [21]. It is based on the idea of hidden Markov random field, and it uses the constraints to adjust the metrics between different data points. A parameter is needed for the strength of the constraints. Note that we do not compare our approach with the algorithm in [161], because its implementation is no longer available. 5.6.1 Experimental Result on Synthetic Data Our first experiment is based on the example in Figure 5.3(a), which contains 400 points generated by four Gaussians centered at [g], l—28l’ [:g] and [”82], each with identity covariance matrix. Recall that the goal is to group this data set into two clusters — a “left” and a “right” cluster —- based on the two must—link constraints. Specifically, points with negative and positive horizontal co—ordinates are intended to be in two different clusters. Note that this synthetic example differs from the 3The url is http : //www . cs . huj i . ac . il/~tomboy/code/ConstrainedEM_plusBNT . zip. 4Its implementation is available at http://www.cs .utexas .edu/users/ml/risc/code/. 202 similar one in [161] in that the vertical separation between the top and bottom point clouds is larger. This increases the difference between the goodness of the “left / right” and “top/ bottom” clustering solutions, so that a small number of constraints is no longer powerful enough to bias one clustering solution over the other as in [161]. The results of running the algorithms Shental and Basu are shown in Figures 5.4(a) and 5.4(b), respectively. For Shental the two Gaussians estimated are also shown. Not only did both algorithms fail to recover the desired cluster structure, but also the cluster assignments found were counter-intuitive. This failure is due to the fact that these two approaches represent the constraints by imposing prior distributions on the cluster labels, as explained earlier in Section 5.2. The result of applying the proposed algorithm to this data set. with A = 250 is shown in Figure 5.4(c). The two desired clusters have been almost perfectly recov- ered, when we compare the solution visually with the desired cluster structure in Figure 5.3(c). A more careful comparison is done in Figure 5.4(d), where the cluster boundaries obtained by the proposed algorithm (the gray dotted line) is compared with the ground-truth (the solid green line). We can see that these two boundaries are very close to each other, indicating that the proposed algorithm discovered a good cluster boundary. This compares with the similar example in [167], where the cluster boundary there (as inferred from the Gaussians shown) is quite different5 from the desired cluster boundary. An additional cluster boundary obtained by the proposed algorithm when T took the intermediate value of 1 is also shown (the magenta dashed 5Note that the synthetic data example in [167] is fitted with a mixture model with different covariance matrices per class. Therefore, comparing it with the proposed algorithm may not be the most fair. 203 line). (The final cluster boundary was produced with T = 4.) This boundary is signif- icantly different from the ground—truth boundary. So, a large value of T improves the clustering result in this case. This improvement is the consequence of the fact that a large T focuses on the cluster assignments of the objects and reduces the spurious influence of the exact locations of the points. The Jensen-Shannon divergence mea- sures the constraint violation / satisfaction more accurately. Note that a larger value of T does not have any further visible effect on the cluster boundary. The Gaussian distributions contributing to these cluster boundaries are shown in Figure 5.4(e). We observe that the Gaussians recovered by the proposed algorithm (dotted gray lines) are slightly “fatter” than those obtained with the ground—truth labels (solid green lines). This is because data points not in a particular cluster can still contribute, though to a smaller extent, to the covariance of the Gaussian distributions due to the soft-assignment implied in the mixture model. This is not the case when the covariance matrix is estimated based on the ground-truth labels. While the proposed algorithm is the only clustering under constraints algorithm we know that can return the two desired clusters, we want to note that a sufficiently large A is needed for its success. If A = 50, for example, the result of the proposed algorithm is shown in Figure 5.4(f). This is virtually identical to the clustering solution without any constraints (Figure 5.3(b)). While the constraints are violated, the clustering solution is more “reasonable” than the solutions shown in Figures 5.4(a) and 5.4(b). Note that it is easy to detect that A is too small in this example, because the constraints are violated. We should increase A until this is no longer the case. The resulting clustering solution will effectively be identical to the desired solution 204 shown in Figure 5.4(c). 5.6.2 Experimental Results on Real World Data We have also compared the proposed algorithm with the algorithms Shental and Basu based on real world data sets obtained from different domains. The label information in these data sets is used only for the creation of the constraints and for performance evaluation. In 1;)articular, the labels are not used by the clustering algorithms. 5.6.2.1 Data Sets Used Table 5.2 summarizes the characteristics of the data sets used. The following prepro- cessing has been applied to the data. whenever necessary. If a data set has a nominal feature that can assume 0 possible values with c > 2, that feature is converted into 0 continuous features. The 2-th such feature is set to one when the nominal feature assumes the 2-th possible value, and the remaining c— 1 continuous features are set to zero. If the variances of the features of a data set are very different, standardization is applied to all the features, so that the variances or the ranges of the preprocessed features become the same. If the number of features is too large when compared with the number of data points 72, principal component analysis (PCA) is applied to reduce the dimensionality. The number of reduced dimension d is determined by finding the largest d that satisfies n > 3d”, while the principal components with negligible eigen- values are also discarded. The difficulty of the classification tasks associated with these data sets can be seen by the values of the F -score and the normalized mutual information (to be defined in Section 5.6.2.3) computed using the ground—truth labels, 205 0 _________ O 10‘ ,t - ‘ ‘ I 3 ‘‘‘‘‘‘ ’0‘ ~ - 10’ o o O o - 00,9 6° 0 ' ’ f ~‘ 0. 0 o .0 o o . '3 xiii-xx ' x.) . . v t'stui .W 1 ‘~-_‘L’. .9 t—!_>_'.-" ‘... 00 0: '.‘ 5 O ----------- 5. 0 - 0r 0. -5 ,ffxt—‘f‘ -------------- 13””: 7“ ., fig“ 5.: - 5’s, ..eae as. £55.35) , sex ewes - . --y __________________ . '1? 6 ‘15 o (a) Result of the algorithm Shental (b) Result of the algorithm Basu 15 1’, ’ xx Xx, “.\ 1 15» __We 4 I ‘ \ 10 ‘ x § ‘\,". g 0 Y 10' .,.;fi . “313.5...“ '- ‘r‘ 2"” ”Fifi-ii“. 71". e. e? .3: 1‘ :-' .2 5. I! X I" (I 5. i .' I I 0' I I l l 07 I z: : . " I I _5_ _ . . :s‘ , ...,: ...-4: , l '1: .1. ,0, " «15 "r o 5 (d) The cluster boundaries in (c) 10 ° ] u.» x-qisrryiitomes-aw ~:- 1 .1! .. o ... IQ” o r) “‘{y ,—.—.-..2zi'_‘.._t-3-e'-:“‘.' 5. o o- . -5 x _______ _,,r— «19 ' I «ate "”2! 25$ {We 0 (e) The Gaussians in (d) Figure 5.4: The result of running different clustering under constraints algorithms for the synthetic data set shown in Figure 5.3(a). and Basu failed to discover the desired clusters ((a) and (b)), the proposed algorithm The resulting cluster boundaries and Gaussians are When succeeded with A = 250 (0). compared with those estimated with the ground-truth labels ((d) and (e)). (f) Result of proposed algorithm, A = 50 A = 50, the proposed algorithm returned the natural clustering solution (f). 206 While the algorithms Shental under the assumption that the class conditional densities are Gaussian with common covariance matrices. Data Sets from UCI The following data sets are obtained from the UCI machine learning repository” The list below includes most of the data sets in the repository that have mostly continuous features and have relatively balanced class sizes. The dermatology database (derm) contains 366 cases with 34 features. The goal is to determine the type of Erythen'iato-Squamous disease based on the features ex- tracted. The age attribute, which has missing values, is removed. PCA is performed to reduce the resulting 33 dimensional data to 11 features. The sizes of the six classes are 112, 61, 72, 49, 52 and 20. The optical recognition of handwritten digits data set (digits) is based on nor- malized bitmaps of handwritten digits extracted from a preprinted form. The 32x32 bitmaps are divided into non-overlapping blocks of 4x4 and the number of pixels are counted in each block. Thus 64 features are obtained for each digit. The training and testing sets are combined, leading to 5620 patterns. PCA is applied to reduce the dimensionality to 42 to preserve 99% of the total variance. The sizes of the ten classes are 554, 571, 557, 572, 568, 558, 558, 566, 554, and 562. The ionosphere data set (ion) consists of 351 radar readings returned from the ionosphere. seventeen pulse numbers are extracted from each reading. The real part and the imaginary part of the complex pulse. numbers constitute the 34 features per pattern. There are two classes: “good” radar returns (225 patterns) are those showing 6The url is http://www. ics .uci . edu/mlearn/MLRepository.html 207 evidence of some type of structure in the ionosphere, and “bad” returns (126 patterns) are those that do not; their signals pass through the ionosphere. PCA is applied to reduce the dimensionality to 10. The multi-feature digit data set consists of features of handwritten numerals ex— tracted from a collection of Dutch utility maps. Multiple types of features have been extracted. we have only used the features based on the 76 Fourier coefficients of the character shapes. The resulting data set is denoted by mfeat-fou. There are 200 patterns per digit class. PCA is applied to reduce the dimensionality to 16, which preserves 95% of the total energy. The Wisconsin breast cancer diagnostic data set (wdbc) has two classes: benign (357 cases) and malignant (212 cases). The 30 features are computed from a digitized image of the breast tissue, which describes the characteristics of the cell nuclei present in the image. All the features are standardized to have mean zero and variance one. PCA is applied to reduce the dimensionality of the data to 14. The UCI image segmentation data set (UCI-seg) contains 19 continuous attributes extracted from random 3x3 regions of seven outdoor images. One of the features has zero variation and is discarded. The training and testing sets are combined to form a data set with 2310 patterns. After standardizing each feature to have variance one, PCA is applied to reduce the dimensionality of the data to 10. The seven classes correspond to brick-face, sky, foliage, cement, window, path, and grass. Each of the classes has 330 patterns. 208 Data Sets from Statlog in UCI The following five data sets are taken from the Statlog section7 in the UCI machine learning repository. The Australian credit approval data set (austra) has 690 instances with 14 at- tributes. The two classes are of size 383 and 307. The continuous features are stan- dardized to have standard deviation 0.5. Four of the features are non-binary nominal features, and they are converted to multiple continuous features. PCA is then applied to reduce the dimensionality of the concatenated feature vector to 15. The German credit data (german) contains 1000 records with 24 features. The version with numerical attributes is used in our experiments. PCA is used to reduce the dimensionality of the data to 18, after standardizing the features so that all of them lie between zero and one. The two classes have 700 and 300 records. The heart data set (heart) has 270 observations with 13 raw features in two classes with 150 and 120 data points. The three nominal features are converted into continuous features. The continuous features are standardized to have standard deviation 0.5, before applying PCA to reduce the data set to 9 features. The satellite image data set (sat) consists of the multi-spectral values of pixels in 3x3 neighborhoods in a satellite image. The aim is to classify the class associated with the central pixel, which can be “red soil”, “cotton crop”, “grey soil”, “damp grey soil”, “soil with vegetation stubble” or “very damp grey soil”. The training and the testing sets are combined to yield a data set of size 6435. There are 36 features altogether. The classes are of size 1533, 703, 1358, 626, 707 and 1508. 7The url is http://www. ics.uci.edu/mlearn/databases/statlog/ 209 The vehicle silhouettes data set (vehicle) contains a set of features extracted from the silhouette of a. vehicle. The goal is to classify a vehicle as one of the four types (Opel, Saab, bus, or van) based on the silhouette. There are altogether 846 patterns in the four classes with sizes of the four classes as 212, 217, 218, and 199. The features are first standardized to have standard deviation one, before applying PCA to reduce the dimensionality to 16. Other Data Sets We have also experimented the proposed algorithm with data sets from other sources. The texture classification data set (texture) is taken from [127]. It consists of 4000 patterns with four different types of textures. The 19 features are based on Gabor filter responses. The four classes are of sizes 987, 999, 1027, and 987. The online handwritten script data set (script), taken from [192], is about a problem that classifies words and lines in an online handwritten document into one of the six major scripts: Arabic, Cyrillic, Devnagari, Han, Hebrew, and Roman. Eleven spatial and temporal features are extracted from the strokes of the words. There are altogether 12938 patterns, and the sizes of the six classes are 1190, 3173, 1773, 3539, 1002, and 2261. The ethnicity recognition data set (ethn) was originally used in [175]. The goal is to classify a 64x64 face image into two classes: “Asian” (1320 images) and “non- Asian” (1310 images). It includes the PF 01 database”, the Yale databaseg, the AR 8http://nova.postech.ac.kr/archives/imdb.htm1. 9http://cvc.yale.edu/projects/yalefaces/yalefaces.html. 210 (a) Asians 515751 sewn-pa: ....Wt. . . A(b) Non-Asians Figure 5.5: Example face images in the ethnicity classification problem for the data set ethn. database [181], and the non-public NLPR database“). Some example images are shown in Figure 5.5. 30 eigenface coefficients are extracted to represent the images. The clustering under constraints algorithm is also tested on an image segmenta- tion task based on the l\r’londrian image shown in Figure 5.6, which has five distinct segments. The image is divided into 101 by 101 sites. Twelve histogram features and twelve Gabor filter responses of four orientations at three different scales are extracted. Because the histogram features always sum to one, PCA is performed to reduce the dimension from 24 to 23. The resulting data set Mondrian contains 10201 patterns with 23 features in 5 classes. The sizes of the classes are 2181, 2284, 2145, 2323, and 1268. The 3-newsgroup database11 is about the classification of Usenet articles from different newsgroups. It has been used previously to demonstrate the effectiveness of clustering under constraints in [14]. It consists of three classification tasks (diff -300, Sim-300, same-300), each of which contains roughly 300 documents from three dif- ferent topics. The topics are regarded as the classes to be discovered. The three classification tasks are of different difficulties: the sets of three topics in diff-300, 10Provided by Dr. Yunhong Wang, National Laboratory for Pattern Recognition, Beijing. 11It can be downloaded at http://www. cs .utexas.edu/users/ml/risc/. 211 Figure 5.6: The Mondrian image used for the data set Mondrian. It contains 5 segments. Three of the segments are best distinguished by Gabor filter responses, whereas the remaining two are best distinguished by their gray-level histograms. sin-300, and same-300 respectively have increasing similarities. Latent semantic indexing is applied to the tf—idf normalized word features to convert each newsgroup article into a feature vector of dimension 10. The three classes in diff —300 are all of sizes 100, whereas the number of patterns in the three classes in Sim-300 is 96, 97, and 98. The sizes of the classes in same-300 are 99, 98, and 100. Notice that the data sets ethn, Mondrian, diff-300, Sim-300, and same-300 have been used in the previous work [161]. The same preprocessing is applied for both ethn and Mondrianas in [161], though we reduce the dimensionality of the data set from 20 to 10 for the diff-300, Sim-300, and same-300 data sets based on our “72 > 3d2” rule. 212 .5 w E2 w o as H W m V 0 £230 £03022 mm 07.3 02533630 00020066 05 09: 3.0065 :22 was .m mo 82.? .8:me .m.m.©.m 028% E 09500 03 2.8030 0%. 08:8 .322 as 03083 003225052: 6308 002E880 05 use a 3 0000003 0.88% 22 3 000m 03 see pom 330 e fits US$083 xmfi 00385630 05 m0 33$ch 23. Emu/50008.3 6 00s 2 3 0030858 0.8 00m: 8353 Egon go .8856 08. was 3500 Set m0 .5955 BE. .mucmatmaxm 030 5 00m: 30m Sec 0103 30a 0% m0 530225 and 033. :98 mamas m S as EH 5285 Emma 585538 5:80 oomtaam cameo 83s m S as 3 5:55 52358818555: sees: comes new; mass m 2 8m use 955% 555 5:555: 550 costs some 886 a mm as: use 85555 seen 5582 fiasco: 33o some m cm 88 “at“ £255 55 33¢ 88.: a a 89. ”EU 235s 855: «$2 ES a 2 was use seemsscaese 55:5 atom cameo as; a. S as H00 5 523m 85855 ease, 3522, SE: same a on was 80 5 mass amass asses pew 293 83¢ a 5 EN 80 s 52% see; 553 $35 so; a a: 82 H00 5 mossm 385 said 55% £36 28¢ a 2 can 50 a mess 555% €25 finesse 853 as? 3.23 m E 8m 50 5:855 5.555 555 58853 on? as; $33 a 2 2mm 50 533858555 000 538 5.83 as; S S 88 H0: 555558 scam .55 2383258 8553a moose as; a 2 an 60 23582 8..“ name 285 S we owe... H00 see asstaeeefo 8:582 sage 3&3 sad as? a : 8” 200 555558 Ems 52 a a s e 885m 582 as 213 5.6.2.2 Experimental Procedure For each data set listed in Table 5.2, a constraint was specified by first generating a random point pair (yi, yj). If the ground—truth class labels of y, and yj were the same, a must-link constraint was created between y, and yj. Otherwise, a must- not-link constraint was created. Different numbers of constraints were created as a percentage of the number of points in the data set: 1%, 2%, 3%, 5%, 10%, and 15%. Note that the constraints were generated in a “cumulative” manner: the set of “1%” constraints was included in the set of “2%” constraints, and so on. The line—search Newton algorithm was used to optimize the objective function J in the proposed approach. The Gaussians Were represented by the natural parameters VJ' and T, with a common precision matrix among different Gaussian components. This particular choice of optimization algorithm was made based on a preliminary efficiency study, where this approach was found to be the most efficient among all the algorithms described in Section 5.4.1. Because the gradient is available in linesearch Newton, convergence was decided when the norm of the gradient was less than a threshold of the norm of the initial gradient. Note that this is a stricter and more reasonable convergence criteria than the one typically used in the EM algorithm, which is based on the relative change of log-likelihood. However, in order to safe- guard against round-off error, we also declare convergence when the relative change of the objective function is very small: 10““), to be precise. Starting with a random initialization, line-search Newton was run with 7 = 1 and T = 0.25, with the conver- gence threshold set to 10‘2. Line-search Newton was run again after increasing 7' to 214 1, with the convergence threshold tightened to 10’3. Finally, ’7' and the convergence threshold were set to 4 and 10‘4, respectively. The optimization algorithm was also stopped if convergence was not achieved within 5000 Newton iterations. Fifteen ran- dom initializations were attempted. The solution with the best objective function value was regarded as the solution found by the proposed algorithm. The above procedure, however, assumes the constraint strength A is known. The value of A was determined using a set of validation constraints. The constraints for training set and the constraints for validation set were obtained using the following rules. Given a data set, if the number of constraints was less than 3k, k being the number of clusters, all the available constraints were used for training and valida- tion. This procedure, while risking overfitting, is necessary because a too small set of constraints is poor for training the clusters as well as the estimation of A. When the number of constraints was between 3k and 6k, the number of training constraints and validation constraints were both set to 3k. So, the training constraints overlapped with the validation constraints. When the number of constraints was larger than 6k, half the constraints were used for training and the other half were used for valida— tion. Starting with A = 0.1, we increased A by multiplying it by \/T(l. For each A, the proposed algorithm was executed. A better value of A was encountered if the number of violations of the validation constraints was smaller than the current best. If there was a tie, the decision was made on the number of violations of the training constraints. If the best value of A did not change for four iterations, we assumed that the optimal value of A was found. The proposed algorithm was executed again using all the available constraints and A value just determined. The resulting solution 215 was compared with the solution obtained using only the training constraints, and the one with the smaller total number of constraint violations was regarded as our final clustering solution. If there was a tie, the solution obtained with training constraints only was selected. The algorithms Shental and Basu were run using the same set of data and con- straints as input. For Shental, we modified the initialization strategy in their soft- ware, which involved a two step process. First, five random parameter vectors were generated, and the one with the highest log-likelihood was selected as the initial value of the EM algorithm. Convergence was achieved if the relative change in the log-likelihood was less than a threshold, which is 10’6 by default. This process was repeated 15 times, and the parameter vector with the highest log-likelihood was regarded as the solution. For easier comparison, we also assumed a common co- variance matrix among the different Gaussian components. For the algorithm Basu, the authors provided their own initialization strategy, which was based on the set of constraints provided. The algorithm was run 15 times, and the solution with the best objective function was picked. The algorithm Basu requires a constraint penalty parameter. In our experiment, a wide range of values were tried: 1, 2, 4, 8, 16, 32, 64, 128, 256, 500, 1000, 2000, 4000, 8000, 16000. We only report their results with the best possible penalty values. As a result, the performance of Basu reported here might be inflated. 216 5.6.2.3 Performance criteria A clustering under constraints algorithm is said to perform well on a data set if the clusters obtained are similar to the ground-truth classes. Consider the k by k “contingency matrix” {(2,-j}, where [3,-J- denotes the number of data points that are originally from the i-th class and are assigned to the j-th cluster. If the clusters match the true classes perfectly, there should only be one non-zero entry in each row and each column of the contingency matrix. Following the common practice in the literature, we summarize the contingency matrix by the F-score and the normalized mutual information (NMI). Consider the “recall matrix” {fij} in which the entries are defined by Fij = 513‘ / :34 513’- Intu— itively, fij denotes the proportion of the i-th class that is “recalled” by the j-th cluster. The “precision matrix” {16,-j}, on the other hand, is defined by fiij = 515/ :1" 6213-. It represents how “pure” the j-th cluster is with respect to the i-th class. Entries in the F-score matrix {fl-j} are simply the harmonic mean of the corresponding entries in the precision and recall matrices, i.e., fij = 277,713“ / (fij + 132-j). The F-score of the i-th class, F}, is obtained by assuming that the i-th class matches12 with the best cluster, i.e., F,- = maxj 1%. The overall F-score is computed as the weighted sum of the individual F,- according to the sizes of the true classes, i.e., k k ~ .— C. . ~ F-score = E E—J—T—l—QFZ' (5.34) , TL 121 12Here, we do not require that one cluster can only match to one class. If this one-to-one corre- spondence is desired, the Hungarian algorithm should be used to perform the matching instead of the max operation to compute Fi- 217 Note that the precision of an empty cluster is undefined. This problem can be circum- vented if we restrict that empty clusters, if any, should not contribute to the overall F -score. The computation of normalized mutual information interprets the true class label and the cluster label as two random variables U and V. The contingency table, after dividing by n (the number of objects), forms the joint distribution of U and V. The mutual information (MI) between U and V can be computed based on the joint distribution. Since the range of the mutual information depends on the sizes of the true classes and the sizes of the clusters, we normalize the MI by the average of the entropies of U and V (denoted by H (U) and H (V)) so that the resulting value lies between zero and one. Formally, we have k 2’6 ~.. ’6 ~.. '21 0,] 2 '=1 CI] H(U)=—:—Ln—log—Jn— i=1 1: k ~ k ~ .__ Cl. . .._. C. . i=1 P? k H(U,V)=—Zzir:llog% i=1j=1 (5.35) MI = H(U) + H(V) — H(U, V) For both F-score and NMI, the higher the value, the better the match between the clusters and the true classes. For a perfect match, both NMI and F-score take the value of 1. When the cluster labels are completely independent of the class labels, N MI takes its smallest value of 0. The minimum value of F -score depends on the sizes 218 of the true classes. If all the classes are of equal sizes, the lower bound of F-score is l/k. In general, the lower bound of F-score is higher, and it can be more than 0.5 if there is a dominant class. 5.6.2.4 Results The results of clustering the data sets mentioned in Section 5.6.2.1 when there are no constraints are shown in Table 5.3. In the absence of constraints, both the proposed algorithm and Shental effectively find the cluster parameter vector that maximizes thelog-likelihood, whereas Basu is the same as the k-means algorithm. One may be surprised to discover from Table 5.3 that even though the proposed algorithm and Shental optimize the same objective function. their results are different. This is understandable when we notice that the line-search Newton algorithm used by the proposed approach and the EM algorithm used by Shental can locate different local optima. It is sometimes argued that maximizing the mixture log-likelihood globally is inappropriate as it can go to infinity when one of the Gaussian components has an almost singular covariance matrix. However, this is not the case here, because the covariance matrices all have small condition numbers as seen in Table 5.3. Therefore, among the two solutions produced by the proposed approach and by Shental, we take the one with the larger log-likelihood. In the remaining experiments, the no— constraint solutions found by the proposed algorithm were also used as the initial value for Shental. It is because we are interested in locating the best possible local optima for the objective functions. The results of running our proposed algorithm, Shental, and Basu, with 1% 219 .Eocfiomammu .553: mocaflgoo 20588 2: mo Spawn 283628 2: was 4509385-on m5 5258.285 33:8 womzeanoc 23 .808i 2: .mufioa 33o mo .8858 2: 302% K was xzmofi .22 Z mm 5 $598; $3. .Efiiow? 2m 23 moms 333 ME... 3223» ~8ng £983.32 23 moms 52:8 23 nwzofi .930 m3... 5 wooazmfifmofi 23 325me $325 one Efitow? wwmoaoa 2: 50m .mufiefimsoo Mo 8535 23 E 853%? $38630 32¢me mo moneanchm ”mh 2st Sod was be x we ”S x 83 Rod ”93 OS x 3 me x was 9.3 was as 08-23... macs 3.2 He x we ”S x 83” age was HS x 3 .S x $3 33 see an 83:.“ 32 BE 32 x ma ”2 x was 33 as; 1: x 3 me x $3 $3 .83 com 8ng 83 83 1: x 3 as x 33. Home See H.2 x 3 1: x 3:. Sec Sec as: 5:30: moms an; 1: x as fa: x 8? wood we; HS x as as x 83 wood 33 88 33 $3 $3 H2 x as m: x £3: 23 :3. HS x as 1: x see: 23 ES 89. 83x8 33 :3 HS x .2 a: x 83: $2 82 HS x 2 me. x 2.3: ca; 83 was a5... :3 as we x 3 as x $.31 two 32. me x a...” ”S x 23: cams and as 302...; See 23 a: x as as x EST Sec :3 me x as a: x SST we; 22 was paw Sod sad s2 x 3 ”S x 33: ago 83 NS x 3 ”S x 33: ego and EN p.32 was 35 me x 3. me x was: 83 $3 me x a...“ ”S x “as: wood ass 82 55% 25.0 83 a: x 3 K: x 5.3.. See was as x we me x some: 83 was 0% «has £3 .23 HS x em ".2 x ”8.7 Sod ES HS x 3 e2 x 3.3: mood sad 8m. 33 £3 83 HS x 3.. as x 33: 83 83 me x E a: x 33' 3.5 53 2mm 338 we; £3 a: x S as x 33 Eco $3 a: x 3 as x $3 33 a; 88 3:82. £3 £3 1: x 3 me x 231 Sod See H2 x 3 me x 33: .83 39o an a: Ed 33 HS x 3 me x was: 2.3 3:. HS x 3 as x 32: 8.3 $3 88 35% 83 Sec 92 x as me x 83: moms 33 92 x as me x was: memo ES 8m 5% 32 a x 52-3 32 a M £52 32 m 2 ammm Hapqwnm comoaohm 220 constraint level, 2% constraint level, 3% constraint level, 5% constraint level, 10% constraint level, and 15% constraint level are shown in Tables 5.4, 5.5, 5.6, 5.7, 5.8, and 5.9, respectively. In these tables, the columns under “Proposed” correspond to the performance of the proposed algorithm. The heading A denotes the value of the constraint strength as determined by the validation procedure. The heading “Shental, default init” corresponds to the performance when the algorithm Shental is initialized by its default strategy, whereas “Shental, special init” corresponds to the result when Shental is initialized by the no—constraint solution found by the proposed approach. The heading “log-lik” shows the log-likelihood of the resulting parameter vector. Among these two solutions of Shental, the one with a higher log-likelihood is selected, and its performance is shown under the heading “Shental, combine” . From these tables, we can see that Shental with default initialization often yields a higher performance than Shental with the special initialization. However, the log-likelihood of Shental with default initialization is sometimes smaller. By the principle of maximum likelihood, such a solution, though it has a higher F-score and / or NMI, should not be accepted. This observation has the implication that the good performance of Shental as reported in comparative work such as in [161] might be due to the initialization strategy instead of the model used. The fact that we are more interested in comparing the model used in Shental with that used in the proposed approach, instead of the strategy for initialization, is the reason why we run Shental with the special initialization. We have also tried to do something similar with Basu, but its initialization routine is integrated with the main clustering routine 221 so that it is non-trivial to modify the initialization strategy. The numbers listed in Tables 5.3 to 5.9 are visualized in Figures 5.7 to 5.13. For each data set, we draw the F-score and the N MI with an increasing number of constraints. The horizontal axis corresponds to different constraint levels in terms of the percentages of the number of data points, whereas the vertical axis corresponds to the F-score or the NMI. The results of the proposed algorithm, Shental, and Basu are shown by the (red) solid lines, (blue) dotted lines, and (black) dashed lines, respectively. For comparison, the (gray) dashdot lines in the figures show the F-score and the NMI due to a classifier trained using the labels of all the objects in the data set under the assumption that the class conditional densities are Gaussian with a common covariance matrix. The data sets. are grouped according to the performance of the proposed algorithms. The proposed algorithm outperformed both Shental and Basu for the data sets shown in Figures 5.7 to 5.9. The performance of the proposed algorithm is comparable to its competitors for the data sets shown in Figures 5.10 to 5.12. For the data sets shown in Figures 5.13, the proposed algorithm is slightly inferior to one of its competitors. We shall examine the performance on individual data sets later. Perhaps the first observation from these figures is that the performance is not monotonic, i.e., the F-score and the N MI can actually decrease when there are ad- ditional constraints. This is counter-intuitive, because one expects improved results when more information (in the form of constraints) is fed as the input to the algo— rithms. Note that this lack of monotonicity is observed for all the three algorithms. There are three reasons for this. First, the additional constraints can be based on 222 data points that are erroneously labeled (errors in the ground truth), or they are “outlier” in the sense that they would be nus-classified by most reasonable super- vised classifiers trained with all the labels known. The additional constraints in this case serve as “mis—information”, and it can hurt the performance of the clustering under constraints algorithms. This effect is more severe for the proposed approach when there are only a small number of constraints, because the influence of each of the constraints may be magnified by a large value of A. The second reason is that an algorithm may locate a poor local optima. In general, the larger the number of constraints, the greater the number of local optima in the energy landscape. So, the proposed algorithm as well as Shentaland Basu is more likely to get trapped in poor local optima. This trend is the most obvious for Basu, as the performance at 10% and 15% constraint levels dropped for more than half of the data sets. This is not surpris- ing, because the iterative conditional mode used by Basu is greedy and it is likely to get trapped in local optima. The third reason is specific to the proposed approach. It is due to the random nature of the partitioning of the constraints into training set and validation set. If we have an unfavorable split, the value of A found by minimizing the number of violations on the set of validation constraints can be suboptimal. In fact, we observe that whenever there is a significant drop in the F-score and NMI, there often exists a better value of /\ than the one found by the validation procedure. Performance on Individual Data Sets The result on the ethn data set can be seen in Figures 5.7(a) and 5.7(b). The performance of the proposed algorithm im- proves with additional constraints, and it outperforms Shental and Basu at all con- 223 straint levels. A similar phenomenon occurs for the. Mondrian data set (Figures 5.7(c) and 5.7(d)) and the ion data set (Figures 5.7(e) and 5.7(f)). For Mondrian, note that 1% constraint level is already sufficient to bias the cluster parameter to match the re— sult using the ground-truth labels. Additional constraints only help marginally. The performance of the proposed algorithm for the script data set (Figures 5.8(a) and 5.8(b)) is better than Shental and Basu for all constraint levels except 1%, where the proposed algorithm is inferior to the result of Basu. However, given how much better the k-means algorithm is when compared with the EM algorithm in the absence of constraints, it is fair to say that the proposed algorithm is doing a decent job. For the data set derm, the clustering solution without any constraints is pretty good: that solution, in fact, satisfies all the constraints when the constraint levels are 1% and 2%. Therefore, it is natural that the performance does not improve with the provision of the constraints. However, when the constraint level is higher than 2%, the proposed algorithm again outperforms Shental and Basu (Figures 5.8(c) and 5.8(d)). The per- formance of the proposed algorithm on the vehicle data set is superior to Shental and Basu for all constraint levels except 5%, where the performance of Shental is slightly superior. For the data set wdbc, the performance of the proposed algorithm (Figures 5.9(a) and 5.9(b)) is better than Shental at all constraint levels except 5%. The proposed algorithm outperforms Basu when the constraint level is higher than 1%. The F-score of the proposed algorithm on the UCI-seg data (Figures 5.10(a)) is superior to Shental at three constraint levels and is superior to Basu at all but 1% constraint level. On the other hand, if N MI is used (Figure 5.10(b)), the proposed 224 1 algorithm does not do as well as the others. For the heart data set, the proposed algorithm is superior to Shental at all constraint levels, but it is superior to Basu at only 3% constraint level (Figures 5.10(c) and 5.10(d)). Note that the performance of Basu might be inflated because we only report its best results among all possible values of constraint penalty in this algorithm. we can regard the performance of the proposed algorithm on the austra data set (Figures 510(0) and 5.10(f)) as a tie with Shental and Basu, because the proposed algorithm outperforms Shental and Basuat three out of six possible constraint levels. For the german data set, the proposed algorithm performs the best in terms of NMI (Figure 5.11(b)), though the performances of all three algorithms are not that good. Apparently, this is a difficult data set. The performance of the proposed algorithm is less impressive when F—score is used, however (Figure 5.11(a)). The proposed algorithm is superior to Shental in performance for the Sim-300 data set (Figures 5.11(e) and 5.11(d)). While the proposed algorithm has a tie in performance when compared with Basu based on the F-score, Basu outperforms the proposed algorithm on this data set when N MI is used. The result of the diff-300 data set (Figures 5.11(e) and 5.11(f)) is somewhat similar: the proposed algorithm outperforms Shental at all constraint levels, but it is inferior to Basu. Given the fact that the k-means algorithm is much better than BM in the absence of constraints for this data set, the proposed algorithm is not as bad as it first seems. For the sat data set (Figures 5.12(a) and 5.12(b)), the proposed algorithm outperforms Shental and Basu significantly in terms of F-score when the constraint levels are 10% and 15%. The improvement in NMI is less significant, though the proposed method is still the best at three constraint levels. The result of the digits 225 RI data set (Figures 5.12(c) and 5.12(d)) is similar: the proposed method is superior to its competitors at three and four constraint levels if F-score and NMI are used as the evaluation criteria, respectively. It is difficult to draw any conclusion on the performance of the three algorithms on the mfeat-fou data set (Figures 5.13(a) and 5.13(b)). The performances of all three algorithms go up and down with an increasing number of constraints. Apparently this data set is fairly noisy, and clustering with constraints is not appropriate for this data set. For the data set same-300, the proposed algorithm does not perform well: it has a tie with Shental, but it is inferior to Basu at all constraint levels, as seen in Figures 5.13(c) and 5.1.‘3(d). The performance of the proposed algorithm is better than Shental only at the 15% constraint level for the data set texture ‘ (Figures 5.13(e) and 5.13(f)). The proposed algorithm is superior to Basu for this data set, though this is probably due to the better performance of the EM algorithm in the absence of constraints. Note that this data set is a relatively easy data set for model-based clustering: both k—means and EM have a F-score higher than 0.95 when no constraints are used. 5.6.3 Experiments on Feature Extraction We have also tested the idea of learning the low-dimensional subspace and the clusters simultaneously in the presence of constraints. Our first experiment in this regard is based on the data set shown in Figure 5.3. The two features were standardized to variance one before applying the algorithm described in Section 5.5 with the two 226 $32838: .vooanmxzag 23 «Ed 65:80:: 95:62:.» :5 .3 :::£ 8:303: @8953 m5 :8 4 383:0 9: £838.83: ESE: 8.2388: 9: 688$ 23 £38,538 mo :58:: ms... .350: gm: No 898:: 23 305: xzaofi m5: .4 .222 ..m ..E .2 mwfidam: 23. AK; mm :53 3:26:00 23 5:3 33:83: 3:65.38 :25: w:€8m3o mo 85:88th ”vb £an mwfid vamd mmdd awvd m3 x Bad mmdd awvd me x mafia aan mwmd m3 x ad vgd dwnd m nam oomnoamm wde ndmd mmdd mavd Lad: x no; mmdd mavd m3 x and awed davd Nd: x ad ade wand m Ham oomuaflm mwvd Hand add awad Lmdfi x mmmd dvdd awvd de x 3m...” v3.0 vavd m3 x Nd mamd ddmd a dam oomnmwfi: nwnd wand wand adwd v3 x 8:: mand aand «a: x 91.: wand adwd No: x Nd nawd ddad m3 8de ”8:38: dddd Hnmd wddd Had no: x mom; wddd Edd mo: x mom: wood Edd are x d; mmvd Hmwd om dmdm 5:6 Sad amad mHad nnad :3 x mwvdl mHad nnad we: x mwvdl Bad nnad a fad nnad d: dagV 8:88p wand amdd Hand mdmd de x dSAI Hand momd m3 x anAI and ondd me x a; momd de amH wmamfi wagon mZd dmvd mmfid nwvd m3 x aadnl med nwvd moH x vndnl mmmd vwvd de x d; wfimd wand w de 3039» Sad End mmdd and no: x waddl mmcd 8nd m3 x waddl mmdd 3nd Nd: x ad mood 2nd Ed nave pwm ammd dmwd vmdd damd mag x vnoAl vmdd damd m3 x wnoAI vmdd damd me x Nd ndmd mend m dnm Fume: dddd mmdd Hddd ndmd me x mmmdl wood awed no: x mmadl Sad ndmd d wddd add 3 dddH 55mm addd nmmd mmdd dad he: x daddl mmdd Sad me x nfimdl Hmvd vmwd we x ad dddd dvmd n cam 93:: Bad ndad Sid mnod v3 x dmdgl Sid andd :3 x wmdAI vmmd awnd Nd: x d; man nndd d adm 09:3 «and 3nd mwdd and we x mama! 93d and No: x Quad! mood ndnd Nd: x d; andd 3nd mm Ema wmmnHoD vadd awnd End End :3 x mmdd Edd andd :3 x mmdd and End m3 x ad wood wndd dm doom 3739:: made dand dadd Edd me x anmdl awed Edd ma: x anmdl awed Edd de x ad mwmd add v Em :oH nmnd Ewd annd fiwnd me: x vgdl wfimd wvwd me x andl annd Hand de x d: mvnd dand am. dmdm mpfiwfiu dwwd mmwd wdwd nde me x nowdl wdwd Sad me x nddél wowd Sad d wamd nde v can 8.8: H22 «1: H22 m xzao— 2.42 m 5&2 H22 m 4 H22 m E : :mmm 8:388 Jazmnm .fifi $8me .3323 5:: 933:: 4:225 demoaohm 227 3:330:38 .:000::::::-m0: o:: :8: 68:80:: :0:::E::> ::: n: 888 8:883: «5:098: 0:: :0: /\ ::8:::0 2:: 50:88:08: :::::8 88:88: o:: 800:»: ::: 5:885:00 :0 :::8:: ::: 6:80: ::::0 :0 :::8:: ::: 30:20 xzwo: :8: .4 :2 Z .n: .E ,2 8868: ::H dam m: :36: 88.3800 0:: :08: m8::::0w:: m::::::m:00 8:8: w:::::m::0 o0 :0::8:0o::n: ”nn ::::,H ww:d wand omod wawd no: x ao:.w omdd wawd no: x :aod mw:d ownd no: x o.: nmdd owwd o nam oomnoa:m nw:.o n:nd amod navd Loo: x nw:.w n:d.o wawd no: x wovw amdd navd mo: x No :ndd mond o :am oomaaflm wawd ::n.o dndd wawd no: x nn:.w dndd wawd no: x nn:.w :aod owwd mo: x aw aa:.d o:o.o o dom donuts nwnd oond nand aowd vo: x nw:.v nand aowd Ho: x nw:.v nand aowd mo: x No wawd ooad vow :dmd: 88:80: :dod annd oodd wvod no: x don: odod wvod no: x wav: ddod nvod mo: x No w:o.o vmad wn doom 8:3 nowd wnad w:ad wnad vo: x awvol w:ad wnad «o: x mwvol w:ad wnad :d: x No w:ad onad dw ooov 9888p ownd M:Vnod w:n.d dnod no: x n:o.:l w:nd onod no: x n:o.:| w:nd :nod no: x o.: onnd nwnd anm wwam: 880m 3:.d w:v.d :omd n:nd d: x anon! :omd n:nd No: x nw:.ni ww:d nwvd no: x o.: mond nwnd n: ovw 3089,. w:od n:nd owod n:nd no: x nmool owod n:nd no: x nmool vwod n:nd Ho: x an nwod a:n.o am: nwvo 9:: awwd owwd oaod wand mo: x ann.:l odd and no: x mnn.:| omod wand Nro: x o.: nomd mond n onm 98m: nodd annd wood wwod mo: x wowél wood wvod no: x nawél :dod :nod wd: x o.: ::od :wod om odd: 825mm :mod ownd :dod wmnd no: x nwmol nmdd :dod no: x w::.ol :dod wmnd mo: x no wnod o:o.d v: dao :upm:: onnd adad :n:.d :nod wd: x wno.:l mood :wod vo: x ono.:l :n:d :nod mo: x No owod w:ad :: aon 0:63 nnnd wmod wwod ::n.o vo: x nnn.ml wwod ::n.o «1d: x nvodl nood mond vo: x o.: aond :mnd ov d:nm mmwnHoo oond annd onod :ond wo: x anod ndnd onnd wo: x anon onod :ond mo: x no wnod nwnd ow doom :0wap:mma dEd :mnd ovdd :nod No: x annwl ovod :nod mo: x annwl ovod :nod mo: x an nn:.o nnnd n :nw :0: oond n:w.o nwnd onnd no: x nn:.ol :and nwnd no: x wn:.ol nwnd onnd oo: x o.: d:n.o nwnd a: oaon mpfimfiv owwd wwwd wowd n:w.o no: x nooél wowd n:w.o no: x noodl wowd n:w.o d wowd n:w.o n oow 8.3:: :27: n: :27: n: v3&0: :27: b :::.m0: :22 n: 4 :27: ,...: E : :95 883800 4:828 :8: BBQ? 4:::o:m :8: 33:20 4:8m:m ::m0Q0:n: 228 23388888 ,voonzmxo-wo~ 93 88 683008 80585? 2: 83 888 852038 88803 83 .88 4 E8380 mi 8888.888 33:8 8888.88 25 .808i Qt £88588 no 85888 8: .8808 Saw do .8385: 2: 308% voodoo 88 .4 “:2 Z mm .8 .2 $888: 23. .OAVd mo 3?: 988888 23 8mg? mEsquwE 8885.88 $88 @5883 .8 8888.83me ”dd 2an mmmd wad odod owwd doH x omfid odod owwd doH x Food bmod n:w.o moH x o4 owod nwnd o uom oomnmfimm ode 3nd NSd wwvd doH x ndod ovod mwvd doH x owwd NSd wwvd moH x Nd odod dnvd o How oomnafim mddd omod de mHmd doH x womd de and doH x Rad Edd vad mod x Nd wnad dvod o ood oomuwwwv dmud End oond oowd «2 x «36 oond dowd Boa x mmfiv dobd dowd moH x Nd oood mood ood Home 58580: wood owwd wood omod moH x mow; wood omdd moH x wow; oood mvod moH x o4 odnd mood on odom 83w homo oomd wfiod whod voH x vadl Sod wnad wofi x vadl dad wnad mod x oA wad wnad o3 ooov 9888p mdmd vmod :md onod boa x :oAl :md ohod mod x SoAI :md :od doH x md ddmd owwd wdd wdomfi pmfiuum de ogd ohod dwdd doH x ondnl onod dwdd doH x mdwfil oomd wvvd doH x md nwnd dmmd mm ovw mHUfinwb n94 Sod w:ad odod w:ad me x nwodl odod 2nd moH x nmodl odod wand doH x o4 nwnd w:ad d3 mdvo 8mm 2 Sod owwd nmod vomd moH x wddél hood mood do” x nmwél umod vomd NoH x o4 nomd mound w ohm ”Emma mood wnad Hood modd doH x Hamel Hood mood doH x omdél Hood oond doH x Nd god and od oooH 830m dHod :md Hood dmnd doH x mdmdl Hood dmmd doH x ooddl dmvd dndd moH x Nd oood ovod Hm ooo mnpmsm owwd oood mood wnod voH x dmoél mood wnod «.3 x omoél mood mood HoH x md wood vmod 2 mod on?» ddod dfidd odod wand voH x Sodl odod find we x wood! god dwod moH x Nd Hmdd wood do oHdm wmmuHuo Sod vwod Food omnd voH x Hood hood owwd vofi x mmod Rod End o wood and do doom sownpmmwa mdfid w:ad mdod Bod drofi x woodl wood Bod mod x wnmdl mdod Sod doH x md ode wand : and no“ wand Edd vvnd vmnd noa x dmédl Sumo End moH x dmfidl dEd wand doH x oA oond nmwd of omom mpfimflu Edd bddd wowd :wd doH x modél dodd :wd doH x modél wdwd dfiwd moH x Nd Edd God 3 com 8.36 :22 m H22 m x:-wo_ :22 m x:-w£ ZZZ m 4 :42 m E 2 swam «58588 49825 .38 7:0on ,Hmpqmsm £8 88va .Hdpqmnm Emoaem 58388888 ,oooazmxzdoo 83 ~88 588850.8 888883, 83 3 @888 852cm? ommoooa 83 88 4 E8380 2: 80388888 8888 888288808 888 688$ 83 888.388 .8 $2888 83 .2868 888d do 85888 8: 30888 voodoo o88 ,4 :22 ..m :E ,8 $8888 23; .ooo mo 728 8888888 3: 8883 m88880d8 888.388 8888 d88$mflu do 8888th ”wo @589 ooHd dwod owod wwwd doH x doHd wwod odwd doH x oNHd owod wwwd NoH x o4 ddod ded 3 woN oomquMm doNd oood wdod wowd moH x ode wdod wowd doH x ode oHod Ndwd on x Nd odod oNod 2 SN oomuafim dNod oodd dwod dowd doH x ode dwod dowd doH x oSd NSd owwd on x o4 Hddd wood 3 ood oednwwflv Howd Ndwd wowd dodd on x Hodw wowd dodd on x Sow wowd dodd moH x Nd wood oood oHo HoNoH 888880: ded ddwd wood owod ooo x wowA wood owod ooH x oowA dowd Hodd Nrofi x Nd owwd dood NdH odoN 88pm oodd oood dHod dwod on x odwol dHod dwod on x odwol dHod dwod NoH x Nd wHod wwod ooN ooow 8.888» Hood ond odwd oNod ooH x NHoAI odwd oNod ooH x dHoAl dHod Nwod on x oA Nood ddwd wwo ddoNH pmfiuum de ond wHNd owwd doH x ddel wHNd owwd doH x wowwl HoNd oowd HoH x Nd Nde dwwd Nw owd 0839, 000 :od ond ddod ond ooH x wNo.ol dod wawd ooH x wNool ddod ond NoH x Nd wNod ddwd NNd odwo paw 2 wood dddd wNod wood doH x wodAI wNod wood doH x wodAl wNod wood N9 x Nd oodd dddd wH owN p.888 oood odod Hood oood doH x doowl oood oood LdoH x ddw.wl Hood oood doH x Nd oood dwod oo oooH 888.2% oowd dwdd oood dHod d3 x dodol ded oodd doH x oNN.o| oood dHod doH x o4 ooNd oowd od ooo 8.388 wwwd owdd wwod wodd on x ooogl owod Nodd on x oooAl wwod wodd HoH x Nd oood owdd dN ooo ”58, dNod owod odod ded on x dooNl odod ded on x wHoNl ddod wawd dog x o4 HNwd Nwwd o: oHdN wmmuHoo Hwod owwd oowd wowd woo x oood NNwd dowd on x oood oowd wowd doH x o4 Hdod ond ooo oooN sownpmmwa oNHd ond wdod oood doH x woodl wdod oood doH x woodl wdod oood HoH x Nd wodd ded do Hod 83 :wd ond oowd owwd ooH x NoHoI wawd Nde ooH x wofioi oowd owwd HoH x Nd wowd Hdwd fldN oNoo mpfidfiv wwdd dddd dodd ded doH x woowl dodd ded doH x ooo.w| NHod owod NoH x o4 dHod wood dH ood ammo 342 m :22 m zoom: H22 m fiTdB :54 m 4 :22 .m E 8 swam 88588 488828 .88 88QO 488898 £8 88888 488895 ommoagm 5030000008 8008:0082d0— 080 88 6.880008 80388800, 080 03 @888 8888038 8000808 080 80o 4 8088.80 05 8038880088 8888 800208808 05 68000.0“ 080 88800800 o0 805888 080 000808 3.000 o0 808888 080 0008000 8:-d8 880 .4 42 Z mm :80 ,8 0d88008 08H. 003 08 80008 08800800 080 808.3 08888088 008800800 00888 d8800080 o0 008.08.88.85 ”do 030% 000.0 800 0:0 000 0030000 008.0 000.0 0020080 0:0 000.0 0030.0 000.0 000.0 00 000 000-050 000.0 000.0 000.0 000.0 00:03.0 020.0 000.0 00300080 000.0 000.0 LV00x08 000.0 000.0 00 800 000.900. 000.0 000.0 000.0 000.0 0030000 008.0 000.0 0030000 000.0 000.0 00300 000.0 000.0 00 000 000-0000 000.0 000.0 000.0 000.0 000x003. 000.0 000.0 00300000 000.0 000.0 0030.0 000.0 000.0 0008 00008 53082 000.0 000.0 80.0 8000 003003 000.0 000.0 00:00: 80.0 00.0 00300 000.0 000.0 000 0000 0000 000.0 000.0 080.0 000.0 003000.01 000.0 000.0 003000.0: 000.0 000.0 H10300.0 000.0 000.0 000 0000 3303 000.0 000.0 000.0 000.0 003000.01 080.0 000.0 00300008: 0800 000.0 008x08 080.0 000.0 0002 00008 8:80 0:0 000.0 000.0 000.0 003000.01 000.0 000.0 008.3000: 080.0 000.0 00:08 0.080 000.0 00 000 3039, 000.0 020.0 000.0 000.0 08:00.0: 200.0 0800 0030000: 000.0 000.0 0020.0 000.0 000.0 3.0 0000 :0. 000.0 000.0 000.0 00.0 00300008.. 000.0 000.0 0030008.. 000.0 0800 00:08 000.0 000.0 00 000 0080 000.0 000.0 000.0 0000 0033.080: 000.0 000.0 008.3080: 000.0 000.0 008x08 000.0 000.0 008 0000 90800 000.0 000.0 000.0 000.0 02:00.0: 000.0 000.0 08:000.. 000.0 000.0 000.08 000.0 000.0 00 000 2003.. 000.0 000.0 000.0 000.0 002x003: 000.0 080.0 002000.01 000.0 000.0 00:08 000.0 000.0 00 000 BE 0800 000.0 000.0 000.0 0030000: 000.0 000.0 003200.01 000.0 000.0 00:08 200.0 000.0 800 0000 03-800 000.0 000.0 000.0 000.0 00300000 000.0 000.0 003000.0 000.0 00.0 008.08 000.0 000.0 000 0000 80-0800 0:0 000.0 000.0 000.0 003000.01 000.0 000.0 003000.01 000.0 000.0 H0:00 000.0 000.0 00 800 02 000.0 000.0 000.0 000.0 0033.08.01 000.0 0.000 8030080: 000.0 000.0 0030.0 080.0 000.0 000 0000 30080 000.0 000.0 000.0 000.0 008.0800: 000.0 0800 00320.0: 000.0 00.0 00:00 30.0 000.0 00 000 9.80 822 0 822 .0 00-02 822 8 00-002 822 8 0 822 8 E 0 800m 8088800 4088085 .88 80080 4088008 .88 88880 40980.3 80000080“ 31 2 3030008000 8008:8288 888 888 n0.8800088 80388880, 080 03 8888 8888088 80008088 080 80o 4 #88580 080 88088888 #8888 800388808 08. 88000-8 83 00088800800 Mo 85888 880 F00808 8088 do 808888 080 000806 8:88 888 .4 422 ..m .80 08 0.0888808 08H. .052 8 758 088800800 080 8083 08888038 0088800800 88888 98.800080 o0 0088880o80m Moo 038B dwdd dood dZd owod Mofi x doHd dZd owod doH x dwod Hwod oowd on x o4 owod odwd ow woN oomnmfimm dde Nood oHod oowd m3 x ode oHod oowd doH x Nde owod wdwd on x o4 oood owwd ww HoN oomaaflm owwd wdod Hdod ddwd whoa x ond Hdod odwd doH x Ndod oood wwwd ooH x o4 dHod wodd ow ood oomuwwfiv dowd odwd oood wood on x ddow oood wood on x ddow oood wood No8 x Nd oood owod odoH HoNoH 880.380: wdwd oodd dood wodd ooH x oowA dwwd wood ooH x wow; dood wodd No8 x o4 owwd wood ood odoN 8880 oodd wood oNod owod on x owwd! oNod owod 0.3 x owwol oNod owod No8 x Nd wNod Ndod ooo ooow 0.080389 Sod odod wHod owod ooH x dNoAI wHod dwod ooH x dNoAI wHod owod doH x o4 oNod oodd Hon ddoNH 030.008 de ond wood Nodd doH x ood.wl wood Nodd doH x ooow! wHNd wowd doH x Nd ded Nood wNH owd 08008000 odd dowd ddod ond ooH x dNo.o| wdod wde ooH x dNo.o| ddod odwd on x o4 dood oowd ooo odwo 98m Nwdd oddd wdod oood doH x wwoAl wdod oood doH x wdoAl ooNd oowd doH x oA ded oHdd Hw owN 8.0888 oood dood oHod dwod doH x ddNoI oHod dwod de x Nwdol Hood owod o dood owod ooH oooH 850d wowd dwdd owwd dodd doH x ooool owwd dodd dog x oood... owwd Hodd doH x Nd oodd dodd on ooo 8.3888 oood dfiod odod odod on x Hooél odod odod on x Hoogl odod odod Hod x Nd oNdd Nwod od ooo onus oodd wowd odod wawd Boa x wooNl odod wawd on x owoNl Hood oowd on x Nd Nood odwd wwd oHdN mmmuHoo dood dwod owod oowd on x dood owod oowd on x oood oowd dowd HoH x Nd owod dowd ood oooN zomnumwwa wood owod Ndod dwod doH x Nwodl Ndod dwod doH x Nwodl Ndod dwod HoH x Nd dddd ded do Hod 800 owwd dodd owwd oowd ooH x wofiol ded wodd ooH x oofiol owwd oowd doH x oA oowd wwdd dwd oNoo mpfimflu dowd Ndod Nodd oodd doH x wdowl Nodd oHdd doH x dwowl Hodd oodd No0 x oA dHod wood oo ood 8.000 dZ./H m 2.82 m 0.5-000— :Az m 03-no— HZZ m 4 :42 h 80 8 888m 8085800 4888005 .88 78.580 4888086 .88 8:880 4888008 8000800L 232 0.5 *1 ..... 4} .v 0.4 * I 0.3 r- , 5; 0215 I , ‘Luu—n' """’ 1 2 a 5 10 15 c 1 (a) ethn, Fscore 10 15 ..... ,-.-._,.- 2 5 5 (e) ion, Fscore (f) ion, NMI Figure 5.7: F-score and NMI for different algorithms for clustering under constraints for the data sets ethn, Mondrian, and ion. The results of the proposed algorithm, Shental, and Basu are represented by the red solid line, blue dotted lines and the black dashed line, respectively. The performance of a classifier trained using all the labels is shown by the gray dashdot line. The horizontal axis shows the number of constraints as the percentage of the number of data points. 233 ""’< 1 é s 5 1o 15 1 (a) script, Fscore - n n: . v 1' v.” a: . L i 07 0""‘o 1 2 3 5 1o 15 ' i 2 a 5 1b 15 (c) derm, Fscore (d) derm, NMI 3.8 _________________________ ,_ _ - _‘_ 0.75 i 23 :i ‘ 2 5 5‘; (6) vehicle, Fscore (f) vehicle, NMI Figure 5.8: F-score and NMI for different algorithms for clustering under constraints for the data sets script, derm, and vehicle. The results of the proposed algorithm, Shental, and Basu are represented by the red solid line, blue dotted lines and the black dashed line, respectively. The performance of a classifier trained using all the labels is shown by the gray dashdot line. The horizontal axis shows the number of constraints as the percentage of the number of data points. 234 l I" r ' o ; 2 a 510.50 ; én‘a; 512,15 (a) wdbc, Fscore (b) wdbc, NMI Figure 5.9: F-score and NMI for different algorithms for clustering under constraints for the data sets wdbc. The results of the proposed algorithm, Shental, and Basu are represented by the red solid line, blue dotted lines and the black dashed line, respectively. The performance of a classifier trained using all the labels is shown by the gray dashdot line. The horizontal axis shows the number of constraints as the percentage of the number of data points. 235 i» :3 5 (a) UCI-seg, Fscore 10 1 i» 5 3 (b) UCI-sag, NMI 10 15 (c) heart, Fscore ‘-- I .‘-'.u~::‘"" 2 5 5 (e) austra, Fscore .90 1 (f) 3 '5' 1o 15 austra, N MI Figure 5.10: F—score and N M1 for different algorithms for clustering under constraints for the data sets UCI-seg, heart and austra. The results of the proposed algorithm, Shental, and Basu are represented by the red solid line, blue dotted lines and the black dashed line, respectively. The performance of a classifier trained using all the labels is shown by the gray dashdot line. The horizontal axis shows the number of constraints as the percentage of the number of data points. 236 1 1 J 4 _,. ,_l . “1...“. EE 4 A ~‘._.‘-a'--l ‘-m:-A-- A" “i 1 2 a 5 1o 15 1 2 3 5 1o 15 (b) german, NMI "so 1 2 5 5 1b 15 i 2 :3 5 1o 15 (c) Sim-300, F score (d) Sim-300, NMI 0.95‘- T . . - o.s-----.v.._.__ . i--- 10 15 “o 10 1s 1 2 3 5 2 a 5 (e) diff-300, Fscore (f) diff—300, NMI Figure 5.11: F-score and NMI for different algorithms for clustering under constraints for the data sets german, Sim-300 and diff-300. The results of the proposed algo— rithm, Shental, and Basu are represented by the red solid line, blue dotted lines and the black dashed line, respectively. The performance of a classifier trained using all the labels is shown by the gray dashdot line. The horizontal axis shows the number of constraints as the percentage of the number of data points. 237 . . é (0) digits, Fscore (d) digits, NMI Figure 5.12: F -score and NMI for different algorithms for clustering under constraints for the data sets sat and digits. The results of the proposed algorithm, Shental, and Basu are represented by the red solid line, blue dotted lines and the black dashed line, respectively. The performance of a classifier trained using all the labels is shown by the gray dashdot line. The horizontal axis shows the number of constraints as the percentage of the number of data points. 238 . . . 0.7.: _ 0.73 ~ , 0.72 r 0.7- 0.65 ”"0 1 2 3 5 1o 15 “'1 1 2 a 5 1o 15 (a) mfeat-fou, Fscore (b) mfeat-fou, NMI 08 l ' _ _ 0.45L ' I ‘ ”5 1 0‘, - -.- _ -.-,- _ _ --.- -._ _ _ -.-,- - -,-- -._._.-._._.= 'o 1 2 "3 5 1o 15 1 2 5 5 1o 15 (c) same-300, Fscore (d) same-300, NMI 1 r v r ------.--.__.-.-_-_.-_-_-------.---5 0.98.7 oaa » ..... 087 096 .ab'---O"'--‘ ‘\‘ AK [ ,‘ ----- *---""’ ---- A ‘a‘ ’_a‘ \ 095g “\ s" 086‘! ‘s‘ '1 ‘~-’¢' ‘\ " ARE A x 1 an: ‘ “" ”"1 1 2 3 5 1o 15 “'W 1 2 a 5 1o 15 (e) texture, Fscore (f) texture, NMI Figure 5.13: F—score and NMI for different algorithms for clustering under constraints for the data sets mfeat—fou, same—300 and texture. The results of the proposed algorithm, Shental, and Basu are represented by the red solid line, blue dotted lines and the black dashed line, respectively. The performance of a classifier trained using all the labels is shown by the gray dashdot line. The horizontal axis shows the number of constraints as the percentage of the number of data points. 239 must-link constraints. Based on the result slmwn in Figure 5.14(a), we can see that a good projection direction was found by the proposed algorithm. The projected data follow the Gaussian distribution well, as evident. from Figure 5.14(b). Our second experiment is about the combination of feature extraction and the ker- nel trick to detect clusters with general shapes. The two-ring data set (Figure 5.15(a)) considered in [158], which used a hidden Markov random field approach for cluster- ing with constraints in kernel k-means, was used. As in [158], we applied the REF kernel to transform this data set of 200 points nonlinearly. The kernel width was set to 0.2, which was the 20-percentile of all the pairwise distances. Unlike [158], we applied kernel PCA to this data set and extracted 20 features. The algorithm described in Section 5.5 was used to learn a good projection of these 20 features into a 2D space while clustering the data into two groups simultaneously in the presence of 60 randomly generated constraints. The result shown in Figure 5.15(b) indicates that the algorithm successfully found a 2D subspace such that the two clusters were Gaussian-like, and all the constraints were satisfied. When we plot the cluster labels of the original two-ring data set, we can see that the desired clusters (the “inner” and the “outer” rings) were recovered perfectly (Figure 5.15(c)). Note that the algo- rithm described in [158] required at least 450 constraints to identify the two clusters perfectly, whereas we have only used 60 constraints. For comparison, the spectral clustering algorithm in [194] was applied to this data set using the same kernel ma- trix as the similarity. The two desired clusters could not be recovered (Figure 5.15(d)). In fact, the two desired clusters were never recovered even when we tried other values of kernel widths. 240 -05[ 4 03f O. u '51: 0 ° ‘H 4 '5, ,1 ‘ 02r o M , e r? M "“ x " "$de .€¢x§§¢§ “3‘?" l 01 o . .-- _1= — . —1 5 ,1 0 5 6 05 ‘l 1i5 2 -§.5, “—E —1 5 —1 —05 .5 l . . (a) Clustering result and the axis to be pro- (b) Clustering result after projection to the jected axis Figure 5.14: The result of simultaneously performing feature extraction and clustering with constraints simultaneously on the data set in Figure 5.3(a). The blue line in (a) corresponds to the projection direction found by the algorithm. The projected data points (which is ID), together with the cluster labels and the two Gaussians, are shown in (b). 5.7 Discussion 5.7 .1 Time Complexity The computation of the objective function and its gradient requires the calculation of Fij, sij, wij, and the weighted sum of different sufficient statistics with rij and wij as weights. When compared with the EM algorithm for standard model-based clustering, the extra computation by the proposed algorithm is due to sij, wij, and the accumulation of the corresponding sufficient statistics. These take O(kd(m+ + m‘" + n*)) time, where k, d, m+, m", n* denote the number of clusters, the dimension of the feature vector, the number of must-link constraints, the number of must-not-link constraints, and the number of data points involved in any constraint, respectively. This is smaller than the 0(kdn) time required for one iteration of the EM algorithm, with n indicating the total number of data points. Multiplication by the inverse of the 241 0.5 e 0 3 Q . e { ‘ . r s“ \‘ ’I ’l g 2 4 . ‘ Q I; l ' . . o o_ _ . _ ‘: _|_‘°L ‘ a". . 1 * -—-""" ‘3‘|‘::{:3’5’Q .‘ ’ h) ‘*‘: 2": -fl ,'.: j of 0 o... - 2.3," .' e ‘ ‘ :0“ o ' ' .0. l ’1‘ ' ‘.‘ . . .- . ’ ‘ - . J'.— ‘ r 5" F.. - '1 0‘ ' g o . f \‘t ‘ .s : v.1“: \ " ‘ 0 -2f . l “ \ o ' I \|. I; ’ I‘V”. ' . _al- ° 9 o -0. ‘ ‘ 4 ‘ ‘ ‘ - .5 o 0.5 is —1 -o.5 o 0.5 1 1.5 (a) The constraints (b) Projected space 0.5 . 0.5 o 0 00 o 0 Q: o O 5 O o 3" 8 o O 9 0 0 $9 8 O O O, O 0 Q00 0 x 000 o 0 xx 1% W); " 09 000 (Sign? o 3"" o (590 yQOb 0 d: z x i i 0:0 o 0% )& ’S< 0% x§ i 00 0% OF 00 >13: x %U 1 0' xx xi? (:3 (b 00 o x x x x O X xxx § 0 X X 31L. @ C as x u , a x 0g 800 got" "xiig‘fx (95° §Xx fiX§XXX§J dag" O o o x x o o O f O b: O .. .-O ' X x 8) 0 ad” “7,336 O k X x”! x -0. 4 -0. ‘ ‘ —8.5 0 0.5 -8.5 0 0.5 (c) Clusters obtained (d) Result of spectral clustering Figure 5.15: An example of learning the subspace and the clusters simultaneously. (a): the original data and the constraints, where solid (dotted) lines correspond to must-link (must-not-link) constraints. (b) Clustering result of projecting 20 features extracted by kernel PCA to a 2D space. (c) Clustering solution ((1) Result of applying spectral clustering [194] to this data set with two clusters, using the same kernel used for kernel PCA. 242 function evaluation because of the line-search. Each iteration in the algorithm Shental is similar to that in the standard EM algorithm. The difference is in the E—step, in which Shental involves an inference for a Markov network. This can take exponential time with respect to the number of constraints in the worst case. The per-iteration computation cost in Basu is in general smaller than both Shental and the proposed algorithm, because it is fundamentally the lat—means algorithm. However, the use of iterative conditional mode to solve the cluster labels in the hidden Markov random fields, as well as the metric learning based on the constraints, becomes the overhead due to the constraints. In practice, the proposed algorithm is slower than the other two because of the cross-validation procedure to determine the optimal A. Even when A is fixed, however, the proposed algorithm is still slower because (i) the optimization problem considered . by the proposed algorithm is more difficult than those considered by Shental and Basu, and (ii) the convergence criteria based on the relative norm of gradient is stricter. 5.7.2 Discriminative versus Generative One way to view the difference between the proposed algorithm and the algorithms Shental and Basu is that both Shental and Basu are generative, whereas the pro- posed approach is a combination of generative and discriminative. In supervised learning, a classifier is “generative” if it assumes a certain model on how the data from different classes are generated via the specification of the class conditional den- 243 sities, whereas a “discriminative” classifier is built by optimizing some error measure, without any regard to the class conditional densities. Discriminative approaches are often superior to generative approaches when the actual class conditional densities differ from their assumed forms. On the other hand, incorporation of prior knowl- edge is easier for generative approaches because one can construct a generative model based on the domain knowledge. Discriminative approaches are also more prone to overfitting. In the context of clustering under constraints, Shental and Basu can be regarded as generative because they specify a hidden Markov random field to describe how the data are generated. The constraint violation term f(6;C) used by the pro— posed algorithm is discriminative, because it effectively counts the number of vio- lated constraints, which are analogous to the number of misclassified samples. The log-likelihood term £(6; y) in the proposed objective function is generative because it is based on how the data are generated by a finite mixture model. Therefore, the proposed approach is both generative and discriminative, with the tradeoff parameter /\ controlling the relative importance of these two properties. One can think that the discriminative component enables the proposed algorithm to have a higher perfor- mance, whereas the generative component acts as a regularization term to prevent overfitting in the discriminative component. This discussion provides a new perspective in viewing the example in Figure 5.3. Shental and Basu, being generative, failed to recover the two desired clusters because their forms differ significantly from what Shental and Basu assume about a cluster. On the other hand, the discriminative property of the proposed algorithm can locate 244 the desired vertical cluster boundary, which can satisfy the constraints. The discriminative nature of the proposed algorithm is also the reason why the proposed algorithm, using constraints only, can outperform the generative classifier using all the labels. This is surprising at. first, because, after all, constraints carry less information than labels. Incorporating the constraints on only some of the objects therefore should not outperform the case when the labels of all objects are available. However, this is only true when all possible classifiers are considered. When we restrict ourself to the generative classifier that. assumes a Gaussian distribution with common covariance matrix as the class conditional density, it is possible for a discriminative algorithm to outperform the generative classifier if the class conditional densities are non-Gaussians. In fact, for the data sets ethn, Mondrian, script, wdbc, and texture, we observed that the proposed algorithm can have a higher F-score or NMI than that estimated using all the class labels. The difference is more noticeable for script and wdbc. Note that for the data set austra, the generative algorithm Shental can also out-perform the classifier trained using all the labels, though the difference is very small and it may be due to the noisy nature of this data set. 5.7 .3 Drawback of the Proposed Approach There are two main drawbacks of the proposed approach. The optimization problem considered, while accurately representing the goal of clustering with constraints, is more difficult. This has several consequences. First, a more sophisticated algorithm (line—search Newton) is needed instead of the simpler EM algorithm. The landscape 245 of the proposed objective function is more “rugged”. So, it it is more likely to get trapped in poor local optima. It also takes more iterations to reach a local optimum. Because we are initializing randomly, this also means that the proposed algorithm is not very stable if we have an insufficient number of random initializations. The second difficulty is the determination of A. (Note that the algorithm Basu has a similar parameter.) In our experiments, we adopted a cross-validation procedure to determine A, which is computationally expensive. Cross-validation may yield a suboptimal A when the number of informative constraints in the validation set is too small, or when too many constraints are erroneous due to the noise in the data. Here, a constraint is informative if it provides “useful” information to the clustering process. So, a must—link constraint between two points close to each other is not very informative because they are likely to be in the same cluster anyway. Another problem is that we may encounter an unfavorable split of the training and validation constraints when the set of available constraints is too small. When this happens, the number of violations for the validation constraints is significantly larger than that of the training constraints. Increasing the value of A cannot reduce the violation of the validation constraints, leading to an optimal constraint strength of zero. When this happens, we should try a different split of the constraints for training and validation. 246 5.7 .4 Some Implementation Details We have. incorporated some heuristics in our optimization algorithm. During the optimization process, a cluster may become almost empty. This is detected when 2, fij/n. falls below a threshold, which is set to 4 x 10‘3/k. The empty cluster is removed, and the largest cluster that can result in the increase in the .7 value is split to maintain the same number of clusters. If no such cluster exists, the one that can lead to the smallest decrease in j is split. Another heuristic is that we lower- bound aj by 10-8, no matter what the values of {flj} are. This is used to improve the numerical stability of the proposed algorithm. The ozj are then renormalized to ensure that they sum to one. 5.8 Summary We have presented an algorithm that handles instance-level constraints for model- based clustering. The key assumption in our approach is that the cluster labels are determined based on the feature vectors and the cluster parameters; the set of con- straints has no influence here. This contrasts with previous approaches like [231] and [21] which impose prior distribution on the cluster labels directly to reflect the con- straints. This is the fundamental reason for the anomaly described in Section 5.2. The actual clustering is performed by the line-search Newton algorithm under the natural parameterization of the Gaussian distributions. The strength of the constraints is determined by a hold—out set of validation constraints. The proposed approach can be extended to handle simultaneously feature extraction and clustering under con— 247 straints. The effectiveness of the proposed approach has been demonstrated on both synthetic data sets and real-world data sets from different. domain. In particular, we notice that the discriminative nature of the proposed algorithm can lead to superior performance when compared with a generative classifier trained using the labels of all the objects. 248 Chapter 6 Summary The primary objective of the work presented in this dissertation is to advance the state-of-the-art in unsupervised learning. Unsupervised learning is challenging be- cause its objective is often ill—defined. Instead of providing yet another new unsuper- vised learning algorithm, we are more interested in studying issues that are generic to different unsupervised learning tasks. This is the motivation behind the study of various topics in this dissertation, including the modification of the batch version of an algorithm to become incremental, the selection of the appropriate data representa- tion (feature selection), and the incorporation of side—information in an unsupervised learning task. 6. 1 Contributions The results in this thesis have contributed to the field of unsupervised learning in several ways, and has led to the publication of two journal articles [163, 164]. Several 249 conference papers [168, 161, 167, 165, 82] have also been published at different stages of the research conducted in this thesis. The incremental ISOMAP algorithm described in Chapter 3 has made the follow- ing contributions: 0 Framework for incremental manifold learning: The proposed incremental ISOMAP algorithm can serve as a general framework for converting a mani- fold learning algorithm to become incremental: the neighborhood graph is first updated, followed by the update of the low-dimensional representation, which is often an incren‘iental eigenvalue problem similar to our case. 0 Solution of the all-pairs shortest path problems: One component in the incre— mental algorithm is to update the all-pairs shortest path distances in view of the change in the neighborhood graph due to the new data points. We have devel- oped a new algorithm that performs such an update efficiently. Our algorithm updates the shortest path distances from multiple source vertices simultane- ously. This contrasts with previous work like [193], where different shortest path trees are updated independently. 0 Improved embedding for new data points: We have derived an improved esti- mate of the inner product between the low-dimensional representation of the new point and the low—dimensional representations of the existing points. This leads to an improved embedding for the new point. 0 Algorithm for incremental eigen-decomposition with increasing matrix size: The problem of updating the low-dimensional representation of the data points 250 is essentially an incremental eigen-decomposition problem. Unlike the previous work [270], however, the size of the matrix we considered is increasing. 0 Vertex contraction to memorize the effect of data points: A vertex contrac- tion procedure that improves the geodesic distance estimate without additional memory is proposed. Our work on estimating the feature saliency and the number of clusters simulta— neously in Chapter 4 has made the following contributions: 0 Feature Saliency in unsupervised learning: The problem of feature selec- tion / feature saliency estimation is rarely studied for unsupervised learning. We tackle this problem by introducing a notion of feature saliency, which is able to describe the difference between the distributions of a feature among different clusters. The saliency is estimated efficiently by the EM algorithm. 0 Automatic Feature Saliency and Determination of the Number of Clusters: The algorithm in [81], which utilizes the minimum message length to select the number of clusters automatically, is extended to estimate the feature saliency. The clustering under constraints algorithm proposed in Chapter 5 has made the following contributions: 0 New objective function for clustering under constraints: We have proposed a new objective function for clustering under constraints under the assumption that the constraints do not have any direct influence on the cluster labels. Ex- tensive experimental evaluations reveal that this objective function is superior to the other state—of—the—art algorithms in most cases. It is also easy to extend the proposed objective function to handle group constraints that involve more than two data points. Avoidance of Counter-intuitive Clustering Result: The proposed objective function can avoid the pitfall of previous clustering under constraints algorithms like [231] and [21], which are based on hidden Markov random field. Specifically, clustering solutions that assign the cluster label to a data point that is different from all its neighbors is possible for previous algorithms, a situation avoided by the proposed algorithm. Robustness to model-mismatch: The proposed objective function for clustering under constraints is a combina- tion of generative and discriminative terms. The discriminative term, which is based on the satisfaction of the constraints, improves the robustness of the proposed algorithm towards mismatch in the cluster shape. This leads to an improvement in the overall performance. The improvement can sometimes be so significant that the proposed algorithm, using constraints only, outperforms a generative supervised classifier trained using all the labels. Feature extraction and clustering with constraints: The proposed algorithm has been extended to perform feature extraction and clusterng with constraints simultaneously by locating the best low-dimensional subspace, such that the Gaussian clusters formed will satisfy the given set of constraints as well as 252 they can. This allows the proposed algorithm to handle data sets with higher dimensionality. The combination of this notion of feature extraction and the kernel trick allows us to extract clusters with general shapes. 0 Efficient implementation of the Line-search Newton Algorithm: The proposed objective function is optimized by the line—search Newton al- gorithm. The multiplication by the inverse of the Hessian for the case of a Gaussian mixture can be done efficiently with time complexity 0(d3) without forming the 0(d2) by 0((12) Hessian matrix explicitly. Here, d denotes the num- ber of features. A naive approach of inverting the Hessian would require 0(d6) time. 6.2 Future work The study conducted in this dissertation leads to several interesting new research possibilities. 0 Improvement in the efficiency of the incremental ISOMAP algorithm There are several possibilities for improving the efficiency of the proposed in- cremental ISOMAP algorithm. Data structures such as kd—tree, ball-tree, and cover-tree [19] can be used to speed up the search of the k nearest neighbors. The update strategy for geodesic distance and co—ordinates can be more aggres- sive; we can sacrifice the theoretical convergence property in favor of empirical efficiency. For example, the geodesic distance can be updated approximately using a scheme analogous to the distance vector protocol in the network routing literature. Co—ordinate update can be made faster if only a subset of the co- ordinates (such as those close to the new point) are updated at each iteration. The co—ordinates of every point would be finally updated if the new points came from different regions of the manifold. Incrementalization of other manifold learning algorithms The algorithm in Chapter 3 modifies the ISOMAP algorithm to become incre- mental. We can also modify similar algorithms, such as locally linear embedding or Laplacian eigenmap to become incremental. Features dependency in dimensionality reduction and unsupervised learning: The algorithm in Chapter 4 assumes that the features are conditionally in- dependent of each other when the cluster labels are known. This assumption, however, is generally not true in practice. A new algorithm needs to be designed to cope with the situation when features are highly correlated in this setting. Feature selection and constraints: The main difficulty of feature selection in clustering is the ill-posed nature of the problem. A possible way to make the problem more well-defined is to intro- duce instance-level constraints. In Section 5.5, we described an algorithm for performing feature extraction and clustering under constraints simultaneously. One can apply a similar idea and use the constraints to assist in feature selection for clustering. o More efficient algorithms for clustering with constraints The use of line-search Newton algorithm for optimizing the objective function in Chapter 5 is relatively efficient when compared with alternative approaches. Unfortunately, the objective function, which effectively uses Jensen-Shannon divergence to count the number of violated constraints, is difficult to optimize. It is similar to the minimization of the number of classification errors directly in supervised learning, which is generally perceived as difficult. Often, the number of errors is approximated by some quantities that are easier to optimize, such as the distances of nus-classified points from the separating hyperplane in the case of support vector machines. In the current context, we may want to approximate the number of violated constraints by some quantities that are easier to opti- mize. A difficulty can arise, however, when both must-link and must-not-link constraints are considered. If the violation of a must—link constraint is approx— imated by a convex function g(.), the violation of a must—not-link constraint is naturally approximated by ——g(.), which is concave. Their combination leads to a function that is neither concave nor convex, which is difficult to optimize. Techniques like DC (difference of convex functions) programming [117] can be adopted for global optimization. 0 Number of clusters for clustering with constraints The algorithm described in Chapter 5 assumes that the number of clusters is known. It is desirable if the number of clusters can be estimated automatically from the data. The presence of constraints should be helpful in this process. 255 In fact, correlation clustering [10] considers must-link and rnust-not-link con— straints only, without any regard to the feature vectors, and it can infer the optimal number of clusters by minimizing the number of constraint violations. 256 APPENDICES 257 Appendix A Details of Incremental ISOMAP In this appendix, we present the proof for the correctness of the algorithms in chapter 3 as well as analyzing their time complexity. A.1 Update of Neighborhood Graph The procedure to update the neighborhood graph has been described in section 3.2.1.1, where A, the set of edges to be added, and ’D, the set of edges to be deleted, are constructed upon insertion of vn+1 to the neighborhood graph. Time Complexity For time complexity, note that for each i, the conditions in Equations (3.1) and (3.2) can be checked in constant time. So, the construction of .A and ’D takes 0(n) time. The calculation of Li for all i can be done in 0( [i=1 deg(v,-) + [AD or O(|E I + IAI) time by examining the neighbors of different vertices. Here, deg(v,j) denotes the degree of vi. The complexity of the update of neighborhood graph can be bounded by ()(nq), where q is the maximum degree of the vertices in the graph 258 after inserting en“. Note that L,- becomes the r, for the updated neighborhood graph. A.2 Update of Geodesic Distances: Edge Deletion A.2.1 Finding Vertex Pairs For Update In this section, we examine how the geodesic distances should be updated upon edge deletion. Consider an edge e(a, b) E D that is to be deleted. If ”ab 75 a, the shortest path between va and vb does not contain e(a, b). Deletion of e(a, b) does not affect sp(a, b) and hence none of the existing shortest paths are affected. Therefore, we have Lemma A.1. Ifrrdb # a. deletion of e(a, b) does not affect any of the eristing shortest paths and therefore no geodesic distance gij needs to be updated. we now consider the case ”ab = a. This implies 7Tba = b because the graph is undirected. The next lemma is an easy consequence of this assumption. Lemma A.2. For any vertex vi, sp(i, b) passes through va ijf sp(i, b) contains e(a,b) iff ”it = a. Before we proceed further, recall the definitions of T(b) and T(b; a) in section 3.1: T(b) is the shortest path tree of vb, where the root node is vb and sp(b, j) consists of the tree edges from vb to vj, and T(b; a) is the subtree of T(b) rooted at va. Let Rab E {i : rrl-b = a}. Intuitively, Rab contains vertices whose shortest paths to vb include e(a, b). We shall first construct Rabv and then “propagate” from Rab to get the geodesic distances that require update. 259 Because sp(t, b) passes through the vertices that are the ancestor of vt in T(b), plus of, we have Lemma A.3. Rab = { vertices in T(b;a) }. Proof. vt E T(b; a) <:> on is an ancestor of w in T(b), or va = vt o sp(t, b) passes through va o rrtb -— a (lemma A.2) 42> t E Rob 5 El If vt is a child of vu in T(b), vu is the vertex in sp(b, t) just before vt. Thus, we have the lemma below. Lemma AA. The set of children of vu in T(b) = {vt : vt is a neighbor of on and ”bi = U} ' Consequently, we can examine all the neighbors of vu to find the node’s children in T(b) based on the predecessor matrix. Note that the shortest path trees are not stored explicitly; only the predecessor matrix is maintained. The first nine lines in Algorithm 3.1 perform a tree traversal that extracts all the vertices in T(b; a) to form Rab, using Lemma A4 to find all the children of a node in the tree. 260 Time Complexity At any time, the queue Q contains vertices in the subtree T(b;a) that have been examined. The while-loop is executed lRabl times because a new vertex is added to Rab in each iteration. The inner for-loop is executed a total of 2,, {,6 Ra b deg(vt), which can be bounded loosely by (llRabl‘ Therefore, a loose bound for the first nine lines in Algorithm 3.1 is 0((IlRabll- A.2.2 Propagation Step Define F(a,b) E {(i,j) : sp(i,j) contains e(a, b)}. Here, (a,b) denotes the unordered pair a and b. So, F(a,b) is indexed by the unordered pair (a,b), and its elements are also unordered pairs. Intuitively, F(a,b) contains the vertex pairs whose geodesic distances need to be recomputed when the edge e(a, b) is deleted. Starting from "’b for each of the vertex in Rab» we construct F (a,b) by a search. Lemma A.5. If (i,j) E F(a,b).~ eitheri orj is in Rab- Proof. (i,_j) E Fm“ is equivalent to sp(i, j) contains e(a, b). The shortest path sp(i,j) can be written either as sp(i,j) = v,- w va —> vb w vj, or sp(i,j) = v,- w vb -—> va w v j, where «M denotes a path between the two vertices. Because the subpath of a shortest path is also a shortest path, either sp(i,b) or sp( j, b) passes through va. By lemma A.2, either 7% = a or rm, 2 a. Hence either i or 3' is in Rab° El Lemma A.6. F011,) 2 U {(u, t) : vt is in T(u;b)}. uERab Proof. By lemma A.5, (u, t) E F(a,b) implies either u or t is in Rab- W'ithout loss of generality, suppose u E Rab. So, sp(u, t) can be written as vu w va —> vb w vt. Thus 261 ... OOH T(a; b) Figure A.1: Example of T(u;b) and T(a;b)_. All the nodes and the edges shown constitute T(a; b), whereas only the part of the subtree above the line constitutes T(u; b). This example illustrates the relationship of T(u; b) and T(a; b) as proved in Lemma A.7. vt must be in T(u; b). On the other hand, for any vertex vt in the subtree of T(u; b), sp(u,t) goes through vb. Since sp(u,b) goes through va (because u E Rab), sp(u,t) must also go through va and hence use e(a, b). El Direct application of the above lemma to compute F011,) requires the construction of T(u; b) for different u. This is not necessary, however, because for all u E Rab: T(u;b) must be a part of T (a;b) in the sense that is exemplified in Figure A.1. This relationship aids the construction of T(u; b) in Algorithm 3.1 (the variable 7") because we only need to expand the vertices in T(a; b) that are also in T(u; b). Lemma A.7. Consider u E Rab- The subtree T(u;b) is non-empty, and let vt be any vertex in this subtree. Let vs be a child of vt in T(u;b), if any. We have the following: 262 1. vt is in the subtree of T(a; b). 2. v, is a child of Pt in the subtree of T(a; b). 3- 71-us 2 7ras : t Proof. The subtree T(u; b) is not empty because vb is in this subtree. For any vt in this subtree, sp(u, t) passes through vb. Hence sp(u, b) is a subpath of sp(u, t). Because u E Rab, sp(u, b) passes through va. So, we can write sp(u, t) as on w va —> vb w vt. So, sp(a, t) contains vb, and this implies that vt is in T(a; b). Now, if US is a child of vt in T(u; b), sp(u, s) can be written as on w va ——> vb w vt —> vs. So, was 2 t. Because any subpath of a shortest path is also a shortest path, sp(a, s) is simply va —+ vb w vt —> us, which implies that us is also a child of vt in T(a; b), and ray 2 t. Therefore, we have rug 2 was = t. Cl Let F be the set of unordered pair (i, j) such that a new shortest path from v,- to vj is needed when edges in D are removed. So, F = Ue(a,b)€D F(a,b)- For each (a. b) E D, Rab constructed in the first nine lines in Algorithm 3.1 is used to construct F(a,b) from line 11 until the end of Algorithm 3.1. At each iteration of the while-loop starting at line 15. the subtree T(a; b) is traversed, using the condition nus = was to check if v, is in T(u; b) or not. The part of the subtree T(a; b) is expanded only when necessary, using the variable T’. Time Complexity If we ignore the time to construct T’, the complexity of the construction of F is proportional to the number of vertices examined. If the maximum degree of T’ is q’, this is bounded by O(q’|F|). Note that q' S q, where q is the 263 11' .._I maximum degree of the vertices in the neighborhood graph. The time to expand T' is proportional to the number of vertices actually expanded plus the number of edges incident 011 those vertices. This is bounded by q times the size of the tree, and the size of the tree is at most 0(lF(a,b)l)' Usually, the time is much less, because different u in Rab can reuse the same T’. The time complexity to construct F(a,b) can be bounded by O(q|F(a.b)|) in the worst case. The overall time complexity to construct F, which is the union of F(a,b) for all (a,b) E D, is O(q|F|), assuming the number of duplicate pairs in F(a,b) for different (a, b) is 0(1). Empirically, there are at most several such pairs. Most of the time, there is no duplicate pair at all. A.2.3 Performing The Update . Let Q’ = (V, E /D), the graph after deleting the edges in D. Let B be an auxiliary undirected graph with the same vertices as 9, but its edges are based on F. In other words, there is an edge between v,- and vj in the graph 8 if and only if (i, j) is in F. Because F contains all the vertex pairs whose geodesic distances need to be updated, an edge in 8 corresponds to a geodesic distance value that needs to be revised. To update the geodesic distances, we first pick a vu in B with at least one edge incident on it. Define C(u) = {i : e(u, i) is an edge of B}. So, the geodesic distance gm- needs to be updated if and only if i E C (u) These geodesic distances are updated by the modified Dijkstra’s algorithm (Algorithm 3.2), with v", as the source vertex and C (u) as the set of “unprocess vertices”, i.e., the set of vertices such that their shortest paths from on are invalid. Recall the basic idea of Dijkstra’s algorithm is that, 264 starting with an empty set of “processed vertices” (vertices whose shortest paths have been found), different vertices are added one by one to this set in an ascending order of estimated shortest path distances. The ascending order guarantees the optimality of the shortest paths. Algorithm 3.2 does something similar, except that the set of “processed vertices” begins with V/ C (u) instead of an empty set. The first for-loop estimates the shortest path distances for j E C (u) if sp(u, j ) is “one edge away” from the processed vertices, i.e., sp(u,j) can be written as on w va ——> vj with a E V/C(u). In the while loop, the vertex vk (k E C (u)) with the smallest estimated shortest path distance is examined and transferred into the set of processed vertices. The estimates of the shortest path distances between on and the adjacent vertices of 2);, are relaxed (updated) accordingly. This repeats until C (u) becomes empty, i.e., all vertices have been processed. When the modified Dijkstra’s algorithm with on as the source vertex finishes, all geodesic distances involving on have been updated. Since an edge in 8 corresponds to a geodesic distance estimate requiring update, we should remove all edges incident on on in B. We then select another vertex on; with at least one edge incident on it in B, and call the modified Dijkstra’s algorithm again but with on; as the source vertex. This repeats until 8 becomes an empty graph. Time Complexity The for-loop in Algorithm 3.2 takes at most O(q|C(u)|) time. In the while—loop, there are |C(u)[ Extracth’lin operations, and the number of De- creaseKey operations depends on how many edges are there within the vertices in C(u). A upper bound for this is q|C(u)[. By using Fibonacci’s heap, ExtractMin 265 can be done in ()(log |C(u)|) time while DecreaseKey can be done in 0(1) time, on average. Thus the complexity of algorithm 3.2 is O(|C(u)| log|C(u)I + q|C(u)|). If binary heap is used instead, the complexity is 0((1IC (u)[ log |C (u)|) A.2.4 Order for Performing Update How do we select on in B to be eliminated and to act as the source vertex for the modified Dijkstra’s Algorithm (Algorithm 3.2)? We seek an elimination order that minimizes the time complexity of all the updates. Let f,- be the degree of vh-i, the i-th vertex removed from B. So, f, = |C(K,-)|. The overall time complexity T for running the modified Dijkstra’s algorithm (with F ibonacci’s heap) for all the vertices in B with at least an incident edge is 0(T), with T: Z(fi103fi+(1fi)- (All Because Ell-:1 f,- is a constant (twice the number of edges in B) with respect to dif- ferent. elimination order, the vertices should be eliminated in an order that minimizes 2,- fi log f,. If binary heap is used, the time complexity is O(T*), with T* = QZfi 10?; fi- (A?) In both cases, we should minimize 2, f,- log fi. Finding an order that minimizes this is difficult, unfortunately. Since this sum is dominated by the largest fi, we instead minimize max, f,. This minimization is achieved by a greedy algorithm that removes 266 the vertex in B with the smallest degree. The correctness of this greedy approach can be seen from the following argument. Suppose the greedy algorithm is wrong. So, at some point the algorithm makes a mistake, i.e., the removal of vt instead of on leads to an increase of max,- f,. This can only happen when deg(vt) > deg(vu). We get a contradiction, since the algorithm always removes the vertex with the smallest degree. Because the degree of each vertex is an integer, an array of linked lists can be used to implement the greedy search (Algorithm 3.3) efficiently without an explicit search. At any time of the instance, the linked list l[i] is empty for i < pos. So, the vertex in l[i] has the smallest degree in B. The for—loop in lines 10 to 18 removes all the edges incident 011v]- in B by reducing the degree of all vertices adjacent to vj by one, and moving pos back by one if necessary. Time Complexity The first for-loop in Algorithm 3.3 takes O(|F[) time, because |F| is the number of edges in B. In the second for-loop, pos is incremented at most 2n times, because it can move backwards at most n steps. The inner for-loop is executed altogether ()(lFl) time. Therefore, the overall time complexity for algorithm 3.3 (excluding the time for executing the modified Dijkstra’s algorithm) is O([F|). A.3 Update of Geodesic Distances: Edge Insertion In Equation (3.3), we describe how the geodesic distance between the new vertex vn+1 and vi is computed, after updating the geodesic distance in view of the edge deletion. 267 Since all the edges in A, the set of edges inserted into the neighborhood graph, are incident on vn+1. any improvement in an existing shortest path must involve vn+1. Let L z {(i j) : aim“ +wn+1$j < gij}. Intuitively, L is the set of unordered pairs adjacent to vn+1 with improved shortest paths due to the insertion of vn+1. For different ((1,1)) 6 L, Algorithm 3.4 is used to propagate the effect of the improvement in sp(u, b) to the vertices near va and ”a First, lines 1 to 9 construct a set Sab that is similar to Rob in Algorithm 3.1, and it consists of vertices whose shortest paths to vb have been improved. For each vertex v,- in Sub, lines 11 to 22 search for other shortest paths starting from v,- that can be improved, and update the geodesic distance according to the improved shortest path just discovered. Its idea is analogous to the construction of F011,) in Algorithm 3.1, but now sp(a, b) is improved instead of destroyed as in the case of F011,). The correctness of Algorithm 3.1 can be seen by the following argument. Without loss of generality, the improved shortest path between v,- and vj can be written as u,- wva ——>v,,+1 —> 7’1) wvj. So, u,- is a vertex in T(n + 1;a), and vj must be in both T(i;b) and T(n + 1;b). If v1 is a child of vj in T(i;b), v1 is also a child of vj in T(n + 1;b), and (gmirl + gn+1,1) < 911 should be satisfied. In other words, the relationship between T(i;b) and T(n + 1;b) here is similar to the relationship between T(u; b) and T(a; b) depicted in Figure Al. The proof of these properties is similar to the proof given for the relationship between F(a,b) and Rob: and hence is not repeated. 268 Time Complexity The set L can be constructed in 0(IAI2) time. Let H = {(i, j) : A better shortest path appears between v, and vj because of vn+1 }. By an argument similar to the complexity of constructing F, the complexity of finding H and revising the corresponding geodesic distances in Algorithm 3.4 is 0(qIH I + IAI2). A.4 Geodesic Distance Update: Overall Time Complexity Updating the neighborhood graph takes 0(nq) time. The construction of Rab and Fab (Algorithm 3.1) takes 0(quabl) and 0(quabl) time, respectively. Since lFablZ [RabL - these steps take 0((1lFabl) time together. As a result, F can be constructed in O(q|F|) time. The time to run the modified Dijkstra’s algorithm (Algorithm 3.2) is difficult to estimate. Let u be the number of vertices in B with at least one edge incident on it, and let 1/ E max,- f, with f, defined in Appendix A.2.4. In the highly unlikely worst case, 11 can be as large as u. The time of running Algorithm 3.2 can be rewritten as 0( up log 1/ + qul). The typical value of V can be estimated using concepts from random graph theory. It is easy to see that z/ = Inlax{8 has a l-regular sub-graph}, (A.3) where a l-regular sub-graph is defined as a subgraph with the degree of all ver- tices as l. Unfortunately, we fail to locate the exact result 011 the behavior of the largest l-regular sub-graph in random graph theory. On the other hand, the largest 269 l-complete sub-graph, i.e., a clique of size l, of a random graph has been well stud- ied. The clique number (the size of the largest clique in a graph) of almost ev- ery graph is “close” to 0(log u) [200], assuming the average degree of vertices is a constant and u is the number of vertices in the graph. Based on our empirical observations in the experiments, we conjecture that, on average, I/ is also of the or- der O(log ,u). 'With this conjecture, the total time to run the Dijkstra’s algorithm can be bounded by ()(uloguloglogu + qul). Finally, the time complexity of al- gorithm 3.4 is 0(qIHI + IAI2). So, the overall time complexity can be written as 0(qIFl + qul + u logu log log u + [A|2). Note that u s 2|F|. In practice, the first two terms dominate, and the complexity can be written as O(q(|F I + |H|)). 270 Appendix B Calculations for Clustering with Constraints The purpose of this appendix is to derive the results in Chapter 5, some of which are relatively involved. B.1 First Order Information In this appendix, we shall derive the gradient of the objective function ,7. The differential of a variable or a function a: will be denoted by “d :17”. We shall first compute the differential of J, followed by the conversion of the differentials into the derivatives with respect to the cluster parameters. 271 B.1.1 Computing the Differential The differential of the log-likelihood can be derived as follows: k k 1 .. dl .. (l .C( 6; 3?): Ed (10g Zexp( IOgCIijl> = ZZCXP( 08Q2J)( qutyl 3'21 i=lj=1 Zj'eXpa quii’ ’) n k (3.1) = 22% (d logql-j). i=1j=1 Here, rij = exp(log(1,]~)/ 2]»: exp(log (lift) = (“j/2ft gift is the usual posterior probability for the j-th cluster given the point y,. The annealing version of the log-likelihood, which is needed if we want to apply a deterministic-annealing type of pro) j l : 72(52'3'10832'1" Sij ZS.” 10g Sil) (d lOng’j) j l + 0, + _ + +) (12tthOothj-— Zlogthj( (d thj) .7 = T Zest/132 ahisij (d 105 (In - Z Sit (‘1 10% (1:1 l) j i l :2 T Z ahisij (10g t3“). -— 2 Sit log flit) (d log (1,-j) ij 1 (1 Z tgj log tgj = r: bhz‘Sij (log tgj — :8” log thl) (d log qij) j ij t The differential for the Jensen-Shannon divergence term is then given by d DJS (h): ZahZZsU locrsij— Zia; 100% = Tzahisij (log 8” — :81] log Sit) (d logqij) ij 1 — T 2 arms (10517:,- - Z 8a 105 till) (d log (lij) ij 1 = T Z “hisij (logs t—ij - 28 9,1 log ff 8&1) (d log (fij) ij hj [ hl dD — — —erh,-s,-j (100— — szlloa—“)(108(Iijld thl 274 The differential of the loss functions of constraint violation can thus be written as m+ m— (1f(6;C) = d (3121 Agojsuz) + Z A;D;S(h)) = —TZ(Z A, (zmsiJ- (loggi—j W: ,1 lorr Ell) 2] h J 1 hl —;A;bh:8ij(10:j—_§;52110”—))—:d logqul thl = JZ(:Z+:},)‘ aIiiSiJIO iJ gthj “7%: Ah b,,,-s,-,1 (B-5) gtIJ’ _3iJZEAh ”his 8,1100:- l+ l h: 1 thl + “v 28:5 b,,,-s 5,,1og— till ((1 longJ) l h=1 —TZ (u’ij - SiJZU’il) (d 103 (11'3“) iJ’ t where we define 771+ . + . + u’ij = Z “\h a,,, ‘SiJ lOg Sij - Sij Z All ah, log thj h=1 _ m— ~WZ Ah bhi siJ- loor gsiJ- + siJ 2 Ah bh, log thj h=1 h: 1 (B6) m— : Z, A+ a,,,-— ZA, bk, 3,,- log 3,,- Izzl m— 771+ + . —s,—,~ ZAha,,,-1ogth.— ;Ahbl,,loghtj h=1 275 It is interesting to note that n k 2: Z z—ljzl n k m— :2 (::AEL a;us,leogs,J--— ZAJjth-SJ-J-logsJ-J i:lj=1 h=1 m+ m — Aza'hisij log {7;}- + Z AgbhiS-ij log :Ifj) (8'7) h=1 h=1 =23A 22% log: — -2: A 232% log— h=1i=1j=l thj — i=1j=1thj m+ m __ +— + — — ._ . _. Ah 12,502) — ZAh 12,502.) —f(6,C) [2:1 [2:1 Therefore, summing all 21..r,-J- provides a way to compute the loss function for constraint violation. We are now ready to write down the differential of j: n k I; (I J = ZZ<flj — T(wij —- Sij 21011)) (d log (Iij) (13.8) [:1 i=1j=1 B.1.2 Gradient Computation Since the only differentials in Equation (8.8) are (d log qu), the gradient of j can be obtained by converting these differentials into derivatives. Recall that ql-J- : (VJ-p(yz- I0). So, ______1 a J. :1 x21 Blogaj Ooqll (J ), 276 where I () is the indicator function, and is one if the argument is true and zero otherwise. To enforce the restriction that 01- > 0 and Zj aj = 1, we introduce new variables (33' and express 03' in terms of {3}}: exp(,BJ-) 013' = k I, . 2.1/:1 €Xp(,t3jl) We then have 0 ('9 . . exptfil) T—r—loga'zr— fl—log ex [j- =1 =l — ‘ 0W! J (Ml L] g p( 3’) (J ) Zjl GXPWJ-I) =IU=0-0z k , a 0100‘ q l 0100' (1771 , .._—7—10g_: bl {0 = E [[2711 17112 —O (Adj (Ill "1:1 810g a," a’flj m ( ) ( ( J) J) =Io=n—aj If p(yz-Wj) falls into the exponential family (Section 5.1.1), and 03- is the natural parameter, the derivative of log qt- j with respect to 61 can be written as 8 (9 aiqu=nr=00mo-5Ema0. chm Note that Qb(y.,-) -— Egg/4(6)!) is zero when the sufficient statistics of the observed data (represented by ¢(y,j)) equal to its expected value (represented by 5%A(91)). In this case, the convexity of A(Ol) guarantees that the log-likelihood is maximized. Before going into the special case of the Gaussian distribution, we want to note 277 that for any number cij, we have 0 . E“ Cij'.0_3l 1000(1ij: E_:Cij(1(l:])_al):§:Cil—alzczj i ij 2.7 2] (‘9 a , (9 %:Cij_‘66110g0(1ij: ;L~11567110gq.i1 = 21:0“ (((d)/i)“ 676—114(01)) , 8 = Z Cil‘pb’i) — 551/1091) :02! i i The gradient of J can be computed by substituting cij = fij — T(wij — sij 2L1 wig). B.1.3 Derivative for Gaussian distribution Consider the special case that p(yz-IGZ) is a Gaussian distribution. Based on Equa- tion (5.6), we can see that the natural parameters are T1 and V1, the sufficient statistics consist of yz- and —%yiy;-F, and the log-cumulant function A(61) is given by Equation (5.7). In this case, we have Buljz :0 CzlYi" “1:021 (B'll) 1 T l T 1 5%..7— — _§ ZCiIYiyi + (EHH‘I + 52!) 262'! (B-12) 2' 2' Note that the above computation implicitly assumes that Tl is symmetric. To ex- plicitly enforce the constraints that T1 is symmetric and positive definite, we can re-pararneterize by its Cholesky decomposition: r1 = F,F,T, (13.13) 278 Note, however, with this set. of parameters, the density is no longer in its natural form. The gradient with respect to V1 remains unchanged, and it is not hard to Show that 0 . T T 517110;)(113‘ = 1(J = 1) —Yiyi + mm + 231 Fl- (B-14) O—FZJ — Z 011Y1ygF1 + (#1H1T+ 21) F1 2 Cu (315) 1 1 Alternatively, the Gaussian distribution can be parameterized by the mean #3- and the precision matrix Tj as in Equation (5.5). Because (3 ——lU,--=I"=lT -— B.16 dill ()0 (1U (J ) [(3% M) ( ) 3 . 1 1 T 69—1? 10% (Iij = 10 =1) (2‘21 — g(yz' - #1)(yz' - #1) ) (B-17) (‘3 , T 513:! 10% <11} = 1(J =1) (St-(Y1 - mm - H1) )an (318) the corresponding gradient of ,7 is ———J= T y: — p ) (B19) 0H1 1122:0111d 1 1 g‘fij—z — 1212621 _ :26 011(3’ MDT (B'ZO) apl =(EIZ Cil ZiCzlmyiT)F1-u1) (B21) 279 B.2 Second Order Information The second-order information (Hessian) of the proposed objective function J can be derived in a manner similar to the first order information. We shall first compute the second-order differentials and then convert them to the Hessian matrix. Let d2 2: denote the second-order differential of the variable :13. B.2.1 Second-order Differential By taking the differential on both sides of Equation (B4), we have d2 Cammw(9;y,‘1') = 2W 7‘13“) (d 10g (113') + Zfztj (d2 109; (113') - (B22) 13' 13' To compute d 731-, we take the differentials of the logarithm of both sides of Equa- tion (B3): k d log fij = d log (13 — longg [=1 1 k = '7' d log (1,-j — —7€——§— Z: q?! d log q?! (B23) 1’21 “111’ 1:1 k = 7 d 103011 — 27711 d 10% (111 1:1 Because of the identity that d .T = a: d logs: for :1? > O, we have (1 fij = fij d log 7:1} (13.24) 280 Substituting Equation (B24) into Equation (B22), we have ([2 fiannealedw; y, ,7) = :71]. ((12 logqij) + V’Zfij(d log (jij)(d logqij) 1} iJ' -VZZf11d 1030112f1j(11<1g(11j i 1 j = Z 73-1 (d2 10g (111') + v 235,, — 1.1111111 log (1.3-11d log (11). z] ijl (B25) Here, (SJ-l is the delta function, and it is one if j = l and zero otherwise. The definition of 5i] in Equation (5.12) implies the following: (1 82'} = 31'] d log 513 (8.26) k. d log sij = 'r d log (1,-j — Z 8,) d log (1,-1 (B27) (=1 Note the similarity between the definitions of sij and 17,-]. Because for any 2', Zflwij — 5%,-j :1 111,1) 2 0 and Zj d 3,-1- = d Zj sij = 0, Equation (BS) can be rewritten as (1 f(9,C) = —TZZ(U1U -— Sij Zwil)(d log (11]) i J' l = —T: u “U U U Here, 72 (Zj (11);]- -— sij ::le 11.1”)(61-1, — 3iu)(5jv — and) is the (2', i)-th element of the n. by 11 diagonal matrix Em. Let ahju. denote a vector of length n such that its i-th e11- try is given by Tahz‘5ij(5ju — Siul Because d t3]. 2: 2,- “hi (1 5,-3' = Z,- Ohisij d log 3,], we have (NI-:7- 810g S‘i] , = Zahisz‘je— — TZahiSijwju_3zu)1/Jiu: ‘I’uahju (991, 2' 06a . + (31,]. 811311 = T Z ahisijwju - Sin) 2 11.nahju i This means that + T m k A+ 01+ &+ 227: 37: if ‘ 11:21,!“ + :ahJuath ‘1’; h=1j=1th h=1j=1hj T = quuL+Av \Ilf, where we concatenate different ahju to form a n by km+ matrix Au, defined by A“ : [31,1411 32.1,“? ' ’ ° ’am+,1,u’ 21112911” ’ ' ’am+,2,u’ ° ' ‘ ’ alakdt’ ' ' ' ’am‘l’Jcml' Note that An has similar sparsity pattern as the matrix {am}. The diagonal matrix /\+ L+ is of si7e km+ by Am..+ Its diagonal entries are given by—— ,and the ordering of hi these diagonal entries matches the ordering of ahju in Au. By similar reasoning, we 287 have "1+ k + + T )‘h . 8th. J _ + T T Z Z— t+ (Olga) (39v) — 11,71AUL Av ‘I’v h=1j=1hJ m+ k + )‘h 8th J _ + T T $251; (33”) (83:) _ anAuL Avllfl h=1j=1hj The case for tI—fj’ which corresponds to to must-not—link constraints, is similar. So, we define bhju to consist of Tl);,,-.9,-j(6ju —— sin) for different i, and concatenate bhju A- to form B“. L‘ is a diagonal matrix with entries {in Substituting all these into the hj result derived in Equation (B29), we have 0 0 d—6u~ ; UJU 59; log 3,-1- 02 8 T :Tguijd—ifludf): log (111+; —siij,-l)(a ulog 35])(0—9U10g 8,3) 0 a gutt) —t )(— a —t )T —§Zi\% _u'thj)( — +2: Jean M 09v M h: 1 j - ~ 02 log - _ = 701w 2 JLJJV,J————-a:2:“ + w,,E,,vw{ — \IJUA,,L+A{\IJ{ + \IIuBuL 33,"qu 2' ll. _ ‘~. 02 log (1m. + T _ T T — 7611.1) “"211—363— + ‘I'U Euv — AUL Av + BUL Bv ‘I’v 'lt Similarly, we have r) . — V 618211 2 100061? 10g Sij 11,71 (Euv AUL+AU + BU BU ) ‘I’IJ; . ij . I; 288 (9 i) (9210?,qu 071;“le ””2?” 0131thqu 11,n (E1111 '_ AUL+A$ + BULTBS) 1in Let HC denote the “expected” hessian of the complete data log-likelihood due to the constraints, i.e., H0 = b1k-diag(0, T Z alum-1,. ..,7- : w,,,H,-,,.). (13.33) 1' 2' Note that there are no Hessian terms corresponding to the 53- because Zj 1172']- = 0. Let E be a 71k by 711; matrix. We partition it into k. by k blocks, such that the (11,v)-th block is EW. Let A be a 11k by km+ matrix and B be a 71k by km- matrix, such that L- a- L _1 we are now ready to state the Hessian term corresponding to the constraints: 771+ m {:33ng )+Z,\;D;S(h) [1:1 2 _HC — AEAT + AAL+ATAT — ABL‘BTAT (3.34) Note that the sum of each of the columns of A is 0, because Em Tron-SUM)” — 3,“) = Z,- Tahisij 21,051,, — Sin) 2 0. Combine Equation (B34) with Equation (8.32), we 289 have the Hessian of the objective function j in matrix form: ._2 ~ ~ .35 = HL — H6 + ADAT — AEAT + AAL+ ATAT — ABL“BTAT W (3%) = H“ + A (D — E + AL+AT — BL—BT) AT. Here, H56 = H5 — H6 is the combined expected Hessian. B.2.3 Hessian of the Gaussian Probability Density Function Computation of HLZC requires Hij, which is the result of differentiating log p(yilélj) with respect to the parameter 03- twice. We shall derive the explicit form of Hij when log p(yil6j) is the Gaussian pdf. For simplicity, we shall omit the reference to the object index i and the cluster index j in our derivation. We shall need some notations in matrix calculus [179] in our derivation. Let vec X - denote a. vector of length pq formed by stacking the columns of a p by q matrix X. Let Y be a 'r by .9 matrix. The Kronecker product X®Y is a pr by qs matrix defined by _ - :EllY :13ng . . . (13qu :5le :rggY . . . :rqu X®Y= . . (3%) prY . . . 1:qu The precedence of the operator 69 is defined to be lower than matrix multiplication, i.e., XY 6'9 Z is the same as (XY) 63> Z. The following identity is used frequently in 290 'A this section: vec(XYZ) = (zT <59 X) vec Y. (13.37) Let Kd denote a permutation matrix of size (12 by d2, such that Kd vec Z : vec ZT, (B38) where Z is a d by (1 matrix. Note that K5 2 K51 2 Kd. B.2.3.1 Natural Parameter When the density is parameterized by its natural parameter as in Equation (5.6), we have 0 1 _ U, ___ __ __ -l —T auloopm y 2 (T +r )u 3 , _ 1 T 1 —T 1 —:r T —:r 8T10°p(y) — 2yy + 2T + 2T 1111 T 5% log p(y) = —nyF + F“T + T—TVVTT‘TF, where T-T denotes the transpose of T—l. Therefore, 82 517 10%]’(Y) = — 02 Ovec T 01/ 10gp(y) 291 02 ———— (ri' = T —1 "A _T T —1 —T dvecF 0111001)“) F T @T V+F T ”(8T = (F_1®E)(Id®u+l’®ld) The last term in the Hessian matrix requires more work. We first take the differential with respect to T: 1 d 5%.— logp(y) = —§T'T(d YT)T_T 1 1 — ér'TuuTrTw TT)‘I“‘T — Ear-Tm TT)T_TVVTT'T. By using the identity in Equation (8.37), the Hessian term can be obtained as 82 _ 1 —1 —T _ g (r—1 69 T—TVVTT—T) Kd — é- (T—IVVTT—l ® T—T) Kd Similarly, the Hessian term corresponding to F can be obtained if we note that a (1 5? log p(y) = —ny(d F) — F‘T(d FT)F‘T — T—TuuTF_T(d FT)F"‘T — T‘T(F(d FT) +(dF)FT)T_T1/VTF‘T 02 M 10$ p(y) = -Id 9’9 ny — (F"1 8) F4) Kd - (F"1 <8 MMTF) Kd —— (FTppT 69 FT) Kd -— FTupTF a 2 292 In the special case that T is always symmetric, we can have a simpler Hessian term. This amounts to assuming that TT 2 T and (alT)T 2 ((1T). We have 82 1 1 T 1 T _l 0‘ r: ——2 )l 2 — _2 _ — 2 0(vec r)2 GDP”) 2 g 2 QM“ 2”“ ® 1 2 _5 ((23 + muT) e (23 + ##T) - (n ® (QUIT ® HTl) B.2.3.2 Moment Parameter When moment parameter is used as in Equation (5.6) for the density, we have a 1 T alosp(y)—§(T+T )(y—u) a ., _ 1 T 1 _T filosmyb 2(y (My u) +2T 8 Ef loamy) = —(y - u)(y - MTF + F‘T The second—order terms include a? _ 1 :r W103P()’)——§ (TWLT ) a? 1m( )—1 I I aveCTap oopy —2(d®(y-u)+(y-u)® d) 1 = 5(1d2 + Kd) (Id ® (3' - Ill) = (FT e Idiadg + K.» (I. e (y — u» 293 As in the case of natural parameter, we have a 1 —T T —T +1 a! = _— (10 ogp(y) 2r ((lT )T a? 1 _1 _T ___—__1 or = __ ., 0(vec r)? 0" pm 2 (T 3 T )Kd a _ ._ d 55103210!) = —F T <92 . . _ T —1 -—T WIOOMY) - ‘Id ’3) (Y — u)(y — H) — (F ‘59 F )Kd If we assume both T and d T are always symmetric, we have (92 1 W lOgP(Y) = "—"'2 ® 2 2 294 BIBLIOGRAPHY 295 Bibliography ill [2] [3] [4] l5] [6] [7] [8] [9} [10] S. Agarwal, J. Lim, L. Zelnik—Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages II—838—II—845. IEEE Com- puter Society, 2005. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM SIGM OD International Conference on Management of Data, pages 94—- 105, June 1998. P. Arabie and L. Hubert. Cluster analysis in marketing research. In Advanced Methods of Marketing Research, pages 160—189. Oxford: Blackwell, 1994. L.D. Baker and AK. McCallum. Distributional clustering of words for text classification. In Proc. SICIR-98, 215t ACM International Conference on Re- search and Development in Information Retrieval, pages 96—103. ACM Press, New York, US, 1998. CH. Bakir, J. Weston, and B. Schoelkopf. Learning to find pre-images. In Advances in Neural Information Processing Systems 16. MIT Press, 2004. P. Baldi and G.W. Hatfield. DNA Microar'rays and Gene Expression. Cambridge University Press, 2002. P. Baldi and K. Hornik. Neural networks and principal component analysis: learning from examples without local minima. Neural Networks, 2:53—58, 1989. CH. Ball and DJ. Hall. Isodata, a novel method of data analysis and pattern classification. Technical Report NTIS AD 699616, Stanford Research Institute, Stanford, CA, 1965. A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergence. In Proc. SIAM International Conference on Data Mining, pages 234—245, 2004. N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1-3):89—113, 2004. 296 1111 [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] J. Bartram. A letter from John Bartram, MD. to Peter Collinson, F.R.S. concerning a cluster of small teeth observed by him at the root of each fang or great tooth in the head of a rattle-snake, upon dissecting it. Philosophical Transactions (1683—1775), 411358-359, 1739—1741. S. Basu, A. Banerjee, and R.J. Mooney. Active semi-supervision for pairwise constrained clustering. In Proc. the SIAM International Conference on Data Mining, pages 333—344, 2004. S. Basu, A. Banerjee, and R.J. Mooney. Semi-supervised clustering by seed- ing. In Proc. 19th International Conference on Machine Learning, pages 19—26, 2005. S. Basu. M. Bilenko, and R.J. Mooney. A probabilistic framework for semi— supervised clustering. In Proc. 10th ACM SICKDD, International Conference on Knowledge Discovery and Data Mining, pages 59—68, 2004. R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4):537~550, July 1994. M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373—1396, June 2003. Y. Bengio, J .-F . Paiement, and P. Vincent. Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and spectral clustering. In Advances in Neural In- formation Processing Systems 16, pages 177—184. MIT Press, 2004. M. Bernstein, V. de Silva, J. Langford, and J. Tenenbaum. Graph approxi- mations to geodesics on embedded manifolds. Technical report, Department of Psychology, Stanford University, 2000. A. Beygelzimer, SM. Kakade, and J. Langford. Cover trees for nearest neighbor. Technical report, 2005. http://www. cis.upenn.edu/~skakade/papers/ml/ cover_tree.pdf. SK. Bhatia and J .S. Deogun. Conceptual clustering in information retrieval. IEEE Transactions on Systems. Man and Cybernetics, Part B, 28(3):427—436, June 1998. . M. Bilenko, S. Basu, and R.J. Mooney. Integrating constraints and metric learn- ing in semi-supervised clustering. In Proc. 213t International Conference on Machine Learning, 2004. http://doi.acm.org/10. 1145/1015330. 1015360. J. Bins and B. Draper. Feature selection from huge feature sets. In Proc. 8th IEEE International Conference on Computer Vision, pages 159465, 2001. C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, New York, 1995. 297 [24] CM. Bishop, M. Svensen, and C.K.I. Williams. GTM: the generative topo- graphic mapping. Neural Computation, 10:215-234, 1998. [25] G. Biswas, R. Dubes, and AK. Jain. Evaluation of projection algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3:702—708, 1981. [26] AL. Blum and P. Langley. Selection of relevant features and examples in ma- chine learning. Artificial Intelligence, 97(1-2):245—271, 1997. [27] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [28] P. Bradley, U. Fayyad, and C. Reina. Clustering very large database using EM mixture models. In Proc. 15th International Conference on Pattern Recognition, pages 76—80, September 2000. [29] M. Brand. Charting a manifold. In Advances in Neural Information Processing Systems 15, pages 961—968. MIT Press, 2003. [30] M. Brand. Fast online‘SVD revisions for lightweight recommender systems. In Proc. SIAM International Conference on Data Mining, 2003. http://www. siam.org/meetings/sdmO3/proceedings/sdm03_04.pdf. [31] M. Brand. Nonlinear dimensionality reduction by kernel eigenmaps. In Proc. 18th International Joint Conference on Artificial Intelligence, pages 547-552, August 2003. [32] A. Brun, H-J. Park, H. Knutsson, and Carl-Fredrik Westin. Coloring of DT- MRI fiber traces using Laplacian eigenmaps. In Proc. the Ninth International Conference on Computer Aided Systems Theory, volume 2809, February 2003. [33] J. Bruske and G. Sommer. Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 20(5):572—575, 1998. [34] C.J.C. Burges. Geometric methods for feature extraction and dimensional re- duction. In L. Rokach and O. Maimon, editors, Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic Publishers, 2005. [35] F. Camastra and A. Vinciarelli. Estimating the intrinsic dimension of data with a fractal-based method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(10):1404—1407, October 2002. [36] R. Caruana and D. Freitag. Greedy attribute selection. In Proc. 11th Inter- national Conferenee on Machine Learning, pages 2836. Morgan Kaufmann, 1994. 298 [37] G. Celeux, S. Chrétien, F. Forbes, and A. Mkhadri. A component-wise EM algorithm for mixtures. Journal of Computational and Graphical Statistics, 10:699—712, 2001. [38] Y. Chang, C. H11, and M. Matthew Turk. Probabilistic expression analysis on manifolds. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 520—527, 2004. [39] A. Chaturvedi and J .D. Carroll. A feature-based approach to market segmen- tation via overlapping k-centroids clustering. Journal of Marketing Research, 34(3):370—377, August 1997. [40] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17:790—799, 1995. [41] Y. Cheng and GM. Church. Biclustering of expression data. In Proc. of the Eighth International Conference on Intelligent Systems for Molecular Biology, 2000. [42] F. Chung. Spectral Graph Theory. American Mathematical Society, 1997. [43] Forrest E. Clements. Use of cluster analysis with anthropological data. Amer- ican Anthropologist, New Series, Part 1, 56(2):180—199, April 1954. [44] D. Comaniciu. An algorithm for data-driven bandwidth selection. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 25(2):281—288, 2003. [45] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603—619, 2002. [46] TH. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. MIT Press, 1990. [47] J. Costa and AC. Hero. Manifold learning using euclidean k-nearest neighbor graphs. In Proc. IEEE International Conference on Acoustic Speech and Signal Processing, volume 3, pages 988—991, Montreal, 2004. [48] DR. Cox. Note on grouping. Journal of the American Statistical Association, 52(280):543—547, December 1957. [49] T.F. Cox and M.A.A. Cox. Multidimensional Scaling. Chapman & Hall, 2001. [50] M. Craven, D. DiPasquo, D. Freitag, A.K. McCallum, T.M. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1/2):69—113, 2000. [51] M. Dash and H. Liu. Feature selection for clustering. In Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining 2000, 2000. 299 [5‘2] [53] [57] [58] [59] [60] [61] 1621 [63] A. d’Aspremont, L. E1 Ghaoui, MI. Jordan, and G.R.G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. In Advances in Neural Information Processing Systems 17. MIT Press, 2005. D. de Ridder, O. Kouropteva, O. Okun, M. Pietikinen, and R.P.W. Duin. Super- vised locally linear embedding. In Proc. Artificial Neural Networks and Neural Information Processing, pages 333—341. Springer, 2003. D. de Ridder, M. Loog, and M.J.T. Reinders. Local fisher embedding. In Proc. 17th International Conference on Pattern Recognition, pages II—295——II— 298, 2004. V. de Silva and J.B. Tenenbaum. Global versus local approaches to nonlin- ear dimensionality reduction. In Advances in Neural Information Processing Systems 15, pages 705—712. MIT Press, 2003. D. DeCoste. Visualizing Mercer kernel feature spaces via kernelized locally- linear embeddings. In Proc. 8th International Conference on Neural Informa- tion Processing, November 2001. Available at http://www. cse . cuhk.edu.hk/ ~apnna/proceedings/iconip2001/index.htm. D. DeMers and G. Cottrell. Non—linear dimensionality reduction. In Advances in Neural Information Processing Systems 5, pages 580—587. Morgan Kaufmann, 1993. AP. Dempster, N.M. Laird, and DB. Rubin. Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 3921—38, 1977. M. Dettling and P. Biihlmann. Finding predictive gene groups from microarray data. Journal of Multivariate Analysis, 90:106—131, 2004. M. Devaney and A. Ram. Efficient feature selection in conceptual clustering. In Proc. 14th International Conference on Machine Learning, pages 92—97. Morgan Kaufmann, 1997. LS. Dhillon. Co—clustering documents and words using bipartite spectral graph partitioning. In Knowledge Discovery and Data Mining, pages 269—274, 2001. IS. Dhillon, S. Mallela, and R. Kumar. A divisive information-theoretic fea- ture clustering algorithm for text classification. Journal of Machine Learning Research, 321265—1287, March 2003. LS. Dhillon, S. Mallela, and D.S. Modha. Information-theoretic co-clustering. In Proc. of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 89—98, 2003. 300 [64] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] D. Donoho. For most large underdetermined systems of linear equations, the minimal 1 1-norm near-solution approximates the sparsest near-solution. Tech- nical report, Department of Statistics, Stanford University, 2004. D. Donoho. For most large underdetermined systems of linear equations, the minimal 1 1-norm solution is also the sparsest solution. Technical report, De- partment of Statistics, Stanford University, 2004. D.L. Donoho and C. Grimes. When does isomap recover natural parameteriza- tion of families of articulated images? Technical Report 2002-27, Department of Statistics, Stanford University, August 2002. D.L. Donoho and C. Grimes. Hessian eigenmaps: new locally linear embedding techniques for high-dimensional data. Technical Report TR-2003-08, Depart- ment of Statistics, Stanford University, 2003. R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, New York, 2nd edition, 2001. J .G. Dy and CE. Brodley. Feature subset selection and order identification for unsupervised learning. In Proc. 17th International Conference on. Machine Learning, pages 247—-254. Morgan Kaufmann, 2000. J .G. Dy and CE. Brodley. Feature selection for unsupervised learning. Journal of Machine Learning Research, 5:845—889, August 2004. J .G. Dy, C.E. Brodley, A. Kak, L.S. Broderick, and A.M. Aisen. Unsupervised feature selection applied to content-based retrieval of lung images. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 25(3):373—378, March 2003. B. Efron, T. Hastie, I.M. Johnstone, and R. Tibshirani. Least angle regression (with discussion). Annals of Statistics, 32(2):407—499, 2004. R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semi-supervised learning. In Advances in Neural Information Processing Systems 14, pages 1025-1032. MIT Press, 2002. A. Elgannnal and CS. Lee. Inferring 3D body pose from silhouettes using activity manifold learning. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 681—688, 2004. A. Elgammal and CS. Lee. Separating style and content on a nonlinear man- ifold. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 478489, 2004. D.M. Endres and J .E. Schindelin. A new metric for probability distributions. IEEE Transactions on. Information Theory, 491858—1860, September 2003. 301 [77] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining, pages 226—231, 1996. [78] M. Farmer and A. Jain. Occupant classification system for automotive airbag suppression. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 7536761, 2003. [79] D. Fasulo. A11 analysis of recent work on clustering algorithms. Tech- nical report, University of Washington, 1999. Available at http://www. cs . washington. edu/homes/df asulo/clustering . ps and http : //citeseer. ist.psu.edu/fasu1099analysi.html. [80] M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:1150—1159, 2003. [81] M.A.T. Figueiredo and AK. Jain. Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):381-—396, 2002. [82] M.A.T. Figueiredo, A.K. Jain, and M.H. Law. A feature selection wrapper for mixtures. In Proc. the First Iberian Conference on Pattern Recognition and Image Analysis, pages 229—237, Puerto de Andratx, Spain, 2003. Springer Verlag, LNCS vol. 2652. [83] B. Fischer and J .M. Buhmann. Bagging for path-based clustering. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 25(11):1411—1415, 2003. [84] B. Fischer and J.l\r‘I. Buhmann. Path-based clustering for grouping smooth curves and texture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4):513~-518, 2003. [85] VVD. Fisher. On grouping for maximum homogeneity. Journal of the American Statistical Association, 53(284):789—-798, December 1958. [86] R. Fletcher. Practical Methods of Optimization. John Wiley & Sons, 1987. [87] R. Fletcher. Practical Methods of Optimization. John W'iley & Sons, 2nd edition, 2000. [88] E. Porgy. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21, September 1964. [89] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):214~225, February 2004. 302 [90] Y. Frennd and RE. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148—156. Morgan Kaufmann, 1996. [91] J .H. Friedman. l\11111ti\«'ariate adaptive regression splines. Annals of Statistics, 19(1):1-—67, March 1991. [92] J.H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76:817~—823, 1981. [93] J .H. Friedman and J .W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, C-232881—890, 1974. [94] H. Frigui and R. Krishnapuram. A robust competitive clustering algorithm with [95] [96] [97] [98] [99] [100] [101] [102] (103] applications in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5):450—465, May 1999. G. Getz, E. Levine, and E. Domany. Coupled two-way clustering analysis of gene microarray data. Proc. National Academy of Sciences of the United States of America, 94:12079—12084, 2000. CH. Golub and CF. Van Loan. Matrix Computations. Johns Hopkins Univer- sity Press, 1996. D. Gondek and T. Hofmann. Non-redundant data clustering. In Proc. 5th IEEE International Conference on Data Mining, pages 75—82, 2004. Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimiza- tion. In Advances in Neural Information Processing Systems 17, pages 529—536, Cambridge, MA, 2005. MIT Press. RM. Gray and D.L. Neuhoff. Quantization. IEEE Transactions on Information Theory, 44(6):2325—2383, October 1998. P. Gustafson, P. Carbonetto, N. Thompson, and N. de Freitas. Bayesian feature weighting for unsupervised learning, with application to object recognition. In C. M. Bishop and B. J. Frey, editors, Proc. 9th International Workshop on Artificial Intelligence and Statistics, Key West, FL, 2003. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 321157—1182, March 2003. I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature selection challenge. In Advances in Neural Information Processing Systems 17, pages 545—552. MIT Press, 2005. A. Hadid, O. Kouropteva, and M. Pietikainen. Unsupervised learning using locally linear embedding: experiments in face pose analysis. In Proc. 16th International Conference on Pattern Recognition, pages I:111~114, 2002. 303 [104] [105] [106] [107] [108] [109] [110] [111] [112] [113} [114] [115] [116] [117] MA. Hall. Correlation-based feature selection for discrete and numeric class machine learning. In Proc. 17th International Conference on Machine Learning, pages 359—366. Morgan Kaufmann, 2000. J. Ham, D.D. Lee, S. Mika, and B. Schoelkopf. A kernel view of the dimension- ality reduction of manifolds. In Proc. 21 st International Conference on Machine Learning, 2004. G. Hamerly and C. Elkan. Learning the k in k-means. In Advances in Neural Information Processing Systems 16, pages 281—288. MIT Press, 2004. T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84:502—516, 1989. T. Hastie, R. T ibshirani, D. Botstein, and P. Brown. Supervised harvesting of expression trees. Genome Biology, 2:0003.1—0003.12, 2003. X. He and P. Niyogi. Locality preserving projections. In Advances in Neural Information Processing Systems 16, pages 153—160. MIT Press, 2004. T. Hertz, N. Shental, A. Bar-Hillel, and D. Weinshall. Enhancing image and video retrieval: Learning via equivalence constraint. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 668*674, 2003. A. Hinneburg and DA. Keim. An efficient approach to clustering in large multimedia databases with noise. In Knowledge Discovery and Data Mining, pages 58-65, 1998. G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems 15, pages 833—840. MIT Press, 2003. GE. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of handwritten digits. IEEE Transactions on Neural Networks, 8(1):65—74, January 1997. T. Hofmann and J. Buhmann. Pairwise data clustering by deterministic an- nealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1—14, 1997. T. Hofmann and J.M. Buhmann. Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1~14, 1997. T. Hofmann, J. Puzicha, and J.M. Buhmann. Unsupervised texture segmen- tation in a deterministic annealing framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):803--818, 1998. R. Horst, PM. Pardalos, and Nguyen Van Thoai. Introduction to Global Opti- mization (Nonconvea: Optimization and Its Applications). Springer, 1995. 304 [118] VV.S. Hwang and J. W'eng. Hierarchical discriminant regression. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 22(11):1277-—1293, Novem- ber 2000. [119] P. Indyk and J. Matousek. Low distortion embeddings of finite metric spaces. In Handbook of Discrete and Computational Geometry, pages 177-196. CRC Press LLC, 2nd edition, 2004. http://theory.lcs.mit.edu/~indyk/p.htm1. [120] M. Iwayama and T. Tokunaga. Cluster-based text categorization: a comparison of category search strategies. In Proc. of 18th ACM International Conference on Research and Development in Information Retrieval, pages 273—281, 1995. [121] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11, pages 487-493, 1998. [122] A. Jain and D. Zongker. Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2):153-157, February 1997. [123] A.K. Jain, S.C. Dass, and K. Nandakumar. Soft biometric traits for personal recognition systems. In Proc. International Conference on Biometric Authen- tication, pages 731-738, Hong Kong, 2004. [124] A.K. Jain and R. Dubes. Feature definition in pattern recognition with small sample size. Pattern Recognition, 10(2):85-97, 1978. [125] A.K. Jain and RC. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. [126] A.K. Jain, R. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4-—38, 2000. [127] A.K. Jain and F. Farrokhnia. Unsupervised texture segmentation using Gabor filters. Pattern Recognition, 24:1167—1186, 1991. [128] A.K. Jain and P. Flynn. Image segmentation using clustering. In Advances in Image Understanding, pages 65—83. IEEE Computer Society Press, 1996. [129] A.K. Jain and J. Mao. Artificial neural network for nonlinear projection of multivariate data. In Proc. International Joint Conference on Neural Networks, pages III-335—III—340, 1992. [130] A.K. Jain. M.N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264—323, September 1999. [131] A.K. Jain, A. Topchy, M.H. Law, and J. Buhmann. Landscape of clustering algorithms. In Proc. 17th International Conference on Pattern Recognition, pages I—260—I——263, 2004. 305 [132] A.K. Jain and D. Zongker. Representation and recognition of handwritten digits using deformable templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(12):1386-1390, December 1997. [133] O. Jenkins and M. Mataric. A spatio—temporal extension to Isomap nonlinear dimension reduction. In Proc. 2Ist International Conference on Machine Learn— ing, 2004. Available at http://doi.acrn.org/10. 1145/1015330. 1015357. [134] ST. Roweis J .J . Verbeek and N. Vlassis. Non-linear CCA and PCA by align— ment of local models. In Advances in Neural Information Processing Systems 16, pages 297-304. MIT Press, 2004. [135] T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proc. 10th European Conference on Machine Learn- ing, pages 137—142. Springer Verlag, 1998. [136] T. Joachims. Transductive inference for text classification using support vector machines. In Proc. 16th International Conference on Machine Learning, pages 200—209. Morgan Kaufmann, 1999. [137] I.M. Johnstone and A.Y. Lu. Sparse principal components analy- sis. Located at http : //www-stat . stanford. edu/ ~ imj /WEBLIST/AsYetUnpub/ sparse.pdf,2004. [138] S. Kamvar, D. Klein, and CD. Manning. Spectral learning. In Proc. 18th International Joint Conference on Artificial Intelligence, pages 561—566, 2003. [139] R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497-—515, May 2004. [140] T. Kanungo, D.M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881— 892, 2002. [141] J.N. Kapur. Measures of Information and Their Applications. Wiley, New Delhi, India, 1994. [142] G. Karypis, E.-H. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithms using dynamic modeling. IEEE Computer, Special Issue on Data Analysis and Mining, 32(8):68-75, August 1999. [143] B. Kégl. Intrinsic dimension estimation using packing numbers. In Advances in Neural Information Processing Systems 15, pages 681—688. MIT Press, 2003. [144] B. Kégl, A. Krzyzak, T. Linder, and K. Zeger. Learning and design of princi- pal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(3):281«297, 2000. 306 [145] [146] [147] [148] [149] [1.50] [151] [152] [153] [154] [155] [156] [1.57] W.-Y. Kim and AC. Kak. 3-d object recognition using bipartite matching embedded in discrete relaxation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(3):224—251, March 1991. Y. Kim, W. Street, and F. Menczer. Feature selection in unsupervised learning via evolutionary search. In Proc. 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 365—369, 2000. K. Kira and L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proc. of the 10th National Conference on Artificial Intelligence, pages 129-«134, Menlo Park, CA, USA, 1992. AAAI Press. D. Klein, S.D. Kamvar, and CD. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proc. 19th International Conference on Machine Learning, pages 307—314. Morgan Kaufmann, 2002. J. Kleinberg. An impossibility theorem for clustering. In Advances in Neural Information Processing Systems 15. MIT Press, 2003. Y. Kluger, R. Basri, J.T. Chang, and M. Gerstein. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Research, 13:703— 716, 2003. R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intel- ligence, 97(1—2):273—324, 1997. T. Kohonen. Self-Organizing Maps. Springer-Verlag, 2001. 3rd edition. D. Koller and M. Sahami. Toward optimal feature selection. In Proc. 13th Inter- national Conference on Machine Learning, pages 284—292. Morgan Kaufmann, 1996. R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proc. 19th International Conference on Machine Learning, pages 315—322. Morgan Kaufmann, 2002. I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In Proc. 7th European Conference on Machine Learning, pages 171—182, 1994. B. Krishnapuram, A. Hartemink, L. Carin, and M. Figueiredo. A Bayesian approach to joint feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1105—1111, September 2004. B. Krislmapuram, D. Williams, Y. Xue, A. Hartemink, L. Carin, and M. Figueiredo. On semi-supervised classification. In Advances in Neural In.- formation Processing Systems 17, pages 721—728, Cambridge, MA, 2005. MIT Press. 307 [158] [159] [160] [161] [162] [163] [164] [165] [166] [167] [168] [169] B. Kulis, S. Basu, I. Dhillon, and R. Mooney. Semi-supervised graph cluster- ing: A kernel approach. In Proc. 22nd International Conference on Machine Learning, pages 457—464, 2005. N. Kwak and C.-H. Choi. Input feature selection by mutual information based on Parzen window. IEEE Transactions on Pattern Analysis and Machine In- telligence, 24(12):1667—1671, December 2002. J.T. Kwok and I.W. Tsang. The pre-image problem in kernel methods. In Proc. 20th International Conference on Machine Learning, pages 408—415, 2003. T. Lange, M.H. Law, A.K. Jain, and J .B. Buhmann. Learning with constrained and unlabelled data. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 730—737, 2005. T. Lange, V. Roth, M.L. Braun, and J .M. Buhmann. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299—1323, June 2004. M.H. Law, M.A.T. Figueiredo, and A.K. Jain. Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1154—1166, September 2004. M.H. Law and A.K. Jain. Incremental nonlinear dimensionality reduction by manifold learning. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 28(3):377—391, March 2006. M.H. Law, A.K. Jain, and M.A.T. Figueiredo. Feature selection in mixture- based clustering. In Advances in Neural Information Processing Systems 15, pages 625—632. MIT Press, 2003. M.H. Law, A. Topchy, and A.K. Jain. Clustering with soft and group con- straints. In Proc. Joint IAPR International Workshops on Structural, Syntactic, And Statistical Pattern Recognition, pages 662—670, Lisbon, Portugal, August 2004. M.H. Law, A. Topchy, and A.K. Jain. Model-based clustering with probabilistic constraints. In Proc. SIAM International Conference on Data Mining, pages 641—645, 2005. M.H. Law, N. Zhang, and A.K. Jain. Non-linear manifold learning for data stream. In Proc. SIAM International Conference for Data Mining, pages 33— 44, 2004. N.D. Lawrence and MI. Jordan. Semi—supervised learning via gaussian pro— cesses. In Advances in Neural Information Processing Systems 17, pages 753— 760, Cambridge, MA, 2005. MIT Press. 308 [170] [171] [172] [173] [174] [175] [176] [177] [178] [179] [180] [181] [182] Y. LeCun, O. Matan, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, and HS. Baird. Handwritten zip code recognition with multilayer networks. In Proc. 10th International Conference on Pattern Recognition, volume 2, pages 35—40, 1990. E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimen- sion. In Advances in Neural Information Processing Systems 17, pages 777—784. MIT Press, 2005. S.Z. Li, R. Xiao, Z.Y. Li, and H.J. Zhang. Nonlinear mapping from multi— view face patterns to a gaussian distribution in a low dimensional space. In Proc. IEEE ICCV Workshop on Recognition, Analysis and Tracking of Faces and Gestures in Real-time Systems, 2001. Available at http://doi. ieeecomputersociety.org/10.1109/RATFG.2001.938909. J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145—151, 1991. S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Infor- mation Theory, 28(2):129—137, March 1982. Originally as an unpublished Bell Laboratories Technical Note (1957). X. Lu and A.K. Jain. Ethnicity identification from face images. In Proc. SPIE, volume 5404, pages 114—123, 2004. Z. Lu and T. Leen. Semi-supervised learning with penalized probabilistic clus- tering. In Advances in Neural Information Processing Systems 17, pages 849— 856, Cambridge, MA, 2005. MIT Press. D.J.C. MacKay. Bayesian non-linear modelling for the prediction competi- tion. In ASHRAE Transactions, V.100, Pt.2, pages 1053—1062, Atlanta Geor- gia, 1994. J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symposium on Math. Stat. and Prob., pages 281—297. University of California Press, 1967. J.R. Magnus and H. Neudecker. Matrix Difierential Calculus with Applications in Statistics and Econometrics. John Wiles and Sons, 1999. Revised Edition. J. Mao and A.K. Jain. Artificial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks, 6(2):296— 317, March 1995. A. Martinez and R. Benavente. The AR face database. Technical Report 24, CVC, 1998. http://rvll.ecn.purdue.edu/~aleix/aleix_face_DB.html. M.Belkin and P. N iyogi. Semi-supervised learning on manifolds. Machine Learn- ing, Special Issue on Clustering, 56:209—239, 2004. 309 [183] G. McLachlan and K. Basford. Mixture Models: Inference and Application to Clustering. Marcel Dekker, New York, 1988. [184] G. McLachlan and D. Peel. Finite Aalirture Models. John Wiley & Sons, New York, 2000. [185] S. Mika, B. Scholkopf, A.J. Smola, K.-R. Muller, M. Scholz, and G. Ratsch. Ker- nel PCA and de—noising in feature spaces. In Advances in Neural Information Processing Systems 11, pages 536—542. MIT Press, 1998. [186] A.J. l\-‘Iiller. Subset Selection in Regression. Chapman & Hall, London, 2002. [187] B. Mirkin. Concept learning and feature selection based on square—error clus- tering. Machine Learning, 35:25—39, 1999. [188] P. Mitra and CA. Murthy. Unsupervised feature selection using feature sim- ilarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):301—312, 2002. [189] K.-R. Mller, S. Mika, G. Rtsch, K. Tsuda, and B. Schlkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181—201, May 2001. [190] D. Modha and W. Scott-Spangler. Feature weighting in k—means clustering. Machine Learning, 52(3):217—237, September 2003. [191] F. Mosteller and J.W. 'Ihkey. Data Analysis and Regression. Addison—\Nesley, Boston, 1977. [192] A.M. N amboodiri and A.K. Jain. Online handwritten script recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(1):124—130, 2004. [193] P. Narvaez, K.-Y. Sin, and H.-Y. Tzeng. New dynamic algorithms for shortest path tree computation. IEEE/A CM Transactions on Networking, 8(6):734—746, December 2000. [194] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849—856, Cambridge, MA, 2002. MIT Press. [195] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103— 134, 2000. [196] M. Niskanen and O. Silvn. Comparison of dimensionality reduction methods for wood surface inspection. In Proc. 6th International Conference on Quality Control by Artificial Vision, pages 178-188, 2003. 310 [197] [198] [199] [200] [201] [206] [207] [208] J. Novovicova, P. Pudil, and J. Kittler. Divergence based feature selection for multimodal class densities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(2):218—223, February 1996. MR. Osborne, B. Presnell, and BA. Turlach. A new approach to variable selec- tion in least squares problems. IMA Journal of Numerical Analysis, 20(3):389— 403, 2000. F. Osterreicher and I. Vajda. A new class of metric divergences on probability spaces and its statistical applications. Ann. Inst. Statist. Math, 55:639—653, 2003. EM. Palmer. Graph Evolution: An Introduction to the Theory of Random Graphs. John Wiley & Sons, 1985. A. Patrikainen and H. Mannila. Subspace clustering of high-dimensional bi- nary data - a probabilistic approach. In Proc. Workshop on Clustering High Dimensional Data, SIAM International Conference on Data Mining, pages 57- 65, 2004. E. Pekalska, P. Paclik, and R.P.W'. Duin. A generalized kernel approach to dissimilarity-based classification. Journal of Machine Learning Research, 2:175— 211, December 2001. ' D. Pelleg and AW. Moore. X—means: Extending k-means with efficient esti- mation of the number of clusters. In Proc. 17th International Conference on Machine Learning, pages 727—734, San Francisco, 2000. Morgan Kaufmann. J. Pena, J. Lozano, P. Larranaga, and I. Inza. Dimensionality reduction in unsupervised learning of conditional Gaussian networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):590—603, 2001. W. Penny and S. Roberts. Notes on variational learning. Technical Report PARG-OO-l, Department of Engineering Science, University of Oxford, April 2000. P. Perona and M. Polito. Grouping and dimensionality reduction by locally linear embedding. In Advances in Neural Information Processing Systems 14, pages 1255—1262. MIT Press, 2002. K. Pettis, T. Bailey, A.K. Jain, and R. Dubes. An intrinsic dimensionality esti- mator from near-neighbor information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(1):25—36, January 1979. K.S. Pollard and M.J. van der Laan. Statistical inference for simultaneous clustering of gene expression data. Mathematical Biosciences, 176(1):99—121, 2002. 311 [209] P. Pudil, J. Novovicova, and J. Kittler. Floating search methods in feature selection. Pattern Recognition Letters, 15:1119—1125, 1994. [210] P. Pudil, J. Novovicova, and J. Kittler. Feature selection based on the ap- proximation of class densities by finite mixtures of the special type. Pattern Recognition, 28(9):1389—1398, 1995. [211] S.J. Raudys and A.K. Jain. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(3):252—264, 1991. [212] ML. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain. Di- mensionality reduction using genetic algorithms. IEEE Transactions on Evolu- tionary Computation, 4(2):164—171, July 2000. [213] T. Sorensen. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on danish commons. Biologiske Skrifter, 5:1—34, 1948. [214] D. De Ridder and V. Franc. Robust manifold learning. Technical Report CTU- CMP-2003—08, Center for Machine Perception, Department of Cybernetics Fac- ulty of Electrical Engineering, Czech Technical University, Prague, 2003. [215] J. Rissanen. Stochastic Complexity in Stastistical Inquiry. World Scientific, Singapore, 1989. [216] S.J. Roberts, RM. Everson, and I. Rezek. Maximum certainty data partition- ing. Pattern Recognition, 33(5):833—839, 1999. [217] A. Robles-Kelly and ER. Hancock. A graph spectral approach to shape-from- shading. IEEE Transactions on Image Processing, 13:912—926, 2004. [218] V. Roth and T. Lange. Feature selection in clustering problems. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [219] ST. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323—2326, 2000. [220] ST. Roweis, L.K. Saul, and GE. Hinton. Global coordination of local linear models. In Advances in Neural Information Processing Systems 14, pages 889— 896. MIT Press, 2002. [221] M. Sahami. Using Machine Learning to Improve Information Access. PhD thesis, Computer Science Department, Stanford University, 1998. [222] R. Salakhutdinov, S.T. Roweis, and Z. Ghahramani. Optimization with em and expectation-conjugate-gradient. In Proc. 20th International Conference on Machine Learning, pages 672—679. AAAI Press, 2003. 312 [223] J .VV. Sammon. A non-linear mapping for data structure analysis. IEEE Trans- actions on Computers, C-18(5):401—409, May 1969. [224] P. Sand and AW. Moore. Repairing faulty mixture models using density es- timation. In Proc. 18th International Conference on Machine Learning, pages 457—464. Morgan Kaufmann, 2001. [225] W.S. Sarle. The VARCLUS procedure. In SAS/STAT User’s Guide. SAS Institute, Inc., 4th edition, 1990. [226] L.K. Saul and ST. Roweis. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119—155, 2003. [227] C. Saunders, J. Shawe-Taylor, and A. Vinokourov. String kernels, fisher ker- nels and finite state automata. In Advances in Neural Information Processing Systems 15, pages 633—640. MIT Press, 2003. [228] B. Scholkopf, A.J. Smola, and K.—R. Miiller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299—1319, 1998. [229] B. Scholkopf, K. Sung, C.J.C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Transactions on Signal Processing, 45(11):2758— 2765, 1997. [230] M. Segal, K. Dahlquist, and B. Conklin. Regression approach for microarray data analysis. Journal of Computational Biology, 10:961—980, 2003. [231] N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing gaussian mixture models with EM using equivalence constraints. In Advances in Neural Information Processing Systems 16, pages 465—472. MIT Press, 2004. [232] J.R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical report, School of Computer Science, Carnegie Mellon University, August 1994. Available at http://www.cs.cmu.edu/ ~quake-papers/painless-conjugate-gradient.pdf. [233] J. Shi and J. Malik. Normalized cuts and image segmentation. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 731—737, 1997. [234] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE T ransac- tions on Pattern Analysis and Machine Intelligence, 22(8):888—905, 2000. [235] P.Y. Simard, D. Steinkraus, and J. Platt. Best practice for convolutional neural networks applied to visual document analysis. In Proc. International Conference on Document Analysis and Recogntion, pages 958—962, 2003. 313 [236] E. Sjostrom. Singular value computations for toeplitz matrices. Licentiate thesis, 1996. http://www.mai.liu.se/~evlun/pub/lic/lic.html. [237] N. Slonim and N. Tishby. Document clustering using word clusters via the information bottleneck method. In ACM SIGIR, pages 208—215, 2000. [238] N. Slonim and N. Tishby. The power of word clusters for text classification. In 28rd European Colloquium on Information Retrieval Research, 2001. [239] A.J. Smola, S. Mika, B. Scholkopf, and RC. Williamson. Regularized principal manifolds. Journal of Machine Learning Research, 1:179—209, June 2001. [240] P.H.A. Sneath. The application of computers to taxonomy. J. Gen. Microbiol., 17:201—226, 1957. [241] RR. Sokal and CD. Michener. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull, 38:1409—1438, 1958. [242] RR. Sokal and P.H.A. Sneath. Principles of Numerical Taxonomy. San Fran- cisco, W. H. Freeman, 1963. [243] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci, C1. III, Vol. IV, pages 801—804, 1956. [244] 33 Stevens. On the theory of the scales of measurement. Science, 103:677—680, 1946. [245] E. Sungur. Overview of multivariate statistical data analysis, http : //www.mrs. umn.edu/~sungurea/multivariatestatistics/overview.html. [246] L. Talavera. Dependency-based feature selection for clustering symbolic data. Intelligent Data Analysis, 4:19—28, 2000. [247] Y.W. Teh and ST. Roweis. Automatic alignment of local representations. In Advances in Neural Information Processing Systems 15, pages 841—848. MIT Press, 2003. [248] J .B. Tenenbaum, V. de Silva, and J .C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319—2323, 2000. [249] R. Tibshirani. Principal curves revisited. Statistics and Computing, 2:183—190, 1992. [250] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267—288, 1996. [251] K. Torkkola. Feature extraction by non parametric mutual information maxi— mization. Journal of Machine Learning Research, 321415—1438, March 2003. 314 [252] G. Trunk. A problem of dimensionality: A simple example. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(3):306—307, 1979. [253] MA. Turk and AP. Pentland. Face recognition using eigenfaces. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 586—591, 1991. [254] S. Vaithyanathan and B. Dom. Generalized model selection for unsupervised learning in high dimensions. In Advances in Neural Information Processing Systems 12, pages 970—976, Cambridge, MA, 1999. MIT Press. [255] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 2nd edition, 2000. [256] PP. Velleman and L. W'ilkinson. Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1):65—72, February 1993. [257] J.J. Verbeek, N. Vlassis, and B. Krose. Coordinating principal component analyzers. In Proc. International Conference on Artificial Neural Networks, pages 914- 919, 2002. [258] D. Verma and M. Meila. A comparison of spectral clustering algorithms. Tech- nical Report 03—05-01, CSE Department, University of Washington, 2003. [259] P. Verveer and R. Duin. An evaluation of intrinsic dimensionality estimators. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1):81—86, 1995. [260] P. Vincent and Y. Bengio. Manifold parzen windows. In Advances in Neural Information Processing Systems 15, pages 825—832. MIT Press, 2003. [261] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages I—511—I—518, vol. 1, 2001. [262] K. Wagstaff and C. Cardie. Clustering with instance—level constraints. In Proc. 17th International Conference on Machine Learning, pages 1103—1110. Morgan Kaufmann, 2000. [263] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clus- tering with background knowledge. In Proc. 18th International Conference on Machine Learning, pages 577—584. Morgan Kaufmann, 2001. [264] CS. Wallace and D.L. Dowe. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. Statistics and Computing, 10:73—83, 2000. [265] CS. W'allace and P. Freeman. Estimation and inference via compact coding. Journal of the Royal Statistical Society. Series B (Methodological), 49(3):241— 252, 1987. 315 [266] [267] [268] [269] [270] [271] [272] [273] [274] [275] [276] [277] [278] J. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236—244, March 1963. KO. Weinberger, B. Packer, and L.K. Saul. Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In Proc. 10th International Workshop on Artificial Intelligence and Statistics, pages 381—388, 2005. K.Q. Weinberger and L.K. Saul. Unsupervised learning of image manifolds by semidefinite programming. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 988—995, 2004. Y. Weiss. Segmentation using eigenvectors: A unifying view. In Proc. 7th IEEE International Conference on Computer Vision, pages II—975—II—982, 1999. J. VVeng, Y. Zhang, and W.S. Hwang. Candid covariance-free incremental prin- cipal component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):1034—1040, 2003. D. Wettschereck, D.W. Aha, and T. Mohri. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artif. Intell. Rev, 11(1-5):273—314, 1997. Z. Wu and R. Leahy. An optimal graph theoretic approach to data cluster- ing: Theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11):1101—1113, November 1993. E. Xing, M. Jordan, and R. Karp. Feature selection for high-dimensional ge— nomic microarray data. In Proc. 18th International Conference on Machine Learning, pages 601—608. Morgan Kaufmann, 2001. BR Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, pages 505—512. MIT Press, 2003. J. Yang and V. Honavar. Feature subset selection using a genetic algorithm. IEEE Intelligent Systems, 13:44—49, 1998. M.-H. Yang. Face recognition using extended isomap. 111 IEEE International Conference on Image Processing, pages II: 117—120, 2002. Y. Yang and X. Liu. A re—examination of text categorization methods. In Proc. SI CIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pages 42—49. ACM Press, New York, 1999. L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc. 20th International Conference on Machine Learning, pages 856—863. AAAI Press, 2003. 316 [279] S.X. Yu and J. Shi. Segmentation given partial grouping constraints. IEEE Transactions on Pattern Analysis and IWachine Intelligence, 26(2):173—183, 2004. [280] CT. Zahn. Graph-theoretic methods for detecting and describing gestalt clus- [281] [282] [288] [284] [285] [286] [287] [288] [289] [290] [291] ters. IEEE Transactions on Computers, 20(31):68—86, 1971. H. Zha, X. He, C. Ding, M. Gu, and H. Simon. Bipartite graph partitioning and data clustering. In Proc. of ACM 10th Int’l Conf. Information and Knowledge Management, pages 25—31, 2001. H. Zha and Z. Zhang. Isometric embedding and continuum isomap. In Proc. 20th International Conference on Machine Learning, pages 864—871, 2003. J. Zhang, Stan Z. Li, and J. Wang. Nearest manifold approach for face recogni- tion. In Proc. The 6th International Conference on Automatic Face and Gesture Recognition, Seoul, Korea, May 2004. Available at http://www.cbsr.ia.ac. cn/users/szli/papers/ZJP-FG2004.pdf. J. Zhang, S.Z. Li, and J. Wang. Manifold learning and applications in recog- nition. In Intelligent Multimedia Processing with Soft Computing. Springer- Verlag, Heidelberg, 2004. Available at http://www.nlpr.ia.ac.cn/users/ szli/papers/ZHP-MLA-Chapter.pdf. Z. Zhang and H. Zha. Principal manifolds and nonlinear dimension reduction via local tangent space alignment. Technical Report CSE—02-019, CSE, Penn State Univ., 2002. Q. Zhao and D.J. Miller. Mixture modeling with pairwise, instance-level class constraints. Neural Computation, 17(11):2482—2507, November 2005. D. Zhou, B. Schoelkopf, and T. Hofmann. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems 17, pages 1633— 1640, Cambridge, MA, 2005. MIT Press. X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaus- sian fields and harmonic functions. In Proc. 20th International Conference on Machine Learning, pages 912—919. AAAI Press, 2003. X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric trans— forms of graph kernels for semi-supervised learning. In Advances in Neural Information Processing Systems 17, pages 1641—1648, Cambridge, MA, 2005. MIT Press. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Methodological), To appear. H. Zou, T. Hastie, and R. Tibshirani. Sparse principal components analysis. Technical report, Department of Statistics, Stanford University, 2004. 317