DATA CLUSTERING WITH PAIRWISE CONSTRAINTS By Jinfeng Yi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Doctor of Philosophy 2014 ABSTRACT DATA CLUSTERING WITH PAIRWISE CONSTRAINTS By Jinfeng Yi The classical unsupervised clustering is an ill-posed problem due to the absence of a unique clustering criteria. This issue can be addressed by introducing additional supervised information, usually casts in the form of pairwise constraints, to the clustering procedure. Depending on the sources, most pairwise constraints can be classified into two categories: (i) pairwise constraints collected from a set of non-expert crowd workers, which leads to the problem of crowdclustering, and (ii) pairwise constraints collected from oracle or experts, which leads to the problem of semi-supervised clustering. In both cases, the costs of collecting pairwise constraints can be expensive, thus it is important to identify the minimal number of pairwise constraints needed to accurately recover the underlying true data partition, also known as a sample complexity problem. In this thesis, we first analyze the sample complexity of crowdclustering. At first, we propose a novel crowdclustering approach based on the theory of matrix completion. Unlike the existing crowdclustering algorithm that is based on a Bayesian generative model, the proposed approach is more desirable since it only needs a much less number of crowdsourced pairwise annotations to accurately cluster all the objects. Our theoretical analysis shows that in order to accurately cluster N objects, only O(N log2 N ) randomly sampled pairs should be annotated by crowd workers. To further reduce the sample complexity, we then introduce a semi-crowdsourced clustering framework that is able to effectively incorporate the low-level features of the objects to be clustered. In this framework, we only need to sample a subset of n N objects and generate their pairwise constraints via crowdsourcing. After complet- ing a n × n similarity matrix using the proposed crowdclustering algorithm, we can further recover a N × N similarity matrix by applying a regression-based distance metric learning algorithm to the completed smaller size similarity matrix. This enables us to reliably cluster N objects with only O(n log2 n) crowdsourced pairwise constraints. Next, we study the problem of sample complexity in semi-supervised clustering. To this end, we propose a novel convex semi-supervised clustering approach based on the theory of matrix completion. In order to reduce the number of pairwise constraints needed we apply a nature assumption that the feature representations of the objects are able to reflect the similarities between objects. This enables us to only utilize O(log N ) pairwise constraints to perfectly recover the data partition of N objects. Lastly, in addition to sample complexity that relates to labeling costs, we also consider the computational costs of semi-supervised clustering. Specifically, we study the problem of efficiently updating clustering results when the pairwise constraints are generated sequentially, a common case in various real-world applications such as social networks. To address this issue, we develop a dynamic semi-supervised clustering algorithm that casts the clustering problem into a searching problem in a feasible convex space, i.e., a convex hull with its extreme points being an ensemble of multiple data partitions. Unlike classical semi-supervised clustering algorithms that need to re-optimize their objective functions when new pairwise constraints are generated, the proposed method only needs to update a low-dimensional vector and its time complexity is irrelevant to the number of data points to be clustered. This enables us to update large-scale clustering results in an extremely efficient way. To my family. iv ACKNOWLEDGMENTS First of all, I want to express my sincere gratitude to my thesis advisor Prof. Rong Jin, for his continuous support, encouragement, and help during my Ph.D. study. Back to five years ago, I was merely an undergraduate student knowing almost nothing about research. It is his help and advice opened me the door to the world of scientific research. Throughout the years, he always inspired me to tackle interesting research problems and also encourages me to work on any problems that interest me. From him, I have learned a lot not only about modeling techniques, but also how to formalize and solve a problem by myself. More importantly, his passion in research always motivates me and let me understand how enjoyable it is to learn and to do research. Without Prof. Jin, this thesis would not be possible. I also want to express my gratitude to Prof. Anil K. Jain. We worked together on multiple interesting research problems in the areas of crowdsourcing and data clustering, which counts for a significant portion of this thesis. I really appreciate him for pointing me to right directions, asking me good questions and encouraging me. I have learned a great deal from him about how to find a valuable research problem, and how to write a convincing technical paper. In addition, I have greatly benefited from his expertise on his insightful career advice. I would also like to thank my committee members Dr. Joyce Chai, Dr. Pang-ning Tan and Dr. Selin Aviyente, who offered lots of useful advices and suggestions to this thesis. I am also fortunate enough to be mentored by many good researchers during internships. I did my first summer intern in Stroz Friedberg LLC and this is also my first industrial experience. I am grateful to my manager, Dr. Michael Sperling, for guiding me to conduct project-oriented research. This was a great learning experience for me. I also spent a wonderful summer at v IBM Thomas J. Watson Research Center, mentored by Dr. Jun Wang and Dr. Aleksandra Mojsilovic. I enjoyed the discussions with them on research problems and I want to thank both of them for making my last summer memorable and productive. In the last several years, I greatly enjoyed the collaborations and friendships with my labmates: Tianbao Yang, Wei Tong, Lijun Zhang, Mehrdad Mahdavi, Yang Zhou, Qi Qian and many others. I wish you all the best in your future careers. I also appreciate the help provided from the stuffs at CSE department of MSU, especially Linda Moore and Norma Teague. Finally and most importantly, I would like to thank my family for their unconditional love and constant encouragement. To them I owe a lot and words cannot describe my love for them. vi TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x Chapter 1 Introduction 1.1 Crowdclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Semi-supervised Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4 5 7 Chapter 2 Literature Survey 2.1 Crowdclustering . . . . . . . . . . . . . . . . . . . 2.2 Ensemble Clustering . . . . . . . . . . . . . . . . 2.3 Semi-supervised Clustering . . . . . . . . . . . . . 2.3.1 Constrained Clustering . . . . . . . . . . . 2.3.2 Distance Metric Learning . . . . . . . . . 2.3.3 Dynamic Clustering . . . . . . . . . . . . . 2.4 Evaluation Metrics . . . . . . . . . . . . . . . . . 2.5 Matrix Completion . . . . . . . . . . . . . . . . . 2.5.1 Matrix Completion with Noiseless Entries 2.5.2 Matrix Completion with Noisy Entries . . . . . . . . . . . . 12 12 14 15 15 17 18 19 21 22 23 . . . . . . . . . . . . . . . . . . . . . . 25 26 28 29 31 33 35 36 38 39 42 43 Chapter 4 Semi-Crowdsourced Clustering by Distance Metric Learning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Semi-Crowdsourced Clustering by Robust Distance Metric Learning . . . . 4.2.1 Problem Definition and Framework . . . . . . . . . . . . . . . . . . 4.2.2 Learning a Distance Metric from a Set of Noisy Similarity Matrices 4.2.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 46 49 50 51 54 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 Crowdclustering with Sparse Pairwise Labels 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Crowdclustering by Matrix Completion . . . . . . . . . . . . . . . 3.2.1 Filtering Entries with Unlabeled and Uncertain Data Pairs 3.2.2 Completing the Partially Observed Matrix . . . . . . . . . 3.2.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Baseline and Parameter Selection . . . . . . . . . . . . . . 3.3.3 Experimental results with full annotations . . . . . . . . . 3.3.4 Experimental results with sampled annotations . . . . . . 3.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 4.4 Experiments . . . . . . . . . . . . . . . . . 4.3.1 Data Sets . . . . . . . . . . . . . . 4.3.2 Baselines . . . . . . . . . . . . . . . 4.3.3 Parameter Selection and Evaluation 4.3.4 Experimental Results . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 5 Semi-supervised Clustering by Input Pattern Assisted Pairwise Similarity Matrix Completion 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Semi-supervised Clustering by Input Pattern Assisted Matrix Completion . . 5.2.1 A Matrix Completion Framework for Semi-supervised Clustering . . . 5.2.2 Input Pattern Assisted Matrix Completion . . . . . . . . . . . . . . . 5.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Baselines, and Parameter Settings . . . . . . . . . . . . . . . . . . . . 5.4.2 Experiment with Synthesized Data . . . . . . . . . . . . . . . . . . . 5.4.3 Experiment with Benchmark Datasets . . . . . . . . . . . . . . . . . 5.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 57 58 59 59 62 64 65 66 67 68 74 79 79 80 82 84 Chapter 6 A Constant-Time Algorithm for Dynamic Semi-Supervised Clustering 87 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2 Semi-supervised Clustering with Dynamic Constraints . . . . . . . . . . . . . 91 6.2.1 Semi-supervised clustering . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2.2 A Constant Time Algorithm for Dynamic Semi-supervised Clustering 94 6.2.3 Offline Step: Ensemble Clustering . . . . . . . . . . . . . . . . . . . . 97 6.2.4 Online Step: Efficient Updating Simplex Vector . . . . . . . . . . . . 100 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.3.2 Parameter Sensitivity of m . . . . . . . . . . . . . . . . . . . . . . . 108 6.3.3 Comparison with Other Baseline Algorithms . . . . . . . . . . . . . . 110 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Chapter 7 Conclusions and Future Directions 114 7.1 Summary of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 viii LIST OF TABLES Table 3.1 Clustering performance and running time of the proposed algorithm (i.e. matrix completion) and the baseline algorithm (i.e. Bayesian method) on four data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Table 3.2 Performance of the proposed clustering algorithm as a function of different threshold values and the percentage of 1 entries in the matrix A˜ that are consistent with the cluster assignments for the Scenes data set. . . . . . . . . . . . . . . . . . . . . . . . . . 42 Table 4.1 CPU time (in s) for learning the distance metrics. . . . . . . . . . . . . . . . . . 62 Table 5.1 Running time (in s) for recovering synthetic data of different size . . . . 82 Table 5.2 Description of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table 5.3 Average Clustering performance of the proposed semi-supervised clustering algorithm (MCCC) and the baseline algorithms (Base, MPCKmeans (MPCK) [12], CCSKL [88], PMMC [144], DCA [63], LMNN [130], and ITML [36]) on nine datasets with 2, 000, 4, 000 and 6, 000 randomly sampled pairwise constraints (PCs) . . . . . . 86 Table 6.1 Average CPU time (in s) for updating the partition in each tier. (N/A means that the clustering task cannot be accomplished by the algorithms within 5 hours.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 ix LIST OF FIGURES Figure 1.1 A clustering of 4 face images into 2 clusters. It is possible to cluster these face images in two ways, with both of them equally valid. They can be clustered based on gender as shown in (b), or clustered based on age as shown in (c). Without further information from the user, it is not possible to determine the correct partition. . . 2 Figure 3.1 Some sample images from the 13 categories in the Scenes data set . . . . . . . 36 Figure 3.2 Some sample images from the three categories in the Tattoo data set . . . . 36 Figure 3.3 Some sample images from the seven categories in the ImageNet data set . . 37 Figure 3.4 Some sample images from the five categories in the PASCAL 07 data set . 37 Figure 3.5 Sample image pairs that are grouped into the same cluster by more than 50% of the workers but are assigned to different clusters according to the ground truth. . . . . 41 Figure 3.6 NMI values as a function of number of workers and percentage of annotations for four data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Figure 4.1 The proposed framework for semi-crowdsourced clustering. The given N objects (o1 , o2 , . . . , oN ) need to be clustered, but only a small subset of the N objects N .. . . . . . . . . . . . . . . . . 47 (o1 , o2 , · · · on ) have been annotated by crowdsourcing, n Figure 4.2 The proposed framework of learning a distance metric from noisy manual annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Figure 4.3 NMI vs. no. of sampled images (n) used in crowdlabeling. . . . . . . . . . . 60 Figure 4.4 Sample image pairs that are incorrectly clustered by the Base method but correctly clustered by the proposed method (the similarity of our method is based on the normalized distance metric Ms ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 5.1 The plot of the smallest number of pairwise constraints (PCs) needed for perfect recovery. The correlation coefficient computed by the linear fit is 0.992, indicating a linear dependence of sample complexity in log n.. . . . . . . . . . . . . . . . . . 81 Figure 6.1 The offline step: generate a convex hull using the technique of ensemble clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 x Figure 6.2 The online step: efficiently update clustering results when new pairwise constraints are generated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Figure 6.3 NMI vs. the number of ensemble partitions m = {10, 20, 30, 40, 50} with different number of pairwise constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Figure 6.4 Average clustering performance of the proposed dynamic semi-supervised clustering algorithm (CCCS) and the baseline algorithms (MPCKmeans (MPCK) [12], CCSKL [88], PMMC [144], RCA [6] DCA [63], LMNN [130], and ITML [36]) from tier t1 to t10 on nine datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 xi Chapter 1 Introduction Data clustering is a method of grouping similar objects together. Given a representation of N objects, the goal of data clustering is to find r groups based on a measure of similarity such that objects within the same group are alike but the objects in different groups are not alike [71]. Data clustering is an important problem that has found numerous applications in different domains, including computer vision [55], information retrieval [11, 91], bioinformatics [85, 140], recommender systems [83], etc. Generally speaking, data clustering is considered as an unsupervised learning technique in which the input data only contains the data points themselves without any additional information. The unsupervised nature makes data clustering an ill-posed problem since data can be usually partitioned in many equally valid ways depending on users’ intent and goal. Consider a simple clustering task that groups 4 face images into 2 clusters, as shown in Figure 1.1. The 4 face images are taken from a young man, a young woman, an old man, and an old woman, respectively. These images can be clustered either based on the gender of the people, as shown in Figure 1.1(b), or based on the age of the people, as shown in Figure 1.1(c). Both these partitions are equally valid, implying that it is not possible to determine the correct partition without further information. Such information, usually cast 1 (a) 4 face images need to be grouped into 2 clusters (b) partition the face images based on gender (c) partition the face images based on age Figure 1.1 A clustering of 4 face images into 2 clusters. It is possible to cluster these face images in two ways, with both of them equally valid. They can be clustered based on gender as shown in (b), or clustered based on age as shown in (c). Without further information from the user, it is not possible to determine the correct partition. in the form of pairwise constraints, can be used to tune the clustering algorithm towards finding the data partition sought by the user. Pairwise constraints are introduced by Wagstaff et al. [123] to specify the relationship between class assignments of two objects, which are also known as must-link and cannot-link information. A must-link constraint between two data points implies that these two data points tend to be assigned to the same cluster, while a cannot-link constraint between two data points implies that they tend to be assigned to different clusters. These link information thus provide constraints on clustering results. Must-link and cannot-link have some interesting properties. Must-link constraint is an equivalence relation and hence are symmetrical, reflexive and transitive. For example, if object x and object y are connected by a must-link, and object y and object z are connected by a must-link, then object x and object z should also be connected by a must-link. Similarly, must-link constraints can give rise to cannot-link constraints. For example, if x and y are connected by a must-link, y and z are connected 2 by a cannot-link, then x and z can also be connected by a cannot-link. Despite the simple definition, must-and-cannot links are powerful and widely used for facilitating data clustering [32, 124]. In this thesis, we address two main issues regarding incorporating pairwise connections in the data clustering problem: (i) how is the pairwise constraints obtained and specified?, and (ii) how the clustering performance is improved with increasing number of pairwise constraints, an issue that is usually referred to as sample complexity in supervised learning [7]. In the sequel, we discuss both of these questions. Depending on the sources, most pairwise constraints can be classified into two categories. In the first category, the pairwise constraints are collected from a set of non-expert crowd workers through crowdsourcing tools such as the Amazon’s Mechanical Turk [14]. Then the problem of incorporating all the crowdsourced pairwise constraints into a data partitioning is denoted as crowdsourced clustering, or crowdclustering [60, 141] for short. Since the pairwise constraints collected via crowdsourcing usually have a high noise level, crowdclustering can be considered as one problem of clustering with noisy pairwise constraints. In the second category, a small amount of pairwise constraints are collected from oracle or experts. Then the problem of searching for the optimal data partition that is consistent with both the given pairwise constraints and the input data points to be clustered leads to the problem of semisupervised clustering. Since most semi-supervised clustering frameworks assume that the provided pairwise constraints are perfect, it can be considered as a problem of clustering with noiseless pairwise constraints. Below, we describe each of them in detail. 3 1.1 Crowdclustering Crowdclustering is a technique of utilizing human power in acquiring similarities between objects need to be clustered. The main focus of crowdsourced clustering is to combine a set of crowdsourced pairwise constraints to form a data partitioning of the entire data set. In more detail, given a collection of objects to be clustered, a subset of objects is first sampled in each Human Intelligence Task (HIT), and a crowd worker is asked to annotate the subset of objects in the HIT based on their own opinion. The annotation task can either be grouping objects based on their similarities or describing individual objects by multiple keywords; the annotation results then can be summarized in the form of pairwise constraints. The keyword annotation is transformed into binary pairwise constraints by checking if two objects share common annotated keywords. The results of each HIT, which can be considered as a partial local clustering of the objects in that HIT, are then combined to form a data partitioning of the entire data set. Note that the question of crowdclustering is motivated by practical considerations: if we have a large number of objects, it may not be realistic to expect a single person to look at all the objects and form an opinion as to how to group them. Crowdclustering solves one important issue in data clustering, namely the inconsistency between the similarity computed by data points’ attributes and the similarity in human perception. This is due to the reason that crowdclustering obtains similarity measures between objects based on manual annotations, which capture the human perception of similarity among objects. Two major problems play an important role in crowdclustering. First, since crowd workers are paid by the number of crowdsourcing tasks they worked on, it is important to analyze the 4 minimal number of crowdsourced pairwise constraints needed to recover an accurate enough data partition. The second major problem of crowdclustering lies in the issue of high noise levels in crowdsourced pairwise constraints. This is because that most crowd workers are untrained non-experts. They sometimes assign low-quality labels due to the reasons such as they do not understand the labeling criteria, or they do not look at the instances carefully when labeling. As pointed out by [141], more than 80% of crowdsourced pairwise labels can be inconsistent with the true cluster assignment. Thus how to identify the true data partitions based on the noisy crowdsourced pairwise constraints becomes one main challenge in crowdclustering. 1.2 Semi-supervised Clustering In semi-supervised clustering, the pairwise constraints are collected from oracle or experts. In contrast to crowdclustering, most semi-supervised clustering frameworks assume that the provided pairwise constraints are perfect. These pairwise constraints, also known as side information, can be incorporated in the clustering process to attain better clustering performance. Several mechanisms have been developed to exploit the side-information to improve the clustering performance. Most semi-supervised clustering algorithms can be classified into two categories [12]: constrained clustering and distance metric based semi-supervised clustering. The constrained clustering employs the side information to restrict the solution space, then only find the feasible data partitions that are consistent with the pairwise constraints. The distance metric based semi-supervised clustering attempts to find and apply a transformation to the data such that (a) the data points in must-link constraints are separated 5 by small distances, and (b) data point in cannot-link constraints are separated by larger distances. Generally speaking, the larger the number of pairwise constraints been provided, the better the clustering performance can be achieved by semi-supervised clustering. However, one major issue regarding incorporating pairwise constraints in the data clustering problem is how the clustering performance is improved with increasing number of pairwise constraints, also known as the sample complexity problem in supervised learning [7]. Since collecting pairwise constraints from oracle or experts are usually very expensive, it becomes crucial to discover the minimal number of pairwise constraints required to perfectly recover the underlying true data partition in semi-supervised clustering. In addition to sample complexity that relates to the labeling costs, computational cost is also an important issue of semi-supervised clustering. In this thesis, we also study the problem of efficiently updating the clustering results when the pairwise constraints are generated sequentially, a common case in various real-world applications. For example, in social networks, we can treat the user attributes like gender, educational background, nationality, interests and so on as features, and the social connections like friendship and common community membership as the pairwise constraints. Hence, the task of grouping users in social networks is essentially a semi-supervised clustering problem. Note that the pairwise constraints in social networks are changing all over the time, a clustering algorithm that is able to cope with dynamic pairwise information is needed to solve this dynamic semi-supervised clustering problem. However, given a set of new pairwise constraints, classical semi-supervised clustering algorithms need to re-optimize the objective function over all of the data points subject to all the received connections, making them computationally infeasible for updating 6 large-scale data clustering results. Thus how to develop an efficient dynamic semi-supervised clustering framework becomes an interesting and important problem. 1.3 Main Contributions In this thesis, we focus on the problem of data clustering with pairwise constraints. In particular, this thesis addresses a number of important issues in both the crowdclustering problem and the problem of semi-supervised clustering. Below, we briefly summarize each of our contributions. • Crowdclustering by Matrix Completion The first problem we consider in this thesis is to reduce the number of pairwise annotations needed in crowdclustering. In particular, we propose a novel crowdclustering approach based on the theory of matrix completion [20]. The basic idea of the proposed algorithm is to construct a partially observed similarity matrix based on a subset of pairwise annotation labels that are agreed upon by most crowd workers. It then deploys the matrix completion algorithm to complete the similarity matrix and obtains the final data partition by applying a spectral clustering algorithm to the completed similarity matrix. Unlike the existing work of crowdclustering [60] that based on a Bayesian generative model, the main advantage of the proposed approach is that much less crowdsourced pairwise annotations are needed to accurately cluster all the objects. This is due to the reason that the Bayesian approach requires a sufficiently large number of manual annotations to discover the hidden factors for clustering decision. This results in high cost, both in computation and annotation, which limits the scalability 7 to clustering large data sets. In contrast, the proposed algorithm can significantly reduce the demanding of manual annotations due to a key observation, i.e. the complete similarity matrix for all the objects should be of low rank [73]. According to the matrix completion theory [20], when an N × N matrix is of low rank, it can be perfectly recovered given only a very small portion of entries (i.e. O(log2 N/N )). This guarantees that the proposed crowdclustering algorithm can accurately discover the underlying data partitions with only small number of crowdsourced annotations. Another advantage of the proposed crowclustering algorithm is that by filtering out the uncertain data pairs, the proposed algorithm is less sensitive to the noisy labels, leading to a more robust clustering. • Semi-Crowdsourced Clustering by Distance Metric Learning As our second problem, we study the crowdclustering framework when most of the objects are not manually annotated by crowd workers. This question is motivated by a practical consideration of clustering large-scale data sets. Since it is very expensive to hire a large amount of crowd workers for annotation, it is not feasible to have each object manually annotated by crowd workers. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. In more detail, the proposed algorithm for clustering N objects consists of three steps: (i) in the first step, we randomly sample a subset of n N objects and obtain their manual annotations by crowdsourcing. 8 Then we filter noisy pairwise similarities for n objects by only keeping object pairs whose pairwise similarities are agreed by many workers. The result of this step is a partially observed n × n similarity matrix with most of its entries removed or unobserved; (ii) we then recover the n × n similarity matrix from the partially observed entries by exploiting the matrix completion algorithm; and (iii) in the third step, we apply a regression algorithm to learn a distance metric from the recovered similarity matrix, then clustering the N × N pairwise similarities based on the learned distance metric. The main advantage of the proposed approach is that, in order to cluster N objects, we only need a small subset to be annotated by crowdsourcing. • Semi-supervised Clustering by Pattern Assisted Matrix Completion We then turn to the problem of semi-supervised clustering. In particular, we address two main shortcomings with the existing semi-supervised clustering algorithms. First, most semi-supervised clustering algorithms have to deal with non-convex optimization problems, leading to clustering results that are only locally optimal and sensitive to the initialization. Second, although many computational algorithms have been proposed for semi-supervised learning, none of them is equipped with a theoretical guarantee on clustering performance. We address these limitations by developing a novel convex semi-supervised clustering approach based on the theory of matrix completion. Instead of penalizing the violations of pairwise constraints, which usually leads to nonconvexity, we treat them as the subset of observed entries in the similarity matrix of the N objects need to be clustered. Namely, we consider the must-link as similarity 1 and cannot-link as similarity 0. We then deploy the matrix completion method to complete the partially observed similarity matrix under the key observation that the complete 9 similarity matrix should be of low rank. To reduce the number of pairwise constraints needed, we apply a nature assumption that the feature representations of the objects are good enough, i.e., they are able to reflect the similarities between objects. In more detail, we assume that the singular vectors of the similarity matrix should lie in the subspace spanned by the first k eigenvectors of feature representations of the objects. Our analysis shows that under this assumption, only O(log N ) pairwise constraints are needed to accurately recover the true cluster partition. This logarithmic dependence on the number of objects been clustered makes the proposed algorithm particularly suitable for clustering large-scale data sets. • A Constant-Time Algorithm for Dynamic Semi-Supervised Clustering Another problem we address in this thesis is how to efficiently update clustering results when the pairwise constraints are dynamic. Our motivation stems from the observation that in numerous real-world applications, the pairwise constraints are not fixed, which is different from the assumptions made by most classical semi-supervised clustering algorithms. A typical example is social network analysis. If we treat user profiles as features, and connections between users as pairwise constraints, then the task of grouping user communities is essentially a dynamic semi-supervised clustering problem. Given a set of newly generated pairwise constraints, classical semi-supervised clustering algorithms need to re-optimize their objective functions over all the data points and constraints, prohibiting them to efficiently update the data partitions. To address this issue, we propose an efficient dynamic semi-supervised clustering framework that casts the clustering problem into a searching problem in a feasible convex space, i.e., a convex hull with its extreme points being an ensemble of multiple data partitions. 10 According to the principle of ensemble clustering, the optimal partition lies in that convex hull and it can be uniquely represented by a low-dimensional simplex vector. This enables us to carry out the dynamic semi-supervised clustering problem as an updating procedure of the simplex vector based on the newly received pairwise constraints. Using this idea, we derive a constant time algorithm for updating the simplex vector (clustering result) and this enables us to update large-scale clustering results in an extremely efficient way. The remainder of this thesis is organized as follows. In Chapter 2, we provide a survey of some background materials from crowdclustering, ensemble clustering, semi-supervised clustering and matrix completion. In Chapter 3, we present the framework of crowdclustering by matrix completion. This is based on our work published in the 4th Human Computation Workshop in junction with AAAI (HCOMP) [141]. Chapter 4 introduces the proposed semicrowdsourced clustering framework. The contents in Chapter 4 follows our paper appeared in Advances in Neural Information Processing Systems (NIPS) [142]. We then present the proposed semi-supervised clustering method in Chapter 5. The material in Chapter 5 come from our work published in the International Conference on Machine Learning (ICML) [143]. In Chapter 6, we focus on the problem of efficiently updating clustering results when the pairwise constraints are dynamic. Finally, we conclude the thesis and discuss some future directions that can be explored in Chapter 7. 11 Chapter 2 Literature Survey The goal of this chapter is to give an overview of the material related to the work has been done in this thesis. In particular, we will survey the problems of crowdclustering, ensemble clustering, semi-supervised clustering, the evaluation metrics of data clustering and the technique of matrix completion. 2.1 Crowdclustering The idea of crowdclustering was first proposed in [60]. With the advent of crowdsourcing services such as Amazon’s Mechanical Turk, it becomes much more convenient to purchase Human Intelligence Tasks from large groups of anonymous crowd workers. Note that the similarity between two objects computed using their attributes may not reflect human perception of inter-object similarity, crowdsourcing provides an easy way to address this issue by utilizing human power in acquiring pairwise similarities between objects. Generally speaking, crowdclustering is a divide and conquer procedure with two steps. In the first step, the problem of clustering N objects is reduced to a number of small problems. Each small problem, denoted as a human intelligence task (HIT), contains a set of objects with reasonable 12 size. Then the human intelligence tasks are assigned to a large pool of human workers for annotations. The annotation tasks can be either grouping objects based on their similarities, or describing individual objects by multiple keywords. Then the annotation results can be summarized in the form of pairwise constraints. In the second step, a model need to be developed to aggregate the human annotations automatically that yield a partition of N objects into clusters. Crowdclustering is a challenge problem due to the following reasons: (i) each worker has only a partial view of the data, (ii) different workers may have different clustering criteria and may also produce different numbers of clusters, and (iii) the annotation quality varies significantly among different workers since some workers are spammers or malicious users. To address such issues, the authors in [60] proposed a Bayesian generative model for crowdclustering. In more detail, the objects to be clustered are represented in a Euclidean space and workers are modeled as pairwise binary classifiers in this space. Then the clusters are obtained by clustering these inferred points using a Dirichlet process mixture model [99]. However, one limitation of the Bayesian approach for crowdclustering is that in order to discover the hidden factors for clustering decision, it requires a sufficiently large number of manual annotations. This leads to high annotation cost for clustering large-scale data sets. 13 2.2 Ensemble Clustering The main idea behind ensemble clustering [54, 116] is to combine multiple partitions of a dataset into a single data partition, termed as the consensus partition, hoping to exploit the strength of different clustering algorithms and at the same time, compensate for their limitations. This problem is related to crowdclustering problem since both of them aim to combine multiple partitions of data into a single clustering result. According to [121], ensemble clustering can be classified into two categories: median partition based approaches and object co-occurrence based approaches. In the median partition based approaches, ensemble clustering is cast into an optimization problem that finds the partition by maximizing the within-cluster similarity, where various similarity measures have been proposed, such as Mirkin distance [126], Jaccard coefficient [10], utility function [117] and normalized mutual information [113]. Among the approaches based on object co-occurrence, one major category is the relabeling/voting based method [5, 41, 44, 50, 119, 131]. The basic idea is to first find the corresponding cluster labels between different partitions, and then obtain the consensus partition through a voting process. The second group of approaches in this category is based on co-association/similarity matrix [54, 69, 87, 120, 128]. They use the similarity measure to combine multiple partitions, thus avoiding the label correspondence problem. The third group of approaches in this category is the graph based methods [49,113]. These methods construct a weighted graph to represent multiple partitions from the ensemble and find the optimal partition of data by minimizing the cut of the graph. While some of ensemble clustering methods can work with partial input clusterings, most have not been demonstrated in situations where the input clusterings involve only a small 14 subset of the objects to be clustered, which is the case in the problem of crowdclustering. In addition, since different human workers in crowdclustering may have different clustering criterion, they may produce various partial clustering results. This usually makes a significant amount of noise and inter-worker variations in their clustering results. As a consequence, there may exist a large number of uncertain data pairs for which about half of the human workers put them into the same cluster while the other half do the opposite. These uncertain data pairs can mislead the ensemble clustering algorithms to create inappropriate data partitions. 2.3 Semi-supervised Clustering There are two major approaches for clustering with semi-supervised information: the constrained clustering and the approach based on distance metric learning. In this section, we review the existing work for both of them, followed by two categories of dynamic clustering: clustering based on user feedbacks, as well as dynamic network clustering. 2.3.1 Constrained Clustering The constrained clustering employs the side information to restrict the solution space, and only find the feasible data partitions that are consistent with the pairwise constraints. Among them, hard constraints-based approaches [2,30,31,33,34,108,125,125] only consider the cluster assignments that all the constraints are strictly satisfied. In [125], the cluster centers are first initialized randomly. Each data point is then assigned to the nearest cluster center en15 suring that no constraints are violated. The cluster centers are updated by finding the mean of the points assigned to the cluster, like in the K-means algorithm. In [2], the authors modified the k-means and the self organizing map algorithms respectively, to adjust the cluster memberships to be consistent with the given pairwise constraints. In [108], the generalized Expectation Maximization (EM) algorithm was modified to ensure that only the mixture models that are consistent with the constraints are considered. One problem with treating the side information as hard constraints is that such methods may lead to counter-intuitive clustering solutions, or even render the clustering problem infeasible [31]. To overcome this problem, a number of studies view the side information as soft constraints [8, 9, 30, 80, 93]. Instead of trying to satisfy all the constraints, the key idea behind such methods is to satisfy as many constraints as possible, and introduce a penalty term to account for constraints that cannot be satisfied. In [8, 9, 93], probabilistic models are applied to semi-supervised clustering problem and they consider pairwise constraints as Bayesian priors to learn the model. In [74], the pairwise constraints are added as an additional penalty term in the objective in spectral learning. In [80], the authors proposed a graphical model which considered the pairwise constraints as random variables. Then an EM-based algorithm was developed to model the uncertainty of constraints. In [79], the authors enhanced the performance of the k-means algorithm by constructing a kernel matrix, which incorporates the given pairwise constraints. More discussion on constrained clustering can be found in [35], and the references therein. 16 2.3.2 Distance Metric Learning Another approach to semi-supervised clustering falls into the category of distance metric learning. It aims to learn a transformation to the data from the pairwise constraints such that the data points in must-link constraints are separated by smaller distances, than the data point in cannot-link constraints [137]. Many algorithms [59, 64, 109, 129, 135, 139] have been developed for distance metric learning, such as distance metric learning by convex optimization [135], relevance component analysis [109], discriminative component analysis [64], nearest neighbor component analysis [59], local distance metric learning [139], large margin nearest neighbor classifier [129], information theoretic metric learning [37], distance function learning [115], and learning a Bregman distance function [133]. In [135], the authors formulated the problem of distance metric learning into a PSD constrained convex programming problem. The authors in [129] proposed a nearest-neighbor classifier to enforce that the examples from different classes were separated by a large margin. In [37], an informationtheoretic method was proposed to learn a Mahalanobis distance function. In [6], relevant component analysis learns a distance metric by assigning large weights to relevant dimensions while low weights to irrelevant dimensions. It was further improved in [64] to effectively explore both the must-link and the cannot-link constraints simultaneously. More work on distance metric learning can be found in survey [137] and references therein. One problem with approaches based on distance metric learning is that they usually need to deal with the positive semi-definite constraint and this makes them computationally expensive. Despite the progress, there are two main shortcomings with the exisiting algorithms for semi-supervised clustering. First, most semi-supervised clustering algorithms have to deal 17 with non-convex optimization problems, leading to the clustering results that are sensitive to the initializations. Second, although many computational algorithms have been proposed for semi-supervised learning, none of the exisiting semi-supervised clustering algorithms analyze the important sample complexity problem, namely how the clustering performance is improved with increasing number of pairwise constraints. 2.3.3 Dynamic Clustering In recent years, an increasing amount of literature studied the problem of clustering evolving. Among them, clustering based on user feedbacks [13, 28] attracted considerable attentions in the last decade. As one of the earliest work, Cohn et al. [28] considered a scenario when users can iteratively provide different types of feedbacks regarding the clustering quality. Then an EM-based scheme was developed to update the distance metric for achieving better clustering results. The algorithm proposed in [13] is a variant of the complete-link hierarchical clustering. It combines the feedback pairwise constraints with spatial constraints to learn a new distance metric that only satisfies local pairwise constraints. Huang and Mitchell [67] proposed a probabilistic generative model for text clustering. This model enables users to provide four types of user feedback to further enhance the clustering performance. As a related topic, dynamic network clustering [21, 21, 26, 26, 77, 89, 114] studies how a community evolves when a network to be clustered is changing continuously. Evolutionary clustering [21, 26] ensures that the output partition achieves high accuracies on the new data, while still consistent with the historical clustering results. Chakrabarti et al. [21] developed evolutionary versions of both k-means and agglomerative hierarchical clustering algorithms. 18 Chi et al. [26] then extended this idea to evolutionary spectral clustering algorithm. In [114], a parameter free algorithm called GraphScope was proposed to mine time-evolving graphs using the principle of Minimum Description Length (MDL). FacetNet [89] used probabilistic community membership models to identify dynamic communities within the graph. Kim and Han [77] further allowed a changing number of communities and proposed a particle-anddensity based algorithm to form new communities or dissolve existing communities. 2.4 Evaluation Metrics In this section, we review four widely-used metrics for evaluating the quality of clustering results. They are purity [95], normalized mutual information (NMI for short) [29], pairwise F-measure (PWF for short) [132] and Rand index [105]. Purity focuses on the frequency of the most common category in each cluster, and rewards the clustering solutions that introduce less noise in each cluster [3]. Given the ground truth partition C = {C1 , C2 , . . . , Cr } and the partition C = {C1 , C2 , . . . , Cr } generated by a clustering algorithm, the purity is computed by taking the weighted average of maximal precision values: Purity(C, C ) = 1 N max |Ci ∩ Cj |, i=1,...,r j=1,...,r where N is the total number of objects been clustered. One problem of purity is that a high purity can be easily achieved when the number of clusters is large. Thus purity is not appropriate for evaluating the clustering performance with large number of clusters. This problem can be addressed by normalized mutual information. Given the ground truth 19 partition C and the partition C generated by a clustering algorithm, the mutual information is computed as MI(C, C ) = p(Ck , Cl ) log Ck ,C p(Ck , Cl ) , p(Ck )p(Cl ) l where p(Ck ) denotes the probability that a randomly selected node belongs to the cluster Ck , and p(Ck , Cl ) indicates the joint probability that a randomly selected node belongs to both of the cluster Ck and cluster Cl . Then the normalized mutual information is given by N M I(C, C ) = where H(C) = 1 k p(Ck ) log p(C ) k 2M I(C, C ) , H(C) + H(C ) represents the Shannon entropy of partition C. Pairwise F-measure is another commonly used measure for evaluating clustering algorithms. Let A be the set of data pairs that share the same class labels according to the ground truth, and let B be the set of data pairs that are assigned to the same cluster by a clustering algorithm. Given the pairwise precision and recall that are defined as follows precision = |A ∩ B| , |A| recall = |A ∩ B| , |B| then the pairwise F-measure is computed as the harmonic mean of precision and recall, i.e. PWF = 2 × precision × recall . precision + recall An alternative way to evaluate the clustering performance, known as Rand index, considers clustering results as N (N − 1)/2 object pairs. Four measurements are then introduced to 20 compute the rand index. They are (i) true positive (TP) decision that assigns two similar objects to the same cluster, (ii) true negative (TN) decision that assigns two dissimilar objects to different clusters, (iii) false positive (FP) decision that assigns two dissimilar objects to the same cluster, and (iv) false negative (FN) decision that assigns two similar objects to different clusters. Then the Rand index measures the percentage of decisions that are correct, given by RI = TP + TN . TP + TN + FP + FN All the four discussed measurements lie in the range [0, 1] where a value of 1 indicates perfect match between the obtained partition by a clustering algorithm and the ground truth partition and 0 indicates completely mismatch. 2.5 Matrix Completion Since all the proposed approaches utilize the technique of matrix completion [15], in this section, we review the existing work for matrix completion. Matrix completion was originally proposed for collaborative filtering [57], where the goal is to predict the ratings of users for all the items given the ratings for a subset of randomly sampled items. Generally speaking, matrix completion aims to reconstruct a data matrix from a small subset of observed entries under the assumption that the data matrix is of low-rank. The observed entries can be either noiseless or noisy. In the noiseless setting, the observed entries are exactly sampled from the underlying true matrix while in the noisy setting, the observed entries are perturbed by some random noises from the true entries. In 21 the following, we discuss the matrix completion problem under both scenarios. 2.5.1 Matrix Completion with Noiseless Entries When the sampled entries of a n1 × n2 matrix A are revealed without any noise, we can try to recover it by solving the following optimization problem min rank(A ) (2.1) A ˜ s.t. P∆ (A ) = P∆ (A), where rank(·) is the rank of a matrix and P∆ : Rn1 ×n2 → Rn1 ×n2 is a matrix projection operator that takes a matrix A as the input and outputs a new matrix P∆ (A) ∈ Rn1 ×n2 as [P∆ (A)]ij =    Aij (i, j) ∈ ∆   0 (2.2) otherwise. This projection operator guarantees that only the observed entries in the matrix can be projected into the space where we apply matrix completion. One problem with (2.1) is that it is non-convex because rank(·) is a non-convex function [20]. This makes the problem (2.1) an NP-hard problem and it is therefore computationally challenging to find the optimal solution. To address this challenge, we follow [20] and replace rank(A) in (2.1) with its convex surrogate |A|∗ , the trace norm of matrix A. This allows us 22 to relax (2.1) into the following convex optimization problem min |A |∗ (2.3) A ˜ s.t. P∆ (A ) = P∆ (A). One important question regarding problem (2.3) is what is the minimal number of observed entries needed to perfectly reconstruct the matrix A. The authors in [19] show that, under the assumptions of incoherence property and uniform sampling, the matrix A can be correctly recovered with a high probability by solving problem (2.3) if the number of observed entries is at least O(n6/5 log n) with n = max(n1 , n2 ). This bound was later tightened to O(n log2 n) with contributions from [20, 62, 62, 75, 106]. 2.5.2 Matrix Completion with Noisy Entries To solve a more general setting when the sampled entries are noisy, some robust algorithms were developed to accurately recover the underlying matrix. Usually, two types of errors and corruptions are considered in the literature. In the first case, the locations of noisy entries are assumed to be spread out, meaning that no single column or row has too many corrupted entries. This problem was first studied in [24], which assumes that the received partially observed matrix is a summation of an unknown low-rank matrix and an unknown sparse error matrix. Then several works extend the results of [24] under different settings. One line of work [65] provides worst case guarantees for arbitrary corruption in the entries. Another avenue [16, 56, 76, 86, 100] provides probabilistic 23 guarantees for the setting when the locations of the noisy entries are chosen uniformly at random but the values of the errors are arbitrary. In addition, approximate recovery in the presence of stochastic additive noise has also been studied in [1, 17]. In contrast, the second case allows all entries from some columns to be entirely corrupted. As a result, it is in general impossible to recover these corrupted columns, and the goal thus becomes recovering the other, uncorrupted columns. This setting is important in many applications, such as the problem of collaborative filtering [57] with malicious users. This problem has been studied in [1, 136], which use convex relaxation of minimizing rank plus column support to distinguish between authentic and corrupted columns. 24 Chapter 3 Crowdclustering with Sparse Pairwise Labels In this chapter, we propose a novel approach for crowclustering that exploits the technique of matrix completion. Instead of using all the crowdsourced annotations, the proposed algorithm constructs a partially observed similarity matrix based on a subset of pairwise annotation labels that are agreed upon by most annotators. It then deploys the matrix completion algorithm to complete the similarity matrix and obtains the final data partition by applying a spectral clustering algorithm to the completed similarity matrix. The main advantage of the proposed algorithm is that only a small number of crowdsourced pairwise annotations are needed to accurately cluster all the objects. Our analysis reveals that, by exploiting the technique of matrix completion, we can perfectly recover the underlying true partition of N objects given only a very small portion of reliable pairwise constraints (i.e. O(log2 N/N )). Another advantage of the proposed crowclustering algorithm is that by filtering out the uncertain data pairs, the proposed algorithm is robust to the noisy crowdsourced labels. The remainder of the chapter is organized as follows: In Section 3.1, we introduce the back- 25 ground and discuss the motivation of the proposed crowdclustering framework. Section 3.2 presents the proposed crowdclustering algorithm and the related analysis. We summarize the results of our empirical studies in Section 3.3. Section 3.4 concludes the chapter and we put the omitted proofs in the appendix. 3.1 Introduction In data clustering problems, crowdsourcing helps to address one of the key challenges, namely how to define the similarity measure between objects. A typical clustering algorithm measures the similarity between two data points based on their attributes. However, these pairwise similarities may not reflect human perception of inter-object similarity in the unsupervised setting. In contrast, crowdclustering utilizes human power in acquiring pairwise similarities by asking each worker to perform clustering on a subset of objects, thereby defining a similarity measure between pairs of objects based on the percentage of workers who put them into the same cluster. The core of crowdclustering is to combine the partial clustering results, generated by individual workers, into a complete data partition. One way to address this challenge is ensemble clustering [53,113], as suggested in [60]. There are, however, two special challenges in applying ensemble clustering to the crowdclustering problem. First, since each worker deals with only a subset of the entire dataset (because the task of partitioning a large dataset is too complex for individual workers), only partial clustering results are available in the ensemble for combination. This is in contrast to most ensemble clustering studies that require a 26 clustering of the complete dataset from individual partitions. Second, there is a significant amount of noise and inter-worker variations in the partial clustering results generated by individual workers. As a consequence, we often observe a large number of uncertain data pairs for which about half of the human workers put them into the same cluster while the other half do the opposite. These uncertain data pairs can mislead the ensemble clustering algorithms to create inappropriate data partitions. To address the potentially large variations in the pairwise annotation labels provided by different workers (i.e. whether or not two objects should be assigned to the same cluster), a Bayesian generative model was proposed for crowdclustering in [60]. It explicitly models the hidden factors that are deployed by individual workers to group objects into the same cluster. The empirical study in [60] shows encouraging results in comparison to the ensemble clustering methods. However, one limitation of the Bayesian approach for crowdclustering is that in order to discover the hidden factors for clustering decision, it requires a sufficiently large number of manual annotations, or HITs. This results in high cost, both in computation and annotation, which limits the scalability to clustering large data sets. To overcome the limitation of the Bayesian approach, we propose a novel crowdclustering approach based on the theory of matrix completion [20]. The basic idea is to first compute a partially observed similarity matrix based only on the reliable pairwise annotation labels, or in other words, the labels that are in agreement with a sufficiently large percentage of the workers. It then completes the partially observed similarity matrix using a matrix completion algorithm, and obtains the final data partition by applying a spectral clustering algorithm [102] to the completed similarity matrix. 27 The main advantage of the matrix completion approach is that only a small number of pairwise annotations are needed to construct the partially observed similarity matrix. Therefore, we can obtain a clustering accuracy similar to the Bayesian methods, with a substantial reduction in the number of workers and/or the number of HITs performed by individual workers. The high efficiency of the proposed algorithm in exploiting manual annotations arises from a key observation, i.e. the complete similarity matrix for all the objects should be of low rank [73]. According to the theory of matrix completion [20], when an N × N matrix is of low rank, it can be perfectly recovered given only O(N log2 N ) entries. Another advantage of the proposed crowclustering algorithm is that by filtering out the uncertain data pairs, the proposed algorithm is less sensitive to the noisy labels, making the clustering results more robust. 3.2 Crowdclustering by Matrix Completion The key idea of the proposed crowdclustering algorithm is to derive a partially observed similarity matrix from the partial clustering results generated by individual workers, where the entries associated with the uncertain data pairs are marked as unobserved. A matrix completion algorithm is applied to complete the partially observed similarity matrix by filtering out the unobserved entries. Finally, a spectral clustering algorithm [102] is applied to the completed similarity matrix to obtain the final clustering. Below, we describe in detail the two key steps of the proposed algorithm, i.e., the filtering step that removes the entries associated with the uncertain data pairs from the similarity matrix, and the matrix completion step that completes the partially observed similarity matrix. 28 The notations described below will be used throughout the paper. Let N be the total number of objects that need to be clustered, and m be the number of HITs. We assume that the true number of clusters in the data is known a priori1 . Given the partial clustering result from the k-th HIT, we define a similarity matrix W k ∈ RN ×N such that Wijk = 1 if objects i and j are assigned to the same cluster, 0 if they are assigned to different clusters, and −1 if the pairwise label for the two objects can not be derived from the partial clustering result (i.e. neither object i nor object j is used in the HIT). Finally, given a subset of object pairs ∆ ⊂ {(i, j), i, j = 1, . . . N }, we define a matrix projection operator P∆ : RN ×N → RN ×N that takes a matrix E as the input and outputs a new matrix P∆ (E) ∈ RN ×N as [P∆ (E)]ij =    Eij (i, j) ∈ ∆   0 (3.1) otherwise. This projection operator guarantees that only the reliable entries in the matrix can be projected into the space where we apply matrix completion. 3.2.1 Filtering Entries with Unlabeled and Uncertain Data Pairs The purpose of the filtering step is to remove the uncertain data pairs from the manual annotations. To this end, given the m similarity matrices {W k }m k=1 obtained from individual 1 We can relax this requirement by estimating the number of clusters via some heuristic, eg., considering the number of clusters as the rank of the completed matrix A. 29 workers, we first compute matrix A = [Aij ] ∈ RN ×N as the average of {W k }m k=1 , i.e., Aij =          m k=1 k I(W k ≥0) Wij ij m I(W l ≥0) ij l=1 −1 m l l=1 I(Wij ≥ 0) > 0, otherwise where I(z) is an indicator function that outputs 1 when z is true and zero, otherwise. We introduce the indicator function I(Wijk ≥ 0) in the above equation so that only the labeled pairs of objects will be counted in computing A. Since Aij ∈ [0, 1] for a labeled data pair (i.e. Aij ≥ 0) measures the percentage of HITs that assign objects i and j to the same cluster, it can be used as the basis for the uncertainty measure. In particular, we define the set of reliable data pairs whose labelings are agreed upon by the percentage of workers as ∆ = {(i, j) ∈ [N ] × [N ] : Aij ≥ 0, Aij ∈ / (d0 , d1 )} where d0 < d1 ∈ [0, 1] are two thresholds that will be determined depending on the quality of the annotations. We then construct the partially observed similarity matrix A˜ as follows    1 (i, j) ∈ ∆, Aij ≥ d1     A˜ij = 0 (i, j) ∈ ∆, Aij ≤ d0       unobserved (i, j) ∈ /∆ 30 (3.2) 3.2.2 Completing the Partially Observed Matrix The second step of the algorithm is to reconstruct the full similarity matrix A∗ ∈ RN ×N ˜ To this end, we need to make several reasonable based on the partially observed matrix A. assumptions about the relationship between A˜ and A∗ . A simple approach is to assume A˜ij = A∗ij , ∀(i, j) ∈ ∆; in other words, assume that all the observed entries in matrix A˜ are correct. This, however, is unrealistic because A˜ is constructed from the partial clustering results generated by different workers, and we expect a significant amount of noise in individual clustering results. Thus, a more realistic assumption is A˜ij = A∗ij for most of the observed entries in ∆. We introduce the matrix E ∈ RN ×N to ˜ i.e., capture the noise in A, ˜ P∆ (A∗ + E) = P∆ (A), (3.3) where P∆ is a matrix projection operator defined in (3.1). Under this assumption, we expect E to be a sparse matrix with most of its entries being zero. The assumption specified in equation (3.3) is insufficient to recover the full similarity A∗ as we can fill the unobserved entries (i.e., (i, j) ∈ / ∆) in A∗ with any values. An additional assumption is needed to make it possible to recover the full matrix from a partially observed one. To this end, we follow the theory of matrix completion [20] by assuming the full similarity A∗ to be of low rank. It was shown in [73] that when the similarity matrix A∗ is constructed from a given clustering (i.e. A∗ij = 1 when objects i and j are assigned to the same cluster and zero, otherwise), its rank is equal to the number of clusters. As a result, when the number 31 of clusters is relatively small, it is reasonable to assume A∗ to be of low rank. Combining the two assumptions together leads to the following approach, to recover the full ˜ We decompose A˜ into the sum similarity matrix A∗ from the partially observed matrix A. of two matrices E and A∗ , where E is a sparse matrix that captures the noise in A˜ and A∗ is a low rank matrix that gives the similarity between any two objects. Based on this idea, we cast the matrix recovery problem into the following optimization problem min rank(A ) + C E 1 (3.4) A ,E s.t. where X 1 = ij ˜ P∆ (A + E) = P∆ (A) |Xij | is the 1 norm of matrix X that measures the sparsity of X. Pa- rameter C > 0 is introduced to balance the two objectives, i.e., finding a low rank similarity matrix A and a sparse matrix E for noise. We will discuss in Section 3.3.2 about how to automatically determine the value of C. One problem with the objective function in (3.4) is that it is non-convex because rank(·) is a non-convex function [20]. It is therefore computationally challenging to find the optimal solution for (3.4). To address this challenge, we follow [20] and replace rank(A ) in (3.4) with its convex surrogate |A |∗ , the trace norm of matrix A . This allows us to relax (3.4) into the following convex optimization problem min A ,E s. t. |A |∗ + C E 1 ˜ P∆ (A + E) = P∆ (A). 32 (3.5) We then use the efficient first order algorithm developed in [90] to solve the optimization problem in (3.5). Given the completed similarity matrix A∗ obtained from (3.5), we apply the spectral clustering algorithm [102] to compute the final data partition, which is essentially an application of k-means algorithm [94] to the data projected into the space of the top r eigenvectors of A∗ . 3.2.3 Theoretical Analysis A theoretical question is whether the similarity matrix obtained by (3.5) is close to the true similarity matrix A∗ . Our theoretical analysis gives a positive answer to this question. In the following, we show that, under appropriate conditions about the eigenvectors of A∗ , A∗ can be perfectly recovered by (3.5) if the number of noisy data pairs is significantly smaller than the number of observed data pairs. First, we need to make a few assumptions about A∗ besides its low rank. Let A∗ be a low-rank matrix of rank r, with a singular value decompsition A∗ = U ΣV , where U = (u1 , . . . , ur ) ∈ RN ×r and V = (v1 , . . . , vr ) ∈ RN ×r are the left and right eigenvectors of A∗ , satisfying the following incoherence assumptions. • A1: The row and column spaces of A∗ have coherence bounded above by some positive number µ0 , i.e., µ r max PU (ei ) 22 ≤ 0 , N i∈[N ] 33 µ r max PV (ei ) 22 ≤ 0 N i∈[N ] where ei is the standard basis vector. √ µ1 r • A2: The matrix E = U V has a maximum entry bounded by in absolute value N √ µ r for some positive µ1 , i.e. |Ei,j | ≤ 1 , ∀(i, j) ∈ [N ] × [N ], N where PU and PV denote the orthogonal projections on the column space and row space of A∗ , respectively, i.e. PU = U U , PV = V V Remark The assumptions A1 and A2 essentially assume that for both the left and right eigenvectors, the values are spread over all the entries. In other words, no entry in those eigenvectors has its value dominated by the other entries of the same eigenvector. This is the key property that makes it possible to recover the full matrix from a small number of observed entries. To state our theorem, we need to introduce a few notations. Let ξ(A ) and µ(A ) denote the low-rank and sparsity incoherence of matrix A defined by [22], i.e. ξ(A ) = max E∈T (A ), E ≤1 µ(A ) = E ∞ (3.6) E (3.7) max E∈Ω(A ), E ∞ ≤1 where T (A ) denotes the space spanned by the elements of the form uk y and xvk , for 1 ≤ k ≤ r, Ω(A ) denotes the space of matrices that have the same support to A , denotes the spectral norm and · · ∞ denotes the largest entry in magnitude. Then the following theorem states the theoretical guarantee of the proposed algorithm. 34 Theorem 1. Let A∗ ∈ RN ×N be a similarity matrix of rank r obeying the incoherence properties (A1) and (A2), with µ = max(µ0 , µ1 ). Suppose we observe m1 entries of A∗ recorded in A˜ with locations sampled uniformly at random, denoted by S. Under the assumption that m0 entries randomly sampled from m1 observed entries are corrupted, denoted by Ω, i.e. ˜ = PS (A∗ + E ∗ ), where E ∗ corresponds to the corrupted A∗ij = A˜ij , (i, j) ∈ Ω. Given PS (A) entries in Ω. With µ(E ∗ )ξ(A∗ ) ≤ 1 , 4r + 5 m1 − m0 ≥ C1 µ4 n(log n)2 , and C1 is a constant, we have, with a probability at least 1 − N −3 , the solution (A , E) = (A∗ , E ∗ ) is the unique optimizer to (4.2) provided that ξ(A∗ ) − (2r − 1)ξ 2 (A∗ )µ(E ∗ ) 1 − (4r + 5)ξ(A∗ )µ(E ∗ ) < λ < 1 − 2(r + 1)ξ(A∗ )µ(E ∗ ) (r + 2)µ(E ∗ ) The proof can be found in the appendix. As indicated by Theorem 1, we have a good chance to recover the full similarity matrix A∗ if the number of observed correct entries (i.e., m1 ) is significantly larger than the number of observed noisy entries (i.e., m0 ). 3.3 Experiments In this section, we first demonstrate empirically that the proposed algorithm can achieve similar or better clustering performance as the Bayesian approach for crowdclustering [60] with significantly lower running time. We further show that, as we reduce the number of 35 (a) Bedroom (b) Suburb (c) Kitchen (d) Living (e) Coast (f) Forest Room (h) Inside (i) Mountain City (j) Open Coun- (k) (g) High- way Street try (l) Tall Build- (m) Office ing Figure 3.1 Some sample images from the 13 categories in the Scenes data set (a) Human (b) Animal (c) Plant Figure 3.2 Some sample images from the three categories in the Tattoo data set pairwise labels, either by reducing the number of workers, or by reducing the number of HITs performed by each worker, the proposed algorithm significantly outperforms the Bayesian approach. 3.3.1 Data Sets Four image data sets are used in our experiments. They are: • Scenes Data Set: This is a subset of the larger Scenes image data set [46] which has been used in the previous study on crowdclustering [60]. It is comprised of 1, 001 images belonging to 13 categories. Figure 6.3 shows sample images of each category from this data set. To obtain the crowdsourced labels, 131 workers were employed to perform HITs. In each HIT, the worker was asked to group images into multiple clusters, where the number of clusters was determined by individual workers. Pairwise labels between 36 (a) Tractor (b) Horse (c) Bench cart (d) Black- (e) Violin berry (f) Saxo- (g) Hammer phone Figure 3.3 Some sample images from the seven categories in the ImageNet data set (a) Car (b) Dog (c) Chair (d) Cat (e) Bird Figure 3.4 Some sample images from the five categories in the PASCAL 07 data set images are derived from the partial clustering results generated in HITs. The data we used, including the subset of images and the output of HITs, were provided by the authors of [60]. • Tattoo Data Set: This is a subset of the Tattoo image database [72]. It contains 3, 000 images that are evenly distributed over three categories: human, animal and plant. Some sample images of each category in the Tattoo data set are shown in Figure 6.4. Unlike the Scenes data set where the objective of HIT was to group the images into clusters, the workers here were asked to annotate tatoo images with keywords of their choice. On average, each image is annotated by three different workers. Pairwise labels between images are derived by comparing the number of matched keywords between images to a threshold (which is set to 1 in our study). • ImageNet data set: This is a subset of the larger ImageNet database [39]. This subset contains 6, 408 images belonging to 7 categories: tractor, horse cart, bench, blackberry, violin, saxophone, and hammer. Figure 3.3 shows some sample images of each category in this data set. 37 • PASCAL 07 data set: This is a subset of the PASCAL Visual Object Classes Challenge 2007 database [45]. The subset contains 2, 989 images belonging to five classes: car, dog, chair, cat and bird. Some sample images of each category in the PASCAL 07 data set are shown in Figure 3.4. Similar to the Tattoo data set, the crowd workers were asked to annotate images from both ImageNet and PASCAL 07 data sets with at most 3 keywords. On average, each image is annotated by five different workers. Pairwise labels between images are derived based on whether two images share at least one common keyword. 3.3.2 Baseline and Parameter Selection Studies in [60] have shown that the Bayesian approach performs significantly better than the ensemble clustering algorithm [113], and Non-negative Matrix Factorization (NMF) [84] in the crowdclustering setting. Hence, we use the Bayesian approach for crowdclustering as the baseline in our study. Parameter C in (3.5) plays an important role in deciding the final similarity matrix. Since no ground truth information (true cluster labels) is available to determine C, we present a heuristic for estimating the value of C. We assume that the N objects to be clustered are roughly evenly distributed across clusters; a similar assumption was adopted in normalized cut algorithm [110]. Based on this assumption, we propose to choose a value of C that leads to the most balanced distribution of objects over different clusters. To this end, we measure the imbalance of data distribution over clusters by computing N i,j=1 Ai,j = 1 A 1, where 1 is a vector of all ones. Our heuristic is to choose a value for C that minimizes 1 A 1. The 38 rationale behind the imbalance measurement 1 A 1 is the following: Let N1 , · · · , Nr be the number of objects in the r clusters. Since 1 A 1 = r 2 k=1 Nk and r k=1 Nk = N , without any further constraint, the optimal solution that minimizes 1 A 1 is Ni = N/r, i = 1, . . . , r, the most balanced data distribution. Hence, 1 A 1, to some degree, measures the imbalance of data distribution over clusters. The experimental results show that this heuristic works well. It usually helps us to find a good enough C among all the candidates. We use normalized mutual information (NMI) and pairwise F-measure (PWF) to evaluate the clustering performance. Besides clustering accuracy, we also evaluate the efficiency of both algorithms by measuring their running time. The code of the baseline algorithm was provided by the authors of [60]. Both the baseline algorithm and the proposed algorithm were implemented in MATLAB and run on an Intel Xeon 2.40 GHz processor with 64.0 GB of main memory. 3.3.3 Experimental results with full annotations To evaluate the clustering performance of the proposed algorithm, our first experiment is performed on the four image data sets using all the pairwise labels derived from the manual annotation process. For all data sets, we set d0 to 0 and d1 to 0.9. Two criteria are deployed in determining the value for d1 : d1 should be large enough to ensure that most of the selected pairwise labels are consistent with the cluster assignments, and should be small enough to obtain enough number of entries with value 1 in the partially observed matrix ˜ Table 3.1 summarizes the clustering performance and running time (CPU time) of both A. algorithms. 39 Table 3.1 Clustering performance and running time of the proposed algorithm (i.e. matrix completion) and the baseline algorithm (i.e. Bayesian method) on four data sets Datasets Scenes Data Set Tattoo Data Set ImageNet data set PASCAL 07 data set NMI PWF CPU time NMI PWF CPU time NMI PWF CPU time NMI PWF CPU time (s) (s) (s) (s) Matrix Completion 0.738 0.584 4.31 × 102 0.398 0.595 4.75 × 103 0.631 0.734 2.62 × 104 0.394 0.462 9.23 × 103 Bayesian Method 0.764 0.618 4.84 × 103 0.292 0.524 5.44 × 104 0.615 0.718 7.48 × 104 0.388 0.439 2.36 × 104 We observed that for the Scenes data set, the proposed algorithm yields similar performance as the Bayesian crowdclustering algorithm but with much lower running time. For the Tattoo, ImageNet and PASCAL 07 data sets, the proposed algorithm outperforms the Bayesian crowdclustering algorithm in both accuracy and efficiency. The higher efficiency of the proposed algorithm is because that the proposed algorithm only needs to handle a subset of reliable pairwise labels while the Bayesian crowdclustering algorithm needs to explore all the pairwise labels derived from manual annotation. For example, for the Scenes data set, only less than 13% of image pairs satisfy the specified condition of “reliable pairs”. The small ˜ and consequently a high efficiency percentage of reliable pairs results in a sparse matrix A, in solving the matrix completion problem in (3.4). We also examine how well the conditions specified in our theoretical analysis are satisfied for the two image data sets. Besides the technical conditions that are difficult to verify, the most important condition used in our analysis is that a majority of the reliable pairwise 40 (a) Highway and Inside (b) Bedroom and Kitchen city (c) Mountain and Open country (d) Tall building and Street Figure 3.5 Sample image pairs that are grouped into the same cluster by more than 50% of the workers but are assigned to different clusters according to the ground truth. labels derived from manual annotation should be consistent with the cluster assignments (i.e. m1 −m0 ≥ O(N log2 N )). We found that for the Scenes data set, 95% of the reliable pairwise labels identified by the proposed algorithm are consistent with the cluster assignments, and for the Tattoo data set, this percentage is 71%. We finally evaluate the significance of the filtering step for the proposed algorithm. First, we observe that a large portion of pairwise labels derived from the manual annotation process are inconsistent with the cluster assignment. In particular, more than 80% of pairwise labels are inconsistent with the cluster assignment for the Scenes data set. Figure 3.5 shows some example image pairs that are grouped into the same cluster by more than 50% of the workers but belong to different clusters according to the ground truth. To observe how the noisy labels affect the proposed algorithm, we fix the threshold d0 to be 0, and vary the threshold d1 used to determine the reliable pairwise labels from 0.1 to 0.9. Table 3.2 summarizes the clustering performance of the proposed algorithm for the Scenes data set with different thresholds and the percentage of resulting reliable pairwise labels that are consistent with the cluster assignments. Overall, we observe that the higher the percentage of consistent pairwise labels, the better the clustering performance. 41 Table 3.2 Performance of the proposed clustering algorithm as a function of different threshold values and the percentage of 1 entries in the matrix A˜ that are consistent with the cluster assignments for the Scenes data set Threshold d1 0.1 0.3 0.5 0.7 0.9 Consistency percentage 18.02% 28.10% 35.53% 43.94% 61.79% NMI 0.507 0.646 0.678 0.700 0.738 PWF 0.327 0.412 0.431 0.445 0.584 3.3.4 Experimental results with sampled annotations The objective of the second experiment is to verify that the proposed algorithm is able to obtain an accurate clustering result even with a significantly smaller number of manual annotations. To this end, we use two different methods to sample the annotations: for the Scenes data set, we use the annotations provided by 20, 10, 7 and 5 randomly sampled workers, and for the remaining three data sets whose pairwise constraints are generated by keywords matching, we randomly sample 10%, 5%, 2% and 1% of all the annotations. Then we run both the baseline and the proposed algorithm on the sampled annotations. All the experiments in this study are repeated five times, and the performance averaged over the five trials is reported in Figure 3.6. As expected, reducing the number of annotations deteriorates the clustering performance for both the algorithms. However, the proposed algorithm appears to be more robust and performs better than the baseline algorithm for all levels of random sampling. The robustness of the proposed algorithm can be attributed to the fact that according to our analysis, to perfectly recover the cluster assignment matrix, the proposed algorithm only requires a small number of reliable pairwise labels (i.e. O(N log2 /N )). In contrast, the Bayesian crowdclustering algorithm requires a large number of manual annotations to overcome the noisy labels 42 (a) (c) (b) Scenes data set (d) ImageNet data set Tattoo data set Pascal 07 data set Figure 3.6 NMI values as a function of number of workers and percentage of annotations for four data sets and to make reliable inference about the hidden factors used by different workers to group the images. As a consequence, we observe a significant reduction in the clustering performance of the Bayesian approach as the number of manual annotations is decreased. 3.4 Conclusion and Discussion In this chapter, we present a matrix completion framework for crowdclustering. The key to the proposed algorithm is to identify a subset of data pairs with reliable pairwise labels. These reliable data pairs are used as the seed for a matrix completion algorithm to derive the full similarity matrix, which forms the foundation for data clustering. Currently, we identify these reliable data pairs based on the disagreement among workers, and as a result, a sufficient number of workers are needed to determine which data pairs are reliable. An 43 alternative approach is to improve the quality of manual annotations. Given that our matrix completion approach needs only a small number of high quality labels, we believe that combining appropriately designed incentive mechanisms with our matrix completion algorithm will lead to greatly improved performance. In [107], the authors discussed different incentive mechanisms to improve the quality of work submitted via HITs. In particular, they studied a number of incentive mechanisms and their affect on eliciting high quality work on Turk. They find that a mechanism based on accurately reporting peers’ responses is the most effective in improving the performance of Turkers. As part of our future work, we plan to investigate the conjunction of appropriate incentive mechanisms with clustering algorithms for this problem. 44 Chapter 4 Semi-Crowdsourced Clustering by Distance Metric Learning As discussed in Chapter 3, crowdclustering addresses the challenge of defining appropriate similarity measures between objects by using the manual annotations obtained through crowdsourcing. However, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, in this chapter we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a large amount of noise in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency. 45 The remainder of the chapter is organized as follows: In Section 4.1, we describe the settings and introduce the motivation of the proposed semi-crowdsourced clustering framework. Section 4.2 presents the proposed semi-crowdsourced clustering algorithm and the related theoretical analysis. We summarize the results of our empirical studies in Section 4.3. Section 4.4 concludes the chapter. 4.1 Introduction Despite the encouraging results obtained via crowdclustering, a main shortcoming of crowdclustering is that it can only cluster objects for which manual annotations are available, significantly limiting its application to large scale clustering problems. For instance, when clustering hundreds of thousands of objects, it is not feasible to have each object manually annotated by multiple workers. To address this limitation, we study the problem of semicrowdsourced clustering, where given the annotations obtained through crowdsourcing for a small subset of the objects, the objective is to cluster the entire collection of objects. Figure 4.1 depicts the proposed framework. Given a set of N objects to be clustered, the objective of the proposed framework is to learn a pairwise similarity measure from the crowdsourced labels of n objects (n N ) and the feature representations of N objects to be clustered. The key to semi-crowdsourced clustering is to define an appropriate similarity measure for the subset of objects that do not have manual annotations (i.e., N − n objects). To this end, we propose to learn a similarity function, based on the object features, from the pairwise 46 Figure 4.1 The proposed framework for semi-crowdsourced clustering. The given N objects (o1 , o2 , . . . , oN ) need to be clustered, but only a small subset of the N objects (o1 , o2 , · · · on ) have been annotated by crowdsourcing, n N. similarities derived from the manual annotations for the subset of n objects; we then apply the learned similarity function to compute the similarity between any two objects, and perform data clustering based on the computed similarities. In this study, for computational simplicity, we restrict ourselves to a linear similarity function, i.e. given two objects oi and oj and their feature representation xi and xj , respectively, their similarity sim(Oi , Oj ) is given by sim(Oi , Oj ) = xi M xj , where M 0 is the learned distance metric. Learning a linear similarity function from given pairwise similarities (sometimes referred to as pairwise constraints when similarities are binary) is known as distance metric learning, which has been reviewed in the Section 2.3.2. The key challenge of distance metric learning in semi-crowdsourced clustering arises due to the noise in the pairwise similarities obtained from manual annotations. According to our previous observation in Section 3.3, large dis- 47 agreements are often observed among human workers in specifying pairwise similarities. As a result, pairwise similarities based on the majority voting among human workers often disagree with the true cluster assignments of objects. As an example, we show in Section 3.3 that for the Scenes data set [46], more than 80% of the pairwise labels obtained from human workers are inconsistent with the true cluster assignment. This large noise in the pairwise similarities due to crowdsourcing could seriously misguide the distance metric learning and lead to a poor prediction performance, as already demonstrated in [66] as well as in our empirical study. We propose a metric learning algorithm that explicitly addresses the presence of noise in pairwise similarities obtained via crowdsourcing. The proposed algorithm uses the matrix completion technique [20] to rectify the noisy pairwise similarities, and regression analysis to efficiently learn a distance metric from the restored pairwise similarities. More specifically, the proposed algorithm for clustering N objects consists of three components: (i) filtering noisy pairwise similarities for n objects by only keeping object pairs whose pairwise similarities are agreed by many workers (not majority of the workers). The result of the filtering step is a partially observed n × n similarity matrix (n N ) with most of its entries re- moved/unobserved; (ii) recovering the n × n similarity matrix from the partially observed entries by using the matrix completion algorithm; (iii) applying a regression algorithm to learn a distance metric from the recovered similarity matrix, and clustering the N × N pairwise similarities based on the learned distance metric. Figure 6.2 shows the basic steps of the proposed algorithm. Compared to the existing approaches of distance metric learning [138], the proposed algorithm has the following three advantages: (i) by exploring the matrix completion technique, 48 Figure 4.2 The proposed framework of learning a distance metric from noisy manual annotations the proposed algorithm is robust to a large amount of noise in the pairwise similarities; (ii) by utilizing regression analysis, the proposed algorithm is computationally efficient and does not have to handle the positive semi-definite constraint, a key computational bottleneck for most distance metric learning algorithms; (iii) the learned distance metric, with high probability, is close to the optimal metric learned from the perfect or true similarities (i.e. similarity of 1 when two objects are in the same cluster and 0, otherwise) for arbitrarily large n. 4.2 Semi-Crowdsourced Clustering by Robust Distance Metric Learning In this section, we first present the problem and a general framework for semi-crowdsourced clustering. We then describe the proposed algorithm for learning distance metric from a small set of noisy pairwise similarities that are derived from manual annotations. 49 4.2.1 Problem Definition and Framework Let D = {O1 , . . . , ON } be the set of N objects to be clustered, and let X = (x1 , . . . , xN ) be their feature representation, where xi ∈ Rd is a vector of d dimensions. We randomly sample a subset of n N objects from the collection D, denoted by D = {O1 , . . . , On }, and obtain their manual annotations by crowdsourcing. Let m be the number of HITs used by crowsourcing. Given the manual annotations collected from the k-th HIT, we define a similarity matrix Ak ∈ Rn×n such that Aki,j = 1 if objects Oi and Oj share common annotations (i.e. share common annotated keywords or assigned to the same cluster by the worker), zero if they don’t, and −1 if either of the two objects is not annotated by the kth HIT (i.e. unlabeled pair). Note that we only consider a binary similarity measure in this study because our goal is to perfectly reconstruct the ideal pairwise similarities based on the true cluster assignments (i.e. 1 when both objects are assigned to the same cluster and zero, otherwise). The objective of semi-crowdsourced clustering is to cluster all the N objects in D based on the features in X and the m × m similarity matrices {Ak }m k=1 for the objects in D. Throughout this paper, we assume that the number of clusters, denoted by r, is given a priori1 . To generalize the pairwise similarities from the subset D to the entire collection of objects D, we propose to first learn a distance metric from the similarity matrices {Ak }m k=1 , and then compute the pairwise similarity for all the N objects in D using the learned distance metric. The challenge is how to learn an appropriate distance metric from a set of similarity 1 Similar to the Chapter 3, we may relax this requirement by estimating the number of clusters via some heuristic, e.g. considering the number of clusters as the rank of the completed matrix A. 50 matrices {Ak }m k=1 . A straightforward approach is to combine multiple similarity matrices into a single similarity matrix by computing their average. More specifically, let A ∈ Rn×n be the average similarity matrix. We have m Ai,j = 1 I(Aki,j ≥ 0)Ai,j m k k=1 I(Ai,j ≥ 0) k=1 where Aki,j < 0 indicates that the pair (Oi , Oj ) is not labeled by the kth HIT (i.e. either object Oi or Oj is not annotated by the kth worker) and I(z) is an indicator function that outputs 1 when z is true and zero, otherwise. We then learn a distance metric M from A. The main problem with this simple strategy is that due to the large disagreements among workers in determining the pairwise similarities, the average similarities do not correlate well with the true cluster assignments. In the next subsection, we develop an efficient and robust algorithm that learns a distance metric from a set of noisy similarity matrices. 4.2.2 Learning a Distance Metric from a Set of Noisy Similarity Matrices As illustrated in Figure 6.2, the proposed algorithm consists of three steps, i.e. filtering step, matrix completion step and distance metric learning step. For the first two steps, namely the data preprocessing steps, we follow the idea proposed in Chapter 3. Filtering step. To filter out the uncertain object pairs, we introduce two thresholds d0 and ˜ Since any similarity measure d1 (1 ≥ d1 > d0 ≥ 0) into the average similarity matrix A. smaller than d0 indicates that most workers put the corresponding object pair into different 51 clusters, we simply set it as 0. Similarly, we set the similarity measure larger than d1 as 1. For object pairs with similarity measure in the range between d0 and and d1 , they are treated as uncertain object pairs and are discarded (i.e. marked as unobserved) from the similarity matrix. The resulting partially observed similarity matrix A is given by    1 A˜i,j ∈ [d1 , 1]     Ai,j = 0 A˜i,j ∈ [0, d0 ]       unobserved Otherwise (4.1) We also define ∆ as the set of observed entries in Ai,j ∆ = {(i, j) ∈ [N ] × [N ] : A˜ij ≥ 0, A˜ij ∈ / (d0 , d1 )} Matrix completion step. Since A is constructed from the partial clustering results generated by different workers, we expect some of the binary similarity measures in A to be incorrect. We introduce the matrix E ∈ Rn×n to capture the incorrect entries in A. If A∗ is the perfect similarity matrix, we have P∆ (A∗ + E) = P∆ (A), where P∆ outputs a matrix with [P∆ (B)]i,j = Bi,j if (i, j) ∈ ∆ and zero, otherwise. With appropriately chosen thresholds d0 and d1 , we expect most of the observed entries in A to be correct and as a result, E to be a sparse matrix. To reconstruct the perfect similarity matrix A∗ from A, following the matrix completion theory [20], we solve the following optimization problem min |A|∗ + C|E|1 A,E s. t. P∆ (A + E) = P∆ (A), 52 (4.2) where |A|∗ is the nuclear norm of matrix A and |E|1 = i,j |Ei,j | is the 1 norm of E. Using the facts that E is a sparse matrix and A is of low rank [73], under the two assumptions made in Section 3.3, with a high probability, we have A∗ = A, where A is the optimal solution for (4.2). Distance metric learning step. This step learns a distance metric from the completed similarity matrix A. A common problem shared by most distance metric learning algorithms is their high computational cost due to the constraint that a distance metric has to be positive semi-definite. In this study, we develop an efficient algorithm for distance metric learning that does not have to deal with the positive semi-definite constraint. Our algorithm is based on the key observation that with a high probability, the completed similarity matrix A is positive semi-definite. This is because according to the Theorem 1, with a probability at least 1 − N −3 , A = Y Y , where Y ∈ {0, 1}N ×r is the true cluster assignment. This property guarantees the resulting distance metric to be positive semi-definite. The proposed distance metric learning algorithm is based on a standard regression algorithm [97]. Given the similarity matrix A, the optimal distance metric M is given by a regression problem n min M ∈Rd×d L(M ) = (xi M xj − Ai,j )2 = |X M X − A|2F (4.3) i,j=1 where xi is the feature vector for the sampled object Oi and X = (x1 , . . . , xn ). The optimal solution to (4.3), denoted by M , is given by M = (X X )−1 X AX (X X )−1 53 (4.4) where Z −1 is pseudo inverse of Z. It is straightforward to verify that M 0 if A 0. Directly using the solution in (4.4) could result in the overfitting of similarity matrix A because of the potential singularity of X X . We address this challenge by a smoothing technique, i.e. Ms = (X X + λmI)−1 X AX (X X + λmI)−1 (4.5) where I is the identity matrix of size d × d and λ > 0 is a smoothing parameter used to address the overfitting and the curse of dimensionality. 4.2.3 Theoretical Analysis We now state the theoretical property of Ms . Let A(Oi , Oj ) be the perfect similarity that outputs 1 when Oi and Oj belong to the same cluster and zero, otherwise. It is straightforward to see that A(Oi , Oj ) = yi yj , where yi ∈ {0, 1}r is the cluster assignment for object Oi . To learn an ideal distance metric from the perfect similarity measure A(Oi , Oj ), we generalize the regression problem in (4.3) as follows min M ∈Rd×d L(M ) = Exi ,xj (xi M xj − A(Oi , Oj ))2 (4.6) −1 BB C −1 , where C = E [x x ] and B = The solution to (4.6) is given by M = CX xi i i X X Exi [xi yi ]. Let Ms be the smoothed version of the ideal distance metric M , i.e. M = (CX + λI)−1 BB (CX + λI)−1 . The following theorem shows that with a high probability, the difference between Ms and Ms is small if both λ and n are not too small. 54 Theorem 2. Assume |x|2 ≤ 1 for the feature representation of any object. Then, with a probability 1 − 3n−3 , we have |Ms − Ms |2 = O ln n √ λ2 n where |Z|2 stands for the spectral norm of matrix Z. Proof. To prove Theorem 2, we need the following theorem for matrix concentration. Lemma 1. (Lemma 2 from [111]) Let H be a Hilbert space and ξ be a random variable on (Z, ρ) with values in H. Assume ξ ≤ M < ∞ almost surely. Denote σ 2 (ξ) = E( ξ 2 ). Let {zi }m i=1 be independent random drawers of ρ. For any 0 < δ < 1, with confidence 1 − δ, 1 m m (ξi − E[ξi ]) ≤ i=1 4M ln(2/δ) √ m Using the assumption that |x|2 ≤ 1 and Lemma 1, we have, with a probability 1 − N −3 , 12 ln n 1 X X − CX ≤ √ m n 2 and therefore −1 1 X X + λI − (CX + λI)−1 m ≤ 2 12 ln n √ λ n Second, according to Theorem 1, with a probability 1 − N −3 , we have A = Y Y therefore X AX and = XY Y X . Again, using the matrix concentration theory, we have, 55 with a probability 1 − N −3 , 1 12 ln n XY − B ≤ √ m n 2 Finally, we rewrite |Ms − Ms |2 as Ms − Ms 2 ≤ Ms − −1 1 X X + λI BB CX m 1 X X + λI m −1 1 X X + λI m −1 BB CX − BB −1 XY 1 X X + λI B m m + 2 1 X X + λI BB m 1 X X + λI m −1 − −1 1 X X + λI m 1 X X + λI m −1 XY B m + 2 −1 1 X X + λI + m −1 1 − Ms X X + λI m It is easy to see that with a probability 1 − 3n−3 , each term on the right hand side of the √ n , leading to the result of the theorem. above inequality is bounded by 122ln λ n Given the learned distance metric Ms , we construct a similarity matrix S = X Ms X and then apply a spectral clustering algorithm [110] to compute the final data partition for N objects. 4.3 Experiments In this section, we demonstrate empirically that the proposed semi-crowdsourced clustering algorithm is both effective and efficient. 56 4.3.1 Data Sets Two real-world image data sets are used in our experiments: (i) ImageNet data set is a subset of the larger ImageNet database [39]. The subset contains 6, 408 images belonging to 7 categories: tractor, horse cart, bench, blackberry, violin, saxophone, and hammer. (ii) PASCAL 07 data set is a subset of the PASCAL Visual Object Classes Challenge 2007 database [45]. The subset contains 2, 989 images belonging to five classes: car, dog, chair, cat and bird. We choose these specific image categories because they yield relatively low classification performance in ImageNet competition and PASCAL VOC Challenge, indicating that it could be difficult to cluster these images using low level features without side information. The image features for these datasets were downloaded from the homepages of the ImageNet database1 and the research group of Learning and Recognition in Vision (LEAR)2 , respectively. To perform crowdlabeling, we follow [141], and ask human workers to annotate images with keywords of their choice in each HIT. A total of 249 and 332 workers were employed using the Amazon’s Mechanical Turk [70] to annotate images from ImageNet and PASCAL datasets, respectively. On average, each image is annotated by five different workers, with three keywords from each individual worker. For every HIT, the pairwise similarity between two images (i.e. Aki,j used in Section 4.2.1) is set to 1 if the two images share at least one common annotated keyword and zero, otherwise3 . We note that the crowdsouced annotations of these two data sets are also used to evaluate our crowdclustering algorithm proposed in Chapter 3. 1 http://www.image-net.org/download-features 2 http://lear.inrialpes.fr/people/guillaumin/data.php 3 We tried several other similarity measures (e.g. cosine similarity measure and tf.idf weighting) and found that none of them yielded better performance than the simple similarity measure used in this work 57 4.3.2 Baselines Two baseline methods are used as reference points in our study: (a) the Base method that clusters images directly using image features without distance metric learning, and (b) the Raw method that runs the proposed algorithm against the average similarity matrix A without filtering and matrix completion steps. The comparison to the Base method allows us to examine the effect of distance metric learning in semi-crowdsourced clustering, and the comparison to the Raw method reveals the effect of filtering and matrix completion steps in distance metric learning. We compare the proposed algorithm for distance metric learning to the following five stateof-the-art distance metric learning algorithms: (a) GDM, the global distance metric learning algorithm [134], (b) RCA, the relevant component analysis [6], (c) DCA, the discriminative component analysis [63], (d) ITML, the information theoretic metric learning algorithm [38], and (e) LMNN, the large margin nearest neighbor classifier [130]. Some of the other stateof-the-art distance metric learning algorithms (e.g. the neighborhood components analysis (NCA) [58]) were excluded from the comparison because they can only work with class assignments, instead of pairwise similarities, and therefore are not applicable in our case. The code for the baseline algorithms was provided by their respective authors (In LMNN, Principal Component Analysis (PCA) is used at first to reduce the data to lower dimensions). For a fair comparison, all distance metric learning algorithms are applied to the pairwise constraints derived from A, the n × n pairwise similarity matrix reconstructed by the matrix completion algorithm. We refer to the proposed distance metric learning algorithm as Regression based Distance Metric Learning, or RDML for short, and the proposed 58 semi-crowdsourced clustering algorithm as Semi-Crowd. 4.3.3 Parameter Selection and Evaluation Similar to the Section 3.4, two criteria are used in determining the values for d0 and d1 in (4.1). First, d0 (d1 ) should be small (large) enough to ensure that most of the retained pairwise similarities are consistent with the cluster assignments. Second, d0 (d1 ) should be large (small) enough to obtain a sufficient number of observed entries in the partially observed matrix A. For both data sets, we set d0 to 0 and d1 to 0.8. Besides, we follow the heuristic proposed in Section 3.4 to determine the parameter C in (4.2), which is selected to generate balanced clustering results. Parameter λ in (4.5) is set to 1. We varied λ from 0.5 to 5 and found that the clustering results essentially remain unchanged. Normalized mutual information (NMI for short) is used to measure the coherence between the inferred clustering and the ground truth categorization. The number of sampled images is varied from 100, 300, 600 to 1, 000. All the experiments are performed on a PC with Intel Xeon 2.40 GHz processor and 16.0 GB of main memory. Each experiment is repeated five times, and the performance averaged over the five trials is reported. 4.3.4 Experimental Results First, we examine the effect of distance metric learning algorithm on semi-crowdsourced clustering. Figure 6.4 compares the clustering performance with six different metric learning algorithms with that of the Base method that does not learn a distance metric. We observed 59 (a) ImageNet data set (b) PASCAL 07 data set Figure 4.3 NMI vs. no. of sampled images (n) used in crowdlabeling. that four of the distance metric learning algorithms (i.e. GDM, ITML, LMNN and the proposed RDML) outperform the Base method, while RCA and DCA fail to improve the clustering performance of Base. We conjecture that the failure of RCA and DCA methods is due to their sensitivity to the noisy pairwise similarities. In fact, RCA and DCA can yield better performance than the Base method if all the pairwise similarities are consistent with the cluster assignments. Compared to all the baseline distance metric learning algorithms, RDML, the proposed distance metric learning algorithm, yields the best clustering results for both the data sets and for all values of n (i.e. the number of annotated images) considered here. Furthermore, the performance of RDML gradually stabilizes as the number of sampled images increases. This is consistent with our theoretical analysis in Theorem 2, and implies that only a modest number of annotated images is needed by the proposed algorithm to learn an appropriate distance metric. This observation is particularly useful for crowdclustering as it is expensive to reliably label a very large number of images. Figure 4.4 shows some example image pairs for which the Base method fails to make correct cluster assignments, but the proposed RDML method successfully corrects these mistakes with the learned distance metric. Our next experiment evaluates the impact of filtering and matrix completion steps. In Fig- 60 (a) Two images incorrectly placed (b) Two images incorrectly placed (c) Two images incorrectly grouped in different clusters by the Base in different clusters by the Base into the same cluster by the Base method (similarity 0.16) but correctly method (similarity 0.31) but cor- method (similarity 0.72) but correctly grouped into the same cluster by the rectly grouped into the same cluster clustered to different clusters by the proposed method (similarity 0.66). by the proposed method (similarity proposed method (similarity 0.22) 0.85) Figure 4.4 Sample image pairs that are incorrectly clustered by the Base method but correctly clustered by the proposed method (the similarity of our method is based on the normalized distance metric Ms ). ure 6.4, we compare the clustering results of the proposed algorithm for semi-crowdsourced clustering (i.e. Filtering+Matrix-Completion+RDML) to the Raw method that runs the proposed distance metric algorithm RDML without the filtering and matrix completion steps. Based on these experiments, we can make the following observations: (i) the proposed distance metric learning algorithms performs better than the Raw method, particularly when the number of annotated images is small; (ii) the gap between the proposed semi-crowdsourced clustering method and the Raw method decreases as the sample size increases. These results indicate the importance of filtering and matrix completion steps for the crowdsourced data in semi-crowdsourced clustering. Finally, it is interesting to observe that the Raw method still outperforms all the baseline methods, which further verifies the effectiveness of the proposed algorithm for distance metric learning. Finally, we evaluate the computational efficiency of the proposed distance metric learning algorithm. Table 4.1 shows that the proposed distance metric learning algorithm is significantly more efficient than the baseline approaches evaluated here. The last row of Table 4.1 indicates the run time for the matrix completion step. Since all the distance metric learning algorithms 61 Table 4.1 CPU time (in s) for learning the distance metrics. CPU time (s) Sample sizes (n) RDML (proposed) GDM [134] LMNN [130] ITML [38] DCA [63] RCA [6] Matrix Completion ImageNet 100 300 4.2 6.3 11384 14706 59.8 157 2128 2376 8.5 9.2 9.7 13.5 12.4 74.2 Data Set PASCAL 07 Data Set 600 1,000 100 300 600 1,000 8.0 11.2 27.4 34.2 41.7 47.3 18140 25155 26346 36795 44237 53468 330 629 55.1 124 277 527 2692 3081 5311 5721 6104 6653 14.5 20.7 51.2 64.1 72.7 82.3 18.6 23.6 71.4 92.7 103 122 536 1916 12.8 86.6 615 1873 are applied to the similarity matrix recovered by the matrix completion algorithm, the computational cost of matrix completion is shared by all distance metric learning algorithms used in our evaluation. One interesting observation comes from a comparison of Table 4.1 and Table 3.1, which summarizes the running time of doing standard crowdclustering. We observe that, to cluster the same ImageNet Data Set, the proposed semi-crowdsourced clustering algorithm is more than 1, 000 times more efficient than the crowdclustering algorithm proposed in Chapter 3. 4.4 Conclusions In this chapter, we present a semi-crowdsourced clustering framework that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The proposed framework overcomes the limitation of the classical crowdclustering problem, which can only cluster objects when their manual annotations are available. In addition, the proposed semi-crowdsourced clustering method provides a much more efficient way to solve the crowdclustering problem, comparing to the approach 62 proposed in chapter 3. Furthermore, our work also addresses the challenge of learning a reliable distance metric from noisy pairwise constraints. Although many studies on distance metric learning have been reported, one limitation of these earlier studies is that they can only work with a relatively small number (typically less than 30%) of noisy pairwise constraints. In contrast, the proposed distance metric learning approach can handle a significantly larger percentage of pairwise similarities (as many as 80%) that are inconsistent with the true cluster assignments. 63 Chapter 5 Semi-supervised Clustering by Input Pattern Assisted Pairwise Similarity Matrix Completion In this chapter, we focus on the semi-supervised clustering problem. Many semi-supervised clustering algorithms have been proposed to improve the clustering accuracy by effectively exploring the available side information that is usually in the form of pairwise constraints. However, there are two main shortcomings of the existing semi-supervised clustering algorithms. First, they have to deal with non-convex optimization problems, leading to clustering results that are sensitive to the initialization. Second, none of these algorithms is equipped with theoretical guarantee regarding the clustering performance. We address these limitations by developing a framework for semi-supervised clustering based on input pattern assisted matrix completion. The key idea is to cast clustering into a matrix completion problem, and solve it efficiently by exploiting the correlation between input patterns and cluster assignments. Our analysis shows that under appropriate conditions, only O(log N ) pairwise constraints are needed to accurately recover the true cluster partition of N objects. 64 The remainder of the chapter is organized as follows: In Section 5.1, we describe the background and introduce the motivation of the proposed semi-supervised clustering framework. Section 5.2 presents the proposed framework for semi-supervised clustering and efficient computational algorithm. Theoretical analysis for the proposed algorithm for semi-supervised learning is presented in Section 5.3. We summarize the results of our empirical studies in Section 5.4 and Section 5.5 concludes with the discussion about the relationship between the proposed semi-supervised clustering approach and the semi-crowdsourced clustering approach proposed in Chapter 4. 5.1 Introduction The objective of semi-supervised clustering algorithms is to search for the optimal data partition that is consistent with both the given pairwise constraints and the input data points to be clustered. Despite the progress, there are two main shortcomings with the available semi-supervised clustering algorithms. First, most semi-supervised clustering algorithms have to deal with non-convex optimization problems, leading to clustering results that are only locally optimal and sensitive to the initialization. Second, although many computational algorithms have been proposed for semi-supervised learning, none of them is equipped with a theoretical guarantee on clustering performance. In particular, it is unknown how the clustering performance is improved with increasing number of pairwise constraints, an issue that is usually referred to as sample complexity in supervised learning [7]. In this chapter, we aim to address these limitations by developing a new framework for semi- 65 supervised learning based on the theory of matrix completion [15]. The proposed framework aims to reconstruct the pairwise similarity matrix, that gives 1 for any two data points in the same cluster and 0 otherwise, based on the given constraints and the input patterns of the objects to be clustered. The proposed framework results in a convex optimization problem and, consequentially, globally optimal solutions. More importantly, the proposed work is equipped with a strong theoretical guarantee: with a high probability, the proposed algorithm can accurately recover the true data partition provided (i) the cluster membership vectors can be well approximated by the top singular vectors of the data matrix, and (ii) the number of pairwise constraints is sufficiently large. In particular, we show that under appropriate conditions, the true data partition can be perfectly recovered by the proposed algorithm with O(rk log N ) pairwise constraints, where N is the number of data points to be clustered, r is the number of clusters, and k is the number of singular vectors used to approximate the cluster memberships. The logarithmic dependence on N makes the proposed algorithm particularly suitable for large-scale data clustering problem. 5.2 Semi-supervised Clustering by Input Pattern Assisted Matrix Completion In this section, we first present a matrix completion based framework for semi-supervised clustering. We then present the proposed algorithm for semi-supervised clustering. 66 5.2.1 A Matrix Completion Framework for Semi-supervised Clustering Let D = {O1 , . . . , ON } be the set of N objects to be clustered, and let X = (x1 , . . . , xN ) be their feature representation, where xi ∈ Rd is a vector of d dimensions. Let M denote the set of must-link constraints where (i, j) ∈ M implies that xi and xj should be in the same cluster, and C denote the set of cannot-link constraints, where (i, j) ∈ C implies that xi and xj belong to different clusters. For the convenience of presentation, we also define set Ω = M ∪ C to include all the pairwise constraints. Let r be the number of clusters, and Nmin be the size of the smallest cluster. The objective of semi-supervised clustering is to partition N data points into r clusters that are consistent with (i) the pairwise constraints in M and C, and (ii) the data matrix X such that data points with similar input patterns are put into the same cluster. Let ui ∈ {0, 1}N be the membership vector of the i-th cluster, where ui,j = 1 if xj is assigned to the i-th cluster and zero, otherwise. Define the pairwise similarity matrix S ∈ {0, +1} as r S= ui ui i=1 Evidently, Si,j = 1 if xi and xj are assigned to the same cluster, and zero, otherwise. It is easy to verify that the rank of matrix S is r. The given must-links in M and cannot-links in C provide partial observations for M , i.e. Si,j = 1 if (i, j) ∈ M and Si,j = 0 if (i, j) ∈ C. Since finding the best data partition is equivalent to recovering the binary similarity matrix S, following [73, 141], we cast the semi-supervised clustering problem into a matrix completion problem, i.e. filling out the missing entries in binary similarity matrix S based on the pairwise 67 constraints in M and C (i.e. the partial observations of S) and the data matrix X. Similar to the standard theory for matrix completion [15], we can accurately recover the binary similarity matrix S because S is of low rank. We, however, note that the matrix completion problem discussed in this work is different from the previous studies of using matrix completion for clustering [73, 141] in that we aim to complete the binary similarity matrix S by utilizing both the observed entries in S and the input patterns in X. It will be shown later, both theoretically and empirically, that by effectively exploring the input patterns in X, the proposed algorithm is able to reduce the sample complexity for matrix completion from O(N [log N ]2 ) to O(log N ), making it possible to apply the proposed algorithm to cluster very large data sets. 5.2.2 Input Pattern Assisted Matrix Completion In this subsection, we first present input pattern assisted matrix completion for semisupervised clustering. We then describe an efficient algorithm for solving the related convex optimization problem. In the standard matrix completion theory [15], to reconstruct a matrix Q of size N × N from a subset of observed entries in ∆ ⊆ [N ] × [N ], we solve the following optimization problem min Q∈RN ×N |Q|tr s. t. P∆ (P ) = P∆ (S) (5.1) where | · |tr is the trace norm, and R∆ (S) : RN ×N → RN ×N is a linear operator that maps 68 a matrix S to a new matrix P∆ (S) given by    Si,j (i, j) ∈ ∆ [P∆ (S)]i,j =   0 (i, j) ∈ /∆ According to [15], with a high probability, matrix Q can be perfectly recovered by solving the optimization problem in (6.6) if the number of observed entries in ∆ is O(µ(Q)2 r(Q)N [log N ]2 ), where r(Q) is the rank of Q and µ(Q) is the coherence measure of Q. In the case of binary similarity matrix S, it is easy to verify that the coherence measure µ(S) is bounded by N/[Nmin r] and the rank of S equals to the number of clusters r. As a result, the number of pairwise constraints required for perfectly recovering the binary similarity matrix S is O(κN [log N ]2 ), where κ = N/Nmin . When data points are evenly distributed over clusters, we observe that the number of pairwise constraints required by matrix completion increases at least linearly in the number of data points to be clustered, making it unscalable to large data sets. We address this limitation by developing a matrix completion approach that explicitly incorporates the data matrix X into the matrix completion process. Let Z = (z1 , . . . , zk ) include the first k left singular vectors of X, where k ≥ r. We make the following crucial assumption about the relationship between X and S: A3 {ui }ri=1 lie in the subspace spanned by {zi }ki=1 , a similar assumption used by the spectral clustering algorithm [101]. Using assumption A3, we can write S as S = ZM Z , where M ∈ Rk×k . Following the theory of matrix completion, 69 we obtain the optimal M by solving the following optimization problem: min |M |tr s. t. PΩ (ZM Z ) = PΩ (S) M ∈Rk×k (5.2) where Ω ⊆ [N ] × [N ] includes all the observed entries in S derived from the pairwise constraints in M and C. The following theorem shows the perfect recovery result for (6.7). Theorem 3. Let µ(Z) be the coherence measure for matrix Z given by n |[ZZ ]i,i |2 1≤i≤N k (5.3) µ(Z) = max Define µ0 = max µ(Z), N rNmin . (5.4) For fixed β > 2, define a and B as 1 (1 + log2 k − log2 r) 2 512β B = µ rk ln N 3 0 a = (5.5) (5.6) Then, under assumption A3 with a probability 1 − 4(a + 1)N −β+1 − 2aN −β+2 , M∗ = Z SZ is the unique optimizer to (6.7) provided |Ω| ≥ aB. Remark: Compared to the standard matrix completion theory, the sample complexity of 70 input pattern assisted matrix completion is reduced from O(rN [log N ]2 ) to O(k log N log N ) if µ0 = O(1). Thus, if k = Ω(r) and the number of clusters r is small, Theorem 3 implies that O(log N ) pairwise constraints are needed in order to obtain the perfect clustering result, provided assumption A3 holds and the coherence measure µ0 is small. Evidently, A3 is a strong assumption that usually does not hold in real world applications. We thus relax this assumption by assuming that the cluster membership vectors {ui }ri=1 can be well approximated by the top k singular vectors of X. More specifically, we define the projection operator Pk as Pk = ZZ , and the projection errors for the cluster membership vectors as 1 ui − Pk ui 2F 1≤i≤r N 2 E 2 = max (5.7) Instead of assuming E = 0 as assumption A3, we assume that E is small enough to allow for an accurate recovery of the binary similarity matrix S. Under this assumption, we modify the optimization problem in (6.7) as follows min M ∈Rk×k |M |tr + 2 C PΩ (ZM Z ) − PΩ (S) 2 F (5.8) where parameter C > 0 is introduced to balance the tradeoff between finding the low rank matrix M and fitting the observed pairwise constraints. The following theorem shows that the binary similarity matrix S can be accurately recovered by (6.8) if (i) the approximation error E is small and (ii) |Ω|, the number of pairwise constraints, is sufficiently large. Theorem 4. Let M be the optimal solution to (6.8) and S = Z M Z be the reconstructed similarity matrix. For a fixed β > 2, with a probability 1 − 4(a + 1)N −β+1 − 2aN −β+2 , we 71 have S − S F ≤ ν(k, r)E where √ ν(k, r) = 6 √ 2k + 4 r 3+ √ r √ provided |Ω| ≥ aB and C ≥ 1/[ rE] As indicated by the above theorem, with a sufficiently large number of pairwise constraints, we have S − S F ∝ E, implying a small difference between S and S when the cluster membership vectors can be well approximated by the top k singular vectors of X. Let M be the optimal solution for (6.8). The estimated binary similarity matrix is given by S = Z M Z . Since |S − S|F is small and the eigenvectors of S correspond to the cluster membership vectors, we expect the first r eigenvectors of S reveal the clustering structure of the data. As a result, we apply the spectral clustering algorithm to find the best data partition, i.e. we first compute the top r eigenvectors of S, and then run the k-means algorithm over the computed eigenvectors. To improve the computational efficiency, we apply the spectral clustering algorithm proposed in [25] that reduces computational cost by the matrix sparsification technique [122] and the Nystrom approximation [51]. We finally discuss how to efficiently solve the optimization problem in (6.8). We exploit the fast stochastic subgradient descent (FSGD) method developed in [4]. Define L(M ) = 2 C PΩ (ZM Z ) − PΩ (S) . 2 F 72 Algorithm 1 Efficient Stochastic Subgradient Descent for Solving the Optimization Problem (6.8) 1: Input: • Z ∈ RN ×k : first k left singular vectors of X • C > 0: loss function parameter • r: number of clusters • T : number of iterations • ηt : step size 2: Initialization: U0 = 0k×r , Σ0 = 0r×r , V0 = 0k×r 3: for t = 0, . . . , T − 1 do 4: Generate a k × r probing matrix H 5: Set Uˆt+1 = [Ut Σt , Bt ], where Bt = (Ut Vt + C · Z (RΩ (ZM Z − S))Z)H. 6: Set Vˆt+1 = [Vt − ηt H] 7: QR factorization of Uˆt+1 : Uˆt+1 = QU RU 8: QR factorization of Vˆt+1 : Vˆt+1 = QV RV 9: Compute K = RU RV ˜Σ ¯ t+1 N ˜ 10: SVD decomposition of K: K = M ˜ and V¯t+1 = QV N ˜ 11: Set U¯t+1 = QU M ¯ 12: Ut+1 = Ut+1 (1:k,1:r) ¯ t+1 (1:r,1:r) 13: Σt+1 = Σ ¯ 14: Vt+1 = Vt+1 (1:k,1:r) 15: M (t+1) = Π(Ut+1 Σt+1 Vt+1 ) 16: end for At each iteration, the proposed algorithm samples a subset of rows from the binary similarity matrix S by introducing a probing matrix H. It then computes an unbiased estimate of the ˜ gradient ∇L(Mt ), denoted by ∇L(M t ), based on the sampled rows. Given the unbiased ˜ estimate of gradient, solution Mt is updated by Mt+1 = Π Mt+1 = Mt − η ∇L(M t ) . Here, Π(A) is a soft thresholding function and is defined as Π(A) = r i=1 max(λi − 1, 0)ai ai , where (ai , λi ), i = 1, . . . , r are the top r eigenvectors and eigenvalues of A. Algorithm 1 shows the detailed steps of the proposed algorithm, where the notation U (1:k, 1:r) represents the sub-matrix of U that includes the first k rows and the first r columns of U . 73 5.3 Theoretical Analysis In this analysis, we will focus on the result for the noisy case, namely where the cluster membership vectors can be well approximated by the top k singular vectors of X although they do not lie in the subspace spanned by the top k singular vectors. This is a more general case and the perfect recovery result in Theorem 3 follows immediately from Theorem 4 by setting E = 0. We need to define a few notations before presenting our analysis. We define two linear operators PT : RN ×N → RN ×N and PT ⊥ : RN ×N → RN ×N as follows: where PU = U U PT (A) = PU A + APU − PU APU (5.9) PT ⊥ (A) = (Pk − PU )A(Pk − PU ) (5.10) and Pk = ZZ . The coherence measurement µ for binary similarity matrix S is given by µ(S) = N N max |PU ei |2 ≤ r 1≤i≤N rNmin (5.11) As a result, we have the following inequality for µ0 defined in (5.4) µ0 = max µ(Z), N rNmin ≥ max(µ(Z), µ(S)) Our strategy is to first identify the deterministic conditions for the optimal solution M∗ = Z SZ to be close to M , and then confirm that these deterministic conditions will hold with 74 high probability. Theorem 5. Under the assumptions 1. the number of pairwise constraints is sufficiently large, i.e. |Ω| > 512µ20 r(k − r) ln N 3 (5.12) 2. there exists a dual matrix Y ∈ RN ×m satisfied the following condition PΩ (Y ) = Y, PT (Y ) − U U PT ⊥ (Y ) ≤ ≤ r , 2k (5.13) 1 2 3. for any nonzero F ∈ RN ×N satisfying F = Pk F Pk , we have PT (F ) F ≤ γ PT ⊥ (F ) F + 2 PΩ (F ) F , (5.14) where γ is given by γ = 4µ0 (k − r) 2 log N 3|Ω| (5.15) Then, by setting C = √1rE ,, we have |S − S|F ≤ 6 √ √ 2k + 4 r 3+ √ r E Proof. Define S∗ = Z M∗ Z and F = Z M Z − S∗ . Evidently, we have F = PZ F PZ . Using 75 the condition in (5.14), we have PT (F ) F ≤ γ PT ⊥ (F ) F + 2 PΩ (F ) F Let U⊥ be the eigenvectors of PT ⊥ (Z). Evidently, column vectors in U⊥ are orthogonal to the column vectors in U . We have |M |tr = |S|tr ≥ S∗ + Z, U U + U⊥ U⊥ ≥ |S|tr − |S∗ − S|tr + Z, −Y + U U + U⊥ U⊥ √ ≥ |S∗ |tr − 2 2r S∗ − S F − PT (Y ) + U⊥ U⊥ − PT ⊥ (Y ) √ ≥ |M∗ |tr − 2 2rE + PT (F ) F U U − PT (Y ) F + F, U U + (1 − PT (Y ) ) PT ⊥ (F ) F 1 ≥ |M∗ |tr + PT ⊥ (F ) F −γ 2 √ r −2 PΩ (F ) F − 2 2rE 2k r 2k When |Ω| > 512µ20 r(k − r) log N 3 we have |M |tr ≥ |M∗ |tr + −2 PT ⊥ (F ) F 4 √ r PΩ (F ) F − 2 2rE 2k 76 Since C P (Z M Z − S) 2F 2 Ω C ≥ ( PΩ (S∗ − S) F − PΩ (Z) F )2 2 L(M ) = and 1 C≥√ , rE it is easy to verify that √ √ PΩ (Z) F ≤ (12 + 2 r) rE and therefore PT ⊥ (Z) √ r C PΩ (Z) F + 2 2rE + rE 2 2k 2 √ √ ≤ 24 rE 3 + r ≤ 4 2 As a result, we have Z F ≤ PT (Z) F + PT ⊥ (Z) F ≤ (γ + 1) PT ⊥ (Z) F + 2 PΩ (Z) F √ √ √ 2k + 4 r 3 + r E ≤ 6 Thus finishes the proof. The following two theorems are developed to confirm that the conditions specified in Theorem 5 hold with a high probability. 77 Theorem 6. With a probability 1 − 4N −β+1 ,for any Z = 0 satisfying Z = PU ZPU , we have PT (Z) F ≤ γ PT ⊥ (Z) F + 2 PΩ (Z) F where γ is given in (5.15), provided |Ω| ≥ Ω0 and |Ω1 | ≤ Ω0 . To verify if there exists a matrix Y that satisfies the condition in (5.14), we follow [18] and construct Y as follows. We randomly select qΩ0 entries from Ω, and divide the selected entries into q subsets of equal size, denoted by Ω1 , . . . , Ωq , with |Ωi | = Ω0 , i = 1, . . . , q. We generate a sequence of Yt , t = 1, . . . , q as follows N2 Yt = Ω0 where W1 = U U t PΩi (Wi ) i=1 and Wt+1 is defined inductively as Wt+1 = PT (U U − Yt ) N2 P P (Wt ) Ω0 T Ωt N2 PT − P P P Wt Ω0 T Ωt T = Wt − = We construct Y as the last element of the sequence, i.e. Y = Yq . Evidently, we have Y = PΩ (Y ). The following theorems show that Y satisfies the other properties specified in (5.14) 78 Theorem 7. With a probability 1 − 2qN −β+1 , we have PT (Y ) − U U ≤ r 2k if q ≥ a. Theorem 6 and Theorem 7 follows directly from the analysis from [106]. Thus we omit the proofs in this thesis. 5.4 Experiments In this section, we first conduct a simulated study to verify our theoretical claim, i.e. the sample complexity of the proposed semi-supervised clustering algorithm is only logarithmic dependence on N . We then compare the proposed algorithm to the state-of-the-art algorithms for semi-supervised clustering on several benchmark datasets. 5.4.1 Baselines, and Parameter Settings Baselines. We compare the proposed semi-supervised clustering algorithm to the following six state-of-the-art algorithms for semi-supervised clustering, including three constrained clustering algorithms and three distance metric learning algorithms. The three constrained clustering algorithms are (a) MPCK-means, the metric pairwise constrained k-means algorithm [12], (b) CCSKL, constrained clustering by spectral kernel learning [88], and (c) PMMC, pairwise constrained maximum margin clustering [144]. The three state-of-the-art 79 distance metric learning algorithms are (d) DCA, the discriminative component analysis [63], (e) LMNN, the large margin nearest neighbor classifier [130], and (f) ITML, the information theoretic metric learning algorithm [36]. In order to examine the effectiveness of pairwise constraints for clustering, we also include the baseline method, referred to as Base, that directly applies the spectral clustering algorithm to cluster data points without any constraints. We refer to the proposed semi-supervised clustering algorithm as Matrix Completion based Constraint Clustering, or MCCC for short. Evaluation and Parameter Settings. Normalized mutual information (NMI for short) [29] is used to measure the coherence between the inferred clustering and the ground truth categorization. To determine the parameter C in (6.8), we follow the heuristic used in Sections 3.3 and 4.3 that chooses the best C that results in a balanced cluster distribution. Two criteria are used in determining the values for k. First, k should be small enough to make the Algorithm 1 efficient. Second, k should be reasonably large to make the projection errors relatively small. In our experiments, we set k = min(100, d), where d is the dimensionality of the datasets. 5.4.2 Experiment with Synthesized Data We first conduct experiments with simulated data to verify that under the assumption A3, the proposed semi-supervised clustering algorithm can perfectly recover the true data partition with only O(log N ) sampled pairwise constraints. To this end, for a fixed N , the number of data points to be clustered, we create a partition of five clusters of equal size. Let ui ∈ {0, 1}N , i = 1, . . . , 5 represent the cluster membership vectors. The target matrix 80 Figure 5.1 The plot of the smallest number of pairwise constraints (PCs) needed for perfect recovery. The correlation coefficient computed by the linear fit is 0.992, indicating a linear dependence of sample complexity in log n. to be recovered is S = 5 i=1 ui ui . We construct the input pattern matrix X syn by first generating a Gaussian random matrix G ∈ R5×15 , with Gi,j drawn independently from a Gaussian distribution N (0, 1), and setting X syn = U G, where U = (u1 , . . . , u5 ). We vary N in range {5, 000, 10, 000, 20, 000, 50, 000, 10, 0000}. For each N , we search for the smallest number of pairwise constraints that results in the perfect partition (i.e. NMI = 1). Figure 5.1 shows that the number of required constraints increases linearly in log N , thus verifying that the sample complexity is logarithmic in the number of data points to be clustered. Another advantage of the proposed algorithm is its scalability to large datasets since it only requires solving an optimization problem involving a small (k×k, k = Ω(r)) matrix. Table 5.1 summarizes the running time of recovering the synthetic data X syn of different sizes, with the number of observed pairwise entries set to be the minimum required for perfect recovery. We observe that even for N = 100, 000, it takes the proposed semi-supervised clustering algorithm less than an hour. 81 Table 5.1 Running time (in s) for recovering synthetic data of different size N CPU time 5.4.3 5K 24.0 10K 77.1 20K 217 50K 1, 086 100K 3, 429 Experiment with Benchmark Datasets We then evaluate the proposed semi-supervised clustering algorithm on multiple benchmark datasets. They are (i) Mushrooms database 1 that contains 8, 124 mushrooms belonging to 2 classes: poisonous or edible; (ii) RCV1 M2 database, a subset of the RCV1 corpus [82], that is comprised of 4, 923 documents belonging to the categories “C15” and “GCAT”.; (iii) COIL5 database, a subset of the larger COIL100 database [98], that is comprised of 360 images belonging to 5 objects; and (iv) Segment database 2 that contains 2, 310 random segmentations of 7 outdoor images, (v) USPS M5 and L5 databases, that are two subsets of images from the USPS handwritten dataset [68]. Among them, “USPS M5” consists of the first five categories of USPS dataset and has a total of 5, 427 images. “USPS L5” consists of the last five categories and includes 3, 871 images in total, (vi) MNIST4k database that is a subset of the MNIST handwritten digits data set [81]. The subset contains the widely used first 4, 000 images which belong to 10 classes, (vii) 20 Newsgroups database3 which contains 18, 774 documents belonging to 20 news categories, and (viii) ImageNet data set, a subset of the larger ImageNet database [39], that is comprised of 6, 408 images belonging to 7 categories (i.e. tractor, horse cart, bench, blackberry, violin, saxophone, and hammer ). Details of these nine datasets are given in Table 5.2. 1 http://www.csie.ntu.edu.tw/ ˜cjlin/libsvmtools/datasets/ 2 http://archive.ics.uci.edu/ml/datasets/Image+Segmentation/ 3 http://qwone.com/ ˜jason/20Newsgroups/ 82 Table 5.2 Description of Datasets Name Mushrooms RCV1 M2 COIL5 Segment USPS M5 USPS L5 MNIST4k 20 Newsgroups ImageNet #Instances 8, 124 4, 923 360 2, 310 5, 427 3, 871 4, 000 18, 774 6, 408 #Features 112 29, 992 1, 024 19 256 256 784 61, 188 1, 000 #Clusters 2 2 5 7 5 5 10 20 7 We vary the number of randomly sampled pairwise constraints from 2, 000, 4, 000 to 6, 000 for each data sets. We note that we did not run experiments with smaller numbers of pairwise constraints because our theoretical analysis shows that the proposed algorithm is effective only when the number of constraints is sufficiently large. All the experiments are performed on a PC with Xeon 2.40 GHz processor and 64.0 GB memory. Each experiment is repeated five times, and the performance averaged over five trials is reported. Table 6.1 summarizes the performance of the proposed semi-supervised clustering algorithm and the baseline algorithms. We first observed that although all the semi-supervised clustering algorithms significantly outperform the Base method with sufficiently large numbers of pairwise constraints, generally speaking, the distance metric based algorithms outperform the constrained clustering algorithms. We conjecture that this may be due to the fact that the number of pairwise constraints is large enough to learn a good distance metric such that data points of the same class will be separated by a small distance and data points from different classes are separated by a large distance. For the Mushrooms, RCV1 M2, COIL5 and USPS M5 databases, we observe that MCCC, the proposed semi-supervised clustering algorithm, achieves very high NMI values (> 0.9) and outperforms all the baseline methods 83 when the number of constraints is relatively large (i.e. 4, 000 and 6, 000). Since the Base method can achieve high NMI scores (> 0.5) on these datasets without utilizing any pairwise constraints, we conjecture that these datasets satisfy our assumption, i.e., the cluster membership vectors can be well approximated by the top eigenvectors of the data matrix. Among such data sets, the experimental results for Mushrooms dataset is very encouraging since only 4, 000 pairwise constraints are needed to achieve more than 0.99 NMI. This only accounts for about 0.012% of all possible pairwise constraints. In contrast, for the datasets 20 Newsgroups and ImageNet, the Base method yields very low NMI scores (i.e. 0.221 for 20 Newsgroups and only 0.148 for ImageNet). Thus it is very likely that the cluster membership vectors of them do not lie in the space spanned by the top singular vectors of their features. For these two challenging datasets, the proposed algorithm yields similar clustering performance as the baseline methods when the number of constraints is relatively small (i.e. 2, 000 and 4, 000) and starts to significantly outperform the baseline methods when the number of constraints reaches 6, 000. This result demonstrates that even when the assumption is violated, the proposed algorithm is still able to yield good clustering performance with sufficiently large numbers of constraints. 5.5 Conclusion and Discussion In this chapter, we propose a framework for semi-supervised clustering based on input pattern assisted matrix completion. The key idea is to cast clustering into a matrix completion problem, and solve it efficiently by exploiting the correlation between input patterns and class assignments. A stochastic subgradient descend method is employed to optimize the convex 84 optimization problem. Since k is usually very small (≤ 100), directly computing the SVD of k × k matrix M is also very fast. This meets our observation that a standard subgradient descend method can also be very efficient. Under the assumption that cluster membership vectors can be well approximated by the top few singular vectors of the data matrix, we show that with an overwhelming probability, the proposed algorithm can accurately recover the true data partition with only O(log N ) randomly sampled pairwise constraints. Our empirical study verifies the effectiveness of the proposed algorithm. We note that the proposed semi-supervised clustering framework is related to the semicrowdsourced clustering framework proposed in Chapter 4. Both of them address the problem of combining features of objects with the pairwise constraints to enhance the clustering performance. The major difference between them is that the semi-crowdsourced clustering uses human annotations that are collected from a small subset of n (n N ) objects. As a result, it only needs O(n log2 n) reliable pairwise constraints to accurately recover the true data partitions. In contrast, the proposed semi-supervised clustering approach exploits a small amount of noiseless pairwise constraints sampled from all of N objects. Then the true data partition can be almost perfectly recovered with only O(log N ) randomly sampled pairwise constraints. The proposed semi-supervised clustering can also be applied to the setting of semi-crowdsourced clustering. By randomly sample a set of object pairs and ask crowd workers for annotation, we can generate a N ×N partially-observed pairwise similarity matrix with all the observed entries to be reliable pairwise constraints that are agreed upon by many crowd workers. Then the proposed semi-supervised clustering framework can be used to solve the problem of semi-crowdsourced clustering. 85 Table 5.3 Average Clustering performance of the proposed semi-supervised clustering algorithm (MCCC) and the baseline algorithms (Base, MPCKmeans (MPCK) [12], CCSKL [88], PMMC [144], DCA [63], LMNN [130], and ITML [36]) on nine datasets with 2, 000, 4, 000 and 6, 000 randomly sampled pairwise constraints (PCs) Datasets Mushrooms RCV1 M2 COIL5 Segment USPS M5 USPS L5 MNIST4k 20 Newsgroups ImageNet #PCs 2, 000 4, 000 6, 000 2, 000 4, 000 6, 000 2, 000 4, 000 6, 000 2, 000 4, 000 6, 000 2, 000 4, 000 6, 000 2, 000 4, 000 6, 000 2, 000 4, 000 6, 000 2, 000 4, 000 6, 000 2, 000 4, 000 6, 000 MCCC 0.982 0.991 0.998 0.844 0.897 0.932 0.914 0.973 1.000 0.750 0.755 0.774 0.901 0.923 0.944 0.811 0.830 0.862 0.784 0.817 0.841 0.244 0.293 0.323 0.196 0.255 0.279 Base 0.540 0.540 0.540 0.702 0.702 0.702 0.582 0.582 0.582 0.651 0.651 0.651 0.681 0.681 0.681 0.521 0.521 0.521 0.435 0.435 0.435 0.221 0.221 0.221 0.148 0.148 0.148 MPCK 0.645 0.684 0.713 0.792 0.846 0.903 0.909 0.970 1.000 0.693 0.701 0.718 0.864 0.883 0.901 0.792 0.809 0.833 0.769 0.794 0.823 0.243 0.246 0.301 0.192 0.172 0.202 86 CCSKL 0.652 0.786 0.754 0.814 0.884 0.894 0.624 0.668 0.737 0.721 0.695 0.684 0.869 0.886 0.910 0.798 0.817 0.838 0.773 0.802 0.821 0.235 0.280 0.313 0.099 0.191 0.213 PMMC 0.876 0.898 0.923 0.821 0.895 0.886 0.897 0.955 0.992 0.718 0.734 0.748 0.793 0.831 0.887 0.789 0.804 0.820 0.760 0.785 0.805 0.254 0.291 0.311 0.171 0.214 0.256 DCA 0.873 0.977 0.988 0.811 0.867 0.917 0.818 0.824 0.846 0.723 0.741 0.760 0.872 0.896 0.916 0.793 0.821 0.848 0.783 0.803 0.829 0.164 0.252 0.289 0.213 0.258 0.271 LMNN 0.980 0.982 0.983 0.802 0.874 0.928 0.925 0.970 0.998 0.714 0.744 0.751 0.890 0.919 0.928 0.802 0.819 0.860 0.789 0.809 0.831 0.213 0.267 0.302 0.194 0.249 0.269 ITML 0.971 0.981 0.984 0.815 0.883 0.924 0.931 0.968 1.000 0.706 0.740 0.743 0.884 0.914 0.931 0.808 0.825 0.858 0.785 0.811 0.832 0.225 0.270 0.299 0.218 0.255 0.265 Chapter 6 A Constant-Time Algorithm for Dynamic Semi-Supervised Clustering One condition that is often overlooked by existing semi-supervised clustering is that side information can be generated sequentially and dynamically, a practical setting in numerous real-world applications such as social network and E-commerce system analysis. Given a set of new pairwise constraints, classical semi-supervised clustering algorithms need to reoptimize their objective functions over all the data points and constraints, prohibiting them to efficiently update the data partitions. In this chapter, we propose an efficient dynamic semi-supervised clustering framework that casts the clustering problem into a searching problem in a feasible convex space, i.e., a convex hull with its extreme points being an ensemble of multiple data partitions. According to the principle of ensemble clustering, the optimal partition lies in that convex hull and it can be uniquely represented by a low-dimensional simplex vector. This enables us to carry out the dynamic semi-supervised clustering problem as an updating procedure of the simplex vector based on the newly received pairwise constraints. We then derive an efficient algorithm that is able to update the simplex vec87 tor (clustering result) in a constant time. Our empirical studies with multiple real-world benchmark datasets show that the proposed algorithm outperforms several state-of-the-art semi-supervised clustering approaches with visible performance gain and significantly reduced running time. The remainder of the paper is organized as follows. In Section 6.1, we describe the settings and introduce the motivation of the proposed dynamic semi-supervised clustering framework. Section 6.2 presents the proposed framework for dynamic semi-supervised clustering followed by an efficient solution using convex searching. We summarize the results of our empirical studies in Section 6.3. Section 6.4 concludes with the future work. 6.1 Introduction Despite the progress of semi-supervised clustering, one issue that is often overlooked is how to efficiently update the clustering results when the pairwise constraints are dynamic, i.e., new pairwise constraints are generated sequentially. This condition is closely related to numerous real-world applications. For example, one common application in social network analysis is to group user communities based on users’ profiles as well as their social connections. If we treat user profiles as features, and connections between users as pairwise constraints, then this application is essentially a semi-supervised clustering problem. Note that user connections in social networks are changing all over the time, it requires an efficient updating of user groups given newly generated connection links. In addition, similar situation also occurs in various real-world E-commerce platforms, which usually group items or customers according to their 88 attributes (features) and dynamic co-purchasing histories (pairwise constraints). We note that although the problem of clustering evolving has been extensively studied, no previous study has investigated the efficiency issue in the dynamic semi-supervised clustering setting. For example, [13, 28] studied the problem of updating clustering results based on various user feedbacks. However, these algorithms are less scalable since they often need to learn distance metrics iteratively. In addition, dynamic network clustering [21, 21, 26, 26, 77, 89, 114] studied the problem of community evolving when a network to be clustered is changing continuously. However, they only use link information to guide clustering procedure and ignore the important attributes of the data points. To address the issue of clustering updating, we propose an efficient dynamic semi-supervised clustering framework for large scale applications. The key idea is to cast the large-scale semisupervised clustering problem into a problem of searching in a low-dimensional convex space. More specifically, the proposed algorithm consists of two components: (i) an offline step for constructing a low-dimensional convex space, and (ii) an online step for efficiently updating clustering results when new pairwise constraints are generated. In the first component, we employ the ensemble clustering technique [113] to generate m ensemble partitions of all the data points to be clustered. Note that the m ensemble partitions can form a convex hull [61] with m extreme points. According to the principle of ensemble clustering [52, 113, 117], the optimal data partition can be approximated by a linear combination of the m ensemble partitions, indicating that the optimal partition should lie in the inner space of that convex hull. Since the inner space of a convex hull can be spanned by the linear combinations of the extreme points, the problem of finding the optimal data partition is equivalent to deriving the 89 combination weights, denoted as a m-dimensional simplex vector γ 1 . Given new pairwise constraints generated at time t, this enables us to efficiently update data partitions by computing an updated simplex vector γ t from γ t−1 . Therefore, in the second component of the proposed framework, we design an efficient updating scheme that is able to update the simplex vector in a constant time. Note that the new data partition should be somewhat similar to the old data partition. We cast the simplex updating problem into a learning problem that aims to learn a vector γ t that is close to γ t−1 , and also consistent with the new pairwise constraints. We then present an efficient solver for our learning problem. Compared to the existing approaches of semi-supervised clustering, the proposed algorithm has the following two advantages: (i) by exploring the ensemble clustering technique, the proposed algorithm is able to exploit the strength of different ensemble partitions and at the same time, compensate for their limitations; (ii) by casting the problem of clustering n data points into the problem of learning a m-dimensional vector γ, the proposed algorithm is computationally efficient and the time complexity of updating γ is irrelevant to the number of data points to be clustered. This enables us to update large-scale clustering results in an extremely efficient way. To evaluate the performance of the proposed algorithm, we conduct empirical studies on several large-scale real-world datasets. The experimental results and the comparison with peer methods verify both the efficiency and effectiveness of the proposed algorithm. 1A simplex vector is a vector whose elements are non-negative and the summation of all the elements equals to 1 90 6.2 Semi-supervised Clustering with Dynamic Constraints In this section, we first present a more general framework for semi-supervised clustering, then discuss the proposed efficient dynamic semi-supervised clustering algorithm. 6.2.1 Semi-supervised clustering Let X = (x1 , . . . , xN ) be a set of N data points to be clustered, where each data point xi ∈ Rd , i ∈ [N ] is a vector of d dimensions. Let Mt be the set of must-link constraints generated till time t, where each must-link pair (xi , xj ) ∈ Mt implies that xi and xj should be in the same cluster. Similarly, let Ct be the set of cannot-link constraints generated till time t, where each cannot-link pair (xi , xj ) ∈ Ct implies that xi and xj should belong to different clusters. For the ease of presentation, we also define Ωt = Mt ∪ Ct to include all pairwise constraints generated till time t. Similar to most studies on data clustering, we assume that the number of clusters r is given as a priori. Throughout this paper, we use a binary matrix F ∈ {0, 1}N ×r to represent the result of partitioning N data points into r clusters, where Fij = 1 indicates that xi is associated with the j-th cluster. We further denote F as the set of all possible clustering results F = {F ∈ {0, 1}N ×r : F∗i F∗j = 0 ∀i = j, Fk∗ = 1 ∀k}. Let κ(x, x ) be a kernel function used to measure the similarity between two data points x ×N be the kernel matrix, and let K and x . Let K = [κ(xi , xj )] ∈ RN i,∗ be the i-th row + vector of K. The goal of semi-supervised learning is to find the clustering result that is 91 consistent with both the kernel similarities in K and pairwise constraints in Ωt . To measure the discrepancy between the kernel similarity K and a clustering result F , we define the distance between K and F as N Fi,∗ K −1 Fj,∗ = tr F K −1 F d(K, F ) = (6.1) i=1 As indicated by the above measure, the smaller the distance d(K, F ), the better the consistency between the clustering result F and the similarity matrix K. We note that an alternative approach is to measure the distance by tr(F LF ) where L = diag(K1) − K is the graph Laplacian. To measure the inconsistency between the clustering result F and pairwise constraints, we introduce two loss functions, one for must-links and one for cannot-links. More specifically, given a must-link (xi , xj ) ∈ Mt , we define the loss function − (Fi,∗ , Fj,∗ ) as − (Fi,∗ , Fj,∗ ) = Fi,∗ − Fj,∗ 22 (6.2) Similarly, given a cannot-link (xi , xj ) ∈ Ct , we define the loss function + (Fi,∗ , Fj,∗ ) as + (Fi,∗ , Fj,∗ ) = Fi,∗ + Fj,∗ 22 (6.3) We note that a loss function similar to (6.3) has been used in label propagation with cannotlinks [92]. The justification of using loss function + (·, ·) for cannot-links is provided in the following proposition. Proposition 1. Let a ∈ Rd+ be a fixed vector. Let b∗ be the minimizer to the following 92 optimization problem min b∈Rd + , b q ≤R + (a, b) where q ≥ 1. Then we have b∗ ⊥ a. As indicated by Proposition 1, by minimizing the loss function + (Fi,∗ , Fj,∗ ), the resulted solution, under no other constraints, will satisfy Fj,∗ ⊥ Fi,∗ , implying that xi and xj are assigned to different clusters. Using the distance measure d(K, F ) and the loss functions, we can cast semi-supervised clustering into the following optimization problem + (Fi,∗ , Fj,∗ ) + min F ∈F (xi ,xj )∈Ct (xi ,xj )∈Mt s. t. d(F, K) ≤ ε, where the threshold − (Fi,∗ , Fj,∗ ) (6.4) decides the level of consistency between the clustering results and the kernel similarity. The main challenge of dynamic semi-supervised clustering arises from the fact that pairwise constraints are dynamically updated over time. A naive approach is to solve the optimization problem in (6.4) from the scratch whenever pairwise constraints are updated. A more efficient approach is to explore the fact that only a small portion of pairwise constraints are updated at each time, leading to a small change in the clustering result. Based on this intuition, we can use the existing solution as an initial solution to the optimization problem with updated constraints. Despite the simplicity, this approach can significantly reduce the running time, and has been widely used in clustering social networks with dynamic up93 dates [21, 26]. Although this simple approach reduces the number of iterations due to the appropriate initialization, it still needs to solve the optimization problem in (6.4), which is computationally expensive when the number of data points N is very large. To address this challenging issue, we propose an efficient dynamic semi-supervised clustering algorithm that it highly attractive for clustering large-scale data sets. 6.2.2 A Constant Time Algorithm for Dynamic Semi-supervised Clustering The proposed algorithm is based on the key observation that the number of different clustering results F in the subspace ∆ = {F ∈ F : d(K, F ) ≤ ε} is not large when ε is sufficiently small and the eigenvalues of K follow a skewed distribution, where d(K, F ) = tr(F K −1 F ). To this end, we denote by λ1 , . . . , λN the eigenvalues of K ranked in the descending order, and by v1 , . . . , vN the corresponding eigenvectors. {λi } follows a q-power law if there exists constant c such that λk ≤ ck −q , where q > 2. The following lemma summarizes an important property of K when its eigenvalues follow a q-power law. Lemma 2. Define V = (v1 , . . . , vN ) and ξ = V x for a unit vector |x| = 1. If x K −1 x ≤ ε, we have |ξ|1 √ ≤ εN |ξ|2 1+ 2 q−2 provides that eigenvalues of K follow a q-power law. 94 Proof. Since N 2 λ−1 i ξi ≤ ε, x K −1 x = i=1 we have ξk ≤ √ cεk −q/2 , k = 1, . . . , N and therefore |ξ|1 ≤ √ N k −q/2 ≤ cε √ εc 1 + k=1 2 q−2 We complete the proof by using the fact |x|2 = 1 and c ≤ n. The above lemma shows that when the eigenvalues of K follow a power law, V x is a 1 sparse vector if x K −1 x ≤ ε. The observation provides the key foundation for our analysis. In order to show that the number of significantly different partitions in ∆ is small, we first consider the simple case where the number of classes is 2. In this case, we can simplify the domain ∆ as ∆2 = v ∈ {−1, +1}N : v K −1 v ≤ N ε We define θN (ρ) the maximum number of partitions in ∆2 such that the difference between any two partition vectors v1 and v2 is at least ρN . The theorem below bound θN (ρ). Theorem 1. θN (ρ) ≤ 2d Cs/[2ρ] s where C is an universal constant and √ s= εN 1+ 95 2 q−2 (6.5) Proof. We first notice that |v|2 = √ N . As the first step, we relax ∆2 into ∆2 as follows ∆2 = x ∈ Rd : |x|2 ≤ 1, x K −1 x ≤ ε Define γN (K, δ) the maximum number of vectors such that the distance between any two vectors is at least δ. Since the distance between any two partitions that differs by at least ρN entries is at least √ 2ρ θN ≤ γN ∆2 , 2ρ Using Lemma 3.4 from [104] to bound, we have 2 γn ∆2 , √ N ≤ exp 2d Cs log 2ρ s = 2d Cs/[2ρ] s where s is defined in (6.5). It is straightforward to extend the above result for two-way clustering into multiple-way clustering. Define θN (ρ; r) be the maximum number of partitions in ∆ for r-way clustering such that the difference between any two partitions is at least ρN . Then we have θN (ρ; r) ≤ Cs(r−1)/[2ρ] 2d (r − 1)s Based on the above results, there is a relatively small number of significantly different clustering results in the subspace ∆. Hence, to improve the computational efficiency of dynamic semi-supervised clustering, a natural thought is to pre-compute all the possible clustering results in ∆, and find the best clustering result in ∆ that is consistent with most of the 96 dynamically updated pairwise constraints. The main shortcoming of this approach is that it is computationally infeasible to identify all the different clustering results in ∆. To address this problem, we propose to construct a convex hull ∆ ⊂ F to approximate different clustering results in ∆. The key advantage of using convex hull approximation is that all the solutions in ∆ can be represented by a linear combination of the extreme points of the convex hull. As a result, instead of searching for the best clustering result, we only need to compute the linear combination weights. Since the number of combination weights to be determined equals to the number of extreme points, which is much smaller than the number of data points to be clustered, it is significantly more efficient to compute combination weights than to directly estimate the best clustering result. More specifically, the proposed learning process is comprised of an offline step and an online step. In the offline step, we compute multiple partitions for the data points in X using the ensemble clustering technique. These clustering results will be used to construct the convex hull ∆ that approximates the solutions in ∆. In the online step, an efficient learning algorithm is developed to update the combination weights based on the newly received pairwise constraints. Below, we describe in detail the two key steps of the proposed algorithm for dynamic semi-supervised clustering. 6.2.3 Offline Step: Ensemble Clustering In this step, we generate a convex hull ∆ ⊂ F using the technique of ensemble clustering [113]. Ensemble clustering is motivated by the fact that different clustering algorithms have their own merits, as well as their own limitations. Thus no single clustering algorithm 97 is universally better than others for all types of data. Using the idea of ensemble clustering, we will create m different partitioning results of the data, and construct a convex hull based on the computed data partitions to approximate the feasible clustering results in ∆. According to [54], multiple partitions of a same dataset can be generated by (i) running different clustering algorithms [48], (ii) running the same algorithm with different initializations and parameters [47, 116], (iii) clustering via sub-sampling the data repeatedly [96, 118], and (iv) clustering via projecting the data onto different subspaces [42,48]. In order to efficiently generate m ensemble partitions for a large-scale data set, we follow the last approach by first randomly selecting m different subsets of attributes in X and then applying the approximate kernel k-means algorithm [27] to each subset of selected features to create m data partitions. We denote by P = {P1 , P2 , . . . , Pm } the m realigned ensemble partitions processed by the Hungarian algorithm [78], where each partition Pl ∈ F, l ∈ [m] divides X into r disjoint subsets. Using m partitions of dataset X , we then construct a convex hull m ∆= F| F = γi Pi , γ 0, 1 γ = 1 , i=1 where each clustering solution F is represented by a simplex vector γ ∈ Rm . Two key observations are made here: (i) by generating multiple partitions using a kernel k-means algorithm, which is shown to be closely related to spectral clustering [40], all the ensemble partitions should satisfy the condition d(K, Pi ) ≤ ε, ∀i ∈ [m]. This indicates that the constructed convex hull ∆ is a subspace of ∆; and (ii) according to the 98 Figure 6.1 The offline step: generate a convex hull using the technique of ensemble clustering widely used median partition based ensemble clustering approaches [52,113,117], the optimal data partition should be similar to all the ensemble partitions P1 , . . . , Pm , indicating that the optimal partition should lie in the inner space of ∆ and it can be uniquely represented by a m-dimensional simplex vector γ. Using the convex hull representation, we cast a problem of clustering N data points into the problem of finding a m-dimensional simplex vector γ. More specifically, instead of finding the best F , we will solve the following optimization problem to find the optimal γ min γ∈Rm + + (Fi,∗ , Fj,∗ ) + (xi ,xj )∈Mt − (Fi,∗ , Fj,∗ ) (xi ,xj )∈Ct m s. t. γ 1 = 1, F = γi Pi (6.6) i=1 Since m = O(1) is a small number independent of the data size N , the optimization problem in (6.6) can be solved efficiently with a constant running time 1 . When no pairwise constraints 1 We note that the ensemble partitions are computed offline and therefore do not affect the efficiency of online computation, which is the main concern of this work 99 is generated, the clustering problem reduces to a standard ensemble clustering problem. The 1 ,..., 1 ) clustering result in terms of simplex γ, denoted as γ 0 , can be simply set to ( m m or computed by some existing ensemble clustering algorithms such as [113, 117]. Figure 6.1 shows the conceptual framework of the offline step. In the next subsection, we introduce the online step for efficiently updating the data partition (i.e. the simplex vector) when pairwise constraints are dynamically updated over time. 6.2.4 Online Step: Efficient Updating Simplex Vector Although the formulation in (6.6) already significantly reduces the overall computational cost, it is still undesirable to solve the optimization problem in (6.6) from the scratch whenever new pairwise constraints are introduced. This is particularly true if for most of the time, only a few constraints are added to the system. In this subsection, we develop an online step to efficiently update the simplex vector when the number of newly added constraints is small. To simplify the presentation, we divide Mt , the set of must-link constraints received till time t, into two subsets: Mat that includes all the must-link constraints received before time t and Mbt that includes the new must-link constraints added at time t. Similarly, we divide the cannot-link set Ct into Cta and Ctb . We also denote by γ 1 , . . . , γ T the sequence of simplex vectors computed based on the updated constraints. 100 Using the above notation, we rewrite the optimization problem in (6.6) as min γ∈Rm + Lat (F (γ)) + Lbt (F (γ)) m s. t. γ 1 = 1, F = γi Pi (6.7) i=1 where Lat (F ) = + (Fi,∗ , Fj,∗ ) + − (Fi,∗ , Fj,∗ ) (xi ,xj )∈Ma t Lbt (F ) = (xi ,xj )∈Cta − (Fi,∗ , Fj,∗ ) + (Fi,∗ , Fj,∗ ) + (xi ,xj )∈Ctb (xi ,xj )∈Mbt Since γ t−1 approximately minimizes the objective Lat (F (γ)), we can approximate Lat (F (γ)) as Lat (F (γ)) ≈ Lat (F (γ t−1 )) + λ γ − γ t−1 2 and as a result, the optimization problem in (6.7) is further simplified as min γ∈Rm + Lbt (F (γ)) + λ γ − γ t−1 2 m s. t. γ 1 = 1, F = γ i Pi , (6.8) i=1 where parameter λ is introduced to balance between two objectives, i.e., ensuring that the learned γ is close to γ t−1 , and also consistent with the new pairwise constraints. Compared to (6.7), the main advantage of using (6.8) is that it only involves the new constraints that are added to the system at time t and does not need to store and work with the constraints 101 received before time t. In the following, we discuss how to efficiently update the simplex vector with the new pairwise constraints. By representing F as a linear combination of {Pi }m i=1 , we rewrite the optimization problem (6.8) as m + γ k [Pk (i, :) − Pk (j, :)]||2 || minγ∈Rm f (γ) = (xi ,xj )∈Mbt m γ k [Pk (i, :) + Pk (j, :)]||2 || + k=1 (xi ,xj )∈Ctb k=1 +λ||γ − γ t−1 ||2 s. t. (6.9) 1 γ = 1, where Pk (i, :) and Pk (j, :) represent the i-th and the j-th row of the matrix Pk . The optimization problem (6.9) can be efficiently solved by a gradient descend method. Specifically, we update the γ by γ = P∆ (γ t−1 − η∇f (γ t−1 )), where η is a step size and P∆ : Rm → Rm is a projection operator that takes a m-dimensional vector as input and outputs its projected simplex vector, as described in [43]. Note that ∇f (γ t−1 ) has a closed-form solution as ∇f (γ t−1 ) = 2( γ t−1 U (ij) U (ij) + (xi ,xj )∈Mbt γ t−1 V (ij) V (ij) (xi ,xj )∈Ctb +λγ t−1 − λγ t−2 ), (6.10) 102 where U (ij) and V (ij) are two m × r matrices, satisfying U (ij) (k, :) = Pk (i, :) − Pk (j, :), V (ij) (k, :) = Pk (i, :) + Pk (j, :). Then the optimal solution of γ is given by γ = P∆ {γ t−1 [Im − 2η( U (ij) U (ij) + (xi ,xj )∈Mbt −2λη(γ t−1 − γ t−2 )}, V (ij) V (ij) )] (xi ,xj )∈Ctb (6.11) where Im is the m×m identity matrix. Since we can precompute U (ij) U (ij) and V (ij) V (ij) for each pair (i, j) offline and store the results in a server, we can efficiently update γ using equation (6.11). Despite a low time complexity of the above updating scheme, it however requires a space complexity of O(N 2 m2 ) to store all the matrices, which is expensive when the number of objects N is large. In the following, we discuss how to reduce the expensive storage cost by relaxing the optimization procedure. Note that the k-th row of the matrix U (ij) should equal to either of these two cases: (i) containing all zero entries if the ensemble partition Pk put object i and object j in the same cluster, or (ii) containing one positive entry (=2) , and one negative entry (=−2) if the ensemble partition Pk put object i and object j in different clusters. Then the diagonal elements of the matrix U (ij) U (ij) either equal to 0 or equal to a positive value (=8). Thus 103 the matrix U (ij) U (ij) Im − 2η (xi ,xj )∈Mbt essentially suggests us to assign less weights to the ensemble partitions that mistakenly assign the object i and object j in different clusters when they share a must-link connection. Likewise, the matrix V (ij) V (ij) Im − 2η (xi ,xj )∈Ctb essentially suggests us to assign less weights to the ensemble partitions that mistakenly assign the object i and object j in the same cluster when they share a cannot-link constraint. After updating γ from γ t−1 , the ensemble partitions that are consistent with the new pairwise constraints are assigned with larger weights, while the ensemble partitions that are not consistent with the new pairwise constraints are assigned with smaller weights. This leads to a relaxed updating procedure γ = P∆ {(1 − 2λη)γ t−1 + 2ληγ t−2 − ηγ t−1 ◦ a(ij) } (6.12) (xi ,xj )∈ {Mbt ∪Ctb } where a(ij) is a m-dimensional vector with the k-th element equaling to 1 if the ensemble partition Pk is not consistent with the pairwise constraints (xi , xj ), and 0 otherwise. Given the learned simplex vector γ, we can compute a soft labeling matrix as a linear combination of the ensemble partitions P = γ 1 P 1 + γ 2 P 2 + . . . + γ m Pm . 104 Figure 6.2 The online step: efficiently update clustering results when new pairwise constraints are generated Then the hard partition can be easily obtained by rounding, i.e., assign the i-th data point to the k-th cluster if Pik is the largest entry in the i-th row of the matrix P . The time complexity to update the simplex vector γ is only O(m), which is a constant time irrelevant to the number of data points. Although we need to go through each row of the matrix P for final partitioning, it is still very efficient since we only need to find the largest element in each row of P . Figure 6.2 depicts the steps of online updating. 6.3 Experiments In this section, we present extensive empirical evaluations of the proposed dynamic semisupervised clustering algorithm, namely Constrained Clustering by Convex Searching (CCCS for short). In particular, we aim to address the following questions in our study: 1. How does the number of extreme points m influence our performance? 2. Does the proposed semi-supervised clustering method outperform the state-of-the-art algorithms for semi-supervised clustering? 105 3. Is the proposed semi-supervised clustering method significantly more efficient than the state-of-the-art algorithms for semi-supervised clustering? 6.3.1 Experimental Setup Datasets. In order to examine the effectiveness of the proposed dynamic semi-supervised clustering algorithm, nine public benchmark datasets are used in the experiments: • USPS [68] is a widely used handwritten digit database containing 9, 298 handwritten images. Each image is a 256 dimensional vector that belongs to one of 10 classes. • RCV1 is a subset of text documents from the RCV1 corpus [82]. It contains 193, 844 documents that belongs to one of 103 classes. • 20 News 1 is a well-known database that contains 18, 774 documents belonging to 20 news categories. • BlogCatalog is a social blog directory dataset that was crawled from BlogCatalog2 by Wang et. al [127]. This dataset contains a total of 19, 664 bloggers that are grouped into 60 categories. Each blogger is represented as a 5, 413-dimensional vector with each dimension denoting one semantic tag. • TDT2 3 is a text dataset that contains 9, 394 documents belonging to one of 30 classes. 1 http://qwone.com/ ˜jason/20Newsgroups/ 2 http://www.blogcatalog.com 3 http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html 106 • MNIST is also a well-known handwritten digits database1 . It contains a total of 70, 000 handwritten digits which belong to one of 10 classes. • YouTube [103] is a dataset retrieved using the YouTube Data API2 from Oct. 31st, 2011 to Jan. 17th, 2012. This dataset has 2, 860, 264 users and their comments on 6, 407 YouTube videos. All the users are belong to one of the two imbalanced classes: 177, 542 spam users or 2, 682, 722 non-spam users. • BBC 3 is a database consisting of BBC news during the years of 2004 and 2005. It contains 2, 225 news with 9, 636 distinct words, categorized into five topics: business, politics, tech, sport, and entertainment. • Network Intrusion [112] is a database containing 4, 897, 988 patterns representing TCP dump data of network traffic for a local-area network (LAN). Each pattern is a 50-dimensional vector that belongs to one of 10 different classes. Parameter Selection. In order to generate m ensemble partitions, we need to randomly sample d˜ out of d features in each time. Two criteria are used in determining the values ˜ First, d˜ should be small enough to make the ensemble partitions diverse. Second, d˜ of d. should be reasonably large to get good enough ensemble partitions since the quality of the starting point γ 0 is rely on the quality of m ensemble partitions. In our experiments, we set d˜ = [d/10]. The parameter λ in problem (6.9) is introduced to balance the tradeoff between the change of the simplex vector and the fitness over the new pairwise constraints. Note that 1 http://yann.lecun.com/exdb/mnist/ 2 http://code.google.com/apis/youtube/getting_started.html 3 http://mlg.ucd.ie/datasets/bbc.html 107 the term ||γ − γ t−1 ||2 is upper bounded by 2, while the term m γ k [Pk (i, :) − Pk (j, :)]||2 || (xi ,xj )∈Mbt k=1 and m γ k [Pk (i, :) + Pk (j, :)]||2 || (xi ,xj )∈Ctb k=1 are upper bounded by 2 × |Mbt | and 2 × |Ctb |, respectively. In order to make two terms comparable, we set λ to be |Mbt ∪ Ctb |, the number of pairwise constraints generated at time t. 6.3.2 Parameter Sensitivity of m A key factor that influences the performance of the proposed algorithm is the number of ensemble partitions m, which introduces a trade-off between clustering quality and efficiency. As m increases, the clustering quality tend to improve at the cost of increased computational time. In order to analyze how the clustering performance is influenced by different m, we first conduct experiments with two benchmark datasets, BBC and Network Intrusion. We choose these two datasets since they are the smallest and the largest testbeds in our study. By this means, we can also analyze whether the selection of m is influenced by the data size N . To this end, for both of the two datasets, we begin with the unsupervised data partition γ 0 generated from m = {10, 20, 30, 40, 50} different ensemble partitions. Then we randomly generate 100 pairwise constraints based on the ground truth information of the data sets in each tier, and apply the proposed CCCS algorithm to update the partition. We repeat this 108 (a) BBC data set (b) Network Intrusion data set Figure 6.3 NMI vs. the number of ensemble partitions m = {10, 20, 30, 40, 50} with different number of pairwise constraints. step for 10 times, until the total number of pairwise constraints achieves 1, 000. Figure 6.3 shows how the performance of the proposed dynamic semi-supervised clustering algorithm is changed with different m and different number of pairwise constraints. From this figure, we first observe that the clustering performance keeps increasing with more and more pairwise constraints being provided. In addition, the clustering performance was improved when m increases. We conjecture that the performance gain may be due to the fact that a larger m indicates a larger searching space and also a larger overlap between the the ˜ In addition, a larger number of ensemble true searching space ∆ and the relaxed space ∆. partitions usually provide a more diverse clustering results, leading to a higher chance to find data partitions that are consistent with most of the pairwise constraints. Furthermore, the performance of the proposed semi-supervised clustering algorithm gradually stabilizes as m increases to 50. This is also consistent with the intuition since when the number of ensemble partitions are already large enough, adding more partitions cannot provide more information since it is likely that the new data partitions are coincide with some existing partitions. This provides us a guidance to appropriately choose m: on one hand, m should be reasonably large to provide a diverse and large enough convex searching space, while on 109 the other hand, m should be relatively small to reduce the computational cost to update the partitions. In our experiments, we set m = 50 for all the remaining experiments. 6.3.3 Comparison with Other Baseline Algorithms To examine the effectiveness and efficiency of the proposed semi-supervised clustering algorithms, we compare it to the following seven state-of-the-art algorithms for semi-supervised clustering, including three constrained clustering algorithms and four distance metric learning algorithms. The three constrained clustering algorithms are (a) MPCK-means, the metric pairwise constrained k-means algorithm [12], (b) CCSKL, constrained clustering by spectral kernel learning [88], and (c) PMMC, pairwise constrained maximum margin clustering [144]. The four state-of-the-art distance metric learning algorithms are (d) RCA, the relevant component analysis [6], (e) DCA, the discriminative component analysis [63], (f) LMNN, the large margin nearest neighbor classifier [130], and (g) ITML, the information theoretic metric learning algorithm [36]. The code for the baseline algorithms was provided by their respective authors. For a fair comparison, all the parameters used in the baseline algorithms were their suggested values (if applicable). In our experiments, we begin with 100 randomly generated pairwise constraints, denoted as tier t1 . In each of the following tier, another set of 100 randomly sampled pairwise constraints are generated and all the semi-supervised clustering algorithms are called to update the data partition given the newly generated pairwise constraints. Specifically, we rerun all the baseline semi-supervised clustering algorithms by combining the new pairwise constraints with the old ones. We repeat such steps from tier t1 to tier t10 , finally leading to a total of 110 Table 6.1 Average CPU time (in s) for updating the partition in each tier. (N/A means that the clustering task cannot be accomplished by the algorithms within 5 hours.) Datasets CCCS MPCK 20 News 0.6 9.7 RCV1 1.3 64 YouTube 1.7 49 Network Intrusion 2.8 31 CCSKL PMMC N/A N/A N/A N/A N/A N/A N/A N/A RCA DCA LMNN 489 377 N/A 926 601 N/A 177 132 N/A 148 99 N/A ITML N/A N/A N/A N/A 1, 000 randomly sampled pairwise constraints. All the experiments are performed on a server with Intel Xeon 2.4 GHz processor and 64 GB of main memory. Each experiment is repeated ten times, and the average clustering performances are reported in Figure 6.4. We first observe that, compared to all the baseline algorithms, the proposed dynamic semisupervised clustering algorithm CCCS yields the best performance for most of the datasets (USPS, RCV1, 20 News, BlogCatalog, MNIST, Youtube and Network Intrusion). Specifically, when given a small number of the pariwise constraints, our method outperform the compared methods with significant performance gain. This is because that by generating a convex hull from a set of ensemble partitions, we dramatically reduce the possible searching space and it is expected that all the inner points in that convex hull can correspond to reasonably good data partitions. Also, by computing the starting point γ 0 based on the idea of ensemble clustering, it is not surprising that γ 0 should already be close to the optimal solution. Thus a simple locally search is good enough to recover the optimal partition. In addition to superior performance, the proposed CCCS algorithm is also extremely efficient, requiring much less running time compared to all the baseline algorithms. Table 6.1 summarizes the average running time to update the data partitions in the four largest datasets, namely 20 News, RCV1, YouTube and Network Intrusion. We mark the results as N/A if an algorithm cannot output the results within 5 hours. By comparing the results in 111 Table 6.1, we observe that the running time of the proposed dynamic semi-supervised clustering algorithm to update the partition is significantly less than all the baseline algorithms. In particular, the results show that even for updating the partition of about 5 million data points, it only takes the proposed algorithm less than 3 seconds. 6.4 Conclusions In this chapter, we propose a dynamic semi-supervised clustering algorithm that is able to efficiently update the clustering results when new pairwise constraints are generated. The key idea is to cast the clustering process into the problem of searching in a feasible clustering space, i.e., a convex hull generated from multiple ensemble partitions. Based on ensemble clustering, the optimal partition should lie in the inner space of this convex hull. Since every inner point of a convex hull can be uniquely represented by a simplex vector, the dynamic semi-supervised clustering problem can be reduced to the problem of learning a low-dimensional vector. Given a new set of pairwise constraints, we derive an efficient updating scheme that is able to learn an optimal simplex vector in a constant time. Our empirical studies conducted on multiple real-world datasets verify both the effectiveness and efficiency of the proposed algorithm. 112 (a) USPS data set (b) RCV1 data set (c) 20 News data set (d) BlogCatalog (e) TDT2 data set (f) MNIST data set (g) Youtube data set (h) BBC data set (i) Network Intrusion data set Figure 6.4 Average clustering performance of the proposed dynamic semi-supervised clustering algorithm (CCCS) and the baseline algorithms (MPCKmeans (MPCK) [12], CCSKL [88], PMMC [144], RCA [6] DCA [63], LMNN [130], and ITML [36]) from tier t1 to t10 on nine datasets. 113 Chapter 7 Conclusions and Future Directions In this thesis, we study the issue of data clustering with pairwise constraints. Three important problems are considered, namely crowdclustering, semi-crowdsourced clustering, and semisupervised clustering. In this chapter, we first review our main contributions to each problem, then point out a few possible directions for future research.. 7.1 Summary of Main Results In previous chapters, we have presented a matrix completion based approach for addressing the problem of crowdclustering. In addition, we solved the problem of semi-crowdsourced clustering based on the proposed robust distance metric learning approach. Furthermore, we developed an input pattern assisted matrix completion to address the semi-supervised clustering problem. Finally, we proposed a semi-supervised clustering algorithm that can efficiently update clustering results when new pairwise constraints are generated. In below, we summarize each of them. • In Chapter 3, we present a matrix completion based framework to solve the problem of crowdclustering. To address the issue of high noise levels in crowdsourced annotations, 114 we first identify a subset of pairwise constraints that are agreed upon by most crowd workers. We then use these reliable pairwise constraints as the seed to derive the full similarity matrix by exploiting a matrix completion algorithm. Then the final data partition can be obtained by applying a spectral clustering algorithm to the completed similarity matrix. Different from the previous Bayesian crowdclustering approach that requires a sufficiently large number of manual annotations to discover the hidden factors, the proposed algorithm needs only a small number of manual annotations to obtain an accurate data partition. In addition, by filtering out the uncertain pairwise constraints collected from crowd workers, the proposed crowdclustering algorithm yield more robust clustering result. • In Chapter 4, we focus on the problem of semi-crowdsourced clustering. This is motivated by a practical consideration that the classical crowdclustering problem can only cluster objects when their manual annotations are available. This significantly limits its application to large scale clustering problems since it is not feasible to have each object manually annotated by multiple workers when the number of objects is large. To address this issue, we at first randomly sample a small subset of objects for human annotation, and construct a partially observed similarity matrix based on the reliable annotations. We then exploit a matrix completion algorithm to recover the similarity matrix, and finally learn a linear similarity function to compute the similarities between all the objects. To the best of our knowledge, it is the first semi-crowdsourced clustering approach been proposed. • In Chapter 5, we aim to address two main shortcomings with the existing semisupervised clustering algorithms. First, most semi-supervised clustering algorithms 115 suffer from non-convexity optimization problems, leading to only locally optimal clustering results. Second, none of the existing approaches analyze the theoretical guarantee on sample complexity problem, i.e., what is the minimal number of pairwise constraints needed to accurately recover the underlying true partitions? To address such issues, we propose a framework for semi-supervised clustering based on the input pattern assisted matrix completion. Under the assumption that cluster membership vectors can be well approximated by the top few singular vectors of the data matrix, we cast the problem of semi-supervised clustering into a convex matrix completion problem. We then solve it efficiently by exploiting the stochastic subgradient descent method. Our theoretical analysis shows that only O(log N ) pairwise constraints are needed to accurately recover the true cluster partition of N objects. The logarithmic dependence on N makes the proposed algorithm particularly suitable for clustering large-scale data sets. • In Chapter 6, we focus on the problem of efficiently updating data partitions upon receiving some newly generated pairwise constraints, which is a common case in numerous real-world applications. Traditional semi-supervised clustering algorithms are not suitable to handle this problem since they need to re-optimize their objective functions over all the data points subject to all the pairwise connections, rendering high computational costs when data sets are large. To address this issue, we proposed an efficient algorithm that cast the problem of semi-supervised clustering into a searching problem in a convex hull, whose extreme points are an ensemble of multiple data partitions. We show that the time complexity of this algorithm is irrelevant to the number of data points to be clustered. This enables us to update large-scale clustering results in an extremely efficient way. 116 7.2 Future Directions In addition to sample complexity and time complexity discussed in this thesis, both crowdclustering and semi-supervised clustering still remain many interesting open questions that worth to be considered in the future. We discuss three such research directions. • Spammers and multi-objectives detection in crowdclustering Note that the quality of crowdsourced annotations is usually very difficult to control. As indicated by [141], more than 80% of crowdsourced pairwise labels can be inconsistent with the true cluster assignment. Very often that the crowdsourced annotations can be dominated by spammers, who tend to assign labels randomly. A large amount of spammers can significantly increase the labeling costs, and also degrade the overall quality of the crowdsourced annotations. Thus how to separate spammers from good users is a very useful and important application. Think along this line, even among the good users, people may not share only one view to partition the objects. Instead, they can have multiple objectives to cluster the data with all of them meaningful and valid. A typical example is the clustering of face images as illustrated in chapter 1. Both the clustering criteria that based on age and based on gender are equally valid, although their clustering results are orthogonal to each other. Compared to spammer detection, learning multiple objectives of workers is probably a even more challenging problem. This is due to the reason that spammers tend to assign random labels since they usually annotate instances without truly looking at them. This enables us to detect spammers by considering the consistency of their annotations. In contrast, the objectives of crowd workers are hidden and unknown, 117 and they need to be learned from the crowd annotations. However, since the number of pairwise constraints provided by each worker is usually limited, it becomes a problem of learning multiple hidden variables based on sparse labels. This is definitely a challenge but very interesting research problem. • Active data clustering by matrix completion In all the aforementioned matrix completion based clustering algorithms, we assume that the pairwise constraints are generated randomly. This is consistent with the classical matrix completion theory [20] that the observed entries are randomly sampled. However, it is expected that by actively selecting the most informative pairs of objects for querying, we have chance to achieve an even lower sample complexity. The research in this field may lead to a profound development to the theory of matrix completion. • Semi-supervised clustering with dynamic instances In chapter 6, we presented a dynamic semi-supervised clustering algorithm that is able to efficiently update clustering results given new pairwise constraints. However, in many real-world applications, both the instances and pairwise constraints can be dynamic. Take Facebook as an example, on average 5 new accounts and hundreds of connections are created in every second 1 . This demands a semi-supervised clustering algorithm that is able to update clustering results with both new instances and pairwise constraints. 1 http://www.iacpsocialmedia.org/Resources/FunFacts.aspx 118 APPENDIX 119 In order to prove Theorem 1, we aim to construct Q that satisfies the following conditions: (a) Q = PS (Q), (b) PT (Q) = U V , (c) PT (Q) < 1, (d) PΩ (Q) = γsgn(PΩ (A)), and (e) |PΩc (Q)|∞ < γ. We first provide the following proposition. Proposition 1. Let Ω be a set of m entries sampled uniformly from [N ]×[N ] according to the Bernoulli model, and PΩ (Z) projects Z onto the subset Ω. Let β > 1. Assume m > m0 , where 2 µ rnβ log n and C is some positive constant. Then, with probability 1 − 3N −β , m0 = CR 0 R for any Z ∈ T with PΩ (Z) = 0, we have Z = 0. Let T be the space spanned by the elements of the form uk y and xvk , for 1 ≤ k ≤ r, where x and y are arbitrary, and let T ⊥ be the orthogonal complement to space T . First, according to Theorem 2 of [23], when N > m0 , where m0 is defined in Proposition 1, with a probability at least 1 − 3N −β , mapping PT PS PT (Z) : T → T is one to one mapping and therefore its inverse mapping, denoted by (PT PS PT )−1 is well defined. Similar to the proof for Theorem 2 in [23], we construct the dual certificate Q as follows Q = λsgn(N ∗ ) + Ω + PS PT (PT PS PT )−1 (U V where T ∈ T and Ω = PΩ ( Ω ). We further define H = PS PT (PT PS PT )−1 (U V ) F = PS PT (PT PS PT )−1 ( T ) 120 + T) Evidently, we have PS (Q) = Q since Ω ⊂ S, and therefore (a) is satisfied. To satisfy (b)-(e), we need PT (Q) = U V → T = −PT (λsgn(N ∗ ) + Ω ) (1) PT ⊥ (Q) < 1 → µ(N ∗ ) (γ + | Ω |∞ ) + PT ⊥ (H) + PT ⊥ (F ) < 1 (2) PΩ (Q) = λsgn(N ∗ ) → Ω = −PΩ (H + F ) (3) |PΩc (Q)|∞ < λ → ξ(M ∗ )(1 + (4) T )<λ Below, we will first show that there exist solutions to T ∈ T and Ω that satisfy conditions (1) and (3). We will then bound | Ω |∞ , T , PT ⊥ (H) , and PT ⊥ (F ) to show that with sufficiently small µ(N ∗ ) and ξ(M ∗ ), and appropriately chosen λ, conditions (2) and (4) can be satisfied. First, we show the existence of Ω and T that obey the relationships in (1) and (3). It is equivalent to show there exists T that satisfies the following relation T = −PT (λsgn(N ∗ )) + PT PΩ (H) + PT PΩ PT (PT PS PT )−1 ( T ) or PT PS\Ω PT (PT PS PT )−1 ( T ) = −PT (λsgn(N ∗ )) + PT PΩ (H) Similar to the previous argument, when |S \ Ω| = N − m > m0 , with a probability 1 − 3N −β , PT PS\Ω PT (Z) : T → T is one to one mapping, and therefore (PT PS\Ω PT (Z))−1 is well 121 defined. Using this result, we have the following solution to the above equation T We now bound = PT PS PT (PT PS\Ω PT )−1 (−PT (λsgn(N ∗ )) + PT PΩ (H)) T and | Ω |∞ . Since ≤ | T |F , we will bound | |F . First, according T to Corollary 3.5 in [15], under the assumption of this theorem, when β = 4 and with a probability 1 − N −3 , for any Z ∈ T , we have PT ⊥ PS PT (PT PS PT )−1 (Z) F ≤ |Z|F . Using this result, we have | Ω |∞ ≤ ξ(M ∗ ) ( H + F ) ≤ ξ(M ∗ ) 1 + |PT ⊥ (H)|F + ≤ ξ(M ∗ ) (2 + T T + |PT ⊥ (F )|F + | T |F ) ≤ ξ(M ∗ ) (2 + (2r + 1) T ) In the last step, we use the fact rank( T ) ≤ 2r if T ∈ T . We then proceed to bound as follows T ≤ µ(N ∗ ) (λ + | Ω |∞ ) 122 T Combining the above two inequalities together, we have T ≤ ξ(M ∗ )µ(N ∗ )(2r + 1) T + 2ξ(M ∗ )µ(N ∗ ) + λµ(N ∗ ) | Ω |∞ ≤ ξ(B) (2 + (2r + 1)µ(N ∗ )(λ + | Ω |∞ ) leading to λµ(N ∗ ) + 2ξ(M ∗ )µ(N ∗ ) 1 − (2r + 1)ξ(M ∗ )µ(N ∗ ) 2ξ(M ∗ ) + (2r + 1)λξ(M ∗ )µ(N ∗ ) | Ω |∞ ≤ 1 − (2r + 1)ξ(M ∗ )µ(N ∗ ) ≤ T Using the bound for | Ω |∞ and T , we now check condition (2) 1 r 1 > µ(N ∗ ) (λ + | Ω |∞ ) + + 2 2 T or λ< 1 − ξ(M ∗ )µ(N ∗ )(4r + 5) µ(N ∗ )(r + 2) For condition (4), we have λ > ξ(M ∗ ) + ξ(M ∗ ) T or λ> ξ(M ∗ ) − (2r − 1)ξ 2 (M ∗ )µ(N ∗ ) 1 − 2(r + 1)ξ(M ∗ )µ(N ∗ ) To ensure that there exists λ ≥ 0 satisfies the above two conditions, we have 1 − 5(r + 1)ξ(M ∗ )µ(N ∗ ) + (10r2 + 21r + 8)[ξ(M ∗ )µ(N ∗ )]2 > 0 123 and 1 − ξ(M ∗ )µ(N ∗ )(4r + 5) ≥ 0 Since the first condition is guaranteed to be satisfied for r ≥ 1, we have µ(N ∗ )ξ(M ∗ ) ≤ 124 1 4r + 5 BIBLIOGRAPHY 125 BIBLIOGRAPHY [1] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. In ICML, pages 1129–1136, 2011. [2] Kais Allab and Khalid Benabdeslem. Constraint selection for semi-supervised topological clustering. In ECML/PKDD, pages 28–43, 2011. [3] Javier Artiles, Julio Gonzalo, and Satoshi Sekine. The semeval-2007 weps evaluation: Establishing a benchmark for the web people search task. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). ACL, 2007. [4] H Avron, S Kale, S Kasiviswanathan, and V Sindhwani. Efficient and practical stochastic subgradient descent for nuclear norm regularization. In ICML, 2012. [5] Hanan Ayad and Mohamed S. Kamel. Cumulative voting consensus method for partitions with variable number of clusters. IEEE Trans. Pattern Anal. Mach. Intell., 30(1):160–173, 2008. [6] Aharon Bar-Hillel, Tomer Hertz, Noam Shental, and Daphna Weinshall. Learning a mahalanobis metric from equivalence constraints. JMLR, 6:937–965, 2005. [7] P. Bartlett. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525–536, 1998. [8] Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD, pages 59–68, 2004. [9] R. Bekkerman and M. Sahami. Semi-supervised clustering using combinatorial MRFs. In ICML Workshop on Learning in Structured Output Spaces, 2006. [10] Asa Ben-Hur, Andr´e Elisseeff, and Isabelle Guyon. A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pages 6–17, 2002. [11] S. K. Bhatia and J. S. Deogun. Conceptual clustering in information retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 28(3):427–436, 1998. [12] M Bilenko, S Basu, and R J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In ICML, 2004. [13] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string 126 similarity measures. In KDD, pages 39–48, 2003. [14] Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. Amazon’s mechanical turk a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1):3–5, 2011. [15] E J. Cand`es and T Tao. The power of convex relaxation: near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010. [16] Emmanuel J. Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J. ACM, 58(3):11, 2011. [17] Emmanuel J. Cand`es and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010. [18] Emmanuel J. Cand`es and Benjamin Recht. Simple bounds for low-complexity model reconstruction. CoRR, abs/1106.1474, 2011. [19] Emmanuel J. Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111–119, 2012. [20] Emmanuel J. Cand`es and Terence Tao. The power of convex relaxation: near-optimal matrix completion. volume 56, pages 2053–2080, 2010. [21] D. Chakrabarti, R. Kumar, and A. Tomkins. Evolutionary clustering. In KDD, pages 554–560, 2006. [22] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, and Alan S. Willsky. Ranksparsity incoherence for matrix decomposition. volume 21, pages 572–596, 2011. [23] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, and Alan S. Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011. [24] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, S. Willsky, and Alan. Ranksparsity incoherence for matrix decomposition. CoRR, 2009. [25] Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang. Parallel spectral clustering in distributed systems. PAMI, 33(3):568–586, 2011. [26] Y. Chi, X. Song, D. Zhou, K. Hino, and B. L. Tseng. Evolutionary spectral clustering by incorporating temporal smoothness. In KDD, pages 153–162, 2007. [27] R. Chitta, R. Jin, T. C. Havens, and A. K. Jain. Approximate kernel k-means: solution to large scale kernel clustering. In KDD, pages 895–903, 2011. 127 [28] D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Constrained Clustering: Advances in Algorithms, Theory, and Applications, 4(1):17–32, 2003. [29] T. Cover and J. Thomas. Elements of Information Theory (2nd ed.). Wiley, 2006. [30] Ian Davidson and S. S. Ravi. Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. In PKDD, pages 59–70, 2005. [31] Ian Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In SDM, 2005. [32] Ian Davidson and S. S. Ravi. Hierarchical clustering with constraints: Theory and practice. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 59–70, 2005. [33] Ian Davidson and S. S. Ravi. Identifying and generating easy sets of constraints for clustering. In AAAI, pages 336–341, 2006. [34] Ian Davidson and S. S. Ravi. The complexity of non-hierarchical clustering with instance and cluster level constraints. Data Min. Knowl. Discov., 14(1):25–61, 2007. [35] Ian Davidson and Basu Sugato. A survey of clustering with instance level constraints. Technical report, 2007. [36] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoretic metric learning. In ICML, pages 209–216, 2007. [37] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th Annual International Conference on Machine Learning, pages 209–216, 2007. [38] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In ICML, pages 209–216, 2007. [39] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009. [40] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized cuts. In KDD, pages 551–556, 2004. [41] Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik. A combination scheme for fuzzy clustering. pages 332–338, 2002. [42] Carlotta Domeniconi and Muna Al-Razgan. Weighted cluster ensembles: Methods and 128 analysis. TKDD, 2(4), 2009. [43] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l1 -ball for learning in high dimensions. In ICML, pages 272–279, 2008. [44] Sandrine Dudoit and Jane Fridlyand. Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9):1090–1099, 2003. [45] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html. [46] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 524–531, 2005. [47] Xiaoli Zhang Fern and Carla E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In ICML, pages 186–193, 2003. [48] Xiaoli Zhang Fern and Carla E. Brodley. Solving cluster ensemble problems by bipartite graph partitioning. In ICML, 2004. [49] X.Z. Fern and C.E. Brodley. Solving cluster ensemble problems by bipartite graph partitioning. In ICML, page 36. ACM, 2004. [50] Bernd Fischer and Joachim M. Buhmann. Bagging for path-based clustering. IEEE Trans. Pattern Anal. Mach. Intell., 25(11):1411–1415, 2003. [51] Charless Fowlkes, Serge Belongie, Fan R. K. Chung, and Jitendra Malik. Spectral grouping using the nystr¨om method. PAMI, 26(2):214–225, 2004. [52] Lucas Franek and Xiaoyi Jiang. Ensemble clustering by means of clustering embedding in vector spaces. Pattern Recognition, 47(2):833–842, 2014. [53] Ana L. N. Fred and Anil K. Jain. Data clustering using evidence accumulation. In International Conference on Pattern Recognition, volume 4, pages 276–280, 2002. [54] Ana L. N. Fred and Anil K. Jain. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell., 27(6):835–850, 2005. [55] Hichem Frigui and Raghu Krishnapuram. A robust competitive clustering algorithm with applications in computer vision. PAMI, 21(5):450–465, 1999. [56] Arvind Ganesh, John Wright, Xiaodong Li, Emmanuel J. Cand`es, and Yi Ma. Dense error correction for low-rank matrices via principal component pursuit. In ISIT, pages 129 1513–1517, 2010. [57] D. Goldberg, D. A. Nichols, B. M. Oki, and D. B. Terry. Using collaborative filtering to weave an information tapestry. Commun. ACM, 35(12):61–70, 1992. [58] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004. [59] Jacob Goldberger, Sam Roweis, Geoff Hinton, and Ruslan Salakhutdinov. Neighbourhood components analysis. In Proceedings of the 18nd Annual Conference on Neural Information Processing Systems, pages 513–520, 2004. [60] R. Gomes, P. Welinder, A. Krause, and P. Perona. Crowdclustering. In NIPS, 2011. [61] Ronald L. Graham. An efficient algorith for determining the convex hull of a finite planar set. Information processing letters, 1(4):132–133, 1972. [62] David Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548–1566, 2011. [63] S.C.H. Hoi, W. Liu, M.R. Lyu, and W.Y. Ma. Learning distance metrics with contextual constraints for image retrieval. In CVPR, pages 2072–2078, 2006. [64] Steven C. H. Hoi, Wei Liu, Michael R. Lyu, and Wei-Ying Ma. Learning distance metrics with contextual constraints for image retrieval. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2072–2078, 2006. [65] Daniel Hsu, Sham M. Kakade, and Tong Zhang. Robust matrix decomposition with sparse corruptions. IEEE Transactions on Information Theory, 57(11):7221–7234, 2011. [66] Kaizhu Huang, Rong Jin, Zenglin Xu, and Cheng-Lin Liu. Robust metric learning by smooth optimization. In UAI, 2010. [67] Yifen Huang and Tom M. Mitchell. Text clustering with extended user feedback. In SIGIR, pages 413–420, 2006. [68] J.J. Hull. A database for handwritten text recognition research. PAMI, 16(5):550–554, 1994. [69] Natthakan Iam-on, Tossapon Boongoen, and Simon Garrett. Refining pairwise similarity matrix for cluster ensemble problem with cluster relations. In Discovery Science, pages 222–233, 2008. 130 [70] Panagiotis G. Ipeirotis. Analyzing the amazon mechanical turk marketplace. ACM Crossroads, 17(2):16–21, 2010. [71] Anil K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. [72] Anil K. Jain, Jung-Eun Lee, and Rong Jin. Tattoo-ID: Automatic tattoo image retrieval for suspect and victim identification. In Advances in Multimedia Information Processing–PCM, pages 256–265, 2007. [73] Ali Jalali, Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering partially observed graphs via convex optimization. In ICML, pages 1001–1008, 2011. [74] S D. Kamvar, D Klein, and C D. Manning. Spectral learning. In IJCAI, pages 561–566, 2003. [75] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980–2998, 2010. [76] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. Journal of Machine Learning Research, 11:2057–2078, 2010. [77] Min-Soo Kim and Jiawei Han. A particle-and-density based evolutionary clustering method for dynamic networks. PVLDB, 2(1):622–633, 2009. [78] H. W. Kuhn and Bryn Yaw. The hungarian method for the assignment problem. Naval Res. Logist. Quart, pages 83–97, 1955. [79] Brian Kulis, Sugato Basu, Inderjit S. Dhillon, and Raymond J. Mooney. Semisupervised graph clustering: a kernel approach. Machine Learning, 74(1):1–22, 2009. [80] M Law, A P. Topchy, and A K. Jain. Model-based clustering with probabilistic constraints. In SDM, 2005. [81] Y. LeCun and C. Cortes. MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 1998. [82] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. [83] Q Li and B Kim. Clustering approach for hybrid recommender system. In Web Intelligence, pages 33–38, 2003. [84] Tao Li, Chris H. Q. Ding, and Michael I. Jordan. Solving consensus and semisupervised clustering problems using nonnegative matrix factorization. In Seventh 131 IEEE International Conference on Data Mining, pages 577–582, 2007. [85] Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17(3):282–283, 2001. [86] Xiaodong Li. Compressed sensing and matrix completion with constant proportion of corruptions. CoRR, abs/1104.1041, 2011. [87] Yan Li, Jian Yu, Pengwei Hao, and Zhulin Li. Clustering ensembles based on normalized edges. pages 664–671, 2007. [88] Zhenguo Li and Jianzhuang Liu. Constrained clustering by spectral kernel learning. In ICCV, pages 421–427, 2009. [89] Yu-Ru Lin, Y. Chi, S. Zhu, H. Sundaram, and B. L. Tseng. Facetnet: a framework for analyzing communities and their evolutions in dynamic networks. In WWW, pages 685–694, 2008. [90] Z. Lin, M. Chen, L. Wu, and Y. Ma. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. Arxiv preprint arXiv:1009.5055, 2010. [91] Xiaoyong Liu and W. Bruce Croft. Cluster-based retrieval using language models. In SIGIR, pages 186–193, 2004. ´ Carreira-Perpi˜ [92] Z. Lu and M. A. na´n. Constrained spectral clustering through affinity propagation. In CVPR, 2008. [93] Z. Lu and T. K. Leen. Semi-supervised learning with penalized probabilistic clustering. In NIPS, 2004. [94] J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, page 14. California, USA, 1967. [95] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨ utze. Introduction to information retrieval. Cambridge University Press, 2008. [96] Behrouz Minaei-Bidgoli, Alexander P. Topchy, and William F. Punch. Ensembles of partitions via data resampling. pages 188–192, 2004. [97] D.C. Montgomery, E.A. Peck, and G.G. Vining. Introduction to Linear Regression Analysis, volume 49. John Wiley & Sons, 2007. [98] Nayar and H. Murase. Columbia object image library: COIL-100. Technical Report 132 CUCS-006-96, Department of Computer Science, Columbia University, February 1996. [99] Radford M Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics, 9(2):249–265, 2000. [100] Sahand Negahban and Martin J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13:1665–1697, 2012. [101] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages 849–856, 2001. [102] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages 849–856, 2001. [103] D. O’Callaghan, M. Harrigan, J. Carthy, and P. Cunningham. Network analysis of recurring youtube spam campaigns. In ICWSM, 2012. [104] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. CoRR, abs/1202.1212, 2012. [105] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971. [106] Benjamin Recht. A simpler approach to matrix completion. JMLR, 12:3413–3430, 2011. [107] Aaron D. Shaw, John J. Horton, and Daniel L. Chen. Designing incentives for inexpert human raters. In ACM Conference on Computer Supported Cooperative Work [107], pages 275–284. [108] Noam Shental, Aharon Bar-Hillel, Tomer Hertz, and Daphna Weinshall. Computing gaussian mixture models with em using equivalence constraints. In NIPS, 2003. [109] Noam Shental, Tomer Hertz, Daphna Weinshall, and Misha Pavel. Adjustment learning and relevant component analysis. In Proceedings of the 17th European Conference on Computer Vision, pages 776–792, 2002. [110] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, 2000. [111] Steve Smale and Ding-Xuan Zhou. Geometry on probability spaces. Constr Approx, 30:311–323, 2009. [112] S. J Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. K Chan. Cost-based modeling for 133 fraud and intrusion detection: results from the jam project. In DARPA Information Survivability Conference and Exposition, 2000. DISCEX’00. Proceedings, volume 2, pages 130–144. IEEE, 2000. [113] Alexander Strehl and Joydeep Ghosh. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. JMLR, 3:583–617, 2002. [114] J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu. Graphscope: parameter-free mining of large time-evolving graphs. In KDD, pages 687–696, 2007. [115] Tomer Hertz Tomboy, Aharon Bar-hillel, and Daphna Weinshall. Boosting margin based distance functions for clustering. In Proceedings of the 21st Annual International Conference on Machine Learning, 2004. [116] Alexander P. Topchy, Anil K. Jain, and William F. Punch. A mixture model for clustering ensembles. In SDM, 2004. [117] Alexander P. Topchy, Anil K. Jain, and William F. Punch. Clustering ensembles: Models of consensus and weak partitions. IEEE Trans. Pattern Anal. Mach. Intell., 27(12):1866–1881, 2005. [118] Alexander P. Topchy, Behrouz Minaei-Bidgoli, Anil K. Jain, and William F. Punch. Adaptive clustering ensembles. In ICPR, pages 272–275, 2004. [119] Kagan Tumer and Adrian K. Agogino. Ensemble clustering with voting active clusters. Pattern Recognition Letters, 29(14):1947–1953, 2008. [120] Sandro Vega-Pons and Jos´e Ruiz-Shulcloper. Clustering ensemble method for heterogeneous partitions. In CIARP, pages 481–488, 2009. [121] Sandro Vega-Pons and Jos´e Ruiz-Shulcloper. A survey of clustering ensemble algorithms. IJPRAI, 25(3):337–372, 2011. [122] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007. [123] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In Proceedings of the 17th Annual International Conference on Machine Learning, pages 1103–1110, 2000. [124] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schr¨odl. Constrained k-means clustering with background knowledge. In Proceedings of the 18th Annual International Conference on Machine Learning, 2001. [125] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schr¨odl. Constrained k-means 134 clustering with background knowledge. In ICML, pages 577–584, 2001. [126] Y. Wakabayashi. Aggregation of binary relations: algorithmic and polyhedral investigations. PhD thesis, 1986. [127] X. Wang, L. Tang, H. Gao, and H. Liu. Discovering overlapping groups in social media. In ICDM, pages 569–578, 2010. [128] X. Wang, C. Yang, and J. Zhou. Clustering aggregation by probability accumulation. Pattern Recognition, 42(5):668–675, 2009. [129] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. In Proceedings of the 20nd Annual Conference on Neural Information Processing Systems, pages 207–244, 2006. [130] K.Q. Weinberger, J. Blitzer, and L.K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2006. [131] A. Weingessel, E. Dimitriadou, and K. Hornik. An ensemble method for clustering. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing. Citeseer, 2003. [132] JD Maitland Wright. Measures with values in a partially ordered vector space. Proceedings of the London Mathematical Society, 3(4):675–688, 1972. [133] Lei Wu, Rong Jin, Steven C.H. Hoi, Jinfeng Zhuang, and Nenghai Yu. Simplenpkl: Simple non-parametric kernel learning. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, 2009. [134] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. volume 15, pages 505–512, 2002. [135] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning, with application to clustering with side-information. In Proceedings of the 17th Annual Conference on Neural Information Processing Systems, pages 505–512, 2003. [136] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. IEEE Transactions on Information Theory, 58(5):3047–3064, 2012. [137] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Technical report, 2006. [138] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University, 135 2006. [139] Liu Yang, Rong Jin, Rahul Sukthankar, and Yi Liu. An efficient algorithm for local distance metric learning. In Proceedings of the 21st National Conference on Aartifical Intelligence, pages 543–548, 2006. [140] Ka Yee Yeung, Chris Fraley, Alejandro Murua, Adrian E Raftery, and Walter L Ruzzo. Model-based clustering and data transformations for gene expression data. Bioinformatics, 17(10):977–987, 2001. [141] J. Yi, R. Jin, A. K. Jain, and S. Jain. Crowdclustering with sparse pairwise labels: A matrix completion approach. In AAAI Workshop on Human Computation, 2012. [142] J. Yi, R. Jin, A. K. Jain, S. Jain, and T. Yang. Semi-crowdsourced clustering: Generalizing crowd labeling by robust distance metric learning. In NIPS, pages 1781–1789, 2012. [143] J. Yi, L. Zhang, R. Jin, Q. Qian, and A. K. Jain. Semi-supervised clustering by input pattern assisted pairwise similarity matrix completion. In ICML (3), pages 1400–1408, 2013. [144] H. Zeng and Y. Cheung. Semi-supervised maximum margin clustering with pairwise constraints. TKDE, 24(5):926–939, 2012. 136