You are here
Search results
(1  2 of 2)
 Title
 Kernelbased clustering of big data
 Creator
 Chitta, Radha
 Date
 2015
 Collection
 Electronic Theses & Dissertations
 Description

There has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The stateoftheart algorithms for clustering big data sets...
Show moreThere has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The stateoftheart algorithms for clustering big data sets are linear clustering algorithms, which assume that the data is linearly separable in the input space, and use measures such as the Euclidean distance to define the interpoint similarities. Though efficient, linear clustering algorithms do not achieve high cluster quality on realworld data sets, which are not linearly separable. Kernelbased clustering algorithms employ nonlinear similarity measures to define the interpoint similarities. As a result, they are able to identify clusters of arbitrary shapes and densities. However, kernelbased clustering techniques suffer from two major limitations: (i) Their running time and memory complexity increase quadratically with the increase in the size of the data set. They cannot scale up to data sets containing billions of data points. (ii) The performance of the kernelbased clustering algorithms is highly sensitive to the choice of the kernel similarity function. Ad hoc approaches, relying on prior domain knowledge, are currently employed to choose the kernel function, and it is difficult to determine the appropriate kernel similarity function for the given data set.In this thesis, we develop scalable approximate kernelbased clustering algorithms using random sampling and matrix approximation techniques. They can cluster big data sets containing billions of highdimensional points not only as efficiently as linear clustering algorithms but also as accurately as classical kernelbased clustering algorithms.Our first contribution is based on the premise that the similarity matrices corresponding to big data sets can usually be wellapproximated by lowrank matrices built from a subset of the data. We develop an approximate kernelbased clustering algorithm, which uses a lowrank approximate kernel matrix, constructed from a uniformly sampled small subset of the data, to perform clustering. We show that the proposed algorithm has linear running time complexity and low memory requirements, and also achieves high cluster quality, when provided with sufficient number of data samples. We also demonstrate that the proposed algorithm can be easily parallelized to handle distributed data sets. We then employ nonlinear random feature maps to approximate the kernel similarity function, and design clustering algorithms which enhance the efficiency of kernelbased clustering, as well as label assignment for previously unseen data points. Our next contribution is an online kernelbased clustering algorithm that can cluster potentially unbounded stream data in realtime. It intelligently samples the data stream and finds the cluster labels using these sampled points. The proposed scheme is more effective than the current kernelbased and linear stream clustering techniques, both in terms of efficiency and cluster quality. We finally address the issues of high dimensionality and scalability to data sets containing a large number of clusters. Under the assumption that the kernel matrix is sparse when the number of clusters is large, we modify the above online kernelbased clustering scheme to perform clustering in a lowdimensional space spanned by the top eigenvectors of the sparse kernel matrix. The combination of sampling and sparsity further reduces the running time and memory complexity. The proposed clustering algorithms can be applied in a number of realworld applications. We demonstrate the efficacy of our algorithms using several large benchmark text and image data sets. For instance, the proposed batch kernel clustering algorithms were used to cluster large image data sets (e.g. Tiny) containing up to 80 million images. The proposed stream kernel clustering algorithm was used to cluster over a billion tweets from Twitter, for hashtag recommendation.
Show less
 Title
 Data clustering with pairwise constraints
 Creator
 Yi, Jinfeng
 Date
 2014
 Collection
 Electronic Theses & Dissertations
 Description

The classical unsupervised clustering is an illposed problem due to the absence of a unique clustering criteria. This issue can be addressed by introducing additional supervised information, usually casts in the form of pairwise constraints, to the clustering procedure. Depending on the sources, most pairwise constraints can be classified into two categories: (i) pairwise constraints collected from a set of nonexpert crowd workers, which leads to the problem of crowdclustering, and (ii)...
Show moreThe classical unsupervised clustering is an illposed problem due to the absence of a unique clustering criteria. This issue can be addressed by introducing additional supervised information, usually casts in the form of pairwise constraints, to the clustering procedure. Depending on the sources, most pairwise constraints can be classified into two categories: (i) pairwise constraints collected from a set of nonexpert crowd workers, which leads to the problem of crowdclustering, and (ii) pairwise constraints collected from oracle or experts, which leads to the problem of semisupervised clustering. In both cases, the costs of collecting pairwise constraints can be expensive, thus it is important to identify the minimal number of pairwise constraints needed to accurately recover the underlying true data partition, also known as a sample complexity problem.In this thesis, we first analyze the sample complexity of crowdclustering. At first, we propose a novel crowdclustering approach based on the theory of matrix completion. Unlike the existing crowdclustering algorithm that is based on a Bayesian generative model, the proposed approach is more desirable since it only needs a much less number of crowdsourced pairwise annotations to accurately cluster all the objects. Our theoretical analysis shows that in order to accurately cluster $N$ objects, only $O(N\log^2 N)$ randomly sampled pairs should be annotated by crowd workers. To further reduce the sample complexity, we then introduce a semicrowdsourced clustering framework that is able to effectively incorporate the lowlevel features of the objects to be clustered. In this framework, we only need to sample a subset of $n \ll N$ objects and generate their pairwise constraints via crowdsourcing. After completing a $n \times n$ similarity matrix using the proposed crowdclustering algorithm, we can further recover a $N \times N$ similarity matrix by applying a regressionbased distance metric learning algorithm to the completed smaller size similarity matrix. This enables us to reliably cluster $N$ objects with only $O(n\log^2 n)$ crowdsourced pairwise constraints.Next, we study the problem of sample complexity in semisupervised clustering. To this end, we propose a novel convex semisupervised clustering approach based on the theory of matrix completion. In order to reduce the number of pairwise constraints needed %to achieve a perfect data partitioning,we apply a nature assumption that the feature representationsof the objects are able to reflect the similarities between objects. This enables us to only utilize $O(\log N)$ pairwiseconstraints to perfectly recover the data partition of $N$ objects.Lastly, in addition to sample complexity that relates to labeling costs, we also consider the computational costs of semisupervised clustering.%In addition to sample complexity that relates to the labeling costs, we also consider the computational cost of semisupervised clustering in the final part of this thesis.Specifically, we study the problem of efficiently updating clustering results when the pairwise constraints are generated sequentially, a common case in various realworld applications such as social networks. To address this issue, we develop a dynamic semisupervised clustering algorithm that casts the clustering problem into a searching problem in a feasibleconvex space, i.e., a convex hull with its extreme points being an ensemble of multiple data partitions. Unlike classical semisupervised clustering algorithms that need to reoptimize their objective functions when new pairwise constraints are generated, the proposed method only needs to update a lowdimensional vector and its time complexity is irrelevant to the number of data points to be clustered. This enables us to update largescale clustering results in an extremely efficient way.
Show less