(1 - 1 of 1)
- Kernel-based clustering of big data
- Chitta, Radha
- Electronic Theses & Dissertations
There has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The state-of-the-art algorithms for clustering big data sets...
Show moreThere has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The state-of-the-art algorithms for clustering big data sets are linear clustering algorithms, which assume that the data is linearly separable in the input space, and use measures such as the Euclidean distance to define the inter-point similarities. Though efficient, linear clustering algorithms do not achieve high cluster quality on real-world data sets, which are not linearly separable. Kernel-based clustering algorithms employ non-linear similarity measures to define the inter-point similarities. As a result, they are able to identify clusters of arbitrary shapes and densities. However, kernel-based clustering techniques suffer from two major limitations: (i) Their running time and memory complexity increase quadratically with the increase in the size of the data set. They cannot scale up to data sets containing billions of data points. (ii) The performance of the kernel-based clustering algorithms is highly sensitive to the choice of the kernel similarity function. Ad hoc approaches, relying on prior domain knowledge, are currently employed to choose the kernel function, and it is difficult to determine the appropriate kernel similarity function for the given data set.In this thesis, we develop scalable approximate kernel-based clustering algorithms using random sampling and matrix approximation techniques. They can cluster big data sets containing billions of high-dimensional points not only as efficiently as linear clustering algorithms but also as accurately as classical kernel-based clustering algorithms.Our first contribution is based on the premise that the similarity matrices corresponding to big data sets can usually be well-approximated by low-rank matrices built from a subset of the data. We develop an approximate kernel-based clustering algorithm, which uses a low-rank approximate kernel matrix, constructed from a uniformly sampled small subset of the data, to perform clustering. We show that the proposed algorithm has linear running time complexity and low memory requirements, and also achieves high cluster quality, when provided with sufficient number of data samples. We also demonstrate that the proposed algorithm can be easily parallelized to handle distributed data sets. We then employ non-linear random feature maps to approximate the kernel similarity function, and design clustering algorithms which enhance the efficiency of kernel-based clustering, as well as label assignment for previously unseen data points. Our next contribution is an online kernel-based clustering algorithm that can cluster potentially unbounded stream data in real-time. It intelligently samples the data stream and finds the cluster labels using these sampled points. The proposed scheme is more effective than the current kernel-based and linear stream clustering techniques, both in terms of efficiency and cluster quality. We finally address the issues of high dimensionality and scalability to data sets containing a large number of clusters. Under the assumption that the kernel matrix is sparse when the number of clusters is large, we modify the above online kernel-based clustering scheme to perform clustering in a low-dimensional space spanned by the top eigenvectors of the sparse kernel matrix. The combination of sampling and sparsity further reduces the running time and memory complexity. The proposed clustering algorithms can be applied in a number of real-world applications. We demonstrate the efficacy of our algorithms using several large benchmark text and image data sets. For instance, the proposed batch kernel clustering algorithms were used to cluster large image data sets (e.g. Tiny) containing up to 80 million images. The proposed stream kernel clustering algorithm was used to cluster over a billion tweets from Twitter, for hashtag recommendation.