(1 - 1 of 1)
- Hierarchical learning for large multi-class classification in network data
- Liu, Lei
- Electronic Theses & Dissertations
Multi-class learning from network data is an important but challenging problem with many applications, including malware detection in computer networks, user modeling in social networks, and protein function prediction in biological networks. Despite the extensive research on large multi-class learning, there are still numerous issues that have not been sufficiently addressed, such as efficiency of model testing, interpretability of the induced models, as well as the ability to handle...
Show moreMulti-class learning from network data is an important but challenging problem with many applications, including malware detection in computer networks, user modeling in social networks, and protein function prediction in biological networks. Despite the extensive research on large multi-class learning, there are still numerous issues that have not been sufficiently addressed, such as efficiency of model testing, interpretability of the induced models, as well as the ability to handle imbalanced classes. To overcome these challenges, there has been increasing interest in recent years to develop hierarchical learning methods for large multi-class problems. Unfortunately, none of them were designed for classification of network data. In addition, there are very few studies devoted to classification of heterogeneous networks, where the nodes may have different feature sets. This thesis aims to overcome these limitations with the following contributions.First, as the number of classes in big data applications can be very large, ranging from thousands to possibly millions, two hierarchical learning schemes are proposed to deal with the so-called extreme multi-class learning problems. The first approach, known as recursive non-negative matrix factorization (RNMF), is designed to achieve sublinear runtime in classifying test data. Although RNMF reduces the test time significantly, it may also assign the same class to multiple leaf nodes, which hampers the interpretability of the model as a concept hierarchy for the classes. Furthermore, since RNMF employs a greedy strategy to partition the classes, there is no theoretical guarantee that the partitions generated by the tree would lead to a globally optimal solution.To address the limitations of RNMF, an alternative hierarchical learning method known as matrix factorization tree (MF-Tree) is proposed. Unlike RNMF, MF-tree is designed to optimize a global objective function while learning its taxonomy structure. A formal proof is provided to show the equivalence between the objective function of MF-tree and the Hilbert-Schmidt Independence Criterion (HSIC). Furthermore, to improve its training efficiency, a fast algorithm for inducing approximate MF-Tree is also developed.Next, an extension of MF-Tree to network data is proposed. This approach can seamlessly integrate both the link structure and node attribute information into a unified learning framework. To the best of our knowledge, this is the first study that automatically constructs a taxonomy structure to predict large multi-class problems for network classification. Empirical results suggest that the approach can effectively capture the relationship between classes and generate class taxonomy structures that resemble those produced by human experts. The approach can also be easily parallelizable and has been implemented in a MapReduce framework.Finally, we introduce a network learning task known as co-classification to classify heterogeneous nodes in multiple networks. Unlike existing node classification problems, the goal of co-classification is to learn the classifiers in multiple networks jointly, instead of learning to classify each network independently. The framework proposed in this thesis can utilize prior information about the relationship between classes in different networks to improve its prediction accuracy.