HIERARCHICAL LEARNING FOR LARGE MULTI-CLASS CLASSIFICATION IN NETWORK DATA By Lei Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Doctor of Philosophy 2015 ABSTRACT HIERARCHICAL LEARNING FOR LARGE MULTI-CLASS CLASSIFICATION IN NETWORK DATA By Lei Liu Multi-class learning from network data is an important but challenging problem with many applications, including malware detection in computer networks, user modeling in social networks, and protein function prediction in biological networks. Despite the extensive research on large multi-class learning, there are still numerous issues that have not been sufficiently addressed, such as efficiency of model testing, interpretability of the induced models, as well as the ability to handle imbalanced classes. To overcome these challenges, there has been increasing interest in recent years to develop hierarchical learning methods for large multi-class problems. Unfortunately, none of them were designed for classification of network data. In addition, there are very few studies devoted to classification of heterogeneous networks, where the nodes may have different feature sets. This thesis aims to overcome these limitations with the following contributions. First, as the number of classes in big data applications can be very large, ranging from thousands to possibly millions, two hierarchical learning schemes are proposed to deal with the so-called extreme multi-class learning problems [1]. The first approach, known as recursive non-negative matrix factorization (RNMF), is designed to achieve sublinear runtime in classifying test data. Although RNMF reduces the test time significantly, it may also assign the same class to multiple leaf nodes, which hampers the interpretability of the model as a concept hierarchy for the classes. Furthermore, since RNMF employs a greedy strategy to partition the classes, there is no theoretical guarantee that the partitions generated by the tree would lead to a globally optimal solution. To address the limitations of RNMF, an alternative hierarchical learning method known as matrix factorization tree (MF-Tree) is proposed. Unlike RNMF, MF-tree is designed to optimize a global objective function while learning its taxonomy structure. A formal proof is provided to show the equivalence between the objective function of MF-tree and the HilbertSchmidt Independence Criterion (HSIC). Furthermore, to improve its training efficiency, a fast algorithm for inducing approximate MF-Tree is also developed. Next, an extension of MF-Tree to network data is proposed. This approach can seamlessly integrate both the link structure and node attribute information into a unified learning framework. To the best of our knowledge, this is the first study that automatically constructs a taxonomy structure to predict large multi-class problems for network classification. Empirical results suggest that the approach can effectively capture the relationship between classes and generate class taxonomy structures that resemble those produced by human experts. The approach can also be easily parallelizable and has been implemented in a MapReduce framework. Finally, we introduce a network learning task known as co-classification to classify heterogeneous nodes in multiple networks. Unlike existing node classification problems, the goal of co-classification is to learn the classifiers in multiple networks jointly, instead of learning to classify each network independently. The framework proposed in this thesis can utilize prior information about the relationship between classes in different networks to improve its prediction accuracy. ACKNOWLEDGMENTS The past five years PhD study at Michigan State University is a both painful and enjoyable life experience, just like climbing mountain, although it was a hard work for every step you move forward, I enjoy this journey and realized that I can not reach the mountain top enjoying the beautiful view without helps from my surrounding people. Though it is not enough to express my grateful in words to all of these people who helped me, I still would like to give them my most sincere thanks. First and foremost, I would like to thank my PhD advisor, Dr. Pang-Ning Tan, who generously funded my PhD study with assistantships and internship opportunities. He offered me so much advices, patiently supervising me, and always guide me in the right and meaningful research direction. He has also provided many insightful discussions through our weekly individual meetings, group studies and data mining class. His guidance in both research and career have been priceless. Without his support and help, it is hard to image that I could finish this thesis. Besides research, his hard working nature and easy going personality have inspired me a lot, which I believe will continue benefit me throughout the rest life journey. Besides my advisor, I would like to thank the rest of my PhD committee members, Prof. Rong Jin, Prof. Joyce Chai and Dr. Selin Aviyente for their support and guidance throughout my PhD programme. I would like to specially thank Dr. Jin for shaping my thinking during the Machine Learning class in Spring 2010. I also want to thank Prof. Joyce Chai and Dr. Selin Aviyente for providing brilliant comments and suggestions to my thesis. It was great honor to have worked with each of my PhD committee member and I sincerely thank them for all the help and guidance. I look forward to collaborating with each of my iv committee member in the future. I want to thank present and past labmates in Michigan State University LINKS Lab, Prakash Mandayam Comar, Jianpeng Xu, Zubin Abraham, Courtland VanDam, Xi Liu, Shuai Yuan and Ding Wang. Especially thank Prakash Mandayam Comar for keeping my Lab and Intern hour interesting and insightful discussion of various research projects. My special thanks to our graduate director Eric Torng for his tremendous help and support over my time at MSU. I would like to thank our department secretaries Linda Moore and Norma Teague for all the administrative and conference travel works. I also want to give my sincere thanks to all my friends for keeping me happy at MSU. Last but not least, special thanks to my family, words cannot express how grateful I am to my parents and sister. My hard working parents sacrifice their lives and support me and my sister for our PhD studies. Without their support, I would not have chance to study abroad and complete this thesis. At the end, I would like to express my appreciation to my beloved wife Sang Ni who has brought me fun and always cheer me up throughout my PhD study. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . 1.1 Applications of Multi-class Network Learning . . . . . . . 1.1.1 Prediction of User Interest in Social Networks . . . 1.1.2 Classification in Bibliographic Networks . . . . . . 1.1.3 Detecting Malware in Computer Network . . . . . . 1.1.4 Function Prediction in Protein Interaction Network 1.2 Challenges of Multi-class Network Learning . . . . . . . . 1.2.1 Extreme Multi-Class Network Learning . . . . . . . 1.2.1.1 Training Accuracy . . . . . . . . . . . . . 1.2.1.2 Testing Efficiency . . . . . . . . . . . . . . 1.2.1.3 Imbalanced Class Distributions . . . . . . 1.2.1.4 Evolving Classes . . . . . . . . . . . . . . 1.2.1.5 Model Interpretability . . . . . . . . . . . 1.2.2 Classification on Heterogeneous Networks . . . . . . 1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 2 2 3 3 4 4 5 5 6 6 6 7 10 Chapter 2 Background & Related Work . 2.1 Problem Definition . . . . . . . . . . . . 2.2 Node Classification . . . . . . . . . . . . 2.3 Classification on Heterogeneous Networks 2.4 Multi-class Learning . . . . . . . . . . . 2.4.1 Non-Hierarchical Methods . . . . 2.4.2 Hierarchical Methods . . . . . . . 2.5 Taxonomy Learning . . . . . . . . . . . . 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 13 16 17 17 18 19 20 Chapter 3 Recursive NMF for Extreme Multi-Class Learning 3.1 Label Tree Structure for RNMF . . . . . . . . . . . . . . . . . . 3.2 RNMF Algorithm for Label Tree Construction . . . . . . . . . . 3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 3.4 New Class Detection . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 22 24 30 32 35 vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 Matrix Factorization Tree for Extreme Multi-Class Learning 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Hilbert-Schmidt Independence Criterion (HSIC) . . . . . . . . . . . 4.1.3 Tree-Structured Covariance Matrix . . . . . . . . . . . . . . . . . . 4.2 MF-Tree: Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Global Objective Function of MF-Tree . . . . . . . . . . . . . . . . 4.2.2 Relationship between Global Objective Function and HSIC . . . . . 4.2.3 Additive Property of HSIC . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 MF-Tree construction . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 MF-Tree Induction Algorithm . . . . . . . . . . . . . . . . . . . . . 4.3 Approximate MF-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2.1 Overall Results Comparison . . . . . . . . . . . . . . . . . 4.4.2.2 Effect of Parameter Tuning . . . . . . . . . . . . . . . . . 4.4.3 Comparison of Taxonomy Structure . . . . . . . . . . . . . . . . . . 4.4.4 Results for Approximate MF-Tree . . . . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 39 40 40 42 43 43 44 46 51 52 54 56 58 59 60 60 61 62 67 69 Chapter 5 Hierarchical Tree Learning for Large Multi-class Network 5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Global Objective Function . . . . . . . . . . . . . . . . . . . . . 5.1.2 Relationship between Proposed Method and HSIC . . . . . . . . 5.1.3 Additive Property of HSIC . . . . . . . . . . . . . . . . . . . . . 5.1.4 Hierarchical Tree Construction from Network Data . . . . . . . 5.1.5 Parallelization of Hierarchical Tree Learning . . . . . . . . . . . 5.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1.1 U.S. Patent Data . . . . . . . . . . . . . . . . . . . . . 5.2.1.2 Wikipedia Data . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Micro-average F1 . . . . . . . . . . . . . . . . . . . . . 5.2.2.2 Macro-average F1 . . . . . . . . . . . . . . . . . . . . . 5.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3.1 Results on U.S. Patent Data . . . . . . . . . . . . . . . 5.2.3.2 Results on Wikipedia Data 1 . . . . . . . . . . . . . . 5.2.3.3 Parallel Results on Wikipedia Data 2 . . . . . . . . . . 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 73 73 74 78 78 80 81 82 82 82 84 86 86 87 87 91 93 93 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 6 Co-classification of Heterogeneous Networks . . . . . 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Independent Classification of Wikipedia Users and Articles 6.2.2 Co-Classification of Users and Articles . . . . . . . . . . . 6.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 7 Future Work . . . . . . . . . . . . . . . . . 7.1 Joint Label Tree Construction and Link Prediction Number of Classes . . . . . . . . . . . . . . . . . . 7.2 Classification of Documents into Reading Positions REFERENCES . . . . . . . . . . . . . . viii . . for . . . . . . . . . Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 . 96 . 98 . 98 . 99 . 104 . 104 . 107 . . . . 108 Large . . . . 109 . . . . 109 . . . . . . . . . . . . . . . . . . 111 LIST OF TABLES Table 3.1 Table 3.2 Comparison between RNMF and baseline algorithms on Wiki1 data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Comparison between RNMF and baseline algorithms on Wiki2 data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 3.3 Average F1-score of the new class when applied to the Wiki1 data set. 34 Table 4.1 Summary of data used for our experiments . . . . . . . . . . . . . . 60 Table 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Table 4.3 Approximate MF-Tree on W iki2 . . . . . . . . . . . . . . . . . . . . 68 Table 5.1 22 General Categories of U.S. Patent Data . . . . . . . . . . . . . . 83 Table 5.2 Micro-average F1 Comparison on U.S. Patent Data . . . . . . . . . 87 Table 5.3 Macro-average F1 Comparison on U.S. Patent Data . . . . . . . . . 87 Table 5.4 Tree Depth Comparison on U.S. Patent Data . . . . . . . . . . . . . 91 Table 5.5 Performance Comparison on Wikipedia Data 1 . . . . . . . . . . . . 91 Table 5.6 Results on Wikipedia Data 2 using the parallel HTL method. . . . . 93 Table 6.1 Prior category relation Matrix Ep (ku , ka ) . . . . . . . . . . . . . . . 101 Table 6.2 Category relation Matrix from training data Et . . . . . . . . . . . . 105 Table 6.3 Four categories average results . . . . . . . . . . . . . . . . . . . . . 106 ′ ix LIST OF FIGURES Fraction of neighboring nodes that belong to the same class as the the number of classes is varied. . . . . . . . . . . . . . . . . . . . . . 4 Figure 3.1 Proposed label tree structure using the RNMF algorithm. . . . . . . 23 Figure 3.2 Average F1 score for RNMF as the branching factor (p) and maximum depth (d) are varied on Wiki1 data set. . . . . . . . . . . . . . . . . 32 Test time for RNMF as the branching factor (p) and maximum depth (d) are varied on Wiki1 data set. . . . . . . . . . . . . . . . . . . . . 33 Figure 3.4 Example of DAGSVM Structure for 5 classes . . . . . . . . . . . . . 34 Figure 3.5 New Class Detection Performance for individual Class on W iki1 . . 35 Figure 4.1 Example Tree Structure with 8 classes . . . . . . . . . . . . . . . . . 43 Figure 4.2 Performance Comparison on Caltech1 Data . . . . . . . . . . . . . . 63 Figure 4.3 Performance Comparison on Caltech2 Data . . . . . . . . . . . . . . 63 Figure 4.4 Performance Comparison on W iki1 Data . . . . . . . . . . . . . . . 64 Figure 4.5 Performance Comparison on W iki2 Data . . . . . . . . . . . . . . . 64 Figure 4.6 MSE comparison of the tree produced by MF-Tree against the one produced by CMTL . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 4.7 Example of Mapping Tree Structure to Adjacency Matrix . . . . . . 66 Figure 4.8 Illustration of tree structure generated . . . . . . . . . . . . . . . . . 67 Figure 5.1 Class Distribution of U.S. Patent and Wikipedia Data 1 . . . . . . . 84 Figure 5.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Figure 5.3 F1 Score for each individual class in the U.S. Patent Data (where the classes are sorted in increasing class size). . . . . . . . . . . . . . . . 89 A snippet of the class hierarchy structure (up to depth 4) generated by the proposed method for the 22 classes of the U.S. Patent Data . 90 Figure 1.1 Figure 3.3 Figure 5.4 x Figure 5.5 Testing Efficiency Comparison on U.S. Patent Data . . . . . . . . . 91 Figure 5.6 F1 Score for each individual class in Wikipedia Data 1 (where the classes are sorted in increasing class size). . . . . . . . . . . . . . . . 92 Figure 6.1 A Sample of User-Article network . . . . . . . . . . . . . . . . . . . 99 Figure 6.2 Comparison of F1 measure for each Article category between LS-SVM and Co-LS-SVM algorithms . . . . . . . . . . . . . . . . . . . . . . . 106 Figure 6.3 Comparison of F1 measure for each User category between LS-SVM and Co-LS-SVM algorithms . . . . . . . . . . . . . . . . . . . . . . . 107 xi LIST OF ALGORITHMS Algorithm 1 Recrursive NMF at a node v . . . . . . . . . . . . . . . . . . . . . . . 27 Algorithm 2 MF-Tree Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Algorithm 3 Testing with MF-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Algorithm 4 Approximate MF-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Algorithm 5 Hierarchical Tree Construction from Network Data . . . . . . . . . . . 80 Algorithm 6 Testing with Hierarchical Tree . . . . . . . . . . . . . . . . . . . . . . . . 81 Algorithm 7 Co-Classification Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 104 xii Chapter 1 Introduction The rapid proliferation of network data in various domains such as computer science [2], computational biology [3] and social science [4], has attracted considerable interest in applying data mining techniques to study complex networks. Examples of network mining tasks include link prediction [5], community detection [6], and influence propagation [7]. The focus of this thesis is on predicting the class of each node in a network, a problem that is known as node classification [5], link-based classification [8], or collective classification [9] in the literature. Node classification in a network is different than traditional classification, which assumes that the objects to be classified are independent and identically distributed (i.i.d). The i.i.d. assumption is inappropriate for network data as the class of a node depends not only on its attributes, but also on its relationship to other nodes in the network. Furthermore, the number of classes in a network can be extremely large, ranging from thousands to millions, a problem that is known as extreme multi-class learning [1]. Although there has been considerable research on extreme multi-class learning for i.i.d. data, none of them were specifically designed for network data. We termed the latter as multi-class network learning in this thesis. 1 1.1 Applications of Multi-class Network Learning Below are four potential applications of multi-class learning in network data. 1.1.1 Prediction of User Interest in Social Networks The growth of online social media has brought together billions of users who are actively engaged in daily conversations and information sharing. The widespread use of such websites is a marketing dream for advertisers who seek to predict the interest of users based on their demographic profile and friendship relations. Given the wide variety of user interests, this is a clear example of an extreme multi-class network learning problem. 1.1.2 Classification in Bibliographic Networks With the massive number of scientific papers written in different research disciplines, finding relevant publications from the literature has become an increasingly challenging problem. An information system that can effectively organize the research papers is needed. Such a system must be capable of automatically classifying the millions of papers published annually into their respective categories based on the keywords used and citations among the papers. 1.1.3 Detecting Malware in Computer Network Malicious software (or malware) automatically spreads from one device to another by covertly executing a software program without authorization from users. The infected devices can be used to steal sensitive information, send spam messages, launch denial-of-service attacks, and other malicious activities. Although many intrusion detection and prevention systems have been developed to monitor such activities, various techniques can be employed by 2 hackers to evade detection. Despite such evasion techniques, one unavoidable aspect that is difficult to camouflage is that the infected devices must communicate back to their command and control servers. This has led to considerable interest in applying node classification techniques to detect infected devices in a networked en vironment. As there are many variants of computer viruses available, malware detection is another example of extreme multi-class network learning problem. In a previous study, I have developed a two-phase score propagation algorithm to predict the class label of each client IP in an HTTP communication network [2]. By jointly utilizing the IP-address connectivity graph (“who talks to whom”) and flow level information, the proposed technique achieves a precision rate of about 90% when applied to a subnet of a large internet service provider. 1.1.4 Function Prediction in Protein Interaction Network Proteins are the most essential macromolecule of life. Knowing the functionality of proteins is critical for developing new drugs, better crops, and the development of biofuels [10]. Since a protein usually interacts with other proteins in the biological system, one can model the relationship between proteins as an interaction network, where the nodes represent proteins and the links represent evidence of a shared function. Protein function prediction can thus be cast into a multi-class network learning problem. 1.2 Challenges of Multi-class Network Learning This section presents some of the challenges in multi-class network learning that motivate the research directions of this thesis. 3 Percentage of Nearby Nodes in Same Class 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 22 426 3139 10524 65526 102126 140116 NO. of Classes Figure 1.1 Fraction of neighboring nodes that belong to the same class as the the number of classes is varied. 1.2.1 Extreme Multi-Class Network Learning Networks with large number of classes have become increasingly prevalent in recent years. For example, Wikipedia contains roughly 2.4 million articles from more than 600,000 classes. Building a classifier that can handle such a massive number of classes is challenging for the following reasons. 1.2.1.1 Training Accuracy Learning a global classifier that accurately predicts all the classes is a challenging problem. Such a classifier must be extremely flexible and capable of fitting complex nonlinear decision surfaces, which in turn, makes it susceptible to the model overfitting problem. Furthermore, as the number of classes increases, the standard assumption that “nearby nodes in a network have similar classes” no longer holds, as illustrated in Figure 1.1. The figure was generated from the US Patent network data, where each node corresponds to a patent and each link corresponds to a citation between patents. The patents are assigned to classes, whose relationships are represented in a hierarchical structure. The number of classes can be varied by 4 assigning the patents to classes at a particular depth in the hierarchy. The results suggest that the percentage of neighboring nodes that belong to the same class decreases from 56% to 4.6% as the number of classes increases from 22 to 140,116. 1.2.1.2 Testing Efficiency Many traditional classifiers such as logistic regression, support vector machine, and boosting are primarily designed for binary class problems. A popular strategy to extend these methods to multi-class classification is to transform the data into multiple binary classification problems, using techniques such as one-versus-one and one-versus-all. In practice, these techniques are not scalable for handing extreme multi-class learning problems. For example, with the one-versus-one strategy, one has to apply O(k 2 ) classifiers to predict the label of each test example, where k is the number of classes. The one-versus-one approach requires only O(k) classifiers for testing, which is still expensive when the number of classes is extremely large. How to reduce the testing cost to sublinear time while maintaining good classification accuracy is one of the major challenges that must be addressed. 1.2.1.3 Imbalanced Class Distributions Most classification methods work well when there are representative examples from each class in the training data. For imbalanced class problems, standard classifiers tend to be more biased toward accurate prediction of the larger classes instead of the rare classes, which are more interesting. Unfortunately, network data with imbalanced classes is very common in the real world. For example, the largest category in the U.S Patent citation network data contains 25,727 patents, while the smallest category has only 8 patents. How to deal with the highly imbalanced classes in network classification is another challenge that must be 5 addressed. 1.2.1.4 Evolving Classes Extreme multi-class learning problems typically arise in application domains where new classes are continuously generated as more data is collected (e.g., in malware detection or image classification). Unfortunately, most of the existing multi-class learning methods would assign their test examples to one of the existing classes in the training set. Developing multiclass learning methods capable of detecting new classes remains an open research problem. 1.2.1.5 Model Interpretability Classification methods are often evaluated in terms of their model accuracies. For some applications, model interpretability is another important criteria, where it is insufficient to generate accurate models without providing an explanation of how the inference was derived. This is typically achieved by using sparse learning methods [11, 12] to identify the most relevant features characterizing the different classes. For extreme multi-class problems, many of the classes may not be independent of each other. In this case, another useful model interpretability criteria would be its ability to convey information about the relationships among the large number of classes. How to design an accurate classifier that produces such information is another key challenge that must be addressed. 1.2.2 Classification on Heterogeneous Networks Most existing network classification tasks have focused on developing algorithms for a single network. The drawback of this approach is that information from a single network may not be sufficient to build an accurate classifier especially when the network has noisy attributes 6 and incomplete link structure. This has led to the following question: Can we improve the performance of a classifier by integrating data from other related networks? A previous study on spam detection in social media by Chen et al. [13] has shown that, instead of categorizing the social media postings alone (as spam or non-spam), the detection rate can be improved using information about the users who generated the posts. Another example is in collective classification for Wikipedia editor network [14], where the classification accuracy was found to improve when augmented with Wikipedia article network. Both of these applications are examples of heterogeneous network classification, where the nodes in the network have different types (e.g., humans and Web pages), each with their own set of attributes. The links of the networks also have different types, which makes the problem more challenging since each link type carries varying degree of information about the class similarity. Designing a technique that can handle different types of attributes and links is a challenge that must be addressed in heterogeneous network classification. 1.3 Thesis Contributions This section summarizes the thesis contributions in terms of addressing the challenges due to the extreme multi-class and heterogeneous network learning problems described in the previous section. First, two novel hierarchical learning methods were developed for extreme multi-class classification. Hierarchical learning [15, 16, 17, 18, 19] is an effective strategy to deal with large multi-class problems for several reasons. Not only does it yield a hierarchical structure that captures the relationships among the classes, it also helps to reduce testing time and alleviates the class imbalanced problem by allowing the rare classes to be combined with ex- 7 amples from other classes when building the classification model. Although there are some domains where the subclass-superclass relationships are known a priori, e.g., Wikipedia categorization, the class hierarchy is typically unavailable or incomplete. As a result, there have been growing interests in recent years to develop classification methods that automatically induce a hierarchy from data. For example, Platt et al. [15] employs a directed acyclic graph (dag) to combine multiple binary class classifiers into a multi-class classifier. Some methods such as filter tree [17] imposes an arbitrary tree structure on the classes and then learns a classifier to partition the training examples to follow their appropriate paths until they reach their assigned leaf nodes. Since the tree is arbitrarily constructed, they do not convey meaningful information about the relationship among the classes. Other methods such as top-down decision trees employ a greedy strategy to split the data into smaller subsets [20]. The models produced by such methods are also difficult to interpret because a class may be assigned to multiple leaf nodes in the hierarchy. This thesis presents two novel hierarchical learning methods to address the challenges of extreme multi-class learning problems. The first method, called recursive non-negative matrix factorization (RNMF, in short) was designed to speed up the computation time for model testing. RNMF builds a structure known as a label tree by performing a soft partitioning of the classes in a way that minimizes an information theoretic loss function. A unique aspect of the proposed approach is that RNMF employs multiple one-class SVM models at its leaf nodes. This allows the classifier to detect new classes that are not present in its training data, a unique option that is not available in any existing hierarchical learning methods. By allowing the leaf node to contain more than one SVM models, this also helps to shorten the depth of the tree without significantly degrading its accuracy. The tree also allows the model to be applied in sublinear time. 8 One major limitation of RNMF is that there is no global objective function optimized by its tree-growing procedure. Instead, RNMF employs a greedy, divide-and-conquer approach to partition the classes based on a local objective function when constructing its tree. In addition, it allows a class to be split and assigned to multiple leaf nodes, which limits its interpretability as a class hierarchy. To address this issue, this thesis develops a novel method known as matrix factorization tree (MF-Tree, in short), which organizes the classes into a binary tree by solving a global objective function. A theoretical proof is given to demonstrate the equivalence between the objective function and a recently proposed criterion for taxonomy learning called Hilbert-Schmidt Independence Criterion (HSIC)[21][22][23][24]. This thesis demonstrates how HSIC can be modified to a supervised multi-class learning setting and shows the additive property of the function, which allows us to recursively decompose the optimization problem into multiple subproblems that can be organized into a binary tree structure. Despite these advantages, solving the subproblems at the top levels of the tree can be computationally prohibitive especially when the number of training examples and classes are extremely large. To address the scalability issue, an approximate MF-Tree induction algorithm is proposed, where the determination of class partitions at the higher levels of the tree is determined using a simpler approach instead of solving the more expensive matrix factorization problem. Experimental results suggested that the accuracy of the approximate MF-Tree approach is slightly lower than the original MF-Tree but it is between 20% - 40% faster (depending on how far down the tree the approximation scheme is adopted). Building a hierarchical learning method for network data is challenging because the tree may partition the network into a large number of disjointed components. As the network becomes more disconnected, the link information is no longer useful for node classification. Another contribution of this thesis is to extend the MF-tree approach to handle network 9 data. The proposed framework incorporates both the node attribute similarity and link connectivity into the matrix factorization framework. Although alternative network classification methods such as label propagation is available, these methods are ill-suited for extreme multi-class learning problems since the neighborhood assumption no longer holds for many of the nodes, as shown in Figure 1.1. The proposed MF-tree framework for network data is also trivially parallelizable. A MapReduce implementation of the framework was developed and tested on a large-scale Wikipedia network data that contains more than 600K classes. To the best of my knowledge, this is a data set with the largest number of classes that have ever been reported in the literature. The final contribution of this thesis is in the development of a novel multi-class learning algorithm for heterogeneous networks. The approach jointly classifies a pair of networks with different types of attributes and links but share similar characteristics in terms of their class information. The effectiveness of the proposed co-classification method was demonstrated using Wikipedia article and editor network data. Empirical results showed that the proposed multi-class co-classification method is more effective than learning to classify the multiple classes in each network independently. 1.4 Thesis Outline The remainder of this thesis is organized as follows. Chapter 2 presents the background and related work of this thesis. Chapter 3 introduces the recursive non-negative matrix factorization (RNMF) approach for building a label tree in extreme multi-class learning problems [20]. Chapter 4 presents the matrix factorization tree (MF-Tree) approach and provides a theoretical proof to demonstrate the equivalence between the proposed objective 10 function and the Hilbert-Schmidt Independence Criterion [21][22][23][24]. The MF-Tree approach is extended to network data in Chapter 5. Chapter 6 presents a novel multi-class coclassification framework for heterogeneous network data [14]. Finally, Chapter 7 summarizes the contributions of the thesis and presents the future directions of this research. 11 Chapter 2 Background & Related Work This chapter introduces the research problem to be addressed in this thesis and presents the literature review on node classification, multi-class learning, and taxonomy learning. 2.1 Problem Definition Consider a network, N , which is a graph of interconnected nodes. Each node represents an entity that is characterized by a set of attributes (or features). Each node is also assumed to have a class label. Assuming the labels are known for a subset of nodes in the network, the node classification task is to assign a label to each of the remaining unlabeled nodes in the network. More formally, the problem can be defined as follows: Definition 1 (Node Classification) Let X ∈ ℜd be the set of attribute values and Y = {1, 2, · · · , c} be the set of discrete-valued class labels. Consider a network data N = (V, E, D), where V = {v1 , v2 , ...vn } is the set of nodes, E ⊆ V × V is set of links, and D = {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )} ⊆ (X × Y)n . Let Vl ⊂ V be the set of nodes whose labels are known and Vu = V − Vl be the set of unlabeled nodes. The goal of node classification is to infer a function f : X → Y that accurately assigns each node to its corresponding class label. When the number of classes, c > 2, this is also known as a multi-class network learning problem. We will use the terms multi-class node classification and multi-class network learning interchangeably throughout this thesis. 12 As the number of classes considered in this thesis can be very large, it would be useful to derive a hierarchical structure that captures the relationships among the classes. Definition 2 (Hierarchical Multi-class Network Learning) Let Z = {1, 2, · · · , p} be the set of possible partitions of a class. Given a network data G = (V, E, D), where D ⊆ (X ×Y)n , the goal of hierarchical multi-class network learning is to infer a set of classification functions Φ = {f |f : X → Z}, which are organized in a hierarchical structure induced from the given data set D. The hierarchical structure is often represented in the form of a directed acylic graph or a rooted tree. This thesis focuses on a special type of hierarchical structure known as a label tree [18] in the literature. Definition 3 (Label Tree) A label tree [18] is a hierarchical structure T =< V, E, Φ, Λ >, where V is the set of nodes, E ⊂ V × V is the set of directed links connecting each node to its children, Φ is the set of classification functions, and Λ is the set of class labels. Let v0 ∈ V denote the root node, which has no incoming links, and Vleaf ⊂ V denote the leaf nodes, which have no outgoing links. Furthermore, Vint = V − Vleaf are the internal nodes of the tree. Each internal node is associated with a classification function f ∈ Φ while each leaf node is associated with a class label y ∈ Λ. If (vi , vj ) ∈ E, then vi is a superclass of vj (or equivalently, vj is a subclass of vi ). 2.2 Node Classification Node classification in network data [25, 26, 27, 8] is an active research problem that has received considerable attention from both industry and academia. The problem is also known 13 collective classification in the network mining literature [28]. Previous methods developed for node classification can be divided into 3 categories—attribute-based, link-based, and hybrid methods. Attribute-based methods employ a set of attributes that characterize each node to learn their classification functions. The attributes are either explicit properties of the node themselves or derived from their neighborhood information. One of the earliest examples of using attribute-based methods is in Web page categorization, where the goal was to automatically classify Web pages based on the document content, including their HTML tags [29]. Furnkranz [30] also investigated the use of anchor text along with the words near the anchor text to predict the class label of web pages. Attribute-based methods are also popular for node classification in security related applications. For example, Ma et al. [31] combined lexical and host-level features to build statistical models for classifying malicious URLs whereas Le et al. [32] relied on the lexical features to detect phishing Web sites. One practical advantage of using attribute-based methods is that we can apply existing classification techniques, such as support vector machine (SVM), decision tree classifier and logistic regression, to the node attributes to determine their classes. The drawback of these methods is that the link structure is not explicitly included in the learning scheme. Link-based methods, on the other hand, incorporates the link structure directly into the modeling approach. For example, the HITS algorithm [33] performs web page classification by exploiting the hyperlink topology. PageRank [34] based link analysis methods have also been widely used in web page classification and for malicious node detection [35]. Graphbased semi-supervised methods [36][37][38][35] is another popular link-based method for node classification using partially labeled data to label the remaining nodes. These methods are designed to optimize the following two criteria: 14 • The predicted class of labeled nodes in the network should agree with their true classes. • Two nodes that are connected should share similar labels. Zhu and Ghahramani [37] introduced the label propagation approach to transmit the labels from an initial set of seed nodes to other nodes in the graph. A link-based method has also been proposed for spam detection in [38], where an initial set of seed nodes were labeled as benign (non-malicious) and their labels are propagated to the rest of the graph until they converged. Nodes with low (benign) scores are then predicted as spam. A similar approach was developed in [38] to propagate the “malicious” scores from known malicious nodes to other nodes in the entire graph. Finally, there have been extensive studies using a hybrid method of both attribute- and link-based information to improve node classification. For example, Yang et al. [39] showed that extracting meta data from related web sites is useful for improving the accuracy of hypertext classification. Zhu et al. [40] developed a matrix factorization algorithm that exploits both the content and linkage information to address the Web page classification problem. There have also been quite extensive works on using graph regularization methods to combine link structure with content information. For example, Zhang et al. [25] developed a risk minimization formulation for learning from both text and graph structures. In [41], Lu and Getoor designed a structured logistic regression model to capture both content and link information. [27] introduce a matrix alignment based approach to integrate the network node attributes and links information, which weights the attributes and links according to their influence on the classification decision. Gan and Suel [42] proposed a two-stage approach to improve a Web spam classifier’s performance using the link structure. 15 2.3 Classification on Heterogeneous Networks Although node classification is a well-studied problem, there has been very few research focusing on the classification of nodes in heterogeneous networks. The key assumption in heterogeneous network classification is that there are useful information in other related networks that may aid in the classification of nodes in a given network. For example, consider the problem of classifying Wikipedia articles. Although it is possible to develop attribute-based, link-based, and hybrid methods for classifying the articles based on the the Wikipedia article network alone, there are other related networks, e.g., Wikipedia editor network, that could be augmented to enhance the classification accuracy. Current research on node classification in heterogeneous networks can be classified into two categories. The first category learns a unified classification function for the combined networks. For example, Ji et al. [43] employed a link-based approach to classify nodes in heterogeneous networks using a graph-based algorithm similar to [36]. Using network data from a bibliography network (DBLP), the authors showed that classification accuracy can be significantly if we incorporate data from the entire heterogeneous information network. The second category learns a separate classification function for each underlying network but trains the multiple functions simultaneously. This strategy is equivalent to a multi-task learning approach for solving multiple related prediction problems [44]. For example, Chen et al. [13] presented a co-classification approach to simultaneously classify spammers in a user network and spam in a Web page graph. Their approach was inspired by the fact that both Web spam and spammer detection tasks are intuitively related, as spam content are more likely contributed by spammers than by non-spammers. Nevertheless, the method proposed in [13] is for binary class problems only. One of the contributions of this thesis 16 is to extend the co-classification framework to multi-class network learning problems. The framework proposed in this thesis also allows the incorporation of side information (in the form of a class prior matrix) to bias the learning algorithm. 2.4 Multi-class Learning Classification with large number of classes is an important but challenging problem, and has received increasing attention in recent years [45][46][47][19][48][49]. Existing methods typically employ a collection of binary classifiers, organized either in a flat (non-hierarchical) or a hierarchical (tree) structure. 2.4.1 Non-Hierarchical Methods Traditional non-hierarchical learning methods, including one-versus-all [50], one-versus-one [51], and error-correcting output coding (ECOC) [52], require invocation of all the binary classifiers during test time in order to make their final prediction. Both one-versus-one and one-versus-all methods are inefficient when the number of classes is large due to the linear and quadratic number of binary classifiers that must be invoked during testing. The ECOC method assigns a unique m bit codeword to each class label (where m > log2 c) and trains a binary classifier to predict each bit of the codeword. Given a test instance, we must generate an m-bit vector by applying the m binary classifiers and then compute its distance to the codewords of all c classes, which is an expensive operation, to determine its predicted class. Compressed sensing based ECOC methods [53][54] have been proposed as an alternative. Such methods consider the sparsity of the output label space to ease the burden of largescale multi-label learning. The authors concluded that, by incorporating compressed sensing 17 into ECOC, the number of subproblems to be solved is logarithmic in the number of classes, thereby reducing the number of binary classifiers to be invoked during testing. Nonetheless, these methods are non-hierarchical as they do not generate a concept hierarchy to relate the classes. They also do not consider issues such as new class detection and the class imbalance problem, which are typically encountered in extreme multi-class learning problems. 2.4.2 Hierarchical Methods Hierarchical methods [15, 16, 17, 18, 19] help to reduce the number of classifiers that need to be invoked during testing by organizing the classifiers in a rooted, hierarchical structure. Their runtime complexity for testing depends on the maximum depth of the hierarchy. Existing hierarchical methods can be divided into two categories, depending on how the hierarchical structure is created. In the first category, a hierarchical structure is arbitrarily imposed on the classes. Each intermediate node in the hierarchy contains a binary classifier, which is trained to partition the classes in a way that is consistent with the imposed structure. One example is the filter tree method developed by Beygelzimer et al. [17], which learns the tree structure in a bottom-up fashion. This method was adapted in [55] to build a conditional probability tree for the labels. Another example is the Decision Directed Acyclic Graph (DDAG) [15] approach, which arranges a set of one-versus-one classifiers into a rooted binary DAG in order to reduce the test complexity from O(c(c − 1)/2) to O(c). In all three cases, there is no hierarchical learning involved when the models are trained. In the second category, the hierarchical structure is automatically inferred when building the multi-class models. For example, Bengio et al. [18] proposed a label embedding tree approach that learns c one-vs-all classifiers to obtain an initial confusion matrix, which was 18 subsequently used as the affinity matrix for applying spectral clustering to partition the classes into smaller subgroups. Deng et al. [19] presented an approach that simultaneously learns the class partitioning and the weights of the linear classifiers by optimizing a joint objective function. However, the approach requires solving an integer programming problem, which is NP-hard. Instead, they relaxed the optimization problem into a linear program in order to find a polynomial time solution. Shi et al. [56] also introduced a relaxed hierarchical learning approach for large-scale visual recognition. At each node, the classes are colored as positive or negative by a binary classifier while a subset of the confusing classes are ignored. The coloring of classes and training of binary classifiers were performed simultaneously using a principled max-margin optimization method. 2.5 Taxonomy Learning Taxonomy learning, which aims to encode the conceptual relationship among the classes using an unsupervised learning approach, is another related area of this thesis. A classical method for taxonomy learning is by using agglomerative hierarchical clustering methods such as single-link and complete-link to derive the hierarchical relationships. Another approach is to construct the taxonomy with a divisive (top-down) partitioning approach. For example, Kuang et al. [57] developed a hierarchical clustering method that recursively partitions the data into two subsets using a rank-2 nonnegative matrix factorization approach. Their proposed method also performs outlier detection to avoid generating too many small leaf nodes. Although the method resembles the matrix factorization based methods developed in this thesis, it was unsupervised and there is no global objective function solved by their algorithm. Hilbert-Schmidt Independence Criterion (HSIC) metric is another recently pro- 19 posed criterion for unsupervised taxonomy learning [21][22][23][24]. Inspired by this metric, in this thesis, we demonstrate how HSIC can be applied to a supervised multi-class network learning problem. 2.6 Summary This section presents a formal definition of the multi-class network learning problem and reviews some of the previous works related to this research. Although there has been quite an extensive study on node classification in networks, they were mostly focused on binary classification problems and were applied to a single network. The primary goal of this thesis is to address the existing limitations and expands the use of hierarchical multi-class learning methods to network data. 20 Chapter 3 Recursive NMF for Extreme Multi-Class Learning As noted in Chapter 1, extreme multi-class learning is a challenging problem due to issues such as testing complexity, imbalanced class distributions, presence of new classes in the test set, and interpretability of the models. To overcome these challenges, this chapter describes a novel label tree learning algorithm called recursive non-negative matrix factorization (RNMF) for extreme multi-class learning problems. Key advantages of RNMF include: 1. Its testing complexity is sublinear in the number of classes. 2. It can detect the presence of new classes in the test data. RNMF is a label tree construction algorithm that recursively partitions the training data into smaller subsets based on their classes and attribute values until a stopping criterion is satisfied. At each recursive step, it learns a classifier to predict which partition to assign a given test example. RNMF allows for soft partitioning of the classes by minimizing an information-theoretic loss function, which can be cast into a regularized non-negative matrix factorization problem. The regularization term introduced by our framework is unique in that it minimizes the entropy of the partitions. Unlike other label tree learning methods, each leaf node at the bottom of the tree contains multiple one-class SVM models to distinguish instances from different classes. Experimental results suggest that the induced tree achieves 21 accuracy comparable to more complex tree learning methods but is considerably shorter, thus reducing its overall testing time. 3.1 Label Tree Structure for RNMF Let D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} denote a set of training examples sampled from a distribution P over X × Y, where X = Rd is the d-dimensional feature space and Y = {1, 2, ...c} is the set of classes. The goal of multi-class classification is to learn a function f that minimizes the classification error of any instance drawn from P , i.e., Pr(x,y)∼P [f (x) ̸= y]. A typical approach for multi-class learning is to reduce the problem into multiple binary classification tasks. However, when the number of classes is large, applying all the binary classifiers to determine the predicted class becomes computationally infeasible. RNMF provides a systematic approach to construct the binary classifiers and arrange them in a label tree structure (see Definition 3). An illustration of the label tree structure is shown in Figure 3.1. The label tree obtained from RNMF differs from conventional decision trees in the following ways. First, conventional decision tree classifiers employ a decision stump (i.e., a classifier consists of only one splitting attribute) at each internal node of the tree whereas RNMF considers linear combination of the attributes to partition the input space. The weights of the attribute combination are obtained by solving a regularized non-negative matrix factorization problem. A regularization term is introduced to prevent the tree from distributing examples from one class to more than one children during the tree construction process. Second, each leaf node of conventional decision tree classifiers contains a class label whereas each leaf node of RNMF contains one or more 1-class SVM classifiers [58]. By allowing the leaf nodes to contain multiple classifiers, this helps to reduce the depth of the 22 Figure 3.1 Proposed label tree structure using the RNMF algorithm. tree without significantly degrading its accuracy. In 1-class SVM, the training examples from each class are encapsulated in a hypersphere constructed in an appropriate feature space. Once the enclosing hyperspheres for all the classes at a leaf node have been constructed, the test examples are classified based on their distance to the center of each hypersphere. Furthermore, the 1-class SVM models can be used to detect the presence of new classes. The label tree structure generated by RNMF is parameterized by its branching factor (p) and maximum depth (d). As will be described in the next section, the two parameters can be varied by the user to alter the size of the label tree. For example, by choosing a branching factor larger than 2, this may result in purer splits among the children, thus reducing the maximum depth needed to achieve high accuracy. Decreasing the maximum depth parameter, on the other hand, may result in leaf nodes that contain examples that belong to a larger number of classes. This, in turn, may increase the number of one-class SVM models needed to achieve high accuracy. 23 3.2 RNMF Algorithm for Label Tree Construction Let X be an n × d data matrix containing the feature vector of the training examples and Y be an n × c matrix that represents their corresponding class labels, i.e., Yij = 1 if the ith training example belongs to the j th class, and zero otherwise. Similar to existing decision tree classifiers, RNMF seeks to recursively partition the training examples into purer subsets such that examples that belong to the same class should be assigned to the same partition. Specifically, at the each node v, our goal is to learn a partition matrix L of size n × p, where n is the number of training examples assigned to the node and p is the branching factor, as well as the associated feature weight matrix W, which is a d × p matrix that indicates the importance of each feature in partitioning the training examples. Each element Ljq represents the probability that the j th training example should be assigned to the q th partition at the given node v. To ensure L can be interpreted as probabilities, each column in L must sum up to 1. This can be achieved by adding the constraint LT 1n = 1p into the objective function, where 1p denote a p-dimensional column vector of all 1s. Furthermore, a regularization term is introduced to prevent training examples from the same class being assigned to different partitions. The partitions created at each node v are obtained by minimizing the following objective function: min L≥0,W≥0 s.t. D(X ∥ LWT ) + λH(LT Y) LT 1n = 1p (3.1) where D(X||LWT ) refers to the Kullback-Leilber divergence between the input data matrix X and the product of its latent factors LWT . The regularizer term H(LT Y) corresponds 24 to the entropy of the class distribution of each partition and is needed to avoid splitting the data from the same class to different partitions, i.e., H(LT Y) = − where (LT Y)i+ = ∑ p c ∑ (LT Y)ij (LT Y)i+ ∑ (LT Y)ij log , (LT Y)++ (LT Y)i+ (LT Y)i+ i=1 j=1 ∑ T T T ij (L Y)ij . Note that entropy increases j (L Y)ij and (L Y)++ = when the class distribution within each partition becomes more uniform and decreases when the distribution skews towards a small subset of the classes. Furthermore, since Y1k = 1n , therefore, LT Y1k = LT 1n = 1p (for a feasible solution that satisfies the constraint given in (3.1)). As a result, ∀i : (LT Y)i+ = 1 and (LT Y)++ = p. The entropy calculation can be simplified as follows: H(LT Y) = − 1∑ T (L Y)ij log(LT Y)ij p i,j Note that λ is a user-specified parameter that controls the trade off between optimal partition of X and class purity of each partition1 . The Lagrange formulation of (3.1) is given by: L = ∑ [Xij log i,j − λ ∑ Xij (LW T ) − Xij + (LW T )ij ] ij (LT Y )ij log(LT Y )ij − ∑ s ij ∑ µs ( LTsm − 1) m The objective function is minimized by taking its partial derivative with respect to the 1 λ is set to 1 in our experiments, but it can also be determined via cross-validation. 25 parameters W and L and setting them to zero: ∑ Xip Liq ∑ ∂L ∂Wpq = − ∂L ∂Lpq ∑ ∑ ∑ ∑ Xpj Wjq TY ) − λ + W − λ Y log(L Ypj − µq = − jq pj qj (LW T )pj + (LW T )ip i Liq = 0 (3.2) i j j j j = 0 (3.3) By summing up over the variable p in Equation (3.3), we have µq = − ∑ ∑ Ypj 1 ∑ Xpj Wjq + W − λ log(LT Y )qj − λ jq n n (LW T )pj jp j (3.4) jp Replacing (3.4) back into (3.3), the partial derivative can be rewritten as follow: ∂L ∂Lpq = − ∑ Xpj Wjq j +λ (LW T )pj ∑ Ypj jp n + ∑ 1 ∑ Xpj Wjq − λ Ypj log(LT Y )qj n (LW T )pj jp log(LT Y )qj j (3.5) The objective function can be solved using a gradient descent approach. Following the approach used in [59], the gradient descent formula can be transformed into a multiplicative update formula as follows: (X T L)pq (W LT L)pq α Lpq = Lpq , β Wpq = Wpq 26 (3.6) (3.7) Algorithm 1 Recursive NMF at a node v Input: Matrices Xv , Yv , and MaxIter Output: Matrices Wv and Lv old Initialize: Wold v and Lv to random matrices for j=1 to MaxIter do update Wnew using (3.6) v new update Lv using (3.7) new old new set Wold v ← Wv ; Lv ← Lv ; end for where α = ∑ Xpj Wjq j (LW T )pj +λ ∑ Ypj log(LT Y )qj j ∑ Ypj 1 ∑ Xpj Wjq β = + λ log(LT Y )qj n n (LW T )pj jp jp By initializing W and L to be non-negative matrices, the update formula ensures that the solution will also be non-negative. Algorithm 1 summarizes the key steps of the proposed recursive non-negative matrix factorization (RNMF) algorithm for finding the partitions for any given node in the tree. The recursive partitioning step continues until one of the following stopping criteria is met: (1) stop if all the training examples in a partition belong to the same class, or (2) stop if the node contains less than minleaf training examples, or (3) stop if there is only one class containing more than minclass examples2 . Once a leaf node is encountered, the algorithm will construct a one-class SVM model for each class i by learning a hypersphere to encapsulate all class i training examples that reach the particular leaf node. The hypersphere for class i is constructed by first labeling the 2 We set minleaf = 15 and minclass = 5 for our experiments. 27 instances belonging to class i as +1 and those that belong to other classes as −1. We then construct two concentric hyperspheres such that the inner sphere encloses as many instances from class i as possible. The outer sphere is constructed in such a way that instances that do not belong to class i lie outside of it. The radial distance between the two hyperspheres is called the classifier’s margin, which defines the objective function to be optimized by the 1-class SVM learning algorithm. For example, the hypersphere for predicting class k is obtained by solving the following [60]: min Rk ,ak ,dk ,ξi ,ξl subject to Rk2 − M d2k + C ∑ C ∑ ξi + ξl Nk Nk¯ l:yl ̸=k i:yi =k ∥ xi − ak ∥2 ≤ Rk2 + ξi , ∀i : yi = k ∥ xl − ak ∥2 ≥ Rk2 + d2k − ξl , ξi ≥ 0, ∀i : yi = k ξl ≥ 0, ∀l : yl ̸= k ∀l : yl ̸= k (3.8) where Nk and Nk¯ are the number of instances that belong to the k th class and its complement, respectively. Here, the positive examples are enumerated by indices i and the negative examples by indices l. ξi and ξl are the slack variables while C ≥ 0 controls the penalty for misclassification errors. The variables ak and Rk represent the center and radius of √ th the hypersphere constructed for the k class while (Rk2 + d2k ) defines the radius of the √ outer hypersphere. It can shown that the margin width is given by (Rk2 + d2k ) − Rk . To maximize the margin, we need to maximize d2k and minimize Rk2 simultaneously, and the parameter M ≥ 0 controls the trade-off between those two terms. The Wolfe dual form of 28 the preceding optimization problem is given below: maxα ∑ αi K(xi , xi ) − αl K(xl , xl ) l:yl ̸=k i:yi =k ∑ − ∑ αi αj K(xi , xj ) i,j;yi ,yj =k ∑ − αl αm K(xl , xm ) l,m:yl ,ym ̸=k ∑ +2 αi αl K(xi , xl ) i,l:yi =k,yl ̸=k subject to ∑ αi = 1 + M, 0 ≤ αi ≤ i:yi =k ∑ αl = M, 0 ≤ αl ≤ l:yl ̸=k C Nk C Nk¯ (3.9) where K(x, x′ ) is a kernel function that measures the similarity between the pair of instances, x and x′ . Among the widely used kernel functions include RBF Kernel: Polynomial Kernel: K(x, x′ ) = exp [ ] ∥ x − x′ ∥2 − 2σ 2 K(x, x′ ) = (1 + xT x′ )q The optimization problem given in (3.9) can be solved using quadratic programming. After the label tree is constructed, we can apply the tree to classify unlabeled examples in the following way. Starting from the root node of the tree, we apply the feature weight matrix W to determine which partition the unlabeled example xtest should be sent. Specifically, we compute a partition membership vector π and assign the test example to the partition j 29 that has the largest magnitude, i.e., j = arg max π k , k where π = xTtest Wv (WvT Wv )−1 The procedure is repeated until the test example reaches one of the leaf nodes, where we can apply the 1-class SVM models associated with the leaf node to predict its label. Specifically, if the test example is encapsulated in the hypersphere for class i, then the test example is assigned to the class. However, if the test example is located within more than one hypersphere, we compute the distance between the test example to the hypersphere centers and assign it to the hypersphere with the smallest distance. Finally, if the test example is not encapsulated by any hypersphere, we declare it as belonging to a new class. 3.3 Experimental Evaluation We conducted our experiments on the Wikipedia data dump generated on October 9, 2009. After preprocessing, there were about 3.6 million Wikipedia articles in the data, which were classified into 336K categories. We created two subsets of the data for our experiment, which are denoted as Wiki1 and Wiki2 , respectively. Wiki1 is a data set that contains 24,379 articles from the largest 214 categories whereas Wiki2 contains 65,156 articles chosen from the largest 1,618 categories of the Wikipedia dump. We compared the performance of RNMF against two baseline hierarchical learning algorithms proposed in the literature, namely, DAGSVM [15] and the confusion matrix approach (CM) described in [18]. For fair comparison, we set the branching factor (p) for our algorithm to be 2. The performance of the classifiers were evaluated according to their average 30 F1 scores for all classes. All the results reported in this study were based on 5-fold cross validation. The results for the two data sets are summarized in Tables 3.1 and 3.2. Table 3.1 Comparison between RNMF and baseline algorithms on Wiki1 data set. CM [18] DAGSVM [15] RNMF(p=2) F1 0.4826 ± 0.0017 0.5309 ± 0.002 0.5116 ± 0.0012 Test time (s) 195.5± 9.2 942.9 ± 17.6 132.2 ± 3.2 depth 10 214 8 Table 3.2 Comparison between RNMF and baseline algorithms on Wiki2 data set. CM [18] DAGSVM [15] RNMF(p=2) F1 0.2518 ± 0.0025 0.3012 ± 0.0019 0.2892 ± 0.0036 Test time (s) 336.2± 13.2 1524.6 ± 16.5 289.3 ± 6.1 depth 12 1618 10 The results suggest that RNMF outperforms the CM approach both in terms of its F1 score and test efficiency. When compared against DAGSVM, which requires building a quadratic number of 1-vs-1 classifiers, the F1-score for RNMF is slightly lower but its test time is at least 7 times faster since the depth of RNMF’s label tree is considerably shorter. Note that the performance of RNMF depends on the branching factor (p) and the maximum depth (d) of the tree. We evaluated how the branching factor p and maximum depth of the tree affect the testing time and classification accuracy on the Wiki1 data set. If the tree has only a single node, which is equivalent to a flat (non-hierarchical) model of k one-class SVM models, the average F1 score and test times of RNMF are comparable to DAGSVM (see Figures 3.2 and 3.3). As the depth of the tree increases, both the testing time and F1 score of RNMF would decrease. The decrease in testing time is because the number of 1-class SVM models that need to be invoked to predict the label of a test example at a given leaf node is smaller as the tree grows deeper. On the other hand, its F1-score is also lower for two reasons. First, as the tree is extended, the additional partitioning may divide 31 0.54 p=2 p=5 0.53 0.52 F1 0.51 0.5 0.49 0.48 0.47 0.46 0.45 0 2 4 6 8 d(depth of tree) 10 12 14 Figure 3.2 Average F1 score for RNMF as the branching factor (p) and maximum depth (d) are varied on Wiki1 data set. the training examples from the same class into multiple partitions. Second, there are less training examples available to build the 1-class SVM models at the leaf nodes as the tree grows longer, which makes it more susceptible to model overfitting. Furthermore, if we compare their average F1 scores at the same maximum depth d, the tree with a branching factor equals to 2 is better than the tree with a branching factor equals to 5 (when d > 1). This is because the tree with smaller branching factor contains less number of nodes (since their maximum depths are the same), which means, its training examples will be split into less number of branches. This will result in a better classifier as more training examples become available at each leaf node to train the 1-class SVM models. 3.4 New Class Detection As previously noted, one of the key advantages of RNMF is that it can detect new classes that are not present in the training set. To evaluate the performance of RNMF in terms of 32 1200 p=2 p=5 Testing time in seconds 1000 800 600 400 200 0 0 2 4 6 8 d(depth of tree) 10 12 14 Figure 3.3 Test time for RNMF as the branching factor (p) and maximum depth (d) are varied on Wiki1 data set. its ability for new class detection, we performed the following leave-one-class-out experiment on W iki1 . Specifically, we eliminate one of the 214 classes from the training set and use the data instances belonging to the remaining 213 classes for training purposes. During testing, we apply the induced label tree on a test set that contains all 214 classes and assume the eliminated class to be the new class. We repeat the experiment 214 times, each time leaving out of the classes from the training set. The average F1 score of the new class is given in Table 3.3 (these results were reported after repeating the experiment 3 times). We compared the performance of the algorithm against DAGSVM. DAGSVM [15] constructs a series of binary SVM classifiers and organizes them in a directed acyclic graph. An example of the DAGSVM structure for 5 classes is shown in Figure 3.4. Starting from the root node, DAGSVM eliminates one class at a time until it reaches a leaf node, which contains only a single class. To extend DAGSVM to detect new class, we terminate the dag one level earlier, so that the leaf nodes contain training examples from two classes. A binary SVM classifier is then 33 Figure 3.4 Example of DAGSVM Structure for 5 classes trained to discriminate the classes. When a test example reaches the leaf node, we apply the binary classifier to predict its class. If the confidence of the SVM prediction is high, we assign the test example to its predicted class. However, if the confidence of the SVM prediction is low (i.e., when the test example is located within the geometric margin of the classifier), we predict it to be a new class. Table 3.3 Average F1-score of the new class when applied to the Wiki1 data set. RNMF DAGSVM # classes (Training) 213 213 # classes (Testing) 214 214 F1 0.3192± 0.016 0.2833 ± 0.012 Based on the results in Table 3.3, RNMF clearly outperforms DAGSVM in terms of its ability to detect new classes. We have also compared the F1-score of each class that was omitted from the training set in our new class detection experiment. The results are shown in Figure 3.5, where the horizontal axis corresponds to the F1 score of the DAGSVM algorithm whereas the vertical axis corresponds to the F1 score of RNMF. There are 214 data points in the plot, one for each class. If the point is located above the diagonal, this means the 34 F1-score for RNMF is better than DAGSVM for the particular omitted class. Since 82% of the points lie above the diagonal, this suggests that RNMF outperforms DAGSVM in terms of detecting the new class for the majority of the classes present in the Wiki1 data. 0.7 y=x 0.6 RNMF (F1 Score) 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 DDAG (F1 Score) Figure 3.5 New Class Detection Performance for individual Class on W iki1 3.5 Conclusion This chapter presents a novel label tree learning algorithm for extreme multi-class problem called recursive non-negative matrix factorization or RNMF. The algorithm creates soft partitions of the classes by solving a regularized non-negative matrix factorization problem and uses a set of 1-class SVM models to predict the classes at the leaf nodes of the tree. Experimental results on the Wikipedia data suggest that RNMF can achieve comparable accuracy as some of the state-of-the-art label tree algorithms but is considerably faster (since it produces shorter trees). In addition, it can detect new classes that are not present in the training data. However, RNMF has two major limitations. First, there is no global objective function optimized by the algorithm. Instead, it greedily partitions the training 35 data to minimize a local objective function. Second, the classes can be fragmented to multiple leaf nodes in the tree. The next section presents an alternative method known as MF-tree that overcomes these limitations. 36 Chapter 4 Matrix Factorization Tree for Extreme Multi-Class Learning Hierarchical learning methods [15, 16, 17, 18, 19] are designed to overcome the challenges of extreme multi-class classification by employing a set of binary classifiers and arranging them into a directed acyclic graph or rooted tree structure. These methods have several advantages. First, they can lead to sublinear test complexity since only the binary classifiers located along the path from the root node to leaf nodes are invoked when classifying a test example. Second, one could alleviate the class imbalanced problem particularly at the top levels of the hierarchy by grouping together the related classes1 . Finally, the resulting concept hierarchy may offer interesting insights into the conceptual relationships among the large number of categories. Over the years, there have been numerous hierarchical classification methods developed, from greedy top-down decision tree classifiers, which use simple attribute tests as internal nodes of the tree, to more advanced methods using binary classifiers as their internal nodes to distinguish the different classes. These methods have different pros and cons. While decision tree classifiers are easy to construct, they are susceptible to the class imbalanced and data fragmentation problems. In addition, each class can be distributed to multiple branches of 1 Instead of learning to discriminate one class from another, some hierarchical methods are designed to distinguish one group of classes from another group at the top levels of the tree. 37 the tree, resulting in an unnecessarily large tree. This affects both the test efficiency and the ability of the classifier to generate a meaningful taxonomy about the conceptual relationships among the classes. Alternative methods, such as filter trees [17] and its variants, impose a random tree structure and learn an appropriate binary classifier at each internal node of the tree. Because of the arbitrariness in which the classes are assigned to the nodes, the resulting tree might be suboptimal and loses descriptive information about the class relationships. More importantly, both classes of methods were not designed to optimize a global objective function. To address these problems, we propose a novel method known as matrix factorization tree (MF-Tree, in short), which organizes the classes into a binary tree by solving a global objective function based on a squared loss error function. We first provide a theoretical proof to demonstrate the equivalence between the proposed squared error loss function and the Hilbert-Schmidt Independence Criterion (HSIC) [21][22][23][24]. HSIC is a recently proposed criterion for taxonomy learning in an unsupervised learning setting. In this chapter, we demonstrate how HSIC can be applied to a supervised multi-class learning problem. We also showed the additive property of the objective function, thus allowing us to decompose the problem into a hierarchical binary classification task that can be organized in a binary tree structure. Nevertheless, for massive classification problems involving hundreds of thousands of classes or more, factorizing the matrices is computationally expensive especially at the top levels of the tree. To address this problem, we present an approximate algorithm for inducing MF-Tree, where the initial assignment of classes to the child nodes at the top level of the hierarchy is determined from the class sizes and distances between centroids of classes instead of solving the more expensive matrix factorization problem. The factorization can then be applied at lower levels of the tree when there are fewer number of classes and training examples to deal with. We showed 38 that the performance of the approximate MF-Tree is comparable to that of MF-Tree, with a significant improvement in its training time. In short, the main contributions of this chapter are as follows: 1. We introduce a novel hierarchical classification method for large multi-class learning problems known as MF-Tree. The tree is designed to optimize a global loss function. 2. We provide a theoretical proof showing the equivalence between optimizing the squared error loss function of a matrix factorization problem and the Hilbert-Schmidt Independence Criterion (HSIC) by setting appropriate constraints on the latent factors of the factorization. 3. We demonstrate the additive property of HSIC that enables the design of a hierarchical tree learning framework. 4. We present an approximate algorithm for learning MF-Tree, to speed up the tree induction process. 5. We provide extensive experiments to evaluate and compare the performance of the proposed method against several state-of-the-art baselines. 4.1 Preliminaries This section presents a formal definition of the multi-class learning problem. We also introduce the concepts of Hilbert-Schmidt Independence Criterion and tree-structured covariance matrix. The connections between these concepts and the proposed MF-tree method will be shown in Section 4.2. 39 4.1.1 Problem Definition Let D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} denote a set of training instances sampled from an unknown distribution P over X × Y, where X ⊆ Rd is the d-dimensional feature space and Y = {1, 2, · · · , c} is the set of permissible categories. For large multi-class learning problems, we assume c >> 2. The goal of multi-class learning is to infer a target function f that minimizes the expected risk of misclassifying any instance drawn from P , i.e., minf E(x,y)∈P [L(f (x), y)], where L is known as the loss function. A standard approach for multi-class learning is to reduce the problem into multiple binary classification tasks. However, when the number of classes c is too large, applying all the binary classifiers to predict the class of a test instance is computationally infeasible. Towards this end, there are various hierarchical labeled tree algorithms have been developed for large multi-class learning problems. A hierarchical labeled tree is a 4-tuple T =< V, E, Φ, Λ >, where V is the set of nodes, E is the set of edges connecting parent nodes to their corresponding child nodes, Φ = {fv (x)|v ∈ V } is the set of classifiers associated with the internal nodes of T , and Λ = {yvˆ|yvˆ ∈ Y, vˆ ∈ V } is the set of labels associated with the leaf nodes. In this chapter, we restrict our framework to binary rooted trees only, even though, in principle, it is applicable to rooted trees with a branching factor greater than 2. Our objective in this work is to learn an optimal tree T for large, multi-class learning problems. 4.1.2 Hilbert-Schmidt Independence Criterion (HSIC) This section introduces the Hilbert-Schmidt Independence Criterion. Our presentation follows closely the definitions given in [21][23][22][24]. Let (X, Y ) be a pair of random variables drawn from a joint distribution P rX,Y . The Hilbert-Schmidt Independence Criterion 40 (HSIC)[21] measures the dependence between X and Y by calculating the norm of its crosscovariance operator Cxy in Hilbert space. It has been shown that this norm vanishes if and only if X and Y are independent variables. A larger value of HSIC also indicates a stronger dependence between the two variables, with respect to the choice of kernels [23]. Let F be a Reproducing Kernel Hilbert Space (RKHS) of functions from X to R with ′ ′ a feature map ϕ, such that ⟨ϕ(x), ϕ(x )⟩F = kX (x, x ), where kX : X × X → R is a positive definite kernel. Likewise, let G be the RKHS on Y with a feature map ψ, such that ′ ′ ⟨ψ(y), ψ(y )⟩G = kY (y, y ), where kY : Y × Y → R is the corresponding kernel for the metric space Y. The covariance operator Cxy is defined as Cxy = Ex,y [(ϕ(x) − µx ) ⊗ (ψ(y) − µy )]2HS where µx = E[ϕ(x)], ψy = E[ψ(y)] and ⊗ denote a tensor product. The Hilbert-Schmidt Independence Criterion is defined as the squared Hilbert-Schmidt norm of the cross-covariance operator Cxy , which can be written as follows: HSICX,Y = ∥Cxy ∥2HS ′ ′ ′ [kX (x, x )kY (y, y )] x,x ,y,y ′ ′ +E ′ [kX (x, x )]E ′ [kY (y, y )] x,x y,y = E ′ ′ ′ −2Ex,y [E ′ [kX (x, x )]E ′ [kY (y, y )]] x y (4.1) Let {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )} be the set of training instances drawn from P rXY . HSIC 41 can be empirically estimated from the training data as follows [21, 22]: [ ] 1 HSIC = tr Hn KHn L (n − 1)2 (4.2) where tr[·] denote the trace of a matrix, Hn = I − n1 1n 1Tn is a centering matrix, I is the identity matrix, 1n is a column vector of all ones, K is an n × n kernel matrix for x, and L is a c × c kernel matrix for the class labels y. The matrix product Hn KHn corresponds to ˆ denote the centered kernel matrix, then the computation of a centered kernel matrix. If K ˆ HSIC simplifies to (n − 1)−2 tr[KL]. Note that the time complexity for learning HSIC from the data is O(c3 + nc2 + n2 c), which is expensive when number of classes c is large. 4.1.3 Tree-Structured Covariance Matrix Given a rooted binary tree T on the labeled set Y, we can represent it as a tree-structured covariance matrix [22] as follows. Definition Definition 4 (Tree-Structured Covariance Matrix) A matrix C ∈ Rc×c is a covariance matrix associated with the label tree structure T, where Cij is a measure of similarity between the i-th and j-th classes computed based on the path length from the root node to the closest common ancestor between the leaf nodes for i and j in T. The larger the value of Cij , the closer is the relationship between the two classes. Figure 4.1 shows an example of a labeled tree structure for 4 classes, c1 , c2 , c3 , and c4 . Let a, b, c, d, e, f denote the edge weights between the nodes. Based on the structure, the tree-structured covariance matrix is a 4 × 4 matrix shown on the right hand side of the diagram. For example, C1,2 = a since the closest common ancestor between the classes has a path length equals to a from the root node. Similarly, C3,3 = b+e, which is the path length 42 Figure 4.1 Example Tree Structure with 8 classes from the root to the leaf node for c3 . Note that the edge weight a, b, c, d, e, f determines the similarity between the different categories. For brevity, we consider the edge weights to be 1 in the remainder of this chapter. 4.2 MF-Tree: Proposed Method This section presents an overview of the proposed method. We first introduce the global objective function to be optimized by MF-Tree. We then show its connection to the HSIC measure described in the previous section. We also illustrate the additive property of HSIC, which allows us to develop a hierarchical tree-based induction algorithm. 4.2.1 Global Objective Function of MF-Tree Let X denote an n × d centered design matrix, whose rows correspond to the data instances ˜ can always be and columns correspond to the features. An uncentered design matrix X ˜ where Hn is the centering matrix defined centered by applying the following operation, Hn X, in Section 4.1.2. 43 MF-Tree is designed to minimize the following regularized squared error loss function. [ J ] = ∥X − ZWT ∥2F + λtr WAWT [ ] = tr XXT − XWZT − ZWT XT + ZWT WZT ] [ T +λ tr WAW , (4.3) where Z is an n × c matrix, which represents the class membership of each data instance, W is a d×c matrix, which represents the feature vector of each class, A is a symmetric, positivedefinite, c × c class covariance matrix, and λ is the regularization parameter. ∥ · ∥F denote the Frobenius norm of a matrix and the superscript T denote the transpose operator. The intuition behind the objective function given in Equation (4.3) is to factorize the centered design matrix X into a product of latent factors Z and W. 4.2.2 Relationship between Global Objective Function and HSIC To demonstrate the relationship between the objective function and HSIC, let us take the partial derivative of J given in Equation (4.3) with respect to W and set it to zero: ∂J = −2XT Z + 2WZT Z + 2λWA = 0 ∂W [ ]−1 T T W = X Z Z Z + λA 44 Replacing this back into Equation (4.3) yields the following: J { [ ]−1 T T T = tr XX − 2XX Z Z Z + λA ZT [ + Z ZT Z + λA ]−1 [ ]−1 ZT XXT Z ZT Z + λA } ZT { [ ]−1 [ ]−1 } T T T T + λ tr X Z Z Z + λA A Z Z + λA Z X { [ ]−1 T T T = tr XX − 2XX Z Z L + λA ZT [ + XT Z ZT Z + λA ]−1 [ ][ ZT Z + λA ZT Z + λA ]−1 } ZT X { [ ]−1 } T T T = tr XX − XX Z Z Z + λA ZT Thus, finding a solution for Z that minimizes J is equivalent to finding one that maximizes the following objective function: { [ ]−1 } T T max tr XX Z Z Z + λA ZT Z ˆ = XXT be the centered kernel matrix associated with the training instances X. Let K During training, we enforce the following constraint: [ ]−1 Ztrain ZTtrain Ztrain + λA ZTtrain = YCYT , (4.4) where Y is an n × c matrix of class labels and C is the tree-structured covariance matrix. ]−1 [ T . This can be achieved by setting Ztrain = Y and C = Ztrain Ztrain + λA Theorem 1 If A be a symmetric, positive-definite matrix, and λ > 0, then ZTtrain Ztrain +λA is also positive-definite. 45 ] [ Proof 1 Given any real vector v, vT ZTtrain Ztrain + λA v = vT ZTtrain Ztrain v + λvT Av = ∥Ztrain v∥2 + λvT Av > 0 since A is a positive definite matrix. Since ZTtrain Ztrain + λA is positive-definite, it is invertible. The equality constraint in Equation (4.4) ensures that an appropriate λ and A can be chosen to generate a viable tree, in which the entries of the matrix C must have a specific structure. Furthermore, let L = YCYT denote the kernel matrix associated with the class labels. The objective function can now be simplified to: [ ] ˆ max tr KL s.t. L = YCYT , C (4.5) which is equivalent to the optimization of HSIC. 4.2.3 Additive Property of HSIC In the previous subsection, we have shown that the global objective function for MF-Tree is equivalent to HSIC by setting appropriate constraints on A and λ. This enables us to define a kernel matrix for the class labels as L = YCYT , where C is the tree-structured covariance matrix. In this section, we demonstrate the additive property of HSIC, which enables us to decompose the optimization problem into a hierarchical binary classification task. First, we show that the tree-structured covariance matrix can be written as a sum of block diagonal matrices. We illustrate this with a simple example. Consider the following covariance matrix for a labeled tree that contains 8 leaf nodes (assuming the edge weights 46 are equal to 1):  3   2    1    1 C8 =   0    0    0   0  2 1 1 0 0 0 0   3 1 1 0 0 0 0    1 3 2 0 0 0 0    1 2 3 0 0 0 0   0 0 0 3 2 1 1    0 0 0 2 3 1 1    0 0 0 1 1 3 2   0 0 0 1 1 2 3 (4.6) We can rewrite the matrix as follows: C8 = I2 ⊗ 14 1T4 + I4 ⊗ 12 1T2 + I8 ⊗ 1 where ⊗ denote a Kronecker product and  1   1    1    1 T I2 ⊗ 14 14 =   0    0    0   0  1 1 1 0 0 0 0   1 1 1 0 0 0 0    1 1 1 0 0 0 0    1 1 1 0 0 0 0   0 0 0 1 1 1 1    0 0 0 1 1 1 1    0 0 0 1 1 1 1   0 0 0 1 1 1 1 47 (4.7)  1   1    0    0 T I4 ⊗ 12 12 =   0    0    0   0  1   0    0    0 I8 ⊗ 1 =   0    0    0   0  1 0 0 0 0 0 0   1 0 0 0 0 0 0    0 1 1 0 0 0 0    0 1 1 0 0 0 0   0 0 0 1 1 0 0    0 0 0 1 1 0 0    0 0 0 0 0 1 1   0 0 0 0 0 1 1  0 0 0 0 0 0 0   1 0 0 0 0 0 0    0 1 0 0 0 0 0    0 0 1 0 0 0 0   0 0 0 1 0 0 0    0 0 0 0 1 0 0    0 0 0 0 0 1 0   0 0 0 0 0 0 1 The first term on the right hand side of Equation (4.7) simply partitions the first 4 rows and columns of the matrix into 1 cluster and the last 4 rows and columns into a second cluster. The second term on the right hand side partitions the first cluster into 2 smaller subclusters, while the third term puts each subcluster as a cluster itself (this is at the leaf 48 node of the tree). So, we can simply write C2m = B2 + B4 + · · · + B2m , (4.8) where B2j = I2j ⊗ 12m−j 1Tm−j , 120 = 1, and m is the depth of the binary tree. We can 2 further simplify the decomposition to the following: C8 = B2 + B4 + B8    B8,1   [ ]   B4,1 04×4  02×2 = B2,1 +  +  02×2 04×4 B4,2   02×2   02×2 02×2 02×2    B8,2 02×2 02×2     02×2 B8,3 02×2    02×2 02×2 B8,4 where each diagonal block has the following form   T 0  11 B2i ,j =   T 0 11 (4.9) and the dimension for each of the submatrices of 0s and 11T s in B2i ,j is 2m−i × 2m−i . Furthermore, given a block matrix Bk , we have T YBk YT = Y1 Bk,1 Y1T + Y2 Bk,2 Y2T + · · · + Yk/2 Bk,k/2 Yk/2 where [ ] Y = Y1 · · · Yj · · · Yk/2 49 (4.10)   0 ··· 0 Bk,1   . ..  0 . Bk,2 . .    .. .. .. ..  . . . .  Bk =   .. ... ...  . Bk,j    .. .. .. .. . . .  .   0 ··· ··· ···  ··· 0 .. . .. . .. . .. . ... .. . .. .. . . · · · Bk,k/2                   (4.11)  T  Y1       ···        YT =  YT   j       ···      T Yk/2 (4.12) As each Bi,j corresponds to a block-diagonal matrix given by Equation (4.9), we can write log2 k 2i−1 log2 k HSIC = YCYT = ∑ YB2i YT = i=1 ∑ ∑ Yj B2i ,j YjT i=1 j=1 This decomposition enables us to solve the optimization problem with a hierarchical tree-like decomposition. The outer sum corresponds to decomposition at each depth of the tree, while the inner sum corresponds to expanding the internal nodes associated with the particular depth. Instead of learning C directly by maximizing the trace of HSIC (see Equation (4.5)), 50 we can learn the set of B2i ,j iteratively, each of which is associated with an internal node in the tree structure. 4.2.4 MF-Tree construction To construct the tree, we need to solve the objective function given in (4.5). Based on the discussion in the previous subsection, the objective function can be simplified as follows: max {B} ∑ [ ] T tr Kj,j Yj Bi,j Yj (4.13) i,j This allows us to solve the optimization problem one node at a time. Let Dj = YjT Kj,j Yj , which is simply a c × c similarity matrix of the classes, computed based on their attributes X and the known ground truth labels associated with the block Bi,j . At each node of the tree construction process, we need to find a matrix Bi,j that maximizes its alignment with Dj , i.e.: [ ] max tr Dj Bi,j Bi,j For notational convenience, we drop the subscripts i and j in the remainder of the discussion. Note that B is simply a co-association matrix, whose element    1, if classes i and j are in the same partition; [B]i,j =   0, otherwise. 51 To create binary partitions, let G be a c × 2 matrix2 , where    1, if class i belongs to partition j; [G]i,j =   0, otherwise. It is easy to show that B = GGT . Thus, [ ] [ ] [ ] T T tr DB = tr DGG = tr G DG . This objective function is equivalent to finding a clustering of the classes in a way that maximizes the within-cluster similarity. 4.2.5 Optimization Note that the following optimization problem at each node [ ] T max tr G DG s.t. GT 12 = 1n ˆ G is not feasible unless we enforce a non-negative constraint G ≽ 0 [61]. Furthermore, if we relax the condition that elements of G must be either 0 or 1 and replace it with a constraint that G must be an orthogonal matrix [ ] T max tr G DG s.t. GT G = I2 G 2 Here, we assume the number of classes associated with the given node is c. As the tree grows deeper, the number of classes also decreases, which in turn, reduces the number of rows in G. 52 we can solve the problem easily by finding the first two eigenvectors corresponding to the largest eigenvalues of D [62, 63, 64]. Let GΛGT be the eigendecomposition of matrix D. By definition, D = YT KY. Following the eigendecomposition of D, we have YT KY = GΛGT Since GT G = I: GT YT KYG = Λ Let P = YG, which is an n × 2 matrix that determines the assignment of each data instance to the corresponding child nodes. Thus, PT KP = Λ ˆ be an n-dimensional column vector that contains similarity During the testing phase, let k between the test instance to all the training instances. We may compute the 2-dimensional ˆ ∈ R2×1 as follows: partition vector p ˆ = PT k ˆ ˆ = GT YT k p ˆ , we assign the test instance to the partition with the larger value, i.e., After computing p ˆ i. arg maxi (PT k) 53 Algorithm 2 MF-Tree Construction Input: Xtrain ∈ Rn×d , Ytrain ∈ Rn×k Output: Tree Structure, T 1. root = TreeGrowth(Xtrain , Ytrain ) 2. Insert root into T 3. return T function TreeGrowth(X,Y ) 1. v = createNode() 2. v.index = getIndex(X) 3. if number of classes in Y < 2 then 4. v.class = unique(Y ) 5. else 6. (X (l) , Y (l) , X (r) , Y (r) , v.G) = Partition(X,Y ) 7. v.lef t = TreeGrowth(X (l) , Y (l) ) 8. v.right = TreeGrowth(X (r) , Y (r) ) 9. end if 10. return v function Partition(X,Y ) 1. Compute the centered kernel K from X. 2. Compute D = Y T KY 3. Compute the first two eigenvectors g (1) , g (2) of D 4. Let G = [g (1) , g (2) ] 5. (X (l) , Y (l) ) = {(xi , yi ) yi = k, arg maxj Gkj = 1} 6. (X (r) , Y (r) ) = {(xi , yi ) yi = k, arg maxj Gkj = 2} 7. return (X (l) , Y (l) , X (r) , Y (r) , G) 4.2.6 MF-Tree Induction Algorithm The pseudocode for constructing MF-Tree is summarized in Algorithm 2. Our algorithm recursively partitions each node by calling the TreeGrowth function. Each invocation of the function returns a root node for a subtree that partitions the classes and their associated training instances into 2 groups. Each node is assumed to have a complex structure, where v.index contains the indices of the training instances assigned to the node, v.class represents the class label if v is a leaf node, v.G is the discriminant function if v is an internal node, whereas v.lef t and v.right are its left and right child. 54 The Partition function decides how to partition each class and their training examples into different branches of the tree. The partitioning is done based on the magnitude of matrix G as described in the previous section. If a class k is assigned to the left child of a node, all the training instances that belong to class k will also be propagated to the left child. The algorithm continues to expand a node until all the training instances associated with the node belong to the same class. To determine its time complexity, note that at each node v, we need to compute the kernel matrix K, an operation that requires O(n2 d) time. Computing D from the kernel matrix K and class label Y takes O(n2 c + c2 n) time while finding its first two eigenvectors requires, in the worst-case, O(c3 ) time. Thus, the computational cost at each node is O(n2 c + c2 n + c3 ). The cost reduces significantly as the tree grows deeper since the size of matrices K and D decreases substantially. Once the tree structure is induced from the training set, we can predict the label for any test instance using the method presented in Algorithm 3. Algorithm 3 Testing with MF-Tree Input: Xtest , Xtrain , Ytrain , T Output:Ytest 1. for each x ∈ Xtest do 2. v = T.root 3. while v.class = ∅ do 4. kˆ = Similarity(x, Xtrain , v.index) 5. Yˆ = Subset(Ytrain , v.index) ˆ = v.G 6. G ˆ T Yˆ T kˆ 7. pˆ = G 8. v = getChild(ˆ p, v) 9. end while 9. Insert y = v.class into Ytest 10. end for 11. return Ytest 55 4.3 Approximate MF-Tree As shown in the previous section, the time complexity for constructing MF-Tree can be quite expensive especially at the top levels of the hierarchy. This is due to the large size of matrices that must be constructed and factorized according to their largest two eigenvalues. However, as we traverse down the tree, the number of classes and training instances to be dealt with decreases considerably, which in turn, helps to speed up the computations. Thus, the bottleneck of our computation lies in the first few iterations of the MF-Tree algorithm. In this section, we present an approach to speed up the tree construction process by learning an approximation for the G matrix during the first few iterations. Once the matrices become small enough, we can proceed with applying the original TreeGrowth function shown in Algorithm 2. Our proposed method attempts to balance the tradeoff between accuracy and efficiency of the labeled tree construction process. While efficiency is a concern at the top levels of the tree, accuracy is an issue at lower levels of the tree. Even if the class partitioning at the top levels is suboptimal, one might still be able to recover good partitions at subsequent iterations when refining the tree. Based on this rationale, we develop the following two-step approach to improve training efficiency. First, we apply a simple method to calculate G at the top levels of the tree. When the depth of the tree exceeds some threshold τ , we revert back to the original TreeGrowth function. We termed this approach as Approximate MF-Tree. Consider the root node of the labeled tree. Let n be the number of training instances and c be the number of classes. We first sort the classes in decreasing order of their class size. We then assign the largest class (c1 ) to the left node of the root and the second largest class 56 Algorithm 4 Approximate MF-Tree Input: Xtrain ∈ Rn×d , Ytrain ∈ Rn×k , τ Output: Tree structure T 1. Set depth, d = 0 2. root = TreeGrowthApprox(Xtrain , Ytrain , d, τ ) 3. Insert root into T 4. return T function TreeGrowthApprox(X,Y ,d,τ ) 1. v = createNode() 2. v.index = getIndex(X) 3. if d < τ and number of classes in Y < 2 4. (X (l) , Y (l) , X (r) , Y (r) )= Partition2(X,Y ) 5. v.lef t = TreeGrowthApprox(X (l) , Y (l) , d + 1) 6. v.right = TreeGrowthApprox(X (r) , Y (r) , d + 1) 7. else 8. v = TreeGrowth(X, Y ) 9. end if 10. return v function Partition2(X,Y ) 1. classes = Sort(Y ) 2. (X (l) , Y (l) ) = {(xi , yi ) yi = c1 (largest class) } 3. (X (r) , Y (r) ) = {(xi , yi ) yi = c2 (second largest class) } 4. meanvectors = Mean(X, Y ) 5. for each remaining class c ∈ classes do 6. m1 = distance(meanvectors, c, c1 ) 7. m2 = distance(meanvectors, c, c2 ) 8. if m1 ≤ m2 then 9. (X (l) , Y (l) ) = (X (l) , Y (l) ) ∪ {(xi , yi ) yi = c } 10. else 11. (X (r) , Y (r) ) = (X (r) , Y (r) ) ∪ {(xi , yi ) yi = c } 12. end if 13. end for 14. return (X (l) , Y (l) , X (r) , Y (r) ) 57 (c2 ) to the right child. The sorting operations requires only O(c log c) computations. For the remaining classes, we compute their corresponding mean vectors, an operation that requires O(nd) time. We then assign each class to the left or right child based on their distance to the mean vectors of the first two largest classes. For example, if the mean vector for c3 is closer to the mean vector for c1 than c2 , we assign c3 to the left child of the root node. Otherwise, we assign it to the right child. This process is repeated until depth τ . This reduces the time complexity of each node from O(c3 ) to O(c log c + nd). As will be shown in our experimental results section, substantial improvement in training time can be achieved by setting τ to be 3 or 4 without losing significant accuracy. 4.4 Experimental Evaluation In this section, we compared the performance of our proposed MF-Tree algorithm against several state-of-the-art baseline algorithms, including DAGSVM [15], Discriminative Relaxed Hierarchy Learning (DRHL) [56], Confusion matrix based label embedding tree learning approach (CMTL) [18] and Recursive Non-negative Matrix Factorization (RNMF) [20]. In addition to these methods, we have also implemented a recursive version of the HSIC clustering by dependence maximization (RDM) algorithm [23]. The latter would partition the data instances into p groups at each depth of the tree until a stopping criterion is met (i.e., when there are no other classes that have more than 5 instances at the leaf node). We construct a binary classifier at each internal node of the tree as well as 1-class SVM classifiers at the leaf nodes to assign the class labels. 58 4.4.1 Experimental Setup To illustrate the effectiveness of the proposed method, our experiments were performed on two real-world data sets: (1) Caltech-256, 3 , which is a standard multi-class classification data set with 256 categories containing 30,607 images and at least 80 images for each class, and (2) Wiki, a collection of Wikipedia article dump from October 9, 2009. We generated four subsets with varying number of classes for our experiments. The data sets are denoted as W iki1 , W iki2 , Caltech1 and Caltech2 . Caltech1 and Caltech2 are subsets of image data from the Caltech-256 collection. Caltech1 is generated using the same approach as described in [56], in which we randomly sample 80 images from each class to create the data set. We use a repeated holdout method to create our training and test sets. Specifically, half of the sampled data were reserved for training while the remaining half for testing. We repeat the sampling process five times to create five different versions of the Caltech1 data set. We then applied all the algorithms on the five data sets and compute their average F1-score. The F1-score is computed based on the harmonic mean of the precision (positive predictive value) and recall (sensitivity) for each class. Results are reported based on the average F1-score for all the classes over five versions of the data sets. In addition, we have generated a larger subset called Caltech2 by randomly choosing 40 images from each class for training and all the remaining images in each class for testing. This process is again repeated five times to generate five versions of Caltech2 . For both Caltech1 and Caltech2 , we have extracted SIFT (scale-invariant feature transform) features to represent each image. W iki1 is a data set generated with the same setting as that described in [20]. Here, we choose the largest 214 categories of the Wikipedia articles. The resulting data set contains 3 http://authors.library.caltech.edu/7694/ 59 Table 4.1 Summary of data used for our experiments Data set Caltech1 Caltech2 W iki1 W iki2 # Classes 256 256 214 1618 # data instances 20480 30607 24378 65156 24,378 articles. In addition, we also created a larger sample data set from the Wikipedia dump, which covers the largest 1618 categories with 65,156 articles. Similar to Caltech1 , half of the sampled data were used for training and the remaining half for testing. The sampling process is also repeated five times to generate different versions of the data sets. A summary of the data sets used for our experiments is given in table 4.1. Note that all the experiments were conducted on a single PC with Intel Core i7 CPU 2.6GHz and 8 GB memory, running Windows 7 operating system. 4.4.2 Experimental Results This section presents summary results of our experiments. As previously noted, we use F1score to measure the accuracy of the different approaches. In addition, we have compared the testing time for each method along with the size of the induced tree. 4.4.2.1 Overall Results Comparison We first compare the performance of our MF-Tree algorithm against all the baseline methods. The results for data sets Caltech1 , Caltech2 , W iki1 , and W iki2 are summarized in Table 4.2. The experimental results suggest the proposed MF-Tree algorithm consistently outperforms other state-of-the-art baselines in terms of their F1-score and testing efficiency on all 4 data sets evaluated in this study. 60 The size of the hierarchy generated by MF-Tree is comparable to some of the best algorithms. Even though RNMF generates a shorter than than MF-Tree, its test time is higher because the leaf nodes of RNMF may contain multiple classifiers that must be invoked in order to predict the final class. In contrast, the leaf nodes of MF-Tree has a single class label, which allows us to predict the class efficiently. Among all the competing methods, DAGSVM creates the largest tree. This is because it eliminates one class at a time at each level of the hierarchy. So the depth of the tree is equal to the number of classes. In the meantime, DRHL and RDM also create larger trees because they both allow a single class to reside in more than one leaf nodes. This affects the test time of the algorithms. 4.4.2.2 Effect of Parameter Tuning Unlike MF-Tree, which is parameter-free, the performance of several baseline algorithms depends on the values of their parameters. For example, the branching factor p in RNMF method determines the structure and depth of the tree. If p = 1, this produces a decision stump consisting of c 1-class SVM models. This is equivalent to the one-versus-all approach. If p = 2, then it will produce a binary tree. The choice of p also affects the depth of the tree. As the depth increases, the test efficiency reduces significantly at the expense of decreasing F1 values. Similarly, tree construction parameters are also needed in other baseline methods. To provide a fair comparison, we vary the parameters for each method based on the suggestion of their original papers. Specifically, for RNMF [20], we set p ∈ {2, 3, 4, 5}), for RDM , we set p = 2, 3, 4, 5 and for DRHL[56], we set ρ ∈ {0.5, 0.6, 0.7, 0.8}). We plotted the results in Figures 4.4, 4.5, 4.2 and 4.3. In each plot, the x-axis corresponds to the test time while the y-axis corresponds to the F1-score. Ideally, we seek for a classifier with lowest test time and highest F1-score. As can be seen from these plots, MF-Tree is 61 Table 4.2 Experimental Results Caltech1 DAGSV M RN M F DRHL CM T L RDM M F − T ree Caltech2 DAGSV M RN M F DRHL CM T L RDM M F − T ree W iki1 DAGSV M RN M F DRHL CM T L RDM M F − T ree W iki2 DAGSV M RN M F DRHL CM T L RDM M F − T ree F1 0.369 ± 0.0019 0.360 ± 0.0022 0.375 ± 0.003 0.348 ± 0.006 0.331 ± 0.0103 0.3816 ± 0.03 F1 0.356 ± 0.0017 0.353 ± 0.0017 0.365 ± 0.0021 0.343 ± 0.005 0.323 ± 0.0098 0.376 ± 0.029 F1 0.5309± 0.002 0.5227 ± 0.0017 0.5217 ± 0.0019 0.4826 ± 0.0017 0.4229 ± 0.0135 0.5406 ± 0.0216 F1 0.3106± 0.0021 0.2998 ± 0.0017 0.3089 ± 0.0016 0.2459 ± 0.0029 0.2296 ± 0.022 0.3318 ± 0.016 Test Time (s) 436.8 ± 10.2 91.2 ± 6.4 312.76 ± 9.6 122.18 ± 10.4 198.2 ± 11.2 77.5 ± 4.5 Test Time (s) 751.68 ± 11.8 157.66 ± 7.9 523.7 ± 12.2 192.8 ± 11.2 367.4 ± 10.5 135.6 ± 5.6 Test Time (s) 942.9± 17.6 188.3± 9.3 457.5 ± 16.9 195.5 ± 9.2 256.6 ± 14.7 152.7 ± 4.2 Test Time (s) 1618.6± 17.6 351.8± 13.6 772.5 ± 16.9 312.2 ± 11.3 456.6 ± 12.5 267.7 ± 6.9 Depth of Tree 256 9 196 10 18 10 Depth of Tree 256 9 196 10 18 10 Depth of Tree 214 8 64 10 33 9 Depth of Tree 1618 10 116 12 52 11 consistently better than the baseline methods on all four data sets. 4.4.3 Comparison of Taxonomy Structure In addition to their F1-scores, it is useful to compare the tree generated by the different methods against their ground truth structure. To do this, we need an evaluation measure for comparing the similarity between two trees. In this chapter, we apply the edit distance measure, which was originally used for string comparison, to compare the ordered labeled 62 0.4 0.39 0.38 0.37 F1 0.36 0.35 0.34 0.33 RNMF RDM DRHL DAGSVM CMTL MF−Tree 0.32 0.31 0.3 0 50 100 150 200 250 300 350 400 450 Testing Time (s) Figure 4.2 Performance Comparison on Caltech1 Data 0.38 0.37 0.36 0.35 F1 0.34 0.33 0.32 CMTL DAGSVM RNMF DRHL MF−Tree RDM 0.31 0.3 0.29 100 200 300 400 500 600 700 Testing Time (s) Figure 4.3 Performance Comparison on Caltech2 Data 63 800 0.55 F1 0.5 0.45 RNMF RDM DRHL DAGSVM CMTL MF−Tree 0.4 0.35 100 200 300 400 500 600 700 800 900 1000 Testing Time(s) Figure 4.4 Performance Comparison on W iki1 Data 0.34 0.32 0.3 F1 0.28 0.26 0.24 CMTL DAGSVM RNMF DRHL MF−Tree RDM 0.22 0.2 0.18 200 400 600 800 1000 1200 1400 1600 Testing Time (s) Figure 4.5 Performance Comparison on W iki2 Data 64 1800 trees (see [65] for a review). Ordered labeled trees are trees in which the left-to-right order among siblings is significant. The distance between two trees is computed by considering the optimal mapping between the two trees. Specifically, the distance is given by the minimum cost of elementary operations to convert one tree into the other. An alternative way to map and edit the trees is by using tree alignment [66]. To quantify the difference between two trees based on their pairwise class ordering, we first construct an adjacency matrix A for a tree structure using the following formula: Aij = 1 numhops (ci → cj ) where Aij is the proximity between classes ci and cj and numhops (ci → cj ) is the number of hops to traverse from class ci to cj . To measure the difference between two tree structures ˆ we compute the mean squared error (M SE), which is represented by the matrices A and A, defined as follows: ˆ = (1 M SE(A, A) n n ∑ (Aij − Aˆij )2 ) i,j=1 Figure 4.7 shows an example of using M SE to compare the tree structures. Let A be the adjacency matrix induced from the ground truth structure while Aˆresult1 and Aˆresult2 are the corresponding adjacency matrices for two competing tree structures, result1 and result2. Since M SE(A, Aˆresult1 ) = 0.03125 > M SE(A, Aˆresult2 ) = 0.013825, this suggests that the tree structure result2 is closer to the ground truth structure than result1. This result makes sense because the structure of result2 is more similar to the ground truth as only class c2 is misplaced, whereas the classes c2 and c3 are misplaced in result1. For example, A(c3 , c4 ) = 0.5 in the ground truth structure as the number of hops from category c3 to c4 is equal to two. However, Aˆresult1 (c3 , c4 ) = 0.25 for result1 since class c3 is 4 hops 65 Figure 4.6 MSE comparison of the tree produced by MF-Tree against the one produced by CMTL Figure 4.7 Example of Mapping Tree Structure to Adjacency Matrix away from c4 . In contrast, Aˆresult2 (c3 , c4 ) = 0.5, which is the same as the distance according to the ground truth, which means, the tree structure of result2 was able to maintain the proximity between c3 and c4 . We compared the MSE score of MF-Tree against the confusion matrix based label embedding tree learning (CMTL) method. We choose this baseline method for two reasons. First, the size of the tree is comparable to MF-Tree. Second, similar to MF-Tree, this approach produces a tree in which each class can only reside in a single leaf node of the tree (whereas other methods allow a class to be assigned to multiple leaf nodes). Trees that restrict each class to a single leaf node are more interpretable as they may be used to define a concept hierarchy for the application domain. Our experimental results shown in Figure 4.6 suggest that MF-Tree produces a tree structure that is closer to the ground truth compared to CMTL. Note that Caltech1 and Caltech2 use the classes for their training data to construct the taxonomy, so we report them together as Caltech1&2 in Figure 4.6. 66 Figure 4.8 Illustration of tree structure generated Figure 4.8 shows a subset of the tree structure generated by MF-Tree on the Caltech1 data set. Observe that the tree was able to capture the relationships among many of the classes quite well. For example, some of the classes of animals (dog, horse, chimp, bear) was assigned to the same branch. Though there were exceptions and misclassifications (e.g., bats and crabs) but the related classes are often quite close together at the lower levels of the tree. 4.4.4 Results for Approximate MF-Tree Despite its lower test time compared to other approaches, MF-Tree can still be expensive if the number of classes is too large. To overcome this problem, we propose an approximate MFTree learning algorithm. To justify the effectiveness of the approximate MF-Tree approach, we perform an experiment on the W iki2 data set. We choose this data set because it has the largest number of classes. As previously mentioned in Section 4.3, the approximate MF-Tree 67 method uses an inexpensive partitioning method to assign classes to their child nodes at the top levels of the hierarchy. The algorithm requires a single parameter, τ , which determines the maximum depth at which to apply the inexpensive partitioning method. The results comparing Approximate MF-Tree against MF-Tree for different thresholds of τ are summarized in Table 4.3. For this experiment, we vary τ from 3 to 6. We compare the average F1-score as well as the training time of the algorithms. Recall that the average F1-score for MF-Tree on the W iki2 data set is 0.3318. When τ = 3, the F1-score reduces slightly to 0.3289. However, there was a 22.6% improvement in the training time of the algorithm. As τ increases, the F1-score gradually decreases while its training time continues to improve. Note that the size of the tree generated by the original MF-Tree algorithm has a depth equals to 11. When τ = 6, the improvement in training time appears to taper off while its F1-score has decreased to 0.2868. This clearly shows a trade-off between accuracy and training time efficiency. In practice, we could set the τ threshold based on the specific needs of the problem. If our priority is lower training cost, we could set a higher τ value. τ can also be set based on the computational resources available. If there is limited memory, τ can be set to a threshold in such a way that the resulting data matrices can fit into the memory available. Table 4.3 Approximate MF-Tree on W iki2 Threshold τ =3 τ =4 τ =5 τ =6 F1 0.3289 0.3169 0.3034 0.2868 % improvement in training time 22.6% 32.3% 37.8% 39.5% 68 4.5 Conclusion In this chapter, we proposed a novel hierarchical method for large-scale multi-class learning called MF-Tree. Our proposed algorithm is driven by a global objective function. We demonstrate the equivalence between the global objective function and the HSIC metric and show it has an additive property that can be exploited to design a hierarchical tree learning algorithm. Experimental results comparing the method against five baseline methods demonstrated the effectiveness of our method, in terms of its accuracy and test efficiency. 69 Chapter 5 Hierarchical Tree Learning for Large Multi-class Network As noted in Chapter 1, classification of network data is an important problem across many disciplines, including computer science, biology, chemistry and social science. Previous work on network classification can be broadly categorized into three approaches: Classification with node attributes only [30][31]. The methods that belong to this category simply apply existing classifiers, such as support vector machine, rule-based classifiers, and logistic regression, to the attributes of each node in the network (e.g., the HTML tags in Web pages). Many efforts have been made to enhance the classification accuracy by constructing a better feature matrix for the nodes [30][31][67]. A major drawback of these methods is that they do not consider the link information when constructing the classifier. Classification based on link analysis. This includes PageRank based methods [34][35], HITS [33], local and global consistency[36], score propagation[37], hard label propagation [68], trust rank[38], anti-trust rank[35], minimum cut[69][70], spectral clustering[71],etc. This approach is based on the fundamental assumption that nodes that are connected should share similar labels. As a result, network classification task could be formulated as a regularized optimization problem. The drawback of link analysis based approaches is that the classification relies only on link structure is susceptible to missing link information. Classification based on 70 the use of both content and link information. This includes hybrid methods developed in [39][72][40][41][27][42][25]. Most of these research have shown that hybrid methods are generally better than using node attributes or link information alone. Despite the extensive research on node classification for network data, most of the approaches were not designed for large multi-class learning problems. These approaches will encounter challenges such as class imbalance, high testing cost, and model interpretability problems when applied to networks with large number of classes. Furthermore, as noted in Chapter 1, the assumption that nearby nodes have similar classes is less likely to hold when the number of classes is large. The size of the network also increases considerably as the number of classes increases, all of which makes large multi-class network learning an extremely challenging problem. As discussed in the previous two chapters, hierarchical learning methods have been developed [15, 16, 17, 18, 19] to overcome the challenges of extreme multi-class classification problems. However, as all of these methods were designed for non-network data, extending them to a hybrid network classification approach (using both node attributes and link structure information) is not an easy task. Furthermore, although open source tools are available for large scale network analysis, they are not designed for node classification tasks. Pegasus 1 [73], for example, was designed to analyze networks with billions of nodes and edges by providing functionalities to compute the diameter of the graph, the Pagerank scores of nodes, as well as to find connected components using techniques such as spectral clustering. The techniques developed in Pegasus rely on the link topology only, which overlooks the importance of node attributes in the analysis. As a result, there is an urgent need for new techniques to address the challenges in large multi-class network learning. 1 http://www.cs.cmu.edu/ pegasus/index.htm 71 In the previous chapter, we have presented a framework for multi-class learning known as MF-Tree. This chapter presents an extension of the framework for large multi-class learning in network data. The modified framework systematically combines node attribute with link information to learn the hierarchical structure of the classes from network data. Specifically, we recursively optimize a global objective function to learn a binary discriminant function to partition the classes based on the node attributes and link structure until the partition contains only a single class. During testing, a test example is routed from the root to one of the leaf nodes in the tree structure, from which the label associated with the leaf node is assigned as the predicted class label. To address the efficiency issue of large-scale network learning, our framework also has an approximate version, which allows the partitions at the top levels of the tree to be determine without requiring the expensive computation of matrix factorization. Similar to the approach described in the previous chapter, the framework allows us to choose the proper parameter setting (τ ) based on the specific needs of the problem. If our priority is to achieve a high training efficiency, we could apply the approximate MF-Tree with larger τ . Otherwise, we could use regular MF-Tree to obtain the best classification accuracy. Second, the proposed approach is trivially parallelizable. A MapReduce implementation of the framework for network data has been developed to improve its scalability to larger networks. As proof of concept, the MapReduce framework was applied to the Wikipedia network data that contains more than 600K classes. To the best of our knowledge, this is the largest multi-class network learning problem that has ever been reported in the literature. 72 5.1 Methodology Let N = (V, E, D) be a network of order n, where V = {v1 , v2 , · · · , vn } is the set of nodes, E ⊆ V × V is the set of edges, and D is the set of data instances associated with V . Each instance in D is characterized by the tuple (x, y), where x ∈ ℜd is a d-dimensional feature vector (known as the node attribute) and y ∈ {1, 2, · · · , c} is its class label. The goal of multiclass network learning is to induce a labeling function f (V, E, D) that minimizes the expected misclassification rate for any instance drawn from network N , i.e., minf E(x,y)∼D [ℓ(f (x), y)], where ℓ is the loss function. The objective function used as an estimate of the regularized empirical loss function is described in the next subsection. 5.1.1 Global Objective Function Let X be an n × d centered design matrix, whose rows correspond to the data instances and ˆ can be easily columns correspond to the node attributes. An uncentered design matrix X ˆ where Hn is the centering matrix defined centered by applying the matrix operation Hn X, as Hn = I − n1 1n 1Tn . Let A denote the n × n adjacency matrix for the link structure of the network. Our MF-Tree for network is then designed to optimize the following regularized loss function for an undirected graph: J = ∥X − LW T ∥2F + λ1 ∥A − LP LT ∥2F + λ2 ∥W ∥2F + λ3 ∥P LT ∥2F [ ] T T T T T T = tr XX − XW L − LW X + LW W L [ ] T T T T T T T T +λ1 tr AA − ALP L − LP L A + LP L LP L [ ] [ ] T T +λ2 tr W W + λ3 tr U U 73 (5.1) where L is an n × c matrix, which indicates the class membership of each data instance, W is a d × c matrix, which represents the feature vector of each class, and P is a c × c is class covariance matrix, which represents the correlation between classes. λ1 , λ2 and λ3 are the regularization parameters. Intuitively, the objective function seeks to approximate the matrices X and A by their low rank approximation matrices given by the matrix products LW T and LP LT , respectively. The first two terms in the objective function aim to measure the error in the approximation while the last two terms are regularizers for the matrices W and P LT . The purpose of optimizing this objective function is to jointly factorize the centered data matrix X and adjacency matrix A to learn L and W. For directed graph, we can write P LT = U T and reduce the objective function as follows: J 5.1.2 = ∥X − LW T ∥2F + λ1 ∥A − LU T ∥2F + λ2 ∥W ∥2F + λ3 ∥U ∥2F [ ] T T T T T T = tr XX − XW L − LW X + LW W L [ ] T T T T T T +λ1 tr AA − AU L − LU A + LU U L [ ] [ ] T T +λ2 tr W W + λ3 tr U U (5.2) Relationship between Proposed Method and HSIC As we have discussed in the previous chapter, the Hilbert-Schmidt Independence Criterion (HSIC) [21][22][23][24] is a recently proposed unsupervised learning measure for taxonomy learning. We have provided a theoretical proof to demonstrate the equivalence between the objective function of MF-Tree and HSIC. Similarly, in this section, we first show that the network extension of MF-Tree is also equivalent to HSIC. Given the additive property of 74 HSIC, the task of large multi-class network learning can be decomposed into a set of binary classification tasks, which can be organized in a hierarchical tree structure. To demonstrate the relationship between our global objective function and HilbertSchmidt Independence Criterion (HSIC), we first take the partial derivative of J with respect to W and U and set them to zero: ∂(J ) = −2X T L + 2W LT L + 2λ2 W = 0 ∂(W ) W = X T L[LT L + λ2 I]−1 ∂(J ) = λ1 (−2AT L + 2U LT L) + 2λ3 U = 0 ∂(U ) = U [λ1 LT L + λ3 I] = λ1 AT L λ U = AT L[LT L + 3 I]−1 λ1 Replacing these back into Equation (5.3), we have: J { [ ]−1 T T T = tr XX − 2XX L L L + λ2 I LT [ ]−1 [ ]−1 } T T T T +L L L + λ2 I L XX L L L + λ2 I LT { [ ] λ3 −1 T T T T +λ1 tr AA − 2AA L L L + I L λ1 ] [ ] } [ λ3 −1 T λ3 −1 T T T T +L L L + I L AA L L L + I L λ1 λ1 {[ ]−1 [ ]−1 } T T T T +λ2 tr L L + λ2 I L XX L L L + λ2 I {[ ] [ λ3 −1 T T T +λ3 tr L L + I L AA L LT L + λ1 75 ] } λ3 −1 I λ1 { [ ]−1 T T T = tr XX − 2XX L L L + λ2 I LT [ ]−1 [ ]−1 } T T T T T + L L + λ2 I L XX L L L + λ2 I L L { [ ] λ3 −1 T T T T +λ1 tr AA − 2AA L L L + I L λ1 ] [ ] } [ λ3 −1 T λ3 −1 T T T T L AA L L L + I L L + L L+ I λ1 λ1 {[ ]−1 [ ]−1 } T T T T +λ2 tr L L + λ2 I L XX L L L + λ2 I ] [ {[ λ3 −1 T T T +λ3 tr L L + I L AA L LT L + λ1 ] } λ3 −1 I λ1 Assuming λ1 = 1 and λ2 = λ3 = λ, J can be further simplified as J { [ ]−1 T T T = tr XX − 2XX L L L + λI LT [ + LT L + λI ]−1 [ LT XX T L LT L + λI ]−1 [ ]} T LL + λI { [ ]−1 T T T +tr AA − 2AA L L L + λI LT [ + LT L + λI ]−1 [ LT XX T L LT L + λI ]−1 [ ]} T LL + λI { [ ]−1 } T T T = tr XX − XX L L L + λI LT { [ ]−1 } T T T +tr AA − AA L L L + λI LT {[ ] [ ]} T T T T −1 T T T −1 T = tr XX + AA − XX L[L L + λI] L + AA L[L L + λI] L {[ ] [ ] } T T T T T −1 T = tr XX + AA − XX + AA L[L L + λI] L [ ] The term XX T + AAT can be regarded as the joint kernel matrix of both content and 76 link structure, which already known given the design matrix and adjacency matrix. Thus, finding a solution that minimizes J is equivalent to solving the following trace form: ] [ T T max tr XX + AA L[LT L + λI]−1 LT L (5.3) Let K = XX T + AAT 2 , we could enforce the following constraint during training: Ltrain [LTtrain Ltrain + λI]−1 LTtrain = Y CY T where Y ∈ ℜn×c is the class label matrix, C ∈ ℜc×c is the tree structure covariance matrix. We could simplify the Equation (5.3) as follow: [ ] T max tr KY CY C (5.4) where the covariance matrix C associated with the label tree structure is jointly estimated from the link structure and node attribute information of the network. Cij is a measure of similarity between the i-th and j-th classes. The larger the value of Cij , the closer is the relationship between two classes. Thus, we have shown that optimizing our regularized objective function for network learning is equivalent to the optimization of HSIC. 2 To save the space and running time, we keep K as a sparse matrix. In our implementation, we only maintain the elements in XX T with value larger than 0.3, this threshold value could be adjusted based on the specific needs of the problem 77 5.1.3 Additive Property of HSIC As shown in previous chapter for MF-Tree, the tree-structure covariance matrix has an additive property, which can be formally written as follows: C2m = B2 + B4 + · · · + B2m , (5.5) where B2j = I2j ⊗ 12m−j 1Tm−j , 120 = 1, and m is the depth of the binary tree. Thus, we 2 can write: log2 k 2i−1 log2 k YCYT = ∑ YB2i YT = i=1 ∑ ∑ Yj B2i ,j YjT i=1 j=1 This additive property enables us to solve the optimization problem with a hierarchical tree-like decomposition. Instead of learning C directly by maximizing the objective function in (5.4), we can learn a set of block matrices B2i ,j iteratively, each of which is associated with an internal node of the tree structure. 5.1.4 Hierarchical Tree Construction from Network Data To construct the tree, we need to solve the objective function given in (5.4). Using the tree decomposition approach given in Equation (5.5), we simplify this to max {B} ∑ [ ] T tr Kj,j Yj Bi,j Yj (5.6) i,j where the optimization problem can be solved iteratively, one block matrix B at a time. Similar to MF-Tree, by setting Cj = YjT Kj,j Yj , our tree construction step is equivalent to 78 finding a matrix Bi,j that maximizes its alignment to Cj . max {B} ∑ [ ] tr Cj Bi,j (5.7) i,j For brevity, we omit the subscript (i,j) in the rest of the sections. Since B is a symmetric matrix, let B = GGT , where G ∈ ℜc×2 . It is easy to show that: [ ] [ [ ] ] T T tr CB = tr CGG = tr G CG . Following the same approach as MF-Tree, we can relax the condition that elements of G must be either 0 or 1 and add a condition that G is an orthogonal matrix: [ ] T max tr G CG s.t. GT G = I2 G The optimization problem can be easily solved by finding the first two eigenvectors of C. After performing eigenvalue decomposition on C, we can write: C = GΛGT (5.8) Once the matrix G is known, we can apply the following equation to determine the partition each test example show follow: ˆ ˆ = GT YT k, p ˆ is the kernel similarity between the test example and the training examples. This is where k 79 equivalent to the formula used for predicting the branch to follow in the MF-Tree framework ˆ , we assign the test node to the partition described in the previous section. After computing p with largest value. We summarized the hierarchical tree construction procedure in Algorithm 5 and the node testing procedure in Algorithm 6. Algorithm 5 Hierarchical Tree Construction from Network Data Input: G = (V, E, D), (Xtrain , Ytrain ) ⊂ D, Xtrain ∈ Rn×d , Ytrain ∈ Rn×k Output: Tree Structure, T 1. construct A from G 2. root = TreeGrowth(Xtrain , Ytrain , A) 3. Insert root into T 4. return T function TreeGrowth(X,Y ,A) 1. v = createNode() 2. v.index = getIndex(X) 3. if number of classes in Y < 2 then 4. v.class = unique(Y ) 5. else 6. (X (l) , Y (l) , X (r) , Y (r) , v.g) = Partition(X,Y ,A) 7. v.lef t = TreeGrowth(X (l) , Y (l) ,A(l) ) 8. v.right = TreeGrowth(X (r) , Y (r) , A(r) ) 9. end if 10. return v function Partition(X,Y ,A) 1. Compute the centered kernel K from X and A. 2. Compute C = Y T KY 3. Compute the first two eigenvectors g (1) , g (2) of C 4. Let g = [g (1) , g (2) ] 5. (X (l) , Y (l) ) = {(xi , yi ) yi = k, arg maxj gkj = 1} 6. (X (r) , Y (r) ) = {(xi , yi ) yi = k, arg maxj gkj = 2} 7. return (X (l) , Y (l) , X (r) , Y (r) , g) 5.1.5 Parallelization of Hierarchical Tree Learning Our framework can be easily parallelizable. As shown in Algorithm 5 and 6, the key step in the optimization of our objective function is to solve a matrix multiplication problem and to 80 Algorithm 6 Testing with Hierarchical Tree Input: Xtest , Xtrain , Ytrain , A, T Output:Ytest 1. for each x ∈ Xtest do 2. v = T.root 3. while v.class = ∅ do 4. kˆ = Similarity(x, Xtrain , A, v.index) 5. Yˆ = Subset(Ytrain , v.index) 6. gˆ = v.g 7. pˆ = gˆT Yˆ T kˆ 8. v = getChild(ˆ p, v) 9. end while 9. Insert y = v.class into Ytest 10. end for 11. return Ytest find the eigenvectors corresponding to the two largest eigenvalues of the tree structure covariance matrix C. To perform the matrix multiplication and extract the eigenvectors, we apply the HEIGEN algorithm [74], which provides an efficient matrix-matrix multiplication under the Hadoop/MapReduce distributed programming framework. HEIGEN has an eigensolver for billion-scale matrices to compute the top k eigenvectors using the Hadoop/MapReduce framework. We evaluated our parallelization approach for MF-Tree on a cluster of machines running MapReduce with 128 worker nodes have 8 cores and 32 GB RAM each. The results are reported in the next section. 5.2 Experimental Evaluation This section presents the results of our experiments to evaluate the performance of the enhanced MF-Tree framework for network data. 81 5.2.1 Network Data We evaluated the performance of the proposed framework on two real-world network data— U.S Patent and Wikipedia data sets. Details of the data sets are given below. 5.2.1.1 U.S. Patent Data We have extracted the U.S. patents granted between January 1963 and December 1999 from a website called PatentGenius 3 along with the citations made to these patents between 1975 and 1999 (over 16 million). For each patent, we extracted the title, abstract, category, and citation information from each patent Web page. We ended up with 3,774,768 patents with 16,518,948 citation links. There are 22 general categories in the US Patent Data as shown in Table 5.1. Each general category has its own subcategories, which produces 426 classes under which a patent can be categorized. These classes are further divided into more subclasses, which in turn, produces a hierarchical structure between classes. For this experiment, we tested our algorithm using 3139 patent U.S. sub-categories. The class distribution of U.S. Patent Data is highly skewed, as shown in Figure 5.1, with the largest class containing 25,727 patents, and the smallest class has only 8 patents. A patent citation network is constructed from the data, where each patent is a node while its title and abstract represent the node attributes. The edge set contains citations between pairs of patents. 5.2.1.2 Wikipedia Data We created network data from Wikipedia dump dated October 9, 2009, where each Wikipedia article is a node and its title and content form the node attributes. The link set contains citations between pair of articles. We generated two versions of the network data for evaluation 3 http://www.patentgenius.com/ 82 Table 5.1 22 General Categories of U.S. Patent Data Class Business Chemistry Clothing, Fashion & Accessories Communications Construction Electrical & Energy Engineering Entertainment & Recreation Firearms & Weapons Food Health & Medicine Household Industrial Information Technology Material Science Optical & Optics Packaging & Containers Papers, Printing & Office Supplies Physics Plants, Animals & Agriculture Tools & Hardware Transportation & Automotive No. of Patents 26790 441687 52895 281045 112621 383168 84506 36342 5968 20896 227264 96166 393949 479869 418611 147182 93128 52285 159269 25913 82137 153077 purposes. 1. Wikipedia Data 1, which contains articles from all the subclasses of the biology category. There are altogether 98,068 articles belonging to 2,924 classes with 331,916 hyperlinks between articles. 2. Wikipedia Data 2, which contains the entire 2,365,436 articles belonging to 614,428 classes. Similar to the U.S. Patent data, the class distribution for Wikipedia data 1 is also skewed, as shown in Figure 5.1. Due to the imbalanced class distribution, accuracy is not a good metric for evaluating the performance of the framework. We will discuss the evaluation metric used 83 4 3 x 10 160 140 NO. of Wikipedia Articles NO. of Patents 2.5 2 1.5 1 120 100 80 60 40 0.5 20 0 500 1000 1500 2000 2500 0 3000 Class 500 1000 1500 2000 2500 3000 Class Figure 5.1 Class Distribution of U.S. Patent and Wikipedia Data 1 in the next section. 5.2.2 Evaluation Metric Accuracy is a widely used evaluation metric for classification tasks. The metric can be computed from a confusion matrix that summarizes the total number of examples correctly and incorrectly predicted by a classification method. Figure 5.2 shows an example of a confusion matrix for a binary classification problem, where the columns are the predicted classes and the rows are the actual classes. The two classes are denoted as positive and negative class, respectively. In the confusion matrix, true positive (T P ) is the number of positive examples correctly predicted, true negative (T N ) is the number of negative examples correctly predicted, false positive (F P ) is the number of negative examples incorrectly classified as positive, while false negative (F N ) is the number of positive examples incorrectly classified 84 Figure 5.2 Confusion Matrix as negative. Then accuracy is formally defined as Accuracy = TP + TN TP + FP + TN + FN . Unfortunately, accuracy is not an appropriate metric for data sets with imbalanced classes. For example, if we have 99 instances belonging to the negative class and 1 instance from the positive class, then a classifier that predicts all 100 instances as negative class will have an accuracy of 99% even though it fails to detect any instance from the positive class. Instead of accuracy, we report the experimental results using both micro and macro-averaged F1 scores [75]. The metrics are formally defined as follows: 85 5.2.2.1 Micro-average F1 Let T Pk , F Pk , and F Nk be the true positive, false positive, and false negative of class k ∈ C. The micro-average F1 score is defined as: ∑ Precision = ∑ Recall = ∑ Micro-average F1 = k∈C T Pk (T Pk + F Pk ) k∈C ∑ k∈C T Pk k∈C (T Pk + F Nk ) 2 × Precision × Recall Precision+Recall Note that micro-average F1 is computed based on the average precision and recall values of all the test instances. 5.2.2.2 Macro-average F1 The macro-average F1 score is defined as: T Pk T Pk + F Pk T Pk Recallk = T Pk + F Nk 1 ∑ 2 × Precisionk × Recallk Macro-average F1 = |C| Precisionk + Recallk Precisionk = k∈C The metric is computed by averaging the F1 scores for all the classes. When the class distribution is skewed, the macro-average F1 score will be highly influenced by the F1-score of the rare classes. In contrast, the micro-average F1 score will be dominated by the performance of the classifier on instances from the larger categories. 86 5.2.3 Experimental Results This section describes the results of our experiments on both U.S. Patent and Wikipedia network data sets. We compared the hierarchical tree learning framework (denoted as HTL) against a well-known label propagation algorithm [36] and the decision tree classifier (applied to node attributes only). For this experiment, we apply the holdout method, where half of the network data was used for training and the remaining half was reserved for testing purposes. 5.2.3.1 Results on U.S. Patent Data Tables 5.2 and 5.3 showed the micro- and macro-average F1 scores for the various algorithms on the U.S. Patent Data. For this experiment, we vary the number of classes by mapping the patents to their corresponding classes at different levels of the class hierarchy. Recall that the top level of the class hierarchy contains 22 classes, while the second level contains 426 classes, and the subsequent level contains 3,139 classes. Table 5.2 Micro-average F1 Comparison on U.S. Patent Data No. Classes k = 22 k = 426 k = 3139 Label Propagation (Link only) 0.582 0.500 0.312 Label Propagation (Link + Content) 0.592 0.511 0.330 Decision Tree HTL 0.569 0.490 0.302 0.622 0.512 0.328 Table 5.3 Macro-average F1 Comparison on U.S. Patent Data No. Classes k = 22 k = 426 k = 3139 Label Propagation (Link only) 0.426 0.338 0.257 Label Propagation (Link + Content) 0.457 0.3760 0.261 Decision Tree HTL 0.408 0.316 0.238 0.466 0.396 0.276 The following observations can be made based on the results shown in the tables: 87 1. Hybrid approaches are more effective than attribute-only and link-only node classification methods. For example, the performance of the proposed hierarchical tree learning (HTL) method is superior than simple decision tree classifier constructed on the node attributes only, both in terms of their micro- and macro-averaged F1 scores. In addition, a hybrid labeled propagation approach using both the link structure and Web content of the patents consistently outperforms label propagation with link-only information. 2. In terms of macro-average F1 scores, the proposed HTL method is better than the hybrid label propagation method for 3 all cases (k = 22, 426, and 3139). As noted earlier, macro-average F1 is mainly influenced by the performance of the rare classes, this result demonstrates the effectiveness of the proposed method for handling imbalanced class problems compared to the other baseline methods. 3. In terms of micro-average F1 scores, the proposed HTL method is better than the hybrid label propagation method on 2 out of 3 cases (k = 22 and 426). For k = 3139, our results are slightly worse. This suggests that the hybrid label propagation method is slightly better in terms of classifying instances from the larger categories. To support our previous assertion that HTL is more effective at classifying the rare classes, consider the plot shown in Figure 5.3. The horizontal axis of the plot corresponds to the ID of individual classes in the U.S. Patent data, sorted by their increasing class size. The vertical axis, on the other hand, corresponds to the F1 score of the class. The results clearly showed that the F1 score of HTL for rare classes is higher than those obtained using link propagation and decision tree methods. HTL also outperforms decision tree classifier on all the classes as its curve lies above the curve for decision tree classifier. The figure also shows 88 0.35 HTL LP(L+C) LP(Link) Decision Tree 0.3 F1 Score 0.25 0.2 0.15 0.1 0.05 0 500 1000 1500 2000 2500 3000 Classes Figure 5.3 F1 Score for each individual class in the U.S. Patent Data (where the classes are sorted in increasing class size). that the label propagation methods generally work better than HTL on the larger classes. This result is consistent with the micro- and macro-average F1 scores shown in Tables 5.2, 5.3, and 5.5. Although the hybrid label propagation method is also quite effective, it does not produce a hierarchical tree structure that represents the conceptual relationship between classes. Decision tree classifiers, on the other hand, produces trees that (1) may not include all the classes in their leaf nodes, and (2) may contain classes that are split into multiple leaf nodes in the tree. In contrast, the binary tree generated by the proposed HTL method can be interpreted as a class hierarchy as it includes all the classes in the data and assigns each class to exactly one leaf node in its induced tree. Figure 5.4 provides a snippet of the tree structure obtained after applying HTL to the U.S. patent data, which contains 22 classes. The tree structure (up to depth 4) shows that we can capture the relationships among the classes very 89 Figure 5.4 A snippet of the class hierarchy structure (up to depth 4) generated by the proposed method for the 22 classes of the U.S. Patent Data well. For example, chemistry related classes (chemistry, firearms, health& medicine, papers, printing & office supplies) were assigned to the same branch. Although there were some uncertainty in the assignment (e.g., food with material science and clothing), in general, the classes located near the bottom levels of the tree are quite related to each other. Testing Efficiency Comparison. To compare the test efficiency, we compared our method against decision tree classifier on the U.S. patent data and plotted their results in Figure 5.5. The horizontal axis in the figure corresponds the test time while the vertical axis corresponds to the macro-average or micro-average F1 score. We applied each method to the different variants of the U.S. Patent data (with k = 22, 426, and 3139). As we can seen from the figure, our method consistently achieves higher average F1 scores but is slower in terms of test efficiency even though the size of the tree produced by the HTL method is shorter than the size of a decision tree (see Table 5.4 for a comparison of the maximum depth of the trees). We believe this is because our method needs to perform more complex matrix computations at the internal nodes of the tree compared to a decision tree classifier, which uses simple attribute test conditions to determine which branch to follow. 90 Table 5.4 Tree Depth Comparison on U.S. Patent Data k=22 56 5 Decision Tree HTL k=426 212 10 0.5 k=3139 635 13 0.7 Our Method Decision Tree Our Method Decision Tree 0.65 0.45 Micro−average F1 Macro−average F1 0.6 0.4 0.35 0.3 0.55 0.5 0.45 0.4 0.35 0.25 0.3 0.2 0.6 0.8 1 1.2 Testing Time (s) 1.4 0.25 1.6 0.6 0.8 1 1.2 1.4 Testing Time (s) 4 x 10 1.6 4 x 10 Figure 5.5 Testing Efficiency Comparison on U.S. Patent Data 5.2.3.2 Results on Wikipedia Data 1 In this section, we tested our algorithm on Wikipedia data 1, which contains 98,068 articles belonging to 2,924 subclasses of biology. Results in terms of both micro-average F1 and macro-average F1 are shown in Table 5.5. Table 5.5 Performance Comparison on Wikipedia Data 1 Micro-average F1 Macro-average F1 Label Propagation (Link only) 0.413 0.351 Label Propagation (Link + Content) 0.425 0.365 Decision Tree 0.416 0.349 HTL 0.427 0.371 The results suggest that the proposed method is slightly better than all 3 baselines in terms of both evaluation criteria. Since its macro-scale F1 score is slightly higher than that for hybrid label propagation, this agrees with our previous observation on the U.S. Patent data that the proposed method is better in terms of dealing with rare classes. This conclusion is supported when analyzing the F1 score of the individual classes. As shown in Figure 5.6, 91 0.5 F1 Score 0.4 HTL LP (L+C) LP(Link) Decision Tree 0.3 0.2 0.1 500 1000 1500 2000 2500 3000 Classes Figure 5.6 F1 Score for each individual class in Wikipedia Data 1 (where the classes are sorted in increasing class size). the propose HTL method has highest F1 score when the class size is small. However, it is slightly worse than the label propagation methods for some of the larger classes. Testing Efficiency Comparison. The test time for decision tree classifier on Wikipedia Data 1 is 427.9 seconds while the proposed HTL method takes 516.8 seconds. Although our method is slightly slower, it consistently outperforms the decision tree classifier in terms of both its micro-average and macro-average F1 scores. In addition, unlike decision trees, our method has the ability to generate meaningful taxonomy structure to reveal the class relationships. With the given 2,924 classes in Wikipedia data 1, the decision tree classifier generates a tree with maximum depth of 516, while the proposed HTL method generates a tree with maximum depth of 13. 92 5.2.3.3 Parallel Results on Wikipedia Data 2 To evaluate the performance of our parallel HTL method, we use the Wikipedia Data 2, which contains more than 600K classes and 2 million articles. The results are shown in Table 5.6. Since none of the competing methods are applicable to such a large size data set, we reported only the performance of our method in this section. Unlike Wikipedia data 1, the number of classes in this data set is 200 times larger, which explains its lower F1 scores. It is also computationally more expensive, requiring 27,153 seconds for testing. Table 5.6 Results on Wikipedia Data 2 using the parallel HTL method. Micro-average F1 Macro-average F1 5.3 0.3436 0.1738 Conclusion In this chapter, we proposed an extension framework of our MF-Tree for large multi-class network classification. We demonstrate that our proposed method can systematically combine both the node attributes and link structure to effectively learn a hierarchical structure from large multi-class network learning problems. Experimental results confirmed the effectiveness of our method. 93 Chapter 6 Co-classification of Heterogeneous Networks Most existing network learning techniques have focused on the analysis of a single network. One drawback of this type of methods is that information from a single network may not be sufficient to effectively learn an accurate classification model, especially when the network has noisy attributes or incomplete link structure. In this chapter, we introduce a framework that can jointly classify heterogeneous networks with different types of attributes and links but share similar characteristics in terms of their class information. The proposed framework is known as co-classification. As an example of a co-classification task, consider the network of editors and articles in Wikipedia. Wikipedia has become the largest online knowledge repository with approximately 15 million articles written in more than 270 languages, among which more than 3 million of them are in English. These articles are written collaboratively by voluntary users around the world. Given its massive amount of information and the fact that most of its content can be edited freely by any users who have access to the Internet, it is not surprising Wikipedia has become a rich domain for a variety of interesting classification problems such as topic detection of Wikipedia articles, categorization of Wikipedia editors as humans or bots, and classification of Wikipedia content as featured articles, controversial/inflammatory, 94 spam, or vandalism. These tasks can be cast into a network classification problem, in which the goal is to assign a class label to each node in the network (e.g., an article or a user) based on the node features and link information [76][77][78][40]. Most of the previous works have focused on classifying nodes in a single network. Since the Wikipedia articles and users are inextricably related, knowing the class label of an article often aids in the classification of users, and vice-versa. For example, in topic detection of Wikipedia articles, the category of an article provides evidence for the interest area of a user who edited the article’s content. Similarly, in spam detection, we expect spam content to be more likely contributed by spammers than non-spammers. Assigning class labels to both articles and users in Wikipedia not only increases the amount of labeled data available, it is often a natural process during the creation of the training data itself (e.g., to label a user as a spammer or a vandal, one has to know whether the edited content should be considered a form of spam or vandalism). Despite the obvious relationship between their classes, none of the previous works on Wikipedia classification [79][80][81] were designed to co-classify the article and user networks. A co-classification method was previously developed in [13] to detect spam and spammers in social bookmarking Web sites. The method is based on extending the least-square support vector machine (LS-SVM) formulation to network data, taking into account the correspondences between the classes in different networks. Nevertheless, the method assumes a binary class problem, where the nodes in each network are assigned to one of two possible classes (spam/non-spam or spammer/non-spammer). However, there are many Wikipedia classification problems in which the nodes in the user and article networks may have more than two classes. Though it may be possible to transform the multi-class problem into multiple binary classification problems (using the 1-vs-1 or 1-vs-rest strategies), this approach 95 may not be effective because each binary classifier is trained independently, disregarding the effectiveness of other classifiers. It is also inapplicable when the number of classes in article and user networks are different. In this chapter, we present an extension of the LS-SVM formulation for network data to a multi-class problem. Specifically, we formalize the joint classification tasks as a constrained optimization problem, in which the relationships between the classes of nodes in two different networks are modeled as graph regularization constraints. Unlike the previous binary class formulation in [13], the formulation proposed here allows us to incorporate prior knowledge about the potential relationships between classes in different networks to avoid overfitting. 6.1 Preliminaries We begin with a brief discussion of the terminology and notations used. The terminology is given in the context of user and article networks in Wikipedia. User: A registered user is identified by a username and has an associated user page on Wikipedia. Unregistered users are also allowed to edit the Wikipedia pages. These users are identified by their corresponding IP address. In this study, we consider only registered Wikipedia users. Let U be the set of such users. Article: Articles in Wikipedia have a title and are organized by categories. Some articles are considered Wikipedia stubs, which has been created as a placeholder for a particular topic but does not contain any useful text information. In this study, we consider only non-stub Wikipedia articles. Let A be the set of such articles. User-Article matrix: The edits made by users on Wikipedia articles are represented in a user-article matrix M , where M (i, j) is the total number of edits made by user i on article j. Category relation matrix: Let Et be a category relation matrix estimated from the training data. The matrix 96 represents the number of links between each user category and article category. For instance, Et (y (u) , y (a) ) is the total number of edits made by users from category y (u) on articles from category y (a) , where y (u) ∈ {1, 2, ...ku } and y (a) ∈ {1, 2, ..., ka }. ku and ka are the number of user and article categories, respectively. Our proposed framework assumes that there is a correspondence between user and article categories (classes). For example, suppose we are interested in classifying the topic areas of Wikipedia users and articles. We expect users who are classified as “computer science” editors to edit more articles classified as “computer science” than other areas such as “biology”. As will be shown in Section 6.2, this assumption can be enforced using a graph regularization constraint in our proposed co-classification framework. Before describing our methodology, we first formalize the co-classification problem. Suppose we are given: (u) (u) (u) (u) (u) (u) (u) 1. A set of lu labeled users, L(u) = {(x1 , y1 ), (x2 , y2 ), · · · , (xl , yl )} where xi is u u (u) a du -dimensional feature vector for the i-th user and yi (u) ∈ {1, 2...ku } is its class label. (u) (u) 2. A set of nu − lu unlabeled users, U (u) = {xl +1 , xl +2 , · · · , xnu } u u (a) (a) (a) (a) (a) (a) (a) 3. A set of la labeled articles, L(a) = {(x1 , y1 ), (x2 , y2 ), ...(xl , yl )} where xj is a a (a) a da -dimensional feature vector for the j-th article and yj ∈ {1, 2, · · · , ka } is its class label. (a) (a) (a) 4. A set of na − la unlabeled articles, U (a) = {xl +1 , xl +2 , · · · , xna } a a (u) (a) (u) 5. A set of user-article pairs, M, where (xi , xj ) ∈ M if user xi (a) edits article xj . Definition 5 (Co-Classification). Given L(u) , L(a) , and M, the goal of co-classification is to learn a pair of classifiers: (1) fu : ℜdu → {1, 2, · · · , ku } that accurately maps the feature 97 vector of a user x(u) to its class label y (u) and (2) fa : ℜda → {1, 2, · · · , ka } that accurately maps the feature vector of an article x(a) to its class label y (a) . 6.2 Methodology This section presents the detailed of our proposed co-classification framework. We first present a maximum margin classification approach to classify the Wikipedia users and articles independently. We then describe the proposed co-classification approach. 6.2.1 Independent Classification of Wikipedia Users and Articles The framework developed in this study is an extension of the LS-SVM formulation for network data developed in [13]. Before describing the framework, we first consider the task of classifying Wikipedia users and articles independently. Given a set of labeled users L(u) , we may construct a LS-SVM classifier for users by minimizing the following objective function: Lu (w, b, e, α) = ku lu ∑ 1∑ 1 ∑ (u)T (u) (u)2 (wm wm ) + γ1 eim 2 2 m=1 − lu ∑ ∑ (u) (u) αim (wyi xi i=1 m̸=yi (u) (u) + byi − wm xi (u) (u) − bm − 2 + eim ) (6.1) i=1 m̸=yi where {αim } is the set of Lagrange multipliers. Similarly, the corresponding Lagrangian formulation for classifying articles can be expressed as follow: ka la ∑ 1∑ 1∑ (a)T (a) (a)2 (wn wn ) + γ2 εjn La (w, b, ε, β) = 2 2 n=1 j=1 n̸=yj 98 − la ∑ ∑ (a) (a) βjn (wyj xj (a) (a) (a) + byj − wn xj (a) (a) − bn − 2 + εjn ) (6.2) j=1 n̸=yj (a) where {wn } is the set of parameters for the article classifier, γ2 is a parameter that controls (a) the penalty of misclassifying articles, and εjn is the error of classifying the j-th article. 6.2.2 Co-Classification of Users and Articles Instead of solving the optimization problems for classifying users and articles independently, we propose a co-classification algorithm that utilizes the link structure between users and articles from the two relation matrices Et and M described in Section 6.1. The category relation matrix Et is computed as follow: Et (y (u) , y (a) ) = ∑ (u) (a) M (xi , xj ) (u) i:yi =y (u) (a) j:yj =y (a) (u) where xi (a) ∈ L(u) , y (u) ∈ {1, 2, ...ku }, xj ∈ L(a) , and y (a) ∈ {1, 2, ...ka }. Figure 6.1 A Sample of User-Article network Figure 6.1 shows an example of the links between Wikipedia users and articles. Suppose both the users and articles are divided into four categories, represented by nodes with 99 the following colors: white, blue, gray, and cyan. The edge between the user and article corresponds to the number of edits, which is captured by the user-article matrix M . For example, M (u6 , a2 ) = 4 means user u6 have edited the article a2 four times. Each entry in the category relation matrix Et (y (u) , y (a) ) corresponds to the sum of the weights of edges between users from category y (u) and articles from category y (a) . For example, Et (white, gray) = M (u6 , a2 ) + M (u4 , a3 ) = 10. One limitation of using Et is that if the number of labeled users connected to the labeled articles is small, then the relationship between the user and article classes may not be accurately represented by Et . Furthermore, the presence of noisy links in Wikipedia data also have an adverse affect that may degrade classification performance. In many cases, we may have some prior knowledge about the probability of a link between nodes in two classes in the user and article networks. This information can be encoded using a prior matrix Ep to avoid overfitting the model to the observed relationships between classes in the different networks. Based on these two considerations, we construct an adjusted category relation matrix as follows: E(y (u) , y (a) ) = αEt (y (u) , y (a) ) + (1 − α)Ep (y (u) , y (a) ), (6.3) where 0 ≤ α ≤ 1 is a tuning parameter that controls the tradeoff between Et and Ep . Its value can be determined via cross-validation or set to a constant, say 0.5, for equal weighting. We use the following prior matrix in our experiments: where the rows and columns are the user and article categories, respectively. The values in Et should be normalized for two reasons. First, the values in Et and Ep do not have the same scale. The larger values in Et tend to dominate the overall value of E, thus 100 Table 6.1 Prior category relation Matrix Ep (ku , ka ) Ca1 1 -1 -1 -1 Cu1 Cu2 Cu3 Cu4 Ca2 -1 1 -1 -1 Ca3 -1 -1 1 -1 Ca4 -1 -1 -1 1 losing information about our prior knowledge. Second, it is useful to normalize the values in Et so that it ranges between [-1,1]. As will be seen shortly, this helps to enforce a penalty or reward scheme for our regularization constraint. Based on these two considerations, we normalize Et as follows: ′′ (u) (a) (u) (a) Et (yi , yj ) = Et (yi , yj )/ ∑ (u) (a) Et (yi , yj ) ka ′ (u) (a) ′′ (u) (a) Et (yi , yj ) = Et (yi , yj ) − (u) After normalization, if yi ′ (u) (a) and yj ∑ ′′ (u) (a) ka Et (yi , yj ) ka ′ (u) (a) are related categories, then Et (yi , yj ) is positive′ (a) valued; otherwise, Et (yi , yj ) is negative-valued. The normalized value for Et will be used to estimate E in (6.3). The category relation matrix will be used to enforce a constraint between the user and article classifiers. Specifically, the assumption described in Section 6.1 states that Wikipedia users will not likely edit articles in their non-related categories. The following graph regularization constraint is therefore introduced to penalize models in which users edit articles that belong to non-related categories: −γ3 lu ∑ la ∑ ∑ i=1 j=1 (u) (a) M (xi ,xj )̸=0 ′ (u) (a) (u) E (yi , yj )(w (u) xi yi 101 (a) + b (u) − w (a) xj yi yj − b (a) ) y j The overall objective function for our co-classification framework of users and articles can be written as follows: ku lu ∑ 1 ∑ 1∑ (u)T (u) (u)2 L = (wm wm ) + γ1 eim 2 2 m=1 i=1 m̸=yi ka la ∑ 1∑ 1∑ (a)T (a) (a)2 (wn wn ) + γ2 ξjn + 2 2 n=1 − j=1 n̸=yj lu ∑ ∑ (u) (u) αim (wyi xi (u) (u) (u) + byi − wm xi (u) (u) − bm − 2 + eim ) i=1 m̸=yi − la ∑ ∑ (a) (a) βjn (wyj xj j=1 n̸=yj − γ3 ∑ ∑ i,j (u) (a) M (xi ,xj )̸=0 (a) (a) − wyj xj (a) (a) (a) + byj − wn xj ′ (u) (a) (u) (u) (a) (a) − bn − 2 + εjn ) E (yi , yj )(wyi xi (a) − byj ) (u) + by i (6.4) where (w(u) , w(a) , e, ξ, α, β) are the model parameters to be estimated from training data and (γ1 , γ2 , γ3 ) are user-specified parameters. For our experiments, we set γ1 = γ2 = 1 whereas γ3 is estimated from the data via cross validation. The objective function is solved by taking the derivative of L with respect to each of the model parameters and setting them to zero. This leads to the following set of linear equations, and could be expressed in matrix notation as follows:      C(u) 0 0 C(a)   χ(u) χ(a) 102     = b(u) b(a)   (6.5) (u) where C (u) is a matrix that contains xi (u) ′ (a) (u) (u) and E (m, yj ), the vector χ(u) = (w1 , ..., wk , u (u) u b1 , ..., bk ,e, α)T , the vector b(u) = (p1 , q1 , 0lu ×ku , 2lu ×ku )T , 0lu ×ku is a lu ×ku -dimensional vector of 0s, 2lu ×ku is a lu × ku -dimensional vector of 2s. Analogously, C (a) , χ(a) , and b(a) = (p2 , q2 , 0la ×ka , 2la ×ka )T are the corresponding matrix and vectors for solving the parameters of the article classifier, where: p1 (m) = γ3 ∑ ∑ i,j (u) yi =m (u) (a) M (xi ,xj )̸=0 q1 (m) = γ3 p2 (n) = γ3 ∑ ∑ i,j (u) yi =m (u) (a) M (xi ,xj )̸=0 ∑ ∑ i,j (a) yj =n (u) (a) M (xi ,xj )̸=0 q2 (n) = γ3 ′ (a) (u) E (m, yj )xi ′ (a) E (m, yj ) ′ (u) (a) E (yi , n)xj ∑ ∑ i,j (a) yj =n (u) (a) M (xi ,xj )̸=0 ′ (u) E (yi , n) The block structure of the matrix equation in Equation (6.5) suggests that the system of linear equations can be decoupled into two subproblems, one for learning the parameters of the user classifier and the other for article classifier. C(u) χ(u) = b(u) 103 (6.6) Algorithm 7 Co-Classification Algorithm ′ Input: U (a) , U (u) , L(a) ,L(u) , M , E and γ (u) (a) Output: (yi |i = lu + 1, ...nu ), (yj |j = la + 1, ...na ) ′ Initialize: 1. (α, β, b(u) , b(a) ) ← linearsolver(L(a) , L(u) , M , E , γ) 2. y (a) ← Classify(U (a) , α, b(a) ) 3. y (u) ← Classify(U (u) , β, b(u) ) C(a) χ(a) = b(a) (6.7) It is worth noting that the solutions for both equations are not entirely independent because the terms p1 , p2 , q1 and q2 depend on the link structure of M and E. It is sufficient to solve (u) (a) Equations (6.6) and (6.7) for (α, bm , β, bn ) in order to classify the unlabeled users and articles. Algorithm 7 summarizes the high-level overview of our co-classification algorithm. The linearsolver function solves the set of linear equations to obtain the model parameters. The classify function takes the models parameters and unlabeled data as input to generate their predicted class labels as output. 6.3 Experimental Evaluation This section presents the experiments performed to evaluate the performance of the proposed co-classification algorithm. 6.3.1 Data Set To evaluate the performance of our algorithm, we downloaded the Wikipedia dump dated Oct-09-2009. After preprocessing, we chose the following four topics from each network as 104 our ground truth classes (ku = ka = 4): biology, natural science, computer science and political science. The label for each user is assigned based on label for the majority class of articles written by the user. We created a sample that contains 5361 users and 6403 articles for our experiments. The experimental results reported in this section is obtained using 5-fold cross validation. For each run, we use 4 of the folds for training and the remaining fold as test data. We repeated the experiment ten times, each time using a random division of the data into five ′ folds. Table 6.2 shows the category relation matrix Et obtained from the training data. ′ Table 6.2 Category relation Matrix from training data Et Cu1 Cu2 Cu3 Cu4 Ca1 0.6658 -0.2186 -0.2342 -0.2423 Ca2 -0.2225 0.6983 -0.2271 -0.2462 Ca3 -0.2274 -0.2385 0.5430 -0.2398 Ca4 -0.2159 -0.2412 -0.0816 0.7283 Our experimental results is summarized in Table 6.3. The table shows a comparison between the performance of our proposed multi-class LS-SVM framework against applying the LS-SVM algorithm independently on the user and article networks. For the latter case, we employ the 1-vs-rest strategy to train the independent LS-SVM models. Note that Pre means the average precision for all four categories, Rec means the average recall for all four categories, and F1 is the harmonic mean of precision and recall. Clearly, the F1 measure for our proposed framework is higher than that for LS-SVM. A more detailed analysis can be seen in Figures 6.2 and Figure 6.3, where we have shown the F1 values for each class in the article and user networks. Notice that the improvement in F1 is observed across all the classes. These results suggest that our multi-class graph regularization framework along with the addition of prior matrix enables our algorithm to 105 Table 6.3 Four categories average results 5-fold cross validation LS-SVM Co-LS-SVM Article Rec 49.6420 53.5612 Pre 48.8169 50.1656 F1 49.2260 51.8078 User Rec 40.6410 46.2185 Pre 41.5724 42.8800 F1 41.1014 44.4867 1 p=2 p=5 p=20 CM J48 DAGSVM 0.98 0.96 F1 0.94 0.92 0.9 0.88 0.86 0 1 2 3 4 5 6 d(depth of tree) 7 8 9 10 Figure 6.2 Comparison of F1 measure for each Article category between LS-SVM and CoLS-SVM algorithms outperform LS-SVM models that are trained independently. Despite the observed improvement, our overall F1 measure is around 50%, which is not very high. This probably is because the presence of noisy links in Wikipedia that adversely affects our classification results. Furthermore, we use the content of the user page in Wikipedia to represent the feature vector of each user. Since some of the user pages contain only a few uninformative words or even empty, their feature vector may not be effectively used for classification. As a next step in our work, we may consider removing the noisy links and exploring additional features to represent the users. 106 1800 p=2 p=5 p=214 CM J48 DAGSVM 1600 1400 Testing time 1200 1000 800 600 400 200 0 0 2 4 6 d(depth of tree) 8 10 12 Figure 6.3 Comparison of F1 measure for each User category between LS-SVM and Co-LSSVM algorithms 6.4 Conclusion In this chapter, we proposed a co-classification framework for Wikipedia user and article networks, which utilizes relational information between these two networks and learn the classifier models simultaneously. Experiment results have justified the effectiveness of our method. 107 Chapter 7 Future Work In this thesis, I have presented my research to address some of the key challenges of large multi-class network learning. First, a hierarchical learning technique known as recursive non-negative matrix factorization was developed to deal with the test efficiency of large multi-class problems. Another unique advantage of the approach is that it can accommodate the detection of new classes in the test data. Second, I present another approach known as matrix factorization tree (MF-Tree), which was designed to optimize a global objective function. A proof is also given to demonstrate the equivalence between the proposed regularized loss function and the Hilbert Schmidt Information Criterion. We also show the additive property of the objective function, which allows us to decompose the large multiclass learning problem into a series of binary classification tasks that can be organized in a binary tree structure. Third, we extended the MF-Tree approach to deal with network data. The enhanced MF-Tree approach systematically combines both the node attributes and link structure information to perform large multi-class node classification and class hierarchy generation from the network data. Finally, we introduce an approach to jointly classify heterogeneous nodes in multiple networks in order to improve classification accuracy. For future work, we plan to investigate the following two problems as extensions of this thesis: 108 7.1 Joint Label Tree Construction and Link Prediction for Network with Large Number of Classes Since the nodes and link structure are interleaved with each other in a network, node classification and link prediction are two related learning tasks. This is because, in most networks, more links are exist between nodes of the same class than between nodes of different classes. Thus, we conjecture that the learning of one task could aid the learning of the other. The hierarchical network learning method proposed in Chapter 5 builds a hierarchical structure by optimizing an objective function to partition the data into subgroups at each internal node. The partitioning may break links between pair of nodes that should linked together, which may degrade classification performance. Designing an effective hierarchical learning method that can jointly perform node classification and link prediction would be an interesting subject for future research. 7.2 Classification of Documents into Reading Positions From an application perspective, hierarchical learning methods can potentially be used to provide navigational interface of content, i.e., a way to access content of a document collection. Currently, a popular way to access a large collection of documents is by using a search interface that allows users to submit a query and view a dynamically generated ranked list of relevant documents. Such an approach may work fine when a user is trying to locate a specific document in the collection but is insufficient when users need to access the pertinent documents in some logical order, e.g., for learning, research or editorial purposes. Search engines hide document relationships while navigational interfaces tend to capture 109 only fixed (static) relationships that do not adapt to a user’s specific need. In both cases, to determine which resources to read and in what order, users must manually sift through the documents returned by the system, which is a tedious process. Worse still, an individual may read a significant number of documents in the wrong order, until he/she understood how they relate to each other. Consequently, the user may have to re-read the documents again in the right order to fully grasp their contents. I have previously developed a novel algorithm to generate reading orders over set of documents (from general to more specific) in [82]. Such type of reading order is extremely useful in several contexts and applications. For instance, it can benefit researchers looking for papers on a particular topic, editors selecting articles to publish on a web site, or IT personnel trying to figure out which documentation they should read first. My previous solution is based on an unsupervised learning method to generate reading orders by analyzing the conceptual relations among the documents. However it does not consider the document link relations such as citations between documents, their common authors, etc. An interesting research question would be to map the reading order generation as a hierarchical network learning task, where each document is a node and the links represent the document link relations. The goal here is to learn a classification or ranking model that accurately assigns a given document to its proper reading position. Note that a ranking problem can be converted into a series of classification problems. For example, to determine the rank among documents d1 , d2 , and d3 , we can transform it into classifying whether d1 should precede d2 , d1 should precede d3 and d2 should precede d3 . 110 REFERENCES 111 REFERENCES [1] A. Choromanska, A. Agrawal, and J. Langford, “Extreme multi class classification,” in Neural Information Processing Systems Conference (NIPS) Workshop: Extreme Classification: Multi-Class & Multi-Label Learning with Millions of Categories, (Lake Tahoe, NV), 2013. [2] L. Liu, S. Saha, R. Torres, J. Xu, P.-N. Tan, A. Nucci, and M. Mellia, “Detecting malicious clients in isp networks using http connectivity graph and flow information,” in Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on, pp. 150–157, 2014. [3] P. Zhao and J. Han, “On graph query optimization in large networks,” Proc. VLDB Endow., vol. 3, pp. 340–351, Sept. 2010. [4] M. Adedoyin-Olowe, M. M. Gaber, and F. Stahl, “A survey of data mining techniques for social media analysis,” CoRR, 2013. [5] J. Scripps, R. Nussbaum, P. Tan, and A. Esfahanian, “Link-based network mining,” in Structural Analysis of Complex Networks (M. Dehmer, ed.), Birkhauser, 2010. [6] P. Mandayam-Comar, P.-N. Tan, and A. Jain, “Identifying cohesive subgroups and their correspondences in multiple related networks,” in Proc of the 2010 IEEE/WIC/ACM Int’l Conf on Web Intelligence (WI-2010), (Toronto, Canada), 2010. [7] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of influence through a social network,” in Proc of the ACM International Conference on Knowledge Discovery and Data Mining, 2003. [8] Q. Lu and L. Getoor, “Link-based classification,” in Proceedings of the International Conference on Machine Learning (ICML), 2003. [9] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad, “Collective classification in network data,” AI Magazine, vol. 29, no. 3, pp. 93–106, 2008. [10] G. Pandey, V. Kumar, and M. Steinbach, “Computational approaches for protein function prediction: A survey,” Tech. Rep. 06-028, Department of Computer Science and Engineering, University of Minnesota, Twin Cities, 2006. 112 [11] M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001. [12] D. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006. [13] F. Chen, P. Tan, and A. Jain, “A co-classification framework for detecting web spam and spammers in social media web sites,” in Proceeding Of ACM CIKM International Conference On Information And Knowledge Management, (Hong Kong,China), pp. 1807–1810, 2009. [14] L. Liu and P.-N. Tan, “A framework for co-classification of articles and users in wikipedia,” in Web Intelligence’10, pp. 212–215, 2010. [15] J. C. Platt, N. Cristianini, and J. Shawe-taylor, “Large margin dags for multiclass classification,” in Advances in Neural Information Processing Systems, pp. 547–553, MIT Press, 2000. [16] V. Vural and J. G. Dy, “A hierarchical method for multi-class support vector machines,” in Proceedings of the twenty-first international conference on Machine learning, pp. 105– 112, 2004. [17] A. Beygelzimer, J. Langford, and P. Ravikumar, “Multiclass classification with filter trees,” in Information Theory and Application Workshop, 2007. [18] S. Bengio, J. Weston, and D. Grangier, “Label embedding trees for large multi-class tasks.,” in Neural Information Processing Systems, pp. 163–171, 2010. [19] J. Deng, S. Satheesh, A. C. Berg, and F. F. F. Li, “Fast and balanced: Efficient label tree learning for large scale object recognition,” in Advances in Neural Information Processing Systems 24 (J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds.), pp. 567–575, 2011. [20] L. Liu, P. M. Comar, S. Saha, P.-N. Tan, and A. Nucci, “Recursive nmf: Efficient label tree learning for large multi-class problems,” in Proceedings of the 21st International Conference on Pattern Recognition (ICPR’2012), pp. 2148–2151, 2012. [21] A. Gretton, O. Bousquet, A. Smola, and B. Sch¨olkopf, “Measuring statistical dependence with hilbert-schmidt norms,” in Proceedings of the 16th International Conference on Algorithmic Learning Theory (ALT’2005), pp. 63–77, 2005. 113 [22] M. B. Blaschko and A. Gretton, “Learning taxonomies by dependence maximization,” in Proceedings of the Advances in Neural Information (NIPS’2008), pp. 153–160, 2008. [23] L. Song, A. J. Smola, A. Gretton, and K. M. Borgwardt, “A dependence maximization view of clustering,” in Proceedings of the 24th International Conference on Machine Learning (ICML’2007), pp. 815–822, 2007. [24] M. B. Blaschko, W. Zaremba, and A. Gretton, “Taxonomic prediction with treestructured covariances,” in Machine Learning and Knowledge Discovery in DatabasesEuropean Conference (ECML/PKDD’2013), pp. 304–319, 2013. [25] T. Zhang, A. Popescul, and B. Dom, “Linear prediction models with graph regularization for web-page categorization,” in Proceeding of ACM SIGKDD International Conf on Data Mining, (Philadelphia,PA), pp. 812–826, 2006. [26] R. Heatherly, M. Kantarcioglu, and B. Thuraisingham, “Social Network Classification Incorporating Link Type Values,” in IEEE Intelligence and Security Informatics, (, Dallas, Texas,), June 2009. [27] J. Scripps, P.-N. Tan, F. Chen, and A.-H. Esfahanian, “A matrix alignment approach for collective classification,” in ASONAM, pp. 155–159, 2009. [28] J. Scripps, P.-N. Tan, F. Chen, and A.-H. Esfahanian, “A matrix alignment approach for link prediction,” in ICPR, pp. 1–4, 2008. [29] O.-W. Kwon and J.-H. Lee, “Web page classification based on k-nearest neighbor approach,” in Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages (IRAL), pp. 9–15, 2000. [30] J. F¨ urnkranz, “Exploiting structural information for text classification on the www,” in Proceedings of the Third International Symposium on Advances in Intelligent Data Analysis, IDA ’99, pp. 487–498, 1999. [31] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists: learning to detect malicious web sites from suspicious urls,” in KDD, 2009. [32] A. Le, A. Markopoulou, and M. Faloutsos, “Phishdef: Url names say it all,” INFOCOM, 2011. [33] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM, vol. 46, pp. 604–632, Sept. 1999. 114 [34] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.,” technical report, 1999. [35] V. K. Stanford and V. Krishnan, “Web spam detection with anti-trust rank,” 2006. [36] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Proc. of Advances in Neural Information Processing Systems, pp. 321–328, 2003. [37] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation,” tech. rep., 2002. [38] Z. Gy¨ongyi, H. Garcia-Molina, and J. Pedersen, “Combating web spam with trustrank,” in Proceedings of the Thirtieth international conference on Very large data bases - Volume 30, VLDB ’04, pp. 576–587, 2004. [39] Y. Yang, S. Slattery, and R. Ghani, “A study of approaches to hypertext categorization,” Journal of Intelligent Information Systems, vol. 18, pp. 219–241, 2002. [40] S. Zhu, K. Yu, Y. Chi, and Y. Gong, “Combining content and link for classification using matrix factorization,” in Proceeding Of ACM SIGIR, (Netherlands), pp. 487–494, 2007. [41] Q.Lu and L.Getoor, “Link-based classification,” in Proceeding of the twentieth International Conference on Machine Learning(ICML’2003), (Washion DC), pp. 496–503, 2003. [42] Q.Gan and T.Suel, “Improving web spam classifiers using link structure,” in AIRWeb ’07: Proceeding of the 3rd international workshop on Adversarial information retrieval on the web, (New York,USA), pp. 17–20, 2007. [43] M. Ji, Y. Sun, M. Danilevsky, J. Han, and J. Gao, “Graph regularized transductive classification on heterogeneous information networks,” in Proceedings of ECML/PKDD, pp. 570–586, 2010. [44] R. Caruana, “Multitask learning,” Machine Learning, vol. 1, no. 28, pp. 41–75, 1997. [45] P. M. Comar, L. Liu, S. Saha, A. Nucci, and P.-N. Tan, “Weighted linear kernel with tree transformed features for malware detection,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’2012), pp. 2287–2290, 2012. 115 [46] P. M. Comar, L. Liu, S. Saha, P.-N. Tan, and A. Nucci, “Combining supervised and unsupervised learning for zero-day malware detection.,” in Proceedings of the 33rd IEEE International Conference on Computer Communications (INFOCOM’2013), pp. 2022– 2030, 2013. [47] L. Liu, P. M. Comar, A. Nucci, S. Saha, and P.-N. Tan, “Missing or inapplicable: Treatment of incomplete continuous-valued features in supervised learning,” in Proceedings of SIAM International Conference on Data Mining (SDM’2013), pp. 46–54, 2013. [48] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Largescale video classification with convolutional neural networks,” in Proceedings of International Computer Vision and Pattern Recognition (CVPR 2014), 2014. [49] R. Ghani, “Using error-correcting codes for efficient text classification with a large number of categories,” masters thesis, Carnegie Mellon University, 2001. [50] R. Ryan and K. Aldebaro, “In defense of one-vs-all classification,” Journal of Machine Learning Research, vol. 5, pp. 101–141, December 2004. [51] T. Hastie and R. Tibshirani, “Classification by pairwise coupling,” The Annals of Statistics, vol. 26(2), pp. 451–471, 2001. [52] J. Langford and A. Beygelzimer, “Sensitive error correcting output codes,” in Conference on Learning Theory, pp. 158–172, 2005. [53] D. Hsu, S. M. Kakade, J. Langford, and T. Zhang, “Multi-label prediction via compressed sensing,” CoRR, vol. abs/0902.1284, 2009. [54] J. Huang, T. Zhang, and D. N. Metaxas, “Learning with structured sparsity,” in Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pp. 417–424, 2009. [55] A. Beygelzimer, J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl, “Conditional probability tree estimation analysis and algorithms,” in Proceedings of UAI, pp. 51–58, 2009. [56] T. Gao and D. Koller, “Discriminative learning of relaxed hierarchy for large-scale visual recognition,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV’2011), pp. 2072–2079, 2011. [57] D. Kuang and H. Park, “Fast rank-2 nonnegative matrix factorization for hierarchical document clustering,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2013), pp. 739–747, 2013. 116 [58] P.-Y. Hao, J.-H. Chiang, and Y.-H. Lin, “A new maximal margin spherical structured multi-class support vector machine,” in Applied Intelligence, 2009. [59] D.Lee and H.Seung, “Algorithm for non-negative matrix factorization,” in Neural Information Processing Systems, pp. 556–562, 2001. [60] H. Pei-Yi, C. Jung-Hsien, and L. Yen-Hsiu, “A new maximal-margin sphericalstructured multi-class support vector machine,” Applied Intelligence, vol. 30, pp. 98–111, 2009. [61] D. Chen and R. J. Plemmons, “Nonnegativity constraints in numerical analysis,” in Symposium on the Birth of Numerical Analysis (A. Bultheel and R. Cools, eds.), World Scientific Publishing, 2010. [62] T. Ngo, M. Bellalij, and Y. Saad, “The trace ratio optimization problem,” SIAM Review, vol. 54, no. 3, pp. 545–569, 2012. [63] J. Wenhao and C. Fu-lai, “A trace ratio maximization approach to multiple kernel-based dimensionality reduction,” Neural Networks, vol. 49, pp. 96–106, 2014. [64] D. Boley, “Principal direction divisive partitioning,” Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 325–344, 1998. [65] E. Tanaka and K. Tanaka, “The tree-to-tree editing problem,” International Journal Pattern Recognition and Artificial Intelligence, vol. 2, no. 2, pp. 221–224, 1988. [66] T. Jiang, L. Wang, and K. Zhang, “5th annual symposium on combinatorial pattern matching,” in Advances in Neural Information Processing Systems, pp. 75–86, 1994. [67] P. Mandayam Comar, L. Liu, S. Saha, A. Nucci, and P.-N. Tan, “Weighted linear kernel with tree transformed features for malware detection,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, pp. 2287–2290, 2012. [68] U. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks.,” Phys Rev E Stat Nonlin Soft Matter Phys, vol. 76, no. 3 Pt 2, p. 036106, 2007. [69] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partitioning and clustering,” Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 11, pp. 1074–1085, Nov. 2006. 117 [70] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts,” in Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, (San Francisco, CA, USA), pp. 19–26, 2001. [71] U. von Luxburg, “A tutorial on spectral clustering,” CoRR, vol. abs/0711.0189, 2007. [72] D. Achlioptas, A. Fiat, A. Karlin, and F. McSherry, “Web search via hub synthesis,” in Proceedings of the 42Nd IEEE Symposium on Foundations of Computer Science, FOCS ’01, 2001. [73] U. Kang, C. E. Tsourakakis, and C. Faloutsos, “Pegasus: A peta-scale graph mining system implementation and observations,” in Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09, pp. 229–238, 2009. [74] U. Kang, B. Meeder, E. E. Papalexakis, and C. Faloutsos, “Heigen: Spectral analysis for billion-scale graphs,” IEEE Trans. Knowl. Data Eng., pp. 350–362, 2014. [75] Y. Yang, “An evaluation of statistical approaches to text categorization,” Inf. Retr., vol. 1, no. 1-2, pp. 69–90, 1999. [76] J. Abernethy, O. Chapelle, and C. Castillo, “Web spam identification through content and hyperlinks,” in Proceeding Of The Sigir Workshop On Adverasrial Information Retrieval On Web (Airweb’08), (Beijing,China), pp. 41–44, 2008. [77] G. Attardi, A. Gull, and F. Sebastiani, “Automatic web page categorization by link and context analysis,” in Proceedings of THAI’99, First European Symposium on Telematics, Hypermedia and Artificial Intelligence, Varese, IT, pp. 205–119, 1999. [78] D. A. Cohn and T. Hofmann, “The missing link - a probabilistic model of document content and hypertext connectivity,” in NIPS, pp. 430–436, 2000. [79] K. Smets, B. Goethals, and B. Verdonk, “Automatic vandalism detection in wikipedia: Towards a machine learning approach,” in Proceedings of the Association for the Advancement of Artificial Intelligence(AAAI), pp. 43–48, AAAI Press, 2008. [80] M. Potthast, B. Stein, and R. Gerling, “Automatic vandalism detection in wikipedia,” Advances in Information Retrieval, pp. 663–668, 2008. [81] G. Druck, G. Miklau, and A. McCallum, “Learning to predict the quality of contributions to wikipedia,” in Proceedings of the AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI 08), pp. 7–12, 2008. 118 [82] L. Liu, G. Koutrika, and S. Simske, “Generating reading orders over document collections,” in Proceedings of the 31st IEEE International Conference on Data Engineering, ICDE’15, 2015. 119