HIERARCHICAL LEARNING FOR LARGE MULTI-CLASS CLASSIFICATION IN
NETWORK DATA
By
Lei Liu

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Computer Science - Doctor of Philosophy
2015

ABSTRACT
HIERARCHICAL LEARNING FOR LARGE MULTI-CLASS CLASSIFICATION IN
NETWORK DATA
By
Lei Liu
Multi-class learning from network data is an important but challenging problem with many
applications, including malware detection in computer networks, user modeling in social networks, and protein function prediction in biological networks. Despite the extensive research
on large multi-class learning, there are still numerous issues that have not been suﬃciently addressed, such as eﬃciency of model testing, interpretability of the induced models, as
well as the ability to handle imbalanced classes. To overcome these challenges, there has
been increasing interest in recent years to develop hierarchical learning methods for large
multi-class problems. Unfortunately, none of them were designed for classiﬁcation of network data. In addition, there are very few studies devoted to classiﬁcation of heterogeneous
networks, where the nodes may have diﬀerent feature sets. This thesis aims to overcome
these limitations with the following contributions.
First, as the number of classes in big data applications can be very large, ranging from
thousands to possibly millions, two hierarchical learning schemes are proposed to deal with
the so-called extreme multi-class learning problems [1]. The ﬁrst approach, known as recursive non-negative matrix factorization (RNMF), is designed to achieve sublinear runtime in
classifying test data. Although RNMF reduces the test time signiﬁcantly, it may also assign
the same class to multiple leaf nodes, which hampers the interpretability of the model as a
concept hierarchy for the classes. Furthermore, since RNMF employs a greedy strategy to
partition the classes, there is no theoretical guarantee that the partitions generated by the

tree would lead to a globally optimal solution.
To address the limitations of RNMF, an alternative hierarchical learning method known
as matrix factorization tree (MF-Tree) is proposed. Unlike RNMF, MF-tree is designed to
optimize a global objective function while learning its taxonomy structure. A formal proof is
provided to show the equivalence between the objective function of MF-tree and the HilbertSchmidt Independence Criterion (HSIC). Furthermore, to improve its training eﬃciency, a
fast algorithm for inducing approximate MF-Tree is also developed.
Next, an extension of MF-Tree to network data is proposed. This approach can seamlessly integrate both the link structure and node attribute information into a uniﬁed learning
framework. To the best of our knowledge, this is the ﬁrst study that automatically constructs
a taxonomy structure to predict large multi-class problems for network classiﬁcation. Empirical results suggest that the approach can eﬀectively capture the relationship between classes
and generate class taxonomy structures that resemble those produced by human experts.
The approach can also be easily parallelizable and has been implemented in a MapReduce
framework.
Finally, we introduce a network learning task known as co-classiﬁcation to classify heterogeneous nodes in multiple networks. Unlike existing node classiﬁcation problems, the goal
of co-classiﬁcation is to learn the classiﬁers in multiple networks jointly, instead of learning
to classify each network independently. The framework proposed in this thesis can utilize
prior information about the relationship between classes in diﬀerent networks to improve its
prediction accuracy.

ACKNOWLEDGMENTS
The past ﬁve years PhD study at Michigan State University is a both painful and enjoyable
life experience, just like climbing mountain, although it was a hard work for every step
you move forward, I enjoy this journey and realized that I can not reach the mountain top
enjoying the beautiful view without helps from my surrounding people. Though it is not
enough to express my grateful in words to all of these people who helped me, I still would
like to give them my most sincere thanks.
First and foremost, I would like to thank my PhD advisor, Dr. Pang-Ning Tan, who
generously funded my PhD study with assistantships and internship opportunities. He offered me so much advices, patiently supervising me, and always guide me in the right and
meaningful research direction. He has also provided many insightful discussions through
our weekly individual meetings, group studies and data mining class. His guidance in both
research and career have been priceless. Without his support and help, it is hard to image
that I could ﬁnish this thesis. Besides research, his hard working nature and easy going
personality have inspired me a lot, which I believe will continue beneﬁt me throughout the
rest life journey.
Besides my advisor, I would like to thank the rest of my PhD committee members,
Prof. Rong Jin, Prof. Joyce Chai and Dr. Selin Aviyente for their support and guidance
throughout my PhD programme. I would like to specially thank Dr. Jin for shaping my
thinking during the Machine Learning class in Spring 2010. I also want to thank Prof. Joyce
Chai and Dr. Selin Aviyente for providing brilliant comments and suggestions to my thesis.
It was great honor to have worked with each of my PhD committee member and I sincerely
thank them for all the help and guidance. I look forward to collaborating with each of my

iv

committee member in the future.
I want to thank present and past labmates in Michigan State University LINKS Lab,
Prakash Mandayam Comar, Jianpeng Xu, Zubin Abraham, Courtland VanDam, Xi Liu,
Shuai Yuan and Ding Wang. Especially thank Prakash Mandayam Comar for keeping my
Lab and Intern hour interesting and insightful discussion of various research projects. My
special thanks to our graduate director Eric Torng for his tremendous help and support over
my time at MSU. I would like to thank our department secretaries Linda Moore and Norma
Teague for all the administrative and conference travel works. I also want to give my sincere
thanks to all my friends for keeping me happy at MSU.
Last but not least, special thanks to my family, words cannot express how grateful I am
to my parents and sister. My hard working parents sacriﬁce their lives and support me and
my sister for our PhD studies. Without their support, I would not have chance to study
abroad and complete this thesis. At the end, I would like to express my appreciation to my
beloved wife Sang Ni who has brought me fun and always cheer me up throughout my PhD
study.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xii

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . .
1.1 Applications of Multi-class Network Learning . . . . . . .
1.1.1 Prediction of User Interest in Social Networks . . .
1.1.2 Classiﬁcation in Bibliographic Networks . . . . . .
1.1.3 Detecting Malware in Computer Network . . . . . .
1.1.4 Function Prediction in Protein Interaction Network
1.2 Challenges of Multi-class Network Learning . . . . . . . .
1.2.1 Extreme Multi-Class Network Learning . . . . . . .
1.2.1.1 Training Accuracy . . . . . . . . . . . . .
1.2.1.2 Testing Eﬃciency . . . . . . . . . . . . . .
1.2.1.3 Imbalanced Class Distributions . . . . . .
1.2.1.4 Evolving Classes . . . . . . . . . . . . . .
1.2.1.5 Model Interpretability . . . . . . . . . . .
1.2.2 Classiﬁcation on Heterogeneous Networks . . . . . .
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . .
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1
2
2
2
2
3
3
4
4
5
5
6
6
6
7
10

Chapter 2 Background & Related Work .
2.1 Problem Deﬁnition . . . . . . . . . . . .
2.2 Node Classiﬁcation . . . . . . . . . . . .
2.3 Classiﬁcation on Heterogeneous Networks
2.4 Multi-class Learning . . . . . . . . . . .
2.4.1 Non-Hierarchical Methods . . . .
2.4.2 Hierarchical Methods . . . . . . .
2.5 Taxonomy Learning . . . . . . . . . . . .
2.6 Summary . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

12
12
13
16
17
17
18
19
20

Chapter 3 Recursive NMF for Extreme Multi-Class Learning
3.1 Label Tree Structure for RNMF . . . . . . . . . . . . . . . . . .
3.2 RNMF Algorithm for Label Tree Construction . . . . . . . . . .
3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . .
3.4 New Class Detection . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

21
22
24
30
32
35

vi

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

Chapter 4 Matrix Factorization Tree for Extreme Multi-Class Learning
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Problem Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Hilbert-Schmidt Independence Criterion (HSIC) . . . . . . . . . . .
4.1.3 Tree-Structured Covariance Matrix . . . . . . . . . . . . . . . . . .
4.2 MF-Tree: Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Global Objective Function of MF-Tree . . . . . . . . . . . . . . . .
4.2.2 Relationship between Global Objective Function and HSIC . . . . .
4.2.3 Additive Property of HSIC . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 MF-Tree construction . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.6 MF-Tree Induction Algorithm . . . . . . . . . . . . . . . . . . . . .
4.3 Approximate MF-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2.1 Overall Results Comparison . . . . . . . . . . . . . . . . .
4.4.2.2 Eﬀect of Parameter Tuning . . . . . . . . . . . . . . . . .
4.4.3 Comparison of Taxonomy Structure . . . . . . . . . . . . . . . . . .
4.4.4 Results for Approximate MF-Tree . . . . . . . . . . . . . . . . . . .
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

37
39
40
40
42
43
43
44
46
51
52
54
56
58
59
60
60
61
62
67
69

Chapter 5 Hierarchical Tree Learning for Large Multi-class Network
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Global Objective Function . . . . . . . . . . . . . . . . . . . . .
5.1.2 Relationship between Proposed Method and HSIC . . . . . . . .
5.1.3 Additive Property of HSIC . . . . . . . . . . . . . . . . . . . . .
5.1.4 Hierarchical Tree Construction from Network Data . . . . . . .
5.1.5 Parallelization of Hierarchical Tree Learning . . . . . . . . . . .
5.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1.1 U.S. Patent Data . . . . . . . . . . . . . . . . . . . . .
5.2.1.2 Wikipedia Data . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2.1 Micro-average F1 . . . . . . . . . . . . . . . . . . . . .
5.2.2.2 Macro-average F1 . . . . . . . . . . . . . . . . . . . . .
5.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3.1 Results on U.S. Patent Data . . . . . . . . . . . . . . .
5.2.3.2 Results on Wikipedia Data 1 . . . . . . . . . . . . . .
5.2.3.3 Parallel Results on Wikipedia Data 2 . . . . . . . . . .
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

70
73
73
74
78
78
80
81
82
82
82
84
86
86
87
87
91
93
93

vii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Chapter 6 Co-classiﬁcation of Heterogeneous Networks . . . . .
6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Independent Classiﬁcation of Wikipedia Users and Articles
6.2.2 Co-Classiﬁcation of Users and Articles . . . . . . . . . . .
6.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 7 Future Work . . . . . . . . . . . . . . . . .
7.1 Joint Label Tree Construction and Link Prediction
Number of Classes . . . . . . . . . . . . . . . . . .
7.2 Classiﬁcation of Documents into Reading Positions
REFERENCES . . . . . . . . . . . . . .

viii

. .
for
. .
. .

. . . . .
Network
. . . . .
. . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

. . .
with
. . .
. . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

. 94
. 96
. 98
. 98
. 99
. 104
. 104
. 107

. . . . 108
Large
. . . . 109
. . . . 109

. . . . . . . . . . . . . . . . . . 111

LIST OF TABLES

Table 3.1
Table 3.2

Comparison between RNMF and baseline algorithms on Wiki1 data
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

Comparison between RNMF and baseline algorithms on Wiki2 data
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

Table 3.3

Average F1-score of the new class when applied to the Wiki1 data set. 34

Table 4.1

Summary of data used for our experiments . . . . . . . . . . . . . .

60

Table 4.2

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . .

62

Table 4.3

Approximate MF-Tree on W iki2 . . . . . . . . . . . . . . . . . . . .

68

Table 5.1

22 General Categories of U.S. Patent Data . . . . . . . . . . . . . .

83

Table 5.2

Micro-average F1 Comparison on U.S. Patent Data

. . . . . . . . .

87

Table 5.3

Macro-average F1 Comparison on U.S. Patent Data . . . . . . . . .

87

Table 5.4

Tree Depth Comparison on U.S. Patent Data . . . . . . . . . . . . .

91

Table 5.5

Performance Comparison on Wikipedia Data 1 . . . . . . . . . . . .

91

Table 5.6

Results on Wikipedia Data 2 using the parallel HTL method. . . . .

93

Table 6.1

Prior category relation Matrix Ep (ku , ka ) . . . . . . . . . . . . . . . 101

Table 6.2

Category relation Matrix from training data Et . . . . . . . . . . . . 105

Table 6.3

Four categories average results . . . . . . . . . . . . . . . . . . . . . 106

′

ix

LIST OF FIGURES

Fraction of neighboring nodes that belong to the same class as the
the number of classes is varied. . . . . . . . . . . . . . . . . . . . . .

4

Figure 3.1

Proposed label tree structure using the RNMF algorithm. . . . . . .

23

Figure 3.2

Average F1 score for RNMF as the branching factor (p) and maximum
depth (d) are varied on Wiki1 data set. . . . . . . . . . . . . . . . .

32

Test time for RNMF as the branching factor (p) and maximum depth
(d) are varied on Wiki1 data set. . . . . . . . . . . . . . . . . . . . .

33

Figure 3.4

Example of DAGSVM Structure for 5 classes . . . . . . . . . . . . .

34

Figure 3.5

New Class Detection Performance for individual Class on W iki1 . .

35

Figure 4.1

Example Tree Structure with 8 classes . . . . . . . . . . . . . . . . .

43

Figure 4.2

Performance Comparison on Caltech1 Data . . . . . . . . . . . . . .

63

Figure 4.3

Performance Comparison on Caltech2 Data . . . . . . . . . . . . . .

63

Figure 4.4

Performance Comparison on W iki1 Data . . . . . . . . . . . . . . .

64

Figure 4.5

Performance Comparison on W iki2 Data . . . . . . . . . . . . . . .

64

Figure 4.6

MSE comparison of the tree produced by MF-Tree against the one
produced by CMTL . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

Figure 4.7

Example of Mapping Tree Structure to Adjacency Matrix . . . . . .

66

Figure 4.8

Illustration of tree structure generated . . . . . . . . . . . . . . . . .

67

Figure 5.1

Class Distribution of U.S. Patent and Wikipedia Data 1 . . . . . . .

84

Figure 5.2

Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

Figure 5.3

F1 Score for each individual class in the U.S. Patent Data (where the
classes are sorted in increasing class size). . . . . . . . . . . . . . . .

89

A snippet of the class hierarchy structure (up to depth 4) generated
by the proposed method for the 22 classes of the U.S. Patent Data .

90

Figure 1.1

Figure 3.3

Figure 5.4

x

Figure 5.5

Testing Eﬃciency Comparison on U.S. Patent Data . . . . . . . . .

91

Figure 5.6

F1 Score for each individual class in Wikipedia Data 1 (where the
classes are sorted in increasing class size). . . . . . . . . . . . . . . .

92

Figure 6.1

A Sample of User-Article network . . . . . . . . . . . . . . . . . . .

99

Figure 6.2

Comparison of F1 measure for each Article category between LS-SVM
and Co-LS-SVM algorithms . . . . . . . . . . . . . . . . . . . . . . . 106

Figure 6.3

Comparison of F1 measure for each User category between LS-SVM
and Co-LS-SVM algorithms . . . . . . . . . . . . . . . . . . . . . . . 107

xi

LIST OF ALGORITHMS
Algorithm 1 Recrursive NMF at a node v . . . . . . . . . . . . . . . . . . . . . . .

27

Algorithm 2 MF-Tree Construction . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

Algorithm 3 Testing with MF-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

Algorithm 4 Approximate MF-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Algorithm 5 Hierarchical Tree Construction from Network Data . . . . . . . . . . .

80

Algorithm 6 Testing with Hierarchical Tree . . . . . . . . . . . . . . . . . . . . . . . . 81
Algorithm 7 Co-Classification Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 104

xii

Chapter 1
Introduction
The rapid proliferation of network data in various domains such as computer science [2],
computational biology [3] and social science [4], has attracted considerable interest in applying data mining techniques to study complex networks. Examples of network mining tasks
include link prediction [5], community detection [6], and inﬂuence propagation [7]. The focus
of this thesis is on predicting the class of each node in a network, a problem that is known
as node classification [5], link-based classification [8], or collective classification [9] in the
literature.
Node classiﬁcation in a network is diﬀerent than traditional classiﬁcation, which assumes
that the objects to be classiﬁed are independent and identically distributed (i.i.d). The
i.i.d. assumption is inappropriate for network data as the class of a node depends not only
on its attributes, but also on its relationship to other nodes in the network. Furthermore,
the number of classes in a network can be extremely large, ranging from thousands to
millions, a problem that is known as extreme multi-class learning [1]. Although there has
been considerable research on extreme multi-class learning for i.i.d. data, none of them were
speciﬁcally designed for network data. We termed the latter as multi-class network learning
in this thesis.

1

1.1

Applications of Multi-class Network Learning

Below are four potential applications of multi-class learning in network data.

1.1.1

Prediction of User Interest in Social Networks

The growth of online social media has brought together billions of users who are actively
engaged in daily conversations and information sharing. The widespread use of such websites
is a marketing dream for advertisers who seek to predict the interest of users based on their
demographic proﬁle and friendship relations. Given the wide variety of user interests, this
is a clear example of an extreme multi-class network learning problem.

1.1.2

Classiﬁcation in Bibliographic Networks

With the massive number of scientiﬁc papers written in diﬀerent research disciplines, ﬁnding
relevant publications from the literature has become an increasingly challenging problem.
An information system that can eﬀectively organize the research papers is needed. Such a
system must be capable of automatically classifying the millions of papers published annually
into their respective categories based on the keywords used and citations among the papers.

1.1.3

Detecting Malware in Computer Network

Malicious software (or malware) automatically spreads from one device to another by covertly
executing a software program without authorization from users. The infected devices can
be used to steal sensitive information, send spam messages, launch denial-of-service attacks,
and other malicious activities. Although many intrusion detection and prevention systems
have been developed to monitor such activities, various techniques can be employed by
2

hackers to evade detection. Despite such evasion techniques, one unavoidable aspect that is
diﬃcult to camouﬂage is that the infected devices must communicate back to their command
and control servers. This has led to considerable interest in applying node classiﬁcation
techniques to detect infected devices in a networked en vironment. As there are many
variants of computer viruses available, malware detection is another example of extreme
multi-class network learning problem. In a previous study, I have developed a two-phase score
propagation algorithm to predict the class label of each client IP in an HTTP communication
network [2]. By jointly utilizing the IP-address connectivity graph (“who talks to whom”)
and ﬂow level information, the proposed technique achieves a precision rate of about 90%
when applied to a subnet of a large internet service provider.

1.1.4

Function Prediction in Protein Interaction Network

Proteins are the most essential macromolecule of life. Knowing the functionality of proteins
is critical for developing new drugs, better crops, and the development of biofuels [10]. Since
a protein usually interacts with other proteins in the biological system, one can model the
relationship between proteins as an interaction network, where the nodes represent proteins
and the links represent evidence of a shared function. Protein function prediction can thus
be cast into a multi-class network learning problem.

1.2

Challenges of Multi-class Network Learning

This section presents some of the challenges in multi-class network learning that motivate
the research directions of this thesis.

3

Percentage of Nearby Nodes in Same Class

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

22

426

3139

10524

65526

102126

140116

NO. of Classes

Figure 1.1 Fraction of neighboring nodes that belong to the same class as the the number of
classes is varied.

1.2.1

Extreme Multi-Class Network Learning

Networks with large number of classes have become increasingly prevalent in recent years.
For example, Wikipedia contains roughly 2.4 million articles from more than 600,000 classes.
Building a classiﬁer that can handle such a massive number of classes is challenging for the
following reasons.

1.2.1.1

Training Accuracy

Learning a global classiﬁer that accurately predicts all the classes is a challenging problem.
Such a classiﬁer must be extremely ﬂexible and capable of ﬁtting complex nonlinear decision
surfaces, which in turn, makes it susceptible to the model overﬁtting problem. Furthermore,
as the number of classes increases, the standard assumption that “nearby nodes in a network
have similar classes” no longer holds, as illustrated in Figure 1.1. The ﬁgure was generated
from the US Patent network data, where each node corresponds to a patent and each link
corresponds to a citation between patents. The patents are assigned to classes, whose relationships are represented in a hierarchical structure. The number of classes can be varied by

4

assigning the patents to classes at a particular depth in the hierarchy. The results suggest
that the percentage of neighboring nodes that belong to the same class decreases from 56%
to 4.6% as the number of classes increases from 22 to 140,116.

1.2.1.2

Testing Eﬃciency

Many traditional classiﬁers such as logistic regression, support vector machine, and boosting are primarily designed for binary class problems. A popular strategy to extend these
methods to multi-class classiﬁcation is to transform the data into multiple binary classiﬁcation problems, using techniques such as one-versus-one and one-versus-all. In practice,
these techniques are not scalable for handing extreme multi-class learning problems. For
example, with the one-versus-one strategy, one has to apply O(k 2 ) classiﬁers to predict the
label of each test example, where k is the number of classes. The one-versus-one approach
requires only O(k) classiﬁers for testing, which is still expensive when the number of classes
is extremely large. How to reduce the testing cost to sublinear time while maintaining good
classiﬁcation accuracy is one of the major challenges that must be addressed.

1.2.1.3

Imbalanced Class Distributions

Most classiﬁcation methods work well when there are representative examples from each
class in the training data. For imbalanced class problems, standard classiﬁers tend to be
more biased toward accurate prediction of the larger classes instead of the rare classes, which
are more interesting. Unfortunately, network data with imbalanced classes is very common
in the real world. For example, the largest category in the U.S Patent citation network data
contains 25,727 patents, while the smallest category has only 8 patents. How to deal with
the highly imbalanced classes in network classiﬁcation is another challenge that must be

5

addressed.

1.2.1.4

Evolving Classes

Extreme multi-class learning problems typically arise in application domains where new
classes are continuously generated as more data is collected (e.g., in malware detection or
image classiﬁcation). Unfortunately, most of the existing multi-class learning methods would
assign their test examples to one of the existing classes in the training set. Developing multiclass learning methods capable of detecting new classes remains an open research problem.

1.2.1.5

Model Interpretability

Classiﬁcation methods are often evaluated in terms of their model accuracies. For some
applications, model interpretability is another important criteria, where it is insuﬃcient
to generate accurate models without providing an explanation of how the inference was
derived. This is typically achieved by using sparse learning methods [11, 12] to identify the
most relevant features characterizing the diﬀerent classes. For extreme multi-class problems,
many of the classes may not be independent of each other. In this case, another useful model
interpretability criteria would be its ability to convey information about the relationships
among the large number of classes. How to design an accurate classiﬁer that produces such
information is another key challenge that must be addressed.

1.2.2

Classiﬁcation on Heterogeneous Networks

Most existing network classiﬁcation tasks have focused on developing algorithms for a single
network. The drawback of this approach is that information from a single network may not
be suﬃcient to build an accurate classiﬁer especially when the network has noisy attributes
6

and incomplete link structure. This has led to the following question: Can we improve the
performance of a classiﬁer by integrating data from other related networks?
A previous study on spam detection in social media by Chen et al. [13] has shown that,
instead of categorizing the social media postings alone (as spam or non-spam), the detection
rate can be improved using information about the users who generated the posts. Another
example is in collective classiﬁcation for Wikipedia editor network [14], where the classiﬁcation accuracy was found to improve when augmented with Wikipedia article network. Both
of these applications are examples of heterogeneous network classiﬁcation, where the nodes
in the network have diﬀerent types (e.g., humans and Web pages), each with their own set
of attributes. The links of the networks also have diﬀerent types, which makes the problem
more challenging since each link type carries varying degree of information about the class
similarity. Designing a technique that can handle diﬀerent types of attributes and links is a
challenge that must be addressed in heterogeneous network classiﬁcation.

1.3

Thesis Contributions

This section summarizes the thesis contributions in terms of addressing the challenges due
to the extreme multi-class and heterogeneous network learning problems described in the
previous section.
First, two novel hierarchical learning methods were developed for extreme multi-class
classiﬁcation. Hierarchical learning [15, 16, 17, 18, 19] is an eﬀective strategy to deal with
large multi-class problems for several reasons. Not only does it yield a hierarchical structure
that captures the relationships among the classes, it also helps to reduce testing time and
alleviates the class imbalanced problem by allowing the rare classes to be combined with ex-

7

amples from other classes when building the classiﬁcation model. Although there are some
domains where the subclass-superclass relationships are known a priori, e.g., Wikipedia categorization, the class hierarchy is typically unavailable or incomplete. As a result, there have
been growing interests in recent years to develop classiﬁcation methods that automatically
induce a hierarchy from data. For example, Platt et al. [15] employs a directed acyclic graph
(dag) to combine multiple binary class classiﬁers into a multi-class classiﬁer. Some methods
such as ﬁlter tree [17] imposes an arbitrary tree structure on the classes and then learns
a classiﬁer to partition the training examples to follow their appropriate paths until they
reach their assigned leaf nodes. Since the tree is arbitrarily constructed, they do not convey
meaningful information about the relationship among the classes. Other methods such as
top-down decision trees employ a greedy strategy to split the data into smaller subsets [20].
The models produced by such methods are also diﬃcult to interpret because a class may be
assigned to multiple leaf nodes in the hierarchy.
This thesis presents two novel hierarchical learning methods to address the challenges
of extreme multi-class learning problems. The ﬁrst method, called recursive non-negative
matrix factorization (RNMF, in short) was designed to speed up the computation time
for model testing. RNMF builds a structure known as a label tree by performing a soft
partitioning of the classes in a way that minimizes an information theoretic loss function.
A unique aspect of the proposed approach is that RNMF employs multiple one-class SVM
models at its leaf nodes. This allows the classiﬁer to detect new classes that are not present
in its training data, a unique option that is not available in any existing hierarchical learning
methods. By allowing the leaf node to contain more than one SVM models, this also helps
to shorten the depth of the tree without signiﬁcantly degrading its accuracy. The tree also
allows the model to be applied in sublinear time.
8

One major limitation of RNMF is that there is no global objective function optimized by
its tree-growing procedure. Instead, RNMF employs a greedy, divide-and-conquer approach
to partition the classes based on a local objective function when constructing its tree. In
addition, it allows a class to be split and assigned to multiple leaf nodes, which limits its interpretability as a class hierarchy. To address this issue, this thesis develops a novel method
known as matrix factorization tree (MF-Tree, in short), which organizes the classes into a binary tree by solving a global objective function. A theoretical proof is given to demonstrate
the equivalence between the objective function and a recently proposed criterion for taxonomy learning called Hilbert-Schmidt Independence Criterion (HSIC)[21][22][23][24]. This
thesis demonstrates how HSIC can be modiﬁed to a supervised multi-class learning setting
and shows the additive property of the function, which allows us to recursively decompose
the optimization problem into multiple subproblems that can be organized into a binary tree
structure. Despite these advantages, solving the subproblems at the top levels of the tree can
be computationally prohibitive especially when the number of training examples and classes
are extremely large. To address the scalability issue, an approximate MF-Tree induction
algorithm is proposed, where the determination of class partitions at the higher levels of the
tree is determined using a simpler approach instead of solving the more expensive matrix
factorization problem. Experimental results suggested that the accuracy of the approximate
MF-Tree approach is slightly lower than the original MF-Tree but it is between 20% - 40%
faster (depending on how far down the tree the approximation scheme is adopted).
Building a hierarchical learning method for network data is challenging because the tree
may partition the network into a large number of disjointed components. As the network
becomes more disconnected, the link information is no longer useful for node classiﬁcation.
Another contribution of this thesis is to extend the MF-tree approach to handle network
9

data. The proposed framework incorporates both the node attribute similarity and link
connectivity into the matrix factorization framework. Although alternative network classiﬁcation methods such as label propagation is available, these methods are ill-suited for
extreme multi-class learning problems since the neighborhood assumption no longer holds
for many of the nodes, as shown in Figure 1.1. The proposed MF-tree framework for network data is also trivially parallelizable. A MapReduce implementation of the framework
was developed and tested on a large-scale Wikipedia network data that contains more than
600K classes. To the best of my knowledge, this is a data set with the largest number of
classes that have ever been reported in the literature.
The ﬁnal contribution of this thesis is in the development of a novel multi-class learning
algorithm for heterogeneous networks. The approach jointly classiﬁes a pair of networks with
diﬀerent types of attributes and links but share similar characteristics in terms of their class
information. The eﬀectiveness of the proposed co-classiﬁcation method was demonstrated
using Wikipedia article and editor network data. Empirical results showed that the proposed
multi-class co-classiﬁcation method is more eﬀective than learning to classify the multiple
classes in each network independently.

1.4

Thesis Outline

The remainder of this thesis is organized as follows. Chapter 2 presents the background
and related work of this thesis. Chapter 3 introduces the recursive non-negative matrix
factorization (RNMF) approach for building a label tree in extreme multi-class learning
problems [20]. Chapter 4 presents the matrix factorization tree (MF-Tree) approach and
provides a theoretical proof to demonstrate the equivalence between the proposed objective

10

function and the Hilbert-Schmidt Independence Criterion [21][22][23][24]. The MF-Tree
approach is extended to network data in Chapter 5. Chapter 6 presents a novel multi-class coclassiﬁcation framework for heterogeneous network data [14]. Finally, Chapter 7 summarizes
the contributions of the thesis and presents the future directions of this research.

11

Chapter 2
Background & Related Work
This chapter introduces the research problem to be addressed in this thesis and presents the
literature review on node classiﬁcation, multi-class learning, and taxonomy learning.

2.1

Problem Deﬁnition

Consider a network, N , which is a graph of interconnected nodes. Each node represents an
entity that is characterized by a set of attributes (or features). Each node is also assumed
to have a class label. Assuming the labels are known for a subset of nodes in the network,
the node classiﬁcation task is to assign a label to each of the remaining unlabeled nodes in
the network. More formally, the problem can be deﬁned as follows:
Deﬁnition 1 (Node Classiﬁcation) Let X ∈ ℜd be the set of attribute values and Y =
{1, 2, · · · , c} be the set of discrete-valued class labels. Consider a network data N = (V, E, D),
where V = {v1 , v2 , ...vn } is the set of nodes, E ⊆ V × V is set of links, and D = {(x1 , y1 ),
(x2 , y2 ), · · · , (xn , yn )} ⊆ (X × Y)n . Let Vl ⊂ V be the set of nodes whose labels are known
and Vu = V − Vl be the set of unlabeled nodes. The goal of node classification is to infer a
function f : X → Y that accurately assigns each node to its corresponding class label.
When the number of classes, c > 2, this is also known as a multi-class network learning problem. We will use the terms multi-class node classiﬁcation and multi-class network
learning interchangeably throughout this thesis.
12

As the number of classes considered in this thesis can be very large, it would be useful
to derive a hierarchical structure that captures the relationships among the classes.
Deﬁnition 2 (Hierarchical Multi-class Network Learning) Let Z = {1, 2, · · · , p} be
the set of possible partitions of a class. Given a network data G = (V, E, D), where D ⊆
(X ×Y)n , the goal of hierarchical multi-class network learning is to infer a set of classification
functions Φ = {f |f : X → Z}, which are organized in a hierarchical structure induced from
the given data set D.
The hierarchical structure is often represented in the form of a directed acylic graph or a
rooted tree. This thesis focuses on a special type of hierarchical structure known as a label
tree [18] in the literature.
Deﬁnition 3 (Label Tree) A label tree [18] is a hierarchical structure T =< V, E, Φ, Λ >,
where V is the set of nodes, E ⊂ V × V is the set of directed links connecting each node to
its children, Φ is the set of classification functions, and Λ is the set of class labels.
Let v0 ∈ V denote the root node, which has no incoming links, and Vleaf ⊂ V denote
the leaf nodes, which have no outgoing links. Furthermore, Vint = V − Vleaf are the internal
nodes of the tree. Each internal node is associated with a classiﬁcation function f ∈ Φ while
each leaf node is associated with a class label y ∈ Λ. If (vi , vj ) ∈ E, then vi is a superclass
of vj (or equivalently, vj is a subclass of vi ).

2.2

Node Classiﬁcation

Node classiﬁcation in network data [25, 26, 27, 8] is an active research problem that has
received considerable attention from both industry and academia. The problem is also known
13

collective classiﬁcation in the network mining literature [28]. Previous methods developed for
node classiﬁcation can be divided into 3 categories—attribute-based, link-based, and hybrid
methods.
Attribute-based methods employ a set of attributes that characterize each node to learn
their classiﬁcation functions. The attributes are either explicit properties of the node themselves or derived from their neighborhood information. One of the earliest examples of using
attribute-based methods is in Web page categorization, where the goal was to automatically classify Web pages based on the document content, including their HTML tags [29].
Furnkranz [30] also investigated the use of anchor text along with the words near the anchor
text to predict the class label of web pages. Attribute-based methods are also popular for
node classiﬁcation in security related applications. For example, Ma et al. [31] combined lexical and host-level features to build statistical models for classifying malicious URLs whereas
Le et al. [32] relied on the lexical features to detect phishing Web sites. One practical advantage of using attribute-based methods is that we can apply existing classiﬁcation techniques,
such as support vector machine (SVM), decision tree classiﬁer and logistic regression, to the
node attributes to determine their classes. The drawback of these methods is that the link
structure is not explicitly included in the learning scheme.
Link-based methods, on the other hand, incorporates the link structure directly into the
modeling approach. For example, the HITS algorithm [33] performs web page classiﬁcation
by exploiting the hyperlink topology. PageRank [34] based link analysis methods have also
been widely used in web page classiﬁcation and for malicious node detection [35]. Graphbased semi-supervised methods [36][37][38][35] is another popular link-based method for node
classiﬁcation using partially labeled data to label the remaining nodes. These methods are
designed to optimize the following two criteria:
14

• The predicted class of labeled nodes in the network should agree with their true classes.
• Two nodes that are connected should share similar labels.
Zhu and Ghahramani [37] introduced the label propagation approach to transmit the labels
from an initial set of seed nodes to other nodes in the graph. A link-based method has also
been proposed for spam detection in [38], where an initial set of seed nodes were labeled as
benign (non-malicious) and their labels are propagated to the rest of the graph until they
converged. Nodes with low (benign) scores are then predicted as spam. A similar approach
was developed in [38] to propagate the “malicious” scores from known malicious nodes to
other nodes in the entire graph.
Finally, there have been extensive studies using a hybrid method of both attribute- and
link-based information to improve node classiﬁcation. For example, Yang et al. [39] showed
that extracting meta data from related web sites is useful for improving the accuracy of
hypertext classiﬁcation. Zhu et al. [40] developed a matrix factorization algorithm that
exploits both the content and linkage information to address the Web page classiﬁcation
problem. There have also been quite extensive works on using graph regularization methods
to combine link structure with content information. For example, Zhang et al. [25] developed
a risk minimization formulation for learning from both text and graph structures. In [41], Lu
and Getoor designed a structured logistic regression model to capture both content and link
information. [27] introduce a matrix alignment based approach to integrate the network node
attributes and links information, which weights the attributes and links according to their
inﬂuence on the classiﬁcation decision. Gan and Suel [42] proposed a two-stage approach to
improve a Web spam classiﬁer’s performance using the link structure.

15

2.3

Classiﬁcation on Heterogeneous Networks

Although node classiﬁcation is a well-studied problem, there has been very few research
focusing on the classiﬁcation of nodes in heterogeneous networks. The key assumption in
heterogeneous network classiﬁcation is that there are useful information in other related
networks that may aid in the classiﬁcation of nodes in a given network. For example,
consider the problem of classifying Wikipedia articles. Although it is possible to develop
attribute-based, link-based, and hybrid methods for classifying the articles based on the the
Wikipedia article network alone, there are other related networks, e.g., Wikipedia editor
network, that could be augmented to enhance the classiﬁcation accuracy.
Current research on node classiﬁcation in heterogeneous networks can be classiﬁed into
two categories. The ﬁrst category learns a uniﬁed classiﬁcation function for the combined
networks. For example, Ji et al. [43] employed a link-based approach to classify nodes in
heterogeneous networks using a graph-based algorithm similar to [36]. Using network data
from a bibliography network (DBLP), the authors showed that classiﬁcation accuracy can
be signiﬁcantly if we incorporate data from the entire heterogeneous information network.
The second category learns a separate classiﬁcation function for each underlying network
but trains the multiple functions simultaneously. This strategy is equivalent to a multi-task
learning approach for solving multiple related prediction problems [44]. For example, Chen
et al. [13] presented a co-classiﬁcation approach to simultaneously classify spammers in a
user network and spam in a Web page graph. Their approach was inspired by the fact
that both Web spam and spammer detection tasks are intuitively related, as spam content
are more likely contributed by spammers than by non-spammers. Nevertheless, the method
proposed in [13] is for binary class problems only. One of the contributions of this thesis

16

is to extend the co-classiﬁcation framework to multi-class network learning problems. The
framework proposed in this thesis also allows the incorporation of side information (in the
form of a class prior matrix) to bias the learning algorithm.

2.4

Multi-class Learning

Classiﬁcation with large number of classes is an important but challenging problem, and
has received increasing attention in recent years [45][46][47][19][48][49]. Existing methods
typically employ a collection of binary classiﬁers, organized either in a ﬂat (non-hierarchical)
or a hierarchical (tree) structure.

2.4.1

Non-Hierarchical Methods

Traditional non-hierarchical learning methods, including one-versus-all [50], one-versus-one
[51], and error-correcting output coding (ECOC) [52], require invocation of all the binary
classiﬁers during test time in order to make their ﬁnal prediction. Both one-versus-one and
one-versus-all methods are ineﬃcient when the number of classes is large due to the linear
and quadratic number of binary classiﬁers that must be invoked during testing. The ECOC
method assigns a unique m bit codeword to each class label (where m > log2 c) and trains a
binary classiﬁer to predict each bit of the codeword. Given a test instance, we must generate
an m-bit vector by applying the m binary classiﬁers and then compute its distance to the
codewords of all c classes, which is an expensive operation, to determine its predicted class.
Compressed sensing based ECOC methods [53][54] have been proposed as an alternative.
Such methods consider the sparsity of the output label space to ease the burden of largescale multi-label learning. The authors concluded that, by incorporating compressed sensing

17

into ECOC, the number of subproblems to be solved is logarithmic in the number of classes,
thereby reducing the number of binary classiﬁers to be invoked during testing. Nonetheless,
these methods are non-hierarchical as they do not generate a concept hierarchy to relate the
classes. They also do not consider issues such as new class detection and the class imbalance
problem, which are typically encountered in extreme multi-class learning problems.

2.4.2

Hierarchical Methods

Hierarchical methods [15, 16, 17, 18, 19] help to reduce the number of classiﬁers that need
to be invoked during testing by organizing the classiﬁers in a rooted, hierarchical structure. Their runtime complexity for testing depends on the maximum depth of the hierarchy.
Existing hierarchical methods can be divided into two categories, depending on how the
hierarchical structure is created.
In the ﬁrst category, a hierarchical structure is arbitrarily imposed on the classes. Each
intermediate node in the hierarchy contains a binary classiﬁer, which is trained to partition
the classes in a way that is consistent with the imposed structure. One example is the
ﬁlter tree method developed by Beygelzimer et al. [17], which learns the tree structure in
a bottom-up fashion. This method was adapted in [55] to build a conditional probability
tree for the labels. Another example is the Decision Directed Acyclic Graph (DDAG) [15]
approach, which arranges a set of one-versus-one classiﬁers into a rooted binary DAG in
order to reduce the test complexity from O(c(c − 1)/2) to O(c). In all three cases, there is
no hierarchical learning involved when the models are trained.
In the second category, the hierarchical structure is automatically inferred when building
the multi-class models. For example, Bengio et al. [18] proposed a label embedding tree
approach that learns c one-vs-all classiﬁers to obtain an initial confusion matrix, which was
18

subsequently used as the aﬃnity matrix for applying spectral clustering to partition the
classes into smaller subgroups. Deng et al. [19] presented an approach that simultaneously
learns the class partitioning and the weights of the linear classiﬁers by optimizing a joint
objective function. However, the approach requires solving an integer programming problem,
which is NP-hard. Instead, they relaxed the optimization problem into a linear program in
order to ﬁnd a polynomial time solution. Shi et al. [56] also introduced a relaxed hierarchical
learning approach for large-scale visual recognition. At each node, the classes are colored as
positive or negative by a binary classiﬁer while a subset of the confusing classes are ignored.
The coloring of classes and training of binary classiﬁers were performed simultaneously using
a principled max-margin optimization method.

2.5

Taxonomy Learning

Taxonomy learning, which aims to encode the conceptual relationship among the classes
using an unsupervised learning approach, is another related area of this thesis. A classical
method for taxonomy learning is by using agglomerative hierarchical clustering methods such
as single-link and complete-link to derive the hierarchical relationships. Another approach
is to construct the taxonomy with a divisive (top-down) partitioning approach. For example, Kuang et al. [57] developed a hierarchical clustering method that recursively partitions
the data into two subsets using a rank-2 nonnegative matrix factorization approach. Their
proposed method also performs outlier detection to avoid generating too many small leaf
nodes. Although the method resembles the matrix factorization based methods developed
in this thesis, it was unsupervised and there is no global objective function solved by their
algorithm. Hilbert-Schmidt Independence Criterion (HSIC) metric is another recently pro-

19

posed criterion for unsupervised taxonomy learning [21][22][23][24]. Inspired by this metric,
in this thesis, we demonstrate how HSIC can be applied to a supervised multi-class network
learning problem.

2.6

Summary

This section presents a formal deﬁnition of the multi-class network learning problem and
reviews some of the previous works related to this research. Although there has been quite
an extensive study on node classiﬁcation in networks, they were mostly focused on binary
classiﬁcation problems and were applied to a single network. The primary goal of this thesis
is to address the existing limitations and expands the use of hierarchical multi-class learning
methods to network data.

20

Chapter 3
Recursive NMF for Extreme
Multi-Class Learning
As noted in Chapter 1, extreme multi-class learning is a challenging problem due to issues
such as testing complexity, imbalanced class distributions, presence of new classes in the test
set, and interpretability of the models. To overcome these challenges, this chapter describes
a novel label tree learning algorithm called recursive non-negative matrix factorization (RNMF) for extreme multi-class learning problems. Key advantages of RNMF include:
1. Its testing complexity is sublinear in the number of classes.
2. It can detect the presence of new classes in the test data.
RNMF is a label tree construction algorithm that recursively partitions the training data
into smaller subsets based on their classes and attribute values until a stopping criterion is
satisﬁed. At each recursive step, it learns a classiﬁer to predict which partition to assign
a given test example. RNMF allows for soft partitioning of the classes by minimizing an
information-theoretic loss function, which can be cast into a regularized non-negative matrix
factorization problem. The regularization term introduced by our framework is unique in
that it minimizes the entropy of the partitions. Unlike other label tree learning methods, each
leaf node at the bottom of the tree contains multiple one-class SVM models to distinguish
instances from diﬀerent classes. Experimental results suggest that the induced tree achieves
21

accuracy comparable to more complex tree learning methods but is considerably shorter,
thus reducing its overall testing time.

3.1

Label Tree Structure for RNMF

Let D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} denote a set of training examples sampled from a
distribution P over X × Y, where X = Rd is the d-dimensional feature space and Y =
{1, 2, ...c} is the set of classes. The goal of multi-class classiﬁcation is to learn a function f
that minimizes the classiﬁcation error of any instance drawn from P , i.e., Pr(x,y)∼P [f (x) ̸=
y]. A typical approach for multi-class learning is to reduce the problem into multiple binary
classiﬁcation tasks. However, when the number of classes is large, applying all the binary
classiﬁers to determine the predicted class becomes computationally infeasible.
RNMF provides a systematic approach to construct the binary classiﬁers and arrange
them in a label tree structure (see Deﬁnition 3). An illustration of the label tree structure is
shown in Figure 3.1. The label tree obtained from RNMF diﬀers from conventional decision
trees in the following ways. First, conventional decision tree classiﬁers employ a decision
stump (i.e., a classiﬁer consists of only one splitting attribute) at each internal node of the tree
whereas RNMF considers linear combination of the attributes to partition the input space.
The weights of the attribute combination are obtained by solving a regularized non-negative
matrix factorization problem. A regularization term is introduced to prevent the tree from
distributing examples from one class to more than one children during the tree construction
process. Second, each leaf node of conventional decision tree classiﬁers contains a class label
whereas each leaf node of RNMF contains one or more 1-class SVM classiﬁers [58]. By
allowing the leaf nodes to contain multiple classiﬁers, this helps to reduce the depth of the

22

Figure 3.1 Proposed label tree structure using the RNMF algorithm.
tree without signiﬁcantly degrading its accuracy. In 1-class SVM, the training examples from
each class are encapsulated in a hypersphere constructed in an appropriate feature space.
Once the enclosing hyperspheres for all the classes at a leaf node have been constructed,
the test examples are classiﬁed based on their distance to the center of each hypersphere.
Furthermore, the 1-class SVM models can be used to detect the presence of new classes.
The label tree structure generated by RNMF is parameterized by its branching factor
(p) and maximum depth (d). As will be described in the next section, the two parameters
can be varied by the user to alter the size of the label tree. For example, by choosing a
branching factor larger than 2, this may result in purer splits among the children, thus
reducing the maximum depth needed to achieve high accuracy. Decreasing the maximum
depth parameter, on the other hand, may result in leaf nodes that contain examples that
belong to a larger number of classes. This, in turn, may increase the number of one-class
SVM models needed to achieve high accuracy.

23

3.2

RNMF Algorithm for Label Tree Construction

Let X be an n × d data matrix containing the feature vector of the training examples and
Y be an n × c matrix that represents their corresponding class labels, i.e., Yij = 1 if the ith
training example belongs to the j th class, and zero otherwise. Similar to existing decision
tree classiﬁers, RNMF seeks to recursively partition the training examples into purer subsets
such that examples that belong to the same class should be assigned to the same partition.
Speciﬁcally, at the each node v, our goal is to learn a partition matrix L of size n × p,
where n is the number of training examples assigned to the node and p is the branching
factor, as well as the associated feature weight matrix W, which is a d × p matrix that
indicates the importance of each feature in partitioning the training examples. Each element
Ljq represents the probability that the j th training example should be assigned to the q th
partition at the given node v. To ensure L can be interpreted as probabilities, each column
in L must sum up to 1. This can be achieved by adding the constraint LT 1n = 1p into the
objective function, where 1p denote a p-dimensional column vector of all 1s. Furthermore,
a regularization term is introduced to prevent training examples from the same class being
assigned to diﬀerent partitions. The partitions created at each node v are obtained by
minimizing the following objective function:

min

L≥0,W≥0

s.t.

D(X ∥ LWT ) + λH(LT Y)
LT 1n = 1p

(3.1)

where D(X||LWT ) refers to the Kullback-Leilber divergence between the input data matrix
X and the product of its latent factors LWT . The regularizer term H(LT Y) corresponds

24

to the entropy of the class distribution of each partition and is needed to avoid splitting the
data from the same class to diﬀerent partitions, i.e.,

H(LT Y) = −

where (LT Y)i+ =

∑

p
c
∑
(LT Y)ij
(LT Y)i+ ∑ (LT Y)ij
log
,
(LT Y)++
(LT Y)i+
(LT Y)i+
i=1
j=1

∑
T
T
T
ij (L Y)ij . Note that entropy increases
j (L Y)ij and (L Y)++ =

when the class distribution within each partition becomes more uniform and decreases when
the distribution skews towards a small subset of the classes. Furthermore, since Y1k = 1n ,
therefore, LT Y1k = LT 1n = 1p (for a feasible solution that satisﬁes the constraint given in
(3.1)). As a result, ∀i : (LT Y)i+ = 1 and (LT Y)++ = p. The entropy calculation can be
simpliﬁed as follows:
H(LT Y) = −

1∑ T
(L Y)ij log(LT Y)ij
p
i,j

Note that λ is a user-speciﬁed parameter that controls the trade oﬀ between optimal
partition of X and class purity of each partition1 . The Lagrange formulation of (3.1) is
given by:

L =

∑

[Xij log

i,j

− λ

∑

Xij
(LW T )

− Xij + (LW T )ij ]
ij

(LT Y )ij log(LT Y )ij −

∑
s

ij

∑
µs (
LTsm − 1)
m

The objective function is minimized by taking its partial derivative with respect to the
1 λ is set to 1 in our experiments, but it can also be determined via cross-validation.

25

parameters W and L and setting them to zero:
∑ Xip Liq

∑

∂L
∂Wpq

= −

∂L
∂Lpq

∑
∑
∑
∑ Xpj Wjq
TY ) − λ
+
W
−
λ
Y
log(L
Ypj − µq
= −
jq
pj
qj
(LW T )pj

+

(LW T )ip

i

Liq = 0

(3.2)

i

j

j

j

j

= 0

(3.3)

By summing up over the variable p in Equation (3.3), we have

µq = −

∑
∑ Ypj
1 ∑ Xpj Wjq
+
W
−
λ
log(LT Y )qj − λ
jq
n
n
(LW T )pj
jp

j

(3.4)

jp

Replacing (3.4) back into (3.3), the partial derivative can be rewritten as follow:

∂L
∂Lpq

= −

∑ Xpj Wjq
j

+λ

(LW T )pj
∑ Ypj
jp

n

+

∑
1 ∑ Xpj Wjq
−
λ
Ypj log(LT Y )qj
n
(LW T )pj
jp

log(LT Y )qj

j

(3.5)

The objective function can be solved using a gradient descent approach. Following the
approach used in [59], the gradient descent formula can be transformed into a multiplicative
update formula as follows:
(X T L)pq
(W LT L)pq
α
Lpq = Lpq ,
β

Wpq = Wpq

26

(3.6)
(3.7)

Algorithm 1 Recursive NMF at a node v
Input: Matrices Xv , Yv , and MaxIter
Output: Matrices Wv and Lv
old
Initialize: Wold
v and Lv to random matrices
for j=1 to MaxIter do
update Wnew
using (3.6)
v
new
update Lv using (3.7)
new old
new
set Wold
v ← Wv ; Lv ← Lv ;
end for

where

α =

∑ Xpj Wjq
j

(LW T )pj

+λ

∑

Ypj log(LT Y )qj

j

∑ Ypj
1 ∑ Xpj Wjq
β =
+
λ
log(LT Y )qj
n
n
(LW T )pj
jp

jp

By initializing W and L to be non-negative matrices, the update formula ensures that the
solution will also be non-negative.
Algorithm 1 summarizes the key steps of the proposed recursive non-negative matrix
factorization (RNMF) algorithm for ﬁnding the partitions for any given node in the tree.
The recursive partitioning step continues until one of the following stopping criteria is met:
(1) stop if all the training examples in a partition belong to the same class, or (2) stop if
the node contains less than minleaf training examples, or (3) stop if there is only one class
containing more than minclass examples2 .
Once a leaf node is encountered, the algorithm will construct a one-class SVM model
for each class i by learning a hypersphere to encapsulate all class i training examples that
reach the particular leaf node. The hypersphere for class i is constructed by ﬁrst labeling the
2 We set minleaf = 15 and minclass = 5 for our experiments.

27

instances belonging to class i as +1 and those that belong to other classes as −1. We then
construct two concentric hyperspheres such that the inner sphere encloses as many instances
from class i as possible. The outer sphere is constructed in such a way that instances that
do not belong to class i lie outside of it. The radial distance between the two hyperspheres
is called the classiﬁer’s margin, which deﬁnes the objective function to be optimized by
the 1-class SVM learning algorithm. For example, the hypersphere for predicting class k is
obtained by solving the following [60]:

min

Rk ,ak ,dk ,ξi ,ξl

subject to

Rk2 − M d2k +

C ∑
C ∑
ξi +
ξl
Nk
Nk¯
l:yl ̸=k

i:yi =k

∥ xi − ak ∥2 ≤ Rk2 + ξi ,

∀i : yi = k

∥ xl − ak ∥2 ≥ Rk2 + d2k − ξl ,
ξi ≥ 0,

∀i : yi = k

ξl ≥ 0,

∀l : yl ̸= k

∀l : yl ̸= k

(3.8)

where Nk and Nk¯ are the number of instances that belong to the k th class and its complement,
respectively. Here, the positive examples are enumerated by indices i and the negative
examples by indices l. ξi and ξl are the slack variables while C ≥ 0 controls the penalty
for misclassiﬁcation errors. The variables ak and Rk represent the center and radius of
√
th
the hypersphere constructed for the k class while (Rk2 + d2k ) deﬁnes the radius of the
√
outer hypersphere. It can shown that the margin width is given by (Rk2 + d2k ) − Rk . To
maximize the margin, we need to maximize d2k and minimize Rk2 simultaneously, and the
parameter M ≥ 0 controls the trade-oﬀ between those two terms. The Wolfe dual form of

28

the preceding optimization problem is given below:

maxα

∑

αi K(xi , xi ) −

αl K(xl , xl )

l:yl ̸=k

i:yi =k

∑

−

∑

αi αj K(xi , xj )

i,j;yi ,yj =k

∑

−

αl αm K(xl , xm )

l,m:yl ,ym ̸=k

∑

+2

αi αl K(xi , xl )

i,l:yi =k,yl ̸=k

subject to

∑

αi = 1 + M, 0 ≤ αi ≤

i:yi =k

∑

αl = M, 0 ≤ αl ≤

l:yl ̸=k

C
Nk

C
Nk¯

(3.9)

where K(x, x′ ) is a kernel function that measures the similarity between the pair of instances,
x and x′ . Among the widely used kernel functions include

RBF Kernel:
Polynomial Kernel:

K(x, x′ ) = exp

[

]
∥ x − x′ ∥2
−
2σ 2

K(x, x′ ) = (1 + xT x′ )q

The optimization problem given in (3.9) can be solved using quadratic programming.
After the label tree is constructed, we can apply the tree to classify unlabeled examples in
the following way. Starting from the root node of the tree, we apply the feature weight matrix
W to determine which partition the unlabeled example xtest should be sent. Speciﬁcally,
we compute a partition membership vector π and assign the test example to the partition j

29

that has the largest magnitude, i.e.,

j = arg max π k ,
k

where π = xTtest Wv (WvT Wv )−1

The procedure is repeated until the test example reaches one of the leaf nodes, where we can
apply the 1-class SVM models associated with the leaf node to predict its label. Speciﬁcally,
if the test example is encapsulated in the hypersphere for class i, then the test example
is assigned to the class. However, if the test example is located within more than one
hypersphere, we compute the distance between the test example to the hypersphere centers
and assign it to the hypersphere with the smallest distance. Finally, if the test example is
not encapsulated by any hypersphere, we declare it as belonging to a new class.

3.3

Experimental Evaluation

We conducted our experiments on the Wikipedia data dump generated on October 9, 2009.
After preprocessing, there were about 3.6 million Wikipedia articles in the data, which were
classiﬁed into 336K categories. We created two subsets of the data for our experiment,
which are denoted as Wiki1 and Wiki2 , respectively. Wiki1 is a data set that contains 24,379
articles from the largest 214 categories whereas Wiki2 contains 65,156 articles chosen from
the largest 1,618 categories of the Wikipedia dump.
We compared the performance of RNMF against two baseline hierarchical learning algorithms proposed in the literature, namely, DAGSVM [15] and the confusion matrix approach
(CM) described in [18]. For fair comparison, we set the branching factor (p) for our algorithm to be 2. The performance of the classiﬁers were evaluated according to their average

30

F1 scores for all classes. All the results reported in this study were based on 5-fold cross
validation. The results for the two data sets are summarized in Tables 3.1 and 3.2.
Table 3.1 Comparison between RNMF and baseline algorithms on Wiki1 data set.
CM [18]
DAGSVM [15]
RNMF(p=2)

F1
0.4826 ± 0.0017
0.5309 ± 0.002
0.5116 ± 0.0012

Test time (s)
195.5± 9.2
942.9 ± 17.6
132.2 ± 3.2

depth
10
214
8

Table 3.2 Comparison between RNMF and baseline algorithms on Wiki2 data set.
CM [18]
DAGSVM [15]
RNMF(p=2)

F1
0.2518 ± 0.0025
0.3012 ± 0.0019
0.2892 ± 0.0036

Test time (s)
336.2± 13.2
1524.6 ± 16.5
289.3 ± 6.1

depth
12
1618
10

The results suggest that RNMF outperforms the CM approach both in terms of its F1
score and test eﬃciency. When compared against DAGSVM, which requires building a
quadratic number of 1-vs-1 classiﬁers, the F1-score for RNMF is slightly lower but its test
time is at least 7 times faster since the depth of RNMF’s label tree is considerably shorter.
Note that the performance of RNMF depends on the branching factor (p) and the maximum depth (d) of the tree. We evaluated how the branching factor p and maximum depth of
the tree aﬀect the testing time and classiﬁcation accuracy on the Wiki1 data set. If the tree
has only a single node, which is equivalent to a ﬂat (non-hierarchical) model of k one-class
SVM models, the average F1 score and test times of RNMF are comparable to DAGSVM
(see Figures 3.2 and 3.3). As the depth of the tree increases, both the testing time and
F1 score of RNMF would decrease. The decrease in testing time is because the number
of 1-class SVM models that need to be invoked to predict the label of a test example at a
given leaf node is smaller as the tree grows deeper. On the other hand, its F1-score is also
lower for two reasons. First, as the tree is extended, the additional partitioning may divide
31

0.54
p=2
p=5

0.53
0.52

F1

0.51
0.5
0.49
0.48
0.47
0.46
0.45
0

2

4

6
8
d(depth of tree)

10

12

14

Figure 3.2 Average F1 score for RNMF as the branching factor (p) and maximum depth (d)
are varied on Wiki1 data set.
the training examples from the same class into multiple partitions. Second, there are less
training examples available to build the 1-class SVM models at the leaf nodes as the tree
grows longer, which makes it more susceptible to model overﬁtting.
Furthermore, if we compare their average F1 scores at the same maximum depth d, the
tree with a branching factor equals to 2 is better than the tree with a branching factor
equals to 5 (when d > 1). This is because the tree with smaller branching factor contains
less number of nodes (since their maximum depths are the same), which means, its training
examples will be split into less number of branches. This will result in a better classiﬁer as
more training examples become available at each leaf node to train the 1-class SVM models.

3.4

New Class Detection

As previously noted, one of the key advantages of RNMF is that it can detect new classes
that are not present in the training set. To evaluate the performance of RNMF in terms of

32

1200
p=2
p=5

Testing time in seconds

1000
800
600
400
200
0
0

2

4

6
8
d(depth of tree)

10

12

14

Figure 3.3 Test time for RNMF as the branching factor (p) and maximum depth (d) are
varied on Wiki1 data set.
its ability for new class detection, we performed the following leave-one-class-out experiment
on W iki1 . Speciﬁcally, we eliminate one of the 214 classes from the training set and use the
data instances belonging to the remaining 213 classes for training purposes. During testing,
we apply the induced label tree on a test set that contains all 214 classes and assume the
eliminated class to be the new class. We repeat the experiment 214 times, each time leaving
out of the classes from the training set. The average F1 score of the new class is given in
Table 3.3 (these results were reported after repeating the experiment 3 times). We compared
the performance of the algorithm against DAGSVM.
DAGSVM [15] constructs a series of binary SVM classiﬁers and organizes them in a
directed acyclic graph. An example of the DAGSVM structure for 5 classes is shown in
Figure 3.4. Starting from the root node, DAGSVM eliminates one class at a time until it
reaches a leaf node, which contains only a single class.
To extend DAGSVM to detect new class, we terminate the dag one level earlier, so that
the leaf nodes contain training examples from two classes. A binary SVM classiﬁer is then
33

Figure 3.4 Example of DAGSVM Structure for 5 classes
trained to discriminate the classes. When a test example reaches the leaf node, we apply
the binary classiﬁer to predict its class. If the conﬁdence of the SVM prediction is high,
we assign the test example to its predicted class. However, if the conﬁdence of the SVM
prediction is low (i.e., when the test example is located within the geometric margin of the
classiﬁer), we predict it to be a new class.
Table 3.3 Average F1-score of the new class when applied to the Wiki1 data set.
RNMF
DAGSVM

# classes (Training)
213
213

# classes (Testing)
214
214

F1
0.3192± 0.016
0.2833 ± 0.012

Based on the results in Table 3.3, RNMF clearly outperforms DAGSVM in terms of its
ability to detect new classes. We have also compared the F1-score of each class that was
omitted from the training set in our new class detection experiment. The results are shown in
Figure 3.5, where the horizontal axis corresponds to the F1 score of the DAGSVM algorithm
whereas the vertical axis corresponds to the F1 score of RNMF. There are 214 data points
in the plot, one for each class. If the point is located above the diagonal, this means the
34

F1-score for RNMF is better than DAGSVM for the particular omitted class. Since 82% of
the points lie above the diagonal, this suggests that RNMF outperforms DAGSVM in terms
of detecting the new class for the majority of the classes present in the Wiki1 data.
0.7

y=x
0.6

RNMF (F1 Score)

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

DDAG (F1 Score)

Figure 3.5 New Class Detection Performance for individual Class on W iki1

3.5

Conclusion

This chapter presents a novel label tree learning algorithm for extreme multi-class problem
called recursive non-negative matrix factorization or RNMF. The algorithm creates soft
partitions of the classes by solving a regularized non-negative matrix factorization problem
and uses a set of 1-class SVM models to predict the classes at the leaf nodes of the tree.
Experimental results on the Wikipedia data suggest that RNMF can achieve comparable
accuracy as some of the state-of-the-art label tree algorithms but is considerably faster
(since it produces shorter trees). In addition, it can detect new classes that are not present
in the training data. However, RNMF has two major limitations. First, there is no global
objective function optimized by the algorithm. Instead, it greedily partitions the training
35

data to minimize a local objective function. Second, the classes can be fragmented to multiple
leaf nodes in the tree. The next section presents an alternative method known as MF-tree
that overcomes these limitations.

36

Chapter 4
Matrix Factorization Tree for
Extreme Multi-Class Learning
Hierarchical learning methods [15, 16, 17, 18, 19] are designed to overcome the challenges
of extreme multi-class classiﬁcation by employing a set of binary classiﬁers and arranging
them into a directed acyclic graph or rooted tree structure. These methods have several
advantages. First, they can lead to sublinear test complexity since only the binary classiﬁers
located along the path from the root node to leaf nodes are invoked when classifying a
test example. Second, one could alleviate the class imbalanced problem particularly at the
top levels of the hierarchy by grouping together the related classes1 . Finally, the resulting
concept hierarchy may oﬀer interesting insights into the conceptual relationships among the
large number of categories.
Over the years, there have been numerous hierarchical classiﬁcation methods developed,
from greedy top-down decision tree classiﬁers, which use simple attribute tests as internal
nodes of the tree, to more advanced methods using binary classiﬁers as their internal nodes to
distinguish the diﬀerent classes. These methods have diﬀerent pros and cons. While decision
tree classiﬁers are easy to construct, they are susceptible to the class imbalanced and data
fragmentation problems. In addition, each class can be distributed to multiple branches of
1 Instead of learning to discriminate one class from another, some hierarchical methods

are designed to distinguish one group of classes from another group at the top levels of the
tree.
37

the tree, resulting in an unnecessarily large tree. This aﬀects both the test eﬃciency and the
ability of the classiﬁer to generate a meaningful taxonomy about the conceptual relationships
among the classes. Alternative methods, such as ﬁlter trees [17] and its variants, impose
a random tree structure and learn an appropriate binary classiﬁer at each internal node
of the tree. Because of the arbitrariness in which the classes are assigned to the nodes,
the resulting tree might be suboptimal and loses descriptive information about the class
relationships. More importantly, both classes of methods were not designed to optimize a
global objective function. To address these problems, we propose a novel method known as
matrix factorization tree (MF-Tree, in short), which organizes the classes into a binary tree
by solving a global objective function based on a squared loss error function.
We ﬁrst provide a theoretical proof to demonstrate the equivalence between the proposed squared error loss function and the Hilbert-Schmidt Independence Criterion (HSIC)
[21][22][23][24]. HSIC is a recently proposed criterion for taxonomy learning in an unsupervised learning setting. In this chapter, we demonstrate how HSIC can be applied to
a supervised multi-class learning problem. We also showed the additive property of the
objective function, thus allowing us to decompose the problem into a hierarchical binary
classiﬁcation task that can be organized in a binary tree structure. Nevertheless, for massive
classiﬁcation problems involving hundreds of thousands of classes or more, factorizing the
matrices is computationally expensive especially at the top levels of the tree. To address this
problem, we present an approximate algorithm for inducing MF-Tree, where the initial assignment of classes to the child nodes at the top level of the hierarchy is determined from the
class sizes and distances between centroids of classes instead of solving the more expensive
matrix factorization problem. The factorization can then be applied at lower levels of the
tree when there are fewer number of classes and training examples to deal with. We showed
38

that the performance of the approximate MF-Tree is comparable to that of MF-Tree, with
a signiﬁcant improvement in its training time.
In short, the main contributions of this chapter are as follows:
1. We introduce a novel hierarchical classiﬁcation method for large multi-class learning
problems known as MF-Tree. The tree is designed to optimize a global loss function.
2. We provide a theoretical proof showing the equivalence between optimizing the squared
error loss function of a matrix factorization problem and the Hilbert-Schmidt Independence Criterion (HSIC) by setting appropriate constraints on the latent factors of the
factorization.
3. We demonstrate the additive property of HSIC that enables the design of a hierarchical
tree learning framework.
4. We present an approximate algorithm for learning MF-Tree, to speed up the tree
induction process.
5. We provide extensive experiments to evaluate and compare the performance of the
proposed method against several state-of-the-art baselines.

4.1

Preliminaries

This section presents a formal deﬁnition of the multi-class learning problem. We also introduce the concepts of Hilbert-Schmidt Independence Criterion and tree-structured covariance
matrix. The connections between these concepts and the proposed MF-tree method will be
shown in Section 4.2.

39

4.1.1

Problem Deﬁnition

Let D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} denote a set of training instances sampled from
an unknown distribution P over X × Y, where X ⊆ Rd is the d-dimensional feature space
and Y = {1, 2, · · · , c} is the set of permissible categories. For large multi-class learning
problems, we assume c >> 2. The goal of multi-class learning is to infer a target function f that minimizes the expected risk of misclassifying any instance drawn from P , i.e.,
minf E(x,y)∈P [L(f (x), y)], where L is known as the loss function.
A standard approach for multi-class learning is to reduce the problem into multiple binary classiﬁcation tasks. However, when the number of classes c is too large, applying all the
binary classiﬁers to predict the class of a test instance is computationally infeasible. Towards
this end, there are various hierarchical labeled tree algorithms have been developed for large
multi-class learning problems. A hierarchical labeled tree is a 4-tuple T =< V, E, Φ, Λ >,
where V is the set of nodes, E is the set of edges connecting parent nodes to their corresponding child nodes, Φ = {fv (x)|v ∈ V } is the set of classiﬁers associated with the internal
nodes of T , and Λ = {yvˆ|yvˆ ∈ Y, vˆ ∈ V } is the set of labels associated with the leaf nodes. In
this chapter, we restrict our framework to binary rooted trees only, even though, in principle,
it is applicable to rooted trees with a branching factor greater than 2. Our objective in this
work is to learn an optimal tree T for large, multi-class learning problems.

4.1.2

Hilbert-Schmidt Independence Criterion (HSIC)

This section introduces the Hilbert-Schmidt Independence Criterion. Our presentation follows closely the deﬁnitions given in [21][23][22][24]. Let (X, Y ) be a pair of random variables drawn from a joint distribution P rX,Y . The Hilbert-Schmidt Independence Criterion

40

(HSIC)[21] measures the dependence between X and Y by calculating the norm of its crosscovariance operator Cxy in Hilbert space. It has been shown that this norm vanishes if and
only if X and Y are independent variables. A larger value of HSIC also indicates a stronger
dependence between the two variables, with respect to the choice of kernels [23].
Let F be a Reproducing Kernel Hilbert Space (RKHS) of functions from X to R with
′

′

a feature map ϕ, such that ⟨ϕ(x), ϕ(x )⟩F = kX (x, x ), where kX : X × X → R is a
positive deﬁnite kernel. Likewise, let G be the RKHS on Y with a feature map ψ, such that
′

′

⟨ψ(y), ψ(y )⟩G = kY (y, y ), where kY : Y × Y → R is the corresponding kernel for the metric
space Y. The covariance operator Cxy is deﬁned as

Cxy = Ex,y [(ϕ(x) − µx ) ⊗ (ψ(y) − µy )]2HS

where µx = E[ϕ(x)], ψy = E[ψ(y)] and ⊗ denote a tensor product. The Hilbert-Schmidt Independence Criterion is deﬁned as the squared Hilbert-Schmidt norm of the cross-covariance
operator Cxy , which can be written as follows:

HSICX,Y

= ∥Cxy ∥2HS
′
′
′ [kX (x, x )kY (y, y )]
x,x ,y,y
′
′
+E ′ [kX (x, x )]E ′ [kY (y, y )]
x,x
y,y

= E

′

′

′

−2Ex,y [E ′ [kX (x, x )]E ′ [kY (y, y )]]
x
y

(4.1)

Let {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )} be the set of training instances drawn from P rXY . HSIC

41

can be empirically estimated from the training data as follows [21, 22]:
[
]
1
HSIC =
tr Hn KHn L
(n − 1)2

(4.2)

where tr[·] denote the trace of a matrix, Hn = I − n1 1n 1Tn is a centering matrix, I is the
identity matrix, 1n is a column vector of all ones, K is an n × n kernel matrix for x, and L
is a c × c kernel matrix for the class labels y. The matrix product Hn KHn corresponds to
ˆ denote the centered kernel matrix, then the computation of
a centered kernel matrix. If K
ˆ
HSIC simpliﬁes to (n − 1)−2 tr[KL].
Note that the time complexity for learning HSIC from
the data is O(c3 + nc2 + n2 c), which is expensive when number of classes c is large.

4.1.3

Tree-Structured Covariance Matrix

Given a rooted binary tree T on the labeled set Y, we can represent it as a tree-structured
covariance matrix [22] as follows. Deﬁnition
Deﬁnition 4 (Tree-Structured Covariance Matrix)
A matrix C ∈ Rc×c is a covariance matrix associated with the label tree structure T, where
Cij is a measure of similarity between the i-th and j-th classes computed based on the path
length from the root node to the closest common ancestor between the leaf nodes for i and j
in T. The larger the value of Cij , the closer is the relationship between the two classes.
Figure 4.1 shows an example of a labeled tree structure for 4 classes, c1 , c2 , c3 , and c4 .
Let a, b, c, d, e, f denote the edge weights between the nodes. Based on the structure, the
tree-structured covariance matrix is a 4 × 4 matrix shown on the right hand side of the
diagram. For example, C1,2 = a since the closest common ancestor between the classes has
a path length equals to a from the root node. Similarly, C3,3 = b+e, which is the path length
42

Figure 4.1 Example Tree Structure with 8 classes
from the root to the leaf node for c3 . Note that the edge weight a, b, c, d, e, f determines the
similarity between the diﬀerent categories. For brevity, we consider the edge weights to be
1 in the remainder of this chapter.

4.2

MF-Tree: Proposed Method

This section presents an overview of the proposed method. We ﬁrst introduce the global
objective function to be optimized by MF-Tree. We then show its connection to the HSIC
measure described in the previous section. We also illustrate the additive property of HSIC,
which allows us to develop a hierarchical tree-based induction algorithm.

4.2.1

Global Objective Function of MF-Tree

Let X denote an n × d centered design matrix, whose rows correspond to the data instances
˜ can always be
and columns correspond to the features. An uncentered design matrix X
˜ where Hn is the centering matrix deﬁned
centered by applying the following operation, Hn X,
in Section 4.1.2.

43

MF-Tree is designed to minimize the following regularized squared error loss function.
[
J

]

= ∥X − ZWT ∥2F + λtr WAWT
[

]

= tr XXT − XWZT − ZWT XT + ZWT WZT
]
[
T
+λ tr WAW ,

(4.3)

where Z is an n × c matrix, which represents the class membership of each data instance, W
is a d×c matrix, which represents the feature vector of each class, A is a symmetric, positivedeﬁnite, c × c class covariance matrix, and λ is the regularization parameter. ∥ · ∥F denote
the Frobenius norm of a matrix and the superscript T denote the transpose operator. The
intuition behind the objective function given in Equation (4.3) is to factorize the centered
design matrix X into a product of latent factors Z and W.

4.2.2

Relationship between Global Objective Function and HSIC

To demonstrate the relationship between the objective function and HSIC, let us take the
partial derivative of J given in Equation (4.3) with respect to W and set it to zero:
∂J
= −2XT Z + 2WZT Z + 2λWA = 0
∂W
[
]−1
T
T
W = X Z Z Z + λA

44

Replacing this back into Equation (4.3) yields the following:

J

{
[
]−1
T
T
T
= tr XX − 2XX Z Z Z + λA
ZT
[
+ Z ZT Z + λA

]−1

[

]−1

ZT XXT Z ZT Z + λA

}
ZT

{
[
]−1 [
]−1
}
T
T
T
T
+ λ tr X Z Z Z + λA
A Z Z + λA
Z X
{
[
]−1
T
T
T
= tr XX − 2XX Z Z L + λA
ZT
[
+ XT Z ZT Z + λA

]−1 [

][
ZT Z + λA

ZT Z + λA

]−1

}
ZT X

{
[
]−1 }
T
T
T
= tr XX − XX Z Z Z + λA
ZT

Thus, ﬁnding a solution for Z that minimizes J is equivalent to ﬁnding one that maximizes
the following objective function:
{
[
]−1 }
T
T
max tr XX Z Z Z + λA
ZT
Z

ˆ = XXT be the centered kernel matrix associated with the training instances X.
Let K
During training, we enforce the following constraint:
[

]−1

Ztrain ZTtrain Ztrain + λA

ZTtrain = YCYT ,

(4.4)

where Y is an n × c matrix of class labels and C is the tree-structured covariance matrix.
]−1
[
T
.
This can be achieved by setting Ztrain = Y and C = Ztrain Ztrain + λA
Theorem 1 If A be a symmetric, positive-definite matrix, and λ > 0, then ZTtrain Ztrain +λA
is also positive-definite.
45

]

[

Proof 1 Given any real vector v, vT ZTtrain Ztrain + λA v = vT ZTtrain Ztrain v + λvT Av =
∥Ztrain v∥2 + λvT Av > 0 since A is a positive definite matrix.
Since ZTtrain Ztrain + λA is positive-deﬁnite, it is invertible. The equality constraint in
Equation (4.4) ensures that an appropriate λ and A can be chosen to generate a viable
tree, in which the entries of the matrix C must have a speciﬁc structure. Furthermore,
let L = YCYT denote the kernel matrix associated with the class labels. The objective
function can now be simpliﬁed to:
[
]
ˆ
max tr KL s.t. L = YCYT ,
C

(4.5)

which is equivalent to the optimization of HSIC.

4.2.3

Additive Property of HSIC

In the previous subsection, we have shown that the global objective function for MF-Tree is
equivalent to HSIC by setting appropriate constraints on A and λ. This enables us to deﬁne
a kernel matrix for the class labels as L = YCYT , where C is the tree-structured covariance
matrix. In this section, we demonstrate the additive property of HSIC, which enables us to
decompose the optimization problem into a hierarchical binary classiﬁcation task.
First, we show that the tree-structured covariance matrix can be written as a sum of
block diagonal matrices. We illustrate this with a simple example. Consider the following
covariance matrix for a labeled tree that contains 8 leaf nodes (assuming the edge weights

46

are equal to 1):


3


2



1



1
C8 = 

0



0



0


0


2 1 1 0 0 0 0


3 1 1 0 0 0 0



1 3 2 0 0 0 0



1 2 3 0 0 0 0


0 0 0 3 2 1 1



0 0 0 2 3 1 1



0 0 0 1 1 3 2


0 0 0 1 1 2 3

(4.6)

We can rewrite the matrix as follows:

C8 = I2 ⊗ 14 1T4 + I4 ⊗ 12 1T2 + I8 ⊗ 1

where ⊗ denote a Kronecker product and


1


1



1



1
T
I2 ⊗ 14 14 = 

0



0



0


0


1 1 1 0 0 0 0


1 1 1 0 0 0 0



1 1 1 0 0 0 0



1 1 1 0 0 0 0


0 0 0 1 1 1 1



0 0 0 1 1 1 1



0 0 0 1 1 1 1


0 0 0 1 1 1 1

47

(4.7)


1


1



0



0
T
I4 ⊗ 12 12 = 

0



0



0


0


1


0



0



0
I8 ⊗ 1 = 

0



0



0


0


1 0 0 0 0 0 0


1 0 0 0 0 0 0



0 1 1 0 0 0 0



0 1 1 0 0 0 0


0 0 0 1 1 0 0



0 0 0 1 1 0 0



0 0 0 0 0 1 1


0 0 0 0 0 1 1


0 0 0 0 0 0 0


1 0 0 0 0 0 0



0 1 0 0 0 0 0



0 0 1 0 0 0 0


0 0 0 1 0 0 0



0 0 0 0 1 0 0



0 0 0 0 0 1 0


0 0 0 0 0 0 1

The ﬁrst term on the right hand side of Equation (4.7) simply partitions the ﬁrst 4 rows
and columns of the matrix into 1 cluster and the last 4 rows and columns into a second
cluster. The second term on the right hand side partitions the ﬁrst cluster into 2 smaller
subclusters, while the third term puts each subcluster as a cluster itself (this is at the leaf

48

node of the tree). So, we can simply write

C2m = B2 + B4 + · · · + B2m ,

(4.8)

where B2j = I2j ⊗ 12m−j 1Tm−j , 120 = 1, and m is the depth of the binary tree. We can
2
further simplify the decomposition to the following:

C8 = B2 + B4 + B8





 B8,1


[
]

 B4,1 04×4  02×2
= B2,1 + 
+

02×2
04×4 B4,2


02×2




02×2 02×2 02×2 


B8,2 02×2 02×2 



02×2 B8,3 02×2 


02×2 02×2 B8,4

where each diagonal block has the following form


T
0 
11
B2i ,j = 

T
0 11

(4.9)

and the dimension for each of the submatrices of 0s and 11T s in B2i ,j is 2m−i × 2m−i .
Furthermore, given a block matrix Bk , we have

T
YBk YT = Y1 Bk,1 Y1T + Y2 Bk,2 Y2T + · · · + Yk/2 Bk,k/2 Yk/2

where
[

]

Y = Y1 · · · Yj · · · Yk/2

49

(4.10)




0
···
0
Bk,1


.
..
 0
.
Bk,2 . .


 ..
..
..
..
 .
.
.
.

Bk = 
 ..
... ...
 .
Bk,j


 ..
..
..
..
.
.
.
 .


0
··· ··· ···



···

0

..

.

..
.

..

.

..
.

...

..
.

..

..
.

.

· · · Bk,k/2




















(4.11)



T
 Y1 





 ··· 






YT =  YT 
 j 




 ··· 




T
Yk/2
(4.12)

As each Bi,j corresponds to a block-diagonal matrix given by Equation (4.9), we can write
log2 k 2i−1

log2 k

HSIC = YCYT =

∑

YB2i YT =

i=1

∑ ∑

Yj B2i ,j YjT

i=1 j=1

This decomposition enables us to solve the optimization problem with a hierarchical tree-like
decomposition. The outer sum corresponds to decomposition at each depth of the tree, while
the inner sum corresponds to expanding the internal nodes associated with the particular
depth. Instead of learning C directly by maximizing the trace of HSIC (see Equation (4.5)),

50

we can learn the set of B2i ,j iteratively, each of which is associated with an internal node in
the tree structure.

4.2.4

MF-Tree construction

To construct the tree, we need to solve the objective function given in (4.5). Based on the
discussion in the previous subsection, the objective function can be simpliﬁed as follows:

max
{B}

∑

[
]
T
tr Kj,j Yj Bi,j Yj

(4.13)

i,j

This allows us to solve the optimization problem one node at a time. Let Dj = YjT Kj,j Yj ,
which is simply a c × c similarity matrix of the classes, computed based on their attributes
X and the known ground truth labels associated with the block Bi,j . At each node of the
tree construction process, we need to ﬁnd a matrix Bi,j that maximizes its alignment with
Dj , i.e.:

[

]

max tr Dj Bi,j
Bi,j

For notational convenience, we drop the subscripts i and j in the remainder of the discussion.
Note that B is simply a co-association matrix, whose element


 1, if classes i and j are in the same partition;
[B]i,j =


 0, otherwise.

51

To create binary partitions, let G be a c × 2 matrix2 , where


 1, if class i belongs to partition j;
[G]i,j =


 0, otherwise.

It is easy to show that B = GGT . Thus,
[

]
[
]
[
]
T
T
tr DB = tr DGG = tr G DG .

This objective function is equivalent to ﬁnding a clustering of the classes in a way that
maximizes the within-cluster similarity.

4.2.5

Optimization

Note that the following optimization problem at each node
[
]
T
max tr G DG s.t. GT 12 = 1n
ˆ
G

is not feasible unless we enforce a non-negative constraint G ≽ 0 [61]. Furthermore, if we
relax the condition that elements of G must be either 0 or 1 and replace it with a constraint
that G must be an orthogonal matrix
[
]
T
max tr G DG s.t. GT G = I2
G

2 Here, we assume the number of classes associated with the given node is c. As the tree

grows deeper, the number of classes also decreases, which in turn, reduces the number of
rows in G.

52

we can solve the problem easily by ﬁnding the ﬁrst two eigenvectors corresponding to the
largest eigenvalues of D [62, 63, 64]. Let GΛGT be the eigendecomposition of matrix D.
By deﬁnition, D = YT KY. Following the eigendecomposition of D, we have

YT KY = GΛGT

Since GT G = I:
GT YT KYG = Λ
Let P = YG, which is an n × 2 matrix that determines the assignment of each data instance
to the corresponding child nodes. Thus,

PT KP = Λ

ˆ be an n-dimensional column vector that contains similarity
During the testing phase, let k
between the test instance to all the training instances. We may compute the 2-dimensional
ˆ ∈ R2×1 as follows:
partition vector p

ˆ = PT k
ˆ
ˆ = GT YT k
p

ˆ , we assign the test instance to the partition with the larger value, i.e.,
After computing p
ˆ i.
arg maxi (PT k)

53

Algorithm 2 MF-Tree Construction
Input: Xtrain ∈ Rn×d , Ytrain ∈ Rn×k
Output: Tree Structure, T
1. root = TreeGrowth(Xtrain , Ytrain )
2. Insert root into T
3. return T
function TreeGrowth(X,Y )
1. v = createNode()
2. v.index = getIndex(X)
3. if number of classes in Y < 2 then
4.
v.class = unique(Y )
5. else
6.
(X (l) , Y (l) , X (r) , Y (r) , v.G) = Partition(X,Y )
7.
v.lef t = TreeGrowth(X (l) , Y (l) )
8.
v.right = TreeGrowth(X (r) , Y (r) )
9. end if
10. return v
function Partition(X,Y )
1. Compute the centered kernel K from X.
2. Compute D = Y T KY
3. Compute the ﬁrst two eigenvectors g (1) , g (2) of D
4. Let G = [g (1) , g (2) ]
5. (X (l) , Y (l) ) = {(xi , yi ) yi = k, arg maxj Gkj = 1}
6. (X (r) , Y (r) ) = {(xi , yi ) yi = k, arg maxj Gkj = 2}
7. return (X (l) , Y (l) , X (r) , Y (r) , G)

4.2.6

MF-Tree Induction Algorithm

The pseudocode for constructing MF-Tree is summarized in Algorithm 2. Our algorithm
recursively partitions each node by calling the TreeGrowth function. Each invocation of the
function returns a root node for a subtree that partitions the classes and their associated
training instances into 2 groups. Each node is assumed to have a complex structure, where
v.index contains the indices of the training instances assigned to the node, v.class represents
the class label if v is a leaf node, v.G is the discriminant function if v is an internal node,
whereas v.lef t and v.right are its left and right child.
54

The Partition function decides how to partition each class and their training examples
into diﬀerent branches of the tree. The partitioning is done based on the magnitude of
matrix G as described in the previous section. If a class k is assigned to the left child of
a node, all the training instances that belong to class k will also be propagated to the left
child. The algorithm continues to expand a node until all the training instances associated
with the node belong to the same class.
To determine its time complexity, note that at each node v, we need to compute the kernel
matrix K, an operation that requires O(n2 d) time. Computing D from the kernel matrix K
and class label Y takes O(n2 c + c2 n) time while ﬁnding its ﬁrst two eigenvectors requires, in
the worst-case, O(c3 ) time. Thus, the computational cost at each node is O(n2 c + c2 n + c3 ).
The cost reduces signiﬁcantly as the tree grows deeper since the size of matrices K and D
decreases substantially. Once the tree structure is induced from the training set, we can
predict the label for any test instance using the method presented in Algorithm 3.
Algorithm 3 Testing with MF-Tree
Input: Xtest , Xtrain , Ytrain , T
Output:Ytest
1. for each x ∈ Xtest do
2.
v = T.root
3.
while v.class = ∅ do
4.
kˆ = Similarity(x, Xtrain , v.index)
5.
Yˆ = Subset(Ytrain , v.index)
ˆ = v.G
6.
G
ˆ T Yˆ T kˆ
7.
pˆ = G
8.
v = getChild(ˆ
p, v)
9.
end while
9.
Insert y = v.class into Ytest
10. end for
11. return Ytest

55

4.3

Approximate MF-Tree

As shown in the previous section, the time complexity for constructing MF-Tree can be
quite expensive especially at the top levels of the hierarchy. This is due to the large size of
matrices that must be constructed and factorized according to their largest two eigenvalues.
However, as we traverse down the tree, the number of classes and training instances to be
dealt with decreases considerably, which in turn, helps to speed up the computations. Thus,
the bottleneck of our computation lies in the ﬁrst few iterations of the MF-Tree algorithm. In
this section, we present an approach to speed up the tree construction process by learning an
approximation for the G matrix during the ﬁrst few iterations. Once the matrices become
small enough, we can proceed with applying the original TreeGrowth function shown in
Algorithm 2.
Our proposed method attempts to balance the tradeoﬀ between accuracy and eﬃciency
of the labeled tree construction process. While eﬃciency is a concern at the top levels of
the tree, accuracy is an issue at lower levels of the tree. Even if the class partitioning at
the top levels is suboptimal, one might still be able to recover good partitions at subsequent
iterations when reﬁning the tree. Based on this rationale, we develop the following two-step
approach to improve training eﬃciency. First, we apply a simple method to calculate G
at the top levels of the tree. When the depth of the tree exceeds some threshold τ , we
revert back to the original TreeGrowth function. We termed this approach as Approximate
MF-Tree.
Consider the root node of the labeled tree. Let n be the number of training instances and
c be the number of classes. We ﬁrst sort the classes in decreasing order of their class size.
We then assign the largest class (c1 ) to the left node of the root and the second largest class

56

Algorithm 4 Approximate MF-Tree
Input: Xtrain ∈ Rn×d , Ytrain ∈ Rn×k , τ
Output: Tree structure T
1. Set depth, d = 0
2. root = TreeGrowthApprox(Xtrain , Ytrain , d, τ )
3. Insert root into T
4. return T
function TreeGrowthApprox(X,Y ,d,τ )
1. v = createNode()
2. v.index = getIndex(X)
3. if d < τ and number of classes in Y < 2
4.
(X (l) , Y (l) , X (r) , Y (r) )= Partition2(X,Y )
5.
v.lef t = TreeGrowthApprox(X (l) , Y (l) , d + 1)
6.
v.right = TreeGrowthApprox(X (r) , Y (r) , d + 1)
7. else
8.
v = TreeGrowth(X, Y )
9. end if
10. return v
function Partition2(X,Y )
1. classes = Sort(Y )
2. (X (l) , Y (l) ) = {(xi , yi ) yi = c1 (largest class) }
3. (X (r) , Y (r) ) = {(xi , yi ) yi = c2 (second largest class) }
4. meanvectors = Mean(X, Y )
5. for each remaining class c ∈ classes do
6.
m1 = distance(meanvectors, c, c1 )
7.
m2 = distance(meanvectors, c, c2 )
8.
if m1 ≤ m2 then
9.
(X (l) , Y (l) ) = (X (l) , Y (l) ) ∪ {(xi , yi ) yi = c }
10.
else
11.
(X (r) , Y (r) ) = (X (r) , Y (r) ) ∪ {(xi , yi ) yi = c }
12.
end if
13. end for
14. return (X (l) , Y (l) , X (r) , Y (r) )

57

(c2 ) to the right child. The sorting operations requires only O(c log c) computations. For the
remaining classes, we compute their corresponding mean vectors, an operation that requires
O(nd) time. We then assign each class to the left or right child based on their distance to the
mean vectors of the ﬁrst two largest classes. For example, if the mean vector for c3 is closer
to the mean vector for c1 than c2 , we assign c3 to the left child of the root node. Otherwise,
we assign it to the right child. This process is repeated until depth τ . This reduces the time
complexity of each node from O(c3 ) to O(c log c + nd). As will be shown in our experimental
results section, substantial improvement in training time can be achieved by setting τ to be
3 or 4 without losing signiﬁcant accuracy.

4.4

Experimental Evaluation

In this section, we compared the performance of our proposed MF-Tree algorithm against
several state-of-the-art baseline algorithms, including DAGSVM [15], Discriminative Relaxed
Hierarchy Learning (DRHL) [56], Confusion matrix based label embedding tree learning approach (CMTL) [18] and Recursive Non-negative Matrix Factorization (RNMF) [20]. In
addition to these methods, we have also implemented a recursive version of the HSIC clustering by dependence maximization (RDM) algorithm [23]. The latter would partition the
data instances into p groups at each depth of the tree until a stopping criterion is met (i.e.,
when there are no other classes that have more than 5 instances at the leaf node). We construct a binary classiﬁer at each internal node of the tree as well as 1-class SVM classiﬁers
at the leaf nodes to assign the class labels.

58

4.4.1

Experimental Setup

To illustrate the eﬀectiveness of the proposed method, our experiments were performed on
two real-world data sets: (1) Caltech-256, 3 , which is a standard multi-class classiﬁcation
data set with 256 categories containing 30,607 images and at least 80 images for each class,
and (2) Wiki, a collection of Wikipedia article dump from October 9, 2009. We generated
four subsets with varying number of classes for our experiments. The data sets are denoted
as W iki1 , W iki2 , Caltech1 and Caltech2 .
Caltech1 and Caltech2 are subsets of image data from the Caltech-256 collection. Caltech1
is generated using the same approach as described in [56], in which we randomly sample 80
images from each class to create the data set. We use a repeated holdout method to create
our training and test sets. Speciﬁcally, half of the sampled data were reserved for training
while the remaining half for testing. We repeat the sampling process ﬁve times to create
ﬁve diﬀerent versions of the Caltech1 data set. We then applied all the algorithms on the
ﬁve data sets and compute their average F1-score. The F1-score is computed based on the
harmonic mean of the precision (positive predictive value) and recall (sensitivity) for each
class. Results are reported based on the average F1-score for all the classes over ﬁve versions
of the data sets. In addition, we have generated a larger subset called Caltech2 by randomly
choosing 40 images from each class for training and all the remaining images in each class
for testing. This process is again repeated ﬁve times to generate ﬁve versions of Caltech2 .
For both Caltech1 and Caltech2 , we have extracted SIFT (scale-invariant feature transform)
features to represent each image.
W iki1 is a data set generated with the same setting as that described in [20]. Here, we
choose the largest 214 categories of the Wikipedia articles. The resulting data set contains
3 http://authors.library.caltech.edu/7694/

59

Table 4.1 Summary of data used for our experiments
Data set
Caltech1
Caltech2
W iki1
W iki2

# Classes
256
256
214
1618

# data instances
20480
30607
24378
65156

24,378 articles. In addition, we also created a larger sample data set from the Wikipedia
dump, which covers the largest 1618 categories with 65,156 articles. Similar to Caltech1 ,
half of the sampled data were used for training and the remaining half for testing. The
sampling process is also repeated ﬁve times to generate diﬀerent versions of the data sets.
A summary of the data sets used for our experiments is given in table 4.1. Note that all
the experiments were conducted on a single PC with Intel Core i7 CPU 2.6GHz and 8 GB
memory, running Windows 7 operating system.

4.4.2

Experimental Results

This section presents summary results of our experiments. As previously noted, we use F1score to measure the accuracy of the diﬀerent approaches. In addition, we have compared
the testing time for each method along with the size of the induced tree.

4.4.2.1

Overall Results Comparison

We ﬁrst compare the performance of our MF-Tree algorithm against all the baseline methods.
The results for data sets Caltech1 , Caltech2 , W iki1 , and W iki2 are summarized in Table 4.2.
The experimental results suggest the proposed MF-Tree algorithm consistently outperforms
other state-of-the-art baselines in terms of their F1-score and testing eﬃciency on all 4 data
sets evaluated in this study.

60

The size of the hierarchy generated by MF-Tree is comparable to some of the best algorithms. Even though RNMF generates a shorter than than MF-Tree, its test time is higher
because the leaf nodes of RNMF may contain multiple classiﬁers that must be invoked in
order to predict the ﬁnal class. In contrast, the leaf nodes of MF-Tree has a single class
label, which allows us to predict the class eﬃciently. Among all the competing methods,
DAGSVM creates the largest tree. This is because it eliminates one class at a time at each
level of the hierarchy. So the depth of the tree is equal to the number of classes. In the
meantime, DRHL and RDM also create larger trees because they both allow a single class
to reside in more than one leaf nodes. This aﬀects the test time of the algorithms.

4.4.2.2

Eﬀect of Parameter Tuning

Unlike MF-Tree, which is parameter-free, the performance of several baseline algorithms
depends on the values of their parameters. For example, the branching factor p in RNMF
method determines the structure and depth of the tree. If p = 1, this produces a decision
stump consisting of c 1-class SVM models. This is equivalent to the one-versus-all approach.
If p = 2, then it will produce a binary tree. The choice of p also aﬀects the depth of the tree.
As the depth increases, the test eﬃciency reduces signiﬁcantly at the expense of decreasing F1
values. Similarly, tree construction parameters are also needed in other baseline methods. To
provide a fair comparison, we vary the parameters for each method based on the suggestion
of their original papers. Speciﬁcally, for RNMF [20], we set p ∈ {2, 3, 4, 5}), for RDM , we
set p = 2, 3, 4, 5 and for DRHL[56], we set ρ ∈ {0.5, 0.6, 0.7, 0.8}).
We plotted the results in Figures 4.4, 4.5, 4.2 and 4.3. In each plot, the x-axis corresponds
to the test time while the y-axis corresponds to the F1-score. Ideally, we seek for a classiﬁer
with lowest test time and highest F1-score. As can be seen from these plots, MF-Tree is
61

Table 4.2 Experimental Results
Caltech1
DAGSV M
RN M F
DRHL
CM T L
RDM
M F − T ree
Caltech2
DAGSV M
RN M F
DRHL
CM T L
RDM
M F − T ree
W iki1
DAGSV M
RN M F
DRHL
CM T L
RDM
M F − T ree
W iki2
DAGSV M
RN M F
DRHL
CM T L
RDM
M F − T ree

F1
0.369 ± 0.0019
0.360 ± 0.0022
0.375 ± 0.003
0.348 ± 0.006
0.331 ± 0.0103
0.3816 ± 0.03
F1
0.356 ± 0.0017
0.353 ± 0.0017
0.365 ± 0.0021
0.343 ± 0.005
0.323 ± 0.0098
0.376 ± 0.029
F1
0.5309± 0.002
0.5227 ± 0.0017
0.5217 ± 0.0019
0.4826 ± 0.0017
0.4229 ± 0.0135
0.5406 ± 0.0216
F1
0.3106± 0.0021
0.2998 ± 0.0017
0.3089 ± 0.0016
0.2459 ± 0.0029
0.2296 ± 0.022
0.3318 ± 0.016

Test Time (s)
436.8 ± 10.2
91.2 ± 6.4
312.76 ± 9.6
122.18 ± 10.4
198.2 ± 11.2
77.5 ± 4.5
Test Time (s)
751.68 ± 11.8
157.66 ± 7.9
523.7 ± 12.2
192.8 ± 11.2
367.4 ± 10.5
135.6 ± 5.6
Test Time (s)
942.9± 17.6
188.3± 9.3
457.5 ± 16.9
195.5 ± 9.2
256.6 ± 14.7
152.7 ± 4.2
Test Time (s)
1618.6± 17.6
351.8± 13.6
772.5 ± 16.9
312.2 ± 11.3
456.6 ± 12.5
267.7 ± 6.9

Depth of Tree
256
9
196
10
18
10
Depth of Tree
256
9
196
10
18
10
Depth of Tree
214
8
64
10
33
9
Depth of Tree
1618
10
116
12
52
11

consistently better than the baseline methods on all four data sets.

4.4.3

Comparison of Taxonomy Structure

In addition to their F1-scores, it is useful to compare the tree generated by the diﬀerent
methods against their ground truth structure. To do this, we need an evaluation measure
for comparing the similarity between two trees. In this chapter, we apply the edit distance
measure, which was originally used for string comparison, to compare the ordered labeled
62

0.4
0.39
0.38
0.37

F1

0.36
0.35
0.34
0.33

RNMF
RDM
DRHL
DAGSVM
CMTL
MF−Tree

0.32
0.31
0.3

0

50

100

150

200

250

300

350

400

450

Testing Time (s)

Figure 4.2 Performance Comparison on Caltech1 Data

0.38
0.37
0.36
0.35

F1

0.34
0.33
0.32
CMTL
DAGSVM
RNMF
DRHL
MF−Tree
RDM

0.31
0.3
0.29

100

200

300

400

500

600

700

Testing Time (s)

Figure 4.3 Performance Comparison on Caltech2 Data

63

800

0.55

F1

0.5

0.45

RNMF
RDM
DRHL
DAGSVM
CMTL
MF−Tree

0.4

0.35
100

200

300

400

500

600

700

800

900

1000

Testing Time(s)

Figure 4.4 Performance Comparison on W iki1 Data

0.34

0.32

0.3

F1

0.28

0.26

0.24
CMTL
DAGSVM
RNMF
DRHL
MF−Tree
RDM

0.22

0.2

0.18
200

400

600

800

1000

1200

1400

1600

Testing Time (s)

Figure 4.5 Performance Comparison on W iki2 Data

64

1800

trees (see [65] for a review). Ordered labeled trees are trees in which the left-to-right order
among siblings is signiﬁcant. The distance between two trees is computed by considering the
optimal mapping between the two trees. Speciﬁcally, the distance is given by the minimum
cost of elementary operations to convert one tree into the other. An alternative way to map
and edit the trees is by using tree alignment [66].
To quantify the diﬀerence between two trees based on their pairwise class ordering, we
ﬁrst construct an adjacency matrix A for a tree structure using the following formula:

Aij =

1
numhops (ci → cj )

where Aij is the proximity between classes ci and cj and numhops (ci → cj ) is the number of
hops to traverse from class ci to cj . To measure the diﬀerence between two tree structures
ˆ we compute the mean squared error (M SE), which is
represented by the matrices A and A,
deﬁned as follows:
ˆ = (1
M SE(A, A)
n

n
∑

(Aij − Aˆij )2 )

i,j=1

Figure 4.7 shows an example of using M SE to compare the tree structures. Let A be
the adjacency matrix induced from the ground truth structure while Aˆresult1 and Aˆresult2
are the corresponding adjacency matrices for two competing tree structures, result1 and
result2. Since M SE(A, Aˆresult1 ) = 0.03125 > M SE(A, Aˆresult2 ) = 0.013825, this suggests
that the tree structure result2 is closer to the ground truth structure than result1. This
result makes sense because the structure of result2 is more similar to the ground truth
as only class c2 is misplaced, whereas the classes c2 and c3 are misplaced in result1. For
example, A(c3 , c4 ) = 0.5 in the ground truth structure as the number of hops from category
c3 to c4 is equal to two. However, Aˆresult1 (c3 , c4 ) = 0.25 for result1 since class c3 is 4 hops
65

Figure 4.6 MSE comparison of the tree produced by MF-Tree against the one produced by
CMTL

Figure 4.7 Example of Mapping Tree Structure to Adjacency Matrix
away from c4 . In contrast, Aˆresult2 (c3 , c4 ) = 0.5, which is the same as the distance according
to the ground truth, which means, the tree structure of result2 was able to maintain the
proximity between c3 and c4 .
We compared the MSE score of MF-Tree against the confusion matrix based label embedding tree learning (CMTL) method. We choose this baseline method for two reasons. First,
the size of the tree is comparable to MF-Tree. Second, similar to MF-Tree, this approach
produces a tree in which each class can only reside in a single leaf node of the tree (whereas
other methods allow a class to be assigned to multiple leaf nodes). Trees that restrict each
class to a single leaf node are more interpretable as they may be used to deﬁne a concept
hierarchy for the application domain. Our experimental results shown in Figure 4.6 suggest that MF-Tree produces a tree structure that is closer to the ground truth compared to
CMTL. Note that Caltech1 and Caltech2 use the classes for their training data to construct
the taxonomy, so we report them together as Caltech1&2 in Figure 4.6.
66

Figure 4.8 Illustration of tree structure generated
Figure 4.8 shows a subset of the tree structure generated by MF-Tree on the Caltech1
data set. Observe that the tree was able to capture the relationships among many of the
classes quite well. For example, some of the classes of animals (dog, horse, chimp, bear) was
assigned to the same branch. Though there were exceptions and misclassiﬁcations (e.g., bats
and crabs) but the related classes are often quite close together at the lower levels of the
tree.

4.4.4

Results for Approximate MF-Tree

Despite its lower test time compared to other approaches, MF-Tree can still be expensive if
the number of classes is too large. To overcome this problem, we propose an approximate MFTree learning algorithm. To justify the eﬀectiveness of the approximate MF-Tree approach,
we perform an experiment on the W iki2 data set. We choose this data set because it has the
largest number of classes. As previously mentioned in Section 4.3, the approximate MF-Tree

67

method uses an inexpensive partitioning method to assign classes to their child nodes at the
top levels of the hierarchy. The algorithm requires a single parameter, τ , which determines
the maximum depth at which to apply the inexpensive partitioning method.
The results comparing Approximate MF-Tree against MF-Tree for diﬀerent thresholds
of τ are summarized in Table 4.3. For this experiment, we vary τ from 3 to 6. We compare
the average F1-score as well as the training time of the algorithms. Recall that the average
F1-score for MF-Tree on the W iki2 data set is 0.3318. When τ = 3, the F1-score reduces
slightly to 0.3289. However, there was a 22.6% improvement in the training time of the
algorithm. As τ increases, the F1-score gradually decreases while its training time continues
to improve. Note that the size of the tree generated by the original MF-Tree algorithm has
a depth equals to 11. When τ = 6, the improvement in training time appears to taper oﬀ
while its F1-score has decreased to 0.2868. This clearly shows a trade-oﬀ between accuracy
and training time eﬃciency. In practice, we could set the τ threshold based on the speciﬁc
needs of the problem. If our priority is lower training cost, we could set a higher τ value. τ
can also be set based on the computational resources available. If there is limited memory,
τ can be set to a threshold in such a way that the resulting data matrices can ﬁt into the
memory available.
Table 4.3 Approximate MF-Tree on W iki2
Threshold
τ =3
τ =4
τ =5
τ =6

F1
0.3289
0.3169
0.3034
0.2868

% improvement in training time
22.6%
32.3%
37.8%
39.5%

68

4.5

Conclusion

In this chapter, we proposed a novel hierarchical method for large-scale multi-class learning called MF-Tree. Our proposed algorithm is driven by a global objective function. We
demonstrate the equivalence between the global objective function and the HSIC metric and
show it has an additive property that can be exploited to design a hierarchical tree learning algorithm. Experimental results comparing the method against ﬁve baseline methods
demonstrated the eﬀectiveness of our method, in terms of its accuracy and test eﬃciency.

69

Chapter 5
Hierarchical Tree Learning for Large
Multi-class Network
As noted in Chapter 1, classiﬁcation of network data is an important problem across many
disciplines, including computer science, biology, chemistry and social science. Previous work
on network classiﬁcation can be broadly categorized into three approaches: Classiﬁcation
with node attributes only [30][31]. The methods that belong to this category simply
apply existing classiﬁers, such as support vector machine, rule-based classiﬁers, and logistic
regression, to the attributes of each node in the network (e.g., the HTML tags in Web pages).
Many eﬀorts have been made to enhance the classiﬁcation accuracy by constructing a better
feature matrix for the nodes [30][31][67]. A major drawback of these methods is that they
do not consider the link information when constructing the classiﬁer. Classiﬁcation based
on link analysis. This includes PageRank based methods [34][35], HITS [33], local and
global consistency[36], score propagation[37], hard label propagation [68], trust rank[38],
anti-trust rank[35], minimum cut[69][70], spectral clustering[71],etc. This approach is based
on the fundamental assumption that nodes that are connected should share similar labels.
As a result, network classiﬁcation task could be formulated as a regularized optimization
problem. The drawback of link analysis based approaches is that the classiﬁcation relies
only on link structure is susceptible to missing link information. Classiﬁcation based on

70

the use of both content and link information. This includes hybrid methods developed
in [39][72][40][41][27][42][25]. Most of these research have shown that hybrid methods are
generally better than using node attributes or link information alone.
Despite the extensive research on node classiﬁcation for network data, most of the approaches were not designed for large multi-class learning problems. These approaches will
encounter challenges such as class imbalance, high testing cost, and model interpretability
problems when applied to networks with large number of classes. Furthermore, as noted
in Chapter 1, the assumption that nearby nodes have similar classes is less likely to hold
when the number of classes is large. The size of the network also increases considerably
as the number of classes increases, all of which makes large multi-class network learning an
extremely challenging problem.
As discussed in the previous two chapters, hierarchical learning methods have been developed [15, 16, 17, 18, 19] to overcome the challenges of extreme multi-class classiﬁcation
problems. However, as all of these methods were designed for non-network data, extending
them to a hybrid network classiﬁcation approach (using both node attributes and link structure information) is not an easy task. Furthermore, although open source tools are available
for large scale network analysis, they are not designed for node classiﬁcation tasks. Pegasus 1 [73], for example, was designed to analyze networks with billions of nodes and edges
by providing functionalities to compute the diameter of the graph, the Pagerank scores of
nodes, as well as to ﬁnd connected components using techniques such as spectral clustering.
The techniques developed in Pegasus rely on the link topology only, which overlooks the
importance of node attributes in the analysis. As a result, there is an urgent need for new
techniques to address the challenges in large multi-class network learning.
1 http://www.cs.cmu.edu/ pegasus/index.htm

71

In the previous chapter, we have presented a framework for multi-class learning known as
MF-Tree. This chapter presents an extension of the framework for large multi-class learning
in network data. The modiﬁed framework systematically combines node attribute with link
information to learn the hierarchical structure of the classes from network data. Speciﬁcally,
we recursively optimize a global objective function to learn a binary discriminant function
to partition the classes based on the node attributes and link structure until the partition
contains only a single class. During testing, a test example is routed from the root to one
of the leaf nodes in the tree structure, from which the label associated with the leaf node is
assigned as the predicted class label.
To address the eﬃciency issue of large-scale network learning, our framework also has an
approximate version, which allows the partitions at the top levels of the tree to be determine
without requiring the expensive computation of matrix factorization. Similar to the approach
described in the previous chapter, the framework allows us to choose the proper parameter
setting (τ ) based on the speciﬁc needs of the problem. If our priority is to achieve a high
training eﬃciency, we could apply the approximate MF-Tree with larger τ . Otherwise, we
could use regular MF-Tree to obtain the best classiﬁcation accuracy. Second, the proposed
approach is trivially parallelizable. A MapReduce implementation of the framework for
network data has been developed to improve its scalability to larger networks. As proof of
concept, the MapReduce framework was applied to the Wikipedia network data that contains
more than 600K classes. To the best of our knowledge, this is the largest multi-class network
learning problem that has ever been reported in the literature.

72

5.1

Methodology

Let N = (V, E, D) be a network of order n, where V = {v1 , v2 , · · · , vn } is the set of nodes,
E ⊆ V × V is the set of edges, and D is the set of data instances associated with V . Each
instance in D is characterized by the tuple (x, y), where x ∈ ℜd is a d-dimensional feature
vector (known as the node attribute) and y ∈ {1, 2, · · · , c} is its class label. The goal of multiclass network learning is to induce a labeling function f (V, E, D) that minimizes the expected
misclassiﬁcation rate for any instance drawn from network N , i.e., minf E(x,y)∼D [ℓ(f (x), y)],
where ℓ is the loss function. The objective function used as an estimate of the regularized
empirical loss function is described in the next subsection.

5.1.1

Global Objective Function

Let X be an n × d centered design matrix, whose rows correspond to the data instances and
ˆ can be easily
columns correspond to the node attributes. An uncentered design matrix X
ˆ where Hn is the centering matrix deﬁned
centered by applying the matrix operation Hn X,
as Hn = I − n1 1n 1Tn . Let A denote the n × n adjacency matrix for the link structure of the
network.
Our MF-Tree for network is then designed to optimize the following regularized loss
function for an undirected graph:

J

= ∥X − LW T ∥2F + λ1 ∥A − LP LT ∥2F + λ2 ∥W ∥2F + λ3 ∥P LT ∥2F
[
]
T
T
T
T
T
T
= tr XX − XW L − LW X + LW W L
[
]
T
T
T
T
T
T
T
T
+λ1 tr AA − ALP L − LP L A + LP L LP L
[
]
[
]
T
T
+λ2 tr W W
+ λ3 tr U U
73

(5.1)

where L is an n × c matrix, which indicates the class membership of each data instance,
W is a d × c matrix, which represents the feature vector of each class, and P is a c × c is
class covariance matrix, which represents the correlation between classes. λ1 , λ2 and λ3 are
the regularization parameters. Intuitively, the objective function seeks to approximate the
matrices X and A by their low rank approximation matrices given by the matrix products
LW T and LP LT , respectively. The ﬁrst two terms in the objective function aim to measure
the error in the approximation while the last two terms are regularizers for the matrices W
and P LT .
The purpose of optimizing this objective function is to jointly factorize the centered data
matrix X and adjacency matrix A to learn L and W. For directed graph, we can write
P LT = U T and reduce the objective function as follows:

J

5.1.2

= ∥X − LW T ∥2F + λ1 ∥A − LU T ∥2F + λ2 ∥W ∥2F + λ3 ∥U ∥2F
[
]
T
T
T
T
T
T
= tr XX − XW L − LW X + LW W L
[
]
T
T
T
T
T
T
+λ1 tr AA − AU L − LU A + LU U L
[
]
[
]
T
T
+λ2 tr W W
+ λ3 tr U U

(5.2)

Relationship between Proposed Method and HSIC

As we have discussed in the previous chapter, the Hilbert-Schmidt Independence Criterion
(HSIC) [21][22][23][24] is a recently proposed unsupervised learning measure for taxonomy
learning. We have provided a theoretical proof to demonstrate the equivalence between the
objective function of MF-Tree and HSIC. Similarly, in this section, we ﬁrst show that the
network extension of MF-Tree is also equivalent to HSIC. Given the additive property of

74

HSIC, the task of large multi-class network learning can be decomposed into a set of binary
classiﬁcation tasks, which can be organized in a hierarchical tree structure.
To demonstrate the relationship between our global objective function and HilbertSchmidt Independence Criterion (HSIC), we ﬁrst take the partial derivative of J with respect
to W and U and set them to zero:

∂(J )
= −2X T L + 2W LT L + 2λ2 W = 0
∂(W )
W = X T L[LT L + λ2 I]−1
∂(J )
= λ1 (−2AT L + 2U LT L) + 2λ3 U = 0
∂(U )
= U [λ1 LT L + λ3 I] = λ1 AT L
λ
U = AT L[LT L + 3 I]−1
λ1
Replacing these back into Equation (5.3), we have:

J

{
[
]−1
T
T
T
= tr XX − 2XX L L L + λ2 I
LT
[
]−1
[
]−1 }
T
T
T
T
+L L L + λ2 I
L XX L L L + λ2 I
LT
{
[
]
λ3 −1 T
T
T
T
+λ1 tr AA − 2AA L L L + I
L
λ1
]
[
]
}
[
λ3 −1 T
λ3 −1 T
T
T
T
+L L L + I
L AA L L L + I
L
λ1
λ1
{[
]−1
[
]−1 }
T
T
T
T
+λ2 tr L L + λ2 I
L XX L L L + λ2 I
{[
]
[
λ3 −1 T
T
T
+λ3 tr L L + I
L AA L LT L +
λ1

75

] }
λ3 −1
I
λ1

{
[
]−1
T
T
T
= tr XX − 2XX L L L + λ2 I
LT
[
]−1
[
]−1
}
T
T
T
T
T
+ L L + λ2 I
L XX L L L + λ2 I
L L
{
[
]
λ3 −1 T
T
T
T
+λ1 tr AA − 2AA L L L + I
L
λ1
]
[
]
}
[
λ3 −1 T
λ3 −1 T
T
T
T
L AA L L L + I
L L
+ L L+ I
λ1
λ1
{[
]−1
[
]−1 }
T
T
T
T
+λ2 tr L L + λ2 I
L XX L L L + λ2 I
]
[
{[
λ3 −1 T
T
T
+λ3 tr L L + I
L AA L LT L +
λ1

] }
λ3 −1
I
λ1

Assuming λ1 = 1 and λ2 = λ3 = λ, J can be further simpliﬁed as

J

{
[
]−1
T
T
T
= tr XX − 2XX L L L + λI
LT
[
+ LT L + λI

]−1

[
LT XX T L LT L + λI

]−1 [
]}
T
LL + λI

{
[
]−1
T
T
T
+tr AA − 2AA L L L + λI
LT
[
+ LT L + λI

]−1

[
LT XX T L LT L + λI

]−1 [
]}
T
LL + λI

{
[
]−1 }
T
T
T
= tr XX − XX L L L + λI
LT
{
[
]−1 }
T
T
T
+tr AA − AA L L L + λI
LT
{[
] [
]}
T
T
T
T
−1
T
T
T
−1
T
= tr XX + AA − XX L[L L + λI] L + AA L[L L + λI] L
{[
] [
]
}
T
T
T
T
T
−1
T
= tr XX + AA − XX + AA L[L L + λI] L
[

]

The term XX T + AAT can be regarded as the joint kernel matrix of both content and

76

link structure, which already known given the design matrix and adjacency matrix. Thus,
ﬁnding a solution that minimizes J is equivalent to solving the following trace form:
]
[
T
T
max tr XX + AA L[LT L + λI]−1 LT
L

(5.3)

Let K = XX T + AAT 2 , we could enforce the following constraint during training:

Ltrain [LTtrain Ltrain + λI]−1 LTtrain = Y CY T

where Y ∈ ℜn×c is the class label matrix, C ∈ ℜc×c is the tree structure covariance matrix.
We could simplify the Equation (5.3) as follow:
[
]
T
max tr KY CY
C

(5.4)

where the covariance matrix C associated with the label tree structure is jointly estimated
from the link structure and node attribute information of the network. Cij is a measure
of similarity between the i-th and j-th classes. The larger the value of Cij , the closer is
the relationship between two classes. Thus, we have shown that optimizing our regularized
objective function for network learning is equivalent to the optimization of HSIC.
2 To save the space and running time, we keep K as a sparse matrix. In our implementation, we only maintain the elements in XX T with value larger than 0.3, this threshold value

could be adjusted based on the speciﬁc needs of the problem

77

5.1.3

Additive Property of HSIC

As shown in previous chapter for MF-Tree, the tree-structure covariance matrix has an
additive property, which can be formally written as follows:

C2m = B2 + B4 + · · · + B2m ,

(5.5)

where B2j = I2j ⊗ 12m−j 1Tm−j , 120 = 1, and m is the depth of the binary tree. Thus, we
2
can write:

log2 k 2i−1

log2 k

YCYT =

∑

YB2i YT =

i=1

∑ ∑

Yj B2i ,j YjT

i=1 j=1

This additive property enables us to solve the optimization problem with a hierarchical
tree-like decomposition. Instead of learning C directly by maximizing the objective function
in (5.4), we can learn a set of block matrices B2i ,j iteratively, each of which is associated
with an internal node of the tree structure.

5.1.4

Hierarchical Tree Construction from Network Data

To construct the tree, we need to solve the objective function given in (5.4). Using the tree
decomposition approach given in Equation (5.5), we simplify this to

max
{B}

∑

[
]
T
tr Kj,j Yj Bi,j Yj

(5.6)

i,j

where the optimization problem can be solved iteratively, one block matrix B at a time.
Similar to MF-Tree, by setting Cj = YjT Kj,j Yj , our tree construction step is equivalent to

78

ﬁnding a matrix Bi,j that maximizes its alignment to Cj .

max
{B}

∑

[
]
tr Cj Bi,j

(5.7)

i,j

For brevity, we omit the subscript (i,j) in the rest of the sections. Since B is a symmetric
matrix, let B = GGT , where G ∈ ℜc×2 . It is easy to show that:
[
]
[
[
]
]
T
T
tr CB = tr CGG = tr G CG .

Following the same approach as MF-Tree, we can relax the condition that elements of G
must be either 0 or 1 and add a condition that G is an orthogonal matrix:
[
]
T
max tr G CG s.t. GT G = I2
G

The optimization problem can be easily solved by ﬁnding the ﬁrst two eigenvectors of C.
After performing eigenvalue decomposition on C, we can write:

C = GΛGT

(5.8)

Once the matrix G is known, we can apply the following equation to determine the
partition each test example show follow:

ˆ
ˆ = GT YT k,
p

ˆ is the kernel similarity between the test example and the training examples. This is
where k

79

equivalent to the formula used for predicting the branch to follow in the MF-Tree framework
ˆ , we assign the test node to the partition
described in the previous section. After computing p
with largest value. We summarized the hierarchical tree construction procedure in Algorithm
5 and the node testing procedure in Algorithm 6.
Algorithm 5 Hierarchical Tree Construction from Network Data
Input: G = (V, E, D), (Xtrain , Ytrain ) ⊂ D, Xtrain ∈ Rn×d , Ytrain ∈ Rn×k
Output: Tree Structure, T
1. construct A from G
2. root = TreeGrowth(Xtrain , Ytrain , A)
3. Insert root into T
4. return T
function TreeGrowth(X,Y ,A)
1. v = createNode()
2. v.index = getIndex(X)
3. if number of classes in Y < 2 then
4.
v.class = unique(Y )
5. else
6.
(X (l) , Y (l) , X (r) , Y (r) , v.g) = Partition(X,Y ,A)
7.
v.lef t = TreeGrowth(X (l) , Y (l) ,A(l) )
8.
v.right = TreeGrowth(X (r) , Y (r) , A(r) )
9. end if
10. return v
function Partition(X,Y ,A)
1. Compute the centered kernel K from X and A.
2. Compute C = Y T KY
3. Compute the ﬁrst two eigenvectors g (1) , g (2) of C
4. Let g = [g (1) , g (2) ]
5. (X (l) , Y (l) ) = {(xi , yi ) yi = k, arg maxj gkj = 1}
6. (X (r) , Y (r) ) = {(xi , yi ) yi = k, arg maxj gkj = 2}
7. return (X (l) , Y (l) , X (r) , Y (r) , g)

5.1.5

Parallelization of Hierarchical Tree Learning

Our framework can be easily parallelizable. As shown in Algorithm 5 and 6, the key step in
the optimization of our objective function is to solve a matrix multiplication problem and to
80

Algorithm 6 Testing with Hierarchical Tree
Input: Xtest , Xtrain , Ytrain , A, T
Output:Ytest
1. for each x ∈ Xtest do
2.
v = T.root
3.
while v.class = ∅ do
4.
kˆ = Similarity(x, Xtrain , A, v.index)
5.
Yˆ = Subset(Ytrain , v.index)
6.
gˆ = v.g
7.
pˆ = gˆT Yˆ T kˆ
8.
v = getChild(ˆ
p, v)
9.
end while
9.
Insert y = v.class into Ytest
10. end for
11. return Ytest
ﬁnd the eigenvectors corresponding to the two largest eigenvalues of the tree structure covariance matrix C. To perform the matrix multiplication and extract the eigenvectors, we apply
the HEIGEN algorithm [74], which provides an eﬃcient matrix-matrix multiplication under
the Hadoop/MapReduce distributed programming framework. HEIGEN has an eigensolver
for billion-scale matrices to compute the top k eigenvectors using the Hadoop/MapReduce
framework. We evaluated our parallelization approach for MF-Tree on a cluster of machines
running MapReduce with 128 worker nodes have 8 cores and 32 GB RAM each. The results
are reported in the next section.

5.2

Experimental Evaluation

This section presents the results of our experiments to evaluate the performance of the
enhanced MF-Tree framework for network data.

81

5.2.1

Network Data

We evaluated the performance of the proposed framework on two real-world network data—
U.S Patent and Wikipedia data sets. Details of the data sets are given below.

5.2.1.1

U.S. Patent Data

We have extracted the U.S. patents granted between January 1963 and December 1999 from
a website called PatentGenius 3 along with the citations made to these patents between 1975
and 1999 (over 16 million). For each patent, we extracted the title, abstract, category, and
citation information from each patent Web page. We ended up with 3,774,768 patents with
16,518,948 citation links. There are 22 general categories in the US Patent Data as shown in
Table 5.1. Each general category has its own subcategories, which produces 426 classes under
which a patent can be categorized. These classes are further divided into more subclasses,
which in turn, produces a hierarchical structure between classes. For this experiment, we
tested our algorithm using 3139 patent U.S. sub-categories. The class distribution of U.S.
Patent Data is highly skewed, as shown in Figure 5.1, with the largest class containing 25,727
patents, and the smallest class has only 8 patents. A patent citation network is constructed
from the data, where each patent is a node while its title and abstract represent the node
attributes. The edge set contains citations between pairs of patents.

5.2.1.2

Wikipedia Data

We created network data from Wikipedia dump dated October 9, 2009, where each Wikipedia
article is a node and its title and content form the node attributes. The link set contains citations between pair of articles. We generated two versions of the network data for evaluation
3 http://www.patentgenius.com/

82

Table 5.1 22 General Categories of U.S. Patent Data
Class
Business
Chemistry
Clothing, Fashion & Accessories
Communications
Construction
Electrical & Energy
Engineering
Entertainment & Recreation
Firearms & Weapons
Food
Health & Medicine
Household
Industrial
Information Technology
Material Science
Optical & Optics
Packaging & Containers
Papers, Printing & Oﬃce Supplies
Physics
Plants, Animals & Agriculture
Tools & Hardware
Transportation & Automotive

No. of Patents
26790
441687
52895
281045
112621
383168
84506
36342
5968
20896
227264
96166
393949
479869
418611
147182
93128
52285
159269
25913
82137
153077

purposes.
1. Wikipedia Data 1, which contains articles from all the subclasses of the biology
category. There are altogether 98,068 articles belonging to 2,924 classes with 331,916
hyperlinks between articles.
2. Wikipedia Data 2, which contains the entire 2,365,436 articles belonging to 614,428
classes.
Similar to the U.S. Patent data, the class distribution for Wikipedia data 1 is also skewed, as
shown in Figure 5.1. Due to the imbalanced class distribution, accuracy is not a good metric
for evaluating the performance of the framework. We will discuss the evaluation metric used
83

4

3

x 10

160

140

NO. of Wikipedia Articles

NO. of Patents

2.5

2

1.5

1

120

100

80

60

40

0.5
20

0

500

1000

1500

2000

2500

0

3000

Class

500

1000

1500

2000

2500

3000

Class

Figure 5.1 Class Distribution of U.S. Patent and Wikipedia Data 1
in the next section.

5.2.2

Evaluation Metric

Accuracy is a widely used evaluation metric for classiﬁcation tasks. The metric can be computed from a confusion matrix that summarizes the total number of examples correctly and
incorrectly predicted by a classiﬁcation method. Figure 5.2 shows an example of a confusion
matrix for a binary classiﬁcation problem, where the columns are the predicted classes and
the rows are the actual classes. The two classes are denoted as positive and negative class,
respectively. In the confusion matrix, true positive (T P ) is the number of positive examples correctly predicted, true negative (T N ) is the number of negative examples correctly
predicted, false positive (F P ) is the number of negative examples incorrectly classiﬁed as
positive, while false negative (F N ) is the number of positive examples incorrectly classiﬁed

84

Figure 5.2 Confusion Matrix
as negative. Then accuracy is formally deﬁned as

Accuracy =

TP + TN
TP + FP + TN + FN

.
Unfortunately, accuracy is not an appropriate metric for data sets with imbalanced classes. For example, if we have 99 instances belonging to the negative class and 1 instance from
the positive class, then a classiﬁer that predicts all 100 instances as negative class will have
an accuracy of 99% even though it fails to detect any instance from the positive class. Instead of accuracy, we report the experimental results using both micro and macro-averaged
F1 scores [75]. The metrics are formally deﬁned as follows:

85

5.2.2.1

Micro-average F1

Let T Pk , F Pk , and F Nk be the true positive, false positive, and false negative of class k ∈ C.
The micro-average F1 score is deﬁned as:
∑
Precision = ∑
Recall = ∑
Micro-average F1 =

k∈C T Pk

(T Pk + F Pk )
k∈C
∑
k∈C T Pk

k∈C (T Pk + F Nk )

2 × Precision × Recall
Precision+Recall

Note that micro-average F1 is computed based on the average precision and recall values of
all the test instances.

5.2.2.2

Macro-average F1

The macro-average F1 score is deﬁned as:

T Pk
T Pk + F Pk
T Pk
Recallk =
T Pk + F Nk
1 ∑ 2 × Precisionk × Recallk
Macro-average F1 =
|C|
Precisionk + Recallk
Precisionk =

k∈C

The metric is computed by averaging the F1 scores for all the classes.
When the class distribution is skewed, the macro-average F1 score will be highly inﬂuenced by the F1-score of the rare classes. In contrast, the micro-average F1 score will be
dominated by the performance of the classiﬁer on instances from the larger categories.

86

5.2.3

Experimental Results

This section describes the results of our experiments on both U.S. Patent and Wikipedia
network data sets. We compared the hierarchical tree learning framework (denoted as HTL)
against a well-known label propagation algorithm [36] and the decision tree classiﬁer (applied
to node attributes only). For this experiment, we apply the holdout method, where half of
the network data was used for training and the remaining half was reserved for testing
purposes.

5.2.3.1

Results on U.S. Patent Data

Tables 5.2 and 5.3 showed the micro- and macro-average F1 scores for the various algorithms
on the U.S. Patent Data. For this experiment, we vary the number of classes by mapping
the patents to their corresponding classes at diﬀerent levels of the class hierarchy. Recall
that the top level of the class hierarchy contains 22 classes, while the second level contains
426 classes, and the subsequent level contains 3,139 classes.
Table 5.2 Micro-average F1 Comparison on U.S. Patent Data
No. Classes
k = 22
k = 426
k = 3139

Label Propagation
(Link only)
0.582
0.500
0.312

Label Propagation
(Link + Content)
0.592
0.511
0.330

Decision Tree

HTL

0.569
0.490
0.302

0.622
0.512
0.328

Table 5.3 Macro-average F1 Comparison on U.S. Patent Data
No. Classes
k = 22
k = 426
k = 3139

Label Propagation
(Link only)
0.426
0.338
0.257

Label Propagation
(Link + Content)
0.457
0.3760
0.261

Decision Tree

HTL

0.408
0.316
0.238

0.466
0.396
0.276

The following observations can be made based on the results shown in the tables:
87

1. Hybrid approaches are more eﬀective than attribute-only and link-only node classiﬁcation methods. For example, the performance of the proposed hierarchical tree learning
(HTL) method is superior than simple decision tree classiﬁer constructed on the node attributes only, both in terms of their micro- and macro-averaged F1 scores. In
addition, a hybrid labeled propagation approach using both the link structure and
Web content of the patents consistently outperforms label propagation with link-only
information.
2. In terms of macro-average F1 scores, the proposed HTL method is better than the
hybrid label propagation method for 3 all cases (k = 22, 426, and 3139). As noted earlier, macro-average F1 is mainly inﬂuenced by the performance of the rare classes, this
result demonstrates the eﬀectiveness of the proposed method for handling imbalanced
class problems compared to the other baseline methods.
3. In terms of micro-average F1 scores, the proposed HTL method is better than the
hybrid label propagation method on 2 out of 3 cases (k = 22 and 426). For k = 3139,
our results are slightly worse. This suggests that the hybrid label propagation method
is slightly better in terms of classifying instances from the larger categories.
To support our previous assertion that HTL is more eﬀective at classifying the rare
classes, consider the plot shown in Figure 5.3. The horizontal axis of the plot corresponds to
the ID of individual classes in the U.S. Patent data, sorted by their increasing class size. The
vertical axis, on the other hand, corresponds to the F1 score of the class. The results clearly
showed that the F1 score of HTL for rare classes is higher than those obtained using link
propagation and decision tree methods. HTL also outperforms decision tree classiﬁer on all
the classes as its curve lies above the curve for decision tree classiﬁer. The ﬁgure also shows
88

0.35

HTL
LP(L+C)
LP(Link)
Decision Tree

0.3

F1 Score

0.25

0.2

0.15

0.1

0.05

0

500

1000

1500

2000

2500

3000

Classes

Figure 5.3 F1 Score for each individual class in the U.S. Patent Data (where the classes are
sorted in increasing class size).
that the label propagation methods generally work better than HTL on the larger classes.
This result is consistent with the micro- and macro-average F1 scores shown in Tables 5.2,
5.3, and 5.5.
Although the hybrid label propagation method is also quite eﬀective, it does not produce
a hierarchical tree structure that represents the conceptual relationship between classes.
Decision tree classiﬁers, on the other hand, produces trees that (1) may not include all
the classes in their leaf nodes, and (2) may contain classes that are split into multiple leaf
nodes in the tree. In contrast, the binary tree generated by the proposed HTL method can be
interpreted as a class hierarchy as it includes all the classes in the data and assigns each class
to exactly one leaf node in its induced tree. Figure 5.4 provides a snippet of the tree structure
obtained after applying HTL to the U.S. patent data, which contains 22 classes. The tree
structure (up to depth 4) shows that we can capture the relationships among the classes very
89

Figure 5.4 A snippet of the class hierarchy structure (up to depth 4) generated by the
proposed method for the 22 classes of the U.S. Patent Data
well. For example, chemistry related classes (chemistry, ﬁrearms, health& medicine, papers,
printing & oﬃce supplies) were assigned to the same branch. Although there were some
uncertainty in the assignment (e.g., food with material science and clothing), in general, the
classes located near the bottom levels of the tree are quite related to each other.
Testing Eﬃciency Comparison.

To compare the test eﬃciency, we compared our

method against decision tree classiﬁer on the U.S. patent data and plotted their results in
Figure 5.5. The horizontal axis in the ﬁgure corresponds the test time while the vertical axis
corresponds to the macro-average or micro-average F1 score. We applied each method to the
diﬀerent variants of the U.S. Patent data (with k = 22, 426, and 3139). As we can seen from
the ﬁgure, our method consistently achieves higher average F1 scores but is slower in terms
of test eﬃciency even though the size of the tree produced by the HTL method is shorter
than the size of a decision tree (see Table 5.4 for a comparison of the maximum depth of
the trees). We believe this is because our method needs to perform more complex matrix
computations at the internal nodes of the tree compared to a decision tree classiﬁer, which
uses simple attribute test conditions to determine which branch to follow.

90

Table 5.4 Tree Depth Comparison on U.S. Patent Data
k=22
56
5

Decision Tree
HTL

k=426
212
10

0.5

k=3139
635
13

0.7
Our Method
Decision Tree

Our Method
Decision Tree

0.65

0.45

Micro−average F1

Macro−average F1

0.6
0.4

0.35

0.3

0.55
0.5
0.45
0.4
0.35

0.25
0.3
0.2
0.6

0.8

1

1.2

Testing Time (s)

1.4

0.25

1.6

0.6

0.8

1

1.2

1.4

Testing Time (s)

4

x 10

1.6
4

x 10

Figure 5.5 Testing Eﬃciency Comparison on U.S. Patent Data
5.2.3.2

Results on Wikipedia Data 1

In this section, we tested our algorithm on Wikipedia data 1, which contains 98,068 articles
belonging to 2,924 subclasses of biology. Results in terms of both micro-average F1 and
macro-average F1 are shown in Table 5.5.
Table 5.5 Performance Comparison on Wikipedia Data 1

Micro-average F1
Macro-average F1

Label Propagation
(Link only)
0.413
0.351

Label Propagation
(Link + Content)
0.425
0.365

Decision
Tree
0.416
0.349

HTL
0.427
0.371

The results suggest that the proposed method is slightly better than all 3 baselines in
terms of both evaluation criteria. Since its macro-scale F1 score is slightly higher than that
for hybrid label propagation, this agrees with our previous observation on the U.S. Patent
data that the proposed method is better in terms of dealing with rare classes. This conclusion
is supported when analyzing the F1 score of the individual classes. As shown in Figure 5.6,
91

0.5

F1 Score

0.4

HTL
LP (L+C)
LP(Link)
Decision Tree

0.3

0.2

0.1

500

1000

1500

2000

2500

3000

Classes

Figure 5.6 F1 Score for each individual class in Wikipedia Data 1 (where the classes are
sorted in increasing class size).
the propose HTL method has highest F1 score when the class size is small. However, it is
slightly worse than the label propagation methods for some of the larger classes.
Testing Eﬃciency Comparison.

The test time for decision tree classiﬁer on Wikipedia

Data 1 is 427.9 seconds while the proposed HTL method takes 516.8 seconds. Although
our method is slightly slower, it consistently outperforms the decision tree classiﬁer in terms
of both its micro-average and macro-average F1 scores. In addition, unlike decision trees,
our method has the ability to generate meaningful taxonomy structure to reveal the class
relationships. With the given 2,924 classes in Wikipedia data 1, the decision tree classiﬁer
generates a tree with maximum depth of 516, while the proposed HTL method generates a
tree with maximum depth of 13.

92

5.2.3.3

Parallel Results on Wikipedia Data 2

To evaluate the performance of our parallel HTL method, we use the Wikipedia Data 2,
which contains more than 600K classes and 2 million articles. The results are shown in
Table 5.6. Since none of the competing methods are applicable to such a large size data set,
we reported only the performance of our method in this section. Unlike Wikipedia data 1,
the number of classes in this data set is 200 times larger, which explains its lower F1 scores.
It is also computationally more expensive, requiring 27,153 seconds for testing.
Table 5.6 Results on Wikipedia Data 2 using the parallel HTL method.
Micro-average F1
Macro-average F1

5.3

0.3436
0.1738

Conclusion

In this chapter, we proposed an extension framework of our MF-Tree for large multi-class
network classiﬁcation. We demonstrate that our proposed method can systematically combine both the node attributes and link structure to eﬀectively learn a hierarchical structure
from large multi-class network learning problems. Experimental results conﬁrmed the eﬀectiveness of our method.

93

Chapter 6
Co-classiﬁcation of Heterogeneous
Networks
Most existing network learning techniques have focused on the analysis of a single network.
One drawback of this type of methods is that information from a single network may not be
suﬃcient to eﬀectively learn an accurate classiﬁcation model, especially when the network
has noisy attributes or incomplete link structure. In this chapter, we introduce a framework
that can jointly classify heterogeneous networks with diﬀerent types of attributes and links
but share similar characteristics in terms of their class information. The proposed framework
is known as co-classification.
As an example of a co-classiﬁcation task, consider the network of editors and articles
in Wikipedia. Wikipedia has become the largest online knowledge repository with approximately 15 million articles written in more than 270 languages, among which more than 3
million of them are in English. These articles are written collaboratively by voluntary users
around the world. Given its massive amount of information and the fact that most of its
content can be edited freely by any users who have access to the Internet, it is not surprising
Wikipedia has become a rich domain for a variety of interesting classiﬁcation problems such
as topic detection of Wikipedia articles, categorization of Wikipedia editors as humans or
bots, and classiﬁcation of Wikipedia content as featured articles, controversial/inﬂammatory,

94

spam, or vandalism. These tasks can be cast into a network classiﬁcation problem, in which
the goal is to assign a class label to each node in the network (e.g., an article or a user)
based on the node features and link information [76][77][78][40].
Most of the previous works have focused on classifying nodes in a single network. Since
the Wikipedia articles and users are inextricably related, knowing the class label of an article
often aids in the classiﬁcation of users, and vice-versa. For example, in topic detection of
Wikipedia articles, the category of an article provides evidence for the interest area of a user
who edited the article’s content. Similarly, in spam detection, we expect spam content to
be more likely contributed by spammers than non-spammers. Assigning class labels to both
articles and users in Wikipedia not only increases the amount of labeled data available, it is
often a natural process during the creation of the training data itself (e.g., to label a user as
a spammer or a vandal, one has to know whether the edited content should be considered a
form of spam or vandalism). Despite the obvious relationship between their classes, none of
the previous works on Wikipedia classiﬁcation [79][80][81] were designed to co-classify the
article and user networks.
A co-classiﬁcation method was previously developed in [13] to detect spam and spammers in social bookmarking Web sites. The method is based on extending the least-square
support vector machine (LS-SVM) formulation to network data, taking into account the correspondences between the classes in diﬀerent networks. Nevertheless, the method assumes a
binary class problem, where the nodes in each network are assigned to one of two possible
classes (spam/non-spam or spammer/non-spammer). However, there are many Wikipedia
classiﬁcation problems in which the nodes in the user and article networks may have more
than two classes. Though it may be possible to transform the multi-class problem into multiple binary classiﬁcation problems (using the 1-vs-1 or 1-vs-rest strategies), this approach
95

may not be eﬀective because each binary classiﬁer is trained independently, disregarding the
eﬀectiveness of other classiﬁers. It is also inapplicable when the number of classes in article
and user networks are diﬀerent.
In this chapter, we present an extension of the LS-SVM formulation for network data to a
multi-class problem. Speciﬁcally, we formalize the joint classiﬁcation tasks as a constrained
optimization problem, in which the relationships between the classes of nodes in two diﬀerent
networks are modeled as graph regularization constraints. Unlike the previous binary class
formulation in [13], the formulation proposed here allows us to incorporate prior knowledge
about the potential relationships between classes in diﬀerent networks to avoid overﬁtting.

6.1

Preliminaries

We begin with a brief discussion of the terminology and notations used. The terminology is
given in the context of user and article networks in Wikipedia. User: A registered user is
identiﬁed by a username and has an associated user page on Wikipedia. Unregistered users
are also allowed to edit the Wikipedia pages. These users are identiﬁed by their corresponding
IP address. In this study, we consider only registered Wikipedia users. Let U be the set
of such users. Article: Articles in Wikipedia have a title and are organized by categories.
Some articles are considered Wikipedia stubs, which has been created as a placeholder for a
particular topic but does not contain any useful text information. In this study, we consider
only non-stub Wikipedia articles. Let A be the set of such articles. User-Article matrix:
The edits made by users on Wikipedia articles are represented in a user-article matrix M ,
where M (i, j) is the total number of edits made by user i on article j. Category relation
matrix: Let Et be a category relation matrix estimated from the training data. The matrix

96

represents the number of links between each user category and article category. For instance,
Et (y (u) , y (a) ) is the total number of edits made by users from category y (u) on articles from
category y (a) , where y (u) ∈ {1, 2, ...ku } and y (a) ∈ {1, 2, ..., ka }. ku and ka are the number
of user and article categories, respectively.
Our proposed framework assumes that there is a correspondence between user and article
categories (classes). For example, suppose we are interested in classifying the topic areas
of Wikipedia users and articles. We expect users who are classiﬁed as “computer science”
editors to edit more articles classiﬁed as “computer science” than other areas such as “biology”. As will be shown in Section 6.2, this assumption can be enforced using a graph
regularization constraint in our proposed co-classiﬁcation framework.
Before describing our methodology, we ﬁrst formalize the co-classiﬁcation problem. Suppose we are given:
(u)

(u)

(u)

(u)

(u)

(u)

(u)

1. A set of lu labeled users, L(u) = {(x1 , y1 ), (x2 , y2 ), · · · , (xl , yl )} where xi is
u
u
(u)

a du -dimensional feature vector for the i-th user and yi
(u)

∈ {1, 2...ku } is its class label.

(u)

(u)

2. A set of nu − lu unlabeled users, U (u) = {xl +1 , xl +2 , · · · , xnu }
u
u
(a) (a)
(a) (a)
(a) (a)
(a)
3. A set of la labeled articles, L(a) = {(x1 , y1 ), (x2 , y2 ), ...(xl , yl )} where xj is
a
a
(a)

a da -dimensional feature vector for the j-th article and yj

∈ {1, 2, · · · , ka } is its class

label.
(a)

(a)

(a)

4. A set of na − la unlabeled articles, U (a) = {xl +1 , xl +2 , · · · , xna }
a
a
(u)

(a)

(u)

5. A set of user-article pairs, M, where (xi , xj ) ∈ M if user xi

(a)

edits article xj .

Deﬁnition 5 (Co-Classification). Given L(u) , L(a) , and M, the goal of co-classification is
to learn a pair of classifiers: (1) fu : ℜdu → {1, 2, · · · , ku } that accurately maps the feature
97

vector of a user x(u) to its class label y (u) and (2) fa : ℜda → {1, 2, · · · , ka } that accurately
maps the feature vector of an article x(a) to its class label y (a) .

6.2

Methodology

This section presents the detailed of our proposed co-classiﬁcation framework. We ﬁrst
present a maximum margin classiﬁcation approach to classify the Wikipedia users and articles
independently. We then describe the proposed co-classiﬁcation approach.

6.2.1

Independent Classiﬁcation of Wikipedia Users and Articles

The framework developed in this study is an extension of the LS-SVM formulation for
network data developed in [13]. Before describing the framework, we ﬁrst consider the
task of classifying Wikipedia users and articles independently. Given a set of labeled users
L(u) , we may construct a LS-SVM classiﬁer for users by minimizing the following objective
function:

Lu (w, b, e, α) =

ku
lu ∑
1∑
1 ∑
(u)T (u)
(u)2
(wm wm ) + γ1
eim
2
2
m=1

−

lu ∑
∑

(u) (u)

αim (wyi xi

i=1 m̸=yi

(u)

(u)

+ byi − wm xi

(u)

(u)

− bm − 2 + eim )

(6.1)

i=1 m̸=yi

where {αim } is the set of Lagrange multipliers. Similarly, the corresponding Lagrangian
formulation for classifying articles can be expressed as follow:
ka
la ∑
1∑
1∑
(a)T (a)
(a)2
(wn wn ) + γ2
εjn
La (w, b, ε, β) =
2
2
n=1

j=1 n̸=yj

98

−

la ∑
∑

(a) (a)

βjn (wyj xj

(a)

(a) (a)

+ byj − wn xj

(a)

(a)

− bn − 2 + εjn )

(6.2)

j=1 n̸=yj
(a)

where {wn } is the set of parameters for the article classiﬁer, γ2 is a parameter that controls
(a)

the penalty of misclassifying articles, and εjn is the error of classifying the j-th article.

6.2.2

Co-Classiﬁcation of Users and Articles

Instead of solving the optimization problems for classifying users and articles independently,
we propose a co-classiﬁcation algorithm that utilizes the link structure between users and
articles from the two relation matrices Et and M described in Section 6.1. The category
relation matrix Et is computed as follow:

Et (y (u) , y (a) ) =

∑

(u)

(a)

M (xi , xj )

(u)
i:yi =y (u)
(a)
j:yj =y (a)

(u)

where xi

(a)
∈ L(u) , y (u) ∈ {1, 2, ...ku }, xj ∈ L(a) , and y (a) ∈ {1, 2, ...ka }.

Figure 6.1 A Sample of User-Article network
Figure 6.1 shows an example of the links between Wikipedia users and articles. Suppose both the users and articles are divided into four categories, represented by nodes with
99

the following colors: white, blue, gray, and cyan. The edge between the user and article
corresponds to the number of edits, which is captured by the user-article matrix M . For
example, M (u6 , a2 ) = 4 means user u6 have edited the article a2 four times. Each entry in
the category relation matrix Et (y (u) , y (a) ) corresponds to the sum of the weights of edges
between users from category y (u) and articles from category y (a) . For example, Et (white,
gray) = M (u6 , a2 ) + M (u4 , a3 ) = 10.
One limitation of using Et is that if the number of labeled users connected to the labeled
articles is small, then the relationship between the user and article classes may not be
accurately represented by Et . Furthermore, the presence of noisy links in Wikipedia data
also have an adverse aﬀect that may degrade classiﬁcation performance. In many cases, we
may have some prior knowledge about the probability of a link between nodes in two classes
in the user and article networks. This information can be encoded using a prior matrix Ep
to avoid overﬁtting the model to the observed relationships between classes in the diﬀerent
networks. Based on these two considerations, we construct an adjusted category relation
matrix as follows:

E(y (u) , y (a) ) = αEt (y (u) , y (a) ) + (1 − α)Ep (y (u) , y (a) ),

(6.3)

where 0 ≤ α ≤ 1 is a tuning parameter that controls the tradeoﬀ between Et and Ep . Its
value can be determined via cross-validation or set to a constant, say 0.5, for equal weighting.
We use the following prior matrix in our experiments: where the rows and columns are the
user and article categories, respectively.
The values in Et should be normalized for two reasons. First, the values in Et and Ep do
not have the same scale. The larger values in Et tend to dominate the overall value of E, thus

100

Table 6.1 Prior category relation Matrix Ep (ku , ka )
Ca1
1
-1
-1
-1

Cu1
Cu2
Cu3
Cu4

Ca2
-1
1
-1
-1

Ca3
-1
-1
1
-1

Ca4
-1
-1
-1
1

losing information about our prior knowledge. Second, it is useful to normalize the values in
Et so that it ranges between [-1,1]. As will be seen shortly, this helps to enforce a penalty
or reward scheme for our regularization constraint. Based on these two considerations, we
normalize Et as follows:
′′

(u)

(a)

(u)

(a)

Et (yi , yj ) = Et (yi , yj )/

∑

(u)

(a)

Et (yi , yj )

ka

′ (u) (a)
′′ (u) (a)
Et (yi , yj ) = Et (yi , yj ) −
(u)

After normalization, if yi
′

(u)

(a)

and yj

∑

′′ (u) (a)
ka Et (yi , yj )

ka
′

(u)

(a)

are related categories, then Et (yi , yj ) is positive′

(a)

valued; otherwise, Et (yi , yj ) is negative-valued. The normalized value for Et will be used
to estimate E in (6.3).
The category relation matrix will be used to enforce a constraint between the user and
article classiﬁers. Speciﬁcally, the assumption described in Section 6.1 states that Wikipedia
users will not likely edit articles in their non-related categories. The following graph regularization constraint is therefore introduced to penalize models in which users edit articles
that belong to non-related categories:

−γ3

lu ∑
la
∑

∑

i=1 j=1

(u) (a)
M (xi ,xj )̸=0

′

(u)

(a)

(u)

E (yi , yj )(w (u) xi
yi

101

(a)

+ b (u) − w (a) xj
yi
yj

− b (a) )
y
j

The overall objective function for our co-classiﬁcation framework of users and articles can
be written as follows:
ku
lu ∑
1 ∑
1∑
(u)T (u)
(u)2
L =
(wm wm ) + γ1
eim
2
2
m=1

i=1 m̸=yi

ka
la ∑
1∑
1∑
(a)T (a)
(a)2
(wn wn ) + γ2
ξjn
+
2
2
n=1

−

j=1 n̸=yj

lu ∑
∑

(u) (u)

αim (wyi xi

(u)

(u) (u)

+ byi − wm xi

(u)

(u)

− bm − 2 + eim )

i=1 m̸=yi

−

la ∑
∑

(a) (a)

βjn (wyj xj

j=1 n̸=yj

− γ3

∑

∑

i,j

(u) (a)
M (xi ,xj )̸=0

(a) (a)

− wyj xj

(a)

(a) (a)

+ byj − wn xj
′

(u)

(a)

(u) (u)

(a)

(a)

− bn − 2 + εjn )

E (yi , yj )(wyi xi

(a)

− byj )

(u)

+ by i

(6.4)

where (w(u) , w(a) , e, ξ, α, β) are the model parameters to be estimated from training data
and (γ1 , γ2 , γ3 ) are user-speciﬁed parameters. For our experiments, we set γ1 = γ2 = 1
whereas γ3 is estimated from the data via cross validation.
The objective function is solved by taking the derivative of L with respect to each of the
model parameters and setting them to zero.
This leads to the following set of linear equations, and could be expressed in matrix
notation as follows:









C(u)

0

0

C(a)




χ(u)
χ(a)

102





 
=

b(u)
b(a)




(6.5)

(u)

where C (u) is a matrix that contains xi
(u)

′
(a)
(u)
(u)
and E (m, yj ), the vector χ(u) = (w1 , ..., wk ,
u

(u)
u

b1 , ..., bk ,e, α)T , the vector b(u) = (p1 , q1 , 0lu ×ku , 2lu ×ku )T , 0lu ×ku is a lu ×ku -dimensional
vector of 0s, 2lu ×ku is a lu × ku -dimensional vector of 2s. Analogously, C (a) , χ(a) , and
b(a) = (p2 , q2 , 0la ×ka , 2la ×ka )T are the corresponding matrix and vectors for solving the
parameters of the article classiﬁer, where:

p1 (m) = γ3

∑

∑

i,j

(u)
yi =m
(u) (a)
M (xi ,xj )̸=0

q1 (m) = γ3

p2 (n) = γ3

∑

∑

i,j

(u)
yi =m
(u) (a)
M (xi ,xj )̸=0

∑

∑

i,j

(a)
yj =n
(u) (a)
M (xi ,xj )̸=0

q2 (n) = γ3

′

(a)

(u)

E (m, yj )xi

′

(a)

E (m, yj )

′

(u)

(a)

E (yi , n)xj

∑

∑

i,j

(a)
yj =n
(u) (a)
M (xi ,xj )̸=0

′

(u)

E (yi , n)

The block structure of the matrix equation in Equation (6.5) suggests that the system of
linear equations can be decoupled into two subproblems, one for learning the parameters of
the user classiﬁer and the other for article classiﬁer.

C(u) χ(u) = b(u)

103

(6.6)

Algorithm 7 Co-Classiﬁcation Algorithm
′

Input: U (a) , U (u) , L(a) ,L(u) , M , E and γ
(u)
(a)
Output: (yi |i = lu + 1, ...nu ), (yj |j = la + 1, ...na )

′

Initialize: 1. (α, β, b(u) , b(a) ) ← linearsolver(L(a) , L(u) , M , E , γ)
2. y (a) ← Classify(U (a) , α, b(a) )
3. y (u) ← Classify(U (u) , β, b(u) )

C(a) χ(a) = b(a)

(6.7)

It is worth noting that the solutions for both equations are not entirely independent because
the terms p1 , p2 , q1 and q2 depend on the link structure of M and E. It is suﬃcient to solve
(u)

(a)

Equations (6.6) and (6.7) for (α, bm , β, bn ) in order to classify the unlabeled users and
articles.
Algorithm 7 summarizes the high-level overview of our co-classiﬁcation algorithm. The
linearsolver function solves the set of linear equations to obtain the model parameters.
The classify function takes the models parameters and unlabeled data as input to generate
their predicted class labels as output.

6.3

Experimental Evaluation

This section presents the experiments performed to evaluate the performance of the proposed
co-classiﬁcation algorithm.

6.3.1

Data Set

To evaluate the performance of our algorithm, we downloaded the Wikipedia dump dated
Oct-09-2009. After preprocessing, we chose the following four topics from each network as
104

our ground truth classes (ku = ka = 4): biology, natural science, computer science and
political science. The label for each user is assigned based on label for the majority class of
articles written by the user. We created a sample that contains 5361 users and 6403 articles
for our experiments.
The experimental results reported in this section is obtained using 5-fold cross validation.
For each run, we use 4 of the folds for training and the remaining fold as test data. We
repeated the experiment ten times, each time using a random division of the data into ﬁve
′

folds. Table 6.2 shows the category relation matrix Et obtained from the training data.
′

Table 6.2 Category relation Matrix from training data Et
Cu1
Cu2
Cu3
Cu4

Ca1
0.6658
-0.2186
-0.2342
-0.2423

Ca2
-0.2225
0.6983
-0.2271
-0.2462

Ca3
-0.2274
-0.2385
0.5430
-0.2398

Ca4
-0.2159
-0.2412
-0.0816
0.7283

Our experimental results is summarized in Table 6.3. The table shows a comparison
between the performance of our proposed multi-class LS-SVM framework against applying
the LS-SVM algorithm independently on the user and article networks. For the latter case,
we employ the 1-vs-rest strategy to train the independent LS-SVM models. Note that Pre
means the average precision for all four categories, Rec means the average recall for all four
categories, and F1 is the harmonic mean of precision and recall. Clearly, the F1 measure for
our proposed framework is higher than that for LS-SVM.
A more detailed analysis can be seen in Figures 6.2 and Figure 6.3, where we have shown
the F1 values for each class in the article and user networks. Notice that the improvement
in F1 is observed across all the classes. These results suggest that our multi-class graph
regularization framework along with the addition of prior matrix enables our algorithm to

105

Table 6.3 Four categories average results
5-fold cross validation
LS-SVM
Co-LS-SVM

Article
Rec
49.6420
53.5612

Pre
48.8169
50.1656

F1
49.2260
51.8078

User
Rec
40.6410
46.2185

Pre
41.5724
42.8800

F1
41.1014
44.4867

1
p=2
p=5
p=20
CM
J48
DAGSVM

0.98

0.96

F1

0.94

0.92

0.9

0.88

0.86

0

1

2

3

4

5
6
d(depth of tree)

7

8

9

10

Figure 6.2 Comparison of F1 measure for each Article category between LS-SVM and CoLS-SVM algorithms
outperform LS-SVM models that are trained independently.
Despite the observed improvement, our overall F1 measure is around 50%, which is
not very high. This probably is because the presence of noisy links in Wikipedia that
adversely aﬀects our classiﬁcation results. Furthermore, we use the content of the user page
in Wikipedia to represent the feature vector of each user. Since some of the user pages
contain only a few uninformative words or even empty, their feature vector may not be
eﬀectively used for classiﬁcation. As a next step in our work, we may consider removing the
noisy links and exploring additional features to represent the users.

106

1800
p=2
p=5
p=214
CM
J48
DAGSVM

1600

1400

Testing time

1200

1000

800

600

400

200

0

0

2

4

6
d(depth of tree)

8

10

12

Figure 6.3 Comparison of F1 measure for each User category between LS-SVM and Co-LSSVM algorithms

6.4

Conclusion

In this chapter, we proposed a co-classiﬁcation framework for Wikipedia user and article
networks, which utilizes relational information between these two networks and learn the
classiﬁer models simultaneously. Experiment results have justiﬁed the eﬀectiveness of our
method.

107

Chapter 7
Future Work
In this thesis, I have presented my research to address some of the key challenges of large
multi-class network learning. First, a hierarchical learning technique known as recursive
non-negative matrix factorization was developed to deal with the test eﬃciency of large
multi-class problems. Another unique advantage of the approach is that it can accommodate the detection of new classes in the test data. Second, I present another approach known
as matrix factorization tree (MF-Tree), which was designed to optimize a global objective
function. A proof is also given to demonstrate the equivalence between the proposed regularized loss function and the Hilbert Schmidt Information Criterion. We also show the
additive property of the objective function, which allows us to decompose the large multiclass learning problem into a series of binary classiﬁcation tasks that can be organized in
a binary tree structure. Third, we extended the MF-Tree approach to deal with network
data. The enhanced MF-Tree approach systematically combines both the node attributes
and link structure information to perform large multi-class node classiﬁcation and class hierarchy generation from the network data. Finally, we introduce an approach to jointly classify
heterogeneous nodes in multiple networks in order to improve classiﬁcation accuracy.
For future work, we plan to investigate the following two problems as extensions of this
thesis:

108

7.1

Joint Label Tree Construction and Link Prediction
for Network with Large Number of Classes

Since the nodes and link structure are interleaved with each other in a network, node classiﬁcation and link prediction are two related learning tasks. This is because, in most networks,
more links are exist between nodes of the same class than between nodes of diﬀerent classes.
Thus, we conjecture that the learning of one task could aid the learning of the other. The
hierarchical network learning method proposed in Chapter 5 builds a hierarchical structure
by optimizing an objective function to partition the data into subgroups at each internal
node. The partitioning may break links between pair of nodes that should linked together,
which may degrade classiﬁcation performance. Designing an eﬀective hierarchical learning
method that can jointly perform node classiﬁcation and link prediction would be an interesting subject for future research.

7.2

Classiﬁcation of Documents into Reading Positions

From an application perspective, hierarchical learning methods can potentially be used to
provide navigational interface of content, i.e., a way to access content of a document collection. Currently, a popular way to access a large collection of documents is by using a search
interface that allows users to submit a query and view a dynamically generated ranked list
of relevant documents. Such an approach may work ﬁne when a user is trying to locate a
speciﬁc document in the collection but is insuﬃcient when users need to access the pertinent
documents in some logical order, e.g., for learning, research or editorial purposes.
Search engines hide document relationships while navigational interfaces tend to capture

109

only ﬁxed (static) relationships that do not adapt to a user’s speciﬁc need. In both cases,
to determine which resources to read and in what order, users must manually sift through
the documents returned by the system, which is a tedious process. Worse still, an individual
may read a signiﬁcant number of documents in the wrong order, until he/she understood
how they relate to each other. Consequently, the user may have to re-read the documents
again in the right order to fully grasp their contents.
I have previously developed a novel algorithm to generate reading orders over set of
documents (from general to more speciﬁc) in [82]. Such type of reading order is extremely
useful in several contexts and applications. For instance, it can beneﬁt researchers looking for
papers on a particular topic, editors selecting articles to publish on a web site, or IT personnel
trying to ﬁgure out which documentation they should read ﬁrst. My previous solution is
based on an unsupervised learning method to generate reading orders by analyzing the
conceptual relations among the documents. However it does not consider the document link
relations such as citations between documents, their common authors, etc. An interesting
research question would be to map the reading order generation as a hierarchical network
learning task, where each document is a node and the links represent the document link
relations. The goal here is to learn a classiﬁcation or ranking model that accurately assigns
a given document to its proper reading position. Note that a ranking problem can be
converted into a series of classiﬁcation problems. For example, to determine the rank among
documents d1 , d2 , and d3 , we can transform it into classifying whether d1 should precede
d2 , d1 should precede d3 and d2 should precede d3 .

110

REFERENCES

111

REFERENCES
[1] A. Choromanska, A. Agrawal, and J. Langford, “Extreme multi class classiﬁcation,” in
Neural Information Processing Systems Conference (NIPS) Workshop: Extreme Classification: Multi-Class & Multi-Label Learning with Millions of Categories, (Lake Tahoe,
NV), 2013.
[2] L. Liu, S. Saha, R. Torres, J. Xu, P.-N. Tan, A. Nucci, and M. Mellia, “Detecting
malicious clients in isp networks using http connectivity graph and ﬂow information,”
in Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM
International Conference on, pp. 150–157, 2014.
[3] P. Zhao and J. Han, “On graph query optimization in large networks,” Proc. VLDB
Endow., vol. 3, pp. 340–351, Sept. 2010.
[4] M. Adedoyin-Olowe, M. M. Gaber, and F. Stahl, “A survey of data mining techniques
for social media analysis,” CoRR, 2013.
[5] J. Scripps, R. Nussbaum, P. Tan, and A. Esfahanian, “Link-based network mining,” in
Structural Analysis of Complex Networks (M. Dehmer, ed.), Birkhauser, 2010.
[6] P. Mandayam-Comar, P.-N. Tan, and A. Jain, “Identifying cohesive subgroups and their
correspondences in multiple related networks,” in Proc of the 2010 IEEE/WIC/ACM
Int’l Conf on Web Intelligence (WI-2010), (Toronto, Canada), 2010.
[7] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of inﬂuence through a
social network,” in Proc of the ACM International Conference on Knowledge Discovery
and Data Mining, 2003.
[8] Q. Lu and L. Getoor, “Link-based classiﬁcation,” in Proceedings of the International
Conference on Machine Learning (ICML), 2003.
[9] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad, “Collective
classiﬁcation in network data,” AI Magazine, vol. 29, no. 3, pp. 93–106, 2008.
[10] G. Pandey, V. Kumar, and M. Steinbach, “Computational approaches for protein function prediction: A survey,” Tech. Rep. 06-028, Department of Computer Science and
Engineering, University of Minnesota, Twin Cities, 2006.
112

[11] M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” Journal
of Machine Learning Research, vol. 1, pp. 211–244, 2001.
[12] D. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52,
no. 4, pp. 1289–1306, 2006.
[13] F. Chen, P. Tan, and A. Jain, “A co-classiﬁcation framework for detecting web spam
and spammers in social media web sites,” in Proceeding Of ACM CIKM International Conference On Information And Knowledge Management, (Hong Kong,China), pp. 1807–1810, 2009.
[14] L. Liu and P.-N. Tan, “A framework for co-classiﬁcation of articles and users in wikipedia,” in Web Intelligence’10, pp. 212–215, 2010.
[15] J. C. Platt, N. Cristianini, and J. Shawe-taylor, “Large margin dags for multiclass
classiﬁcation,” in Advances in Neural Information Processing Systems, pp. 547–553,
MIT Press, 2000.
[16] V. Vural and J. G. Dy, “A hierarchical method for multi-class support vector machines,”
in Proceedings of the twenty-first international conference on Machine learning, pp. 105–
112, 2004.
[17] A. Beygelzimer, J. Langford, and P. Ravikumar, “Multiclass classiﬁcation with ﬁlter
trees,” in Information Theory and Application Workshop, 2007.
[18] S. Bengio, J. Weston, and D. Grangier, “Label embedding trees for large multi-class
tasks.,” in Neural Information Processing Systems, pp. 163–171, 2010.
[19] J. Deng, S. Satheesh, A. C. Berg, and F. F. F. Li, “Fast and balanced: Eﬃcient label tree
learning for large scale object recognition,” in Advances in Neural Information Processing Systems 24 (J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger,
eds.), pp. 567–575, 2011.
[20] L. Liu, P. M. Comar, S. Saha, P.-N. Tan, and A. Nucci, “Recursive nmf: Eﬃcient label
tree learning for large multi-class problems,” in Proceedings of the 21st International
Conference on Pattern Recognition (ICPR’2012), pp. 2148–2151, 2012.
[21] A. Gretton, O. Bousquet, A. Smola, and B. Sch¨olkopf, “Measuring statistical dependence with hilbert-schmidt norms,” in Proceedings of the 16th International Conference
on Algorithmic Learning Theory (ALT’2005), pp. 63–77, 2005.

113

[22] M. B. Blaschko and A. Gretton, “Learning taxonomies by dependence maximization,”
in Proceedings of the Advances in Neural Information (NIPS’2008), pp. 153–160, 2008.
[23] L. Song, A. J. Smola, A. Gretton, and K. M. Borgwardt, “A dependence maximization
view of clustering,” in Proceedings of the 24th International Conference on Machine
Learning (ICML’2007), pp. 815–822, 2007.
[24] M. B. Blaschko, W. Zaremba, and A. Gretton, “Taxonomic prediction with treestructured covariances,” in Machine Learning and Knowledge Discovery in DatabasesEuropean Conference (ECML/PKDD’2013), pp. 304–319, 2013.
[25] T. Zhang, A. Popescul, and B. Dom, “Linear prediction models with graph regularization for web-page categorization,” in Proceeding of ACM SIGKDD International Conf
on Data Mining, (Philadelphia,PA), pp. 812–826, 2006.
[26] R. Heatherly, M. Kantarcioglu, and B. Thuraisingham, “Social Network Classiﬁcation
Incorporating Link Type Values,” in IEEE Intelligence and Security Informatics, (,
Dallas, Texas,), June 2009.
[27] J. Scripps, P.-N. Tan, F. Chen, and A.-H. Esfahanian, “A matrix alignment approach
for collective classiﬁcation,” in ASONAM, pp. 155–159, 2009.
[28] J. Scripps, P.-N. Tan, F. Chen, and A.-H. Esfahanian, “A matrix alignment approach
for link prediction,” in ICPR, pp. 1–4, 2008.
[29] O.-W. Kwon and J.-H. Lee, “Web page classiﬁcation based on k-nearest neighbor approach,” in Proceedings of the 5th International Workshop on Information Retrieval
with Asian Languages (IRAL), pp. 9–15, 2000.
[30] J. F¨
urnkranz, “Exploiting structural information for text classiﬁcation on the www,”
in Proceedings of the Third International Symposium on Advances in Intelligent Data
Analysis, IDA ’99, pp. 487–498, 1999.
[31] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists: learning to detect
malicious web sites from suspicious urls,” in KDD, 2009.
[32] A. Le, A. Markopoulou, and M. Faloutsos, “Phishdef: Url names say it all,” INFOCOM,
2011.
[33] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM, vol. 46,
pp. 604–632, Sept. 1999.
114

[34] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.,” technical report, 1999.
[35] V. K. Stanford and V. Krishnan, “Web spam detection with anti-trust rank,” 2006.
[36] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learning with local
and global consistency,” in Proc. of Advances in Neural Information Processing Systems,
pp. 321–328, 2003.
[37] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label
propagation,” tech. rep., 2002.
[38] Z. Gy¨ongyi, H. Garcia-Molina, and J. Pedersen, “Combating web spam with trustrank,”
in Proceedings of the Thirtieth international conference on Very large data bases - Volume 30, VLDB ’04, pp. 576–587, 2004.
[39] Y. Yang, S. Slattery, and R. Ghani, “A study of approaches to hypertext categorization,”
Journal of Intelligent Information Systems, vol. 18, pp. 219–241, 2002.
[40] S. Zhu, K. Yu, Y. Chi, and Y. Gong, “Combining content and link for classiﬁcation
using matrix factorization,” in Proceeding Of ACM SIGIR, (Netherlands), pp. 487–494,
2007.
[41] Q.Lu and L.Getoor, “Link-based classiﬁcation,” in Proceeding of the twentieth International Conference on Machine Learning(ICML’2003), (Washion DC), pp. 496–503,
2003.
[42] Q.Gan and T.Suel, “Improving web spam classiﬁers using link structure,” in AIRWeb
’07: Proceeding of the 3rd international workshop on Adversarial information retrieval
on the web, (New York,USA), pp. 17–20, 2007.
[43] M. Ji, Y. Sun, M. Danilevsky, J. Han, and J. Gao, “Graph regularized transductive
classiﬁcation on heterogeneous information networks,” in Proceedings of ECML/PKDD,
pp. 570–586, 2010.
[44] R. Caruana, “Multitask learning,” Machine Learning, vol. 1, no. 28, pp. 41–75, 1997.
[45] P. M. Comar, L. Liu, S. Saha, A. Nucci, and P.-N. Tan, “Weighted linear kernel with
tree transformed features for malware detection,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’2012), pp. 2287–2290, 2012.
115

[46] P. M. Comar, L. Liu, S. Saha, P.-N. Tan, and A. Nucci, “Combining supervised and
unsupervised learning for zero-day malware detection.,” in Proceedings of the 33rd IEEE
International Conference on Computer Communications (INFOCOM’2013), pp. 2022–
2030, 2013.
[47] L. Liu, P. M. Comar, A. Nucci, S. Saha, and P.-N. Tan, “Missing or inapplicable: Treatment of incomplete continuous-valued features in supervised learning,” in Proceedings
of SIAM International Conference on Data Mining (SDM’2013), pp. 46–54, 2013.
[48] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Largescale video classiﬁcation with convolutional neural networks,” in Proceedings of International Computer Vision and Pattern Recognition (CVPR 2014), 2014.
[49] R. Ghani, “Using error-correcting codes for eﬃcient text classiﬁcation with a large
number of categories,” masters thesis, Carnegie Mellon University, 2001.
[50] R. Ryan and K. Aldebaro, “In defense of one-vs-all classiﬁcation,” Journal of Machine
Learning Research, vol. 5, pp. 101–141, December 2004.
[51] T. Hastie and R. Tibshirani, “Classiﬁcation by pairwise coupling,” The Annals of Statistics, vol. 26(2), pp. 451–471, 2001.
[52] J. Langford and A. Beygelzimer, “Sensitive error correcting output codes,” in Conference on Learning Theory, pp. 158–172, 2005.
[53] D. Hsu, S. M. Kakade, J. Langford, and T. Zhang, “Multi-label prediction via compressed sensing,” CoRR, vol. abs/0902.1284, 2009.
[54] J. Huang, T. Zhang, and D. N. Metaxas, “Learning with structured sparsity,” in Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009,
Montreal, Quebec, Canada, June 14-18, 2009, pp. 417–424, 2009.
[55] A. Beygelzimer, J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl, “Conditional probability tree estimation analysis and algorithms,” in Proceedings of UAI, pp. 51–58, 2009.
[56] T. Gao and D. Koller, “Discriminative learning of relaxed hierarchy for large-scale visual
recognition,” in Proceedings of the IEEE International Conference on Computer Vision
(ICCV’2011), pp. 2072–2079, 2011.
[57] D. Kuang and H. Park, “Fast rank-2 nonnegative matrix factorization for hierarchical
document clustering,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2013), pp. 739–747, 2013.
116

[58] P.-Y. Hao, J.-H. Chiang, and Y.-H. Lin, “A new maximal margin spherical structured
multi-class support vector machine,” in Applied Intelligence, 2009.
[59] D.Lee and H.Seung, “Algorithm for non-negative matrix factorization,” in Neural Information Processing Systems, pp. 556–562, 2001.
[60] H. Pei-Yi, C. Jung-Hsien, and L. Yen-Hsiu, “A new maximal-margin sphericalstructured multi-class support vector machine,” Applied Intelligence, vol. 30, pp. 98–111,
2009.
[61] D. Chen and R. J. Plemmons, “Nonnegativity constraints in numerical analysis,” in
Symposium on the Birth of Numerical Analysis (A. Bultheel and R. Cools, eds.), World
Scientiﬁc Publishing, 2010.
[62] T. Ngo, M. Bellalij, and Y. Saad, “The trace ratio optimization problem,” SIAM Review,
vol. 54, no. 3, pp. 545–569, 2012.
[63] J. Wenhao and C. Fu-lai, “A trace ratio maximization approach to multiple kernel-based
dimensionality reduction,” Neural Networks, vol. 49, pp. 96–106, 2014.
[64] D. Boley, “Principal direction divisive partitioning,” Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 325–344, 1998.
[65] E. Tanaka and K. Tanaka, “The tree-to-tree editing problem,” International Journal
Pattern Recognition and Artificial Intelligence, vol. 2, no. 2, pp. 221–224, 1988.
[66] T. Jiang, L. Wang, and K. Zhang, “5th annual symposium on combinatorial pattern
matching,” in Advances in Neural Information Processing Systems, pp. 75–86, 1994.
[67] P. Mandayam Comar, L. Liu, S. Saha, A. Nucci, and P.-N. Tan, “Weighted linear
kernel with tree transformed features for malware detection,” in Proceedings of the 21st
ACM International Conference on Information and Knowledge Management, CIKM
’12, pp. 2287–2290, 2012.
[68] U. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks.,” Phys Rev E Stat Nonlin Soft Matter Phys,
vol. 76, no. 3 Pt 2, p. 036106, 2007.
[69] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partitioning and
clustering,” Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 11, pp. 1074–1085, Nov.
2006.
117

[70] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts,” in Proceedings of the Eighteenth International Conference on Machine Learning,
ICML ’01, (San Francisco, CA, USA), pp. 19–26, 2001.
[71] U. von Luxburg, “A tutorial on spectral clustering,” CoRR, vol. abs/0711.0189, 2007.
[72] D. Achlioptas, A. Fiat, A. Karlin, and F. McSherry, “Web search via hub synthesis,” in
Proceedings of the 42Nd IEEE Symposium on Foundations of Computer Science, FOCS
’01, 2001.
[73] U. Kang, C. E. Tsourakakis, and C. Faloutsos, “Pegasus: A peta-scale graph mining system implementation and observations,” in Proceedings of the 2009 Ninth IEEE
International Conference on Data Mining, ICDM ’09, pp. 229–238, 2009.
[74] U. Kang, B. Meeder, E. E. Papalexakis, and C. Faloutsos, “Heigen: Spectral analysis
for billion-scale graphs,” IEEE Trans. Knowl. Data Eng., pp. 350–362, 2014.
[75] Y. Yang, “An evaluation of statistical approaches to text categorization,” Inf. Retr.,
vol. 1, no. 1-2, pp. 69–90, 1999.
[76] J. Abernethy, O. Chapelle, and C. Castillo, “Web spam identiﬁcation through content
and hyperlinks,” in Proceeding Of The Sigir Workshop On Adverasrial Information
Retrieval On Web (Airweb’08), (Beijing,China), pp. 41–44, 2008.
[77] G. Attardi, A. Gull, and F. Sebastiani, “Automatic web page categorization by link and
context analysis,” in Proceedings of THAI’99, First European Symposium on Telematics,
Hypermedia and Artificial Intelligence, Varese, IT, pp. 205–119, 1999.
[78] D. A. Cohn and T. Hofmann, “The missing link - a probabilistic model of document
content and hypertext connectivity,” in NIPS, pp. 430–436, 2000.
[79] K. Smets, B. Goethals, and B. Verdonk, “Automatic vandalism detection in wikipedia: Towards a machine learning approach,” in Proceedings of the Association for the
Advancement of Artificial Intelligence(AAAI), pp. 43–48, AAAI Press, 2008.
[80] M. Potthast, B. Stein, and R. Gerling, “Automatic vandalism detection in wikipedia,”
Advances in Information Retrieval, pp. 663–668, 2008.
[81] G. Druck, G. Miklau, and A. McCallum, “Learning to predict the quality of contributions to wikipedia,” in Proceedings of the AAAI Workshop on Wikipedia and Artificial
Intelligence (WIKIAI 08), pp. 7–12, 2008.
118

[82] L. Liu, G. Koutrika, and S. Simske, “Generating reading orders over document collections,” in Proceedings of the 31st IEEE International Conference on Data Engineering,
ICDE’15, 2015.

119