EXPLOITING MULTIPLE DATA SOURCES FOR NETWORK MINING
By
Prakash Mandayam Comar

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Computer Science - Doctor of Philosophy
2013

ABSTRACT
EXPLOITING MULTIPLE DATA SOURCES FOR NETWORK MINING
By
Prakash Mandayam Comar
Network mining is an active research area with application to diverse ﬁelds including
computer science, social science, and biological sciences. However, previous studies have
focused mostly on developing algorithms for mining data from a single network. Such algorithms are susceptible to imperfections in the network data such as noisy links and node
attribute values. The focus of this thesis is on exploiting multiple data sources to enhance
the performance of network mining algorithms for community detection, node classiﬁcation
and link prediction tasks.
The ﬁrst contribution of this thesis is the development of a joint matrix factorization
framework for mining multiple networks. The framework oﬀers a principled way to perform community detection simultaneously across multiple related networks. It is also highly
ﬂexible, allowing the link structure, node attributes, and any prior knowledge about the
relationship between communities in diﬀerent networks to be seamlessly integrated under a
uniﬁed formulation. The framework is then extended to a multi-task learning setting where
one could perform community detection on one network and node classiﬁcation on the other.
Multi-task learning is natural for networks considering the intimate relation between the
link structure and node attributes of the networks. However, designing a framework for
multi-task network learning requires a joint objective function that can be used for various
network mining tasks while accommodating some of the existing objective functions (such
as the well-known modularity measure for community detection). As second contribution,

this thesis presents a novel cost-sensitive loss function that enables the joint learning for link
prediction and community detection on one or more networks. The loss function addresses
the class skewness and degree skewness problems inherent in most link prediction tasks.
A formal proof is provided to show the equivalence between the proposed loss function
and the modularity measure used in community detection. To enhance the scalability of
the approach, a divide and conquer scheme was developed where the learning algorithm is
applied to smaller partitions of a network and their results are systematically combined using
the boosting framework.
Acquiring reliable labels is crucial for network learning tasks such as link prediction and
node classiﬁcation. While for the most part the labels can be gleaned from the network itself,
they are often incomplete and noisy, thus requiring alternative mechanism to solicit more
label information. This thesis explores the viability of using crowdsourcing technology as an
external source for obtaining additional labeled data for network mining tasks. Adopting
crowdsourcing for network data is non-trivial due to the diﬃculty in designing a human
intelligence task (HIT) that can be easily handled by non-experts (i.e., the crowd ). To
overcome this problem, this thesis proposes an approach for transforming network data into
a set of images that can be easily labeled by non-experts. The conditions under which the
transformation preserves the original network data was also examined. To the best of our
knowledge, this is the ﬁrst study to examine the use of crowdsourcing for acquiring labels in
network learning tasks.
This thesis is a step forward towards resolving some of the fundamental challenges in
performing multi-source network mining. Though the methods described in this thesis were
designed for network mining, some of them (e.g., methodology to transform network data
into image data) are applicable to non-network learning problems.

Copyright by
PRAKASH MANDAYAM COMAR
2013

ACKNOWLEDGMENTS

This PhD dissertation is an end product of a very long delightful journey during which I
was helped by several wonderful personalities, many of whom have left a lasting impression
in my mind. I would like to sincerely acknowledge each of their contribution in shaping this
dissertation.
With at most sincerity and gratitude, I thank my advisor Dr. Pang-Ning Tan for all the
technical guidance, support and encouragement he showered on me. He generously funded
my PhD programme with assistantships and internship opportunities. He was always very
approachable, kind and encouraged me to pursue problems of my liking and interest. He
always stressed on the importance of mathematical representation and analysis of problems,
instead of solving by heuristics and intuition. The data mining course oﬀered by Dr. Tan,
the biweekly meetings, seminars, paper reviews and generous funding for conference travel
has kept me up to date with most recent technical advances in my ﬁeld of research. In
all he made my stay here in Michigan State University very comfortable, memorable and
enlightening. I will certainly cherish my leanings from his association for long time to come.
Next, I would like to thank my PhD committee members, Prof. Anil K Jain, Dr. Rong Jin
and Prof. Ramamoorthi for their support and guidance throughout my PhD programme. Dr.
Jain took special interest in my work and advised me to pursue research, associating social
networks with crowdsourcing. I was very delighted with every meeting and conversation I
had with him and Chapter 6 of this dissertation is result of the impetus given by him. I
would also like to specially thank Dr. Jin for shaping my thinking during the weekly seminar
on convex optimization in fall 2010. His machine learning course and seminars had inﬂuential
impact on my research work. Finally I thank Prof. Ramamoorthi for being instrumental
v

in me joining Michigan State University. It was great honor to have worked with each of
my PhD committee member and I sincerely thank them for all the help and guidance they
oﬀered to me. I earnestly look forward to continue my research collaborations with each of
my committee member in the future.
Further, I heartily thank my peer group here in LINKS lab Lei Liu, Zubin Abraham,
Jianpeng, Shai and Xi, specially for their help in manual labeling of images which otherwise
would have been a very costly and time consuming endeavor. I specially thank my colleague
Meherdad Mahdavi for keeping my lab hours interesting and informative with his update
on recent technical happenings in optimization ﬁeld. My special thanks to our department
secretaries Linda Moore and Norma Teague for helping with all the administrative and travel
related work. My sincere thanks to all my friends and roommates for keeping my stay in
East Lansing a happy and fun ﬁlled aﬀair.
Finally I thank my parents for the unconditional love, support and encouragement they
showered on me throughout my life. I dedicate this work to Sri Sathya Sai, who has been
my inspiration and guide all though my life.

vi

TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xii

LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiv

Chapter 1 Introduction . . . . . . . . . .
1.1 Thesis Statement . . . . . . . . . . . .
1.2 Why Multi-Source Network Mining? .
1.3 Why Multi-Task Network Mining? . . .
1.4 Label Acquisition for Network Mining .
1.5 Thesis Contributions . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

1
2
3
4
5
6

Work
. . . .
. . . .
. . . .
. . . .
. . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

8
9
10
13
15
16

Chapter 3 Joint Community Detection Across Multiple Networks
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Joint Clustering Framework . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Joint Clustering of Multiple Networks . . . . . . . . . . . . . .
3.2.2 Incorporating Prior Information . . . . . . . . . . . . . . . . .
3.2.3 Computational Complexity . . . . . . . . . . . . . . . . . . . .
3.2.4 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . .
3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Baseline Algorithms and Evaluation Metrics . . . . . . . . . .
3.3.2 Synthetic Data set . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Complexity Veriﬁcation . . . . . . . . . . . . . . . . . . . . .
3.3.3.1 Eﬀect of noise in one network . . . . . . . . . . . . .
3.3.3.2 Eﬀect of noise in between the networks . . . . . . . .
3.3.4 Wikipedia Dataset . . . . . . . . . . . . . . . . . . . . . . . .
3.3.5 Digg Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

17
19
21
21
23
25
26
28
28
29
29
30
32
33
37
39

Chapter 2 Background and Related
2.1 Categorization of Network Types
2.2 Learning Tasks on Networks . . .
2.3 Multi-Network Mining . . . . . .
2.4 Multi-task Learning on Networks
2.5 Summary . . . . . . . . . . . . .

vii

Chapter 4 Joint Clustering and Classiﬁcation on Multiple Networks
4.1 Joint Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Joint Learning vs Independent Learning . . . . . . . . . . . . . . . . .
4.2.1 Joint Factorization vs Label Propagation . . . . . . . . . . . . .
4.2.2 Joint Factorization vs Graph Cuts . . . . . . . . . . . . . . . . .
4.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Baseline Algorithms and Evaluation Metrics . . . . . . . . . . .
4.3.2 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2.1 Varying Noisy Links Within a Network . . . . . . . . .
4.3.2.2 Varying Noisy Links Between Networks . . . . . . . .
4.3.2.3 Eﬀect of Unequal Number of Clusters and Classes . . .
4.3.3 Wikipedia Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3.1 Number of Iterations . . . . . . . . . . . . . . . . . . .
4.3.4 Digg Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

41
43
45
46
48
51
51
52
53
56
57
57
61
62
65

Chapter 5 Joint Community Detection and Link Prediction
5.1 Approaches for Link Prediction . . . . . . . . . . . . . . . . .
5.2 Loss Function and Risk . . . . . . . . . . . . . . . . . . . . . .
5.3 Variable Cost Loss Function for Link Prediction . . . . . . . .
5.4 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Boosting Approach for Link Prediction . . . . . . . . . . . . .
5.5.1 Estimating αt . . . . . . . . . . . . . . . . . . . . . . .
5.5.2 Weak Learners . . . . . . . . . . . . . . . . . . . . . .
5.5.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Experimental Evaluations . . . . . . . . . . . . . . . . . . . .
5.6.1 Baseline Algorithms . . . . . . . . . . . . . . . . . . .
5.6.2 Data Sets for Inferring Missing Links . . . . . . . . . .
5.6.3 Data Sets for Predicting Future Links . . . . . . . . . .
5.6.4 Links Within Community . . . . . . . . . . . . . . . .
5.6.5 Missing and Future Links . . . . . . . . . . . . . . . .
5.6.6 Low Degree Nodes . . . . . . . . . . . . . . . . . . . .
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

66
68
71
74
76
78
79
81
83
85
86
87
87
89
92
93
93

Chapter 6 Crowdsourcing for Network Mining
6.1 Transforming Network Data into Images . . .
6.2 Preliminaries . . . . . . . . . . . . . . . . . .
6.3 Proposed Framework . . . . . . . . . . . . . .
6.4 Surrogate Mapping . . . . . . . . . . . . . . .
6.4.1 Reconstruction Error Analysis . . . . .
6.4.2 Parameter Estimation . . . . . . . . .
6.5 Experimental Evaluations . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

95
96
99
100
101
103
106
110

viii

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

6.6

6.5.1 Evaluation Methodology and Baseline .
6.5.2 Synthetic Data . . . . . . . . . . . . .
6.5.3 Biology Article Corpus . . . . . . . . .
6.5.4 Wiki Editor Networks . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . .

Chapter 7
APPENDIX

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

111
112
113
117
120

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
. . . . . . . . . . . . . . .

BIBLIOGRAPHY

. . . . . . . . . . . . .

ix

. . . . . . . . . . . . . . . . . . . 125
. . . . . . . . . . . . . . . . . 131

LIST OF TABLES
Table 3.1

Data Category and Sub Category . . . . . . . . . . . . . . . . . . .

34

Table 3.2

The link distribution between diﬀerent author clusters in Wikipedia
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

Table 3.3

Average Cluster NMI on Wikipedia dataset . . . . . . . . . . . . . .

35

Table 3.4

Confusion matrix of article network using the Ncut algorithms on the
multi graph G gave NMI 0.55. . . . . . . . . . . . . . . . . . . . . .

36

Confusion matrix of article network using the Joint clustering method
gave NMI 0.40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

Confusion matrix on digg user and wiki editor network using Ncut on
Multigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

Confusion matrix of Digg users and Wikipedia editors using the proposed joint clustering method. . . . . . . . . . . . . . . . . . . . . .

39

Table 4.1

Summary of notations used in the chapter. . . . . . . . . . . . . . .

44

Table 4.2

Parameters of multi-network generator . . . . . . . . . . . . . . . .

52

Table 4.3

Conﬁgurations of various synthetic data . . . . . . . . . . . . . . . .

53

Table 4.4

Clustering results of Wikipedia editors. Here (ME) refers to the run
with minimum error. . . . . . . . . . . . . . . . . . . . . . . . . . .

60

Classiﬁcation results of Wikipedia articles. Here (ME) refers to the
run with minimum error. . . . . . . . . . . . . . . . . . . . . . . . .

60

Confusion Matrix - Article Network using Joint algorithm . . . . .

62

Table 3.5

Table 3.6

Table 3.7

Table 4.5

Table 4.6

x

Table 4.7

Table 4.8

Table 4.9

Table 5.1

Confusion matrix for Digg clustering. The top panel shows the result
of Ncut on Digg user network. The bottom panel shows the clustering
result on the multi graph consists of both Digg users and Wikipedia
editors. Only two dominant clusters were found on both. . . . . . .

64

Confusion matrix for Wikipedia classiﬁcation. The left panel gives
the confusion matrix on applying LGC on Wikipedia editor network
and the bottom panel gives the confusion matrix on applying LGC
on multi graph consisting of both Digg user network and Wikipedia
editor network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

Confusion matrix for Joint with prior information for the run with
minimum error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

Link Prediction: The table shows the AUC of predicted missing links
and future links in each of the four data sets. . . . . . . . . . . . . .

88

Table 6.1

This table shows the F measure for each of the four article categories.
Here Wi denotes the ith worker in the crowd. Each independent
worker performs better than SVM with rbf kernel (σ = 0.1). The
Crowd + SVM performed better than individual workers in the crowd. 117

Table 6.2

This table gives the AUC values of SVM and Crowd + SVM approach
for link prediction problem on two diﬀerent Wikipedia editor networks. Here the proposed Crowd approach outperforms the baseline
SVM by 8% on Computer science network and fares slightly better
than SVM on Natural science editor network. . . . . . . . . . . . . . 119

xi

LIST OF FIGURES
Figure 3.1

Multi Network Clustering: Graph G1 = (V1 , E1 ), graph G2 = (V2 , E2 )
and the bipartite graph capturing the relationship between the nodes
of the networks G1 and G2 . For interpretation of the references to
color in this and all other ﬁgures, the reader is referred to the electronic version of this dissertation. . . . . . . . . . . . . . . . . . . .

19

This plot shows the variation of computation time with increase in
number of links in the network (due to increase in nodes) . . . . . .

30

Eﬀect of Noisy links in Network B on clustering performance of Network A (p1 = 0.4 and p2 = 0.35) . . . . . . . . . . . . . . . . . . . .

31

Eﬀect of Noisy links between networks on the clustering performance
of Network A (p1 = 0.4 and p2 = 0.35) . . . . . . . . . . . . . . . .

32

Figure 3.5

Spy plot for Article (Left) and User (Right) networks in Wikipedia .

33

Figure 3.6

Adjacency matrix plot for digg user and Wikipedia editor networks
(best viewed in color). . . . . . . . . . . . . . . . . . . . . . . . . . .

38

Figure 3.2

Figure 3.3

Figure 3.4

Figure 4.1

Sample multi-network to illustrate the functioning of graph cut based
partitioning techniques. Left: The link density in the bipartite graph
G12 is sparse. As as result, the graph cut based algorithm would
ﬁrst disintegrate the multi-network into individual networks and then
continue to split individual networks based on their link densities.
Right:: The link density is very diﬀerent between network G1 from
G2 , as a result, one of them may get split many times while the other
remain intact. Such problem wont arise in proposed joint factorization. 49

Figure 4.2

Eﬀect of varying the between-cluster link probabilities (P12 ) on G1 .
Top: Accuracy on G1 . Bottom: NMI on G2 . . . . . . . . . . . . . . .

xii

54

Figure 4.3

Figure 4.4

Figure 4.5

Figure 4.6

Figure 4.7

Figure 5.1

Figure 5.2

Figure 5.3

Figure 5.4

Figure 6.1

Figure 6.2

Eﬀect of varying the between-cluster link probabilities (P12 ) on G1 .
Top: accuracy on G2 . Bottom: NMI on G1 . . . . . . . . . . . . . . .

56

Eﬀect of varying the between-cluster link probabilities (q2 ) of bipartite network G12 with same number of clusters/classes in both G1 and
G2 . Top: accuracy on G1 . Bottom: NMI on G2 . . . . . . . . . . . . .

58

Eﬀect of varying between-cluster Link probabilities (q2 ) of bipartite
network G12 with uneven number of clusters/classes. Top: Accuracy
on G1 . Bottom: NMI on G2 . . . . . . . . . . . . . . . . . . . . . . .

59

Plot showing the variation of classiﬁcation accuracy and clustering
NMI for every 10 iterations (best viewed in color). . . . . . . . . . .

63

Adjacency matrix plot for Digg users (right) and Wikipedia editors
(left) networks (best viewed in color). . . . . . . . . . . . . . . . . .

63

Proportion of within community links (good links) as function of topK
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

ROC curves comparing performances of diﬀerent link prediction algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

ROC curves comparing performances of diﬀerent link prediction algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

ROC curves comparing performances of diﬀerent missing link prediction algorithms on the subgraph induced by the low degree nodes in
the citation networks. . . . . . . . . . . . . . . . . . . . . . . . . . .

94

A toy example consisting of target network data (A) and source handwritten digit data (B). We map each distinct labeled target node to a
unique source image and learn the transformation between target and
source data. When this transformation is applied on all the target
nodes, we get the transformed target data (C). The blurry images are
interpreted in Section 6.1. . . . . . . . . . . . . . . . . . . . . . . . .

97

Left panel shows the distribution of three classes (in three colors)
with respect to ﬁrst and second principal components. The top and
bottom right panel gives the transformed data using 4 and 12 features respectively. The blurry images are formed because of poor
transformation quality. This is discussed in detail in section 6.4.1 . . 106
xiii

Figure 6.3

Plot of two normal distributions used in generating synthetic data. . 113

Figure 6.4

The transformed Data from two normal distributions. The text above
each image is not meant to be readable but for visual reference only.
The number above each image is the actual sample value. The interpretation of images is illustrated in section 6.5.2. . . . . . . . . . . . 114

Figure 6.5

Wikipedia article corpus: The left result shows the decreasing error with each iteration for 10 random initializations. The right ﬁgure
shows the rank of the source data almost decreases with successive
iterations. Lowest error is achieved when the rank of the source falls
below the target rank. . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xiv

LIST OF ALGORITHMS
Algorithm 1 Multi-network Clustering

. . . . . . . . . . . . . . . . . . .

25

Algorithm 2 LinkBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

Algorithm 3 Surrogate Mapping Algorithm . . . . . . . . . . . . . . . . . . . . . . . 108

xiv

Chapter 1
Introduction
The rapid growth of online social media services such as Facebook , Twitter and Wikipedia
has triggered a wave of interest in applying data mining and machine learning techniques to
the study of complex networks. The ﬁeld of social network analysis (SNA) has focused on a
variety of issues, from inferring the formation of links in a social network to understanding
how social phenomena such as homophily, social inﬂuence, and communities emerge from
the interactions between individuals in a network. Signiﬁcant progress was achieved in
the 1990s following the seminal works of Watts et al. [104] and Barabasi et al. [8], who
showed that the structural properties of many social, biological, and physical networks can
be characterized by the same mathematical principles. This has led to increasing research
into the development and application of network mining techniques to other areas beyond
social science, including systems biology (e.g., for modeling proteins and gene interaction
networks) and geo-sciences (e.g., for modeling teleconnections among climate variables and
ecological processes).
A great deal of research in network mining has focused on the development of algorithms
for mining data from a single network. This includes algorithms for solving network link
prediction, node classiﬁcation, and community detection tasks. As more diverse sources of
network data becomes available, the need for multi-source network mining has grown in
recent years. However, developing a robust algorithm that can eﬀectively combine information from other networks is a challenge. The algorithm needs to be applied not only to one
1

network, but also simultaneously to other related networks. This requires a ﬂexible computational framework that can utilize potentially noisy information from other sources and
produce consistent solutions across all networks. Furthermore, the learning tasks applied
to the multiple networks may not be the same. For example, one might be interested in
predicting the formation of links in a network based on the community structure found in
another network. Previous works, which are mostly limited to performing the same task on
one or more networks, must therefore be extended to deal with multi-task network mining
problems. Finally, acquiring labeled data for network analysis is an important but understudied problem. Although most networks already contain some label information (e.g.,
the partial links in a network can be used to train a link prediction model), the labels are
usually incomplete and noisy. This raises an interesting question whether it is possible to
design an approach for acquiring additional labeled data from alternative sources to enhance
performance of network learning algorithms. These are the key issues investigated in this
thesis.

1.1

Thesis Statement

This thesis focuses on multi-source and multi-task network mining problems. The techniques
developed in this thesis are based on the following two conjectures. First, augmenting data
from alternative sources (e.g., from crowdsourcing and other related networks) is expected
to improve the performance of network mining algorithms. Second, given the inherent interdependencies between the link structure and node attributes, network learning tasks such
as link prediction, community detection, and node classiﬁcation are mutually related, which
makes multi-task learning a natural ﬁt for analyzing network data.

2

1.2

Why Multi-Source Network Mining?

The need for multi-source network mining has grown in recent years due to the following
reasons. First, network data from the target domain alone may not be suﬃcient to yield
high-quality results. In particular, any imperfections in the network data such as noisy
links or missing node attribute values can have an adverse eﬀect on the performance of the
network learning algorithm. Second, network data has become more diverse. For example, it has become increasingly common for individuals to create user proﬁles on multiple
social media Web sites. According to a recent report by the Pew Internet and American
Life Project, about 51% of social network users have two or more online proﬁles. Each of
the social media sites often maintains diﬀerent aspects of information about the users. For
example, Twitter provides information about user opinions and other personal communications while FourSquare 1 contains a trove of user mobility information. The users may also
have a LinkedIn 2 proﬁle containing information about their professional networks. Many of
these Web sites provide easy-to-use application programming interface (APIs) to facilitate
access to their public data. Similarly, advances in high throughput genomic technologies
have enabled the modeling of complex biological systems using diverse types of networks,
including protein interaction networks, metabolic networks, gene interaction networks, or
gene regulatory networks. There has also been increasing interest in analyzing climate networks, which can be constructed using climate variables such as temperature, precipitation,
and atmospheric pressure measured at diﬀerent heights.
Despite the availability of the diverse sources of network data, there are very few studies
that consider integrating the diverse sources of network data to improve the analysis of
1 www.foursquare.com
2 www.linkedin.com

3

complex networks. This thesis aims to develop new learning formulations that can eﬀectively
harness network data from multiple sources.

1.3

Why Multi-Task Network Mining?

A network consists of an inter-connected set of nodes, whose properties are represented by
the node attributes. There is often a tight coupling between the link structure of a network
and some of the attributes of the nodes. For example, individuals are more likely to befriend
others who work in the same organization or attend the same school compared to those
who work or attend diﬀerent schools. Due to their inter-dependent relationships, many
of the network learning tasks are mutually related to each other. For example, consider
the product recommendation problem at an online retail store such as Amazon. Here, a
“link” can be established between customers if they had bought similar products in the
past. The product recommendation problem can be cast as a link prediction task on the
user-product bipartite graph. It would be natural for the recommendation algorithm to
ﬁrst identify segments of customers who share similar purchasing behavior before making its
recommendations. Identifying diﬀerent customer segments can be modeled as a clustering
or community detection problem. There is a close relationship between the network link
prediction and community detection tasks. A good link prediction algorithm should take
into consideration the community structure present in the network. Conversely, the link
prediction algorithm could be used as a pre-processing step to enhance the within community
links before applying a community detection algorithm [34]. This example illustrates the need
to design network learning algorithms that can simultaneously solve a collection of learning
tasks on one or more networks, instead of performing only a single task.

4

A key challenge in performing multi-task network learning is in designing a joint objective
function that simultaneously performs all the related learning tasks. The shared objective
function should be designed in such a way that each learning task is aided by the partial
solutions obtained from other related learning tasks and the partial solutions must be intelligently combined in order to attain an optimal solution. Despite its promise, to the best of
our knowledge, there has been no signiﬁcant research work on multi-task network learning.

1.4

Label Acquisition for Network Mining

Adequate labeled data is the key requirement for training supervised learning algorithms
for link prediction or node classiﬁcation tasks. Even unsupervised learning tasks such as
community detection and anomaly detection can beneﬁt from utilizing partially labeled
data (via the semi-supervised learning paradigm). In both cases, domain experts are often
needed to manually peruse the data and categorize them into diﬀerent labels. This is both a
tedious and time consuming process, and may not always generate enough labeled data for
the eﬀective mining of large-scale networks. Annotating network data is cumbersome as the
label information depends not only on the attributes of a node, but also, on the attributes
and labels of its neighboring nodes. Labeling a non-network data instance is easier as its
label does not depend on the label assignment for other instances. In fact, there is a subset
of labeling problems known as human intelligence tasks (HITs), where it is easy (or cheap)
to train non-experts to provide their reasonably accurate labels on the given data. Typical
examples of HITs include digit, letter, or text recognition, image classiﬁcation, and object
identiﬁcation in image, and video streams. It would be useful to develop a framework for
exploiting labeled data (especially from HITs or other non-related domains) in order to

5

generate additional labeled examples for the network mining problem.

1.5

Thesis Contributions

As mentioned earlier, the focus of the thesis is in developing network mining algorithms
that can exploit data from multiple sources. This is akin to network learning with sideinformation [57, 62]. However, previous studies on learning with side information (also
known as semi-supervised learning) have focused primarily on independent and identically
distributed (i.i.d.) data or network data obtained from a single source. These approaches
assume there is a target data for which a learning algorithm is designed to solve with auxiliary
information provided by the alternative sources (e.g., in the form of must-link and cannotlink constraints for clustering). In contrast, our approach for learning from multiple networks
assumes each network is equally important and has its own learning task to be solved. In
fact, the networks may have diﬀerent learning parameters (e.g., number of communities in
the diﬀerent networks may not be the same) or learning tasks (node classiﬁcation on one
network and link prediction in another).
The main contributions of this thesis are as follows. In Chapter 3, we provide a matrix factorization based framework to perform joint community detection across multiple
related networks, [63, 64]. In Chapter 4, we further extend the above framework to perform
multi-task learning where we simultaneously perform clustering and classiﬁcation on the different networks, [21, 65]. We found that the matrix factorization approach has a signiﬁcant
advantage over graph partitioning methods such as normalized cut especially when combining networks with diﬀerent link densities. Another advantage of the framework is that
it can systematically combine both content and link information from the multiple related

6

networks. Finally, prior information about the relationships between communities in the
diﬀerent networks can also be incorporated into the framework.
In Chapter 5, we present a framework to perform joint learning for link prediction and
community detection [22]. Here, we have designed a novel cost-sensitive loss function that
addresses both class skewness and degree skewness problems that are prevalent in most link
prediction tasks. With a proper choice of the cost parameters, the proposed loss function
can be theoretically shown to be equivalent to the well-known modularity measure used in
community detection. We have employed a divide and conquer scheme for constructing the
model, where the learning algorithm is initially applied to smaller partitions of the given
network and the results obtained from each partition are systematically combined using the
boosting framework.
In Chapter 6, we consider the problem of acquiring additional labeled data for supervised
network mining tasks such as link prediction and node classiﬁcation using crowdsourcing
technology. Due to the diﬃculty in designing a HIT that can be easily solved, even by nonexperts, we present a generic framework that transforms the network data into an image set,
thereby giving a distinct visual representation of the network data for non-experts to label.
Such an approach is shown to produce reliable labeled data that can be augmented to boost
the performance of network mining algorithms.

7

Chapter 2
Background and Related Work
The study of complex networks has leaped to the forefront of data mining research spurred
by the rapid proliferation of relational data generated from various physical, biological, and
socio-information systems. Substantial progress has been made over the past decade to
address fundamental questions such as: How are links established in a network? How do
communities formed and sustained over time? What are the most inﬂuential nodes in the
network? and How to infer missing attributes or links in a network? To provide answers to
these and many other questions, innovative computational solutions have been developed to
mine the rapidly growing repositories of network data. In this chapter we present a detailed
review of advances in mining network data.
Ever since Euler ﬁrst applied graph-theoretic principles to solve the K¨nigsberg Bridge
o
Problem [41], the study of networks has become increasingly popular, from the analysis of
social and biological systems to the modeling of disease outbreaks, supply chains, power grids,
and transportation networks. A network is typically represented as a graph of interconnected
nodes, where each node represents an entity that is characterized by a set of attributes.
Formally, let N = (V, E, X) be a network, where V is the vertex (node) set, E ⊆ V × V is
the edge (link) set, and X is a matrix of nodal attributes.

8

2.1

Categorization of Network Types

Networks can be characterized based on the types of nodes and links they contain. A
network is homogeneous if all the nodes in the network are of the same type. Otherwise,
it is called a heterogeneous network. Furthermore, a network is mono-relational or uniplex
if all the links are of the same type and multi-relational or multiplex if it contains links of
diﬀerent types [68]. For example, a co-authorship network is a uniplex network since each
link indicates a pair of authors have written an article together. An online social network
can often be treated as a multiplex network because two people who are linked together
could be friends, relatives, fans (followers), or even complete strangers. Networks can also
be characterized as static or dynamic, depending on the temporal properties of its link
structure. The former is represented by a graph with ﬁxed connectivity whereas the latter
is represented by a sequence of network snapshots, each indexed by its corresponding time
stamp, i.e., NT = {(Vt , Et , Xt )}T .
t=1
In this thesis we present algorithms for mining multiple networks. Here we assume the
availability of a collection of homogeneous networks, {N1 , N2 , · · · , Nk }, each of which is
obtained from a diﬀerent data source. As used herein, the term “data source” broadly refers
to any repository that houses the network data or a speciﬁc approach used to generate
the data for constructing the network. For example, there are many data sources available
for studying online social networks, including Facebook, Twitter, Digg, and YouTube, each
of which has its own application programming interface (API) to facilitate searching and
downloading data in a standardized format such as XML or JSON. Furthermore, the nodes
in diﬀerent networks may be connected by a set of inter-network links. In this thesis, we
refer to the collection of such multiple networks along with their links as a multi-network.

9

Deﬁnition 1 A multi-network N is a collection of of homogeneous networks {N1 , N2 , · · · , Nk },
where each homogeneous network Ni is an attributed graph (Vi , Ei , Xi ) and the diﬀerent netˆ
works are connected by a set of inter-network links, E = {(vp , vq ) | vp ∈ Vi , vq ∈ Vj , i ̸= j}.
With this deﬁnition, it is easy to verify that multiplex networks and heterogeneous k-partite
graphs are special cases of multi-networks. For example, a multi-network with homogeneous
ˆ
nodes and no inter-network links (i.e., ∀i : Vi = V and E = ∅) is equivalent to a multiplex
network whereas a multi-network with disjoint node sets (∀i, j : Vi ∩ Vj = ∅) and no withinˆ
network links (i.e., ∀i : Ei = ∅ but E ̸= ∅) is a heterogeneous k-partite graph.

2.2

Learning Tasks on Networks

Network mining research can be broadly divided into the following tasks—network generation and characterization, link prediction, node classiﬁcation, community ﬁnding, and rank
analysis. We brieﬂy review these tasks below. A more detailed exposition can be found in
the following books and survey articles [35, 87, 90].
Link Prediction:

Link prediction attempts to uncover previously unknown relation-

ships in a network. It can be used, for example, to identify covert ties in an adversarial
network or to predict regulatory interactions in a biological network. There are two common
approaches to solve the link prediction problem: (1) by using a generative modeling approach [70, 77, 98, 103] to learn a joint probability model of the network components (node
attributes and link structure) and marginalize the distribution to make a prediction, or (2)
by using a discriminative approach to learn a target function that directly maps an input
pair to its corresponding class [42, 47, 55]. Discriminative approaches are often preferred
for several reasons. First, generative approaches require speciﬁcations of the dependence
10

relationships among the network components via a graphical model and assumptions about
the parametric forms of the probability distributions. Second, estimation of the model parameters can be very expensive, thus requiring approximation techniques such as Markov
Chain Monte Carlo (MCMC) and variational methods [27].
Node Classification:

The node classiﬁcation problem attempts to assign a class label

to all the nodes in a network using the partial class information available on a few selected
nodes in the network. This is similar to conventional pattern classiﬁcation problem with
the diﬀerence being that the objects to be classiﬁed are not independent and identically distributed (i.i.d.). Instead, the label of each node is dependent on the labels of its neighboring
nodes. Due to such dependency, any node classiﬁcation method should take into consideration the link structure as well as node attribute values when assigning class labels to the
nodes. It has been previously shown that augmenting the nodal attributes of the neighbors
may yield poor classiﬁcation performance [14]. However, if the labels of the neighboring
nodes are incorporated into the feature vector, this will improve the overall classiﬁcation
accuracy [14, 61]. Another important consideration for node classiﬁcation is the presence
of diﬀerent types of regularities in the networks [110]. Here, nodes of a similar class label
tend to form more links among themselves than with the rest of the network. Researchers
in the past have also used probabilistic models [96] and matrix factorization [86] methods to
perform the node classiﬁcation task.
Rank

analysis:

Ranking is the process of assigning an ordering among the nodes

in a network. The rank of a node reﬂects the measurement of some particular structural
property of the network, which conveys a semantic meaning such as importance, popularity,
authority, etc. As an end in itself, rankings can also be used to look for well-connected or
central nodes in a network. In network mining, ranking is often performed using centrality
11

measures [103] such as degree, closeness, and betweenness. A popular ranking method for
large directed networks such as the World Wide Web is the PageRank algorithm [75], which
employs a random walk with restart approach to compute the dominant eigenvector of a
matrix. Other eigenvector-based algorithms, similar to PageRank, include HITS [50] and
SALSA [53]. Recently, there has been considerable interest in assigning rank values to
nodes based on their community belongingness. Guimera et al. [40] introduced a metric
called participation coeﬃcient, which measures to what degree a node participates in other
communities. Scripps et al. [84] introduced an alternative metric called rawComm for
assigning ranks and roles to nodes without applying a community ﬁnding algoruthm.
Community Finding:

Identifying communities in a network

is closely related to

the data clustering problem. Traditional clustering methods seek to ﬁnd groups of objects
with highly similar attribute values. In contrast, the technique of community ﬁnding places
nodes in a network into cohesive subgroups in such a way that the nodes within a group
are highly connected to each other and disconnected from nodes in other groups [63, 108].
Community ﬁnding is an ill-posed problem; there is no agreed-upon metric for evaluation.
A popular graph-based metric from Newman and Girvan [71], called modularity, is based on
the fraction of links within a community to those between communities. Some community
ﬁnding methods do not try to completely cluster the entire network. Instead they form
communities from a given seed set of nodes [30, 36]. More recently, progress in community
ﬁnding has focused along ﬁnding communities in dynamic networks, where the nodes, links,
and attributes change over time. Backstrom et al. [5] studied how the structural features
of communities aﬀect how nodes join and leave communities. Tantipathananadth et al. [95]
proposed a new framework for tracking community changes in dynamic networks by modeling
it as a graph coloring problem. Communities are then identiﬁed by approximately solving a
12

combinatorial optimization problem using dynamic programming.

2.3

Multi-Network Mining

As mentioned in Section 2.1, a multi-network consists of a collection of networks that are
inter-connected with each other. Previous works have focused mostly on mining special
cases of such networks, either as multiplex networks or heterogeneous k-partite graphs. For
example, a multiplex network of articles (or documents) can be constructed by deﬁning
diﬀerent types of links between the articles (e.g., based on their co-citations, similarity of
keywords in abstracts or titles, and similarity of authors). There have been recent eﬀorts
to identify communities in multiplex networks [12]. Zhou et al. [115] has used the diﬀerent
link types to create similarity matrices between nodes which are then used to cluster the
multiplex network. Researchers have also customized the well known matrix factorization
techniques to cluster the multiplex networks. For example, Tang et al. [94] has proposed a
linked matrix factorization approach for fusing information from multiple graphs. Lin et al.
[56] also investigated a similar problem using a relational hypergraph factorization approach
to detect communities of users based on various social contexts and interactions.
Another type of multi-network is a heterogeneous k-partite network, where links exist
only between nodes of diﬀerent types. This type of graphs has been successfully used to
model relationships such as documents-words, products-users, blogs-bloggers, etc. Clustering
bipartite and k-partite graphs are often referred to as co-clustering or multi-way clustering
in the literature. Long et al. [58] investigated the problem of co-clustering as a matrix
factorization problem and derived multiplicative update formulas for identifying the clusters.
Dhillon et al. [25] presented a framework for co-clustering that minimizes the loss in mutual

13

information between the original joint distribution of related data and the corresponding joint
distribution of the clustered data. Long et al. [59, 60] also provided a uniﬁed framework for
attributes-based clustering, semi-supervised clustering, co-clustering, and graph clustering
using a probabilistic framework.
The focus of this thesis is on mining multi-network data that can be conceived as a
fusion of multiplex network and a heterogeneous k-partite network. Research on mining
such type of networks has ﬂourished recently. Narayanan et al. [69] propose simultaneous
clustering of multiple networks as a framework to integrate large-scale datasets on the interactions among and activities of cellular components. They present an algorithm called
JointCluster that ﬁnds sets of genes that cluster well in multiple networks of interest, such as
co-expression networks summarizing correlations among the expression proﬁles of genes and
physical networks describing protein-protein and protein-DNA interactions among genes or
gene-products. Chen et al. [16] recently presented a co-classiﬁcation approach for detecting
Web spam and spammers in social media applications. They formalized the joint detection
tasks as a constraint optimization problem, in which the relationships between users and
their submitted Web content are represented as constraints in the form graph regularization. A pair of classiﬁers for detecting Web spam (url nodes) and spammers (user nodes)
is simultaneously trained taking into consideration the url-url and user-user links. They
demonstrated that the co-classiﬁcation strategy is more eﬀective than training the pair of
classiﬁers independently.

14

2.4

Multi-task Learning on Networks

Multi-task learning is a machine learning paradigm that simultaneously learns a collection
of related problems using a shared problem representation. Such an approach is desirable
when the solution to one of the learning tasks can be used in learning other related tasks. It
leads to a better model for each learning task, because it allows the learner to harness the
commonality between the learning tasks. The merits of multi task learning over single-task
learning has been discussed in detail by Caruana [13]. As mentioned in [13], if the tasks
can share what they learn, the learner may ﬁnd it easier to learn them together than in
isolation. Thus, if we simultaneously train a classiﬁer to recognize object outlines, shapes,
edges, regions, subregions, textures, reﬂections, highlights, shadows, text, orientation, size,
distance, etc., it may learn better to recognize complex objects in the real world, compared
to learning them independently.
Multi-task learning can be considered a form of transfer learning [76]. Given a source
domain DS and learning task TS , a target domain DT and learning task TT , transfer learning
aims to help improve the learning of the target predictive function fT (.) in DT using the
knowledge in DS and TS , where DS ̸= DT ,or TS ̸= TT . Typically, in a multi-task learning
the domains are identical (DS = DT ) but the learning tasks are diﬀerent. One such example
is in identifying the hair color and eye color of a given face image. These are two separate
tasks applied to the same data (image corpus). Most of the previous works on multi-task
learning have focused on learning multiple, related classiﬁcation tasks [28, 74, 106]. There
has been some recent eﬀorts to extend the formulation to learning diﬀerent tasks such as
classiﬁcation and regression [1, 111] or regression and ranking [88]. However, none of these
previous works are designed for relational data, which is the focus of this thesis.

15

Deﬁnition 1 (Multi-task Multi-network Learning) Given a multi-network N , the multitask multi-network learning problem is to solve k learning tasks, where a learning task is
associated with a subnetwork Ni = (Vi , Ei , Xi ) of N .
The learning tasks stated in the preceding deﬁnition may correspond to the same class of
learning problem or diﬀerent classes of problems (e.g., classiﬁcation, community detection, or
link prediction). For example, one might be interested in classifying the nodes in one network
while detecting communities in another. Furthermore, we only consider learning tasks that
are mutually related; i.e., the classes or communities in one network are dependent on the
classes or communities in another.

2.5

Summary

This section reviews the previous works on network mining. The past research has mostly
focused on learning from a single network, temporal networks, k-partite graphs or multiplex
networks. There has been very little work on combining multiple related networks nor multitask learning on on multi-networks, which is the focus of this thesis.

16

Chapter 3
Joint Community Detection Across
Multiple Networks
Community detection in networks is an algorithmic approach to partition the nodes in a
network into cohesive groups (known as communities) in such a way that the nodes within a
group are highly related (connected) with each other and are mostly unrelated (disconnected)
from nodes in other groups. Most of the previous work has focused on ﬁnding communities in
a single network or in a bipartite graph formed on heterogeneous nodes (e.g., co-clustering of
articles and authors in a bibliographic network) [73, 83, 109, 112]. This chapter investigates
the problem of combining link information from more than one network to improve the
eﬃciency of community detection task.
Our work on clustering multiple networks of heterogenous nodes is motivated by its
many potential applications. For example, it can be used to simultaneously ﬁnd clusters of
scientiﬁc papers and clusters of authors working in the same research areas. Similarly, it can
also be used to perform joint clustering on Wikipedia articles and editors of the Wikipedia
pages. The multiple networks may also represent relational data from diﬀerent domains.
For example, one could perform joint clustering of Wikipedia editors and Digg 1 users, where
the links between Wikipedia editors and Digg users are established based on the content
1 Digg.com is a social networking web site that allows users to share news stories with

other users.

17

similarity between the edited Wikipedia pages and the submitted news stories in Digg. The
advantages of multi-network clustering are that
• Attribute set of nodes in individual networks may not be rich enough for the purpose
of clustering.
• An individual network may have noisy or partially observed links. In such a case
the link structure may be enhanced by considering information from other associated
networks.
A naive approach for multi-network clustering is to partition each network separately.
Such an approach is useful when the link structure and subgroup information in diﬀerent
networks are independent of each other. However if the networks are related to each other,
then the link structures in the individual networks are not only characteristic of the respective networks but often contains implicit information about the underlying clusters of
other related networks. In such a scenario, we expect a joint clustering would enhance the
performance of the clustering algorithm.
The proposed framework is equally applicable to clustering multiple networks created
from heterogeneous nodes of the same source (e.g., Wikipedia articles and editors) or nodes
from diﬀerent sources (e.g., Wikipedia editors and Digg users) as long as the corresponding links between nodes in diﬀerent networks can be established. One possible motivation
for jointly identifying communities across diﬀerent network domains is that it allows us to
compare the characteristics of the similar communities present across diﬀerent domains. For
example, the political science community in Wikipedia may consists of lot of academic professionals and university students where as a similar community in digg.com may consist of
tech savy netizens with political interests. Another motivation is that the joint clustering
18

`

Graph G1
Figure 3.1

BiPartite Graph

Graph G2

Multi Network Clustering: Graph G1 = (V1 , E1 ), graph G2 = (V2 , E2 )
and the bipartite graph capturing the relationship between the nodes
of the networks G1 and G2 . For interpretation of the references to
color in this and all other ﬁgures, the reader is referred to the electronic version of this dissertation.

allows us to incorporate the auxiliary information from multiple data sources that are related
to a given network.

3.1

Preliminaries

In this thesis, we formulate the multi-network clustering problem on a pair of related networks. Our formulation can be extended to more than two networks. Let G1 (V1 , E1 ) and
G2 (V2 , E2 ) be a pair of graphs associated with two networks. The objective of multi-network
∪k1
k1
k2
clustering is to create sets of partitions {P1j }j=1 and {P2j }j=1 such that V1 = j=1 P1j
∪k2
and V2 = j=1 P2j . We seek a pair of functions g1 : V1 → [0, 1]k1 and g2 : V2 → [0, 1]k2
such that gi (vj ) = (c1 , c2 , ..., cki ), where each cj ∈ [0, 1] is the degree of membership node vj
belongs to cluster partition Pij . Figure 3.1 depicts the diﬀerent graphs and their relationships.
In the independent clustering approach, the cluster membership functions g1 and g2 are
learnt separately using their corresponding adjacency matrices. To simplify the notation,
let A be the adjacency matrix associated with graph G1 , where Aij = 1 if (vi , vj ) ∈ E1 .
19

Similarly, B is the adjacency matrix for graph G2 , where Bij = 1 if (vi , vj ) ∈ E2 . In
addition, we assume there is a third set of edges E3 ⊆ V1 × V2 connecting the nodes between
G1 and G2 . We denote the adjacency matrix for these edges as C, i.e., Cij = 1 if (vi , vj ) ∈ E3 .
In this study, the cluster partitions are obtained by decomposing the adjacency matrix
representation of a graph into a product of its latent factors. In particular, we seek to
minimize the distance function D(A∥B) between the adjacency matrix A and the product
of latent factors B, where:

D(A∥B) =

∑

(
Aij log

ij

Note that if

∑

ij Aij =

∑

Aij
Bij

)
− Aij + Bij

(3.1)

ij Bij = 1, the distance function reduces to Kullback-Leibler

divergence measure.
Depending on the application domain, the adjacency matrix C is either readily available as part of the data or needs to be estimated from the data. For example, consider
the document-document network (A) and author-author network (B). Both networks are
naturally linked by a document-author bipartite graph C. In another example, consider two
networks constructed from Wikipedia editors and Digg users. Suppose we want to simultaneously cluster these two user networks such that each cluster represent users with interest
in certain topics like science, sports, entertainment, etc. Here there is no natural link matrix
C that is readily available to perform joint clustering. Nevertheless, it can be estimated from
the data. Two users from diﬀerent networks are linked if the Wikipedia pages one of them
have edited has high similarity value to the news stories submitted by the Digg user.

20

3.2

Joint Clustering Framework

This section outlines our proposed framework for identifying cohesive subgroups in multiple
networks. Our framework uses the non-negative matrix factorization technique given in
[26, 52]. Let A ∈ Rn×n and B ∈ Rm×m be the adjacency matrices of the graphs G1 and G2 ,
+
+
respectively, whereas C ∈ Rn×m be the adjacency matrix for the links between nodes in G1
+
and G2 . Here R+ represents set of non negative real numbers. Note that our framework is
applicable to both directed and undirected graphs.

3.2.1

Joint Clustering of Multiple Networks

To simultaneously cluster the networks, we minimize the following objective function with
respect to X, Y, U, V and W .

J = D(A ∥ XU X T ) + D(B ∥ Y W Y T ) + D(C ∥ XV Y T )
N ×k1

where X ∈ ℜ+

M ×k2

and Y ∈ ℜ+

(3.2)

are the corresponding cluster membership matrices

for the two networks. The decomposition of A into a 3-factor XU X t instead of a 2-factor
XX t enables the framework to deal with directed links [26, 117]. For each node i in graph
G1 , Xij indicates the cluster membership of node i to cluster j. The cluster membership
values are not necessarily probabilities. Large values of Xij indicates greater aﬃnity for the
node i to be a member of the cluster j. Similarly, we can interpret Yij . The matrix V reﬂects
the correspondence between the subgroups derived from the two networks.

21

The objective function can be written as follows:

J

=
+

min

XY U V W

∑

∑
ij

+

Aij
[XU X T ]ij

− Aij + [XU X T ]ij

Cik log

Cik
− Cik + [XV Y T ]ik
T]
[XV Y ik

Bks log

Bks
− Bks + [Y W Y T ]ks
[Y W Y T ]ks

ik

∑

Aij log

ks

(3.3)

Taking the partial derivatives of J with respect to X and Y yield the following( the partial
derivatives with respect to U, V and W can be similarly derived)

∂J
∂Xij

∂J
∂Yij

=

N
∑ [ −Aia [XU T ]aj

+ [XU T ]aj −

Aai [XU ]aj

[XU X T ]ia
[XU X T ]ai
] ∑[
]
M
−Cia [Y V T ]aj
T]
+ [XU ]aj +
+ [Y V aj
[XV Y T ]ia
a=1
[
M
∑ −Bia [Y W T ]aj
Bai [Y W ]aj
=
+ [Y W T ]aj −
[Y W Y T ]ia
[Y W Y T ]ai
a=1
] ∑[
]
N
−Cai [XV ]aj
+ [Y W ]aj +
+ [XV ]aj
[XV Y T ]ai
a=1

(3.4)

a=1

Given the objective function (3.3) and its partial derivatives, one can solve for X, Y, U, V
and W using a gradient descent approach. Here, we give a converging iterative matrix
factorization based update formulas for the unknown factors:
∑N
Xij = Xij

∑
Aia [XU T ]aj
A [XU ]aj
Cia [Y V T ]aj
+ ai
)+ M
a=1 ( [XU X T ]
a=1 [XV Y T ]
[XU X T ]ai
ia
ia
∑N
∑M
T] +
T] )
( a=1 [XU + XU aj
a=1 [Y V aj

22

(3.5)

∑M
Yij = Yij

∑
B [Y W T ]aj
B [Y W ]
Cai [XV ]aj
( ia
+ ai T aj ) + N
a=1 [Y W Y T ]
a=1 [XV Y T ]
[Y W Y ]ai
ia
ai
∑M
∑N
T] +
( a=1 [Y W + Y W aj
a=1 [XV ]aj )

[

∑N

Vij = Vij

[

∑N

Uij = Uij

[

∑M

Cab
]
b=1 [XV Y T ] Xai Ybj
ab
∑N ∑M
a=1
b=1 Xai Ybj

a=1

(3.7)

∑N

Aab
]
b=1 [XU X T ] Xai Xbj
ab
∑N ∑N
Xai Xbj
a=1
b=1

a=1

(3.6)

(3.8)

∑M ∑M

Bab
]
b=1 [Y W Y T ] Yai Ybj
ab
∑M ∑M
a=1
b=1 Yai Ybj

a=1

Wij = Wij

(3.9)

Theorem 1 For a ﬁxed Y, U V and W the update formula for X, given in Equation (3.5)
monotonically decreases the objective function of the objective function deﬁned in (3.3).
The proof of theorem is given in the Appendix.

3.2.2

Incorporating Prior Information

Many times, we may have additional information about the correspondence between clusters in multiple networks. In what follows we give a motivation to incorporate this prior
information into the objective function.
Example 1 Consider a citation network between research articles and a co-authorship network between researchers. Suppose the articles are grouped into the following topics: Algorithms, Artiﬁcial Intelligence, Databases, Cell Biology, and Genetics. Typically,
23

an author may work on multiple related topics, therefore the article partitions are not reﬂected as it is in the author network, rather they are further grouped into coarser clusters,
namely, Computer Science and Biotechnology. The author cluster (Computer Science) is related to the ﬁrst three article clusters, while Biotechnology is related to Cell
Biology and Genetics. We expect such prior information will enhance the joint clustering
results. This information can be encoded in a 5 × 2 prior matrix:











P =







1 0 


1 0 



1 0 



0 1 


0 1

where the rows are the article clusters and the columns are the author clusters.
To incorporate prior, we ﬁrst need to interpret the role of the V matrix in the objective
function. As mentioned earlier, V is the ’between network’ cluster correspondence matrix.
The elements of V matrix reﬂect the relationship between the clusters between the two
networks. If we have the prior knowledge about the proportion P of links between the
clusters in both the networks, then it can be incorporated into the objective function (3.3)
as follows

J

= D(A ∥ XU X T ) + D(B ∥ Y W Y T )
+ D(C ∥ X(λV + (1 − λ)P )Y T )

24

(3.10)

Algorithm 1 Multi-network Clustering Algorithm
Input: Matrices A, B, C, and maximum iteration T .
Output: Matrices X, Y , and V .
Initialize X old , Y old , U old , V old , and W old to random matrices.
for i = 1 to T do
update X new using (3.5)
update Y new using (3.6) with X new
update V new using (3.7) with X new and Y new
update U new using (3.8) with X new
update W new using (3.9) with Y new
set X old ← X new ; Y old ← Y new ; U old ← U new ;
W old ← W new and V old ← V new
end for
where λ is a parameter provided by the user. We call V ′ = λV + (1 − λ)P as the adjusted
correspondence matrix. The λ parameter in V ′ controls the tradeoﬀ between ﬁtting V ′
directly to the data and ﬁtting V ′ to the prior matrix P . If λ = 0, then the correspondence
between clusters is given by the prior matrix. If λ = 1, then the formulation reduces to the
joint clustering framework given in Equation (3.3). In situations where the proportion of data
scattered between the clusters in two networks is unknown, we can use a non-informative
prior where Pij is 0 or 1 indicating whether the ith article cluster is related to j th author
cluster (see Example 1).

3.2.3

Computational Complexity

This section presents analysis of the computational complexity of our algorithm. To begin
with, consider the clustering of a single network. The objective function for this would be
D(A ∥ XU X T ) and the update formula for the cluster membership matrix X is given by
∑N
Xij = Xij

A [XU T ]

A [XU ]

ia
aj
aj
+ ai
)
a=1 ( [XU X T ]
T ]ai
[XU X
ia
∑
( N [XU + XU T ]aj )
a=1

25

(3.11)

Let k be the number of clusters we wish to identify. Each Xij is multiplied by an update
factor, that involves several matrix multiplications. The product XU and XU T can be
computed in O(nk 2 ) while XU X T requires an additional computation of order O(n2 k).
Since A is a sparse matrix, the element-by-element division between the matrices A and
XU X T can be computed in O(|e|A ), where |e|A is the number of edges in matrix A. This
reduces the computation of term

A
to O(|e|A k). Note that the update factor needs to
XU X T

be computed only once at each iteration and then applied, element-by-element, to each Xij .
Thus, the overall complexity for clustering a single network into k clusters is O(T |e|A k)),
where T is the total number of iterations.
Next we consider the case of two networks of comparable sizes and having the same number of clusters. The joint optimization is given by Equation (3.3) and the update formula
for X is given by Equation (3.5). This diﬀers from the update formula for a single network
with an additional term that represents the relation between the two networks. If both networks are of comparable link densities, then this additional term incurs additional overhead
of O(|e|A kT ). By similar argument, each of the ﬁve update formulas given in Algorithm 1
requires a time complexity of O(ekT ), where e denotes the average number of links in a network. Putting it altogether, this implies the proposed algorithm has an overall complexity
of O(5ekT ) where T is the maximum number of iterations.

3.2.4

Semi-Supervised Learning

Before we end this section, we discuss the similarity between the proposed joint community detection problem with well known semi-supervised learning models. The paradigm of
learning from multiple related networks can be modeled as semi-supervised network learning,
where the additional data from related network is used as side information in order to per26

form supervised or unsupervised learning. One way to incorporate the side-information from
related networks would be to add additional knowledge derived from the side-information
as constraints to the original learning problem. For example, depending on the nature and
quality of the side information, a set of must-link and cannot-link constraints on the node
pairs can be extracted from the side information and included in the objective function
[9, 10, 57, 62, 100]. A drawback here is that it requires a good heuristic to extract the
must-link, cannot-link constraints which may not be readily available. Also, it may not be
possible to get a feasible clustering solution that satisfy all the constraints extracted from
the side information [24].
One way to avoid using heuristics to extract constraints from the side-information is to
directly use all the available information and let the algorithm extract the best suitable
information from the data. In context of multiple related networks, we combine all the
networks to form one giant multi-network. This is accomplished by constructing a single
adjacency 
matrix


A
 ij







C
 ij
Gij =


C
 ji








B
ij

G for the entire network as follows
i, j = 1, 2, ....n,
i = 1, 2, ..n and j = n + 1, ..m
j = n + 1, ..n + m and i = 1, ...n
i = n + 1, ..n + m and j = n + 1, ..n + m .

We then apply any manifold learning algorithms like label propagation or normalized cut
[92] algorithms on the entire multi-network G. It should be noted that the proposed joint
factorization framework can be thought of as novel approach to incorporate side information
as well as prior knowledge into the learning framework for performing clustering on networks.
In the next section, we present the results comparing the performance of the proposed

27

algorithm against its manifold learning counterpart.

3.3

Experimental Evaluation

We have evaluated the eﬀectiveness of the proposed algorithm on both synthetic and real
world network data. The details of the real world network and synthetic networks are given
below.

3.3.1

Baseline Algorithms and Evaluation Metrics

As a baseline, we used three clustering algorithms. The ﬁrst one is the normalized cut (Ncut)
algorithm by [92]. The other two algorithms are part of the objective function given in (3.3).
The ﬁrst term in the objective function (3.3) refers to independent clustering of a single
network. The ﬁrst two terms corresponds to the co-clustering of a single network using the
bi-partite graph between two networks. We compare our joint clustering algorithm against
the independent clustering (denoted by Ind) of single network and co-clustering (denoted by
CoC). We denote our proposed framework as Joint or Joint + Prior in the remainder of
this section. For a fair comparison, we applied the normalized cut algorithm on the entire
multi-network G described above (instead of G1 and G2 separately).
We use the normalized mutual information (NMI) measure to evaluate clustering results.
ˆ
It is deﬁned as follows. Let C = {C1 , C2 , ...Ck } denote the true set of clusters. Let C =
ˆ ˆ
ˆ
{C1 , C2 , ...Ck } denote the cluster obtained from the algorithm. Then mutual information
between them is deﬁned as

ˆ
M I(C, C) =

∑

ˆ
p(Ci , Cj )

ˆ
Ci ,Cj

28

ˆ
p(Ci , Cj )
ˆ
p(Ci )p(Cj )

and the normalized mutual information is given by
ˆ
M I(C, C)
ˆ
N M I(C, C) =
ˆ
H(C)H(C)
ˆ
ˆ
where H(C) and H(C) denote the entropies of partition C and C respectively.

3.3.2

Synthetic Data set

Our synthetic data set is generated as follows. First, the number of nodes and number of
clusters (k) in each network are given. In our experiment, the number of clusters in each
network is ﬁxed to be 4 with 400 data points in each cluster. Within each cluster i, a link
is created between any two nodes with probability p1 . On the other hand, an inter-cluster
link is created between a node in cluster i and nodes in other clusters in the network with
probability p2 . In addition, links are also created between nodes from diﬀerent networks.
We create links between networks G1 and G2 with probabilities q1 and q2 , where the former
is the probability of link between corresponding clusters and the latter is the probability of
noisy link between non-corresponding clusters. For example, if networks G1 and G2 have 4
clusters each, q1 is probability of link between clusteri in network G1 and clusteri in network
G2 . q2 is the probability of link between clusteri in G1 and clusterj in G1 with i ̸= j.

3.3.3

Complexity Veriﬁcation

In this section, we use the synthetic network to verify the theoritical computational complexity derived in section 3.2.3. Here we computed the time for factorization of single network
with k = 4 clusters. We set the inter-cluster link probability to be 0.01 and intra-cluster link
probability to be 0.001. We then varied the number of nodes in the network from 400 to 4400
29

in step size of 800 nodes. For each network we computed the factorization D(A ∥ XU X T )
and plot the computation time as function of number of links in the network. This is shown
in Figure 3.2. Clearly, the computation time is seen to be linear in the number of links in
the network as derived in section 3.2.3.

400
350

Time in Seconds

300
250
200
150
100
50
0
0

1

2

3

4

Number of links

Figure 3.2

3.3.3.1

x 105
This plot shows the variation of computation time with increase in
number of links in the network (due to increase in nodes).

Eﬀect of noise in one network

We created two networks namely, Network A and Network B using the above mentioned
parameters. Network A was generated with p1 = 0.4 and p2 = 0.35. The links between the
network was generated with q1 = 0.6 and q2 = 0.45. Network B was generated with p1 = 0.5
30

and we varied the inter cluster link probability for Network B from 0.1 to 0.4 in step size of
0.1. These are the noisy link in the Network B. We studied the eﬀect of varying noisy links
in Network B on the cluster NMI on network A. The results are plotted in Figure 3.3. As
expected, low level of noise in network B helped in identifying the clusters in Network A with
higher accuracy. As the noise level increases in Network B, the NMI of clusters obtained in
Network A decreases. This reﬂects our belief that if one of the networks is less noisy then
it helps in improving the clustering accuracy of the other network. Ncut-M and Ncut-I in
the ﬁgures 3.3 and 3.4, refers to normalized cut algorithm applied on the Multi graph and
Individual network respectively.

1.2
Joint
Ncut−M
CoC
Ncut−I

NMI on Network A

1
0.8
0.6
0.4
0.2
0
0
Figure 3.3

0.1

0.2

0.3

0.4

0.5

Noise in Network B
Eﬀect of Noisy links in Network B on clustering performance of Network A (p1 = 0.4 and p2 = 0.35)

31

3.3.3.2

Eﬀect of noise in between the networks

Here, we tested the eﬀect of noise between the networks on the clustering accuracy. To do
this we created two networks with following parameters. Network A with p1 = 0.4 and
p2 = 0.35 and Network B with p1 = 0.5 and p2 = 0.45. We created the links between them
with q1 = 0.6. We varied the noise between the networks by varying q2 from 0.1 to 0.5 in step
size of 0.1. The results of clustering accuracy on Network A is shown in Figure 3.4. When
the link between the networks are less noisy, the CoC algorithm gives good result. The Joint
algorithm gives better results. This is because the information ﬂow between the networks
is more reliable and the link structure in Network B helps in improving the performance of
Joint clustering over the CoC clustering.

1.2

Joint
Ncut−M
CoC
Ncut−I

NMI on Network A

1
0.8
0.6
0.4
0.2
0
0
Figure 3.4

0.1

0.3

0.2

0.4

0.5

Noise between Networks
Eﬀect of Noisy links between networks on the clustering performance
of Network A (p1 = 0.4 and p2 = 0.35)

32

3.3.4

Wikipedia Dataset

We use the Wikipedia dump from Oct-09-2009 for our experiments. We have chosen four
topics as the ground truth clusters—Biology, Natural Science, Computer Science and Social
Science. Each of the four topics are further divided into subtopics which are shown in Table
3.1. We collected roughly 20K articles, with 5K articles in each category. After removing
stubs and other smaller articles we were left with 10K articles and 53K editors (who have
edited the articles). We removed articles/editors that do not have suﬃcient links (less than
3 links) with other articles/editors in our corpus. Our ﬁnal data set contains 6403 articles
and 5361 editors. A visual representation of the adjacency matrices of the article and editor
networks is shown in Figure 3.5. Our goal of clustering is to identify the 12 sub-categories in
the article network and relate them to 4 categories in the editor network. Such a clustering
is useful in other real world application, where in, low level label information like sub topics
are assigned to each article in the article corpus and high level label information are assigned
to authors based on their research interests. Joint clustering can then used to identify and
associate the related communities between these networks and propagate the high level labels
from authors networks to more reﬁned low level labels in the article clusters.

Figure 3.5: Spy plot for Article (Left) and User (Right) networks in Wikipedia

The Wikipedia data set is particularly challenging. Firstly, the editors do not seem to
33

Table 3.1: Data Category and Sub Category
Category
User clusters
Political Science
Natural Science
Computer Science
Biology

Sub-Categories
Article clusters
Civil-Rights Liberties(878); Imperialism(601);
Nationalism(368);
Physics(568); Earth Sciences(513);
Astronomy(613)
Algorithms(112); Operating Systems(395);
Computer Architecture(350)
Zoology(392); Anatomy(897);
Cell-Biology(716);

Table 3.2: The link distribution between diﬀerent author clusters in Wikipedia data set

Political Science
Natural Science
Computer Science
Biology

Political
Science
8313
1113
749
844

Natural
Science
1113
3398
657
592

Computer
Science
749
657
5337
516

Biology
844
592
516
4806

have a ﬁxed domain of interest. As seen in the spy plot in Figure 3.5, a good proportion of
editors have edited articles in all the four categories. We assign a ground truth label to each
user based on the category for which the user has made the most number of contributions.
Secondly, although each editor has his/her own Wikipedia page, many of these pages do not
contain enough useful features that can be used to identify the cluster of an editor. Thirdly,
the links between articles tend to be noisy. The article-article spy plot in Figure 3.5 shows
9 visually distinct groups even though there are 12 article clusters. This is because, in our
sample, the articles in some of the sub categories are highly connected to other realted sub
categories. It is therefore not visually discernable.
We ﬁrst clustered the article network and user network independently and used them as
our baseline result. As shown in Table 3.3, independent clustering of article gives a NMI of
0.30 and co-clustering of article network using the article-editor bipartite graph increases the
NMI to 0.38. However, the joint clustering gave a slightly higher NMI of 0.41, highlighting
34

the importance of information ﬂowing from the other network. Joint clustering with prior
information increases the NMI further to 0.45. The normalized cut algorithm produced
the highest NMI of 0.55 on the article network. The confusion matrix of Ncut and Joint
algorithms on article network is given in Table 3.5 and Table 3.4.
Table 3.3: Average Cluster NMI on Wikipedia dataset

Experiment
Normalized Cut on Multi Graph
(12 clusters)
Normalized Cut on Multi Graph
(4 clusters)
Independent clustering
Co-clustering
Joint clustering without Prior (λ = 1)
Joint clustering with Prior (λ = 0.5)

Article
0.550

Editor
-

-

0.010

0.304
0.381
0.405
0.454

0.080
0.208
0.213
0.259

As seen in the Table 3.4, the Ncut algorithm on the multi graph G has found 4 predominant clusters in the article network, one for each category. For example, clusters 1, 6, 9, 11
in the confusion matrix of Ncut algorithm represent predominantly represent the underlying
major four categories and rest of the columns are very sparse. However, the Joint clustering
algorithm has found the ﬁne sub categories in the article network. For example, the clusters
1 − 3 predominantly represent the sub categories of political science, clusters 4 − 6 predominantly represent natural science subcategories and clusters 10 − 12 predominantly represent
the subcategories of biology. The algorithm has found only two predominant subcategories
(columns 7 − 8)for the computer science category.
In the editor network, the Joint + Prior has out performed all other approaches. The
Ncut on multi graph could not identify any of the major four categories. Both Joint and CoC
algorithms has shown similar performance with NMI of 0.21. A simple Joint algorithm did
not give any additional beneﬁt compared to the CoC. However, adding the non-informative

35

Table 3.4: Confusion matrix of article network using the Ncut algorithms on the multi graph
G gave NMI 0.55. The columns 1,6,9 and 11 are the dominant clusters identiﬁed by the Ncut
algorithm. These four dominant clusters correspond to the four major categories and the algorithm
has failed to identify the sub categories.

1

Ground Truth
Civil rights
Imperialism
Nationalism
Physics
Earth Sc
Astronomy
Algo
OS
Architecture
Zoology
Anatomy
Cell Biology

2

3

4
5
6
7
8
9
Clusters Output by Ncut Algorithm
536 286 42
0
1
3
2
4
1
331 0 222
2
1
6
21
2
0
286 2
17
1
0
0
3
1
11
3
0 270
0
1
21 247 0
3
17 21 16 161 234 27
26
2
8
49
0
10
0
0 347 20 179 2
0
0
0
0
0
1
7
0 104
1
0
0
0
0
20
2
1 371
19
1 101
0
0
2
213 0
0
25
0
18
0
0
1
1
0
0
1
0
0
0
0
2
1
1
0
5
1
11
0
0
2
0
1
5

10

11

12

0
3
0
1
15 0
2
45 0
0
23 0
0
1
0
1
5
0
0
0
0
0
0
0
0
0 14
1 346 0
0 892 0
200 491 0

Table 3.5: Confusion matrix of article network using the Joint clustering method gave NMI 0.40.
Unlike the Ncut algorithm, the Joint clustering is able to identify the ﬁner sub categories. For
example, columns 1,2,3 predominantly identiﬁes the sub categories of political science category.
Similarly, columns 10,11 and 12 relates to sub categories of the biology category.

1
Ground Truth
Civil rights
Imperialism
Nationalism
Physics
Earth Sc
Astronomy
Algorithm
OS
Architecture
Zoology
Anatomy
Cell Biology

2

21 416
208 177
14 261
8
1
18
4
120 1
0
0
2
0
32
9
23 13
0
1
5
9

3

4
5
6
Clusters Output by
14
1
2 265
110
0
15 54
14
0
36
7
9
4 295 245
25 248 171 13
96
78 13 130
0
2
0
0
0
1
0
0
26 195 13 55
20
0
19 38
0
0
47 54
4
0
32 12

36

7
8
9
10 11
Joint Algorithm
0
2
0 112 34
1
0
0
31
4
3
7
0
18
6
1
0
3
0
2
0
7
5
6
9
162 12 0
0
0
62 48 0
0
0
122 251 0
0
12
11
0
0
1
3
1
2
8
4 176
11
0 11 172 393
1
30 4 187 286

12
11
1
2
0
7
1
0
7
5
88
208
146

prior boosted the performance of the joint clustering algorithm.

3.3.5

Digg Data Set

The web site www.digg.com is a popular social bookmarking web site where each individual
users bookmark URL’s and share them publicly with other users. The goal is to identify
diﬀerent user community based on the topics of the bookmarked URL’s. As mentioned in the
abstract and introduction, the objective of this work is to investigate whether it is possible to
combine information from several networks to improve community detection. However, one
of the challenges in demonstrating this eﬀect empirically is the lack of ground truth information about the true clusters of various domains. So, the main reason we choose Wikipedia
and Digg data sets for our experiments is the availability of the ground truth class labels
that can be used to evaluate the performance of our joint community detection framework.
Furthermore, though there may not be a direct correspondence between Wikipedia editors
and Digg users, their community structures are often deﬁned based on the topics of the
articles they have edited or posted. Since there exists common topics among articles in
both networks, this information that can be harnessed to improve their community detection tasks. The idea of using Wikipedia as an auxiliary data source for improving clustering
has also been investigated before (see for example, the works by [6]), though none of the
previous works consider the Wikipedia as source of auxiliary network data for user community detection. Our experiment suggests that Wikipedia is a potentially useful source to
improve clustering of users in a domain such as Digg.com as well. The exact data collection
mechanism is described below.
We ﬁrst sampled 5670 digg users who have bookmarked URLs on the following three
topics: Politics, Computer Science, and Natural Science. We formed a user-user connectivity
37

matrix from the user-URL matrix. Two users are linked if they have at least τ = 5 URLs in
common. We sampled 4206 Wikipedia editors that belong to aforementioned 3 categories.
We established the links between the digg users and the Wikipedia editors by ﬁrst establishing
a similarity between the bookmarked URLs and the edited articles. Each URL bookmarked
at digg.com has a title and a short description about the content of the web site. The digg
url-word matrix and Wikipedia article-word matrix are used to establish a “weighted link”
between a digg url and a Wikipedia article. Speciﬁcally, the weight of the link corresponds to
the cosine similarity between the words in the title and description of a URL and the words
that appear in the content of a Wikipedia article. We ﬁnally establish the link between a
digg user and Wikipedia editor if there is high similarity between the contents of bookmarked
URL and edited article. Here again, we assign a ground truth label to each editor (user)
based on the category for which the editor has made the most number of contributions.
Figure 3.6 shows the adjacency spy plot for two networks. In both these networks a good
proportion of editors/users have contributed articles in all three categories. However, notice
that the there is large number of links between the ﬁrst two user communities in the digg
network. These two communities refer to the politics and computer science. It indicated that
in our sample, the users who have bookmarked politics related URL have also bookmarked
technology related URLs.

Figure 3.6: Adjacency matrix plot for Digg user and Wikipedia editor networks.

38

Table 3.6: Confusion matrix on digg user and wiki editor network using Ncut on Multigraph

Ncut on Digg+ Wiki MultiGraph
Digg User Network
Wiki Editor Network
167
1481 0
108
1708 0
997
638 0
707
492 0
1952
435 0
1113
77
1
NMI:0.185
NMI:0.307
Table 3.7: Confusion matrix of Digg users and Wikipedia editors using the proposed joint clustering method.

Joint on Digg+ Wiki MultiGraph
Digg User Network
Wiki Editor Network
1242
63
343
102
70 1644
169
220 1246
631
270 298
338
1889 160
122
1023 46
NMI:0.387
NMI:0.418
We ﬁrst applied the Ncut algorithm on the multi graph containing both digg and wiki
networks. Both the networks contain three corresponding clusters. The confusion matrix
is shown in Table 3.6. Clearly, the Ncut algorithm identiﬁed only two clusters on each
network. This is because, the digg network had predominantly two clusters. Table 3.7 gives
the confusion matrix of digg and Wikipedia network using the proposed Joint approach. It
has identiﬁed the three dominant community on each network. Since there was a natural
correspondence between the three clusters on both the networks, adding prior information
(identity matrix of size 3) did not give any additional improvements.

3.4

Summary

In this chapter we have discussed the problem of learning cohesive subgroups and their correspondences in multiple related networks. Our experiments reveal that the joint clustering
of multiple networks gives better results in terms of normalized mutual information between

39

the actual clusters and the clusters identiﬁed by algorithm. We have also introduced the
idea of using a prior to guide the clustering process. We have performed a through analysis
of the convergence of the proposed algorithm and we have also given heuristics for faster
convergence. The proposed algorithm is of order O(n2 T ) in complexity where n is number
of nodes on either network and T is number of iterations. The scalability of the proposed
algorithm for networks with millions of nodes will be pursed in future.

40

Chapter 4
Joint Clustering and Classiﬁcation on
Multiple Networks
In the previous chapter, we focused on learning a single task (community detection) simultaneously on a two related networks. In this chapter we extend the framework for multi-task
learning on two related network data. Speciﬁcally, we present a novel framework that enables
one to perform classiﬁcation on one network and community detection in another related
network. Multi-task learning is accomplished by introducing a joint objective function that
must be optimized to ensure the classes in one network are consistent with the link structure, nodal attributes, as well as the communities detected in another network. Experiments
performed on both real-world and synthetic data sets demonstrate the eﬀectiveness of the
joint framework compared to applying classiﬁcation and community detection algorithms on
each network separately.
The motivation for this study is two-fold. First, the rapid proliferation of online social and
information networks raises the question whether we can leverage network data from known
information sources (say, Wikipedia) to enhance the network learning tasks. In a previous
chapter, we have demonstrated the advantages of combining data from multiple related
networks to improve community detection. However, the approach presented in Chapter 3
considers the same learning task—community detection—on all networks. Here, we extend

41

the analysis to the case when a diﬀerent learning task is performed on each network. For
example, one might be interested in leveraging the knowledge about the classes of Wikipedia
editors (based on the articles they have edited) to identify the communities of users at
Digg.com, a social news Web site. Though the correspondence between the identities of Digg
users and Wikipedia editors may not be known, their community structures and classes are
potentially well-aligned as they are based on the topics of articles posted or edited by the
users.
Second, by comparing the classes and communities across diﬀerent networks, one could
potentially explain the identiﬁed communities in one network in terms of the known classes
deﬁned in another network, especially when there is high connectivity between nodes that
belong to a community and its associated class. Furthermore, it is also possible to perform
a comparative network analysis to identify the classes (clusters) found in one network but
not the other. For example, given the known classes of users at Facebook, can we identify
groups of MySpace users whose link structure and nodal behavior do not correspond to any
known groups in Facebook?
One approach to solve this multi-task multi-network learning problem is to combine all the
networks into one giant graph and apply the classiﬁcation or community ﬁnding algorithms
to the combined graph. However such an approach has many limitations. First, by applying
a single algorithm to the combined graph, we have no control over the number of clusters
or classes that will be found in each of its underlying networks. In particular, as will be
shown in our experiments, this approach does not work well when the number of classes in
one network is diﬀerent than the number of communities in another network. Furthermore,
it may lead to a suboptimal solution as it attempts to ﬁt a global model to a graph that
contains local networks with their own distinctive properties.
42

The main contributions of this chapter are as follows. First, we extend the framework
in Chapter 3, for joint classiﬁcation and community detection in heterogeneous network
data. The framework is applicable even when the number of classes in one network diﬀers
from the number of communities in another network. In addition, we have drawn parallels
between the update formula and the label propagation algorithm [114, 118]. We have also
included extensive experiments using synthetic data sets to assess the performance of our
proposed framework under diﬀerent multi-network parameter settings. In particular, these
experiments help to shed light into fundamental questions such as (1) Can clustering on
one network help to improve classiﬁcation on another network, and vice-versa? (2) Does
combining multiple related networks help in solving a task better than solving the same task
independently on individual networks? (3) How does the presence of noisy links in diﬀerent
networks aﬀect performance of each learning task?

4.1

Joint Learning Framework

The notations used in this chapter are similar to the ones used in chapter. We summarize
these notions in Table 4.1.
The objective function for joint clustering and classiﬁcation is very similar to the objective
function (3.2) deﬁned in Chapter 3 with an additional term for aligning the label information
present in the network.

L=

min

X,U,V,Y,W

D(A ∥ XU X T ) + D(C ∥ XV Y T )
+ D(B ∥ Y W Y T ) + D(βL ∥ Yl )

43

(4.1)

Table 4.1: Summary of notations used in the chapter.

Symbol
G1
G2
G12
vij ∈ Vi
Ei
Ed
k1
k2
l
A
B
C
X
Y
V
L

Description
A homogeneous network for detecting communities
A homogeneous network for classiﬁcation
A bipartite graph connecting nodes between G1 and G2
A node in network Gi (i = 1 or 2)
The set of links in network Gi
The set of links in the bipartite graph G12
number of communities in G1
number of classes in G2
number of labeled nodes in G2
An n × n adjacency matrix for network G1
An m × m adjacency matrix for network G2
An n × m adjacency matrix for bipartite graph G12
An n × k1 pseudo-label matrix for community membership of the nodes in G1
An m × k2 pseudo-label matrix for class membership of the nodes in G2
A k1 × k2 community-class correspondence matrix
An l × k2 true class membership matrix for the labeled nodes in G2

The ﬁrst term in the objective function deals with the clustering of nodes in G1 by factorizing
the adjacency matrix A into a product involving the pseudo label matrix X. The last two
terms deal with the classiﬁcation of nodes in G2 by estimating the pseudo label matrix Y ,
taking into account both the link structure (B) and class information (L). Thus, Yl is an l×k2
sub-matrix of Y consisting on rows of Y corresponding to the labeled data points. Thus, the
last term in the objective function, D(L ∥ Yl ), does not apply to unlabeled nodes in network
G2 . Meanwhile, the second term in the objective function is used to learn the relationship
between the clusters found in network G1 and the classes obtained for network G2 . The
association between the clusters and classes are encoded by the cluster-class correspondence
matrix V . The constant β indicates the factor by which Yl would be ﬁt to the ground truth
label L. Since L is ﬁxed binary label matrix, the constant factor β can be absorbed into it.
That is, Lij = β if the node v2i ∈ V2 belongs to class j and zero otherwise. In what follows
we would not explicitly mention the label factor β.

44

Here again, the optimization problem is solved using an alternating minimization scheme.
The update formula for matrices X, U , W , and V are same as (3.5),(3.8), (3.9) and (3.7)
respectively. The update formula for Yij depends on whether the node is labeled or not. It
is given by

Yij =



∑ Bia [Y W T ]aj Bai [Y W ]aj ∑ Cai [XV ]aj


+
)+ a

a ( [Y W Y T ]

[Y W Y T ]ai
[XV Y T ]ai
 Y
ia
 ij
∑
∑
,

T
(










i>l;

a [Y W +Y W ]aj + a [XV ]aj )
∑ Bia [Y W T ]aj Bai [Y W ]aj ∑ Cai [XV ]aj −Lij
+
)+ a
+ Y
a ( [Y W Y T ]
[Y W Y T ]ai
[XV Y T ]ai
ij
ia
∑
∑
Yij
, i ≤ l.
( a [Y W +Y W T ]aj + a [XV ]aj )+1

(4.2)

We highlight the advantages of our proposed multi-task learning technique compared to
single-task learning with label propagation and cut-based graph partitioning algorithms in
the next section.

4.2

Joint Learning vs Independent Learning

In this section, we will give an insight into the exact mechanism by which the proposed
framework performs the iterative clustering and classiﬁcation between the networks until
the community structure and classes crystalize in each of the network. We explain this
from the point of view of classiﬁcation problem, though similar explanation can be made
for the clustering problem. We also compared our joint learning approach to independent
learning using the well-known label propagation (for classiﬁcation) and graph partitioning
(for clustering) algorithms.

45

4.2.1

Joint Factorization vs Label Propagation

Consider a binary classiﬁcation problem in a network where each node belongs to either
class 1 or class 2. Suppose the class information is encoded in a two dimensional row vector,
where [1, 0]T represents a node assigned to class 1 and [0, 1]T represents a node assigned
to class 2. Unlabeled nodes are assigned a vector [0, 0]T . A node classiﬁcation algorithm
can be designed to propagate a fraction of the class information from a labeled node to its
neighbors at each iteration. An unlabeled node will sum up the label vector it receives from
each of its neighbor. After a suﬃcient number of iterations, an unlabeled node in the network
1 2
1
2
whose label vector is denoted by [li , li ] will be assigned to class 1 if li > li and to class

2 otherwise. This is the learning strategy employed by a broad class of label propagation
algorithms [114, 118].
Two important parameters that govern the propagation of labels between nodes in a
network are (1) propagation structure and (2) rate of propagation. The former refers to the
link structure that deﬁnes the neighborhood structure of each node in the network while the
latter determines the fraction of amount by which the label information is propagated to
neighboring nodes at each iteration. The rate of propagation is usually modeled as a function
of the weight associated with the link between an adjacent pair of nodes. More formally, the
update formula for the label propagation algorithm is given by

Y (t) ← αPY (t−1) + (1 − α)L,

(4.3)

or equivalently,

(t)

Yij = α

∑

(t−1)

Pia Yaj

a

46

+ (1 − α)Lij

(4.4)

where P is the propagation matrix constructed from the adjacency matrix of the network.
There are several forms of label propagation matrices proposed in the literature, including
1

1

P = D−1 B [118] and P = D− 2 BD− 2 [114], where B is the adjacency matrix and D
is a diagonal matrix who diagonal entries are Dii =

∑

j Bij .

Next, we show that the

multiplicative update formula given by our matrix factorization framework have similar
mathematical form, which suggests that our multiplicative update formula can be viewed as
a form of multi-task multi-network label propagation.
Consider the problem of classiﬁcation on a single network G2 without using the information from the bipartite graph G12 . The objective function involves minimizing the terms
D(B ∥ Y W Y T ) + D(L ∥ Yl ). The update formula for Y can be obtained by considering only
the relevant terms in Equation (4.2). Assuming the links in the network are undirected, this
imposes symmetry restriction on the matrices B and W . Furthermore, let Mij =

Bij
,
[Y W Y T ]ij

which represents a weighted adjacency matrix, whose large weights are associated with links
that were incorrectly predicted by Y W Y T in the previous iteration. Hence, the update
formula Y for labeled nodes can be re-written as
∑ Bia [Y W T ]aj Bai [Y W ]aj
+
)+
a ( [Y W Y T ]
[Y W Y T ]ai
ia
∑
Yij = Yij
T
a (Y W + Y W ) + 1
∑
Lij
a Mia [Y W ]aj + Yij
∑
= Yij
( a [Y W ]aj ) + 1

Lij
Yij

(4.5)

Notice that the denominator term approximately normalizes the columns of the matrix
Y W . Let D be the diagonal matrix with diagonal entries consisting of the column sums of
YW matrix. That is, Dii =

∑

a [Y W ]ai + 1. Then the column normalized label matrix is

˜
given by Y = Y W D−1 . Let Y j represent the j th column of Y (that is, the label information

47

of j th class), then the above update formula can be written as

Yij = Yij

∑

˜
Mia Yaj +

a

Lij
Djj

(4.6)

Comparing the equations (4.4) and (4.6), we notice that the multiplicative update formula
has the same mathematical form as the label propagation algorithm. The key diﬀerence
between them is that the multiplicative updates have a varying propagation matrix M,
instead of a ﬁxed propagation matrix P. The role of α in (4.4), is to establish consistency
between the actual and predicted labels on the labeled data set. This is done by the factor
β in (4.6). Recall that β is absorbed into L, i.e Lij = β if the node v2i ∈ V2 belongs to class
j and zero otherwise.
The update formula for unlabeled nodes can be similarly shown to have same mathematical form as (4.4) albeit with diﬀerent propagation matrix. This aspect makes it very
adaptive in that, the propagation matrix can be updated at each iteration to accommodate
more information from the other network. At each iteration, the partial cluster information from network G1 and the partial label information from G2 determines the propagation
structure and the propagation rate for the classiﬁcation problem. The label information thus
propagated would minimize the class ambiguity among the unlabeled nodes. The enriched
class information is again propagated back for performing the clustering in G1 .

4.2.2

Joint Factorization vs Graph Cuts

The earliest algorithms on identifying communities in a given network were borrowed from
the graph theory literature, where, subset of the edges were removed to induce partition on
48

G12
G2

G1

Figure 4.1

G12

G1

G2

Sample multi-network to illustrate the functioning of graph cut based
partitioning techniques. Left: The link density in the bipartite graph
G12 is sparse. As as result, the graph cut based algorithm would
ﬁrst disintegrate the multi-network into individual networks and then
continue to split individual networks based on their link densities.
Right:: The link density is very diﬀerent between network G1 from
G2 , as a result, one of them may get split many times while the other
remain intact. Such problem wont arise in proposed joint factorization.

the vertex set. Given a graph G = (V,E), a cut <M,N> is a partitioning of the vertex
set V = M

∪

N of graph by removing edges between the vertex sets M and N. The size of

the cut is deﬁned as number of edges removed from the edge set E. Let A be the adjacency
matrix of the graph G.

cut<M,N > =

∑

Auv

(4.7)

v∈M,u∈N

A min-cut of the graph is a cut that has a smallest cut size. The well known algorithm
for solving min-cut problem is based on the max-ﬂow min-cut theorem by Ford and Fulkerson [33]. The ﬂow based algorithm for identifying graph cuts has been used to solve the
community detection problem in large World Wide Web graphs [31, 32].
A key hinderance in applying the cut based algorithm for community detection in a multirelational heterogeneous network setting is that the link density in diﬀerent networks (data
sources) are diﬀerent. As a result one may get trivial partitions or suboptimal partitions.
For example, consider a multi-relational heterogeneous network G, whose adjacency matrix

49

is constructed as follows



A
 ij







C

ij

Gij =



C
 ji







B
 ij

i, j = 1, 2, ....n,
i = 1, 2, ..n and j = n + 1, ..m
(4.8)
j = n + 1, ..n + m and i = 1, ...n
i = n + 1, ..n + m and j = n + 1, ..n + m .

If the link density in bipartite graph G12 is very low, then the cut based algorithm would
chop the multi-network into individual networks and subsequently split each of the individual
networks independently, thus ignoring the valuable information given in the bipartite graph
G12 . On the other hand, if the link density within one of the two networks is much higher
than the other (including the bipartite graph), then the cut based algorithm would treat it
as a dense community and split the other network repeatedly. This is illustrated graphically
in Figure 4.1. Such problems would not arise in the joint learning framework as the proposed
objective function factorizes each network separately into their respective communities (and
classes) and these latent factor variables in turn induce partition on the bipartite graph that
link the two networks. Nevertheless, the link density does aﬀect the working of the update
formula in two ways. If the link density in one of the network is higher than the other, then
it contributes more to the objective function than the other network. For example, in the
objective function (4.1), if the matrix A is denser than the other two matrices, then more
weights would be given in minimizing the ﬁrst term making the algorithm more focussed
in determining a better estimate of value X than for Y. This problem can be addressed by
appropriately weighting the terms in the objective function or scaling the adjacency matrices.

50

4.3

Experimental Evaluation

This section presents the results of applying the proposed framework to the multi-task learning problem on both synthetic and real-world network data. We have designed a multinetwork simulator to generate the synthetic data for our experiments. In addition, the
Wikipedia and Digg data discussed in Chapter 3, are used to evaluate the performance of
the proposed multi-task learning framework.

4.3.1

Baseline Algorithms and Evaluation Metrics

As a baseline, we use the normalized cut (Ncut) algorithm by Shi and Malik [91] for clustering
and the label propagation algorithm with local and global consistency (LGC) by Zhou et al.
[114] for classiﬁcation. For a fair comparison, we applied each baseline algorithm on the entire
multi-network G (instead of G1 and G2 separately). This is accomplished by constructing a
single adjacency matrix G for the entire network using (4.8). In each experiment, we set the
proportion of labeled nodes for the classiﬁcation problem in one of the two network to be
0.2. The label factor β is set to 0.5 for synthetic data and set to 50 for Wikipedia, Digg data.
We use the normalized mutual information (NMI) measure to evaluate clustering results and
accuracy to evaluate classiﬁcation results. Since we perform the classiﬁcation task as semi
supervised clustering, the accuracy measure is computed diﬀerently. Here , each estimated
cluster is assigned to the ground truth class to which is most frequent in the cluster. If
rows of the confusion matrix represents the ground truth class and columns represent the
estimated clusters, then we sum the maximum values across each column and divide by total
number of points. Note that if the confusion matrix is diagonal heavy (maximum of each
row/column occurs in diagonal) then this measure is same as regular accuracy measure.

51

We denote our proposed multi-task, multi-network learning framework as Joint or Joint
+ Prior in the remainder of this section. Since the iterative update formula converges to a
locally optimal solution, the Joint and Joint + Prior algorithms were run several times,
each with diﬀerent random initialization values for the pseudo-label matrices. We report
the results produced by the run which minimizes the objective function (4.1) (instead of
choosing the run that maximizes the accuracy of inferring the class and cluster membership
of the nodes). For the synthetic data, we show the scatter plot of all the values from 15
diﬀerent runs and highlight the values of the run with minimum error. We denote this run
as Joint(ME)

4.3.2

Synthetic Data

Our multi-network simulator constructs two networks, G1 and G2 . The parameters for constructing each network is summarized in Table 4.2 below. We assume there is a corresponTable 4.2: Parameters of multi-network generator

Parameter
k1
P11
P12
k2
P21
P22
Q1
Q2

Explanation
Number of clusters/classes in G1
Probability of within cluster link in G1
Probability of between cluster link in
G1
Number of clusters/classes in G2
Probability of within cluster link in G2
Probability of between cluster link in
G2
Probability of link between nodes in
two corresponding clusters in G12
Probability of link between nodes in
two non-corresponding clusters in G12

dence between the clusters/classes in G1 and the clusters/classes in G2 . A bipartite graph
G12 is also constructed by linking the nodes in G1 to those in G2 . By default, each cluster
52

Table 4.3: Conﬁgurations of various synthetic data

Conﬁguration
Noisy Network
Noisy Bipartite
Uneven

k1
4
4
3

Conﬁguration
Noisy Network
Noisy Bipartite
Uneven

Network G1
P11
P12
0.30 0.05 - 0.25
0.20
0.12
0.20
0.12

Network
k2 P21
4 0.20
4 0.20
5 0.20

G2
P22
0.10
0.15
0.15

Links in bipartite network G12
Q1
Q2
0.12
0.03
0.30
0.05 - 0.25
0.30
0.05 - 0.25

contains 100 nodes (so a network with 4 clusters will contain 400 nodes).
In this thesis we investigate three main conﬁgurations of the network parameter settings.
The conﬁgurations are given in Table 4.3. The “Noisy Network” conﬁguration varies the
proportion of noisy links (i.e., P12 ) in G1 while ﬁxing all other parameters. The ”Noisy
Bipartite” conﬁguration varies the proportion of noisy links (i.e., Q2 ) in the bipartite graph.
In both conﬁgurations, there is a one-to-one correspondence between the clusters/classes in
the two networks. For the ”Uneven” conﬁguration, the networks contain diﬀerent number
of clusters/classes. We have 5 clusters in G1 and 3 clusters in G2 . Thus, a many-to-one
correspondence is established between clusters/classes in G1 to those in G2 .
We evaluate the performance of our algorithm on the synthetic data set under four
conﬁgurations that are listed in Table 4.3. These four conﬁgurations evaluate diﬀerent
aspects of learning multiple related networks.

4.3.2.1

Varying Noisy Links Within a Network

We use the ”Noisy Network” conﬁguration in which we vary the inter cluster noise (P12 ) in
network G1 from 0.05 to 0.25 (with a step size of 0.05). we the performed clustering on G1

53

Class. accuracy in network 1

1
0.9

0.8

0.7

0.6

Joint (ME)
LGC

0.05

0.10

0.15

0.20

0.25

Clustering NMI in network 2

Proportion of noisy links in network 1
0.9

0.7
0.5
0.3

0

Figure 4.2

Joint (ME)
Ncut

0.1
0.05

0.10

0.15

0.20

0.25

Proportion of noisy links in network 1
Eﬀect of varying the between-cluster link probabilities (P12 ) on G1 .
Top: Accuracy on G1 . Bottom: NMI on G2

(classiﬁcation on G2 ) and classiﬁcation on G1 (clustering on G2 ) to study the eﬀect of noise
on both the learning task.
Classiﬁcation on G1 Figure 4.2 compares the results for Joint learning against the LGC
and Ncut on multi-graph. The top panel shows classiﬁcation accuracy on G1 and the bottom
panel shows the NMI of clusters obtained from G2 .
When the noise level is low, the LGC algorithm performs slightly better than Joint.
However, at higher noise levels, the performance of LGC drops signiﬁcantly and the joint

54

learning performs much better. This result suggests that the explicit community information
(from variable X) in G2 helps to discern between the classes in G1 , whereas LCG is unable
to use such information. Notice the inter quartile range for the box plot of accuracy values
is very small, suggesting the robustness of the classiﬁcation problem against the random
initializations.
Surprisingly, as the noise level in G1 increases, the NMI of the Ncut algorithm also
increases in G2 . This is because of diﬀerence in the link densities between the two networks.
(See Section 4.2.2). When P11 = 0.3 and P12 = 0.05, G1 has four distinct clusters, while
in comparison, the network G2 with high link density (with P21 =0.20, P22 =0.10) by itself
appears as a single community in the multi-network. The Ncut algorithm tries to identify
four communities in the multi network. Due the diﬀerence in the link density between the
two networks, the Ncut algorithm assign the entire G2 as one community and partitioned the
network G1 into three communities. This results in lower NMI value on G2 . The increase in
the noise level in G1 increases the link density in the network resulting in identifying better
cuts by the Ncut algorithm on the multi-network. When the noise parameter is P12 = 0.15,
the link densities in both the network becomes comparable and the proposed Joint algorithm
still performs better than Ncut.
Clustering on G1 For the same ”Noisy Network” conﬁguration , we performed community detection G1 and classiﬁcation accuracy in G2 . The results are shown in Figure 4.3.
In both the cases, the Joint performs better than the respective baseline algorithms. As
explained earlier, when the noise level is low the Ncut on multigraph identiﬁes three communities on network G1 and the whole of G2 as fourth community. This resulted in NMI
of 0.75. As the noise exceeded certain threshold (P12 > 0.15), the Ncut on multi-graph
identiﬁed cuts in network G2 instead of G1 . This resulted in decrease of NMI value in G1 .
55

Classification accuracy in network 2

0.98

0.94
Joint (ME)
LGC

0.90

0.86

0.82
0.05

0.25
0.20
0.15
0.10
Proportion of noisy links in network 1

Clustering NMI in network 1

0.9

0.7

0.5

0.3
Joint (ME)
Ncut

0.1
0

0.05

0.10

0.15

0.20

0.25

Proportion of noisy links in network 1

Figure 4.3

Eﬀect of varying the between-cluster link probabilities (P12 ) on G1 .
Top: accuracy on G2 . Bottom: NMI on G1

Since the parameters for G2 are not varied, the classiﬁcation results on G2 from LCG remains
almost constant at 0.85 while the classiﬁcation accuracy from Joint also remains pretty
much constant at 0.98.

4.3.2.2

Varying Noisy Links Between Networks

We use the ”Noisy Bipartite” conﬁguration and vary q2 from 0.05 to 0.25 in step size of 0.05.
Varying the noise between the network will have no bearing on the independent clustering

56

and classiﬁcation tasks on either network. Here again, we performed both clustering and
classiﬁcation on the multi-relational network G given by (4.8). We then recorded the classiﬁcation accuracy on subgraph G1 and clustering NMI on subgraph G2 . The results are shown
in Figure 4.4. At lower noise levels, the Joint classiﬁcation outperforms both LGC and Ncut.
For higher noise levels in G12 , the learned class information from G1 are not successfully
transferred to G2 for community detection and vice versa. In terms of the objective function,
the adjacency matrix C inﬂuences both the cluster membership(X) and class membership(Y)
factors. As C becomes noisier, it propagates noise into factors X and Y, thus bringing down
both accuracy and NMI.

4.3.2.3

Eﬀect of Unequal Number of Clusters and Classes

We use the ”Uneven” conﬁguration by varying q2 from 0.05 to 0.25 in step size of 0.05. We
apply LGC to the entire multi-network G (with number of classes equal to 3) and Ncut to G
(with number of clusters equal to 5). We compare their performance against the classiﬁcation
accuracy of Joint on G1 and NMI of Joint on G2 . The results are shown in Figure 4.5.
Here again, the proposed Joint learning algorithm outperforms LGC and Ncut at lower noise
levels. The performance of the Joint is comparable to LGC and Ncut at higher noise levels.

4.3.3

Wikipedia Data

In this experiment, we perform classiﬁcation task on the article network and the community
detection task on the editor network. As mentioned earlier, there are four large communities
in the editor network and 12 sub-topics in the article network. Here we perform four sets of
experiments:
1. Apply LGC on article network only (with k1 = 12) and Ncut on editor network only
57

Classification accuracy in network 1

1

Joint (ME)
LGC

0.9
0.8
0.7
0.6
0.5

0.4
0.3

0.05

0.10

0.15

0.20

0.25

Proportion of noisy links between networks

Clustering NMI in network 2

1

Joint (ME)
Ncut

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.05

0.10

0.15

0.20

0.25

Proportion of noisy links between networks

Figure 4.4

Eﬀect of varying the between-cluster link probabilities (q2 ) of bipartite network G12 with same number of clusters/classes in both G1 and
G2 . Top: accuracy on G1 . Bottom: NMI on G2

58

Classification accuracy in network 1

1

0.9
0.8

0.7

Joint (ME)
LGC

0.6

0.5
0.05

0.10

0.15

0.20

0.25

Clustering NMI in network 2

Proportion of noisy links between networks
Joint (ME)
Ncut

0.6
0.5
0.4

0.3
0.2
0.1

0.05

0.10

0.15

0.20

0.25

Proportion of noisy links between networks

Figure 4.5

Eﬀect of varying between-cluster Link probabilities (q2 ) of bipartite
network G12 with uneven number of clusters/classes. Top: Accuracy
on G1 . Bottom: NMI on G2

59

Table 4.4: Clustering results of Wikipedia editors. Here (ME) refers to the run with minimum
error.

User network
Ncut on editor network only
Ncut to entire multi-network
Joint(ME) without prior
Joint without prior
Joint(ME) with Prior
Joint with Prior

4 clusters (NMI)
0.07
0.02
0.32
0.30 ± 0.0389
0.39
0.36 ± 0.0215

Table 4.5: Classiﬁcation results of Wikipedia articles. Here (ME) refers to the run with minimum
error.

Article network
LGC on article network only
LGC on entire multi-network
Joint without prior (ME)
Joint without prior
Joint with Prior (ME)
Joint with Prior

12 classes (accuracy)
0.87
0.85
0.84
0.80 ± 0.054
0.88
0.85 ± 0.021

(with k2 = 4).
2. Apply LGC (with 12 classes) and Ncut (with 4 clusters) to the entire multi-network G.
3. Apply Joint to the article network (with k1 = 12) and editor network (with k2 = 4)
without using prior information.
4. Apply Joint to the article network (with k1 = 12) and editor network (with k2 = 4)
with using prior information.
The results are shown in Tables 4.4 and 4.5. As shown in Table 4.4, the independent
clustering of user network gives very bad results compared to the Joint approach. The
cluster NMI increases from 0.07 to 0.32. Using the prior information further boosts NMI to
0.39. However, this additional gain comes at the expense of slightly reduced classiﬁcation
accuracy on the article network. Applying LGC on article network alone gives an accuracy

60

of 0.87 which reduces to 0.85, when applied to the entire multi-network. This is because
the class information provided by the article network is more useful than the “coarse-level”
cluster information provided by the editor network. Furthermore, a user typically contributes
to articles across diﬀerent categories which makes it diﬃcult to decide his/her actual class
label. We currently assigned the user to the category to which he/she has made the most
contributions. In fact, it is because of this problem, it is diﬃcult to acquire the label
information in user network, and thus, clustering becomes a necessary task.
The Joint approach gives an accuracy of 0.84 on the article network. The loss in accuracy
in the article classiﬁcation can also be explained by examining the resulting confusion matrix,
as shown in Table 4.6. The Joint approach identiﬁes four communities in the editor network.
Each of these four communities have major correspondence with three article sub-categories.
Therefore, the Joint approach settles for a local minimum solution in such a way that the
resulting confusion matrix is block diagonal instead of pure diagonal. That is to say, the
identiﬁcation of four communities in the editor network would propagate coarse-level cluster
information from the editor network to article network. This makes it harder to discern the
sub-categories in the article network and hence the drop in accuracy However, using the
right prior information improves the accuracy to 0.88.

4.3.3.1

Number of Iterations

In this section ,we discuss the eﬀect of number of iterations on the proposed joint factorization
algorithm. Figure 4.6 shows the variation of the Wikipedia article network accuracy and the
editor network NMI with every 10 iterations. Both the values tend to stabilize in long run.
It should be noted that they do not necessarily exhibit a monotone property with respect to
the number of iterations. It is not easy to determine the exact number of iterations required
61

Table 4.6: Confusion Matrix - Article Network using Joint algorithm

Political Science
753 21
29
14 1064 27
1
6
622
0
4
2
11
2
0
1
7
0
0
0
0
0
1
0
3
80
0
1
4
14
0
0
2
1
7
11

Natural Science
9
27
3
32 38
11
7
0
3
680 1
17
41 514
4
16
0
747
2
0
2
3
0
2
8
0
6
1
0
4
2
0
2
1
0
3

Computer Science
215 1
1
96
1
8
25
9
0
1
0
3
16
5
0
1
0
1
22 102
5
27 420
1
7
3
769
4
0
0
1
0
0
5
1
0

Biology
6
1
2
29
2
0
4
7
4
1
0
0
13
0
1
7
0
0
1
1
0
0
0
0
0
0
13
345 32 24
17 870 37
139 49 589

for convergence. We run the algorithm until the error between two successive iterations is
less than a speciﬁed threshold or maximum of 1500 steps.

4.3.4

Digg Data

Here, we linked the Digg users with Wikipedia editors based on the similarity between the
words in the bookmarked URLs and words in the edited Wikipedia articles. There are three
clusters in each network. The adjacency plot for these two networks is shown in Figure 4.7.
We ﬁrst performed Ncut on the Digg data alone and Ncut on the overall network (Digg
+ Wikipedia + links between them). The confusion matrix is given in Table 4.7. As can be
seen in the adjacency matrix plot, the ﬁrst two clusters are noisy and heavily interlinked.
So we obtain only two predominant clusters.
We apply the LGC algorithm to propagate labels in the Wikipedia data set. The results
are summarized in Table 4.8. The noisy Digg data has degraded the performance of LGC on
the Wikipedia network. Propagating labels only on Wikipedia data set gives an accuracy of
0.71, which reduces to 0.66 when applied to the multi-network.
62

0.4

0.85

0.3

0.8

0.2

0.75

Purity
NMI

0

10

40
30
20
Iterations (x 10)

0.1

Editor network NMI

Article network accuracy

0.9

0

50

Figure 4.6: Plot showing the variation of classiﬁcation accuracy and clustering NMI for
every 10iterations (best viewed in color).

Figure 4.7: Adjacency matrix plot for Digg users (right) and Wikipedia editors (left)
networks(best viewed in color).
The presence of noise on the Digg user network combined with noisy links between the
Wikipedia and Digg networks resulted in poor performance of the joint learning algorithm.
The number of clusters obtained is less than the number we expect. However, by incorporating the prior matrix P = I3 , this ensures that we obtain three clusters on each network.
The results are shown in table below. Clearly, the Joint + Prior results are signiﬁcantly
better than both Ncut and LGC.

63

Table 4.7: Confusion matrix for Digg clustering. The top panel shows the result of Ncut on Digg
user network. The bottom panel shows the clustering result on the multi graph consists of both
Digg users and Wikipedia editors. Only two dominant clusters were found on both.

Ncut on Digg
1171 3 474
1149 0 486
392 0 1995
NMI - 0.143

Ncut on Digg + Wikipedia
167 1481
0
997 638
0
1952 435
0
NMI - 0.185

Table 4.8: Confusion matrix for Wikipedia classiﬁcation. The left panel gives the confusion
matrix on applying LGC on Wikipedia editor network and the bottom panel gives the confusion
matrix on applying LGC on multi graph consisting of both Digg user network and Wikipedia editor
network.

LGC on Wikipedia
1463 203
150
350 690
159
232 100
859
Accuracy - 0.71

LGC on Digg + Wikipedia
1688 78
50
710 416
73
450 56
685
Accuracy - 0.66

Table 4.9: Confusion matrix for Joint with prior information for the run with minimum error.

Digg - Cluster
1246 280 122
136 1278 221
227 103 2057
NMI = 0.44

Wiki - Classify
1710 84
22
298 770 131
55 118 1018
Accuracy - 0.83

64

4.4

Summary

In this chapter, we have given a framework to perform mutual learning on multiple related
networks. Through various set of experiments, we infer that on a collection of noisy networks,
multi task learning performs better than independent task learning on individual networks.
We have also introduced the idea of using a prior to guide the clustering process. We have
demonstrated a practical use of our algorithm by identifying similar communities on diﬀerent
network domains namely Digg and Wikipedia. Recently, researchers have used Wikipedia as
knowledge source or auxiliary information source to perform several data mining operations
like document clustering, web page classiﬁcation, tagging and semantic relationship mining
etc. [6, 38, 44, 101, 102]. In this thesis, we show yet another novel use of Wikipedia where
in we use the link information between the Wikipedia articles to perform clustering of users
in other networks. In particular, we use the link information in Wikipedia to cluster users
in Digg network. Incorporating nodal attributes and improving scalability of the algorithm
to networks having millions of nodes are two potential directions for future research.

65

Chapter 5
Joint Community Detection and Link
Prediction
The ability to predict the formation of links in a network is an important task in network
analysis. A reliable link prediction model is useful for uncovering missing links in a static
network or for projecting the formation of new links in a dynamic network. The level of
link prediction accuracy sought often varies depending on the context of its application.
For example, despite its poor accuracy, the FOF (Friend of Friend) algorithm has been
extensively used in predicting future links between users in large social networks. However,
in critical applications such as bio-surveillance and terrorist network monitoring, predictions
are cost sensitive in that there is a severe penalty factor associated with diﬀerent incorrect
predictions made by algorithm.
The low accuracies of link prediction algorithms can be attributed to the inherent skewness of network data. In this regards, there are two types of skewness to be considered. The
obvious one is class skewness, which refers to the lopsided ratio of non-linked (the negative
class) to linked (the positive class) node pairs in a network. Typically the ratio is of the
order of O(1/N ). Such high skewness would result in a biased decision boundary and requires inclusion of skew correction approaches into the link prediction framework. Another
type of skewness, which has received little attention in the link prediction literature, is in

66

the degree distribution of the nodes. Since many networks exhibit a scale-free behavior, this
results in the (few) high degree nodes exerting the most inﬂuence on the prediction for the
positive class. As a result, most link prediction models tend to fail in their prediction for
the majority of the low-degree nodes. To avoid this, we need to develop a loss function for
link prediction that considers both type of skewness in the data.
In addition, existing link prediction methods are either global or local in nature. The
former (e.g., common neighbors[54]) simply utilizes information from the immediate neighborhood of the nodes to make its prediction. Though such an approach tends to perform
poorly especially on large networks, it is computationally eﬃcient. The latter (e.g., based on
supervised learning [43, 85]) often achieves better performance but at the expense of higher
computation time. Moving away from these two extremes is the method of ﬁnding links at
the community (cluster) level. The intuition here is that links are more likely to be formed
between nodes in the same community rather than those in diﬀerent communities. However,
identifying the right set of communities is itself a challenging problem. One of the most
well-known community ﬁnding algorithms is based on the network modularity measure [72].
However, to the best of our knowledge, none of the existing link prediction algorithms are
designed to optimize the measure.
The main contributions of this chapter are as follows:
• We propose a variable-cost loss function for supervised link prediction that considers
both the imbalanced class distribution (of linked and non-linked node pairs) as well
as skewness in the degree distribution. The variable-cost loss function addresses the
bias in degree distribution by penalizing the misclassiﬁcation of low-degree linked node
pairs more than misclassiﬁcation of high-degree linked node pairs.

67

• We show the intimate relationship between the proposed loss function and the modularity measure used to identify communities in a network. As a consequence, a link
prediction algorithm that optimizes the proposed loss function is inherently biased towards ﬁnding links within communities without explicitly identifying the communities.
• We design a boosting algorithm for link prediction called LinkBoost that optimizes
the cost-sensitive loss function. We also provide weak learners that utilize the nodal
attributes to estimate the link potentials between the node pairs.
• We present an approach for scaling up the LinkBoost framework by ﬁrst decomposing
the network into smaller, potentially overlapping partitions and then combining the
predictions made by the weak learners constructed from the diﬀerent partitions.
To the best of our knowledge, the degree dependent cost sensitive link prediction algorithm is the ﬁrst of its kind. Through the design of our loss function, the chapter also
highlights the connection between community-based link prediction, modularity measure,
and the boosting algorithm. Finally, experimental results show that our proposed LinkBoost algorithm consistently performs as good as or better than many existing methods
when evaluated on 4 real-world network datasets.

5.1

Approaches for Link Prediction

Link prediction algorithms can be categorized in many ways. First, the algorithms can be
supervised or unsupervised. Second, they can be based on the observed link structure only
or may incorporate nodal attributes. Third, the algorithms may utilize the link structure information from immediate neighborhoods (local methods), entire network (global methods),

68

or at community levels.
The simplest unsupervised link prediction algorithm is based on computing the similarity
scores between a pair of nodes using the nodal attributes or their local network topology.
Examples of such local methods include common neighbors, Salton index [82], preferential
attachment [7, 105], and Adamic-Adar index [2]. The performance of the diﬀerent local
measures were compared in [54, 116]. The results suggest that simple common neighbors
approach performs better than other local measures. A theoretical justiﬁcation for the better
performance of common neighbors approach was presented in [78].
Unsupervised global methods for link prediction typically consider the weighted paths
between node pairs. Examples include the Katz measure [49], random walk with restart
[99], average commute time, and matrix forest index [15]. Measures based on paths, in
general oﬀer higher prediction accuracy compared to the local similarity measures. However,
they require the entire network link structure and their computations are generally time
consuming.
Link prediction using supervised learning has been investigated by many authors [43,
46, 85, 97]. Al Hasan et al. [43] derived several nodal and topological features for link
prediction and applied a variety of classiﬁers such as support vector machines and decision
tree to predict links in bibliographic databases. Kashima and Abe [46] proposed a parameterized probability model for the link structure and developed an expectation maximization
algorithm to estimate the model parameters. Scripps et al. [85] employed a regularized
matrix factorization approach for link prediction. Taskar et al. [97] used a relational Markov
network to jointly model the nodal attributes and links. However, one of the main challenges in link prediction is the extremely large class skew, which leads to poor detection
rate. Rattigan and Jensen [80] suggested an alternative problem known as anomalous link
69

discovery to identify the most interesting links in the network. Recently, there have been
attempts to develop a semi-supervised approach for link prediction [48] as well as combining
link prediction with other tasks such as collective classiﬁcation [11].
More recently, there have been attempts to develop link prediction algorithms using
generative models that account for the clustering (community) structure in the network.
Guimera et al. [39], used the likelihood based methods for estimating the reliability of a
link between any node pair, given the observed link structure. The reliability score is then
used to predict both missing and spurious links. Clauset et al. [18] have proposed maximum
likelihood based methods that represent the clusters in the network as a hierarchy, which in
turn are represented as a dendrogram. Each dendrogram has an associated likelihood value
indicating the strength of community structure represented by the dendrogram. The missing
links are predicted by ﬁrst sampling large a number of dendrograms proportional to their
likelihood and for each unconnected node pairs i and j, the expected connecting probability is
computed by averaging the corresponding probability over all sampled dendrograms. Finally,
the node pairs are sorted according the connecting probability and highest ranked ones are
declared as potential links.
Both the reliability and hierarchical cluster model try to estimate the link potentials
between the node pairs at cluster level. They in fact, average over all possible partitions
of communities present in given network which makes it is very costly to implement even
on small sized networks. In this chapter, we suggest an alternative to these two algorithms
which strive to identify more links with in a community. We do this by deﬁning loss function
whose associated risk when minimized, leans towards giving higher rating for the with in
community node pairs. We also show the relationship between the proposed cost sensitive
loss function and the well known modularity measure used for clustering networks.
70

5.2

Loss Function and Risk

We consider the link prediction task as a binary classiﬁcation problem, in which a node pair
is assigned to the positive class if there is a link between them, or to the negative class
otherwise. Let V = {1, 2, · · · , n} denote the set of nodes in the network and E = V × V
denote the set of all node-pairs. We represent the adjacency matrix of the network as A,
where Aij = {+1, −1} indicating the presence or absence of links. Each node i ∈ V is
associated with a set of d-dimensional nodal attributes xi = {xi1 , xi2 , ..., xid }. Our objective
is to learn a target function f : V × V → ℜ that maps each node pair to its link potential.
The function is optimal if it minimizes the expected risk R = EE,A [L(f (e), a)] for any given
node-pair e ∈ E, where L[f (e), a] is the loss function. The loss function usually takes the
form of



0,

if sgn(f (eij )) = Aij





L[f (eij ), Aij ] = C1 , if sgn(f (e )) ̸= A = 1
ij
ij







C2 , if sgn(f (e )) ̸= A = −1

ij
ij

(5.1)

where sgn(·) is the sign function, whose value is equal to +1 if its argument is non-negative
and −1 otherwise. When C1 = C2 = 1, this corresponds to the 0-1 loss function. Many
supervised link prediction algorithms are designed to yield a classiﬁer that minimizes the
following 0-1 empirical loss function, which is given by
[
]
n
1 ∑
ˆ
R0-1 = 2
I Aij sgn(f (eij )) ≤ 0
n
i,j=1

71

(5.2)

where I(·) is an indicator function, which is equal to 1 if its argument is true and zero
otherwise.
A major hinderance in this binary classiﬁcation task is the class imbalance problem.
In the social network data, negative examples (non-linked node pairs) tend to outnumber
the positive examples by a signiﬁcantly large proportion. The literature for classiﬁcation
on imbalanced data suggests two approaches to tackle this problem, namely, sampling and
cost-sensitive learning. In the ﬁrst approach, a balanced training set is obtained by undersampling the negative examples or oversampling the positive examples. This approach
has several drawbacks. Firstly, undersampling the negative examples reduces the amount of
data available for training an accurate model. Furthermore, one has to do the undersampling repeatedly to remove the sampling bias. On the other hand, oversampling the positive
examples in the social network data increases the training set size signiﬁcantly, which in
turn, makes the training time considerably longer.
The cost sensitive learning approach is based on the premise that diﬀerent classes of
examples (positives or negatives) incur diﬀerent penalties for misclassiﬁcation. The loss
function deﬁned in (5.1) is cost sensitive if C1 ̸= C2 , where C1 is the cost for misclassifying
linked node pairs as non-linked node pairs and C2 is the cost of misclassifying the non-linked
node pairs as linked node pairs.
The loss function deﬁned in (5.1) and the associated risk functions are not diﬀerentiable,
hence does not oﬀer mathematical dexterity in designing classiﬁers. The risk associlated
with exponential loss can be used as an alternative:
[
]
1 ∑
ˆ
exp − Aij f (eij )
Rexp = 2
n
ij

72

(5.3)

The exponential risk is a continuous and diﬀerentiable function and it bounds the risk for 0-1
loss from above. It can be shown that an equivalent expression bounding the cost sensitive
loss function deﬁned in (5.1) is

ˆ
Rcost-sens =

[
1 ∑
I(Aij = 1) exp(−C1 f (eij ))
n2
ij
]
+ I(Aij = −1) exp(C2 f (eij ))

(5.4)

By simply changing the class labels for presence and absence of links from {+1, −1} to
{C1 , −C2 } the cost sensitive risk in Equation (5.4) can be transformed to the empirical risk
of (5.3). Generally the cost parameters C1 and C2 are chosen in such a way that they correct
for the classiﬁcation bias that arises due to skewness in the class distribution. If n+ and
n− represent the number of positive and negative examples in the data, then C1 and C2 are
n
n
C
often chosen such that C1 = n− . For large sparse networks, the fraction n− = O(n), thus if
+
+
2

we ﬁx C1 = 1, then the value of C2 ∼ n−1 which results in working with extreme penalties
that are easily polluted by the limitations of the machine precision. To avoid this, we need
to scale the cost of both positive and negative labels such that the desired penalty ratio is
C

maintained. Another signiﬁcance of the cost ratio C1 is its role in determining the optimal
2
cost sensitive decision surface. The optimal decision surface for the cost sensitive learning

f ∗ = arg minf EE,A [L(e, a)]

is given by the Bayes Decision Rule [66]

f ∗ (e) = log

PA|E (a = 1|e)C1
PA|E (a = −1|e)C2
73

(5.5)

Hence for any cost structure (C1 , C2 ), cost sensitive optimality diﬀers from cost insensitive
C
2

optimality only through the threshold T = log C1 .
The preceding formulation assumes that C1 and C2 are constants. We argue that it may
not desirable to treat the misclassiﬁcation cost for all the linked (and non-linked) node pairs
by the same yard stick. In the next section, we present a variable cost loss function, such that
misclassiﬁcation of low degree linked node pairs incurs more penalty than misclassiﬁcation
of high degree linked node pairs. We also show that such modiﬁcation leads to a link
prediction algorithm that leans towards predicting more links within the same community
than otherwise.

5.3

Variable Cost Loss Function for Link Prediction

This section describes our rationale for introducing a variable cost loss function for link
prediction. It is generally observed that the degree distribution of real-world networks tends
to follow a power law distribution, where there are few high degree nodes and a large number
of low degree nodes. Consequently, a supervised learning algorithm for link prediction not
only faces the bias from the large number of non-linked node pairs (negative class) but
also from the small number of high degree nodes. Speciﬁcally, among the linked node pairs
(positive class), the high degree nodes contribute more in determining the decision surface.
Since we want to build models that can explain the observed links between any node pairs
and not strongly inﬂuenced by the links formed for a few of the high degree nodes, we need
to design a loss function that removes this bias within the positive class.
One way to do this would be to make the misclassiﬁcation penalty dependent on the
degree of the nodes. Let ki be the degree of node i. Then the cost of misclassifying the

74

linked node pair eij is given by

C1 (eij ) = 1 − βki kj ,

where β is user deﬁned parameter, typically chosen to keep the cost function non-negative.
Notice that C1 monotonically decreases with increasing degrees of ki or kj , thus penalizing
more for misclassiﬁcation of links between low degree nodes compared to misclassiﬁcation of
links between the high degree nodes.
Analogously, the same reasoning can be made about the non-linked node pairs. The low
degree node pairs exert a higher inﬂuence on the negative class than the high degree node
pairs. To remove this bias among the negative examples, we deﬁne the cost for misclassifying
non-linked node pairs as
C2 (eij ) = γki kj ,
which increases when the node degrees are higher. We now need to account for the overall
bias between the positive and negative examples, this is done by choosing the value of β and
γ such that C2 < C1 . Putting it all together, we obtain the following loss function




1 − βk k , if sgn(f (e )) ̸= A = 1


i j
ij
ij




L(f (eij ), Aij ) =

γki kj ,






0,


if sgn(f (eij )) ̸= Aij = −1

(5.6)

otherwise

The distinguishing aspect of the above loss function is that it assigns variable misclassiﬁcation cost for diﬀerent node pairs. When β = ∑1 k the term βki kj represents the expected
i i
75

number of links between the node pair i and j [72]. We will show in the next section that
for this speciﬁc value of β, lowering the risk associated with the variable cost loss function
is same as maximizing the modularity measure. This results in the learning algorithm being
biased more towards learning links between the node pairs in same community than learning
the links that lie between the communities.

5.4

Modularity

A well accepted conjecture in the network mining literature is that link densities are expected
to be higher within a community than between communities. This suggests the possibility
of an intimate connection between link prediction and community ﬁnding tasks. A popular
method to identify communities in a network is using the well known modularity measure
[72]. Here the possible existence of a community in a given network is revealed by comparing
the actual link density in the subgraph induced by the community and the density one
would expect to have if the nodes of the subgraph were linked irrespective of the community
structure. The modularity measure can be mathematically quantiﬁed as follows.
]
∑[
Q=
I(Aij > 0) − Pij I(ci = cj ),

(5.7)

ij

where Pij is the expected number of links between the nodes i and j under a null model
(or reference network). The variables ci and cj represent the community membership of
nodes i and j respectively. Modularity-based community ﬁnding algorithms are designed to
assign the nodes to diﬀerent communities such that the overall modularity measure, Q, is
maximized. The null model used often corresponds to that of a random graph with the same

76

degree distribution as the given network. This leads to [17]
]
[
ki kj
1 ∑
Q= 2
I(Aij > 0) −
I(ci = cj )
2m
n

(5.8)

ij

where m =

∑

i ki /2 is the number of links in the network.

The following theorem shows the equivalence between maximizing (5.8) and minimizing the risk associated with a special case of the loss function given in (5.6). Consider
a community-based link prediction model that predicts the existence of a link between a
node-pair based on whether the nodes are in the same community, i.e.,


 +1, if I(ci = cj );
sgn(fcomm (eij )) =

 −1, otherwise.

(5.9)

Theorem 2 For the variable cost loss function given in (5.6), minimizing the risk associated
with the community-based link prediction model fcomm (eij ) with β = γ = ∑1 k is equivalent
i i
to maximizing the modularity function in (5.8)
Proof 1 The empirical risk associated with the variable cost loss function given in (5.6) for
the community-based link prediction model is

1
ˆ
Rmod =
n2

[

1
n2

I(sgn(fcomm (eij )) = −1)(1 − βki kj )

ij:Aij =1

+

=

∑

[

+

∑

]
I(sgn(fcomm (eij )) = 1)(γki kj )

ij:Aij =−1

∑

1
kk )
2m i j
]
1
(δij
kk )
2m i j

(1 − δij )(1 −

ij:Aij =1

∑

ij:Aij =−1

77

(5.10)

where we have replaced I(sgn(fcomm (eij )) = 1) = δij and β = γ = 1/2m. Now minimizing
the empirical risk with respect to δ is equivalent to maximizing the following

ˆ
min Rmod
f
[ ∑
]
∑
ki kj
1
1
= max 2
δij (1 −
)−
δij
kk
2m
2m i j
δ n
ij:Aij =1
ij:Aij =−1
]
[
ki kj
1 ∑
= max 2
δij I(Aij > 0) −
2m
δ n

(5.11)

ij

Since δij = I(sgn(fcomm (eij )) = 1) = I(ci = cj ), this completes the proof.
The preceding theorem suggests that maximizing the modularity measure is equivalent
to minimizing a special case of the loss function using the clustering solution, fcomm (eij ) as
the link prediction model. The clustering solution uses only the network topology to explain
the link potential between node pairs. In contrast, our proposed variable cost loss function
provides a framework that allows us to estimate the link potential using other information
including the nodal attributes. We design the fcomm (eij ) as function of nodal attributes xi
and xj . Our experimental results have demonstrated the eﬀectiveness of using an exponential
loss compared to modularity function (5.24) for link prediction.

5.5

Boosting Approach for Link Prediction

This section presents our method for optimizing the variable-cost loss function given in
Section 5.3. The risk associated with the loss function given in (5.6) is non-diﬀerentiable, so

78

we employ the following variable-cost empirical risk function:

ˆ
Rmod =

[
1 ∑
I(Aij = 1) exp[−(1 − βki kj )f (eij )]
n2
ij
]
+ I(Aij = −1) exp[γki kj f (eij )]

(5.12)

If we set β = γ then the above loss function reduces to
[
]
∑
ˆ mod = 1
R
exp (I(Aij > 0) − βki kj )f (eij )
n2

(5.13)

ij

This form of loss function is well studied in the machine learning community using additive modeling or boosting techniques [19]. Speciﬁcally, an additive model takes the form
of
fα (x) = sgn

[∑

]
αt ft (x) .

t

For boosting, each fi corresponds to a weak learner and the goal is to identify a sequence of
constants α1 , ..αk such that a linear combination of the weak learners performs better than
any of the individual learners.

5.5.1

Estimating αt

Our aim is to design a boosting algorithm that minimizes the variable-cost empirical risk
ˆ
function Rmod . To do this, we need to induce a sequence of weak learners that help in
∑
reducing the risk as optimally as possible. Let F = t−1 αi f i be the previous solution of
i=1
the boosting algorithm at step (t − 1) and f t is the currently induced weak learner. We need
ˆ
to identify an appropriate αt that would lead to an improvement in Rmod . The optimization

79

problem at step t is given by

min

∑

αt ,f t ij

)(
)]
(
ki kj
t (e )
exp − I(Aij > 0) −
Fij + αt f ij
2m
[

(5.14)

To highlight the eﬀect of current weak learner we need to isolate the eﬀect of past weak
learners from the equation. Let
[

(
) ]
ki kj
Dij = exp − I(Aij > 0) −
Fij
2m
(
)
ki kj
Mij =
I(Aij > 0) −
f t (eij )
2m
sij = sgn(Mij )
W+ =

∑

Dij | Mij |

ij∈Mij >0

W− =

∑

Dij | Mij |

(5.15)

ij∈Mij <0

It can be shown that the objective function given in (5.14) is bounded as follows:
∑
ij

=

∑

[

(
)(
)]
ki kj
t (e )
exp − I(Aij > 0) −
Fij + αt f ij
2m
Dij exp(−αt sij |Mij |)

ij

≤

∑

Dij |Mij |(exp(−αt sij ) − 1)

ij

≤ (W + exp(−αt ) + W − exp(αt ) − W + − W − )

(5.16)

where the inequality follows from applying Jensen’s inequality and the assumption that

80

|Mij | ≤ 1. For a given f t taking its partial derivative with respect to αt gives

αt =

W+
1
log −
2
W

(5.17)

The formula for αt is similar in spirit to regular AdaBoost in which αt = 1 log 1−e , where e
e
2
is the error rate for the weak classiﬁer. In our case, the W + and W− represent the weighted
sum of the correctly classiﬁed and incorrectly classiﬁed node pairs.

5.5.2

Weak Learners

This section describes the construction of the weak learners used in our boosting framework.
Similar to traditional boosting, we could apply any simple classiﬁer as long as it takes
into consideration the weight matrix D associated with the node-pairs. The weak learner
considered in this study is computed based on the nodal attributes and can be computed in
closed form.
Let X represent the n × d nodal attribute matrix. Given the current weight matrix D
between the node pairs, the goal of weak learner is to estimate the n×n link potential matrix
L(X) where Lij = f t (eij ) indicates the strength of link between the nodes i and j. Large
positive values of Lij indicate greater potential for link between the nodes and large negative
values indicate greater repulsion for link formation between the nodes. We model L(X) as
simple weighted correlation of the nodal features. Let L(X) = XWXT . Here the weight
matrix W is a d × d matrix that needs to be estimated by solving the following objective
function.

Q = max
W

∑

(Dij Bij )[XW X T ]ij ) −

ij

81

λ
∥ W ∥2
2
2

(5.18)

ki k
where Bij = [I(Aij > 0) − 2mj ] is the coeﬃcient term of the modularity measure or the cost

associated with each node pair. Diﬀerentiating the objective function we get,
∑
∂L
T
=−
Dij Bij (Xip Xqj ) + λWpq = 0
Wpq

(5.19)

ij

We get,

Wpq =

1∑
T
Xip Dij Bij Xqj
λ
ij

1∑ T
=
Xpi Dij Bij Xjq
λ

(5.20)

ij

Let • denote the element wise matrix multiplication, then W can be written as

W=

1 T
X (B • D)X
λ

(5.21)

Thus the link potential function L(X) = XWXT for the given weight matrix D is given by

L=

1
XXT (B • D)XXT
λ

(5.22)

A crude interpretation of the above solution is that it aligns the correlation between the
nodal attributes with the modularity matrix. λ is chosen as a normalization constant such
that the estimated link potentials are mapped between [−1, 1]. A distinct aspect of above
deﬁnition of weak learners is that it does not require explicit conversion of nodal features
to edge features. Traditional classiﬁers like support vector machine or logistic regression
requires one to construct feature for each node pair from the nodal features, which itself is
a time consuming process.
82

5.5.3

Scalability

Link prediction algorithms such as Preferential Attachment and Common Neighbors, though
often have poor performance, are still considered attractive for many practical applications
as they are easy to implement and scalable to large sized networks. Scalability is one of the
important aspects of the proposed link prediction algorithm. Even though the number of
links in large sparse networks is small, the supervised link prediction algorithm must examine
all possible node pairs thereby increasing the size of data to be dealt with. In addition to
the number of node pairs, the number of features associated with each node may add severe
constraints on the performance of the model with respect to speed, memory requirement,
and accuracy.
In this section, we describe an approach to scale up our proposed algorithm by decomposing the network into smaller, potentially overlapping partitions and using the boosting
approach to systematically combine the weak learners constructed from each partition. This
divide-and-conquer strategy is well suited both for the link prediction problem and the boosting framework since link formation is typically a local phenomenon, in the sense that there
are several small communities in the network and the links are formed more inside that
community. Thus it is beneﬁcial to construct a local (weak) learner from a small segment of
the network at a time and aggregate them in a principled way to form the global model via
the boosting formulation.
There are many strategies to create subgraph partitions from a large network. Our
requirements are that (1) the partitions must be distinctive enough from each other to
induce a diverse (uncorrelated) set of weak learners and (2) the partitioning approach must
be eﬃcient to implement especially for large-scale networks. We tried several partitioning

83

strategies (e.g., applying random walk starting from randomly chosen seed nodes) but found
that they often fail to satisfy one of the two requirements. This led us to consider the domain
partitioning strategy, which is inexpensive to implement and often produces a diverse set of
partitions.
The proposed scalable LinkBoost algorithm is summarized in Algorithm 2. The function
GetFeaturePartition(η) returns a feature partition where each partition set contains η%
of the features. For each partition, we create a subgraph containing only those nodes that
have at least one non-zero value with respect to the selected set of features. We then build a
local model on the subgraph by invoking the GetBaseLearner subroutine. The subroutine
takes the following parameters as input: (1) Av , the adjacency matrix associated with the
subgraph induced by the feature partition P, (2) Xp , subset of the nodal attributes in the
subgraph, (3) Dv , weights on node pairs in the subgraph, and (4) kv , global degree of the
nodes in the subgraph. The weight returned by the GetBaseLearner subroutine is used to
update the estimated link potential matrix. This process is repeated T times on all subgraphs
obtained by diﬀerent feature partitions.
In addition to its eﬃcient implementation and diversity of its induced weak learners,
another advantage of the domain partitioning strategy is that the ﬁnal hypothesis has a
nonlinear decision surface. It can be easily seen that the weak learner XW X T described
in Section 5.5.2 yields a linear decision surface separating the linked and non-linked node
pairs. Since the boosting algorithm combines the weak learners also in a linear fashion, it
will not be able to signiﬁcantly alter the decision surface. However, by employing domain
partitioning in the weak learner construction, we will work with a distinct subgraph at a
time. The weight matrix W returned by GetBaseLearner function is applied only to the
current subgraph V and not to the entire network. This results in inducing a non-linear
84

Algorithm 2 LinkBoost
Input: A: n × n adjacency matrix with {+1, −1} entries
X: n × d nodal attribute matrix
η: threshold for feature partitioning
Output: F: n × n link potential matrix
Initialize:
F = [0]n×n ; D(0) = [1]n×n ;
k: n × 1 column vector of node degrees
for t = 1 to T do
P ← GetFeaturePartition(η)
for Xp ∈ P do
V ← GetSubGraphNodes(Xp )
W ← GetBaseLearner(Xp , Dv , kv , Av )
T
Compute fv = Xp W Xp
Compute α using (5.17)
Fv ← Fv + αfv
Dv = Dv exp(−αAv • LV )
end for
end for
return F
decision surface (clipped line) in the feature space (a line with respect to node pairs in
current partition and value zero for node pairs outside the partition). Finally the boosting
algorithm combines the collection of clipped lines to produce a ﬁnal classiﬁer with non-linear
decision surface.

5.6

Experimental Evaluations

This section reports the results of experiments conducted on the proposed LinkBoost algorithm. Since link prediction is cost sensitive in nature, we compare the algorithm against
other baseline methods using the receiver operating characteristic (ROC) curve. The curve
is obtained by calculating the true positives and false positives by varying the threshold on
the estimated link potentials between the node pairs. The link prediction model is built on
the training set while the ROC curves are plotted for the node pairs in the test set.
85

5.6.1

Baseline Algorithms

We compared the performance of LinkBoost against the following link prediction algorithms
discussed in the related work section.
Link-Based: We used three link based algorithm for link prediction. These are Preferential Attachment, Katz and Modularity. Preferential attachment estimates the link potential
between a node pair as product of their degrees. The Katz measure is deﬁned as

score(x, y) =

∞
∑

(l)

β l | pathij |

(5.23)

l=1
(l)

where | pathij | is set of all path of length l from node i to node j and 0 < β < 1 is a user
parameter. A special variant of Katz is the truncated Katz in which only ﬁnite number of
terms in the summation are considered. The number of terms to consider is again a user
given parameter. The Katz measure is sensitive to both these parameters. The modularity
measure for link prediction is computed as follows. Let Sir to be 1 if vertex i belongs to
group r and zero otherwise. Then modularity maximization involves identifying a n × k
matrix S with elements Sij such that following equation is maximized.

max tr ST BS
S

(5.24)

The problem (5.24) is NP hard and is relaxed by letting S to be any real matrix such that
ST S = I. We then deﬁne fcomm (eij ) =

∑

r Sir Sjr .

Attribute Based: Here we use the well known Fixed Cost Adaboost where the cost
parameters are set to C1 = 1 and C2 = 0.01. It is not possible to make the cost more
C

than 1 as the derivation of α assumes that |Mij | ≤ 1. Furthermore, only the cost ratio C1
2
86

matters and not their absolute magnitudes. Similarly, the modularity matrix B is multiplied
by constant factor so that the magnitude of entries are less than 1. For LinkBoost, we set
η == 0.05. The base learners used for both LinkBoost and Fixed Cost Adaboost are the
same (see Section 5.5.2).

5.6.2

Data Sets for Inferring Missing Links

Here we consider the problem of inferring missing links from an incomplete network. We use
two well-known citation networks 1 [89]—citeseer and cora data sets—for this experiment.
In both the data sets, we ﬁrst make the graph undirected and randomly suppress 30% of the
links from the network and use them as the test set for predicting missing links.
Cora Data Set contains publications from the machine learning area, which include
the following 7 subcategories: Case-based reasoning, Genetic Algorithms, Neural Networks,
Probabilistic Methods, Reinforcement Learning, Rule Learning and Theory. The data set we
use contains 2708 nodes, 5429 directed links, and 1433 unique words. Each node corresponds
to a paper and is characterized by a 0/1-valued vector indicating the absence/presence of
the corresponding word from the title of the paper.
Citeseer Data Set consists of data from 3312 scientiﬁc publications. Each publication
is labeled as one of 6 classes. The data set we have created contains 4732 links and 3703
unique words.

5.6.3

Data Sets for Predicting Future Links

Here, we are given the network link structure and the nodal attributes at a particular time
period. Our task is to predict the link formed between the given nodes at a future time.
1 http://www.cs.umd.edu/projects/linqs/projects/lbc/index.html

87

Cora Dataset

Katz
PA
Modularity
LinkBoost
AdaCost

Probability of Good Links

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

1

2

3
topK
Citeseer Dataset

4

Katz
PA
Modularity
LinkBoost
AdaCost

0.9
0.8

Probability of Good Links

5
5
x 10

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

1

3

2

topK

4

5
x 10 5

Figure 5.1: Proportion of within community links (good links) as function of topK values
Table 5.1: Link Prediction: The table shows the AUC of predicted missing links and future links
in each of the four data sets.

Data
Method
Link Only
Katz
PA
Modularity

AUC (% improvement compared to LinkBoost)
Cora
Citeseer
DBLP
Wiki
0.72 (-16.70%)
0.63 (-41.30%)
0.61 (-17.50%)
0.86 (-2.20%)
0.63(-14.20%)
0.59(-33.70%)
0.71(-4.00%)
0.90(+2.20%)
0.67(-20.00%)
0.60(-32.50%)
0.63(-14.86%)
0.64(-27.27%)

Link+
AdaCost 0.53 ± 0.09(-36.90%) 0.54 ±0.12 (-39.32%) 0.56 ± 0.2 (-24.32%)
Content LinkBoost
.84± 0.025
.89±0.063
.74± 0.18

88

<.50
.88± 0.14

DBLP Data Set contains all the computer science articles 2 from the proceedings of 28
conferences related to machine learning, data mining and databases from 1997 to 2006. The
train set consists of all publications from 1997-2000 and test set contains all publications
from 2001-2004. There are 9252 nodes in the train set with 9136 nodal attributes. There
are 21, 107 links in the train set and only 6679 links in the test set.
Wikipedia Data Set is a web page network which was crawled from Wikipedia web
site by Kossinets3 . The data set contains edit history of all the pages in Wikipedia from
its inception until January 2008. We examined the user-user interaction network (user talk
pages). The user interactions in the ﬁrst 6 months of 2004 is taken as train set and the next
6 months is taken as test set. There are 8178 users and 24891 features for each user.

5.6.4

Links Within Community

First, we evaluated the performance of the LinkBoost algorithm in terms of its ability to
predict links within community. For both cora and citeseer data sets we use the ground truth
community label to verify the proportions of links formed within community for each of the
link prediction algorithms. We sort the link potentials and declare the top-K largest link
potentials as the possible missing or future links. Figure 5.1 shows the plot of the proportion
of within community links or good links as function of top-K values. Clearly, the LinkBoost
algorithm outperforms both modularity and Katz measures, thus validating the claim that
our algorithm indeed strives to identify links within a community. The proportion of good
links identiﬁed by modularity and Katz are quite high for smaller values of topK, but falls
signiﬁcantly for larger topK values.
2 http://dblp.uni-trier.de/
3 G. Kossinets. Processed Wikipedia Edit History. Stanford large network dataset collec-

tion.
89

DBLP Dataset
1

True Positive

0.8

0.6

0.4
Katz
PA
Modularity
LinkBoost
FC−Ada

0.2

0
0

0.2

0.6
0.4
False Positive
WIKI Dataset

0.8

1

1

True Positive

0.8

0.6

0.4
Katz
PA
Modularity
LinkBoost

0.2

0
0

0.2

0.6
0.4
False Positive

0.8

1

Figure 5.2: ROC curves comparing performances of diﬀerent link prediction algorithms.

90

CORA Dataset

1

True Positive

0.8

0.6

0.4
Katz
PA
Modularity
LinkBoost
FC−Ada

0.2

0
0

0.2

0.4
0.6
False Positive
CITESEER Dataset

0.8

1

1

True Positive

0.8

0.6

0.4
Katz
PA
Modularity
LinkBoost
FC−Ada

0.2

0
0

0.2

0.4
0.6
False Positive

0.8

1

Figure 5.3: ROC curves comparing performances of diﬀerent link prediction algorithms.

91

5.6.5

Missing and Future Links

Next we evaluate the performance of LinkBoost for the missing and future link prediction
problems. The ROC curves are shown in Figure 5.2 and 5.3. Firstly, notice that the AUC of
LinkBoost is consistently higher than modularity measure. As mentioned earlier, modularity
utilizes only the network link structure whereas boosting makes use of both the link and the
content information thus resulting in superior performance. The LinkBoost consistently
outperforms the ﬁxed cost Adaboost as well, highlighting the importance of the proposed
variable cost structure.
LinkBoost outperforms the Katz measure on both cora and citeseer citation networks.
The Katz measure performs better than the ﬁxed cost Adaboost on the DBLP network and
as good as LinkBoost on Wikipedia network. However it is sensitive to choice of parameter
setting. In this chapter, we report the results based on the parameters that best ﬁts the test
set.
Finally, LinkBoost outperforms the preferential attachment measure on both the citation
networks. However it is performance is comparable to preferential attachment on DBLP and
Wiki networks. Speciﬁcally, LinkBoost is slightly better than preferential attachment on
DBLP network and is slightly worse on Wikipedia network. This is because the preferential
attachment algorithm is based on the premise that the rich gets richer. We suspect that
the user network in Wikipedia exhibit the preferential attachment characteristics where few
authoritative users communicate with large number of other users. The average AUC for
LinkBoost is 0.88 and for preferential attachment is 0.90.

92

5.6.6

Low Degree Nodes

In this section, we demonstrate the ability of the proposed method to identify the links
formed between low degree nodes in the citation networks. A node with degree less than
2 is considered to be a low degree node. We compute the models on the training set and
estimate the ROC curves for the subgraph consisting of low degree nodes. The results are
plotted in Figure 5.4. As expected, the preferential attachment measure under performs as
it ranks the high degree nodes ahead of the low degree nodes. The proposed LinkBoost with
eﬀective degree sensitive loss function overcomes this problem.

5.7

Summary

In this chapter, we have given a new direction for the supervised link prediction problem
in large sparse networks. We have proposed a new degree dependent cost function and
has shown that minimization of the associated risk leads to modular link prediction where
more links are predicted within community. Such a cost function addresses the skewness
in class distribution and skewness in nodal degrees. The proposed algorithm is scalable
and easy to implement. Experimental evaluations show the superior performance of the
proposed method over existing supervised and unsupervised methods. The proposed method
is specially eﬀective in predicting the missing links for the low degree nodes. For future work,
we plan to investigate methods for estimating optimal cost parameters and alternate ways
for creating the weak learners used in LinkBoost formulation.

93

Cora Dataset
1

True Positive

0.8

0.6

0.4
Katz
PA
Modularity
LinkBoost
FC−Ada

0.2

0
0

0.2

0.4
0.6
False Positive
Citeseer Dataset

0.8

1

1

True Positive

0.8

0.6

0.4
Katz
PA
Modularity
LinkBoost
FC−Ada

0.2

0
0

0.2

0.4
0.6
False Positive

0.8

1

Figure 5.4: ROC curves comparing performances of diﬀerent missing link prediction
algorithmson the subgraph induced by the low degree nodes in the citation networks.

94

Chapter 6
Crowdsourcing for Network Mining
In the previous chapters, we have presented multi-source and multi-task network mining
frameworks for both supervised and unsupervised learning tasks. A key requirement for
developing supervised learning algorithms is the availability of trustworthy label information
for the given mining task. The quality of the labeled data often aﬀects the performance of
the learning algorithm. In this chapter, we present a framework that engages the services
of crowdsourcing technology in order to acquire (or augment) the label information for the
network data. First, we brieﬂy discuss the role of crowdsourcing in aiding the label acquisition
process. Next, we highlight the challenges that are present in acquiring label information
for network data. Finally, we present a novel approach which transforms the given network
data into an image corpus for labeling by the crowd.
Crowdsourcing [23] is an emerging technology where a group of human workers, some of
whom might be unskilled, are employed to solve a certain task that cannot be automatically
and/or reliably solved by computers. For example, image annotation is a task that involves
categorizing individual images in a corpus into certain pre-deﬁned categories. It is not always
possible to automate this task using computers and often times, humans can perform the task
more accurately than computers. The key challenge for harnessing the power of the crowd
lies in converting the problem at hand into a simpler task that can be handled by humans
with great ease and speed. Such tasks are called Human Intelligence Tasks or HITs. For
example, in the image annotation problem, the individual images constitute a HIT, which
95

are displayed to workers in order to elicit their label information. In the past, researchers
have successfully employed the crowd to annotate data for search and retrieval problems
[3, 67, 81, 93]. In addition, the power of the crowd has been harnessed to perform machine
learning tasks such as clustering and classiﬁcation [37, 45, 113].
One advantage of utilizing the services of crowdsourcing is that once the HITs are designed, they can be solved even by low skilled human workers (without any domain knowledge
or expertise) for a menial payment. Therefore, the valuable time of domain experts can be
spared from performing cumbersome data labeling task and focused on analyzing the data
instead. The goal of this chapter is to design simple and easy to use HITs that can be presented to the crowd for acquiring label information in the network data. Unlike the image
annotation problem, designing HITs for network data is more challenging as the raw network data does not easily lend itself to be presented to the crowd. Also, there are inherent
privacy concerns when presenting social network data to third party workers. Therefore one
has to design HITs that are simple and intuitive for average humans to act upon and at the
same time do not disclose the original data in any way to the workers. To overcome this
challenge, we present a data transformation approach whereby the network data is initially
transformed into images so that it can visualized and subsequently classiﬁed by the crowd.

6.1

Transforming Network Data into Images

Data transformation is a preprocessing procedure for converting an input pattern into a
suitable representation before we apply supervised and unsupervised learning algorithms.
Traditional approaches are mostly designed for in-domain data transformation, which means
they simply manipulate the input space in such a way that the transformed space is more

96

aligned with the requirements of the learning algorithm. Such approaches would typically
project the data into a low dimensional manifold (in order to remove noise as well as to
eliminate redundant and irrelevant features) or map the data into a higher dimensional
space (to extend the feature representation and enable the use of linear classiﬁers to discern
patterns that belong to diﬀerent classes). The in-domain transformation approaches do not
bring any new information that was not already present in the original data.

Figure 6.1: A toy example consisting of target network data (A) and source handwritten digit data
(B). We map each distinct labeled target node to a unique source image and learn the transformation
between target and source data. When this transformation is applied on all the target nodes, we
get the transformed target data (C). The blurry images are interpreted in Section 6.1.

Unlike previous research, this work investigates an out-of-domain data transformation
approach that enables the use of crowdsourcing technology for network mining problems.
Speciﬁcally, the transformation involves two unrelated domains (source and target), each
having a unique set of attributes, classes, and probability distributions. The target is the
domain in which the desired classiﬁcation task is to be performed but has limited labeled
data, whereas the source serves as an auxiliary data source for which labels are already available or can be easily acquired even from non-experts. For this study, the source corresponds
to a collection of labeled images whereas the target is a social network. Since the domains
97

are unrelated, a key challenge is to learn a proper transformation function that maps each
labeled example in the target domain to its corresponding “surrogate” in the source domain.
To illustrate the proposed approach, consider the toy example shown in Figure 6.1. Here
the source domain is a collection of hand written digit images corresponding to digits 1, 2, 3
and 4 (see Figure 6.1B). The target domain, is a network data with four distinct communities denoted as A, B, C and D, as shown in Figure 6.1A. The circles and lines are the
nodes and links, respectively. Solid circles correspond to labeled nodes, whereas unﬁlled
circles represent the unlabeled ones. The solid and dotted lines represent within community
and between community links, respectively. We map each distinct labeled node to one of
the labeled images and learn their corresponding transformation matrix. The mapping is
done in such a way that nodes from communities A, B, C and D are mapped to distinct
images corresponding to digits 1, 2, 3 and 4, respectively. The transformation is then applied to the unlabeled examples in the target domain to generate their corresponding images
for subsequent labeling by the crowd. Figure 6.1C gives a pictorial representation of the
transformed node representation in the image space. A good transformation may help reveal
certain aspects of the underlying network. For example, the node C2 in community C has
links to nodes outside of its community. After the transformation, even though the majority
of the images associated with the nodes in community C are mapped to the digit 3, the
image for node C2 is harder to discern because it lies at the border with other communities. Furthermore, the node D1, which has more links to nodes in community B than to
those in its own community, is transformed into an image that resembles digit 2 more than
digit 4. This example suggests that, although the feature space has completely changed, the
transformed images should still retain useful information that helps provide insights into the
latent structure of the network data.
98

A key question that remains to be answered is whether it is always possible to ﬁnd a
proper transformation that eﬀectively maps nodes in a social network to images that can
be easily discerned by humans. To help answer this question, we analyze the reconstruction
error of the transformation and examine its relationship to the ranks of data matrices in
the source and target domains. An alternating least square method is presented to learn
a transformation matrix that minimizes the reconstruction error. Using network data from
two real-world domains, we empirically showed the eﬀectiveness of the framework in providing labels to solve a variety of network learning tasks including link prediction and node
classiﬁcation.
The remainder of this chapter is organized as follows. Section 6.2 presents the notations
used in this chapter and deﬁnes the data transformation problem for crowdsourcing social
networks. In Section 6.3, we introduce our proposed framework for augmenting training data
in the target domain with newly acquired labeled data from the source domain. Section 6.4
describes the detailed methodology for learning the transformation. We also discuss the
conditions under which an exact transformation can be found. Experimental results are
given in Section 6.5. Finally, we conclude with a summary of the work and suggestions for
future research.

6.2

Preliminaries

Let S and T denote the source and target domains. Throughout this chapter, we assume
the source is an image corpus for which labels can be easily acquired from the crowd (human
workers) whereas the target is a social network for which obtaining reliable labels is expensive.
Both domains are assumed to have their own training examples. We further assume that

99

the classiﬁer in the target domain can be improved if more labeled examples are augmented
into the training set.
Let X(s) be an ns × ds data matrix for the source domain and Y(s) be its corresponding
ns × cs class membership matrix, where ns is the number of labeled examples, ds is the
s
number of attributes, and cs is the number of classes. Each element yij in the matrix Y(s)
(s)

is equal to 1 if the labeled example xi

belongs to class j and zero otherwise. Similarly,

let X(t) = [X(tl) ; X(tu) ] denote an (nt + r) × dt data matrix for the target domain and Y(t)
be its corresponding nt × ct class membership matrix, where nt is the number of labeled
examples, r is the number of unlabeled examples, dt is the number of attributes, and ct is
the number of classes in the target domain. For brevity, we assume cs = ct = c and ns ≫ nt .
Our goal is to learn a dt × ds transformation matrix U that eﬀectively maps each labeled
target example to a surrogate example in the source domain in such a way that preserves
the distance and label information of the two data sets as much as possible. For example, in
Figure 6.1, we seek a transformation matrix U that maps all the nodes from community A
to images containing the handwritten digit 1, those from community B to images containing
the digit 2, and so on. After learning the transformation, any unlabeled node can be mapped
to its corresponding image by applying the matrix U to its node attributes. The transformed
images can be labeled by human workers and the newly acquired labels can be combined
with the original labeled data to train a better classiﬁer.

6.3

Proposed Framework

This section presents an overview of our proposed framework for augmenting training data in
the target domain with labeled examples acquired from the source domain. The framework

100

consists of the following tasks:
1. Surrogate Mapping. Given the labeled examples in the source and target domains,
(X(s) , Y(s) ) and (X(tl) , Y(tl) ), we need to learn a transformation matrix U that maps
(tl)

each xi

(s)

∈ X(tl) to its surrogate xj

∈ X(s) . The mapping should be done in such a

way that instances of a particular label in the target domain are mapped to instances
of a ﬁxed label in the source domain.
2. Surrogate Labeling. The transformation matrix U will be applied to unlabeled data
ˆ
in the target domain X(tu) to generate new surrogates X(su) for labeling by human
workers. Since each surrogate can be labeled by more than one workers, a consensus on
the class label must be made for each target instance (e.g., by taking a majority vote
on the class labels). Let Y(tu) denote the consensus labels obtained for the unlabeled
target instances.
3. Model Building. The newly labeled target examples (Xtu , Ytu ) are merged with the
original training data. A classiﬁer will be trained on the extended training data to
generate a new predictive model for the target domain.
The key challenge is to develop an eﬀective and eﬃcient algorithm for learning the transformation matrix U. We describe the details of the algorithm in the next section.

6.4

Surrogate Mapping

We cast the surrogate mapping task into a constrained optimization problem. First, we need
to design an objective function that assigns each target example to a unique source example
satisfying the label consistency requirement as well as minimizing the reconstruction error.
101

In order to do this, we deﬁne an nt × ns surrogate selection matrix P, such that Pij = 1
(s)

(t)

indicates that the source example xj is the surrogate for target example xi . The objective
function for the surrogate mapping task is given below:

min ∥ PX(s) − X(tl) U ∥2 + ∥ PY(s) QT − Y(tl) ∥2
F
F

U,P,Q

∀i, j : Pij ∈ {0, 1},

P1ns = 1nt ,

∀i, j : Qij ∈ {0, 1},

s.t.

Q1cs = 1ct ,

where ∥ · ∥F denote the Frobenius norm and 1d is a d-dimensional column vector of all
ones. The constraints ensure that the elements of the matrix P are binary-valued and
that each target example is mapped to exactly one source example. The ﬁrst term in the
objective function is a measure of reconstruction error when mapping the target examples
into instances of the source domain. The second term in the objective function ensures
consistency of the class labels, i.e., labeled examples of a particular class in the target
domain are mapped only to source examples of the same class. The label matching matrix
Q is a cs × ct binary-valued matrix that represents the mapping between the class labels
of the source and target domains. Assuming cs = ct = c and since the classes in the
source and target domains are often unrelated, we found it is suﬃcient to assign Q to be
an identity matrix when performing our experiments. A more careful selection of Q would
require considerations of the within-class and between-class variability of the source and
target examples. We plan to pursue this as part of our future work.
The constraint on matrix P allows a many-to-one assignment between the target and the
source domains. This is essential because, for any two target examples that are located close
to each other and belong to same class, the surrogate selection matrix should map them to
102

source examples that are close to each other as well. If no such corresponding pair of source
examples can be found, it would be better to map the two target examples to the same
surrogate, as long as they are both from the same class.

6.4.1

Reconstruction Error Analysis

Our proposed framework considers a linear transformation approach for mapping the target
examples into their corresponding surrogates in the source domain. In the case where the
source domain corresponds to an image corpus, one concern is whether the transformation
can produce images that can be easily discerned by humans. One way to measure the quality
of the transformation is to evaluate the reconstruction error of the surrogates selected for the
ˆ
target examples. Let X(s) = X(tl) U be the transformed images of the target examples and
Z(s) = PX(s) be the selected surrogates. Given U and P, we consider the transformation to
ˆ
ˆ
be ϵ-proper if the reconstruction error ∥Z(s) − X(s) ∥2 ≤ ϵ and exact if ∥Z(s) − X(s) ∥2 = 0.
F
F
The transformed images of the target examples are expected to be as easily discernable as
the original source images themselves if ϵ is small. A key question is whether it is possible
to construct a transformation matrix U with low reconstruction error given the source and
target data matrices X(s) and X(tl) .
To determine the condition under which a low reconstruction error can be obtained,
assume the surrogate selection matrix P is known. The reconstruction error can be written
as follows
∥Z − X(tl) U∥2
F

103

(6.1)

whose minimum solution is given by
(
)−1
(tl)T X(tl)
U= X
X(tl)T Z

(6.2)

A unique solution exists only if the covariance matrix X(tl)T X(tl) is of full column rank.
Otherwise we need to obtain a rank reduced approximation of X(tl) using techniques such
as singular value decomposition (SVD).
More importantly, the reconstruction error of the transformation can be assessed in terms
of the rank of the data matrices.
Proposition 1 Let A be an m × n matrix and B be an n × k matrix. If r(A), r(B), and
r(AB) denote the ranks of matrices A, B, and AB, respectively, then it can be shown that
[79],
r(AB) ≤ min[r(A), r(B)]

(6.3)

ˆ
ˆ
Since X(s) = X(tl) U, according to this proposition, r(X(s) ) ≤ r(X(tl) ). Thus, if the rank of
the target data matrix is considerably lower than that for the original source data matrix,
then the transformed images will have a lower rank than the original source images, which
in turn, may lead to large reconstruction errors. In other words, an exact or low ϵ-proper
transformation is infeasible if r(X(tl) ) ≪ r(X(s) ).
We illustrate this with an example in Figure 6.2. Here we consider the well known Iris
data consisting of 150 labeled examples belonging to 3 distinct categories (Iris versicolor, Iris
virginica, and Iris setosa) as our target domain. Each category contains 50 examples, which
are matched against 50 handwritten images of 28 × 28 dimensions containing the digits 1,
2, or 3. The rank of the data matrix for the handwritten images is 150 (which is equivalent

104

to its number of rows), which is much higher than the rank of Iris data, which is equal
to 4 (i.e., its number of columns). According to Proposition 1, the rank of the transformed
images for the Iris data is at most 4, which is considerably lower than the rank of the original
handwritten images. This suggests that the reconstruction error for the Iris data using the
handwritten images is likely to be high. The top right column of the Figure 6.2 shows the
transformed images for 30 selected examples from the Iris data. Even though all the target
examples in the Iris versicolor class were mapped to surrogate images containing the digit 2,
the transformed images look noisy and do not resemble the digit 2. Instead, they looked like
a mixture of digits 2 and 3 because it is hard to distinguish target examples belonging to the
Iris versicolor class from those belonging to the Iris virginica class. However, if we increase
the dimensionality of the Iris data from 4 to 12 (by adding quadratic and cubic terms for
each of the 4 original features), the reconstruction error reduces signiﬁcantly, especially for
those images that correspond to the Iris versicolor class (see bottom right column of Figure
6.2). In particular, if we project the target data to a 150-dimensional feature space (using
higher degree polynomials), an exact transformation matrix U can be obtained.
Proposition 2 If Z(s) ∈ ℜns ×ds , X(tl) ∈ ℜnt ×dt , and r(X(tl) ) ≥ r(X(s) ) (where X(tl) is
a full column rank matrix), then there exists a transformation matrix U such that Z(s) =
X(tl) U.
The key lesson here is that it is preferable to have a source data whose rank is smaller than
that of the target data. There are two ways to achieve this. First, we can reduce the rank of
the source data by applying SVD. The drawback here is that the source data is typically an
image whose contents are manually evaluated by humans (crowds). The rank reduction may
damage the visual clarity of the source images. Alternatively, we can increase the rank of
105

Figure 6.2: Left panel shows the distribution of three classes (in three colors) with respect to
ﬁrst and second principal components. The top and bottom right panel gives the transformed data
using 4 and 12 features respectively. The blurry images are formed because of poor transformation
quality. This is discussed in detail in section 6.4.1

the target data by adding features that correspond to higher order polynomials of the target
attributes. In fact, the objective function of our proposed surrogate matching framework can
be extended to a nonlinear transformation using the kernel trick (see supplemental material
for derivation). This allows us to project the target data to a high-dimensional (possibly
inﬁnite-dimensional) space, thereby making it compatible with the source data of higher
rank.

6.4.2

Parameter Estimation

This section presents our approach for estimating the parameters P and U for the constrained
optimization problem stated in Section 6.4. We also provide proof of convergence of the
proposed algorithm. First, note that the objective function is non-convex with respect to
both P and U. However, for a ﬁxed P, it is convex with respect to minimizing U, and
vice-versa. Thus, we employ the well-known alternating least square method to solve the
106

optimization problem.
We begin with an initial surrogate selection matrix P0 satisfying the label compatibility
criterion, namely P0 Ys Q = Yl (assuming Q = I as previously discussed). We then estimate
the transformation matrix U0 using (6.2). At iteration k, we estimate Pk based on the
(tl)

previous estimate for Uk−1 as follows. For each target example xi , we compute its trans(tl)

ˆ
formed image xi

(tl)T

= xi

(s)
i

Uk−1 and match it to the surrogate example xk that minimizes

its reconstruction error:

ki = argmin

(s)
(tl) 2
ˆ
(s) (tl) ∥ xj − xi ∥
j:yj =yi
(s)

We then set Pk = 1 and Pk = 0 (∀j ̸= ki ) . The condition yj
i,j
i,ki

(6.4)

(tl)

= yi

imposes the label

compatibility requirement on the selected surrogate.
Lemma 1 Let X(tl) be a full column rank matrix. Then, the objective function (6.1) is
monotonically non-increasing as the number of iterations k increases in Algorithm 3.
Proof 1 To prove the result, we need to show the following inequality:

E[Pk , Uk ] ≤ E[Pk , Uk−1 ] ≤ E[Pk−1 , Uk−1 ],

(6.5)

where E[Pk , Uk ] is the reconstruction error after the k-th iteration. Given Uk−1 , the new
selection matrix Pk satisﬁes the condition that ∀i,
(tl)T

∥xs − xi
ki

Pk = 1 implies,
i,k
i

(tl)T

Uk−1 ∥ ≤ ∥xs − xi
j

Uk−1 ∥ : ∀j ̸= ki

(6.6)

The preceding inequality follows from the algorithm step given in Equation (6.4). Now sup107

Algorithm 3 Surrogate Mapping Algorithm
1: Input: Source and Target data Xm , Xl
2: Output: Transformation Matrix U
3: Initialize:
4: k = 0;
5: Initialize P0 , satisfying P0 Ys = Yl
6:
7: repeat
8:
k =k+1
9:
Learn the transformation Uk−1
10:
Uk−1 = argminU ∥ Pk−1 X(s) − X(tl) U ∥2
11:
Compute the transformed image as follows
(tl)
(tl)T k−1
ˆ
12:
xi = xi
U
∀i = 1, 2, . . . , n
k matrix
13:
Estimate the P
14:
Set Pk =0
15:
for i = 1 → n do
16:
ki = argmin (s) (tl) ∥ xs − xi ∥2
ˆ
j
j:yj =yi

17:
Pk = 1
i,ki
18:
end for
19: until Pk−1 = Pk
20: return Uk−1
pose we assume that E[Pk , Uk−1 ] > E[Pk−1 , Uk−1 ]. Then by deﬁnition of re-construction
error given in (6.1),

∥Pk X(s) − X(tl) Uk−1 ∥2 > ∥Pk−1 X(s) − X(tl) Uk−1 ∥2
F
F
⇒

∑
i

(tl)T

[xs − xi
ki

Uk−1 ]2 >

∑
i

(tl)T

[xs
(k−1)i − xi

Uk−1 ]2

(6.7)

Since each term in the sum is non-negative, there must exist an index i such that

(tl)

(tl)

k−1 ]2
[xs − xi Uk−1 ]2 > [xs
ki
(k−1)i − xi U

(6.8)

which contradicts the condition given in (6.6). Thus, the original assumption must be false,
which means E[Pk , Uk−1 ] ≤ E[Pk−1 , Uk−1 ].
108

The next inequality E[Pt , Ut ] ≤ E[Pt , Ut−1 ] can be inferred from line 10 of Algorithm 3
(see Equation (6.2)). Since X(tl) is a full rank matrix, therefore Uk must exist for every k
and is unique.
Theorem 3 Let X(tl) be a full column rank matrix. Then the Algorithm 3 terminates after
ﬁnitely number of iterations.
Proof 2 Lemma 1 shows the monotonically non-increasing nature of the objective function
(6.1) when applying Algorithm 3. At each iteration, the algorithm re-estimates the selection
matrix P and subsequently ﬁnds a corresponding optimal transformation matrix U. Since
there are only ﬁnitely many permutations available for matrix P, in the worse-case scenario,
the algorithm should converge after considering all the permutations.
We need to show is that the Algorithm does not enter inﬁnite loop. For this, it is suﬃcient
to show that the algorithm will not produce the same selection matrix P again and again
except in the last iteration. If this is not true, then let Pt , Ut be the permutation matrix and
transformation matrix obtained at the end of iteration t and let Pt+1 be the new permutation
matrix estimated using Ut such that Pt+1 = Pi for some i < t and i ̸= t. Since the linear
transformation U is unique for a given permutation matrix, we have

Ut+1 = argminU ∥ Pt+1 X(s) − X(tl) U ∥2
= argminU ∥ Pi X(s) − X(tl) U ∥2
= Ui

(6.9)

However, this results in following contradiction.

E[Pi Ui ] < E[Pt , Ut ] ≤ E[Pt+1 , Ut+1 ] = E[Pi Ui ]
109

(6.10)

Therefore, our assumption i ̸= t is wrong. In other words Pt+1 = Pi can only happen when
i = t or in the last iteration, at which the algorithm is said to have terminated.

6.5

Experimental Evaluations

We performed experiments using both synthetic and real world data to demonstrate the eﬀectiveness of the proposed data transformation technique in enabling the use of crowdsourcing
technology for labeling the network data. For this purpose, we have selected the grey scale
image corpus of handwritten digits [51], as source data domain. This source domain consists
of roughly 5000 images for each of the digits from 0 through 9. Each image is of size 28 × 28
and is represented as one dimensional vector of length 784. We used two sparse data as
target domains namely, a social network of Wikipedia editors and a sample collection of
Wikipedia articles. A detailed description of these two data sets are given below.
• Wikipedia Editor Interactions: Here, we sampled articles from Wikipedia on two
topics namely, computer science and natural science. We then took all the editors who
worked on at least 20 of the sampled articles and recorded the interactions between
them from their respective Wikipedia User::Talk pages. We created two editor networks
one for each of the two topics. In each network, the set of editor interactions (linked
node pairs) and non-interactions(non-link node pairs) were equi-partitioned to create
TRAIN and TEST set. Therefore, each unique editor-pair is present in either TRAIN
or TEST set. The goal is to learn the presence/absence of interactions between editor
pairs from the TRAIN set and predict the interactions between the editors on the
TEST set.
• Biology Article Corpus: We sampled Wikipedia articles belonging to 4 related
110

topics in biology namely - genetics, zoology, anatomy and cell-biology. Here, instead
of the network link information, we have used the text data (words in the article) to
classify the articles into one of the four categories. The TRAIN set consists of 2000
articles with 500 article in each topic and TEST set consists of 800 articles with 200
articles from each topic. There are totally 6040 words in the TRAIN set, which were
used as features. The goal is to obtain correct labels for the TEST articles, without
exposing the article content to the crowd.

6.5.1

Evaluation Methodology and Baseline

Our goal is to learn the labels for the TEST data points as accurately as possible. As a
baseline, we have trained a support vector machine (SVM) on the TRAIN set and used it to
predict the labels for the points in the TEST set. In the proposed approach, we ﬁrst learn
the transformation matrix U between the labeled TRAIN set and the source domain (set of
digit images). We then apply the transformation on the unlabeled points in the TEST set.
The transformed TEST data now resides in the source domain (as handwritten digit images).
We have utilized the services of crowd(non-domain expert human workers) to manually label
these surrogate digit images (transformed test data) into appropriate digit category. If the
image is not well formed or appears visually cluttered then the worker ﬂags the data point
(image) as noise. We combine the evaluations of all the workers (individual members in
crowd) by taking a simple majority vote.
The crowd based approach has the ﬂexibility of rejecting a data point as noise. This
approach uniquely partitions the TEST set into two groups viz. good images (TEST G) and
noisy images (TEST N). Their performance can only be reported on the respective TEST G
set. Therefore, it is not fair to compare their performance against the baseline. In order
111

to address this problem, we performed additional model building step that would generate
label for the entire TEST data. It is a semi supervised approach where we trained a SVM on
both TRAIN and TEST G set and used it to predict labels for entire TEST set. TEST G
is subset of TEST where every data point is assigned a unique label by the crowd (majority
vote). We denote this approach by Crowd + SVM.

6.5.2

Synthetic Data

We used synthetic data to understand the ability of the proposed out-of-domain mapping
algorithm in revealing the latent structure present in the target data. In particular, we
examined the ability of the proposed approach in distinguishing boundary points from non
boundary points in a simple binary classiﬁcation problem. In order to study this, we have
generated 100 random samples each from two normal distributions namely N (µ = 1, σ = 1)
and N (µ = 3, σ = 1). We converted the one dimensional data to 50 dimension data by
appending polynomials of the feature from degree 1 through 50. Samples from the former
are mapped to images corresponding to digit 0 and samples from the later are mapped to
images corresponding to digit 1.
Figure 6.3 depicts the two normal distribution from which we sampled 200 points. Figure
6.4 shows the transformed data (image representation) for all the 200 sample points. The
target domain consists of real numbers from the interval [−3 5]. For individual data point,
the actual sample value is printed on top of each image. The data points belonging to
class 0 (N (µ = 1, σ = 1)) are marked by ∗ sign. For the samples from this distribution,
all the sample points in the interval [−2 1.5] got mapped to digit image zero without any
clutter or noise. Similarly, for the samples from class 1 (N (µ = 3, σ = 1)), all the sample
points in the interval [2.5 5] got mapped to the digit image 1 without any noise. However,
112

0.45

Class 0

Class 1

0.35
0.25
0.15
0.05

0
−2

−1

0

1

2

3

4

5

6

X
Figure 6.3: Plot of two normal distributions used in generating synthetic data.
the sample points from the interval [1.5 2.4] got mapped to image which contains both
digit 0 and digit 1. In fact, this interval marks the overlapping region or boundary region
between the two distributions and the transformed target images for points from this region
looks like alphabet Φ. This eﬀect remains same for higher data dimensions where the visual
representation of data points in the boundary region encapsulates the characteristics of all
the respective classes that share the boundary. This is further illustrated in the next section
on a real world example.

6.5.3

Biology Article Corpus

On this data corpus, we have experimentally validated the Theorem 3 and other propositions.
Here we have mapped the articles belonging to four categories namely, genetics, zoology,
anatomy and cell-biology to digit images 1, 2, 3 and 4 respectively. There are 500 articles
in each article category and we have randomly matched each article against a unique image
from appropriate digit category. This is done by the initial permutation matrix P0 . Since,
the ﬁnal surrogate mapping depends on the initial permutation matrix P0 , we performed
113

Figure 6.4: The transformed Data from two normal distributions. The text above each image is
not meant to be readable but for visual reference only. The number above each image is the actual
sample value. The interpretation of images is illustrated in section 6.5.2.

114

the experiment for 10 diﬀerent random initializations of P0 and chose the transformation
that gave the minimum reconstruction error.
On this data set, we have experimentally validated the Propositions, Lemma and theorem
proved in this chapter. In order to validate Lemma 1, we plot the reconstruction error
obtained at each iteration. This is shown in Figure 6.5(left) for each of the 10 random
initializations. As proved in Lemma 1, the reconstruction error monotonically decreases
with each iteration for all the 10 random initializations of P0 . It was also observed that
for each random initialization, the reconstruction error converged to a local minima albeit
diﬀerent ones for each random initialization thus validating Theorem 3. As mentioned earlier,
the re-construction can be analyzed with respect to the rank of the source and the target
data matrices. Figure 6.5(right), shows the decreasing trend of the rank of the re-sampled
source matrix with each iteration. It should be noted that the rank of Pt Xs does not always
decrease monotonically with each iteration. The overall decrease in rank can be attributed
to the fact that at each iteration, the algorithm performs a one-to-many mapping, eﬀectively
reducing the row rank of the matrix Pt Xs . Notice that the minimum error across all the
10 random runs is obtained when the rank of PT Xs falls below the rank of target data.
However, that alone is not suﬃcient condition for decreasing the re-construction error. In
Figure 6.5, errorcurve denoted by legends ◦, ♢ and

correspond to the condition where the

rank of Pt Xs has fallen below rank of the target data.
The surrogate images generated by the transformation with minimum re-construction
error is then presented to a crowd of three workers. On this dataset, each of the three
workers manually labeled all the 800 TEST images. Each worker categorized an image into
one of the four digit (1 − 4) category. If the surrogate image was cluttered, dark or contained
more than one digit, then it was labeled as noise. On an average, each worker has labeled
115

550

21.3

500

rank PtX

s

600

21.4

log (error)

21.5

21.2
21.1

Target Rank

450
400

21
350

20.9
0

10

20

30

300
0

40

10

20

30

40

Iterations
Iterations
Figure 6.5: Wikipedia article corpus: The left result shows the decreasing error
with eachiteration for 10 random initializations. The right ﬁgure shows the rank of the
source data almostdecreases with successive iterations. Lowest error is achieved when
the rank of the source fallsbelow the target rank.

90 TEST data points or 11% of the TEST set as noise. The classiﬁcation accuracy of three
independent workers on the entire TEST set was found to be 74%, 72% and 73%. For the
baseline, we trained a SVM on the TRAIN set using radial basis kernel with sigma value of
0.1. The SVM gave an over all accuracy of 63.5% on the TEST set. Table 6.1 records the
F measure for each class as given by SVM and each of the three individual workers in the
crowd.
An important distinguishing aspect of the crowdsourcing approach from the SVM is that
the latter was directly trained and tested on target domain whereas in the former, workers
manually assign label by looking at the transformed data (digit images) residing on the
source domain. In addition, the performance of the SVM directly depends on the data
points in the sample TRAIN set, the choice of kernel and kernel parameters. On contrary,
the performance of the crowd depends on the quality of the transformation, the user interface

116

Table 6.1: This table shows the F measure for each of the four article categories. Here Wi denotes
the ith worker in the crowd. Each independent worker performs better than SVM with rbf kernel
(σ = 0.1). The Crowd + SVM performed better than individual workers in the crowd.

Class

SVM

W1

Zoology
Cell Biology
Anatomy
Genetics
Accuracy

0.5639
0.7250
0.6267
0.6285
63.50

0.6977
0.8677
0.7436
0.7866
74.75 %

W2
AUC
0.6667
0.8571
0.7512
0.7798
72.12 %

W3

Crowd+SVM

0.7178
0.8673
0.7350
0.7803
72.88%

0.7079
0.8756
0.7400
0.8104
78.25%

for labeling, worker skills and fatigue. In this experiment, the workers were not given any
special training or instruction for labeling. They were not exposed to the images formed on
the TRAIN set and they used the basic human intelligence in categorizing the images into
one of the four digits or noise.
Finally in order to obtain a unique label for each TEST data point, we have used a simple
majority vote strategy to combine the results of individual workers in the crowd. Here a
TEST data point is assigned to a category which was favored by majority of workers in the
crowd (atleast two out of three workers). This approach labels only a portion of TEST set.
We generalize this crowd generated label on entire TEST set as follows. We train a SVM on
the combined TRAIN set and crowd labeled portion of TEST set and use it to predict the
label for entire TEST set. This approach is denoted as Crowd + SVM and it gave an accuracy
of 78.25 as reported in the Table 6.1.

6.5.4

Wiki Editor Networks

In this section, we analyze the performance of the proposed approach for label acquisition
on the Wikipedia editor networks. Each Wikipedia article is composed and edited by several
editors and each editor works on several articles. Sometimes, the editors interact with each

117

other through their respective USER::Talk pages to improve the content of articles. In our
sample, we observed that a large pool of editors have worked on a common set of articles
without interacting with each other. Therefore merely editing a common set of article does
not imply interaction. For each editor, we use a binary vector indicating whether or not
the editor worked on a article as attribute vector. For each editor pair, the sum of their
corresponding attributes is used as feature vector. Here again, we used the SVM as baseline
algorithm to predict the interactions between the editors. We trained the SVM using linear
kernel and rbf kernels with diﬀerent σ parameters. The best result was obtained with the
linear kernel.
The editor networks data is similar to the Biology article corpus data in that both are
sparse, high dimensional data and require strong domain expertise to label them. However
learning on the editor network data poses a signiﬁcant challenge owning to the label skewness.
In our sample, the number of non-interacting editors are 2 − 10 times than the number of
interacting editor pairs. In addition, the attributes or features used for predicting interactions
has limited discerning capability. This results in generating images containing multiple digits
or undecipherable characters which cannot be easily labeled by humans.
Another distinguishing aspect between the experiments on Biology article corpus and
Editor network is that on the former there were three workers who labeled the entire TEST
set. However, on the later there were 27 workers who labeled only a portion of TEST set
(on each network, 800 TEST images were labeled). Therefore we do report the performance
of individual workers rather, we report the overall performance of the Crowd+SVM approach.
We mapped the data points corresponding to the interactions(links) to digit 1 and the
data belonging to non-interaction(no-links) group were mapped to digit 2. For each surrogate
image, a worker chose one of the following ﬁve options: A) Clearly digit 1 B) Clearly digit
118

Table 6.2: This table gives the AUC values of SVM and Crowd + SVM approach for link prediction problem on two diﬀerent Wikipedia editor networks. Here the proposed Crowd approach
outperforms the baseline SVM by 8% on Computer science network and fares slightly better than
SVM on Natural science editor network.

Network
Nat.Sc.
Comp. Sc.

AUC
Pos/Neg in TEST SVM
489/1003
0.7507
155/1500
0.7287

Crowd + SVM
0.7701
0.7950

2, C)Prominent 1, with little 2, D) Prominent 2 with little 1 and E) Noise. The surrogate
images assigned to category A and B were assigned label +1 and −1 respectively. The
surrogate images assigned to category C and D were assigned label 0.5 and −0.5 respectively.
The images assigned to category E were discarded as noise. We then generalized the label
over entire TEST set by training SVM on the TRAIN and TEST G. On an average only 130
images ( roughly 10% ) of the TEST images in each network were assigned to single category
by the crowd. In Table 6.2, we report the area under the curve (AUC) values for SVM and
Crowd on each of the two networks. The SVM model trained on the TRAIN set gave an AUC
of 75% and 72% respectively on the natural science and computer science editor networks.
However, the Crowd + SVM approach further lifted the AUC to 77% and 79% respectively.
This shows that the partial label information generated by the crowd oﬀers new insight to
the SVM model to change the decision surface which inturn results in higher performance.
From the above experiments, we have demonstrated the capability of proposed framework
in mapping the complex high dimensional sparse network data into a human comprehensible
visual data that can be manually labeled by the crowd. The results suggest that learning
approach on the surrogate data with assistance from the crowd outperforms (at the very least
as good as) learning on the raw data. Throughout this work, we have used a simple majority
vote among the workers to determine the best label for each surrogate image. Sophisticated

119

algorithms [37, 45, 113] can be used to combine the label information from diﬀerent workers
to generate a ﬁnal reliable label for the given surrogate image and the corresponding network
data point.

6.6

Conclusions

In this chapter, we propose a transformation learning technique that learns a point transformation between any two data domains. We highlight the challenges associated with this
problem and analyze the data requirements for learning the point mapping. We list two application of such transformation namely data obfuscation for performing privacy preserving
data mining and visual representation of data. In this work, we apply the proposed technique to transform sparse, high dimensional social network data into a set of digit images
that are labeled by a non domain experts. This way we perform network mining tasks like
link prediction and node classiﬁcation using crowd-sourcing technology.

120

Chapter 7
Future Work
In this thesis, we have presented three learning frameworks for combining data from multiple
sources to perform common network mining tasks. Firstly, we have presented a framework to
jointly perform clustering and/or classiﬁcation on two related network domains. Secondly,
we have presented a framework to simultaneously perform related learning tasks namely,
community detection and link prediction. Finally, we have presented a generic framework to
leverage label information from unrelated non-network data domain to perform supervised
learning on networks. We have demonstrated the performance of all the proposed framework
on real world networks like citation network, co-authorship network and user networks. In
this chapter, we highlight the directions for future research on two novel problems that
are unique to this thesis, namely, the multi-task learning on networks and out-of-domain
transformation for label augmentation.
1. Multi-task Learning: In Chapter 5, we have presented a unique loss function for
jointly performing the link prediction and community detection tasks on the network
data. We have showed the relationship between the proposed loss function and the
well known modularity measure for community detection. We demonstrated its applicability on a single network and in future, this could be extended to multiple related
networks as well. To give an example, the framework can be employed to predict the
missing citation links between relevant Wikipedia articles and at the same time identify

121

communities between corresponding authors or editors. In other words, the framework
can be easily extended to perform link prediction on one network and community
detection on another network.
Another important problem in network mining is identiﬁcation of inﬂuential nodes in
the network. These are minimum set of nodes that inﬂuences the diﬀusion of information in a network. Clearly, each community has its own set of inﬂuential nodes [4].
Once the communities are deciphered, the search for inﬂuential nodes can be narrowed
to high degree nodes inside communities and bridge nodes between communities. Conversely, knowing the inﬂuential nodes in the network could expedite the process of
community detection in a network, as each community has its own popular or inﬂuential nodes [107]. Therefore it is beneﬁcial to perform these two tasks jointly. The
proposed degree dependent loss function can be suitably altered to accomplish this
task.
2. Surrogate Mapping: In Chapter 6, we have presented a novel data transformation
framework in order to augment the label information for the given network mining task.
This framework allows us to transform a large, sparse, high dimensional network data
into a image format. We sincerely hope that the proposed approach of representing
network data in a visual format for crowd labeling would be a trend setter for future
research in this ﬁeld. This oﬀers several diﬀerent directions for future research as listed
below
• Extend the proposed linear framework for surrogate mapping to a non-linear
framework in order to minimize the transformation error. In addition, algorithmic
approach should be designed to automatically perform label matching (estimating
122

Q matrix in equation (6.1)) between the domains.
• Currently, we use a simple majority voting to decide the ﬁnal label for data points
labeled by the crowd. However, sophisticated learning algorithms like boosting
can be utilized to combine the labels generated by the individual workers in order
to enhance the overall crowd performance.
• In this thesis, we have demonstrated the usefulness of the surrogate mapping
algorithm for solving link prediction problem. It can be extended to solve multitask learning problems on related networks like the ones we described in Chapter
3 and 4.

123

APPENDIX

124

Appendix
Proof of Convergence
In this section we provide the formal proofs that show the update formula (3.5) - (3.9) monotonically decrease the objective function (3.3). The proofs shown here has been reproduced
from our previous work [20] of joint community detection across multiple networks.
Deﬁnition 2 G(w, w) is an auxiliary function for F(w) if following two conditions are sat´
isﬁed
G(w, w) ≥ F (w),
´

G(w, w) = F (w)

(A.1)

Lemma 2 If G1 (w, w) and G2 (w, w) are auxiliary functions for F1 (w) and F2 (w) respec´
´
tively, then G = G1 + G2 is the auxiliary function for F = F1 + F2 . Further, F is non
decreasing under the update formula

wt+1 = arg min G(w, wt )
w

(A.2)

Proof 3 The proof for G as an auxiliary function for F follows directly from deﬁnition.

125

Below, we give the proof for second part.

F (wt+1 ) = F1 (wt+1 ) + F2 (wt+1 )
≤ G1 (wt+1 , wt ) + G2 (wt+1 , wt )
≤ G1 (wt , wt ) + G2 (wt , wt )
= F (wt )

The third line follows from the fact that wt+1 minimizes the auxiliary function G. Thus,
G(wt+1 , wt ) ≤ G(wt , wt ) and F (wt+1 ) ≤ F (wt ), which completes the proof.
The Equation (3.3) involves a summation of three distance functions D(·∥·), Lemma 2 suggests that it is suﬃcient to show that minimizing the auxiliary function that bounds (3.3)
would decrease the overall objective function. First, we give the auxiliary function for the
term D(A∥XU X T ) and the auxiliary function for other terms can be similarly derived. The
update formula for X which minimizes D(A∥XU X T ) is given by
∑N
Xij = Xij (

A [XU T ]

A [XU ]

ia
aj
+ ai T aj )
a=1 ( [XU X T ]
[XU X ]ai
ia
)
∑N
( a=1 [XU + XU T ]aj )

(A.3)

The objective function can be written as

F1 =

∑
i,j

Aij log ∑

Aij

− Aij +
T
kl Xik Ukl Xlj

∑

T
kl Xik Ukl Xlj

(A.4)

Now deﬁne
(t)
(t)T
U X
ik kl lj
αk,l = ∑
(t)
(t)T
rs Xir Uks Xsj
X

126

(A.5)

Clearly,

∑

rs αrs = 1. Using Jensens inequality we get
T
Xik Ukl Xlj
∑
∑
T ≤−
− log kl Xik Ukl Xlj
αkl
αkl
kl

(A.6)

Substituting (A.6) in (A.4) we get D(A ∥ XU X T ) ≤
∑[

Aij log Aij − Aij − Aij

ij

∑

αkl log

kl

+

∑

T
Xik Ukl Xlj

αkl

]

T
Xik Ukl Xlj

(A.7)

kl

plugging in αkl in above equation gives the following bound on the objective function
∑[
ij

(t)
(t)T (
X Ukl X
ik
lj
T
Aij log Aij −Aij −Aij
log Xik Ukl Xlj
∑
(t)
(t)T
rs Xir Urs Xsj
kl
(t)
(t)T )
]
X Ukl X
∑
ik
lj
T
Xik Ukl Xlj
+
− log ∑
(t)
(t)T
Xir Urs Xsj
rs
kl

∑

which is the auxiliary function for F1 . We denote it as G1 (X, X (t) ). Now taking derivative
of G1 with respect to Xpq we get,
(t)

(t)T

∑
∑
Xpq Uql Xlj
∂G
(t)T
=−
Apj ∑
+
Uql Xlj
(t)
(t)T
∂Xpq
rs Xpr Urs Xsj Xpq
jl
jl
−

∑
ik

(t)

Aip ∑

(t)T

Xik Ukq Xqp

(t)
(t)T
rs Xir Urs Xsp Xpq

127

+

∑
ik

(t)

Xik Ukq = 0

We get

(t)

Xpq = Xpq

[

∑ Apj [U X (t)T ]qj

∑

Aip [X (t) U ]iq
]
i
j
[X (t) U X (t)T ]pj
[X (t) U X (t)T ]ip
∑N
(t)T ] +∑N [X (t) U ]
qj
iq
i=1
j=1 [U X

+

(A.8)

which is same as (A.3). Similarly, the term D(C ∥ XV Y T ) can be written as

F2 =

∑
ij

Cij log ∑

∑
T
− Cik +
Xik Vkl Ylj
T
kl Xik Vkl Ylj
kl
Cij

Now deﬁne
(t)
(t)T
X Vkl Y
ik
lj
βkl = ∑
(t)T
(t)
rs Xir Vrs Ysj

Once again

∑

(A.9)

kl βkl = 1 and following the same procedure as above, we get the auxiliary

function for D(C ∥ XV Y T ) to be G2 (X, X (t) ) =
∑[
ij

(t)
(t)T
V Y
ik kl lj
T
log Xik Vkl Ylj
Cij log Cij −Cij −Cij
∑
(t)T
(t)
rs Xir Vrs Ysj
kl
(t)
(t)T
]
X Vkl Y
∑
ik
lj
T
− log ∑
+
Xik Vkl Ylj
(t)
(t)T
Xir Vrs Ysj
rs
kl

∑

X

Taking derivative of this with respect to Xpq and equating it to zero, we get
∑M Cpi [Y V T ]iq

i=1 [XV Y T ]
(t)
pi
Xpq = Xpq ( ∑M
)
( i=1 [Y V T ]iq )

128

(A.10)

Minimizing the original objective function (3.3) with respect to X, we have

min J (X) = min F1 (X) + F2 (X)
X

X

≤ min G1 (X, X (t) ) + G2 (X, X (t) )
X

(A.11)

Taking derivative of G1 (X, X (t) ) + G2 (X, X (t) ) with respect to X and equating it to zero we
get update formulae (3.5).
In all we have shown that G1 (X, X (t) ) + G2 (X, X (t) ) is auxiliary function for (3.3) with
respect to variable X and (3.5) decreases this auxiliary function and hence the objective
function. Therefore, we have proved Theorem 1
It should be noted that Theorem 1 states that the update formula monotonically decreases
the objective function (3.3). However, it does not guarantee convergence to local minima
or any stationary point. We stop the algorithm when the error between two consecutive
iterations lies below a speciﬁed threshold or after certain maximum number of iterations.
The work by Finesso & Sperji [29], has shown the existence of solution for the problem
D(A ∥ W H). First, they argue that the minima exists in the interior of the domain. Then
they show that by normalizing the A matrix (such that
constraint on H (

∑

∑

ij Aij = 1 and adding stochastic

j Hij = 1) results in a update formula that converges to local minima in

the interior of the domain (strictly positive quadrant).

129

BIBLIOGRAPHY

130

BIBLIOGRAPHY
[1] Z. Abraham and P.-N. Tan. A semi-supervised framework for simultaneous classiﬁcation and regression of zero-inﬂated time series data with application to precipitation
prediction. In Proceedings of the 2009 IEEE International Conference on Data Mining
Workshops, ICDMW ’09, pages 644–649, 2009.
[2] L. Adamic and E. Adar. Abstract how to search a social network, 2005.
[3] O. Alonso, C. Carson, D. Gerster, X. Ji, and S. U. Nabar. Detecting Uninteresting
Content in Text Streams. In M. Lease, V. Carvalho, and E. Yilmaz, editors, Proceedings
of the ACM SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (CSE
2010), pages 39–42, Geneva, Switzerland, July 2010.
[4] M. Anjerani and A. Moeini. Selecting inﬂuential nodes for detected communities in
real-world social networks. In Electrical Engineering (ICEE), 2011 19th Iranian Conference on, pages 1 –6, may 2011.
[5] L. Backstrom, D. P. Huttenlocher, J. M. Kleinberg, and X. Lan. Group formation
in large social networks: membership, growth, and evolution. In Proceedings of the
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD’06), pages 44–54, Philadelphia, PA, USA, August 2006.
[6] S. Banerjee, K. Ramanathan, and A. G. 0005. Clustering short texts using wikipedia.
In SIGIR, pages 787–788, 2007.
[7] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. In Science,
volume 286, 1999.
[8] A.-L. Barabsi and R. Albert. Emergence of scaling in random networks. Science, 286
(5439):509–512, 1999.
[9] S. Basu, A. Banjeree, E. Mooney, A. Banerjee, and R. J. Mooney. Active semisupervision for pairwise constrained clustering. In In Proceedings of the 2004 SIAM
International Conference on Data Mining (SDM-04, pages 333–344, 2004.
[10] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised
clustering. In Proceedings of the tenth ACM SIGKDD international conference on
Knowledge discovery and data mining, KDD ’04, pages 59–68, New York, NY, USA,
2004. ACM. ISBN 1-58113-888-1. doi: 10.1145/1014052.1014062. URL http://doi.
acm.org/10.1145/1014052.1014062.

131

[11] M. Bilgic, G. Namata, and L. Getoor. Combining collective classiﬁcation and link
prediction. In Proceedings of the ICDM Workshop on Mining Graphs and Complex
Structures, pages 381–386, Omaha, NE, USA, 2007.
[12] D. Cai, Z. Shao, X. He, X. Yan, and J. Han. Community mining from multi-relational
networks. In Proceedings of the 9th European Conference on Principles and Practice
of Knowledge Discovery in Databases, 2005.
[13] R. Caruana. Multitask learning. Mach. Learn., 28:41–75, 1997.
[14] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using
hyperlinks. SIGMOD International Conference on Management of Data, pages 307–
318, 1998.
[15] Chebotarev and S. E. V. The matrix-forest theorem and measuring relations in small
social groups. In Automation and Remote Control, page 1505, 1997.
[16] F. Chen, P. N. Tan, and A. K. Jain. A co-classiﬁcation framework for detecting web
spam and spammers in social media web sites. In CIKM, 2009.
[17] A. Clauset, M. Newman, and C. Moore. Finding community structure in very large
networks. In Phys. Rev. E, number 6, page 066111, 2004.
[18] A. Clauset, C. Moore, and M. Newman. Structural inference of hierarchies in networks.
In Intl. Conf. on Machine Learning, ICML, 2006.
[19] M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, adaboost and bregman
distances. Machine Learning, 48(1-3), 2002.
[20] P. M. Comar, P.-N. Tan, and A. Jain. Identifying cohesive subgroups and their correspondences in multiple related networks. In Proc of the 2010 IEEE/WIC/ACM Int’l
Conf on Web Intelligence (WI-2010), Toronto, Canada, 2010.
[21] P. M. Comar, P.-n. Tan, and A. K. Jain. Multi task learning on multiple related
networks. In Proceedings of the 19th ACM international conference on Information
and knowledge management, CIKM ’10, pages 1737–1740, 2010.
[22] P. M. Comar, P.-N. Tan, and A. K. Jain. Linkboost: A novel cost-sensitive boosting
framework for community-level network link prediction. In ICDM, pages 131–140,
2011.
[23] R. Das and M. Vukovic. Emerging theories and models of human computation systems: a brief survey. In Proceedings of the 2nd international workshop on Ubiquitous
crowdsouring, UbiCrowd ’11, pages 1–4, New York, NY, USA, 2011. ACM. ISBN 9781-4503-0927-1. doi: 10.1145/2030100.2030102. URL http://doi.acm.org/10.1145/
2030100.2030102.
[24] I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the
k-means algorithm. In SDM, 2005.
132

[25] I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In
Proceedings of the 9th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data
Mining, pages 89–98. ACM, 2003.
[26] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations
for clustering. In Proceedings of the 12th ACM SIGKDD, pages 126–135. ACM, 2006.
[27] E.Airoldi, D. Blei, S. Feinberg, A. Goldenberg, E. Xing, and A. Zheng. Social Network
Analysis: Models, Issues, and New Directions, volume 4503 of LNCS. Springer, 2007.
[28] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel
methods. J. Mach. Learn. Res., 6:615–637, 2005.
[29] L. Finesso and P. Spreij. Nonnegative matrix factorization and i-divergence alternating
minimization. Linear Algebra and its Applications, 416(23):270 – 287, 2006.
[30] G. Flake, K. Tsioutsiouliklis, and R. Tarjan. Graph clustering techniques based on
minimum cut trees. Technical Report, 2002.
[31] G. W. Flake, S. Lawrence, and C. L. Giles. Eﬃcient identiﬁcation of web communities.
In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 150–160. ACM Press, 2000.
[32] G. W. Flake, S. Lawrence, C. L. Giles, and F. M. Coetzee. Self-organization and
identiﬁcation of web communities. IEEE Computer, 35:66–71, 2002.
[33] L. Ford and D. Fulkerson. Maximal ﬂow through a network. 8:399–404, 1956.
[34] B. Gallagher, H. Tong, T. Eliassi-Rad, and C. Faloutsos. Using ghost edges for classiﬁcation in sparsely labeled networks. In Proceedings of the 14th ACM SIGKDD
international conference on Knowledge discovery and data mining, KDD ’08, pages
256–264, 2008.
[35] L. Getoor and C. P. Diehl. Link mining: A survey. ACM SIGKDD Explorations, 7(2):
3–12, December 2005.
[36] D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring web communities from link
topology. In Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia:
Links, Objects, Time and Space - Structure in Hypermedia Systems, pages 225–234,
Pittsburgh, PA, USA, June 1998.
[37] R. Gomes, P. Welinder, A. Krause, and P. Perona. Crowdclustering. In NIPS, pages
558–566, 2011.
[38] M. Grineva, M. Grinev, D. Turdakov, and P. Velikhov. Harnessing wikipedia for smart
tags clustering. In Proceedings of the International Workshop on Knowledge Acquisition
from the Social Web (KASW2008), 2008.
[39] R. Guimera and M. Sales-Pardo. Missing and spurious interactions and the reconstruction of complex networks. In Proc. Natl. Acad. Sci. U.S.A., 2009.
133

[40] R. Guimer`, M. Sales-Pardo, and L. Amaral. Classes of complex networks deﬁned by
a
role-to-role connectivity proﬁles. Nature Physics, 2007.
[41] F. Harary. Graph Theory. Westview Press, 1994.
[42] M. Hasan, V. Chaoji, S. Salem, and M. J. Zaki. Link prediction using supervised
learning. In Workshop on Link analysis, Counter-terrorism, and Security, Bethesda,
MD, USA, April 2006.
[43] M. A. Hasan, V. Chaoji, S. Salem, and M. Zaki. Link prediction using supervised
learning. In In Proc. of SDM 06 workshop on Link Analysis, Counterterrorism and
Security, 2006.
[44] X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou. Exploiting wikipedia as external
knowledge for document clustering. In Proc. of Int. Conf. on Knowledge Discovery
and Data Mining (KDD), 2009.
[45] A. K. J. Jinfeng Yi, Rong Jin and S. Jain. Semi-Crowdsourced Clustering: Generalizing
Crowd Labeling by Robust Distance Metric Learning. In NIPS, 2012.
[46] H. Kashima and N. Abe. A parameterized probabilistic model of network evolution
for supervised link prediction. In Proceedings of the ICDM, pages 556–559, New York,
NY, USA, 2006. ACM. doi: http://doi.acm.org/10.1145/956863.956972.
[47] H. Kashima and N. Abe. A parameterized probabilistic model of network evolution
for supervised link prediction. In Proceedings of the IEEE International Conference
on Data Mining (ICDM 2006), 2006.
[48] H. Kashima, T. Kato, Y. Yamanishi, M. Sugiyama, and K. Tsuda. Link propagation:
A fast semi-supervised learning algorithm for link prediction. In Proceedings of the
SIAM intl conf on Data Mining, pages 1099–1110, Sparks, NV, USA, 2009.
[49] L. Katz. A new status index derived from sociometric analysis. Psychometrika, VOL.
18, NO. 1:39– 43, 1953.
[50] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,
46(5):604–632, 1999.
[51] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
[52] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS,
volume 13, pages 556–562, 2001.
[53] R. Lempel and S. Moran. Salsa: The stochastic approach for link-structure analysis.
ACM Transactions on Information Systems, 2001.

134

[54] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks.
In CIKM ’03: Proceedings of the twelfth international conference on Information and
knowledge management, pages 556–559, New York, NY, USA, 2003. ACM. ISBN 158113-723-0. doi: http://doi.acm.org/10.1145/956863.956972.
[55] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks.
In Proceedings of the Twelfth International Conference on Information and Knowledge
Management (CIKM’03), New Orleans, LA, USA, November 2003.
[56] Y.-R. Lin, J. Sun, P. Castro, R. Konuru, H. Sundaram, and A. Kelliher. MetaFac:
community discovery via relational hypergraph factorization. In Proceedings of the 15th
ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pages 527–536,
2009.
[57] Y. Liu, R. Jin, and A. K. Jain. Boostcluster: boosting clustering by pairwise
constraints. In Proceedings of the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining, KDD ’07, pages 450–459, New York, NY,
USA, 2007. ACM. ISBN 978-1-59593-609-7. doi: 10.1145/1281192.1281242. URL
http://doi.acm.org/10.1145/1281192.1281242.
[58] B. Long, Z. M. Zhang, and P. S. Yu. Co-clustering by block value decomposition. In
Proceedings of the 11th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data
Mining, pages 635–640. ACM, 2005.
[59] B. Long, Z. M. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 470–479, 2007.
[60] B. Long, P. S. Yu, and Z. Zhang. A General Model for Multiple View Unsupervised
Learning. In Proceedings of the 2008 SIAM International Conference on Data Mining,
2008.
[61] Q. Lu and L. Getoor. Link-based classiﬁcation. In International Conference on Machine Learning, 2003.
[62] P. K. Mallapragada, R. Jin, A. K. Jain, and Y. Liu. Semiboost: Boosting for semisupervised learning. IEEE Trans. Pattern Anal. Mach. Intell., 31(11):2000–2014, Nov.
2009. ISSN 0162-8828. doi: 10.1109/TPAMI.2008.235. URL http://dx.doi.org/10.
1109/TPAMI.2008.235.
[63] P. Mandayam Comar, P.-N. Tan, and A. K. Jain. Identifying cohesive subgroups
and their correspondences in multiple related networks. In Proceedings of the 2010
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent
Technology - Volume 01, pages 476–483, 2010.
[64] P. Mandayam Comar, P.-N. Tan, and A. K. Jain. A framework for joint community
detection across multiple related networks. Neurocomputing, 2011.

135

[65] P. Mandayam Comar, P.-N. Tan, and A. K. Jain. Simultaneous classiﬁcation and
community detection on heterogeneous network data. In To appear, Data Mining and
Knowledge Discovery(DMKD), 2012.
[66] H. Masnadi-Shirazi and N. Vasconcelos. Cost-sensitive boosting. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 33:294–309, 2011. ISSN 0162-8828. doi:
http://doi.ieeecomputersociety.org/10.1109/TPAMI.2010.71.
[67] R. McCreadie, C. Macdonald, and I. Ounis. Crowdsourcing Blog Track Top News
Judgments at TREC. In M. Lease, V. Carvalho, and E. Yilmaz, editors, Proceedings
of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth
ACM International Conference on Web Search and Data Mining (WSDM), pages 23–
26, Hong Kong, China, February 2011.
[68] M. Meyerhoﬀ. Introducing sociolinguistics. Routledge;, 1 edition, 2006.
[69] M. Narayanan, A. Vetta, E. E. Schadt, and J. Zhu. Simultaneous clustering of multiple
gene expression and physical interaction datasets. PLoS Computational Biology, 6(4),
2010.
[70] J. Neville and D. Jensen. Leveraging relational autocorrelation with latent group models. In Proceedings of the IEEE International Conference on Data Mining (ICDM’05),
pages 322–329, Houston, TX, USA, November 2005.
[71] M. Newman and M. Girvan. Finding and evaluating community structure in networks.
Physical Review E, 69, Feb 2004.
[72] M. E. J. Newman. Clustering and preferential attachment in growing networks. Phys.
Rev. E, 64, 2001.
[73] M. E. J. Newman. The structure and function of complex networks. SIAM Review,
(45):167–256, 2003.
[74] S. Ozawa, A. Roy, and D. Roussinov. A multitask learning model for online pattern
recognition. IEEE Transactions Neural Networks, 20:430–445, March 2009. ISSN
1045-9227.
[75] L. Page, S. Brin, R. Motwani, and T. Winograd. Pagerank citation ranking: Bringing
order to the web. Technical report, Stanford University, 1998.
[76] S. J. Pan and Q. Yang. A survey on transfer learning. Knowledge and Data Engineering,
IEEE Transactions on, 22(10):1345 –1359, 2010.
[77] A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Proc.
of IJCAI Workshop on Learning Statistical Models from Relational Data, 2003.
[78] S. Purnamrita, C. Deepayan, and M. Andrew. Theoretical justiﬁcation of popular link
prediction heuristics. In COLT, 2010.

136

[79] M. J. R and N. Heinz. New York: John Wiley, 2nd edition, 1999.
[80] M. Rattigan and D. Jensen. The case for anomalous link discovery. SIGKDD Explorations, 7(2):41–47, 2005.
[81] C. M. Richard M. C. McCreadie and I. Ounis. Crowdsourcing a News Query Classiﬁcation Dataset. In M. Lease, V. Carvalho, and E. Yilmaz, editors, Proceedings of
the ACM SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (CSE 2010),
pages 31–38, Geneva, Switzerland, July 2010.
[82] G. Salton and M. J. McGill. Introduction to modern information retrieval. In McGrawHill, Auckland, 1983.
[83] J. Scripps and P.-N. Tan. Clustering in the presence of bridge-nodes. In Proceedings of
the SIAM International Conference on Data Mining (SDM’06), Bethesda, MD, USA,
April 2006.
[84] J. Scripps, P. N. Tan, and A.-H. Esfahanian. Exploration of link structure and
community-based node roles in network analysis. In Proceedings of the Seveth IEEE
International Conference on Data Mining, 2007.
[85] J. Scripps, P. Tan, F. Chen, and A.-H. Esfahanian. A matrix alignment approach
for link prediction. In Proceedings of the 19th International Conference on Pattern
Recognition, pages 1–4, 2008.
[86] J. Scripps, P.-N. Tan, and A.-H. Esfahanian. A matrix alignment approach for collective classiﬁcation. In Proc of the 2009 Int’l Conf on Advances in Social Networks
Analysis and Mining, Athens, Greece, 2009.
[87] J. Scripps, R. Nussbaum, P. Tan, and A. Esfahanian. Link-based network mining. In
M. Dehmer, editor, Structural Analysis of Complex Networks. Birkhauser, 2010.
[88] D. Sculley. Combined regression and ranking. In Proceedings of the 16th ACM SIGKDD
international conference on Knowledge discovery and data mining, KDD ’10, 2010.
[89] P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad. Collective classiﬁcation in network data. AI Magazine, 29(3):93–106, 2008.
[90] T. E. Senator. Link mining applications: progress and challenges. SIGKDD Explorations, 7(2):76–83, 2005.
[91] J. Shi and J. Malik. Normalized cuts and image segmentation. In Proceedings of
CVPR, 1997.
[92] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

137

[93] M. Soleymani and M. Larson. Crowdsourcing for Aﬀective Annotation of Video: Development of a Viewer-reported Boredom Corpus. In M. Lease, V. Carvalho, and
E. Yilmaz, editors, Proceedings of the ACM SIGIR 2010 Workshop on Crowdsourcing
for Search Evaluation (CSE 2010), pages 4–8, Geneva, Switzerland, July 2010.
[94] W. Tang, Z. Lu, and I. Dhillon. Clustering with multiple graphs. In ICDM, pages
1016–1021, Miami, FL, 2009.
[95] C. Tantipathananandh, T. Y. Berger-Wolf, and D. Kempe. A framework for community identiﬁcation in dynamic social networks. In Proceedings of the Thirteenth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007.
[96] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of the Eighteenth Conference on Uncertainty in Artiﬁcial
Intelligence (UAI02), 2002.
[97] B. Taskar, M. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In
In Advances in Neural Information Processing Systems 16, 2003.
[98] B. Taskar, M. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In
Proceedings of the Neural Information Processing Systems (NIPS 2003), 2003.
[99] H. Tong, C. Faloutsos, and J. Y. Pan. Fast random walk with restart and its applications. In ICDM ’06: Proceedings of the Sixth International Conference on Data
Mining, pages 613–622, 2006. URL http://dx.doi.org/10.1109/ICDM.2006.70.
[100] K. Wagstaﬀ, C. Cardie, S. Rogers, and S. Schr¨dl. Constrained k-means clustering with
o
background knowledge. In Proceedings of the Eighteenth International Conference on
Machine Learning, ICML ’01, pages 577–584, San Francisco, CA, USA, 2001. Morgan
Kaufmann Publishers Inc. ISBN 1-55860-778-1. URL http://dl.acm.org/citation.
cfm?id=645530.655669.
[101] P. Wang and C. Domeniconi. Building semantic kernels for text classiﬁcation using wikipedia. In Proceeding of the 14th ACM SIGKDD international conference on
Knowledge discovery and data mining, KDD ’08, 2008.
[102] P. Wang, C. Domeniconi, and J. Hu. Using wikipedia for co-clustering based crossdomain text classiﬁcation. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 1085–1090, 2008. ISBN 978-0-7695-3502-9.
[103] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications
(Structural Analysis in the Social Sciences). Cambridge University Press, 1994.
[104] D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature,
393(6684):440–442, 1998.
[105] Y.-B. Xie, T. Zhou, and B.-H. Wang. Scale-free networks without growth. In Physica,
volume 387, 2008.

138

[106] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classiﬁcation
with dirichlet process priors. J. Mach. Learn. Res., 8:35–63, 2007.
[107] T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative approach. In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining, KDD
’09, pages 927–936, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9. doi:
10.1145/1557019.1557120. URL http://doi.acm.org/10.1145/1557019.1557120.
[108] T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community
detection: a discriminative approach. In Proc. of the 15th ACM SIGKDD International
Conference on Data Mining, pages 927–936, Paris, France, 2009.
[109] T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative approach. In Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD’09), pages 927–936, 2009.
[110] Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization.
Journal of Intelligent Information Systems, 18, March 2002.
[111] D. Zhang and D. Shen. Multi-modal multi-task learning for joint prediction of multiple
regression and classiﬁcation variables in alzheimer’s disease. NeuroImage, 59(2):895 –
907, 2012.
[112] Z. Zhang, T. Li, C. Ding, and X. Zhang. Binary matrix factorization with applications.
In Proceedings of the IEEE Int’l Conf on Data Mining, pages 391–400, 2007.
[113] L. Zhao, G. Sukthankar, and R. Sukthankar. Robust active learning using crowdsourced annotations for activity recognition. In AAAI workshop on Human Computation, 2011.
[114] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨lkopf. Learning with local and
o
global consistency. In Advances in Neural Information Processing Systems 16, 2004.
[115] D. Zhou, S. Zhu, K. Yu, X. Song, B. L. Tseng, H. Zha, and C. L. Giles. Learning
multiple graphs for document recommendations. In WWW ’08, pages 141–150, 2008.
[116] T. Zhou, L. Lu, and Y.-C. Zhang. Predicting missing links via local information. In
Eur. Phys. J., 2009.
[117] S. Zhu, K. Yu, Y. Chi, and Y. Gong. Combining content and link for classiﬁcation
using matrix factorization. In Proceedings of the 30th Annual Int’l ACM SIGIR Conf
on Research and development in information retrieval, pages 487–494. ACM, 2007.
[118] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label
propagation. 2002.

139