LEARNING FROM NOISILY CONNECTED DATA
By
Tianbao Yang

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Computer Science
2012

ABSTRACT
LEARNING FROM NOISILY CONNECTED DATA
By
Tianbao Yang

Machine learning is a discipline of developing computational algorithms for learning predictive models from data. Traditional analytical learning methods treat the data as independent and identically distributed (i.i.d) samples from unknown distributions. However, this
assumption is often violated in many real world applications that leading to the challenge of
learning predictive models. For example, in electronic commerce website, customers could
purchase a product by the recommendation of their friends. Hence the purchasement records
of customers are not i.i.d samples but correlated.
Nowadays, data become correlated due to collaborations, interactions, communications,
and many other types of connections. Eﬀective learning from these connected data not only
provides better understanding of the data but also brings signiﬁcant economic beneﬁts. How
to learn from the connected data also brings unique challenges to both supervised learning
and unsupervised learning algorithms because these algorithms are designed for i.i.d data
and are often sensitive to the noise in the connected data.
In this dissertation, I focus on developing theory and algorithms for learning from connected data. In particular, I consider two types of connections: the ﬁrst type of connection is
naturally formed in real wold networks, while the second type of connection is manually created to facilitate the learning process which is called must-and-cannot link. In the ﬁrst part
of this dissertation, I develop eﬃcient algorithms for detecting communities in the ﬁrst type
of connected data. In the second part of this dissertation, I develop clustering algorithms
that eﬀectively utilize both must links and cannot links for the second type of connected

data
A common approach toward learning from connected data is to assume that if two data
points are connected, they are likely to be assigned to the same class/cluster. This assumption is often violated in real-word applications, leading to the noisy connection problems.
One key challenge of learning from connected data is how to model the noisy pairwise connections that indicates the pairwise class-relationship between two data points. In the problem
of detecting communities in networked data, I develop Bayesian approaches that explicitly
model the noisy pairwise links by introducing additional hidden variables, besides community memberships, to explain potential inconsistency between the pairwise connections and
pairwise class-relationship. In clustering must-and-cannot linked data, I will try to model
how the noise is added into the pairwise connections in the manually generating process.
The main contributions of this dissertation include (i) it introduces popularity and productivity for the ﬁrst time besides the community memberships to model the generation of
noisy links in real networks; the eﬀectiveness of these factors is demonstrated through the
task of community detection; (ii) it proposes a discriminative model for the ﬁrst time that
combines the content and link analysis together for detecting communities to alleviate the
impact of noisy connections in community detection; (iii) it presents a general approach for
learning from noisily labeled data, proves the theoretical convergence results for the ﬁrst
time and applies the approach in clustering noisy must-and-cannot linked data.

ACKNOWLEDGMENTS

First and formost, I would like to thank my thesis advisor Dr. Rong Jin. I feel extremely
fortunate to work with and learn from Dr. Rong Jin. His decision back to ﬁve years ago
on extending me a graduate assistantship oﬀer signiﬁcantly changed my life. Without this
oﬀer, I have no idea where I am going and I can not image I can perform as well as what I
did in the past ﬁve years. During the ﬁve years at MSU, Dr. Rong Jin gave me a great help
and had a profound inﬂuence on my research. He always inspired me to talckle interesting
problems and guided me to solve problems using mathematical skills. His enthusiasm on
reading books brought me to a world of interesting and useful books from which I learned a
lot in the past ﬁve years, and I have no doubt that this inﬂuence will continue in the future.
I would also like to thank other committee members Dr. Pang-ning Tan, Dr. Joyce
Chai and Dr. Selin Aviyente. I am grateful to my collaborators from NEC Laboratories
American and Yahoo Research. I spent three summers at NEC Labs working with Yun
Chi and Shenghuo Zhu. I am greatly indebted to them for the help on my ﬁrst several
publications and for introducing me an interesting research topic of social network analysis.
I also spent a wonderful summer at Yahoo Research mentored by Olivier Chapelle and Lihong
Li. I enjoyed the discussion with Olivier and Lihong on research problems.
I want to thank my colleagues from LINKS lab at MSU, especially Wei Tong, Yang Zhou,
Mehrdad Mahdavi, Fengjie Li, Yi Liu and Jinfeng Yi. I also want to thank my colleagues
from NEC Labs during the three summers. I also owe great thanks to Ying Liu and I wish
her all the best. I gratefully acknowledge the help from the stuﬀ at CSE department of
MSU, especially from Linda Moore and Norma Teague.
My ﬁnal and the most heartfelt acknowledgment must go to my family for their love
and support. I also want to specially thank Fangfang. She always patiently listens to me
whenever I am feeling down.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Introduction . . . . . . . .
1.1 Networks . . . . . . . . .
1.2 Must-and-Cannot Link .
1.3 Research on leaning from

. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
connected data . . . . . . . .

. . . .
. . . .
. . . .
. . . .

1
2
4
5

2 Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . . . . .

7

3 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Community Detection for Networked Data . . . . . . . . . . . . . . . . . . .
3.1.1 Community Detection based on Link Analysis . . . . . . . . . . . . .
3.1.2 Community Detection based on Content Analysis . . . . . . . . . . .
3.1.3 Community Detection based on Combined Link and Content Analysis
3.2 Clustering for Must-and-Cannot(M&C) Linked Data . . . . . . . . . . . . .
3.2.1 Constrained Clustering . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Distance Metric Learning . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9
9
10
11
12
13
13
14
14

4 Community Detection for Networked Data: A Popularity
ity Link Model . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Existing Models . . . . . . . . . . . . . . . . . . . . .
4.3 Popularity and Productivity Link (PPL) Model . . . . . . .
4.3.1 General Form of PPL . . . . . . . . . . . . . . . . . .
4.3.2 Variants of the General PPL Model . . . . . . . . . .
4.3.3 Properties of PPL Model . . . . . . . . . . . . . . . .
4.4 Relationship with Existing Models . . . . . . . . . . . . . .
4.4.1 Relationship with Conditional Link Models . . . . . .
4.4.2 Relationship with Symmetric Joint Link Models . . .
4.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Estimation Algorithm . . . . . . . . . . . . . . . . . . . . . .
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . .

16
16
17
17
17
18
19
20
21
25
25
26
28
28
31
31

v

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

ix

and Productiv. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .

4.7

4.6.2 Community Detection
4.6.3 Link Prediction . . . .
4.6.4 Degree Fitting . . . . .
Conclusions . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

32
35
40
41

5 Community Detection for Networked Data: A Discriminative Approach
for Combining Link and Content . . . . . . . . . . . . . . . . . . . . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Conditional Link Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Popularity Conditional Link Model (PCL) . . . . . . . . . . . . . . .
5.2.2 Analysis of the PCL Model . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . .
5.3 Discriminative Content Model . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 PCL + PLSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 PHITS + DC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.3 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.4 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42
42
43
44
46
47
49
52
53
54
54
54
55
57
58
61

6 Clustering for Noisy M&C Linked Data: A Probabilistic Link Prediction
Model (I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Learning from Noisy M&C Linked Data . . . . . . . . . . . . . . . . . . . .
6.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Generalized Maximum Entropy Model . . . . . . . . . . . . . . . . .
6.2.3 Estimating the Suﬃcient Statistics . . . . . . . . . . . . . . . . . . .
6.2.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Experiments with Real Noise . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Experiments with Synthetic Noise . . . . . . . . . . . . . . . . . . . .
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62
62
63
63
64
67
68
70
72
72
76

7 Clustering for Noisy M&C Linked
Model (II) . . . . . . . . . . . . . .
7.1 Notations . . . . . . . . . . . . .
7.2 Learning from Noisy M&C Linked
7.2.1 A Probabilistic Model . .

77
77
78
78

Data: A Probabilistic Link Prediction
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
Data . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .

vi

7.3

7.4

7.2.2 Estimating Suﬃcient Statistics .
7.2.3 Convergence Analysis . . . . . . .
Experiments . . . . . . . . . . . . . . . .
7.3.1 Experiments with Real Noise . .
7.3.2 Experiments with Synthetic Noise
Conclusions . . . . . . . . . . . . . . . .

8 Conclusions and Future Work

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

78
80
82
84
86
89

. . . . . . . . . . . . . . . . . . . . . . . . . 90

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A Proofs

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

vii

LIST OF TABLES

4.1

Taxonomy of the models categorized by type and variables . . . . . . . . . .

28

4.2

Community detection performance on the Political Blog data set. . . . . . .

35

4.3

Community detection performance on the Cora data set. . . . . . . . . . . .

35

4.4

Community detection performance on the Citeseer data set. . . . . . . . . .

36

5.1

Community detection performance on Cora and Citeseer dataset . . . . . . .

58

6.1

Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

6.2

Statistics of Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

7.1

Statistics of Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

7.2

Clustering performance on unseen data . . . . . . . . . . . . . . . . . . . . .

85

7.3

Clustering performance with perturbed (py , dy ) . . . . . . . . . . . . . . . .

88

viii

LIST OF FIGURES

4.1

(a)∼(c): Recalls on Political Blog data. (d): Degree distribution. . . . . . . .

37

4.2

(a)∼(c): Recalls on Cora data. (d): Degree distribution. . . . . . . . . . . .

38

4.3

(a)∼(c): Recalls on Citeseer data. (d): Degree distribution. . . . . . . . . . .

39

4.4

Degree ﬁtting on the Political Blog data . . . . . . . . . . . . . . . . . . . .

40

4.5

Indegree ﬁtting on Cora and outdegree ﬁtting on Citeseer . . . . . . . . . . .

41

5.1

Recall on the four data sets . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

5.2

Partition Measure of PCL-DC vs. λ . . . . . . . . . . . . . . . . . . . . . . .

61

6.1

Experimental results on Cora data set

. . . . . . . . . . . . . . . . . . . . .

73

6.2

Experimental results on Citeseer data set . . . . . . . . . . . . . . . . . . . .

74

6.3

Experimental results on Terrorist Attack data set . . . . . . . . . . . . . . .

75

7.1

Comparison of clustering performance on real noise. . . . . . . . . . . . . . .

85

7.2

Comparison of clustering performance with diﬀerent noise levels. . . . . . . .

87

7.3

Convergence behavior of the proposed approach . . . . . . . . . . . . . . . .

88

ix

CHAPTER 1
Introduction
Most machine learning problems can be classiﬁed into three categories (1) supervised
learning [92] that aims to learn a prediction function from a set of labeled instances. Most
supervised learning tasks can be further divided into the category of classiﬁcation and regression [38], depending on the type of outputs. (2) unsupervised learning [29] that aims
to divide a given collection of instances into multiple clusters, with the hope that instances
within the same cluster share a larger similarity than instances from diﬀerent clusters. (3)
semi-supervised learning [114] that aims to learn a prediction function from a mixture of
labeled and unlabeled instances. Other learning paradigms include reinforcement learning [5] and learning to learn [88].
In classical setup of learning problems, training instances are usually represented by vectors
of attributes (also referred to as content information), and in the case of supervised learning,
labels are provided for individual training instances. In this dissertation, we study the class
of learning problems where the additional relationship information is provided between pairs
of instances. The pairwise relationship information, can either be the observed connections
between training instances (e.g., hyperlinks between web pages) or the connections derived
from side information (e.g., pairwise class-relationship). Training instances along with the
physical connections can be represented by a network, and this type of training data is also
referred to as networked data. Pairwise class-relationships are introduced by Wagstaﬀ
et al.[94] to specify the relationship between class assignments of two instances, which are
also known as must-and-cannot(M&C) link information. Instances connected by must-link
information are said to be in the same class, while instances connected by cannot-link information are in diﬀerent classes. The pairwise class-relationship connections can be derived

1

from the observed connections with some additional knowledge. We refer to the data that
are connected by must-and-cannot link information as must-and-cannot linked data or
M&C linked data. In this dissertation, we refer to both networked data and M&C linked
data as connected data.
The objective of this study is to take advantage of the connections between instances, to
facilitate the learning process. We thus refer to the related learning problems as learning
from connected data. In particularly, the research work presented in this dissertation focuses
on unsupervised learning on networked data and M&C linked data. The challenge of
learning from connected data is how to discriminate and utilize useful connections from those
noisy connections in indicating the pairwise class relationship. In the sequel, the discussion
and presentation on connected data is divided into networked data and M&C linked data.

1.1

Networks

Networks, as a particular type of connected data, have become an important topic for computer science, physics, biology, communications, statistics, social science, psychology and
etc. Many important properties have been studied for networks, including scale-free, smallword, community structure, triads and clustering coeﬃcients, network motifs, etc. Below,
we review the three most important properties of networks, i.e. scale-free, small-word, and
community structure. For discussion of the other properties of networks, , we refer the reader
to overviews in [4, 74, 55, 11, 12, 98].
A scale-free network refers to the network in which a few nodes have a tremendous
number of connections to the rest nodes while most nodes in the network only have a handful connections. This property is also known as power-law degree distribution. The nodes
with very large numbers of connections are usually referred to as hubs. Over that past several years, researchers have uncovered scale-free structures in a stunning range of systems,
including Word Wide Web [6], social networks such as the sexual network[59], network of

2

citations between scientiﬁc papers [23], celluar metabolic network [3], protein-protein interaction network [80]. One important characteristic of scale free networks is that they are
robust to accidental failures but are vulnerable to coordinated attacks. Scale-free networks
have found their applications in many domains, such as computing, medicine, and business.
Computer networks with scale-free architectures, such as the world wide web, are highly
resistant to accidental failures, but they are very vulnerable to deliberate attacks and sabotage. In medicine, vaccination campaigns against serious viruses, such as smallpox, might
be most eﬀective if they concentrate on treating hubs-people who have many connections to
others. Mapping out the networks within the human cell could aid researchers in uncovering
and controlling the side eﬀects of drugs. Furthermore, identifying the hub molecules involved
in certain disease could lead to new drugs that would target those hubs.
A small-world network is a type of network in which although most nodes are not
directed connected with each other, most nodes can be reached from each other by a small
number of hopes or steps. A typical small-world network is the network of acquaintance
between people, where the famous six-degree separation phenomenon [90, 65] is well accepted.
Other real wold networks that reveal the small-world properties include road maps [108], food
chains [66], electric power grids [7], metabolic processing networks [93], networks of brain
neurons [85], telephone call graphs [69], and protein networks [91], etc. In observation, if a
network has a degree-distribution which can be ﬁt with a power law distribution, it is taken
as a sign that the network is small-world, which is also scale-free networks. Though, scale
free networks usually reveal the small world property; however they are by no means the
only kind of small-world networks. Networks of very diﬀerent topology can still quantify as
small-world networks as long as they satisfy the properties of the small-world networks.
Community structure is another important property of networks. It refers to the occurrence of groups of nodes in a network that are more densely connected internally than
with the rest of the network. This inhomogeneity of connections suggests that the network
3

has certain natural divisions within it. Community structures are quite common in real
networks [77, 30, 67, 111, 36]. Social networks [30] often include community groups (the
origin of the term, in fact) based on common location, interests, occupation, etc. Metabolic
networks [71] have communities based on functional groupings. Citation networks [14] form
communities by research topic. Being able to identify these sub-structures within a network can provide insight into how network function and topology aﬀect each other. For
more discussion on community structures in networks and methods for automatic community detection, we refer to the survey paper by Poter et al. [76], and the book by Mark
Newman [73].

1.2

Must-and-Cannot Link

Must-and-Cannot link was introduced as supervised information for clustering by Wagstaﬀ
et al.[94]. It speciﬁes the class relationship between instances. Pairs of instances connected
by must-link must be in the same cluster and pairs of instances connected by cannot-link
cannot be in the same cluster. These link information thus provide constraints on clustering
results. Must-link and cannot-link have some interesting properties. Must-link constraint
is an equivalence relation and hence are symmetrical, reﬂexive and transitive. For example,
if instance x and instance y are connected by a must-link, and instance y and instance z
are connected by a must-link, then instance x and instance z should also be connected by
a must-link. Similarly, must-link constraints can give rise to cannot-link constraints. For
example x and y are connected by a must-link , y and z are connected by a cannot-link, then
x and z can also be connected by a cannot-link. Though simple, must-and-cannot link are
powerful and have been used for improving clustering accuracy [95, 19]. For more discussion
on must-and-cannot link and clustering for M&C linked data, we refer the reader to the
survey paper by I. Davidson [20].

4

1.3

Research on leaning from connected data

Research on Networked Data
The study of networked data has attracted an enormous amount of interest in the last few
years. Topics include the measurement and structure of networks, mathematical models of
networks, theories of dynamical processes in networks, and prediction based on networks.
We review some research topics as follows, but they are by no means the complete list.
• Community Detection aims to detect the community structure in networks [72, 34].
• Dynamics of Networks Networks, especially social networks and the web, are constantly evolving with additions and deletion of nodes and edges taking place all the
time. Research on dynamics of networks concerns about the statistical properties and
models that govern the generation and evolution of real-world networks [15, 60].
• Diﬀusion and Cascading in Networks Information cascades refers to the phenomena in which an action or idea becomes widely adopted due to inﬂuence by others.
Studies on diﬀusion and cascading in networks seek to build models for the ﬂow of
information or inﬂuence through a large network [33, 49].
• Statiscal Modeling concerns about modeling the network from the viewpoint of
statistics. It aims to learn a statistical model that generates the network as a whole
based on the statistics collected in the network [99, 32].
• Link Prediction aims to infer new interactions among nodes of a given network that
are likely to occur in the near future, given a snapshot of a social network [58, 87].
• Link-based Classiﬁcation aims to improve classiﬁcation accuracy by exploiting the
link structure besides utilizing the content information of the nodes [62, 22].

5

Research on M&C linked Data
Research on M&C linked data focuses on constrained clustering, distance metric learning,
and kernel learning.
• Constrained Clustering aims to improve the clustering accuracy by leveraging the
must-link and cannot-link information[95, 19, 8, 81, 18, 53, 24, 18, 95].
• Distance Metric Learning aims to learn a distance metric that is consistent with
the given M&C linked data, i.e. must-link pairs are separated by a shorter distance
than cannot-link pairs [102, 82, 43, 31, 104, 100] .
• Kernel Learning aims to learn a kernel matrix or function from the given M&C
linked data such that kernel similarity between must-link pairs is larger than kernel
similarity between cannot link pairs [51, 42, 41, 56, 13, 46].

6

CHAPTER 2
Contributions of this Dissertation
This dissertation addresses a number of important problems regarding community detection for networked data, and clustering for M&C linked data.
Most algorithms developed for community detection for networked data are based on probabilistic models. The advantage of these approaches is that they are easy to understand.
Furthermore, these models are able to predict links besides detecting community structures.
However, there are two major problems with most probabilistic models for community detection. (i) The links between nodes are generated only based on their community memberships.
However, in real networks, many more factors could aﬀect the link generation. For example,
a webpage could point to another webpage simply because it is a very popular webpage
such as Google. (ii) The memberships are assumed to be random variables, independent of
the content. However the community memberships are usually determined by the content
of the nodes. We address the ﬁst limitation by a link-based probabilistic model that takes
into account other factors, such as popularity and productivity, when modeling the generation of links. We address the second limitation by presenting a discriminative approach to
incorporate the content information. Below, we describe each of the contributions in detail:
• We propose a novel probabilistic model for link-based community detection in networks, referred to as Popularity and Productivity Link Model for Community
Detection [105]. Besides community memberships, it introduces two important factors, namely popularity and productivity, for modeling the link generation. We verify
both theoretically and empirically that the proposed model ﬁts well with the powerlaw degree distribution of real networks, a key property of scale-free networks that is
overlooked by previous probabilistic models for community detection. This is the ﬁrst

7

research work that is able to model the community structure and scale-free properties
at the same time.
• We present a novel approach to combine link and content information for community
detection in networks [106]. Unlike previous work that assume the community memberships are random variables inferred from the link, the proposed approach models
the community memberships by the content information using a discriminative model,
and infers the model parameters from the data. With the discriminative model, the
proposed approach is able to identify the relevant attributes for each community from
those irrelevant ones. This is the ﬁrst work that explores a discriminative model for
community memberships when incorporating the content information.
Most previous work on clustering for M&C linked data assumes that both must-link and
cannot-link information is perfect. However, the M&C link information is usually derived
from side information such as citations between scientiﬁc papers, and can be vulnerable to
noise. To address this limitation, we present two approaches for clustering the noisy M&C
linked data.
• Both approaches the noise issue by estimating the suﬃcient data statistics with perfect
link information from the one with noise link information, and using the estimated
suﬃcient statistics to learn a probabilistic link prediction model.
• The two approaches diﬀer in the assumptions about the data and the noise. We study
two diﬀerent data models and noise models, and devise approaches for estimating the
feature statistics under the two diﬀerent models and compare them empirically.
• We verify, both theoretically and empirically, the eﬀectiveness of the proposed approaches. This is the ﬁrst work that provides theoretical analysis to show that the
model learned from the noisy M&C link information converges to the model learned
from the perfect M&C link information.

8

CHAPTER 3
Literature Survey
3.1

Community Detection for Networked Data

As online repositories such as digital libraries and user-generated media(e.g. blogs) become
more popular, analyzing such networked data has become an increasingly important research
issue. One major topic in analyzing such networked data is to detect salient communities
among individuals. Community detection has many applications such as understanding the
social structure of organizations and modeling large-scale networks in Internet services [96].
While there are diﬀerent formulations for community detection, in this work, we focus on the
unsupervised learning, or the clustering viewpoint, a commonly accepted and well studied
perspective.
A networked data set is usually represented as a graph where individuals in the network are
represented by the nodes in the graph. The nodes are tied with each other by either directed
links or undirected links, which represent the relations among the individuals. In addition
to the links that they are incident to, nodes are often described by certain attributes, which
we refer to as contents of the nodes. For example, when it comes to the web pages, online
blogs, or scientiﬁc papers, the contents are usually represented by histograms of keywords;
in the network of co-authorship, the contents of nodes can be the demographic or aﬃliation
information of researchers.
The goal of community detection is to ﬁnd a set of communities such that the nodes
in each community are densely connected with each other while sparsely connected with
others. When the content information is available, we may also expect nodes within the

9

same community to share similar contents. However, most existing studies on community
detection only focus on either the link based analysis or the content analysis. Only very
recently people started to combine these two types of information for community detection.
In the following, we will review community detection methods in these three categories, i.e.,
link-based analysis, content-based analysis, and analysis based on the combination of link
and content information.

3.1.1

Community Detection based on Link Analysis

Link-based approaches for community detection can be roughly classiﬁed into two categories,
i.e., metric based approaches and probabilistic modeling based approaches. For metric-based
approaches, a metric is ﬁrst deﬁned to measure the quality of any potential community structure; and then, procedures are developed to optimize the proposed metric. Some well-known
metrics used for link based community detection include normalized cut proposed by Shi et
al. [83], modularity proposed by Newman et al. [72], betweenness proposed by Gregory [34],
etc. More recent work in this category starts to introduce probabilistic interpretations for
some of these metrics and extend the metrics from undirected networks to directed networks
[54, 112]. A main weak point with these metric-based approaches is the proposed metrics
are usually heuristic and do not have solid foundations. As a result, it is diﬃcult to reach a
consensus on the optimal metric. These approaches are also limited in that they are unable
to make prediction for unobserved links, which is important for a number applications.
The second category of approaches are based on probabilistic models. They ﬁrst deﬁne a
generative process, in which links are generated based on latent community memberships,
and then, develop inference algorithms to derive the latent community memberships from the
observed links. Holland et al. [45] proposed the ﬁrst stochastic block model for community
detection, and later on many variations [2, 68, 25, 28, 39, 48] have been proposed. In the
stochastic block models, the probability of creating a link between two nodes is assumed to
be a constant that depends on the community assignments of the two nodes. Most of the

10

followup works essentially reﬁne this probability. In probabilistic Hyperlink-Induced Topic
Search (PHITS) model [16] and Latent Dirchilet allocation link model [10], the probability
of observing a link from node i to j is given by the probability for j to be linked by any
node from the community of i. In the graph factorization model [109, 79], this probability is
interpreted as a joint probability of observing both nodes simultaneously, which is factorized
based on the Bayes product rule. In terms of inference, some of these models [45, 16, 109,
79] are based on maximum likelihood estimation, while some are based on full Bayesian
inference [2, 68, 25, 28, 39, 48, 10].
Given a large number of approaches have been developed in the family of stochastic block
models, it is useful to further classify into subcategories: the symmetric approaches [79, 109]
that model links by symmetric joint probabilities, and the conditional approaches [16, 106]
that focus on modeling the conditional probability of receiving links. However, neither of
these models is satisfying: a symmetric model misses the semantics of link directions, a key
factor that distinguishes directed networks from undirected networks, while a conditional
model only captures one type of links, either incoming links or outgoing links, and therefore
is unable to characterize nodes in a full spectrum. As an example, in a blog readership
network, there are two types of bloggers: “writers” who generate inﬂuential blogs read by
many, and “readers” who read a lot but seldom write anything for others to read. Evidently,
to characterize these two types of bloggers, it is important to examine both incoming links
and outgoing links of the network.

3.1.2

Community Detection based on Content Analysis

Given the content description of nodes, it is appealing to cast community detection problem
into a clustering problem. Many traditional vector-based clustering algorithm such as Kmeans [37], hierarchical clustering algoirthms [86, 97] have been applied to content analysis
for community detection. In the subsection, we focus on the scenario that the content is represented as a bag-of-words, since nodes in many online networks are usually webpages. One

11

of the most well-known approaches for such content information is the topic model[], where
each topic is naturally interpreted as a community. Two well-known topic models are Probabilistic Latent Semantic Analysis(PLSA) [40] and Latent Dirichilet Allocation(LDA) [10].
In these models, each topic is modeled as a probability distribution over words. To generate
a document, one ﬁrst sample a topic from a prior distribution, and then sample words for
the given topic. One key drawback of topic models is that they are generative probabilistic
models and therefore are vulnerable to the words that are irrelevant to the target topics.

3.1.3

Community Detection based on Combined Link and Content
Analysis

Neither link information nor content information is suﬃcient in detecting the community
structure: the link information is usually sparse and noisy and often results in a poor partition of networks; the irrelevant content attributes could signiﬁcantly mislead the detection of
communities. It is therefore important to combine the link analysis and content analysis for
community detection in networks. Recently, several approaches [17, 28, 68, 113, 110] have
been proposed to combine these two types of information for community detection. PHITSPLSA combines PHITS, a link based approach, with PLSA, a content based approach, for
community detection [17], and show a signiﬁcantly improvement over the approaches that
only utilizes link information or content information for community detection. E. Erosheva
et al. [28] combine LDA with LDA-Link, an extension of PHITS, for network analysis, referred to as LDA-Link-Word model in this paper. R. Nallapti et al. [68] try to improve
LDA-Link-Word model by applying diﬀerent modeling strategies to citing documents from
that to cited documents. More speciﬁcally, LDA-Link-Word model is applied to the citing
documents and PLSA model is applied to the cited documents. Other approaches that exploit LDA for combining link and content analysis can be found in [25, 35]. Despite the
signiﬁcant eﬀorts, the existing approaches for combining the link and content information
are based on generative models, and therefore will suﬀer from the following two shortcom12

ings. First, community membership by itself is insuﬃcient to model links—link patterns
are usually aﬀected by factors other than communities such as the popularity of a node(i.e.
how likely the node is cited by other nodes). Second, the content information often include
irrelevant attributes and as a result, a generative model without feature selection usually
makes them vulnerable to the irrelevant keywords, and therefore leads to poor performance.
In addition to probabilistic models, some other approaches that have been proposed to combine link and content information include matrix factorization[113] and kernel fusion[110] for
spectral clustering. Most of them can be more or less viewed as generative models, according
to the recent studies on the equivalence between mixture model, k-means, spectral clustering, and matrix factorization [26, 27]. As a result, these approaches usually yield similar
performance as the generative models. Since the focus of this dissertation is probabilistic
modeling approaches, we therefore did not review them in detail.

3.2

Clustering for Must-and-Cannot(M&C) Linked
Data

In this section, we review the related work on clustering for must-and-cannot(M&C) linked
data. Most approaches in this subject can be classiﬁed into three categories: constrained
clustering, distance metric learning and kernel learning.

3.2.1

Constrained Clustering

Constrained clustering tries to improve the accuracy of data clustering by exploiting the
M&C linked pairs, also termed as pairwise constraints. They are also known as semisupervised clustering. In constrained clustering, the clustering algorithm is modiﬁed so
that the given pairwise constraints are used to bias the search for an appropriate clustering
of the data. There are two types of constrained clustering approaches: (1) ones with strict
enforcement, which ﬁnd the best feasible clustering satisfying all the given pairwise con13

straints [95, 19], and (2) ones with partial enforcement, which ﬁnd the best clustering while
maximally respecting constraints [8, 81, 18, 53]. Several techniques have been developed to
ﬁt the clustering results with respect to the constraints: (1) modifying the clustering objective function so that it includes a term (penalty) for satisfying speciﬁed constraints [24, 18].
(2) enforcing all constraints to be satisﬁed by the cluster assignments generated in each step
of an iterative clustering algorithm [95]. More discussion on constrained clustering can be
found in the [20], and the references therein.

3.2.2

Distance Metric Learning

In distance metric learning, an appropriate distance metric is learned so that must-link pairs
are separated by a short distance, while cannot-link pairs are separated by a large distance.
Given a learned distance metric, we can then apply the existing clustering algorithms, such as
k-means, to ﬁnd the clusters. Many algorithms [102, 82, 43, 31, 104, 100] have been developed
for distance metric learning, such as distance metric learning by convex optimization [102],
relevance component analysis [82], discriminative component analysis [43], nearest neighbor
component analysis [31], local distance metric learning [104], large margin nearest neighbor
classiﬁer [100], information theoretic metric learning [21], distance function learning [89],
and learning a Bregman distance function [101]. All these approaches propose to optimize
an objective function about the distance metric by satisfying a set of equality/inequality
constraints. The M&C link information serve in the objective [104], or the constraints [82,
21], or both [102, 100]. More work on distance metric learning from M&C link information
can be found in the survey [103] and references therein.

3.2.3

Kernel Learning

In kernel learning, an appropriate kernel matrix/function for a given data set is learned
from M&C link information, and then a similarity based clustering algorithm(e.g. spectral clustering) is applied with the learned kernel matrix. Recent work on kernel learning
14

from M&C linked data include learning low rank kernel matrix [51], nonparametric kernel
learning [42], active kernel learning [41], spectral kernel learning [56], Kernel-based metric adaptation [13], and semi-supervised kernel matrix learning by kernel propagation [46].
These approaches either propose a loss function that measures the inconsistency between a
kernel function/matrix and pairwise constraints, and ﬁnd the optimal kernels by minimizing
the overall loss, or ﬁnd a kernel maxtrix with certain property such as low rank to satisfy a
set of equality/inequality constraints constructed using the M&C link information.
Most studies on clustering M&C linked data assume the link information is noise-free.
However, in many applications, the M&C link information is derived from the side information like the observed connections among instances, making them prone to errors. For
example, in classifying research articles with M&C link constructed from paper citations,
the cited paper may not share the same research topic as the citing paper. It is important to
consider the noise in the M&C link information, since direct learning from the given M&C
link information wouldlead to suboptimal results.

15

CHAPTER 4
Community Detection for Networked
Data: A Popularity and Productivity
Link Model
4.1

Introduction

In this chapter, we propose a novel probabilistic framework for directed network community
detection, termed Popularity and Productivity Link model or PPL for short, that explicitly addresses the shortcomings of the existing stochastic block models. In particular,
we model both outgoing links and incoming links by the introduction of the latent variables
productivity and popularity. We demonstrate the generality of the proposed framework by
showing that both the symmetric models and the conditional models can be derived from the
proposed framework as special cases, leading to the uniﬁcation of various seemingly diﬀerent
forms for the existing models. We develop eﬃcient EM-algorithms for computing the maximum likelihood solutions to the models proposed in this paper. Extensive empirical studies
show the promising performances of the proposed models in several application domains.
Further analysis is conducted to investigate the trade-oﬀs of each stochastic block model
when data characteristic varies.
The rest of the chapter is organized as follows. In Section 4.2, we give background information, including notation we will use and details of several previous approaches. In
Section 4.3, we present the PPL model, several of its variations, and some of their properties. In Section 4.4, we provide a detailed analysis on the relationship between PPL models
and several existing stochastic block models. In Section 4.5, we describe an eﬃcient esti-

16

mation algorithm. In Section 4.6, we show the results of experimental studies. Finally, we
conclude in Section 4.7.

4.2

Background

In this section we ﬁrst establish some necessary notations for ease of presentation. We then
describe the details about two representative existing stochastic block models which are most
relevant to our work.

4.2.1

Notation

For a directed network, we denote the nodes by V = {1, · · · , N}, the directed links by
E = {(i, j)|sij = 0}, where sij records the value associated with link from node i to node
j. sij can either be binary, to denote whether there is a link from node i to node j, or
be non-negative values, to denote the weight of the link. For simplicity, following [106], we
assume the “link-in” space (i.e., all possible nodes that can point to a particular node) and
“link-out” space(i.e., all possible nodes that can be pointed to by a particular node) of every
node to be V, i.e., the complete set of nodes. We use I(i) = {j|sji = 0} to denote the set
of all nodes point to node i, and O(i) = {j|sij = 0} to denote the set of all nodes that are
pointed to by node i. Let K denote the number of communities, zi ∈ {1, · · · , K} denote the
community variable of node i, and γ i = (γi1 , · · · , γiK ) denote the community memberships
of node i. In other words, γik is the probability for the case zi = k, i.e., node i belongs to
community k.

4.2.2

Existing Models

We now review three variants of the well-known stochastic block model [45] that are closely
related to the proposed model.

17

PHITS Model
PHITS [16] is a conditional model that focuses on the conditional link probability of Pr(j|i),
i.e., given that node i produces a link, how likely this link will point to node j among all
nodes. To compute Pr(j|i), a community variable zi is ﬁrst sampled from a multinomial
distribution with parameter γ i that describes the community membership of node i, then
for a given zi , the conditional link probability Pr(j|i, zi ) is given by Pr(j|i, zi ) = βjzi , where
the parameter βjk represents the likelihood for node j to be pointed to by any node in
community k. By integrating out zi , we get the PHITS model
Pr(j|i) =

βjk γik

(4.1)

k

It is well known that PHITS can be considered as an application of PLSA [40] to networked
data.
Symmetric Joint Link (SJL) Model
SJL [16, 109, 79] is a symmetric model for community detection. It models the link structure
by the joint probability Pr(i, j), i.e., the likelihood of creating a link between node i and j,
as follows
Pr(i, j) =

Pr(j|k) Pr(i|k) Pr(k) =
k

βjk βik πk

(4.2)

k

In Equation (4.2), πk is the prior probability for a link to be produced in community k, and
βik and βjk are the conditional probabilities that nodes i and j are selected as the two ends
of the link. Given the symmetric treatment, i.e., Pr(i, j) = Pr(j, i), it is evident that SJL
may not be suitable for directed network community detection.

4.3

Popularity and Productivity Link (PPL) Model

In this section, we ﬁrst present our popularity and productivity link (PPL) model in its
general form, then give three variations of the general PPL model, and ﬁnally discuss several
18

properties of the PPL models.

4.3.1

General Form of PPL

PPL models the joint link probability Pr(i, j), i.e., how likely there is a directed link from
node i to node j. In order to emphasize the diﬀerent roles played by i and j, we write Pr(i, j)
as Pr(i→ , j← ), denoting that node i plays the role of producing the link, and node j plays
the role of receiving the link. Following the idea of SJL, we model Pr(i→ , j← ) as follows


γjk bj
 γik ai
γ i ′ k ci ′ 
Pr(i→ , j← ) =
Pr(i→ |k) Pr(j← |k) Pr(k) =
i′ γi′ k ai′
i′ γi′ k bi′ ′
k
k
i

(4.3)

where
• γik : the probability for node i to belong to community k
• ai : the productivity of node i, i.e., among all the nodes, how likely a link is produced
by node i
• bj : the popularity of node j, i.e., among all the nodes, how likely a link is received by
node j
• ci : the weight of node i in terms of deciding the community prior Pr(k) (which will be
elaborated momentarily).
To handle scale invariance, we normalize so that

i ai

=

j bj

=

i ci

= 1.

Generative Process
We explain Equation (4.3) by the following generative process of PPL:
• Sample a community z according to a prior distribution π1 , · · · , πK , where πk is computed by πk =

N
i=1 γik ci .

• Given community z, the conditional link probability is given by
Pr(i→ , j← |z) = Pr(i→ |z) Pr(j← |z) =
19

γiz ai
i′ γi′ z ai′

γjz bj
i′ γi′ z bi′

(4.4)

There are two unique features in the above generative process:
• Prior probability πk =

i γik ci

is constructed as the weighted sum of node mem-

berships γik , where ci is used to weight node i in the combination. This construction
enforces the consistency between node memberships γik and community prior {πk }K .
k=1
This speciﬁc construction of community priors also simplify relation between the proposed framework and some existing models for community detection.
• In Equation (4.4), the two ends of link i → j are treated diﬀerently when modeling Pr(i→ , j← |z): besides the dependence on community memberships γik and γjk ,
Pr(i→ |z) and Pr(j← |z) are modeled by ai (i.e., the productivity of node i) and bj (i.e.,
the popularity of node j), respectively, leading to the diﬀerentiation of the roles played
by the two nodes.
With the joint link probability deﬁned in Equation (4.3), the log-likelihood for links can be
written as
L(a, b, c, γ) =

γik ai
i′ γi′ k ai′

sij log
(i,j)∈E

k

γjk bj
i′ γi′ k bi′

i′

γ i ′ k ci ′

(4.5)

Note that we use original data sij in the joint link model rather than normalized data
sij
sij =
ˆ
used in conditional link models [16, 106] . Parameters γ, a, b, and c can be
j sij
inferred by maximizing the log-likelihood L(a, b, c, γ).

4.3.2

Variants of the General PPL Model

In this subsection, we show three variants of PPL model by introducing diﬀerent restrictions
on parameters a, b and c.
Popularity Link (PoL) Model
In the ﬁrst restricted variation, we enforce ci = ai , ∀i in Equation (4.3), leading to the
following expression for Pr(i→ , j← )
Pr(i→ , j← ) =

k

20

γjk bj
γik ai
i′ γi′ k bi′

(4.6)

We refer to this variant as the Popularity Link (PoL) Model. By assuming ci = ai , we
essentially assume that the prior probability of each community (i.e.,

i γik ci )

to the prior probability for a link to be produced from that community (i.e.,

is identical

i γik ai ).

Productivity Link (PrL) Model
In the second restricted variation, we enforce ci to be equal to bi , leading to the following
expression for Pr(i→ , j← ),
Pr(i→ , j← ) =

k

γik ai
γjk bj
′ γi′ k ai′
i

(4.7)

We refer to this variant as the Productivity Link (PrL) Model. By assuming ci = bi , we
essentially assume that the prior probability of each community (i.e.,

i γik ci )

to the prior probability for a link to be received by that community (i.e.,

is identical

i γik bi ).

Regularized PPL (PPL-D) Model
In this variation, instead of enforcing the relationship between ci and ai or bi , we learn ci
from data, under certain regularization. In particular, we introduce a Dirichlet prior for
parameters c = (c1 , . . . , cN ), i.e., Pr(c) ∝

α
i ci ,

where α is the hyper-parameter of Dirichlet

distribution. Using the prior Pr(c) as the regularization, we obtain an MAP estimation of
parameters by maximizing the following log-posterior probability
L(a, b, c, γ) + log Pr(c)

(4.8)

where L(a, b, c, γ) is given in Equation (4.5). We call this PPL model regularized by the
Dirichlet prior the PPL-D model.

4.3.3

Properties of PPL Model

In this subsection, we show two important properties of the PoL, PrL, and general PPL
model.
21

Equivalence between the Variants of PPL Models
The ﬁrst property is about the relationship between PoL model, PrL model, and general
PPL model. Surprisingly, although their formulas are diﬀerent, the optimal solutions for the
three models actually result in identical joint link probability and therefore identical data
likelihood. This property is described in the following theorem.
Theorem 1. Under the optimal solution, the joint link probability Pr(i→ , j← ) of PoL
model, PrL model and general PPL model are the same. That is, Pr1 (i→ , j← |a1 , b1 , γ 1 ) =
Pr2 (i→ , j← |a2 , b2 , γ 2 )

=

Pr3 (i→ , j← |a3 , b3 , c3 , γ 3 ), where Pr1 (i→ , j← ), Pr2 (i→ , j← ),

Pr3 (i→ , j← ) are the joint link probabilities of PoL model, PrL model, and general PPL
model, respectively; {a1 , b1 , γ 1 }, {a2 , b2 , γ 2 } and {a3 , b3 , c3 , γ 3 } are the optimal solutions
to maximizing the log-likelihood of PoL model, PrL model and general PPL model, respectively. In particular, denoting the log-likelihood of PoL, PrL and general PPL model by
L1 (a, b, γ), L2 (a, b, γ), L3 (a, b, c, γ) respectively, we have L1 (a1 , b1 , γ 1 ) = L2 (a2 , b2 , γ 2 ) =
L3 (a3 , b3 , c3 , γ 3 ).
In order to prove Theorem 1, we ﬁrst state the following lemma about the optimal solution
to the PPL model given in Equation (4.3).
Lemma 1. Given that (a3 , b3 , c3 , γ 3 ) is the optimal solution to maximizing the log-likelihood
of PPL model, we deﬁne πk =
(a1 , b1 , γ 1 ) such that

1 1
i γik ai

3 3
i γik ci .

Then we can obtain one set of parameters

= πk and (a1 , b1 , γ 1 ) is the optimal solution to maximiz-

ing the log-likelihood of PoL model. Similarly, we can obtain another set of parameters
(a2 , b2 , γ 2 ) such that

2 2
j γjk bj

= πk and (a2 , b2 , γ 2 ) is the optimal solution to maximizing

the log-likelihood of PrL model.
Proof. we ﬁrst show how to construct such (a1 , b1 , γ 1 ) and (a2 , b2 , γ 2 ). Given (a3 , b3 , c3 , γ 3 )
and πk =

3 3
i γik ci ,

we can deﬁne q such that
ˆ

22

3 3ˆ
i γik ai qk

1
= πk . We then construct γik =

3 ˆ
γik qk
3 ˆ
k γik qk

,

a1
i

=

a3
i

1
3 ˆ
k γik qk , bj

=

b3
j

3 ˆ
k γjk qk
,
3 ˆ
3
k γj ′ k qk
j ′ bj ′

and we can show that

1
γik a1 = πk
i
i

We can also deﬁne q such that
˜
a2
i

=

a3
i

3 ˜
k γik qk
3
3 ˜
k γi′ k qk
i′ ai′

and b2 = b3
j
j

2
3 3˜
j γjk bj qk = πk . We then construct γik =
3 ˜
k γjk qk ,

3 ˜
γik qk
3 ˜
k γik qk

,

and we can show that

2
γjk b2 = πk
j
j

With constructed (a1 , b1 , γ 1 ) and (a2 , b2 , γ 2 ) we can show that
L3 (a3 , b3 , c3 , γ 3 ) = L1 (a1 , b1 , γ 1 ) = L2 (a2 , b2 , γ 2 )
Next, we need to show that (a1 , b1 , γ 1 ) is the optimal solution to PoL model, (a2 , b2 , γ 2 )
is the optimal solution to PrL model. We prove this by contradiction.

Assume their

exists another set of parameters (a∗ , b∗ , γ ∗ ) such that L1 (a∗ , b∗ , γ ∗ ) > L1 (a1 , b1 , γ 1 ) =
L3 (a3 , b3 , c3 , γ 3 ), then
L3 (a∗ , b∗ , a∗ , γ ∗ ) = L1 (a∗ , b∗ , γ ∗ ) > L3 (a3 , b3 , c3 , γ 3 )
which contradicts that (a3 , b3 , c3 , γ 3 ) is the optimal solution to PPL model. Similarly, we
can show (a2 , b2 , γ 2 ) is the optimal solution to PrL model. Thus, we complete the proof.
Following the above lemma, we can easily prove Theorem 1. One implication of this
theorem is that the space of the optimal solution to the general PPL model is not a unique
ﬁxed point. As a consequence, if in addition to the joint link probability, we also care about
the exact solution to the community membership γ, then we should not directly apply PPL
in its general form. Instead, we should either choose PoL and PrL if the MLE solution is
needed, or choose PPL-D if the MAP solution is needed.

23

Perfect Fitting of the Distributions of Indegree and Outdegree
The second property of the PPL model is about degree ﬁtting. It turns out that the optimal solutions to PoL model, PrL model, and general PPL model all ﬁt exactly the degree
distributions (both indegree and outdegree) in the network data. This is described in the
following theorem, whose proof is given in the appendix.
Theorem 2. The model outdegree distribution Pr(i→ ) and model indegree distribution
Pr(j← ) of PoL model, PrL model and general PPL model ﬁt exactly the actual indegree
and outdegree distributions of the network data.
Proof. We can easily show that the optimal solution to ai in PoL model is equal to the
j sij
normalized outdegree of node i, i.e., a1 =
; and the optimal solution to bj in PrL
i
ij sij
i sij
. From the model
model is equal to the normalized indegree of node j, i.e., b2 =
j
ij sij
formulation in Equation (4.6) for PoL model, we have
Pr1 (i→ |a1 , b1 , γ 1 ) =

Pr1 (i→ , j← |a1 , b1 , γ 1 ) = a1
i

j

So the model outdegree distribution of PoL model ﬁts exactly the actual outdegree distribution of the network.
From the model formulation in Equation (4.7) for PrL model, we have
Pr2 (j← |a2 , b2 , γ 2 ) =

i

Pr(i→ , j← |a2 , b2 , γ 2 ) = b2
j

So the model indegree distribution of PrL model ﬁts exactly the actual indegree distribution
of the network. Following Theorem 1 we have
Pr3 (i→ |a3 , b3 , c3 , γ 3 ) = Pr2 (i→ |a2 , b2 , γ 2 ) = Pr1 (i→ |a1 , b1 γ 1 ) = a1
i
and
Pr3 (j← |a3 , b3 , c3 , γ 3 ) = Pr1 (j← |a1 , b1 , γ 1 ) = Pr2 (j← |a2 , b2 , γ 2 ) = b2
j
24

We conclude that the model indegree and outdegree distributions estimated from PoL model,
PrL model and PPL model ﬁt exactly the actual indegree and outdegree distributions of the
network.
This property of degree ﬁtting is a consequence of the concepts of productivity and popularity. We argue that degree ﬁtting is a very important property for a generative model. This
is because in real world, most networks have heavy-tailed (or power-law) degree distribution.
So far, no existing stochastic block models can guarantee to generate degree distributions
ﬁtting both indegree and outdegree distributions of real-world networks.

4.4

Relationship with Existing Models

In this section, we describe the relationship between PPL and several existing models, including conditional link models, namely PCL [106] and PHITS [16], and symmetric joint
link model (SJL) [79, 109]. It turns out that these existing models all can be considered as
special cases of PPL, with diﬀerent constraints. Such a connection demonstrates that PPL
provides a consistent framework to unify the existing models.

4.4.1

Relationship with Conditional Link Models

We show that the Popularity Conditional Link(PCL) model in [106] is a conditional version
of the Popularity Link(PoL) model described in Section 4.3.2. Starting from the joint probability given in Equation (4.6), we can express the conditional probability of the PoL model
as
Pr(j← |i→ ) =

Pr(i→ , j← )
=
Pr(i→ )

k

γjk bj
γik
i′ γi′ k bi′

(4.9)

Note that in the above derivation, we use the fact that Pr(i→ ) = ai , which is obtained in
the proof of Theorem 2. Equation (4.9) is exactly the same as the Popularity Conditional
Link(PCL) model proposed in [106]. Because of this connection, in the following discussion,
we also refer to the PCL model described in Equation (4.9) (and in [106]) as PoCL model.
25

Following a similar idea, from Productivity Link(PrL) model we can derive a Productivity Conditional Link model by computing the conditional probability Pr(i→ |j← ) from
Equation (4.7) as the following
Pr(i→ |j← ) =

Pr(i→ , j← )
=
Pr(j← )

k

γik ai
γjk
′ γi′ k ai′
i

(4.10)

In the above derivation, we use the fact that Pr(j← ) = bj , which is obtained in the proof
of Theorem 2. Because of its connection to PrL, we refer to this new conditional model as
PrCL.
As we can see, PoCL and PrCL capture the conditional link probability in diﬀerent directions. PoCL depends on the popularity of the receiving node j while PrCL depends on the
productivity of the producing node i. In addition, both PoCL and PrCL can be naturally
derived from the PPL models, i.e., PoL and PrL.
Because as we have discussed before, PHITS is a relaxed version of PoCL, obviously it can
also be derived from PPL.

4.4.2

Relationship with Symmetric Joint Link Models

To show its relationship with SJL, we enforce that ci = ai = bi , ∀i in the general PPL
model. From a probabilistic view point this restricts that for each node, the probability for
producing links is equal to that for receiving links. With this restriction, Equation (4.3) is
reduced to
Pr(i→ , j← ) =

k

γik ci
i ′ γ i ′ k ci ′

γjk cj
j ′ γ j ′ k cj ′

i′

γ i ′ k ci ′

The following theorem, whose proof is given in the appendix, shows that this restricted
version of PPL is exactly the SJL model.
Theorem 3. Under the constraint that ai = bi = ci , ∀i, the general PPL model is equivalent
to the SJL model.

26

Proof. The joint link probability of SJL model is given in Equation (4.2), i.e.,
Pr(i→ , j← ) =

βik βjk πk
k

The community membership of SJL model is deﬁned as[109, 79]
f

γik =

βjk πk
k βik πk

(4.11)

βik πk

(4.12)

f

We can also deﬁne ci as
f

ci =
k

Similarly, given solution (γ, c) to PPL model with a = b = c, we can deﬁne
p

πk =

γik ci ,

p

βik =

i

γik ci
i γ i ′ k ci ′

(4.13)

All we need to show is that given that (β, π) is the solution to SJL model, (γ f , cf ) deﬁned
as in Equations (4.11,4.12) is the solution to PPL model under the restriction of a = b = c;
and given that (γ, c) is the optimal solution to PPL model under the restriction of a = b = c,
(π p , β p ) deﬁned in Equation (4.13) is the optimal solution to SJL model. First note that
L0 (β, π) = L3 (γ f , cf )
L3 (γ, c) = L0 (β p , π p )
where L0 and L3 are the log-likelihood of SJL model and PPL model respectively. Given
that (β, π) is the optimal solution to SJL model, if there exists (γ ∗ , c∗ ) such that L3 (γ ∗ , c∗ ) >
L3 (γ f , cf ) = L0 (β, π), then we can construct (β ∗ , π ∗ ) as in Equation (4.13) such that
L0 (β ∗ , π ∗ ) = L3 (γ ∗ , c∗ ) > L0 (β, π), which contradicts the assumption that (β, π) is the
optimal solution to SJL model. Similarly, given that (γ, c) is the optimal solution to PPL
model, if there exists (π ∗ , β ∗ ) such that L0 (π ∗ , β ∗ ) > L0 (π p , β p ) = L3 (γ, c), then we can
construct (γ ∗ , c∗ ) as in Equations (4.11,4.12) such that L3 (γ ∗ , c∗ ) = L0 (π ∗ , β ∗ ) > L3 (γ, c),
which contradicts the assumption that (γ, c) is the optimal solution to PPL model. Therefore, we prove that PPL model under the restriction of a = b = c is equivalent to SJL
model.
27

The relationship revealed by Theorem 3 shows that SJL is a special PPL with the constraint that nodes having the same probability in terms of producing and receiving links,
which is appropriate only for modeling undirected networks.

4.4.3

Summary

In Table 4.1, we summarize all the models discussed in this paper. Models that are newly developed in this paper are print in bold. We believe such a uniﬁed picture, oﬀered through the
PPL model, will be very helpful for understanding and further studying diﬀerent stochastic
block models for community detection.
Table 4.1. Taxonomy of the models categorized by type and variables

conditional
joint
symmetric

4.5

popularity
PHITS, PoCL
PoL

productivity
PrCL
PrL

both
PPL, PPL-D
SJL

Estimation Algorithm

In this section, we present eﬃcient EM algorithms for computing the MLE solutions to
PoL and PrL and the MAP solution to PPL-D. Because the derivation of the algorithms is
rather lengthy, here we only present the ﬁnal form of the algorithms as well as oﬀer several
observations, and we provide the detailed derivation in the appendix.
Theorem 4. The following EM algorithms converge to the MLE solutions to PoL and PrL,
and the MAP solution to PPL-D.
E-step:
qijk ∝ Prt−1 (i, j, k)
where t-1 indicates the result in the previous iteration

28

M-step:
nik
,
PoL : γik = τ
mk bi + nout (i)

bi =

nin (i)
,
τ
k mk γik

ai =

nout (i)
i nout (i)

nik
,
+ nin (i)

ai =

nout (i)
,
η
k mk γik

bi =

nin (i)
i nin (i)

PrL :

γik =

η
mk ai

ζ

PPL-D :

ζ

nik + mik

γik =

mi + eα

,c =
η
ζ i
ζ
mk ai + mτ bi + mi
i (mi + eα)
k
nin (i)
nout (i)
, bi =
ai =
η
τ
k mk γik
k mk γik

where e is the summation of all sij and the rest variables are deﬁned as:
ηk =
i′

γ t−1 at−1 , τk =
′
′
ik

i

nin (i, k) =

j′

sji qjik ,

nout (i, k) =

j∈I(i)

nin (i) =

t−1
γik ct−1
i

γ t−1 bt−1 , ζik =
j ′k j ′

t−1 t−1
i ′ γ i ′ k ci ′

sij qijk
j∈O(i)

nin (i, k),

nout (i) =

k

nout (i, k)
k

nik = nin (i, k) + nout (i, k),

mk =

sij qijk
(i→j)∈E

mτ =
k

(i→j)∈E sij qijk

τk

ζ

,

sij qijk ,

mik = ζik

η

mk =

(i→j)∈E sij qijk

ηk
ζ

ζ

mi =

mik .
k

(i→j)∈E

Proof. In the E-step, we would bound the log-likelihood from below. The key point is to
apply the Jensen inequality log

k pk

≥

k qk log pk /qk ,

where

= 1, to the logx
sum-term in the log-likelihood and to apply the inequality − log x ≥ 1 − − log y to the
y
summation term in the denominator of the log-sum-term in the log-likelihood. In particular,
at the t-th iteration the log-sum-term is lower bounded as
log
k

Pr(i, j, k) ≥

qijk log Pr(i, j, k)/qijk
k

29

k qk

with qijk computed as qijk ∝ Prt−1 (i, j, k) where superscript t − 1 means the probability is computed under the values of the parameters in the (t − 1)-th iteration. Then the
denominator term in Pr(i, j, k) would be lower bounded as
− log

i′

− log

γi′ k ai′ ≥ 1 −
γi′ k bi′ ≥ 1 −

i′

i′ γi′ k ai′

ηk

i′ γi′ k bi′

τk

− log ηk
− log τk

with ηk , τk computed as
ηk =
i′

and the summation term

γ t−1 at−1
′
′

i ′ γ i ′ k ci ′

log
i′

ik

j′

γ t−1 bt−1
′
′
jk j

in PPL model is lower bounded as

γ i ′ k ci ′ ≥

t−1 t−1
γik ci

with ζ computed as ζik =

τk =

i

t−1 t−1
i ′ γ i ′ k ci ′

i′

ζi′ k log γi′ k ci′ /ζi′ k

. Due to the limit of space, we omit the details about

deriving the lower bound of the three log-likelihoods.
In the M-step, we will maximize the corresponding lower bound over the corresponding
parameters as follows:
PoL :

PrL :

k

sij
(i,j)∈E

j′

γi′ k ai′

ηk
i′



sij

PPL-D :



qijk log γjk γik ai −

k

(i,j)∈E

γj ′ k bj ′

qijk log γik γjk bj −

sij
(i,j)∈E

−



k



qijk log γik ai γjk bj −

τk





γi′ k ai′
ηk
i′


γi′ k bi′
ζi′ k log γi′ k 
(ζi′ k + α) log ci +
+
τk
i′
i′
i′

By taking the derivatives of the expressions and setting them to zero, we can obtain the
corresponding formulas in the M-step.
30

It can be observed from the EM algorithm that in every iteration (and therefore in the
ﬁnal solutions) for each node i, its productivity ai is proportional to its outdegree and its
popularity bi is proportional to its indegree. This is consistent with our intentions that the
productivity of a node reﬂects how likely it produces a link and the popularity of a node
reﬂects how likely it receives a link.
In addition, it is worth mentioning that in the real implementation, we avoid to explicitly
compute all qijk ’s (whose number is N 2 K, which can be extremely large). Instead, qijk ’s are
computed in an “on-demand” fashion. We can show that the complexity (per iteration) of
our EM algorithms is linear in the number of links in the network. Therefore, the algorithm
is very eﬃcient because in most real applications, networks are sparse and so the number of
links is usually manageable.

4.6

Experiments

In this section, we show experiment results. We evaluate a variety of models (variations
of PPL and existing models) on two tasks: community detection and link prediction. In
addition, we also investigate the issue of degree ﬁtting. We start by describing the data sets
used in the experiments.

4.6.1

Data Sets

In the following experiments, we use a blog network and two paper citation networks.
Political Blog Network
This is a directed network of hyperlinks between a set of weblogs about US politics, recorded
by Adamic and Glance [1]. In this network, there are totally 1,490 nodes and 19,090 links.
Each node is labeled as either conservative or liberal.

31

Paper Citation Networks
We use the Cora paper citation network and the Citeseer paper citation network processed
by Getoor et al.1. There are totally 2,708 nodes and 5,429 links in Cora network, and 3,312
nodes and 4,732 links in Citeseer network. Each paper in Cora network is categorized into
one of 7 classes (e.g., Genetic Algorithms, Neural Networks, etc.), and each paper in Citeseer
network is labeled as one of 6 classes.

4.6.2

Community Detection

In the ﬁrst task, communities are to be detected from the networks. In this task, the real class
labels in the data sets are used as the ground truth to evaluate the communities detected by
diﬀerent models. More speciﬁcally, we use the following evaluation metrics.
Evaluation Metrics for Community Detection
We use three commonly used metrics for evaluating the performance of community detection,
i.e. normalized mutual information (NMI), pairwise F measure (PWF), and modularity
(Modu). We ﬁrst give detailed description about the three metrics.
Normalized mutual information (NMI) is deﬁned as follows: given the true community
structure C = {C1 , · · · , CK }, where Ck denote the set of nodes in the k-th community, and
′
′
the community structure C ′ = {C1 , · · · , CK } obtained from a model, the mutual information

is computed as
p(Ck , Cl′ ) log

MI(C, C ′ ) =
Ck ,C ′
l

p(Ck , Cl′ )
p(Ck )p(Cl′ )

where p(Ck ) denotes the probability that a randomly selected node belongs to Ck , and
p(Ck , Cl′ ) denotes the joint probability that a randomly selected node belongs to Ck and Cl′ .
The normalized mutual information is deﬁned as
NMI(C, C ′ ) =
1

MI(C, C ′ )
max(H(C), H(C ′ ))

http://www.cs.umd.edu/projects/linqs/projects/lbc/
32

where H(C) =

p(Ck ) log
k

1
is the entropy of partition C.
p(Ck )

Pairwise F measure (PWF) is another commonly used measure for evaluating clustering
algorithms. Assume T is the set of node pairs (i, j) where nodes i and j belong to the same
community in the ground truth, and S is the set of node pairs that belong to the same
community in the outcome of a speciﬁc model. Then the pairwise F measure is computed
from pairwise precision and recall as
precision = |S
PWF =

T |/|S| recall = |S

T |/|T |

2 × precision × recall
precision + recall

where | · | indicates the cardinality of a set.
Note that to compute the normalized mutual information and pairwise F measure, the
ground truth must be used. However, in some cases, the ground truth does not necessarily
faithfully reﬂect the link structure. Therefore, we also use another measure called directed
modularity (Modu), which is proposed by Leicht et al. [54] for measuring community partitions in directed networks without using ground truth. The deﬁnition of the directed
modularity is given by
Modu =

1
e

ij

sij −

dout (i)din (j)
e

δ(ci , cj )

where din (i) and dout (i) are the indegree and outdegree of node i in the network, e is the
number of directed links in the network, and ci denotes the community of node i assigned
by a model, and δ(·, ·) is the Kronecker delta function.
For all the three metrics, i.e., NMI, PWF, and Modu, larger values correspond to better
performances.
Performance on Community Detection
The community detection performances for diﬀerent models on the three data sets are given
in Tables 4.2, 4.3, and 4.4. Among the models, PHITS, PoCL, and PrCL are conditional link
33

models. PHITS [17] represents the model described in Equation (5.4); PoCL represents the
Popularity Conditional Link model [106] described in Equation (??); PrCL represents the
Productivity Conditional Link model described in Equation (4.10). SJL represents symmetric link model as described in [79]. PoL, PrL, and PPL-D are the joint link models proposed
in this work.
All the EM algorithms for MLE and MAP are run with 100 iterations, which according to
our observation is more than enough for convergence. In order to alleviate the problem of
local minimum of EM algorithms, for each test we conduct 10 trials with diﬀerent random
initializations, and choose the one giving the largest likelihood. The prior α for the parameter
c in PPL-D is set to 1. Actually, we found the performance not sensitive to α—we tested
diﬀerent values for α ranging from 0.01 to 1, and the results are almost the same.
From the performance results, we can make the following comparisons and observations.
Joint link models vs. conditional link models Joint link models clearly outperform
conditional link models. These can be seen from that our joint link models, i.e., PoL,
PrL, and PPL-D always have the top performances and clearly outperform their conditional
counterparts PoCL and PrCL. Even the symmetric joint link model SJL outperforms its
conditional counterpart PHITS in most of the cases. This result veriﬁes our assumption
that modeling both behavior in receiving links (popularity) and that in producing links
(productivity) is better than modeling just one behavior or none at all.
Non-symmetric vs. symmetric joint link models Comparing the performances of nonsymmetric link models, i.e., PoL, PrL, and PPL-D, with that of traditional SJL model, which
is symmetric, we can see that the non-symmetric models consistently outperform SJL and
the improvement is quite signiﬁcant in many cases. This veriﬁes the beneﬁt of separating
the behavior of nodes in receiving links and that in producing links over simply ignoring the
direction of links.
PPL models without vs. with restrictions Comparing PPL-D, which does not restrict
c other than providing a weak prior, with PoL, PrL and SJL, which enforce c = a, c = b,

34

and c = a = b, respectively, we can see that PPL-D has the best performance. However,
as shown in Section 4.3.3 we can always derived PoL and PrL from PPL-D that give the
identical data likelihood, and so the above result suggests that PPL-D tends to ﬁnd better
solutions for community memberships.
Popularity vs. productivity If we can only choose one feature between popularity and
productivity for community detection in our data sets, it seems that popularity has a small
edge over productivity. This can be observed both in joint link models (i.e., PoL over PrL)
and conditional models (i.e., PoCL over PrCL). Such a result suggests that to determine the
community membership of a node i in these three data sets, those nodes point to i may be
more important than those nodes pointed to by i.
Table 4.2. Community detection performance on the Political Blog data set.
Algo.
PHITS
PoCL
PrCL
SJL
PoL
PrL
PPL-D

NMI
0.3829
0.4905
0.4569
0.4409
0.5156
0.5178
0.5365

PWF
Modu
0.7152 0.4200
0.7947 0.4270
0.7776 0.4243
0.7425 0.4323
0.8072 0.4324
0.8091 0.4324
0.8167 0.4324

Table 4.3. Community detection performance on the Cora data set.
Algo.
PHITS
PoCL
PrCL
SJL
PoL
PrL
PPL-D

4.6.3

NMI
0.0591
0.0797
0.0211
0.0602
0.0886
0.0870
0.0972

PWF
Modu
0.1862 0.3594
0.1982 0.5982
0.1666 0.4959
0.1840 0.6091
0.2014 0.6310
0.1993 0.6307
0.2085 0.6381

Link Prediction

In this task, we study the performance of the joint link models on predicting the links (both
incoming links and outgoing links). Speciﬁcally, for each node in the network we randomly
hide one of its incoming links and one of its outgoing links and ask each model to recover the
35

Table 4.4. Community detection performance on the Citeseer data set.
Algo.
PHITS
PoCL
PrCL
SJL
PoL
PrL
PPL-D

NMI
0.0117
0.0292
0.0131
0.0236
0.0292
0.0263
0.0317

PWF
Modu
0.1788 0.4374
0.1909 0.6214
0.1805 0.5954
0.1896 0.6348
0.1921 0.6648
0.1904 0.6612
0.1948 0.6687

missing links. Such a task has practical values in applications such as friend recommendation
in social networks and citation suggestion in citation networks.
Evaluation Metric for Link Prediction
We measure the performance of link prediction by Recall measure. Two types of recall are
presented, namely outlink recall and inlink recall. The outlink recall measures the ability
of a model to predict nodes pointed to by a given node. The inlink recall measures the
ability of a model to predict the nodes point to a given node. To compute outlink recall for
node i, we ﬁrst compute the outlink probabilities Pr(j← |i→ ) for node i to all other nodes
Pr(i→ , j← )
. The resulting probabilities assign an outlink rank to each
by Pr(j← |i→ ) =
j Pr(i→ , j← )
node j. The outlink recall at rank position K is deﬁned as the fraction of nodes whose top-K
ranked predictions contain the true missing link. Inlink recall is deﬁned similarly based on
Pr(j→ |i← ). In addition, we also report the average of the inlink and outlink recalls.
Performance on Link Prediction
The recalls at top-1 through top-20 on the three data sets are given in Figures 4.1, 4.2, and
4.3. All the results are averaged over 10 trials with diﬀerent randomly selected missing links.
Because we have that PoL, PrL and general PPL model have equal link probabilities, and
because we also found that PPL-D achieves almost the same performance as PoL and PrL,
we will only report one result for these models which are denoted by P-family models. We
also report the results of a naive baseline, the Frequency-based model, where the outgoing
36

0.4

P−family
SJL
Freq

outlink Recall

average Recall

0.4

0.2

0
0

10
Rank

P−family
SJL
Freq

0.3
0.2
0.1
0
0

20

10
Rank

0.3

#of nodes

inlink Recall

0.4

(b) outlink Recall

P−family
SJL
Freq

0.2
0.1
0
0

10
Rank

20

(c) inlink Recall

1000
500
0
0

#of nodes

(a) average Recall

20

500

indegree
200
#of incoming links

400

outdegree

0
0

100
200
#of outgoing links

300

(d) degree distribution

Figure 4.1. (a)∼(c): Recalls on Political Blog data. (d): Degree distribution.
link probabilities are proportional to the indegree of nodes, i.e., Pr(j← |i→ ) ∝ din (j), and
the incoming link probabilities are proportional to the outdegree of nodes, i.e., Pr(j→ |i← ) ∝
dout (j) 20.
As can be seen from the ﬁgures, compared to SJL and the Frequency-based baseline, Pfamily models perform the best in all the cases except the inlink recall for Cora network.
This result illustrates that most of the time, it is beneﬁcial to use productivity and popularity
to model indegree and outdegree distributions separately in a directed network.
However, the inlink recall for Cora network is an abnormal case, where SJL performs the
best, P-family models perform worse, and the Frequency-based model has extremely poor
performance (almost constantly zero). To see why this case is special, we show the degree
distributions of the three networks in the rightmost panels of Figures 4.1, 4.2, and 4.3. All
the degree distributions follow a power-law distribution except the outdegree distribution in

37

0.15

0.4

P−family
SJL
Freq

outlink recall

average Recall

0.2

0.1
0.05
0
0

10
Rank

P−family
SJL
Freq

0.3
0.2
0.1
0
0

20

10
Rank

inlink Recall

0.03
0.02

(b) outlink Recall
#of nodes #of nodes

(a) average Recall
P−family
SJL
Freq

0.01
0
0

10
Rank

20

20

(c) inlink Recall

2000
1000
0
0
1000
500
0
0

indegree
50
100
150
#of incoming links
outdegree
2
4
#of outgoing links

6

(d) degree distribution

Figure 4.2. (a)∼(c): Recalls on Cora data. (d): Degree distribution.
Cora network. The outdegree in Cora follows a rather uniform distribution with outdegree
no lager than 5. (We suspect such a distribution is due to the small scale of the Cora data
which leads to many references, and therefore outlinks, to be outside the data set.) Because
of such a uniform distribution, the outdegrees of nodes are not informative, which explains
the extremely poor performance of the Frequency-based model. The P-family models treat
indegree and outdegree equally importantly and therefore also suﬀer from the uninformative
outdegree distribution. SJL, in comparison, ignores the link direction and as a result makes
the more informative indegree distribution dominate the uninformative outdegree distribution and therefore suﬀers the least. This special case actually reveals some trade-oﬀs made
by diﬀerent models.

38

0.4

P−family
SJL
Freq

0.3

outlink Recall

average Recall

0.4

0.2
0.1
0
0

10
Rank

P−family
SJL
Freq

0.3
0.2
0.1
0
0

20

10
Rank

0.15

#of nodes

inlinke Recall

0.2

(b) outlink Recall

P−family
SJL
Freq

0.1
0.05
0
0

10
Rank

20

(c) inlink Recall

2000

#of nodes

(a) average Recall

20

2000
1000
0
0

indegree

1000
0
0

50
#of incoming links

100

outdegree
10
20
#of outgoing links

30

(d) degree distribution

Figure 4.3. (a)∼(c): Recalls on Citeseer data. (d): Degree distribution.

39

4.6.4

Degree Fitting

Finally, we verify the degree ﬁtting properties of PPL models. Figure 4.4(a) shows the
scatter plots for the indegree and outdegree ﬁtting of PPL models on the Political Blog data
set. Note that PoL, PrL and PPL-D again give almost the same result is this experiment
and so we refer to them together as PPL. Each point in the plot represents a node, where
its position on the horizontal axis is determined by its actual degree (indegree or outdegree)
and its position on the vertical axis is determined by the degree predicted by the model.
Therefore, a point fell on the diagonal line (the red lines in the plots) indicates a perfect
degree match. As can be seen from the ﬁgure, all the points fall on the red line, which
indicates that PPL captures the degree distributions for each node exactly. In comparison,
as shown in Figure 4.4(b), SJL has very poor performance in terms of degree ﬁtting. Similar
results are obtained for the paper citation data sets, where in Figures 4.5(a) and 4.5(b) we
show some of the results. These empirical studies clearly validate the degree ﬁtting property

model indeg.

400
200
0
0

100
200
300
actual indegree

400

model outdeg.

model outdeg. model indeg.

of the PPL models that we previously stated in Section 4.3.3.

400
200
0
0

100
200
actual outdegree

300

(a) degree ﬁtting by PPL

400
200
0
0

100
200
300
actual indegree

400

100
200
actual outdegree

300

400
200
0
0

(b) degree ﬁtting by SJL

Figure 4.4. Degree ﬁtting on the Political Blog data

40

model indeg.

30

model outdeg.

100

model outdeg.

model indeg.

200

200

0
0

50

100
150
actual indegree

40
20
0
0

10
20
actual outdegree

(a) degree ﬁtting by PPL

200
100
0
0

50

100
150
actual indegree

200

100
50
0
0

10
20
actual outdegree

30

(b) degree ﬁtting by SJL

Figure 4.5. Indegree ﬁtting on Cora and outdegree ﬁtting on Citeseer

4.7

Conclusions

Stochastic block model is a promising probabilistic model for community detection. In this
paper, we present a new stochastic block model, PPL, for community detection in directed
networks. On one hand, our model is complete, in that it captures the roles of each node
both as a link producer and as a link receiver whereas a consistent community membership
serves both the roles; on the other hand, our model is uniﬁed, in that it oﬀers a uniﬁed
framework to connect and to understand existing models. We believe such a complete and
uniﬁed model provides a solid foundation for further studies in stochastic block models for
community detection.

41

CHAPTER 5
Community Detection for Networked
Data: A Discriminative Approach for
Combining Link and Content
5.1

Introduction

In addition to the link information, nodes in networks are often described by certain attributes, which we refer to as contents of the nodes. For example, when it comes to the web
pages, online blogs, or scientiﬁc papers, the contents are usually represented by histograms of
keywords; in the network of co-authorship, the contents of nodes can be the demographic or
aﬃliation information of researchers. Many existing studies on community detection focus on
either link analysis or content analysis. However, neither information alone is satisfactory in
determining accurately the community memberships: the link information is usually sparse
and noisy and often results in a poor partition of networks; the irrelevant content attributes
could signiﬁcantly mislead the process of community detection. It is therefore important to
combine the link analysis and content analysis for community detection in networks.
In this chapter, we propose a discriminative model of combining link and content analysis for community detection that explicitly addresses the above shortcomings of existing
approaches. Our main contributions are summarized as follows.
• We propose a conditional model for link analysis. In contrast to generative models, our
approach does not attempt to generate the links; instead, the conditional probability
for the destination of a given link is to be captured. To achieve this, in our model we
introduce a hidden variable to capture the popularity of a node in terms of how likely

42

the node is cited by other nodes.
• To alleviate the impact of irrelevant content attributes, we adopt a discriminative
approach to make use of the node contents. We refer to this part as discriminative
content model. As a consequence, the attributes are automatically weighed by their
discriminative power in terms of telling apart salient communities.
• We combine the above two models into a uniﬁed framework and propose a novel twostage optimization algorithm for the maximum likelihood inference. In addition, we
show how the proposed link model and content model can be used to extend existing
complementary approaches. Additional algorithms are presented to solve the extended
models.
To the best of our knowledge, the model proposed in this chapter is the ﬁrst that combines
conditional link models and discriminative content models for community detection. We conduct extensive experimental studies by using several benchmark data sets. The experimental
results show signiﬁcant improvement over the state-of-the-art approaches. Additional experiments are conducted to further verify the eﬀectiveness of each of our link model and content
model, respectively.
The rest of the chapter is organized as follows. In Section 5.2 we present and analyze
the conditional link model. In Section 5.3, we extend the link model to include the content
information. Also in Section 5.3, we describe the two-stage optimization algorithm. In
Section 5.4, we show extensions by combining our link model and content model with other
existing content and link models. In Section 5.5, we show extensive experimental results on
benchmark data sets. Finally, we give conclusions in Section 5.6.

5.2

Conditional Link Model

In this section, we ﬁrst present the proposed link model and followed by a maximum likelihood estimation method used to estimate the unknown parameters of the proposed model.
43

In Section 5.3, we incorporate the content information into the proposed link model by a
discriminative model.

5.2.1

Popularity Conditional Link Model (PCL)

Before going to the mathematical model, we ﬁrst establish the assumptions and notations
that are used in our model. All nodes in the network form a node space V = {1, · · · , n},
where the nodes could represent web pages, online blogs, etc. For each pair of ordered nodes
(i, j), let sij record the information of the link from node i to node j. sij could either be
{0, 1}, N + , or any nonnegative values dependent on the type of the link. If sij = 0, we say
there is a directional link from node i to node j, or node i cites j (equivalently, node j is cited
by node i). Let E = {(i → j)|sij = 0} denote all the directional links in the network. Each
node i has an associated “link-in” space denoted by LI(i) ∈ V, which is the set of nodes
that could possibly cite node i. Similarly, each node i is associated with a “link-out” space
denoted by LO(i) ∈ V, which is the set of nodes that could possibly been cited by node i.
Although in most cases we have LI(i) = LO(i) = V, in some scenarios such as citation of
publications, the link-out space of a paper is the set of all papers that are older than the
paper itself, and the link-in space is the set of all papers that are newer than the paper itself.
Let I(i) = {j|sji = 0} be the set of nodes that actually cite node i, O(i) = {j|sij = 0}
be the set of nodes that are actually cited by node i, and din (i) = |I(i)| be the indegree of
node i, dout (i) = |O(i)| be the outdegree of node i. Finally, we denote by K the number of
communities we aim to ﬁnd.
In our link model, we focus on modeling Pr(j|i), i.e., the probability of linking node i
to node j among all the other candidates in LO(i). In other words, we model which node
j among LO(i) is more likely to be cited by node i. This is in contrast to many existing
approaches that explicitly model the presence or absence of link i → j, i.e., Pr(i → j).
This modeling choice allows us to avoid modeling the absence of links, which was observed
in [2, 63] as a major problem for link analysis. We introduce a set of hidden variables

44

zi ∈ {1, · · · , K} for each node i ∈ {1, · · · , n} to denote the community of node i. On the
other hand, to model how likely a node will receive a citation in general, in our model for
Pr(j|i), we introduce a popularity variable bi ≥ 0 for each node i: the higher popularity of
one node, the higher chance the node will be cited by other nodes. Given the popularity
and community memberships of all nodes, the link probability Pr(j|i) conditioned on the
community variable zi of node i associated with this link is given as follows
γjzi bj

Pr(j|i; zi , b) =

(5.1)

j ′ ∈LO(i) γj ′ zi bj ′

where γik gives the community membership of node i in community k. As indicated by the
above expression, the conditional link probability Pr(j|i) is proportional to bj , the popularity
of the ending node of the link. By assuming a multinomial distribution for zi , i.e., zi ∼
Mult(γi1 , · · · , γiK ), we have Pr(j|i) written as
Pr(j|i; γ, b) =

γik
k

γjk bj
j ′ ∈LO(i) γj ′ k bj ′

(5.2)

where γik = Pr(zi = k).
In Eq. (5.2), we assume that bi is independently from the community variable. As a result,
each node will only have one copy of the popularity. An alternative approach is to have the
popularity variable bi conditioned on the community variable. In other words, we have a
diﬀerent popularity variable bik for each node i when it is in a diﬀerent community zi = k.
Using the community dependent popularity bik , Pr(j|i) is computed as
γjzi bjzi
Pr(j|i; zi , b) =
j ′ ∈LO(i) γj ′ z bj ′ z
i

i

or by integrating out zi
Pr(j|i; b) =

γik
k

γjk bjk
j ′ ∈LO(i) γj ′ k bj ′ k

(5.3)

Comparing Eq. (5.3) to (5.2), we see that Eq. (5.3) introduces the freedom of modeling the
community dependent popularity at the price of increasing number of variables. As will be
shown in our empirical study, Eq. (5.2) achieves better performance because of the reduced
number of variables.
45

5.2.2

Analysis of the PCL Model

In this section, we analyze our link model by establishing the relation and comparing to
PHITS model [16]. For the purpose of consistency, we assume LO(i) = V for all i.
In PHITS, each community is assumed to have a multinomial distribution that speciﬁes
the probability for each node to be cited by the other nodes in the same community. We
denote by βjk the probability for node j to be cited by any nodes in the k th community.
Pr(j|i) conditioned on community variable zi of node i for this link, and β is then expressed
as
Pr(j|i; zi , β) = βjzi
Note that unlike our model in Eq. (7.6), the conditional link probability in PHITS model
has nothing to do with the community membership of node j. This leads to the problem of
undetermined community membership for nodes that do not cite any other nodes for PHITS,
as discussed in the next section. By integrating out zi , we have Pr(j|i) written as
Pr(j|i; γ, β) =

γik βjk

(5.4)

k

where γik is the probability that node i is in the kth community.
The following proposition allows us to establish the relationship between the PHITS model
and the popularity-based conditional link model.
Proposition 1. The PHITS model speciﬁed in Eq. (5.4) is equivalent to the link model with
Pr(j|i) speciﬁed in Eq. (5.3).
The above proposition is proved by observing the link between βjk and the quantity
γjk bjk /

j ′ γj ′ k bj ′ k

. As revealed by the above proposition, PHITS is in fact a relaxed

version of the proposed PCL model by assuming that the popularity of each node depends
on the community of the node.
We can also derive the proposed model in Eq. (5.2) from the PHITS model in Eq. (5.4) by
considering the relationship between γjk and βjk , as revealed by the following proposition.
46

Proposition 2. The popularity-based conditional link model speciﬁed in Eq. (5.2) is equivalent to the PHITS model speciﬁed in Eq. (5.4) if βjk is interpreted as Pr(j|Ck ), i.e., the
probability of selecting node j from the k th community.
The above proposition follows the Bayes’s rule, i.e.,
Pr(Ck |j) Pr(j)
′
′ =
j ′ Pr(Ck |j ) Pr(j )

Pr(j|Ck ) =

γjk bj
j ′ γj ′ k bj ′

The above proposition once again reveals that the proposed conditional link model is a
restricted version of the PHITS model. We believe that it is the constraints introduced in
the proposed conditional link model that lead to more reliable performance.

5.2.3

Maximum Likelihood Estimation

In this section, we present the method of maximum likelihood for the PCL model speciﬁed in
Eq. (5.2). Observing the directional links E = {(i → j)|sij = 0}, we write the log-likelihood
as
log L =

sij log
ˆ

γik
k

(i→j)∈E

where sij is normalized sij such that
ˆ

ˆ
j∈LO(i) sij

γjk bj
j ′ ∈LO(i) γj ′ k bj ′

(5.5)

= 1. We ﬁnd optimal γ and b by

maximizing the log-likelihood
max
γ,b

sij log
ˆ
(i→j)∈E

s.t.
k

γik
k

γjk bj
j ′ ∈LO(i) γj ′ k bj ′

γik = 1, γik ≥ 0, bi ≥ 0

To derive the EM algorithm, we ﬁrst have the following lemma for a low bound for log L.
Lemma 5. The log-likelihood log L in Eq. (5.5) at the tth iteration is lower bounded as
follows
log L ≥
−

sij
ˆ
k

(i→j)∈E

sij
ˆ
(i→j)∈E



qijk log γik + log γjk + log
qijk log qijk

k

47

bj
+1−
τik
j ′ ∈LO(i)

γj ′ k bj ′
τik




where the parameters τik and qijk are computed as
τik =
j ′ ∈LO(i)

qijk ∝

t−1
γik

γ t−1 bt−1
′
′

(5.6)

jk j

t−1
γjk bt−1
j

s.t.

t−1 t−1
j ′ ∈LO(i) γj ′ k bj ′

qijk = 1

(5.7)

k

and bt−1 , γ t−1 are the corresponding solutions in the t − 1th iteration.
The above lemma follows from the Jensen’s inequality and the inequality of − log x ≥ 1−x.
Using the result in the above lemma, we search for b and γ at the tth iteration that maximize
the lower bound of log L, i.e.,
sij
ˆ

max
γ,b

(i→j)∈E

s.t.
k

k



qijk log γik γjk bj −

γj ′ k bj ′
τik

j ′ ∈LO(i)

γik = 1, γik ≥ 0, bi ≥ 0




(5.8)

For this maximization problem, we have the following theorem. Before stating the theorem,
we ﬁrst establish the notations for the purpose of representation:
nin (i, k) =

sji qjik
ˆ

nout (i, k) =

j∈I(i)

nin (i) =

sij qijk
ˆ
j∈O(i)

nin (i, k)

nout (i) =

k

nout (i, k)
k

n(i, k) = nin (i, k) + nout (i, k)

m(i, k) =
j∈LI(i)

nout (j, k)
τjk

Theorem 6. The optimal solution to Eq. (5.8) satisﬁes the following conditions
∀i, dout (i) = 0, din (i) = 0,
γik =

n(i, k)
,
m(i, k)bi + nout (i)

bi =

nin (i)
k m(i, k)γik

∀i, dout (i) = 0, din (i) = 0,
n (i, k)
,
γik ∝ in
m(i, k)

bi =

48

nin (i)
k m(i, k)γik

(5.9)

∀i, dout (i) = 0, din (i) = 0,
γik =

nout (i, k)
,
k nout (i, k)

bi = 0

∀i, dout (i) = 0, din (i) = 0,
γik is any non-negative value such that

γik = 1,

bi = 0

k

Remark: As revealed in Eq. (5.9), bi is proportional to the number of nodes that cites
node i, i.e., nin (i), which is consistent with interpreting bi as “popularity” or “authoritative”
for node i. Advantage of PCL over PHITS can also be seen in the solution of γik . It can be
shown that the membership of node i in PHITS model only depends on the membership of
the nodes that are cited by node i, i.e., γik ∝ nout (i, k), and not aﬀected by the nodes that
cite node i. When nout (i) = 0, i.e., node i has no outgoing links, the membership γik is not
determined. In contrast, in PCL model, community membership of node i depends on the
membership of all the nodes connected to node i.

5.3

Discriminative Content Model

In this section, we extend our link model to incorporate the content information of nodes.
As we discussed in Sections 5.1, most existing approaches combine link and content by a
generative model that generates both links and content attributes via a shared set of hidden
variables related to community memberships. In this work, we propose a discriminative
model, referred to as Discriminative Content(DC) model, to incorporate the content into
the proposed link model. Let xi ∈ Rd denote the content vector of node i. The content
information is used to model the memberships of nodes by a discriminative model, given by
Pr(zi = k) = yik =

T
exp(wk xi )
T
l exp(wl xi )

(5.10)

where wk ∈ Rd is a d-dimensional weight vector for community k with each element corresponding to each attribute. We can see that by incorporating the content model, the

49

community membership is no longer speciﬁed by parameters γik , but rather conditioned on
the content through yik by a softmax transformation. Then, the conditional link probability
Pr(j|i) expressed in Eq. (5.2) is modiﬁed as follows
Pr(j|i; b, w) =

yik
k

yjk bj
j ′ ∈LO(i) yj ′ k bj ′

where yik depends on w as given in Eq. (5.10). As revealed in the above expression, we do not
generate the content attributes as most topic models do. Instead, by using the discriminative
model, with an appropriately chosen weight vector wk that assign large weights to important
attributes and small weights or zero weights to irrelevant attributes, we avoid the shortcoming
of the generative models, i.e., being misled by irrelevant attributes. Another beneﬁt from
the discriminative model is that we can use a non-linear transformation φ(x) : Rd → Rm on
the content vector as the new attribute to obtain a non-linear model. In the sequel, we use
φ(x) rather than x for presentation.
The log-likelihood of the combined model is written as
log L =

sij log
ˆ
(i→j)∈E

yik
k

yjk bj
j ′ ∈LO(i) yj ′ k bj ′

(5.11)

We maximize the log-likelihood over the free parameters w and b. Although we can use
any gradient-based algorithm to optimize with wk and bi , we propose an eﬃcient two-stage
method as discussed in the next section, which helps us better understand the relation of
link model and content model.

A Two-Stage Method for Optimization
In this section, we describe the method to maximize the log-likelihood in Eq. (5.11). We
still use the EM algorithm to maximize the log-likelihood. In the E-step, we compute τik
and qijk from y and b. In the M-step, we maximize the following problem:


yj ′ k bj ′

sij
ˆ
qijk log yik yjk bj −
max
τik
w,b
′
(i→j)∈E

k

j ∈LO(i)

50

(5.12)

subject to non-negative constraints on b.
Instead of maximizing over w, we convert Eq. (7.1) into a constraint optimization problem
over y and b by
max

y∈∆,b

sij
ˆ
k

(i→j)∈E

where the domain ∆ is deﬁned as
∆=



qijk log yik yjk bj −

y|∃w, yik =

yj ′ k bj ′
j ′ ∈LO(i)

τik




T
exp(wk φ(xi ))

(5.13)

(5.14)

T
l exp(wl φ(xi ))

By having the domain of y given in Eq. (5.14) as a convex set, we can take a projection
method to maximize the problem of Eq. (5.13), which leads to the two-stage method. In
the ﬁrst stage, we simply ignore the complex constraint for yik imposed by the domain ∆
and solve the optimization problem in Eq. (5.13) with only sum-to-one constraint on yik and
non-negative constraints on b using the result in Theorem 6. In the second stage, we project
the yik into the domain ∆. Let yik denote the optimal solution obtained from the ﬁrst stage.
˜
The projection of yik , denoted by yik , is obtained by minimizing the KL divergence between
˜
yik and yik ∈ ∆, which is equal to the following optimization problem
˜
max
w

yik log yik =
˜
i

yik log
˜
i

k

T
exp(wk φ(xi ))
T
l exp(wl φ(xi ))

k

This problem is similar to the log-likelihood in multi-class logistic regression problem except
that the class membership yik is not just binary but between 0 and 1 . As in logistic
˜
regression, we can add regularization term on wk to make the solution more robust, which
leads to the following optimization problem
max
w

yik log
˜
i

k

T
exp(wk φ(xi ))
T
l exp(wl φ(xi ))

−

λ
2

T
wk wk

(5.15)

k

where λ is the regularization coeﬃcient. This problem is a convex problem [9] and has a
unique optimal solution, and can be maximized eﬃciently by the Newton-Raphson method.
By converting the optimization problem over w into the problem over y and taking the
two-stage method, we are able to have a better understanding of our combined model—the
51

Algorithm 1 Algorithm for maximizing the log-likelihood
1. Input the number of iterations or convergence rate
2. Initialize wk to zeros, bi randomly, λ to a ﬁxed value
3. in the E-step, compute τik and qijk as in Eq. (5.6) and (5.7) using yik rather than γik
4. in the M-step,
• compute γik , and bi as in Theorem 6
• update wk by maximizing the objective in Eq. (5.15) with γik in place of yik , and
˜
then compute yik
5. repeat Step 3 and 4 until the input number of iterations is exceeded or convergence
rate is satisﬁed.
6. Output γik or yik as the ﬁnal membership
link structure will ﬁrst give us a noisy estimation of community memberships y , and the
˜
noisy memberships are then used as supervised information for our discriminative content
model to derive high-quality memberships y. These estimated memberships are further
used in our EM iterations. Algorithm 1 summarizes the overall algorithms for combined
link and content analysis for community detection. The algorithm has a time complexity
of O(M(eKC1 + nKC2 + T3 )), where M is the number of iterations, e is the number of
links in the network, n is the number of nodes in the network, C1 is a constant factor in
computing qijk and τik , C2 is a constant factor in computing γik and bi , and T3 is the time
for maximizing problem in Eq. (5.15) by the Newton-Raphson method.

5.4

Extensions

In this section, we discuss two variants of the proposed framework for combining link information with content information. In the ﬁrst variant, referred to as PCL+PLSA, we
present an approach that combines the proposed conditional link model with the PLSA
model for content analysis. In the second variant, referred to as PHITS+DC, we present
an approach that combines the PHITS model for link analysis with the proposed discrimi-

52

native approach for content analysis. These two combined models will serve as baselines in
our experimental study.

5.4.1

PCL + PLSA

Similar to [17] where the PHITS link model is combined with PLSA content model, we
combine our PCL link model with PLSA. The combined log-likelihood is given by
log L = α

sw log
ˆij
k

i,j∈W(i)

w
βjk γik + (1 − α)

sl log
ˆij

γik
k

i,j∈O(i)

γjk bj
j ′ ∈LO(i) γj ′ k bj ′

where α is combination coeﬃcient, sw is the normalized number of times that word j occurs
ˆij
in the content of node i, W(i) denotes the set of unique words that occur in the content
w
of node i, and βjk = Pr(word j|Ck ). To maximize the log-likelihood, we derive the EMw
l
algorithm as follows. In the E-step, we compute qijk , qijk and τik as
w
w
qijk ∝ γik βjk ,

τik =
j ′ ∈LO(i)
l
qijk ∝ γik

w
qijk = 1

s.t.
k

γj ′ k bj ′

γjk bj
,
′ ∈LO(i) γj ′ k bj ′
j

l
qijk = 1

s.t.
k

w
In the M-step, we compute βjk , γik and bi as
w
βjk

=

γik =

bi =

ˆw w
nw (j, k)
i∈N (j) sij qijk
in
=
w
sw qijk
ˆjk w
j
i∈N (j)
j nin (j, k)
αnw (i, k) + (1 − α)nl (i, k)
out
αnw (i) + (1 − α) nl (i) + bi ml (i, k)
out
out
l (i)
nin
l
k m (i, k)γik

where N (j) denotes the set of nodes whose content have the word j, and nw , nw , nl ,
out
in
in
nl , nl , and ml are deﬁned similar as before.
out

53

5.4.2

PHITS + DC

In this variant, we combine the PHITS link model with our DC content model. The loglikelihood is given by
log L =
T
where yik = exp(wk φ(xi ))/

sij log
ˆ

yik βjk
k

(i→j)∈E

T
l exp(wl φ(xi )).

In the E-step, we compute qijk as

qijk ∝ yik βjk ,

qijk = 1
k

In the M-step, we ﬁrst compute βjk and the free form membership γik by
ˆ
i∈I(i) sij qijk

βjk =

ˆ
i∈I(i) sij qijk

j

ˆ
j∈O(j) sij qijk

γik =

ˆ
j∈O(j) sij qijk

k

nin (j, k)
j nin (j, k)

=
=

nout (i, k)
nout (i)

Then we maximize the following objective to get wk and yik ,
max
k

5.5

i

γik log yik −

λ
2

T
wk wk
k

Experiments

In this section, we conduct several experimental studies. We ﬁrst compare the PCL model
with the PHITS model for the task of link prediction. Then we compare the performance
of the PCL model with that of several state-of-the-art methods on the task of community
detection by using two citation data sets. Before going into the details, we ﬁrst describe the
data sets and the metrics used in the experiment and evaluation.

5.5.1

Data Sets

We used four data sets in our experiments. They are described in the following:
Political Blog Data Set is a social blog network, which is a directed network of hyperlinks between webblogs about the US political issues, recorded in 2005 by Adamic and
54

Glance [1]. There are totally 1490 blogs, and each blog is labeled as either conservative or
liberal. In the data set, we only have the link information and have no content information.
So this data set is only used in the link prediction task to compare the PCL model with the
PHITS model. The number of communities for this data set is set to K = 2.
Wikipedia Data Set is a web page network which was crawled from Wikipedia web site
by Gruber et al. [35]. This data set has 105 nodes and 799 links. This data set contains no
explicit community label for each page. So we only use this data set in the link prediction
task, with K set to 20 as suggested in [35].
Cora Data Set is a subset of the larger Cora citation data set [64]. This data set includes
publications from the machine learning area, each of which is classiﬁed into 7 sub-categories
as: Case-based reasoning, Genetic Algorithms, Neural Networks, Probabilistic Methods,
Reinforcement Learning, Rule Learning and Theory. There are totally 2708 nodes, and 5429
links. Each node corresponds to one paper and is described by a 0/1-valued word vector
indicating the absence/presence of the corresponding word from the dictionary of 1433 unique
words. We use this data set in both the link prediction task and the community detection
task. The number of communities is set to be K = 7.
Citeseer Data Set is a subset of the larger Citeseer data set (http://citeseer.ist.
psu.edu/). The Citeseer data set consists of 3312 scientiﬁc publications labeled as one of
6 classes and 4732 links. Each publication is described by a 0/1 valued word vector. The
dictionary of word consists of 3703 unique words. This data set is used in both link prediction
and community detection tasks. The number of communities is set to be K = 6.

5.5.2

Evaluation Metrics

In the comparison of the PCL model and the PHITS model on the task of link prediction,
we hide some links from the network, and run the two models on the remaining links. The
performance is measured by the metric of Recall.
Recall is an Information Retrieval measure. For each node, we compute the probabilities

55

for the node to generate links to the other nodes and then sort these probabilities in the
decreasing order. The recall is computed at each position in the rank and deﬁned as the
fraction of target nodes that correspond to the hidden links. The recall is reported from
positions 1 to 20 in the rank.
To measure the performance of community detection, we used four metrics among which
two are supervised and the other two are unsupervised. The two supervised metrics are
normalized mutual information (NMI), and pairwise F-measure (PWF). These two metrics
use the supervised label information. The other two unsupervised metrics are modularity
(Modu) and normalized cut (NCut). These two metrics measure the partition performance
in terms of the link structure.
With the supervised label information, we can form the true community structure C =
{C1 , . . . , CK }, where Ck contains the set of nodes that are in the kth community. The
′
′
community structure given by the algorithms is represented by C ′ = {C1 , . . . , CK }. Then

the mutual information between the two is deﬁned as
MI(C, C ′ )

′
p(Ci , Cj ) log

=
′
Ci ,Cj

′
p(Ci , Cj )
′
p(Ci )p(Cj )

and the normalized mutual information is deﬁned by
NMI(C, C ′ ) =

MI(C, C ′ )
max(H(C), H(C ′ ))

where H(C) and H(C ′ ) are the entropies of the partitions C and C ′ . The higher the normalized
mutual information, the closer the partition is to the ground truth.
Let T denote the set of node pairs that have the same label, S denote the set of node pairs
that are assigned to the same community, |T | denote the cardinality of set T . The pairwise
F-measure is computed from the pairwise precision and recall, as the following
precision = |S
PWF =

T |/|S| recall = |S
2 × precision × recall
precision + recall
56

T |/|T |

The higher the PWF, the better is the partition.
Modularity is proposed by Newman et al. [72] for measuring community partitions. For a
given community partition C = {C1 , . . . , CK }, the modularity is deﬁned as
Modu(C) =
k

where Cut(Ci , Cj ) =

p∈Ci ,q∈Cj

Cut(Ck , Ck )
−
Cut(C, C)

Cut(Ck , C) 2
Cut(C, C)

wpq . As stated in [72], modularity measures how likely a

network is generated due to the proposed community structure versus generated by a random
process. Therefore, a higher modularity value indicates a community structure that better
explains the observed network.
Normalized cut is the objective of the normalized cut algorithm ([83], which we refer to as
NCUT). Given a community partition C = {C1 , . . . , CK }, the normalized cut is deﬁned as
K

NCut(C1 , · · · , Ck ) =

i=1

¯
Cut(Ci , Ci )
vol(Ci )

¯
where Ci denotes the set of nodes that are not in Ci and vol(Ci ) =

5.5.3

p∈Ci

q wpq .

Link Prediction

To validate the advantage of the PCL link model over the PHITS link model, we experiment
them on the four data sets described in Section 5.5.1. The performance is reported in
Figure 5.1 in terms of recall at positions 1 to 20. Each number in the ﬁgure is averaged over
5 runs. The PCL outperforms the PHITS in all the cases. To investigate the eﬀects of the
popularity parameter, b, we also perform the same experiments on PCL by setting bi = 1 for
all i. The results are labeled as “PCL-b=1” in the ﬁgure. The performance given bi = 1 is
worse than PCL and PHITS. It further conﬁrms the importance of the popularity parameter.
Overall, this result validates our conjecture that the conditional link model outperforms the
generative link model, at least for the task of link predication.

57

Table 5.1. Community detection performance on Cora and Citeseer dataset
Cora
PWF
Modu

Algorithm

NMI

L

PHITS
LDA-Link
PCL
NCUT

0.0570
0.0762
0.0884
0.1715

0.1894
0.2278
0.2055
0.2864

C

PLSA
LDA-Word
NCUT(RBF kernel)
NCUT(pp kernel)

0.2107
0.2310
0.1317
0.1804

PHITS-PLSA
LDA-Link-Word
LCF
NCUT(RBF kernel)
NCUT(pp kernel)
PCL-PLSA
PHITS-DC
PCL-DC

0.3140
0.3587
0.1227
0.2444
0.3866
0.3900
0.4359
0.5123

LC

5.5.4

Citeseer
PWF
Modu

NCut

NMI

NCut

0.3929
0.2189
0.5903
0.2701

3.2466
4.5687
1.9391
0.2732

0.0101
0.0356
0.0315
0.1833

0.1773
0.2363
0.1927
0.3252

0.4588
0.2211
0.6436
0.6577

2.2370
3.7457
1.1181
0.1490

0.2864
0.2774
0.2457
0.2912

0.2682
0.2970
0.1839
0.2487

4.2686
3.7820
4.7775
4.6612

0.0965
0.1342
0.0976
0.1986

0.2298
0.2880
0.2386
0.3282

0.2885
0.3022
0.2133
0.4802

3.2294
3.0165
3.7078
1.8118

0.3526
0.3969
0.2456
0.3062
0.4214
0.4233
0.4526
0.5450

0.3956
0.4576
0.1664
0.3703
0.5158
0.5503
0.6384
0.6976

3.2880
2.8906
4.8101
1.6585
0.7903
2.1575
1.5165
1.0093

0.1188
0.1920
0.0934
0.1592
0.1986
0.2207
0.2062
0.2921

0.2596
0.3045
0.2361
0.2957
0.3282
0.3334
0.3295
0.3876

0.3863
0.5058
0.2011
0.4280
0.4802
0.5505
0.6117
0.6857

2.7397
2.0369
3.6721
1.7592
1.8118
1.6786
1.2074
0.7505

Community Detection

In this section, we investigate the performance of our model on the task of community
detection. We perform experiments on the two scientiﬁc publication date sets, which have
both link and content information. To validate the advantage of our proposed model, we
compare it with several baselines. Based on what information is used, the algorithms are
categorized into 3 classes:
Based on Link, we compare the following models: PHITS, PCL, LDA-Link, and Spectral
Clustering (NCUT).
Based on Content, we compare the following: PLSA, LDA-Word, and Spectral Clustering.
In spectral clustering, the similarity matrix is the kernel matrix computed from the content
of each publication. Here we report two kernels, one is the RBF kernel, and the other is the
probabilistic product kernel proposed in [47].
Based on Link and Content, we compare the following: PHITS-PLSA, LDA-Link-Word,
Link-Content-Factorization (LCF), Spectral Clustering, PCL-PLSA, PHITS-DC, and PCLDC. Notice that PHITS-PLSA refers to the combination of PHITS and PLSA proposed
in [17], LDA-Link-Word refers to the mixed membership model proposed in [28], LCF refers

58

Recall on Political Blog

Recall

0.3

Recall on Wikipedia
0.8

PCL
PCL−b=1
PHITS

0.6
Recall

0.4

0.2
0.1
0
0

0.4
0.2

5

10
Rank

15

0
0

20

(a) Recall on Political Blog

PCL
PCL−b=1
PHITS

0.3

0.2
0.1
0
0

10
Rank

15

20

Recall on Citeseer
0.4

Recall

Recall

0.3

5

(b) Recall on Wikipedia

Recall on Cora
0.4

PCL
PCL−b=1
PHITS

PCL
PCL−b=1
PHITS

0.2
0.1

5

10
Rank

15

0
0

20

(c) Recall on Cora

5

10
Rank

15

20

(d) Recall on Citeseer

Figure 5.1. Recall on the four data sets
to the model proposed in [113], Spectral Clustering is applied to linear combined kernel from
the link matrix and content kernel, PCL-PLSA refers to the combination of the PCL and the
PLSA model as described in Section 5.4, PHITS-DC refers to the PHITS model combined
with the Discriminative Content model, and PCL-DC refers to the PCL model combined
with the Discriminative Content model.
In the implementation, the feature vector used in our model is the original word indicator
vector without any transformation; the spectral clustering we used is the normalized cut
algorithm [83] (NCUT). For the algorithms that are dependent on some parameters such
as the σ parameter in RBF kernel, the combination coeﬃcient in PHITS-PLSA, the combi-

59

nation coeﬃcient of link matrix and content kernel for spectral clustering, the combination
coeﬃcient in PCL-PLSA, the regularization coeﬃcient in PHITS-DC, we experiment on a
wide range of values and choose the best one in terms of normalized mutual information and
pairwise F-measure. For example, the combination coeﬃcients in PHITS-PLSA, PCL-PLSA,
and combined link matrix and content kernel are tuned from 0.1 to 0.9 with 0.1 as the step
size. The regularization coeﬃcient for PHITS-DC is tuned from 0 to 50 with 5 as the step
size. The regularization coeﬃcient for PCL-DC is set to a ﬁxed value of 10. All the iterative
algorithms are run until the relative diﬀerence of the objective is within 10−8 .
Tables 5.1 show the results on the Cora data set and the Citeseer data set. For both
data sets, PCL outperforms PHITS in all the cases, either using link only (PCL outperforms
PHITS), or combining link and content (PCL-PLSA outperforms PHITS-PLSA and PCL-DC
outperforms PHITS-DC). When considering content, the approaches that discriminatively
combine content (DC) outperform the approaches that combine content using PLSA. That is,
PHITS-DC outperforms PHITS-PLSA, and PCL-DC outperforms PCL-PLSA. These results
further conﬁrm that the discriminative models (either the link model, or the content model,
or the combination of the two) achieve better performance than the generative ones.
We also compared PCL and PCL-DC with the following algorithms. In the link-only case,
the spectral clustering (NCUT) outperforms PCL. LDA-Link outperforms PCL in some
metrics. When combining link and content, PCL-DC outperforms all algorithms except
for the spectral clustering (NCUT) algorithm in the normalized cut (NCut) metric. The
main reason for the spectral clustering (NCUT) to have the best performance in terms
of normalized cut is that it directly minimizes this metric. However, we argue that people
would consider the NMI and PWF metrics as equally important, because the NMI and PWF
metrics measure how good the partition derived by the algorithms matches the ground truth.
Finally, to reveal the performance of our model under diﬀerent parameters, we show the
performance of the PCL-DC model under diﬀerent regularization coeﬃcient λ on the two
data sets in Figure 5.2. In both data sets, the performance achieves the highest level when

60

λ = 5. After that, the PCL-DC algorithm is not very sensitive to λ.
Measure Vs. λ

Measure Vs. λ

0.7

0.5
0.4
0.3
0.2
0

PWF
NMI

0.4
Measure

0.6
Measure

0.5

PWF
NMI

0.3
0.2
0.1

10

20

λ

30

40

0
0

50

(a) Cora

10

20

λ

30

40

50

(b) Citeseer

Figure 5.2. Partition Measure of PCL-DC vs. λ

5.6

Conclusions

In this chapter, we proposed a uniﬁed model to combine link and content analysis for community detection. To accurately model the link patterns, a conditional link model is proposed
to capture the popularity of nodes. In order to alleviate the problem caused by the irrelevant
attributes, a discriminative model, instead of a generative model, is proposed for modeling
the contents of nodes. The link model and content model are combined via a probabilistic
framework through the shared variables of community memberships. We observed that the
combined model obtains signiﬁcant improvement over the state-of-the-art approaches for
community detection.

61

CHAPTER 6
Clustering for Noisy M&C Linked
Data: A Probabilistic Link Prediction
Model (I)
6.1

Introduction

In this chapter and the next chapter, we explore the problem of clustering for noisy M&C
linked data. Learning from M&C linked data has been studied extensively and has found
its application in distance metric learning [102], constrained clustering [8], and kernel learning [42]. The M&C link information include the must-link for pairs in the same class and
cannot-link for pairs in diﬀerent classes. They are also termed as positive (pairwise) constraints for must-link pairs, and negative (pairwise) constraints for cannot-link pairs. M&C
link information can often be derived from data, making it more attractive than the standard
setup of supervised learning. For instance, in classifying research articles, we can derive the
pairwise constraints based on the citations between papers.
Although various algorithms have been proposed for learning from M&C linked data, most
of them assume perfect M&C link information. In contrast, in this study, we focus on the
problem of learning from noisy M&C link information in which some of the pairwise
constraints are labeled incorrectly. This is important because the pairwise constraints extracted from data tend to be noisy and inaccurate. In the example of classifying research
articles with pairwise constraints constructed from paper citations, the cited paper may not
share the same research topic as the citing paper.
To cluster the noisy M&C linked data, we proposed to learn a combined kernel from

62

the noisy M&C linked data. We proposed a probabilistic link prediction model to learn
the combination weights based on a generalized maximum entropy model or equivalently a
regularized logistic regression model. We proposed two diﬀerent approaches for estimating
the suﬃcient statistics from the noisy M&C link information under diﬀerent assumptions. We
show that under the claimed assumptions, the probabilistic model trained from the noisy
M&C link information converges to that trained from the perfect M&C link information.
Extensive experimental results verify the eﬃcacy of the proposed framework for clustering
noisy M&C linked data.
The remainder of this chapter is organized as follows. In section 6.2, we present the
probabilistic link predication model for learning from noisy M&C link information. We
present experimental results in section 6.3, 7.3, and conclude our study in section 7.4.

6.2

Learning from Noisy M&C Linked Data

We start with the basic formulation for maximum entropy learning from perfect M&C link
information, followed by its generalization and its equivalence to regularized logistic regression model. We then extend to the case of noisy M&C link information. For the purpose of
presentation, we ﬁrst introduce the notations that are used throughout this chapter.

6.2.1

Notations

Let D = {xi ∈ X , i = 1, · · · N} be a collection of data points, P = {(x1 , x2 , yi )|x1 , x2 ∈
i i
i i
D, i = 1, . . . , n, yi ∈ {+1, −1}} be a collection of observed labeled pairs. We slightly abuse
the terminology of labeled and unlabeled examples by referring to the examples in D that also
occur in P as labeled examples, and to the remaining examples in D as unlabeled examples.
We denote by yi the true label for the pair (x1 , x2 ). We refer to the pairs with yi = +1 as
i i
perfect must-link pairs or perfect positive constraints and the pairs with yi = −1 as perfect
cannot-link pairs or perfect negative constraints. Similarly, we refer to the pairs with yi = +1

63

Table 6.1. Notations
Expectation

Empirical Average

Eδ [k] = EX1 ,X2 ,Y [δ(Y, y)k(X1 , X2 )]
y

1
aδ [k] = n
y

Eδ [k] = EX1 ,X2 ,Y [δ(Y, y)k(X1, X2 )]
y

1
aδ [k] = n
y

E[k] = EX1 ,X2 k(X1 , X2 )

1
a[k] = n

Ec [k] = EX1 ,X2 |Y=y k(X1 , X2 )
y

ac [k] =
y

n
i=1 δ(yi , y)ki
n
i=1 δ(yi , y)ki

Eo [k] = EX1 ,X2 ,Y Yk(X1 , X2 )
Ec [k] = EX1 ,X2 |Y=y k(X1 , X2 )
y

n
i=1 ki
n
i=1 yi ki

1
ao [k] = n
—–

n
i=1 δ(yi , y)ki
n
i=1 δ(yi , y)

as noisy must-link pairs or noisy positive constraints and the pairs with yi = −1 as noisy
cannot-link pairs or noisy negative constraints . We use y = −y for complement of y. We use
¯
κj (x1 , x2 ) for the j th ∈ {1, · · · , m} jth kernel feature function deﬁned on X × X . We denote
by ki = (κ1 (x1 , x2 ), · · · , κm (x1 , x2 ))⊤ the feature vector for pair (x1 , x2 ). Throughout the
i i
i i
i i
paper, we use capital letters X, Y, Y for the corresponding random variables. In the sequel,
we use the notations deﬁned in Table 6.1, where δ(yi , y) is the Kronecker delta function
that outputs 1 if yi = y and zero, otherwise. We let aδ [κj ] denote the jth element in aδ [k],
y
y
and similarly for aδ [κj ]. The empirical averages aδ [k] and ao [k] in Table 6.1 are referred to
y
y
suﬃcient statistics.

6.2.2

Generalized Maximum Entropy Model

We proposed to learn a probabilistic link predication model Pr(Y|X1 , X2 ), i.e. given a
pair X1 , X2 , how likely they are related by must-link or cannot-link. We ﬁrst consider the
maximum entropy model for learning the conditional link prediction model from perfect
M&C link information. We cast the problem of learning from M&C link information into
a binary classiﬁcation problem where the objective is to classify each pair (x1 , x2 ) into two
i i
categories, i.e., a positive pair (yi = +1) and a negative pair (yi = −1). Using maximum
entropy model, we aim to learn the conditional distribution Pr(Y = y|X1 , X2 ), which leads

64

to the following optimization problem:
n

H(p|x1, x2 )
i i

max

(6.1)

i=1
n

s.t.
where H(p|x1 , x2 ) = −
i i

1
n

i=1

p(y|x1, x2 )κj (x1 , x2 ) = aδ [κj ], ∀y, j
y
i i
i i

1 2
1 2
y p(y|xi , xi ) ln p(y|xi , xi ).

p(y|x1 , x2 ) =
i i

The solution to (6.1) is given by

1
exp(yw⊤ki /2)
=
1 + exp(−yw⊤ ki )
exp(yw⊤ki /2) + exp(−yw⊤ ki /2)

where w ∈ Rm are the dual variables and are obtained by solving the following optimization
problem,
n

min

w∈Rm

1
ln 1 + exp(−yi w ki ) = −
2

n

i=1

n
⊤

⊤

yi w ki +
i=1

exp(yw⊤ki /2)

ln
i=1

y

One major problem with the maximum entropy model in (6.1) is the equality constraint,
which is unlikely to hold if for each pair (x1 , x2 ), yi is a random sample from the distribution
i i
p

p(y|x1 , x2 ). We denote by ay [κj ] the left side of equality constraint in problem (6.1), i.e.
i i
p

ay [κj ] =

1
n

n

p(y|x1, x2 )κj (x1 , x2 )
i i
i i
i=1

p

The following theorem shows that ay [κj ] and aδ [κj ] could diﬀer signiﬁcantly if n is small.
y
The diﬀerence between the two quantities will diminish only when n approaches inﬁnity.
Theorem 7. Assume (x1 , x2 , yi ) are i.i.d.
i i

samples from an unknown distribution

P (X1 , X2 , Y), and the kernel function is bounded |κj (x1 , x2 )| ≤ R, ∀j, then the equality
constraint in (6.1) for any j and y holds with probability 1 when the number of instances
approaches inﬁnity. In particular, for any ǫ > 0 we have
Pr

p

ay [κj ] − aδ [κj ] ≥ ǫ ≤ 4 exp −
y

ǫ2 n
8R2

p

The theorem can be proved by noting that E[aδ [κj ]] = E[ay [κj ]] and applying McDiarmid’s
y
p
inequality. Details are provided in the appendix. To address the case that ay [κj ] and aδ [κj ]
y

65

could be diﬀerent, we propose a generalization to the traditional maximum entropy model
in (6.1). Given the ﬁnite number of training data, we relax the equality constraints in (6.1)
into inequality ones, leading to the following formulation for learning from side information
max
s.t.

1
n
1
n

n
i=1

i

H(p|x1 , x2 ) −
i i

1
2λ y

ǫy 2

(6.2)

p(y|x1 , x2 )κj (x1 , x2 ) ≥ aδ [κj ] − ǫyj , ∀y, j
y
i i
i i

where ǫy = (ǫy1 , . . . , ǫym )⊤ and

·

is a norm that measures the length of vector ǫy . The

key features of the generalized maximum entropy model in (6.2) are:
• Replacing equality constraints with inequality ones. As a result, we have
p

aδ [κj ] − ǫyj ≤ ay [κj ] ≤ aδ [κj ] + ǫy j .
¯
y
y
Note that although only one side inequality is included in (6.2), the upper bound of
p

p

p

ay [κj ] can be easily derived by using the relation ay [κj ] + ay [κj ] = aδ [κj ] + aδ [κj ].
y
y
¯
¯
• The positive dummy variables ǫ are introduced to account for the diﬀerence between the
p
two empirical means ay [κj ] and aδ [κj ]. A regularization term ǫy 2 /(2λ) is introduced
y

into the objective in order to determine these variables automatically.
We further justify the generalized maximum entropy model by showing it is equivalent to
the regularized logistic regression model.
Proposition 3. When · = · 2, the dual problem of (6.2) is equivalent to the regularized
logistic regression model, i.e.,
1
max
w∈Rm n

n
i=1

ln p(yi |x1 , x2 ) −
i i

λ
w 2
2
2

(6.3)

or equivalently,
max

w∈Rm

λ
1
1 ⊤ o
w a [k] −
w 2−
2
2
2
n

66

n

ln
i=1

exp
y

1 ⊤
yw ki
2

(6.4)

6.2.3

Estimating the Suﬃcient Statistics

In this section, we extend the framework of learning a probabilistic link prediction model to
the case when pairwise constraints are noisy, i.e., yi = yi for some pairs. The strategies we
used in this section, is to estimate the suﬃcient statistics aδ [k] or ao [k] from the noisy labels
y
without having to know which labels are incorrect. We present an approach for estimating
the suﬃcient statistics under certain assumptions.
In order to estimate aδ [k] in the case of noisy M&C link information, we make the following
y
assumptions.
Assumption 1. We assume (1.a) Pr(Y|X1 , X2 , Y) = Pr(Y|Y), (1.b) Pr(Y = y|Y = y) = cy
is given.
In the above assumption, (1.a) assumes Y is conditionally independent of (X1 , X2 ) given
Y, (1.b) assumes the group-level knowledge about the noise in the pairwise constraintsWith
these two assumptions, the following theorem shows that it is possible to express empirical mean aδ [k] in terms of aδ [k], i.e., the empirical mean estimated from the noisy side
y
y
information.
Theorem 8. Assuming(xi , yi , yi ), i = 1, . . . , n are i.i.d samples, we have, with a probability
at least 1 − δ,
aδ [k] − bδ [k]
y
y

2

≤

8mR2
ln
(c+ + c− − 1)2 n

4m
, ∀y
δ

where
bδ [k] =
y

n
aδ [k]
1 (1 − cy )
¯
y
−
k(x1 , x2 )
i i
(cy + cy − 1) n cy + cy − 1
¯
¯
i=1

The proof can be found in the appendix. As indicated by Theorem 8, under Assumption
1, we can approximate aδ [k] by bδ [k]. It is interesting to note that the convergence rate
y
y
√
√
is O 1/[|cy + cy − 1| n] , not O(1/ n). Similar to Theorem 8, we can have the following
¯
p
corollary to bound the diﬀerence between ay [k] and bδ [k].
y

67

Corrolary 9. Assuming (xi , yi , yi ), i = 1, . . . , n are i.i.d samples, we have, with a probability
at least 1 − δ,
p

ay [k] − bδ [k]
y

2

8mR2
ln
(c+ + c− − 1)2 n

≤

4m
,
δ

∀y

Given bδ [k] is an estimate of aδ [k] and ao [k] = aδ [k] − aδ [k], we can compute an estimate
y
y
+
−

of ao [k] under Assumption 1 by

1
c+ − c−
ao [k] −
a[k]
c+ + c− − 1
c+ + c− − 1

bo [k] = bδ [k] − bδ [k] =
+
−

(6.5)

Similar to Theorem 8, we have the following corollary to bound the diﬀerence between ao [k]
and bo [k]
Corrolary 10. Assuming (xi , yi , yi ), i = 1, . . . , n are i.i.d samples, we have, with a probability at least 1 − δ,
8mR2
ln
(c+ + c− − 1)2 n

ao [k] − bo [k] 2 ≤

4m
δ

With theorem 8, corollary 9 and corollary 10, we ﬁnally reach to the following generalized
maximum entropy for learning from noisily labeled data
p(y|x1, x2 )

= arg max

p(y|x)

s.t.

1
n
1
n

n
i=1
n
i=1

H(p|x1, x2 ) −
i i

1
2λ y

ǫy 2

(6.6)

p(y|x1, x2 )k(x1 , x2 ) ≥ bδ [k] − ǫy , ∀y
y
i i
i i

or the following regularized logistic regression model for learning from noisily labeled data
∗
wb

6.2.4

λ
1
1
= arg min
w 2 − w⊤ bo [k] +
2
m 2
2
n
w∈R

n

ln exp
i=1

1 ⊤
yw ki
2

(6.7)

Convergence Analysis

The resulting conditional distribution p(y|x1, x2 ) from (6.6) is given by
p(y = 1|x1 , x2 ) =

∗
exp(wb ⊤ k(x1 , x2 ))
∗
1 + exp(wb ⊤ k(x1 , x2 ))

(6.8)

Next, we show how the solution w will be aﬀected when replacing aδ [k] with bδ [k], i.e., the
y
y
empirical mean computed from the noisy pairwise constraints.
68

∗
∗
Theorem 11 ( Lemma 6[78]). Let wa be the solution to (6.4) with ao [k], and wb be the

solution to (6.7) with bo [k]. We have
1 o
∗
∗
wa − wb 2 ≤
a [k] − bo [k] 2
λ
The proof for the theorem as well as the following theorems can be found in the appendix.
Combining Theorem 11 and Corollary 10, we have the following theorem showing the impact
of replacing ao [k] with bo [k].
Theorem 12. Let p(y|x1, x2 ) be the conditional model derived from noisy pairwise constraints using (6.6) using

· 2 , and p(y|x1 , x2 ) be the conditional model derived from the

perfect pairwise constraints using (6.2). Under Assumption 1, with probability 1 − δ, for any
x1 , x2 and y, we have
p(y|x1 , x2 ) − p(y|x1 , x2 ) ≤

mR2
λc

8
ln
n

4m
δ

where c = |c+ + c− − 1|.
As indicated by the above theorem, the diﬀerence between two conditional models will
√
be reduced at the rate of 1/[|c+ + c− − 1| n]. Finally, since our algorithm depends on
the knowledge of c+ and c− , we further analyze the behavior of the proposed algorithm
with inaccurate estimation of c+ and c− . We denote by c+ and c− the estimates of c+ and
c− , respectively. We deﬁne p(y|x1 , x2 ) the conditional model derived from the noisy side
information using c+ and c− . We measure the diﬀerence (c+ , c− ) and their estimates by
∆ = max(|c+ − c+ |, |c− − c− |). The next theorem shows the diﬀerence between p(y|x1 , x2 )
and p(y|x1, x2 ).
Theorem 13. Let

· = · 2 . Let p(y|x1, x2 ) be the conditional model derived from noisy

pairwise constraints with c+ and c− , and p(y|x1, x2 ) be the conditional model derived from
the perfect pairwise constraints using (6.2). Assume |c+ + c− − 1| ≥ ρ with ρ ≥ 0 and
∆ ≤ ρ/4. Under Assumption 1, with probability 1 − δ, for any x1 , x2 and y, we have
p(y|x1 , x2 ) − p(y|x1, x2 ) ≤

mR2
λc
69

8 4m mR2 ∆(8 + 2c)
ln
+
n
δ
λρ2

We ﬁnally mention that w can serve as weights for combining kernels, i.e.,

j

wj κj .

However, the combined kernel may not be positive semi-deﬁnite, because some weights wj
are negative. To ensure the combined kernel to be valid, we introduce one more constraint
wj ≥ 0, j = 1, . . . , m to the optimization problem in (6.6). The optimization problems are
solved by Nesterov method [70].

6.3

Experiments

We evaluate the proposed algorithm by clustering the linked documents. We ﬁrst present the
experiments on clustering linked documents with noisy pairwise constraints derived from the
link information. We then examine the behavior of the proposed algorithm in more details.
Before presenting the experimental results, we ﬁrst introduce the data sets, baselines and
evaluation metric.
Data Sets We select three linked document data sets, i.e. Cora, Citeseer, Terrorist
Attacks(TeAt) for our evalution. We choose these data sets because they have relatively low
noise (20% ∼ 35%) in their pairwise constraints derived from links that satisﬁes the condition
c+ + c− − 1 > 0 in Assumption 1. They were processed by the research group of Lise Getoor
(see www.cs.umd.edu/projects/linqs/projects/lbc/). Each data set contains (1) a set of
documents described by binary vectors indicating the presence and absence of words from a
dictionary, (2) links among documents (e.g., citations between research articles), and (3) the
class assignment for each document. The statistics of these three data sets are summarized
in Table 1. In the experiments, the attributes for each document are normalized by ﬁrst
dividing the sum of the attributes and then taking the square root [47].
Evaluation In order to evaluate the proposed algorithm, we apply it to kernel learning
as described at the end of Section 3. In particular, for each attribute j, we construct a linear
kernel matrix κj (x1 , x2 ) = x1 [j]x2 [j] for paired documents (x1 , x2 ), where x[j] is the j th
normalized attribute of document x. The proposed algorithm will be applied to learn the

70

Table 6.2. Statistics of Data sets
name
Cora
Citeseer
TeAt

#examples

#words

#links

#classes

2708
3312
1293

1433
3703
106

5429
4732
571

7
6
6

combination of multiple kernel matrices from the noisy pairwise constraints derived from
links. The ℓ2 norm is used in the proposed algorithm. Given the learned kernel matrix,
a spectral clustering algorithm [84] is applied for document clustering. We evaluate the
clustering result by comparing it to the class assignment information provided in each data
set. Normalized mutual information(NMI) [107] is used as our evaluation metric. For all the
experiments, we set γ in the proposed algorithm to be 0.01/c2 , where c = c+ + c− − 1.
Baseline We compare the proposed algorithm to the following metric/kernel learning algorithms: (a) GDM, the global distance metric learning algorithm [102], (b) DCA, the discriminative component analysis algorithm [43], (c) ITML, the information theoretic metric
learning algorithm proposed by [21], and (d) SKL, the spectral kernel learning algorithm [44].
For fair comparison, the distance metric A learned by the metric learning algorithms will
be used to construct a kernel matrix K = XAX ⊤ , where X is the data matrix, and the
same spectral clustering algorithm will be applied to K for document clustering. We also
evaluate the proposed algorithm against the metric pairwise constrained K-means clustering
algorithm [8], referred to as MPCK. In order to improve the robustness of MPCK to noisy
constraints, we follow [61] and weight the noisy positive constraints and the noisy negative
constraints by c+ , c− respectively to reduce their impact on the clustering results. As the
reference point, we compute a linear kernel for both labeled and unlabeled examples, without
using the provided pairwise constraints. We refer to this baseline as base. Finally, we refer
to as GMEns the proposed generalized maximum entropy model for learning from noisy
side information, and as GMEs the generalized maximum entropy model without considering the noise in side information. All the experiments are run ﬁve times and the clustering
accuracy averaged over ﬁve runs is reported in our study.

71

6.3.1

Experiments with Real Noise

We conduct experiments of document clustering with the noisy pairwise constraints derived
from the links between documents. In particular, we use all the linked document pairs as
the positive constraints. The same number of document pairs without link are sampled to
construct the negative constraints. To obtain the noise levels of the pairwise constraints, we
sample a total of 100 pairwise constraints and estimate c+ and c− based on the correctness
of the sampled constraints. These sampled pairwise constraints with their true labels are
also used by the other baseline methods for computing distance metrics and kernel matrices.
Figures 6.1(a), 6.2(a) and 6.3(a) show the clustering accuracy measured in NMI for the three
data sets. The mean values of the estimated c+ and c− are listed under each ﬁgure. We
observe that given the noisy pairwise constraints, all the algorithms except ITML perform
signiﬁcantly worse on at least one data set than the reference method base. In contrast, the
proposed algorithm for learning from noisy pairwise constraints outperforms the reference
method signiﬁcantly for all three data sets. We thus conclude that the proposed algorithm
is overall more robust to noise in the side information.

6.3.2

Experiments with Synthetic Noise

In this section, we examine the robustness of the proposed algorithm to (a) diﬀerent noise
levels in synthetically generated pairwise constraints, and (b) the estimated values for c+
and c− .
Robustness to the Noise We ﬁrst sample 10, 000 pairwise constraints from each data
set, with 5, 000 positive constraints and 5, 000 negative constraints. Random noise is introduced to the synthetic constraints by randomly ﬂipping the label of a pair with a probability
p%, where p% speciﬁes the noise level. We set c+ and c− to be 1 − p%, with the assumption that the knowledge of noise level is perfect. To examine the impact of noisy positive

72

Cora

Cora
0.4
0.3

base
GMEns
MPCK
GDM
DCA
ITML
SKL

0.2
0.1
0

NMI

NMI

0.3

0.2
0.1
0

(a) on real noisy constraints

less

Cora

Cora e=1%

NMI

0.4

NMI

GMEns
GMEs

0.2
0

(0,0)
(+e,+e)
(+e,−e)

0.2
0

10 20 30 40 50 60 70 80 90
noise(%) in positive constraints
NMI

NMI

0

much

(b) on synthetic noisy constraints

0.4
0.2

medium
noise level

10

20
30
e=10%

40

0.2

(−e,+e)
(−e,−e)

0

10
20
30
40
noise(%) in pairwise constraints

10 20 30 40 50 60 70 80 90
noise(%) in negative constraints

(c) robustness to noise

(d) sensitivity to c+ , c−

Figure 6.1. Experimental results on Cora data set
constraints and noisy negative constraints separately, for each data set, with a given noise
level p%, we conduct two experiments, one with corrupted positive constraints but perfect
negative constraints, and the other with corrupted negative constraints but perfect positive
constraints. Figures 6.1(c), 6.2(c) and 6.3(c) compare the clustering results for GMEns and
GMEs with the noise levels in the synthetic pairwise constraints varied from 10% to 90% on
the three data sets. We observe that GMEns, the generalized maximum entropy model for
noisy side information, is signiﬁcantly more robust to the noise in the pairwise constraints
than GMEs which does not take into account the noise in side information. We also observe
that the noisy positive constraints have signiﬁcantly higher adverse impact on the clustering

73

Citeseer

Citeseer

0.3
0.3

NMI
0.1

0

NMI

base
GMEns
MPCK
GDM
DCA
ITML
SKL

0.2

0.2
0.1
0

(a) on real noisy constraints

0
0.4
NMI

Citeseer
NMI

(0,0)
(+e,+e)
(+e,−e)

0.2
0

10 20 30 40 50 60 70 80 90
noise(%) in positive constraints

0.2
0

much

Citeseer e=1%
GMEns
GMEs

0.2

medium
noise level

(b) on synthetic noisy constraints

NMI

NMI

0.4

less

10

20
30
e=10%

40

0.2

(−e,+e)
(−e, −e)

0

10 20 30 40 50 60 70 80 90
noise(%) in negative constraints

10
20
30
40
noise(%) in pairwise constraints

(c) robustness to noise

(d) sensitivity to c+ , c−

Figure 6.2. Experimental results on Citeseer data set
results than the noisy negative constraints.
Sensitivity to c+ , c− We use the same set of 10, 000 randomly sampled pairwise constraints for this study.

We add the same noise level to both positive constraints and

negative constraints. To investigate the sensitivity to c+ , c− , instead of setting them to
be 1 − p%, we perturb these parameters by setting them to be (1 − p%)(1 ± e). Figures 6.1(d), 6.2(d) and 6.3(d) show the results of GMEns on the three data sets with four
noise levels p% = 10% ∼ 40% for e = 1%, 10%. We observe that GMEns is overall robust to modest perturbation level, making the proposed algorithm applicable even when the
assumed noise levels are inaccurate.

74

Terrorist Attack

Terrorist Attack

0.4

0.5
0.4

base
GMEns
MPCK
GDM
DCA
ITML
SKL

0.2
0.1
0

NMI

NMI

0.3

0.3
0.2
0.1
0

(a) on real noisy constraints

NMI

GMEns
GMEs

0.5
0

10 20 30 40 50 60 70 80 90
noise(%) in positive constraints

(0,0)
(+e,+e)
(+e,−e)
10

20
30
e=10%

40

0.5

0.5

NMI

NMI
NMI

much

Terrorist Attack e=1%

0.5

0

medium
noise level

(b) on synthetic noisy constraints

Terrorist Attack

0

less

(−e,+e)
(−e,−e)

0

10 20 30 40 50 60 70 80 90
noise(%) in negative constraints

10
20
30
40
noise(%) in pairwise constraints

(c) robustness to noise

(d) sensitivity to c+ , c−

Figure 6.3. Experimental results on Terrorist Attack data set
Finally, we compare the proposed algorithm to the baselines on the synthetic noisy constraints by varying the level of noise. Due to the fact that some of the baseline algorithms
are time consuming, one thousand pairs are sampled for positive constraints and negative
constraints, respectively. We show the results on the noise added to the positive constraints
due to its stronger eﬀect on the performance. Figures 6.1(b), 6.2(b) and 6.3(b) show the
clustering results of all algorithms at three noise levels: low(10%), medium(40%), high(70%)
on the three data sets. We observe that the proposed algorithm is able to outperform all the
baseline algorithms for all the cases.

75

6.4

Conclusions

In this chapter, we have proposed a generalized maximum entropy model for learning from
noisy M&C link information, and applied it to learning an optimal weight for combining
multiple kernels for clustering. Our theoretical analysis shows that the model trained from
the noisy link information converges to the model trained from the perfect link information.
Extensive experimental results verify the eﬃcacy of the proposed model.

76

CHAPTER 7
Clustering for Noisy M&C Linked
Data: A Probabilistic Link Prediction
Model (II)
In this chapter, we present an alternate approach for estimating the feature statistics under
diﬀerent assumptions about the data and the noise. We begin with a description of the
notations.

7.1

Notations

Let D = {xj ∈ X } be a collection of observed data, and P = {(x1 , x2 , yi ) : x1 , x2 ∈ D, yi ∈
i i
i i
{1, −1}, i = 1, · · · , n} be a collection of n pairwise constraints, where yi is the noisy label
given to the pair (x1 , x2 ) indicating if the pair is a positive constraint (yi = 1) or a negative
i i
constraint (yi = −1). Let yi denote the underlying true label for the pair (x1 , x2 ) which is
i i
unknown in our setting. Let kj (·, ·) : X × X → R, j = 1, · · · , m be a set of m base kernels.
Our goal is to learn a combination of the multiple kernels k(·, ·) =

m
j=1 wj kj (·, ·),

given the

noisy must-and-cannot links and use the combined kernel to do clustering.
A few more words about the notations used in this study. For any pair (x1 , x2 ), we use
k(x1 , x2 ) = (k1 (x1 , x2 ), · · · , km (x1 , x2 ))⊤ denote the similarities between x1 and x2 using all
m kernel functions. We use the shorthand ki for k(x1 , x2 ). We use y = −y for complement
¯
i i
of y. We use subscript + and − for conditions y=1 and y=−1, respectively.

77

7.2
7.2.1

Learning from Noisy M&C Linked Data
A Probabilistic Model

To learn the combination of multiple kernels, we construct a conditional model Pr(y|x1 , x2 )
for computing the pairwise classiﬁcation probability, where y ∈ {1, −1} is the underlying
true label to indicate whether the pair (x1 , x2 ) belongs to the same class or not. Given
multiple kernels kj (·, ·), j = 1, . . . , m, we formulate Pr(y|x1, x2 ) by
Pr(y|x1 , x2 ) =

exp(yw⊤k(x1 , x2 )/2)
1
=
1 + exp(−yw⊤ k(x1 , x2 ))
exp(−w⊤ k(x1 , x2 )/2) + exp(w⊤ k(x1 , x2 )/2)

where w = (w1 , · · · , wm )⊤ ∈ Rm are the non-negative weights that need to be learned.
+
We constrain w to be non-negative to ensure that the resulting combined kernel is positive
semi-deﬁnite. By optimizing the log-likelihood of the true labels for the observed pairs, we
obtain the optimal weights for kernel combination. More speciﬁcally, we need to solve the
following optimization problem
1
max
w∈Rm n
+

n
i=1

ln p(yi |x1 , x2 ) −
i i

λ
w 2
2
2

(7.1)

or equivalently,
max

w∈Rm
+
1
where ao [k] = n

1 ⊤ o
λ
1
w a [k] −
w 2−
2
2
2
n

1 2
i yi k(xi , xi )

n

ln
i=1

exp
y

1 ⊤
yw ki
2

(7.2)

is what we call suﬃcient statistics. The main challenge is

that the true labels for the observed pairs are unknown, making it diﬃcult to apply the maximum likelihood estimation. Below, we present an approach that eﬀectively approximates
the suﬃcient statistics using the noisy labels.

7.2.2

Estimating Suﬃcient Statistics

In this section, we present an alternative approach for estimating the suﬃcient statistics
ao [k] under a diﬀerent assumption from that in Chapter 6. We ﬁrst present the assumption.
78

Assumption 2. We assume (2.a) Pr(|X1 , X2 |Y, Y) = Pr(X1 , X2 |Y), (2.b) Pr(Y = y|Y =
y) = dy and Pr(Y = y) = py is given.
Compared to Assumption 1, we can see that Assumption 2 assumes that the data
pair X1 , X2 is independent of the noisy label Y given the true label Y, which is essentially
equivalent to assumption (1.a). Nevertheless, we assume diﬀerent knowledge about the
noise, i.e., assumption (2.b), which is diﬀerent from assumption (1.b). Comparing the two
assumptions, assumption 1 may be advantageous if we know that how much noise is added
on top the true labels, however, the conditional probabilities in assumption (2.b) may be
estimated more accurately by a sampling approach, which samples a part of examples among
noisily labeled pairs and query their true labels. Given the knowledge of py , we rewrite the
expectation Eo [k] by
Eo [k] = EX1 ,X2 ,Y Yk(X1 , X2 ) = EY YEX1 ,X2 |Y k(X1 , X2 ) = p+ Ec [k] − p− Ec [k]
−
+
where Ec [k] is deﬁned in Table 1. Estimating Eo [k] is therefore reduced to estimating Ec [k].
y
y
Given the independence assumption, we have
EX1 ,X2 |Y=y [k(X1 , X2 )] =EX1 ,X2 |Y=y [k(X1 , X2 )] Pr(Y = y|Y = y)+
EX1 ,X2 |Y=¯[k(X1 , X2 )] Pr(Y = y |Y = y)
¯
y
Writing in the matrix form, we have
Ec [k]
+
Ec [k]
−

⊤
⊤

=B

Ec [k]⊤
+
c [k]⊤
E−

(7.3)

1 − d+
d+
. Equa1 − d−
d−
tion (7.3) allows us to estimate Ec [k] (and therefore Eo [k]) from Ec [k], a quantity that can be
y
y
where Ec [k] is deﬁned in Table 6.1, and B is deﬁned as B =
y

computed from the noisy labels. In particular, we approximate Ec [k] by its sample average
y
ac [k] which is computed by
y
ac [k] =
y

1 2
i δ(yi , y)k(xi , xi )
i δ(yi , y)

79

By replacing Ec [k] with ac [k] in (7.3), we obtain the estimation for Ec [k], denoted by bc [k],
y
y
y
y
by solving the following least square problem
min

B

bc [k]
y

bc [k]⊤
+
bc [k]⊤
−

−A

(7.4)
F

ac [k]⊤
+
. To obtain a more robust estimation of Ec [k], we note that
y
ac [k]⊤
−
c
y py Ey [k] = EX [k(X)] = E[k], and therefore add the following constraint for the esti-

where A =

mator bc [k]
y
py bc [k]
y
y

1
=
n

n

k(x1 , x2 ) = a[k]
i i

(7.5)

i=1

Combining Equation (7.4) and (7.5), we have the following constrained least square problem
for bc [k], an estimator of Ec [k]
y
y
min

B

bc [k]
y

bc [k]⊤
+
bc [k]⊤
−

−A

,

py bc [k] = a[k]
y

s.t.

(7.6)

y

F

It can be shown that the solution for bc [k] is given by
y
bc [k]⊤
+
c [k]⊤
b−

= (B⊤ B)−1

pa[k]⊤
pp⊤ (B⊤ B)−1
+ I − ⊤ ⊤ −1
B⊤ A
⊤ (B⊤ B)−1 p
p
p (B B) p

(7.7)

where p = (p+ , p− )⊤ . Note that the solution in (7.7) requires B to be non-singular, implying
d+ + d− = 1. Given Ec [k] is estimated by bc [k], we have Eo [k] estimated as follows
y
y
Eo [k] ≈ bo [k] = p+ bc [k] − p− bc [k]
+
−
leading to the following maximizing the approximate log-likelihood of the true labels for the
observed pairs
max

w∈Rm
+

7.2.3

λ
1
1 ⊤ o
w b [k] −
w 2−
2
2
2
n

ln
i

exp
y

1 ⊤
yw ki
2

(7.8)

Convergence Analysis

In this section, we present the convergence analysis showing that the combination weights
learned by the proposed approach converge to the optimal ones learned from perfectly labeled
80

pairs. All the proofs for the lemmas and theorems presented in this section are included in
supplementary material.
Comparing (7.2) with (7.8), we see that the only diﬀerence between the two optimization
problems is that ao [k] in (7.2) is replaced with bo [k] in (7.8). The following lemma showing
the bounds for w when replacing ao [k] with bo [k].
∗
∗
Lemma 14 ( Lemma 6[78]). Let wa be the solution to (7.2) with ao [k], and wb be the

solution to (7.8) with bo [k]. We have
1 o
∗
∗
a [k] − bo [k] 2
wa − wb 2 ≤
λ
Next, we try to bound the diﬀerence between ao [k] and bo [k]. Note that
ao [k] − bo [k] 2 ≤ ao [k] − Eo [k] 2 + bo [k] − Eo [k] 2

(7.9)

The following two lemmas allow us to bound the two terms in (7.9), respectively.
Lemma 15. With the bounded kernel function kj (x1 , x2 ), i.e., |kj (x1 , x2 )| ≤ R, j =
1, . . . , m, the following inequality holds with probability at least 1 − δ for any δ > 0
ao [k] − Eo [k] 2 ≤

2mR2
ln
n

2m
δ

Lemma 16. With bounded kernel function kj (x1 , x2 ), i.e. |kj (x1 , x2 )| ≤ R, j = 1, . . . , m,
and signiﬁcant large number n+ of pairs labeled as positive and large number n− of pairs labeled as negative, i.e., there exists some positive constant ρ > 0 such that min(n+ /n, n− /n) ≥
ρ. Under the independence assumption, the following inequality holds with probability at least
1 − δ for any δ > 0
bo [k] − Eo [k] 2 ≤

C

2mR2
ln
n

6m
δ

with C deﬁned by
C=

ˆ
ˆ ˆ
4d 4
32(d6 d + d7 )
4κ2
32(1 + κ3 )
≤
+
+
ρ2 σmin
ρ2 d8
p 2
p 2 d4
2
2

where σmax , σmin are the maximum and minimum eigenvalue of B⊤ B, κ = σmax /σmin ,
ˆ
d = d+ + d− − 1, and d = 1 + |d+ − d− |.
81

Combining the results in Lemma 1, Lemma 2 and Lemma 3, we have the following theorem.
Theorem 17. With the same conditions as in Lemma 3, the following inequality holds with
probability at least 1 − δ for any δ > 0
1
∗
∗
wa − wb 2 ≤
λ

2mR2
n

C ln

12m
+
δ

ln

4m
δ

∗
∗
Theorem 4 indicates wb , the optimal solution to problem (7.8) converges to wa , the

optimal solution to problem (7.2), as the number of the pairs n approaches inﬁnity. It is
interesting to observe that the bound is proportional to C, which is inversely proportional
to |d| that is related to noisy level. Hence, the larger the |d|, the better the approximate
solution will be. This will be further validated by our experiments.
∗
Our next result is regarding the diﬀerence between the estimated solution wb and the

optimal solution w∗ obtained by solving (7.1) with an inﬁnite number of perfectly labeled
pairs.
Theorem 18. With the same conditions as in Lemma 3, the following inequality holds with
probability at least 1 − δ for any δ > 0
∗
wb

− w∗

1
2 ≤λ

2mR2
n

24m
+
C ln
δ

8m
ln
δ

+

√
4ρ mR
√
λ n

1+

1 4
ln
2 δ

Finally, we note that our analysis relies on accurate estimate of p, B. In supplementary
material, we provide the error bound with inaccurate estimate of p, B.

7.3

Experiments

In this section, we validate the proposed approach by clustering on linked documents using the combined kernels learned from noisy pairwise constraints. We conduct two sets of
experiments: in the ﬁrst set of experiments, we synthesize the noisy pairwise constraints
by random sampling, and in the second setup, we derive the pairwise constraints from the
links among documents. We use the same cora and citeseer data sets. The baselines for
comparison are described as follows.
82

Data sets Two paper citation data sets, Cora and Citeseer, processed by Lise Getoor’s
research group, are used in our study. Each paper in the data sets is described by a binary
vector indicating the presence and absence of corresponding words, and is assigned to one
of the given classes. Besides the attributes and class assignments of papers, the citations
between papers are also available in these two datasets. The statistics of the two data sets
are summarized in Table 7.1. Following [47], we normalize the attributes of each document
by ﬁrst dividing the sum of the attributes and then taking the square root.
Baselines We compare the proposed algorithm to the following two algorithms that learn
kernel matrices from pairwise constraints: (1) NPK, a state-of-the-art non-parametric kernel learning algorithm from pairwise constraints [42], and (2) SKL, a spectral kernel learning
algorithm [44]. We did not compare to the kernel learning algorithms [50, 52] from pairwise constraints because they are outperformed by NPK according to [42], and to algorithm
in [57] because it requires solving a SDP problem, which does not scale well for large data
sets. Besides kernel learning algorithms, we also compare the proposed algorithm to three
representative methods for distance metric learning from pairwise constraints: (1) GDM,
a global distance metric learning algorithm [102], (2) DCA, a discriminative component
analysis algorithm [43], and (3) ITML, a information-theoretic based metric learning algorithm [21]. We did not compare to the other metric learning algorithms because they either
are not as eﬀective as these three metric learning algorithms (e.g. [31]) or require labeled
data for training instead of pairwise constraints (e.g., [100]). The learned metric/kernel
matrix is used by a spectral clustering algorithm [84] to cluster documents. Since these
learning algorithms do not address noisy pairwise constraints, the comparison will allow us
to verify if the proposed algorithm is robust to the noise in pairwise constraints. The last
two baselines are constrained clustering algorithms: MPCK [8], a state-of-the-art algorithm
for constrained clustering, and LCVQE [75], a recently developed k-means algorithm with
noisy constraints. To improve the robustness of MPCK to noisy constraints, we follow [61]

83

Table 7.1. Statistics of Data sets
data
Cora
Citeseer

#papers

#attr

#citations

#classes

2708
3312

1433
3703

5429
4591

7
6

by weighting the noisy pairwise constraints by probability dy . Since MPCK and LCVQE
are claimed to be robust to noisy constraints, the comparison will allow us to see if the proposed approach is eﬀective for noisy pairwise constraints. Finally, we include the baseline
that combines kernels with equal weights, referred to as LK.
For the proposed approach, each candidate kernel kj is the linear kernel on the j th attribute. Given the learned combination of kernel matrices, the same spectral clustering algorithm will be used to cluster documents. In all the experiments, we set the regularization
parameter λ as λ = 0.01/n. We refer to the proposed approach as LKCnpc. To evaluate
the clustering results, we compute the normalized mutual information [107] by comparing
the cluster assignments predicted by the clustering algorithms to the class assignments given
in the data set.

7.3.1

Experiments with Real Noise

In this subsection, we show experimental results on noisy pairwise constraints derived from
citation information. In particular, a positive pairwise constraint is created if two papers
are linked by a citation, and a negative pairwise constraint is created when two papers do
not cite each other. Since the number of unlinked paper pairs is much larger than that of
linked pairs, for computational eﬃciency, we randomly sample the same number of unlinked
paper pairs as that of linked paper pairs. As a result, we have on average about 20%-25%
pairwise constraints being incorrect. To obtain py , dy we randomly sample 1% of document
pairs and label them according to their class assignments, and compute py , dy from the noisy
labels and the correct labels for these sampled pairs. For fair comparison, the correct labels
of these sampled pairs are used by the baselines.

84

Cora

Citeseer
0.3
LKCnpc
NPK
SKL
GDM
DCA
ITML
LCVQE
MPCK

0.2
0.1
0

NMI

NMI

0.3

LKCnpc
NPK
SKL
GDM
DCA
ITML
LCVQE
MPCK

0.2
0.1
0

(a) Cora

(b) Citeseer

Figure 7.1. Comparison of clustering performance on real noise.
Table 7.2. Clustering performance on unseen data
data
Cora
Citeseer

observed data unseen data
0.3326±0.013 0.3028±0.012
0.2822±0.012 0.2502±0.020

Figure 7.1(a) and 7.1(b) show the clustering results on the two data sets for the proposed
approach and baseline methods. We observe that most baseline methods, except for GDM,
LCVQE, are unable to outperform the reference method LK. In particular, we observe that
MPCK, a state-of-the-art constrained clustering algorithm, performs signiﬁcantly worse
than LK, indicating the importance of addressing noisy pairwise constraints in constrained
clustering. Overall, we observe that the proposed approach outperforms all the baseline
methods on both data sets. We thus conclude that the proposed approach is eﬀective for
learning from noisy pairwise constraints.
Finally, we brieﬂy show the clustering results applied on the unseen data with learned
combinations of multiple kernels from the observed data. We randomly choose 80% papers
as observed data and derive noisy pairwise constraints for these observed data from citations,
and the remaining 20% papers as unseen data for testing the learned model. The NMI
averaged over 5 trials for the learned model on the observed data and unseen data for the
two data sets is shown in Table 7.2. We can see that the combination weights learned from
85

observed data also works well on unseen data.

7.3.2

Experiments with Synthetic Noise

In this subsection, we show experimental results using synthesized noisy pairwise constraints.
We ﬁrst examine the robustness of the proposed approach to the noisy pairwise constraints
by varying the level of noise. In particular, for each data set, we randomly sample 10, 000
pairs in the same class and 10, 000 pairs in diﬀerent classes. To obtain the noisy label yi for
each pair, we randomly ﬂip the correct label yi with a probability of p%, where p% speciﬁes
the noise level. These 20, 000 document pairs together with their noisy labels are used to
train both the proposed approach and the eight baseline methods for document clustering.
The values of py and dy are computed from the correct labels and noisy labels. We choose
four noise levels, p% = 20%, 40%, 60%, 80% in our study. Figure 7.2(a) and 7.2(b) compare
the clustering performance of the proposed approach to that of the baseline methods on
the two data sets. The results are averaged over ﬁve independent experiments. For the
convenience of comparison, we also include the clustering result of the proposed approach
but without modeling the noise in the pairwise constraints, i.e. simply using ao [k] in place
of ao [k], as referred to LKCpc. The comparison to LKCpc allows us to see how the noise
in pairwise constraints aﬀects the clustering results.
It is clear that the proposed approach LKCnpc is more resilient to the noisy pairwise
constraints than the baseline methods. In particular, we observe that even the noise level is
relatively low(20%), most of the baseline methods, except for LKCpc, GDM, are unable
to outperform the reference method LK. As the noise level increases, except for ITML, the
performance of all the other baseline methods degrade signiﬁcantly and become much worse
than that of LK when the noise level exceeds 50%. For all the cases, the proposed approach
yields the best performance among all the methods in comparison. More interestingly, on
both data sets, we observe that the clustering performance of the proposed approach increases
when the noise level goes beyond or below 50%. This result seems surprising at the ﬁrst

86

glance, however, by noting that our approach relies on the knowledge of dy = Pr(Y =
y|Y = y) related to noise level, when the noise level is above 50%, the simple approach by
ﬂipping labels of pairwise constraints to their opposite can obtain pairwise constraints with
smaller errors, leading to better clustering performance, and so of course does the proposed
approach. Also, this result is consistent with our theoretical analysis in Theorem 4, because
the diﬀerence between the model learned from noisily labeled pairs and the model learned
from perfectly labeled pairs decreases as |d| = |d1 + d0 − 1| increases. The corresponding |d|
for the four noise levels p% = 20%, 40%, 60%, 80% are 0.6, 0.2, 0.2, 0.6.
Cora

Citeseer

NMI

0.3
0.2
0.1
0

LKCnpc
LKCpc
NPK
SKL
GDM
DCA
ITML
LCVQE
MPCK

0.3
NMI

LKCnpc
LKCpc
NPK
SKL
GDM
DCA
ITML
LCVQE
MPCK

0.4

0.2
0.1
0

20% 40% 60% 80%
percentage of noise

(a) Cora

20% 40% 60% 80%
percentage of noise

(b) Citeseer

Figure 7.2. Comparison of clustering performance with diﬀerent noise levels.
Second, we show the proposed approach is not sensitive to the estimated values of py
and dy . With 20,000 synthetic pairwise constraints, and p% = 20% noise, we randomly
perturb py and dy by either adding or subtracting 10 percentage from their “true” values
computed from the ground truth. We found the clustering performance is almost the same,
as shown in Table 7.3, where py , dy refer to their “true” values, i.e. py =
i δ(yi , y)δ(yi, y)/

i δ(yi , y),

i δ(yi , y)/n, dy

=

and py , dy refer to randomly perturbed values p+ = (1 ±
ˆ ˆ
ˆ

ˆ
10%)p+ , dy = (1 ± 10%)dy , and the results for perturbed values are averaged over 10 random
trials.
Third, we show the experimental results to verify the convergence behavior in the number

87

Citeseer

Cora
0.3
NMI

NMI

0.3
0.2

0.1

0.1
0

0.2

MKL−noisy labels
MKL−perfect labels
100

0

500 1000 5000 10000
#of pairwise constraints

(a) Cora

MKL−noisy labels
MKL−perfect labels
100

500 1000 5000 10000
#of pairwise constraints

(b) Citeseer

Figure 7.3. Convergence behavior of the proposed approach
Table 7.3. Clustering performance with perturbed (py , dy )
data
Cora
Citeseer

(py , dy )
0.3338
0.3120

(ˆy , dy )
p ˆ
0.3279±0.006
0.3011±0.007

of noisily labeled pairs n as presented in the paper. We sample the equal number of pairs
in the same class and pairs in diﬀerent classes. The noisy label for these pairs are obtained
by using the same random ﬂipping process. We ﬁx the noisy level to p% = 20%. We
compare the clustering performance of the proposed approach with noisy labels to the same
approach but with perfect labels by increasing the total number of pairwise constraints
from 200, 1000, 2000, 10, 000 to 20, 000. The results in Figure 7.3(a) and Figure 7.3(b) show
that the diﬀerence decreases as the number of constraints increases. We also noticed that
the iterative method (e.g., Newton method) for solving the related optimization problem
converges very quickly in one hundred iterations on the two data sets with 20,000 paris.
Finally, we mention that the two approaches proposed in this Chapter and last Chapter
yield similar empirical results.

88

7.4

Conclusions

In this chapter, we have presented an alternative approach for estimating the suﬃcient
statistics in generalized maximum entropy model under diﬀerent assumptions from Chapter
6. Our theoretical analysis shows that the model trained from the noisy link information
converges to the model trained from the perfect link information. Extensive experimental
results verify the eﬃcacy of the proposed approach.

89

CHAPTER 8
Conclusions and Future Work
In previous chapters, we have presented a link-based model for community detection, a
discriminative approach for combining link and content information for community detection
for networked data, and a probabilistic link prediction model for clustering the noisy M&C
linked data. We summarize the contributions of the thesis as follows.
• In Chapter 4, we present a probabilistic link model for detecting communities for directed networks. We introduce popularity and productivity to model the diﬀerences of
nodes in receiving links and the diﬀerences of nodes in producing links, respectively.
These two factors can explain the noisy connections in terms of community detection,
because the connections between nodes may be not due to their common or similar
community, but rather because of their high popularities tending to receive links or
high productivities tending to produce links. These two factors can also explain the
preferential attachment phenomenon in the real world, i.e., the power law degree distribution. Using these two factors, together with community memberships, we deﬁne
a generative model to generate the links. The proposed link model is an uniﬁed framework which can include several previous models in degenerated cases, and also can
derive us new link models. It is the ﬁrst time that the power law degree distribution
is modeled for community detection.
• In Chapter 5, we present a discriminative approach for combining link and content
information for community detection. Diﬀerent from previous approaches that usually
put together a generative link model and a generative content model and therefore
are vulnerable to irrelevant or noisy attributes, the proposed approach uses a discriminative model on the content to ﬁt the community memberships, which thus has
90

discriminative power to identify relevant attributes from irrelevant ones. The proposed
approach for combining our link model and our content model yield signiﬁcantly better
empirical performance for community detection. It is the ﬁrst discriminative approach
for combining link and content for community detection.
• In Chapter 6 and Chapter 7, we present approaches for clustering noisy must-andcannot linked data. To handle the noisy M&C links, we formulate the problem into
learning from noisily labeled data. We propose a generalized maximum entropy model
to learn the conditional model that given a pair of data how likely they are connected
by a must link (i.e., how likely they belong to the same cluster). The critical problem
of learning the conditional model is to compute the suﬃcient statistics that depend
on the true labels which are unknown. We propose two diﬀerent approaches under
diﬀerent assumptions to estimate the suﬃcient statistics and we prove the convergence
results , i.e., the model learned from the noisy labels converges to the model learned
from the true labels under appropriate assumptions about the data and the noise. It
is the ﬁrst work that proves the convergence for learning from noisily labeled data.
The studies conducted during the course of this dissertation point to several directions for
future research:
• Algorithms for Detecting Communities and Their Evolutions in Dynamic
Networks Since networks are dynamic in the real world, e.g., nodes could leave the
network and new nodes could join in the network, nodes could change their communities, and communities could disappear and new communities would emerge, it is
important to model the dynamic behavior of communities in networks. For future
work, we can extend the proposed link models or previous link models to a dynamic
version in order to model the dynamic changes in networks. The key challenges are
(i) how to model the evolution of communities sequentially, and (ii) how to handle the
deletion, insertion of nodes.

91

• Algorithms and Theory for Learning from Noisily Labeled Data The problem
of noisy M&C linked data was abstracted into the problem of learning from noisily
labeled data in Chapter 6&7, which is an important problem in machine learning
since the labels could be noisy due to human error or uncertainty in labeling. For
future work along the line, we can think about several directions: (1) how to relax the
independence assumption made by the approaches in Chapter 6&7, while still maintain
some theoretical guarantee about the learned model; (2) how to extend SVM to handle
the noise labels with an eﬃcient optimization algorithm to learn the parameter; (3)
how to extend the problem to an online setup, where at each trial an noisy label is
given to the learner together with the data, after making prediction on the data, the
learner decides to make a query or not for the true label of the data, for which the
learner usually needs to pay a cost. The goal of the learner is to minimize the errors
he makes on the received examples under certain constraint imposed on the cost for
querying the true labels.
• Applications The proposed link models for community detection provide a solution
for link prediction. The goal of link prediction is to predict whether two entities could
link to each other (e.g., whether a user could be followed by other users or whether
a product could be purchased by customers). The proposed approach can predict the
links based on the communities of the entities and their popularities, productivities
and attributes. It would be interesting to compare the community-based link models
with traditional approaches for link prediction.
The proposed approach for learning from noisily labeled data can be applied to semisupervised crowd clustering, where pairwise constraints are usually collected to facilitate the clustering and subject to errors because of biased or adversarial human
annotators. The technique of estimating the suﬃcient statistics from noisy labels can
be also applied to estimating the suﬃcient statistics for a target class from that for
auxiliary classes based on the relation between the target class and the auxiliary classes
92

parameterized by the conditional probabilities that given a data point belongs to the
target class how likely it also belongs to an auxiliary class. Using this technique, we
can learn a model for predicting the target class without any training examples that
are labeled by the target class, which could be useful when the target class label is
diﬃcult or expensive to collect.

93

APPENDICES

94

APPENDIX A
Proofs
Proof to Theorem 7
Proof. Under the i.i.d. assumption, it is straightforward to show that
p

E[ay [κj ]] = E[aδ [κj ]] = Eδ [κj ]
y
y
Following the McDiarmid’s inequality, for any ǫ > 0, we have
Pr
Pr

ǫ2 n
2R2
ǫ2 n
aδ [κj ] − Eδ [κj ] ≥ ǫ ≤ 2 exp − 2
y
y
2R
p

ay [κj ] − Eδ [κj ] ≥ ǫ ≤ 2 exp −
y

Using the following inequality and the union bound,
p

p

aδ [κj ] − ay [κj ] ≤ aδ [κj ] − Eδ [κj ] + ay [κj ] − Eδ [κj ]
y
y
y
y
we can complete the theorem using union bound.

Proof to Theorem 8
Proof. Using the assumption (1.a) and (1.b), we have
Eδ [κj ] = EX1 ,X2 EY|X1 ,X2 [δ(Y, y)κj (X 1 , X 2 )] = EX1 ,X2 [Pr(Y = y|X1 , X2 )κj (X1 , X2 )]
y
¯
= EX1 ,X2 cy Pr(Y = y|X1 , X2 )κj (X1 , X2 ) + EX1 ,X2 (1 − cy ) Pr(Y = y |X1 , X2 )κj (X1 , X2 )
¯
= cy Eδ [κj ] + (1 − cy )Eδ [κj ] = (cy + cy − 1)Eδ [κj ] + (1 − cy )E[κj (X1 , X2 )]
¯ y
¯
¯
y
¯
y
where we use the fact Eδ [κj ] + Eδ [κj ] = E[κj (X1 , X2 )]. Let us deﬁne
y
y
¯
cδ [κj ] = (cy + cy − 1)aδ [κj ] + (1 − cy )
¯
¯
y
y

95

1
n

κj (x1 , x2 )
i i
i

Under the i.i.d. assumption, it is straightforward to show that
E[aδ [κj ]] = E[cδ [κj ]] = Eδ [κj ]
y
y
y
Following the McDiarmid’s inequality, for any ǫ > 0, we have
ǫ2 n
2R2
ǫ2 n
cδ [κj ] − Eδ [κj ] ≥ ǫ ≤ 2 exp − 2
y
y
2R

aδ [κj ] − Eδ [κj ] ≥ ǫ ≤ 2 exp −
y
y

Pr
Pr
Then we have

cδ [κj ] − aδ [κj ] ≥ ǫ ≤ 4 exp −
y
y

Pr

ǫ2 n
8R2

Dividing both sides of cδ [κj ] − aδ [κj ] ≥ ǫ by |cy + cy − 1|, we have
¯
y
y
Pr

ǫ
≥
|cy + cy − 1|
¯

aδ [κj ] − bδ [κj ]
y
y

ǫ2 n
≤ 4 exp − 2
8R

Replacing ǫ with |(cy + cy − 1)|ǫ, we complete the proof using the union bound.
¯

Proof to Theorem 11
Proof. Let
L(w) =

1
n

ln
i

exp
y

1 ⊤
1
λ
yw ki − w⊤ v + w 2
2
2
2
2

λ
1
= g(w) − w⊤ v + w 2
2
2
2
where g(w) is the sum of log-exponential function of w, which is convex in w. Assume w∗
is the optimal solution to minimizing L(w), w∗ is the optimal solution to minimizing L(w)
with v replaced with v. First, we have
L(w∗ ) ≥ L(w∗ ) + ∇L(w∗ )⊤ (w∗ − w∗ ) +
≥ L(w∗ ) +

λ ∗
w − w∗ 2
2
2
96

λ ∗
w − w∗ 2
2
2

where the ﬁrst inequality follows that L(·) is a λ-strongly convex function, and the second
inequality follows the optimality criterion that ∇L(w∗ )⊤ (w∗ − w∗ ) ≥ 0. Second,
λ
1
L(w∗ ) = g(w∗) − v⊤ w∗ + w∗ 2
2
2
2
1
1
λ
= g(w∗) − v⊤ w∗ + w∗ 2 + (v − v)⊤ w∗
2
2
2
2
1 ⊤ ∗ λ ∗ 2 1
≤ g(w∗) − v w + w 2 + (v − v)⊤ w∗
2
2
2
∗) − 1 v⊤ w∗ + λ w∗ 2 + 1 (w∗ − w∗ )⊤ (v − v)
≤ g(w
2
2
2
2
1 ∗
≤ L(w∗ ) + w − w∗ 2 v − v 2
2
Coming the above two bounds for L(w∗ ) together, we have
1
λ ∗
w − w∗ 2 ≤ w∗ − w∗ 2 v − v 2
2
2
2
i.e.,
1
w∗ − w∗ 2 ≤
v−v 2
λ
∗
∗
Let v = ao [k], and wa be the corresponding optimal solution, and v = bo [k], and wb be the

corresponding optimal solution, then we have
1 o
∗
∗
wa − wb 2 ≤
a [k] − bo [k] 2
λ

Proof of Theorem 12
Proof. Since 1/(1 + exp(−s)) is a lipschitz continuous function with lipschitz continuity of
1, we have
∗
∗
|p(y|x1 , x2 ) − p(y|x1 , x2 )| ≤ (wa − wb )⊤ k(x1 , x2 )
√
√
√
mR o
mR
o
∗
∗
a [k] − b [k] 2 ≤
≤ m wa − wb 2 R ≤
λ
λ

=

mR2
cλ

8
n

4m
δ
97

8mR2
ln
(c+ + c− − 1)2 n

4m
δ

Proof of Theorem 13
Proof. We deﬁne
bo [k] =

ao [k] 1 − cy
¯
−
a[k]
c
c

where c = c+ + c− −1. Let p(y|x1, x2 ) be the classiﬁcation model learned from the noisy side
information using corrupted c+ and c− , and p(y|x1, x2 ) be the classiﬁcation model learned
from the noisy side information using perfect c+ and c− . Using the analysis in the proof of
Theorem 12, we have
√

mR o
b [k] − bo [k]
λ
2
√
o [k]
o [k]
1 − cy
1 − cy
mR a
a
¯
¯
≤
−
+
a[k] −
a[k]
λ
c
c
c
c
2
√
√
1 − cy 1 − cy
mR √
1 1
¯
¯
mR − + mR
−
≤
λ
c c
c
c
2 4∆
2c∆ + 4∆
mR
+
≤
2
λ
ρ
ρ2
√
√
where we use ao [k] 2 ≤ mR, a[k] 2 ≤ mR, |c − c| ≤ 2∆, c ≥ ρ/2. Using the fact
p(y|x1, x2 ) − p(y|x1 , x2 )

≤

p(y|x1 , x2 ) − p(y|x1 , x2 ) ≤ p(y|x1 , x2 ) − p(y|x1 , x2 ) + p(y|x1, x2 ) − p(y|x1, x2 )
and the result in Theorem 12, we have the theorem.

Proof to Lemma 15
Proof. First we note that
Eo [kj ] = EX1 ,X2 ,Y [Ykj (X1 , X2 )]
Under the i.i.d. assumption, we have
E[ao [kj ]] = EX1 ,X2 ,Y [Ykj (X1 , X2 )] = Eo [kj ]
98

Following the McDiarmid’s inequality, for any ǫ > 0, we have
Pr ao [kj ] − Eo [kj ] ≥ ǫ ≤ 2 exp −

ǫ2 n
2R2

Using the union bound, we have
2

Pr

Let δ = 2m exp −

ǫ n
ao [k] − Eo [k] 2 ≤ mǫ2 ≥ 1 − 2m exp − 2
2
2R

ǫ2 n
, i.e. ǫ =
2R2

2R2
ln
n

2m
, we can complete the proof.
δ

Proof to Lemma 16
Proof. First, under the independence assumption, we can have a similar solution for Ec [k]
y
as bc [k] in Equation 7.7, i.e.
y
Ec [k]⊤
+
c [k]⊤
E−
¯
where A =

Ec [k]⊤
o
o
+
c [k]⊤ . To bound b [k] − E [k] 2 , we have
E−

bo [k] − Eo [k] 2
2
≤4
≤

pp⊤ (B⊤ B)−1
pE[k]⊤
¯
+ I − ⊤ ⊤ −1
B⊤ A
⊤ (B⊤ B)−1p
p
p (B B) p

= (B⊤ B)−1

(B⊤ B)−1

σ −2
4 −2min 2
σmax p 2

bc [k]⊤
+
c [k]⊤
b−

≤2

−

2
p(E[k] − a[k])⊤

p⊤ (B⊤ B)−1 p

≤4

p

2
2

+4

2
F

(B⊤ B)−1

F

pp⊤ (B⊤ B)−1
¯
I − ⊤ ⊤ −1
B⊤ (A − A)
p (B B) p

¯
E[k] − a[k] 2 + 8 (B⊤ B)−1B⊤ 2 A − A 2
2
F
F

(B⊤ B)−1 pp⊤ (B⊤ B)−1 ⊤
B
+8
p⊤ (B⊤ B)−1 p
κ2

Ec [k]⊤
+
c [k]⊤
E−

E[k] − a[k]

2
2

−1
+ 16(σmin

2
F

¯
A−A 2
F
−4
σmin

¯
+ −2 σmax ) A − A 2
F
σmax

3
4κ2
2 + 16(1 + κ ) A − A 2
¯
E[k] − a[k] 2
≤
F
σmin
p 2
2

99

2
F

−1
where we use the fact p 2 σmax ≤
2

p⊤ (B⊤ B)−1 p

≤

−1
p 2 σmin ,
2

B 2 ≤ 2σmax and
F

−1
¯
B−1 2 ≤ 2σmin Next, we show bound for E[k]−a[k] 2 and A− A 2 . For E[k]−a[k] 2 ,
2
2
F
F

similar to Lemma 15, the following inequality holds for any ǫ > 0,
Pr( E[k] − a[k] 2 ≤ mǫ2 ) ≥ 1 − 2m exp
2

ǫ2 n
2R2

¯
For A − A 2 , we need ﬁrst bound each element in the matrix, i.e. |Ec [kj ] − ac [kj ]|, it is
y
y
F
easy to show that
ǫn
ǫ2 n
) ≥ 1 − 2 exp − 2
ny
2R

Pr(|Ec [kj ] − ac [kj ]| ≤
y
y
Then with the union bound,
Pr

2
¯
A − A 2 ≤ 2 mǫ2
F
ρ

≥ 1 − 4m exp −

ǫ2 n
2R2

where we use min(n+ /n, n− /n) ≥ ρ. Then we have
Pr

32(1 + κ3 ) 2
4κ2
mǫ
mǫ2 +
bo [k] − Eo [k] 2 ≤
2
ρ2 σmin
p 2
2

≥ 1 − 6m exp −

ǫ2 n
2R2

2R2
6m
ǫ2 n
, and note that σmax σmin = det(B)2 =
ln
, i.e. ǫ =
2
n
δ
2R
ˆ
d2 , σmax ≤ B⊤ B 1 = d, we have the result in Lemma 3.

Let δ = 6m exp −

Proof to Theorem 18
∗
Proof. We ﬁrst show wa −w∗ 2 can be well bounded with a large probability. By combining
2

this bound with Theorem 17, we have the result in Theorem 18. In the following derivation,
√
we use k(X1 , X2 ) 2 ≤ mR. We denote by
λ
1
L(w) =
w 2+
2
2
n

n

ln(1 + exp(−yi w⊤ ki )
i=1

λ
w 2 + E[ln(1 + exp(−Yw⊤ k(X1 , X2 )))]
L(w) =
2
2

100

∗
We ﬁrst bound L(wa ) as follows
∗
∗
L(wa ) ≤ L(wa ) + max L(w) − L(w)
w 2≤ρ

where ρ =

2 ln 2
λ

is due to L(w∗ ) ≤ ln 2. It is easy to show that, with probability at least

1 − δ, we have
√
max L(w) − L(w) ≤ Rn [Fw ] + ρ mR

w 2 ≤ρ

√
√
ln(1/δ)
2ρ mR
≤ √
+ ρ mR
2n
n

ln(1/δ)
2n

where Rn [F ] is the Rademacher complexity of function class F , and Fw = {f (X1 , X2 , Y) =
ln(1 + exp(−Yw⊤ k(X1 , X2 ))) : w 2 ≤ ρ}. Using the concentration inequality of bounded
diﬀerence, with probability at least 1 − δ, we have
√
L(w∗ ) ≤ L(w∗ ) + ρ mR

ln(1/δ)
2n

∗
Using the fact L(w∗ ) ≥ L(wa ), with probability 1 − 2δ, we have
∗
L(wa ) − L(w∗ )

√
√
2ρ mR
+ 2ρ mR
≤ √
n

ln(1/δ)
2n

∗
∗
Using the strong convexity of L(w), we have L(wa ) − L(w∗ ) ≥ λ wa − w∗ 2 , and therefore,
2
2

with probability 1 − δ, have
∗
wa

− w∗

≤

√
√
4ρ mR 4ρ mR
√
+ √
λ n
λ 2n

ln

2
δ

We complete the proof by combining the above result with the result in Theorem 4.

101

BIBLIOGRAPHY

102

BIBLIOGRAPHY

[1] Lada Adamic and Natalie Glance. The political blogosphere and the 2004 U.S. election:
divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery,
2005.
[2] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic
block models for relational data with application to protein-protein interactions. In
Proceedings of the International Biometrics Society Annual Meeting, 2006.
[3] R. Albert. Scale-free networks in cell biology. Journal of Cell Science, 118(21):4947–
4957, 2005.
[4] R. Albert and A. L. Barab´si. Statistical mechanics of complex networks. Reviews of
a
Modern Physics, 74:47–97, 2002.
[5] Alex M. Andrew. Reinforcement learning: An introduction. Robotica, 17:229–235,
March 1999.
[6] A. L. Barab´si and R. Albert. Emergence of scaling in random networks. Science,
a
286(5439):509–512, 1999.
[7] Marc Barth´lemy and Nunes L. A. Amaral. Small-World networks: Evidence for a
e
crossover picture. Physical Review Letters, 82:3180–3183, 1999.
[8] Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. A probabilistic framework
for semi-supervised clustering. In Proceedings of the 10th Annual International ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, pages 59–68, 2004.
[9] C. M. Bishop. Pattern recognition and machine learning (information science and
statistics). Springer-Verlag New York, Inc., 2006.
[10] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Laﬀerty. Latent dirichlet
allocation. Journal of Machine Learning Research, 2003.
[11] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D-U. Hwang. Complex networks
: Structure and dynamics. Phys. Rep., 424(4-5):175–308, Fervier 2006.
[12] Deepayan Chakrabarti and Christos Faloutsos. Graph mining: Laws, generators, and
algorithms. ACM Computing Survey, 38, June 2006.
[13] Hong Chang and Dit yan Yeung. Kernel-based metric adaptation with pairwise constraints. In Proceedings of International Conference on Machine Learning and Cybernetics, pages 721–730, 2005.
103

[14] P. Chen and S. Redner. Community structure of the physical review citation network.
January 2010.
[15] Yun Chi, Xiaodan Song, Dengyong Zhou, Koji Hino, and Belle L. Tseng. On evolutionary spectral clustering. ACM Transactions Knowledge Discovery Data, 3:17:1–17:30,
December 2009.
[16] David Cohn and Huan Chang. Learning to probabilistically identify authoritative
documents. In Proceedings of the 17th International Conference on Machine Learning,
2000.
[17] David Cohn and Thomas Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Proceedings of the 13th Neural Information
Processing Systems, 2001.
[18] Ian Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the
k-means algorithm. In Proceedings of the 5th SIAM International Conference on Data
Mining, 2005.
[19] Ian Davidson and S. S. Ravi. Hierarchical clustering with constraints: Theory and
practice. In Proceedings of the 9th European Conference on Principles and Practice of
Knowledge Discovery in Databases, pages 59–70, 2005.
[20] Ian Davidson and Basu Sugato. A survey of clustering with instance level constraints.
Technical report, 2007.
[21] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon.
Information-theoretic metric learning. In Proceedings of the 24th Annual International
Conference on Machine Learning, pages 209–216, 2007.
[22] Luis M. De Campos, Juan M. Fern´ndez-Luna, Juan F. Huete, Andr´s R. Masegosa,
a
e
and Alfonso E. Romero. Link-based text classiﬁcation using bayesian networks. In
Proceedings of the Focused retrieval and evaluation, and 8th international conference
on Initiative for the evaluation of XML retrieval, INEX’09, pages 397–406, Berlin,
Heidelberg, 2010. Springer-Verlag.
[23] de Solla. Networks of scientiﬁc papers. Science, 149(3683):510–515, July 1965.
[24] Ayhan Demiriz, Kristin Bennett, and Mark J. Embrechts. Semi-supervised clustering
using genetic algorithms. In Proceedings of Artiﬁcial Neural Networks in Engineering,
pages 809–814. ASME Press, 1999.
[25] Laura Dietz, Steﬀen Bickel, and Tobias Scheﬀer. Unsupervised prediction of citation
inﬂuences. In Proceedings of the 24th International Conference on Machine Learning,
2007.
104

[26] Chris Ding, Tao Li, and Wei Peng. Nmf and plsi: equivalence and a hybrid algorithm.
In Proceedings of the 29th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 641–642, New York, NY, USA, 2006.
ACM.
[27] Chris Ding, Tao Li, and Wei Peng. On the equivalence between non-negative matrix
factorization and probabilistic latent semantic indexing. Journal of Computational
Statistics and Data Analysis, 52:3913–3927, April 2008.
[28] Elena Erosheva, Stephen Fienberg, and John Laﬀerty. Mixed membership models of
scientiﬁc publications. In Proceedings of the National Academy of Sciences, 2004.
[29] Zoubin Ghahramani. Unsupervised learning. In Advanced Lectures on Machine Learning, pages 72–112, 2003.
[30] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002.
[31] Jacob Goldberger, Sam Roweis, Geoﬀ Hinton, and Ruslan Salakhutdinov. Neighbourhood components analysis. In Proceedings of the 18nd Annual Conference on Neural
Information Processing Systems, pages 513–520, 2004.
[32] Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg, and Edoardo M. Airoldi. A
survey of statistical network models. Found. Trends Mach. Learn., 2:129–233, February
2010.
[33] Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks
of diﬀusion and inﬂuence. In Proceedings of the 16th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 1019–1028,
New York, NY, USA, 2010. ACM.
[34] Steve Gregory. An algorithm to ﬁnd overlapping community structure in networks. In
Proceedings of the 11th European Conference on Principles and Practice of Knowledge
Discovery in Databases, 2007.
[35] Amit Gruber, Michal Rosen-Zvi, and Yair Weiss. Latent topic models for hypertext.
In Proceedings of the 24th Annual Conference on Uncertainty in Artiﬁcial Intelligence,
null, 2008. null.
[36] Natali Gulbahce and Sune Lehmann. The art of community detection. BioEssays,
30(10):934–938, October 2008.
[37] J. A. Hartigan and M. A. Wong. A K-means clustering algorithm. Applied Statistics,
28:100–108, 1979.

105

[38] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical
Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA,
2001.
[39] J. M. Hofman and C. H. Wiggins. A Bayesian approach to network modularity. Physical
Review Letters, 100, 2008.
[40] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd
International Conference on Research and Development in Information Retrieval, 1999.
[41] Steven C. H. Hoi and Rong Jin. Active kernel learning. In Proceedings of the 25th
Annual International Conference on Machine Learning, pages 400–407, 2008.
[42] Steven C. H. Hoi, Rong Jin, and Michael R. Lyu. Learning nonparametric kernel
matrices from pairwise constraints. In Proceedings of the 24th Annual International
Conference on Machine Learning, pages 361–368, 2007.
[43] Steven C. H. Hoi, Wei Liu, Michael R. Lyu, and Wei-Ying Ma. Learning distance metrics with contextual constraints for image retrieval. In Proceedings of IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, pages 2072–2078,
2006.
[44] Steven C. H. Hoi, Michael R. Lyu, and Edward Y. Chang. Learning the uniﬁed kernel
machines for classiﬁcation. In Proceedings of the 12nd Annual International ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, pages 187–196, 2006.
[45] P. Holland and S. Leinhardt. Local structure in social networks. Socialogical Methodology, 1976.
[46] Enliang Hu, Songcan Chen, Daoqiang Zhang, and Xuesong Yin. Semisupervised kernel
matrix learning by kernel propagation. Transactions Neural Networks, 21:1831–1841,
November 2010.
[47] Tony Jebara, Risi Kondor, Andrew Howard, Kristin Bennett, and Nicol Cesa-bianchi.
Probability product kernels. Journal of Machine Learning Research, 5:819–844, 2004.
[48] C. Kemp, T. L. Griﬃths, and J. B. Tenenbaum. Discovering latent classes in relational
data. Technical Report AI Memo 2004-019, MIT Computer Science and Artiﬁcial
Intelligence Laboratory, 2004.
[49] Jon Kleinberg. Cascading Behavior in Networks: Algorithmic and Economic Issues.
Cambridge University Press, 2007.
[50] Brian Kulis, M´ty´s Sustik, and Inderjit Dhillon. Learning low-rank kernel matrices. In
a a
Proceedings of the 23rd Annual International Conference on Machine Learning, pages
505–512, 2006.
106

[51] Brian Kulis, Mtys Sustik, and Inderjit Dhillon. Learning low-rank kernel matrices. In
Proceedings of the 23rd Annual International Conference on Machine Learning, pages
505–512. Morgan Kaufmann, 2006.
[52] Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and
Michael I. Jordan. Learning the kernel matrix with semideﬁnite programming. Journal
of Machine Learning Research, 5:27–72, 2004.
[53] Martin H. C. Law, Alexander P. Topchy, and Anil K. Jain. Model-based clustering with
probabilistic constraints. In Proceedings of the 5th SIAM International Conference on
Data Mining, 2005.
[54] E. A. Leicht and M. E. J. Newman. Community structure in directed networks. Physical
Review Letters, 100, 2008.
[55] L. Li, D. Alderson, J.C. Doyle, and W. Willinger. Towards a theory of scale-free graphs:
Deﬁnition, properties, and implications. Internet Mathematics, 2(4):431–523, 2005.
[56] Zhenguo Li and Jianzhuang Liu. Constrained clustering by spectral kernel learning.
In Proceedings of IEEE International Conference on Computer Vision, pages 421–427,
2009.
[57] Zhenguo Li, Jianzhuang Liu, and Xiaoou Tang. Pairwise constraint propagation by
semideﬁnite programming for semi-supervised classiﬁcation. In Proceedings of the 25th
Annual International Conference on Machine Learning, pages 576–583, 2008.
[58] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol., 58:1019–1031, May 2007.
[59] Fredrick Lilijeros, Cristofer Edling, Lu´ Amaral, Eugene Stanley, and Yvonne ˚berg.
ıs
A
The web of human sexual contacts. Nature, 411:907–908, 2001.
[60] Yu-Ru Lin, Yun Chi, Shenghuo Zhu, Hari Sundaram, and Belle L. Tseng. Analyzing
communities and their evolutions in dynamic social networks. ACM Transactions
Knowledge Discovery Data, 3:8:1–8:31, April 2009.
[61] Yi Liu, Rong Jin, and Anil K. Jain. Boostcluster: boosting clustering by pairwise constraints. In Proceedings of the 13rd Annual International ACM SIGKDD Conference
on Knowledge Discovery and Data Mining, pages 450–459, 2007.
[62] Qing Lu and Lise Getoor. Link-based classiﬁcation. In Tom Fawcett and Nina Mishra,
editors, Proceedings of the 20th International Conference on Machine Learning, pages
496–503. AAAI Press, Chicago, IL, USA, 2003.
[63] A. McCallum and K. Nigam. A comparisoin of event models for naive bayes text
classiﬁcation. In In AAAI Workshop on Leaning for Text Categorization, 1998.
107

[64] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the contruction of internet portals with machine learning. Information Retrieval
Journal, 3, 2000.
[65] Stanley Milgram. The small world problem. Psychology Today, 2:60–67, 1967.
[66] J. M. Montoya and R. V. Sol´. Small world patterns in food webs. pages 405–412,
e
February.
[67] Peter J. Mucha, Thomas Richardson, Kevin Macon, Mason A. Porter, and JukkaPekka Onnela. Community structure in Time-Dependent, multiscale, and multiplex
networks. Science, 328(5980):876–878, May 2010.
[68] Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen. Joint latent
topic models for text and citations. In Proceedings of the 14th ACM International
Conference on Knowledge Discovery and Data Mining, 2008.
[69] Amit A. Nanavati, Siva Gurumurthy, Gautam Das, Dipanjan Chakraborty, Koustuv
Dasgupta, Sougata Mukherjea, and Anupam Joshi. On the Structural Properties of
Massive Telecom Call Graphs: Findings and Implications. In Proceedings of the 15th
ACM International Conference on Information and Kowledge Management, pages 435–
444, November 2006.
[70] A. Nemirovski. Eﬃcient methods in convex programming. 1994.
[71] M E Newman. Modularity and community structure in networks. Proc Natl Acad Sci
U S A, 103(23):8577–8582, June 2006.
[72] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in
networks. Physical Review E, 69, 2003.
[73] Mark Newman. Networks: An Introduction. Oxford University Press, 2010.
[74] M.E.J. Newman. The structure and function of complex networks. SIAM review,
45(2):167–256, 2003.
[75] Dan Pelleg and Dorit Baras. K-means with large and noisy constraint sets. In Proceedings of the 18th European Conference on Machine Learning, pages 674–682, 2007.
[76] M. A. Porter, J.-P. Onnela, and P. J. Mucha. Communities in Networks. ArXiv
e-prints, February 2009.
[77] Mason A. Porter, Peter J. Mucha, M. E. J. Newman, and A. J. Friend. Community
structure in the united states house of representatives, February 2006.

108

[78] Novi Quadrianto, Alex J. Smola, Tiberio S. Caetano, and Quoc V. Le. Estimating
labels from label proportions. In Proceedings of the 25th Annual International Conference on Machine Learning, pages 776–783, 2008.
[79] Wei Ren, Guiying Yan, Xiaoping Liao, and Lan Xiao. Simple probabilistic algorithm
for detecting community structure. Physical Review E, 79, 2009.
[80] B. Schwikowski, P. Uetz, and S. Fields. A network of protein-protein interactions in
yeast. Nature biotechnology, 18(12):1257–1261, December 2000.
[81] E. Segal, H. Wang, and D. Koller. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19:i264–272, 2003.
[82] Noam Shental, Tomer Hertz, Daphna Weinshall, and Misha Pavel. Adjustment learning
and relevant component analysis. In Proceedings of the 17th European Conference on
Computer Vision, pages 776–792, 2002.
[83] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(8), 2000.
[84] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22:888–905, 1997.
[85] O. Sporns, D. Chialvo, M. Kaiser, and C. Hilgetag. Organization, development and
function of complex brain networks. Trends in Cognitive Sciences, 8(9):418–425,
September 2004.
[86] Gabor J. Szekely and Maria L. Rizzo. Hierarchical clustering via joint between-within
distances: Extending ward’s minimum variance method. Journal of Classiﬁcation,
22(2):151–183, September 2005.
[87] Ben Taskar, Ming F. Wong, Pieter Abbeel, and Daphne Koller. Link prediction in
relational data. In Proceedings of the 17th Annual Conference on Neural Information
Processing Systems, 2003.
[88] Sebastian Thrun and Lorien Pratt. Learning to learn: introduction and overview, pages
3–17. Kluwer Academic Publishers, Norwell, MA, USA, 1998.
[89] Tomer Hertz Tomboy, Aharon Bar-hillel, and Daphna Weinshall. Boosting margin
based distance functions for clustering. In Proceedings of the 21st Annual International
Conference on Machine Learning, 2004.
[90] Jeﬀrey Travers and Stanley Milgram. An experimental study of the small world problem. Sociometry, 32:425–443, 1969.

109

[91] Vera van Noort, Berend Snel, and Martijn Huynen. The yeast coexpression network
has a small-world, scale-free architecture and can be explained by a simple model.
EMBO Reports, 5(3):280–284, March 2004.
[92] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New
York, Inc., New York, NY, USA, 1995.
[93] A. Wagner and D. A. Fell. The small world inside large metabolic networks. Proc R
Soc Lond B Biol Sci, 268(1478):1803–1810, September 2001.
[94] K. Wagstaﬀ and C. Cardie. Clustering with instance-level constraints. In Proceedings
of the 17th Annual International Conference on Machine Learning, pages 1103–1110,
2000.
[95] Kiri Wagstaﬀ, Claire Cardie, Seth Rogers, and Stefan Schr¨dl. Constrained k-means
o
clustering with background knowledge. In Proceedings of the 18th Annual International
Conference on Machine Learning, 2001.
[96] Xuerui Wang, Natasha Mohanty, and Andrew McCallum. Group and topic discovery
from relations and their attributes. In Proceedings of the 19th Annual Conference on
Neural Information Processing Systems, 2005.
[97] Jr. Ward. Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58:236–244, 1963.
[98] S. Wasserman and K. Faust. Social network analysis: Methods and applications. Cambridge University Press, 1994.
[99] Stanley Wasserman, Garry Robins, and Douglas Steinley. Statistical models for networks: A brief review of some recent research. In Edoardo Airoldi, David M. Blei,
Stephen E. Fienberg, Anna Goldenberg, Eric P. Xing, and Alice X. Zheng, editors,
Statistical Network Analysis: Models, Issues, and New Directions, volume 4503 of Lecture Notes in Computer Science, chapter 4, pages 45–56. Springer Berlin Heidelberg,
2007.
[100] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin
nearest neighbor classiﬁcation. In Proceedings of the 20nd Annual Conference on Neural
Information Processing Systems, pages 207–244, 2006.
[101] Lei Wu, Rong Jin, Steven C.H. Hoi, Jinfeng Zhuang, and Nenghai Yu. Simplenpkl:
Simple non-parametric kernel learning. In Proceedings of the 23rd Annual Conference
on Neural Information Processing Systems, 2009.
[102] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric
learning, with application to clustering with side-information. In Proceedings of the
110

17th Annual Conference on Neural Information Processing Systems, pages 505–512,
2003.
[103] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Technical
report, 2006.
[104] Liu Yang, Rong Jin, Rahul Sukthankar, and Yi Liu. An eﬃcient algorithm for local
distance metric learning. In Proceedings of the 21st National Conference on Aartiﬁcal
Intelligence, pages 543–548, 2006.
[105] Tianbao Yang, Yun Chi, Shenghuo Zhu, Yihong Gong, and Rong Jin. Directed network
community detection: A popularity and productivity link model. In Proceedings of the
SIAM International Conference on Data Mining, pages 742–753, 2010.
[106] Tianbao Yang, Rong Jin, Yun Chi, and Shenghuo Zhu. Combining link and content
for community detection: A discriminative approach. In Proceedings of the 15th ACM
International Conference on Knowledge Discovery and Data Mining, 2009.
[107] Tianbao Yang, Rong Jin, Yun Chi, and Shenghuo Zhu. Combining link and content for
community detection: a discriminative approach. In Proceedings of the 15th Annual
International ACM SIGKDD Conference on Knowledge Discovery and Data Mining,
pages 927–936, 2009.
[108] X.-S. Yang. Small-world networks in geophysics. grl, 28:2549–2552, July 2001.
[109] K. Yu, S. Yu, and V. Tresp. Soft clustering on graphs. In Proceedings of 19th Advances
in Neural Information Processing Systems, 2005.
[110] Shi Yu, Bart De Moor, and Yves Moreau. Clustering by heterogeneous data fusion:
framework and applications. Worpshop of the 23rd Annual Conference on Neural
Information Processing Systems, 2009.
[111] Y. Zhang, A. Friend, A. Traud, M. Porter, J. Fowler, and P. Mucha. Community
structure in congressional cosponsorship networks. Physica A: Statistical Mechanics
and its Applications, 387(7):1705–1712, March 2008.
[112] Dengyong Zhou, Jiayuan Huang, and Bernhard Sch¨lkopf. Learning from labeled
o
and unlabeled data on a directed graph. In Proceedings of the 22nd International
Conference on Machine Learning, 2005.
[113] Shenghuo Zhu, Kai Yu, Yun Chi, and Yihong Gong. Combining content and link for
classiﬁcation using matrix factorization. In Proceedings of the 30th annual international
ACM SIGIR conference on Research and development in information retrieval, pages
487–494, New York, NY, USA, 2007. ACM Press.

111

[114] X. Zhu. Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University,
2005.

112