LEARNING FROM NOISILY CONNECTED DATA By Tianbao Yang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science 2012 ABSTRACT LEARNING FROM NOISILY CONNECTED DATA By Tianbao Yang Machine learning is a discipline of developing computational algorithms for learning predictive models from data. Traditional analytical learning methods treat the data as independent and identically distributed (i.i.d) samples from unknown distributions. However, this assumption is often violated in many real world applications that leading to the challenge of learning predictive models. For example, in electronic commerce website, customers could purchase a product by the recommendation of their friends. Hence the purchasement records of customers are not i.i.d samples but correlated. Nowadays, data become correlated due to collaborations, interactions, communications, and many other types of connections. Effective learning from these connected data not only provides better understanding of the data but also brings significant economic benefits. How to learn from the connected data also brings unique challenges to both supervised learning and unsupervised learning algorithms because these algorithms are designed for i.i.d data and are often sensitive to the noise in the connected data. In this dissertation, I focus on developing theory and algorithms for learning from connected data. In particular, I consider two types of connections: the first type of connection is naturally formed in real wold networks, while the second type of connection is manually created to facilitate the learning process which is called must-and-cannot link. In the first part of this dissertation, I develop efficient algorithms for detecting communities in the first type of connected data. In the second part of this dissertation, I develop clustering algorithms that effectively utilize both must links and cannot links for the second type of connected data A common approach toward learning from connected data is to assume that if two data points are connected, they are likely to be assigned to the same class/cluster. This assumption is often violated in real-word applications, leading to the noisy connection problems. One key challenge of learning from connected data is how to model the noisy pairwise connections that indicates the pairwise class-relationship between two data points. In the problem of detecting communities in networked data, I develop Bayesian approaches that explicitly model the noisy pairwise links by introducing additional hidden variables, besides community memberships, to explain potential inconsistency between the pairwise connections and pairwise class-relationship. In clustering must-and-cannot linked data, I will try to model how the noise is added into the pairwise connections in the manually generating process. The main contributions of this dissertation include (i) it introduces popularity and productivity for the first time besides the community memberships to model the generation of noisy links in real networks; the effectiveness of these factors is demonstrated through the task of community detection; (ii) it proposes a discriminative model for the first time that combines the content and link analysis together for detecting communities to alleviate the impact of noisy connections in community detection; (iii) it presents a general approach for learning from noisily labeled data, proves the theoretical convergence results for the first time and applies the approach in clustering noisy must-and-cannot linked data. ACKNOWLEDGMENTS First and formost, I would like to thank my thesis advisor Dr. Rong Jin. I feel extremely fortunate to work with and learn from Dr. Rong Jin. His decision back to five years ago on extending me a graduate assistantship offer significantly changed my life. Without this offer, I have no idea where I am going and I can not image I can perform as well as what I did in the past five years. During the five years at MSU, Dr. Rong Jin gave me a great help and had a profound influence on my research. He always inspired me to talckle interesting problems and guided me to solve problems using mathematical skills. His enthusiasm on reading books brought me to a world of interesting and useful books from which I learned a lot in the past five years, and I have no doubt that this influence will continue in the future. I would also like to thank other committee members Dr. Pang-ning Tan, Dr. Joyce Chai and Dr. Selin Aviyente. I am grateful to my collaborators from NEC Laboratories American and Yahoo Research. I spent three summers at NEC Labs working with Yun Chi and Shenghuo Zhu. I am greatly indebted to them for the help on my first several publications and for introducing me an interesting research topic of social network analysis. I also spent a wonderful summer at Yahoo Research mentored by Olivier Chapelle and Lihong Li. I enjoyed the discussion with Olivier and Lihong on research problems. I want to thank my colleagues from LINKS lab at MSU, especially Wei Tong, Yang Zhou, Mehrdad Mahdavi, Fengjie Li, Yi Liu and Jinfeng Yi. I also want to thank my colleagues from NEC Labs during the three summers. I also owe great thanks to Ying Liu and I wish her all the best. I gratefully acknowledge the help from the stuff at CSE department of MSU, especially from Linda Moore and Norma Teague. My final and the most heartfelt acknowledgment must go to my family for their love and support. I also want to specially thank Fangfang. She always patiently listens to me whenever I am feeling down. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . 1.1 Networks . . . . . . . . . 1.2 Must-and-Cannot Link . 1.3 Research on leaning from . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . connected data . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 5 2 Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . 7 3 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Community Detection for Networked Data . . . . . . . . . . . . . . . . . . . 3.1.1 Community Detection based on Link Analysis . . . . . . . . . . . . . 3.1.2 Community Detection based on Content Analysis . . . . . . . . . . . 3.1.3 Community Detection based on Combined Link and Content Analysis 3.2 Clustering for Must-and-Cannot(M&C) Linked Data . . . . . . . . . . . . . 3.2.1 Constrained Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Distance Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 11 12 13 13 14 14 4 Community Detection for Networked Data: A Popularity ity Link Model . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Existing Models . . . . . . . . . . . . . . . . . . . . . 4.3 Popularity and Productivity Link (PPL) Model . . . . . . . 4.3.1 General Form of PPL . . . . . . . . . . . . . . . . . . 4.3.2 Variants of the General PPL Model . . . . . . . . . . 4.3.3 Properties of PPL Model . . . . . . . . . . . . . . . . 4.4 Relationship with Existing Models . . . . . . . . . . . . . . 4.4.1 Relationship with Conditional Link Models . . . . . . 4.4.2 Relationship with Symmetric Joint Link Models . . . 4.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Estimation Algorithm . . . . . . . . . . . . . . . . . . . . . . 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 16 16 17 17 17 18 19 20 21 25 25 26 28 28 31 31 v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix and Productiv. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 4.6.2 Community Detection 4.6.3 Link Prediction . . . . 4.6.4 Degree Fitting . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 35 40 41 5 Community Detection for Networked Data: A Discriminative Approach for Combining Link and Content . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Conditional Link Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Popularity Conditional Link Model (PCL) . . . . . . . . . . . . . . . 5.2.2 Analysis of the PCL Model . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 5.3 Discriminative Content Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 PCL + PLSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 PHITS + DC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 43 44 46 47 49 52 53 54 54 54 55 57 58 61 6 Clustering for Noisy M&C Linked Data: A Probabilistic Link Prediction Model (I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Learning from Noisy M&C Linked Data . . . . . . . . . . . . . . . . . . . . 6.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Generalized Maximum Entropy Model . . . . . . . . . . . . . . . . . 6.2.3 Estimating the Sufficient Statistics . . . . . . . . . . . . . . . . . . . 6.2.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Experiments with Real Noise . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Experiments with Synthetic Noise . . . . . . . . . . . . . . . . . . . . 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 62 63 63 64 67 68 70 72 72 76 7 Clustering for Noisy M&C Linked Model (II) . . . . . . . . . . . . . . 7.1 Notations . . . . . . . . . . . . . 7.2 Learning from Noisy M&C Linked 7.2.1 A Probabilistic Model . . 77 77 78 78 Data: A Probabilistic Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 7.3 7.4 7.2.2 Estimating Sufficient Statistics . 7.2.3 Convergence Analysis . . . . . . . Experiments . . . . . . . . . . . . . . . . 7.3.1 Experiments with Real Noise . . 7.3.2 Experiments with Synthetic Noise Conclusions . . . . . . . . . . . . . . . . 8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 80 82 84 86 89 . . . . . . . . . . . . . . . . . . . . . . . . . 90 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 vii LIST OF TABLES 4.1 Taxonomy of the models categorized by type and variables . . . . . . . . . . 28 4.2 Community detection performance on the Political Blog data set. . . . . . . 35 4.3 Community detection performance on the Cora data set. . . . . . . . . . . . 35 4.4 Community detection performance on the Citeseer data set. . . . . . . . . . 36 5.1 Community detection performance on Cora and Citeseer dataset . . . . . . . 58 6.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2 Statistics of Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.1 Statistics of Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.2 Clustering performance on unseen data . . . . . . . . . . . . . . . . . . . . . 85 7.3 Clustering performance with perturbed (py , dy ) . . . . . . . . . . . . . . . . 88 viii LIST OF FIGURES 4.1 (a)∼(c): Recalls on Political Blog data. (d): Degree distribution. . . . . . . . 37 4.2 (a)∼(c): Recalls on Cora data. (d): Degree distribution. . . . . . . . . . . . 38 4.3 (a)∼(c): Recalls on Citeseer data. (d): Degree distribution. . . . . . . . . . . 39 4.4 Degree fitting on the Political Blog data . . . . . . . . . . . . . . . . . . . . 40 4.5 Indegree fitting on Cora and outdegree fitting on Citeseer . . . . . . . . . . . 41 5.1 Recall on the four data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Partition Measure of PCL-DC vs. λ . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Experimental results on Cora data set . . . . . . . . . . . . . . . . . . . . . 73 6.2 Experimental results on Citeseer data set . . . . . . . . . . . . . . . . . . . . 74 6.3 Experimental results on Terrorist Attack data set . . . . . . . . . . . . . . . 75 7.1 Comparison of clustering performance on real noise. . . . . . . . . . . . . . . 85 7.2 Comparison of clustering performance with different noise levels. . . . . . . . 87 7.3 Convergence behavior of the proposed approach . . . . . . . . . . . . . . . . 88 ix CHAPTER 1 Introduction Most machine learning problems can be classified into three categories (1) supervised learning [92] that aims to learn a prediction function from a set of labeled instances. Most supervised learning tasks can be further divided into the category of classification and regression [38], depending on the type of outputs. (2) unsupervised learning [29] that aims to divide a given collection of instances into multiple clusters, with the hope that instances within the same cluster share a larger similarity than instances from different clusters. (3) semi-supervised learning [114] that aims to learn a prediction function from a mixture of labeled and unlabeled instances. Other learning paradigms include reinforcement learning [5] and learning to learn [88]. In classical setup of learning problems, training instances are usually represented by vectors of attributes (also referred to as content information), and in the case of supervised learning, labels are provided for individual training instances. In this dissertation, we study the class of learning problems where the additional relationship information is provided between pairs of instances. The pairwise relationship information, can either be the observed connections between training instances (e.g., hyperlinks between web pages) or the connections derived from side information (e.g., pairwise class-relationship). Training instances along with the physical connections can be represented by a network, and this type of training data is also referred to as networked data. Pairwise class-relationships are introduced by Wagstaff et al.[94] to specify the relationship between class assignments of two instances, which are also known as must-and-cannot(M&C) link information. Instances connected by must-link information are said to be in the same class, while instances connected by cannot-link information are in different classes. The pairwise class-relationship connections can be derived 1 from the observed connections with some additional knowledge. We refer to the data that are connected by must-and-cannot link information as must-and-cannot linked data or M&C linked data. In this dissertation, we refer to both networked data and M&C linked data as connected data. The objective of this study is to take advantage of the connections between instances, to facilitate the learning process. We thus refer to the related learning problems as learning from connected data. In particularly, the research work presented in this dissertation focuses on unsupervised learning on networked data and M&C linked data. The challenge of learning from connected data is how to discriminate and utilize useful connections from those noisy connections in indicating the pairwise class relationship. In the sequel, the discussion and presentation on connected data is divided into networked data and M&C linked data. 1.1 Networks Networks, as a particular type of connected data, have become an important topic for computer science, physics, biology, communications, statistics, social science, psychology and etc. Many important properties have been studied for networks, including scale-free, smallword, community structure, triads and clustering coefficients, network motifs, etc. Below, we review the three most important properties of networks, i.e. scale-free, small-word, and community structure. For discussion of the other properties of networks, , we refer the reader to overviews in [4, 74, 55, 11, 12, 98]. A scale-free network refers to the network in which a few nodes have a tremendous number of connections to the rest nodes while most nodes in the network only have a handful connections. This property is also known as power-law degree distribution. The nodes with very large numbers of connections are usually referred to as hubs. Over that past several years, researchers have uncovered scale-free structures in a stunning range of systems, including Word Wide Web [6], social networks such as the sexual network[59], network of 2 citations between scientific papers [23], celluar metabolic network [3], protein-protein interaction network [80]. One important characteristic of scale free networks is that they are robust to accidental failures but are vulnerable to coordinated attacks. Scale-free networks have found their applications in many domains, such as computing, medicine, and business. Computer networks with scale-free architectures, such as the world wide web, are highly resistant to accidental failures, but they are very vulnerable to deliberate attacks and sabotage. In medicine, vaccination campaigns against serious viruses, such as smallpox, might be most effective if they concentrate on treating hubs-people who have many connections to others. Mapping out the networks within the human cell could aid researchers in uncovering and controlling the side effects of drugs. Furthermore, identifying the hub molecules involved in certain disease could lead to new drugs that would target those hubs. A small-world network is a type of network in which although most nodes are not directed connected with each other, most nodes can be reached from each other by a small number of hopes or steps. A typical small-world network is the network of acquaintance between people, where the famous six-degree separation phenomenon [90, 65] is well accepted. Other real wold networks that reveal the small-world properties include road maps [108], food chains [66], electric power grids [7], metabolic processing networks [93], networks of brain neurons [85], telephone call graphs [69], and protein networks [91], etc. In observation, if a network has a degree-distribution which can be fit with a power law distribution, it is taken as a sign that the network is small-world, which is also scale-free networks. Though, scale free networks usually reveal the small world property; however they are by no means the only kind of small-world networks. Networks of very different topology can still quantify as small-world networks as long as they satisfy the properties of the small-world networks. Community structure is another important property of networks. It refers to the occurrence of groups of nodes in a network that are more densely connected internally than with the rest of the network. This inhomogeneity of connections suggests that the network 3 has certain natural divisions within it. Community structures are quite common in real networks [77, 30, 67, 111, 36]. Social networks [30] often include community groups (the origin of the term, in fact) based on common location, interests, occupation, etc. Metabolic networks [71] have communities based on functional groupings. Citation networks [14] form communities by research topic. Being able to identify these sub-structures within a network can provide insight into how network function and topology affect each other. For more discussion on community structures in networks and methods for automatic community detection, we refer to the survey paper by Poter et al. [76], and the book by Mark Newman [73]. 1.2 Must-and-Cannot Link Must-and-Cannot link was introduced as supervised information for clustering by Wagstaff et al.[94]. It specifies the class relationship between instances. Pairs of instances connected by must-link must be in the same cluster and pairs of instances connected by cannot-link cannot be in the same cluster. These link information thus provide constraints on clustering results. Must-link and cannot-link have some interesting properties. Must-link constraint is an equivalence relation and hence are symmetrical, reflexive and transitive. For example, if instance x and instance y are connected by a must-link, and instance y and instance z are connected by a must-link, then instance x and instance z should also be connected by a must-link. Similarly, must-link constraints can give rise to cannot-link constraints. For example x and y are connected by a must-link , y and z are connected by a cannot-link, then x and z can also be connected by a cannot-link. Though simple, must-and-cannot link are powerful and have been used for improving clustering accuracy [95, 19]. For more discussion on must-and-cannot link and clustering for M&C linked data, we refer the reader to the survey paper by I. Davidson [20]. 4 1.3 Research on leaning from connected data Research on Networked Data The study of networked data has attracted an enormous amount of interest in the last few years. Topics include the measurement and structure of networks, mathematical models of networks, theories of dynamical processes in networks, and prediction based on networks. We review some research topics as follows, but they are by no means the complete list. • Community Detection aims to detect the community structure in networks [72, 34]. • Dynamics of Networks Networks, especially social networks and the web, are constantly evolving with additions and deletion of nodes and edges taking place all the time. Research on dynamics of networks concerns about the statistical properties and models that govern the generation and evolution of real-world networks [15, 60]. • Diffusion and Cascading in Networks Information cascades refers to the phenomena in which an action or idea becomes widely adopted due to influence by others. Studies on diffusion and cascading in networks seek to build models for the flow of information or influence through a large network [33, 49]. • Statiscal Modeling concerns about modeling the network from the viewpoint of statistics. It aims to learn a statistical model that generates the network as a whole based on the statistics collected in the network [99, 32]. • Link Prediction aims to infer new interactions among nodes of a given network that are likely to occur in the near future, given a snapshot of a social network [58, 87]. • Link-based Classification aims to improve classification accuracy by exploiting the link structure besides utilizing the content information of the nodes [62, 22]. 5 Research on M&C linked Data Research on M&C linked data focuses on constrained clustering, distance metric learning, and kernel learning. • Constrained Clustering aims to improve the clustering accuracy by leveraging the must-link and cannot-link information[95, 19, 8, 81, 18, 53, 24, 18, 95]. • Distance Metric Learning aims to learn a distance metric that is consistent with the given M&C linked data, i.e. must-link pairs are separated by a shorter distance than cannot-link pairs [102, 82, 43, 31, 104, 100] . • Kernel Learning aims to learn a kernel matrix or function from the given M&C linked data such that kernel similarity between must-link pairs is larger than kernel similarity between cannot link pairs [51, 42, 41, 56, 13, 46]. 6 CHAPTER 2 Contributions of this Dissertation This dissertation addresses a number of important problems regarding community detection for networked data, and clustering for M&C linked data. Most algorithms developed for community detection for networked data are based on probabilistic models. The advantage of these approaches is that they are easy to understand. Furthermore, these models are able to predict links besides detecting community structures. However, there are two major problems with most probabilistic models for community detection. (i) The links between nodes are generated only based on their community memberships. However, in real networks, many more factors could affect the link generation. For example, a webpage could point to another webpage simply because it is a very popular webpage such as Google. (ii) The memberships are assumed to be random variables, independent of the content. However the community memberships are usually determined by the content of the nodes. We address the fist limitation by a link-based probabilistic model that takes into account other factors, such as popularity and productivity, when modeling the generation of links. We address the second limitation by presenting a discriminative approach to incorporate the content information. Below, we describe each of the contributions in detail: • We propose a novel probabilistic model for link-based community detection in networks, referred to as Popularity and Productivity Link Model for Community Detection [105]. Besides community memberships, it introduces two important factors, namely popularity and productivity, for modeling the link generation. We verify both theoretically and empirically that the proposed model fits well with the powerlaw degree distribution of real networks, a key property of scale-free networks that is overlooked by previous probabilistic models for community detection. This is the first 7 research work that is able to model the community structure and scale-free properties at the same time. • We present a novel approach to combine link and content information for community detection in networks [106]. Unlike previous work that assume the community memberships are random variables inferred from the link, the proposed approach models the community memberships by the content information using a discriminative model, and infers the model parameters from the data. With the discriminative model, the proposed approach is able to identify the relevant attributes for each community from those irrelevant ones. This is the first work that explores a discriminative model for community memberships when incorporating the content information. Most previous work on clustering for M&C linked data assumes that both must-link and cannot-link information is perfect. However, the M&C link information is usually derived from side information such as citations between scientific papers, and can be vulnerable to noise. To address this limitation, we present two approaches for clustering the noisy M&C linked data. • Both approaches the noise issue by estimating the sufficient data statistics with perfect link information from the one with noise link information, and using the estimated sufficient statistics to learn a probabilistic link prediction model. • The two approaches differ in the assumptions about the data and the noise. We study two different data models and noise models, and devise approaches for estimating the feature statistics under the two different models and compare them empirically. • We verify, both theoretically and empirically, the effectiveness of the proposed approaches. This is the first work that provides theoretical analysis to show that the model learned from the noisy M&C link information converges to the model learned from the perfect M&C link information. 8 CHAPTER 3 Literature Survey 3.1 Community Detection for Networked Data As online repositories such as digital libraries and user-generated media(e.g. blogs) become more popular, analyzing such networked data has become an increasingly important research issue. One major topic in analyzing such networked data is to detect salient communities among individuals. Community detection has many applications such as understanding the social structure of organizations and modeling large-scale networks in Internet services [96]. While there are different formulations for community detection, in this work, we focus on the unsupervised learning, or the clustering viewpoint, a commonly accepted and well studied perspective. A networked data set is usually represented as a graph where individuals in the network are represented by the nodes in the graph. The nodes are tied with each other by either directed links or undirected links, which represent the relations among the individuals. In addition to the links that they are incident to, nodes are often described by certain attributes, which we refer to as contents of the nodes. For example, when it comes to the web pages, online blogs, or scientific papers, the contents are usually represented by histograms of keywords; in the network of co-authorship, the contents of nodes can be the demographic or affiliation information of researchers. The goal of community detection is to find a set of communities such that the nodes in each community are densely connected with each other while sparsely connected with others. When the content information is available, we may also expect nodes within the 9 same community to share similar contents. However, most existing studies on community detection only focus on either the link based analysis or the content analysis. Only very recently people started to combine these two types of information for community detection. In the following, we will review community detection methods in these three categories, i.e., link-based analysis, content-based analysis, and analysis based on the combination of link and content information. 3.1.1 Community Detection based on Link Analysis Link-based approaches for community detection can be roughly classified into two categories, i.e., metric based approaches and probabilistic modeling based approaches. For metric-based approaches, a metric is first defined to measure the quality of any potential community structure; and then, procedures are developed to optimize the proposed metric. Some well-known metrics used for link based community detection include normalized cut proposed by Shi et al. [83], modularity proposed by Newman et al. [72], betweenness proposed by Gregory [34], etc. More recent work in this category starts to introduce probabilistic interpretations for some of these metrics and extend the metrics from undirected networks to directed networks [54, 112]. A main weak point with these metric-based approaches is the proposed metrics are usually heuristic and do not have solid foundations. As a result, it is difficult to reach a consensus on the optimal metric. These approaches are also limited in that they are unable to make prediction for unobserved links, which is important for a number applications. The second category of approaches are based on probabilistic models. They first define a generative process, in which links are generated based on latent community memberships, and then, develop inference algorithms to derive the latent community memberships from the observed links. Holland et al. [45] proposed the first stochastic block model for community detection, and later on many variations [2, 68, 25, 28, 39, 48] have been proposed. In the stochastic block models, the probability of creating a link between two nodes is assumed to be a constant that depends on the community assignments of the two nodes. Most of the 10 followup works essentially refine this probability. In probabilistic Hyperlink-Induced Topic Search (PHITS) model [16] and Latent Dirchilet allocation link model [10], the probability of observing a link from node i to j is given by the probability for j to be linked by any node from the community of i. In the graph factorization model [109, 79], this probability is interpreted as a joint probability of observing both nodes simultaneously, which is factorized based on the Bayes product rule. In terms of inference, some of these models [45, 16, 109, 79] are based on maximum likelihood estimation, while some are based on full Bayesian inference [2, 68, 25, 28, 39, 48, 10]. Given a large number of approaches have been developed in the family of stochastic block models, it is useful to further classify into subcategories: the symmetric approaches [79, 109] that model links by symmetric joint probabilities, and the conditional approaches [16, 106] that focus on modeling the conditional probability of receiving links. However, neither of these models is satisfying: a symmetric model misses the semantics of link directions, a key factor that distinguishes directed networks from undirected networks, while a conditional model only captures one type of links, either incoming links or outgoing links, and therefore is unable to characterize nodes in a full spectrum. As an example, in a blog readership network, there are two types of bloggers: “writers” who generate influential blogs read by many, and “readers” who read a lot but seldom write anything for others to read. Evidently, to characterize these two types of bloggers, it is important to examine both incoming links and outgoing links of the network. 3.1.2 Community Detection based on Content Analysis Given the content description of nodes, it is appealing to cast community detection problem into a clustering problem. Many traditional vector-based clustering algorithm such as Kmeans [37], hierarchical clustering algoirthms [86, 97] have been applied to content analysis for community detection. In the subsection, we focus on the scenario that the content is represented as a bag-of-words, since nodes in many online networks are usually webpages. One 11 of the most well-known approaches for such content information is the topic model[], where each topic is naturally interpreted as a community. Two well-known topic models are Probabilistic Latent Semantic Analysis(PLSA) [40] and Latent Dirichilet Allocation(LDA) [10]. In these models, each topic is modeled as a probability distribution over words. To generate a document, one first sample a topic from a prior distribution, and then sample words for the given topic. One key drawback of topic models is that they are generative probabilistic models and therefore are vulnerable to the words that are irrelevant to the target topics. 3.1.3 Community Detection based on Combined Link and Content Analysis Neither link information nor content information is sufficient in detecting the community structure: the link information is usually sparse and noisy and often results in a poor partition of networks; the irrelevant content attributes could significantly mislead the detection of communities. It is therefore important to combine the link analysis and content analysis for community detection in networks. Recently, several approaches [17, 28, 68, 113, 110] have been proposed to combine these two types of information for community detection. PHITSPLSA combines PHITS, a link based approach, with PLSA, a content based approach, for community detection [17], and show a significantly improvement over the approaches that only utilizes link information or content information for community detection. E. Erosheva et al. [28] combine LDA with LDA-Link, an extension of PHITS, for network analysis, referred to as LDA-Link-Word model in this paper. R. Nallapti et al. [68] try to improve LDA-Link-Word model by applying different modeling strategies to citing documents from that to cited documents. More specifically, LDA-Link-Word model is applied to the citing documents and PLSA model is applied to the cited documents. Other approaches that exploit LDA for combining link and content analysis can be found in [25, 35]. Despite the significant efforts, the existing approaches for combining the link and content information are based on generative models, and therefore will suffer from the following two shortcom12 ings. First, community membership by itself is insufficient to model links—link patterns are usually affected by factors other than communities such as the popularity of a node(i.e. how likely the node is cited by other nodes). Second, the content information often include irrelevant attributes and as a result, a generative model without feature selection usually makes them vulnerable to the irrelevant keywords, and therefore leads to poor performance. In addition to probabilistic models, some other approaches that have been proposed to combine link and content information include matrix factorization[113] and kernel fusion[110] for spectral clustering. Most of them can be more or less viewed as generative models, according to the recent studies on the equivalence between mixture model, k-means, spectral clustering, and matrix factorization [26, 27]. As a result, these approaches usually yield similar performance as the generative models. Since the focus of this dissertation is probabilistic modeling approaches, we therefore did not review them in detail. 3.2 Clustering for Must-and-Cannot(M&C) Linked Data In this section, we review the related work on clustering for must-and-cannot(M&C) linked data. Most approaches in this subject can be classified into three categories: constrained clustering, distance metric learning and kernel learning. 3.2.1 Constrained Clustering Constrained clustering tries to improve the accuracy of data clustering by exploiting the M&C linked pairs, also termed as pairwise constraints. They are also known as semisupervised clustering. In constrained clustering, the clustering algorithm is modified so that the given pairwise constraints are used to bias the search for an appropriate clustering of the data. There are two types of constrained clustering approaches: (1) ones with strict enforcement, which find the best feasible clustering satisfying all the given pairwise con13 straints [95, 19], and (2) ones with partial enforcement, which find the best clustering while maximally respecting constraints [8, 81, 18, 53]. Several techniques have been developed to fit the clustering results with respect to the constraints: (1) modifying the clustering objective function so that it includes a term (penalty) for satisfying specified constraints [24, 18]. (2) enforcing all constraints to be satisfied by the cluster assignments generated in each step of an iterative clustering algorithm [95]. More discussion on constrained clustering can be found in the [20], and the references therein. 3.2.2 Distance Metric Learning In distance metric learning, an appropriate distance metric is learned so that must-link pairs are separated by a short distance, while cannot-link pairs are separated by a large distance. Given a learned distance metric, we can then apply the existing clustering algorithms, such as k-means, to find the clusters. Many algorithms [102, 82, 43, 31, 104, 100] have been developed for distance metric learning, such as distance metric learning by convex optimization [102], relevance component analysis [82], discriminative component analysis [43], nearest neighbor component analysis [31], local distance metric learning [104], large margin nearest neighbor classifier [100], information theoretic metric learning [21], distance function learning [89], and learning a Bregman distance function [101]. All these approaches propose to optimize an objective function about the distance metric by satisfying a set of equality/inequality constraints. The M&C link information serve in the objective [104], or the constraints [82, 21], or both [102, 100]. More work on distance metric learning from M&C link information can be found in the survey [103] and references therein. 3.2.3 Kernel Learning In kernel learning, an appropriate kernel matrix/function for a given data set is learned from M&C link information, and then a similarity based clustering algorithm(e.g. spectral clustering) is applied with the learned kernel matrix. Recent work on kernel learning 14 from M&C linked data include learning low rank kernel matrix [51], nonparametric kernel learning [42], active kernel learning [41], spectral kernel learning [56], Kernel-based metric adaptation [13], and semi-supervised kernel matrix learning by kernel propagation [46]. These approaches either propose a loss function that measures the inconsistency between a kernel function/matrix and pairwise constraints, and find the optimal kernels by minimizing the overall loss, or find a kernel maxtrix with certain property such as low rank to satisfy a set of equality/inequality constraints constructed using the M&C link information. Most studies on clustering M&C linked data assume the link information is noise-free. However, in many applications, the M&C link information is derived from the side information like the observed connections among instances, making them prone to errors. For example, in classifying research articles with M&C link constructed from paper citations, the cited paper may not share the same research topic as the citing paper. It is important to consider the noise in the M&C link information, since direct learning from the given M&C link information wouldlead to suboptimal results. 15 CHAPTER 4 Community Detection for Networked Data: A Popularity and Productivity Link Model 4.1 Introduction In this chapter, we propose a novel probabilistic framework for directed network community detection, termed Popularity and Productivity Link model or PPL for short, that explicitly addresses the shortcomings of the existing stochastic block models. In particular, we model both outgoing links and incoming links by the introduction of the latent variables productivity and popularity. We demonstrate the generality of the proposed framework by showing that both the symmetric models and the conditional models can be derived from the proposed framework as special cases, leading to the unification of various seemingly different forms for the existing models. We develop efficient EM-algorithms for computing the maximum likelihood solutions to the models proposed in this paper. Extensive empirical studies show the promising performances of the proposed models in several application domains. Further analysis is conducted to investigate the trade-offs of each stochastic block model when data characteristic varies. The rest of the chapter is organized as follows. In Section 4.2, we give background information, including notation we will use and details of several previous approaches. In Section 4.3, we present the PPL model, several of its variations, and some of their properties. In Section 4.4, we provide a detailed analysis on the relationship between PPL models and several existing stochastic block models. In Section 4.5, we describe an efficient esti- 16 mation algorithm. In Section 4.6, we show the results of experimental studies. Finally, we conclude in Section 4.7. 4.2 Background In this section we first establish some necessary notations for ease of presentation. We then describe the details about two representative existing stochastic block models which are most relevant to our work. 4.2.1 Notation For a directed network, we denote the nodes by V = {1, · · · , N}, the directed links by E = {(i, j)|sij = 0}, where sij records the value associated with link from node i to node j. sij can either be binary, to denote whether there is a link from node i to node j, or be non-negative values, to denote the weight of the link. For simplicity, following [106], we assume the “link-in” space (i.e., all possible nodes that can point to a particular node) and “link-out” space(i.e., all possible nodes that can be pointed to by a particular node) of every node to be V, i.e., the complete set of nodes. We use I(i) = {j|sji = 0} to denote the set of all nodes point to node i, and O(i) = {j|sij = 0} to denote the set of all nodes that are pointed to by node i. Let K denote the number of communities, zi ∈ {1, · · · , K} denote the community variable of node i, and γ i = (γi1 , · · · , γiK ) denote the community memberships of node i. In other words, γik is the probability for the case zi = k, i.e., node i belongs to community k. 4.2.2 Existing Models We now review three variants of the well-known stochastic block model [45] that are closely related to the proposed model. 17 PHITS Model PHITS [16] is a conditional model that focuses on the conditional link probability of Pr(j|i), i.e., given that node i produces a link, how likely this link will point to node j among all nodes. To compute Pr(j|i), a community variable zi is first sampled from a multinomial distribution with parameter γ i that describes the community membership of node i, then for a given zi , the conditional link probability Pr(j|i, zi ) is given by Pr(j|i, zi ) = βjzi , where the parameter βjk represents the likelihood for node j to be pointed to by any node in community k. By integrating out zi , we get the PHITS model Pr(j|i) = βjk γik (4.1) k It is well known that PHITS can be considered as an application of PLSA [40] to networked data. Symmetric Joint Link (SJL) Model SJL [16, 109, 79] is a symmetric model for community detection. It models the link structure by the joint probability Pr(i, j), i.e., the likelihood of creating a link between node i and j, as follows Pr(i, j) = Pr(j|k) Pr(i|k) Pr(k) = k βjk βik πk (4.2) k In Equation (4.2), πk is the prior probability for a link to be produced in community k, and βik and βjk are the conditional probabilities that nodes i and j are selected as the two ends of the link. Given the symmetric treatment, i.e., Pr(i, j) = Pr(j, i), it is evident that SJL may not be suitable for directed network community detection. 4.3 Popularity and Productivity Link (PPL) Model In this section, we first present our popularity and productivity link (PPL) model in its general form, then give three variations of the general PPL model, and finally discuss several 18 properties of the PPL models. 4.3.1 General Form of PPL PPL models the joint link probability Pr(i, j), i.e., how likely there is a directed link from node i to node j. In order to emphasize the different roles played by i and j, we write Pr(i, j) as Pr(i→ , j← ), denoting that node i plays the role of producing the link, and node j plays the role of receiving the link. Following the idea of SJL, we model Pr(i→ , j← ) as follows   γjk bj  γik ai γ i ′ k ci ′  Pr(i→ , j← ) = Pr(i→ |k) Pr(j← |k) Pr(k) = i′ γi′ k ai′ i′ γi′ k bi′ ′ k k i (4.3) where • γik : the probability for node i to belong to community k • ai : the productivity of node i, i.e., among all the nodes, how likely a link is produced by node i • bj : the popularity of node j, i.e., among all the nodes, how likely a link is received by node j • ci : the weight of node i in terms of deciding the community prior Pr(k) (which will be elaborated momentarily). To handle scale invariance, we normalize so that i ai = j bj = i ci = 1. Generative Process We explain Equation (4.3) by the following generative process of PPL: • Sample a community z according to a prior distribution π1 , · · · , πK , where πk is computed by πk = N i=1 γik ci . • Given community z, the conditional link probability is given by Pr(i→ , j← |z) = Pr(i→ |z) Pr(j← |z) = 19 γiz ai i′ γi′ z ai′ γjz bj i′ γi′ z bi′ (4.4) There are two unique features in the above generative process: • Prior probability πk = i γik ci is constructed as the weighted sum of node mem- berships γik , where ci is used to weight node i in the combination. This construction enforces the consistency between node memberships γik and community prior {πk }K . k=1 This specific construction of community priors also simplify relation between the proposed framework and some existing models for community detection. • In Equation (4.4), the two ends of link i → j are treated differently when modeling Pr(i→ , j← |z): besides the dependence on community memberships γik and γjk , Pr(i→ |z) and Pr(j← |z) are modeled by ai (i.e., the productivity of node i) and bj (i.e., the popularity of node j), respectively, leading to the differentiation of the roles played by the two nodes. With the joint link probability defined in Equation (4.3), the log-likelihood for links can be written as L(a, b, c, γ) = γik ai i′ γi′ k ai′ sij log (i,j)∈E k γjk bj i′ γi′ k bi′ i′ γ i ′ k ci ′ (4.5) Note that we use original data sij in the joint link model rather than normalized data sij sij = ˆ used in conditional link models [16, 106] . Parameters γ, a, b, and c can be j sij inferred by maximizing the log-likelihood L(a, b, c, γ). 4.3.2 Variants of the General PPL Model In this subsection, we show three variants of PPL model by introducing different restrictions on parameters a, b and c. Popularity Link (PoL) Model In the first restricted variation, we enforce ci = ai , ∀i in Equation (4.3), leading to the following expression for Pr(i→ , j← ) Pr(i→ , j← ) = k 20 γjk bj γik ai i′ γi′ k bi′ (4.6) We refer to this variant as the Popularity Link (PoL) Model. By assuming ci = ai , we essentially assume that the prior probability of each community (i.e., i γik ci ) to the prior probability for a link to be produced from that community (i.e., is identical i γik ai ). Productivity Link (PrL) Model In the second restricted variation, we enforce ci to be equal to bi , leading to the following expression for Pr(i→ , j← ), Pr(i→ , j← ) = k γik ai γjk bj ′ γi′ k ai′ i (4.7) We refer to this variant as the Productivity Link (PrL) Model. By assuming ci = bi , we essentially assume that the prior probability of each community (i.e., i γik ci ) to the prior probability for a link to be received by that community (i.e., is identical i γik bi ). Regularized PPL (PPL-D) Model In this variation, instead of enforcing the relationship between ci and ai or bi , we learn ci from data, under certain regularization. In particular, we introduce a Dirichlet prior for parameters c = (c1 , . . . , cN ), i.e., Pr(c) ∝ α i ci , where α is the hyper-parameter of Dirichlet distribution. Using the prior Pr(c) as the regularization, we obtain an MAP estimation of parameters by maximizing the following log-posterior probability L(a, b, c, γ) + log Pr(c) (4.8) where L(a, b, c, γ) is given in Equation (4.5). We call this PPL model regularized by the Dirichlet prior the PPL-D model. 4.3.3 Properties of PPL Model In this subsection, we show two important properties of the PoL, PrL, and general PPL model. 21 Equivalence between the Variants of PPL Models The first property is about the relationship between PoL model, PrL model, and general PPL model. Surprisingly, although their formulas are different, the optimal solutions for the three models actually result in identical joint link probability and therefore identical data likelihood. This property is described in the following theorem. Theorem 1. Under the optimal solution, the joint link probability Pr(i→ , j← ) of PoL model, PrL model and general PPL model are the same. That is, Pr1 (i→ , j← |a1 , b1 , γ 1 ) = Pr2 (i→ , j← |a2 , b2 , γ 2 ) = Pr3 (i→ , j← |a3 , b3 , c3 , γ 3 ), where Pr1 (i→ , j← ), Pr2 (i→ , j← ), Pr3 (i→ , j← ) are the joint link probabilities of PoL model, PrL model, and general PPL model, respectively; {a1 , b1 , γ 1 }, {a2 , b2 , γ 2 } and {a3 , b3 , c3 , γ 3 } are the optimal solutions to maximizing the log-likelihood of PoL model, PrL model and general PPL model, respectively. In particular, denoting the log-likelihood of PoL, PrL and general PPL model by L1 (a, b, γ), L2 (a, b, γ), L3 (a, b, c, γ) respectively, we have L1 (a1 , b1 , γ 1 ) = L2 (a2 , b2 , γ 2 ) = L3 (a3 , b3 , c3 , γ 3 ). In order to prove Theorem 1, we first state the following lemma about the optimal solution to the PPL model given in Equation (4.3). Lemma 1. Given that (a3 , b3 , c3 , γ 3 ) is the optimal solution to maximizing the log-likelihood of PPL model, we define πk = (a1 , b1 , γ 1 ) such that 1 1 i γik ai 3 3 i γik ci . Then we can obtain one set of parameters = πk and (a1 , b1 , γ 1 ) is the optimal solution to maximiz- ing the log-likelihood of PoL model. Similarly, we can obtain another set of parameters (a2 , b2 , γ 2 ) such that 2 2 j γjk bj = πk and (a2 , b2 , γ 2 ) is the optimal solution to maximizing the log-likelihood of PrL model. Proof. we first show how to construct such (a1 , b1 , γ 1 ) and (a2 , b2 , γ 2 ). Given (a3 , b3 , c3 , γ 3 ) and πk = 3 3 i γik ci , we can define q such that ˆ 22 3 3ˆ i γik ai qk 1 = πk . We then construct γik = 3 ˆ γik qk 3 ˆ k γik qk , a1 i = a3 i 1 3 ˆ k γik qk , bj = b3 j 3 ˆ k γjk qk , 3 ˆ 3 k γj ′ k qk j ′ bj ′ and we can show that 1 γik a1 = πk i i We can also define q such that ˜ a2 i = a3 i 3 ˜ k γik qk 3 3 ˜ k γi′ k qk i′ ai′ and b2 = b3 j j 2 3 3˜ j γjk bj qk = πk . We then construct γik = 3 ˜ k γjk qk , 3 ˜ γik qk 3 ˜ k γik qk , and we can show that 2 γjk b2 = πk j j With constructed (a1 , b1 , γ 1 ) and (a2 , b2 , γ 2 ) we can show that L3 (a3 , b3 , c3 , γ 3 ) = L1 (a1 , b1 , γ 1 ) = L2 (a2 , b2 , γ 2 ) Next, we need to show that (a1 , b1 , γ 1 ) is the optimal solution to PoL model, (a2 , b2 , γ 2 ) is the optimal solution to PrL model. We prove this by contradiction. Assume their exists another set of parameters (a∗ , b∗ , γ ∗ ) such that L1 (a∗ , b∗ , γ ∗ ) > L1 (a1 , b1 , γ 1 ) = L3 (a3 , b3 , c3 , γ 3 ), then L3 (a∗ , b∗ , a∗ , γ ∗ ) = L1 (a∗ , b∗ , γ ∗ ) > L3 (a3 , b3 , c3 , γ 3 ) which contradicts that (a3 , b3 , c3 , γ 3 ) is the optimal solution to PPL model. Similarly, we can show (a2 , b2 , γ 2 ) is the optimal solution to PrL model. Thus, we complete the proof. Following the above lemma, we can easily prove Theorem 1. One implication of this theorem is that the space of the optimal solution to the general PPL model is not a unique fixed point. As a consequence, if in addition to the joint link probability, we also care about the exact solution to the community membership γ, then we should not directly apply PPL in its general form. Instead, we should either choose PoL and PrL if the MLE solution is needed, or choose PPL-D if the MAP solution is needed. 23 Perfect Fitting of the Distributions of Indegree and Outdegree The second property of the PPL model is about degree fitting. It turns out that the optimal solutions to PoL model, PrL model, and general PPL model all fit exactly the degree distributions (both indegree and outdegree) in the network data. This is described in the following theorem, whose proof is given in the appendix. Theorem 2. The model outdegree distribution Pr(i→ ) and model indegree distribution Pr(j← ) of PoL model, PrL model and general PPL model fit exactly the actual indegree and outdegree distributions of the network data. Proof. We can easily show that the optimal solution to ai in PoL model is equal to the j sij normalized outdegree of node i, i.e., a1 = ; and the optimal solution to bj in PrL i ij sij i sij . From the model model is equal to the normalized indegree of node j, i.e., b2 = j ij sij formulation in Equation (4.6) for PoL model, we have Pr1 (i→ |a1 , b1 , γ 1 ) = Pr1 (i→ , j← |a1 , b1 , γ 1 ) = a1 i j So the model outdegree distribution of PoL model fits exactly the actual outdegree distribution of the network. From the model formulation in Equation (4.7) for PrL model, we have Pr2 (j← |a2 , b2 , γ 2 ) = i Pr(i→ , j← |a2 , b2 , γ 2 ) = b2 j So the model indegree distribution of PrL model fits exactly the actual indegree distribution of the network. Following Theorem 1 we have Pr3 (i→ |a3 , b3 , c3 , γ 3 ) = Pr2 (i→ |a2 , b2 , γ 2 ) = Pr1 (i→ |a1 , b1 γ 1 ) = a1 i and Pr3 (j← |a3 , b3 , c3 , γ 3 ) = Pr1 (j← |a1 , b1 , γ 1 ) = Pr2 (j← |a2 , b2 , γ 2 ) = b2 j 24 We conclude that the model indegree and outdegree distributions estimated from PoL model, PrL model and PPL model fit exactly the actual indegree and outdegree distributions of the network. This property of degree fitting is a consequence of the concepts of productivity and popularity. We argue that degree fitting is a very important property for a generative model. This is because in real world, most networks have heavy-tailed (or power-law) degree distribution. So far, no existing stochastic block models can guarantee to generate degree distributions fitting both indegree and outdegree distributions of real-world networks. 4.4 Relationship with Existing Models In this section, we describe the relationship between PPL and several existing models, including conditional link models, namely PCL [106] and PHITS [16], and symmetric joint link model (SJL) [79, 109]. It turns out that these existing models all can be considered as special cases of PPL, with different constraints. Such a connection demonstrates that PPL provides a consistent framework to unify the existing models. 4.4.1 Relationship with Conditional Link Models We show that the Popularity Conditional Link(PCL) model in [106] is a conditional version of the Popularity Link(PoL) model described in Section 4.3.2. Starting from the joint probability given in Equation (4.6), we can express the conditional probability of the PoL model as Pr(j← |i→ ) = Pr(i→ , j← ) = Pr(i→ ) k γjk bj γik i′ γi′ k bi′ (4.9) Note that in the above derivation, we use the fact that Pr(i→ ) = ai , which is obtained in the proof of Theorem 2. Equation (4.9) is exactly the same as the Popularity Conditional Link(PCL) model proposed in [106]. Because of this connection, in the following discussion, we also refer to the PCL model described in Equation (4.9) (and in [106]) as PoCL model. 25 Following a similar idea, from Productivity Link(PrL) model we can derive a Productivity Conditional Link model by computing the conditional probability Pr(i→ |j← ) from Equation (4.7) as the following Pr(i→ |j← ) = Pr(i→ , j← ) = Pr(j← ) k γik ai γjk ′ γi′ k ai′ i (4.10) In the above derivation, we use the fact that Pr(j← ) = bj , which is obtained in the proof of Theorem 2. Because of its connection to PrL, we refer to this new conditional model as PrCL. As we can see, PoCL and PrCL capture the conditional link probability in different directions. PoCL depends on the popularity of the receiving node j while PrCL depends on the productivity of the producing node i. In addition, both PoCL and PrCL can be naturally derived from the PPL models, i.e., PoL and PrL. Because as we have discussed before, PHITS is a relaxed version of PoCL, obviously it can also be derived from PPL. 4.4.2 Relationship with Symmetric Joint Link Models To show its relationship with SJL, we enforce that ci = ai = bi , ∀i in the general PPL model. From a probabilistic view point this restricts that for each node, the probability for producing links is equal to that for receiving links. With this restriction, Equation (4.3) is reduced to Pr(i→ , j← ) = k γik ci i ′ γ i ′ k ci ′ γjk cj j ′ γ j ′ k cj ′ i′ γ i ′ k ci ′ The following theorem, whose proof is given in the appendix, shows that this restricted version of PPL is exactly the SJL model. Theorem 3. Under the constraint that ai = bi = ci , ∀i, the general PPL model is equivalent to the SJL model. 26 Proof. The joint link probability of SJL model is given in Equation (4.2), i.e., Pr(i→ , j← ) = βik βjk πk k The community membership of SJL model is defined as[109, 79] f γik = βjk πk k βik πk (4.11) βik πk (4.12) f We can also define ci as f ci = k Similarly, given solution (γ, c) to PPL model with a = b = c, we can define p πk = γik ci , p βik = i γik ci i γ i ′ k ci ′ (4.13) All we need to show is that given that (β, π) is the solution to SJL model, (γ f , cf ) defined as in Equations (4.11,4.12) is the solution to PPL model under the restriction of a = b = c; and given that (γ, c) is the optimal solution to PPL model under the restriction of a = b = c, (π p , β p ) defined in Equation (4.13) is the optimal solution to SJL model. First note that L0 (β, π) = L3 (γ f , cf ) L3 (γ, c) = L0 (β p , π p ) where L0 and L3 are the log-likelihood of SJL model and PPL model respectively. Given that (β, π) is the optimal solution to SJL model, if there exists (γ ∗ , c∗ ) such that L3 (γ ∗ , c∗ ) > L3 (γ f , cf ) = L0 (β, π), then we can construct (β ∗ , π ∗ ) as in Equation (4.13) such that L0 (β ∗ , π ∗ ) = L3 (γ ∗ , c∗ ) > L0 (β, π), which contradicts the assumption that (β, π) is the optimal solution to SJL model. Similarly, given that (γ, c) is the optimal solution to PPL model, if there exists (π ∗ , β ∗ ) such that L0 (π ∗ , β ∗ ) > L0 (π p , β p ) = L3 (γ, c), then we can construct (γ ∗ , c∗ ) as in Equations (4.11,4.12) such that L3 (γ ∗ , c∗ ) = L0 (π ∗ , β ∗ ) > L3 (γ, c), which contradicts the assumption that (γ, c) is the optimal solution to PPL model. Therefore, we prove that PPL model under the restriction of a = b = c is equivalent to SJL model. 27 The relationship revealed by Theorem 3 shows that SJL is a special PPL with the constraint that nodes having the same probability in terms of producing and receiving links, which is appropriate only for modeling undirected networks. 4.4.3 Summary In Table 4.1, we summarize all the models discussed in this paper. Models that are newly developed in this paper are print in bold. We believe such a unified picture, offered through the PPL model, will be very helpful for understanding and further studying different stochastic block models for community detection. Table 4.1. Taxonomy of the models categorized by type and variables conditional joint symmetric 4.5 popularity PHITS, PoCL PoL productivity PrCL PrL both PPL, PPL-D SJL Estimation Algorithm In this section, we present efficient EM algorithms for computing the MLE solutions to PoL and PrL and the MAP solution to PPL-D. Because the derivation of the algorithms is rather lengthy, here we only present the final form of the algorithms as well as offer several observations, and we provide the detailed derivation in the appendix. Theorem 4. The following EM algorithms converge to the MLE solutions to PoL and PrL, and the MAP solution to PPL-D. E-step: qijk ∝ Prt−1 (i, j, k) where t-1 indicates the result in the previous iteration 28 M-step: nik , PoL : γik = τ mk bi + nout (i) bi = nin (i) , τ k mk γik ai = nout (i) i nout (i) nik , + nin (i) ai = nout (i) , η k mk γik bi = nin (i) i nin (i) PrL : γik = η mk ai ζ PPL-D : ζ nik + mik γik = mi + eα ,c = η ζ i ζ mk ai + mτ bi + mi i (mi + eα) k nin (i) nout (i) , bi = ai = η τ k mk γik k mk γik where e is the summation of all sij and the rest variables are defined as: ηk = i′ γ t−1 at−1 , τk = ′ ′ ik i nin (i, k) = j′ sji qjik , nout (i, k) = j∈I(i) nin (i) = t−1 γik ct−1 i γ t−1 bt−1 , ζik = j ′k j ′ t−1 t−1 i ′ γ i ′ k ci ′ sij qijk j∈O(i) nin (i, k), nout (i) = k nout (i, k) k nik = nin (i, k) + nout (i, k), mk = sij qijk (i→j)∈E mτ = k (i→j)∈E sij qijk τk ζ , sij qijk , mik = ζik η mk = (i→j)∈E sij qijk ηk ζ ζ mi = mik . k (i→j)∈E Proof. In the E-step, we would bound the log-likelihood from below. The key point is to apply the Jensen inequality log k pk ≥ k qk log pk /qk , where = 1, to the logx sum-term in the log-likelihood and to apply the inequality − log x ≥ 1 − − log y to the y summation term in the denominator of the log-sum-term in the log-likelihood. In particular, at the t-th iteration the log-sum-term is lower bounded as log k Pr(i, j, k) ≥ qijk log Pr(i, j, k)/qijk k 29 k qk with qijk computed as qijk ∝ Prt−1 (i, j, k) where superscript t − 1 means the probability is computed under the values of the parameters in the (t − 1)-th iteration. Then the denominator term in Pr(i, j, k) would be lower bounded as − log i′ − log γi′ k ai′ ≥ 1 − γi′ k bi′ ≥ 1 − i′ i′ γi′ k ai′ ηk i′ γi′ k bi′ τk − log ηk − log τk with ηk , τk computed as ηk = i′ and the summation term γ t−1 at−1 ′ ′ i ′ γ i ′ k ci ′ log i′ ik j′ γ t−1 bt−1 ′ ′ jk j in PPL model is lower bounded as γ i ′ k ci ′ ≥ t−1 t−1 γik ci with ζ computed as ζik = τk = i t−1 t−1 i ′ γ i ′ k ci ′ i′ ζi′ k log γi′ k ci′ /ζi′ k . Due to the limit of space, we omit the details about deriving the lower bound of the three log-likelihoods. In the M-step, we will maximize the corresponding lower bound over the corresponding parameters as follows: PoL : PrL : k sij (i,j)∈E j′ γi′ k ai′  ηk i′  sij PPL-D :  qijk log γjk γik ai − k (i,j)∈E γj ′ k bj ′ qijk log γik γjk bj − sij (i,j)∈E −  k  qijk log γik ai γjk bj − τk   γi′ k ai′ ηk i′  γi′ k bi′ ζi′ k log γi′ k  (ζi′ k + α) log ci + + τk i′ i′ i′ By taking the derivatives of the expressions and setting them to zero, we can obtain the corresponding formulas in the M-step. 30 It can be observed from the EM algorithm that in every iteration (and therefore in the final solutions) for each node i, its productivity ai is proportional to its outdegree and its popularity bi is proportional to its indegree. This is consistent with our intentions that the productivity of a node reflects how likely it produces a link and the popularity of a node reflects how likely it receives a link. In addition, it is worth mentioning that in the real implementation, we avoid to explicitly compute all qijk ’s (whose number is N 2 K, which can be extremely large). Instead, qijk ’s are computed in an “on-demand” fashion. We can show that the complexity (per iteration) of our EM algorithms is linear in the number of links in the network. Therefore, the algorithm is very efficient because in most real applications, networks are sparse and so the number of links is usually manageable. 4.6 Experiments In this section, we show experiment results. We evaluate a variety of models (variations of PPL and existing models) on two tasks: community detection and link prediction. In addition, we also investigate the issue of degree fitting. We start by describing the data sets used in the experiments. 4.6.1 Data Sets In the following experiments, we use a blog network and two paper citation networks. Political Blog Network This is a directed network of hyperlinks between a set of weblogs about US politics, recorded by Adamic and Glance [1]. In this network, there are totally 1,490 nodes and 19,090 links. Each node is labeled as either conservative or liberal. 31 Paper Citation Networks We use the Cora paper citation network and the Citeseer paper citation network processed by Getoor et al.1. There are totally 2,708 nodes and 5,429 links in Cora network, and 3,312 nodes and 4,732 links in Citeseer network. Each paper in Cora network is categorized into one of 7 classes (e.g., Genetic Algorithms, Neural Networks, etc.), and each paper in Citeseer network is labeled as one of 6 classes. 4.6.2 Community Detection In the first task, communities are to be detected from the networks. In this task, the real class labels in the data sets are used as the ground truth to evaluate the communities detected by different models. More specifically, we use the following evaluation metrics. Evaluation Metrics for Community Detection We use three commonly used metrics for evaluating the performance of community detection, i.e. normalized mutual information (NMI), pairwise F measure (PWF), and modularity (Modu). We first give detailed description about the three metrics. Normalized mutual information (NMI) is defined as follows: given the true community structure C = {C1 , · · · , CK }, where Ck denote the set of nodes in the k-th community, and ′ ′ the community structure C ′ = {C1 , · · · , CK } obtained from a model, the mutual information is computed as p(Ck , Cl′ ) log MI(C, C ′ ) = Ck ,C ′ l p(Ck , Cl′ ) p(Ck )p(Cl′ ) where p(Ck ) denotes the probability that a randomly selected node belongs to Ck , and p(Ck , Cl′ ) denotes the joint probability that a randomly selected node belongs to Ck and Cl′ . The normalized mutual information is defined as NMI(C, C ′ ) = 1 MI(C, C ′ ) max(H(C), H(C ′ )) http://www.cs.umd.edu/projects/linqs/projects/lbc/ 32 where H(C) = p(Ck ) log k 1 is the entropy of partition C. p(Ck ) Pairwise F measure (PWF) is another commonly used measure for evaluating clustering algorithms. Assume T is the set of node pairs (i, j) where nodes i and j belong to the same community in the ground truth, and S is the set of node pairs that belong to the same community in the outcome of a specific model. Then the pairwise F measure is computed from pairwise precision and recall as precision = |S PWF = T |/|S| recall = |S T |/|T | 2 × precision × recall precision + recall where | · | indicates the cardinality of a set. Note that to compute the normalized mutual information and pairwise F measure, the ground truth must be used. However, in some cases, the ground truth does not necessarily faithfully reflect the link structure. Therefore, we also use another measure called directed modularity (Modu), which is proposed by Leicht et al. [54] for measuring community partitions in directed networks without using ground truth. The definition of the directed modularity is given by Modu = 1 e ij sij − dout (i)din (j) e δ(ci , cj ) where din (i) and dout (i) are the indegree and outdegree of node i in the network, e is the number of directed links in the network, and ci denotes the community of node i assigned by a model, and δ(·, ·) is the Kronecker delta function. For all the three metrics, i.e., NMI, PWF, and Modu, larger values correspond to better performances. Performance on Community Detection The community detection performances for different models on the three data sets are given in Tables 4.2, 4.3, and 4.4. Among the models, PHITS, PoCL, and PrCL are conditional link 33 models. PHITS [17] represents the model described in Equation (5.4); PoCL represents the Popularity Conditional Link model [106] described in Equation (??); PrCL represents the Productivity Conditional Link model described in Equation (4.10). SJL represents symmetric link model as described in [79]. PoL, PrL, and PPL-D are the joint link models proposed in this work. All the EM algorithms for MLE and MAP are run with 100 iterations, which according to our observation is more than enough for convergence. In order to alleviate the problem of local minimum of EM algorithms, for each test we conduct 10 trials with different random initializations, and choose the one giving the largest likelihood. The prior α for the parameter c in PPL-D is set to 1. Actually, we found the performance not sensitive to α—we tested different values for α ranging from 0.01 to 1, and the results are almost the same. From the performance results, we can make the following comparisons and observations. Joint link models vs. conditional link models Joint link models clearly outperform conditional link models. These can be seen from that our joint link models, i.e., PoL, PrL, and PPL-D always have the top performances and clearly outperform their conditional counterparts PoCL and PrCL. Even the symmetric joint link model SJL outperforms its conditional counterpart PHITS in most of the cases. This result verifies our assumption that modeling both behavior in receiving links (popularity) and that in producing links (productivity) is better than modeling just one behavior or none at all. Non-symmetric vs. symmetric joint link models Comparing the performances of nonsymmetric link models, i.e., PoL, PrL, and PPL-D, with that of traditional SJL model, which is symmetric, we can see that the non-symmetric models consistently outperform SJL and the improvement is quite significant in many cases. This verifies the benefit of separating the behavior of nodes in receiving links and that in producing links over simply ignoring the direction of links. PPL models without vs. with restrictions Comparing PPL-D, which does not restrict c other than providing a weak prior, with PoL, PrL and SJL, which enforce c = a, c = b, 34 and c = a = b, respectively, we can see that PPL-D has the best performance. However, as shown in Section 4.3.3 we can always derived PoL and PrL from PPL-D that give the identical data likelihood, and so the above result suggests that PPL-D tends to find better solutions for community memberships. Popularity vs. productivity If we can only choose one feature between popularity and productivity for community detection in our data sets, it seems that popularity has a small edge over productivity. This can be observed both in joint link models (i.e., PoL over PrL) and conditional models (i.e., PoCL over PrCL). Such a result suggests that to determine the community membership of a node i in these three data sets, those nodes point to i may be more important than those nodes pointed to by i. Table 4.2. Community detection performance on the Political Blog data set. Algo. PHITS PoCL PrCL SJL PoL PrL PPL-D NMI 0.3829 0.4905 0.4569 0.4409 0.5156 0.5178 0.5365 PWF Modu 0.7152 0.4200 0.7947 0.4270 0.7776 0.4243 0.7425 0.4323 0.8072 0.4324 0.8091 0.4324 0.8167 0.4324 Table 4.3. Community detection performance on the Cora data set. Algo. PHITS PoCL PrCL SJL PoL PrL PPL-D 4.6.3 NMI 0.0591 0.0797 0.0211 0.0602 0.0886 0.0870 0.0972 PWF Modu 0.1862 0.3594 0.1982 0.5982 0.1666 0.4959 0.1840 0.6091 0.2014 0.6310 0.1993 0.6307 0.2085 0.6381 Link Prediction In this task, we study the performance of the joint link models on predicting the links (both incoming links and outgoing links). Specifically, for each node in the network we randomly hide one of its incoming links and one of its outgoing links and ask each model to recover the 35 Table 4.4. Community detection performance on the Citeseer data set. Algo. PHITS PoCL PrCL SJL PoL PrL PPL-D NMI 0.0117 0.0292 0.0131 0.0236 0.0292 0.0263 0.0317 PWF Modu 0.1788 0.4374 0.1909 0.6214 0.1805 0.5954 0.1896 0.6348 0.1921 0.6648 0.1904 0.6612 0.1948 0.6687 missing links. Such a task has practical values in applications such as friend recommendation in social networks and citation suggestion in citation networks. Evaluation Metric for Link Prediction We measure the performance of link prediction by Recall measure. Two types of recall are presented, namely outlink recall and inlink recall. The outlink recall measures the ability of a model to predict nodes pointed to by a given node. The inlink recall measures the ability of a model to predict the nodes point to a given node. To compute outlink recall for node i, we first compute the outlink probabilities Pr(j← |i→ ) for node i to all other nodes Pr(i→ , j← ) . The resulting probabilities assign an outlink rank to each by Pr(j← |i→ ) = j Pr(i→ , j← ) node j. The outlink recall at rank position K is defined as the fraction of nodes whose top-K ranked predictions contain the true missing link. Inlink recall is defined similarly based on Pr(j→ |i← ). In addition, we also report the average of the inlink and outlink recalls. Performance on Link Prediction The recalls at top-1 through top-20 on the three data sets are given in Figures 4.1, 4.2, and 4.3. All the results are averaged over 10 trials with different randomly selected missing links. Because we have that PoL, PrL and general PPL model have equal link probabilities, and because we also found that PPL-D achieves almost the same performance as PoL and PrL, we will only report one result for these models which are denoted by P-family models. We also report the results of a naive baseline, the Frequency-based model, where the outgoing 36 0.4 P−family SJL Freq outlink Recall average Recall 0.4 0.2 0 0 10 Rank P−family SJL Freq 0.3 0.2 0.1 0 0 20 10 Rank 0.3 #of nodes inlink Recall 0.4 (b) outlink Recall P−family SJL Freq 0.2 0.1 0 0 10 Rank 20 (c) inlink Recall 1000 500 0 0 #of nodes (a) average Recall 20 500 indegree 200 #of incoming links 400 outdegree 0 0 100 200 #of outgoing links 300 (d) degree distribution Figure 4.1. (a)∼(c): Recalls on Political Blog data. (d): Degree distribution. link probabilities are proportional to the indegree of nodes, i.e., Pr(j← |i→ ) ∝ din (j), and the incoming link probabilities are proportional to the outdegree of nodes, i.e., Pr(j→ |i← ) ∝ dout (j) 20. As can be seen from the figures, compared to SJL and the Frequency-based baseline, Pfamily models perform the best in all the cases except the inlink recall for Cora network. This result illustrates that most of the time, it is beneficial to use productivity and popularity to model indegree and outdegree distributions separately in a directed network. However, the inlink recall for Cora network is an abnormal case, where SJL performs the best, P-family models perform worse, and the Frequency-based model has extremely poor performance (almost constantly zero). To see why this case is special, we show the degree distributions of the three networks in the rightmost panels of Figures 4.1, 4.2, and 4.3. All the degree distributions follow a power-law distribution except the outdegree distribution in 37 0.15 0.4 P−family SJL Freq outlink recall average Recall 0.2 0.1 0.05 0 0 10 Rank P−family SJL Freq 0.3 0.2 0.1 0 0 20 10 Rank inlink Recall 0.03 0.02 (b) outlink Recall #of nodes #of nodes (a) average Recall P−family SJL Freq 0.01 0 0 10 Rank 20 20 (c) inlink Recall 2000 1000 0 0 1000 500 0 0 indegree 50 100 150 #of incoming links outdegree 2 4 #of outgoing links 6 (d) degree distribution Figure 4.2. (a)∼(c): Recalls on Cora data. (d): Degree distribution. Cora network. The outdegree in Cora follows a rather uniform distribution with outdegree no lager than 5. (We suspect such a distribution is due to the small scale of the Cora data which leads to many references, and therefore outlinks, to be outside the data set.) Because of such a uniform distribution, the outdegrees of nodes are not informative, which explains the extremely poor performance of the Frequency-based model. The P-family models treat indegree and outdegree equally importantly and therefore also suffer from the uninformative outdegree distribution. SJL, in comparison, ignores the link direction and as a result makes the more informative indegree distribution dominate the uninformative outdegree distribution and therefore suffers the least. This special case actually reveals some trade-offs made by different models. 38 0.4 P−family SJL Freq 0.3 outlink Recall average Recall 0.4 0.2 0.1 0 0 10 Rank P−family SJL Freq 0.3 0.2 0.1 0 0 20 10 Rank 0.15 #of nodes inlinke Recall 0.2 (b) outlink Recall P−family SJL Freq 0.1 0.05 0 0 10 Rank 20 (c) inlink Recall 2000 #of nodes (a) average Recall 20 2000 1000 0 0 indegree 1000 0 0 50 #of incoming links 100 outdegree 10 20 #of outgoing links 30 (d) degree distribution Figure 4.3. (a)∼(c): Recalls on Citeseer data. (d): Degree distribution. 39 4.6.4 Degree Fitting Finally, we verify the degree fitting properties of PPL models. Figure 4.4(a) shows the scatter plots for the indegree and outdegree fitting of PPL models on the Political Blog data set. Note that PoL, PrL and PPL-D again give almost the same result is this experiment and so we refer to them together as PPL. Each point in the plot represents a node, where its position on the horizontal axis is determined by its actual degree (indegree or outdegree) and its position on the vertical axis is determined by the degree predicted by the model. Therefore, a point fell on the diagonal line (the red lines in the plots) indicates a perfect degree match. As can be seen from the figure, all the points fall on the red line, which indicates that PPL captures the degree distributions for each node exactly. In comparison, as shown in Figure 4.4(b), SJL has very poor performance in terms of degree fitting. Similar results are obtained for the paper citation data sets, where in Figures 4.5(a) and 4.5(b) we show some of the results. These empirical studies clearly validate the degree fitting property model indeg. 400 200 0 0 100 200 300 actual indegree 400 model outdeg. model outdeg. model indeg. of the PPL models that we previously stated in Section 4.3.3. 400 200 0 0 100 200 actual outdegree 300 (a) degree fitting by PPL 400 200 0 0 100 200 300 actual indegree 400 100 200 actual outdegree 300 400 200 0 0 (b) degree fitting by SJL Figure 4.4. Degree fitting on the Political Blog data 40 model indeg. 30 model outdeg. 100 model outdeg. model indeg. 200 200 0 0 50 100 150 actual indegree 40 20 0 0 10 20 actual outdegree (a) degree fitting by PPL 200 100 0 0 50 100 150 actual indegree 200 100 50 0 0 10 20 actual outdegree 30 (b) degree fitting by SJL Figure 4.5. Indegree fitting on Cora and outdegree fitting on Citeseer 4.7 Conclusions Stochastic block model is a promising probabilistic model for community detection. In this paper, we present a new stochastic block model, PPL, for community detection in directed networks. On one hand, our model is complete, in that it captures the roles of each node both as a link producer and as a link receiver whereas a consistent community membership serves both the roles; on the other hand, our model is unified, in that it offers a unified framework to connect and to understand existing models. We believe such a complete and unified model provides a solid foundation for further studies in stochastic block models for community detection. 41 CHAPTER 5 Community Detection for Networked Data: A Discriminative Approach for Combining Link and Content 5.1 Introduction In addition to the link information, nodes in networks are often described by certain attributes, which we refer to as contents of the nodes. For example, when it comes to the web pages, online blogs, or scientific papers, the contents are usually represented by histograms of keywords; in the network of co-authorship, the contents of nodes can be the demographic or affiliation information of researchers. Many existing studies on community detection focus on either link analysis or content analysis. However, neither information alone is satisfactory in determining accurately the community memberships: the link information is usually sparse and noisy and often results in a poor partition of networks; the irrelevant content attributes could significantly mislead the process of community detection. It is therefore important to combine the link analysis and content analysis for community detection in networks. In this chapter, we propose a discriminative model of combining link and content analysis for community detection that explicitly addresses the above shortcomings of existing approaches. Our main contributions are summarized as follows. • We propose a conditional model for link analysis. In contrast to generative models, our approach does not attempt to generate the links; instead, the conditional probability for the destination of a given link is to be captured. To achieve this, in our model we introduce a hidden variable to capture the popularity of a node in terms of how likely 42 the node is cited by other nodes. • To alleviate the impact of irrelevant content attributes, we adopt a discriminative approach to make use of the node contents. We refer to this part as discriminative content model. As a consequence, the attributes are automatically weighed by their discriminative power in terms of telling apart salient communities. • We combine the above two models into a unified framework and propose a novel twostage optimization algorithm for the maximum likelihood inference. In addition, we show how the proposed link model and content model can be used to extend existing complementary approaches. Additional algorithms are presented to solve the extended models. To the best of our knowledge, the model proposed in this chapter is the first that combines conditional link models and discriminative content models for community detection. We conduct extensive experimental studies by using several benchmark data sets. The experimental results show significant improvement over the state-of-the-art approaches. Additional experiments are conducted to further verify the effectiveness of each of our link model and content model, respectively. The rest of the chapter is organized as follows. In Section 5.2 we present and analyze the conditional link model. In Section 5.3, we extend the link model to include the content information. Also in Section 5.3, we describe the two-stage optimization algorithm. In Section 5.4, we show extensions by combining our link model and content model with other existing content and link models. In Section 5.5, we show extensive experimental results on benchmark data sets. Finally, we give conclusions in Section 5.6. 5.2 Conditional Link Model In this section, we first present the proposed link model and followed by a maximum likelihood estimation method used to estimate the unknown parameters of the proposed model. 43 In Section 5.3, we incorporate the content information into the proposed link model by a discriminative model. 5.2.1 Popularity Conditional Link Model (PCL) Before going to the mathematical model, we first establish the assumptions and notations that are used in our model. All nodes in the network form a node space V = {1, · · · , n}, where the nodes could represent web pages, online blogs, etc. For each pair of ordered nodes (i, j), let sij record the information of the link from node i to node j. sij could either be {0, 1}, N + , or any nonnegative values dependent on the type of the link. If sij = 0, we say there is a directional link from node i to node j, or node i cites j (equivalently, node j is cited by node i). Let E = {(i → j)|sij = 0} denote all the directional links in the network. Each node i has an associated “link-in” space denoted by LI(i) ∈ V, which is the set of nodes that could possibly cite node i. Similarly, each node i is associated with a “link-out” space denoted by LO(i) ∈ V, which is the set of nodes that could possibly been cited by node i. Although in most cases we have LI(i) = LO(i) = V, in some scenarios such as citation of publications, the link-out space of a paper is the set of all papers that are older than the paper itself, and the link-in space is the set of all papers that are newer than the paper itself. Let I(i) = {j|sji = 0} be the set of nodes that actually cite node i, O(i) = {j|sij = 0} be the set of nodes that are actually cited by node i, and din (i) = |I(i)| be the indegree of node i, dout (i) = |O(i)| be the outdegree of node i. Finally, we denote by K the number of communities we aim to find. In our link model, we focus on modeling Pr(j|i), i.e., the probability of linking node i to node j among all the other candidates in LO(i). In other words, we model which node j among LO(i) is more likely to be cited by node i. This is in contrast to many existing approaches that explicitly model the presence or absence of link i → j, i.e., Pr(i → j). This modeling choice allows us to avoid modeling the absence of links, which was observed in [2, 63] as a major problem for link analysis. We introduce a set of hidden variables 44 zi ∈ {1, · · · , K} for each node i ∈ {1, · · · , n} to denote the community of node i. On the other hand, to model how likely a node will receive a citation in general, in our model for Pr(j|i), we introduce a popularity variable bi ≥ 0 for each node i: the higher popularity of one node, the higher chance the node will be cited by other nodes. Given the popularity and community memberships of all nodes, the link probability Pr(j|i) conditioned on the community variable zi of node i associated with this link is given as follows γjzi bj Pr(j|i; zi , b) = (5.1) j ′ ∈LO(i) γj ′ zi bj ′ where γik gives the community membership of node i in community k. As indicated by the above expression, the conditional link probability Pr(j|i) is proportional to bj , the popularity of the ending node of the link. By assuming a multinomial distribution for zi , i.e., zi ∼ Mult(γi1 , · · · , γiK ), we have Pr(j|i) written as Pr(j|i; γ, b) = γik k γjk bj j ′ ∈LO(i) γj ′ k bj ′ (5.2) where γik = Pr(zi = k). In Eq. (5.2), we assume that bi is independently from the community variable. As a result, each node will only have one copy of the popularity. An alternative approach is to have the popularity variable bi conditioned on the community variable. In other words, we have a different popularity variable bik for each node i when it is in a different community zi = k. Using the community dependent popularity bik , Pr(j|i) is computed as γjzi bjzi Pr(j|i; zi , b) = j ′ ∈LO(i) γj ′ z bj ′ z i i or by integrating out zi Pr(j|i; b) = γik k γjk bjk j ′ ∈LO(i) γj ′ k bj ′ k (5.3) Comparing Eq. (5.3) to (5.2), we see that Eq. (5.3) introduces the freedom of modeling the community dependent popularity at the price of increasing number of variables. As will be shown in our empirical study, Eq. (5.2) achieves better performance because of the reduced number of variables. 45 5.2.2 Analysis of the PCL Model In this section, we analyze our link model by establishing the relation and comparing to PHITS model [16]. For the purpose of consistency, we assume LO(i) = V for all i. In PHITS, each community is assumed to have a multinomial distribution that specifies the probability for each node to be cited by the other nodes in the same community. We denote by βjk the probability for node j to be cited by any nodes in the k th community. Pr(j|i) conditioned on community variable zi of node i for this link, and β is then expressed as Pr(j|i; zi , β) = βjzi Note that unlike our model in Eq. (7.6), the conditional link probability in PHITS model has nothing to do with the community membership of node j. This leads to the problem of undetermined community membership for nodes that do not cite any other nodes for PHITS, as discussed in the next section. By integrating out zi , we have Pr(j|i) written as Pr(j|i; γ, β) = γik βjk (5.4) k where γik is the probability that node i is in the kth community. The following proposition allows us to establish the relationship between the PHITS model and the popularity-based conditional link model. Proposition 1. The PHITS model specified in Eq. (5.4) is equivalent to the link model with Pr(j|i) specified in Eq. (5.3). The above proposition is proved by observing the link between βjk and the quantity γjk bjk / j ′ γj ′ k bj ′ k . As revealed by the above proposition, PHITS is in fact a relaxed version of the proposed PCL model by assuming that the popularity of each node depends on the community of the node. We can also derive the proposed model in Eq. (5.2) from the PHITS model in Eq. (5.4) by considering the relationship between γjk and βjk , as revealed by the following proposition. 46 Proposition 2. The popularity-based conditional link model specified in Eq. (5.2) is equivalent to the PHITS model specified in Eq. (5.4) if βjk is interpreted as Pr(j|Ck ), i.e., the probability of selecting node j from the k th community. The above proposition follows the Bayes’s rule, i.e., Pr(Ck |j) Pr(j) ′ ′ = j ′ Pr(Ck |j ) Pr(j ) Pr(j|Ck ) = γjk bj j ′ γj ′ k bj ′ The above proposition once again reveals that the proposed conditional link model is a restricted version of the PHITS model. We believe that it is the constraints introduced in the proposed conditional link model that lead to more reliable performance. 5.2.3 Maximum Likelihood Estimation In this section, we present the method of maximum likelihood for the PCL model specified in Eq. (5.2). Observing the directional links E = {(i → j)|sij = 0}, we write the log-likelihood as log L = sij log ˆ γik k (i→j)∈E where sij is normalized sij such that ˆ ˆ j∈LO(i) sij γjk bj j ′ ∈LO(i) γj ′ k bj ′ (5.5) = 1. We find optimal γ and b by maximizing the log-likelihood max γ,b sij log ˆ (i→j)∈E s.t. k γik k γjk bj j ′ ∈LO(i) γj ′ k bj ′ γik = 1, γik ≥ 0, bi ≥ 0 To derive the EM algorithm, we first have the following lemma for a low bound for log L. Lemma 5. The log-likelihood log L in Eq. (5.5) at the tth iteration is lower bounded as follows log L ≥ − sij ˆ k (i→j)∈E sij ˆ (i→j)∈E  qijk log γik + log γjk + log qijk log qijk k 47 bj +1− τik j ′ ∈LO(i) γj ′ k bj ′ τik   where the parameters τik and qijk are computed as τik = j ′ ∈LO(i) qijk ∝ t−1 γik γ t−1 bt−1 ′ ′ (5.6) jk j t−1 γjk bt−1 j s.t. t−1 t−1 j ′ ∈LO(i) γj ′ k bj ′ qijk = 1 (5.7) k and bt−1 , γ t−1 are the corresponding solutions in the t − 1th iteration. The above lemma follows from the Jensen’s inequality and the inequality of − log x ≥ 1−x. Using the result in the above lemma, we search for b and γ at the tth iteration that maximize the lower bound of log L, i.e., sij ˆ max γ,b (i→j)∈E s.t. k k  qijk log γik γjk bj − γj ′ k bj ′ τik j ′ ∈LO(i) γik = 1, γik ≥ 0, bi ≥ 0   (5.8) For this maximization problem, we have the following theorem. Before stating the theorem, we first establish the notations for the purpose of representation: nin (i, k) = sji qjik ˆ nout (i, k) = j∈I(i) nin (i) = sij qijk ˆ j∈O(i) nin (i, k) nout (i) = k nout (i, k) k n(i, k) = nin (i, k) + nout (i, k) m(i, k) = j∈LI(i) nout (j, k) τjk Theorem 6. The optimal solution to Eq. (5.8) satisfies the following conditions ∀i, dout (i) = 0, din (i) = 0, γik = n(i, k) , m(i, k)bi + nout (i) bi = nin (i) k m(i, k)γik ∀i, dout (i) = 0, din (i) = 0, n (i, k) , γik ∝ in m(i, k) bi = 48 nin (i) k m(i, k)γik (5.9) ∀i, dout (i) = 0, din (i) = 0, γik = nout (i, k) , k nout (i, k) bi = 0 ∀i, dout (i) = 0, din (i) = 0, γik is any non-negative value such that γik = 1, bi = 0 k Remark: As revealed in Eq. (5.9), bi is proportional to the number of nodes that cites node i, i.e., nin (i), which is consistent with interpreting bi as “popularity” or “authoritative” for node i. Advantage of PCL over PHITS can also be seen in the solution of γik . It can be shown that the membership of node i in PHITS model only depends on the membership of the nodes that are cited by node i, i.e., γik ∝ nout (i, k), and not affected by the nodes that cite node i. When nout (i) = 0, i.e., node i has no outgoing links, the membership γik is not determined. In contrast, in PCL model, community membership of node i depends on the membership of all the nodes connected to node i. 5.3 Discriminative Content Model In this section, we extend our link model to incorporate the content information of nodes. As we discussed in Sections 5.1, most existing approaches combine link and content by a generative model that generates both links and content attributes via a shared set of hidden variables related to community memberships. In this work, we propose a discriminative model, referred to as Discriminative Content(DC) model, to incorporate the content into the proposed link model. Let xi ∈ Rd denote the content vector of node i. The content information is used to model the memberships of nodes by a discriminative model, given by Pr(zi = k) = yik = T exp(wk xi ) T l exp(wl xi ) (5.10) where wk ∈ Rd is a d-dimensional weight vector for community k with each element corresponding to each attribute. We can see that by incorporating the content model, the 49 community membership is no longer specified by parameters γik , but rather conditioned on the content through yik by a softmax transformation. Then, the conditional link probability Pr(j|i) expressed in Eq. (5.2) is modified as follows Pr(j|i; b, w) = yik k yjk bj j ′ ∈LO(i) yj ′ k bj ′ where yik depends on w as given in Eq. (5.10). As revealed in the above expression, we do not generate the content attributes as most topic models do. Instead, by using the discriminative model, with an appropriately chosen weight vector wk that assign large weights to important attributes and small weights or zero weights to irrelevant attributes, we avoid the shortcoming of the generative models, i.e., being misled by irrelevant attributes. Another benefit from the discriminative model is that we can use a non-linear transformation φ(x) : Rd → Rm on the content vector as the new attribute to obtain a non-linear model. In the sequel, we use φ(x) rather than x for presentation. The log-likelihood of the combined model is written as log L = sij log ˆ (i→j)∈E yik k yjk bj j ′ ∈LO(i) yj ′ k bj ′ (5.11) We maximize the log-likelihood over the free parameters w and b. Although we can use any gradient-based algorithm to optimize with wk and bi , we propose an efficient two-stage method as discussed in the next section, which helps us better understand the relation of link model and content model. A Two-Stage Method for Optimization In this section, we describe the method to maximize the log-likelihood in Eq. (5.11). We still use the EM algorithm to maximize the log-likelihood. In the E-step, we compute τik and qijk from y and b. In the M-step, we maximize the following problem:   yj ′ k bj ′  sij ˆ qijk log yik yjk bj − max τik w,b ′ (i→j)∈E k j ∈LO(i) 50 (5.12) subject to non-negative constraints on b. Instead of maximizing over w, we convert Eq. (7.1) into a constraint optimization problem over y and b by max y∈∆,b sij ˆ k (i→j)∈E where the domain ∆ is defined as ∆=  qijk log yik yjk bj − y|∃w, yik = yj ′ k bj ′ j ′ ∈LO(i) τik   T exp(wk φ(xi )) (5.13) (5.14) T l exp(wl φ(xi )) By having the domain of y given in Eq. (5.14) as a convex set, we can take a projection method to maximize the problem of Eq. (5.13), which leads to the two-stage method. In the first stage, we simply ignore the complex constraint for yik imposed by the domain ∆ and solve the optimization problem in Eq. (5.13) with only sum-to-one constraint on yik and non-negative constraints on b using the result in Theorem 6. In the second stage, we project the yik into the domain ∆. Let yik denote the optimal solution obtained from the first stage. ˜ The projection of yik , denoted by yik , is obtained by minimizing the KL divergence between ˜ yik and yik ∈ ∆, which is equal to the following optimization problem ˜ max w yik log yik = ˜ i yik log ˜ i k T exp(wk φ(xi )) T l exp(wl φ(xi )) k This problem is similar to the log-likelihood in multi-class logistic regression problem except that the class membership yik is not just binary but between 0 and 1 . As in logistic ˜ regression, we can add regularization term on wk to make the solution more robust, which leads to the following optimization problem max w yik log ˜ i k T exp(wk φ(xi )) T l exp(wl φ(xi )) − λ 2 T wk wk (5.15) k where λ is the regularization coefficient. This problem is a convex problem [9] and has a unique optimal solution, and can be maximized efficiently by the Newton-Raphson method. By converting the optimization problem over w into the problem over y and taking the two-stage method, we are able to have a better understanding of our combined model—the 51 Algorithm 1 Algorithm for maximizing the log-likelihood 1. Input the number of iterations or convergence rate 2. Initialize wk to zeros, bi randomly, λ to a fixed value 3. in the E-step, compute τik and qijk as in Eq. (5.6) and (5.7) using yik rather than γik 4. in the M-step, • compute γik , and bi as in Theorem 6 • update wk by maximizing the objective in Eq. (5.15) with γik in place of yik , and ˜ then compute yik 5. repeat Step 3 and 4 until the input number of iterations is exceeded or convergence rate is satisfied. 6. Output γik or yik as the final membership link structure will first give us a noisy estimation of community memberships y , and the ˜ noisy memberships are then used as supervised information for our discriminative content model to derive high-quality memberships y. These estimated memberships are further used in our EM iterations. Algorithm 1 summarizes the overall algorithms for combined link and content analysis for community detection. The algorithm has a time complexity of O(M(eKC1 + nKC2 + T3 )), where M is the number of iterations, e is the number of links in the network, n is the number of nodes in the network, C1 is a constant factor in computing qijk and τik , C2 is a constant factor in computing γik and bi , and T3 is the time for maximizing problem in Eq. (5.15) by the Newton-Raphson method. 5.4 Extensions In this section, we discuss two variants of the proposed framework for combining link information with content information. In the first variant, referred to as PCL+PLSA, we present an approach that combines the proposed conditional link model with the PLSA model for content analysis. In the second variant, referred to as PHITS+DC, we present an approach that combines the PHITS model for link analysis with the proposed discrimi- 52 native approach for content analysis. These two combined models will serve as baselines in our experimental study. 5.4.1 PCL + PLSA Similar to [17] where the PHITS link model is combined with PLSA content model, we combine our PCL link model with PLSA. The combined log-likelihood is given by log L = α sw log ˆij k i,j∈W(i) w βjk γik + (1 − α) sl log ˆij γik k i,j∈O(i) γjk bj j ′ ∈LO(i) γj ′ k bj ′ where α is combination coefficient, sw is the normalized number of times that word j occurs ˆij in the content of node i, W(i) denotes the set of unique words that occur in the content w of node i, and βjk = Pr(word j|Ck ). To maximize the log-likelihood, we derive the EMw l algorithm as follows. In the E-step, we compute qijk , qijk and τik as w w qijk ∝ γik βjk , τik = j ′ ∈LO(i) l qijk ∝ γik w qijk = 1 s.t. k γj ′ k bj ′ γjk bj , ′ ∈LO(i) γj ′ k bj ′ j l qijk = 1 s.t. k w In the M-step, we compute βjk , γik and bi as w βjk = γik = bi = ˆw w nw (j, k) i∈N (j) sij qijk in = w sw qijk ˆjk w j i∈N (j) j nin (j, k) αnw (i, k) + (1 − α)nl (i, k) out αnw (i) + (1 − α) nl (i) + bi ml (i, k) out out l (i) nin l k m (i, k)γik where N (j) denotes the set of nodes whose content have the word j, and nw , nw , nl , out in in nl , nl , and ml are defined similar as before. out 53 5.4.2 PHITS + DC In this variant, we combine the PHITS link model with our DC content model. The loglikelihood is given by log L = T where yik = exp(wk φ(xi ))/ sij log ˆ yik βjk k (i→j)∈E T l exp(wl φ(xi )). In the E-step, we compute qijk as qijk ∝ yik βjk , qijk = 1 k In the M-step, we first compute βjk and the free form membership γik by ˆ i∈I(i) sij qijk βjk = ˆ i∈I(i) sij qijk j ˆ j∈O(j) sij qijk γik = ˆ j∈O(j) sij qijk k nin (j, k) j nin (j, k) = = nout (i, k) nout (i) Then we maximize the following objective to get wk and yik , max k 5.5 i γik log yik − λ 2 T wk wk k Experiments In this section, we conduct several experimental studies. We first compare the PCL model with the PHITS model for the task of link prediction. Then we compare the performance of the PCL model with that of several state-of-the-art methods on the task of community detection by using two citation data sets. Before going into the details, we first describe the data sets and the metrics used in the experiment and evaluation. 5.5.1 Data Sets We used four data sets in our experiments. They are described in the following: Political Blog Data Set is a social blog network, which is a directed network of hyperlinks between webblogs about the US political issues, recorded in 2005 by Adamic and 54 Glance [1]. There are totally 1490 blogs, and each blog is labeled as either conservative or liberal. In the data set, we only have the link information and have no content information. So this data set is only used in the link prediction task to compare the PCL model with the PHITS model. The number of communities for this data set is set to K = 2. Wikipedia Data Set is a web page network which was crawled from Wikipedia web site by Gruber et al. [35]. This data set has 105 nodes and 799 links. This data set contains no explicit community label for each page. So we only use this data set in the link prediction task, with K set to 20 as suggested in [35]. Cora Data Set is a subset of the larger Cora citation data set [64]. This data set includes publications from the machine learning area, each of which is classified into 7 sub-categories as: Case-based reasoning, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning and Theory. There are totally 2708 nodes, and 5429 links. Each node corresponds to one paper and is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary of 1433 unique words. We use this data set in both the link prediction task and the community detection task. The number of communities is set to be K = 7. Citeseer Data Set is a subset of the larger Citeseer data set (http://citeseer.ist. psu.edu/). The Citeseer data set consists of 3312 scientific publications labeled as one of 6 classes and 4732 links. Each publication is described by a 0/1 valued word vector. The dictionary of word consists of 3703 unique words. This data set is used in both link prediction and community detection tasks. The number of communities is set to be K = 6. 5.5.2 Evaluation Metrics In the comparison of the PCL model and the PHITS model on the task of link prediction, we hide some links from the network, and run the two models on the remaining links. The performance is measured by the metric of Recall. Recall is an Information Retrieval measure. For each node, we compute the probabilities 55 for the node to generate links to the other nodes and then sort these probabilities in the decreasing order. The recall is computed at each position in the rank and defined as the fraction of target nodes that correspond to the hidden links. The recall is reported from positions 1 to 20 in the rank. To measure the performance of community detection, we used four metrics among which two are supervised and the other two are unsupervised. The two supervised metrics are normalized mutual information (NMI), and pairwise F-measure (PWF). These two metrics use the supervised label information. The other two unsupervised metrics are modularity (Modu) and normalized cut (NCut). These two metrics measure the partition performance in terms of the link structure. With the supervised label information, we can form the true community structure C = {C1 , . . . , CK }, where Ck contains the set of nodes that are in the kth community. The ′ ′ community structure given by the algorithms is represented by C ′ = {C1 , . . . , CK }. Then the mutual information between the two is defined as MI(C, C ′ ) ′ p(Ci , Cj ) log = ′ Ci ,Cj ′ p(Ci , Cj ) ′ p(Ci )p(Cj ) and the normalized mutual information is defined by NMI(C, C ′ ) = MI(C, C ′ ) max(H(C), H(C ′ )) where H(C) and H(C ′ ) are the entropies of the partitions C and C ′ . The higher the normalized mutual information, the closer the partition is to the ground truth. Let T denote the set of node pairs that have the same label, S denote the set of node pairs that are assigned to the same community, |T | denote the cardinality of set T . The pairwise F-measure is computed from the pairwise precision and recall, as the following precision = |S PWF = T |/|S| recall = |S 2 × precision × recall precision + recall 56 T |/|T | The higher the PWF, the better is the partition. Modularity is proposed by Newman et al. [72] for measuring community partitions. For a given community partition C = {C1 , . . . , CK }, the modularity is defined as Modu(C) = k where Cut(Ci , Cj ) = p∈Ci ,q∈Cj Cut(Ck , Ck ) − Cut(C, C) Cut(Ck , C) 2 Cut(C, C) wpq . As stated in [72], modularity measures how likely a network is generated due to the proposed community structure versus generated by a random process. Therefore, a higher modularity value indicates a community structure that better explains the observed network. Normalized cut is the objective of the normalized cut algorithm ([83], which we refer to as NCUT). Given a community partition C = {C1 , . . . , CK }, the normalized cut is defined as K NCut(C1 , · · · , Ck ) = i=1 ¯ Cut(Ci , Ci ) vol(Ci ) ¯ where Ci denotes the set of nodes that are not in Ci and vol(Ci ) = 5.5.3 p∈Ci q wpq . Link Prediction To validate the advantage of the PCL link model over the PHITS link model, we experiment them on the four data sets described in Section 5.5.1. The performance is reported in Figure 5.1 in terms of recall at positions 1 to 20. Each number in the figure is averaged over 5 runs. The PCL outperforms the PHITS in all the cases. To investigate the effects of the popularity parameter, b, we also perform the same experiments on PCL by setting bi = 1 for all i. The results are labeled as “PCL-b=1” in the figure. The performance given bi = 1 is worse than PCL and PHITS. It further confirms the importance of the popularity parameter. Overall, this result validates our conjecture that the conditional link model outperforms the generative link model, at least for the task of link predication. 57 Table 5.1. Community detection performance on Cora and Citeseer dataset Cora PWF Modu Algorithm NMI L PHITS LDA-Link PCL NCUT 0.0570 0.0762 0.0884 0.1715 0.1894 0.2278 0.2055 0.2864 C PLSA LDA-Word NCUT(RBF kernel) NCUT(pp kernel) 0.2107 0.2310 0.1317 0.1804 PHITS-PLSA LDA-Link-Word LCF NCUT(RBF kernel) NCUT(pp kernel) PCL-PLSA PHITS-DC PCL-DC 0.3140 0.3587 0.1227 0.2444 0.3866 0.3900 0.4359 0.5123 LC 5.5.4 Citeseer PWF Modu NCut NMI NCut 0.3929 0.2189 0.5903 0.2701 3.2466 4.5687 1.9391 0.2732 0.0101 0.0356 0.0315 0.1833 0.1773 0.2363 0.1927 0.3252 0.4588 0.2211 0.6436 0.6577 2.2370 3.7457 1.1181 0.1490 0.2864 0.2774 0.2457 0.2912 0.2682 0.2970 0.1839 0.2487 4.2686 3.7820 4.7775 4.6612 0.0965 0.1342 0.0976 0.1986 0.2298 0.2880 0.2386 0.3282 0.2885 0.3022 0.2133 0.4802 3.2294 3.0165 3.7078 1.8118 0.3526 0.3969 0.2456 0.3062 0.4214 0.4233 0.4526 0.5450 0.3956 0.4576 0.1664 0.3703 0.5158 0.5503 0.6384 0.6976 3.2880 2.8906 4.8101 1.6585 0.7903 2.1575 1.5165 1.0093 0.1188 0.1920 0.0934 0.1592 0.1986 0.2207 0.2062 0.2921 0.2596 0.3045 0.2361 0.2957 0.3282 0.3334 0.3295 0.3876 0.3863 0.5058 0.2011 0.4280 0.4802 0.5505 0.6117 0.6857 2.7397 2.0369 3.6721 1.7592 1.8118 1.6786 1.2074 0.7505 Community Detection In this section, we investigate the performance of our model on the task of community detection. We perform experiments on the two scientific publication date sets, which have both link and content information. To validate the advantage of our proposed model, we compare it with several baselines. Based on what information is used, the algorithms are categorized into 3 classes: Based on Link, we compare the following models: PHITS, PCL, LDA-Link, and Spectral Clustering (NCUT). Based on Content, we compare the following: PLSA, LDA-Word, and Spectral Clustering. In spectral clustering, the similarity matrix is the kernel matrix computed from the content of each publication. Here we report two kernels, one is the RBF kernel, and the other is the probabilistic product kernel proposed in [47]. Based on Link and Content, we compare the following: PHITS-PLSA, LDA-Link-Word, Link-Content-Factorization (LCF), Spectral Clustering, PCL-PLSA, PHITS-DC, and PCLDC. Notice that PHITS-PLSA refers to the combination of PHITS and PLSA proposed in [17], LDA-Link-Word refers to the mixed membership model proposed in [28], LCF refers 58 Recall on Political Blog Recall 0.3 Recall on Wikipedia 0.8 PCL PCL−b=1 PHITS 0.6 Recall 0.4 0.2 0.1 0 0 0.4 0.2 5 10 Rank 15 0 0 20 (a) Recall on Political Blog PCL PCL−b=1 PHITS 0.3 0.2 0.1 0 0 10 Rank 15 20 Recall on Citeseer 0.4 Recall Recall 0.3 5 (b) Recall on Wikipedia Recall on Cora 0.4 PCL PCL−b=1 PHITS PCL PCL−b=1 PHITS 0.2 0.1 5 10 Rank 15 0 0 20 (c) Recall on Cora 5 10 Rank 15 20 (d) Recall on Citeseer Figure 5.1. Recall on the four data sets to the model proposed in [113], Spectral Clustering is applied to linear combined kernel from the link matrix and content kernel, PCL-PLSA refers to the combination of the PCL and the PLSA model as described in Section 5.4, PHITS-DC refers to the PHITS model combined with the Discriminative Content model, and PCL-DC refers to the PCL model combined with the Discriminative Content model. In the implementation, the feature vector used in our model is the original word indicator vector without any transformation; the spectral clustering we used is the normalized cut algorithm [83] (NCUT). For the algorithms that are dependent on some parameters such as the σ parameter in RBF kernel, the combination coefficient in PHITS-PLSA, the combi- 59 nation coefficient of link matrix and content kernel for spectral clustering, the combination coefficient in PCL-PLSA, the regularization coefficient in PHITS-DC, we experiment on a wide range of values and choose the best one in terms of normalized mutual information and pairwise F-measure. For example, the combination coefficients in PHITS-PLSA, PCL-PLSA, and combined link matrix and content kernel are tuned from 0.1 to 0.9 with 0.1 as the step size. The regularization coefficient for PHITS-DC is tuned from 0 to 50 with 5 as the step size. The regularization coefficient for PCL-DC is set to a fixed value of 10. All the iterative algorithms are run until the relative difference of the objective is within 10−8 . Tables 5.1 show the results on the Cora data set and the Citeseer data set. For both data sets, PCL outperforms PHITS in all the cases, either using link only (PCL outperforms PHITS), or combining link and content (PCL-PLSA outperforms PHITS-PLSA and PCL-DC outperforms PHITS-DC). When considering content, the approaches that discriminatively combine content (DC) outperform the approaches that combine content using PLSA. That is, PHITS-DC outperforms PHITS-PLSA, and PCL-DC outperforms PCL-PLSA. These results further confirm that the discriminative models (either the link model, or the content model, or the combination of the two) achieve better performance than the generative ones. We also compared PCL and PCL-DC with the following algorithms. In the link-only case, the spectral clustering (NCUT) outperforms PCL. LDA-Link outperforms PCL in some metrics. When combining link and content, PCL-DC outperforms all algorithms except for the spectral clustering (NCUT) algorithm in the normalized cut (NCut) metric. The main reason for the spectral clustering (NCUT) to have the best performance in terms of normalized cut is that it directly minimizes this metric. However, we argue that people would consider the NMI and PWF metrics as equally important, because the NMI and PWF metrics measure how good the partition derived by the algorithms matches the ground truth. Finally, to reveal the performance of our model under different parameters, we show the performance of the PCL-DC model under different regularization coefficient λ on the two data sets in Figure 5.2. In both data sets, the performance achieves the highest level when 60 λ = 5. After that, the PCL-DC algorithm is not very sensitive to λ. Measure Vs. λ Measure Vs. λ 0.7 0.5 0.4 0.3 0.2 0 PWF NMI 0.4 Measure 0.6 Measure 0.5 PWF NMI 0.3 0.2 0.1 10 20 λ 30 40 0 0 50 (a) Cora 10 20 λ 30 40 50 (b) Citeseer Figure 5.2. Partition Measure of PCL-DC vs. λ 5.6 Conclusions In this chapter, we proposed a unified model to combine link and content analysis for community detection. To accurately model the link patterns, a conditional link model is proposed to capture the popularity of nodes. In order to alleviate the problem caused by the irrelevant attributes, a discriminative model, instead of a generative model, is proposed for modeling the contents of nodes. The link model and content model are combined via a probabilistic framework through the shared variables of community memberships. We observed that the combined model obtains significant improvement over the state-of-the-art approaches for community detection. 61 CHAPTER 6 Clustering for Noisy M&C Linked Data: A Probabilistic Link Prediction Model (I) 6.1 Introduction In this chapter and the next chapter, we explore the problem of clustering for noisy M&C linked data. Learning from M&C linked data has been studied extensively and has found its application in distance metric learning [102], constrained clustering [8], and kernel learning [42]. The M&C link information include the must-link for pairs in the same class and cannot-link for pairs in different classes. They are also termed as positive (pairwise) constraints for must-link pairs, and negative (pairwise) constraints for cannot-link pairs. M&C link information can often be derived from data, making it more attractive than the standard setup of supervised learning. For instance, in classifying research articles, we can derive the pairwise constraints based on the citations between papers. Although various algorithms have been proposed for learning from M&C linked data, most of them assume perfect M&C link information. In contrast, in this study, we focus on the problem of learning from noisy M&C link information in which some of the pairwise constraints are labeled incorrectly. This is important because the pairwise constraints extracted from data tend to be noisy and inaccurate. In the example of classifying research articles with pairwise constraints constructed from paper citations, the cited paper may not share the same research topic as the citing paper. To cluster the noisy M&C linked data, we proposed to learn a combined kernel from 62 the noisy M&C linked data. We proposed a probabilistic link prediction model to learn the combination weights based on a generalized maximum entropy model or equivalently a regularized logistic regression model. We proposed two different approaches for estimating the sufficient statistics from the noisy M&C link information under different assumptions. We show that under the claimed assumptions, the probabilistic model trained from the noisy M&C link information converges to that trained from the perfect M&C link information. Extensive experimental results verify the efficacy of the proposed framework for clustering noisy M&C linked data. The remainder of this chapter is organized as follows. In section 6.2, we present the probabilistic link predication model for learning from noisy M&C link information. We present experimental results in section 6.3, 7.3, and conclude our study in section 7.4. 6.2 Learning from Noisy M&C Linked Data We start with the basic formulation for maximum entropy learning from perfect M&C link information, followed by its generalization and its equivalence to regularized logistic regression model. We then extend to the case of noisy M&C link information. For the purpose of presentation, we first introduce the notations that are used throughout this chapter. 6.2.1 Notations Let D = {xi ∈ X , i = 1, · · · N} be a collection of data points, P = {(x1 , x2 , yi )|x1 , x2 ∈ i i i i D, i = 1, . . . , n, yi ∈ {+1, −1}} be a collection of observed labeled pairs. We slightly abuse the terminology of labeled and unlabeled examples by referring to the examples in D that also occur in P as labeled examples, and to the remaining examples in D as unlabeled examples. We denote by yi the true label for the pair (x1 , x2 ). We refer to the pairs with yi = +1 as i i perfect must-link pairs or perfect positive constraints and the pairs with yi = −1 as perfect cannot-link pairs or perfect negative constraints. Similarly, we refer to the pairs with yi = +1 63 Table 6.1. Notations Expectation Empirical Average Eδ [k] = EX1 ,X2 ,Y [δ(Y, y)k(X1 , X2 )] y 1 aδ [k] = n y Eδ [k] = EX1 ,X2 ,Y [δ(Y, y)k(X1, X2 )] y 1 aδ [k] = n y E[k] = EX1 ,X2 k(X1 , X2 ) 1 a[k] = n Ec [k] = EX1 ,X2 |Y=y k(X1 , X2 ) y ac [k] = y n i=1 δ(yi , y)ki n i=1 δ(yi , y)ki Eo [k] = EX1 ,X2 ,Y Yk(X1 , X2 ) Ec [k] = EX1 ,X2 |Y=y k(X1 , X2 ) y n i=1 ki n i=1 yi ki 1 ao [k] = n —– n i=1 δ(yi , y)ki n i=1 δ(yi , y) as noisy must-link pairs or noisy positive constraints and the pairs with yi = −1 as noisy cannot-link pairs or noisy negative constraints . We use y = −y for complement of y. We use ¯ κj (x1 , x2 ) for the j th ∈ {1, · · · , m} jth kernel feature function defined on X × X . We denote by ki = (κ1 (x1 , x2 ), · · · , κm (x1 , x2 ))⊤ the feature vector for pair (x1 , x2 ). Throughout the i i i i i i paper, we use capital letters X, Y, Y for the corresponding random variables. In the sequel, we use the notations defined in Table 6.1, where δ(yi , y) is the Kronecker delta function that outputs 1 if yi = y and zero, otherwise. We let aδ [κj ] denote the jth element in aδ [k], y y and similarly for aδ [κj ]. The empirical averages aδ [k] and ao [k] in Table 6.1 are referred to y y sufficient statistics. 6.2.2 Generalized Maximum Entropy Model We proposed to learn a probabilistic link predication model Pr(Y|X1 , X2 ), i.e. given a pair X1 , X2 , how likely they are related by must-link or cannot-link. We first consider the maximum entropy model for learning the conditional link prediction model from perfect M&C link information. We cast the problem of learning from M&C link information into a binary classification problem where the objective is to classify each pair (x1 , x2 ) into two i i categories, i.e., a positive pair (yi = +1) and a negative pair (yi = −1). Using maximum entropy model, we aim to learn the conditional distribution Pr(Y = y|X1 , X2 ), which leads 64 to the following optimization problem: n H(p|x1, x2 ) i i max (6.1) i=1 n s.t. where H(p|x1 , x2 ) = − i i 1 n i=1 p(y|x1, x2 )κj (x1 , x2 ) = aδ [κj ], ∀y, j y i i i i 1 2 1 2 y p(y|xi , xi ) ln p(y|xi , xi ). p(y|x1 , x2 ) = i i The solution to (6.1) is given by 1 exp(yw⊤ki /2) = 1 + exp(−yw⊤ ki ) exp(yw⊤ki /2) + exp(−yw⊤ ki /2) where w ∈ Rm are the dual variables and are obtained by solving the following optimization problem, n min w∈Rm 1 ln 1 + exp(−yi w ki ) = − 2 n i=1 n ⊤ ⊤ yi w ki + i=1 exp(yw⊤ki /2) ln i=1 y One major problem with the maximum entropy model in (6.1) is the equality constraint, which is unlikely to hold if for each pair (x1 , x2 ), yi is a random sample from the distribution i i p p(y|x1 , x2 ). We denote by ay [κj ] the left side of equality constraint in problem (6.1), i.e. i i p ay [κj ] = 1 n n p(y|x1, x2 )κj (x1 , x2 ) i i i i i=1 p The following theorem shows that ay [κj ] and aδ [κj ] could differ significantly if n is small. y The difference between the two quantities will diminish only when n approaches infinity. Theorem 7. Assume (x1 , x2 , yi ) are i.i.d. i i samples from an unknown distribution P (X1 , X2 , Y), and the kernel function is bounded |κj (x1 , x2 )| ≤ R, ∀j, then the equality constraint in (6.1) for any j and y holds with probability 1 when the number of instances approaches infinity. In particular, for any ǫ > 0 we have Pr p ay [κj ] − aδ [κj ] ≥ ǫ ≤ 4 exp − y ǫ2 n 8R2 p The theorem can be proved by noting that E[aδ [κj ]] = E[ay [κj ]] and applying McDiarmid’s y p inequality. Details are provided in the appendix. To address the case that ay [κj ] and aδ [κj ] y 65 could be different, we propose a generalization to the traditional maximum entropy model in (6.1). Given the finite number of training data, we relax the equality constraints in (6.1) into inequality ones, leading to the following formulation for learning from side information max s.t. 1 n 1 n n i=1 i H(p|x1 , x2 ) − i i 1 2λ y ǫy 2 (6.2) p(y|x1 , x2 )κj (x1 , x2 ) ≥ aδ [κj ] − ǫyj , ∀y, j y i i i i where ǫy = (ǫy1 , . . . , ǫym )⊤ and · is a norm that measures the length of vector ǫy . The key features of the generalized maximum entropy model in (6.2) are: • Replacing equality constraints with inequality ones. As a result, we have p aδ [κj ] − ǫyj ≤ ay [κj ] ≤ aδ [κj ] + ǫy j . ¯ y y Note that although only one side inequality is included in (6.2), the upper bound of p p p ay [κj ] can be easily derived by using the relation ay [κj ] + ay [κj ] = aδ [κj ] + aδ [κj ]. y y ¯ ¯ • The positive dummy variables ǫ are introduced to account for the difference between the p two empirical means ay [κj ] and aδ [κj ]. A regularization term ǫy 2 /(2λ) is introduced y into the objective in order to determine these variables automatically. We further justify the generalized maximum entropy model by showing it is equivalent to the regularized logistic regression model. Proposition 3. When · = · 2, the dual problem of (6.2) is equivalent to the regularized logistic regression model, i.e., 1 max w∈Rm n n i=1 ln p(yi |x1 , x2 ) − i i λ w 2 2 2 (6.3) or equivalently, max w∈Rm λ 1 1 ⊤ o w a [k] − w 2− 2 2 2 n 66 n ln i=1 exp y 1 ⊤ yw ki 2 (6.4) 6.2.3 Estimating the Sufficient Statistics In this section, we extend the framework of learning a probabilistic link prediction model to the case when pairwise constraints are noisy, i.e., yi = yi for some pairs. The strategies we used in this section, is to estimate the sufficient statistics aδ [k] or ao [k] from the noisy labels y without having to know which labels are incorrect. We present an approach for estimating the sufficient statistics under certain assumptions. In order to estimate aδ [k] in the case of noisy M&C link information, we make the following y assumptions. Assumption 1. We assume (1.a) Pr(Y|X1 , X2 , Y) = Pr(Y|Y), (1.b) Pr(Y = y|Y = y) = cy is given. In the above assumption, (1.a) assumes Y is conditionally independent of (X1 , X2 ) given Y, (1.b) assumes the group-level knowledge about the noise in the pairwise constraintsWith these two assumptions, the following theorem shows that it is possible to express empirical mean aδ [k] in terms of aδ [k], i.e., the empirical mean estimated from the noisy side y y information. Theorem 8. Assuming(xi , yi , yi ), i = 1, . . . , n are i.i.d samples, we have, with a probability at least 1 − δ, aδ [k] − bδ [k] y y 2 ≤ 8mR2 ln (c+ + c− − 1)2 n 4m , ∀y δ where bδ [k] = y n aδ [k] 1 (1 − cy ) ¯ y − k(x1 , x2 ) i i (cy + cy − 1) n cy + cy − 1 ¯ ¯ i=1 The proof can be found in the appendix. As indicated by Theorem 8, under Assumption 1, we can approximate aδ [k] by bδ [k]. It is interesting to note that the convergence rate y y √ √ is O 1/[|cy + cy − 1| n] , not O(1/ n). Similar to Theorem 8, we can have the following ¯ p corollary to bound the difference between ay [k] and bδ [k]. y 67 Corrolary 9. Assuming (xi , yi , yi ), i = 1, . . . , n are i.i.d samples, we have, with a probability at least 1 − δ, p ay [k] − bδ [k] y 2 8mR2 ln (c+ + c− − 1)2 n ≤ 4m , δ ∀y Given bδ [k] is an estimate of aδ [k] and ao [k] = aδ [k] − aδ [k], we can compute an estimate y y + − of ao [k] under Assumption 1 by 1 c+ − c− ao [k] − a[k] c+ + c− − 1 c+ + c− − 1 bo [k] = bδ [k] − bδ [k] = + − (6.5) Similar to Theorem 8, we have the following corollary to bound the difference between ao [k] and bo [k] Corrolary 10. Assuming (xi , yi , yi ), i = 1, . . . , n are i.i.d samples, we have, with a probability at least 1 − δ, 8mR2 ln (c+ + c− − 1)2 n ao [k] − bo [k] 2 ≤ 4m δ With theorem 8, corollary 9 and corollary 10, we finally reach to the following generalized maximum entropy for learning from noisily labeled data p(y|x1, x2 ) = arg max p(y|x) s.t. 1 n 1 n n i=1 n i=1 H(p|x1, x2 ) − i i 1 2λ y ǫy 2 (6.6) p(y|x1, x2 )k(x1 , x2 ) ≥ bδ [k] − ǫy , ∀y y i i i i or the following regularized logistic regression model for learning from noisily labeled data ∗ wb 6.2.4 λ 1 1 = arg min w 2 − w⊤ bo [k] + 2 m 2 2 n w∈R n ln exp i=1 1 ⊤ yw ki 2 (6.7) Convergence Analysis The resulting conditional distribution p(y|x1, x2 ) from (6.6) is given by p(y = 1|x1 , x2 ) = ∗ exp(wb ⊤ k(x1 , x2 )) ∗ 1 + exp(wb ⊤ k(x1 , x2 )) (6.8) Next, we show how the solution w will be affected when replacing aδ [k] with bδ [k], i.e., the y y empirical mean computed from the noisy pairwise constraints. 68 ∗ ∗ Theorem 11 ( Lemma 6[78]). Let wa be the solution to (6.4) with ao [k], and wb be the solution to (6.7) with bo [k]. We have 1 o ∗ ∗ wa − wb 2 ≤ a [k] − bo [k] 2 λ The proof for the theorem as well as the following theorems can be found in the appendix. Combining Theorem 11 and Corollary 10, we have the following theorem showing the impact of replacing ao [k] with bo [k]. Theorem 12. Let p(y|x1, x2 ) be the conditional model derived from noisy pairwise constraints using (6.6) using · 2 , and p(y|x1 , x2 ) be the conditional model derived from the perfect pairwise constraints using (6.2). Under Assumption 1, with probability 1 − δ, for any x1 , x2 and y, we have p(y|x1 , x2 ) − p(y|x1 , x2 ) ≤ mR2 λc 8 ln n 4m δ where c = |c+ + c− − 1|. As indicated by the above theorem, the difference between two conditional models will √ be reduced at the rate of 1/[|c+ + c− − 1| n]. Finally, since our algorithm depends on the knowledge of c+ and c− , we further analyze the behavior of the proposed algorithm with inaccurate estimation of c+ and c− . We denote by c+ and c− the estimates of c+ and c− , respectively. We define p(y|x1 , x2 ) the conditional model derived from the noisy side information using c+ and c− . We measure the difference (c+ , c− ) and their estimates by ∆ = max(|c+ − c+ |, |c− − c− |). The next theorem shows the difference between p(y|x1 , x2 ) and p(y|x1, x2 ). Theorem 13. Let · = · 2 . Let p(y|x1, x2 ) be the conditional model derived from noisy pairwise constraints with c+ and c− , and p(y|x1, x2 ) be the conditional model derived from the perfect pairwise constraints using (6.2). Assume |c+ + c− − 1| ≥ ρ with ρ ≥ 0 and ∆ ≤ ρ/4. Under Assumption 1, with probability 1 − δ, for any x1 , x2 and y, we have p(y|x1 , x2 ) − p(y|x1, x2 ) ≤ mR2 λc 69 8 4m mR2 ∆(8 + 2c) ln + n δ λρ2 We finally mention that w can serve as weights for combining kernels, i.e., j wj κj . However, the combined kernel may not be positive semi-definite, because some weights wj are negative. To ensure the combined kernel to be valid, we introduce one more constraint wj ≥ 0, j = 1, . . . , m to the optimization problem in (6.6). The optimization problems are solved by Nesterov method [70]. 6.3 Experiments We evaluate the proposed algorithm by clustering the linked documents. We first present the experiments on clustering linked documents with noisy pairwise constraints derived from the link information. We then examine the behavior of the proposed algorithm in more details. Before presenting the experimental results, we first introduce the data sets, baselines and evaluation metric. Data Sets We select three linked document data sets, i.e. Cora, Citeseer, Terrorist Attacks(TeAt) for our evalution. We choose these data sets because they have relatively low noise (20% ∼ 35%) in their pairwise constraints derived from links that satisfies the condition c+ + c− − 1 > 0 in Assumption 1. They were processed by the research group of Lise Getoor (see www.cs.umd.edu/projects/linqs/projects/lbc/). Each data set contains (1) a set of documents described by binary vectors indicating the presence and absence of words from a dictionary, (2) links among documents (e.g., citations between research articles), and (3) the class assignment for each document. The statistics of these three data sets are summarized in Table 1. In the experiments, the attributes for each document are normalized by first dividing the sum of the attributes and then taking the square root [47]. Evaluation In order to evaluate the proposed algorithm, we apply it to kernel learning as described at the end of Section 3. In particular, for each attribute j, we construct a linear kernel matrix κj (x1 , x2 ) = x1 [j]x2 [j] for paired documents (x1 , x2 ), where x[j] is the j th normalized attribute of document x. The proposed algorithm will be applied to learn the 70 Table 6.2. Statistics of Data sets name Cora Citeseer TeAt #examples #words #links #classes 2708 3312 1293 1433 3703 106 5429 4732 571 7 6 6 combination of multiple kernel matrices from the noisy pairwise constraints derived from links. The ℓ2 norm is used in the proposed algorithm. Given the learned kernel matrix, a spectral clustering algorithm [84] is applied for document clustering. We evaluate the clustering result by comparing it to the class assignment information provided in each data set. Normalized mutual information(NMI) [107] is used as our evaluation metric. For all the experiments, we set γ in the proposed algorithm to be 0.01/c2 , where c = c+ + c− − 1. Baseline We compare the proposed algorithm to the following metric/kernel learning algorithms: (a) GDM, the global distance metric learning algorithm [102], (b) DCA, the discriminative component analysis algorithm [43], (c) ITML, the information theoretic metric learning algorithm proposed by [21], and (d) SKL, the spectral kernel learning algorithm [44]. For fair comparison, the distance metric A learned by the metric learning algorithms will be used to construct a kernel matrix K = XAX ⊤ , where X is the data matrix, and the same spectral clustering algorithm will be applied to K for document clustering. We also evaluate the proposed algorithm against the metric pairwise constrained K-means clustering algorithm [8], referred to as MPCK. In order to improve the robustness of MPCK to noisy constraints, we follow [61] and weight the noisy positive constraints and the noisy negative constraints by c+ , c− respectively to reduce their impact on the clustering results. As the reference point, we compute a linear kernel for both labeled and unlabeled examples, without using the provided pairwise constraints. We refer to this baseline as base. Finally, we refer to as GMEns the proposed generalized maximum entropy model for learning from noisy side information, and as GMEs the generalized maximum entropy model without considering the noise in side information. All the experiments are run five times and the clustering accuracy averaged over five runs is reported in our study. 71 6.3.1 Experiments with Real Noise We conduct experiments of document clustering with the noisy pairwise constraints derived from the links between documents. In particular, we use all the linked document pairs as the positive constraints. The same number of document pairs without link are sampled to construct the negative constraints. To obtain the noise levels of the pairwise constraints, we sample a total of 100 pairwise constraints and estimate c+ and c− based on the correctness of the sampled constraints. These sampled pairwise constraints with their true labels are also used by the other baseline methods for computing distance metrics and kernel matrices. Figures 6.1(a), 6.2(a) and 6.3(a) show the clustering accuracy measured in NMI for the three data sets. The mean values of the estimated c+ and c− are listed under each figure. We observe that given the noisy pairwise constraints, all the algorithms except ITML perform significantly worse on at least one data set than the reference method base. In contrast, the proposed algorithm for learning from noisy pairwise constraints outperforms the reference method significantly for all three data sets. We thus conclude that the proposed algorithm is overall more robust to noise in the side information. 6.3.2 Experiments with Synthetic Noise In this section, we examine the robustness of the proposed algorithm to (a) different noise levels in synthetically generated pairwise constraints, and (b) the estimated values for c+ and c− . Robustness to the Noise We first sample 10, 000 pairwise constraints from each data set, with 5, 000 positive constraints and 5, 000 negative constraints. Random noise is introduced to the synthetic constraints by randomly flipping the label of a pair with a probability p%, where p% specifies the noise level. We set c+ and c− to be 1 − p%, with the assumption that the knowledge of noise level is perfect. To examine the impact of noisy positive 72 Cora Cora 0.4 0.3 base GMEns MPCK GDM DCA ITML SKL 0.2 0.1 0 NMI NMI 0.3 0.2 0.1 0 (a) on real noisy constraints less Cora Cora e=1% NMI 0.4 NMI GMEns GMEs 0.2 0 (0,0) (+e,+e) (+e,−e) 0.2 0 10 20 30 40 50 60 70 80 90 noise(%) in positive constraints NMI NMI 0 much (b) on synthetic noisy constraints 0.4 0.2 medium noise level 10 20 30 e=10% 40 0.2 (−e,+e) (−e,−e) 0 10 20 30 40 noise(%) in pairwise constraints 10 20 30 40 50 60 70 80 90 noise(%) in negative constraints (c) robustness to noise (d) sensitivity to c+ , c− Figure 6.1. Experimental results on Cora data set constraints and noisy negative constraints separately, for each data set, with a given noise level p%, we conduct two experiments, one with corrupted positive constraints but perfect negative constraints, and the other with corrupted negative constraints but perfect positive constraints. Figures 6.1(c), 6.2(c) and 6.3(c) compare the clustering results for GMEns and GMEs with the noise levels in the synthetic pairwise constraints varied from 10% to 90% on the three data sets. We observe that GMEns, the generalized maximum entropy model for noisy side information, is significantly more robust to the noise in the pairwise constraints than GMEs which does not take into account the noise in side information. We also observe that the noisy positive constraints have significantly higher adverse impact on the clustering 73 Citeseer Citeseer 0.3 0.3 NMI 0.1 0 NMI base GMEns MPCK GDM DCA ITML SKL 0.2 0.2 0.1 0 (a) on real noisy constraints 0 0.4 NMI Citeseer NMI (0,0) (+e,+e) (+e,−e) 0.2 0 10 20 30 40 50 60 70 80 90 noise(%) in positive constraints 0.2 0 much Citeseer e=1% GMEns GMEs 0.2 medium noise level (b) on synthetic noisy constraints NMI NMI 0.4 less 10 20 30 e=10% 40 0.2 (−e,+e) (−e, −e) 0 10 20 30 40 50 60 70 80 90 noise(%) in negative constraints 10 20 30 40 noise(%) in pairwise constraints (c) robustness to noise (d) sensitivity to c+ , c− Figure 6.2. Experimental results on Citeseer data set results than the noisy negative constraints. Sensitivity to c+ , c− We use the same set of 10, 000 randomly sampled pairwise constraints for this study. We add the same noise level to both positive constraints and negative constraints. To investigate the sensitivity to c+ , c− , instead of setting them to be 1 − p%, we perturb these parameters by setting them to be (1 − p%)(1 ± e). Figures 6.1(d), 6.2(d) and 6.3(d) show the results of GMEns on the three data sets with four noise levels p% = 10% ∼ 40% for e = 1%, 10%. We observe that GMEns is overall robust to modest perturbation level, making the proposed algorithm applicable even when the assumed noise levels are inaccurate. 74 Terrorist Attack Terrorist Attack 0.4 0.5 0.4 base GMEns MPCK GDM DCA ITML SKL 0.2 0.1 0 NMI NMI 0.3 0.3 0.2 0.1 0 (a) on real noisy constraints NMI GMEns GMEs 0.5 0 10 20 30 40 50 60 70 80 90 noise(%) in positive constraints (0,0) (+e,+e) (+e,−e) 10 20 30 e=10% 40 0.5 0.5 NMI NMI NMI much Terrorist Attack e=1% 0.5 0 medium noise level (b) on synthetic noisy constraints Terrorist Attack 0 less (−e,+e) (−e,−e) 0 10 20 30 40 50 60 70 80 90 noise(%) in negative constraints 10 20 30 40 noise(%) in pairwise constraints (c) robustness to noise (d) sensitivity to c+ , c− Figure 6.3. Experimental results on Terrorist Attack data set Finally, we compare the proposed algorithm to the baselines on the synthetic noisy constraints by varying the level of noise. Due to the fact that some of the baseline algorithms are time consuming, one thousand pairs are sampled for positive constraints and negative constraints, respectively. We show the results on the noise added to the positive constraints due to its stronger effect on the performance. Figures 6.1(b), 6.2(b) and 6.3(b) show the clustering results of all algorithms at three noise levels: low(10%), medium(40%), high(70%) on the three data sets. We observe that the proposed algorithm is able to outperform all the baseline algorithms for all the cases. 75 6.4 Conclusions In this chapter, we have proposed a generalized maximum entropy model for learning from noisy M&C link information, and applied it to learning an optimal weight for combining multiple kernels for clustering. Our theoretical analysis shows that the model trained from the noisy link information converges to the model trained from the perfect link information. Extensive experimental results verify the efficacy of the proposed model. 76 CHAPTER 7 Clustering for Noisy M&C Linked Data: A Probabilistic Link Prediction Model (II) In this chapter, we present an alternate approach for estimating the feature statistics under different assumptions about the data and the noise. We begin with a description of the notations. 7.1 Notations Let D = {xj ∈ X } be a collection of observed data, and P = {(x1 , x2 , yi ) : x1 , x2 ∈ D, yi ∈ i i i i {1, −1}, i = 1, · · · , n} be a collection of n pairwise constraints, where yi is the noisy label given to the pair (x1 , x2 ) indicating if the pair is a positive constraint (yi = 1) or a negative i i constraint (yi = −1). Let yi denote the underlying true label for the pair (x1 , x2 ) which is i i unknown in our setting. Let kj (·, ·) : X × X → R, j = 1, · · · , m be a set of m base kernels. Our goal is to learn a combination of the multiple kernels k(·, ·) = m j=1 wj kj (·, ·), given the noisy must-and-cannot links and use the combined kernel to do clustering. A few more words about the notations used in this study. For any pair (x1 , x2 ), we use k(x1 , x2 ) = (k1 (x1 , x2 ), · · · , km (x1 , x2 ))⊤ denote the similarities between x1 and x2 using all m kernel functions. We use the shorthand ki for k(x1 , x2 ). We use y = −y for complement ¯ i i of y. We use subscript + and − for conditions y=1 and y=−1, respectively. 77 7.2 7.2.1 Learning from Noisy M&C Linked Data A Probabilistic Model To learn the combination of multiple kernels, we construct a conditional model Pr(y|x1 , x2 ) for computing the pairwise classification probability, where y ∈ {1, −1} is the underlying true label to indicate whether the pair (x1 , x2 ) belongs to the same class or not. Given multiple kernels kj (·, ·), j = 1, . . . , m, we formulate Pr(y|x1, x2 ) by Pr(y|x1 , x2 ) = exp(yw⊤k(x1 , x2 )/2) 1 = 1 + exp(−yw⊤ k(x1 , x2 )) exp(−w⊤ k(x1 , x2 )/2) + exp(w⊤ k(x1 , x2 )/2) where w = (w1 , · · · , wm )⊤ ∈ Rm are the non-negative weights that need to be learned. + We constrain w to be non-negative to ensure that the resulting combined kernel is positive semi-definite. By optimizing the log-likelihood of the true labels for the observed pairs, we obtain the optimal weights for kernel combination. More specifically, we need to solve the following optimization problem 1 max w∈Rm n + n i=1 ln p(yi |x1 , x2 ) − i i λ w 2 2 2 (7.1) or equivalently, max w∈Rm + 1 where ao [k] = n 1 ⊤ o λ 1 w a [k] − w 2− 2 2 2 n 1 2 i yi k(xi , xi ) n ln i=1 exp y 1 ⊤ yw ki 2 (7.2) is what we call sufficient statistics. The main challenge is that the true labels for the observed pairs are unknown, making it difficult to apply the maximum likelihood estimation. Below, we present an approach that effectively approximates the sufficient statistics using the noisy labels. 7.2.2 Estimating Sufficient Statistics In this section, we present an alternative approach for estimating the sufficient statistics ao [k] under a different assumption from that in Chapter 6. We first present the assumption. 78 Assumption 2. We assume (2.a) Pr(|X1 , X2 |Y, Y) = Pr(X1 , X2 |Y), (2.b) Pr(Y = y|Y = y) = dy and Pr(Y = y) = py is given. Compared to Assumption 1, we can see that Assumption 2 assumes that the data pair X1 , X2 is independent of the noisy label Y given the true label Y, which is essentially equivalent to assumption (1.a). Nevertheless, we assume different knowledge about the noise, i.e., assumption (2.b), which is different from assumption (1.b). Comparing the two assumptions, assumption 1 may be advantageous if we know that how much noise is added on top the true labels, however, the conditional probabilities in assumption (2.b) may be estimated more accurately by a sampling approach, which samples a part of examples among noisily labeled pairs and query their true labels. Given the knowledge of py , we rewrite the expectation Eo [k] by Eo [k] = EX1 ,X2 ,Y Yk(X1 , X2 ) = EY YEX1 ,X2 |Y k(X1 , X2 ) = p+ Ec [k] − p− Ec [k] − + where Ec [k] is defined in Table 1. Estimating Eo [k] is therefore reduced to estimating Ec [k]. y y Given the independence assumption, we have EX1 ,X2 |Y=y [k(X1 , X2 )] =EX1 ,X2 |Y=y [k(X1 , X2 )] Pr(Y = y|Y = y)+ EX1 ,X2 |Y=¯[k(X1 , X2 )] Pr(Y = y |Y = y) ¯ y Writing in the matrix form, we have Ec [k] + Ec [k] − ⊤ ⊤ =B Ec [k]⊤ + c [k]⊤ E− (7.3) 1 − d+ d+ . Equa1 − d− d− tion (7.3) allows us to estimate Ec [k] (and therefore Eo [k]) from Ec [k], a quantity that can be y y where Ec [k] is defined in Table 6.1, and B is defined as B = y computed from the noisy labels. In particular, we approximate Ec [k] by its sample average y ac [k] which is computed by y ac [k] = y 1 2 i δ(yi , y)k(xi , xi ) i δ(yi , y) 79 By replacing Ec [k] with ac [k] in (7.3), we obtain the estimation for Ec [k], denoted by bc [k], y y y y by solving the following least square problem min B bc [k] y bc [k]⊤ + bc [k]⊤ − −A (7.4) F ac [k]⊤ + . To obtain a more robust estimation of Ec [k], we note that y ac [k]⊤ − c y py Ey [k] = EX [k(X)] = E[k], and therefore add the following constraint for the esti- where A = mator bc [k] y py bc [k] y y 1 = n n k(x1 , x2 ) = a[k] i i (7.5) i=1 Combining Equation (7.4) and (7.5), we have the following constrained least square problem for bc [k], an estimator of Ec [k] y y min B bc [k] y bc [k]⊤ + bc [k]⊤ − −A , py bc [k] = a[k] y s.t. (7.6) y F It can be shown that the solution for bc [k] is given by y bc [k]⊤ + c [k]⊤ b− = (B⊤ B)−1 pa[k]⊤ pp⊤ (B⊤ B)−1 + I − ⊤ ⊤ −1 B⊤ A ⊤ (B⊤ B)−1 p p p (B B) p (7.7) where p = (p+ , p− )⊤ . Note that the solution in (7.7) requires B to be non-singular, implying d+ + d− = 1. Given Ec [k] is estimated by bc [k], we have Eo [k] estimated as follows y y Eo [k] ≈ bo [k] = p+ bc [k] − p− bc [k] + − leading to the following maximizing the approximate log-likelihood of the true labels for the observed pairs max w∈Rm + 7.2.3 λ 1 1 ⊤ o w b [k] − w 2− 2 2 2 n ln i exp y 1 ⊤ yw ki 2 (7.8) Convergence Analysis In this section, we present the convergence analysis showing that the combination weights learned by the proposed approach converge to the optimal ones learned from perfectly labeled 80 pairs. All the proofs for the lemmas and theorems presented in this section are included in supplementary material. Comparing (7.2) with (7.8), we see that the only difference between the two optimization problems is that ao [k] in (7.2) is replaced with bo [k] in (7.8). The following lemma showing the bounds for w when replacing ao [k] with bo [k]. ∗ ∗ Lemma 14 ( Lemma 6[78]). Let wa be the solution to (7.2) with ao [k], and wb be the solution to (7.8) with bo [k]. We have 1 o ∗ ∗ a [k] − bo [k] 2 wa − wb 2 ≤ λ Next, we try to bound the difference between ao [k] and bo [k]. Note that ao [k] − bo [k] 2 ≤ ao [k] − Eo [k] 2 + bo [k] − Eo [k] 2 (7.9) The following two lemmas allow us to bound the two terms in (7.9), respectively. Lemma 15. With the bounded kernel function kj (x1 , x2 ), i.e., |kj (x1 , x2 )| ≤ R, j = 1, . . . , m, the following inequality holds with probability at least 1 − δ for any δ > 0 ao [k] − Eo [k] 2 ≤ 2mR2 ln n 2m δ Lemma 16. With bounded kernel function kj (x1 , x2 ), i.e. |kj (x1 , x2 )| ≤ R, j = 1, . . . , m, and significant large number n+ of pairs labeled as positive and large number n− of pairs labeled as negative, i.e., there exists some positive constant ρ > 0 such that min(n+ /n, n− /n) ≥ ρ. Under the independence assumption, the following inequality holds with probability at least 1 − δ for any δ > 0 bo [k] − Eo [k] 2 ≤ C 2mR2 ln n 6m δ with C defined by C= ˆ ˆ ˆ 4d 4 32(d6 d + d7 ) 4κ2 32(1 + κ3 ) ≤ + + ρ2 σmin ρ2 d8 p 2 p 2 d4 2 2 where σmax , σmin are the maximum and minimum eigenvalue of B⊤ B, κ = σmax /σmin , ˆ d = d+ + d− − 1, and d = 1 + |d+ − d− |. 81 Combining the results in Lemma 1, Lemma 2 and Lemma 3, we have the following theorem. Theorem 17. With the same conditions as in Lemma 3, the following inequality holds with probability at least 1 − δ for any δ > 0 1 ∗ ∗ wa − wb 2 ≤ λ 2mR2 n C ln 12m + δ ln 4m δ ∗ ∗ Theorem 4 indicates wb , the optimal solution to problem (7.8) converges to wa , the optimal solution to problem (7.2), as the number of the pairs n approaches infinity. It is interesting to observe that the bound is proportional to C, which is inversely proportional to |d| that is related to noisy level. Hence, the larger the |d|, the better the approximate solution will be. This will be further validated by our experiments. ∗ Our next result is regarding the difference between the estimated solution wb and the optimal solution w∗ obtained by solving (7.1) with an infinite number of perfectly labeled pairs. Theorem 18. With the same conditions as in Lemma 3, the following inequality holds with probability at least 1 − δ for any δ > 0 ∗ wb − w∗ 1 2 ≤λ 2mR2 n 24m + C ln δ 8m ln δ + √ 4ρ mR √ λ n 1+ 1 4 ln 2 δ Finally, we note that our analysis relies on accurate estimate of p, B. In supplementary material, we provide the error bound with inaccurate estimate of p, B. 7.3 Experiments In this section, we validate the proposed approach by clustering on linked documents using the combined kernels learned from noisy pairwise constraints. We conduct two sets of experiments: in the first set of experiments, we synthesize the noisy pairwise constraints by random sampling, and in the second setup, we derive the pairwise constraints from the links among documents. We use the same cora and citeseer data sets. The baselines for comparison are described as follows. 82 Data sets Two paper citation data sets, Cora and Citeseer, processed by Lise Getoor’s research group, are used in our study. Each paper in the data sets is described by a binary vector indicating the presence and absence of corresponding words, and is assigned to one of the given classes. Besides the attributes and class assignments of papers, the citations between papers are also available in these two datasets. The statistics of the two data sets are summarized in Table 7.1. Following [47], we normalize the attributes of each document by first dividing the sum of the attributes and then taking the square root. Baselines We compare the proposed algorithm to the following two algorithms that learn kernel matrices from pairwise constraints: (1) NPK, a state-of-the-art non-parametric kernel learning algorithm from pairwise constraints [42], and (2) SKL, a spectral kernel learning algorithm [44]. We did not compare to the kernel learning algorithms [50, 52] from pairwise constraints because they are outperformed by NPK according to [42], and to algorithm in [57] because it requires solving a SDP problem, which does not scale well for large data sets. Besides kernel learning algorithms, we also compare the proposed algorithm to three representative methods for distance metric learning from pairwise constraints: (1) GDM, a global distance metric learning algorithm [102], (2) DCA, a discriminative component analysis algorithm [43], and (3) ITML, a information-theoretic based metric learning algorithm [21]. We did not compare to the other metric learning algorithms because they either are not as effective as these three metric learning algorithms (e.g. [31]) or require labeled data for training instead of pairwise constraints (e.g., [100]). The learned metric/kernel matrix is used by a spectral clustering algorithm [84] to cluster documents. Since these learning algorithms do not address noisy pairwise constraints, the comparison will allow us to verify if the proposed algorithm is robust to the noise in pairwise constraints. The last two baselines are constrained clustering algorithms: MPCK [8], a state-of-the-art algorithm for constrained clustering, and LCVQE [75], a recently developed k-means algorithm with noisy constraints. To improve the robustness of MPCK to noisy constraints, we follow [61] 83 Table 7.1. Statistics of Data sets data Cora Citeseer #papers #attr #citations #classes 2708 3312 1433 3703 5429 4591 7 6 by weighting the noisy pairwise constraints by probability dy . Since MPCK and LCVQE are claimed to be robust to noisy constraints, the comparison will allow us to see if the proposed approach is effective for noisy pairwise constraints. Finally, we include the baseline that combines kernels with equal weights, referred to as LK. For the proposed approach, each candidate kernel kj is the linear kernel on the j th attribute. Given the learned combination of kernel matrices, the same spectral clustering algorithm will be used to cluster documents. In all the experiments, we set the regularization parameter λ as λ = 0.01/n. We refer to the proposed approach as LKCnpc. To evaluate the clustering results, we compute the normalized mutual information [107] by comparing the cluster assignments predicted by the clustering algorithms to the class assignments given in the data set. 7.3.1 Experiments with Real Noise In this subsection, we show experimental results on noisy pairwise constraints derived from citation information. In particular, a positive pairwise constraint is created if two papers are linked by a citation, and a negative pairwise constraint is created when two papers do not cite each other. Since the number of unlinked paper pairs is much larger than that of linked pairs, for computational efficiency, we randomly sample the same number of unlinked paper pairs as that of linked paper pairs. As a result, we have on average about 20%-25% pairwise constraints being incorrect. To obtain py , dy we randomly sample 1% of document pairs and label them according to their class assignments, and compute py , dy from the noisy labels and the correct labels for these sampled pairs. For fair comparison, the correct labels of these sampled pairs are used by the baselines. 84 Cora Citeseer 0.3 LKCnpc NPK SKL GDM DCA ITML LCVQE MPCK 0.2 0.1 0 NMI NMI 0.3 LKCnpc NPK SKL GDM DCA ITML LCVQE MPCK 0.2 0.1 0 (a) Cora (b) Citeseer Figure 7.1. Comparison of clustering performance on real noise. Table 7.2. Clustering performance on unseen data data Cora Citeseer observed data unseen data 0.3326±0.013 0.3028±0.012 0.2822±0.012 0.2502±0.020 Figure 7.1(a) and 7.1(b) show the clustering results on the two data sets for the proposed approach and baseline methods. We observe that most baseline methods, except for GDM, LCVQE, are unable to outperform the reference method LK. In particular, we observe that MPCK, a state-of-the-art constrained clustering algorithm, performs significantly worse than LK, indicating the importance of addressing noisy pairwise constraints in constrained clustering. Overall, we observe that the proposed approach outperforms all the baseline methods on both data sets. We thus conclude that the proposed approach is effective for learning from noisy pairwise constraints. Finally, we briefly show the clustering results applied on the unseen data with learned combinations of multiple kernels from the observed data. We randomly choose 80% papers as observed data and derive noisy pairwise constraints for these observed data from citations, and the remaining 20% papers as unseen data for testing the learned model. The NMI averaged over 5 trials for the learned model on the observed data and unseen data for the two data sets is shown in Table 7.2. We can see that the combination weights learned from 85 observed data also works well on unseen data. 7.3.2 Experiments with Synthetic Noise In this subsection, we show experimental results using synthesized noisy pairwise constraints. We first examine the robustness of the proposed approach to the noisy pairwise constraints by varying the level of noise. In particular, for each data set, we randomly sample 10, 000 pairs in the same class and 10, 000 pairs in different classes. To obtain the noisy label yi for each pair, we randomly flip the correct label yi with a probability of p%, where p% specifies the noise level. These 20, 000 document pairs together with their noisy labels are used to train both the proposed approach and the eight baseline methods for document clustering. The values of py and dy are computed from the correct labels and noisy labels. We choose four noise levels, p% = 20%, 40%, 60%, 80% in our study. Figure 7.2(a) and 7.2(b) compare the clustering performance of the proposed approach to that of the baseline methods on the two data sets. The results are averaged over five independent experiments. For the convenience of comparison, we also include the clustering result of the proposed approach but without modeling the noise in the pairwise constraints, i.e. simply using ao [k] in place of ao [k], as referred to LKCpc. The comparison to LKCpc allows us to see how the noise in pairwise constraints affects the clustering results. It is clear that the proposed approach LKCnpc is more resilient to the noisy pairwise constraints than the baseline methods. In particular, we observe that even the noise level is relatively low(20%), most of the baseline methods, except for LKCpc, GDM, are unable to outperform the reference method LK. As the noise level increases, except for ITML, the performance of all the other baseline methods degrade significantly and become much worse than that of LK when the noise level exceeds 50%. For all the cases, the proposed approach yields the best performance among all the methods in comparison. More interestingly, on both data sets, we observe that the clustering performance of the proposed approach increases when the noise level goes beyond or below 50%. This result seems surprising at the first 86 glance, however, by noting that our approach relies on the knowledge of dy = Pr(Y = y|Y = y) related to noise level, when the noise level is above 50%, the simple approach by flipping labels of pairwise constraints to their opposite can obtain pairwise constraints with smaller errors, leading to better clustering performance, and so of course does the proposed approach. Also, this result is consistent with our theoretical analysis in Theorem 4, because the difference between the model learned from noisily labeled pairs and the model learned from perfectly labeled pairs decreases as |d| = |d1 + d0 − 1| increases. The corresponding |d| for the four noise levels p% = 20%, 40%, 60%, 80% are 0.6, 0.2, 0.2, 0.6. Cora Citeseer NMI 0.3 0.2 0.1 0 LKCnpc LKCpc NPK SKL GDM DCA ITML LCVQE MPCK 0.3 NMI LKCnpc LKCpc NPK SKL GDM DCA ITML LCVQE MPCK 0.4 0.2 0.1 0 20% 40% 60% 80% percentage of noise (a) Cora 20% 40% 60% 80% percentage of noise (b) Citeseer Figure 7.2. Comparison of clustering performance with different noise levels. Second, we show the proposed approach is not sensitive to the estimated values of py and dy . With 20,000 synthetic pairwise constraints, and p% = 20% noise, we randomly perturb py and dy by either adding or subtracting 10 percentage from their “true” values computed from the ground truth. We found the clustering performance is almost the same, as shown in Table 7.3, where py , dy refer to their “true” values, i.e. py = i δ(yi , y)δ(yi, y)/ i δ(yi , y), i δ(yi , y)/n, dy = and py , dy refer to randomly perturbed values p+ = (1 ± ˆ ˆ ˆ ˆ 10%)p+ , dy = (1 ± 10%)dy , and the results for perturbed values are averaged over 10 random trials. Third, we show the experimental results to verify the convergence behavior in the number 87 Citeseer Cora 0.3 NMI NMI 0.3 0.2 0.1 0.1 0 0.2 MKL−noisy labels MKL−perfect labels 100 0 500 1000 5000 10000 #of pairwise constraints (a) Cora MKL−noisy labels MKL−perfect labels 100 500 1000 5000 10000 #of pairwise constraints (b) Citeseer Figure 7.3. Convergence behavior of the proposed approach Table 7.3. Clustering performance with perturbed (py , dy ) data Cora Citeseer (py , dy ) 0.3338 0.3120 (ˆy , dy ) p ˆ 0.3279±0.006 0.3011±0.007 of noisily labeled pairs n as presented in the paper. We sample the equal number of pairs in the same class and pairs in different classes. The noisy label for these pairs are obtained by using the same random flipping process. We fix the noisy level to p% = 20%. We compare the clustering performance of the proposed approach with noisy labels to the same approach but with perfect labels by increasing the total number of pairwise constraints from 200, 1000, 2000, 10, 000 to 20, 000. The results in Figure 7.3(a) and Figure 7.3(b) show that the difference decreases as the number of constraints increases. We also noticed that the iterative method (e.g., Newton method) for solving the related optimization problem converges very quickly in one hundred iterations on the two data sets with 20,000 paris. Finally, we mention that the two approaches proposed in this Chapter and last Chapter yield similar empirical results. 88 7.4 Conclusions In this chapter, we have presented an alternative approach for estimating the sufficient statistics in generalized maximum entropy model under different assumptions from Chapter 6. Our theoretical analysis shows that the model trained from the noisy link information converges to the model trained from the perfect link information. Extensive experimental results verify the efficacy of the proposed approach. 89 CHAPTER 8 Conclusions and Future Work In previous chapters, we have presented a link-based model for community detection, a discriminative approach for combining link and content information for community detection for networked data, and a probabilistic link prediction model for clustering the noisy M&C linked data. We summarize the contributions of the thesis as follows. • In Chapter 4, we present a probabilistic link model for detecting communities for directed networks. We introduce popularity and productivity to model the differences of nodes in receiving links and the differences of nodes in producing links, respectively. These two factors can explain the noisy connections in terms of community detection, because the connections between nodes may be not due to their common or similar community, but rather because of their high popularities tending to receive links or high productivities tending to produce links. These two factors can also explain the preferential attachment phenomenon in the real world, i.e., the power law degree distribution. Using these two factors, together with community memberships, we define a generative model to generate the links. The proposed link model is an unified framework which can include several previous models in degenerated cases, and also can derive us new link models. It is the first time that the power law degree distribution is modeled for community detection. • In Chapter 5, we present a discriminative approach for combining link and content information for community detection. Different from previous approaches that usually put together a generative link model and a generative content model and therefore are vulnerable to irrelevant or noisy attributes, the proposed approach uses a discriminative model on the content to fit the community memberships, which thus has 90 discriminative power to identify relevant attributes from irrelevant ones. The proposed approach for combining our link model and our content model yield significantly better empirical performance for community detection. It is the first discriminative approach for combining link and content for community detection. • In Chapter 6 and Chapter 7, we present approaches for clustering noisy must-andcannot linked data. To handle the noisy M&C links, we formulate the problem into learning from noisily labeled data. We propose a generalized maximum entropy model to learn the conditional model that given a pair of data how likely they are connected by a must link (i.e., how likely they belong to the same cluster). The critical problem of learning the conditional model is to compute the sufficient statistics that depend on the true labels which are unknown. We propose two different approaches under different assumptions to estimate the sufficient statistics and we prove the convergence results , i.e., the model learned from the noisy labels converges to the model learned from the true labels under appropriate assumptions about the data and the noise. It is the first work that proves the convergence for learning from noisily labeled data. The studies conducted during the course of this dissertation point to several directions for future research: • Algorithms for Detecting Communities and Their Evolutions in Dynamic Networks Since networks are dynamic in the real world, e.g., nodes could leave the network and new nodes could join in the network, nodes could change their communities, and communities could disappear and new communities would emerge, it is important to model the dynamic behavior of communities in networks. For future work, we can extend the proposed link models or previous link models to a dynamic version in order to model the dynamic changes in networks. The key challenges are (i) how to model the evolution of communities sequentially, and (ii) how to handle the deletion, insertion of nodes. 91 • Algorithms and Theory for Learning from Noisily Labeled Data The problem of noisy M&C linked data was abstracted into the problem of learning from noisily labeled data in Chapter 6&7, which is an important problem in machine learning since the labels could be noisy due to human error or uncertainty in labeling. For future work along the line, we can think about several directions: (1) how to relax the independence assumption made by the approaches in Chapter 6&7, while still maintain some theoretical guarantee about the learned model; (2) how to extend SVM to handle the noise labels with an efficient optimization algorithm to learn the parameter; (3) how to extend the problem to an online setup, where at each trial an noisy label is given to the learner together with the data, after making prediction on the data, the learner decides to make a query or not for the true label of the data, for which the learner usually needs to pay a cost. The goal of the learner is to minimize the errors he makes on the received examples under certain constraint imposed on the cost for querying the true labels. • Applications The proposed link models for community detection provide a solution for link prediction. The goal of link prediction is to predict whether two entities could link to each other (e.g., whether a user could be followed by other users or whether a product could be purchased by customers). The proposed approach can predict the links based on the communities of the entities and their popularities, productivities and attributes. It would be interesting to compare the community-based link models with traditional approaches for link prediction. The proposed approach for learning from noisily labeled data can be applied to semisupervised crowd clustering, where pairwise constraints are usually collected to facilitate the clustering and subject to errors because of biased or adversarial human annotators. The technique of estimating the sufficient statistics from noisy labels can be also applied to estimating the sufficient statistics for a target class from that for auxiliary classes based on the relation between the target class and the auxiliary classes 92 parameterized by the conditional probabilities that given a data point belongs to the target class how likely it also belongs to an auxiliary class. Using this technique, we can learn a model for predicting the target class without any training examples that are labeled by the target class, which could be useful when the target class label is difficult or expensive to collect. 93 APPENDICES 94 APPENDIX A Proofs Proof to Theorem 7 Proof. Under the i.i.d. assumption, it is straightforward to show that p E[ay [κj ]] = E[aδ [κj ]] = Eδ [κj ] y y Following the McDiarmid’s inequality, for any ǫ > 0, we have Pr Pr ǫ2 n 2R2 ǫ2 n aδ [κj ] − Eδ [κj ] ≥ ǫ ≤ 2 exp − 2 y y 2R p ay [κj ] − Eδ [κj ] ≥ ǫ ≤ 2 exp − y Using the following inequality and the union bound, p p aδ [κj ] − ay [κj ] ≤ aδ [κj ] − Eδ [κj ] + ay [κj ] − Eδ [κj ] y y y y we can complete the theorem using union bound. Proof to Theorem 8 Proof. Using the assumption (1.a) and (1.b), we have Eδ [κj ] = EX1 ,X2 EY|X1 ,X2 [δ(Y, y)κj (X 1 , X 2 )] = EX1 ,X2 [Pr(Y = y|X1 , X2 )κj (X1 , X2 )] y ¯ = EX1 ,X2 cy Pr(Y = y|X1 , X2 )κj (X1 , X2 ) + EX1 ,X2 (1 − cy ) Pr(Y = y |X1 , X2 )κj (X1 , X2 ) ¯ = cy Eδ [κj ] + (1 − cy )Eδ [κj ] = (cy + cy − 1)Eδ [κj ] + (1 − cy )E[κj (X1 , X2 )] ¯ y ¯ ¯ y ¯ y where we use the fact Eδ [κj ] + Eδ [κj ] = E[κj (X1 , X2 )]. Let us define y y ¯ cδ [κj ] = (cy + cy − 1)aδ [κj ] + (1 − cy ) ¯ ¯ y y 95 1 n κj (x1 , x2 ) i i i Under the i.i.d. assumption, it is straightforward to show that E[aδ [κj ]] = E[cδ [κj ]] = Eδ [κj ] y y y Following the McDiarmid’s inequality, for any ǫ > 0, we have ǫ2 n 2R2 ǫ2 n cδ [κj ] − Eδ [κj ] ≥ ǫ ≤ 2 exp − 2 y y 2R aδ [κj ] − Eδ [κj ] ≥ ǫ ≤ 2 exp − y y Pr Pr Then we have cδ [κj ] − aδ [κj ] ≥ ǫ ≤ 4 exp − y y Pr ǫ2 n 8R2 Dividing both sides of cδ [κj ] − aδ [κj ] ≥ ǫ by |cy + cy − 1|, we have ¯ y y Pr ǫ ≥ |cy + cy − 1| ¯ aδ [κj ] − bδ [κj ] y y ǫ2 n ≤ 4 exp − 2 8R Replacing ǫ with |(cy + cy − 1)|ǫ, we complete the proof using the union bound. ¯ Proof to Theorem 11 Proof. Let L(w) = 1 n ln i exp y 1 ⊤ 1 λ yw ki − w⊤ v + w 2 2 2 2 2 λ 1 = g(w) − w⊤ v + w 2 2 2 2 where g(w) is the sum of log-exponential function of w, which is convex in w. Assume w∗ is the optimal solution to minimizing L(w), w∗ is the optimal solution to minimizing L(w) with v replaced with v. First, we have L(w∗ ) ≥ L(w∗ ) + ∇L(w∗ )⊤ (w∗ − w∗ ) + ≥ L(w∗ ) + λ ∗ w − w∗ 2 2 2 96 λ ∗ w − w∗ 2 2 2 where the first inequality follows that L(·) is a λ-strongly convex function, and the second inequality follows the optimality criterion that ∇L(w∗ )⊤ (w∗ − w∗ ) ≥ 0. Second, λ 1 L(w∗ ) = g(w∗) − v⊤ w∗ + w∗ 2 2 2 2 1 1 λ = g(w∗) − v⊤ w∗ + w∗ 2 + (v − v)⊤ w∗ 2 2 2 2 1 ⊤ ∗ λ ∗ 2 1 ≤ g(w∗) − v w + w 2 + (v − v)⊤ w∗ 2 2 2 ∗) − 1 v⊤ w∗ + λ w∗ 2 + 1 (w∗ − w∗ )⊤ (v − v) ≤ g(w 2 2 2 2 1 ∗ ≤ L(w∗ ) + w − w∗ 2 v − v 2 2 Coming the above two bounds for L(w∗ ) together, we have 1 λ ∗ w − w∗ 2 ≤ w∗ − w∗ 2 v − v 2 2 2 2 i.e., 1 w∗ − w∗ 2 ≤ v−v 2 λ ∗ ∗ Let v = ao [k], and wa be the corresponding optimal solution, and v = bo [k], and wb be the corresponding optimal solution, then we have 1 o ∗ ∗ wa − wb 2 ≤ a [k] − bo [k] 2 λ Proof of Theorem 12 Proof. Since 1/(1 + exp(−s)) is a lipschitz continuous function with lipschitz continuity of 1, we have ∗ ∗ |p(y|x1 , x2 ) − p(y|x1 , x2 )| ≤ (wa − wb )⊤ k(x1 , x2 ) √ √ √ mR o mR o ∗ ∗ a [k] − b [k] 2 ≤ ≤ m wa − wb 2 R ≤ λ λ = mR2 cλ 8 n 4m δ 97 8mR2 ln (c+ + c− − 1)2 n 4m δ Proof of Theorem 13 Proof. We define bo [k] = ao [k] 1 − cy ¯ − a[k] c c where c = c+ + c− −1. Let p(y|x1, x2 ) be the classification model learned from the noisy side information using corrupted c+ and c− , and p(y|x1, x2 ) be the classification model learned from the noisy side information using perfect c+ and c− . Using the analysis in the proof of Theorem 12, we have √ mR o b [k] − bo [k] λ 2 √ o [k] o [k] 1 − cy 1 − cy mR a a ¯ ¯ ≤ − + a[k] − a[k] λ c c c c 2 √ √ 1 − cy 1 − cy mR √ 1 1 ¯ ¯ mR − + mR − ≤ λ c c c c 2 4∆ 2c∆ + 4∆ mR + ≤ 2 λ ρ ρ2 √ √ where we use ao [k] 2 ≤ mR, a[k] 2 ≤ mR, |c − c| ≤ 2∆, c ≥ ρ/2. Using the fact p(y|x1, x2 ) − p(y|x1 , x2 ) ≤ p(y|x1 , x2 ) − p(y|x1 , x2 ) ≤ p(y|x1 , x2 ) − p(y|x1 , x2 ) + p(y|x1, x2 ) − p(y|x1, x2 ) and the result in Theorem 12, we have the theorem. Proof to Lemma 15 Proof. First we note that Eo [kj ] = EX1 ,X2 ,Y [Ykj (X1 , X2 )] Under the i.i.d. assumption, we have E[ao [kj ]] = EX1 ,X2 ,Y [Ykj (X1 , X2 )] = Eo [kj ] 98 Following the McDiarmid’s inequality, for any ǫ > 0, we have Pr ao [kj ] − Eo [kj ] ≥ ǫ ≤ 2 exp − ǫ2 n 2R2 Using the union bound, we have 2 Pr Let δ = 2m exp − ǫ n ao [k] − Eo [k] 2 ≤ mǫ2 ≥ 1 − 2m exp − 2 2 2R ǫ2 n , i.e. ǫ = 2R2 2R2 ln n 2m , we can complete the proof. δ Proof to Lemma 16 Proof. First, under the independence assumption, we can have a similar solution for Ec [k] y as bc [k] in Equation 7.7, i.e. y Ec [k]⊤ + c [k]⊤ E− ¯ where A = Ec [k]⊤ o o + c [k]⊤ . To bound b [k] − E [k] 2 , we have E− bo [k] − Eo [k] 2 2 ≤4 ≤ pp⊤ (B⊤ B)−1 pE[k]⊤ ¯ + I − ⊤ ⊤ −1 B⊤ A ⊤ (B⊤ B)−1p p p (B B) p = (B⊤ B)−1 (B⊤ B)−1 σ −2 4 −2min 2 σmax p 2 bc [k]⊤ + c [k]⊤ b− ≤2 − 2 p(E[k] − a[k])⊤ p⊤ (B⊤ B)−1 p ≤4 p 2 2 +4 2 F (B⊤ B)−1 F pp⊤ (B⊤ B)−1 ¯ I − ⊤ ⊤ −1 B⊤ (A − A) p (B B) p ¯ E[k] − a[k] 2 + 8 (B⊤ B)−1B⊤ 2 A − A 2 2 F F (B⊤ B)−1 pp⊤ (B⊤ B)−1 ⊤ B +8 p⊤ (B⊤ B)−1 p κ2 Ec [k]⊤ + c [k]⊤ E− E[k] − a[k] 2 2 −1 + 16(σmin 2 F ¯ A−A 2 F −4 σmin ¯ + −2 σmax ) A − A 2 F σmax 3 4κ2 2 + 16(1 + κ ) A − A 2 ¯ E[k] − a[k] 2 ≤ F σmin p 2 2 99 2 F −1 where we use the fact p 2 σmax ≤ 2 p⊤ (B⊤ B)−1 p ≤ −1 p 2 σmin , 2 B 2 ≤ 2σmax and F −1 ¯ B−1 2 ≤ 2σmin Next, we show bound for E[k]−a[k] 2 and A− A 2 . For E[k]−a[k] 2 , 2 2 F F similar to Lemma 15, the following inequality holds for any ǫ > 0, Pr( E[k] − a[k] 2 ≤ mǫ2 ) ≥ 1 − 2m exp 2 ǫ2 n 2R2 ¯ For A − A 2 , we need first bound each element in the matrix, i.e. |Ec [kj ] − ac [kj ]|, it is y y F easy to show that ǫn ǫ2 n ) ≥ 1 − 2 exp − 2 ny 2R Pr(|Ec [kj ] − ac [kj ]| ≤ y y Then with the union bound, Pr 2 ¯ A − A 2 ≤ 2 mǫ2 F ρ ≥ 1 − 4m exp − ǫ2 n 2R2 where we use min(n+ /n, n− /n) ≥ ρ. Then we have Pr 32(1 + κ3 ) 2 4κ2 mǫ mǫ2 + bo [k] − Eo [k] 2 ≤ 2 ρ2 σmin p 2 2 ≥ 1 − 6m exp − ǫ2 n 2R2 2R2 6m ǫ2 n , and note that σmax σmin = det(B)2 = ln , i.e. ǫ = 2 n δ 2R ˆ d2 , σmax ≤ B⊤ B 1 = d, we have the result in Lemma 3. Let δ = 6m exp − Proof to Theorem 18 ∗ Proof. We first show wa −w∗ 2 can be well bounded with a large probability. By combining 2 this bound with Theorem 17, we have the result in Theorem 18. In the following derivation, √ we use k(X1 , X2 ) 2 ≤ mR. We denote by λ 1 L(w) = w 2+ 2 2 n n ln(1 + exp(−yi w⊤ ki ) i=1 λ w 2 + E[ln(1 + exp(−Yw⊤ k(X1 , X2 )))] L(w) = 2 2 100 ∗ We first bound L(wa ) as follows ∗ ∗ L(wa ) ≤ L(wa ) + max L(w) − L(w) w 2≤ρ where ρ = 2 ln 2 λ is due to L(w∗ ) ≤ ln 2. It is easy to show that, with probability at least 1 − δ, we have √ max L(w) − L(w) ≤ Rn [Fw ] + ρ mR w 2 ≤ρ √ √ ln(1/δ) 2ρ mR ≤ √ + ρ mR 2n n ln(1/δ) 2n where Rn [F ] is the Rademacher complexity of function class F , and Fw = {f (X1 , X2 , Y) = ln(1 + exp(−Yw⊤ k(X1 , X2 ))) : w 2 ≤ ρ}. Using the concentration inequality of bounded difference, with probability at least 1 − δ, we have √ L(w∗ ) ≤ L(w∗ ) + ρ mR ln(1/δ) 2n ∗ Using the fact L(w∗ ) ≥ L(wa ), with probability 1 − 2δ, we have ∗ L(wa ) − L(w∗ ) √ √ 2ρ mR + 2ρ mR ≤ √ n ln(1/δ) 2n ∗ ∗ Using the strong convexity of L(w), we have L(wa ) − L(w∗ ) ≥ λ wa − w∗ 2 , and therefore, 2 2 with probability 1 − δ, have ∗ wa − w∗ ≤ √ √ 4ρ mR 4ρ mR √ + √ λ n λ 2n ln 2 δ We complete the proof by combining the above result with the result in Theorem 4. 101 BIBLIOGRAPHY 102 BIBLIOGRAPHY [1] Lada Adamic and Natalie Glance. The political blogosphere and the 2004 U.S. election: divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery, 2005. [2] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic block models for relational data with application to protein-protein interactions. In Proceedings of the International Biometrics Society Annual Meeting, 2006. [3] R. Albert. Scale-free networks in cell biology. Journal of Cell Science, 118(21):4947– 4957, 2005. [4] R. Albert and A. L. Barab´si. Statistical mechanics of complex networks. Reviews of a Modern Physics, 74:47–97, 2002. [5] Alex M. Andrew. Reinforcement learning: An introduction. Robotica, 17:229–235, March 1999. [6] A. L. Barab´si and R. Albert. Emergence of scaling in random networks. Science, a 286(5439):509–512, 1999. [7] Marc Barth´lemy and Nunes L. A. Amaral. Small-World networks: Evidence for a e crossover picture. Physical Review Letters, 82:3180–3183, 1999. [8] Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of the 10th Annual International ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 59–68, 2004. [9] C. M. Bishop. Pattern recognition and machine learning (information science and statistics). Springer-Verlag New York, Inc., 2006. [10] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent dirichlet allocation. Journal of Machine Learning Research, 2003. [11] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D-U. Hwang. Complex networks : Structure and dynamics. Phys. Rep., 424(4-5):175–308, Fervier 2006. [12] Deepayan Chakrabarti and Christos Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Computing Survey, 38, June 2006. [13] Hong Chang and Dit yan Yeung. Kernel-based metric adaptation with pairwise constraints. In Proceedings of International Conference on Machine Learning and Cybernetics, pages 721–730, 2005. 103 [14] P. Chen and S. Redner. Community structure of the physical review citation network. January 2010. [15] Yun Chi, Xiaodan Song, Dengyong Zhou, Koji Hino, and Belle L. Tseng. On evolutionary spectral clustering. ACM Transactions Knowledge Discovery Data, 3:17:1–17:30, December 2009. [16] David Cohn and Huan Chang. Learning to probabilistically identify authoritative documents. In Proceedings of the 17th International Conference on Machine Learning, 2000. [17] David Cohn and Thomas Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Proceedings of the 13th Neural Information Processing Systems, 2001. [18] Ian Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In Proceedings of the 5th SIAM International Conference on Data Mining, 2005. [19] Ian Davidson and S. S. Ravi. Hierarchical clustering with constraints: Theory and practice. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 59–70, 2005. [20] Ian Davidson and Basu Sugato. A survey of clustering with instance level constraints. Technical report, 2007. [21] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th Annual International Conference on Machine Learning, pages 209–216, 2007. [22] Luis M. De Campos, Juan M. Fern´ndez-Luna, Juan F. Huete, Andr´s R. Masegosa, a e and Alfonso E. Romero. Link-based text classification using bayesian networks. In Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval, INEX’09, pages 397–406, Berlin, Heidelberg, 2010. Springer-Verlag. [23] de Solla. Networks of scientific papers. Science, 149(3683):510–515, July 1965. [24] Ayhan Demiriz, Kristin Bennett, and Mark J. Embrechts. Semi-supervised clustering using genetic algorithms. In Proceedings of Artificial Neural Networks in Engineering, pages 809–814. ASME Press, 1999. [25] Laura Dietz, Steffen Bickel, and Tobias Scheffer. Unsupervised prediction of citation influences. In Proceedings of the 24th International Conference on Machine Learning, 2007. 104 [26] Chris Ding, Tao Li, and Wei Peng. Nmf and plsi: equivalence and a hybrid algorithm. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 641–642, New York, NY, USA, 2006. ACM. [27] Chris Ding, Tao Li, and Wei Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Journal of Computational Statistics and Data Analysis, 52:3913–3927, April 2008. [28] Elena Erosheva, Stephen Fienberg, and John Lafferty. Mixed membership models of scientific publications. In Proceedings of the National Academy of Sciences, 2004. [29] Zoubin Ghahramani. Unsupervised learning. In Advanced Lectures on Machine Learning, pages 72–112, 2003. [30] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002. [31] Jacob Goldberger, Sam Roweis, Geoff Hinton, and Ruslan Salakhutdinov. Neighbourhood components analysis. In Proceedings of the 18nd Annual Conference on Neural Information Processing Systems, pages 513–520, 2004. [32] Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg, and Edoardo M. Airoldi. A survey of statistical network models. Found. Trends Mach. Learn., 2:129–233, February 2010. [33] Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 1019–1028, New York, NY, USA, 2010. ACM. [34] Steve Gregory. An algorithm to find overlapping community structure in networks. In Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2007. [35] Amit Gruber, Michal Rosen-Zvi, and Yair Weiss. Latent topic models for hypertext. In Proceedings of the 24th Annual Conference on Uncertainty in Artificial Intelligence, null, 2008. null. [36] Natali Gulbahce and Sune Lehmann. The art of community detection. BioEssays, 30(10):934–938, October 2008. [37] J. A. Hartigan and M. A. Wong. A K-means clustering algorithm. Applied Statistics, 28:100–108, 1979. 105 [38] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001. [39] J. M. Hofman and C. H. Wiggins. A Bayesian approach to network modularity. Physical Review Letters, 100, 2008. [40] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval, 1999. [41] Steven C. H. Hoi and Rong Jin. Active kernel learning. In Proceedings of the 25th Annual International Conference on Machine Learning, pages 400–407, 2008. [42] Steven C. H. Hoi, Rong Jin, and Michael R. Lyu. Learning nonparametric kernel matrices from pairwise constraints. In Proceedings of the 24th Annual International Conference on Machine Learning, pages 361–368, 2007. [43] Steven C. H. Hoi, Wei Liu, Michael R. Lyu, and Wei-Ying Ma. Learning distance metrics with contextual constraints for image retrieval. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2072–2078, 2006. [44] Steven C. H. Hoi, Michael R. Lyu, and Edward Y. Chang. Learning the unified kernel machines for classification. In Proceedings of the 12nd Annual International ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 187–196, 2006. [45] P. Holland and S. Leinhardt. Local structure in social networks. Socialogical Methodology, 1976. [46] Enliang Hu, Songcan Chen, Daoqiang Zhang, and Xuesong Yin. Semisupervised kernel matrix learning by kernel propagation. Transactions Neural Networks, 21:1831–1841, November 2010. [47] Tony Jebara, Risi Kondor, Andrew Howard, Kristin Bennett, and Nicol Cesa-bianchi. Probability product kernels. Journal of Machine Learning Research, 5:819–844, 2004. [48] C. Kemp, T. L. Griffiths, and J. B. Tenenbaum. Discovering latent classes in relational data. Technical Report AI Memo 2004-019, MIT Computer Science and Artificial Intelligence Laboratory, 2004. [49] Jon Kleinberg. Cascading Behavior in Networks: Algorithmic and Economic Issues. Cambridge University Press, 2007. [50] Brian Kulis, M´ty´s Sustik, and Inderjit Dhillon. Learning low-rank kernel matrices. In a a Proceedings of the 23rd Annual International Conference on Machine Learning, pages 505–512, 2006. 106 [51] Brian Kulis, Mtys Sustik, and Inderjit Dhillon. Learning low-rank kernel matrices. In Proceedings of the 23rd Annual International Conference on Machine Learning, pages 505–512. Morgan Kaufmann, 2006. [52] Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27–72, 2004. [53] Martin H. C. Law, Alexander P. Topchy, and Anil K. Jain. Model-based clustering with probabilistic constraints. In Proceedings of the 5th SIAM International Conference on Data Mining, 2005. [54] E. A. Leicht and M. E. J. Newman. Community structure in directed networks. Physical Review Letters, 100, 2008. [55] L. Li, D. Alderson, J.C. Doyle, and W. Willinger. Towards a theory of scale-free graphs: Definition, properties, and implications. Internet Mathematics, 2(4):431–523, 2005. [56] Zhenguo Li and Jianzhuang Liu. Constrained clustering by spectral kernel learning. In Proceedings of IEEE International Conference on Computer Vision, pages 421–427, 2009. [57] Zhenguo Li, Jianzhuang Liu, and Xiaoou Tang. Pairwise constraint propagation by semidefinite programming for semi-supervised classification. In Proceedings of the 25th Annual International Conference on Machine Learning, pages 576–583, 2008. [58] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol., 58:1019–1031, May 2007. [59] Fredrick Lilijeros, Cristofer Edling, Lu´ Amaral, Eugene Stanley, and Yvonne ˚berg. ıs A The web of human sexual contacts. Nature, 411:907–908, 2001. [60] Yu-Ru Lin, Yun Chi, Shenghuo Zhu, Hari Sundaram, and Belle L. Tseng. Analyzing communities and their evolutions in dynamic social networks. ACM Transactions Knowledge Discovery Data, 3:8:1–8:31, April 2009. [61] Yi Liu, Rong Jin, and Anil K. Jain. Boostcluster: boosting clustering by pairwise constraints. In Proceedings of the 13rd Annual International ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 450–459, 2007. [62] Qing Lu and Lise Getoor. Link-based classification. In Tom Fawcett and Nina Mishra, editors, Proceedings of the 20th International Conference on Machine Learning, pages 496–503. AAAI Press, Chicago, IL, USA, 2003. [63] A. McCallum and K. Nigam. A comparisoin of event models for naive bayes text classification. In In AAAI Workshop on Leaning for Text Categorization, 1998. 107 [64] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the contruction of internet portals with machine learning. Information Retrieval Journal, 3, 2000. [65] Stanley Milgram. The small world problem. Psychology Today, 2:60–67, 1967. [66] J. M. Montoya and R. V. Sol´. Small world patterns in food webs. pages 405–412, e February. [67] Peter J. Mucha, Thomas Richardson, Kevin Macon, Mason A. Porter, and JukkaPekka Onnela. Community structure in Time-Dependent, multiscale, and multiplex networks. Science, 328(5980):876–878, May 2010. [68] Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen. Joint latent topic models for text and citations. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining, 2008. [69] Amit A. Nanavati, Siva Gurumurthy, Gautam Das, Dipanjan Chakraborty, Koustuv Dasgupta, Sougata Mukherjea, and Anupam Joshi. On the Structural Properties of Massive Telecom Call Graphs: Findings and Implications. In Proceedings of the 15th ACM International Conference on Information and Kowledge Management, pages 435– 444, November 2006. [70] A. Nemirovski. Efficient methods in convex programming. 1994. [71] M E Newman. Modularity and community structure in networks. Proc Natl Acad Sci U S A, 103(23):8577–8582, June 2006. [72] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69, 2003. [73] Mark Newman. Networks: An Introduction. Oxford University Press, 2010. [74] M.E.J. Newman. The structure and function of complex networks. SIAM review, 45(2):167–256, 2003. [75] Dan Pelleg and Dorit Baras. K-means with large and noisy constraint sets. In Proceedings of the 18th European Conference on Machine Learning, pages 674–682, 2007. [76] M. A. Porter, J.-P. Onnela, and P. J. Mucha. Communities in Networks. ArXiv e-prints, February 2009. [77] Mason A. Porter, Peter J. Mucha, M. E. J. Newman, and A. J. Friend. Community structure in the united states house of representatives, February 2006. 108 [78] Novi Quadrianto, Alex J. Smola, Tiberio S. Caetano, and Quoc V. Le. Estimating labels from label proportions. In Proceedings of the 25th Annual International Conference on Machine Learning, pages 776–783, 2008. [79] Wei Ren, Guiying Yan, Xiaoping Liao, and Lan Xiao. Simple probabilistic algorithm for detecting community structure. Physical Review E, 79, 2009. [80] B. Schwikowski, P. Uetz, and S. Fields. A network of protein-protein interactions in yeast. Nature biotechnology, 18(12):1257–1261, December 2000. [81] E. Segal, H. Wang, and D. Koller. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19:i264–272, 2003. [82] Noam Shental, Tomer Hertz, Daphna Weinshall, and Misha Pavel. Adjustment learning and relevant component analysis. In Proceedings of the 17th European Conference on Computer Vision, pages 776–792, 2002. [83] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 2000. [84] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:888–905, 1997. [85] O. Sporns, D. Chialvo, M. Kaiser, and C. Hilgetag. Organization, development and function of complex brain networks. Trends in Cognitive Sciences, 8(9):418–425, September 2004. [86] Gabor J. Szekely and Maria L. Rizzo. Hierarchical clustering via joint between-within distances: Extending ward’s minimum variance method. Journal of Classification, 22(2):151–183, September 2005. [87] Ben Taskar, Ming F. Wong, Pieter Abbeel, and Daphne Koller. Link prediction in relational data. In Proceedings of the 17th Annual Conference on Neural Information Processing Systems, 2003. [88] Sebastian Thrun and Lorien Pratt. Learning to learn: introduction and overview, pages 3–17. Kluwer Academic Publishers, Norwell, MA, USA, 1998. [89] Tomer Hertz Tomboy, Aharon Bar-hillel, and Daphna Weinshall. Boosting margin based distance functions for clustering. In Proceedings of the 21st Annual International Conference on Machine Learning, 2004. [90] Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. Sociometry, 32:425–443, 1969. 109 [91] Vera van Noort, Berend Snel, and Martijn Huynen. The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Reports, 5(3):280–284, March 2004. [92] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. [93] A. Wagner and D. A. Fell. The small world inside large metabolic networks. Proc R Soc Lond B Biol Sci, 268(1478):1803–1810, September 2001. [94] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In Proceedings of the 17th Annual International Conference on Machine Learning, pages 1103–1110, 2000. [95] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schr¨dl. Constrained k-means o clustering with background knowledge. In Proceedings of the 18th Annual International Conference on Machine Learning, 2001. [96] Xuerui Wang, Natasha Mohanty, and Andrew McCallum. Group and topic discovery from relations and their attributes. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems, 2005. [97] Jr. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236–244, 1963. [98] S. Wasserman and K. Faust. Social network analysis: Methods and applications. Cambridge University Press, 1994. [99] Stanley Wasserman, Garry Robins, and Douglas Steinley. Statistical models for networks: A brief review of some recent research. In Edoardo Airoldi, David M. Blei, Stephen E. Fienberg, Anna Goldenberg, Eric P. Xing, and Alice X. Zheng, editors, Statistical Network Analysis: Models, Issues, and New Directions, volume 4503 of Lecture Notes in Computer Science, chapter 4, pages 45–56. Springer Berlin Heidelberg, 2007. [100] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. In Proceedings of the 20nd Annual Conference on Neural Information Processing Systems, pages 207–244, 2006. [101] Lei Wu, Rong Jin, Steven C.H. Hoi, Jinfeng Zhuang, and Nenghai Yu. Simplenpkl: Simple non-parametric kernel learning. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, 2009. [102] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning, with application to clustering with side-information. In Proceedings of the 110 17th Annual Conference on Neural Information Processing Systems, pages 505–512, 2003. [103] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Technical report, 2006. [104] Liu Yang, Rong Jin, Rahul Sukthankar, and Yi Liu. An efficient algorithm for local distance metric learning. In Proceedings of the 21st National Conference on Aartifical Intelligence, pages 543–548, 2006. [105] Tianbao Yang, Yun Chi, Shenghuo Zhu, Yihong Gong, and Rong Jin. Directed network community detection: A popularity and productivity link model. In Proceedings of the SIAM International Conference on Data Mining, pages 742–753, 2010. [106] Tianbao Yang, Rong Jin, Yun Chi, and Shenghuo Zhu. Combining link and content for community detection: A discriminative approach. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining, 2009. [107] Tianbao Yang, Rong Jin, Yun Chi, and Shenghuo Zhu. Combining link and content for community detection: a discriminative approach. In Proceedings of the 15th Annual International ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 927–936, 2009. [108] X.-S. Yang. Small-world networks in geophysics. grl, 28:2549–2552, July 2001. [109] K. Yu, S. Yu, and V. Tresp. Soft clustering on graphs. In Proceedings of 19th Advances in Neural Information Processing Systems, 2005. [110] Shi Yu, Bart De Moor, and Yves Moreau. Clustering by heterogeneous data fusion: framework and applications. Worpshop of the 23rd Annual Conference on Neural Information Processing Systems, 2009. [111] Y. Zhang, A. Friend, A. Traud, M. Porter, J. Fowler, and P. Mucha. Community structure in congressional cosponsorship networks. Physica A: Statistical Mechanics and its Applications, 387(7):1705–1712, March 2008. [112] Dengyong Zhou, Jiayuan Huang, and Bernhard Sch¨lkopf. Learning from labeled o and unlabeled data on a directed graph. In Proceedings of the 22nd International Conference on Machine Learning, 2005. [113] Shenghuo Zhu, Kai Yu, Yun Chi, and Yihong Gong. Combining content and link for classification using matrix factorization. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 487–494, New York, NY, USA, 2007. ACM Press. 111 [114] X. Zhu. Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University, 2005. 112