. ‘ _......«,,,..U.a.,.. .~ 1.. . cw ., .ur ‘ {Aunts .t. i: . a» 3 ‘55.“. a... Y x» .. 9.533.? .53.. .. l .. yum... A . , . z. . .2 ‘ .2.» a. .. 1h .‘u‘., L a. , . ,... la . ... $51.3 .... hinmr ‘ , 7..fl.fi.€:.u... Bx. . H . . . . A . . l... 2 Km 1 This is to certify that the dissertation entitled Label Propagation for Classification and Ranking presented by Ming Wu LIBRARY Michigan State University has been accepted towards fulfillment of the requirements for the Doctoral degree in Computer Science ." l1 / i , / {I ‘ / W Major Professor’s Signature July 16, 2007 Date MSU is an affirmative-action, equal-opportunity employer PLACE IN RETURN BOX to remove this checkout from your record. i To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. 1 DATE DUE DATE DUE DATE DUE '0 1 1 3 1 1 fl M b <9? H». Hi “l 6/07 p:lClRC/DateDue.indd-p.1 LABEL PROPAGATION FOR CLASSIFICATION AND RANKING By Ming Wu A DISSERTATION Submitted to Michigan State University In partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 2007 ABSTRACT LABEL PROPAGATION FOR CLASSIFICATION AND RANKING By Ming Wu Label Propagation has been proven to be an effective semi-supervised learning ap- proach in many applications. The key idea behind label propagation is to first construct a graph in which each node represents a data point and each edge is assigned a weight often computed as the similarity between data points, then propagate the class labels of labeled data to neighbors in the constructed graph in order to make predictions. This dissertation is a comprehensive study of the label propagation approaches in different directions including Relation Propagation, Rank Propagation and Propagation over Directed Graphs. Most previous works on label propagation propagate information among a single type of objects. However, many applications involve multiple types of objects. Inspired by the assumption that the correlation among different types of objects can be very helpful in many cases, a generalized framework for Relation Propagation is proposed to explore the correlation in semi-supervised learning. The key idea behind relation propagation is to construct a graph which involves multiple types of objects and then propagate the relation among different types of objects in this graph. The framework for Relation Propagation is applied to multi—label learning (classification problems) and collaborative filtering (ranking problems). Empirical results show that relation propagation is a more effective approach in compari son with the previous approaches in label propagation. It is very important to study the label propagation approaches for ranking problems due to the existing challenges for label propagation. First, it may not be appropriate to propa- gate class labels (ordinal values) as numerical values in classification problems. It seems more reasonable to cast the problem into a ranking problem in which the class labels are converted to pairwise preferences between classes for each example and then the prefer- ences are propagated. Second, most previous studies require absolute labels for learning which are often hard to obtain. Instead, relative ordering information is more easily avail- able. Traditional label propagation approaches may not fit with ranked data. Inspired by these challenges, a Rank Propagation framework is proposed for supervised learning. The key idea behind Rank Propagation is to propagate the given preference judgements, in- stead of the true labels, from the labeled data to unlabeled data and compute the preference matrices for unlabeled data whose principal eigenvectors correspond to the class assign- ments. The application of this framework is presented in a multi-label categorization task with multiple datasets. The empirical results show that Rank Propagation is an effective approach in comparison with other commonly used supervised learning approaches. Most studies in label propagation focus on using the undirected graphs. Motivated by the assumption that directed graph may better capture the nature of data, a framework for Propagation over Directed Graphs is proposed for utilizing the directed graph in propaga- tion. The question involved in this approach is how to construct and utilize a directed graph. Two asymmetric weight measures, namely KL divergence-based measure and asymmetric cosine similarity, are proposed in order to construct a directed graph. To utilize the directed graphs, one common method is to convert the directed graphs into undirected ones and then apply a standard label propagation approach to the converted undirected graphs. A random walk related method is discussed for this conversion. The application of this framework is presented in a binary classification task and a multi-label classification task. The empirical results show the effectiveness of Propagation over Directed Graphs in comparison with other approaches. In summary, this dissertation discusses the proposed approaches in label propagation, namely Relation Propagation, Rank Propagation and Propagation over Directed Graphs, in order to address the existing challenges in this area. Empirical studies show that the proposed approaches achieve promising performance in given scenarios. © Copyright 2007 by Ming Wu All Rights Reserved ACKNOWLEDGMENTS This thesis would not have been possible without the support of many people. Many thanks to my advisor, Rong J in, who read my numerous revisions and helped make some sense of the confusion. Also thanks to my committee members, Eric Tomg, Pang-Ning Tan, and Sarat Dass, who offered guidance and support. Special thanks to my fiancé, Aron White, who put up with me and provided so much love and support. Finally, thanks to my parents, and many friends who walked with me every step on this journey. TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x 1 Introduction 1 2 Label Propagation : A Graph-based Approach 8 2.1 Related Works ................................... 9 2.2 Label Propagation based on Gaussian Fields and Harmonic Functions ..... 11 2.2.1 Description of the Framework ......................... 11 2.2.2 Interpretation .................................. 14 2.2.3 Learning the Weight Matrix .......................... 16 2.3 Learning with Local and Global Consistency .................. 17 2.3.1 Description of the Method ........................... 18 2.3.2 The Regularization Framework ......................... 19 2.4 Summary ..................................... 20 3 Relation Propagation: A Generalized Framework 21 3.1 A Graph-based Framework ............................ 23 3.2 Applications for Classification Problem ..................... 26 3.2.1 Multi-Label Learning Model .......................... 26 3.2.2 Case Study: Image Categorization ....................... 30 3.3 Applications for Ranking Problem ........................ 38 vi 3.3.1 Ranking Learning Model for Collaborative Filtering ............. 38 3.3.2 Case Study 1: Movie Recommendations .................... 47 3.3.3 Case Study 11: Book Crossing ......................... 50 3.4 Drawbacks of Label Propagation Approaches .................. 54 4 Propagation for Ranked Data 56 4.1 Ranking Problem Setting ............................. 59 4.2 Related Works ................................... 59 4.2.1 Statistical Measure of Rank Correlation .................... 60 4.2.2 Ranking SVM ................................. 61 4.2.3 Large Margin Principle ............................. 63 4.2.4 PRank Algorithm ................................ 65 4.2.5 Conditional Model on Permutations ...................... 66 4.2.6 Conditional Model on Ranking Poset ..................... 67 4.3 Rank Propagation for Multi-label Learning ................... 68 4.3. 1 Preliminaries .................................. 69 4.3.2 Preference Matrix ................................ 69 4.3.3 Framework Description ............................. 73 4.3.4 Case Study: Multi-label Categorization .................... 77 5 Propagation over Directed Graph 92 5.1 Convert a Directed Graph into an Undirected Graph ............... 93 5.1.1 Conversion to Undirected Graphs ....................... 94 5.1.2 Propagation Scheme: Spectral Graph Transducer ............... 95 5.2 Construct a Direct Graph ............................. 97 5.3 Case Study 1: Binary Classification ........................ 99 5.3.1 Experiment Setup ................................ 99 5.3.2 Experiment Results ............................... 101 vii 5.3.3 AnalysisandDiscussion ....................... ..... 105 5.4 Case Study 11: Multi-label Classification ..................... 115 5.4.] Experiment Setup ................................ 1 15 5.4.2 Results and Analysis .............................. 116 6 Conclusion and Future Work 118 6.1 Summary ..................................... 1 18 6.2 Future Work .................................... 121 BIBLIOGRAPHY 123 viii 3.1 3.2 4.1 4.2 4.3 4.4 5.1 5.2 5.3 5.4 LIST OF TABLES Average Precision for the Top 20 Ranked Movies in Movielens Dataset. 50 Average Precision for the Top 10 Ranked Books in Book Crossing Dataset. 52 PRBEP for Different Datasets with Different Number of Training Examples. . . 82 Area under ROC Curve for MovieLens Dataset. ................. 89 Area under ROC Curve for St. Andrews Dataset. ................ 89 Area under ROC Curve for Yeast Dataset. .................... 90 The F 1 results for Movielens Dataset. ...................... 102 F1 measure for 20-newsgronp Dataset ......................... 103 F 1 measure for Renters-21 5 78 R8 Dataset. ..................... 104 The F 1 Results for 20-newsgroup Dataset with Different A Values ........ 1 14 2.1 2.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.4 4.5 4.5 4.5 4.5 5.1 5.2 5.3 LIST OF FIGURES An Example of the Connected Graph for Label Propagation ........... 9 The Random Fields Constructed on Labeled and Unlabeled Examples ...... 12 An Example of the Connected Graph for Relation Propagation ......... 22 An Efficient Algorithm for Multi-label Learning using the Relation Propagation 30 Classification F 1 Results comparison using Relation Propagation ........ 35 Classification Results of the Relation Propagation using Different 73. K1) = 20 36 Classification Results of the Relation Propagation using Different Kps. 7 = 1 . 37 Average Precision for Top Ranked Items Given 5, 10 or 20 Movies. ...... 51 Average Precision of Top Ranked Items Given 5, 10 or 20 Books ......... 53 Analysis of Principal Eigenvectors of Preference Matrix for Generated Datasets. 71 Analysis of Principal Eigenvectors of Preference Matrix for Study Datasets. . . 72 F1 Scores of Three Algorithms for MovLens C18 Dataset ............. 83 F1 Scores of Three Algorithms for MovLens C18 Dataset (cont’d). ....... 84 F1 Scores of Three Algorithms for St. Andrews C10 Dataset. .......... 85 F1 Scores of Three Algorithms for St. Andrews C10 Dataset (cont’d). ..... 86 F1 Scores of Three Algorithms for Yeast C14 Dataset ............... 87 F1 Scores of Three Algorithms for Yeast C14 Dataset (cont’d). ......... 88 F1 Scores Comparison of Different B Values in Rank Propagation ........ 91 The Algorithm for Classification on a Directed Graph .............. 96 The Distribution of Document Length in Three Datasets. ............ 105 Similarity Matrix Analysis of MovieLens Dataset ................ 106 5.4 5.5 5.6 5.7 5.8 5.9 Similarity Mtrix Analysis of 20-newsgroup Dataset ............... 107 Similarity Matrix Analysis of Reuters-21578 R8 Dataset ............ 108 Distribution of Similarities in MovieLens Dataset ................. 109 Distribution of Similarities in 20-newsgroup Dataset. .............. 110 Distribution of Similarities in Reuters-21 5 78 R8 Dataset. ............ 11 1 F1 Measure for MovieLens Dataset in Multi-label Classification Task. ..... 117 xi Chapter 1 Introduction As a graph-based approach, label Propagation belongs to the family of semi-supervised learning and has been proved to be effective for both text categorization and information retrieval [47, 10, 57, 50]. Assuming the objects in question are documents, the key idea of label Propagation is as follows: first, view each document as a node in a connected graph and each edge of the graph is associated with a weight which is proportional to the similarity between the two connected nodes/documents; then, the confidence scores for all the classes of the unlabeled documents are estimated by propagating the class labels of the labeled documents through the weighted edges; finally, the estimated confidence scores are either used to determine the appropriate categories for unlabeled documents in text categorization, or used to rank the documents in information retrieval. One of the key components in label propagation is the document similarity. It is usually calculated based on the content of documents, using either the dot product between two document-term vectors or the RBF kernel function. It can also be computed using the side information, such as the hyperlinks among web pages [17]. A number of approaches have been developed in the past for label propagation. includ- ing the approach based on the Harmonic function [57], Gaussian random field [47], and the Green function approach [50]. In the meantime, a number of propagation approaches have 1 been developed in information retrieval, including the pagerank algorithm [8], the HITS algorithm [27], and retrieval based on implicit link [48, 29]. This dissertation explores and extends the idea of label propagation in several different avenues, namely Relation Propagation, Rank Propagation and Propagation over Directed Graphs. Relation Propagation. First, the generalized framework for Relation Propagation will be exploited and evaluated. Most previous works on label propagation propagate infor- mation among a single type of objects [57, 51, 23]. However, in the real world, many applications involve multiple types of objects. For instance, document categorization tasks involve two types of objects: documents and categories. In these scenarios, the informa- tion needs to be propagated among the objects of different types in order to utilize the correlation among the objects of different types which can provide useful information for classification and ranking problems. Inspired by the assumption that correlation among the different types of objects can be very helpful in many cases, a generalized framework is proposed for Relation Propagation. The difference between our work and previous re- search is that the Relation Propagation framework allows the relationship between multiple types of objects to be propagated and thus improve the performance. To better understand this framework, consider the scenario which involves two types of objects: A and B. The key idea behind Relation Propagation is to first construct a graph in which each node is a object pair consisting of one object from each type and then propagate relationship between A and B in this graph. This framework of Relation Propagation can be easily applied to many applications. For instance, in a document categorization task, two types of objects are documents and categories. The graph can be constructed as: each node is a document-category pair whose label is the membership of the document belonging to the category; each edge is associated with a weight which can be defined using some similarity measure. Then, the membership information can be propagated through this 2 graph. The resulting scores for each document with respect to all categories will lead to a ranking list of all categories for this document. The framework can also be easily extended to the scenario with more than two types of objects. In order to show the effectiveness of the relation propagation model. it is applied to two kinds of tasks: text categorization tasks and collaborative filtering tasks. The text catego- rization task is studied with an image retrieval dataset in which two types of objects are images and categories. The membership of an image belonging to a category is the infor- mation to be propagated. The collaborative filtering problem is studied with movie ratings and book recommendation datasets in which users and items are two types of objects and the ratings information is propagated. Empirical results show that the Relation Propaga— tion achieves very promising performance in many applications in comparison with other state-of—art techniques. Rank Propagation. The ranking problem has always been an important subject in many areas such as machine learning and statistics. A ranking problem often requires the ranked data as the input with the goal of generating a ranking function. Many previous works have been done in this area [2, 22, 43, 12, 32, 31]. Although label propagation, as an effective semi-supervised learning technique, has shown promising performance in many applications, it is quite natural to explore the label propagation approaches on ranked data because of the existing challenges in label propagation. One of the challenges is related to propagating directly the class labels. As described above, the label propagation approaches address all the problems by first propagating the class labels and. then generating ranking lists based on the resulting propagation scores computed by the algorithm. This may not be an appropriate approach in certain cases (e. g., classification) since there is no ordering information among class labels. For example, consider the numerical class labels 1, 2, 3, ~ - -. In most cases, the numbers just provide the labeling for classes instead of the ordering information. Class 1 is not greater or smaller than class ‘2. When the class labels are propagated directly, the ordering information will be incorrectly introduced into the propagation process. One way to avoid this problem is to first convert the class labels into pairwise preferences between classes for each training example and then prOpagate these preferences. Clearly this idea can be cast to a ranking problem. Another challenge is the difficulty of obtaining the exact labels of the labeled data. Most previous studies require the absolute label values or numeric values to be given in order to learn the ranking function, which is often not practical. The absolute labeling information is usually hard to obtain or is very costly and time consuming. In contrast, the partial relative ordering information is more easily available for training in some cases. For instance, in a movie ratings dataset, the users provide the ratings which indicate the preferences of certain movies over other movies. Traditional label propagation may not fit in these applications because they are required to propagate the absolute labels. To address these problem, a Rank Propagation framework is proposed for multi-label learning. The discussion starts with the definition of a general ranking problem whose input is the preference judgements, followed by the presentation of the Rank Propagation frame- work of supervised learning. The key idea behind Rank Propagation is to propagate the preference judgements (relative ordering) of the labeled data between the labeled data and unlabeled data, instead of propagating the true labels of the labeled data. The preference judgements can be directly given by the datasets or computed from the labeles of the given data. Consider the example of text categorization. For each document, a preference matrix can be computed with respect to the categories and each entry in the preference matrix is the preference between two categories of this document. The preference judgements from the labeled data are propagated between the labeled documents and the unlabeled docu- ments. The goal of Rank Propagation is to achieve the preference matrices for unlabeled data whose principal eigenvectors are proved to correspond to the class assignments. The application of this framework is presented in a multi-label categorization task with 4 movie categorization, image retrieval and gene categorization datasets. The empirical re- sults show that Rank Propagation is an effective approach in comparison with other com- monly used supervised learning approaches, and thus verify our assumption that propagat- ing the relative ordering information can be more helpful for solving the problems. Propagation over Directed Graphs. As mentioned above, label propagation often in- volves a graph constructed from the given datasets. Most previous studies in label prop- agation focus on using the undirected graphs [57, 51]. However, directed graphs can be more appropriate in describing the given datasets in many cases. For instance, in the web page categorization scenario, it’s often better to view the hyperlinks as directed edges in- stead of undirected edges in the graph. A number of works have been devoted to the graph-based approaches on the directed graph [50, 52]. Motivated by the assumption that directed graph may better capture the nature of the data, a framework for Propagation over Directed Graphs is proposed. The general idea of this approach is to first construct a directed graph from a given dataset, then convert this directed graph into an undirected graph and finally propagate the labeling information over this converted graph using some standard propagation scheme. There are two questions involved in this approach. The first question is how to construct a directed graph from the given data. In label propagation, a matrix is often computed in which each entry is the pairwise relationship between two data points. Based on this ma- trix, the graph is constructed in which each node represents a data point and each edge is associated with a weight equal to the corresponding pairwise relationship in the matrix. Evidently symmetric weight measures result in the undirected graphs since two edges con- necting two nodes carry the same weights while asymmetric weight measures lead to the directed graphs. Based on this observation, two asymmetric weight measures are pr0posed, namely KL divergence-based measure and asymmetric cosine similarity, in order to con- struct a directed graph. The second question is how to utilize the directed graphs. One 5 straightforward way is to convert a directed graph into an undirected one and then apply the existing label propagation approaches to the converted undirected graph. The regu- larization framework in [52] is discussed for this conversion. Spectral Graph Transducer (SGT) is used as the standard propagation scheme in the proposed approach because of its proven effectiveness in various applications. The application of this framework is presented in a binary classification task and a multi-label classification task. In the binary classification task, the approach of Propaga- tion over Directed Graphs is compared with propagation using undirected graphs and other state-of-art classification techniques with multiple datasets. The empirical results show the effectiveness of Propagation over Directed Graphs in comparison with other approaches. The analysis of why directed graphs outperform other approaches verifies the assumption that directed graphs can better express the nature of the data in some cases. A brief empirical study is given for the multi-label classification in which the proposed approach is compared with only the label propagation approach based on harmonic functions and Gaussian fields. The goal is to show that using directed graphs is more effective than using undirected graphs from a different point of view. It is important to note that this dissertation focuses on the rank-related evaluation met- rics for all the methods, which is different from the commonly used metrics in traditional classification and regression approaches. Traditionally, researchers often evaluate the ef- fectiveness of the label propagation approaches by classification accuracy. This requires an algorithm to generate the exact labels or numerical values in order to compare with the true assignments of classes. Since most label propagation approaches generate the scores which can only provide the ranking information for the unlabeled data, it is necessary to convert these scores into true labels. However, by converting from the ranking into the binary clas- sification decision, it always involves the selection of threshold which is often determined empirically and prone to inaccuracy. In order to evaluate the essential effectiveness of the 6 label propagation approaches without the complication of the threshold, this dissertation focuses on the label propagation approaches for ranking problems. Later chapters will dis- cuss how label propagation can be applied to many scenarios by solving a ranking problem with the presentation of the proposed works. The rest of this dissertation is organized as follows: Chapter 2 introduces the gen- eral idea of label propagation with the existing related works and discusses in detail two label propagation approaches which are closely related to our work. Chapter 3 presents the Relation propagation approach which includes both the general framework of this ap- proach and its applications for the multi-label learning task and the collaborative filtering task with the empirical results. Several drawbacks of the label propagation approaches for ranking problems will be discussed at the end of the chapter. Chapter 4 defines the general ranking problem and proposes the Rank Propagation framework for supervised learning which propagates the relative ordering information instead of true labels of the labeled data. Chapter 5 proposes the framework of Propagation over Directed Graphs and discusses its applications of binary classification and multi-label classification along with an analysis of the reasons why directed graphs can better express the nature of the data. Chapter 6 concludes this dissertation with summarization and future work. Chapter 2 Label Propagation : A Graph-based Approach In many traditional approaches to machine learning, the learner is trained by labeled ex- amples. Labeled examples are often, however, time consuming and expensive to obtain because they require very skilled human annotators. Therefore, the problem of combining labeled and unlabeled examples together in the learning process becomes very important and necessary. The semi-supervised learning utilizes both labeled and unlabeled examples and has attracted an increasing amount of interest. A number of semi-supervised learning approaches have been proposed [41, 4, 23, 5, 51]. Among these approaches, there is one family of techniques which exploit the manifold structure of the data that is usually inferred from the pairwise similarity of unlabeled and labeled data points. This family of techniques is referred to as label propagation since they propagate the label information from labeled examples to unlabeled examples by using the pairwise similarity. The unlabeled examples are classified by using the labeling information propagated to the unlabeled examples from the labeled examples. The general process of label propagation is to propagate the labels of the labeled data over the graph in which each node represents a data point and each edge connecting two nodes is associated with a user defined weight. Figure 2.1 shows an 8 example of the graph for label propagation. d1 d /O 3 O ”’35 d8 d7 0d. Figure 2. 1: An Example of the Connected Graph for Label Propagation 2.1 Related Works Nearly all label propagation approaches are based on the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure (typically referred to as a cluster or a manifold) are likely to have the same label. The difference among various algorithms for label propagation lies in the way to model the structure of the data and realize the assumption of consistency, and the way to propagate the label information from labeled examples to unlabeled ones. A number of previous works have been devoted to label propagation [47, 10, 57, 50]. In [57, 54, 56], Gaussian random fields and Harmonic functions-based methods are motivated by the assumption that the label assignments should be smooth over the entire graph. In [51], the local and global consistency method proposes to enforce the smoothness of the class assignments by minimizing the sum of local variations measured at each edge in an undirected graph. In [23], Spectral Graph Transducer, which can be viewed a transduc- 9 rive version of KN N, propagates the labels of the labeled data by incorporating a quadratic penalty on the labeled data into the minimization problem of normalized graph cuts with constraints. In [4, 5], the authors propose to extend the graph mincuts method for trans- ductive learning. The key idea is to search for a partition of the graph which results in a minimum sum of weights of the cut while agreeing with the labeled data. Local Laplacian embedding methods propose to project the data from the original space to a dimension reduced space and then the classification function can defined on the reduced dimension space [3, 13]. Among various approaches in label propagation, the two most popular ones are the Gaussian fields and Harmonic functions-based approach [57] and the local and global con— sistency approach [51], which both have been successfully applied to a number of appli- cations. We will present these two approaches in detail due to their resemblance to our proposed framework of Relation Propagation in Section 3. The Harmonic function-based approach predicts the class labels by enforcing the smoothness of the class assignments over the entire graph. The smoothness of the class assignments is defined by a quadratic energy function which has a harmonic solution. The local and global consistency approach presents a simple algorithm to achieve a smooth classification function with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. This al- gorithm enforces the smoothness by minimizing the sum of local variations measured at each edge in the graph. These two methods are consistent with each other in that (1) both of them achieve the smoothness of label assignment probabilities over the entire graph by minimizing the sum of the variations defined at each edge; (2) the initial labels with the labeled data are retained in some way for both methods. In the following section. the two different label propagation approaches are reviewed in detail. 10 2.2 Label Propagation based on Gaussian Fields and Har- monic Functions [57] introduces a label propagation approach based on a random field model defined on a weighted graph over the unlabeled and labeled data, where the weights are given in terms of a similarity function between instances. This label propagation method adopts Gaussian fields over a continuous state space rather than random fields over a discrete label set. This “relaxation” to a continuous rather than discrete sample space results in many attractive properties in that the most probable configuration of the field is unique, characterized in terms of harmonic functions with a closed form solution that can be computed using matrix methods or loopy belief propaga- tion. 2.2.1 Description of the Framework Assume there are 1 labeled examples (1171.111).---.(:r,,y,) and n. unlabeled examples xi+1,-~.rl+u (1 << it). Let n ._—_ l + u be the total number of data examples. Con- sider the binary classification problem: y E {0, 1}. Construct a connected graph G (V, E ) where V corresponds to the 72 data examples with L = {1. - - - ,1} corresponding to the labeled examples with labels {y1. - . - .311} and U = {l + 1. - - - .l + u} corresponding to the unlabeled examples. The goal is to assign labels to U. Let W denote a n x n symmetric weight matrix on the edges E of the graph. W can be computed by RBF kernel as: m , . — ‘ 2 rm :- exp (._. Z W) (2.1) d=l ‘1 where and is the d-th element for the example :11,- represented as a vector :13, E R’", and ad is length scale hyper parameters for each dimension. Obviously, nearby points in Euclidean space are assigned large edge weight. Other measures are also possible to be adopted if 11 1onc \4 \/ / \ /324\/ \/ , //\ / /...._. \ / 5 0 live zero Figure 2.2: The random fields used in this work are constructed on labeled and unlabeled examples. The graph is formed with weighted edges between instances (digits), with la- beled data as special “boundary” points and unlabeled points as “interior” points. they are meaningful for the application as long as they generate non-negative weights. The goal is to find a real-valued function f : V —> R on G and then assign the labels based on f. Also we constrain f (i) = f,(i) E y,- for all labeled examples. We use f, to de- note the function values for labeled examples and fa for unlabeled examples. It is intuitive to assume that data points sharing large similarities should have similar label assignments. This assumption leads to the quadratic energy function 1 _ , . I: f) = 5 give (fit) — finf (2.2) A probability distribution is assigned on f by forming the Gaussian field p3( f) = e~r3E(f) Zr? Z5 2 f fl L: I: exp (“’x/jE ( f )) df , which normalizes over all functions constrained to f; on , where 13 is an “inverse temperature” parameter, 2,; is the partition function the labeled data. Clearly minimizing this energy function will lead to a smooth label assign- ment function f because the minimum value can only be reached by enforcing the similar label probabilities to the pairs of nodes which share large similarities. In other words, each node will finally be assigned a label probability which is consistent with its neighbors and 12 this leads to a smooth label assignments over the entire graph. The minimum energy function f : argminflszlEU) is harmonic. It satisfies Af = 0 on unlabeled data points U and is equal to f; on the labeled examples L. A is combinatorial Laplacian and the matrix form is A = D — W where D = diag(d,—) is the diagonal matrix with entries (1,- = 2]. run-,- and W = [1120-] is the weight matrix. The harmonic property means that the value of f at each unlabeled data point is the average of f at neighboring points f0) = diijU-fiz') i~j where j : l + 1,- - ' ,l + n. In other words, f = Pf where P = D‘1W. According to the maximum principle of harmonic functions [14], a harmonic function f defined on S (S is an open subset of R") takes on its maximum value and its minimum value on the boundary. Thus we guarantee 0 < f ( j ) < 1 for j 2 l+ 1 since the labeled data are “boundary” points and unlabeled data are “interior" points in the graph. The harmonic solution can be computed explicitly in matrix form. The weight matrix W can be split into 4 blocks after lth row and column: in, in. W = l/Vul ”furl l Letting f = , the harmonic solution A f = 0 subject to f IL 2 f; can be presented fu as: fu : (Dun. _ H'l‘iru)—1I'Vulf1 Z (I "' Rut)_1PulfI (23) To better understand the above method, we will show its application to document classi- fication task. In order to apply the label propagation method, we relax the discrete class labels to a continuous space. The classification method will be divided into two steps: first, 13 we estimate the continuous class label scores by propagating the class labels of labeled data over the graph; then based on the continuous class label scores, we can make predictions by an empirically determined threshold. Let D = (d1. d2, . . . .d.,,) denote the entire collection of documents. Assume that the first in documents of D are labeled by y, = (y1 , yg, . . . , y,” ), and the remaining "u = n — m documents are unlabeled. Each label y, can either be +1 or —1. Let S = [Sig]mm denote the similarity matrix for both labeled and unlabeled documents. To estimate the class labels for the unlabeled document, denoted by ya, the label propagation approach described above suggested minimizing the following energy function, i.e., Es = Z S.-,,-(y,- — 1M2 = yTLy (2.4) i,j=1 where y : (le yl)? Matrix L : [LL 1],,” is the graph Laplacian for similarity matrix S. It is defined as L = D —— S where D = diag(D1, D2, . . . , D,,) and D, = 237:, S”. The optimal solution yu for the above problem is approximated in [57] as y. = L;,}.Su,zyz. (2.5) where L”, refers to the part of the graph Laplacian or combinatorial Laplacian L that is only related to the unlabeled documents, and Sui stands for the similarity matrix between the unlabeled documents and the labeled documents. 2.2.2 Interpretation The framework can be viewed in several fundamentally different ways. These differ- ent viewpoints enrich the understanding and reasoning about this approach to the semi- supervised learning problem. The harmonic solution can be viewed as a random walk approach. Imagine a particle 14 walking along the graph G. Starting from an unlabeled data i, the particle moves to a node j with probability PU after one step. The walk continues until the particle hits a labeled node. Then f (i) is the probability that the particle, starting from node 2', hits a labeled node with label 1. This view of the harmonic solution indicates that it is closely related to the random walk approach in [45]. There are two major differences between the harmonic function approach and the random walk approach: first, f is fixed on the labeled examples in the harmonic function approach; second, the harmonic function solution described above is achieved from an equilibrium state, while in [45] the random walk approach depends on the time parameter t. The solution f can also be viewed from the viewpoint of spectral graph theory. The heat kernel with time parameter t on the graph G is defined as K, = e‘m. Here K, (12,) is the solution to the heat equation on the graph with initial conditions being a point source at i at time t = 0. It was proposed as an appropriate kernel for machine learning with categorical data in [28]. When used in a kernel method such as a support vector machine, the kernel classifier fig) 2 2,6,1 oriyiKAi, j) can be viewed as a solution to the heat equation with initial heat sources my, on the labeled data. The time parameter t must be chosen using an auxiliary technique. The harmonic function approach described in the previous section uses a different approach independent of t, diffusion time. Consider the heat kernel K, = e‘m'w on A,,,, where Am, 2 Dm, — WW, the Laplacian restricted to the unlabeled data examples on G. K , describes heat diffusion on the unlabeled subgraph with Dirichlet boundary conditions on the labeled nodes. A Dirichlet boundary condition (often referred to as a first-type boundary condition) imposed on an ordinary differential equation or a partial differential equation specifies the values a solution is to take on the boundary of the domain. Clearly Dirichlet boundary conditions fix the values of the function on the labeled data. The Green’s function 9 can be expressed as: g: f Kgdiz / e“m"“dt= (puritans-1 (2.6) (‘l 0 15 The harmonic solution in 2.3 can be written as fit : gl'V-ulfl (2.7) 01' = ZEWHAQ( k j) (2.8) k i=1 The above expression shows that this approach can be viewed as a kernel classifier with the kernel Q and a specific form of kernel machine. The graph mincut approach proposed in [4] has very interesting and substantial connec- tion to the harmonic function method described here. The starting point for the graph min- cut approach is also a weighted graph G, but the learning problem is presented as finding the minimum st-cut, where negative labeled data is connected to a special source node 3 and positive labeled data is connected to a special sink node t. A minimum st-cut, which is not necessarily unique, minimizes the L1 objective function EU) = g 21,) triij|f(i) — f(j)| and corresponds to a function f : V —+ {— 1, +1}. The solution may not be unique and can be obtained using linear programming. However, its random field model is over the label space {—1, +1} and the field is pinned on the labeled entries. This leads to two problems. First, the approximation methods based on rapidly mixing markov chains can not be used; second, multi-label extensions of this approach are generally NP-hard. Instead, the har- monic solution can be computed efficiently using matrix methods, even in the multi-label case and inference for the Gaussian random field can be efficiently and accurately carried out using loopy belief propagation [46]. 2.2.3 Learning the Weight Matrix The weight matrix is an important input in the harmonic function approach described above. Consider the weight matrix given by the equation 2.1, the parameter ad is the only param- eter determining the weight matrix. By learning ad from both labeled and unlabeled data, 16 the graph structure can be better aligned with the data. One way to learn the parameter ad is to minimize the entropy on the unlabeled data. The average label entropy is used as a heuristic criterion for parameter learning and it is defined as l+u H(f) =§ 2 HM» (2.9) i=i+1 where H,(f(i)) : —f(i)logf(i) — (1 — f(i))log(1 — f(i)) is the entropy of the field at the individual unlabeled data point i. The maximum principle of harmonic functions guarantees that 0 < f (i) < 1 for i 2 l + 1. Small entropy implies that f (i) is close to 0 or 1 and this captures the intuition that a good W should result in a confident labeling. Smoothing can be used to avoid the complication that H has a minimum at 0 as 00 —> 0. We replace P (by normalizing weight matrix W) with the smoothed matrix P = 611 + (l — e)W, where U is the uniform matrix with entries U.)- = 1 / (1 +11). We use gradient descent to find the hyperparameters ad that minimize H. The gradient 0 it ~ _ ‘fiuu ‘fiu '—f : (1— R111) 1(qu'l'0. If!) is computed as and 80d (30d Since the original matrix P is obtained by normalizing W, we have i.)l,l7i»' ‘ 1+1]. 8“" t J _ . . . m ()[)ij _ 30.1 pi] 272:1 (lad ”at anr l + 'rr'wm . 011'," ‘ , 2 3 Finally. 7):} = 2'l1rrj(-Tdi — 47(1)) /0d° 2.3 Learning with Local and Global Consistency The local and global consistency method [51] is similar to the harmonic function approach described above. It is different from the harmonic function approach in that it measures the smoothness of the classifying function in a different way. We first introduce some notations. l7 Given a point set X = {.r], . - - ,:r,.:r1+1. - ~ - ,.r,,} C R’” and alabel set L = {1, ~ - - ,c}, the first 1 points 2:,(2' S l) are labeled as g,- E y and the remaining points :r:,,(l+l g u. S n) are unlabeled. The goal is to predict the label of the unlabeled points. Let .7 denote the set of n x c matrices with nonnegative entries. The matrix F = [F I, . ~ - ,Fij E J: corresponds to a classification on the dataset X by labeling each point .r, as a label y, = argmax Ej. In other words, each row of F can be viewed as the confidence score vector jSC of assigning all class labels to one data point and we can understand F as a vectorial function F : X ——) RC which assigns a vector F,- to each point 13-. Define a n. x c matrix Y E .7: with Y.”- = 1 if r, is labeled as y,- = j and Yij = 0 otherwise. Y is consistent with the initial labels of the labeled data. 2.3.1 Description of the Method The method can described in the following steps: first, we compute the weight matrix W defined as He,- 2 exp (— ”:17,- — :1: j||2 /202) i f i aé j and IV..- = 0; second, we construct the matrix S = D’l/‘illiDd/g in which D is diagonal matrix with its (i, i)-element equal to the sum of the i-th row of W; third, we conduct the iteration F (t+ 1) = (18 F (t) + (1 — (1))” until it converges where a- is a parameter in (O, 1): finally. each point :12,- can be labeled as g,- = argmax}? Ft}. This method can be easily understood from the propagation point of view. We first define the weight matrix W which capture the manifold structure of the dataset D. The graph G = (V, E) for propagation can be defined on D, where the vertex set V includes all points from D and the edges E are weighted by W. W is normalized symmetrically for the convergence of the iteration. In the iteration step, the labels of the labeled data are repeatedly propagated over the graph G until convergence. It is not too hard to see that each point receives the information propagated from its neighbors and at the same time retains its initial information. The self-reinforcement is avoided since the diagonal elements of the weight matrix W are set to zero. After the propagation ends, we can label the unlabeled 18 data using F". 2.3.2 The Regularization Framework Based on the above algorithm, a regularization framework can be developed as follows: 2 Tl F. + u 2 HF. — Y.“2 (2.10) i=1 $§3u~ i.j=1 0‘” where p > 0 is the regularization parameter. Then the classifying function is F‘! = argminpefQ(F) (2.”) The first term of the right-hand side in Equation 2.10 is the smoothness constant, which means that a good classifying function should not change too much between nearby points. The second term is fitting constraint, which means a good classifying function should not change too much from the initial label assignment. The trade-off between two terms is captured by p. F can be solved by differentiating Q(1~ ) with respect to F, 0Q —~_~.=F*—SF* F*—Y= 1 , r“— SFV— “ Y=0 1 + [i l + it Two new variables are introduced, a = , and l3 = l1; . Note that (t +13 : 1. Then 1 + it 1 + it U—GSM"=BY l9 Since I — GS is invertible, the solution is r“ : 13(1 — aS)—‘Y. 2.4 Summary Label propagation is an important technique in machine learning and has shown the promis- ing performance in many applications such as classification and ranking problems. The key idea behind label propagation is to propagate the labels of the labeled data to the unlabeled data over the graph constructed from the affinity matrix based on the data and then make predictions accordingly. Despite the various motivations behind different methods such as ' graph-cut based approaches [23, 4, 5], gaussian processes based approach [30], gaussian random fields and harmonic function based approach [57, 54, 56] and so on, most label propagation assumes the consistency of the resulting labels, which means that the data ex— amples close to each other should have similar labels. The difference among various label propagation methods lie in the way to model the consistency of the data and propagate the labels. In addition to the overview of different algorithms for label propagation, we also presented in detail two most popular algorithms for label propagation including the gaussian random field and harmonic function based approach, and the local and global consistency approach due to their resemblance to our work in the following chapters. 20 Chapter 3 Relation Propagation: A Generalized Framework Despite the extensive study in label propagation, most previous work assumes that the sim- ilarity graph consists of a single type of object, namely documents, and the propagation is only conducted among objects of the same type. However, there are scenarios in many applications that involve multiple types of objects, and the label information needs to be propagated not only among objects of the same type, but also between objects of different types. To illustrate this issue, consider the multi-label classification problem. To directly employ the label propagation approach, we will first decompose the multi-label classifica- tion problem into a number of binary classification problems, and then the labels of each binary class is propagated through a similarity matrix. The problem with this approach is that it is incapable of exploring the correlation information among different classes, which is often extremely useful when the number of training documents is small. To see this, consider a five-class classification problem. Assume that for an unlabeled document (1, its confidence scores for the five classes after the binary class propagation are 0.4, 0.4, 0.4, 0.3, and 0.3. Without using the class correlation information, it is equally likely to assign the document d to category 1, 2, and 3. However, if both category 4 and 5 are highly correlated 21 Figure 3.1: An Example of the Connected Graph for Relation Propagation with category 3 and meanwhile are orthogonal to category 1 and 2, we would expect that category 3 is more likely to be the right class for document d than category 1 and 2. We can generalize the above example into a propagation framework for multiple types of objects. In particular, we treat the documents and the categories as two different types of objects. Instead of propagating the labeling information among documents indepen- dently for each category, we propose to propagate the relationship between documents and categories. More specifically, we construct a weighted graph as follows: each node 0(2‘3’) 2 ((1,, Cj) in the graph represents the relationship that document (1,- belongs to the category C}; any two nodes 0(1‘. j) and OW) in the graph are connected by an edge whose weight reflects the correlation between the two corresponding relationships. In other words, a large similarity between node OW) and OW) indicates that document d, is likely to be assigned to class c,- if document (1;, belongs to class c), and vice versa. It is important to note that class ck and (:1 can be different in the above propagation scheme. Figure 3.1 il- lustrates an example of the graph for the relation propagation with three documents and three categories. To distinguish from the previous work on label propagation, we refer to the proposed framework as “Relation Propagation”, or RP for short. We would like to emphasize that the relation propagation framework can be applied to 22 many applications in addition to the multi-label learning problems described above. For instance, it can be applied to collaborative filtering by viewing users and items as two different objects, and the relationship Om) = (11,-. 2‘3) represents the rating of item tj by user 11,-. Using the framework of relation propagation, we will be able to propagate the rating information among different users and different items simultaneously. In particular, it allows us to infer the rating of user it] on item t1 given the rating of user U2 on a different item t2 if items t1 and it) share similar characteristics. This property can be extremely useful to alleviate the sparse data problem, which has been an critical issue in collaborative filtering [7]. 3.1 A Graph-based Framework To facilitate our discussion, we first describe the framework of relation propagation for two types of objects, followed by the generalization to multiple types of objects. We then present the efficient implementation of applying the proposed framework to multi-label learning and collaborative filtering. Let .A = ((1.,(12. . . . ,a.,,) and B = ((11.62, . . . ,lim) denote the two types of objects. Let f ((1,, bj) : .A x B ——> R denote the relation function between an object of type A and an object of type B. Let y denote the vector of size mn whose element gm) corresponds to the relation f (a,, bj). Note that we use a double index (i, j) to refer an element in vector y. This convention will be used throughout the entire paper. For the convenience of discussion, we assume that the first N) elements in vector y, denoted by y). are the labeled relations, and the remaining Nu : mn — N, elements in y, denoted by yu, are the unlabeled relations that need to be predicted. Finally, let S" = [Sgt-hm and SB 2 lSEJlmxm denote the matrices of similarities among the objects of type A and among the objects of type B, respectively. In order to incorporate the similarity information 8" and S3 into the propagation 23 scheme, we modify the energy function in Equation (2.4) into the following expression: "J Em 1‘: Z 81'ka J?! (1101))“ y(k.l))2 (3-1) ik: 171 1 In the above, we introduce the weight Sf US? ,the product of the similarity measurements for the two types of objects, to weigh the difference between the two relations ym) and gym). Hence, to minimize the energy function in Equation (3.1), two relations will be enforced to have similar values when they share similar input objects in both type A and type B. We then simplify the expression for energy Em by using the matrix notation, i .e., E... = yTLA‘By (3.2) where LAB : DA.B_S.A.B St” = SA®SB where operator ® stands for the direct product of matrices. DAB in the above expression is defined as a diagonal matrix whose diagonal element is defined as D6503) Z Zisij'iw) k=ll=l [)fo (3.3) Finally, similar to the solution in Equation (2.5). the optimal solution for yu that rrrinimizes 24 the energy function in Equation (3.2) is yu :[th/rlflfl1 Stills yl (34) As the further improvement, we introduce the normalized similarity matrices, and the normalized graph Laplacian: 54 = (DAV/isA (04)”2 (3.5) SB ___ (DB)-1/QSB(DB)-1l2 (36) 334,8 2 (DA.B)—i/2 SAJB (DA.B)—l/2 (3.7) pm 2 (DAB)‘1/2LAB(DAB)—1/2 = 1 - 5'” (3.8) Notice that, according to the above definitions, we have 5‘43 2 SA (8) 53 (3.9) This is because A 8 5'4 .5 _ Siik) o1) \/ (24k) (2:) HOW) yopbétbfof Sf SB 2 ' J 55151.8, 1 \/D,.BDA \/D;4DB Replacing the graph Laplacian LAB and similarity matrix 8’“; in Equation (3.4) with the 25 normalized ones, we have ya computed as: y. = (I — Eff)“ S‘fif‘y. (3.10) It is straightforward to extend the above formulism to the case of multiple types of objects. Let T = (01192., . . .O’) denote the t different types of objects. Let f : (’21 x (92 . . . x 0" —-> R denote the relation function among t different types of objects. Let {8k :; [8k .,-, 1].”, xnk, k: = l, 2. . . . ,t} denote the similarity matrices for the t types of objects. We then have the energy function for the t types of objects written as: t Em Z yT <® Sh) y kzl where each element in the vector y, i.e., ywr 02 or ), corresponds to the relation '1’ ‘2""' ‘t f(o,-1l,o'fi_,. . . . ,of, ). A solution similar to the one in Equation (3.4) will minimize the above energy function. 3.2 Applications for Classification Problem The relation propagation approach can be used for classification problem. This section will discuss its application on multi-label learning task. We will describe the multi-label learning model based on relation propagation and then present the empirical studies with different datasets. 3.2.1 Multi-Label Learning Model It is straightforward to apply the framework of relation propagation to multi-label learning. Let’s denote the collection of documents by D 2 (d1, d2, . . . ,d,, ), and the set of categories by C = (c1, c2, . . . ,cm). Then, the relation f (d,, cj) is a binary function that outputs 1 when document (I,- belong to the j-th category, and 0 otherwise. Let the normalized similarity 26 matrices of categories and documents denoted by SC and SD, and the normalized graph Laplacian denoted by L”. Assume that the first it, documents of D are the labeled examples, and the rest 71,, = n — n.) documents are the unlabeled examples. We then decompose the document similarity matrix SD as follows: —’D _‘D gp _ SH blur _D 7D 811.] Sum. 6‘ ,9 where the subindexes “l” and u refer to the labeled documents and the unlabeled docu- ments, respectively. Similarly, we can decompose the normalized matrix SD'C and L1" as follows: SIDI SC t1D ® SC Lu SD‘C : 5113 ® SC 2 ~ _ _ _ 31131 ® SC Sign ® SC —D —D L1,! Ll,u L“ = I—S'D’C = £33, £17 1 _ sagsc _sggsc 43.63136 I — 323.6936 191 D.C Using the above expressions for Sup and L” , the solution in Equation (3.10) for multi- label learning can be expanded as: _ _ —l _ _ y. = (I — 5,37,, ® SC) (5,, ® SC) y, (3.11) Directly computing the solution in the above expression could be computationally ex- pensive. This is because the size of the matrix I — S5,, ® SC is 0(mn x nm), roughly the square of the product between the number of documents and the number of categories. 27 This matrix will be huge even when the number of documents and the number of cate- gories are modest, thus making it unfeasible to store this matrix, not to mention computing the inverse of this matrix. For instance, in our experiment with multi-label learning, the number of documents is 1000 and the number of categories is 100. This results in matrix I —— S3,, ® SC of size 100, 000 x 100, 000, which is too large to be manipulated by most desktop computers. In the following, we present an efficient algorithm for applying relation propagation to multi-label learning. The key idea of the efficient algorithm is to approximate the inverse of matrix I — S3,, ® SC by its eigenvectors. To this end, we first present an important theorem regarding the eigenvectors of the direct product of matrices, which will serve as the basis for our efficient algorithm. Theorem 1 Let va and vb are the eigenvectors of matrix A and B, and AG and Ab are the corresponding eigenvalues. Then, va ® vb is an eigenvector of matrix A ® B, and MM, is the corresponding eigenvalue. Proof To prove the above theorem, we need to show the following equation holds (A (8) E) (v, (8) v.) = A,.\,(v, (8) vb) (3.12) Notice that the left side of the equation in the above can be simplified as (A®B)(va®vb) = (.4v,)®(Bv,) = A.A.(v.®vb) In the above, we use the following property of the matrix direct product: (U (3 V) (0(2) D) = (UC) @(UD) (3.13) Thus, the left hand side of Equation (3.12) is equal to .its right hand side, and the theorem 28 is proved. Using Theorem 1, we can approximate the expression in (3.11) using the eigen— fillu vectors of matrices SD and SC. Let the top eigenvalues and eigenvectors of S3,, denoted by {(AP, VP). 1' = 1.2 ..... 1(1)}, and the top eigenvalues and eigenvectors of SC denoted by {(AE, vf), k = 1 2,... ,Ix’c} Then, the eigenvectors and eigenvalues of 1— S”, u ® SC can be expressed as follows: Anni) = l—AFAC, m1.) = v?®vg,1:1,...,1r,,,1~=1,__,,1,vc Therefore. the inverse of I— S E ,, (8) SC is calculated as: —S.?,’1>—‘.®S( I‘D KC T D c D : :Zf—_)D)61(vi®vk)(vi ®v§) i=1 k= 1 Then, the product(I 5,, ,, D® S)C S, , D® SC) )is simplified as [_St? @113“) (‘§IZ)1.Z®SC) K KC —§D:§j,_,.,,( v. (2) :3) (vr®vt)T(s~a®sc) K7) Kcl _ ZZ_ _AADAc(V v, D®V® ([VP]T§51)® [vi]T i=1k=l In the last of the derivation, we again use the property of direct product in Equation (3.13), and the property of eigenvectors, namely SCvagvf. Using the above expression for (I — S5,, ® SC)“ (S u, ® SC), the predict class labels yu in (3.11) rs then rewritten as: [(1) KC ”:2 2__1— 1171;: (V3) 8W) (MIT 5'53) ® [vi-1T a (3.14) 1': 1k: 1 Comparing to the original expression for yu in Equation (3.11), the above expression is 29 computationally efficient because it no longer needs to compute the inverse of matrix I -— S32" ® SC explicitly. Figure 3.2 summarizes the procedure for efficiently predicting the class labels for unlabeled data using the framework of relation propagation. A careful examination of the algorithm in Figure 3.2 indicates that our algorithm does not need to compute any matrix of size ()(nm x nm). In fact, during the iterations, only vectors of size 0(mn) are computed, and no any intermediate matrix is calculated. Hence, this algorithm is efficient not only in computation but also in memory storage. Input: 0 The labels of the training documents y;; 0 SD an S“: similarity matrices for documents and categories; a K13 and Kc: the number of required eigenvectors for the similarity matrices of documents and categories. Output: The class labels for the unlabeled documents ya 1. Computing the eigen space 0 Compute normalized similarity matrices S D and SC as in Equation (3.5) and (3.6) 0 Compute the top K1) eigenvalues and eigenvectors of SD, i.e., (Alp, v?),i = 1, 2, . . . , KD 0 Compute the top Ix'c eigenvalues and eigenvectors of SC, i.e., (Agni/3,16 = 1, 2, . . . , KC 2. Initialization: ya 2 0mm 3. Fori=1,2,...,Kp,andk =1,2,...,KC Compute a 2 [S5,] T VD . 1 0 Compute b = 3 ® VS] T y, 0 Compute c = v? ® vi) b Update ya = yu + AEC/ ( 1 - WAS) Figure 3.2: An Efficient Algorithm for Multi-label Learning using the Relation Propagation 3.2.2 Case Study: Image Categorization The goal of the experiments is to evaluate the effectiveness of the proposed framework of relation propagation for multi-label learning. In particular, we will address the following research questions in this empirical study: 30 0 Will the relation propagation framework be eflective for multi-label learning? In particular, we will examine whether the incorporation of class correlation and the unlabeled documents will improve the classification performance. To this end, we compare the proposed algorithm to two graph-based approaches, the label propaga- ' tion approach based on the harmonic function [57] and the spectral graph transducer (SGT) [24]. o How sensitive is the proposed algorithm for multi-label learning to the setup of pa- rameters? As indicated in Figure 3.2, the input to the proposed algorithm includes the two similarity matrices, SD and SC, and the number of required eigenvectors, KD and Kc. In this experiment, we will vary these input variables to examine their impact on the classification performance of the proposed algorithm. Dataset We evaluate the proposed algorithm for multi-label learning on the Eurovision ST. Andrews photographic collection (ESTA)' , which is provided by Text and Content-Based Cross Lan- guage Image Retrieval (ImageCLEFY. This collection contatins 28133 images that belong to 999 pre-defined categories. Each image is also described a short textual document. The averaged length is about 50 words per document. In our experiment, we randomly selected 100 categories and textual description of 1000 images as our testbed. For the se- lected testbed, the number of documents for each category varies from 2 to around 600, and the average number of documents per category is around 72. Meanwhile, the number of categories per document varies from 6 to 11, and the average number of categories per document is around 7. lhttp://ir.shef.ac.uk/imageclef/2004/stand.html 2http://ir.shef.ac.uk/imageclef/ 31 Evaluation Methodology Similar to the previous studies on multi-label text categorization, we evaluate the proposed algorithm using both the F1 measure across documents and the F1 measure across cate- gories. The first metric, referred as “micro-average”, is calculated by first computing the precision and the recall of class labels for each document, and then taking the average of the precision and the recall over all the documents in the test set, and finally computing Fl measure based on the average precision and average recall. The second metric, referred as “macro-average”, is calculated by first computing the average precision and recall for documents within each category, and then taking the average of the precision and the re- call over all the categories, and finally computing the F1 measure based on the average precision and average recall. Since the micro-average is usually dominated by the popular categories while the macro-average is usually decided by the rare categories, by using both metrics, we will be able to obtain the comprehensive view of the classification performance for the proposed algorithm. Furthermore, similar to the label ranking algorithms for multi-label learning, the pro- posed algorithm only estimates the confidence scores of class labels for the unlabeled docu- ments. An additional procedure is needed to decide the set of class labels for each unlabeled document. To obtain a more comprehensive view of the classification performance for the proposed algorithm, for each document, we first rank its class labels in the descending or- der of the estimated confidence scores. We then compute the F1 measure based on average precision and recall at first 20 ranks. This is similar to the analysis based on Receiver Operating Characteristic (ROC) Curves [35]. Finally, since the focus of our studies is on semi-supervised learning, 2%, 5% and 10% of documents are randomly chosen as training set and the rest of the documents are used for testing. Each experiment is conducted for 10 times, and the average precision and recall across 10 trials are used to compute the final Fl measure. 32 Parameter Setup and Baseline Models As indicated in Figure 3.2. there are four inputs to the proposed algorithm for multi-label learning: SD, SC, K17, and ICC. To obtain the document similarity matrix SD, the SMART information retrieval system [39] is first used for preprocessing the documents to generate the weighted term vectors. The pre-processing includes removing stopwords, stemming, and the Okapi term weighting [37]. Then, a RBF kernel function f (t,,t,~) = exp(—7||t,: — t_,- ”3 /(I) is used to calculate the document similarity based on their weighted term vectors t,- and t j, where (I is average distance between any two documents, and 7 is the scaling factor. In the experiment, we will examine impact of parameter 7 on the classification performance of the proposed algorithm. The category similarity matrix SC is obtained by first representing each category as a binary vector in the space of documents, and using the cosine similarity between the vectors of categories as the similarity measurement of categories. Since the number of categories is only 100, we set KC 2 100 for all experiments. We compare our approach for multi-label learning to two state-of-art classification algorithms: the label propagation (LP) approach that is based on the harmonic function in [57], and the spectral graph transducer (SGT) [24]. Both these two approaches are semi- supervised learning algorithms. In particular, they explore the distribution of unlabeled documents for predicting their class labels. SGT is a semi-supervised version of ratio-cut algorithm originally proposed for unsu- pervised learning [23]. The objective function incorporates a quadratic penalty on labeled data in addition to minimize the graph cut as minf = fTLf + C(f — r)TC(f — r) 33 where the vector f is a label probability vector, and the vector r is defined as f+ 11:, is a positive example 7‘1 — f2 11:,- is a negative example 0 1:,- is unlabeled The matrix C : (liag(cl, (:2, - - - ,c,,) is a diagnal cost matrix allowing for different misclas- sification cost for each data example. The trade-off between graph cut value and training error penalty is balanced through the constant 0. Previous studies have shown that these two approaches are able to outperform a number of comparative semi-supervised learning algorithm, such as Transductive Support Vector Machine (TSVM) [24]. However, one drawback with these two approaches is that none of them is able to explore the correlation among different classes. By comparing to them, we are able to examine if the proposed algorithm is able to improve the classification perfor- mance by incorporating the class correlation information. The implementation of Spectral Graph Transducer used in our study is download from (http://svmlight.joachims.org/). For easy reference, we refer to the proposed algorithm for multi—label learning as “RP”. Finally, all the three baseline models use the same document similarities as the proposed algorithm. Experiment (I): Comparison to Baseline Models Figure 3.3 shows the classification results when 2%, 5% and 10% of the documents are used for training. The left panel in each figure shows the F1 measure of micro-average, and the right panel shows the F 1 measure of macro-average. In this experiment, we set the scaling factor 7 to be 1, K1) 2 20 and KC 2 100. First, we observe that as the size of training set increases, both the micro-average and macro-average Fl measures are improved for all three algorithms. This is expected be- cause more training data will provide more information and thus improve the performance. Second, we observe that the spectral graph transducer is able to outperform the label prop- 34 Figure 3.3: Classification Fl Results comparison using Relation Propagation Micro—average F1 measure Macro—average F1 measure 0.10 7/ L1" . 0-151 A’ 'f ""f :fi\ f /’// 5 \‘\ ‘ /’/ f] \\‘ A \ 0.08 / 5\" ~ / 9 I, 2 0-10 r" + a3» [ l: 1% / . *WW— I (B ‘ «I ‘ “’é) 8 15* . s \ " 519011, EOOS‘ meww E f fEr'aCJR'T at“ x F , 1:795“ ‘ ‘ E 0.05L x" 55‘ u. ”.14 l 1 ,6} ' 0.041 a” — RP /, a 1 g? + LP 21 A ’ +1- SGT ‘9 J 002% ,m__ ,1_fig___ 0.00;; ._—_,, 0 5 10 15 20 0 5 10 15 20 Rank Rank (3) Percentage of Training Data = 2% Micro-average F1 measure Macro-average F1 measure 0.08 "*7 «14:74 i 0.20} f 2]] // .fis ] /’/ 5‘ 5 5 '\"\r».,\]\ 1 ‘ ] / 'I/ 5 \ ‘ a 0.06 ’ / (D . 5 0.14» m 1‘ 8 ,. .»..‘r+23‘:)‘-’9""':3"C"Jt’ E ‘ ,,..:--"-"'"H‘ 5 ‘ LL 0 081 x“ -1 “firm, i 1"“ ,IK’L‘ '5 1 . in? 1 0.02; ’ 5 10 15 20 Rank (b) Percentage of Training Data = "/c Micro-average F1 measure Macro-average F1 measure 0.10 0.20 (,r'fi'r “T ,,\]\ J /~]I/,~V/+ / \ 1 ,I' \~ 4i 0.08 / 20'16 ‘ 2 / 5 * 1' 1 3o 06 (D 1 . z ._1. 3;“) l0 ' 8 0.12 I . Iiyéieyéfifiwfl’ E 8 / E I r J ’ 1 E0 04 '/ If. 0-03 1 pixiirere *hw 444+ E ‘ / _,. aut‘rewes 1 2" ween . r 0041 ){H — RP 0.021; , ‘ ‘ ., ,2 LP I V a 5+ SGT v - 1 0.00 0.00‘ ‘ 0 5 1O 15 20 O 5 10 15 20 Rank Rank (c) Percentage of Training Data = 10% 35 agation algorithm noticeably when more than 2% of documents are used for training. This is consistent with the results of the previous study in [24]. Third. compared to label propa- gation and spectral graph transducer, we observe that the relation propagation algorithm is able to achieve significantly better classification accuracy for all the cases in terms of both micro-Fl and macro-F1 measurement. This result indicates that exploring the relationship among different types of objects does help improve the classification performance. Experiment (11): Sensitivity Analysis Micro-average F1 measure Macro-average F1 measure 0.22 T f‘ V" T I 0.12 I Y 1 d.3£f%::x 63‘s,. ‘ — '. ].. “QR K Rae} ' f . 15$in 3% ‘ L 1‘ / // 5» ——\ xx“. 0.08 f} 9 // / \ ‘ \ V 9 3 /’ a 8 ’ 00.06i E / E r: , / r: 0.10» ,» / —+— y: 0.5 0-04' 003 If ‘6‘ 1:1 ' T“ 7: 2 0.02» 0.06» — 7‘4 1, _ _ Y: 6 . . 1 0.00 1 4 . 0'040 5 10 15 20 0 5 10 15 20 Rank Rank Figure 3.4: Classification Results of the Relation Propagation using Different 7's. Kp = 20 In this experiment, we examine how the scaling parameter '7 and the number of eigen- vectors K D will affect the performance of the proposed relation propagation algorithm. The number of eigenvectors for categories Kc is set to be 100. In the first experiment, we fix K1; = 20 and vary ’7 from 0.5 to 1, 2, 4 to 6. Figure 3.4 shows the micro-averaged F1 and the macro-averaged F1 of the proposed algorithm for different 7. We observe that the classification performance of the proposed algorithm is not impacted too much when 7 is set to be 0.5, 1, and 2. A significant degradation 36 Micro-average F1 measure Macr o-aver age F1 measure 0-22 - . - 1 . 0.14 .— , ‘ 012 i Luz/",1 (”fit 0‘20 5 waiétrr 0.1 o » ,ap,;',1—s.<;,@€}]@ff3“5‘% 2’ m 2 ' ’ j i 0'18 3 0.03 ~ , ‘° 8 "5’ 5 E016 _ Ko=5 L,_0.oe~ + K = 20 ,8 = 30 0.04 - , ":7?‘ D — K0 = 70 0.02 ~11, 0.12 * 1 i 0.00 . r . 0 5 10 15 20 0 5 10 15 20 Rank Rank Figure 3.5: Classification Results of the Relation Propagation using Different Kps. 7 = 1 is observed when ’1' is increased to 4 and 6. This is because, when the scaling 7 is too large, the similarity between any two documents based on the RBF function will almost be zero, thus making it impossible to explore the unlabeled data. This is consistent with the observation in [57], i.e., only when it is possible to explore the correlation among unlabeled data only when the scaling parameter 7 is small. In the second experiment, we fix 7 = 1 and vary the number of eigenvectors K1) from 5 to 20, 30, 50, and 70. Figure 3.5 shows the results for different K1). We observe that the classification performance was very stable for K1; 3 20 for both the micro-average F1 measure and macro-average F1 measure. However, the performance is significantly degraded when K13 = 5 for micro-average F 1 measure. Overall, the macro-average F1 measure is not affected significantly by the number of eigenvectors. Generally speaking, when K1; 2 20, the algorithm has the best performance. 37 3.3 Applications for Ranking Problem The relation propagation model can be easily applied to ranking problems. Since the model generates the confidence scores for the objects in question, it is straightforward to rank the objects according to their confidence scores. We will use the collaborative filtering application as an example and describe how we can apply the relation propagation approach to the ranking problem. Empirical studies will be discussed after the model description. 3.3.1 Ranking Learning Model for Collaborative Filtering The goal of collaborative filtering is to predict the preferences or interests of a particular user based on the available votes of training users, which sometimes is called community data. Many methods have been proposed for collaborative filtering in the past years. Most methods assume that users with similar interests tend to rate items similarly. Thus, the similarity between users in their rating patterns becomes the key to most of the collab- orative filtering methods. The user similarity can be used either for identifying a subset of training users that share similar interests with a test user, or for clustering a large number of users into a small number of user classes. One of the key challenges to collaborative filtering is the sparse data problem, namely each user only provides a limited number of ratings. This is often the case in the real world applications given few users are willing to spend time on rating a large number of items. The sparse data problem often results in the unreliable estimation of the user similarity, which can significantly degrade the performance of collaborative filtering. This is because two users with similar interests may not share any commonly rated items, and as a result, have a zero user similarity. Many studies have been devoted to the sparse data problem in collaborative filtering [49, 6]. One strategy [19, 34] is to cluster training users into a small number of classes. Instead of identifying the individual training users that are similar to a test user, we identify the class of training users that fits in best with the test user. Another 38 strategy [57] is to fill out the ratings of the unrated items for the training users. This can be done either by propagating the rating of one item by one training user to the rating of another item by the same user if these two items are similar, or by propagating the rating of one item by one training user to the rating of the same item by another user if these two users share similar ratings for a number of items. Our Relation Propagation model can be applied to address the sparse data problem in collaborative filtering. The key observation behind this work is that most of the previous work on collaborative filtering addresses the sparse data problem by exploring the user sim- ilarity and/or the item similarity separately. As a result, it only allows the rating information of the same item to be propagated among different users, or the rating information of the same user to be propagated among different items. This limitation will prevent the system from answering the question such as, if user y] rate item 2:1 as 5, what is the expected rating of item 562 by user yg provided that user 3), and 112 are similar in their preference of items, and in the meantime item :r] and 1132 are similar. The relation propagation approach for collaborative filtering can address this problem by allowing the rating information to be propagated among different users and different items simultaneously. To apply the framework, we first construct a weighted graph as follows: each node rap-J) = (11,-.01) in the graph represents the rating of item 03- by user 11,-; any two nodes 1:“, j) and “(1:1) in the graph are connected by an edge whose weight reflects the correlation between the two corresponding rating relationships: a large weight indicates that user 11.,- will give a similar rating for item 0, as the rating by user 11,, for item 0,, and vice versa. The rest of this section is arranged as follows: We first present how the label propaga- tion can be applied on collaborative filtering; we then introduce the relation propagation model for collaborative filtering, and the related algorithm. To better motivate the proposed framework for collaborative filtering, we will start with the description of the label propagation method for collaborative filtering. We will then describe the relation propagation framework, which can be viewed as a generalization of 39 the label propagation method in [57]. Before getting to the details, we will first introduce the notations that will be used throughout this paper. Let U = (111, 112. . . . ,u.,,) denote all users, where n. is the number of users including both the training users and the test users. Let the similarities of users denoted by the matrix S“ = [Sig-[mm where each element SU- represents the similarity between two users in their rating patterns. Let the collection of items denoted by O = (01, 02, . . . ,0,,,) where m is the number of items. Let the similarity of items denoted by the matrix S0 = [Sfflmxm where element SE]. represents the similarity between two items in their descriptive features. In the case of movie recommendation, each item corresponds to a different movie, and is described by a set of movie categories. Let the ratings of items 0 by users Ll denoted by the matrix R = [R,-,j],,xm. Each element Rm- 6 {0,1,2,...,-r} represents the rating of the j-th item by the i-th user. It takes value between 1 and r when the j-th item is rated by the i-th user, and 0 when it is not. We also write matrix R as R = (r1, r2, . . . ,rm) and R = (El, E2, . . . ,f'n)T, where each vector r,- represents the ratings of the i—th item by all the users, and f,- represents the ratings of all the items by the i-th user. Given the above information, the goal of collaborative filtering is to predict the rating of a given item by the [681 user. In order to apply the label propagation method to collaborative filtering, we will erase the difference between the training users and the test users. This is because predicting the rating of items by the test users is equivalent to predicting the rating of the unrated items by the training users. So, the question that label propagation method will address is to predict the ratings of a given item 0;, for all the users. For the sake of simplicity, let’s assume that only the first 71;, users among Ll rate the item 01,, and our goal is to estimate the ratings of item 01,. for the remaining 71,, = n — 71; users. For the convenience of the presentation, we refer to the first at users as the “labeled” users and the remaining 11,, users as the “unlabeled” users. Following the idea of the label propagation [57], we search for 40 *e-"vv ‘ 1 .“OJr; .‘g’ic‘...r,‘. the rating vector r], that minimizes the following energy function: 2: S,“ R”, —J-1.)2= rZL"rk (3.15) i j: 1 where matrix L“ = [L;f,]nx,, is the graph Laplacian for the similarity matrix S“. It is defined as L“ = D” — S" where D“ = diag(D‘,‘, Dg, . . . ,D“) )and D“: Z,_1S rewrite the vector rk as rk— — (rk, r},) where vector r, refers to the ratings that are already provided by the labeled users, and r}, refers to the ratings to be predicted for the unlabeled users. Similarly, we rewrite L“ and S“ as ’U Mu SU- S’u u 1.1 1, , S“ ___] 1.1 1 (3.16) LU. Lu 11 83’.” 11 l 11, 11 11.! L” = where the sub-indices l and 11 stand for the training users and the test users. Using the above notations, we have the optimal solution to Equation (2.4) written as: rZ— — [Lg—1“] S",rk. (3.17) The key advantage of using the label propagation method for collaborative filtering lies in its capability of utilizing the similarity information of all the training users, including both the rated users and the unrated users, to predict the rating for the test user. In this aspect, the label propagation method for collaboration information is similar to the clustering ap- proach and the collaborative filtering methods based on the eigenvector analysis [40] and the matrix factorization [44, 36]. The problem with the label propagation approach for collaborative filtering is that it is unable to exploit the matrix S 0, the similarity information of items. To address this prob- lem, we propose the relation propagation framework that effectively exploits the similarity of users and the similarity of items, simultaneously. To this end, we modify the energy 41 function in Equation (2.4) into as follows: "I —: Z S,“ 5, 12,, — 133,,)2 (3.18) i.j= lk.=l1 In the above energy function, we weight the squared difference (RM. — RN)? by the prod- uct between the user similarity S,“ and the item similarity S3 1. In other words, the rating RM, and R1,, should be close if user 11,- share similar interest as user 11.,- (i.e., Sf“, is large), and item 0,, is similar to item 0,- in their descriptive features. It is important to note that the energy function in Equation (3.18) returns back to the energy function in Equation (2.4) when the item similarity matrix 5" becomes an identity matrix, i.e., S3,, = 609,1). This is because E" = Z 2 5,135 1,}; — Rt!)2 i,j= lkl=1 = Z Z 5,“, (RM, — Rj,k)2 k=l i.j=1 m. _ I _ E :13, k2] To minimize the energy function in Equation (3.18), we first introduce a matrix S""’ = [ (11k),(j.1)]n111>II Using the property of direct product. i.e., (A $9 B)(C 12 D) 2 (AC ® 80), and the relation Lijj: = —5§f “I850 and 5",} 0 =55 u ® 5", we have the following simpli- fication: (vagf (L122) (v:®v:)= (1 — 1:130:11, mo. 1) (v:®v:)T(S:::n=f)3?:(1v1511.)®[v:fr. Using the above simplification, the objective function Z in the above can be expanded as follows: Ku KO 1AM 0 “—5 G? 1.,j( ”A K. ‘2‘ i=1 (3.27) o 11 T ‘11 o T + ai-jAj ([Vi] U11) ® [Vj] I'l 1:1 1:1 Then the optimal solution for (1:. 1- that maximizes the above objective function is a. - = —L (MT 3“,) (8) MT r, (3.28) 9] 1 _ AitAf; 3 11- J Note that if we substitute (1,}- to the equation (3.26), we get the same solution as before. 46 3.3.2 Case Study 1: Movie Recommendations The goal of our experiment is to evaluate the effectiveness of the Relation Propagation model given the sparse data. We compare the relation propagation model with five base- lines: Pearson Correlation Coefficient algorithm (PCC), Personality Diagnosis model (PD), Aspect model (Aspect), Flexible Mixture model (FMM) and Label Propagation model (LP) as described in the previous section. Dataset We use the dataset of movie ratings, named MovieLen(http://www.grouplens.org/), as our test bed. MovieLen data set includes 943 users and 1682 movies. Each movie is rated between 1 and 5. with 5 as the best rating. The average number of ratings provided by each user is around 374. To demonstrate the problem of sparse data, we selected 200 training users and for each test user, only 5,10 or 20 movies are randomly selected as the rated items. Evaluation Metrics Since one of the goals of collaborative filtering is to rank the unseen movies for a test user, we will apply each of the five algorithms in comparison with order the unseen movies based on their predicted preference, and evaluate these algorithms based on their ranking lists. The evaluation metrics used in our experiments is the average precision [16] (AP), which is a commonly used metric in the information retrieval. It measures the agreement between the top K movies that are ranked by the automatic algorithms, and the top K movies that are ranked by the test user. To this end, we will first rank movies in a descending order of their true ratings by a test user, and then compute the precision PK of an algorithm for the test user as follows: 1 K k P - = — ——+— 3.29 K K I; ran/Ir(tk) ( ) 47 where t), is the k-th movie ranked by the true ratings of the test user. Since the users’ ratings only provide a partial order of movies, we break the tie by choosing a random ordering among the movies with the same rating. Finally, the average precision is computed by taking the average of the precision at K over all test users. Evidently, the higher the average precision, the better the performance is. Notice that we did not employ the Mean Average Error (MAE) for evaluation as a number of collaborative filtering papers did. This is because our algorithm is only able to generate the ranking list of items for individual test users. It is unable to predict the rating categories for items, and as a result, we can not measure the MAE metric. Finally, each experiment is conducted for 10 times and the results across 10 trials are used as the final evaluation metric. Parameter Setup Our algorithm needs the inputs 5“, 5", K ,, and K0. The user similarity matrix S" is com- puted using the RBF kernel function f (n, rj) = exp(— Hr,- — 73-“ /(Z)‘2 based on the user rating vectors 7‘,- and 7'], where J is the average distance between any two rating vectors and A is the scaling factor. The movie similarity matrix 5° is obtained by representing each movie as a binary vector in the space of movie categories and using the cosine similarity be- tween the movie vectors as the similarity measurement. Due to the space limitation, we are not going to discuss the impact on the proposed algorithm from different values of the pa- rameters. To speed up our computation, we set the number of eigenvectors K u = K o = 20 for all experiments according to the empirical experience. We measure the average preci- sion over the top 20 ranks for all algorithms. Experiment Results Table 3.1 shows the average precision for the top 20 ranked movies given 5, 10 and 20 observed ratings provided by each test user. First, we can see that the average precision 48 for all approaches increases as the number of observed movie increases. This is expected since more training data should lead to better performance. Second, the label propagation approach improves the ranking accuracy when compared to the Pearson correlation coef- ficient approach. However, the label propagation approach did not perform as well as the Aspect Model and the Flexible Mixture Model. We believe this is related to the capabil— ity of the algorithms in exploring the correlation among users and the correlation among movies. Both the Aspect Model and the Flexible Mixture Model explore the movie corre— lation and user correlation by data clustering. In contrast, the label propagation approach only explores the correlation among movies. Third, the relation propagation (RP) algorithm outperforms the other approaches considerably. Most noticeably, the ranking precision of the relation propagation algorithm is 56.7% when the number of rated movies for the test user is 5. This number is better than the ranking precision of the other comparative al- gorithms when 20 movies are rated by test users. We attribute the success of the relation propagation algorithm to the fact that the proposed algorithm allows relevance information to be propagated among users and movies simultaneously. This is particularly critical to alleviate the problem of sparse data. Although both the Flexible Mixture Model and the Aspect model are able to alleviate the sparse data problem by user clustering, they are lim- ited in that all the users in the same class have the same rating for the same movies. The proposed algorithm relieve this limitation by the propagation scheme. We also verified the average precision across 10 fold is statistically significant at the level of p < 0.05 based on the student’s t test. Furthermore, we vary the number of top ranked movies that are rated by the users. Figure 3.6 shows the average precision for different number of top ranked movies for each algorithm given 5, 10 and 20 observed movies for each test user. Again, we observe that for all cases, the proposed relation propagation algorithm is able to outperform the other algorithm considerably. This result again shows the power of the proposed algorithm. 49 Alg. Given 5 Given 10 Given 20 RP 0.567 i(0.003) 0.569:t(0.007) 0.57lzt (0.020) LP 0.434:l:(0.004) 0.446i(0.003) 0.462j:(0.004) F MM 0.489j:(0.003) 0.497:l:(0.001) 0.5 1 7310.004) PD 0.4701(0001) 0.4851(0001) 0.528:l:(0.001) PCC 0.321 j:(0.002) 0.358:l:(0.001) 0.433:l:(0.004) ASPECT 0.449i(0.002) 0.465:l:(0.007) 0.515i(0.001) Table 3.1: Average Precision for the top 20 ranked movies for all five algorithms. Variance of the average precision is included inside the parentheses (all values need to be multiplied with 104). 3.3.3 Case Study 11: Book Crossing Book-crossing dataset was collectedc by Cai-Nicolas Ziegler in a 4-week crawl from the Book-crossing community. It contains 278858 users providing 1149780 ratings about 271379 books. Ratings are either explicit, expressed by the scale 1 - 10 (higher values show the higher appreciation), or implicit, expressed by zero. _ We sort the books by the number of ratings given to them and select the top 1000 books with the most ratings. Based on these books, we remove users with less than 30 ratings. The resulting dataset has 160 users with 7612 ratings on 1000 books. The average number of ratings for each user is around 47 and the average number of ratings for each book is around 8. In each experiment, 20% of the users are randomly selected as training users and the rest are for testing. For each test user, 5, 10 and 20 books ratings are randomly selected as observed ratings. Each experiment was conducted 10 times and the average across 10 fold was taken as the final value. Average precision is also used to evaluate the performance. Experiment Setup Similar to the movie recommendation case, we compare our method to other four algo- rithms: Pearson Correlation Coefficient (PCC), Aspect model (Aspect), Label Propagation (LP) and Personality Diagnosis model (PD). To be consistent with the evaluation in movie recommendation case, we also use Average Precision described in Equation (3.29) as the 50 Given 5 Items 0 '0) I OI 09.0.0.0 Average Precision o '_.. m w .1:- 1 Top K Rank Given 10 Items ’7' ‘Iiii T—WV " T‘ ’ ' .’ 7" Average Precision Top K Rank Given 20 Items _O .L ‘f .0 N Average Precision O 0 Top K Rank Figure 3.6: Average Precision for different number of top ranked items by the test user given 5,10 or 20 movies. evaluation metrics. We first compute the average precision on top 10 ranked books for each user and then take the average across all users as the final results. The relation propagation model requires the input, the user similarity matrix S“, the book similarity matrix S 0, and the number of eigenvectors K, and K o. S“ is computed by taking the dot product between two user rating vectors as 5;} = r, - rj. Similarly, S" is computed by first presenting each book as a vector of ratings from all users and then taking 51 Alg. Given 5 Given 10 Given 20 RP 0.526j:(0.003) 0.569i(0.115) 0.718i(0.051) LP 0.354i(0.125) 0.426:t(0.106) 0.625i(0.334) PD 10.516:t(0.007) 0.556i(0.023) 0.705i(0.001) PCC 0.403:t(0.182) 0.448j:(0.138) 0.623i(0.099) ASPECT 0.480i(0.220) 0.524i(0.201) 0.686i(0.139) Table 3.2: Average Precision for the Top 10 Ranked Books for all Five Algorithms. Note: Variance of the average precision is included inside the parentheses (all values need to be multiplied with 10“). the dot product between two book rating vectors F,- and '53- as 5,0]- : F, f}. According to the empirical experience, we set Ku = K 0 = 20. Experiment Results and Analysis Table 3.2 shows the average precision on the top 10 ranked books for all the algorithms. First, as the number of observed books increases, the average precision increases for all five algorithms. This is expected since more observed ratings provide more information for predicting the test data. Second, LP algorithm performs the worst among all algorithms. This shows that propagating the ratings by using only user similarity may not be enough for accurate prediction. Simple PCC method works better than LP algorithm general although worse than other approaches. Third, PD model and Aspect model work better than LP model and PCC method, but worse than the RP model. Especially the performance of the PD model is very close to the RP model. Grouping users by the similarity of their personality seems a effective strategy to improve the performance. Finally, the RP model performs the best among all the approaches. We believe it’s the contribution of exploring the correlation among users and books. All results are statistically significant at 95% level. Figure 3.7 shows the average precision at different number of top ranked books given 5, 10 and 20 observed ratings. The RP model performs best at each different rank and the result is consistent with the above analysis. 52 Given 5 books g 0.6; I I I 1 1 I 1 22 0.5k Q .5, :4“ E, Wu..-”— :9 E 0 4 . _V.,"_l;<;.::7;'§r~:::; if;::§:;:”:i§:;”+fi,fl- fflrgr D. ' i'_,,+,._.g( ,1. r ’r "“ ,-,4(-} ’:W,: h + M”. 13—“ g, 0.3 ; 7 _ rwlaaw: __ _, _ m g, - «Ff g z a); 0.2 ,-- wt"? .- ~- ~ ‘1 0.1 " ’1 ' 1 1 I 1 2 3 4 6 8 9 10 Rank Given 10 books .6 0.6 I 1 r r 1 I fl .2 , L,. E 0.5 {jazz ruse: fiZfi/q £2 55’ n. \ _. 3» ~ $71589 / — a: T‘iil) «fig ”:8: ”"8“"7fi—A—fi‘ér’” "‘:B:’f*i3 flee PCC 8’ 0-4* ’_ ,Za/W‘t‘ + PD § /.»-4+ *---—*+/ —1— ASPECT < 0.3 ‘ I I i ’ 1 1 j 1 2 3 5 6 8 9 10 Rank Given 20 books _5 0.75 , , (D 1:. "" f ‘ " :5 ’ *4» , ‘\" "R *M}’*— Gr~ —~ A‘“ If, g0 70 f//” ‘ j» w_:1,fi:::~..—:.::::‘ 11-193; *3: W “@065; _ . _ fl _ - IE.) 1.1.— ;Vb ;W‘t‘~‘*‘V-—erv'”.tj‘T-R‘__Ejl~:7\\n~3‘;h’: EL .: . 781;:“f3 E 0.60 A - A I . . . 1 2 3 5 7 8 9 10 Rank Figure 3.7: Average Precision for Different Number of Top Ranked Books by the Test User Given 5,10 or 20 Books. 53 3.4 Drawbacks of Label Propagation Approaches Label propagation approach has attracted a lot of attention from researchers in the machine learning area with its promising performance in many applications including information retrieval, classification and ranking problems. In the previous sections, motivated by the label propagation approach, we presented the relation propagation model and their appli— cations. Empirical studies showed that it is very helpful in a lot of problems to propagate the class (or the rating) information on the weighted graph which is appropriately con- structed. However, there are still some general issues with the traditional label propagation approaches. The label propagation approach essentially addresses a ranking problem. In order to apply the label propagation approach, we construct the graph based on the given data in- cluding training and testing examples. Then the labeling information on the training data is propagated through the graph such that the entire system reaches a global equilibrium. Each node in the graph will get a corresponding numeric confidence score computed through the propagation. Finally we compute a ranking list based on the confidence score assigned to each node and make the prediction accordingly. Although it sounds a practical idea, the problem comes from propagating ordinal values (or class labels) as numerical values. In many cases, the labeling information for data examples does not carry the numerical meaning. The label values are usually discrete and finite. For instance, in a classification problem, each class label is represented by a number that should carry only ordinal mean- ing. Consider the case that there are three classes c1 = 1, (:2 = 2 and 03 = 3. We can not say that, the class cl should be counted less than C2 and 03 because numerically cl is smaller than cl and C2. Furthermore there is not order among cl, 02 and c3 and changing the label value for each class will not change the nature of the problem. The class numbers are only the names for the classes. But during the process of propagation, we actually propagate the labeling information that is supposed to be ordinal values as numerical value and take the final numerical results as the base of assigning ordinal labels. This is also the case in a lot 54 of ranking problem. For example, in collaborative filtering task, each rating is often repre- sented by a number. Different from class labels, rating information does exhibit an order among different categories. If we have three categories ba.d(1), good(2) and emcellent(3). we can tell the rating 1 < 2 < 3. However, propagating the rating information as numeri- cal values is not an appropriate practice since those rating numbers only carry the ordinal meaning. Replacing the rating number with any other numerical value without breaking the ordering will not change the nature of the problem, but it may significantly affect the final results. In a lot of ranking problems, it’s also very common that the exact rating information is not available. Instead, only the ranking information is given for the observed data. For instance, a web search engine presents a ranking list to the user and the user may click the preferred web pages. This clicked-through history for the user generates the partial preference judgements over the clicked pages and the unclicked pages. This kind of ranked data is a challenge for traditional classification and regression approaches. Without the exact ratings, propagating the rating information is not practical at all. In a word, label propagation has shown itself a promising approach to many problems while the issues we discussed above may limit its efficiency in a lot of scenarios. In the next chapter, we will discuss a solution which addresses the challenge of the ranked data for label propagation. 55 Chapter 4 Propagation for Ranked Data Analysis of ranked data is often involved in practical applications including information retrieval, collaborative filtering, etc. For example, the ranking presented by a search engine to the user is a ranking list of all web pages based on the relevance of each page to the user’s query; the links clicked by a user provide some kind of relevance judgements for web pages; the user’s past ratings on movies are also ranked data in terms of the user’s preferences. The interest in ranking problems is still the source of ongoing research in many fields such as mathematical economics, social science and computer science. Ranking problems differ from traditional classification and regression problems in that ranking problems explore the ordering of examples rather than the absolute numeric/ordinal values assigned to each example. There are two different views of ranking problems. In the first view, the training examples are presented with the ranking scores (e.g., relevance judgements or rank scales) and the goal is to learn a score function from the labeled data examples. One common approach for this kind of ranking problem is to learn a probabilis- tic classifier or regression model from the given ranking scores. However, This ranking problem may be hard for regression models since the given ranks are usually discrete and finite. It is also unsuitable for classification models since the ranking information does not exist in class labels. There are a number of works devoted to this area [43, 18, 42, 12]. 56 In [18], the authors proposed a large margin algorithm based on a mapping from objects to scalar utility values using a distribution independent risk formulation of rank learning. Kramer et a1. investigated the use of a regression tree learner by mapping the ordinal vari- ables into numerical values [38]. In [15], the authors converted a rank learning problem into nested binary classification problems that encode the ordering of the original ranks and then the results of standard binary classifiers can be organized for prediction. In [43], two large margin principle-based approaches were proposed for ranking learning in which the rankings are viewed as ordinal scales. The first appraoch uses the “fixed margin” policy in which the margin of the closest neighboring classes is being maximized and the second approach allows for multiple different margins where the sum of margins is maximized. [9] presented a probabilistic kernel approach to ranking learning based on Gaussian processes. In the second view, the training data is presented in the format of partial ordering in- formation (e.g., preference judgements) and the goal is to learn a ranking function from the training data examples which generates the ordering of the data. The partial ordering information can be provided in the form of the ranking list of training instances or the preferences over pairs on training set. Given the ranking of training instances, the rank- ing function can be learned by minimizing the difference between the given ranking and predicted ranking [22, 32]. In [32], an approach was introduced to take ranking rather than classification as fundamental. This approach uses a generalization of the Mallows model on permutations to combine multiple input rankings. In [22], the author presented a Support Vector Machine approach which automatically optimizes the retrieval quality of search engines using clickthrough data. The ordering information can also be broken into a set of pairwise preference judgements and used. in the form of pairwise preferences [25]. Given the pairwise preference judgements, we can construct the models which predict the preferences or the rankings [16, 11. 31]. These kinds of approaches appear more appro- priate in ranking problems since we focus on the relative ordering information rather than absolute values. An advantage of learning ordering directly is that preference judgements 57 can be much easier to obtain than the exact labels required for classification or numeric values for regression. This chapter focuses on the approaches utilizing the relative ordering information. Although there has been a significant amount of research done on ranking models in statistics and machine learning, there is little work on label propagation models for ranked data. However, it is very important and also a natural decision to explore the label propa- gation approaches for ranked data due to the existing challenges discussed in section 3.4. One of the challenges is related to the meaning carried by the class labels. As we described above, the label propagation approaches address all the problems by first propagating the class labels and then generating ranking lists based on the resulting propagation scores computed by the algorithm. This may not be an appropriate approach in certain cases since the numerical values are not able to preserve the essential meaning of the ordinal infor- mation among class labels. For example, consider the numerical class labels 1, 2, 3, - - o. In most cases, the numbers just provide the labeling for classes instead of the ordering in- formation. Class 1 is not greater or smaller than class 2. However, when the class labels are propagated, the ordering information will be introduced into the propagation process. Another challenge is the difficulty of obtaining the exact labels of the labeled data. Most previous studies require the absolute label values or numeric values to be given in order to learn the ranking function, which is often not practical. The absolute labeling information is usually hard to obtain or is very costly and time consuming. In contrast, the partial rela- tive ordering information is more easily available for training in some cases. For instance, in a movie rating dataset, the users provide the ratings which indicate the preferences of certain movies over other movies. Traditional label propagation approaches may not fit in these applications because they are required to propagate the absolute labels. In order to address these challenges, this chapter is devoted to label propagation on ranked data. The rest of the chapter will start with the formal definition of the ranking problem on relative ordering information, followed by the discussion of some recent research works 58 related to rank learning. Then it will propose a supervised framework of Rank Propagation in which the model is learned from pairwise preference judgements. 4.1 Ranking Problem Setting Let’s formally describe the ranking problem over the relative ordering information. Let X = {271,172. - - - an} be a set of labeled and unlabeled instances called the domain or instance space. Elements of X are called instances. Each instance .13,- 6 Rd (1' = 1, - - - ,n) is represented by a vector in d dimensional space. We define preference judgement function which is the input to the learning algorithm. This function encodes the known relative ordering information about a subset (training set) of the domain X. For instance, in the movie recommendation task, the feedback from a movie reviewer provides the known movie preferences by this reviewer. Formally, we define the preference judgement function which has the form (I) : X x X —+ R (:co, 11:1) represents the degree to which 3:1 should be correctly ranked above 170. The positive value means that 361 should be ranked higher above 1’0 and negative value means that :17] should be ranked below (1:0. We also define (.1:, .17) = 0 and (I) is anti-symmetric in the sense that @(fl’o,$1)= -—(;rl,:1:0) for all $0,117, E X. The learning algorithm will output a ranking of all instances represented in the form of a function H : X ——> R. We interpret H (3:0) > H (.111) as .110 is ranked above 1131 and vice versa. The goal of the learning algorithm is to produce a ranking function H which generates a good ranking based on the per f creme judgements encoded by (I). 4.2 Related Works There are a number of studies devoted to the analysis of ranked data. This section will introduce some literature in the learning approaches on ranked data in machine learning 59 area. 4.2.1 Statistical Measure of Rank Correlation Guttman ’5 Point Alienation, a statistical measure of rank correlation, has been used to con- struct the objective function that matches user’s preferences [2] in a combination expert model. Previous work [1] has demonstrated that this measure can be highly correlated with average precision, which is a more typical measure of performance in information retrieval. Thus optimization of this objective function is likely to lead to the optimized average pre- cision performance. The objective function can be Optimized by gradient descent even though there are singularities. Let Re.q(d) be a ranked retrieval function which generates a score indicating the rel- evance of document (1 to a query (1 (R must be differentiable in its parameters 6)). The objective function is: —1 Z.) .1 1129.10) — Raw) J R : —— 9 4.1 ( 9) <2 2Q 2....11119110 —Re..1 ‘ ’ where Q is the set of training queries and d and d.’ are documents retrieved. >q is defined as a preference relation over document pairs as d >q (1' <=—> the user prefers d to d’ The goal is to find parameters 9 so that R9,,(d) > R9,q(d’) whenever d >q d’. This method is applied to combining experts as the following linear model (three experts are illustrated): Regid) = @lEl((I1d) + 921320141) “I” @3E3IQId) The parameters 6,- serves as the scales on individual expert which is denoted as E,. The critical feature of such a model is that the score it generates is differentiable with respect to 60 the parameters (-). The gradient in general is: (1119) CW D organ; 01—) In the experiment on a commercial retrieval system called CMME, the combination model performs 47% better than any single expert. 4.2.2 Ranking SVM Support Vector Machine approach has been applied to ranking problem for automatically optimizing the retrieval quality of search engines using clickthrough data in [22]. This method is shown to be well founded in a risk minimization framework. Clickthrough data in search engine can be viewed as triplets (q, r, c) where q is the query, r is the ranking presented by the search engine and c is the set of links the user clicked on. Since the clicked-on link set 0 depends on the query q and the presented ranking r, partial relative relevance judgement would be a better interpretation for clickthrough data than absolution relevance. The problem is formalized in this approach as: Given a query q and a document set D = {111,- - . ,dm}, the retrieval system should return a rankings of documents in D according to their relevance with respect to q. Since retrieval systems usually can not achieve an optimal ranking, the retrieval function f are evaluated by how closely its ordering rm) approximates the optimum 7"“. Both 7““ and rm) are considered as weak ordering. Kendall’s r [26] is used as performance measure to evaluate the closeness between r” and r f(,,). It is defined as T(r I) P-Q 1 —-—2Q ‘0'. ‘b : — . P+Q ”1 2 where P is the number of concordant pairs and Q is the number of discordant pairs between . . . . . 711 rd and 7'1. m is the number of documents on a hnrte domain D and the sum of P and Q IS 2 61 for strict ordering. It can be proved that maximizing 7(1‘ 101): 7‘*) is connected to improved retrieval quality. Given an independently and identically distributed training sample S of size n containing queries q with their target rankings 7"“ ((11, TI), (612.. 7‘31 - - - . ((111173)- the learning £ will find a ranking function f from a family of ranking functions F that maximizes the empirical 7' defined as :3 = — , .. . 4.2 ,n 1:17-(Tf(Qi)7( ) This SVM ranking algorithm will find the function f out of a family of ranking func- tions F that maximizes (4.2) and generalizes well beyond the training data. Given the class of linear ranking functions (diedj) E ffitUI) i=> W quadj) where TB is a weight vector which needs to be learned and (q, d) is a mapping onto feature vector describing the match between query q and document d. The optimization problem is defined as follows OPTIMIZATION PROBLEM I. (RANKING SVM) minimize: _+ 1 we, 1) = 51:1 - 17.1 + 0:61.11 (4.3) subject to: lV(dz,dj) E 7“; I WCpUhdz) _>_ W¢ y“ and y‘ is an element from finite set 32 = {1, 2, - - - ,k} with > as the order relation. The ranking rule H : R" —> 31 which is a mapping from the instances to ranks. The family of ranking rules H employs a vector w E R" and a set of k thresholds b1 2 - -- 2 bk_1 _>_ (1,. z 00. The predicted rank is then defined to be the index of the first threshold b. for which w - :r < b... The goal of the algorithm is to learn the ranking rule minimizing the ranking-loss which is defined to be the number of thresholds between the true rank and the predicted rank. To learn the ranking rule online, the true rank 3) is expanded into k — 1 variables y1,- - - , 111.21 and y induces a vector as follows: .1...‘ ,_ ___ +13...,+1:_11...5_1 {yr 411.1} ( ) T where r is the maximum index for which y, = +1 and r = y —— 1. Thus, the predicted rank 65 is correct if y,.(w . x — b.) > 0 for all r, otherwise the rule is updated as: o replace 11,. with (1,. — 1),. for which y.,.(w - x — b.) S O. o replace w with w + (23,.)3 for which yr(w - x — br) _<_ 0. The paper proved that PRank maintains a consistent hypothesis in the sense that it pre- serves the correct order of the thresholds and given an input sequence (x1, 1J1), - - - , (xT, yT) , the rank loss 2; lg)‘ — y‘l is at most (I: — 1)(R2 +1)/’)«2 where R2 = max H x‘ H2 and A], : nrinr.t{(W* . x‘ — b;)yf.} (see details in [12]). 4.2.5 Conditional Model on Permutations CRanking [32], a new approach to ensemble learning, explores conditional models on per- mutations as a tool for solving ranking problems. This approach views each input classifier in terms of the ranked list of labels that it assigns to the input, builds probability distri- butions over rankings of the labels and then learn the models on permutation groups. The basic model in this approach is an extension of the Mallows model to the conditional setting and it has a natural Bayesian interpretation as a generative model because of the invariance properties of the sufficient statistics. Let X = y 1. . . . ,1)" be a set of items to be ranked, identified with the numbers 1, - - - ,n. A permutation 7r is a bijection from {1, - - - , n} to itself. Let 7r(i) denote the ranking given to the item 2' and 7r‘1(t') denote the item assigned to rank 2'. The collection of all permutations of n-items forms a non-abelian group under composition, called the symmetric group of order n, and denoted Sn. Given a set of training instances, a ranking of items needs to be assigned to each instance. The permutations given by the j—th input classifier for the i-th instance is denoted as of). The permutation 7r“) is used to denote the predicted ranking for the i-th instance. a is used to denote a sequence of permutations Uj. A generalization of the Mallows model was proposed for estimating a conditional dis— tribution. Let aj E 8,, be a permutation from the j-th input classifier and ()j E R for 66 j = 1, - - - , It. The distribution which is described as follows: 1 Z(9,0) 62:21 911101203) 11040.0) = defines a conditional model when there are multiple instances and each instance is asso- ciated with a possibly differently set of rankings of) . Z (6,0) is the normalizing con- stant defined as Z (6,0) 2 ed“) and 1,0 is the cumulant function defined as t’1(9,a) = log Ewes. exp(6d(7r, 0)). The model can be applied to, for example, the rankings of web pages generated by different search engines for a specified query. 0)- can be viewed as the weight for each ranker in the ensemble. In this scenario, only the 6,- are the free param- eters to be estimated. The likelihood function based on this model is convex in 6. Given a training set consisting of the pairs D = (71“), aw), the parameters can be estimated by maximum conditional likelihood or MAP. This model is different from other discriminative models in that it has a natural Bayesian interpretation. If we view 7r as a parameter, suppose Uj are indepently sampled from Mal- low models with common mode 71 and dispersion parameters 61-. Under a prior p(7r), the posterior is p(7r|6, a) or p(7r)exp (2: 01-1101, 0)) because of the invariance property of d(., .). Thus, under a uniform prior on 7r, the distri- bution is precisely the posterior of a generative model for the rankings (1. 4.2.6 Conditional Model on Ranking Poset Another conditional model on ranking poset is presented for ranking problem in [31]. Sim- ilar to CRank algorithm, this model is also an extension of the Mallows (11 model and it generalizes the classifier combination methods used by several ensemble leanring algo— rithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. 67 A poset (partially ordered set) is a pair (P, S), where P is a set and g is a binary relation that satisfies (1)2: S :r, (2)if :1: S y and y S :1: then a: = y, and (3)if :1: S y and y S 2 then :r _<_ z for all :r, y, z E P. Then ranking poset W,, is defined as the poset in which the elements are all possible cosets QM, where A is an ordered partition of n. and 7r 6 g. The right coset is defined as: Q,,_k7r = {anla E gn—k} and Qn_k = {7r 6 Q,,|tr(t') = 1,2' = l, - - - ,k} where 71' denotes a permutation of {1, . . - ,n} and 7r(t') denotes the rank given to item i. The right coset can be viewed as a partial ranking, where there is a ful ordering of the k top-ranked items. Kendall’s Tan T ( 7r. 0) is used as a distance measure d which is define in (4.2). Given input I: rankings 01, 02, - - - ,ai, contained in some subset a E Wn of the ranking poset and a probability mass function qo on Wn, an exponential model 719(7rla) is given as: k 119(7rla) = fig—)qofirkxp (32:; 61d(7r.oj)) where 6 E x E Rk, 7r 6 II 6 Wu and 113- E E3 E W,,. The term Z(6,a) is the normalizing constant I.- Z(6’, a) = an(7r)exp (Z 6 + jd(7r, (73)) 11'6“ 4.3 Rank Propagation for Multi-label Learning In this section, we propose the idea of rank propagation for multi-label learning problem. We view the problem of multi-label learning as multi-label ranking, i.e., instead of making a hard decision about which subset of class labels should be assigned to each instance. Our goal is to rank the class labels according to their relevance to the given instance. Unlike the typical label propagation schema that propagate the prediction scores, the key idea of rank propagation is to propagate the pairwise preference relationship of class labels between labeled examples and unlabeled examples. We will focus the discussion on rank 68 propagation for supervised learning. 4.3.1 Preliminaries Let X, = (x11,x§, - - - ,xf,) denote the set of labeled examples. Each example x,- is labeled by a vector y,- = (941.11.111.21 - - - ,y,~.(;) where C is the number of classes and y“, E {0, 1} indicates if example x, is assigned to the kth class (1 indicates the example belongs to the ktth class and 0 otherwise). We denote by Y = (y1,y2, - - - ,y,,) the class labels assigned to all the instances. Let X,, = (x1‘, x3, - - - ,xfil) denote the set of unlabeled examples. Our goal is to rank the class labels for each unlabeled example. 4.3.2 Preference Matrix We focus on a single unlabeled example x. We denote by k(x,x') the kernel similarity function. For the sake of simplicity, we assume that the kernel similarity function always outputs non-negative values, i.e., k(x,x') Z 0 for any x and x'. To conduct the rank propagation, we encode the class label information into a pairwise preference matrix. More specifically, given assigned class labels y,, the preference matrix M,- E RCXC is defined as follows: a 311.1 = 1 and yak = 1 b ill/1,1 =‘— 1 and y”; = 0 ler-r = (4.16) a. y” = 0 and y“, = 0 0 otherwise where a. and b are constants with a < b. We can normalize the preference matrix M,- into a transition matrix S,- by defining [S'lkr = [Ml-hut 2:21 [A'Iiik’t 69 (4.17) Evidently, the constant a is proportional to the transition probability between the classes that are either both selected or both unselected. Similarly, constant b is proportional to the transitional probability between the selected classes and the unselected classes. it is not difficult to see that the principle eigenvector of S, i.e., S v = v,is 1/ll3’ill‘2 311.1: :- 1 131'}; — 0 yth: = 0 Furthermore, we denote by S E REXC‘ the transitional matrix for the test example x. Our goal is to search for the transitional matrix S that are coherent with the transitional matrix in the neighborhood x. Principal Eigenvector of Transition Matrix As claimed in the above section, the prin- ciple eigenvector of the transitional matrix for each example is consistent with the class assignment of the example. Thus, by computing the principle eigenvector of the transi- tional matrix of a testing example, we can generate a ranking list of all classes labels for this example and make predictions accordingly. To further verify our assumption, we randomly generated four datasets. Each data sam- ple is assigned to one or multiple categories. We build the preference matrix for each data sample based on the category information as described above and then compute the prin- cipal eigenvector. To evaluate how much the ranking of categories based on the principal eigenvector is consistent with the real category information, we use Average Precision as the evaluation metric. Figure 4.2 shows the average precision for each document in all generated datasets. Clearly, for the majority of the data samples, the principal eigenvector of preference matrix is completely consistent with the real category information. We also verify this assumption with the existing datasets which will be used in our case study discussed in the following section. 70 Randomly Generated Dataset with 1000 Data Samples and 50 Categories Average Precision 1 ~ — — fi' (18* 1 0.6 r 0.4» J 0.2 - . _._ . . 44) o J . . . i 0 200 400 600 800 Documents (3) Randomly generated dataset 2: average number of categories per document is 5 with maximum 13 and minimum 0; average number of documents per category is 101 with maxi- mum 127 and minimum 79. Randomly Generated Dataset with 2000 Data Samples and 100 Categories Average Precision O .0 a: on d ,0 h r .0 M f 1000 r—r/ 00 500 1000 1500 2000 Documents (c) Randomly generated dataset 1: average number of categories per document is 20 with maximum 35 and minimum 9; average number of documents per category is 400 with maxi- mum 448 and minimum 361. Average Precision Randomly Generated Dataset with 1000 Data Samples and 100 Categories 1 . S -,—, 0.8 '6 a” 0 o 6 ~ 0) i < 0.4 0.2 [’1 O r x J x 0 200 400 600 800 Documents (b) Randomly generated dataset 3: average number of categories per document is 15 with maximum 26 and minimum 4; average number of documents per category is 149 with maxi- mum 181 and minimum 117. Randomly Generated Dataset with 2000 Data Samples and 200 Categories 1 0.84 1 0.6 '1 0.4 02 x .4 0; A r A 0 500 1000 1500 Documents ((1) Randomly generated dataset 3: average number of categories per document is 20 with maximum 35 and minimum 8; average number of documents per category is 199 with maxi- mum 239 and minimum 166. 1000 2000 Figure 4.1: Datasets. Analysis of Principal Eigenvectors of Preference Matrix for Generated 71 Average Precision for Each Document in St. Andrews Dataset 1' l .5 0.8— 4 .9 0 93 o. a, 0.6 ~ 0) m i; 1 < 0.4" 0.2 OWJ—J r r r r 0 200 400 600 800 1000 Documents (a) ST. Andrews dataset: average number of categories per document is 6 with maximum 11 and minimum 1; average number of documents per category is 56 with maximum 740 and minimum 10. Average Precision for Each Document in MovieLens Dataset j Y Y # 1 T 7 Y 7 0.8r “ 0.6~ * Average Precision 0.4r 1 02* ~ 00 100 260 300 460 500 600 700 860 900 Documents (b) Movielens dataset: average number of categories per doc- ument is 3 with maximum 6 and minimum 2; average num- ber of documents per category is 108 with maximum 349 and minimum 1. Figure 4.2: Analysis of Principal Eigenvectors of Preference Matrix for Study Datasets. 72 4.3.3 Framework Description Using the transitional matrix, we view the multi-label learning problem as a multi-label ranking problem. This goal can be encoded into the following optimization problem 1115111 2 [(S, 5,)111, 1 (4.18) s.t. 51,120,Vk,l=1,2,-~,C C 2 5,.) = 1 k2] where w,- = k(x, x.) l(S, S') is the loss function that measures the difference between two matrix S and S '. Different types of loss functions can be used in this optimization problem. We consider two choices of loss functions as follows: [1 loss function It’s defined using the trace function as MS, 5') = tr((S— S')T(S — S')). This loss function can be further rewritten as C 11(S,S') = 2(5):“ — 31,)? (4.19) k,l=1 Obviously, the above equation essentially measures the square of the difference between two matrices across all the elements. The solution of S is straightforward when using the first loss function. In particular, we will solve a series of C optimization problem with each optimization problem written as follows: (1 J. C min 2: Z: ”(Hz-(5&1 — [Si]k.i)2 (4.20) Lg . .'.'.IS‘ .' Ll k.C (:1 i=1 8.1. 511,120,l=1,2,“-,C C 2 3111:1- (=1 73 This is a quadratic programming problem. To see more clearly about the meaning of the result. we redefine the problem as follows: Tl Insirr Z wills - 51H: (4.21) i=1 s.t. s : (),eTs = 1. WhCI'C $1, = (51411, 519, ' ' 35151;) and S? 2 ([Siij? [51]}:‘2, ' ° ' , [Si]k_c,'). The dual form Of the above optimization problem can be further written as . ”7 + Aelli 1'/\—-T/\ ~———.—— 4.22 111111 Sk( e + 1) 2 2:;1 lt’i ( ) s.t. ”y t 0, /\ 2 0. where s... is defined as A 27.11 ”(pl-S}? S), = ———"‘n ’ (4.23) 21:1“? The solution s;, is calculated as a )1 ’ T e (4.24) . . . . + /\e As we can see, the two constrarnts s t 0 and eTs = 1 result 1n the additional term 7 ,, 21:1 “’1' . 12 loss function It is defined as “1211121:1 zT (S —S' )T (S — S, )2. This loss function essentially requires the two transitional matrix S and S’ to be similar along any direction 2. The interesting point is that the first loss function [1 can also be written in the similar form to [2 as MS, 5’) = eT(S —- S')T(S — S')e. In other words, the first loss function only require the two matrices to be matched along the direction e. Thus, 12(S, S') is a more general form than 1,,(S, SI). 74 22 loss function leads to the following optimization problem 1118111 max zT ( ; try-(S — S))T(S — S,))z (4.25) lzl‘él s.t. SM 20, 11.7,! = 1,2,---,C STe = e. Since max zT ( 2:21 111,-(S— S,)T (S—S,)) z is the principle eigenvalue of matrix 211:1 (S— IZISl S,-)T (S — S,)z, we can rewrite the above problem into the following format: min t (4.26) ,A' S't- Zita/ii S tIC i=1 Ai (S — SOT Z 0 S — S,- I Sk.120krl:132amec S Te 2 e. where I C and I are both identity matrix. As we mentioned above, ll loss function can be viewed as a special case of [2 loss function and it can be verified through the above optimization problem. To better under- 2 I stand this, consider the special case where A,- = diag(a.}, a1, - - ,az-C). Furthermore, we approximate the constraint : 0 (4.27) by its necessary conditions, i.e., k a]. k 2 i ll 2 HSk—S' 2 ‘l 75 As the result of the above simplification, we rewrite the original problem as 11111] t (4.28) LS1; n s. t. :11?in g f, k: 1,2,-~,C =1 (1:12 “SA—Sf”; k: 1,"',C. i: 1:"‘271' i Sk_>:02 k21~21m~C SZe=1,k=1,2,---,C. To minimize t, it is sufficient to minimize 221:1 uriaf‘. It is not too hard to see that this is equivalent to the optimization problem using [1 loss function. 76 4.3.4 Case Study: Multi-label Categorization We evaluate the proposed ranking propagation scheme on multiple data sets and report the empirical results in the following study. We will answer the following questions: 0 Does Rank Propagation really help the multi-label classification task? As we dis- cussed before, the traditional methods for multi-label classification have problems caused by utilizing class labels as numerical values. Rank Propagation utilizes the preferences among all categories and can avoid this problem. 0 Does 12 loss function work better than ll loss function? 12 loss function considers the projection of the difference between the preference matrix of the test example in question and the preference matrix of all training examples along all directions. Theoretically, the optimization problem based 12 loss function should result in better performance and the empirical studies proved this. 0 How sensitive is Rank Propagation to the parameters? There are two parameters involved in Rank Propagation: A and B. We give a detailed description of the influ- ence of these parameters on the performance. Datasets We evaluate our proposed approach with three datasets: MovieLens dataset, St. Andrews dataset and Yeast dataset. In three datasets, each document is assigned to one or more categories. We describe the datasets as follows: 0 MovieLens C18 dataset: MovieLens dataset is a movie rating dataset. The original dataset provides the ratings of 943 users on 1682 movies from 19 categories. Each rating is an integer ranging from 1 to 5 with 1 being the lowest rating and 5 is the highest rating. Zero rating means that the rating information is not available. We selected 849 movies and 18 categories as our testbed. The average number of cat- 77 egories per movie is around 3 and the average number of movies per categories is around 1 14 in the resulting dataset. St. Andrews C 10 dataset: This is an image categorization dataset which contains 999 predefined categories and 28133 images along with their captions (textual descrip- tion). The 28133 captions consist of 44085 terms and 1348474 word occurrences. The maximum caption length is 316 words, but on average 48 words in length. We selected 10 categories and 1000 images with their captions as our text categorization testbed (we view each caption as a document). The number of categories per docu- ment in the resulting dataset varies from 2 to 4 and the average number is around 2. The number of documents for each category varies from 108 to around 320, and the average number of documents per category is around 209. Yeast C14 dataset: The Yeast dataset is formed by micro-array expression data and phylogenetic profiles with 2417 genes. East gene is represented by 103 attributes and is associated with a set of functional classes whose maximum size can be potentially more than 190. The dataset in our experiment contains 14 classes. The number of classes per gene is ranging from 1 to 11 with the average number being around 5. The average number of genes per class is ranging from 34 to 1816 with the average number being around 732. Baselines The proposed Rank Propagation method is a supervised-learning algorithm which predicts the labels by propagating the preferences between labeled data and unlabeled data. We refer to the Rank Propagation approaches using two loss functions as RP L1 and RP L2 respec- tively. We compare the Rank Propagation approaches with two other supervised-leaming algorithms: K Nearest Neighbor method (KNN) and Support Vector machine (SVM) as our baselines. 78 KNN is similar to the Rank Propagation method except that KN N propagates the true labels of the training examples while Rank Propagation propagates the preferences among categories which are obtained from the given labels of the training examples. In order to generate a ranking list of all categories for each example using KN N, we conduct KNN for each category and generate a ranking list of all categories based on the resulting scores. Traditional SVM is often used for binary classification tasks. Many algorithms derived from SVM have been proposed for multi-label classification tasks. Among them, SVM binary appears to perform the best according to the empirical results in [33]. The basic idea of SVM binary can be summarized as two steps. First, we conduct binary classification for each category and generate a score for each testing example on each category. For a testing example, the score corresponding to one category can be viewed as a confidence score indicating how likely the testing example should be assigned to this category. Second, we generate a ranking list of all categories according to the confidence scores for each testing example. The higher rank a category has, the more likely the testing example should be assigned to this category. Various metrics can be applied to the ranking lists in order to evaluate the performance. Evaluation Metrics In the framework we describe in section 4.3.3, we generate a ranking list of all categories for each document. Similarly, we generate a ranking list of all categories for each testing example in two baseline algorithms. Thus, it’s more appropriate to evaluate the perfor- mance of three algorithms using rank-related measurements. In our empirical study, we use category-wise Precision/Recall-breakeven point (PRBEP), Micro-average F1 score and Macro-average F1 score at different ranks, and Area under ROC curve as our evaluation metrics. For Micro-average F1 score, we first compute the precision and recall at each rank of category for each document, then compute the average precision and recall at each 79 rank across all documents and finally compute F I score based on the precision and recall. Macro-average F1 score is computed similarly, but the precision and recall are computed for each category. The Precision/Recall-breakeven point is a commonly used measure for evaluating text classifiers. It is based on the two statistics recall and precision which are both widely used in information retrieval. Between high recall and high precision exists a trade-off. The Precision/Recall-breakeven point is defined as that value for which precision and re- call are equal. We compute the average Precision/Recall-breakeven point value across all. categories as follows: first, for each category, we compute the precision/recall when as- signing the top 71 examples to this category and n. is the number of examples actually in this category; second, we take the average precision/recall across all categories as the final results. ROC (Receiver Operating Characteristic) curve was first used to define detection cut-off points for radar equipment with different operators. It is also a commonly used measure in information retrieval. In ROC curve plot, sensitivity (true positive rate) is plotted against 1- specificity (false positive rate). To get a comprehensive idea of the ranking list, we compute the average Area under ROC curve. First, we compute the true positive rate and false positive rate at each ranks for each document and plot the ROC curve for each example. Then we compute the area under ROC curve for each example and take the average as the final results. By using the above three metrics, we can get a comprehensive idea about the ranking lists of categories generated in three algorithms. We conduct each experiment 10 times and take the average results over 10 trials as the final results. Parameter Setup There are two parameters involved in the Ranking Propagation method: A and B (see de- tails in section 4.3.3). Since only the ratio of A and B matters, we fix A = 1 and test 80 different B values. We set B = 10 which gives the best performance for all three experi- ments. We’ll include the discussion on the influence of B values in the following sections. We also need to compute the similarity matrix among data examples in the Ranking Prop- agation method. Each entry in the similarity matrix is computed using cosine similarity measure based on the given representation of the data points. We download SVM-light ([20]) as the implementation of SVM and use the default setting of the software. For KNN, K (the number of the nearest neighbors) was set to be empirical value generating the best results. Results and Analysis I: Comparison among All Methods We evaluate the effectiveness of the proposed Ranking Propagation method in the first experiment. Table 4.1 shows the Precision/Recall-breakeven point for the three datasets described in the previous sections. Table 4.2, 4.3 and 4.4 presented the Area under ROC Curve for three datasets with different number of training examples. Also, Figure 4.5, 4.4 and 4.5 showed the Micro-average and Macro-average F1 scores of three datasets. We have the following observations. First, for all the algorithms, as the number of training examples increases, the perfor- mance is also getting better in terms of all three metrics. Moreover, the curve of F1 scores usually starts at a low value, then keeps ascending till it reaches a peak and finally ends at a fixed point lower than the peak value. This is consistent with the intuition since it is hard to make all correct predictions at the first rank. But as the rank increases, it is more possible to catch the positive examples for each category. The ascending trend is continued until at a certain point there are not many positive examples left for each category at lower ranks. Then the curve starts descending until it reaches the end. There many be some fluctuation in a curve because some positive examples are not ranked higher than the negative ones. Second, we can see that in most cases, both RP L1 and RP L2 outperform two baselines, namely, SVM binary and KNN. This indicates that propagating the preferences instead of 81 true labels does help the classification in some applications. RP L2 method shows signifi— cant improvement in terms of all the metrics. This is not too hard to understand since RP L2 considers a more generalized form of differences and constructs a better optimization problem. Third, it is interesting that KNN actually outperforms SVM binary method in some cases despite that KN N is a simple algorithm. For instance, KN N achieves the better Micro- average and Macro-average Fl scores for yeast datssets in Figure 4.5 than SVM binary. This may be related to the nature of the datasets including the distribution of the positive examples of each class, the numerical presentation of the data examples, etc. Table 4.1: PRBEP for Different Datasets with Different Number of Training Examples. Number of Training examples Classifier 20 40 80 100 RP Ll 0.135 :1: (0.008) 0.138 i (0.006) 0.150 :t (0.006) 0.162 i (0.010) RP L2 0.136 :t (0.006) 0.144 :1: (0.005) 0.168 :t (0.003) 0.184 :t (0.002) SVM 0.131 :1: (0.007) 0.136 :1: (0.006) 0.148 :1: (0.002) 0.152 :l: (0.003) KNN 0.132 :1: (0.006) 0.133 :I: (0.006) 0.133 :1: (0.007) 0.135 i (0.007) (a) MovieLens Dataset Number of Training examples Classifier 100 200 300 400 RP L1 0.211 i (0.007) 0.213 :t (0.006) 0.213 i (0.003) 0.214 :1: (0.006) RP L2 0.217 :1: (0.011) 0.224 :1: (0.005) 0.237 :1: (0.004) 0.243 :1: (0.009) SVM 0.206 i (0.011) 0.207 1 (0.009) 0.209 i (0.009) 0.211 :1: (0.008) KNN 0.208 i (0.004) 0.212 i (0.005) 0.2112 i (0.006) 0.212 :1: (0.007) (b) St. Andrews C- 10 Dataset. Number of Training examples Classifier 300 500 1000 1500 RP L1 0.303 :1: (0.004) 0.308 i (0.003) 0.309 :1: (0.003) 0.315 :t (0.005) RP L2 0.316 :1: (0.004) 0.336 :1: (0.009) 0.352 i (0.001) 0.390 i (0.009) SVM 0.30] :1: (0.006) 0.302 :1: (0.005) 0.304 i (0.002) 0.306 :t (0.003) KNN 0.30] :l: (0.006) 0.301 :t (0.005) 0.302 :t (0.009) 0.303 i (0.008) (c) Yeast C-l4 Dataset 82 Figure 4.3: F1 Scores of Three Algorithms for MovLens C18 Dataset. Micro-average F1 Scores 0.40, v v ~ . e 0.24 \ 381, (L \i ‘V— 54* Y 1:24;? few/“f r9: g 0.35 r \' 0,9 LWX/ 0.20 "lax/x” - / XX ”0.30 w Q! i X o @016» , // , I6 1.. 1 , .1 30.25 g 0* X ‘— “— /)( LL LL . x f 0.20 "I." x. KNN ( K = 10 ) , {‘5' + SVM binary L 7 Rank—Prop (L1) 015* x A Rank—Prop (L2) 0.10 ‘ ‘ 2 4 6 81012141618 Macro-average F1 Scores 6 81012141618 0 Rank of Categories Rank of Categories (3) The Number of Training data = 20 Micro-average F1 Scores Macro-average F1 Scores 0.45, - . . 0.33, . . ‘ ,L KNN (K =10) + SVM binary — RP L1 4;} RP L2 0.25 J i @120 l 08> \ ’\\.‘(\ .\ L ‘- *‘iiifig 1.1.0.15 0.10 040 A ‘ 0.05 J . 024681012141618 024681012141618 Rank of Categories Rank of Categories (b) The Number of Training data = 40 83 Figure 4.4: F 1 Scores of Three Algorithms for MovLens C18 Dataset (cont’d). Micro-average F1 Scores Macro—average F1 Scores 0.30, 0.50I....ee 0.25 > 330.20» 0 c8 ‘— , 7f LL0-15 ,'.’.'_ fly —-><— KNN(K= 10) 010* [ft/ix ”‘1”— SVM binary 4 ' H.’ — Rank-Prop (L1) .137" 9 Rank-Prop (L2) 1 i; ’52" 0.155“ - 5 . . . . . 0.05! . . 024681012141618 0 5 10 15 Rank of Categories Rank of Categories (c) The Number of Training data = 80 Micro-average F1 Scores Macro-average F1 Scores 0501 ’ ' * . . r . . 0.30 . a 4 r a T r | ‘705; @t'%%\»\ I of ' =>« / do 045'» 7 (‘3- 1 ' 3 ~- 88p. 5 W“ 0.25- ,f V 0.40:» f [g )5 i x Q ! \‘x K» ' i / #4. \u; .U _ f’ ,' $0.35r’k , KQ‘Q..- a3020» ’ /’”‘- ‘51:“ ‘ é *‘—~(§$’\ g ' I /’/:7‘r‘"+f”#r 030 ‘ . "‘13.. , 7*?" (D r . x 2x X» ¥:‘+\f’>f} (D 1 / f ‘— ‘ \XCK‘i-g\“*/\ ‘— / LL 1 X \X\ ‘ 'F‘Qh r, LL 0 15 ~_' 1 025 Jr XX X‘~>¢.,;.-%2esn m L/y/ o 2 0_ 5x —><— KNN (K = 10) . l 0 101 1%,. —+— SVM binary . 0 15 x' _ ' j fig —- Rank-Prop (L1) ' 1 5 4g] «9 Rank-Prop (L2) 2 0.10i 1 1 L 1 r r 1 r 0.05' X 1 r r r 1 r 1 r 1 o 2 4 6 8 1o 12 14 16 18 o 2 4 6 8 1o 12 14 16 18 Rank of Categories Rank of Categories ((1) The Number of Training data = 100 84 F1 Scores Figure 4.4: F 1 Scores of Three Algorithms for St. Andrews C10 Dataset. Micro-average F1 Scores Macro-average F1 Scores 0.40 0.35 . .0 (.0 0 F1 Scores 0 i0 01 r‘. /O\3“9\ 1 V A / \./ .0 N o \\ 0.15 \TIJ ‘ 0.10 L 34567 8910 Rank of Categories (a) The Number of Training data = 100 0.40 5 Q /<)\\\ \ If”... ,,._ _~‘ 0‘35 L ALL/£21 ELK/:57“ ¢:A€:i1g ,. rd—E’EX“ - fl 5 ‘ " {9,1445 0.30 I," ,/,//x,/’ ' [11" [Air /‘ 4.5 K/ X 0.25..» ,gr/ ix if 0-20 [,5 —>e— KNN ( K: 10) 2 —+— SVM binary —— Rank-Prop(L1) 0‘5)“ 9 Rank—Prop (L2) ‘ 0.10 - 5 . . . g - 5 1 2 3 4 5 6 7 8 9 10 Rank of Categories Micro-average F1 Scores 0.40 . . . '94, 5: ' \‘(Dr’x JK“‘-f'_j_ ,-.--‘~L_ 0.35» 130/ " /“ “L 27345» I; x” , #4175,er ,. ,.,/+””*:/<' “‘e/ 3 0.30 .5” ,/ /)< 5 ,1] l//// // 6’) 0.25;; tr f 1— \‘/ f/I // u. / // ~/ 0.20 35/ ,4.» —><—KNN(K=10) ' -+— SVM binary Rank-Prop (L1) 4% Rank-Prop (L2) 2 3 4 5 6 7 8 9 10 Rank of Categories Macro-average F1 Scores 0.40 0.35» 0.30 . 0.25 > F1 Scores 0.20 ' \ J . , .‘ 1‘ I 015 ‘ ’7‘! - ,3” , 0.10 23456 78910 Rank of Categories (b) The Number of Training data = 200 85 Figure 4.5: F 1 Scores of Three Algorithms for St. Andrews C10 Dataset (cont’d). Macro-average F1 Scores Micro-average F1 Scores 0.45 . 5 a 5 . . 5 5 0.40 a 5 . /’&., 0 4° 1 1;; “ \c 0.35 .>/’ 0 35‘ W ””2“ ”‘ / ,_,/‘ I , ‘: L51”: / 1’ ,, :74jfi—T—1'Ififé'1wé 0 . 30 //?( g 1 ”;;’+ 'I” 8 I, 50.30 ,4 2‘ - 75 ($3 // ,/ c§125 fr 1— 5:11.25, ,3, // u. ,5?" ,g + KNN (K =10) . 0.20 0-20 ,9" /" —+- SVM binary *- J» ,5" —— Ran k-Prop (L1) ,5?" 0.15 L' *9” Rank-Prop (L2) , 0.15.,9/5',’ 0 10T 1 1 1 1 1 1 a 1 0‘10? 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Rank of Categories Rank of Categories (c) The Number of Training data = 300 Micro-average F1 Scores Macro-average F1 Scores 0.55. . . 5 a . 5 . 5 0,50 . . . . . . I n I 1 L 059 r x. 0 45 ,,5~>~~s\ ," \ ‘v;\ ' \Télx 0.45» ;‘ < “5 ‘3. (A; K: j ‘3‘ ‘0 \E.‘;\ 0.40 ~ ,5 "‘“ Y, * Bk- a) .t -’ ., , \ (D I ,,,,,,,, LP -_j“§ - , _ h0.3 ,,.-I , "“ $42M. ‘ . . f‘} 8 5:5 IQ:I+/+ K- 8‘ :1 /,§/f’§ 4 ”0.30 , ’ /.X“ J a? E / ,x ,2; , 0.25» ,9' ~ 2‘2” 9" g +— KNN ( K =10) i 0-20 2;; ,5 —+— SVM binary Jr," —— Rank-Prop (L1) 0.15,! «5} Rank-Prop (L2) ‘ 010 1 1 1 1 4 1 1 1 1 1% 1 1 1 1 1 2 3 4 5 6 7 8 9 1o 3 4 5 6 7 8 9 10 Rank of Categories Rank of Categories ((1) The Number of Training data = 400 86 Micro-average F1 Scores 0.65 0.55 5 @0455 [0.35- 0.25 ' ‘r ~x— KNN(K= 10) A, —+- SVM binary ‘ ,",1} Rank-Prop (L1 ) j ~9- Rank-Prop (L2) 0.15 2 4 6 8 10 12 14 Rank of Categories Figure 4.5: Fl Scores of Three Algorithms for Yeast C14 Dataset. Macro—average F1 Scores 0.45 0.35 _ F1 Scores 0 N U“ 0.15- ‘ 936%wa 315.55,, , / 0.05 2 4 6 8 1O 12 14 Rank of Categories (e) The Number of Training data = 300 Micro-average F1 Scores 0.65 r 0.60} 0.55 , 9 01 o _o .9. 01 0.40 1 F1 Scores 0.35 r 0.30» 0.25 * + + KNN ( K =10) 91'} -+— SVM binary 1 J —— Rank-Prop (1.1) 1‘11" +5.» Rank-Prop (L2) L 0.20 2 4 6 8 10 12 14 Rank of Categories Macro-average F1 Scores 0.50 0.45 > 0.40 ~ 0.35 1 .0 w 0 F1 Scores .0 N 9‘ 0.20 L 0.15- 0.10» 0.05l 2 4 6 8 10 12 14 Rank of Categories (1) The Number of Training data = 500 F1 Scores F1 scores Figure 4.5: F 1 Scores of Three Algorithms for Yeast C14 Dataset (cont’d). Micro-average F1 scores Macro-average F1 Scores 0.65 f 5 5 a 5 0.50 5 5 i 19.}-T§\\W A A 0.60: A," \;\:\ 1 0.45 - gerawjafiig 1 . \J \,\_\ 2'25", . '_ ."1 0.55» 55 H 5 0.40 1 “-- :' //" $7#—:::\;¥ ,- (3" ,1- 0.501 5,1 ,1?” X32? 0.351 / / rf1";(1 ' Q " m V 0 45 5 7" +’ ‘55 $0.30- I 515 0 40 L ,1. x 5— 025’ 5+ LL 0.351 55 + KNN(K= 10) - 0.20- —+— SVM binary 0.305 —— Rank-Prop(L1) 5 0.151 1 j ‘75 A Rank-Prop (L2) 0.25» I 0.10L 5 0.201 1 1 1 1 1 1 0,05 5 1 1 1 . . 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Rank of Categories Rank of Categories (g) The Number of Training data = IOOO Micro—average F1 Scores Macro-average F1 Scores 0.65 . . 5 . . . 0.50 5 5 0.60 0.45 get} {ya - 040 '27 W 0.551 _ 1 ,_ 1 5 ,- 1 . 0.501 5 0.351 )3 5 . , 5 ,1 ./ 0451 .5; 80.301 1,: {2: (D ’ of 0.401 .025 - ,1 ,5, .. .5 LL ,5/7’ 0.35» 5,155+— KNN(K= 10) 5 0.20, 15,15,511, 5"." —-+— SVM binary ', x 0.301 555,1} — Rank—Prop (L1) 1; 0.15} (3,5,: ,1" l 7,5 ~—} Rank-Prop 5L2) ‘ 1 [1975 0.25- 5 5 0.105 i i l 1 0.20 1 1 1 1 + 1 0.05 1 5 0 2 4 6 8 10 12 14 0 5 10 15 Rank of Categories Rank of Categories (h) The Number of Training data = 1500 88 Table 4.2: Area under ROC Curve for MovieLens Dataset. Number of Training examples Classifier 20 40 80 100 RP Ll 0.640 i (0.012) 0.663 d: (0.009) 0.690 i (0.025) 0.728 :1: (0.003) RP L2 0.714 3: (0.025) 0.742 d: (0.002) 0.779 :1: (0.011) 0.790 i (0.008) SVM 0.600 :1: (0.008) 0.628 i (0.031) 0.631 :1: (0.008) 0.682 :1: (0.007) KNN 0.612 d: (0.029) 0.625 :1: (0.062) 0.636 i (0.040) 0.641 d: (0.030) (a) Micro-average Number of Training examples Classifier 20 40 80 100 RP L1 0.59] :1: (0.014) 0.613 i (0.004) 0.643 i (0.009) 0.655 i (0.007) RP L2 0.608 :t (0.022) 0.628 :1: (0.024) 0.656 :1: (0.019) 0.665 d: (0.012) SVM 0.575 d: (0.015) 0.599 :l: (0.005) 0.611 :1: (0.015) 0.623 :1: (0.007) KNN 0.569 i (0.005) 0.579 :l: (0.003) 0.597 i (0.007) 0.608 d: (0.004) (b) Macro-average Table 4.3: Area under ROC Curve for St. Andrews Dataset. Number of Training examples Classifier 100 200 300 400 RP Ll 0.64] :1: (0.015) 0.644 :1: (0.011) 0.645 :1: (0.010) 0.651 i (0.011) RP L2 0.650 :1: (0.009) 0.654 :1: (0.010) 0.672 :1: (0.008) 0.743 :1: (0.007) SVM 0.634 :1: (0.021) 0.636 :1: (0.007) 0.638 :1: (0.008) 0.649 i (0.007) KNN 0.512 i (0.011) 0.516 :1: (0.009) 0.519 :1: (0.009) 0.524 :t (0.009) (a) Micro~average Number of Training examples Classifier 20 40 80 100 RP L1 0.553 i (0.023) 0.592 :1: (0.010) 0.596 :t (0.011) 0.596 d: (0.013) RP L2 0.582 i (0.001) 0.621 d: (0.003) 0.665 :1: (0.029) 0.697 :1: (0.014) SVM 0.533 :1: (0.019) 0.578 i (0.011) 0.581 :1: (0.008) 0.584 :1: (0.013) KNN 0.519 i (0.011) 0.522 d: (0.010) 0.527 :t (0.010) 0.529 :1: (0.009) (b) Macro-average 89 Table 4.4: Area under ROC Curve for Yeast Dataset. Number of Training examples Classifier 300 500 1000 1500 RP Ll 0.690 :1: (0.010) 0.722 :1: (0.0041) 0.734 i (0.010) 0.754 :1: (0.008) RP L2 0.715 3: (0.013) 0.754 i (0.002) 0.784 i (0.005) 0.797 :1: (0.008) SVM 0.648 :1: (0.011) 0.710 i (0.006) 0.721 i (0.005) 0.727 :1: (0.009) KNN 0.703 i (0.009) 0.709 i (0.005) 0.710 3: (0.006) 0.715 :t (0.004) (a) Micro-average Number of Training examples Classifier 300 500 1000 1500 RP Ll 0.594 :1: (0.010) 0.599 :1: (0.002) 0.599 :t (0.003) 0.612 :1: (0.009) RP L2 0.598 1 (0.008) 0.606 :t (0.003) 0.607 i (0.003) 0.62] i (0.004) SVM 0.589 :1: (0.003) 0.589 :1: (0.004) 0.598 :1: (0.006) 0.601 i (0.002) KNN 0.586 :1: (0.004) 0.587 i (0.004) 0.588 i (0.007) 0.600 :1: (0.004) (b) Macro-average Results and Analysis 11: Sensitivity Test We discuss the influence of the parameters A and B on the proposed Rank Propagation algorithm in this empirical study. The following are the F1 scores for Yeast dataset with 80 and 100 training examples for different B values (since only the ratio between A and B matters, we fix A = 1 and consider the value of B). As we can see, values smaller or greater than the optimal value for B all result in worse performance than the optimal value. It is consistent with the intuition since too large values will dominate the resulting propagation scores and too small values can not express the differences in transitional probabilities. The parameter B has the similar influence on RP L2 to RP Ll, thus we didn’t include it here. 90 Figure 4.5: F1 scores of Rank Propagation algorithm using l1 loss function with different B values on MovLens C18 dataset. Micro-average F1 Scores Macro-average F1 Scores 0.05 —_.J—LiJ—J—l—__ 0 1 1 1 1 1 1 1 1 o 2 4 6 8 1o 12 14 16 18 o 2 4 6 a 10 12 14 16 18 Rank of Categories Rank of Categories (a) The Number of Training data = 80 Micro-average F1 Scores Macro—average F1 Scores 0.401 . . = 0.25 . . 1 . . . 9 / X 4 12‘. M / )6 =5 =20 =30 =40 1 , =50 . ‘ , =10 0.05 / / x l >/ 0 1* 1 A 1 1 1 1 o 1 1 1 A 1 1 1 1 024681012141618 024661012141618 Rank of Categories Rank 01031690035 (b) The Number of Training data = 100 91 Chapter 5 Propagation over Directed Graph Label propagation approaches have gained popularity in semi-supervised learning area. Most previous research has been done on the undirected graph [4, 5, 57, 55, 24, 51]. How- ever, there are many scenarios which involve directed graphs. For example, in order to explore the link structure of the web for ranking web pages, we usually view the links be- tween web pages as directed. A number of studies have been devoted to semi-supervised learning on the directed graph such as [50, 52]. In this chapter, we will explore the label propagation approach over directed graphs in the application of text categorization. Label propagation has been shown to be an effective approach for text categoriza- tion [57, 23]. The main idea is to predict the category assignments for unlabeled documents by propagating the category information of the training documents through the similarities between documents. Most previous research on label propagation focuses on undirected graphs which are based on symmetric similarity matrices. However, in a number of scenar- ios, the similarity relationship between documents can be asymmetric and is better captured by directed graphs. For example, consider two documents d,.; and dB with document (1,; being a part of document d3. Evidently, if document dA belongs to category c, we would expect that document (13 should also belong to category c. However, this inference is irre- versible, namely we can not infer the category for document (1.1 if we know the category 92 of (13. To capture the asymmetric relationship among documents, we propose the label propagation approach for text categorization using directed graphs. The proposed approach first constructs a directed graph from document collection, then converts this directed graph into an undirected one, and finally makes prediction by propa- gating the class labels over the converted undirected graph. There are two questions which need to be answered: How do we utilize the directed graph? A straightforward way is to first convert the direct graph into an undirected one and then apply a standard label propagation method in this converted undirected graph. In our study, we will discuss the work in [52]. It proposed a regularization framework which converts a directed graph based on asymmetric similarity matrix into an undirected graph. We then propagate the labeling information on the undirected graph using Spectral Graph Transducer [23]. How do we construct a directed graph? To construct a directed graph from document collection, it requires computing asymmenic similarities, namely similarity of (1,4 to d B is different from the similarity of (13 to (1,1. Two asymmetric similarity measures are presented in this study, i.e., the asymmetric similarity based on the KL divergence and the asymmetric similarity based on the modified cosine similarity. We evaluate the efficacy of these two asymmetric similarities through an empirical study with multiple datasets. 5.1 Convert a Directed Graph into an Undirected Graph Let D 2 (d1, - - - ,dn) denote the entire collection of documents. Assume the first 721 doc- uments of D are labeled by y1 = (311,- - - .31.“) and the remaining nu = n — n, documents are unlabeled. Each class label y,- is either +1 or —1. Let 'u;(-21||j) Z 0 denote the similar- ity of the ith example to the jth example. Note that the similarity is asymmetric, namely 'w('z' Ilj) # 111(j Hi). We further denote by W E R1“ the similarity matrix for all the exam- 93 ples. Our goal is to predict y“, the category labels for the unlabeled documents, by propa- gating the class labels y) over the asymmetric similarities 212(2' 11 j). In the following, we will first describe the approach in [50] that is used to convert a directed graph into an undirected graph. We will then describe the spectral graph transducer (SGT) algorithm [23], which is used to propagate the class labels over the converted undirected graph. 5.1.1 Conversion to Undirected Graphs Given a set of labeled documents Dr 2 {(d1,c1), - - . , (d), c,)}, the task is to classify the unlabeled documents ’19,, = {d,+1, - - - ,d,+,,}. Assume n = l + it. We can construct a directed graph G = (V, E) as described in the previous section. V is the set of nodes and each node represents a document. E is the set of edges. The edges are weighted and there is a weight function 212 : E —> IR+ which associates a positive value w([u, 12]) with each edge [11,21] 6 E. We use K-L divergence between two documents based on their term vectors as the weight function. The out-degree function (1+ : V —> 1R+ and in-degree function (1‘ : V ——> IR‘ are defined as: d+(u) = Zungfluwl) and d”(u) = Zufl,w([u,v]). 11. ——> 11 denotes the set of vertices adjacent from the vertex 1.1 and u <— 11 denotes the set of vertices adjacent to u. For a given weighted directed graph, there is a natural random walk on the graph with the transition probability function p : V x V —+ IR+ defined by p(u, v) = w([u, v])/d+(u) for all [11. v] E E, and 0 otherwise. The graph we constructed is strongly connected and aperiodic and the random walk has a unique stationary distribution 1r which satisfies the balance equations 71(11) 2 Z 7r(u)p(u.,12), for all v E V. tr(rr) > 0 for all 0 E V. u—vl’ Assume a classification function f E ’H(V), which assigns a label sign f (c) to each vertex v E V. A functional {2 can be defined as 00’) F; 2: ”(U1p1'u,2’)( f0“? — M) > (5.1) [11.116 E 94 Q sums the weighted variation of a function on each edge of the directed graph. The initial label assignment should be be changed as little as possible. Let y denote the function in H(V) defined by y(2:) = 1 or -1 if vertex v has been labeled as positive or negative respectively, or 0 if it’s unlabeled. We can form the optimization problem as follows: argminfeH(\/){Q(f) + l1 11 f — 3/ “2} (52) where [L > 0 is the parameter specifying the tradeoff between the two competitive terms. Define the Operator 8 : H(V) —> H(V) as follows: _ .,. :1 7r(u)p(u,'v)f(’u-) 7r(’v)p(v,u)f(u) (0M) A; T—myrm +2 'r_(v)r(u) ) G) can be written in matrix form as H1/2pn—1/2 + H—1/2PTH1/2 e 2 (5.3) where II denote the diagonal matrix with 11(1), 1)) = tr(v) for all v E V. P is the transition probability matrix and PT is the transpose of P. Note: 9 is symmetric although the original weight matrix is asymmetric. Using 9 as the new symmetric similarity matrix, we conduct the propagation using SGT [23] which is discussed in details in the following section. Figure 5.1 shows the algorithm. Note that for multi-class problem, we compute the function f for each category using the algorithm. 5.1.2 Propagation Scheme: Spectral Graph Transducer We briefly introduce Spectral Graph Transducer (SGT) in this section. SGT is a method for transductive learning which can be seen as a transductive version of the k nearest-neighbor classifier. In this method, the training problem has a meaningful relaxation that can be 95 solved globally optimally using spectral methods. A key advantage of the method is that it doesn’t require additional heuristics to avoid unbalanced splits which makes this algorithm very robust. Let y = (511,. . . ,y,,) denote the class labels for all the examples. Evidently, ya = (yn,+1,. . . , i1”). Given a symmetric similarity matrix S, we need to find the class labels y that on one hand is consistent with y,, and on the other hand is consistent with the similarity matrix S. Follow [23], we can formulate the above objectives into the following optimization problem, it 7,, mi” 2 Still/2' — 902 + 0:011 — 71')? i=1 yeR" . . 20:1 s. t. 2": y, = 0, ity? =71. (5-4) i=1 i=1 where “,2. is 7+ if y,- = +1 and 0,_ otherwise. [23] computes the similarity-weighted k Input: The label assignment vector y containing the labels for training data and 0 for testing data; the similarity matrices SD for documents computed by K-L divergence. Output: The classification function f. Algorithm: 1. Construct the directed graph: construct the graph G = (V, E). V is the set of nodes and each node represents a document. E is the set of edges and each edge [11, v] is assigned a weight ru([u, 0]) determined by SD(u, 0). Compute the transition probability matrix P defined as p(u, v) = u'([u, 11]) / (1+(u). 1. Define the random walk: define a random walk over G with a transition matrix P such that it has a unique stationary distribution 7r. 2. Compute the matrix 9: Let ll denote the diagonal matrix with its diag- onal elements being the stationary distribution it of the random walk. Compute the matrix 8 as G = (Ill/QPII‘I/2 + H‘l/QPTH1/2)/2. 3. Compute the function f: Compute the function f using SGT, and clas— sify each unlabeled vertex v as sign f (v). Figure 5.1: The Algorithm for Classification on a Directed Graph 96 nearest-neighbor graph S as simtfijfi) SI kaeknn(r‘,) sim(f,.;r'k) if f} E ktm(i",-) 0 else and then symmetricize it as S = S + S’. The details of computing H can be found in [23]. 5.2 Construct a Direct Graph Label propagation approaches often involve a connected graph constructed from the data samples. Consider a classification problem. Assume we have the labeled data set {(x1.y1), - . - , (x). 211)} and the unlabeled data set {x,+1, - - - ,x,+,,}. We construct a con- nected graph G = (V, E): V is the set of nodes in which each node represents one data sample; B is the set of edges between nodes and each edge is assigned with a value which is determined by the weight function 10. The function w is usually computed based on the similarity between all nodes. For instance, we use RBF kernel to compute the similarity. If u; is a symmetric measure which means w(:r,-, .123) = tr(rj, xi), the graph is an undirected graph since the edge from the node :1:,- to 35 is viewed equally as the edge from the node an,- to xi. All the approaches we discussed from the previous chapters are based on undirected graph. In many cases, directed graphs may be a more appropriate choice to express the nature of the data set. By defining the weight function 20 as an asymmetric function, we can construct a directed graph. Let’s consider the document classification problem. Given a set of labeled doc— uments ’D; = {(d1.c1), - . . , (d;,cl)}, the task is to classify the unlabeled documents 73,, = {d,+1, - - - ,d,+,,}. In order to apply the label propagation approach, we need to compute the document similarity matrix S. Two asymmetric similarity measures are pre- sented in the following study, i.e., the asymmetric similarity based on the K—L divergence and the asymmetric similarity based on modified cosine similarity. 97 K-L divergence similarity measure For probability distribution P and Q of a discrete variable, the K-L divergence of Q from P is defined to be Dm P II (JV—21’0")! —’—P (5.5) Given the document d,- and the document dj, we first compute the pairwise KL distance D(i||j) as follows: UIDIJ') = :dlzkl 0:: (A) (5.6) d) k where d“c is normalized term frequency and is defined as d“. = d,,k/ 213:1 di,k’~ We then convert the distance D(i||j) to similarity w(i||j) by 1.1)(illj) = exp (—/\D(i||j)) where A is a scaling factor and is determined empirically. The graph is directed since the edge from the node representing d,- to the node repre— senting d,- is different from the edge from the node representing d,- to the node representing di. By using KL divergence, we are trying to express in the pairwise similarities the differ- ences from document length, term probability distribution and some other factors. Asymmetric cosine similarity We assume that the document length reflects the variety of topics which may have impact on propagation. Thus the resulting similarity should be weighted according the document length. Standard cosine similarity is a symmetric mea- sure. To capture the differences from document lengths, we modify it to be a asymmetric measure. Let’s denote by X,- the set of words used by the ith document, i.e., X,- = {jldm- 7£ 0}. We then measure the similarity of d,- to dj by the following expression: (1,: kdj k keX Z\/Zkex, d-i k\/Zke.¥.- J31: wUII))= (57) This actually makes intuitive sense: a short document may have large similarity to a long 98 document with diverse topics while the long document may not have the large similarity to the short one because of its diverse t0pics. 5.3 Case Study 1: Binary Classification We evaluate our pr0posed approach with binary text classification tasks. Generally, the following questions will be addressed: 0 Is the classification performance improved by using directed graph? We compare label propagation approaches using directed graphs with approaches using undirected graphs on the same classification tasks. SGT [23], SVM [21] and LP [57] are used as our baselines. 6 Why do directed graphs improve the performance? From the experiments we can see that propagation using directed graphs outperforms using undirected graphs. But how it actually helps classification and why this is the case still need to be studied. We’ll discuss why directed graphs actually help classification with the analysis of the dataset. o How sensitive are two proposed asymmetric similarity measures? We proposed two similarity measures above. How they are impacted by different parameters will be presented and analyzed. 5.3.1 Experiment Setup We evaluate the proposed algorithm for semi-supervised text categorization by multiple datasets which are described as follows: 6 Movielens dataset: We use the MovieLens dataset‘. The five most popular categories are used in this study, including Adventure, Children’s, Comedy, Drama and Thriller, lhttp://www.grouplens.org/ 99 which results in a total of 1354 movies. Each movie is described by a number of keywords, and the goal of this experiment is to predict the genre of movies based on their keyword description. 6 20-newsgroup: For 20-newsgroup dataset, we randomly selected 5 categories and 100 documents from each category. The resulting dataset contains 500 documents and around 7300 words. There is no overlapping of documents among categories. That is to say, a document is assigned to only one category. 6 Reuters-21578: Reuters-215 78 dataset contains more than 20,000 documents. We used the subset Reuters-21578 R8 dataset2 which contains 7674 documents and 8 classes. Each document is assigned to only one category. Since some classes don’t have enough samples, we select 6 classes which have more than 80 documents. The resulting dataset has 543 documents with the vocabulary of 4452 words and the av- erage number of documents within one class is around 91. Three baselines are used in our study: the supervised support vector machine (SVM), the spectral graph transducer (SGT) over undirected graphs where document similarity is computed with cosine similarity and the label propagation using harmonic function (LP) in [57]. SVM is a supervised learner which only utilizes the labeled data. SGT explores the undirected graph with an adjacency matrix based on k-nearest neighbors and propagates the labels over the undirected graph. LP propagates the labels over an undirected graph using gaussian fields and harmonic function. We refer to the directed graph label propagation based on the asymmetric cosine sim- ilarity and the KL divergence as DP-Con and DP-KL, respectively. F1 is used as the evaluation metric. All the experiments are repeated ten times, and F1 averaged over ten trials is reported as the final result. 2http://www.gia.ist.utl.pt/*acardoso/datasets/ 100 5.3.2 Experiment Results Table 5.3.2, 5.2 and 5.3 present the F1 scores of five algorithms in different binary classifi— cation tasks for three datasets. First, as we can see from three tables, DP-KL and DP-Con always outperform SGT and LP which are both semi-supervised learning algorithms using undirected graphs. We attribute the improvement to utilizing the directed graphs during propagation since the directed graph can better capture the asymmetric relationship be- tween documents in many cases. Detailed discussion about why it is the case will be presented later. Second, generally speaking, except LP algorithm, semi-supervised learning algorithms (DP-KL, DP-Con and SGT) perform better than SVM which is a supervised learning algo- rithm. This is reasonable because the deficiency of training examples will affect the effec— tiveness of supervised learning. This is again proved that utilizing unlabeled data is impor- tant for classification tasks especially when a large amount of training data is not available. As the size of training set increases, the performance of SVM is usually improved signif- icantly since more training samples can be used for learning. The performance of SVM varies from case to case and in some cases shown (for example, the category ”Childre’s" in Table 5.3.2, the category ”trade” in Table 5.3), it outperforms SGT, but not DP-KL and DP-Con. This supports out estimation that directed graphs can improve the performance. Third, LP performs poorly in all cases. We attribute this to two possible reasons. First, the poor performance may be caused by propagating the labels over the weighted undi- rected graph based on the similarity matrix in which each pair of nodes are connected and the edge between them is assigned a weight. The labeling information from connected edges with small weights may become very noisy during propagation. Thus it improves the performance by using adjacency matrix with h nearest—neighbor which removes unnec- essary edges from the graph. Another reason may be the threshold for classification. The prediction result from harmonic function seems very sensitive to the threshold. Consid- ering the class prior is one reasonable way to adjust the threshold. It works usually better 101 than simply taking the sign of the final value. But compared to other algorithms, LP doesn’t seem always work well. Number of Training examples Cat. Alg. 10 20 40 80 DP-KL 0.201 (0.044) 0.209 (0.042) 0.210 (0.034) 0.220 (0.004) DP—Con 0.225 (0.055) 0.244 (0.053) 0.252 (0.050) 0.264 (0.047) Adv. SGT 0.149 (0.026) 0.183 (0.032) 0.2036 (0.034) 0.208 (0.044) SVM 0.111 (0.011) 0.170 (0.067) 0.229 (0.058) 0.346 (0.010) LP 0.155 (0.002) 0.157 (0.013) 0.156 (0.008) 0.163 (0.005) DP-KL 0.194 (0.046) 0.197 (0.040) 0.196 (0.052) 0.198 (0.017) DP-Con 0.161 (0.068) 0.2069 (0.087) 0.264 (0.091) 0.281 (0.068) Chi. SGT 0.139 (0.038) 0.147 (0.022) 0.156 (0.025) 0.159 (0.036) SVM 0.137 (0.085) 0.155 (0.020) 0.180 (0.056) 0.189 (0.016) LP 0.071 (0.003) 0.074 (0.021) 0.072 (0.022) 0.080 (0.005) DP-KL 0.412 (0.069) 0.422 (0.043) 0.440 (0.038) 0.444 (0.034) DP—Con 0.381 (0.042) 0.404 (0.035) 0.411 (0.049) 0.421 (0.038) Com. SGT 0.373 (0.0676) 0.395 (0.057) 0.405 (0.047) 0.413 (0.025) SVM 0.311 (0.111) 0.321 (0.053) 0.345 (0.074) 0.363 (0.065) LP 0.296 (0.223) 0.294 (0.002) 0.289 (0.004) 0.293 (0.006) DP-KL 0.630 (0.063) 0.639 (0.053)) 0.678 (0.044) 0.702 (0.020) DP-Con 0.634 (0.024) 0.659 (0.051) 0.6612 (0.075) 0.700 (0.017) Dra. SGT 0.583 (0.071) 0.588 (0.105) 0.647 (0.039) 0.678 (0.031) SVM 0.348 (0.074) 0.3979 (0.054) 0.416 (0.014) 0.546 (0.055) LP 0.353 (0.002) 0.3601 (0.002) 0.427 (0.004) 0.431 (0.005) DP-KL 0.1985 (0.047) 0.201 (0.034) 0.209 (0.037) 0.213 (0.023) DP-Con 0.220 (0.044) 0.230 (0.044) 0.231 (0.046) 0.245 (0.029) Thri. SGT 0.172 (0.028) 0.181 (0.034) 0.185 (0.028) 0.195 (0.020) SVM 0.128 (0.017) 0.131 (0.058) 0.153 (0.060) 0.165 (0.027) LP 0.136 (0.003) 0.139 (0.002) 0.141 (0.006) 0.161 (0.007) Table 5.1: The F1 results for MovieLens Dataset. 102 Number of Training examples Cat. Alg. 10 20 30 40 DP-KL 0.585 (0.020) 0.606(0.069) 0.611(0.057) 0.623 (0.076) DP-Con 0.616(0.042) 0.628(0.042) 0.635 (0.043) 0.653 (0.076) 1 scr 0.570 (0.075) 0.578(0071) 0.595 (0.070) 0.604 (0.080) svrvr 0.332 (0.055) 0.434 (0.070) 0.469 (0.022) 0.594 (0.008) LP 0.245(0007) 0.261 (0.013) 0.285 (0.011) 0.304 (0.016) DP-KL 0.557 (0.044) 0.562 (0.039) 0.572 (0.047) 0.589(0.042) DP-Con 0.629 (0.041) 0.652 (0.055) 0.663 (0.052) 0.667 (0.047) 2 sor 0.544(0045) 0.554(0024) 0.567 (0.066) 0.572 (0.065) SVM 0.399 (0.053) 0.422(0044) 0.514(0024) 0.592(0013) LP 0.361 (0.007) 0.376 (0.009) 0.385 (0.012) 0.395(0021) DP-KL 0.562 (0.052) 0.594 (0.044) 0.629 (0.057) 0.626 (0.070) DP-Con 0.594 (0.054) 0.614(0.035) 0.642 (0.051) 0.641 (0.051) 3 SGT 0.544(0040) 0.559 (0.064) 0.566 (0.065) 0.570 (0.084) SVM 0.365 (0.055) 0.520 (0.053) 0.573 (0.074) 0.607(0.069) LP 0.249(0.008) 0.271(0013) 0.293(0015) 0.317(0015) DP—KL 0.567 (0.046) 0.571 (0.066) 0.596 (0.064) 0.607(0.060) DP—Con O.614(0.027) 0.632 (0.035) 0.639(0.051) 0.640 (0.074) 4 sor 0.542 (0.031) 0.561 (0.058) 0.567 (0.068) 0.590 (0.090) SVM 0.382 (0.055) 0.460 (0.027) 0.515(0.069) 0.600(0.013) LP 0.382 (0.004) 0.411(0007) 0.420 (0.016) 0.427 (0.018) DP—KL 0.477(0053) 0.487 (0.035) 0.500 (0.045) 0.501 (0.047) DP-Con 0.599(0.038) 0.610(0.041) 0.616(0.043) 0.630 (0.033) 5 sor 0.459 (0.069) 0.462 (0.037) 0.462 (0.046) 0.473 (0.059) SVM 0.253 (0.072) O.308(0.009) 0.451 (0.011) 0.533(0.069) LP 0.154 (0.008) 0.168(0.010) 0.187 (0.013) 0.206(0.020) Table 5.2: F] measure for 20—newsgroup Dataset. 103 Number of Training examples Cat. Alg. 10 20 30 40 DP-KL 0.514 (0.080) 0.535(0.083) 0.560 (0.097) 0.591 (0.086) DP-Con 0.527(0.050) 0.527 (0.051) 0.548 (0.041) 0.569 (0.048) acq SGT 0.504 (0.050) 0.531 (0.054) 0.541 (0.045) 0.549 (0.036) SVM 0.362 (0.029) 0.446 (0.010) 0.527 (0.049) 0.535 (0.042) LP 0.177 (0.005) 0.177 (0.010) 0.181 (0.013) 0.188 (0.013) DP-KL 0.773 (0.056) 0.811 (0.090) 0.813 (0.021) 0.821 (0.015) DP-Con 0.549 (0.041) 0.572 (0.045) 0.573 (0.051) 0.583 (0.054) crude SGT 0.540 (0.033) 0.561(0.051) 0.589 (0.057) 0.579 (0.058) SVM 0.482 (0.084) 0.529(0.037) 0.550 (0.046) 0.563 (0.101) LP 0.294 (0.009) 0.304(0.011) 0.310 (0.015) 0.315 (0.017) DP-KL 0.696(0.1 12) 0.731 (0.089) 0.705 (0.102) 0.789 (0.072) DP-Con 0.595 (0.038) 0.610 (0.044) 0.635 (0.043) 0.647 (0.068) earn SGT 0.572 (0.029) 0.595 (0.031) 0.606 (0.051) 0.614 (0.063) SVM 0.508 (0.092) 0.571 (0.120) 0.600 (0.140) 0.608 (0.026) LP 0.223 (0.018) 0.242 (0.016) 0.271 (0.030) 0.311 (0.026) DP-KL 0.470 (0.052) 0.505 (0.059) 0.483 (0.065) 0.537 (0.075) interest DP-Con 0.447 (0.047) 0.492 (0.061) 0.478 (0.044) 0.485 (0.052) SGT 0.439 (0.054) 0.490 (0.052) 0.464 (0.044) 0.480 (0.061) SVM 0.403 (0.036) 0.472 (0.032) 0.489 (0.022) 0.434 (0.067) LP 0.228 (0.012) 0.230 (0.011) 0.230 (0.017) 0.233 (0.014) DP-KL 0.471 (0.074) 0.528 (0.071) 0.556 (0.080) 0.574 (0.078) DP—Con 0.471 (0.066) 0.498 (0.062) 0.506 (0.071) 0.506 (0.050) money SGT 0.456 (0.063) 0.490 (0.061) 0.484 (0.060) 0.490 (0.042) SVM 0.382 (0.018) 0.436 (0.013) 0.458 (0.021) 0.487 (0.022) LP 0.208 (0.007) 0.211 (0.012) 0.212 (0.012) 0.212 (0.012) DP-KL 0.505 (0.090) 0.550 (0.106) 0.576 (0.041)) 0.608 (0.057) DP—Con 0.494 (0.054) 0.510 (0.058) 0.567 (0.043) 0.582 (0.054) trade SGT 0.463 (0.057) 0.462 (0.061) 0.462 (0.054) 0.467 (0.062) SVM 0.478 (0.152) 0.492 (0.095) 0.506 (0.070) 0.523 (0.077) LP 0.347 (0.007) 0.351 (0.009) 0.354 (0.012) 0.318 (0.012) Table 5.3: Fl measure for Reuters-215 78 R8 Dataset. 104 5.3.3 Analysis and Discussion The results presented above show that propagation using directed graphs works better than propagation using undirected graphs. However, it may not always be the case. In this section, we give a closer look at why directed graphs can bring the improvement of perfor- mance. We consider SGT, DP-KL and DP-Con. 0.08 0.06 0.04 0.02 0.06 0.04 0.02 The percentage of documents 0.1 * The Distribution of Documents w.r.t the Number of Words for Movielen dataset l 20 l l l r T l l _— ‘L _ l _L l 40 60 80 100 120 140 160 180 The number of words in a document The Distribution of Documents w.r.t the Number of Words for 20-newsgroup dataset l l l l l f T T 4 6 8 10 12 14 16 18 The number of words in a document The Distribution of Documents w.r.t the Number of Words for Reuters-R8 dataset l 50 j 77 77 l V l l I 100 150 200 250 300 350 400 450 500 The number of words in a document Figure 5.2: The Distribution of Document Length in Three Datasets. Why does propagation using directed graphs outperform undirected graph? We al- ready show that propagation using directed graphs outperforms propagation using undi- rected graphs in text categorization task. Recall that we use SGT algorithm for propagation with the undirected graph converted from directed graph. The performance of each algo- 105 The Probability of Normalized Pairwise Similarity The Distribution of Normalized Pairwise Similarities using KL Divergence 0.2 — * 1 I 1 ___-L ' 0 0 l l l l 1 40.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Pairwise Similarity The Distribution of Normalized Pairwise Similarities using Asymmetric Cosine Measure 1 I l l l l l l l 0.6 — 0.4 - e 0.2 : - 00 1 l L l l l 1 l l -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Pairwise Similarity The Distribution of Normalized Pairwise Similarities in Undirected Graph 1 l l l l l l r l 0.6 a 0.4 r r 0.2 _ 0 O i l l l l l l l 0 0.1 0.2 40.1 0.3 _0.4 . 0.5 . 06 0.7 0.8 0.9 Normalized Parrwrse Srmrlanty l l (a) The Distribution of Pairwise Similarities in MovieLens Dataset. The Distribution of Two Documents Belonging to One Category w.r.t Normalized Pairwise Similarity for Movielen Dataset 1 1 T 4’1111“ 1‘11 0 8 _ *1- DP-Con l ' —e— Undirected The Percentage of Document Pairs in One Category 4 l k V 0 0.1 0.2 0.3 0.4 0.5 Normalized Pairwise Similarity (b) The Distribution of No documents Belonging to One Category with respect to Normalized Pairwise Similarity in MovieLens Dtaset. Figure 5.3: Similarity Matrix Analysis of MovieLens Dataset The Probability of Normalized Pairwise Similarity The Distribution of Normalized Painrvise Similarities using KL Divergence 0.6 L I 1 l l I ' T T I 0 0 LL. i 1 l L 1 l l I 1 I 40.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Normalized Pairwise Similarity The Distribution of Normalized Pairwise Similarities using Asymmetric Cosine Measure 1 I I I l I l I l 0.4 . 0.2 «1 0' 7 A 1 1 I l l L ) -o.1 0 0.1 0.2 0.3 0.4 0.7 0.8 0.9 0.5 0.6 Normalized Pairwise Similarity The Distribution of Normalized Pairwise Similarities in Undirected Graph 1 l l 3 l I l l l 0 I I 1 J; 1 _L #4 l -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Pairwise Similarities (a) The Distribution of Pairwise Similarities in 20-newsgroup Dataset. The Distribution of Two Documents Belonging to One Category w.r.t Normalized Pairwise Similarity in ZO-newsgroup Dataset 1 .2 (B a. + DP-KL g 0-8’ ~+— DP—Con E b —e— Undirected 3 o o 8’ 0.6 . o ‘65 a "5 0 a) (D 35 O. 5 .5 8 a: 0.2 d) .C I'- 0.1 0.2 0.3 0.4 Normalized Pairwise Similarity (b) The Distribution of TWO Documents Belonging to One Category with respect to Normalized Pairwise Similarity in 20-newsgmup Dataset. Figure 5.4: Similarity Mtrix Analysis of 20-newsgroup Dataset The Distribution of Normalized Pairwise Similarities using KL Divergence 0-1 l l l l I l l l l 30.05 T T o O 1 l 1 1 l 1 0 0.1 0.2 0.3 0.4, 0,5 . 0.6 , 0.8 0.9 1 Norrnalrzed Parrwrse Similarity The Distribution of Normalized Pairwise Similarities using Asymmetric Cosine Measure 0.2 I I T l I l l l l 0.1 : 0 A l l l l l l l 0 0’1 0'2 0'3 Norrrcrla‘lized Pgirswise Similarity 0'7 0'8 0'9 1 The Distribution of Normalized Pairwise Similarities in Undirected Graph 0.5 — r r a r r r r r r 0L1 1 1 1 1 1 l 1 1 0 0.1 0.2 0.3 0. . 0.5 . 0. . 0.7 0.8 0.9 1 Normalized Parrwrse Similarity The Probability of Normalized Pairwise Similar (a) The Distribution of Pairwise Similarities in Reuters-215 78 R8 Dataset. The Distribution of Two Documents Belonging to One category w.r.t Normalized Pairwise Similarity in Reuters-R8 dataset I'\ 1 L (\mrxnx A. J r f T W 0.8 : I 5? . , W’efll} ‘9 {fa ' if: Q. .9 C (D 5 ?_~ S o 06» -8 - 'o #8..) ' (’3 l “5 8 If / a) 34’s 0.4r Peat. / 1 ‘E 5 {a} f . 8 s: ._ / f + DP-KL § 0.2 34 . ' =4— DP-Con E 395/ ”WWW -—e— Undirected 07 semifiw . 1 . 0 0.1 0.2 0.3 0.4 0.5 Normalized Pairwise Similarity (b) The Distribution of Two Documents Belonging to One Category with respect to Normalized Pairwise Similarity in Reuters-215 78 R8 Dataset. Figure 5.5: Similarity Matrix Analysis of Reuters-215 78 R8 Dataset 108 Distribution of Similarities of Doc Pairs in One Category and Doc Pairs not in One category using KL Divergence I I I I I I I I ~+— Doc Pairs in One Category 6— Doc Pairs not in One Category 7 Distribution of Similarities of Doc Pairs in One Category 000 Pairs not in One Category using Asymmetric Cosine Measure 1 r r I fl 1 r r r r The Probability of Similarity o 01 l Distribution of Similarities of Doc Pairs in One Category and Doc Pairs not in One Category in Undirected Graph 1 r r r T I I r I F I 0 0.1 "'02 I 073 ’ “0.4: 0,5, 0.6 _ Normalized Parrwrse Srmrlanty Figure 5.6: The Distribution of Similarities of Document Pairs Belonging to the Same Category and Document Pairs not Belonging to the Same Category in MovieLens Dataset. 109 The Distribution of Similarities of 000 Pairs in One Category and Doc Pairs not in One Category using KL Divergence 0.4.: I I I I I I I I I i “is l. . . l I \ ~9— Doc Pairs not In One Category 0 2 —‘ 21* . . : 117' \1 —+— Doc Pairs In One Category I}; ‘. 1h], N 0"! 1?“ -‘--‘-“ :1:::1:‘:‘..:::.:.::::=':=‘::. .. 1:" ‘33 o 0.05 0.1 0715 0.2_ 0.25. 0.3 _ “0.35 0.4 135 0.5 Normalized Parrwrse Similarity g The Distribution of Similarities of Doc Pairs in One Category __g and Doc Pairs not in One Category using Asymmetric Cosine Similarity E I I I I I I I I I '03 0.4 F : “6 1 a III '5 0'2 Ill 3 10 I .0 . . e | . n- 0 '7‘ ‘0! II! I :l,‘ Q‘.‘J \" g 0 0.05 0 1 0 0.2, 0.25. 0.3 , 0.35 0.4 0.45 0.5 ,_ Normalized Parrwrse Similarity The Distribution of Similarities of Doc Pairs in One Category and Doc Pairs not in One Category in Undirected Graph 0.4 m I I m I I I I I 0.2:;- — J A A A r -‘ v "‘9 v _ . .. :L l l I I ' "" ’ 3. ' l 1 . I 1 ' ’ ' . ‘ l I nnrrnrnnxrrxrirrfxr+¥fi+t#%xf ”‘AAAJ‘A‘A‘ A‘AAM ' V O vvxlvxl\IVVVVVVUKJUWCTUVVUCJ‘UQ/VTHUG} 0 0.05 0.1 0.15 0.35 0. 0.4 0.5 0.2 0.25, 0.3 , Normalized Parrwrse Srmrlanty Figure 5.7: The Distribution of Similarities of Document Pairs Belonging to the Same Cat- egory and Document Pairs not Belonging to the Same Category in 20-newsgroup dataset. 110 Distribution of Similarities for Doc Pairs in One Category and Doc Pairs not in One Category using KL Divergence I I I I I I I +3e Doc pairs not in the same category —4— Doc pairs in the same category . A 0.4_ 0,5_ 0.6 , Normalized Parrwrse Similarity Distribution of Similarities for Doc Pairs in One Category and Doc Pairs not in One Category using Asymmetric Cosine Measure 0.2 I I I I I I T F I «Pr Doc pairs not in the same category —1— Doc pairs in the same category Probability of Similarity O 04. 0.5 . .0-6 . Normalized Parrwrse SImIIanty Distribution of Similarities for Doc Pairs in One Category and Doc Pairs not in One Category in Undirected Graph I l l l j 1 I T «e Doc pairs not in the same category —1— Doc pairs in the same category 3 Normalized0 Pairwise Similarity 0 7 Figure 5.8: The Distribution of Similarities of Document Pairs Belonging to the Same Category and Document Pairs not Belonging to the Same Category in Reuters-21578 R8 dataset. lll rithm simply depends on the similarity matrix. Thus, in order to understand the reason behind it, we study the similarity matrix computed in DP-KL, DP—Con and SGT. DP-KL and DP-Con generate their similarity matrices based on KL divergence and asymmetric cosine similarity respectively. For SGT, the sirrrilarity between two documents is computed by using standard cosine similarity (symmetric). First, we claim that the correlation between pairwise similarity and the probability of two document belonging to the same category is enhanced in directed graphs using KL divergence based similarity and asymmetric cosine similarity. Figure 5.3(b), 5.4(b) and 5.5(b) present the distribution of two documents belonging to the same category with re- spect to the normalized pairwise similarity in MovieLens dataset, 20-newsgroup dataset and Reuters-21 5 78 R8 dataset. In three figures, the probability of two documents belong- ing to the same category increases significantly at some high similarity point for directed graph while in undirected graph, the probability does not have a sharp change and instead increases gradually as the similarity get larger. This shows that it is more likely for KL divergence based similarity and asymmetric cosine similarity to assign higher similarity values to document pairs belongs to the same category. To further understand this point, we plot the similarity distribution of all document pairs belonging to the same category and all document pairs not belonging to the same category for three datasets in 5.6, 5.7 and 5.8. For Reuters-21578 R8 dataset (Figure 5.8), two kinds of sirrrilarities are clearly distributed in two modes for directed graphs. That is, the document pairs belonging to the same category are more likely to have higher pairwise similarities and document pairs not belonging to the same category are more likely to have lower pairwise similarities. For 20-newsgroup (Figure 5.7), this trend can be seen although not as clear as in Reuters-21 5 78 R8 dataset. This may be the reason why propagation us- ing directed graphs significantly outperforms using undirected graphs in Reuters-215 78 R8 dataset while not in 20-newsgroup dataset. We also plot the same distribution for Movie- Lens dataset in Figure 5.6, but two distributions are not separated very obviously. This may 112 explain why the performance improvement from directed graphs for MovieLens dataset is not very significant. Our assumptions for using directed graphs are based on the analy- sis of document length. These assumptions may not work for all data sets because of the representation of the data, accuracy and some other reasons. Accordingly the similarity measures based on the assumptions will not be able to accurately express the relationship among data points. Thus the performance gained from using directed graphs was not that obvious compared to undirected graphs. From the above analysis, we gain some insight into why propagation using directed graphs can outperform propagation using undirected graphs in some scenarios. Directed graphs based two proposed asymmetric measure can better explore the manifold structure of the dataset and reflect the impact of document length, term frequency distribution and other factors on classification in some cases as in Renter-R8 dataset and 20-newsgroup dataset. Thus the performance was improved by using directed graphs. However, in MovieLens dataset, since the directed graphs didn’t capture the appropriate properties of the dataset, the performance was not significantly improved though this may be solved by using some other different similarity measure. In summary, directed graphs constructed from plausible assumptions will definitely improve the performance of label propagation and are worthwhile studying. The sensitivity of asymmetric similarity measures We proposed two asymmetric sim- ilarity measures, namely KL divergence similarity and asymmetric cosine similarity. Since there is no parameter tuning in asymmetric cosine similarity, we leave it out of our discus- sion. KL divergence similarity uses RBF kernel (111(1' H j) = exp (—AD(i||j)), see section 5.2) to convert the distance into similarity and a scaling factor A is involved in the process. We briefly discuss how the parameter A affects the performance. Table 5.4 shows the F1 scores of DP—KL on 20-newsgroup dataset with different A values. A = 10 has the best performance. The result for A = 5 is much worse than A = 10. 113 For the values greater than 10, as A is increasing , the performance is decreasing in most cases. This is consistent with the meaning of A which is the scaling factor of distance. Too small or too big scaling values will both result in the inaccurate similarity. Number of Training examples Cat. A = 10 20 30 40 5 0.3971 (0.0613) 0.4431 (0.0813) 0.4996 (0.0662) 0.5055 (0.0641) 10 0.5845 (0.0197) 0.6059 (0.0688) 0.6108 (0.0570) 0.6226 (0.0762) 1 20 0.4221 (0.0773) 0.4436 (0.0809) 0.4781 (0.0688) 0.5178 (0.0577) 30 0.4097 (0.0952) 0.4568 (0.0817) 0.4849 (0.0612) 0.5001 (0.0837) 50 0.3807 (0.0848) 0.4295 (0.0710) 0.4870 (0.0730) 0.5102 (0.0708) 5 0.4371 (0.1013) 0.4695 (0.0987) 0.5253 (0.0895) 0.4719 (0.0831) 10 0.6287 (0.0414) 0.6515 (0.0551) 0.6631 (0.0517) 0.6636 (0.0471) 2 20 0.5391 (0.0089) 0.5862 (0.0872) 0.5868 (0.0854) 0.5937 (0.0773) 30 0.4433 (0.1296) 0.4378 (0.0731) 0.4777 (0.0981) 0.4947 (0.0804) 50 0.4040 (0.1232) 0.4759 (0.1137) 0.4955 (0.0936) 0.5283 (0.0880) 5 0.4349 (0.0469) 0.5364 (0.1061) 0.5659 (0.0855) 0.5709 (0.0892) 10 0.5620 (0.0523) 0.5939 (0.0440) 0.6290 (0.0574) 0.6257 (0.0691) 3 20 0.5049 (0.0305) 0.5257 (0.1024) 0.5571 (0.0869) 0.5757 (0.0772) 30 0.3879 (0.1210) 0.5326 (0.0836) 0.5570 (0.0885) 0.6191 (0.0656) 50 0.4984 (0.0504) 0.5231 (0.0901) 0.5588 (0.0721) 0.5731 (0.0795) 5 0.4159 (0.0782) 0.4667 (0.0761) 0.5134 (0.0719) 0.5052 (0.0733) 10 0.5674 (0.0458) 0.5709 (0.0656) 0.5954 (0.0643) 0.6072 (0.0601) 4 20 0.4019 (0.0975) 0.4918 (0.0838) 0.5127 (0.0698) 0.5214 (0.0762) 30 0.3749 (0.0756) 0.4574 (0.0831) 0.5001 (0.0791) 0.5118 (0.0732) 50 0.4086 (0.0689) 0.4758 (0.0761) 0.4954 (0.0643) 0.5062 (0.0709) 5 0.3491 (0.0475) 0.3532 (0.0530) 0.3830 (0.0587) 0.4049 (0.0540) 10 0.4766 (0.0531) 0.4866 (0.0350) 0.4981 (0.0454) 0.5006 (0.0473) 5 20 0.4274 (0.0665) 0.4563 (0.0763) 0.4812 (0.0407) 0.4815 (0.0496) 30 0.3130 (0.0687) 0.3696 (0.0645) 0.3877 (0.0445) 0.4001 (0.0572) 50 0.3250 (0.0629) 0.3556 (0.0600) 0.3908 (0.0389) 0.3941 (0.0608) 114 Table 5.4: The F1 Results for 20—newsgroup Dataset with Different A Values. 5.4 Case Study 11: Multi-label Classification We extent the idea of Propagation over directed graphs to the multi-label classification task on the MovieLens dataset. The goal of the experiments is to give a more comprehensive view of the effectiveness of propagation over directed graphs. We conduct the experiment of propagation using KL divergence based similarity mea- sure as discussed above. We follow the propagation scheme proposed in [53]. For each category, the propagation will generate the final score for each document. We view those scores as confidence scores of one document belonging to one category. Thus, by ranking the scores of all categories for each document, we can make predictions accordingly. 5.4.1 Experiment Setup We again use MovieLens dataset3 as our testbed. The dataset provides the movie category information. There are originally 1682 movies and 19 categories. The average number of movies for each category is around 152 and the average number of categories for each movie is 2. We also downloaded the movie keyword. information. After we removed the movies with no keywords, the resulting movie keyword file contains 1651. movies with 11107 keywords. We remove the corresponding movies from the movie category information file and the resulting movie category file also contains the same 1651 movies. The average number of movies per category is around 1 and the average number of categories per movie is around 4. We refer to the propagation approach on the directed graph as Directed Propagation (DP) and compare it with label propagation approach in [53]. We refer to the label prop- agation approach as LP. According to the figure 5.1, directed propagation needs the input 5” which is the asymmetric movie similarity matrix. We compute the S” from movie key- 3http://www.grouplens.org/ 115 words information using K-L divergence in Equation (5.6). LP algorithm has also the input SD which is the symmetric movie similarity matrix and we compute it from movie ratings information using cosine similarity. We use F1 measure as our evaluation metrics since it is a more appropriate measure for evaluating both precision and recall. We refer to the average F 1 measure per category as “macro-average” Fl measure and the average F1 measure per movie as “micro—average” Fl measure. Since both DP and LP algorithms provide a ranking of categories for each movie, we compute the F1 measure at different ranks. For macro—average F1 measure, precision and recall for each category are computed; the average precision and recall across all categories are taken for computing macro-average F1 measure. For micro-average F1 measure, precision and recall for each movie are computed; the average across all movies are taken for computing micro-average F1 measure. All experiments are conducted 10 times and the average across 10 trials are used as the final results. 5.4.2 Results and Analysis Figure 5.9(a) and 5.9(b) shows the performance of DP and LP algorithms for 40% and 20% training data respectively. For micro-average F 1 measure, two algorithms performs almost the same. For macro-average F1 measure, DP algorithm performs significantly better than LP algorithm. This may be explained by the way of constructing the directed graph. Since we compute the asymmetric weight matrix from movie keyword information, the movies from the same categories may have larger similarity because they share more keywords. More in-depth research has to be done in order to have a better understanding. We leave it to the future work. 116 Micro—average F1 measure 0.45 0.4. 1 I —e—; DP . , LP ”0.3le . 93 o l 0 c’30.:5it EC ! 0.25L {9‘s} 1 r L r x A A r . 0 2 4 6 8101214161820 Rank of Categories Macro-average F1 measure (a) 40% training data. Micro-average F1 measure 0.45[ . . 4 . - . a I l L —+— LP 4 j . moas‘ 1 e . 8 <0 0.3 .1 LT. . 0.25‘ BIKE . Q 0.2L 0 2 4 6 8 1o 12 14 16 18 20 Rank of Categories 0.26 . . 0.24 " P‘E‘rx J 7{ \_ Gm 0.22 » . X . l/ .9-& 0.2 ~ ' A ‘é i ,7” . 8 o 18 \ _ ~ [I 0.16% - 0.14 . 0.12 - 0.1 . 0.08 A . A 0 5 10 15 20 Rank of Categories Macro-average F1 measure 0.26 - . "Q 0.24 _ . 75 i: % 0.22- /+/ \ Rx « / \ ‘Is 0-2 ” 4/ M G)\®\ 1 s a X a . o 0.18.» / \_1 a. I 0.16 . i 0.14 0.12 L 0.1 0.08 A A L j o 5 10 15 20 Rank of Categories (b) 20% training data. Figure 5.9: F 1 Measure at Different Ranks for MovieLens Dataset in Multi-label Classifi- cation Task. 117 Chapter 6 Conclusion and Future Work This dissertation presents a comprehensive study of label propagation with its applications. The related research works of label propagation has been discussed in detail. Several prop- agation schemes, namely, relation propagation, rank propagation and propagation on di- rected graph are proposed in order to solve the existing problems in label propagation. The applications of the proposed schemes are presented with the empirical studies, which has shown that the proposed label propagation methods can achieve better performance than standard label propagation approaches in many problems. This chapter will start with sum- marizing the proposed work in this thesis followed by the discussion of some interesting directions in future work. 6.1 Summary Label propagation is an effective approach to semi-supervised learning. The main idea be— hind label propagation is to first construct a graph by denoting each data example as a node and connecting each pairs of nodes with an edge assigned with a weight (usually similarity between corresponding data examples), then propagate the labels of known data examples and finally make the prediction according to the propagation scores. Empirical study has shown it’s an effective approach in a lot of applications. Although label propagation is a 118 very promising approach, it still has limitations and disadvantages in many scenarios. We discussed some of its limitations and proposed several new propagation schemes accord- ingly. Relation Propagation We proposed the generalized framework of Relation Propagation and discussed its applications including multi-label classification task and collaborative filtering task. In most previous work in label propagation, propagation is conducted among one single type of objects. However, real applications usually involve multiple types of objects. Propagation over only one type of objects will usually miss the correlations among different types of objects and we believe that the correlations among multiple types of objects will help in the propagation process. Based on this assumption, we propose the framework of relation propagation which propagates the relations among multiple types of objects instead of propagating the labels directly among one type of objects. Consider the case of two types of objects. We construct the graph in which each node in the graph represents a object pair with one object from each type. Each edge connect- ing two nodes in the graph is assigned with a user-defined weight. We proposed to use the direct product of two similarity matrices respectively from two types of objects. The relations among two types of objects should be defined according to different situations. For examples, in multi-label classification task, two types of objects are documents and categories and the relation is the membership of documents in categories. Then by propa- gating the relations over the constructed graph, we achieve for each unlabeled example the confidence scores of this example belonging to all categories. The empirical study showed that the relation propagation is a more effective approach in some cases and it proved our assumption that it is helpful to utilize the correlations among multiple types of objects. Rank Propagation It is necessary to study the label propagation approaches on ranked data due to the limitations of label propagation. First, most label propagation models prop- agate directly the class labels. The problem may arise because the ordering information 119 among the numerical values representing the class labels is incorrectly introduced to the propagation process. One way to avoid this issue is to cast the problem as a ranking prob- lem in which the label information is converted to pairwise preferences over classes for each training example and these preferences are propagated. Second, most previous re- search works on label propagation require the labels of known data for training. However, the label information is not always available. Instead, in a lot of ranking applications including web page searching and collaborative filtering, usually only the ordering infor- mation is provided. Previous label propagation models are often not appropriate in these C3868. To address the challenges, a Rank Propagation scheme for multi-label categorization scenario is proposed. Instead of propagating the labels of training examples, rank propaga- tion actually propagates the category preference information from all labeled data. Specifi- cally, given the labels of all training examples, a preference matrix over all category pairs is built for each training example. Then, for each testing example, an optimization problem is constructed which minimizes the weighted differences between the preference matrix to be computed for the testing example and the preference matrix of each training example. As a result, for each unlabeled example, a preference matrix is computed and a ranking list of all categories is generated by computing the principle eigenvector of the preference matrix. Two loss functions, namely trace-based loss function and a more general loss function, are designed for measuring the difference between two preference matrices. Empirical study with multiple datasets has shown that rank propagation scheme is an effective approach. Propagation over Directed Graphs Most previous research on label propagation was conducted on undirected graphs. For propagation over undirected graphs, the similarity matrix is symmetric. But in many cases, relationship among objects can be asymmetric and is more appropriate to be expressed by directed graphs which result in asymmetric matrices. For instance, web pages connected with hyperlinks are better presented. with a 120 directed graph. The framework of Propagation over Directed Graphs is proposed to utilize the directed graphs. In order to construct directed graphs, two asymmetric similarity measures are de- signed: KL divergence-based similarity and Asymmetric cosine similarity. The directed graphs are converted into undirected graphs and the propagation is conducted on the con- verted undirected graphs by using a standard label propagation scheme. Spectral Graph Transducer is used in the pr0posed approach because of its effectiveness. The empirical study with several widely used datasets has shown the advantage of using directed graphs over some state-of-art semi-supervised techniques. 6.2 Future Work The detailed description of the existing and proposed label propagation approaches is given with the empirical studies which have shown that label propagation can be effectively used in many applications. However, some problems need to be further studied. Generally speaking, all label propagation approaches involve the graph which is asso- ciated with the similarity matrix and the propagation method which is usually determined by an optimization problem. How to define the appropriate similarity measure is always an important issue for label propagation. A number of research works have been focused on learning the best similarity matrix for different applications. Besides computing the similarity matrix, there are also many problems involved in the propagation schemes of different models. In relation propagation framework, consider the multi-label classification case. Each node in the graph constructed from the data is a document-category pair. The problem arises when the number of documents and categories are large, which is usually the case in real world scenario. Although the approximation is proposed to alleviate the problem, how to scale the algorithm remains to be an issue. For rank propagation, the optimization problem is designed based on the loss function. 121 The definition of the loss function can have a significant influence on the performance as showed in the empirical study. In propagation over directed graphs, how to construct the directed graph can affect the performance heavily. Two asymmetric measures to construct directed graphs are proposed based on our assumptions. However, as the empirical study showed, if the assumptions are not consistent with the nature of the application, the directed graphs may not work well as expected. This problem can also be viewed as a similarity matrix learning problem. 122 BIBLIOGRAPHY 123 Bibliography [1] B. T. Bartel], Cottrell G.W., and R.K. Belew. Learning the optimal parameters in a ranked retrieval system using multi-query relevance feedback. In Proc. Symp. on document Analysis and Information Retrieval, Las Vegas, April 1994. [2] Brian Bartel], Garrison W. Cottrell, and Rik Belew. Learning to retrieve informa- tion. In Current trends in connectionism: Proceedings of the Swedish Conference on Connectionism, LEA: Hillsdale, 1995. [3] Mikhail Belkin and Partha Niyogi. Using manifold snucture for partially labeled classification. Advances in Neural Information Processing Systems, 2002. [4] A. Blum and S. Chawla. Leaming from labeled and unlabeled data using graph min- cuts. In Proc. 18th International conf. on Machine Learning, 2001. [5] Avrim Blum, John Lafferty, Mugizi Robert Rwebangira, and Rajashekar Reddy. Semi-supervised learning using randomized mincuts. In Proc. of ICML’04, page 13, New York, NY, USA, 2004. ACM Press. [6] C. Boutilier, R. Zemel, and B. Marlin. Active collaborative filtering. In Proc. UAI, 2003. [7] J. S. Breese, D. Heckerrnan, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pages 43—52, 1998. [8] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1—7)2107—1 17, 1998. [9] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. Technical report, University College London, 2004. [10] W. Chu and Z. Ghahramani. Preference learning with gaussian processes. In Proc. lCML, 2005. [l 1] William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. Journal of Artificial Intelligence Research, 10:243—270, 1999. [12] K. Crammer and Y. Singer. Pranking with ranking. In Proceedings of the conference on Neural Information Processing Systems(NIPS), 2001. 124 [13] Arthur Gretton Olivier Bousquet Dengyong Zhou, Jason Weston and Bernard Scholkopf. Ranking on data manifolds. Advances in Neural Information Process- ing Systems, (16), 2004. [14] P. Doyle and J. Snell. Random walks and electric networks. Mathematical Assoc. of America, 1984. [15] Eibe Frank and Mark Hall. A simple approach to ordinal classification. In Proceed- ings of the European Conference on Machine Learning, pages 145 — 165, 2001. [16] Y. Freund, R. lyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. In Proc. of the Fifteenth Intl. Conf. of Machine Learning, pages 170—178, 1998. [17] Rayid Ghani, Sean Slattery, and Yiming Yang. Hypertext categorization using hy- perlink patterns and meta data. In Carla Brodley and Andrea Danyluk, editors, Pro- ceedings of ICML-OI, 18th International Conference on Machine Learning, pages 178—185, "Williams College, US”, 2001. Morgan Kaufmann Publishers, San Fran- cisco, US. [18] R. Herbrich, T. Graepel, and K. Oberrnayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, MIT Press, 2000. [19] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Pro- ceedings of International Joint Conference on Artificial Intelligence, 1997. [20] T. Joachims. Making large-scale svm learning practical. Advances in Kernel Methods - Support Vector Learning, 1999. [21] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137—142, 1998. [22] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceed- ings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. [23] Thorsten Joachims. Transductive learning via spectral graph partitioning. In Proceed- ings of the International Conference on Machine Learning (ICML), 2003. [24] Thorsten Joachims. Transductive learning via spectral graph partitioning. In Proc. ICML’03. 2003. [25] Klaus Brinker Kbrinker. Active learning of label ranking functions. In Proc. 21st International Conf on Machine Learning, 2004. [26] M. Kendall. Charles Griffin and Company Limited, 1948. [27] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Joumal of the ACM, 46(5):604—632, 1999. 125 [28] R. l. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proc. 19th International Conf. on Machine Learning, 2002. [29] Oren Kurland and Lillian Lee. Pagerank without hyperlinks: Structural re-ranking using links induced by language models. In Proc. of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 306—313, 2005. [30] Neil D. Lawrence and Michael I. Jordan. Semi-supervised learning via gaussian pro- cesses. Advances in Neural Information Processing Systems, (l7):753—760, 2005. [31] G. Lebanon and J. Lafferty. Conditional models on the ranking poset. Advances in Neural Information Processing Systems, 15, 2003. [32] Guy Lebanon and John Lafferty. Crankingzcombining rankings using conditional probability models on permutations. In International Conference on Machine Leam- ing, 2002. [33] Tao Li, Chengliang Zhang, and Shenghuo Zhu. Empirical studies on mult-label clas- sification. [34] M. OConnor and J. Herlocker. Clustering items for collaborative filtering. In Pro- ceedings of SIGIR-2001 Workshop on. Recommender Systems, 2001. [35] Foster J. Provost and Tom Fawcett. Robust classification for imprecise environments. In AAAI/IAAI,, pages 706—713, 1998. [36] J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proc. ICML, 2005. [37] S. E. Robertson and S. Walker. Okapi/keenbow at tree-8. In Proc. TREC-8, 1999. [38] B. Pfahringer S. Kramer, G. Widmer and M. DeGroeve. Prediction of ordinal classes using regression trees. F undamenta Informaticae, 47:1 — 13, 2001. [39] G. Salton. The SMART Retrieval System: Experiments in Automatic Document Pro- cessing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1971. [40] B. Sarwar, G. Karypis, and J. Konstan. Incremental singular value decomposition algorithms for highly scalable recommender systems. In 5th International Conference on Computer and Information Technology, 2002. [41] M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2001. [42] A. Shashua and A. Levin. Taxonomy of large margin principle algorithms for ordinal regression problems. Technical Report 39, Leibniz Center for Research, School of Computer Science and Eng, the Hebrew University of Jerusalem, 2002. 126 [43] Amnon Shashua and Anat Levin. Ranking with large margin principle: Two ap- proaches. In Proceedings of the 14th conference on Neural Infomation Processing System(NIPS), 2003. [44] N. Srebro, J. D. Rennie. and T. S. Jaakkola. Maximum-margin matrix factorization. In Proc. NIPS, 2005. [45] M. Szurmner and T. Jaakkola. Partially labeled classification with markov random walks. Advances in Neural Information Processing Systems, 14, 2001. [46] Y Weiss and W. T. Freeman. Correctness of belief propagation in gaussian graphical models of arbitrary topology. Neural Computation, (13):2173 — 2200, 2001. [47] C. Williams. Computation with infinite neural networks. Neural Computation, 10(5), 1998. [48] G.-R. Xue, H.-J. Zeng, Z. Chen, W.-Y. Ma, H.-J. Zhang, and OJ. Lu. Implicit link analysis for small web search. In Proc. SIGIR ’03, pages 56—63, 2003. [49] K. Yu, A. Schwaighofer, V. Tresp, W. Ma, and H. J. Zhang. Collaborative ensemble learning: combining collaborative and content-based information filtering via hierar- chical bayes. In Proc. UAI, 2003. [50] D. Zhou, B. Scholkopf, and. T. Hofmann. Semi-supervised learning on directed graphs. In Proc. NIPS, 2005. [51] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schdlkopf. learning with local and global consistency. Advances in Neural Informa- tion Processing Systems, 16, 2004. MIT PRess, Cambridge. [52] Dengyong Zhou, Jiayuan Huang, and Bernhard Scholkopf. Learning from labeled and unlabeled data on a directed graph. In ICML’05: Proceedings of the 22nd inter- national conference on Machine Learning, pages 1036—1043, New York, NY, USA, 2005. ACM press. [53] S. Zhu, X. Ji, W. Xu, and Y. Gong. Multi-labeled classification using maximum entropy method. In Proc. ACM SIGIR, 2005. [54] Xiaojin Zhu. Semi-supervised learning learning literature survey. technical report 1530, University of Wisconsin-Madison, 2005. [55] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Science, University of Wisconsin-Madison, 2005. [56] Xiaojin Zhu and John Lafferty. Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In Pro- ceedings of the 22nd international conference on Machine leaming, pages 1052- 1059, New York, NY, USA, 2005. ACM Press. 127 [57] Z.; Zhu, X.; Ghahramani and J. D. Lafferty. Semi—supervised learning using Gaussian fields and harmonic functions. In Proc. ICML, 2003. 128 ll1111111]le1111]le[[11le1l