. ‘
_......«,,,..U.a.,..
.~ 1.. . cw
., .ur ‘
{Aunts

.t.

i: . a» 3
‘55.“. a... Y
x» .. 9.533.?
.53..

.. l ..

yum...
A

. , . z. . .2
‘ .2.» a.
.. 1h

 

.‘u‘., L a. , . ,... la . ...
$51.3 .... hinmr ‘ , 7..ﬂ.ﬁ.€:.u... Bx.

 

 

. H . . . . A
. . l...

2 Km 1

This is to certify that the
dissertation entitled

Label Propagation for Classification and Ranking

 

presented by

Ming Wu

LIBRARY
Michigan State
University

has been accepted towards fulfillment
of the requirements for the

 

 

 

Doctoral degree in Computer Science

 

 

."
l1 /
i , / {I ‘
/ W

Major Professor’s Signature

July 16, 2007

 

Date

MSU is an affirmative-action, equal-opportunity employer

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
i To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

1 DATE DUE DATE DUE DATE DUE
'0 1 1 3 1 1

 

ﬂ

 

 

 

M

b
<9?

 

H».
Hi
“l

 

 

 

 

 

 

 

 

 

 

6/07 p:lClRC/DateDue.indd-p.1

 

LABEL PROPAGATION FOR CLASSIFICATION AND
RANKING

By

Ming Wu

A DISSERTATION

Submitted to
Michigan State University
In partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Computer Science

2007

ABSTRACT
LABEL PROPAGATION FOR CLASSIFICATION AND RANKING
By
Ming Wu

Label Propagation has been proven to be an effective semi-supervised learning ap-
proach in many applications. The key idea behind label propagation is to ﬁrst construct a
graph in which each node represents a data point and each edge is assigned a weight often
computed as the similarity between data points, then propagate the class labels of labeled
data to neighbors in the constructed graph in order to make predictions. This dissertation is
a comprehensive study of the label propagation approaches in different directions including

Relation Propagation, Rank Propagation and Propagation over Directed Graphs.

Most previous works on label propagation propagate information among a single type
of objects. However, many applications involve multiple types of objects. Inspired by
the assumption that the correlation among different types of objects can be very helpful
in many cases, a generalized framework for Relation Propagation is proposed to explore
the correlation in semi-supervised learning. The key idea behind relation propagation is to
construct a graph which involves multiple types of objects and then propagate the relation
among different types of objects in this graph. The framework for Relation Propagation is
applied to multi—label learning (classiﬁcation problems) and collaborative ﬁltering (ranking
problems). Empirical results show that relation propagation is a more effective approach in

compari son with the previous approaches in label propagation.

It is very important to study the label propagation approaches for ranking problems due
to the existing challenges for label propagation. First, it may not be appropriate to propa-
gate class labels (ordinal values) as numerical values in classiﬁcation problems. It seems

more reasonable to cast the problem into a ranking problem in which the class labels are

converted to pairwise preferences between classes for each example and then the prefer-
ences are propagated. Second, most previous studies require absolute labels for learning
which are often hard to obtain. Instead, relative ordering information is more easily avail-
able. Traditional label propagation approaches may not ﬁt with ranked data. Inspired by
these challenges, a Rank Propagation framework is proposed for supervised learning. The
key idea behind Rank Propagation is to propagate the given preference judgements, in-
stead of the true labels, from the labeled data to unlabeled data and compute the preference
matrices for unlabeled data whose principal eigenvectors correspond to the class assign-
ments. The application of this framework is presented in a multi-label categorization task
with multiple datasets. The empirical results show that Rank Propagation is an effective
approach in comparison with other commonly used supervised learning approaches.

Most studies in label propagation focus on using the undirected graphs. Motivated by
the assumption that directed graph may better capture the nature of data, a framework for
Propagation over Directed Graphs is proposed for utilizing the directed graph in propaga-
tion. The question involved in this approach is how to construct and utilize a directed graph.
Two asymmetric weight measures, namely KL divergence-based measure and asymmetric
cosine similarity, are proposed in order to construct a directed graph. To utilize the directed
graphs, one common method is to convert the directed graphs into undirected ones and then
apply a standard label propagation approach to the converted undirected graphs. A random
walk related method is discussed for this conversion. The application of this framework is
presented in a binary classiﬁcation task and a multi-label classiﬁcation task. The empirical
results show the effectiveness of Propagation over Directed Graphs in comparison with
other approaches.

In summary, this dissertation discusses the proposed approaches in label propagation,
namely Relation Propagation, Rank Propagation and Propagation over Directed Graphs,
in order to address the existing challenges in this area. Empirical studies show that the

proposed approaches achieve promising performance in given scenarios.

© Copyright 2007 by Ming Wu
All Rights Reserved

ACKNOWLEDGMENTS

This thesis would not have been possible without the support of many people. Many thanks
to my advisor, Rong J in, who read my numerous revisions and helped make some sense
of the confusion. Also thanks to my committee members, Eric Tomg, Pang-Ning Tan, and
Sarat Dass, who offered guidance and support. Special thanks to my ﬁancé, Aron White,
who put up with me and provided so much love and support. Finally, thanks to my parents,

and many friends who walked with me every step on this journey.

TABLE OF CONTENTS

LIST OF TABLES ix
LIST OF FIGURES x
1 Introduction 1
2 Label Propagation : A Graph-based Approach 8
2.1 Related Works ................................... 9
2.2 Label Propagation based on Gaussian Fields and Harmonic Functions ..... 11
2.2.1 Description of the Framework ......................... 11
2.2.2 Interpretation .................................. 14
2.2.3 Learning the Weight Matrix .......................... 16
2.3 Learning with Local and Global Consistency .................. 17
2.3.1 Description of the Method ........................... 18
2.3.2 The Regularization Framework ......................... 19
2.4 Summary ..................................... 20
3 Relation Propagation: A Generalized Framework 21
3.1 A Graph-based Framework ............................ 23
3.2 Applications for Classiﬁcation Problem ..................... 26
3.2.1 Multi-Label Learning Model .......................... 26
3.2.2 Case Study: Image Categorization ....................... 30
3.3 Applications for Ranking Problem ........................ 38

vi

3.3.1 Ranking Learning Model for Collaborative Filtering ............. 38

3.3.2 Case Study 1: Movie Recommendations .................... 47
3.3.3 Case Study 11: Book Crossing ......................... 50
3.4 Drawbacks of Label Propagation Approaches .................. 54
4 Propagation for Ranked Data 56
4.1 Ranking Problem Setting ............................. 59
4.2 Related Works ................................... 59
4.2.1 Statistical Measure of Rank Correlation .................... 60
4.2.2 Ranking SVM ................................. 61
4.2.3 Large Margin Principle ............................. 63
4.2.4 PRank Algorithm ................................ 65
4.2.5 Conditional Model on Permutations ...................... 66
4.2.6 Conditional Model on Ranking Poset ..................... 67
4.3 Rank Propagation for Multi-label Learning ................... 68
4.3. 1 Preliminaries .................................. 69
4.3.2 Preference Matrix ................................ 69
4.3.3 Framework Description ............................. 73
4.3.4 Case Study: Multi-label Categorization .................... 77
5 Propagation over Directed Graph 92
5.1 Convert a Directed Graph into an Undirected Graph ............... 93
5.1.1 Conversion to Undirected Graphs ....................... 94
5.1.2 Propagation Scheme: Spectral Graph Transducer ............... 95
5.2 Construct a Direct Graph ............................. 97
5.3 Case Study 1: Binary Classiﬁcation ........................ 99
5.3.1 Experiment Setup ................................ 99
5.3.2 Experiment Results ............................... 101

vii

5.3.3 AnalysisandDiscussion ....................... ..... 105

5.4 Case Study 11: Multi-label Classiﬁcation ..................... 115
5.4.] Experiment Setup ................................ 1 15
5.4.2 Results and Analysis .............................. 116
6 Conclusion and Future Work 118
6.1 Summary ..................................... 1 18
6.2 Future Work .................................... 121
BIBLIOGRAPHY 123

viii

3.1
3.2

4.1
4.2
4.3
4.4

5.1
5.2
5.3
5.4

LIST OF TABLES

Average Precision for the Top 20 Ranked Movies in Movielens Dataset. 50
Average Precision for the Top 10 Ranked Books in Book Crossing Dataset. 52
PRBEP for Different Datasets with Different Number of Training Examples. . . 82
Area under ROC Curve for MovieLens Dataset. ................. 89
Area under ROC Curve for St. Andrews Dataset. ................ 89
Area under ROC Curve for Yeast Dataset. .................... 90
The F 1 results for Movielens Dataset. ...................... 102
F1 measure for 20-newsgronp Dataset ......................... 103
F 1 measure for Renters-21 5 78 R8 Dataset. ..................... 104
The F 1 Results for 20-newsgroup Dataset with Different A Values ........ 1 14

2.1
2.2

3.1
3.2
3.3
3.4
3.5
3.6
3.7

4.1
4.2
4.3
4.4
4.4
4.5
4.5
4.5
4.5

5.1
5.2
5.3

LIST OF FIGURES

An Example of the Connected Graph for Label Propagation ........... 9
The Random Fields Constructed on Labeled and Unlabeled Examples ...... 12
An Example of the Connected Graph for Relation Propagation ......... 22

An Efﬁcient Algorithm for Multi-label Learning using the Relation Propagation 30
Classiﬁcation F 1 Results comparison using Relation Propagation ........ 35
Classiﬁcation Results of the Relation Propagation using Different 73. K1) = 20 36
Classiﬁcation Results of the Relation Propagation using Different Kps. 7 = 1 . 37
Average Precision for Top Ranked Items Given 5, 10 or 20 Movies. ...... 51

Average Precision of Top Ranked Items Given 5, 10 or 20 Books ......... 53

Analysis of Principal Eigenvectors of Preference Matrix for Generated Datasets. 71

Analysis of Principal Eigenvectors of Preference Matrix for Study Datasets. . . 72
F1 Scores of Three Algorithms for MovLens C18 Dataset ............. 83
F1 Scores of Three Algorithms for MovLens C18 Dataset (cont’d). ....... 84
F1 Scores of Three Algorithms for St. Andrews C10 Dataset. .......... 85
F1 Scores of Three Algorithms for St. Andrews C10 Dataset (cont’d). ..... 86
F1 Scores of Three Algorithms for Yeast C14 Dataset ............... 87
F1 Scores of Three Algorithms for Yeast C14 Dataset (cont’d). ......... 88
F1 Scores Comparison of Different B Values in Rank Propagation ........ 91
The Algorithm for Classiﬁcation on a Directed Graph .............. 96
The Distribution of Document Length in Three Datasets. ............ 105
Similarity Matrix Analysis of MovieLens Dataset ................ 106

5.4
5.5
5.6
5.7
5.8
5.9

Similarity Mtrix Analysis of 20-newsgroup Dataset ............... 107

Similarity Matrix Analysis of Reuters-21578 R8 Dataset ............ 108
Distribution of Similarities in MovieLens Dataset ................. 109
Distribution of Similarities in 20-newsgroup Dataset. .............. 110
Distribution of Similarities in Reuters-21 5 78 R8 Dataset. ............ 11 1
F1 Measure for MovieLens Dataset in Multi-label Classiﬁcation Task. ..... 117

xi

Chapter 1

Introduction

As a graph-based approach, label Propagation belongs to the family of semi-supervised
learning and has been proved to be effective for both text categorization and information
retrieval [47, 10, 57, 50]. Assuming the objects in question are documents, the key idea
of label Propagation is as follows: ﬁrst, view each document as a node in a connected
graph and each edge of the graph is associated with a weight which is proportional to the
similarity between the two connected nodes/documents; then, the conﬁdence scores for
all the classes of the unlabeled documents are estimated by propagating the class labels
of the labeled documents through the weighted edges; ﬁnally, the estimated conﬁdence
scores are either used to determine the appropriate categories for unlabeled documents in
text categorization, or used to rank the documents in information retrieval. One of the key
components in label propagation is the document similarity. It is usually calculated based
on the content of documents, using either the dot product between two document-term
vectors or the RBF kernel function. It can also be computed using the side information,

such as the hyperlinks among web pages [17].

A number of approaches have been developed in the past for label propagation. includ-
ing the approach based on the Harmonic function [57], Gaussian random ﬁeld [47], and the

Green function approach [50]. In the meantime, a number of propagation approaches have

1

been developed in information retrieval, including the pagerank algorithm [8], the HITS
algorithm [27], and retrieval based on implicit link [48, 29].

This dissertation explores and extends the idea of label propagation in several different
avenues, namely Relation Propagation, Rank Propagation and Propagation over Directed

Graphs.

Relation Propagation. First, the generalized framework for Relation Propagation will
be exploited and evaluated. Most previous works on label propagation propagate infor-
mation among a single type of objects [57, 51, 23]. However, in the real world, many
applications involve multiple types of objects. For instance, document categorization tasks
involve two types of objects: documents and categories. In these scenarios, the informa-
tion needs to be propagated among the objects of different types in order to utilize the
correlation among the objects of different types which can provide useful information for
classiﬁcation and ranking problems. Inspired by the assumption that correlation among
the different types of objects can be very helpful in many cases, a generalized framework
is proposed for Relation Propagation. The difference between our work and previous re-
search is that the Relation Propagation framework allows the relationship between multiple

types of objects to be propagated and thus improve the performance.

To better understand this framework, consider the scenario which involves two types of
objects: A and B. The key idea behind Relation Propagation is to ﬁrst construct a graph in
which each node is a object pair consisting of one object from each type and then propagate
relationship between A and B in this graph. This framework of Relation Propagation can
be easily applied to many applications. For instance, in a document categorization task, two
types of objects are documents and categories. The graph can be constructed as: each node
is a document-category pair whose label is the membership of the document belonging
to the category; each edge is associated with a weight which can be deﬁned using some

similarity measure. Then, the membership information can be propagated through this

2

graph. The resulting scores for each document with respect to all categories will lead to a
ranking list of all categories for this document. The framework can also be easily extended

to the scenario with more than two types of objects.

In order to show the effectiveness of the relation propagation model. it is applied to two
kinds of tasks: text categorization tasks and collaborative ﬁltering tasks. The text catego-
rization task is studied with an image retrieval dataset in which two types of objects are
images and categories. The membership of an image belonging to a category is the infor-
mation to be propagated. The collaborative ﬁltering problem is studied with movie ratings
and book recommendation datasets in which users and items are two types of objects and
the ratings information is propagated. Empirical results show that the Relation Propaga—
tion achieves very promising performance in many applications in comparison with other

state-of—art techniques.

Rank Propagation. The ranking problem has always been an important subject in many
areas such as machine learning and statistics. A ranking problem often requires the ranked
data as the input with the goal of generating a ranking function. Many previous works
have been done in this area [2, 22, 43, 12, 32, 31]. Although label propagation, as an
effective semi-supervised learning technique, has shown promising performance in many
applications, it is quite natural to explore the label propagation approaches on ranked data

because of the existing challenges in label propagation.

One of the challenges is related to propagating directly the class labels. As described
above, the label propagation approaches address all the problems by ﬁrst propagating the
class labels and. then generating ranking lists based on the resulting propagation scores
computed by the algorithm. This may not be an appropriate approach in certain cases (e. g.,
classiﬁcation) since there is no ordering information among class labels. For example,
consider the numerical class labels 1, 2, 3, ~ - -. In most cases, the numbers just provide the

labeling for classes instead of the ordering information. Class 1 is not greater or smaller

than class ‘2. When the class labels are propagated directly, the ordering information will
be incorrectly introduced into the propagation process. One way to avoid this problem is
to ﬁrst convert the class labels into pairwise preferences between classes for each training
example and then prOpagate these preferences. Clearly this idea can be cast to a ranking

problem.

Another challenge is the difﬁculty of obtaining the exact labels of the labeled data.
Most previous studies require the absolute label values or numeric values to be given in
order to learn the ranking function, which is often not practical. The absolute labeling
information is usually hard to obtain or is very costly and time consuming. In contrast,
the partial relative ordering information is more easily available for training in some cases.
For instance, in a movie ratings dataset, the users provide the ratings which indicate the
preferences of certain movies over other movies. Traditional label propagation may not ﬁt

in these applications because they are required to propagate the absolute labels.

To address these problem, a Rank Propagation framework is proposed for multi-label
learning. The discussion starts with the deﬁnition of a general ranking problem whose input
is the preference judgements, followed by the presentation of the Rank Propagation frame-
work of supervised learning. The key idea behind Rank Propagation is to propagate the
preference judgements (relative ordering) of the labeled data between the labeled data and
unlabeled data, instead of propagating the true labels of the labeled data. The preference
judgements can be directly given by the datasets or computed from the labeles of the given
data. Consider the example of text categorization. For each document, a preference matrix
can be computed with respect to the categories and each entry in the preference matrix is
the preference between two categories of this document. The preference judgements from
the labeled data are propagated between the labeled documents and the unlabeled docu-
ments. The goal of Rank Propagation is to achieve the preference matrices for unlabeled

data whose principal eigenvectors are proved to correspond to the class assignments.
The application of this framework is presented in a multi-label categorization task with

4

movie categorization, image retrieval and gene categorization datasets. The empirical re-
sults show that Rank Propagation is an effective approach in comparison with other com-
monly used supervised learning approaches, and thus verify our assumption that propagat-

ing the relative ordering information can be more helpful for solving the problems.

Propagation over Directed Graphs. As mentioned above, label propagation often in-
volves a graph constructed from the given datasets. Most previous studies in label prop-
agation focus on using the undirected graphs [57, 51]. However, directed graphs can be
more appropriate in describing the given datasets in many cases. For instance, in the web
page categorization scenario, it’s often better to view the hyperlinks as directed edges in-
stead of undirected edges in the graph. A number of works have been devoted to the
graph-based approaches on the directed graph [50, 52]. Motivated by the assumption that
directed graph may better capture the nature of the data, a framework for Propagation
over Directed Graphs is proposed. The general idea of this approach is to ﬁrst construct
a directed graph from a given dataset, then convert this directed graph into an undirected
graph and ﬁnally propagate the labeling information over this converted graph using some

standard propagation scheme.

There are two questions involved in this approach. The ﬁrst question is how to construct
a directed graph from the given data. In label propagation, a matrix is often computed in
which each entry is the pairwise relationship between two data points. Based on this ma-
trix, the graph is constructed in which each node represents a data point and each edge is
associated with a weight equal to the corresponding pairwise relationship in the matrix.
Evidently symmetric weight measures result in the undirected graphs since two edges con-
necting two nodes carry the same weights while asymmetric weight measures lead to the
directed graphs. Based on this observation, two asymmetric weight measures are pr0posed,
namely KL divergence-based measure and asymmetric cosine similarity, in order to con-

struct a directed graph. The second question is how to utilize the directed graphs. One

5

straightforward way is to convert a directed graph into an undirected one and then apply
the existing label propagation approaches to the converted undirected graph. The regu-
larization framework in [52] is discussed for this conversion. Spectral Graph Transducer
(SGT) is used as the standard propagation scheme in the proposed approach because of its

proven effectiveness in various applications.

The application of this framework is presented in a binary classiﬁcation task and a
multi-label classiﬁcation task. In the binary classiﬁcation task, the approach of Propaga-
tion over Directed Graphs is compared with propagation using undirected graphs and other
state-of-art classiﬁcation techniques with multiple datasets. The empirical results show the
effectiveness of Propagation over Directed Graphs in comparison with other approaches.
The analysis of why directed graphs outperform other approaches veriﬁes the assumption
that directed graphs can better express the nature of the data in some cases. A brief
empirical study is given for the multi-label classiﬁcation in which the proposed approach
is compared with only the label propagation approach based on harmonic functions and
Gaussian ﬁelds. The goal is to show that using directed graphs is more effective than using

undirected graphs from a different point of view.

It is important to note that this dissertation focuses on the rank-related evaluation met-
rics for all the methods, which is different from the commonly used metrics in traditional
classiﬁcation and regression approaches. Traditionally, researchers often evaluate the ef-
fectiveness of the label propagation approaches by classiﬁcation accuracy. This requires an
algorithm to generate the exact labels or numerical values in order to compare with the true
assignments of classes. Since most label propagation approaches generate the scores which
can only provide the ranking information for the unlabeled data, it is necessary to convert
these scores into true labels. However, by converting from the ranking into the binary clas-
siﬁcation decision, it always involves the selection of threshold which is often determined

empirically and prone to inaccuracy. In order to evaluate the essential effectiveness of the

6

label propagation approaches without the complication of the threshold, this dissertation
focuses on the label propagation approaches for ranking problems. Later chapters will dis-
cuss how label propagation can be applied to many scenarios by solving a ranking problem
with the presentation of the proposed works.

The rest of this dissertation is organized as follows: Chapter 2 introduces the gen-
eral idea of label propagation with the existing related works and discusses in detail two
label propagation approaches which are closely related to our work. Chapter 3 presents
the Relation propagation approach which includes both the general framework of this ap-
proach and its applications for the multi-label learning task and the collaborative ﬁltering
task with the empirical results. Several drawbacks of the label propagation approaches
for ranking problems will be discussed at the end of the chapter. Chapter 4 deﬁnes the
general ranking problem and proposes the Rank Propagation framework for supervised
learning which propagates the relative ordering information instead of true labels of the
labeled data. Chapter 5 proposes the framework of Propagation over Directed Graphs and
discusses its applications of binary classiﬁcation and multi-label classiﬁcation along with
an analysis of the reasons why directed graphs can better express the nature of the data.

Chapter 6 concludes this dissertation with summarization and future work.

Chapter 2

Label Propagation : A Graph-based
Approach

In many traditional approaches to machine learning, the learner is trained by labeled ex-
amples. Labeled examples are often, however, time consuming and expensive to obtain
because they require very skilled human annotators. Therefore, the problem of combining
labeled and unlabeled examples together in the learning process becomes very important
and necessary. The semi-supervised learning utilizes both labeled and unlabeled examples
and has attracted an increasing amount of interest. A number of semi-supervised learning
approaches have been proposed [41, 4, 23, 5, 51]. Among these approaches, there is one
family of techniques which exploit the manifold structure of the data that is usually inferred
from the pairwise similarity of unlabeled and labeled data points. This family of techniques
is referred to as label propagation since they propagate the label information from labeled
examples to unlabeled examples by using the pairwise similarity. The unlabeled examples
are classiﬁed by using the labeling information propagated to the unlabeled examples from
the labeled examples. The general process of label propagation is to propagate the labels
of the labeled data over the graph in which each node represents a data point and each

edge connecting two nodes is associated with a user deﬁned weight. Figure 2.1 shows an

8

example of the graph for label propagation.

d1

 

d /O
3 O ”’35 d8

d7

0d.

Figure 2. 1: An Example of the Connected Graph for Label Propagation

2.1 Related Works

Nearly all label propagation approaches are based on the prior assumption of consistency,
which means: (1) nearby points are likely to have the same label; and (2) points on the
same structure (typically referred to as a cluster or a manifold) are likely to have the same
label. The difference among various algorithms for label propagation lies in the way to
model the structure of the data and realize the assumption of consistency, and the way to
propagate the label information from labeled examples to unlabeled ones.

A number of previous works have been devoted to label propagation [47, 10, 57, 50]. In
[57, 54, 56], Gaussian random ﬁelds and Harmonic functions-based methods are motivated
by the assumption that the label assignments should be smooth over the entire graph. In
[51], the local and global consistency method proposes to enforce the smoothness of the
class assignments by minimizing the sum of local variations measured at each edge in an

undirected graph. In [23], Spectral Graph Transducer, which can be viewed a transduc-

9

rive version of KN N, propagates the labels of the labeled data by incorporating a quadratic
penalty on the labeled data into the minimization problem of normalized graph cuts with
constraints. In [4, 5], the authors propose to extend the graph mincuts method for trans-
ductive learning. The key idea is to search for a partition of the graph which results in a
minimum sum of weights of the cut while agreeing with the labeled data. Local Laplacian
embedding methods propose to project the data from the original space to a dimension
reduced space and then the classiﬁcation function can deﬁned on the reduced dimension
space [3, 13].

Among various approaches in label propagation, the two most popular ones are the
Gaussian ﬁelds and Harmonic functions-based approach [57] and the local and global con—
sistency approach [51], which both have been successfully applied to a number of appli-
cations. We will present these two approaches in detail due to their resemblance to our
proposed framework of Relation Propagation in Section 3. The Harmonic function-based
approach predicts the class labels by enforcing the smoothness of the class assignments
over the entire graph. The smoothness of the class assignments is deﬁned by a quadratic
energy function which has a harmonic solution. The local and global consistency approach
presents a simple algorithm to achieve a smooth classiﬁcation function with respect to the
intrinsic structure collectively revealed by known labeled and unlabeled points. This al-
gorithm enforces the smoothness by minimizing the sum of local variations measured at
each edge in the graph. These two methods are consistent with each other in that (1) both
of them achieve the smoothness of label assignment probabilities over the entire graph by
minimizing the sum of the variations deﬁned at each edge; (2) the initial labels with the
labeled data are retained in some way for both methods. In the following section. the two

different label propagation approaches are reviewed in detail.

10

2.2 Label Propagation based on Gaussian Fields and Har-

monic Functions

[57] introduces a label propagation approach based on a random ﬁeld model deﬁned on a
weighted graph over the unlabeled and labeled data, where the weights are given in terms
of a similarity function between instances.

This label propagation method adopts Gaussian ﬁelds over a continuous state space
rather than random ﬁelds over a discrete label set. This “relaxation” to a continuous rather
than discrete sample space results in many attractive properties in that the most probable
conﬁguration of the ﬁeld is unique, characterized in terms of harmonic functions with a
closed form solution that can be computed using matrix methods or loopy belief propaga-

tion.

2.2.1 Description of the Framework

Assume there are 1 labeled examples (1171.111).---.(:r,,y,) and n. unlabeled examples
xi+1,-~.rl+u (1 << it). Let n ._—_ l + u be the total number of data examples. Con-
sider the binary classiﬁcation problem: y E {0, 1}. Construct a connected graph G (V, E )
where V corresponds to the 72 data examples with L = {1. - - - ,1} corresponding to the
labeled examples with labels {y1. - . - .311} and U = {l + 1. - - - .l + u} corresponding to the
unlabeled examples. The goal is to assign labels to U. Let W denote a n x n symmetric

weight matrix on the edges E of the graph. W can be computed by RBF kernel as:
m , . — ‘ 2
rm :- exp (._. Z W) (2.1)
d=l ‘1

where and is the d-th element for the example :11,- represented as a vector :13, E R’", and ad is
length scale hyper parameters for each dimension. Obviously, nearby points in Euclidean

space are assigned large edge weight. Other measures are also possible to be adopted if

11

1onc \4
\/ /
\ /324\/ \/
, //\ / /...._.
\ /

5 0

live zero

 

Figure 2.2: The random ﬁelds used in this work are constructed on labeled and unlabeled
examples. The graph is formed with weighted edges between instances (digits), with la-
beled data as special “boundary” points and unlabeled points as “interior” points.

they are meaningful for the application as long as they generate non-negative weights.
The goal is to ﬁnd a real-valued function f : V —> R on G and then assign the labels
based on f. Also we constrain f (i) = f,(i) E y,- for all labeled examples. We use f, to de-
note the function values for labeled examples and fa for unlabeled examples. It is intuitive
to assume that data points sharing large similarities should have similar label assignments.

This assumption leads to the quadratic energy function
1 _ , .
I: f) = 5 give (fit) — finf (2.2)

A probability distribution is assigned on f by forming the Gaussian ﬁeld p3( f) =
e~r3E(f)

Zr?
Z5 2 f fl L: I: exp (“’x/jE ( f )) df , which normalizes over all functions constrained to f; on

, where 13 is an “inverse temperature” parameter, 2,; is the partition function

the labeled data. Clearly minimizing this energy function will lead to a smooth label assign-
ment function f because the minimum value can only be reached by enforcing the similar
label probabilities to the pairs of nodes which share large similarities. In other words, each

node will ﬁnally be assigned a label probability which is consistent with its neighbors and

12

this leads to a smooth label assignments over the entire graph.

The minimum energy function f : argminﬂszlEU) is harmonic. It satisﬁes Af = 0
on unlabeled data points U and is equal to f; on the labeled examples L. A is combinatorial
Laplacian and the matrix form is A = D — W where D = diag(d,—) is the diagonal matrix
with entries (1,- = 2]. run-,- and W = [1120-] is the weight matrix. The harmonic property
means that the value of f at each unlabeled data point is the average of f at neighboring

points

f0) = diijU-fiz')

i~j
where j : l + 1,- - ' ,l + n. In other words, f = Pf where P = D‘1W. According to the
maximum principle of harmonic functions [14], a harmonic function f deﬁned on S (S is
an open subset of R") takes on its maximum value and its minimum value on the boundary.
Thus we guarantee 0 < f ( j ) < 1 for j 2 l+ 1 since the labeled data are “boundary” points

and unlabeled data are “interior" points in the graph.

The harmonic solution can be computed explicitly in matrix form. The weight matrix

W can be split into 4 blocks after lth row and column:

in, in.

W =
l/Vul ”furl
l
Letting f = , the harmonic solution A f = 0 subject to f IL 2 f; can be presented
fu
as:
fu : (Dun. _ H'l‘iru)—1I'Vulf1 Z (I "' Rut)_1PulfI (23)

To better understand the above method, we will show its application to document classi-
ﬁcation task. In order to apply the label propagation method, we relax the discrete class

labels to a continuous space. The classiﬁcation method will be divided into two steps: ﬁrst,

13

we estimate the continuous class label scores by propagating the class labels of labeled data
over the graph; then based on the continuous class label scores, we can make predictions
by an empirically determined threshold.

Let D = (d1. d2, . . . .d.,,) denote the entire collection of documents. Assume that the
ﬁrst in documents of D are labeled by y, = (y1 , yg, . . . , y,” ), and the remaining "u = n — m
documents are unlabeled. Each label y, can either be +1 or —1. Let S = [Sig]mm denote
the similarity matrix for both labeled and unlabeled documents. To estimate the class labels
for the unlabeled document, denoted by ya, the label propagation approach described above

suggested minimizing the following energy function, i.e.,

Es = Z S.-,,-(y,- — 1M2 = yTLy (2.4)

i,j=1

where y : (le yl)? Matrix L : [LL 1],,” is the graph Laplacian for similarity matrix
S. It is deﬁned as L = D —— S where D = diag(D1, D2, . . . , D,,) and D, = 237:, S”. The

optimal solution yu for the above problem is approximated in [57] as
y. = L;,}.Su,zyz. (2.5)

where L”, refers to the part of the graph Laplacian or combinatorial Laplacian L that is
only related to the unlabeled documents, and Sui stands for the similarity matrix between

the unlabeled documents and the labeled documents.

2.2.2 Interpretation

The framework can be viewed in several fundamentally different ways. These differ-
ent viewpoints enrich the understanding and reasoning about this approach to the semi-

supervised learning problem.

The harmonic solution can be viewed as a random walk approach. Imagine a particle

14

walking along the graph G. Starting from an unlabeled data i, the particle moves to a node
j with probability PU after one step. The walk continues until the particle hits a labeled
node. Then f (i) is the probability that the particle, starting from node 2', hits a labeled
node with label 1. This view of the harmonic solution indicates that it is closely related to
the random walk approach in [45]. There are two major differences between the harmonic
function approach and the random walk approach: ﬁrst, f is ﬁxed on the labeled examples
in the harmonic function approach; second, the harmonic function solution described above
is achieved from an equilibrium state, while in [45] the random walk approach depends on

the time parameter t.

The solution f can also be viewed from the viewpoint of spectral graph theory. The
heat kernel with time parameter t on the graph G is deﬁned as K, = e‘m. Here K, (12,) is
the solution to the heat equation on the graph with initial conditions being a point source
at i at time t = 0. It was proposed as an appropriate kernel for machine learning with
categorical data in [28]. When used in a kernel method such as a support vector machine,
the kernel classiﬁer ﬁg) 2 2,6,1 oriyiKAi, j) can be viewed as a solution to the heat
equation with initial heat sources my, on the labeled data. The time parameter t must be
chosen using an auxiliary technique. The harmonic function approach described in the
previous section uses a different approach independent of t, diffusion time. Consider the
heat kernel K, = e‘m'w on A,,,, where Am, 2 Dm, — WW, the Laplacian restricted to the
unlabeled data examples on G. K , describes heat diffusion on the unlabeled subgraph with
Dirichlet boundary conditions on the labeled nodes. A Dirichlet boundary condition (often
referred to as a ﬁrst-type boundary condition) imposed on an ordinary differential equation
or a partial differential equation speciﬁes the values a solution is to take on the boundary
of the domain. Clearly Dirichlet boundary conditions ﬁx the values of the function on the

labeled data. The Green’s function 9 can be expressed as:

g: f Kgdiz / e“m"“dt= (puritans-1 (2.6)
(‘l 0

15

The harmonic solution in 2.3 can be written as

fit : gl'V-ulfl (2.7)

01'

= ZEWHAQ( k j) (2.8)
k

i=1
The above expression shows that this approach can be viewed as a kernel classiﬁer with the
kernel Q and a speciﬁc form of kernel machine.

The graph mincut approach proposed in [4] has very interesting and substantial connec-
tion to the harmonic function method described here. The starting point for the graph min-
cut approach is also a weighted graph G, but the learning problem is presented as ﬁnding
the minimum st-cut, where negative labeled data is connected to a special source node 3 and
positive labeled data is connected to a special sink node t. A minimum st-cut, which is not
necessarily unique, minimizes the L1 objective function EU) = g 21,) triij|f(i) — f(j)|
and corresponds to a function f : V —+ {— 1, +1}. The solution may not be unique and can
be obtained using linear programming. However, its random ﬁeld model is over the label
space {—1, +1} and the ﬁeld is pinned on the labeled entries. This leads to two problems.
First, the approximation methods based on rapidly mixing markov chains can not be used;
second, multi-label extensions of this approach are generally NP-hard. Instead, the har-
monic solution can be computed efﬁciently using matrix methods, even in the multi-label
case and inference for the Gaussian random ﬁeld can be efﬁciently and accurately carried

out using loopy belief propagation [46].

2.2.3 Learning the Weight Matrix

The weight matrix is an important input in the harmonic function approach described above.
Consider the weight matrix given by the equation 2.1, the parameter ad is the only param-

eter determining the weight matrix. By learning ad from both labeled and unlabeled data,

16

the graph structure can be better aligned with the data. One way to learn the parameter ad
is to minimize the entropy on the unlabeled data.

The average label entropy is used as a heuristic criterion for parameter learning and it
is deﬁned as

l+u

H(f) =§ 2 HM» (2.9)

i=i+1
where H,(f(i)) : —f(i)logf(i) — (1 — f(i))log(1 — f(i)) is the entropy of the ﬁeld
at the individual unlabeled data point i. The maximum principle of harmonic functions
guarantees that 0 < f (i) < 1 for i 2 l + 1. Small entropy implies that f (i) is close to 0 or
1 and this captures the intuition that a good W should result in a conﬁdent labeling.
Smoothing can be used to avoid the complication that H has a minimum at 0 as 00 —> 0.
We replace P (by normalizing weight matrix W) with the smoothed matrix P = 611 + (l —
e)W, where U is the uniform matrix with entries U.)- = 1 / (1 +11).

We use gradient descent to ﬁnd the hyperparameters ad that minimize H. The gradient

0 it ~ _ ‘ﬁuu ‘ﬁu
'—f : (1— R111) 1(qu'l'0. If!)

is computed as

 

and 80d (30d

Since the original matrix P is obtained by normalizing W, we have

 

i.)l,l7i»' ‘ 1+1]. 8“"
t J _ . . . m
()[)ij _ 30.1 pi] 272:1 (lad
”at anr l + 'rr'wm

. 011'," ‘ , 2 3
Finally. 7):} = 2'l1rrj(-Tdi — 47(1)) /0d°

2.3 Learning with Local and Global Consistency

The local and global consistency method [51] is similar to the harmonic function approach
described above. It is different from the harmonic function approach in that it measures
the smoothness of the classifying function in a different way. We ﬁrst introduce some

notations.

l7

Given a point set X = {.r], . - - ,:r,.:r1+1. - ~ - ,.r,,} C R’” and alabel set L = {1, ~ - - ,c},
the ﬁrst 1 points 2:,(2' S l) are labeled as g,- E y and the remaining points :r:,,(l+l g u. S n)
are unlabeled. The goal is to predict the label of the unlabeled points. Let .7 denote the
set of n x c matrices with nonnegative entries. The matrix F = [F I, . ~ - ,Fij E J:
corresponds to a classiﬁcation on the dataset X by labeling each point .r, as a label y, =

argmax Ej. In other words, each row of F can be viewed as the conﬁdence score vector

jSC
of assigning all class labels to one data point and we can understand F as a vectorial
function F : X ——) RC which assigns a vector F,- to each point 13-. Deﬁne a n. x c matrix

Y E .7: with Y.”- = 1 if r, is labeled as y,- = j and Yij = 0 otherwise. Y is consistent with

the initial labels of the labeled data.

2.3.1 Description of the Method

The method can described in the following steps: ﬁrst, we compute the weight matrix W
deﬁned as He,- 2 exp (— ”:17,- — :1: j||2 /202) i f i aé j and IV..- = 0; second, we construct the
matrix S = D’l/‘illiDd/g in which D is diagonal matrix with its (i, i)-element equal to
the sum of the i-th row of W; third, we conduct the iteration F (t+ 1) = (18 F (t) + (1 — (1))”
until it converges where a- is a parameter in (O, 1): ﬁnally. each point :12,- can be labeled as
g,- = argmax}? Ft}.

This method can be easily understood from the propagation point of view. We ﬁrst
deﬁne the weight matrix W which capture the manifold structure of the dataset D. The
graph G = (V, E) for propagation can be deﬁned on D, where the vertex set V includes
all points from D and the edges E are weighted by W. W is normalized symmetrically
for the convergence of the iteration. In the iteration step, the labels of the labeled data are
repeatedly propagated over the graph G until convergence. It is not too hard to see that each
point receives the information propagated from its neighbors and at the same time retains

its initial information. The self-reinforcement is avoided since the diagonal elements of the

weight matrix W are set to zero. After the propagation ends, we can label the unlabeled

18

data using F".

2.3.2 The Regularization Framework

Based on the above algorithm, a regularization framework can be developed as follows:

2 Tl
F. + u 2 HF. — Y.“2 (2.10)
i=1

$§3u~

i.j=1

 

 

 

 

 

 

0‘”

where p > 0 is the regularization parameter. Then the classifying function is

F‘! = argminpefQ(F) (2.”)

The ﬁrst term of the right-hand side in Equation 2.10 is the smoothness constant, which
means that a good classifying function should not change too much between nearby points.
The second term is ﬁtting constraint, which means a good classifying function should not
change too much from the initial label assignment. The trade-off between two terms is

captured by p.

F can be solved by differentiating Q(1~ ) with respect to F,

 

 

 

 

0Q
—~_~.=F*—SF* F*—Y=
1 ,
r“— SFV— “ Y=0
1 + [i l + it
Two new variables are introduced, a = , and l3 = l1; . Note that (t +13 : 1. Then
1 + it 1 + it

U—GSM"=BY

l9

Since I — GS is invertible, the solution is

r“ : 13(1 — aS)—‘Y.

2.4 Summary

Label propagation is an important technique in machine learning and has shown the promis-
ing performance in many applications such as classiﬁcation and ranking problems. The key
idea behind label propagation is to propagate the labels of the labeled data to the unlabeled
data over the graph constructed from the afﬁnity matrix based on the data and then make
predictions accordingly. Despite the various motivations behind different methods such as '
graph-cut based approaches [23, 4, 5], gaussian processes based approach [30], gaussian
random ﬁelds and harmonic function based approach [57, 54, 56] and so on, most label
propagation assumes the consistency of the resulting labels, which means that the data ex—
amples close to each other should have similar labels. The difference among various label
propagation methods lie in the way to model the consistency of the data and propagate
the labels. In addition to the overview of different algorithms for label propagation, we
also presented in detail two most popular algorithms for label propagation including the
gaussian random ﬁeld and harmonic function based approach, and the local and global

consistency approach due to their resemblance to our work in the following chapters.

20

Chapter 3

Relation Propagation:

A Generalized Framework

Despite the extensive study in label propagation, most previous work assumes that the sim-
ilarity graph consists of a single type of object, namely documents, and the propagation
is only conducted among objects of the same type. However, there are scenarios in many
applications that involve multiple types of objects, and the label information needs to be
propagated not only among objects of the same type, but also between objects of different
types. To illustrate this issue, consider the multi-label classiﬁcation problem. To directly
employ the label propagation approach, we will ﬁrst decompose the multi-label classiﬁca-
tion problem into a number of binary classiﬁcation problems, and then the labels of each
binary class is propagated through a similarity matrix. The problem with this approach is
that it is incapable of exploring the correlation information among different classes, which
is often extremely useful when the number of training documents is small. To see this,
consider a ﬁve-class classiﬁcation problem. Assume that for an unlabeled document (1, its
conﬁdence scores for the ﬁve classes after the binary class propagation are 0.4, 0.4, 0.4, 0.3,
and 0.3. Without using the class correlation information, it is equally likely to assign the

document d to category 1, 2, and 3. However, if both category 4 and 5 are highly correlated

21

 

Figure 3.1: An Example of the Connected Graph for Relation Propagation

with category 3 and meanwhile are orthogonal to category 1 and 2, we would expect that

category 3 is more likely to be the right class for document d than category 1 and 2.

We can generalize the above example into a propagation framework for multiple types
of objects. In particular, we treat the documents and the categories as two different types
of objects. Instead of propagating the labeling information among documents indepen-
dently for each category, we propose to propagate the relationship between documents
and categories. More speciﬁcally, we construct a weighted graph as follows: each node
0(2‘3’) 2 ((1,, Cj) in the graph represents the relationship that document (1,- belongs to the
category C}; any two nodes 0(1‘. j) and OW) in the graph are connected by an edge whose
weight reﬂects the correlation between the two corresponding relationships. In other words,
a large similarity between node OW) and OW) indicates that document d, is likely to be
assigned to class c,- if document (1;, belongs to class c), and vice versa. It is important to
note that class ck and (:1 can be different in the above propagation scheme. Figure 3.1 il-
lustrates an example of the graph for the relation propagation with three documents and
three categories. To distinguish from the previous work on label propagation, we refer to

the proposed framework as “Relation Propagation”, or RP for short.

We would like to emphasize that the relation propagation framework can be applied to

22

many applications in addition to the multi-label learning problems described above. For
instance, it can be applied to collaborative ﬁltering by viewing users and items as two
different objects, and the relationship Om) = (11,-. 2‘3) represents the rating of item tj by
user 11,-. Using the framework of relation propagation, we will be able to propagate the
rating information among different users and different items simultaneously. In particular,
it allows us to infer the rating of user it] on item t1 given the rating of user U2 on a different
item t2 if items t1 and it) share similar characteristics. This property can be extremely
useful to alleviate the sparse data problem, which has been an critical issue in collaborative

ﬁltering [7].

3.1 A Graph-based Framework

To facilitate our discussion, we ﬁrst describe the framework of relation propagation for
two types of objects, followed by the generalization to multiple types of objects. We then
present the efﬁcient implementation of applying the proposed framework to multi-label

learning and collaborative ﬁltering.

Let .A = ((1.,(12. . . . ,a.,,) and B = ((11.62, . . . ,lim) denote the two types of objects. Let
f ((1,, bj) : .A x B ——> R denote the relation function between an object of type A and an
object of type B. Let y denote the vector of size mn whose element gm) corresponds to the
relation f (a,, bj). Note that we use a double index (i, j) to refer an element in vector y. This
convention will be used throughout the entire paper. For the convenience of discussion, we
assume that the ﬁrst N) elements in vector y, denoted by y). are the labeled relations, and
the remaining Nu : mn — N, elements in y, denoted by yu, are the unlabeled relations that
need to be predicted. Finally, let S" = [Sgt-hm and SB 2 lSEJlmxm denote the matrices

of similarities among the objects of type A and among the objects of type B, respectively.

In order to incorporate the similarity information 8" and S3 into the propagation

23

scheme, we modify the energy function in Equation (2.4) into the following expression:
"J

Em 1‘: Z 81'ka J?! (1101))“ y(k.l))2 (3-1)

ik: 171 1

In the above, we introduce the weight Sf US? ,the product of the similarity measurements
for the two types of objects, to weigh the difference between the two relations ym) and
gym). Hence, to minimize the energy function in Equation (3.1), two relations will be
enforced to have similar values when they share similar input objects in both type A and

type B. We then simplify the expression for energy Em by using the matrix notation, i .e.,
E... = yTLA‘By (3.2)

where

LAB : DA.B_S.A.B

St” = SA®SB

where operator ® stands for the direct product of matrices. DAB in the above expression

is deﬁned as a diagonal matrix whose diagonal element is deﬁned as

D6503) Z Zisij'iw)

k=ll=l

[)fo (3.3)

Finally, similar to the solution in Equation (2.5). the optimal solution for yu that rrrinimizes

24

the energy function in Equation (3.2) is

yu :[th/rlflﬂ1 Stills yl (34)

As the further improvement, we introduce the normalized similarity matrices, and the

normalized graph Laplacian:

54 = (DAV/isA (04)”2 (3.5)
SB ___ (DB)-1/QSB(DB)-1l2 (36)
334,8 2 (DA.B)—i/2 SAJB (DA.B)—l/2 (3.7)

pm 2 (DAB)‘1/2LAB(DAB)—1/2
= 1 - 5'” (3.8)

Notice that, according to the above deﬁnitions, we have

 

 

 

 

 

 

5‘43 2 SA (8) 53 (3.9)
This is because
A 8
5'4 .5 _ Siik) o1)

\/ (24k) (2:) HOW)

yopbétbfof
Sf SB

2 ' J 55151.8, 1

\/D,.BDA \/D;4DB

Replacing the graph Laplacian LAB and similarity matrix 8’“; in Equation (3.4) with the

25

normalized ones, we have ya computed as:
y. = (I — Eff)“ S‘ﬁf‘y. (3.10)

It is straightforward to extend the above formulism to the case of multiple types of
objects. Let T = (01192., . . .O’) denote the t different types of objects. Let f : (’21 x
(92 . . . x 0" —-> R denote the relation function among t different types of objects. Let

{8k :; [8k

.,-, 1].”, xnk, k: = l, 2. . . . ,t} denote the similarity matrices for the t types of objects.

We then have the energy function for the t types of objects written as:

t
Em Z yT <® Sh) y
kzl

where each element in the vector y, i.e., ywr 02 or ), corresponds to the relation
'1’ ‘2""' ‘t

f(o,-1l,o'ﬁ_,. . . . ,of, ). A solution similar to the one in Equation (3.4) will minimize the above

energy function.

3.2 Applications for Classiﬁcation Problem

The relation propagation approach can be used for classiﬁcation problem. This section
will discuss its application on multi-label learning task. We will describe the multi-label
learning model based on relation propagation and then present the empirical studies with

different datasets.

3.2.1 Multi-Label Learning Model

It is straightforward to apply the framework of relation propagation to multi-label learning.
Let’s denote the collection of documents by D 2 (d1, d2, . . . ,d,, ), and the set of categories
by C = (c1, c2, . . . ,cm). Then, the relation f (d,, cj) is a binary function that outputs 1 when

document (I,- belong to the j-th category, and 0 otherwise. Let the normalized similarity

26

matrices of categories and documents denoted by SC and SD, and the normalized graph

Laplacian denoted by L”.

Assume that the ﬁrst it, documents of D are the labeled examples, and the rest 71,, =
n — n.) documents are the unlabeled examples. We then decompose the document similarity

matrix SD as follows:

—’D _‘D

gp _ SH blur
_D 7D
811.] Sum.

6‘ ,9

where the subindexes “l” and u refer to the labeled documents and the unlabeled docu-
ments, respectively. Similarly, we can decompose the normalized matrix SD'C and L1" as
follows:

SIDI SC t1D ® SC

Lu

SD‘C : 5113 ® SC 2 ~ _ _ _
31131 ® SC Sign ® SC

—D —D
L1,! Ll,u

L“ = I—S'D’C =
£33, £17

1 _ sagsc _sggsc
43.63136 I — 323.6936

191 D.C

Using the above expressions for Sup and L” , the solution in Equation (3.10) for multi-

label learning can be expanded as:
_ _ —l _ _
y. = (I — 5,37,, ® SC) (5,, ® SC) y, (3.11)

Directly computing the solution in the above expression could be computationally ex-
pensive. This is because the size of the matrix I — S5,, ® SC is 0(mn x nm), roughly

the square of the product between the number of documents and the number of categories.

27

This matrix will be huge even when the number of documents and the number of cate-
gories are modest, thus making it unfeasible to store this matrix, not to mention computing
the inverse of this matrix. For instance, in our experiment with multi-label learning, the
number of documents is 1000 and the number of categories is 100. This results in matrix
I —— S3,, ® SC of size 100, 000 x 100, 000, which is too large to be manipulated by most
desktop computers. In the following, we present an efﬁcient algorithm for applying relation
propagation to multi-label learning.

The key idea of the efﬁcient algorithm is to approximate the inverse of matrix I —
S3,, ® SC by its eigenvectors. To this end, we ﬁrst present an important theorem regarding
the eigenvectors of the direct product of matrices, which will serve as the basis for our

efﬁcient algorithm.

Theorem 1 Let va and vb are the eigenvectors of matrix A and B, and AG and Ab are the
corresponding eigenvalues. Then, va ® vb is an eigenvector of matrix A ® B, and MM,

is the corresponding eigenvalue.

Proof To prove the above theorem, we need to show the following equation holds

(A (8) E) (v, (8) v.) = A,.\,(v, (8) vb) (3.12)

Notice that the left side of the equation in the above can be simpliﬁed as

(A®B)(va®vb) = (.4v,)®(Bv,)
= A.A.(v.®vb)

In the above, we use the following property of the matrix direct product:

(U (3 V) (0(2) D) = (UC) @(UD) (3.13)

Thus, the left hand side of Equation (3.12) is equal to .its right hand side, and the theorem

28

is proved. Using Theorem 1, we can approximate the expression in (3.11) using the eigen—

ﬁllu

vectors of matrices SD and SC. Let the top eigenvalues and eigenvectors of S3,, denoted
by {(AP, VP). 1' = 1.2 ..... 1(1)}, and the top eigenvalues and eigenvectors of SC denoted
by {(AE, vf), k = 1 2,... ,Ix’c} Then, the eigenvectors and eigenvalues of 1— S”, u ® SC

can be expressed as follows:

Anni) = l—AFAC,

m1.) = v?®vg,1:1,...,1r,,,1~=1,__,,1,vc

Therefore. the inverse of I— S E ,, (8) SC is calculated as:

—S.?,’1>—‘.®S(

I‘D KC

T
D c D :
:Zf—_)D)61(vi®vk)(vi ®v§)
i=1 k= 1
Then, the product(I 5,, ,, D® S)C S, , D® SC) )is simpliﬁed as

[_St? @113“) (‘§IZ)1.Z®SC)

K KC
—§D:§j,_,.,,( v. (2) :3) (vr®vt)T(s~a®sc)
K7) Kcl

_ ZZ_ _AADAc(V v, D®V® ([VP]T§51)® [vi]T

i=1k=l

In the last of the derivation, we again use the property of direct product in Equation (3.13),
and the property of eigenvectors, namely SCvagvf. Using the above expression for (I —

S5,, ® SC)“ (S u, ® SC), the predict class labels yu in (3.11) rs then rewritten as:

[(1) KC

”:2 2__1— 1171;: (V3) 8W) (MIT 5'53) ® [vi-1T a (3.14)
1': 1k: 1

Comparing to the original expression for yu in Equation (3.11), the above expression is

29

computationally efﬁcient because it no longer needs to compute the inverse of matrix I -—
S32" ® SC explicitly. Figure 3.2 summarizes the procedure for efﬁciently predicting the
class labels for unlabeled data using the framework of relation propagation. A careful
examination of the algorithm in Figure 3.2 indicates that our algorithm does not need to
compute any matrix of size ()(nm x nm). In fact, during the iterations, only vectors of size
0(mn) are computed, and no any intermediate matrix is calculated. Hence, this algorithm

is efﬁcient not only in computation but also in memory storage.

 

Input:
0 The labels of the training documents y;;
0 SD an S“: similarity matrices for documents and categories;
a K13 and Kc: the number of required eigenvectors for the similarity matrices
of documents and categories.
Output: The class labels for the unlabeled documents ya

1. Computing the eigen space
0 Compute normalized similarity matrices S D and SC as in Equation (3.5) and
(3.6)
0 Compute the top K1) eigenvalues and eigenvectors of SD, i.e., (Alp, v?),i =
1, 2, . . . , KD
0 Compute the top Ix'c eigenvalues and eigenvectors of SC, i.e., (Agni/3,16 =
1, 2, . . . , KC
2. Initialization: ya 2 0mm
3. Fori=1,2,...,Kp,andk =1,2,...,KC

Compute a 2 [S5,] T VD

. 1
0 Compute b = 3 ® VS] T y,
0 Compute c = v? ® vi) b

 

 

Update ya = yu + AEC/ ( 1 - WAS)

 

Figure 3.2: An Efﬁcient Algorithm for Multi-label Learning using the Relation Propagation

3.2.2 Case Study: Image Categorization

The goal of the experiments is to evaluate the effectiveness of the proposed framework of
relation propagation for multi-label learning. In particular, we will address the following

research questions in this empirical study:

30

0 Will the relation propagation framework be eﬂective for multi-label learning? In
particular, we will examine whether the incorporation of class correlation and the
unlabeled documents will improve the classiﬁcation performance. To this end, we
compare the proposed algorithm to two graph-based approaches, the label propaga- '
tion approach based on the harmonic function [57] and the spectral graph transducer
(SGT) [24].

o How sensitive is the proposed algorithm for multi-label learning to the setup of pa-
rameters? As indicated in Figure 3.2, the input to the proposed algorithm includes
the two similarity matrices, SD and SC, and the number of required eigenvectors,
KD and Kc. In this experiment, we will vary these input variables to examine their

impact on the classiﬁcation performance of the proposed algorithm.

Dataset

We evaluate the proposed algorithm for multi-label learning on the Eurovision ST. Andrews
photographic collection (ESTA)' , which is provided by Text and Content-Based Cross Lan-
guage Image Retrieval (ImageCLEFY. This collection contatins 28133 images that belong
to 999 pre-deﬁned categories. Each image is also described a short textual document.
The averaged length is about 50 words per document. In our experiment, we randomly
selected 100 categories and textual description of 1000 images as our testbed. For the se-
lected testbed, the number of documents for each category varies from 2 to around 600,
and the average number of documents per category is around 72. Meanwhile, the number
of categories per document varies from 6 to 11, and the average number of categories per

document is around 7.

 

lhttp://ir.shef.ac.uk/imageclef/2004/stand.html
2http://ir.shef.ac.uk/imageclef/

31

Evaluation Methodology

Similar to the previous studies on multi-label text categorization, we evaluate the proposed
algorithm using both the F1 measure across documents and the F1 measure across cate-
gories. The ﬁrst metric, referred as “micro-average”, is calculated by ﬁrst computing the
precision and the recall of class labels for each document, and then taking the average of
the precision and the recall over all the documents in the test set, and ﬁnally computing
Fl measure based on the average precision and average recall. The second metric, referred
as “macro-average”, is calculated by ﬁrst computing the average precision and recall for
documents within each category, and then taking the average of the precision and the re-
call over all the categories, and ﬁnally computing the F1 measure based on the average
precision and average recall. Since the micro-average is usually dominated by the popular
categories while the macro-average is usually decided by the rare categories, by using both
metrics, we will be able to obtain the comprehensive view of the classiﬁcation performance

for the proposed algorithm.

Furthermore, similar to the label ranking algorithms for multi-label learning, the pro-
posed algorithm only estimates the conﬁdence scores of class labels for the unlabeled docu-
ments. An additional procedure is needed to decide the set of class labels for each unlabeled
document. To obtain a more comprehensive view of the classiﬁcation performance for the
proposed algorithm, for each document, we ﬁrst rank its class labels in the descending or-
der of the estimated conﬁdence scores. We then compute the F1 measure based on average
precision and recall at ﬁrst 20 ranks. This is similar to the analysis based on Receiver

Operating Characteristic (ROC) Curves [35].

Finally, since the focus of our studies is on semi-supervised learning, 2%, 5% and 10%
of documents are randomly chosen as training set and the rest of the documents are used
for testing. Each experiment is conducted for 10 times, and the average precision and recall

across 10 trials are used to compute the ﬁnal Fl measure.

32

Parameter Setup and Baseline Models

As indicated in Figure 3.2. there are four inputs to the proposed algorithm for multi-label
learning: SD, SC, K17, and ICC. To obtain the document similarity matrix SD, the SMART
information retrieval system [39] is ﬁrst used for preprocessing the documents to generate
the weighted term vectors. The pre-processing includes removing stopwords, stemming,
and the Okapi term weighting [37]. Then, a RBF kernel function f (t,,t,~) = exp(—7||t,: —
t_,- ”3 /(I) is used to calculate the document similarity based on their weighted term vectors t,-
and t j, where (I is average distance between any two documents, and 7 is the scaling factor.
In the experiment, we will examine impact of parameter 7 on the classiﬁcation performance

of the proposed algorithm.

The category similarity matrix SC is obtained by ﬁrst representing each category as a
binary vector in the space of documents, and using the cosine similarity between the vectors
of categories as the similarity measurement of categories. Since the number of categories

is only 100, we set KC 2 100 for all experiments.

We compare our approach for multi-label learning to two state-of-art classiﬁcation
algorithms: the label propagation (LP) approach that is based on the harmonic function
in [57], and the spectral graph transducer (SGT) [24]. Both these two approaches are semi-
supervised learning algorithms. In particular, they explore the distribution of unlabeled

documents for predicting their class labels.

SGT is a semi-supervised version of ratio-cut algorithm originally proposed for unsu-
pervised learning [23]. The objective function incorporates a quadratic penalty on labeled

data in addition to minimize the graph cut as

minf = fTLf + C(f — r)TC(f — r)

33

where the vector f is a label probability vector, and the vector r is deﬁned as

f+ 11:, is a positive example

7‘1 — f2 11:,- is a negative example

0 1:,- is unlabeled

The matrix C : (liag(cl, (:2, - - - ,c,,) is a diagnal cost matrix allowing for different misclas-
siﬁcation cost for each data example. The trade-off between graph cut value and training
error penalty is balanced through the constant 0.

Previous studies have shown that these two approaches are able to outperform a number
of comparative semi-supervised learning algorithm, such as Transductive Support Vector
Machine (TSVM) [24]. However, one drawback with these two approaches is that none of
them is able to explore the correlation among different classes. By comparing to them, we
are able to examine if the proposed algorithm is able to improve the classiﬁcation perfor-
mance by incorporating the class correlation information. The implementation of Spectral
Graph Transducer used in our study is download from (http://svmlight.joachims.org/). For
easy reference, we refer to the proposed algorithm for multi—label learning as “RP”. Finally,

all the three baseline models use the same document similarities as the proposed algorithm.

Experiment (I): Comparison to Baseline Models

Figure 3.3 shows the classiﬁcation results when 2%, 5% and 10% of the documents are
used for training. The left panel in each ﬁgure shows the F1 measure of micro-average,
and the right panel shows the F 1 measure of macro-average. In this experiment, we set the
scaling factor 7 to be 1, K1) 2 20 and KC 2 100.

First, we observe that as the size of training set increases, both the micro-average and
macro-average Fl measures are improved for all three algorithms. This is expected be-
cause more training data will provide more information and thus improve the performance.

Second, we observe that the spectral graph transducer is able to outperform the label prop-

34

Figure 3.3: Classiﬁcation Fl Results comparison using Relation Propagation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Micro—average F1 measure Macro—average F1 measure
0.10 7/ L1" . 0-151 A’ 'f ""f :ﬁ\ f
/’// 5 \‘\ ‘ /’/ f] \\‘ A \
0.08 / 5\" ~ /
9 I, 2 0-10 r" +
a3» [ l: 1% / . *WW— I
(B ‘ «I ‘ “’é) 8 15* . s \ " 519011,
EOOS‘ meww E f fEr'aCJR'T at“ x
F , 1:795“ ‘ ‘ E 0.05L x" 55‘
u. ”.14 l 1 ,6} '
0.041 a” — RP /, a 1
g? + LP 21
A ’ +1- SGT ‘9 J
002% ,m__ ,1_ﬁg___ 0.00;; ._—_,,
0 5 10 15 20 0 5 10 15 20
Rank Rank
(3) Percentage of Training Data = 2%
Micro-average F1 measure Macro-average F1 measure
0.08 "*7 «14:74 i
0.20} f 2]] // .ﬁs ]
/’/ 5‘ 5 5 '\"\r».,\]\ 1 ‘ ] /
'I/ 5 \ ‘ a 0.06 ’ /
(D .
5 0.14»
m 1‘
8 ,. .»..‘r+23‘:)‘-’9""':3"C"Jt’
E ‘ ,,..:--"-"'"H‘ 5 ‘
LL 0 081 x“ -1 “ﬁrm, i
1"“ ,IK’L‘ '5 1
. in? 1
0.02; ’
5 10 15 20
Rank
(b) Percentage of Training Data = "/c
Micro-average F1 measure Macro-average F1 measure
0.10
0.20 (,r'ﬁ'r “T ,,\]\ J /~]I/,~V/+
/ \
1 ,I' \~ 4i 0.08 /
20'16 ‘ 2 /
5 * 1' 1 3o 06
(D 1 . z ._1. 3;“) l0 '
8 0.12 I . Iiyéieyéﬁﬁwﬂ’ E 8 /
E I r J ’ 1 E0 04 '/
If. 0-03 1 pixiirere *hw 444+ E ‘ / _,. aut‘rewes
1 2" ween . r
0041 ){H — RP 0.021; , ‘
‘ ., ,2 LP I
V a
5+ SGT v - 1
0.00 0.00‘ ‘
0 5 1O 15 20 O 5 10 15 20
Rank Rank

(c) Percentage of Training Data = 10%

35

agation algorithm noticeably when more than 2% of documents are used for training. This
is consistent with the results of the previous study in [24]. Third. compared to label propa-
gation and spectral graph transducer, we observe that the relation propagation algorithm is
able to achieve signiﬁcantly better classiﬁcation accuracy for all the cases in terms of both
micro-Fl and macro-F1 measurement. This result indicates that exploring the relationship

among different types of objects does help improve the classiﬁcation performance.

Experiment (11): Sensitivity Analysis

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Micro-average F1 measure Macro-average F1 measure
0.22 T f‘ V" T I 0.12 I Y 1
d.3£f%::x 63‘s,. ‘ — '. ]..
“QR K Rae} ' f .
15$in 3%
‘ L 1‘ /
// 5» ——\ xx“. 0.08 f}
9 // / \ ‘ \ V 9
3 /’ a
8 ’ 00.06i
E / E
r: , / r:
0.10» ,» / —+— y: 0.5 0-04'
003 If ‘6‘ 1:1
' T“ 7: 2 0.02»
0.06» — 7‘4
1, _ _ Y: 6
. . 1 0.00 1 4 .
0'040 5 10 15 20 0 5 10 15 20
Rank Rank

Figure 3.4: Classiﬁcation Results of the Relation Propagation using Different 7's. Kp = 20

In this experiment, we examine how the scaling parameter '7 and the number of eigen-
vectors K D will affect the performance of the proposed relation propagation algorithm.
The number of eigenvectors for categories Kc is set to be 100.

In the ﬁrst experiment, we ﬁx K1; = 20 and vary ’7 from 0.5 to 1, 2, 4 to 6. Figure
3.4 shows the micro-averaged F1 and the macro-averaged F1 of the proposed algorithm
for different 7. We observe that the classiﬁcation performance of the proposed algorithm

is not impacted too much when 7 is set to be 0.5, 1, and 2. A signiﬁcant degradation

36

Micro-average F1 measure Macr o-aver age F1 measure

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

0-22 - . - 1 . 0.14 .— , ‘
012 i Luz/",1 (”fit
0‘20 5 waiétrr
0.1 o » ,ap,;',1—s.<;,@€}]@ff3“5‘% 2’
m 2 ' ’ j
i 0'18 3 0.03 ~ ,
‘° 8
"5’ 5
E016 _ Ko=5 L,_0.oe~
+ K = 20
,8 = 30 0.04 - ,
":7?‘ D
— K0 = 70 0.02 ~11,
0.12 * 1 i 0.00 . r .
0 5 10 15 20 0 5 10 15 20
Rank Rank

Figure 3.5: Classiﬁcation Results of the Relation Propagation using Different Kps. 7 = 1

is observed when ’1' is increased to 4 and 6. This is because, when the scaling 7 is too
large, the similarity between any two documents based on the RBF function will almost be
zero, thus making it impossible to explore the unlabeled data. This is consistent with the
observation in [57], i.e., only when it is possible to explore the correlation among unlabeled

data only when the scaling parameter 7 is small.

In the second experiment, we ﬁx 7 = 1 and vary the number of eigenvectors K1) from
5 to 20, 30, 50, and 70. Figure 3.5 shows the results for different K1). We observe that
the classiﬁcation performance was very stable for K1; 3 20 for both the micro-average
F1 measure and macro-average F1 measure. However, the performance is signiﬁcantly
degraded when K13 = 5 for micro-average F 1 measure. Overall, the macro-average F1
measure is not affected signiﬁcantly by the number of eigenvectors. Generally speaking,

when K1; 2 20, the algorithm has the best performance.

37

3.3 Applications for Ranking Problem

The relation propagation model can be easily applied to ranking problems. Since the model
generates the conﬁdence scores for the objects in question, it is straightforward to rank
the objects according to their conﬁdence scores. We will use the collaborative ﬁltering
application as an example and describe how we can apply the relation propagation approach

to the ranking problem. Empirical studies will be discussed after the model description.

3.3.1 Ranking Learning Model for Collaborative Filtering

The goal of collaborative ﬁltering is to predict the preferences or interests of a particular
user based on the available votes of training users, which sometimes is called community
data. Many methods have been proposed for collaborative ﬁltering in the past years.

Most methods assume that users with similar interests tend to rate items similarly. Thus,
the similarity between users in their rating patterns becomes the key to most of the collab-
orative ﬁltering methods. The user similarity can be used either for identifying a subset of
training users that share similar interests with a test user, or for clustering a large number
of users into a small number of user classes.

One of the key challenges to collaborative ﬁltering is the sparse data problem, namely
each user only provides a limited number of ratings. This is often the case in the real world
applications given few users are willing to spend time on rating a large number of items.
The sparse data problem often results in the unreliable estimation of the user similarity,
which can signiﬁcantly degrade the performance of collaborative ﬁltering. This is because
two users with similar interests may not share any commonly rated items, and as a result,
have a zero user similarity. Many studies have been devoted to the sparse data problem in
collaborative ﬁltering [49, 6]. One strategy [19, 34] is to cluster training users into a small
number of classes. Instead of identifying the individual training users that are similar to a

test user, we identify the class of training users that ﬁts in best with the test user. Another

38

strategy [57] is to ﬁll out the ratings of the unrated items for the training users. This can
be done either by propagating the rating of one item by one training user to the rating of
another item by the same user if these two items are similar, or by propagating the rating
of one item by one training user to the rating of the same item by another user if these two

users share similar ratings for a number of items.

Our Relation Propagation model can be applied to address the sparse data problem in
collaborative ﬁltering. The key observation behind this work is that most of the previous
work on collaborative ﬁltering addresses the sparse data problem by exploring the user sim-
ilarity and/or the item similarity separately. As a result, it only allows the rating information
of the same item to be propagated among different users, or the rating information of the
same user to be propagated among different items. This limitation will prevent the system
from answering the question such as, if user y] rate item 2:1 as 5, what is the expected
rating of item 562 by user yg provided that user 3), and 112 are similar in their preference of
items, and in the meantime item :r] and 1132 are similar. The relation propagation approach
for collaborative ﬁltering can address this problem by allowing the rating information to be

propagated among different users and different items simultaneously.

To apply the framework, we ﬁrst construct a weighted graph as follows: each node
rap-J) = (11,-.01) in the graph represents the rating of item 03- by user 11,-; any two nodes
1:“, j) and “(1:1) in the graph are connected by an edge whose weight reﬂects the correlation
between the two corresponding rating relationships: a large weight indicates that user 11.,-

will give a similar rating for item 0, as the rating by user 11,, for item 0,, and vice versa.

The rest of this section is arranged as follows: We ﬁrst present how the label propaga-
tion can be applied on collaborative ﬁltering; we then introduce the relation propagation

model for collaborative ﬁltering, and the related algorithm.

To better motivate the proposed framework for collaborative ﬁltering, we will start with
the description of the label propagation method for collaborative ﬁltering. We will then

describe the relation propagation framework, which can be viewed as a generalization of

39

the label propagation method in [57]. Before getting to the details, we will ﬁrst introduce

the notations that will be used throughout this paper.

Let U = (111, 112. . . . ,u.,,) denote all users, where n. is the number of users including
both the training users and the test users. Let the similarities of users denoted by the matrix
S“ = [Sig-[mm where each element SU- represents the similarity between two users in their
rating patterns. Let the collection of items denoted by O = (01, 02, . . . ,0,,,) where m is
the number of items. Let the similarity of items denoted by the matrix S0 = [Sfﬂmxm
where element SE]. represents the similarity between two items in their descriptive features.
In the case of movie recommendation, each item corresponds to a different movie, and is
described by a set of movie categories. Let the ratings of items 0 by users Ll denoted by
the matrix R = [R,-,j],,xm. Each element Rm- 6 {0,1,2,...,-r} represents the rating of
the j-th item by the i-th user. It takes value between 1 and r when the j-th item is rated
by the i-th user, and 0 when it is not. We also write matrix R as R = (r1, r2, . . . ,rm) and
R = (El, E2, . . . ,f'n)T, where each vector r,- represents the ratings of the i—th item by all
the users, and f,- represents the ratings of all the items by the i-th user. Given the above
information, the goal of collaborative ﬁltering is to predict the rating of a given item by the

[681 user.

In order to apply the label propagation method to collaborative ﬁltering, we will erase
the difference between the training users and the test users. This is because predicting the
rating of items by the test users is equivalent to predicting the rating of the unrated items
by the training users. So, the question that label propagation method will address is to
predict the ratings of a given item 0;, for all the users. For the sake of simplicity, let’s
assume that only the ﬁrst 71;, users among Ll rate the item 01,, and our goal is to estimate
the ratings of item 01,. for the remaining 71,, = n — 71; users. For the convenience of the
presentation, we refer to the ﬁrst at users as the “labeled” users and the remaining 11,, users

as the “unlabeled” users. Following the idea of the label propagation [57], we search for

40

 
 
 
     

*e-"vv ‘
1 .“OJr;

.‘g’ic‘...r,‘.

 
 

 

 

the rating vector r], that minimizes the following energy function:

2: S,“ R”, —J-1.)2= rZL"rk (3.15)
i j: 1
where matrix L“ = [L;f,]nx,, is the graph Laplacian for the similarity matrix S“. It is
deﬁned as L“ = D” — S" where D“ = diag(D‘,‘, Dg, . . . ,D“) )and D“: Z,_1S
rewrite the vector rk as rk— — (rk, r},) where vector r, refers to the ratings that are already
provided by the labeled users, and r}, refers to the ratings to be predicted for the unlabeled

users. Similarly, we rewrite L“ and S“ as

’U Mu SU- S’u u
1.1 1, , S“ ___] 1.1 1 (3.16)
LU. Lu 11 83’.”

11 l 11, 11 11.!

L” =

where the sub-indices l and 11 stand for the training users and the test users. Using the

above notations, we have the optimal solution to Equation (2.4) written as:

rZ— — [Lg—1“] S",rk. (3.17)

The key advantage of using the label propagation method for collaborative ﬁltering lies in
its capability of utilizing the similarity information of all the training users, including both
the rated users and the unrated users, to predict the rating for the test user. In this aspect,
the label propagation method for collaboration information is similar to the clustering ap-
proach and the collaborative ﬁltering methods based on the eigenvector analysis [40] and

the matrix factorization [44, 36].

The problem with the label propagation approach for collaborative ﬁltering is that it is
unable to exploit the matrix S 0, the similarity information of items. To address this prob-
lem, we propose the relation propagation framework that effectively exploits the similarity

of users and the similarity of items, simultaneously. To this end, we modify the energy

41

function in Equation (2.4) into as follows:

"I

—: Z S,“ 5, 12,, — 133,,)2 (3.18)

i.j= lk.=l1

In the above energy function, we weight the squared difference (RM. — RN)? by the prod-
uct between the user similarity S,“ and the item similarity S3 1. In other words, the rating
RM, and R1,, should be close if user 11,- share similar interest as user 11.,- (i.e., Sf“, is large),
and item 0,, is similar to item 0,- in their descriptive features. It is important to note that the
energy function in Equation (3.18) returns back to the energy function in Equation (2.4)
when the item similarity matrix 5" becomes an identity matrix, i.e., S3,, = 609,1). This is

because

E" = Z 2 5,135 1,}; — Rt!)2

i,j= lkl=1

= Z Z 5,“, (RM, — Rj,k)2

k=l i.j=1

m.
_ I
_ E :13,

k2]

To minimize the energy function in Equation (3.18), we ﬁrst introduce a matrix S""’ =
[ (11k),(j.1)]n111><n111 that represents the similarity between two ratings. Each element (u, k) U ,)
represents the similarity between the rating of item 0,. by user 11,- and the rating of item
0,. by user 11,-. According to Equation (3.18), the similarity 51:21.01) should be deﬁned as

8,“, 3,. In the matrix form, this is equivalent to deﬁning matrix S "'0 as the direct product

of S" and S", i e

311.0 : 811 ® So

where operator ® is the direct product of matrices. Then, the energy function in (3.18) can

42

be written as
ET 2 rTL“’”r (3. 19)
where

Lu,o : Du,o _ 8111.0
_ ~T ~T ~T T
r — (r1,r2,...,r,,,)

D” in the above expression is deﬁned as a diagonal matrix whose diagonal element is

deﬁned as

11.0 _ 11,0
D(z‘,j).(i.j) — S(i.j).(k,l)

n

= 52,) (E: 53.1,) = D3D§ (3.20)
1 1:]

Similar as before, we have the rating vector r rewritten as r = (r,, ru) where r, consists of
the ratings provided by the training users, and r,, represents the ratings to be predicted for
the test users. Similar to Equation (3.16), we can decompose S "'0 and L“ into the parts
related to the training users and the parts related to the test users. Finally, similar to the
solution in Equation (3.17), the optimal solution for r,, that minimizes the energy function
in Equation (3.19) is

r, = [ng3] ‘1 "-01, (3.21)

'u,l

Furthermore, since the normalized graph Laplacian usually delivers better performance

than the unnorrnalized ones, we replace the similarity matrix S “"0 and the graph Laplacian

43

L“"’ in the above with the normalized ones, i.e., S “*0 and L-““’, that are deﬁned as follows

Sup :— (Dll.0)—]/2 8'1110(D”~0)—1/2 (3.22)

1:150 ___ (Diary—U2 Lu.0 (DtL.O)—1/2 :. I _ 311,0 (3.23)
The solution to Equation (3.18) that uses the normalized similarity matrix is

_ _11.0
ru ‘_ [L'U.u

]‘1 531,011, (3.24)

L“"’ in the above can also be written as:

11,11

L” = I - St: = I — 5“ (g) 50

11.11 11,11

where S3,, and S0 are the normalized similarity matrix for training users and the normal-
ized similarity for items, respectively. They are deﬁned similarly as the ones in Equation

(3.22).

Remark: The solution in Equation(3.24) can be interpreted from the viewpoint of
graph. This graph is constructed by having each node 0,”) = (11,, 0,) in the graph cor-
responding to the rating of item 0,- by user 11,-. Any two nodes Wide) and 110,1) in the graph
are connected by an edge whose weight depends on both the similarity between users 11,
and 11,-, and the similarity between items 0),. and 0,. Then, the above solution can be inter-
preted as propagated the rating information from the rated items to the unrated ones. It is
important to note that the proposed relation propagation framework distinguishes the label
propagation approach in that it allows the rating information to propagate among different

users and different items simultaneously.

Similar to the multi-label learning model described before, directly computing the so-
lution in Equation (3.24) can be computationally expensive because size of matrix Lgf, is

()(mn x mn). It is very hard to compute the matrix inverse given a large number of users

44

and items. We already present an efﬁcient algorithm for multi-labcl learning model. In
this section, we will solve the problem in a different way which actually leads to the same

solution as before.
To resolve this problem, ﬁrst notice that the solution in (3.24) in fact corresponds to the
optimal solution to the following optimization problem:

1 _ _
l = —3r: [L33] ru + 1'13an (3.25)

d

This can be easily veriﬁed by setting the ﬁrst order derivative of the objective function
Z to be zero. Thus computing yu becomes the problem of optimizing Z. By the above
conversion, we avoid the difﬁculty in computing the inverse of 12:3. However, we still
have to resolve the problem with storing the matrix [:33 since it is too large. To this end,

we assume that the solution ru in (3.24) can be written as a linear combination of the direct

products between the principle eigenvectors of 5'“ and the principle eigenvectors of 3".

U.“

More speciﬁcally, let the top eigenvalues and eigenvectors of 3:", denoted by

(A?;v”),i==l,2,n.,1{u

I

and the top eigenvalues and eigenvectos of 5'" denoted by

ijyjzrzqu,

J

where ICU and K0 are the number of eigenvectors extracted from S3.” and 5‘“. Then, fol-

lowing our assumption, we can write r.“ as

1c, Ko
‘11, 0
ru——— 2 E (rim-v1. ®vj (3.26)
i=1 j=1

where 01,-, J-s are the parameters needed to be determined. Substituting the above expression

45

for r., in (3.25), we have the objective function I rewritten as follows:

0

_ $121k Ear-.10).: [(V Vit®v§ )T (Li): (v E®V?)I

i,k lj,l=l

U 0

+ Z (1,-1- (v H®v )T( 53;,01‘)

1:111

>II

Using the property of direct product. i.e.,
(A $9 B)(C 12 D) 2 (AC ® 80),

and the relation Lijj: = —5§f “I850 and 5",} 0 =55 u ® 5", we have the following simpli-

ﬁcation:

(vagf (L122) (v:®v:)= (1 — 1:130:11, mo. 1)
(v:®v:)T(S:::n=f)3?:(1v1511.)®[v:fr.

Using the above simpliﬁcation, the objective function Z in the above can be expanded as

follows:

Ku KO
1AM 0
“—5 G? 1.,j( ”A
K. ‘2‘ i=1 (3.27)
o 11 T ‘11 o T
+ ai-jAj ([Vi] U11) ® [Vj] I'l
1:1 1:1
Then the optimal solution for (1:. 1- that maximizes the above objective function is
a. - = —L (MT 3“,) (8) MT r, (3.28)
9] 1 _ AitAf; 3 11- J

Note that if we substitute (1,}- to the equation (3.26), we get the same solution as before.

46

3.3.2 Case Study 1: Movie Recommendations

The goal of our experiment is to evaluate the effectiveness of the Relation Propagation
model given the sparse data. We compare the relation propagation model with ﬁve base-
lines: Pearson Correlation Coefﬁcient algorithm (PCC), Personality Diagnosis model (PD),
Aspect model (Aspect), Flexible Mixture model (FMM) and Label Propagation model (LP)

as described in the previous section.

Dataset

We use the dataset of movie ratings, named MovieLen(http://www.grouplens.org/), as our
test bed. MovieLen data set includes 943 users and 1682 movies. Each movie is rated
between 1 and 5. with 5 as the best rating. The average number of ratings provided by each
user is around 374. To demonstrate the problem of sparse data, we selected 200 training
users and for each test user, only 5,10 or 20 movies are randomly selected as the rated

items.

Evaluation Metrics

Since one of the goals of collaborative ﬁltering is to rank the unseen movies for a test user,
we will apply each of the ﬁve algorithms in comparison with order the unseen movies based
on their predicted preference, and evaluate these algorithms based on their ranking lists.
The evaluation metrics used in our experiments is the average precision [16] (AP), which
is a commonly used metric in the information retrieval. It measures the agreement between
the top K movies that are ranked by the automatic algorithms, and the top K movies that
are ranked by the test user. To this end, we will ﬁrst rank movies in a descending order of
their true ratings by a test user, and then compute the precision PK of an algorithm for the

test user as follows:
1 K k
P - = — ——+— 3.29
K K I; ran/Ir(tk) ( )

47

where t), is the k-th movie ranked by the true ratings of the test user. Since the users’
ratings only provide a partial order of movies, we break the tie by choosing a random
ordering among the movies with the same rating. Finally, the average precision is computed
by taking the average of the precision at K over all test users. Evidently, the higher the
average precision, the better the performance is. Notice that we did not employ the Mean
Average Error (MAE) for evaluation as a number of collaborative ﬁltering papers did. This
is because our algorithm is only able to generate the ranking list of items for individual
test users. It is unable to predict the rating categories for items, and as a result, we can not
measure the MAE metric.

Finally, each experiment is conducted for 10 times and the results across 10 trials are

used as the ﬁnal evaluation metric.

Parameter Setup

Our algorithm needs the inputs 5“, 5", K ,, and K0. The user similarity matrix S" is com-
puted using the RBF kernel function f (n, rj) = exp(— Hr,- — 73-“ /(Z)‘2 based on the user
rating vectors 7‘,- and 7'], where J is the average distance between any two rating vectors
and A is the scaling factor. The movie similarity matrix 5° is obtained by representing each
movie as a binary vector in the space of movie categories and using the cosine similarity be-
tween the movie vectors as the similarity measurement. Due to the space limitation, we are
not going to discuss the impact on the proposed algorithm from different values of the pa-
rameters. To speed up our computation, we set the number of eigenvectors K u = K o = 20
for all experiments according to the empirical experience. We measure the average preci-

sion over the top 20 ranks for all algorithms.

Experiment Results

Table 3.1 shows the average precision for the top 20 ranked movies given 5, 10 and 20

observed ratings provided by each test user. First, we can see that the average precision

48

for all approaches increases as the number of observed movie increases. This is expected
since more training data should lead to better performance. Second, the label propagation
approach improves the ranking accuracy when compared to the Pearson correlation coef-
ﬁcient approach. However, the label propagation approach did not perform as well as the
Aspect Model and the Flexible Mixture Model. We believe this is related to the capabil—
ity of the algorithms in exploring the correlation among users and the correlation among
movies. Both the Aspect Model and the Flexible Mixture Model explore the movie corre—
lation and user correlation by data clustering. In contrast, the label propagation approach
only explores the correlation among movies. Third, the relation propagation (RP) algorithm
outperforms the other approaches considerably. Most noticeably, the ranking precision of
the relation propagation algorithm is 56.7% when the number of rated movies for the test
user is 5. This number is better than the ranking precision of the other comparative al-
gorithms when 20 movies are rated by test users. We attribute the success of the relation
propagation algorithm to the fact that the proposed algorithm allows relevance information
to be propagated among users and movies simultaneously. This is particularly critical to
alleviate the problem of sparse data. Although both the Flexible Mixture Model and the
Aspect model are able to alleviate the sparse data problem by user clustering, they are lim-
ited in that all the users in the same class have the same rating for the same movies. The
proposed algorithm relieve this limitation by the propagation scheme. We also veriﬁed the
average precision across 10 fold is statistically signiﬁcant at the level of p < 0.05 based on

the student’s t test.

Furthermore, we vary the number of top ranked movies that are rated by the users.
Figure 3.6 shows the average precision for different number of top ranked movies for each
algorithm given 5, 10 and 20 observed movies for each test user. Again, we observe that
for all cases, the proposed relation propagation algorithm is able to outperform the other

algorithm considerably. This result again shows the power of the proposed algorithm.

49

 

 

Alg. Given 5 Given 10 Given 20
RP 0.567 i(0.003) 0.569:t(0.007) 0.57lzt (0.020)
LP 0.434:l:(0.004) 0.446i(0.003) 0.462j:(0.004)
F MM 0.489j:(0.003) 0.497:l:(0.001) 0.5 1 7310.004)
PD 0.4701(0001) 0.4851(0001) 0.528:l:(0.001)
PCC 0.321 j:(0.002) 0.358:l:(0.001) 0.433:l:(0.004)
ASPECT 0.449i(0.002) 0.465:l:(0.007) 0.515i(0.001)

 

 

Table 3.1: Average Precision for the top 20 ranked movies for all ﬁve algorithms. Variance
of the average precision is included inside the parentheses (all values need to be multiplied
with 104).

3.3.3 Case Study 11: Book Crossing

Book-crossing dataset was collectedc by Cai-Nicolas Ziegler in a 4-week crawl from the
Book-crossing community. It contains 278858 users providing 1149780 ratings about
271379 books. Ratings are either explicit, expressed by the scale 1 - 10 (higher values
show the higher appreciation), or implicit, expressed by zero.

_ We sort the books by the number of ratings given to them and select the top 1000 books
with the most ratings. Based on these books, we remove users with less than 30 ratings.
The resulting dataset has 160 users with 7612 ratings on 1000 books. The average number
of ratings for each user is around 47 and the average number of ratings for each book is
around 8. In each experiment, 20% of the users are randomly selected as training users
and the rest are for testing. For each test user, 5, 10 and 20 books ratings are randomly
selected as observed ratings. Each experiment was conducted 10 times and the average
across 10 fold was taken as the ﬁnal value. Average precision is also used to evaluate the

performance.

Experiment Setup

Similar to the movie recommendation case, we compare our method to other four algo-
rithms: Pearson Correlation Coefﬁcient (PCC), Aspect model (Aspect), Label Propagation
(LP) and Personality Diagnosis model (PD). To be consistent with the evaluation in movie

recommendation case, we also use Average Precision described in Equation (3.29) as the

50

Given 5 Items

0
'0)
I

OI

09.0.0.0

Average Precision
o '_.. m w .1:-

 

1
Top K Rank
Given 10 Items
’7' ‘Iiii T—WV " T‘ ’ ' .’ 7"

 

Average Precision

 

 

 

 

 

Top K Rank

Given 20 Items

 

_O
.L
‘f

 

.0
N

Average Precision

 

 

O

 

0

Top K Rank

Figure 3.6: Average Precision for different number of top ranked items by the test user
given 5,10 or 20 movies.

evaluation metrics. We ﬁrst compute the average precision on top 10 ranked books for each
user and then take the average across all users as the ﬁnal results.

The relation propagation model requires the input, the user similarity matrix S“, the
book similarity matrix S 0, and the number of eigenvectors K, and K o. S“ is computed by
taking the dot product between two user rating vectors as 5;} = r, - rj. Similarly, S" is

computed by ﬁrst presenting each book as a vector of ratings from all users and then taking

51

 

Alg. Given 5 Given 10 Given 20
RP 0.526j:(0.003) 0.569i(0.115) 0.718i(0.051)
LP 0.354i(0.125) 0.426:t(0.106) 0.625i(0.334)
PD 10.516:t(0.007) 0.556i(0.023) 0.705i(0.001)

PCC 0.403:t(0.182) 0.448j:(0.138) 0.623i(0.099)

ASPECT 0.480i(0.220) 0.524i(0.201) 0.686i(0.139)

 

 

 

Table 3.2: Average Precision for the Top 10 Ranked Books for all Five Algorithms. Note:
Variance of the average precision is included inside the parentheses (all values need to be
multiplied with 10“).

the dot product between two book rating vectors F,- and '53- as 5,0]- : F, f}. According to the

empirical experience, we set Ku = K 0 = 20.

Experiment Results and Analysis

Table 3.2 shows the average precision on the top 10 ranked books for all the algorithms.
First, as the number of observed books increases, the average precision increases for all
ﬁve algorithms. This is expected since more observed ratings provide more information for
predicting the test data. Second, LP algorithm performs the worst among all algorithms.
This shows that propagating the ratings by using only user similarity may not be enough for
accurate prediction. Simple PCC method works better than LP algorithm general although
worse than other approaches. Third, PD model and Aspect model work better than LP
model and PCC method, but worse than the RP model. Especially the performance of
the PD model is very close to the RP model. Grouping users by the similarity of their
personality seems a effective strategy to improve the performance. Finally, the RP model
performs the best among all the approaches. We believe it’s the contribution of exploring
the correlation among users and books. All results are statistically signiﬁcant at 95% level.
Figure 3.7 shows the average precision at different number of top ranked books given 5, 10
and 20 observed ratings. The RP model performs best at each different rank and the result

is consistent with the above analysis.

52

Given 5 books

 

 

 

 

 

 

 
  

 

 

 

 

 

 

 

 

 

g 0.6; I I I 1 1 I 1
22 0.5k Q .5, :4“ E, Wu..-”— :9
E 0 4 . _V.,"_l;<;.::7;'§r~:::; if;::§:;:”:i§:;”+ﬁ,ﬂ- fﬂrgr
D. ' i'_,,+,._.g( ,1. r ’r "“ ,-,4(-} ’:W,:
h + M”. 13—“
g, 0.3 ; 7 _ rwlaaw: __ _, _
m g, - «Ff g z
a); 0.2 ,-- wt"? .- ~- ~
‘1 0.1 " ’1 ' 1 1 I
1 2 3 4 6 8 9 10
Rank
Given 10 books
.6 0.6 I 1 r r 1 I ﬂ
.2 , L,.
E 0.5 {jazz ruse: ﬁZﬁ/q £2 55’
n. \ _. 3» ~ $71589 / —
a: T‘iil) «fig ”:8: ”"8“"7ﬁ—A—ﬁ‘ér’” "‘:B:’f*i3 ﬂee PCC
8’ 0-4* ’_ ,Za/W‘t‘ + PD
§ /.»-4+ *---—*+/ —1— ASPECT
< 0.3 ‘ I I i ’ 1 1 j
1 2 3 5 6 8 9 10
Rank
Given 20 books
_5 0.75 , ,
(D 1:.
"" f ‘ " :5 ’ *4» , ‘\" "R *M}’*— Gr~ —~ A‘“ If,
g0 70 f//” ‘ j» w_:1,ﬁ:::~..—:.::::‘ 11-193; *3: W
“@065; _ . _ ﬂ _ -
IE.) 1.1.— ;Vb ;W‘t‘~‘*‘V-—erv'”.tj‘T-R‘__Ejl~:7\\n~3‘;h’: EL .: . 781;:“f3
E 0.60 A - A I . . .
1 2 3 5 7 8 9 10
Rank

Figure 3.7: Average Precision for Different Number of Top Ranked Books by the Test User
Given 5,10 or 20 Books.

53

3.4 Drawbacks of Label Propagation Approaches

Label propagation approach has attracted a lot of attention from researchers in the machine
learning area with its promising performance in many applications including information
retrieval, classiﬁcation and ranking problems. In the previous sections, motivated by the
label propagation approach, we presented the relation propagation model and their appli—
cations. Empirical studies showed that it is very helpful in a lot of problems to propagate
the class (or the rating) information on the weighted graph which is appropriately con-
structed. However, there are still some general issues with the traditional label propagation
approaches.

The label propagation approach essentially addresses a ranking problem. In order to
apply the label propagation approach, we construct the graph based on the given data in-
cluding training and testing examples. Then the labeling information on the training data is
propagated through the graph such that the entire system reaches a global equilibrium. Each
node in the graph will get a corresponding numeric conﬁdence score computed through the
propagation. Finally we compute a ranking list based on the conﬁdence score assigned to
each node and make the prediction accordingly. Although it sounds a practical idea, the
problem comes from propagating ordinal values (or class labels) as numerical values.

In many cases, the labeling information for data examples does not carry the numerical
meaning. The label values are usually discrete and ﬁnite. For instance, in a classiﬁcation
problem, each class label is represented by a number that should carry only ordinal mean-
ing. Consider the case that there are three classes c1 = 1, (:2 = 2 and 03 = 3. We can not
say that, the class cl should be counted less than C2 and 03 because numerically cl is smaller
than cl and C2. Furthermore there is not order among cl, 02 and c3 and changing the label
value for each class will not change the nature of the problem. The class numbers are only
the names for the classes. But during the process of propagation, we actually propagate the
labeling information that is supposed to be ordinal values as numerical value and take the

ﬁnal numerical results as the base of assigning ordinal labels. This is also the case in a lot

54

of ranking problem. For example, in collaborative ﬁltering task, each rating is often repre-
sented by a number. Different from class labels, rating information does exhibit an order
among different categories. If we have three categories ba.d(1), good(2) and emcellent(3).
we can tell the rating 1 < 2 < 3. However, propagating the rating information as numeri-
cal values is not an appropriate practice since those rating numbers only carry the ordinal
meaning. Replacing the rating number with any other numerical value without breaking
the ordering will not change the nature of the problem, but it may signiﬁcantly affect the
ﬁnal results.

In a lot of ranking problems, it’s also very common that the exact rating information
is not available. Instead, only the ranking information is given for the observed data. For
instance, a web search engine presents a ranking list to the user and the user may click
the preferred web pages. This clicked-through history for the user generates the partial
preference judgements over the clicked pages and the unclicked pages. This kind of ranked
data is a challenge for traditional classiﬁcation and regression approaches. Without the
exact ratings, propagating the rating information is not practical at all.

In a word, label propagation has shown itself a promising approach to many problems
while the issues we discussed above may limit its efﬁciency in a lot of scenarios. In the
next chapter, we will discuss a solution which addresses the challenge of the ranked data

for label propagation.

55

Chapter 4

Propagation for Ranked Data

Analysis of ranked data is often involved in practical applications including information
retrieval, collaborative ﬁltering, etc. For example, the ranking presented by a search engine
to the user is a ranking list of all web pages based on the relevance of each page to the
user’s query; the links clicked by a user provide some kind of relevance judgements for
web pages; the user’s past ratings on movies are also ranked data in terms of the user’s
preferences. The interest in ranking problems is still the source of ongoing research in

many ﬁelds such as mathematical economics, social science and computer science.

Ranking problems differ from traditional classiﬁcation and regression problems in that
ranking problems explore the ordering of examples rather than the absolute numeric/ordinal
values assigned to each example. There are two different views of ranking problems. In
the ﬁrst view, the training examples are presented with the ranking scores (e.g., relevance
judgements or rank scales) and the goal is to learn a score function from the labeled data
examples. One common approach for this kind of ranking problem is to learn a probabilis-
tic classiﬁer or regression model from the given ranking scores. However, This ranking
problem may be hard for regression models since the given ranks are usually discrete and
ﬁnite. It is also unsuitable for classiﬁcation models since the ranking information does not

exist in class labels. There are a number of works devoted to this area [43, 18, 42, 12].

56

In [18], the authors proposed a large margin algorithm based on a mapping from objects
to scalar utility values using a distribution independent risk formulation of rank learning.
Kramer et a1. investigated the use of a regression tree learner by mapping the ordinal vari-
ables into numerical values [38]. In [15], the authors converted a rank learning problem
into nested binary classiﬁcation problems that encode the ordering of the original ranks and
then the results of standard binary classiﬁers can be organized for prediction. In [43], two
large margin principle-based approaches were proposed for ranking learning in which the
rankings are viewed as ordinal scales. The ﬁrst appraoch uses the “ﬁxed margin” policy
in which the margin of the closest neighboring classes is being maximized and the second
approach allows for multiple different margins where the sum of margins is maximized. [9]

presented a probabilistic kernel approach to ranking learning based on Gaussian processes.

In the second view, the training data is presented in the format of partial ordering in-
formation (e.g., preference judgements) and the goal is to learn a ranking function from
the training data examples which generates the ordering of the data. The partial ordering
information can be provided in the form of the ranking list of training instances or the
preferences over pairs on training set. Given the ranking of training instances, the rank-
ing function can be learned by minimizing the difference between the given ranking and
predicted ranking [22, 32]. In [32], an approach was introduced to take ranking rather
than classiﬁcation as fundamental. This approach uses a generalization of the Mallows
model on permutations to combine multiple input rankings. In [22], the author presented
a Support Vector Machine approach which automatically optimizes the retrieval quality of
search engines using clickthrough data. The ordering information can also be broken into
a set of pairwise preference judgements and used. in the form of pairwise preferences [25].
Given the pairwise preference judgements, we can construct the models which predict the
preferences or the rankings [16, 11. 31]. These kinds of approaches appear more appro-
priate in ranking problems since we focus on the relative ordering information rather than

absolute values. An advantage of learning ordering directly is that preference judgements

57

can be much easier to obtain than the exact labels required for classiﬁcation or numeric
values for regression. This chapter focuses on the approaches utilizing the relative ordering

information.

Although there has been a signiﬁcant amount of research done on ranking models in
statistics and machine learning, there is little work on label propagation models for ranked
data. However, it is very important and also a natural decision to explore the label propa-
gation approaches for ranked data due to the existing challenges discussed in section 3.4.
One of the challenges is related to the meaning carried by the class labels. As we described
above, the label propagation approaches address all the problems by ﬁrst propagating the
class labels and then generating ranking lists based on the resulting propagation scores
computed by the algorithm. This may not be an appropriate approach in certain cases since
the numerical values are not able to preserve the essential meaning of the ordinal infor-
mation among class labels. For example, consider the numerical class labels 1, 2, 3, - - o.
In most cases, the numbers just provide the labeling for classes instead of the ordering in-
formation. Class 1 is not greater or smaller than class 2. However, when the class labels
are propagated, the ordering information will be introduced into the propagation process.
Another challenge is the difﬁculty of obtaining the exact labels of the labeled data. Most
previous studies require the absolute label values or numeric values to be given in order to
learn the ranking function, which is often not practical. The absolute labeling information
is usually hard to obtain or is very costly and time consuming. In contrast, the partial rela-
tive ordering information is more easily available for training in some cases. For instance,
in a movie rating dataset, the users provide the ratings which indicate the preferences of
certain movies over other movies. Traditional label propagation approaches may not ﬁt in
these applications because they are required to propagate the absolute labels. In order to

address these challenges, this chapter is devoted to label propagation on ranked data.

The rest of the chapter will start with the formal deﬁnition of the ranking problem on

relative ordering information, followed by the discussion of some recent research works

58

related to rank learning. Then it will propose a supervised framework of Rank Propagation

in which the model is learned from pairwise preference judgements.

4.1 Ranking Problem Setting

Let’s formally describe the ranking problem over the relative ordering information. Let
X = {271,172. - - - an} be a set of labeled and unlabeled instances called the domain or
instance space. Elements of X are called instances. Each instance .13,- 6 Rd (1' = 1, - - - ,n)
is represented by a vector in d dimensional space.

We deﬁne preference judgement function <I> which is the input to the learning algorithm.
This function encodes the known relative ordering information about a subset (training set)
of the domain X. For instance, in the movie recommendation task, the feedback from
a movie reviewer provides the known movie preferences by this reviewer. Formally, we
deﬁne the preference judgement function which has the form (I) : X x X —+ R <I>(:co, 11:1)
represents the degree to which 3:1 should be correctly ranked above 170. The positive value
means that 361 should be ranked higher above 1’0 and negative value means that :17] should
be ranked below (1:0. We also deﬁne <I>(.1:, .17) = 0 and (I) is anti-symmetric in the sense that
@(fl’o,$1)= -—<I>(;rl,:1:0) for all $0,117, E X.

The learning algorithm will output a ranking of all instances represented in the form of
a function H : X ——> R. We interpret H (3:0) > H (.111) as .110 is ranked above 1131 and vice
versa.

The goal of the learning algorithm is to produce a ranking function H which generates

a good ranking based on the per f creme judgements encoded by (I).

4.2 Related Works

There are a number of studies devoted to the analysis of ranked data. This section will

introduce some literature in the learning approaches on ranked data in machine learning

59

area.

4.2.1 Statistical Measure of Rank Correlation

Guttman ’5 Point Alienation, a statistical measure of rank correlation, has been used to con-
struct the objective function that matches user’s preferences [2] in a combination expert
model. Previous work [1] has demonstrated that this measure can be highly correlated with
average precision, which is a more typical measure of performance in information retrieval.
Thus optimization of this objective function is likely to lead to the optimized average pre-
cision performance. The objective function can be Optimized by gradient descent even
though there are singularities.

Let Re.q(d) be a ranked retrieval function which generates a score indicating the rel-
evance of document (1 to a query (1 (R must be differentiable in its parameters 6)). The

objective function is:

—1 Z.) .1 1129.10) — Raw)
J R : —— 9 4.1
( 9) <2 2Q 2....11119110 —Re..<d'>1 ‘ ’

 

where Q is the set of training queries and d and d.’ are documents retrieved. >q is deﬁned

as a preference relation over document pairs as

d >q (1' <=—> the user prefers d to d’

The goal is to ﬁnd parameters 9 so that R9,,(d) > R9,q(d’) whenever d >q d’. This
method is applied to combining experts as the following linear model (three experts are
illustrated):

Regid) = @lEl((I1d) + 921320141) “I” @3E3IQId)

The parameters 6,- serves as the scales on individual expert which is denoted as E,. The

critical feature of such a model is that the score it generates is differentiable with respect to

60

the parameters (-). The gradient in general is:

 

 

(1119) CW D organ; 01—)

In the experiment on a commercial retrieval system called CMME, the combination model

performs 47% better than any single expert.

4.2.2 Ranking SVM

Support Vector Machine approach has been applied to ranking problem for automatically
optimizing the retrieval quality of search engines using clickthrough data in [22]. This
method is shown to be well founded in a risk minimization framework.

Clickthrough data in search engine can be viewed as triplets (q, r, c) where q is the
query, r is the ranking presented by the search engine and c is the set of links the user
clicked on. Since the clicked-on link set 0 depends on the query q and the presented ranking
r, partial relative relevance judgement would be a better interpretation for clickthrough data
than absolution relevance. The problem is formalized in this approach as: Given a query
q and a document set D = {111,- - . ,dm}, the retrieval system should return a rankings
of documents in D according to their relevance with respect to q. Since retrieval systems
usually can not achieve an optimal ranking, the retrieval function f are evaluated by how
closely its ordering rm) approximates the optimum 7"“. Both 7““ and rm) are considered as
weak ordering. Kendall’s r [26] is used as performance measure to evaluate the closeness

between r” and r f(,,). It is deﬁned as

 

T(r I) P-Q 1 —-—2Q
‘0'. ‘b : — .
P+Q ”1

2

where P is the number of concordant pairs and Q is the number of discordant pairs between

. . . . . 711
rd and 7'1. m is the number of documents on a hnrte domain D and the sum of P and Q IS 2

61

for strict ordering. It can be proved that maximizing 7(1‘ 101): 7‘*) is connected to improved
retrieval quality. Given an independently and identically distributed training sample S of

size n containing queries q with their target rankings 7"“

((11, TI), (612.. 7‘31 - - - . ((111173)-

the learning £ will ﬁnd a ranking function f from a family of ranking functions F that

maximizes the empirical 7' deﬁned as

:3

= — , .. . 4.2
,n 1:17-(Tf(Qi)7( )

This SVM ranking algorithm will ﬁnd the function f out of a family of ranking func-
tions F that maximizes (4.2) and generalizes well beyond the training data. Given the class

of linear ranking functions

(diedj) E fﬁtUI) i=> W<I><qad11 > quadj)

where TB is a weight vector which needs to be learned and <I>(q, d) is a mapping onto
feature vector describing the match between query q and document d. The optimization

problem is deﬁned as follows

OPTIMIZATION PROBLEM I. (RANKING SVM)

minimize:
_+ 1
we, 1) = 51:1 - 17.1 + 0:61.11 (4.3)
subject to:
lV(dz,dj) E 7“; I WCpUhdz) _>_ W¢<Q1.d H) —€i.j.l

62

VOL, (1]) E 1'; I 7“:qu (1i) 2 wq)((I1n(-lj)+ 1 _‘ €i,j,n (404)

Viv‘j‘Vlk 1 €11.17}; 2 0 (4.5)

C is a parameter that allows trading-off margin size against training error. Adapting the
Ranking SVM to the case of partial relevance data by replacing 'r“ with the observed pref-

erences r’, the optimization problem can be deﬁned similarly.

4.2.3 Large Margin Principle

Large Margin principle has been applied for ranking problem in [43]. The paper views
the problem of ranking 1: instances as predicting variables of ordinal scale and introduced
two approaches for learning ranking functions that applied the large margin principle used
in SVM to the ordinal regression while maintaining an optimal problem size linear in the
number of training examples. The ﬁrst approach is the ”ﬁxed margin” policy in which the
margin of the closest neighboring classes is maximized which is a direct generalization of
SVM to ranking learning. The second approach allows for k — 1 different margins where
the sum of margins is maximized.

Let x? be the set of training example where j = 1, . - - , k. denotes the class number
and i = 1, - - - ,ij is the index within each class. I = 2:]. ij is the total number of training
examples. Similar to 2-c1ass separating hyperplane problem, this method is to look for k— 1
parallel hyperplanes represented by vector w E R" (n is the dimension of the input vectors)
and scalars (11 _<_ -- - g (11.4. Thus the hyperplanes are deﬁned as (w, (11), - - - , (w, bk_1)
such that the training data are separated by the following decision rule:

f(x) = min {1" : w-x — I). < 0}
rE{l,---,k}

In other words, all input vectors x are assigned the rank r if br_] < w ~ x < br.

63

In order to control the tradeoff between lowering the empirical risk (error measure on
the training et) and lowering the conﬁdence interval, the model should maximize the margin
between the boundaries of sets according to the structural risk minimization principle. Two

approaches based on large margin principle to different optimization problem.

a ﬁxed margin: the margin to be maximized is the margin between the closest neigh-
boring sets. If w and 1),. are scaled such that the distance of the boundary points from
the hyperplane (w, b.) is 1, the margin between the classes 7‘ and r + 1 becomes
2 / |w|. Thus the goal is to ﬁnd the direction w and the scalars b1, - - - ,bk_1 such that

w - w is minimized. This leads to the following optimization problem:

1L‘J)JI.1€1{i.r€1:j+1 éw - w + C Z ; (E: + 6?“) (4.6)
s.t. w - xf — bj S —1 + e? (4.7)
w-xf—bJ—S—1+ef (4.8)
w-xg‘Ll—bJ-S l—efj+1 (4.9)
11' 2 0,1? 2 0 (4.10)
wherej=1.---,k — 1 and 1' = 1. - - - .gz'j, and Cis predeﬁned constant.

0 sum-of-margins: large margin policy can be applied to k — 1 margins such that the
sum of them is maximized. The challenge is that pre-scaling of w can not be adopted
as in ﬁxed margin strategy. In order to solve this, the primal functional is represented
by 2(kv — 1) parallel hyperplanes instead of k - 1. The goal is to seek a ranking rule
which employs a vector w and a set of 2(k—) thresholds a1 5 b] 5 112 3 b2 3 - - - S
ak_1 g bk_] such that w - x? g (1.1 andw - xi 2 (134,110 2 —oc and a1. 2 00. The

margin between two hyperplanes separating class j and j + 1 is: (11]- — aj)/ “W“.

64

Hence, the Optimization problem can formulated as:

k—l
“Irina j; (aj — 111-) + CZ; (6? + 6:1“) (4.11)
s.t. (1.]- S b1 (4.12)
bjgaj+1,j=1,~-,k—2 (4.13)
w . x1" 3 a]- + 13,11]- — 1:1“ g 1111-11chr1 (4.14)
w-wgr,e{go,e:’+‘ 50 (4.15)

4.2.4 PRank Algorithm

PRank is an online algorithm for ranking problems motivated by the perceptron algorithm
for classiﬁcation. This method projects each instance into the reals and then operates on
rankings by associating each ranking with distinct sub-interval of the reals.

Given a sequence (x‘, y‘), . - . ,(x’ , y‘), - - - of instance-rank pairs, x‘ is preferred over
x3 if y‘ > y“ and y‘ is an element from ﬁnite set 32 = {1, 2, - - - ,k} with > as the order
relation. The ranking rule H : R" —> 31 which is a mapping from the instances to ranks.
The family of ranking rules H employs a vector w E R" and a set of k thresholds b1 2
- -- 2 bk_1 _>_ (1,. z 00. The predicted rank is then deﬁned to be the index of the ﬁrst
threshold b. for which w - :r < b... The goal of the algorithm is to learn the ranking rule
minimizing the ranking-loss which is deﬁned to be the number of thresholds between the

true rank and the predicted rank.

To learn the ranking rule online, the true rank 3) is expanded into k — 1 variables

y1,- - - , 111.21 and y induces a vector as follows:
.1...‘ ,_ ___ +13...,+1:_11...5_1
{yr 411.1} ( )

T
where r is the maximum index for which y, = +1 and r = y —— 1. Thus, the predicted rank

65

is correct if y,.(w . x — b.) > 0 for all r, otherwise the rule is updated as:
o replace 11,. with (1,. — 1),. for which y.,.(w - x — b.) S O.
o replace w with w + (23,.)3 for which yr(w - x — br) _<_ 0.

The paper proved that PRank maintains a consistent hypothesis in the sense that it pre-
serves the correct order of the thresholds and given an input sequence (x1, 1J1), - - - , (xT, yT)
, the rank loss 2; lg)‘ — y‘l is at most (I: — 1)(R2 +1)/’)«2 where R2 = max H x‘ H2 and

A], : nrinr.t{(W* . x‘ — b;)yf.} (see details in [12]).

4.2.5 Conditional Model on Permutations

CRanking [32], a new approach to ensemble learning, explores conditional models on per-
mutations as a tool for solving ranking problems. This approach views each input classiﬁer
in terms of the ranked list of labels that it assigns to the input, builds probability distri-
butions over rankings of the labels and then learn the models on permutation groups. The
basic model in this approach is an extension of the Mallows model to the conditional setting
and it has a natural Bayesian interpretation as a generative model because of the invariance
properties of the sufﬁcient statistics.

Let X = y 1. . . . ,1)" be a set of items to be ranked, identiﬁed with the numbers 1, - - - ,n.
A permutation 7r is a bijection from {1, - - - , n} to itself. Let 7r(i) denote the ranking given to
the item 2' and 7r‘1(t') denote the item assigned to rank 2'. The collection of all permutations
of n-items forms a non-abelian group under composition, called the symmetric group of
order n, and denoted Sn. Given a set of training instances, a ranking of items needs to be
assigned to each instance. The permutations given by the j—th input classiﬁer for the i-th
instance is denoted as of). The permutation 7r“) is used to denote the predicted ranking for
the i-th instance. a is used to denote a sequence of permutations Uj.

A generalization of the Mallows model was proposed for estimating a conditional dis—

tribution. Let aj E 8,, be a permutation from the j-th input classiﬁer and ()j E R for

66

j = 1, - - - , It. The distribution which is described as follows:

1
Z(9,0)

62:21 911101203)

 

11040.0) =

deﬁnes a conditional model when there are multiple instances and each instance is asso-

ciated with a possibly differently set of rankings of)

. Z (6,0) is the normalizing con-
stant deﬁned as Z (6,0) 2 ed“) and 1,0 is the cumulant function deﬁned as t’1(9,a) =
log Ewes. exp(6d(7r, 0)). The model can be applied to, for example, the rankings of web
pages generated by different search engines for a speciﬁed query. 0)- can be viewed as the
weight for each ranker in the ensemble. In this scenario, only the 6,- are the free param-
eters to be estimated. The likelihood function based on this model is convex in 6. Given
a training set consisting of the pairs D = (71“), aw), the parameters can be estimated by
maximum conditional likelihood or MAP.

This model is different from other discriminative models in that it has a natural Bayesian
interpretation. If we view 7r as a parameter, suppose Uj are indepently sampled from Mal-

low models with common mode 71 and dispersion parameters 61-. Under a prior p(7r), the

posterior is

p(7r|6, a) or p(7r)exp (2: 01-1101, 0))

because of the invariance property of d(., .). Thus, under a uniform prior on 7r, the distri-

bution is precisely the posterior of a generative model for the rankings (1.

4.2.6 Conditional Model on Ranking Poset

Another conditional model on ranking poset is presented for ranking problem in [31]. Sim-
ilar to CRank algorithm, this model is also an extension of the Mallows (11 model and it
generalizes the classiﬁer combination methods used by several ensemble leanring algo—
rithms, including error correcting output codes, discrete AdaBoost, logistic regression and

cranking.

67

A poset (partially ordered set) is a pair (P, S), where P is a set and g is a binary relation
that satisﬁes (1)2: S :r, (2)if :1: S y and y S :1: then a: = y, and (3)if :1: S y and y S 2
then :r _<_ z for all :r, y, z E P. Then ranking poset W,, is deﬁned as the poset in which the
elements are all possible cosets QM, where A is an ordered partition of n. and 7r 6 g. The
right coset is deﬁned as: Q,,_k7r = {anla E gn—k} and Qn_k = {7r 6 Q,,|tr(t') = 1,2' =
l, - - - ,k} where 71' denotes a permutation of {1, . . - ,n} and 7r(t') denotes the rank given to
item i. The right coset can be viewed as a partial ranking, where there is a ful ordering of

the k top-ranked items.

Kendall’s Tan T ( 7r. 0) is used as a distance measure d which is deﬁne in (4.2). Given
input I: rankings 01, 02, - - - ,ai, contained in some subset a E Wn of the ranking poset and

a probability mass function qo on Wn, an exponential model 719(7rla) is given as:

k
119(7rla) = fig—)qoﬁrkxp (32:; 61d(7r.oj))

where 6 E x E Rk, 7r 6 II 6 Wu and 113- E E3 E W,,. The term Z(6,a) is the normalizing

constant

I.-
Z(6’, a) = an(7r)exp (Z 6 + jd(7r, (73))

11'6“

4.3 Rank Propagation for Multi-label Learning

In this section, we propose the idea of rank propagation for multi-label learning problem.
We view the problem of multi-label learning as multi-label ranking, i.e., instead of making
a hard decision about which subset of class labels should be assigned to each instance. Our
goal is to rank the class labels according to their relevance to the given instance. Unlike
the typical label propagation schema that propagate the prediction scores, the key idea
of rank propagation is to propagate the pairwise preference relationship of class labels

between labeled examples and unlabeled examples. We will focus the discussion on rank

68

propagation for supervised learning.

4.3.1 Preliminaries

Let X, = (x11,x§, - - - ,xf,) denote the set of labeled examples. Each example x,- is labeled
by a vector y,- = (941.11.111.21 - - - ,y,~.(;) where C is the number of classes and y“, E {0, 1}
indicates if example x, is assigned to the kth class (1 indicates the example belongs to the
ktth class and 0 otherwise). We denote by Y = (y1,y2, - - - ,y,,) the class labels assigned
to all the instances. Let X,, = (x1‘, x3, - - - ,xﬁl) denote the set of unlabeled examples. Our

goal is to rank the class labels for each unlabeled example.

4.3.2 Preference Matrix

We focus on a single unlabeled example x. We denote by k(x,x') the kernel similarity
function. For the sake of simplicity, we assume that the kernel similarity function always
outputs non-negative values, i.e., k(x,x') Z 0 for any x and x'. To conduct the rank
propagation, we encode the class label information into a pairwise preference matrix. More
speciﬁcally, given assigned class labels y,, the preference matrix M,- E RCXC is deﬁned as

follows:

a 311.1 = 1 and yak = 1
b ill/1,1 =‘— 1 and y”; = 0
ler-r = (4.16)
a. y” = 0 and y“, = 0
0 otherwise

where a. and b are constants with a < b. We can normalize the preference matrix M,- into a

transition matrix S,- by deﬁning

 

[S'lkr = [Ml-hut
2:21 [A'Iiik’t

69

(4.17)

Evidently, the constant a is proportional to the transition probability between the classes
that are either both selected or both unselected. Similarly, constant b is proportional to the
transitional probability between the selected classes and the unselected classes. it is not

difﬁcult to see that the principle eigenvector of S, i.e., S v = v,is

1/ll3’ill‘2 311.1: :- 1
131'}; —
0 yth: = 0
Furthermore, we denote by S E REXC‘ the transitional matrix for the test example x. Our

goal is to search for the transitional matrix S that are coherent with the transitional matrix

in the neighborhood x.

Principal Eigenvector of Transition Matrix As claimed in the above section, the prin-
ciple eigenvector of the transitional matrix for each example is consistent with the class
assignment of the example. Thus, by computing the principle eigenvector of the transi-
tional matrix of a testing example, we can generate a ranking list of all classes labels for
this example and make predictions accordingly.

To further verify our assumption, we randomly generated four datasets. Each data sam-
ple is assigned to one or multiple categories. We build the preference matrix for each data
sample based on the category information as described above and then compute the prin-
cipal eigenvector. To evaluate how much the ranking of categories based on the principal
eigenvector is consistent with the real category information, we use Average Precision as
the evaluation metric.

Figure 4.2 shows the average precision for each document in all generated datasets.
Clearly, for the majority of the data samples, the principal eigenvector of preference matrix
is completely consistent with the real category information.

We also verify this assumption with the existing datasets which will be used in our case

study discussed in the following section.

70

Randomly Generated Dataset with
1000 Data Samples and 50 Categories

 

 

Average Precision

 

 

 

 

1 ~ — — ﬁ'
(18* 1
0.6 r
0.4» J
0.2 - .

_._ . . 44)
o J . . . i
0 200 400 600 800
Documents
(3) Randomly generated dataset 2: average

number of categories per document is 5 with
maximum 13 and minimum 0; average number
of documents per category is 101 with maxi-
mum 127 and minimum 79.

Randomly Generated Dataset with
2000 Data Samples and 100 Categories

 

 

Average Precision
O .0
a: on d

,0
h
r

 

.0
M
f

 

 

 

1000

r—r/
00 500 1000 1500 2000
Documents
(c) Randomly generated dataset 1: average

number of categories per document is 20 with
maximum 35 and minimum 9; average number
of documents per category is 400 with maxi-
mum 448 and minimum 361.

Average Precision

Randomly Generated Dataset with
1000 Data Samples and 100 Categories

 

 

 

 

 

1 .
S
-,—, 0.8
'6
a”
0 o 6 ~
0)
i
< 0.4
0.2 [’1
O r x J x
0 200 400 600 800
Documents
(b) Randomly generated dataset 3: average

number of categories per document is 15 with
maximum 26 and minimum 4; average number
of documents per category is 149 with maxi-
mum 181 and minimum 117.

Randomly Generated Dataset with
2000 Data Samples and 200 Categories

 

 

 

 

 

 

1
0.84 1
0.6 '1
0.4
02 x .4
0; A r A
0 500 1000 1500
Documents

((1) Randomly generated dataset 3: average
number of categories per document is 20 with
maximum 35 and minimum 8; average number
of documents per category is 199 with maxi-
mum 239 and minimum 166.

 

1000

2000

Figure 4.1:
Datasets.

Analysis of Principal Eigenvectors of Preference Matrix for Generated

71

Average Precision for Each Document
in St. Andrews Dataset

 

 

 

 

 

 

1' l
.5 0.8— 4
.9
0
93
o.
a, 0.6 ~
0)
m
i; 1
< 0.4"

0.2
OWJ—J r r r r
0 200 400 600 800 1000
Documents

(a) ST. Andrews dataset: average number of categories per
document is 6 with maximum 11 and minimum 1; average
number of documents per category is 56 with maximum 740
and minimum 10.

Average Precision for Each Document
in MovieLens Dataset

j Y Y # 1 T 7 Y 7

 

 

0.8r “

0.6~ *

Average Precision

0.4r 1

02* ~

00 100 260 300 460 500 600 700 860 900
Documents

 

 

 

 

(b) Movielens dataset: average number of categories per doc-
ument is 3 with maximum 6 and minimum 2; average num-
ber of documents per category is 108 with maximum 349 and
minimum 1.

Figure 4.2: Analysis of Principal Eigenvectors of Preference Matrix for Study Datasets.

72

4.3.3 Framework Description

Using the transitional matrix, we view the multi-label learning problem as a multi-label

ranking problem. This goal can be encoded into the following optimization problem

1115111 2 [(S, 5,)111, 1 (4.18)

s.t. 51,120,Vk,l=1,2,-~,C

C
2 5,.) = 1
k2]

where w,- = k(x, x.) l(S, S') is the loss function that measures the difference between two
matrix S and S '. Different types of loss functions can be used in this optimization problem.

We consider two choices of loss functions as follows:

[1 loss function It’s deﬁned using the trace function as MS, 5') = tr((S— S')T(S — S')).

This loss function can be further rewritten as

C

11(S,S') = 2(5):“ — 31,)? (4.19)

k,l=1

Obviously, the above equation essentially measures the square of the difference between

two matrices across all the elements.

The solution of S is straightforward when using the ﬁrst loss function. In particular, we
will solve a series of C optimization problem with each optimization problem written as

follows:

(1
J.

C
min 2: Z: ”(Hz-(5&1 — [Si]k.i)2 (4.20)

Lg . .'.'.IS‘ .'
Ll k.C (:1 i=1

8.1. 511,120,l=1,2,“-,C

C
2 3111:1-

(=1

73

This is a quadratic programming problem. To see more clearly about the meaning of

the result. we redeﬁne the problem as follows:

Tl

Insirr Z wills - 51H: (4.21)

i=1

s.t. s : (),eTs = 1.

WhCI'C $1, = (51411, 519, ' ' 35151;) and S? 2 ([Siij? [51]}:‘2, ' ° ' , [Si]k_c,'). The dual form Of

the above optimization problem can be further written as

. ”7 + Aelli
1'/\—-T/\ ~———.—— 4.22
111111 Sk( e + 1) 2 2:;1 lt’i ( )
s.t. ”y t 0, /\ 2 0.
where s... is deﬁned as

A 27.11 ”(pl-S}?

S), = ———"‘n ’ (4.23)
21:1“?
The solution s;, is calculated as

a )1

’ T e (4.24)

. . . . + /\e
As we can see, the two constrarnts s t 0 and eTs = 1 result 1n the additional term 7 ,,

21:1 “’1' .

12 loss function It is deﬁned as “1211121:1 zT (S —S' )T (S — S, )2. This loss function essentially
requires the two transitional matrix S and S’ to be similar along any direction 2. The
interesting point is that the ﬁrst loss function [1 can also be written in the similar form to [2
as MS, 5’) = eT(S —- S')T(S — S')e. In other words, the ﬁrst loss function only require

the two matrices to be matched along the direction e. Thus, 12(S, S') is a more general

form than 1,,(S, SI).

74

22 loss function leads to the following optimization problem

1118111 max zT ( ; try-(S — S))T(S — S,))z (4.25)

lzl‘él

s.t. SM 20, 11.7,! = 1,2,---,C

STe = e.

Since max zT ( 2:21 111,-(S— S,)T (S—S,)) z is the principle eigenvalue of matrix 211:1 (S—

IZISl

S,-)T (S — S,)z, we can rewrite the above problem into the following format:

min t (4.26)
,A'

S't- Zita/ii S tIC
i=1
Ai (S — SOT
Z 0
S — S,- I

Sk.120krl:132amec

S Te 2 e.
where I C and I are both identity matrix.

As we mentioned above, ll loss function can be viewed as a special case of [2 loss

function and it can be veriﬁed through the above optimization problem. To better under-

2 I

stand this, consider the special case where A,- = diag(a.}, a1, - - ,az-C). Furthermore, we

approximate the constraint
: 0 (4.27)

by its necessary conditions, i.e.,

k

a].

k 2
i ll

2 HSk—S' 2

‘l

75

As the result of the above simpliﬁcation, we rewrite the original problem as

11111] t (4.28)

LS1;

n
s. t. :11?in g f, k: 1,2,-~,C
=1
(1:12 “SA—Sf”; k: 1,"',C. i: 1:"‘271'

i

Sk_>:02 k21~21m~C

SZe=1,k=1,2,---,C.

To minimize t, it is sufﬁcient to minimize 221:1 uriaf‘. It is not too hard to see that this is

equivalent to the optimization problem using [1 loss function.

76

4.3.4 Case Study: Multi-label Categorization

We evaluate the proposed ranking propagation scheme on multiple data sets and report the

empirical results in the following study. We will answer the following questions:

0 Does Rank Propagation really help the multi-label classiﬁcation task? As we dis-
cussed before, the traditional methods for multi-label classiﬁcation have problems
caused by utilizing class labels as numerical values. Rank Propagation utilizes the

preferences among all categories and can avoid this problem.

0 Does 12 loss function work better than ll loss function? 12 loss function considers
the projection of the difference between the preference matrix of the test example
in question and the preference matrix of all training examples along all directions.
Theoretically, the optimization problem based 12 loss function should result in better

performance and the empirical studies proved this.

0 How sensitive is Rank Propagation to the parameters? There are two parameters
involved in Rank Propagation: A and B. We give a detailed description of the inﬂu-

ence of these parameters on the performance.

Datasets

We evaluate our proposed approach with three datasets: MovieLens dataset, St. Andrews
dataset and Yeast dataset. In three datasets, each document is assigned to one or more

categories. We describe the datasets as follows:

0 MovieLens C18 dataset: MovieLens dataset is a movie rating dataset. The original
dataset provides the ratings of 943 users on 1682 movies from 19 categories. Each
rating is an integer ranging from 1 to 5 with 1 being the lowest rating and 5 is the
highest rating. Zero rating means that the rating information is not available. We

selected 849 movies and 18 categories as our testbed. The average number of cat-

77

egories per movie is around 3 and the average number of movies per categories is

around 1 14 in the resulting dataset.

St. Andrews C 10 dataset: This is an image categorization dataset which contains 999
predeﬁned categories and 28133 images along with their captions (textual descrip-
tion). The 28133 captions consist of 44085 terms and 1348474 word occurrences.
The maximum caption length is 316 words, but on average 48 words in length. We
selected 10 categories and 1000 images with their captions as our text categorization
testbed (we view each caption as a document). The number of categories per docu-
ment in the resulting dataset varies from 2 to 4 and the average number is around 2.
The number of documents for each category varies from 108 to around 320, and the

average number of documents per category is around 209.

Yeast C14 dataset: The Yeast dataset is formed by micro-array expression data and
phylogenetic proﬁles with 2417 genes. East gene is represented by 103 attributes and
is associated with a set of functional classes whose maximum size can be potentially
more than 190. The dataset in our experiment contains 14 classes. The number of
classes per gene is ranging from 1 to 11 with the average number being around 5.
The average number of genes per class is ranging from 34 to 1816 with the average

number being around 732.

Baselines

The proposed Rank Propagation method is a supervised-learning algorithm which predicts

the labels by propagating the preferences between labeled data and unlabeled data. We refer

to the Rank Propagation approaches using two loss functions as RP L1 and RP L2 respec-

tively. We compare the Rank Propagation approaches with two other supervised-leaming

algorithms: K Nearest Neighbor method (KNN) and Support Vector machine (SVM) as

our baselines.

78

KNN is similar to the Rank Propagation method except that KN N propagates the true
labels of the training examples while Rank Propagation propagates the preferences among
categories which are obtained from the given labels of the training examples. In order to
generate a ranking list of all categories for each example using KN N, we conduct KNN for
each category and generate a ranking list of all categories based on the resulting scores.

Traditional SVM is often used for binary classiﬁcation tasks. Many algorithms derived
from SVM have been proposed for multi-label classiﬁcation tasks. Among them, SVM
binary appears to perform the best according to the empirical results in [33]. The basic idea
of SVM binary can be summarized as two steps. First, we conduct binary classiﬁcation for
each category and generate a score for each testing example on each category. For a testing
example, the score corresponding to one category can be viewed as a conﬁdence score
indicating how likely the testing example should be assigned to this category. Second, we
generate a ranking list of all categories according to the conﬁdence scores for each testing
example. The higher rank a category has, the more likely the testing example should be
assigned to this category. Various metrics can be applied to the ranking lists in order to

evaluate the performance.

Evaluation Metrics

In the framework we describe in section 4.3.3, we generate a ranking list of all categories
for each document. Similarly, we generate a ranking list of all categories for each testing
example in two baseline algorithms. Thus, it’s more appropriate to evaluate the perfor-
mance of three algorithms using rank-related measurements. In our empirical study, we
use category-wise Precision/Recall-breakeven point (PRBEP), Micro-average F1 score and
Macro-average F1 score at different ranks, and Area under ROC curve as our evaluation
metrics.

For Micro-average F1 score, we ﬁrst compute the precision and recall at each rank

of category for each document, then compute the average precision and recall at each

79

rank across all documents and ﬁnally compute F I score based on the precision and recall.
Macro-average F1 score is computed similarly, but the precision and recall are computed
for each category.

The Precision/Recall-breakeven point is a commonly used measure for evaluating text
classiﬁers. It is based on the two statistics recall and precision which are both widely
used in information retrieval. Between high recall and high precision exists a trade-off.
The Precision/Recall-breakeven point is deﬁned as that value for which precision and re-
call are equal. We compute the average Precision/Recall-breakeven point value across all.
categories as follows: ﬁrst, for each category, we compute the precision/recall when as-
signing the top 71 examples to this category and n. is the number of examples actually in
this category; second, we take the average precision/recall across all categories as the ﬁnal
results.

ROC (Receiver Operating Characteristic) curve was ﬁrst used to deﬁne detection cut-off
points for radar equipment with different operators. It is also a commonly used measure in
information retrieval. In ROC curve plot, sensitivity (true positive rate) is plotted against 1-
speciﬁcity (false positive rate). To get a comprehensive idea of the ranking list, we compute
the average Area under ROC curve. First, we compute the true positive rate and false
positive rate at each ranks for each document and plot the ROC curve for each example.
Then we compute the area under ROC curve for each example and take the average as the
ﬁnal results.

By using the above three metrics, we can get a comprehensive idea about the ranking
lists of categories generated in three algorithms. We conduct each experiment 10 times and

take the average results over 10 trials as the ﬁnal results.

Parameter Setup

There are two parameters involved in the Ranking Propagation method: A and B (see de-

tails in section 4.3.3). Since only the ratio of A and B matters, we ﬁx A = 1 and test

80

different B values. We set B = 10 which gives the best performance for all three experi-
ments. We’ll include the discussion on the inﬂuence of B values in the following sections.
We also need to compute the similarity matrix among data examples in the Ranking Prop-
agation method. Each entry in the similarity matrix is computed using cosine similarity
measure based on the given representation of the data points.

We download SVM-light ([20]) as the implementation of SVM and use the default
setting of the software. For KNN, K (the number of the nearest neighbors) was set to be

empirical value generating the best results.

Results and Analysis I: Comparison among All Methods

We evaluate the effectiveness of the proposed Ranking Propagation method in the ﬁrst
experiment. Table 4.1 shows the Precision/Recall-breakeven point for the three datasets
described in the previous sections. Table 4.2, 4.3 and 4.4 presented the Area under ROC
Curve for three datasets with different number of training examples. Also, Figure 4.5, 4.4
and 4.5 showed the Micro-average and Macro-average F1 scores of three datasets. We
have the following observations.

First, for all the algorithms, as the number of training examples increases, the perfor-
mance is also getting better in terms of all three metrics. Moreover, the curve of F1 scores
usually starts at a low value, then keeps ascending till it reaches a peak and ﬁnally ends at a
ﬁxed point lower than the peak value. This is consistent with the intuition since it is hard to
make all correct predictions at the ﬁrst rank. But as the rank increases, it is more possible
to catch the positive examples for each category. The ascending trend is continued until at
a certain point there are not many positive examples left for each category at lower ranks.
Then the curve starts descending until it reaches the end. There many be some ﬂuctuation
in a curve because some positive examples are not ranked higher than the negative ones.

Second, we can see that in most cases, both RP L1 and RP L2 outperform two baselines,

namely, SVM binary and KNN. This indicates that propagating the preferences instead of

81

true labels does help the classiﬁcation in some applications. RP L2 method shows signiﬁ—
cant improvement in terms of all the metrics. This is not too hard to understand since RP
L2 considers a more generalized form of differences and constructs a better optimization
problem.

Third, it is interesting that KNN actually outperforms SVM binary method in some
cases despite that KN N is a simple algorithm. For instance, KN N achieves the better Micro-
average and Macro-average Fl scores for yeast datssets in Figure 4.5 than SVM binary.
This may be related to the nature of the datasets including the distribution of the positive

examples of each class, the numerical presentation of the data examples, etc.

Table 4.1: PRBEP for Different Datasets with Different Number of Training Examples.

 

Number of Training examples

 

Classiﬁer

20

40

80

100

 

RP Ll

 

 

0.135 :1: (0.008)

 

0.138 i (0.006)

 

0.150 :t (0.006)

 

0.162 i (0.010)

 

 

 

 

 

 

 

 

 

 

 

 

 

RP L2 0.136 :t (0.006) 0.144 :1: (0.005) 0.168 :t (0.003) 0.184 :t (0.002)
SVM 0.131 :1: (0.007) 0.136 :1: (0.006) 0.148 :1: (0.002) 0.152 :l: (0.003)
KNN 0.132 :1: (0.006) 0.133 :I: (0.006) 0.133 :1: (0.007) 0.135 i (0.007)
(a) MovieLens Dataset
Number of Training examples
Classiﬁer 100 200 300 400
RP L1 0.211 i (0.007) 0.213 :t (0.006) 0.213 i (0.003) 0.214 :1: (0.006)
RP L2 0.217 :1: (0.011) 0.224 :1: (0.005) 0.237 :1: (0.004) 0.243 :1: (0.009)
SVM 0.206 i (0.011) 0.207 1 (0.009) 0.209 i (0.009) 0.211 :1: (0.008)
KNN 0.208 i (0.004) 0.212 i (0.005) 0.2112 i (0.006) 0.212 :1: (0.007)
(b) St. Andrews C- 10 Dataset.
Number of Training examples
Classiﬁer 300 500 1000 1500
RP L1 0.303 :1: (0.004) 0.308 i (0.003) 0.309 :1: (0.003) 0.315 :t (0.005)
RP L2 0.316 :1: (0.004) 0.336 :1: (0.009) 0.352 i (0.001) 0.390 i (0.009)
SVM 0.30] :1: (0.006) 0.302 :1: (0.005) 0.304 i (0.002) 0.306 :t (0.003)
KNN 0.30] :l: (0.006) 0.301 :t (0.005) 0.302 :t (0.009) 0.303 i (0.008)

 

 

 

 

 

 

(c) Yeast C-l4 Dataset

82

Figure 4.3: F1 Scores of Three Algorithms for MovLens C18 Dataset.

 

 

 

Micro-average F1 Scores
0.40, v v ~ . e 0.24 \ 381, (L \i ‘V— 54* Y
1:24;? few/“f r9: g
0.35 r \' 0,9 LWX/
0.20 "lax/x” -
/ XX
”0.30 w Q! i X
o @016» , // ,
I6 1.. 1 , .1
30.25 g 0* X
‘— “— /)(
LL LL . x f
0.20 "I." x. KNN ( K = 10 ) ,
{‘5' + SVM binary
L 7 Rank—Prop (L1)
015* x A Rank—Prop (L2)

 

 

 

 

 

0.10 ‘ ‘
2 4 6 81012141618

 

 

 

 

 

 

 

 

Macro-average F1 Scores

 

 

 

 

 

6 81012141618

 

 

 

 

 

0
Rank of Categories Rank of Categories
(3) The Number of Training data = 20
Micro-average F1 Scores Macro-average F1 Scores
0.45, - . . 0.33, . .
‘ ,L KNN (K =10)
+ SVM binary
— RP L1
4;} RP L2 0.25
J
i @120
l 08>
\ ’\\.‘(\ .\ L ‘-
*‘iiiﬁg 1.1.0.15
0.10
040 A ‘ 0.05 J .
024681012141618 024681012141618
Rank of Categories Rank of Categories

(b) The Number of Training data = 40

83

Figure 4.4: F 1 Scores of Three Algorithms for MovLens C18 Dataset (cont’d).

Micro-average F1 Scores Macro—average F1 Scores

 

0.30,

 

0.50I....ee

0.25 >

 

 

 

 

 

 

 

 

 

 

 

 

330.20»
0
c8
‘— , 7f
LL0-15 ,'.’.'_
ﬂy —-><— KNN(K= 10)
010* [ft/ix ”‘1”— SVM binary 4
' H.’ — Rank-Prop (L1)
.137" 9 Rank-Prop (L2)
1 i; ’52"
0.155“ - 5 . . . . . 0.05! . .
024681012141618 0 5 10 15

Rank of Categories Rank of Categories

(c) The Number of Training data = 80

 

 

 

 

 

 

 

 

 

 

 

 

 

Micro-average F1 Scores Macro-average F1 Scores
0501 ’ ' * . . r . . 0.30 . a 4 r a T r
| ‘705; @t'%%\»\
I of ' =>« / do
045'» 7 (‘3- 1 ' 3 ~- 88p.
5 W“ 0.25- ,f V
0.40:» f [g )5 i x Q
! \‘x K» '
i / #4. \u; .U _ f’ ,'
$0.35r’k , KQ‘Q..- a3020» ’ /’”‘- ‘51:“ ‘
é *‘—~(§$’\ g ' I /’/:7‘r‘"+f”#r
030 ‘ . "‘13.. , 7*?"
(D r . x 2x X» ¥:‘+\f’>f} (D 1 / f
‘— ‘ \XCK‘i-g\“*/\ ‘— /
LL 1 X \X\ ‘ 'F‘Qh r, LL 0 15 ~_' 1
025 Jr XX X‘~>¢.,;.-%2esn m L/y/
o 2 0_ 5x —><— KNN (K = 10)
. l 0 101 1%,. —+— SVM binary .
0 15 x' _ ' j ﬁg —- Rank-Prop (L1)
' 1 5 4g] «9 Rank-Prop (L2) 2
0.10i 1 1 L 1 r r 1 r 0.05' X 1 r r r 1 r 1 r 1
o 2 4 6 8 1o 12 14 16 18 o 2 4 6 8 1o 12 14 16 18

Rank of Categories Rank of Categories

((1) The Number of Training data = 100

84

F1 Scores

Figure 4.4: F 1 Scores of Three Algorithms for St. Andrews C10 Dataset.

Micro-average F1 Scores

 

 

 

 

 

 

 

 

Macro-average F1 Scores

 

0.40

0.35 .

.0
(.0
0

F1 Scores
0
i0
01

r‘.

/O\3“9\

1 V A
/ \./

.0
N
o
\\

0.15

 

\TIJ ‘

    

 

 

0.10 L

34567

8910

Rank of Categories

(a) The Number of Training data = 100

 

 

 

 

 

0.40 5
Q
/<)\\\ \ If”... ,,._ _~‘
0‘35 L ALL/£21 ELK/:57“ ¢:A€:i1g
,. rd—E’EX“ - ﬂ
5 ‘ " {9,1445
0.30 I," ,/,//x,/’ '
[11" [Air /‘
4.5 K/ X
0.25..» ,gr/
ix
if
0-20 [,5 —>e— KNN ( K: 10)
2 —+— SVM binary
—— Rank-Prop(L1)
0‘5)“ 9 Rank—Prop (L2) ‘
0.10 - 5 . . . g - 5
1 2 3 4 5 6 7 8 9 10
Rank of Categories
Micro-average F1 Scores
0.40 . . .
'94, 5:
' \‘(Dr’x JK“‘-f'_j_ ,-.--‘~L_
0.35» 130/ " /“ “L 27345»
I; x” , #4175,er ,.
,.,/+””*:/<' “‘e/
3 0.30 .5” ,/ /)<
5 ,1] l//// //
6’) 0.25;; tr f
1— \‘/ f/I //
u. /
// ~/
0.20 35/
,4.» —><—KNN(K=10)
' -+— SVM binary
Rank-Prop (L1)
4% Rank-Prop (L2)

 

 

 

 

 

 

2 3 4 5 6 7 8 9 10
Rank of Categories

Macro-average F1 Scores

 

0.40

0.35»

0.30 .

0.25 >

F1 Scores

0.20 '
\ J

.
, .‘
1‘ I
015 ‘
’7‘!
- ,3”
,

 

 

 

 

0.10

23456

78910

Rank of Categories

(b) The Number of Training data = 200

85

Figure 4.5: F 1 Scores of Three Algorithms for St. Andrews C10 Dataset (cont’d).

Macro-average F1 Scores

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Micro-average F1 Scores
0.45 . 5 a 5 . . 5 5 0.40 a 5
. /’&.,
0 4° 1 1;; “ \c 0.35
.>/’
0 35‘ W ””2“ ”‘
/ ,_,/‘ I , ‘: L51”: /
1’ ,, :74jfi—T—1'Iﬁfé'1wé 0 . 30 //?(
g 1 ”;;’+ 'I” 8 I,
50.30 ,4 2‘ - 75
($3 // ,/ c§125
fr 1—
5:11.25, ,3, // u.
,5?" ,g + KNN (K =10) . 0.20
0-20 ,9" /" —+- SVM binary *-
J» ,5" —— Ran k-Prop (L1) ,5?"
0.15 L' *9” Rank-Prop (L2) , 0.15.,9/5',’
0 10T 1 1 1 1 1 1 a 1 0‘10? 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Rank of Categories Rank of Categories
(c) The Number of Training data = 300
Micro-average F1 Scores Macro-average F1 Scores
0.55. . . 5 a . 5 . 5 0,50 . . . . . .
I n
I 1 L
059 r x. 0 45 ,,5~>~~s\
," \ ‘v;\ ' \Télx
0.45» ;‘ < “5 ‘3.
(A; K: j ‘3‘ ‘0 \E.‘;\
0.40 ~ ,5 "‘“ Y, * Bk-
a) .t -’ ., , \
(D I ,,,,,,,, LP -_j“§ - , _
h0.3 ,,.-I , "“ $42M. ‘ . . f‘}
8 5:5 IQ:I+/+ K- 8‘ :1 /,§/f’§ 4
”0.30 , ’ /.X“ J a?
E / ,x ,2; ,
0.25» ,9' ~ 2‘2”
9" g +— KNN ( K =10) i
0-20 2;; ,5 —+— SVM binary
Jr," —— Rank-Prop (L1)
0.15,! «5} Rank-Prop (L2) ‘
010 1 1 1 1 4 1 1 1 1 1% 1 1 1 1
1 2 3 4 5 6 7 8 9 1o 3 4 5 6 7 8 9 10

Rank of Categories Rank of Categories

((1) The Number of Training data = 400

86

Micro-average F1 Scores

 

0.65
0.55 5
@0455
[0.35-

0.25 '

 

‘r

 

 

 

 

 

 

~x— KNN(K= 10)
A, —+- SVM binary
‘ ,",1} Rank-Prop (L1 )
j ~9- Rank-Prop (L2)

 

 

0.15

2 4 6 8 10 12 14
Rank of Categories

Figure 4.5: Fl Scores of Three Algorithms for Yeast C14 Dataset.

Macro—average F1 Scores

 

0.45

0.35 _

F1 Scores
0
N
U“

0.15-

 

 

‘ 936%wa 315.55,,
,

/

 

 

0.05

2 4 6 8 1O 12 14
Rank of Categories

(e) The Number of Training data = 300

Micro-average F1 Scores

 

0.65 r

 

0.60}

0.55 ,

9
01
o

_o
.9.
01

0.40 1

F1 Scores

0.35 r

0.30»

0.25 *

 

 

 

+ + KNN ( K =10)
91'} -+— SVM binary 1
J —— Rank-Prop (1.1)
1‘11" +5.» Rank-Prop (L2)

 

 

 

 

L

 

0.20

2 4 6 8 10 12 14
Rank of Categories

Macro-average F1 Scores

 

0.50
0.45 >
0.40 ~

0.35 1

.0
w
0

F1 Scores
.0
N
9‘

0.20 L

0.15-

0.10»

 

0.05l

 

 

2 4 6 8 10 12 14
Rank of Categories

(1) The Number of Training data = 500

F1 Scores

F1 scores

Figure 4.5: F 1 Scores of Three Algorithms for Yeast C14 Dataset (cont’d).

 

Micro-average F1 scores Macro-average F1 Scores

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.65 f 5 5 a 5 0.50 5 5
i 19.}-T§\\W A A
0.60: A," \;\:\ 1 0.45 - gerawjaﬁig 1
. \J \,\_\ 2'25", . '_ ."1
0.55» 55 H 5 0.40 1 “--
:' //" $7#—:::\;¥ ,- (3" ,1-
0.501 5,1 ,1?” X32? 0.351 / /
rf1";(1 ' Q " m V
0 45 5 7" +’ ‘55 $0.30-
I 515
0 40 L ,1. x 5— 025’
5+ LL
0.351 55 + KNN(K= 10) - 0.20-
—+— SVM binary
0.305 —— Rank-Prop(L1) 5 0.151 1
j ‘75 A Rank-Prop (L2)
0.25» I 0.10L
5
0.201 1 1 1 1 1 1 0,05 5 1 1 1 . .
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Rank of Categories Rank of Categories
(g) The Number of Training data = IOOO
Micro—average F1 Scores Macro-average F1 Scores
0.65 . . 5 . . . 0.50 5 5
0.60 0.45 get} {ya -
040 '27 W
0.551 _ 1 ,_ 1 5 ,- 1 .
0.501 5 0.351 )3 5
. , 5 ,1 ./
0451 .5; 80.301 1,: {2:
(D ’ of
0.401 .025 - ,1 ,5,
.. .5 LL ,5/7’
0.35» 5,155+— KNN(K= 10) 5 0.20, 15,15,511,
5"." —-+— SVM binary ', x
0.301 555,1} — Rank—Prop (L1) 1; 0.15} (3,5,: ,1"
l 7,5 ~—} Rank-Prop 5L2) ‘ 1 [1975
0.25- 5 5 0.105
i i l 1
0.20 1 1 1 1 + 1 0.05 1 5
0 2 4 6 8 10 12 14 0 5 10 15
Rank of Categories Rank of Categories

(h) The Number of Training data = 1500

88

Table 4.2: Area under ROC Curve for MovieLens Dataset.

 

Number of Training examples

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Classiﬁer 20 40 80 100
RP Ll 0.640 i (0.012) 0.663 d: (0.009) 0.690 i (0.025) 0.728 :1: (0.003)
RP L2 0.714 3: (0.025) 0.742 d: (0.002) 0.779 :1: (0.011) 0.790 i (0.008)
SVM 0.600 :1: (0.008) 0.628 i (0.031) 0.631 :1: (0.008) 0.682 :1: (0.007)
KNN 0.612 d: (0.029) 0.625 :1: (0.062) 0.636 i (0.040) 0.641 d: (0.030)
(a) Micro-average
Number of Training examples
Classiﬁer 20 40 80 100
RP L1 0.59] :1: (0.014) 0.613 i (0.004) 0.643 i (0.009) 0.655 i (0.007)
RP L2 0.608 :t (0.022) 0.628 :1: (0.024) 0.656 :1: (0.019) 0.665 d: (0.012)
SVM 0.575 d: (0.015) 0.599 :l: (0.005) 0.611 :1: (0.015) 0.623 :1: (0.007)
KNN 0.569 i (0.005) 0.579 :l: (0.003) 0.597 i (0.007) 0.608 d: (0.004)
(b) Macro-average
Table 4.3: Area under ROC Curve for St. Andrews Dataset.
Number of Training examples
Classiﬁer 100 200 300 400
RP Ll 0.64] :1: (0.015) 0.644 :1: (0.011) 0.645 :1: (0.010) 0.651 i (0.011)
RP L2 0.650 :1: (0.009) 0.654 :1: (0.010) 0.672 :1: (0.008) 0.743 :1: (0.007)
SVM 0.634 :1: (0.021) 0.636 :1: (0.007) 0.638 :1: (0.008) 0.649 i (0.007)
KNN 0.512 i (0.011) 0.516 :1: (0.009) 0.519 :1: (0.009) 0.524 :t (0.009)
(a) Micro~average
Number of Training examples
Classiﬁer 20 40 80 100
RP L1 0.553 i (0.023) 0.592 :1: (0.010) 0.596 :t (0.011) 0.596 d: (0.013)
RP L2 0.582 i (0.001) 0.621 d: (0.003) 0.665 :1: (0.029) 0.697 :1: (0.014)
SVM 0.533 :1: (0.019) 0.578 i (0.011) 0.581 :1: (0.008) 0.584 :1: (0.013)
KNN 0.519 i (0.011) 0.522 d: (0.010) 0.527 :t (0.010) 0.529 :1: (0.009)

 

 

 

 

 

 

(b) Macro-average

89

Table 4.4: Area under ROC Curve for Yeast Dataset.

 

Number of Training examples

 

 

 

 

 

 

 

 

 

 

 

Classiﬁer 300 500 1000 1500
RP Ll 0.690 :1: (0.010) 0.722 :1: (0.0041) 0.734 i (0.010) 0.754 :1: (0.008)
RP L2 0.715 3: (0.013) 0.754 i (0.002) 0.784 i (0.005) 0.797 :1: (0.008)
SVM 0.648 :1: (0.011) 0.710 i (0.006) 0.721 i (0.005) 0.727 :1: (0.009)
KNN 0.703 i (0.009) 0.709 i (0.005) 0.710 3: (0.006) 0.715 :t (0.004)

(a) Micro-average
Number of Training examples

Classiﬁer 300 500 1000 1500
RP Ll 0.594 :1: (0.010) 0.599 :1: (0.002) 0.599 :t (0.003) 0.612 :1: (0.009)
RP L2 0.598 1 (0.008) 0.606 :t (0.003) 0.607 i (0.003) 0.62] i (0.004)
SVM 0.589 :1: (0.003) 0.589 :1: (0.004) 0.598 :1: (0.006) 0.601 i (0.002)
KNN 0.586 :1: (0.004) 0.587 i (0.004) 0.588 i (0.007) 0.600 :1: (0.004)

 

 

 

 

 

 

(b) Macro-average

Results and Analysis 11: Sensitivity Test

We discuss the inﬂuence of the parameters A and B on the proposed Rank Propagation

algorithm in this empirical study. The following are the F1 scores for Yeast dataset with

80 and 100 training examples for different B values (since only the ratio between A and B

matters, we ﬁx A = 1 and consider the value of B).

As we can see, values smaller or greater than the optimal value for B all result in worse

performance than the optimal value. It is consistent with the intuition since too large values

will dominate the resulting propagation scores and too small values can not express the

differences in transitional probabilities. The parameter B has the similar inﬂuence on RP

L2 to RP Ll, thus we didn’t include it here.

90

Figure 4.5: F1 scores of Rank Propagation algorithm using l1 loss function with different
B values on MovLens C18 dataset.

Micro-average F1 Scores Macro-average F1 Scores

 

 

 

 

 

 

 

 

 

 

 

 

 

0.05 —_.J—LiJ—J—l—__ 0 1 1 1 1 1 1 1 1
o 2 4 6 8 1o 12 14 16 18 o 2 4 6 a 10 12 14 16 18
Rank of Categories Rank of Categories

(a) The Number of Training data = 80

 

 

 

 

 

 

 

 

 

 

 

 

Micro-average F1 Scores Macro—average F1 Scores
0.401 . . = 0.25 . . 1 . . .
9
/
X 4
12‘.
M /
)6
=5
=20
=30
=40
1 , =50
. ‘ , =10
0.05 / /
x l
>/
0 1* 1 A 1 1 1 1 o 1 1 1 A 1 1 1 1
024681012141618 024661012141618
Rank of Categories Rank 01031690035

(b) The Number of Training data = 100

91

Chapter 5

Propagation over Directed Graph

Label propagation approaches have gained popularity in semi-supervised learning area.
Most previous research has been done on the undirected graph [4, 5, 57, 55, 24, 51]. How-
ever, there are many scenarios which involve directed graphs. For example, in order to
explore the link structure of the web for ranking web pages, we usually view the links be-
tween web pages as directed. A number of studies have been devoted to semi-supervised
learning on the directed graph such as [50, 52]. In this chapter, we will explore the label

propagation approach over directed graphs in the application of text categorization.

Label propagation has been shown to be an effective approach for text categoriza-
tion [57, 23]. The main idea is to predict the category assignments for unlabeled documents
by propagating the category information of the training documents through the similarities
between documents. Most previous research on label propagation focuses on undirected
graphs which are based on symmetric similarity matrices. However, in a number of scenar-
ios, the similarity relationship between documents can be asymmetric and is better captured
by directed graphs. For example, consider two documents d,.; and dB with document (1,;
being a part of document d3. Evidently, if document dA belongs to category c, we would
expect that document (13 should also belong to category c. However, this inference is irre-

versible, namely we can not infer the category for document (1.1 if we know the category

92

of (13. To capture the asymmetric relationship among documents, we propose the label
propagation approach for text categorization using directed graphs.

The proposed approach ﬁrst constructs a directed graph from document collection, then
converts this directed graph into an undirected one, and ﬁnally makes prediction by propa-
gating the class labels over the converted undirected graph. There are two questions which

need to be answered:

How do we utilize the directed graph? A straightforward way is to ﬁrst convert the
direct graph into an undirected one and then apply a standard label propagation method
in this converted undirected graph. In our study, we will discuss the work in [52]. It
proposed a regularization framework which converts a directed graph based on asymmetric
similarity matrix into an undirected graph. We then propagate the labeling information on

the undirected graph using Spectral Graph Transducer [23].

How do we construct a directed graph? To construct a directed graph from document
collection, it requires computing asymmenic similarities, namely similarity of (1,4 to d B is
different from the similarity of (13 to (1,1. Two asymmetric similarity measures are presented
in this study, i.e., the asymmetric similarity based on the KL divergence and the asymmetric
similarity based on the modiﬁed cosine similarity. We evaluate the efﬁcacy of these two

asymmetric similarities through an empirical study with multiple datasets.

5.1 Convert a Directed Graph into an Undirected Graph

Let D 2 (d1, - - - ,dn) denote the entire collection of documents. Assume the ﬁrst 721 doc-
uments of D are labeled by y1 = (311,- - - .31.“) and the remaining nu = n — n, documents
are unlabeled. Each class label y,- is either +1 or —1. Let 'u;(-21||j) Z 0 denote the similar-
ity of the ith example to the jth example. Note that the similarity is asymmetric, namely

'w('z' Ilj) # 111(j Hi). We further denote by W E R1“ the similarity matrix for all the exam-

93

ples. Our goal is to predict y“, the category labels for the unlabeled documents, by propa-
gating the class labels y) over the asymmetric similarities 212(2' 11 j). In the following, we will
ﬁrst describe the approach in [50] that is used to convert a directed graph into an undirected
graph. We will then describe the spectral graph transducer (SGT) algorithm [23], which is

used to propagate the class labels over the converted undirected graph.

5.1.1 Conversion to Undirected Graphs

Given a set of labeled documents Dr 2 {(d1,c1), - - . , (d), c,)}, the task is to classify the
unlabeled documents ’19,, = {d,+1, - - - ,d,+,,}. Assume n = l + it. We can construct a
directed graph G = (V, E) as described in the previous section. V is the set of nodes and
each node represents a document. E is the set of edges. The edges are weighted and there is
a weight function 212 : E —> IR+ which associates a positive value w([u, 12]) with each edge
[11,21] 6 E. We use K-L divergence between two documents based on their term vectors
as the weight function. The out-degree function (1+ : V —> 1R+ and in-degree function
(1‘ : V ——> IR‘ are deﬁned as: d+(u) = Zungﬂuwl) and d”(u) = Zuﬂ,w([u,v]).
11. ——> 11 denotes the set of vertices adjacent from the vertex 1.1 and u <— 11 denotes the set of

vertices adjacent to u.

For a given weighted directed graph, there is a natural random walk on the graph with
the transition probability function p : V x V —+ IR+ deﬁned by p(u, v) = w([u, v])/d+(u)
for all [11. v] E E, and 0 otherwise. The graph we constructed is strongly connected and
aperiodic and the random walk has a unique stationary distribution 1r which satisﬁes the

balance equations 71(11) 2 Z 7r(u)p(u.,12), for all v E V. tr(rr) > 0 for all 0 E V.

u—vl’

Assume a classiﬁcation function f E ’H(V), which assigns a label sign f (c) to each

vertex v E V. A functional {2 can be deﬁned as

 

 

00’) F; 2: ”(U1p1'u,2’)( f0“? — M) > (5.1)

[11.116 E

94

Q sums the weighted variation of a function on each edge of the directed graph. The initial
label assignment should be be changed as little as possible. Let y denote the function
in H(V) deﬁned by y(2:) = 1 or -1 if vertex v has been labeled as positive or negative

respectively, or 0 if it’s unlabeled. We can form the optimization problem as follows:

argminfeH(\/){Q(f) + l1 11 f — 3/ “2} (52)

where [L > 0 is the parameter specifying the tradeoff between the two competitive terms.

Deﬁne the Operator 8 : H(V) —> H(V) as follows:

 

 

_ .,. :1 7r(u)p(u,'v)f(’u-) 7r(’v)p(v,u)f(u)
(0M) A; T—myrm +2 'r_(v)r(u) )

G) can be written in matrix form as

H1/2pn—1/2 + H—1/2PTH1/2

e 2

 

(5.3)

where II denote the diagonal matrix with 11(1), 1)) = tr(v) for all v E V. P is the transition
probability matrix and PT is the transpose of P. Note: 9 is symmetric although the original
weight matrix is asymmetric.

Using 9 as the new symmetric similarity matrix, we conduct the propagation using SGT
[23] which is discussed in details in the following section. Figure 5.1 shows the algorithm.
Note that for multi-class problem, we compute the function f for each category using the

algorithm.

5.1.2 Propagation Scheme: Spectral Graph Transducer

We brieﬂy introduce Spectral Graph Transducer (SGT) in this section. SGT is a method for
transductive learning which can be seen as a transductive version of the k nearest-neighbor

classiﬁer. In this method, the training problem has a meaningful relaxation that can be

95

solved globally optimally using spectral methods. A key advantage of the method is that it
doesn’t require additional heuristics to avoid unbalanced splits which makes this algorithm
very robust.

Let y = (511,. . . ,y,,) denote the class labels for all the examples. Evidently, ya =
(yn,+1,. . . , i1”). Given a symmetric similarity matrix S, we need to ﬁnd the class labels
y that on one hand is consistent with y,, and on the other hand is consistent with the
similarity matrix S. Follow [23], we can formulate the above objectives into the following
optimization problem,

it 7,,
mi” 2 Still/2' — 902 + 0:011 — 71')?
i=1

yeR" . .
20:1

s. t. 2": y, = 0, ity? =71. (5-4)
i=1

i=1

where “,2. is 7+ if y,- = +1 and 0,_ otherwise. [23] computes the similarity-weighted k

 

Input: The label assignment vector y containing the labels for training data and
0 for testing data; the similarity matrices SD for documents computed by K-L
divergence.

Output: The classiﬁcation function f.

Algorithm:

1. Construct the directed graph: construct the graph G = (V, E). V is the set
of nodes and each node represents a document. E is the set of edges and each
edge [11, v] is assigned a weight ru([u, 0]) determined by SD(u, 0). Compute the
transition probability matrix P deﬁned as p(u, v) = u'([u, 11]) / (1+(u).

1. Deﬁne the random walk: deﬁne a random walk over G with a transition
matrix P such that it has a unique stationary distribution 7r.

2. Compute the matrix 9: Let ll denote the diagonal matrix with its diag-
onal elements being the stationary distribution it of the random walk. Compute
the matrix 8 as G = (Ill/QPII‘I/2 + H‘l/QPTH1/2)/2.

3. Compute the function f: Compute the function f using SGT, and clas—
sify each unlabeled vertex v as sign f (v).

 

 

 

Figure 5.1: The Algorithm for Classiﬁcation on a Directed Graph

96

nearest-neighbor graph S as

simtﬁjﬁ)

SI kaeknn(r‘,) sim(f,.;r'k)

 

if f} E ktm(i",-)

0 else

and then symmetricize it as S = S + S’. The details of computing H can be found in [23].

5.2 Construct a Direct Graph

Label propagation approaches often involve a connected graph constructed from the data
samples. Consider a classiﬁcation problem. Assume we have the labeled data set
{(x1.y1), - . - , (x). 211)} and the unlabeled data set {x,+1, - - - ,x,+,,}. We construct a con-
nected graph G = (V, E): V is the set of nodes in which each node represents one data
sample; B is the set of edges between nodes and each edge is assigned with a value which
is determined by the weight function 10. The function w is usually computed based on the
similarity between all nodes. For instance, we use RBF kernel to compute the similarity. If
u; is a symmetric measure which means w(:r,-, .123) = tr(rj, xi), the graph is an undirected
graph since the edge from the node :1:,- to 35 is viewed equally as the edge from the node an,-
to xi. All the approaches we discussed from the previous chapters are based on undirected
graph. In many cases, directed graphs may be a more appropriate choice to express the
nature of the data set. By deﬁning the weight function 20 as an asymmetric function, we
can construct a directed graph.

Let’s consider the document classiﬁcation problem. Given a set of labeled doc—
uments ’D; = {(d1.c1), - . . , (d;,cl)}, the task is to classify the unlabeled documents
73,, = {d,+1, - - - ,d,+,,}. In order to apply the label propagation approach, we need to
compute the document similarity matrix S. Two asymmetric similarity measures are pre-
sented in the following study, i.e., the asymmetric similarity based on the K—L divergence

and the asymmetric similarity based on modiﬁed cosine similarity.

97

K-L divergence similarity measure For probability distribution P and Q of a discrete

variable, the K-L divergence of Q from P is deﬁned to be

Dm P II (JV—21’0")! —’—P (5.5)

Given the document d,- and the document dj, we ﬁrst compute the pairwise KL distance

D(i||j) as follows:

UIDIJ') = :dlzkl 0:: (A) (5.6)

d) k
where d“c is normalized term frequency and is deﬁned as d“. = d,,k/ 213:1 di,k’~ We then
convert the distance D(i||j) to similarity w(i||j) by 1.1)(illj) = exp (—/\D(i||j)) where A is
a scaling factor and is determined empirically.

The graph is directed since the edge from the node representing d,- to the node repre—
senting d,- is different from the edge from the node representing d,- to the node representing
di. By using KL divergence, we are trying to express in the pairwise similarities the differ-

ences from document length, term probability distribution and some other factors.

Asymmetric cosine similarity We assume that the document length reﬂects the variety
of topics which may have impact on propagation. Thus the resulting similarity should be
weighted according the document length. Standard cosine similarity is a symmetric mea-
sure. To capture the differences from document lengths, we modify it to be a asymmetric
measure.

Let’s denote by X,- the set of words used by the ith document, i.e., X,- = {jldm- 7£ 0}.

We then measure the similarity of d,- to dj by the following expression:

(1,: kdj k

keX Z\/Zkex, d-i k\/Zke.¥.- J31:

 

 

wUII))= (57)

This actually makes intuitive sense: a short document may have large similarity to a long

98

document with diverse topics while the long document may not have the large similarity to

the short one because of its diverse t0pics.

5.3 Case Study 1: Binary Classiﬁcation

We evaluate our pr0posed approach with binary text classiﬁcation tasks. Generally, the

following questions will be addressed:

0 Is the classiﬁcation performance improved by using directed graph? We compare
label propagation approaches using directed graphs with approaches using undirected
graphs on the same classiﬁcation tasks. SGT [23], SVM [21] and LP [57] are used

as our baselines.

6 Why do directed graphs improve the performance? From the experiments we can
see that propagation using directed graphs outperforms using undirected graphs. But
how it actually helps classiﬁcation and why this is the case still need to be studied.
We’ll discuss why directed graphs actually help classiﬁcation with the analysis of the

dataset.

o How sensitive are two proposed asymmetric similarity measures? We proposed two
similarity measures above. How they are impacted by different parameters will be

presented and analyzed.

5.3.1 Experiment Setup

We evaluate the proposed algorithm for semi-supervised text categorization by multiple

datasets which are described as follows:

6 Movielens dataset: We use the MovieLens dataset‘. The ﬁve most popular categories

are used in this study, including Adventure, Children’s, Comedy, Drama and Thriller,

 

lhttp://www.grouplens.org/

99

which results in a total of 1354 movies. Each movie is described by a number of
keywords, and the goal of this experiment is to predict the genre of movies based on

their keyword description.

6 20-newsgroup: For 20-newsgroup dataset, we randomly selected 5 categories and
100 documents from each category. The resulting dataset contains 500 documents
and around 7300 words. There is no overlapping of documents among categories.

That is to say, a document is assigned to only one category.

6 Reuters-21578: Reuters-215 78 dataset contains more than 20,000 documents. We
used the subset Reuters-21578 R8 dataset2 which contains 7674 documents and 8
classes. Each document is assigned to only one category. Since some classes don’t
have enough samples, we select 6 classes which have more than 80 documents. The
resulting dataset has 543 documents with the vocabulary of 4452 words and the av-

erage number of documents within one class is around 91.

Three baselines are used in our study: the supervised support vector machine (SVM),
the spectral graph transducer (SGT) over undirected graphs where document similarity is
computed with cosine similarity and the label propagation using harmonic function (LP) in
[57]. SVM is a supervised learner which only utilizes the labeled data. SGT explores the
undirected graph with an adjacency matrix based on k-nearest neighbors and propagates
the labels over the undirected graph. LP propagates the labels over an undirected graph
using gaussian ﬁelds and harmonic function.

We refer to the directed graph label propagation based on the asymmetric cosine sim-
ilarity and the KL divergence as DP-Con and DP-KL, respectively. F1 is used as the
evaluation metric. All the experiments are repeated ten times, and F1 averaged over ten

trials is reported as the ﬁnal result.

 

2http://www.gia.ist.utl.pt/*acardoso/datasets/

100

5.3.2 Experiment Results

Table 5.3.2, 5.2 and 5.3 present the F1 scores of ﬁve algorithms in different binary classiﬁ—
cation tasks for three datasets. First, as we can see from three tables, DP-KL and DP-Con
always outperform SGT and LP which are both semi-supervised learning algorithms using
undirected graphs. We attribute the improvement to utilizing the directed graphs during
propagation since the directed graph can better capture the asymmetric relationship be-
tween documents in many cases. Detailed discussion about why it is the case will be
presented later.

Second, generally speaking, except LP algorithm, semi-supervised learning algorithms
(DP-KL, DP-Con and SGT) perform better than SVM which is a supervised learning algo-
rithm. This is reasonable because the deﬁciency of training examples will affect the effec—
tiveness of supervised learning. This is again proved that utilizing unlabeled data is impor-
tant for classiﬁcation tasks especially when a large amount of training data is not available.
As the size of training set increases, the performance of SVM is usually improved signif-
icantly since more training samples can be used for learning. The performance of SVM
varies from case to case and in some cases shown (for example, the category ”Childre’s"
in Table 5.3.2, the category ”trade” in Table 5.3), it outperforms SGT, but not DP-KL and
DP-Con. This supports out estimation that directed graphs can improve the performance.

Third, LP performs poorly in all cases. We attribute this to two possible reasons. First,
the poor performance may be caused by propagating the labels over the weighted undi-
rected graph based on the similarity matrix in which each pair of nodes are connected and
the edge between them is assigned a weight. The labeling information from connected
edges with small weights may become very noisy during propagation. Thus it improves
the performance by using adjacency matrix with h nearest—neighbor which removes unnec-
essary edges from the graph. Another reason may be the threshold for classiﬁcation. The
prediction result from harmonic function seems very sensitive to the threshold. Consid-

ering the class prior is one reasonable way to adjust the threshold. It works usually better

101

than simply taking the sign of the ﬁnal value. But compared to other algorithms, LP doesn’t

seem always work well.

 

Number of Training examples

 

 

 

 

 

 

 

 

 

 

 

 

Cat. Alg. 10 20 40 80
DP-KL 0.201 (0.044) 0.209 (0.042) 0.210 (0.034) 0.220 (0.004)
DP—Con 0.225 (0.055) 0.244 (0.053) 0.252 (0.050) 0.264 (0.047)

Adv. SGT 0.149 (0.026) 0.183 (0.032) 0.2036 (0.034) 0.208 (0.044)

SVM 0.111 (0.011) 0.170 (0.067) 0.229 (0.058) 0.346 (0.010)

LP 0.155 (0.002) 0.157 (0.013) 0.156 (0.008) 0.163 (0.005)

DP-KL 0.194 (0.046) 0.197 (0.040) 0.196 (0.052) 0.198 (0.017)
DP-Con 0.161 (0.068) 0.2069 (0.087) 0.264 (0.091) 0.281 (0.068)

Chi. SGT 0.139 (0.038) 0.147 (0.022) 0.156 (0.025) 0.159 (0.036)
SVM 0.137 (0.085) 0.155 (0.020) 0.180 (0.056) 0.189 (0.016)

LP 0.071 (0.003) 0.074 (0.021) 0.072 (0.022) 0.080 (0.005)

DP-KL 0.412 (0.069) 0.422 (0.043) 0.440 (0.038) 0.444 (0.034)
DP—Con 0.381 (0.042) 0.404 (0.035) 0.411 (0.049) 0.421 (0.038)

Com. SGT 0.373 (0.0676) 0.395 (0.057) 0.405 (0.047) 0.413 (0.025)
SVM 0.311 (0.111) 0.321 (0.053) 0.345 (0.074) 0.363 (0.065)

LP 0.296 (0.223) 0.294 (0.002) 0.289 (0.004) 0.293 (0.006)

DP-KL 0.630 (0.063) 0.639 (0.053)) 0.678 (0.044) 0.702 (0.020)
DP-Con 0.634 (0.024) 0.659 (0.051) 0.6612 (0.075) 0.700 (0.017)

Dra. SGT 0.583 (0.071) 0.588 (0.105) 0.647 (0.039) 0.678 (0.031)
SVM 0.348 (0.074) 0.3979 (0.054) 0.416 (0.014) 0.546 (0.055)

LP 0.353 (0.002) 0.3601 (0.002) 0.427 (0.004) 0.431 (0.005)

DP-KL 0.1985 (0.047) 0.201 (0.034) 0.209 (0.037) 0.213 (0.023)
DP-Con 0.220 (0.044) 0.230 (0.044) 0.231 (0.046) 0.245 (0.029)

Thri. SGT 0.172 (0.028) 0.181 (0.034) 0.185 (0.028) 0.195 (0.020)
SVM 0.128 (0.017) 0.131 (0.058) 0.153 (0.060) 0.165 (0.027)

LP 0.136 (0.003) 0.139 (0.002) 0.141 (0.006) 0.161 (0.007)

 

 

Table 5.1: The F1 results for MovieLens Dataset.

102

 

Number of Training examples

 

 

 

 

 

 

 

 

 

 

 

 

Cat. Alg. 10 20 30 40
DP-KL 0.585 (0.020) 0.606(0.069) 0.611(0.057) 0.623 (0.076)
DP-Con 0.616(0.042) 0.628(0.042) 0.635 (0.043) 0.653 (0.076)

1 scr 0.570 (0.075) 0.578(0071) 0.595 (0.070) 0.604 (0.080)
svrvr 0.332 (0.055) 0.434 (0.070) 0.469 (0.022) 0.594 (0.008)

LP 0.245(0007) 0.261 (0.013) 0.285 (0.011) 0.304 (0.016)
DP-KL 0.557 (0.044) 0.562 (0.039) 0.572 (0.047) 0.589(0.042)
DP-Con 0.629 (0.041) 0.652 (0.055) 0.663 (0.052) 0.667 (0.047)
2 sor 0.544(0045) 0.554(0024) 0.567 (0.066) 0.572 (0.065)
SVM 0.399 (0.053) 0.422(0044) 0.514(0024) 0.592(0013)
LP 0.361 (0.007) 0.376 (0.009) 0.385 (0.012) 0.395(0021)
DP-KL 0.562 (0.052) 0.594 (0.044) 0.629 (0.057) 0.626 (0.070)
DP-Con 0.594 (0.054) 0.614(0.035) 0.642 (0.051) 0.641 (0.051)
3 SGT 0.544(0040) 0.559 (0.064) 0.566 (0.065) 0.570 (0.084)
SVM 0.365 (0.055) 0.520 (0.053) 0.573 (0.074) 0.607(0.069)
LP 0.249(0.008) 0.271(0013) 0.293(0015) 0.317(0015)
DP—KL 0.567 (0.046) 0.571 (0.066) 0.596 (0.064) 0.607(0.060)
DP—Con O.614(0.027) 0.632 (0.035) 0.639(0.051) 0.640 (0.074)
4 sor 0.542 (0.031) 0.561 (0.058) 0.567 (0.068) 0.590 (0.090)
SVM 0.382 (0.055) 0.460 (0.027) 0.515(0.069) 0.600(0.013)
LP 0.382 (0.004) 0.411(0007) 0.420 (0.016) 0.427 (0.018)
DP—KL 0.477(0053) 0.487 (0.035) 0.500 (0.045) 0.501 (0.047)
DP-Con 0.599(0.038) 0.610(0.041) 0.616(0.043) 0.630 (0.033)

5 sor 0.459 (0.069) 0.462 (0.037) 0.462 (0.046) 0.473 (0.059)

SVM 0.253 (0.072) O.308(0.009) 0.451 (0.011) 0.533(0.069)
LP 0.154 (0.008) 0.168(0.010) 0.187 (0.013) 0.206(0.020)

 

Table 5.2: F] measure for 20—newsgroup Dataset.

103

 

Number of Training examples

 

 

 

 

 

 

 

 

 

 

 

 

 

Cat. Alg. 10 20 30 40
DP-KL 0.514 (0.080) 0.535(0.083) 0.560 (0.097) 0.591 (0.086)
DP-Con 0.527(0.050) 0.527 (0.051) 0.548 (0.041) 0.569 (0.048)
acq SGT 0.504 (0.050) 0.531 (0.054) 0.541 (0.045) 0.549 (0.036)
SVM 0.362 (0.029) 0.446 (0.010) 0.527 (0.049) 0.535 (0.042)
LP 0.177 (0.005) 0.177 (0.010) 0.181 (0.013) 0.188 (0.013)
DP-KL 0.773 (0.056) 0.811 (0.090) 0.813 (0.021) 0.821 (0.015)
DP-Con 0.549 (0.041) 0.572 (0.045) 0.573 (0.051) 0.583 (0.054)
crude SGT 0.540 (0.033) 0.561(0.051) 0.589 (0.057) 0.579 (0.058)
SVM 0.482 (0.084) 0.529(0.037) 0.550 (0.046) 0.563 (0.101)
LP 0.294 (0.009) 0.304(0.011) 0.310 (0.015) 0.315 (0.017)
DP-KL 0.696(0.1 12) 0.731 (0.089) 0.705 (0.102) 0.789 (0.072)
DP-Con 0.595 (0.038) 0.610 (0.044) 0.635 (0.043) 0.647 (0.068)
earn SGT 0.572 (0.029) 0.595 (0.031) 0.606 (0.051) 0.614 (0.063)
SVM 0.508 (0.092) 0.571 (0.120) 0.600 (0.140) 0.608 (0.026)
LP 0.223 (0.018) 0.242 (0.016) 0.271 (0.030) 0.311 (0.026)
DP-KL 0.470 (0.052) 0.505 (0.059) 0.483 (0.065) 0.537 (0.075)
interest DP-Con 0.447 (0.047) 0.492 (0.061) 0.478 (0.044) 0.485 (0.052)
SGT 0.439 (0.054) 0.490 (0.052) 0.464 (0.044) 0.480 (0.061)
SVM 0.403 (0.036) 0.472 (0.032) 0.489 (0.022) 0.434 (0.067)
LP 0.228 (0.012) 0.230 (0.011) 0.230 (0.017) 0.233 (0.014)
DP-KL 0.471 (0.074) 0.528 (0.071) 0.556 (0.080) 0.574 (0.078)
DP—Con 0.471 (0.066) 0.498 (0.062) 0.506 (0.071) 0.506 (0.050)
money SGT 0.456 (0.063) 0.490 (0.061) 0.484 (0.060) 0.490 (0.042)
SVM 0.382 (0.018) 0.436 (0.013) 0.458 (0.021) 0.487 (0.022)
LP 0.208 (0.007) 0.211 (0.012) 0.212 (0.012) 0.212 (0.012)
DP-KL 0.505 (0.090) 0.550 (0.106) 0.576 (0.041)) 0.608 (0.057)
DP—Con 0.494 (0.054) 0.510 (0.058) 0.567 (0.043) 0.582 (0.054)
trade SGT 0.463 (0.057) 0.462 (0.061) 0.462 (0.054) 0.467 (0.062)
SVM 0.478 (0.152) 0.492 (0.095) 0.506 (0.070) 0.523 (0.077)
LP 0.347 (0.007) 0.351 (0.009) 0.354 (0.012) 0.318 (0.012)

 

Table 5.3: Fl measure for Reuters-215 78 R8 Dataset.

104

5.3.3 Analysis and Discussion

The results presented above show that propagation using directed graphs works better than

propagation using undirected graphs. However, it may not always be the case. In this

section, we give a closer look at why directed graphs can bring the improvement of perfor-

mance. We consider SGT, DP-KL and DP-Con.

0.08

0.06

0.04

0.02

0.06

0.04

0.02

The percentage of documents

0.1 *

The Distribution of Documents w.r.t the Number of Words for Movielen dataset

 

 

l

 

20

l l l r T l l

 

_— ‘L _ l _L l
40 60 80 100 120 140 160 180
The number of words in a document

The Distribution of Documents w.r.t the Number of Words for 20-newsgroup dataset

 

 

l

 

l l l l f T T

 

4 6 8 10 12 14 16 18
The number of words in a document

The Distribution of Documents w.r.t the Number of Words for Reuters-R8 dataset

 

l

 

 

50

j 77 77 l V l l I

 

100 150 200 250 300 350 400 450 500
The number of words in a document

Figure 5.2: The Distribution of Document Length in Three Datasets.

Why does propagation using directed graphs outperform undirected graph? We al-

ready show that propagation using directed graphs outperforms propagation using undi-

rected graphs in text categorization task. Recall that we use SGT algorithm for propagation

with the undirected graph converted from directed graph. The performance of each algo-

105

The Probability of Normalized Pairwise Similarity

The Distribution of Normalized Pairwise Similarities using KL Divergence

 

0.2 — *

 

 

1 I 1 ___-L '
0 0 l l l l 1

40.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Normalized Pairwise Similarity

 

The Distribution of Normalized Pairwise Similarities using Asymmetric Cosine Measure

1 I l l l l l l l
0.6 —
0.4 - e
0.2 : -
00 1 l L l l l 1 l l
-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Normalized Pairwise Similarity

 

 

 

 

The Distribution of Normalized Pairwise Similarities in Undirected Graph

1 l l l l l l r l
0.6 a
0.4 r r
0.2 _
0 O i l l l l l l l
0 0.1 0.2

40.1 0.3 _0.4 . 0.5 . 06 0.7 0.8 0.9
Normalized Parrwrse Srmrlanty

 

l

l

 

 

 

(a) The Distribution of Pairwise Similarities in MovieLens Dataset.

The Distribution of Two Documents Belonging to One Category
w.r.t Normalized Pairwise Similarity for Movielen Dataset

1 1 T 4’1111“
1‘11

 

 
 
 
  

 

0 8 _ *1- DP-Con l
' —e— Undirected

 

 

 

The Percentage of Document Pairs
in One Category

 

 

4 l k
V

0 0.1 0.2 0.3 0.4 0.5
Normalized Pairwise Similarity

 

(b) The Distribution of No documents Belonging to One Category with respect to
Normalized Pairwise Similarity in MovieLens Dtaset.

Figure 5.3: Similarity Matrix Analysis of MovieLens Dataset

The Probability of Normalized Pairwise Similarity

The Distribution of Normalized Painrvise Similarities using KL Divergence
0.6 L I 1 l l I ' T T I

 

 

0 0 LL. i 1 l L 1 l l I 1 I
40.1 0 0.1 0.2 0.3 0.4 0.5 0.6
Normalized Pairwise Similarity
The Distribution of Normalized Pairwise Similarities using Asymmetric Cosine Measure

1 I I I l I l I l

 

 

 

 

0.4 .
0.2 «1
0' 7 A 1 1 I l l L )

-o.1 0 0.1 0.2 0.3 0.4 0.7 0.8 0.9

0.5 0.6
Normalized Pairwise Similarity

 

The Distribution of Normalized Pairwise Similarities in Undirected Graph

1 l l 3 l I l l l

0 I I 1 J; 1 _L #4 l
-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Normalized Pairwise Similarities

(a) The Distribution of Pairwise Similarities in 20-newsgroup Dataset.

The Distribution of Two Documents Belonging to One Category
w.r.t Normalized Pairwise Similarity in ZO-newsgroup Dataset

 

1

 

.2

(B

a. + DP-KL

g 0-8’ ~+— DP—Con
E b —e— Undirected
3 o

o 8’ 0.6 .

o ‘65 a
"5 0

a) (D

35 O.

5 .5

8

a: 0.2

d)

.C

I'-

 

 

 

 

 

0.1 0.2 0.3 0.4
Normalized Pairwise Similarity

(b) The Distribution of TWO Documents Belonging to One Category with respect
to Normalized Pairwise Similarity in 20-newsgmup Dataset.

Figure 5.4: Similarity Mtrix Analysis of 20-newsgroup Dataset

The Distribution of Normalized Pairwise Similarities using KL Divergence

0-1 l l l l I l l l l

 

30.05 T T

o

 

 

 

O 1 l 1 1 l 1
0 0.1 0.2 0.3 0.4, 0,5 . 0.6 , 0.8 0.9 1
Norrnalrzed Parrwrse Similarity

The Distribution of Normalized Pairwise Similarities using Asymmetric Cosine Measure
0.2 I I T l I l l l l

 

0.1 :

 

 

0 A l l l l l l l

0 0’1 0'2 0'3 Norrrcrla‘lized Pgirswise Similarity 0'7 0'8 0'9 1

The Distribution of Normalized Pairwise Similarities in Undirected Graph

0.5 — r r a r r r r r r

0L1 1 1 1 1 1 l 1 1

0 0.1 0.2 0.3 0. . 0.5 . 0. . 0.7 0.8 0.9 1
Normalized Parrwrse Similarity

 

The Probability of Normalized Pairwise Similar

 

 

 

 

 

(a) The Distribution of Pairwise Similarities in Reuters-215 78 R8 Dataset.

The Distribution of Two Documents Belonging to One category
w.r.t Normalized Pairwise Similarity in Reuters-R8 dataset

I'\

       

1 L (\mrxnx A.
J r f T W

0.8 :

I 5? .
, W’eﬂl} ‘9
{fa '

 

 

 

 

 

 

 

if:

Q.

.9

C

(D

5 ?_~

S o 06» -8 -

'o #8..) ' (’3 l

“5 8 If /

a)

34’s 0.4r Peat. / 1

‘E 5 {a} f .

8 s:

._ / f + DP-KL

§ 0.2 34 . ' =4— DP-Con

E 395/ ”WWW -—e— Undirected
07 semiﬁw . 1 .
0 0.1 0.2 0.3 0.4 0.5

Normalized Pairwise Similarity

(b) The Distribution of Two Documents Belonging to One Category with respect
to Normalized Pairwise Similarity in Reuters-215 78 R8 Dataset.

Figure 5.5: Similarity Matrix Analysis of Reuters-215 78 R8 Dataset

108

Distribution of Similarities of Doc Pairs in One Category

and Doc Pairs not in One category using KL Divergence
I I I I I I I I

     
   

 

 

~+— Doc Pairs in One Category
6— Doc Pairs not in One Category 7

 

 

 

 

Distribution of Similarities of Doc Pairs in One Category
000 Pairs not in One Category using Asymmetric Cosine Measure

 

1 r r I ﬂ 1 r r r r

 

 

The Probability of Similarity
o
01
l

 

Distribution of Similarities of Doc Pairs in One Category
and Doc Pairs not in One Category in Undirected Graph

1 r r r T I I r I F I

 

 

 

0 0.1 "'02 I 073 ’ “0.4: 0,5, 0.6 _
Normalized Parrwrse Srmrlanty

Figure 5.6: The Distribution of Similarities of Document Pairs Belonging to the Same
Category and Document Pairs not Belonging to the Same Category in MovieLens Dataset.

109

The Distribution of Similarities of 000 Pairs in One Category
and Doc Pairs not in One Category using KL Divergence

 

 

 

 

 

 

0.4.: I I I I I I I I I

i “is
l. . .

l I \ ~9— Doc Pairs not In One Category

0 2 —‘ 21* . . :
117' \1 —+— Doc Pairs In One Category
I}; ‘.
1h], N

0"! 1?“ -‘--‘-“ :1:::1:‘:‘..:::.:.::::=':=‘::. .. 1:" ‘33

o 0.05 0.1 0715 0.2_ 0.25. 0.3 _ “0.35 0.4 135 0.5
Normalized Parrwrse Similarity

 

 

 

 

 

g The Distribution of Similarities of Doc Pairs in One Category
__g and Doc Pairs not in One Category using Asymmetric Cosine Similarity
E I I I I I I I I I
'03 0.4 F :
“6 1
a III
'5 0'2 Ill 3
10 I
.0 . .
e | .
n- 0 '7‘ ‘0! II! I :l,‘ Q‘.‘J \"
g 0 0.05 0 1 0 0.2, 0.25. 0.3 , 0.35 0.4 0.45 0.5
,_ Normalized Parrwrse Similarity
The Distribution of Similarities of Doc Pairs in One Category
and Doc Pairs not in One Category in Undirected Graph
0.4 m I I m I I I I I
0.2:;- —

  

 

 

J A A A r
-‘ v "‘9 v _ . .. :L l l I
I ' "" ’ 3. ' l
1 . I 1 ' ’ ' . ‘ l I nnrrnrnnxrrxrirrfxr+¥ﬁ+t#%xf ”‘AAAJ‘A‘A‘ A‘AAM
' V

O vvxlvxl\IVVVVVVUKJUWCTUVVUCJ‘UQ/VTHUG}
0 0.05 0.1 0.15

0.35 0. 0.4 0.5

 

0.2 0.25, 0.3 ,
Normalized Parrwrse Srmrlanty

Figure 5.7: The Distribution of Similarities of Document Pairs Belonging to the Same Cat-
egory and Document Pairs not Belonging to the Same Category in 20-newsgroup dataset.

110

Distribution of Similarities for Doc Pairs in One Category
and Doc Pairs not in One Category using KL Divergence
I I I I I I I

    
   

 

 

+3e Doc pairs not in the same category
—4— Doc pairs in the same category

 

 

 

 

. A 0.4_ 0,5_ 0.6 ,
Normalized Parrwrse Similarity

Distribution of Similarities for Doc Pairs in One Category
and Doc Pairs not in One Category using Asymmetric Cosine Measure

0.2 I I I I I I T F I

 

 

«Pr Doc pairs not in the same category
—1— Doc pairs in the same category

 

 

 

 

 

 

Probability of Similarity
O

 

04. 0.5 . .0-6 .
Normalized Parrwrse SImIIanty

Distribution of Similarities for Doc Pairs in One Category
and Doc Pairs not in One Category in Undirected Graph

      
  
 
   

 

I l l l j 1 I T

 

«e Doc pairs not in the same category
—1— Doc pairs in the same category

 

 

 

 

 

3 Normalized0 Pairwise Similarity 0 7
Figure 5.8: The Distribution of Similarities of Document Pairs Belonging to the Same

Category and Document Pairs not Belonging to the Same Category in Reuters-21578 R8
dataset.

lll

rithm simply depends on the similarity matrix. Thus, in order to understand the reason
behind it, we study the similarity matrix computed in DP-KL, DP—Con and SGT. DP-KL
and DP-Con generate their similarity matrices based on KL divergence and asymmetric
cosine similarity respectively. For SGT, the sirrrilarity between two documents is computed

by using standard cosine similarity (symmetric).

First, we claim that the correlation between pairwise similarity and the probability of
two document belonging to the same category is enhanced in directed graphs using KL
divergence based similarity and asymmetric cosine similarity. Figure 5.3(b), 5.4(b) and
5.5(b) present the distribution of two documents belonging to the same category with re-
spect to the normalized pairwise similarity in MovieLens dataset, 20-newsgroup dataset
and Reuters-21 5 78 R8 dataset. In three ﬁgures, the probability of two documents belong-
ing to the same category increases signiﬁcantly at some high similarity point for directed
graph while in undirected graph, the probability does not have a sharp change and instead
increases gradually as the similarity get larger. This shows that it is more likely for KL
divergence based similarity and asymmetric cosine similarity to assign higher similarity

values to document pairs belongs to the same category.

To further understand this point, we plot the similarity distribution of all document
pairs belonging to the same category and all document pairs not belonging to the same
category for three datasets in 5.6, 5.7 and 5.8. For Reuters-21578 R8 dataset (Figure 5.8),
two kinds of sirrrilarities are clearly distributed in two modes for directed graphs. That is,
the document pairs belonging to the same category are more likely to have higher pairwise
similarities and document pairs not belonging to the same category are more likely to have
lower pairwise similarities. For 20-newsgroup (Figure 5.7), this trend can be seen although
not as clear as in Reuters-21 5 78 R8 dataset. This may be the reason why propagation us-
ing directed graphs signiﬁcantly outperforms using undirected graphs in Reuters-215 78 R8
dataset while not in 20-newsgroup dataset. We also plot the same distribution for Movie-

Lens dataset in Figure 5.6, but two distributions are not separated very obviously. This may

112

explain why the performance improvement from directed graphs for MovieLens dataset is
not very signiﬁcant. Our assumptions for using directed graphs are based on the analy-
sis of document length. These assumptions may not work for all data sets because of the
representation of the data, accuracy and some other reasons. Accordingly the similarity
measures based on the assumptions will not be able to accurately express the relationship
among data points. Thus the performance gained from using directed graphs was not that

obvious compared to undirected graphs.

From the above analysis, we gain some insight into why propagation using directed
graphs can outperform propagation using undirected graphs in some scenarios. Directed
graphs based two proposed asymmetric measure can better explore the manifold structure
of the dataset and reﬂect the impact of document length, term frequency distribution and
other factors on classiﬁcation in some cases as in Renter-R8 dataset and 20-newsgroup
dataset. Thus the performance was improved by using directed graphs. However, in
MovieLens dataset, since the directed graphs didn’t capture the appropriate properties of
the dataset, the performance was not signiﬁcantly improved though this may be solved by
using some other different similarity measure. In summary, directed graphs constructed
from plausible assumptions will deﬁnitely improve the performance of label propagation

and are worthwhile studying.

The sensitivity of asymmetric similarity measures We proposed two asymmetric sim-
ilarity measures, namely KL divergence similarity and asymmetric cosine similarity. Since
there is no parameter tuning in asymmetric cosine similarity, we leave it out of our discus-
sion. KL divergence similarity uses RBF kernel (111(1' H j) = exp (—AD(i||j)), see section
5.2) to convert the distance into similarity and a scaling factor A is involved in the process.

We brieﬂy discuss how the parameter A affects the performance.
Table 5.4 shows the F1 scores of DP—KL on 20-newsgroup dataset with different A

values. A = 10 has the best performance. The result for A = 5 is much worse than A = 10.

113

For the values greater than 10, as A is increasing , the performance is decreasing in most

cases. This is consistent with the meaning of A which is the scaling factor of distance. Too

small or too big scaling values will both result in the inaccurate similarity.

 

Number of Training examples

 

 

 

 

 

 

 

 

 

 

 

 

Cat. A = 10 20 30 40
5 0.3971 (0.0613) 0.4431 (0.0813) 0.4996 (0.0662) 0.5055 (0.0641)
10 0.5845 (0.0197) 0.6059 (0.0688) 0.6108 (0.0570) 0.6226 (0.0762)
1 20 0.4221 (0.0773) 0.4436 (0.0809) 0.4781 (0.0688) 0.5178 (0.0577)
30 0.4097 (0.0952) 0.4568 (0.0817) 0.4849 (0.0612) 0.5001 (0.0837)
50 0.3807 (0.0848) 0.4295 (0.0710) 0.4870 (0.0730) 0.5102 (0.0708)
5 0.4371 (0.1013) 0.4695 (0.0987) 0.5253 (0.0895) 0.4719 (0.0831)
10 0.6287 (0.0414) 0.6515 (0.0551) 0.6631 (0.0517) 0.6636 (0.0471)
2 20 0.5391 (0.0089) 0.5862 (0.0872) 0.5868 (0.0854) 0.5937 (0.0773)
30 0.4433 (0.1296) 0.4378 (0.0731) 0.4777 (0.0981) 0.4947 (0.0804)
50 0.4040 (0.1232) 0.4759 (0.1137) 0.4955 (0.0936) 0.5283 (0.0880)
5 0.4349 (0.0469) 0.5364 (0.1061) 0.5659 (0.0855) 0.5709 (0.0892)
10 0.5620 (0.0523) 0.5939 (0.0440) 0.6290 (0.0574) 0.6257 (0.0691)
3 20 0.5049 (0.0305) 0.5257 (0.1024) 0.5571 (0.0869) 0.5757 (0.0772)
30 0.3879 (0.1210) 0.5326 (0.0836) 0.5570 (0.0885) 0.6191 (0.0656)
50 0.4984 (0.0504) 0.5231 (0.0901) 0.5588 (0.0721) 0.5731 (0.0795)
5 0.4159 (0.0782) 0.4667 (0.0761) 0.5134 (0.0719) 0.5052 (0.0733)
10 0.5674 (0.0458) 0.5709 (0.0656) 0.5954 (0.0643) 0.6072 (0.0601)
4 20 0.4019 (0.0975) 0.4918 (0.0838) 0.5127 (0.0698) 0.5214 (0.0762)
30 0.3749 (0.0756) 0.4574 (0.0831) 0.5001 (0.0791) 0.5118 (0.0732)
50 0.4086 (0.0689) 0.4758 (0.0761) 0.4954 (0.0643) 0.5062 (0.0709)
5 0.3491 (0.0475) 0.3532 (0.0530) 0.3830 (0.0587) 0.4049 (0.0540)
10 0.4766 (0.0531) 0.4866 (0.0350) 0.4981 (0.0454) 0.5006 (0.0473)
5 20 0.4274 (0.0665) 0.4563 (0.0763) 0.4812 (0.0407) 0.4815 (0.0496)
30 0.3130 (0.0687) 0.3696 (0.0645) 0.3877 (0.0445) 0.4001 (0.0572)
50 0.3250 (0.0629) 0.3556 (0.0600) 0.3908 (0.0389) 0.3941 (0.0608)

 

114

Table 5.4: The F1 Results for 20—newsgroup Dataset with Different A Values.

5.4 Case Study 11: Multi-label Classiﬁcation

We extent the idea of Propagation over directed graphs to the multi-label classiﬁcation task
on the MovieLens dataset. The goal of the experiments is to give a more comprehensive
view of the effectiveness of propagation over directed graphs.

We conduct the experiment of propagation using KL divergence based similarity mea-
sure as discussed above. We follow the propagation scheme proposed in [53]. For each
category, the propagation will generate the ﬁnal score for each document. We view those
scores as conﬁdence scores of one document belonging to one category. Thus, by ranking

the scores of all categories for each document, we can make predictions accordingly.

5.4.1 Experiment Setup

We again use MovieLens dataset3 as our testbed. The dataset provides the movie category
information. There are originally 1682 movies and 19 categories. The average number
of movies for each category is around 152 and the average number of categories for each
movie is 2.

We also downloaded the movie keyword. information. After we removed the movies
with no keywords, the resulting movie keyword ﬁle contains 1651. movies with 11107
keywords. We remove the corresponding movies from the movie category information ﬁle
and the resulting movie category ﬁle also contains the same 1651 movies. The average
number of movies per category is around 1 and the average number of categories per movie
is around 4.

We refer to the propagation approach on the directed graph as Directed Propagation
(DP) and compare it with label propagation approach in [53]. We refer to the label prop-
agation approach as LP. According to the ﬁgure 5.1, directed propagation needs the input

5” which is the asymmetric movie similarity matrix. We compute the S” from movie key-

 

3http://www.grouplens.org/

115

words information using K-L divergence in Equation (5.6). LP algorithm has also the input
SD which is the symmetric movie similarity matrix and we compute it from movie ratings
information using cosine similarity.

We use F1 measure as our evaluation metrics since it is a more appropriate measure for
evaluating both precision and recall. We refer to the average F 1 measure per category as
“macro-average” Fl measure and the average F1 measure per movie as “micro—average”
Fl measure. Since both DP and LP algorithms provide a ranking of categories for each
movie, we compute the F1 measure at different ranks. For macro—average F1 measure,
precision and recall for each category are computed; the average precision and recall across
all categories are taken for computing macro-average F1 measure. For micro-average F1
measure, precision and recall for each movie are computed; the average across all movies
are taken for computing micro-average F1 measure. All experiments are conducted 10

times and the average across 10 trials are used as the ﬁnal results.

5.4.2 Results and Analysis

Figure 5.9(a) and 5.9(b) shows the performance of DP and LP algorithms for 40% and 20%
training data respectively. For micro-average F 1 measure, two algorithms performs almost
the same. For macro-average F1 measure, DP algorithm performs signiﬁcantly better than
LP algorithm. This may be explained by the way of constructing the directed graph. Since
we compute the asymmetric weight matrix from movie keyword information, the movies
from the same categories may have larger similarity because they share more keywords.
More in-depth research has to be done in order to have a better understanding. We leave it

to the future work.

116

Micro—average F1 measure

 

 

 

0.45
0.4. 1
I —e—; DP
. , LP
”0.3le .
93
o l
0
c’30.:5it
EC !
0.25L
{9‘s}
1 r L r x A A r .
0 2 4 6 8101214161820

Rank of Categories

Macro-average F1 measure

 

 

 

 

(a) 40% training data.

Micro-average F1 measure

 

 

 

 

 

0.45[ . . 4 . - . a
I
l
L —+— LP 4
j .
moas‘ 1
e .
8
<0 0.3 .1
LT. .
0.25‘ BIKE .
Q
0.2L
0 2 4 6 8 1o 12 14 16 18 20

Rank of Categories

 

 

 

 

0.26 . .
0.24 " P‘E‘rx J
7{ \_ Gm
0.22 » . X .
l/ .9-&
0.2 ~ ' A
‘é i ,7” .
8 o 18 \ _ ~
[I 0.16% -
0.14 .
0.12 -
0.1 .
0.08 A . A
0 5 10 15 20
Rank of Categories
Macro-average F1 measure
0.26 - .
"Q
0.24 _ .
75 i: %
0.22- /+/ \ Rx «
/ \ ‘Is
0-2 ” 4/ M G)\®\ 1
s a X a .
o 0.18.» / \_1 a.
I 0.16 . i
0.14
0.12 L
0.1
0.08 A A L j
o 5 10 15 20

Rank of Categories

(b) 20% training data.

Figure 5.9: F 1 Measure at Different Ranks for MovieLens Dataset in Multi-label Classiﬁ-

cation Task.

117

Chapter 6

Conclusion and Future Work

This dissertation presents a comprehensive study of label propagation with its applications.
The related research works of label propagation has been discussed in detail. Several prop-
agation schemes, namely, relation propagation, rank propagation and propagation on di-
rected graph are proposed in order to solve the existing problems in label propagation. The
applications of the proposed schemes are presented with the empirical studies, which has
shown that the proposed label propagation methods can achieve better performance than
standard label propagation approaches in many problems. This chapter will start with sum-
marizing the proposed work in this thesis followed by the discussion of some interesting

directions in future work.

6.1 Summary

Label propagation is an effective approach to semi-supervised learning. The main idea be—
hind label propagation is to ﬁrst construct a graph by denoting each data example as a node
and connecting each pairs of nodes with an edge assigned with a weight (usually similarity
between corresponding data examples), then propagate the labels of known data examples
and ﬁnally make the prediction according to the propagation scores. Empirical study has

shown it’s an effective approach in a lot of applications. Although label propagation is a

118

very promising approach, it still has limitations and disadvantages in many scenarios. We
discussed some of its limitations and proposed several new propagation schemes accord-

ingly.

Relation Propagation We proposed the generalized framework of Relation Propagation
and discussed its applications including multi-label classiﬁcation task and collaborative
ﬁltering task. In most previous work in label propagation, propagation is conducted among
one single type of objects. However, real applications usually involve multiple types of
objects. Propagation over only one type of objects will usually miss the correlations among
different types of objects and we believe that the correlations among multiple types of
objects will help in the propagation process. Based on this assumption, we propose the
framework of relation propagation which propagates the relations among multiple types of
objects instead of propagating the labels directly among one type of objects.

Consider the case of two types of objects. We construct the graph in which each node
in the graph represents a object pair with one object from each type. Each edge connect-
ing two nodes in the graph is assigned with a user-deﬁned weight. We proposed to use
the direct product of two similarity matrices respectively from two types of objects. The
relations among two types of objects should be deﬁned according to different situations.
For examples, in multi-label classiﬁcation task, two types of objects are documents and
categories and the relation is the membership of documents in categories. Then by propa-
gating the relations over the constructed graph, we achieve for each unlabeled example the
conﬁdence scores of this example belonging to all categories. The empirical study showed
that the relation propagation is a more effective approach in some cases and it proved our

assumption that it is helpful to utilize the correlations among multiple types of objects.

Rank Propagation It is necessary to study the label propagation approaches on ranked
data due to the limitations of label propagation. First, most label propagation models prop-

agate directly the class labels. The problem may arise because the ordering information

119

among the numerical values representing the class labels is incorrectly introduced to the
propagation process. One way to avoid this issue is to cast the problem as a ranking prob-
lem in which the label information is converted to pairwise preferences over classes for
each training example and these preferences are propagated. Second, most previous re-
search works on label propagation require the labels of known data for training. However,
the label information is not always available. Instead, in a lot of ranking applications
including web page searching and collaborative ﬁltering, usually only the ordering infor-
mation is provided. Previous label propagation models are often not appropriate in these

C3868.

To address the challenges, a Rank Propagation scheme for multi-label categorization
scenario is proposed. Instead of propagating the labels of training examples, rank propaga-
tion actually propagates the category preference information from all labeled data. Speciﬁ-
cally, given the labels of all training examples, a preference matrix over all category pairs is
built for each training example. Then, for each testing example, an optimization problem is
constructed which minimizes the weighted differences between the preference matrix to be
computed for the testing example and the preference matrix of each training example. As
a result, for each unlabeled example, a preference matrix is computed and a ranking list of
all categories is generated by computing the principle eigenvector of the preference matrix.
Two loss functions, namely trace-based loss function and a more general loss function, are
designed for measuring the difference between two preference matrices. Empirical study

with multiple datasets has shown that rank propagation scheme is an effective approach.

Propagation over Directed Graphs Most previous research on label propagation was
conducted on undirected graphs. For propagation over undirected graphs, the similarity
matrix is symmetric. But in many cases, relationship among objects can be asymmetric
and is more appropriate to be expressed by directed graphs which result in asymmetric

matrices. For instance, web pages connected with hyperlinks are better presented. with a

120

directed graph.

The framework of Propagation over Directed Graphs is proposed to utilize the directed
graphs. In order to construct directed graphs, two asymmetric similarity measures are de-
signed: KL divergence-based similarity and Asymmetric cosine similarity. The directed
graphs are converted into undirected graphs and the propagation is conducted on the con-
verted undirected graphs by using a standard label propagation scheme. Spectral Graph
Transducer is used in the pr0posed approach because of its effectiveness. The empirical
study with several widely used datasets has shown the advantage of using directed graphs

over some state-of-art semi-supervised techniques.

6.2 Future Work

The detailed description of the existing and proposed label propagation approaches is given
with the empirical studies which have shown that label propagation can be effectively used
in many applications. However, some problems need to be further studied.

Generally speaking, all label propagation approaches involve the graph which is asso-
ciated with the similarity matrix and the propagation method which is usually determined
by an optimization problem. How to deﬁne the appropriate similarity measure is always an
important issue for label propagation. A number of research works have been focused on
learning the best similarity matrix for different applications.

Besides computing the similarity matrix, there are also many problems involved in the
propagation schemes of different models. In relation propagation framework, consider
the multi-label classiﬁcation case. Each node in the graph constructed from the data is a
document-category pair. The problem arises when the number of documents and categories
are large, which is usually the case in real world scenario. Although the approximation
is proposed to alleviate the problem, how to scale the algorithm remains to be an issue.

For rank propagation, the optimization problem is designed based on the loss function.

121

The deﬁnition of the loss function can have a signiﬁcant inﬂuence on the performance as
showed in the empirical study. In propagation over directed graphs, how to construct the
directed graph can affect the performance heavily. Two asymmetric measures to construct
directed graphs are proposed based on our assumptions. However, as the empirical study
showed, if the assumptions are not consistent with the nature of the application, the directed
graphs may not work well as expected. This problem can also be viewed as a similarity

matrix learning problem.

122

BIBLIOGRAPHY

123

Bibliography

[1] B. T. Bartel], Cottrell G.W., and R.K. Belew. Learning the optimal parameters in
a ranked retrieval system using multi-query relevance feedback. In Proc. Symp. on
document Analysis and Information Retrieval, Las Vegas, April 1994.

[2] Brian Bartel], Garrison W. Cottrell, and Rik Belew. Learning to retrieve informa-
tion. In Current trends in connectionism: Proceedings of the Swedish Conference on
Connectionism, LEA: Hillsdale, 1995.

[3] Mikhail Belkin and Partha Niyogi. Using manifold snucture for partially labeled
classiﬁcation. Advances in Neural Information Processing Systems, 2002.

[4] A. Blum and S. Chawla. Leaming from labeled and unlabeled data using graph min-
cuts. In Proc. 18th International conf. on Machine Learning, 2001.

[5] Avrim Blum, John Lafferty, Mugizi Robert Rwebangira, and Rajashekar Reddy.
Semi-supervised learning using randomized mincuts. In Proc. of ICML’04, page 13,
New York, NY, USA, 2004. ACM Press.

[6] C. Boutilier, R. Zemel, and B. Marlin. Active collaborative ﬁltering. In Proc. UAI,
2003.

[7] J. S. Breese, D. Heckerrnan, and C. Kadie. Empirical analysis of predictive algorithms
for collaborative ﬁltering. In Proceedings of the 14th Conference on Uncertainty in
Artiﬁcial Intelligence, pages 43—52, 1998.

[8] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine.
Computer Networks and ISDN Systems, 30(1—7)2107—1 17, 1998.

[9] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. Technical
report, University College London, 2004.

[10] W. Chu and Z. Ghahramani. Preference learning with gaussian processes. In
Proc. lCML, 2005.

[l 1] William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things.
Journal of Artiﬁcial Intelligence Research, 10:243—270, 1999.

[12] K. Crammer and Y. Singer. Pranking with ranking. In Proceedings of the conference
on Neural Information Processing Systems(NIPS), 2001.

124

[13] Arthur Gretton Olivier Bousquet Dengyong Zhou, Jason Weston and Bernard
Scholkopf. Ranking on data manifolds. Advances in Neural Information Process-
ing Systems, (16), 2004.

[14] P. Doyle and J. Snell. Random walks and electric networks. Mathematical Assoc. of
America, 1984.

[15] Eibe Frank and Mark Hall. A simple approach to ordinal classiﬁcation. In Proceed-
ings of the European Conference on Machine Learning, pages 145 — 165, 2001.

[16] Y. Freund, R. lyer, R. E. Schapire, and Y. Singer. An efﬁcient boosting algorithm
for combining preferences. In Proc. of the Fifteenth Intl. Conf. of Machine Learning,
pages 170—178, 1998.

[17] Rayid Ghani, Sean Slattery, and Yiming Yang. Hypertext categorization using hy-
perlink patterns and meta data. In Carla Brodley and Andrea Danyluk, editors, Pro-
ceedings of ICML-OI, 18th International Conference on Machine Learning, pages
178—185, "Williams College, US”, 2001. Morgan Kaufmann Publishers, San Fran-
cisco, US.

[18] R. Herbrich, T. Graepel, and K. Oberrnayer. Large margin rank boundaries for ordinal
regression. Advances in Large Margin Classiﬁers, MIT Press, 2000.

[19] T. Hofmann and J. Puzicha. Latent class models for collaborative ﬁltering. In Pro-
ceedings of International Joint Conference on Artiﬁcial Intelligence, 1997.

[20] T. Joachims. Making large-scale svm learning practical. Advances in Kernel Methods
- Support Vector Learning, 1999.

[21] Thorsten Joachims. Text categorization with support vector machines: learning with
many relevant features. In Proceedings of ECML-98, 10th European Conference on
Machine Learning, pages 137—142, 1998.

[22] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceed-
ings of the eighth ACM SIGKDD international conference on Knowledge discovery
and data mining, 2002.

[23] Thorsten Joachims. Transductive learning via spectral graph partitioning. In Proceed-
ings of the International Conference on Machine Learning (ICML), 2003.

[24] Thorsten Joachims. Transductive learning via spectral graph partitioning. In Proc.
ICML’03. 2003.

[25] Klaus Brinker Kbrinker. Active learning of label ranking functions. In Proc. 21st
International Conf on Machine Learning, 2004.

[26] M. Kendall. Charles Grifﬁn and Company Limited, 1948.

[27] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Joumal of the
ACM, 46(5):604—632, 1999.

125

[28] R. l. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input
spaces. In Proc. 19th International Conf. on Machine Learning, 2002.

[29] Oren Kurland and Lillian Lee. Pagerank without hyperlinks: Structural re-ranking
using links induced by language models. In Proc. of the 28th annual international

ACM SIGIR conference on Research and development in information retrieval, pages
306—313, 2005.

[30] Neil D. Lawrence and Michael I. Jordan. Semi-supervised learning via gaussian pro-
cesses. Advances in Neural Information Processing Systems, (l7):753—760, 2005.

[31] G. Lebanon and J. Lafferty. Conditional models on the ranking poset. Advances in
Neural Information Processing Systems, 15, 2003.

[32] Guy Lebanon and John Lafferty. Crankingzcombining rankings using conditional
probability models on permutations. In International Conference on Machine Leam-
ing, 2002.

[33] Tao Li, Chengliang Zhang, and Shenghuo Zhu. Empirical studies on mult-label clas-
siﬁcation.

[34] M. OConnor and J. Herlocker. Clustering items for collaborative ﬁltering. In Pro-
ceedings of SIGIR-2001 Workshop on. Recommender Systems, 2001.

[35] Foster J. Provost and Tom Fawcett. Robust classiﬁcation for imprecise environments.
In AAAI/IAAI,, pages 706—713, 1998.

[36] J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative
prediction. In Proc. ICML, 2005.

[37] S. E. Robertson and S. Walker. Okapi/keenbow at tree-8. In Proc. TREC-8, 1999.

[38] B. Pfahringer S. Kramer, G. Widmer and M. DeGroeve. Prediction of ordinal classes
using regression trees. F undamenta Informaticae, 47:1 — 13, 2001.

[39] G. Salton. The SMART Retrieval System: Experiments in Automatic Document Pro-
cessing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1971.

[40] B. Sarwar, G. Karypis, and J. Konstan. Incremental singular value decomposition
algorithms for highly scalable recommender systems. In 5th International Conference
on Computer and Information Technology, 2002.

[41] M. Seeger. Learning with labeled and unlabeled data. Technical report, University of
Edinburgh, 2001.

[42] A. Shashua and A. Levin. Taxonomy of large margin principle algorithms for ordinal

regression problems. Technical Report 39, Leibniz Center for Research, School of
Computer Science and Eng, the Hebrew University of Jerusalem, 2002.

126

[43] Amnon Shashua and Anat Levin. Ranking with large margin principle: Two ap-
proaches. In Proceedings of the 14th conference on Neural Infomation Processing
System(NIPS), 2003.

[44] N. Srebro, J. D. Rennie. and T. S. Jaakkola. Maximum-margin matrix factorization.
In Proc. NIPS, 2005.

[45] M. Szurmner and T. Jaakkola. Partially labeled classiﬁcation with markov random
walks. Advances in Neural Information Processing Systems, 14, 2001.

[46] Y Weiss and W. T. Freeman. Correctness of belief propagation in gaussian graphical
models of arbitrary topology. Neural Computation, (13):2173 — 2200, 2001.

[47] C. Williams. Computation with inﬁnite neural networks. Neural Computation, 10(5),
1998.

[48] G.-R. Xue, H.-J. Zeng, Z. Chen, W.-Y. Ma, H.-J. Zhang, and OJ. Lu. Implicit link
analysis for small web search. In Proc. SIGIR ’03, pages 56—63, 2003.

[49] K. Yu, A. Schwaighofer, V. Tresp, W. Ma, and H. J. Zhang. Collaborative ensemble
learning: combining collaborative and content-based information ﬁltering via hierar-
chical bayes. In Proc. UAI, 2003.

[50] D. Zhou, B. Scholkopf, and. T. Hofmann. Semi-supervised learning on directed
graphs. In Proc. NIPS, 2005.

[51] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard
Schdlkopf. learning with local and global consistency. Advances in Neural Informa-
tion Processing Systems, 16, 2004. MIT PRess, Cambridge.

[52] Dengyong Zhou, Jiayuan Huang, and Bernhard Scholkopf. Learning from labeled
and unlabeled data on a directed graph. In ICML’05: Proceedings of the 22nd inter-
national conference on Machine Learning, pages 1036—1043, New York, NY, USA,
2005. ACM press.

[53] S. Zhu, X. Ji, W. Xu, and Y. Gong. Multi-labeled classiﬁcation using maximum
entropy method. In Proc. ACM SIGIR, 2005.

[54] Xiaojin Zhu. Semi-supervised learning learning literature survey. technical report
1530, University of Wisconsin-Madison, 2005.

[55] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530,
Computer Science, University of Wisconsin-Madison, 2005.

[56] Xiaojin Zhu and John Lafferty. Harmonic mixtures: combining mixture models and
graph-based methods for inductive and scalable semi-supervised learning. In Pro-
ceedings of the 22nd international conference on Machine leaming, pages 1052-
1059, New York, NY, USA, 2005. ACM Press.

127

[57] Z.; Zhu, X.; Ghahramani and J. D. Lafferty. Semi—supervised learning using Gaussian
ﬁelds and harmonic functions. In Proc. ICML, 2003.

128

   

ll1111111]le1111]le[[11le1l