7 _ Il'
Eufiz

.ﬁurﬁu

 

 

 

 

 

 

\r. I .
a-vdwve www.-
.uu.... .w 4%
3?“... m

«#5. $58.

ll
0
.

2:333:

>...voa

 

 

 

 

 

i

LIBRARY
gan State
University

ch

A" Mi

This is to certify that the
dissertation entitled

Semi-supervised Learning with Side Information:
Graph-based Approaches

presented by

Yi Liu

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Computer Science and Engin_eering

 

 

V Major Professor’s Signature

léi/o 612w] .

Date

MSU is an afﬁnnative-action, equal-opportunity employer

on-----I-o--n-n--|-.-s-n-n-n-u-n-n-u-u-o-o-u-oCo-o-o---n-o-n-o-u-u--v------u-u. - - -

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/07 p:/CIRClDateDue.indd-p.1

SEMI-SUPERVISED LEARNING WITH SIDE
INFORMATION: GRAPH-BASED APPROACHES

By

Yi Liu

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the-requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science and Engineering

2007

ABSTRACT

SEMI-SUPERVISED LEARNING WITH SIDE
INFORMATION: GRAPH-BASED APPROACHES

By
Yi Liu

In many real-world learning tasks, data examples are known not only by their input
patterns, but also in other forms that are often refered to as “side information”. Side
information provides additional knowledge about the data, leaving hope for machine
learning algorithms to gain more insight into the structure of data and thus perform
better. However, the usual incomplete, sparse, and noisy nature of side information
also poses challenges. This dissertation will present research work in semi—supervised
learning with side information.

Semi-supervised learning uses both labeled and unlabeled data in training. In this
work, we follow the graph-based approaches that are able to explore the underlying
structure of data by constructing a graph over data examples. Taking such an ad-
vantage of graph representation for semi-supervised learning, we propose to construct
a graph over all labeled and unlabeled data and further incorporate side informa-
tion into the graph. To effectively utilize side information against its incompleteness,
sparseness, and noise, this work adopts a common theme in formalizing graph-based
learning models, i.e., enforcing consistency over the graph. More speciﬁcally, consis-
tency is maximized among data input patterns, supervised information (if any), side
information, and predictions on the unlabeled data. Optimization approaches are
taken to carry out the consistency enforcement, with objective functions that collect
consistency measurement deﬁned on all possible data pairs.

Solutions to three generic learning tasks are presented to illustrate the proposed

method of utilizing side information in a semi-supervised learning setting. Speciﬁcally,

we study multi-label learning with class correlation information, to meet the challenge
of a large number of classes and a small training set. To improve classification with
link information, we propose a boostng framework that is able to improve the clas-
siﬁcation accuracy of any given supervised classiﬁers. Another boosting framework
is also designed to boost any given clustering algorithm, with the help of pairwise
constraints over the data.

In addition, two applications in the area of information retrieval will be discussed.
In the ﬁrst application, we develop a maximum coherence framework to tackle the
difficulty of query translation disambiguation in cross-language information retrieval,
with a bilingual dictionary as side information. The proposed framework will also
be explained as two-way graph partitioning. The second application is automatic
extraction of question-answer pairs from Web FAQs, where the side information comes
from human knowledge on the presentation regularity on Web FAQ pages. Correlated
label propagations over a graph constructed for each FAQ page is shown to be an
interpretation of the corresponding model.

All the semi-supervised learning models proposed for the tasks and applications
demonstrate the effectiveness of the consistency enforcement theme in exploiting side
information for semi-supervised learning. Analysis shows that the proposed models
are robust against the incompleteness, sparseness, and noise of side information, and
retain the power of utilizing both labeled and unlabeled data for training as semi-

supervised learning.

TABLE OF CONTENTS

LIST OF TABLES vii
LIST OF FIGURES viii
1 Introduction 1
1.1 Graph-based Semi-supervised Learning ................. 2
1.1.1 Mathematical Definition of Graph ................ 2

1.1.2 Graph-based Semi-supervised Learning Models ......... 4

1.2 Side Information ............................. 11
1.2.1 Motivation on Introducing Side Information .......... 11

1.2.2 Side Information for Semi—supervised Learning ......... 15

1.3 Graph-based Semi-supervised Learning with Side Information . . . . 18
1.4 Overview on Thesis Work ........................ 20

2 Semi-supervised Multi-label Learning with Class Correlations 23
2.1 Introduction ................................ 23
2.2 Related Work ............................... 25
2.2.1 Related Work in Multi-label Learning .............. 25

2.2.2 Representative Algorithms .................... 27

2.3 Semi-Supervised Multi-label Learning by Constrained NMF ..... 31
2.3.1 A Hamework for Multi-label Learning ............. 32

2.4 Solving the Constrained NMF ...................... 34
2.5 Experiments and Discussions ....................... 36
2.5.1 Experiment Setup ......................... 37

2.5.2 Experiment Results ........................ 38

2.6 Conclusions ................................ 43

3 Semi-supervised Classiﬁcation with Link Information 44
3.1 Introduction ................................ 44
3.2 Related Work ............................... 47
3.2.1 Link-based Classiﬁcation ..................... 47

3.2.2 Related Semi-supervised Classiﬁcation Models ......... 48

3.2.3 Representative Algorithms Review ............... 50

3.3 LinkBoost Framework .......................... 53
3.3.1 Objective Function ........................ 53

3.3.2 Boosting Algorithm ........................ 54

3.3.3 LinkBoost Framework Summary ................. 61

3.4 Experiments and Analysis ........................ 62

iv

3.4.1 Experiment Setup ......................... 62

3.4.2 Robustness against Sparse Link Information .......... 65
3.4.3 Boosting Power for Supervised Algorithms ........... 70
3.5 Conclusions ................................ 72
Semi-supervised Clustering with Pairwise Constraints 73
4.1 Problem Definition ............................ 73
4.2 Review on Previous Studies ....................... 74
4.2.1 Approach Based on Constraints Satisfaction .......... 75
4.2.2 Approach Based on Distance Metric Learning ......... 75
4.2.3 Representative Algorithms .................... 76
4.2.4 Summary ............................. 82
4.3 Boosting Clustering ............................ 83
4.3.1 Main Idea ............................. 83
4.3.2 Objective Function ........................ 87
4.3.3 The BoostCluster Framework .................. 88
4.4 Experiments ................................ 97
4.4.1 Experiment Setup ......................... 98
4.4.2 Generality of the Boosting Framework ............. 102
4.4.3 Robustness of Exploiting Pairwise Constraints ......... 103
4.4.4 BCP vs. BCS ........................... 109
4.5 Summary ................................. 112
Semi-supervised Learning for Query Translation Disambiguation in
Dictionary-based Cross Language Information Retrieval 113
5.1 Introduction ................................ 113
5.2 Related Work ............................... 117
5.2.1 Selection-based Approaches for Query Translation Disambigua-
tion ................................ 117
5.2.2 Spectral Clustering ........................ 119
5.3 The Statistical Framework For Dictionary-based CLIR ........ 121
5.3.1 Notation .............................. 121
5.3.2 Modeling the Uncertainty in Query Translation ........ 122
5.3.3 The Retrieval Model ....................... 123
5.3.4 Learning the Translation Probabilities ............. 124
5.3.5 Solving the Optimization Problem ................ 127
5.3.6 Summary and Discussion ..................... 129
5.4 Maximum Coherence Model ....................... 131
5.5 Spectral Query Translation Model .................... 132
5.5.1 Query Translation Disambiguation as Graph Partitioning . . . 133
5.5.2 Maximum Coherence vs. Spectral Graph Translation ..... 136

5.6 Experiments and Discussions ....................... 137
5.6.1 Experiment Setup ......................... 138
5.6.2 Comparison to Selection—based Approaches ........... 141
5.6.3 The Necessity of Including Translation Uncertainty ...... 146
5.6.4 The Impact of 'I‘ranslation Independence Assumption on Query

Disambiguation .......................... 148
5.6.5 Performance: MAC vs. SQT ................... 149
5.6.6 Computational Efﬁciency ..................... 150
5.7 Conclusions ................................ 151

Semi-supervised Learning for Extraction of Questions and Answers
from Web FAQs 152
6.1 Introduction ................................ 152
6.2 Related Work ............................... 160
6.2.1 QA Extraction from the Web .................. 160
6.2.2 Review on Spectral Graph Transducer ............. 161
6.3 FAQ Page Classification ......................... 162
6.4 Observations on Web FAQs Data .................... 163
6.5 A Semi-supervised Learning Approach for QA Extraction ....... 165
6.5.1 Web Page Preprocessing ..................... 165
6.5.2 Semi-supervised Learning for QA Extraction .......... 167
6.6 Experiments and Discussions ....................... 171
6.6.1 Experiment Setup ......................... 171
6.6.2 Experiments on Veriﬁcation of Side Information ........ 173
6.6.3 Experiments on QA Extraction ................. 175
6.7 Conclusions ................................ 179
7 Conclusions 180
APPENDICES 184
A Related Proofs and Lists 184
Al Proof of the eigenvector problem in Section 4.3.3 ............ 184
A2 Proof of the generalized eigenvector problem in Section 4.3.3 ..... 185
A3 Features for FAQ page classiﬁcation ................... 185
BIBLIOGRAPHY 188

vi

2.1

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

4.1
4.2
4.3

5.1

6.1
6.2

LIST OF TABLES

The CNMF algorithm ..........................

Classes in Cora dataset ...........................
Classes in Citeseer dataset .........................
Classiﬁcation accuracy on Cora dataset (Part A) ............
Classiﬁcation accuracy on Cora dataset (Part B) ............
Classiﬁcation accuracy on Citeseer dataset (Part A) ..........
Classiﬁcation accuracy on Citeseer dataset (Part B) ..........
Boosting classiﬁcation accuracy on Cora dataset ............

Boosting classiﬁcation accuracy on Citeseer dataset ..........

Notations .................................
Datasets used in the experiments. ...................

Performance comparison of BCP algorithm and BCS algorithm . . . .
Average precision for short and long queries on TREC datasets

Corpus statistics of QA pair data ....................

Performance comparison of QA extraction ...............

vii

63
63
66
67
68
69
71
71

144

162
176

1.1
1.2
1.3

2.1
2.2
2.3

3.1

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11

5.1
5.2
5.3
5.4
5.5

LIST OF FIGURES

Example A of complicated structures of data. .............
Example B of complicated structures of data. .............

Dataset 2 with side information ......................

Precision comparison: CNMF vs. baselines ..............
Recall comparison: CNMF vs. baselines ................

F1 comparison: CNMF vs. baselines ..................
LinkBoost. framework ...........................

Illustrative example on iterative projections ..............
The ﬂowchart of the BoostCluster framework ..............
Boosting algorithm for BCP and BCS ..................
Example BCP objective function vs. number of iterations. ......
Legends for all algorithms in comparative study ............
Clustering performance evaluated by NMI (Part A) ..........
Clustering performance evaluated by NMI (Part B) ..........
Clustering performance evaluated by PWFl (Part A) .........
Clustering performance evaluated by PWFl (Part B) .........
Clustering performance (NMI) with noisy constraints .........

Clustering performance with noisy constraints .............

Steps of applying the proposed framework to CLIR ..........
Graph partitioning perspective for query translation disambiguation .
An example query in experiments ....................
Performance comparison on homogeneous datasets ...........

Performance comparison on heterogeneous datasets ..........

viii

14

61

86

92

94
101
104
105
106
107
110
111

130
135
140
142
143

6.1
6.2
6.3
6.4
6.5
6.6
6.7

Example translation probabilities estimated by “MAC” model

Example query translation by “BSTO” and “MAC” models. . . . . .

Com arison of “MAC” and “S T” models on an exam )le ( nerv . . .
l u’

Snapshot of example web FAQ pages 1 of 4) .............

(
Snapshot of example web FAQ pages (2 of 4) .............
Snapshot of example web FAQ pages (3 of 4) .............
Snapshot of example web FAQ pages (4 of 4) .............
Semi-supervised learning algorithm for QA extractions ........

Relation between class membership and format similarity .......

Correlation between prediction accuracy and prediction probability

146
147
149

CHAPTER 1

Introduction

Learning with inadequate amount of, or no, supervised information poses a challenge
for machine learning research. In practical domains, though supervised information
may be hard to collect, it is often the case that some certain partial supervised
information is available. This naturally leads to a question: can we learn better, if
more knowledge on the data is gained? ‘

Answering this question leads to the research topic of semi-supervised learning,
which has gained signiﬁcant attention in recent years.

Semi-supervised learning, in a strict sense, refers to the family of classiﬁcation
techniques in machine learning that makes use of both labeled and unlabeled data.
In a typical semi-supervised learning setting, the amount of labeled data is small
while the amount of unlabeled data is large. As its name suggests, semi-supervised
learning lies in the continuum of two ends in the spectrum of machine learning —
strictly supervised learning (with completely labeled data), and strictly unsupervised
learning (without any labeled data). Usually, some partial supervised information is
available in semi-supervised learning, which typically includes the observation of all
unlabeled data that is going to be classiﬁed into predeﬁned categories, possibly as
well as information in other forms that do not directly provides class labels.

Data clustering is a typical unsupervised learning problem, where no supervised

information is available. The word “semi—supervised” has also been introduced to data

clustering, i.e. “semi-supervised clustering”, when partial supervised information is
available for clustering purpose.

Regardless of how the terminology evolves, the essence of semi-supervised learning
is the utilization of any knowledge gained in addition to the supervised information.
The particular interest we will show in this thesis, is the kind of additional knowledge
beyond the input patterns of the data. By input patterns, we refer to the observation
of “feature representations” (or also known as “attribute values” in literature) on all

the data. This leads to the research issue that centers this thesis work, i.e.

How can we improve semi-supervised learning, if we know anything more

than the input patterns of labeled data (if any) and unlabeled data?

The knowledge about data that is not the input patterns will be referred to as
“side information” in this thesis. Also, among all types of semi-supervised learning
work, we will focus on graph-based approaches.

In this chapter, we will ﬁrst review graph-based semi-supervised learning, then
motivates the use of side information in semi-supervised learning, and brieﬂy overview
the entire thesis work. Detailed discussions on each piece of work will be left to the

rest chapters.

1.1 Graph-based Semi-supervised Learning

1.1.1 Mathematical Deﬁnition of Graph

From a mathematical point of view, a graph is a collection of points and lines con-
necting some (possibly empty) subset of them [151]. The points of a graph are most
commonly known as graph vertices, but may also be called “nodes” or simply “points”.
Similarly, the lines connecting the vertices of a graph are most commonly known as
graph edges, but may also be called “links”, “arcs” or simply “lines”. In an undirected

graph, edges are not directional, i.e., a line from point A to point B is not distin-

guished from a line from point B to point A. However, the two directions are distinct
in a directed graph (or digraph for short). On many occasions, a weight. (usually pos—
itive) will be associated with each edge, indicating the strength of the relationship
within the corresponding vertex pair.

Formally, we can denote a graph by G (V, E), where V is the vertex set and E is
the edge set. For a ﬁnite graph G with n vertices, the adjacency matrix is deﬁned as

a binar matrix A = a- - , with a' - = 1 denotinor there is an edge between the
y 2,] 2.] b b

nxn
i-th and the j—th vertices and ai, j =2 0 otherwise.

Introducing the weight to each edge in a graph will result in a weighted graph.
In this case the adjacency matrix (or weight matrix) becomes W = [wi’j] nxn’ with
wm- > 0 indicating the edge weight between the i-th and the j—th vertices and wi, j = 0
indicating no edge there. For an undirected graph, both the adjacency matrix and
the weight matrix are symmetric.

An important concept centering in Spectral Graph Theory is the graph Laplacian,

which has been widely used in graph-based semi-supervised learning models. With

respect to a symmetric adjacency matrix W, the graph Laplacian is deﬁned as follows
L = D — W (1.1)

where D is a diagonal matrix D = diag(d1,d2, - -- ,dn) with di 2 ZjEV wi,j- A
normalized version of the matrices above is deﬁned in some occasions

- 1 1
W = D_2WD_2

_1_1
L=D2LD2

I—w

The graph Laplacian L (or L) has the following properties

1. It is positive semi-deﬁnite, i.e., all the eigenvalues are non-negative;

 

2. Its minimum eigenvalue is always 0. For unnormalized graph Laplacian L,
the corresponding principal eigenvector is e = (1/ﬁ,1/\/h,...,1/\/H)T; for
1

normalized graph Laplacian L, the corresponding principal eigenvectrn' is D28

The set of eigenvalues of the graph Laplacian L can be denoted by 0 = A0 S A1 3

- S An_1, which is also called the spectrum of L (or the graph itself). Spectral
Graph Theory tells us that the structure of a graph and its principal properties can be
deduced from its spectrum. In particular, it has been shown that the eigenvalues are
closely related to almost all major invariants of the graph, and link external properties
together ( [38, 37]). As will be shown later in the next subsection, graph Laplacian is

clearly present in a number of graph-based learning models.

1.1.2 Graph-based Semi-supervised Learning Models

Semi-supervised learning deals with the use of both labeled and unlabeled data for
training. The central idea behind nearly all semi—supervised learning algorithms is an
assumption made on the data consistency that data examples close to each other or
in the same structure are more likely to have the same label. Various semi-supervised
learning approaches differ in the way to model the structure of data and attempts
to propagate the label information from labeled examples to unlabeled ones. One
approach is to construct a graph over all the labeled and unlabeled data: each vertex
represents a data example; whenever a pair of data examples are close enough by some
similarity measurement, an edge is constructed between them with a weight propor-
tional to their similarity value. Taking this graph point of view, a number of graph-
based semi-supervised learning model can have interpretation in graph languages,
such as graph partitioning [25, 26, 87], random walk on the graph [134, 119, 70],
ranking of nodes [119, 70, 92, 144], graph approximation [104, 105], and etc..

In the rest of this subsection, we will recapitulate a few typical graph-based semi-

supervised learning models.

GRAPH MINCUTS

The graph mincuts model extends the algorithm for ﬁnding the minimal cut in a
graph to a transductive learning setup [25, 26]. The basic idea is searching for a
partition of the graph which results in a minimum sum of weights of the edges being
cut while agreeing with the labeled data. To enforce the consistency with the labeled
data, a special weighting scheme is adopted to build a graph over all the labeled
and unlabeled examples: for any pair of data examples belonging to different classes,
the edge weight indicates their similarity; for any pair of data examples belonging to
the same class, an inﬁnite weight is assigned. In a binary case, the search for the
minimum cut amounts to the following optimization problem

- 2
U :w.,.<y. - y.)
{yilxz'EX } i,j

at. y, = 3):,in E XL

where 3;: indicates the known labels of the labeled data.

Note that the solution gives binary labels for the unlabeled data, which can also
been proved to be optimal in another sense that it minimizes the leave—one-out cross—
validation error of the nearest-neighbor algorithm applied to the entire dataset ([25]).
To improve the robustness of the solution, a follow-up work ([26]) introduces random

noise to edge weights and results in a solution with “soft” labeling.

GAUSSIAN RANDOM FIELDS AND HARMONIC FUNCTIONS

Gaussian random ﬁelds and harmonic functions method ([173, 172, 174]) is motivated
by the assumption that the label probability should vary smoothly over the entire

graph. To enforce the label smoothness on the graph, a quadratic energy function is

‘

proposed as follows (in a binary classiﬁcation case)

Elf) = 12mm 43-)?

where f = (f1, f2, . . . ,fn)T is the label probability vector deﬁned as

6(3),, 1) r,- is labeled
fi. =
Pr(y,- = 1|r,-) .23,- is unlabeled
The energy deﬁned above will be small when the label probability vector varies

smoothly over the graph, which leads to the minimization of the energy function. The

minimizer f * makes the function harmonic in the sense that

yz- :13,- is labeled

(Lf*)i =

0 r,- is unlabeled

This method is also related to Gaussian Random Field because the energy function

could be used to form a Gaussian density function

p(f) oc exp[—ﬁE(f)]

where B is an “inverse temperature” parameter.
If we deﬁne P = D’1W and further decompose it into four blocks

P11 P In

P =
Pul Puu

 

 

where Pll corresponds to the labeled data and Pan corresponds to the unlabeled

data, the ﬁnal prediction is made in the following way
ya = (I - PuuYIPuzfz

To save the computation introduced by the matrix inverse, an extended work is
proposed in [174], which essentially forms a backbone graph with super-nodes created

by pre—clustering the examples.

Again the important role 1..)layed by the graph Laplacian L is witnessed: the
smoothness regularization on the graph and the harmonic nature of the energy func-
tion are both achieved through it, which suggests a close relationship to the Spectral

Graph Theory ([173]).

SPECTRAL GRAPH TRANSDUCER

Spectral Graph Transducer is another semi—supervised version of ratio—cut algorithm
originally proposed for unsupervised learning ([87]). The objective function incorpo-

rates a quadratic penalty on labeled data in addition to minimizing the graph cut

mfin fTLf + Ca —— r)TC(f — r)
sat. fTe = 0

fo = n
where the vector f is a label probability vector, and the vector r is deﬁned as

a. 475i is a positive example
Ti = 1“; xi is a negative example
0 xi is unlabeled
The matrix C = diag((:1 , C2, . . . ,cn) iS a diagonal cost matrix allowing different mis-
classiﬁcation cost for each data example. The trade-off between the graph cut value
and the training error penalty is balanced through the constant c. It has been shown

that solving the optimization problem in Spectral Graph Transducer also leads to a

matrix eigen—decomposition problem ([87]).

LEARNING WITH LOCAL AND GLOBAL CONSISTENCY

The local and global consistency method ([166]) proposes the following optimization

problem

2

F' F r)

35—] wine—Yuk
l i

, 1
min — w,- - —-—

F 2 22]: ’7 \f— \ﬂi:
where F is the label probability matrix with each column corresponding to a class,
d; = ZjeV wm- and Y is the label matrix with yi,j = 1 if the i-th example is labeled
as a member in the j—th class and 31,-, j = 0 otherwise.

The ﬁrst term in the object function addresses the smoothness constraint on the
graph by the sum of local variations measured at each edge in an undirected graph;

the second term penalize the inconsistency with the labeled data.

LOCAL LAPLACIAN EMBEDDING

To learn the global manifold structure from the data, local Laplacian embedding meth-
ods propose to project the data from the original Space to a dimension reduced space.
In particular, let V = [v1, v2, . . . ,vp] be a matrix composed by the p smallest eigen-
vector of the graph Laplacian for the nearest neighbor graph built over all labeled
and unlabeled data. If we rewrite V = [if]: if; , . . . ,ile, it,- is the projected image

of x,- in the dimension reduced space.

For classiﬁcation purpose, in [18] a linear classiﬁer is learned from the labeled data
a = (VZLVLLl—IVLFLYL

where \7 L L is a sub—matrix Of V that corresponds to the labeled examples. Then the

prediction for the unlabeled data x,- is made by

T~
3% = a Xi

Another manifold learning based semi—supervised learning algorithm extends the

manifold ranking function ([169]). The prediction is as follows
y* = (I — aD—lwrlny

where a is a constant that controls how much we rely on the labeled data, and y

takes a i1 value if labeled and 0 if unlabeled.

LEARNING ON DIRECTED GRAPI-Is

Recently a semi-supervised learning method is proposed based on directed graphs
[167], which extends the work in [166]. To address the data consistency, the undirected
graph is expected to be smooth in the sense that nodes lying on a densely linked
subgraph are likely to have the same label. To search for a good classiﬁer which
results in a labeling with such smoothness on the graph, a smoothness functional is
proposed (for binary classiﬁcation)
_ 1 _ . . a _ it 2
- t. A

where f = [f1, f2, . . ., fn]T is the label probability vector, L = [ll-’3'] is the graph
Laplacian over the entire data examples, and 71’ = [7r1, r2, . . . ,7rn]T iS the principal
eigenvector of the graph Laplacian L.

To ﬁnd the labeling for the unlabeled data, an optimization problem is proposed
with an objective function addressing both the data smoothness enforcement and the

consistency with labeled data.
arg min Q(f) + ullf — y”
f

where the component of y takes a :l:1 value if labeled and 0 if unlabeled, and u > 0
is a constant specifying the trade—Off.
A Slightly different version Of the algorithm above is described in [168], where two

sets of smoothness functionals are deﬁned for the data examples: one for their “hub”

scores (which accounts for outgoing links) and one for their “authority” scores (which
accounts for incoming links). TO separate the two smoothness enforcement, directed

graphs are transformed into bipartite graphs.

SPECTRAL CLUSTERINC

Spectral clustering approaches View the problem of data clusterng as a problem Of
graph partitioning. Taking 2-way graph partitioning as an example, to form two
disjoint data sets A and B from a graph G = (V, E), edges connecting these two
parts Should be removed. The degree Of dissimilarity between the partitioned parts
are captured by the notion of cat, which is deﬁned as Cut(A, B) = Xvi €144,363 ail-J.
Generally Speaking, a good partitioning should lead to a small cut value.

Addressing different balancing concerns, there are several variants Of cut deﬁnition
which lead to the optimal partitioning in different senses. TO begin with, we deﬁne
S(A, B) = 2736A ZjEB wi,j and dA = ZiEA di- Ratio Cut addresses the balance
concern on the sizes Of partitioned graph ([68]), which leads to minimization of the

following Objective function

_ MB) S(A,B)
JRCU‘ “ W + lBl

 

Normalized Cut addresses the balance concern on the weights Of partitioned graph
([134]), which leads to minimization of

S(A, B) + S(A, B)
dA 413

 

jNCut =

Min-Max Cut addresses the balance concern between the intra—cluster weights and

inter-cluster weights in a partitioning ([49]), which leads to minimization Of

__ S(A,B) S(A,B)
JMCW ‘ S(A,A)+S(B,B)

 

By relaxing cluster memberships to real values, the above minimization problems

10

can all be formulated as eigenvector problems related tO the graph Laplacian

T
jRC'ut = q Lq
T~
~7NCut Z q Lq
qTWq
qTDq

 

~71” C' at

where q is related to the relaxed cluster membership. All the above three problems
lead to ﬁnding the second eigenvector Of the graph Laplacian L or L.
2-way spectral clustering can be extended to k-way spectral clustering ([66, 115]),

whose solution is related to the top It eigenvectors Of the graph Laplacian.

1.2 Side Information

1.2.1 Motivation on Introducing Side Information

In the previous section, we brieﬂy reviewed graph—based semi-supervised learning. All
the models mentioned there only assume the knowledge of unlabeled data, or more
speciﬁcally, the input patterns of unlabeled data. As we can see, by constructing a
graph from all the data, those models do utilize unlabeled data in training. However
on many occasions, it is hard to achieve satisfying performance, even with the knowl-
edge of input patterns Of both labeled and unlabeled data. The difficulties come from

various sources. TO name a few here

0 Very small number of training examples. As a result, not enough supervised

information can be gained.

0 Large number Of classes with skewed Size distributions. For example, it is easy
to come across hundreds Of categories in text classiﬁcations problems. Very
often, in those categories only a few major ones are Of large Sizes, while the
rest are minor categories with relatively smaller sizes. Classiﬁers tend to make

mistakes on those minor categories.

11

  

 

 

 

 

 

a on D
D Don
I:
o
05_ ““00:
. a
n D n a mnﬁbfdﬂal
On a
an” on
Do a": on -
D an at: m. I'.
O ’- D on g on . 7' :-
Dn '3 la,
0 Dan '3').-
0 Ch * .
an qﬁ
no %
-o.5r ° °° °
D
0000
—1 L Q I
-1.5 -1 -O.5 1.5
a 8.888
()D t t1
1 .
O
o 0000 00 000°
°o 0 0° 00
o N + 0000
0 5 0 0°0 I f F "3% ‘3
. " **+ O 0
0° 9 . mowmfﬂg oo
00 +" o ‘S o
o u: 00000 0+ QCB g: 08
3 :’ 0° {r353} °o++ o
O O %% ¢. . *0 ‘- . oo
O~ 1.°et” o ,2 g,
+ 00.” :0 g ‘9'
* O O o
+ q) 1 «0+ 0 g
”'4 (X +1 0+ *’* 0° *
+ * '01: OO .
O 5* IN- oc%°‘»+’:ﬂm 0000 I. I
4' %%% 3 o ‘0’.
+++++ o o o 9
+. ﬂ, 1v],
up"... .++*"+. .
4-
_1 I I l I 1 4
-1.5 -‘| -0.5 0 0.5 1 1.5

(b) Dataset 1 (with true class information)

Figure 1.1. Example dataset that presents data in complicated structure. The upper
ﬁgure shows the data distributions without class information; the lower ﬁgure Shows class
lables for data examples. Without any supervised information, clustering on this dataset is

difficult, since the two spirals are hard to be separated.

12

 

 

 

 

 

 

 

 

 

6 T 1 f I G I
X
08
4b 3: x O 0000 O ‘
a” a o
2_ x :8: O OO CIDo 4
”*2, 0869
00
0- o ’9‘ ~
0 V "
-2- xx x -
3‘81:
x X
-4. x X xx .
ea? ..
— b 1 I l I

 

 

 

-6 -4 -2 0 2 4 6

(b) Dataset 2 (with true class information)
Figure 1.2. Example dataset that presents data in complicated structure. The upper
ﬁgure shows the data distributions without class information; the lower ﬁgure shows class

lables for data examples. Without any supervised information, 2-way clustering on this
dataset is difficult, since it is hard to tell how to combine four data clouds into two clusters.

13

 

 

 

 

 

Figure 1.3. Dataset 2 with side information.

o Complicated underlying structure of data. For example, in general a clustering
algorithm is hard to perform well when the data examples do not form compact
and well-separated clusters. Figure 1.1 and Figure 1.2 present two such cases
that will fail many clustering algorithms. It is easy to imagine, yet hard to
visualize, the existence of much more complicated structure of data in real-

world applications.

0 Noisy input patterns. AS a result, the correlations between data examples
computed from their input patterns are unreliable, so propagating information

among data examples could be error-prone.

o Incomplete data. In this situation, it is hard to reveal the entire structure of

data.

However, with a little more information known beyond the data input patterns,

the difﬁculty involved in the semi-supervised learning can be eased or resolved. TO

14

 

illustrate this ﬁnding, let us consider the 2-way clustering problem shown in Figure
1.2. It is easy to ﬁnd that there are roughly four clouds of data points in the dataset,
but the difficult part is in what way these four clouds Should be combined into two
clusters. Suppose we know that there a few data pairs, each of them containing two
data points that should be put into the same cluster, as indicated by solid lines in
Figure 1.3. Then it is clear that the two data clusters should be formed by putting
the two clouds in the diagonal positions into one cluster.

In the above example, the extra information is only data relationships that take
only binary values, from merely four data pairs involving eight data points. Consider-
ing the input patterns Of all the data points and the relationships between all possible
data pairs, the extra information we additionally have is very Simple and small, SO
we can call it as a type of “side information”. However, the data clustering task is

greatly beneﬁted from such side information.

1.2.2 Side Information for Semi-supervised Learning

We may ﬁnd different forms of Side information in different task scenarios. Numerous
previous studies have resorted to Side information, in order to improve the perfor-

mance of learning algorithms. For example

0 The title, text annotations, or even surrounding texts are often referred as Side
information to the image representations. AS is well known, understanding
image contents is very hard due to the gap between the low-level image features
and the high-level image semantics. The textual side information will provide
additional features that help to describe the image contents. Precious work
along this line includes using textual information for image segmentation [10],

image retrieval [11, 35, 170], and image clustering [31].

0 Log data is widely gathered in web applications (such as search engine, infor-

mation ﬁltering, online shopping, and etc.) and serves as side information for

15

 

various purposes. User query log can be used to form a context to provide
additional information on the user’s information need, and then improve the
query model ([139, 131]) or the user model ([132]). User click-through data 1is
another important source of side information for web applications. For exam—
ple, from click-through data we can tell how Often a link from page A to page B
is actually used, which can be used to estimate the similarity between the two
pages [112], in addition to their page contents; gathering a speciﬁc user’s clicked
links can help to model the user’s interest [75, 140], as additional information

to the user’s proﬁle.

0 User feedback can be regarded as side information for information retrieval. For
example, user ratings on a returned document help to decide its relevance to
the query. In movie recommendation websites, user ratings provide additional

information either for describing movies or describing users.

0 External linguistic resource can serve as side information for many information
retrieval applications. For example, bilingual dictionaries are widely used to
help query translation for cross-language information retrieval [2, 95, 46, 57, 76,
93, 109, 81]. WordNet [54] can be used for query expansion, word disambigua—

tion, etc., and ﬁnally improve retrieval performance [106, 147].

o For classiﬁcation problems, knowledge gained about the categories can be
viewed as Side information. For example, human knowledge on the popular—
ity Of each category is always used to gain some control on the category sizes in
the classiﬁcation results. In image databases such as ImageCLEF [77], there is
some text description associated with each image category, which can be used

to gain some knowledge on the relationships between categories.

 

1Click-through data refers to a type Of log which keeps record of the time, session of hyperlinks
being clicked by web users.

16

o In real—world data clustering problems, it is Often possible to gather informa-
tion on whether two examples should be put in the same cluster or not (Often
called as “pairwise constraints”). For example, in speaker segmentation and
recognition, a conversation between several speakers needs to be segmented and
clustered according to the speaker identity. It may be possible to automatically
identify small segments Of Speech which are likely to be from a Single unknown
Speakers [9]. Another example would be ﬁnding lanes in GPS maps. Usually
lanes are identiﬁed by clustering a few road segments. By tracking the Signals
Of car GPS’S, continuous car-traveled segments would be considered belonging
to the same lane, and well separated car-traveled segments would be considered
belonging to different lanes [148]. These pairwise constraints, would provide

side information for clustering data examples based on their input patterns.

0 In some cases, a few identiﬁed representative data examples are available as
“seeds” for clustering. We can also view these cluster seeds as Side informa-

tion [13, 12].

One characteristic can be summarized from various forms Of side information is
that, it is capable of providing some additional knowledge about the data, but it
cannot replace the original data input patterns. In this thesis, we will refer to any
additional information on the unlabeled data beyond input patterns as “Side informa-
tion”.

Usually side information is small in amount and incomplete. However, with proper
use, side information could Signiﬁcantly improve the performance of unsupervised,
supervised, or semi—supervised learning algorithms.

Side information serves as partial supervised information. Also, as mentioned in
the beginning of this chapter, we will assume the data input patterns Of all the data
(including labeled and unlabeled) is known; and in all following discussions, we will

use unlabeled data in training. For these reasons, regardless Of the original learning

17

task being an unsupervised one or a supervised one, as long as side information is
used, we will refer the resulting learning method as semi-supervised learning with
side information.

Side information can be explored in a number Of different ways. FOr instance,
in probabilistic models, side information iS always represented as strong priors; in
non-probabilistic models, side information is always incorporated in distance metrics
or constraints, etc. The work presented in the thesis will focus on graph-based non-

probabilistic models.

1.3 Graph-based Semi-supervised Learning with Side Infor-

mation

Side information, depending on its form, has various interpretations in graph—based
semi-supervised learning models. These interpretations can be summarized into, but

not limited to, the following major categories

1. Additional dimensions to the node representation: more features are
found for the input pattern of part (or all) the data examples. A typical example
in this category is textual information for images, where we can append textual
feature vectors with image feature vectors to form a new image representation.
Other examples include query log for query modeling, click-through data for

user modeling, user ratings for movie recommendation, etc.

2. Constraints on the graph topology: an edge should or should not be con-
structed between two nodes. The pairwise constraints for clustering belongs to
this category, which is self-explanatory. For example, the link graph constructed

for web pages by hyperlinks falls into this category.

3. Reﬁnement on graph edge weights: a (new) weight Of an edge is proposed.

For instance, user feedback on a returned document deﬁnes a new relevance

18

 

score between the query and the document, i.e. a similarity value between the
query node and the document node. Click-through data for hyperlinks on a
web page also suggests a new weight for the corresponding pair of page nodes

(for example, proportional to the number of user clicks on the hyperlinks).

The variance in the roles played by Side information from a graph point of view
leads to a diversity Of ways to accommodate side information within semi-supervised

learning settings. The following is an incomplete list Of treatments of Side information

1. Data Representation Augmentation: Side information is used to form bet-
ter representation, in the hope that the underlying structure of data can be bet-
ter revealed when the augmented data representation is used for computation.
For example, in [63, 32, 120, 130, 33, 118], Side information is used to generate
additional feature dimensions for classiﬁcation; in [48, 40], side information is

used to learn a new representation in the latent Space.

2. Hard Constraints: Side information is formulated into constraints that can-
not be violated. For example, constrained clustering [148, 20], constrained EM
algorithm [73] and multiple-instance learning [6] all treat Side information as

non-violable rules.

3. Soft Penalty: Any inconsistency between the learning outcome and the side
information will incur a penalty. For those models with a clear Optimization
goal, such a treatment often leads to side information being encoded as part
of the Optimization Objective function (such as a regularizer term or its equiv-
alence). Typical examples include using Side information for distance metric

learning [155, 98, 74], clustering [96, 15, 108, 16], etc..

4. Other Heuristics to Carry Out Learning Algorithms: Side information

is used to provide building blocks for other learning algorithms. For example,

19

in [168, 167, 119, 70, 158, 149], side information is used to induce the graph
structure where learning algorithms over the graph can be carried out; in [27, 39]
side information is used for co—training; in [72] side information is used for

boosting.

1.4 Overview on Thesis Work

All the thesis work presented in the rest Of the chapters revolves around the topic
Of improving semi-supervised learning with side information. Due to the diversity of
learning tasks and forms of side information, we will demonstrate our efforts through
several typical learning settings and applications.

Chapter 2 discusses multi—label classiﬁcation, with class correlations as side infor—
mation. TO meet the challenge Of a large number Of classes and a small Size Of training
data in multi-label classiﬁcation, we propose tO utilize class correlations to link the
computation of data example similarities with their multiple class memberships. We
will also Show that the proposed multi-label learning algorithm leads to a constrained
non-negative matrix factorization formalization.

Chapter 3 focuses on binary classiﬁcation, with side information in the form Of
links. Link-based classiﬁcation gained signiﬁcant attention in text domains, due to
the large amount of needs in classifying web pages or scientiﬁc publications, where
inter-document connections are available as hyperlinks or references. However, the
sparse and noisy nature of links causes trouble in utilizing them for classiﬁcation. A
general boosting framework is proposed in this chapter that tries to make the best
use of link information against its sparseness and noise, and is able to improve any
binary classiﬁer.

Chapter 4 presents another boosting framework for semi-supervised clustering,
with pairwise constraints as side information. In this work, side information is en-

coded intO the data representations by iteratively selecting a good direction to project.

20

the original data into a low-dirnensional space. Again, the proposed framework is de-
signed as a meta-algorithm that can be applicable to any clustering algorithm, and
improve its performance with pairwise constraints.

In addition to the above generic tasks of semi-supervised learning, the thesis work
also includes two application examples. In particular, we will Show two information
retrieval tasks, where semi-supervised learning with Side information could achieve
improved performance over traditional methods. In Chapter 5, we study the problem
Of cross-language information retrieval, with bilingual dictionaries as side information.
Two models based on a maximum coherence principle are proposed, which can be
well explained as two-way partitioning of the graph induced by Side information — the
dictionary.

Another application, extracting question-answer pairs from web FAQs, is de-
scribed in Chapter 6, where side information comes from human knowledge on the
presentation regularities Of web FAQS. The model proposed in this chapter leads to a
correlated label propagation scheme over a graph built upon the text segments Of web
FAQ pages. As will be seen, properly using the Side information enables the question-
answer extraction task being performed without any human supervision, despite the
wide variety of possible contents and styles in web pages.

Finally, Chapter 7 concludes the thesis work.

Semi-supervised learning enforces the predicted labels (usually the learning goal)
to be consistent with the structure of data, while agreeing with the supervised infor-
mation (if available). This is already stated as the assumption made by all graph—
based learning algorithms in Section 1.1.2. The introduction Of Side information
should not violate this data consistency assumption. Therefore, including side infor-
mation into the consistency enforcement becomes necessary. Such an idea is common
across all the chapters that follow, in despite Of the diversity of tasks, models and

their graph interpretations that will be discussed in detail in each individual chapter.

21

In summary, the semi-supervised learning models and frameworks proposed in this
thesis work all conform to the same theme: consistency is maximally enforced among
the following factors -— data input patterns, supervised information (if any), side in—
formation, and predictions on unlabeled data; invariably, Optimizaticm will be the tool
to achieve the goal Of consistency enforcement. Before going into detailed discussions
in each chapter, understanding this theme will help to reveal the connections between

all parts of the thesis work and view them as one whole piece.

22

CHAPTER 2

Semi-supervised Multi—label Learning

with Class Correlations

2.1 Introduction

Multi—label learning refers to the classiﬁcation problems where each example can
be assigned to multiple different classes. It has found applications in many real-
world problems. For example, text categorization is typically multi-labeled since each
document can be assigned to several predeﬁned topics; in bioinformatics, most genes
are associated with more than one functional classes (e.g., metabolism, transcription
and protein synthesis); automatic image annotation, can also be treated as a multi-
label learning problem if we view each annotation word as a distinct class.

A straightforward approach toward multi-label learning is to decompose it into
a set of binary classiﬁcation problems, one for each class. The drawback with this
approach is that it does not explore the correlation among different classes, which
often could be an important hint for deciding the class memberships. Many algorithms
have been developed to incorporate the class correlation into multi—label learning,
including [110, 52, 85, 146, 28, 59, 60, 90, 171, 164, 145, 141, 42]. But most Of
these studies are limited to a relatively small number of classes and assume that

the amount Of training data is sufficient for exploiting class correlations and training

23

 

reliable classiﬁers. In contrast, the real-world application Of multi-label learning often
features a large number of classes and a relatively small size of training data. AS a
result, the amount of training data related to each class is Often sparse and insufficient
for computing class correlations and learning a reliable classiﬁer.

However, we ﬁnd that in practical, more reliable class correlations are often avail-
able as side information. For example, in text classiﬁcation problem, if each category
has a description, the correlation between two classes can be derived by computing
the similarity between their descrptions; also, human knowledge about the category
topic can also be used to decide how close two categories are in terms Of their topics.
SO the problem becomes, given class correlation information, how can we improve
multi-label learning?

To address this problem, we present a novel framework for multi-label learning
that explicitly explores the correlation among different classes. Compared to the
existing approaches for multi-label learning that also explore the class correlation,
the proposed framework provides a natural means for exploring the unlabeled data
and the class correlation simultaneously, thus effective for the learning scenarios with
a large number of classes and a small Size of training data.

The key assumption behind this work is that two examples tend to have large
overlap in their assigned class memberships if they share high similarity in their
input patterns. To be more speciﬁc, consider two examples x1 and x2 that are
labeled by two sets of class labels yl and y2, respectively. We can evaluate the
Similarity between these two examples in two different ways. The ﬁrst one is based
on the correlation between the input patterns of these two examples. The second one
is based on the overlap between the class labels of these two examples. We denote the
similarity based on the input patterns by K g;(x1 , x2), and the similarity based on the
class labels by Ky(y1, y2). If the assigned class labels Y1 and y2 are appropriate for

example x1 and x2, we would expect the two similarity measurements to be similar,

24

namely K 33(x1, x2) % Ky(y1, y2). Based on this assumption, we can determine the
best assignment Of class labels tO the unlabeled data by minimizing the difference
between the two sets Of similarities. Clearly, this approach is able to effectively
explore the unlabeled data because the assignment Of class labels to each unlabeled
example is dependent on the assignment. of class labels of other unlabeled examples.
This approach is also able to exploit the class correlation effectively through the kernel
similarity function Ky(y1, y2).

The rest Of the chapter is structured as follows: ﬁrst, we brieﬂy review the related
work on multi-label learning and semi-supervised learning; second, we introduce the
proposed framework for multi-label learning, and a formalization based on the con-
strained non-negative matrix factorization; third, we present an efﬁcient algorithm
to solve the related optimization problem that is based on the iterative bound opti-
mization algorithm; fourth, we present the empirical study with a text categorization

problem; ﬁnally, we conclude this study and raise some future work.

2.2 Related Work

We will ﬁrst review the related work on multi-label learning, followed by a discussion

of related semi-supervised learning problem.

2.2.1 Related Work in Multi-Iabel Learning

The simplest approach toward multi-label learning is to divide it into a number of
binary classiﬁcation problems [160, 86]. There are a number of disadvantages with
this approach. One is that it will not scale to a large number of classes Since a different
binary classiﬁer has to be built for each class. Another disadvantage is that it treats
each class independently, and therefore is unable to explore the correlation among
different classes. The third disadvantage is that this approach Often will suffer from

the unbalanced data problem when the minority classes are given only a few training

25

 

examples.

Another group of approaches toward multi-label learning is label ranking [42, 52,
127]. Instead of learning binary classiﬁers from labeled examples, these approaches
learn a ranking function from the labeled examples that order class labels for a given
test example according to their relevance to the example. Compared to the binary
classiﬁcation approaches, the label ranking approaches are advantageous in dealing
with large numbers of classes because only a Single ranking function is learned. How-
ever, Similar to the binary classiﬁcation approaches, the label ranking approaches are
usually unable to exploit the class correlation information.

In the past, a number Of studies have been devoted to exploring the class corre-
lation within the context Of multi—label learning. A generative model for multi-label
learning was proposed in [146] to explicitly incorporate the pairwise correlation be—
tween any two class labels. A maximum entropy model is proposed in [171] that
capture the pairwise class correlation by constraints. Approaches based on latent vari-
ables were proposed in [110, 164] to capture the correlation among different classes.
The study in [123] exploited the class correlation information given the hierarchi-
cal structure of classes. Unlike the previous work on multi-label learning that only
considers the correlation among different classes, in this chapter, we present a novel
framework that exploits the unlabeled data as well as the class correlation. This prop-
erty will make the proposed approach more effective than the existing approaches for
multi-label learning, particularly when the number Of classes is large and the size of
training data is small.

This work is also particularly related to the label propagation approaches for semi-
supervised learning. This is because by enforcing examples with similar input patterns
to share Similar sets of class labels, we essentially propagate the class labels through
the Similarity graph Of examples, which is the key idea of the label propagation

approaches. A number of machine learning methods have been developed recently

26

for label propagation, including the Gaussian processes [152], the harmtmic functions
[173], and Green functions [168]. Unlike most. of the previous work on semi—supervised
learning that is designed primarily for multi-class learning, this work is speciﬁcally
targeted on the semi-supervised multi—label learning. It effectively explores the class
correlation information when utilizing the unlabeled data. More discussion Of semi—

supervised learning can be found in [129, 172].

2.2.2 Representative Algorithms

In the following, we will recapitulate a few representative algorithms for semi-
supervised multi-label learning, as well as the non-negative matrix factorization al-

gorithm that is closely related to the model we are going to propose later.

MULTI-CLAss MULTI-LABEL PERCEPTRON ALGORITHM

Multi-class Multi—label Perceptron (MMP) algorithm extends the Perceptron algo—
rithm from Single binary output to a ranked list Of n—ary output. Speciﬁcally, for any
test example possibly belonging to one or more Of l categories, the algorithm output
will be a ranked list of the l categories, indicating the preference of assigning the test
example to those categories.

AS the Perceptron algorithm, h‘lMP n’raintains a set of l prototypes W1, - -- ,wl.
For any data example x -, wiij yields a score that will decide the ranking Of the
i-th category for this data example 233-. The procedures Of learning those prototypes

can be summarized as follows
Initialize: Set “’1 = - -- = W) = 0
Loop:

0 Get a new training data example xj 6 Rd, and its true category informa—

tion 9,, which is a set of category IDs

27

0 Rank the categories according to wlij, - - - ,wiTxl

o If the category ranking is not consistent. with the truth 3?]-
. - ,. __ A “f '1—
1. For Vr E yj, set M —— [[5 ﬁt yjlws xj Z w,. xj}]
. ~ , _ - T . T .
2. For VI Q yj, sct nr — [{s G yjlws x] 3 W7. x]}]
3. Compute loss 77 = 27.717.
4. Update for r E 51,-: wr +— wr + Iglxj

5. Update for r ¢ 51,-: WT +— wr — 7+]ij

Output: W1, - -- ,wl

MMP algorithm is computationally inexpensive and it is suitable for online learn-

ing.

MULTI-LABEL INFORMED LATENT SEMANTIC INDEXING ALGORITHM

Multi-label Informed Latent Semantic Indexing (MLSI) algorithm extends the Latent
Semantic Indexing (LSI) to supervised cases [164]. In LSI, a linear mapping from the
input space to a low-dimensional latent space is found, so that the structure of the
data can be preserved as much as possible. This leads to an Optimization problem
which minimizes the reconstruction error from the latent space to the original Space.
Given supervised information in the form Of multiple labels of training examples,
MLSI proposes to preserve the structures in the input patterns Of the data, as well
as their label information, using the same set of latent variables.

Formally, let X E Rm<d denote the input matrix for n data examples, each in
(1 dimensions; let Y E Rmd denote the corresponding label indicator matrix. The

MLSI algorithm is equivalent to solve the following Optimization problem

minA,B,v (1— ﬁlllX—VAHZ +B|lY — VBII2

s.t. VTV = I

28

where V E Rnx“ deﬁnes a latent space, which gives a k-dimensional projection for
both the input pattern (i.e. X) and their labels (i.e. Y); A E R“Xd and B E Rle
give the mappings from the latent space to the input space and the output space (i.e.
labels), respectively; B is a constant which balances the reconstruction errors to both
the input spaces and output Spaces. The above Optimization problem is equivalent

to the following eigenvector problem

maxveR-n vTCv

s.t. v v=1

where C = (1 — ;B)XXT + BYYT. The solution Of the latent semantic matrix
V is composed of the ﬁrst k-th eigenvector of the above eigenvector problem, i.e.,
V = [v1,--o ,vk].

Note that, MLSI includes output variable Y in its Objective function. The corre-

sponding term can be rewritten as

env — VBH2 = s - trace[(Y — VB)T(Y —- VB)]

= a . trace[YTY — BTVTY — YTVB + (VB)TVB]

In the above, YTY can be interpretted as a class similarity matrix, computed solely
from the training data. Therefore, MLSI implicitly addresses the idea of utilizing
class correlation. However, since only training data is involved, when the training

data is small in amount, such class Similarity information could not be very reliable.

NON-NONGATIVE MATRIX FACTORIZATION

When we use a matrix to represent a graph, such as the adjacency matrix or other
variants, there is a potential computation burden when the graph has a large number
nodes and edges. A natural thought would be to approximate the matrix while
preserving as much information as possible. This is particularly useful especially

when we are more interested in understanding global structures of the data.

29

 

Finding the optimal lit-rank approximation Of a given r-rank matrix A (k < r)
can be formulated as ( [138, 163])
B = argmin [IA—BHF (2.1)
Rank(B)=k

Applying Singular Value Decomposition to matrix A, we have
A = UsvT

where U and V are orthonormal matrices, and S = diag(sl, 52, . . . ,Sr, 0, . . . ,0) with
31 2 32 2 ~ . - 2 Sr > 0, the solution tO the lower rank approximation problem would

be
B = deiag(sl,sg, . . .,sk)V;€r (2.2)

Highly related tO matrix approximation problem, non-negative matrix factoriza-
tion tries to approximate a matrix A with two non-negative matrix factors U and V

([104, 105])
A 8 UV

To measure the approximation quality, two cost functions are used. The ﬁrst mea-

surement is the square Of Euclidean distance

2
HA — Bll = ZlAi,j — Big)
M
and the second measurement is the divergence
Am‘
D(AllB) = Z A. .- log— - Am- + 3,,-

. . ’ Bi j ’

3,] ’
Note that the second measurement D(A|]B) is always nonnegative and reaches zero

only when Ai, j = BM holds for all (i, j) pairs.

30

An iterative algorithm has been proposed to efficiently solve the problem by itera-
tively minimizing the above two cost functions. In particular, the following updating

rule minimizes the Euclidean distance [[A — UVII

Ui a f“ Ui a live. k
l l k (UV)l,k ,
U' (— Uin
2,0. Zj Uj!a

and the following updating rule minimizes the divergence D(A|]UV)
A- k
V ,__ V U; __"';__
a,k a.,k z,a
22.: (UVh'Jc
The above two rules which guarantees the two cost functions to be non-increasing
until a local Optimum is reached. To initialize the algorithm, the two matrix factors

U and V can be seeded as random non—negative matrices.

2.3 Semi-Supervised Multi-label Learning by Constrained
NMF

The following terminology and notations will be used throughout the rest Of the
chapter. Let D = (x1,x2, . . . ,xn) denote the entire dataset, where n is the total
number of examples, including both the labeled ones and the unlabeled ones. We
assume that the ﬁrst "I examples are labeled ones, and their label information is

}nlxm where m is the number of classes.

presented in the binary matrix T 6 {0,1
Let the Similarity Of all the examples denoted by a matrix A = [Agjlnxna where
element Az-J- : 0 represents the similarity between two examples based on their
input patterns. We denote by TU: _>_ 0 the conﬁdence score of assigning the k—th
class label to the i-th example, and by t,- = (Ti,1,T,-,2, . . . ,Ti,,n)T the conﬁdence

scores of assigning each class to the i—th example. Finally, the matrix T = [Ti,kln><nr

denotes the conﬁdence scores Of assigning every class label to all examples.

31

 

2.3.1 A Framework for Multi-label Learning

The key assumption behind this work is that two examples tend to be assigned similar
sets of class labels if they share high Similarity in the input patterns. In order to utilize
this assumption for predicting class labels, we evaluate the similarity Of two examples
in different ways, one by their input patterns and the other by their assigned class
memberships. We refer to the former as the input-based similarity, and the latter
as the class-based similarity. Then, if the class labels assigned to the examples are
consistent with their input patterns, we would expect the class-based Similarities to
be close to the input-based similarities. Since the input—based Similarities are already
given by the matrix A, the key question is how to compute the similarity of two
examples based on their class memberships. The simplest approach is to compute
the class-based Similarity between examples x,- and xj by the overlap between their
classmemberships, or tthj. The problem with this Similarity measurement is that it
treats all the classes independently and therefore is unable tO explore the correlation
among them. In particular, it will give zero Similarity whenever two examples share
no common classes. However, two examples with no common shared classes can still
be strongly related if their assigned classes have close relationship (e.g. the children-
parent relationship in the hierarchy Of class labels).

To capture the Side information Of correlation among different classes, we introduce
matrix B = [Blemxm for the class Similarities. Each element Bk,l _>_ 0 represents
the similarity between two classes. Then, instead of computing the class-based simi-
larity between two examples by the direct dot product, we compute it by a weighted
dot product, i.e., tZTBtj. Then, following the assumption stated above, we would

expect Ai, j % tJBtj if the class assignments ti and t j are appropriate for examples

32

 

x,- and xj. This leads to the following Optimization problem:

n m 2
argmin 2 AM — Z Tl,kBk,lTj,l (2.3)
T i,j=1 k,l=1

s. t. Tj‘l_>_0,j:l,...,n,l=1,...,rn

7i,k:7i.krf:17"'rnlik217"'r7n (2.4)

where the last set Of constraints is to ensure that the estimated label confidences
Tiak’s are consistent with the assigned class labels Ti’k’s for all the training examples.
Remark: It is interesting tO see that the formalization in (2.3) can also be written
as a non—negative matrix factorization problem under a linear constraint, if we ignore
the constraints coming from the training examples, i.e.,
arg min ”A — THllF
,H

S. t. Tj,l,Hj,lZO,j=1,...,Tl,l=1,...,7n

H=BTT

where I] - I] F stands for the Frobenius norm. The above problem is similar to the
standard Non—negative Matrix Factorization (NMF) problem except for the linear
constraint that restricts the matrix H to be linearly dependent on the matrix T. It
is this constraint and furthermore the constraints arising from the labeled data that.
prevent the direct application of the N MF algorithm.

One problem with the formulation in (2.3) is that since the input-based similarity
Ai,k can be any positive value, it could be signiﬁcantly larger than the elements in B.
As a result, the label conﬁdence TiJc that minimizes the objective function in (2.3)
will also be signiﬁcantly larger than 1. But, to satisfy the constraints in (2.4), the
label conﬁdence Ti,k Should be restricted to 0 or 1 Since the assigned class label THC
is binary. To resolve the conﬂicts between the minimizer of the objective function
in (2.3) and the binary constraints, we introduce two sets Of label conﬁdences: the

unnormalized label conﬁdence {Ti’k}, and the normalized label conﬁdence {Ti,k}°

33

 

The former can take any positive value, while the latter is positive and subject. to
the constraints of Zlgnzl Tuk = 1. We will on one hand, use the unnormalized
label conﬁdence to minimize the difference between the class-based similarity and the
input-based Similarity, and on the other hand, use the normalized label conﬁdence to
ensure that the predicted label conﬁdence is consistent with the assigned class labels.

Formally, we can summarize this idea into the following Optimization problem:

n m 2 n m
- ~ 2
Mg?“ Z An * Z 73,kBk.sz,l + 0 Z 2031’ “27231) (2-5)
T.T.a i,j=1 k,l=1 3‘21 [:1
S. t. ijl,Tj,l,aj20,]—1,...,TL.i=1, ,m

m

23%J2L221, m2

#1

EA

T- =———=—,i=1,...,n,k=l,...,m

Note that in the above formalization, we introduce the term C 23":1 :l:1(Tj,l -—
cry-T”? into the objective function to enforce that. the two sets of label conﬁdences
are consistent and only differ by a scaling factor a j for each example. Parameter C
weights the importance of the second term against the ﬁrst term, and is determined

empirically.

2.4 Solving the Constrained NMF

An alternative Optimization approach is adopted to solve the constrained N MF. In
particular, we will solve the optimization problem by alternatively ﬁxing one set of
label conﬁdences and ﬁnding the optimal solution for another set of label conﬁdences.

More speciﬁcally, we ﬁrst ﬁx the normalized label conﬁdence matrix T and the
scaling factors aj’s, and search for the unnormalized label conﬁdence Ti, j that Op-

2
timizes (2.5). To this end, we upper-bound the term (Ai,j — Zznl=1Ti,kBk,lTj,l)

34

 

as follows

m

—TZ 1.31. 1T

k,=ll

m Ti,kBk.sz,z A“ 'I‘B’I‘T T1 kBk 1T
2 W WI _—
k,l=1 7"]

[TBTTLJT2

1433+ Z (T

2 . . . .
“=1 ——kBk 1T 11.31. zT‘,z‘2Az.JTz.kBk.lTJ.l)

Z

 

"' ~ ,~ ~. 23 J?
3 A2,]- + 5k; [TBT li,sz',kBk,lTj,l T4,, + $113
, =1 2, J.

m
— 2 Z Ai,jTi,kBk,lTj,l (1+10gT’lJc + log T3") - log TiJC — log Tjal)
k,l=l
In the above, T refers to the matrix T from the last iteration. We use the convexity
Of the quadratic function in the first step Of the derivation, and the concaveness Of
the logarithm function in the third step of the derivation. Then, we can upper-bound

the ﬁrst term in the function (2.5) as

n m 2

Z 14,-3- — Z T1,kBk,lTj,l

i,j=1 k,l=1
n 4 m

3 Z +ZlTBTTl1JlTBl11T'l4ZA1JITBl1 1T 1logT,,1
i,j=1 z 1 J [:1

77?.
~ ~ T ~ ~ T ~
- 2413]" [TBT li,j +4 Z Ai,sz'.leT lks’ 103 TU:
k=1

Similarly, we can upper-bound the second term in (2.5) as follows:

n m
__ 2
_ (72:2:(711—2 aJi3ﬂ3r+aJi3p
j=11=1
n m T102 2
- J -
:; 02:2:TMl—21ijnﬂbgf +1)+ OJTN
j=1l=1 3’1

35

By combining the above two bounds, we have the upper bound for the objective

function in (25). Taking the derivative of the bounding function with respect to Tij

 

 

we have
T T31 n. 1 1

.7 ~ ~ ~ _

4Z[’i‘B'i‘ ], J[TB], IT? —4 Z AJJ-[TBMTJJJJ— + C(QTJl — 2aJ TJ ,JJT 55;) _ 0
Z: 1 j,l 2:1 .731
which leads to the following solution
1
__ *3 2 . ~4 . “. .
CTJ, + \/C + 8UJITJJ(2VJZ + CTJlaJ)
Tj,l = (2.6) H

4U j,l
where UJ, = [TBTTTBJJJ and VJ, = [A'i‘B]Jl.
In the second step, we ﬁx the unnormalized label confidence TUE and search for

the normalized label conﬁdence T1,}? that optimizes the problem in (2.5), which leads

 

to the following optimal solution:

Tl: ——T—-’——le j— Tzl+1,. n,l—1 m (27)
3 —J————,, , — — ,..., .
3' 21: 1 TN
a]::TJ-’l,j=nl,...,n (2.8)
l=l

In summary, the iterative steps solving the optimization problem (2.5) could be

formulated as a algorithm shown in Table 2.1.

 

Step 1 Randomly initialize T and T subject to the constraints in (2.5)

Step 2 Until convergence, do
1. Fix all aj’s and T, update T using Equation (2.6)

2. Fix T, update T using Equation (2.7)
3. Fix T, update all aj’s using Equation (2.8)

 

 

 

Table 2.1. The CNMF algorithm

2.5 Experiments and Discussions

Our experiments are designed to evaluate our proposed multi—label learning frame—

work in text categorization tasks, particularly in the case of a large number of classes

36

and a small size of training data.

2.5. 1 Experiment Setup

The dataset used in our study comes from the textual data of The Eurom'sion St An—
drews Photographic Collection (ESTA) in ImageCLEF collection [77]. We randomly
pick 3456 documents, and choose the top 100 most popular ones from all the cate-
gories those picked documents belong to. On average, each document is assigned to
4.5 classes. Documents are preprocessed by the SMART system with stop words re-
moved and words stemmed, and each document is represented by a vector of weighted
term frequency.

Our proposed framework is implemented in the following way. The document
similarity Ai, j is computed as the cosine similarity between the corresponding term
frequency vectors. To compute the class similarity matrix B, we first represent each
class c by a binary vector whose elements are set to be one when the corresponding
training documents belong to the class c and zero otherwise. We then compute
the pairwise class similarity based on their vector representation using a normalized
RBF kernel. Finally, the class assignment for each test document is made by the
ranking of the label conﬁdence scores that are obtained from the matrix T. Every
experiment is repeated 10 times by randomly re-splitting the dataset into the training
and the testing sets. The parameter C in the objective function (2.5) is set to 100.
We also varied the value of C from 20 to 200, and found that the results remain
almost unchanged. For an easy reference, we will refer to the proposed algorithm as
“CNMF”.

Since our approach only produces a ranked list of class labels for a test document,
in this study, we focus on evaluating the quality of class ranking. In particular,
for each test document, we compute the precision/ recall and the F 1 measurement

at each rank by comparing the ranked classes to the true class labels. Then, the

37

 

precision/recall and the F1 measurement averaged over all the test docm‘nents is
used as the ﬁnal evaluation metric.

Three baseline models are used in our study. The first one is Spectral Graph
Transducer (“SGT” for short) [87], which has been proved effective for exploring
unlabeled data. An separated SGT classiﬁer is built for each individual document
category, and the probability values output by SGT are used to rank the class la-
bels. The second baseline model is Multi-label Informed Latent Semantic Indexing
(“MLSI” for short) [164], which maps document vectors into a low-dimensional space
that is strongly correlated with the class labels of the documents. It has been shown
empirically that MLSI is effective for exploring both the unlabeled data and the cor-
relation among classes. The last baseline model is Support Vector Machine (“SVM”
for short). A linear SVM classiﬁer based on the term frequency vectors of the docu-
ments is built for each category independently. All the baseline models are tested by
a lO-fold experiment, using the same training/ test split of the dataset as the proposed

framework.

2.5.2 Experiment Results

Figure 2.1, Figure 2.2 and Figure 2.3 show the average precision, recall, and F1
measurement, respectively, at different ranks, for both the proposed framework and
the three baseline approaches.

A comparative analysis based on the results in Figure 2.1, Figure 2.2 and Figure

2.3 lead to the following ﬁndings:

1. All four approaches show a same trend of decreasing precision and increasing
recall, when the number of labels predicted for each document increases. This
is in accordance with the usual precision-recall tradeoff. However, as a measure-
ment balancing the precision and recall, each F 1 curve clearly shows a peak. As

can been seen from Figure 2.3, the F1 curve of CNMF reaches its climax when

38

 

Precision Micro (#train = 100)

 

 

 

 

Precision Micro
O
m
l

O
I-'
i

 

 

 

5 10 15 20 25
#Labels predicted for each example
(a) #Training=100

Precision Micro (#train = 500)

 

  
     
 

—CNMF ' "
SGT
---MLSI
—SVM

 

 

 

Precision Micro
0
U1
I

 

 

 

0 J l I l L l

5 10 15 20 25

#Labels predicted for each example
(b) #Training=500

Figure 2.1. Classiﬁcation performance (Precision) when varying the number of predicted
labels for each test example along the ranked list of class labels.

39

Recall Micro (#train = 100)

l.—
0.9—
o 8‘ .. » ‘I .
o 7— -' .,
3
O 0 6' ., .. H.
E ' I-I""' " -
HOS- ,gllllii . c ”K ... ML
H , . .
8 a
g 0.4’ ’

 

I ‘——CNMF'H
.HHHSGT

,ll’ﬁ. .H. ;

25 30

0
Lu
1
F-Fd -
I—‘rt ;
F-w-ﬁ

O
l-'
I

 

 

 

 

 

  
  

0 5 10 2
#Labels predicted for each example
(a) #Tl‘ainingleO
Recall Micro (#train = 500)
0.9— ll-iI'IR-I‘E‘E'I'E'!
0.8-
0.7-

Recall Micro
0
U1
I

 

 
 

 

 

 

 

 

 

o.3~ _CNMF
0.2—. ”"”'SGT --
---MLSI
0.1-1 _. . . . . —SVM
o l I l l 1 4|
0 30

S 10 1 0 2
#Labels predicted for each example
(b) #Training=500

Figure 2.2. Classiﬁcation performance (Recall) when varying the number of predicted
labels for each test example along the ranked list of class labels.

40

P1 Micro (#train = 100)

 

 

 

 

F1 Micro

 

 

 

0 , 1 1 i 1 l i

5 l O 1 5 2 O 2 5 3 0

#Labels predicted for each example
(a) #Training2100

F1 Micro (#train = 500)

 

 

 

 

F1 Micro

 

 

7 5 10 15 20 25 30

#Labels predicted for each example
(b) #Training=500

Figure 2.3. Classiﬁcation performance (Fl) when varying the number of predicted labels

for each test example along the ranked list of class labels.

41

the number of predicted labels is around 3 to 4, which is close to the average

number of labels per document. (i.e., 4.5).

. CNMF makes more significant improvement on the average recall than on the
average precision when compared to the three baseline approaches. This is re-
lated to our task scenario, which focuses on multi-label learning with a large
number of classes and a small size of training examples. Given such a sce-
nario, we would expect a number of classes that are not provided with sufﬁcient
amount of training examples. As a result, we hypothesize that prediction on
these classes will have to rely heavily on the correlation among classes. This hy-
pothesis is partially justiﬁed by the comparison between the proposed approach,
CNMF, that exploits class correlations, and SGT or SVM, which does not.
Although CNMF and SGT achieve similar performance in precision, CNMF

performs significantly better than the SGT in terms of the average recall.

. More improvement by CNMF to the three baseline approaches is observed
when the number of training documents is 100 than when the number of training
documents is 500. This is partially due to the same reason mentioned above
— the advantage of exploiting class correlations on sparse training data. It
can also be attributed to the reason that our approach also makes use of the
correlation among the unlabeled data, which has been proved by many studies

in semi-supervised learning, for instance [173, 129, 172].

. Although MLSI is able to explore the correlation among classes, its performance
depends heavily on the appropriate choice of the number of latent variables and
the tuning parameter determining how much the indexing should be biased by
the outputs. These two parameters are usually determined by a cross valida-
tion approach and therefore could be problematic when the number of training

examples is relatively small. This problem is directly reflected in the large vari-

42

ance in both precision and recall of the MLSI algorithm, which we believe it is
due to the inappropriate choice of the aforementioned two parameters given the

limited number of training examples.

2.6 Conclusions

In this chapter, we propose a novel framework to accommodate the side information
of class correlation in multi-label learning, which meets the challenging situation of
a large number of classes and a small size of training data. The advantage of our
proposed framework is that it is able to exploit the correlation among classes and the
unlabeled data. We also present an efﬁcient algorithm to solve the related optimiza—
tion problem. Experiments show that our proposed framework performs signiﬁcantly
better than the other three state-of-the—art multi-label learning techniques in text

classiﬁcation tasks.

43

CHAPTER 3

Semi-supervised Classiﬁcation with Link

Information

3.1 Introduction

From supervised learning to semi-supervised learning, machine learning research has
witnessed the trend of exploiting the structure of unlabeled data. A typical approach
towards revealing the underlying structure of unlabeled data is through establishing
correlation between example pairs from their data representation (or feature values).
For example, many graph-based learning models construct a k-nearest neighbor graph
by choosing an appropriate distance measurement deﬁned on the data representation
of a pair of examples, such as Local Linear Embedding [124], ISOMap [142], Laplacian
Eigenmaps [17], Manifold Learning [18] and etc.. However, many real-world datasets
already exhibit inherent correlations by entities that are interlinked with each other.
Especially in text domains, datasets features with “links” are very popular: nearly
all kinds of scientiﬁc publications are cross-referenced by each other; and the World
Wide Web is weaved by hyperlinks that connect pages. One can also ﬁnd datasets
with link information in other domains, such as social networks, bioinformatics, etc.

Naturally, link information suggests certain structure underlying the dataset. For

example, an empirical study showed that the Web’s spatial locality (induced by hy-

44

perlinks) mirrors its topical locality [47]. h‘lor‘eover, link information usually provides
a different view on the structure of data than that “computed” from the data rep-
resentations, since link generation often involves human interference. For example,
when a web author constructs hyperlinks, he or she can capture more sophisticated
semantic closeness between web pages that is beyond the power of state—of—the-art
text mining tools. Therefore, exploiting link information leaves the hope of gaining
more insight into the structure of data for many learning algorithms.

However, while it could be informative, link information also often presents some
annoying characteristics in practice. First, link information could be sparse, i.e.,
some data examples could be involved in no links. Second, link information are
often incomplete, i.e., one cannot always expect a link being observed wherever a
link “should” be constructed between two examples. For example, we cannot hope a
scientiﬁc publication to cite all work that is related to itself. Third, link information
tends to be noisy. One example to illustrate noise in links is the hyperlink spam
on the Web. In summary, in real-world datasets, link information is typically not a
reliable source for the structure of data. Therefore, link information is always treated
as “side information” that supplements the original data representations for learning
tasks.

Informative but unreliable, link information provides opportunity as well as chal-
lenge for semi-supervised learning. In this chapter, we will focus on classiﬁcation
on datasets with link structures. Depending on how link information is incorpo-
rated into a learning algorithm, previous studies on this topic can be roughly cat-
egorized into three types of models: representation augmentation, correlation aug-
mentation and training pool augmentation. The ﬁrst type modiﬁes the representa-
tion of a data example from its neighborhood that is induced from the link struc-
ture [33, 118, 61, 63, 162, 48, 120, 130, 40]. The second type utilizes link information

to construct a graph [168, 167, 7], or derive more accurate similarity measurement.

45

between data examples [32], and apply (or design) semi-supervised learning algorithm
over the graph. Essentially, link information plays the role of establishing correlation
between examples. The third type uses linked examples to augment the training set,
represented by the co-training algorithm and its variants [27, 39]. Usually a feature
based classiﬁer and a link based classiﬁer are applied alternatively, supplementing
each other’s training pool with its highly conﬁdent predictions.

However, the performance of all three types of models summarized above can
degrade signiﬁcantly with a decreasing number of links. When links are sparse, only
a small fraction of data is involved in some link(s). No matter a model augments
the data representation, correlation or training pool, an unbalanced issue is created
between those data examples that are inﬂuenced by the links and the rest that are
not. To overcome this problem brought by the sparseness of link information, it
is desirable that the limited link information can be somehow “smoothed” over the
entire dataset.

In this chapter, we will propose a novel semi-supervised framework, termed as
“LinkBoost”, to improve classiﬁcation accuracy by utilizing link information. Link-
Boost is a boosting framework: a base classiﬁer is applied iteratively with augmented
training pool, which is updated through a minimization of inconsistency between the
class label assignments and the pairwise data similarities that is augmented by the
link information; with a series of learned weights, the classiﬁcation results from all
iterations are ﬁnally combined as the ﬁnal prediction. Thus the classiﬁcation results
from the iteratively applied classiﬁer with updating training pool act as a way of
propagating link information around, i.e., “smoothing” the influence of links over the
entire dataset. As a result, LinkBoost is more robust against the sparseness of link
information. Besides, as a general boosting framework, LinkBoost can be applied to
any classiﬁcation algorithm. LinkBoost framework is also hybrid in the sense that it

can accommodate all three augmentations from link information on the dataset: rep-

46

resentation augmentation, correlation augmentation and training pool augmentation.

The following notations will be used throughout this chapter. Let X = {xi}?:1
denote a set of n examples, where the ﬁrst 77.] ones are labeled and the rest nu ones
are unlabeled, i.e. n = n] + nu. We also use x,- to represent the feature vector of
the i-th example, and use X to denote the matrix that gathers all the feature vectors
of examples, i.e., X = [x1, - -- ,xn]. Let y 2 {Eli [‘21 be the label vector, in which
{gig-:1 is given, and {Mill—7:1 is to be decided. SiJ is deﬁned as a similarity
measurement between the i—th and the j—th examples, and a matrix is formed as
S = [8i,j]an. The link information is encoded into a matrix R = [rm-Rx”, where
TiJ = 1 if there is a link from x,- to xj and Ti, j = 0 otherwise. From the link matrix,
a co—citation matrix can be deﬁned as C = [ci3j]an, where Ci, j = 1 if 31: 75 2', 3' such

that THC = 1:73,]: = 1, and Cz’,j = 0 otherwise 1.

3.2 Related Work

In this section, we will brieﬂy review previous efforts in the area of link-based classi-
ﬁcation (especially in text domain), and a few semi-supervised classiﬁcation models
that are closely related to this chapter, followed by a recapitulation of several repre-

sentative algorithms.

3.2.1 Link-based Classiﬁcation

The most popular way of link-based classiﬁcation is to augment the representation
of data examples with their neighbors. In [63, 32, 120, 130], the bag-of-words model
of documents is augmented by including words from neighboring documents, thus

creating a “virtual document” to feed into classiﬁers. However, the studies in [33, 118]

 

1This co—citation matrix is deﬁned in the “co-citing” manner, i.e., two examples are correlated
if they both link to a. third example. Another way is to deﬁne co—citation matrix in a “co—cited”
manner, i.e., two examples are correlated if they are both linked from a third example. Deciding
which deﬁnition is more appropriate is domain dependent.

47

suggests that incorporating words from neighboring documents sometimes leads to
performance degradation, while making use of their predicted class labels as additional
features for data representation was helpful. Later in [107], different schemes of
formulating the additional link-based features were discussed. Apart from explicitly
augmenting data representations with features or labels of linked examples, another
stream of research tries to unify content analysis with link analysis by creating new
data representation for linked data examples in the latent space [48, 40].

When links are often directional (which is true in most cases), there exists several
different criteria to identify neighborhood for a data example. The most frequently
used criteria include incoming links, outgoing links and co—citation links [107, 61,
162, 32]. Especially, the studies presented in [61, 162] suggests that different dataset
regularities may favor the use of different neighborhood identiﬁcation criteria.

Rather than augmenting data representation to better explore the underlying
structure of data for semi-supervised learning, a few studies directly create a graph
structure from link information. These approaches often leads to semi-supervised
learning algorithms over graphs: an iterative relaxation labeling algorithm is pro-
posed in [7] for undirected graph; the work in [168, 167] handles directed graph by

regularizing classiﬁcation functions to change slowly on densely linked subgraphs.

3.2.2 Related Semi-supervised Classiﬁcation Models

A large number of models have been proposed for semi-supervised classiﬁcation, which
can be broadly categorized into several types: graph-based models, margin-based
models, kernel-based models, ensemble-based models, and etc. Although only a small
fraction of models directly address the use of link information, at least two types
of models are closely related to the LinkBoost framework proposed in this chapter:
graph-based models and ensemble-based models.

Graph-based models build a connected graph on both labeled and unlabeled ex-

48

amples, with each vertex representing an example and each weighted edge represent-
ing correlations between example pairs. The most well-known graph—based models
include Harmonic Functions [173, 174], Spectral Graph Transducer [87], Gaussian
Processes [4, 103, 67], Manifold Regularization [18], Label Propagation [19], etc. A
common theme shared by many graph-based semi-supervised classiﬁcation models is
to find an optimal set of class labels for unlabeled examples, such that they are consis-
tent with supervised class labels from labeled examples, as well as the graph structure.
Graph Laplacian is a popular form to deﬁne an inconsistency measurement over a
set of class assignment y = {21,-} and the graph represented by a similarity matrix
S = [8233']

F(y) = $21 23-21 Sig-(y.- - 31))2 = yTLy

he above inconsistency measurement, is always combined with other components
to form an objective function for minimization. It is easy to imagine, when link
information is available, we can enforce class assignment to be consist with it, in
a similar way as graph-based models. As will be seen in Section 3.3, the proposed
LinkBoost framework uses a similar deﬁnition inconsistency measurement, except in
the form of exponential cost (to facilitate deriving a boosting algorithm) rather than
quadratic cost. Moreover, unlike most graph-based models, which are non-parametric
and do not build speciﬁc classiﬁcation models, the proposed LinkBoost framework is
able to yield one classiﬁcation model based on any given classiﬁcation algorithm.
This is particularly useful, when the amount of unlabeled data is extremely large.
Ensemble-based models gained increasing attention in semi-supervised learning
by several successful models, such as AdaBoost [56, 55], ASSEMBLE [21] and Semi-
supervised Margin Boost (SSMB) [43]. Usually, in an iterative manner, pseudo labels
are assigned to unlabeled examples, which are then sampled for training a new su-
pervised classiﬁer from an ensemble of them. The proposed LinkBoost framework

follows the idea of iteratively augmenting the training set, but it utilizes link infor-

49

mation to obtain more reliable pseudo labels for unlabeled examples. In this sense,
LinkBoost framework combines the advantages from both graph-based models and

ensemble—based models.

3.2.3 Representative Algorithms Review

Nearly all the algorithms proposed on link-based classiﬁcation make the assumption
that interlinked (or co—citation) examples are more likely to belong to the same class.
In the following, we will recapitulate a few representative algorithms proposed for

semi-supervised classiﬁcation on datasets with link information.

FEATURE REPRESENTATION AUGMENTATION

In the family of feature representation augmentation algorithms, a example’s feature
set is supplemented with features from its interlinked / co-citation examples. Formally,

we can write the augmented example as
xa“g = (1 -— A)X + AMTX (3.1)

where the matrix M = R if we only want to augment an example with incoming
links, or M = R+ RT if both incoming links and outgoing links are used, or M = C
if only co—citation examples are considered for augmentation. A is a weight to put on
the augmentation from linked examples.

The augmented feature representation will be fed into an semi-supervised learning

algorithm for classiﬁcation.

CO-TRAINING FOR LINK-BASED CLASSIFICATION

Co—training is ﬁrst proposed in [27], in which two or possibly more learners are trained
separately on a set of examples; each learner uses a different, and ideally independent,

set of features for each example. Co—training can be naturally applied for link-based

50

classiﬁcation: one learner is based on the data representation, the other learner is
based on link information.

Speciﬁcally, the co—training algorithm for link-based classiﬁcation can be formal-
ized as follows

Co—training for Link-based Classiﬁcation
Step 1 Initialize a training pool with all training examples R = x1, - - - ,xnl.

Step 2 Iteratively apply the following two learners

0 Train a data representation based learner on current training pool R, and
apply the learner to predict labels for the rest examples. Update the
training pool R by move those examples with high prediction conﬁdence
into it.

0 Train a link based learner on current training pool R, and apply the learner
to predict labels for the rest examples. Update the training pool R by move

those examples with high prediction conﬁdence into it.

Until no more examples can be added into the training pool T.

ITERATIVE CLASSIFICATION ALGORITHM

Lu &. Getoor proposed an iterative classiﬁcation Algorithm (ICA) in [107]. Different
from the feature representation augmentation approach, ICA augments the data rep-
resentation by a new set of features that summarize the class label statistics (from
prediction of the previous iteration) of interlinked /co—citation examples. Due to the
fact that newly introduced label feature set is different from the original feature set
in nature, two logistic regression models on both feature sets are combined to form a
prediction.

To formalize the ICA algorithm, let us introduce Z7; to represent the new feature

vector, i.e., linked examples’ label statistics, for the corresponding data example

51

xi. For example, for a C-way classiﬁcation problem, 2,; could be defined as a C-
dimensional vector, where a element 25,-], k counts the number of examples from those
linked with x; that belong to the k-th class.

Then the ICA algorithm can be summarized as follows

Iterative Classiﬁcation Algorithm

(Bootstrap) Initially assign class to each example based on its original feature repre-

sentation Xi-

(Iteration) lteratively apply the following two learners

0 Form a new feature representation 2, for each example xi, by gathering
the label statistics based on current class assignment to linked examples.
0 Train two separate logistic regression models on both feature representa-

tions, i.e., x,- and 2,, by the following MAP estimation

PFD’IX) = Pr(¥|{xi})Pr(YI{Zi})

where

Til—+7111,

Pr<y|{x.-}) = Z 1

i=n1+1 exp(—yingi) + 1

 

TLl-i-nu

Pr(¥|{zz’}) = Z

i=nl+1

1
exp(—y.-wlz.-> + 1

 

In the above, W; and w; are two parameters for the logistic regression
model on either feature set.
0 Apply the combined logistic regression model to the test examples and

update their class assignments.

Until no updates are made on class assignments or a maximum number itera-

tions has been reached.

52

3.3 LinkBoost Framework

For simplicity, here we only consider binary classiﬁcation, where the label vector y
can only take values y,- = 1 or y, = —1.

As a semi-supervised framework, LinkBoost inherits the common theme from
graph-based models that consistency is enforced between the link-information induced
graph structure and the class assignment to the unlabeled data. On the other hand,
as a general framework that is able to boost any base supervise learning algorithm,
LinkBoost follows the idea of ensemble-based models that a sampling of unlabeled
examples based on predicted pseudo labels will be used to iteratively augment the
training set of the Speciﬁed supervised algorithm. In the rest of the section, we will

formalize the LinkBoost framework and derive the related algorithm.

3.3.1 Objective Function

To encode the structure of both labeled and unlabeled data, we use a Similarity matrix
S = [5,0]an to combine both the link information and the feature representation of

data examples

SiJ = (1 — a)sz’m(x,-,xj) + arm- (3.2)

Here sim(-, -) deﬁnes a Similarity measurement based on the features. For example,
we can use cosine similarity in text domain. Ti, j is an element from the link matrix R
which, more speciﬁcally, encodes all the incoming links; the link matrix R can be re-
placed with R+RT if both incoming and outgoing links are taken into consideration,
or be replaced by the co-citation matrix C if co—citation links are more appropriate
to disclose the structure of data. Making a good choice on the link matrix is usually
dependent on the regularity presented by the dataset for classiﬁcation, as suggested
in [61, 162]. a is a combination weight factor, 0 S a S 1. The above similarity

deﬁnition can be viewed as using data represenatino based similarity computation to

53

smooth the links, thus minizing the problem brought by the sparseness in the link
information.

We deﬁne the inconsistency, within the unlabeled data, between the class labels

nu . . .
{9;}?! —nl ++1 and the SIIIlll‘dl‘lty measurement S as
Tll'l'nu
Fun 2 Z Siaj C()Sl’l(‘y.lj — 313') (3.3)
i’ij:nl+l

where cosh(-) is the hyperbolic cosine function, i.e., cosh(y,- — yj) = [exp(y,~ — 31]) +
exp(yj — y,)] / 2. When symmetric similarity measurement is considered, as in this

paper, the above inconsistency can be rewritten as

Til-+7111, ”(+7111 1
Fuu = Z 532:4 EXPO/i — yj) + 2: 581,1 exp(yj - yr)
z',j=nl+1 i,j=nl+1
TLl'l-Tlu
= 2 Sid exp(y,- — yj) (3.4)
i,j=nl+1

Also, we deﬁne the inconsistency, across the labeled and unlabeled data, between
the class labels and the similarity measurement as

nl nl+nu

Flu: Z :5 j(exp -2y7:yj) (36)

i=1 j: nl+1
Ideally, the labels decided for the unlabeled data should minimize both inconsis-

tencies stated above. This leads to the following optimization problem

y,-,i=n +l,...,n +nu
l l l

3.3.2 Boosting Algorithm

In this subsection, we will derive a boosting algorithm that solves the optimization
problem (3.6) in an iterative procedure. Speciﬁcally, given an arbitrary binary clas-

siﬁcation algorithm A, let h(t)(x) denote the classiﬁcation model learned in the t-th

54

iteration by this algorithm. Then the ﬁnal classiﬁcation model, after T iterations, is

the combination of T models learned at all iterations, i.e.
’ T
f1(1)(x) = Z a(tlli(t)(x) (3.7)
t=l

where the a“) Z O is the combination weight. Here the superscript in parenthesis
indicates the iteration number. Finally we will apply this model to predict the labels
for the unlabeled data, i.e., y,- = H(T)(x,j),i = "l + 1, . . . ,nl + nu. To derive a
boosting procedure that minimize the objective function (3.6), we need to ﬁnd a
good combination weight at each iteration, so that the objection function yields a
decreased value from previous iteration.

N ow we study the change of objective function from the t — 1 iteration to the t-th
iteration. TO simplify the notation, let H,- denote the class label of the i—th example
(unlabeled) predicted by the combined model from all t -— 1 iterations, and h,- denote
the same example’s predicted label at the t-th iteration. We also simplify am as a.

Now the objective function becomes

Flt) = F513,) Flat) (3.8)

= 2: Sid exp(H,;+ah,- —Hj —o'hj)
i,j=nl+1

"l nl+nu

' Z Z 8i,j€XP(-2yi(Hj+ahj))

i=1j=nl+1

Using the inequality of arithmetic and geometric means, we have

exp[a(h,- - hj)] g [exp(2ah.,-) + exp(—2ahj)] (3.9)

ml?“

55

. . , l
So we can bound the ﬁrst term 111 (3.8), FIE-1,2 , as follows

'nl-l-Tlu
t
File) 2 Z 5i,j exp(H,j + ah; _ Hf _ Obj)
i,j=’nl+l
nl+nu
: Z .sz-J- exp(Hz- — Hj) 0Xp(a(hi " hjll
i,j=nl+1
1
E Z 28i’j exp(Hi — Hj) [exp(2(ih,j) + eXp(—2(ihj)]
i,j=nl+1
nl+nu n1+nu
1
= Z exp(2ahi) 2 531,3‘ 8XP(H2' — Hj)
nl+nu nl+nu1
+ Z exp( —2(rhj) 2% jexpf (iH— H j)
j: 771+1 Z: —Tl-21+1
n1+nu n1+nu 1
= E: exp(20zhj) Z §5j,i exp(Hj — Hi)
jznl+1 i=nl+1
nl+nu nl+nu 1
+ Z exp(—2ahj) Z 5814' exp(Hi _‘ Hj) (3-10)

In the last step above, we switched the index 2' and j in the ﬁrst term for notational

convenience.

If we deﬁne
d f nl+nu 1
aj g 2 58.7.31: €Xp(Hj - Hi) (3.11)
i=nl+1
e
b,- = Z 53,-0- expw, — Hj) (3.12)
i=nl+1

then we can Simplify the bound for the F52 in (3.10) as

nl+nu
F1)? < : exp(2ahj)aj+exp(—2ahj)bj (3.13)

j=nl+l

56

Similarly, we can simplify HEEL—1) as
nl+nu
Edit—1) = Z 82,) exp(Hi —Hj)
i,j=nl+1
nl+nu nl+nu1nl+nu nwl+nu1
= Z 2% jexp( (iH- Hj) )+ Z: 2% 3'9fo (iH— Hj )
t: —nl+1j= "1+12 j: n1+1i= nl+l2
nl+nu nl+nu1nl+nu nl+nu1
= Z Z 2 sJ-iexp( (jH —H,-) )+ Z 2% jexp( (Hi— H)
j=nl+1i=nl+1j=nl+1i=nl+12
nl+nu
= Z aj+bj (3-14)
j=nl+1

The second term in (3.8), F [(5), can be rewritten as

() nl nl-i-nu
t
Flu = Z :8 jexp( _QLZ/z'lHj'l—ahjll
i=1j=—’nl+lz
= Z: 232-3]-exp(—2%Hj)ex1)(—2Oy,jhj)
j=n1+1i=l
nl+nu "l
= Z exp(2ahj)Zsivjexp(2Hj)6(yi,—1)
j=nl+l i=1
nl+nu n,
+ 2 exp( —2-)ah )Zsijepr“ 2Hj)6(yi,1) (3.15)
j: nl+l i=1

In the above, we use the fact that the label for a labeled example, yi, can only be 1
or —1. And the delta function 6(a), y) = 1 if :1: = y and 6(.r, y) = 0 if :2: # y.

Again, if we deﬁne

nl
d.f -
cj :9 ZsmeprH )0(J,,1) (3.16)
i=ll
d,- “2‘ :3, Je\p( 2H )6( 1) (3.17)
i=1

57

then we can simplify the bound for the F512) in (3.10) as

"l +nu

F182) 2 Z exp(20hj)cj+exp(—2crhj)(lj

j=nl+1

F(t’"1)

Similarly, we can simplify [u as

n) nl+nu
F(t—1)

lu = Z Z 2"“ QyiHjl

i=1j= —nl+1

nl+nu "l

= Z ZsijeprH') 6(1/i,— 1)+€XP(—2Hj)5(yia1)

j: nl+1i=1

Til-+7121,

= Z Cj+dj

j=nl+1

(3.18)

(3.19)

Given the above results, we can study the bound of the objective function change

from the t— 1-th iteration to the t—th iteration. However, to further simplify notations,

let us ﬁrst deﬁne

 

 

 

 

{1.7 : n.l+7zu]a b
23': =nl+1 0.? J
3 bi
j — nl+nu
ijnl+1a3+b3
c-
éj : Til-PHIL] i
23': =nl+1 Cj+‘J
2 d1"
.7 _ nl+nu
E:j= =nl+1 CJ+dJ

(3.20)

(3.21)

(3.22)

(3.23)

then we have

(t) W F“)
log [(11) = log ﬁlial)+log (5'11)
F ' F ' F

U1],

 

 

 

+- .
23:71:11 exp(ZQhJ-Mj + exp(—2dhj)bj

nl+nu
ijnl+1 “J + b]

|/\

 

log

Z711 +7lu

j=nl+1exp(2ohj)cj + exp(—2alzj)dj

 

+10g n n.
23-51131 c,- + 0’2
7Ll+nu
: log 2 exp(2a-hj)&j + exp(—2o'h.j)f)j
j=nl+1
nl+nu
+ log 2 exp(2ahj)cj + exp(—2ahj)cfj
j=nl+1
1Ll+nu
S E exp(2ahj)((ij + 5]) + exp(—20h.j)(5j + dj) — 2 (3.24)
j=n1+1

In the last step above, we used the inequality
logs: 3 23—1, Vx>0 (3.25)
Furthermore, if we apply the following inequality
exp(yx) g exp(v) + exp(—'7) - 1 + 7:17, V2: 6 [—1, +1] (3.26)
we can further bound log Fft) as
F(t—1 )

3 Z (Eij + 5]) [exp(2a) + exp(—2a) - 1+ 202th
j=nl+1

+(f2j + dj) [exp(2a) + exp(—2o) — 1 — 2ahj]

nl+nu
= Z (Eij + 53- + éj + dj) [exp(2a) + exp(—2a) — 1]

 

— . Z 20hj(f)j + (17,- -— 51,-— 5,) — 2 (3.27)

59

t
Now we have two upper bounds for log %, (3.24) and (3.27). Both bounds

(t)

. F u ‘ n , __ -, . _ , 3 __ , ,
and log F t—l) touch at a — 0, Since they all reach 0 when a — 0. Fuithermoie,

log Fit) 2 0 means F (t) 2 F071). Therefore, as 10110 as we minimize either

Hm) ..
upper bound, we can always have F (t) S F (t‘l), i.e., keeping the objective function
non-increasing in the iterative procedures.

The upper bound (3.27) is good for designing a sampling scheme for a boosting
algorithm. Speciﬁcally, to help lower such an upper bound, we expect the label value
hj (at the t iteration) to be consistent with the sign of (ﬁj + dj — cij — 5]). This
gives us a good hint on sampling unlabeled data for training: the j-th unlabeled
example should be labeled as sign(5j + dj — (ij — 5]) and sampled with a probability
proportional to [(5, + dj — fij — 55)].

Finally, to ﬁnd the optimal a, we can minimize either upper bound in (3.24) or
(3.27). Here we choose the former, because it is tighter than the other one, and also
minimizing it leads to a solution which is easier to compute. In particular, we take

the ﬁrst-order derivative (w.r.t. a) of the upper bound in (3.24) and set it to zero,

 

i.e.
nl+nu nl+nu
Z exp(2ah.j)2hj(&j + 5]) — Z exp(—2ahj)2hj(5j + dj) 2 0(3.28)
j=nl+1 j=nl+1
or equivalently
nl+nu nl+nu
Z exp(2oz)(éj + Ej)6(hj,1) — Z exp(—2a)(&j + Ej)6(hj, —1)
j=nl+1 j=nl+1
nl+nu nl+nu
— Z exp(—2a)(b,- + d,)3(h,-, 1) + Z exp(2a)(bj + d,)5(h,-, —1)
j=nl+1 j=n1+1
= 0 (3.29)
Solving the above equation, we have
Tll+nu ~ __ - ~ ~ -
a _ 1 Zj=nl+l(aj + Cj)()(hj,—1)+(bj + dj)()(hj,1) (3 30)
_ _ n +n ~ ., ~ ~ '
4 z I ’“ (a,+c,-)6(h,-,1)+(b,- +dj)6(hj,—1)

j=nl+1

60

 

 

Input
0 X: d x 12. matrix for the input data
0 A: given classification algorithm

Output: class labels
Algorithm

0 Compute pairwise similarity matrix S.
o Initialize H(O)(x) = 0.

o Fort=1,2,...,T

~

,3], 5,31,. using (320)—(323)

- Compute class label for each unlabeled example xj as sign(bj + dj —
aj — cj).

— Sample unlabeled examples with probability proportional to |(I3j +dj —

— Compute ii

~

aj - éj)|
— Adding sampled examples to the labeled examples, train a binary clas-
siﬁer h(t)(x) with the algorithm A

-— Compute the optimal am using (3.30)
— Update the classiﬁcation model as H(t)(x) <— H(t_1)(x) +a(t)h.(t)(x)

o Predict class labels with the ﬁnal classiﬁcation model H (t)(x)

 

Figure 3.1. LinkBoost framework.

3.3.3 LinkBoost Framework Summary

The LinkBoost framework can be summarized into a meta algorithm, as shown in

Figure 3.1

As we can see, LinkBoost framework is able to take any supervised algorithm as

the base classiﬁer. This is more useful as opposed to other link-based classiﬁcation

method that design a special algorithm. Consider the situation that we ﬁnd a spec-

iﬁed classiﬁer which works particularly well for a given domain. When we gained

more understanding on the dataset (in the form of “links”), we want to stick with

61

 

the good classiﬁer and further improve its accuracy. This requires a framework that
is ﬂexible enough to accommodate any classiﬁer, and is able to evaluate its perfor-
mance (without knowing ground truth) and make adjustment. LinkBoost meets such
a requirement in the following way: it iteratively applies the classiﬁer with a different
training set; by checking the inconsistency between current classiﬁcation results and
the structure of data that. is estimated from the link information and data repre—
sentations, it generates heuristics on assigning pseudo labels to unlabeled data, thus
updating the training set; at the same time, LinkBoost estimated the appropriate
amount of trust we can put on the current prediction for ﬁnal combination.

It is also worth noting that even when no link information is available, LinkBoost
can also work as a generic semi-supervised learning algorithm, by simply using the

similarity computed from data representations alone in (3.2) in Section 3.3.1.

3.4 Experiments and Analysis

In this section, we will present a few experiments to verify the proposed LinkBoost
framework. Speciﬁcally, through the experiments we try to address the following two

research questions

1. Is the propose LinkBoost framework effective in improving classiﬁcation perfor-

mance with link information?

2. As a general semi-supervised boosting framework, is LinkBoost able to improve

any supervised classiﬁcation algorithm?

3.4. 1 Experiment Setup

Two datasets of scientiﬁc publications will be used in our experiments: “Cora” and

“Citeseer”. We describe the two datasets here.

62

 

ID Class Name Size
Neural Networks 818
Rule Learning 180
Reinforcement Learning 217
Probabilistic Methods 426
Theory 351
Genetic Algorithms 418
Case Based 298

 

 

 

 

 

NCUUIACOMH

 

Table 3.1. Classes in Cora dataset.

 

 

ID Class Name Size
1 Agents 596
2 Information Retrieval 668
3 Database 701
4 Artiﬁcial Intelligence 249
5 Human-computer Interaction 508
6 Machine Learning 590

 

 

 

 

 

Table 3.2. Classes in Citeseer dataset.

Cora dataset The Cora dataset consists of 2708 scientiﬁc publications classiﬁed into
one of seven classes, as shown in Table 3.1. Each publication in the dataset is
described by a 0/1-valued word vector indicating the absence/ presence of the
corresponding word from the dictionary, which consists of 1433 unique words in
total. The citation network consists of 5429 links. By checking with the ground
truth class assignment of publications, we ﬁnd that about 81.38% of the links
correctly indicates that the linked publication pair should be put in the same
class. Furthermore, the links are not evenly distributed across all publications.
For example, there are about 100 publications each involved in more than 10

links, while there are also 1143 publications not involved in any link.

Citeseer dataset The CiteSeer dataset consists of 3312 scientiﬁc publications classi-
ﬁed into one of six classes, as shown in Table 3.2. Each publication in the dataset

is described by a 0/1—valued word vector indicating the absence/ presence of the

63

corresponding word from the dictionary, which consists of 3703 unique words in
total. The citation network consists of 4732 links. By checking with the ground
truth class assignment of publications, we ﬁnd that about. 74.61% of the links
correctly indicates that the linked publication pair should be put in the same
class. Furthermore, the links are not evenly distributed across all publications.
For example, there are about 50 publications each involved in more than 10

links, while there are also 1358 publications not involved in any link.

As we can see, the link information in both datasets are sparse and noisy. Therefore,
these two datasets can be seen as representative of real-world data.

Given a supervised classiﬁcation algorithm, we will apply the proposed LinkBoost
framework to boost its classiﬁcation performance on the two datasets described above,
using the link information. Due to the regularity presented in the two datasets, we
will use both incoming and outgoing links to form a link matrix, i.e., in Equation (3.2)
R+ RT will be used as the link matrix to compute similarity between data examples.
And the weight factor a is set to 0.4. As mentioned before, the proposed LinkBoost
Framework can accommodate augmented feature representation, for example, from
the Feature Representation Augmentation algorithm reviewed in Section 3.2.3. In
our experiments, we implemented LinkBoost on both the original feature space, and
the augmented feature space using the algorithm mentioned in Section 3.2.3. For
notation brevity, we will refer to the former method as “LB-OR”, and the latter
one as “LB-AR”. In LB-AR, The link matrix in Equation (3.1) takes the form of
R + RT, and the weight factor A = 0.5.

To compare with the proposed LinkBoost algorithm, two baseline algorithms are
implemented. The ﬁrst baseline algorithm is Feature Representation Augmentation
(“RepAug” for short). The second one is Co—training (“Cotrain” for short). Both

baseline algorithm were reviewed in Section 3.2.3.

64

Since only binary classiﬁcation model is discussed in this chapter, we tested all the
possible class pairs in both datasets. For each class pair, we randomly chose 5% data
examples for training and the rest for test. Each classiﬁcation was repeated 10 times
by using different randomly selected training data, and the classiﬁcation accuracy

was averaged over the 10-fold experiments.

3.4.2 Robustness against Sparse Link Information

As discussed in Section 3.4.1, the link information in both Cora and Citeseer datasets
is noisy. To further verify the proposed LinkBoost algorithm and its robustness
against sparse link information, we gradually decrease the number of links being used,
and compare the classiﬁcation performance of the four algorithms. Table 3.3 and
Table 3.4 give the classiﬁcation accuracies on Cora dataset; Table 3.5 and Table 3.6
give the classiﬁcation accuracies on Citeseer dataset. In this group of experiments,
Support Vector Machine (implemented by SVMLight software) is used as the base
supervised classiﬁcation algorithm to be boosted by the LinkBoost framework.

As we can see, with all the available link information being used, LB-AR and
AugRep deliver comparable performance, while overall speaking LB-AR is slightly
better. LB-OR is suboptimal among the four methods; and Cotrain almost always
performs worst. When the number of links being used is decreasing, the performance
of all four methods degrades as expected. However, with the link information getting
even sparser gradually, the advantage of LB-AR over AugRep becomes signiﬁcant,
since their performance gap enlarges. Moreover, when only 30% - 10% of links are
used, LB-OR also outperforms the AugRep. The above observations suggest that
LinkBoost is more robust against the sparseness of link information.

Comparing the performance of LB-OR and LB-AR, i.e., applying LinkBoost on
both original feature space or augmented feature space, leads to the following ﬁndings:

when 80% ~ 100% links are used, LB—AR performs better than LB-OR; however,

65

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Classes Model Percentage of Links Used
100% 90% 80% 70% 60% 50% 40% 30% 20% 10%
1 vs 2 AugRep .900 .898 .891 .888 .883 .875 .866 .859 .849 .841
Cotrain .836 .835 .837 .839 .841 .845 .843 .844 .844 .838
LB-OR .891 .889 .890 .890 .884 .890 .882 .886 .888 .885
LB-AR .911 .911 .905 .907 .900 .896 .897 .889 .889 .898
1 vs 3 AugRep .912 .910 .907 .902 .897 .893 .883 .873 .863 .846
Cotrain .840 .841 .840 .836 .842 .842 .839 .848 .849 .846
LB-OR .877 .882 .876 .873 .870 .869 .862 .864 .866 .864
LB-AR .911 .908 .908 .903 .895 .891 .880 .874 .869 .857
1 vs 4 AugRep .841 .835 .826 .822 .810 .798 .794 .779 .759 .748
Cotrain .732 .729 .727 .726 .728 .727 .730 .727 .735 .744
LB-OR .809 .804 .807 .806 .794 .789 .784 .772 .768 .775
LB-AR .850 .840 .836 .832 .815 .809 .802 .804 .779 .782
1 vs 5 AugRep .847 .844 .841 .832 .826 .816 .807 .790 .777 .770
Cotrain .763 .759 .753 .750 .748 .746 .747 .752 .757 .762
LB-OR .821 .814 .811 .806 .800 .802 .795 .792 .788 .786
LB—AR .846 .840 .845 .833 .824 .814 .807 .789 .784 .782
1 vs 6 AugRep .944 .941 .935 .929 .921 .909 .897 .876 .862 .861
Cotrain .833 .840 .836 .831 .835 .837 .840 .843 .849 .851
LB-OR .864 .863 .860 .857 .852 .855 .847 .844 .840 .837
LB-AR .945 .939 .932 .923 .920 .904 .891 .874 .857 .853
1 vs 7 AugRep .894 .889 .885 .877 .868 .857 .843 .828 .817 .803
Cotrain .789 .791 .789 .784 .786 .786 .786 .796 .798 .799
LB-OR .854 .849 .836 .831 .833 .833 .829 .824 .824 .821
LB-AR .899 .896 .890 .884 .876 .868 .854 .842 .841 .830
2 vs 3 AugRep .889 .879 .861 .851 .832 .802 .783 .745 .703 .705
Cotrain .755 .759 .756 .768 .761 .757 .760 .760 .735 .725
LB—OR .882 .875 .885 .868 .874 .851 .863 .846 .848 .805
LB—AR .891 .883 .869 .860 .844 .833 .830 .827 .815 .806
2 vs 4 AugRep .852 .853 .847 .839 .830 .819 .801 .788 .776 .760
Cotrain .764 .758 .754 .750 .756 .762 .759 .767 .763 .752
LB-OR .886 .887 .890 .874 .877 .877 .861 .857 .842 .850
LB-AR .942 .937 .934 .938 .920 .926 .902 .884 .893 .879
2 vs 5 AugRep .741 .732 .729 .725 .721 .710 .713 .706 .700 .687
Cotrain .689 .687 .684 .686 .689 .688 .688 .695 .698 .688
LB—OR .785 .778 .761 .766 .764 .767 .766 .744 .749 .738
LB-AR .808 .815 .807 .799 .775 .774 .756 .742 .733 .737
2 vs 6 AugRep .924 .920 .905 .889 .883 .862 .841 .825 .803 .771
Cotrain .772 .779 .780 .781 .775 .769 .769 .776 .771 .782
LB—OR .885 .884 .879 .879 .879 .862 .867 .856 .847 .831
LB-AR .934 .919 .919 .888 .896 .896 .883 .875 .859 .843
2 vs 7 AugRep .732 .733 .731 .723 .717 .703 .690 .679 .670 .654
Cotrain .650 .657 .643 .639 .654 .657 .664 .662 .665 .659
LB—OR .751 .765 .759 .757 .757 .756 .758 .757 .749 .742
LB-AR .773 .773 .767 .760 .760 .763 .734 .728 .747 .752

 

Table 3.3. Classiﬁcation Accuracy Comparison on Cora Dataset (Part A).

66

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Classes Model Percentage of Links Used
100 % 90% 80% 70% 60% 50% 40% 30% 20% 10%
3 vs 4 AugRep .940 .934 .930 .928 .912 .899 .883 .865 .846 .831
Cotrain .840 .838 .840 .836 .844 .858 .858 .859 .864 .848
LB-OR .902 .904 .895 .902 .906 .908 .897 .904 .899 .889
LB—AR .947 .944 .942 .939 .924 .921 .909 .889 .880 .862
3 vs 5 AugRep .872 .870 .862 .854 .838 .817 .804 .786 .759 .750
Cotrain .760 .749 .759 .764 .763 .763 .769 .759 .761 .763
LB-OR .831 .832 .826 .828 .812 .807 .806 .799 .778 .781
LB—AR .872 .874 .871 .853 .839 .824 .802 .799 .767 .778
3 vs 6 AugRep .903 .898 .893 .888 .879 .862 .845 .833 .805 .795
Cotrain .775 .772 .780 .785 .792 .784 .791 .785 .789 .777
LB-OR .856 .847 .844 .847 .845 .839 .839 .830 .829 .815
LB—AR .914 .901 .899 .892 .883 .862 .862 .842 .831 .823
3 vs 7 AugRep .871 .868 .862 .855 .836 .827 .808 .774 .746 .739
Cotrain .713 .724 .717 .722 .732 .743 .755 .755 .744 .740
LB-OR .797 .782 .783 .790 .784 .787 .806 .788 .800 .789
LB-AR .870 .868 .857 .850 .841 .848 .825 .804 .825 .785
4 vs 5 AugRep .899 .891 .880 .874 .864 .860 .861 .829 .819 .810
Cotrain .809 .811 .812 .803 .808 .809 .806 .818 .800 .798
LB-OR .844 .800 .818 .819 .827 .831 .831 .838 .841 .831
LB-AR .898 .893 .878 .879 .868 .859 .864 .827 .826 .794
4 vs 6 AugRep .959 .950 .948 .936 .923 .912 .892 .860 .833 .829
Cotrain .808 .805 .815 .818 .818 .816 .816 .810 .818 .806
LB-OR .923 .892 .898 .910 .915 .909 .914 .915 .919 .923
LB-AR .974 .968 .959 .947 .951 .940 .934 .925 .908 .895
4 vs 7 AugRep .904 .894 .885 .869 .855 .836 .821 .799 .774 .765
Cotrain .769 .775 .778 .774 .767 .774 .760 .770 .777 .778
LB—OR .829 .806 .799 .809 .813 .819 .810 .832 .832 .844
LB—AR .916 .893 .899 .886 .883 .874 .851 .838 .821 .811
5 vs 6 AugRep .926 .924 .919 .910 .898 .874 .851 .839 .828 .809
Cotrain .807 .802 .806 .794 .796 .796 .795 .797 .809 .798
LB—OR .836 .822 .826 .835 .840 .847 .847 .853 .862 .866
LB-AR .926 .925 .917 .911 .901 .871 .852 .844 .843 .816
5 vs 7 AugRep .812 .810 .807 .803 .796 .771 .747 .729 .713 .714
Cotrain .706 .707 .702 .705 .711 .716 .718 .712 .712 .709
LB-OR .782 .739 .745 .764 .749 .760 .769 .759 .776 .774
LB-AR .821 .821 .810 .803 .805 .788 .776 .755 .763 .743
6 vs 7 AugRep .940 .934 .929 .919 .909 .893 .875 .853 .827 .823
Cotrain .796 .779 .774 .769 .779 .774 .759 .769 .781 .789
LB-OR .846 .828 .833 .837 .831 .839 .843 .850 .843 .843
LB-AR .940 .930 .929 .918 .904 .892 .876 .859 .837 .832

 

Table 3.4. Classiﬁcation Accuracy Comparison on Cora datasets (Part B).

67

 

 

Classes Model Percentage of Links Used

100% 90% 80% 70% 60% 50% 40% 30% 20% 10%
1 vs 2 AugRep .939 .933 .928 .928 .921 .921 .915 .903 .906 .902
Cotrain .907 .910 .910 .909 .910 .908 .911 .911 .911 .905
LB-OR .938 .934 .933 .931 .929 .923 .927 .929 .927 .923
LB-AR .946 .942 .939 .937 .931 .932 .928 .917 .916 .909
1 vs 3 AugRep .932 .928 .916 .915 .906 .897 .887 .884 .886 .893
Cotrain .905 .905 .906 .905 .905 .906 .906 .904 .903 .902
LB-OR .921 .923 .918 .918 .916 .917 .911 .913 .906 .905
LB—AR .925 .922 .917 .916 .917 .912 .906 .908 .900 .896
1 vs 4 AugRep .785 .773 .763 .763 .760 .751 .739 .732 .732 .734
Cotrain .748 .744 .743 .742 .737 .738 .739 .739 .735 .732
LB—OR .760 .759 .756 .755 .750 .755 .751 .753 .755 .755
LB-AR .778 .772 .765 .753 .752 .747 .735 .738 .742 .742
1 vs 5 AugRep .895 .890 .881 .879 .867 .865 .853 .850 .854 .853
Cotrain .848 .852 .856 .858 .862 .858 .857 .859 .854 .851
LB-OR .859 .860 .863 .863 .860 .870 .870 .865 .856 .853
LB-AR .895 .888 .887 .881 .875 .870 .864 .865 .853 .854
1 vs 6 AugRep .904 .897 .892 .877 .860 .851 .848 .838 .834 .850
Cotrain .864 .859 .866 .865 .865 .869 .873 .873 .875 .881
LB—OR .883 .860 .863 .863 .860 .870 .870 .865 .856 .853
LB-AR .903 .898 .893 .885 .875 .875 .877 .869 .876 .882
2 vs 3 AugRep .845 .840 .838 .835 .822 .818 .803 .797 .780 .791
Cotrain .795 .798 .795 .800 .793 .794 .797 .794 .788 .787
LB-OR .807 .803 .810 .811 .803 .804 .794 .794 .797 .794
LB-AR .842 .839 .838 .836 .825 .824 .811 .796 .790 .784
2 vs 4 AugRep .787 .782 .776 .778 .778 .770 .768 .758 .757 .757
Cotrain .795 .755 .757 .753 .751 .754 .749 .746 .746 .743
LB—OR .780 .773 .777 .773 .773 .773 .773 .773 .775 .776
LB-AR .784 .782 .781 .777 .775 .774 .765 .767 .773 .774
2 vs 5 AugRep .870 .859 .854 .837 .829 .811 .799 .801 .789 .793
Cotrain .804 .808 .812 .811 .806 .806 .808 .803 .808 .805
LB-OR .841 .846 .843 .843 .845 .838 .830 .824 .817 .804
LB—AR .872 .869 .867 .856 .855 .844 .836 .822 .810 .791

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.5. Classiﬁcation Accuracy Comparison on Citeseer datasets (Part A).

68

 

 

 

 

 

 

 

 

 

Classes Model Percentage of Links Used
100% 90% 80% 70% 60% 50% 40% 30% 20% 10%
2 vs 6 AugRep .834 .831 .828 .824 .815 .793 .769 .760 .742 .765
Cotrain .803 .801 .792 .791 .787 .784 .773 .774 .782 .781
LB-OR .791 .793 .792 .788 .783 .785 .783 .779 .775 .774
LB-AR .829 .829 .824 .819 .807 .782 .778 .765 .748 .754
3 vs 4 AugRep .766 .765 .759 .757 .756 .756 .753 .751 .750 .749
Cotrain .754 .755 .756 .753 .754 .754 .754 .752 .752 .752
LB-OR .781 .775 .771 .771 .771 .767 .769 .770 .772 .768
LB-AR .770 .773 .766 .761 .761 .759 .755 .752 .758 .767
3 vs 5 AugRep .890 .883 .869 .856 .840 .830 .823 .820 .817 .821
Cotrain .848 .852 .846 .845 .845 .844 .843 .844 .839 .831
LB-OR .887 .884 .883 .885 .884 .878 .883 .881 .875 .869
LB—AR .905 .899 .890 .885 .881 .873 .869 .866 .862 .863
3 vs 6 AugRep .874 .865 .854 .844 .839 .831 .821 .815 .803 .809
Cotrain .830 .832 .832 .837 .837 .836 .835 .834 .831 .829
LB-OR .853 .853 .849 .851 .846 .846 .839 .841 .841 .833
LB-AR .878 .870 .861 .854 .854 .844 .842 .845 .831 .835
4 vs 5 AugRep .726 .726 .721 .713 .709 .703 .705 .709 .704 .701
Cotrain .707 .705 .702 .699 .700 .700 .699 .698 .699 .703
LB-OR .726 .724 .724 .723 .719 .722 .721 .723 .726 .726
LB-AR .731 .724 .725 .725 .726 .721 .726 .721 .717 .719
4 vs 6 AugRep .713 .712 .711 .711 .710 .708 .710 .709 .709 .710
Cotrain .706 .706 .707 .706 .706 .711 .709 .710 .709 .710
LB-OR .713 .714 .711 .712 .714 .714 .711 .713 .713 .713
LB-AR .714 .711 .711 .710 .706 .705 .707 .707 .709 .712
5 vs 6 AugRep .876 .861 .851 .848 .836 .820 .809 .805 .807 .814
Cotrain .814 .817 .814 .816 .820 .824 .824 .817 .810 .810
LB-OR .862 .855 .852 .847 .852 .838 .844 .835 .831 .826
LB-AR .882 .864 .859 .862 .854 .848 .839 .835 .820 .822

 

 

 

 

Table 3.6. Classiﬁcation Accuracy Comparison on Citeseer datasets (Part B).

69

 

when 10% - 30% links are used, LB—OR is slightly better than LB-AR. A possible
explanation to these ﬁndings is that when signiﬁcant amount of data examples are
involved in links, the classiﬁcation process is doubly boosted by LB-AR, both from
data representation augmentation and the training set augmentation; but when link
information are extremely sparse, the representation augmentation in a bag-of—words
manner is less reliable, compared to the training set augmention which selectively
supplements training pool with highly conﬁdent predictions. However, how to reliably

utilize extremely sparse link information needs further study.

3.4.3 Boosting Power for Supervised Algorithms

To verify the boosting power of the proposed LinkBoost framework, we apply it to
several base supervised classiﬁers, and compare the performance with that of the base

classiﬁer itself. In particular, ﬁve supervised algorithms are used

0 Support Vector Machine (“svm” for short), whose performance was proved to

be among the best in many text classiﬁcation applications.
0 J48 decision trees (“j48” for short).

0 HyperPipe classiﬁer (“hpp” for short), which constructs for each category a
“hypepipe” that contains all points of that category (essentially records the
attribute bounds observed for each category). Predictions on test instances are

made according to the category that most contains the instance.

0 Simple Naive Bayes classiﬁer (“nbs” for short).

Voted Perceptron (“vp” for short).

For “svm”, we used the SVMLight implementation. For the rest four classiﬁers, we

used its implementation in the Weka softwarez.

 

2http://www.cs.waikato.ac.nz/m1/weka/

70

 

Classes

SV 111

LB—OR

j48

LB-OR

hpp LB-OR

nbs

LB-OR.

vp

LB-OR

 

 

1vs2
1vs3
1vs4
1vs5
1vs6
1vs7
2v33
2vs4
2vs5
2vs6
2vs7
3vs4
3v35
3vs6
3vs7
4v35
4v36
4vs7
5v36
5vs7
6vs7

 

0.843
0.843
0.744
0.772
0.854
0.802
0.732
0.750
0.685
0.773
0.653
0.855
0.758
0.803
0.748
0.809
0.846
0.771
0.806
0.708
0.819

0.891
0.877
0.809
0.821
0.864
0.854
0.882
0.886
0.785
0.885
0.751
0.902
0.831
0.856
0.797
0.844
0.923
0.829
0.836
0.782
0.846

 

0.819
0.877
0.733
0.725
0.863
0.768
0.651
0.748
0.680
0.747
0.628
0.837
0.744
0.804
0.741
0.691
0.795
0.681
0.779
0.672
0.724

0.930
0.925
0.866
0.834
0.929
0.902
0.915
0.933
0.800
0.913
0.797
0.943
0.892
0.894
0.887
0.888
0.955
0.861
0.916
0.822
0.913

 

0.845
0.813
0.734
0.744
0.769
0.807
0.776
0.800
0.711
0.796
0.712
0.775
0.749
0.732
0.741
0.740
0.807
0.783
0.743
0.723
0.780

0.851
0.799
0.733
0.750
0.771
0.821
0.847
0.852
0.733
0.835
0.742
0.789
0.780
0.771
0.758
0.750
0.806
0.790
0.759
0.739
0.783

 

0.853
0.822
0.749
0.767
0.756
0.803
0.772
0.810
0.719
0.791
0.690
0.785
0.736
0.736
0.723
0.750
0.766
0.753
0.743
0.708
0.728

0.892
0.849
0.796
0.799
0.820
0.859
0.892
0.897
0.769
0.887
0.767
0.876
0.841
0.834
0.826
0.806
0.895
0.844
0.857
0.804
0.834

 

0.846
0.817
0.712
0.752
0.797
0.805
0.699
0.785
0.676
0.765
0.650
0.791
0.697
0.767
0.730
0.734
0.799
0.742
0.770
0.688
0.787

0.874
0.844
0.777
0.789
0.834
0.841
0.812
0.871
0.734
0.858
0.717
0.843
0.757
0.817
0.785
0.792
0.870
0.817
0.823
0.740
0.816

 

Table 3.7.

Boosting classiﬁcation accuracy on Cora datasets.

 

Classes

svm LB—OR

j48

LB—OR

hpp LB-OR

nbs

LB-OR

VP

LB-OR

 

 

1vs2
lvs3
1vs4
1v55
1vs6
2vs3
2vs4
2v55
2vs6
3vs4
3vs5
3vs6
4v35
4vs6
5vs6

 

0.904
0.901
0.741
0.854
0.869
0.786
0.754
0.802
0.785
0.754
0.843
0.819
0.702
0.709
0.813

0.938
0.921
0.760
0.859
0.883
0.807
0.780
0.841
0.791
0.781
0.887
0.853
0.726
0.713
0.862

 

0.853
0.870
0.733
0.840
0.842
0.665
0.717
0.667
0.639
0.720
0.717
0.726
0.681
0.656
0.681

0.946
0.929
0.772
0.887
0.898
0.824
0.798
0.859
0.800
0.807
0.904
0.864
0.758
0.706
0.872

 

0.785
0.790
0.737
0.749
0.754
0.705
0.777
0.763
0.720
0.768
0.791
0.760
0.724
0.706
0.766

0.865
0.838
0.747
0.796
0.803
0.773
0.776
0.809
0.738
0.777
0.842
0.796
0.713
0.704
0.804

 

0.817
0.793
0.738
0.752
0.775
0.723
0.778
0.773
0.728
0.769
0.782
0.762
0.730
0.695
0.774

0.903
0.862
0.761
0.815
0.844
0.780
0.821
0.840
0.751
0.789
0.868
0.825
0.759
0.691
0.839

 

0.818
0.832
0.734
0.754
0.782
0.742
0.743
0.726
0.715
0.755
0.771
0.745
0.697
0.663
0.742

0.882
0.869
0.755
0.810
0.827
0.773
0.785
0.786
0.741
0.773
0.825
0.797
0.738
0.681
0.797

 

Table 3.8. Boosting classiﬁcation accuracy on Citeseer datasets.

71

 

 

The original data represenation is used both in the base classiﬁer and the
LinkBoost-boosted classiﬁer.

Table 3.7 and Table 3.8 compare the performance of the base classiﬁers, and perfor-
mance of them within the LinkBoost framwork, on all class pairs in Cora and Citeseer
datasets respectively. In both tables, each column contains two sub—columns, with
the left one under the name of the base classiﬁer giving its classifation accuracy, and
the right one under “LB-OR” giving the classiﬁcation accuracy of the corresponding
base classiﬁer within the LinkBoost framework (using original data representations).
As we can see, in most cases, LinkBoost framework is able to signiﬁcantly improve the
classiﬁcation accuracy. These results empirically proves the boosting power of Link-
Boost framework, i.e., it is able to improve any supervised algorithm by effectively

exploiting link information.

3.5 Conclusions

In this chapter, we propose a semi-supervised learning framework, named as “Link-
Boost”, for boosting classiﬁcation performance, when side information in the form of
links are available. LinkBoost is designed to turn any supervised algorithm into a
semi—supervised one, and improve its classiﬁcation performance. Experiments show
that LinkBoost is robust against the noisy and sparse nature of link information, and

it does improve the classiﬁcation accuracy of several typical supervised algorithms.

72

CHAPTER 4

Semi-supervised Clustering with Pairwise

Constraints

In this chapter, a novel boosting framework for semi-supervised clustering will be
described. Starting with the problem deﬁnition of semi-supervised clustering, we will
review a few major approaches in previous studies on this problem, then present our
boosting idea with formal description of the related algorithms. Experiment results
and discussions will be provided, as empirical validation. Finally, we will summarize

our work and raise a few issues for future work.

4.1 Problem Deﬁnition

Data clustering, also called unsupervised learning, is one of the key techniques in data
mining that is used to understand and mine the structure of unlabeled data. The
idea of improving clustering by side information, sometimes called semi-supervised
clustering or constrained data clustering, has received signiﬁcant amount of attention
in recent studies on data clustering. Often, the side information is presented in the
form of pairwise constraints: the must-link pairs where data points should belong
to the same cluster, and the mustnot-lz'nk pairs where data points should belong to
different clusters.

Table 4.1 summarizes the notations that will be used throughout the chapter.

73

71

Total number of data examples.
The number of attributes for each data example

A d—dimension vector represents the i—th data example. We
also use x,- to refer to the i-th data example.

The set of all data examples, i.e. X = {X22 ?:1.

A matrix X = (x1, . . . ,xn) that gathers the vector repre-
sentations of all the data examples.

The set of all must-link pairs of data examples.

A n x n matrix where 52+- is one when examples x,- and x]-
form a must-link pair, and zero otherwise.

The set of all mustnot—link pairs of data examples

An n x n matrix where S 2— - is one when examples X, and
7
xj form a mustnot-hnk pair, and zero otherwrse.

For classiﬁcation or clustering problems, c gives the number
of classes.

For classiﬁcation or clustering problems, 1,- takes values
from the set {1, - -- ,c} which indicates the class label for
the i-th data example.

Table 4. 1. Notations

4.2 Review on Previous Studies

There are two major approaches to semi-supervised clustering: the approach based on
constraints satisfaction and the approach based on distance metric learning. The first
approach employs the side information to restrict the solution space, and only ﬁnds
the solution that is consistent with the pairwise constraints. The second approach
ﬁrst learns a distance metric from the given pairwise constraints, and computes the
pairwise similarity using the learned distance metric. The computed similarity matrix
is then used for data clustering. In this section, we will review some major work in
both approaches, recapitulate a few selected representative algorithms, followed by a

brief summary on these previous studies.

74

4.2.1 Approach Based on Constraints Satisfaction

The constraint~based approach for semi—supervised clustering utilizes the side infor—
mation to restrict the feasible solutions when deciding the cluster assignment. Early
work in this category took the side information as the hard constraints, and only con-
sidered the cluster assignments that were absolutely consistent with the given pairwise
constraints. In [148, 20], the authors proposed the constrained K-means algorithms
by adjusting the cluster memberships to be consistent with the pairwise constraints.
In [133], a generalized Expectation Maximization (EM) algorithm is proposed to in-
corporate the pairwise constraints into the EM algorithm. In particular, the cluster
assignments that are inconsistent with the constraints are excluded from the partition
function when computing the posterior probability for the cluster memberships. One
problem with treating the side information as hard constraints is that we may not
be able to ﬁnd feasible solutions that are consistent with all the constraints [44]. To
overcome this problem, a number of studies view the side information as soft con-
straints. The key idea is to penalize, not to exclude, the cluster assignments that are
inconsistent with the given pairwise constraints. In [15, 108, 16], the authors present
probabilistic models for semi-supervised clustering where the pairwise constraints are
incorporated into the clustering algorithms through the Bayesian priors. In [102],
the authors modiﬁed the mixture model for data clustering by redeﬁning the data
generation process through the introduction of hidden variables. In [14], the authors
extended the framework of semi-supervised clustering by selecting the most infor-
mative pairwise constraints to solicit the labeling information. In [45], the authors

studied semi-supervised clustering for the hierarchical clustering algorithm.

4.2.2 Approach Based on Distance Metric Learning

Another approach to semi-supervised clustering is to ﬁrst learn a distance metric from

the given pairwise constraints. The pairwise similarity between any two examples is

75

then computed based on the learned distance metric, and a clustering algorithm is
applied to the computed similarity matrix. The key to this approach is to effectively
learn a distance metric from the side information. Zhang et al. [165] proposed to
learn a distance metric by a linear regression model. Xing et al. [155] formulated
the distance metric learning problem as a constrained convex programming problem.
This algorithm is extended to the nonlinear case in [98] by the introduction of ker-
nels. Yang et al. [159] proposed a local distance metric algorithm that is designed to
address the problem of distance metric learning for multi-modal data distributions.
Golderberg et a1. [64] presented the neighborhood component analysis algorithm that
learns a local distance metric by extending the nearest neighbor classiﬁer. Wein—
berger [150] presented a large margin nearest-neighbor classiﬁer for distance metric
learning that extended the neighborhood component analysis to a maximum margin
framework. Discriminative component analysis [74] learned a distance metric by ex-
tending the relevance component analysis to effectively explore both the must-link
and the mustnot-link constraints simultaneously. In [71, 72], the authors extended
the boosting algorithms to learn a distance function from given pairwise constraints.
Schultz and Joachims [128] extended the framework of support vector machine to
learn distance metrics from the pairwise comparisons.

Finally, a few studies cluster data points by a similarity matrix that is directly
modiﬁed according to the pairwise constraints. In [96], the authors proposed to
modify the similarity matrix by linearly combining the original similarity matrix with
the pairwise constraints. Klein et al. [91] proposed to modify the similarity matrix

by propagating the pairwise constraints through the nearest neighbors.

4.2.3 Representative Algorithms

We will recapitulate a few representative algorithms proposed for semi-supervised

clustering in the following.

76

CONSTRAINED K-MEANS CLUSTERING ALGORITHM

The constrained K-mcans algorithm is a modiﬁed K-Ineans algorithm, by ensuring
none of the pairwise constrained is violated during the iterative steps of the K-means
algorithm. Speciﬁcally, the constrained K-means algoritlnn can be fornmlated as
follows [148]
Input

0 X: matrix for the input data

0 8+: matrix for must»link pairs

0 8": matrix for mustnot-link pairs
Output: cluster memberships

Algorithm
Step 1 Initialize C1, . . . ,Ck as the initial cluster centers.

Step 2 For each data point 27,-, assign it to the closest cluster center Cj such that

o for all xi; not belonging to the k—th cluster, S? j 7é 1;

o for all xi, belonging to the k-th cluster, 32— j ¢ 1.

If no Cj satisﬁes the above rules, fail.

Step 3 Update each cluster center C j by averaging all the data point (1, in the corre-

sponding cluster.
Step 4 Iterate through Step 2 and Step 3 until convergence.
Step 5 Return cluster memberships.

The main drawback of the above algorithm is that it can fail without yielding
a feasible solution. As the algorithm presents, if the cluster assignment (in Step
2) cannot be found to satisfy all the pairwise constraints, the algorithm will stop.

Another drawback with the constrained K-means algorithm is that. it only makes

77

efforts to satisfy those “known” constraints. but does not. generalize those constraints

to the unseen data where the pairwise relationship is “unknown”.

CONSTRAINED COMPLETE-LINK CLUS'I‘ERING ALGORITHM

The data examples involved in pairwise constraints, in general, can be viewed as rep-
resentative of their local neighborhoods. Having recognized this, it. would be natural
to try to induce a set of new distance measurements over all the data examples from
the limited number of pairwise constraints. This is the basic idea of the work pre-
sented in [91], which is also formulated as acquiring prior knowledge “from instance
level constraints to space-level constraints”. Then a speciﬁc clustering algorithm
(Complete-Link clustering) is applied with the new distance measurements. The cor-
responding algorithm is named as Constrained Complete-Link (CCL) algorithm.

In CCL algorithm, the distance measurement between each pair of data exam-
ples are generated by explicitly making adjustment on an initial distance matrix
computed from the data input patterns. The adjustment is done by ﬁrst imposing
the constraints, and then propagating them to the neighborhood of the constrained
examples.

For must-link constraints, imposing the constraints means setting each distance
between the must-link pair of data examples to zero. Then, the distance between all
other data example pairs are recomputed as the length of shortest path connecting
them (allowing using the “zero” length of those must-links). In this way, the must—link
gets propagated to their neighborhoods. After these distance adjustment, the triangle
inequality still holds, and the resulting distance matrix is still a valid “metric”.

For mustnot-links, imposing the constraints means setting each distance between
the mustnot-link pair of data examples to a large value. However, further propagat—
ing the mustnot-links to their neighborhood while maintaining the adjusted distance

matrix a valid “metric” will be computationally expensive. In [91], the propagation

78

Of mustnot-links are not carried out explicitly by adjusting the distance matrix, but
claimed to be implicitly done in the merging step Of the following Complete—Link
clustering procedures 1.

The most notable contribution of the CCL algorithm is its efforts to generalize
the instance—level pairwise constraints tO space—level distance measurements that can

affect data examples beyond the constrained ones. As a results, it is reported in [91]

to outperform the constrained K-means algorithm in clustering.

HMRF-KMEANS ALGORITHM

Basu et al. proposed a probabilistic framework based on Hidden Markov Random
Fields (HMRFS) that combines constraints satisfaction and distance metric learn-
ing [15]. Based on this framework, a partitional clustering algorithm is designed,
which is named as HMRF-Kmeans algorithm.

Speciﬁcally, the following Hidden Markov Random Field is considered

a A hidden set Of random variables [I = {ll-”1:1, where each random variable li

takes values from the set {1, . . . , c} which is a cluster membership indicator.

0 An observable set X = {xi}?___1, where each random variable x,- is generated

from a conditional probability distribution Pr(x,;|l,;).

TO incorporate the pairwise constraints, the probability Of a particular cluster label

conﬁguration is expressed in the following Gibbs distribution form

Pr(£) = iexpeZZI/(zym (4.1)
j

i

 

1In the Complete-Link algorithm, the distance between two clusters is defined as the maximum
distance between data examples from either cluster. Therefore, if (x,,xj) forms a mustnot-link,
merging X], with x.- will result in a mustnot-link practically being constructed between xk and x,-.

79

where Z 1 is a normalizing factor and

f+(x,-,xj) if (x,,xj) 6 8+ but 1,- #1]-

V('i,j) = f_(x,-,xj) if (x,-,x]-) e s— but I,- = l]-

0 otherwise
Here, f+(x,-,xj) and f_(x,;,xj) are two non—negative functions that penalize the
violations of must-links and mustnot-links, respectively. The intuition behind the
above treatment is that higher probability should be assigned to a label conﬁgurations
that satisfy more pairwise constraints.

The optimal set of cluster labels for all the data example is acquired by solving

the following MAP estimation problem
Pr(£|X) or Pr(£) Pr(X|£)

Since Pr(£) is given in (4.1), we need tO decide Pr(r1’ [13) to carry on the MAP esti-
mation. Assuming the set Of random variables X to be conditional independent given
the set Of hidden variables, i.e.
Pr(X|£) = H Pr(x,-|I,-) (4.2)
XiEX

we further parametrize Pr(x,-|l,) as
Pr(xz-|li) or exp(—D(xi,plz,)) (4.3)

where Mkfk = 1, - - - ,c) is the cluster representative Of the k-th cluster, and D(xi, p12.)
is a distortion measurement between the i—th data example and the li-th cluster
representative. Such distortion measurement can take various forms, such as cosine
similarity or I—divergence, etc..

Combining (4.1) - (4.3), the MAP estimation leads to the following objective

function for clustering

laungzl exp(—XXV(iai>>exp(—ZD(x.-,m,)>
i j i

maXUz'}?

80

Note that the above Optimization problem is a “incomplete—data problem” since both
the set Of cluster labels {ll-[[121 and the cluster representatives {pk}z:1 are unknown
in a clustering setting, Expectation Maximization (EM) method is used to ﬁnd the
solution. The EM steps can be formulated into a modiﬁed K-means algorithm (see [15]

for details).

METRIC PAIRWISE CONSTRAINED KMEANS ALGORITHM

Metric Pairwise Constrained Kmeans (MPCKmeans) is another research attempt
which tries to integrate distance metric learning into the iterative steps for cluster-
ing [24]. Again, an underlying K-means-style procedure (i.e. iteratively updating
cluster memberships and cluster representatives) is assumed for ”the clustering. The

Objective function for the related Optimization problem is

min Bx.- — m,)TAz,<x.- — #1,.) — Z) 10gdet(AI,)

i 1,
+ Z f+(xiaxj)6(li at lj) + Z f—(Xi,Xj)5(li = lj)(4-4)
(Xi,Xj)€S+ (Xi,Xj)€S_
where pk(k = 1, - -- ,c) is the cluster representative Of the k-th cluster, A]: is a

distance matrix for the k—th cluster, and f+(xz-,xj) and f_(xz-,xj) are two non-
negative functions that penalize the violations of must-links or mustnot-links.

In [24], the two penalty functions are deﬁned in the following forms

f+<x.-,xj) = —<x.—xj>TAz,(x.-— j>+§(x.-—xj>TAI,<x.—xj>

1
f—(Xz',Xj) = §(X z.(ij.-X[;)-§(Xi-Xj)TAlJ-(Xi-Xj)

where (xbxﬁ ) is the maximally separated pair of the points according to the li-th
distance matrix Ali (note that we only care about f. when l,- = 1]). The function
deﬁnition of f+ allow a larger penalty imposed on violating a must-link constraints
between a pair Of distant data examples than that between a pair of close data

examples. This reflects the intuition that if two must-link examples are measured as

81

far from each other by a distance metric (i.e. a large value of (x,- — leTAll-(Xz‘ — xj)
or (x,- ‘leTAlj (Xi —— 3)), we want to put more penalty for this situation so that the
corresponding distance metric can to be adjusted dramatically. A similar argument
can be found for the function deﬁnition of f_.

Note that there are four terms in the Objective function (4.4). The ﬁrst term
addresses the expectation on the compactness Of the data clusters, as in the K—means
algorithm, but allowing each cluster to take different shapes; the second term regu-
larizes the learned distance matrices; the third and the fourth term try to enforce the
pairwise constraints.

Similar to the HMRF-Kmeans algorithm, EM algorithm is used to solve the Opti-
mization problem (4.4), which can be formulated into a modiﬁed Kmeans algorithm
(see [24] for details).

The main advantages Of MPCKmeans algorithm are: 1) instead of using a ﬁxed
distance metric for clustering, it allows new distance metrics being learned during the
clustering procedures and hence improves clustering performance; 2) it allows different
distance metrics being learned for different clusters, so that clusters in different shapes
are possible to be detected. As reported in [24], MPCKmeans outperforms several
purely semi-supervised distance metric learning methods and purely semi-supervised

clustering methods in data clustering applications.

4.2.4 Summary

Although a large number Of studies have been devoted to semi-supervised clustering,
most of them focus on designing special clustering algorithms that can effectively
exploit the pairwise constraints. For instance, algorithms in [15, 108, 16] are designed
to improve the probabilistic models for data clustering by incorporating the pairwise
constraints into the priors of the probabilistic models; the constrained K-means algo-

rithm [148] exploits the pairwise constraints by adjusting the cluster memberships to

82

be consistent with the given constraints. However, it is Often the case that we have a
speciﬁc clustering algorithm that is specially designed for the target domain, and we
are interested in improving the accuracy Of this clustering algorithm by the available
side information. This motivates us to design a meta-algorithm that is able to improve
any given clustering algorithm by the pairwise constraints. It is important to note
that such a meta-algorithm should not make any assumption about the underlying
clustering algorithm, so that it can be applicable to any clustering algorithm. To this
end, we propose a boosting framework for data clustering. More details will be given

in the following section.

4.3 Boosting Clustering

With pairwise constraints available, it is always desirable if we can utilize such side
information to improve clustering performance, no matter which clustering algorithm
is used. To this end, we propose a general boosting framework, termed as Boost-
Cluster. In this section, we will ﬁrst illustrate the main idea of boosting clustering,
followed by a formal description on the framework, including the related Optimization
problem and solution, two variations Of the meta-algorithm for boosting, and related
discussions on the scalability issue.

Let A denote the given clustering algorithm tO be improved. In order to make
this framework general, we treat the clustering algorithm A as a black box that only
takes the data representation of all examples as its input and outputs the cluster
memberships for the given examples. Note in this work, we assume that the number

of clusters is given.

4.3.1 Main Idea

Although a number of boosting algorithms (e.g., [55]) have been successfully applied

to supervised learning, extending boosting algorithms to data clustering is signiﬁ—

83

cantly more challenging. The key difficulty is how to influence an arbitrary clustering
algorithm with the given pairwise constraints. TO overcome this challenge, we propose
to encode the side information into the data representation that is the input to the
clustering algorithm. More speciﬁcally, we will ﬁrst ﬁnd the subspace in which data
points Of the must-link pairs are close to each other while data points Of the mustnot-
link pairs are far apart. A new data representation is then generated by projecting
all the data points into the subspace, that is used by the given clustering algorithm
to ﬁnd the appropriate cluster assignments. Furthermore, the procedure for identify-
ing the appropriate subspace based on the remaining unsatisﬁed constraints, and the
procedure for clustering data points using the newly generated data representation
will alternate iteratively till most of the constraints are satisﬁed.

Figure 4.1 illustrates the idea of iterative data projection. The data points used
in this illustration are sampled from the “scale” dataset that will be described later
in Section 4.4.1. They belong to three clusters that are labeled in Figure 4.1 by
legends A, o, and x, respectively. A partitional clustering algorithm is used in
this illustration. Sub-ﬁgure (a) shows the original data distribution projected into
a 2D space that is generated by Principle Component Analysis (PCA). We clearly
see that many data points Of the cluster x overlap heavily with the data points of
the clusters A and o, and they are difficult to be well separated. The must-link and
mustnot-link constraints are indicated in sub—ﬁgure (a) by solid lines and dotted lines,
respectively. Sub-ﬁgures (b)~(d) illustrate the projected data distributions based on
the new representations that are generated by the proposed boosting framework in
iteration 1, 2, and 7, respectively. Evidently, the data representations generated in
different iterations are helpful in separating the data points in the cluster x from
those in the other two clusters.

In practice, the whole boosting idea will work in the following way: the original

data representation and the pairwise constraints will be used as input in the proposed

84

 

 

29 Projection 0f Oringinal Data 2D Projection of Data at Iteration 1
a r 1 r . .

 

 

 

 

 

 

 

 

 

 

 

 

0.8
0.6’ A A A A x x o
0.5- AAAAAAM: 0x900
0.4» 000(9 0 o
A A A95: A8,}: 0 <9 0 o
0.2- 0_ AAégASéAA‘6) a??? (900
0_ AAaAAAAAJ‘Od" 6:00
_ AAAAAAAA o 069 o
-02. . -0.5 AAA 8 O& 0 O
o O O
-o.4- _1_ x x o
-0.6'
_0'3 -0:5 I) 05 '1'5’4 -é 6 z 4
(3.) Original data distribution (b) Iteration 1
2D Projection of Data at Iteration 2 2D Projection of Data at Iteration 7
a . . . X . . .
1.5' AEE ‘ :0 3p 00 00 o
A x0
1' A RE § ‘ 0'5' A 308000080 00
05 R‘g §§§g . A A0 ”030080 00
' A g A Q 09 o
8 § A A AAxQ‘OO o
0 AR 19§ 3 <§8 0 A 163% oo
,9 § 3: § A AA A AA A ’9‘ o
—o.5- R g A A A A A $6 0
L E xx§8 o AAAAAAAAAAOOO
—1 A 09 o -—0.5- AAAAAAAAA ”it
_1 5. O 8 A A
. x O A
Q 0
'1 - -1
'4 -2 O 2 4 -3 -2 -1 0 1 2
(c) Iteration 2 (d) Iteration 7

Figure 4.1. An example illustrating the idea of iterative data projections. Sub-ﬁgure (a)
shows the original data distribution, projected to the space spanned by its two principal
components; Sub-ﬁgures (b)-(d) show the data distributions based on the new representa-
tions in the projected subspaces that are generated in iteration 1, 2, and 7.

85

 

 

 

Pairwise
II data examples
c°"‘"‘”““ / d attributes a x n

Data Matrix

 

Data Matrix

  
   
 
    
  

 

 

 

ow at:
Rep.
In Subspace -

Clustering
Algorithm

s-dlm
Subspace
‘ Protection

  
   

d x s
Projecﬂon Matrix

 

 

 

Clustering

J“Siaml’mr‘ BoostCluster-E
Framework

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

 

 

Flue!
Clustering Result:

Figure 4.2. The ﬂowchart of the BoostCluster framework.

86

boosting framework; then in a iterative manner, new data representations will be
generated and be fed into the “black-box” clustering algorithm, whose results will in
turn be used to ﬁnd a subspace where data representations can be generated for the
next iteration; during the iterative process, a kernel similarity matrix gets updated.
The kernel similarity matrix, which can be seen as learned from the above boosting
procedure, incorporates the side information gathered from the pairwise constraints
and will improve the clustering performance. Figure 4.2 presents a flowchart that
illustrates the working mechanism of the boosting framework. More details will be

provided in the following subsections.

4.3.2 Objective Function

The first step in designing boosting algorithm is to construct an appropriate Objective
function. Note that, as described in the introduction section, the goal Of the boosting
algorithm is to identify the subspace that keeps the data points in the must-link
pairs close to each other, and keeps the data points from the mustnot-link pairs well
separated. To this end, we introduce the kernel similarity matrix K E Rnxn, where
K 2', j _>_ 0 indicates the conﬁdence of assigning examples X, and x j to the same cluster.
Then, our goal is to iteratively construct this kernel similarity matrix by using the
clustering algorithm A and the pairwise constraints in 8+ and 8".

Since the ideal kernel matrix K is expected to assign a large value to examples in
a must-link pair and a small value to examples in a mustnot-link pair, we propose to

minimize the following objective function:

n n
BCP —
1.3 = E: E: SJjSa,bexp(Ka’b—Ki,j) (4.5)
ileat=1

In the above, each term within the summation compares K 0,1” i.e., the similarity
between two points from an mustnot-link pair, to KM, i.e., the similarity between

two data points from a must—link pair. By minimizing the Objective function in (4.5),

87

we will ensure that all the data points in the must-link pairs are more similar to each
other than the data points in the mustnot-link pairs.
The objective function in (4.5) can also be written as:

n n
EBCP = Z ngcxM—Ki’j) Z Sci—.bCXMKaib) (4.6)
i,j=1 a,b=1

The above objective function is a product of two terms: the ﬁrst term, i.e.,
2 j- 1 S + j—exp( K2} 3), measures the inconsistency between the kernel similarity ma—
trix K and] the must-link constraints; the second term, i.e., Zg,b=1 SJJ) exp(K 0,1,),
measures the inconsistency between the kernel similarity matrix K and the mustnot-
link constraints. Thus, by minimizing the product of the two terms, we enforce the
kernel matrix K to be consistent with the given pairwise constraints.
Instead Of multiplying the two inconsistency measurements as in (4.6), we can also
deﬁne the Objective function to be the sum of the two terms, i.e. ,:

(1305:: S: j(exp —K,'j) +c 2 Sa— bexp(Ka b) (4.7)
2',j=1 a,=b 1

The parameter c in (4.7) balances the two inconsistency measurements. To differ-
entiate these two Objective functions, we refer to the objective function in (4.6) as
BoostCluster by product, or BCP for short, and the Objective function in (4.7) as

BoostCluster by sum, or BCS for short.

4.3.3 The BoostCluster Framework

We will ﬁrst describe the efﬁcient optimization algorithm for BCP, followed by the
algorithm for BCS.

THE BCP ALGORITHM

To boost the performance of a clustering algorithm A given a set of pairwise con-
straints, we follow the idea Of boosting algorithms by iteratively improving the ker-

nel similarity matrix K. Let K denote the current kernel similarity matrix. Let

88

A E [0,1]")(7‘ denote the incremental kernel similarity matrix that is inferred from
the clustering results generated by the algorithm A. In particular, A, J- : 1 when
both x,- and xj are assigned to the same cluster and Atj = 0 otherwise. The overall

kernel matrix K’ is a linear combination Of K and A, i.e.,
K’ = K + aA (4.8)

where a 2 O is the combination weight. Then, the Objective function CBCP for the

combined kernel K’, denoted by CBCP(K’), is written as:

n n
LBCP(K’) = Z 2: SM Sabexp(ch —K£’j)

ivj:1a,b=1
n n
= Z Z ijawv- “(Aria—Aux)» (4.9)
i,j=1a,b=1
where
= 3+ X)(- A") (410)
pl,] 1,].8‘ I 2,] .
qa,b = Sgbexmh’at) (4.11)

In the above, pm- measures the inconsistency between the kernel matrix K and
the must-link pair (xi, xj), and (10,,) measures the inconsistency between K and the
mustnot-link pair (xa, xb).

We then employ the .Iensen’s inequality to Obtain an upper bound for the function

11 (4.9), i.e.,

m m
exp 2: pint,- S 2 Pi exp(.ri)
i=1 i=1

where pi 2 0,2' = 1,2,. . . ,m and 227:1 p,- = 1. Using the above inequality, an upper
bound for exp(Aa b“ A-

) can be constructed as follows

 

 

7w]
exp(—amm- — Aw) (4.12)
A- — A +1 1 A — A- -+1
2 exp (—30 2’] 3 a,b + 3a§ + 0 x at 3 2’] )
A" —Aab+1 1 Aa,b"Ai,j+1

 

 

|/\
N
9
(D
X
"C

(—-3cr) + E exp(3cr) + 3

89

In the ﬁrst step of the above derivation, we rewrite “(Amb — AM) as a summation
over the probability distribution of ((Aat — AM + 1)/3_. (AiJ — Ant + 1)/3,1/3).
Note that (AaJ) — AiJ +1) 2 0 since 0 3 Am- 3 1. Using the upper bound in (4.12),

we can now bound the Objective function Of BCP in (4.9) as follows

n Tl
£BCP(K’) = Z Z 1)i.jqa.beXI>(-O(Ai.j"Aw”

 

 

 

igzlat=1
n 'n.
exp(—30) - 1
S 3 Z Z Pi,jqa,b(Ai,j -Aa,b)
i,j=1a,b=1
exp(BO) + exp(—30) + 1 n n
+ 3 2 Zagreb
i,j=1 a,b
n n n
exp(3a) — 1
= ———3 2: A221 P232" }: qa.b — (1232' 2 Par;
z',j=1 a,b=1 a,b=1
exp(BO') + exp(—3(1) + 1 n n
+ 3 ’ Z ZPz’J‘Iai (4-13)
i,j=1 a,b

We can simplify the above expression by deﬁning a matrix T as follows

Pm‘ (In

T. ___________________
z 22,1):1 pa,b 22,1):1 qaab

,J'

The elements in matrix T measure the inconsistency between kernel matrix K and
the pairwise constraints: a large positive value for T2 j indicates that K is inconsistent
with the must-link pair (xi,xj); similarly, a large negative value for TGJJ indicates
that K is inconsistent with the mustnot-link pair (xa,xb). Using the notation Of

matrix T, the upper bound for LBCP in (4.13) becomes

 

£BCP(KI) S CBCP(K) x {lexp(3a) +e:p(—3a) + 1l _ [1 -exp(—3:)ltr(TAT)}

(4.14)

90

where

n n,

BCP

5 (K) = Z Ptﬂw
i,j=1a,b=1
n

T

”(TA 1 = Z T,,,A,-,,-

z',j=1

Note that when a = 0, the right side Of (4.14) becomes [IBCP(K), i.e., the Objective
function Of the previous iteration. Thus, by minimizing the upper bound in (4.14)
with respect to a, we are guaranteed to have EBCP(K’) 3 .CB 0}) (K), thus reducing
the objective function at successive iterations.

As suggested by the inequality in (4.14), to effectively reduce the Objective func-
tion LBCP, we need to maximize the term tr(TAT). We further assume that the
incremental kernel matrix A can be approximated by a linear projection of the input.
data X, i.e.,

A rs (PTX)T(PTX) = XTPPTX
where P = (p1,p2, . . . ,ps) is the projection matrix (.9 S d) with each Pi 6 Rd
specifying a different projection direction. Using the above expression, tr(TAT) can

be written as
tr(TAT) a: tr(PTXTXTP) (4.15)

If we enforce orthogonality between any two projection vectors, i.e., pgrpj = (5 (2', j ),
the optimal solution for p,- that maximizes the expression in (4.15) is the i-th max-
imum eigenvector of matrix XTXT. Let {(Ai,vi)}f:1 denote the top 3 principal
eigenvalues and eigenvectors Of matrix XTXT. Then, the optimal projection matrix

P is constructed as
P = (\/A1V1,\/A2V2,..., ASVS) (4.16)

Using the projection computed in (4.16), we generate a new data representation

as X’ = PTX. This new representation X’ will be input to the clustering algorithm

91

 

Input
0 X: matrix for the input data
0 A: the given clustering algorithm
0 s: the number Of principal eigenvectors used for projection
0 51': matrix for must-link pairs
0 8‘: matrix for mustnot-link pairs
Output: cluster memberships
Algorithm
0 Initialize KiJ = 0 for any i,j : 1,2,. . . ,n.
0 Fort: 1,2,...,T

- Compute p1,]- and (1M using (4.10) and (4.11).

— Compute matrix T using (4.15) for BCP and using (4.23) for BCS.

- Compute the top .3 eigenvectors and eigenvalues {0‘2" Vi) 25:1 of T.

— Construct the projection matrix P using (4.16), and generate the new
data representation X’ by projecting the input data X onto P.

— Run the clustering algorithm A using the new data representation X'.
Compute the matrix A with Am- = 1 when x,- and xj are grouped
into the same cluster by A, and zero otherwise.

— Compute (1 using (4.18) for BCP and using (4.24) for BCS.

— Update the kernel similarity matrix K as
K + (1A —+ K

0 Run the clustering algorithm A with the kernel matrix K (if A does not take
a kernel similarity matrix as input, a data representation can be generated
by the ﬁrst 8 + 1 eigenvectors of the matrix K).

 

 

 

Figure 4.3. Boosting algorithm for BCP and BCS

A to generate new cluster memberships. The resulting cluster memberships are then
used to compute the incremental kernel matrix A.

In addition to the projection matrix P, another important question is how to
compute the Optimal oz. We can estimate the optimal a by minimizing the upper
bound in (4.14), which leads to (I = log[1 + tr(TAT)]/6. However, we can further

improve the estimation Of a by minimizing the original Objective function in (4.6),

92

which is

n
CBCP(K’) : Sit-j (“Pf—Kid) Z SIT]. exp(KiJ)

23:1 i,j=1

n n
E p?" exp(—OAM) 2: (17:3. exp(aAiJ)
' i,j=1

II
N. .
a,
[L
i.

Tl 77.
Z I)zt,j5(Az',j,0) + Z Pi,j5(At,j,1)0XP(-a)
i,j=1 i,j=1

n n
x Z q,,,-5(A,,j,0) + Z q,,J-5(A,,j,1)exp(a) (4.17)
i,j=1 i,j=l
It is not difﬁcult to show that the optimal a that maximizes the above expression is:

a 11 g ZEjzlmgﬂAiJJ) X EiszlquAiJﬁ) (418)
= — O ~ _
2j=1pi,j6(Ai,j,0) ZZj=1Qi,j0(Ai,j,1)

 

 

2

Figure 4.3 summarizes the proposed BCP (and BCS) algorithm.

DISCUSSIONS ON EFFICIENCY AND SCALABILITY

Similar to most boosting algorithms, we can show that the objective function of the

proposed BCP algorithm is reduced exponentially, as shown by the following theorem.

Theorem 1 Let A1, A2, . . . , AT be the incremental kernel matrices computed from
the clustering results by running the boosting algorithm (in Figure 4.3). Then, the

objective function after T iterations, i.e., £¥CP, is bounded as follows:

 

n n T
BCP + _
‘CT S 2 Si,j 2: Sin]. HO - 7t), (4-19)
i,j=1 i,j=1 t=1
where
W (m - vBiCt)2
(At + Btlfct + Dt)

93

Change in BoostCluster Objective Function
12-

 

 

10.8 I l .L
O 10 20 3O 4O 50

#Iterations

Figure 4.4. An example of BCP Objective function vs. number Of iterations.

At, Bt, Ct, and Dt are non-negative constants, and are computed as

TL it
At: 2 p§,j5(AE,J-.0), Bi = Z adored-,1)

i,j=l i,j=1
TI. 77.
_ t t _ t t
Ct — Z qi,j”(Ai,j’0l’ Dt — Z gaff/Air”
z'.j=1 i.j=1

where ij and qgj are computed according to (4.10) and (4.11} using the kernel

matrix K at the t-th iteration.

The above theorem can be proved directly by using the expression in (4.17) and

the expression for a in (4.18). Figure 4.4 shows the change in the BOP algorithm’s

Objective function Observed in one of our experiments. As can be seen, the Objective

function converges very fast, and becomes very ﬂat after around 22 iterations. In our

experiments, our BCP algorithm usually converges within 25 iterations.

In terms Of analyzing the scalability Of the BCP algorithm, it is easy to ﬁnd that

the most crucial step is ﬁnding the projection matrix P since it seems to involve a

94

huge amount of computation cost. As in (4.15), we need to compute XTXT, which
seems to be computationally expensive, especially when the number of examples (i.e.,
n) is large. However, it is important XTXT only involves the examples used in the
pairwise constraints. This is because XTXT can also be written as:
n
XTXT = Z Tia-xix;— (4.20)
i, j =1

Since Ti, j is nonzero only when the example pair (xi, xj) are used by the constraint,
the above calculation only involves a very small portion of the entire example pairs.
Thus, XTXT can be computed efﬁciently as long as the number of labeled pairs is
relatively small.

After XTXT is computed, the computational cost in constructing the projection
matrix P mainly arises from computing the principal eigenvectors and eigenvalues
of XTXT, particularly when the dimensionality of the feature space is high. For
instance, for text categorization, each document is represented by a vector of over
100,000 word features, and the size Of matrix XTXT is over 100, 000 x 100,000. A
straightforward approach is to reduce the dimensionality before running the proposed
algorithm. However, most dimensionality reduction algorithms that are capable of
handling high dimensional space are unsupervised, and therefore are unable to exploit
the pairwise constraints. Here, we propose an algorithm that is able to efﬁciently com-
pute the eigenvectors Of XTXT when the input dimensionality is high. We ﬁrst. realize
that each eigenvector of XTXT has to lie in the space that is spanned by the exam-
ples used by the constraints. More speciﬁcally, we denote by X = (5:1, 5:2, . . . ,sem)
the subset of m examples that are involved in the constraints. Then, the eigenvector
v,- can be written as a linear combination Of {59};er i.e.,

m
v,- = 2 mike), = Xwi (4.21)
kzl

More generally, we have

at

V = (v1,v2...,v3) =X(w1,w2,...,w3) =XW.

95

The proof of this result can be found in Appendix A.1. We "furthermore denote by

T the pairwise constraints, where T,- j gives the pairwise constraint between x,- and
7

59. Then, wi,i = 1,2,. . .,s correspond to the ﬁrst 3 principal eigenvectors of the

following generalized eigenvector problem

xTXTxwa, = A,xwa, (4.22)

Note that since XTXTXTX and XTX are matrices of m x m, the cost. Of computing
the eigenvectors is independent Of the dimensionality Of the input space. The proof

of the above result can be found in Appendix A.2.

THE BCS ALGORITHM

The derivation of the BCS algorithm is very similar to the BCP algorithm. First, we

compute the BCS Objective function for K’ = K + aA as follows:

n
B I ’ —
£ CS(K) = Z 5:]- CXp(—Ki,j) ‘l‘ CSi,j eXP(Ki,j)
i,j=1
n
= Z mama-01AM)+qu,jexp(aAiJ)
i,j=1

Using the Jensen’s inequality, we have

Alaj +1 1 1— Atj
exp(—aAz-J) = exp —3a——3-—— + 303 + 0——3——

 

 

A' '+ 1 1 1 _ A- .
S —ZL3—— exp(—3a) + S exp(3a) + ——3—2’1
_ 1 + exp(3a) + exp(—30) A 1 — exp(—3O.)
“ 3 '- 133' 3

Similarly, the upper bound for exp(aAi‘j) is the following expression

 

 

1 + exp(3a) + exp(—3a) + A 1 — exp(—3(1)

eXp(aAi,j) S 3 i,j 3

Using the above inequalities, the BCS Objective function LB CS (K ’ ) is upper bounded

by the following expression:

96

LBCS(K’) S 1+ exp(3a-)3+ exp(-3a)£BCS(K) _ 1— ex[)(_3a)tr(T3AT)

 

 

where T; is deﬁned

lTSli.j = pi,j—C(Ii’j (4.23)

Note that the above deﬁnition is similar to T in (4.15). The key difference is that pm-
and qz-J- are normalized in (4.15) while they are not in the above expression. Similar
to the BCP algorithm, we compute the top 3 principal eigenvectors of T s to construct
the projection matrix, and the projected input data is sent to the algorithm A for

clustering. Furthermore, the Optimal oz for BCS is computed as

a _ 110g sz: 1p2,j6(Ai,j71)
gCZij_1qZ,j6(Az,j71)

Figure 4.3 summarizes the BCS algorithm. Finally, we can show that the Objective

 

(4.24)

function of BCS decreases exponentially, i.e.,

Theorem 2 Let A1, A2, . . . ,AT be the kernel matrices computed from the clustering

results by running the boosting algorithm ( in Figure 4.3) Then, the objective function

BCS
ET

after T iterations, i.e., , is bounded as follows:

T
£305 3 2 3+ +cSi-J H( (1—7t), (4.25)
Z,j=1t1:
where

,3 = («BE—«DD?
t At+Bt+Ct+Dt

 

At, Bt, Ct, and Dt are already deﬁned in Theorem I.

4.4 Experiments

We now present an empirical evaluation of our proposed boosting framework. In

particular, we aim to address the following four questions in our study:

97

1. As a general boosting framework, is the proposed method able to improve the

performance for any given clustering algorithm?

2. How eﬁective is the proposed boosting framework in improving the clustering

performance by using the pairwise constraints?

3. How robust is the proposed boosting framework in improving the clustering per-

formance by using the pairwise constraints?

4. How does the BCP algorithm compare to BCS algorithm in the proposed boosting

framework?

4.4.1 Experiment Setup

To validate the claim that the proposed boosting algorithm is capable of improving
any clustering algorithm by exploiting the pairwise constraints, three typical cluster-

ing algorithms are used in our study. They are:

1. K—means algorithm [5]. It represents the family of clustering algorithms that try
to ﬁnd compact and well-separated clusters. We adopted the implementation

from the Weka software2.

2. Partitional SingleLink algorithm (“SLINK” for short) [80]. It represents the
family Of the hierarchical clustering algorithms. We adopted the implementation

from the CLUTO software3

3. k—way spectral clustering (“SPEC” for short). It represents the family of spec-
tral methods for data clustering. In particular, we follow the paper [115] for the

implementation of spectral clustering.

 

2http://www.cs.waikato.ac.nz/ml/weka/
3http://glaros.dtc.umn.edu/gkhome/views/cluto

98

 

 

 

 

 

 

 

Name #Attributes #Clusters #Examples

wdbc 30 2 569

scale 4 3 625

vowel 10 1 1 990
segmentation 19 7 2310
handwrittiendigit 256 10 4000
pendigit 16 4 4396

 

 

 

 

 

 

Table 4.2. Datasets used in the experiments.

Six datasets drawn from the UCI machine learning repository [50] are used in our
study. Table 4.2 summarizes the information about these datasets4. As indicated
in Table 4.2, these datasets vary signiﬁcantly in their sizes, number of clusters, and
number of attributes.

To evaluate the clustering performance, two measurements are used in our exper-
iments. The ﬁrst measurement is normalized mutual information (NMI for short)
[15], which is deﬁned as

2MI(X, X0)
H(X) + H(X0)

 

NMI

where X 0 and X denote the random variables of cluster memberships from the ground
truth and the output of clustering algorithm, respectively. M I (:r, 3]) represents the
mutual information between random variables x and y, and H (51‘) represents the
Shannon entropy of random variable x. The second measurement is Pairwise F -
measure (PWFl for short), which is the harmonic mean of pairwise precision and

recall that are deﬁned as follows

#pairs correctly placed in the same cluster

 

 

 

preczszon _ Total #pairs placed in the same cluster
ll #pairs correctly placed in the same cluster
reca Total #pairs actually in the same cluster
PWFl _ 2 x precision x recall

precision + recall

 

4Note that for the “pendigit” dataset, examples in only four classes of letter “3”, “6”, “8” and “9”
are selected from a total of 10 classes because these four letters are in general difﬁcult to distinguish.

99

ur "1.7-

. ‘-..‘IIII'I_J' .. ‘_L.'l!

 

The PWF 1 measurement deﬁned above is closely related to the metric deﬁned in [155]
that measures the percentage of data pairs correctly clustered together. The key
problem with the metric deﬁned in [155] is that it. counts two types of data pairs, i.e.,
pairs of data points assigned to the same cluster and pairs of data points assigned
to different clusters, with equal importance. This is problematic because most of
the data pairs will consist of data points from different, clusters when the number of
clusters is large. A similar issue was raised in multi-class learning, and that. is why
F -measure is widely used for evaluating multi-class learning [161].

To verify the efﬁcacy of the proposed boosting framework in exploiting the pairwise

constraints for data clustering, three baseline approaches are used:

1. Metric Pairwise Constraints K—means (MPCKmeans for short) algorithm [?,

12], which is a probabilistic framework based on Hidden Markov Random Fields.

2. Semi-supervised Kernel K—means (SSKK for short) algorithm [96], which ex-
ploits the pairwise constraints by a kernel approach and ﬁnds clusters with

nonlinear boundaries in the input data space.

3. Spectral Learning (SpectraJLearn for short) algorithm [89], which applies spec-
tral methods to learn a data representation from the pairwise constraints. The
generated data representation can therefore be used by any clustering algorithm
to identify the appropriate data clusters. The key difference between spectral
learning and our algorithm is that our algorithm generates algorithm speciﬁc
data representations by taking into account the performance of clustering algo-

rithms.

Previous studies [15, 12, 96, 89] showed that the above three algorithms deliver the
state-of-the—art performance in comparison to other semi-supervised clustering algo-

rithms such as the constrained K-means.

100

 

 

+BoostCluster + K—means
+BoostCluster + SLINK
+BoostCluster + SPEC
"0"MCPKmeans

“HOH'SSKK

-£I-'SpectralLearn + K-means
-*--Spectra1Learn + SLINK
-A-'SpectralLearn + SPEC

 

 

 

Figure 4.5. Legends for all algorithms in our comparative study. These legends will be
used in all the ﬁgures in this paper.

In summary, we will compare the following semi-supervised clustering algorithms
in the experiments: the three clustering algorithms (K-means, SLINK, and SPEC)
being boosted by the proposed BoostCluster framework; the same three clustering
algorithms with input from the Spectral Learning algorithm; the MPCKmeans al-
gorithm; and the SSKK algorithm. For easy identiﬁcation in ﬁgures, we listed the
legends for the above algorithms to be compared, in Figure 4.5. These legends apply
to all following performance comparison ﬁgures (we will omit showing legends in those
ﬁgures due to space constraints).

Finally, in all the experiments, we vary the number of pairwise constraints from 0
to 800. Since a random sampling of data pairs tends to ﬁnd many more cannot-link
pairs than the must-link pairs, in this study, we enforce an equal number of constraints
for both must-link pairs and cannot-link pairs. The numbers of eigenvectors (i.e., the
parameter s in the boosting algorithm shown in Figure 4.3) are determined empirically
as follows: 3 for the “scale” dataset, 10 for the “handwrittendigit” dataset and 5 for
the remaining 4 datasets. All the experiments in this study are repeated ﬁve times,

and the evaluation results averaged over the ﬁve trials are reported.

101

 

4.4.2 Generality of the Boosting Framework

Figures 4.6 - 4.9 show the clustering performance, evaluated by NMI and PVVFl
respectively, of the BCP algorithm of BoostCluster framework using the three clus-
tering algorithms (i.e., K-means, partitional SingleLink, and spectral clustering), the
same three clustering algorithms with input as the new data representation from the

Spectral Learning algorithm, the MPCKmeans algorithm, and the SSKK algorithm.

1. We observe that for most datasets, the BoostCluster framework is able to im-
prove the clustering performance for all the three clustering algorithms regard—
less of which evaluation metric is used. This suggests that the proposed frame—
work is effective in exploiting the pairwise constraints to improve clustering
performance. MPCKmeans algorithm and SSKK algorithm are also effective in
general, however, their clustering performance improvements are less signiﬁcant,

especially for larger datasets (such as “handwrittendigit” and “pendigit”).

2. Although SpectralLearn algorithm can also be combined with any clustering al-
gorithm, in our experiments, it does not always improve the clustering algorithm
performance. For example, for “wdbc” and “handwrittendigit”, increasing the
number of pairwise constraints deteriorates clustering performance by combin-
ing SpectralLearn with any of the three clustering algorithms. Moreover, the
effect of SpectralLearning depends on the clustering algorithm. For example,
for the “pendigit” dataset, SpectralLearn improves the clustering performance
of K-means and SLINK, but degrades SPEC in general. In comparison, the
clustering performance improvement brought by the proposed BoostCluster is
substantially more stable and consistent, across different datasets and different
clustering algorithms. This can be attributed to the fact that BoostCluster is
adaptive to both clustering algorithms and datasets: in each iteration, it takes

the feedback from the result of applying the given clustering algorithm to the

102

 

particular dataset, and decides how to adjust the kernel matrix. However, Spec-
tralLearn generates new data representations independent from the clustering

algorithm that is used.

‘. The performance of the three clustering algorithms (K-means, SLINK, and

SPEC) varies significantly across the six different datasets. For instance, for
the “vowel” dataset, “BoostCluster+K-means” algorithm performs consider-
ably worse than “BoostCluster+SPEC” algorithm. However, the performance
of “BoostCluster+K-means” algorithm for the “pendigit” dataset, is signiﬁ-
cantly better than that of “BoostCluster+SPEC” algorithm. This result also
indicates that every clustering algorithm has its own strength, and therefore it is
important to develop a general framework that is able to boost the performance

of any clustering algorithm by the given pairwise constraints.

. The results based on the two different evaluation metrics, namely NMI
and PWFl, are inconsistent on some occasions. For instance, for the
“handwrittingdigit” dataset, according to NMI, the clustering performance of
“BoostCluster+K-means” and “BoostCluster+SLINK” appears to be similar.
However, according to PWF 1, “BoostCluster+SLIN K” performs noticeably bet-
ter than “BoostCluster+K-means”. The implication of this ﬁnding is the impor-
tance of evaluating clustering performance by more than one evaluation metric,
since conclusions based on the results of a single evaluation metric could be

biased.

4.4.3 Robustness of Exploiting Pairwise Constraints

Although the curves in Figure 4.6 - Figure 4.9 all display different degrees of “bumpi-
ness”, generally speaking, BoostCluster framework, SSKK and MPCKmeans deliver
a more robust performance than SpectralLearn algorithm. On the other hand, al-

though for most datasets, the performance curves of SSKK appear to be the most

103

'IF'T“"¢.A" H

NMI (wdbc dataset)

 

%:;:¢=;:2§222$111§ ﬁzzzzazzlﬁ'a:
0 100 200 300 400 500 500 700 800
#Pa1rw1se Constraints

NMI (scale dataset)

 

 

0 100 200' 3Q0 400 500 6005 700 800
#P airwise Constrain

NMI (vowel dataset)

 

 

0 100 200 300 400 500 600S 700 800
#Pairwise Constraint

Figure 4.6. Clustering performance evaluated by NMI (Part A). Each graph shows the
performance of the three clustering algorithms (K-means, parititional SLINK, spectral clus-
tering) boosted by the proposed BCP algorithm, and the performance of the MPCKmeans,

SSKK and SpectralLearn algorithms.

104

NMI

OOOOOOOOO

NMI (segmentation dataset)

 

 

.2" I“

"0 100 200 300 400 500 1n60t0S 700 800

Pairwise Constra
NMI (handwrittendigit ldatsaeet)

 

0“
5‘

-1‘ .§;;;3;;; :1: 323:5: ':;.= = 4523

 

 

V0 100 200I 300 400 500 6008 700 800

#P airwise Constraint

NMI (pendigit dataset)

 

 

 

0 100 EOOI 300 400 500 600S 700 800

Pairwise Constraint

Figure 4.7. Clustering performance evaluated by NMI (Part B). Each graph shows the
performance of the three clustering algorithms (K—means, parititional SLINK, spectral clus-
tering) boosted by the proposed BCP algorithm, and the performance of the MPCKmeans,
SSKK and SpectralLearn algorithms.

105

PWFl (wdbc dataset)

 

 

 

 

 

 

H
E o
m
o I 7 - — -..—- ;;; 112‘.‘.-.' :‘*ﬁ-““““‘
0.6
O 100 20a0i 3Q0 400 500 6005 700 800
#P rwise Constrain
PWFl (scale dataset)s ,
O. .
O.
0.
H I.
m
z
m
0.
O.
0.4
0 100 200 3QO 400 500 600 700 800
Pairwise Constraints
PWFl (vowel dataset)
0.
H
m
3
m
0.15

 

 

0 100 20Q 300 400 500 6008 700 800
#P airwise Constraint

Figure 4.8. Clustering performance evaluated by PWFI (Part A). Each graph shows the
performance of the three clustering algorithms (K—means, partitional SLINK, spectral clus-
tering) boosted by the proposed BCP algorithm, and the performance of the MPCKmeans,
SSKK and SpectralLearn algorithms.

106

PWFl (segmentation dataset)

 

 

 

 

 

 

 

 

 

O.
H(L
m
3
O...
O.
\‘A_—-IA”
0.3
0 100 200 300 400 500 600 700 800
#Pairwise Constraints
PWFl (handwrittendigit dataset)
O.
O.
H
m
z
010.
0.
.W
R—f-MMXL‘TX-HAn-ér-HA
0.2
O 100 200 300 400 500 600 700 800
Pairwise Constraints
PWPl (pendigit dataset)
1,. . .. . .7 ,
’3---q\ p
O.
0.
r.40.
m
3
“'0.
0.
0.
0

. O 100 200 300 400 500 600 700 800

#Pairwise Constraints
Figure 4.9. Clustering performance evaluated by PWFl (Part B). Each graph shows the
performance of the three clustering algorithms (K-means, partitional SLINK, spectral clus-
tering) boosted by the proposed BCP algorithm, and the performance of the MPCKmeans,
SSKK and SpectralLearn algorithms.

107

smooth among all the competitors, the resultant improvement is almost always the
least noticeable among all the semi—supervised clustering algorithms.

To further evaluate the robustness of all the algorithms, we conduct experiments
with noisy pairwise constraints. We randomly select 20% of the pairwise constraints
and flip their labels (i.e., a must—link pair becomes a cannot-link pair and vice versa).
This setting reflects the scenario when the side information includes incorrect pairwise
constraints. It could happen when for instance, the pairwise constraints are derived
from the implicit user feedback (e.g., user ratings or click-through data). Thus, it is
important to develop semi-supervised clustering algorithms that are resilient to the
noisy side information.

Figure 4.10 and Figure 4.11 show the performance of all the algorithms, on three
selected datasets (i.e., “scale”, “vowel”, and “pendigit”) when 20% of the pairwise
constraints are noisy. First, by comparing Figure 4.10 - Figure 4.11 with Figures 4.6 -
Figure 4.9, it is not surprising to observe a degradation in clustering performance when
20% of the pairwise constraints are noisy. Second, we observe a general trend that a
larger number of noisy constraints tend to result in an inferior clustering performance
by MPCKmeans. This is in contrast to the results of MPCKmeans shown in Figure 4.6
— Figure 4.9 where increasing the number of pairwise constraints usually improves the
performance of clustering. This result implies that the MPCKmeans algorithm is
unable to effectively exploit the pairwise constraints for data clustering when they
are noisy. Similarly, while “SpectralLearn+K—means” and “SpectralLearn+SLINK”
are able to noticeably improve the clustering performance with increasing number of
noise-free pairwise constraints, with 20% noise in the constraints, their performance
degrades with the increasing number of constraints. In comparison, as shown in Fig—
ure 4.10 and Figure 4.11, BoostCluster framework is generally able to improve the
performance of all the three clustering algorithms with increasing number of noisy

pairwise constraints, and SSKK algorithm is also able to improve clustering perfor-

108

 

 

 

 

 

 

 

 

 

 

 

 

K-means (BCP) K-means (BCS)

DATASETS NlVIf PWFI NMI PWFI
wdbc 0.5862 0.8502 0.6360 0.8221

scale 0.3040 0.5520 0.3450 0.5645

vowel 0.2723 0.2069 0.3317 0.2565
segmentation 0.6315 0.5843 0.6013 0.5544
handwrittendigit 0.5720 0.5664 0.5435 0.5359
pendigit 0.7448 0.8140 0.7261 0. 8166

SLINK (BCP) SLINK (BCS)I

DATASETS NMI PWFI NMI PWFI

wdbc 0.7438 0.9248 0.7971 0.9440

scale 0.5399 0.8091 0.5015 0.7839

vowel 0.2982 0.2302 0.3044 0.2599
segmentation 0.6164 0.5740 0.5765 0.5421
handwrittendigit 0.5451 0.6328 0.5300 0.5579
pendigit 0.7833 0.8654 0.7713 0.8573

SPEC (BCP) SPECLBCS)

DATASETS N Ml PWFI N MI PWFT
wdbc 0.7073 0.9038 0.6749 0.8964

scale 0.4792 0.6021 0.3818 0.6099

vowel 0.3431 0.2606 0.3417 0.2649
segmentation 0.6004 0.5559 0.5754 0.5768
handwrittendigit 0.4764 0.4306 0.4579 0.4417
pendigit 0.5912 0.6325 0.8056 0.8745

 

 

 

 

Table 4.3. The performance comparison between the BCP algorithm and the BCS algo-
rithm, when the number of pairwise constraints is 500.

mance despite the noise in pairwise constraints. This indicates that the proposed
BoostCluster framework and the SSKK algorithm are more resilient to the noise in

the side information.

4.4.4 BCP vs. BCS

We compare the clustering performance of the BCP algorithm and the BCS algorithm
in Table 4.3 that is evaluated in both NMI and PWFl. Due to the space limitation,
we only list the evaluation results for 500 pairwise constraints. We set c = 1 in the
objective function (4.7) of the BCS algorithm.

As indicated in Table 4.3, in general the BCP and BCS algorithms deliver similar
performance, which suggests that both algorithms are effective in boosting the per—
formance of the underlying clustering algorithms. BCP is more effective than BCS for

certain datasets and clustering algorithms, and vice versa. However, it is important

109

NMI (scale dataset, 20% noise)

  

 

0 100 200 300 400 500 600S 700 800
#Pa11rwise Constra

NMI (vowel dataset, 20% noise)

,n-u-a

 

 

'10 100 200i 3QO 400 500 600 700 800
#Pa rwise Constraints

NMI (pendigit dataset, 20% noise)
8 .

 

‘G
0.1 ”"44: ~.
h‘--~A-—-A---—A---§:’=“"“

 

v0 100 200 300 400 500 600 700 800
#Pairwise Constraints

Figure 4.10. Clustering performance (NMI) with 20% noise in the pairwise constraints.
Each graph shows the performance of the three clustering algorithms (K-means, pariti—
tional SLINK, spectral clustering) boosted by the proposed BCP algorithm of BoostCluster
framework, and the performance of the MPCKmeans, SSKK and SpectralLearn algorithms.

110

PWPl (scale dataset, 20% noise)

 

 

  
  

 

0.
H
:0
a
0.
0.4
0 100 20n0 I300 400 500 6005 700 800
#P w1se Constraint
PWFl (vowel dataset, 20% noise)
O.
H
[n
z
D;
0.
0 100 200 300 400 500 600S 700 800
Pai1rwise Constraint
PWPl (pendigit dataset, 20%S noise)
9
0.
0.
H
‘go
m
0.
O.
0

 

' 0 100 200 300 400 500 600 700 800
Pairwise Constraints

Figure 4.11. Clustering performance (PWFl) with 20% noise in the pairwise constraints.
Each graph shows the performance of the three clustering algorithms (K-means, pariti-
tional SLINK, spectral clustering) boosted by the proposed BCP algorithm of BoostCluster
framework, and the performance of the MPCKmeans, SSKK and SpectralLearn algorithms.

111

I.

to note that compared to BCP, BCS has an additional parameter c that can be used
to tune the boosting algorithm. Our empirical experience indicates that changing the
parameter c can lead to signiﬁcant difference in the clustering performance. It would
be useful to investigate the pros and cons of the BCP and BCS algorithms, and how

to choose the parameter c in the BCS objective function wisely.

4.5 Summary

In this chapter, we have studied the problem of improving data clustering by using side
information in the form of pairwise constraints. A general boosting framework has
been proposed to improve the accuracy of any given clustering algorithm with a given
set of pairwise constraints. Such performance improvement is achieved by iteratively
ﬁnding new data representations that are consistent with both the clustering results
from previous iterations and the pairwise constraints. Empirical study shows that
our proposed boosting framework is able to improve the clustering performance of

several popular clustering algorithms by using the pairwise constraints.

112

CHAPTER 5

Semi-supervised Learning for Query
Translation Disambiguation in
Dictionary-based Cross Language

Information Retrieval

5.1 Introduction

To bridge the gap between different languages, machine translation has been used
extensively in many research areas of multilingual information processing, such as
cross-language information retrieval (CLIR). In CLIR, we can either translate queries
into the language of documents or translate documents into the language of queries.
Usually, it is simpler and more efﬁcient to translate queries because of their shorter
length. Most query translation algorithms require external linguistic resources, among
which parallel corpora and bilingual dictionaries are the most commonly used. Meth-
ods based on parallel corpora usually learn the association between words of the
source language and words of the target language, and apply the learned association
to estimate the translation of queries. Examples in this category include statistical
translation models [157, 53, 116, 94], and relevance language models [93, 101, 100].

The main drawback of these methods is that they depend critically on the availability

113

of parallel bilingual corpora, which are often difﬁcult to acquire, especially for minor
languages.

While the problem of machine learning for language translation remains tough
(either theoretically or practically), undoubtedly, bilingual dictionaries can be viewed
as side information for various multi-lingual tasks. Typically, a bilingual dictionary
provides a repository of translation pairs, usually organized in a manner that each
word (or phrase) in one language is supplied with a list of possible translation words
(or phrases) in another language. These translation pairs, to some extent, helps to
bridge (or narrow) the gap between languages. With the increasing availability of
machine readable bilingual dictionaries, dictionary-based approaches becomes more
preferable for CLIR applications, especially when other linguistic resources are scarce.

Compared to the approaches based on parallel corpora, a major disadvantage of
the bilingual dictionary based approaches is that they lack the ability in disambiguat-
ing the translation of query terms among multiple candidates. Very often, a number
of translations (which we call translation candidates in this chapter) are found in a
bilingual dictionary for a single query word, but most of them are irrelevant to the
semantic meaning of the query. Hence, it is crucial for a dictionary-based approach
to reduce the ambiguity in translating query words as much as possible. However in
CLIR, given the short length of a query, it is usually impossible to completely resolve
the translation ambiguity due to the multiple interpretation of the query. Thus, it is
also important for any dictionary-based CLIR approach to maintain the uncertainty
in translating queries when the ambiguity is hard to resolve.

In the past, several approaches [81, 57, 58, 1, 95] have been proposed to resolve the
query translation ambiguity in dictionary-based CLIR. The simplest one is to use all
the translation candidates of each query word provided by the dictionary with equal
weights [46, 95]. This amounts to no sense disambiguation when translating query

words. Other approaches try to resolve the translation ambiguity by measuring the

114

coherence of a translation candidate to the entire query. Typically, the coherence score
of a translation candidate is computed using word co—occurrence statistics. Given a
query, a translation candidate of a query word is assigned with a high coherence
score when it co—occurs frequently with the translations of other query words. The
translation candidates with the highest coherence scores are selected to form the ﬁ-
nal translation for the original query. In [46, 76, 2, 93, 95, 57], only one translation
candidate is selected for each query word; in [81, 109], a translation candidate is se—
lected when its coherence score exceeds a predeﬁned threshold, which allows multiple
translations to be selected for each query word. We will refer to both approaches men-
tioned above as selection-based approaches, because they all have to make a binary
decision for each translation candidate regarding if it will be included in the trans-
lated query or not. Given the usually short length of queries and the large variance
existed in mapping information across different languages, such binary decisions are
usually difficult, if not impossible, to make. We call this problem the “translation
uncertainty problem”. Another problem with the selection-based approaches is
that the translation of one query word is usually determined independently from the
translations of others. This assumption is reﬂected in the calculation of coherence
scores. Usually, the coherence score of a translation candidate to a given query is
computed as the sum of its similarities to every translation candidate for other query
words. As a result, coherence scores are estimated independently from the choice of
translations for query words, which leads the selection of translation candidates for
different query words to be independent. We call this problem “translation in-
dependence assumption”. Although this problem has been addressed in previous
work (e. g., [57]), usually greedy approaches are applied and therefore only suboptimal
solutions can be obtained.

In this chapter, we propose a novel statistical framework for dictionary-based

CLIR. This framework will allow us to estimate the translation probabilities of query

words by maximizing the overall coherence of the translated query, which we call
“maximum coherence principle”. Particularly, the proposed framework explic-
itly addresses the two problems mentioned above: to resolve the translation uncer—
tainty problem, the proposed framework maintains the uncertainty in translating
queries through the estimation of translation probabilities of query words; to remove
the translation independence assumption, the proposed framework allows the transla-
tion probabilities of all query words to be estimated simultaneously. Furthermore, the
proposed framework is formulated as a quadratic programming problem [62], whose
global optimal solution can be found efficiently using standard optimization packages
such as Matlab. This is in contrast to several existing approaches such as the propa-
gation approach in [114], where the solution is determined by an iterative procedure,
which is not only time consuming but also sensitive to the initialization of parameters
or the stop criterion employed in the iterative procedure.

In addition to the general framework, we also present in this chapter two realiza-
tions of the proposed framework that employ different coherence measurements: the
“Maximum Coherence Model” that adopts the raw word-to-word similarity for
coherence measurement, and the “Spectral Query Translation Model” that mea-
sures the coherence score of each translation candidate based on the normalized word
similarity. As will be explained later, the Spectral Query Translation Model based
on the normalized similarity measurement, can be further explained as a graph parti-
tioning approach for query translation disambiguation, which is employed in spectral
clustering. Our empirical studies with TREC datasets have shown that both models
outperform the selection-based approaches with relative improvements ranging from

10% to 50%.

116

5.2 Related Work

We ﬁrst review the previous work in the selection-based approaches for query transla-
tion disambiguation, followed by the discussion of spectral clustering that is strongly

related to the proposed Spectral Query Translation l\-‘Iodel.

5.2.1 Selection-based Approaches for Query Translation Disambiguation

One of the major factors that can potentially degrade the effectiveness of dictionary-
based cross—language information retrieval is the ambiguity in translating query words
[8, 57]. In the efforts to resolve this translation ambiguity, a number of recent studies
[46, 76, 81, 2, 93, 109, 57, 95] have suggested the strategy of translation selection by
exploiting word co—occurrence patterns. Usually a similarity measurement between
two translation candidates is deﬁned in the form of word co—occurrence statistics.
With the word similarities, we can then measure the coherence of a translation candi-
date with regard to the theme of the entire query. Only those translation candidates
with high coherence scores will be selected for the query translation.

Ideally, for each query word we should select the translation candidate(s)
that is consistent with the selected translation candidates for other query words.
Apparently, this becomes a “chicken-egg” problem since the selection of translation
candidates for one query word is determined by the translation candidates selected
for other query words. Thus, due to the computational concern, most selection-based
approaches [1, 57, 58] adopted approximate approaches that usually only produce
suboptimal solutions. In particular, for each query word, those approaches choose
the translation candidates that are consistent with all the translation candidates
provided by the dictionary for all the query words, including both the selected and
the unselected translation candidates. Formally, a translation selection strategy can

be formulated as follows:

117

APPROXIMATE TRANSLATION SELECTION ALGORITHM

 

1. Given a query q8 = {qi (13, - ~ ,qfns} in the source language , for each query

word (1?, look up the dictionary for the translation candidate set S,- = {wa j}

2. For each set S,-

a For each translation candidate wf . in S -, deﬁne the similarit measure—
i y

2,]
ment between the word 11):]. and the set SI/(i’ aé i) as the sum of the

)

similarities between wf j and each word in the set S i” i.e.,
sim(w£IJ-, 52.)) = Z sim(wsz]-, wf, l) (5.1)
v r’: 68.
u2,,l 2’
where sim(wzt- j’ wzi, l) computes the word-to—word similarity.
(b) Compute the coherence score for wzt. j as
t _ ,- t
f(wi,j) — Z sam(w,-Ij, Si’) (5.2)

Viiyéi
(c) Select the word q; in S,- with the highest coherence score
t _ , , t
q,- — aIgmtax f(wi,j) (5.3)

w. .
3,]

The deﬁnition of similarity between two words in the above algorithm can take
various forms of co—occurrence statistics, such as Dice similarity (as in [1]), mutual
information (as in [81, 109]) or its variants (as in [57, 58]). In addition to selecting the
most likely translation for each query word, other selection-based approaches have
been tried, such as selecting the best N translations [46] or selecting translations
whose coherence scores exceed a predeﬁned threshold [81, 109].

Apparently the above approximate algorithm is not ideal. In particular, the coher-
ence score for a translation is computed with regard to both selected and unselected

translations. As a result of such an approximation, translation of different query

118

words are determined independently, which leads to the translation independence
problem as discussed in the introduction section. III the proposed statistical frame-
work, by formulating the problem of translation selection in a quadratic programming
form, we are able to efficiently estimate the translations of all query words simultane-
ously. Furthermore, in contrast to the selection-based approaches that make binary
decision for each translation candidate, the new framework employs soft probabilities
for representing both selected and unselected translation candidates. This is particu-
larly useful when binary decisions are hard to make, for instance, all the translation

candidates of a query word have very similar coherence scores.

5.2.2 Spectral Clustering

Spectral clustering approaches view the problem of data clustering as a problem of
graph partitioning. Each data point corresponds to a vertex in a graph. Any two
data points are connected by an edge whose weight is the similarity between the two
data points. To form data clusters, the graph is partitioned into multiple disjoint sets
such that only the edges with small weights are removed. Based on different criteria
imposed on the partitioning, there are three major variants for spectral clustering:
Ratio Cut [38], Normalized Cut [134] and Min-Max Cut [49]. In the following, we
brieﬂy recapitulate the 2-way Normalized Cut algorithm since it is the most widely
used spectral clustering algorithm and has the closest relation to our proposed work.

Let G'(V,E;W) denote an undirected graph, where V is the vertex set, E is
the edge set, and W = (wi,j)n><n is a matrix with win 2 0 denoting the edge
weight between the i-th and the j-th vertex. Deﬁne D = diag(d1, d2, - - - ,dn), where
d,- = ZjEV win. To partition the vertex set into two disjoint sets A and B, a 2-way

Normalized Cut algorithm minimizes the following objective function:

_ S(A,B) S(A,B)
J _ (M + dB (5.4)

 

where we deﬁne S(A,B) = ZiEA ZjEBwiJ as the cut value, dA = ZiEA di and

119

d B = 27:6 B d, as normalizing factors that. balance the size of the. two clusters. The

above objective function can be rewritten as

(lA-f-(lB ((lA+(lB)2
J= 2: Se =2 Zw < .
‘i,j 1,]
iEAjEB dAdB iEAjEB d4dB (dA+dB)

 

C"!
C}!
v

If we introduce a cluster indicator vector q, with each element q, deﬁned as

 

\/dB/ldA(dA+dB)l lfi E A

\/dA/ldB(dA + 6113)] ifi E B

 

the objective function becomes

2
J = Z (qr-(13') we)“

i,jeV
2 2
= Z ((12- +€Ij — QQIinlwiy
z‘,jeV
2
= :29i(zwi,j)— Z quqJ-u’m
iEV jeV z',jeV
2
= QZqidi—Q Z qiquiu’
iEV i,jeV
= 2qT(D—W)q (5.7)

Note that the minimizer to the above normalized cut value J is a binary vector q,
with each element ‘17; indicating the cluster membership of a vertex. Given its combi-
natorial nature, it is difﬁcult to solve the optimization problem efficiently (NP hard).

However, if we relax the cluster memberships to real values under the constraints

qTDq = 1 (5.8)

qTDe = 0 (where e = [1, - -- ,1]T) (5.9)
the Normalized Cut algorithm can be formulated as follows:

minq qT(D — W)q (5.10)

s.t. qTDq = 1, qTDe = 0

120

 

Furthermore, if we deﬁne a = DQq, we reach the following equivalent Optimization

problem

. ~T
1111qu q (I —D

s.t. a

Note the above problem is in the form of Rayleigh quotient, and its solution can be

found by solving the following eigenvalue system [65]

1 1
(I—D—QWD—?)q = is (5.12)

5.3 The Statistical Framework For Dictionary-based CLIR

The essential idea of the framework is to learn a set of translation probabilities for
query words from the word co—occurrence statistics that maximizes the overall co—
herence of the translated query. In the following subsections, we will describe the
components of the general statistical framework for dictionary-based cross-language
information retrieval, including uncertainty modeling, the retrieval model, translation
probabilities learning and the solution to the related optimization problem, followed

by a summary that sketches the steps of applying the proposed framework to CLIR.

5.3.1 Notation

The term “source language” and a superscript s are used when referring to the
language of queries. Similarly, the term “target language” and a superscript t are
for the language of documents. Let a query of the source language be denoted by
qs = {wiwi - -- ,wfns}, where m3 is the number of distinct words in qs. Let rk
denote the set of translation candidates provided by the dictionary for a word w]: in
the source language. The union of translation candidate sets for all the words in qs is

s
then denoted by R = U221 rk. The total number of distinct translation candidates

121

 

for query qs, i.e., the size of R, is denoted by mt . A matrix T = [tkIj] t repre-

ms x m.
sents the part of the bilingual dictionary related to query qs. Each element t k. j in T
is 1 if the j-th word in the target language appears in the dictionary as a translation

for the k-th word in the source language, and 0 otherwise.

5.3.2 Modeling the Uncertainty in Query Translation

To address the problem of translation uncertainty, we build the statistical framework
by introducing translation probabilities. For a given query word, instead of making
binary decision for its translation candidates, we estimate the probability of translating
the query word into each translation candidate. More importantly, the translation
probabilities are estimated under the context of the entire query, namely a translation
candidate will be assigned large probability mass if it is coherent with the semantic
meaning of the entire query and vice versa.

Let kaj denote the probability of translating a word w: of the source language

into a word wj of the target language, given the context of query qs. It is deﬁned as

Pk,j = Pr(w§-lw;2, qs) (5.13)

In order to be consistent with the dictionary T, we assume that translation probability
kaj = 0 if the word w;- does not appear in the dictionary as the translation of
word wz. In other words, kaj could be nonzero only if the word w;- is one of the
translation candidates for the word wz. This assumption can be formally expressed

by the following constraints:
Vk’ =1,...,m3, j =1,...,mt Z O S kaj S tkIj
In addition, we have a constraint

Vk = 1, ...,m3: 2 kaj =1
ng-El‘k

122

to ensure that each query word has only one ideal translation given the context of
the query qs.

To simplify our notation, we further introduce the matrix P = [kaj] t to

msxm
denote all the translation probabilities for query qs. Then, the above two sets of

constraints can be rewritten as

P ' eTntXI = ems-X1 (5.14)
ongT 6m)

where e = [1, 1, . . . , 1]T.

5.3.3 The Retrieval Model

The introduction of translation probabilities kaj can be well accommodated by a
statistical retrieval model for CLIR. In particular, we estimate Pr(dt|q3), i.e., the
probability for a document (it in the target language to be relevant to a query qs in

the source language. By the Bayes’ law, this probability can be approximated as

Pr(qsldt) -Pr(dt)
Pr(q3)

The last step assumes that document prior Pr(dt) follows a uniform distribution.

Pr<dth3> = ~ Pr<q3|dt> (5.16)

 

Hence, in the following, we will compute log Pr(q3]dt), instead of log Pr(dt|qs).
To model the translation uncertainty, we rewrite the expression for log Pr(q3[dt)

as follows:

10gPr(qSIdt) = 10s / dqt Pr(qslqt)Pr(qtldt)
1
Pr(q5)
~ / dthr(qt|q3)logPr(qtld‘)
_ t s
— :(logPrUv [q )lPr(qt|q5)

wt

2 :Pr(wths) log Pr(wt|dt) (5.17)

wt

22

 

/ dqt Pr(qslqt) (10g Pr(qt ldt) + 10s Pr(q3))

123

 

In the second step of the above derivation we employs the Jensen’s inequality [125],
which can be viewed as the ﬁrst step toward the variational approximation [79, 88]. In
the third step, we ignore the term log Pr(q5) that is independent from the document
dt, and we also switch the roles between qt and qs using the Bayes’ law (similar
to (5.16)) by assuming that the prior of the translated query qt follows a uniform
distribution, i.e. Pr(qt) is a constant. In the fourth step, () represents the mathe—
matical expectation of a random variable. To estimate Pr(wt|q3), i.e., the probability
of observing the word wt in the translation of the query qs, we decompose Pr(wt|q5)

into a summation over words in the source language:

Province) = Z Pr<wt1w8;q8)Pr(w51q3) (5.18)
wSEqs

Finally, from (5.16) - (5.18) we have

logPr(dt|q5) ~ ZZPr(wt[wS;q5)Pr(w5|q3)logPr(wt|dt) (5.19)

wt ws
Here Pr(wt|dt) is a monolingual language model for a document (it in the target
language; Pr(wt|w3;q3) is the probability for translating a query word ws into wt
given the context of query qS; and Pr(w3|q3) is a monolingual language model for
query qs in the source language, which can also be seen as the weight assigned to
the query word ws. For the sake of simplicity, an uniform distribution is assumed

for probability Pr(w3|qs). As indicated in (5.19), the key component to the above

retrieval model is how to estimate the translation probabilities Pr(wt|ws; qs).

5.3.4 Learning the Translation Probabilities

In this subsection, we will describe the essential part of the statistical framework, i.e.,
automatically learning translation probabilities from the word co-occurrence statis-
tics. We will begin with the deﬁnition of an overall coherence measurement for trans-

lating a query, followed by formulating the learning process in an optimization form.

124

 

Using the translation probabilities introduced in the previous subsection, we can

now deﬁne a measurement for the overall coherence when translating a query qs, i.e.,

t
Co(qs; T) = E E kaj IOJ'J’ 'pk’,j’ (5.20)
V108 Eqs th Er
k ‘
Vw

k

.7
S 3
eq . t

where 0]. j’ is a pair-wise similarity that measures the correlation between two words
7

w;- and “1;" in the target language.

The above measurement is motivated by the intuition that appropriate transla-

tions of query words tend to be coherent. with each other. In other words, if w;- and

w;, are appropriate translations for words w]: and mi, respectively, we expect that
1) both translation probabilities ka j and pk; j’ are assigned large values, and in the

same time, 2) w; and w? are related by a large coherence measurement 0 Hence,

t
k,k’ '
what is implied by this intuition is that the assignment of probability kaj and pk; 3"

should be synchronized with the coherence measurement 0;. .,. This synchroniza-

,

tion is expressed in (5.20) through the multiplication of the three terms kaj, pk; j”

and 0t. .,. In particular, by maximizing the overall coherence in (5.20), we enforce

1

the consistency between the probability assignment and the coherence measurement
by assigning large values to ka j and pk, 3" only when the corresponding coherence

measurement 0:. j’ is large.

7

The similarity information (i.e., 0t. .,) can be derived from monolingual word co—

7

occurrence statistics using the metric such as information gain. The expression for

the overall coherence can be simpliﬁed using the matrix notation introduced in (5.14)

and (5.15):
Co(q5;T) = eTPOPTe (5.21)

where O = [03. t is the similarity matrix that includes the similarity mea-

,j’1mtxm

surement of any two words in the target language.

125

It is important to note that the coherence measurement deﬁned above is signiﬁ-
cantly different from the coherence measurement deﬁned in [58]. The key difference
between them lies in the fact that. the measurement deﬁned in [58] is based on the
concept of translation selection, namely only the best translation candidate is chosen
for every query word. As a consequence, the resulting formalization in [58] is indeed
a combinatorial optimization problem, and therefore is difﬁcult to solve efficiently.
By relaxing the binary choice of translation for query words to translation probabili-
ties, on one hand we resolve the difficulty in optimization, and on the other hand we
are able to explore the translation uncertainty, which could be important when the
translation ambiguities are difficult to resolve.

Now our goal is to determine the translation probabilities such that the overall
coherence is maximized. Putting together both the objective function in (5.21) and
the constraints in (5.14) and (5.15), the learning process of the query-dependent

translation probabilities can be formalized as the following optimization form

max eTPOPTe — CpeTPPTe (5.22)
PERmsxmt
8.15. P 'emtxl = emSXI
0 S P _<_ T

Notice that in the above objective function, in addition to the ﬁrst term that
corresponds to the overall coherence measurement for query translation, another term
—CpeTPPTe is introduced. This additional term is called a regularizer in machine
learning, and plays the similar role as the prior in Bayesian learning [113]. Note
that eTPPTe = 2%, Zj kajpk/Ij stands for the sum of all elements in PPT,
and its maximizer is to assign uniform distributions to all translation probabilities P.
Thus, by including the regularizer in the objective function, we essentially introduce
an uninformative prior for P, i.e., without the context of a given query, we assume

that all translation candidates provided by a bilingual dictionary are equally likely

126

to be selected. The regularizer approach has been widely used in many well-known
machine learning models, including the logistic regression model [117] and support
vector machines [29]. Parameter Cp is introduced to balance the trade—off between
the overall coherence measurement and the regularizer. Another important issue
with the objective function in (5.22) is the choice of similarity measurement 0. A
different similarity matrix 0 can result in rather different performance in information
retrieval. We will show two of them in the later sections. Finally, we would like to
emphasize that by solving the resulting optimization problem in (5.22), we are able
to acquire the translation probabilities of all query words simultaneously through the
computation of matrix P. This is in contrast to a number of existing approaches for
dictionary—based CLIR, where the selection of translations for individual query words
are determined either independently or by certain greedy means that usually leads to

suboptimal solutions.

5.3.5 Solving the Optimization Problem

The Optimization problem in (5.22) is in fact a quadratic programming (QP) prob—
lem [62], since we ﬁnd the objective function consists exclusively of quadratic terms
with respect to the set of variables {kaj} and all the constraints are linear to those

variables. A standard QP problem has the following form

1
min —xTHx + ch
x 2
st. Ax S b
Ex 2 d

where the vector x is the unknown variable, and matrices H, A, E and vectors b,
c, d are known. In this section we will discuss how to reformulate our optimization
problem (5.22) into the standard QP form, so that. it can be easily recognized by most

of existing QP solver softwares.

127

Since the standard QP form takes a vector form of the optimization variables,
we ﬁrst need to rearrange the unknown variables in the matrix P (i.e. the set of
translation probabilities) as a vector; then we need to reformulate the objective func-
tion as well as all the constraints in terms of the vector representation of the set of
translation probabilities.

To begin with, we concatenate all the row vectors in the matrix P together to

form a vector, i.e.,

T
p1 T Pl
P2 - P2

msxmt : _i pmsmtxl : ' (5'23)
I)"; pms

where f) is the rearranged vector, consisting exactly the same set of variables in the
matrix P.

Then, we need to reformulate the objective function in (5.22) with respect to the
vector 13, i.e. ﬁnding a matrix H such that eTPOPTe - CpeTPPTe = 131-Hf) is
satisﬁed. It is easy to verify that such a matrix H can be written in a succinct form

by using the kronecker product operator (8) 1as follows

msmt x msmt
Here 1 stands for a matrix of all elements 1.
Similar procedures can be applied to reformulate the constraints in (5.22), with

the use the following matrices

T
t1 t1
t2T _ t2
msxmt 2' : a pmsmtxl = . (525)
trriis tms
- T T T
msxmsmt diag(t1 ,t2 ,. . . ’t’NLS) (5.26)

\

1Kroneclter product is also known as Matrix Direct Product. Given an m x 71. matrix A and a
P X q matrix B, their kronecker product A 81 B is an mp x nq matrix C whose (i, j)-th submatrix
is aid-B, i = 1,...,m andj=1,...,n.

128

And ﬁnally the optimization problem in (5.22) can be rewritten in a standard form

of the QP problem as follows

max pTHf) (5.27)
p

s.t. Ef) = e (5.28)

Osﬁsn on)

where e is a vector with all elements equal to 1, and the matrices H, E and the vector
1‘) are given in (5.24) - (5.26). In our experiments, the QP package from Matlab [143]

is used to solve the above problem.

Remark: For the QP problem formulated in (5.27), the problem size appears to be

t, i.e., the product between

large because the number of variables in vector q is msm
the number of unique query words and the number of distinct translation words
provided by the dictionary. However, notice that in the constraint (5.29), '13, i.e., the
upper bound of translation probabilities, is a concatenation of translation vectors t,-
obtained from T, the matrix notation of the bilingual dictionary. Given that most
query words only have a few translations, most of the elements in the matrix T will be
zeros. As a result, most elements in the upper bound vector 6 are zeros, which leads
to the zero values for the corresponding translation probabilities in q. Hence, the
number of non-zero translation probabilities in q is no more than the total number of
translations provided by the bilingual dictionary for the query words, which is usually

much smaller than the product msmt. Thus, the computation cost of the Maximum

Coherence Model is modest for real CLIR practice, if not overestimated.

5.3.6 Summary and Discussion

By putting together the uncertainty modeling, translation probability learning and

the retrieval model, we summarize the steps of applying the proposed framework to

129

 

 

Step 0 Prepare a bilingual dictionary

Step 1 Compute the pair—wise similarity of all the words in the target
language that appear in the dictionary as translations

Step 2 Prepare language models for both the source language
and the target language

Step 3 When a query in the source language comes
1. Identify the candidate set of translations by the dictionary

E0

Extract the similarity information related to the candidate set

9°

Compute translation probabilities by solving the QP problem (5.27)

:5

Retrieve documents according to the model in (5.19), by feeding in
the set of translation probabilities and both language models

 

 

 

Figure 5.1. Steps of applying the proposed framework to Cross Lingual Information
Retrieval

CLIR in Figure 5.1. In those steps, pair-wise similarity computation and language
model preparation can all be done offline. In query time, most computation comes
from solving the QP problem, in addition to the conventional retrieval process.

It is important to note that although in this study we limit ourselves to the bag
of words approach without exploring other linguistic structures such as phrases, our
framework can essentially be generalized to the linguistic structures other than the
bag of words. This can be achieved by treating all the w; and w}: as the units in the
targeted linguistic structures, and Pr(w§-Iw]‘:) as the likelihood of translation between

the units in the new linguistic structures. We focus our discussion on word translation

mainly because of the following two concerns:

1. As mentioned in the introduction part, one of the main motivation behind our
work is to resolve the problem of CLIR when only bilingual dictionaries are
available. Since in practice most bilingual dictionaries are word-based, we focus

our study on the word—based approaches.

2. Since many user queries, particularly the ones from the \Norld Wide Web, are

less likely to be grammatically structured, we believe it is important for a robust

130

 

CLIR framework to minimize its dependence on deep linguistic analysis of user

queries.

Despite of the above claim, we believe that given more linguistic resources avail—
able and well grammatically structured queries, our framework could be improve
by appropriately incorporating the linguistic constraints derived for the translated
queries. For example, we can incorporate the linguistic knowledge into the similarity
matrix 0; or we can introduce it into the regularizer in the optimization problem
(5.22) to make it more informative. These framework extensions will be considered

in future work.

5.4 Maximum Coherence Model

In the previous studies of selection—based approaches, several metrics have been used
for measuring the similarity between two words in the target language, i.e. the 0;. j”
including variants of mutual information [57, 81], and the Dice similarity [1, 2]. In

this model, a typical variant of mutual information (which has been used in previous

studies [58]) is used as the pair-wise similarity metric

 

Pr(w;-, wt.,)
st. ., = Pr(u.i§-,wt.,) x log t J t (5.30)
)3] J Pr(wj) x Pr(wj,)
Pr(w§) is the unigram probability for word wt-, and Pr(w§-, w?) is the joint probability
for word w; and 103., to co—occur in the same documents. Both probabilities can be

acquired by simply counting the term frequency of single words and the frequencies
of co—occurrence between two words. Note that Equation (5.30) is different from the
standard deﬁnition for mutual information in that only co-occurrence information is
used. Due to the computation concern, in Equation (5.30) the correlation between two
words when at least one of them does not occur in documents is ignored. According
to the deﬁnition in Equation (5.30), we see that two words will be regarded as similar

if they co—occur often in the document collection.

131

 

Using matrix notation S = [5; t and substitute the coherence matrix 0

J’lmtxm

with the similarity matrix S, we have the following model

max eTPSPTe — CpeTPPTe (5.31)
PER’IRS X ”It
at. P - emtxl = emsxl

O§P§T

We call the above model “Maximum Coherence Model” since it is a straightfor-
ward implementation of the proposed framework. Parameter Cp can be determined
empirically by cross validation. Heuristically, we hope Up to be roughly in the same
scale as those elements in the matrix S, since OP is used to balance between the
similarity measurement S and regularizer I. Hence, we heuristically propose to set

GP to be proportional to the average of the elements in S, i.e.
0p = n25; Jami)?
j j’ ’

Here n is a scaling factor. In our experiments we tested different values of 77 in the

range of [0.1, 10] and chose the one with good retrieval performance.

5.5 Spectral Query Translation Model

One problem with the Maximum Coherence Model proposed in the previous section
is the difﬁculty in determining the value of parameter Cp. This problem arises be-
cause the two terms in objective function in (5.31), namely the overall coherence
measurement of translated query, and the regularizer, are in different scales. As a
result, we have to search for appropriate C'p empirically. In order to put the two
terms on the same scale, we introduce the concept of normalized similarity ma-

trix: for a given similarity matrix S = [ we deﬁne a diagonal matrix

t
51-,jllmtth,

D = diag(d1, d2, - -- ,dn), where dj 2 237,11 3]. 3'" Then, the normalized similarity

132

 

matrix is S = D_%SD-’Z. Note that through this normalization procedure, we are
able to bring down the scale of matrix S to 0(1). As a result, both the coherence
term and the regularization term are on the same scale. Thus, instead of ﬁnding
appropriate 0,; empirically, we can simply set it to 1, which leads to the following
realization of the proposed framework:
T —1 —1 T T T
max e PD 28D 2P e —— e PP e (5.32)

PER’ITIS xmt

Si. P iemtX1= emsxl

0§P_<_T

We call the above model “Spectral Query Translation Model” because it can be
interpreted as a graph partitioning approach for query translation disambiguation,
which is strongly related to spectral clustering. We will further elaborate on this

interpretation in the following subsection.

5.5.1 Query Translation Disambiguation as Graph Partitioning

In order to see the relationship between graph partitioning and dictionary-based CLIR
model in (5.32), for a given query q3 and its translation candidate set R, we present
the related similarity information S through an undirected weighted graph. Each
translation candidate w]C E R is represented by a vertex. Any two translation can-
didates related to two different query words are connected by an edge if they ever
co—occur in at least one document. A non-negative weight is assigned to each edge to
indicate the similarity between the two connected words. Here we use the measure-
ment deﬁned in Equation (5.30) as the edge weight.

With the constructed graph for a given query and its translation, we hypothesize
that the best (or the most coherent) translation of a query corresponds to the most
strongly connected component within the graph. To separate the strongly connected

component from the rest of the graph, a graph partitioning algorithm can be employed

133

to divide the graph into two disjoint clusters: a cluster for strongly connected compo-
nent and a cluster for the rest of the graph. To this end, we inherit the idea from the
Normalized Cut algorithm. For the graph constructed for the translation candidate

set R, its adjacency matrix is exactly the similarity matrix S = [3; and the

,j’] mt x mt’
graph Laplacian matrix is L = D — S. Following the formalization of the Normalized
Cut algorithm [134], the optimal 2-way partitioning is found by solving the following

minimization problem
. T _ 1 _ 1
nnnv D 1ZLD 2v (5.33)

Here v = [1,1122 - ' -'vm]]T is a cluster membership indicator vector. Each element 12,-
is a binary variable with 1 indicating the corresponding word being included in the
query translation and O for not being included.

It is not difﬁcult to see that the optimization problems in (5.32) and (5.33) are in

fact equivalent if we set
v = PTe (5.34)

or in another more explicit form

vj = 2ka
k
= Z Pr(w§-|wz,qs)Pr(w]:|qs)vms
wlscEqS
= mS-Pr(w§|q3) (5.35)

where Pr(wi]q3) takes a uniform distribution, i.e. Pr(wi|q3) = 1/m3.

What is suggested by Equation (5.34) or (5.35) is to relax the cluster indicator
213- to a soft membership instead of a binary value. This soft membership, from
the graph partitioning point of view, indicates how likely the strongly connected
component will include the particular translation candidate 1113-. Equation (5.35) links

the soft membership with the probability Pr(w§-|qs), i.e. the likelihood of including

134

translation candidate w;- in the translation of query qs. Thus, the model proposed as
the optimization problem (5.32) can be perfectly explained from a graph partitioning
perspective. Note that the relaxation of a binary indicator to a soft membership
in the perspective of graph partitioning is in accordance with our introduction of
translation probabilities in the statistical framework. This is because both of them

try to address the uncertainty problem in the process of query translation.

20.9w
@321 0.1234s

Lang along

ingegnggnt 0.4071

 
   
 

I \
serum @545
0.0987
_90
0.5661 mm
0.0928

premonition
0.0978

 

 

-— >5'10e-7
>1‘10e-7
>5*10e-8

 

 

 

 

put out

am
03748 03603 55%

Figure 5.2. An example of graph partitioning perspective for query translation disam-
biguation.

Figure 5.2 gives an illustrative example of this graph partitioning perspective.
In the example, the query is composed of four Chinese words, and around each
Chinese word are its translation candidates in English provided by a Chinese—English
dictionary. The thickness of lines connecting two English words roughly represents
their similarity. The number below each English word is its translation probability

estimated from the Spectral Query Translation Model. Based on the graph repre-

135

 

sentation in Figure 5.2, we can easily see a strongly connected component consisting
of words “independent”, “sign”, and “press”, which have been assigned with large
translation probabilities. It is important to note that Figure 5.2 only serves as
an illustration of the proposed spectral query translation model. In particular, all
translation candidates will be used in the retrieval model (see Section 5.3.3), and

their importance will be weighted by their translation probabilities determined by

the optimization algorithm.

Remark: In the above, we discuss the relationship between the Spectral Query
Translation Model and the spectral graph partitioning. However, it is worth pointing
out that the solution to the Spectral Query Translation model can not be acquired by
the eigen analysis that is used for solving spectral graph partitioning. This is because
the Spectral Query Translation model seeks for the optimal translation probabili-
ties (which leads to soft cluster memberships) that maximize the overall coherence
measurement. In contrast, most spectral graph partitioning algorithms, such as Nor-
malized Cut, search for the binary cluster memberships that minimize the graph cut.
It is such difference that leads to the quadratic programming formalization, instead

of an eigenvector problem, for the Spectral Query Translation Model.

5.5.2 Maximum Coherence vs. Spectral Graph Translation

Aside from its graph partitioning explanation, the Spectra] Query Translation Model
is advantageous to the Maximum Coherence Model in that it is able to reduce the
translation probabilities for the “common” words, which may otherwise dominate in
the ﬁnal query translation. To see this, consider an element 5.7-J; in the normalized
similarity matrix S = D_%SD—%

st.

'I
gt. . = M (5.35)
10’ th t Z:mt t
. S . . . 8. .
J’=1 JJ’ J’=1 30’

136

 

 

Suppose translation candidate wt- is a common word that appears frequently in the

J

t
document collection of the target language. This implies that the sum 23’}; st. ., will

11}

be large. Since most of the “common” words are less informative for the purpose of

 

document retrieval, we use the normalization factor 1/ \/Z;7}:1 8;,‘7.’ to suppress their
coherence values, which will eventually reduce the probability of including common
words in the ﬁnal query translation.

Based on the above analysis, we see that the Spectral Query Translation Model
is more appealing than the Maximum Coherence Model. This is further conﬁrmed
by our empirical studies on cross-language information retrieval presented in the next

section.

5.6 Experiments and Discussions

The goal of this experiment is to examine the effectiveness of the proposed statisti-
cal framework for cross—language information retrieval. In particular, four research

questions will be addressed in this empirical study:

1. Is the proposed statistical framework effective for cross-language information re-
trieval? To obtain a comprehensive view, we compare the Maximum Coherence
Model and the Spectral Query Translation Model to the existing selection-based

approaches using a variety of queries and documents.

2. How important is it for a query disambiguation algorithm to include transla-
tion uncertainty in its analysis? To address this question, we will examine the
importance of including translation uncertainty in cross-language information

retrieval through case studies.

3. How important is it to remove the translation independence assumption for

cross-language information retrieval? To address this question, we will exam-

137

ll

ine the impact of the translation independence assumption on cross-language

information retrieval through case studies.

4. How does the Maximum Coherence Model empirically compare to the Spectral
Query Translation Model? we will show a performance comparison between
the two models as well as a case study as empirical evidence to our previous

comparative analysis of the two models.

5.6. 1 Experiment Setup

All our experiments are retrieval of English documents using Chinese queries. The
document collections used in this experiment are from TREC ad hoc test collections,

including

AP88-89 164,835 documents from Associated Press(l988, 1989)
WSJ87—88 83,480 documents from Wall Street Journal (1987, 1988)
DOE1-2 226,087 documents from Department of Energy abstracts 2

In addition to the homogeneous collections listed above, we also test the proposed
model against heterogeneous collections that are formed by combining multiple ho-
mogeneous collections. In particular, two heterogeneous collections are created: col-
lection AP88—89 + WSJ87—88, and collection AP89 + WSJ87—88 + DOE1-2. In a
heterogeneous collection, words are more likely to carry multiple senses than words
from a homogeneous collection, which will increase the difﬁculty for an automatic
algorithm to disambiguate the senses of query words using the pairwise word sim-
ilarities. The SMART system [126] is used to process document collections. Each

document is ﬁrst parsed into tokens with stop words removed, and then tokens are

 

2DOEl—2 collection is not used as one of the homogeneous datasets in our experiments because
DOEl-2 collection provides no relevant documents for a majority of the queries used in this ex-
periment. It is only used to create heterogeneous collections by combining with the other two
homogeneous collections.

138

stemmed using the Porter algorithm. Finally, each document is represented as a bag
of stemmed words. We also implemented pivoted document length normalization
weighting scheme [135] in SMART system for the retrieval process. Since our goal
is to illustrate the advantage of the proposed statistical framework, we do not apply
more sophisticated procedures for text analysis in our experiment, such as phrase
identiﬁcation.

Our queries come from a manual Chinese translation of TREC-3 ad hoc topics
(topic 151-200). To fully examine the effectiveness of the proposed models, we test it
against both the long Chinese queries and the short Chinese queries. A short Chinese
query is created by translating the “title” ﬁeld of an English query into Chinese; a long
Chinese query is formed by combining the Chinese translations of both the “title” ﬁeld
and the “description” ﬁeld in an English query. The average length of short Chinese
queries is 9.64 Chinese characters, and 30.72 Chinese characters for long queries. For
Chinese translations, we also manually segment the sentences into words with stop
words removed, then feed them into our query translation framework. Figure 5.3
gives an example of the title ﬁeld and description ﬁeld (in the bottom panel), which
is used to form a Chinese query in our experiments.

Since most of the words in a short query are highly relevant to the topic of the
query, we would expect that query disambiguation approaches based on word similar-
ities will work well. The analysis of CLIR with long queries could be tricky because
of the following two conﬂicting aspects: (1) On one hand, a long query provides
signiﬁcantly richer context than the short one in disambiguating the word sense of
translation. Therefore, we would expect that the long queries may achieve a better '
performance than the short queries in CLIR. (2) On the other hand, a long query
tends to include words either irrelevant or only slightly relevant to its topic. As a
result, even a translation word that is coherent with the translations of many query

words may not necessarily be a good candidate for selection. In our experiment, we

139

 

 

<num> Number: 187

<title> Topic: Signs of the Demise of Independent
Publishing

<desc> Description:

Document must identify instances of the loss of
independence by publishers, from the sale or
merger of their business or publication, or the sale
of a signiﬁcant interest in it to another person or
company.

 

 

 

Translate into Chinese

 

 

\/

<num> Number: 187
<title> Topic: iﬂlf/j lll lili ill 7; [1‘1 ill? 35
<desc> Description:
5C iii: lb 3131' N ltfr .'J‘. M “E .' l 1 Jill [iii (I i i \1k 212:}: Jill M 9W}
l'i’d ti‘it‘s ‘ a )t- ' I u at 55171-13}inFETMU‘JI‘EJHT, it
MJWEKTHMMAmadMiM.

Segment Chinese sentences
and remove sto a words

 

 

 

 

       

<num> 187
<title> Topic: Kill 32‘. lllllli il'l K (M) iiEJli
<desc> Description:

‘1 .7:- “él’i its (:11) (“"919 :1: J52 liil (iii) it)“:

(22) {J'Hilft'lilli’ﬂ (lt‘J) ii'if'ii (all) it'l- ('|') (. )(L'JZ
a) ((Ii) Wit ‘2 {2‘23 Milt] (Ill). (if/i) imi‘r'lit 1E3};
(”I“).1LilL|irmA(th) 2} ml (M) 5‘13fo (. )

Figure 5.3. An example query used in our experiments. The query is ﬁrst translated from
the No. 187 in TREC-3 ad hoc topics into Chinese. It is then segmented into words with
stop words removed. The upper panel shows the original query in English; the middle panel
shows the translated Chinese query; the bottom panel shows the segmented Chinese query,
where stop words in parentheses are removed.

140

will examine which factor among the two plays the major role in CLIR. Hence, a
long query may pose a more challenging problem than a short query for a translation
disambiguation algorithm based on word similarity information.

Finally, the relevance judgments for the original English queries are used as the
relevance judgments for their Chinese translations. The Chinese—English dictionary
used in our experiments comes from Linguistic Data Consortium (LDC, http://www.-
ldc.upenn.edu), which consists of translations for 53061 Chinese words. Since our
experiments do not involve the processing of English phrases, for any English phrase
that is the translation of a Chinese word, we simply treat it as a bag of words.

To evaluate the effectiveness of the proposed framework and models, we implement
two baseline models that take translation selection approaches. The ﬁrst baseline
model selects the most likely translation for each query word, which we call “BSTO”.
Speciﬁcally, we follow the ”Approximate Ttanslation Selection Algorithm” described
in Section 5.2.1. The second model, which we call “ALTR”, makes no efforts for
translation disambiguation by simply including all the translations provided by the
dictionary for query words into the ﬁnal query translation. Finally, for easy reference,
we use the abbreviation “MAC” for our Maximum Coherence Model and “SQT” for
our Spectral Query Translation Model. The constant Up for the regularizer term in
the Maximum Coherence Model is set to be 4 2].va 53.0., / (mt)2 based on our empirical

experience.

5.6.2 Comparison to Selection-based Approaches

Table 5.1 lists the average precision across 11 recall points for both the homogeneous
collections and the heterogeneous collections. As indicated in Table 5.1, the proposed
models (i.e., “MAC” and “SQT”) outperform the two baseline models for both short
queries and long queries across all four different collections. For the purpose of refer-

ence, we also list the results of monolingual information retrieval in the ﬁrst column of

141

0.8

      

 

Precision-Recall (Short Queries / AP)

 

" -+-MAC T

 

- .. 433m;
--aHALTR ’

 

 

-9—SQT :

 

0.2-
0 . .
0 02 0.4 0.6 0.8 1
Recall

0.8

 

Precision-Recall (Short Queries l WSJ)

 

 

-~-Bsr03
~a--ALTR f
.+'MAC f

 

 

 

Precision

 

142

0.8

_o

.O
N

 

Precision-Recall (Long Queries / AP)

 

 

-~-Bsr0§
“O-ALTR ;

V -+-MAC :

 

.0
.5

-e—SQT

 

 
 
   
 

 

 

Precision-Recall (Long Queries I WSJ)

 

 

«433m
--a-'ALTR 3

.+-MAC i

 

 

0:2 0:4 0:6

Figure 5.4. Comparison of CLIR performance on homogeneous datasets using both short
and long queries. The upper two ﬁgures are for AP88-89 dataset, and the lower two are for
WSJ87—88 dataset. The left two figures are for short queries, and the right two are for long
queries.

 

 

Precision-Recall

 

(Short Queries /AP+WSJ)
. -u-BSTO
i , ~-n--ALTR .
06,—, j ‘ +rMAC ‘

 

 

 

 

 

Precision-Recall
(Short Queries / AP+WSJ+DOE)

 

 

 

-~-BSTO'

. . - --r:r-'ALTR -

o. - i , '~+TMAC -
‘ —e—SQT 1

 

 

 

 

Precision

       
 

Precision-Recall
(Long Queries / AP+WSJ)

 

-I-BSTO
r-rawALTR
-+-MAC '
—e—SQT .

 

 

 

 

0.8

 

Precision-Recall
(Long Queries I AP+WSJ+DOE)

 

-~-Bsro
-»-rJ~ALTR
v+~MAC :
-6-SQT

 

 

 

 

 

Figure 5.5. Comparison of CLIR performance on heterogeneous datasets using both short
and long queries. The upper two figures are for AP88—89 + WSJ87—88, and the lower two
are for AP89 + WSJ87-88 + DOE1—2 dataset. The left two ﬁgures are for short queries,

and the right two are for long queries.

143

Table 5.1. ll-poiut average precision for both short and long queries on TREC datasets.
MLIR represents the monolingual information retrieval. The relative CLIR improvements
of MAC and SQT models over the other two baseline models are listed in the 5th, 6th, 8th
and 9th data columns.

 

 

 

 

 

 

 

 

 

 

[MLIR[BSTO ALTRMAC (M-B)/B (M-A)/ALSQT (S-B)/B (S-A)/A
Short Queries
AP .4112 .2331 .2241 .2959 +24.23% +32.04% .3115 +30.37% +39.05%
WSJ .3335 .1955 .2129 .2550 +30.21% +20.24% .2571 +30.77% +20.75%
AP+WSJ .4057 .2253 .2209 .2772 +23.04% +25.49% .2359 +25.90% +29.43%
AP+WSJ+DOE .3555 .1739 .1329 .2172 +24.90% +13.75% .2295 +32.03% +25.53%
Long Queries
AP .3755 .1749 .1303 .2095 +19.34% +15.25% .2425 +33.71% +34.55%
WSJ .4022 .1473 .1727 .2115 +43.17% +22.52% .2151 +45.21% +25.13%
AP+WSJ .3721 .1433 .1555 .1947 +35.37% +15.94% .2043 +49.92% +23.00%
AP+WSJ+DOE) .3299 .1122 .1411 .1575 +40.45% +11.59% .1712 +52.53% +21.33%

 

 

 

 

 

 

Table 5.1. Furthermore, we plot the precision-recall curves for both the short queries
and the long queries in Figure 5.4 and Figure 5.5, respectively. As illustrated in
Figure 5.4 and 5.5, for all four collections the precision—recall curves of the proposed
models always stay above the curves of the other two baseline models. Based on these
results, we conclude that the proposed statistical framework performs substantially
better than the other two selection-based approaches for cross-language information
retrieval.

A further examination of results in Table 5.1 gives rise to the following observa-

tions:

1. In general, the retrieval accuracy for heterogeneous collections appears to be
worse than that for homogeneous collections. In particular, a substantial de-
crease in the average precision is observed for all four models when the collection
of DOE1-2 is included in the heterogeneous collection. This result is in accor-
dance with our previous analysis, i.e., words from heterogeneous collections are

more likely to have multiple senses, thus resulting in higher translation ambi-

guity.

2. A better retrieval is achieved for short queries than for long queries. The degra-

144

(lation in performance between long queries and short queries is more signiﬁcant.
for heterogeneous collections than for homogeneous collections. Usually long
queries bring rich context as well as noise. Our result. indicates that among
the two factors, the second one, i.e., long queries consists of many irrelevant
words, seems to be more influential than the ﬁrst one, i.e., long queries provide
rich context for word sense disambiguation, in query translation. This result
seems to contradict the general belief that long queries tend to yield better
retrieval accuracy than the short ones given its rich context. However, it is
worth pointing out that this general belief is based on a simpliﬁed analysis and
overlooks the fact that long queries tend to include irrelevant words that could
corrupt the accuracy of query translation disambiguation. To further conﬁrm
our hypothesis, we compare the results of the monolingual information retrieval
between long queries and short queries. we observe that the short queries in
fact outperform the long queries for three among four datasets. All the results
indicate that due to the two conflicting factors related to long queries, it is not
necessary the case that information retrieval with long queries will deﬁnitely

deliver better retrieval accuracy than short queries.

. The “BSTO” model does not consistently outperform the “ALTR” model. In
fact, for the long queries, the “ALTR” model performs better than the “BSTO”
model across all four different collections. This phenomenon can also be at-
tributed to the fact that long queries are rather noisy and likely to include
irrelevant words. This result indicates that the “BSTO” model can be sensitive
to the noises present in queries. Given that a signiﬁcant amount of noise can be
present in queries, it is important to maintain the uncertainty of translation in
the retrieval process. Note that our results appear to be inconsistent with the
ﬁnding in [57]. We believe that this difference could be explained by the dif-

ference in the experiment setups. In particular, since our experiments focus on

145

 

 

upheaval commotion turbulence unrest turmoil
30% 0.19931 0.19723 0.19882 0.20209 0.20255

 

 

 

 

intent spring motive inducement incentive
filll‘ll 0.22821 0.19645 0.30384 0.13196 0.13954

 

 

 

 

I acid sour sore ache
as | 0.79070 0.05422 0.07552 0.05955

 

 

Figure 5.6. Examples of translation probabilities estimated by the Maximum Coherence
Model.

CLIR with a bag of words, we did not employ any linguistic tools other than re-
moving stop words and stemming keywords. In contrast, in [57] linguistic tools
are used to identify appropriate English phrases and their Chinese translation,
which has been shown as an important factor in CLIR [8, 57]. Although phrase
analysis is important to CLIR, we believe that a generic probabilistic model is
beneﬁcial to CLIR of any languages, particularly when linguistic resources are
scarce. Other differences between our baseline “BSTO” model and the model in
[57] include the different ways of mutual information computation and slightly

different translation selection strategies.

5.6.3 The Necessity of Including Translation Uncertainty

To demonstrate the uncertainty in query translation, in Figure 5.6, we list the trans-
lation probabilities for three Chinese words 3 that are estimated by the Maximum
Coherence Model. As we can see, a signiﬁcant variance exists in the distribution
of translation probabilities across different Chinese words. The ﬁrst example in the
ﬁgure shows an almost uniform distribution over all translations, while the third one
illustrates a very skewed distribution. Meanwhile, the second example provides a dis-

tribution that is neither uniform nor totally skewed. These three examples illustrate

 

3These three Chinese words are not from the same query.

146

 

 

 

 

 

, . Selection Trans. Prob. Trans. Prob.

CH Term EN Translation (BSTO) (M A (3 (S QT)
independent 0.36208 0.46944

541172 on one's own 0.24481 0.12342
stand alone x 0.3931 1 0.40715

press x 0.33789 0.22208

11’. lili publish 0.19973 0.37477

‘ ’ put out 0.33311 0.36082
print 0.12927 0.04232

disappear x 0.34941 0.38760

ill '1'; demise 0.32035 0.32365
fade 0.33024 0.28875

symptom 0.16698 0.14468

omen 0.16155 0.09868

111.315 sign x 0.35015 0.55512
portent 0.16097 0.09279
premonition 0. 16034 0.09773

 

Original TREC Topic in English (topic 187 'title' ﬁeld):
Signs of the Demise of Independent Publishing

Figure 5.7. An example of query translation, using the “BSTO” model and the Maximum
Coherence Model. (English words in italicized font are removed as stop words.)

the “translation uncertainty problem”, which we have addressed in previous sections.
Furthermore, the diversity in the distribution of translation probabilities makes it
difﬁcult for a selection-based approach to perform well over all different cases. For
example, the “BSTO” model is able to work well for the third example but will fail
in the ﬁrst one. On the other hand, the “ALTR” model would be perfect for the ﬁrst
example but not for the third one. Base on the above analysis, we conclude that it
is important to capture the translation uncertainty and the diversity of translation

uncertainty in a probabilistic model.

147

5.6.4 The Impact of Translation Independence Assumption on Query Dis-

ambiguation

To illustrate the impact of the translation independence assumption on query trans—
lation disambiguation, consider the example in Figure 5.7. This query consists of
four Chinese words, and the English translations for each Chinese are provided by
the dictionary are listed in the second column. The original English query is also in-
cluded at the bottom of the ﬁgure. The English translations selected by the “BSTO”
model are listed in the third column, marked by small crosses. The translation prob—
abilities from Chinese words to their English translations estimated by the Maximum
Coherence Model and the Spectral Query Translation Model are listed in the last two
columns respectively.

Comparing to the original English query, we see that the “BSTO” model makes
incorrect translation selection for both the ﬁrst and the second Chinese words. For the
ﬁrst one, the correct English translation should be “independent”, instead of “stand
(alone)” 4. The better translation for the second Chinese word should be “publish”
instead of “press”. One reason for such mistakes is that in the “BSTO” model, the
coherence score of a translation is computed based on all the English translations pro—
vided by the dictionary for the Chinese words in the query. Thus, the coherence score
of one translation word is completely independent from the selection of other transla-
tions. Since both “stand” and “press” are common in English, their overall coherence
scores turn out to be larger than the coherence scores of other words, which lead them
to be selected by the “BSTO” model. In contrast, in both the Maximum Coherence
Model and the Spectral Query Translation Model, the estimation of translation prob-
abilities for one word is dependent on the estimation of translation probabilities for

other words. As a result, they are able to adjust the mistakes by assigning signiﬁcant

 

4“alone” is removed as a stop word and does not count in the translation. It is listed only for
the sake of clarity.

148

amounts of probability mass to the correct translations. For example, for the ﬁrst
Chinese word, both models are able to assign a probability to the correct English
translation “independent” comparable to the probability assigned to the translation
“stand (alone)”.

Note that neither the Maximum Coherence Model nor the Spectral Query Trans-
lation Model is able to always assign the largest probabilities to the best translation
candidates (such as for the ﬁrst Chinese word in Figure 5.7). However, in general by
maximizing the coherence of the entire translated query both models tend to shift
more probability mass to the best translation candidates, thus alleviate the mistakes
brought by those selection-based approaches originated from their false assumption
on translation independence. We believe this is one of the major advantages of these

two models over all the selection-based approaches.

5.6.5 Performance: MAC vs. SQT

1Q 1111 I encroach erode weather eat
MAC 0 0.01091 0.34901 0.14008
SQT 0.02389 0.03.139 0.34487 0.09985

 

Original TREC Topic in English (topic 188 'title' ﬁeld):
Beachfront Erosion

Figure 5.8. Comparison of an example query word translation using Maximum Coherence
Model and Spectral Query 'Itanslation Model. The numbers showed in this example are
the probabilities of the translation candidates being included in the ﬁnal query translation.

Table 5.1 reveals a slightly higher average precision across 11 recall points of
“SQT” model compared to “MAC” model. In Figure 5.4 and Figure 5.5 the precision-
recall curves of “SQT” model generally stays above those of “MAC” model, though
the margin is not clear sometimes. All the experiment results suggest that the Spectral

Query Translation Model is slightly better than the Maximum Coherence Model.

149

 

In‘

This observation is in accordance with our theoretical analysis on the two models
at the end of Section 5.5. Apart. from the advantage of saving the trouble of choosing
the optimal regularizer constant Cp, the beneﬁt of normalizing the coherence matrix
can be observed from the example presented in Figure 5.8, where we list and compare
the probabilities of including different translation candidates in the ﬁnal query trans-
lation for one example query word from both models. As we can see, the common
word “eat” is assigned with a signiﬁcant amount of probability mass in the Maximum
Coherence Model although it is almost irrelevant to the context of the entire query.
At the same time, the word “encroach”, which in fact is related to the theme of the
query, receives an almost zero probability (or too small to represent). As indicated
by the results listed in Figure 5.8, the Spectral Query Translation Model is able to
overcome this problem by redistributing some of the probability mass on the com-
mon word “eat” to other words. This is consistent with our previous analysis that
the Spectral Query Translation Model has a better way to estimate the translation

probabilities of common words than the Maximum Coherence Model.

5.6.6 Computational Efficiency

To examine the computational efﬁciency, we calculate the averaged number of seconds
that are spent by the proposed algorithms to solve the Optimization problem for each
query. All the experiments are conducted on a PC with a Pentium 4 2.8GHz CPU
and 1G RAM that runs Matlab 7.1 on a Windows system. We directly use the
quadratic programming function provided by Matlab to solve the QP problem in
the proposed algorithms for query translation disambiguation. The results are 0.027
seconds per query for short queries, and 3.014 seconds per query for long queries.
Clearly, our algorithm is sufﬁciently fast for short queries, but a little slow for long
queries. To improve the computational efﬁciency of long queries, we can ﬁrst divide

a long query into a number short ones, and then translate each short query using the

150

 

proposed framework. Since the computational complexity of quadratic programming
is 0(n3) in the worst case where n is the number of translation probabilities, by
signiﬁcantly reducing the number of query words, we reduce the number of translation
probabilities, and therefore improve the computational efﬁciency. Note that the above
approach is based on the assumption that most of the context needed for query
translation disambiguation can be found in the neighborhood of each query word. As
a potential research issue, improving the computational efﬁciency for long queries is

worth further study.

5.7 Conclusions

In this chapter, we propose a novel statistical framework for cross-language infor-
mation retrieval. It utilizes word co—occurrence statistics for estimating translation
probabilities that are effective for query disambiguation. Compared to the selection—
based approaches, the merits of the framework are twofold: 1) It preserves the trans-
lation uncertainty through the estimation of translation probabilities; 2) It estimates
the translations for all query words simultaneously. Two realizations of the proposed
framework, namely the Maximum Coherence Model and the Spectral Query Transla—
tion Model, are presented based on different choices of coherence measurements. Our
analysis indicates that the Spectral Query Translation model can also be interpreted
as a graph partitioning approach for query translation disambiguation. Empirical
results under various scenarios have shown that the proposed framework is able to
perform substantially better than the existing selection-based approaches. Further
analysis indicates that the Spectral Query Translation Model is more effective and

reliable than the Maximum Coherence Model when dealing with common words.

151

CHAPTER 6

Semi-supervised Learning for Extraction

of Questions and Answers from Web FAQs

6.1 Introduction

Question-Answering is faced by a problem of a “lexical gap” [22] or a “word mis-
match” [156] between question and answers, suggesting that retrieval of candidate
documents for answer extraction should focus on documents that are similar to the
expected answer rather than on documents that are similar to the user question. Re-
cent work in Question-Answering has capitalized on existing repositories Frequently-
Asked-Question (FAQ) pages to bridge this lexical gap. FAQ pages are easily available
in large quantities on the web. They cover a wide range of topics, and thus present
an ideal resource to learn about question-answer correspondences. Large repositories
of question-answer (QA) pairs extracted from FAQ pages have been used to train
statistical translations models in order to provide an additional measure to rank can-
didate answers [22, 51, 122, 137], or to perform expansion of questions by answer
terms using various query expansion techniques [78, 121, 3, 69]. In other work, QA
repositories have been used as candidate collections for answer retrieval [30, 83, 154].
Besides Question-Answering, QA repositories have also been used in other related

information retrieval and natural language processing tasks, such as query summa-

152

 

rization [23], semantically similar question ﬁnding [82] and knowledge representa-
tion [153, 154]. Since the performance of these applications depends heavily on the
quality of the QA repository, high-precision extraction of QA pairs is crucial in order
to provide reliable data for various deployments.

While FAQ pages abound on the web, high-precision extraction of QA pairs is
challenging because of the wildly varying markup of questions and answers across
FAQ pages. We refer to this challenge as the Across Page Diversity challenge. Across
FAQ pages on the web, the variants in QA markup range from simple paragraph
breaks to special formatting (using italics, boldface, different fonts, font sizes, colors,
etc), various types of indentations (lists, tables, etc.), special preﬁxes (numbers,
Q., 0:, Question:, etc), or even sophisticated images. The creative power of web
designers is displayed not only in the huge variations of QA markup, but also in the
endless variants of “noise text”, i.e., headers, navigational text, or other annotations
that display neither questions nor answers.

To illustrate the Across Page Diversity challenge, Figure 6.1 - 6.4 present four
snapshots of FAQ pages, each displaying questions and answers in their own format.
Speciﬁcally, Figure 6.1 presents two FAQ answers in sophisticated formats — one
with embedded listing structures and varying fonts, and the other with multiple
paragraphs separated by images; the other three snapshots present QA text mostly
in plain paragraphs. In Figure 6.2 two types of images preﬁx each FAQ question and
answer; in Figure 6.1 and Figure 6.3 only FAQ questions are marked by numbers;
Figure 6.3 uses different colors and fonts to distinguish questions and answers. Figure
6.3 includes a real FAQ question without an ending question mark (i.e. the second
question in the snapshot), while Figure 6.4 includes a fake FAQ question that does
end with a question mark (i.e., in the dashed ellipse). Also, there are various types
of noise texts that are highlighted by boxes of all shapes in the four snapshots.

The diversity of web FAQ pages pose a serious challenge in extracting high-quality

153

 

There is a short lag period between the time you enter the
sites to be blocked and when our system begins blocking the
corresponding ads. This period should be no more than two
business days.

Back to Top

9. Are the ads blocked for all products within the Yahoo
Pubilsher Network?
Yes. Once you enter the URLs, the blocked URLs will be
applied to all products within the Yahoo! Publisher Network.
You will not need to create the list for every YPN product

Back to Top

10. Can 1 block Run-of-Network ads from appearing on my
' site?
No. we do not offer this option at this time.

Back to Top

- idn't find an answer to your question'il

Submit your inguig directm to our Customer Solution: team

 

 

      

NOTICE: AI information is the conﬁdential information of the Yahoo! Publisher Network.
Copyright © 2007 Yahoo! Inc. Al rights reserved.
: Poirc ' Terms of Sen/ice Terms and Conditions Hefn Center

  
  

 
  

ana

     

Figure 6.1. Snapshot of example web FAQ pages (1 of 4)

154

Is there a limit to how much karma you can accumulate?

Yes. Karma is now capped at "Excellent" This was done to keep people from
running up insane karma scores, and then being immune from moderation.
Despite some theories to the contrary, the karma cap applies to every account.

 

Answered by: Cmdr Taco
Last Modiﬁed: 1/24/02

 

It seems unfair that I can't get any more karma than
that even if I earn it.

Karma is used to remove risky users from the moderator pool, and to assign a
bonus point to users who have contributed positively to Slashdot in the past. It
is not your IQ, dick length/cup size, value as a human being, or a score in a
video game. It does not determine your worth as a Siashdot reader. It does not
cure cancer or grant you a seat on the secret spaceship that will be traveling to
Mars when the Krulls return to destroy the planet in 2012. Karma ﬂuctuates
dramatically as users post, moderate, and meta-moderate. Don't let it bother
you. It's just a number in the database.

 

Answered by: Ondrfaco
Last Modiﬁed: 10/19/00

 

Why didn't I get karma for a Quickie or a Slashback
story?

This is a shortcoming in the code that we haven't solved yet. Essentially, the

Figure 6.2. Snapshot of example web FAQ pages (2 0f 4)

155

 

4. How do I uninstall Google Web Accelerator?

if you decide you don't want to use Google Web Accelerator, here's how to
uninstall it

. Click on Start > Settings > Control Panel to open the Windows Control Panel.

. Click on Add or Remove Programs to open its window.

. Click on Google Web Accelerator. A Remove button will appear below the
Google Web Accelerator item.

4. Click the Remove button.

5. Close the Add or Remove Programs window and the Vlﬁndows Control Panel.

(”N-l

 

[The Google Web Accelerator Menu I w [W]

V J ""
1. How do I access the Google Web Accelerator menu?

 

 

You can bring up the Google Web Accelerator Menu by clicking on the Google
Web Accelerator speedometer icon in the toolbar or system tray.

 

 

FE) . 36.3w saved

 

Heb ' Preferences...

Preferences... Performance Data...
Performance data. .. Help >
DOO't accelerate “15 "Chit: Stop Google Web Accelerator
Stop Google Web Accelerate é;

 

Learn more about each menu item:

0 Preferences
0 Performance data
0 Stop/Start Google Web Accelerator

 

 

 

[Preferences .4 j ., A. _e . [M23

 

 

V1 . What Google Web Accelerator preferences can I set?

The Preferences menu option Opens this Web Accelerator Preferences page:

Figure 6.3. Snapshot of example web FAQ pages (3 of 4)

156

 

{HQFEW “ .
“Lat—~31 Qlle’J/lon - Glossary of terms:
Taxes

cw... . ; Denomination
FinanciaLMgkgts i

AQQQUEJBQWQ War/ww- — Refers to the different values of

2:23,} i money. U.S. coins currently are made in the
-~-_.______ ‘ following six denominations: cent, nickel,
Went :
liasi§h§§£§
EQLKJSE
WW9. %
l §§§L§E§ﬁ
5 Site Policies and
Notices

dime, quarter, half dollar, and dollar.

 

Quay/{on - What is the correct term for a
one-cent coin?

 

I.77n.savr~ The proper term is "one cent
piece," but in common usage this coins is
often referred to as a penny or cent Many
times, even the Treasury Department and
the United States Mint use the term penny
because that is what is normally referred to

in general use by the public.
a

 

Quad/on ~Are there any plans to
remove the one-cent coin (more
popularly known as the “penny”) from
circulation?

f7ii/z.rrm?r ~ You may be interested to know
that the penny is the most widely used

 

Figure 6.4. Snapshot of example web FAQ pages (4 of 4)

157

QA from them. At least. three major reasons that contribute to the difﬁculty of the

task:

0 First, the list. of QA pairs within a FAQ page is often mixed with various amount
of noise text, such as section headings, navigational text, or annotations, that
are not part of any QA pair. Separating the text. of QA pairs from the noise

text can be difﬁcult.

0 Second, it is usually signiﬁcantly more challenging to accurately identify the
texts for answers than for questions. This is because the answer text tends to
be much longer than that of questions, and more diversiﬁed in its presentation

format and word usage.

0 Third, many questions and answers from the FAQ pages do not follow the
grammar and syntax of English rigorously, which makes it difficult to apply the

algorithms that are developed in natural language processing.

One may enumerate a few heuristic rules for identifying questions and answers
from FAQ pages, such as punctuations, listing markers, lexical cues, repeating pat-
terns, etc., as suggested in several previous studies [136, 84, 137, 99]. However, there
are two major problems with employing those heuristics-based approaches. First,
due to the large diversity in the presentation formats of questions and answers, it is
difﬁcult, if not impossible, to come up with a comprehensive list of heuristics that
cover all the possible presentation patterns. Second, many heuristic-based approaches
tend to struggle between false positives (i.e., accuracy) and false negatives (i.e, com-
pleteness) in QA extraction. More speciﬁcally, a set of relaxed heuristic rules tend
to find more QA pairs but at the price of including noise text into the QA pairs; by
tightening the heuristic rules, we are likely to avoid the problem of including noise

text but at. the risk of missing many true QA pairs.

158

Another approach toward QA extraction is to View it as a classiﬁcation problem
of three classes, i.e., the class of questions, answers, and noise. we can then employ
the supervised learning algorithm to train a statistical classiﬁer from the labeled ex-
amples. However, a key difﬁculty with this approach is that, due to the large variance
in the presentation formats of QAs, it is difficult for any supervised learning algo—
rithm to capture all the possible patterns from a. limited number of labeled examples.
Apparently, this difﬁculty originates from the Across Page Diversity. Fortunately,
while QA markup may vary widely across pages, it is generally true that within sin-
gle pages QA markup is consistent. This Within Page Consistency principle can be
viewed as “side information” - which encodes human knowledge on the fact of FAQ
page creation — to the QA extraction task. As will be shown later in this section, the
principle’s accuracy is well supported by empirical study.

The Within Page Consistency leads to the possibility of a bootstrapping approach
to QA pair extraction: various heuristics that have been proposed for QA pair extrac-
tion can be deployed to perform a high-precision labeling of a seed set of text segments
as questions, answers and noise; an optimization problem is solved that essentially
propagates the class labels of the seed segments to the text segments that share sim-
ilar HTML formats as the seeds. This optimization problem can be described in a
transductive learning setting, similar to the well-known semi-supervised learning al-
gorithm — Spectral Graph Transducer [87]. Given a large enough database of FAQ
pages, which themselves have to be extracted in a high-precision fashion, each of the
extraction steps -—labeling of seed examples, and similarity—based propagation—can
be tuned to high precision, resulting in a large database of correctly identiﬁed QA
pairs. We show in an experimental evaluation that compared to both heuristics-based
and supervised learning approaches, the proposed approach significantly improves
precision in QA extraction.

The key advantages of the proposed algorithm are

159

 

o It takes full advantage of the heuristics discovered by the previous study of QA

extraction.

0 It is entirely unsupervised in that all the labeled examples for the semi-
supervised learning algorithm are acquired by the heuristic rules. Therefore

no human annotation is needed.

0 The extraction from one Web FAQ page is independent from any other pages,
which enables the algorithm to be carried out in a parallel manner to handle

huge amount of data from the Web.

The remaining chapter is organized as follows. In Section 6.2 we brieﬂy review
previous work on automatic QA extraction, and the related Spectral Graph Trans-
ducer algorithm. In Section 6.4, we present a few important observations on the web
FAQs data that motivates our work. The semi-supervised learning algorithm for QA
extraction is proposed in Section 6.5. Experiments and discussions are presented in

Section 6.6. Finally, Section 6.7 concludes the work.

6.2 Related Work

6.2.1 QA Extraction from the Web

At ﬁrst glance, extracting QA text from FAQ pages looks like a typical information
extraction (IE) task. In [111] a maximum entropy Markov model has been proposed,
and in [97] logical structure detection is used. While both approaches prove to be
effective in QA text extraction from Usenet FAQ ﬁles, they are domain-speciﬁc since
they rely heavily on the special format of Usenet FAQs. Although IE research have
gained signiﬁcant progress towards open-domain free-text tasks (for example, seminar
announcement extraction [36]), the extraction is usually carried out as ﬁlling template

slots. Unlike seminar announcements that have a relatively ﬁxed set of themes and

160

presentation formats, QA can be a place-holder for virtually any theme or any pre-
sentation format. As a result, it is difficult to apply IE algorithms to extract QA
pairs from FAQ pages.

Most of the previous studies on QA extraction from FAQ pages [99, 136, 84, 137]
are heuristics-based approaches. In summary, four types of heuristics have proved
to be useful for identifying QA text segments in a FAQ page: (1) punctuations, (2)
HTML tags (e.g., <p>, <BR>), (3) listing markers (e.g., 0:, (1)), and (4) lexical cues
(e.g., What, How). Then rules or memory—based learning algorithms are applied to
determine whether or not a text segment is a question or an answer. However, most
previous efforts toward building a QA repository only aim at extracting FAQ ques-
tions. Only a few studies are related to extracting FAQ answers. The work reported
in [99] concentrates on the extraction of FAQ questions, and deﬁnes as answers the
region between two consecutive FAQ questions, thus ignoring the separation of QA
pairs from noise text. In [136, 84, 137], up to three consecutive sentences following
each identiﬁed question are viewed as the corresponding answer text. As already
pointed out in Section 6.1, FAQ answers are signiﬁcantly more difﬁcult to be ex-
tracted than FAQ questions due to their length and diverse presentation formats. An
important aspect that distinguishes our work from all the previous studies is that we
are aiming at complete and noise—free text of both FAQ questions and answers, which

is more challenging yet more useful for most deployments of QA repositories.

6.2.2 Review on Spectral Graph Transducer

Spectral Graph Transducer (SGT) [87] balances between ﬁtting a model that is con-
sistent with supervised information (from labeled examples), and taking the central
assumption behind nearly all semi—supervised learning algorithms, i.e., examples are
more likely to share the same class labels if they are close to each other or in the

underlying manifold. Suppose {:r,}?=+1m is the set of data examples, with the ﬁrst n

161

 

 

web pages FAQ pages QA pairs
count 4 billion 795,483 10,568,160

 

 

 

 

 

 

 

Table 6.1. Corpus statistics of QA pair data

examples being labeled and the rest m examples as unlabeled. An (n + m) x (n + m)
matrix L is deﬁned as the graph Laplacian that can be derived from the pairwise
similarities of all the data examples [38, 87]. Note that the graph Laplacian matrix
L captures the structure of both label and unlabeled data. A (n + m)—dimension
vector r is used to encodes the label information, where r,- = 73+ if 3:,- is a positive
example, 7*, = 7“- if x,- is a negative example, and 7‘,- = 0 if :13,- is unlabeled. f... and
7”"- are constants that depend on the priors of the two classes. Let f 6 1R" denote the
vector of predicted class labels. Spectral Graph Transducer is deﬁned by the following

optimization problem

mfin fTLf + C(f — r)TC(f — r) (6.1)
at fTe = 0 (where e = [1, - -- ,1]T)
fo = n + m
where the matrix C = diag(c1, c2, . . . ,cn) is a diagonal cost matrix that assigns a

different misclassiﬁcation cost for each labeled data example. The trade-off between
graph cut value (i.e. the ﬁrst term) and training error (i.e. the second term) is

balanced through the constant c.

6.3 FAQ Page Classiﬁcation

As shown in Table 6.1, the FAQ pages used in our experiment were extracted from a
4 billion page web crawl using the queries “inurl : f aq” and “inurl : faqs” to match
the tokens “faq” or “faqs” in the urls. This extraction resulted in 2.6 million web

pages (0.07% of the crawl). Since not all those pages are actually FAQs, we manually

162

 

 

labeled 1,000 of those pages to train an online passive—aggressive classiﬁer [41] in a
10—fold cross validation setup. Training was done using 19 features on URLs, question
marks and word statistics (see Appendix A3 for details), and resulted in an F1 score
of around 90% for FAQ classiﬁcation. Application of the classiﬁer to the extracted
web pages resulted in a classiﬁcation of 795,483 pages as FAQ pages. Instead of going
into more details of the FAQ page classiﬁcation algorithm, we will concentrate in this
paper on QA pair extraction, and present the main algorithm for QA pair extraction

from FAQ pages in followinor section.

6.4 Observations on Web FAQs Data

To better understand the challenges in extracting QA pairs from the FAQ pages, in
this section, we summarize some important observations from the web FAQs data 1

as follows

Noise Text For most FAQ pages, we often ﬁnd the noise text that do not consist of
any questions or answers. These texts can appear either between two consecu-
tive QA pairs or outside the entire list of QA pairs. It is important for the QA

extractor to remove the noise text from the identiﬁed questions and answers.

Question Mark Rule Most, though not all, FAQ questions end with question
marks. This implies that the question mark is an important feature to identify

FAQ questions.

Pattern Diversity and Within Page Consistency Due to the heterogeneous
authorships of web pages, no speciﬁc text formats are consistently used across
all the FAQ pages. This implies that it could be very difﬁcult to develop a

supervised learning algorithm that is able to capture the large diversity in the

 

1Note that there are two types of FAQ pages: one that compiles multiple QA pairs in a single
page, and the other that devotes each page to only one QA pair. We will focus on the ﬁrst type
since it is dominant on the web.

163

 

presentation patterns of QA pairs across different FAQ pages. However, within
a FAQ page, it is often the case that consistent HTML formats are used to
present questions, answers, and noise text. Moreover, the formats for present-
ing questions, answers, and others are often different and distinguishable. This
motivates us to consider a semi-supervised learning algorithm that explicitly

exploits the within page consistency.

The proposed algorithm consists of two major components, i.e., the heuristics
that are applied to identify the seed examples of FAQ questions, answers, and noise
text, and the semi-supervised learning algorithm that propagate the class labels of
the seed examples to other text segments based on the Within Page Consistency

principle. The following heuristics are used to identify the seed examples:

1. From all the text segments that end with question marks, select the ones whose

format repeats the most often and label them as FAQ questions. 2

2. From all the text segments between any two consecutive FAQ questions that
are identiﬁed by the ﬁrst heuristic, select the ones whose format repeats the

most often and label them as FAQ answers.

3. If a text segment repeat itself k time in a FAQ page, all those occurrences of
the text segment will be labeled as noise text. 3 k is a predeﬁned integer, and
is set to 4 in our experiments. And all text segments before the ﬁrst question

are labeled as noise.

All the above heuristics can be seen as a simpliﬁed version of those proposed in
previous studies that are mentioned in Section 6.2.1. These rules tend to be accurate

with very small number of false positives. According to our study, these rules can

 

2Text segments of hyperlinks are excluded, to avoid the possible question lists at the top of the
many FAQ pages.

3This heuristic is designed to detect those highly repetitive noise text (such as navigational links,
e.g., “back to top”).

164

achieve a precision of 96.8% in classiﬁcation 4.

6.5 A Semi-supervised Learning Approach for QA Extrac-

tion

In order to exploit the Within Page Consistency principle, in the proposed approach,
we view the extraction of QA pairs from each FAQ page as an independent multi-
class classiﬁcation problem, i.e., to classify each text segment in a FAQ page into
the classes of questions, answers, and noise text. The key idea behind the proposed
algorithm is to ﬁrst identify a few text segments that can be classiﬁed by the speciﬁed
heuristics. The class labels of these identiﬁed segments are then propagated to the
text segments of the same web page that are similar in HTML format. In this section,
we ﬁrst review the preprocessing step that divides the FAQ page into text segments,
followed by the description of the proposed semi—supervised learning algorithm for

QA extraction.

6.5.1 Web Page Preprocessing

The main purpose of preprocessing is to divide a FAQ page into text segments to
be classiﬁed. In order to exploit the Within Page Consistency principle, we need
an effective way to represent the presentation format of text segments so that the
similarity between any two text segments can be computed accurately. Since all the
web FAQS are presented in the HTML ﬁle, we propose to divide each FAQ page into
text segments based on the layout of the HTML tags, and represent the format of
each text segment by the related HTML tags. In particular, we deﬁne a tent segment

as a continuous text region that is surrounded by exactly the same pair of immediate

 

4Due to the well-known precision-recall trade-off, when the precision of those rules are tuned to
96.8%, the corresponding recall becomes very low. Therefore these rules, though can achieve high
extraction accuracy, are not suitable for complete and noise free QA extraction

 

opening and closing tags. For example, given the following HTML text

<HTML><HEAD> Title Goes Here </HEAD>

<BODY><H1> This is the heading </H1>

<P> 01: this is question 1 </P>

<P> A1: answer 1 starts here

<SPAN style="font-size: xx-small;">
some small font text </SPAN>

<TABLE><TR><TD> table cell 1 </TD></TR>

<TR><TD> table cell 2 </TD></TR></TABLE></P>

the identiﬁed text segments are: "Title Goes Here", "This is the heading",
"01: this is question 1","A1: answer 1 starts here","some small font
text", "table cell 1", "table cell 2". Given the identiﬁed text segments, the
next step is to represent the format of each segment by a sequence of HTML tags. In
particular, each text segment t,- is represented by a HTML tag sequence q,- includ-
ing all the HTML tags that surround segment t,- from outermost to innermost. For
example, for the text segment Q1: this is question 1, its HTML tag sequence
is HTML, BODY, P, while the HTML tag sequence for the text segment some small
font text is HTML, BODY, P, SPAN. Finally, given the representation of presenta-
tion format, the format similarity between two segments t,- and tj is calculated as

follows:

Simf(ti7tj) : exp(-Ad(qivqj)) (62)

where d(qi,qj) is the edit distance between the two corresponding HTML tag se-
quences qz- and qj. A is a decay factor (set to 0.1 in our experiment).

We ﬁnd that there is a disadvantage of the web page division scheme we proposed
here. In particular, it will break a sentence into multiple text segments if there are

hyperlinks or highlighted terms within the sentence. To remedy this problem, we

166

can detect the broken sentences from the text segments based on a few grammatical
rules. For example, a text segment ending with no punctuation and its following text
segment starting with an small-cased letter are very likely to be two broken parts

from the same sentence. We then adjust the similarity measurement as follows:

5i,j = .9t7nf(tz', tj) - biaj (6.3)

where simf is the format similarity, and bi, j is an adjustment factor from broken sen-
tence detection, i.e. bi, j takes a constant value b0 > 1 if t,- and t j are two neighboring

text segments that are detected as from the same sentence, and 1 otherwise.

6.5.2 Semi-supervised Learning for QA Extraction

The key idea of the semi-supervised learning approach is to propagate the class labels
of the text segments that are classiﬁed by the heuristics to the segments of the same
page that are similar in the presentation patterns. The idea of label propagation, as
shown in [87], is equivalent to minimizing the inconsistency between the similarity of
text segments and the class labels assigned to the segments. We can thus encode the
idea of label propagation by the following optimization problem:

Let {tﬁﬁtl denote the set of all N text segments in a FAQ page. We introduce a

probability matrix P = (PM) , where 172', j indicates the probability of assigning

N x3
the i-th text segment to the j-th class. Here, we encode the three classes by: class 1
for FAQ questions, class 2 for FAQ answers, and class 3 for noise text. Note that for

each segment t,-

2PM = 1 (6-4)
j

For convenience, we also use pj to denote the j-th column vector in the matrix P,
i.e., P = [p1,p2, p3]. Each vector pj includes the probability of all the text segments

belonging to the corresponding class.

167

 

 

To represent the class information of those seed text segments, we introduce an-
other matrix P : (pm-)ng, where 13,-, j = 1 if the i-th text segment is a seed text
segment that is classified into the j—th class by the heuristics, and 13,),- = 0 otherwise.
Similarly we can decompose the matrix P into column vectors that represent the seed
examples in each class respectively, i.e., P = [131,132,133].

Let a matrix S = (Sii’)N>< N denote a matrix of all the pairwise similarities of
the text segments, as deﬁned in Equation (6.3). We can view the similarity matrix S
as building a weighted graph: each node represents a text segment; if the similarity
between two text segments is non-zero, the corresponding two nodes is connected by
an edge that is weighted by the similarity value. Based on a weighted graph, we can

deﬁne a normalized graph Laplacian [38]
L = I — D—ls

where D :2 diag(d1, - -- ,dN) is a diagonal matrix with each d,- = Zi’ 31,24. The
graph Laplacian includes the format similarity between any two text segments, and
will be used as the basis to exploit the Within Page Consistency principle.

Our goal is to estimate the optimal set of probabilities in the matrix P, such that if
two text segments are similar in their presentation formats, they will be assigned with '
similar probabilities to the three classes. In addition, the optimal set of probabilities
should also be consistent with the class labels that are assigned to the seed examples

by the heuristics. All these can be formulated into the following optimization problem
jlp[T ij + Cj (Pj - Pj)TAj(Pj - Pﬁ] (6-5)

s.t. -,1]T

p120) j=112i3

where A -

J is a diagonal matrix with a diagonal element equal to 1 if corresponding

168

text segment is a seed example in the j-th class and 0 otherwise. Cj is a predeﬁned
factor that balances the two terms in the objective function.

The ﬁrst term pJLJ-pj in the objective function (6.5) is exactly the normalized
graph cut [134, 87], whose minimization leads to a two-way clustering with minimum
intra—cluster connections [134]. In particular, this term can be further divided into

two parts, i.e.,

T
Pj ij = Z Sir/(Pia ’ pi’,j)2/Zsi,i’ (6'6)
/ i

i,i
Thus, by minimizing the ﬁrst term, we enforce the consistency between the class label
assignment and the similarity measurement. The second term (pj — 13,-)TAJ-(pj —
13]) is a term that penalize the disagreement between the estimated probabilities
pj with 13,-, the class labels of the seed examples. Instead of formulating it as a
hard constraint in the optimization problem, we use the penalty term to ensure the
consistency between pj and 13,-. The advantage of such a treatment is that it leaves
the room for correcting labeling mistakes in the seed examples.
It is easy to ﬁnd our proposed model in the expression (6.5) can be viewed as
a multi-class version of the Spectra] Graph Transducer model reviewed in Section
6.2. As pointed out before, our proposed model can be explained from the view of
label propagation: the enforcement of within page format consistency and agreement
with seed examples on all the probabilities pm- can be viewed as propagating the
labeling information from the labeled seed text segments to the unlabeled ones along
the structure of the weighted graph. Since we need to decide the probabilities of
assigning each text segment to three classes, our approach indeed has three separate
propagations, each for a different class. It is important to point out that the three
propagation processes are strongly correlated. This is because if a text segment is
assigned with a large probability for one class, the probability of assigning the text

segment to the other classes has to be small. This can be further illustrated by the

169

 

 

Step 0 Divide the page into text segments and
compute their pairwise format similarity (as
described in Section 6.5.1)

Step 1 Initialize a labeled example set L by
identifying a few seed examples (as
described in Section 6.4)

Step 1.1 Solve the semi-supervised model in (6.5)

Step 1.2 Predict the class label of each text segment

 

 

by the class with the largest probability.

 

Figure 6.5. The semi-supervised learning algorithm for extracting QA pairs from FAQ
pages

dual problem of (6.5), i.e.,

3
min A+u-+CoA"-TL+C-A-‘1A+u-+C-A-“-
“Ag? J .7 39]) ( J J) ( J J JpJ)
S.t.11j 20, j=1,2,3

where A and uj are the Lagrangian multipliers. Given the optimal solution for A and

uj, the solution for the primal problem can be written as:
pj = (L + CjAj)_1()‘ + llj + CjAjﬁj)

Clearly, the solutions for all pj’s are correlated through the Lagrangian multiplier
A. Finally, compared to the SGT approach, another advantage of the proposed algo-
rithm is its probabilistic nature. As we will show in our experiments, the estimated
probability does reveal the uncertainty in classifying text segments into the class of
questions, answers, or noise text.

The optimization problem (6.5) is a quadratic programming problem, and there-
fore can be solved efﬁciently using the standard packages. Given the estimated prob-
abilities in P, we predict the class of each text segment by the one with the largest
probability. Figure 6.5 summarizes the proposed algorithm for QA extraction.

As the ﬁnal step, we need to form the QA pairs based on the class labels assigned

to the text segments in each FAQ page. To this end, we ﬁrst reduce the sequence

170

 

by removing all the text segments being labeled as “noise, text”; then in the reduced
sequence, we merge those consecutive text segments that have the same label; ﬁnally,
we pair each merged text segment being labeled as answers with its most immediate
proceeding text segment being labeled as questions, thus form a QA pair. For ex-
ample, suppose we have a sequence of text segments (represented by their IDs) and

their classes labels (represented by Q, A, 0) as follows

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

DUQQOAAOQAAOAQAAO
Then three QA pairs can be extracted from the above sequence:

QA Pair 1: Question-#3,#4; Answer-#6,#7
QA Pair 2: Question-#9; Answer-#10,#11,#13

QA Pair 3: Question-#14; Answer-#15,#16

6.6 Experiments and Discussions

6.6. 1 Experiment Setup

Our testbed includes 10,912 text segments, which forms 1303 QA pairs 5. All these
1303 QA pairs are all manually labeled, and will be used as the truth for evaluation.
Since our proposed algorithm does not need any training data, we will carry out our
proposed algorithm on all the FAQ pages, and compare the extracted QA pairs with
the truth, i.e., the manual labeled QA pairs, for evaluation.

Both precision and recall are used as evaluation metrics. They are deﬁned as

 

E‘These text segments come from 100 FAQ pages that are randomly selected from those FAQ
pages we’ve identiﬁed from our web crawl data, as described in Section 6.3. The selection does not
favor any particular domain or page content). The size of this dataset is comparable to those used
in [99, 136, 84, 137].

171

 

follows

Number of correctly extracted QA pairs

 

Precision =
Number of extracted QA pairs

Number of correctly extracted QA pairs
Recall =

 

Number of the true QA pairs

Note that a QA pair is “correctly extracted” if and only if the corresponding question
and answer share the identical sets of text segments with those that are manually
identiﬁed as a QA pair.

Since the above evaluation metrics deﬁned above view all the text of an QA pair
as a whole, we refer to it as evaluation by QA. Evidently, these evaluation metrics
are very “strict” in that as long as an extracted QA pair is not identical to the true
one, it won’t count as correct, no matter how small the extraction error may be. Here,
we deﬁne another set of metrics based on word counting that are relatively “looser”

compared to the evaluation by QA. Specifically, we can deﬁne

Number of correctly extracted QA words

Number of extracted QA words
Number of correctly extracted QA words

 

Precision =

 

R 11 =
eca Number of words in the true QA pairs

Here, a word is correctly extracted as long as it is in the true QA pairs. Note that
using this set of evaluation metrics, an extracted QA pair will have some chance to
obtain “partial credit” when it is not identical to the true one. Since the evaluation is
based on word counting, we call it as evaluation by word. Furthermore, we combine
precision and recall together and compute the F 1 score that is deﬁned as harmonic
mean of precision and recall. All the above precision/ recall / F 1 metrics for QA pair
extraction can be extended to evaluate the performance of extracting FAQ questions
or FAQ answers separately.

To obtain comprehensive understanding of the overall performance of the ex-
traction, two different averaging methods are used in our evaluation. In the ﬁrst

averaging method, we compute the evaluation metric (i.e., precision, recall, or F 1)

172

 

for each FAQ page, and average it across all FAQ pages. We refer to this averaging
as macro-averaging. Alternatively, we can directly compute the evaluation metrics
for all the individual QA pairs from all the FAQ pages. We refer to this method as
micro—averaging.

We implemented three baseline methods. The ﬁrst two are based on heuristics,
and have been used in previous studies [99, 136, 84, 137]. Both baseline methods

6 to ﬁrst identify all the questions. In the ﬁrst

employ an optimal set of heuristics
baseline method, all the text between two consecutive questions are identiﬁed as the
answers to the proceeding question? We refer to this method as “H-all”. In the
second baseline method, up to 3 sentences following each FAQ question are extracted
as the corresponding FAQ answer. We refer to this baseline method as “H-3”. The
third baseline is Support Vector Machine with precomputed kernels, which is a rep-
resentative supervised learning method. We use the 1/4 of the FAQ pages as the
training set, and the rest as the test set. The kernel is precomputed using the sim-
ilarity measurement deﬁned in Section 6.5.1. LibSVM [34] software package is used
in our experiment. we will refer to this method as “kSVM”. Finally, we will use
“SSL” to refer to the proposed semi—supervised learning algorithm for extracting QA

pairs from FAQ pages. The constants in the optimization problem (6.5) are set as

C1 = 1, C2 = C3 -.= 0.5 from empirical experience.

6.6.2 Experiments on Veriﬁcation of Side Information

We ﬁrst examine the Within Page Consistency principle, namely two text segments
of the same page tend to follow the same HTML format pattern if they are in the
same class. To verify this principle, we computes two quantities for each pair of

text segments using the manually labeled FAQ page set: the similarity between the

 

6The optimal set of heuristics are identiﬁed empirically.
7For the last question in a FAQ page, up to n text segments are extracted for its answer, where
n is the average number of text segments in all previous FAQ answers.

173

l'T_'_ :.

 

 

Text Segment Similarity VS. Being in the Same Class
lr , ., . ._ .. ..... .. .

0
K0
1

 

O
(I)

” +Within Pages (Question) '
......... Across Pages (Question) _
—0—Within Pages (Answer)

0
\I
I

 

 

 

 

 

 

 

m
m
rd
r-l
U
a)
E
at
m
a)
,.C
J.)
g 0.6- --0---Across Pages (Answer)
01
g 0.5— ...........
-r-l ,
(D :
m 0.4—
“a i
0.3- ~
>~ :
.43. z
[:1 0.2.. .. .........
"(all I‘,E~~ ‘8‘ ‘
'80“ +.~+
' ..+ - . 's‘.
E or at” a - —— — - -' 4
O 0.2 0.4 0.6 . 1

Similarity

Figure 6.6. The probability distribution for two text segments to be in the same class
versus the format similarity between text segments

two text segments in their HTML format, and whether or not the two segments are
in the same class. Figure 6.6 summarizes the results by showing how the format
similarity between two text segments affects the probability for them to be in the
same class. The two solid lines show the results for the intra—class similarity of two
text segments within the same pages, and the two dot lines show the results for the
intra—class similarity of two text segments from different pages. It is clear that for
the text segments of the same web pages, their intra—class similarity nicely indicate
their relationship, namely the larger the similarity between two text segments, the
more likely the two segments will be classiﬁed into the same class. In contrast,
the intra—class similarity between text segments of different web pages appears to

be uninformative to their class memberships. Both dot lines in Figure 6.6 are flat

174

 

across wide range of the similarity, and do not. show the clear trend that the larger
the similarity the higher the probability. We thus conclude that the Within Page
Consistency principle provides a piece of valid side information for our experiment

data.

6.6.3 Experiments on QA Extraction

Table 6.2 lists the precision / recall / F 1 of the four different approaches for QA extrac-
tion. As shown in the table, the F1 scores of the proposed “SSL” method are always
considerably better than the three baseline models, whether using the micro—averaging
or the macro—averaging method. Especially, when using the strict evaluation metrics
“by QA”, the improvement made by the proposed algorithm is signiﬁcant. Therefore,
we can conclude that our proposed “SSL” method perform better than the baseline
methods in extracting QA pairs from the FAQ pages.

Further analysis on the experiment results reveals the following ﬁndings

1. For all the four algorithms, the performance of FAQ question extraction is in
general signiﬁcantly better than that of FAQ answer extraction when using the
metric of evaluation by QA. This is in accordance with the fact that answers
are more challenging to extract due to its longer length and more diversiﬁed

formats.

2. In general, for all the four extraction methods, the scores of evaluation by word
are considerably higher than the scores of evaluation by QA. This observation
implies that although the noise text mixed with the FAQ text is small in amount,
it is pervasive. This is related to the fact that there are often repeated patterns
in the FAQ pages, so an error in extracting one QA pair will be very likely to
recur when extracting other QA pairs. When the extracted QA pairs are used
in other applications as enumerated in Section 6.1, such kind of pervasive and

recurring errors could be a very annoying factor that can seriously degrade the

175

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

By QA
micro-averaging macro-averaging
P R F 1 P R F 1
H-all Pair 0.5055 0.4950 0.5002 0.5089 0.4919 0.4965
Q 0.8706 0.6662 0.7548 0.8617 0.6564 0.7280
A 0.5078 0.4973 0.5025 0.5105 0.4933 0.4980
H-3 Pair 0.4704 0.3599 0.4078 0.4601 0.3532 0.3910
Q 0.8706 0.6662 0.7548 0.8617 0.6564 0.7280
A 0.4794 0.3669 0.4157 0.4681 0.3578 0.3968
kSVM Pair 0.1267 0.1696 0.1450 0.1001 0.1493 0.1189
Q 0.3029 0.2304 0.2617 0.2109 0.1923 0.1956
A 0.1323 0.1842 0.1540 0.1104 0.1569 0.1245
SSL Pair 0.8163 0.7130 0.7612 0.7698 0.6838 0.7159
Q 0.9411 0.8220 0.8775 0.9196 0.7891 0.8356
A 0.8383 0.7322 0.7816 0.7901 0.7010 0.7344
By Word
micro—averaging macro-averaging
P R F 1 P R F 1
H-all Pair 0.8728 0.6987 0.7761 0.8640 0.7518 0.8007
Q 0.8896 0.7156 0.7932 0.9027 0.7196 0.7535
A 0.8673 0.9051 0.8858 0.8559 0.9102 0.8680
H-3 Pair 0.9303 0.3543 0.5132 0.9137 0.4656 0.5790
Q 0.8896 0.7156 0.7932 0.9027 0.7196 0.7535
A 0.9441 0.3113 0.4682 0.9229 0.4365 0.5486
kSVM Pair 0.4930 0.4265 0.4573 0.3922 0.3729 0.3654
Q 0.5023 0.5313 0.5164 0.4238 0.4247 0.4156
A 0.5123 0.4269 0.4657 0.4323 0.3920 0.4107
SSL Pair 0.9806 0.8126 0.8887 0.9589 0.8077 0.8493
Q 0.9800 0.8163 0.8907 0.9496 0.7981 0.8590
A 0.9807 0.8727 0.9235 0.9506 0.8324 0.8827

 

 

 

 

 

 

Table 6.2. The performance of QA extraction by four different methods. The columns of
“P”, “R” and “F1” give the precision/recall/ F1 scores.

176

 

application performance.

0::

. It is interesting to observe the performance gap between the “H-all” algorithm
and the “SSL” when we use different evaluation metrics. Using the metrics of
evaluation by word, the scores of the “H-all” algorithm are high, and sometimes
close to those of the “SSL” algorithm. However when using the metrics of
evaluation by QA, the performance gap between “Hall” and “SSL” is much
larger. This phenomenon shows that instead of extracting a large number of
QA pairs with small errors, our “SSL” method is able to yield many more
complete and noise-free QA pairs, thus improving the quality of extracted QA

text.

4. When comparing the “H-all” method with “H-3” method in terms of FAQ
answer extraction 8, we observe that using the “by word” metrics, the former
approach has a much better recall, but a worse precision. This is because the
“H-all” algorithm includes all the text between two consecutive questions as the
answer, and therefore achieves a high recall. In contrast, the “H—3” algorithm
only includes the ﬁrst up—to—three sentences following a question as the answer,

and therefore achieves a high precision. This result shows the interesting trade-

off between precision and recall that exists in most heuristics-based approaches.

5. The performance of “kSVM” method is extremely poor. This is not surprising,
since the precomputed kernel encodes much information about the intra—class
similarity from different pages, which is shown to be uninformative as discussed
in Section 6.6.2. This experiment implies that the large variance existing in
the presentation format from different FAQ pages is hard to be captured by a

supervised learning algorithm, given a limited number of labeled examples.

As stated before, one important advantage of the proposed algorithm is that it

 

8Their performance in extracting FAQ questions are the same because the two methods only
differ in their FAQ answer extraction strategies.

177

 

The Effectiveness of Label Probabilities

 

 

 

 

 

 

1~ ,. ,,,,,,,,,,,,,,, . ..... . ...... . .
i ------ " * * o
O . — I ’v .7! s
9 ’, xi °~-,_-o
>‘O.8‘ \ ﬁ _0” .
g . ‘t.’—;.
H — .~ ! b
:5 0 .7 {I _I a’ .
U 9‘ .’ I
O 0.6— -:I"\ ~_I»” . . ..
ﬁ .5! \‘ 1
c: 0.5— X 2' +All
S +/ ,w’ '“+ Question
'80'4_ :1!” I ' '9-Amswer '
H : . .
'8 O.3~ if .. - *-°---N01se
H _,
a.0.2—
O.l~
0 L 1 1 1 l 4' i
0.4 O. 0.8 0.9 l

5 0 .6 0.7 _ _

Label Probability
Figure 6.7. The correlation between prediction accuracy and classiﬁcation probability
estimated by the proposed algorithm for QA extraction

outputs not only the classiﬁcation results but also the uncertainty in classiﬁcation.
To examine the quality of the class probabilities that are estimated by the proposed
algorithm, for each text segment, we compute two quantities: the label probability
output by the proposed algorithm, and whether or not the predicted class label is
correct. Figure 6.7 summarizes these results by showing how the prediction accuracy
of each class is affected by the estimated label probability. Overall speaking, the larger
is the estimated probability, the higher is the prediction accuracy. We thus conclude
that the estimated label probabilities do indicate the conﬁdence of classifying text

segments.

178

6.7 Conclusions

In this paper, we study the problem of automatically extracting QA pairs from web
FAQS. We ﬁrst identify the pattern diversity challenge, a key challenge in QA extrac-
tion from FAQ pages. We then present a semi-supervised learning algorithm that is
effective in exploiting the Within Page Consistency principle, the key side information
to the QA extraction task. The main advantage of the proposed algorithm is twofold:
ﬁrst, the proposed algorithm is able to boost the limited knowledge by generating the
seed examples that are labeled by heuristics; second, the proposed algorithm is com—
pletely unsupervised. It predicts the class labels of text segments by propagating the
class labels of seed examples to the other text segments of the same page. Empirical
study shows that our proposed QA extraction method is able to yield more complete
and noise-free extracted QA pairs, hence improving the extraction performance sub-

stantially over heuristics-based or supervised learning approaches in previous studies.

179

El

CHAPTER 7

Conclusions

The topic of this thesis work is semi-supervised learning with side information. In
previous chapters, three generic learning tasks and two applications are discussed in

details. The contributions, of each chapter respectively, can be summarized as follows

a In Chapter 2, we propose a constrained non-negative matrix factorization
(CNMF) model for multi-label learning with class correlations, to meet the
challenging situation of a large number of classes and a small size of training

data.

0 In Chapter 3, we propose a novel boosting framework, LinkBoost, to improve

any supervised classiﬁcation algorithm with link information.

o In Chapter 4, we propose a novel boosting framework, BoostCluster, to improve

any clustering algorithm with pairwise constraints.

0 In Chapter 5, we propose a novel statistical framework, Maximum Coherence
Framework, for query translation disambiguation in cross-language information

retrieval with a bilingual dictionary.

0 In Chapter 6, we propose a semi-supervised learning model that utilizes human
knowledge on web FAQs, to automatically extract question-answer pairs from

them without human supervision.

180

 

 

The role of side information played in each task and application is listed as follows

0 In CNMF model for multi-label learning, side information (in the form of class
correlations) is encoded with class labels for computing pairwise example sim-
ilarities, whose consistency with input pattern based similarities is further en-

forced.

o In LinkBoost framework for classiﬁcation, side information (in the form of links)
is combined with input pattern based example similarities, whose consistency

with the pairwise relationship induced from class labels is further enforced.

o In BoostCluster framework for clustering, side information (in the form of pair-
wise constraints) is enforced to be consistent with input pattern based pairwise

similarities computation.

o In Maximum Coherence model for CLIR, side information (in the form of die-
tionaries) is used to construct a graph, over which soft class memberships are

enforced to be consistent with pairwise term similarities (or “coherence”).

o In automatic question-answer extraction from web FAQS, side information (in
the form of “within page consistency” knowledge) is used to correlate three
class label propagations over the same graph, where each label propagation can
be explained as consistency enforcement between class labels and input pattern

based similarities.

From the above, it is easy to ﬁnd that consistency enforcement is a common theme
involved in the use of side information. On an abstract level, the rationale behind all
the work presented in this thesis is the assumption that data examples that are close to
each other, when judging either from their input patterns or side information, should
be predicted similarly. This is a natural extension of the data consistency assumption

underlying most graph-based learning approaches (as stated in Section 1.1.2).

181

 

To effectively use the side information against its usual nature of sparseness, in-
completeness and noise, when incorporating side information into semi-supervised
models, we deliberately avoid formulating them as hard constraints. Instead, we
favor the use of side information in soft penalty. In this way, violation with side
information is allowed, but with a loss; minimizing the collective violation loss results
in tolerance with the noise in the side information. Also, all our proposed models in-
herit the spirit of graph-based learning that a connected graph is constructed over all
the (labeled and unlabeled) data based on their input patterns. Such a graph serves
as a good supplement to understand the underlying structure of data, wherever side
information is missing. This treatment reduces the risk brought by the incomplete-
ness and sparseness of side information. Consequently robustness is presented in all
the proposed semi-supervised learning models with side information, as suggested by
those experiments shown in previous chapters.

In all the semi—supervised models proposed in this thesis work, the theme of con-
sistency enforcement is always achieved through optimizations. A comparison among
all the objective functions in Chapter 2 through Chapter 6 will reveal a certain de—
gree of resemblance. In particular, the consistency is always formalized in a pairwise
manner, i.e., the overall consistency is decomposed into consistency measurements
deﬁned on each pair of data examples, including labeled and unlabeled ones. Such
a formalization embodies the use of unlabeled data, and also fosters the propagation
of supervised information from labeled data to unlabeled data. Another advantage
of formalizing optimization objectives in a sum-of-pairwise manner is, by carefully
choosing the consistency measurement deﬁned in a pair of data examples, the result-
ing optimization is often convex, thus tractable and with global optimum.

The above analysis summarizes a few nice properties in the proposed semi-
supervised models. In conclusion, the work presented in this thesis suggests a viable

approach towards semi-supervised learning with side information: enforcing consis-

182

"1 7‘7 -I r) -m-

tency among data input patterns, supervised information (if any), side information,
and predictions. We believe that this approach is applicable to a wide variety of

learning tasks and application areas.

183

APPENDIX A

Related Proofs and Lists

A.1 Proof of the eigenvector problem in Section 4.3.3

We show that every non—zero eigenvector Vi can be written as a linear combination
of xi,i = 1,2,. . .,m, i.e., the examples involved in the pairwise constraints. Let v
and A % 0 be an eigenvector and eigenvalue of matrix XTXT. We therefore have
XTXTV 2 Av. We further decompose v into two parts: v = v” + V1.7 where v”
represents the projection of v in the subspace spanned by {523,531, and x_L represents
the projection perpendicular to {i,-}f:,. To show v can be written as a linear
combination of {i,}f_=1, we need to Show vi = 0. To this end, we ﬁrst utilize
the expression in (4.20) to calculate VIXTXT, i.e.,
m
VIXTXT = Z edge-5a; = 0T
i,j=l
We then multiply the eigen equation XTXTV = Av by VI, which leads to the

following equation
T T T 2
viXTX v = 0 = Aviv 2 A||vi||2

Since A 74 0, we have Vl = 0 and v = v“.

184

A.2 Proof of the generalized eigenvector problem in Section

4.3.3

We will show that for the ith eigenvector Vi 2 KW, of XTXT, wi corresponds
to the ith eigenvector of the generalized eigenvector problem in (4.22). First, we
realize that the orthogonality condition vavj = 6 (i, j ) becomes WZTXXTWj = 62,3“
We can write the above condition for all wi,i = 1, 2, . . . , s in the matrix form, i.e.,
WTXXTW 2 I3. Second, the eigenvectors V = (v1,v2, . . . ,vs) are the optimal
solution to the following optimization problem, i.e.,

arg max tr(VTXTXTV)

VERdxs

s. t. vTv = I,

Replacing V in the above optimization problem with V 2 KW, we have

max t1‘(WTXTXTXTXW)
We R?” X S

s t. WTxTXW = I,

It is well known that the optimal solution W to the above problem consists of the

ﬁrst 5 eigenvectors of the generalized eigenvector problem in (4.22).

A.3 Features for FAQ page classiﬁcation

The 19 features we used in FAQ page classiﬁcation are as follows

1. Occurrence of keywords faq or f aqs in the URL host section 1

2. Number of keywords f aq or f aqs in the URL query section

 

1A URL can be divided into six sections. For example in
http://www.amazon.com:83/search.html?q=trave1#marker, the protocol section is http,
the host section is ww.amazon.com, the port section is 83, the path section is /search.html, the
query section is q=travel, and the fragment section is marker.

 

 

10.

11.

12.

13.

14.

15.

16.

17.

18.

Number of keywords f aq or faqs in the URL fragment section

. Normalized position of keywords faq or faqs in the URL path section (for

example, in URL path section / education/ f aq/ coins/ denominations . shtml,
the keyword f aq appears in the 3rd segment when counting from right to left.

Since altogether there are 4 segments, the normalized position of the faq is 3 / 4)

. Number of question marks in the page body text

. Number of question marks in the page title

Number of question marks in the anchor text

. Logarithm of total number of words in the whole page text

. Number of anchors with question marks

Percentage of anchors-with—question-marks in all the anchors
Ratio of question marks to words in the page body text

Ratio of question marks to words in the page title

Ratio of question marks to words in all the anchor text
Average number of words between consecutive question marks
Number of text segments in the page body text

Number of text segments (see Section 6.5.1 for deﬁnition) with question marks

in the page body text
Percentage of text-segments—with-question-marks in all the text segments

Percentage of links in all the text-segments—with-question-marks

186

 

kl

 

19. Number of well separated text—segments—with-question-marks (two consecutive
text segments are well separated if there are at least kt words between them. In

our practice, we set I; = 3.)

 

 

187

[1]

[2]

[3]

[4]

l5]

{6]

[7]

l8]

[9]

BIBLIOGRAPHY

Mirna Adriani. Dictionary-based clir for the clef multilingual track. In Cross-
Language Information Retrieval and Evaluation, Workshop of Cross-Language
Evaluation Forum, CLEF 2000, Lisbon, September 2000.

Mirna Adriani. Using statistical term similarity for sense disambiguation in
cross-language information retrieval. Inf. Retr., 2(1):71—~82, 2000.

Eugene Agichtein, Steve Lawrence, and Luis Gravano. Learning to ﬁnd answers
to questions on the web. ACM Transactions on Internet Technology, 4(2):129~
162, 2004.

Yasemin Altun, Thomas Hofmann, and Alexander J. Smola. Gaussian process
classiﬁcation for segmenting and annotating sequences. In I CML ’04: Proceed-

ings of the twenty-ﬁrst international conference on Machine learning, page 4,
New York, NY, USA, 2004. ACM Press.

M. R. Anderberg. Cluster Analysis for Applications. Academic Press, Inc., New
Youk, NY, 1973.

Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector
machines for multiple-instance learning. In Advances in Neural Information
Processing Systems 15, pages 561-568, 2002.

Ralitsa Angelova and Gerhard Weikum. Graph-based text classiﬁcation: learn
from your neighbors. In SIGIR ’06: Proceedings of the 29th annual international
ACM SICIR conference on Research and development in information retrieval,
pages 485—492, New York, NY, USA, 2006. ACM.

Lisa Ballesteros and W. Bruce Croft. Phrasal translation and query expansion
techniques for cross-language information retrieval. In SICIR ’97: Proceed-
ings of the 20th annual international ACM SICIR conference on Research and
development in information retrieval, pages 84—91. ACM Press, 1997.

Aharon Bar-Hillel, Tomer Hertz, Noam Shental, and Daphna Weinshall. Learn-
ing distance functions using equivalence relations. In I CML ’03: Proceedings of
the Twentieth International Conference, pages 11—18, August 21—24 2003.

188

[10]

[11]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

Kobus Barnard, Pinar Duygulu, David A. Forsyth, Nando de Freitas, David M.
Blei, and Michael I. Jordan. Matching words and pictures. Journal of Machine
Learning Research, 3:1107—1135, 2003.

Kobus Barnard and David A. Forsyth. Learning the semantics of words and
pictures. In ICC V ’01 : Proceedings. Eighth IEEE International Conference on
Computer Vision, pages 408—415, 2001.

Sugato Basu. Semi-supervised Clustering: Probabilistic Models, Algorithms and
Erperiments. PhD thesis, The University of Texas at Austin, 2005.

Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. Semi-supervised
clustering by seeding. In I CML ’02: Proceedings of the Nineteenth International
Conference on Machine Learning, pages 27—34, San Francisco, CA, USA, 2002.
Morgan Kaufmann Publishers Inc.

Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. Active semi-

supervision for pairwise constrained clustering. In SDM ’04 : Proc. of the Fourth
SIAM International Conference on Data Mining, 2004.

Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. A probabilistic frame-
work for semi—supervised clustering. In KDD ’04: Proceedings of the tenth ACM
SIGKDD international conference on Knowledge discovery and data mining,
pages 59—68, New York, NY, USA, 2004. ACM Press.

R. Bekkerman and M. Sahami. Semi-supervised clustering using combinato-
rial MRFS. In Proc. of ICML-06 Workshop on Learning in Structured Output
Spaces, 2006.

M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and
data representation. Neural Computation, 15(6):1373—1396, 2003.

Mikhail Belkin and Partha Niyogi. Using manifold stucture for partially labeled
classiﬁcation. In Sebastian Thrun, Lawrence Saul, and Bernhard Scholkopf,
editors, Advances in Neural Information Processing Systems 15, Cambridge,

MA, 2002. MIT Press.

Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. Label propagation and
quadratic criterion. In Olivier Chapelle, Bernhard Schlkopf, and Alexander
Zien, editors, Semi-Supervised Learning, pages 193-216. MIT Press, 2006.

K. Bennett, P. Bradley, and A. Demiriz. Constrained k—means clustering. Tech-
nical Report 2000—65, Microsoft Research, May 2000.

189

[21]

[22]

[23]

[24]

[261

[27]

[28]

[29]

[30]

Kristin P. Bennett, Ayhan Demiriz, and Richard Maclin. Exploiting unla-
beled data in ensemble methods. In KDD ’02: Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data mining,

pages 289—296, New York, NY, USA, 2002. ACM.

Adam Berger, Rich Caruana, David Cohn, Dayne Freitag, and Vibhu Mittal.
Bridging the lexical chasm: statistical approaches to answer-ﬁnding. In SI CIR
’00: Proceedings of the 23rd annual international ACM SICIR conference on
Research and development in information retrieval, pages 192—199, New York,

NY, USA, 2000. ACM Press.

Adam Berger and Vibhu O. Mittal. Query-relevant summarization using faqs.
In ACL ’00: Proceedings of the 38th Annual Meeting on Association for Com-
putational Linguistics, pages 294—301, Morristown, NJ, USA, 2000. Association
for Computational Linguistics.

Mikhail Bilenko, Sugato Basu, and Raymond J. Mooney. Integrating constraints
and metric learning in semi-supervised clustering. In ICML ’04: Proceedings
of the twenty-ﬁrst international conference on Machine learning, page 11, New

York, NY, USA, 2004. ACM Press.

Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data
using graph mincuts. In I CML ’01 : Proceedings of the Eighteenth International
Conference on Machine Learning, pages 1926, San Francisco, CA, USA, 2001.
Morgan Kaufmann Publishers Inc.

Avrim Blum, John Lafferty, Mugizi Robert Rwebangira, and Rajashekar Reddy.
Semi-supervised learning using randomized mincuts. In I CML ’04: Proceedings
of the twenty—ﬁrst international conference on Machine learning, page 13, New

York, NY, USA, 2004. ACM Press.

Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with
co-training. In COLT’ .98: Proceedings of the eleventh annual conference on
Computational learning theory, pages 92—100, New York, NY, USA, 1998. ACM.

Matthew R. Boutella, Jiebo Luob, Xipeng Shen, and Christopher M. Browna.
Learning multi-label scene classiﬁcation. Pattern Recognition, 37:1757—1771,
2004.

Christopher J. C. Burges. A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2(2):121~—167, 1998.

Robin D. Burke, Kristian J. Hammond, Vladimir A. Kulyukin, Steven L. Lyti-
nen, N. Tomuro, and S. Schoenberg. Question answering from frequently asked

190

 

 

question ﬁles: Experiences with the faq ﬁnder system. Technical report, Uni-
versity of Chicago, Chicago, IL, USA, 1997.

[31] Lijuan Cal and Thomas Hofmann. Hierarchical document categorization with
support vector machines. In Proc. ACM Conf Information and Knowledge
Management, 2004.

[32] Pavel Calado, Marco Cristo, Edleno Moura, Nivio Ziviani, Berthier Ribeiro—
Neto, and Marcos Andre; Goncalves. Combining link-based and content-based
methods for web document classiﬁcation. In CIKM ’03: Proceedings of the
twelfth international conference on Information and knowledge management,

pages 394—401, New York, NY, USA, 2003. ACM Press.

[33] Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext cat-
egorization using hyperlinks. In SICMOD ’98: Proceedings of the 1998 ACM
SI CM 0D international conference on Management of data, pages 307—318, New
York, NY, USA, 1998. ACM.

[34] Chih-Chung Chang and Chili-Jen Lin. LIBSVM: a library
for support vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm.

[35] Zheng Chen, Liu Wenyin, Feng Zhang, and Mingjing Li. Web mining for web
image retrieval. J. Am. Soc. Inf. Sci. T echnol., 52(10):831—839, 2001.

[36] Hai Leong Chieu and Hwee Tou Ng. A maximum entropy approach to infor-
mation extraction from semi-structured and free text. In Eighteenth national
conference on Artiﬁcial intelligence, pages 786—791, Menlo Park, CA, USA,
2002. American Association for Artiﬁcial Intelligence.

[37] Fan R. K. Chung. Eigenvalues of graphs. In Proceedings of the International
Congress of Mathematicians, pages 1333—1342, Zurich, 1994. Birkh”auser Ver-
lag, Berlin.

[38] Fan R. K. Chung. Spectral Graph Theory. CBMS Regional Conference Series
in Mathematics, ISSN: 0160—7642. American Mathematical Society, 1997.

[39] William W. Cohen. Improving a page classiﬁer with anchor extraction and link
analysis. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural
Information Processing Systems 15, pages 1481-1488. MIT Press, Cambridge,
MA, 2002.

[40] David Cohn, Deepak Verma, and Karl Pﬁeger. Recursive attribute factoring. In
B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information
Processing Systems 19, pages 297—304. MIT Press, Cambridge, MA, 2007.

191

[41] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev—Slnvartz, and Yoram
Singer. Online passive-agressive algorithms. Machine Learning: 7:55la585,
2006.

[42] Koby Crammer and Yoram Singer. A new family of online algorithms for cate-
gory ranking. In SI CIR ’02: Proceedings of the 25th annual international ACM
SICIR conference on Research and development in informaion retrieval, 2002.

[43] Florence d’Alché Buc, Yves Grandvalet, and Christophe Ambroise. Semi-
supervised marginboost. In Advances in Neural Information Processing Systems
14, pages 553—560, 2001.

[44] I. Davidson and SS. Ravi. Clustering under constraints: Feasibility results and
the k-means algorithm. In SIAM Data Mining Conference, 2005.

[45] 1. Davidson and SS. Ravi. Hierarchical clustering with constraints: Theory
and practice. In Proc. of the 9th European Principles and Practice of KDD
(PKDD), 2005.

[46] Mark W. Davis. New experiments in cross-language text retrieval at NMSU’S
computing research lab. In D. K. Harman, editor, The Fifth Tert REtrieval
Conference (TREC-5). NIST, 1996.

[47] Brian D. Davison. Topical locality in the web. In SI CIR ’00: Proceedings of the
23rd annual international ACM SI CIR conference on Research and development
in information retrieval, pages 272—279, New York, NY, USA, 2000. ACM Press.

[48] Brian D. Davison. Toward a uniﬁcation of text and link analysis. In SICIR
’03: Proceedings of the 26th annual international ACM SICIR conference on
Research and development in informaion retrieval, pages 367—368, New York,

NY, USA, 2003. ACM Press.

[49] Chris H. Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D. Simon.
A min-max cut algorithm for graph partitioning and data clustering. In Pro-
ceedings of the 2001 IEEE International Conference on Data Mining (ICDM
2001), pages 107-114. IEEE Computer Society, 2001.

[50] CL. Blake D.J. Newman, S. Hettich and OJ. Merz. UCI repository of machine
learning databases, 1998.

[51] Abdessamad Echihabi and Daniel Marcu. A noisy-channel approach to ques-
tion answering. In ACL ’03: Proceedings of the 413t Annual Meeting of the
Association for Computation al Linguistics, Sapporo, Japan, 2003.

192

 

[52] Andre Elisseeff and JasonWeston. A kernel method for multi-labelled classiﬁ-

I54]

[55]

I56]

[57]

I59]

[60]

[61]

cation. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances
in Neural Information Processing Systems 14, pages 681- 687, Cambridge, MA,
2002. MIT Press.

Marcello Federico and Nicola Bertoldi. Statistical cross—language information
retrieval using N-best query translations. In SI CIR ’02: Proceedings of the 25th
annual international ACM SICIR conference on Research and development in
information retrieval, pages 167—174. ACM Press, 2002.

Christiane Fellbaum, editor. Wordnet : an electronic lexical database. MIT
Press, 1998.

Yoav Freund and Robert E. Schapire. A decision—theoretic generalization of
on-line learning and an application to boosting. pages 23-37, 1995.

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of
on—line learning and an application to boosting. J. of Computer and System
Sciences, 55(1):119—139, 1997.

Jianfeng Gao, Jian-Yun Nie, Endong Xun, Jian Zhang, Ming Zhou, and
Changning Huang. Improving query translation for cross—language informa-
tion retrieval using statistical models. In SICIR ’01: Proceedings of the 24th
annual international ACM SIGIR conference on Research and development in
information retrieval, pages 96~104. ACM Press, 2001.

Jianfeng Gao, Ming Zhou, Jian-Yun Nie, Hongzhao He, and Weijun Chen.
Resolving query translation ambiguity using a decaying co—occurrence model
and syntactic dependence relations. In SICIR ’02: Proceedings of the 25th
annual international ACM SICIR conference on Research and development in
information retrieval, pages 183-190. ACM Press, 2002.

Sheng Gao, Wen Wu, Chin-Hui Lee, and Tat-Seng Chua. A MFoM learning
approach to robust multiclass multi-label text categorization. In ICML ’04:
Proceedings of the twenty-ﬁrst international conference on Machine learning,
2004.

Nadia Ghamrawi and Andrew McCallum. Collective multi-label classiﬁcation.
In CIKM ’05: Proceedings of the 14th ACM international conference on Infor-
mation and knowledge management, pages 195—200, New York, NY, USA, 2005.
ACM Press.

Rayid Ghani, Sean Slattery, and Yiming Yang. Hypertext categorization using
hyperlink patterns and meta data. In I CML ’01 : Proceedings of the Eighteenth

193

 

 

 

[6‘3]

[63]

[54]

[66]

[67]

[68]

[69]

[70]

[71]

International Conference on Machine Learning, pages 178—185, San Francisco,

CA, USA, 2001. Morgan Kaufmann Publishers Inc.

PE. Gill, W. Murray, and M.H. Wright. Practical Optimization. Academic
Press, Inc., San Diego, USA, 1981.

Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, and
Gary W. Flake. Using web structure for classifying and describing web pages.
In WWW ’02: Proceedings of the 11th international conference on World Wide
Web, pages 562-569, New York, NY, USA, 2002. ACM.

Jacob Goldberger, Sam T. Roweis, Geoffrey E. Hinton, and Ruslan Salakhutdi-
nov. Neighbourhood components analysis. In Advances in Neural Information

Processing Systems 17 (NIPS 2004 f, 2004.

Gene H. Golub and Charles F. Van Loan. Matrix Computation. John Hopkins
Press, 1989.

Ming Gu, Hongyuan Zha, Chris Ding, Xiaofeng He, and Horst Simon. Spectral
relaxation models and structure analysis for k-way graph clustering and bi-
clustering. Technical Report CSE—01-007, Department of Computer Science
and Engineering, Pennsylvania State University, 2001.

Carlos Guestrin, Andreas Krause, and Ajit Paul Singh. Near-optimal sensor
placements in gaussian processes. In I CML ’05: Proceedings of the 22nd inter-
national conference on Machine learning, pages 265-272, New York, NY, USA,
2005. ACM Press.

L. Hagen and AB. Kahng. New spectral methods for ratio cut partitioning and
clustering. IEEE. Trans. on Computed Aided Desgin, 11:1074—1085, 1992.

Sanda Harabagiu and Finley Lacatusu. Strategies for advanced question an-
swering. In Sanda Harabagiu and Finley Lacatusu, editors, HLT-NAA CL 2004 :
Workshop on Pragmatics of Question Answering, pages 1—9, Boston, Mas-
sachusetts, USA, May 2 - May 7 2004. Association for Computational Lin-
guistics.

Taher H. Haveliwala. Topic-sensitive pagerank. In WWW ’02: Proceedings of
the 11th international conference on World Wide Web, pages 517-526, New
York, NY, USA, 2002. ACM Press.

Tomer Hertz, Aharon Bar-Hillel, and Daphna Weinshall. Boosting margin based
distance functions for clustering. In ICML ’04: Proceedings of the twenty-ﬁrst
international conference on Machine learning, page 50, New York, NY, USA,

2004. ACM Press.

194

[72]

[73]

[74]

[75]

[76]

[77]

I78]

[79]

[80]

[811

[821

Tomer Hertz, Aharon Bar Hillel, and Daphna W'einshall. Learning a kernel
function for classiﬁcation with small training samples. In I CML ’06: Proceedings
of the 23rd international conference on Machine learning, pages 401 -408, New

York, NY, USA, 2006. ACM Press.

Tomer Hertz, Noam Shental, Aharon Bar-Hillel, and Daphna Weinshall. En—
hancing image and video retrieval: Learning via equivalence constraints. In
C VPR ’03: 2003 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, volume 02, page 668, Los Alamitos, CA, USA, 2003. IEEE
Computer Society.

Steven C. H. Hoi, Wei Liu, Michael R. Lyu, and VVei-Ying Ma. Learning distance
metrics with contextual constraints for image retrieval. In 2006 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition ( C VPR 2006),
pages 2072—2078, 2006.

Stefan Holland, Martin Ester, and Werner KieBling. Preference mining: A novel

approach on mining user preferences for personalized applications. In PKDD,
pages 204—216, 2003.

David A. Hull and Gregory Grefenstette. Querying across languages: a
dictionary-based approach to multilingual information retrieval. In SICIR ’96:
Proceedings of the 19th annual international ACM SIGIR conference on Re-
search and development in information retrieval, pages 49—57. ACM Press, 1996.

ImageCLEF. The CLEF Cross Language Image Retrieval Track (ImageCLEF),
http://ir.shef.ac.uk/imageclef/ , 2003.

Abraham Ittycheriah, Martin Franz, and Salim Roukos. IBM’s statistical ques-
tion answering system. In TREC ’01 : Proceedings of the 10th Text REtrieval
Conference, Gaithersburg, MD, 2001.

Tommi S. Jaakkola. Variational methods for inference and estimation in graph-
ical models. PhD thesis, MIT, 1997. Supervisor-Michael I. Jordan.

A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM
Comput. Surv., 31(3):264—323, September 1999.

Myung-Gil Jang, Sung Hyon Myaeng, and Se Young Park. Using mutual infor-
mation to resolve query translation ambiguities and query term weighting. In
ACL ’99: Proceedings of the 37th annual meeting of the association for compu-
tational linguistics, 1999.

Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. Finding semantically simi-
lar questions based on their answers. In SICIR ’05: Proceedings of the 28th

195

 

 

[83]

[84]

[851

[86]

[87]

[88]

[89]

[90]

[91]

[92]

annual international ACM SI CIR conference on Research and development in

information retrieval, pages 617—618, New York, NY, USA, 2005. ACM Press.

Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. Finding similar questions in
large question and answer archives. In CIKM ’05: Proceedings of the 14th ACM
international conference on Information and knowledge management, pages 84—

90, New York, NY, USA, 2005. ACM Press.

Valentin Jijkoun and Maarten de Rijke. Retrieving answers from frequently
asked questions pages on the web. In CIKM ’05: Proceedings of the 14th ACM
international conference on Information and knowledge management, pages 76—

83, New York, NY, USA, 2005. ACM Press.

Rong Jin and Zoubin Ghahramani. Learning with multiple labels. In S. Thrun
S. Becker and K. Obermayer, editors, Advances in Neural Information Process-
ing Systems 15, pages 897—904. MIT Press, Cambridge, MA, 2003.

Thorsten Joachims. Text categorization with suport vector machines: Learn-
ing with many relevant features. In Proc European Conference on Machine
Learning, 1998.

Thorsten Joachims. Transductive learning via spectral graph partitioning.
In Proceedings of the International Conference on Machine Learning (ICML),
2003.

Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K.

Saul. An introduction to variational methods for graphical models. Mach.
Learn, 37(2):183-233, 1999.

Sepandar D. Kamvar, Dan Klein, and Christopher D. Manning. Spectral Learn-
ing. In IJCAI ’03: Proceedings of the Eighteenth International Joint Conference
on Artiﬁcial Intelligence, August 9—15 2003.

Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, and Eisaku Maeda. Max-
imal margin labeling for multi-topic text categorization. In Lawrence K. Saul,

Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Process-
ing Systems 17, pages 649—656. MIT Press, Cambridge, MA, 2005.

D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to

space-level constraints: Making the most of prior knowledge in data clustering.
In ICML ’02: Proceedings of 19th Intl. Conf. on Machine Learning, 2002.

Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal
of the ACM, 46(5):604~632, 1999.

196

 

 

[93]

[94]

[95]

[96]

[97]

[98]

[99]

[100]

[101]

W. Kraaij, R. Pohlmann, and D. Hiemstra. Twenty—one at TREC-8: using lan-
guage technology for information retrieval. In Ellen M. Voorhees and Donna K.
Harman, editors, The Eighth Text REtrieval Conference (TREC-8), volume 8,
pages 285—300. National Institute of Standards and Technology, NIST, 2000.
NIST Special Publication 500-246.

Wessel Kraaij, Jian-Yun Nie, and Michel Simard. Embedding web-based sta-
tistical translation models in cross-language information retrieval. Comput.
Linguist, 29(3):381—419, 2003.

Wessel ”Kraaij and Renée Pohlmann. Different approaches to cross language
information retrieval. In W. Daelemans, K. Sima’an, J. Veenstra, and J. Za-
vrel, editors, Computational Linguistics in the Netherlands 2000, number 37
in Language and Computers: Studies in Practical Linguistics, pages 97—111,
Amsterdam, 2001. Rodopi.

Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. Semi-
supervised graph clustering: a kernel approach. In ICML ’05: Proc. of the
22nd international conference on Machine learning, pages 457—464, New York,
NY, USA, 2005. ACM Press.

Vladimir A. Kulyukin, Kristian A. Hammond, and Robin D. Burke. Auto—
mated processing of structured online documents. Technical report, University
of Chicago, Chicago, IL, USA, 1998.

James T. Kwok and Ivor W. Tsang. Learning with idealized kernels. In Proceed-
ings of the Twentieth International Conference on Machine Learning (ICML-
2003), pages 400—407, Washington, DC, USA, August 2003.

Yu-Sheng Lai, Kuao—Ann Fung, and Chung-Hsien Wu. Faq mining via list
detection. In COLING—02: proceeding of the 2002 conference on multilingual
summarization and question answering, pages 1—7, Morristown, NJ, USA, 2002.
Association for Computational Linguistics.

Victor Lavrenko, Martin Choquette, and W. Bruce Croft. Cross-lingual rele-
vance models. In SI CIR ’02: Proceedings of the 25th annual international ACM
SIGIR conference on Research and development in information retrieval, pages
175—182. ACM Press, 2002.

Victor Lavrenko and W. Bruce Croft. Relevance based language models. In SI-
CIR ’01: Proceedings of the 24th annual international ACM SICIR conference
on Research and development in information retrieval, pages 120—127. ACM
Press, 2001.

197

[102}

[103]

[104]

[105]

[106]

[107]

[108]

[109]

[110]

[111]

Martin H. C. Law, Alexander P. Topchy, and Anil K. Jain. Model-based cluster-
ing with probabilistic constraints. In SIAM International Conference on Data

Mining (SDM), 2005.

Neil D. Lawrence and l\’lichael I. Jordan. Semi—supervised learning via gaus-
sian processes. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors,
Advances in Neural Information Processing Systems 17, pages 753—760. MIT
Press, Cambridge, MA, 2005.

D. D. Lee and H. S. Seung. Learning the parts of objects by non—negative matrix
factorization. Nature, 401(6755):788-791, October 1999.

Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix

factorization. In Advances in Neural Information Processing Systems 13, pages
556—562, 2000.

Shuang Liu, Fang Liu, Clement Yu, and Weiyi Meng. An effective approach
to document retrieval via utilizing wordnet and recognizing phrases. In SI CIR
’04: Proceedings of the 27th annual international ACM SICIR conference on
Research and development in information retrieval, pages 266—272, New York,
NY, USA, 2004. ACM Press.

Qing Lu and Lise Getoor. Link-based classiﬁcation. In ICML ’03: Proceedings
of the twentieth international conference on Machine learning, pages 496—503,
2003.

Zhengdong Lu and Todd Leen. Semi—supervised learning with penalized proba-
bilistic clustering. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors,

Advances in Neural Information Processing Systems 17, pages 849—856, Cam-
bridge, MA, 2005. MIT Press.

Akira Maeda, Fatiha Sadat, Masatoshi Yoshikawa, and Shunsuke Uemura.
Query term disambiguation for web cross-language information retrieval using
a search engine. In Proceedings of the 5 th International Workshop on Infor-
mation Retrieval with Asian Languages (IRAL ’00), pages 25—32. ACM Press,
2000.

Andrew McCallum. Multi-label text classiﬁcation with a mixture model trained
by EM. In AAAI ’99 Workshop on T ext Learning, 1999.

Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum
entropy markov models for information extraction and segmentation. In I CML
’00: Proceedings of the Seventeenth International Conference on Machine
Learning, pages 591—598, San Pl‘ancisco, CA, USA, 2000. Morgan Kaufmann
Publishers Inc.

198

 

[112]

[113]

[114]

[115]

[116]

[117]

[118]

[119]

[120]

[121]

Joel C. Miller, Gregory Rae, Fred Schaefer, Lesley A. Ward, Thomas LoFaro,
and Ayman Farahat. Modiﬁcations of kleinberg’s hits algorithm using matrix
exponentiation and web log records. In SICIR ’01: Proceedings of the 24th

annual international ACM SIGIR conference on Research and development in
information retrieval, pages 444—445, New York, NY, USA, 2001. ACM Press.

Tom Mitchell. Machine Learning. hA’IcGraw Hill, 1997.

Christof Monz and Bonnie J. Dorr. Iterative translation disambiguation for
cross-language information retrieval. In SIGIR ’05: Proceedings of the 28th
annual international ACM SICIR conference on Research and development in
information retrieval, pages 520—527, 2005.

Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Anal-
ysis and an algorithm. In Advances in Neural Information Processing Systems
14, pages 849—856, 2001.

Jian-Yun Nie and Michel Simard. Using statistical translation models for bilin-
gual ir. In Cross-Language Information Retrieval and Evaluation, Workshop
of Cross-Language Evaluation Forum, CLEF 20’01, pages 137—150. Springer-
Verlag, 2002.

Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum en-
tropy for text classiﬁcation. In IJCAI-99 Workshop on Machine Learning for
Information Filtering, pages 61—67, 1999.

Hyo—Jung Oh, Sung Hyon Myaeng, and Mann-Ho Lee. A practical hypertext
catergorization method using links and incrementally available class informa-
tion. In SI GIR ’00: Proceedings of the 23rd annual international ACM SIGIR
conference on Research and development in information retrieval, pages 264—
271, New York, NY, USA, 2000. ACM.

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pager-
ank citation ranking: Bringing order to the web. Technical report, Stanford
Digital Library Technologies Project, 1998.

Xiaoguang Qi and Brian D. Davison. Knowing a web page by the company it
keeps. In CIKM ’06: Proceedings of the 15th ACM international conference on
Information and knowledge management, pages 228—237, New York, NY, USA,
2006. ACM.

Dragomir R. Radev, Hong Qi, Zhiping Zheng, Sasha Blair—Goldensohn, Zhu
Zhang, Weigo Fan, and John Prager. Mining the web for answers to natural
language questions. In CIKM ’01 : Proceedings of the 10th ACM international
conference on Information and knowledge management, Atlanta, GA, 2001.

199

[122]

[123]

[124]

[12.5]

[126]
[127]

[128]

[129]

[130]

[131]

[132]

[133]

Ganesh Ramakrishnan, Soumen Chakrabarti, Deepa Paranjpe, and Pushpak
Bhattacharya. Is question answering an acquired skill? In WWW ’04: Proceed-
ings of the 13th international conference on World Wide Web, pages 111—120,
New York, NY, USA, 2004. ACM Press.

Juho Rousu, Craig Saunders, Sandor Szedmak, and John Shawe—Taylor. On
maximum margin hierarchical multi-label classiﬁcation. In NIPS Workshop on
Learning With Structured Outputs, 2004.

S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear
embedding. In Science, 2000.

Walter Rudin. Real 65 Complex Analysis. McGraw—Hill, Inc., New York, NY,
3rd edition, 1987.

Gerard Salton, editor. The SMART retrieval system. Prentice-Hall, 1971.

Robert E. Schapire and Yoram Singer. Boostexter: A boosting-based systemfor
text categorization. Machine Learning, 39(2-3), 2000.

Matthew Schultz and Thorsten Joachims. Learning a distance metric from
relative comparisons. In Advances in Neural Information Processing Systems

16, 2003.

M. Seeger. Learning with labeled and unlabeled data. Technical report, Uni-
versity of Edinburgh, 2001.

Dou Shen, Jian-Tao Sun, Qiang Yang, and Zheng Chen. A comparison of
implicit and explicit links for web page classiﬁcation. In WWW ’06: Proceedings
of the 15th international conference on World Wide Web, pages 643—650, New
York, NY, USA, 2006. ACM.

Xuehua Shen, Bin Tan, and ChengXiang Zhai. Context-sensitive information
retrieval using implicit feedback. In SI CIR ’05: Proceedings of the 28th annual
international ACM SICIR conference on Research and development in infor-
mation retrieval, pages 43—50, New York, NY, USA, 2005. ACM Press.

Xuehua Shen, Bin Tan, and ChengXiang Zhai. Implicit user modeling for per-
sonalized search. In CIKM ’05: Proceedings of the 14th ACM international
conference on Information and knowledge management, pages 824—831, New

York, NY, USA, 2005. ACM Press.

N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing gaussian
mixture models with em using side-information. In Proc. of workshop on the
continuum from labeled to unlabeled data in machine learning and data mining,

2003.

200

 

[134]

[135]

[136]

[137]

[138]

[139]

[140]

[141]

[142]

[143]

[144]

J ianbo Shi and J itendra Malik. Normalized cuts and image segmentation. IEEE
Trans. on PAMI, 22(8):888—905, August 2000.

Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length
normalization. In SICIR ’96: Proceedings of the 19th annual international
ACM SI CIR conference on Research and development in information retrieval,
pages 21—29, New York, NY, USA, 1996. ACM Press.

Radu Soricut and Eric Brill. Automatic question answering: Beyond the factoid.
In HLT-NAACL 2004, Human Language Technology Conference of the North
American Chapter of the Association for Computational Linguistics, pages 57—
64, 2004.

Radu Soricut and Eric Brill. Automatic question answering using the web:
Beyond the factoid. Information Retrieval - Special Issue on Web Information
Retrieval, 9(2):191—206, 2006.

Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In
ICML ’03: Proceedings of the twentieth international conference on Machine
learning, pages 720—727, 2003.

Smitha Sriram, Xuehua Shen, and Chengxiang Zhai. A session-based search
engine. In SI CIR ’04: Proceedings of the 27th annual international ACM SI CIR
conference on Research and development in information retrieval, pages 492—

493, New York, NY, USA, 2004. ACM Press.

Jian-Tao Sun, Hua-Jun Zeng, Huan Liu, Yuchang Lu, and Zheng Chen.
Cubesvd : a novel approach to personalized web search. In WWW ’05: Proceed-
ings of the 14th international conference on World Wide Web, pages 382—390,
New York, NY, USA, 2005. ACM Press.

Ben Taskar, Vassil Chatalbashev, and Daphne Koller. Learning associative
markov networks. In ICML ’04: Proceedings of the twenty—ﬁrst international
conference on Machine learning, page 102, New York, NY, USA, 2004. ACM

Press.

J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework
for nonlinear dimensionality reduction. Science, 2000.

The Mathworks. Matlab, http://www.mathworks.eom/.

John A. Tomlin. A new paradigm for ranking pages on the world wide web.
In WWW ’03: Proceedings of the 12th international conference on World Wide
Web, pages 350—355, New York, NY, USA, 2003. ACM Press.

201

 

[145] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Al-
tun. Support vector machine learning for interdependent and structured output
spaces. In ICML ’04: Proceedings of the twenty-ﬁrst international conference
on Machine learning, page 104, New York, NY, USA, 2004. ACM Press.

[146] Naonori Ueda and Kazumi Saito. Parametric mixture models for multi-labeled
text. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in
Neural Information Processing Systems 15, pages 649—656. MIT Press, Cam-
bridge, MA, 2003.

[147] Ellen M. Voorhees. Using wordnet to disambiguate word senses for text retrieval.
In SICIR ’93: Proceedings of the 16th annual international ACM SICIR con-
ference on Research and development in information retrieval, pages 171—180,

New York, NY, USA, 1993. ACM Press.

[148] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained k-
means clustering with background knowledge. In Proceedings of the Eighteenth

International Conference on Machine Learning, pages 577—584, San Francisco,
CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[149] Jidong Wang, Huajun Zeng, Zheng Chen, Hongjun Lu, Li Tao, and Wei-Ying
Ma. Recom : reinforcement clustering of multi-type interrelated data objects.
In SICIR ’03: Proceedings of the 26th annual international ACM SICIR con-

ference on Research and development in informaion retrieval, pages 274—281,

New York, NY, USA, 2003. ACM.

[150] Kilian Weinberger, John Blitzer, and Lawrence Saul. Distance metric learning
for large margin nearest neighbor classiﬁcation. In Y. Weiss, B. Schélkopf, and
J. Platt, editors, Advances in Neural Information Processing Systems 18, pages

1473—1480, Cambridge, MA, 2006. MIT Press.

[151] Eric W. Weisstein. ”graph.” from mathworld—a wolfram web resource.
http: //mathworld.wolfram.com/graph.html.

[152] C. Williams. Computation with inﬁnite neural networks. Neural Computation,
10(5), 1998.

[153] Chung—Hsien Wu, Jui-Feng Yeh, and Ming-Jun Chen. Domain-speciﬁc faq re-
trieval using independent aspects. ACM Transactions on Asian Language In-
formation Processing (TALIP), 4(1):1—17, 2005.

[154] Chung-Hsien Wu, Jui-Feng Yeh, and Yu-Sheng Lai. Semantic segment extrac-
tion and matching for internet faq retrieval. IEEE Transactions on Knowledge
and Data Engineering, 18(7):930—940, 2006.

202

 

[159]

[150]

[161]

[162]

[163]

[164]

Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Dis-
tance metric learning with application to clusterng with side-information. In
S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information
Processing Systems 15, pages 505—512, Cambridge, MA, 2003. MIT Press.

' Jinxi Xu and W. Bruce Croft. Query expansion using local and global doc-

ument analysis. In SICIR ’96: Proceedings of the 19th annual international
ACM SICIR conference on Research and development in information retrieval,
Zurich, Switzerland, 1996.

Jinxi Xu and Ralph Weischedel. T REC-9 cross-lingual retrieval at BBN. In
The Ninth Text REtrieval Conference (TREC-9), 2001.

Gui-Rong Xue, Dou Shen, Qiang Yang, Hua—Jun Zeng, Zheng Chen, Yong Yu,
WenSi Xi, and Wei-Ying Ma. Irc : An iterative reinforcement categorization
algorithm for interrelated web objects. In I CDM ’04: Proceedings of the Fourth
IEEE International Conference on Data Mining (ICDM’04), pages 273—280,
Washington, DC, USA, 2004. IEEE Computer Society.

Liu Yang, Rong Jin, Rahul Sukthankar, and Yi Liu. An efﬁcient algorithm
for local distance metric learning. In Proceedings of the Twenty-First National

Conference on Artiﬁcial Intelligence and the Eighteenth Innovative Applications
of Artiﬁcial Intelligence Conference, 2006.

Yiming Yang. An evaluation of statistical approaches to text categorization.
Information Retrieval, 1(1/2), 1999.

Yiming Yang and Xin Liu. A re-examination of text categorization methods. In
SICIR ’99: Proceedings of the 22nd annual international ACM SICIR confer-
ence on Research and development in information retrieval, pages 42—49, New

York, NY, USA, 1999. ACM Press.

Yiming Yang, Sean Slattery, and Rayid Ghani. A study of approaches to hy-
pertext categorization. J. Intell. Inf. Syst., 18(2—3):219—241, 2002.

Jieping Ye. Generalized low rank approximations of matrices. In ICML ’04:
Proceedings of the twenty-ﬁrst international conference on Machine learning,

page 112, New York, NY, USA, 2004. ACM Press.

Kai Yu, Shipeng Yu, and Volker Tresp. Multi-label informed latent semantic
indexing. In SICIR ’05: Proceedings of the 28th annual international ACM
SIGIR conference on Research and development in informaion retrieval, 2005.

203

[165]

[166]

[167]

[168]

[169]

[170]

[171]

[172]

[173]

[174]

Z. Zhang, J.T. Kwok, and D.Y. Yeung. Parametric distance metric learning with
label information. In Proc. of the Eighteenth International Joint Conference on
Artiﬁcial Intelligence {IJCAI’O3}, pages 1450—1452, 2003.

Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bern—
hard Schblkopf. Learning with local and global consistency. In Sebastian Thrun,
Lawrence Saul, and Bernhard Scholkopf, editors, Advances in Neural Informa—
tion Processing Systems 16. MIT Press, Cambridge, MA, 2004.

Dengyong Zhou, Jiayuan Huang, and Bernhard Scholkopf. Learning from la-
beled and unlabeled data on a directed graph. In ICML ’05: Proceedings of
the 22nd international conference on Machine learning, pages 1036—1043, New

York, NY, USA, 2005. ACM Press.

Dengyong Zhou, Bernhard Schélkopf, and Thomas Hofmann. Semi-supervised
learning on directed graphs. In Lawrence K. Saul, Yair Weiss, and Léon Bottou,
editors, Advances in Neural Information Processing Systems 17, pages 1633—
1640. MIT Press, Cambridge, MA, 2005.

Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bern-
hard Sehélkopf. Ranking on data manifolds. In Sebastian Thrun, Lawrence
Saul, and Bernhard Schblkopf, editors, Advances in Neural Information Pro-
cessing Systems 16. MIT Press, Cambridge, MA, 2004.

Xiang Sean Zhou and Thomas S. Huang. Unifying keywords and visual contents
in image retrieval. IEEE MultiMedia, 9(2):23—33, 2002.

Shenghuo Zhu, Xiang Ji, Wei Xu, and Yihong Gong. Multi-labelled classiﬁca-
tion using maximum entropy method. In SICIR ’05: Proceedings of the 28th
annual international ACM SICIR conference on Research and development in
information retrieval, pages 274—281, New York, NY, USA, 2005. ACM Press.

Xiaojin Zhu. Semi-supervised learning literature survey. Technical Re—
port 1530, Computer Sciences, University of Wisconsin-Madison, 2005.
http: / / www.cs.wisc.edu / ~jerryzhu / pub / ssl_survey.pdf .

Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learn-
ing using gaussian ﬁelds and harmonic functions. In ICML ’03: Proceedings

of the twentieth international conference on Machine learning, pages 912—919,
2003.

Xiaojin Zhu and John Lafferty. Harmonic mixtures: combining mixture models
and graph-based methods for inductive and scalable semi-supervised learning.
In ICML ’05: Proceedings of the 22nd international conference on Machine
learning, pages 1052—1059, New York, NY, USA, 2005. ACM Press.

204

   

lllﬂlllllllllﬂjllllmﬂlﬂlfl