7 _ Il' Eufiz .fiurfiu \r. I . a-vdwve www.- .uu.... .w 4% 3?“... m «#5. $58. ll 0 . 2:333: >...voa i LIBRARY gan State University ch A" Mi This is to certify that the dissertation entitled Semi-supervised Learning with Side Information: Graph-based Approaches presented by Yi Liu has been accepted towards fulfillment of the requirements for the Ph.D. degree in Computer Science and Engin_eering V Major Professor’s Signature léi/o 612w] . Date MSU is an affinnative-action, equal-opportunity employer on-----I-o--n-n--|-.-s-n-n-n-u-n-n-u-u-o-o-u-oCo-o-o---n-o-n-o-u-u--v------u-u. - - - PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/07 p:/CIRClDateDue.indd-p.1 SEMI-SUPERVISED LEARNING WITH SIDE INFORMATION: GRAPH-BASED APPROACHES By Yi Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the-requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering 2007 ABSTRACT SEMI-SUPERVISED LEARNING WITH SIDE INFORMATION: GRAPH-BASED APPROACHES By Yi Liu In many real-world learning tasks, data examples are known not only by their input patterns, but also in other forms that are often refered to as “side information”. Side information provides additional knowledge about the data, leaving hope for machine learning algorithms to gain more insight into the structure of data and thus perform better. However, the usual incomplete, sparse, and noisy nature of side information also poses challenges. This dissertation will present research work in semi—supervised learning with side information. Semi-supervised learning uses both labeled and unlabeled data in training. In this work, we follow the graph-based approaches that are able to explore the underlying structure of data by constructing a graph over data examples. Taking such an ad- vantage of graph representation for semi-supervised learning, we propose to construct a graph over all labeled and unlabeled data and further incorporate side informa- tion into the graph. To effectively utilize side information against its incompleteness, sparseness, and noise, this work adopts a common theme in formalizing graph-based learning models, i.e., enforcing consistency over the graph. More specifically, consis- tency is maximized among data input patterns, supervised information (if any), side information, and predictions on the unlabeled data. Optimization approaches are taken to carry out the consistency enforcement, with objective functions that collect consistency measurement defined on all possible data pairs. Solutions to three generic learning tasks are presented to illustrate the proposed method of utilizing side information in a semi-supervised learning setting. Specifically, we study multi-label learning with class correlation information, to meet the challenge of a large number of classes and a small training set. To improve classification with link information, we propose a boostng framework that is able to improve the clas- sification accuracy of any given supervised classifiers. Another boosting framework is also designed to boost any given clustering algorithm, with the help of pairwise constraints over the data. In addition, two applications in the area of information retrieval will be discussed. In the first application, we develop a maximum coherence framework to tackle the difficulty of query translation disambiguation in cross-language information retrieval, with a bilingual dictionary as side information. The proposed framework will also be explained as two-way graph partitioning. The second application is automatic extraction of question-answer pairs from Web FAQs, where the side information comes from human knowledge on the presentation regularity on Web FAQ pages. Correlated label propagations over a graph constructed for each FAQ page is shown to be an interpretation of the corresponding model. All the semi-supervised learning models proposed for the tasks and applications demonstrate the effectiveness of the consistency enforcement theme in exploiting side information for semi-supervised learning. Analysis shows that the proposed models are robust against the incompleteness, sparseness, and noise of side information, and retain the power of utilizing both labeled and unlabeled data for training as semi- supervised learning. TABLE OF CONTENTS LIST OF TABLES vii LIST OF FIGURES viii 1 Introduction 1 1.1 Graph-based Semi-supervised Learning ................. 2 1.1.1 Mathematical Definition of Graph ................ 2 1.1.2 Graph-based Semi-supervised Learning Models ......... 4 1.2 Side Information ............................. 11 1.2.1 Motivation on Introducing Side Information .......... 11 1.2.2 Side Information for Semi—supervised Learning ......... 15 1.3 Graph-based Semi-supervised Learning with Side Information . . . . 18 1.4 Overview on Thesis Work ........................ 20 2 Semi-supervised Multi-label Learning with Class Correlations 23 2.1 Introduction ................................ 23 2.2 Related Work ............................... 25 2.2.1 Related Work in Multi-label Learning .............. 25 2.2.2 Representative Algorithms .................... 27 2.3 Semi-Supervised Multi-label Learning by Constrained NMF ..... 31 2.3.1 A Hamework for Multi-label Learning ............. 32 2.4 Solving the Constrained NMF ...................... 34 2.5 Experiments and Discussions ....................... 36 2.5.1 Experiment Setup ......................... 37 2.5.2 Experiment Results ........................ 38 2.6 Conclusions ................................ 43 3 Semi-supervised Classification with Link Information 44 3.1 Introduction ................................ 44 3.2 Related Work ............................... 47 3.2.1 Link-based Classification ..................... 47 3.2.2 Related Semi-supervised Classification Models ......... 48 3.2.3 Representative Algorithms Review ............... 50 3.3 LinkBoost Framework .......................... 53 3.3.1 Objective Function ........................ 53 3.3.2 Boosting Algorithm ........................ 54 3.3.3 LinkBoost Framework Summary ................. 61 3.4 Experiments and Analysis ........................ 62 iv 3.4.1 Experiment Setup ......................... 62 3.4.2 Robustness against Sparse Link Information .......... 65 3.4.3 Boosting Power for Supervised Algorithms ........... 70 3.5 Conclusions ................................ 72 Semi-supervised Clustering with Pairwise Constraints 73 4.1 Problem Definition ............................ 73 4.2 Review on Previous Studies ....................... 74 4.2.1 Approach Based on Constraints Satisfaction .......... 75 4.2.2 Approach Based on Distance Metric Learning ......... 75 4.2.3 Representative Algorithms .................... 76 4.2.4 Summary ............................. 82 4.3 Boosting Clustering ............................ 83 4.3.1 Main Idea ............................. 83 4.3.2 Objective Function ........................ 87 4.3.3 The BoostCluster Framework .................. 88 4.4 Experiments ................................ 97 4.4.1 Experiment Setup ......................... 98 4.4.2 Generality of the Boosting Framework ............. 102 4.4.3 Robustness of Exploiting Pairwise Constraints ......... 103 4.4.4 BCP vs. BCS ........................... 109 4.5 Summary ................................. 112 Semi-supervised Learning for Query Translation Disambiguation in Dictionary-based Cross Language Information Retrieval 113 5.1 Introduction ................................ 113 5.2 Related Work ............................... 117 5.2.1 Selection-based Approaches for Query Translation Disambigua- tion ................................ 117 5.2.2 Spectral Clustering ........................ 119 5.3 The Statistical Framework For Dictionary-based CLIR ........ 121 5.3.1 Notation .............................. 121 5.3.2 Modeling the Uncertainty in Query Translation ........ 122 5.3.3 The Retrieval Model ....................... 123 5.3.4 Learning the Translation Probabilities ............. 124 5.3.5 Solving the Optimization Problem ................ 127 5.3.6 Summary and Discussion ..................... 129 5.4 Maximum Coherence Model ....................... 131 5.5 Spectral Query Translation Model .................... 132 5.5.1 Query Translation Disambiguation as Graph Partitioning . . . 133 5.5.2 Maximum Coherence vs. Spectral Graph Translation ..... 136 5.6 Experiments and Discussions ....................... 137 5.6.1 Experiment Setup ......................... 138 5.6.2 Comparison to Selection—based Approaches ........... 141 5.6.3 The Necessity of Including Translation Uncertainty ...... 146 5.6.4 The Impact of 'I‘ranslation Independence Assumption on Query Disambiguation .......................... 148 5.6.5 Performance: MAC vs. SQT ................... 149 5.6.6 Computational Efficiency ..................... 150 5.7 Conclusions ................................ 151 Semi-supervised Learning for Extraction of Questions and Answers from Web FAQs 152 6.1 Introduction ................................ 152 6.2 Related Work ............................... 160 6.2.1 QA Extraction from the Web .................. 160 6.2.2 Review on Spectral Graph Transducer ............. 161 6.3 FAQ Page Classification ......................... 162 6.4 Observations on Web FAQs Data .................... 163 6.5 A Semi-supervised Learning Approach for QA Extraction ....... 165 6.5.1 Web Page Preprocessing ..................... 165 6.5.2 Semi-supervised Learning for QA Extraction .......... 167 6.6 Experiments and Discussions ....................... 171 6.6.1 Experiment Setup ......................... 171 6.6.2 Experiments on Verification of Side Information ........ 173 6.6.3 Experiments on QA Extraction ................. 175 6.7 Conclusions ................................ 179 7 Conclusions 180 APPENDICES 184 A Related Proofs and Lists 184 Al Proof of the eigenvector problem in Section 4.3.3 ............ 184 A2 Proof of the generalized eigenvector problem in Section 4.3.3 ..... 185 A3 Features for FAQ page classification ................... 185 BIBLIOGRAPHY 188 vi 2.1 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3 5.1 6.1 6.2 LIST OF TABLES The CNMF algorithm .......................... Classes in Cora dataset ........................... Classes in Citeseer dataset ......................... Classification accuracy on Cora dataset (Part A) ............ Classification accuracy on Cora dataset (Part B) ............ Classification accuracy on Citeseer dataset (Part A) .......... Classification accuracy on Citeseer dataset (Part B) .......... Boosting classification accuracy on Cora dataset ............ Boosting classification accuracy on Citeseer dataset .......... Notations ................................. Datasets used in the experiments. ................... Performance comparison of BCP algorithm and BCS algorithm . . . . Average precision for short and long queries on TREC datasets Corpus statistics of QA pair data .................... Performance comparison of QA extraction ............... vii 63 63 66 67 68 69 71 71 144 162 176 1.1 1.2 1.3 2.1 2.2 2.3 3.1 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 5.1 5.2 5.3 5.4 5.5 LIST OF FIGURES Example A of complicated structures of data. ............. Example B of complicated structures of data. ............. Dataset 2 with side information ...................... Precision comparison: CNMF vs. baselines .............. Recall comparison: CNMF vs. baselines ................ F1 comparison: CNMF vs. baselines .................. LinkBoost. framework ........................... Illustrative example on iterative projections .............. The flowchart of the BoostCluster framework .............. Boosting algorithm for BCP and BCS .................. Example BCP objective function vs. number of iterations. ...... Legends for all algorithms in comparative study ............ Clustering performance evaluated by NMI (Part A) .......... Clustering performance evaluated by NMI (Part B) .......... Clustering performance evaluated by PWFl (Part A) ......... Clustering performance evaluated by PWFl (Part B) ......... Clustering performance (NMI) with noisy constraints ......... Clustering performance with noisy constraints ............. Steps of applying the proposed framework to CLIR .......... Graph partitioning perspective for query translation disambiguation . An example query in experiments .................... Performance comparison on homogeneous datasets ........... Performance comparison on heterogeneous datasets .......... viii 14 61 86 92 94 101 104 105 106 107 110 111 130 135 140 142 143 6.1 6.2 6.3 6.4 6.5 6.6 6.7 Example translation probabilities estimated by “MAC” model Example query translation by “BSTO” and “MAC” models. . . . . . Com arison of “MAC” and “S T” models on an exam )le ( nerv . . . l u’ Snapshot of example web FAQ pages 1 of 4) ............. ( Snapshot of example web FAQ pages (2 of 4) ............. Snapshot of example web FAQ pages (3 of 4) ............. Snapshot of example web FAQ pages (4 of 4) ............. Semi-supervised learning algorithm for QA extractions ........ Relation between class membership and format similarity ....... Correlation between prediction accuracy and prediction probability 146 147 149 CHAPTER 1 Introduction Learning with inadequate amount of, or no, supervised information poses a challenge for machine learning research. In practical domains, though supervised information may be hard to collect, it is often the case that some certain partial supervised information is available. This naturally leads to a question: can we learn better, if more knowledge on the data is gained? ‘ Answering this question leads to the research topic of semi-supervised learning, which has gained significant attention in recent years. Semi-supervised learning, in a strict sense, refers to the family of classification techniques in machine learning that makes use of both labeled and unlabeled data. In a typical semi-supervised learning setting, the amount of labeled data is small while the amount of unlabeled data is large. As its name suggests, semi-supervised learning lies in the continuum of two ends in the spectrum of machine learning — strictly supervised learning (with completely labeled data), and strictly unsupervised learning (without any labeled data). Usually, some partial supervised information is available in semi-supervised learning, which typically includes the observation of all unlabeled data that is going to be classified into predefined categories, possibly as well as information in other forms that do not directly provides class labels. Data clustering is a typical unsupervised learning problem, where no supervised information is available. The word “semi—supervised” has also been introduced to data clustering, i.e. “semi-supervised clustering”, when partial supervised information is available for clustering purpose. Regardless of how the terminology evolves, the essence of semi-supervised learning is the utilization of any knowledge gained in addition to the supervised information. The particular interest we will show in this thesis, is the kind of additional knowledge beyond the input patterns of the data. By input patterns, we refer to the observation of “feature representations” (or also known as “attribute values” in literature) on all the data. This leads to the research issue that centers this thesis work, i.e. How can we improve semi-supervised learning, if we know anything more than the input patterns of labeled data (if any) and unlabeled data? The knowledge about data that is not the input patterns will be referred to as “side information” in this thesis. Also, among all types of semi-supervised learning work, we will focus on graph-based approaches. In this chapter, we will first review graph-based semi-supervised learning, then motivates the use of side information in semi-supervised learning, and briefly overview the entire thesis work. Detailed discussions on each piece of work will be left to the rest chapters. 1.1 Graph-based Semi-supervised Learning 1.1.1 Mathematical Definition of Graph From a mathematical point of view, a graph is a collection of points and lines con- necting some (possibly empty) subset of them [151]. The points of a graph are most commonly known as graph vertices, but may also be called “nodes” or simply “points”. Similarly, the lines connecting the vertices of a graph are most commonly known as graph edges, but may also be called “links”, “arcs” or simply “lines”. In an undirected graph, edges are not directional, i.e., a line from point A to point B is not distin- guished from a line from point B to point A. However, the two directions are distinct in a directed graph (or digraph for short). On many occasions, a weight. (usually pos— itive) will be associated with each edge, indicating the strength of the relationship within the corresponding vertex pair. Formally, we can denote a graph by G (V, E), where V is the vertex set and E is the edge set. For a finite graph G with n vertices, the adjacency matrix is defined as a binar matrix A = a- - , with a' - = 1 denotinor there is an edge between the y 2,] 2.] b b nxn i-th and the j—th vertices and ai, j =2 0 otherwise. Introducing the weight to each edge in a graph will result in a weighted graph. In this case the adjacency matrix (or weight matrix) becomes W = [wi’j] nxn’ with wm- > 0 indicating the edge weight between the i-th and the j—th vertices and wi, j = 0 indicating no edge there. For an undirected graph, both the adjacency matrix and the weight matrix are symmetric. An important concept centering in Spectral Graph Theory is the graph Laplacian, which has been widely used in graph-based semi-supervised learning models. With respect to a symmetric adjacency matrix W, the graph Laplacian is defined as follows L = D — W (1.1) where D is a diagonal matrix D = diag(d1,d2, - -- ,dn) with di 2 ZjEV wi,j- A normalized version of the matrices above is defined in some occasions - 1 1 W = D_2WD_2 _1_1 L=D2LD2 I—w The graph Laplacian L (or L) has the following properties 1. It is positive semi-definite, i.e., all the eigenvalues are non-negative; 2. Its minimum eigenvalue is always 0. For unnormalized graph Laplacian L, the corresponding principal eigenvector is e = (1/fi,1/\/h,...,1/\/H)T; for 1 normalized graph Laplacian L, the corresponding principal eigenvectrn' is D28 The set of eigenvalues of the graph Laplacian L can be denoted by 0 = A0 S A1 3 - S An_1, which is also called the spectrum of L (or the graph itself). Spectral Graph Theory tells us that the structure of a graph and its principal properties can be deduced from its spectrum. In particular, it has been shown that the eigenvalues are closely related to almost all major invariants of the graph, and link external properties together ( [38, 37]). As will be shown later in the next subsection, graph Laplacian is clearly present in a number of graph-based learning models. 1.1.2 Graph-based Semi-supervised Learning Models Semi-supervised learning deals with the use of both labeled and unlabeled data for training. The central idea behind nearly all semi—supervised learning algorithms is an assumption made on the data consistency that data examples close to each other or in the same structure are more likely to have the same label. Various semi-supervised learning approaches differ in the way to model the structure of data and attempts to propagate the label information from labeled examples to unlabeled ones. One approach is to construct a graph over all the labeled and unlabeled data: each vertex represents a data example; whenever a pair of data examples are close enough by some similarity measurement, an edge is constructed between them with a weight propor- tional to their similarity value. Taking this graph point of view, a number of graph- based semi-supervised learning model can have interpretation in graph languages, such as graph partitioning [25, 26, 87], random walk on the graph [134, 119, 70], ranking of nodes [119, 70, 92, 144], graph approximation [104, 105], and etc.. In the rest of this subsection, we will recapitulate a few typical graph-based semi- supervised learning models. GRAPH MINCUTS The graph mincuts model extends the algorithm for finding the minimal cut in a graph to a transductive learning setup [25, 26]. The basic idea is searching for a partition of the graph which results in a minimum sum of weights of the edges being cut while agreeing with the labeled data. To enforce the consistency with the labeled data, a special weighting scheme is adopted to build a graph over all the labeled and unlabeled examples: for any pair of data examples belonging to different classes, the edge weight indicates their similarity; for any pair of data examples belonging to the same class, an infinite weight is assigned. In a binary case, the search for the minimum cut amounts to the following optimization problem - 2 U :w.,. 0 is a constant specifying the trade—Off. A Slightly different version Of the algorithm above is described in [168], where two sets of smoothness functionals are defined for the data examples: one for their “hub” scores (which accounts for outgoing links) and one for their “authority” scores (which accounts for incoming links). TO separate the two smoothness enforcement, directed graphs are transformed into bipartite graphs. SPECTRAL CLUSTERINC Spectral clustering approaches View the problem of data clusterng as a problem Of graph partitioning. Taking 2-way graph partitioning as an example, to form two disjoint data sets A and B from a graph G = (V, E), edges connecting these two parts Should be removed. The degree Of dissimilarity between the partitioned parts are captured by the notion of cat, which is defined as Cut(A, B) = Xvi €144,363 ail-J. Generally Speaking, a good partitioning should lead to a small cut value. Addressing different balancing concerns, there are several variants Of cut definition which lead to the optimal partitioning in different senses. TO begin with, we define S(A, B) = 2736A ZjEB wi,j and dA = ZiEA di- Ratio Cut addresses the balance concern on the sizes Of partitioned graph ([68]), which leads to minimization of the following Objective function _ MB) S(A,B) JRCU‘ “ W + lBl Normalized Cut addresses the balance concern on the weights Of partitioned graph ([134]), which leads to minimization of S(A, B) + S(A, B) dA 413 jNCut = Min-Max Cut addresses the balance concern between the intra—cluster weights and inter-cluster weights in a partitioning ([49]), which leads to minimization Of __ S(A,B) S(A,B) JMCW ‘ S(A,A)+S(B,B) By relaxing cluster memberships to real values, the above minimization problems 10 can all be formulated as eigenvector problems related tO the graph Laplacian T jRC'ut = q Lq T~ ~7NCut Z q Lq qTWq qTDq ~71” C' at where q is related to the relaxed cluster membership. All the above three problems lead to finding the second eigenvector Of the graph Laplacian L or L. 2-way spectral clustering can be extended to k-way spectral clustering ([66, 115]), whose solution is related to the top It eigenvectors Of the graph Laplacian. 1.2 Side Information 1.2.1 Motivation on Introducing Side Information In the previous section, we briefly reviewed graph—based semi-supervised learning. All the models mentioned there only assume the knowledge of unlabeled data, or more specifically, the input patterns of unlabeled data. As we can see, by constructing a graph from all the data, those models do utilize unlabeled data in training. However on many occasions, it is hard to achieve satisfying performance, even with the knowl- edge of input patterns Of both labeled and unlabeled data. The difficulties come from various sources. TO name a few here 0 Very small number of training examples. As a result, not enough supervised information can be gained. 0 Large number Of classes with skewed Size distributions. For example, it is easy to come across hundreds Of categories in text classifications problems. Very often, in those categories only a few major ones are Of large Sizes, while the rest are minor categories with relatively smaller sizes. Classifiers tend to make mistakes on those minor categories. 11 a on D D Don I: o 05_ ““00: . a n D n a mnfibfdflal On a an” on Do a": on - D an at: m. I'. O ’- D on g on . 7' :- Dn '3 la, 0 Dan '3').- 0 Ch * . an qfi no % -o.5r ° °° ° D 0000 —1 L Q I -1.5 -1 -O.5 1.5 a 8.888 ()D t t1 1 . O o 0000 00 000° °o 0 0° 00 o N + 0000 0 5 0 0°0 I f F "3% ‘3 . " **+ O 0 0° 9 . mowmfflg oo 00 +" o ‘S o o u: 00000 0+ QCB g: 08 3 :’ 0° {r353} °o++ o O O %% ¢. . *0 ‘- . oo O~ 1.°et” o ,2 g, + 00.” :0 g ‘9' * O O o + q) 1 «0+ 0 g ”'4 (X +1 0+ *’* 0° * + * '01: OO . O 5* IN- oc%°‘»+’:flm 0000 I. I 4' %%% 3 o ‘0’. +++++ o o o 9 +. fl, 1v], up"... .++*"+. . 4- _1 I I l I 1 4 -1.5 -‘| -0.5 0 0.5 1 1.5 (b) Dataset 1 (with true class information) Figure 1.1. Example dataset that presents data in complicated structure. The upper figure shows the data distributions without class information; the lower figure Shows class lables for data examples. Without any supervised information, clustering on this dataset is difficult, since the two spirals are hard to be separated. 12 6 T 1 f I G I X 08 4b 3: x O 0000 O ‘ a” a o 2_ x :8: O OO CIDo 4 ”*2, 0869 00 0- o ’9‘ ~ 0 V " -2- xx x - 3‘81: x X -4. x X xx . ea? .. — b 1 I l I -6 -4 -2 0 2 4 6 (b) Dataset 2 (with true class information) Figure 1.2. Example dataset that presents data in complicated structure. The upper figure shows the data distributions without class information; the lower figure shows class lables for data examples. Without any supervised information, 2-way clustering on this dataset is difficult, since it is hard to tell how to combine four data clouds into two clusters. 13 Figure 1.3. Dataset 2 with side information. o Complicated underlying structure of data. For example, in general a clustering algorithm is hard to perform well when the data examples do not form compact and well-separated clusters. Figure 1.1 and Figure 1.2 present two such cases that will fail many clustering algorithms. It is easy to imagine, yet hard to visualize, the existence of much more complicated structure of data in real- world applications. 0 Noisy input patterns. AS a result, the correlations between data examples computed from their input patterns are unreliable, so propagating information among data examples could be error-prone. o Incomplete data. In this situation, it is hard to reveal the entire structure of data. However, with a little more information known beyond the data input patterns, the difficulty involved in the semi-supervised learning can be eased or resolved. TO 14 illustrate this finding, let us consider the 2-way clustering problem shown in Figure 1.2. It is easy to find that there are roughly four clouds of data points in the dataset, but the difficult part is in what way these four clouds Should be combined into two clusters. Suppose we know that there a few data pairs, each of them containing two data points that should be put into the same cluster, as indicated by solid lines in Figure 1.3. Then it is clear that the two data clusters should be formed by putting the two clouds in the diagonal positions into one cluster. In the above example, the extra information is only data relationships that take only binary values, from merely four data pairs involving eight data points. Consider- ing the input patterns Of all the data points and the relationships between all possible data pairs, the extra information we additionally have is very Simple and small, SO we can call it as a type of “side information”. However, the data clustering task is greatly benefited from such side information. 1.2.2 Side Information for Semi-supervised Learning We may find different forms of Side information in different task scenarios. Numerous previous studies have resorted to Side information, in order to improve the perfor- mance of learning algorithms. For example 0 The title, text annotations, or even surrounding texts are often referred as Side information to the image representations. AS is well known, understanding image contents is very hard due to the gap between the low-level image features and the high-level image semantics. The textual side information will provide additional features that help to describe the image contents. Precious work along this line includes using textual information for image segmentation [10], image retrieval [11, 35, 170], and image clustering [31]. 0 Log data is widely gathered in web applications (such as search engine, infor- mation filtering, online shopping, and etc.) and serves as side information for 15 various purposes. User query log can be used to form a context to provide additional information on the user’s information need, and then improve the query model ([139, 131]) or the user model ([132]). User click-through data 1is another important source of side information for web applications. For exam— ple, from click-through data we can tell how Often a link from page A to page B is actually used, which can be used to estimate the similarity between the two pages [112], in addition to their page contents; gathering a specific user’s clicked links can help to model the user’s interest [75, 140], as additional information to the user’s profile. 0 User feedback can be regarded as side information for information retrieval. For example, user ratings on a returned document help to decide its relevance to the query. In movie recommendation websites, user ratings provide additional information either for describing movies or describing users. 0 External linguistic resource can serve as side information for many information retrieval applications. For example, bilingual dictionaries are widely used to help query translation for cross-language information retrieval [2, 95, 46, 57, 76, 93, 109, 81]. WordNet [54] can be used for query expansion, word disambigua— tion, etc., and finally improve retrieval performance [106, 147]. o For classification problems, knowledge gained about the categories can be viewed as Side information. For example, human knowledge on the popular— ity Of each category is always used to gain some control on the category sizes in the classification results. In image databases such as ImageCLEF [77], there is some text description associated with each image category, which can be used to gain some knowledge on the relationships between categories. 1Click-through data refers to a type Of log which keeps record of the time, session of hyperlinks being clicked by web users. 16 o In real—world data clustering problems, it is Often possible to gather informa- tion on whether two examples should be put in the same cluster or not (Often called as “pairwise constraints”). For example, in speaker segmentation and recognition, a conversation between several speakers needs to be segmented and clustered according to the speaker identity. It may be possible to automatically identify small segments Of Speech which are likely to be from a Single unknown Speakers [9]. Another example would be finding lanes in GPS maps. Usually lanes are identified by clustering a few road segments. By tracking the Signals Of car GPS’S, continuous car-traveled segments would be considered belonging to the same lane, and well separated car-traveled segments would be considered belonging to different lanes [148]. These pairwise constraints, would provide side information for clustering data examples based on their input patterns. 0 In some cases, a few identified representative data examples are available as “seeds” for clustering. We can also view these cluster seeds as Side informa- tion [13, 12]. One characteristic can be summarized from various forms Of side information is that, it is capable of providing some additional knowledge about the data, but it cannot replace the original data input patterns. In this thesis, we will refer to any additional information on the unlabeled data beyond input patterns as “Side informa- tion”. Usually side information is small in amount and incomplete. However, with proper use, side information could Significantly improve the performance of unsupervised, supervised, or semi—supervised learning algorithms. Side information serves as partial supervised information. Also, as mentioned in the beginning of this chapter, we will assume the data input patterns Of all the data (including labeled and unlabeled) is known; and in all following discussions, we will use unlabeled data in training. For these reasons, regardless Of the original learning 17 task being an unsupervised one or a supervised one, as long as side information is used, we will refer the resulting learning method as semi-supervised learning with side information. Side information can be explored in a number Of different ways. FOr instance, in probabilistic models, side information iS always represented as strong priors; in non-probabilistic models, side information is always incorporated in distance metrics or constraints, etc. The work presented in the thesis will focus on graph-based non- probabilistic models. 1.3 Graph-based Semi-supervised Learning with Side Infor- mation Side information, depending on its form, has various interpretations in graph—based semi-supervised learning models. These interpretations can be summarized into, but not limited to, the following major categories 1. Additional dimensions to the node representation: more features are found for the input pattern of part (or all) the data examples. A typical example in this category is textual information for images, where we can append textual feature vectors with image feature vectors to form a new image representation. Other examples include query log for query modeling, click-through data for user modeling, user ratings for movie recommendation, etc. 2. Constraints on the graph topology: an edge should or should not be con- structed between two nodes. The pairwise constraints for clustering belongs to this category, which is self-explanatory. For example, the link graph constructed for web pages by hyperlinks falls into this category. 3. Refinement on graph edge weights: a (new) weight Of an edge is proposed. For instance, user feedback on a returned document defines a new relevance 18 score between the query and the document, i.e. a similarity value between the query node and the document node. Click-through data for hyperlinks on a web page also suggests a new weight for the corresponding pair of page nodes (for example, proportional to the number of user clicks on the hyperlinks). The variance in the roles played by Side information from a graph point of view leads to a diversity Of ways to accommodate side information within semi-supervised learning settings. The following is an incomplete list Of treatments of Side information 1. Data Representation Augmentation: Side information is used to form bet- ter representation, in the hope that the underlying structure of data can be bet- ter revealed when the augmented data representation is used for computation. For example, in [63, 32, 120, 130, 33, 118], Side information is used to generate additional feature dimensions for classification; in [48, 40], side information is used to learn a new representation in the latent Space. 2. Hard Constraints: Side information is formulated into constraints that can- not be violated. For example, constrained clustering [148, 20], constrained EM algorithm [73] and multiple-instance learning [6] all treat Side information as non-violable rules. 3. Soft Penalty: Any inconsistency between the learning outcome and the side information will incur a penalty. For those models with a clear Optimization goal, such a treatment often leads to side information being encoded as part of the Optimization Objective function (such as a regularizer term or its equiv- alence). Typical examples include using Side information for distance metric learning [155, 98, 74], clustering [96, 15, 108, 16], etc.. 4. Other Heuristics to Carry Out Learning Algorithms: Side information is used to provide building blocks for other learning algorithms. For example, 19 in [168, 167, 119, 70, 158, 149], side information is used to induce the graph structure where learning algorithms over the graph can be carried out; in [27, 39] side information is used for co—training; in [72] side information is used for boosting. 1.4 Overview on Thesis Work All the thesis work presented in the rest Of the chapters revolves around the topic Of improving semi-supervised learning with side information. Due to the diversity of learning tasks and forms of side information, we will demonstrate our efforts through several typical learning settings and applications. Chapter 2 discusses multi—label classification, with class correlations as side infor— mation. TO meet the challenge Of a large number Of classes and a small Size Of training data in multi-label classification, we propose tO utilize class correlations to link the computation of data example similarities with their multiple class memberships. We will also Show that the proposed multi-label learning algorithm leads to a constrained non-negative matrix factorization formalization. Chapter 3 focuses on binary classification, with side information in the form Of links. Link-based classification gained significant attention in text domains, due to the large amount of needs in classifying web pages or scientific publications, where inter-document connections are available as hyperlinks or references. However, the sparse and noisy nature of links causes trouble in utilizing them for classification. A general boosting framework is proposed in this chapter that tries to make the best use of link information against its sparseness and noise, and is able to improve any binary classifier. Chapter 4 presents another boosting framework for semi-supervised clustering, with pairwise constraints as side information. In this work, side information is en- coded intO the data representations by iteratively selecting a good direction to project. 20 the original data into a low-dirnensional space. Again, the proposed framework is de- signed as a meta-algorithm that can be applicable to any clustering algorithm, and improve its performance with pairwise constraints. In addition to the above generic tasks of semi-supervised learning, the thesis work also includes two application examples. In particular, we will Show two information retrieval tasks, where semi-supervised learning with Side information could achieve improved performance over traditional methods. In Chapter 5, we study the problem Of cross-language information retrieval, with bilingual dictionaries as side information. Two models based on a maximum coherence principle are proposed, which can be well explained as two-way partitioning of the graph induced by Side information — the dictionary. Another application, extracting question-answer pairs from web FAQs, is de- scribed in Chapter 6, where side information comes from human knowledge on the presentation regularities Of web FAQS. The model proposed in this chapter leads to a correlated label propagation scheme over a graph built upon the text segments Of web FAQ pages. As will be seen, properly using the Side information enables the question- answer extraction task being performed without any human supervision, despite the wide variety of possible contents and styles in web pages. Finally, Chapter 7 concludes the thesis work. Semi-supervised learning enforces the predicted labels (usually the learning goal) to be consistent with the structure of data, while agreeing with the supervised infor- mation (if available). This is already stated as the assumption made by all graph— based learning algorithms in Section 1.1.2. The introduction Of Side information should not violate this data consistency assumption. Therefore, including side infor- mation into the consistency enforcement becomes necessary. Such an idea is common across all the chapters that follow, in despite Of the diversity of tasks, models and their graph interpretations that will be discussed in detail in each individual chapter. 21 In summary, the semi-supervised learning models and frameworks proposed in this thesis work all conform to the same theme: consistency is maximally enforced among the following factors -— data input patterns, supervised information (if any), side in— formation, and predictions on unlabeled data; invariably, Optimizaticm will be the tool to achieve the goal Of consistency enforcement. Before going into detailed discussions in each chapter, understanding this theme will help to reveal the connections between all parts of the thesis work and view them as one whole piece. 22 CHAPTER 2 Semi-supervised Multi—label Learning with Class Correlations 2.1 Introduction Multi—label learning refers to the classification problems where each example can be assigned to multiple different classes. It has found applications in many real- world problems. For example, text categorization is typically multi-labeled since each document can be assigned to several predefined topics; in bioinformatics, most genes are associated with more than one functional classes (e.g., metabolism, transcription and protein synthesis); automatic image annotation, can also be treated as a multi- label learning problem if we view each annotation word as a distinct class. A straightforward approach toward multi-label learning is to decompose it into a set of binary classification problems, one for each class. The drawback with this approach is that it does not explore the correlation among different classes, which often could be an important hint for deciding the class memberships. Many algorithms have been developed to incorporate the class correlation into multi—label learning, including [110, 52, 85, 146, 28, 59, 60, 90, 171, 164, 145, 141, 42]. But most Of these studies are limited to a relatively small number of classes and assume that the amount Of training data is sufficient for exploiting class correlations and training 23 reliable classifiers. In contrast, the real-world application Of multi-label learning often features a large number of classes and a relatively small size of training data. AS a result, the amount of training data related to each class is Often sparse and insufficient for computing class correlations and learning a reliable classifier. However, we find that in practical, more reliable class correlations are often avail- able as side information. For example, in text classification problem, if each category has a description, the correlation between two classes can be derived by computing the similarity between their descrptions; also, human knowledge about the category topic can also be used to decide how close two categories are in terms Of their topics. SO the problem becomes, given class correlation information, how can we improve multi-label learning? To address this problem, we present a novel framework for multi-label learning that explicitly explores the correlation among different classes. Compared to the existing approaches for multi-label learning that also explore the class correlation, the proposed framework provides a natural means for exploring the unlabeled data and the class correlation simultaneously, thus effective for the learning scenarios with a large number of classes and a small Size of training data. The key assumption behind this work is that two examples tend to have large overlap in their assigned class memberships if they share high similarity in their input patterns. To be more specific, consider two examples x1 and x2 that are labeled by two sets of class labels yl and y2, respectively. We can evaluate the Similarity between these two examples in two different ways. The first one is based on the correlation between the input patterns of these two examples. The second one is based on the overlap between the class labels of these two examples. We denote the similarity based on the input patterns by K g;(x1 , x2), and the similarity based on the class labels by Ky(y1, y2). If the assigned class labels Y1 and y2 are appropriate for example x1 and x2, we would expect the two similarity measurements to be similar, 24 namely K 33(x1, x2) % Ky(y1, y2). Based on this assumption, we can determine the best assignment Of class labels tO the unlabeled data by minimizing the difference between the two sets Of similarities. Clearly, this approach is able to effectively explore the unlabeled data because the assignment Of class labels to each unlabeled example is dependent on the assignment. of class labels of other unlabeled examples. This approach is also able to exploit the class correlation effectively through the kernel similarity function Ky(y1, y2). The rest Of the chapter is structured as follows: first, we briefly review the related work on multi-label learning and semi-supervised learning; second, we introduce the proposed framework for multi-label learning, and a formalization based on the con- strained non-negative matrix factorization; third, we present an efficient algorithm to solve the related optimization problem that is based on the iterative bound opti- mization algorithm; fourth, we present the empirical study with a text categorization problem; finally, we conclude this study and raise some future work. 2.2 Related Work We will first review the related work on multi-label learning, followed by a discussion of related semi-supervised learning problem. 2.2.1 Related Work in Multi-Iabel Learning The simplest approach toward multi-label learning is to divide it into a number of binary classification problems [160, 86]. There are a number of disadvantages with this approach. One is that it will not scale to a large number of classes Since a different binary classifier has to be built for each class. Another disadvantage is that it treats each class independently, and therefore is unable to explore the correlation among different classes. The third disadvantage is that this approach Often will suffer from the unbalanced data problem when the minority classes are given only a few training 25 examples. Another group of approaches toward multi-label learning is label ranking [42, 52, 127]. Instead of learning binary classifiers from labeled examples, these approaches learn a ranking function from the labeled examples that order class labels for a given test example according to their relevance to the example. Compared to the binary classification approaches, the label ranking approaches are advantageous in dealing with large numbers of classes because only a Single ranking function is learned. How- ever, Similar to the binary classification approaches, the label ranking approaches are usually unable to exploit the class correlation information. In the past, a number Of studies have been devoted to exploring the class corre- lation within the context Of multi—label learning. A generative model for multi-label learning was proposed in [146] to explicitly incorporate the pairwise correlation be— tween any two class labels. A maximum entropy model is proposed in [171] that capture the pairwise class correlation by constraints. Approaches based on latent vari- ables were proposed in [110, 164] to capture the correlation among different classes. The study in [123] exploited the class correlation information given the hierarchi- cal structure of classes. Unlike the previous work on multi-label learning that only considers the correlation among different classes, in this chapter, we present a novel framework that exploits the unlabeled data as well as the class correlation. This prop- erty will make the proposed approach more effective than the existing approaches for multi-label learning, particularly when the number Of classes is large and the size of training data is small. This work is also particularly related to the label propagation approaches for semi- supervised learning. This is because by enforcing examples with similar input patterns to share Similar sets of class labels, we essentially propagate the class labels through the Similarity graph Of examples, which is the key idea of the label propagation approaches. A number of machine learning methods have been developed recently 26 for label propagation, including the Gaussian processes [152], the harmtmic functions [173], and Green functions [168]. Unlike most. of the previous work on semi—supervised learning that is designed primarily for multi-class learning, this work is specifically targeted on the semi-supervised multi—label learning. It effectively explores the class correlation information when utilizing the unlabeled data. More discussion Of semi— supervised learning can be found in [129, 172]. 2.2.2 Representative Algorithms In the following, we will recapitulate a few representative algorithms for semi- supervised multi-label learning, as well as the non-negative matrix factorization al- gorithm that is closely related to the model we are going to propose later. MULTI-CLAss MULTI-LABEL PERCEPTRON ALGORITHM Multi-class Multi—label Perceptron (MMP) algorithm extends the Perceptron algo— rithm from Single binary output to a ranked list Of n—ary output. Specifically, for any test example possibly belonging to one or more Of l categories, the algorithm output will be a ranked list of the l categories, indicating the preference of assigning the test example to those categories. AS the Perceptron algorithm, h‘lMP n’raintains a set of l prototypes W1, - -- ,wl. For any data example x -, wiij yields a score that will decide the ranking Of the i-th category for this data example 233-. The procedures Of learning those prototypes can be summarized as follows Initialize: Set “’1 = - -- = W) = 0 Loop: 0 Get a new training data example xj 6 Rd, and its true category informa— tion 9,, which is a set of category IDs 27 0 Rank the categories according to wlij, - - - ,wiTxl o If the category ranking is not consistent. with the truth 3?]- . - ,. __ A “f '1— 1. For Vr E yj, set M —— [[5 fit yjlws xj Z w,. xj}] . ~ , _ - T . T . 2. For VI Q yj, sct nr — [{s G yjlws x] 3 W7. x]}] 3. Compute loss 77 = 27.717. 4. Update for r E 51,-: wr +— wr + Iglxj 5. Update for r ¢ 51,-: WT +— wr — 7+]ij Output: W1, - -- ,wl MMP algorithm is computationally inexpensive and it is suitable for online learn- ing. MULTI-LABEL INFORMED LATENT SEMANTIC INDEXING ALGORITHM Multi-label Informed Latent Semantic Indexing (MLSI) algorithm extends the Latent Semantic Indexing (LSI) to supervised cases [164]. In LSI, a linear mapping from the input space to a low-dimensional latent space is found, so that the structure of the data can be preserved as much as possible. This leads to an Optimization problem which minimizes the reconstruction error from the latent space to the original Space. Given supervised information in the form Of multiple labels of training examples, MLSI proposes to preserve the structures in the input patterns Of the data, as well as their label information, using the same set of latent variables. Formally, let X E Rm 0, the solution tO the lower rank approximation problem would be B = deiag(sl,sg, . . .,sk)V;€r (2.2) Highly related tO matrix approximation problem, non-negative matrix factoriza- tion tries to approximate a matrix A with two non-negative matrix factors U and V ([104, 105]) A 8 UV To measure the approximation quality, two cost functions are used. The first mea- surement is the square Of Euclidean distance 2 HA — Bll = ZlAi,j — Big) M and the second measurement is the divergence Am‘ D(AllB) = Z A. .- log— - Am- + 3,,- . . ’ Bi j ’ 3,] ’ Note that the second measurement D(A|]B) is always nonnegative and reaches zero only when Ai, j = BM holds for all (i, j) pairs. 30 An iterative algorithm has been proposed to efficiently solve the problem by itera- tively minimizing the above two cost functions. In particular, the following updating rule minimizes the Euclidean distance [[A — UVII Ui a f“ Ui a live. k l l k (UV)l,k , U' (— Uin 2,0. Zj Uj!a and the following updating rule minimizes the divergence D(A|]UV) A- k V ,__ V U; __"';__ a,k a.,k z,a 22.: (UVh'Jc The above two rules which guarantees the two cost functions to be non-increasing until a local Optimum is reached. To initialize the algorithm, the two matrix factors U and V can be seeded as random non—negative matrices. 2.3 Semi-Supervised Multi-label Learning by Constrained NMF The following terminology and notations will be used throughout the rest Of the chapter. Let D = (x1,x2, . . . ,xn) denote the entire dataset, where n is the total number of examples, including both the labeled ones and the unlabeled ones. We assume that the first "I examples are labeled ones, and their label information is }nlxm where m is the number of classes. presented in the binary matrix T 6 {0,1 Let the Similarity Of all the examples denoted by a matrix A = [Agjlnxna where element Az-J- : 0 represents the similarity between two examples based on their input patterns. We denote by TU: _>_ 0 the confidence score of assigning the k—th class label to the i-th example, and by t,- = (Ti,1,T,-,2, . . . ,Ti,,n)T the confidence scores of assigning each class to the i—th example. Finally, the matrix T = [Ti,kln>_ 0 represents the similarity between two classes. Then, instead of computing the class-based simi- larity between two examples by the direct dot product, we compute it by a weighted dot product, i.e., tZTBtj. Then, following the assumption stated above, we would expect Ai, j % tJBtj if the class assignments ti and t j are appropriate for examples 32 x,- and xj. This leads to the following Optimization problem: n m 2 argmin 2 AM — Z Tl,kBk,lTj,l (2.3) T i,j=1 k,l=1 s. t. Tj‘l_>_0,j:l,...,n,l=1,...,rn 7i,k:7i.krf:17"'rnlik217"'r7n (2.4) where the last set Of constraints is to ensure that the estimated label confidences Tiak’s are consistent with the assigned class labels Ti’k’s for all the training examples. Remark: It is interesting tO see that the formalization in (2.3) can also be written as a non—negative matrix factorization problem under a linear constraint, if we ignore the constraints coming from the training examples, i.e., arg min ”A — THllF ,H S. t. Tj,l,Hj,lZO,j=1,...,Tl,l=1,...,7n H=BTT where I] - I] F stands for the Frobenius norm. The above problem is similar to the standard Non—negative Matrix Factorization (NMF) problem except for the linear constraint that restricts the matrix H to be linearly dependent on the matrix T. It is this constraint and furthermore the constraints arising from the labeled data that. prevent the direct application of the N MF algorithm. One problem with the formulation in (2.3) is that since the input-based similarity Ai,k can be any positive value, it could be significantly larger than the elements in B. As a result, the label confidence TiJc that minimizes the objective function in (2.3) will also be significantly larger than 1. But, to satisfy the constraints in (2.4), the label confidence Ti,k Should be restricted to 0 or 1 Since the assigned class label THC is binary. To resolve the conflicts between the minimizer of the objective function in (2.3) and the binary constraints, we introduce two sets Of label confidences: the unnormalized label confidence {Ti’k}, and the normalized label confidence {Ti,k}° 33 The former can take any positive value, while the latter is positive and subject. to the constraints of Zlgnzl Tuk = 1. We will on one hand, use the unnormalized label confidence to minimize the difference between the class-based similarity and the input-based Similarity, and on the other hand, use the normalized label confidence to ensure that the predicted label confidence is consistent with the assigned class labels. Formally, we can summarize this idea into the following Optimization problem: n m 2 n m - ~ 2 Mg?“ Z An * Z 73,kBk.sz,l + 0 Z 2031’ “27231) (2-5) T.T.a i,j=1 k,l=1 3‘21 [:1 S. t. ijl,Tj,l,aj20,]—1,...,TL.i=1, ,m m 23%J2L221, m2 #1 EA T- =———=—,i=1,...,n,k=l,...,m Note that in the above formalization, we introduce the term C 23":1 :l:1(Tj,l -— cry-T”? into the objective function to enforce that. the two sets of label confidences are consistent and only differ by a scaling factor a j for each example. Parameter C weights the importance of the second term against the first term, and is determined empirically. 2.4 Solving the Constrained NMF An alternative Optimization approach is adopted to solve the constrained N MF. In particular, we will solve the optimization problem by alternatively fixing one set of label confidences and finding the optimal solution for another set of label confidences. More specifically, we first fix the normalized label confidence matrix T and the scaling factors aj’s, and search for the unnormalized label confidence Ti, j that Op- 2 timizes (2.5). To this end, we upper-bound the term (Ai,j — Zznl=1Ti,kBk,lTj,l) 34 as follows m —TZ 1.31. 1T k,=ll m Ti,kBk.sz,z A“ 'I‘B’I‘T T1 kBk 1T 2 W WI _— k,l=1 7"] [TBTTLJT2 1433+ Z (T 2 . . . . “=1 ——kBk 1T 11.31. zT‘,z‘2Az.JTz.kBk.lTJ.l) Z "' ~ ,~ ~. 23 J? 3 A2,]- + 5k; [TBT li,sz',kBk,lTj,l T4,, + $113 , =1 2, J. m — 2 Z Ai,jTi,kBk,lTj,l (1+10gT’lJc + log T3") - log TiJC — log Tjal) k,l=l In the above, T refers to the matrix T from the last iteration. We use the convexity Of the quadratic function in the first step Of the derivation, and the concaveness Of the logarithm function in the third step of the derivation. Then, we can upper-bound the first term in the function (2.5) as n m 2 Z 14,-3- — Z T1,kBk,lTj,l i,j=1 k,l=1 n 4 m 3 Z +ZlTBTTl1JlTBl11T'l4ZA1JITBl1 1T 1logT,,1 i,j=1 z 1 J [:1 77?. ~ ~ T ~ ~ T ~ - 2413]" [TBT li,j +4 Z Ai,sz'.leT lks’ 103 TU: k=1 Similarly, we can upper-bound the second term in (2.5) as follows: n m __ 2 _ (72:2:(711—2 aJi3fl3r+aJi3p j=11=1 n m T102 2 - J - :; 02:2:TMl—21ijnflbgf +1)+ OJTN j=1l=1 3’1 35 By combining the above two bounds, we have the upper bound for the objective function in (25). Taking the derivative of the bounding function with respect to Tij we have T T31 n. 1 1 .7 ~ ~ ~ _ 4Z[’i‘B'i‘ ], J[TB], IT? —4 Z AJJ-[TBMTJJJJ— + C(QTJl — 2aJ TJ ,JJT 55;) _ 0 Z: 1 j,l 2:1 .731 which leads to the following solution 1 __ *3 2 . ~4 . “. . CTJ, + \/C + 8UJITJJ(2VJZ + CTJlaJ) Tj,l = (2.6) H 4U j,l where UJ, = [TBTTTBJJJ and VJ, = [A'i‘B]Jl. In the second step, we fix the unnormalized label confidence TUE and search for the normalized label confidence T1,}? that optimizes the problem in (2.5), which leads to the following optimal solution: Tl: ——T—-’——le j— Tzl+1,. n,l—1 m (27) 3 —J————,, , — — ,..., . 3' 21: 1 TN a]::TJ-’l,j=nl,...,n (2.8) l=l In summary, the iterative steps solving the optimization problem (2.5) could be formulated as a algorithm shown in Table 2.1. Step 1 Randomly initialize T and T subject to the constraints in (2.5) Step 2 Until convergence, do 1. Fix all aj’s and T, update T using Equation (2.6) 2. Fix T, update T using Equation (2.7) 3. Fix T, update all aj’s using Equation (2.8) Table 2.1. The CNMF algorithm 2.5 Experiments and Discussions Our experiments are designed to evaluate our proposed multi—label learning frame— work in text categorization tasks, particularly in the case of a large number of classes 36 and a small size of training data. 2.5. 1 Experiment Setup The dataset used in our study comes from the textual data of The Eurom'sion St An— drews Photographic Collection (ESTA) in ImageCLEF collection [77]. We randomly pick 3456 documents, and choose the top 100 most popular ones from all the cate- gories those picked documents belong to. On average, each document is assigned to 4.5 classes. Documents are preprocessed by the SMART system with stop words re- moved and words stemmed, and each document is represented by a vector of weighted term frequency. Our proposed framework is implemented in the following way. The document similarity Ai, j is computed as the cosine similarity between the corresponding term frequency vectors. To compute the class similarity matrix B, we first represent each class c by a binary vector whose elements are set to be one when the corresponding training documents belong to the class c and zero otherwise. We then compute the pairwise class similarity based on their vector representation using a normalized RBF kernel. Finally, the class assignment for each test document is made by the ranking of the label confidence scores that are obtained from the matrix T. Every experiment is repeated 10 times by randomly re-splitting the dataset into the training and the testing sets. The parameter C in the objective function (2.5) is set to 100. We also varied the value of C from 20 to 200, and found that the results remain almost unchanged. For an easy reference, we will refer to the proposed algorithm as “CNMF”. Since our approach only produces a ranked list of class labels for a test document, in this study, we focus on evaluating the quality of class ranking. In particular, for each test document, we compute the precision/ recall and the F 1 measurement at each rank by comparing the ranked classes to the true class labels. Then, the 37 precision/recall and the F1 measurement averaged over all the test docm‘nents is used as the final evaluation metric. Three baseline models are used in our study. The first one is Spectral Graph Transducer (“SGT” for short) [87], which has been proved effective for exploring unlabeled data. An separated SGT classifier is built for each individual document category, and the probability values output by SGT are used to rank the class la- bels. The second baseline model is Multi-label Informed Latent Semantic Indexing (“MLSI” for short) [164], which maps document vectors into a low-dimensional space that is strongly correlated with the class labels of the documents. It has been shown empirically that MLSI is effective for exploring both the unlabeled data and the cor- relation among classes. The last baseline model is Support Vector Machine (“SVM” for short). A linear SVM classifier based on the term frequency vectors of the docu- ments is built for each category independently. All the baseline models are tested by a lO-fold experiment, using the same training/ test split of the dataset as the proposed framework. 2.5.2 Experiment Results Figure 2.1, Figure 2.2 and Figure 2.3 show the average precision, recall, and F1 measurement, respectively, at different ranks, for both the proposed framework and the three baseline approaches. A comparative analysis based on the results in Figure 2.1, Figure 2.2 and Figure 2.3 lead to the following findings: 1. All four approaches show a same trend of decreasing precision and increasing recall, when the number of labels predicted for each document increases. This is in accordance with the usual precision-recall tradeoff. However, as a measure- ment balancing the precision and recall, each F 1 curve clearly shows a peak. As can been seen from Figure 2.3, the F1 curve of CNMF reaches its climax when 38 Precision Micro (#train = 100) Precision Micro O m l O I-' i 5 10 15 20 25 #Labels predicted for each example (a) #Training=100 Precision Micro (#train = 500) —CNMF ' " SGT ---MLSI —SVM Precision Micro 0 U1 I 0 J l I l L l 5 10 15 20 25 #Labels predicted for each example (b) #Training=500 Figure 2.1. Classification performance (Precision) when varying the number of predicted labels for each test example along the ranked list of class labels. 39 Recall Micro (#train = 100) l.— 0.9— o 8‘ .. » ‘I . o 7— -' ., 3 O 0 6' ., .. H. E ' I-I""' " - HOS- ,gllllii . c ”K ... ML H , . . 8 a g 0.4’ ’ I ‘——CNMF'H .HHHSGT ,ll’fi. .H. ; 25 30 0 Lu 1 F-Fd - I—‘rt ; F-w-fi O l-' I 0 5 10 2 #Labels predicted for each example (a) #Tl‘ainingleO Recall Micro (#train = 500) 0.9— ll-iI'IR-I‘E‘E'I'E'! 0.8- 0.7- Recall Micro 0 U1 I o.3~ _CNMF 0.2—. ”"”'SGT -- ---MLSI 0.1-1 _. . . . . —SVM o l I l l 1 4| 0 30 S 10 1 0 2 #Labels predicted for each example (b) #Training=500 Figure 2.2. Classification performance (Recall) when varying the number of predicted labels for each test example along the ranked list of class labels. 40 P1 Micro (#train = 100) F1 Micro 0 , 1 1 i 1 l i 5 l O 1 5 2 O 2 5 3 0 #Labels predicted for each example (a) #Training2100 F1 Micro (#train = 500) F1 Micro 7 5 10 15 20 25 30 #Labels predicted for each example (b) #Training=500 Figure 2.3. Classification performance (Fl) when varying the number of predicted labels for each test example along the ranked list of class labels. 41 the number of predicted labels is around 3 to 4, which is close to the average number of labels per document. (i.e., 4.5). . CNMF makes more significant improvement on the average recall than on the average precision when compared to the three baseline approaches. This is re- lated to our task scenario, which focuses on multi-label learning with a large number of classes and a small size of training examples. Given such a sce- nario, we would expect a number of classes that are not provided with sufficient amount of training examples. As a result, we hypothesize that prediction on these classes will have to rely heavily on the correlation among classes. This hy- pothesis is partially justified by the comparison between the proposed approach, CNMF, that exploits class correlations, and SGT or SVM, which does not. Although CNMF and SGT achieve similar performance in precision, CNMF performs significantly better than the SGT in terms of the average recall. . More improvement by CNMF to the three baseline approaches is observed when the number of training documents is 100 than when the number of training documents is 500. This is partially due to the same reason mentioned above — the advantage of exploiting class correlations on sparse training data. It can also be attributed to the reason that our approach also makes use of the correlation among the unlabeled data, which has been proved by many studies in semi-supervised learning, for instance [173, 129, 172]. . Although MLSI is able to explore the correlation among classes, its performance depends heavily on the appropriate choice of the number of latent variables and the tuning parameter determining how much the indexing should be biased by the outputs. These two parameters are usually determined by a cross valida- tion approach and therefore could be problematic when the number of training examples is relatively small. This problem is directly reflected in the large vari- 42 ance in both precision and recall of the MLSI algorithm, which we believe it is due to the inappropriate choice of the aforementioned two parameters given the limited number of training examples. 2.6 Conclusions In this chapter, we propose a novel framework to accommodate the side information of class correlation in multi-label learning, which meets the challenging situation of a large number of classes and a small size of training data. The advantage of our proposed framework is that it is able to exploit the correlation among classes and the unlabeled data. We also present an efficient algorithm to solve the related optimiza— tion problem. Experiments show that our proposed framework performs significantly better than the other three state-of-the—art multi-label learning techniques in text classification tasks. 43 CHAPTER 3 Semi-supervised Classification with Link Information 3.1 Introduction From supervised learning to semi-supervised learning, machine learning research has witnessed the trend of exploiting the structure of unlabeled data. A typical approach towards revealing the underlying structure of unlabeled data is through establishing correlation between example pairs from their data representation (or feature values). For example, many graph-based learning models construct a k-nearest neighbor graph by choosing an appropriate distance measurement defined on the data representation of a pair of examples, such as Local Linear Embedding [124], ISOMap [142], Laplacian Eigenmaps [17], Manifold Learning [18] and etc.. However, many real-world datasets already exhibit inherent correlations by entities that are interlinked with each other. Especially in text domains, datasets features with “links” are very popular: nearly all kinds of scientific publications are cross-referenced by each other; and the World Wide Web is weaved by hyperlinks that connect pages. One can also find datasets with link information in other domains, such as social networks, bioinformatics, etc. Naturally, link information suggests certain structure underlying the dataset. For example, an empirical study showed that the Web’s spatial locality (induced by hy- 44 perlinks) mirrors its topical locality [47]. h‘lor‘eover, link information usually provides a different view on the structure of data than that “computed” from the data rep- resentations, since link generation often involves human interference. For example, when a web author constructs hyperlinks, he or she can capture more sophisticated semantic closeness between web pages that is beyond the power of state—of—the-art text mining tools. Therefore, exploiting link information leaves the hope of gaining more insight into the structure of data for many learning algorithms. However, while it could be informative, link information also often presents some annoying characteristics in practice. First, link information could be sparse, i.e., some data examples could be involved in no links. Second, link information are often incomplete, i.e., one cannot always expect a link being observed wherever a link “should” be constructed between two examples. For example, we cannot hope a scientific publication to cite all work that is related to itself. Third, link information tends to be noisy. One example to illustrate noise in links is the hyperlink spam on the Web. In summary, in real-world datasets, link information is typically not a reliable source for the structure of data. Therefore, link information is always treated as “side information” that supplements the original data representations for learning tasks. Informative but unreliable, link information provides opportunity as well as chal- lenge for semi-supervised learning. In this chapter, we will focus on classification on datasets with link structures. Depending on how link information is incorpo- rated into a learning algorithm, previous studies on this topic can be roughly cat- egorized into three types of models: representation augmentation, correlation aug- mentation and training pool augmentation. The first type modifies the representa- tion of a data example from its neighborhood that is induced from the link struc- ture [33, 118, 61, 63, 162, 48, 120, 130, 40]. The second type utilizes link information to construct a graph [168, 167, 7], or derive more accurate similarity measurement. 45 between data examples [32], and apply (or design) semi-supervised learning algorithm over the graph. Essentially, link information plays the role of establishing correlation between examples. The third type uses linked examples to augment the training set, represented by the co-training algorithm and its variants [27, 39]. Usually a feature based classifier and a link based classifier are applied alternatively, supplementing each other’s training pool with its highly confident predictions. However, the performance of all three types of models summarized above can degrade significantly with a decreasing number of links. When links are sparse, only a small fraction of data is involved in some link(s). No matter a model augments the data representation, correlation or training pool, an unbalanced issue is created between those data examples that are influenced by the links and the rest that are not. To overcome this problem brought by the sparseness of link information, it is desirable that the limited link information can be somehow “smoothed” over the entire dataset. In this chapter, we will propose a novel semi-supervised framework, termed as “LinkBoost”, to improve classification accuracy by utilizing link information. Link- Boost is a boosting framework: a base classifier is applied iteratively with augmented training pool, which is updated through a minimization of inconsistency between the class label assignments and the pairwise data similarities that is augmented by the link information; with a series of learned weights, the classification results from all iterations are finally combined as the final prediction. Thus the classification results from the iteratively applied classifier with updating training pool act as a way of propagating link information around, i.e., “smoothing” the influence of links over the entire dataset. As a result, LinkBoost is more robust against the sparseness of link information. Besides, as a general boosting framework, LinkBoost can be applied to any classification algorithm. LinkBoost framework is also hybrid in the sense that it can accommodate all three augmentations from link information on the dataset: rep- 46 resentation augmentation, correlation augmentation and training pool augmentation. The following notations will be used throughout this chapter. Let X = {xi}?:1 denote a set of n examples, where the first 77.] ones are labeled and the rest nu ones are unlabeled, i.e. n = n] + nu. We also use x,- to represent the feature vector of the i-th example, and use X to denote the matrix that gathers all the feature vectors of examples, i.e., X = [x1, - -- ,xn]. Let y 2 {Eli [‘21 be the label vector, in which {gig-:1 is given, and {Mill—7:1 is to be decided. SiJ is defined as a similarity measurement between the i—th and the j—th examples, and a matrix is formed as S = [8i,j]an. The link information is encoded into a matrix R = [rm-Rx”, where TiJ = 1 if there is a link from x,- to xj and Ti, j = 0 otherwise. From the link matrix, a co—citation matrix can be defined as C = [ci3j]an, where Ci, j = 1 if 31: 75 2', 3' such that THC = 1:73,]: = 1, and Cz’,j = 0 otherwise 1. 3.2 Related Work In this section, we will briefly review previous efforts in the area of link-based classi- fication (especially in text domain), and a few semi-supervised classification models that are closely related to this chapter, followed by a recapitulation of several repre- sentative algorithms. 3.2.1 Link-based Classification The most popular way of link-based classification is to augment the representation of data examples with their neighbors. In [63, 32, 120, 130], the bag-of-words model of documents is augmented by including words from neighboring documents, thus creating a “virtual document” to feed into classifiers. However, the studies in [33, 118] 1This co—citation matrix is defined in the “co-citing” manner, i.e., two examples are correlated if they both link to a. third example. Another way is to define co—citation matrix in a “co—cited” manner, i.e., two examples are correlated if they are both linked from a third example. Deciding which definition is more appropriate is domain dependent. 47 suggests that incorporating words from neighboring documents sometimes leads to performance degradation, while making use of their predicted class labels as additional features for data representation was helpful. Later in [107], different schemes of formulating the additional link-based features were discussed. Apart from explicitly augmenting data representations with features or labels of linked examples, another stream of research tries to unify content analysis with link analysis by creating new data representation for linked data examples in the latent space [48, 40]. When links are often directional (which is true in most cases), there exists several different criteria to identify neighborhood for a data example. The most frequently used criteria include incoming links, outgoing links and co—citation links [107, 61, 162, 32]. Especially, the studies presented in [61, 162] suggests that different dataset regularities may favor the use of different neighborhood identification criteria. Rather than augmenting data representation to better explore the underlying structure of data for semi-supervised learning, a few studies directly create a graph structure from link information. These approaches often leads to semi-supervised learning algorithms over graphs: an iterative relaxation labeling algorithm is pro- posed in [7] for undirected graph; the work in [168, 167] handles directed graph by regularizing classification functions to change slowly on densely linked subgraphs. 3.2.2 Related Semi-supervised Classification Models A large number of models have been proposed for semi-supervised classification, which can be broadly categorized into several types: graph-based models, margin-based models, kernel-based models, ensemble-based models, and etc. Although only a small fraction of models directly address the use of link information, at least two types of models are closely related to the LinkBoost framework proposed in this chapter: graph-based models and ensemble-based models. Graph-based models build a connected graph on both labeled and unlabeled ex- 48 amples, with each vertex representing an example and each weighted edge represent- ing correlations between example pairs. The most well-known graph—based models include Harmonic Functions [173, 174], Spectral Graph Transducer [87], Gaussian Processes [4, 103, 67], Manifold Regularization [18], Label Propagation [19], etc. A common theme shared by many graph-based semi-supervised classification models is to find an optimal set of class labels for unlabeled examples, such that they are consis- tent with supervised class labels from labeled examples, as well as the graph structure. Graph Laplacian is a popular form to define an inconsistency measurement over a set of class assignment y = {21,-} and the graph represented by a similarity matrix S = [8233'] F(y) = $21 23-21 Sig-(y.- - 31))2 = yTLy he above inconsistency measurement, is always combined with other components to form an objective function for minimization. It is easy to imagine, when link information is available, we can enforce class assignment to be consist with it, in a similar way as graph-based models. As will be seen in Section 3.3, the proposed LinkBoost framework uses a similar definition inconsistency measurement, except in the form of exponential cost (to facilitate deriving a boosting algorithm) rather than quadratic cost. Moreover, unlike most graph-based models, which are non-parametric and do not build specific classification models, the proposed LinkBoost framework is able to yield one classification model based on any given classification algorithm. This is particularly useful, when the amount of unlabeled data is extremely large. Ensemble-based models gained increasing attention in semi-supervised learning by several successful models, such as AdaBoost [56, 55], ASSEMBLE [21] and Semi- supervised Margin Boost (SSMB) [43]. Usually, in an iterative manner, pseudo labels are assigned to unlabeled examples, which are then sampled for training a new su- pervised classifier from an ensemble of them. The proposed LinkBoost framework follows the idea of iteratively augmenting the training set, but it utilizes link infor- 49 mation to obtain more reliable pseudo labels for unlabeled examples. In this sense, LinkBoost framework combines the advantages from both graph-based models and ensemble—based models. 3.2.3 Representative Algorithms Review Nearly all the algorithms proposed on link-based classification make the assumption that interlinked (or co—citation) examples are more likely to belong to the same class. In the following, we will recapitulate a few representative algorithms proposed for semi-supervised classification on datasets with link information. FEATURE REPRESENTATION AUGMENTATION In the family of feature representation augmentation algorithms, a example’s feature set is supplemented with features from its interlinked / co-citation examples. Formally, we can write the augmented example as xa“g = (1 -— A)X + AMTX (3.1) where the matrix M = R if we only want to augment an example with incoming links, or M = R+ RT if both incoming links and outgoing links are used, or M = C if only co—citation examples are considered for augmentation. A is a weight to put on the augmentation from linked examples. The augmented feature representation will be fed into an semi-supervised learning algorithm for classification. CO-TRAINING FOR LINK-BASED CLASSIFICATION Co—training is first proposed in [27], in which two or possibly more learners are trained separately on a set of examples; each learner uses a different, and ideally independent, set of features for each example. Co—training can be naturally applied for link-based 50 classification: one learner is based on the data representation, the other learner is based on link information. Specifically, the co—training algorithm for link-based classification can be formal- ized as follows Co—training for Link-based Classification Step 1 Initialize a training pool with all training examples R = x1, - - - ,xnl. Step 2 Iteratively apply the following two learners 0 Train a data representation based learner on current training pool R, and apply the learner to predict labels for the rest examples. Update the training pool R by move those examples with high prediction confidence into it. 0 Train a link based learner on current training pool R, and apply the learner to predict labels for the rest examples. Update the training pool R by move those examples with high prediction confidence into it. Until no more examples can be added into the training pool T. ITERATIVE CLASSIFICATION ALGORITHM Lu &. Getoor proposed an iterative classification Algorithm (ICA) in [107]. Different from the feature representation augmentation approach, ICA augments the data rep- resentation by a new set of features that summarize the class label statistics (from prediction of the previous iteration) of interlinked /co—citation examples. Due to the fact that newly introduced label feature set is different from the original feature set in nature, two logistic regression models on both feature sets are combined to form a prediction. To formalize the ICA algorithm, let us introduce Z7; to represent the new feature vector, i.e., linked examples’ label statistics, for the corresponding data example 51 xi. For example, for a C-way classification problem, 2,; could be defined as a C- dimensional vector, where a element 25,-], k counts the number of examples from those linked with x; that belong to the k-th class. Then the ICA algorithm can be summarized as follows Iterative Classification Algorithm (Bootstrap) Initially assign class to each example based on its original feature repre- sentation Xi- (Iteration) lteratively apply the following two learners 0 Form a new feature representation 2, for each example xi, by gathering the label statistics based on current class assignment to linked examples. 0 Train two separate logistic regression models on both feature representa- tions, i.e., x,- and 2,, by the following MAP estimation PFD’IX) = Pr(¥|{xi})Pr(YI{Zi}) where Til—+7111, Pr + 1 In the above, W; and w; are two parameters for the logistic regression model on either feature set. 0 Apply the combined logistic regression model to the test examples and update their class assignments. Until no updates are made on class assignments or a maximum number itera- tions has been reached. 52 3.3 LinkBoost Framework For simplicity, here we only consider binary classification, where the label vector y can only take values y,- = 1 or y, = —1. As a semi-supervised framework, LinkBoost inherits the common theme from graph-based models that consistency is enforced between the link-information induced graph structure and the class assignment to the unlabeled data. On the other hand, as a general framework that is able to boost any base supervise learning algorithm, LinkBoost follows the idea of ensemble-based models that a sampling of unlabeled examples based on predicted pseudo labels will be used to iteratively augment the training set of the Specified supervised algorithm. In the rest of the section, we will formalize the LinkBoost framework and derive the related algorithm. 3.3.1 Objective Function To encode the structure of both labeled and unlabeled data, we use a Similarity matrix S = [5,0]an to combine both the link information and the feature representation of data examples SiJ = (1 — a)sz’m(x,-,xj) + arm- (3.2) Here sim(-, -) defines a Similarity measurement based on the features. For example, we can use cosine similarity in text domain. Ti, j is an element from the link matrix R which, more specifically, encodes all the incoming links; the link matrix R can be re- placed with R+RT if both incoming and outgoing links are taken into consideration, or be replaced by the co-citation matrix C if co—citation links are more appropriate to disclose the structure of data. Making a good choice on the link matrix is usually dependent on the regularity presented by the dataset for classification, as suggested in [61, 162]. a is a combination weight factor, 0 S a S 1. The above similarity definition can be viewed as using data represenatino based similarity computation to 53 smooth the links, thus minizing the problem brought by the sparseness in the link information. We define the inconsistency, within the unlabeled data, between the class labels nu . . . {9;}?! —nl ++1 and the SIIIlll‘dl‘lty measurement S as Tll'l'nu Fun 2 Z Siaj C()Sl’l(‘y.lj — 313') (3.3) i’ij:nl+l where cosh(-) is the hyperbolic cosine function, i.e., cosh(y,- — yj) = [exp(y,~ — 31]) + exp(yj — y,)] / 2. When symmetric similarity measurement is considered, as in this paper, the above inconsistency can be rewritten as Til-+7111, ”(+7111 1 Fuu = Z 532:4 EXPO/i — yj) + 2: 581,1 exp(yj - yr) z',j=nl+1 i,j=nl+1 TLl'l-Tlu = 2 Sid exp(y,- — yj) (3.4) i,j=nl+1 Also, we define the inconsistency, across the labeled and unlabeled data, between the class labels and the similarity measurement as nl nl+nu Flu: Z :5 j(exp -2y7:yj) (36) i=1 j: nl+1 Ideally, the labels decided for the unlabeled data should minimize both inconsis- tencies stated above. This leads to the following optimization problem y,-,i=n +l,...,n +nu l l l 3.3.2 Boosting Algorithm In this subsection, we will derive a boosting algorithm that solves the optimization problem (3.6) in an iterative procedure. Specifically, given an arbitrary binary clas- sification algorithm A, let h(t)(x) denote the classification model learned in the t-th 54 iteration by this algorithm. Then the final classification model, after T iterations, is the combination of T models learned at all iterations, i.e. ’ T f1(1)(x) = Z a(tlli(t)(x) (3.7) t=l where the a“) Z O is the combination weight. Here the superscript in parenthesis indicates the iteration number. Finally we will apply this model to predict the labels for the unlabeled data, i.e., y,- = H(T)(x,j),i = "l + 1, . . . ,nl + nu. To derive a boosting procedure that minimize the objective function (3.6), we need to find a good combination weight at each iteration, so that the objection function yields a decreased value from previous iteration. N ow we study the change of objective function from the t — 1 iteration to the t-th iteration. TO simplify the notation, let H,- denote the class label of the i—th example (unlabeled) predicted by the combined model from all t -— 1 iterations, and h,- denote the same example’s predicted label at the t-th iteration. We also simplify am as a. Now the objective function becomes Flt) = F513,) Flat) (3.8) = 2: Sid exp(H,;+ah,- —Hj —o'hj) i,j=nl+1 "l nl+nu ' Z Z 8i,j€XP(-2yi(Hj+ahj)) i=1j=nl+1 Using the inequality of arithmetic and geometric means, we have exp[a(h,- - hj)] g [exp(2ah.,-) + exp(—2ahj)] (3.9) ml?“ 55 . . , l So we can bound the first term 111 (3.8), FIE-1,2 , as follows 'nl-l-Tlu t File) 2 Z 5i,j exp(H,j + ah; _ Hf _ Obj) i,j=’nl+l nl+nu : Z .sz-J- exp(Hz- — Hj) 0Xp(a(hi " hjll i,j=nl+1 1 E Z 28i’j exp(Hi — Hj) [exp(2(ih,j) + eXp(—2(ihj)] i,j=nl+1 nl+nu n1+nu 1 = Z exp(2ahi) 2 531,3‘ 8XP(H2' — Hj) nl+nu nl+nu1 + Z exp( —2(rhj) 2% jexpf (iH— H j) j: 771+1 Z: —Tl-21+1 n1+nu n1+nu 1 = E: exp(20zhj) Z §5j,i exp(Hj — Hi) jznl+1 i=nl+1 nl+nu nl+nu 1 + Z exp(—2ahj) Z 5814' exp(Hi _‘ Hj) (3-10) In the last step above, we switched the index 2' and j in the first term for notational convenience. If we define d f nl+nu 1 aj g 2 58.7.31: €Xp(Hj - Hi) (3.11) i=nl+1 e b,- = Z 53,-0- expw, — Hj) (3.12) i=nl+1 then we can Simplify the bound for the F52 in (3.10) as nl+nu F1)? < : exp(2ahj)aj+exp(—2ahj)bj (3.13) j=nl+l 56 Similarly, we can simplify HEEL—1) as nl+nu Edit—1) = Z 82,) exp(Hi —Hj) i,j=nl+1 nl+nu nl+nu1nl+nu nwl+nu1 = Z 2% jexp( (iH- Hj) )+ Z: 2% 3'9fo (iH— Hj ) t: —nl+1j= "1+12 j: n1+1i= nl+l2 nl+nu nl+nu1nl+nu nl+nu1 = Z Z 2 sJ-iexp( (jH —H,-) )+ Z 2% jexp( (Hi— H) j=nl+1i=nl+1j=nl+1i=nl+12 nl+nu = Z aj+bj (3-14) j=nl+1 The second term in (3.8), F [(5), can be rewritten as () nl nl-i-nu t Flu = Z :8 jexp( _QLZ/z'lHj'l—ahjll i=1j=—’nl+lz = Z: 232-3]-exp(—2%Hj)ex1)(—2Oy,jhj) j=n1+1i=l nl+nu "l = Z exp(2ahj)Zsivjexp(2Hj)6(yi,—1) j=nl+l i=1 nl+nu n, + 2 exp( —2-)ah )Zsijepr“ 2Hj)6(yi,1) (3.15) j: nl+l i=1 In the above, we use the fact that the label for a labeled example, yi, can only be 1 or —1. And the delta function 6(a), y) = 1 if :1: = y and 6(.r, y) = 0 if :2: # y. Again, if we define nl d.f - cj :9 ZsmeprH )0(J,,1) (3.16) i=ll d,- “2‘ :3, Je\p( 2H )6( 1) (3.17) i=1 57 then we can simplify the bound for the F512) in (3.10) as "l +nu F182) 2 Z exp(20hj)cj+exp(—2crhj)(lj j=nl+1 F(t’"1) Similarly, we can simplify [u as n) nl+nu F(t—1) lu = Z Z 2"“ QyiHjl i=1j= —nl+1 nl+nu "l = Z ZsijeprH') 6(1/i,— 1)+€XP(—2Hj)5(yia1) j: nl+1i=1 Til-+7121, = Z Cj+dj j=nl+1 (3.18) (3.19) Given the above results, we can study the bound of the objective function change from the t— 1-th iteration to the t—th iteration. However, to further simplify notations, let us first define {1.7 : n.l+7zu]a b 23': =nl+1 0.? J 3 bi j — nl+nu ijnl+1a3+b3 c- éj : Til-PHIL] i 23': =nl+1 Cj+‘J 2 d1" .7 _ nl+nu E:j= =nl+1 CJ+dJ (3.20) (3.21) (3.22) (3.23) then we have (t) W F“) log [(11) = log filial)+log (5'11) F ' F ' F U1], +- . 23:71:11 exp(ZQhJ-Mj + exp(—2dhj)bj nl+nu ijnl+1 “J + b] |/\ log Z711 +7lu j=nl+1exp(2ohj)cj + exp(—2alzj)dj +10g n n. 23-51131 c,- + 0’2 7Ll+nu : log 2 exp(2a-hj)&j + exp(—2o'h.j)f)j j=nl+1 nl+nu + log 2 exp(2ahj)cj + exp(—2ahj)cfj j=nl+1 1Ll+nu S E exp(2ahj)((ij + 5]) + exp(—20h.j)(5j + dj) — 2 (3.24) j=n1+1 In the last step above, we used the inequality logs: 3 23—1, Vx>0 (3.25) Furthermore, if we apply the following inequality exp(yx) g exp(v) + exp(—'7) - 1 + 7:17, V2: 6 [—1, +1] (3.26) we can further bound log Fft) as F(t—1 ) 3 Z (Eij + 5]) [exp(2a) + exp(—2a) - 1+ 202th j=nl+1 +(f2j + dj) [exp(2a) + exp(—2o) — 1 — 2ahj] nl+nu = Z (Eij + 53- + éj + dj) [exp(2a) + exp(—2a) — 1] — . Z 20hj(f)j + (17,- -— 51,-— 5,) — 2 (3.27) 59 t Now we have two upper bounds for log %, (3.24) and (3.27). Both bounds (t) . F u ‘ n , __ -, . _ , 3 __ , , and log F t—l) touch at a — 0, Since they all reach 0 when a — 0. Fuithermoie, log Fit) 2 0 means F (t) 2 F071). Therefore, as 10110 as we minimize either Hm) .. upper bound, we can always have F (t) S F (t‘l), i.e., keeping the objective function non-increasing in the iterative procedures. The upper bound (3.27) is good for designing a sampling scheme for a boosting algorithm. Specifically, to help lower such an upper bound, we expect the label value hj (at the t iteration) to be consistent with the sign of (fij + dj — cij — 5]). This gives us a good hint on sampling unlabeled data for training: the j-th unlabeled example should be labeled as sign(5j + dj — (ij — 5]) and sampled with a probability proportional to [(5, + dj — fij — 55)]. Finally, to find the optimal a, we can minimize either upper bound in (3.24) or (3.27). Here we choose the former, because it is tighter than the other one, and also minimizing it leads to a solution which is easier to compute. In particular, we take the first-order derivative (w.r.t. a) of the upper bound in (3.24) and set it to zero, i.e. nl+nu nl+nu Z exp(2ah.j)2hj(&j + 5]) — Z exp(—2ahj)2hj(5j + dj) 2 0(3.28) j=nl+1 j=nl+1 or equivalently nl+nu nl+nu Z exp(2oz)(éj + Ej)6(hj,1) — Z exp(—2a)(&j + Ej)6(hj, —1) j=nl+1 j=nl+1 nl+nu nl+nu — Z exp(—2a)(b,- + d,)3(h,-, 1) + Z exp(2a)(bj + d,)5(h,-, —1) j=nl+1 j=n1+1 = 0 (3.29) Solving the above equation, we have Tll+nu ~ __ - ~ ~ - a _ 1 Zj=nl+l(aj + Cj)()(hj,—1)+(bj + dj)()(hj,1) (3 30) _ _ n +n ~ ., ~ ~ ' 4 z I ’“ (a,+c,-)6(h,-,1)+(b,- +dj)6(hj,—1) j=nl+1 60 Input 0 X: d x 12. matrix for the input data 0 A: given classification algorithm Output: class labels Algorithm 0 Compute pairwise similarity matrix S. o Initialize H(O)(x) = 0. o Fort=1,2,...,T ~ ,3], 5,31,. using (320)—(323) - Compute class label for each unlabeled example xj as sign(bj + dj — aj — cj). — Sample unlabeled examples with probability proportional to |(I3j +dj — — Compute ii ~ aj - éj)| — Adding sampled examples to the labeled examples, train a binary clas- sifier h(t)(x) with the algorithm A -— Compute the optimal am using (3.30) — Update the classification model as H(t)(x) <— H(t_1)(x) +a(t)h.(t)(x) o Predict class labels with the final classification model H (t)(x) Figure 3.1. LinkBoost framework. 3.3.3 LinkBoost Framework Summary The LinkBoost framework can be summarized into a meta algorithm, as shown in Figure 3.1 As we can see, LinkBoost framework is able to take any supervised algorithm as the base classifier. This is more useful as opposed to other link-based classification method that design a special algorithm. Consider the situation that we find a spec- ified classifier which works particularly well for a given domain. When we gained more understanding on the dataset (in the form of “links”), we want to stick with 61 the good classifier and further improve its accuracy. This requires a framework that is flexible enough to accommodate any classifier, and is able to evaluate its perfor- mance (without knowing ground truth) and make adjustment. LinkBoost meets such a requirement in the following way: it iteratively applies the classifier with a different training set; by checking the inconsistency between current classification results and the structure of data that. is estimated from the link information and data repre— sentations, it generates heuristics on assigning pseudo labels to unlabeled data, thus updating the training set; at the same time, LinkBoost estimated the appropriate amount of trust we can put on the current prediction for final combination. It is also worth noting that even when no link information is available, LinkBoost can also work as a generic semi-supervised learning algorithm, by simply using the similarity computed from data representations alone in (3.2) in Section 3.3.1. 3.4 Experiments and Analysis In this section, we will present a few experiments to verify the proposed LinkBoost framework. Specifically, through the experiments we try to address the following two research questions 1. Is the propose LinkBoost framework effective in improving classification perfor- mance with link information? 2. As a general semi-supervised boosting framework, is LinkBoost able to improve any supervised classification algorithm? 3.4. 1 Experiment Setup Two datasets of scientific publications will be used in our experiments: “Cora” and “Citeseer”. We describe the two datasets here. 62 ID Class Name Size Neural Networks 818 Rule Learning 180 Reinforcement Learning 217 Probabilistic Methods 426 Theory 351 Genetic Algorithms 418 Case Based 298 NCUUIACOMH Table 3.1. Classes in Cora dataset. ID Class Name Size 1 Agents 596 2 Information Retrieval 668 3 Database 701 4 Artificial Intelligence 249 5 Human-computer Interaction 508 6 Machine Learning 590 Table 3.2. Classes in Citeseer dataset. Cora dataset The Cora dataset consists of 2708 scientific publications classified into one of seven classes, as shown in Table 3.1. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/ presence of the corresponding word from the dictionary, which consists of 1433 unique words in total. The citation network consists of 5429 links. By checking with the ground truth class assignment of publications, we find that about 81.38% of the links correctly indicates that the linked publication pair should be put in the same class. Furthermore, the links are not evenly distributed across all publications. For example, there are about 100 publications each involved in more than 10 links, while there are also 1143 publications not involved in any link. Citeseer dataset The CiteSeer dataset consists of 3312 scientific publications classi- fied into one of six classes, as shown in Table 3.2. Each publication in the dataset is described by a 0/1—valued word vector indicating the absence/ presence of the 63 corresponding word from the dictionary, which consists of 3703 unique words in total. The citation network consists of 4732 links. By checking with the ground truth class assignment of publications, we find that about. 74.61% of the links correctly indicates that the linked publication pair should be put in the same class. Furthermore, the links are not evenly distributed across all publications. For example, there are about 50 publications each involved in more than 10 links, while there are also 1358 publications not involved in any link. As we can see, the link information in both datasets are sparse and noisy. Therefore, these two datasets can be seen as representative of real-world data. Given a supervised classification algorithm, we will apply the proposed LinkBoost framework to boost its classification performance on the two datasets described above, using the link information. Due to the regularity presented in the two datasets, we will use both incoming and outgoing links to form a link matrix, i.e., in Equation (3.2) R+ RT will be used as the link matrix to compute similarity between data examples. And the weight factor a is set to 0.4. As mentioned before, the proposed LinkBoost Framework can accommodate augmented feature representation, for example, from the Feature Representation Augmentation algorithm reviewed in Section 3.2.3. In our experiments, we implemented LinkBoost on both the original feature space, and the augmented feature space using the algorithm mentioned in Section 3.2.3. For notation brevity, we will refer to the former method as “LB-OR”, and the latter one as “LB-AR”. In LB-AR, The link matrix in Equation (3.1) takes the form of R + RT, and the weight factor A = 0.5. To compare with the proposed LinkBoost algorithm, two baseline algorithms are implemented. The first baseline algorithm is Feature Representation Augmentation (“RepAug” for short). The second one is Co—training (“Cotrain” for short). Both baseline algorithm were reviewed in Section 3.2.3. 64 Since only binary classification model is discussed in this chapter, we tested all the possible class pairs in both datasets. For each class pair, we randomly chose 5% data examples for training and the rest for test. Each classification was repeated 10 times by using different randomly selected training data, and the classification accuracy was averaged over the 10-fold experiments. 3.4.2 Robustness against Sparse Link Information As discussed in Section 3.4.1, the link information in both Cora and Citeseer datasets is noisy. To further verify the proposed LinkBoost algorithm and its robustness against sparse link information, we gradually decrease the number of links being used, and compare the classification performance of the four algorithms. Table 3.3 and Table 3.4 give the classification accuracies on Cora dataset; Table 3.5 and Table 3.6 give the classification accuracies on Citeseer dataset. In this group of experiments, Support Vector Machine (implemented by SVMLight software) is used as the base supervised classification algorithm to be boosted by the LinkBoost framework. As we can see, with all the available link information being used, LB-AR and AugRep deliver comparable performance, while overall speaking LB-AR is slightly better. LB-OR is suboptimal among the four methods; and Cotrain almost always performs worst. When the number of links being used is decreasing, the performance of all four methods degrades as expected. However, with the link information getting even sparser gradually, the advantage of LB-AR over AugRep becomes significant, since their performance gap enlarges. Moreover, when only 30% - 10% of links are used, LB-OR also outperforms the AugRep. The above observations suggest that LinkBoost is more robust against the sparseness of link information. Comparing the performance of LB-OR and LB-AR, i.e., applying LinkBoost on both original feature space or augmented feature space, leads to the following findings: when 80% ~ 100% links are used, LB—AR performs better than LB-OR; however, 65 Classes Model Percentage of Links Used 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 1 vs 2 AugRep .900 .898 .891 .888 .883 .875 .866 .859 .849 .841 Cotrain .836 .835 .837 .839 .841 .845 .843 .844 .844 .838 LB-OR .891 .889 .890 .890 .884 .890 .882 .886 .888 .885 LB-AR .911 .911 .905 .907 .900 .896 .897 .889 .889 .898 1 vs 3 AugRep .912 .910 .907 .902 .897 .893 .883 .873 .863 .846 Cotrain .840 .841 .840 .836 .842 .842 .839 .848 .849 .846 LB-OR .877 .882 .876 .873 .870 .869 .862 .864 .866 .864 LB-AR .911 .908 .908 .903 .895 .891 .880 .874 .869 .857 1 vs 4 AugRep .841 .835 .826 .822 .810 .798 .794 .779 .759 .748 Cotrain .732 .729 .727 .726 .728 .727 .730 .727 .735 .744 LB-OR .809 .804 .807 .806 .794 .789 .784 .772 .768 .775 LB-AR .850 .840 .836 .832 .815 .809 .802 .804 .779 .782 1 vs 5 AugRep .847 .844 .841 .832 .826 .816 .807 .790 .777 .770 Cotrain .763 .759 .753 .750 .748 .746 .747 .752 .757 .762 LB-OR .821 .814 .811 .806 .800 .802 .795 .792 .788 .786 LB—AR .846 .840 .845 .833 .824 .814 .807 .789 .784 .782 1 vs 6 AugRep .944 .941 .935 .929 .921 .909 .897 .876 .862 .861 Cotrain .833 .840 .836 .831 .835 .837 .840 .843 .849 .851 LB-OR .864 .863 .860 .857 .852 .855 .847 .844 .840 .837 LB-AR .945 .939 .932 .923 .920 .904 .891 .874 .857 .853 1 vs 7 AugRep .894 .889 .885 .877 .868 .857 .843 .828 .817 .803 Cotrain .789 .791 .789 .784 .786 .786 .786 .796 .798 .799 LB-OR .854 .849 .836 .831 .833 .833 .829 .824 .824 .821 LB-AR .899 .896 .890 .884 .876 .868 .854 .842 .841 .830 2 vs 3 AugRep .889 .879 .861 .851 .832 .802 .783 .745 .703 .705 Cotrain .755 .759 .756 .768 .761 .757 .760 .760 .735 .725 LB—OR .882 .875 .885 .868 .874 .851 .863 .846 .848 .805 LB—AR .891 .883 .869 .860 .844 .833 .830 .827 .815 .806 2 vs 4 AugRep .852 .853 .847 .839 .830 .819 .801 .788 .776 .760 Cotrain .764 .758 .754 .750 .756 .762 .759 .767 .763 .752 LB-OR .886 .887 .890 .874 .877 .877 .861 .857 .842 .850 LB-AR .942 .937 .934 .938 .920 .926 .902 .884 .893 .879 2 vs 5 AugRep .741 .732 .729 .725 .721 .710 .713 .706 .700 .687 Cotrain .689 .687 .684 .686 .689 .688 .688 .695 .698 .688 LB—OR .785 .778 .761 .766 .764 .767 .766 .744 .749 .738 LB-AR .808 .815 .807 .799 .775 .774 .756 .742 .733 .737 2 vs 6 AugRep .924 .920 .905 .889 .883 .862 .841 .825 .803 .771 Cotrain .772 .779 .780 .781 .775 .769 .769 .776 .771 .782 LB—OR .885 .884 .879 .879 .879 .862 .867 .856 .847 .831 LB-AR .934 .919 .919 .888 .896 .896 .883 .875 .859 .843 2 vs 7 AugRep .732 .733 .731 .723 .717 .703 .690 .679 .670 .654 Cotrain .650 .657 .643 .639 .654 .657 .664 .662 .665 .659 LB—OR .751 .765 .759 .757 .757 .756 .758 .757 .749 .742 LB-AR .773 .773 .767 .760 .760 .763 .734 .728 .747 .752 Table 3.3. Classification Accuracy Comparison on Cora Dataset (Part A). 66 Classes Model Percentage of Links Used 100 % 90% 80% 70% 60% 50% 40% 30% 20% 10% 3 vs 4 AugRep .940 .934 .930 .928 .912 .899 .883 .865 .846 .831 Cotrain .840 .838 .840 .836 .844 .858 .858 .859 .864 .848 LB-OR .902 .904 .895 .902 .906 .908 .897 .904 .899 .889 LB—AR .947 .944 .942 .939 .924 .921 .909 .889 .880 .862 3 vs 5 AugRep .872 .870 .862 .854 .838 .817 .804 .786 .759 .750 Cotrain .760 .749 .759 .764 .763 .763 .769 .759 .761 .763 LB-OR .831 .832 .826 .828 .812 .807 .806 .799 .778 .781 LB—AR .872 .874 .871 .853 .839 .824 .802 .799 .767 .778 3 vs 6 AugRep .903 .898 .893 .888 .879 .862 .845 .833 .805 .795 Cotrain .775 .772 .780 .785 .792 .784 .791 .785 .789 .777 LB-OR .856 .847 .844 .847 .845 .839 .839 .830 .829 .815 LB—AR .914 .901 .899 .892 .883 .862 .862 .842 .831 .823 3 vs 7 AugRep .871 .868 .862 .855 .836 .827 .808 .774 .746 .739 Cotrain .713 .724 .717 .722 .732 .743 .755 .755 .744 .740 LB-OR .797 .782 .783 .790 .784 .787 .806 .788 .800 .789 LB-AR .870 .868 .857 .850 .841 .848 .825 .804 .825 .785 4 vs 5 AugRep .899 .891 .880 .874 .864 .860 .861 .829 .819 .810 Cotrain .809 .811 .812 .803 .808 .809 .806 .818 .800 .798 LB-OR .844 .800 .818 .819 .827 .831 .831 .838 .841 .831 LB-AR .898 .893 .878 .879 .868 .859 .864 .827 .826 .794 4 vs 6 AugRep .959 .950 .948 .936 .923 .912 .892 .860 .833 .829 Cotrain .808 .805 .815 .818 .818 .816 .816 .810 .818 .806 LB-OR .923 .892 .898 .910 .915 .909 .914 .915 .919 .923 LB-AR .974 .968 .959 .947 .951 .940 .934 .925 .908 .895 4 vs 7 AugRep .904 .894 .885 .869 .855 .836 .821 .799 .774 .765 Cotrain .769 .775 .778 .774 .767 .774 .760 .770 .777 .778 LB—OR .829 .806 .799 .809 .813 .819 .810 .832 .832 .844 LB—AR .916 .893 .899 .886 .883 .874 .851 .838 .821 .811 5 vs 6 AugRep .926 .924 .919 .910 .898 .874 .851 .839 .828 .809 Cotrain .807 .802 .806 .794 .796 .796 .795 .797 .809 .798 LB—OR .836 .822 .826 .835 .840 .847 .847 .853 .862 .866 LB-AR .926 .925 .917 .911 .901 .871 .852 .844 .843 .816 5 vs 7 AugRep .812 .810 .807 .803 .796 .771 .747 .729 .713 .714 Cotrain .706 .707 .702 .705 .711 .716 .718 .712 .712 .709 LB-OR .782 .739 .745 .764 .749 .760 .769 .759 .776 .774 LB-AR .821 .821 .810 .803 .805 .788 .776 .755 .763 .743 6 vs 7 AugRep .940 .934 .929 .919 .909 .893 .875 .853 .827 .823 Cotrain .796 .779 .774 .769 .779 .774 .759 .769 .781 .789 LB-OR .846 .828 .833 .837 .831 .839 .843 .850 .843 .843 LB-AR .940 .930 .929 .918 .904 .892 .876 .859 .837 .832 Table 3.4. Classification Accuracy Comparison on Cora datasets (Part B). 67 Classes Model Percentage of Links Used 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 1 vs 2 AugRep .939 .933 .928 .928 .921 .921 .915 .903 .906 .902 Cotrain .907 .910 .910 .909 .910 .908 .911 .911 .911 .905 LB-OR .938 .934 .933 .931 .929 .923 .927 .929 .927 .923 LB-AR .946 .942 .939 .937 .931 .932 .928 .917 .916 .909 1 vs 3 AugRep .932 .928 .916 .915 .906 .897 .887 .884 .886 .893 Cotrain .905 .905 .906 .905 .905 .906 .906 .904 .903 .902 LB-OR .921 .923 .918 .918 .916 .917 .911 .913 .906 .905 LB—AR .925 .922 .917 .916 .917 .912 .906 .908 .900 .896 1 vs 4 AugRep .785 .773 .763 .763 .760 .751 .739 .732 .732 .734 Cotrain .748 .744 .743 .742 .737 .738 .739 .739 .735 .732 LB—OR .760 .759 .756 .755 .750 .755 .751 .753 .755 .755 LB-AR .778 .772 .765 .753 .752 .747 .735 .738 .742 .742 1 vs 5 AugRep .895 .890 .881 .879 .867 .865 .853 .850 .854 .853 Cotrain .848 .852 .856 .858 .862 .858 .857 .859 .854 .851 LB-OR .859 .860 .863 .863 .860 .870 .870 .865 .856 .853 LB-AR .895 .888 .887 .881 .875 .870 .864 .865 .853 .854 1 vs 6 AugRep .904 .897 .892 .877 .860 .851 .848 .838 .834 .850 Cotrain .864 .859 .866 .865 .865 .869 .873 .873 .875 .881 LB—OR .883 .860 .863 .863 .860 .870 .870 .865 .856 .853 LB-AR .903 .898 .893 .885 .875 .875 .877 .869 .876 .882 2 vs 3 AugRep .845 .840 .838 .835 .822 .818 .803 .797 .780 .791 Cotrain .795 .798 .795 .800 .793 .794 .797 .794 .788 .787 LB-OR .807 .803 .810 .811 .803 .804 .794 .794 .797 .794 LB-AR .842 .839 .838 .836 .825 .824 .811 .796 .790 .784 2 vs 4 AugRep .787 .782 .776 .778 .778 .770 .768 .758 .757 .757 Cotrain .795 .755 .757 .753 .751 .754 .749 .746 .746 .743 LB—OR .780 .773 .777 .773 .773 .773 .773 .773 .775 .776 LB-AR .784 .782 .781 .777 .775 .774 .765 .767 .773 .774 2 vs 5 AugRep .870 .859 .854 .837 .829 .811 .799 .801 .789 .793 Cotrain .804 .808 .812 .811 .806 .806 .808 .803 .808 .805 LB-OR .841 .846 .843 .843 .845 .838 .830 .824 .817 .804 LB—AR .872 .869 .867 .856 .855 .844 .836 .822 .810 .791 Table 3.5. Classification Accuracy Comparison on Citeseer datasets (Part A). 68 Classes Model Percentage of Links Used 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 2 vs 6 AugRep .834 .831 .828 .824 .815 .793 .769 .760 .742 .765 Cotrain .803 .801 .792 .791 .787 .784 .773 .774 .782 .781 LB-OR .791 .793 .792 .788 .783 .785 .783 .779 .775 .774 LB-AR .829 .829 .824 .819 .807 .782 .778 .765 .748 .754 3 vs 4 AugRep .766 .765 .759 .757 .756 .756 .753 .751 .750 .749 Cotrain .754 .755 .756 .753 .754 .754 .754 .752 .752 .752 LB-OR .781 .775 .771 .771 .771 .767 .769 .770 .772 .768 LB-AR .770 .773 .766 .761 .761 .759 .755 .752 .758 .767 3 vs 5 AugRep .890 .883 .869 .856 .840 .830 .823 .820 .817 .821 Cotrain .848 .852 .846 .845 .845 .844 .843 .844 .839 .831 LB-OR .887 .884 .883 .885 .884 .878 .883 .881 .875 .869 LB—AR .905 .899 .890 .885 .881 .873 .869 .866 .862 .863 3 vs 6 AugRep .874 .865 .854 .844 .839 .831 .821 .815 .803 .809 Cotrain .830 .832 .832 .837 .837 .836 .835 .834 .831 .829 LB-OR .853 .853 .849 .851 .846 .846 .839 .841 .841 .833 LB-AR .878 .870 .861 .854 .854 .844 .842 .845 .831 .835 4 vs 5 AugRep .726 .726 .721 .713 .709 .703 .705 .709 .704 .701 Cotrain .707 .705 .702 .699 .700 .700 .699 .698 .699 .703 LB-OR .726 .724 .724 .723 .719 .722 .721 .723 .726 .726 LB-AR .731 .724 .725 .725 .726 .721 .726 .721 .717 .719 4 vs 6 AugRep .713 .712 .711 .711 .710 .708 .710 .709 .709 .710 Cotrain .706 .706 .707 .706 .706 .711 .709 .710 .709 .710 LB-OR .713 .714 .711 .712 .714 .714 .711 .713 .713 .713 LB-AR .714 .711 .711 .710 .706 .705 .707 .707 .709 .712 5 vs 6 AugRep .876 .861 .851 .848 .836 .820 .809 .805 .807 .814 Cotrain .814 .817 .814 .816 .820 .824 .824 .817 .810 .810 LB-OR .862 .855 .852 .847 .852 .838 .844 .835 .831 .826 LB-AR .882 .864 .859 .862 .854 .848 .839 .835 .820 .822 Table 3.6. Classification Accuracy Comparison on Citeseer datasets (Part B). 69 when 10% - 30% links are used, LB—OR is slightly better than LB-AR. A possible explanation to these findings is that when significant amount of data examples are involved in links, the classification process is doubly boosted by LB-AR, both from data representation augmentation and the training set augmentation; but when link information are extremely sparse, the representation augmentation in a bag-of—words manner is less reliable, compared to the training set augmention which selectively supplements training pool with highly confident predictions. However, how to reliably utilize extremely sparse link information needs further study. 3.4.3 Boosting Power for Supervised Algorithms To verify the boosting power of the proposed LinkBoost framework, we apply it to several base supervised classifiers, and compare the performance with that of the base classifier itself. In particular, five supervised algorithms are used 0 Support Vector Machine (“svm” for short), whose performance was proved to be among the best in many text classification applications. 0 J48 decision trees (“j48” for short). 0 HyperPipe classifier (“hpp” for short), which constructs for each category a “hypepipe” that contains all points of that category (essentially records the attribute bounds observed for each category). Predictions on test instances are made according to the category that most contains the instance. 0 Simple Naive Bayes classifier (“nbs” for short). Voted Perceptron (“vp” for short). For “svm”, we used the SVMLight implementation. For the rest four classifiers, we used its implementation in the Weka softwarez. 2http://www.cs.waikato.ac.nz/m1/weka/ 70 Classes SV 111 LB—OR j48 LB-OR hpp LB-OR nbs LB-OR. vp LB-OR 1vs2 1vs3 1vs4 1vs5 1vs6 1vs7 2v33 2vs4 2vs5 2vs6 2vs7 3vs4 3v35 3vs6 3vs7 4v35 4v36 4vs7 5v36 5vs7 6vs7 0.843 0.843 0.744 0.772 0.854 0.802 0.732 0.750 0.685 0.773 0.653 0.855 0.758 0.803 0.748 0.809 0.846 0.771 0.806 0.708 0.819 0.891 0.877 0.809 0.821 0.864 0.854 0.882 0.886 0.785 0.885 0.751 0.902 0.831 0.856 0.797 0.844 0.923 0.829 0.836 0.782 0.846 0.819 0.877 0.733 0.725 0.863 0.768 0.651 0.748 0.680 0.747 0.628 0.837 0.744 0.804 0.741 0.691 0.795 0.681 0.779 0.672 0.724 0.930 0.925 0.866 0.834 0.929 0.902 0.915 0.933 0.800 0.913 0.797 0.943 0.892 0.894 0.887 0.888 0.955 0.861 0.916 0.822 0.913 0.845 0.813 0.734 0.744 0.769 0.807 0.776 0.800 0.711 0.796 0.712 0.775 0.749 0.732 0.741 0.740 0.807 0.783 0.743 0.723 0.780 0.851 0.799 0.733 0.750 0.771 0.821 0.847 0.852 0.733 0.835 0.742 0.789 0.780 0.771 0.758 0.750 0.806 0.790 0.759 0.739 0.783 0.853 0.822 0.749 0.767 0.756 0.803 0.772 0.810 0.719 0.791 0.690 0.785 0.736 0.736 0.723 0.750 0.766 0.753 0.743 0.708 0.728 0.892 0.849 0.796 0.799 0.820 0.859 0.892 0.897 0.769 0.887 0.767 0.876 0.841 0.834 0.826 0.806 0.895 0.844 0.857 0.804 0.834 0.846 0.817 0.712 0.752 0.797 0.805 0.699 0.785 0.676 0.765 0.650 0.791 0.697 0.767 0.730 0.734 0.799 0.742 0.770 0.688 0.787 0.874 0.844 0.777 0.789 0.834 0.841 0.812 0.871 0.734 0.858 0.717 0.843 0.757 0.817 0.785 0.792 0.870 0.817 0.823 0.740 0.816 Table 3.7. Boosting classification accuracy on Cora datasets. Classes svm LB—OR j48 LB—OR hpp LB-OR nbs LB-OR VP LB-OR 1vs2 lvs3 1vs4 1v55 1vs6 2vs3 2vs4 2v55 2vs6 3vs4 3vs5 3vs6 4v35 4vs6 5vs6 0.904 0.901 0.741 0.854 0.869 0.786 0.754 0.802 0.785 0.754 0.843 0.819 0.702 0.709 0.813 0.938 0.921 0.760 0.859 0.883 0.807 0.780 0.841 0.791 0.781 0.887 0.853 0.726 0.713 0.862 0.853 0.870 0.733 0.840 0.842 0.665 0.717 0.667 0.639 0.720 0.717 0.726 0.681 0.656 0.681 0.946 0.929 0.772 0.887 0.898 0.824 0.798 0.859 0.800 0.807 0.904 0.864 0.758 0.706 0.872 0.785 0.790 0.737 0.749 0.754 0.705 0.777 0.763 0.720 0.768 0.791 0.760 0.724 0.706 0.766 0.865 0.838 0.747 0.796 0.803 0.773 0.776 0.809 0.738 0.777 0.842 0.796 0.713 0.704 0.804 0.817 0.793 0.738 0.752 0.775 0.723 0.778 0.773 0.728 0.769 0.782 0.762 0.730 0.695 0.774 0.903 0.862 0.761 0.815 0.844 0.780 0.821 0.840 0.751 0.789 0.868 0.825 0.759 0.691 0.839 0.818 0.832 0.734 0.754 0.782 0.742 0.743 0.726 0.715 0.755 0.771 0.745 0.697 0.663 0.742 0.882 0.869 0.755 0.810 0.827 0.773 0.785 0.786 0.741 0.773 0.825 0.797 0.738 0.681 0.797 Table 3.8. Boosting classification accuracy on Citeseer datasets. 71 The original data represenation is used both in the base classifier and the LinkBoost-boosted classifier. Table 3.7 and Table 3.8 compare the performance of the base classifiers, and perfor- mance of them within the LinkBoost framwork, on all class pairs in Cora and Citeseer datasets respectively. In both tables, each column contains two sub—columns, with the left one under the name of the base classifier giving its classifation accuracy, and the right one under “LB-OR” giving the classification accuracy of the corresponding base classifier within the LinkBoost framework (using original data representations). As we can see, in most cases, LinkBoost framework is able to significantly improve the classification accuracy. These results empirically proves the boosting power of Link- Boost framework, i.e., it is able to improve any supervised algorithm by effectively exploiting link information. 3.5 Conclusions In this chapter, we propose a semi-supervised learning framework, named as “Link- Boost”, for boosting classification performance, when side information in the form of links are available. LinkBoost is designed to turn any supervised algorithm into a semi—supervised one, and improve its classification performance. Experiments show that LinkBoost is robust against the noisy and sparse nature of link information, and it does improve the classification accuracy of several typical supervised algorithms. 72 CHAPTER 4 Semi-supervised Clustering with Pairwise Constraints In this chapter, a novel boosting framework for semi-supervised clustering will be described. Starting with the problem definition of semi-supervised clustering, we will review a few major approaches in previous studies on this problem, then present our boosting idea with formal description of the related algorithms. Experiment results and discussions will be provided, as empirical validation. Finally, we will summarize our work and raise a few issues for future work. 4.1 Problem Definition Data clustering, also called unsupervised learning, is one of the key techniques in data mining that is used to understand and mine the structure of unlabeled data. The idea of improving clustering by side information, sometimes called semi-supervised clustering or constrained data clustering, has received significant amount of attention in recent studies on data clustering. Often, the side information is presented in the form of pairwise constraints: the must-link pairs where data points should belong to the same cluster, and the mustnot-lz'nk pairs where data points should belong to different clusters. Table 4.1 summarizes the notations that will be used throughout the chapter. 73 71 Total number of data examples. The number of attributes for each data example A d—dimension vector represents the i—th data example. We also use x,- to refer to the i-th data example. The set of all data examples, i.e. X = {X22 ?:1. A matrix X = (x1, . . . ,xn) that gathers the vector repre- sentations of all the data examples. The set of all must-link pairs of data examples. A n x n matrix where 52+- is one when examples x,- and x]- form a must-link pair, and zero otherwise. The set of all mustnot—link pairs of data examples An n x n matrix where S 2— - is one when examples X, and 7 xj form a mustnot-hnk pair, and zero otherwrse. For classification or clustering problems, c gives the number of classes. For classification or clustering problems, 1,- takes values from the set {1, - -- ,c} which indicates the class label for the i-th data example. Table 4. 1. Notations 4.2 Review on Previous Studies There are two major approaches to semi-supervised clustering: the approach based on constraints satisfaction and the approach based on distance metric learning. The first approach employs the side information to restrict the solution space, and only finds the solution that is consistent with the pairwise constraints. The second approach first learns a distance metric from the given pairwise constraints, and computes the pairwise similarity using the learned distance metric. The computed similarity matrix is then used for data clustering. In this section, we will review some major work in both approaches, recapitulate a few selected representative algorithms, followed by a brief summary on these previous studies. 74 4.2.1 Approach Based on Constraints Satisfaction The constraint~based approach for semi—supervised clustering utilizes the side infor— mation to restrict the feasible solutions when deciding the cluster assignment. Early work in this category took the side information as the hard constraints, and only con- sidered the cluster assignments that were absolutely consistent with the given pairwise constraints. In [148, 20], the authors proposed the constrained K-means algorithms by adjusting the cluster memberships to be consistent with the pairwise constraints. In [133], a generalized Expectation Maximization (EM) algorithm is proposed to in- corporate the pairwise constraints into the EM algorithm. In particular, the cluster assignments that are inconsistent with the constraints are excluded from the partition function when computing the posterior probability for the cluster memberships. One problem with treating the side information as hard constraints is that we may not be able to find feasible solutions that are consistent with all the constraints [44]. To overcome this problem, a number of studies view the side information as soft con- straints. The key idea is to penalize, not to exclude, the cluster assignments that are inconsistent with the given pairwise constraints. In [15, 108, 16], the authors present probabilistic models for semi-supervised clustering where the pairwise constraints are incorporated into the clustering algorithms through the Bayesian priors. In [102], the authors modified the mixture model for data clustering by redefining the data generation process through the introduction of hidden variables. In [14], the authors extended the framework of semi-supervised clustering by selecting the most infor- mative pairwise constraints to solicit the labeling information. In [45], the authors studied semi-supervised clustering for the hierarchical clustering algorithm. 4.2.2 Approach Based on Distance Metric Learning Another approach to semi-supervised clustering is to first learn a distance metric from the given pairwise constraints. The pairwise similarity between any two examples is 75 then computed based on the learned distance metric, and a clustering algorithm is applied to the computed similarity matrix. The key to this approach is to effectively learn a distance metric from the side information. Zhang et al. [165] proposed to learn a distance metric by a linear regression model. Xing et al. [155] formulated the distance metric learning problem as a constrained convex programming problem. This algorithm is extended to the nonlinear case in [98] by the introduction of ker- nels. Yang et al. [159] proposed a local distance metric algorithm that is designed to address the problem of distance metric learning for multi-modal data distributions. Golderberg et a1. [64] presented the neighborhood component analysis algorithm that learns a local distance metric by extending the nearest neighbor classifier. Wein— berger [150] presented a large margin nearest-neighbor classifier for distance metric learning that extended the neighborhood component analysis to a maximum margin framework. Discriminative component analysis [74] learned a distance metric by ex- tending the relevance component analysis to effectively explore both the must-link and the mustnot-link constraints simultaneously. In [71, 72], the authors extended the boosting algorithms to learn a distance function from given pairwise constraints. Schultz and Joachims [128] extended the framework of support vector machine to learn distance metrics from the pairwise comparisons. Finally, a few studies cluster data points by a similarity matrix that is directly modified according to the pairwise constraints. In [96], the authors proposed to modify the similarity matrix by linearly combining the original similarity matrix with the pairwise constraints. Klein et al. [91] proposed to modify the similarity matrix by propagating the pairwise constraints through the nearest neighbors. 4.2.3 Representative Algorithms We will recapitulate a few representative algorithms proposed for semi-supervised clustering in the following. 76 CONSTRAINED K-MEANS CLUSTERING ALGORITHM The constrained K-mcans algorithm is a modified K-Ineans algorithm, by ensuring none of the pairwise constrained is violated during the iterative steps of the K-means algorithm. Specifically, the constrained K-means algoritlnn can be fornmlated as follows [148] Input 0 X: matrix for the input data 0 8+: matrix for must»link pairs 0 8": matrix for mustnot-link pairs Output: cluster memberships Algorithm Step 1 Initialize C1, . . . ,Ck as the initial cluster centers. Step 2 For each data point 27,-, assign it to the closest cluster center Cj such that o for all xi; not belonging to the k—th cluster, S? j 7é 1; o for all xi, belonging to the k-th cluster, 32— j ¢ 1. If no Cj satisfies the above rules, fail. Step 3 Update each cluster center C j by averaging all the data point (1, in the corre- sponding cluster. Step 4 Iterate through Step 2 and Step 3 until convergence. Step 5 Return cluster memberships. The main drawback of the above algorithm is that it can fail without yielding a feasible solution. As the algorithm presents, if the cluster assignment (in Step 2) cannot be found to satisfy all the pairwise constraints, the algorithm will stop. Another drawback with the constrained K-means algorithm is that. it only makes 77 efforts to satisfy those “known” constraints. but does not. generalize those constraints to the unseen data where the pairwise relationship is “unknown”. CONSTRAINED COMPLETE-LINK CLUS'I‘ERING ALGORITHM The data examples involved in pairwise constraints, in general, can be viewed as rep- resentative of their local neighborhoods. Having recognized this, it. would be natural to try to induce a set of new distance measurements over all the data examples from the limited number of pairwise constraints. This is the basic idea of the work pre- sented in [91], which is also formulated as acquiring prior knowledge “from instance level constraints to space-level constraints”. Then a specific clustering algorithm (Complete-Link clustering) is applied with the new distance measurements. The cor- responding algorithm is named as Constrained Complete-Link (CCL) algorithm. In CCL algorithm, the distance measurement between each pair of data exam- ples are generated by explicitly making adjustment on an initial distance matrix computed from the data input patterns. The adjustment is done by first imposing the constraints, and then propagating them to the neighborhood of the constrained examples. For must-link constraints, imposing the constraints means setting each distance between the must-link pair of data examples to zero. Then, the distance between all other data example pairs are recomputed as the length of shortest path connecting them (allowing using the “zero” length of those must-links). In this way, the must—link gets propagated to their neighborhoods. After these distance adjustment, the triangle inequality still holds, and the resulting distance matrix is still a valid “metric”. For mustnot-links, imposing the constraints means setting each distance between the mustnot-link pair of data examples to a large value. However, further propagat— ing the mustnot-links to their neighborhood while maintaining the adjusted distance matrix a valid “metric” will be computationally expensive. In [91], the propagation 78 Of mustnot-links are not carried out explicitly by adjusting the distance matrix, but claimed to be implicitly done in the merging step Of the following Complete—Link clustering procedures 1. The most notable contribution of the CCL algorithm is its efforts to generalize the instance—level pairwise constraints tO space—level distance measurements that can affect data examples beyond the constrained ones. As a results, it is reported in [91] to outperform the constrained K-means algorithm in clustering. HMRF-KMEANS ALGORITHM Basu et al. proposed a probabilistic framework based on Hidden Markov Random Fields (HMRFS) that combines constraints satisfaction and distance metric learn- ing [15]. Based on this framework, a partitional clustering algorithm is designed, which is named as HMRF-Kmeans algorithm. Specifically, the following Hidden Markov Random Field is considered a A hidden set Of random variables [I = {ll-”1:1, where each random variable li takes values from the set {1, . . . , c} which is a cluster membership indicator. 0 An observable set X = {xi}?___1, where each random variable x,- is generated from a conditional probability distribution Pr(x,;|l,;). TO incorporate the pairwise constraints, the probability Of a particular cluster label configuration is expressed in the following Gibbs distribution form Pr(£) = iexpeZZI/(zym (4.1) j i 1In the Complete-Link algorithm, the distance between two clusters is defined as the maximum distance between data examples from either cluster. Therefore, if (x,,xj) forms a mustnot-link, merging X], with x.- will result in a mustnot-link practically being constructed between xk and x,-. 79 where Z 1 is a normalizing factor and f+(x,-,xj) if (x,,xj) 6 8+ but 1,- #1]- V('i,j) = f_(x,-,xj) if (x,-,x]-) e s— but I,- = l]- 0 otherwise Here, f+(x,-,xj) and f_(x,;,xj) are two non—negative functions that penalize the violations of must-links and mustnot-links, respectively. The intuition behind the above treatment is that higher probability should be assigned to a label configurations that satisfy more pairwise constraints. The optimal set of cluster labels for all the data example is acquired by solving the following MAP estimation problem Pr(£|X) or Pr(£) Pr(X|£) Since Pr(£) is given in (4.1), we need tO decide Pr(r1’ [13) to carry on the MAP esti- mation. Assuming the set Of random variables X to be conditional independent given the set Of hidden variables, i.e. Pr(X|£) = H Pr(x,-|I,-) (4.2) XiEX we further parametrize Pr(x,-|l,) as Pr(xz-|li) or exp(—D(xi,plz,)) (4.3) where Mkfk = 1, - - - ,c) is the cluster representative Of the k-th cluster, and D(xi, p12.) is a distortion measurement between the i—th data example and the li-th cluster representative. Such distortion measurement can take various forms, such as cosine similarity or I—divergence, etc.. Combining (4.1) - (4.3), the MAP estimation leads to the following objective function for clustering laungzl exp(—XXV(iai>>exp(—ZD(x.-,m,)> i j i maXUz'}? 80 Note that the above Optimization problem is a “incomplete—data problem” since both the set Of cluster labels {ll-[[121 and the cluster representatives {pk}z:1 are unknown in a clustering setting, Expectation Maximization (EM) method is used to find the solution. The EM steps can be formulated into a modified K-means algorithm (see [15] for details). METRIC PAIRWISE CONSTRAINED KMEANS ALGORITHM Metric Pairwise Constrained Kmeans (MPCKmeans) is another research attempt which tries to integrate distance metric learning into the iterative steps for cluster- ing [24]. Again, an underlying K-means-style procedure (i.e. iteratively updating cluster memberships and cluster representatives) is assumed for ”the clustering. The Objective function for the related Optimization problem is min Bx.- — m,)TAz,TAz,(x.-— j>+§(x.-—xj>TAI, 1 f—(Xz',Xj) = §(X z.(ij.-X[;)-§(Xi-Xj)TAlJ-(Xi-Xj) where (xbxfi ) is the maximally separated pair of the points according to the li-th distance matrix Ali (note that we only care about f. when l,- = 1]). The function definition of f+ allow a larger penalty imposed on violating a must-link constraints between a pair Of distant data examples than that between a pair of close data examples. This reflects the intuition that if two must-link examples are measured as 81 far from each other by a distance metric (i.e. a large value of (x,- — leTAll-(Xz‘ — xj) or (x,- ‘leTAlj (Xi —— 3)), we want to put more penalty for this situation so that the corresponding distance metric can to be adjusted dramatically. A similar argument can be found for the function definition of f_. Note that there are four terms in the Objective function (4.4). The first term addresses the expectation on the compactness Of the data clusters, as in the K—means algorithm, but allowing each cluster to take different shapes; the second term regu- larizes the learned distance matrices; the third and the fourth term try to enforce the pairwise constraints. Similar to the HMRF-Kmeans algorithm, EM algorithm is used to solve the Opti- mization problem (4.4), which can be formulated into a modified Kmeans algorithm (see [24] for details). The main advantages Of MPCKmeans algorithm are: 1) instead of using a fixed distance metric for clustering, it allows new distance metrics being learned during the clustering procedures and hence improves clustering performance; 2) it allows different distance metrics being learned for different clusters, so that clusters in different shapes are possible to be detected. As reported in [24], MPCKmeans outperforms several purely semi-supervised distance metric learning methods and purely semi-supervised clustering methods in data clustering applications. 4.2.4 Summary Although a large number Of studies have been devoted to semi-supervised clustering, most of them focus on designing special clustering algorithms that can effectively exploit the pairwise constraints. For instance, algorithms in [15, 108, 16] are designed to improve the probabilistic models for data clustering by incorporating the pairwise constraints into the priors of the probabilistic models; the constrained K-means algo- rithm [148] exploits the pairwise constraints by adjusting the cluster memberships to 82 be consistent with the given constraints. However, it is Often the case that we have a specific clustering algorithm that is specially designed for the target domain, and we are interested in improving the accuracy Of this clustering algorithm by the available side information. This motivates us to design a meta-algorithm that is able to improve any given clustering algorithm by the pairwise constraints. It is important to note that such a meta-algorithm should not make any assumption about the underlying clustering algorithm, so that it can be applicable to any clustering algorithm. To this end, we propose a boosting framework for data clustering. More details will be given in the following section. 4.3 Boosting Clustering With pairwise constraints available, it is always desirable if we can utilize such side information to improve clustering performance, no matter which clustering algorithm is used. To this end, we propose a general boosting framework, termed as Boost- Cluster. In this section, we will first illustrate the main idea of boosting clustering, followed by a formal description on the framework, including the related Optimization problem and solution, two variations Of the meta-algorithm for boosting, and related discussions on the scalability issue. Let A denote the given clustering algorithm tO be improved. In order to make this framework general, we treat the clustering algorithm A as a black box that only takes the data representation of all examples as its input and outputs the cluster memberships for the given examples. Note in this work, we assume that the number of clusters is given. 4.3.1 Main Idea Although a number of boosting algorithms (e.g., [55]) have been successfully applied to supervised learning, extending boosting algorithms to data clustering is signifi— 83 cantly more challenging. The key difficulty is how to influence an arbitrary clustering algorithm with the given pairwise constraints. TO overcome this challenge, we propose to encode the side information into the data representation that is the input to the clustering algorithm. More specifically, we will first find the subspace in which data points Of the must-link pairs are close to each other while data points Of the mustnot- link pairs are far apart. A new data representation is then generated by projecting all the data points into the subspace, that is used by the given clustering algorithm to find the appropriate cluster assignments. Furthermore, the procedure for identify- ing the appropriate subspace based on the remaining unsatisfied constraints, and the procedure for clustering data points using the newly generated data representation will alternate iteratively till most of the constraints are satisfied. Figure 4.1 illustrates the idea of iterative data projection. The data points used in this illustration are sampled from the “scale” dataset that will be described later in Section 4.4.1. They belong to three clusters that are labeled in Figure 4.1 by legends A, o, and x, respectively. A partitional clustering algorithm is used in this illustration. Sub-figure (a) shows the original data distribution projected into a 2D space that is generated by Principle Component Analysis (PCA). We clearly see that many data points Of the cluster x overlap heavily with the data points of the clusters A and o, and they are difficult to be well separated. The must-link and mustnot-link constraints are indicated in sub—figure (a) by solid lines and dotted lines, respectively. Sub-figures (b)~(d) illustrate the projected data distributions based on the new representations that are generated by the proposed boosting framework in iteration 1, 2, and 7, respectively. Evidently, the data representations generated in different iterations are helpful in separating the data points in the cluster x from those in the other two clusters. In practice, the whole boosting idea will work in the following way: the original data representation and the pairwise constraints will be used as input in the proposed 84 29 Projection 0f Oringinal Data 2D Projection of Data at Iteration 1 a r 1 r . . 0.8 0.6’ A A A A x x o 0.5- AAAAAAM: 0x900 0.4» 000(9 0 o A A A95: A8,}: 0 <9 0 o 0.2- 0_ AAégASéAA‘6) a??? (900 0_ AAaAAAAAJ‘Od" 6:00 _ AAAAAAAA o 069 o -02. . -0.5 AAA 8 O& 0 O o O O -o.4- _1_ x x o -0.6' _0'3 -0:5 I) 05 '1'5’4 -é 6 z 4 (3.) Original data distribution (b) Iteration 1 2D Projection of Data at Iteration 2 2D Projection of Data at Iteration 7 a . . . X . . . 1.5' AEE ‘ :0 3p 00 00 o A x0 1' A RE § ‘ 0'5' A 308000080 00 05 R‘g §§§g . A A0 ”030080 00 ' A g A Q 09 o 8 § A A AAxQ‘OO o 0 AR 19§ 3 <§8 0 A 163% oo ,9 § 3: § A AA A AA A ’9‘ o —o.5- R g A A A A A $6 0 L E xx§8 o AAAAAAAAAAOOO —1 A 09 o -—0.5- AAAAAAAAA ”it _1 5. O 8 A A . x O A Q 0 '1 - -1 '4 -2 O 2 4 -3 -2 -1 0 1 2 (c) Iteration 2 (d) Iteration 7 Figure 4.1. An example illustrating the idea of iterative data projections. Sub-figure (a) shows the original data distribution, projected to the space spanned by its two principal components; Sub-figures (b)-(d) show the data distributions based on the new representa- tions in the projected subspaces that are generated in iteration 1, 2, and 7. 85 Pairwise II data examples c°"‘"‘”““ / d attributes a x n Data Matrix Data Matrix ow at: Rep. In Subspace - Clustering Algorithm s-dlm Subspace ‘ Protection d x s Projecflon Matrix Clustering J“Siaml’mr‘ BoostCluster-E Framework OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO Flue! Clustering Result: Figure 4.2. The flowchart of the BoostCluster framework. 86 boosting framework; then in a iterative manner, new data representations will be generated and be fed into the “black-box” clustering algorithm, whose results will in turn be used to find a subspace where data representations can be generated for the next iteration; during the iterative process, a kernel similarity matrix gets updated. The kernel similarity matrix, which can be seen as learned from the above boosting procedure, incorporates the side information gathered from the pairwise constraints and will improve the clustering performance. Figure 4.2 presents a flowchart that illustrates the working mechanism of the boosting framework. More details will be provided in the following subsections. 4.3.2 Objective Function The first step in designing boosting algorithm is to construct an appropriate Objective function. Note that, as described in the introduction section, the goal Of the boosting algorithm is to identify the subspace that keeps the data points in the must-link pairs close to each other, and keeps the data points from the mustnot-link pairs well separated. To this end, we introduce the kernel similarity matrix K E Rnxn, where K 2', j _>_ 0 indicates the confidence of assigning examples X, and x j to the same cluster. Then, our goal is to iteratively construct this kernel similarity matrix by using the clustering algorithm A and the pairwise constraints in 8+ and 8". Since the ideal kernel matrix K is expected to assign a large value to examples in a must-link pair and a small value to examples in a mustnot-link pair, we propose to minimize the following objective function: n n BCP — 1.3 = E: E: SJjSa,bexp(Ka’b—Ki,j) (4.5) ileat=1 In the above, each term within the summation compares K 0,1” i.e., the similarity between two points from an mustnot-link pair, to KM, i.e., the similarity between two data points from a must—link pair. By minimizing the Objective function in (4.5), 87 we will ensure that all the data points in the must-link pairs are more similar to each other than the data points in the mustnot-link pairs. The objective function in (4.5) can also be written as: n n EBCP = Z ngcxM—Ki’j) Z Sci—.bCXMKaib) (4.6) i,j=1 a,b=1 The above objective function is a product of two terms: the first term, i.e., 2 j- 1 S + j—exp( K2} 3), measures the inconsistency between the kernel similarity ma— trix K and] the must-link constraints; the second term, i.e., Zg,b=1 SJJ) exp(K 0,1,), measures the inconsistency between the kernel similarity matrix K and the mustnot- link constraints. Thus, by minimizing the product of the two terms, we enforce the kernel matrix K to be consistent with the given pairwise constraints. Instead Of multiplying the two inconsistency measurements as in (4.6), we can also define the Objective function to be the sum of the two terms, i.e. ,: (1305:: S: j(exp —K,'j) +c 2 Sa— bexp(Ka b) (4.7) 2',j=1 a,=b 1 The parameter c in (4.7) balances the two inconsistency measurements. To differ- entiate these two Objective functions, we refer to the objective function in (4.6) as BoostCluster by product, or BCP for short, and the Objective function in (4.7) as BoostCluster by sum, or BCS for short. 4.3.3 The BoostCluster Framework We will first describe the efficient optimization algorithm for BCP, followed by the algorithm for BCS. THE BCP ALGORITHM To boost the performance of a clustering algorithm A given a set of pairwise con- straints, we follow the idea Of boosting algorithms by iteratively improving the ker- nel similarity matrix K. Let K denote the current kernel similarity matrix. Let 88 A E [0,1]")(7‘ denote the incremental kernel similarity matrix that is inferred from the clustering results generated by the algorithm A. In particular, A, J- : 1 when both x,- and xj are assigned to the same cluster and Atj = 0 otherwise. The overall kernel matrix K’ is a linear combination Of K and A, i.e., K’ = K + aA (4.8) where a 2 O is the combination weight. Then, the Objective function CBCP for the combined kernel K’, denoted by CBCP(K’), is written as: n n LBCP(K’) = Z 2: SM Sabexp(ch —K£’j) ivj:1a,b=1 n n = Z Z ijawv- “(Aria—Aux)» (4.9) i,j=1a,b=1 where = 3+ X)(- A") (410) pl,] 1,].8‘ I 2,] . qa,b = Sgbexmh’at) (4.11) In the above, pm- measures the inconsistency between the kernel matrix K and the must-link pair (xi, xj), and (10,,) measures the inconsistency between K and the mustnot-link pair (xa, xb). We then employ the .Iensen’s inequality to Obtain an upper bound for the function 11 (4.9), i.e., m m exp 2: pint,- S 2 Pi exp(.ri) i=1 i=1 where pi 2 0,2' = 1,2,. . . ,m and 227:1 p,- = 1. Using the above inequality, an upper bound for exp(Aa b“ A- ) can be constructed as follows 7w] exp(—amm- — Aw) (4.12) A- — A +1 1 A — A- -+1 2 exp (—30 2’] 3 a,b + 3a§ + 0 x at 3 2’] ) A" —Aab+1 1 Aa,b"Ai,j+1 |/\ N 9 (D X "C (—-3cr) + E exp(3cr) + 3 89 In the first step of the above derivation, we rewrite “(Amb — AM) as a summation over the probability distribution of ((Aat — AM + 1)/3_. (AiJ — Ant + 1)/3,1/3). Note that (AaJ) — AiJ +1) 2 0 since 0 3 Am- 3 1. Using the upper bound in (4.12), we can now bound the Objective function Of BCP in (4.9) as follows n Tl £BCP(K’) = Z Z 1)i.jqa.beXI>(-O(Ai.j"Aw” igzlat=1 n 'n. exp(—30) - 1 S 3 Z Z Pi,jqa,b(Ai,j -Aa,b) i,j=1a,b=1 exp(BO) + exp(—30) + 1 n n + 3 2 Zagreb i,j=1 a,b n n n exp(3a) — 1 = ———3 2: A221 P232" }: qa.b — (1232' 2 Par; z',j=1 a,b=1 a,b=1 exp(BO') + exp(—3(1) + 1 n n + 3 ’ Z ZPz’J‘Iai (4-13) i,j=1 a,b We can simplify the above expression by defining a matrix T as follows Pm‘ (In T. ___________________ z 22,1):1 pa,b 22,1):1 qaab ,J' The elements in matrix T measure the inconsistency between kernel matrix K and the pairwise constraints: a large positive value for T2 j indicates that K is inconsistent with the must-link pair (xi,xj); similarly, a large negative value for TGJJ indicates that K is inconsistent with the mustnot-link pair (xa,xb). Using the notation Of matrix T, the upper bound for LBCP in (4.13) becomes £BCP(KI) S CBCP(K) x {lexp(3a) +e:p(—3a) + 1l _ [1 -exp(—3:)ltr(TAT)} (4.14) 90 where n n, BCP 5 (K) = Z Ptflw i,j=1a,b=1 n T ”(TA 1 = Z T,,,A,-,,- z',j=1 Note that when a = 0, the right side Of (4.14) becomes [IBCP(K), i.e., the Objective function Of the previous iteration. Thus, by minimizing the upper bound in (4.14) with respect to a, we are guaranteed to have EBCP(K’) 3 .CB 0}) (K), thus reducing the objective function at successive iterations. As suggested by the inequality in (4.14), to effectively reduce the Objective func- tion LBCP, we need to maximize the term tr(TAT). We further assume that the incremental kernel matrix A can be approximated by a linear projection of the input. data X, i.e., A rs (PTX)T(PTX) = XTPPTX where P = (p1,p2, . . . ,ps) is the projection matrix (.9 S d) with each Pi 6 Rd specifying a different projection direction. Using the above expression, tr(TAT) can be written as tr(TAT) a: tr(PTXTXTP) (4.15) If we enforce orthogonality between any two projection vectors, i.e., pgrpj = (5 (2', j ), the optimal solution for p,- that maximizes the expression in (4.15) is the i-th max- imum eigenvector of matrix XTXT. Let {(Ai,vi)}f:1 denote the top 3 principal eigenvalues and eigenvectors Of matrix XTXT. Then, the optimal projection matrix P is constructed as P = (\/A1V1,\/A2V2,..., ASVS) (4.16) Using the projection computed in (4.16), we generate a new data representation as X’ = PTX. This new representation X’ will be input to the clustering algorithm 91 Input 0 X: matrix for the input data 0 A: the given clustering algorithm 0 s: the number Of principal eigenvectors used for projection 0 51': matrix for must-link pairs 0 8‘: matrix for mustnot-link pairs Output: cluster memberships Algorithm 0 Initialize KiJ = 0 for any i,j : 1,2,. . . ,n. 0 Fort: 1,2,...,T - Compute p1,]- and (1M using (4.10) and (4.11). — Compute matrix T using (4.15) for BCP and using (4.23) for BCS. - Compute the top .3 eigenvectors and eigenvalues {0‘2" Vi) 25:1 of T. — Construct the projection matrix P using (4.16), and generate the new data representation X’ by projecting the input data X onto P. — Run the clustering algorithm A using the new data representation X'. Compute the matrix A with Am- = 1 when x,- and xj are grouped into the same cluster by A, and zero otherwise. — Compute (1 using (4.18) for BCP and using (4.24) for BCS. — Update the kernel similarity matrix K as K + (1A —+ K 0 Run the clustering algorithm A with the kernel matrix K (if A does not take a kernel similarity matrix as input, a data representation can be generated by the first 8 + 1 eigenvectors of the matrix K). Figure 4.3. Boosting algorithm for BCP and BCS A to generate new cluster memberships. The resulting cluster memberships are then used to compute the incremental kernel matrix A. In addition to the projection matrix P, another important question is how to compute the Optimal oz. We can estimate the optimal a by minimizing the upper bound in (4.14), which leads to (I = log[1 + tr(TAT)]/6. However, we can further improve the estimation Of a by minimizing the original Objective function in (4.6), 92 which is n CBCP(K’) : Sit-j (“Pf—Kid) Z SIT]. exp(KiJ) 23:1 i,j=1 n n E p?" exp(—OAM) 2: (17:3. exp(aAiJ) ' i,j=1 II N. . a, [L i. Tl 77. Z I)zt,j5(Az',j,0) + Z Pi,j5(At,j,1)0XP(-a) i,j=1 i,j=1 n n x Z q,,,-5(A,,j,0) + Z q,,J-5(A,,j,1)exp(a) (4.17) i,j=1 i,j=l It is not difficult to show that the optimal a that maximizes the above expression is: a 11 g ZEjzlmgflAiJJ) X EiszlquAiJfi) (418) = — O ~ _ 2j=1pi,j6(Ai,j,0) ZZj=1Qi,j0(Ai,j,1) 2 Figure 4.3 summarizes the proposed BCP (and BCS) algorithm. DISCUSSIONS ON EFFICIENCY AND SCALABILITY Similar to most boosting algorithms, we can show that the objective function of the proposed BCP algorithm is reduced exponentially, as shown by the following theorem. Theorem 1 Let A1, A2, . . . , AT be the incremental kernel matrices computed from the clustering results by running the boosting algorithm (in Figure 4.3). Then, the objective function after T iterations, i.e., £¥CP, is bounded as follows: n n T BCP + _ ‘CT S 2 Si,j 2: Sin]. HO - 7t), (4-19) i,j=1 i,j=1 t=1 where W (m - vBiCt)2 (At + Btlfct + Dt) 93 Change in BoostCluster Objective Function 12- 10.8 I l .L O 10 20 3O 4O 50 #Iterations Figure 4.4. An example of BCP Objective function vs. number Of iterations. At, Bt, Ct, and Dt are non-negative constants, and are computed as TL it At: 2 p§,j5(AE,J-.0), Bi = Z adored-,1) i,j=l i,j=1 TI. 77. _ t t _ t t Ct — Z qi,j”(Ai,j’0l’ Dt — Z gaff/Air” z'.j=1 i.j=1 where ij and qgj are computed according to (4.10) and (4.11} using the kernel matrix K at the t-th iteration. The above theorem can be proved directly by using the expression in (4.17) and the expression for a in (4.18). Figure 4.4 shows the change in the BOP algorithm’s Objective function Observed in one of our experiments. As can be seen, the Objective function converges very fast, and becomes very flat after around 22 iterations. In our experiments, our BCP algorithm usually converges within 25 iterations. In terms Of analyzing the scalability Of the BCP algorithm, it is easy to find that the most crucial step is finding the projection matrix P since it seems to involve a 94 huge amount of computation cost. As in (4.15), we need to compute XTXT, which seems to be computationally expensive, especially when the number of examples (i.e., n) is large. However, it is important XTXT only involves the examples used in the pairwise constraints. This is because XTXT can also be written as: n XTXT = Z Tia-xix;— (4.20) i, j =1 Since Ti, j is nonzero only when the example pair (xi, xj) are used by the constraint, the above calculation only involves a very small portion of the entire example pairs. Thus, XTXT can be computed efficiently as long as the number of labeled pairs is relatively small. After XTXT is computed, the computational cost in constructing the projection matrix P mainly arises from computing the principal eigenvectors and eigenvalues of XTXT, particularly when the dimensionality of the feature space is high. For instance, for text categorization, each document is represented by a vector of over 100,000 word features, and the size Of matrix XTXT is over 100, 000 x 100,000. A straightforward approach is to reduce the dimensionality before running the proposed algorithm. However, most dimensionality reduction algorithms that are capable of handling high dimensional space are unsupervised, and therefore are unable to exploit the pairwise constraints. Here, we propose an algorithm that is able to efficiently com- pute the eigenvectors Of XTXT when the input dimensionality is high. We first. realize that each eigenvector of XTXT has to lie in the space that is spanned by the exam- ples used by the constraints. More specifically, we denote by X = (5:1, 5:2, . . . ,sem) the subset of m examples that are involved in the constraints. Then, the eigenvector v,- can be written as a linear combination Of {59};er i.e., m v,- = 2 mike), = Xwi (4.21) kzl More generally, we have at V = (v1,v2...,v3) =X(w1,w2,...,w3) =XW. 95 The proof of this result can be found in Appendix A.1. We "furthermore denote by T the pairwise constraints, where T,- j gives the pairwise constraint between x,- and 7 59. Then, wi,i = 1,2,. . .,s correspond to the first 3 principal eigenvectors of the following generalized eigenvector problem xTXTxwa, = A,xwa, (4.22) Note that since XTXTXTX and XTX are matrices of m x m, the cost. Of computing the eigenvectors is independent Of the dimensionality Of the input space. The proof of the above result can be found in Appendix A.2. THE BCS ALGORITHM The derivation of the BCS algorithm is very similar to the BCP algorithm. First, we compute the BCS Objective function for K’ = K + aA as follows: n B I ’ — £ CS(K) = Z 5:]- CXp(—Ki,j) ‘l‘ CSi,j eXP(Ki,j) i,j=1 n = Z mama-01AM)+qu,jexp(aAiJ) i,j=1 Using the Jensen’s inequality, we have Alaj +1 1 1— Atj exp(—aAz-J) = exp —3a——3-—— + 303 + 0——3—— A' '+ 1 1 1 _ A- . S —ZL3—— exp(—3a) + S exp(3a) + ——3—2’1 _ 1 + exp(3a) + exp(—30) A 1 — exp(—3O.) “ 3 '- 133' 3 Similarly, the upper bound for exp(aAi‘j) is the following expression 1 + exp(3a) + exp(—3a) + A 1 — exp(—3(1) eXp(aAi,j) S 3 i,j 3 Using the above inequalities, the BCS Objective function LB CS (K ’ ) is upper bounded by the following expression: 96 LBCS(K’) S 1+ exp(3a-)3+ exp(-3a)£BCS(K) _ 1— ex[)(_3a)tr(T3AT) where T; is defined lTSli.j = pi,j—C(Ii’j (4.23) Note that the above definition is similar to T in (4.15). The key difference is that pm- and qz-J- are normalized in (4.15) while they are not in the above expression. Similar to the BCP algorithm, we compute the top 3 principal eigenvectors of T s to construct the projection matrix, and the projected input data is sent to the algorithm A for clustering. Furthermore, the Optimal oz for BCS is computed as a _ 110g sz: 1p2,j6(Ai,j71) gCZij_1qZ,j6(Az,j71) Figure 4.3 summarizes the BCS algorithm. Finally, we can show that the Objective (4.24) function of BCS decreases exponentially, i.e., Theorem 2 Let A1, A2, . . . ,AT be the kernel matrices computed from the clustering results by running the boosting algorithm ( in Figure 4.3) Then, the objective function BCS ET after T iterations, i.e., , is bounded as follows: T £305 3 2 3+ +cSi-J H( (1—7t), (4.25) Z,j=1t1: where ,3 = («BE—«DD? t At+Bt+Ct+Dt At, Bt, Ct, and Dt are already defined in Theorem I. 4.4 Experiments We now present an empirical evaluation of our proposed boosting framework. In particular, we aim to address the following four questions in our study: 97 1. As a general boosting framework, is the proposed method able to improve the performance for any given clustering algorithm? 2. How efiective is the proposed boosting framework in improving the clustering performance by using the pairwise constraints? 3. How robust is the proposed boosting framework in improving the clustering per- formance by using the pairwise constraints? 4. How does the BCP algorithm compare to BCS algorithm in the proposed boosting framework? 4.4.1 Experiment Setup To validate the claim that the proposed boosting algorithm is capable of improving any clustering algorithm by exploiting the pairwise constraints, three typical cluster- ing algorithms are used in our study. They are: 1. K—means algorithm [5]. It represents the family of clustering algorithms that try to find compact and well-separated clusters. We adopted the implementation from the Weka software2. 2. Partitional SingleLink algorithm (“SLINK” for short) [80]. It represents the family Of the hierarchical clustering algorithms. We adopted the implementation from the CLUTO software3 3. k—way spectral clustering (“SPEC” for short). It represents the family of spec- tral methods for data clustering. In particular, we follow the paper [115] for the implementation of spectral clustering. 2http://www.cs.waikato.ac.nz/ml/weka/ 3http://glaros.dtc.umn.edu/gkhome/views/cluto 98 Name #Attributes #Clusters #Examples wdbc 30 2 569 scale 4 3 625 vowel 10 1 1 990 segmentation 19 7 2310 handwrittiendigit 256 10 4000 pendigit 16 4 4396 Table 4.2. Datasets used in the experiments. Six datasets drawn from the UCI machine learning repository [50] are used in our study. Table 4.2 summarizes the information about these datasets4. As indicated in Table 4.2, these datasets vary significantly in their sizes, number of clusters, and number of attributes. To evaluate the clustering performance, two measurements are used in our exper- iments. The first measurement is normalized mutual information (NMI for short) [15], which is defined as 2MI(X, X0) H(X) + H(X0) NMI where X 0 and X denote the random variables of cluster memberships from the ground truth and the output of clustering algorithm, respectively. M I (:r, 3]) represents the mutual information between random variables x and y, and H (51‘) represents the Shannon entropy of random variable x. The second measurement is Pairwise F - measure (PWFl for short), which is the harmonic mean of pairwise precision and recall that are defined as follows #pairs correctly placed in the same cluster preczszon _ Total #pairs placed in the same cluster ll #pairs correctly placed in the same cluster reca Total #pairs actually in the same cluster PWFl _ 2 x precision x recall precision + recall 4Note that for the “pendigit” dataset, examples in only four classes of letter “3”, “6”, “8” and “9” are selected from a total of 10 classes because these four letters are in general difficult to distinguish. 99 ur "1.7- . ‘-..‘IIII'I_J' .. ‘_L.'l! The PWF 1 measurement defined above is closely related to the metric defined in [155] that measures the percentage of data pairs correctly clustered together. The key problem with the metric defined in [155] is that it. counts two types of data pairs, i.e., pairs of data points assigned to the same cluster and pairs of data points assigned to different clusters, with equal importance. This is problematic because most of the data pairs will consist of data points from different, clusters when the number of clusters is large. A similar issue was raised in multi-class learning, and that. is why F -measure is widely used for evaluating multi-class learning [161]. To verify the efficacy of the proposed boosting framework in exploiting the pairwise constraints for data clustering, three baseline approaches are used: 1. Metric Pairwise Constraints K—means (MPCKmeans for short) algorithm [?, 12], which is a probabilistic framework based on Hidden Markov Random Fields. 2. Semi-supervised Kernel K—means (SSKK for short) algorithm [96], which ex- ploits the pairwise constraints by a kernel approach and finds clusters with nonlinear boundaries in the input data space. 3. Spectral Learning (SpectraJLearn for short) algorithm [89], which applies spec- tral methods to learn a data representation from the pairwise constraints. The generated data representation can therefore be used by any clustering algorithm to identify the appropriate data clusters. The key difference between spectral learning and our algorithm is that our algorithm generates algorithm specific data representations by taking into account the performance of clustering algo- rithms. Previous studies [15, 12, 96, 89] showed that the above three algorithms deliver the state-of-the—art performance in comparison to other semi-supervised clustering algo- rithms such as the constrained K-means. 100 +BoostCluster + K—means +BoostCluster + SLINK +BoostCluster + SPEC "0"MCPKmeans “HOH'SSKK -£I-'SpectralLearn + K-means -*--Spectra1Learn + SLINK -A-'SpectralLearn + SPEC Figure 4.5. Legends for all algorithms in our comparative study. These legends will be used in all the figures in this paper. In summary, we will compare the following semi-supervised clustering algorithms in the experiments: the three clustering algorithms (K-means, SLINK, and SPEC) being boosted by the proposed BoostCluster framework; the same three clustering algorithms with input from the Spectral Learning algorithm; the MPCKmeans al- gorithm; and the SSKK algorithm. For easy identification in figures, we listed the legends for the above algorithms to be compared, in Figure 4.5. These legends apply to all following performance comparison figures (we will omit showing legends in those figures due to space constraints). Finally, in all the experiments, we vary the number of pairwise constraints from 0 to 800. Since a random sampling of data pairs tends to find many more cannot-link pairs than the must-link pairs, in this study, we enforce an equal number of constraints for both must-link pairs and cannot-link pairs. The numbers of eigenvectors (i.e., the parameter s in the boosting algorithm shown in Figure 4.3) are determined empirically as follows: 3 for the “scale” dataset, 10 for the “handwrittendigit” dataset and 5 for the remaining 4 datasets. All the experiments in this study are repeated five times, and the evaluation results averaged over the five trials are reported. 101 4.4.2 Generality of the Boosting Framework Figures 4.6 - 4.9 show the clustering performance, evaluated by NMI and PVVFl respectively, of the BCP algorithm of BoostCluster framework using the three clus- tering algorithms (i.e., K-means, partitional SingleLink, and spectral clustering), the same three clustering algorithms with input as the new data representation from the Spectral Learning algorithm, the MPCKmeans algorithm, and the SSKK algorithm. 1. We observe that for most datasets, the BoostCluster framework is able to im- prove the clustering performance for all the three clustering algorithms regard— less of which evaluation metric is used. This suggests that the proposed frame— work is effective in exploiting the pairwise constraints to improve clustering performance. MPCKmeans algorithm and SSKK algorithm are also effective in general, however, their clustering performance improvements are less significant, especially for larger datasets (such as “handwrittendigit” and “pendigit”). 2. Although SpectralLearn algorithm can also be combined with any clustering al- gorithm, in our experiments, it does not always improve the clustering algorithm performance. For example, for “wdbc” and “handwrittendigit”, increasing the number of pairwise constraints deteriorates clustering performance by combin- ing SpectralLearn with any of the three clustering algorithms. Moreover, the effect of SpectralLearning depends on the clustering algorithm. For example, for the “pendigit” dataset, SpectralLearn improves the clustering performance of K-means and SLINK, but degrades SPEC in general. In comparison, the clustering performance improvement brought by the proposed BoostCluster is substantially more stable and consistent, across different datasets and different clustering algorithms. This can be attributed to the fact that BoostCluster is adaptive to both clustering algorithms and datasets: in each iteration, it takes the feedback from the result of applying the given clustering algorithm to the 102 particular dataset, and decides how to adjust the kernel matrix. However, Spec- tralLearn generates new data representations independent from the clustering algorithm that is used. ‘. The performance of the three clustering algorithms (K-means, SLINK, and SPEC) varies significantly across the six different datasets. For instance, for the “vowel” dataset, “BoostCluster+K-means” algorithm performs consider- ably worse than “BoostCluster+SPEC” algorithm. However, the performance of “BoostCluster+K-means” algorithm for the “pendigit” dataset, is signifi- cantly better than that of “BoostCluster+SPEC” algorithm. This result also indicates that every clustering algorithm has its own strength, and therefore it is important to develop a general framework that is able to boost the performance of any clustering algorithm by the given pairwise constraints. . The results based on the two different evaluation metrics, namely NMI and PWFl, are inconsistent on some occasions. For instance, for the “handwrittingdigit” dataset, according to NMI, the clustering performance of “BoostCluster+K-means” and “BoostCluster+SLINK” appears to be similar. However, according to PWF 1, “BoostCluster+SLIN K” performs noticeably bet- ter than “BoostCluster+K-means”. The implication of this finding is the impor- tance of evaluating clustering performance by more than one evaluation metric, since conclusions based on the results of a single evaluation metric could be biased. 4.4.3 Robustness of Exploiting Pairwise Constraints Although the curves in Figure 4.6 - Figure 4.9 all display different degrees of “bumpi- ness”, generally speaking, BoostCluster framework, SSKK and MPCKmeans deliver a more robust performance than SpectralLearn algorithm. On the other hand, al- though for most datasets, the performance curves of SSKK appear to be the most 103 'IF'T“"¢.A" H NMI (wdbc dataset) %:;:¢=;:2§222$111§ fizzzzazzlfi'a: 0 100 200 300 400 500 500 700 800 #Pa1rw1se Constraints NMI (scale dataset) 0 100 200' 3Q0 400 500 6005 700 800 #P airwise Constrain NMI (vowel dataset) 0 100 200 300 400 500 600S 700 800 #Pairwise Constraint Figure 4.6. Clustering performance evaluated by NMI (Part A). Each graph shows the performance of the three clustering algorithms (K-means, parititional SLINK, spectral clus- tering) boosted by the proposed BCP algorithm, and the performance of the MPCKmeans, SSKK and SpectralLearn algorithms. 104 NMI OOOOOOOOO NMI (segmentation dataset) .2" I“ "0 100 200 300 400 500 1n60t0S 700 800 Pairwise Constra NMI (handwrittendigit ldatsaeet) 0“ 5‘ -1‘ .§;;;3;;; :1: 323:5: ':;.= = 4523 V0 100 200I 300 400 500 6008 700 800 #P airwise Constraint NMI (pendigit dataset) 0 100 EOOI 300 400 500 600S 700 800 Pairwise Constraint Figure 4.7. Clustering performance evaluated by NMI (Part B). Each graph shows the performance of the three clustering algorithms (K—means, parititional SLINK, spectral clus- tering) boosted by the proposed BCP algorithm, and the performance of the MPCKmeans, SSKK and SpectralLearn algorithms. 105 PWFl (wdbc dataset) H E o m o I 7 - — -..—- ;;; 112‘.‘.-.' :‘*fi-““““‘ 0.6 O 100 20a0i 3Q0 400 500 6005 700 800 #P rwise Constrain PWFl (scale dataset)s , O. . O. 0. H I. m z m 0. O. 0.4 0 100 200 3QO 400 500 600 700 800 Pairwise Constraints PWFl (vowel dataset) 0. H m 3 m 0.15 0 100 20Q 300 400 500 6008 700 800 #P airwise Constraint Figure 4.8. Clustering performance evaluated by PWFI (Part A). Each graph shows the performance of the three clustering algorithms (K—means, partitional SLINK, spectral clus- tering) boosted by the proposed BCP algorithm, and the performance of the MPCKmeans, SSKK and SpectralLearn algorithms. 106 PWFl (segmentation dataset) O. H(L m 3 O... O. \‘A_—-IA” 0.3 0 100 200 300 400 500 600 700 800 #Pairwise Constraints PWFl (handwrittendigit dataset) O. O. H m z 010. 0. .W R—f-MMXL‘TX-HAn-ér-HA 0.2 O 100 200 300 400 500 600 700 800 Pairwise Constraints PWPl (pendigit dataset) 1,. . .. . .7 , ’3---q\ p O. 0. r.40. m 3 “'0. 0. 0. 0 . O 100 200 300 400 500 600 700 800 #Pairwise Constraints Figure 4.9. Clustering performance evaluated by PWFl (Part B). Each graph shows the performance of the three clustering algorithms (K-means, partitional SLINK, spectral clus- tering) boosted by the proposed BCP algorithm, and the performance of the MPCKmeans, SSKK and SpectralLearn algorithms. 107 smooth among all the competitors, the resultant improvement is almost always the least noticeable among all the semi—supervised clustering algorithms. To further evaluate the robustness of all the algorithms, we conduct experiments with noisy pairwise constraints. We randomly select 20% of the pairwise constraints and flip their labels (i.e., a must—link pair becomes a cannot-link pair and vice versa). This setting reflects the scenario when the side information includes incorrect pairwise constraints. It could happen when for instance, the pairwise constraints are derived from the implicit user feedback (e.g., user ratings or click-through data). Thus, it is important to develop semi-supervised clustering algorithms that are resilient to the noisy side information. Figure 4.10 and Figure 4.11 show the performance of all the algorithms, on three selected datasets (i.e., “scale”, “vowel”, and “pendigit”) when 20% of the pairwise constraints are noisy. First, by comparing Figure 4.10 - Figure 4.11 with Figures 4.6 - Figure 4.9, it is not surprising to observe a degradation in clustering performance when 20% of the pairwise constraints are noisy. Second, we observe a general trend that a larger number of noisy constraints tend to result in an inferior clustering performance by MPCKmeans. This is in contrast to the results of MPCKmeans shown in Figure 4.6 — Figure 4.9 where increasing the number of pairwise constraints usually improves the performance of clustering. This result implies that the MPCKmeans algorithm is unable to effectively exploit the pairwise constraints for data clustering when they are noisy. Similarly, while “SpectralLearn+K—means” and “SpectralLearn+SLINK” are able to noticeably improve the clustering performance with increasing number of noise-free pairwise constraints, with 20% noise in the constraints, their performance degrades with the increasing number of constraints. In comparison, as shown in Fig— ure 4.10 and Figure 4.11, BoostCluster framework is generally able to improve the performance of all the three clustering algorithms with increasing number of noisy pairwise constraints, and SSKK algorithm is also able to improve clustering perfor- 108 K-means (BCP) K-means (BCS) DATASETS NlVIf PWFI NMI PWFI wdbc 0.5862 0.8502 0.6360 0.8221 scale 0.3040 0.5520 0.3450 0.5645 vowel 0.2723 0.2069 0.3317 0.2565 segmentation 0.6315 0.5843 0.6013 0.5544 handwrittendigit 0.5720 0.5664 0.5435 0.5359 pendigit 0.7448 0.8140 0.7261 0. 8166 SLINK (BCP) SLINK (BCS)I DATASETS NMI PWFI NMI PWFI wdbc 0.7438 0.9248 0.7971 0.9440 scale 0.5399 0.8091 0.5015 0.7839 vowel 0.2982 0.2302 0.3044 0.2599 segmentation 0.6164 0.5740 0.5765 0.5421 handwrittendigit 0.5451 0.6328 0.5300 0.5579 pendigit 0.7833 0.8654 0.7713 0.8573 SPEC (BCP) SPECLBCS) DATASETS N Ml PWFI N MI PWFT wdbc 0.7073 0.9038 0.6749 0.8964 scale 0.4792 0.6021 0.3818 0.6099 vowel 0.3431 0.2606 0.3417 0.2649 segmentation 0.6004 0.5559 0.5754 0.5768 handwrittendigit 0.4764 0.4306 0.4579 0.4417 pendigit 0.5912 0.6325 0.8056 0.8745 Table 4.3. The performance comparison between the BCP algorithm and the BCS algo- rithm, when the number of pairwise constraints is 500. mance despite the noise in pairwise constraints. This indicates that the proposed BoostCluster framework and the SSKK algorithm are more resilient to the noise in the side information. 4.4.4 BCP vs. BCS We compare the clustering performance of the BCP algorithm and the BCS algorithm in Table 4.3 that is evaluated in both NMI and PWFl. Due to the space limitation, we only list the evaluation results for 500 pairwise constraints. We set c = 1 in the objective function (4.7) of the BCS algorithm. As indicated in Table 4.3, in general the BCP and BCS algorithms deliver similar performance, which suggests that both algorithms are effective in boosting the per— formance of the underlying clustering algorithms. BCP is more effective than BCS for certain datasets and clustering algorithms, and vice versa. However, it is important 109 NMI (scale dataset, 20% noise) 0 100 200 300 400 500 600S 700 800 #Pa11rwise Constra NMI (vowel dataset, 20% noise) ,n-u-a '10 100 200i 3QO 400 500 600 700 800 #Pa rwise Constraints NMI (pendigit dataset, 20% noise) 8 . ‘G 0.1 ”"44: ~. h‘--~A-—-A---—A---§:’=“"“ v0 100 200 300 400 500 600 700 800 #Pairwise Constraints Figure 4.10. Clustering performance (NMI) with 20% noise in the pairwise constraints. Each graph shows the performance of the three clustering algorithms (K-means, pariti— tional SLINK, spectral clustering) boosted by the proposed BCP algorithm of BoostCluster framework, and the performance of the MPCKmeans, SSKK and SpectralLearn algorithms. 110 PWPl (scale dataset, 20% noise) 0. H :0 a 0. 0.4 0 100 20n0 I300 400 500 6005 700 800 #P w1se Constraint PWFl (vowel dataset, 20% noise) O. H [n z D; 0. 0 100 200 300 400 500 600S 700 800 Pai1rwise Constraint PWPl (pendigit dataset, 20%S noise) 9 0. 0. H ‘go m 0. O. 0 ' 0 100 200 300 400 500 600 700 800 Pairwise Constraints Figure 4.11. Clustering performance (PWFl) with 20% noise in the pairwise constraints. Each graph shows the performance of the three clustering algorithms (K-means, pariti- tional SLINK, spectral clustering) boosted by the proposed BCP algorithm of BoostCluster framework, and the performance of the MPCKmeans, SSKK and SpectralLearn algorithms. 111 I. to note that compared to BCP, BCS has an additional parameter c that can be used to tune the boosting algorithm. Our empirical experience indicates that changing the parameter c can lead to significant difference in the clustering performance. It would be useful to investigate the pros and cons of the BCP and BCS algorithms, and how to choose the parameter c in the BCS objective function wisely. 4.5 Summary In this chapter, we have studied the problem of improving data clustering by using side information in the form of pairwise constraints. A general boosting framework has been proposed to improve the accuracy of any given clustering algorithm with a given set of pairwise constraints. Such performance improvement is achieved by iteratively finding new data representations that are consistent with both the clustering results from previous iterations and the pairwise constraints. Empirical study shows that our proposed boosting framework is able to improve the clustering performance of several popular clustering algorithms by using the pairwise constraints. 112 CHAPTER 5 Semi-supervised Learning for Query Translation Disambiguation in Dictionary-based Cross Language Information Retrieval 5.1 Introduction To bridge the gap between different languages, machine translation has been used extensively in many research areas of multilingual information processing, such as cross-language information retrieval (CLIR). In CLIR, we can either translate queries into the language of documents or translate documents into the language of queries. Usually, it is simpler and more efficient to translate queries because of their shorter length. Most query translation algorithms require external linguistic resources, among which parallel corpora and bilingual dictionaries are the most commonly used. Meth- ods based on parallel corpora usually learn the association between words of the source language and words of the target language, and apply the learned association to estimate the translation of queries. Examples in this category include statistical translation models [157, 53, 116, 94], and relevance language models [93, 101, 100]. The main drawback of these methods is that they depend critically on the availability 113 of parallel bilingual corpora, which are often difficult to acquire, especially for minor languages. While the problem of machine learning for language translation remains tough (either theoretically or practically), undoubtedly, bilingual dictionaries can be viewed as side information for various multi-lingual tasks. Typically, a bilingual dictionary provides a repository of translation pairs, usually organized in a manner that each word (or phrase) in one language is supplied with a list of possible translation words (or phrases) in another language. These translation pairs, to some extent, helps to bridge (or narrow) the gap between languages. With the increasing availability of machine readable bilingual dictionaries, dictionary-based approaches becomes more preferable for CLIR applications, especially when other linguistic resources are scarce. Compared to the approaches based on parallel corpora, a major disadvantage of the bilingual dictionary based approaches is that they lack the ability in disambiguat- ing the translation of query terms among multiple candidates. Very often, a number of translations (which we call translation candidates in this chapter) are found in a bilingual dictionary for a single query word, but most of them are irrelevant to the semantic meaning of the query. Hence, it is crucial for a dictionary-based approach to reduce the ambiguity in translating query words as much as possible. However in CLIR, given the short length of a query, it is usually impossible to completely resolve the translation ambiguity due to the multiple interpretation of the query. Thus, it is also important for any dictionary-based CLIR approach to maintain the uncertainty in translating queries when the ambiguity is hard to resolve. In the past, several approaches [81, 57, 58, 1, 95] have been proposed to resolve the query translation ambiguity in dictionary-based CLIR. The simplest one is to use all the translation candidates of each query word provided by the dictionary with equal weights [46, 95]. This amounts to no sense disambiguation when translating query words. Other approaches try to resolve the translation ambiguity by measuring the 114 coherence of a translation candidate to the entire query. Typically, the coherence score of a translation candidate is computed using word co—occurrence statistics. Given a query, a translation candidate of a query word is assigned with a high coherence score when it co—occurs frequently with the translations of other query words. The translation candidates with the highest coherence scores are selected to form the fi- nal translation for the original query. In [46, 76, 2, 93, 95, 57], only one translation candidate is selected for each query word; in [81, 109], a translation candidate is se— lected when its coherence score exceeds a predefined threshold, which allows multiple translations to be selected for each query word. We will refer to both approaches men- tioned above as selection-based approaches, because they all have to make a binary decision for each translation candidate regarding if it will be included in the trans- lated query or not. Given the usually short length of queries and the large variance existed in mapping information across different languages, such binary decisions are usually difficult, if not impossible, to make. We call this problem the “translation uncertainty problem”. Another problem with the selection-based approaches is that the translation of one query word is usually determined independently from the translations of others. This assumption is reflected in the calculation of coherence scores. Usually, the coherence score of a translation candidate to a given query is computed as the sum of its similarities to every translation candidate for other query words. As a result, coherence scores are estimated independently from the choice of translations for query words, which leads the selection of translation candidates for different query words to be independent. We call this problem “translation in- dependence assumption”. Although this problem has been addressed in previous work (e. g., [57]), usually greedy approaches are applied and therefore only suboptimal solutions can be obtained. In this chapter, we propose a novel statistical framework for dictionary-based CLIR. This framework will allow us to estimate the translation probabilities of query words by maximizing the overall coherence of the translated query, which we call “maximum coherence principle”. Particularly, the proposed framework explic- itly addresses the two problems mentioned above: to resolve the translation uncer— tainty problem, the proposed framework maintains the uncertainty in translating queries through the estimation of translation probabilities of query words; to remove the translation independence assumption, the proposed framework allows the transla- tion probabilities of all query words to be estimated simultaneously. Furthermore, the proposed framework is formulated as a quadratic programming problem [62], whose global optimal solution can be found efficiently using standard optimization packages such as Matlab. This is in contrast to several existing approaches such as the propa- gation approach in [114], where the solution is determined by an iterative procedure, which is not only time consuming but also sensitive to the initialization of parameters or the stop criterion employed in the iterative procedure. In addition to the general framework, we also present in this chapter two realiza- tions of the proposed framework that employ different coherence measurements: the “Maximum Coherence Model” that adopts the raw word-to-word similarity for coherence measurement, and the “Spectral Query Translation Model” that mea- sures the coherence score of each translation candidate based on the normalized word similarity. As will be explained later, the Spectral Query Translation Model based on the normalized similarity measurement, can be further explained as a graph parti- tioning approach for query translation disambiguation, which is employed in spectral clustering. Our empirical studies with TREC datasets have shown that both models outperform the selection-based approaches with relative improvements ranging from 10% to 50%. 116 5.2 Related Work We first review the previous work in the selection-based approaches for query transla- tion disambiguation, followed by the discussion of spectral clustering that is strongly related to the proposed Spectral Query Translation l\-‘Iodel. 5.2.1 Selection-based Approaches for Query Translation Disambiguation One of the major factors that can potentially degrade the effectiveness of dictionary- based cross—language information retrieval is the ambiguity in translating query words [8, 57]. In the efforts to resolve this translation ambiguity, a number of recent studies [46, 76, 81, 2, 93, 109, 57, 95] have suggested the strategy of translation selection by exploiting word co—occurrence patterns. Usually a similarity measurement between two translation candidates is defined in the form of word co—occurrence statistics. With the word similarities, we can then measure the coherence of a translation candi- date with regard to the theme of the entire query. Only those translation candidates with high coherence scores will be selected for the query translation. Ideally, for each query word we should select the translation candidate(s) that is consistent with the selected translation candidates for other query words. Apparently, this becomes a “chicken-egg” problem since the selection of translation candidates for one query word is determined by the translation candidates selected for other query words. Thus, due to the computational concern, most selection-based approaches [1, 57, 58] adopted approximate approaches that usually only produce suboptimal solutions. In particular, for each query word, those approaches choose the translation candidates that are consistent with all the translation candidates provided by the dictionary for all the query words, including both the selected and the unselected translation candidates. Formally, a translation selection strategy can be formulated as follows: 117 APPROXIMATE TRANSLATION SELECTION ALGORITHM 1. Given a query q8 = {qi (13, - ~ ,qfns} in the source language , for each query word (1?, look up the dictionary for the translation candidate set S,- = {wa j} 2. For each set S,- a For each translation candidate wf . in S -, define the similarit measure— i y 2,] ment between the word 11):]. and the set SI/(i’ aé i) as the sum of the ) similarities between wf j and each word in the set S i” i.e., sim(w£IJ-, 52.)) = Z sim(wsz]-, wf, l) (5.1) v r’: 68. u2,,l 2’ where sim(wzt- j’ wzi, l) computes the word-to—word similarity. (b) Compute the coherence score for wzt. j as t _ ,- t f(wi,j) — Z sam(w,-Ij, Si’) (5.2) Viiyéi (c) Select the word q; in S,- with the highest coherence score t _ , , t q,- — aIgmtax f(wi,j) (5.3) w. . 3,] The definition of similarity between two words in the above algorithm can take various forms of co—occurrence statistics, such as Dice similarity (as in [1]), mutual information (as in [81, 109]) or its variants (as in [57, 58]). In addition to selecting the most likely translation for each query word, other selection-based approaches have been tried, such as selecting the best N translations [46] or selecting translations whose coherence scores exceed a predefined threshold [81, 109]. Apparently the above approximate algorithm is not ideal. In particular, the coher- ence score for a translation is computed with regard to both selected and unselected translations. As a result of such an approximation, translation of different query 118 words are determined independently, which leads to the translation independence problem as discussed in the introduction section. III the proposed statistical frame- work, by formulating the problem of translation selection in a quadratic programming form, we are able to efficiently estimate the translations of all query words simultane- ously. Furthermore, in contrast to the selection-based approaches that make binary decision for each translation candidate, the new framework employs soft probabilities for representing both selected and unselected translation candidates. This is particu- larly useful when binary decisions are hard to make, for instance, all the translation candidates of a query word have very similar coherence scores. 5.2.2 Spectral Clustering Spectral clustering approaches view the problem of data clustering as a problem of graph partitioning. Each data point corresponds to a vertex in a graph. Any two data points are connected by an edge whose weight is the similarity between the two data points. To form data clusters, the graph is partitioned into multiple disjoint sets such that only the edges with small weights are removed. Based on different criteria imposed on the partitioning, there are three major variants for spectral clustering: Ratio Cut [38], Normalized Cut [134] and Min-Max Cut [49]. In the following, we briefly recapitulate the 2-way Normalized Cut algorithm since it is the most widely used spectral clustering algorithm and has the closest relation to our proposed work. Let G'(V,E;W) denote an undirected graph, where V is the vertex set, E is the edge set, and W = (wi,j)n> = ~ Pr (5.16) Hence, in the following, we will compute log Pr(q3]dt), instead of log Pr(dt|qs). To model the translation uncertainty, we rewrite the expression for log Pr(q3[dt) as follows: 10gPr(qSIdt) = 10s / dqt Pr(qslqt)Pr(qtldt) 1 Pr(q5) ~ / dthr(qt|q3)logPr(qtld‘) _ t s — :(logPrUv [q )lPr(qt|q5) wt 2 :Pr(wths) log Pr(wt|dt) (5.17) wt 22 / dqt Pr(qslqt) (10g Pr(qt ldt) + 10s Pr(q3)) 123 In the second step of the above derivation we employs the Jensen’s inequality [125], which can be viewed as the first step toward the variational approximation [79, 88]. In the third step, we ignore the term log Pr(q5) that is independent from the document dt, and we also switch the roles between qt and qs using the Bayes’ law (similar to (5.16)) by assuming that the prior of the translated query qt follows a uniform distribution, i.e. Pr(qt) is a constant. In the fourth step, () represents the mathe— matical expectation of a random variable. To estimate Pr(wt|q3), i.e., the probability of observing the word wt in the translation of the query qs, we decompose Pr(wt|q5) into a summation over words in the source language: Province) = Z Pr5'10e-7 >1‘10e-7 >5*10e-8 put out am 03748 03603 55% Figure 5.2. An example of graph partitioning perspective for query translation disam- biguation. Figure 5.2 gives an illustrative example of this graph partitioning perspective. In the example, the query is composed of four Chinese words, and around each Chinese word are its translation candidates in English provided by a Chinese—English dictionary. The thickness of lines connecting two English words roughly represents their similarity. The number below each English word is its translation probability estimated from the Spectral Query Translation Model. Based on the graph repre- 135 sentation in Figure 5.2, we can easily see a strongly connected component consisting of words “independent”, “sign”, and “press”, which have been assigned with large translation probabilities. It is important to note that Figure 5.2 only serves as an illustration of the proposed spectral query translation model. In particular, all translation candidates will be used in the retrieval model (see Section 5.3.3), and their importance will be weighted by their translation probabilities determined by the optimization algorithm. Remark: In the above, we discuss the relationship between the Spectral Query Translation Model and the spectral graph partitioning. However, it is worth pointing out that the solution to the Spectral Query Translation model can not be acquired by the eigen analysis that is used for solving spectral graph partitioning. This is because the Spectral Query Translation model seeks for the optimal translation probabili- ties (which leads to soft cluster memberships) that maximize the overall coherence measurement. In contrast, most spectral graph partitioning algorithms, such as Nor- malized Cut, search for the binary cluster memberships that minimize the graph cut. It is such difference that leads to the quadratic programming formalization, instead of an eigenvector problem, for the Spectral Query Translation Model. 5.5.2 Maximum Coherence vs. Spectral Graph Translation Aside from its graph partitioning explanation, the Spectra] Query Translation Model is advantageous to the Maximum Coherence Model in that it is able to reduce the translation probabilities for the “common” words, which may otherwise dominate in the final query translation. To see this, consider an element 5.7-J; in the normalized similarity matrix S = D_%SD—% st. 'I gt. . = M (5.35) 10’ th t Z:mt t . S . . . 8. . J’=1 JJ’ J’=1 30’ 136 Suppose translation candidate wt- is a common word that appears frequently in the J t document collection of the target language. This implies that the sum 23’}; st. ., will 11} be large. Since most of the “common” words are less informative for the purpose of document retrieval, we use the normalization factor 1/ \/Z;7}:1 8;,‘7.’ to suppress their coherence values, which will eventually reduce the probability of including common words in the final query translation. Based on the above analysis, we see that the Spectral Query Translation Model is more appealing than the Maximum Coherence Model. This is further confirmed by our empirical studies on cross-language information retrieval presented in the next section. 5.6 Experiments and Discussions The goal of this experiment is to examine the effectiveness of the proposed statisti- cal framework for cross—language information retrieval. In particular, four research questions will be addressed in this empirical study: 1. Is the proposed statistical framework effective for cross-language information re- trieval? To obtain a comprehensive view, we compare the Maximum Coherence Model and the Spectral Query Translation Model to the existing selection-based approaches using a variety of queries and documents. 2. How important is it for a query disambiguation algorithm to include transla- tion uncertainty in its analysis? To address this question, we will examine the importance of including translation uncertainty in cross-language information retrieval through case studies. 3. How important is it to remove the translation independence assumption for cross-language information retrieval? To address this question, we will exam- 137 ll ine the impact of the translation independence assumption on cross-language information retrieval through case studies. 4. How does the Maximum Coherence Model empirically compare to the Spectral Query Translation Model? we will show a performance comparison between the two models as well as a case study as empirical evidence to our previous comparative analysis of the two models. 5.6. 1 Experiment Setup All our experiments are retrieval of English documents using Chinese queries. The document collections used in this experiment are from TREC ad hoc test collections, including AP88-89 164,835 documents from Associated Press(l988, 1989) WSJ87—88 83,480 documents from Wall Street Journal (1987, 1988) DOE1-2 226,087 documents from Department of Energy abstracts 2 In addition to the homogeneous collections listed above, we also test the proposed model against heterogeneous collections that are formed by combining multiple ho- mogeneous collections. In particular, two heterogeneous collections are created: col- lection AP88—89 + WSJ87—88, and collection AP89 + WSJ87—88 + DOE1-2. In a heterogeneous collection, words are more likely to carry multiple senses than words from a homogeneous collection, which will increase the difficulty for an automatic algorithm to disambiguate the senses of query words using the pairwise word sim- ilarities. The SMART system [126] is used to process document collections. Each document is first parsed into tokens with stop words removed, and then tokens are 2DOEl—2 collection is not used as one of the homogeneous datasets in our experiments because DOEl-2 collection provides no relevant documents for a majority of the queries used in this ex- periment. It is only used to create heterogeneous collections by combining with the other two homogeneous collections. 138 stemmed using the Porter algorithm. Finally, each document is represented as a bag of stemmed words. We also implemented pivoted document length normalization weighting scheme [135] in SMART system for the retrieval process. Since our goal is to illustrate the advantage of the proposed statistical framework, we do not apply more sophisticated procedures for text analysis in our experiment, such as phrase identification. Our queries come from a manual Chinese translation of TREC-3 ad hoc topics (topic 151-200). To fully examine the effectiveness of the proposed models, we test it against both the long Chinese queries and the short Chinese queries. A short Chinese query is created by translating the “title” field of an English query into Chinese; a long Chinese query is formed by combining the Chinese translations of both the “title” field and the “description” field in an English query. The average length of short Chinese queries is 9.64 Chinese characters, and 30.72 Chinese characters for long queries. For Chinese translations, we also manually segment the sentences into words with stop words removed, then feed them into our query translation framework. Figure 5.3 gives an example of the title field and description field (in the bottom panel), which is used to form a Chinese query in our experiments. Since most of the words in a short query are highly relevant to the topic of the query, we would expect that query disambiguation approaches based on word similar- ities will work well. The analysis of CLIR with long queries could be tricky because of the following two conflicting aspects: (1) On one hand, a long query provides significantly richer context than the short one in disambiguating the word sense of translation. Therefore, we would expect that the long queries may achieve a better ' performance than the short queries in CLIR. (2) On the other hand, a long query tends to include words either irrelevant or only slightly relevant to its topic. As a result, even a translation word that is coherent with the translations of many query words may not necessarily be a good candidate for selection. In our experiment, we 139 Number: 187 Topic: Signs of the Demise of Independent Publishing <desc> Description: Document must identify instances of the loss of independence by publishers, from the sale or merger of their business or publication, or the sale of a significant interest in it to another person or company. Translate into Chinese \/ <num> Number: 187 <title> Topic: ifllf/j lll lili ill 7; [1‘1 ill? 35 <desc> Description: 5C iii: lb 3131' N ltfr .'J‘. M “E .' l 1 Jill [iii (I i i \1k 212:}: Jill M 9W} l'i’d ti‘it‘s ‘ a )t- ' I u at 55171-13}inFETMU‘JI‘EJHT, it MJWEKTHMMAmadMiM. Segment Chinese sentences and remove sto a words <num> 187 <title> Topic: Kill 32‘. lllllli il'l K (M) iiEJli <desc> Description: ‘1 .7:- “él’i its (:11) (“"919 :1: J52 liil (iii) it)“: (22) {J'Hilft'lilli’fl (lt‘J) ii'if'ii (all) it'l- ('|') (. )(L'JZ a) ((Ii) Wit ‘2 {2‘23 Milt] (Ill). (if/i) imi‘r'lit 1E3}; (”I“).1LilL|irmA(th) 2} ml (M) 5‘13fo (. ) Figure 5.3. An example query used in our experiments. The query is first translated from the No. 187 in TREC-3 ad hoc topics into Chinese. It is then segmented into words with stop words removed. The upper panel shows the original query in English; the middle panel shows the translated Chinese query; the bottom panel shows the segmented Chinese query, where stop words in parentheses are removed. 140 will examine which factor among the two plays the major role in CLIR. Hence, a long query may pose a more challenging problem than a short query for a translation disambiguation algorithm based on word similarity information. Finally, the relevance judgments for the original English queries are used as the relevance judgments for their Chinese translations. The Chinese—English dictionary used in our experiments comes from Linguistic Data Consortium (LDC, http://www.- ldc.upenn.edu), which consists of translations for 53061 Chinese words. Since our experiments do not involve the processing of English phrases, for any English phrase that is the translation of a Chinese word, we simply treat it as a bag of words. To evaluate the effectiveness of the proposed framework and models, we implement two baseline models that take translation selection approaches. The first baseline model selects the most likely translation for each query word, which we call “BSTO”. Specifically, we follow the ”Approximate Ttanslation Selection Algorithm” described in Section 5.2.1. The second model, which we call “ALTR”, makes no efforts for translation disambiguation by simply including all the translations provided by the dictionary for query words into the final query translation. Finally, for easy reference, we use the abbreviation “MAC” for our Maximum Coherence Model and “SQT” for our Spectral Query Translation Model. The constant Up for the regularizer term in the Maximum Coherence Model is set to be 4 2].va 53.0., / (mt)2 based on our empirical experience. 5.6.2 Comparison to Selection-based Approaches Table 5.1 lists the average precision across 11 recall points for both the homogeneous collections and the heterogeneous collections. As indicated in Table 5.1, the proposed models (i.e., “MAC” and “SQT”) outperform the two baseline models for both short queries and long queries across all four different collections. For the purpose of refer- ence, we also list the results of monolingual information retrieval in the first column of 141 0.8 Precision-Recall (Short Queries / AP) " -+-MAC T - .. 433m; --aHALTR ’ -9—SQT : 0.2- 0 . . 0 02 0.4 0.6 0.8 1 Recall 0.8 Precision-Recall (Short Queries l WSJ) -~-Bsr03 ~a--ALTR f .+'MAC f Precision 142 0.8 _o .O N Precision-Recall (Long Queries / AP) -~-Bsr0§ “O-ALTR ; V -+-MAC : .0 .5 -e—SQT Precision-Recall (Long Queries I WSJ) «433m --a-'ALTR 3 .+-MAC i 0:2 0:4 0:6 Figure 5.4. Comparison of CLIR performance on homogeneous datasets using both short and long queries. The upper two figures are for AP88-89 dataset, and the lower two are for WSJ87—88 dataset. The left two figures are for short queries, and the right two are for long queries. Precision-Recall (Short Queries /AP+WSJ) . -u-BSTO i , ~-n--ALTR . 06,—, j ‘ +rMAC ‘ Precision-Recall (Short Queries / AP+WSJ+DOE) -~-BSTO' . . - --r:r-'ALTR - o. - i , '~+TMAC - ‘ —e—SQT 1 Precision Precision-Recall (Long Queries / AP+WSJ) -I-BSTO r-rawALTR -+-MAC ' —e—SQT . 0.8 Precision-Recall (Long Queries I AP+WSJ+DOE) -~-Bsro -»-rJ~ALTR v+~MAC : -6-SQT Figure 5.5. Comparison of CLIR performance on heterogeneous datasets using both short and long queries. The upper two figures are for AP88—89 + WSJ87—88, and the lower two are for AP89 + WSJ87-88 + DOE1—2 dataset. The left two figures are for short queries, and the right two are for long queries. 143 Table 5.1. ll-poiut average precision for both short and long queries on TREC datasets. MLIR represents the monolingual information retrieval. The relative CLIR improvements of MAC and SQT models over the other two baseline models are listed in the 5th, 6th, 8th and 9th data columns. [MLIR[BSTO ALTRMAC (M-B)/B (M-A)/ALSQT (S-B)/B (S-A)/A Short Queries AP .4112 .2331 .2241 .2959 +24.23% +32.04% .3115 +30.37% +39.05% WSJ .3335 .1955 .2129 .2550 +30.21% +20.24% .2571 +30.77% +20.75% AP+WSJ .4057 .2253 .2209 .2772 +23.04% +25.49% .2359 +25.90% +29.43% AP+WSJ+DOE .3555 .1739 .1329 .2172 +24.90% +13.75% .2295 +32.03% +25.53% Long Queries AP .3755 .1749 .1303 .2095 +19.34% +15.25% .2425 +33.71% +34.55% WSJ .4022 .1473 .1727 .2115 +43.17% +22.52% .2151 +45.21% +25.13% AP+WSJ .3721 .1433 .1555 .1947 +35.37% +15.94% .2043 +49.92% +23.00% AP+WSJ+DOE) .3299 .1122 .1411 .1575 +40.45% +11.59% .1712 +52.53% +21.33% Table 5.1. Furthermore, we plot the precision-recall curves for both the short queries and the long queries in Figure 5.4 and Figure 5.5, respectively. As illustrated in Figure 5.4 and 5.5, for all four collections the precision—recall curves of the proposed models always stay above the curves of the other two baseline models. Based on these results, we conclude that the proposed statistical framework performs substantially better than the other two selection-based approaches for cross-language information retrieval. A further examination of results in Table 5.1 gives rise to the following observa- tions: 1. In general, the retrieval accuracy for heterogeneous collections appears to be worse than that for homogeneous collections. In particular, a substantial de- crease in the average precision is observed for all four models when the collection of DOE1-2 is included in the heterogeneous collection. This result is in accor- dance with our previous analysis, i.e., words from heterogeneous collections are more likely to have multiple senses, thus resulting in higher translation ambi- guity. 2. A better retrieval is achieved for short queries than for long queries. The degra- 144 (lation in performance between long queries and short queries is more significant. for heterogeneous collections than for homogeneous collections. Usually long queries bring rich context as well as noise. Our result. indicates that among the two factors, the second one, i.e., long queries consists of many irrelevant words, seems to be more influential than the first one, i.e., long queries provide rich context for word sense disambiguation, in query translation. This result seems to contradict the general belief that long queries tend to yield better retrieval accuracy than the short ones given its rich context. However, it is worth pointing out that this general belief is based on a simplified analysis and overlooks the fact that long queries tend to include irrelevant words that could corrupt the accuracy of query translation disambiguation. To further confirm our hypothesis, we compare the results of the monolingual information retrieval between long queries and short queries. we observe that the short queries in fact outperform the long queries for three among four datasets. All the results indicate that due to the two conflicting factors related to long queries, it is not necessary the case that information retrieval with long queries will definitely deliver better retrieval accuracy than short queries. . The “BSTO” model does not consistently outperform the “ALTR” model. In fact, for the long queries, the “ALTR” model performs better than the “BSTO” model across all four different collections. This phenomenon can also be at- tributed to the fact that long queries are rather noisy and likely to include irrelevant words. This result indicates that the “BSTO” model can be sensitive to the noises present in queries. Given that a significant amount of noise can be present in queries, it is important to maintain the uncertainty of translation in the retrieval process. Note that our results appear to be inconsistent with the finding in [57]. We believe that this difference could be explained by the dif- ference in the experiment setups. In particular, since our experiments focus on 145 upheaval commotion turbulence unrest turmoil 30% 0.19931 0.19723 0.19882 0.20209 0.20255 intent spring motive inducement incentive filll‘ll 0.22821 0.19645 0.30384 0.13196 0.13954 I acid sour sore ache as | 0.79070 0.05422 0.07552 0.05955 Figure 5.6. Examples of translation probabilities estimated by the Maximum Coherence Model. CLIR with a bag of words, we did not employ any linguistic tools other than re- moving stop words and stemming keywords. In contrast, in [57] linguistic tools are used to identify appropriate English phrases and their Chinese translation, which has been shown as an important factor in CLIR [8, 57]. Although phrase analysis is important to CLIR, we believe that a generic probabilistic model is beneficial to CLIR of any languages, particularly when linguistic resources are scarce. Other differences between our baseline “BSTO” model and the model in [57] include the different ways of mutual information computation and slightly different translation selection strategies. 5.6.3 The Necessity of Including Translation Uncertainty To demonstrate the uncertainty in query translation, in Figure 5.6, we list the trans- lation probabilities for three Chinese words 3 that are estimated by the Maximum Coherence Model. As we can see, a significant variance exists in the distribution of translation probabilities across different Chinese words. The first example in the figure shows an almost uniform distribution over all translations, while the third one illustrates a very skewed distribution. Meanwhile, the second example provides a dis- tribution that is neither uniform nor totally skewed. These three examples illustrate 3These three Chinese words are not from the same query. 146 , . Selection Trans. Prob. Trans. Prob. CH Term EN Translation (BSTO) (M A (3 (S QT) independent 0.36208 0.46944 541172 on one's own 0.24481 0.12342 stand alone x 0.3931 1 0.40715 press x 0.33789 0.22208 11’. lili publish 0.19973 0.37477 ‘ ’ put out 0.33311 0.36082 print 0.12927 0.04232 disappear x 0.34941 0.38760 ill '1'; demise 0.32035 0.32365 fade 0.33024 0.28875 symptom 0.16698 0.14468 omen 0.16155 0.09868 111.315 sign x 0.35015 0.55512 portent 0.16097 0.09279 premonition 0. 16034 0.09773 Original TREC Topic in English (topic 187 'title' field): Signs of the Demise of Independent Publishing Figure 5.7. An example of query translation, using the “BSTO” model and the Maximum Coherence Model. (English words in italicized font are removed as stop words.) the “translation uncertainty problem”, which we have addressed in previous sections. Furthermore, the diversity in the distribution of translation probabilities makes it difficult for a selection-based approach to perform well over all different cases. For example, the “BSTO” model is able to work well for the third example but will fail in the first one. On the other hand, the “ALTR” model would be perfect for the first example but not for the third one. Base on the above analysis, we conclude that it is important to capture the translation uncertainty and the diversity of translation uncertainty in a probabilistic model. 147 5.6.4 The Impact of Translation Independence Assumption on Query Dis- ambiguation To illustrate the impact of the translation independence assumption on query trans— lation disambiguation, consider the example in Figure 5.7. This query consists of four Chinese words, and the English translations for each Chinese are provided by the dictionary are listed in the second column. The original English query is also in- cluded at the bottom of the figure. The English translations selected by the “BSTO” model are listed in the third column, marked by small crosses. The translation prob— abilities from Chinese words to their English translations estimated by the Maximum Coherence Model and the Spectral Query Translation Model are listed in the last two columns respectively. Comparing to the original English query, we see that the “BSTO” model makes incorrect translation selection for both the first and the second Chinese words. For the first one, the correct English translation should be “independent”, instead of “stand (alone)” 4. The better translation for the second Chinese word should be “publish” instead of “press”. One reason for such mistakes is that in the “BSTO” model, the coherence score of a translation is computed based on all the English translations pro— vided by the dictionary for the Chinese words in the query. Thus, the coherence score of one translation word is completely independent from the selection of other transla- tions. Since both “stand” and “press” are common in English, their overall coherence scores turn out to be larger than the coherence scores of other words, which lead them to be selected by the “BSTO” model. In contrast, in both the Maximum Coherence Model and the Spectral Query Translation Model, the estimation of translation prob- abilities for one word is dependent on the estimation of translation probabilities for other words. As a result, they are able to adjust the mistakes by assigning significant 4“alone” is removed as a stop word and does not count in the translation. It is listed only for the sake of clarity. 148 amounts of probability mass to the correct translations. For example, for the first Chinese word, both models are able to assign a probability to the correct English translation “independent” comparable to the probability assigned to the translation “stand (alone)”. Note that neither the Maximum Coherence Model nor the Spectral Query Trans- lation Model is able to always assign the largest probabilities to the best translation candidates (such as for the first Chinese word in Figure 5.7). However, in general by maximizing the coherence of the entire translated query both models tend to shift more probability mass to the best translation candidates, thus alleviate the mistakes brought by those selection-based approaches originated from their false assumption on translation independence. We believe this is one of the major advantages of these two models over all the selection-based approaches. 5.6.5 Performance: MAC vs. SQT 1Q 1111 I encroach erode weather eat MAC 0 0.01091 0.34901 0.14008 SQT 0.02389 0.03.139 0.34487 0.09985 Original TREC Topic in English (topic 188 'title' field): Beachfront Erosion Figure 5.8. Comparison of an example query word translation using Maximum Coherence Model and Spectral Query 'Itanslation Model. The numbers showed in this example are the probabilities of the translation candidates being included in the final query translation. Table 5.1 reveals a slightly higher average precision across 11 recall points of “SQT” model compared to “MAC” model. In Figure 5.4 and Figure 5.5 the precision- recall curves of “SQT” model generally stays above those of “MAC” model, though the margin is not clear sometimes. All the experiment results suggest that the Spectral Query Translation Model is slightly better than the Maximum Coherence Model. 149 In‘ This observation is in accordance with our theoretical analysis on the two models at the end of Section 5.5. Apart. from the advantage of saving the trouble of choosing the optimal regularizer constant Cp, the benefit of normalizing the coherence matrix can be observed from the example presented in Figure 5.8, where we list and compare the probabilities of including different translation candidates in the final query trans- lation for one example query word from both models. As we can see, the common word “eat” is assigned with a significant amount of probability mass in the Maximum Coherence Model although it is almost irrelevant to the context of the entire query. At the same time, the word “encroach”, which in fact is related to the theme of the query, receives an almost zero probability (or too small to represent). As indicated by the results listed in Figure 5.8, the Spectral Query Translation Model is able to overcome this problem by redistributing some of the probability mass on the com- mon word “eat” to other words. This is consistent with our previous analysis that the Spectral Query Translation Model has a better way to estimate the translation probabilities of common words than the Maximum Coherence Model. 5.6.6 Computational Efficiency To examine the computational efficiency, we calculate the averaged number of seconds that are spent by the proposed algorithms to solve the Optimization problem for each query. All the experiments are conducted on a PC with a Pentium 4 2.8GHz CPU and 1G RAM that runs Matlab 7.1 on a Windows system. We directly use the quadratic programming function provided by Matlab to solve the QP problem in the proposed algorithms for query translation disambiguation. The results are 0.027 seconds per query for short queries, and 3.014 seconds per query for long queries. Clearly, our algorithm is sufficiently fast for short queries, but a little slow for long queries. To improve the computational efficiency of long queries, we can first divide a long query into a number short ones, and then translate each short query using the 150 proposed framework. Since the computational complexity of quadratic programming is 0(n3) in the worst case where n is the number of translation probabilities, by significantly reducing the number of query words, we reduce the number of translation probabilities, and therefore improve the computational efficiency. Note that the above approach is based on the assumption that most of the context needed for query translation disambiguation can be found in the neighborhood of each query word. As a potential research issue, improving the computational efficiency for long queries is worth further study. 5.7 Conclusions In this chapter, we propose a novel statistical framework for cross-language infor- mation retrieval. It utilizes word co—occurrence statistics for estimating translation probabilities that are effective for query disambiguation. Compared to the selection— based approaches, the merits of the framework are twofold: 1) It preserves the trans- lation uncertainty through the estimation of translation probabilities; 2) It estimates the translations for all query words simultaneously. Two realizations of the proposed framework, namely the Maximum Coherence Model and the Spectral Query Transla— tion Model, are presented based on different choices of coherence measurements. Our analysis indicates that the Spectral Query Translation model can also be interpreted as a graph partitioning approach for query translation disambiguation. Empirical results under various scenarios have shown that the proposed framework is able to perform substantially better than the existing selection-based approaches. Further analysis indicates that the Spectral Query Translation Model is more effective and reliable than the Maximum Coherence Model when dealing with common words. 151 CHAPTER 6 Semi-supervised Learning for Extraction of Questions and Answers from Web FAQs 6.1 Introduction Question-Answering is faced by a problem of a “lexical gap” [22] or a “word mis- match” [156] between question and answers, suggesting that retrieval of candidate documents for answer extraction should focus on documents that are similar to the expected answer rather than on documents that are similar to the user question. Re- cent work in Question-Answering has capitalized on existing repositories Frequently- Asked-Question (FAQ) pages to bridge this lexical gap. FAQ pages are easily available in large quantities on the web. They cover a wide range of topics, and thus present an ideal resource to learn about question-answer correspondences. Large repositories of question-answer (QA) pairs extracted from FAQ pages have been used to train statistical translations models in order to provide an additional measure to rank can- didate answers [22, 51, 122, 137], or to perform expansion of questions by answer terms using various query expansion techniques [78, 121, 3, 69]. In other work, QA repositories have been used as candidate collections for answer retrieval [30, 83, 154]. Besides Question-Answering, QA repositories have also been used in other related information retrieval and natural language processing tasks, such as query summa- 152 rization [23], semantically similar question finding [82] and knowledge representa- tion [153, 154]. Since the performance of these applications depends heavily on the quality of the QA repository, high-precision extraction of QA pairs is crucial in order to provide reliable data for various deployments. While FAQ pages abound on the web, high-precision extraction of QA pairs is challenging because of the wildly varying markup of questions and answers across FAQ pages. We refer to this challenge as the Across Page Diversity challenge. Across FAQ pages on the web, the variants in QA markup range from simple paragraph breaks to special formatting (using italics, boldface, different fonts, font sizes, colors, etc), various types of indentations (lists, tables, etc.), special prefixes (numbers, Q., 0:, Question:, etc), or even sophisticated images. The creative power of web designers is displayed not only in the huge variations of QA markup, but also in the endless variants of “noise text”, i.e., headers, navigational text, or other annotations that display neither questions nor answers. To illustrate the Across Page Diversity challenge, Figure 6.1 - 6.4 present four snapshots of FAQ pages, each displaying questions and answers in their own format. Specifically, Figure 6.1 presents two FAQ answers in sophisticated formats — one with embedded listing structures and varying fonts, and the other with multiple paragraphs separated by images; the other three snapshots present QA text mostly in plain paragraphs. In Figure 6.2 two types of images prefix each FAQ question and answer; in Figure 6.1 and Figure 6.3 only FAQ questions are marked by numbers; Figure 6.3 uses different colors and fonts to distinguish questions and answers. Figure 6.3 includes a real FAQ question without an ending question mark (i.e. the second question in the snapshot), while Figure 6.4 includes a fake FAQ question that does end with a question mark (i.e., in the dashed ellipse). Also, there are various types of noise texts that are highlighted by boxes of all shapes in the four snapshots. The diversity of web FAQ pages pose a serious challenge in extracting high-quality 153 There is a short lag period between the time you enter the sites to be blocked and when our system begins blocking the corresponding ads. This period should be no more than two business days. Back to Top 9. Are the ads blocked for all products within the Yahoo Pubilsher Network? Yes. Once you enter the URLs, the blocked URLs will be applied to all products within the Yahoo! Publisher Network. You will not need to create the list for every YPN product Back to Top 10. Can 1 block Run-of-Network ads from appearing on my ' site? No. we do not offer this option at this time. Back to Top - idn't find an answer to your question'il Submit your inguig directm to our Customer Solution: team NOTICE: AI information is the confidential information of the Yahoo! Publisher Network. Copyright © 2007 Yahoo! Inc. Al rights reserved. : Poirc ' Terms of Sen/ice Terms and Conditions Hefn Center ana Figure 6.1. Snapshot of example web FAQ pages (1 of 4) 154 Is there a limit to how much karma you can accumulate? Yes. Karma is now capped at "Excellent" This was done to keep people from running up insane karma scores, and then being immune from moderation. Despite some theories to the contrary, the karma cap applies to every account. Answered by: Cmdr Taco Last Modified: 1/24/02 It seems unfair that I can't get any more karma than that even if I earn it. Karma is used to remove risky users from the moderator pool, and to assign a bonus point to users who have contributed positively to Slashdot in the past. It is not your IQ, dick length/cup size, value as a human being, or a score in a video game. It does not determine your worth as a Siashdot reader. It does not cure cancer or grant you a seat on the secret spaceship that will be traveling to Mars when the Krulls return to destroy the planet in 2012. Karma fluctuates dramatically as users post, moderate, and meta-moderate. Don't let it bother you. It's just a number in the database. Answered by: Ondrfaco Last Modified: 10/19/00 Why didn't I get karma for a Quickie or a Slashback story? This is a shortcoming in the code that we haven't solved yet. Essentially, the Figure 6.2. Snapshot of example web FAQ pages (2 0f 4) 155 4. How do I uninstall Google Web Accelerator? if you decide you don't want to use Google Web Accelerator, here's how to uninstall it . Click on Start > Settings > Control Panel to open the Windows Control Panel. . Click on Add or Remove Programs to open its window. . Click on Google Web Accelerator. A Remove button will appear below the Google Web Accelerator item. 4. Click the Remove button. 5. Close the Add or Remove Programs window and the Vlfindows Control Panel. (”N-l [The Google Web Accelerator Menu I w [W] V J "" 1. How do I access the Google Web Accelerator menu? You can bring up the Google Web Accelerator Menu by clicking on the Google Web Accelerator speedometer icon in the toolbar or system tray. FE) . 36.3w saved Heb ' Preferences... Preferences... Performance Data... Performance data. .. Help > DOO't accelerate “15 "Chit: Stop Google Web Accelerator Stop Google Web Accelerate é; Learn more about each menu item: 0 Preferences 0 Performance data 0 Stop/Start Google Web Accelerator [Preferences .4 j ., A. _e . [M23 V1 . What Google Web Accelerator preferences can I set? The Preferences menu option Opens this Web Accelerator Preferences page: Figure 6.3. Snapshot of example web FAQ pages (3 of 4) 156 {HQFEW “ . “Lat—~31 Qlle’J/lon - Glossary of terms: Taxes cw... . ; Denomination FinanciaLMgkgts i AQQQUEJBQWQ War/ww- — Refers to the different values of 2:23,} i money. U.S. coins currently are made in the -~-_.______ ‘ following six denominations: cent, nickel, Went : liasi§h§§£§ EQLKJSE WW9. % l §§§L§E§fi 5 Site Policies and Notices dime, quarter, half dollar, and dollar. Quay/{on - What is the correct term for a one-cent coin? I.77n.savr~ The proper term is "one cent piece," but in common usage this coins is often referred to as a penny or cent Many times, even the Treasury Department and the United States Mint use the term penny because that is what is normally referred to in general use by the public. a Quad/on ~Are there any plans to remove the one-cent coin (more popularly known as the “penny”) from circulation? f7ii/z.rrm?r ~ You may be interested to know that the penny is the most widely used Figure 6.4. Snapshot of example web FAQ pages (4 of 4) 157 QA from them. At least. three major reasons that contribute to the difficulty of the task: 0 First, the list. of QA pairs within a FAQ page is often mixed with various amount of noise text, such as section headings, navigational text, or annotations, that are not part of any QA pair. Separating the text. of QA pairs from the noise text can be difficult. 0 Second, it is usually significantly more challenging to accurately identify the texts for answers than for questions. This is because the answer text tends to be much longer than that of questions, and more diversified in its presentation format and word usage. 0 Third, many questions and answers from the FAQ pages do not follow the grammar and syntax of English rigorously, which makes it difficult to apply the algorithms that are developed in natural language processing. One may enumerate a few heuristic rules for identifying questions and answers from FAQ pages, such as punctuations, listing markers, lexical cues, repeating pat- terns, etc., as suggested in several previous studies [136, 84, 137, 99]. However, there are two major problems with employing those heuristics-based approaches. First, due to the large diversity in the presentation formats of questions and answers, it is difficult, if not impossible, to come up with a comprehensive list of heuristics that cover all the possible presentation patterns. Second, many heuristic-based approaches tend to struggle between false positives (i.e., accuracy) and false negatives (i.e, com- pleteness) in QA extraction. More specifically, a set of relaxed heuristic rules tend to find more QA pairs but at the price of including noise text into the QA pairs; by tightening the heuristic rules, we are likely to avoid the problem of including noise text but at. the risk of missing many true QA pairs. 158 Another approach toward QA extraction is to View it as a classification problem of three classes, i.e., the class of questions, answers, and noise. we can then employ the supervised learning algorithm to train a statistical classifier from the labeled ex- amples. However, a key difficulty with this approach is that, due to the large variance in the presentation formats of QAs, it is difficult for any supervised learning algo— rithm to capture all the possible patterns from a. limited number of labeled examples. Apparently, this difficulty originates from the Across Page Diversity. Fortunately, while QA markup may vary widely across pages, it is generally true that within sin- gle pages QA markup is consistent. This Within Page Consistency principle can be viewed as “side information” - which encodes human knowledge on the fact of FAQ page creation — to the QA extraction task. As will be shown later in this section, the principle’s accuracy is well supported by empirical study. The Within Page Consistency leads to the possibility of a bootstrapping approach to QA pair extraction: various heuristics that have been proposed for QA pair extrac- tion can be deployed to perform a high-precision labeling of a seed set of text segments as questions, answers and noise; an optimization problem is solved that essentially propagates the class labels of the seed segments to the text segments that share sim- ilar HTML formats as the seeds. This optimization problem can be described in a transductive learning setting, similar to the well-known semi-supervised learning al- gorithm — Spectral Graph Transducer [87]. Given a large enough database of FAQ pages, which themselves have to be extracted in a high-precision fashion, each of the extraction steps -—labeling of seed examples, and similarity—based propagation—can be tuned to high precision, resulting in a large database of correctly identified QA pairs. We show in an experimental evaluation that compared to both heuristics-based and supervised learning approaches, the proposed approach significantly improves precision in QA extraction. The key advantages of the proposed algorithm are 159 o It takes full advantage of the heuristics discovered by the previous study of QA extraction. 0 It is entirely unsupervised in that all the labeled examples for the semi- supervised learning algorithm are acquired by the heuristic rules. Therefore no human annotation is needed. 0 The extraction from one Web FAQ page is independent from any other pages, which enables the algorithm to be carried out in a parallel manner to handle huge amount of data from the Web. The remaining chapter is organized as follows. In Section 6.2 we briefly review previous work on automatic QA extraction, and the related Spectral Graph Trans- ducer algorithm. In Section 6.4, we present a few important observations on the web FAQs data that motivates our work. The semi-supervised learning algorithm for QA extraction is proposed in Section 6.5. Experiments and discussions are presented in Section 6.6. Finally, Section 6.7 concludes the work. 6.2 Related Work 6.2.1 QA Extraction from the Web At first glance, extracting QA text from FAQ pages looks like a typical information extraction (IE) task. In [111] a maximum entropy Markov model has been proposed, and in [97] logical structure detection is used. While both approaches prove to be effective in QA text extraction from Usenet FAQ files, they are domain-specific since they rely heavily on the special format of Usenet FAQs. Although IE research have gained significant progress towards open-domain free-text tasks (for example, seminar announcement extraction [36]), the extraction is usually carried out as filling template slots. Unlike seminar announcements that have a relatively fixed set of themes and 160 presentation formats, QA can be a place-holder for virtually any theme or any pre- sentation format. As a result, it is difficult to apply IE algorithms to extract QA pairs from FAQ pages. Most of the previous studies on QA extraction from FAQ pages [99, 136, 84, 137] are heuristics-based approaches. In summary, four types of heuristics have proved to be useful for identifying QA text segments in a FAQ page: (1) punctuations, (2) HTML tags (e.g., <p>, <BR>), (3) listing markers (e.g., 0:, (1)), and (4) lexical cues (e.g., What, How). Then rules or memory—based learning algorithms are applied to determine whether or not a text segment is a question or an answer. However, most previous efforts toward building a QA repository only aim at extracting FAQ ques- tions. Only a few studies are related to extracting FAQ answers. The work reported in [99] concentrates on the extraction of FAQ questions, and defines as answers the region between two consecutive FAQ questions, thus ignoring the separation of QA pairs from noise text. In [136, 84, 137], up to three consecutive sentences following each identified question are viewed as the corresponding answer text. As already pointed out in Section 6.1, FAQ answers are significantly more difficult to be ex- tracted than FAQ questions due to their length and diverse presentation formats. An important aspect that distinguishes our work from all the previous studies is that we are aiming at complete and noise—free text of both FAQ questions and answers, which is more challenging yet more useful for most deployments of QA repositories. 6.2.2 Review on Spectral Graph Transducer Spectral Graph Transducer (SGT) [87] balances between fitting a model that is con- sistent with supervised information (from labeled examples), and taking the central assumption behind nearly all semi—supervised learning algorithms, i.e., examples are more likely to share the same class labels if they are close to each other or in the underlying manifold. Suppose {:r,}?=+1m is the set of data examples, with the first n 161 web pages FAQ pages QA pairs count 4 billion 795,483 10,568,160 Table 6.1. Corpus statistics of QA pair data examples being labeled and the rest m examples as unlabeled. An (n + m) x (n + m) matrix L is defined as the graph Laplacian that can be derived from the pairwise similarities of all the data examples [38, 87]. Note that the graph Laplacian matrix L captures the structure of both label and unlabeled data. A (n + m)—dimension vector r is used to encodes the label information, where r,- = 73+ if 3:,- is a positive example, 7*, = 7“- if x,- is a negative example, and 7‘,- = 0 if :13,- is unlabeled. f... and 7”"- are constants that depend on the priors of the two classes. Let f 6 1R" denote the vector of predicted class labels. Spectral Graph Transducer is defined by the following optimization problem mfin fTLf + C(f — r)TC(f — r) (6.1) at fTe = 0 (where e = [1, - -- ,1]T) fo = n + m where the matrix C = diag(c1, c2, . . . ,cn) is a diagonal cost matrix that assigns a different misclassification cost for each labeled data example. The trade-off between graph cut value (i.e. the first term) and training error (i.e. the second term) is balanced through the constant c. 6.3 FAQ Page Classification As shown in Table 6.1, the FAQ pages used in our experiment were extracted from a 4 billion page web crawl using the queries “inurl : f aq” and “inurl : faqs” to match the tokens “faq” or “faqs” in the urls. This extraction resulted in 2.6 million web pages (0.07% of the crawl). Since not all those pages are actually FAQs, we manually 162 labeled 1,000 of those pages to train an online passive—aggressive classifier [41] in a 10—fold cross validation setup. Training was done using 19 features on URLs, question marks and word statistics (see Appendix A3 for details), and resulted in an F1 score of around 90% for FAQ classification. Application of the classifier to the extracted web pages resulted in a classification of 795,483 pages as FAQ pages. Instead of going into more details of the FAQ page classification algorithm, we will concentrate in this paper on QA pair extraction, and present the main algorithm for QA pair extraction from FAQ pages in followinor section. 6.4 Observations on Web FAQs Data To better understand the challenges in extracting QA pairs from the FAQ pages, in this section, we summarize some important observations from the web FAQs data 1 as follows Noise Text For most FAQ pages, we often find the noise text that do not consist of any questions or answers. These texts can appear either between two consecu- tive QA pairs or outside the entire list of QA pairs. It is important for the QA extractor to remove the noise text from the identified questions and answers. Question Mark Rule Most, though not all, FAQ questions end with question marks. This implies that the question mark is an important feature to identify FAQ questions. Pattern Diversity and Within Page Consistency Due to the heterogeneous authorships of web pages, no specific text formats are consistently used across all the FAQ pages. This implies that it could be very difficult to develop a supervised learning algorithm that is able to capture the large diversity in the 1Note that there are two types of FAQ pages: one that compiles multiple QA pairs in a single page, and the other that devotes each page to only one QA pair. We will focus on the first type since it is dominant on the web. 163 presentation patterns of QA pairs across different FAQ pages. However, within a FAQ page, it is often the case that consistent HTML formats are used to present questions, answers, and noise text. Moreover, the formats for present- ing questions, answers, and others are often different and distinguishable. This motivates us to consider a semi-supervised learning algorithm that explicitly exploits the within page consistency. The proposed algorithm consists of two major components, i.e., the heuristics that are applied to identify the seed examples of FAQ questions, answers, and noise text, and the semi-supervised learning algorithm that propagate the class labels of the seed examples to other text segments based on the Within Page Consistency principle. The following heuristics are used to identify the seed examples: 1. From all the text segments that end with question marks, select the ones whose format repeats the most often and label them as FAQ questions. 2 2. From all the text segments between any two consecutive FAQ questions that are identified by the first heuristic, select the ones whose format repeats the most often and label them as FAQ answers. 3. If a text segment repeat itself k time in a FAQ page, all those occurrences of the text segment will be labeled as noise text. 3 k is a predefined integer, and is set to 4 in our experiments. And all text segments before the first question are labeled as noise. All the above heuristics can be seen as a simplified version of those proposed in previous studies that are mentioned in Section 6.2.1. These rules tend to be accurate with very small number of false positives. According to our study, these rules can 2Text segments of hyperlinks are excluded, to avoid the possible question lists at the top of the many FAQ pages. 3This heuristic is designed to detect those highly repetitive noise text (such as navigational links, e.g., “back to top”). 164 achieve a precision of 96.8% in classification 4. 6.5 A Semi-supervised Learning Approach for QA Extrac- tion In order to exploit the Within Page Consistency principle, in the proposed approach, we view the extraction of QA pairs from each FAQ page as an independent multi- class classification problem, i.e., to classify each text segment in a FAQ page into the classes of questions, answers, and noise text. The key idea behind the proposed algorithm is to first identify a few text segments that can be classified by the specified heuristics. The class labels of these identified segments are then propagated to the text segments of the same web page that are similar in HTML format. In this section, we first review the preprocessing step that divides the FAQ page into text segments, followed by the description of the proposed semi—supervised learning algorithm for QA extraction. 6.5.1 Web Page Preprocessing The main purpose of preprocessing is to divide a FAQ page into text segments to be classified. In order to exploit the Within Page Consistency principle, we need an effective way to represent the presentation format of text segments so that the similarity between any two text segments can be computed accurately. Since all the web FAQS are presented in the HTML file, we propose to divide each FAQ page into text segments based on the layout of the HTML tags, and represent the format of each text segment by the related HTML tags. In particular, we define a tent segment as a continuous text region that is surrounded by exactly the same pair of immediate 4Due to the well-known precision-recall trade-off, when the precision of those rules are tuned to 96.8%, the corresponding recall becomes very low. Therefore these rules, though can achieve high extraction accuracy, are not suitable for complete and noise free QA extraction opening and closing tags. For example, given the following HTML text <HTML><HEAD> Title Goes Here </HEAD> <BODY><H1> This is the heading </H1> <P> 01: this is question 1 </P> <P> A1: answer 1 starts here <SPAN style="font-size: xx-small;"> some small font text </SPAN> <TABLE><TR><TD> table cell 1 </TD></TR> <TR><TD> table cell 2 </TD></TR></TABLE></P> the identified text segments are: "Title Goes Here", "This is the heading", "01: this is question 1","A1: answer 1 starts here","some small font text", "table cell 1", "table cell 2". Given the identified text segments, the next step is to represent the format of each segment by a sequence of HTML tags. In particular, each text segment t,- is represented by a HTML tag sequence q,- includ- ing all the HTML tags that surround segment t,- from outermost to innermost. For example, for the text segment Q1: this is question 1, its HTML tag sequence is HTML, BODY, P, while the HTML tag sequence for the text segment some small font text is HTML, BODY, P, SPAN. Finally, given the representation of presenta- tion format, the format similarity between two segments t,- and tj is calculated as follows: Simf(ti7tj) : exp(-Ad(qivqj)) (62) where d(qi,qj) is the edit distance between the two corresponding HTML tag se- quences qz- and qj. A is a decay factor (set to 0.1 in our experiment). We find that there is a disadvantage of the web page division scheme we proposed here. In particular, it will break a sentence into multiple text segments if there are hyperlinks or highlighted terms within the sentence. To remedy this problem, we 166 can detect the broken sentences from the text segments based on a few grammatical rules. For example, a text segment ending with no punctuation and its following text segment starting with an small-cased letter are very likely to be two broken parts from the same sentence. We then adjust the similarity measurement as follows: 5i,j = .9t7nf(tz', tj) - biaj (6.3) where simf is the format similarity, and bi, j is an adjustment factor from broken sen- tence detection, i.e. bi, j takes a constant value b0 > 1 if t,- and t j are two neighboring text segments that are detected as from the same sentence, and 1 otherwise. 6.5.2 Semi-supervised Learning for QA Extraction The key idea of the semi-supervised learning approach is to propagate the class labels of the text segments that are classified by the heuristics to the segments of the same page that are similar in the presentation patterns. The idea of label propagation, as shown in [87], is equivalent to minimizing the inconsistency between the similarity of text segments and the class labels assigned to the segments. We can thus encode the idea of label propagation by the following optimization problem: Let {tfifitl denote the set of all N text segments in a FAQ page. We introduce a probability matrix P = (PM) , where 172', j indicates the probability of assigning N x3 the i-th text segment to the j-th class. Here, we encode the three classes by: class 1 for FAQ questions, class 2 for FAQ answers, and class 3 for noise text. Note that for each segment t,- 2PM = 1 (6-4) j For convenience, we also use pj to denote the j-th column vector in the matrix P, i.e., P = [p1,p2, p3]. Each vector pj includes the probability of all the text segments belonging to the corresponding class. 167 To represent the class information of those seed text segments, we introduce an- other matrix P : (pm-)ng, where 13,-, j = 1 if the i-th text segment is a seed text segment that is classified into the j—th class by the heuristics, and 13,),- = 0 otherwise. Similarly we can decompose the matrix P into column vectors that represent the seed examples in each class respectively, i.e., P = [131,132,133]. Let a matrix S = (Sii’)N>< N denote a matrix of all the pairwise similarities of the text segments, as defined in Equation (6.3). We can view the similarity matrix S as building a weighted graph: each node represents a text segment; if the similarity between two text segments is non-zero, the corresponding two nodes is connected by an edge that is weighted by the similarity value. Based on a weighted graph, we can define a normalized graph Laplacian [38] L = I — D—ls where D :2 diag(d1, - -- ,dN) is a diagonal matrix with each d,- = Zi’ 31,24. The graph Laplacian includes the format similarity between any two text segments, and will be used as the basis to exploit the Within Page Consistency principle. Our goal is to estimate the optimal set of probabilities in the matrix P, such that if two text segments are similar in their presentation formats, they will be assigned with ' similar probabilities to the three classes. In addition, the optimal set of probabilities should also be consistent with the class labels that are assigned to the seed examples by the heuristics. All these can be formulated into the following optimization problem jlp[T ij + Cj (Pj - Pj)TAj(Pj - Pfi] (6-5) s.t. -,1]T p120) j=112i3 where A - J is a diagonal matrix with a diagonal element equal to 1 if corresponding 168 text segment is a seed example in the j-th class and 0 otherwise. Cj is a predefined factor that balances the two terms in the objective function. The first term pJLJ-pj in the objective function (6.5) is exactly the normalized graph cut [134, 87], whose minimization leads to a two-way clustering with minimum intra—cluster connections [134]. In particular, this term can be further divided into two parts, i.e., T Pj ij = Z Sir/(Pia ’ pi’,j)2/Zsi,i’ (6'6) / i i,i Thus, by minimizing the first term, we enforce the consistency between the class label assignment and the similarity measurement. The second term (pj — 13,-)TAJ-(pj — 13]) is a term that penalize the disagreement between the estimated probabilities pj with 13,-, the class labels of the seed examples. Instead of formulating it as a hard constraint in the optimization problem, we use the penalty term to ensure the consistency between pj and 13,-. The advantage of such a treatment is that it leaves the room for correcting labeling mistakes in the seed examples. It is easy to find our proposed model in the expression (6.5) can be viewed as a multi-class version of the Spectra] Graph Transducer model reviewed in Section 6.2. As pointed out before, our proposed model can be explained from the view of label propagation: the enforcement of within page format consistency and agreement with seed examples on all the probabilities pm- can be viewed as propagating the labeling information from the labeled seed text segments to the unlabeled ones along the structure of the weighted graph. Since we need to decide the probabilities of assigning each text segment to three classes, our approach indeed has three separate propagations, each for a different class. It is important to point out that the three propagation processes are strongly correlated. This is because if a text segment is assigned with a large probability for one class, the probability of assigning the text segment to the other classes has to be small. This can be further illustrated by the 169 Step 0 Divide the page into text segments and compute their pairwise format similarity (as described in Section 6.5.1) Step 1 Initialize a labeled example set L by identifying a few seed examples (as described in Section 6.4) Step 1.1 Solve the semi-supervised model in (6.5) Step 1.2 Predict the class label of each text segment by the class with the largest probability. Figure 6.5. The semi-supervised learning algorithm for extracting QA pairs from FAQ pages dual problem of (6.5), i.e., 3 min A+u-+CoA"-TL+C-A-‘1A+u-+C-A-“- “Ag? J .7 39]) ( J J) ( J J JpJ) S.t.11j 20, j=1,2,3 where A and uj are the Lagrangian multipliers. Given the optimal solution for A and uj, the solution for the primal problem can be written as: pj = (L + CjAj)_1()‘ + llj + CjAjfij) Clearly, the solutions for all pj’s are correlated through the Lagrangian multiplier A. Finally, compared to the SGT approach, another advantage of the proposed algo- rithm is its probabilistic nature. As we will show in our experiments, the estimated probability does reveal the uncertainty in classifying text segments into the class of questions, answers, or noise text. The optimization problem (6.5) is a quadratic programming problem, and there- fore can be solved efficiently using the standard packages. Given the estimated prob- abilities in P, we predict the class of each text segment by the one with the largest probability. Figure 6.5 summarizes the proposed algorithm for QA extraction. As the final step, we need to form the QA pairs based on the class labels assigned to the text segments in each FAQ page. To this end, we first reduce the sequence 170 by removing all the text segments being labeled as “noise, text”; then in the reduced sequence, we merge those consecutive text segments that have the same label; finally, we pair each merged text segment being labeled as answers with its most immediate proceeding text segment being labeled as questions, thus form a QA pair. For ex- ample, suppose we have a sequence of text segments (represented by their IDs) and their classes labels (represented by Q, A, 0) as follows 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 DUQQOAAOQAAOAQAAO Then three QA pairs can be extracted from the above sequence: QA Pair 1: Question-#3,#4; Answer-#6,#7 QA Pair 2: Question-#9; Answer-#10,#11,#13 QA Pair 3: Question-#14; Answer-#15,#16 6.6 Experiments and Discussions 6.6. 1 Experiment Setup Our testbed includes 10,912 text segments, which forms 1303 QA pairs 5. All these 1303 QA pairs are all manually labeled, and will be used as the truth for evaluation. Since our proposed algorithm does not need any training data, we will carry out our proposed algorithm on all the FAQ pages, and compare the extracted QA pairs with the truth, i.e., the manual labeled QA pairs, for evaluation. Both precision and recall are used as evaluation metrics. They are defined as E‘These text segments come from 100 FAQ pages that are randomly selected from those FAQ pages we’ve identified from our web crawl data, as described in Section 6.3. The selection does not favor any particular domain or page content). The size of this dataset is comparable to those used in [99, 136, 84, 137]. 171 follows Number of correctly extracted QA pairs Precision = Number of extracted QA pairs Number of correctly extracted QA pairs Recall = Number of the true QA pairs Note that a QA pair is “correctly extracted” if and only if the corresponding question and answer share the identical sets of text segments with those that are manually identified as a QA pair. Since the above evaluation metrics defined above view all the text of an QA pair as a whole, we refer to it as evaluation by QA. Evidently, these evaluation metrics are very “strict” in that as long as an extracted QA pair is not identical to the true one, it won’t count as correct, no matter how small the extraction error may be. Here, we define another set of metrics based on word counting that are relatively “looser” compared to the evaluation by QA. Specifically, we can define Number of correctly extracted QA words Number of extracted QA words Number of correctly extracted QA words Precision = R 11 = eca Number of words in the true QA pairs Here, a word is correctly extracted as long as it is in the true QA pairs. Note that using this set of evaluation metrics, an extracted QA pair will have some chance to obtain “partial credit” when it is not identical to the true one. Since the evaluation is based on word counting, we call it as evaluation by word. Furthermore, we combine precision and recall together and compute the F 1 score that is defined as harmonic mean of precision and recall. All the above precision/ recall / F 1 metrics for QA pair extraction can be extended to evaluate the performance of extracting FAQ questions or FAQ answers separately. To obtain comprehensive understanding of the overall performance of the ex- traction, two different averaging methods are used in our evaluation. In the first averaging method, we compute the evaluation metric (i.e., precision, recall, or F 1) 172 for each FAQ page, and average it across all FAQ pages. We refer to this averaging as macro-averaging. Alternatively, we can directly compute the evaluation metrics for all the individual QA pairs from all the FAQ pages. We refer to this method as micro—averaging. We implemented three baseline methods. The first two are based on heuristics, and have been used in previous studies [99, 136, 84, 137]. Both baseline methods 6 to first identify all the questions. In the first employ an optimal set of heuristics baseline method, all the text between two consecutive questions are identified as the answers to the proceeding question? We refer to this method as “H-all”. In the second baseline method, up to 3 sentences following each FAQ question are extracted as the corresponding FAQ answer. We refer to this baseline method as “H-3”. The third baseline is Support Vector Machine with precomputed kernels, which is a rep- resentative supervised learning method. We use the 1/4 of the FAQ pages as the training set, and the rest as the test set. The kernel is precomputed using the sim- ilarity measurement defined in Section 6.5.1. LibSVM [34] software package is used in our experiment. we will refer to this method as “kSVM”. Finally, we will use “SSL” to refer to the proposed semi—supervised learning algorithm for extracting QA pairs from FAQ pages. The constants in the optimization problem (6.5) are set as C1 = 1, C2 = C3 -.= 0.5 from empirical experience. 6.6.2 Experiments on Verification of Side Information We first examine the Within Page Consistency principle, namely two text segments of the same page tend to follow the same HTML format pattern if they are in the same class. To verify this principle, we computes two quantities for each pair of text segments using the manually labeled FAQ page set: the similarity between the 6The optimal set of heuristics are identified empirically. 7For the last question in a FAQ page, up to n text segments are extracted for its answer, where n is the average number of text segments in all previous FAQ answers. 173 l'T_'_ :. Text Segment Similarity VS. Being in the Same Class lr , ., . ._ .. ..... .. . 0 K0 1 O (I) ” +Within Pages (Question) ' ......... Across Pages (Question) _ —0—Within Pages (Answer) 0 \I I m m rd r-l U a) E at m a) ,.C J.) g 0.6- --0---Across Pages (Answer) 01 g 0.5— ........... -r-l , (D : m 0.4— “a i 0.3- ~ >~ : .43. z [:1 0.2.. .. ......... "(all I‘,E~~ ‘8‘ ‘ '80“ +.~+ ' ..+ - . 's‘. E or at” a - —— — - -' 4 O 0.2 0.4 0.6 . 1 Similarity Figure 6.6. The probability distribution for two text segments to be in the same class versus the format similarity between text segments two text segments in their HTML format, and whether or not the two segments are in the same class. Figure 6.6 summarizes the results by showing how the format similarity between two text segments affects the probability for them to be in the same class. The two solid lines show the results for the intra—class similarity of two text segments within the same pages, and the two dot lines show the results for the intra—class similarity of two text segments from different pages. It is clear that for the text segments of the same web pages, their intra—class similarity nicely indicate their relationship, namely the larger the similarity between two text segments, the more likely the two segments will be classified into the same class. In contrast, the intra—class similarity between text segments of different web pages appears to be uninformative to their class memberships. Both dot lines in Figure 6.6 are flat 174 across wide range of the similarity, and do not. show the clear trend that the larger the similarity the higher the probability. We thus conclude that the Within Page Consistency principle provides a piece of valid side information for our experiment data. 6.6.3 Experiments on QA Extraction Table 6.2 lists the precision / recall / F 1 of the four different approaches for QA extrac- tion. As shown in the table, the F1 scores of the proposed “SSL” method are always considerably better than the three baseline models, whether using the micro—averaging or the macro—averaging method. Especially, when using the strict evaluation metrics “by QA”, the improvement made by the proposed algorithm is significant. Therefore, we can conclude that our proposed “SSL” method perform better than the baseline methods in extracting QA pairs from the FAQ pages. Further analysis on the experiment results reveals the following findings 1. For all the four algorithms, the performance of FAQ question extraction is in general significantly better than that of FAQ answer extraction when using the metric of evaluation by QA. This is in accordance with the fact that answers are more challenging to extract due to its longer length and more diversified formats. 2. In general, for all the four extraction methods, the scores of evaluation by word are considerably higher than the scores of evaluation by QA. This observation implies that although the noise text mixed with the FAQ text is small in amount, it is pervasive. This is related to the fact that there are often repeated patterns in the FAQ pages, so an error in extracting one QA pair will be very likely to recur when extracting other QA pairs. When the extracted QA pairs are used in other applications as enumerated in Section 6.1, such kind of pervasive and recurring errors could be a very annoying factor that can seriously degrade the 175 By QA micro-averaging macro-averaging P R F 1 P R F 1 H-all Pair 0.5055 0.4950 0.5002 0.5089 0.4919 0.4965 Q 0.8706 0.6662 0.7548 0.8617 0.6564 0.7280 A 0.5078 0.4973 0.5025 0.5105 0.4933 0.4980 H-3 Pair 0.4704 0.3599 0.4078 0.4601 0.3532 0.3910 Q 0.8706 0.6662 0.7548 0.8617 0.6564 0.7280 A 0.4794 0.3669 0.4157 0.4681 0.3578 0.3968 kSVM Pair 0.1267 0.1696 0.1450 0.1001 0.1493 0.1189 Q 0.3029 0.2304 0.2617 0.2109 0.1923 0.1956 A 0.1323 0.1842 0.1540 0.1104 0.1569 0.1245 SSL Pair 0.8163 0.7130 0.7612 0.7698 0.6838 0.7159 Q 0.9411 0.8220 0.8775 0.9196 0.7891 0.8356 A 0.8383 0.7322 0.7816 0.7901 0.7010 0.7344 By Word micro—averaging macro-averaging P R F 1 P R F 1 H-all Pair 0.8728 0.6987 0.7761 0.8640 0.7518 0.8007 Q 0.8896 0.7156 0.7932 0.9027 0.7196 0.7535 A 0.8673 0.9051 0.8858 0.8559 0.9102 0.8680 H-3 Pair 0.9303 0.3543 0.5132 0.9137 0.4656 0.5790 Q 0.8896 0.7156 0.7932 0.9027 0.7196 0.7535 A 0.9441 0.3113 0.4682 0.9229 0.4365 0.5486 kSVM Pair 0.4930 0.4265 0.4573 0.3922 0.3729 0.3654 Q 0.5023 0.5313 0.5164 0.4238 0.4247 0.4156 A 0.5123 0.4269 0.4657 0.4323 0.3920 0.4107 SSL Pair 0.9806 0.8126 0.8887 0.9589 0.8077 0.8493 Q 0.9800 0.8163 0.8907 0.9496 0.7981 0.8590 A 0.9807 0.8727 0.9235 0.9506 0.8324 0.8827 Table 6.2. The performance of QA extraction by four different methods. The columns of “P”, “R” and “F1” give the precision/recall/ F1 scores. 176 application performance. 0:: . It is interesting to observe the performance gap between the “H-all” algorithm and the “SSL” when we use different evaluation metrics. Using the metrics of evaluation by word, the scores of the “H-all” algorithm are high, and sometimes close to those of the “SSL” algorithm. However when using the metrics of evaluation by QA, the performance gap between “Hall” and “SSL” is much larger. This phenomenon shows that instead of extracting a large number of QA pairs with small errors, our “SSL” method is able to yield many more complete and noise-free QA pairs, thus improving the quality of extracted QA text. 4. When comparing the “H-all” method with “H-3” method in terms of FAQ answer extraction 8, we observe that using the “by word” metrics, the former approach has a much better recall, but a worse precision. This is because the “H-all” algorithm includes all the text between two consecutive questions as the answer, and therefore achieves a high recall. In contrast, the “H—3” algorithm only includes the first up—to—three sentences following a question as the answer, and therefore achieves a high precision. This result shows the interesting trade- off between precision and recall that exists in most heuristics-based approaches. 5. The performance of “kSVM” method is extremely poor. This is not surprising, since the precomputed kernel encodes much information about the intra—class similarity from different pages, which is shown to be uninformative as discussed in Section 6.6.2. This experiment implies that the large variance existing in the presentation format from different FAQ pages is hard to be captured by a supervised learning algorithm, given a limited number of labeled examples. As stated before, one important advantage of the proposed algorithm is that it 8Their performance in extracting FAQ questions are the same because the two methods only differ in their FAQ answer extraction strategies. 177 The Effectiveness of Label Probabilities 1~ ,. ,,,,,,,,,,,,,,, . ..... . ...... . . i ------ " * * o O . — I ’v .7! s 9 ’, xi °~-,_-o >‘O.8‘ \ fi _0” . g . ‘t.’—;. H — .~ ! b :5 0 .7 {I _I a’ . U 9‘ .’ I O 0.6— -:I"\ ~_I»” . . .. fi .5! \‘ 1 c: 0.5— X 2' +All S +/ ,w’ '“+ Question '80'4_ :1!” I ' '9-Amswer ' H : . . '8 O.3~ if .. - *-°---N01se H _, a.0.2— O.l~ 0 L 1 1 1 l 4' i 0.4 O. 0.8 0.9 l 5 0 .6 0.7 _ _ Label Probability Figure 6.7. The correlation between prediction accuracy and classification probability estimated by the proposed algorithm for QA extraction outputs not only the classification results but also the uncertainty in classification. To examine the quality of the class probabilities that are estimated by the proposed algorithm, for each text segment, we compute two quantities: the label probability output by the proposed algorithm, and whether or not the predicted class label is correct. Figure 6.7 summarizes these results by showing how the prediction accuracy of each class is affected by the estimated label probability. Overall speaking, the larger is the estimated probability, the higher is the prediction accuracy. We thus conclude that the estimated label probabilities do indicate the confidence of classifying text segments. 178 6.7 Conclusions In this paper, we study the problem of automatically extracting QA pairs from web FAQS. We first identify the pattern diversity challenge, a key challenge in QA extrac- tion from FAQ pages. We then present a semi-supervised learning algorithm that is effective in exploiting the Within Page Consistency principle, the key side information to the QA extraction task. The main advantage of the proposed algorithm is twofold: first, the proposed algorithm is able to boost the limited knowledge by generating the seed examples that are labeled by heuristics; second, the proposed algorithm is com— pletely unsupervised. It predicts the class labels of text segments by propagating the class labels of seed examples to the other text segments of the same page. Empirical study shows that our proposed QA extraction method is able to yield more complete and noise-free extracted QA pairs, hence improving the extraction performance sub- stantially over heuristics-based or supervised learning approaches in previous studies. 179 El CHAPTER 7 Conclusions The topic of this thesis work is semi-supervised learning with side information. In previous chapters, three generic learning tasks and two applications are discussed in details. The contributions, of each chapter respectively, can be summarized as follows a In Chapter 2, we propose a constrained non-negative matrix factorization (CNMF) model for multi-label learning with class correlations, to meet the challenging situation of a large number of classes and a small size of training data. 0 In Chapter 3, we propose a novel boosting framework, LinkBoost, to improve any supervised classification algorithm with link information. o In Chapter 4, we propose a novel boosting framework, BoostCluster, to improve any clustering algorithm with pairwise constraints. 0 In Chapter 5, we propose a novel statistical framework, Maximum Coherence Framework, for query translation disambiguation in cross-language information retrieval with a bilingual dictionary. 0 In Chapter 6, we propose a semi-supervised learning model that utilizes human knowledge on web FAQs, to automatically extract question-answer pairs from them without human supervision. 180 The role of side information played in each task and application is listed as follows 0 In CNMF model for multi-label learning, side information (in the form of class correlations) is encoded with class labels for computing pairwise example sim- ilarities, whose consistency with input pattern based similarities is further en- forced. o In LinkBoost framework for classification, side information (in the form of links) is combined with input pattern based example similarities, whose consistency with the pairwise relationship induced from class labels is further enforced. o In BoostCluster framework for clustering, side information (in the form of pair- wise constraints) is enforced to be consistent with input pattern based pairwise similarities computation. o In Maximum Coherence model for CLIR, side information (in the form of die- tionaries) is used to construct a graph, over which soft class memberships are enforced to be consistent with pairwise term similarities (or “coherence”). o In automatic question-answer extraction from web FAQS, side information (in the form of “within page consistency” knowledge) is used to correlate three class label propagations over the same graph, where each label propagation can be explained as consistency enforcement between class labels and input pattern based similarities. From the above, it is easy to find that consistency enforcement is a common theme involved in the use of side information. On an abstract level, the rationale behind all the work presented in this thesis is the assumption that data examples that are close to each other, when judging either from their input patterns or side information, should be predicted similarly. This is a natural extension of the data consistency assumption underlying most graph-based learning approaches (as stated in Section 1.1.2). 181 To effectively use the side information against its usual nature of sparseness, in- completeness and noise, when incorporating side information into semi-supervised models, we deliberately avoid formulating them as hard constraints. Instead, we favor the use of side information in soft penalty. In this way, violation with side information is allowed, but with a loss; minimizing the collective violation loss results in tolerance with the noise in the side information. Also, all our proposed models in- herit the spirit of graph-based learning that a connected graph is constructed over all the (labeled and unlabeled) data based on their input patterns. Such a graph serves as a good supplement to understand the underlying structure of data, wherever side information is missing. This treatment reduces the risk brought by the incomplete- ness and sparseness of side information. Consequently robustness is presented in all the proposed semi-supervised learning models with side information, as suggested by those experiments shown in previous chapters. In all the semi—supervised models proposed in this thesis work, the theme of con- sistency enforcement is always achieved through optimizations. A comparison among all the objective functions in Chapter 2 through Chapter 6 will reveal a certain de— gree of resemblance. In particular, the consistency is always formalized in a pairwise manner, i.e., the overall consistency is decomposed into consistency measurements defined on each pair of data examples, including labeled and unlabeled ones. Such a formalization embodies the use of unlabeled data, and also fosters the propagation of supervised information from labeled data to unlabeled data. Another advantage of formalizing optimization objectives in a sum-of-pairwise manner is, by carefully choosing the consistency measurement defined in a pair of data examples, the result- ing optimization is often convex, thus tractable and with global optimum. The above analysis summarizes a few nice properties in the proposed semi- supervised models. In conclusion, the work presented in this thesis suggests a viable approach towards semi-supervised learning with side information: enforcing consis- 182 "1 7‘7 -I r) -m- tency among data input patterns, supervised information (if any), side information, and predictions. We believe that this approach is applicable to a wide variety of learning tasks and application areas. 183 APPENDIX A Related Proofs and Lists A.1 Proof of the eigenvector problem in Section 4.3.3 We show that every non—zero eigenvector Vi can be written as a linear combination of xi,i = 1,2,. . .,m, i.e., the examples involved in the pairwise constraints. Let v and A % 0 be an eigenvector and eigenvalue of matrix XTXT. We therefore have XTXTV 2 Av. We further decompose v into two parts: v = v” + V1.7 where v” represents the projection of v in the subspace spanned by {523,531, and x_L represents the projection perpendicular to {i,-}f:,. To show v can be written as a linear combination of {i,}f_=1, we need to Show vi = 0. To this end, we first utilize the expression in (4.20) to calculate VIXTXT, i.e., m VIXTXT = Z edge-5a; = 0T i,j=l We then multiply the eigen equation XTXTV = Av by VI, which leads to the following equation T T T 2 viXTX v = 0 = Aviv 2 A||vi||2 Since A 74 0, we have Vl = 0 and v = v“. 184 A.2 Proof of the generalized eigenvector problem in Section 4.3.3 We will show that for the ith eigenvector Vi 2 KW, of XTXT, wi corresponds to the ith eigenvector of the generalized eigenvector problem in (4.22). First, we realize that the orthogonality condition vavj = 6 (i, j ) becomes WZTXXTWj = 62,3“ We can write the above condition for all wi,i = 1, 2, . . . , s in the matrix form, i.e., WTXXTW 2 I3. Second, the eigenvectors V = (v1,v2, . . . ,vs) are the optimal solution to the following optimization problem, i.e., arg max tr(VTXTXTV) VERdxs s. t. vTv = I, Replacing V in the above optimization problem with V 2 KW, we have max t1‘(WTXTXTXTXW) We R?” X S s t. WTxTXW = I, It is well known that the optimal solution W to the above problem consists of the first 5 eigenvectors of the generalized eigenvector problem in (4.22). A.3 Features for FAQ page classification The 19 features we used in FAQ page classification are as follows 1. Occurrence of keywords faq or f aqs in the URL host section 1 2. Number of keywords f aq or f aqs in the URL query section 1A URL can be divided into six sections. For example in http://www.amazon.com:83/search.html?q=trave1#marker, the protocol section is http, the host section is ww.amazon.com, the port section is 83, the path section is /search.html, the query section is q=travel, and the fragment section is marker. 10. 11. 12. 13. 14. 15. 16. 17. 18. Number of keywords f aq or faqs in the URL fragment section . Normalized position of keywords faq or faqs in the URL path section (for example, in URL path section / education/ f aq/ coins/ denominations . shtml, the keyword f aq appears in the 3rd segment when counting from right to left. Since altogether there are 4 segments, the normalized position of the faq is 3 / 4) . Number of question marks in the page body text . Number of question marks in the page title Number of question marks in the anchor text . Logarithm of total number of words in the whole page text . Number of anchors with question marks Percentage of anchors-with—question-marks in all the anchors Ratio of question marks to words in the page body text Ratio of question marks to words in the page title Ratio of question marks to words in all the anchor text Average number of words between consecutive question marks Number of text segments in the page body text Number of text segments (see Section 6.5.1 for definition) with question marks in the page body text Percentage of text-segments—with-question-marks in all the text segments Percentage of links in all the text-segments—with-question-marks 186 kl 19. Number of well separated text—segments—with-question-marks (two consecutive text segments are well separated if there are at least kt words between them. In our practice, we set I; = 3.) 187 [1] [2] [3] [4] l5] {6] [7] l8] [9] BIBLIOGRAPHY Mirna Adriani. Dictionary-based clir for the clef multilingual track. In Cross- Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum, CLEF 2000, Lisbon, September 2000. Mirna Adriani. Using statistical term similarity for sense disambiguation in cross-language information retrieval. Inf. Retr., 2(1):71—~82, 2000. Eugene Agichtein, Steve Lawrence, and Luis Gravano. Learning to find answers to questions on the web. ACM Transactions on Internet Technology, 4(2):129~ 162, 2004. Yasemin Altun, Thomas Hofmann, and Alexander J. Smola. Gaussian process classification for segmenting and annotating sequences. In I CML ’04: Proceed- ings of the twenty-first international conference on Machine learning, page 4, New York, NY, USA, 2004. ACM Press. M. R. Anderberg. Cluster Analysis for Applications. Academic Press, Inc., New Youk, NY, 1973. Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems 15, pages 561-568, 2002. Ralitsa Angelova and Gerhard Weikum. Graph-based text classification: learn from your neighbors. In SIGIR ’06: Proceedings of the 29th annual international ACM SICIR conference on Research and development in information retrieval, pages 485—492, New York, NY, USA, 2006. ACM. Lisa Ballesteros and W. Bruce Croft. Phrasal translation and query expansion techniques for cross-language information retrieval. In SICIR ’97: Proceed- ings of the 20th annual international ACM SICIR conference on Research and development in information retrieval, pages 84—91. ACM Press, 1997. Aharon Bar-Hillel, Tomer Hertz, Noam Shental, and Daphna Weinshall. Learn- ing distance functions using equivalence relations. In I CML ’03: Proceedings of the Twentieth International Conference, pages 11—18, August 21—24 2003. 188 [10] [11] [14] [15] [16] [17] [18] [19] [20] Kobus Barnard, Pinar Duygulu, David A. Forsyth, Nando de Freitas, David M. Blei, and Michael I. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107—1135, 2003. Kobus Barnard and David A. Forsyth. Learning the semantics of words and pictures. In ICC V ’01 : Proceedings. Eighth IEEE International Conference on Computer Vision, pages 408—415, 2001. Sugato Basu. Semi-supervised Clustering: Probabilistic Models, Algorithms and Erperiments. PhD thesis, The University of Texas at Austin, 2005. Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. Semi-supervised clustering by seeding. In I CML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 27—34, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. Active semi- supervision for pairwise constrained clustering. In SDM ’04 : Proc. of the Fourth SIAM International Conference on Data Mining, 2004. Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. A probabilistic frame- work for semi—supervised clustering. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 59—68, New York, NY, USA, 2004. ACM Press. R. Bekkerman and M. Sahami. Semi-supervised clustering using combinato- rial MRFS. In Proc. of ICML-06 Workshop on Learning in Structured Output Spaces, 2006. M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373—1396, 2003. Mikhail Belkin and Partha Niyogi. Using manifold stucture for partially labeled classification. In Sebastian Thrun, Lawrence Saul, and Bernhard Scholkopf, editors, Advances in Neural Information Processing Systems 15, Cambridge, MA, 2002. MIT Press. Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. Label propagation and quadratic criterion. In Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien, editors, Semi-Supervised Learning, pages 193-216. MIT Press, 2006. K. Bennett, P. Bradley, and A. Demiriz. Constrained k—means clustering. Tech- nical Report 2000—65, Microsoft Research, May 2000. 189 [21] [22] [23] [24] [261 [27] [28] [29] [30] Kristin P. Bennett, Ayhan Demiriz, and Richard Maclin. Exploiting unla- beled data in ensemble methods. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 289—296, New York, NY, USA, 2002. ACM. Adam Berger, Rich Caruana, David Cohn, Dayne Freitag, and Vibhu Mittal. Bridging the lexical chasm: statistical approaches to answer-finding. In SI CIR ’00: Proceedings of the 23rd annual international ACM SICIR conference on Research and development in information retrieval, pages 192—199, New York, NY, USA, 2000. ACM Press. Adam Berger and Vibhu O. Mittal. Query-relevant summarization using faqs. In ACL ’00: Proceedings of the 38th Annual Meeting on Association for Com- putational Linguistics, pages 294—301, Morristown, NJ, USA, 2000. Association for Computational Linguistics. Mikhail Bilenko, Sugato Basu, and Raymond J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 11, New York, NY, USA, 2004. ACM Press. Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data using graph mincuts. In I CML ’01 : Proceedings of the Eighteenth International Conference on Machine Learning, pages 1926, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Avrim Blum, John Lafferty, Mugizi Robert Rwebangira, and Rajashekar Reddy. Semi-supervised learning using randomized mincuts. In I CML ’04: Proceedings of the twenty—first international conference on Machine learning, page 13, New York, NY, USA, 2004. ACM Press. Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In COLT’ .98: Proceedings of the eleventh annual conference on Computational learning theory, pages 92—100, New York, NY, USA, 1998. ACM. Matthew R. Boutella, Jiebo Luob, Xipeng Shen, and Christopher M. Browna. Learning multi-label scene classification. Pattern Recognition, 37:1757—1771, 2004. Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121~—167, 1998. Robin D. Burke, Kristian J. Hammond, Vladimir A. Kulyukin, Steven L. Lyti- nen, N. Tomuro, and S. Schoenberg. Question answering from frequently asked 190 question files: Experiences with the faq finder system. Technical report, Uni- versity of Chicago, Chicago, IL, USA, 1997. [31] Lijuan Cal and Thomas Hofmann. Hierarchical document categorization with support vector machines. In Proc. ACM Conf Information and Knowledge Management, 2004. [32] Pavel Calado, Marco Cristo, Edleno Moura, Nivio Ziviani, Berthier Ribeiro— Neto, and Marcos Andre; Goncalves. Combining link-based and content-based methods for web document classification. In CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pages 394—401, New York, NY, USA, 2003. ACM Press. [33] Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext cat- egorization using hyperlinks. In SICMOD ’98: Proceedings of the 1998 ACM SI CM 0D international conference on Management of data, pages 307—318, New York, NY, USA, 1998. ACM. [34] Chih-Chung Chang and Chili-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [35] Zheng Chen, Liu Wenyin, Feng Zhang, and Mingjing Li. Web mining for web image retrieval. J. Am. Soc. Inf. Sci. T echnol., 52(10):831—839, 2001. [36] Hai Leong Chieu and Hwee Tou Ng. A maximum entropy approach to infor- mation extraction from semi-structured and free text. In Eighteenth national conference on Artificial intelligence, pages 786—791, Menlo Park, CA, USA, 2002. American Association for Artificial Intelligence. [37] Fan R. K. Chung. Eigenvalues of graphs. In Proceedings of the International Congress of Mathematicians, pages 1333—1342, Zurich, 1994. Birkh”auser Ver- lag, Berlin. [38] Fan R. K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, ISSN: 0160—7642. American Mathematical Society, 1997. [39] William W. Cohen. Improving a page classifier with anchor extraction and link analysis. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 1481-1488. MIT Press, Cambridge, MA, 2002. [40] David Cohn, Deepak Verma, and Karl Pfieger. Recursive attribute factoring. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 297—304. MIT Press, Cambridge, MA, 2007. 191 [41] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev—Slnvartz, and Yoram Singer. Online passive-agressive algorithms. Machine Learning: 7:55la585, 2006. [42] Koby Crammer and Yoram Singer. A new family of online algorithms for cate- gory ranking. In SI CIR ’02: Proceedings of the 25th annual international ACM SICIR conference on Research and development in informaion retrieval, 2002. [43] Florence d’Alché Buc, Yves Grandvalet, and Christophe Ambroise. Semi- supervised marginboost. In Advances in Neural Information Processing Systems 14, pages 553—560, 2001. [44] I. Davidson and SS. Ravi. Clustering under constraints: Feasibility results and the k-means algorithm. In SIAM Data Mining Conference, 2005. [45] 1. Davidson and SS. Ravi. Hierarchical clustering with constraints: Theory and practice. In Proc. of the 9th European Principles and Practice of KDD (PKDD), 2005. [46] Mark W. Davis. New experiments in cross-language text retrieval at NMSU’S computing research lab. In D. K. Harman, editor, The Fifth Tert REtrieval Conference (TREC-5). NIST, 1996. [47] Brian D. Davison. Topical locality in the web. In SI CIR ’00: Proceedings of the 23rd annual international ACM SI CIR conference on Research and development in information retrieval, pages 272—279, New York, NY, USA, 2000. ACM Press. [48] Brian D. Davison. Toward a unification of text and link analysis. In SICIR ’03: Proceedings of the 26th annual international ACM SICIR conference on Research and development in informaion retrieval, pages 367—368, New York, NY, USA, 2003. ACM Press. [49] Chris H. Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In Pro- ceedings of the 2001 IEEE International Conference on Data Mining (ICDM 2001), pages 107-114. IEEE Computer Society, 2001. [50] CL. Blake D.J. Newman, S. Hettich and OJ. Merz. UCI repository of machine learning databases, 1998. [51] Abdessamad Echihabi and Daniel Marcu. A noisy-channel approach to ques- tion answering. In ACL ’03: Proceedings of the 413t Annual Meeting of the Association for Computation al Linguistics, Sapporo, Japan, 2003. 192 [52] Andre Elisseeff and JasonWeston. A kernel method for multi-labelled classifi- I54] [55] I56] [57] I59] [60] [61] cation. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 681- 687, Cambridge, MA, 2002. MIT Press. Marcello Federico and Nicola Bertoldi. Statistical cross—language information retrieval using N-best query translations. In SI CIR ’02: Proceedings of the 25th annual international ACM SICIR conference on Research and development in information retrieval, pages 167—174. ACM Press, 2002. Christiane Fellbaum, editor. Wordnet : an electronic lexical database. MIT Press, 1998. Yoav Freund and Robert E. Schapire. A decision—theoretic generalization of on-line learning and an application to boosting. pages 23-37, 1995. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on—line learning and an application to boosting. J. of Computer and System Sciences, 55(1):119—139, 1997. Jianfeng Gao, Jian-Yun Nie, Endong Xun, Jian Zhang, Ming Zhou, and Changning Huang. Improving query translation for cross—language informa- tion retrieval using statistical models. In SICIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 96~104. ACM Press, 2001. Jianfeng Gao, Ming Zhou, Jian-Yun Nie, Hongzhao He, and Weijun Chen. Resolving query translation ambiguity using a decaying co—occurrence model and syntactic dependence relations. In SICIR ’02: Proceedings of the 25th annual international ACM SICIR conference on Research and development in information retrieval, pages 183-190. ACM Press, 2002. Sheng Gao, Wen Wu, Chin-Hui Lee, and Tat-Seng Chua. A MFoM learning approach to robust multiclass multi-label text categorization. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, 2004. Nadia Ghamrawi and Andrew McCallum. Collective multi-label classification. In CIKM ’05: Proceedings of the 14th ACM international conference on Infor- mation and knowledge management, pages 195—200, New York, NY, USA, 2005. ACM Press. Rayid Ghani, Sean Slattery, and Yiming Yang. Hypertext categorization using hyperlink patterns and meta data. In I CML ’01 : Proceedings of the Eighteenth 193 [6‘3] [63] [54] [66] [67] [68] [69] [70] [71] International Conference on Machine Learning, pages 178—185, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. PE. Gill, W. Murray, and M.H. Wright. Practical Optimization. Academic Press, Inc., San Diego, USA, 1981. Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, and Gary W. Flake. Using web structure for classifying and describing web pages. In WWW ’02: Proceedings of the 11th international conference on World Wide Web, pages 562-569, New York, NY, USA, 2002. ACM. Jacob Goldberger, Sam T. Roweis, Geoffrey E. Hinton, and Ruslan Salakhutdi- nov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems 17 (NIPS 2004 f, 2004. Gene H. Golub and Charles F. Van Loan. Matrix Computation. John Hopkins Press, 1989. Ming Gu, Hongyuan Zha, Chris Ding, Xiaofeng He, and Horst Simon. Spectral relaxation models and structure analysis for k-way graph clustering and bi- clustering. Technical Report CSE—01-007, Department of Computer Science and Engineering, Pennsylvania State University, 2001. Carlos Guestrin, Andreas Krause, and Ajit Paul Singh. Near-optimal sensor placements in gaussian processes. In I CML ’05: Proceedings of the 22nd inter- national conference on Machine learning, pages 265-272, New York, NY, USA, 2005. ACM Press. L. Hagen and AB. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE. Trans. on Computed Aided Desgin, 11:1074—1085, 1992. Sanda Harabagiu and Finley Lacatusu. Strategies for advanced question an- swering. In Sanda Harabagiu and Finley Lacatusu, editors, HLT-NAA CL 2004 : Workshop on Pragmatics of Question Answering, pages 1—9, Boston, Mas- sachusetts, USA, May 2 - May 7 2004. Association for Computational Lin- guistics. Taher H. Haveliwala. Topic-sensitive pagerank. In WWW ’02: Proceedings of the 11th international conference on World Wide Web, pages 517-526, New York, NY, USA, 2002. ACM Press. Tomer Hertz, Aharon Bar-Hillel, and Daphna Weinshall. Boosting margin based distance functions for clustering. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 50, New York, NY, USA, 2004. ACM Press. 194 [72] [73] [74] [75] [76] [77] I78] [79] [80] [811 [821 Tomer Hertz, Aharon Bar Hillel, and Daphna W'einshall. Learning a kernel function for classification with small training samples. In I CML ’06: Proceedings of the 23rd international conference on Machine learning, pages 401 -408, New York, NY, USA, 2006. ACM Press. Tomer Hertz, Noam Shental, Aharon Bar-Hillel, and Daphna Weinshall. En— hancing image and video retrieval: Learning via equivalence constraints. In C VPR ’03: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 02, page 668, Los Alamitos, CA, USA, 2003. IEEE Computer Society. Steven C. H. Hoi, Wei Liu, Michael R. Lyu, and VVei-Ying Ma. Learning distance metrics with contextual constraints for image retrieval. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition ( C VPR 2006), pages 2072—2078, 2006. Stefan Holland, Martin Ester, and Werner KieBling. Preference mining: A novel approach on mining user preferences for personalized applications. In PKDD, pages 204—216, 2003. David A. Hull and Gregory Grefenstette. Querying across languages: a dictionary-based approach to multilingual information retrieval. In SICIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Re- search and development in information retrieval, pages 49—57. ACM Press, 1996. ImageCLEF. The CLEF Cross Language Image Retrieval Track (ImageCLEF), http://ir.shef.ac.uk/imageclef/ , 2003. Abraham Ittycheriah, Martin Franz, and Salim Roukos. IBM’s statistical ques- tion answering system. In TREC ’01 : Proceedings of the 10th Text REtrieval Conference, Gaithersburg, MD, 2001. Tommi S. Jaakkola. Variational methods for inference and estimation in graph- ical models. PhD thesis, MIT, 1997. Supervisor-Michael I. Jordan. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264—323, September 1999. Myung-Gil Jang, Sung Hyon Myaeng, and Se Young Park. Using mutual infor- mation to resolve query translation ambiguities and query term weighting. In ACL ’99: Proceedings of the 37th annual meeting of the association for compu- tational linguistics, 1999. Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. Finding semantically simi- lar questions based on their answers. In SICIR ’05: Proceedings of the 28th 195 [83] [84] [851 [86] [87] [88] [89] [90] [91] [92] annual international ACM SI CIR conference on Research and development in information retrieval, pages 617—618, New York, NY, USA, 2005. ACM Press. Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. Finding similar questions in large question and answer archives. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 84— 90, New York, NY, USA, 2005. ACM Press. Valentin Jijkoun and Maarten de Rijke. Retrieving answers from frequently asked questions pages on the web. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 76— 83, New York, NY, USA, 2005. ACM Press. Rong Jin and Zoubin Ghahramani. Learning with multiple labels. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Process- ing Systems 15, pages 897—904. MIT Press, Cambridge, MA, 2003. Thorsten Joachims. Text categorization with suport vector machines: Learn- ing with many relevant features. In Proc European Conference on Machine Learning, 1998. Thorsten Joachims. Transductive learning via spectral graph partitioning. In Proceedings of the International Conference on Machine Learning (ICML), 2003. Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Mach. Learn, 37(2):183-233, 1999. Sepandar D. Kamvar, Dan Klein, and Christopher D. Manning. Spectral Learn- ing. In IJCAI ’03: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, August 9—15 2003. Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, and Eisaku Maeda. Max- imal margin labeling for multi-topic text categorization. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Process- ing Systems 17, pages 649—656. MIT Press, Cambridge, MA, 2005. D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In ICML ’02: Proceedings of 19th Intl. Conf. on Machine Learning, 2002. Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604~632, 1999. 196 [93] [94] [95] [96] [97] [98] [99] [100] [101] W. Kraaij, R. Pohlmann, and D. Hiemstra. Twenty—one at TREC-8: using lan- guage technology for information retrieval. In Ellen M. Voorhees and Donna K. Harman, editors, The Eighth Text REtrieval Conference (TREC-8), volume 8, pages 285—300. National Institute of Standards and Technology, NIST, 2000. NIST Special Publication 500-246. Wessel Kraaij, Jian-Yun Nie, and Michel Simard. Embedding web-based sta- tistical translation models in cross-language information retrieval. Comput. Linguist, 29(3):381—419, 2003. Wessel ”Kraaij and Renée Pohlmann. Different approaches to cross language information retrieval. In W. Daelemans, K. Sima’an, J. Veenstra, and J. Za- vrel, editors, Computational Linguistics in the Netherlands 2000, number 37 in Language and Computers: Studies in Practical Linguistics, pages 97—111, Amsterdam, 2001. Rodopi. Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. Semi- supervised graph clustering: a kernel approach. In ICML ’05: Proc. of the 22nd international conference on Machine learning, pages 457—464, New York, NY, USA, 2005. ACM Press. Vladimir A. Kulyukin, Kristian A. Hammond, and Robin D. Burke. Auto— mated processing of structured online documents. Technical report, University of Chicago, Chicago, IL, USA, 1998. James T. Kwok and Ivor W. Tsang. Learning with idealized kernels. In Proceed- ings of the Twentieth International Conference on Machine Learning (ICML- 2003), pages 400—407, Washington, DC, USA, August 2003. Yu-Sheng Lai, Kuao—Ann Fung, and Chung-Hsien Wu. Faq mining via list detection. In COLING—02: proceeding of the 2002 conference on multilingual summarization and question answering, pages 1—7, Morristown, NJ, USA, 2002. Association for Computational Linguistics. Victor Lavrenko, Martin Choquette, and W. Bruce Croft. Cross-lingual rele- vance models. In SI CIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 175—182. ACM Press, 2002. Victor Lavrenko and W. Bruce Croft. Relevance based language models. In SI- CIR ’01: Proceedings of the 24th annual international ACM SICIR conference on Research and development in information retrieval, pages 120—127. ACM Press, 2001. 197 [102} [103] [104] [105] [106] [107] [108] [109] [110] [111] Martin H. C. Law, Alexander P. Topchy, and Anil K. Jain. Model-based cluster- ing with probabilistic constraints. In SIAM International Conference on Data Mining (SDM), 2005. Neil D. Lawrence and l\’lichael I. Jordan. Semi—supervised learning via gaus- sian processes. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 753—760. MIT Press, Cambridge, MA, 2005. D. D. Lee and H. S. Seung. Learning the parts of objects by non—negative matrix factorization. Nature, 401(6755):788-791, October 1999. Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13, pages 556—562, 2000. Shuang Liu, Fang Liu, Clement Yu, and Weiyi Meng. An effective approach to document retrieval via utilizing wordnet and recognizing phrases. In SI CIR ’04: Proceedings of the 27th annual international ACM SICIR conference on Research and development in information retrieval, pages 266—272, New York, NY, USA, 2004. ACM Press. Qing Lu and Lise Getoor. Link-based classification. In ICML ’03: Proceedings of the twentieth international conference on Machine learning, pages 496—503, 2003. Zhengdong Lu and Todd Leen. Semi—supervised learning with penalized proba- bilistic clustering. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 849—856, Cam- bridge, MA, 2005. MIT Press. Akira Maeda, Fatiha Sadat, Masatoshi Yoshikawa, and Shunsuke Uemura. Query term disambiguation for web cross-language information retrieval using a search engine. In Proceedings of the 5 th International Workshop on Infor- mation Retrieval with Asian Languages (IRAL ’00), pages 25—32. ACM Press, 2000. Andrew McCallum. Multi-label text classification with a mixture model trained by EM. In AAAI ’99 Workshop on T ext Learning, 1999. Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum entropy markov models for information extraction and segmentation. In I CML ’00: Proceedings of the Seventeenth International Conference on Machine Learning, pages 591—598, San Pl‘ancisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. 198 [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] Joel C. Miller, Gregory Rae, Fred Schaefer, Lesley A. Ward, Thomas LoFaro, and Ayman Farahat. Modifications of kleinberg’s hits algorithm using matrix exponentiation and web log records. In SICIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 444—445, New York, NY, USA, 2001. ACM Press. Tom Mitchell. Machine Learning. hA’IcGraw Hill, 1997. Christof Monz and Bonnie J. Dorr. Iterative translation disambiguation for cross-language information retrieval. In SIGIR ’05: Proceedings of the 28th annual international ACM SICIR conference on Research and development in information retrieval, pages 520—527, 2005. Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Anal- ysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849—856, 2001. Jian-Yun Nie and Michel Simard. Using statistical translation models for bilin- gual ir. In Cross-Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum, CLEF 20’01, pages 137—150. Springer- Verlag, 2002. Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum en- tropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61—67, 1999. Hyo—Jung Oh, Sung Hyon Myaeng, and Mann-Ho Lee. A practical hypertext catergorization method using links and incrementally available class informa- tion. In SI GIR ’00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 264— 271, New York, NY, USA, 2000. ACM. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pager- ank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998. Xiaoguang Qi and Brian D. Davison. Knowing a web page by the company it keeps. In CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 228—237, New York, NY, USA, 2006. ACM. Dragomir R. Radev, Hong Qi, Zhiping Zheng, Sasha Blair—Goldensohn, Zhu Zhang, Weigo Fan, and John Prager. Mining the web for answers to natural language questions. In CIKM ’01 : Proceedings of the 10th ACM international conference on Information and knowledge management, Atlanta, GA, 2001. 199 [122] [123] [124] [12.5] [126] [127] [128] [129] [130] [131] [132] [133] Ganesh Ramakrishnan, Soumen Chakrabarti, Deepa Paranjpe, and Pushpak Bhattacharya. Is question answering an acquired skill? In WWW ’04: Proceed- ings of the 13th international conference on World Wide Web, pages 111—120, New York, NY, USA, 2004. ACM Press. Juho Rousu, Craig Saunders, Sandor Szedmak, and John Shawe—Taylor. On maximum margin hierarchical multi-label classification. In NIPS Workshop on Learning With Structured Outputs, 2004. S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. In Science, 2000. Walter Rudin. Real 65 Complex Analysis. McGraw—Hill, Inc., New York, NY, 3rd edition, 1987. Gerard Salton, editor. The SMART retrieval system. Prentice-Hall, 1971. Robert E. Schapire and Yoram Singer. Boostexter: A boosting-based systemfor text categorization. Machine Learning, 39(2-3), 2000. Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16, 2003. M. Seeger. Learning with labeled and unlabeled data. Technical report, Uni- versity of Edinburgh, 2001. Dou Shen, Jian-Tao Sun, Qiang Yang, and Zheng Chen. A comparison of implicit and explicit links for web page classification. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 643—650, New York, NY, USA, 2006. ACM. Xuehua Shen, Bin Tan, and ChengXiang Zhai. Context-sensitive information retrieval using implicit feedback. In SI CIR ’05: Proceedings of the 28th annual international ACM SICIR conference on Research and development in infor- mation retrieval, pages 43—50, New York, NY, USA, 2005. ACM Press. Xuehua Shen, Bin Tan, and ChengXiang Zhai. Implicit user modeling for per- sonalized search. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 824—831, New York, NY, USA, 2005. ACM Press. N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing gaussian mixture models with em using side-information. In Proc. of workshop on the continuum from labeled to unlabeled data in machine learning and data mining, 2003. 200 [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] J ianbo Shi and J itendra Malik. Normalized cuts and image segmentation. IEEE Trans. on PAMI, 22(8):888—905, August 2000. Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SICIR ’96: Proceedings of the 19th annual international ACM SI CIR conference on Research and development in information retrieval, pages 21—29, New York, NY, USA, 1996. ACM Press. Radu Soricut and Eric Brill. Automatic question answering: Beyond the factoid. In HLT-NAACL 2004, Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 57— 64, 2004. Radu Soricut and Eric Brill. Automatic question answering using the web: Beyond the factoid. Information Retrieval - Special Issue on Web Information Retrieval, 9(2):191—206, 2006. Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In ICML ’03: Proceedings of the twentieth international conference on Machine learning, pages 720—727, 2003. Smitha Sriram, Xuehua Shen, and Chengxiang Zhai. A session-based search engine. In SI CIR ’04: Proceedings of the 27th annual international ACM SI CIR conference on Research and development in information retrieval, pages 492— 493, New York, NY, USA, 2004. ACM Press. Jian-Tao Sun, Hua-Jun Zeng, Huan Liu, Yuchang Lu, and Zheng Chen. Cubesvd : a novel approach to personalized web search. In WWW ’05: Proceed- ings of the 14th international conference on World Wide Web, pages 382—390, New York, NY, USA, 2005. ACM Press. Ben Taskar, Vassil Chatalbashev, and Daphne Koller. Learning associative markov networks. In ICML ’04: Proceedings of the twenty—first international conference on Machine learning, page 102, New York, NY, USA, 2004. ACM Press. J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 2000. The Mathworks. Matlab, http://www.mathworks.eom/. John A. Tomlin. A new paradigm for ranking pages on the world wide web. In WWW ’03: Proceedings of the 12th international conference on World Wide Web, pages 350—355, New York, NY, USA, 2003. ACM Press. 201 [145] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Al- tun. Support vector machine learning for interdependent and structured output spaces. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 104, New York, NY, USA, 2004. ACM Press. [146] Naonori Ueda and Kazumi Saito. Parametric mixture models for multi-labeled text. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 15, pages 649—656. MIT Press, Cam- bridge, MA, 2003. [147] Ellen M. Voorhees. Using wordnet to disambiguate word senses for text retrieval. In SICIR ’93: Proceedings of the 16th annual international ACM SICIR con- ference on Research and development in information retrieval, pages 171—180, New York, NY, USA, 1993. ACM Press. [148] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained k- means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 577—584, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [149] Jidong Wang, Huajun Zeng, Zheng Chen, Hongjun Lu, Li Tao, and Wei-Ying Ma. Recom : reinforcement clustering of multi-type interrelated data objects. In SICIR ’03: Proceedings of the 26th annual international ACM SICIR con- ference on Research and development in informaion retrieval, pages 274—281, New York, NY, USA, 2003. ACM. [150] Kilian Weinberger, John Blitzer, and Lawrence Saul. Distance metric learning for large margin nearest neighbor classification. In Y. Weiss, B. Schélkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1473—1480, Cambridge, MA, 2006. MIT Press. [151] Eric W. Weisstein. ”graph.” from mathworld—a wolfram web resource. http: //mathworld.wolfram.com/graph.html. [152] C. Williams. Computation with infinite neural networks. Neural Computation, 10(5), 1998. [153] Chung—Hsien Wu, Jui-Feng Yeh, and Ming-Jun Chen. Domain-specific faq re- trieval using independent aspects. ACM Transactions on Asian Language In- formation Processing (TALIP), 4(1):1—17, 2005. [154] Chung-Hsien Wu, Jui-Feng Yeh, and Yu-Sheng Lai. Semantic segment extrac- tion and matching for internet faq retrieval. IEEE Transactions on Knowledge and Data Engineering, 18(7):930—940, 2006. 202 [159] [150] [161] [162] [163] [164] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Dis- tance metric learning with application to clusterng with side-information. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 505—512, Cambridge, MA, 2003. MIT Press. ' Jinxi Xu and W. Bruce Croft. Query expansion using local and global doc- ument analysis. In SICIR ’96: Proceedings of the 19th annual international ACM SICIR conference on Research and development in information retrieval, Zurich, Switzerland, 1996. Jinxi Xu and Ralph Weischedel. T REC-9 cross-lingual retrieval at BBN. In The Ninth Text REtrieval Conference (TREC-9), 2001. Gui-Rong Xue, Dou Shen, Qiang Yang, Hua—Jun Zeng, Zheng Chen, Yong Yu, WenSi Xi, and Wei-Ying Ma. Irc : An iterative reinforcement categorization algorithm for interrelated web objects. In I CDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04), pages 273—280, Washington, DC, USA, 2004. IEEE Computer Society. Liu Yang, Rong Jin, Rahul Sukthankar, and Yi Liu. An efficient algorithm for local distance metric learning. In Proceedings of the Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, 2006. Yiming Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2), 1999. Yiming Yang and Xin Liu. A re-examination of text categorization methods. In SICIR ’99: Proceedings of the 22nd annual international ACM SICIR confer- ence on Research and development in information retrieval, pages 42—49, New York, NY, USA, 1999. ACM Press. Yiming Yang, Sean Slattery, and Rayid Ghani. A study of approaches to hy- pertext categorization. J. Intell. Inf. Syst., 18(2—3):219—241, 2002. Jieping Ye. Generalized low rank approximations of matrices. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 112, New York, NY, USA, 2004. ACM Press. Kai Yu, Shipeng Yu, and Volker Tresp. Multi-label informed latent semantic indexing. In SICIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in informaion retrieval, 2005. 203 [165] [166] [167] [168] [169] [170] [171] [172] [173] [174] Z. Zhang, J.T. Kwok, and D.Y. Yeung. Parametric distance metric learning with label information. In Proc. of the Eighteenth International Joint Conference on Artificial Intelligence {IJCAI’O3}, pages 1450—1452, 2003. Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bern— hard Schblkopf. Learning with local and global consistency. In Sebastian Thrun, Lawrence Saul, and Bernhard Scholkopf, editors, Advances in Neural Informa— tion Processing Systems 16. MIT Press, Cambridge, MA, 2004. Dengyong Zhou, Jiayuan Huang, and Bernhard Scholkopf. Learning from la- beled and unlabeled data on a directed graph. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 1036—1043, New York, NY, USA, 2005. ACM Press. Dengyong Zhou, Bernhard Schélkopf, and Thomas Hofmann. Semi-supervised learning on directed graphs. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1633— 1640. MIT Press, Cambridge, MA, 2005. Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bern- hard Sehélkopf. Ranking on data manifolds. In Sebastian Thrun, Lawrence Saul, and Bernhard Schblkopf, editors, Advances in Neural Information Pro- cessing Systems 16. MIT Press, Cambridge, MA, 2004. Xiang Sean Zhou and Thomas S. Huang. Unifying keywords and visual contents in image retrieval. IEEE MultiMedia, 9(2):23—33, 2002. Shenghuo Zhu, Xiang Ji, Wei Xu, and Yihong Gong. Multi-labelled classifica- tion using maximum entropy method. In SICIR ’05: Proceedings of the 28th annual international ACM SICIR conference on Research and development in information retrieval, pages 274—281, New York, NY, USA, 2005. ACM Press. Xiaojin Zhu. Semi-supervised learning literature survey. Technical Re— port 1530, Computer Sciences, University of Wisconsin-Madison, 2005. http: / / www.cs.wisc.edu / ~jerryzhu / pub / ssl_survey.pdf . Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learn- ing using gaussian fields and harmonic functions. In ICML ’03: Proceedings of the twentieth international conference on Machine learning, pages 912—919, 2003. Xiaojin Zhu and John Lafferty. Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 1052—1059, New York, NY, USA, 2005. ACM Press. 204 lllfllllllllllfljllllmfllfllfl