.I

A if} r . ‘ . . , . , . ‘ _ . . , ‘
w.07xk..a.7.muﬂ.lum€¥ , A : . . . A .. ‘ ., . .. ‘ _ ‘ .. . .V . , . , .. .. . 1H, .._L.‘..~w,h. ‘.vlv...‘hnhhﬂ|ru|.l

 

‘ I.

TATE NIVERSITY UBRAR

I Ilmii'l‘ﬂﬂ'iil lIIi llllllllllllllltlilllelill

3 1293 01688 7477

 

 

 

 

 

 

 

 

 

 

 

 

 

This is to certify that the

dissertation entitled

PERSONALIZED INFORMATION DISCOVERY
FROM UNSTRUCTURED TEXT

presented by

Marilyn Wulfekuhler

has been accepted towards fulﬁllment
of the requirements for

Doctoral degree in Computer Science
& Engineering

 

 

A M;

v I .
Major professor

Date 5’//f/78‘

 

MSU is an Afﬁrmatiw Action/Equal Opportunity Institution 0-12771

 

LIBRARY

MIchIgan State
Unlverslty ‘

 

 

 

PLACE IN RETURN BOX
to remove this checkout

from your record.
TO AVOID FINES return on or before date due.

 

DATE DUE DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

PERSONALIZED INFORMATION DISCOVERY FROM
UNSTRUCTURED TEXT

By

M anlyn Wulfekuhler

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

July 31, 1998

© Copyright July 31, 1998 by Marilyn Wulfekuhler
All Rights Reserved

ABSTRACT
PERSONALIZED INFORMATION DISCOVERY FROM UNSTRUCTURED
TEXT
By
Marilyn Wulfekuhler

We have developed a technique that ﬁnds conceptually related groups of words
which represent the underlying content of user designated text examples. These word
groups characterize the content of the documents at an intermediate conceptual level
that is between full understanding and Simple statistical processing of individual
words. A number of user’s needs for ﬁnding and organizing information can be met
by using this intermediate conceptual level, beyond current methods which either
rely on individual words to characterize documents, or methods that rely on a global,
static comprehensive knowledge base.

Given sets of examples of unstructured text which a user has designated as in-
teresting, we can automatically ﬁnd conceptually related groups of words from them
which represent underlying building blocks of the content of the documents. We
ﬁnd these word groups using standard statistical clustering techniques without the

use of complicated natural language processing techniques, and without the need for

 

explicit external knowledge. We require only a relatively small number of example
documents.

Both automatic document content search and automatic document classiﬁcation
have long been goals in information retrieval which have had only limited success.
We conjecture that success has been limited because in the case of classiﬁcation, the
traditional focus has been to achieve a global, static, comprehensive scheme which is
all things to all people and all documents. In the case of search, traditional methods
have focused on documents and information needs at an individual word level; when
words are considered in isolation, the problems of polysemy (identical words with
different meanings) and synonymy (different words with identical meanings) limit the
success of the search.

By using an intermediate conceptual level, we can arrive at personalized context
sensitive organizational schemes, and provide a more useful basis for search than that

provided by representations based on isolated words.

To my family, Ted, Daniel, Erika

ACKNOWLEDGMENTS

I would ﬁrst like to thank-my advisor, Bill Punch, for his extraordinary support
and patience. I could not have asked for a better advisor, and I was lucky to have
the privilege of working with him. Thanks also go to the other members of my
guidance committee: Anil Jain, who was always interested in my academic success
and always expected excellence, Nicholas Altiero, who carefully reviewed my work
and offered helpful suggestions for improvements, and Mike Moch, who kept an eye
on the practical application of my research.

I gratefully acknowledge ﬁnancial support from the Dean’s Distinguished Fellow-
ship program and NEM—Online. Without the ﬁnancial assistance they provided, it is
likely that I would have given up on a PhD. years ago.

I would also like to thank my fellow graduate students when I was in the PRIP
lab, including Chitra Dorai, Dan Swets, Sharath Pankanti, and Sally Howden. These
people were not only good colleagues, but good friends. Also from my days in the
PRIP lab, I would like to thank George Stockman and the late Richard Dubes for
starting me down the path into research.

When I found a new home in the GARAGe, I also found more good colleagues

vi

and friends, including Eric Goodman. Owen Matthews, Terry Dexter, Jason Grean-
ya, Gang Wang, \v'ictor .\liagkikh, Eman El-Sheikh. Oleg Lukibanov, and lliana
Martinez—Bermoudes. These individuals contributed to a stimulating and enjoyable
environment. I would like to single out GARAGe dwellers Boris Gelfand, Arnold
Patton, and Mike Raymer for special recognition, for listening to my brainstorming
ideas, for their willingness to give suggestions and opinions, as well as for just being
there with moral support. Also deserving Special thanks are Hugh Smith, Donna
Wicks, Darren Meahl, Marie-Pierre Dubuisson Jolly, and Qian Huang, who kept me
going through particularly black times.

I would like to thank my parents, my sister Margaret and my brother Car] for their
unwavering encouragement and support. I would also like to thank my children Erika
and Daniel for putting up with my late nights and time spent in the lab. And ﬁnally,
I want to thank my husband Ted for his understanding, patience, commitment, and

support. This degree would not have been possible without his help.

vii

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

1 Introduction

1.1
1.1.1
1.1.2

Motivation ...................................

Information and Knowledge ’ ........................
A new domain — the Internet and World Wide Web ...........

1.2 Representation of Information ........................

1.2.1

Conceptual level representation ......................

1.3 Information Discovery in a library vs. the Internet ............

2 Literature Review
2.1 Traditional Information Retrieval ......................

2.1.1
2.1.2
2.1.3
2.1.4
2.1.5
2.1.6

Word Level Representation ........................
Feature Selection and Extraction .....................
Query Modiﬁcation .............................
Evaluation of Search Effectiveness .....................
Traditional IR Search vs. On—Line Search ................
Search on the World Wide Web ......................

2.2 Organization of Textual Information ....................

2.2.1
2.2.2
2.2.3
2.2.4

Global collection representation ......................
Subject Catalogs ..............................
Document Classiﬁcation ..........................
Document Clustering ............................

2.3 Summary ...................................

3 Finding Concepts

3.1
3.1.1

Preprocessing .................................

Stemming ..................................

3.2 Forming the Pattern Matrix .........................
3.3 Finding Important Features .........................

3.3.1
3.3.2

Feature Selection ..............................
Feature Extraction .............................

3.4 Transposing the Matrix: Clustering Words Instead of Documents

3.4.1
3.4.2
3.4.3

Choosing a value for K ..........................
Distance Measures .............................
Normalization for document length ....................

viii

xi

v—IxixianAI-t

ix

3.4.4 Combining runs into fuzzy sets ......................
3.5 Summary ...................................
4 Experiments and Results

4.1 Sample Data Sets ...............................
4.2 Word Clustering Results ...........................
4.2.1 Statistically Random Data Set ......................
4.2.2 Random Seeded Data Set .........................
4.2.3 Reuters Data Set ..............................
4.2.4 NEM—Online Data Set ...........................
4.3 Cluster Usefulness ..............................
4.3.1 Classiﬁcation Accuracy ...........................
4.3.2 Search for Similar Documents .......................
4.3.3 Personalized Document Organization, ...................
4.3.4 Conclusions .................................
5 Summary and Future Work

5.1 Summary ...................................
5.2 Future Work .................................
5.2.1 Characterization of Training Set .....................
5.2.2 Hierarchical Subclustering .........................
5.2.3 Other methods for determining fuzzy set membership ..........
5.2.4 Applications for conceptual representation ................
5.2.5 Incorporate knowledge in representation .................

LIST OF TABLES

3.1 Sample of the partial pattern matrix ..................... 54
4.1 Seed words from four Reuters categories .................. 80
4.2 Sample Reuters data set categories ..................... 81
4.3 Sample NEM Online data set categories .................. 82
4.4 Summary of the training data sets: Statistically Random (StR), Seeded

Random (SeR) at 1%, 5%, 10% seed words, Reuters articles (RA), and

NEM—Online (NO) ............................. 83
4.5 Three smallest, strongest clusters from 5% random seeded document set 89
4.6 Two clusters containing seed words from 10% random seeded document set 90
4.7 The four clusters from actual Reuters articles whose strongest member

was greater than .4. Words with strength < .7 appear in the clusters,

but are not Shown here. ......................... 92
4.8 The four clusters from NEM—Online documents ............... 94
4.9 Occurrences of cluster words in all Reuters documents of the four categories 98
4.10 Precision and Recall of single word queries ................. 99

1.1
3.1

4.1
4.2
4.3

LIST OF FIGURES

General Overview of the System ....................... 10
Four documents in 2 dimensions ....................... 67
Sizes of K — 1 smallest clusters in all data sets ............... 84
Maximum Precision and Recall for Reuters cluster 1 in the gas category . 98

Maximum Precision and Recall for Reuters cluster 2 in the grain category 99

xi

Chapter 1

Introduction

We have deveIOped a technique that ﬁnds conceptually related groups of words which
represent the underlying content of user designated text examples. These word groups
characterize the content of the documents at an intermediate conceptual level that is
between full understanding and Simple statistical processing of individual words. A
number of user’s needs for ﬁnding and organizing information can be met by using
this intermediate conceptual level of characterization, beyond current methods which
either rely on individual words to characterize documents, or methods that rely on a

global, static, comprehensive knowledge base.

We can concisely state the problem as follows:

‘6'

Given a set of n textual documents that a user has identiﬁed as In-
teresting”, containing (1 total words (the features), ﬁnd some subsets of

d which represent conceptual features which distinguish these documents

from others, and which represent concepts contained in the n documents.

2

The subsets of related words are found using standard statistical clustering tech-
niques without the use of complicated natural language processing, and without the
need for external knowledge. We require only a relatively small number of example
documents; we do not try to build these conceptual features based 011 a large corpus
of text; we use only documents which represent the user’s focus of attention at the
time.

We can use these conceptual features and the topics they suggest as the charac-
terization of textual information in order to:

o automatically search for unknown documents on similar topics

0 organize collections of documents tailored to personal preference

a dynamically classify document collections with topical proximity

These tasks are less fruitful under current schemes because traditional methods
represent information either by using individual words in documents, which is too
narrow a view, or by using a globally imposed topic categorization, which is too broad.
Representing information at an intermediate conceptual level enables a personalized,
dynamic view of textual information, allowing the above tasks to be performed more
effectively. This is the major contribution of this thesis.

Assumptions we make are as follows:

1. A user has collected several documents from each of several categories of interest
so that the document collection contains both commonality (as deﬁned by a

user) between documents in the same category, and diversity among categories.

2. The commonality consists of related, low-level concepts which likely share a

3

Specialized vocabulary rather than meta concepts (e.g., a collection of funny

stories).
3. The documents are not labeled by category.
4. The number of documents 72 is smaller than the number of distinct words (1.

5. A user may generate new conceptual features at any time, as interests or doc-

ument examples change.

Assumption 4 states that n < d. If there are more documents than distinct
words, that is, n > d, the document collection can be broken into subsets such that
the commonality and diversity requirements are met. The reason for the restriction
on the size of the document sets will be further explained in subsequent sections.

Assumption 5 implies that total processing time should be short enough to allow
user feedback as quickly as possible. While real—time processing is not yet possible,
it Should not take more than a few hours (e.g., over lunch or overnight) to generate
conceptual features and use them for search or organization.

In this chapter we motivate the problem, describe how it has been handled his-
torically in the context of library science, and explain why traditional methods don’t
work well in the new domain of the Internet. Chapter 2 reviews previous related work
and current approaches to information discovery on the Internet. Chapter 3 describes
our system for extracting conceptual features which we use for representing informa-
tion at an intermediate conceptual level in a personal, dynamic way. In this chapter
we also describe our attempts to achieve personal categorization using word-level fea-

tures, and explain why categorization using this sort of representation is doomed to

4

failure. Chapter 4 gives detailed results of our experiments using 4 different data sets.
It then gives preliminary results of using the conceptual level features for automatic
search and document organization. Chapter 5 gives a summary and outlines future

work to be done in the area of automatic information discovery.

1.1 Motivation

AS the information age proceeds, we are deluged with more and more information,
and it is becoming increasingly difficult to manage it. An individual can not and does
not want to understand the organization of all available information, only that which
is of current interest to him or her. We need to have an easily derived representation
of information that conveys meaning in a given context which represents a user’s

current focus of attention.

1.1.1 Information and Knowledge

Humans are unique in the animal kingdom in that they do not rely solely on expe-
rience or a direct teacher to gain information. Instead, they can utilize an amassed
body of knowledge which has been acquired and stored by others. Signiﬁcant among
sources of all information are textual sources, for which we have developed well es-
tablished practices of access. Some information needs are speciﬁc, (e.g., I need to
know the maximum recorded length of a blue whale), while others are general (I
want to learn about blue whales). Representation of information encompasses both

these needs, and information sources which fulﬁll needs. (The traditional resource for

 

 

 

:3
acquiring knowledge from textual information is a library. and methods and solutions

for meeting information needs have been developed in that context.

1.1.2 A new domain — the Internet and World Wide Web

The explosion of the Internet and the World Wide Web in the past four years has
created an unprecedented amount of readily and widely available textual information.
The current apprOaches to discovering relevant information are typically patterned
after methods and techniques from the information retrieval ﬁeld, where the problem
of automatically ﬁnding information from textual sources has been well studied in
the context of library science. However, there are fundamental differences in the
basic assumptions of search and classiﬁcation of text in library contexts which make

traditional methods inadequate when seeking information from the Internet.

Speciﬁcally, published works in a library are collected under well deﬁned rules
and controls, using well established techniques. The tasks of categorizing, indexing,
and representing information are performed by professional humans. With the rise
of the Internet as a source of textual information, however, anyone can publish infor-
mation, and a document is under complete control of the author, who is most likely
to be untrained in library science. Library infrastructure and controls are absent
in the Internet, providing unique challenges to information discovery. There have
been attempts to automate some of the tasks performed by professional librarians,
both in library collections and on the Internet. Here we brieﬂy describe traditional

approaches in broad terms; more detailed explanation is given in section 1.3 and in

chapter 2.

Both automatic document content search and automatic document classiﬁcation
have long been goals in information retrieval though researchers have achieved only

limited success.

In the case of classiﬁcation, the traditional focus has been to ﬁnd a global, Static,
comprehensive categorization which is all things to all people and all documents. In
the case of search, traditional methods have focused on documents and information
needs at an individual word level; when words are considered in isolation, the problems
of polysemy (identical words with different meanings) and synonymy (different words

with identical meanings) limit the success of the search.

Rather than attempt to classify or understand the universe, we consider only
those concepts and categories that are important at a given time to a given user,
and are therefore able to dynamically provide a reasonable approximation of content
components of a document in a given context. In our approach we examine small sets
of documents from which we derive groups of related words (related in the sense that
they have Similar patterns of usage across the example documents). Using groups
of words together rather than in isolation alleviates the polysemy and synonymy
problems because the context of the other words in the set can disambiguate polysemic

words, and can augment a single word with its synonyms.

 

T

1.2 Representation of Information

Textual documents are an abstraction of the information they contain. Represen-
tation of documents. in turn, is a further abstraction, preserving the expression of
information as much as possible while providing a computationally feasible way to
analyze, categorize, and search.

There are two main approaches to representing documents: One extreme considers
a document as being composed of individual words or phrases. Though the underlying
semantics can not in general be obtained automatically, the idea is that documents
with similar semantics use Similar words, thus allowing a word search match. The
other extreme consists of looking at a document as a whole, deciding what one thing
it is about, and assigning it a label in a global information categorization. Clearly
there is room for a middle ground, where documents are seen as being composed of

individual coherent topics. This is the approach we take.

1.2.1 Conceptual level representation

Rather than using either a word level representation or a global topical representation
for textual sources, we consider documents at an intermediate conceptual level.
Suppose a user has a collection of n interesting documents logically grouped to-
gether in classes either explicitly (a user’s bookmark ﬁle, or email folders, for ex-
ample), or implicitly in the user’s mind, according to the individual’s ideas of what
constitutes a category. Two people may have the same collection of documents, but

classify them differently. For example, one user may have saved a given page from a

8

company which does research on software agents because of her interest in one of its
products which facilitates Internet commerce. They would (explicitly or implicitly)
group it with other pages about similar products. while another user may have saved
the same page because of interest in obtaining a job there. and grouped it with other
job leads.

Note that the user need not label the groupings; but when the whole collection
is examined, it is apparent to the user that a certain subset of the documents refers
to certain conceptual categories. The user—deﬁned grouping is one that makes sense
to that user in the context of the other documents in the set. This means that the
underlying conceptual features which determine the user’s categories will vary with
each individual document collection, and with each user. The set of documents a
user has saved provides implicit context in determining the semantic concepts that
were used. In trying to extract these conceptual features, it is not enough to merely
ﬁnd features that reproduce the user’s classiﬁcation of their small example document
set. A number of traditional feature selection techniques from individual words can
reproduce the classiﬁcation with a high degree of accuracy, as will be described later.
However, this is not enough, since we want features that are at a higher conceptual
level than single words, and which suggest the concepts that underly the user’s cat-
egories. We can ﬁnd these conceptual features as long as the underlying concepts
share a common vocabulary; meta concepts such as “humorous pages”, or “ugly color
schemes” will not be detected.

We will enable personalized information discovery from unstructured text by ﬁrst

ﬁnding conceptual features from example documents. These features consist of groups

9

of semantically related words (concept clusters) with a value associated with each
word according to its “strength” in the concept. The user may manually reﬁne and/or
label the concept clusters to better deﬁne their personal ideas of the topic, but. inter-
vention is not required.

Note that these concept clusters are different from any classes that the user may
have deﬁned in their example documents, and are also different from the notion Of
classes provided by global subject catalogs such as Yahoo!. The word groups are
generated dynamically according to the raw features from the example document set,
and will change as the example document set changes.

Representing documents at an intermediate level through our conceptually related
word groups is useful in a number of ways. We can perform automatic searches on
behalf of a user based on topics they are interested in, rather than forcing them
to construct queries composed of individual words which are iteratively submitted
and reﬁned. We can also improve the presentation of search results to the user by
classifying them according to the concepts the user has already indicated interest in.
Rather than merely presenting a linear list of ranked results as most search services
do, we can present documents of potential interest in a way that allows a user to
browse them with the expectation that physical proximity of the documents in the
presentation corresponds to their tOpical proximity.

We initially examine single words contained in documents, however, once our word
groups are identiﬁed, each group as a whole has an evident semantic meaning which
does not rely on a universally deﬁned knowledge base of meaning built by human

categorization, and which is more powerful than considering words in isolation. The

10

word groups also alleviate the problems of polysemy and synonymy by providing a

context. for individual words.

Providing a system to automatically classify/organize documents according to
semantic content as deﬁned by the user, and automatically searching for previously
unseen documents with Similar semantic content are further contributions of this

thesis.

A general overview of the system is given in Figure 1.1, and will be described in

detail in Chapters 3 and 4.

’-----------------------------—--------—--------------------------

 

 

 

 

 

 

 

 

 

 

 

 

 

    

 

 

 

 

 

 

 

E Chapter 3 '
Find conce I user
; Example/ Conceptual clustefs—a reﬁnes :
: Ocument Features concept 2
I Set clusters 1
User , ------------------- . :
i E concept 5:
" """ {111' :1‘.‘ 11’: :1: 11‘: 11': :t: :L‘: 11‘: ............... I _ - _ - _ _ 91951915 _______ 1,:
5 Organization ; i i
with topical : !
proximity g : :
5 construct} ;
\ evaluate 5 . / submit ,
and ‘ : list of : query ;
classify 5 info sources : I
the does 5 (documents) 1, _ _ _ _ _C_h_apte_r_4 ________ ;
Chapter4 '

-o--- ..........................................

Figure 1.1: General Overview of the System

11

1.3 Information Discovery in a library vs. the

Internet

There are. two distinct but related activities which facilitate information discovery
from textual sources (for example, the WWW, newsgroups, or a traditional library):
searching and browsing. Search consists of submitting a query to a predeﬁned index,
relying on a word level representation of documents, while browsing consists of briefly
examining several items, relying on a global concept categorization. While both
activities can arise from a speciﬁc or a general information need, search is typically
for speciﬁc information, while browsing is typically for general information.

In a traditional library, search is facilitated by the card catalog, which has an
established indexing scheme which maps individual query words (including keywords,
author, title, etc) to a document. The document has a unique identiﬁer (Dewey
Decimal System number [2] or Library of Congress Subject Heading [3]) which enables
the user to precisely locate the document in the library. The indexing scheme and
identiﬁer also facilitate browsing, since the identiﬁers are constructed such that items
about Similar topics are intended to be in close physical proximity to each other.
Thus, once a user locates a single item of interest, others on the same topic will be
physically nearby. The underlying assumption of this scheme is that a given document
identiﬁer has a well deﬁned mapping to the semantic content or subject matter of
the document. Likewise, there is a well understood inverse mapping from subject
matter to identiﬁer. When new documents are published, the labor intensive process

of determining the details of this mapping is performed by specially trained peOple.

12

On the Internet, every document does have a. unique identiﬁer in the form of a
Universal Resource Locator (URL). but there is no well—deﬁned relation between topic
and location. The URL of a web document reflects wherever the author happened to
put it. There is no sense of topic proximity; items on the same topic can be widely
scattered across the world. Users browse by following hyperlinks, which presumably
go from a given page to a related topic, but the relevance of the relationship may
only be known to the page author.

A further difﬁculty on the Internet is that there is no well-deﬁned notion of a
“document”. A document on the World Wide Web is usually thought of as a single
item with a distinct URL, but may logically be only a part of a document (for example,
a Single section from a scientiﬁc paper), or logically several documents (for example,
a Usenet newsgroup digest). Whenever we refer to an on—line “document”, we will
mean a data item that has a unique URL as an identiﬁer. This may include book
chapters, directory lists, a Single email message, digests, etc.

Current search of on—line sources consist of engines which are based on traditional
information retrieval techniques, with some modiﬁcations, analogous to the library
card catalog search. A user submits a query consisting of a Boolean expression of
keywords to a service and is returned a set of documents which match the query.

In a library card catalog, a user can search for a document by title, author, or
keyword. Searching on the Internet is based only on keywords, not title or author. For
example, in a search for C.J. van Rijsbergen’s book “Information Retrieval”, in the
Michigan State University library’s MAGIC card catalog system, a query speciﬁcally

for the title as “t=Information Retrieval” results in three items with that title, all

13

with the Library of Congress subject heading of 2699. Armed with this information,
the user could go to the shelf in the library containing “2699” call numbers and
browse for further resources related to the topic of information retrieval, or pinpoint
the one book of three they had in mind.

However, to ﬁnd an on—line version of the same book on the web 1, there is no
uniform way to search for a title, and if the author’s name (or even its spelling) has
been forgotten, the user is forced to search for the title as a keyword. Submitting the
query information retrieval to AltaVista, for example, returns the URLs of 30,000
documents which each must be examined to ﬁnd the desired reference (the correct
reference did not appear in the ﬁrst 40 pages listed). Some search services do offer
the capability to search Speciﬁcally for a Web page’s title, but since web pages do
not conform to the standard notion of a Single volume of work, there is no guarantee
that the title of a Web page containing the book is the title of the book. (Titles of
the Web pages comprising that book are in fact unspeciﬁed and so default to the ﬁle
names “Prefacehtml”, “Ch.1.html”, “Ch.2.html”, etc.)

Another important difference between library search and on—line search lies in
the construction of the keyword index. While library card catalogs are hand con-
structed with carefully selected keywords, major web search engines such as Lycos
[6], AltaVista [7], WebCrawler [8], and Open Text [9] employ software robots [10]
(also called spiders) which traverse the web, read each document, and automatically
create an index of the documents they ﬁnd. The index is an attempt to somehow

characterize a document without storing its full text. Each service makes its own

 

1its URL is http://www.dcs.gla.ac.uk/Keith/Preface.html

14

decisions about how much of a document to read, which words from the document
belong in the index, and the weight or importance of each indexing term. Although
there is a. facility for associating keywords with Web documents (using META tags
within HTML documents), the responsibility for specifying them lies with the doc-
ument author. These META tags do not affect the information displayed to a user,
so the author may choose keywords poorly, may deliberately choose wrong keywords
to fool indexing robots, or may simply not bother to specify any at all. Since the
software robots cannot rely on the “keywords” given by web page authors, and since
they do not have the intelligence of a professional librarian, the most comprehensive
search engines treat every word in an on—line document as a potential keyword, which
results in large numbers of irrelevant documents being returned for a query, simply
because the query word occurred somewhere in the document, not because the query
word was an important keyword of that document.

Since various web search engines differ in their indexing criteria, coverage, and
recency, identical queries to different services return different results. Query format
also varies with the particular implementation, and most users have difﬁculty in
formulating optimal queries for all the search engines. If the user is skilled enough
to formulate an appropriate query, most search engines retrieve pages with adequate
recall (the percent of the relevant pages retrieved among all possible relevant pages),
but with poor precision (the ratio of relevant pages to the total number of pages
retrieved), because of the keyword selection problem described above. The important
point is that searches are conducted and matching documents are found based on an

individual word level of representation, rather than an idea or topic level.

15

For example, a naive user looking for information on legal issues involved in run-
ning a small business might. try the query legal small business, which to Alta\'ista
means “give me all the documents which contain any of the words legal, small, or
business. This query returns over a million documents (1,813,308 contained legal,
4,133,148 contained small, and the 8,805,231 containing business were ignored by
AltaVista). A more sophisticated user who is familiar with AltaVista’s query syn-
tax might try the query +lega1 +"small business", which means “give me all the
documents which contain the word legal AND the phrase small business. This query
returns around 10,000 documents, still an overwhelming amount for a user to sift
through, and they are not relevant or precise.

The results of these kinds of searches are typically presented as a list of URLs
matching the documents along with their ﬁrst 20—30 words. The list is usually ranked
according to the search engine’s notion of relevance to the query, which may give
higher rank to documents which contain the search terms in the title, documents
which contain the search terms more times than others, or other schemes which are
usually not made known to the public. In any case, there is no topical proximity in
the returned list, and the user must search linearly through the list, possibly fetching
the full document, in order to determine if any of the listed documents are of interest.

In contrast to keyword based representations, global categorization schemes cre-
ate directories (hierarchical categories), and are an attempt to provide some topical
proximity in Internet documents, analogous to the Dewey Decimal System or Library
of Congress Subject Headings in libraries. Yahoo! [4] is the largest and most well

known Internet directory, but others also exist, sometimes alongside search engines

16

_-
‘r

(e.g, Infoseek [a] and Lycos [6]). To construct a directory, individual documents are
submitted or collected by hand, then assigned to a category in the hierarchical or-
ganization of topics created by the directory designers. Browsing is facilitated by
providing the hierarchical topic organization. For example, to ﬁnd information on
karate, it is easy to start at Yahool’s top level, and successively travel down the hi-
erarchy to Recreation and Sports, then to Sports, Martial Arts, and ﬁnally Karate,
where there are 101 documents listed.

However, not all topics are as clear cut. For example, to search for a down-loadable
version of the publicly available stemming code from the textbook ”Information Re-
trieval: Data Structures and Algorithms” by Frakes and Baeza—Yates [11], one might
guess to start at the top level Yahoo! category of Computers and Internet, then the
sub category Software, then perhaps Archives. In any case, trying to guess the cat-
egories from this point leads to frustration, since the code in question is nowhere to
be found within the Computers and Internet sub—hierarchy. It is found in Yahoo!,
but under the tOp level category Reference, then Libraries, Library and Information
Science, and ﬁnally, Information Retrieval. Successful navigation of the tOpic hier-
archy requires that the user have the same notion of topic organization used by the
designers, not the users’ personal organization.

Another problem with a global categorization is that the number of documents
listed in a subject directory is necessarily much smaller than that in search engine
indices because of the human categorization bottleneck. There have been attempts at
automatic classiﬁcation of web documents (e.g., [12], [13]), but these have had limited

SUCCESS.

17'

Rather than relying on user—speciﬁed queries of keyword based searches. or global
subject categorizations. we will instead use our automatically determined features
which describe the content of a document at. a higher level than individual words,
according to an individual user’s interests. We then use these higher level features to
seek out additional information and organize other documents (which may be candi-
date interesting documents resulting from a search, or merely previously unclassiﬁed
documents we already had on hand) in a topically related browsable form. This is an

improvement over current methods for information discovery on the Internet.

Chapter 2

Literature Review

Information discovery from textual sources is a fairly mature ﬁeld in the context of
libraries, but has received much renewed interest recently because of the vast new
amounts available from on—line sources such as the World Wide Web. Automatic
methods for information discovery in traditional settings make up the ﬁeld of In-
formation Retrieval (IR),where the focus is typically on search for documents which
meet a need that is expressed in a query. Many of the current methods of discov-
ering information from on—line sources have roots in traditional IR techniques, so
we begin this chapter with an overview of IR and the most widely used model of
information, Salton’s vector space model [14] and its variants and assumptions. With
this background, we can also examine current methods for on—line search. The other
major activity in information discovery, browsing, is dependent on the speciﬁc or-
ganization of text collections, so we next review various methods of text collection
organization, and conclude with current techniques for browsing assistance without
an explicit organization.

18

19
2.1 Traditional Information Retrieval

2.1.1 Word Level Representation

Information Retrieval [14. 15, 11] is a mature ﬁeld which grew out of a need for
automatically searching large collections of text such as those found in a library.
In order to meet a user’s need for information, a query is given, and pointers to
documents that are relevant to that query are returned to the user. Representation
of documents, representation of queries, and determination of relevance are important
issues in IR.

The most widely used model for query—document matching is Salton’s vector
space model [14], where each document (pattern) is represented as a vector of words
(features). In this approach, n words form an n—dimensional feature space, and each
document lies somewhere in that space. A query for information is composed of
words, so it can also be represented as a vector in that same feature space. Matching
a query to documents then consists of plotting the query in the space, and returning
the documents which lie closest to the query according to some distance measure. The
values for each vector element can range from binary (where a 1 indicates occurrence
of the word in the document, a 0 indicates absence), to values derived from the
frequency of the word in the document, to other more complex weighting schemes.
Choice of a distance measure is coupled with the representation of the feature vectors.
Possibilities include Euclidean distance, cosine coefﬁcient, Dice’s coefficient, Jaccard’s
coefﬁcient, etc. [15]

For example, a popular way to form values for the components in the feature vector

20

is to use the term frequency/inverse document frequency (TF/IDF) measure [14]. [15]
for weighting each term. 1 Term frequency reflects the number of times a given word
occurs in a single document, and document frequency refers to the percentage of
documents from the entire collection in which the term appears. The idea behind the
TF/IDF measure is that words that occur often in a document, but rarely across the
entire document set, are important words and Should receive a higher value (weight)

in the feature vector.

The choice of a distance measure between documents, or between documents and
queries, can imply an assumption about the importance (or lack thereof) of the length
of a document. Most of the distance measures used, including the most popular cosine
coefﬁcient, automatically normalize for document length [15]. For library collections,
where the types of documents being indexed are fairly homogeneous in form, it is
reasonable to assume that terms in long documents should not be weighted more
heavily than terms in short documents. However, Web documents can range from
simple lists of URLs to complete texts. The assumption that long and short docu-
ments should be equally treated for indexing and representation purposes may not

be a good assumption for Web documents, as we will discuss more fully later.

Many studies have been done on effective normalization schemes (e.g., [16]), but
all of these have been with document collections (TREC, Reuters) whose members

do not vary much with respect to style, unlike Web documents.

 

1A term is technically different from a word in the information retrieval literature, reflecting the
fact that the original word may have been stemmed or changed to some canonical form, but we will
use term and word interchangeably, and will note stemming or other operations when appropriate.

21

2.1.2 Feature Selection and Extraction

When considering documents as patterns and the words in documents as features,
it obviously does not make sense to consider all words as equally important. We
know that not all words in a document convey information, and that some words
convey more conceptual content than others, so feature selection and extraction is an

important aspect of document representation.

It should be noted that sometimes all words in a document are not even eligible for
feature consideration. Due to computational limitations, many systems in the past

only used keywords chosen by humans, or words from only the title and abstract.

As early as 1957, Luhn suggested that the most frequently appearing words in
a document collection, as well as the least frequently appearing words, are not as
signiﬁcant [17]. Selecting features based on the “Luhn cutoffs” is still a popular
method of feature selection, although there is no well established way to specify the

thresholds for the cutoffs.

Related to the idea of most frequent words not having high information content is
the notion of a negative dictionary, or stoplist [18]. Words such as “and”, “of”, “the”

should be eliminated from consideration as features, no matter what their frequency.

Another operation which reduces features is stemming [19]. Stemming is the
process which reduces a word to its stem (e.g., “computer” and “computing” both
become “comput”). Stemming explicitly acknowledges the fact that distinct original

features are not independent.

Bookstein et. al. [20] attempted to identify content bearing words based not only

22
on frequency, but also based on the sequential occurrence pattern of the words. The
hypothesis is that words which tend to appear in "clumps” throughout the document
convey more information than words which are distributed randomly. This method
uses the key idea that a given word may be important in one set of documents, but
irrelevant in another.

One method of both reducing the feature Space and eliminating feature inter-
dependence is Latent Semantic Indexing [21]. Here the original pattern matrix is
converted into an orthogonal one, which can be truncated aCcording to those vectors
which correspond to the smallest eigenvalues. This is basically the principal compo—
nents projection. It is claimed that the newly derived feature vectors contain “latent
semantics” of the documents — semantics that can not be described explicitly, but
are present as a linear combination of the original features. We attempted a similar
approach, as will be described in the next chapter, but while latent semantic indexing
may be useful for document classiﬁcation accuracy, it is ineffective for search on large
dynamic data collections, such as the World Wide Web.

An extension to the basic vector space model which is related to the idea of latent
semantic indexing is the Generalized Vector Space Model (GVSM) [22]. GVSM, rec-
ognizing that one of the limitations of the Vector Space Model is that terms are not
independent (especially when they have multiple meanings), considers each word to
be independent of the others, unless two words both index the same document. The
method then creates orthonormal basis vectors from the document—term frequency
matrix, and derives a similarity measure incorporating these basis vectors. It was

reported that precision was improved over the basic V SM model, but using linear

23

combinations of the original terms does not convey conceptual meaning of the doc-
uments. This model was proposed during a time when documents were represented

by a small set of keywords.

Another attempt to capture semantics in the representation of documents for
search is the Semantic Vector Space Model [23]. This model also-recognized that
the basic vector space model, as a set of words, does not capture any contextual
information about a word’s usage in a given document. The semantic vector space
model extends the basic vector space model by using a heuristic natural language
parsing technique which assesses the probability that a word in a given sentence plays
one of a set of thematic roles as deﬁned by linguists. A given term t in a document can
play one of m thematic roles. Probabilities for each of the possible roles are computed
for k occurrences in a single document, which are then combined linearly into a single
length m term vector, VT. A single document, then, is represented, not as a weighted
term vector (with n terms), but as a n x (m+1) matrix, where the rows of the matrix
represent a vector for each term in the document. These row vectors contain ﬁrst, the
original term weight from the traditional vector space model, then the term vector VT
which represents the roles that term might take in the document. In other words, the
ﬁrst column of the matrix is the traditional weights from the vector Space model. One
issue that arises in using this model is the problem of computing similarity between

matrices. It remains to be seen whether this fairly recent model will prove useful.

24
2.1.3 Query Modiﬁcation

When a user has an information need, it is usually speciﬁed in the form of a query
— a free text or boolean expression consisting of keywords. The query is represented
as a vector of its component words, then plotted in the same vector Space as the
documents are, so that documents “near” the query can be returned as the most likely
information relevant to the need. This is a simple and mathematically easy way to
match information sourCes with information needs, but queries typically have many
fewer terms than even the shortest documents, and while the mismatch in length is not
too much of a problem when documents are represented by relatively few keywords,
the length difference is greatly magniﬁed for World Wide Web documents, which are
usually indexed by their full text, and for which typical user queries consist of only
two words [24]. The mismatch which results when a document is represented by too
many terms, and a query is represented by too few terms, causes queries to match

too many documents, and leads to the low precision characterized by Web searching.

It has long been recognized that user generated queries are imperfect indicators of
information needs [25]. One technique which addresses this problem is known as query
modiﬁcation (also known as relevance feedback), which weights the terms in a query
according to the user’s evaluation of documents returned. In the ﬁrst iteration, when
the documents resulting from a user’s query are returned, the user labels individual
documents as relevant or irrelevant. Then the distribution of the query terms in
these two document sets are analyzed, and the query terms are assigned weights

accordingly. For example, a query word which occurs frequently in the set of relevant

 

25

documents but infrequently in irrelevant documents receives a higher weight. while
terms occurring frequently in irrelevant documents and infrequently in relevant ones
will receive lower weight. The newly weighted query is then submitted to the system,
which returns another set of documents, which the user again evaluates as relevant or
non relevant. There may be many iterations, all of which require the user to evaluate
every document returned at each stage, which becomes tedious.

Another technique for more closely matching a user—deﬁned query to actual in-
formation needs is query expanSion, which modiﬁes the query by adding additional-
words from a thesaurus.2 The thesaurus may be constructed by hand, which is very
labor intensive and time consuming, or it may be constructed automatically.

Word clustering is an appealing idea to automatically construct a thesaurus, and
early work using this technique was done by Sparck—Jones [26]. In that work it was
pointed out that though the intent at the time was to improve recall in retrieval by
adding words not included in the original query, use of an automatically constructed
thesaurus could improve precision as well, by disambiguating word senses. Words
from the entire document collection are clustered to obtain thesaurus classes. This
early work was only applied to small (by today’s standards) data sets, and only
marginal improvements in recall and precision were realized, as has been the case
with many extensions to this approach Since the early work [27] [28] [29]. A global
approach to clustering the words identiﬁes words that have similar co—occurrences

across the entire corpus of documents, but it does not take into account the idea

 

2A thesaurus in information retrieval is not like the lay man’s idea of a thesaurus which contains
only synonyms. An IR thesaurus entry for a given word may contain synonyms, broader terms,
narrower terms, or related terms.

26
that different information needs will have different thesaurus classes. Global methods
do not capture the context of a meaningful subset of documents which reflect the
information need.

In contrast to global methods, there have been local methods of automatic the-
saurus construction described in early literature [30, 31]. In these approaches, only
the top ranked documents returned after the results of the initial query were analyzed
for local feedback. Based on analysis of these returned documents, queries were both
expanded and re—weighted. Rather than relying on the user to determine which were
relevant and which were not, these approaches assumed relevance, and adjusted the
queries accordingly. A flaw in this approach is that if the n top ranked returned
documents (or even a subset of them) are in fact not relevant (as is often the case in
Web searching), the analysis of the words contained in these non-relevant documents
will skew the context. Again, improvements in retrieval using this early approach
were not promising.

More recently, Xu and Croft have proposed an approach to query expansion using
global analysis methods on a local subset of documents, which they call local context
analysis [32]. They use noun groups (rather than individual words) as concepts, which
are chosen from the top ranked documents (as in the local feedback described above).
However, instead of using the “best” documents as a whole to choose the concepts,
they use the “best” subsets of documents, which in this case are text windows with
300 words. They report better results in terms of precision and recall using this
approach than with using either global analysis or local analysis alone.

Crouch and Yang [33] describe a method for automatic thesaurus construction

 

 

‘27

which clusters documents rather than words in an attempt, to capture context from
those documents seen to be similar to each other in some way. These clusters of
documents are then used as a local set on which to do analysis to construct thesaurus

classes.

AltaVista’s Live TOpics

AltaVista has attempted to aid the user in reﬁning their query through their experi-
mental Live Topics interface to the AltaVista search results. When many documents
are returned, the system gives suggestions for additional terms which may improve
the query. The user can select more terms (or delete some), and the query is resub-
mitted. Live Topics is a form relevance feedback for generating better queries, which
still must be submitted in the usual way.

With an original query consisting of the Single word “clustering”, the AltaVista
search engine returns 20,000 documents. Live Topics suggests only 2 additional words
to reﬁne the query: Cuba and Bearer. On another try with the same query, the only
two suggested words were expatriate and Anderson. These two sessions of Live Topics
were of no help in narrowing down the 20,000 documents which matched “clustering”.

In another Live Topics session, the original query consisting of the phrase “curse
of dimensionality” resulted in about 200 matches; among the terms live topics sug-
gested for query reﬁnement included “clustering”, “statistical”, “feedforward”, “hy-
perplane”, “Gaussian”, “irrelevant”. When “clustering” and “irrelevant” were clicked
on, and the query resubmitted to include these keywords, AltaVista returned. no

matches. This experience illustrates that for some topics, Live Topics encourages

28

query reﬁnement to the. point of narrowing down the search so much that there are
no longer any matches.

According to the documentation, Live Topics works best when the number of
initial matches are 200 or more [34].

Live Topics does not consider the relevance rankings of the matches to the initial
query. Rather, it analyzes all documents which provide a match [34]. Live Topics may
not work well because it"s enhanced queries are dependent on the quality of the initial
set of returned documents. If many irrelevant documents [are returned as a result of
a query, using words from those irrelevant documents to augment the query is not
appropriate. Our approach instead clusters words from documents already known to

be relevant.

2.1.4 Evaluation of Search Effectiveness

Evaluation of information retrieval systems is a difficult problem [15, 35]. Even though
there has been more than 20 years of research in retrieval effectiveness evaluation,
nearly all evaluation measures are based on precision and recall. As mentioned in the
introduction, recall is the percentage of all relevant documents which were actually
retrieved, and precision is the percentage of the documents retrieved which were

actually relevant. More formally,

[relevant ﬂ retrieved]

 

recall =
[relevant]

[relevant ﬂ retrievedl

 

precision = ,
[retrieved |

29

Although these measures and those based OII them can be mathematically deﬁned.
they rely on the subjective notion of relevance. They also require a binary partition
of documents into the absolute classes “relevant” and “non relevant”. Therefore.
comparisons of retrieval systems are often done on standard document collections
where a large number of speciﬁc queries and relevance judgments have been labeled

with respect to those two classes by experts.

Yao [36] suggests a performance measure which assumes documents are arranged
by a weak order of relevance (which the user determines by comparing two documents
at a time), leading to an ordinal scale of relevance of a number of equivalence classes
of documents (rather than just two). The performance of a system is based on the
distance between the user’s weak ordering and the system’s ordering. The traditional
precision and recall measures can also be derived from the distance. The advantages of
this approach are that documents can have an arbitrary number of levels of relevance,
and each user may deﬁne their own personal ordering. The disadvantage of measuring
system performance in this way is that the user must deﬁne their preference ordering
of the entire document collection for every information need, which is clearly not
practical in real systems. However, the idea of a user—deﬁned weak ordering is a
promising new development in measuring the performance of retrieval systems. In
operational systems (as opposed to experimental ones), it is only the individual user

who can truly determine the effectiveness of a retrieval system.

 

30
2.1.5 Traditional IR Search vs. On—Line Search

The search for speciﬁc information from on—line sources can use the traditional infor-
mation retrieval model of a Speciﬁc query which is then matched against a pre—existing
index of documents. However, there are differences between traditional published
works (where keyword selection, for example, is under the control of a publisher),
and on—line information, where an author is also the publisher, and does not have
the same discipline, training, or motives as print publishers. These differences cre-
ate difﬁculties in applying the library science information retrieval model to Internet
documents. Since an author can not be relied on to provide appropriate indexing
information, automatic methods are used to construct massive indices of web pages.
And since automatic methods are incapable of intelligently deciding which terms to
index, the most comprehensive index all terms found in a document. This leads to
many documents matching a given query, usually more than the user can easily ex-
amine. Most search engines provide a ranking of the matched documents in order of
relevance. However, there is no universally accepted measure of relevance, and dif-
ferent search services use many different measures and algorithms in determining the
score for a given document. Often, the list returned from a search contains irrelevant
pages being ranked highly, while those truly relevant to a user are buried so far down

the list that they are not examined.

One of the assumptions used in traditional information retrieval models is that the
keywords (those used to construct the pattern vectors and which deﬁne the feature

space) are a proper subset of all of the words in the documents. Usually this subset

31

is much smaller than the set of the total occurring words, which means that. there is
a somewhat controlled vocabulary. One of the challenges of automatic index building
is the appropriate selection of the subset of words which comprise the feature space.
Most information retrieval systems use statistical properties of the frequencies of word

occurrences or co— occurrences to select the keywords.

Another assumption in information retrieval systems is that the entire collection
of documents is available for analysis, in order to construct the entire feature space,
and to compute statistics on the usage of certain words across the entire collection.
For example, the TF/IDF measure requires all term frequencies in the collection to
be known before constructing a feature vector for a document (or a query). The
sheer size of the collection of Internet documents, as well as the dynamic nature of

the collection, makes measures based on corpus—wide statistics impossible.

2.1.6 Search on the World Wide Web

The major search engines employed on the World Wide Web use traditional IR mod-
els based on full text indexing of documents encountered'by their web robots, but
commercial systems give few details about the Speciﬁc models, distance measures,
features, and relevance ranking methods they employ. Since the exact parameters of
the model are unknown, it is hard for users to design Optimal queries, even if they

are information retrieval experts.

Since no librarian or publisher has speciﬁed relevant keywords for a given docu-

ment, the current search engines use some form of full text indexing (with values of

3‘2
feature vectors based on different factors, such as giving more weight to words occur-
ring in titles or hyperlinks). Typically recall is good (there are plenty of documents
returned on almost any query), but. precision is poor, since the search is similar to a
Unix “grep” command (Simply a search for the query string occurring anywhere in

the text).

Marchiori’s work [37] has recognized some of these problems and sought to improve
the relevance ranking algorithm by considering the hyperstructure of a document as
well as the content. In that system, information content is a function of the text and
other text which can be reached via a certain number of hyperlinks (i.e., a number of
mouse clicks) from the original text. It uses the premise that connectivity is important

for determining a page’s worth.

2.2 Organization of Textual Information

Recall that browsing is an important activity for information discovery as well as
search. In fact, one could argue that it is the browsing capability of the World Wide
Web that has led to its huge popularity (after all, ftp and gopher have been around
for a long time, but neither of them saw the tremendous growth that the web did).
A meaningful organization of resources is essential for effective browsing. In this
section we examine various document organization schemes arrived at by document

clustering and document classiﬁcation, as well as personal browsing agents.

33

2.2.1 Global collection representation

In order to facilitate browsing by concept, an information item’s representation must
necessarily be different from the word based model described above, Since individual
words in isolation do not in general denote concepts. Effective browsing requires
a means for a user with general information needs to browse collections of text by
examining several sources which are topically related. In a library, a user with a
broad, general information need can go to the area where sources on the tOpic are

kept, and brieﬂy examine several to ﬁnd text that meets the need.

Organization of entire collections of information conceptually requires a universal,
global topic categorization. Examples of such organizations in the library context
are the Dewey Decimal System [2] and Library of Congress Subject Headings [3].
Examples from the Internet / Web are directories such as Yahoo! [4] and Infoseek [5].
All of these global categorizations require two things that are difﬁcult to automate:
(1) an agreed upon categorization scheme and (2) assignment of textual sources to

the appropriate category.

Global representation is intended to provide topical proximity of textual informa-
tion, in order to facilitate browsing. It provides the mapping of document location to
topic. To use a global representation to fulﬁll an information need on the Internet,

you must know and understand the categorization scheme used.

34
2.2.2 Subject Catalogs

There have been some attempts to create subject catalogs for the \Vorld Wide Web.
The most. extensive and successful of these is Yahoo! [4]. Yahoo! gives a catalog of web
pages based on a hierarchical topic structure. The catalog is constructed according to
the standard librarian approach; a human must decide on the apprOpriate categories
and assignments. Since it relies on human categorization, it’s catalog is easy to browse
if you have an idea of the topic you’re looking for, but it suffers from an immense
bottleneck of human evaluation. In a library, a document’s content never changes,
and there are relatively few new documents added in a given period of time. However,
on the web, there are so many new documents added (and already so many documents
in existence), that the human catalogers can not hope to keep up. In November 1996,
Yahoo indexed 185,000 documents compared to AltaVista’s 21,000,000 [38]. As of
mid 1997, Yahoo! indexed 500,000+, but AltaVista had grown to 100 million [39].
Another drawback of catalogs is that there is no well—deﬁned system of categorization,
so it may be difﬁcult for a user to determine in which category to look. For example,
information retrieval stemming algorithms are found in Yahoo under library science

and not under computer science.

Since topical hierarchies provide guidance in narrowing down a topic of interest,
they result in higher precision than the full text query based search services, but
since human effort is involved, there is a bottleneck to getting more information
categorized. The human evaluation factor results in lower recall. Yahoo does not

attempt to categorize the entire web, rather their goal is to provide high quality

 

35
organization for those sites that they do index.
There are also human constructed catalogs for small specialized domains (e.g.,

3, manufacturing 4 ), which are sometimes subsets of the larger catalogs.

Biosciences
These domain Speciﬁc catalogs have the advantage that their catalogers have a smaller

subset to work with, but there is still no uniﬁed standard categorization.

2.2.3 Document Classiﬁcation

Document classiﬁcation is the attempt to automatically place documents into pre-
deﬁned categories. These categories may be a globally imposed hierarchy such as
Yahoo!, or merely the two categories of “relevant” and “not relevant” for a given
user. Classiﬁcation into global categories is of interest in the Information Retrieval
and Machine Learning communities. Most methods require a large number of training

samples and assume a ﬁxed global document collection.

Letizia

Letizia [40] is a software agent that recommends web pages for a browsing user by
concurrently exploring links from the page a user is currently browsing. It is actually
performing dynamic web page classiﬁcation into the 2 categories “interesting” and
“not interesting”. It does not use any query based approach for resource discovery.
Rather, it follows links from the user’s current page and recommends those that it

believes the user will ﬁnd interesting. It infers the user’s interest based on observation

 

3 http:/ / golgiharvard .edu / biopages / all .html
4 http: / /www.nemonline.org

 

36

Of the user’s browsing behavior. Strong interest. in a given page is indicated by saving
a reference to it (i.e., "'l:)ookmarking” the page). Following a link from a page or
returning to the same page several times also indicates interest. Not following a link,
but following a link occurring later in the document can indicate disinterest in the
link that was “passed over”.

Letizia models the content of a document as a list of keywords. It provides a
preference ordering of links to follow to assist the user. It does not require explicit
relevance ranking of pages by the user, rather it infers relevance based on assumptions

about the user’s behavior.

Fab

Another system for providing page recommendations is Fab [41]. The Fab system
utilizes both the content of pages and the recommendations of other users to deter-
mine its recommendations. Rather than recommending pages directly linkable from
the user’s current browsing position, Fab uses a set of independent software agents
which collect information from the web and try to match a user’s interest proﬁle in
determining which to recommend. The user must explicitly rank all pages suggested
by the agent in order to provide feedback and also to provide the evaluation of the
effectiveness of the system.

Fab uses the traditional information retrieval vector space model using a term
frequency — inverse document frequency (TF/IDF) measure for the features. Recall
that we described above some of the problems with adopting the information retrieval

model for web-based information discovery: namely the nonexistence and impracti-

 

37

cality of measuring statistics from the entire collectiOn, and the scarcity of labeled
training samples relative to the entire collection. Fab avoids these problems by taking
a random sample of 5,229 web pages, and considers that. set to be the entire docu-
ment collection for purposes of deﬁning the global feature Space. It then computes
the maximum term frequency in a given document, and the document frequency for a
given term. The 5229 web pages yield a ﬁxed dictionary of 70,000 stemmed terms (af-
ter stop words are removed), weights are computed for each term using the TF/IDF,
then the top 100 weighted terms are selected as features for future analysis.

Thus a 100—ary vector is used to represent both documents and the user proﬁle
(corresponding to a query in traditional information retrieval). The user proﬁle is
updated according to the formula m = m + sw, where m is the user proﬁle (a term
vector), w is a document vector, and s is an integer —3 . . .3 representing the user’s
relevance score for that document. Several different agents are described, includ-
ing search agents which explore the web [42], index agents which use existing web
search services, and non—adaptive agents which consider “random” pages from various
sources.

One of the problems with the Fab system is its reliance on the explicit human
evaluation of each suggested page. It is likely that, especially in the early stages of
learning, the user will be presented with many irrelevant or uninteresting pages, and
the user will not want to bother ranking those, and may become frustrated with the
high time investment with low payoff. Experience with Fab has also shown that the
users are very strict in their rankings of pages — usually either very high or very low,

rather than using the full range of the 7 point scale.

 

38
WebWatcher

WebWatcher [43] is a personal agent that attempts to learn which hyperlinks will lead
to target information. Like Letizia, WebWatcher observes the user’s current browsing
behavior in order to infer relevance information about documents. However, unlike
Letizia, WebWatcher assumes the user has a speciﬁc goal (in the form of a scientiﬁc
paper, a Speciﬁc domain) in mind, and observes a user’s behavior to learn how to
predict whether a hyperlink will be followed or not. Though the authors call this web
search, they are not classifying documents. Instead, they are classifying hyperlinks
into 2 classes: followed vs. not followed. They consider only information from the
text of the current page, and not the content of the target of the hyperlink. It may
not follow that these followed vs. unfollowed links correspond to relevant/ irrelevant
pages. WebWatcher uses a ﬁxed length (530) boolean feature vector composed of 4
subvectors. The words comprising the features are determined during the learning
phase. The subvectors are used to distinguish words of 4 categories: words occurring
in hyperlinks (the anchor text), words occurring in the sentence of the hyperlinks (i.e.,
click M2 for information about machine learning), words in the headings ”associated
with the hyperlinks, and words used to deﬁne the user goal. Feature selection for
the ﬁrst three subvectors was performed by ranking all the words in the training
set according to their mutual information with respect to correctly classifying the
training data, then choosing the top N words in the ranking. Mutual information is
a feature selection technique where each feature is taken in isolation and tested on

its ability to classify the training data.

 

39

Preliminary results were given from data gleaned from 30 browsing sessions of 3
different users who started from the same page, and had the same kind of goal (a
speciﬁc kind of scientiﬁc paper). 23 of the 30 sessions resulted in a paper being found.
Evaluation of the learning performance of the system was done by a comparison of
the system’s recommended hyperlinks with the links that the user actually followed.
The best performer recommended as its ﬁrst choice the link that the user actually
chose 30% of the time, and the user’s choice was in the classiﬁer’s top 3 choices 54%
of the time.

One disadvantage to this approach is that it does not consider the content of a
linked document, as mentioned above. Another is that it must have a speciﬁc target
goal, and training is only relevant for that goal. For every new goal a user Speciﬁes,

training (including feature selection) must start again from scratch.

Hotlists and Coldlists

Pazzani, Nguyen, and Mantik [13] describe an information ﬁltering agent that tries
to learn a user’s interests based on their ranking of pages. In their experiments, a
single user was asked to classify a set of URLs from a single domain (BioSciences) as
interesting (the “hotlist”) or not interesting (the “coldlist”). Out of 120 total URLs,
the user selected 38 documents as positive examples (interesting) and the remaining
82 were negative examples (uninteresting). Sets of varying sizes, ranging from 10 to
100 were randomly selected to be training patterns, with the remaining patterns used
as test data. Features were selected from the training data based on the expected

information content of a word, with the idea that the most discriminating features

40

would be those that occur with very different frequency between the two classes. The
128 most. informative features (words) were selected to form binary feature vectors,
indicating occurrence or non—occurrence of the word in the document.

Error estimation was based on the classiﬁcation accuracy of the test data. Four
different learning algorithms were compared, a Bayesian classiﬁer, a nearest neighbor
algorithm using the matching coefﬁcient as a distance measure, a nearest neighbor
algorithm using a value difference metric [44] as a distance measure, and the ID3
decision tree algorithm.

Results showed that the Bayesian classiﬁer clearly outperformed the other algo-
rithms, achieving 85% accuracy with only 15 training examples, and 91% accuracy
with 100 training examples. None of the other algorithms ever exceeded 78% accu-
racy. The authors conjecture that the superior performance of the Bayesian classiﬁer
may be due to the difference in importance of various features accorded by the con-
ditional probability of each feature given the class, where the other algorithms treat
all features equally, and the fact that the Bayesian classiﬁer examines all features at
the same time (in contrast to ID3). However, they do not mention the fact that the
prior class probability is also considered for the Bayesian classiﬁer, and the global
pool of examples has P(hot) = .317, and P(cold) = .683. The decision rule for the
Bayesian classiﬁer is thus already predisposed to decide “uninteresting”, and with no
other information at all, it would be accurate nearly 70% of the time.

For the general information discovery problem, the prior class probabilities of
interesting / not interesting can not be computed, but are likely to be even more

skewed toward uninteresting than 30/70. If the prior probabilities were 5/95, one

41
could use a classiﬁer that always decided "uninteresting” and report a 95% accuracy
rate.

Note that the success of learning the proﬁle may be due to the fact that all
examples were from a single domain, which likely has a. speciﬁc vocabulary. Note also
that the assumed universe for feature selection was only the pool of interesting and
uninteresting training data, and the features were chosen as those which were most
discriminating across only this training set. While this technique may have produced
accurate classiﬁcation in a speCiﬁc domain, it is questionable whether the features
used for classiﬁcation would be of any use in seeking additional information that was

not present in the initial set of data.

2.2.4 Document Clustering

Document clustering also organizes documents into groups, but the groups are not
according to some predetermined labeling; they are groups whose members are similar
in some way, and dissimilar to members of other groups. Although there are not
labeled categories (except for some attempts to label clusters), an organization of this
kind enhances browsing once the user has found something of interest by providing

topical proximity.

Visual Organization

Kohonen’s W EBSOM group at Helsinki University of Technology use a self organizing
map (SOM) to present on—line documents for easier browsing and exploration [45],

[46], [47], [48]. A self—organizing map is a neural network technique in which high

 

42

dimensional statistical data is arranged in an easily visualized spatial map such that
like inputs are near each other on the map. The ﬁrst step is to create a SO.\-l of
word categories from the example documents using a local neighborhood of context
(i.e., words that appear nearby) from the full text. The local neighborhood vectors
are input into the neural network, which creates a map of word categories. Then
individual documents are encoded as a histogram of the word categories, and a SOM
of the encoded documents is created which can be browsed. This technique has been
used to organize Grimm’s fairy tales [45], articles from the Single Usenet newsgroup
“comp.ai.neural-nets” [48], and articles from 20 different Usenet newsgroups (which
used a partially supervised method to increase the separability between groups) [47].

For the newsgroup “comp.ai.neural—nets”, there were 4600 documents and 1700
words considered. The maps were created on specialized hardware, the massively
parallel CNAPS neurocomputer, and then ﬁne tuned with SOM software.

One problem mentioned in connection with organizing Usenet articles is the fact
that they contain both Short query or answer postings, which may not provide enough
context to properly categorize them, and long articles which contain multiple tOpics,

which properly belong in multiple locations.

HyPursuit

HyPursuit is a hierarchical network search engine described by Weiss, et. al [49] which
gives an architecture for organizing the web which clusters hypertext documents in
a “given information space” to aid browsing and searching. They propose a hierar-

chical organization where each information server’s documents are clustered, and the

 

43
server keeps track of the summary information on it. so that queries can be routed
to particular servers which are likely to contain the information. This scheme is in
contrast to current existing search engines, which construct a massive single database
of information on the entire web. One of the main aims of the HyPursuit project is
to distribute the indexing information and then group the indices in order to allevi-
ate network load and provide a coherent organization. Unfortunately, this proposed
scheme requires universal agreement on the new paradigm, and Sites which do not
follow it will still require other methods of information discovery. The interesting
aspect of the HyPursuit work is in its deﬁnitions of document and cluster Similarity.
They not only use document contents (the terms contained in the documents), but
also hyperlink information. The idea is that if several documents d1, d2, ...d,, all con-
tain a link to another document (1, then they are likely to be semantically related.
Similarly, if several documents d, are linked to from the same document d, the d,- are
semantically related. Consequently, the deﬁnition for document similarity contains
terms for both content similarity and hyperlink structure similarity. The idea that
there is semantic information in the hyperlinks of a document is worth investigating
further; however, it is difﬁcult to discern the link structure for the entire web unless

only small subsets are considered at a time.

Syntactic clustering

Rather than clustering documents based on semantic concepts, or topical catego-
rization, Broder, Glassman, and Manasse clustered web documents based on their

syntactic similarity [50], with the goal of ﬁnding documents that are “roughly” iden-

44

tical. For example. documents which occur in the same cluster may differ only in
formatting, minor corrections, webmaster signature, or logo. Their goal is therefore
not enhancing information discovery on the web, but applications such as updating
widely used information (such as FAQS which occur on multiple Sites), identifying
intellectual property violations, or ﬁnding a document which has changed locations.
We are more interested in semantic clustering, and will not further discuss syntactic

clustering.

Topic Categorization

Martin [12] describes a system for clustering documents into overlapping hierarchies
using the factor learner algorithm. The hope was to achieve groups of documents
which were comparable to the Yahoo! categories. In this algorithm, each of several
different groupings of n documents is called a factor. The factors are found by con-
structing an undirected graph consisting of all features, where a link between two
features indicates a dependency relation, and no link between features indicates in-
dependence. The factors are then formed by ﬁnding cliques in the graph. Once the
factors are found, the objects are clustered according to the factors using a greedy
approximation of a Bayesian clustering algorithm. The method estimates highly
probable conditional dependencies between factors and features.

Martin constructs a stop word list based on the most frequent words across the
collection, augments keywords using WordNet [51], and uses frequency for feature
values. He does not Specify how many features were ultimately used, and describes

preliminary results of clustering 54 documents from a Single Yahoo category. Using

 

45

only words from the documents as features, the resulting clustered hierarchy matched
the Yahoo hierarchy with only a few shared headings. Using the augmented words
from W'ordNet, the resulting hierarchy shared half the headings with the Yahoo di-

rectory.

SONIA

Sahami, Yusufali, and Baldonado [52] describe the SONIA project, which clusters
documents which were returned as the result of an initial keyword search. They
use partitional clustering rather than hierarchical as in [12]. (Of course, partitional
methods can always be applied iteratively to achieve a hierarchy of clusters.) Their
goal is to aid search and browsing by dynamically organizing the results from an
initial keyword search. They select features from the documents by ﬁrst eliminating
common English stop words, very infrequent words, and then words with minimal
entropy. They use a Boolean representation for feature vectors (indicating presence
or absence of a given word in the document). They then cluster the documents using
both a Bayesian network (AutoClass) and a K—means algOrithm. Once the clusters
are found, they extract “descriptive” keywords for each cluster, based on the cluster
centroid, or frequency of the words in each cluster. They present results from a query
on “mining” which led to 50 documents and 3500 features, and a query on “Mars”,
which led to 39 documents and 1400 features. Results showed plausible clusters of
subtopics within the query domain (i.e., mineral mining vs. data mining, pictures of

Mars vs. life on Mars).

 

46

Scatter / Gather

The Scatter/ Gather browsing technique from Xerox PARC [53] creates a hierarchical
structure for navigating through large text collections. The original intent of Scat-
ter/ Gather was not to create a search tool to replace current information retrieval
techniques, but to communicate information about the topic structure of a collec-
tion to a user. The interface for Scatter/Gather is that the user is presented with
a set of clusters of documents with a summary of their contents based on topical
words and typical titles of the documents they contain. The user can select one or
more of the clusters which are of interest (gather), and then ask to have these cho—
sen documents re—clustered (scatter). Scatter/ Gather clusters documents based on
pairwise document similarity, and the cluster hierarchy is pre—computed off line in
order to provide constant time access. Results indicated that Scatter/ Gather is not
superior to other methods of information retrieval when the goal is to locate speciﬁc
documents. However, indications are that it does effectively communicate topical

information structure to a user about a given document collection.

Further research with Scatter/Gather indicates that document clustering is an
effective way to organize a document collection which assists the user in browsing
a collection while avoiding the need to express their information needs as a formal
query. More recent work has been done on clustering smaller collections, such as
those returned from an initial search, and on efﬁcient clustering algorithms so that

the clustering can be performed in real time [54, 55, 56].

47

Personal Bookmark Clustering

Maarek and Shaul [57] have attempted to provide an organization of a user’s book-
marks into conceptual hierarchical categories. This task is characterized by the fact
that the concepts are personal and may be different for each user, the set of docu-
ments to be organized is a small subset of the entire document collection, and the
collection is highly dynamic. They use a hybrid of manual and automatic organiza-
tion where the user can control the behavior of the system, specifying which parts
Should be automatically grouped, and what should be left under the user’s control.
A user can specify a node in the cluster hierarchy as “frozen”, which the system then
considers as a black box sub—tree, and will not reorganize documents classiﬁed under

that node.

An interesting aspect of this work is that instead of Single words as features, it uses
pairs of words which are linked by a lexical aﬁ‘inity, or correlation of their common
appearance. They use a binary feature vector and do not specify the precise distance
measure, but say it is a direct function of how many features the documents share.
A hierarchical agglomerative clustering technique is used, and the problem of where
to cut the dendogram is automatically determined by identifying the level clusterings
within the tree that have a comparable degree of intra—cluster Similarity. This was

found to have more meaning for users than forcing them to specify a cluster size.

48
2.3 Summary

III this chapter we have described some of the. current research into automatic informa-
tion discovery from the \Yorld \Vide \‘l’eb. \Ve described the models from traditional
information retrieval, which are used by most of the current Web search services. We
also described some of the assumptions about the library based information retrieval
models, and characteristics of Web based information which require reconsideration
of “standard” information retrieval search techniques.

Existing search services are intended to be applicable to the general user, and
are not tailored speciﬁcally to an individual’s preferences and interests. Browsing
assistants consider the personal preferences of their users, but are limited in what
they can recommend by either the pages directly accessible from the user’s current
position, or by pages which have been recommended by others.

We want to utilize services which cover the entire information space of the World
Wide Web, and also personalize the search according the user’s interests and pref-
erences. By using a conceptual representation of information that is tailored to the
current information need / interests, rather than relying on global conceptual ideas or
isolated words as concepts, we will be able to perform automatic search and provide

document organization which are based on concepts.

 

Chapter 3

Finding Concepts

In this chapter we describe the process used to deﬁne our intermediate level features
which represent or approximate conceptual information. Remember that a user ﬁrst
provides some example documents that they determine are relevant or interesting in
some way. These may be organized into categories (e.g., as a Netscape bookmarks
ﬁle, or different email folders), or simply as an unlabeled list of ﬁles. The documents
are fetched from remote servers if necessary and copied locally, where the words are
parsed and counted. A word is deﬁned as a sequence of letters without punctuation,
digits, or white space.

Our next task is to ﬁnd features that will deﬁne semantic concepts that are mean-
ingful to the user from these example documents. Semantic concepts serve two Speciﬁc
purposes in aiding information discovery: they provide an organization or classiﬁca-
tion of documents, and provide a means for searching for additional similar informa-
tion. We make an initial approximation of semantic concepts by ﬁnding important
keywords from the example documents. Using pattern recognition techniques and

49

50

terminology, and consistent with the vector space model, we consider each document
to be a pattern, with the words in all the documents as features.

Initially, we tried to find the important keywords by using feature selection and
feature extraction of the words from the example documents, with the measure of suc-
cess being accurate classiﬁcation of the example documents into the user’s categories.
These efforts are described in detail in section 3.3.1 and section 3.3.2. However, our
results showed that although these techniques may lead to accurate classiﬁcation of
the training data, the features (i.e., word subsets) found are not useful for searching,
and do not provide any advantages in representation of documents. Subsequently, we
used a different approach, described in section 3.4, that clustered the words according
to their use across the document set, and then combined groups resulting from multi-
ple clusterings into fuzzy sets. This technique creates groups of words which together
suggest semantic content, and can be used as the keywords which approximate under-
lying “concepts” contained in the example documents. Results from word clustering
on four different data sets are reported in chapter 4. By turning the focus away
from document classiﬁcation as an evaluation criteria for keyword determination, we
can create sets of words which can be used both to seek out additional documents
which were not in the training set, and to classify and organize topically the existing
documents as well as new documents.

The next step in the process is to present these word groups to the user in order
that they may further deﬁne the concept that iS characterized by a given cluster. The
user may tag Speciﬁc words in a cluster as useless (unrelated to the tOpic), may ask

for large clusters to be further subdivided, or tag more than one cluster to be merged

 

51
into the same tOpic. In addition, the user can provide a concept label to the cluster.
This is not required, but can aid the user in further browsing of documents.

The user’s participation at this point is not. absolutely necessary; the system can
use the topic deﬁnitions it found automatically without additional refinement, and
can automatically create labels based on the individual words in the topic that have

the strongest membership.

Once the user is satisﬁed with the concept word clusters, they form a basis for a

new approach to various problems:
0 automatic query construction for queries to existing search services
0 new methods of indexing
o classiﬁcation of query return results
0 classiﬁcation of previously unseen documents

0 organization of existing collections according to personalized preferences

3. 1 Preprocessing

We parse each document for words, deﬁned as a contiguous string of letters, ignoring
HTML tags, digits, and punctuation, as well as common English words (such as the,
of, and, etc) from a pre—deﬁned stop list. Our stop list is the one given by Fox [18],

which is based on a broad range of literature in English.

CI
to

3.1.1 Stemming

We can also reduce each remaining word to its word stem. so that words like computer
and computing both become the same term, comput. This is a standard information
retrieval technique. and we used the algorithm from Frakes [19]. If stemming is done,
all unique word stems from the entire training set of documents form the global feature

space. Without stemming, all unique words comprise the global feature Space.

Stemming the words has the obvious advantage of reducing dimensionality of the
feature space, for example, one of our data sets contained 7633 distinct non—stemmed
words, and 5190 stemmed words. It would also seem advantageous in classiﬁcation,
since words with the same stems should be related to the same semantic concept, so
we Should count them as the same feature without regarding Slight variations in tense

and usage as distinct features.

However, there are cases where the Simple stemming algorithm reduces words
to the same stem when the originals were not related. For example, “animal” and
“animation” are both reduced to the same stem, “anim”, which merges two words
which denote distinct underlying concepts into a single feature. A second disadvan-
tage is that stemming is not a one-to—one function, so there is no inverse function
from the stemmed word to the original. Once the original term is lost, it can not
then be used in subsequent searching. A third disadvantage is that the best stem—
ming algorithms are often inaccurate; they use heuristics about typical English word
formation which do not always lead to correct results. For example, an “ed” ending

on a word like formed is removed to form the stem farm. But in the word “feed”, the

 

53

“ed" is removed to get a stem of “fe”, which is meaningless. Our initial experiments
used stemming to reduce the initial feature space as much as possible, but we have
subsequently found that the additional number of non—stemmed keywords do not. sig-
niﬁcantly affect processing time nor results. and the utility of having complete words
for generating automatic searches outweighs any advantage brought by the decrease
in the number of features. Furthermore, experimental evidence Shows that words
which really are variants of the same root word (such as america and american, or job
and jobs), appear together in the ﬁnal word clusters, so their indication of similarity

is preserved without resorting to stemming.

3.2 Forming the Pattern Matrix

The n example documents contain a total of (1 distinct words, so there are n patterns
in d dimensional space, with d >> n. The n x d pattern matrix M is formed as
follows: MW- 2 k, where word j appears k times in document i. One of our data
sets, referred to as the “N EM—Online” data, consists of 85 documents in 4 categories
containing 5190 distinct stemmed words. The documents are from the manufacturing
domain, labeled by experts into the categories “labor”, “legal”, “government”, and
“design”. This data set will be more fully described in section 4.1.

A portion of the NEM Online pattern matrix is given in Table 3.1, with rows
representing documents (patterns), and columns representing word stems (features).
Selected features out of the 5190 total feature space are shown for illustrative pur-

poses. Rows which are not shown have 0 values in all positions of those columns.

 

Category

Doc Num

action

affirm

discrimin

drug

fda

banana

central

 

legal

25
27
28
29
30

I

g

C)!

J‘s.
OMMO

Milk 00
\ICJIOOO

 

labor

41
42
44
45
46
47
48
49
50

0109

0‘03

pay—a

 

government

 

 

53
58
59
60
66
68
70
71
72
76

 

 

I—I O1
OAMOMOMOWJAOWHMAMHMHHMt—‘wwxl

 

OOOOOOOI—lowwib-OI—‘OOOOCOOOO

 

CD

OOOOOOOOOOWOOMUOOOI—IOOOOC

 

OOvb-OOOHOOOOHOOOOOOOOUI

 

OOHOOOOOOOOOOOOOOOOKI

 

OOOOOOOOOOOOOOOOOOOOOP—‘OO

 

I—‘OOOOOOMOOMOOOOOOOOHOOOO

 

 

3.3 Finding Important Features

Table 3.1: Sample of the partial pattern matrix.

If we attempted to classify these documents using their words as features with a

parametric statistical classiﬁer, the standard rule of thumb is that you need 10

times as many training samples as features for each class to accurately estimate

the parameters[58]. Assume a user has collected 100 documents in 2 classes. Using

a conservative estimate of 5000 distinct words in the 100 documents, a paramet-

ric statistical classiﬁer would need 100,000 labeled training samples to estimate the

HP. 1

5

0‘!

parameters. Clearly this is not possible with so many features.

Even though our ultimate goal is not merely classiﬁcation of documents but detec-
tion of underlying concepts. a reasonable hypothesis is that features which successfully
classify the user’s training set will reﬂect the distinguishing characteristics of the doc-
ument set and lead to the concepts. We therefore used some feature selection and
extraction techniques to reduce the number of features, and used the classiﬁcation
accuracy of the training set as a measure of the worth of the features. The assumption
is that a feature set which best reproduces the user’s classiﬁcation (has the highest
classiﬁcation accuracy) is the set which contains the “best” features. This assumption
turned out to be false, and our experiments which led to this conclusion are detailed

below.

3.3. 1 Feature Selection

Feature selection is the process of choosing a subset of the available features for use
in classiﬁcation. There are 2" possible subsets of n original features, so exhaustively
enumerating all subsets is computationally prohibitive, especially when n is on the
order of thousands. Some effective conventional methods for feature selection are
sequential forward selection [59], sequential floating feature selection [60], and ge-
netic algorithm search [61, 62]. We applied sequential forward selection and genetic

algorithm search on our sample manufacturing data set.

 

Sequential Forward Selection

Sequential forward selection was applied using a 1 nearest neighbor classiﬁer, Eu-

clidean distance, and leave one out testing. The classiﬁer performs as follows:

1. Select a pattern from the data set to be a test pattern whose category label we

wish to assign.

2. Find the pattern vector from all the remaining data whose Euclidean distance

is closest to the test pattern. This is the nearest neighbor pattern.
3. Assign the test pattern the category label of the nearest neighbor.

4. Repeat for every pattern in the data set.

The sequential forward selection algorithm ﬁrst chooses the single best feature
which gives highest accuracy in classifying the training set. It then successively
chooses one feature at a time which, when combined with those features already
chosen, gives the highest accuracy.

Using all 5190 features (stemmed words) in the four category N EM—Online data
(described briefly in section 3.2) with the above classiﬁer resulted in an error rate
of 50/85, or accuracy of 41.18%. Using an SGI Indy’, sequential forward selection
achieved 17/ 85 errors, or 80% accuracy by selecting only 13 features. These features
were engin, action, david, contempl, aﬁ‘irm, architectur, ave, osha, abund, rehabilit,

notic, commerc, transact.

 

1Because the implementations of the C library routine quicksort behave differently in the presence
of ties on an SGI and a Spare Ultra, we achieved similar classiﬁcation accuracy with a Similarly Sized
feature set, but very different features. On a Spare Ultra, sequential forward selection achieved 14 / 85
errors (83.5% accuracy) with 15 different features. Those features were low, act, aﬁrm, govern,
drive, version, reduc, beneﬁt, labor, bank, at, omer, regul, edgar, label.

--7
OI

Selecting any number of features sequentially, we were able to ultimately achieve
7/85 errors, for a 91.76% accuracy rate with only 17 features. Those features are
project, job, staff. david, apr, clinton, govern, gov, sophist, aﬁirm, favor, commission,
eeo. ﬁscal, cfo. tm, stori.

Since we have such a sparse pattern matrix (zero values for many of the features
in any given pattern), if we only add one feature at a time to the current pattern
vector, there is a high probability that for a given test pattern, many other patterns
will have a zero value for that’feature, meaning that all of those patterns are equally
far away from the test pattern. There will be many neighbors who are equally near
the test pattern. This will be true for most distance measures and any sequential
feature selection method. In other words, any sequential feature selection algorithm
could produce many different subsets of features which are equivalent in classiﬁcation

accuracy by arbitrarily choosing different features in the presence of many ties.

Genetic Algorithm Feature Selection

A genetic algorithm is a technique for searching for optimal solutions. to a problem
based on the principles of natural selection [63, 64]. A solution to a particular problem
is encoded in a bit string called a chromosome. Multiple solutions are initialized
randomly in a single population and evaluated according to problem speciﬁc criteria.
All chromosomes receive a ﬁtness value according to the “goodness” of the solutions
they represent. Three basic operations are applied to a population of solutions over
several generations. They are selection, crossover, and mutation. Selection methods

are used to ensure good solutions are more likely to survive to the next generation than

58
bad solutions. Crossover involves choosing two chromosomes in the current generation
to use to construct a new chromosome in the next. generation which contains some
parts of each parent. Mutation is used to randomly change some small part of a
solution (usually a single bit) in order to maintain population diversity and make
sure the pOpulation does not converge to a single solution too fast.

Solutions which are more ﬁt are more likely to survive and recombine with other
good solutions in subsequent generations, while poor solutions tend to die out of the
population.

Genetic algorithms have been used for the feature selection problem [61, 62]. The
representation of a solution in a genetic algorithm for feature selection is typically a
chromosome which is the same bit length as the number of features. A bit is assigned a
zero value if the corresponding feature is not selected, and the value one if the feature
is selected. This allows the genetic algorithm to potentially search the entire 2"
possible subsets. Using this technique with n = 5190 dimensions is computationally
prohibitive, even though all 2" possibilities are not examined. Instead, we chose a
subset of ﬁxed size k, which could contain exactly k features. Then there are (2)
possible subsets, which is a more tractable problem for the genetic algorithm (but
still can not be exhaustively searched). We implemented this form of feature selection
using the GALLOPS genetic algorithm software from Michigan State University 2. We
chose k = 25, and let the genetic algorithm search for the best set of 25 from the 5190
features. In 300 generations, the best subset of features was found in generation 168

and achieved 21/ 85 errors for 75.29% accuracy. The feature set was bliznakov, volant,

 

2http: / / isl.cps.msu.edu / GA / software / software-indexhtml

59

law, biographi, permiss, cwatx, alaska. treatment, sodimn, move, evolut, version, avex,
darnell, thoma, photographi, build, pany, hospit, law, lexi, plamen, brieﬂi, export, cite.

Recall that these are stemmed words.

3.3.2 Feature Extraction

Another pattern recognition technique we can use to try to obtain accurate classiﬁ-
cation is feature extraction. Rather than simply selecting from the existing features,
feature extraction derives new features by transforming the original features in some

way.

The Discriminant Karhunen-Loeve Projection

One approach used in pattern recognition problems where the dimensionality is much
greater than the number of training samples is the Discriminant Karhunen—Lo‘eve
(DKL) projection, proposed by Swets and Weng for face recognition [65]. This is
similar to Latent Semantic Indexing [21] which has been proposed in information
retrieval, and which also constructs new orthogonal features which are linear com-
binations of the original features. It is called “latent” semantic indexing because it
claims to capture “latent” or “hidden” semantics of a document.

The DKL projection consists of ﬁrst performing a principal component projection
followed by discriminant analysis. The principal component projection selects as the
ﬁrst feature the linear combination of dimensions which has the greatest amount
of variance among all the patterns. Then a new dimension iS chosen which has

the highest variance of patterns among all dimensions orthogonal to the ﬁrst. So

60

we can project the original Space into one whose dimensions are the eigenvectors
corresponding to the largest eigenvalues of the original features which still retain

some high percentage of the total variance.

With our 5190 stemmed features, only 23 eigenvectors were required to retain
90% of the total variance. Thus we transformed the 5190 dimensional original feature
vectors to new 23 dimensional vectors. The 23 new dimensions are orthogonal linear
combinations of the original 5190. Classiﬁcation using these 23 dimensional feature
vectors did not achieve very high accuracy, only 54%. This result is not surprising
since the principal component method is only concerned with ﬁnding orthogonal axes

in the feature space of maximal variance, and is not concerned with class separation.

However, the Karhunen—Loeve projection is only the ﬁrst step in the DKL pro-
jection. The 23 new feature vectors are next projected into another space using

discriminant analysis, which does consider class separation of the training patterns.

Discriminant analysis considers each class of patterns in the n dimensional Space,
and tries to project the data into a smaller dimensional space while maximizing the

between—class scatter and holding the within—class scatter constant [66].

The scatter matrix for a given class k, S (k) is deﬁned as follows:

where xj is a pattern vector from class k, mk is the mean feature vector in class k,

and nk is the number of patterns in class k.

The within-group scatter matrix is the sum of the group scatter matrices:

Ix

Sn; __._ 2 50¢)

k:l

where K is the total number of classes, 4 in our case.

The between—group scatter matrix, Sb is the scatter matrix for the group means:

To project the data into a t dimensional space, we retain the eigenvectors cor-
responding to the t largest eigenvalues of Sngb. Note that in order for Sw to be
invertible, we must have the number of patterns, n, greater than the dimensionality
of the patterns, d. This is why discriminant analysis by itself will not work on the
85 patterns in the original 5190 dimensional space. So we use the 23 new features
obtained by the Karhunen—Lo‘eve projection. We also note that there will be at most
c — 1 non—zero eigenvalues in Sngb, where c is the number of classes. So the ﬁnal

dimensionality of the Space must be less than the number of classes.

Now in our data set we have n = 85, d z 23, and c = 4. We use discriminant anal-
ysis to project the patterns of the Karhunen—Loeve projection into a 3—dimensional
space.

With our data set and performing KNN classiﬁcation on the combined DKL pro-
jection, we got 43/85 errors, or 49.41% accuracy with K = 1. This is a lower accuracy

than even the Karhunen—Loeve projection alone. With K = 5, we did only slightly

62

better, with 36/85 errors for 57.65% accuracy. Although the DKL projection per-
formed well in the face recognition problem [65], it. appears that the DKL projection
is not robust enough to create a general classifier for document classiﬁcation. For the
DKL projection to be successful, one must train over all possible expected variations
of the eventual data to be classiﬁed. The DKL based retrieval fails if the search
probe varies signiﬁcantly from any of the training examples [65]. Also, in the face
recognition problem there were many more classes than 4. Since the ﬁnal dimension
is constrained to be less than the number of classes, the DKL projection may not
perform well where the number of classes is so few.

Even if the DKL projection had adequately classiﬁed the training data, it suffers
the drawback that its low dimensional features are created by linear combinations
of the original 5190, which do not by themselves give any suggestion of semantic
content. We can not therefore use eigenvector based features for subsequent search,
since documents are not indexed in search engine databases based on those eigenvector
features. Even if one could construct such an index for every document on the web, a
new index would have to be constructed for every additional or new training example.
The eigenvectors derived from the original training set will not generalize to other
documents in other categories.

There is no way to examine only a subset of documents on the Web according to
their weightings of 5190 features that happened to be in the training set. This is a
very inefﬁcient way to discover new information, especially since the vast majority
of documents on the web properly belong to the reject class for a given information

need. We need the ability to seek out documents which match a user’s preferences

63

without examining the entire global set of documents, preferably using the resources

that already exist. namely, keyword based indexes of documents.

Discussion

The feature extraction technique did not achieve high classiﬁcation accuracy, nor
did it suggest semantic content. Though the feature selection techniques achieved
high accuracy in classiﬁcation of the training data, the features in each subset were
quite different and do not by themselves suggest the content of the original (or any)
categorization.

It is not meaningful to select a feature as a category discriminator when there
are many potential candidate neighbors for a test pattern with few features. Why
then do the selected sets of features above achieve such high accuracy rates? The
answer lies in the sparse nature of the original pattern space. With 85 points in 5190
dimensional space, there will be many piecewise linear decision boundaries which
adequately partition the training patterns. However, the real question is whether
these decision boundaries are meaningful or generalizable. We want not only to ﬁnd
some decision boundary, but the best general decision boundary, and meaningful
features which will suggest the semantic content of documents, and enable us to
ﬁnd new data. We conclude that our assumption that the feature set which best
reproduces the example classiﬁcation is the best feature set for deﬁning concepts was
invalid. One of the problems with this approach is that individual words are used in
isolation to represent document content and a globally imposed categorization label

is used as the performance criteria. We are Simultaneously using two extremes of

 

64
representation which do not have a good mapping to each other; we would prefer a

uniﬁed conceptual level of representation between the two extremes.

3.4 Transposing the Matrix: Clustering Words

Instead of Documents

Our efforts at. classiﬁcation accuracy using feature selection were somewhat successful,
but the resulting features were not useful in seeking new documents or in suggesting
semantic concepts. The problem lies in the low level representation of information at
an individual isolated word level. We now focus on ways to ﬁnd features which will
aid us in ﬁnding new data, which will represent the documents at a higher conceptual

level than individual words.

Since we know that the words in the original pattern matrix are correlated, we
should be able to cluster the words into groups whose members are similar to each
other and not Similar to members of other groups. We may then be able to use the
groups of similar words to form concepts which are meaningful representations of the
information contained in the documents, and which can subsequently be used to aid

search, and provide personal organization of documents.

Recall that in the n x d pattern matrix M, (Table 3.1), rows represent documents,
columns represent individual features (words), and the feature values are the number
of occurrences of that word in the given document. When we were trying to reproduce

the example documents’ classiﬁcation, we were grouping documents according to their

 

65

pattern of words. N ow, instead. we group the words according to their usage pattern

across the sample document collection.

To group the words, we used a K—meaus partitional clustering algorithm [67].
where K points are chosen randomly as the means of K groups in the n dimensional
Space. Each of the at points is then assigned to the group whose mean is closest. After
each point is assigned, the group means are recomputed, and the process repeats until
either there is no change, or after a ﬁxed number of iterations. The K —means process
is not guaranteed to ﬁnd a global optimum grouping, so we run it several times with
different random seeds, resulting in groups which often overlap. Combining results of

these different groups to form fuzzy sets is described in section 3.4.4.

3.4.1 Choosing a value for K

One of the important parameters of the K—means algorithm is the choice of a value
for K, the number of groups. If K is too small, there will be few groups which
have many members. If K is too large, there may be empty clusters, or many small
groups which Should be grouped together. If some of the K clusters are empty, we
can simply accept less than K groups. This means that the number of ﬁnal groups,
as well as their ﬁnal membership, will depend on the initial random seed. Since we
know the number of categories, c in the training set, we allow for two groups of words
(i.e., 2 “concepts”) for each document category, plus 2 groups for a reject category.
Therefore we choose K = 2(c+ 1). Experiments have shown that greater values of K

tend to generate many empty clusters.

66

3.4.2 Distance Measures

Choice of a distance measure between patterns is also an important parameter in
the clustering algorithm. We have used three different distance measures in our

experiments, Euclidean distance, cosine distance, and Jaccard distance.

Euclidean distance

For two d dimensional vectors X = (x1, x2, . . . , xd) and Y = (y1,y2, . . . , yd), Euclidean

distance is deﬁned as

d 1/2
[2% — 3102]

i=1
and is the typical straight line distance between end points of the vectors X and Y

in Euclidean space.

Cosine distance

If we again consider the points X and Y as vectors from the origin, instead of mea-
suring the distance between the endpoints, we can measure the cosine of the angle
between the vectors. Since in our case both X and Y will always have positive val-
ues, we are only dealing with the ﬁrst quadrant of the Space. The cosine distance is

deﬁned as:

 

67
and is commonly used in information retrieval. The cosine distance has an appeal as
a distance measure between documents where word O(:'(.'urrences are used as features,
because points along the same vector will have the same distance to another vector,
regardless of the vector lengths. This means that the proportion of words with respect
to other words (the other dimensions in the space) is constant along a given vector,
despite the number of words being more or less. This gives an automatic normalization
for document length. Consider measuring document distance in a space with only two
dimensions (i.e., two words), a and lb. Suppose document d1 contains 2 as and 2 bs,
document d2 contains 2 as and 4 bs, document d3 contains 50 as and 50 bs, and

document d4 contains 25 as and 50 bs, as shown in Figure 3.1.

A , a ,o

 

 

I
I
d4 9 ’
I
I I
I
’ I
I I
I I
I
II ’
.m I I,
- I I
.D I
t I, I
H- , II
C , x
3'5 ’ ,’
’ I
'0 I“\ I
e , . ,
I
:3 I ‘ ,
z ’ "
I D
I I
I I
’ I
I I
I I
I
I I
I I
I I
d2, ,
C9 ’
I
I I
[’0
,, d!
n N
Numberof as

Figure 3.1: Four documents in 2 dimensions

Using Euclidean distance, points d1 and d2 are much closer to each other than
d1 and d3, but there is 0 cosine distance between d1 and d3. The cosine distance

from d1 to d2 is identical to the cosine distance between d1 and d4. However, when

68

measuring distance between words rather than documents. it does not make sense to
consider a word wl that occurs once in document (II and once in document (12 to
be identical to a word w2 that occurs 100 times in document d1 and 100 times in
document. d2 simply because their ratios are the same. We also see empirically that
the cosine distance does not make sense between words, since our K—means clustering
algorithm always failed to converge within 100 iterations when clustering words using

the cosine distance.

J accard distance

The third distance measure we used was Jaccard’s coefﬁcient [66], using the binary
values 1 for occurrence of a word in a document and 0 for non—occurrence. J accard’s
coefﬁcient as a measure of distance between two documents is deﬁned as
m1
m

where m1 is the number of distinct words that appear in both documents, d is the
total number of unique words in the whole collection (the number of dimensions), and
m0 is the number of words that appear in neither document. This distance measure
considers only presence or absence of a word, and ignores the frequency of occurrence
of a word in a document. When using Jaccard’s coefﬁcient as a measure of distance
between words, m1 represents the number of documents in which both words occur,

d is the number of documents, and m0 is the number of documents in which neither

word appears. Since the Jaccard coefﬁcient is ignoring some potentially important

69

information, namely, the frequency of usage of a word in a document, we expect. that
the Jaccard measure will not perform as well as Euclidean distance. Again, empirical
evidence conﬁrms this, where we see that words fail to converge into clusters within

100 iterations when using the Jaccard coefﬁcient.

3.4.3 Normalization for document length

As we mentioned, one attractive feature of the cosine measure (but only useful when
measuring distance between documents) is that it automatically normalizes for a
document’s length, on the assumption that longer documents should not contribute
more than Shorter documents. When clustering words, we also wondered if we should
normalize for document length, by dividing each word count in a document by the
total number of words in the document (giving a frequency between 0 and 1 as a value
for a given word rather than an integer number of occurrences). In other words, in
our original pattern matrix, Mm- = k where word j appears k times in document i,

we can normalize M by document length by deﬁning M’, where

, _ k

1"sz " d
Ell/Im-
1:1

 

Under the original scheme, longer documents weigh more heavily into the cluster-
ing than shorter ones. The K —means clustering was performed on the NEM-Online
data which had been row normalized. Resulting clusters tended to be more polarized,

with small clusters smaller and large clusters larger, and the semantic content was

70
less striking than when using word occurrences.

Normalization seems like a good idea but upon reﬂection, we realize that it is
beneﬁcial only if the documents are fairly homogeneous in their style. We know that
online documents vary widely, from full text to a simple list of links. We conjecture
that we need some length to the documents in order to extract context which will
lead to grouping of words from the same concept. Long documents are more likely to
reﬂect “typical” usage patterns than simple lists or abbreviated documents such as
abstracts. Long documents are providing more information, so it is reasonable that

they contribute more heavily to the clustering.

3.4.4 Combining runs into fuzzy sets

A mathematical set X with members 2:,- in a universe U can be deﬁned as a function
f : U —> {0,1} where f(u) = 0 if and only if u E X and f(u) = 1 if and only
if u ¢ X. In other words, set membership can be thought of as a function from
all objects in the universe to the set {0, 1} where a 1 indicates membership and 0

indicates non—membership.

Fuzzy set theory was ﬁrst proposed by Zadeh [68] as an alternative to sets with
absolute memberships. Now, instead of mapping the function from U to the discrete
set {0, 1}, we map U to the interval [0, 1]. Members of U can have varying degrees of
membership, and f is called the membership function. If an object has a membership
value close to 1, it is a strong member of the set, while a value close to 0 indicates a

weak member. Note that this is not the same as the probability of a member being in

71

a set (though both have values in [0, 1]); if an object x,- has probability .5 of being in
a set A and a probability of .5 of being in a set B, it still ultimately has a membership
value of 1 in either A or B, but not both. In fuzzy set theory, it is possible for an
object to simultaneously belong to multiple sets, and all membership values for all
sets need not sum to 1.

Fuzzy set theory has been used in pattern recognition [69] as a means to arrive at
clusters which have a fuzzy membership. Rather than forming fuzzy clusters at the
outset, we ﬁnd partitional clusters, then assign a fuzzy membership function to their
objects, as described below.

We ran the K—means word clustering algorithm many times with different initial
random seeds to increase our conﬁdence in the results. Examining different runs
by hand reveals that there are consistencies, though differences do exist. We need to
precisely quantify the cluster set membership, and combine the results from differently

seeded runs. Philosophically, concepts are not composed of a ﬁxed set of words,

neither are they stochastic. The underlying notion of a concept is fuzzy, so we use

fuzzy set theory to deﬁne the ﬁnal membership of a concept cluster [69].

To form “concept” clusters, we combine the results of several runs of the K—means
clustering, in order to get a deﬁnitive assignment of words to clusters. Recall that we
run the original K—means algorithm n times, with n different random seeds. Words
that appear together in the same cluster all n times have strong evidence that they
truly belong in the same cluster. Furthermore, words that co-occur in the same
cluster in only a few runs probably do not ultimately belong together. We can use

the number of times words co—occur as a measure of their strength in relation to each

72
other: co-occurring a large number of times shows that the words are bound together
strongly, and are likely to be from the same "concept”.

Since words have different. senses, some words may validly belong in more than
one concept. For example, the word “circuit” may occur in half the runs along
with “design”, “CAD”, “electrical”, and in the remaining runs may occur along with
“court” and “appeals”. The fuzzy membership function allows the word to occur in
both concept clusters.

Our experiments have Shown that there are usually K —1 “smaller” clusters, and 1
very large cluster that contains > 807. of the original feature words. This large cluster
contains the very low frequency words which are not useful in discriminating between
documents, so ignoring this cluster reduces the number of words without sacriﬁcing
discriminating power. To combine clusters from several K—means runs, we ﬁrst ﬁnd
all unique words from the K — 1 smallest clusters across all runs. We want to form
“master clusters” which combine results of all runs, and which are actually fuzzy sets,
with words appearing togther many times across several runs having a high degree
of membership, and words appearing few times together with the other words in the
set having a low degree of membership.

We have used two different methods to construct the fuzzy sets. The ﬁrst method
is as follows. Assume we already have K —1 master clusters, M 1, M2, . . . M K_1. Given
a new cluster C, we can ﬁnd the master cluster with the maximum intersection, that
is, ﬁnd the M, such that [II-I,- DC] is maximized. Then M, is updated by adding C to
it, so M,- = M,- U C. However, we want to remember the intersection M,- H C, because

the number of intersections will determine the fuzzy set membership. In practice. we

73
can keep track of the number of intersections merely by not eliminating the duplicate
members when taking the union.

The fuzzy membership function for a word In belonging to master cluster M,- is
deﬁned as follows: fM,(w) = dup/n, where dup is the number of duplicates of the
word within that cluster (i.e., the number of times it intersected with another similar
cluster), and n is the number of runs we are combining. Thus all elements have a
membership value between 0 and 1.

This approach is convenient and fast, but assumes the existance of initial master
clusters. One way to obtain initial master clusters, is to simply accept the ﬁrst set
of clusters generated as the “master” clusters, then ﬁnd the maximal intersection of
subsequent runs with these initial clusters. This is a cheap and easy way to form the
fuzzy sets, but has the drawback of being dependent on the order in which the runs
are presented. If the initial run has anomalies due to poorly chosen initial centers, it
will affect the ﬁnal cluster membership.

To avoid this problem, our second method constructs a full co-occurrence matrix
R for words in the K - l smallest clusters and forms the fuzzy clusters by ﬁnding
the words with the maximum co—occurrence. All pairs of wOrds in a cluster in a given
run are given a vote each time they appear together in a run. SO for n runs, the
maximum value in the co—occurrence matrix Rid is n, meaning that word i and word
j appeared together all n times. The minimum value of RM is 0, meaning that word
i and word j never appeared together.

Once the matrix R is constructed, the algorithm for forming fuzzy clusters is as

follows:

1. Level 2 n
2. For all words i

(a) Find all worde such that R231 2 Level

(b) If i and j are not already in the same cluster, create a new cluster that

contains words i and j.

(c) Compute the set membership function strength, of i and j, for each new

cluster they appear in.
3. Decrement Level by 1.

4. Go to Step 2.

The strength function of a word w in cluster C can be deﬁned in a number of
ways. One method is to use the average connectedness of word w to all other words

currently in the cluster:

Estrengthﬁ)
iEC
ICI

 

strength(w) =

When any new word is added to a cluster, the strength of every word in that cluster
must be recomputed.

Unfortunately, this algorithm is in the worst case 0( j * n“) in time, and 0(4n2)
in space, where j is the number of runs and n is the number of unique words in the
K — 1 clusters across all runs. We could make the algorithm faster by making a

simpler strength function (such as strength(u.') 2 Level at which it joined the cluster)

75

but can in no case do better than the ()(j * n”) of Warshall’s algorithm required to
ﬁnd the transitive closure of the matrix j times. Ten runs for n = 5000 requires over
4 hours clock time and 100 MB of space on an Enterprise Server 3000 with 250 MHz
Spare Ultra 2 processors and 512 MB Of main memory. This is a powerful machine,
not readily accessible to an average user. Recall that one of our requirements is fast
turnaround and easy accessibility. The use of this algorithm may be too resource
intensive for average users.

When we have the fuzzy clusters constructed by either method, we can identify
the “core concepts” in each by choosing the words with the strongest membership.
These core concepts can now be presented to the user, who can identify whether key

terms actually Should belong to the concept, and who can modify the degree of fuzzy

membership if desired.

3.5 Summary

In this chapter we have described methods for ﬁnding concepts in text. We have
concluded that classiﬁcation accuracy of a small set of training examples is not a
good indicator of features that convey semantic meaning. Word clustering using
several different runs of the K—means algorithm and Euclidean distance provides
small groups of words which lead to an approximation of semantic content. In the
next chapter we will discuss in detail the results of this process on four separate data
sets, then describe the uses of the concept clusters for automatic search and document

organization.

Chapter 4

Experiments and Results

In this chapter we describe our experiments with concept extraction and the results
obtained using four different datasets. First we describe the data sets used, then the
experiments which obtained the fuzzy concept clusters. We describe the convergence
behavior, size of resulting clusters, and strengths of resulting clusters. We conclude
the chapter with a discussion of the uses of the derived clusters in automatic search

and document organization.

4.1 Sample Data Sets

Statistically Random

First, in order to provide a baseline for comparison and to ensure that we don’t derive
conceptual results from statistically random documents, we need a set of documents
that are statistically the same as typical documents, but which are not conceptually
related. We generated a set of random documents in the following way. We obtained

76

77

the Reuters-21578. Distribution 1.0 I test, collection of documents. This collection
consists of 21,578 Reuters newswire articles from 1987 which all deal with economic
topics. There are 45,534 distinct words in the entire collection. To generate our
purely random pages, we computed the frequency distribution of all words in all
21,578 articles, then sampled words with replacement according to their observed
probability estimate. The word with highest probability, was, not surprisingly, “the”.
The overall probability of a word w appearing in a document is deﬁned as

where n is the number of occurrences of w in the entire collection and N is the total
number of words in the collection. We created 80 documents with 1000 words in each
document. The number of unique words in these 80 generated documents was 9238,
with 4881 of those words occurring only once, and 1394 occurring only twice. The
majority of all words are low frequency ones. Note that word order in the generated

documents is not important, Since we do not use phrases or word pairs in our analysis.

Seeded Random

Once we had the purely random documents, we ”seeded” them with a set of chosen
“tOpic” words in a controlled way, in order to make sure we could reliably retrieve
these topic words using our methods. Each Reuters document contains zero or more

category labels which were assigned manually by experts. A given article may contain

 

1available from David Lewis at http://www.research.att.com/Iewis

78

many labels. and many of the categories could have overlapping concepts. or be
conceptually hierarchical. For example. "corn”, “wheat”, and “grain” are all distinct
categories in the Reuters data but many articles contain concepts from all three.
Due to human decisions about labeling, it is possible for an article about wheat to
be labeled “wheat” but not “grain”, even though “wheat” can be considered a sub
category of “grain”. Though the Reuters documents have labels manually assigned
by experts, there are many errors and inconsistencies in the labeling of articles. For
example, one article about sOybeans (without mention of meal or oil) may carry all of
the labels “grain”, “soy-meal”, “soy-oil”, and “soybean”, while another similar article
may have only the single label “soybean”. Many articles from the “gas” category do
not also have the label “fuel”, and some articles are clearly mislabeled (one example

is an article about the price of gold carrying the label “gas”).

The Reuters data is also full of many abbreviations standard to Reuters, but
not to typical English text, such as “dlrs” for dollars, “mln” for million, “pet” for
percent, “bpd” for barrels per day. Additionally, there are many proper names and
misspellings, and special routing codes for the news wire. All of these factors con-
tribute to the phenomenon of a higher number of distinct words and a different word

distribution than one would typically expect from, say, English literature.

The four Reuters categories acquisitions, gas, interest, grain were chosen, and the
distribution of words in each category alone was computed to get the probability that

a given word appears in an article of that category. The conditional probability of a

word u; appearing in a certain class c is

12,.

P((l.'[C) : -l\_r

where no is the number of occurrences of the word w in category c, and NC is the

total number of words in category c.

When P(w[c) >> P(w), the word Should be an important feature of class c.
We therefore took the words with the largest values of P(wlc) — P(w) as “seed”
words to insert in the purely random documents. For example, P(oil) = .00158,
and P(oillcrude) = .0177. The difference indicates that oil is probably an important
word for the category crude. The words with the greatest difference between overall
probability and class conditional probability form our seed word sets for each category.
It is interesting to note that the set of words with greatest differences between overall

6‘ 1,

and class conditional probability often included words such as “the”, “from”, a ,
which also have a very high overall probability. In the same way that we removed
stop words from documents prior to word clustering, we removed stop words (from
the same list) from the potential seed word sets. The top 10 “seed” words from each
category is listed in Table 4.1. Note that even these small sets of words overlap

categories, “pct” appears as a top class conditional word in both the categories “gas”

and “interest”.

There are many parameters we can control to construct seeded random documents,
including document size, number of documents in a target category, seed word set

size, percentage of the random document that is replaced with seed words, and the

80

 

 

 

L acquisitions [ gas [ interest [ grain ]
lt shares pct gasoline rates rate tonnes wheat
company include mln oil pct bank grain corn
offer corp prices bpd market money agriculture usda
stake share barrels crude st g fed nil crop
dlrs acquisition demand octane cut banks department export.

 

 

 

 

 

 

Table 4.1: Seed words from four Reuters categories

percentage of the seed word set included in a given document.

To construct our random seeded documents, we constructed three sets of 80 doc-
uments of 1000 words each, in the four target categories mentioned above (20 doc-
uments per category). These values roughly correspond to our “real life” data sets,
and so represent a realistically Sized set from which to derive concept clusters. In set
one, 10 words out of the 1000 (17.) were chosen uniformly from a Size 10 seed word set.
The remaining 990 words were chosen according to the overall collection distribution
of words, giving a total of 8881 unique words. The second random seeded data set
had 5'/. seed words (again, chosen uniformly from a seed set of size 10) with all other
parameters the same, resulting in 8557 unique words among the 80,000. The third
data set was 10'/. seed words (100 words out of 1000 chosen uniformly from a size
10 seed set), with 8346 unique words. We wanted to ensure that we obtain the seed

words as concept clusters from each of the four categories.

Reuters articles

We next chose a subset of actual Reuters articles, chosen from the same four speciﬁc

categories. Again, we wanted to use a small subset of example articles, to realisti-

81

cally simulate real life Situations where a user in general does not have access to an
entire document collection, and therefore can not compute the global probabilities
or the class conditional probabilities. We therefore sampled articles from the cho-
sen categories at random without regard to any probabilities, or any requirements
of exclusiveness of the category label. In other words, we were not choosing the
“best” representatives of a category which might be the case when an article has only
that label. An article had an equal chance for selection as long as it contained the
correct category label (from perhaps many labels). For each of the four categories
acquisitions, gas, interest, grain, we randomly selected 20 documents for a total of 80

documents. The Reuters data is summarized in Table 4.2.

 

 

 

 

 

 

 

 

 

 

 

 

Category Number of Distinct Non— Document Length
Documents Stemmed Words Min [ Max [ Avg [ Median
acquisitions 20 1087 30 369 134 82
gas 20 911 27 355 107 96
interest rates 20 1087 27 388 134 92
grain 20 760 33 234 87 78
[total 80 2665 27 388 115 110 [

 

 

 

 

 

 

 

 

Table 4.2: Sample Reuters data set categories

NEM-Online

The ﬁnal data set we examined comes from the manufacturing domain, and consists
of actual web documents from the Network for Excellence in Manufacturing (NEM
Online)2, which contains a set of documents manually selected by human browsers

to be of interest to a particular user community. We again chose four categories,

 

2http: / /www.nemonline.org

82

labor, legal, government. and design, and collected 85 documents which we were able
to retrieve from the Web. The Nem—Online data is summarized in Table 4.3. This
data set is the most realistic, with varying numbers of documents per category, and

varying lengths of the documents.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Category Number of Distinct Non— Document Length
Documents Stemmed Words Min [ Max [ Avg [ Median
design 23 2155 37 990 253 127
government 15 2852 16 5450 633 80
labor 13 2444 33 2557 552 342
legal 34 3921 34 3137 403 123
[ total 85 7633 j] 16 5450 456 127 [

 

 

 

 

 

 

Table 4.3: Sample NEM Online data set categories

In the tables, document length refers to the total number of words in the document
(excluding stop words). Notice that the Reuters articles are much shorter than the
N EM Online documents. This makes concept extraction from the Reuters data more
diﬂicult, Since there is not as much context in a given article to distinguish the

characterizing concepts.

The category labels were assigned by human experts. Note that the categories
may contain underlying concepts which overlap. For example, a document discussing
“afﬁrmative action” may be reasonably classiﬁed into government, labor, or legal
categories, according to the person who is doing the category assignment, and within
the context of their other collected documents. The NEM Online category labels are
distinct, but recall that the Reuters documents may (and usually do) have multiple

category labels. The overlapping of concepts into multiple categories is one thing

83

that. makes text classiﬁcation so difﬁcult: even human categorizers do not have a
well-deﬁned, unequivocal method of assigning a deﬁnitive label to a certain document.

All sets of training data are summarized in Table 4.4.

 

 

 

 

 

 

 

Data Set categories does in each total distinct conceptually
category words related words

StR 1 80 9238 none

SeR 1% 4 20 8881 10

SeR 5% 4 20 8557 10

SeR 10% 4 20 8346 10

RA 4 20 2665 unknown

NO 4 see Table 4.3 2665 unknown

 

 

 

 

 

 

 

Table 4.4: Summary of the training data sets: Statistically Random (StR), Seeded
Random (SeR) at 1%, 5%, 10% seed words, Reuters articles (RA), and NEM—Online
(NO).

4.2 Word Clustering Results

In this section we discuss the results of clustering the words of the four different data
sets, and convergence, cluster size, and word strength in the fuzzy clusters in each

data set.

4.2.1 Statistically Random Data Set

Random data with no inherent clusters are manifested by patterns whose features
are all Similar. This is the case with the statistically random data. All documents
look alike across the feature set, so we do not expect to get semantically related

words as clusters. The K -means algorithm fails to converge after 100 iterations,

84

indicating that. it is having difﬁculty ﬁnding stable clusters. When forced to stop
after 100 iterations, we assume that the words within the clusters have little chance
of being semantically related to each other. They represent those words whose overall
probability of getting into a document is similar. Interestingly, the sizes of the clusters
follow the same pattern as the non—random data do; we get one very large cluster
and K — 1 smaller ones. This is a result of the tendency of English text to follow a
distribution according to Zipf’s law [70], which states that the frequency of word use
times its rank order in a given text is approximately constant [15]. Sizes of the K — 1

smallest clusters of all data sets are shown in Figure 4.1.

 

 

 

 

 

8 _ a
V D 1% Seeded
O 5% Seeded
A 10% Seeded 3
+ Reuters
X NEM-Online 0/
2
o 4-
8 .4

Number of Words
200
\:\
\-

100

 

 

 

-4
q
-4

2 4 6 8

Cluster Number

Figure 4.1: Sizes of K — 1 smallest clusters in all data sets

85

\Vith K = 10, the Sizes of the clusters from the random data are roughly the
same over all 10 runs, even though the algorithm did not converge. Furthermore, the
3 smallest clusters have Sizes 9. 10, and 30 (or 31) in all runs. If we examine the
contents of those smallest clusters. we see the same words appearing. The smallest
cluster contains be, dlrs, lt, march, mln, pct, reute, reuter, vs. Next smallest contains
april, bank, billion, company, corp, cts, inc, mar, net, usa. If we examine the overall
probabilities for these words in the Reuters collection, we see that the rank order of
the probabilities for the ﬁrst cluster is 15, 11, 12, 16, 14, 19, 30, 20,26. The rank
order for the second cluster is 46, 47, 36, 43, 41, 49, 45, 35, 50, 33. We are obtaining as
clusters those words which have a similar overall probability, which is what we would
expect. The words with highest overall probability (ranks 1-10) are all stop words:

the, to, of, f, in, and, a, said, 3, for, and thus were eliminated before clustering.

So, for the purely random data, we obtain clusters that contain words with similar

overall probabilities, but which do not suggest any semantic content.

4.2.2 Random Seeded Data Set

Since the purely random data had no class information, and its documents were
created based on the overall probability distribution of words, the clusters obtained
were based on those overall probabilities. For the random seeded data, since we have
speciﬁc words in four categories, with those words being the highest difference in class
conditional probabilities, obtaining those seed words in our clusters will reﬂect the fact

that the documents are diﬂerentiated by those words with the highest class conditional

86

probability differences (not merely the highest class conditional probability).

I.’ sing Euclidean distance and K = 10. the clusters converged in an average of
35.1 iterations for the 17. seed words. 44.5 iterations for the 57. seed words. and 45.4
iterations for the 107. seed words. In contrast to the purely random data, the K—
means algorithm is able to ﬁnd stable clusters.

As seen in Figure 4.1, the distribution of the sizes of the K clusters again follows
the same pattern as the random data; that is, more than 807. of the words fall into a
single cluster, leaving less than 207. in the remaining K - 1 clusters. The words in the
large cluster correspond to the great number of words that have very low frequencies
in the document set.

Recall that our categories are gas, grain, acquisitions, interest. We know that the
words with the largest differences between class conditional probability and overall
probability for all grain articles in the Reuters collection are tonnes, grain, agriculture,
nil, department, wheat, corn, usda, crop, export, which are the words seeded into the
random seed document set at 1%, 5%, and 10% of words in each document. We want
to ensure that we detect those words through our clustering process. We do obtain
those words in the K — 1 smallest clusters, but we also obtain clusters containing

words with a high overall probability in the document collection.

Clusters from 1% Seeded Random

At the 1% level, for example, we get a cluster containing the grain seed words agri-
culture, corn, crop, usda with a strength of 1.0, and the additional grain seed words

department, export, grain, nil, wheat appear in the same cluster at a strength of .4.

87

However, the cluster containing the grain seed words also contains 359 other words
which have strength .5 or higher, most of which are not seed words in any of the four
categories. We also get a very strong small cluster consisting of the words be, dlrs, lt,
mln, pct, reute, reuter". These are words with high values of P(w) (the rank order of
their overall probabilities is 16, 15, 12, 11, 20, 14, 19) but which were not eliminated
in the stOp word list, because our stop word list is based on typical English, and not
on Reuters data.

Since we know something about the domain and where these articles were’ob—
tained, and in fact have access to the entire collection, we know that “reute” and
“renter” appear in nearly all documents. Similarly, the Reuters document database
deals with economic topics, and uses the special abbreviation “dlrs” for dollars.
Armed with this kind of domain information, we can easily add these high frequency
words to the stop word list to prevent them from being clustered if we are restrict-
ing our analysis to Reuters articles. It is important to note that we do not want
to blindly eliminate the most frequently occurring words in a collection. Since we use
only a small set of examples, it may be that precisely those highest frequency words
in our collection are what distinguishes them from others we have not yet seen. It iS
only because we have access to the entire Reuters set and are restricting our analysis
to that collection (for this set of experiments) that allows us to make decisions on
the usefulness of the highest frequency words. Given a larger universe of discourse,
the words “reute”, “reuter”, and “dlrs” would be those that would point us towards

more Reuters economic news articles.

 

3Recall that these words are special abbreviations peculiar to the Reuters collection

88

III the 17. data, using the order-dependent method of fuzzy set. construction from
10 different K--means runs, if we examine the smallest K — 1 clusters, all seed words
from all categories do appear, but they are scattered throughout the K — 1 clusters,
and are not separated according to category. There are also many other non-seed
words included with high strengths which reﬂect their high overall probability. Note

that some words with very high overall probability are also seed words, such as “lt”,

H 77

“pct , “dlrs”, “mln . ’e conclude that if only 17. of a document’s content contains
discriminating keywords, our. method will have difﬁculty ﬁnding them. The words
are there, but so are many other words which are not discriminating. We suppose
that in a solid piece of text with a reasonable length, important concept words will

constitute more than 1% of the total non—stop words, so this represents a worst case

scenario.

Clusters from 5% Seeded Random

In the 57. seeded data, three clusters contained all seed words except “agriculture”,
which was not present in any of the K — 1 clusters. The fuzzy clusters containing the
other seed words and their strengths are given in Table 4.5 Words with strengths
less than .4 are not Shown.

Cluster 1 represents the words with high overall probability, including some of
the seed words. Cluster 2 contains all of the seed words from the interest category
(except “pct”, which is in Cluster 1), and all but one of the seed words from the
grain category. Cluster 3 contains all of the seed words from the gas category except

“mln” and “pet”, which are in Cluster 1. It is interesting that. the seed words from

89

 

 

 

[ Cluster 1 [ Cluster 2 Cluster 3

1.0 be 1.0 april 1.0 bank .9 barrels .9 bpd
1.0 company 1.0 banks 1.0 billion .9 crude .9 demand
1.0 dlrs 1.0 cts 1.0 cut. .9 gasoline .9 octane
1.0 include 1.0 fed 1.0 mar .9 oil .9 prices
1.0 It 1.0 market 1.0 money .6 sin .6 trade
1.0 march 1.0 net 1.0 rate .5 acquisition .5 apr
1.0 mln 1.0 rates 1.0 stg .5 corp .5 loss
1.0 pct 1.0 usa .7 corn .5 offer .5 rm
1.0 reute .7 crop .7 department .5 share .5 shares
1.0 reuter .7 export .7 grain .5 stake .5 stock
1.0 vs .7 nil .7 tonnes

.7 usda .7 wheat

.5 apr .5 loss

.5 rm .5 Share

.5 Shares .5 stock

.4 acquisition .4 co

.4 oﬁer .4 stake

.4 trade

 

 

 

 

Table 4.5: Three smallest, strongest clusters from 5% random seeded document set

the “acquisitions” category appear in both Cluster 2 and Cluster 3, with strengths of

and half the time with gas words.

.5. This means that half the time they clustered with the grain and interest words,

We do have another fairly large cluster (104 words) with strong membership (all

but 9 words have strength 1.0 or .9), but which contains none of the seed words. If

we restrict our focus to the smallest clusters, we ﬁnd the seed words (along with some

others). What we see at the 57. level is that the seed words are appearing in the

smallest clusters with strong membership.

 

90

Clusters from 10% Seeded Random

In the 10% seeded data set, there. are only 4 clusters that have any word at a strength
greater than .6. Of these four “strong” clusters, two of them contain all the seed

words for all four categories. These two clusters are shown in Table 4.6.

 

 

 

 

[ Cluster 1 [[ Cluster 2 [
gas [ acquisitions interest grain Non seed words
1.0 barrels 1.0 acquisition 1.0 bank .8 agriculture 1.0 be
1.0 bpd 1.0 company - 1.0 banks .8 corn 1.0 reute
1.0 crude 1.0 corp 1.0 out .8 crOp .8 march
1.0 demand 1.0 dlrs 1.0 fed .8 department 8 reuter
1.0 gasoline 1.0 inc 1.0 market .8 export 8 vs
1.0 octane 1.0 It 1.0 money .8 grain .6 mar
1.0 oil 1.0 offer 1.0 rate .8 nil 6 usa
1.0 prices 1.0 share 1.0 rates .8 tonnes
.8 mln 1.0 shares 1.0 stg .8 usda
.6 pct 1.0 stake .4 pct .8 wheat

 

 

 

 

 

Table 4.6: Two clusters containing seed words from 10% random seeded document
set

One cluster contained all of the gas seed words, and no other words. The strength
of each word was 1.0 except for “mln”, with a strength of .8, and “pet” , with a strength
of .6. This is a clear indication that we retrieved the seed words from the gas category.
The other cluster contained all of the acquisitions seed words, all of the interest seed
words, and all of the grain seed words. One of the other clusters containing words
with strength greater than .6 contained high overall frequency words. The remaining
strong cluster contained 115 words, none of which are seed words.

At a 10% level, we can detect important category words by examining the small-

est clusters with strong members, and by ignoring clusters which contain words with

91

high overall probability. Of course, in real data situations, we would have no way
of detecting which clusters are actual “seed words” (i.e., have high class conditional
probabilities). and which are merely high overall probability words. But these high
probability words which do not belong on the stop list are probably also important
characterizing words for our example document set, even though they may not dis-

criminate among our current categories.

4.2.3 Reuters Data Set

Next we examine the fuzzy clusters resulting from the actual Reuters data in four
categories. Recall that we used the same four categories for actual data as we did for
the random seeded data, and we know the class conditional probabilities for the words
in each of these four categories. Therefore, we expect to obtain clusters containing
those same high class probabilities words as when we “seeded” random documents

with them.

The K—means algorithm converged to stable clusters quickly, in an average of 13.7
iterations. The size distributions of the clusters follow the same consistent pattern,
with one large cluster containing about 80% of the words. Again, using the order
dependent fuzzy cluster construction method, with K = 10, we obtained only 4
clusters whose strongest member was higher than .4. These clusters are shown in
Table 4.7. The words of strength .6 or greater in these 4 clusters consisted almost
entirely of words that are within the top 100 class conditional probabilities of one of

the four target categories. For example, cluster 1 contained 35 words with strength

92

 

L Clusterl [ Cluster2 [ Cluster3 F Cluster4 ]

 

 

.8 crude .9 demand .8 budget .9 banks
.8 gasoline .9 production .8 co .9 commercial
.7 companies .9 tonnes .8 georgia .9 debt
.7 dunham .7 corn .8 power .8 ofﬁcial
.7 energy .7 crop .8 unit .9 world
.7 import .7 fall .8 vogtle .7 brazil
.7 imports .7 farmers .8 board .7 plan
.7 industry .7 growth .7 third
.7 unleaded .7 soybean

.7 soybeans

.7 wheat

 

 

 

 

 

 

Table 4.7: The four clusters from actual Reuters articles whose strongest member was
greater than .4. Words with strength < .7 appear in the clusters, but are not shown
here.

.6 or above; 19 of those 35 words appeared in the top 100 class conditional words for
the gas category. Cluster 2 clearly indicates words from the grain category. The third
cluster had 10 words of strength .6 or above; 4 of them were in the top 100 of the
acquisitions category and 1 was from interest. The ﬁnal cluster had 9 words; banks
and commercial are in the top 100 class conditional words from the interest category,

and debt and official are from the top 100 of grain.

These results lead us to conclude that we are successfully obtaining important
keywords from the 4 categories, although the separation between categories is not
always clearly delineated from the cluster information alone. We also get some extra
words which do not Show up in the list of the top class conditional words, especially

at lower strength levels.

93
4.2.4 NEM—Online Data Set

The last data set is the HTML documents from NEM—Online. Here, we have no
objective way to evaluate the cluster word membership, Since we can not compute
any class conditional probabilities. We also can not compute the overall probability
for a given word; the overall probabilities for the Reuters data is not valid for this set
of documents. Just as the Reuters data had words such as “renter” and “dlrs” as high
frequency words, HTML documents have words such as “homepage” and “Internet”
as high frequency, non—discriminating words. We have evaluated the words in the

clusters by having experts familiar with the NEM—Online domain examine them.

We ran several runs of the K —means algorithm on the N EM—Online data described
above. Cluster membership stabilized in an average of 26 iterations, and again there
was 1 large cluster and K —1 smaller ones for each run. With K = 10 and 10 different
runs, we obtained 6 fuzzy clusters which had at least one member at strength .6 or

above. These clusters are listed in Table 4.8.

The two clusters not Shown here had 178 and 165 members. We note that with
the NEM—Online data, as in all other data sets, we get one very large cluster (the
“useless” cluster), and the remaining clusters are smaller. The smallest of these
remaining clusters suggest reasonable topics, however, others are still too large for a
human to easily interpret. We conjecture that the words in these larger clusters may

be useful in further suggesting content if we could subdivide them into smaller pieces.

We implemented a recursive variation of the partitional K —means algorithm which

accepts some user—deﬁned Size threshold, and if any cluster (other than the largest)

 

94

 

 

Cluster 1 Cluster 2 Cluster 3 Cluster 4
.8 afﬁrmative 1.0 amendments 1.0 cfr .7 cad
.8 americans 1.0 bankruptcy 1.0 cosmetic .7 cadmazing
.8 discrimination 1.0 code 1.0 cosmetics .7 customer
.8 people 1.0 debtor 1.0 eye .7 inc
.8 women 1.0 petition 1.0 hair .7 management
.7 action 1.0 section 1.0 product .7 project
.6 employee 1.0 products .7 service
.6 fmla .9 color .7 services
.6 leave .9 containing .7 software
.4 opportunity .9 except .6 company
.9 fda .6 connx
.9 skin .6 consultants
.6 consulting
.6 customers
.6 design
.6 electronic
.6 network
.6 process
.6 projects
.6 solutions

 

 

 

 

 

 

Table 4.8: The four clusters from NEM—Online documents.

exceeds that threshold, the words from that cluster are reclustered according to the

original documents.

Again, we have no way to objectively evaluate the worth of the words in these sub
clusters. Looking at the words subjectively without the opinion of domain experts,
it is not clear whether the sub categorizations are useful. Work on the hierarchical

subclustering is ongoing and a subject of future study.

95
4.3 Cluster Usefulness

Referring to our system overview in Figure 1.1, we have described the process and
results of obtaining the concept. clusters. We next discuss their uses in classification

accuracy and in automatic querying and document organization.

4.3.1 Classiﬁcation Accuracy

Now that we have some small clusters of words, are these groups useful in document
categorization? We attempted to use the resulting feature clusters to recognize the
original 4 (NEM) document categories. We used a simple scheme where each of the
10 clusters was considered a feature, and an occurrence of any term from a cluster
was counted as an occurrence of that feature. We used number of occurrences in each
cluster as the feature values, and the 85 documents as patterns. So for example, if
the phrase “affirmative action” occurred three times in a given document, and the
cluster containing both “afﬁrm” and “action” was cluster 2, the second element of the
10 dimensional feature vector for that document would contain a value of 6 (provided
no other terms from that cluster also occurred in the document). If no terms from a
cluster appear in a document, the value for that cluster in the feature vector is zero.

After constructing this new pattern matrix with the 10 features derived from the
occurrences of terms from the 10 clusters, we performed K—nearest neighbor classifi-
cation with leave one out testing, as in the feature selection cases. The classiﬁcation
accuracy was 42/85 errors, or 50% accuracy with K = 1, and didn’t change signiﬁ-

cantly for other values of K.

96

Classification using the same scheme as described above, but, with elimination of
the largest cluster (therefore using only 9 features) yielded results which were only
55% accurate.

Even though the features derived from the clusters as described above did not
perform very accurate classification of the original categories, we must remember that
classiﬁcation accuracy is not our ultimate goal. We can use the word clusters as an
initial approximation of a semantic concept. We can verify the semantic relationships
between the cluster words by user evaluation, or by lookup in a semantic database.
We can also allow the user to refine the concept by adding or deleting words, or giving
the word cluster a meaningful label (which is not necessarily the same as the user’s
original category labels). Our experience has shown the clusters are usually small
enough that consideration of the individual words is not burdensome for a user. If a

cluster is too large, the user asking the system to sub-cluster it may prove fruitful.

4.3.2 Search for Similar Documents

One of the advantages of having a semantically related group of words as a represen-
tation of a document is the ability to search for similar documents using these words.
Though automatic query construction is a complicated issue, we can generate queries

from word clusters using two simple strategies that show the efﬁcacy of this approach.

A query formed by taking the conjunction of cluster words (for example,
crude AND gasoline AND companies AND..., etc.) leads to a higher precision as

more words are added than does a query consisting of single words alone. However,

97
recall will'decrease as words are added to the conjunction. A' query formed by the
disjunction ofcluster words (for example, crude DR gasoline DR companies 011...,
etc.) results in a higher recall, but lower precision as more words are added to the
query.

Remember from Table 4.7 that cluster 1 words suggest the category “gas” and
cluster 2 words suggest the category “grain”. We used these as target categories for
purposes of determining document relevance.

Using the 9 words of cluster 1 with strength .7 or greater, and the 11 words of
cluster 2 with strength .7 or greater, we generated all possible conjunctive queries of all
sizes for the target categories “gas” and “grain”. We computed precision and recall for
each query as deﬁned earlier. For purposes of this experiment, “Retrieved” documents
were all those that satisﬁed the conjunction and “Relevant” were those that carried
the same label as the target category. Note that this deﬁnition of “relevant” may not
be correct, due to the multiple, inconsistent, and erroneous labelings in the Reuters
data. However, we used that deﬁnition of relevance to provide as objective a measure
as possible.

The maximum precision and recall values for a given size query for the two con-
cept clusters are shown in Figure 4.2 and Figure 4.3. We see that, as expected, for
conjunctive queries the precision increases with additional words.

We also simulated disjunctive queries, where we are interested in improving recall.

'e know that many disjunctive terms will decrease precision since a single word from
the query occurring in a document is enough to retrieve that document. We generated

queries consisting of the disjunction of all words in Reuters cluster 1 and cluster 2

98

Maximum Precision and Recall for Cluster 1 in the 'gas' category

 

 

08

0.6

' —— Preasuon
' Recall

Precision / Recall

0.2 0 4

00

 

 

 

0 2 4 6 8
Number 0! words in conjunctive query

Figure 4.2: Maximum Precision and Recall for Reuters cluster 1 in the gas category

(Table 4.7), and measured the percentage of documents retrieved in each of our 4

target categories. Results are shown in Table 4.9.

 

 

 

 

 

Category Cluster 1 Cluster 2
Percentage retrieved Percentage retrieved
acquisitions .16 .07
gas .75 .39
interest .12 .29
grain .25 .86

 

 

 

 

Table 4.9: Occurrences of cluster words in all Reuters documents of the four categories

We see that the disjunction of cluster 1 words retrieved the highest percentage
of “gas” documents, and the disjunction of cluster 2 words retrieved the highest
percentage of “grain” documents from among the four targeted categories.

Precision and recall of single word queries of the target categories, where the query
consisted of the obvious choice of category label are given in Table 4.10.

For example, the single word query “gas” for the target category “gas” resulted in

99

Maximum Precision and Recall for Cluster 2 in 'grain' category

 

 

v— {——

/

08

0.6

l —— Precision
. Recall

PFOCIS|On / Recall

0.4

02

 

00

 

 

 

O 2 4 6 8 1O

Numer ol words in conjunctive query

Figure 4.3: Maximum Precision and Recall for Reuters cluster 2 in the grain category

 

 

[ Category Query Term Precision Recaﬂ
acq acquisition .80 .22
gas gas .03 .21
interest interest .15 .37
grain grain .31 .39

 

 

 

 

 

 

Table 4.10: Precision and Recall of single word queries

precision of .025 and recall of .21. The best precision single word query formed from
the concept clusters in the category “gas” was the word “unleaded”, with a precision
of .52. For the query “grain” in the “grain” category, precision was .31 and recall was
.39. A single word query from cluster 2 consisting of “corn” and another query of
“wheat” both exceeded the recall of the “grain” query. So even without constructing
multiple word queries, word clustering can obtain single word queries which perform
better than the category labels themselves.

If we then examine queries consisting of the conjunction of 2 words, we see that

the query “tonnes AND wheat” achieved precision of .30, and recall of 30. The three

100

word queries “tonnes AND farmers AND growth”, and “corn AND farmers AND
growth" achieved 1.0 precision.

We know that most web queries consist of one or two words [24]; we see that by
using more words per query we can improve precision with conjunction and recall
with disjunction. Furthermore, by providing a set of words, we can achieve greater
performance in both precision and recall by suggesting words a user may not have
thought of (e.g., “unleaded”). More complicated combinations of operators for query
generation may be able to improve bothprecision and recall simultaneously by using
the words from the feature clusters and their strengths. This is a subject for further

research.

4.3.3 Personalized Document Organization

Another application for our concept clusters of words is for organizing a collection of
documents (listed as “organization with topical proximity” in Figure 1.1). We have
already seen that classiﬁcation of documents into the original training set categories
does not work well. Instead, we used an unsupervised clustering method to group
the original documents using our derived concept clusters as features. Recall that in
our original pattern matrix M, entry Mid represented the number of times word j
occurred in document 2'. We now form a new 2' x c matrix P, where rows represent

documents, and columns represent fuzzy clusters. We deﬁne an entry

11
Pig = Z wkMz‘J:
k=1

 

101

where n is the number of words in fuzzy cluster j. and urk is the fuzzy membership
value (i.e., the strength) of word I: in cluster j.

We then cluster the rows of P using K—mcans partitional clustering and Euclidean
distance. We now have many documents but only a few features.

If we examine only the original labels of the documents to evaluate our clustering,
we see that small, tight clusters have the same labeling, but there are also many
documents that do not go into a small cluster. These are then combined into one
cluster whose members have several category labels. However, if we look closely at
the content of the clustered documents rather than just their labels, we see some
close similarities between the documents. For example, for the NEM data, we see
that a group of 3 hardware/software/electronics consulting companies always clus-
ters together. The text of the Family Medical Leave Act (FMLA) occurs along with
a document entitled “11 easy steps to avoid FMLA problems”, “Employee relations:
Resources on Afﬁrmative Action”, “Judy’s afﬁrmative action page”, and “President’s
remarks on affirmative action”. We also get a cluster containing the 3 documents
“Poisonous substances in human and animal feed”, “FDA import procedures”, and
“FDA recall policies”. Another contains “Trusts and Estates bulletin”, “GATT re-
lated changes in the US. Patent System”, “Overview of the US. Patent”, “Preparing
the defective product case”, and “Defective product litigation”. While we did not get
neat clusters of our original categories back, we did get clusters of documents that
are more closely related to each other than the broad topics deﬁned by the original
labels.

The Reuters articles were less clear, although a similar argument can be made for

 

102

closely related articles getting grouped together. For example, one cluster contained
4 documents, 2 each from the categories “acquisitions” and “interest”. However, if
we examine those 4 documents, one of the “acq” articles is about Brazil's bank debt.
In another cluster, there were 7 “interest” documents and one “acquisitions”. Here
again, the “acquisitions” article was about Bank of America’s German branch selling
Bankhaus Centrale Credit, with words such as “bank” and “credit” occurring several
times, lending credence to the idea that it should be together with articles relating
to banking which carry the label “interest”. That same cluster also revealed two
identical articles under different identiﬁers, which clustered together.

There were no small clusters among the Reuters set that contained the “gas”
label. Upon inspection of our 20 randomly selected “gas” articles, it was discovered
that some of them were mis—labeled (articles about “the decline in gold production”,
“Vermont Financial Services approved cash dividend to shareholders”, and “debt
securities offered for sale”), and the remainder varied widely in their subject matter,
though they could be reasonably considered to be in the category “gas”. For example,
there were articles about ethanol tax exemption, energy costs rising, the Ford Escort
passing European emissions tests, the EPA proposing rules to shield water supplies
from leaks of underground tanks, and a company proposing a plant to build an octane
enhancer to replace lead. Though all can be considered to fall under the category
“gas”, they are not very closely related to each other, rather they represent many
subtopics under “gas”. The 20 “gas” articles chosen (at random) were not closely
related to each other, however, in spite of that we did get a word cluster with words like

crude, gasoline, energy,.unleaded, diesel, octane, petroleum, reﬁning which suggest the

103

concept of gas. This brings up an important point. about document. categorization.
The process of assigning documents to categories is difﬁcult. even for humans, except
in the most clear cut and obvious cases. We should not expect that a computer
will be able to categorize documents according to meaning which is not explicitly

represented, when humans often have difficulty with the same task.

4.3.4 Conclusions

In this chapter we have described our experimental results in deriving concept clus-
ters. We demonstrated that random documents (which nevertheless had the same
distribution of words as the entire document collection) did not converge into stable
clusters of semantically related words. We showed that the random documents which
were seeded with category words obtained small, strong clusters containing those seed
words. We also showed that for actual documents, the small strong clusters contained
words which had a large difference between their class conditional probabilities and
their overall probabilities. The categories were not well separated by the concept
clusters; words from different clusters often appeared together, but were still present
in the small clusters. We conclude that when we use our method on real documents,
we obtain words which have a high difference between their class conditional proba-
bilities and their overall probabilities, or, words which have high overall probability

within that set but were not removed by a stop list.

We can use the concept clusters formed from our method as a basis for solving

other problems such as search for documents on similar topics, and organization of

104

document collections according to personal categOrization schemes.

Chapter 5

Summary and Future Work

In this chapter we summarize the results of the research reported in this dissertation,

and discuss a number of directions for future research.

5. 1 Summary

Using only statistical techniques, we have developed a method of ﬁnding conceptually
related groups of words which represent the underlying content of user—designated
text examples. We use these groups of words as a representation of documents that
is at an intermediate conceptual level between single word representations and global
category representations.

We assume that a user has collected several documents from each of several cate-
gories of interest so that the document collection contains both commonality between
documents in the same category, and diversity among the categories. This is a reason-
able expectation of the user. we also assume that the number of example documents

105

106

is smaller than the number of distinct words contained in all documents. If there are
more. documents than words, and if we attempt to use the entire document set. to
obtain word clusters, we are likely to suffer the problem of of a sparse high dimen-
sional space that was discussed in Chapter 3. In that. case, we would need to break
the document collection into subsets (subspaces of lower dimension) which meet the
diversity and commonality requirements. The method described here was intended
for personal use on few example documents, therefore all processing should take place
in a reasonably short time.

Another assumption is that the documents in the same category share a common
vocabulary which denotes the category. Users often categorize documents in this way
(e.g., documents about vacations vs. documents about job openings), but also some-
times categorize documents according to meta concepts like humor or color schemes.
When these documents in the same category do not share common vocabulary which
distinguishes them from other categories, our method will fail. The documents need
not be labeled by category, it is enough that the categories are present implicitly.

We described the limitations of word-level and global representations of docu-
ments, and pr0posed an intermediate conceptual level of representation of text using
our groups of semantically related words. This is the major contribution of this thesis.

We described our initial attempts at ﬁnding good features from among all the
words in documents, using classiﬁcation accuracy as a performance measure. We con-
cluded that when the number of examples is as small as those we are working with,
classiﬁcation accuracy is not a good measure of the worth of features. Traditional

feature selection and extraction techniques obtained highly accurate classiﬁcation,

 

107
but features which do not. convey semantic content of the documents. Feature selec-
tion and extraction techniques use a word—level representation of a document, but
classiﬁcation accuracy is a global representation. Using one extreme of representation
as a measure of the worth of the other extreme is not a good idea, especially when
the goal is content representation, and not classiﬁcation. This realization is a second
contribution of this thesis.

We used word clustering to obtain groups of semantically related words which
do convey content‘information. We reported on using Euclidean distance, cosine
distance, and Jaccard’s coefﬁcient. Cosine distance is the one most favored by re-
searchers in the information retrieval community. We concluded that while cosine
distance makes sense as a measure of distance between documents, it does not make
sense as a measure of distance between words.

We discussed normalization for document length and concluded that it is not
necessary, and may in fact be harmful, although this is an area for future study.

Next we described a method for combining the clusters resulting from different
runs of the clustering algorithm into fuzzy sets whose members have a strength func-
tion according to their proclivity to co—occur with other members of the group. We
described two different algorithms for determining the fuzzy set membership function;
one which depends on the order in which the results are presented but is fast, and
one which is not order dependent but which is computationally resource intensive.

In chapter 4 we presented results of experiments on 4 different data sets, and
demonstrated that we can ﬁnd words with high class conditional probabilities without

many training samples, and without needing to explicitly compute those probabilities,

 

108

or the overall word probabilities. This is an important result. We also described the
uses of the word concept clusters in classiﬁcation, search, and document organization.
We concluded that. although using the concept clusters for classiﬁcation into the
original categories was not very accurate, the concept clusters are useful in searches

and document organization.

5.2 Future Work

There are a number of different directions for future research, which we describe

below.

5.2.1 Characterization of Trainng Set

By using the “seed” words from the Reuters document collection, we showed that we
can ﬁnd the words with high class conditional probabilities, as long as the number
of concept words is somewhere between 5 and 10% of the documents. We used a
ﬁxed document size, ﬁxed seed word set size, and a ﬁxed number of categories. There
are many more studies that can be done to evaluate the effect of these parameters
on the success in ﬁnding the appropriate words. Then we can determine necessary
characteristics of the example document set.

One possibility is to have a larger seed word set size, but seed each document with
only a fraction of that seed word set. We could see whether we can still detect the
entire set, and if so what portion of the seed word set needs to be included in each

document in order to successfully ﬁnd the seed words.

 

109

Varying the number of classes would allow us to analyze performance with respect
to the connnonality / diversity of the training set. The experiments reported here
used 4 classes which were for the most part disjoint. It would be interesting to see
if smaller numbers of categories or larger number of categories affect the ability to
detect the key words. It would also be interesting to see whether including a reject
class (that is, a set of random documents with no seed words) would improve, hinder,

or not affect the formation of concept clusters.

 

We could also study the effect document length has on the resulting clusters.
We have already hypothesized, based on empirical evidence, that since we do not
normalize for document length, we need documents of a certain size in order to
extract the correct context and thus the correct concept clusters. We would like to

see how short documents can be and still yield concept clusters.

5.2.2 Hierarchical Subclustering

We have implemented a hierarchical version of the K —means clustering algorithm
which subclusters any cluster which exceeds a user—deﬁned threshold. We have not
fully studied the effects of the sub—clustering, and a next step would be to determine
whether we can derive sub—concepts by hierarchically clustering, and if so, what char-

acteristics of the training examples are necessary to achieve meaningful subclusters.

110

5.2.3 Other methods for determining fuzzy set membership

As we mentioned, creating the fuzzy clusters by building the full (1)-occurrence ma-
trix and traversing it several times according to different strength levels is compu-
tationally complex. It. would be interesting to try different methods of determining
the “strength” of a word in the cluster. Some possibilities for the strength function
strength(w) in cluster C were mentioned in section 3.4.4. The challenge is to develop

an algorithm that is efﬁcient enough for use by average users at any time.

5.2.4 Applications for conceptual representation

There are many directions to investigate regarding automatic search for new docu-
ments using a word cluster representation. We could run our own indexing spider
which indexes documents according to the concept clusters, and then use it for re-
trieval with the automatic queries generated from the concept clusters.

There is also great research potential in studying automatic query generation.
One interesting idea is to use genetic programming to construct Boolean queries using
cluster words. Not only could these queries lead to improved retrieval of documents,

but a further representation of the information contained therein.

5.2.5 Incorporate knowledge in representation

Perhaps the most interesting direction for future research involves incorporating
knowledge into creation of the concept features rather than just statistics of single

words. We can use a semantic database such as WordNet [71] to determine relation-

111

ships between the concept cluster words. We could also use word pairs for phrase
detection. or natural language processing techniques to further determine word rela-
tionships. Techniques which were computationally prohibitive when considering all
words in a document are. no longer intractable when working with the concept clus-
ters. Incorporating knowledge will allow further abstraction of information beyond
simple sets of individual words, which we have dealt with in the research reported
in this dissertation. Sets of words with relationships such as “follows” (phrase for-
mation), “islike” (synonyms), “isakindof” (subclass), etc. offers a potentially more

powerful representation of information.

Bibliography

[1] G. Salton, Automatic Information Organization and Retrieval. hv‘chraw-Hill,

1968.

[2] M. Dewey, Abridged Dewey decimal classiﬁcation and relative index. Lake Placid

Club, NY: Forest Press, 10th ed., 1971.

[3] L. M. Chan, Library of Congress Subject Headings: Principles and Application.

Englewood, CO: Libraries Unlimited, 1995.
[4] Yahoo. [Online] Available http://www.yahoo.com, July 31, 1998.
[5] Infoseek. [Online] Available http://www.infoseek.com, July 31, 1998.
[6] Lycos. [Online] Available http://www.lycos.com, July 31, 1998.
[7] AltaVista. [Online] Available http://altavista.digital.com, July 31, 1998.
[8] WebCrawler. [Online] Available http://www.webcrawler.com, July 31, 1998.

[9] Opentext (Livelink Pinstripe). [Online] Available http://pinstripe.opentext.com,
July 31, 1998.

112

[10]

[11]

[12]

[13l

[14]

[15]

[15]

[17]

113

.\l. Koster. “The web robots pages.” [Online] Available

http://info.webcrawler.com/mak/projects/robots/robots.html, July 31, 1998.

W. B. Frakes and R. Baeza-Yates, eds, Information Retrieval Data Structures

and Algorithms. Englewood Cliffs, New Jersey: Prentice Hall. 1992.

J. D. Martin, “Clustering full text documents,” in Proceedings of IJCAI—95 work-

shop on Data Engineering for Inductive Learning, 1995.

M. Pazzani, L. Nguyen, and S. Mantik, “Learning from hotlists and coldlists:
Towards a www information ﬁltering and seeking agent,” in Proc. AI Tools Conf,

(Washington, DC), 1995.

G. Salton and M. McGill, Introduction to Modern Information Retrieval.

lV’IcGraw-Hill, 1983.

C. J. van Rijsbergen, Information Retrieval. London: Butterworth & Co, 2nd ed.,

1979.

A. Singhal, C. Buckley, and M. Mitra, “Pivoted document length normalization,”
in Proceedings of the Nineteenth Annual International ACM-SIGIR Conference
on Research and Development in Information Retrieval, (Zurich, Switzerland),

pp. 21—29, August 1996.

H. P. Luhn, “A statistical approach to mechanized encoding and searching of
literary information,” IBM Journal of Research and Development, vol. 1, no. 4,

1957.

[18]

[19]

[20]

[21]

[22]

[23]

[24]

114

C. Fox, "Lexical analysis and steplists.” in Information Retrieval Data Structures
and Algorithms (W. B. F rakes and R. Baeza-Yates. eds). pp. 102—130. Englewood

Cliffs. New Jersey: Prentice Hall, 1992.

W. B. F rakes. “Stemming algorithms,” in Information Retrieval Data Structures
and Algorithms (W. B. Frakes and R. Baeza-Yates, eds), pp. 131-160, Englewood

Cliffs, New Jersey: Prentice Hall, 1992.

A. Bookstein, S. T. Klein, and T. Raita, '“Clumping properties of content—
bearing words,” Journal of the American Society for Information Science, vol. 49,

pp. 102—114, February 1998.

S. Deerwester, S. T. Dumais, G. W. Fumas, T. K. Landauer, and R. Harshman,
“Indexing by latent semantic analysis,” Journal of the Society for Information

Science, vol. 41, no. 6, pp. 391—407, 1990.

S. K. M. Wong, W. Ziarko, V. V. Raghavan, and P. C. N. Wong, “On modeling of
information retrieval concepts in vector spaces,” ACM Transactions on Database

Systems, vol. 12, pp. 299—321, June 1987.

G. Z. Liu, “Semantic vector space model: Implementation and evaluation,” Jour-
nal of the American Society for Information Science, vol. 48, pp. 395—417, May

1997.

W. B. Croft, R. Cook, and D. Wilder, “Providing government information on

)

the internet: Experiences with thomas,’ in Digital Libraries Conference DL ’95,

(Austin, TX), pp. 19—24, 1995.

 

 

 

[25]

[26]

[27]

[28]

[29]

[30]

[31]

115

D. Harman, “Relevance feedback and other query modification techniques.”
in Information Retrieval Data Structures and Algorithms (\X'. B. Frakes and
R. Baeza-Yates, eds), pp. 241—263, Englewood Cliffs, New Jersey: Prentice Hall.

1992.

K. S. Jones, Automatic Keyword Classiﬁcation for Information Retrieval. Lon-

don: Butterworth, 1971.

K. S. Jones and E. O. Barber, “What makes an automatic keyword classiﬁcation
effective,” Journal of the American Society for Information Science, vol. 22,

no.3,pp.166—175,1971.

J. Minker, G. A. Wilson, and B. H. Zimmerman, “An evaluation of query ex-
pansion by the addition of clustered terms for a document retrieval system,”

Information Storage and Retrieval, vol. 8, no. 6, pp. 329-348, 1972.

D. Harman, “Towards interactive query expansion,” in Proceedings of the
Eleventh Annual International ACM-SIGIR Conference on Research and De-

velopment in Information Retrieval, (Grenoble, France), 1988.

R. Attar and A. S. Fraenkel, “Local feedback in full—text retrieval systems,”
Journal of the Association for Computing Machinery, vol. 24, no. 3, pp. 397—

417, 1977.

W. B. Croft and D. J. Harper, “Using probabilistic models of document retrieval
without relevance information,” Journal of Documentation, vol. 35, pp. 285—295,

1979.

[31’]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

116

J. Xu and W. B. Croft, “Query expansion using local and global document anal-
ysis,” in Proceedings of the Nineteenth Annual International ACM-SICIR Con-
ference on Research and Development in Information Retrieval. (Zurich. Switzer-

land). pp. 4—11, August 1996.

C. J. Crouch and B. Yang, “Experiments in automatic statistical thesaurus con-
struction,” in Proceedings of the Fifteenth Annual International ACM-SIGIR
Conference on Research and Development in Information Retrieval, (Denmark),

pp. 77-87, 1992.
Live Topics help page, http://a1tavista.digital.com/av/lt/help.html.

C. F aloutsos and D. Oard, “A survey of information retrieval and ﬁltering meth-
ods,” Tech. Rep. TR-CS-3514, University of Maryland, oard@glue.umd.edu, Au-

gust 1995.

Y. Y. Yao, “Measuring retrieval effectiveness based on user preference of doc-
uments,” Journal of the American Society for Information Science, vol. 46,

pp. 133—145, March 1995.

M. Marchiori, “The quest for correct information on the web: Hyper search
engines,” in Sixth International World Wide Web Conference, (Santa Clara, CA),

April 1997.

M. Slot, “The matrix of internet catalogs and search engines,” 1996.

http: //www.ambrosiasw.com/ fprefect / matrix / .

D. Sullivan, Mar 1998. httpzsearchenginewatch.internet.com.

[40]

[41]

[42]

[43]

[44]

[45]

117
H. Lieberman, “Letizia: An agent. that assists web browsing.” in Proceedings of

the International Joint Conference on Artiﬁcial Intelligence, (Montreal), 1995.

M. Balabanovié, “An adaptive web page recormnendation service,” in First In-
ternational Conference on Autonomous Agents, (Marina del Rey, CA), February

1997.

M. Balabanovic and Y. Shoham, “Learning information retrieval agents: Experi-
ments with automated web browsing,” in Proc. 1995 AAAI Spring Symp. on In-
formation G'athering from Heterogeneous, Distributed Environments, (Stanford),

March 1995.

R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell, “Webwatcher: A learn-
ing apprentice for the world wide web,” Proceedings of the 1995 AAAI Spring
Symposium on Information Gathering from Heterogeneous, Distributed Environ-

ments, March 1995.

S. Cost and S. Salzberg, “A weighted nearest neighbor algorithm for learning

with symbolic features,” Machine Learning, vol. 10, pp. 57—78, 1993.

T. Honkela, V. Pulkki, and T. Kohonen, “Contextual relations of words in grimm
tales, analyzed by self—organizing map,” in Proceedings of International Confer-
ence on Artiﬁcial Neural Networks, ICANN—95 (F. Fogelman-Soulie and P. Gal-

linari, eds), (Paris), pp. 3—7, 1995.

[46]

[47]

[49]

[51]

118

T. Honkela, S. Kaski, K. Lagus, and T. Kohonen, “Newsgroup exploration with
websom method and browsing interface,” Tech. Rep. A32, Helsinki University of

Technology, Laboratory of Computer and Information Science, January 1996.

T. Honkela, S. Kaski, K. Lagus, and T. Kohonen, “Exploration of full—text
databases with self—organizing maps,” in Proceedings of International Conference

on Artiﬁcial Neural Networks, ICANN-96, (Washington DC), 1996.

S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, “Creating an order in digital
libraries with self—organizing maps,” in Proceeings WCNN ’96, World Congress

on Neural Networks, (Mahwah, NJ), 1996.

R. Weiss, B. Vélez, M. A. Sheldon, C. Namprempre, P. Szilagyi, A. Duda, and
D. K. Gifford, “Hypursuit: A hierarchical network search engine that exploits
content—link hypertext clustering,” in Hypertext ’96: The Seventh ACM Confer-

ence on Hypertext, (Washington DC), March 1996.

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, “Syntactic clustering
of the web,” in Sixth International World Wide Web Conference, (Santa Clara,

CA), April 1997.

G. A. Miller, “Wordnet: A lexical database for english,” Communications of the

ACM, pp. 39—41, November 1995.

M. Sahami, S. Yusufali, and M. Q. W. Baldonado, “Sonia: A service for organiz-

ing networked information autonomously,” in Digital Libraries 98: Proceedings of

[53]

[54]

[56]

[57]

119
the Third ACM Conference on Digital Libraries, (New York, NY). pp. 200—209.

1998.

P. Pirolli. P. Schank, M. Hearst, and C. Diehl, “Scatter/gather browsing com-

3

municates the topic structure of a very large text collection,’ in Human Factors
in Computing Systems CHI ’96, pp. 213—220, Association for Computing Ma-

chinery, 1996.

H. Schutze and C. Silverstein, “Projections for efﬁcient document clustering,” in
Proceedings of the Twentieth Annual International ACM—SIGIR Conference on
Research and Development in Information Retrieval, (Philadelphia, PA), pp. 74-

81, 1997.

C. Silverstein and J. O. Pederson, “Almost~constant—time clustering of arbi-
trary corpus subsets,” in Proceedings of the Twentieth Annual International
ACM—SIG'IR Conference on Research and Development in Information Retrieval,

(Philadelphia, PA), pp. 60—66, 1997.

M. A. Hearst and J. O. Pedersen, “Reexamining the cluster hypothesis: Scat-
ter/ gather on retrieval results,” in Proceedings of the Nineteenth Annual Inter-
national ACM-SIGIR Conference on Research and Development in Information

Retrieval, (Zurich, Switzerland), pp. 76—84, August 1996.

Y. S. Maarek and I. Z. B. Shaul, “Automatically organizing bookmarks per
contents,” in Fifth International World Wide Web Conference, (Paris, France),

May 1996.

[58]

[59]

[sol

[61]

l62l

120

A. K. .lain and B. Chandrasekaran, “Dimensionality and sample size consider-
ations in pattern recognition practice,” in Handbook of Statistics, Vol. 2 (P. R.
Krishnaiah and L. N. Kanal, eds), pp. 835—855, Amsterdam: North-Holland

Publishing Company, 1982.

A. Whitney, “A direct method of nonparametric measurement selection,” IEEE

Transactions on Computers, vol. 20, pp. 1100—1103, 1971.

F. J. Ferri, P. Pudil, M. Hatef, and J. Kittler, “Comparative study of techniques
for large—scale feature selection,” in Pattern Recognition in Practice IV, Mutliple
Paradigms, Comparative Studies and Hybrid Systems (E. S. Gelsema and L. S.

Kanal, eds), pp. 403—413, Amsterdam: Elsevier, 1994.

W. F. Punch, E. D. Goodman, M. Pei, L. Chia-Shun, P. Hovland, and R. En-
body, “Further research on feature selection and classiﬁcation using genetic al-
gorithms,” in International Conference on Genetic Algorithms 93, (Champaign,

IL),1993.

W. Siedlecki and J. Sklansky, “A note on genetic algorithms for large scale feature

selection,” Pattern Recognition Letters, vol. 10, pp. 335—347, November 1989.

[63] J. H. Holland, Adaptation in Natural and Artiﬁcial Systems. Ann Arbor, Michi-

gan: University of Michigan Press, 1975.

[64] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine

Learning. Reading, Massachusetts: Addison—Wesley, 1989.

121
[65] D. L. Swets and .l. .l. Weng, “Using discriminant eigenfeatures for image re-
trieval." IEEE Transactions on Pattern Analysis and ll/Iachinc Intelligence. Oc-

tober 1996.

[66] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs,

New Jersey: Prentice Hall, 1988.

[67] J. A. Hartigan and M. A. Wong, “A k—means clustering algorithm,” Applied

Statistics, vol. 28, pp. 100—108, 1979.

[68] L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, pp. 338—353, 1965.

[69] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms.

New York: Plenum Press, 1981.

[70] H. P. Zipf, Human Behavior and the Principle of Least Effort. Cambridge, Mas-

sachusetts: Addison—Wesley, 1949.

[71] G. Miller, “Wordnet: A lexical database for english,” Communications of the

ACM, pp. 39—41, Nov 1995.

"‘llllllllll111111“