M H H
N W
\ “

N4 ‘
\100
0000

 

 

THESlB

lllllllllllllllllHIHUUIIIIllllllllIlllllllllllllllllllllll

3 1293 01701 922

 

LIBRARY

5 Michigan State
University

 

 

 

This is to certify that the

thesis entitled

T97“ Documﬁn‘i C(Dtési‘FficcuHOH

presented by

Yoth/xﬂnj L-l

has been accepted towards fulfillment
of the requirements for

Mdegree in chfwtev SCteu‘f

 

WW

Major professor

 

Date 3’“ ”1,973

 

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

PLACE IN REIURN BOX to remove this checkout from your record.
TO AVOID FINE return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

r—r

iﬁRlMelec

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

TEXT DOCUMENT CLASSIFICATION
By

Yonghong Li

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Computer Science

1998

ABSTRACT
TEXT DOCUMENT CLASSIFICATION
By

Yonghong M

The exponential growth of the Internet has led to a great deal of interest in de-
veloping useful and efﬁcient tools and software to assist users in searching the Web.
Document retrieval, categorization, routing and ﬁltering can all be formulated as
classiﬁcation problems. However, the complexity of natural languages and extremely
high dimensionality of the feature Space of documents have made this classiﬁcation
problem very difﬁcult. We investigate ﬁve different methods for document classiﬁca-
tion: the naive Bayes classiﬁer, nearest neighbor classiﬁer, decision trees, subspace
method, and multilayer feedforward neural network. These classiﬁers were applied
to seven-class Yahoo news groups (Business, Entertainment, Health, International,
Politics, Sports, and Technology) individually and in combination. Naive Bayes clas-
siﬁer and subspace classiﬁer have also been tested on the enlarged Yahoo news group
and Reuters-21578 benchmark. We studied three classiﬁer combination approaches:
simple voting, dynamic classiﬁer selection, and adaptive classiﬁer combination. These

classiﬁers were also evaluated on different document representation schemes obtained

by best feature selection, principal component analysis and discriminant analysis. Our
experimental results indicate that the naive Bayes classiﬁer, subspace method, and
multilayer feedforward neural network classiﬁer outperform the other two classiﬁers
on our data sets. Orthonormal discriminant analysis gives better classiﬁcation results
than other dimensionality reduction schemes. Combinations of multiple classiﬁers
do not always improve the classiﬁcation accuracy compared to the best individual
classiﬁer. Among the three different combination approaches, our adaptive classi-
ﬁer combination method introduced here performs the best. The best classiﬁcation
accuracy that we are able to achieve on this seven-class problem is approximately
83 — 85%, which is comparable to the performance of other similar studies. How-
ever, the classiﬁcation problem considered here is more difﬁcult because the pattern
classes used in our experiments have a large overlap of words in their corresponding

documents.

To My Husband Lin Hong

iv

ACKNOWLEDGMENTS

I would like to acknowledge all the people who have assisted me during the years Of
my graduate study at Michigan State University. I am most grateful to my advisor,
Dr. Anil Jain, for both his professional and personal advice, his continuous guidance,
encouragement, support, and valuable time that he spent on projects with me.

My colleagues at the MSU PRIP lab and my friends at MSU have provided help
and moral support throughout my study here. I would like to give my thanks to
all of them. A special thanks to Aditya Vailaya, for the valuable discussions and
help he gave me, not only as a colleague, but as a friend as well. I would like to
specially acknowledge Shaoyun Chen for his great help in my projects and thesis, and
for his assistance in using artiﬁcial neural network and discriminant analysis code. I
would also like to thank Chitra Dorai for her valuable suggestion and her assistance
to adjust to the new environments. Many thanks to Scott Connell, Salil Prabhakar,
Nicolae Duta, and Marilyn Wulfekuhler for proofreading my papers, thesis and helpful
discussions. My acknowledgment also goes to Jinhui Liu, Xiaojie Dong and Fan Du,
for their help in manually classifying the data.

No words can express my thanks to my husband, for his love, understanding, and

support through all the difﬁculties, as well as for sharing the happiness. I must give
my immense thanks to my parents. Their never-fading love, care, and encouragement
are of immeasurable value to me.

I am also grateful to Dr. John Weng and Dr. Sridhar Mahadevan, for serving as
my committee members, and for reviewing the thesis. I would also like to thank Cathy
Davison, Mary Gebbia, Lora Mae Higbee, and Linda Moore for their administrative

assistance.

vi

TABLE OF CONTENTS

LIST OF TABLES ix
LIST OF FIGURES x
1 Introduction 1
1.1 Search Engines ................................ 2
1.1.1 Categorical Engines ............................ 3
1.1.2 Full-Text Engines .............................. 3
1.1.3 Meta Engines ................................ 6
1.2 Intelligent Agents ............................... 7
1.2.1 Personal Agents .............................. 9
1.2.2 Collaborative Agents ............................ 9
1.3 Motivation ................................... 11
1.4 Problem Deﬁnition . . .~ ........................... 14
1.5 Thesis Outline ................................ 15
2 Literature Review 17
2.1 Classiﬁcation Techniques ........................... 18
2.1.1 TF-IDF ................................... 18
2.1.2 Probabilistic Methods ........................... 19
2.1.3 Nearest Neighbor Classiﬁer ........................ 21
2.1.4 Symbolic Learning Methods ........................ 21
2.1.5 Artiﬁcial Neural Networks ......................... 22
2.1.6 Genetic Algorithms ............................. 23
2.2 Dimensionality Reduction .......................... 23
2.2.1 Best Feature Selection ........................... 24
2.2.2 Latent Semantic Indexing ......................... 24
3 Classiﬁcation Algorithms 26
3.1 Preprocessing ................................. 26
3.2 Feature Representation ............................ 27
3.3 Classiﬁcation Algorithms ........................... 28
3.3.1 Naive Bayes Classiﬁer ........................... 28
3.3.2 Nearest Neighbor Classiﬁer ........................ 30
3.3.3 Decision Tree Classiﬁer ........................... 31
3.3.4 Multilayer Feed-Forward Network Classiﬁer ............... 32
3.3.5 Linear Subspace Method .......................... 35
3.4 Combination of Different Classiﬁers ..................... 37

vii

3.4.1 Simple Voting ................................ 37

3.4.2 Dynamic Classiﬁer Selection (DOS) ................... 37
3.4.3 Adaptive Classiﬁer Combination (ACC) ................. 38
4 Dimensionality Reduction 40
4.1 Best Feature Selection ............................ 40
4.2 Principal Component Analysis ........................ 41
4.3 Discriminant Analysis ............................ 42
4.4 Term Grouping in Subspace ......................... 44
5 Experimental Results 45
5.1 Evaluation of Text Classiﬁcation Effectiveness ............... 47
5.2 Individual Classiﬁers ............................. 51
5.3 Combination of Different Classiﬁers ..................... 54
5.4 Dimensionality Reduction .......................... 54
6 Conclusions and Discussion 58
APPENDICES 63
A Examples of Training Samples of Yahoo datasets 63
B The Laplace’s Law of Succession 71

viii

LIST OF TABLES

1.1 A brief summary of characteristics of the most commonly used search engines.
These results are from footnote 5. .....................

5.1 Yahoo news training and test data. ......................
5.2 Enlarged Yahoo news training and test data. .................
5.3 Number of documents in each category of Reuters-21578 TOPIC set ......
5.4 Contingency table of binary decisions for a test set, from [42] ..........

49

5.5 Comparison of the ﬁve classiﬁcation algorithms (NB, N N , DT, N N et, and SS). 50

5.6 Confusion matrix of the classiﬁcation results using the naive Bayes Classiﬁer
(NB) without dimensionality reduction. ..................
5.7 Confusion matrix of the classiﬁcation results using the nearest neighbor classiﬁer
(NN) without dimensionality reduction. ..................
5.8 Confusion matrix of the classiﬁcation results using the 05 decision tree classiﬁer
(DT) without dimensionality reduction. ..................
5.9 Confusion matrix of the classiﬁcation results using adaptive neural network
classiﬁer (NNet) without dimensionality reduction. ............
5.10 Confusion matrix of the classiﬁcation results using the subspace classiﬁer (SS)
without dimensionality reduction. .....................
5.11 Classiﬁcation accuracy using NB on two large data sets. ...........
5.12 Classiﬁcation Accuracy of combinations of multiple classiﬁers. ........
5.13 Comparison of the feature extraction methods using N N ............
5.14 Confusion matrix of the classiﬁcation result using the nearest neighbor classiﬁer
with PCA feature extraction. .......................
5.15 Confusion matrix of the classiﬁcation result using the nearest neighbor classiﬁer
with LDA feature extraction. .......................
5.16 Confusion matrix of the classiﬁcation result using the nearest neighbor classiﬁer
with LDA feature extraction. .......................
5.17 Comparison of using the subspace method with or without the term-grouping
feature reduction technique on test setl ...................

6.1 Comparison of the four classiﬁcation algorithms (NB, N N , DT, and SS) on
test setl for the reduced 5-class problem. .................

ix

51
51
51
52
52
52
53
55
55
56
56

56

1.1
1.2
1.3
1.4
1.5

3.1
5.1
A.1

A2

A3

A.4

A5

A6

A.7

LIST OF FIGURES

The structure of a search engine, adapted from [41] .............
The structure of a meta-search engine, from [41] ..............
A typical personal agent ...........................
A typical collaborative agent .........................
A block diagram of our text classiﬁcation scheme .............

The architecture of our adaptive feed-forward neural network. ......
The accuracy of each algorithm with a different sized feature subsets. . .

An example of the bussiness news group; (a) a training sample; (b) ex-
tracted word list. .............................
An example of the entertainment news group; (a) a training sample; (b)
extracted word list .............................
An example of the health news group; (a) a training sample; (b) extracted
word list. .................................
An example of the international news group; (a) a training sample; (b)
extracted word list .............................
An example of the politics news group; (a) a training sample; (b) extracted
word list. .................................
An example of the sports news group; (a) a training sample; (b) extracted
word list. .................................
An example of the technology news group; (a) a training sample; (b) ex-
tracted word list. .............................

64

66

Chapter 1

Introduction

The World Wide Web (WWW) is a wide-area hypermedia information repository
that aims to give world-wide access to the large repository of documents through the
Internet. The WWW has inherent properties that makes it different from traditional

digital libraries in the following ways:

(i) Unorganized, non-hierarchical: the information on the Web is distributed using
a client/server model. The collection of documents resides on servers, and
viewers access it from clients. Documents are connected by hyperlinks. In fact,

the WWW was not really designed for organized information retrieval.

(ii) Dynamic: it iS very easy for a user to post information, add new sites, update

existing documents, as well as remove old ﬁles from the Web.

The WWW has exhibited exponential growth over the last few years, Since it was
created in 1989 at the European Laboratory for Particle Physics (CERN) in Geneva,
Switzerland. A pessimistic estimate is that, in the middle of 1996, the WWW con-

1

2

sisted of more than 60 million documents on 12 million hosts and 600, 000 servers, up
from 9 million hosts and 250, 000 servers at the beginning of the year [77], and it is

growing everyday.

The rapid growth Of Internet and the characteristics of the WWW have resulted
in an information overload [41, 5, 46], which make it increasingly difﬁcult for users
to locate the relevant information quickly. This has led to a great deal of interest
in developing useful and efﬁcient tools and software to assist users in searching on
the Web. In the following sections, we provide a brief introduction to various search

engines and intelligent agents that are available to the users.

1 . 1 Search Engines

Search engines were designed to reduce the effort and information overload on the
Web. Commercial search engines, such as Yahoo, HotBot, InfoSeek, WebCrawler,
MetaCrawler, and Lycos, etc. are examples of tools that construct indices of Web
resources and ﬁnd information requested by a user. Basically, there are three different
kinds of search engines: categorical, full-text, and metal. They differ in the basic
search logic. All search engines perform their function a bit differently, so their

search results could be quite different.

 

1A Search Engine Primer: http://131.252.56.87/TandT/SENGINE.HTM

1 . 1 . 1 Categorical Engines

Categorical engines, like Yahoo2 and Magellan3, are more of a subject guide than a
search engine. The guide is hierarchically organized into topic categories by human
analysts, each category containing a list of manually indexed web pages. For exam-
ple, Yahoo lists more than 200,000 Web sites under 20,000 different categories. The
hierarchical organization provides a good structure for browsing. Starting with no
speciﬁc indices, this kind of search engine can lead a user to quickly ﬁnd information
on various tOpics.

A nice property of categorical engines is that they extract high-level semantics
(tOpics) and construct links between topics (directories and sub-directories). However,
manual examination of each document is expensive, time-consuming and tedious.
Thus, these engines are faced with a trade-off between the size and the quality of
their directory. Moreover, the larger the number of analysts involved in examining

the web documents, the larger the intra-class variability.

1. 1.2 Full-Text Engines

All current search engines seem to follow gathering-indexing-matching strategy. Fig-
ure 1.1 which is adapted from [41] shows the structure of a search engine. They
collect the Web resource by periodically dispatching programs, known as Web robots,

crawlers, wanderers, spiders, worms and ants4 to the Web. Indexing robots auto-

 

2http://www.yahoo.com
3 http: / /www.mckinley.com
4The Web Robots Pages: http: / /info.webcrawler.com / mak/ projects/ robots / robots.html

4

matically traverse the Web’s hypertext structure by starting from one document
and recursively retrieving all documents that are referenced. Instead of organiz-
ing pages according to topics, these search engines work differently from categorical
engines. They scan and analyze the internal text of a page, and perform keyword-
based searches against an indexed database. The indexing and searching schemes of
these search engines are similar, but they may differ in their database size, update

frequency, search strategy and capability, and ways of representing the search results.

 

 

uery
User -————> USCI‘ 1} q > Search
Interface -
Results /L Engme

 

 

 

Crawling

web>{ Robot ]
Indexing

Figure 1.1: The structure of a search engine, adapted from [41].

 

 

 

 

Search engines mainly follow the so called location/frequency rule to determine
the relevancy between the query keywords and retrieved documents. It is based on
the assumption that any page relevant to the topic will mention those keywords right
from the beginning, such as in the heading or in the ﬁrst few lines of the page.
Keyword frequency is the other major factor in determining the relevancy. Those
pages with higher occurrence Of keywords are more likely to be relevant than other

Web documents. Some search engines add other strategies to “boost” the relevancy.

5

For example, WebCrawler uses link popularity as part of its ranking method. A
search engine features chart5 provides a summary of important features of the most
commonly used search engines. We brieﬂy summarize the properties in Table 1.1.

Since each search engine has a different strategy for selecting pages to index, and

 

 

 

 

 

 

 

Search Size Update Ranking Comments
Engine (no. of pages Freq. (in addition
in millions) to location/
freq. method)
AltaVista 1 day to one of the largest and
100 1 month - most comprehensive
search engines
Excite 3 or 4 best for ﬁnding
55 3 weeks star review widely discussed,
mainstream topics
HotBot
110 2 weeks - easy to scale up
InfoSeek 1 to 2 keywords in one of the best tool for
30 months meta tag comprehensive search
relevancy and speed
Lycos 1 to 2 Also offers to search a
30 weeks - subject-based directory
of the most popular
pages from its database
WebCrawler keywords in good for beginners with
2 - title, link keyword searching
popularity

 

 

 

 

 

 

 

Table 1.1: A brief summary of characteristics of the most commonly used search engines.
These results are from footnote 5.

none of the search engines has a completely indexed database, the coverage of the
individual engines is relatively low, i.e., searching with a second engine would often

return several document that were not returned by the ﬁrst engine. Also, because

 

5http://www.searchenginewatch.com/webmasters/features.html (the chart is as of Feb. 2, 1988)

6

search engines may have their down times, and because of network trafﬁc congestion,
an engine’s response time is often dynamic. Meta engines are a new breed Of search

engines which combine the results of several engines for a given set of query keywords.

1.1 .3 Meta Engines

The structure of a meta-search engine given in [41] is shown in Figure 1.2. This kind
of engine provides a uniform query interface, accepts the query keywords entered by a
user, and sends the query to many different search engines. It then collects the results
from various search engines and analyzes them, eliminates duplicates, re-ranks pages

on the basis of relevancy, and summarizes the information. MetaCrawler [72] and

Search
Engine__1

Fusion [74] are examples of such search engines.

     

 

 

 

 

 

 

 

 

 

 

 

  
     

f N
Result_l
2 Query A MetaSearch Query > Search
User User i < Engine_2
Interface < Engine Result_2
) Results l
I
Query I
l

 

Result_n Search
En gine_n

Figure 1.2: The structure of a meta-search engine, from [41]

Human indexing and categorization of Web pages are time-consuming and tedious.
It is hard to keep pace with the explosive growth of Internet by analyzing each
document manually. Moreover, keywords-based search is not always accurate enough.

It is not uncommon that search engines, in response to a query, often return some

7

sites that have little to do with a user’s interest. This has led to the development of
intelligent agents which are playing an important role in making the Internet more

usable [53, 6, 54].

1 .2 Intelligent Agents

Intelligent agents can sense their environments, act on behalf of their owners, interact
with other agents, and take actions in order to achieve their goals. They are being

used in many applications, such as:

1. Electronic Commerce. As an example, AuctionBot6 is a multi-purpose server
used to create automated Internet auction according to user speciﬁcations, or
bid in an existing AuctionBot auction. Kasbah7 [9] is MIT’s buying and selling
agent in electronic marketplace; it is a multi-agent research project which aims
at fundamentally transforming the way peOple make transactions. Jango8 is
a tool for shopping on the web. It is quoted by NetBot as the world’s ﬁrst

intelligent shopping assistant.

2. Entertainment. As an example, eGenie9 is the premier source for personalized
entertainment information on the Web. It lets users explore categories of en-
tertainment that include music, movies, books, events, and TV. eGenie is the

Showcase of Learning Sesame, which can learn from user’s interests and dynam-

 

6http: / /auction.eecs.umich.edu

7http: / /ecommerce.media.mit.edu/Kasbah/
8http://jango.corn

9http: / /egenie.opensesarne.com

8

ically generate Web pages customized to the user’s taste. M ORSE10 is a movie
recommendation system that can suggest new movies based on user’s previous

rating record.

3. Internet. AS an example, Autonomy Intelligent Agents11 is a news ﬁltering
system that automatically compiles a personalized newspaper of interest to the
user. A user can train an agent by ﬁrst typing a short description, then the
agent can enhance its knowledge of user’s interests by using the new features
learned from user’s past experience. LIRA [4] developed at Stanford University
stands for Learning Information Retrieval Agent. It can help users browse or
“surf” the Web. The system observes how users rank various pages they visit.
It then retrieves new pages on its own, and presents only those which it believes
the user may want. Web Watcher [36] created at Carnegie Mellon University
is a “tour guide” agent for the WWW. Given a description of a user’s current
interests (keywords), the WebWatcher accompanies the user from page to page
as he/She browses the Web, and recommends the hyperlinks that it believes to
be highly relevant to the user’s interest. It learns the strategy for giving advice

from the feedback it received during earlier tours.

Agents can work independently, known as personal agents, as well as work jointly

together in an intellectual endeavor, known as collaborative agents [40].

 

10http: / / www.1abs.bt.com / innovate / multimed / morse
‘1 http: / /ultra.agentware.com/dailyme

1.2.1 Personal Agents

Personal agents are usually used to model users’ long-term interests. They take
user’s interests, such as keywords, a brief description, bookmarks, and feedback, etc.
as user’s proﬁle, and make predictions on user’s behavior based on personal history.
Figure 1.3 Shows a typical personal agent. For example, WBPQ, IBM’S Web Browser
Intelligent agent, can personalize a web user’s experience, and give a web server the
ability to add or modify web content on the fly. WebLearner [59] maintains user’s
preferences as hotlist (for links that are interesting) and coldlist (for links that are
not interesting), analyzes the document from each link, and recommends new pages
that might interest the user. Letizia [45] is another Web browsing assistant. It keeps
track of user’s browsing process, explores links concurrently and automatically from
the user’s current position by using the knowledge inferred from the user’s behavior,

and attempts to suggest pages that are suited to user’s interests.

1.2.2 Collaborative Agents

Collaborative agents interact and cooperate with each other to perform tasks on
behalf of a user. Figure 1.4 Shows a typical collaborative agent. The design of collab-
orative agents involves problem solving, communication, and coordination strategies
for agents to maintain autonomy. They must also be able to efficiently use the avail-
able network bandwidth to communicate with other agents. Agent collaborations

may involve heterogeneous or homogeneous groups of agents, and agents with similar

 

12http: //www.networking.ibm.com/wbi/wbisoft.htm

10

 

User
Model

 

 

 

 

User Proﬁle > Personal

 

User (f

 

Search Results A A
I
I
I

 

Feedback Q

Database

 

 

V

Figure 1.3: A typical personal agent

or differing goals, languages, and knowledge representation facilities. For example,
Wise Wire13 uses a combination of content-ﬁlters, collaborative-ﬁlters, and learning
agents to deliver interesting documents to users. It encodes the key conceptual ma-
terial from documents contents, Sifts out uninteresting pages, categorizes them, and
recommends information rated highly by one user to other users with similar inter-
ests [38]. Ginkgo14 developed at IBM is an agent-based learning system. It uses
personal learning as well as collaborative learning to model a user’s preferences, and

predicts a user’s behavior based on personal history, past experience, and on Similar

 

’3 http: / / www.wisewire.com
14http: / /www.networking.ibm.com/iag/ iaginkgohtm

11

individuals’ histories. It can learn from other agents and promote knowledge sharing.
Yenta15 under development at MIT, is a matchmaking system that aims at provid-

ing privacy-protected, distributed, automatic generation of clusters of users who are

interested in Similar topics [20].

    
 

Personal

   

Agent_1

   

Personal

   

Personal
Agent_n

  
  

 

   

Agent_2

  
 

 

  

Personal
Agent_3

  

Figure 1.4: A typical collaborative agent

1 .3 Motivation

Many tasks in information retrieval can be formulated as a document classiﬁcation

problem [43]. These tasks include:

 

15http: / / foner.www.media.mit.edu / people / foner / Yenta

12

1. Document Retrieval: Document retrieval is to select a set of documents from an
indexed database, usually in response to some user queries. User queries could
range from a few keywords to multi-sentence descriptions of an information item
that is needed. A vast majority of retrieval systems currently in use range from
simple Boolean systems (e.g., search engines) to systems using statistical or nat-
ural language processing (e.g., intelligent agents). A document retrieval system
usually consists of four main phases: indexing, query formulation, comparison,

and feedback [43].

2. Document Filtering and Routing: Also known as selective dissemination of
information, is an information seeking system to sort through large volumes
of dynamically generated information and present those documents to the user
which satisfy a relatively stable and speciﬁc information need. It is similar to
document retrieval, but emphasizes the dynamic environment and speciﬁc long-
term interests [55]. It may involve indexing, user proﬁles modeling, adaptation
of user proﬁles, matching, and social ﬁltering etc. SIFT [82], FireFlyl6 and

WiseWire are some examples of available document ﬁltering systems.

3. Document Categorization: It is the assignment of a document to pre-deﬁned
categories. A category label can be assigned to a document manually, such
as in Yahoo. However, manual assignment of categories requires considerable
human labor, time, and cost. An automatic document categorization system is

desirable to reduce these costs substantially.

 

16http: //www.ﬁreﬁy.com

13

4. Document Clustering: Classiﬁcation is ambiguous in information retrieval. Peo-
ple often refer to clustering as a classiﬁcation task. Clustering is to discover the
underlying structure (categories) of a collection of documents based on some

Similarity measure. It is an unsupervised classiﬁcation.

Text classiﬁcation can also be incorporated with push technology, such as Point-
C’ast Network”. Push technology involves software that is loaded on the machines
connected to the Internet. The software works in the background to compile Internet
sources related to a user’s selected areas of interest, and automatically delivers news

and other speciﬁed information from the Internet to the user’s desktop.

Text classiﬁcation presents many challenges and difﬁculties. First, it is difﬁcult to
capture high-level semantics and abstract concepts of natural languages just from a
few individual words. For instance, there are many ways to represent Similar concepts
(e.g., agent, softbot, robot, or bot), and the same word can represent different mean-
ings (e.g., bank can either mean a ﬁnancial institution or a river bank). Furthermore,
semantic analysis, which is a major step in designing a natural language information
retrieval system, is not well understood, although there are some techniques that have
been successfully applied to limited domains [18]. Noise, high dimensionality (thou-
sands of features), and variable length of content are some of the other undesirable
characteristics of a huge number Of documents on the Web. Due to these difﬁculties,

there is a tradeoff between efﬁciency and accuracy of a classiﬁcation system.

 

l7http: / /www.pointcast.com

14
1 .4 Problem Deﬁnition

We focus on supervised text classiﬁcation in this thesis. A typical classiﬁcation prob-
lem can be stated as follows: Given a set of labeled examples belonging to two or
more classes (training data), classify a test sample to a class with the highest sim-
ilarity. Document retrieval, routing and ﬁltering systems, can often be viewed as
two-class classiﬁers which label a document as relevant or non-relevant [59, 82, 38].
User feedback provides a set of training examples with positive and negative labels. A
document is presented to the user if it is classiﬁed as the relevant class. In document
categorization, we already have human indexed training data available, so a classiﬁer
is used to automatically assign a given previously unseen document to the appropri-
ate class. Examples include deciding what newsgroup an article belongs to [7 9], what
folder an email message Should be directed to [13], or automatic coding of diagnoses

in patient records [47].

In this thesis, we concentrate on document categorization, i.e., we deal with a
multi-class classiﬁcation problem. Figure 1.5 Shows a block diagram of our text clas-
siﬁcation scheme. We report experimental results using news data from Yahoo web
site, which are categorized into seven groups (business, entertainment, health, interna-
tional, politics, sports, and technology), and Reuters-21578 newswire benchmark”,
which involves 135 topical categories. Each item of news is indexed manually by
human experts (i.e, we have labeled training examples and “ground-truth” Of test

samples). The contributions of this thesis are as follows:

 

18http: / /www.research .att.com/ ~lewis

15

1. Our classiﬁcation system is based on a multi-class classiﬁcation system, rather
than a two-class classiﬁcation which is typical in the classiﬁcation systems [44,

58, 71] reported in the literature.

2. We apply subspace method to text classiﬁcation. It performs well on the Yahoo

dataset, while poor on Reuters dataset.

3. We apply ﬁve different classiﬁcation methods to our datasets: the naive Bayes
classiﬁer, nearest neighbor classiﬁer, decision trees, subspace method, and arti-
ﬁcial neural network method. These classiﬁers were also evaluated on different
document representation schemes obtained by best feature selection, principal
component analysis, linear discriminant analysis, and term-grouping in sub-

Space.

4. We also investigate whether performance can be improved by a combination
of different classiﬁers. Our experimental results indicate that a combination of
multiple classiﬁers does not always improve classiﬁcation accuracy compared to
the best individual classiﬁer. The adaptive classiﬁer combination method intro-

duced in this thesis outperforms Simple voting and dynamic classiﬁer selection.

1 .5 Thesis Outline

The outline of this thesis iS as follows. Chapter 2 brieﬂy reviews some relevant
work on text document classiﬁcation. Chapter 3 describes the ﬁve text classiﬁcation

methods we have used in our experiments, and investigates combinations of multiple

16

 

 

 

     
   

 

Training .Ga’
Documents
I Classifier_l ]
\ -------- Class_2
Prepming Feature l Feature : answer—2 ’
(stopwords and ' ‘ \ .
low-freqwords Representation Ewell” :
removal) ’ -------- i '
Classiﬁer

 

 

Feature

 

Test

Document I Representation I

Figure 1.5: A block diagram of our text classiﬁcation scheme

 

 

 

 

 

 

classiﬁers. Dimensionality reduction techniques are introduced in chapter 4. We

present our experimental results in chapter 5. Finally, we conclude with discussions

and future work in chapter 6.

Chapter 2

Literature Review

In the past few decades, the availability of cheap and effective storage devices and in-
formation systems, and the rapid growth of Internet has created an immense informa-
tion bottleneck. Most commercial information retrieval systems (e.g., search engines)
still rely on conventional inverted index and Boolean querying techniques. An inverted
index catalogs a collection of objects in their textual representation. Given a set of
documents, keywords and other attributes (possibly including relevance ranking) are
assigned to each document. Inverted index is the list of keywords and links to the
corresponding document. Index is sorted on the key values to allow rapid searching
for a particular key value. But they often produce less than satisfactory results. Prob-
abilistic models have been used to improve the performance of information retrieval,
ﬁltering, routing and categorization. Since the late 19805, knowledge-based techniques
have been used extensively by information science researchers. These techniques have
attempted to capture users and information specialists’ domain knowledge, classiﬁca—
tion scheme knowledge, effective search strategies, and query reﬁnement heuristics in

17

18

the design of information systems [11]. More recently, machine learning approaches,
such as symbolic, inductive learning methods, artiﬁcial neural networks, and genetic
algorithms have also been utilized [10].

In this chapter, we brieﬂy review several popular techniques used in document
retrieval, ﬁltering, routing, and categorization. We pay more attention to classiﬁca-
tion methods. We also review some feature selection and extraction techniques that
are commonly used in a text document classiﬁcation system, because of the high
dimensionality of the feature vector (usually in the order of several thousand) often

encountered in this problem.

2.1 Classiﬁcation Techniques

There are several ways to design a classiﬁer, from typical techniques in information
retrieval, such as TF-IDF, to pattern recognition and machine learning techniques,
such as decision trees, naive Bayes classiﬁer, nearest neighbor, artiﬁcial neural network
and genetic algorithms. They all have the same goal: given a set of training samples
along with category labels, classify a test sample to one or several most apprOpriate

categories.

2.1.1 TF-IDF

TF-IDF is one of the most successful techniques in information retrieval. A document
is represented as a vector of weighted terms. Terms that appear frequently in one

document (TF=term frequency), but rarely occur outside the document (IDinnverse

19

document frequency) have larger weights. The weights can be normalized according to
the length of the documents. In Syskill & Webert [58] and NewsWeedeer [38] systems,
the TF—IDF vectors of all examples of one class are averaged to get a prototype vector
for the class. A test document is assigned to the class that has the highest cosine
similarity measure, i.e., smallest angle between the TF-IDF vector of the test pattern

and the class prototype vector.

2.1.2 Probabilistic Methods

Probabilistic Methods have been widely used in information retrieval. Naive Bayes
classiﬁer has been demonstrated to be a powerful classiﬁer in several domains [16,
44, 58]. Domingos et a1. [16] conducted an empirical study on 28 domains, a varied
selection of databases from the UCI repositoryl. They showed that naive Bayes
classiﬁer was more accurate than a decision tree (C45) in 16 domains. Lewis et
a1. [44] compared their ProbBayes method with a decision tree classiﬁer (DT-minlO)
on two data sets. One of the data sets was Reuters-22173 newswire stories. These
stories have been manually indexed using 135 ﬁnancial topic categories. The second
data set consisted of 1, 500 documents from the US. foreign Broadcast Information
Service that had been used in the MUC-3. ProbBayes essentially uses Bayesian rule to
estimate P(Cj = 1|D), the a posteriori probability of category cj given the document.
A 2-class classiﬁcation was applied on all the categories for one test pattern. For

Reuters—22173 data set, both Bayes method and the decision tree method gave similar

 

lhttp: / /www.ics.uci.edu/ mlearn/ MLRepository.html

20

classiﬁcation performance. The best break-even value (where the recall is equal to
the precision) is about 65%; for MUG-3 data set, the DT-minlO performed slightly
well, the best break-even value is about 50%.

Bayesian networks, also known as inference networks, are generalizations of naive
Bayes classiﬁer. A Bayesian network is a directed acyclic graph representation of
the joint probability distribution for a set of variables, where each node in the graph
represents a variable, and each edge represents the correlation between the two vari-
ables [32]. Bayesian networks were originally designed to encode the uncertain knowl-
edge of an expert [60], and in recent years, they have been applied to a variety of
ﬁelds, such as expert systems, diagnosis engines, information retrieval and decision
support systems [25]. In [75, 76, 7], Turtle and Croft et al. used Bayesian network
for information retrieval. The Bayesian network-based retrieval systems consist of
two parts: a document network and a query network. In document net, nodes rep-
resenting documents are connected to nodes representing terms. The query net is
constructed where terms in the query are connected to nodes representing how the
terms Should be combined. To perform a retrieval, the system connects these two
networks together. Given the prior probabilities associated with the documents and
the conditional probabilities with internally connected nodes, the a posteriori prob-
ability of each document in the collection, given the query, can be computed using
Bayesian rule. The system then ranks the documents by this probability.

The underlying assumption of the naive Bayes approach is that, for a given class,
the probabilities of terms occurring in a document are independent of each other.

On the other hand, Bayesian inference networks reﬂect the dependence between the

21

terms. Several studies have attempted to relax the strong independence assumption

in naive Bayes classiﬁer by incorporating Bayesian inference networks [66, 24].

2.1.3 Nearest Neighbor Classiﬁer

For a given test document, the nearest neighbor classiﬁer can be viewed as an attempt
to estimate the a posteriori probability of a category from the training patterns. This
rule assigns a test document to the category of its nearest training pattern. A K -
nearest neighbor classiﬁer assigns a test document to the category which is most
frequently present among its K nearest neighbors. Its simplicity and ease of imple-
mentation has made it pOpular in information retrieval systems. For example, Pazzani
et al. [58] applied the nearest neighbor rule on their Syskill & Webert system, an in-
teresting web page identiﬁcation agent that learns from a user’s proﬁle. Weiss [79] et
al. showed that a classiﬁcation-based information retrieval method using the nearest
neighbor classiﬁer performed better that the TF—IDF prototype method on USENET

newsgroups datasets.

2.1.4 Symbolic Learning Methods

Symbolic learning techniques have been extensively studied during the last few
decades [8, 10]. Among these methods, decision trees, such as ID3 [61], and C45 [63]
are the most popular classiﬁers that have been used in information retrieval ﬁelds,
such as medical record classiﬁcation [3], Web page selection [58], and news stories

categorization [44]. Decision trees classify patterns by starting from the root of the

22

tree, and then traversing down to the leaves. Leaf nodes represent categories, while
interior nodes represent attributes. When constructing a tree, attributes are selected

based on their discrimination abilities.

Another type of symbolic learning is rule-based method. Rules can be automat—
ically induced from decision trees [63], from neural networks [14], and they can also

be learned from examples [12, 13, 50, 2].

2.1.5 Artiﬁcial Neural Networks

Artiﬁcial neural networks, e. g., multilayer perceptrons and radial basis networks, seem
to ﬁt well with conventional retrieval methods in information science [10], such as the
vector space model [67]. A typical multilayer feedforward network [31, 48] learns
the weights for its interconnections by using a gradient descent method to minimize
the squared error between the network output values and the target values for these
outputs. Schutze et a1. [71] used neural networks trained by backpropagation for
document routing problem. Two types of neural networks were used in their work:
one is a linear neural network which consists of only input and output units, and
another is a non-linear neural network which has one hidden layer with three units.
They demonstrated that neural networks, linear discriminant analysis, and logistic
regression classiﬁers perform 10 — 15% better than relevance feedback via Rocchio
expansion for T REC-2 and TREC-3 routing tasks. In another similar study [15], L81
(Latent Semantic Indexing)-based neural networks were employed for classiﬁcation

of 10 categories of TREC—l news wire data. They did their experiments in two

23

conﬁgurations: (i) single sensor neural net, where the inputs are based on LSI alone,
and (ii) two sensor neural net, which uses additional 10 keywords as inputs. Their
experimental results showed that neural network methods perform 4% to 20% better
than single LSI method. Artiﬁcial neural network techniques have also been employed

in commercial intelligent agents, such as the wisewire collaborative ﬁltering enginez.

2.1.6 Genetic Algorithms

Another type of learning methods, called genetic algorithms (CA), are also popu-
lar in information science. Chen [10] surveyed several implementations of genetic
algorithms in information retrieval. Sheth [73] used a genetic algorithm to build
a personal information ﬁltering system, called Newt. During the learning phase in
Newt, a genetic algorithm was used to model the population characteristic and be-
havior in response to changing user interests. By using crossover and mutation, new
members were added into the population, while unﬁt members were removed during

each generation.

2.2 Dimensionality Reduction

Document classiﬁcation is often characterized by the high dimensionality (often thou-
sands Of features) of the associated feature space and a relatively small number of
training samples. The feature set is Often “noisy” and may contain redundancy [43],

which may lead to the problem of “curse of dimensionality” [33]. Moreover, large

 

2http: / /www.wisewire—corp.com/indexprod.html

24

number of features result in larger computational and storage complexity. There are
several ways to reduce the feature set size. Traditional feature selection approaches in
pattern recognition, such as sequential forward/backward selection and “plus l-take
away r” selection [35] may perform better, but they are often computationally de-
manding. In this thesis, we brieﬂy review two most popular feature reduction methods

in information retrieval: best feature selection and latent semantic indexing (LSI).

2.2.1 Best Feature Selection

In Best feature selection, the goal is to ﬁnd the best subset of features from the entire
feature space. Salton et a1. [69] provided a weighting scheme, such as TF-IDF, for
terms in the document collection. Schutze et a1. [71] applied a X2-measure of depen-
dence to a relevant and non-relevant contingency table to indicate the importance of
the features. Mladenic [49] suggested the use of mutual information to assign weights
to the terms in the documents. Koller et al. [37] developed an efﬁcient algorithm for
feature selection based on information theory. Their empirical results indicated that
the algorithm can effectively handle datasets with a large number of features (up to

1675 features).

2.2.2 Latent Semantic Indexing

Latent semantic indexing (LSI) method developed at Bellcore [51] is a vector Space
information retrieval method which has been widely adopted in information ﬁltering

and retrieval systems [18]. In LSI method, introduced in [51, 18], the singular value

25

decomposition (SVD) is performed on a term-document matrix, which is formed from
a set of training patterns. Recall that each entry in the term-document matrix indi-
cates the term—frequency in the corresponding document. Only large singular values
are preserved. The resulting singular vector and Singular value matrices are used to
project term frequency vectors for documents and queries onto a reduced feature sub-
space where semantic relationships in the term-document matrix are preserved. In
a query system, documents can be ranked according to their similarity measure to a
query by using these projected subspace vectors. This technique has been successfully
used in Salton’s SMART system [68]. LSI has also been used for feature reduction in

document classiﬁcation systems [71, 15].

Chapter 3

Classiﬁcation Algorithms

In this chapter, we introduce different classiﬁers used in our study of text document
classiﬁcation. We formulate our text classiﬁcation as a multi-class classiﬁcation prob-
lem, i.e., assign a test pattern to one of the pre-deﬁned categories (see Figure 1.5).
The vector space model [67] is used to represent document features. During the pre-
processing stage, a simple lexical analysis is performed to convert an input stream of

characters into a stream of “useful” words.

3. 1 Preprocessing

Web documents are written in HTML language. We use the following preprocessing
steps to construct the SO called indexing ﬁle, which contains those words that may be

helpful in text classiﬁcation.

1. Use an HTML parser to ﬁlter out HTML tags.

2. Convert all characters to lower case.

26

27

3. Remove digits and punctuations.
4. Screen out words in a stoplist.
5. Remove low frequency words (which only occur once in the training examples).

Words in the stoplist are known as stopwords, which are the most frequently occurring
words in English, such as “the”, “some”, “of”, etc. These words are not considered
helpful in retrieval. The stoplist we choose is the one given in [21].

Other preprocessing steps, like stemming [22] can be applied to further reduce the
Size of indexing ﬁles. An effective stemming is helpful in retrieval and classiﬁcation,
because the words with the same stem should have similar meaning. However, over-
stemming [22] can cause original unrelated words to be represented as the same stem,
in which case it may actually degrade the performance of a classiﬁcation system.
For example, “international” and “internal” are both stemmed as “intern”, but they

originally have diﬂerent meanings.

3.2 Feature Representation

The feature space of documents can be represented by the indexing ﬁles generated
above. We adopt the commonly used “bag-of-words” document representation scheme
(vector space model), in which we ignore the structure of a document and the order
of words in the document. The word-list W = (2121, ..., wd) in the training set consists
of all the distinct words (terms) that appear in the training indexing ﬁles. Typically,

there can be several thousand features in document classiﬁcation (the number of

28

commonly used English words is between 20, 000 to 50, 000). Given a document D,
its feature (term) vector is represented by T = (t1, ..., td) constructed from W. The
value of each component of T could be either binary (a value of 1 indicates that
the corresponding word appeared in the document) or an integer representing the
number of times the corresponding word was observed. Some training examples and

the corresponding indexing ﬁles of each Yahoo news group are shown in Appendix A.

3.3 Classiﬁcation Algorithms

In the following sections, we describe the naive Bayes classiﬁer, nearest neighbor clas-
siﬁer, decision tree classiﬁer, and artiﬁcial neural network classiﬁer used in our study.

We also introduce our use of the subspace method for text document classiﬁcation.

3.3.1 Naive Bayes Classiﬁer

The naive Bayes classiﬁer [48], also known as simple Bayes classiﬁer, has been success-
fully used in text classiﬁcation [1, 58]. Let C = (c1, ..., cm) be the set of m document
classes. Given a new unlabeled document D, its word-list W 2 (ml, ..., wdz) (deﬁned
in the same way as the word-list for the training set), and its corresponding binary
feature (term) vector ’7' = (t1, ...,tdl), D is described by the conjunction of a set of
attribute values t,-, i = 1, ..., d’ . Using Bayes rule, the a posteriori density of class Cj,

given document D can be written as:

P(cj|D) = ’P(]||LJOO, ...,LJ[I), (3.1)

29

where

P t ,...,t c- P c-
p(.,l..,...,.,) z < 1P,“ {I 1.1)“)- (3.2)

 

The underlying assumption of the naive Bayes approach is that, for a given class C],

the probabilities of words occurring in a document are independent of each other, i.e.,

I

d
P(t1, "'atd’lcj) = H P(ti]Cj). (3.3)

i=1

The naive Bayes approach assigns D to a class CIVB as follows:

I

d
c3“; = argmarccjecP(cJ-) H P(t,|cj), (3.4)

1:1

where P(cj) is the a priori probability of class cj and P(t,~|cj) is the conditional
probability of word 10,-, given class cj. Both P(cJ-) and P(t,~|cj) are estimated from

the training data.

When the size of the training set is small, the relative frequency estimates of
probabilities, P(t,~|cj), will not be reasonable. For example, if a word never appeared
in the given training data, its relative frequency estimate will be zero. So, we applied
the Laplace’s law of succession [64] to estimate P(t,~|cj). It is essentially a Bayesian
estimate of the multinomial parameters. The Bayesian estimate of P(t,-|cj) is given
as:

(3.5)

 

where nj is the total number of word occurrences in class C], n,,- is the number of

30

occurrences of word to,- in class C], and k is the vocabulary size of the training set.
This Bayesian estimation is based on a uniform prior assumption, i.e., probabilities
of various word occurrences in class 03- are equally likely. A detailed derivation of this

estimate can be found in Appendix B.

3.3.2 Nearest Neighbor Classiﬁer

The nearest neighbor (NN) decision rule assigns an unlabeled document D to the
document class c,- if the training pattern closest to D is from class c,-. A variation
of nearest neighbor rule is the K -nearest neighbor (KNN) rule, which ﬁrst ﬁnds
the K nearest neighbors of D among the training patterns, and then uses a voting
scheme to assign a category label to D. The NN/KNN rule can be viewed as an
attempt to estimate the a posteriori probabilities p(cj|D) from training patterns.
The number of nearest neighbors (K) Should be small compared to the total number
of training samples. A rule of thumb is that K Should be proportional to J17. The
NN rule is used in our text classiﬁcation for Simplicity. A number of algorithms can be
adopted to reduce the complexity of nearest neighbor search, such as nearest neighbor
editing [17], reduced nearest neighbor rule [26], and an effective algorithm for nearest

neighbor search in high dimensions [52].

We use the TF-IDF (T F is the term frequency in a document, and IDF is the
inverse document frequency) weighting scheme and use the cosine similarity [69] in-
stead of Euclidean distance to measure the similarity of the documents. Given two

documents D1 and D2, their corresponding weighted feature vectors are represented

31

as 71- —— (t1,6-),-_1 and 73— — (62i6)zd_1, where 6,- is the weight of word to,- (based on

TF-IDF). The similarity between D1 and D2 is then deﬁned as:

7T’T2

529,29 =———,
(1 2’ mum”

(3.6)

where [I - [I denotes the norm of the vector.

3.3.3 Decision Tree Classiﬁer

Decision trees are one of the most widely used inductive learning methods. Their
robustness to noisy data and their capability to learn disjunctive expressions seem
suitable for document classiﬁcation. One of the most well known decision tree al-
gorithm is ID3 [61], and its successor C45 [63] and C5. It is a top-down method
which recursively constructs a decision tree classiﬁer. The interior nodes of the tree
are associated with speciﬁc attributes (e.g., terms in document collection), and leaves
of the tree represent Speciﬁc categories. At each level of the tree, ID3 selects the
attribute that has the highest information gain. Information gain is simply the ex-
pected reduction in entropy caused by partitioning the training examples according
to this attribute [48]. Given a collection of training samples Tr, information gain for

an attribute W is deﬁned as [48]:

ITr -2)

ITr l |Entr0py(Tr_v) (3.7)

Gain(Tr, W) = Entropy(Tr)— Z
vEValues(W)

 

m

Entropy/(Tr) = Z—piloggp, (3.8)

i=1

32

where Values(W) is a set of values which W can take, Tr-v is the set of training
samples for which Value(W) = v, and p,- is the proportion of samples that belong to
class i.

In summary, ID3 searches a complete hypothesis space, and uses the statistical
properties (e.g., information gain) of all training examples at each step in the search.
This makes it more robust to noisy training data. Furthermore, it uses reduced-error
pruning [62] to avoid overﬁtting in decision tree learning. Mitchell gives an analysis
of ID3 decision tree in more detail in [48].

For our experiments, we chose the C5 decision tree package, since it has many nice
features over its predecessor, ID3 and C45. For example, the rulesets used in C5 are
more accurate, faster, and require less memory 1. Furthermore, adaptive boosting [23]
is incorporated into the software. The basic idea of boosting is to generate n (n > 1,
n is Speciﬁed by the user) classiﬁers (either decision trees or rule sets) instead of
one. The ith classiﬁer is constructed by examining the errors made by the (i — 1)th
classiﬁer. When a new document is to be classiﬁed, a voting scheme based on the n
classiﬁers is used to determine the ﬁnal class of the document. We use the adaptive

boosting option in the decision tree classiﬁer in our experiments.

3.3.4 Multilayer Feed-Forward Network Classiﬁer

An important class of neural networks, namely, adaptive layered networks (e.g., mul-

tilayer perceptrons and radial basis networks) have been widely applied to diverse pat-

 

1http: / / www .rulequest.com/see5—comparison .html

33

tern classiﬁcation domains with some success. They also seem to ﬁt well with conven-
tional retrieval models, such as vector space model in information science [10, 71, 15].
A three-layer feed-forward network with two hidden layers and an output layer is used
in our text classiﬁcation. Figure 3.1 shows the architecture of our feed-forward neural
network. Text document classiﬁcation usually involves thousands of features. It is
Sigrnoid Sigmoid Linear

Input __*[ Featuzzcegraction \I/é \\,// .V/,'—>
k; \\ /’

 

 

Input First Second 0MP“t
layer hidden hidden layer
layer layer

Figure 3.1: The architecture of our adaptive feed-forward neural network.

difﬁcult for neural networks to handle such a large input dimension. The principal
component analysis method (see section 4.2) is used on the vector space model to
reduce the feature set size. The projected features in the subspace are used as input
to the three-layer feed-forward network.

Webb et a1. [78] illustrated why a nonlinear adaptive feed-forward layered net-
work with linear output is capable of performing classiﬁcation tasks well. They
demonstrated that, if the weights are adjusted to minimize the total mean square

output error, then the nonlinear transformation of the hidden layers also maximizes

34

the network discriminant function (C) [78]:

where SB is the (weighted) between-class covariance matrix, and SI]: is the pseudo-

inverse [29] of the total-class covariance matrix ST.

For a one-to-m target coding scheme,

1, if input pattern p is in class c
tcp = (3.10)

0, otherwise.
This coding scheme results in a weighted between-class covariance matrix S B. We

choose to use an alternative target coding scheme [78],

f

Vin—3’ if input pattern p is in class c, where

tcp : 1 nc is the number of patterns in class c (3-11)

0, otherwise

 

L

in which case S B is the conventional between-class covariance matrix, and it is argued
in [78] that, for multi-class classiﬁcation problems, this target coding scheme compen-
sates for un-balanced class memberships in training data. By performing a nonlinear
transformation of the data into a space where the classes maybe more separated as
determined by the between-class and total-class covariance matrices, the subsequent
linear transformation to minimize the mean square target error could perform better

than on the original data.

35

The neural network uses backpropagation algorithm [31, 48] for training. Basi-
cally, the algorithm performs stochastic gradient descent to minimize the mean square
error between the network output and their corresponding target values. The activa-
tions of input pattern are propagated forward through the net, while the errors are
backpropagated to update the weights in order to minimize the mean square target

error.

Traditionally, the weights in the feed-forward neural network trained by a back-
propagation algorithm are initialized randomly. This may involve large training
epochs and the weights may not converge in some cases. Instead, we initialize the
weights based on the eigenvalues of the covariance matrix of the input pattern matrix,

which helps to speed up the convergence [70].

3.3.5 Linear Subspace Method

The subspace model [56] decomposes a given feature space into m sub-regions of
lower dimensionality (subspaces), where each region is a representative feature space
for its corresponding pattern class c,,i = 1, ..., m. A test document is classiﬁed based
on a comparison of its compressed representation in each feature Space with that of

different classes.

We apply this model to document classiﬁcation as follows. Suppose we have m
document classes C = (ck)];":1. Class ck is represented by a subspace £1, of cardinality

dk. Let T = (t,-)§‘=1 denote the term-vector in the original d-dimensional feature space,

corresponding to the word-list of the training set W = (w,)§’:1. Let the word-list of

36

k

the subspace L), be denoted as Wk = (wk)d"1, where w,- are the words observed in

ii:

class ck. Given a vector T in the original feature space, the weighted projection II),

of vector T on the subspace L), is deﬁned as:

”E. = H1477 = HRT, (3.12)

where H), = (hij)dkxd is a dk x (1 matrix, and the ith row corresponds to the ith
component of the word-list Wk in the subspace Lk, while the jth column is the jth
component of the word-list W in the original feature space. The elements hij are

calculated as follows:

(I 6].“, when the term wf is the same as the term to]-

0, otherwise,

where 6].“ is the weight of term wIc

,- in subspace Lk. We deﬁne 6;.“ as:

k _ CLASSFREQJ-k
j _ logg(DOCFREQj +1)’

 

(3.14)

where CLASSFREQJ-k denotes the ratio of the number of documents in which the
term 10,- occurred in class ck to the number of documents in ck, and DOCFREQJ-
represents the ratio of the number of documents in all those classes in which the term

112,- occurred to the size of the training set.

The Euclidean vector norm of 77, is “77,” = 727777,. For a new document D, the

subspace decision rule classiﬁes D to the class on whose subspace its term-vector T

37

has the largest projection in terms of the Euclidean vector norm.

3.4 Combination of Different Classiﬁers

A number Of studies have Shown that combining different classiﬁers can improve the
classiﬁcation accuracy [80, 39, 28]. Larkey et a1. [39] applied weighted linear com-
binations of different classiﬁers to the medical document domain, where the weights
were assigned by the user. Another CM C approach is dynamic classiﬁer selection
(DCS) [80, 28], where a single classiﬁer is selected which has the highest local ac-
curacy in small regions of feature space surrounding the test sample presented to
the system. We investigated three different combination approaches: simple voting,

DCS, and our own approach Of adaptive classiﬁer combination (ACC).

3.4.1 Simple Voting

For each test document, classify it to class c,-, where a majority of the classiﬁers

individually assign the test document to class c,-.

3.4.2 Dynamic Classiﬁer Selection (DCS)

We have implemented a version of DOS described in [80, 28]. For a test document
D, we use the k-nearest neighbor approach to ﬁnd the neighborhood of D and the
“leave-one-out” method [17] is applied on the training data to ﬁnd the local accuracy
in the neighborhood of D. We used the “soft” measure [28] of the local accuracy,

where the weight of each neighbor of D is the cosine similarity measure between D

38

and that neighbor.

3.4.3 Adaptive Classiﬁer Combination (ACC)

Instead of selecting the best classiﬁer with the highest local accuracy for a test docu-
ment, we assign the document to class 0,, which is the class identiﬁed by the classiﬁer
that has the highest local accuracy among all the classiﬁers. The outline of our ACC

algorithm, given n classiﬁers is described as follows:

1. For a test document D, ﬁnd the neighborhood of D, NB(D) = (1171, ...,xK),

as, E T rainning-Set, using the K -nearest neighbor algorithm.

2. Denote the classiﬁcation results for D by n classiﬁers as C = (61,...,cn), Cj E

{Cl,...Cm}.

3. For each class cj E C_, calculate AccloC = 22:, 21;, W,P,(cj|a:,- E cj), where
P,(cJ-|:r,- E Cj) is the local accuracy of a neighborhood pattern 33,-, i.e. the a
posteriori probability that 3:, belongs to class cj. The local accuracy of each x,-
can be obtained by using the “leave-one—out” method on the training data, and
W,- is the cosine similarity measure between pattern 2:,- and D, which is deﬁned

as:

_ ZizﬁTermik x Termk)
\/Z[C:1(Term,-k)2 x 22:1(Termk)?

 

W,- = Cosine(a:,-, D) , (3.15)

where t is the dimensionality of the feature vector, Term“, is the value of term

It in document 123,-, and T erm)c is the value of term It in the given document D.

39

4. Classify D to class c,,, where n = argmaxj(Acc{OC).

Chapter 4

Dimensionality Reduction

Document classiﬁcation is often characterized by high dimensionality of the associ-
ated feature space and a relatively small number of training samples. The increase in
dimensionality results in an increase in both computational and storage complexities.
Furthermore, we must also guard against the potential problems of “curse of dimen-
sionality” [33]. We study four feature dimensionality reduction approaches: feature
selection, feature extraction (principal component analysis and linear discriminant

analysis), and term grouping in subspace.

4.1 Best Feature Selection

One way to reduce the number of features in document classiﬁcation is to select
a subset of the best terms from the entire feature space. In this paper, we use
the individual best features approach, where terms are sorted by their weights in a
descending order, and the top n terms with the highest weights are selected. We

40

41

use mutual information (i.e., information gain, see section 3.3.3) as suggested in [49]
to assign weights to the terms. Traditional feature selection approaches in pattern
recognition, such as sequential forward / backward selection and “plus l-take away r”
selection [35] may perform better, but they are Often much more expensive in terms
of computational cost. This is a major consideration in high dimensional feature
spaces encountered in document classiﬁcation problems.

Another method of dimensionality reduction is to map original measurements into
a more effective lower dimensional subspace. Each new feature is a combination of
the original features. This is the so called feature extraction method, which includes
principal component analysis and linear discriminant analysis which we have used in

our study.

4.2 Principal Component Analysis

Principal component analysis (PCA) (also called K-L transform) is a commonly used
linear projection method [34]. It projects the original data vector (with dimension
d) on the coordinate axes having the dimension p (usually p < d). This minimizes
the mean-square error between the original data and the new representation with p
eigenvectors of the covariance matrix. Using the vector Space representation scheme
in document classiﬁcation, let T1, ..., TN denote the N d—dimensional training vectors,
while their normalized vectors with zero-mean are denoted as T1“, ..., T13}. Let the p
basis vectors, e1, ...,ep be a set of orthonormal vectors that best describe the dis-

tribution of documents in the p-dimensional subspace (eigenspace), p g d. It can

42

be shown that the “best” set of basis vector correspond to the eigenvectors of the
covariance matrix 23. The ﬁrst basis vector e1 corresponds to the largest eigenvalue
of E, the second basis vector e2 corresponds to the second eigenvalues of E, and so

011.

With the p—dimensional eigenspace deﬁned, training vectors, T1", - - - , T13, can be

represented as a set of p—dimensional feature vectors, 61,- . - ,fp:

5,, = eT7;*, i = 1, - . ~,N, (4.1)

where e 2 (e1, ...,ep).

The sum Of the ﬁrst p eigenvalues is the “variance” retained in the subspace, while
the sum of all the d eigenvalues is the “variance” in the original pattern Space [34].
We can choose p such that 2le A,/ 2le A,- 2 V, where V is a user-speciﬁed value

representing the desired “variance” retained in the p—dimensional subspace.

4.3 Discriminant Analysis

Linear discriminant analysis (LDA) also attempts to project patterns onto a lower
dimensional space than the original space [34]. Compared to PCA which projects
patterns to a p—dimensional Space whose orthogonal basis are the p eigenvectors corre-
sponding to the p largest eigenvalues of the covariance matrix of the training samples,
LDA projects patterns to (m — 1)-dimensional space using the basis set (e1, ..., em_1)

computed from (pg/14pm where m is the number of categories, 99w is the within-class

43

scatter matrix and $03 is the between-class scatter matrix. Let xk; be the vector of
ith document in category It, i = 1, ...Nk, pk is the sample mean of class k, u is the
overall mean of the training samples, and N = (N1 + + Nk) is the total number
of documents in the training set. Then according to [34, 57], the within—class scatter

matrix (ow and the between-class scatter matrix 908 are deﬁned as follows:

1 m Nk
90w = N Z :61?“ - Mk) (33kt - HUT (4.2)
k=1 i=1
1 m
w = - :01]: — u) (Mk — MT- (43)
N k=l

The basis set (e1, ...ep) is a set of eigenvectors corresponding to the (m - 1) largest

eigenvalues of matrix gov—V1908, which satisfy:

efej =1 if i=j,

eiej = 0 otherwise,

where Z] is the covariance matrix of the training samples.

Since the rank of the matrix gov-V199,, is at most m — 1, the number of coordinates
of the projected space is limited by the number of document categories. Foley and
Samon [19] proposed an Optimal discriminant plane for the 2-claSS problem under the
orthonomality condition of coordinate axes. Okada et a1. [57] generalized it to multi-
class problem, called the orthonormal discriminant analysis (ODA), where the number
of features being extracted are only constrained by the original feature dimensionality
d. Hamamoto et al. [30] proved that “the ODA method is more powerful than LDA

in terms of the Fisher criterion”.

44
4.4 Term Grouping in Subspace

The occurrences of different words in documents are usually not independent; there
are correlations between words in a group of documents. ’ulfekuhler et a1. [81]
applied K -means clustering within each document category to ﬁnd clusters of words
for that document class. To capture the correlation of all word-pairs, we construct a
bigram matrix for each document class k. Let W), = (wf, ..., 203,) be a dk-dimensional
word-list in subspace 5),. The bigram matrix 8,, = (mij)dedk is a dk x d), matrix with
m,,- representing the number of documents from class c), in which both the terms to:
and to; jointly appear. A complete-link hierarchical clustering algorithm is applied
to the proximity matrix 8),. The resulting dendrogram (tree) is then cut into p term-
groups, where p is a user-speciﬁed parameter. Let the term-groups be denoted as 77:
= (h1,...,hp), where h,ﬂhj = (0,2' aé j and h,- = (wfl,...wfm), to; 6 Wk. Given a
vector 77, in subspace .Ck, the projection of 77, to p-dimensional space, E. = (51,...,§p)

is deﬁned as

{i = (737101, (45)

where H,- = (liu,,),,,xd,c is deﬁned as:

1 when the term tofu is the same as the term to;

hm, = (4.6)

0, otherwise,

and I is the ni-dimensional unit vector.

Chapter 5

Experimental Results

The data used in our experiments are the news items down-loaded from the Yahoo
news group and Reuters-21578 newswire benchmark. Each document in the data sets
is indexed by human experts. We preprocess the HTML news items by (i) document
parsing (remove headers and tags in the HTML ﬁles), and (ii) removing stopwords

and low frequency words as mentioned earlier.

0 Yahoo News Data. The Yahoo news items were down-loaded from the Yahoo
news group in the year of 1997. There were 9 categories on the site. We chose
to use 7 of them, which are Business (B), Entertainment (E), Health (H), Inter-
national (I), Politics (P), Sports (S), and Technology (T). The remaining two
categories, Top-stories and Local are excluded, since the associated semantics
can not be easily captured by word frequencies or occurrences. Each news item
has a title and a short summary, which is about 120 words, on an average. We
have constructed two data sets from Yahoo news data. One of them is called

45

46

Yahoo news set, which contains 814 training samples, and two test data sets
(news items at different time intervals, see Table 5.1). Test data setl has 680
news items and test data set2 has 621 documents. The second dataset is called
enlarged Yahoo news set, which has 4,199 training samples and 2,000 testing

samples (see Table 5.2).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Categories B E H I P S T

Training #of documents 130 133 91 110 130 130 90
Data total #Of terms 1848 2045 1213 1974 2070 1659 1364
Test Data #Of documents 110 111 79 80 110 111 79
Setl total #of terms 2155 2583 1535 1999 2439 1952 1618
Test Data #of documents 100 101 78 70 101 101 70
Set2 total #of terms 2046 2834 1803 2604 2070 1974 1689

Table 5.1: Yahoo news training and test data.

Categories B E H I P S T

Training #of documents 682 687 476 510 682 673 489
Data total #of terms 5841 8610 5020 6893 6963 5756 4721
Test Data #of documents 322 334 224 245 323 330 222
Set total #of terms 3904 5275 3199 4086 4482 3803 3043

 

 

 

 

 

 

 

 

 

 

 

Table 5.2: Enlarged Yahoo news training and test data.

0 Reuters-21578 benchmark. This dataset consists of 21,578 newswire stories
which appeared on the Reuters newswire in 1987. It can be accessed from Lewis’
professional home page (http://www/research/att.com/simlewis). We use the
TOPIC categories set which has been used in almost all previous experiments
with the Reuters data. The TOPIC set contains 12, 668 newswire stories with

135 categories, but only 120 categories have at least one document, and only 57

47

categories have at least 20 documents (see Table 5.3). The data are Split into a

training set (9, 649 documents) and a test set (3, 019 documents).

5.1 Evaluation of Text Classiﬁcation Effectiveness

In information retrieval, there are two important measures of system effectiveness,
called recall and precision [69]. Recall measures the ability of the system to present
all relevant items. It is deﬁned as the number of relevant documents divided by the
total number of relevant documents in the collection. Precision measures the ability
of the system to present only the relevant items. It is deﬁned as the number of
relevant documents retrieved divided by the total number of documents retrieved.
For example, suppose there are 100 documents in a collection that are relevant to a
query Q, and a system retrieves 200 documents (for query Q), among which 80 are
relevant to Q. So, the recall of this system for query Q is 80/100 = 0.80, while the
precision is 80/ 200 = 0.40.

Text classiﬁcation is the assignment of documents to one or more pre-existing
set of Categories, rather than retrieving them in response to a query. So, the recall
and precision measures used in information retrieval do not ﬁt a classiﬁcation system
quite well. Lewis [42] deﬁned microaveraging recall and precision to evaluate a text
classiﬁcation (categorization) system. For a two-class classiﬁcation problem, a binary
decision is made based on whether a document belongs to a given class or not. For
a set of n test documents, a contingency table (Table 5.4) can be made for these n

binary decisions [42].

48

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Category acq alum austdlr austral barley bfr bop
Training Set 1650 35 4 0 37 0 75
Test Set 715 20 0 O 0 0 16
Category can carcass castor castor castor citrus cocoa
-meal -oil seed pulp
TYaining Set 3 50 0 1 1 1 55
Test Set 0 7 0 0 0 0 17
Category coconut coconut coffee copper COpra corn corn
-oil -cake -oil
Training Set 4 4 111 47 2 182 1
Test Set 1 0 23 14 0 1 0
Category cornglu cotton cotton cotton cotton cpi cpu
tenfeed -meal -oil seed
'IYaining Set 2 39 0 1 O 69 3
Test Set 0 9 0 0 0 18 1
Category crude cruzado dﬂ dkr dlr dmk drachma
Training Set 389 1 2 1 131 10 0
Test Set 160 0 0 0 16 0 0
Category earn escudo f- ffr ﬁsh ﬂax fuel
cattle meal seed
Training Set 2877 0 2 0 2 0 13
Test Set 1083 0 0 0 O 0 7
Category gas gnp gold grain ground ground ground
nut nut-meal nut-oil
Training Set 37 101 94 433 5 0 1
Test Set 15 28 27 127 2 0 0
Category heat hk hog housing income install interest
-debt
'Ii'aining Set 14 0 16 16 9 5 347
Test Set 4 0 3 2 4 1 99
Category invent ipi iron- jet jobs 1- lead
ories steel cattle
Training Set 5 41 40 4 46 6 15
Test Set 0 12 12 1 12 0 9
Category lei lin- lin- lin lit live lumber
meal oil seed stock
Training Set 12 1 1 2 1 75 10
Test Set 3 0 0 0 0 11 4

 

 

 

 

 

 

 

 

 

 

49

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Category lupin meal- mexpeso money money- naphtha nat-
feed -fx supply gas
'Haining Set 0 30 0 538 140 2 75
Test Set 0 6 0 151 31 1 15
Category nickel nkr nzdlr oat oil orange palladium
seed
'ITaining Set 8 1 2 8 124 16 2
Test Set 1 0 0 0 19 9 0
Category palm palm palm peseta pet- platinum plywood
-meal -oil kernel chem
'ITaining Set 0 30 2 1 20 5 4
Test Set 0 0 0 0 8 3 0
Category pork- potato propane rand rape rape rape
belly -meal -oil seed
Training Set 3 3 3 2 1 5 18
Test Set 0 3 2 0 0 0 0
Category red- reserves retail rice ringgit rubber rupiah
bean
Training Set 1 55 23 35 1 37 1
Test Set 0 14 1 2 O 9 0
Category rye saud sfr ship silk silver singdlr
riyal
Training Set 1 3 0 197 0 21 0
Test Set 0 0 0 55 0 2 0
Category skr sorghum soy- soy stg strategic sugar
meal bean -metal
Training Set 1 24 13 78 17 16 126
Test Set 0 0 0 1 0 6 27
Category sun- sun sun tapioca tea tin trade
meal oil seed
Training Set 1 5 11 3 9 18 369
Test Set 0 0 0 0 3 11 105
Category tung tung veg wheat wool wpi yen
-oil -oil
"It‘aining Set 0 0 87 212 2 19 45
Test Set 0 0 24 3 0 9 6
Category zinc
TTaining Set 21
Test Set 7

 

 

 

 

 

 

 

 

Table 5.3: Number of documents in each category of Reuters-21578 TOPIC set.

 

50

 

Yes is No is
correct correct
Decide Yes a b a+b
Decide No c d a+b
a+c b+d a+b+c+d=n

 

 

 

 

 

 

 

 

 

Table 5.4: Contingency table of binary decisions for a test set, from [42].

Given the contingency table, recall and precision are deﬁned as: recall = a/(a+c)
and precision = a/ (a + b). For m classes and n test documents, if we treat each
classiﬁcation as a binary decision, then a total of mn decisions are made. Microaver-
aging considers all mn decisions as a single group and computes recall and precision
as deﬁned above. There is a tradeoff between recall and precision. We can have a
very high recall rate by always deciding yes, but then the corresponding precision
will be small. A classiﬁcation system attempts to maximize both precision and recall
simultaneously. Usually, a break-even point of recall and precision measurement is
used, i.e., when microaveraging recall is equal to microaveraging precision. Linear
interpolation is often used to get the break-even values.

We consider the text classiﬁcation as a multi-class classiﬁcation problem, i.e., each
time a classiﬁer assigns a document to one of the m categories instead of classifying it
to one class against all the others. We make an m x m confusion matrix, where entry
(i, j) shows how many documents belonging to category i were assigned to category
j. The sum of diagonal entries divided by the number of test examples measures the
classiﬁcation accuracy, which is the microaveraging recall and precision deﬁned by
Lewis [42]. If a document indeed belongs to several categories, then if the decision

made by a classiﬁer belongs to one of these “true label” categories, we treat this

51

 

NB N N DT N Net SS

Test Data #of Misclassiﬁcations 115 165 178 131 139
Setl Recognition Rate (%) 83.1 75.7 73.8 80.74 79.6

Test Data #of Misclassiﬁcations 125 179 144 90 111
Set2 Recognition Rate (%) 79.87 71.18 76.8 85.51 82.13

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 5.5: Comparison of the ﬁve classiﬁcation algorithms (NB, N N , DT, N N et, and
SS).

decision as a correct decision, and incorrect otherwise.

5.2 Individual Classiﬁers

On the small Yahoo dataset, we compared the ﬁve classiﬁcation algorithms (naive
Bayes classiﬁer (NB), nearest neighbor classiﬁer (NN), decision tree classiﬁer (DT),
artiﬁcial neural network classiﬁer (NNet), and the subspace classiﬁer (88)) on our two
test data sets. Table 5.5 shows a comparison of the recognition rates (microaveraging
recall/precision) using these ﬁve classiﬁcation algorithms individually. The exper-
imental results Show that all the ﬁve classiﬁcation algorithms perform reasonably
well; the NB approach performs the best on test data setl, while the NNet classiﬁer
performs the best on test set2; the NNet and the subspace methods outperform the
other three classiﬁers on test data setl and test data set2, respectively. Confusion
matrices of the classiﬁcation results using the ﬁve classiﬁers on test setl are shown in

Tables 5.6-5.10.

Figure 5.11 shows the classiﬁcation results using NB classiﬁer on the enlarged

Yahoo data set and Reuters-21578 news story benchmark.

52

 

 

 

[ B E H I P S T [LRecognition Rate(%ﬂ
B 81 1 2 0 7 0 19 73.6
E 1 89 1 8 3 2 7 80.2
H 0 0 79 0 0 0 0 100.0
I 1 1 0 53 25 0 0 66.3
P 5 0 1 10 91 0 3 82.7
S 2 0 1 1 3 102 2 91.9
T 7 0 2 0 0 0 70 88.6

 

 

 

 

 

 

Table 5.6: Confusion matrix of the classiﬁcation results using the naive Bayes Classiﬁer
(NB) without dimensionality reduction.

 

 

L— B E H I P S T Recognition Rate(%) ]
B 74 5 2 2 8 2 17 67.3
E 3 83 4 4 5 8 4 74.8
H 2 1 70 0 0 2 4 88.6
I 5 3 3 50 18 0 1 62.5
P 9 1 4 16 76 0 4 69.1
S 2 0 0 2 0 106 1 95.5
T 19 1 1 0 1 1 56 70.9

 

 

 

 

 

 

Table 5.7: Confusion matrix of the classiﬁcation results using the nearest neighbor classiﬁer
(N N) without dimensionality reduction.

 

] [B E H I P S T ][ Recognition Rate(%)]

 

 

B 70 8 1 0 13 1 17 63.6
E 18 77 0 5 2 7 0 70.6
H 1 2 71 4 0 0 1 89.9
I 3 8 0 54 8 5 2 67.5
P 7 5 2 15 73 3 5 66.4
S 1 11 0 1 2 96 0 86.5
T 9 4 2 3 1 0 61 76.3

 

 

 

 

 

 

Table 5.8: Confusion matrix of the classiﬁcation results using the C5 decision tree classiﬁer
(DT) without dimensionality reduction.

53

 

 

 

[ J B E H I P S T ] Recognition Rate(%)]
B 72 1 7 0 8 3 19 65.45
E 1 74 14 2 2 14 4 65.00
H 0 0 79 0 0 0 0 100.0
I 1 1 3 52 19 4 0 65.00
P 4 0 4 7 88 2 5 80.00
S 0 0 2 0 0 109 0 98.20
T 0 0 5 0 0 0 75 94.94

 

 

 

 

 

 

Table 5.9: Confusion matrix of the classiﬁcation results using adaptive neural network
classiﬁer (NNet) without dimensionality reduction.

 

 

B E H I P S T Recognition Rate(%) ]
B 65 3 6 2 8 2 24 59.1
E 1 89 2 7 1 6 5 80.2
H 0 0 79 0 0 0 0 100.0
I 1 2 3 51 22 0 1 63.8
P 6 0 6 10 84 1 3 76.4
S 2 2 1 0 0 106 0 95.5
T 9 0 2 1 0 0 67 84.8

 

 

 

 

 

 

Table 5.10: Confusion matrix of the classiﬁcation results using the subspace classiﬁer (SS)
without dimensionality reduction.

 

Yahoo Reuters
#of Misclassiﬁcations 285 430
Recognition Rate (%) 85.55 85.76

 

 

 

 

 

 

 

Table 5.11: Classiﬁcation accuracy using NB on two large data sets.

54

 

 

 

 

 

 

 

 

Combination Of Combination Testing Data Testing Data
Classiﬁers Approaches Set1(%) Set2(%)
Simple Voting 80.29 81.96
NB, SS, NN DCS 80.00 77.13
ACC 82.21 82.45
NB, SS DCS 80.44 79.87
ACC 83.24 82.93
Best Individual Classiﬁer 83.1 82.13

 

 

 

 

 

Table 5.12: Classiﬁcation Accuracy of combinations of multiple classiﬁers.

5.3 Combination of Different Classiﬁers

Results of combinations of multiple classiﬁers using different combination approaches
are summarized in Table 5.12. We set k = 20 in our experiments. Note that for
these two datasets, there was no signiﬁcant improvement by using a combination of
classiﬁers. Our opinion is that the performance of a combination of classiﬁers is data

dependent.

5.4 Dimensionality Reduction

Best feature selection, PCA, LDA, ODA, and term grouping in subspace are used
to reduce the dimensionality in our experiments. We used mutual information to
weigh the words appearing in the training documents; a subset of the words with the
highest weights is selected. We compared the four classiﬁers (NB, N N , DT, and SS)
using this feature selection technique. Figure 5.1 shows the recognition rate of the
four classiﬁcation algorithms with different Sized feature subsets. From this ﬁgure,

we can see that (i) a small number of features is not suitable for NB, and (ii) the

 

 

 

 

 

 

 

 

0.75 I I I I ur— J I I .
a ’+— T ‘ T ’ T ‘I' ~ ~ _ I T \ 1/3
+ I ............. - f’. H ‘ __ __
/ ‘ " . ' --------- it
0.7 - / . c t -
/ .-
.’.. - --*' ,
it v
0.65 - t- r. _
0.6 ~ " q
>.
o
g 0.55 - _
«I l,
g o 5 — j
.8
g 0.45 — _
o
0.4 - _
0.35 - a
+— - —+ DT
.*. ...... ..*. NN
0.3 — NB _
o H SS
0.25 J l l l 1 1 l l 1
0 1 00 200 300 400 500 600 700 800 900 1 000

number of features

Figure 5.1: The accuracy of each algorithm with a different Sized feature subsets.

performance of NB and SS classiﬁers tend to improve when the number of features

is increased.

We used PCA method to project the original feature space onto a lower dimen-
sional subspace. We set V = 0.90 (90% Of the variance retained), resulting in about
400 features in the small Yahoo training set. When using LDA and ODA, original
features are projected onto a 6 dimensional subspace, since we have 7 classes in the
Yahoo data sets. We apply N N classiﬁer on the projected feature vectors. Table 5.13
Shows a comparison of different feature extraction methods, including PCA, LDA

and ODA. Confusion matrices of the classiﬁcation results using N N classiﬁer with

56

 

 

 

 

 

N N N N N N N N
with PCA with LDA with ODA
Test Data #of Errors 165 166 136 121
Setl Recall (%) 75.7 75.6 80.0 82.21
Test Data #of Errors 179 180 109 102
Set2 Recall (%) 71.18 71.01 82.45 83.57

 

 

 

 

 

 

 

 

Table 5.13: Comparison of the feature extraction methods using N N .

 

 

 

B E H I P S T ]] Recognition Rate(%ﬂ
B 75 4 2 2 9 3 15 68.2
E 4 86 1 4 4 8 4 77.5
H 2 1 72 0 0 3 1 91.1
I 6 4 1 47 21 1 0 58.8
P 10 3 4 17 71 0 5 64.6
S 3 0 0 2 0 105 1 94.6
T 17 1 1 0 1 1 58 73.4

 

 

 

 

 

 

Table 5.14: Confusion matrix of the classiﬁcation result using the nearest neighbor classiﬁer
with PCA feature extraction.

PCA, LDA and ODA on test setl are shown in Tables 5.14-5.16, respectively. The
experimental results Show that ODA outperforms LDA and PCA.

We apply our term-grouping technique to the subspace method on test setl. A
total of 30 term groups were chosen in our experiment. A comparison of the recogni-
tion rates before and after using the term-grouping technique on data setl is shown

in table 5.17. It Shows that the performance of the SS method improved marginally

with the term-grouping technique.

57

 

 

 

[ B E H I P s T ﬂRecognition Rate(%ﬂ
B 67 2 4 1 10 5 21 60.92
E 2 97 0 3 2 5 2 87.39
H 0 4 74 0 0 1 0 93.67
I 3 1 1 46 26 2 1 57.50
P 9 2 2 12 83 1 1 75.45
S 1 2 1 2 0 105 0 94.59
T 3 2 2 0 0 0 72 91.14

 

 

 

 

 

 

Table 5.15: Confusion matrix of the classiﬁcation result using the nearest neighbor classiﬁer
with LDA feature extraction.

 

 

 

B E H I P S T ][ Recognition Rate(%)?
B 73 3 2 1 8 5 18 66.36
E 1 99 0 6 1 3 1 89.19
H 0 1 77 0 0 1 0 97.47
I 4 6 1 41 26 1 1 51.25
P 9 1 2 6 91 1 0 82.73
S 1 2 1 0 0 107 0 96.40
T 4 1 3 0 0 0 71 89.87

 

 

 

 

 

 

Table 5.16: Confusion matrix of the classiﬁcation result using the nearest neighbor classiﬁer
with LDA feature extraction.

 

 

 

 

 

 

 

SS SS

without term-grouping with term-grouping
#of Misclassiﬁcations 139 137
Recognition Rate (%) 79.6 79.9

 

 

Table 5.17: Comparison of using the subspace method with or without the term-grouping
feature reduction technique on test setl.

Chapter 6

Conclusions and Discussion

We have applied ﬁve different classiﬁcation methods (NB, N N , DT, SS, and N Net)
to the problem of document categorization. These methods were evaluated individ-
ually and in combination. Since document classiﬁcation involves a high dimensional
feature space, we also studied the effect of different feature reduction techniques (the
individual best feature selection, PCA, DA, and term-grouping in subspace) on the
performance of these classiﬁers. The seven classes of Yahoo news items used in our
experiments have a large overlap of words in their documents (e.g., in 2,948 total
words, there are 1,096 common words between international and politics news cat-
egories, 744 out of 2,468 words are common between business and technology news
groups), so this is a difﬁcult classiﬁcation problem. We can make the following ob-

servations based on our experimental results:

1. Comparison of different classiﬁers:

(a) For the Yahoo news data, all the ﬁve classiﬁers perform reasonably well

58

59

on our data sets. We also asked four students in our laboratory to as-
sign lables to 680 documents in test setl. The average classiﬁcation
accuracy of the label assigned by these subjects is about 81%. Com-
paring this classiﬁcation accuracy to the accuracy Of the best classi-
ﬁer (83.1%), we see that machine classiﬁcation performance is reason-
able. Weiss et al. [79] also reported that the accuracy of human judg-
ment on 1000 messages on 10 USENET newsgroups (misc.health.diabetes,
sci.math.num-analysis, dc.politics, rec.food.restaurants, alt.tv.seinfeld,
comp.sys.ibm.pc.games.sports,

rec.arts.comics.dc.universe, sci.military.naval, talk. philosophymisc and hu-
manitieslit.authors.shakespeare) is about 85%. Both NB and SS classi-
ﬁers work better than N N and DT methods, but the performance of NB
and SS classiﬁers is data dependent. Most of the misclassiﬁcations are be-
tween international and politics categories, and between business and tech-
nology document classes which inherently have a large overlap of terms. If
we combine international and politics news groups, and combine business
and technology news groups, the performance of all the four classiﬁcation
algorithms on the resulting 5-class problem improves by an average of 7%.
Classiﬁcation accuracies on test setl for the four classiﬁers on this ﬁve-class

problem are shown in Table 6.1.

When we reverse the role of the training and test data (combining test

setl and test set2 as a training set, and the old training set is now the

60

 

 

 

 

 

 

 

 

NB N N DT SS
#of Misclassiﬁcations 67 95 128 101
Recognition Rate (%) 90.15 86.03 81.18 85.15

 

 

Table 6.1: Comparison of the four classiﬁcation algorithms (NB, N N , DT, and SS) on
test setl for the reduced 5-class problem.

test set), the classiﬁcation accuracy remains essentially the same (e.g, the
classiﬁcation accuracy of NB is 83.91%). This shows that the classiﬁer is
not sensitive to the choice of the training data (assuming that the training

data is sufﬁciently large).

For the enlarged Yahoo news data and Reuters newswire, NB performs
steadily well on these large databases, despite the “independence” assump-
tion which is not always satisﬁed in document classiﬁcation. While SS per-
forms well on the enlarged Yahoo news data set, it performs very poor for
Reuters data set. It may be because the data is noisy. Some preprocessing,

such as ﬁne semantic analysis [27] may improve the performance.

2. Combinations of multiple classiﬁers do not always improve the classiﬁcation
accuracy. The adaptive classiﬁer combination introduced here works better

than simple voting and dynamic classiﬁer selection approaches on our two test

data sets.
3. Dimensionality reduction:

(a) There is no signiﬁcant peaking in classiﬁcation performance observed in our
experiments. In particular, the performance of NB improves as the number

of features increase (which is different from the results obtained in [44]).

 

61

Additionally, NB does not use a small number of features effectively (which
is also different from the observation in [44] that 10 features for Reuters
news groups performs the best). This indicates that our data set is less

separable than the Reuters newswire.

(b) The widely adopted LSI method in information retrieval systems is essen-
tially the same as the PCA when used to compare the similarity of two
documents. LDA performs much better than PCA. In particular, ODA

performs the best in our experiment.

(c) The problem with feature selection is that the small number of selected
words may not generalize well to new documents. However, the advantage
of dimensionality reduction is not only to improve the recognition rate
(eliminate the problem of overﬁtting), but the reduced number of features
lead to lower time and space complexities. The term-grouping method re-
duces the feature dimensionality, and overcomes the generalization problem

of feature selection, while maintaining the performance of the classiﬁer.

It is difﬁcult to say which classiﬁer is the best. Generally, we believe that sim-
ple classiﬁes, such as NB and SS can give fairly good classiﬁcation results. These
methods are simple, fast, and can be easily scaled up for very large databases. N Net
classiﬁer also performs well, and it is very fast in the test stage. Feature extrac-
tion, such as LDA can improve the classiﬁcation performance Signiﬁcantly in some
cases. However, feature extraction is computationally expensive, and has difﬁculties

in scaling up to large databases (e.g., memory requirement).

62

Incorporating natural language processing into a classiﬁer (feature representation,
feature selection) can remove noisy features, and improve the classiﬁcation accuracy
(e.g., using WordNet1 [27]). In our experiments, we also observed that, for NB classi-
ﬁer, if we assign a test document to the top two matching categories, the microrecall

rate is 97% on enlarged Yahoo news data, and is 99% on Reuters newswire database.

 

lhttp: / /www.cogsci.princeton.edu/ ~wn

 

APPENDICES

Appendix A

Examples of Training Samples of

 

Yahoo datasets

We Show one training example from each of the seven Yahoo news groups after Web
document parsing, and their word list (used to construct the feature vector) after

stopwords and low-frequency word removal.

63

64

 

Minimum Wage Rises - Nearly 7
million Americans are getting
a raise on this Labor Day. The
federal minimum wage is rising
to $5.15 an hour. Fast food
workers, retail clerks, gas
station attendants and others
will be earning 40 cents an
hour more when they report to
work as the second phase of
the hike goes into effect. It
was first raised to $4.75 last
Oct. 1. According to a report
to be issued tomorrow by the
Economic Policy Institute,
most of the 6.8 million workers
affected by the minimum wage
hike are women who work in the
service sector. The Washington,
D C.-based think tank’s study
found that in 18 states, more
than 10 percent of the work
force will be affected by the
minimum wage increase.

minimum wage rises _ nearly _
million americans ___ getting
_ raise ______ labor day_ ___

federal minimum wage __ rising
_________ hour_ fast food
workers_ retail clerks- gas
station attendants _________

 

______ earning __ cents __
hour ____________ report __
_______________ phase __
__- hike goes ____ effect- --
________ raised __ _
oct_ __ according __ _ report

__ __ issued tomorrow _____
economic policy institute_

____ __ __- ___ million workers
affected _____ minimum wage
hike women ___ ____ __ ___
service sector_ ___ washington
_____ based _____ tank__ study
found ___________________
______ percent __ ___ ____
force ______ affected _____

minimum wagg increase_

 

 

(a)

Figure A.1: An example of the bussiness news group; (a) a training sample; (b)
extracted word list.

 

(b)

 

65

 

Diana Funeral to be Saturday -
Princess Diana’s funeral will be
held Saturday. Buckingham Palace
announced Monday that the
"people’s princess" would be
honored with services at
Westminster Abbey in London.
Diana will be buried in private
services at her family’s estate
in Althorp, central England.
Until the funeral, Diana’s body
will lie in the Chapel Royal at
St. Jane’s Palace in London.
Members of the public will not
be allowed to file past, but

can write a personal message in

diana funeral __ -- saturday _
princess diana__ funeral ______
held saturday_ buckingham palace
announced monday _______
_people-_ princess

honored _-__ services __
westminster abbey __ 10ndon_
diana ______ buried __ private
services _____ family__ estate

__ althorp- central england-
________ funeral diana__ body

____ lie _____ chapel royal __
st_ jame__ palace __ 10ndon_
____________ public ____ ___

allowed __ file past_ ___
_-- write _ personal message

books of condolence. Thousands books __ condolence_ thousands
of pe0ple have been lining up to __ people ________ lining __ __
do so. French prosecutors Monday _____ french prosecutors monday
disclosed a new twist to the disclosed _ ___ twist _____
tragedy that has numbed Britons. tragedy _______ numbed britons_
French officials said the driver french officials _______ driver
who crashed Princess Diana’s car ___ crashed princess diana__ car
-- killing her, her friend Dodi __ killing _______ friend dodi
Al Fayed and himself -- was a1 fayed _______________
driving at a high speed with driving -_ _ ____ speed ____
twice the blood alcohol level twice ___ blood alcohol level
that would have landed him in _____________ landed _____
jail. jail_

 

 

(a)

Figure A.2: An example of the entertainment news group; (a) a training sample; (b)
extracted word list.

 

(b)

 

 

66

 

Heart Failure Therapy Saves
Lives - A more aggressive and
comprehensive management program

heart failure therapy saves
lives _ - ---- aggressive __-
comprehensive management program

 

 

 

for heart failure can reduce -_- - heart failure -_- reduce
hospital admissions by 85% in hospital admissions -- --- -_
patients waiting for a heart patients waiting ___ - heart
transplant, a new study suggests. transplant_ - --- study suggests-
The more comprehensive treatment ------- comprehensive treatment
may not only improve the quality ---------- improve --- quality
of life for patients, it also -_ life --- patients_ -- ___-
saves money. An estimated 400,000 saves money- -- estimated _______
to 800,000 people in the U.S. --------- people ---------
have severe heart failure, and -_-- severe heart failure- ---
200,000 die of the disease each ------- die ----- disease ___-
year, according to the report in _____ according ----- report --
the Journal of the American --- journal ----- american
College of Cardiology. college -- cardiology-

(a) (b)

Figure A.3: An example of the health news group; (a) a training sample; (b) extracted

word list.

 

 

67

 

 

 

 

Diana to Receive ’Unique diana -- receive _unique
Funeral’ - Buckingham Palace funeral- - buckingham palace
Monday announced that Princess monday announced ---- princess
Diana will be given a "unique diana ---- -- ----- _ _unique
funeral for a unique person" funeral --_ - unique person-
Saturday in London. Her coffin saturday -- london- --- coffin
will be carried through the ______ carried ----------
streets to services at streets -_ services --
Westminster Abbey, to be westminster abbey- -- --
followed by a private burial at followed -- - private burial -_
her family’s estate in central --- family__ estate -- central
England. Until the funeral, england- -------- funeral-
Diana’s body will lie in the diana__ body ---- lie _____
Chapel Royal at St. Jane’s chapel royal _- st_ jame-_
Palace in central London, and palace -- central london_ --_
thousands of mourners have thousands _- mourners ----
flocked there to sign books of flocked ------- sign books _-
condolences. Crowds have also condolences- crowds ________
gathered outside Kensington gathered outside kensington
Palace, Diana’s London home, and palace- diana__ london home- __-
0utside Buckingham Palace, as outside buckingham palace- _-
they continue mourning the late ---- continue mourning ___ late
Princess. princess-

(a) (b)

Figure A.4: An example of the international news group; (a) a training sample; (b)
extracted word list.

 

68

 

Clinton Sends Condolences -
President Clinton is sending
condolences to British Prime
Minister Tony Blair on the

death of Princess Diana,

saying "all of us have lost a
friend and a strong voice for
those less fortunate." In a
letter written at his vacation
retreat in Martha’s Vineyard,
Clinton praised Diana’s
"untiring and selfless
commitment to helping persons in
need, particularly children, the
victims of AIDS and landmines,
and other vital humanitarian
concerns." Clinton sent letters
expressing similar sentiments to
Queen Elizabeth, Diana’s
ex-husband Prince Charles, and
Diana’s brother Charles Spencer.
A White House spokesman says

clinton sends condolences -
president clinton _- sending
condolences --british prime
minister tony blair _____
death -- princess diana-
saying _--- -- -- ---- lost _
friend --- _ strong voice
--------- fortunate-- _- -
letter written ----- vacation
retreat -- martha-- vineyard-
clinton praised diana--
_untiring --- selfless
commitment -- helping persons
----- particularly children- --_
victims -- aids _-- landmines-
-------- vital humanitarian
concerns__ clinton sent letters
expressing similar sentiments
queen elizabeth- diana--
ex-husband prince charles- ---
diana__ brother charles spencer-
- white house spokesman

 

 

 

there’s been no decision on who ------------- decision _____
will represent the United States _--- represent -_- united ------
at Diana’s funeral Saturdgy. -_ diana funeral saturday-

(a) (b)

Figure A.5: An example of the politics news group; (a) a training sample; (b) ex-
tracted word list.

 

 

69

 

49ers’ Rice Has Knee Surgery -
The 1997 season is off to a
nightmarish start for the San
Francisco 49ers. Jerry Rice, who
holds all the major receiving
records in NFL history, underwent
surgery Monday to repair a torn
anterior cruciate ligament and
torn medial collateral ligament
in his left knee and is expected
to miss the rest of the season.
An MRI taken Sunday night
revealed the injury. Dr. Michael
Dillingham, the Niners’ team
physician, performed the surgery
and estimated that Rice will be
sidelined four to six months.
Rice suffered the injury on a
reverse in the second quarter

when he was dragged to the ground-

by his facemask by Tampa Bay
defensive end Warren Sapp. Rice’s
left knee buckled as he was
pulled awkwardly to the ground on
the play, which resulted in a
10-yard loss. Sapp was called for
a 15-yard facemask penalty on the

ers_ rice --- knee surgery -
------- season
nightmarish start ------ san
francisco --ers- jerry rice-
holds ------ major receiving
records -- nfl history- underwent
surgery monday -- repair _ torn
anterior cruciate ligament ---
torn medial collateral ligament
----- left knee --- -_ expected
-- miss --- rest ----- season-
-- mri ----- sunday night
revealed --- injury- dr- michael
dillingham- --_ niners- team
physician- performed ___ surgery
___ estimated ---- rice ______
sidelined ______ six months-
rice suffered --- injury -- -
reverse ----------- quarter
_________ dragged -- --- ground
----- facemask -- tampa bay
defensive --- warren sapp- rice-_
left knee buckled -- -- ---
pulled awkwardly ----- ground _-
--- play_ ----- resulted -- -
___yard loss- sapp --- called ---
- ___yard facemask penalty _____

 

 

 

play.
(a)

play_
(b)

Figure A.6: An example of the sports news group; (a) a training sample; (b) extracted

word list.

 

 

70

 

 

 

Apple Buys Clone Maker Assets - apple buys clone maker assets -
Apple Computer said today it’s apple computer
paying $100 million in stock to paying ---- million -- stock _-
buy the core assets of Power buy --- core assets -- power
Computing, a privately held computing- - privately held
licensee of Apple’s Macintosh licensee -- apple-- macintosh
line of computers. Power line -_ computers_ power
Computing has pioneered direct computing --- pioneered direct
marketing and sales in the marketing ___ sales _____
Macintosh market, successfully macintosh market_ successfully
building a $400 million building - ---_ million
business," Apple board member business-_ apple board ______
and founder Steve Jobs said --- founder steve jobs ___-
announcing the agreement. We look announcing ___ agreement- -- look
forward to learning from their forward -- learning ---------
experience, and welcoming their experience- --- welcoming _____
customers back into the Apple customers ----------- apple
family." Apple sold the family__ apple sold ---
Macintosh license to Power macintosh license -- power
Computing in December 1994. computiggi-- december _____

(a) (b)

 

 

 

Figure A.7: An example of the technology news group; (a) a training sample; (b)
extracted word list.

J: -.-r u M'- 9'10. ...T'

 

Appendix B

The Laplace’s Law of Succession

The Laplace’s Law of Succession is a Bayesian approach to the problem of multi-
nomial parameter estimation [64]. Given a set of distinct symbols, the problem is
to estimate the symbol probabilities based on the known frequency with which each

symbol occurred in the past.

Let 9,- be the probability that word w,- occurred in class c. We assume that the
words appear in the documents independently. The problem is to ﬁnd 6,, an estimate

Of 61'.

Let n be the total number of occurrences of words in class c, nw, be the number
of occurrences of word w,- in class c, and k be the total number of distinct words in

all the documents, i.e, the vocabulary size.

Let the a priori probability density of the unknown parameters (01,...,0k) be

71

 

72

uniform, i.e.,

l 039,- g 1,i=1,...,k,2’f9,-=1

0 otherwise.

Since the p.d.f. must integrate to 1, l = (k — 1)!.

The a posteriori densities of (01, ..., 0),) can be written as:

P(nw1, ..., nwk [(61, ..., 6k))P(61, ..., 6k)

 

P9 ...0 w,..., w = .
( l: : kl(n 1 n k)) P(nw1,---,nwk)o (B 2)
P(n,,,,, ..., nwk|(01, ..., 0k)) is a multinomial distribution:
72'! nwl "wk,
P(nw,, ...,nwk|(01, m,6k)) = mgl ”‘gk (8.3)

when nu,1 + nwz + + nwk = n. So,

P(nw,,...,nw,) =/../P(...nw,,,.nwk|(01,. i)))P(I91,...,6,,)dt91...d6’,c (B.4)

 

 

_ )l l
: (Li—”J [0"“1...0n‘”*d91...d6,, (35)
"1111!: "W
_ (k —1!n! nu) 'u-n‘wk!
— nwll...nwk! X (n+k -— 1)! (B6)
n!(k — 1)!
= B.

where 0 S 0,- g 1 and Z’f' 0,: = 1. See [65] for details. For a squared—error loss function,

A

0, = f.../0,-P(01,...,9k|(n,,,,,...,nwk))d61...d6k (B.8)

73

 

n+k—1 ! Tl! nw nw "w
( nl ) (”Tm—'91 l-ugi l+1m0k kdglmdgk (8'9)
. w,.... w).

(74+): — 1)! X n!(nw, +1)
nl ("+1“)!

nw, +1

n+k

 

(8.10)

(B.11)

Bibliography

 

[1] Gentle Introduction to RainBowl.
[2] C. Apte, F. Damerau, and S. M. Weiss. Text Categorization: a Symbolic Ap- [
proach. In ACM Trans. on Information Systems, Vol. 12, No. 3, pages 233—251, 5

1994.

[3] D. B. Aronow, S. Soderland, J. M. Ponte, F. F. Feng, W. B.Croft, and W. G.
Lehnert. Automated Classiﬁcation of Encounter Notes in Computer-Based Med-
ical Records. In Proc. of the Eighth World Congress on Medical Informatics

Vol. 8, pages 8—12, 1995.

[4] M. Balabanovic, Y. Shoham, and Y. Yun. An Adaptive Agent for Automated
Web Browsing. Technical Report SlDL-WP-1995-0023, Stanford University,

1995.

[5] H. Berghel. Cyberspace 2000: Dealing with Information Overload. In Digital

Village, Communications of the ACM, Vol. 40, N0. 2, pages 19—24, 1997.

[6] J. Bradshaw, editor. Software Agents. AAAI Press/ The MIT Press, 1997.

 

1 http: / / www.cs.cmu.edu / afs / cs / project / theo—l 1 / www / naive-bayes / gentleintroductionhtml

74

l7]

[8]

[9]

[10]

[11]

[12]

[13]

75
J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY Retrieval System.

In Proc. of the 3rd International Conference on Database and Expert Systems

Applications, pages 347—356, 1993.

J. G. Carbonell, R. S. Michalski, and T. M. Mitchell. Machine Learning, An Ar-
tiﬁcial Intelligence Approach, chapter An Overview of Machine Learning, pages

3—23. Tioga Publishing Company, 1983.

A. Chavez and P. Maes. Kasbah: An Agent Marketplace for Buying and Selling
Goods. In Proceedings of the First International Conference on the Practical

Application of Intelligent Agents and Multi-Agent Technology, 1996.

H. Chen. Machine Learning for Information Retrieval: Neural Networks, Sym-
bolic Learning, and Genetic Algorithms. In Journal of the American Society for

Information Science, Vol. 46, No. 3, pages 194—216, 1995.

H. Chen and V. Dhar. Cognitive process as a basis for intelligent retrieval system
design. In Information Processing and Management, Vol. 27, No. 5, pages 405—

432, 1991.

W. W. Cohen. Fast Effective Rule Induction. In Machine Learning: Proc. of

12th International Conference, Lake Tahoe, California, 1995.

W. W. Cohen. Learning Rules that Classify E-Mail. In Proc. 1996 AAAI Spring

Symposium on Machine Learning in Information Access, Stanford, 1996.

 

76
[14] M. W. Craven and J. W. Shavlik. Learning Symbolic Rules Using Artiﬁcial

Neural Networks. In Proc. of the Tenth International Conference on Machine

Learning, pages 73—80, 1993.

[15] V. Dasigi and R. C. Mann. Neural Net Learning Issues in Classiﬁcation of
Free Text Documents. In AAAI Spring Symposium on Machine Learning in

Information Access Technical Papers, Palo Alto, March 1996.

[16] P. Domingos and M. Pazzani. Beyond Independence: Conditions for the Opti-

mality of the Simple Bayesian Classiﬁer. In Proc. MLC96, pages 105—112, 1996.

[17] R. O. Duda and P. E. Hart. Pattern classiﬁcation and Scene Analysis. Wiley &

Sons Inc., 1973.

[18] C. Faloutsos and D. Oard. A Survey of Information Retrieval and Filtering

Methods. Technical Report CS-TR—3541, University of Maryland, 1995.

[19] D. H. Foley and J. W. Samon Jr. An Optimal Set of Discriminant Vectors. In

IEEE Trans. Comput. C-24, pages 281-289, 1975.

[20] L. N. Foner. Yenta: A Multi-Agent, Referral Based Matchmaking System. In The
First International Conference on Autonomous Agents (Agents ’97), California,

1997.

[21] C. Fox. Lexical Analysis and Stoplist, pages 102—130. Prentice Hall, 1992. edited

by W.B. Frakes and R. Baeza—Yates.

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

77

W. B. Frakes. Stemming Algorithms. In Information Retrieval Data Structure
and Algorithms, pages pp.131—160. Prentice Hall, 1992. edited by W.B. Frakes

and R. Baeza-Yates.

Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. In Proc. Second European Conference

on Computational Learning Theory, pages pp 23—37, 1995.

N. Friedman and M. Goldszmidt. Building Classiﬁers Using Bayesian Networks.

In AAAI, vol. 2, pages 1277—1284, 1996.

N. Friedman, D. Heckerman, M. Goldszmidt, and S. Russell. Challenge: Where
is the Impact of Bayesian Networks in Learning? In Proc. Fifteenth International

Joint Conference on Artiﬁcial Intelligence (IJCAI), 1997.

G. W. Gates. The reduced nearest neighbor rule. In IEEE Trans. Information

Theory, IT—18, pages 431—433, 1972.

B. Gelfand, M. Wulfekuhler, and W. F. Punch. Automatic Concept Extraction

From Plain Text. Technical report, Michigan State University, 1998.

G. Giacinto and F. Roli. Adaptive Selection of Image Classiﬁers. In Proc. of

ICIAP ’97 (Springer Verlag Lecture Notes in CS Vol. 1310), pages 38—45, 1997.

G. Golub and W. Kahan. Calculating the Singular values and pseudo-inverse of
a matrix. In SIAM Journal Numerical Analysis, Series B, Vol. 2, No. 2, pages

205—224, 1965.

78

[30] Y. Hamamoto, T. Kanaoka, and S. Tomita. On a Theoretical Comparison be-

tween the Orthonormal Discriminant Vector Method and Discriminant Vector.

In Pattern Recognition, Vol. 26, No. 12, pages 1863—1867, 1993.

[31] S. Haykin. Neural Networks. Macmillan College, 1994.

[32] D. Heckerman. A Tutorial on Learning with Bayesian Networks. Technical

[33]

[34]

[35]

[36]

[37]

Report MSR—TR—95-06, Microsoft Research, 1996.

A. K. Jain and B. Chandrasekaran. Dimensionality and Sample Size Considera-
tions in Pattern Recognition Practice. In Handbook of Statistics, pages 835—855.

North-Holland, 1982. edited by PR. Krishnaiah and LN. Kanal.

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall,

1988.

A. K. Jain and D. Zongker. Feature Selection: Evaluation, Application and
Small Sample Performance. In IEEE Trans. PAMI, Vol. 19, No. 2, pages 153—

157, 1997.

T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A Tour Guide for the

World Wide Web. In Proc. IJCAI ’97, 1997.

D. Koller and M. Sahami. Toward Optimal Feature Selection. In Proceedings of
the 13th International Conference on Machine Learning (ML), pages 284—292,

Bari, Italy, 1996.

79

[38] K. Lang. NewsWeeder: Learning to Filter Netnews. In Proc. 12th International

Conference on Machine Learning, pages 331—339, Tahoe City, CA, 1995.

[39] L. S. Larkey and W. B. Croft. Combining Classiﬁers in Text Classiﬁcation. In

Proc. of SIGIR ’96, pages 289—297, 1996.

[40] Y. Lashkari, M. Metral, and P. Maes. Collaborative Interface Agents. In Pro-

ceedings of AAAI ’94 Conference, Seattle, Washington, 1994.

[41] J. Lee and S. M. Chung. Information Discovery on the Internet: Beyond Search

Engines. In SERI Journal, Vol. 2, No. 1, pages 33~45, 1998.

[42] D. D. Lewis. Evaluating Text Categorization. In Proc. of Speech and Natural

Language Workshop, pages 312—318, Kaufmann, San Mateo, CA, 1991.

[43] D. D Lewis. Representation and Learning in Information Retrieval. PhD thesis,

University of Massachusetts, 1992.

[44] D. D. Lewis and M. Ringutte. A comparison of Two Learning Algorithms for
Text Categorization. In Third Annual Symposium on Document Analysis and

Information Retrieval, pages 81—93, Las Vegas, NV, 1994.

[45] H. Lieberman. Letizia: An Agent That Assists Web Browsing. In International

Joint Conference on Artiﬁcial Intelligent, Montreal, 1995.

[46] C. Lynch. Searching the Internet. In Scientiﬁc American, pages 52—56, 1997.

80

[47] Y. Ming. Sampling strategies and learning efﬁciency in text categorization. In
Proc. 1996 AAAI Spring Symposium on Machine Learning in Information Ac-

cess, pages 88—95, Stanford, 1996.

[48] T. Mitchell. Machine Learning. McGraw-Hill, 1997.

[49] D. Mladenic. Personal WebWatcher: Design and Implementation. Technical

Report IJS-DP-7472, Carnegie Mellon University, October 1996.

[50] I. Moulinier, G. Raskinis, and J .-G. Ganascia. Text Categorization: a Symbolic

Approach. In SDAIR96, Las Vegas, 1996.

[51] S. Deerwestera nd S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harsh-
man. Indexing by latent semantic analysis. In Journal of the American Society

for Information Science, Vol. 41, No. 6, pages 391—407, 1990.

[52] S. A. Nene and S. K. Nayar. A simple algorithm for nearest neighbor search in

high dimension. In IEEE Trans. PAMI, vol. 19, No. 9, pages 989—1003, 1997.

[53] H. S. Nwana. Software Agents: An Overview. In Knowledge Engineering Review,

Vol. 11, No. 3, pages 1—40, 1996.

[54] M. J. H.S. Nwana and N. R. Jennings. Intelligent Agents: Theory and Practice.

In Engineering Review, Vol. 10, No. 2, pages 115—152, 1995.

[55] D. W. Oard and G. Marchionini. A Conceptual Framework for Text Filter-
ing. Technical Report EE-TR-96—25, CAR-TR-830, CLIS-TR—96-02, CS-TR—

3643, University of Maryland, 1996.

81
[56] E. Oja. Subspace Methods of Pattern Recognition. Wiley, 1983.

[57] T. Okada and S. Tomita. An Optimal Orthonormal System for Discriminant

Analysis. In Pattern Recognition, Vol. 18, No. 2, pages 139—144, 1985.

[58] M. Pazzani, J. Muramatsu, and D. Billsus. Syskill & Webert: Identifying inter—
esting Web Sites. In AAAI Spring Symposium on Machine Learning in Infor-

mation Access Technical Papers, Palo Alto, March 1996.

[59] M. Pazzani, L. Nguyen, and S. Mantik. Learning from Hotlists and Coldlists:
Toward a WWW information Filtering and Seeking Agent. In Proc. 6th In-
ternational Conference on Tools with Artiﬁcial Intelligence, Washington DC,

1995.

[60] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible

Inference. Morgan Kaufmann, San Mateo, CA, 1988.

[61] J. R. Quinlan. Induction of Decision Trees. In Machine Learning, Vol. 1, pages

81—106, 1986.

[62] J. R. Quinlan. Rule Induction with Statistical Data-a Comparison with Multiple
Regression. In Journal of the Operational Research Society, Vol. 38, pages 347—

352, 1987.

[63] J. R. Quinlan. 04.5:Programs for Machine Learning. Morgan Kaufmann, 1993.

[64] E. S. Ristad. A Natual Law of Succession. Technical Report CS-TR-495-95,

Princeton University, 1995.

 

82
[65] S. M. Ross. Probability Models. Academic Press, 1993.

[66] M. Sahami. Learning Limited Dependence Bayesian Classiﬁers. In proc. of
the Second International Conference on Knowledge Discovery and Data Mining

(KDD—96), pages 335—338, Portland, Oregon, 1996.

[67] G. Salton. Automatic Text Processing. Addison-Wesley, 1989.

[68] G. Salton, J. Allan, and C. Buckley. Automatic Structuring and Retrieval of
Large Text Files. In Communications of the ACM, Vol. 37, N0. 2, pages 97—108,

1994.

[69] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval.

McGraw-Hill, 1983.

[70] S.Chen. Learning-Based Vision and its Application in Indoor Navigation. PhD

thesis, Michigan State University, 1998.

[71] H. Schutze, D. A. Hull, and J. O. Pedersen. A Comparison of Classiﬁers and
Document Representations for the Routing Problem. In SI GIR95, pages 229—237,

Washington DC, 1995.

[72] E. Selberg and O. Etzioni. The MetaCrawler Architecture for Resource Aggre-

gation on the Web. In IEEE Expert, Vol. 12, No. 1, pages 8-14, 1997.

[73] B. D. Sheth. A learning Approach to Personal Information Filtering. Master’s

thesis, Massachusetts Institute of Technology, 1994.

83

[74] A. F. Smeaton and F. Crimmins. Using a Data Fusion Agent for Searching the

[75]

[76]

[77]

[78]

[79]

WWW. In Sixth International World Wide Web Conference, Santa Clara, CA.,

1997.

H. Turtle and W. B. Croft. Inference Networks for Document retrieval. In

Proc. of the 13th Annual International ACM/SIGIR Conference on Research

and Development in Information Retrieval, pages 1—24, Brussels, Belgium, 1990.

H. Turtle and W. B. Croft. Evaluation of an Inference Network-based Retrieval
Model. In ACM Trans. on Information Systems, Vol. 9, N0. 3, pages 187-222,

1991.

Wallace C. Koehler, Jr. An End User’s View of Mining the Web: Focused
and Satisﬁed Internet Search and Retrieval Strategies. In [NET 97 proceedings,

Education 3, 19972.

A. R. Webb and D. Lowe. The Optimized Internal Representation of Multi-
layer Classiﬁer Networks Performs Nonlinear Discriminant Analysis. In Neural

Networks, Vol. 3, pages 367—375, 1990.

S. A. Weiss, S. Kasif, and E. Brill. Text Classiﬁcation in USENET Newsgroup:
A Progress Report. In AAAI Spring Symposium on Machine Learning in Infor-

mation Access Technical Papers, Palo Alto, March 1996.

 

2http: / / www.isoc.org / isoc / whatis / conferences / inet / 97 / proceedings / D3 / IN DEX.HTM

 

84
[80] K. Woods, W. P. Kegelmeyer, and K. Bowyer Jr. Combination of Multiple

Classiﬁers Using Local Accuracy Estimates. In IEEE Trans. PAMI, Vol. 19,

No. 4, pages 405—410, 1997.

[81] M. R. Wulfekuhler and W. F. Punch. Finding Salient Feature for Personal Web
Page Categories. In Sixth International World Wide Web Conference, Santa

Clara, CA., 1997.

[82] T. W. Yan and H. C. Molina. SIFT-A Tool for Wide-Area Information Dissem-

ination. In Proc. 1995 USENIX Technical Conference, pages 177—186, 1995.

 

"‘llama)as