M H H N W \ “ N4 ‘ \100 0000 THESlB lllllllllllllllllHIHUUIIIIllllllllIlllllllllllllllllllllll 3 1293 01701 922 LIBRARY 5 Michigan State University This is to certify that the thesis entitled T97“ Documfin‘i C(Dtési‘FficcuHOH presented by Yoth/xflnj L-l has been accepted towards fulfillment of the requirements for Mdegree in chfwtev SCteu‘f WW Major professor Date 3’“ ”1,973 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution PLACE IN REIURN BOX to remove this checkout from your record. TO AVOID FINE return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE r—r ifiRlMelec TEXT DOCUMENT CLASSIFICATION By Yonghong Li A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Computer Science 1998 ABSTRACT TEXT DOCUMENT CLASSIFICATION By Yonghong M The exponential growth of the Internet has led to a great deal of interest in de- veloping useful and efficient tools and software to assist users in searching the Web. Document retrieval, categorization, routing and filtering can all be formulated as classification problems. However, the complexity of natural languages and extremely high dimensionality of the feature Space of documents have made this classification problem very difficult. We investigate five different methods for document classifica- tion: the naive Bayes classifier, nearest neighbor classifier, decision trees, subspace method, and multilayer feedforward neural network. These classifiers were applied to seven-class Yahoo news groups (Business, Entertainment, Health, International, Politics, Sports, and Technology) individually and in combination. Naive Bayes clas- sifier and subspace classifier have also been tested on the enlarged Yahoo news group and Reuters-21578 benchmark. We studied three classifier combination approaches: simple voting, dynamic classifier selection, and adaptive classifier combination. These classifiers were also evaluated on different document representation schemes obtained by best feature selection, principal component analysis and discriminant analysis. Our experimental results indicate that the naive Bayes classifier, subspace method, and multilayer feedforward neural network classifier outperform the other two classifiers on our data sets. Orthonormal discriminant analysis gives better classification results than other dimensionality reduction schemes. Combinations of multiple classifiers do not always improve the classification accuracy compared to the best individual classifier. Among the three different combination approaches, our adaptive classi- fier combination method introduced here performs the best. The best classification accuracy that we are able to achieve on this seven-class problem is approximately 83 — 85%, which is comparable to the performance of other similar studies. How- ever, the classification problem considered here is more difficult because the pattern classes used in our experiments have a large overlap of words in their corresponding documents. To My Husband Lin Hong iv ACKNOWLEDGMENTS I would like to acknowledge all the people who have assisted me during the years Of my graduate study at Michigan State University. I am most grateful to my advisor, Dr. Anil Jain, for both his professional and personal advice, his continuous guidance, encouragement, support, and valuable time that he spent on projects with me. My colleagues at the MSU PRIP lab and my friends at MSU have provided help and moral support throughout my study here. I would like to give my thanks to all of them. A special thanks to Aditya Vailaya, for the valuable discussions and help he gave me, not only as a colleague, but as a friend as well. I would like to specially acknowledge Shaoyun Chen for his great help in my projects and thesis, and for his assistance in using artificial neural network and discriminant analysis code. I would also like to thank Chitra Dorai for her valuable suggestion and her assistance to adjust to the new environments. Many thanks to Scott Connell, Salil Prabhakar, Nicolae Duta, and Marilyn Wulfekuhler for proofreading my papers, thesis and helpful discussions. My acknowledgment also goes to Jinhui Liu, Xiaojie Dong and Fan Du, for their help in manually classifying the data. No words can express my thanks to my husband, for his love, understanding, and support through all the difficulties, as well as for sharing the happiness. I must give my immense thanks to my parents. Their never-fading love, care, and encouragement are of immeasurable value to me. I am also grateful to Dr. John Weng and Dr. Sridhar Mahadevan, for serving as my committee members, and for reviewing the thesis. I would also like to thank Cathy Davison, Mary Gebbia, Lora Mae Higbee, and Linda Moore for their administrative assistance. vi TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x 1 Introduction 1 1.1 Search Engines ................................ 2 1.1.1 Categorical Engines ............................ 3 1.1.2 Full-Text Engines .............................. 3 1.1.3 Meta Engines ................................ 6 1.2 Intelligent Agents ............................... 7 1.2.1 Personal Agents .............................. 9 1.2.2 Collaborative Agents ............................ 9 1.3 Motivation ................................... 11 1.4 Problem Definition . . .~ ........................... 14 1.5 Thesis Outline ................................ 15 2 Literature Review 17 2.1 Classification Techniques ........................... 18 2.1.1 TF-IDF ................................... 18 2.1.2 Probabilistic Methods ........................... 19 2.1.3 Nearest Neighbor Classifier ........................ 21 2.1.4 Symbolic Learning Methods ........................ 21 2.1.5 Artificial Neural Networks ......................... 22 2.1.6 Genetic Algorithms ............................. 23 2.2 Dimensionality Reduction .......................... 23 2.2.1 Best Feature Selection ........................... 24 2.2.2 Latent Semantic Indexing ......................... 24 3 Classification Algorithms 26 3.1 Preprocessing ................................. 26 3.2 Feature Representation ............................ 27 3.3 Classification Algorithms ........................... 28 3.3.1 Naive Bayes Classifier ........................... 28 3.3.2 Nearest Neighbor Classifier ........................ 30 3.3.3 Decision Tree Classifier ........................... 31 3.3.4 Multilayer Feed-Forward Network Classifier ............... 32 3.3.5 Linear Subspace Method .......................... 35 3.4 Combination of Different Classifiers ..................... 37 vii 3.4.1 Simple Voting ................................ 37 3.4.2 Dynamic Classifier Selection (DOS) ................... 37 3.4.3 Adaptive Classifier Combination (ACC) ................. 38 4 Dimensionality Reduction 40 4.1 Best Feature Selection ............................ 40 4.2 Principal Component Analysis ........................ 41 4.3 Discriminant Analysis ............................ 42 4.4 Term Grouping in Subspace ......................... 44 5 Experimental Results 45 5.1 Evaluation of Text Classification Effectiveness ............... 47 5.2 Individual Classifiers ............................. 51 5.3 Combination of Different Classifiers ..................... 54 5.4 Dimensionality Reduction .......................... 54 6 Conclusions and Discussion 58 APPENDICES 63 A Examples of Training Samples of Yahoo datasets 63 B The Laplace’s Law of Succession 71 viii LIST OF TABLES 1.1 A brief summary of characteristics of the most commonly used search engines. These results are from footnote 5. ..................... 5.1 Yahoo news training and test data. ...................... 5.2 Enlarged Yahoo news training and test data. ................. 5.3 Number of documents in each category of Reuters-21578 TOPIC set ...... 5.4 Contingency table of binary decisions for a test set, from [42] .......... 49 5.5 Comparison of the five classification algorithms (NB, N N , DT, N N et, and SS). 50 5.6 Confusion matrix of the classification results using the naive Bayes Classifier (NB) without dimensionality reduction. .................. 5.7 Confusion matrix of the classification results using the nearest neighbor classifier (NN) without dimensionality reduction. .................. 5.8 Confusion matrix of the classification results using the 05 decision tree classifier (DT) without dimensionality reduction. .................. 5.9 Confusion matrix of the classification results using adaptive neural network classifier (NNet) without dimensionality reduction. ............ 5.10 Confusion matrix of the classification results using the subspace classifier (SS) without dimensionality reduction. ..................... 5.11 Classification accuracy using NB on two large data sets. ........... 5.12 Classification Accuracy of combinations of multiple classifiers. ........ 5.13 Comparison of the feature extraction methods using N N ............ 5.14 Confusion matrix of the classification result using the nearest neighbor classifier with PCA feature extraction. ....................... 5.15 Confusion matrix of the classification result using the nearest neighbor classifier with LDA feature extraction. ....................... 5.16 Confusion matrix of the classification result using the nearest neighbor classifier with LDA feature extraction. ....................... 5.17 Comparison of using the subspace method with or without the term-grouping feature reduction technique on test setl ................... 6.1 Comparison of the four classification algorithms (NB, N N , DT, and SS) on test setl for the reduced 5-class problem. ................. ix 51 51 51 52 52 52 53 55 55 56 56 56 1.1 1.2 1.3 1.4 1.5 3.1 5.1 A.1 A2 A3 A.4 A5 A6 A.7 LIST OF FIGURES The structure of a search engine, adapted from [41] ............. The structure of a meta-search engine, from [41] .............. A typical personal agent ........................... A typical collaborative agent ......................... A block diagram of our text classification scheme ............. The architecture of our adaptive feed-forward neural network. ...... The accuracy of each algorithm with a different sized feature subsets. . . An example of the bussiness news group; (a) a training sample; (b) ex- tracted word list. ............................. An example of the entertainment news group; (a) a training sample; (b) extracted word list ............................. An example of the health news group; (a) a training sample; (b) extracted word list. ................................. An example of the international news group; (a) a training sample; (b) extracted word list ............................. An example of the politics news group; (a) a training sample; (b) extracted word list. ................................. An example of the sports news group; (a) a training sample; (b) extracted word list. ................................. An example of the technology news group; (a) a training sample; (b) ex- tracted word list. ............................. 64 66 Chapter 1 Introduction The World Wide Web (WWW) is a wide-area hypermedia information repository that aims to give world-wide access to the large repository of documents through the Internet. The WWW has inherent properties that makes it different from traditional digital libraries in the following ways: (i) Unorganized, non-hierarchical: the information on the Web is distributed using a client/server model. The collection of documents resides on servers, and viewers access it from clients. Documents are connected by hyperlinks. In fact, the WWW was not really designed for organized information retrieval. (ii) Dynamic: it iS very easy for a user to post information, add new sites, update existing documents, as well as remove old files from the Web. The WWW has exhibited exponential growth over the last few years, Since it was created in 1989 at the European Laboratory for Particle Physics (CERN) in Geneva, Switzerland. A pessimistic estimate is that, in the middle of 1996, the WWW con- 1 2 sisted of more than 60 million documents on 12 million hosts and 600, 000 servers, up from 9 million hosts and 250, 000 servers at the beginning of the year [77], and it is growing everyday. The rapid growth Of Internet and the characteristics of the WWW have resulted in an information overload [41, 5, 46], which make it increasingly difficult for users to locate the relevant information quickly. This has led to a great deal of interest in developing useful and efficient tools and software to assist users in searching on the Web. In the following sections, we provide a brief introduction to various search engines and intelligent agents that are available to the users. 1 . 1 Search Engines Search engines were designed to reduce the effort and information overload on the Web. Commercial search engines, such as Yahoo, HotBot, InfoSeek, WebCrawler, MetaCrawler, and Lycos, etc. are examples of tools that construct indices of Web resources and find information requested by a user. Basically, there are three different kinds of search engines: categorical, full-text, and metal. They differ in the basic search logic. All search engines perform their function a bit differently, so their search results could be quite different. 1A Search Engine Primer: http://131.252.56.87/TandT/SENGINE.HTM 1 . 1 . 1 Categorical Engines Categorical engines, like Yahoo2 and Magellan3, are more of a subject guide than a search engine. The guide is hierarchically organized into topic categories by human analysts, each category containing a list of manually indexed web pages. For exam- ple, Yahoo lists more than 200,000 Web sites under 20,000 different categories. The hierarchical organization provides a good structure for browsing. Starting with no specific indices, this kind of search engine can lead a user to quickly find information on various tOpics. A nice property of categorical engines is that they extract high-level semantics (tOpics) and construct links between topics (directories and sub-directories). However, manual examination of each document is expensive, time-consuming and tedious. Thus, these engines are faced with a trade-off between the size and the quality of their directory. Moreover, the larger the number of analysts involved in examining the web documents, the larger the intra-class variability. 1. 1.2 Full-Text Engines All current search engines seem to follow gathering-indexing-matching strategy. Fig- ure 1.1 which is adapted from [41] shows the structure of a search engine. They collect the Web resource by periodically dispatching programs, known as Web robots, crawlers, wanderers, spiders, worms and ants4 to the Web. Indexing robots auto- 2http://www.yahoo.com 3 http: / /www.mckinley.com 4The Web Robots Pages: http: / /info.webcrawler.com / mak/ projects/ robots / robots.html 4 matically traverse the Web’s hypertext structure by starting from one document and recursively retrieving all documents that are referenced. Instead of organiz- ing pages according to topics, these search engines work differently from categorical engines. They scan and analyze the internal text of a page, and perform keyword- based searches against an indexed database. The indexing and searching schemes of these search engines are similar, but they may differ in their database size, update frequency, search strategy and capability, and ways of representing the search results. uery User -————> USCI‘ 1} q > Search Interface - Results /L Engme Crawling web>{ Robot ] Indexing Figure 1.1: The structure of a search engine, adapted from [41]. Search engines mainly follow the so called location/frequency rule to determine the relevancy between the query keywords and retrieved documents. It is based on the assumption that any page relevant to the topic will mention those keywords right from the beginning, such as in the heading or in the first few lines of the page. Keyword frequency is the other major factor in determining the relevancy. Those pages with higher occurrence Of keywords are more likely to be relevant than other Web documents. Some search engines add other strategies to “boost” the relevancy. 5 For example, WebCrawler uses link popularity as part of its ranking method. A search engine features chart5 provides a summary of important features of the most commonly used search engines. We briefly summarize the properties in Table 1.1. Since each search engine has a different strategy for selecting pages to index, and Search Size Update Ranking Comments Engine (no. of pages Freq. (in addition in millions) to location/ freq. method) AltaVista 1 day to one of the largest and 100 1 month - most comprehensive search engines Excite 3 or 4 best for finding 55 3 weeks star review widely discussed, mainstream topics HotBot 110 2 weeks - easy to scale up InfoSeek 1 to 2 keywords in one of the best tool for 30 months meta tag comprehensive search relevancy and speed Lycos 1 to 2 Also offers to search a 30 weeks - subject-based directory of the most popular pages from its database WebCrawler keywords in good for beginners with 2 - title, link keyword searching popularity Table 1.1: A brief summary of characteristics of the most commonly used search engines. These results are from footnote 5. none of the search engines has a completely indexed database, the coverage of the individual engines is relatively low, i.e., searching with a second engine would often return several document that were not returned by the first engine. Also, because 5http://www.searchenginewatch.com/webmasters/features.html (the chart is as of Feb. 2, 1988) 6 search engines may have their down times, and because of network traffic congestion, an engine’s response time is often dynamic. Meta engines are a new breed Of search engines which combine the results of several engines for a given set of query keywords. 1.1 .3 Meta Engines The structure of a meta-search engine given in [41] is shown in Figure 1.2. This kind of engine provides a uniform query interface, accepts the query keywords entered by a user, and sends the query to many different search engines. It then collects the results from various search engines and analyzes them, eliminates duplicates, re-ranks pages on the basis of relevancy, and summarizes the information. MetaCrawler [72] and Search Engine__1 Fusion [74] are examples of such search engines. f N Result_l 2 Query A MetaSearch Query > Search User User i < Engine_2 Interface < Engine Result_2 ) Results l I Query I l Result_n Search En gine_n Figure 1.2: The structure of a meta-search engine, from [41] Human indexing and categorization of Web pages are time-consuming and tedious. It is hard to keep pace with the explosive growth of Internet by analyzing each document manually. Moreover, keywords-based search is not always accurate enough. It is not uncommon that search engines, in response to a query, often return some 7 sites that have little to do with a user’s interest. This has led to the development of intelligent agents which are playing an important role in making the Internet more usable [53, 6, 54]. 1 .2 Intelligent Agents Intelligent agents can sense their environments, act on behalf of their owners, interact with other agents, and take actions in order to achieve their goals. They are being used in many applications, such as: 1. Electronic Commerce. As an example, AuctionBot6 is a multi-purpose server used to create automated Internet auction according to user specifications, or bid in an existing AuctionBot auction. Kasbah7 [9] is MIT’s buying and selling agent in electronic marketplace; it is a multi-agent research project which aims at fundamentally transforming the way peOple make transactions. Jango8 is a tool for shopping on the web. It is quoted by NetBot as the world’s first intelligent shopping assistant. 2. Entertainment. As an example, eGenie9 is the premier source for personalized entertainment information on the Web. It lets users explore categories of en- tertainment that include music, movies, books, events, and TV. eGenie is the Showcase of Learning Sesame, which can learn from user’s interests and dynam- 6http: / /auction.eecs.umich.edu 7http: / /ecommerce.media.mit.edu/Kasbah/ 8http://jango.corn 9http: / /egenie.opensesarne.com 8 ically generate Web pages customized to the user’s taste. M ORSE10 is a movie recommendation system that can suggest new movies based on user’s previous rating record. 3. Internet. AS an example, Autonomy Intelligent Agents11 is a news filtering system that automatically compiles a personalized newspaper of interest to the user. A user can train an agent by first typing a short description, then the agent can enhance its knowledge of user’s interests by using the new features learned from user’s past experience. LIRA [4] developed at Stanford University stands for Learning Information Retrieval Agent. It can help users browse or “surf” the Web. The system observes how users rank various pages they visit. It then retrieves new pages on its own, and presents only those which it believes the user may want. Web Watcher [36] created at Carnegie Mellon University is a “tour guide” agent for the WWW. Given a description of a user’s current interests (keywords), the WebWatcher accompanies the user from page to page as he/She browses the Web, and recommends the hyperlinks that it believes to be highly relevant to the user’s interest. It learns the strategy for giving advice from the feedback it received during earlier tours. Agents can work independently, known as personal agents, as well as work jointly together in an intellectual endeavor, known as collaborative agents [40]. 10http: / / www.1abs.bt.com / innovate / multimed / morse ‘1 http: / /ultra.agentware.com/dailyme 1.2.1 Personal Agents Personal agents are usually used to model users’ long-term interests. They take user’s interests, such as keywords, a brief description, bookmarks, and feedback, etc. as user’s profile, and make predictions on user’s behavior based on personal history. Figure 1.3 Shows a typical personal agent. For example, WBPQ, IBM’S Web Browser Intelligent agent, can personalize a web user’s experience, and give a web server the ability to add or modify web content on the fly. WebLearner [59] maintains user’s preferences as hotlist (for links that are interesting) and coldlist (for links that are not interesting), analyzes the document from each link, and recommends new pages that might interest the user. Letizia [45] is another Web browsing assistant. It keeps track of user’s browsing process, explores links concurrently and automatically from the user’s current position by using the knowledge inferred from the user’s behavior, and attempts to suggest pages that are suited to user’s interests. 1.2.2 Collaborative Agents Collaborative agents interact and cooperate with each other to perform tasks on behalf of a user. Figure 1.4 Shows a typical collaborative agent. The design of collab- orative agents involves problem solving, communication, and coordination strategies for agents to maintain autonomy. They must also be able to efficiently use the avail- able network bandwidth to communicate with other agents. Agent collaborations may involve heterogeneous or homogeneous groups of agents, and agents with similar 12http: //www.networking.ibm.com/wbi/wbisoft.htm 10 User Model User Profile > Personal User (f Search Results A A I I I Feedback Q Database V Figure 1.3: A typical personal agent or differing goals, languages, and knowledge representation facilities. For example, Wise Wire13 uses a combination of content-filters, collaborative-filters, and learning agents to deliver interesting documents to users. It encodes the key conceptual ma- terial from documents contents, Sifts out uninteresting pages, categorizes them, and recommends information rated highly by one user to other users with similar inter- ests [38]. Ginkgo14 developed at IBM is an agent-based learning system. It uses personal learning as well as collaborative learning to model a user’s preferences, and predicts a user’s behavior based on personal history, past experience, and on Similar ’3 http: / / www.wisewire.com 14http: / /www.networking.ibm.com/iag/ iaginkgohtm 11 individuals’ histories. It can learn from other agents and promote knowledge sharing. Yenta15 under development at MIT, is a matchmaking system that aims at provid- ing privacy-protected, distributed, automatic generation of clusters of users who are interested in Similar topics [20]. Personal Agent_1 Personal Personal Agent_n Agent_2 Personal Agent_3 Figure 1.4: A typical collaborative agent 1 .3 Motivation Many tasks in information retrieval can be formulated as a document classification problem [43]. These tasks include: 15http: / / foner.www.media.mit.edu / people / foner / Yenta 12 1. Document Retrieval: Document retrieval is to select a set of documents from an indexed database, usually in response to some user queries. User queries could range from a few keywords to multi-sentence descriptions of an information item that is needed. A vast majority of retrieval systems currently in use range from simple Boolean systems (e.g., search engines) to systems using statistical or nat- ural language processing (e.g., intelligent agents). A document retrieval system usually consists of four main phases: indexing, query formulation, comparison, and feedback [43]. 2. Document Filtering and Routing: Also known as selective dissemination of information, is an information seeking system to sort through large volumes of dynamically generated information and present those documents to the user which satisfy a relatively stable and specific information need. It is similar to document retrieval, but emphasizes the dynamic environment and specific long- term interests [55]. It may involve indexing, user profiles modeling, adaptation of user profiles, matching, and social filtering etc. SIFT [82], FireFlyl6 and WiseWire are some examples of available document filtering systems. 3. Document Categorization: It is the assignment of a document to pre-defined categories. A category label can be assigned to a document manually, such as in Yahoo. However, manual assignment of categories requires considerable human labor, time, and cost. An automatic document categorization system is desirable to reduce these costs substantially. 16http: //www.firefiy.com 13 4. Document Clustering: Classification is ambiguous in information retrieval. Peo- ple often refer to clustering as a classification task. Clustering is to discover the underlying structure (categories) of a collection of documents based on some Similarity measure. It is an unsupervised classification. Text classification can also be incorporated with push technology, such as Point- C’ast Network”. Push technology involves software that is loaded on the machines connected to the Internet. The software works in the background to compile Internet sources related to a user’s selected areas of interest, and automatically delivers news and other specified information from the Internet to the user’s desktop. Text classification presents many challenges and difficulties. First, it is difficult to capture high-level semantics and abstract concepts of natural languages just from a few individual words. For instance, there are many ways to represent Similar concepts (e.g., agent, softbot, robot, or bot), and the same word can represent different mean- ings (e.g., bank can either mean a financial institution or a river bank). Furthermore, semantic analysis, which is a major step in designing a natural language information retrieval system, is not well understood, although there are some techniques that have been successfully applied to limited domains [18]. Noise, high dimensionality (thou- sands of features), and variable length of content are some of the other undesirable characteristics of a huge number Of documents on the Web. Due to these difficulties, there is a tradeoff between efficiency and accuracy of a classification system. l7http: / /www.pointcast.com 14 1 .4 Problem Definition We focus on supervised text classification in this thesis. A typical classification prob- lem can be stated as follows: Given a set of labeled examples belonging to two or more classes (training data), classify a test sample to a class with the highest sim- ilarity. Document retrieval, routing and filtering systems, can often be viewed as two-class classifiers which label a document as relevant or non-relevant [59, 82, 38]. User feedback provides a set of training examples with positive and negative labels. A document is presented to the user if it is classified as the relevant class. In document categorization, we already have human indexed training data available, so a classifier is used to automatically assign a given previously unseen document to the appropri- ate class. Examples include deciding what newsgroup an article belongs to [7 9], what folder an email message Should be directed to [13], or automatic coding of diagnoses in patient records [47]. In this thesis, we concentrate on document categorization, i.e., we deal with a multi-class classification problem. Figure 1.5 Shows a block diagram of our text clas- sification scheme. We report experimental results using news data from Yahoo web site, which are categorized into seven groups (business, entertainment, health, interna- tional, politics, sports, and technology), and Reuters-21578 newswire benchmark”, which involves 135 topical categories. Each item of news is indexed manually by human experts (i.e, we have labeled training examples and “ground-truth” Of test samples). The contributions of this thesis are as follows: 18http: / /www.research .att.com/ ~lewis 15 1. Our classification system is based on a multi-class classification system, rather than a two-class classification which is typical in the classification systems [44, 58, 71] reported in the literature. 2. We apply subspace method to text classification. It performs well on the Yahoo dataset, while poor on Reuters dataset. 3. We apply five different classification methods to our datasets: the naive Bayes classifier, nearest neighbor classifier, decision trees, subspace method, and arti- ficial neural network method. These classifiers were also evaluated on different document representation schemes obtained by best feature selection, principal component analysis, linear discriminant analysis, and term-grouping in sub- Space. 4. We also investigate whether performance can be improved by a combination of different classifiers. Our experimental results indicate that a combination of multiple classifiers does not always improve classification accuracy compared to the best individual classifier. The adaptive classifier combination method intro- duced in this thesis outperforms Simple voting and dynamic classifier selection. 1 .5 Thesis Outline The outline of this thesis iS as follows. Chapter 2 briefly reviews some relevant work on text document classification. Chapter 3 describes the five text classification methods we have used in our experiments, and investigates combinations of multiple 16 Training .Ga’ Documents I Classifier_l ] \ -------- Class_2 Prepming Feature l Feature : answer—2 ’ (stopwords and ' ‘ \ . low-freqwords Representation Ewell” : removal) ’ -------- i ' Classifier Feature Test Document I Representation I Figure 1.5: A block diagram of our text classification scheme classifiers. Dimensionality reduction techniques are introduced in chapter 4. We present our experimental results in chapter 5. Finally, we conclude with discussions and future work in chapter 6. Chapter 2 Literature Review In the past few decades, the availability of cheap and effective storage devices and in- formation systems, and the rapid growth of Internet has created an immense informa- tion bottleneck. Most commercial information retrieval systems (e.g., search engines) still rely on conventional inverted index and Boolean querying techniques. An inverted index catalogs a collection of objects in their textual representation. Given a set of documents, keywords and other attributes (possibly including relevance ranking) are assigned to each document. Inverted index is the list of keywords and links to the corresponding document. Index is sorted on the key values to allow rapid searching for a particular key value. But they often produce less than satisfactory results. Prob- abilistic models have been used to improve the performance of information retrieval, filtering, routing and categorization. Since the late 19805, knowledge-based techniques have been used extensively by information science researchers. These techniques have attempted to capture users and information specialists’ domain knowledge, classifica— tion scheme knowledge, effective search strategies, and query refinement heuristics in 17 18 the design of information systems [11]. More recently, machine learning approaches, such as symbolic, inductive learning methods, artificial neural networks, and genetic algorithms have also been utilized [10]. In this chapter, we briefly review several popular techniques used in document retrieval, filtering, routing, and categorization. We pay more attention to classifica- tion methods. We also review some feature selection and extraction techniques that are commonly used in a text document classification system, because of the high dimensionality of the feature vector (usually in the order of several thousand) often encountered in this problem. 2.1 Classification Techniques There are several ways to design a classifier, from typical techniques in information retrieval, such as TF-IDF, to pattern recognition and machine learning techniques, such as decision trees, naive Bayes classifier, nearest neighbor, artificial neural network and genetic algorithms. They all have the same goal: given a set of training samples along with category labels, classify a test sample to one or several most apprOpriate categories. 2.1.1 TF-IDF TF-IDF is one of the most successful techniques in information retrieval. A document is represented as a vector of weighted terms. Terms that appear frequently in one document (TF=term frequency), but rarely occur outside the document (IDinnverse 19 document frequency) have larger weights. The weights can be normalized according to the length of the documents. In Syskill & Webert [58] and NewsWeedeer [38] systems, the TF—IDF vectors of all examples of one class are averaged to get a prototype vector for the class. A test document is assigned to the class that has the highest cosine similarity measure, i.e., smallest angle between the TF-IDF vector of the test pattern and the class prototype vector. 2.1.2 Probabilistic Methods Probabilistic Methods have been widely used in information retrieval. Naive Bayes classifier has been demonstrated to be a powerful classifier in several domains [16, 44, 58]. Domingos et a1. [16] conducted an empirical study on 28 domains, a varied selection of databases from the UCI repositoryl. They showed that naive Bayes classifier was more accurate than a decision tree (C45) in 16 domains. Lewis et a1. [44] compared their ProbBayes method with a decision tree classifier (DT-minlO) on two data sets. One of the data sets was Reuters-22173 newswire stories. These stories have been manually indexed using 135 financial topic categories. The second data set consisted of 1, 500 documents from the US. foreign Broadcast Information Service that had been used in the MUC-3. ProbBayes essentially uses Bayesian rule to estimate P(Cj = 1|D), the a posteriori probability of category cj given the document. A 2-class classification was applied on all the categories for one test pattern. For Reuters—22173 data set, both Bayes method and the decision tree method gave similar lhttp: / /www.ics.uci.edu/ mlearn/ MLRepository.html 20 classification performance. The best break-even value (where the recall is equal to the precision) is about 65%; for MUG-3 data set, the DT-minlO performed slightly well, the best break-even value is about 50%. Bayesian networks, also known as inference networks, are generalizations of naive Bayes classifier. A Bayesian network is a directed acyclic graph representation of the joint probability distribution for a set of variables, where each node in the graph represents a variable, and each edge represents the correlation between the two vari- ables [32]. Bayesian networks were originally designed to encode the uncertain knowl- edge of an expert [60], and in recent years, they have been applied to a variety of fields, such as expert systems, diagnosis engines, information retrieval and decision support systems [25]. In [75, 76, 7], Turtle and Croft et al. used Bayesian network for information retrieval. The Bayesian network-based retrieval systems consist of two parts: a document network and a query network. In document net, nodes rep- resenting documents are connected to nodes representing terms. The query net is constructed where terms in the query are connected to nodes representing how the terms Should be combined. To perform a retrieval, the system connects these two networks together. Given the prior probabilities associated with the documents and the conditional probabilities with internally connected nodes, the a posteriori prob- ability of each document in the collection, given the query, can be computed using Bayesian rule. The system then ranks the documents by this probability. The underlying assumption of the naive Bayes approach is that, for a given class, the probabilities of terms occurring in a document are independent of each other. On the other hand, Bayesian inference networks reflect the dependence between the 21 terms. Several studies have attempted to relax the strong independence assumption in naive Bayes classifier by incorporating Bayesian inference networks [66, 24]. 2.1.3 Nearest Neighbor Classifier For a given test document, the nearest neighbor classifier can be viewed as an attempt to estimate the a posteriori probability of a category from the training patterns. This rule assigns a test document to the category of its nearest training pattern. A K - nearest neighbor classifier assigns a test document to the category which is most frequently present among its K nearest neighbors. Its simplicity and ease of imple- mentation has made it pOpular in information retrieval systems. For example, Pazzani et al. [58] applied the nearest neighbor rule on their Syskill & Webert system, an in- teresting web page identification agent that learns from a user’s profile. Weiss [79] et al. showed that a classification-based information retrieval method using the nearest neighbor classifier performed better that the TF—IDF prototype method on USENET newsgroups datasets. 2.1.4 Symbolic Learning Methods Symbolic learning techniques have been extensively studied during the last few decades [8, 10]. Among these methods, decision trees, such as ID3 [61], and C45 [63] are the most popular classifiers that have been used in information retrieval fields, such as medical record classification [3], Web page selection [58], and news stories categorization [44]. Decision trees classify patterns by starting from the root of the 22 tree, and then traversing down to the leaves. Leaf nodes represent categories, while interior nodes represent attributes. When constructing a tree, attributes are selected based on their discrimination abilities. Another type of symbolic learning is rule-based method. Rules can be automat— ically induced from decision trees [63], from neural networks [14], and they can also be learned from examples [12, 13, 50, 2]. 2.1.5 Artificial Neural Networks Artificial neural networks, e. g., multilayer perceptrons and radial basis networks, seem to fit well with conventional retrieval methods in information science [10], such as the vector space model [67]. A typical multilayer feedforward network [31, 48] learns the weights for its interconnections by using a gradient descent method to minimize the squared error between the network output values and the target values for these outputs. Schutze et a1. [71] used neural networks trained by backpropagation for document routing problem. Two types of neural networks were used in their work: one is a linear neural network which consists of only input and output units, and another is a non-linear neural network which has one hidden layer with three units. They demonstrated that neural networks, linear discriminant analysis, and logistic regression classifiers perform 10 — 15% better than relevance feedback via Rocchio expansion for T REC-2 and TREC-3 routing tasks. In another similar study [15], L81 (Latent Semantic Indexing)-based neural networks were employed for classification of 10 categories of TREC—l news wire data. They did their experiments in two 23 configurations: (i) single sensor neural net, where the inputs are based on LSI alone, and (ii) two sensor neural net, which uses additional 10 keywords as inputs. Their experimental results showed that neural network methods perform 4% to 20% better than single LSI method. Artificial neural network techniques have also been employed in commercial intelligent agents, such as the wisewire collaborative filtering enginez. 2.1.6 Genetic Algorithms Another type of learning methods, called genetic algorithms (CA), are also popu- lar in information science. Chen [10] surveyed several implementations of genetic algorithms in information retrieval. Sheth [73] used a genetic algorithm to build a personal information filtering system, called Newt. During the learning phase in Newt, a genetic algorithm was used to model the population characteristic and be- havior in response to changing user interests. By using crossover and mutation, new members were added into the population, while unfit members were removed during each generation. 2.2 Dimensionality Reduction Document classification is often characterized by the high dimensionality (often thou- sands Of features) of the associated feature space and a relatively small number of training samples. The feature set is Often “noisy” and may contain redundancy [43], which may lead to the problem of “curse of dimensionality” [33]. Moreover, large 2http: / /www.wisewire—corp.com/indexprod.html 24 number of features result in larger computational and storage complexity. There are several ways to reduce the feature set size. Traditional feature selection approaches in pattern recognition, such as sequential forward/backward selection and “plus l-take away r” selection [35] may perform better, but they are often computationally de- manding. In this thesis, we briefly review two most popular feature reduction methods in information retrieval: best feature selection and latent semantic indexing (LSI). 2.2.1 Best Feature Selection In Best feature selection, the goal is to find the best subset of features from the entire feature space. Salton et a1. [69] provided a weighting scheme, such as TF-IDF, for terms in the document collection. Schutze et a1. [71] applied a X2-measure of depen- dence to a relevant and non-relevant contingency table to indicate the importance of the features. Mladenic [49] suggested the use of mutual information to assign weights to the terms in the documents. Koller et al. [37] developed an efficient algorithm for feature selection based on information theory. Their empirical results indicated that the algorithm can effectively handle datasets with a large number of features (up to 1675 features). 2.2.2 Latent Semantic Indexing Latent semantic indexing (LSI) method developed at Bellcore [51] is a vector Space information retrieval method which has been widely adopted in information filtering and retrieval systems [18]. In LSI method, introduced in [51, 18], the singular value 25 decomposition (SVD) is performed on a term-document matrix, which is formed from a set of training patterns. Recall that each entry in the term-document matrix indi- cates the term—frequency in the corresponding document. Only large singular values are preserved. The resulting singular vector and Singular value matrices are used to project term frequency vectors for documents and queries onto a reduced feature sub- space where semantic relationships in the term-document matrix are preserved. In a query system, documents can be ranked according to their similarity measure to a query by using these projected subspace vectors. This technique has been successfully used in Salton’s SMART system [68]. LSI has also been used for feature reduction in document classification systems [71, 15]. Chapter 3 Classification Algorithms In this chapter, we introduce different classifiers used in our study of text document classification. We formulate our text classification as a multi-class classification prob- lem, i.e., assign a test pattern to one of the pre-defined categories (see Figure 1.5). The vector space model [67] is used to represent document features. During the pre- processing stage, a simple lexical analysis is performed to convert an input stream of characters into a stream of “useful” words. 3. 1 Preprocessing Web documents are written in HTML language. We use the following preprocessing steps to construct the SO called indexing file, which contains those words that may be helpful in text classification. 1. Use an HTML parser to filter out HTML tags. 2. Convert all characters to lower case. 26 27 3. Remove digits and punctuations. 4. Screen out words in a stoplist. 5. Remove low frequency words (which only occur once in the training examples). Words in the stoplist are known as stopwords, which are the most frequently occurring words in English, such as “the”, “some”, “of”, etc. These words are not considered helpful in retrieval. The stoplist we choose is the one given in [21]. Other preprocessing steps, like stemming [22] can be applied to further reduce the Size of indexing files. An effective stemming is helpful in retrieval and classification, because the words with the same stem should have similar meaning. However, over- stemming [22] can cause original unrelated words to be represented as the same stem, in which case it may actually degrade the performance of a classification system. For example, “international” and “internal” are both stemmed as “intern”, but they originally have diflerent meanings. 3.2 Feature Representation The feature space of documents can be represented by the indexing files generated above. We adopt the commonly used “bag-of-words” document representation scheme (vector space model), in which we ignore the structure of a document and the order of words in the document. The word-list W = (2121, ..., wd) in the training set consists of all the distinct words (terms) that appear in the training indexing files. Typically, there can be several thousand features in document classification (the number of 28 commonly used English words is between 20, 000 to 50, 000). Given a document D, its feature (term) vector is represented by T = (t1, ..., td) constructed from W. The value of each component of T could be either binary (a value of 1 indicates that the corresponding word appeared in the document) or an integer representing the number of times the corresponding word was observed. Some training examples and the corresponding indexing files of each Yahoo news group are shown in Appendix A. 3.3 Classification Algorithms In the following sections, we describe the naive Bayes classifier, nearest neighbor clas- sifier, decision tree classifier, and artificial neural network classifier used in our study. We also introduce our use of the subspace method for text document classification. 3.3.1 Naive Bayes Classifier The naive Bayes classifier [48], also known as simple Bayes classifier, has been success- fully used in text classification [1, 58]. Let C = (c1, ..., cm) be the set of m document classes. Given a new unlabeled document D, its word-list W 2 (ml, ..., wdz) (defined in the same way as the word-list for the training set), and its corresponding binary feature (term) vector ’7' = (t1, ...,tdl), D is described by the conjunction of a set of attribute values t,-, i = 1, ..., d’ . Using Bayes rule, the a posteriori density of class Cj, given document D can be written as: P(cj|D) = ’P(]||LJOO, ...,LJ[I), (3.1) 29 where P t ,...,t c- P c- p(.,l..,...,.,) z < 1P,“ {I 1.1)“)- (3.2) The underlying assumption of the naive Bayes approach is that, for a given class C], the probabilities of words occurring in a document are independent of each other, i.e., I d P(t1, "'atd’lcj) = H P(ti]Cj). (3.3) i=1 The naive Bayes approach assigns D to a class CIVB as follows: I d c3“; = argmarccjecP(cJ-) H P(t,|cj), (3.4) 1:1 where P(cj) is the a priori probability of class cj and P(t,~|cj) is the conditional probability of word 10,-, given class cj. Both P(cJ-) and P(t,~|cj) are estimated from the training data. When the size of the training set is small, the relative frequency estimates of probabilities, P(t,~|cj), will not be reasonable. For example, if a word never appeared in the given training data, its relative frequency estimate will be zero. So, we applied the Laplace’s law of succession [64] to estimate P(t,~|cj). It is essentially a Bayesian estimate of the multinomial parameters. The Bayesian estimate of P(t,-|cj) is given as: (3.5) where nj is the total number of word occurrences in class C], n,,- is the number of 30 occurrences of word to,- in class C], and k is the vocabulary size of the training set. This Bayesian estimation is based on a uniform prior assumption, i.e., probabilities of various word occurrences in class 03- are equally likely. A detailed derivation of this estimate can be found in Appendix B. 3.3.2 Nearest Neighbor Classifier The nearest neighbor (NN) decision rule assigns an unlabeled document D to the document class c,- if the training pattern closest to D is from class c,-. A variation of nearest neighbor rule is the K -nearest neighbor (KNN) rule, which first finds the K nearest neighbors of D among the training patterns, and then uses a voting scheme to assign a category label to D. The NN/KNN rule can be viewed as an attempt to estimate the a posteriori probabilities p(cj|D) from training patterns. The number of nearest neighbors (K) Should be small compared to the total number of training samples. A rule of thumb is that K Should be proportional to J17. The NN rule is used in our text classification for Simplicity. A number of algorithms can be adopted to reduce the complexity of nearest neighbor search, such as nearest neighbor editing [17], reduced nearest neighbor rule [26], and an effective algorithm for nearest neighbor search in high dimensions [52]. We use the TF-IDF (T F is the term frequency in a document, and IDF is the inverse document frequency) weighting scheme and use the cosine similarity [69] in- stead of Euclidean distance to measure the similarity of the documents. Given two documents D1 and D2, their corresponding weighted feature vectors are represented 31 as 71- —— (t1,6-),-_1 and 73— — (62i6)zd_1, where 6,- is the weight of word to,- (based on TF-IDF). The similarity between D1 and D2 is then defined as: 7T’T2 529,29 =———, (1 2’ mum” (3.6) where [I - [I denotes the norm of the vector. 3.3.3 Decision Tree Classifier Decision trees are one of the most widely used inductive learning methods. Their robustness to noisy data and their capability to learn disjunctive expressions seem suitable for document classification. One of the most well known decision tree al- gorithm is ID3 [61], and its successor C45 [63] and C5. It is a top-down method which recursively constructs a decision tree classifier. The interior nodes of the tree are associated with specific attributes (e.g., terms in document collection), and leaves of the tree represent Specific categories. At each level of the tree, ID3 selects the attribute that has the highest information gain. Information gain is simply the ex- pected reduction in entropy caused by partitioning the training examples according to this attribute [48]. Given a collection of training samples Tr, information gain for an attribute W is defined as [48]: ITr -2) ITr l |Entr0py(Tr_v) (3.7) Gain(Tr, W) = Entropy(Tr)— Z vEValues(W) m Entropy/(Tr) = Z—piloggp, (3.8) i=1 32 where Values(W) is a set of values which W can take, Tr-v is the set of training samples for which Value(W) = v, and p,- is the proportion of samples that belong to class i. In summary, ID3 searches a complete hypothesis space, and uses the statistical properties (e.g., information gain) of all training examples at each step in the search. This makes it more robust to noisy training data. Furthermore, it uses reduced-error pruning [62] to avoid overfitting in decision tree learning. Mitchell gives an analysis of ID3 decision tree in more detail in [48]. For our experiments, we chose the C5 decision tree package, since it has many nice features over its predecessor, ID3 and C45. For example, the rulesets used in C5 are more accurate, faster, and require less memory 1. Furthermore, adaptive boosting [23] is incorporated into the software. The basic idea of boosting is to generate n (n > 1, n is Specified by the user) classifiers (either decision trees or rule sets) instead of one. The ith classifier is constructed by examining the errors made by the (i — 1)th classifier. When a new document is to be classified, a voting scheme based on the n classifiers is used to determine the final class of the document. We use the adaptive boosting option in the decision tree classifier in our experiments. 3.3.4 Multilayer Feed-Forward Network Classifier An important class of neural networks, namely, adaptive layered networks (e.g., mul- tilayer perceptrons and radial basis networks) have been widely applied to diverse pat- 1http: / / www .rulequest.com/see5—comparison .html 33 tern classification domains with some success. They also seem to fit well with conven- tional retrieval models, such as vector space model in information science [10, 71, 15]. A three-layer feed-forward network with two hidden layers and an output layer is used in our text classification. Figure 3.1 shows the architecture of our feed-forward neural network. Text document classification usually involves thousands of features. It is Sigrnoid Sigmoid Linear Input __*[ Featuzzcegraction \I/é \\,// .V/,'—> k; \\ /’ Input First Second 0MP“t layer hidden hidden layer layer layer Figure 3.1: The architecture of our adaptive feed-forward neural network. difficult for neural networks to handle such a large input dimension. The principal component analysis method (see section 4.2) is used on the vector space model to reduce the feature set size. The projected features in the subspace are used as input to the three-layer feed-forward network. Webb et a1. [78] illustrated why a nonlinear adaptive feed-forward layered net- work with linear output is capable of performing classification tasks well. They demonstrated that, if the weights are adjusted to minimize the total mean square output error, then the nonlinear transformation of the hidden layers also maximizes 34 the network discriminant function (C) [78]: where SB is the (weighted) between-class covariance matrix, and SI]: is the pseudo- inverse [29] of the total-class covariance matrix ST. For a one-to-m target coding scheme, 1, if input pattern p is in class c tcp = (3.10) 0, otherwise. This coding scheme results in a weighted between-class covariance matrix S B. We choose to use an alternative target coding scheme [78], f Vin—3’ if input pattern p is in class c, where tcp : 1 nc is the number of patterns in class c (3-11) 0, otherwise L in which case S B is the conventional between-class covariance matrix, and it is argued in [78] that, for multi-class classification problems, this target coding scheme compen- sates for un-balanced class memberships in training data. By performing a nonlinear transformation of the data into a space where the classes maybe more separated as determined by the between-class and total-class covariance matrices, the subsequent linear transformation to minimize the mean square target error could perform better than on the original data. 35 The neural network uses backpropagation algorithm [31, 48] for training. Basi- cally, the algorithm performs stochastic gradient descent to minimize the mean square error between the network output and their corresponding target values. The activa- tions of input pattern are propagated forward through the net, while the errors are backpropagated to update the weights in order to minimize the mean square target error. Traditionally, the weights in the feed-forward neural network trained by a back- propagation algorithm are initialized randomly. This may involve large training epochs and the weights may not converge in some cases. Instead, we initialize the weights based on the eigenvalues of the covariance matrix of the input pattern matrix, which helps to speed up the convergence [70]. 3.3.5 Linear Subspace Method The subspace model [56] decomposes a given feature space into m sub-regions of lower dimensionality (subspaces), where each region is a representative feature space for its corresponding pattern class c,,i = 1, ..., m. A test document is classified based on a comparison of its compressed representation in each feature Space with that of different classes. We apply this model to document classification as follows. Suppose we have m document classes C = (ck)];":1. Class ck is represented by a subspace £1, of cardinality dk. Let T = (t,-)§‘=1 denote the term-vector in the original d-dimensional feature space, corresponding to the word-list of the training set W = (w,)§’:1. Let the word-list of 36 k the subspace L), be denoted as Wk = (wk)d"1, where w,- are the words observed in ii: class ck. Given a vector T in the original feature space, the weighted projection II), of vector T on the subspace L), is defined as: ”E. = H1477 = HRT, (3.12) where H), = (hij)dkxd is a dk x (1 matrix, and the ith row corresponds to the ith component of the word-list Wk in the subspace Lk, while the jth column is the jth component of the word-list W in the original feature space. The elements hij are calculated as follows: (I 6].“, when the term wf is the same as the term to]- 0, otherwise, where 6].“ is the weight of term wIc ,- in subspace Lk. We define 6;.“ as: k _ CLASSFREQJ-k j _ logg(DOCFREQj +1)’ (3.14) where CLASSFREQJ-k denotes the ratio of the number of documents in which the term 10,- occurred in class ck to the number of documents in ck, and DOCFREQJ- represents the ratio of the number of documents in all those classes in which the term 112,- occurred to the size of the training set. The Euclidean vector norm of 77, is “77,” = 727777,. For a new document D, the subspace decision rule classifies D to the class on whose subspace its term-vector T 37 has the largest projection in terms of the Euclidean vector norm. 3.4 Combination of Different Classifiers A number Of studies have Shown that combining different classifiers can improve the classification accuracy [80, 39, 28]. Larkey et a1. [39] applied weighted linear com- binations of different classifiers to the medical document domain, where the weights were assigned by the user. Another CM C approach is dynamic classifier selection (DCS) [80, 28], where a single classifier is selected which has the highest local ac- curacy in small regions of feature space surrounding the test sample presented to the system. We investigated three different combination approaches: simple voting, DCS, and our own approach Of adaptive classifier combination (ACC). 3.4.1 Simple Voting For each test document, classify it to class c,-, where a majority of the classifiers individually assign the test document to class c,-. 3.4.2 Dynamic Classifier Selection (DCS) We have implemented a version of DOS described in [80, 28]. For a test document D, we use the k-nearest neighbor approach to find the neighborhood of D and the “leave-one-out” method [17] is applied on the training data to find the local accuracy in the neighborhood of D. We used the “soft” measure [28] of the local accuracy, where the weight of each neighbor of D is the cosine similarity measure between D 38 and that neighbor. 3.4.3 Adaptive Classifier Combination (ACC) Instead of selecting the best classifier with the highest local accuracy for a test docu- ment, we assign the document to class 0,, which is the class identified by the classifier that has the highest local accuracy among all the classifiers. The outline of our ACC algorithm, given n classifiers is described as follows: 1. For a test document D, find the neighborhood of D, NB(D) = (1171, ...,xK), as, E T rainning-Set, using the K -nearest neighbor algorithm. 2. Denote the classification results for D by n classifiers as C = (61,...,cn), Cj E {Cl,...Cm}. 3. For each class cj E C_, calculate AccloC = 22:, 21;, W,P,(cj|a:,- E cj), where P,(cJ-|:r,- E Cj) is the local accuracy of a neighborhood pattern 33,-, i.e. the a posteriori probability that 3:, belongs to class cj. The local accuracy of each x,- can be obtained by using the “leave-one—out” method on the training data, and W,- is the cosine similarity measure between pattern 2:,- and D, which is defined as: _ ZizfiTermik x Termk) \/Z[C:1(Term,-k)2 x 22:1(Termk)? W,- = Cosine(a:,-, D) , (3.15) where t is the dimensionality of the feature vector, Term“, is the value of term It in document 123,-, and T erm)c is the value of term It in the given document D. 39 4. Classify D to class c,,, where n = argmaxj(Acc{OC). Chapter 4 Dimensionality Reduction Document classification is often characterized by high dimensionality of the associ- ated feature space and a relatively small number of training samples. The increase in dimensionality results in an increase in both computational and storage complexities. Furthermore, we must also guard against the potential problems of “curse of dimen- sionality” [33]. We study four feature dimensionality reduction approaches: feature selection, feature extraction (principal component analysis and linear discriminant analysis), and term grouping in subspace. 4.1 Best Feature Selection One way to reduce the number of features in document classification is to select a subset of the best terms from the entire feature space. In this paper, we use the individual best features approach, where terms are sorted by their weights in a descending order, and the top n terms with the highest weights are selected. We 40 41 use mutual information (i.e., information gain, see section 3.3.3) as suggested in [49] to assign weights to the terms. Traditional feature selection approaches in pattern recognition, such as sequential forward / backward selection and “plus l-take away r” selection [35] may perform better, but they are Often much more expensive in terms of computational cost. This is a major consideration in high dimensional feature spaces encountered in document classification problems. Another method of dimensionality reduction is to map original measurements into a more effective lower dimensional subspace. Each new feature is a combination of the original features. This is the so called feature extraction method, which includes principal component analysis and linear discriminant analysis which we have used in our study. 4.2 Principal Component Analysis Principal component analysis (PCA) (also called K-L transform) is a commonly used linear projection method [34]. It projects the original data vector (with dimension d) on the coordinate axes having the dimension p (usually p < d). This minimizes the mean-square error between the original data and the new representation with p eigenvectors of the covariance matrix. Using the vector Space representation scheme in document classification, let T1, ..., TN denote the N d—dimensional training vectors, while their normalized vectors with zero-mean are denoted as T1“, ..., T13}. Let the p basis vectors, e1, ...,ep be a set of orthonormal vectors that best describe the dis- tribution of documents in the p-dimensional subspace (eigenspace), p g d. It can 42 be shown that the “best” set of basis vector correspond to the eigenvectors of the covariance matrix 23. The first basis vector e1 corresponds to the largest eigenvalue of E, the second basis vector e2 corresponds to the second eigenvalues of E, and so 011. With the p—dimensional eigenspace defined, training vectors, T1", - - - , T13, can be represented as a set of p—dimensional feature vectors, 61,- . - ,fp: 5,, = eT7;*, i = 1, - . ~,N, (4.1) where e 2 (e1, ...,ep). The sum Of the first p eigenvalues is the “variance” retained in the subspace, while the sum of all the d eigenvalues is the “variance” in the original pattern Space [34]. We can choose p such that 2le A,/ 2le A,- 2 V, where V is a user-specified value representing the desired “variance” retained in the p—dimensional subspace. 4.3 Discriminant Analysis Linear discriminant analysis (LDA) also attempts to project patterns onto a lower dimensional space than the original space [34]. Compared to PCA which projects patterns to a p—dimensional Space whose orthogonal basis are the p eigenvectors corre- sponding to the p largest eigenvalues of the covariance matrix of the training samples, LDA projects patterns to (m — 1)-dimensional space using the basis set (e1, ..., em_1) computed from (pg/14pm where m is the number of categories, 99w is the within-class 43 scatter matrix and $03 is the between-class scatter matrix. Let xk; be the vector of ith document in category It, i = 1, ...Nk, pk is the sample mean of class k, u is the overall mean of the training samples, and N = (N1 + + Nk) is the total number of documents in the training set. Then according to [34, 57], the within—class scatter matrix (ow and the between-class scatter matrix 908 are defined as follows: 1 m Nk 90w = N Z :61?“ - Mk) (33kt - HUT (4.2) k=1 i=1 1 m w = - :01]: — u) (Mk — MT- (43) N k=l The basis set (e1, ...ep) is a set of eigenvectors corresponding to the (m - 1) largest eigenvalues of matrix gov—V1908, which satisfy: efej =1 if i=j, eiej = 0 otherwise, where Z] is the covariance matrix of the training samples. Since the rank of the matrix gov-V199,, is at most m — 1, the number of coordinates of the projected space is limited by the number of document categories. Foley and Samon [19] proposed an Optimal discriminant plane for the 2-claSS problem under the orthonomality condition of coordinate axes. Okada et a1. [57] generalized it to multi- class problem, called the orthonormal discriminant analysis (ODA), where the number of features being extracted are only constrained by the original feature dimensionality d. Hamamoto et al. [30] proved that “the ODA method is more powerful than LDA in terms of the Fisher criterion”. 44 4.4 Term Grouping in Subspace The occurrences of different words in documents are usually not independent; there are correlations between words in a group of documents. ’ulfekuhler et a1. [81] applied K -means clustering within each document category to find clusters of words for that document class. To capture the correlation of all word-pairs, we construct a bigram matrix for each document class k. Let W), = (wf, ..., 203,) be a dk-dimensional word-list in subspace 5),. The bigram matrix 8,, = (mij)dedk is a dk x d), matrix with m,,- representing the number of documents from class c), in which both the terms to: and to; jointly appear. A complete-link hierarchical clustering algorithm is applied to the proximity matrix 8),. The resulting dendrogram (tree) is then cut into p term- groups, where p is a user-specified parameter. Let the term-groups be denoted as 77: = (h1,...,hp), where h,flhj = (0,2' aé j and h,- = (wfl,...wfm), to; 6 Wk. Given a vector 77, in subspace .Ck, the projection of 77, to p-dimensional space, E. = (51,...,§p) is defined as {i = (737101, (45) where H,- = (liu,,),,,xd,c is defined as: 1 when the term tofu is the same as the term to; hm, = (4.6) 0, otherwise, and I is the ni-dimensional unit vector. Chapter 5 Experimental Results The data used in our experiments are the news items down-loaded from the Yahoo news group and Reuters-21578 newswire benchmark. Each document in the data sets is indexed by human experts. We preprocess the HTML news items by (i) document parsing (remove headers and tags in the HTML files), and (ii) removing stopwords and low frequency words as mentioned earlier. 0 Yahoo News Data. The Yahoo news items were down-loaded from the Yahoo news group in the year of 1997. There were 9 categories on the site. We chose to use 7 of them, which are Business (B), Entertainment (E), Health (H), Inter- national (I), Politics (P), Sports (S), and Technology (T). The remaining two categories, Top-stories and Local are excluded, since the associated semantics can not be easily captured by word frequencies or occurrences. Each news item has a title and a short summary, which is about 120 words, on an average. We have constructed two data sets from Yahoo news data. One of them is called 45 46 Yahoo news set, which contains 814 training samples, and two test data sets (news items at different time intervals, see Table 5.1). Test data setl has 680 news items and test data set2 has 621 documents. The second dataset is called enlarged Yahoo news set, which has 4,199 training samples and 2,000 testing samples (see Table 5.2). Categories B E H I P S T Training #of documents 130 133 91 110 130 130 90 Data total #Of terms 1848 2045 1213 1974 2070 1659 1364 Test Data #Of documents 110 111 79 80 110 111 79 Setl total #of terms 2155 2583 1535 1999 2439 1952 1618 Test Data #of documents 100 101 78 70 101 101 70 Set2 total #of terms 2046 2834 1803 2604 2070 1974 1689 Table 5.1: Yahoo news training and test data. Categories B E H I P S T Training #of documents 682 687 476 510 682 673 489 Data total #of terms 5841 8610 5020 6893 6963 5756 4721 Test Data #of documents 322 334 224 245 323 330 222 Set total #of terms 3904 5275 3199 4086 4482 3803 3043 Table 5.2: Enlarged Yahoo news training and test data. 0 Reuters-21578 benchmark. This dataset consists of 21,578 newswire stories which appeared on the Reuters newswire in 1987. It can be accessed from Lewis’ professional home page (http://www/research/att.com/simlewis). We use the TOPIC categories set which has been used in almost all previous experiments with the Reuters data. The TOPIC set contains 12, 668 newswire stories with 135 categories, but only 120 categories have at least one document, and only 57 47 categories have at least 20 documents (see Table 5.3). The data are Split into a training set (9, 649 documents) and a test set (3, 019 documents). 5.1 Evaluation of Text Classification Effectiveness In information retrieval, there are two important measures of system effectiveness, called recall and precision [69]. Recall measures the ability of the system to present all relevant items. It is defined as the number of relevant documents divided by the total number of relevant documents in the collection. Precision measures the ability of the system to present only the relevant items. It is defined as the number of relevant documents retrieved divided by the total number of documents retrieved. For example, suppose there are 100 documents in a collection that are relevant to a query Q, and a system retrieves 200 documents (for query Q), among which 80 are relevant to Q. So, the recall of this system for query Q is 80/100 = 0.80, while the precision is 80/ 200 = 0.40. Text classification is the assignment of documents to one or more pre-existing set of Categories, rather than retrieving them in response to a query. So, the recall and precision measures used in information retrieval do not fit a classification system quite well. Lewis [42] defined microaveraging recall and precision to evaluate a text classification (categorization) system. For a two-class classification problem, a binary decision is made based on whether a document belongs to a given class or not. For a set of n test documents, a contingency table (Table 5.4) can be made for these n binary decisions [42]. 48 Category acq alum austdlr austral barley bfr bop Training Set 1650 35 4 0 37 0 75 Test Set 715 20 0 O 0 0 16 Category can carcass castor castor castor citrus cocoa -meal -oil seed pulp TYaining Set 3 50 0 1 1 1 55 Test Set 0 7 0 0 0 0 17 Category coconut coconut coffee copper COpra corn corn -oil -cake -oil Training Set 4 4 111 47 2 182 1 Test Set 1 0 23 14 0 1 0 Category cornglu cotton cotton cotton cotton cpi cpu tenfeed -meal -oil seed 'IYaining Set 2 39 0 1 O 69 3 Test Set 0 9 0 0 0 18 1 Category crude cruzado dfl dkr dlr dmk drachma Training Set 389 1 2 1 131 10 0 Test Set 160 0 0 0 16 0 0 Category earn escudo f- ffr fish flax fuel cattle meal seed Training Set 2877 0 2 0 2 0 13 Test Set 1083 0 0 0 O 0 7 Category gas gnp gold grain ground ground ground nut nut-meal nut-oil Training Set 37 101 94 433 5 0 1 Test Set 15 28 27 127 2 0 0 Category heat hk hog housing income install interest -debt 'Ii'aining Set 14 0 16 16 9 5 347 Test Set 4 0 3 2 4 1 99 Category invent ipi iron- jet jobs 1- lead ories steel cattle Training Set 5 41 40 4 46 6 15 Test Set 0 12 12 1 12 0 9 Category lei lin- lin- lin lit live lumber meal oil seed stock Training Set 12 1 1 2 1 75 10 Test Set 3 0 0 0 0 11 4 49 Category lupin meal- mexpeso money money- naphtha nat- feed -fx supply gas 'Haining Set 0 30 0 538 140 2 75 Test Set 0 6 0 151 31 1 15 Category nickel nkr nzdlr oat oil orange palladium seed 'ITaining Set 8 1 2 8 124 16 2 Test Set 1 0 0 0 19 9 0 Category palm palm palm peseta pet- platinum plywood -meal -oil kernel chem 'ITaining Set 0 30 2 1 20 5 4 Test Set 0 0 0 0 8 3 0 Category pork- potato propane rand rape rape rape belly -meal -oil seed Training Set 3 3 3 2 1 5 18 Test Set 0 3 2 0 0 0 0 Category red- reserves retail rice ringgit rubber rupiah bean Training Set 1 55 23 35 1 37 1 Test Set 0 14 1 2 O 9 0 Category rye saud sfr ship silk silver singdlr riyal Training Set 1 3 0 197 0 21 0 Test Set 0 0 0 55 0 2 0 Category skr sorghum soy- soy stg strategic sugar meal bean -metal Training Set 1 24 13 78 17 16 126 Test Set 0 0 0 1 0 6 27 Category sun- sun sun tapioca tea tin trade meal oil seed Training Set 1 5 11 3 9 18 369 Test Set 0 0 0 0 3 11 105 Category tung tung veg wheat wool wpi yen -oil -oil "It‘aining Set 0 0 87 212 2 19 45 Test Set 0 0 24 3 0 9 6 Category zinc TTaining Set 21 Test Set 7 Table 5.3: Number of documents in each category of Reuters-21578 TOPIC set. 50 Yes is No is correct correct Decide Yes a b a+b Decide No c d a+b a+c b+d a+b+c+d=n Table 5.4: Contingency table of binary decisions for a test set, from [42]. Given the contingency table, recall and precision are defined as: recall = a/(a+c) and precision = a/ (a + b). For m classes and n test documents, if we treat each classification as a binary decision, then a total of mn decisions are made. Microaver- aging considers all mn decisions as a single group and computes recall and precision as defined above. There is a tradeoff between recall and precision. We can have a very high recall rate by always deciding yes, but then the corresponding precision will be small. A classification system attempts to maximize both precision and recall simultaneously. Usually, a break-even point of recall and precision measurement is used, i.e., when microaveraging recall is equal to microaveraging precision. Linear interpolation is often used to get the break-even values. We consider the text classification as a multi-class classification problem, i.e., each time a classifier assigns a document to one of the m categories instead of classifying it to one class against all the others. We make an m x m confusion matrix, where entry (i, j) shows how many documents belonging to category i were assigned to category j. The sum of diagonal entries divided by the number of test examples measures the classification accuracy, which is the microaveraging recall and precision defined by Lewis [42]. If a document indeed belongs to several categories, then if the decision made by a classifier belongs to one of these “true label” categories, we treat this 51 NB N N DT N Net SS Test Data #of Misclassifications 115 165 178 131 139 Setl Recognition Rate (%) 83.1 75.7 73.8 80.74 79.6 Test Data #of Misclassifications 125 179 144 90 111 Set2 Recognition Rate (%) 79.87 71.18 76.8 85.51 82.13 Table 5.5: Comparison of the five classification algorithms (NB, N N , DT, N N et, and SS). decision as a correct decision, and incorrect otherwise. 5.2 Individual Classifiers On the small Yahoo dataset, we compared the five classification algorithms (naive Bayes classifier (NB), nearest neighbor classifier (NN), decision tree classifier (DT), artificial neural network classifier (NNet), and the subspace classifier (88)) on our two test data sets. Table 5.5 shows a comparison of the recognition rates (microaveraging recall/precision) using these five classification algorithms individually. The exper- imental results Show that all the five classification algorithms perform reasonably well; the NB approach performs the best on test data setl, while the NNet classifier performs the best on test set2; the NNet and the subspace methods outperform the other three classifiers on test data setl and test data set2, respectively. Confusion matrices of the classification results using the five classifiers on test setl are shown in Tables 5.6-5.10. Figure 5.11 shows the classification results using NB classifier on the enlarged Yahoo data set and Reuters-21578 news story benchmark. 52 [ B E H I P S T [LRecognition Rate(%fl B 81 1 2 0 7 0 19 73.6 E 1 89 1 8 3 2 7 80.2 H 0 0 79 0 0 0 0 100.0 I 1 1 0 53 25 0 0 66.3 P 5 0 1 10 91 0 3 82.7 S 2 0 1 1 3 102 2 91.9 T 7 0 2 0 0 0 70 88.6 Table 5.6: Confusion matrix of the classification results using the naive Bayes Classifier (NB) without dimensionality reduction. L— B E H I P S T Recognition Rate(%) ] B 74 5 2 2 8 2 17 67.3 E 3 83 4 4 5 8 4 74.8 H 2 1 70 0 0 2 4 88.6 I 5 3 3 50 18 0 1 62.5 P 9 1 4 16 76 0 4 69.1 S 2 0 0 2 0 106 1 95.5 T 19 1 1 0 1 1 56 70.9 Table 5.7: Confusion matrix of the classification results using the nearest neighbor classifier (N N) without dimensionality reduction. ] [B E H I P S T ][ Recognition Rate(%)] B 70 8 1 0 13 1 17 63.6 E 18 77 0 5 2 7 0 70.6 H 1 2 71 4 0 0 1 89.9 I 3 8 0 54 8 5 2 67.5 P 7 5 2 15 73 3 5 66.4 S 1 11 0 1 2 96 0 86.5 T 9 4 2 3 1 0 61 76.3 Table 5.8: Confusion matrix of the classification results using the C5 decision tree classifier (DT) without dimensionality reduction. 53 [ J B E H I P S T ] Recognition Rate(%)] B 72 1 7 0 8 3 19 65.45 E 1 74 14 2 2 14 4 65.00 H 0 0 79 0 0 0 0 100.0 I 1 1 3 52 19 4 0 65.00 P 4 0 4 7 88 2 5 80.00 S 0 0 2 0 0 109 0 98.20 T 0 0 5 0 0 0 75 94.94 Table 5.9: Confusion matrix of the classification results using adaptive neural network classifier (NNet) without dimensionality reduction. B E H I P S T Recognition Rate(%) ] B 65 3 6 2 8 2 24 59.1 E 1 89 2 7 1 6 5 80.2 H 0 0 79 0 0 0 0 100.0 I 1 2 3 51 22 0 1 63.8 P 6 0 6 10 84 1 3 76.4 S 2 2 1 0 0 106 0 95.5 T 9 0 2 1 0 0 67 84.8 Table 5.10: Confusion matrix of the classification results using the subspace classifier (SS) without dimensionality reduction. Yahoo Reuters #of Misclassifications 285 430 Recognition Rate (%) 85.55 85.76 Table 5.11: Classification accuracy using NB on two large data sets. 54 Combination Of Combination Testing Data Testing Data Classifiers Approaches Set1(%) Set2(%) Simple Voting 80.29 81.96 NB, SS, NN DCS 80.00 77.13 ACC 82.21 82.45 NB, SS DCS 80.44 79.87 ACC 83.24 82.93 Best Individual Classifier 83.1 82.13 Table 5.12: Classification Accuracy of combinations of multiple classifiers. 5.3 Combination of Different Classifiers Results of combinations of multiple classifiers using different combination approaches are summarized in Table 5.12. We set k = 20 in our experiments. Note that for these two datasets, there was no significant improvement by using a combination of classifiers. Our opinion is that the performance of a combination of classifiers is data dependent. 5.4 Dimensionality Reduction Best feature selection, PCA, LDA, ODA, and term grouping in subspace are used to reduce the dimensionality in our experiments. We used mutual information to weigh the words appearing in the training documents; a subset of the words with the highest weights is selected. We compared the four classifiers (NB, N N , DT, and SS) using this feature selection technique. Figure 5.1 shows the recognition rate of the four classification algorithms with different Sized feature subsets. From this figure, we can see that (i) a small number of features is not suitable for NB, and (ii) the 0.75 I I I I ur— J I I . a ’+— T ‘ T ’ T ‘I' ~ ~ _ I T \ 1/3 + I ............. - f’. H ‘ __ __ / ‘ " . ' --------- it 0.7 - / . c t - / .- .’.. - --*' , it v 0.65 - t- r. _ 0.6 ~ " q >. o g 0.55 - _ «I l, g o 5 — j .8 g 0.45 — _ o 0.4 - _ 0.35 - a +— - —+ DT .*. ...... ..*. NN 0.3 — NB _ o H SS 0.25 J l l l 1 1 l l 1 0 1 00 200 300 400 500 600 700 800 900 1 000 number of features Figure 5.1: The accuracy of each algorithm with a different Sized feature subsets. performance of NB and SS classifiers tend to improve when the number of features is increased. We used PCA method to project the original feature space onto a lower dimen- sional subspace. We set V = 0.90 (90% Of the variance retained), resulting in about 400 features in the small Yahoo training set. When using LDA and ODA, original features are projected onto a 6 dimensional subspace, since we have 7 classes in the Yahoo data sets. We apply N N classifier on the projected feature vectors. Table 5.13 Shows a comparison of different feature extraction methods, including PCA, LDA and ODA. Confusion matrices of the classification results using N N classifier with 56 N N N N N N N N with PCA with LDA with ODA Test Data #of Errors 165 166 136 121 Setl Recall (%) 75.7 75.6 80.0 82.21 Test Data #of Errors 179 180 109 102 Set2 Recall (%) 71.18 71.01 82.45 83.57 Table 5.13: Comparison of the feature extraction methods using N N . B E H I P S T ]] Recognition Rate(%fl B 75 4 2 2 9 3 15 68.2 E 4 86 1 4 4 8 4 77.5 H 2 1 72 0 0 3 1 91.1 I 6 4 1 47 21 1 0 58.8 P 10 3 4 17 71 0 5 64.6 S 3 0 0 2 0 105 1 94.6 T 17 1 1 0 1 1 58 73.4 Table 5.14: Confusion matrix of the classification result using the nearest neighbor classifier with PCA feature extraction. PCA, LDA and ODA on test setl are shown in Tables 5.14-5.16, respectively. The experimental results Show that ODA outperforms LDA and PCA. We apply our term-grouping technique to the subspace method on test setl. A total of 30 term groups were chosen in our experiment. A comparison of the recogni- tion rates before and after using the term-grouping technique on data setl is shown in table 5.17. It Shows that the performance of the SS method improved marginally with the term-grouping technique. 57 [ B E H I P s T flRecognition Rate(%fl B 67 2 4 1 10 5 21 60.92 E 2 97 0 3 2 5 2 87.39 H 0 4 74 0 0 1 0 93.67 I 3 1 1 46 26 2 1 57.50 P 9 2 2 12 83 1 1 75.45 S 1 2 1 2 0 105 0 94.59 T 3 2 2 0 0 0 72 91.14 Table 5.15: Confusion matrix of the classification result using the nearest neighbor classifier with LDA feature extraction. B E H I P S T ][ Recognition Rate(%)? B 73 3 2 1 8 5 18 66.36 E 1 99 0 6 1 3 1 89.19 H 0 1 77 0 0 1 0 97.47 I 4 6 1 41 26 1 1 51.25 P 9 1 2 6 91 1 0 82.73 S 1 2 1 0 0 107 0 96.40 T 4 1 3 0 0 0 71 89.87 Table 5.16: Confusion matrix of the classification result using the nearest neighbor classifier with LDA feature extraction. SS SS without term-grouping with term-grouping #of Misclassifications 139 137 Recognition Rate (%) 79.6 79.9 Table 5.17: Comparison of using the subspace method with or without the term-grouping feature reduction technique on test setl. Chapter 6 Conclusions and Discussion We have applied five different classification methods (NB, N N , DT, SS, and N Net) to the problem of document categorization. These methods were evaluated individ- ually and in combination. Since document classification involves a high dimensional feature space, we also studied the effect of different feature reduction techniques (the individual best feature selection, PCA, DA, and term-grouping in subspace) on the performance of these classifiers. The seven classes of Yahoo news items used in our experiments have a large overlap of words in their documents (e.g., in 2,948 total words, there are 1,096 common words between international and politics news cat- egories, 744 out of 2,468 words are common between business and technology news groups), so this is a difficult classification problem. We can make the following ob- servations based on our experimental results: 1. Comparison of different classifiers: (a) For the Yahoo news data, all the five classifiers perform reasonably well 58 59 on our data sets. We also asked four students in our laboratory to as- sign lables to 680 documents in test setl. The average classification accuracy of the label assigned by these subjects is about 81%. Com- paring this classification accuracy to the accuracy Of the best classi- fier (83.1%), we see that machine classification performance is reason- able. Weiss et al. [79] also reported that the accuracy of human judg- ment on 1000 messages on 10 USENET newsgroups (misc.health.diabetes, sci.math.num-analysis, dc.politics, rec.food.restaurants, alt.tv.seinfeld, comp.sys.ibm.pc.games.sports, rec.arts.comics.dc.universe, sci.military.naval, talk. philosophymisc and hu- manitieslit.authors.shakespeare) is about 85%. Both NB and SS classi- fiers work better than N N and DT methods, but the performance of NB and SS classifiers is data dependent. Most of the misclassifications are be- tween international and politics categories, and between business and tech- nology document classes which inherently have a large overlap of terms. If we combine international and politics news groups, and combine business and technology news groups, the performance of all the four classification algorithms on the resulting 5-class problem improves by an average of 7%. Classification accuracies on test setl for the four classifiers on this five-class problem are shown in Table 6.1. When we reverse the role of the training and test data (combining test setl and test set2 as a training set, and the old training set is now the 60 NB N N DT SS #of Misclassifications 67 95 128 101 Recognition Rate (%) 90.15 86.03 81.18 85.15 Table 6.1: Comparison of the four classification algorithms (NB, N N , DT, and SS) on test setl for the reduced 5-class problem. test set), the classification accuracy remains essentially the same (e.g, the classification accuracy of NB is 83.91%). This shows that the classifier is not sensitive to the choice of the training data (assuming that the training data is sufficiently large). For the enlarged Yahoo news data and Reuters newswire, NB performs steadily well on these large databases, despite the “independence” assump- tion which is not always satisfied in document classification. While SS per- forms well on the enlarged Yahoo news data set, it performs very poor for Reuters data set. It may be because the data is noisy. Some preprocessing, such as fine semantic analysis [27] may improve the performance. 2. Combinations of multiple classifiers do not always improve the classification accuracy. The adaptive classifier combination introduced here works better than simple voting and dynamic classifier selection approaches on our two test data sets. 3. Dimensionality reduction: (a) There is no significant peaking in classification performance observed in our experiments. In particular, the performance of NB improves as the number of features increase (which is different from the results obtained in [44]). 61 Additionally, NB does not use a small number of features effectively (which is also different from the observation in [44] that 10 features for Reuters news groups performs the best). This indicates that our data set is less separable than the Reuters newswire. (b) The widely adopted LSI method in information retrieval systems is essen- tially the same as the PCA when used to compare the similarity of two documents. LDA performs much better than PCA. In particular, ODA performs the best in our experiment. (c) The problem with feature selection is that the small number of selected words may not generalize well to new documents. However, the advantage of dimensionality reduction is not only to improve the recognition rate (eliminate the problem of overfitting), but the reduced number of features lead to lower time and space complexities. The term-grouping method re- duces the feature dimensionality, and overcomes the generalization problem of feature selection, while maintaining the performance of the classifier. It is difficult to say which classifier is the best. Generally, we believe that sim- ple classifies, such as NB and SS can give fairly good classification results. These methods are simple, fast, and can be easily scaled up for very large databases. N Net classifier also performs well, and it is very fast in the test stage. Feature extrac- tion, such as LDA can improve the classification performance Significantly in some cases. However, feature extraction is computationally expensive, and has difficulties in scaling up to large databases (e.g., memory requirement). 62 Incorporating natural language processing into a classifier (feature representation, feature selection) can remove noisy features, and improve the classification accuracy (e.g., using WordNet1 [27]). In our experiments, we also observed that, for NB classi- fier, if we assign a test document to the top two matching categories, the microrecall rate is 97% on enlarged Yahoo news data, and is 99% on Reuters newswire database. lhttp: / /www.cogsci.princeton.edu/ ~wn APPENDICES Appendix A Examples of Training Samples of Yahoo datasets We Show one training example from each of the seven Yahoo news groups after Web document parsing, and their word list (used to construct the feature vector) after stopwords and low-frequency word removal. 63 64 Minimum Wage Rises - Nearly 7 million Americans are getting a raise on this Labor Day. The federal minimum wage is rising to $5.15 an hour. Fast food workers, retail clerks, gas station attendants and others will be earning 40 cents an hour more when they report to work as the second phase of the hike goes into effect. It was first raised to $4.75 last Oct. 1. According to a report to be issued tomorrow by the Economic Policy Institute, most of the 6.8 million workers affected by the minimum wage hike are women who work in the service sector. The Washington, D C.-based think tank’s study found that in 18 states, more than 10 percent of the work force will be affected by the minimum wage increase. minimum wage rises _ nearly _ million americans ___ getting _ raise ______ labor day_ ___ federal minimum wage __ rising _________ hour_ fast food workers_ retail clerks- gas station attendants _________ ______ earning __ cents __ hour ____________ report __ _______________ phase __ __- hike goes ____ effect- -- ________ raised __ _ oct_ __ according __ _ report __ __ issued tomorrow _____ economic policy institute_ ____ __ __- ___ million workers affected _____ minimum wage hike women ___ ____ __ ___ service sector_ ___ washington _____ based _____ tank__ study found ___________________ ______ percent __ ___ ____ force ______ affected _____ minimum wagg increase_ (a) Figure A.1: An example of the bussiness news group; (a) a training sample; (b) extracted word list. (b) 65 Diana Funeral to be Saturday - Princess Diana’s funeral will be held Saturday. Buckingham Palace announced Monday that the "people’s princess" would be honored with services at Westminster Abbey in London. Diana will be buried in private services at her family’s estate in Althorp, central England. Until the funeral, Diana’s body will lie in the Chapel Royal at St. Jane’s Palace in London. Members of the public will not be allowed to file past, but can write a personal message in diana funeral __ -- saturday _ princess diana__ funeral ______ held saturday_ buckingham palace announced monday _______ _people-_ princess honored _-__ services __ westminster abbey __ 10ndon_ diana ______ buried __ private services _____ family__ estate __ althorp- central england- ________ funeral diana__ body ____ lie _____ chapel royal __ st_ jame__ palace __ 10ndon_ ____________ public ____ ___ allowed __ file past_ ___ _-- write _ personal message books of condolence. Thousands books __ condolence_ thousands of pe0ple have been lining up to __ people ________ lining __ __ do so. French prosecutors Monday _____ french prosecutors monday disclosed a new twist to the disclosed _ ___ twist _____ tragedy that has numbed Britons. tragedy _______ numbed britons_ French officials said the driver french officials _______ driver who crashed Princess Diana’s car ___ crashed princess diana__ car -- killing her, her friend Dodi __ killing _______ friend dodi Al Fayed and himself -- was a1 fayed _______________ driving at a high speed with driving -_ _ ____ speed ____ twice the blood alcohol level twice ___ blood alcohol level that would have landed him in _____________ landed _____ jail. jail_ (a) Figure A.2: An example of the entertainment news group; (a) a training sample; (b) extracted word list. (b) 66 Heart Failure Therapy Saves Lives - A more aggressive and comprehensive management program heart failure therapy saves lives _ - ---- aggressive __- comprehensive management program for heart failure can reduce -_- - heart failure -_- reduce hospital admissions by 85% in hospital admissions -- --- -_ patients waiting for a heart patients waiting ___ - heart transplant, a new study suggests. transplant_ - --- study suggests- The more comprehensive treatment ------- comprehensive treatment may not only improve the quality ---------- improve --- quality of life for patients, it also -_ life --- patients_ -- ___- saves money. An estimated 400,000 saves money- -- estimated _______ to 800,000 people in the U.S. --------- people --------- have severe heart failure, and -_-- severe heart failure- --- 200,000 die of the disease each ------- die ----- disease ___- year, according to the report in _____ according ----- report -- the Journal of the American --- journal ----- american College of Cardiology. college -- cardiology- (a) (b) Figure A.3: An example of the health news group; (a) a training sample; (b) extracted word list. 67 Diana to Receive ’Unique diana -- receive _unique Funeral’ - Buckingham Palace funeral- - buckingham palace Monday announced that Princess monday announced ---- princess Diana will be given a "unique diana ---- -- ----- _ _unique funeral for a unique person" funeral --_ - unique person- Saturday in London. Her coffin saturday -- london- --- coffin will be carried through the ______ carried ---------- streets to services at streets -_ services -- Westminster Abbey, to be westminster abbey- -- -- followed by a private burial at followed -- - private burial -_ her family’s estate in central --- family__ estate -- central England. Until the funeral, england- -------- funeral- Diana’s body will lie in the diana__ body ---- lie _____ Chapel Royal at St. Jane’s chapel royal _- st_ jame-_ Palace in central London, and palace -- central london_ --_ thousands of mourners have thousands _- mourners ---- flocked there to sign books of flocked ------- sign books _- condolences. Crowds have also condolences- crowds ________ gathered outside Kensington gathered outside kensington Palace, Diana’s London home, and palace- diana__ london home- __- 0utside Buckingham Palace, as outside buckingham palace- _- they continue mourning the late ---- continue mourning ___ late Princess. princess- (a) (b) Figure A.4: An example of the international news group; (a) a training sample; (b) extracted word list. 68 Clinton Sends Condolences - President Clinton is sending condolences to British Prime Minister Tony Blair on the death of Princess Diana, saying "all of us have lost a friend and a strong voice for those less fortunate." In a letter written at his vacation retreat in Martha’s Vineyard, Clinton praised Diana’s "untiring and selfless commitment to helping persons in need, particularly children, the victims of AIDS and landmines, and other vital humanitarian concerns." Clinton sent letters expressing similar sentiments to Queen Elizabeth, Diana’s ex-husband Prince Charles, and Diana’s brother Charles Spencer. A White House spokesman says clinton sends condolences - president clinton _- sending condolences --british prime minister tony blair _____ death -- princess diana- saying _--- -- -- ---- lost _ friend --- _ strong voice --------- fortunate-- _- - letter written ----- vacation retreat -- martha-- vineyard- clinton praised diana-- _untiring --- selfless commitment -- helping persons ----- particularly children- --_ victims -- aids _-- landmines- -------- vital humanitarian concerns__ clinton sent letters expressing similar sentiments queen elizabeth- diana-- ex-husband prince charles- --- diana__ brother charles spencer- - white house spokesman there’s been no decision on who ------------- decision _____ will represent the United States _--- represent -_- united ------ at Diana’s funeral Saturdgy. -_ diana funeral saturday- (a) (b) Figure A.5: An example of the politics news group; (a) a training sample; (b) ex- tracted word list. 69 49ers’ Rice Has Knee Surgery - The 1997 season is off to a nightmarish start for the San Francisco 49ers. Jerry Rice, who holds all the major receiving records in NFL history, underwent surgery Monday to repair a torn anterior cruciate ligament and torn medial collateral ligament in his left knee and is expected to miss the rest of the season. An MRI taken Sunday night revealed the injury. Dr. Michael Dillingham, the Niners’ team physician, performed the surgery and estimated that Rice will be sidelined four to six months. Rice suffered the injury on a reverse in the second quarter when he was dragged to the ground- by his facemask by Tampa Bay defensive end Warren Sapp. Rice’s left knee buckled as he was pulled awkwardly to the ground on the play, which resulted in a 10-yard loss. Sapp was called for a 15-yard facemask penalty on the ers_ rice --- knee surgery - ------- season nightmarish start ------ san francisco --ers- jerry rice- holds ------ major receiving records -- nfl history- underwent surgery monday -- repair _ torn anterior cruciate ligament --- torn medial collateral ligament ----- left knee --- -_ expected -- miss --- rest ----- season- -- mri ----- sunday night revealed --- injury- dr- michael dillingham- --_ niners- team physician- performed ___ surgery ___ estimated ---- rice ______ sidelined ______ six months- rice suffered --- injury -- - reverse ----------- quarter _________ dragged -- --- ground ----- facemask -- tampa bay defensive --- warren sapp- rice-_ left knee buckled -- -- --- pulled awkwardly ----- ground _- --- play_ ----- resulted -- - ___yard loss- sapp --- called --- - ___yard facemask penalty _____ play. (a) play_ (b) Figure A.6: An example of the sports news group; (a) a training sample; (b) extracted word list. 70 Apple Buys Clone Maker Assets - apple buys clone maker assets - Apple Computer said today it’s apple computer paying $100 million in stock to paying ---- million -- stock _- buy the core assets of Power buy --- core assets -- power Computing, a privately held computing- - privately held licensee of Apple’s Macintosh licensee -- apple-- macintosh line of computers. Power line -_ computers_ power Computing has pioneered direct computing --- pioneered direct marketing and sales in the marketing ___ sales _____ Macintosh market, successfully macintosh market_ successfully building a $400 million building - ---_ million business," Apple board member business-_ apple board ______ and founder Steve Jobs said --- founder steve jobs ___- announcing the agreement. We look announcing ___ agreement- -- look forward to learning from their forward -- learning --------- experience, and welcoming their experience- --- welcoming _____ customers back into the Apple customers ----------- apple family." Apple sold the family__ apple sold --- Macintosh license to Power macintosh license -- power Computing in December 1994. computiggi-- december _____ (a) (b) Figure A.7: An example of the technology news group; (a) a training sample; (b) extracted word list. J: -.-r u M'- 9'10. ...T' Appendix B The Laplace’s Law of Succession The Laplace’s Law of Succession is a Bayesian approach to the problem of multi- nomial parameter estimation [64]. Given a set of distinct symbols, the problem is to estimate the symbol probabilities based on the known frequency with which each symbol occurred in the past. Let 9,- be the probability that word w,- occurred in class c. We assume that the words appear in the documents independently. The problem is to find 6,, an estimate Of 61'. Let n be the total number of occurrences of words in class c, nw, be the number of occurrences of word w,- in class c, and k be the total number of distinct words in all the documents, i.e, the vocabulary size. Let the a priori probability density of the unknown parameters (01,...,0k) be 71 72 uniform, i.e., l 039,- g 1,i=1,...,k,2’f9,-=1 0 otherwise. Since the p.d.f. must integrate to 1, l = (k — 1)!. The a posteriori densities of (01, ..., 0),) can be written as: P(nw1, ..., nwk [(61, ..., 6k))P(61, ..., 6k) P9 ...0 w,..., w = . ( l: : kl(n 1 n k)) P(nw1,---,nwk)o (B 2) P(n,,,,, ..., nwk|(01, ..., 0k)) is a multinomial distribution: 72'! nwl "wk, P(nw,, ...,nwk|(01, m,6k)) = mgl ”‘gk (8.3) when nu,1 + nwz + + nwk = n. So, P(nw,,...,nw,) =/../P(...nw,,,.nwk|(01,. i)))P(I91,...,6,,)dt91...d6’,c (B.4) _ )l l : (Li—”J [0"“1...0n‘”*d91...d6,, (35) "1111!: "W _ (k —1!n! nu) 'u-n‘wk! — nwll...nwk! X (n+k -— 1)! (B6) n!(k — 1)! = B. where 0 S 0,- g 1 and Z’f' 0,: = 1. See [65] for details. For a squared—error loss function, A 0, = f.../0,-P(01,...,9k|(n,,,,,...,nwk))d61...d6k (B.8) 73 n+k—1 ! Tl! nw nw "w ( nl ) (”Tm—'91 l-ugi l+1m0k kdglmdgk (8'9) . w,.... w). (74+): — 1)! X n!(nw, +1) nl ("+1“)! nw, +1 n+k (8.10) (B.11) Bibliography [1] Gentle Introduction to RainBowl. [2] C. Apte, F. Damerau, and S. M. Weiss. Text Categorization: a Symbolic Ap- [ proach. In ACM Trans. on Information Systems, Vol. 12, No. 3, pages 233—251, 5 1994. [3] D. B. Aronow, S. Soderland, J. M. Ponte, F. F. Feng, W. B.Croft, and W. G. Lehnert. Automated Classification of Encounter Notes in Computer-Based Med- ical Records. In Proc. of the Eighth World Congress on Medical Informatics Vol. 8, pages 8—12, 1995. [4] M. Balabanovic, Y. Shoham, and Y. Yun. An Adaptive Agent for Automated Web Browsing. Technical Report SlDL-WP-1995-0023, Stanford University, 1995. [5] H. Berghel. Cyberspace 2000: Dealing with Information Overload. In Digital Village, Communications of the ACM, Vol. 40, N0. 2, pages 19—24, 1997. [6] J. Bradshaw, editor. Software Agents. AAAI Press/ The MIT Press, 1997. 1 http: / / www.cs.cmu.edu / afs / cs / project / theo—l 1 / www / naive-bayes / gentleintroductionhtml 74 l7] [8] [9] [10] [11] [12] [13] 75 J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY Retrieval System. In Proc. of the 3rd International Conference on Database and Expert Systems Applications, pages 347—356, 1993. J. G. Carbonell, R. S. Michalski, and T. M. Mitchell. Machine Learning, An Ar- tificial Intelligence Approach, chapter An Overview of Machine Learning, pages 3—23. Tioga Publishing Company, 1983. A. Chavez and P. Maes. Kasbah: An Agent Marketplace for Buying and Selling Goods. In Proceedings of the First International Conference on the Practical Application of Intelligent Agents and Multi-Agent Technology, 1996. H. Chen. Machine Learning for Information Retrieval: Neural Networks, Sym- bolic Learning, and Genetic Algorithms. In Journal of the American Society for Information Science, Vol. 46, No. 3, pages 194—216, 1995. H. Chen and V. Dhar. Cognitive process as a basis for intelligent retrieval system design. In Information Processing and Management, Vol. 27, No. 5, pages 405— 432, 1991. W. W. Cohen. Fast Effective Rule Induction. In Machine Learning: Proc. of 12th International Conference, Lake Tahoe, California, 1995. W. W. Cohen. Learning Rules that Classify E-Mail. In Proc. 1996 AAAI Spring Symposium on Machine Learning in Information Access, Stanford, 1996. 76 [14] M. W. Craven and J. W. Shavlik. Learning Symbolic Rules Using Artificial Neural Networks. In Proc. of the Tenth International Conference on Machine Learning, pages 73—80, 1993. [15] V. Dasigi and R. C. Mann. Neural Net Learning Issues in Classification of Free Text Documents. In AAAI Spring Symposium on Machine Learning in Information Access Technical Papers, Palo Alto, March 1996. [16] P. Domingos and M. Pazzani. Beyond Independence: Conditions for the Opti- mality of the Simple Bayesian Classifier. In Proc. MLC96, pages 105—112, 1996. [17] R. O. Duda and P. E. Hart. Pattern classification and Scene Analysis. Wiley & Sons Inc., 1973. [18] C. Faloutsos and D. Oard. A Survey of Information Retrieval and Filtering Methods. Technical Report CS-TR—3541, University of Maryland, 1995. [19] D. H. Foley and J. W. Samon Jr. An Optimal Set of Discriminant Vectors. In IEEE Trans. Comput. C-24, pages 281-289, 1975. [20] L. N. Foner. Yenta: A Multi-Agent, Referral Based Matchmaking System. In The First International Conference on Autonomous Agents (Agents ’97), California, 1997. [21] C. Fox. Lexical Analysis and Stoplist, pages 102—130. Prentice Hall, 1992. edited by W.B. Frakes and R. Baeza—Yates. [22] [23] [24] [25] [26] [27] [28] [29] 77 W. B. Frakes. Stemming Algorithms. In Information Retrieval Data Structure and Algorithms, pages pp.131—160. Prentice Hall, 1992. edited by W.B. Frakes and R. Baeza-Yates. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proc. Second European Conference on Computational Learning Theory, pages pp 23—37, 1995. N. Friedman and M. Goldszmidt. Building Classifiers Using Bayesian Networks. In AAAI, vol. 2, pages 1277—1284, 1996. N. Friedman, D. Heckerman, M. Goldszmidt, and S. Russell. Challenge: Where is the Impact of Bayesian Networks in Learning? In Proc. Fifteenth International Joint Conference on Artificial Intelligence (IJCAI), 1997. G. W. Gates. The reduced nearest neighbor rule. In IEEE Trans. Information Theory, IT—18, pages 431—433, 1972. B. Gelfand, M. Wulfekuhler, and W. F. Punch. Automatic Concept Extraction From Plain Text. Technical report, Michigan State University, 1998. G. Giacinto and F. Roli. Adaptive Selection of Image Classifiers. In Proc. of ICIAP ’97 (Springer Verlag Lecture Notes in CS Vol. 1310), pages 38—45, 1997. G. Golub and W. Kahan. Calculating the Singular values and pseudo-inverse of a matrix. In SIAM Journal Numerical Analysis, Series B, Vol. 2, No. 2, pages 205—224, 1965. 78 [30] Y. Hamamoto, T. Kanaoka, and S. Tomita. On a Theoretical Comparison be- tween the Orthonormal Discriminant Vector Method and Discriminant Vector. In Pattern Recognition, Vol. 26, No. 12, pages 1863—1867, 1993. [31] S. Haykin. Neural Networks. Macmillan College, 1994. [32] D. Heckerman. A Tutorial on Learning with Bayesian Networks. Technical [33] [34] [35] [36] [37] Report MSR—TR—95-06, Microsoft Research, 1996. A. K. Jain and B. Chandrasekaran. Dimensionality and Sample Size Considera- tions in Pattern Recognition Practice. In Handbook of Statistics, pages 835—855. North-Holland, 1982. edited by PR. Krishnaiah and LN. Kanal. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. A. K. Jain and D. Zongker. Feature Selection: Evaluation, Application and Small Sample Performance. In IEEE Trans. PAMI, Vol. 19, No. 2, pages 153— 157, 1997. T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A Tour Guide for the World Wide Web. In Proc. IJCAI ’97, 1997. D. Koller and M. Sahami. Toward Optimal Feature Selection. In Proceedings of the 13th International Conference on Machine Learning (ML), pages 284—292, Bari, Italy, 1996. 79 [38] K. Lang. NewsWeeder: Learning to Filter Netnews. In Proc. 12th International Conference on Machine Learning, pages 331—339, Tahoe City, CA, 1995. [39] L. S. Larkey and W. B. Croft. Combining Classifiers in Text Classification. In Proc. of SIGIR ’96, pages 289—297, 1996. [40] Y. Lashkari, M. Metral, and P. Maes. Collaborative Interface Agents. In Pro- ceedings of AAAI ’94 Conference, Seattle, Washington, 1994. [41] J. Lee and S. M. Chung. Information Discovery on the Internet: Beyond Search Engines. In SERI Journal, Vol. 2, No. 1, pages 33~45, 1998. [42] D. D. Lewis. Evaluating Text Categorization. In Proc. of Speech and Natural Language Workshop, pages 312—318, Kaufmann, San Mateo, CA, 1991. [43] D. D Lewis. Representation and Learning in Information Retrieval. PhD thesis, University of Massachusetts, 1992. [44] D. D. Lewis and M. Ringutte. A comparison of Two Learning Algorithms for Text Categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81—93, Las Vegas, NV, 1994. [45] H. Lieberman. Letizia: An Agent That Assists Web Browsing. In International Joint Conference on Artificial Intelligent, Montreal, 1995. [46] C. Lynch. Searching the Internet. In Scientific American, pages 52—56, 1997. 80 [47] Y. Ming. Sampling strategies and learning efficiency in text categorization. In Proc. 1996 AAAI Spring Symposium on Machine Learning in Information Ac- cess, pages 88—95, Stanford, 1996. [48] T. Mitchell. Machine Learning. McGraw-Hill, 1997. [49] D. Mladenic. Personal WebWatcher: Design and Implementation. Technical Report IJS-DP-7472, Carnegie Mellon University, October 1996. [50] I. Moulinier, G. Raskinis, and J .-G. Ganascia. Text Categorization: a Symbolic Approach. In SDAIR96, Las Vegas, 1996. [51] S. Deerwestera nd S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harsh- man. Indexing by latent semantic analysis. In Journal of the American Society for Information Science, Vol. 41, No. 6, pages 391—407, 1990. [52] S. A. Nene and S. K. Nayar. A simple algorithm for nearest neighbor search in high dimension. In IEEE Trans. PAMI, vol. 19, No. 9, pages 989—1003, 1997. [53] H. S. Nwana. Software Agents: An Overview. In Knowledge Engineering Review, Vol. 11, No. 3, pages 1—40, 1996. [54] M. J. H.S. Nwana and N. R. Jennings. Intelligent Agents: Theory and Practice. In Engineering Review, Vol. 10, No. 2, pages 115—152, 1995. [55] D. W. Oard and G. Marchionini. A Conceptual Framework for Text Filter- ing. Technical Report EE-TR-96—25, CAR-TR-830, CLIS-TR—96-02, CS-TR— 3643, University of Maryland, 1996. 81 [56] E. Oja. Subspace Methods of Pattern Recognition. Wiley, 1983. [57] T. Okada and S. Tomita. An Optimal Orthonormal System for Discriminant Analysis. In Pattern Recognition, Vol. 18, No. 2, pages 139—144, 1985. [58] M. Pazzani, J. Muramatsu, and D. Billsus. Syskill & Webert: Identifying inter— esting Web Sites. In AAAI Spring Symposium on Machine Learning in Infor- mation Access Technical Papers, Palo Alto, March 1996. [59] M. Pazzani, L. Nguyen, and S. Mantik. Learning from Hotlists and Coldlists: Toward a WWW information Filtering and Seeking Agent. In Proc. 6th In- ternational Conference on Tools with Artificial Intelligence, Washington DC, 1995. [60] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA, 1988. [61] J. R. Quinlan. Induction of Decision Trees. In Machine Learning, Vol. 1, pages 81—106, 1986. [62] J. R. Quinlan. Rule Induction with Statistical Data-a Comparison with Multiple Regression. In Journal of the Operational Research Society, Vol. 38, pages 347— 352, 1987. [63] J. R. Quinlan. 04.5:Programs for Machine Learning. Morgan Kaufmann, 1993. [64] E. S. Ristad. A Natual Law of Succession. Technical Report CS-TR-495-95, Princeton University, 1995. 82 [65] S. M. Ross. Probability Models. Academic Press, 1993. [66] M. Sahami. Learning Limited Dependence Bayesian Classifiers. In proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD—96), pages 335—338, Portland, Oregon, 1996. [67] G. Salton. Automatic Text Processing. Addison-Wesley, 1989. [68] G. Salton, J. Allan, and C. Buckley. Automatic Structuring and Retrieval of Large Text Files. In Communications of the ACM, Vol. 37, N0. 2, pages 97—108, 1994. [69] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [70] S.Chen. Learning-Based Vision and its Application in Indoor Navigation. PhD thesis, Michigan State University, 1998. [71] H. Schutze, D. A. Hull, and J. O. Pedersen. A Comparison of Classifiers and Document Representations for the Routing Problem. In SI GIR95, pages 229—237, Washington DC, 1995. [72] E. Selberg and O. Etzioni. The MetaCrawler Architecture for Resource Aggre- gation on the Web. In IEEE Expert, Vol. 12, No. 1, pages 8-14, 1997. [73] B. D. Sheth. A learning Approach to Personal Information Filtering. Master’s thesis, Massachusetts Institute of Technology, 1994. 83 [74] A. F. Smeaton and F. Crimmins. Using a Data Fusion Agent for Searching the [75] [76] [77] [78] [79] WWW. In Sixth International World Wide Web Conference, Santa Clara, CA., 1997. H. Turtle and W. B. Croft. Inference Networks for Document retrieval. In Proc. of the 13th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 1—24, Brussels, Belgium, 1990. H. Turtle and W. B. Croft. Evaluation of an Inference Network-based Retrieval Model. In ACM Trans. on Information Systems, Vol. 9, N0. 3, pages 187-222, 1991. Wallace C. Koehler, Jr. An End User’s View of Mining the Web: Focused and Satisfied Internet Search and Retrieval Strategies. In [NET 97 proceedings, Education 3, 19972. A. R. Webb and D. Lowe. The Optimized Internal Representation of Multi- layer Classifier Networks Performs Nonlinear Discriminant Analysis. In Neural Networks, Vol. 3, pages 367—375, 1990. S. A. Weiss, S. Kasif, and E. Brill. Text Classification in USENET Newsgroup: A Progress Report. In AAAI Spring Symposium on Machine Learning in Infor- mation Access Technical Papers, Palo Alto, March 1996. 2http: / / www.isoc.org / isoc / whatis / conferences / inet / 97 / proceedings / D3 / IN DEX.HTM 84 [80] K. Woods, W. P. Kegelmeyer, and K. Bowyer Jr. Combination of Multiple Classifiers Using Local Accuracy Estimates. In IEEE Trans. PAMI, Vol. 19, No. 4, pages 405—410, 1997. [81] M. R. Wulfekuhler and W. F. Punch. Finding Salient Feature for Personal Web Page Categories. In Sixth International World Wide Web Conference, Santa Clara, CA., 1997. [82] T. W. Yan and H. C. Molina. SIFT-A Tool for Wide-Area Information Dissem- ination. In Proc. 1995 USENIX Technical Conference, pages 177—186, 1995. "‘llama)as