LIBRARY Mlchlgan State University PLACED! RETURN BOXtonmovothbctnckoutmmmmd. To AVOID FINES Mum on or Mon data duo. DATE DUE DATE DUE DATE DUE MSU loAnAffltmatlvo mu Oppomnltylmwon DESIGN AND ANALYSIS OF NEURAL NETWORKS FOR PATTERN RECOGNITION By Jianchang Mao A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science Department 1994 ABSTRACT DESIGN AND ANALYSIS OF NEURAL NETWORKS FOR PATTERN RECOGNITION By Jz'anchang Mao There are many common links between neural networks and classical statistical pattern recognition approaches for designing a recognition system. For example, many statistical pattern recognition approaches can be mapped onto a neural net- work architecture. On the other hand, some well-known neural network models are related to classical statistical methods. In spite of numerous successful applications of feedforward networks reported in the literature, their generalization behavior has not been well-understood. This dissertation further expands the common links be- tween the two disciplines through design of neural networks for pattern recognition problems and analysis of the generalization ability of feedforward networks. A number of neural networks and learning algorithms for feature extraction, data projection, classification and clustering are proposed. These networks include: linear discriminant analysis network, network for Sammon’s projection, nonlinear projection based on Kohonen’s self-organizing maps, nonlinear discriminant analysis network, network—based k-nearest neighbor classifier, and network for detecting hyperellip— soidal clusters. A comparative study of five networks for feature extraction and data projection is also conducted. The generalization ability of a feedforward network refers to its performance on independent test data relative to the performance on the training data. Two different approaches, Vapnik’s theory and Moody’s approach, for studying the generalization ability of feedforward networks are compared. Moody’s approach is extended to a more general setting which allows the sampling points of test data to be different from those in training data according to an additive noise model. Monte Carlo exper- iments are conducted to verify Moody’s result and our extension of his model, and to demonstrate the role of the weight-decay regularization in reducing the effective num- ber of parameters. We also present a taxonomy of regularization techniques in neural networks, and demonstrate that a variety of networks and learning algorithms can fit into this framework. One significant impact of the theoretical analysis of the general- ization ability of feedforward networks is that it reveals how different factors involved in designing feedforward networks interact with each other. A practical node pruning method is proposed for designing parsimonious networks which generalize well, and for reducing the dimensionality of the feature vector. ACKNOWLEDGMENTS I am deeply indebted to my advisor, Professor Anil K. Jain, who has been an inexhaustive source of ideas, encouragement and support throughout my program at Michigan State University. Without him, this dissertation would not have been completed. I am very lucky to have such a distinguished and responsible professor as my advisor. I am grateful to professors John Weng, Michael Shanblatt, and Alvin Smucker, for serving on my committee, sharing ideas with me, and providing critical comments on my dissertation. I would like to express my special thanks to late Professor Richard Dubes, who had served on my dissertation committee till the last minute of his life. His invaluable comments on my dissertation proposal helped me to shape my dissertation research. My sincere thank goes to my wife Yao for her love, patience, encouragement, support, and understanding. I share all my academic accomplishments with her. Thanks to my son, David, who was born in this period, for having brought a lot of joy to my family. I would also like to express my heartful thanks to my mother, Aiqing, for her lifelong love, sacrifice and support. Without her love, sacrifice, and hard work during my childhood and teenage years, I would not have received any education. The last stage of this dissertation was completed during my employment at the IBM Almaden Research Center. I would like to express my deep gratitude to Dr. K. Mohiuddin, the manager of Integrated Data Management group, for his under— standing, support, valuable discussions on my dissertation, and challenging me with iv difficult real—world problems. Research reported in Chapter 7 was completed when I worked in his group as a student co—op in 1993. Thanks also to other members of his group, Dick Casey, Raymond Lorie, Andreas Kornai, and Sandeep GOpisetty for exposing me to their interesting research. A special acknowledgment goes to Dr. Eric Saund of XEROX Palo Alto Research Center, where I spent one summer as a research intern. A broad exposure to various research problems made my internship a valuable experience. During my stay in the Pattern Recognition and Image Processing (PRIP) Labora- tory at Michigan State University, I met many distinguished visitors and brilliant stu- dents. I would like to thank our two summer visitors, Martin Kraaijveld and Wouter Schmidt from Delft University of Technology, The Netherlands, for many motivated discussions and pleasant interaction. Thanks also to PRIPpies: Sushma Agrawal, Yao Chen, Shaoyun Chen, Jinlong Chen, Yuntao Cui, Chitra Dorai, Marrie—Piere Dubuisson, Sally Howden, Michiel Huizinga, Kalle Karu, Jan Linthorst, Sharathcha Pankanti, Nalini Ratha, Dan Swets, Marilyn Wulfekuhler, Yu Zhong, and former PRIPpies Hans Dulimart, Farshid Farrokhnia, Pat Flynn, Qian Huang, Greg Lee, Sateesh Nadabar, Tim Newman, Narayan Raja, and Deborah 'I‘rytten. Many of them have become good friends of mine. I enjoyed many open discussions and wonderful group activities which made my graduate study unforgettable. TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x 1 Introduction 1 1.1 Statistical Pattern Recognition ..................... 2 1.2 Artificial Neural Networks ........................ 4 1.3 Common Links Between Neural Networks and Statistical Pattern Recognition ................................ 7 1.3.1 Representation .......................... 8 1.3.2 Feature Extraction and Data Projection ............ 9 1.3.3 Classification ........................... 11 1.3.4 Clustering ............................. 16 1.3.5 “Curse of Dimensionality” and Generalization Issues ..... 17 1.4 Contributions of This Thesis ....................... 21 1.5 Organization of This Thesis ....................... 23 2 Neural Networks for Feature Extraction and Data Projection 25 2.1 Feature Extraction and Data Projection ................ 26 2.1.1 Principal Component Analysis (PCA) Networks ........ 28 2.1.2 Linear Discriminant Analysis (LDA) Network ......... 32 2.1.3 A Neural Network for Sammon’s Projection (SAMAN N) . . . 37 2.1.4 A Nonlinear Projection Based on Kohonen’s Self-Organizing Map (NP-SOM) .......................... 44 2.1.5 Nonlinear Discriminant Analysis (NDA) Network ....... 47 2.1.6 Summary of Networks for Feature Extraction and Data Projection 48 2.2 Comparative Study ............................ 49 2.2.1 Methodology ........................... 49 2.2.2 Visualization of Projection Maps ................ 53 2.2.3 Evaluation Based on Quantitative Criteria ........... 67 2.3 Summary ................................. 71 3 Neural Network-Based K -Nearest Neighbor Classifier 73 3.1 K -Nearest Neighbor Classifier ...................... 74 3.2 A Neural Network Architecture for the k-NN Classifier ........ 75 3.2.1 Design of the k-Maximum Network ............... 77 vi 3.2.2 Performance of the k-N N Neural Network Classifier ...... 81 3.3 Modified k-N N algorithms ........................ 82 3.3.1 Condensed nearest neighbor algorithm ............. 82 3.3.2 Reduced nearest neighbor rule .................. 84 3.3.3 Edited nearest neighbor rule ................... 84 3.4 Summary ................................. 86 A Network for Detecting Hyperellipsoidal Clusters (HEC) 88 4.1 Hyperspherical Versus Hyperellipsoidal Clusters ............ 89 4.2 Network for Hyperellipsoidal Clusters (HEC) .............. 96 4.2.1 The HEC Network and Learning Algorithm .......... 97 4.2.2 Simulations ............................ 103 4.3 Image Segmentation Using the HEC Network ............. 112 4.3.1 Representation .......................... 112 4.3.2 Segmentation Results ....................... 113 4.4 Summary ................................. 117 Generalization Ability of Feedforward Networks 119 5.1 Vapnik’s Theory .............................. 122 5.1.1 Vapnik-Chervonenkis Dimension ................. 122 5.1.2 Bound on Deviation of Time Expected Error From Empirical Error ................................ 123 5.1.3 Bounds on Sample Size Requirements .............. 125 5.2 Moody’s Approach ............................ 126 5.2.1 “Effective” Number of Parameters ................ 127 5.2.2 Direct Relationship between Expected MSEs on Training and Test Sets .............................. 127 5.2.3 Sample Size Requirement ..................... 129 5.3 Vapnik’s Theorem Versus Moody’s Results ............... 129 5.4 Extension of Moody’s Model ....................... 132 5.5 Simulation Results ............................ 135 5.6 Improving Generalization Abilities of Feedforward Networks ..... 141 5.7 Summary ................................. 143 Regularization Techniques in Neural Networks 144 6.1 Introduction to Regularization Techniques ............... 145 6.2 A Taxonomy of Regularization Techniques in Neural Networks . . . . 147 6.2.1 Type I: Architecture-Inherent Regularization .......... 148 6.2.2 Type II: Algorithm-Inherent Regularization .......... 152 6.2.3 Type III: Explicitly-Specified Regularization .......... 156 6.3 Role of Regularization in Neural Networks ............... 161 6.4 Summary ................................. 162 Improving Generalization Ability Through Node Pruning 163 7.1 Node Saliency ............................... 164 vii 7.2 N ode—Pruning Procedure ......................... 167 7.3 Experimental Results ........................... 169 7.3.1 Rank-Order of Features ..................... 170 7.3.2 Feature Selection ......................... 173 7.3.3 Creating a Small Network Through Node-Pruning ....... 175 7.3.4 Application to an OCR system ................. 177 7.4 Summary ................................. 178 8 Summary, Conclusions and Future Work 180 A Derivation of the Effective Number of Parameters 186 B Regularization via Adding Noise to Weights and Training Patterns 192 B1 Adding Noise to Weights Only ...................... 193 B2 Adding Noise to Weights and Tfaining Patterns Simultaneously . . . 194 BIBLIOGRAPHY 196 viii 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 3.4 3.5 3.6 4.1 4.2 4.3 4.4 7.1 LIST OF TABLES Some features of the five representative networks. ........... Characteristics of the eight data sets. .................. The average values of Sammon’s stress for the PCA, LDA, SAMANN, N DA and NP-SOM networks on eight data sets. ........... The average values of the nearest neighbor classification error rate P81”, and relative ratio Ry” for the PCA, LDA, SAMANN, NDA and NP- SOM networks on eight data sets. ................... The average values of the minimum distance classification error rate PCMD and relative ratio R3“) for the PCA, LDA, SAMANN, NDA and NP-SOM networks on eight data sets. ................. Actual number of “on” nodes ...................... Error rates of k-NN classifiers ...................... Error rates and number of training patterns retained in the CNN rule Error rates and number of training patterns in the RNN rule Error rates and number of training patterns in the EN N rule Reductions in number of nodes and reduction rates of the CNN, RN N and ENN rules on three data sets .................... The 4 x 4 covariance matrices of the three classes in the IRIS data set. The significance levels of the Kolmogrov-Smirnov test for all the clus- ters in Figures 3(a)-3(d). ........................ The significance levels of the Kolmogrov-Smirnov test and numbers of misclassified patterns by the K-means algorithm and the HEC network for the IRIS data. The eigenvector projections of the 3—cluster solutions are shown in Figures 4(b)-4(c) and 5(b)-5(c), and those of the 2-cluster solutions are shown in Figures 6 and 7. ................ The “misclassification” rates of the texture segmentation results in Figures 10(c)-10(h). ........................... Classification accuracies using 5 features selected by the node-pruning procedure and Whitney’s feature selection procedure. ........ ix 86 92 104 107 115 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 LIST OF FIGURES Dichotomies in the design of a SPR system. .............. A typical three-layer feedforward network. ............... A taxonomy of neural networks for feature extraction and data pro jec- tion. .................................... PCA network proposed by Rubner et a1. ................ Architecture of the LDA network ..................... A sketch of the NP-SOM network ..................... Some data sets used in the comparative study. (a) A perspective view of data set 2. (b) Range image and (c) texture image from which data sets 7 and 8 are extracted, respectively .................. 2D projection maps of data set Gauss2 using the four networks. (a) PCA network, (b) LDA network, (c) SAMANN network, and (d) NDA network. ................................. 2D projection maps of data set Curve2 using the four networks. (a) PCA network, (b) LDA network, (c) SAMANN network, and (d) NDA network (2 classes become linearly well-separable). .......... 2D projection maps of data set Sphere2 using the four networks. (a) PCA network, (b) LDA network, (c) SAMANN network, and (d) NDA network (the black blob exclusively contains all the patterns on the surface of the inner sphere). ....................... 2D projection maps of data set Random using the four networks. (a) PCA network, (b) LDA network, (c) SAMANN network, and (d) NDA network. ................................. 2D projection maps of the IRIS data set using the four networks. (a) PCA network, (b) LDA network, (c) SAMANN network, and (d) NDA network (the black dot on the lower-left corner exclusively contains all the 50 patterns from class 1). ...................... 2D projection maps of the 80X data set using the four networks. (a) PCA network, (b) LDA network, (c) SAMANN network, and (d) NDA network. ................................. 2D projection maps of data set Range using the four networks. (a) PCA network, (b) LDA network, (0) SAMANN network, and (d) NDA network. ................................. 51 58 59 60 2.13 2.14 3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 4.7 2D projection maps of data set Texture using the four networks. (a) PCA network, (b) LDA network, (0) SAMANN network, and (d) NDA network. ................................. The distance images of Kohonen’s SOM superimposed by the 2D pro- jection maps and boundaries between classes for the 8 data sets us- ing the NP-SOM network. (a) Gauss2, (b) Curve2, (c) Sphere2, (d) Random, (e) IRIS, (f) 80X, (g) Range, and (h) Texture. ........ A neural network architecture of the k-NN classifier .......... k-Maximum network ........................... Principal component (eigenvector) projection of the IRIS data. (a) Projection on the first and the second components. (b) Projection on the first and fourth components. Small circles. (0), triangles (A) and “plus” sign (+) denote patterns from classes 1 (iris setosa), 2 (iris versicolor) and 3 (iris virginica), respectively. ............. A scenario showing the possible 2-cluster solutions by the K-means clustering algorithm with the Euclidean and Mahalanobis distance measures. ................................. 66 76 95 Architecture of the neural network (HEC) for hyperellipsoidal clustering. 97 The K-means clustering with the Euclidean distance measure versus the HEC clustering on two artificial data sets, each containing two clusters. (a)-(b): data set 1; (c)-(d): data set 2; (a) and (c): results of the K-means clustering algorithm with the Euclidean distance measure; (b) and (d): results of the HEC network. ............... Two-dimensional projection maps (the first two principal components) of 3-cluster solutions by the K-means clustering with the Euclidean distance measure and the HEC clustering on the IRIS data. (a) The three true classes. (b) Result of the K-means clustering algorithm with the Euclidean distance measure. (c) Result of the HEC network. Two-dimensional projection maps (the first and fourth principal com- ponents) of 3—cluster solutions by the K-means clustering with the Eu- clidean distance measure and the HEC clustering on the IRIS data. (a) The three true classes. (b) Result of the K-means clustering algorithm with the Euclidean distance measure. (c) Result of the HEC network. Two-dimensional projection maps (the first two principal components) of 2-cluster solutions by the K-means clustering with the Euclidean distance measure and the HEC clustering on the IRIS data. (a) Two clusters produced by the K-means clustering algorithm with the Eu- clidean distance measure. Note the three “misclassified” patterns. (b) Two clusters produced by the HEC network. ............. xi 105 108 109 4.8 Two-dimensional projection maps (the first and fourth principal com- 4.9 5.1 5.2 5.3 6.1 6.2 7.1 7.2 ponents) of 2-cluster solutions by the K-means clustering with the Eu- clidean distance measure and the HEC clustering on the IRIS data. (a) Two clusters produced by the K-means clustering algorithm with the Euclidean distance measure. Note the three “misclassified” patterns. (b) Two clusters produced by the HEC network. ........... The K-means clustering with the Euclidean distance measure versus the HEC clustering for texture segmentation. (a) and (b): Original composite textured images. (c) and (d): 4-class segmentations of the image in (a) using the K-means algorithm on the original features and z-score normalized features, respectively. (e): 4—class segmentation using the HEC network. (f) and (g): 5-class segmentations of the image in (b) using the K-means algorithm on the original features and z-score normalized features, respectively. (h): 5-class segmentation using the HEC network. ......................... Values of Pen; and Pen, at the 15 consecutive cycles starting with 6000. The values of the square error (SE-cycle curve) on the training data set are also shown. ......................... Empirical values of peffy_em and peff_em, and the average values of pen”, pen, and p8” versus the regularization parameter A. ..... Standard deviations of the estimates of Peffz: p8,,” and pa” with re- spect to the regularization parameter A. ................ Projection maps of the IRIS data. (a) Original Sammon’s algo- rithm, E=0.0068; (b) 2-layer neural network with 100 hidden nodes, E=0.0061. ................................ Generalization ability of the network. (a) Projection of 75 randomly chosen training patterns from the IRIS data using a 2-layer network with 100 hidden nodes, E 20.0049; (b) Projection of all the 150 patterns using the trained network in (a), E=0.0057. .............. The distances between the two class means for individual features. The solid, dotted, and dashed curves plot the distance values obtained from the 500 training patterns, 50 training patterns, and the model which generates the data, respectively. The solid and dotted curves are smoothed by averaging over ten data sets of the corresponding sizes (500 and 50 patterns). ....................... The nearest neighbor classification accuracies of individual features. The solid curve: using 500 training patterns. The dotted curve: using 50 training patterns. The values of accuracy are averaged over 10 training sets of the corresponding size. A test set containing 1000 patterns is used for estimating the classification accuracies. ..... xii 111 116 137 138 140 152 153 171 7.3 7.4 7.5 7.6 7.7 The averaged saliency measures for the 50 features. The averaging is taken over 100 trials (10 trials with different initial weights for each of the 10 training sets). Solid curve: data sets with 500 training patterns. Dotted curve: data sets with 50 training patterns. .......... The probability of individual features belonging to the subset of 5 selected features using the node-pruning procedure on the modified 'Ii‘unk’s data. Solid curve: data sets with 500 training patterns. Dotted curve: data sets with 50 training patterns. ............... The probability of individual features belonging to the subset of 5 selected features using Whitney’s feature selection procedure on the modified 'I‘runk’s data. Solid curve: data sets with 500 training pat- terns. Dotted curve: data sets with 50 training patterns. The curves are averaged over 10 training data sets of each size. .......... Classification error rates for the modified Trunk’s data set versus the number of features being removed. The initial network contains a total of 50 input nodes (features). Solid curve: data sets with 500 training patterns. Dotted curve: data sets with 50 training patterns.’ ..... Classification error rates for the 801: data set versus the number of nodes being removed. The initial network contains a total of 21 nodes. xiii 173 175 176 177 178 CHAPTER 1 Introduction Automatic (machine) recognition, description, classification, and grouping of patterns are important activities in a variety of engineering and scientific disciplines such as biology, psychology, medicine, marketing, computer vision, artificial intelligence, pattern recognition, and remote sensing. What is a pattern? Watanabe [172] defines a pattern as an entity, vaguely defined, that could be given a name. A pattern is represented by a vector of measured properties or features and their interrelationships. For example, the recognition problem can range from using fingerprints to identify a suspect in a criminal case to finding defects in a printed circuit board [79]. The corresponding features could be minutiae (i.e., location of minutiae and pattern type) in the fingerprint or the broken line segments or loops in the printed circuit board. A pattern recognition system is expected to (i) identify a given pattern as a member of already known or defined classes (supervised classification), or (ii) assign a pattern to a so far unknown class of patterns (unsupervised classification or clustering). Design of a pattern recognition system essentially involves the following three steps: (i) data acquisition and preprocessing, (ii) representation/feature extraction, and (iii) decision making or clustering. The problem domain dictates the choice of sensor, preprocessing techniques, and decision making model. The domain-specific knowledge is implicit in the design and is not represented in a separate module (as is commonly done in expert systems). Some of the early work in pattern recognition was biologically motivated; the objective was to develop a recognition system whose structure and strategy followed the one utilized by humans. The work on perceptrons [149] is a good example of such attempts which were not very successful. The current serious activity in the area of artificial neural networks and connectionist models is reminiscent of this early work. The three most well-known models for designing a pattern recognition system are: (i) statistical, (ii) syntactic or structural, and (iii) neural networks. This thesis addresses the design and analysis of neural networks for pattern recognition. We will show that there is a considerable amount of overlap between the statistical pat- tern recognition (SPR) approaches and neural network approaches to design pattern recognition systems. The following two subsections present an overview of these two models. 1.1 Statistical Pattern Recognition Statistical pattern recognition is a relatively mature discipline and a number of com- mercial recognition systems have been designed based on this approach. In statistical pattern recognition, a pattern is represented by a set of d numerical measurements or a d—dimensional feature vector. Thus, a pattern can be viewed as a point in this d-dimensional space. A SPR system is operated in two modes: training and recogni- tion. In the training mode, the feature extractor is designed to find the appropriate features for representing the input patterns and the classifier is trained by a set of training data to partition the feature space so that the number of misclassified pat- terns is minimized. In the recognition mode, the trained classifier assigns the input test pattern to one of the categories / clusters based on the extracted feature vector. Number of Training Samples . Bayes labelled Unlabelled 000510“ R019 Training Samples Training Samples Form of Density Form of Density Form of Density Form of Density Function Known Function Unknown Function Known Function Unknown A /\ 'Optimal' Plug-in Density K-NN No. of Pattern N0. of Pattern Rules Rules Estimation Rules Classes Known Classes Unknown Mixture Cluster Analysis Resolving Figure 1.1. Dichotomies in the design of a SPR system. The decision making process in SPR can be summarized as follows. Given a pat- tern represented by a d-dimensional feature vector, x = (x1, x2, - - - , zd)T, assign it to one of c categories, £01,002, - - - ,wc. Depending on what kind of information is avail- able about the class-conditional densities, various strategies can be used to design a classifier. If all the class-conditional densities p(x|w,-), z' = 1, 2, - - - , c are completely specified, then the optimal Bayes decision rule can be used to establish the decision boundaries between pattern classes. However, the class-conditional densities are never known in practice and must be learned from the training patterns. If the functional form of the class-conditional densities is known, but some of the parameters of the densities are unknown, then the problem is called a parametric decision making prob- lem. Otherwise, either the density function must be estimated or some nonparametric decision rule must be used. The various dichotomies which appear in the design of a SPR system are shown in the tree diagram of Fig. 1.1. Various approaches in neural networks and statistical pattern recognition will be compared using this diagram. 1.2 Artificial Neural Networks The field of neural networks is an interdisciplinary area of research. A thorough study of neural networks requires some understanding of neurOphysiology, control theory, mathematics, statistics, decision making, and distributed computing. The diverse nature of topics studied under this heading makes it impossible for us to address all the related issues and topics. Neural networks have been viewed under various perspectives [109, 74, 147, 178, 9, 12]. Research in neural networks has experienced three consecutive cycles of enthusi- asm and skepticism. The first peak, dating back to the 1940’s, is due to McCullough and Pitt’s pioneering work [117]. The second period of intense activity occurred in the 1960’s which featured, Rosenblatt’s perceptron convergence theorem [149] and Minsky and Papert’s work showing the limitations of simple perceptron [122]. Min- sky and Papert’s results dampened the enthusiasm of most researchers, especially those in the computer science community. As a result, there was a lull in the neural network research for almost 20 years. Since the early 1980’s, neural networks have re- ceived considerable renewed interest. The major developments behind this resurgence include Hopfield’s energy approach [68] in 1982, and the backpropagation learning al- gorithm for multilayer perceptrons (multilayer feedforward networks) which was first proposed by Werbos [176], reinvented several times, and popularized by Rumelhart et al. [152] in 1986. Anderson and Rosenfeld [3] provide a detailed historical account of developments in neural networks. Different interconnection strategies lead to different types of neural networks (feed- back versus feedforward) which require different learning algorithms. Another di- chotomy is based on the type of learning algorithm used. There are three types of learning algorithms: supervised, unsupervised, and hybrid (combination of supervised (l) (I) (2) (2) (L) (L) (0) yq wqi yi Wij yj ij yk “‘" ‘NO—b/O—r -—> oio 0—» _, ./O——rQ\O_, input layer hidden layers output layer Figure 1.2. A typical three-layer feedforward network. and unsupervised methods). Supervised learning includes a special case of reinforce- ment learning in which only qualitative target values (yes or no) are used. Multilayer feedforward networks have emerged as a popular model for pattern classification tasks. A standard L—layer feedforward network" consists of one input stage, L — 1 hidden layers, and one output layer of nodes which are successively connected in a feedforward fashion with no connections between nodes in the same layer and no feedback connections between layers. A typical three-layer feedforward network is shown in Figure 1.2. It has been shown by many researchers that a 2-layer network is capable of forming an arbitrary nonlinear decision boundary and approximating any continuous nonlinear function (see, [63, 188]), provided the number of nodes in the hidden layer can be arbitrarily large. Unfortunately, these theoretical studies have not addressed the practical problem of selecting the number of nodes in the hidden layer. In fact, the number of nodes in the hidden layer can not be arbitrarily large due to the phenomenon of “curse of dimensionality” [142]. Moreover, these theoretical studies have not introduced any new practical learning methods. As ’In this thesis, we adopt the convention that the input nodes are not counted as a layer. a result, the backpropagation algorithm [152] has been the most commonly used method for training multilayer feedforward networks. Neural networks were originally proposed as simplified models of biological neural networks, and have been studied using mathematical tools, computer software simula- tions, and hardware implementations. Although many existing neural network models are extremely simplified versions of biological neural networks, they are valuable for understanding some principles of biological computation. One of the ultimate goals of neural networks is to explore why biological systems can effortlessly perform some tasks, such as pattern recognition, which are so difficult for modern conventional com- puters (Von Neumann architecture), and to build some working systems to perform these tasks based on the state-of-the-art hardware (digital or analog) technology. Neural networks offer some information processing advantages characterized by their robustness, adaptivity, self-organization, distributed and massive parallelism, and ease of implementation in hardware. In comparison to rule-based systems or knowledge-based systems, neural networks are claimed to be capable of automatically building the internal representation of knowledge from examples. Therefore, they can be used in situations where the domain knowledge is neither available nor complete, or the underlying problem is not fully understood (e.g., [155, 166]). 1.3 Common Links Between Neural Networks and Statistical Pattern Recognition Machine pattern recognition techniques attempt to endow modern computers some of the recognition capabilities that humans possess. Some of the early work in pattern recognition, such as the work on perceptrons [149], was biologically motivated. Such approaches were not very successful in solving difficult recognition problems. As a result, statistical pattern recognition became the model of choice for designing recognition systems. The designers of pattern recognition systems are more concerned with the efficiency and performance issues rather than the biological plausibility of the underlying approaches. Interestingly, we have observed that many newly developed neural network models have also been influenced by this philosophy. Neural networks and statistical pattern recognition are two closely related disci- plines which share several common design problems. In the neural network commu- nity, pattern recognition is considered as one of the challenging problems. On the other hand, in the pattern recognition community, neural networks are viewed as a powerful complement to the well-known recognition approaches: statistical and struc- tural. It is difficult to estimate the amount of overlap between neural networks and statistical pattern recognition, because it requires knowing the precise boundary be- tween the disciplines. Werbos [177] estimated that about 80% of the work being done with neural networks deals with pattern recognition. Some links between the two disciplines have been established [156]. In the following sections, we will address the common links between neural networks and statistical pattern recognition according to different components of recognition systems and design issues. 1.3.1 Representation Representation addresses the problem of how an input pattern is represented in a recognition system. How do we design a good representation scheme? A good pattern representation should satisfy at least the following requirements: (i) high data com— pression rate, (ii) good discriminatory power/information preservation, (iii) desired invariance, and (iv) good robustness to noise. In most statistical pattern recogni- tion systems, representation schemes are often developed by the designers using their understanding of the underlying problem and the domain knowledge. These repre— sentation schemes are fixed once the systems have been built. Similarly, in many applications using multilayer feedforward networks, the extracted features are fed into a network. So, the network essentially performs the classification task. However, the feedforward networks have a desirable property of automatically building internal representation of input patterns (feature extraction), although it is sometimes diffi- cult to interpret these representations. Due to this property, some researchers feed the raw data (e.g., pixels) into the network and let the network learn a representation from the training patterns (e.g., [105, 155, 83]). For example, Jain and Karu [83] used a feedforward network to learn texture discrimination masks directly from the input image array. Gallinari et al. [53] have established a link between linear discriminant analysis and linear feedforward networks. Their theoretical analysis shows that in a linear two- layer feedforward network trained to minimize the square-error criterion, the role of the hidden layer with ran/C(23) units is to realize a linear discriminant analysis that maximizes the criterion I23, I / IETh |, where BB is the between-class covariance matrix in the input space, and Z33, and 27“,, are the between-class covariance and total-class covariance matrices, respectively, in the space spanned by the hidden units. Webb and Lowe [173] have investigated a class of multilayer feedforward networks with nonlinear hidden units and linear output units, trained to minimize the squared error between the desired output and actual output of the network. They have found that the nonlinear transformation implemented by the subnetwork from the input layer to the final hidden layer of such networks maximizes the so—called network discriminant function, Tr{SBS$ }, where S} is the pseudo-inverse of the total scatter matrix of the patterns at the output of the final hidden layer, and SB is the weighted between-class scatter matrix of the output of the final hidden layer (the weights are determined by the coding scheme of the desired outputs). The role of hidden layers is to implement a nonlinear transformation which projects input patterns from the original space to a space in which patterns are easily separated by the output layer. A good representation can make the decision making process and network design much simpler, and lead to a good generalization. However, designing a good repre- sentation scheme requires a thorough understanding of the underlying problem and substantial domain knowledge. How to learn the representation scheme from given data remains an Open research issue. 1.3.2 Feature Extraction and Data Projection Feature extraction has a two-fold meaning. It is often referred to as the process of extracting some numerical measurements from the raw input patterns (initial rep- resentation). On the other hand, it is also defined as the process of forming a new (often smaller) set of features (refined representation) from the original feature set. However, both the processes can be considered as a mapping (or projection) from a d—dimensional input space to an m-dimensional output space (m < d). A large number of artificial neural networks and learning algorithms have also been proposed for feature extraction and data projection [113]. These research efforts can be loosely classified into two groups. The first group contains new neural network designs, or existing neural network models for feature extraction and data projection. 10 Examples of this type of networks include Kohonen’s self-organizing maps [93], the nonlinear projection methods (NP-SOM) [100], and nonlinear discriminant analysis (NDA) based on the functionality of hidden units in feedforward network classifiers [173, 111]. These networks exhibit some nice properties and can be used as new tools or supplementary approaches to classical feature extraction and data projection algorithms. A second line of research has investigated the prOperties of neural networks and learning rules to establish links between classical approaches and neural networks. As a result, it has been found that some of these neural networks actually perform many of the well-known feature extraction and data projection algorithms [61, 132, 53]. Several classical feature extraction and data projection approaches have been implemented using neural “network architectures, e.g., PCA networks (see [63, 69]), LDA networks [111, 102], and network for Sammon’s projection (SAMANN) [85]. Although such efforts do not necessarily provide new approaches to feature extraction and data projection (from the view point of functionality performed by the networks), the resulting networks do possess following advantages over traditional approaches: (i) most learning algorithms and neural networks are adaptive in nature, thus they are well-suited for many real environments where adaptive systems are required, (ii) for real—time implementation, neural networks provide good, if not the best, architectures which are relatively easily implemented using current VLSI and optical technologies, and (iii) neural network implementations can overcome the drawbacks inherent in the classical algorithms. For example, Jain and Mao [85] proposed a neural network design for Sammon’s nonlinear projection algorithm which offers the generalization ability of projecting new data, a property not present in the original algorithm. 11 1.3.3 Classification In statistical pattern recognition, the goal is to assign a d—dimensional input pattern x to one of c classes, w1,w2, - - - ,wc. If the class-conditional densities, p(x|w,~), and a priori class probabilities p(w,~), 2' = 1, 2, - - - , c, are completely known (the ideal case in Fig. 1.1 where we have infinite number of labeled training samples), the “optimal” strategy for making the assignment is the Bayes decision rule [34]: Assign pattern x into class wk if p(wk|x) > p(w,-|x), W = 1,2, - . -,c;z' 79 k, (1.1) where p(wk|x) = p(wk)p(x|wk)/p(x) is called the a posteriori probability of class wk, given the pattern vector x. The rule minimizes the probability of error. The result- ing “optimal” Bayes error also provides a lower bound for studying the asymptotic behavior of other classifiers. In statistical pattern recognition, a decision rule can often be formulated as a set of c discriminant functions, g,(x), z' 2: 1,2,-~-,c. We assign x to class an, if gj(x) > gi(x),Vi 76 j. For the 2-class problem, only a single discriminant function is needed. Feedforward networks such as multilayer perceptrons and radial basis func- tion networks also use this familiar notion. Each output node implements a special discriminant function, and the class membership of the input pattern is determined by a “winner-take-all” scheme, which is the “max” operation. There are two methods to construct the discriminant functions in statistical pattern recognition. We can first estimate the class-conditional density functions and then use these densities to form the discriminant functions. The second method is to specify the form of discriminant functions and then learn the coefficients from training patterns. The neural networks take the second approach. In the following, we will discuss different classifiers in both statistical pattern recognition and neural networks (parametric versus nonparametric, on-shot versus tree classifiers). 12 Parametric Classifiers If we have a finite set of labeled training patterns and the form of density func- tions is known (more often it is assumed), we can use either the “Optimal” rule or the “plug-in” rule (see Fig. 1.1). In “optimal” decision rule, a Bayesian approach involving an a priori density on the unknown parameters is used. Since this approach is cumbersome, a simpler approach where parameter estimates are used to replace the unknown parameters is preferred. If the class-conditional densities are assumed to be multivariate Gaussian, then the “optimal” discriminant function is quadratic in x. Further, if the covariance matrices of the pattern classes are equal, then the “optimal” discriminant function reduces to a linear form. The well-known percep- tron can implement a linear decision boundary. Therefore, the perceptron classifier is optimal in the Bayes sense if the class-conditional densities of the pattern classes are Gaussian with equal covariance matrices. Several polynomial networks have also been proposed (e.g., [136, 164]). We should be aware of the underlying assumptions which guarantee the optimality of a classifier. If the assumptions are not valid, parametric classifiers may perform very poorly. Yau and Manry [186] mapped the Gaussian classifier (quadratic un- der Gaussian assumption) into a Gaussian isomorphic network which can be trained by the backpropagation algorithm. They found that this mapping can improve the performance when the Gaussian assumption is violated. Non-parametric Classifiers In many classification problems, we do not even know the form of the class- conditional density functions. In such situations, the designer of a statistical pat— tern recognition system either estimates the class-conditional densities from the finite training set, or designs nonparametric classifiers such as k—nearest neighbor classifiers. The two most well-known methods for estimating the class-conditional densities are based on Parzen windows and k-nearest neighbors [34]. 13 A number of neural networks have been proposed which are closely related to the Parzen window classifier. These networks essentially estimate class-conditional densities and perform classification based on them. Some examples of these networks are the probabilistic neural network (PNN) [164] and the Cerebellar Mode Arithmetic Computer (CMAC) [119]. The PNN network uses a Gaussian window, while the CMAC model uses a rectangular window. Another approach to the Parzen window classifier is the Radial Basis Function (RBF) network [123], which is a two-layer network. Each node in the hidden layer employs a radial basis function, such as a Gaussian kernel, as the activation function. Each output node implements a linear combination of these radial basis functions. Unlike in the Parzen window classifier, the number of kernels in the REF network is usually less than the number of training patterns. Furthermore, not only the width of kernels but also the position of these kernels (windows) are learned from training patterns. The Parzen window classifier and the related neural network models share another important design issue: window width selection. The choice of the window width, h, is very critical to Parzen density estimation [35, 164, 119, 123]. Too small a value of h would give a spiky or noisy estimate of p(x|w,-) whereas large values of h result in an oversmoothed estimate of p(x|w,-). In statistical pattern recognition literature, it I: d / , where is common to use a value of h based on the following relationship: h 0: n;— k is a constant (0 < k < 0.5), and n,- is the number of training samples from class w,-. This choice of h has been shown to be Optimal by Silverman [160] and Fukunaga and Hostetler [47] for Gaussian class-conditional densities. Jain and Ramaswami [87] used the bootstrap technique in selecting the optimal width. These techniques would also be useful for the neural network models. A number of stochastic network models have been proposed to learn the underly- ing probability distributions for the given set of training patterns. The well-known example is the Boltzmann machine [65]. The probability of the network states can 14 be trained to comply with the desired probability distribution (class-conditional or a posteriori densities). In a comparative study, Kohonen et al. [96] found that the Boltzmann machine achieved considerably better accuracy than a feedforward net- work trained using the backpropagation algorithm, and its performance was close to the optimal Bayes rule. However, the basic Boltzmann learning algorithm employs a Monte Carlo simulation and is extremely slow, which greatly limits the applica- tion of the Boltzmann machine to real-world problems. A number of variations of Boltzmann learning algorithm have been proposed [63] including the deterministic Boltzmann machine [137] and the Boltzmann Perceptron Network [185]. It has been shown by many researchers, both theoretically and empirically, that networks with the I-of-C' coding of the desired outputs, such as the multilayer per- ceptron with sigmoidal nonlinearities, radial basis function networks, and high—order polynomial networks, trained using the squared-error, cross-entropy, and normalized- likelihood cost functions can produce outputs which estimate a posten’on’ probabilities (e.g., [146]). The estimation accuracy depends on the network complexity, the amount of training data, the degree to which training data reflect the true class-conditional densities and a pm’orz’ probabilities. These results have shown that feedforward net- works have the desirable property of providing both the discriminant functions (de— cision boundaries) and a posteriori probabilities which can be used as a confidence measure of the decision making process. This property also makes it easier to com- bine outputs of multiple networks to form a high-level decision, and to determine a meaningful rejection threshold. Instead of explicitly estimating the class-conditional densities or a posteriori prob- abilities, the K—nearest neighbor (K-NN) classifier is a nonparametric classifier that directly constructs the decision rule from the training data. The K-N N rule classifies a pattern x into the class which is most frequently represented among its K nearest neighbors. The choice of K is data dependent. The rule of thumb is that K should 15 be a fraction (e.g., \fi—z) of the number, n, of training samples. Several variants of the K -NN classifier are available (e.g., condensed and edited), which reduce its time complexity and memory requirements [30]. The K—NN classifier can be easily mapped into a neural network architecture [84]. One can argue whether the feedforward networks belong to the class of paramet— ric or nonparametric classifiers. Since a feedforward network of fixed size explicitly specifies the form of the discriminant functions, it implicitly assumes the forms of class-conditional densities which lead to the specified discriminant functions. In this sense, feedforward networks are parametric classifiers. On the other hand, if the net- work size is determined from data or the network is powerful enough to approximate a large number of density and discriminant functions, we can say that feedforward networks are nonparametric classifiers. Tree Classifiers In a tree classifier, a terminal decision is reached through a sequence of intermediate decisions (nonterminal nodes) following a particular path from the root to a leaf node; the remaining nodes in the tree need not be visited. In some situations, e.g., when pairs of classes can be separated by only one or a few components of the feature vector, tree classifiers can be much more efficient than single-stage classifiers. Moreover, tree classifiers may have a better small training sample size performance than one-shot classifiers because only a small number of features is used for decision making at each node. Note that the decision rule used at each node of the tree can be either parametric or nonparametric. Sethi [157] investigated the link between tree classifiers and multilayer feedfor- ward networks, and found that a decision tree can be restructured as a three-layer feedforward network. This restructuring provided a different tool for network design and training. It also solved the “credit-assignment” problem in training feedforward networks thus making it possible to progressively train each layer separately. The 16 backpropagation algorithm can be used to refine the weights in the network after the initial restructuring. There is not sufficient theoretical basis for making a conclusion on whether a multilayer feedforward network or a tree classifier performs better [4]; the performance is application-dependent. If the architecture of a feedforward network is carefully designed, then it can have a more compact structure than the tree classifier. 1.3.4 Clustering Various clustering or unsupervised learning algorithms proposed in the literature [81] can be grouped into the following two categories: hierarchical and partitional. A hier- archical clustering is a nested sequence of partitions, whereas a partitional clustering is a single partition. A partition can be either hard or soft (e.g., fuzzy clustering [15]). Commonly used partitional clustering algorithms are the k-means algorithm and its variants (e.g., ISODATA [7]) which incorporate some heuristics for merging and splitting clusters and for handling outliers [81]. Competitive neural networks are often used to cluster input data. The well-known examples are Kohonen’s Learn- ing Vector Quantization (LVQ) and self-organizing map [93], and ART (Adaptive Resonance Theory) models [21, 22, 23]. Despite their attractive architectures and biological and neurophysiological plausibility, the updating procedures used in these neural network-based clustering are quite similar to some classical partitional clus- tering approaches. For example, the relationship between the sequential k-means algorithm and Kohonen’s learning vector quantization (LVQ) has been addressed by Pal et al. [135]. The learning algorithm in ART models has been shown to be similar to the sequential leader clustering algorithm [58, 127]. Many partitional clustering algorithms [81] and competitive neural networks for unsupervised learning [63] are suitable primarily for detecting hyperspherical-shaped clusters, because the Euclidean distance metric is commonly used to compute the 17 distance between a pattern and the assigned cluster center. Other distance metrics, such as the Mahalanobis distance, can also be used. Mao and Jain [114] (see Chapter 4) have proposed a network for detecting hyperellipsoidal clusters (HEC), which uses the regularized squared Mahalanobis distance: In spite of the numerous research efforts, data clustering remains a difficult and essentially an unsolved problem [81]. 1.3.5 “Curse of Dimensionality” and Generalization Issues It is well-known that the error rate of the optimal Bayes classifier will not increase as the number of features is increased [34]. However, in a finite training sample size case, the additional discriminatory information provided by the new features can be outweighed by the increase in the inaccuracy of the parameter estimates needed in the classification rule. Thus, in practical classifiers, a “peaking” phenomenon is often observed: addition of new features decreases the classification error at first, then the error levels off, and finally it begins to increase [80, 36]. This phenomenon, often referred to as the “curse of dimensionality”, is also observed in neural networks for both classification [142] and regression [55]. Many important results related to the curse of dimensionality have been published in the statistical pattern recognition literature. We summarize here some of these results. Duin [36] derived a bound on the expected error of a classifier which uses the esti- mated class-conditional density functions, 15,-(x), instead of the true class-conditional density functions, p,(x), z' = 1, 2: me} 3 5...... + §E {2: / lp. — p.(x>ldx}. (1.2) where 530,,” is the optimal Bayes error. The second term on the right-hand side is l8 basically the expected Kolmogorov distance between the true and estimated class- conditional densities. Simulations have shown that for Gaussian densities, the ad- dition of new features may increase this expected Kolmogorov distance. However, evaluation of this bound is difficult because we never know the true densities in prac- tice. Jain and Waller [88] investigated a 2-class problem involving normal distributions with different class means but a common covariance matrix. They found that when the “plug-in” decision rule is used, the “peaking” phenomenon can be avoided if the addition of a new feature results in an increase in the Mahalanobis distance between the two classes by at least 1/(n — d — 3) percent, where n is the number of training patterns and d is the number of features. Raudys and Jain [143] showed that for the linear and quadratic discriminant func- tions which are obtained from the estimated mean vectors and covariance matrices, the expected classification error is given by E {5} = E + i, where 5 is the asymptotic error of the linear or quadratic classifier, and v is a constant which is proportional to d for the linear classifier and d2 for the quadratic classifier. This relationship implies that the number of training patterns should increase linearly according to the number of free parameters in the classifier. This theoretical result is consistent with intensive computer simulations [143, 80]. A general guideline of having five to ten times as many training patterns as free parameters seems to be a good practice to follow [80]. The generalization ability of both statistical and network-based classifiers depends on the following three factors: (i) the number of training patterns, (ii) the underly- ing class-conditional distributions, and (iii) the discrimination ability (or complexity) of the classifier. The theoretical and empirical results on curse of dimensionality discussed above were obtained under the assumption of Gaussian class-conditional distributions. For the general case, the exact relationship between these three factors 19 becomes intractable. Vapnik’s theory [170] provides a bound on the maximum devi- ation of the estimated error from the true error over a set of classifiers. The theory was developed based on the notion of Vapnik-Chervonenkis (VC) dimension (or ca- pacity), which is defined as the maximum number (V) of data points in the space for which the classifier can implement all possible 2“ dichotomies. For example, the VC dimension of a linear classifier in R“ is (d + 1), and the VC dimension of a quadratic classifier is {2d+d(d —- 1) / 2+ 1}. However, computing the VC dimension for a general classifier is very difficult. Baum and Haussler [14] have shown that the VC dimension of a one-hidden-layer feedforward network with threshold units and full connections between adjacent layers is bounded by 2[H/2]d S V g 2W log(eN ), where e is the base of the natural logarithm, H, N and W are the number of hidden units, the total number of units and the total number of weights in the network, respectively. Note that the upper bound on the VC dimension is linear in W (assuming that N is fixed). The bounds on the VC dimension of feedforward networks with continuous nonde- creasing activation function are more complicated [60], but are still roughly linear in W. Once the VC dimension of a classifier is known, a lower bound can be established on the (training) sample size requirement [17]: n 2 max {$1n2, lglni—g}, (1.3) where X and 6 are two parameters which determine the degree of the maximum devi- ation of the estimated error from the true error, and the confidence on the maximum deviation, respectively. Kraaijveld [98] has derived a slightly better bound. Note that the sample size requirement is roughly linearly proportional to the VC dimension of the classifier. For a feedforward network classifier, the number of training patterns 20 should be approximately linearly proportional to the number of weights in the net- work. These results derived from Vapnik’s theory have the following interesting and unique properties: (i) any classifier (or system), no matter how complicated, is charac- terized by a single number, i.e., the VC dimension; (ii) bounds on training sample size and on the maximum deviation of the true and estimated errors are distribution-free; (iii) bounds are independent of how the classifier is trained; and (iv) Vapnik’s theory only captures the difference between the true and estimated errors, and provides no statement about the value of the true error itself. In some sense, properties (ii) and (iii) are desirable, but these properties also make the bounds too conservative because now the worst case behavior must be considered. Therefore, the practical applicability of these bounds is questionable. For example, these bounds provide very little guidance even in the case of multivariate Gaussian distributions. Moreover, the capacity (or VC dimension) of a classifier may not be fully exploited by the learning algorithm, or may be restricted by regularization tech- niques. This means that the efiective capacity of a classifier can be much smaller than its true capacity. Kraaijveld [98] studied the effect of backpropagation algo— rithm on the generalization properties of multilayer feedforward networks and found that the search capability of the backpropagation algorithm (or any other iterative algorithm) can not fully exploit the theoretical capacity of multilayer feedforward networks. Kraaijveld proposed a simple but computationally intensive procedure to estimate the effective capacity of a classifier. His simulation showed that the effective capacity can be orders of magnitude smaller than the bound for the true capacity of a multilayer feedforward network. We recommend that the number of training pat- terns should be at least five to ten times as large as the VC dimension of a classifier [80,182] In many practical situations, collecting a large number of training samples is expen- sive, time consuming, and sometimes impossible. Therefore, choosing a parsimonious 21 system (with a small number of parameters) is a very important issue. Various efforts have been made to improve the generalization ability of neural networks. This prob— lem is similar to selecting the best classifier from a given set of classifiers in statistical pattern recognition [143]. The techniques which attempt to improve the generalization ability of a network can be grouped into the following four categories: (i) local connections and weight sharing, (ii) adaptive network pruning, (iii) adaptive network growing, and (iv) reg- ularization. The main idea behind these techniques can be related to the principle of minimum description length [148] and the intuition of Occam’s Razor. Methods in the first category try to reduce the number of free parameters (connection weights) by connecting a node to only a local region of its inputs or by sharing the same weight among many connections [106, 104]. Kraaijveld [98] has derived the upper bound on the VC dimension of weight-sharing networks. 1.4 Contributions of This Thesis The contributions of this thesis are made to both the fields of pattern recognition and artificial neural networks. These contributions are summarized as follows. 1. Several links between statistical pattern recognition and neural networks have been established [86] (Chapter 1). These links can encourage communication between researchers in the two fields to inspire each other and to avoid repetitious work. 2. A number of neural networks and learning algorithms have been proposed. These include: 0 Linear Discriminant Analysis (LDA) network [111] (Chapter 2); 0 Nonlinear Discriminant Analysis (NDA) network [111] (Chapter 2); 22 o A network for Sammon’s nonlinear projection (SAMAN N ) [85] (Chapter 2); o A Nonlinear Projection algorithm based on Kohonen’s Self-Organizing Maps (NP-SOM) [100] (Chapter 2); o A network-based k-nearest neighbor classifier [84] (Chapter 3); and o A self-organizing network for detecting HyperEllipsoidal Clusters (HEC) [114] (Chapter 4). A comparative study of five typical networks for feature extraction and data projection has also been conducted [113] (Chapter 2). 3. The main theoretical contributions of this thesis are: e Vapnik’s theory and Moody’s approach for studying the generalization abil- ity of feedforward networks have been compared. Moody’s notion of the eficctive number of parameters has been extended to a more general noise model. We have shown that the addition of noise in both sampling points in test data and observations increases the deviation of the expected test set MSE from the expected training set MSE, and also increases the effec- tive number of parameters. Monte Carlo experiments have been conducted to verify Moody’s result and our extension, and to demonstrate the role of the weight-decay regularization in improving the generalization ability of feedforward networks. See Chapter 5. 0 We have provided a systematic study of regularization techniques in neural network design. We present a taxonomy for various forms of regularization and demonstrate that a variety of neural networks and learning algorithms do fit into this framework. Some results discussed and developed in Chapter 5 are used to explain a counter-intuitive phenomenon that a network with 23 a large number of free parameters and trained with a relative small number of patterns can still generalize well. See Chapter 6 and [112]. 4. A practical method of node pruning has been proposed to design minimal net- works which generalize well. This method can also be used for feature selection. See Chapter 7 and [115]. 1.5 Organization of This Thesis The rest of this thesis is organized as follows. The first part of this thesis (Chapters 2—4) focuses on designing neural networks for pattern recognition. Chapter 2 pro- poses a number of neural networks and learning algorithms for feature extraction and multivariate data projection. This chapter also conducts a comparative study of five typical networks for feature extraction and multivariate data projection. Chapter 3 implements the well-known k-nearest neighbor classifier and its variants using neu- ral network architecture. Chapter 4 presents a self-organizing network for detecting hyperellipsoidal clusters. The second part of this thesis (Chapters 5-7) is devoted to the theoretical analy- sis of generalization ability and practical techniques for improving the generalization ability of feedforward networks. Chapter 5 presents a theoretical study of the gener- alization ability of feedforward networks. It first introduces two types of frameworks for this study: Vapnik’s theory and Moody’s approach. The advantages and disad- vantages of these two frameworks are analyzed. Then, we extend Moody’s model to a more general setting. Monte Carlo experiments are conducted to verify both Moody’s result and our extension, and to demonstrate the role of the weight-decay regular- ization in reducing the effective number of parameters in feedforward networks. We also summarize four different types of practical techniques for improving the gener- alization ability of feedforward networks. Chapters 6 and 7 address two of these four 24 types of techniques. Chapter 6 provides a survey of regularization techniques used in neural networks. Some theoretical results in Chapter 5 are used to explain the role of regularization techniques in the performance of neural networks. In Chapter 7, a node pruning method is proposed for designing minimal networks which generalize well, and for feature selection. Finally, Chapter 8 summarizes this thesis and mentions topics for future research work. CHAPTER 2 Neural Networks for Feature Extraction and Data Projection Classical feature extraction and data projection methods have been well-studied in the statistical pattern recognition and exploratory data analysis literature. In this chapter, we propose a number of networks and learning algorithms which provide new or alternative tools for feature extraction and data projection. These networks include a network (SAMANN) for Sammon’s nonlinear projection, a linear discriminant anal- ysis (LDA) network, a nonlinear discriminant analysis (NDA) network, and a network for nonlinear projection (NP-SOM) based on Kohonen’s self organizing map. A com- mon attribute of these networks is that they all employ adaptive learning algorithms which makes them suitable in some environments where the distribution of patterns in feature space changes with respect to time. The availability of these networks also facilitates hardware implementation of well-known classical feature extraction and projection approaches. Moreover, the SAMANN network offers the generaliza- tion ability of projecting new data, which is not present in the original Sammon’s projection algorithm; the NDA method and NP-SOM network provide new powerful approaches for visualizing high dimensional data. We evaluate five representative 25 26 neural networks for feature extraction and data projection based on a visual judge- ment of the 2-dimensional projection maps and three quantitative criteria on eight data sets with various properties. Our conclusions which are based on analysis and simulations can be used as a guideline for choosing a proper method for a specific application. 2.1 Feature Extraction and Data Projection Feature extraction and data projection can be formulated as a mapping \II from a d-dimensional input space to an m—dimensional output space (map space), \II : R“ —> R”, m S d, such that some criterion, C, is optimized. This formulation is similar to a function approximation problem. However, unlike the function approximation problem where the mapping function is estimated from training patterns which are input-output pairs (desired outputs are known), in feature extraction and data pro- jection the desired outputs are often not available even for supervised approaches (e.g., linear discriminant analysis). Feature selection, which is used for choosing an optimal subset of features, can be viewed as a special case of feature extraction, where \I! is a linear m x d permutation matrix. For data visualization purpose, the value of m is usually set to 2 or 3. A large number of approaches for feature extraction and data projection are avail- able in the pattern recognition literature [16, 81]. These approaches differ from each other in the characteristics of the mapping function \II, how \II is learned, and what op- timization criterion C is used. The mapping function can be either linear or nonlinear, and can be learned through either supervised or unsupervised methods. The combi- nation of these two factors results in the following four categories of feature extraction and data projection: (i) unsupervised linear, (ii) supervised linear, (iii) unsupervised nonlinear, and (iv) supervised nonlinear. Within each category, the methods differ in 27 the criterion function C being used. This taxonomy is shown in Figure 2.1. In general, linear methods are attractive because they require less computation than nonlinear methods. Analytical solution is often available for linear methods. On the other hand, nonlinear methods are more powerful than linear methods. However, since the analytical solution is often not available, a numerical optimization method must be used to obtain the solution, which is usually computationally demanding. Further- more, the optimization procedure often gets trapped into a local minimum. It is also generally true that supervised methods have better performance than unsupervised methods in situations where the category information is available [16]. Various neural networks including those proposed in this chapter for feature ex- traction and data projection can also be grouped into the above four categories. The boxes at the leaf nodes in Figure 2.1 list a number of representative networks in each category which will be described and evaluated in this chapter. Feature Extraction & Data Projection Neural Networks /\ Linear ] Nonlinear ] Unsupervised Supervised ] I Unsupervised Supervised ] . . o ANN-Based . . . . ' P11110193] Component 0 Linear Discriminant 0 Nonlinear D1scr1m1nant Sammon’s Projection Analysis Network Analysis Network Analysis Network 0 Kohonen’s Map Figure 2.1. A taxonomy of neural networks for feature extraction and data projection. 28 Choice of a proper feature extraction and data projection method (both the tra- ditional and neural network approaches) for a given problem is driven by (i) the available information (supervised versus unsupervised), (ii) a priori knowledge about the underlying distribution of the data (linear versus nonlinear, and C), and (iii) the task requirement (classification or visualization). Knowing the details and the characteristics of all the available approaches is crucial to making a good choice. In this chapter, we will investigate the characteristics of a number of neural networks through analysis and evaluation experiments. 2.1.1 Principal Component Analysis (PCA) Networks Principal component analysis, also called Karhunen-Loeve transform, is a well-known statistical method for feature extraction, data compression and multivariate data projection, and has been widely used in communication, signal and image processing, pattern recognition and data analysis. It is a linear orthogonal transform from a d—dimensional input space to an m—dimensional output space, m S d, such that the coordinates of the data in the new m—dimensional space are uncorrelated and maximal amount of variance of the original data is preserved by only a small number of coordinates. Projection pursuit is a more general approach in this category which includes prin- cipal component analysis as a special case. Projection pursuit was first introduced by Friedman and Tukey [44] as a technique for exploratory analysis of multivariate data sets. Huber [73] provided a unified framework of projection pursuit which included principal component analysis, multidimensional scaling, factor analysis, nonparamet- ric regression (projection pursuit regression), density estimation (projection pursuit density estimation) and deconvolution. The standard 2-layer (one hidden layer) feed- forward network is closely related to projection pursuit regression in approximating functions although the backpropagation algorithm used for training the feedforward 29 network is different from the algorithms used in projection pursuit regression. A number of researchers have explored using projection pursuit regression and density estimation in training or constructing feedforward networks [13, 75, 187]. Let 5:: = (fk1,£k2,-~,€kd)T denote the k‘“ d—dimensional input vector, k = 1,2, - - -,n. Assume zero—mean vectors without a loss of generality. Let 2 be the covariance matrix of the data, with eigenvalues, A1, A2, - - - , Ad, in the decreasing or— der. The m principal components of the data can be obtained by a linear transform 0,. = T£k, where o,c = (0k110k21' - . , 0km)T, and is a d x m matrix whose columns are m eigenvectors corresponding to the m largest eigenvalues of the covariance ma- trix E. It is easy to show that the covariance matrix of the transformed data is a diagonal matrix, diag{A1,A2, - - -,Am}, which means that the components of the transformed data are uncorrelated. The variance of the original data retained in the new m-dimensional space is 2311., which is the largest retained value among all linear orthogonal transforms of the same output dimensionality. Principal compo- nent analysis is optimal in the mean-square error sense among all linear orthogonal transforms. A large number of neural networks and learning algorithms have been proposed for principal component analysis [1, 133, 151, 150, 154, 6, 40, 69, 101, 103]. We have extremely used the PCA network proposed by Rubner et al. [151, 150]. Rubner et al.’s PCA network architecture is shown in Figure 2.2. It consists of (1 input nodes and m output units. Assume m = (1 without any loss of generality. Each input node i is connected to each output unit j with connection strength wij. We denote the weight vector associated with output unit j from the (1 inputs by wj = (wlj, ng, - - - , wdj)T. All the output units are hierarchically organized in such a way that the output unit i is connected to the output unit j with connection strength 21,-,- if and only ifi < j. Let {5, | k = 1, 2, ~ - - ,n} be the set of n input vectors with zero-mean, and {oh I k = 1, 2, - . - , n} be their corresponding output vectors produced 30 by the network. ij = Wj ° 5k + Z Uzjokl. (2.1) lO—’ 0d Figure 2.2. PCA network proposed by Rubner et al. QM Rubner and Tavan [151] proved that if the learning parameters 77 and ,u are properly chosen according to ”(A1 — /\ d) in + m) < ” < 2”" “'4’ 31 then all the lateral weights will rapidly vanish and the network will converge to a state in which the d weight vectors associated with the d output units are the d eigenvectors of the covariance matrix of input patterns with eigenvalues, A1 2 A2 - - - 2 Ad. Although, in practice, it is difficult to determine the values of 77 and )1 according to the inequalities in Eq. (2.4) without first computing the eigenvalues, Eq. (2.4) does provide a range for the values of 1) and ,u if A1 and Ad can be somehow estimated. The asymptotically vanishing connection strengths also provide a stopping criterion for the learning procedure. We have found that introducing the momentum term and letting the learning rate and momentum decay with time can speed up the convergence. In our experiments, we use Awij (t + I) = "(t)€k10kj + fi(t)Aw,-j(t), (2.5) and Arm-(t + I) = -/.l.(t)0k(0kj + fi(t)Aulj (t), (2.6) where 17(t + 1) = ma$(an(t),0.001), p(t + 1) = maa:(ap(t),0.002), C(t + 1) = max(a,6(t), 0.001), t is the iteration index and a is the decay factor. The non-zero minimum values for n, p. and 6 enable the network to retain the plasticity so that it can adapt itself to the new input data. We summarize the learning algorithm for principal component analysis [151, 150] 32 in Box 1. Box 1 PCA Algorithm 1. Initialize all connection weights to small random values and choose the values of the learning parameters. 2. Randomly select a d—dimensional pattern and present it to the network. 3. Update the connection weights between the input nodes and the output layer according to the Hebbian rule, Eq. (2.5). 4. Normalize each weight vector to a unit vector. 5. Update the lateral weights according to the anti-Hebbian rule, Eq. (2.6). 6. Modify parameters, 17, u and fl. 7. If all the lateral weights are sufiiciently small for a given number of pre- sentations (their absolute values are below some threshold), then stop; else go to step 2. Our computer simulations of Runber’s PCA network have shown that the preci- sion of the computed eigenvalues and eigenvectors is very high (of the order of 10‘5 compared to values obtained by the commercial Eispack software) [111]. However, we have experienced a very slow convergence of the PCA algorithm when the input dimensionality d is high and all the d eigenvectors need to be computed, especially when some eigenvalues are very small. 2.1.2 Linear Discriminant Analysis (LDA) Network If the category information of patterns is known, then it is more appropriate to use supervised learning. Linear Discriminant Analysis incorporates the a priori category 33 information into the projection or transform by maximizing the between-class scatter while holding the within-class scatter constant. Let £9) = ( ff), ,(é’,-- ,éfj’V denote the i‘” pattern in class 1, i = 1,2,-~-,n,, l = 1, 2, - - - , c, where c is the number of categories. Let n = Zizl n, denote the total number of patterns. Define the within-class covariance matrix, 2W, as Ew=-Z:(. (z) _m(1)) (€51)_ m(l))T, (2.7) "1: 1i: where m“) is the mean vector of class l, l = 1, 2, . ~ - , 0. Similarly, define the between- class covariance matrix, 23, as 1 C >3. = — 2: Mm“) — m)(m"’ - mi (2.8) n —1 where m is the mean vector of the pooled data. The total scatter matrix is, therefore, = — £85.“)— mm.- ”— m) = 2.. + 2.. (29) "1:11.21 The goal of linear discriminant analysis is, then, to find a d x m transform (I) such that [(DTEBQl/IQTZWQI is maximized, where I - | denotes the determinant. It can be proved that such a transform, (I), is composed of m eigenvectors corresponding to the m largest non-zero eigenvalues of 2,1533. Due to the fact that the matrix 23 has a maximum rank of c — 1, the value of m must be less than c. Therefore, the dimensionality of the projected space is limited by the number of categories. Sev- eral variations of linear discriminant analysis have been reported in the literature. For example, Foley and Sammon [41] and Okada and Tomita [134] derived a set of discriminant vectors by selecting the projection axes one at a time under an orthogo- nality constraint. We use the total covariance matrix ET instead of the between—class covariance matrix DB in the criterion, so that the output dimensionality is not limited 34 by the number of categories. We have proposed a neural network and a supervised self-organizing learning al- gorithm for multivariate linear discriminant analysis [111]. The linear discriminant 5d :v ' Figure 2.3. Architecture of the LDA network. analysis (LDA) network is shown in Figure 2.3. It consists of two layers, each of which is identical to the PCA network proposed by Rubner et al. [151, 150] shown in Figure 2.2. Linear activation function is used for all the units. Let 111,9) (1118’) be the weights on the interlayer connections from the 2"“ unit in the input (hidden) layer to the j ”unit in the hidden (output) layer. Let all) (u S”) denote the lateral weights from the it“ unit to the 3"” unit within the hidden (output) layer. Moreover, let 6 = (£1,£2, - . -,€d)T be a d-dimensional input pattern, and p = (p1,p2,' ' ' ,Ple and o = (01, 02, - - - , od)T be output vectors of the hidden layer and the output layer, respectively. We can write these outputs as Pj = wa;)§i+zuigl)l)h (2-10) i=1 l