noumammc mums; PROCEDURES row i > V . . Human-mom PATTERN cmssmcmou f ' " Thesis for the Degree (If PEEL! * mum STATE umvmsm , * ALBERT YEWSH‘EN HERE ' '19? '1' _ 07639 This is to certify that the thesis entitled NONPARAMETRIC TRAINING PROCEDURES FOR MULTICATEGORY PATTERN CLASSIFICATION presented by Albert Yen-Shien Hung has been accepted towards fulfillment of the requirements for Ph . D. degree in W1. Engineering ,A I z- ” ' ' fl .4 k. ’,-"" / ”V a — _,r" ,-’ -’ ’/ ‘ I Major professor LIBRA.~:Y Michigan State University .v‘ . . A {42; Space A set CI‘imin fundam the or 358mm UHSUpe ABSTRACT NONPARAMETRIC TRAINING PROCEDURES FOR MULTICATEGORY PATTERN CIASSIFICATION By Albert Yen-shien Hung The M-class pattern recognition problem is to construct a set of discriminant functions which partition a feature space into M regions, one region per pattern class. Each point in the feature space is a potential pattern and each pattern represents an object. A set of training patterns is to be generalized into a set of dis- criminant functions which classify the potential patterns. The fundamental algorithms develOped here concern the situation where the origin of each training pattern is known and almost nothing is assumed about the origins of the patterns. An extension to the unsupervised case is also given. Several new multi-class decision-making algorithms are pro- posed. An entirely new class of algorithms is obtained by trans- lating the pattern recognition problem into the problem of minimizing a function of several variables and selecting suitable functions. This general formulation includes most known algorithms as special cases. The class of algorithms includes all procedures which approximate discriminant functions by linear combinations of basis functions. Several successful two-class algorithms are extended to the M-class problem. Albert Yen-shien Hung The concept of linear inequalities and the role of the mean- square error criterion in pattern recognition are studied. Several algorithms are shown to rely heavily on the basic mean-square- error criterion. In order to solve the generalization problem, the conditional probabilities are selected to form the optimal dis- criminant functions. A class of multi-class algorithms using stochastic approximation techniques is proposed that learn the unknown coefficients of the discriminant functions. A digital simulation has been performed. The most novel aspect of this thesis is the introduction of an algorithm that combines cluster-seeking and multiclass pattern recognition. Cluster-seeking tries to uncover the structure inherent in the training patterns. The algorithm exploits this structural information to construct discriminant functions. The success of the discriminant function in classifying training patterns then provides clues about structure. The algorithm is straightforward and computationally realistic. It has been tested with both artificial and practical data. NONPARAMBTRIC TRAINING PROCEDURES FOR MULTICATEGORY PATTERN CLASSIFICATION By Albert Yen-shien Hung A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering and Systems Science 1971 [O l have to h Mich. Resee resea Dr. J Critil the cc ACKNOWLEDGEMENTS I am deeply indebted to Dr. R.C. Dubes who introduced me to the field of pattern recognition and whose valuable suggestions have gone a long way in shaping this thesis. I am also grateful to him for the guidance and encouragement I received from him during my entire doctoral degree program. I am also indebted to the Division of Engineering Research, Michigan State University, and the Air Force Office of Scientific Research for their financial support during the course of this research. Thanks are also due to Dr. R.O. Barr, Dr. G.L. Park, Dr. J.H. Stapleton and Dr. Rita Zemach for their interest and critical reviews of this thesis. Finally, this thesis would be incomplete without mentioning the contribution of my wife Lilian to whom this thesis is dedicated. ii . WL‘ II IV Chapter I II III IV TABLE OF CONTENTS ABSTRACT LIST OF FIGURES INTRODUCTION 1.1 Basic Problems 1.2 Multiclass Problems 1.3 1.4 GENERALIZED LINEAR INEQUALITIES IN MULTI- CATEGORY PATTERN RECOGNITION 2.1 The Concept of Linear Inequalities in Pattern Recognition ...................... 2.2 The General Learning Algorithms .......... 2.3 A Class of Iterative Procedures for Linear Inequalities ...................... THE MEAN-SQUARE-ERROR CRITERION IN MULTI- CATEGORY PATTERN RECOGNITION 3.1 The Mean-Square4Error Criterion in Pattern Recognition ...................... 3.2 The Generalized Inverse Approach in Multiclass Pattern Recognition ........... 3.3 Some Properties of Least-Mean-Square-Error Pattern Classifiers MULTICATEGORY PATTERN RECOGNITION USING STOCHASTIC APPROXIMATION TECHNIQUES DDD COMP" 4‘45 UI-P Cluster-Seeking and Pattern Recognition .. Thesis Objectives and Outline .. ..... ... Mathematic Formulation ................... Generalized Inverse Adaptation ........... A Stochastic Approximation Algorithm with Updating Property ................... Sensitivity Study and Error Upper Bound .. An Example ............................... iii 3.4 Relation Between Linear Inequalities and the Mean-SquareéError Criterion Page whoop- 16 22 26 32 39 45 47 51 57 62 Page MULTICLASS PATTERN RECOGNITION IN UNSTRUCTURED SITUATIONS 1 Some Cluster-Seeking Techniques .......... 64 2 A Procedure for Combining Cluster-Seeking and Multiclass Pattern Classification .... 68 5.3 A Procedure for Unsupervised Structure 5. 5. An81YSis O......OOCOOOCCOOOOC00......0.... 74 5.4 Computer Simulations ..................... 76 CONCLUSIONS 6.1 Thesis Results ........................... 79 6.2 Possible Extensions ...................... 82 REFERENCES 84 APPENDIX 88 iv LIST OF FIGURES Figure Page 1 A prOposed pattern recognition system ......... 6 2 & 3 Decision surfaces for algorithm GA.4 .......... 32 4 A mathematical model for pattern recognition .. 63 5 Error rate versus number of iterations ........ 63 6 Flow chart of the proposed algorithm .......... 72 7 Example of algorithm in Figure 6 .............. 77 Appendix A Convergence Proof ............. ..... ........... 88 LIST OF APPENDICES Appendix Page A Convergence proof of algorithm GA.4 88 vi CHAPTER I INTRODUCTION The problem of automatic data analysis has drawn consider- able attention since the development of high speed computers. To- day there are automated systems that read handwriting and finger- prints, as well as systems that classify data, forecast weather and perform medical diagnosis. In all these cases, some type of patterns or templates are assumed as a basis for recognition; that is, they are pattern recognition problems (H1, B3, N1, N2). 1.1 Basic Problems There are three fundamental problems associated with pattern recognition. (1) Feature extraction. A set of real variables {x1,x2,...,xd}, called features or attributes, identify the object to be classified. A vector X* = (x1,x2,...,xd)T is called a pattern vector or, simply, a pattern. No general procedure currently exists for selecting an Optimal set of features for a given problem. In most cases, the success of feature extraction depends entirely upon factors in the specific problem. The feature extraction problem will not be discussed in this thesis; a "good" set of features is assumed. (2) The abstraction problem. Once a set of features has been selected, and a set of pattern vectors 1 D* = {(X:,y1), i = 1,2,...,N} are given where yi denotes the classification of the pattern vector X*, then a set of augmented pattern vectors D = {(Xi’yi)’ i = 1,2,...,N}, called training patterns, can be formed. An augmented pattern vector is a d- dimensional vector X* augmented by a (d+l)st component whose value is 1; that is, XT = (X*,1)T. The problem considered in this report is to find a set of discriminant functions {fj(X)}?=1 defined on the augmented feature space such that fi(X) > fj(X) if x 5 class i; i,j = 1,2,...,M, i #1. (1) If there is a set of discriminant functions satisfying (1) then an "optimal" solution exists rather than one which permits a few violations of (l). The abstraction problem is thus one of distilling the information from the training patterns that is necessary to con- struct a set of discriminant functions. Generally speaking, there are two approaches to the abstraction problem. First, if a great deal of information is available about the data, a proba- bilistic model can be generated to describe the underlying physical phenomenon. The abstraction problem can then be treated in the framework of statistical decision theory. This approach is the parametric (or statistical) method. Second, there are many prob- lems, especially in the biomedical area, in which data has simply been amassed. It is reasonable to assume, however, that there exists a set of discriminant functions which approximate (1), and so the functional form for the discriminant functions is assumed known except for a set of parameters. Based on the given training patti lear: the L (or c prime crimi to pr 5 4 .a J was C probl ‘ 0 WI!" -0 b nonpa is us“ ab 1e Patter mather 1.2 } geUEra Origin reCOEn a COIL SeVera. On its whit] patterns, an iterative procedure, sometimes called training or learning, can be formulated to determine reasonable values for the unknown parameters. This approach is termed the nonparametric (or deterministic) method of training. This thesis will be primarily concerned with nonparametric training. (3) The generalization problem. Once a set of dis- criminant functions has been determined, an attempt may be made to predict performance for new patterns. If the parametric method was chosen to solve the abstraction problem, the generalization problem consists in determining the error probability. If the nonparametric approach was adopted, the generalization question is usually difficult to answer because of the absence of a suit- able criterion. The misclassification percentage with training patterns is usually employed even though it cannot be related mathematically to the problem parameters. 1.2 Multiclass Problems A single data source, called a pattern class or category, generates each pattern. In the two-class problem, each pattern originates in one of two pattern classes. A review of the current literature (H2, H3, Pl) indicates that the majority of pattern recognition algorithms have been developed for the two-class prob- lem. In theory, the M-class (M > 2) problem can be solved as a collection of (g) two-class problems (N1). However, there are several advantages in considering M-class pattern classification on its own nerits. (1) A generalization of a two-class algorithm to solve a tmulticlass problem by pair-wise operations is quite involved computationally. For instance, it will be necessary to compute (2) generalized inverse matrices in the Ho-Kashyap algorithm (H2); with an M-class algorith, only one inverse computation is required. (2) Two-class algorithms give no immediate answer to the multiclass problem even if a solution exists. A voting scheme must be imposed to classify a pattern. With an M-class algorithm, there is no need for a voting scheme and it will furnish a reason- able suboptimal solution in case of overlapping sets of training patterns. (3) With a deterministic M-class algorithm, the problems of pattern recognition and cluster-seeking can be combined, as discussed later, into a single problem. Such a simplification is the main reason for devoting this thesis to deterministic Mrclass algorithms. Criteria and procedures are presented in Sec. 5.2. 1.3 Cluster-Seeking and Pattern Recognition According to Ball (Bl), “A cluster is a set of patterns contained in a high dimensional space where the density of patterns is large compared to the density in the surrounding volumes." The idea behind cluster-seeking techniques is the grouping of patterns into clusters or groups so that all patterns within a cluster are very "alike" and patterns in different clusters are very "unalike". Several measures of simularity will lead to clustering algorithhs. For instance, Freidman and Rubin (F1) proposed some invariant criteria for grouping data through linear transformations. Ball and Hall (BZ) suggested the distance between a pattern and the cluster center be used. A number of other measures of simularity were categorized by Ball (Bl). In this thesis, the criterion of minimum probability of misclassifica- tion will be used for cluster-seeking. A pattern recognition system is proposed which combines cluster analysis and classification in such a way that knowledge of data structure guides pattern classifica- tion, and the results of pattern classification provide additional insight into the true structure of the data. Assuming that the feature extraction problem has been solved, a set of (augmented) training patterns D = {(Xi,yi), i = 1,2,...,N} from M pattern classes is provided. The value of yi, which indicates the true classification of Xi, may or may not be given for all i. The functional block diagram of the proposed system is shown in Figure 1. If the classifications of all training patterns are known, the training patterns from each class are grouped to form one cluster per class. A preselected M-class iterative learning algorithm is applied to learn the unknown parameters in a set of M discriminant functions. The rate of learning is determined by prOperties of the algorithm and the distribution of patterns. If the performance of the algorithm is acceptable, 3 set of dis- criminant functions is obtained which adequately classifies all training patterns. By comparing the results of this classifica- tion with {yi}, the misclassification percentage can be computed. If the performance is acceptable, the pattern recognition problem is solved and the data structure obtained assigns one cluster to each pattern class. Otherwise, all the misclassified patterns from Classification Known Training patterns r Unsupervised Supervised Structure Structure __ _ (Leg-ning—Phasg) j‘x l Compute M-class algorithm update | the Change the discrim. number of clusters functions l I Performance acceptable No l 9 Classify all training patterns No Error acceptable ? I'— l I Take new observations l l 1 (Decision Phase) I Make | decision l__-_._J Figure l. A proposed pattern recognition system. each class are clustered. These new clusters together with the clusters of correctly classified patterns represent an updated data structure. A new set of discriminant functions is then obtained. The learning process will not stop until performance is satisfactory. If the true pattern classifications are unknown, it is possible in principle but not in practice to solve the computa- tional problem of evaluating all the possible partitions of input data into M clusters in order to find a partition that minimizes the probability of misclassification (Fl). A subOptimal procedure for unsupervised structure analysis is discussed in Sec. 5.3. In any event, the system is switched from the learning phase to the classification phase when satisfactory system performance has been reached with training patterns. 1.4 Thesis Objectives and Outline. This thesis suggests that cluster-seeking and multiclass pattern recognition can be combined into a workable pattern recognition system, Figure l, in which the two problems are solved in a step-wise fashion, partial solutions for one providing clues for the other. The literature for both clustering and multiclass, nonparametric pattern recognition studies is reviewed, and a specific algorithm is proposed in Sec. 5.2. Several new multiclass algorithms are proposed as extensions of two-class algorithms. In order to realize this proposed system, a class of linear multiclass learning algorithms is derived in Chapters 11, III, and IV. The idea of translating the abstraction problem into an optimization problem in which linear functionals are minimized under creat imagiL of lit forum respe rithms parti patter is sci the gg as a 3‘ ditiOr as the approy ”ERROR tiOn a SQHSit abStra tion 8 CIUSte artifi under certain constraints provides a great deal of freedom when creating new algorithms. In fact, the only limitation is one's imagination in finding meaningful linear functionals. The concept of linear inequalities and the mean-square-error criterion for formulating linear functionals are studied in Chapters II and III, respectively. Based on these criteria, a general learning algo- rithm for the M-class problem is derived. In each chapter, particular algorithms are formulated as special cases. Since no assumptions are made about the origins of the patterns in Chapters II and III, the pattern recognition problem is solved only for the given training patterns. In order to solve the generalization problem, we must view the fixed training patterns as a sample from a population as was done in Chapter IV. The con- ditional probability functions P(wi‘X), i = 1,2,...,M5 are selected as the optimal discriminant functions. The techniques of stochastic approximation (W5) are employed to estimate the coefficients of unknown discriminant functions. Several M-class stochastic approxima- tion algorithms are proposed. The relations among them and the sensitivity problems are investigated. Chapter V examines the abstraction problem in unstructured situations. A pattern recogni- tion system which combines multiclass pattern recognition and cluster-seeking is proposed. The system has been tested with both artificial and practical data. tic 4W E au: If det mir llll A I}! [II III III .I I, ‘II. III! II | fa Sir tra is ine Ira Co] 2.1 Pat CHAPTER II GENERALIZED LINEAR INEQUALITIES IN MULTICATEGORY PATTERN RECOGNITION The most important task in nonparametric pattern recogni- tion can be posed as the selection of a set of weight vectors {W1} that defines a set of discriminant functions {fi(X) = WE ¢(X)}, where ¢(X) is a vector function of the augmented pattern X; that is, the output of a "é-machine" 0N1). If training patterns are provided, the unknown weights can be determined either with or without training. For instance, a minimum-distance classifier 0N1) is achieved without training since the unknown cluster points are computed directly from the training patterns non-recursively. In this section, training is viewed as an Optimization problem and the concept of linear inequalities is chosen to form linear functionals. A general training algorithm for the M-class problem is derived and a collection of training algorithms are presented as special cases. 2.1 The Concept of Linear Inequalities in Pattern Recognition In the two-class pattern recognition problem, a set of N patterns with known classification is given. The problem is to construct a discriminant function f(X), such that f(X) > 0 if X 6 class w 1 (2) < 0 if X 6 class wz ti 8C The 3(7). sel and func the patt can I simuj Here CIaSs eithez reul i tions appr0x 10 for as many of the N training patterns as possible. By the Weierstrauss theorem on polynomial approximation (W1), one can choose a polynomial which uniformly approximates the continuous function f(X) on a closed interval with arbitrary accuracy. The function f(X) will be approximated by §(X) d+l T we = z cisi =c soc). 1=1 T _ . The parameters C - (CI’CZ""’Cd+l) are unknown, QT(X) = (¢1(X),¢2(X),...,¢d(X),l) are linearly independent, pre- selected, real functions. Nilsson (N1) calls §(X) a "Q-function," and any pattern classifier employing Q functions, a "Q-machine." It is not necessary to actually approximate the discriminant function f(X) itself. It is sufficient to achieve agreement in the signs of the functions f(X) and Q(X) for all training patterns. Thus, the problem of finding a discriminant function can be converted to the problem of solving the system of N simultaneous linear inequalities in (3). yCT¢(X) > O for all training patterns X . (3) Here y = 1 if X is from class wl and = —1 if X is from class m2. The coefficients in CT are chosen to map ¢(X) into either 1 or -1 as dictated by y. Once CT is chosen, the decision reul is: Decide X is in m if CT¢(X) > 0, m2 if CT¢(X) < 0. 1 In M-class pattern recognition, a set of discriminant func- tions {f1(X)}?;1 is needed. Each discriminant function will be approximated by a finite sum Siona t0 prc natrua a set f0110i. ExiStS X fro. ll d+l sis) = z j 1 aij¢j(X) = a: ¢(X) ; i = 1,2,...,M . Once (a?) is chosen, the decision rule is: Decide X is in m if i is the smallest integer for which a§¢(X) 2 a§¢(X) i for all j * i. The matrix of unknown parameters is written as the M X (d+1) matrix A. r' r- .— A = iaT 1= a a ... a i 1 ll 12 l,d+l « 2 3T: 8 a ass a 2 ‘ 21 22 2,d+l laT' a a so. a ‘_ M j _.M1 M2 M,d+l The letters {e1,e2,...,eM} denote a set of M m-dimen- sional points, called vertices, for which eiej = O for all i # j. By analogy to the two-class problem, matrix A is chosen to project ¢(X) onto the vertex ei as dictated by Y1, the natrual generalization of y; that is, {Y1} (i = 1,2,...,M) is a set of MXM-dimensional orthogonal matrices which satisfy the following conditions. Y?Y, = I 1 1 for all i = 1,2,...,M O Y.e. = e 1 1 1 Thus, the orthogonal matrix Yi transforms the vertex ei into e1 for each of the M pattern classes. If the M pattern classes are linearly separable, there exists a linear transformation A which maps each training pattern X from class mi close to the vertex ei. Choosing the vertex 12 e1 > 0 (each entry of e1 is greater than zero), an orthonormal matrix Yi is to be found such that Y1A¢(X) ezel for all X in class mi The natural extension of (3) to the M-class problem is: YiA¢(X) > 0 (vector) for all i and X when X is in class w. (4) 1 This is a system of MN simultaneous linear inequalities. The inequalities (4) can be written in another form. ‘CT¢(X) - 1‘ < ‘CT¢(X) + 1‘ if X C class ml (5) T T . \c ¢(X) - 1‘ > |c Q15(X) + 1\ If x c class wz . An extension of (5) to the M-class problem was proposed by Chaplin and Levadi (Cl). HA¢(X) ' 61“ < HA¢(X) - eju if X 6 class wi for all j = 1,2,...,M; i # j (6) where {ei}T=1 are M-dimensional vertices satisfying the follow- ing conditions: He H E trace {e eT} = 1 i i i uei - eju = “ei - ck“ for all j,k # i . Equation (6) can be rewritten as T T , . . eiA¢(X) > ejA¢(X) If X 6 class mi for all J # 1 for all training pattern X. Equivalently, decide X is in class w. if 1 13 eiAqfiX) - max {eTA¢(X)} > O (7) jh j Inequalities (3) and (4) are based on the idea of achieving agreement between the signs of discriminant functions and approximat- ing functions. The inequality prOposed by Chaplin and Levadi is based on a minimum-distance criterion. In both cases, the assumption of linear separability assures the existence of a linear transforma- tion A which correctly classifies all N training patterns. When training patterns overlap, Equations (4) and (7) will be inconsistent, and no solution for a matrix A can be found. 2.2 The General Learning Algorithm It is a common practice in solving linear inequalities to transform the problem into an Optimization problem. Then, the gradient descent method, or some other optimization procedure, is chosen to minimize certain linear functionals. A general approach for obtaining meaningful linear functionals will now be described; these functionals will then be minimized to obtain general learning algorithms. The two-class pattern recognition problem will be con- sidered first, and the results extended to the M-class problems. Equation (3) can be rewritten as yCT®(X) = B > O ; B is a positive number . (8) Define Z = yCT¢(X) - B for all X . A strictly convex and differentiable function defined on the set 2 is denoted by F(.) where l4 F(Z)>0 if Z<0 = 0 if z 2 0 . A solution to the equation 2 = O can be obtained by minimizing a linear functional J(C) = L{F(2)} = L{F(yCT¢(X) - 3)} where L denotes a linear operator. It is clear that the functional J(C) will also be strictly convex and differentiable, which guarantees the existence and unique- ness of a minimum. The problem is now to find C such that J(C) is a global minimum for the N training patterns. The usual method for minimizing J(C) is the gradient descent procedure (B3, Dl). An algorithm can be written as C[n] = CLn-l] - p :éLE 1] x[ 3 (9) where n is the iteration index, and X[n] is the training pattern presented to the algorithm at the nth iteration; p is a positive scalar which determines the distance to be moved at each step in the direction of the negative gradient of J(C) in the parameter space. Equation (9) leads to the following general iterative algorithm for the two-class pattern recognition problem. (GA.1) c[n] = c[n-1] + pL{F'(en_1)Y[“]¢(x[“])} em = sin-1] + pLUF'\ - Wen-1’} where en-l = y[n]CT[n-l]¢(X[n]) - B[n-l] for all n = 1,2,3,... and B[O] > O . 15 It should be noted that, in the above algorithm, only one training pattern is used at each iteration. Each of the N train- ing patterns must be presented to the algorithm many times to assure that J(C) is a global minimum. Algorithm (GA.l) can be extended easily to the M-class prob- lem by invoking (4). Equation (4) can be rewritten as YiA¢(X) = v > O for all i, X . Letting W = YiA¢(X) - y, a solution of the linear equations W = O can be obtained by finding the minimum of the linear functional: J(A) = L{F(W)] = L{F(YiA¢(X) - y)} . Again using (9), a general learning algorithm for solving the M-class pattern recognition problem is obtained. (GA.2) A[n] = A[n-1] + pL{F'(gn_1)Yf[n]¢(X[n])} Yin] = Yin'll + DL{\F'(€n_1)‘ ' F(en_1)} where en-l = Yi[n]A[n-l]¢(X[n]) - y[n-l] . Euqation (GA.2) holds for all X in class wi’ i = 1,2,...,M; and n = 1,2,...,; y[0] > 0. Similarly, (7) can be rewritten as T T eiA¢(X) - max ejA¢(X) = B >’O . (10) ji‘i This leads to the following general learning algorithm: l6 (GA.3) Ain] = Aim-1] + pLiF'(sn_l>}[ei¢T(X[n]> - mix ej¢T(X[n]>} j i B[n] = B[n-l] + pL{\F'(en_1)‘ ' F'(en-1)} where en-l = e§A[n-I]¢(X[n]) - mg: e§A[n-1]¢(X[n] - e[n-1]). Equation (GA.3) holds for all X in class mi where i = 1,2,...,M and n = 1,2,... . In two-class pattern recognition, setting e = 1, e = -l, and A = CT changes (10) to: 2CT¢(X) = B if X 6 class ml T ' . -2C ¢(X) = B If X 6 class wz . This is equivalent to: T . yC ¢(X) = 5/2, y = 1 If x 6 class ml -1 if X 6 class m2 - This is exactly (8). Therefore, (GA.3) reduces to (GA.1) for two- class problems. Based on the general learning algorithms (GA.1), (GA.2), and (GA.3), a particular class of iterative algorithm can be derived easily. In fact, the number of specific algorithms is limited only by the availability of meaningful convex functions F(.) and linear operators L. In Sec. 2.3, a class of standard iterative algorithm for the M-class problem is presented. 2.3 A Class of Iterative Procedures for Linear Inequalities A class of iterative algorithms for the two-class problem corresponding to (GA.1) with B[n] E O for all n is given in 17 Table 2 of Devyaterikov, Propoi, and Tsypkin (D1). The extension to M-class algorithms according to the general algorithm.(GA.2) is straightforward and hence not included. A collection of iterative algorithms corresponding to (GA.3) will be presented in this section. 2.3.1 The Fixed Increment Method Consider the following linear functional. 1 T T J(A) = —{‘e_A¢(X) - max e,A¢(X) - 5‘ 2 1 j#l J - (eEAqflX) - max e?A¢(X) - 3)} . j#i J This function has a unique minimum only for those values of A satisfying (7). Note that J(A) = 0 if egA¢(X) > e§A¢(X) for all i # j, X 6 class mi T T . = -(e,A®(X) - max e,A¢(X) - 3) otherwise . 1 ifij J The gradient Of A is J _ , T T . . T T . = -(ei¢ (X) - max ei¢ (X)) otherwise . l#j Substituting into the general algorithms (GA.3): (PA.1) A[n] A[n-l] + %p{[ei¢T(X[n-l]) - max ej¢T(X[n-1])] ' jfil [ei¢T(X[n-l]) - gm: ej¢T(X[n-1])]sgn(en_l)} B[n-l] + %p{[1 - sgn(gn_1)] +"l sgn(en_1)‘} BIn] where en-l — e§A[n-1]¢(X[n-1]) - max e§A[n-l]¢(X[n-l]) - Bln-l] . j¢1 18 In this algorithm, no correction is made to A if eEA¢(X) - max eTA¢(X) > B , jfi j which means X is correctly classified. If X is incorrectly classified, A is incremented by T T p(ei¢ (X) - max e.¢ (X)) jfi 3 That is, A[n] = A[n-1] + p(e ¢T(X[n-1]) - max e.OT(X[n-1])) The convergence proof of algorithm (PA.1) with B[n] E O for all n is a modified version of Novikoff's convergence proof for perceptrons, the details of which can be found in Blaydon, Appendix II.2 (BB). 2.3.2 The Relaxation Method Consider the following linear functional. J(A) = l((e'¥‘A¢(X) - max eT,‘AQ5(X) - B) 8 1 1#j J T T 2 - esA¢(X) - max e A¢(X) - B ) . A 1 jfi j ‘ Thus J(A) = 0 if e§A¢(X) > e§A¢(X) for all i # j, X 6 class mi §A¢(X) - B)2 otherwise . 1 T "(eA (X) -maxe 2 1 ¢ j#I The gradient of J(A) is its 0 if eEA¢(X) > e§A¢(X) for all i # j, X 6 class mi T T (eiA¢(X) ' max ejA¢(X)-B)(ei¢T(X)-max e-¢T(X)) otherwise. 1*) i#j J 19 Substituting into the general algorithm (GA.3) yields: (PA.2) A[n] = A[n-1] if e§A[n-l]¢(X[n-l]) > e§A[n-1]¢(X[n-1]) for all i # j, x 5 class mi = Aim-1] + puendlteisTocin-Iv-Tz ejchmn-Im otherwise where en_1 = efA[n-l]¢(X[n-l])-?:§ e§A[n-l]¢(X[n-l])-e[n-l] . B[n] = e[n-l] if e§A[n-l]¢(x[n-l]) > e§A[n-l]¢(X[n-l]) for all i # j, X 6 class mi = Bin‘I] + p{len_1‘ + €n_1} otherwise . Algorithm (PA.2) behaves as algorithm (PA.1) except that the increment added to A[n] is weighted by the magnitude of the absolute error. 2.3.3 The Minimum-SquareéError Method Consider the following strictly convex and differentiable linear functional. 1 T T 2 J(A) = —(e,A®(X) - max e A¢(X) - B) , 2 1 j I#j The gradients of J(A) are LJ ... (e’ngbCX) - max eTAsoc) - e)(ei¢T - max 8 451300) ai.= - ?A x - IA x - 86 (e1 ¢() :1; ej ¢() 8) 20 Substituting into the general algorithm (GA.3) yields: (PA.3) A[n] A[n-1] - p{en_1[eiOT(X[n-1] - 9:3: ej¢T O for all i and X 6 class mi . 1H 3 If it is satisfied, the problem is solved. If not, the number of misclassified training patterns is counted; the learning procedure is terminated after a fixed number of iterations and the solution with the smallest number of errors is retained. CHAPTER III THE DEAN -SQUARE -ERROR CRITERION IN MJLTICATEGORY PATTERN RECOGNITION The unknown parameters in a discriminant function are established by selecting and Optimizing a performance criterion. If the criterion is well defined, the parameters can be found by search techniques such as the gradient descent method in (9). For the two-class problem, the criterion function should be chosen so that the discriminant function f(X) has approximately one value whenever X comes from class ml and a different value whenever X comes from class wz. One such method is the mean-square-error criterion, first used by Widrow and Hoff (W2) to find an adaptive procedure for classifying binary patterns. Koford and Groner (Kl) later extended Widrow and Hoff's procedure to real-valued patterns and showed that the adaptive pattern classifier converges to the optimal classifier (in the Bayes sense) for normally distributed pattern classes with equal covariance matrices. Sebestyen (Sl) prOposed a two-step clustering transformation which maps all patterns belonging to one class into the neighborhood of a fixed point. Sebestyen's cluster- ing transformation actually minimizes a mean-square-error criterion. A mean-square-error approach to the M-class problem, using the generalized inverse idea for rapid computation, was undertaken by Wee (W3). This approach allows numerous results obtained in the 21 22 two-class problem to be extended to the M-class problem. In fact, the results show that the pattern classifiers proposed by Widrow and Hoff (W2), Patterson and Womack (P1) and Chaplin and Levadi (C1) are special cases of Wee's procedure. The remainder of this chapter will contain a mathematical formulation and some applications of the mean-square-error criterion. 3.1 The Mean-Square-Error Criterion in Pattern Recognition The mean—square-error criterion will be formulated for the two-class problem and then extended to the M-class problem. Since minimizing the mean-square-error does not necessarily minimize the probability of misclassification, the relation between the resulting discriminant function and the Optimum Bayes discriminant function will be discussed under special circumstances. If the discriminant function f(X) in (2) is considered to be a transformation from the feature space to the real line, then a reasonable solution to the two-class pattern recognition problem will be to find an 'f(X) which transforms all patterns X belonging to class m1 as close as possible to some number 31 > O and which transforms all patterns belonging to class m2 as close as possible to 32 < 0. Using this reasoning, the mean-square-error criterion can be formulated as follows. N2 2 N . 2 [Minn-3112 + 2: [f(xian-BZ) (11) 1 2 i=1 i=1 1 where {X1(i),X2(i),...,XNi(i)} are the training patterns from class mi (i = 1,2). Patterson and Womack (P1, P2) have proposed a modified version of (11), 23 P N1 T 2 P2 N2 T 2 J(C) = J z [c X.(1)-C(2/1)] + —— z [c x.(2)+C(1/2)] N1 i=1 1 N2 i=1 1 where P1, P2 are the prior probabilities of pattern classes wl and m2, respectively; C(i/j) denotes the cost in assigning pattern X to class mi when it really belongs to class mj. Since J(C) is a convex function of C, a unique minimum exists, which can be obtained by computer search techniques. Patterson and Womack (Pl) also relate the optimum Bayes discriminant function and the least mean-square-error discriminant function as the number of training patterns from each class approaches infinity. This relationship is stated in Theorem 3.1.1. If pi(X) denotes the probability density of X, given that class m i is active, and P is the prior probability of 1 class wi’ the Bayes discriminant function is: 5(X) 2 0 if X 6 class ml (12) < 0 if X 6 class m2 where 800 £C(2/1)P1pl(X) - C(1/2)P2p2(X) . P1p1(X) + P2132 (X) N l 1 T 2 Theorem 3.1.1. If lim f 2[(: Xi(1)-C(2/1)] = N am 1 i=1 1 EIECTX - C(2/1)] almost everywhere and N 1 2 T 2 1: 2 11m —-— z [c x,(2) + c(1/2)] = E [c x + C(1/2)] N 1 2 Nzem 2 i=1 almost everywhere 24 * then, if C =C minimizes J(C) as N also minimizes the following integral [ch - e]2[P1p1(x> + P21»2 1dx fax provided the integral exists, where “X is the feature Space. In other words, the least mean-square-error discriminant function is the best mean-square, linear, approximation to the optimum Bayes discriminant function as the number of training patterns from each class approaches infinity. Equation (11) can be rewritten as 2 N- 1 1 2 J =1; )3 Z [f(X.(i)) - Bi] 3 i=1 j=l 3 where N = N1 + N2 or, equivalently, as: 2 Ni 1 T 2 1 2 J(C) =1; 2 r (x.(i)C-a.) =-\\YC-a\\ (13) ._ ._ j l N 1-1 j-l where T Y - X1 T x2 training patterns from class w J k-_ - .... r___._.___.J h’ training patterns from class wz 25 B = 61 , where Bi > O for all i = 1,2,...,N L__._J The mean-square-error criterion will now be extended to the M-category problem. Consider the set of M discriminant functions £f1(X)}?=1 as a set of transformations which, for each i, fi maps all multidimensional patterns belonging to class mi as close as possible to some K-dimensional vertex* ei = (eil’ei2"'°’ei,K)T as defined in Sec. 2.1. The mean-square error criterion can be written as follows. 1 K M Ni T 2 k=1 1=1 j=l This is equivalent to _ I. T 2 A l_ T T T J - N “WA -EH _.N trace {(YA as) (yA 43)} (15) where A is defined in Sec. 2.1 and r- __ Y = Y[1] vtraining patterns from class m1 Y[2] training patterns from class wz Y[M] training patterns from class wM * The number K = M-l if all the vertices lie on a sphere with centroid at the origin having equal angles formed by the lines joining the vertices and the origin, that is, eT = l iej if 1 = j = -l/K otherwise _ T K M if eiej 1 if i = j 0 otherwise 26 M r- N = Z N and Y[i] = X'{(i)-j for all i 1:1 i i XN (1) __ i _. E = e = e e ... e 1 11 12 1M T eT e e e __N_J __N1 N2 " RE“ where e, > 0 if X, 6 class w lj 1 j e,, < 0 if X, E class w . 1] 1 j The pattern recognition problem becomes the problem of selecting A and E to minimize (15). In Sec. 3.2, the relation to Bayes procedure will be discussed. 3.2 The Generalized Inverse Approach in Multiclass Pattern Recognition Generalized inverse computations can be used to furnish a quick solution under the mean-square-error criterion for M-class problems, (15), for fixed-size training patterns. Since the rows of matrix E in (15) can be interpreted either as reference points (vertices) or as cost vectors, the pattern classifier proposed by Chaplin and Levadi (Cl) and the adaptive classifier of Patterson and Womack (P1) are special cases of the generalized inverse approach proposed by Wee (W3). was also extended Theorem 3.1.1 to the M-class problem with (15). However, in both of his interpretations, the matrix E is fixed and a solution to (15) is obtained by letting dJ/dA = O. 27 A modified version of the generalized inverse approach per- mitting variation in the E matrix, subject to certain constraints, will now be described. In fact, the proposed method resembles the Ho-Kashyap strategy (H2) for the two-class problem. The introduction of the entries of the E matrix as additional parameters in the optimization problem improves the convergence rate of the algorithm. Since A is unconstrained, the minimum of J in (15) can be obtained by letting dJ/dA = O. This implies that: - # AT = (YTY) 1 YTE = Y E where Y# is called the generalized inverse (W3). Let N. 1 1 x[i] = §- 2 X,[i] i j=1 J T 1 M Ni T xx = {q- r z x.[i]X.[i] i=1 j=1 J J ._ 1 M x = " z N.X[i] . N 1 =1 Then, T M Ni T .4. M Ni T A = 2 2 X.[i]X.[i] z z X.[i]ei i=1 j=1 J 1 i=1 j=1 J -—-— M = N'1(xxT)‘1 z N, X[i] e? , i=1 1 1 Set e: = (C(1/i),C(2/i),...,C(M/i)) = CT(i) and assume 28 C(j/i) o if i = j (1 6) c > 0 otherwise -1 __ N ____ __ N ____ __ N ____ AT = c(X x1) (x {€1qu x -N—2-X[2],...,X -fiMX[M]). A reasonable decision rule is: Decide X 6 class mi if T “X?A - cT(1)u2 < HX?AT - cT(j)\\2 for all j # i. (17) Expanding (17), the decision rule becomes: Decide X 6 class wi if M M c 2 XTak - %(M-1)c2 > c 2 XTak - 'Zl'(M-1)c2 k=1 k=1 Rfii k#j or, if XTaj > xTai for all j r i (18) where _'I '1 - Ni ai = c(XX ) X - fi-'X[i] . Equation (18) is the result of the generalized inverse approach when the E matrix satisfies (16). In practice, if a pattern class has multimodal structure, a cluster-seeking technique should be employed and one discriminant function should be assigned to each cluster before applying (18). The relation between the generalized inverse approach and the Optimum Bayes decision rule is as follows. By analogy to (12), the Optimum Bayes discriminant functions for the M-class problem are (W4) 29 M 2: C(i/k)P (X)? k=1 k k Bi(X) = M for all i = 1,2,...,M E p (101’ k=1 1‘ k where, ideally, Bi(X) < B (X) if X is from class mi. 1 Let 3T .-. (31(x), 132(X),...,BM(X)) Theorem 3.1.1 can be extended to the M-class problem as follows. N 1 Theorem 3.2.1. If lim .fi" 2 HXI[i]AT-e?u . ._ j 1 N14» 1.3-1 = Ei[HXT[i]AT-efuz] almost everywhere then M. N Ni 2 lim 2 fii' fi" 2 WXT[i]AT-e?u N—ml=1 ij=l j 1 i M = i§1P1E1[HXT[i]AT-eifiz] almost everywhere . Furthermore, minimizing J of (15) is equivalent to minimizing the integral M T T T 2 HXA - B H ( Zpk(X)Pk) dx . I“): k=l The proof is given in Wee (W3). Thus, the discriminant functions obtained by the generalized inverse approach are closest among all linear functions to Optimum Bayes discriminant functions in a mean-square sense as the number of training patterns approaches infinity. It is assumed that the entries of matrix E can be varied subject to the constraint that any rwo vector e in the grouping 30 of rows corresponding to class wi must satisfy the inequality T T . e (n)ei(0) 2 e (n)ej(0), for all j r 1 where n is the iteration number, and where ei(O) is the vertex vector assigned to class mi. It satisfies: e§(0)ej(0) a if 1 = j s > 5 3 otherwise where a, B are real numbers. The problem becomes to find A and E so as to minimize J(A,E) '1qu “XAT-EHZ = trace{(XAT-E)T(XAT-E)} . ZIP‘ The complete algorithm can be derived by minimizing J with the gradient descent method. (GA.4) AT[O] = Y#E[0] D[n] = yAT[n] - E[n] AT[n+1] = AT[n] + Y#6E[n] E[n+l] = E[n] + 6E[n] where 5Ein11j = DDin11j if 91“]ijej[0] 2 e[n]ije&[0] for all L f j = 0 otherwise Here, 6E[n]ij denotes the ith row vector in the grouping of rows corresponding to class mi. The convergence proof is given in Appendix A. A computer simulation Of (GA.4) was performed on CDC 3600. Two artificial pattern recognition problems were con- sidered. The first problem contains three linearly separable 31 pattern classes with two training patterns from each class. The decision surfaces which correctly classify all training patterns are shown in Figure 2. The second problem contains three linearly unseparable pattern classes. There are eight training patterns from class 1, six from class 2 and two from class 3. The decision surfaces which misclassify one training pattern are shown in Figure 3. 3.3 Some Properties of LeastAMean-Square-Error Pattern Classifiers The properties of linear two-class pattern classifiers which are based on the mean-square-error criterion have been investigated by a number of authors (K1, P1, P2). In this section, the statistical properties of two-class classifiers will be studied and the results extended to the M-class problem. The mean-square-error criterion for the two-class problem is given in (13) N. l 2 1 T 2 J(C) =§ 2 2 (X.(i)C - Bi) i=1 j=l J or equivalently, N 2 i 1 *T * 2 J(C) =fi' 2 Z (X. (1)0 + CO - Bi) i=1 j=l 3 where 0* A (C C C ) and C A C " 1’ 2""’ d O -' d+1 ' For simplicity, the star (*) is dropped. Thus 1 2 N1 T 2 J(C) =" 2‘. 2 (X (i)C +C - B.) (19) N i=1 j=l J 0 1 ° The problem is to find C and CO to minimize J(C). 32 f13(X) = 0 l 3 E X1 2 ‘3 Class #3 ‘L / I 1+2 — X1 - 0.26X2 - 0.16 — X1 + 1.28X2 - 0.16 f13(X) = XI - 6.4X2 - 0.16 H1 ...: N A X V I H N U ’52 v I Figure 2. Decision surfaces for algorithm GA.4. 33 /1 2 \‘x a x X 3 //\ Class #1 A - X 0' X2 f23(X)=O Class #3 A K x x R 5 4 t = X 3 1 2 3 4 1 O 2 o O 1 x = O 2 f12() 0 3\ C1ass#2 \1 f13(X) = 0 f12(X) = XI + l.lX2 + 0.01 £1300 = XI - 0.3x2 + 1.2 Figure 3. Decision surfaces for algorithm GA.4 34 Let Ni Al . M _._ Z. X(1) and 1 N1 3:1 j N (20) s 11— 21mm -M) (xm -M)T i‘Ni 1:1 j i j i be the sample mean vector and sample covariance matrix for pattern class wi' Equation (19) can be rewritten as N _ l T T 2 J(C) — N1+NZ 1:1 Ni{C sic + (Mic + cO - Bi) } . (21) Letting OJ/OCO = 0 , 2 2 T 2: N.(MC+C - 8.) =0 N1+N2 i=1 1 i 0 1 or T c = -(MlN1 + MZNZ) c + N131 + N282 (22) 0 N1 +u2 ‘ If N1 = N2 and Bl = 1, 82 = -1, then _ T c0 — -l/2 (M1 + M2) c . Substituting (22) into (21), 2 J(C) = CT“1 + s2)c +1 (CTCM1 - M2) - 2) NIH Let “5 i @i and M12 _.M1 - M2 ||r1:z 1 . J(C) = cTsc +% (CTM12 - 2)2 35 Letting aJ/aC = 0, l T ZQC + §'(C M12 - 2)M12 = 0 or 'r sec + (Mucus12 - 2M12'= o . Try = (1)112 2 + "11“;[22 1M12 Thus (MT - 1 + (1112‘1 l’hzi’hz _ 2M 2 + 1M? {R11 2 + 11113212 2 2 2111112 1 T - ’ 2 +51? {1111 141524101121 11“12mm 4 2 2 2 'r -1 ) ' (M121 M12 M121 = o . From (23), — - = -1 - c—Ks 11412 Rs (M1 M2) where K — _ 1 T - 2 +2111? 1“12 (23) The least-square-error linear classifier for the two-class problem is: 36 , T ml If X C +Co 2 0 T say X 6 class m2 if X C +-CO < 0 say X 6 class (24) where C and C0 are defined in (23) and (22) respectively. On the other hand, the optimal Bayes pattern classifier for the two-calss problem, assuming normal distribution with equal co- variance matrices, is (D2) X 6 class m1 if T -1 l. T -l X Q (M1 - M2) - 2(M1‘+ M2) Q (M1 - M2) 2 0 . (25) Let c = {1011 - M and c = %(M T 2) 0 ‘+ M2) C . 1 Equation (25) becomes, X G class ml if XTC + C0 2 0 which is equivalent to (24). Therefore, the least-mean-square-error classifier is equi- valent to the optimal parametric classifier for normally distributed patterns with equal covariance matrices. A different proof leading to the above result has been pro- vided by Koford and Groner (Kl), who also show that a simple modifica- tion of the least-mean-square-error adaption procedure enables the adaptive structure to converge to a nearly optimal classifier, even though the numbers of training patterns from the two categories are not equal. 37 For the M-class problem, the mean-square-error criterion is given in (14). 1 K M N1 'r 2 J=§ E z z (xj(i)ak-eik) k=l i=1 j=l or equivalently, N K M i * 2 J=l% z 2 )3 (X::T(i)ak+ak -e'k) k=l i=1 j=l ° 1 where A * T A * T X _.(X ,l) and ak _ (ak’ako) . Again, for simplicity, the star (*) is drOpped. Thus M Ni K r 2 z z 2 (x(i)a +a -e.) . k=l 1=1 j=1 3 k k° 1k (.4 ll 2 IH The problem of minimizing J with respect to ak and ak0 is equivalent to minimizing J(ak, ako)’ k = 1,2,...,N, where 1M N1 T 2 J(ak, ako) = fi-igl jEI (Xj(1)ak + ak0 - eik) . (26) Applying (20) and assuming N1 = N2 =...= NM, Equation (26) can be rewritten as: J( )=-1-M T +(TM+ )2 27 ak’ako M 121 ak‘iak £1R i ako " eik ' ( ) Letting aJ/aako = 0, M r + - 1:1 (akMi ako eik) ZIN> 38 Let M M — 1 ‘- 1 M = -' z M, and e = -' 2 e, M i=1 1 k M.i=1 1k then a = -aT1~_4+ Z . (28) Re R k If ei = (eil,e12,...,eik), i = 1,2,...,N, are simplex vertices, then M 2 e. = 0 for all k . . 1k 1=l If ei = (eil’e12"°"eiM)’ for all 1, are vertices, then ek # 0. Substituting (28) into (27), M _ _1_ '1‘ — T - 2 J(ak’ako) ' M 121 ak‘iak + [(Mi M) 3k ' (ek + e119] } ° Setting aJ/aak = O, M l_ T T T T—“‘T — M 121 [%akéi + ZakMiMi + ZakM M - 4a MWM - 2(2k + eik)(Mi - M) Li! H c: or u M 3 z 3 I zl 3% 39 and Then a: = vk[§ +13]'1 . (29) Substituting (29) into (28), aka = -vk[¢ + T]'1fi + E# . (30) Equations (29) and (30) indicate the relation between the optimal solution based on the least-mean-square-error criterion and the sample covariance matrices and sample mean vectors. 3.4 Relation Between Linear Inequalities and the Mean-Square- Error Criterion In the two-class problem, a correct decision is made if yCTX(i) > o 1 1,2 where y = 1 if 1 = 1, and y -1 if 1 = 2. Let 2 = yCTX(i). Ideally, CT is chosen to maximize the probability that Z > 0. On the other hand, the mean-square-error criterion implies that CT is chosen to map X(i) as close as possible into points 1.1, which minimizes the mean-square-error |cTX(1) - ylz , or (2-1)2 . In the M-class problem, the corresponding linear inequality is: 40 yiAX(i) > 0 for all i where A is a linear transformation which maps X(i) into the vertices {e1,e2,...,ek}, k s M. Let W = YiAX(i). Ideally, the matrix A is chosen to maximize the probability that W > O. Chaplin and Levadi (Cl) have shown that the result of maximizing the probability that W > 0 is identical to that obtained by mini minimizing 2 ‘W - e1| . It can be shown that the result of maximizing the probability that Z > 0 is identical to that obtained by minimizing 2 (Z-l) . T Consider Z = yC X(i) as a random variable with unknown 0: =E(z - {)2, where 2 denotes the mean. By the Tchebycheff inequality, (D2) density function and finite variance _ _ '0 2 prob(‘Z -Z‘ 2 Z) _<. (12") . Z Since prob(‘Z - 2] 2 Z) > prob(Z < O), minimizing the ratio O§IZ£ minimizes prob(Z < O), or maximizes prob(Z > O). 52 —2 T -2 In order to minimize prob(Z < O), the term Z Z /Z must be maximized. N N H _ cTX(i)ng)c 9 MC) . 2 (chmy) may CT) 41 Setting aH(C)/5C = O, *-——-—- T T X(i)XT(i)C qugnx (1)0 X(i)y = o C X(i)y or . T . -l "‘T- C = (X(1)X (1)) K X(1)y (31) where T T K = C Xéi)x (i)C , a real number. C X(i)y Now, minimize (z - 1)2 = CTX(i)XT(i)C - ZCTX(i)y + 1 . Letting a(Z - 1)2/aC = 0, 2X(i)XT(i)C -2m = 0 or c = (xmxTun'lm . (32) If K = l in (31), then (31) is identical to (32). For the M-class problem, ‘W - el‘2 = \AX(i) - ei‘z = XT(i)AtAX(i) - 2e?AX(i) + 1 a‘W ’ 31‘ O i 1. 8A , mp 1es X(i)XT(i)AT - X(i)e§ = 0 or -————-———— -1 AT = (X(i)XT(i)) X(i)ef . (33) 42 By analogy to the two-class problem, consider the random variable 9 = ‘W‘ with unknown density function and finite variance 02 9 O 02 T 2 .9. = 9.1. .. 1 == lfl— - 1 -Q -2 _.2 ' 9 9 \WI Equation (33) can be obtained by minimizing ‘W‘z / \W‘Z . Letting 2 a_. W 2 = 0 , 3" m a— \w\2 -KL m2 = o (34) 3A 5" where 2 _ x = M /|w\2. Since I‘VE ‘ ( ‘2 M K d+1 2 w =Y.AXi) =1: 22y.ax(i) 1 m=l(j=1 k=1 “‘1 jkk ) W 2 M K d+l a-L-L—= 22 z 2x(i)x(i)ayy aAqr m=l j=l k=l r k jk mj mq which is the (q,r)th component of the matrix BW‘Z T TT T T = 2X(i)X (i)A Y.Y, = 2X(i)X (i)A . (35) 3A 1 1 Similarly, _2 2 M ’K (1+1 2 |w| = \YiAX(i)[ = 2 ( )3 t. ymjajkxkfi) m=l j=1 k=1 43 a 2 M K (H'l = 2 i 13.1. “£1 ”rm (j; 1.51 ymjajkxk( ’) qu qr which is the (q,r)th component of the matrix amz 5A = 2 X(ifiyi . (36) Substituting (35) and (36) into (34) produces 2 X(i)XT(i)AT - K 2X(1)GTYi = 0 or -1 AT = (X(1)XT(1)) X(i)KWTYi . Choosing KWT = e{, then T T '1 T A = (xmx (1)) X(i)ei which is identical to (33). CHAPTER IV MUDTICATEGORY PATTERN RECOGNITION USING STOCHASTIC APPROXIMATION TECHNIQUES In the previous chapters, a class of deterministic pattern classification algorithms was discussed. The pattern recognition problem was solved only with reSpect to the N training patterns given. No questions related to the generalization problem could be answered. The stochastic, or parametric, approach to pattern recognition (B4, K3, P4, W6, Y1) views the N training patterns as samples from the populations corresponding to the pattern classes. The pattern classes are described by conditional probabilities P(wi!x), the probability that pattern x belongs to pattern class mi. In this chapter the function fi(x) = a§¢(x) will be used as an approximation to P(wilx). The stochastic problem is then one of selecting a suitable criterion function involving all the available information so that the extremum of the criteria function correSponds to a reasonable approximation to P(wi|x), i = 1,2,...,N. In this chapter, the generalized inverse idea is invoked to form the criterion function. The mathematical formulation of the pattern recognition problem is discussed in Sec. 4.1. In Sec. 4.2, two stochastic algorithms which are similar to the deterministic algorithm of Wee (W3) are proposed. Both schemes use information from all training patterns at each stage of the algorithm. In Sec. 4.3, a stochastic algorithm with a special updating property is proposed. At each iteration, only the information from the 44 45 particular pattern presented is utilized. As the number of training patterns increases without bound, the algorithm of Sec. 4.2 is equivalent to the algorithm of Sec. 4.3. In Sec. 4.4, the problem of an occasionally mislabeled training pattern is investigated. The mean square approximation error is defined and its upper bound is computed. A computer simulation of the algorithms is presented in Sec. 4.5. 4.1 Mathematical Formulation Let there be M pattern classes w1,w2,...,mM, any one of which can be active to produce a d+l dimensional (augmented) pattern vector X as shown in Figure 4. The prior probabilities q1,q2,...,qM and the probability densities p1(-),p2(-),...,pM(-) are unknown. However, it is possible to observe a sequence {(X1,yi), i = 1,2,...,N} of training patterns. The correct classification of pattern X is denoted by yi E (1,2,...,M). The i patterns {Xi} are chosen independently under probability density pk(-) and prior probability qk when y1 = k. The controller has two operating modes described as follows; 1) Training mode: the switch is governed by prior probabilities q1,q2,...,qM and the exact location of the switch is observable 2) Decision mode: Same as the training mode except that the location of the switch is unknown. We assume that P(wi|X) is approximated by 3E¢(X)\fi = 1,2,...,N, where J(X) = (cpl(X),cp2(X),...,cp&+1(X)), and {winnfg is a set T of known and bounded functions of X; ai = (ail’ai2"°°’ai,d+1)’ {(aij)‘i = 1,2,...,N, j = 1,2,...,d+l} is a set of constants to be and determined. 46 The training patterns are compiled in a matrix. T . . . m (X1) where x1,X2,.b.d.,XN 1s a set of N training Q A m?(x ) samples; N = Z N where N. denotes the N -' 2 i=1 i 1 I number of training patterns from pattern class T 1:9 or. - The coefficients to be determined are compiled in the A matrix de- fined in Sec. 2.1. The classifications of the training patterns are summarized in the matrix Z . N ZT(X1) 21(X1) 22(X1) .... zM(X1) 2N A zT(x2) = 21(x2) 22(x2) zM(X2) 1 if Xi E wj ( yi = j ) where zj(Xi) 0 if Xi {inj (yi # i) In general, Ez ‘X[21(X)] = 1-P[zj (X) = 1\x] + O-P[zj(X) = o|x1 J' Pb: = fix] = P[mj\X] . The matrix Pk is the following matrix of conditional expectations. —P_('w1\x1) P(w2|x1) P(u)M‘x1) F(wl‘xz) P(w2|x2) .... P(wM\x2) A . . . PN- : . : P(w\x) P(w|x).. P(u)‘x) ___ 1 N 2 N M NhJ The random matrix Z can be considered as a noisy "measure- N ment" of PN since ZN = Pk +VN where VN is the "measurement 47 noise". This particular formulation is very useful in proving Theorem 4.2.2 later. 4.2 Generalized Inverse Adaptation The stochastic pattern classification problem can be viewed as the problem of constructing an approximation to a function which can only be measured in the presence of noise. In other words, we are given a random pair (QN’ZN) which contains all the available in- formation, and requested to determine the unknown parameters in A so that the unknown functions {F(wilX)}?=1 are approximated by {a§¢(x)}?=1 for all i and all patterns X. The criterion function prOposed here involves the available random pair [QN’ZN] and is similar to that used with the deter- ministic generalized inverse approach of Wee (W3). 1 2 1 JN(A) = h— HZN ' QNATH AI; TracedZN - QNAT]T[ZN - QNATD. The matrix AI = 28-; which minimizes JN(A) can be written as T T '1 T By the gradient descent procedure, this can be computed iteratively as T (SA-1) AT - p(n)TN[§NAT(n) - 2N] where n denotes the iteration number, n = 1,2,3,... , AT(1) is an arbitrary m X M matrix (e.g. A(1) = BI where B > O is an small positive number) and p(n) is the weighting factors required for stochastic approximation algorithms. The following conditions 48 are necessary for convergence (BB, Gl). Q on 2 Lim p(n) = O, 2 p(n) = w, 2 p (n) < m . new n=l n-l The second order gradient scheme defined by T a 2N, (A) -1 MN (A) A (n+1>= AT (n) - p < 2 —) (ET—”Mm a A leads to the stochastic algorithm defined below. aJNATHTJ We now show that the criterion function J(A) is a suitable criterion since it is equivalent to a mean-squared criteria function M J(A) = z: qiEiLHZTX) - cpT(X>ATHZ] i=1 M M T quiE 1Ez|1TkE 1‘71"" ' “’T 003192 1 i: M M = z 1: :11 E [Pmkm - 2PcpT 0081, + WT 0031821 1:1 k=1 = EM :1 E. [P( le) - P20» \X)] + E [Pan IX) - T008121 i= -1 k=1q w k qi i k T k = J0 +~J(A) M M 2 where J z 2 q. E. 1[F(u) ‘X) - P (m ‘X)] is not a function of the O= i=1 k—l k k 50 unknown parameters A. A M M T 2 = — a J(A) .2: z qiEi[P(wk‘X) (p (X) k] . 1=l k=1 Thus,minimizing J(A) corresponds to minimizing 3(A) which is the mean-square-error approximation criterion. The solution for the unknown coefficients can be written explicitly as follows. The matrix A? = A: which minimizes J(A) is, T M T -1 M A* = {.31‘11Eiwo‘m (X))} {iilq1E1[‘P(X)P(w1‘X)]’ 1: dra:z M 1Q iEiEqXX) P(leX) ] , - - - , iglq iEihpoi) P(wM\X)]} - 1 The relation hijél,= 0 implies 68k M T zlqiEito - 2P + 2((p (X)ak)cp(X)] i: M T = 2 qiEi[-P>1,.... 2q q1 Elzia- P>]} i= i=1 M M = { ZlqiEiiw(X)(F(w11X)-P(w11X>)],...,q 2q .E [a(X)(P(mMIX)- P(w M\X))]} i= i=1 T T Combining these three parts, lifllAN = A* with probability one. Q.E.D. N—m 4.3 A Stochastic Approximation Algorithm with an Updating Property The method of stochastic approximation prOposed by Robbins and Monroe (R2, W5) is designed to find the zero of a regression function R(x) = EV(x), where V(x) is a random function. Sometimes, the R-M method is employed to locate the minimum of R(x) by finding the zero of a regression function E[V'(x)]. 52 To study the multidimensional case, let V be a stationary random vector which is observed at discrete times. The probability distribution of V is unknown. Let H be a matrix of parameters, and let f(V,H) be a real valued function of V and H. Our prob- lem is to find a matrix H = H which minimize a regression function * R(H) QEv{f(V,H)} . Since the probability distribution of V is unknown, R(H) cannot be evaluated. However, in practice, if we are given a sequence Vn’ n = 1,2,..., of independent observations of the random vector V, the matrix H* can be obtained by iteratively finding the zero of a regression function V R(H) = EV{VHf(V,H)}, where v H H denotes the gradient operator applied with respect to H. The following theorem concerning the R4M process for finding A* has been proved by Gladyshev (GI) and was restated for the multi- dimensional case by Yau (Y1). Theorem 4.3.1 Let VH f(V,H) be a random variable with EV{VHf(V,H)}, and let H VHR(H) be the (unique) solution of * VHR(H) - 0. Choose H1 arbitrarily and define (GSA-l) Hn+l = Hn - anHf(Vn,Hn), n = 1,2,... Then P[lim Hn ' H*] = 1 n—m \2=o lim Euun - H*\ n-coo provided the following conditions are satisfied. (1) inf 1 <(H - H*), VHR(H)> > O for each 6 > 0 where €HH‘H*“<;' 53 “H“ A ()% the norm of a matrix H; A trace {HEHZ} the inner product of matrices H1 and H2; (2) There exists a positive number 6 such that for all H EHVHW’MH s mum ; (3) va(Vn,Hn) is a random variable whose conditional distribution, given H1,H2,...,Hn, is the same as the distribution of VHf(Vn,Hn). (4) {on} is a sequence of positive numbers satisfying the conditions “ 2 pn=0 and anA \\ 1 . 1: Since the random variable (X,Z) has the same significance T as the random variable V and the matrix A is equivalent to H, the stochastic algorithm (GSA-1) can be rewritten as follows. _ T T _ (GSA-2) A — An anAf(Xn,Zn,An), n — 1,2,3,... 54 We now consider the following criterion function. T J(A) = E[H2(X) - qF(x>ATH2] = E[f(x,z,AT)] where f(X,Z,AT) = Hiax) - ¢F(X)ATH2 . Taking the gradient, vAf(x,z,AT) = 2¢£¢¢AT - iQX)] . By invoking (GSA-2), we obtain the following stochastic approxima- tion algorithm, which has an updating property. (SP-l) A:+1 = A: + pntp(Xn)[ZT(X.n) - (pT(Xn)A:] . The sequence A:, n = 1,2,..., converges with probability one to the value A: which minimize J(A) and, hence, 3(A). To show the convergence of the algorithm SP-l, we need to prove the following lemmas. M d+l Lemma 4.3.2 If P{iEIqipi(X) > 0} = 1 and {¢i(.)}i=l are linearly independent and continuous functions, then A: exists. T Proof. J(A) = EEHZ(X) ‘ ¢F(X)ATH2] A EAL)- : o 2 E£2gp(X)cpT(X)A: - 2Qp(X)ZT(X)] = o . T T - . T Thus A* = {E[¢(X)m (X)]} 1E[m(X)2kX)] and the matrix E[qKX)m (X)] has been shown (P3) to be nonsingular; hence A: exists. The following lemma has been proved (Y1). Lemma 4.3.3 Let T be a symmetric operator on Ed+1 with eigen- values *1 3 k2 S...s xd+l° Then xlnATnz s «Mg xdfluATnz 55 for all AT 6 L(Ed+l,EM) where L(Ed+1,EM) is the set of all linear transformations taking Ed+1 into EM and forms a vector space. M d+l Lemma 4.3.4 If P{ 2 qipi(X) > 0} = l and {¢i(X)}i_1 are linearly i=1 ‘ independent and continuous functions on the pattern space 0X, then there exists a constant k such that T - AT VAJ(A)> . kHAT - EH stcpT(x>]AT - E[cpzT1 . 0! ID By invoking Lemma 4.3.3, we have T T 2 T T T T xluA - A*H s where k1 is the smallest eigenvalue of Q, X1 > 0. Therefore, 2 kuAT - I“ s . Q.E.D. Lemma 4.3.5 LEt v = (x,2), vAf(V,AT) = 2(p(X)[q)T(X)AT - ZT(X)]. If the matrix M A_4E[Hm(x)¢F(X)H2] exists, there exists a constant d > 0 such that for all AT, “WW ,ATWZ M + M2). Proof: nvAfcv,AT)H2 Trace{[2¢(X)¢F(X)AT-2qu)i&x)]T[2qu)qF(X)AT - 2cp2T - <6,AT> + y where a = 8E[2 (X)cpTcpcpT + H6“ - HAT“ + y smwwz+mu-MW+w where KM is the maximum eigenvalue of matrix M and is positive. Choosing d = XM + “6H + y we have EuvAfu2 s d<1 +-uATu2> . _ Q.E.D. Theorem 4.3.2 Let A: be the unique matrix minimizing J(A). Let T the sequence of matrices An’ n = 1,2,..., be generated by the proposed algorithm (SP-1). If the following conditions are satisfied: M 1) P{ ‘2: qipiOO > 0} = 1 ; i=1 +1 2) {¢i(°)}:=1 are linearly independent and continuous functions of X ; . T 2 . 3) The matrix M = 4E{Hm(X)m (X)H } eXIStS; 4) {[Xn,Z(Xn)], n = 1,2,3,...] is an arbitrary training sequence of independent observations; 5) {pm} is a sequence of positive numbers satisfying Condition 4 57 of Theorem 4.3.1. Then P[lim A: = A1] = 1 n—m and lim EHA: - AIHZ = 0. rpm Proof: Invoke Theorem 4.3.1 and utilizing Lemma 2 through 5. Condition 1 of Theorem 4.3.1 is satisfied because of Lemma 4.3.4. Condition 2 of Theorem 4.3.1 is satisfied because of Lemma 4.3.5. Condition 3 of Theorem 4.3.1 is satisfied because of Condition 4 of Theorem 4.3.2. Condition 4 of Theorem 4.3.1 is identical to Condition 5 of Theorem 4.3.2. Q.E.D. 4.4 Sensitivity Study and Error Upper Bound In a practical situation, some of the patterns used for train- ing might be mislabeled. The causes for mislabeling are plentiful and include typing errors and measurement errors. In this section, we assume that it is possible to observe a seQuence of sample patterns {(xi,§i), i = 1,2,...,N] where §1,y2,...,§ is a sequence N of labeling (or classification) numbers containing some mislabels. Furthermore, we write the probability of mislabeling the origin of pattern X, as f,. 1 1 Let y1,y2,...,y be the correctly labeled sequence correspond- N ing to §1,§2,...,§ The probability that (y1. = k) is P(wk) .for N. any i. Then kl "U r-‘w '~<) H' II ‘< H ‘< H II PB?i # yi, yi - k] P(wk)fi 58 We new study the sensitivity problem of the proposed algorithm. The following is a generalization of the results obtained in the previous sections. Define Z(X) = (21(X), 22(X),...,zM (X)) f 0 if H. and 2j(X) = ‘<> ’~<> ‘1L =y=j y=j In general Ez m‘x 2(X)] = 1-P[2j(X) = 1\x] + 0-P[fij(X) = 01x1 = P1? = y. y =jlx1= mjmu - f) r- '-j -- - 2T[X£] 210(1) 22(xl) ’z‘M(x1) . .T T _ . . 2N A 2 [x2] - . . ' T :T . . . LE [XN]__ 1:108“) 220%) 21403qu r... P(w1|X1)(l-f1) P(w2|X1) (1-131) pm; Ml‘x )(1-t1) TN 5, = Pohlix2)(1-£2) F(m2|x2)(1-£2) . F(mM|X2 )(1- —£ 2) . . g I E ' E LiwflxNHl-fN) P(tu2\XN)(l-fN) . P(u1 Mhwa fN)i As in Sec. 4.1, we can write: 2N = PN + 0N where 0N is the "measurement noise". Consider the criterion function 1 . 1 T ~ T JN =N HZN - 1N A “u _N- tr riizN - iNATJ [ZN - .NA 1} . T _ «T . . . . . .. The matrix A — N which minimizes JN(A) is «T_ T —1T. 51 'U’N‘bu] 9N ZN ° 59 As in the iterative algorithms (SA-l) and (SA-2), “: can be written recursively as (SA-3) AT(n+l) = AT(n) - p(n)§:[¢NAT(n) - 5“] (SA-4) AT(n+1) = AT(n) - p(n)n'1[N][§NAT(n) - 2N] . M T T 2 . Theorem 4.4.1. lim 1N0.) = z q.E.[\\5r(X) - q) (X)A u ]AJ(A) . N-m i=1 1 1 1 T Proof: :1: fi- HZN - QNA H M N . T 2 -lim2 2(z(X)'cp(X)a) N-too k=l j=l k j j k M M N M -11m 2 z — - 2(2(X)-¢(X.)a)l N-cco 1t=1 1=l N ["1 3:1 k j J k X Em M M j 1 M 3 .21 kzlqifiifikoc) ' ¢T(X)ak)2 = ZlqiEi[\‘?(x) - P(w1\X)(l-f). i.ElqiEiicht)P(u>2|x)(l-f)..-.. M iflqiEi[‘P(x)P(wM1xm'f)] . If f=0,§$=AT * or equivalently {9}?=1 is identical to {y}?=l. For f = l, we have A: = 0 (matrix) which indicatesthat the unknown parameters AT cannot be estimated because of lack of avail- able information (i.e. no training samples). CorreSponding to (SP-l), we have the following stochastic approximation algorithm. (SA-5) A:+1 = A: + pntpT(Xn)[2T(Xn) - cpT(Xn)A:] . The sequence A:, n = 1,2,..., converges with probability one to AT the matrix A*. We shall now define the mean square approximation error and compute the error upper bound. The results are also applicable to the algorithms proposed in Sec. 4.2 and 4.3. Let 6k _A. EX[P(UJR‘X) " (PT(X)ak—J2 2 T = ExiP (wklxn - zaisxwxwwkun + 3kEx[cp(X)cpT(X)]ak 2 ..T M = EX[P (wk|X)] - 2 k iglqiEiicp(X)P(wk\X)] T M + 3k iglqifiiimxwmkw) (l-m _ E [22 3T M p 1 + f - X (wklxn - k iglqifiiicpm (wklxm ) 2 = EX[P (wk‘X)] - aisxtmxxphxnakdfé 61 T T 1+: 5 Pmax ' akExt‘PmW (x)]ak(1—-?) A ekb 2 where P A, max E [P (m ‘X)] max 1sk$M x k and ekb is the upper bound for the mean square error ek’ k = 1,2,...,N. Let be an estimate of the error upper bound after N e"1th training samples. The relation between can be ekbN 61(1) stated by the following theorem. Theorem 4.4.2. lim ekbN = ekb N—too 2 ekN - Ex[P(wkIX) - cpT(X)3kNl with probability 1 . Proof: 2 . EXU’ (wk|X)] - ZaiNEXEcp(X)P(wk‘X)] + aTNEXEeJaNN and ekbN A Pmax - ZagExflflX) P(mk\X)] + ailEthX) cpT(X) 13m . N T _1 N Since akN = [.§ m(Xi)m (Xi)] ._ ¢1(Xi)2k(Xi) 1—1 1-1 N .2 ¢2(xi)zk(xi) i=1 N . Z . 1 ‘PM(xi)‘Z‘k(Xi) 1 . _ T - and lim akN — {Ex[m(X)m (X)]] 1EX[¢KX)P(wk\X)(1'f) Nam =ak then lim ekbN = ekb With probability one. Nam Pitt and Womack (P5) have suggested that the value of ekbN can be used to test different sets of m(X) functions to compare relative performance based on the available training samples. 62 4.5 An Example A simulation of the proposed algorithm (SP-l) was performed on a IBM 1130 computer. The problem was to classify patterns drawn from three categories with equal a priori probabilities and bivariate Gaussian distributions. Each class has the same covariance matrix (3 2) but with different means. The mean vectors for pattern classes wl’ wz and w3 are (8), (Z) and (_g) reSpectively. Since the optimal Bayesian classifiers for this type of pattern classes are known to be linear, we choose T - cp (X) - (x1,x2,l) . Figure 5 shows the classification performance of the algorithm (SP-l) after each training group. At the completion of each training, the system was tested by classifying 300 unknown patterns (100 patterns from each classes). The misclassification rate and the correSponding Bayes misclassification rate are shown for comparison. The results indicate that the System error rate approaches to the Bayesian error rate, which is about 0.2, as the number of training samples increases. 63 CONTROLLER PATTERN GENERATORS P.C. I SE # 1 3 l RECEPTORS N01 PATTERN VECTOR : :3 TX, 1 A ' 'Tr I x2 9 P.C. " I ' z # 2 J, 1 t I {:1 3: = _iMJ : CHANNEL I P.C. I # M I Figure 4. A mathematical model for pattern recognition IERROR,RATE IOOJ (Total unknown patterns to be classified = 300) Bayes‘classifier XC‘K,‘ [Stochastic classifier 0.5d ‘\\\\\\\\ «I x X y l ‘—~ifi“-——=;— ““ ' —¢ IMF. 5 L7 $ 5 3 i’ N 45 300 600 900 1200 1500 NUMBER OF ITERATIONS (N/3 training patterns from each class) Figure 5. Error rate versus number of iterations. CHAPTER V CLUSTER-SEEKING AND PATTERN RECOGNITION Chapters II, III and IV assumed that the data had structure so the pattern recognition problem could be solved by selecting an algorithm derived according to selected criteria. In reality, information about data structure is rarely known beforehand. Some techniques are thus required to sort the patterns into subsets, such that the patterns in each subset are as much "alike" as possible, and the patterns of different subsets are as much "unalike" as possible. Having obtained the data structure, a pattern classifier can be designed to achieve a minimum number of misclassifications. Some typical cluster-seeking techniques will be discussed in this chapter and a new cluster-seeking procedure will be proposed. 5.1 Some Cluster-Seeking Techniques Nearly all cluster-seeking techniques were developed after 1960 since they generally require a great deal of computerized high- speed computation. G.H. Ball (Bl) classified cluster-seeking techniques into the following six categories: probabilistic, signal detection, clumping, eigenvalue, minimal mode-seeking, and miscel- laneous. Cluster-seeking techniques can also be categorized into the following three groups: (a) Techniques for reproducing "natural" structure. The pattern recognizer selects a grouping criterion; 64 65 based on this criterion, the data itself should suggest "natural“ clusters. ISODATA by Ball and Hall (BZ), and work by Friedman and Rubin (F1), are typical of techniques for reproducing "natural" structure. (b) Techniques for data compression. Patterns are mapped from a higher dimensional space to a lower dimensional space in such a way that the inherent data structure is preserved. These techniques include eigenvalue-type techniques (81), a non- linear mapping proposed by Sammon ($2), and discriminant analysis (W4). (c) Techniques for minimizing misclassification rate. The probability of error is used as a criterion for clustering. In- correct classifications imply that new clusters are needed. The algorithm prOposed in Sec. 5.2 belongs to this category. 5.1.1. Clustering Techniques for Reproducing "Natural" Structure Friedman and Rubin (Fl) suggested that a real-valued function be defined and evaluated for all possible partitions of the given patterns. The partition which has the maximal function value is selected to represent the data structure. Suppose a given data matrix Y is given in which each row represents a training pattern. A decomposition of the, N training patterns into g groups can be achieved by a row partition of Y into g submatrices. Without loss of generality, let the sample mean of the N patterns be zero. The total scatter matrix (W4) is given as follows in the notation of Sec. 3.1. For each partition of the N patterns into g groups, the following matrix identity is satisfied. 66 T = W +'B where “k g g T W=2w=z >3[x(k)-C][x(k)-C]. =1 8 k=1 6:1 L k L k This is called the pooled within-group scatter matrix; nk is the number of patterns in group k; 8 Z n =N: k=1 k ck (k = 1,2,...,g) denotes the mean pattern vector of group k; and X1(k),...,Xn (k) are the patterns from group k. The matrix R 8 T B = 2 n C C k=1 k kk is called the between-group scatter matrix. Since the total scatter matrix T is fixed, a natural con- dition for grouping data is to minimize \B‘, the determinant of B, or, equivalently, to maximize \W‘. One criterion is to maximize T _ -1 p M - \I+W 3| = 2104'11) where *1 are eigenvalues of W-lB. The ratio ‘T‘I‘W‘ is in- variant under non-singular linear transformation of the data and is to be maximized over all possible groupings. The number of possible partitions of N patterns into g groups is enormous; as noted by Friedman and Rubin, the computational problem may be solved in principle but not in practice. 67 Ball and Hall (B2) proposed a cluster-seeking technique called ISODATA, an adaptive, and semi-heuristic algorithm. The user supplied training patterns and program parameters such as the minimum allowable size of each cluster and the number of clusters desired. Several patterns are initially selected as cluster centers and all patterns are clustered around them by assigning each pattern to the nearest center. New cluster centers are then computed by averaging all patterns in each cluster. Several cycles of assigning patterns to the nearest cluster center and computing new cluster centers are performed. Heuristic subroutines are provided to divide a cluster in two when the number of clusters is small or to combine two of them to reduce the number of clusters. The main drawbacks of the Ball-Hall technique are that the resulting cluster configuration is highly dependent upon the pro- gram parameters supplied by the user, and that there is a lack of good criteria for determining the adequacy of clustering. The type of distance measure also implies spherical or ellipsoidal clusters only. 5.1.2 Clustering Techniques for Data Compression One clustering algorithm, proposed by Sammon ($2) is basically a nonlinear mapping of the N-dimensional feature space to a low dimensional Space (usually two or three dimensions) such that visual identification of clusters is possible. Structure preservation is achieved by moving the N points in the low dimensional space in such a way that the interpoint distances approximate the correspond- ing interpoint distances in the original high dimensional Space. 68 * Let d be the distance between patterns X and X ij i J where Xi = (X11,X12,...,Xid) denotes a pattern in the original space. Similarly, let dij be the distance between patterns Y1 and YJ where Y1 = (Yil’Yi2’°'°’YiD) is the pattern in the D- dimensional space corresponding to Xi’ D < d. An error function for determining the degree of structure preservation is defined as * 2 _ __1___ N [dii ' dii] E - N z * o * i fjn(x) for all (Jan) # (i,m) . Therefore, a PWLC is actually a multiclass linear classifier and can be trained by the same methods used to train linear classifiers. In order to obtain a set of weight vectors for a PWLC, information is needed about the structure of the subclasses. Since such information is rarely known beforehand, structure analysis must be applied; that is, a similarity measure must be selected and the patterns from each pattern class must be partitioned into subclasses. An algorithm will now be prOposed for determining a PWLC without prior knowledge of pattern class distributions. The 71 algorithm combines M-class, linear-classifier training procedures and clustering-seeking techniques under the control of a minimum error probability performance criterion. The flow chart of the proposed algorithm is given in Figure 6; the main steps of the algorithm are described as follows. (A) Phase I Step 1. Step 2. Step 3. Step 4. The iteration number is n. Determine a set of Sn(SO = M) discriminant functions, one for each of the Sn nonempty subsets, or clusters, of patterns. Classify each training pattern. Compute the misclassification percentage, Pr(n). If Pr(n) SNe is the maximim acceptable misclassification percentage for a particular problem, or the total number of discriminant functions exceeds the allow- able number of clusters then stop Phase I of the algorithm and begin Phase II. If Pr(n) >Ne and the total number of dis- criminant functions is small enough then con- tinue to step 4. Divide each of the Sn clusters into two clusters, one containing all correctly classified patterns and the other containing all misclassified patterns. Learn a new set of discriminant function, one for each cluster. Check to see whether the current set of read in patterns set parameters 1 learn the discriminant functions - one for each pattern class L —‘bri’orm classification ]’—‘ error prob. Yes Pr(n) 5 NE ? divide each cluster into two clusters 1 remove empty clusters and learn a set of new discriminant functions one for each cluster fcn . identical to the previous disc. fcns. ? split all clusters which are mis classified and learn a set of new discriminant functions PHASE I \V1 72 is total no. of clusters 5 desired no. of clusters Yes lump clusters of same origin One pair each time. learn discriminant functions and perform classification store the best partition up to now of remaining clusters equal to M ? Figure 6. I PHASE II Is the best partition acceptable ? Flow chart of the proposed clustering algorithm. L_____ output stat. 73 discriminant functions is identical to the previous set of discriminant functions. If so, split those clusters which were mis- classified; let n = n+1 and return to step 1. (B) Phase II Step 5. Find that pair of clusters from the clusters belonging to each pattern class which are closest together, using an apprOpriate dis- tance measure. Lump them and compute a new set of discriminant functions. Compute Pr(n) and repeat the procedure until the number of clusters equals the number of categories plus one. Step 6. Select the best partition. If the partition is acceptable, stop; otherwise, continue to step 2 of Phase I. Step 7. Terminate the algorithm when the current best partition is inferior to the previous best partition (step 6). The convergence of the algorithm is intuitively clear, since in Phase I there is a finite number of training patterns; error- free classification can always be achieved if each pattern is con- sidered to be a cluster. In reality, it can be reasonably assumed that the training patterns from each class are somewhat clustered together. Therefore, the number of clusters should be much smaller than the total number of training patterns. Only a finite number of clusters is generated in Phase I. The lumping procedure will 74 terminate when the number of clusters decreases to M.+ l. The interaction between Phase I and Phase II provides a search for optimal results. ' If a minimum distance classifier (MDC) is employed for classifying training patterns and N = 0, the prOposed algorithm is actually another version of the condensed nearest neighbor rule (H4). The resulting set of cluster points correctly classify all training patterns. Cover (C2) has shown that the error prob- ability of classifying new patterns is bounded above by twice the Bayes probability of error. If other M-class algorithms are selected for classifying patterns, the proposed algorithm is basically a training procedure for a PWLC so that a bank of sub- sidary functions for each class can be achieved at the completion of the algorithm. 5.3 A Procedure for Unsupervised Structure Analysis In supervised learning, a set of training patterns is pro- vided in which each pattern has a label indicating the pattern class to which it belongs. Since no such labels exist in the un- supervised learning problem, the training patterns must be studied for natural groupings. Each such grouping, or cluster, is viewed as defining a pattern class and is assigned a discriminant function. A partition of the 'N training patterns into M groups, or clusters, is desired for which the misclassification probability and the number of clusters, or discriminant functions, are both minimized. Such a partition converts unclassified patterns into classified training patterns since all patterns in each cluster are tentatively 7S given the same pattern class label. The algorithms previously defined will then produce discriminant functions for any partition. The computational problem of searching all possible parti- tions of N patterns into M groups to find the "best" partition has been solved in principle but not in practice. A suboptimal procedure for searching through the partitions follows. Any prior knowledge about data structure is inserted into the initial partition. If prior knowledge is not available, the center of gravity for all training patterns can be computed; this center and the (M-1) patterns furthest away from it can be selected as initial cluster centers. The initial partition is then obtained by assigning each pattern to the closest cluster center. Rubin (R1) prOposed an algorithm for finding a local extremum of a criterion function that searched only some of the possible groupings. Starting with the best partition achieved, a single pattern is moved into every group other than its own. If no move increases the criterion, the group label of the pattern is not changed. Otherwise, the pattern is transferred to the group which maximizes the criterion function. This operation is repeated with each pattern in the group. It is then repeated for all patterns in the group with the closest cluster center. After several passes, a point is reached at which changing the label on any single pattern degrades the criterion. This grouping provides a local maximum. Once the patterns have been tentatively converted into training patterns by labelling each with a group identifier, the algorithm of Sec. 5.2 will produce the error rate (based on the 76 assumed partition) as well as discriminant functions. 5.4 Computer Simulations This section contains results obtained by using the CDC 3600 computer system at Michigan State University, to process both the artificial data and the practical data. A FORTRAN listing is pro- vided in (H5). The artificial pattern recognition problem is shown in Figure 7. The minimum distance classifier (MDC) is selected for classifying training patterns. The data consist of twelve patterns, four for each of 3 pattern classes (Fig. 7a). The patterns are two-dimensional to produce meaningful plots. The initial number of misclassifications is eight (Fig. 7b). That is, assume that each pattern class forms one cluster. The number of misclassified patterns at the fourth iteration before splitting is two (Fig. 7c). At the end of Phase I, there are eight clusters and the number of misclassifications is zero (Fig. 7d). After the completion of Phase II, the best partition of the data is into six clusters and the corresponding number of misclassified patterns is zero (Fig. 7e). The practical data is the Iris data which was published by Fisher (F2) and repeated by Kendall (K4). There are three species of Iris, setosa, versicolor and virginica. There are fifty flowers of each species and four measurements (sepal length, sepal width, petal length and petal width) are taken on each flower. We labeled them from 1-50, 51-100, and 101-150 respectively. Fisher (F2) con- structed a linear function of the four variables to classify them. He found that setosas could be separated from the other two species 77 X X Region! A . o A O X X A A o o (a) Feature space showing (b) Initial decision regions. four patterns from each Eight patterns mis- of three classes. classified. /x:/OA/ /oo: / / A \ 2 00 A 00 (c) Decision regions before (d) Decision regions before splitting. Two rnisclas- lumping. No mis- sifications. classifications. X X / A O A O __ Figure 7 . Example of A Algorithm in X X Fiurea. Vim ‘ No (e) Be at partition. misclassifications. 78 perfectly. However, versicolors and virginicas overlapped. Kendall (K4) applied a clustering technique based on a pairwise distance function utilizing only the rank order of the measurements and he was able to separate setosas from others. However, for versicolors and virginicas, he classified 87 flowers and left 13 flowers doubtful. Friedman and Rubin (F1) applied the min trace w criterion to this data. They found when partitioned into three groups, setosas, appeared as a separate group with the other two groups being pre- dominantly versicolors and virginicas except for about ten flowers which were misclassified. A computer simulation of the system prOposed here using minimum distance classifiers has been completed. The initial number of misclassifications was eleven (assuming that each pattern class formed one cluster). With 30 allowable number of clusters and N6 = 0, the number of misclassified flowers at the end of Phase I was twenty-one. There were thirty—three clusters. Setosas appears as one cluster. Versicolors contains seventeen clusters with four- teen flowers misclassified. Virginicas contains fifteen clusters with seven flowers misclassified. After the completion of Phase II (lumping Operation), the best partition of the data was into six clusters and the corresponding number of misclassified flowers was six. The six clusters were: one for setosas, two for versicolors and three for virginicas. CHAPTER VI CONCLUSIONS This chapter discusses the general results of this thesis and possible extensions of this work. 6.1 Thesis Results There are two sets of results in this thesis. (1) The development of a class of nonparametric, multiclass pattern recognition algorithms. The idea of translating a problem into an optimization problem in which search techniques are employed to find the minimum of a linear functional is used in many branches of applied mathe— matics. Applied to the abstraction problem, this idea provides a great deal of freedom when creating new algorithms. In Chapter II and III, although we make no assumptions about the data structure and pattern classes, the concept of linear inequalities and the mean-square-error criterion were employed to formulate linear functionals. The general multiclass learning algorithms (GA.2) and (GA.3) derived in Chapter II led to particular algorithms such as the fixed increment method (PA.1), the relaxation method (PA.2) and the minimum square error method (PA.3). These algorithms are generalized versions of algorithms in the literature. Chapter III showed that the mean-square-error criterion is equivalent to the 79 80 generalized inverse approach. A multiclass algorithm, (GA.4), utilizing the generalized inverse approach is proposed and is unique with this thesis. The convergence proof is given in Appendix A and the utility of the algorithm was demonstrated by a digital computer siuu lation . Since no assumptions were made about data structure in Chapters II and III, the pattern recognition problem was solved only for the N given training patterns. In order to investigate the generalization problem, we must view the N fixed training patterns as a sample from a population as was done in Chapter IV. we assumed the existence of probability densities of unknown func- tional form so the optimal discriminant functions involved condi- tional probabilities. Based on the available information, the method of stochastic approximation was employed to estimate the unknown discriminant functions and two stochastic algorithms were derived from the generalized inverse approach. Both schemes used information from all training patterns to update the discriminant functions at each iteration. The asymptotic properties of both algorithms are studied for the first time in Theorem 4.2.1 and 4.2.2. In Sec. 4.3, a new stochastic algorithm with an updating property was proposed. Since only the information from the particular pattern presented was utilized at each iteration, the problem of storing large numbers of patterns was eliminated. The convergence of the proposed algorithm was proved by involving Gladyshev's Theorem (61). In fact, the algorithms of Sec. 4.2 are equivalent to the algorithm in Sec. 4.3 as the number of training samples approaches infinity. 81 In practical situations, some of the training patterns might be mislabeled. This sensitivity problem is studied for the first time in Sec. 4.4. The results show that the estimated co- efficients of the discriminant functions under mislabeled condi- tion are equal to the probability of correct labeling times the estimated coefficients under ideal situation. The mean square approximation error was defined and its upper bound was derived. An estimate of the error upper bound after N training samples can be used as a guide for selecting the ¢(X) functions. (2) The introduction of a procedure that combines cluster- seeking and multiclass pattern recognition into a workable pattern recognition system. In some cases, the pattern class has a multimodal structure; both the number and location of the modes are unknown. In order to minimize the misclassification error, a cluster-seeking technique should be applied to learn the data structure and one discriminant function should be assigned to each cluster (or subclass). A new algorithm for solving the cluster-seeking and multiclass pattern classification problems in a step-wise fashion was prOposed in Chapter V. The procedure proposed in Sec. 5.2 exploited structural information to construct discriminant functions. The success of the discriminant functions in classifying training patterns then provided clues about data structure. The procedure was implemented on the CDC Computers at Michigan State University and tested with the problem reported in Sec. 5.4. The focal point of the thesis is the presentation of a computationally feasible solution to a very difficult, but very 82 common, pattern recognition problem. The difficulty is caused by the lack of prior knowledge about the data structure, or one's un- willingness to make assumptions about pattern class distributions or linear separability among pattern classes. This problem is extremely acute in situations involving large numbers of pattern and features. The abstraction and clustering problems are attacked simultaneously in Chapter 5 of this thesis. The algorithms proposed derive a piece-wise linear pattern classifier in an iterative manner. Information about structure is incorporated into the classifier. In turn, the operation of the classifier provides structural in- formation. 6.2 Possible Extensions To simulate the proposed pattern recognition system on a digital computer, the minimum distance classifier was selected for classifying training patterns. In fact, all multiclass algorithms listed in this thesis could be used for classifying patterns. This fact generates a class of recognition procedures and each procedure needs a full-scale test and optimization of program parameters. For a given pattern recognition problem, a complete explora- tion of all possible procedures and selection of the "best" is possible in principle but not in practice. However, if one is willing to make assumptions about the pattern classes, he might derive some theoretical results or heuristic rules which optimize the searching among possible procedures. Another possible extension is to develop an algorithm for unsupervised structure analysis. Since the true classifications of 83 the patterns are unknown in an unsupervised learning problem, the unclassified patterns must somehow be converted into classified training patterns. Again, if one is willing to impose some probabilistic structure (such as the mixture probability densities), he has converted the problem of clustering into the problem of unsupervised estimation (P6, P7). REFERENCES REFERENCES G.H. Ball, "Data analysis in the social science; what about the details?", Fall Joint Computer Conference (1965). G.H. Ball and D.J. Hall, "ISODATA, a novel method of data analysis and pattern recognition," Technical Report, Stanford Research Institute, Menlo Park, California (1965). C.C. Blaydon, "Recursive algorithms for pattern classifica- tion," Technical Report No. 520, Division of Engineering and Applied Physics, Harvard University, Cambridge, Massachusetts, March, 1967. Colin Blaydon and Yu-chi Ho, "On the abstraction problem in pattern classification," Proc. 1966 NEC, Vol. 22, 857 -8 62 e W.G. Chaplin and V.S. Levadi, "A generalization of the linear threshold decision algorithm to multiple classes," Proc. Second Symposium on Computer and Information Sciences at Batelle Memorial Institute, August 22-24 (1966). T.M. Cover and P.E. Hart, "Nearest neighbor pattern classification," IEEE Trans. on Information Theory, Vol. IT-l3, No. 1, January 1967. I.P. Devyaterikov, A.I. Propoi, and Y.Z. Tsypkin, "Iterative learning algorithms for pattern recognition," Automation and Remote Control (English Translation), 108-117 (1967). R.C. Dubes, The Theory of Applied Probability, (Prentice- Hall, Inc., New'York, 1968). H.P. Friedman and J. Rubin, "On some invariant criteria for grouping data," J. Auer. Statist. Assoc. 62, 320, 1159-1178 (1967). R.A. Fisher, "Multiple measurements in taxonomic problems," Annals of Eugenics, Vol. VII, part II, 179-188, 1936. E.E. Gladyshev, "On stochastic approximation," Theory of Probability and Its Applications, Vol. 10, 275-278, 1965. 84 85 Y.C. Ho and A.K. Agrawala, "On pattern classification algorithms: Introduction and survey," Proc. IEEE 26) 12, 2101-2114 (1968). Y.C. Ho and R.L. Kashyap, “An algorithm for linear in- equalities and its applications," IEEE Trans. Electr. Comp. EC‘IA, Se Y.C. Ho and R.L. Kashyap, "A class of iterative procedures for linear inequalities," J. SIAM Control 4, 1 (1966). P.E. Hart, "The condensed nearest neighbor rule," IEEE Trans. on Information Theory, Vol. IT-4, No. 3, May, 1968. A.Y. Hung and R.C. Dubes, “An introduction to multiclass pattern recognition in unstructured situations," Interim Report No. 12, Contract No. AFOSR-1023-67B, Div. Engineering Research, Michigan State University, 1970. J.S. Koford and G.F. Groner, "The use of adaptive threshold element to design a linear optimal pattern classifier," IEEE Trans. Information Theory, lg, 1 (1966). R.E. Kalman and J.E. Bertram, "Control system analysis and design via the 'second method' of Lyapunov, II Discrete- time system," Trans. ASME J. Basic Engr. 83, D, 2, 371-400 (1960). R.L. Kashyap and C.C. Blaydon, "Recovery of function from noisy measurements taken at randomly selected points and its application to pattern classification," IEEE Proc., Vol. 54, 1127-1129, August, 1966. M.G. Kendall, "Discrimination and Classification," pp. 165- 185, Multivariate Analysis, edited by P.R. Krishnaiah, Academic Press, N.Y., 1966. N.J. Nilsson, "Learning Machines," (McGraw-Hill, New York, 1965). G. Nagy, "State of the art in pattern recognition," Proc. IEEE, 56, 5 (1968). J.D. Patterson and B.F. Womack, “An adaptive pattern classi- fication system," IEEE Trans. Systems Science and Cybernetics SSC-2, 1 (1966). J.D. Patterson, T.J. wagner and B.F. Womack, “A mean-square performance criterion for adaptive pattern classification systems," IEEE Trans. Automat. Control AC-12, 2, 195-197 (1967). 86 J.D. Patterson, T.J. wagner and B.F. Womack, “A performance criterion for adaptive pattern classification systems," Proc. 1966 Joint Automatic Control Conf. (Seattle, Wash.), 38-46. J.M, Pitt and B.F. Womack, "A sequentialization of Patterson classifier," Proc. IEEE (letter), Vol. 54, 1987-1988, December, 1966. James M. Pitt and Baxter F. Womack, “Additional features of an adaptive, multicategory pattern classification system," IEEE Trans. on Systems Science and Cybernetics, Vol. SSC-S, No. 5, July, 1969. E.A. Patrick and J.P. Costello, "On unsupervised estimation algorithms," IEEE Trans. on Information Theory, Vol. IT-16, No. 5, September, 1970. E.A. Patrick, "On a class of unsupervised estimation problems," IEEE Trans. on Information Theory, Vol. IT-14, No. 3, May, 1968. J. Rubin, "Optimal classification into groups: An approach for solving the taxonomy problem," J. Theor. Biol. 12, 103-144 (1967). H. Robbins and 8. Monroe, “A stochastic approximation method," Ann. Math. Statist., Vol. 23, 462-466, 1952. A. Ralston, “A First Course in Numerical Analysis," MoGraw- 11111, N.Y., 1965. 6.8. Sebestyen, "Decision-Making Processes in Pattern Recognition," (The Macmillan Company, New York, 1962). J.W. Sammon, Jr., “A nonlinear mapping for data structure analysis," IEEE Trans. Computers, C-18, 401-409 (1969). D.V. Widder, “Advanced Calculus," 2nd edition (Prentice- Hall, Inc., New'York, 1961). B. Widrow and M.E. Hoff, trans., “Adaptive switching circuits," 1960 IRE WESCON Conv. Record, part 4, 96-104 (1960). W.G. Wee, "Generalized inverse approach to adaptive multi- class pattern classification," IEEE Trans. Computers C-l7, 12 (1968). 8.8. Wilks, "Mathematical Statistics," (John Wiley and Sons, Inc., New York, 1962). M.T. Wasan, "Stochastic Approximation," Cambridge University Press, 1969. W 6 87 T.J. Wagner, J.Mm Pitt and B.F. Womack, “A comparison be- tween pattern classification approaches," IEEE Trans. In- formation Theory (Correspondence), Vol. IT-13, 611-613, October, 1967. S.S. Yau and J.M. Schumpert, "Design of pattern classifiers with updating property using stochastic approximation techniques," IEEE Trans. on Computers, Vol. C-17, No. 9, September, 1968. APPENDIX APPENDIX A CONVERGENCE PROOF The convergence proof of (GA.4) can be divided into two possible situations. Case 1. If the constraint on E is violated, the algorithm will be terminated since 6E[n] = 0 for all n. Case 2. Assume that the constraint on E holds. It must be shown that the algorithm converges. Before the proof of HD[n]H a 0 as n a m, two matrix identities will be proved. YT (YA[n] - 1?.an (a) vTDEn] YT(YY#E[n] - E[n]) = (YTW# - YT) EDI] = (YTHYTW'IYT - YT) E[n] = o (b) Trace {D[n]T(YY# - I)T(YY# - I) D[n]} = Trace {D[n]T[ (W#)T(‘y\y#)-y\y#-W#+ I] D[n]} = Trace {D[n]T[Y(YTY)-1YTY(YTY)-1 YT - 2YY# + I] D[n]} T # = Trace {D[n] (I - YY )D[n]} = Trace {D[n]TD[n]} - Trace {D[n]TYY#D[n]} = HD[n]“2 - Trace {(YA[n] - E[n])TYY#D[n]} = Hvtnlnz - Trace {(vv#E[n1 - Eth>Tvv#vtn1} 88 89 = “DinJHz - Trace {E[n]T(YP#-I)TP(PTP)'1 x YTD[n]} by a 2 = \\D[n]\\ Define V(D[n]) = uD[n]u2 a positive definite function. AV(D[n]) = V(D[n+1]) - V(D[n]> = V(YA[n+1] - E[n+l]) - V(D[n]) = V(Y(A[n] + Y#6E[n]) - E[n+1]) - V(D[n]) = V(YA[n] + YY#pD[n] - E[n] - pD[n]) - V(D[n]) = V = “D[n] + p(YY# - I)D[n]“2 - “D[n]“2 = Trace {[D[n] + p(YY# - 1)D[n]]T x Evin] + pDtn11} - HDEnJHZ = Trace {D[n]TD[n] + pD[n]T(YY# - I)TD[n] + pD[n]T(YY# - I)D[n] + p2D[n]T(YY#-I)T(YY#-I)D[n]} - HD[n]H2 by_b 2 T # — HD[n]“ + 2p Trace {D[n] (PP -I)D[n]} + aznvtnnuz -uvtn1u2 = 29 Trace {D[n]T(‘i"i’#-I)D[n]} + p2\\D[n]H2 = 2p Trace {D[n]TYY#D[n]} - 2p Trace {D[n]TD[n]} by a + pzfivtnmz 2 2 2 = -29\\D[n3\\ + 9 NEW“ = -HD[n1u2<29 - 92> For 0 < p s 2, 2p - p2 = 9(2-9) 2 0, then AV(D[n]) s 0 for all D[n] and AV(D[n]) = 0 if D[n] = 0. By Lyapunov's stability theorem for discrete systems (K2), lim V(D[n]) = lim uD[n]u2 = o . n—m "IWAMMWEUEWLWWITS