NONPARAMETRIC PROCEDURES FOR LEARNING WITH AN- IMPERFECT TEACHER Thesis for the Degree of Ph. D. MICHIGAN STATE UNIVERSITY RONALD JOSEPH RICHTER 1972 E'Fl’ “ Y'— LIB;:11RY Michigan .SWC Univcrmty This is to certify that the thesis entitled NONPARAMETRIC PROCEDURES FOR LEARNING WITH AN IMPERFECT TEACHER presented by Ronald Joseph Richter has been accepted towards fulfillment of the requirements for Ph. D. degree in Electrical Engineering in /,/ AMI/MC/L’ 411/4 Major professor Date November ‘8, 1972 0-7039 (wen-mi -‘EM fir.c N amomo av ‘9 HUM} & SDNS' 800K BINDERY INC. LIBRARY amosns I snlnnu meme“ ABSTRACT NONPARAMETRIC PROCEDURES FOR LEARNING WITH AN IMPERFECT TEACHER By Ronald Joseph Richter In this dissertation a general pattern recognition problem is investigated in which the classification of an observed phenomenon or pattern is inferred from a set of training patterns. The statistical approach to pattern recognition is taken. The pattern classes are assumed to be characterized by probability distributions that are in- adequately known. In order to form decision rules the unknowns must be "learned" from a set of training patterns. The training patterns are classified by an imperfect teacher that makes errors in its classifi- cations. This type of learning, called learning with an imperfect teacher, lies in between supervised learning and unsupervised learning. In the first part of the thesis, a probabilistic model of an imperfect teacher is proposed. Expressions are developed relating a perfect teacher to an imperfect teacher. An example is studied to show the asymptotic effects of using misclassified training patterns in an algorithm designed for supervised learning. The example illustrates the need for developing algorithms specifically for use with an imperfect teacher. A class of nonparametric learning procedures is proposed for learning to recognize patterns with an imperfect teacher. The pro- cedures require prior knowledge only of the nonsingular matrix of error probabilities characterizing the teacher and of whether the patterns are discrete or continuous random variables. Formal proofs are given showing Ronald Joseph Richter that the procedures are asymptotically optimal in the sense that they have expected risks which converge with increasing number of training patterns to the Optimal (Bayes) risk. Theorems on rates of convergence are also obtained. In the latter part of the thesis, the two-class recognition problem is investigated in detail with the objective being to study the quantitative and qualitative effects of using an imperfect teacher rather than a perfect teacher. Large-sample approximations are deve10ped for evaluating the expected risk ofthe estimated decision rules. The performance of the learning procedures is studied in several examples involving normal, triangular, and binomial distributions. Various large sample properties of the expected risk are also investigated. Finally, a measure of relative performance is proposed for quanti- tatively evaluating the effects of an imperfect teacher. This measure is evaluated for the important case of a zero-one loss function. The measure is then used along with a cost of training to establish an overall cost for an imperfect teacher. Conditions are established under which an imperfect teacher is more cost effective than a perfect teacher. NONPARAMETRIC PROCEDURES FOR LEARNING WITH AN IMPERFECT TEACHER By RONALD JOSEPH RICHTER A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering and Systems Science 1972 q . (9 ACKNOWLEDGEMENTS The author wishes to thank his thesis advisor, Dr. Richard C. Dubes, for the guidance and encouragement given throughout this research. Thanks are also due to Dr. Dennis C. Gilliland for his helpful suggestions. Appreciation is also expressed to Dr. G. L. Park, Dr. R. 0. Barr, and Dr. J. H. Stapleton for their interest in this work. The principal support for this research was a United States Steel Foundation Fellowship. The cooperation of The Magnavox Company is also gratefully acknowledged. Special thanks go to Mrs. Jean Emig for her excellence in typing. ii TABLE OF CONTENTS Page Chapter I. INTRODUCTION ........................................... 1 Survey of the Pattern Recognition Problem ......... .1 2 O 2 The ImperfeCt TeaCher O C C O C C O O O C C C C O C C C C C C C C C C C . O O O 5 .3 Thesis Contributions .............................. 7 e—Ip—ar-e II. mE LEARNING PROBLEM OOOOOOOOIOOCOOOOOOCOI0......0...... 9 1 Mathematical Model for Pattern Recognition ........ 9 2 Model of the Imperfect Teacher .................... 12 3 Some Fundamental Relations ........................ 13 4 Effects of Misclassifications on Supervised Learning Procedures ............................... 15 III. A NONPARAMETRIC LEARNING PROCEDURE ..................... 21 3 1 Decision Rules .................................... 22 3 2 Convergence Criteria .............................. 25 3.3 Preliminary Lemmas ................................ 27 3 4 The Discrete Case ................................. 30 3 S The Continuous Case ............................... 35 IV. EFFECTS OF THE IMPERFECT TEACHER ........................ 41 1 Expected Risk for the Two-Class Problem ........... 42 2 Normal Approximation .............................. 44 3 Examples of Learning .............................. 49 4 Large Sample Properties of the Expected Risk ...... 61 S A Measure of Performance .......................... 65 6 Cost of Training .................................. 68 v. CONCLUSIONS 0......00....00.0.00...OOOOOOOOOOOOOOOOOOOO. 73 5.1 Slmary 00......OOOOOOOCOOOOOOOOOCOOOOOOOOOOOOOOOOO 73 5.2 Extensj-ons .0.00....00.0.00...OOOOOOOOOOOQOOOOOOOOO 7S BIBLIMRAPHY OOOOOOOOOOOOCOOOO0.0..0OOOOOOOOOOOOOOOOOOOOO0.... 78 iii APPENDICES A B C optimal DeCiSion RUIeS IDC.OOOOOIOOOOOOOOOOOOOOOOO Nonparametric Estimation of Density Functions . . . . prOOf Ofmeorem 3.5 0.00.00.00.000000000000000000 iv 82 86 89 LIST OF TABLES Table Page 8.1 univariate Kernels OOOOOOOOOOOOOOOIOOOO0......000...... 88 Figure 2.1 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 LIST OF FIGURES Asymptotic Risk with an Imperfect Teacher ............. Expected Risk with Normal Distributions, P1 0.5 ..... 0.1 .0... Expected Risk with Normal Distributions, P1 Expected Risk with Normal Distributions, 811 # 822 .... Expected Risk fer Large N ............................. Triangular Density Functions .......................... Expected Risk with Triangular Densities ............... Comparison of Large Sample Approximation and Simulation ............................................ Expected Risk with Binomial Distributions ............. Measure of Relative Perfbrmance ....................... Teacher's Cost for Classification ..................... Relative Cost of Training ............................. vi Page 20 51 52 53 S4 55 S7 58 60 69 71 72 CHAPTER I INTRODUCTION Pattern recognition is the study of ways in which machines, usu- ally meaning digital computers and associated equipment, can observe an environment, learn to distinguish relevant details from background trivia, and make sound and reasonable decisions. The task of designing machines to recognize patterns appears in many different forms in a variety of disciplines. The problems encountered range from the practi- cal to the profound, from engineering design and economics to the theories of artificial intelligence and human learning.‘ But the central problem of pattern recognition is to develop procedures or algorithms that effectively classify an observed phenomenon as resulting from one of a set of sources. In this thesis a general pattern recognition problem is investi- gated in which the classification of an observed phenomenon or pattern is inferred from a set of training patterns. A statistical approach to pattern recognition is taken. In this approach the sources or pattern classes are assumed to be characterized by probability distributions, and statistical decision theory is used as the mathematical tool for deriving classification procedures. When the probability distributions that describe the patterns are inadequately known, the unknowns must be "learned" from a set of training patterns. Learning with a teacher (supervised learning) refers to the l 2 situation in which all of the training patterns have been correctly classified as to origin. When the true classifications of the training patterns are unknown, the learning is said to occur without a teacher (unsupervised learning). This thesis investigates a third type of learning, learning with an imperfect teacher, which lies somewhere be- tween supervised and unsupervised learning. For this type of learning the training patterns are classified by an imperfect teacher that makes errors in its classifications. 1.1 SURVEY OF THE PATTERN RECOGNITION PROBLEM Surveys of early work in pattern recognition have been written by Nagy [N-l] and Ho and Agrawala [H-l]. Nilsson [N-2] presented some of the early work on the theory of "learning machines," or machines that can be trained to recognize patterns. The idea of finding clustering tranSformations for designing pattern recognizers was developed by Sebestyen [8-3]. A book edited by Kanal [K-l] describes applications of pattern recognition to the problem of character recognition, and a book edited by Watanabe [W-l] is a very good collection of papers emphasizing the philosophy of various approaches. Sequential methods in statistical pattern recognition have been presented by Fu [F-4]. Ho and Agrawala [H-l] identify three fundamental problems associ- ated with pattern recognition: characterization, abstraction, and generalization. Characterization is concerned with the problem of se- lecting the measurements which should be taken on the obfiects and of develOping methods for reducing these measurements to a set of real variables which effectively characterize the objects and are amenable to 3 automatic data processing. The set of real numbers obtained from the measurements on anobject is called a pattern, and each element of the pattern is called a feature. Each of the states of nature or sources of patterns is said to be a pattern class. At present there is no unifying theory for the selection of features [I-2]. Much of the problem of feature selection is left to the ingenuity of the system designer [I-l]. After the features have been selected, the designer, using all available information, must devise a decision procedure for classifying a new pattern of unknown origin. The process of deriving a decision rule from the available information is called abstraction. The ability of the resulting decision rule to correctly classify new patterns is termed generalization. This quality is best stated in probabilistic terms, such as the probability of correct classification. These three aspects of the pattern recognition problem are not entirely independent of each other. Clearly, the choice of imprOper features manifests itself in unduly complex forms for the decision rule with poor generalization ability. Similarly, the ability to generalize may be the criterion for choosing the features. In this thesis it will be assumed that the features have been judiciously chosen. This research is concerned with a particular abstraction problem and the generali- zation ability of the resulting algorithms. Two primary approaches have been taken to the abstraction problem in the literature [B-l]: namely, a deterministic approach and a sta- tistical one. In the deterministic approach, the goal of the recognition system is to partition the feature space into regions such that each region can be identified with a pattern class. A functional form is 4 often assumed for the decision function, and unknown parameters are derived from a set of classified training patterns. Many of the de- terministic procedures take the form of error correction algorithms or gradient descent algorithms. Blaydon [B-l], Nilsson [N—Z], and Ho and Kashyap [H-Z] have described such methods. This thesis is concerned with the statistical approach to the abstraction problem. In this approach the mathematical tools of sta- tistical decision theory [F-Z] are applied to the problem of designing the classifier. The pattern features are assumed to be described by probability distributions; and optimal decision rules are obtained to satisfy certain classification criteria; for example, minimum average risk. A recent book by Fukunaga [F-S] presents this statistical ap- proach. The problem of learning arises during the abstraction process when the probability distributions characterizing the pattern classes are inadequately known. The unknowns of the class distributions must be estimated or "learned" from training patterns drawn from each of the pattern classes. When the origins of the training patterns are known, the learning is said to take place with a perfect teacher (supervised learning); but when the classification of the training patterns is un- known, the learning is called unsupervised (without a teacher). Abramson [A-1] and Spragins [8-9] have studied the convergence question in supervised learning when only a few unknown parameters need to be learned. Aizerman, et. a1. [A-2] deve10ped the method of potential functions for supervised learning, and Van Ryzin [V-l], [V-Z] has shown convergence of a procedure that estimates complete density functions via Parzen type [P-l], [C-l] density estimators. 5 Most of the practical learning procedures that have been deve10ped use supervised learning. Theoretical foundations for unsupervised learning have been established, but practical algorithms have been slow in coming. Fralick [F-3], Patrick and Hancock [P-2], and Yakowitz [Y-l] have discussed various theoretical aspects of unsupervised learning. A third type of learning, which has received little attention in. the literature, is studied in this thesis. This type may be called "learning with an imperfect teacher" and lies somewhere in between super- vised and unsupervised learning. 1.2 THE IMPERFECT TEACHER In many practical situations it is unreasonable to assume that all of the training patterns supplied to a learning system by a "teacher" are correctly classified. The teacher often classifies the training patterns using past experience and additional information not available to the learning system. But even this additional information may not be ade- quate to ensure that the teacher's classifications are all correct. As an example, a case has been reported in the literature [M-3] in which two groups of electrocardiographers reading the same set of 561 EKG's disagreed in over 40 percent of the normal-abnormal classifi- cations. If one were to attempt to use this set of EKG's for supervised learning, he is confronted with a dilemma. Which group's classifi- cations should be used? It is clearly questionable to attempt to perform supervised learning with such unreliable data. One might solve the problem by using only those EKC's for which both groups agreed. But this is a wasteful procedure since time and money are involved in recording 6 and reading each EKG. Another solution might be to use fUrther tests and clinical records from each patient as an aid for classifying each EKG. Again this may be a costly and time consuming undertaking with no guarantee of success. Thus it is reasonable to talk about an "imperfect teacher" and consider situations in which the training patterns may not all be correctly classified. Estimation and decision making with misclassified data has re- ceived some attention in the literature. Bross [B-Z] considered the effects of misclassified data on estimators and significance tests in- volving binomial distributions. Tenenbein [T-l] extended this work to investigate the effects of estimation with misclassified multinomial data. He presented a double sampling scheme that minimized a measure- ment cost. Linear classifiers have been used in many pattern recognition problems. Lachenbruch [L-l] looked at the effects of misclassification when learning a linear discriminant function for a Gaussian classifi- cation problem. He exhibited the asymptotic effects of estimating the mean and variance of the normal distribution from training samples that had a constant probability of being misclassified. Whitney and Dwyer [W-Z] derived the asymptotic performance of the k-nearest neighbor rule [C-Z] which used training patterns classified by an imperfect teacher. Most of the research involving imperfect teachers has been con- cerned with analyzing the performance of existing supervised learning algorithms when some of the training samples are misclassified. Recent- ly Shanmugam [8-4], [8-5] developed an error correction procedure for learning with an imperfect teacher. Shanmugam studied a nonparametric learning scheme that was asymptotically optimal in the sense that it had 7 an average risk which converged to the Bayes risk. He considered only a two-class problem, used a Bayes decision rule with a zero-one loss function, and assumed that the teacher had the same rate of misclassifi- cation for each pattern class. He proposed a threshold feedback scheme for gradually phasing out the teacher in the case when the class densi- ties had disjoint support. The work in this thesis removes some of the assumptions needed for the convergence of Shanmugam's algorithms. An alternative and more general approach to learning with an imperfect teacher is developed herein. The procedures considered are analogous to the type considered by Van Ryzin [V-l], [V-2] for supervised learning. These procedures are nonparametric in the sense that minimal knowledge about the class distributions is assumed to be available to the system designer. 1.3 THESIS CONTRIBUTIONS The emphasis of this research is on developing and evaluating a procedure for learning with an imperfect teacher. The first contribution of this thesis appears in Chapter II which contains a mathematical model of the learning problem. A probabilistic model for an imperfect teacher is proposed and expressions are developed relating the imperfect teacher to a perfect teacher. These basic relations provide the key for studying the class of learning procedures proposed in Chapter III. An example is also presented in Chapter II to illustrate the effects of using mis- classified training patterns in an algorithm designed for supervised learning. The example provides motivation for developing algorithms Specifically for use with an imperfect teacher. 8 The second major contribution appears in Chapter III. A specific class of nonparametric procedures is proposed for learning to recognize patterns with an imperfect teacher. Formal proofs are given showing that the procedures are asymptotically optimal in the sense that they have expected risks which converge with increasing number of training patterns to the optimal (Bayes) risk. The conditions under which the convergence holds are very general for the two cases considered: i.) the case of discrete valued patterns and ii.) the case of patterns with a.e. continuous densities. Theorems on rates of convergence are also obtained. These rate theorems serve as guidelines for selecting the parameters for the estimators. The final contributions of the thesis are presented in Chapter IV where attention is restricted to the two-class recognition problem. The purpose of Chapter IV is to investigate the quantitative and qualitative effects of using an imperfect teacher rather than a perfect one. Two large-sample approximations are developed for evaluating the expected risk of the estimated decision rules. Qualitative effects of the im- perfect teacher are studied in section 4.3 by considering examples in- volving normal, triangular, and binomial distributions. In the latter part of Chapter IV, a measure of relative per- formance is proposed for quantitatively evaluating the effects of the imperfect teacher. This measure, which can be evaluated for the case of a zero-one loss fUnction, is used along with a cost of training to evaluate an overall cost of an imperfect teacher. The results in Chapter IV are original and represent a contribution to the study of learning theory. CHAPTER II THE LEARNING PROBLEM This chapter presents the mathematical model to be used to study the problem of learning with an imperfect teacher. The problem is formulated in terms of statistical decision theory. A Bayesian decision strategy is employed throughout. In Section 2.1, the basic model for statistical pattern recog- nition is defined. The Bayes decision rule and associated Bayes risk are given fer the M-class problem. In Section 2.2, a model is preposed for an imperfect teacher. Section 2.3 then develops some fundamental relations between the imperfect teacher and a perfect teacher. Finally, Section 2.4 investigates the asymptotic effects of using an imperfect teacher with a procedure designed for supervised learning. 2.1 MATHEMATICAL MODEL FOR PATTERN RECOGNITION The major objective of pattern recognition is to derive decision rules for classifying a pattern X as coming from one of M possible sources or pattern classes. In the statistical approach to pattern recognition the sources are characterized by probability distributions, and decision theoretic strategies are used for decision making. The mathematical model for pattern classification to be used in this research will now be defined. The pattern X is defined to be a 9 . lO vector random variable taking values in a feature space T, a subset of Euclidean n-space. The set of pattern classes is denoted as n = {w1, w2, ..., wM} with “i representing the ith pattern class. The pattern classes are assumed to be characterized by a prior probability distribution P = (P1, P2, ...,PM)T where Pi is the prior probability of occurrence of pattern class mi. Let A be an identifi- cation random variable defined over 9 such that A(wi) = i. The proba- bility distribution of A is then just the prior distribution P. The probability density function of the pattern X given that X is from pattern class w- is denoted by f(- A = i). This is a density with re- 1 spect to some measure v on T. Denote the vector of conditional densi- ties by f(-) = (f(- A = MDT. A = 1). f(° A = 2): ..., f(' After observing a value of the pattern random variable X, the pattern recognizer is required to classify X as having come from one of the M possible pattern classes “1 with density f(xIA % i). The random variable A indicating the class that produced X is, however, unobserva- ble. The general elements of this statistical decision making problem are outlined in Appendix A. A Bayesian decision strategy is used to obtain optimal decision rules. The Bayes decision rule, which follows from (A.S) of Appendix A, is any randomized decision rule 68(x) = (681(x), 637(x), ..., 63M(x)) satisfying 1 if Dj(x) < min Dk(x) ka‘j GB.(x) = 0 if D (x) > min Dk(x), j = l, 2, ..., M (2.1) J J - k7!) Yj if Dj(x) = min Dk(x) ka‘j 11 where the Bayes discriminant function Dk(x) is given by M 0km = E A kapr(X|A = A); k = l, 2, coo, B10 (202) l The function 68j(x) is interpreted as the conditional probability of classifying the pattern X as coming from pattern class wj given that the observed value of the pattern is X = x. The loss function ka represents numerical loss incurred by deciding X is from pattern class wk when the true pattern class is wk' The loss function is a nonnegative, real- valued function that is assumed to satisfy the condition Lij > ij :_0, i # j. The Bayes rule minimizes the average loss of misclassification. This minimum average less, known as the Bayes risk, follows from (A.7) and is given by M E RB(P, f) M pA Ir jgl ijf(xIA = x)ij(x) dv(x) A 1 u Ilbvfiz f Dj(x)dBj(x) dv(x). (2.3) J'l'r The notation fer RB in (2.3) emphasizes the dependence of the Bayes risk on the prior distribution P and the class-conditional densities f. This notation will be of use in later chapters. The optimal Bayes rule can be used for decision making only if one has exact knowledge of the class-conditional densities and of the prior distribution. When these distributions are known, the abstraction and generalization problems are essentially solved. The system designer needs only to implement the decision rule of (2.1). The resulting system perfbrmance will be given by the Bayes risk in (2.3). 12 But in practice, the required distributions are usually only partially known. The unknowns of the class distributions must be learned (estimated) from training patterns drawn from each of the pattern classes. The pattern recognition system must use the training patterns provided by a (imperfect) teacher to abstract a decision rule for classifying new patterns of unknown origin. One would hope that the learned decision rules adapt or converge with increasing number of training patterns to what the optimal rule would be if the true proba- bility distributions were known. 2.2 MODEL OF THE IMPERFECT TEACHER The training patterns to be used for learning a decision rule are represented by a sequence of independent, identically distributed N random variables Y = (Y1, Y2, ..., YN). Each random variable Y1 = (Xi, A;' A1) consists of a training pattern Xi, a label A; provided by an imperfect teacher, and an identification random variable Ai representing the true classification of Xi- Only Xi and A; are available to the learning system; Ai remains unknown. The pattern random variable Xi is assumed to be distributed as f(-|A = A) if the correct classification of Xi is A1 = X. The identification random vari- ables are identically distributed according to the prior distribution P. The labels provided by the imperfect teacher are modeled as identically distributed random variables on O having a probability distribution defined by Pr[A; = le. 1 k] = Bjk (2.4) with Bjk 1.0, Bjk = l. The probabilities Bjk are assumed not to be ll M3 j 1 13 a function of i or of the value of the training pattern Xi. An M x M matrix of probabilities characterizing the imperfect teacher may be defined by I" 1 B11 81? ”' 81M 8 B ... B 21 22 2M _8_ = . (2.5) 3M1 8M2 BMM This simple probabilistic model of an imperfect teacher is one that is justifiable in many practical situations. The model is an extension of one used by Whitney and Dwyer [W-Z] and by Shanmugam [8-5]. When Bii = l for all i, A; = A1 with probability one, in which case the model reduces to one of a perfect teacher. When Bij = l/M for all i and j, the model corresponds to unsupervised learning since having a teacher that SUpplies equally likely labels for all pattern classes is equivalent to having no teacher at all. There is no classification information available in the labels. 2.3 SOME FUNDAMENTAL RELATIONS Several expressions can be derived to relate the probability distributions associated with an imperfect teacher to those for a perfect teacher. The labels of the perfect teacher are distributed as the identification random variable A, and the class-conditional densities with the perfect teacher are f(- A). With an imperfect teacher, the density of X conditioned on A' is given for j = l, 2, ..., M by f(x|A' = j) = f(x|A' = j, A = k)Pr[A = kIA' = j]. (2.6) k nevfi: l 14 But f(X|A' = j, A = k) = Pr[[\' zjjxj A :: k]f(xLA = k) WM'=HA=M = f(xlA = k) (2.7) where the last equality follows from the assumption that Bjk is not a function of the value of X. Also Bayes rule gives PrjA' = le = k]Pr[A = k] M Z Pr[A' = le = X]Pr[A = A] 1:1 Pr[A = kIA' = j] = E = BjkPk/pj’ j, k = 1, 2, ooo, b1o (208) Here P; is the probability that A' equals j, P j: 1, 2, ooo, Mo (209) ji 1' Substitution of (2.7) and (2.8) into (2.6) then gives M . P3f(xIA' = j) = Z BjkPkf(xIA = k), j = 1, 2, ..., M. (2.10) k=1 Equations (2.9) and (2.10) can be expressed in a convenient vector form by defining the following vectors and matrices: f'(x) = (f(xIA' = 1), ..., f(xIA' = 11))T (2.1la) I_ 1: 01‘ p _ (p1, p2, ..., PM) (2.11b) 3.: diag (p1, P2, ..., PM) (2.11c) and 3: = diag (Pi, Pg, ..., Pg). (2.11d) Equation (2.10) then becomes P'f'(x) = §_g_f(x) (2.12) and (2.9) has the form P = _e_ P. (2.13) When §_is nonsingular, it may be inverted to obtain from (2.12) and (2.13) P = g‘lp' (2.14) and Pf(x) g-lg'f'm. (2.15) The expressions (2.12) thru (2.15) describe the relations between the probability distributions for the perfect teacher and those for the imperfect teacher. Equations (2.14) and (2.15) provide the key for develOping and studying in Chapter III and Chapter IV an algorithm for learning with an imperfect teacher. Previous studies with imperfect teachers [W-2], [8-4], [8-5] have used (2.10) for the case of two pattern classes (M = 2), but none of these studies have made use of the inverse relations in (2.14) and (2.15). Shanmugam [8-4], [8-5] avoided the need to use the inverse relations by restricting the density functions to having disjoint supports and by using a zero-one loss function along with a known prior distribution. The inverse relations remove the need for such restrictive assumptions. 2.4 EFFECTS OF MISCLASSIFICATIONS ON SUPERVISED LEARNING PROCEDURES Many algorithms have been deve10ped for supervised learning. Several of these algorithms have also been used with an imperfect teacher [S-S], [W-Z], [L-l]. For the problem considered in [8-5], it was shown that misclassified training patterns could be used in the supervised learning procedure and that the resulting decision rule would 16 still converge to the Bayes rule. The use of misclassified data did not prevent the algorithm from converging to the desired decision rule. If this were always the case, one would have little motivation for deve10ping learning procedures which take into account the imperfect teacher. One could just ignore the fact that the data was misclassi- fied and use existing algorithms which were designed for supervised learning. An example presented in this section shows that, as one would expect, misclassified training data can significantly effect the con- vergence of a supervised learning algorithm. This example provides motivation for developing learning procedures for use with an imperfect teacher. The learning procedures discussed in [8-5] and [V-l], as well as those preposed in Chapter III, are based on nonparametric estimators of the class-conditional densities and prior distribution. In [V-l], the training patterns with label k were used to form estimates f(xlA = k) and Pk of f(x|A = k) and Pk,respectively. An estimate of the Bayes dis- criminant function Dk was then defined as M . . Dk(x) = E ij ij(xIA = 3). (2.16) ) A decision rule 6 was formed according to (2.1) by replacing Dk with the estimate Dk. It was shown in [V-l] that,under suitable conditions, 3 converged to the Bayes decision rule 63. Now if the training patterns were misclassified, one would be estimating f(xIA' = k) and Pi, not f(xIA = k) and Pk, when using the patterns with label k. The resulting decision rule would, at best, con- verge to a decision rule 6' having the form of (2.1) but with 17 M DL(X) = 521 ij P§f(X|A' = j). k = 1, 2, ..., M (2.17) as the discriminant functions. This decision rule would be equiva- lent to the Bayes rule only if the risk (see equation (A.2)) ' M R(P, f. 6 ) = I [11.00 6.'(x) d\)(x) (2.18) J J J l were equal to the Bayes risk RB(P, f). For the problem considered in [8-4], [5—5], R(P, f, 6') and RB(P, f) were equal. Hence for that problem, the misclassified data did not alter the convergence of the learning procedure. Shanmugam [5-5] has shown that for the M-class problem, 6' and 63 are equivalent if the following conditions are satisfied: a.) A zero-one loss function is used; i.e., Lii = 0 and Li' = l, i # j. J b.) The probabilities characterizing the imperfect teacher are Bii = B > l/M, i = 1, 2, ..., M (2.19) and _1-3 . ._ . . 220 B " a 1: J _ 1: 2, 0”, M, 17‘]- (. ) 13 M- 1 Under these conditions, supervised learning algorithms based on density estimators as in [V-l] will perform asymptotically just as well with an imperfect teacher as with a perfect teacher. The following example shows what happens when either condition (a.) or (b.) is not satisfied. Suppose that there are two pattern classes w and w2' and assume 1 that under pattern class ”j the pattern X has a normal distribution with mean uj and variance 02, 18 _1 6 f(x|A = j) = (2no’) exp (~(x - uj)?/202). (2.21) For convenience assume that 112 > “1' The decision rule 6' = (6;, 6;) obtained with the imperfect teacher is given by (2.1) using the discriminant functions Di in place of DR. The rule is 6;(x) = 1 if i I t l I ! L11P1f(x|A = 1) + L12P2f(xIA = 2) :_L21Plf(xIA = 1) + L222;f(x|A' = 2), (2.22) and 65(x) = l — 6;(x). Making use of (2.12) one can show that (2.22) simplifies to 61(x) = 1 if either x :_r, p > 0 or x :.T’ p < 0, where p 3 ‘(L21 ‘ L11)812 + (L12 ' L22)822 (2'23) and T _ 02 2(1421 -I;11)811‘ (L12 " L22)82-1- + u] + “2 u2 - P1 P2 (L12 ’ L222322 ‘ (L21 ' L113312 2 (2.24) The risk incurred in using 6' follows from (2.18). One can show that (2.18) can be evaluated as P IJ + P,L " A if p > 0 RU), f, (3') z 1 21 Z 22 (2.25) P L + P L + A if . p < 0 1 11 2 12 where A = (L21 ' L113P1¢(Y1) * (L12 ' L22)P2¢(Y2) (2°26) - 1 P L - L - L - L S Y1 = T’ HI = __1n .1.( 21 11)811 ( 12 22)821 + __ (2.27) O S P2 (L12 " L22)822 ' (L21 ' L11)812 2 Y? = (T ‘ U2)/0 = Y1 - S (2.28) S = (“2 - ull/o (2.29) 19 and a l. ¢(a) = ] (2n)’? exp (-t2/2) dt. (2.30) .00 The Bayes decision rule and the Bayes risk may be found for this example by setting 811 = B = 1 in the above equations. It is clear 22 from the above equations that in general R(P, f, 6') will vary with 811 and 822. Hence the rule 6' is not always equivalent to the Bayes rule. In Figure 2.1 the risk R(P, f, 6') is plotted as a function of 5:: for the case of a zero-one loss function and equal prior proba- bilities. The figure shows that for small 8 the risk exceeds the 22’ Bayes risk by a significant amount. In this case, estimating the density functions without compensating for the misclassified data leads to a decision rule which is not a Bayes rule. Thus, one is motivated to develop a learning procedure for use with an imperfect teacher. 20 o.m genome? poomwemEH cm :ufiz xmwm capoumEXm< .H.N magmam A00 00z min Dk(x; v) (3.5) k¢i yj 1f Dj(x; v) min Dk(x; v) k#j The notation in (3.4) and (3.5) displays the dependence of the Bayes rule on the vector v (or, alternatively, on P and f). If the training patterns are used to obtain an estimate ;N(x) of v(x) and if 6N is a good approximation to v, then a reasonable decision rule may be formed by using 6N in (3.4) as if it were the true v. The resulting decision rule is 8N(x) = chx; 6N) = (lime), lime» (3.6) with (I if DNj(x) > pi? DNk(x) . 6Nj(x) =4 0 if DNj(x) < min DNk(x) (3.7) kfij min DNk(x) Lyj if an(x) k¥j and with the estimates of the Bayes discriminant functions given by DNj(x) Dj(x; 3N) M = z [Jo-v 0(X). (3.8) i=1 )1 N1 The class of estimators of v(x) preposed here is defined as follows. Let {gN(x, y)} be a sequence of nonnegative real-valued func- tions on T x T such that I chx. y) dv(x) = 1 a.e.v (3.9) T and lim E[g (x, X)IA = A] = f(xIA New N A) a.e.v. (3.10) Let 1 if i = j Mi. 3') = (3.11) o if i s j be the Kronecker delta, and denote the elements of Bil by bij: Bil = [bij]° Then the estimator 9N = (5N1, VNZ, ..., GNM) is defined by M 1 izl bjiA(Afi, i)gN(x, xk). (3.12) ZIH IIMZ ij(x) = k Explicit expressions for the 3N functions will be exhibited in Section 3.4 for the discrete case and in Section 3.5 fer the continuous case. The form of (3.12) may be obtained from the following consider- ations. Suppose that Ni is the number of training patterns classified by the imperfect teacher as coming from pattern class mi. Then an esti- mate of f(xIA' i) is given by the Parzen estimator described in Appendix B, N f(xIA = i) = $1 kzl A(A£, i)gN (x, xk). (3.13) 1 = 1 An estimate of the probability Pi is given by Ni/N. Combining these estimates according to the inverse relation (2.15) results in an esti- mator for v(x) having the form in (3.12). The decision rule in (3.6),along with the estimator in (3.12), forms the class of procedures for learning with an imperfect teacher to be studied in the remainder of this thesis. When Bii = 1 for all 1, these procedures reduce to those of Van Ryzin [V-l] for supervised learning. 25 3.2 CONVERGENCE CRITERIA A The risk incurred in using decision rule 6N is given by (A.2) as M M R , 6 I.,,P,f A =° 6, d (v N) 351 IT [12 31 1 (xl 1)] NJ(X) v(x) =1 J M 2 IT chx; v) 3113' (x) dv(x) (3.14) 1 and the Bayes risk is given by (2.3) as R3”) = . J II M2 1 1‘ Dj(x; v) 6Bj(x; v) dv(x). (3.15) The dependence of the risks on the vector v is displayed by the no- tation in (3.14) and (3.15). One would hepe that as N + w, the estimated rule 6N would in some sense converge to the Bayes rule 63. Now the risk R(v, 6N) is a random variable since it is a function of the training set. 80 the decision rule 6N is said to converge to the Bayes rule 6B if the risk R(v, 6N) converges in some sense to the Bayes risk RB(v). The following defi- nitions of convergence are due to Van Ryzin [V-Z]. DEFINITION 3.1. A decision rule 6N is said to be Bayes Risk Consistent if for every 5 > 0, Pr[R(v, 6N) - RB(v) _>_ e] -> o as N ~> oo; (3.16) i.e., if R(v, 6N) converges in probability to RB(v) as N + w. Let EN[-] denote the expectation with respect to the distribution of the training data, {(Xl, A1, A1), ..., (XN, A§, AN)}. Then define the expected risk for the learning procedure as Rch) 2 ENtncv. 3N1]. (3.17) RN is a nonrandom variable which depends upon the vector v. 26 DEFINITION 3.2. A decision rule 6N is said to be Mean Risk Consistent if lim RN(v) = RB(V). (3.18) N-HZO Definition 3.2 is a special case of the following definition: DEFINITION 3.3. A decision rule 6N is said to be Mean Risk Consistent of order r > 0 if $1: EN[IR(V, 8N) - RB(v)|r] = 0. (3.19) The following notation is defined for use in later sections: AR(v, SN) 2 R(v, 6N) - RB(v) (3.20) and ARN(v) 2 Rch) .. RB(v). (3.21) So a decision rule 6N is Bayes Risk Consistent if AR(v, 6N) converges to zero in probability, and the rule is Mean Risk Consistent if ARN con- verges to zero in the ordinary sense of a limit. Since convergence in the mean implies convergence in probability [L-2], Mean Risk Consistency implies Bayes Risk Consistency. A stronger relation between the various types of convergence is established by the following preperty. proPeTW3.1. If a decision rule 6N is Mean Risk Consistent, then it is Mean Risk Consistent of order r for any r > O. PROOF: From (3.14), it follows that . M M |R(V. 5N)l 5,.2 _Z [I j=1 1=1 LjiPif(xIA = i)6Nj(x)I dv(x) M 5_ £1 ? LjiPi IT f(xIA = 1) dv(x) j: i=1 27 M Z Ljipi < m- (3.22) 1 i=1 u Ilbvfiz 5 So the sequence of random variables {R(v, 6N)} is 3.5. uniformly bounded. Hence R(v, 6N) converges to RB(v) in the rth mean if it con- verges fer r = l [L-2, p. 158]. End of proof. In the following sections the estimated decision rules are shown to be Mean Risk Consistent. The above property then establishes that the rules are Mean Risk Consistent of order r for any r > 0. 3.3 PRELIMINARY LEMMAS The two lemmas presented in this section are used to establish convergence theorems and rate theorems for the decision rule 6N in both the discrete case (Section 3.4) and the continuous case (Section 3.5). Let E[-] denote expection with respect to the mixture distribution M Z Pif(xlA = i) =1 p(X) 1 M E P{f(x|A' = 1). (3.23) 1: 1 The estimator GN satisfies the following lemma: LEMMA 3.1. Let GNj(x), j = 1, 2, ..., M, be defined by (3.12). Then ENtij(x)] = PjEIchx. X)IA = j] = Pj [ gN(x, z)f(zIA = j) dv(z). (3.24) T 28 PROOF: Taking expectations in (3.12) gives M M A 1 1 v . EN[ij(x)] = fi' 2 .) bjiEN[A(Ak, 1)gN(x, xk)]. (3.25) k=l i=1 Since the training patterns are identically distributed according to the mixture distribution (3.23), equation (3.25) may be written as ,. M ' EN[VNj(X)] = 111 bji E[A(A . i)gN(x. X)] M = “Z bjiP; EIgN(x. X)|A' = 1] 1=l = P3 E[2N(x. X)IA = 1] (3.26) where the last equality follows from (2.15). End of proof. The decision rule 6N satisfies the following lemma: LEMMA 3.2. The average risk for the decision rule 6N defined by (3.7), (3.8), and (3.12) satisfies M A 0 < ARN _<_ 1Z1 L-iIEN[|vNi(x) - PiE[gN(x, X)|A .-. i]l] dv(x) M . . + I f{EN[DNJ-(X)] - Dj(X)} - {éBj(x; v) - EN[6Nj(x)]} dV(X) (3.27) .J.=1 where E: = max L . 1 lijih 31 (3.28) and the integrals are over T. PROOF: By the definition of Bayes risk RB(v) :_R(v. 3N) (3.29) 29 so that ARN(v) :_0. Also R(GN. 68(x; v1) :_RB(GN). (3.30) So ARN :_EN[R(v, 6N)] - RB(v) + EN[R(vN, 6B(x; V)) - RB(VN)] = hN[R(V. 5N) - RB(VN)] + LN[R(VN. 63(X; V)) - RB(V)]. (3.31) Consider the first expectation in (3.31) and use (3.14) to get . . M . . EN[R(v, 5N) - RB(vN)] = ENS-El I {nj (x) - DNj (x)}..| 2 v (x) + v.(x). (3.79) 3 1 0, (3.82b) > ll iii) for some 0 < y :_1 and some constant C A max 1(y; f(' 1:}:M Al) :_C||y||Y. (3.82c) Then choosing hN = 0(N-1/(n+a)) implies that there exist constants Q1 and Q2 such that for large N Q1 N'Y/(“+“) if a > 2y -o/2(n+a) ARN 5. Q2 N if a < 2y (3.83) (<21 + (aura/0‘ ”“3 if a 2Y The proof of this theorem is given in Appendix C. The conditions in (3.82) necessary for proving Theorem 3.5 are very general. Condition (3.82a) is satisfied by all of the kernels in 40 Table 8.1 except for the Cauchy kernel (entry 5). Condition (3.82b) is a weak moment condition on the density functions. Condition (3.82c) covers an extensive class of densities as discussed by Van Ryzin [V-l]. Van Ryzin has shown that when the density functions f(x|A = k), k = l, 2, ..., M, are absolutely continuous and have first partial derivatives which are integrable, then (3.82c) is satisfied with y = 1. It is easy to shown that when y = 1 the bound in (3.83) is smallest if u is chosen to equal 2. In this case the rate of convergence is ARN = 0(N-1/(n+2)). So a good rule of thumb would be to choose -1/(n+2) hN = 0(N ). Of course if a were known, a better rate of con- vergence could be obtained by properly choosing a. CHAPTER IV EFFECTS OF THE IMPERFECT TEACHER In Chapter II,a model was preposed for an imperfect teacher. An example (Section 2.4) was presented to show that the use of misclassi- fied training data can significantly affect the convergence of a super- vised learning procedure. A nonparametric algorithm for learning with an imperfect teacher was then proposed in Chapter III. The algorithm took into account the misclassifications so that the estimated decision rules converged to the Bayes rule as the number of training patterns became large. Rates of convergence were also given by establishing bounds on ARN that were functions of the number of training patterns used to learn the decision rules. Since these bounds hold for a large class of probability distributions, one would not expect the bounds to be very tight; they only give a loose measure of rate of convergence. In this chapter the two-class learning problem is investigated in some detail. The objective is to study the qualitative and quanti- tative effects of an imperfect teacher on learning. This objective is achieved by restricting attention to the two-class problem and assuming specific forms for the underlying class-conditional densities. Measures of performance for comparing the perfect and imperfect teacher are proposed and studied. A large sample approximation is developed in Section 4.2 to evaluate the expected risk in the two-class problem. The approximation 41 42 is then used in Section 4.3 to study three examples of learning with an imperfect teacher. In Section 4.4 large sample properties of the ex- pected risk are investigated. A quantitative measure of performance is pr0posed in Section 4.5. The measure is evaluated for a zero-one loss function and used in Section 4.6 to compute a cost associated with training. This cost provides a second quantitative measure for com- paring an imperfect teacher with a perfect one. 4.1 EXPECTED RISK FOR THE TWO-CLASS PROBLEM The Bayes decision rule for the two-class problem may be written from (3.1) and (3.5) as 631(x) = 1 if D(x) :_O = o if D(x) < o , (4.1) 637(x) = 1 — 681(x) (4.2) where the Bayes discriminant function has the form D(x) = (L - L11)Plf(x|A = 1) - (I.12 - L22)P2f(xIA = 2) 21 (L21 ‘ L11)V1(X) ‘ (L12 ‘ L22)V2(X)° (4'3) It follows from (4.3), (3.8) and (3.12) that the estimator of D(x) has the form lISNO‘) = (L21 ’ L11)01(X) ' (L12 ' L22)‘72(X) N M 1 ‘ v . N kzl 1%1 [(L21 ' L11)b1i 7 (L12 ' L22)bzi]MAk’1)gN(x’ xk) .1. N 1 nrvfiz k 43 where {WNk} is a sequence of independent random variables that are identically distributed as the random variable 2 WN(x) = 1 [(1.21 - L11)bli - (L12 — L22)b2i]A(A', i)gN(x, X) 1: = [01A(A'. 1) + 021(1), 2)]gN(x, X) (4.5) with Ci 3 (L21 - IJll)bli - (L1? - L22)b21. (4.6) The estimated decision rule is then = o if' fiN(x) < o , (4.7) 8N2(x) = 1 — 3N1(x). (4.8) The risk for any decision rule 6 = (6 6?) can be expressed in a 1’ convenient form for the two-class problem as follows. From (A.4) the risk is given by R(P, f, 8) = P1‘([L (x) + L (x)]f(xlA = 1) dv(x) 1161 2162 + P2 1 [L1261(x) + L2262(x)]f(xIA = 2) dv(x). (4.9) Using the relation 62(x) = l - 61(x) (4.10) in (4.9) gives R(P, f, 8) P1L21 + 1321.22 + f [P1(L11 — L21)f(xIA = 1) + p20.12 - L22)f(xIA = 2)]61(x) dv(x) PILZI + p2122 _ f D(x)61(x) dv(x). (4.11) This expression shows that the risk for any decision rule in the two- class problem can be expressed as a sum of two constant terms that are 44 not functions of the decision rule plus an integral involving the de- cision rule 61(-) and the Bayes discriminant function D(-). The risk for the estimated decision rule 8N is then R(P, f, EN) = P1L21 + P21.22 — f D(x)8N1(x) dv(x), (4.12) and the expected risk with N training patterns is RN(P, f) P1L21 + P2L22 - f D(x)EN[SN1(x)] dv(x) 1211.21 + P2L22 - j D(x)Pr[fiN(x) :_0] dv(x). (4.13) To evaluate the expected risk, one must be able to compute the probability that the estimated discriminant function is nonnegative. Since 6N is the average of a sequence of independent, identically dis- tributed random variables, the calculation of Pr[6N(x) :_O] is gener- ally an extremely difficult problem, although a normal approximation for 6N may be used when N is large. This approximation, which may be used to evaluate the expected risk given by (4.13), is developed in the next section. It is assumed throughout this chapter that 1/2 < Bii :_l, 1 = 1, 2. 4.2 NORMAL APPROXIMATION The first two moments of the random variable WN(x) defined in (4.5) will be used in the normal approximation. From (4.5) it follows that the expected value of WN is given by EIWN(X)] - I II MIN) II MN GjE[A(A . i)gNCX. X)] = . 1 0:. 1 J G.PjE[gN(x. X)|A 1] j 1 J 2 1 cj 121 BjiPiE[gN(x, X)IA = 1] (4.14) II II MN j 45 where the last equality follows from (2.10). Interchanging summations results in E[wN(x)] = (61811 + G2821)P15[8N(X, X)IA = 1] + (61812 + 62822)P2E[gN(x, X)IA = 2]. (4.15) For the two-class problem the 8 matrix is _ — B11 B12 §.= (4.16) 8' B _21 2a so that the inverse matrix is 1 B22 “812 8'1 = . (4.17) - 811 + 822'1 ‘821 811 Substituting (4.17) and (4.6) into (4.15) and simplifying gives Eth(x)] (L21 - L11)P1E[gN(x, X)|A = 11 ’ (L12 ‘ L22)P25[gN(X, X)|A = 2] I ll MN (L - - L .)P-E[g (x. X)|A = 1]. (4.18) l 2] 1] j N 1 Note that the expected value of WN(x) is not a function of the teacher; that is, it does not depend upon the §_matrix. The second moment of WN is EIW§(x)1 E[(G,A(A', 1) + 924(4', 2))2g§(x. X)] GzP'E 2 x A' = 1 + GZP'E 2 x x A' = 2 . 4.19 1 1 [gN(x, )I 1 2 2 [gN( , )I 1 ( ) Proceeding as in (4.14), the second moment may be written as 46 E[w§(x)] = Hlpln[g§(x, X)IA = 1] + n2P25[g§(x, X)IA = 2] (4.20) where _ 2 2 H1 “ G1811 + G2821 (4'21) and _ 2 2 n? _ 61812 + 62822. (4.22) Substitution of (3.38) into (4.18) shows that when X is a discrete random variable, E[WN(x)] = D(x). Thus 6N is an unbiased estimator of D in the discrete case. Also for discrete X, (4.20) becomes E[w§(x)] = H1P1f(xIA = 1) + H2P2f(xIA = 2) (4.23) which is finite for all x. It then follows that 6N is asymptotically normal as defined by the following theorem: THEOREM 4.1. Let the training patterns be represented by discrete random variables. Then DN(x) is asymptotically normal in the sense that, for every real number a, Bch) - D(x) lim Pr 1 < a = 4(3) (4.24) N+u> (VARIWN(x)1/N)é" where ¢(-) is the standard normal cumulative distribution, a _ 2 1 e t /2 d w /§F t. (4.25) 4(a) = I PROOF: Since 6N is the average of a sequence of independent, identi- cally distributed random variables with finite mean and variance, the conclusion follows immediately from the Lindberg-Levy Central Limit Theorem [F-l, p. 256]. End of proof. Now consider the case when the patterns are continuous random variables and the gN functions have the form described in Section 3.5. 47 Proving that 6N is asymptotically normal in this case is not as direct as in Theorem 4.1 because the mean and variance of ”N are now both functions of N. The asymptotic normality of ON is given by the follow- ing theorem. THEOREM 4.2. If the estimator 6N is defined by (4.4) with 3N defined as in (3.63), then at every continuity point of f(xIA = k), k = l, 2, UN is asymptotically normal in the sense that, for every real number 3, lim Pr DNOO - E[WN(X)1] f. a NM (VAR1wN(x)]/N1/2 = ¢(a). (4.26) PROOF: Let x be a continuity point of f(xIA = k), k l, 2. From Parzen [P-l, p. 1069] and Loéve [L-2, p. 316], it follows that a suf- ficient condition for (4.26) to hold is that for some 5 > O, Q N E[|WN(x) - 13[wN(x)]|2+€]_> 0 (N VAR1wN(x)1)1*€/2 qN(x) as N + m. (4.27) Now from (4.5) it follows that 2+ '+ v + ' EIIWNI 5] = |cl|2 5P. 512% t"cx. X)|A = 1] + |62|2*€P£ 5(g§+€(x, X)IA' = 21. (4.28) Using Lemma 8.1, it follows from (4.28) that 2""; 9 1+5 +5 hn( )E[IWNI2 ] = {loll P1f(x|A' = 1) 11m N N~>oo + IG2I2+€P;f(xIA' = 2)} f K2+E(y) dv(y). (4.29) Since the conditions on the kernel K guarantee that [C-l, p. 185] f K2+€(y) dv(y) < m, (4.29) implies that 48 1 o < lim h"( +5)12[|w - E[W ]|2*5] < m (4.30) N N N N+oo and o < lim h“ VAR[W ] < m. (4.31) N+ua N N But qN may be written as hn(1+€/2) n(1+E)E[ W - E[W ] 2+5] q...) . N 1N I N N ' . (4.32) Ng/2h§(l+€) (ha VAR[WN])“‘3/2 Thus qN + O as N + w because the first term of (4.32) goes to zero as N + w while the numerator and denominator of the second term remain finite. ‘ End of proof. Theorems 4.1 and 4.2 provide justification for using a normal approximation to Pr[ON(x) :_O] in (4.13). Specifically, for large N . -N%r[wN(x)1 Pr[DN(x) :_O] 3 l - 4 L (VAR[WN(X)])2 N%EIWN(x)J = 4 1 . (4.33) (VARth(x)])e Thus for large N the expected risk may be approximated by N%E[wN(x)] RN(P, f) 2 P11.21 + P7L22 — f D(x) 4 1 dv(x). (4.34) (VAR [WN (X) ] )/2 Even with the large sample approximation, the integral in (4.33) is a formidable expression that cannot be easily evaluated. In the next section, three examples are studied in an attempt to evaluate the effects of an imperfect teacher. For these examples, (4.34) is evalu- ated numerically to study the learning algorithm. 49 4.3 EXAMPLES OF LEARNING Example 1 - Normal Distributions In the first example to be studied, the patterns are assumed to have univariate normal distributions. Under pattern class mi the patterns are normally distributed with mean “i and variance 02. Since the density functions are continuous, the class of estimators described in Section 3.5 will be used for learning a decision rule. The kernel for the estimator is taken to be ..1/ KCY) = (2“) 2 eXp (-y2/2). (4.35) The 3N function is then _ 2 chx. y) = 1 . eXp (- M). (4.36) . hN(2n)/2 2h§ The first and second moments of gN are needed in order to evaluate the approximation for the expected risk. The conditional mean is given by 1 -/2 E[gN(x, X)|A = 1] = f h§1(2n) exp {- (x - z)2/2h§} 1 “’2 o'1(2n) exp {- (z - pi)2/202} dz. (4.37) Completing the squares in the exponentials and integrating the resulting expression gives 1 -/2 EIchx. X)|A = i] = [2"ch2 + 1113)] exp {—acx -u-1)2/(02 + 1113)}. (4.38) A similar evaluation for the second moment shows that ., , _; E[gfi(x, X)IA = i] = (ZnhN)"1(202 + hfi) 2 exp {4%(x - u.)2/(o2 + h§/2)}. 1 (4.39) The Bayes discriminant function for this example is found from 50 (4.3) to be D(X) = P1(L;)1 - I:11)(207)-l/2 CXp {"‘ (X "' U1)2/202} - P2(L12 - L22)(202)"% exp {- (x - u2)2/202}- (4.40) The Bayes risk for this example was derived in (2.25). The above relations in conjunction with (4.18) and (4.19) can be used to evaluate, via numerical integration, the approximation for the expected risk given by (4.34). The results of evaluating this approxi— mation are presented in Figures 4.1 thru 4.4. For this example the following parameter values were used: “1 = 1.5, u2 = —l.5, and 02 = 1.0. A zero-one loss function was used for all cases shown. The parameter hN was chosen to have the form hN = N_a. Following the rule of thumb suggested in Section 3.5, a was chosen to be 1/3. Figures 4.1 and 4.2 present the expected risk plotted as a function of 8(811 = 822 = B) for the prior probability of pattern class ml equal to 0.5 and 0.1 respectively. Figure 4.3 shows the same case as Figure 4.1 except that 811 = 1.0. Figure 4.4 shows the expected risk plotted as a fUnction of the number of training patterns. These figures illustrate the effects of an imperfect teacher. Basically, the im- perfect teacher slows down the rate of convergence. For B > 0.9 the effect of the imperfect teacher is small, but for B < 0.7, say, con- vergence of the algorithm is much slower. The dotted line in Figures 4.1, 4.2, and 4.3 is the teacher's risk. This risk is just the teacher's probability of error which is equal to 1 - B. The figures show that for small 8, the learning algorithm becomes better than the teacher as the number of training patterns increases. oé m.o n Hm .mcoflusnfihpmflo HmEHoz :pflz xmwm wouoomxm .H.v oysmwm 3V momma 5.23 as as no so no A u q A _ Coo / 14 Va; 3:3an to \/ Va; rfizuqfi / 82 u z / / OOH N Z l N.O / / / / 1 m.o muwmmugm oHuz / . H / m o u a / o.H u Hms u was 1 3.0 m.o )ISI‘d 03133dX3 52 ~.o u Hm .mcofiusnfiaumfio HmEHoz :pflz xmflm wouoomxm .N.v oHDMfim iv ”5% E53: 04 ad mg no 9o mic / . . a . 0.0 Illllllllllllulllulllullllllndmalmfia / / no oooH u z 1 / / on: 95:23 \// 02 u z / . III .JN o / / // / $5 an Nun... 2m / . H [II n a H o S u z / o.H 1 HN4 1 man Ill - - // 1: o u NN._ u :4 / / / / m5 NSIH 03133dX3 53 Nam x Ham .m:0ap:nfiupmfim HmEuoz spa: xmfim wmpoomxm .m.v masmwu ANNE ENE E139 o; as ma no so no . a a a 0.0 / / IIIIV lllnllllullllllllllllllllllllllulnllnll. 1 Xmam mm>014 2] 4x(hN) 4(2x mum—W 1 4(x 1)¢(--————.hN 1 +4hN [exp {1‘21} .. 2 exp {W} + exp {W}] (4042) - 2h lé 2 (2n) ZhN ZhN N S6 ., 2 /2‘ /2’ - 0.5 Elgh(x1 XllA = 1] = [X¢(——5J - (2x - 1)¢( (x . )) hN/JF hN hN + (x - 1.0)('/§(x " 1'91)] +l-[9XP {33;} hN Tr hN _ _ 2 _ _ 2 - 2 exp { (x 0'5) } + exp { (X 119) }] (4.43) h2 2 N N E19136. X)|A = 21 = $141495) - (2x + 114(59‘ " 0'53) hN n hN hN /2 + 1.0 - 2 + (x + 1.014( (11 )1] + -1-[exp {'35—} hN 7‘ hN .. + 7. _ + 2 - 2 exp { (x ?;5) } + exp { (x 1°Ol‘}]. (4.44) l 2 hN Equations (4.41) thru (4.44) can now be used along with (4.18) and (4.20) to evaluate numerically the expected risk. Figures 4.6 and 4.7 present the results of this evaluation using a zero-one loss function and equal prior probabilities. The results of the simulation in [8-5] are also shown on Figure 4.7. There is reasonable agreement between the simulation and the large sample approximation. Example 3 - Binomial Distributions For the final example, suppose that the patterns are discrete random variables that have binomial distributions under both pattern classes, f(xIA = j) = (919391 - ej)c"‘, x = 0, 1, ..., c, j = 1, 2.(4.45) For this discrete case, the estimator 6N is formed according to (4.4) with the gN function defined by (3.37). 57 o.H mofluwmcoo Hmasmcmflhh :ufiz xmflm wouoomxm av mos; ESE .o.¢ magma; m.o m.o “.0 o.o m.o . _ I] d 0.0 ooofi 2 xxx 1 H.o 002 z 1 N.o OH 1 Q1 NNQ 1 Zn m.o 1 HQ 1 m o o.H 1 Hms 1 mas o 1 was 1 Has 1. «.0 m.o NSIH 03133dX3 58 newumfisefim paw cowumsfixOHam< onEmw ownmq mo comfiummeou .n.v oHSMfim § 3% 55,9 o.H m.o w.o 5.0 0.0 m.o ql‘fl _’ 4 _ I II II: ,1, onpquxomaa< mqazoo with Q = f K2(y) dv(y) and H1 and “2 defined by (4.21) and (4.22) respectively. PROOF: Lemma B.l implies that at every continuity point x of the density functions lim hn E[g2(x, X)|A = k] = Qf(xlA = k). (4.56) N-Hzo N N Now from (4.4) Nhg VAR[6N(x)] = hfi E[w§(x)] — h; E2[WN(x)]. (4.57) Equation (4.54) and the fact that hN + 0 as N + w imply that the second term of (4.57) converges to zero as N + w. Thus substituting (4.20) into (4.57) and using (4.56) gives the desired conclusion. End of proof. In view of Lemmas 4.1 and 4.2, it would seem reasonable to expect that DN(x) is asymptotically normal with mean D(x) and variance 0% = [H1P1f(xIA = 1) + H2P2f(xIA = 2)]Q/Nh“. (4.58) Showing that such a condition holds requires some further restrictions on the rate of convergence of hN to zero. Lemma 8.2 leads to the neces- sary restrictions. In view of Theorem 4.2, it is clear that ¢(a) (4.59) 1im Pr[(DN(x) — D(x))/0N :.a] N+oo provided VAR[6N(x)]/o§ +-1 as N + m (4.60) and 63 (E[DN(x)] - D(x))/0N + 0 as N + m. (4.61) .Lemma 4.2 guarantees that (4.60) is satisfied. Also 515 (X) - D(x)] h'2{E 6 (x)] - D( )) -% N = (Nhn+4)%, N [ N x Q 1 . (4.62) 0N N [H1P1fcxlA = 1) + H2P2f(x|A = 2)]/2 So from (4.4), (4.18) and Lemma B.2, it follows that condition (4.61) will be satisfied if Nh3+” + 0 as N + m. This is then the additional condition on hN required for proving that 6N is asymptotically normal in the sense of the following theorem. THEOREM 4.3. In the continuous case if 6N is defined by (4.4), if f(x|A = k), k = 1, 2, have continuous partial derivatives of third order a.e., and if hN satisfies the conditions Nh§+u + 0, Nhg + w as N + m, (4.63) then DN(x) is asymptotically normal in the sense that, for every real number a, 1im Pr[(DN(x) — D(x))/ON(X) :_a] = 4(a) (4.64) N+oo with oN(x) defined by (4.58). In the remainder of this chapter, the conditions of Theorem 4.3 are assumed to be satisfied. In particular hN is taken to be hN = N'a, 1/(n + 4) < a < l/n. For sufficiently large N, the expected risk may be approximated by RN(P, f) z P1L21 + P2L22 - f 0(x)4(t(x)) de(x) (4.65) with t(x) = NCl'na)/20(x)Q'5[H1plf(xIA = 1) + H2P2f(x|A = 2)]-%. (4.66) 64 In the three examples in Section 4.3 it was observed that the expected risks were decreasing functions of N and 8. The following two theorems show that, as one would expect for large N, RN is always a decreasing function of N, 811, and 822. THEOREM 4.4. For large N, the expected risk RN is a strictly decreasing function of N. PROOF: Treat N as a real variable. Then from (4.65) it follows that -f (2n)-%D(x) exp {-t2(x)/2}8t(x)/8N dv(x) BRN/BN ..1 ..1 /2 /2 -%(2n) (1 - na)N'na/2 I D2(X) exp {‘t2(X)/2}Q - [H1P1f(xIA = 1) + H2P2f(xIA = 2)]‘1/2 dv(x) < 0 (4.67) since (1 - na) > O for a < l/n and the integrand is always nonnegative. End of proof. THEOREM 4.5. For large N, the eXpected risk RN is a strictly decreasing function of 811 and 822. PROOF: For large N, (4.65) implies that aRN/BB11 = - f(2n)'%b(x) exp {-t2(x)/2}8t(x)/3811 dv(x). (4.68) But 8t(x) _gN(1-n9)/ZD(X){anl/asllplf(xlh = 1) +3H2/3811P2f(xIA = 2)} 8811 - 0%[H1P1f(xlh = 1) + H2P2f(xIA = 2)]3/2 (4.69) By differentiating (4.21) and (4.22) with respect to 811, one can show that 6S an -(1 - B + B (28 - 1))(L - L + L - L )2 1 11 22 11 1 = 21 11 2 22 < 0 (4 70) 8811 (811 + 822 ' 1)2 . 2 d”2 = ’[(l ' 811M1 ' 822) + B11822](L21 ‘ L11 + L12 ' L22) < 0 2 8811 (811 + 822 ' 1) (4.71) Substitution of (4.69), (4.70), and (4.71) into (4.68) then shows that BRN/BBll < 0. Thus RN is a strictly decreasing function of 611. A similar argument holds for 822. End of proof. One should note that even though Theorems 4.4 and 4.5 were stated and proved for the continuous case, they also hold for the discrete case. This follows immediately by comparing (4.64) with (4.23) and (4.24). 4.5 A MEASURE OF PERFORMANCE The problem that one encounters in trying to develop any quanti- tative measure of the effects of the imperfect teacher is that the be- havior of the learning algorithm depends upon several factors: a.) the class conditional densities, b.) the prior probabilities, c.) the loss function, d.) the §_matrix, and e.) the form of the gN function. Any measure of performance which one defines will generally be a function of all of these parameters. What one would like for comparing a perfect teacher with an imperfect teacher is a measure which is in- sensitive to all of the parameters except the §_matrix. 66 A second problem that one encounters is that an analytic evalu- ation of the expected risk is virtually impossible except in special cases. Even then, numerical methods and the large sample approxi- mations developed in the previous sections must be used. In view of these considerations, the measure of performance proposed in this section will also be evaluated from the large sample approximations. The criterion proposed here for comparing an imperfect teacher to a perfect teacher is defined as follows. For a given number No of training patterns, the nonparametric learning procedure with a perfect teacher has some expected risk, call it RNO. Define N(§, No) to be the number of training patterns required for the learning algorithm with an imperfect teacher to have the same eXpected risk. Then a measure of relative performance may be defined as n(_8_. No) 2 Mg. No1/No. (4.72) In general this measure of relative performance will be a function of NO as well as a fUnction of the five factors listed earlier. But n does provide a quantitative measure of the effect of the imperfect teacher in the sense that it measures the additional number of training patterns required to compensate for the imperfect teacher. The dependence of (4.72) on No may be removed by defining an as- ymptotic measure of performance according to n(_B_) = lim Mg, No) (4.73) No+oo provided the limit exists. In general the evaluation of (4.73) is an intractable problem. But for one very important case, evaluation of (4.73) is rather direct. 67 Consider the case when a zero-one loss function is used and 811 = 822 = 8. Then for large N, the expected risk for the continuous case is given by (4.65). Now suppose an No is chosen and (4.65) is evaluated for the perfect teacher (8 = l) to obtain RNO. Let RN(B) de— note the expected risk when using N training patterns with an imperfect teacher that has a probability 8 of correctly classifying a training pattern. To find N(B, NO), one must then solve the equation RN(B) = RNO(B) (4.74) for N. One way in which (4.74) is satisfied is for the function t(') of (4.66) to be the same for both RN and RNO. But this will occur if (28 - 1)N(1'n“)/2 s “6(1'““)/2. (4.75) Thus Ncs. N0) = NO/(ze - 112/(1'na) (4.76) Since by Theorems 4.4 and 4.5 RN is a strictly decreasing function of N and B, (4.76) is the unique solution of (4.74). Hence in this case the measure of relative performance is nCB) = (28 - 11'2/(1'na). (4.77) This measure of performance has the pleasant prOperty that it is not a function of the class-conditional densities or prior probabilities, and it depends upon the SN function only through the parameter 0. Thus for a zero-one loss function, n(B) in (4.77) is interpreted as the proportionate number of training patterns required for an imperfect teacher to yield the same expected risk as a perfect teacher. Even when 811 # 822 but a zero-one loss function is still used, 68 (4.77) provides a meaningful bound. Since RN is a decreasing function of 8 B ,, and N, 11’ 22 n(max(eu. 82211 _<_ 0(811. 8221 _<_ 001111103111 82211. (4.78) Thus using min(811, 822) in (4.77) yields an upper bound on the actual measure of relative perfbrmance, and using max(811, 822) gives a lower bound. In Figure 4.9 the measure n(B) is plotted as a function of B for n = l and a 0.2, 0.33, 0.5, and 0.65. The figure shows, for example, that with a k the algorithm requires roughly 2.5 times as many train- ing patterns when 8 = 0.9 as compared to a perfect teacher. When 8 0.8, 7.6 times as many training patterns are required; and when 8 = 0.7, nearly 40 times as many training patterns are required for the imperfect teacher to match the perfbrmance of the perfect teacher. Thus a poor teacher is costly in the sense that many more training patterns are required to achieve a perfOrmance equal to that achieved with a perfect teacher. 4.6 COST OF TRAINING The idea of a cost associated with an imperfect teacher can be further developed by assuming that the teacher incurs a cost in classi- fying each training pattern. Assume that the cost of classifying a training pattern is an increasing function of 811 and 822. Let T(§) denote this cost function. Suppose that one wishes to attain a given expected risk R* with minimum cost. If a perfect teacher is used, N* training patterns will be required to achieve R* at a total cost of N*T(I), I being the unit 69 \.. \ 11 \ \ww \\\ a = 0.33 a=0.5 Z /\V ‘/ A \ X [a = 0.2 \\\\ \ \ \ [C X \ \ \ \ V \X \L \\\\ 1000 200 Q 2 100 0 new 2 10 I: #593: 325.20de 1.0 0.9 0.8 TEACHER ERROR (B ) 0.6 0.7 0.5 Measure of Relative Performance Figure 4.9. 70 matrix. For an imperfect teacher, the number of training patterns re- quired to achieve R* is approximately N*n(§, N*) and the cost incurred is c = N*n(_8_, 11*)T(_8). (4.79) Differentiating (4.79) with respect to Bii and equating to zero gives N*[.23..T(§) + 9(E, N*) 21121] = 0, i = 1, 2. (4.80) 3811 8811 Solving this set of simultaneous equations for 811 and 822 gives the 8. matrix that minimizes C whenever the solution satisfies % < Bii < 1. As an example, suppose that a zero-one loss fUnction is used and that 811 = 8 Then n(B) is given by (4.78). Let the cost of 22 = 8. classification have the form (see Figure 4.10) T(B) = a1 + a2(28 - 1)E (4.81) Then 39- = 2N*(28 - “4-2/(1-1611132? + a1(e - TEETH” - 1151. (4.82) and the minimum cost occurs at 2a 1 1 121* = .1. + .1. 1 N; (4.83) 2 2 32 5(1 - no) - 2 o * o o o prov1ded % < B < 1. Otherwlse the m1n1mum occurs for a perfect teacher. From (4.83) it follows that a perfect teacher is best whenever 5 :_2(1 + al/a2)/(l - na). (4.84) Figure 4.11 shows the relative cost plotted as a function of B for several parameter values. For this figure n = l and a = 1/3. Figure 4.11 indicates that when there is a cost associated with classifying the training patterns, a perfect teacher is not necessarily 71 the best. An imperfect teacher may provide an acceptable level of learning at a lower cost than a perfect teacher. T03) 1 a1 Figure 4.10. Teachers Cost for Classification UH! RELATIVE COST (C/N*) 72 700» 600» 5001- 81/512 = 0.002 e = 4 4001- 300- al/a2 = 0.01 200» ff al/a2 = 0.01 = 4 100~ / 6 al/a2 = 0 01 1 J l 1 0.5 0.6 0.7 0.8 0.9 TEACHER ERROR ((3) Figure 4.11. Relative Cost of Training CHAPTER V CONCLUSIONS This chapter summarizes the main results of the thesis and dis- cusses possibilities for future research. 5.1 SUMMARY This thesis has been concerned with studying nonparametric pro- cedures for learning to recognize patterns with an imperfect teacher. The statistical approach to pattern recognition was taken, and sta- tistical decision theory was used as the tool for developing and evaluating learning algorithms. The objective was to study procedures for learning a Bayes decision rule using training patterns some of which were misclassified by an imperfect teacher. In Chapter II, the general M-class statistical pattern recognition problem was outlined and a model for an imperfect teacher was proposed. The imperfect teacher was characterized by a matrix of conditional error probabilities. The main result of Chapter II was the deveIOpment of a set of expressions relating the probability distributions of the perfect teacher with those of the imperfect teacher. These relations were one of the key factors used for develOping in Chapter III learning pro- cedures that required less restrictive assumptions than in previous studies. An example was also presented to show the asymptotic effects 73 74 of using an imperfect teacher with an algorithm designed for supervised learning. It was concluded that misclassified training patterns could prevent the convergence of supervised learning algorithms and that pro- cedures are needed for use specifically with imperfect teachers. In Chapter III a class of nonparametric procedures was prOposed for learning with an imperfect teacher. The procedures required prior knowledge only of the nonsingular matrix of error probabilities charac- terizing the teacher and of whether the pattern random variable was discrete or continuous. The main result of the chapter was formal proofs of the convergence of the estimated decision rules to the Bayes rule in both the discrete case and the continuous case. The procedures were asymptotically optimal in the sense that the expected risk of the decision procedures converged to the Bayes risk with increasing number of training patterns. Theorems were also given establishing rates of convergence of the expected risk. These rate theorems provided guide- lines for choosing the parameters in the estimators. The two-class problem was studied in more detail in Chapter IV. The expected risk was expressed in a convenient form as a function of the Bayes discriminant function and of the probability that the esti- mated discriminant function was nonnegative. A large sample approxi- mation was then developed to evaluate the expression for the expected risk. This approximation was used to study three examples of learning. The examples indicated that an imperfect teacher could have a signifi- cant effect on the rate of learning. When the teacher's error rate was high, learning occurred at a much slower rate than when a perfect teacher was used. The examples also indicated that the learning pro- cedure would eventually perform better than a poor teacher. 75 A second large sample approximation was derived and used to es- tablish several large sample properties of the expected risk. It was shown that the expected risk was a strictly decreasing function of the number of training patterns and of the teacher's error rate, Bii‘ A measure of relative performance was then proposed for quanti- tatively comparing an imperfect teacher with a perfect teacher. This performance measure was a measure of the additional number of training patterns required to compensate for an imperfect teacher. The measure was readily evaluated for the case of a zero-one loss function. For this important case, the measure was not a function of the underlying probability distributions. Thus it provided a very usefu1 criterion for evaluating the effects of the imperfect teacher. It was concluded that a poor teacher was costly in the sense that many more training patterns were required to achieve the same expected risk as obtained with a perfect teacher. By assigning a cost for classifying the training patterns, an overall cost of an imperfect teacher was derived. It was shown that when the cost of training is prOportional to the quality of the teacher, an imperfect teacher may be preferred. By using an imperfect teacher, a learning system may be able to achieve a given level of performance at less overall cost than with a perfect teacher. 5.2 EXTENSIONS The work in this thesis suggests several possibilities for future research. The concept of an imperfect teacher can be extended to many areas of pattern recognition. This thesis has been concerned with one general class of learning 76 procedures. One can also look at several other types of supervised learning algorithms and attempt to develop analogous algorithms for learning with an imperfect teacher. For example, one might develop algorithms based on stochastic approximation, the method of potential functions, the k-nearest neighbor rule, or the various mean-square esti- mation methods for discriminant functions. All of these techniques for supervised learning will be affected by an imperfect teacher, and con- sequently they should be altered for use with an imperfect teacher. The model proposed in Section 2.2 for the imperfect teacher could also be developed further. A natural extension would be to assume that the teacher's error probabilities are a function of the observed values of the training patterns. One might also consider a situation in which the teacher improves with time and experience. It was assumed throughout this thesis that the matrix of error probabilities characterizing the teacher was known. An interesting problem would be to investigate conditions under which learning could occur without knowledge of the matrix of error probabilities. More structure would necessarily have to be assumed known about the proba- bility distributions. This problem would be very close to one of un- supervised learning. In Sections 4.5 and 4.6 measures of performance were preposed for quantitatively evaluating the effects of an imperfect teacher. These measures were chosen because they were intuitively appealing and mathe- matically tractable. Further work is needed in developing other measures of performance. Also the idea of a cost for the teacher could be pursued further. Finally, one might consider ways of gradually phasing the teacher 77 out of the learning process. It was observed in the examples of Section 4.3 that the learning procedures eventually perform better than a poor teacher. One wonders whether the learned decision rules could eventu- ally be used to classify the training patterns and thus eliminate the teacher. This might speed up the learning process since the learning procedure eventually has a smaller expected risk than the teacher. Eliminating the teacher would also be desirable when there is a cost of training associated with the teacher as in Section 4.6. B I BL IOGRAPHY [A-I] [A-Z] [3-1] [3-2] [C-1] [C-2] [F-ll [F-Z] [F-3] 14-41 [F-S] [11-1] BIBLIOGRAPHY N. Abramson and D. Braverman, "Learning to recognize patterns in a random environment," IRE Trans. Inform. Theory, vol. IT-8, pp. 58-63, 1962. M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer, "Method of potential functions in the problem of restoration of functional converter characteristic by means of points observed randomly," Autom. and Remote Control, vol. 25, pp. 1705-1714, 1964. C. C. Blaydon, "Recursive algorithms for pattern classification," Rept. No. 520, Division of Engineering and Applied Physics, Harvard Univ., Cambridge, Mass., 1967. I. Bross, "Misclassification in 2 x 2 tables," Biometrics, vol. 10, pp. 478-486, 1954. T. Cacoullos, "Estimation of a multivariate density," Ann. Inst. Statist. Math., vol. 18, pp. 179-189, 1966. T. M. Cover and P. E. Hart, "Nearest neighbor pattern classifi- cation," IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27, Jan. 1967. W. Feller, An Introduction to Probability Theory and Its Applications, vol. II. New York: Wiley, 1966. T. S. Ferguson, Mathematical Statistics: A Decision Theoretic Approach. New York: Academic Press, 1967. S. C. Fralick, "Learning to recognize patterns without a teacher," IEEE Trans. Inform. Theory, vol. IT-13, pp. 57-65, Jan. 1967. K. S. Fu, Sequential Methods in Pattern Recognition and Machine Learning. New York: Academic Press, 1968. K. Fukunaga, Introduction to Statistical Pattern Recggnition. New York: Academic Press, 1972. Y. C. Ho and A. K. Agrawala, "On pattern classification algorithms, introduction and survey," Proc. IEEE, vol. 56, pp. 2101-2114, Dec. 1968. 78 [11-2] [1-1] [1-2] [K-I] [L-l] [L-Z] [M-l] [11-2] [M- 3] [11-4] [N-1] [N-Zl [P-1] [P-3] [P-3] 79 Y. C. Ho and R. L. Kashyap, "An algorithm for linear inequalities and its applications," IEEE Trans. Electronic Computers, vol. EC-l4, pp. 683-688, Oct. 1965. IEEE Transactions on Computers, vol. C-20, Sept. 1971, (Special issue on feature extraction). B. G. Ishaku, "Feature extraction and ¢-function selection in 9-systems," Ph.D. dissertation, Mich. State Univ., East Lansing, Mich., Oct. 1971. Wash., D. C.: L. N. Kanal, Ed., Pattern Recognition. Thompson, 1968. P. A. Lachenbruch, "Discriminant analysis when the initial samples are misclassified," Technometrics, vol. 8, pp. 657-662, Nov. 1966. M. Loeve, Probability Theory, 2nd ed. Princeton, N. J.: Van Nostrand, 1955. W. S. Meisel, "Potential functions in mathematical pattern recognition," IEEE Trans. Comput., vol. C-18, pp. 911-918, Oct. 1969. J. M. Mendel and K. S. Fu, Ed., Adaptive, Learning, and Pattern Recognition Systems: Theory and Applications. New York: Academic Press, 1970. ' A. M. Mucciardi and E. E. Gose, "A comparison of seven techniques for choosing subsets of pattern rec0gnition properties," IEEE Trans. Comput., vol. C-20, pp. 1023-1031, Sept. 1971. V. K. Murthy, "Nonparametric estimation of multivariate densities with applications," in Proc. Int. Symp. on Multivariate Analysis, P. Krishnaih, Ed. New York: Academic Press, 1966, pp. 43-56. C. Nagy, "State of the art in pattern recognition," Proc. IEEE, vol. 56, pp. 836-862, May 1968. N. J. Nilsson, LearningyMachines. New York: McGraw-Hill, 1965. E. Parzen, "An estimation of a probability density function and mode," Ann. Math. Statist., vol. 33, pp. 1065-1076, 1962. E. A. Patrick and J. C. Hancock, "Nonsupervised sequential classification and recognition of patterns," IEEE Trans. Inform. Theor , vol. IT-12, pp. 362-372, July 1966. M. L. Puri and P. K. Sen, Nonparametric Methods in Multivariate Analysis. New York: Wiley, 1971. [R-ll [R-Z] [5-1] [5-2] [5-3] [5-4] [5-5] [3-6] [3-7] [5-8] [5-9] [T-1] [V-1] [V-Zl [V-3] 80 H. Robbins, "The empirical Bayes approach to statistical decision problems," Ann. Math. Statist., vol. 35, pp. 1-20, 1964. New York: H. L. Royden, Real Analysis, 2nd ed. Macmillan, 1968. S. C. Schwartz, "Convergence of risk in adaptive pattern recog- nition procedures," Proc. 5th Allerton Conf, on Circuits and Systems Theory, pp. 800-806, 1967. S. C. Schwartz, "Estimation of probability density by an orthogo- nal series," Ann. Math. Statist., vol. 38, pp. 1261-1265, 1967. G. S. Sebestyen, Decision-Making Processes in Pattern Recognition. New York: Macmillan, 1962. K. Shanmugam and A. M. Breipohl, "An error correcting procedure for learning with an imperfect teacher," IEEE Trans. Systems, Man, and Cybernetics, vol. SMC-l, pp. 223-229, July 1971. K. Shanmugam, "Learning to recognize patterns with an imperfect teacher,” Ph.D. dissertation, Oklahoma State Univ., Stillwater, Okla., 1970. K. Shanmugam, "A parametric procedure for learning with an im- perfect teacher," IEEE Trans. Inform. Theory (Corresp.), vol. IT-18, pp. 300-302, March 1972. D. F. Specht, ”Generation of polynomial discriminant functions for pattern recognition," IEEE Trans. Elect. Comput., vol. EC-l6, pp. 308-319, June 1967. D. F. Specht, "Series estimation of a probability density function," Technometrics, vol. 13, pp. 409-424, May 1971. J. Spragins, "Learning without a teacher," IEEE Trans. Inform. Theory, vol. IT-11, pp. 544-549, April 1966. A. Tenenbein, "A double sampling scheme for estimating from mis- classified multinomial data with applications to sampling in- spection," Technometrics, vol. 14, pp. 187-202, Feb. 1972. J. Van Ryzin, "Non-parametric Bayesian decision procedures for (pattern) classification with stochastic learning," Fourth Prague Conf. on Inform. Thy., Statistical Decision Functions,yand Random Processes, 1965. J. Van Ryzin, "Bayes risk consistency of classification pro- cedures using density estimation," Sankhya, Series A, vol. 28, pp. 261-270, 19660 J. Van Ryzin, "On strong consistency of density estimators," Ann,_Math. Statist., vol. 40, pp. 1765-1772, 1969. [W-l] [11-2] [W-3] [Y-ll 81 S. Watanabe, Ed., Methodologies of Pattern Recognition. New York: Academic Press, 1969. A. W. Whitney and S. J. Dwyer, "Performance and implementation of the k-nearest neighbor rule with incorrectly identified training patterns," Proc. 4th Allerton Conf. on Circuit and System Theory, pp. 96-106, 1966. C. T. Wolverton and T. J. Wagner, "Asymptotically optimal discriminant functions for pattern classification," IEEE Trans, Inform. Theory, vol. IT-15, pp. 258-265, March 1969. S. J. Yakowitz, "Unsupervised learning and the identification of finite mixtures," IEEE Trans. Inform. Theory, vol. IT-l6, pp. 330-338, May 1970. ' APPENDICES APPENDIX A OPTIMAL DECISION RULES This appendix outlines the elements of statistical decision theory that are used in this thesis. The results, which are taken from the literature [F-2], [R-l], are presented from the vieWpoint of pattern classification. A problem in statistical decision theory consists of the following basic elements: a.) A state space 0 with generic element w representing a ”state of nature." b.) An action space A with generic element 0 representing an action available to the decision maker. c.) A loss function L(a, w) defined on A X 0 and representing the loss incurred in taking action a when the true state of nature is w. d.) An observable random variable X belonging to a space T on which a o-finite measure 0 is defined. When the true state m) of nature is m, X has a specified probability density f(- with respect to v. The type of decision making problem of concern here is the sta- tistical classification problem. In classification problems the state Space is a finite set, 0 = {61, wz’ ..., wM}. The action space is _ . . . ,, . . A - {a1, a2, ..., 0“} where action a1 is say ml was active to produce 82 83 the observed X." The decision rules that are convenient for classification problems are the so called behavioral decision rules [F-2, pp. 198]. Let 6(x) = (61(x), ..., 6M(x))be a vector of real-valued measurable fUnctions on T such that .gl 6j(x) = l, 6j(x) :_0. Denote the class of all such functions 8; A. Then any OEA is a behavioral decision rule with 6j(x) = Pr[ajlx] identified as the conditional probability of classifying X as coming from “j given that X = x is observed. In other words, a be- havioral decision rule assigns to each xeT a probability distribution on A. A Bayesian strategy for classification involves the notion of a prior distribution on the state space 0. This distribution is denoted by P = (P1, P ..., PM) with P1 being the prior probability assigned 2' For any decision rule 080, the average risk associated with state of nature mien is defined as M r(8, bi) = E[.X J L(aj, di)8j(x)|di] (A.1) l where E[° mi] denotes expectation with respect to the conditional density f(° mi). The overall risk of misclassification with decision rule 6 under the prior distribution P and conditional densities f = (f(- ml), ..., f(° wM)) is defined by M R(P, f, 6) = Z Pir(6, bi). (A.2) i=1 A decision rule OBEA is said to be a Bayes rule with respect to the prior distribution P if and only if it satisfies R(P, f, 6B) = inf R(P, f, 6). (A.3) OEA 84 Thus a Bayes decision rule minimizes the overall risk of misclassifi- cation. Substituting (A.l) into (A.2) gives M M R(P, f, 8) = 121 Pi [T j§1 L(aj, wi)6j(x)f(xlwi) dv(x) M M = Z I [ Z P, L(aj, wi)f(xlwi)]6j(x) 81(1). (A.4) j=1 T i=1 A choice of 6 minimizing the risk in (A.4) is seen to be any behavioral decision rule 6B = (631, 582' ..., 68M) satisfying 1 if Dj(x) < pi? Dk(x) 6 -(x) = 0 if D-(x) > min D (x) 83 j kfj k . ‘ . = ' A.5 y) if DJ(x) pi? Dk(x) ( ) where M Dk(x) = 1Z1 Pi L(ak, wi)f(x|wi), k = 1, 2, ..., M (A.6) and where {yj} is such that GB is a measurable function and Yj 3.0 and M {(5-21 B o i=1J The Bayes risk is the risk incurred by a Bayes decision rule. Substituting (A.6) into (A.4) results in the following expression for the Bayes risk: R30). f1 = Z j=l D. 6 . x d x . A.7 IT J(x1BJ(1 v(1 ( 1 The notation in (A.7) emphasizes the dependence of RB on the prior distribution P and conditional densities f. The identification of the above statistical classification problem with pattern recognition is immediate. The state space corresponds to the set of pattern classes, T to the feature space, and X to the pattern 85 to be classified. When the pattern is discrete valued, v is taken to be counting measure; when X is real valued, v is taken to be Lebesgue measure on Euclidean n-space. The optimal decision rules for pattern recognition are then the Bayes rules given by (A.5). APPENDIX B NONPARAMETRIC ESTIMATION OF DENSITY FUNCTIONS This appendix presents a nonparametric method of estimating a probability density function. The method was first proposed by Parzen [P-l] for estimating a univariate density and later extended to multi- variate densities by Cacoullos [C-1] and Murthy [M-4]. Let X1, X , ..., XN be N independent observations on an n- 2 dimensional random variable X with density function f(x). Let K(Y) = K(Y1, yz, ..., yn) be a Borel scalar function on Euclidean n- space En such that KCY) :_0 (8.1a) fEnKO') dy = 1 (B.lb) SUP KO) < °° (B.1C) YeEn and Ilyll"K(y1 + 0 as IIle + co. (8.141 The function K(-) is called the kernel of the estimator. Define a function gN on En x En by gN(x, y) = .171. K(.’.‘.:_.2’.) (8.2) hN hN with {hN} a sequence of positive constants satisfying hN + 0 as N -> 0° (B.3a) and Nhg -> oo as N -> oo. (B.3b) 86 87 Then the nonparametric estimator for the density function f(x) is defined as gN(X. Xi). (B.4) 1 ll M2 1. N i Cacoullos [C-l] has shown that the estimator fN is asymptotically un- biased and consistent at all points of continuity of f. There are many kernels which satisfy (8.1). Typical examples of univariate kernels are shown in Table 8.1. Specht [S-7], [S-8] has investigated methods for approximating the Gaussian kernel (entry 3 of Table B.l) to obtain estimators which have fixed storage requirements. In this dissertation the density estimator of (B.4) is not di- rectly used; instead, linear combinations of the SN functions are used to form estimators of discriminant functions. The following two lemmas concerning the 8N function are proven in [C-l]: LEMMA B.l. Let 3N be a function on En x En defined by (B.l) thru (B.3). Then at every continuity point of a density function f(x), n(r-l) 1im hN E[g;(x, X)] = f(x) [Enxr(y) dy. (B.5) N+oo LEMMA 8.2. Let 8N be defined as above. If a probability density function f(x) has continuous partial derivatives of third order in a neighborhood of x, then lim h'2{E[g (x, X)] - f(x)} = 1/2 (B.6) N—Hzo N N where n n 2 1 = .2 Z E—EIEl-fyiij(y) dy. (8.7) 88 TABLE B.1. UNIVARIATE KERNELS oo K(y) I K2(y) dY l. fer |y| < l 2 " 1 2 0 for Iyl > 1 1 - M for Iyl _<_ 1 2 0 for Iyl > 1 3 (21‘1/2 ( 2/21 1 11 up -y .... 21/1? 1 1 E'exp ( IYI) 2 .1. 1 .1. TI 1 + y2 n . 2 _1_ 5111 y _1_ n y 3n APPENDIX C PROOF OF THEOREM 3.5 The proof of Theorem 3.5 is given in this appendix. The proof is as follows. The first term in the bound of Lemma 3.2 is N '11 = jgl I, I qu(x1 db(x1 (c.11 where qu(x1 = ENIIGNj(x1 - Pj E[gN(x. x1|4 = 11I1. (c.21 Now from (3.77) it follows that H qu(x1 ;_N éc Z ( I b. 81,18, Eth(x. x1IA 4113 (c.31 i=1 i=1 ji So T1_ < N 2 I E'j f ( Z ( Z bjiB 1,)? E[gN(x, X)|A j= —1 2:1 i=1 411% dv(x1 M __ n M < N 2 Z L,{f n (1 + Ix |1*5) Z ( 2 b2 81H)P — = J :1 5 2:]. 1:]. ji (1 + |1csll+€)"1 dv(x)}l5 21:3 E18:(X, X)]A = 2]) d\)(x)}1/2 ° {I 1 5: _g M 2 21.1 E ( 2b b? .8. H1? I H (1 + Ix |'*51 j=l i=1 i=1 ii 5:1 5 I (1 + helm)"1 44,11”2 (c.41 E[g§(x, X)|h = 2] 8001)}11 - { 1 S 11:21:! 89 90 where the second inequality follows from Schwarz inequality. Van Ryzin [V-l] has shown that if (3.82b) is satisfied, then 1im hN j [ n (1 + Ixs |1+€)] E[gN(x, X)I/\= 2] dv(x) N+w s=l = E[ S ":1: (1 + IxSIMHA = 2] I my) dv(y) < w (C.5) l where E = min (5', l/n). Since I(1 + Ixsll“51'l dxS < m. (c.b) (C.5) along with (C.4) implies that lim (NhN)/2T1: j): L1. { {(2 1311310151 p++m 2:1 i=1 n 1 n 1 - b[ n (1 + lxs|1+51|A=21IK2(y) dvmm n 1(1 + lxs|1*‘€rl dxsrz 5:1 3:1 9 Q1. ((1.7) Thus for large N n-% T1 :_Q1(NhN) . (C.8) Now consider the second term in the bound of Lemma 3.2 (see equation (3.69)). M T2 = Z P1f{E[gN(x, X)]A = i] - f(xlA = 1)} i=1 M {1114; (61-B (x v) - I: N[3Nj (x111 dvcx) c.9) Using (3.63) and the definition of the expectation operator gives 91 X— hN M Y Tb. = Z )f(yIA = i) dv(y) - f(xIA = 1)) 1 PiIU—fi'KC i hN 1 M . {321 Lji(6Bj(x; v) - EN[5Nj(x)]} dv(x). (c.10) Making a change of variable and applying Fubini's Theorem gives T2 M 121 p1ff K(z)[f(x - thIA = i) - f(xIA = 1)] 1: M A ° E1 L11(aBj(x; v) - EN[6Nj(x)]) dv(x) dv(z) M < P. K(z) f(x - A = °) - f(X A = ') ’igl 1” I thl 1 - I ll M ' ljgl Lji(éBj(x; V) - EN[6Nj(X)])| dV(X) deZ) Z 'A 2 2 Tip. If K(z) 1(h (2); f(- A = 1)) dv(z) (0.11) i=1 1 1 N where the function T(.;.) is defined by (3.81) and the last inequality follows from the fact that the magnitude of the second term in the integrand is bounded by 231. Conditions (3.82a) and (3.82c) then imply that M hNY T2 :_2C I K(2)||2||Y dez) 2 P11} 9 Q2. (c.12) i=1 Thus for large N _1/ AR < T1+ T2 iQ1(NhN) 2 N __ + Q2 hg. (c.13) If hN is then chosen as hN = 0(N-1/(n+a)),it follows from (C.l3) that 92 Q1 N'Y/(n*a) if a > 2y ARN _<_ Q2 N'a/zfi‘w) if a < 2y (Q1 + Q2)N'°/(“*2“’ if b = 2y. (c.14) This completes the proof of Theorem 3.5.