2f§ -. r". 51' .- U a I $4". ......3 f6 Rana. ......vb mat?) . V .vi #4. o. '1.- ‘ a r?! . . a”... l I 4 .. .. .9... 41%. f. u 1 in ‘ a”??? 1.1. ...Em. ; Dinlabbvxd 3:. «9. f. .. ivy. .. .0 9:. ..V (Cuvuflro 4,113, L} ma? . . Eb .. ,n.< . . . .x'...‘v yl o. .. I. . . - nV . H! V .‘ .u ‘ . . ...,. .. . V... . ‘4 .1 . ... .. .. , .. » . . . —. c -‘ .~. ‘ ‘ n . ‘, V ‘ r, T A . . .H _. ._ . . it 3...:” A . V .. . . . V . V.V.. ,‘ . a .. . : .. , .. . .x. .. .r M .. , .. . V.... ... y ”a . .éruflnfluu." . .V v. . . : v , I . . ‘ I J’AI.V n , _ < a , . ‘ , .. .. . V. . .V . VJ. . . , . .V V .. .. . . . . . n ,. _ . ,.V . .. . V. x. V .. ‘ .. V ... o , . t y o I l . V .. V .1 u v . .. , . V n, -. ..,_ A ,. ‘ . V .A . 0V . . . n ‘ . .. x . . . . V .<. -V . r .u. I ,1 . . . , . 4 c u. . . . . u- ..4 ,. . _ ~ ..4. . . 0 v a _ V . .. . . . ~v , . . . v a ,. . . V x . 1 V . . .V . V. I v .. . - . . . .... :. ... ,2...- .. 0 if i 6 class H1 no?) - _, 2 < 0 if X 6 class H The generalization problem attempts to assess the perfor- mance, or goodness, of u(§) with new patterns through the use Of appropriate criteria, chief among these being various error prob- abilities. It is obvious that these three problems are interrelated. Success in solving the abstraction and generalization problems clearly depends on how well the feature extraction problem has been solved. But the feature extraction problem remains the most difficult of the three. The nature of primitive features is not easily defined in many situations and, at present, the primitive features selected to characterize Objects depend entirely on the intuition of the experimenter. The selection of the best subset of primitive features for the purposes of recognition constitutes the feature extraction problem proper. The abstraction and gen- eralization problems have received much more attention than the feature extraction problem and some solutions in these areas are available on the basis of Statistical Decision Theory and non- probabilistic procedures. The application of Statistical Decision Theory to the pattern recognition problem is called parametric pattern recogni- tion and requires that the pattern classes be characterized by probability distributions. When the underlying probability dis- tributions are not known, so called nonparametric or deterministic pattern recognition procedures are available for solving the abstraction problem. A pattern recognition system consists of algorithms which attempt to solve the problems stated above. Given a set of primitive features, which provide the absolute descriptions of the Objects, the pattern recognition system must determine which primitive features or which combinations of such features are effective for recognition purposes, how to extract them, and how to classify the patterns. A general model for a pattern recognition system is shown in Fig. 1.1. The primitive features are sensed and measured by the receptor. The derivation of these features is not always obvious and, at present, only features for which there are measuring devices are object A PATTERN RECOGNITION SYSTEM \V Receptor Feature >< Vé Extractor Figure 1.1 Categorizer allowed. At the output of the receptor is a set of measurements from which the feature extractor distills the feature subset to be used by the categorizer in classifying unknown input patterns. The receptor provides the set of features which are used in the solution to the feature extraction problem while the categorizer attempts to solve the abstraction and generalization problems. Many of the decision rules develOped in mathematical statistics and in non- parametric pattern recognition have been implemented in the design Of the categorizer. The design of the receptor still remains sub- jective. 1.2 THESIS OUTLINE AND CONTRIBUTIONS The main objective Of this thesis is to examine the problem Of feature extraction in multiclass pattern recognition systems which use Q-discriminant functions for decision making. No assump- tions are made about the forms Of the probability densities for the pattern classes. Chapter 2 unifies the three basic problems in pattern recognition, namely, feature extraction, abstraction, and generaliza- tion. The Optimum pattern recognition system under Bayes criterion is derived in the form of a 3-tuple, and a set Of requirements for "good? performance criteria for feature extraction and abstraction is prOposed. Following a comprehensive literature review, it is shown that a computationally feasible pattern recognition system can be expressed as a 4-tuple when separate criteria are used to evaluate Optimum features and discriminant functions. Several definitions are stated for comparing the relative performance of pattern recognition systems during the recognition phase Of Operation. Hitherto, there has been no general procedure for evaluating pattern recognition systems. Researchers merely constructed and implemented their own isolated recognition systems, providing little comparison to systems developed by others. The definitions should standardize investigations in the formulation of a general procedure for deter- mining Optimum pattern recognition systems. Chapter 3 introduces the Q-discriminant function and reviews an iterative procedure for "learning" the coefficients which converges to the coefficients which minimize the mean-square error between the Q-discriminant functions and the Bayes discriminant functions in a multiclass pattern recognition problem. A scheme which uses the estimate of the upper bound on the mean-square error as a criterion for selecting effective Q-functions is also reviewed. This criterion is shown to be a good criterion for selecting effective features as well. Finally, the algorithm for determining the coefficients of the 6-discriminant functions is related to other learning schemes and suggestions are made for adjusting the coefficients to improve per- formance. In Chapter 4, the estimate of the upper bound on the mean- square error between the @-discriminant functions and the Bayes dis- criminant functions is proposed as a criterion for both feature and Q-function selection. It is shown that thecriterion has some of the desirable prOperties of a good performance criterion proposed in Chapter 2. Two sequential feature extraction algorithms are re- viewed. A third sequential algorithm is proposed and its char- acteristics examined. The prOposed algorithm can lead to simultaneous feature and Q-function selection. Four strategies are advanced for the case of generalized polynomial discriminant func- tions. The distributions of the O-functions are assumed to be normal for each class in a two-class recognition system. This assumption is baseless and very unlikely in practice, but it exhibits the relationships between the coefficients of the Q-dis- criminant functions and coefficients derived in other algorithms. Finally, the prOposed criterion function and sequential extraction rule are used to construct and implement a Q-system. The results demonstrate the feasibility of simultaneous feature extraction and @-function selection. Chapter 5 summarizes the conclusions and Opinions concerning the thesis and suggests further research in the area of feature extraction in pattern recognition systems. CHAPTER II FORMULATION OF A PATTERN RECOGNITION SYSTEM A generalized mathematical formulation of a pattern recogni- tion system must include not only the selection of features and of discriminant functions, but also the selection of performance criteria for evaluating the effectiveness Of features and discriminant functions. The universe is assumed to consist of an m-dimensional measurement space IF, where m is the maximum number Of primitive features available; these features provide the absolute description of an object. An object is thus represented by a vectordvalued random'variable. The vector*, gm, of values assumed by such a random variable is called a primitive pattern and its elements X1,X2,...,Xm are referred to as primitive features. The absolute standard of performance for a pattern recognition system is proba- bility of error or probability of misclassification depending on the strategy adopted. Section 2.1 deals with the mathematical formulation of the Optimum pattern recognition system under the absolute standard of performance, known as Bayes criterion. A survey of the literature in feature extraction can be found in Sec. 2.2. Section 2.3 suggests classes of criteria for feature extraction and discrimination. The same symbol is used in denoting the random variable and the values it assumes. These criteria measure the effectiveness of various sets of features and sets of discriminant functions. Section 2.3 also prOposes a general class of discriminant functions and a general class of feature extraction rules. Definitions of some commonly used feature performance criteria are given in Sec. 2.4. Finally, Sec. 2.5 summarizes the main tOpics of the chapter. 2.1 PATTERN RECOGNITION SYSTEMS AND FEATURE EXTRACTION For a given primitive feature Space Im, a feature extrac- tion rule t is a deletion of some Of the coordinates of Im; the resulting p-dimensional Space IP is called the feature Space and each point in I? is called a pattern; i.e., t: I? d I? , p s m . The deletion can also be performed after a linear transformation has been applied to the primitive feature space, but only deletions in the primitive feature space will be considered in this thesis. The family of all p-dimensional feature Spaces is expressed as: m {1:}. 4, = 1,2...., the distance between the sets X and Y 32 i j p(X.Y) " min 90! .y ) 1.1 is maximal. Integral mean-squgre distance. The integral mean-square distance between two functions y(X) and z(x) is defined as dwa COwumuummm mhupcsom mocuwuu>wm moamwuo>wn ex eunuuom Sumo you n penance U museums :OHumEuomcH aoHuOufiuo ouauauowuom ouauuom mmusuuow mo mucmeoe pSOOOm mam umuflw mo umouo wcfiocoomma ouusxomm Hufiucosvom cheapom Hmwucosumm some nuEuommamuH ummcwa aOHu nusuommcmua Hammad consumes a 0 HO homuo wawccmomou aw mmusumom uoufiom was“ noduouuuxm unauuom mauflw moumm he-ew now>uxumx fl~-og whencuuuouu flu-zg cumuu mam Hawks: fining meson nonus< 35 woma moma womfi moma Roma Roma moma Roma coma coHuchooom mo mmuucmouom cofiqufiw uwmmuaomHz mo muwaanunoum SOHO uficwoomm wo owuucouuom coflu Iwcwooom mo mwuucooumm mush nouum sump uouum moan uouum moon uouum coauaawouum a“ muouum cOwuchooom mo mwuucmouom oases eooeaaexaa Eaewxmz muuucwpuooo o>moa namcSSumM chasm eooeafiexau Eaafixdz nouum muwsvmnquoz mucoumwn muesvmunuuz umwuouaH wouuu mo sueaaeeeoum so mpcaom uouum mo suaaeeeeeum co amazon sees 52 cause coon: 9: .— 83.“ g unused cause eooeeamaaa COHmauoxm O>NOA scucssumx oases eooeuaexaa “nomad unused oases coon: 33 oases eooaaaoxau oases eooeaaexag unwuuumCOO Ono cums m.uonmfim x musumow wcficwua now some pomH escapee Ace H ousmmoz cOHumBuomcH mouuafiouoou m>woA scmcssumx mocowuo>mn uouum cowuuSHumm ouupvmncwuz uomuuucH mocowuo>aa HuumoucH wowcHHHum codename umkuunouuumnm SOHmuoamfin uamumooo pu>o mono rummage cum: mcHEEmuwoum owsmcza oumsxomm Huwuaosuom ohmsuom Hufiucoscom cOHu luauowmmuue amused sown umEuowmamuH umoawg some nuEqumcuuH umucwn nose quuOmmmuuH “momma some nusuommnuua unused fl~-zL >>ma was COmHOz mm-mg as mm-uu em fls-au snowmen was :09 fia-eg mmmsm pad uuovmx 9.2—... assesses; Hauxu umvxmcamuox 36 Humw Cums Omma 0mmH Oan ohmH momH moms moma moms moma musm uouum Sofiuwcwoomm mo meucwoumm boom uouum muum cOHqu -auaemeaeeaz cowuflcwoomm mo mwuucmoumm cofiuwcwooom mo mmuucmouom cOHu -chooum mo oweuamouom cofiu uficwooom mo Ownucooumm woman no seafiaeeeeum cued uouum :ofiuwcwouom GOEuwcwooom mo owuucoouom mummm mucwuuumcoo 03u some m.uonmam commando 3855932 M cofiumuwuu : AAUHDGH Ugo U.— uouum mo muHHHnunoum voumeumm uouum opusmmumumz nouum mo mufiHHnunoum momum oases eoeeafiexaa 835axuz oauam voosfiamxwd assess: uuum nouum coHuwawoomm :OHuaawooom mo Owuucoouum ounce eooeamexau unocwg hemmed unused masm nonswwmz umouuucux HuHEOcmHom oauem eoocaaexea ounce coonaflexaa cause ecosafiuxaa Humaocmaom wcaumuoaao mcmoaux muum uoupm mucamuumcoo 03u sums m.uusmwm museumma manocmausmz M coauoufiuo tauaseaucoo: omens: mmuuuncmz nouum mumsmmucmoz opusmmncuuz HaywoucH mmuum nouum mouuomxm OHS m” 0: muaaanuuumow x55 52 xauuua vuoumm ooauumun ma.“ 88mm cOHuOHmQ mafiaEuuwoum owaucma oumsuom Huaucosmmm :oHumEuow umomua “mused cheapom Huguesdmmm nunsuom Hufiucmsmom aofiu unsuommqmua Hammad cOwumHmQ consumes cos» unencumasua nausea pucsuom Huaucuamom hm-oL guano flm-zH m>mA mam :OmHOz fiw-mm scenes 9 a o... u um>Hmo hm-:g assumes me-3L ems h~-mu monumwm one Mewuumm Fla... 5: fle-zu .Hu um xuuamoz Si... .32 CHAPTER III Q-DISCRIMINANT FUNCTIONS A practical pattern recognition system has been defined in Sec. 2.3 as the 4-tuple (t, b; u, a). The order in which the elements of the 4-tuple are selected Should be immaterial, but we shall follow the general approach of first selecting a discriminant function u and a corresponding performance criterion a, to evaluate it. Later on, we Shall seek a matching feature extraction rule along with a measure b to evaluate the performance of feature sets. Various types of probabilistic structures are available for probability-based discriminant functions. When the form of the density for each pattern class is completely known, or known to within a set of parameters, parametric classification algorithms can be applied to classify patterns. The unknown parameters can be "learned with a teacher" [8-6] [8-7], when a set of training patterns of known classification is available, or "learned without a teacher" [8-5], when the set of training patterns are of unknown classification. Nonparametric algorithms, on the other hand, deal with situations in which distributions describing the pattern classes are unknown. The only data available are a set of training patterns, of known or of unknown classification. A mathematical form for the discriminant function is then assumed and the training 37 38 patterns are used to "learn" the coefficients of the discriminant function. In this thesis, a Q-discriminant function [N-S], will be used for the nonparametric multi-class pattern recognition problem. The discriminant function adopted in this thesis is pre- sented in Sec. 3.1, along with its learning capability, convergence rates and other computational considerations. Section 3.2 reviews a scheme for the selection of Q-functions and discusses how it leads to the definition of Q-systems. The possibility of improving the performance of the chosen discriminant function is discussed in Sec. 3.3. Finally, a summary of the main tOpics of this chapter is given in Sec. 3.4. 3.1 LEARNING THE COEFFICIENTS OF A Q-DISCRIMINANT FUNCTION This thesis will consider a set of discriminant functions which are formed from a set, {W1(i):¢2(§):°°°»¢M(Y)}:0f arbitrary functions defined on the primitive feature space 1?. These arbitrary functions are sometimes referred to as Q, or potential, functions and are used in formulating discriminant functions which approximate the Bayes discriminant functions. The form of a @- discriminant function is: M unit?) = 2 a cp (3?) =1 L L = 2T 0,5?) '1‘ -o where §M(K) - (¢fi(i)»¢2(x):---:¢h(i))s superscript T denoting the tranSpose Operation. Then, under a given criterion, and utilizing any available prior knowledge, values for 3 6 RM' which best 39 approximate the Bayes discriminant function for the pair of pattern classes being considered are determined. The potential functions not only estimate the Bayes discriminant functions nonparametrically but, unlike the conventional linear hyperplane learning schemes [H-6] [8-1] [8-2] [M-A], also take into account higher order statistics. The Bayes decision rule minimizes the expected cost of decision-making and is defined to be the "best" rule. For a p- dimensional pattern vector, or point in 1?, which is known to have originated in one of the 8 pattern classes with prior probability qi and density function pi(-), the Bayes discriminant functions are defined in Sec. 2.1. The application of the Bayes discriminant functions in Sec. 2.1 requires full knowledge of the quantities {qL,pL(-)}:=1. Since these quantities are unknown, ways have to be sought to form dis- criminant functions from a set of training patterns. The decision rule problem therefore calls for the construction of a sequence of estimates which approach the Bayes discriminant functions in some sense, as the number of training patterns approach infinity. Several authors have constructed functions which approximate the Bayes dis- criminant functions: [A-3], [B-Z], [P-B], [H-S], [P-l], [R-3], [8-3], [W-S], [8-4]. The approximating functions given below are from Pitt and Womack [P-7]. No attempt is made to develop a new set of discriminant functions in this thesis. Rather, an investiga- tion of the feature extraction problem is made when a set of @- discriminant functions is applied to the pattern recognition problem. The Q-functions will be selected to Optimize both feature extrac- tion and decision-making. 40 The unknown Bayes discriminant functions will be approximated by: M —o ajk g jk -o ujk(X,a ) Z a; qk(X) {Fl (3.1.1) = '33“ 691,65) where ujk(§,;jk) = - ukj(§,;kj) and Qafi) - (¢IC§),m2(i),...,mM(§)) is a vector-valued function on 1?. The {mi(i)}:=1 are selected to meet Specific design con- siderations such as computer storage. The set of classified training patterns {§(l),§(2),...,Y(N),...} are obtained sequentially and utilized in determining the vectors {ij(N)}:,k=1 which minimize the set of performance indices {ng(3)}:’k=1 introduced later in this section. Then, a pattern vector 2 of unknown claSsification is classified with the approximate discriminant functions as if they were the Bayes functions in Sec. 2.1. The Optimum decision rule, in the mean-square sense, can be stated aS: Decide i 6 Hj if ujk(i,zjk) 2 O, k = l,2,...,s j 9‘ k . The following assumptions are made concerning the Q-functions. Assumption!l. {mi(§)}T=1 are linearly independent with probability 1, i.e., there exists a measure p(o) satisfying fpcpiéimj (2mm?) - 0215“, where 51 is the Kronecka delta function 1 I. and ai is some constant. 41 Assumption II. 2 "'° 2 -0 -o EL{,a> - now] + a . a where iL(i) is the i-th training pattern from class HL , and 'NI +-N2 +u..+-Ns = N. The term 3 3T; can be drOpped if it is assumed that the number of training patterns N is larger than M, the dimension of the Q-function Space. In this case, the matrix P-1(n) defined in (3.1.2) exists. A convergence theorem states that for each pair of classes H1 and HR the vector ajRCN) in (3.1.2) approaches a vector, denoted by a;:, with probability 1 as the number of training patterns N approach infinity. The coefficients 2;: minimize the mean-square error between the dis- criminant function using a particular set of Q-functions, and the Bayes discriminant function. This mean-square error is given by: jk“. . -—O - .9 2 QE (a) ET{[ujk(i,a) ujk(X)] ] where ujk(i) is the unknown Bayes discriminant function. A second convergence theorem shows the rate at which the vector EjRCN) jk —¢ converges to a . ms This procedure requires filiéll discriminant functions and s s-l fijk . 2 sets of coefficients a (N). The computation of the co- efficients can be simplified, [P-7], by expressing the coefficients as follows. 3 . . k “L 3 1 5 8(N) ZrL(J\)W(N). (..) L=1 The vector GL(N) is computed for pattern class BL from the set of N training patterns as follows. 3L(n) . P-1(n) §L(n) (3.1.6) 43 where 3%) = Econ-1) + flu) «is» and L _ . v . L i (n) - 1 if X(n) 18 from class H = 0 otherwise . This procedure not only reduces the number of parameter vectors that must be computed iteratively from E12211. to s, but 2 also retains the convergence rate prOperties of ajk(N) proved in Theorem 2 [P-7]. In both theorems mentioned above, the Q-functions are assumed as given, even though it is clear that the mean-square approximation error between all j-kth discriminant functions and the Bayes discriminant functions, and hence the loss due to mis- classification, depends upon the Q-functions selected. In general, there is no direct procedure for selecting an Optimum set or number of Q-functions. The scheme in Sec. 3.2 was prOposed by Pitt and Womack [P-7] for selecting and testing several sets of Q-functions to determine their relative performance based on the available training patterns. 3.2 A SCHEME FOR SELECTING Q-FUNCTIONS ..jk a,ms the coefficients which minimize ng(3), Denote by «jk «jk . where a (N) converges to ams with probability one as N a m. The resulting approximation error, gjk, between the j-kth dis- criminant function and the Bayes discriminant function is given by 44 jk a -0j ICTQ 6 ET{[amS a a 2 An upper bound for this mean-square approximation error was derived [P-7] and denoted by: 31k gjk'r 31k ajkT -1 —-jk P 9 (3.2.1) IDS ms = Rom - - Rom- where Bjk(N) converges to 3;: and P(N) converges to Pms with probability one as 'N 4 m, and R(j\k)= max {r i<1|k>1 Isiss An estimate of an upper bound on the mean-square approximation error after n training patterns is: km) =R(jlk)-;11'3m (r03 km , where ajk(n) and ‘ajk(n) are given by (3.1.2) and (3.1.3) reSpectively. Theorem 1 [P-7], shows that lim gjk(N) = 31: with probability 1 . Nam The use of the approximation error ejk(N) can improve the performance of the Q-discriminant functions since it permits one to compare the performance of sets of Q-functions. There is, however, no general procedure for selecting the Optimum set of @- functions, or the set which makes the Q-discriminant functions per- form as well as the Bayes discriminant functions. In fact, no set of Q-functions has this prOperty in every situation. The average loss due to misclassification is always greater than that incurred with the Bayes discriminant functions. The only case in which the 45 system loss is asymptotically equivalent to the Bayes loss is when the Bayes discriminant functions can be expressed as a finite sum Of the Q-functions selected. The following statements, proved in [P-7], state the point precisely. Statement 1. If for some j and k there exists an ; E‘ZO 6 RM such that u (if) =§T§ (if) (322) jk 6 M o 0 then 11111 31km) = 331‘ = '3 with probability 1 ms 6 N-«n where "jkT -o = —o ems QMOI) ujkOt) a.e. Statement 2. For each pair j,k such that 36 satisfies (3.2.2) lim ng(3jkm)) = 0 with probability 1. N—IOO Statement 3. If there exists a set of coefficients 2 k M g E R such that —o = "jkT -o ujkm a6 mm and J“ 93613.55 = o for j,k = 1,2,...,s jk j 9‘ k where Ajk - {if : ujkdi) = 0}, then lim ‘eSCN; u, 3) - eS(N; u, 36” = o with probabiliy 1, N-m 46 where es(N; u, 3) is the expected system loss due to misclassifica- tion after N adaptations of the Q-discriminant functions and eS(N; u, 36) is the expected system loss incurred from using the Bayes discriminant functions. It should be noted that the expressions es(N; u, 2) and eSQN; u, :5) do not involve feature extraction or Q-function selection. A slight alteration in the notation will be made to show the dependence of the coefficients zjk(N), and the approximation error ejk(N), on the dimensionality of the primitive feature Space as well as on the dimensionality of the Q-Space. A Q-extraction rule 7 is a deletion Of some of the coordinates of the given Q-Space QM to produce a V-dimensional Q-Space Qv, V s M. Let t and T denote two extraction rules applied to the primitive feature space I? and the Q-space QM, reSpectively. Figure 3.2.1 indicates the sequence in which t and T are applied. Following the application of t and T, the coefficients 3 a 1‘'(N) and the mean-square approximation error ejk(N) are designated by Epk(N; t, T) and ejk(N; t, T), reSpectively. Expressing the system's mean-square approximation error as s k 68(N; t: T) = z €j (N; t: T) j.k=1 the following definitions can be stated. Definition VII. Given a Q-extraction rule T and N training patterns, P a feature extraction rule tL is better than a feature extraction rule written tpza tq\T if q "k’ t k THE SEQUENCE OF THE APPLICATION OF primitive ‘ pattern, m features rule patterns I p features pSm i machine AND w rule M Q-functions Figure 3.2.1 ___\ 7 V Q-functions V Sin 48 .801; ti. T) < esm; ti. T) - A p-dimensional feature Space is better than a q-dimensional feature Space, p s q, if tziD ti‘T for some L and all k . Definition VIII. Given a feature extraction rule t and N training patterns, . V w . a Q-extraction rule Tr is better than a Q-extraction TL, written, V W TrDTL‘t if es(N; t, TY) < e504; t. T2) - A set of V é-functions is better than a set of W @- functions, V s w if V W Tr:3 TL\t for some r and all L . These definitions suggest that the system's mean-square approximation error es* is a good criterion for selecting Q-func- tions as well as features. This means that the performance of features and the performance of Q-discriminant functions can be evaluated under a single performance criterion. The notion of assessing both the feature extractor and the categorizer under a single performance criterion was suggested and implemented in a different mathematical structure by Wee [w-4]. * The notation es stands for the mean-square approximation error and 38(N; t, T) stands for the value it assumes for given N, t and '1'. 49 The definition of a pattern recognition system given in Sec. 2.3 included a class of feature extraction rules T. The adaption of Q-discriminant functions has imposed the additional requirement of an extraction rule for Q-functions. Thus, pattern recognition systems which use Q-discriminant functions, called Q-systems, must include two types of extraction rules. The defini- tion of a general pattern recognition System was given as the 4- tuple (t, b; u, a). Choosing the performance criterion as for both .a and b, the Q-system can be defined as the 4-tuple (N; t, 7, es) where N is the number of training patterns. If the expected cost or the percentage of misrecognition of a Q- system is denoted by YS(N; t, 7, es), the following definition shows how Q-systems can be compared. Definition Lg. Given N training patterns, a Q-system (N: ti, rZ, es) . is better than a Q-system (N; ti, 7:, es) written . P V . q w (N: tL: Tr: es) 3 (N: tk’ Tf’ es) if p s q, V s w, and . V , q w mm :3, Tr. es) < ysm, tk. if, 65) . A lesser degree of a better Q-System occurs when p s q, v>w, or p>q,V$W and V W Yam; t2! Tr: ‘8) < Yam; tit Tf. ‘8) ° 50 3.3 ALGORITHMS FOR IMPROVING THE Q’DISCRIMINANT FUNCTIONS As noted by Nilsson [N-S], a Q-discriminant function is merely a linear discriminant function in which the individual features have been replaced by Q-functions. Consequently, the results Of studies on linear discriminant functions as applied to pattern classification can be extended to Q-discriminant functions. Several authors [H-6] [8-2] [Mk4] [H-7] [G-Z] [P-4] [G-3] [D-l] [Y-l],have dealt with linear discriminant functions in a fundamental problem in mathematical programming, switching theory and pattern classification which can be stated in the following way: Determine vectors 3 and '6 Such that 113-=3 and §>0. Each row of the N X m matrix A is a training pattern from one of two classes and N >'m. Each training pattern is augmented to have an (m+l)-th component of value 1. The augmented training patterns from one class are then multiplied by (-l). A solution to this inequality is a hyperplane which dichotomizes the primitive feature space into regions which are associated with the two pattern classes. A complete separating hyperplane is one which correctly separates all training patterns into pattern classes. If a complete separating hyperplane exists.the training patterns are said to be linearly separable or consistent. If no complete separating hyper- plane exists the training patterns are said to be nonseparable or inconsistent. Ho and Kashyap [H-7], formulated an iterative algorithm which not only gives the complete separating hyperplane in a finite number of steps, if it exists, but also serves as a test for linear separability. Their algorithm consists of the following alternating steps: 51 A1. For a fixed 3, a vector 3 is determined such that it con- stitutes a least square fit in which J = “A3 - EHZ is minimized. A2. For a fixed a, change a in the direction of steepest descent of J, subject to the condition 34> 0. The problem of learning the coefficients ajk(N) of a @- discriminant function for any pair of classes Hj and HR can be stated as follows: Determine vectors 3pk(N) and zgk(N) such that C zjk(N) ' Egkcu) > 0 where C is a given N X M matrix consisting of rows which are the images of the N training patterns in the 9-space and N > M. Stated in this manner it is easily observed that the problem is identical with the fundamental problem discussed above with the obvious need for changing the terms linearly separable and linearly nonseparable to Q-linearly separable and Q-linearly nonseparable respectively. Complete separating hyperplanes exist only when the regions containing all training patterns from each pattern class can be defined which do not intersect and are convex. In most practical problems, however, such regions do intersect and the discriminant functions are chosen to meet some Specific criterion function. Suppose, for any pair of pattern classes Hj and HR, each training pattern in the set {ii}:il from H‘1 and {it}::1 from Hk is assigned the variables 61 and OR, reSpectively, which take the L values 0 and l. The value 1 is given to correctly classified training patterns and the value 0 to incorrectly classified ones. One criterion function that has been applied as a constraint on the discriminant function is F, the maximm nuuber of training patterns correctly classified during the learning process. With the 52 additional constraint, the fundamental problem mentioned earlier is restated as follows: Determine vector 3 and 3 such that A; “ B and ‘3 >~O to maximize F. Stated precisely N N 31 k k Maximize F = 2 5i +- Z 6L i=1 L=l under the constraints 1 if ii 3 - E > 0 61 = 1 . dj-O —o 0 if Xi a - B s 0 1 = l,2,...,Nj 1 if TEE-E1 j,k=I jak “=1,2’...’N. This is an estimate of the upper bound on the sum of the mean- square approximation errors between the Q-discriminant functions and the Bayes discriminant functions based on n training patterns. For a given Q-extraction rule T, and a given number of training patterns, various feature extraction rules might be tried to see which minimize this criterion. If the set of all feature extraction rules leading to feature spaces of dimension r is designated by r _ m Tr 3 {CL}, L ' 1:2:°°°s(r) a O O I r c I O the optimum feature extraction rule of dimension r, t minimizes r Lopt the misrecognition rate ys(N; CL, T, as) of the Q-system (N; CZ: T: 35); 103': r r m t :3 t f = 1,2,... ; # Lopt LLr or L .(r) L Lopt Since misrecognition rates can only be determined from an actual computer simulation we shall associate tr with the feature opt extraction rule which minimizes the system mean-square approxima- r , r); i.e., tion error 38(N, t1. 58 tr :3 tr\ if L T Opt 1' r Opt for all 4, - 1,2,...,(1‘). To obtain a truly Optimum feature extraction rule, it is necessary to consider all possible subsets of the set of primitive m m features. This means that a total number of 2 (m) = 8 -——EL——— r=l r r=1(m-r)lrl subsets of features must be appraised to determine the subset with the best performance measure. It is clear that the total number of feature subsets, even for a relatively small value of m, can be prohibitively high. For example, if m = 20 20 20 20 20! 2 = 2 —-—- = 1,048,575 . r=l (r ) r=l (20-r)lr! Consequently, any.realistic search procedure for the extraction of features will have to be sub-Optimum. Several authors [F-B] [F-4] [L-6] [M-B], have considered two sequential search procedures as alternatives to an exhaustive search. The first is the sequential forward procedure, which first extracts the best single primitive feature, then extracts the best pair, in which is included the best Single feature, and so on, as discussed in Sec. 2.2. At each step, the search procedure selects from the remaining set of features that feature which leads to the largest decrease in the performance criterion. The procedure is stOpped either when the performance measure has reached a prescribed value or when all the primitive features have been exhausted. The other sequential search procedure is the sequential backward search S9 procedure. At each step it discards that feature which increases the performance measure the most. The procedure stOps at a given value Of the performance measure. The sequential procedures de- crease the number Of feature subsets to be appraised considerably. For example, the total number of feature sets that must be con- sidered in the sequential forward search, is given by Ejgill . For m = 20, this number is 210 which is four orders of magnitude smaller than that required in the exhaustive search. The main drawback to the sequential search procedures is the fact that, once a feature has been selected as a good feature, it is retained in the subset which eventually get admitted as the best feature subset. However, the best (p-l)-tuple feature subset may not be contained in the best p-tuple feature subset. This fact is illustrated forcefully in an example in Prabhu [P-8]. In a two- class pattern recognition system in which each pattern is char- acterized by the three features (X1,X2,X3), Prabhu showed that the best single feature for the purposes of discrimination is X3. However, the best two features were found to be the pair (X1,X2), which clearly show that a sequential search procedure in which X3 is retained can never find the pair (x1,x2). Thus, any sequential search procedure in which selected features can be re-assessed will be a highly desirable search procedure. Of course, the novelty of re-assessing selected features will not in itself procure a truly optimum subset of features. The reappraisal of selected features permits discarding feature sets which, in combination with recently acquired features, become less effective in discrimination. 60 Pattern recognition problems generally assume that a set of N training patterns is given. This thesis further assumes that the origins of the primitive training patterns are known, and the patterns are compiled in an N X m matrix, where each row is a primitive pattern. In matrix notation, the training set takes the form: f' r- '1 * Alfl if” (1) A = AZT , ALT = ELT(2) , NL(j) 6 class Ht ST T 5A J LYL (Natl where NL is the number of training patterns from class Ht, 5 2 NL =‘N, and s is the number of pattern classes. Each training (31 _, pattern X&(j), j 8 1’2"..’NL; L 3 l,2,...,s, is represented as a point in the m-dimensional primitive feature space 1?. Thus ELT(j) = (X31,X§2,...,X§m). The feature extraction problem deals with the attempt to rearrange the columns of the data matrix A into two matrices BE and BE so that only features with a high measure of discriminative effectiveness are retained in BE. The primitive features in the data matrix B: will have little or no effectiveness in discrimination. Ideally, a rearrangement of this nature will supply the categorizer with a subset of primitive features which will be almost as effective in classifying unknown patterns as a scheme which uses all of the m primitive features. After the rearrangement of the columns of the data matrix A, the set of N training patterns can be represented as the partitioned matrix ,. 1T 11‘ .1 .21 T . T 21' 21: B=[B1.BZ]= 131 132 ST ST PI Bze where Bi is an N X r matrix and BE with r +-q = m. Only the first r features are used for dis- is an N X q matrix, crimination. The rearrangement of the data matrix A into Bi and B3, can be considered as a dichotomization of the m primitive features. 4.2 PROPOSED SEQUENTIAL FEATURE EXTRACTION RULE The basic prOperty of a pattern class is that its members are more similar to each other than they are to the members of another class in some of the coordinates Of the primitive feature Space 1?. In order for recognition to occur between any pair of classes it is only necessary that the objects of one class be dis- similar in one feature from the objects of the other class. This means that the lower bound on the number of features required for recognition between any pair of classes is one, and that the feature extractor can afford to destroy the information carried in all but one of the primitive features. In an s-class pattern recognition system the lower bound on the number of features required for recognition is between one and .Eiglll . The lower bound of one is reached when the 8 class regions are located in a chain on a single coordinate. The lower bound 21%:11 arises when a dif- ferent coordinate is required for the recognition of all pairs of 62 pattern classes. A sequential feature extraction rule is now pro- posed which seeks to extract only one feature between any pair Of classes, but which allows for the extraction of more than one feature when necessary. To begin the sequential operation, any two primitive features are measured and three feature subsets are constructed; the first subset consists of one primitive feature, the second of the other primitive feature, and the third of the combination of the two primitive features. The subset which performs the best is said to contain effective features and is retained. The remaining subsets are discarded. At the second Stage of the sequential Operation, a third primitive feature is measured and two feature subsets are constructed; the first consists of the third primitive feature alone and the second consists of the third primitive feature in combination with the effective features found in stage one. The current effective features are determined and compared to the effective features of the previous stage. The better of the two is retained and the Operation proceeds to the next stage. At any stage r, two feature subsets are constructed; the first consists of the (r+l)-th primitive feature alone and the other consists of the (r+l)-th primitive feature in combination with the previous effective features. A comparison is made between the effective features at the rth Stage and at the (r-l)-th stage and the better of the two is retained. A subset of p effective features, p s r +'l, is achieved at the end of the r-th stage of the sequential operation. The feature space Spanned by the p effective features is denoted by IP. This space can be considered to have been L 63 P obtained through the application of feature extraction CL, i.e., m m t: : I 4 T2 for some L; L = l,2,...,(p) At each stage of the Operation, the effective features must perform no worse than the effective features at the previous stage. This means that at any stage r, for a given Q-extraction rule T, q h‘T p s r+l, q s r for some L and all h. The sequential Operation and a given set of N training patterns, tEID t with is continued until either all the m primitive features have been processed, or the feature performance criterion has reached a pre- assigned threshold. The total number of feature subsets that must be considered is 2m-1, where m is the number of primitive features. For m B 20, the total number of feature subsets is 39. This number is considerably smaller than 1,048,575 Obtained with an exhaustive search, as well as the number 210 obtained for the sequential for- ward procedure. Furthermore, under certain conditions it can per- form better than the sequential forward procedure reviewed in Sec. 4.2. For example, in Prabhu's example cited in Sec. 4.1, the best single feature was X3, but the pair (X1,X2) gave the Optimum performance. The sequential forward procedure would not attain the Optimum features if X3 were retained. The sequential procedure proposed here would Obtain the optimum features if the primitive features are measured in the order x1,x2,x3. At the first stage of the Operation, the three feature subsets (X1), (X2) and (X1,X2) would be constructed and in this case (X1,X2) would be retained. However if the order of measurement is altered, one 64 cannot guarantee that the prOposed sequential procedure will always perform better than the sequential forward procedure. The only way to determine the relative performance Of these procedures is to actually carry them out. However, if a cost is associated with each feature subset processed, it is clear that the proposed sequential procedure will be the least costly to compute. 4.3 SIMULTANEOUS FEATURE EXTRACTION AND Q-EXTRACTION The application of Q-functions in pattern recognition gen- erates several difficulties. Besides the problems of selecting "good" Q-functions, and the extraction of effective discriminating features, one must determine the number, M, of Q-functions required and of the number, N, Of training patterns to use. For a suffi- ciently large number of training patterns, the @-discriminant functions should be "close" to the mean-square approximations to the Bayes discriminant functions, as shown by the convergence rate of the system [P-7]. Even if the optimum mean-square sets of coefficients 3%: are used, the system performance might be much worse than that obtained when the Bayes discriminant functions are employed. The set {Ejk(N)}:,k'1 depends on both the Q-functions and the true probability densities of the pattern classes. Thus the selection of the number and form of é-functions are both critical. The questions then arise:' For a given set of N training patterns, how does one select a set of "good" Q-functions? How many of them does one select? Which feature subset of the primitive features does one extract and employ in generating the 9-functions? How does one go about extracting the features? The answers to these 65 questions require the consideration of the relationship between the number of training patterns, the dimensionality of a primitive feature space, and the number of 6-functions selected, as implied in Fig. 3.2.1. Kanal and Chandrasekaran [K-S] have concluded that the probability structure assumed for the pattern classes and its correspondence to the real problem at hand determine the relation- ship between the number of the training patterns, the number of features and the probability of error. In the application of @- functions to pattern recognition, this conclusion means that the Q-functions selected and their correspondence to the real problem at hand will determine the relationship between the number of the training patterns, the number of features and the loss incurred in misclassifying patterns. The Q-extraction problem is to determine which subset of the Q-functions are significant for the purposes of discrimination. From this point of view, the Q-extraction problem becomes identical to the feature extraction problem. This means that, in addition to the problem of extracting an optimum feature subset from a given feature set, the pattern recognition system must attempt to extract an optimum Q-subset from a given set of Q-functions. The Optimum Q-system (N; t, T, as) can thus be associated opt . with the pair (t*,'r*)opt which optimizes r V 68m, CL, Th) over all pairs (4,,h), 4, = 1,2,...,(3); h = 1,2,...,(3); and all pairs (r,V). 66 V The optimum 9-system at level (r,V) is the system (t:, T*) for which r V , r V 68(N’ t*! T*) S 68(N, tL’ Th) for all (L.h); and the optimum Q-system (t*, T*)0pt satisfies r (t*: 7*) 3 (t*, TX.) for 811 (r,V). opt This thesis limits the Q-discriminant functions to gen- eralized polynomials and examines the interplay between the order of the polynomial r, and the extraction rules t and T which are applied as indicated in Fig. 3.2.1. For a given r, the number of Q-functions that can be generated after the application of a .3 feature extraction rule is given [N-S] by M(p>=(P:T>-1=§FE}L-1 where p S m, q = l,2,...,(2) and each of the Q-functions is of the form r d r1 2 r ¢L(x) = (x1) (x2) ... (xp) p with L = l,2,...,M(p) and r + r +...+ r s r. Since the number 1 2 m of all possible feature subsets is given by E (m), there are a m M(p) p=1 M(p) total number of Z 2 ( V p=l V=l systems to consider. It is obvious that such a number is orders )(2) possible pattern recognition of magnitude larger than that encountered when extracting subsets of m primitive features alone. In spite of this computational difficulty, ways have to be sought to simultaneously extract features and @-functions which are effective in discrimination. 67 The 3-tuple (r,p,q) denotes not only the q-th feature subset of size p, but also represents the set of M(p) Q-functions that can be generated from the feature subset for a given r, where r is the highest-order polynomial allowed. Thus the problem of extracting optimum features and Q-functions is reduced to finding the Optimum 3-tuple, (r’p’q)opt.’ over all r, p and q. The following approaches are proposed. Strategies for feature extraction alone 8.1. For a given r and p s m, generate all (z) p- dimensional feature subsets and evaluate their performances. The feature subset with the best performance is then selected as the optimum subset. This approach does not attempt to select optimum Qefunctions. The performance of each p-dimensional feature sub- set is evaluated with all the 6-functions that can be generated from that subset, M(p) in number. 8.2. For a given L s r and p s m, generate the (PZL) - 1 possible Q-functions and evaluate the performance of all pairs (L,p). The pair with the best performance is then selected as the optimum pair. As in 8.1, no attempt is made to extract optimum Q-functions since all the Q-functions that the pair (L,p) can generate are used. Strategies for simultaneous feature and Q-function extraction 8.3. For a given r and p s m, generate all the (2) possible feature subsets. Then, for each feature subset, generate all the M(p) possible Q-functions and apply the sequential opera- tion proposed in Sec. 4.2 to select a subset of effective Q-functions. 68 This approach attempts to solve the problems of feature and 9- function extraction simultaneously. The Optimum features and @- functions are contained in that feature subset and its correspond- ing subset Of effective Q-functions which give the best performance. 8.4. Select r and examine the M(m) = (m:r) - l 6- functions in a sequence such that the L-th Q-function in the sequence differs from the (L+l)th in one feature and/or in the order of one feature. For example, for r = 2 and m = 3 order the M(3) Q-functions as follows: $166) = x1 62d) = Xi e366 = x2 ‘94 (if) B x1x2 (95(2) = x; $668) = x3 q’76‘.) = x1x3 ‘Psm g x2x3 q”966) = x§ With this ordering of the Q-functions, the sequential operation prOposed in Sec. 4.2 can be applied to extract effective Q-functions. It is easily seen that if the final set of Q-functions do not contain a particular feature, that feature would be eliminated and simultaneous feature and Q-extraction would be achieved. On the other hand, if the final Q-functions contain all the primitive features, only 69 Q-extraction would be achieved. A simulation of the third and fourth approaches will be carried out in Sec. 4.7. 4.4 CHARACTERISTICS OF THE PROPOSED SEQUENTIAL ALGORITHM The sequential algorithm proposed for feature extraction and Q-function selection in Sec. 4.2 possesses the following char- acteristics. These characteristics apply, as well, to Q-extraction. 1. At each step, the algorithm exhibits the best features obtained so far in theoperation. 2. Feature sets which are effective in discrimination may be dis- carded if found to be less effective in combination with a new feature. 3. In case of a small number of training patterns, the algorithm allows a trade-Off between the number of features selected and the number of training patterns available. 4. No probability density functions for the pattern classes are required and high orders of features can be taken into con- sideration. 5. The number of subsets of features that must be examined is smaller than in sequential extraction rules suggested elsewhere as explained in Sec. 4.2. 6. Feature extraction and 6 selection can be handled simultaneously. Since the algorithm is not exhaustive, it cannot be expected to produce an absolutely Optimum system. The extraction of absolutely Optimum sets of Q-functions and features is not guaranteed and, in general, will not be achieved since relatively few subsets are appraised. The algorithm also fails to exhibit an explicit 70 relationship between the system percentage of misrecognition and the performance criterion. This problem, however, is common to all procedures in which no prior knowledge is available concerning the pattern class probability densities. Furthermore, the algorithm does not give an explicit relationship between the number of Q- functions and features selected, the number of training patterns required, and the performance criterion as, even though the algorithm does provide for a trade-off among these quantities. In addition, the only Q-functions and features that can be selected are supplied to the pattern recognition system by its designer. All computationally realistic algorithms suffer these drawbacks. Along these lines, Highleyman [H-3] has noted that it is not simple to predict in advance whether a linear discriminant function has a chance of working. The same can be said about Q-discriminant functions. The experimenter has no recourse but to formulate the discriminant function and estimate its performance. If the per- formance is found to be unacceptable the experimenter can only search for a better set of Q-functions (features) or seek other types of discriminant functions. 4.5 CHARACTERISTICS OF THE PROPOSED PERFORMANCE CRITERION The prOposed performance criterion is a measure of how "close" in the mean-square sense, the Q-discriminant functions are to the best mean-square approximation to the Bayes discriminant functions. In this section, it is shown that this criterion, when used for 9-function extraction, meets some of the requirements listed in Sec. 2.1 for feature extraction. 7l R.l The performance criterion should show that the per- formance of a Q-function set which carries more discriminative information is better than that Of a Q-function set which carries less. In a W-dimensional Q-function space, Appendix A shows that the system's performance criterion function after N training patterns have been processed is given by: Z ejkO‘H t9 Tw) J'ak ll 88m; t: Tw) (4.5.1) .1 z [Rum - fiEJkTmnnww'lmntnh'é kmnnwn. j.k With the selection of the (W1) 'thé-function, the system's per- formance criterion becomes: W+l W+l .301: c. 'r >= 2 ejkm; t. T ) 3,1. .. kT w+1 .. w+ ej )ejk 1 = ammo-1% (N;t,-r"’”>r'1 0 . The expected value of the discriminant function when class H1 is active is: E,{3T¢w 0] where q1 and q2 are the prior probabilities of the two classes. Because of the Gaussian assumption the probability of error can be expressed as m -m P. = qlu - 9(ng + qzu - p(—02-)] l 2 where p is the cumulative distribution function of the standard normal distribution. It is not easy to obtain the coefficient 30:) which minimizes Pe in a simple form. However Peterson and Mattson [P-6] have given a search technique for finding the linear coefficient 2 which minimizes the probability of error, and concluded that the Opt coefficient 30 pt of an optimum linear discriminant function .663) . 2%: 1... .162) =xi 77 satisfies an equation of the form + -o = .4 -—o [“121 “222]‘opt (”1 “2)“3 where a , a , and a are scalars. The coefficient 3 can 1 2 3 Opt be expressed as a =[a2 +a21'1(” -">a (4.6.1) Opt 1 1 2 2 ”1 “2 3 which is seen to have the same form as the coefficient KIN) in (3.1.2) which is rewritten here for convenience. «12 3120:) = P‘lmm (N) = [P100 + P2m>]’1<'é’1m> - 6200) (4.6.2) for a 0-1 loss function; P1(N) and PZGN) are defined in Appendix B. This suggests that there is a possible link between the iterative procedure for determining 3(N) discussed in Sec. 3.1, and the search technique. Peterson and Mattson further stated that their approach is applicable to many different problems and many different performance criteria which do not necessarily involve the assumption of normal distributions. This tends to make the possibility of a link even greater since EKN) was determined with no assumption on the distributions of the pattern ~12 classes. From (4.6.1) and (4.6.2), a ON) and 3 are equal if Opt ~i and the determination of Pi(Ni) and 31(N1) is seen to be equi- valent to determining quantities which are prOportional to co- variance matrices and means. This suggests that the parametric 78 and deterministic approach to pattern recognition are essentially the same. In the former, assumptions are made concerning the form of the distributions of classes and unknown parameters, such as covariance matrices and mean vectors, are learned. The latter also learns its unknown parameters which are generally in the form of coefficients of a discriminant function assumed for each pattern class. Anderson and Bahadur [A-Z] have also dealt with the problem of determining the Optimum coefficient Zopt in a linear classifica- tion scheme involving two multivariate normal distributions with different covariance matrices. They came out with a vector of the same form as the ones in the two approaches above, specifically —o = -1 q - —o aopt “131 + t:2’32] (“‘1 ”2) where t1 and t2 are scalars such that tlzl + tzz is positive 2 definite. One case in their treatment involved the determination —. Of the coefficient am max in a procedure that minimizes the maximum probability Of error. A scalar c was defined as 3T ( -"> m max ul p'2 c g am a % am a k (a 2 a ) + (a 2 a ) m max 2 m max where a _ -1—9 —-0 8m max - [tzl + (1 - t)£2] (”l ' ”2) with 0 < t < 1. This was the scalar quantity that Min [M-B] used as a separability measure and as a performance criterion for feature selection and which he showed is related to the divergence. 79 This suggests that another scalar, C(N), can be defined as 3%)(31 - II) C(N) ' 'r _. J} ..1' .. 55 (3 (N)Zla(N)) + (a 0022801)) for use as a performance criterion for feature selection in the case where the Q-functions have normal distributions. Similar expressions were also arrived at by HO and Kashyap [H-7], who expressed the Optimum linear coefficient in the form 3 = k(21 + 22)'1 Opt (“1 - ”2)’ k > 0 ° This coefficient also maximizes the function T —o -. —o 2 {aODt (”I - #2)} J 3 am a + aOpt(zl z:2)aopt: which is interpreted as the "distance" between the clusters of the two classes over the "spread" of the cluster of each of the two classes. Obviously the function 4T —+ —0 2 {a (N) ("-1 ' 92)} 3%le + 293m JON) ‘ can be adopted as a feature performance criterion, where the vector EON) is the coefficient derived in Sec. 3.1. 4.7 A PARTICULAR Q-SYSTEM The 6-system described in Fig. 4.7.1 is now prOposed for implementation. The estimate ‘3 of an upper bound on the mean- square error between a set Of Q-discriminant functions and the Bayes discriminant functions is adOpted as a performance criterion for 80 Select subset of primitive features 6 : minimum a ’:*- I encountered at any Select initial subset stage With a Part1°U1ar feature subset of Q-functions [ 'Y efiaés Compute‘ as using | -:tterns Select new 3 Increment — feature subset é-functions 7 \ / Form discriminant functions for c smin * Classify all I trainin Apatterns 4*— Adjust discriminant functions 'Rate of misclassification satisfactory? Figure 4.7.1 Implementation Of Strategy 3.3 and 8.4 81 selecting both optimum features and 6-functions. The discriminant functions are restricted to generalized polynomials and, to avoid excessive computation, only the strategies 8.3 and 8.4 in Sec. 4.3 will be simulated. Strategy 8.3 first selects a subset of features and the highest power, r, allowed for any feature. Strategy 8.4 is a special casecfl Strategy 8.3 when the feature subset selected is the entire set of primitive features. The Q-extractor selects the best 6- function subset at each stage of the prOposed sequential extraction rule in Sec. 4.2, until all the Q-functions have been exhausted. At this juncture, the Q-extractor has the effective Q-functions and the derived coefficients of the discriminant functions stored in the Q-memory and the coefficient memory, respectively, for the particular subset of features chosen earlier. Next, the N training patterns are classified using these Q-functions and dis- criminant functions. If an acceptable rate of misclassification is achieved, the selector switch is turned to the "recognize" position so patterns of unknown classification can be processed. If the misclassification rate achieved is not acceptable, the coefficients are adjusted in an attempt to improve the error rates. If the adjustment leads to an acceptable improvement the selection switch is turned on to the "recognize" position. If no acceptable improvement is achieved, one must either increase r and gen- erate a new set of Q-functions, obtain more training patterns, if they are available, or select a new subset of primitive features and repeat the entire learning procedure. 82 To adjust the coefficients, a scalar djk = dj - dk is chosen and the following decision rule is employed. .‘ - j «jk Decide x EH if ujk(3i,a cm) - djkz o, where D- II I ...a U N J z j min \E'jkTCN) «REM. L - L dk ll ...: U N V O O O U 2 min ‘zjkT(N) 6(it)‘, i i and N1 is the number of training patterns from class Hi. Fisher's data [F-7] was used in implementing the Q-system. The data consisted of fifty four-dimensional primitive patterns from each of three pattern classes representing varieties Of irises, namely, Sesotas (Se), Versicolors (Ve) and Virginicas (Vi). Each Of the 150 irises is represented by four measurements of petal width and length and sepal width and length. The flow chart for the simulation is given in Fig. 4.7.2. This differs somewhat from Fig. 4.7.1 in that the misclassification rate was determined for each best subset of Q-functions encountered to ensure that the sequential Operation was working prOperly and also to observe whatever functional relationship may arise between the criterion function es and the percentage of misrecognition. In all the isimulations B was taken to be 10-10. A summary of the results of approach 8.3 is shown in Table 4-1 and 4-2 for the feature subset X2,X3,X4. Only 25 patterns from each class were used for training but all 50 patterns were used for testing to Obtain the 83 ® [ Read in Data 1 I Initialize Phi1 Vectors“ J Compute ¢Jk(N) for the Phi vectors from traininggpatterns ~--—--.-—.. Choose vector which produced 3 jk minimum e = 2 e (N) i=1 ji‘k Test training patterns with chosen Yes Aggctor esired performanc — ‘ —~ “if L L—r Adjust discriminant function i ] LgTest Testin patterns Yes — Plot Graphs of vs Z mis- Figure 4.7.2 Flow Chart of Proposed Q-System NO Phi Functions xhausted ? y Revise Vectors i . -..—..-—-—.. o-“ 34 results in Table 4-1. The results in Table 4-2 were obtained by using all 50 patterns from each class for both training and testing. Similar results were Obtained for the remaining three possible feature subsets of size three as well as for the case in which all four primitive features were considered. The 6- functions generated by each feature subset were ordered in the manner prOposed in approach 8.4. Tables 4-3 through 4-6 show a comparison of the performance Of the two approaches employed in the simulation. 4.8 SUMMARY The discriminant function adopted in Chap. 2 was used in combination with a proposed sequential extraction rule to con- struct a pattern recognition system in which Optimum features and Q-functions are sought under a single criterion function. An estimate of the upper bound on the mean-square error between the é-discriminant functions and the mean-square approxima- tion to the Bayes discriminant functions was proposed as a criterion for feature extraction in Sec. 4.1. In the same section, two sequential feature extraction rules were reviewed. It was shown that neither of the rules can be guaranteed to be truly optimum. In Sec. 4.2, a third sequential extraction rule which considers fewer feature subsets than the two sequential extrac- tion rules was develOped. An added feature of the prOposed sequential extraction rule is that a feature found to be effective need not be retained if found to be ineffective in combination with other features. The possibility of simultaneous feature and 85 a .qu Nx x .sxmx .ex .mxmx .mx .Mx .Nx moamm.~ w exmx .exmx .ex .mxmx .mx .Mx .Nx Hmwmm.~ a exmx .ex .mxmx .mx .Mx .Nx oemmm.~ 0 sx .mex .mx .Mx .Nx mammm.~ m mxmx .nx .Mx .Nx o¢¢ma.~ q mex .mx .Mx .Nx oemmm.~ n ma .Mx .Nx meamm.~ N we .Nx ooooo.n a mo Enswcwe m coHuouoao couscouo cages cowuocsuuo mo oomanm u + uz+ o u w answcwz Hmwucmsoom venomouo mo meum mmm> mmw> o> a> Tan.» ”#29 ex .mx .Nx uomnsm ousumom Sues mmmHO zoom Eouw mcuouumm wcfiumoa on was mahogany mm "m.m somouao< mommm.o o.¢~ nmmm¢.o o.w~ Hmmma.o o.- mo~mm.~ n.0a w ensmm.o o.m Nmmom.o o.oH ammam.o o.~H Hmmmm.~ m.m a os¢a¢.o o.~ ~mmmm.o 0.5 ~mmam.o o.w oemas.~ a.m o masam.o o.m omaaa.o o.m Hmsmo.o o.s~ wmmmm.~ n.8H m 6 .s Hamma.o o.m ssmmm.o o.a ssaaa.o o.aq oemmm.~ o.ma e Hmama.o o.m asmmm.o o.H emmaa.c o.He s¢msm.~ o.mH m ammam.o o.e meamm.o 0.8 maams.o o.me semam.~ o.mH N aaaaa.o 0.0m ommsm.o o.om msoaa.o o.om ooooo.m o.om H Odom m coduomuuxu Ocauficmo coauwswo cofluficwo u :Ofiuaowo Haausoawom o noose“: N u nooumflz N w nooumwx N locum“: N vomoaouo mmo> om\o> mma> am\a> a>o> a>\m>. saaacaa om\a>\o> so «scum mu .7 d mama 87 N .d& .mx .mNNx .mx .NN .Nx N N qux adxmx .«N .mx .MKNN .MN .NN .Nx N N «xmx .ON .MN .mex .MN .m% .Nx ex .mx .mex .mx .Nx .Nx N N N N mex .mx .m% .N% fix .NN .NN N N .N NN X m u 8:5«afi8 couscoun scans m:Owuoc=uuo a m N x x x banana ouauomm sows mango some Bonn mauouumm wawumoa on was mcwafimuy on o omo> _+ mmmmm.~ mamam.~ onmmm.~ mmmom.w owamm.~ ooooo.m ooooo.m ooooo.m u + u mma> m>a> 0.N 0.H 0.0 0.0 0.0 0.0a 0.0a 0.00 cowuwcwo nouns“: N Om\o> N0000.0 05000.0 H0000.0 H0000.0 50000.0 00000.0 00000.H 00000.H u mmfi> 0.0 0.~ 0.0 0.H 0.0 0.00 0.H0 0.Hm cowuwcwo nomumazx om\w> 00000.0 00000.0 00000.0 00000.0 00000.0 00000.0 00000.H 00000.H a H>o> 0.0 0.0 0.0 0.0a 0.0a 0.0a 0.0a 0.0q cONOacwo nooumfiz no N w>\o> m: N...» mag 50000.0 00000.0 om000.~ 00000.N 00000.N 00000.0 00000.0 00000.0 Basaaae 0.0 0.8 0.N 0.0a n.0a 0.5a n.0a 0.00 cowuwawo Iowans: no N Om\«>\o> mass co«uumuuxo amwucosvom venomouo so magnum 89 0 vo>ownuo umuwm mos :ONONOMOOOMmHz N om \H>\O>.esewswe gowns us mans anuomuuxm Hmwucosvom vomooouo Osu uo woman see c 0 x a .e H x. x Nx.wx.Hx qN.Mx.NxHx.Nx.Wx.HN d .Mx.0xNx.0x.Mx.Nx Mx.mxHx.mx.wx.Hx N AH “H x Nx x Nxax.Nx.WN.Nx x x x.¢x.~x.~x~x.~x.~x.fix N N N ¢N0N.qx.0xNx.0x.Mx.Nx NxHNQ N“. W“. H“ OOHuficwoomu -maz N smm\a>\m> anefiaae 00020 Iowa sufisa you upon cofluocnmuo 0.0a 0.NN 0.N 0.0 0.0a N.qH N.w 5.0 0.NH N.¢H .oEmnom cofluflcwoomu bflomnomucu m mmuocmb mm\«>\o> .1 bo>ounom coHucho nooumwz om\«>\m> N 00 00 00 00 00 00 00 00 on 00 use mauouuwa snags“: wcuumou mo .02 00 00 00 00 on 0N 0N 0N 0N 0N moans you mauouuma wcaawmuu mo .02 ex.mx.~x.ax ex.~x.ax ex.mx.~x ex.mx.ax mx.~x.ax ex.mx.ux.ax sx.~x.ax ex.mx.~x «x.0x.~x MN. NNQHx mango Oomasm cONOHOwooom Om\«>\o> you noumanawm monomouoa< oau mo somwuuasoo < 0: a 5049 0.0 0.0 0.0 0.0 0.0 «.0 0.0 0.0 0.0 ousuuom somouna< 90 N 0x .wx.0x $0.0 00 00 0x.mx.~x.ax 0.0 0 Mx.mx0x.mx.wx.wx $0.0 00 00 «$.0x.~x 0.0 N ~x.wx.fix $0.0 00 00 0x.~x.~x 0.0 N 0x.wx.ax $0.0 00 00 0x.mx.~x 0.0 N 0x.wx.Hx $0.0 00 00 0x.0x.ax 0.0 a wx.0x $0.$ 0m 00 0x.mx.~x.ax 0.0 N 0x.wx.~x $0.H 00 mm 0x.mx.mx 0.0 m ex.wx.0xax.$x.wx.ax $0.0 00 mm 0x.~x.ax 0.0 N 0x.wx.ax $0.0 0m 00 0x.mx.ax 0.0 H wx.ax $0.$ 00 00 0x.Nx.Hx 0.0 vm>oucum umuuu ems coHuHcmoooumfiz N cOHuwcwoomu om\a>.a=ea=as noses -maz_$ amxa> capsicum cans um wash scauouuuxo Essacaa vocab uflcwoomumfiz mmmHO mmmao aneucmswom venomoua noun Loans umm N om\w> use msuOuuma you mauouuma uomaam onu mo ammum any spam coHuOc=muo finesse: wcwumou «0 .oz. wcacamuu 00 .oz OHOOOOM somou00< ceauacwooom om\g> you woumasaam monomouaa< Ono wo acmflucasou < anc mnm¢H m NxHx.~x.Hx.Hx $0.H 00 00 «$.0x.~x.Hx 0.0 N 0 same. x.mx.mx~x.mx.mx.~x $0.0 00 00 0x.mx.~x 0.0 m ~x~x.~x.wx.ax $0.H 00 00 0x.~x.ax 0.0 m mxHx.0x.wx.Hx $0.0 00 00 0x.0x.~x 0.0 m Nxax.mx.wx.fix $0.0 00 00 0x.~x.ax 0.0 0 Nxax.~x.wx.ax $0.0 0m 00 0x.mx.~x.ax 0.0 0 0x0x.ex.mx~x.mx.wx.~x $0.0 00 mm ex.0x.~x 0.0 m NxHx.~x.wx.0x $0.0 00 00 0x.~x.fix 0.0 1 9 m mxax.mx.wx.sx $0.00 on 00 0x.mx.ax 0.0 0 NxHx.~x.wx.Hx $0.0 00 mm 0x.~x.ax 0.0 oo>oanom umuwu mos co« uwcwoo oumaz N o>xom asawcws scans newuflcwoooumfiz nm>ownoo coda us was“ cofiuosuuxo N O>xom asewcfle nacwoomumwz N mamas mmmao Hmaucusvom vmmoaoua nauseouo sagas use o>\om pom announce you mcumuuma Oomnsm may mo swoon may ness coauocswuo EOENOAZ wcgumou mo .02 wawcfimuu mo .02 muouoom nomou00¢ cowuflcwooom m>xom you vmumH:E«m monomouaa< Ono mo :OmwumOEoo < 0.1» 5049 92 m 0x.wx.$xfix.wx.wx.ax $0.$~ 00 00 0x.mx.ux.ax 0.0 0 0xmx.0x.mx.mx~x.mx.wx.~x $0.0 00 00 ex.mx.mx 0.0 0 “a.exax.0x.mx.$xfix.~x.wx.0x $0.08 00 00 0x.~x.~x 0.0 0 Mx.mxax.mx.wx.ax $0.00 00 00 0x.0x.ax 0.0 m 0x.wx.~xax.~x.wx.ax $0.$~ 00 00 0x.~x.ax 0.0 $ mxax.mx~x.mx.mx.~xax.0x.wx.fix $0.00 on 00 ex.nx.~x.0x 0.0 0 sxmx.0x.0x~x.nx.wx.~x $0.0 00 mm 0%.0x.~x 0.0 $ sxax.sx.mx.~xax.0x.wx.ax $0.08 00 00 ex.Nx.Hx 0.0 0 «max. x.Mx.mxHx.mx.wx.Hx $0.HH 00 mm 0x.mx.ax 0.0 $ mxax.mx~x.0x.wx.~xfix.~x.wx.ax $0.00 0m 00 0x.Nx.~x 0.0 vo>oasom umuwu was aOauacwoooumuz N m>\«>.asawcaa nouns us many cowuomuuxm Hugucosuom venomoua on» no owmum one :oHufiswoomumHz.N vm>oanoa cowu o>\w>flasaacaa nacwoomumfiz N mango cacao vmoovoun £0053 o>\«> non acumuuma non manouuma uomnam Oomnsm coauoasmue anawafiz wcflumou mo .02. wcwaamuu mo .02 musumom aomou00< cowuwcwooom o>\«> you woumaaawm monomou00< Onu mo aooauogaou 4 on.» 5048 93 Q-extraction was discussed in Sec. 4.3. Characteristics of the prOposed feature extraction rule and the mean-square error criterion were given in Sec. 4.4 and Sec. 4.5, respectively. It was shown that the criterion has desirable prOperties. In Sec. 4.6, it was assumed that the Q-functions selected have normal dis- tributions with reSpect to two pattern classes, and the Q-dis- criminant function was studied in the case when the two distribu- tions differ in mean vectors as well as in covariance matrices. It is shown that the mean-square Optimum coefficients Of the dis- criminant function has the same form as those derived by others [P-6], [A-Z], [H-7]. Finally, Sec. 4.7 proposed and implemented a Q-system. The results indicate that in a three-class recognition problem involving Fishers' data [F-7], a minimum percentage of mis- 2 recognition of 2% was achieved with the Q-functions X2, X2, X3, X2X3, Kg, X4, XBXA. Since the feature X1 is not included in the set of optimum Q-functions both feature extraction and Q- function selection were achieved simultaneously. The results for all feature subsets considered do not indicate any simple functional relationship between the criterion as and the percentage of misrecognition. A monotonic relation- ship between as and the percentage of misrecognition was not found. Furthermore, the set of Q-functions which were selected as Optimum, i.e. produced the minimum value of es’ did not pro- duce the minimum percentage of misrecognition. However, judging from the results, it is obvious that as is a workable criterion function. In fact, the @-system prOposed exhibits a performance 94 level comparable to, and in some cases better than, other recogni- tion systems [D-Z] [F-B] [K-IO] which experimented with the same data as shown in Table 4-7. The proposed 6-system also achieved a minimum misrecognition rate Of 0.0% for Se/Vi and Se/Ve recognition and a minimum misrecognition rate of 2.0% for Ve/Vi/Se recognition. 95 TABLE 4-7 Author Algorithm Minimum Ve/Vi % Misrecognition achieved Ishaku Proposed 6.0% Kendall 11.0% [F-S] Freeman 5.5% [Ii-8] Dubes Entropy [D-Z] dispersion 46.0% method Dubes Energy Subspace 44.0% [D-Z] method Dubes Gaussian 14.0% [D-Z] rule (l-dim) Dubes Minimum [D-Z] clustering distance 14.0% method (l-dimo CHAPTER V CONCLUSIONS AND POSSIBLE EXTENSIONS Some general comments and possible extensions in the area of general pattern recognition systems are discussed in Sections 5.1 and 5.2. The conclusions and possible extensions of the particular system investigated in this thesis are summarized in Sections 5.3 and 5.4. 5.1 GENERAL COMMENTS ON PATTERN RECOGNITION SYSTEMS Four prOblems are generally associated with the design of pattern recognition systems; namely, feature measurement, feature extraction, abstraction and generalization. This thesis concentrates on feature extraction. There are two problems associated with feature extraction: What criterion should be used in appraising the relative performance of features? How can Optimum features be established under the chosen criterion? The theoretical solution to the second problem is obvious once a performance criterion has been selected. One simply tries all possible feature subsets to be absolutely certain that the Optimum feature subset is Obtained. Economy forbids the consideration of all possible feature subsets so practical feature extraction rules consider only a fraction of the possible feature sets. It appears, therefore, that future developments in feature extraction can be most useful if efforts are concentrated in the 96 97 search for more useful criterion functions, and in the development of machines that will economically handle problems which require considerable computations. The problem of abstraction also involves selecting a suit- able criterion. There have been basically two approaches to this prOblem: probabilistic and deterministic. The main technique for the first approach is the recursive application of the Bayes rule to "learn" the unknown parameters of probability densities having known form. Then, under a given criterion, the Optimum discriminant function is determined using the likelihood ratio. In the absence of any knowledge about class prObability densities, the main technique is to iteratively "learn" the coefficients, generally of a linear discriminant function, which Optimize a selected criterion function. A In the case of probabilistic discriminant functions, the generalization problem is solved through the determination of various error probabilities. NO such error probabilities can be determined from deterministic discriminant functions. Since the probability of error can very seldom be computed, it may be more useful to assess pattern recognition systems only on how well the systems perform in actual simulations. In that case, the Optimum pattern recognition system can be defined as that system which has the smallest misrecognition rate for a given data base during the testing phase. This thesis suggests and formulates a pattern recognition system as a 3-tuple, (t; u, a), when a single criterion, a, is used in determining optimum features and discriminant functions, and as 98 a 4-tuple, (t, b; u, a) when b is used in determining Optimum features and a in determining Optimum discriminant functions. This marks the first attempt at integrating the problems of feature extraction, abstraction and generalization into a single mathe- matical formulation. The validity of this formulation is borne out in Table 2-1 where it is shown that all the pattern recognition systems that have been developed are arrived at through a choice of particular elements in the formulation. This unified system facilitates theoretical research into the Optimality of total pattern recognition systems. This represents a major contribution of the thesis. A second major contribution is the develOpment of a new sequential extraction rule which considers fewer feature subsets than all previous sequential extraction rules and which has been shown to be superior, under certain conditions, to the sequential forward rule. 5.2 EXTENSIONS AND OPINIONS IN THE STUDY OF PATTERN RECOGNITION SYSTEMS IN GENERAL The author hOpes that research into pattern recognition systems can be approached from a fresh viewpoint which may or may not be related to the traditional manner described in Chapter 2. The following possible extensions concern only pattern recognition systems which can be expressed in the formulation mentioned in Sec. 5.1. 1. For a given data base, it would be interesting to rate all the pattern recognition systems that have been develOped under the performance criterion of misrecognition rate during the testing 99 phase of the systems. 2. It is possible that more effective elements can be included into the classes T, B, U and A defined in Chapter 2. 3. Since the problem of feature measurement appears to be beyond mathematical formulation, the author agrees with Prabhu [P-B] that members of any discipline who can assist the experimenter to intuite as to what features to measure in a Specific recognition problem should be consulted. In fact, educated intuition concerning what features to measure should be regarded as essential in the formulation of recognition systems and certainly as a valid tool in the sOlution to the problem of feature measurement. 5.3 CONCLUSIONS IN REGARD TO Q-SYSTEMS This thesis constructed and implemented a three-class @- system with a particular set of the three elements (t; u, a) which define a generalized pattern recognition system. The feature extraction rule prOposed in Sec. 4.2 was employed for t, generalized second order polynomial discriminant functions were employed for u and the approximation of an upper bound on the mean-square error between a set of polynomial discriminant functions and the Bayes discriminant functions was used for a. The data used in the simulations was Fishers' [F-7] but the Q-system is not restricted to any type of data. This thesis extends the work of Pitt and Womack [P-7] and contains the first investigation into the problems involved in linking the selection of 6-functions to both feature extraction and discriminant functions. Thus, a basic goal of this thesis was 100 accomplished when the simulations indicated that feature extraction and Q-function selection can be done simultaneously. A second basic goal of this thesis which was not realized was the exhibition of a monotonic functional relationship between the criterion function as and the percentage of misrecognition. The thesis, however, showed beyond any doubts that es is an effective criterion in the selection of both Optimum features and Q-functions. 5.4 POSSIBLE EXTENSIONS IN THE STUDY OF Q-SYSTEMS The study of the Q-system constructed and implemented in this thesis, may be extended as follows: 1. Construct discriminant functions which partition the Q-Space by associating more than two disjoint regions with any pair of pattern classes. 2. Approximate the distributions of each pattern class in the Q-Space or the distribution of the Q-discriminant functions and approach the abstraction problem from the probabilistic viewpoint. 3. Use Q-functions other than generalized polynomials to implement the system, such as Legendre and Fourier functions. 4. Investigate the possibility of Obtaining the optimum ordering of the Q-functions prior to the start of the prOposed sequential Operation. 5. Investigate the conditions under which a functional relationship exists between as and the percentage of misrecogni- tion. B IBLIOGRAPHY [A-IJ p-23 [A-3] [13-1] [3-2] 02-11 [c-z] [c-3] [as] [9-1] [p-21 BIBLIOGRAPHY Anderson, T.W. “Ag_lntroduction £2 Multivariate Statistical Anal sis", New York, Wiley, 1958. Anderson, T.W. and Bahadur, R.Ra 'Classification into Two Multivariate Normal Distributions with Different Covariance Matrices", Ann. Math. 8tatist., Vol. 33, pp. 420-431, 1962. Aizerman, MMA., Braverman, E.M. and Rozoner, L.I. "The Probability Problem of Pattern Recognition Learning and the Method of Potential Function", Automation and Remote Control, Vol. 25, pp. 1175-1193, 1964. Block, H.D., Nilsson, N.J. and Duda, R.O. "Determination and Detection of Features in Patterns", Computer and Information Sciences, J.T. Tou and R.H. Wilcox, Eds., washington, D.C., Spartan, pp. 75-110, 1964. Braverman, E.M. and Byatnitskii, E.S. "Estimation of the Rate of Convergence Algorithms based on the Potential Functions Method", Automation and Remote Control, Vol. AC-12, pp. 195-197, 1967. Calvert, T.W. "Nonorthogonal Projections for Feature Extraction in Pattern Recognition", IEEE Trans. Computers, Vol. c-19, pp. 447-452, 1970. Chien, Y.T. and Fu, K.S., "On the Generalized Karhunen- LOEVe Expansion", IEEE Trans. Info. Theory, Vol. IT-13, pp. 518-520, 1967. Chien, Y. “A Sequential Decision Model for Selecting Feature Subsets in Pattern Recognition", IEEE Trans. Computers, Vol. C-20, pp. 282-290, 1971. Chu, J.T. and Chueh, J.C. “Error Probability in Decision Functions for Character Recognition", Journ. of A.C.M., N0. 14, pp. 273-280, 1967. Dantzig, J. "Linear Programming", Progress, 1966. Dubes, R.C. "Information Compression, Structure Analysis, and Decision Making with a Correlation Matrix", Interim Scientific Rept. NO. 11, Division of Engineering Research, Michigan State University, East Lansing, Michigan, 1970. 101 [Ia-11 [F-ll [F-21 [p-31 [F-4] [F-S] [F-6] [3-7] [F-81 16-11 16-21 16-31 [n-1] [n-2] 102 Estes, 8.E. "Measurement Selection for Linear Discriminants used in Pattern Classification", IBM Research Rept. RJ-331, 1965. Frantsuz, A.G. "Influence Of Correlations between Attributes on their Informativeness for Pattern Recognition", Engineering Cybernetics U.S.S.R., NO. 4, pp. 64-67, 1967. Fu, K.S. and Chen, C.H. "Sequential Decisions, Pattern Recognition and Machine Learning", Tech. Rept. TR-EE65-6, School of Elect. Eng., Purdue University, Lafayette, Indiana, 1965. Fu, K.S. "Sequential Methods in Pattern Recognition and Machine Learning", Academic Press, New York and London, 1968. Fu, K.8. and Cardillo, G.P. “An Optimum Finite Sequential Procedure for Feature Selection and Pattern Classification", IEEE Trans. Auto. Control, Vol. 12, pp. 588-591, 1967. Fukunaga, K. and Koontz, L.G. “Application of the Karhunen- Loave Expansion to Feature Selection and Ordering", IEEE Trans. Computers, Vol. C-19, pp. 311-318, 1970. Fukunaga, K. and Olsen, D.R. “An Algorithm for Finding Intrinsic Dimensionality Of Data", IEEE Trans. Computers, Vol. c-zo, pp. 176-183, 1971. Fisher, R.A. "The Use of Multiple Measurements in Taxonomic Problems", Annals of Eugenics 3, Part 2, pp. 179-188, 1936. Freeman, J.J. "Experiments in Discrimination and Classifica- tion", Pattern ReCOgnition 1, pp. 207-218, 1969. Grettenberg, T.L. "Signal Selection in Communication and Radar System", IEEE Trans. Info. Theory, Vol. IT-9, pp. 265- 275, 1963. Golovkin, B.A. “Adaptive Pattern Recognition as a Problem in Partial Integer Linear Programming", Engineering Cyber- netics U.S.S.R., NO. 2, pp. 140-144, 1968. GOl'shteyn, E.G. and Yudin, D.B. "New Directions in Linear Programming", Sovetskye Radio, 1966. Hellman, M.E. and Raviv, J. "Probability of Error, Equi- vocation, and the Chernoff Bound", IEEE Trans. Info. Theory, Vol. IT-l6, pp. 368-372, 1970. Henderson, T.L. and Lainiotis, D.G. "Comments on Linear Feature Extraction", IEEE Trans. Info. Theory, Vol. IT-15, pp. 728-730, 1969. 1H-31 [H-4] [H-S] 1H-61 [n-7] [I-l] [K-3] [K-4] [K-S] [K-6] 1K-71 103 Highleyman, W.H. "Linear Decision Functions, with Applica- tion to Pattern Recognition", IRE Proc., pp. 1501-1514, 1962. Ho, Y. and Agrawala, A.K. "On Pattern Classification Algorithms: Introduction and Survey", IEEE Proc., Vol. 56, HO, Y.C. and Blaydon, C. "On the Abstraction Problem in Pattern Classification", Proc. N.E.C., 1966. Ho, Y.C. and Kashyap, R.L. “A Class of Iterative Procedures for Linear Inequalities", SIAM Journal on Control, Vol. 4, pp. 112‘115 , 19660 HO, Y.C. and Kashyap, R.L. “An Algorithm for Linear Inequalities and its Applications", IEEE Trans. Electronic Computers, Vol. EC-l4, 1965. IEEE Computer Group Midwest Area and Pattern Recognition Committees, and Argonne National Laboratory, Applied Mathe- matics and High Energy Physics Divisions. "IEEE Conference Record of the Symposium on Feature Extraction and Selection in Pattern Recognition", 1970. Kadota, T.T. and Shepp, L.A. "On the Best Finite Set Of Linear Observables for Discriminating Two Gaussian Signals", IEEE Trans. Info. Theory, Vol. 13, pp. 278-284, 1967. Kailath, T. "The Divergence and Bhattacharyya Distance Measures in Signal Selection", IEEE Trans. Comm. Tech., Vol. con-15, pp. 52-60, 1967. Kanal, L. and Chandrasekaran, B. "On Dimensionality and Sample Size in Statistical Pattern Classification", Proc. N.E.C., pp. 2-7, Chicago, 1968. Kharkevich, A.A. "Choice of Features for Recognition Machines", Engineering Cybernetics U.S.S.R., NO. 2, pp. l-7, 1963. KObayashi, H. and Thomas, J.B. "Distance Measures and Related Criterion", Proc. Fifth Annual Allerton Conference on Circuit and System Theory, pp. 491-500, 1967. Koontz, L.G. and Fukunaga, K. "Nonparametric Mapping of Multivariate Measurements for Separability Enhancement", Proc. N.E.C., 1970. Korsunskaya, V.O. "Some Principles in Formation of Attributes in Pattern Recognition", Engineering Cyber- netics U.S.S.R., NO. 6, pp. 109-114, 1966. [K-81 [K-9] [K-IO] [L-l] [Ia-2] [1-3] [L-4] [L-5] [M-1] [M-2] [M-3] [n-41 [PI-5] [N '1] [u-21 104 Kullback, 8. "Information Theory and Statistics", John Wiley and Sons, Inc., New York, New York, 1959. Kullback, 8. ”Certain Inequalities in Information Theory and Cramér-Rao Inequality", Ann. Math. Statist., Vol. 25, pp. 745-751, 1954. Kendall, M.G. "Discrimination and Classification", in Multivariate Analysis (ed. P.R. Krishnaiah, Academic Press, New‘York, 1966). Lainiotis, D.B. "Optimal Feature Extraction in Pattern Recognition", IEEE International Info. Theory Symp. Abstracts, San Remo, Italy, 1967. Lainiotis, D.G. and Henderson, T.L. “Application of State Variable Techniques to Optimal Feature Extraction", Proc. IEEE (Letters), Vol. 5, pp. 2175-2176, 1968. Lainiotis, D.G. “A Class of Upper Bounds on Probability of Error for Multihypotheses Pattern Recognition", IEEE Trans. Info. Theory, Vol. IT-15, pp. 730-731, 1969. Levine, M.D. "Feature Extraction: A Survey", Proc. IEEE, Vol. 57, pp. 1391-1407, 1969. Lewis, P.M. "The Characteristic Selection Problem in Recognition System", IRE Trans. Info. Theory, Vol. IT-8, pp. 171-178, 1962. Mahalanobis, P.C. "On the Generalized Distance in Statistics", Proc. Nat. Inst. Sci., Vol. 122, pp. 49-55, India, 1936. Marill, T. and Green, D.M. "On the Effectiveness of Receptors in Recognition Systems", IEEE Trans. Info. Theory, V01. 9, pp. 11-17, 1963. McDonald, R.P. “A Unified Treatment of the Weighting Problem", Psychometrika, Vol. 33, pp. 351-381, 1968. Mengert, P.H. "Solution Of Linear Inequalities", IEEE Trans. Computers, Vol. C-l9, pp. 124-131, 1970. Min, P.J. "On Feature Selection in Multi-class Pattern Recognition", Ph.D. Thesis, School of Elect. Eng., Purdue University, Lafayette, Indiana, 1969. Nagy, G. "Feature Extraction on Binary Patterns", IEEE Trans. System Sci. and Cybernetics, Vol. SSC-S, 1969. Nelson, G.D. and Levy, D.M. "A Dynamic Programming Approach to the Selection of Pattern Features", IEEE Trans. System Sci. and Cybernetics, Vol. 880-4, 1968. [N4] [N-S] [P-ll [p-23 [p-3] [P41] [P-5] [P-6] [P-71 [P-8] [R-l] [R-21 105 Nelson, G.D. and Levy, D.M. "Selection of Pattern Fvnruros by Mathematical Programming Algorithms", IEEE Trans. System Sci. and Cybernetics, Vol. SSC-6, 1970. Neymark, Y.I., Batalova, 2.8. and Obraztsova, N.D. "Pattern Recognition and Computer Systems: Feature Selection in Pattern Recognition", Engineering Cybernetics U.S.S.R., 1969. Nilsson, N.J. "Learning Machines", McGraw-Hill, 1965. Parzen, E. "On the Estimation of a Probability Density and Mode", Ann. Math. Statist., Vol. 33, pp. 1065-1076, 1962. Patrick, E.A. and Fischer II, F.P. "Nonparametric Feature Selection", IEEE Trans. Info. Theory, Vol. IT-15, pp. 577- 584, 1969. Patterson, J.D., Wagner, T.J. and Womack, B.F. “A Perfor- mance Criterion for Adaptive Pattern Classification Systems", IEEE Trans. Automatic Control, Vol. AC-12, pp. 195-197, 1967. Pervozanskiy, A.A. "The Recognition of Abstract Patterns as a Problem in Linear Programming", Engineering Cybernetics U.S.S.R., NO. 4, pp. 38-41, 1965. ' Peterson, D.W. "Discriminant Functions: Classes and Computational Techniques", Stanford Electronic Lab., Stanford, California, Rept. SEL-65-021 (TR6761-2), 1965. Peterson, D.W. and Mattson, R.L. “A Method of Finding Linear Discriminant Functions for a Class of Performance Criteria", IEEE Trans. Info. Theory, Vol. IT-12, pp. 380- 287, 1966. Pitt, J.M. and Womack, B.F. “Additional Features Of an Adaptive, Multicategory Pattern Classification System", IEEE Trans. System Sci. and Cybernetics, Vol. SSC-S, 1969. Prabhu, K.P.S. "On Feature Reduction with Application to Electroencephalographs", Ph.D. Thesis, Division of Engineering and Applied Physics, Harvard University, Cambridge, Massachusetts, 1970. Renyi, A. "On the Amount of Missing Information and the Neyman-Pearson Lemma", Research Papers in Statistics, F.N. David (Ed.), Wiley, 1966. Renyi, A. "On the Amount of Information Concerning an Unknown Parameter in a Sequence of Observations", Publ. Math. Inst. Hungar. Acad. Sci., 9, pp. 617-624, 1964. [R-3] [8-11 [3-2] [3-31 18-41 [8-51 [8-6] [3-7] [r-1] [w-IJ [w-z] [w—33 [w-4] [W-Sl [Y-l] 106 "Bayes Risk.Consistency of Classification The Indian Ryzin, J.V. _ Procedures Using Density Estimation", Sankhya: Journal of Statistics: Series A, Vol. 28, 1966. Sammon Jr., J.W. "Interactive Pattern Analysis and Classification", IEEE Trans. Computers, Vol. C-l9, pp. 594- 616, 1970. Sammon Jr., J.W. “An Optimal Discriminant Plane", IEEE Trans. Computers, Vol. C-19, pp. 826-829, 1970. Schwartz, S.C. "Convergence of Risk in Adaptive Pattern Recognition Procedures", Proc. Fifth Allerton Conf. on Circuit and System Theory, 1967. Sebestyen, G.S. "Decision-Making Processes in Pattern Recognition", Macmillan, New'York and London, 1962. Spragins, J. "Learning without a Teacher", IEEE Trans. Info. Theory, Vol. IT-12, pp. 223-229, 1966. Spragins, J. "A Note on the Iterative Application of Bayes Rule", IEEE Trans. Info. Theory, Vol. IT-11, pp. 544-549, 1965. Spragins, J. "Reproducing Distributions for Machine Learn- ing", Stanford University, Stanford, California, Tech. Rept., Tou, J.T. and Heydorn, R.P. "Some.Approaches to Optimum Feature Extraction", Computer and Info. Sci.-II, Academic Press, pp. 51-89, 1967. Watanabe, S. “A Method of Self-Featuring Information Com- pression in Pattern Recognition", Bionics Symp., 1966. Watanabe, S. "Karhunen-Loéve Expansion and Factor Analysis: Theoretical Remarks and Applications", Proc. Fourth Conf. Inform. Theory, Prague, 1965. Whitney, A.W. “A Direct Method of Nonparametric Measurement Selection", IEEE Conf. Record of the Symp. on Feature Extraction and Selection in Pattern Recognition, 1970. Wee, W.G. "On Feature Selection in a Class of Distribution- Free Pattern Classifiers", IEEE Trans. Info. Theory, Vol. IT-16’ 19700 Wolverton, C.T. and Wagner, T.J. “Asymptotically Optimal Dis- criminant functions for Pattern Classification", IEEE Trans. Info. Theory, Vol. IT-lS, 1969. Yegisapetov, E.G. "Learning Process in Pattern Recognition Using Linear Decision Functions", Engineering Cybernetics U.S.S.R., NO. 5, pp. 76-85, 1969. APPENDICES APPENDIX.A This appendix demonstrates (4 5 l) The performance criterion function for any pair of classes Hj and H , in a W dimenSional training patterns have been processed, Q-function space, after N is given by: Jim; m w) = R(j|k)- fi-EjkTmnprwféjkmnnw) . Substituting for 31km;t,'rw) from (3.1.2), ekj (N; m w) = Mum-$617” (N;t.ww)P‘1(N;c.tw>6jk(N;t.w ) 80 68013.7”) = 2[R(1\k)-6jkl-}T;(N t,T ‘4‘)? (N; mwm” (N; BTW-)1 J.k ’ - loss For a two-class pattern recognition system Wlth O 1 function the system's performance criterion function is given by T W -l W «12 (N;t.'r )P (N;t.T )9 (Nun) 1 - w - w 1(Nun )[el(N;t.T) Zh—a 6:20“ ;tsTw) - N1 [61m;t,t")-62(N;t,ww)] 02 - e (N;c,ww)] OjCN t,T ) and P-1(N;C,Tw) are defined in Sec. 3 1 where is equal to l as seen in the following The quantity R(j‘k) steps. 107 108 R(j\k) = max {r:(j‘k)} lsi-“s = mx {[c(j\1) - c(k‘i)]2} lsiss if c(j‘i) = 1 for j # i = 0 for j = i then ll ...: O R(j\k> APPENDIX B This appendix establishes (4.5.2) and (4.5.3) which are rewritten below. 15th = -[djk(N,W-+l) - dJk(N,W)] < 0 (4.5.2) AWgS is not defined when ¢W+1 is a linear combination Of the previously established Q functions (ml,¢z,...,gw). The notation has been simplified by dropping t and replacing Tw with W. djkmm = EjkTmmw'ICNmYe‘jkmM) From (3.1.3), 'e‘jk(n,w> = 'éjkm-LN) + zjk(n>6[i(n>]; 'éj“(o,w) = 6 where zjk(n) = c(j‘L) - c(k‘L) if 2(n) belongs to class HL and c(j\k) is the cost of deciding that §(n) belongs to class HJ when it actually belongs to HR, with c(j‘j) = 0. The alternative computation Of the 6 vector discussed in Sec. 3.1 proceeds by writing 109 110 3%.)» = E‘s-1.)» + i‘(n)aw(2(n)); 2510.)!) = o where = _ .6 r . =.B PL(nsw) PLO) 19w) + 1 (n)¢w(x(n))a PL(Osw) S I and 1 if E(n) belongs to class HL iL(n).= 0 otherwise . Then, -Jk . —~j . -k S . -L 6 (NM) = c(k\J)e (NM) - c(J\k)e (N,W) + z [c(J\t)-c(k|t)]e (NM) L=1 ask.) where N = N1 + N2 +...+ NS; N1 is the number Of training patterns from class H1. For a 0-1 loss function, a'k -9' -ok 6J (N,W) = 63(N,W) - 6 (NM) if 1% k and P(N,W) = P1(N,W) + P2(N,W) +...+ PS(N,W). If a new Q-function ¢W+l is added, -1 d (N.W+1) = (e (N,W)ew+1) T jk C pw+1 9w+1 where 9421 is the scalar difference in the theta values summed over all patterns from classes Hj and Hk is a scalar ’ pw+1 which equals the sum of the squares of the new Q-function summed over all patterns, and C is a vector whose j-th row is 111 N —-o -—o E p (X(i)) (X(i)). It can be shown [P-8] that i=1 1 Wu [ejk - CT P'lmmféjkmmnz djk(N,W+l) - dJk(N,W) = ”+1 'r _1 . Pw+1 - C P (N.W)C P(N,W) C Consider det CT where "det" stands for determinant. Multiplying the first W rows of the matrix by CTP-1(N,W) and subtracting from row W+1 gives P(N,W) C o pw+1 - c P-1(N,W)C Expanding from the bottom row, P(N,W) C T -l det = [p - C p (N,w)c]. det P(N,W) . T W+l C pn+1 P(N,W) C Now, both det and det P(N,W) are greater CT than zero since both are determinants Of positive definite matrices. This implies that T '1 and dJkCN,W+l) - djkm,W) > o, 112 which demonstrates (4.5.2). TO show (4.5.3), let ¢W+l(i) = aT¢W(E), where a is a scalar vector and §w(i) is the vector which consists of the W previously established Q-functions. Then, 1k = j ... k ew+1 ew+1 ew+1 ....k = aTej (M) = ..Tpm W)—' pw+1 a ’ °’ c = P(N,W); and jk(N,w)]2 ST? (N JOE-[P (N .W)&'T]P'1(N ,W) [P (N .1052] .k jk [aT'éjkcmm-P(N,W)&°]TP'1(N,W)’6‘ d3 (N,W+1) - d (N.W) = which is not defined. This in turn implies that (4.5.3) is not defined. APPENDIX C 'k In this appendix the distance properties of dJ (N,W) are investigated. For a 0-1 loss function and simplifying the notation as in Appendix B, djchM) =[‘e’jm,W)-6k(N.W)]TP'1(N,W)['éj(N,W)-6km,w)] [3km,w)-§j (N,w)]P'1(N,W)[§k(N,w) 563' (N,W)] k =d’mmx Since dJkGN,W) is a positive-definite, quadratic form djkmm) = o if and only if 61mm) = 6km» and dJRCN,W) > O . 113 VLBR MIIIIWUIIIWIIIWIIVIIIIIEIIIIWIW\I ll 3 1 2 9 3 0 3 013 2 244 ARIES