2f§ -.
r". 51' .- U
a

 

   

I

 

          

 
 

$4". ......3
f6

  

Rana.
......vb mat?) .
V .vi #4. o.
'1.- ‘ a r?!

. .
a”...
l I
4 ..

    

 

              

       

          

       

 

.. .9... 41%.
f. u 1 in ‘
a”???

1.1.

...Em.
; Dinlabbvxd 3:.
«9. f.

 

..
ivy.

 

  

..

.0 9:. ..V
(Cuvuﬂro

4,113, L}

         

   

ma?

 

       

. .

 

Eb

   

 

 
 

 

 

     

 

 

  

  
    
    

 

 

 

 

 
 

 

.. ,n.< . . . .x'...‘v yl o. .. I. . .
- nV . H! V .‘ .u ‘
. . ...,. .. . V... . ‘4
.1 . ... .. .. , .. » .
. . —. c -‘ .~. ‘ ‘ n . ‘, V ‘ r,
T A . . .H _. ._ . . it 3...:”
A . V .. . . . V . V.V.. ,‘ .
a .. . : .. , .. . .x. .. .r M
.. , .. . V.... ... y ”a . .éruﬂnﬂuu."
. .V v. . . : v , I . . ‘ I J’AI.V n
, _ <
a , . ‘ ,
.. .. . V. . .V . VJ. . . ,
. .V V .. .. . . . . . n
,. _ . ,.V . .. . V.
x. V .. ‘ .. V ...
o , . t y o
I l . V .. V .1 u
v . .. , . V
n, -. ..,_ A ,. ‘ . V
.A . 0V . . . n
‘ . .. x . . . .
V .<. -V . r .u. I ,1 . . . ,
. 4 c u. . . . .
u- ..4 ,. .
_ ~ ..4. . . 0 v a
_ V . .. . . .
~v , . . . v a
,. . . V
x .
1 V . . .V . V.
I v .. . - .
. .

 

.... :. ... ,2...-

 
  

  

    

   

 

 

 

..<V. ..J:‘.,V.. A.V;.‘.:...:. . . y . o
1....r: 3%.. ...z. . .-.... ...-... .. V. .-.Z {unwmznwnrua ...»mwﬂm‘qudw.
. y . ... «1 . .5233. ?&§.§5m . b 9&3.

 
 

         

 

This is to certify that the
thesis entitled

FEATURE EXTRACTION AND
@-FUNCTION SELECTION
IN é-SYSTEMS

presented by

Benjamin Garba Ishaku

has been accepted towards fulﬁllment
of the requirements for

Ph.D degree in W Engineering
and Systems Science

;, /.//" _/ y .!,7__H___ .. ’l’nc'l I ,, I,
A-c/Vexz’gﬁ/ Cs» i" .x,::.vt..___ r

Major professor
Richard C. Dubes

Date October 14 97

0-7639

ABSTRACT
FEATURE EXTRACTION AND

9 -FUNCTION SELECTION
IN Q -SY STEMS

BY

Benjamin Garba Ishaku

This thesis is concerned with the problem of selecting or
extracting the optimum subset or subsets from a given set of prim-
itive features in a nonparametric multiclass pattern recognition
system.which uses Q-discriminant functions. The two main questions
encountered are: (1) What criterion should be used to compare the
performance of various feature subsets; and (2) What procedure will
ensure an optimum subset? Given any criterion, the solution to the
feature extraction problem is mathematically trivial. All that is
required is to evaluate the performance of all possible subsets and
pick out that subset which Optimizes the criterion. The problem is
not computationally trivial, however. It is not economically feasible
to consider all possible subsets, even for a moderate number of
primitive features on existing computers. Consequently, all practical
feature selection procedures involve only a fraction of the possible
feature subsets.

The problems of feature extraction and decision-making are
unified with a generalized mathematical formulation of the pattern

recognition problem in this thesis. The object of a pattern

Benjamin Garba Ishaku

recognition system is to minimize the misrecognition rate during the
testing phase of the system. Several definitions are given which
establish the notion of an Optimum pattern recognition system. 'Mis-
recognition rate or costs are the performance criteria.

An estimate of an upper bound on the mean-square error be-
tween a set of Q-discriminant functions and the Bayes discriminant
functions is adOpted as a computationally feasible criterion. A
sub-Optimum sequential procedure is then deveIOped for selecting
Optimal features and Q-functions simultaneously. The criterion adopted
is shown to have desirable properties but it is found to be a "good"
criterion for evaluating the discriminative effectiveness of features
and Q-functions only when the largest feature and @-function sets
are the best. The sequential procedure prOposed considers fewer
feature and Q-function subsets than do other sequential algorithms.
It allows feature sets and Q-function sets selected earlier as
effective to be discarded if they are found to be less effective
when combined with recently acquired features and Q-functions. Under
certain conditions, this characteristic of the prOposed algorithm
appears to be desirable eSpecially in cases where the Optimum (p-l)-
tuple feature (Q-function) subset is not contained in the Optimum
p-tuple subset.

A three-class Q-system.was constructed and implemented employ-
ing the criterion function and the sequential extraction rule pro-
posed in this thesis. The recognition capability of the system was
demonstrated as well as its ability to achieve feature extraction
and Q-function selection simultaneously. However, no simple func-

tional relationship between the misrecognition rate and the adOpted

Benjamin Garba Ishaku

criterion function was evident from the results of the simulations.
Recognizing that such useful discriminant function families

as linear, quadratic and generalized polynomials are included in

the family of Q-discriminant functions, the simultaneous feature

extraction and Q-function selection methods deveIOped in this thesis

represent quite general and powerful techniques for pattern recogni-

tion.

FEATURE EXTRACTION AND

Q-FUNCTION SELECTION
IN Q-SYSTEMS

By

Benjamin Garba Ishaku

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering
and Systems Science

1971

Dedicated to the article:
"The student as nigger"
by
Jerry Farber
from

This'Magazine is About Schools, Winter 1968

ii

ACKNOWLEDGEMENTS

The author wishes to express his gratitude to Dr. R.C. Dubes
for his guidance in the preparation of this thesis. He also wishes
to thank Dr. W.L. Kilmer, Dr. J.R. Kinney, Dr. C.V. Page and
Dr. R. Zemach for the fruitful discussions concerning this thesis.
Finally, the author wishes to express his appreciation to
Mrs. N.L. Barnes for typing the thesis and to Mr. Gary Ingvaldson

for programming the computer simulations.

iii

ABSTRACT

TABLE OF CONTENTS

ACKNOWLEDGEMENTS

LIST OF FIGURES

LIST OF TABLES

LIST OF APPENDICES

Chapter

I

II

III

IV

INTRODUCTION

1.1
1.2

Three basic problems in pattern recognition
Thesis outline and contributions

FORMULATION OF A PATTERN RECOGNITION SYSTEM

Pattern recognition systems and feature
extraction

Literature review

Criteria for pattern recognition systems
Feature performance criteria

Summary

6 DISCRIMINANT FUNCTIONS

3.1

3.2
3.3

3.4

Learning the coefficients of a @-
discriminant function

A scheme for selecting Q-functions
Algorithms for improving the @-
discriminant functions

Summary

FEATURE EXTRACTION AND DECISION MAKING UNDER A
SINGLE CRITERION

4.1
4.2
4.3

404

PrOposed criterion for feature extraction
PrOposed sequential feature extraction rule
Simultaneous feature extraction and @-
extraction

Features of the proposed sequential algorithm

iv

Page

iii
vi
vii

viii

var:

13
24
28
32

38
43

SO
53

57
61

69

4.5 Features of the prOposed performance
criterion

4.6 The Gaussian two-class case

4.7 The prOposed pattern recognition system
and simulations

4 . 8 Sumary

V CONCLUSIONS, OPINIONS AND POSSIBLE EXTENSIONS

5.1 General comments on pattern recognition
systems

5.2 Extensions and Opinions in the study Of
pattern recognition systems in general

5.3 Conclusions in regards to @-systems

5.4 Possible extensions in the study of @-
systems

BIBLIOGRAPHY

Page

70
74

79
84

96

98
99

100

101

Figure

1.1

3.2.1 .

4.7.1

4.7.2

LIST OF FIGURES

Page
A pattern recognition system 4
The sequence of the application of t and T 47
Implementation of strategy 8.3 and 3.4 80
Flow chart of prOposed Q-system 83

Table
2-1

4-1

4-2

4-4

4-5

4-6

LIST OF TABLES

A summary of pattern recognition systems

A summary of the results of approach 8.3 with
25 training patterns per class

A summary of the results of approach 3.3 with
50 training patterns per class

A comparison Of the approaches simulated for
Ve/Vi/Se recognition

A comparison of the approaches simulated for
Vi/Se recognition

A comparison of the approaches simulated for
Se/Ve recognition

A comparison of the approaches simulated for
Vi/Ve recognition

A comparison of pattern recognition systems

vii

Page

34

85

87

89

90

91

92

95

Appendix

LIST OF APPENDICES

The derivation of the performance criterion for
a two-class recognition system with 0-1 loss

The demonstration of the effects of increasing
the dimension of the Q-Space on the performance

criterion

The demonstration of the distance prOperties of
the performance criterion

viii

Page

107

109

113

CHAPTER I

INTRODUCTION

The subject of pattern recognition arose from a desire to
design a machine that automatically classifies objects into one of
a finite number of predetermined categories. An Object is given
to the pattern recognition machine, or computer, as a vector of
measurements taken on the object. Each measurement is referred
to as an attribute, or primitive feature, and the vector of measure-
ments, as a primitive pattern. A pattern class, or category, is a
source of object supply and, in general, each category is assumed
to be described by a probability distribution over the Space of
attributes. The forms Of the distributions may be known, known
partially, or not known at all. Recognition is achieved when, on
the basis of the measurements, the pattern recognition machine

associates a pattern with one of the pattern classes.

1.1 THREE BASIC PROBLEMS IN PATTERN RECOGNITION

In a recent paper which surveyed pattern classification
algorithms [H-4], Ho and Agrawala characterized the fundamental
problems associated with pattern recognition as the three problems
Of feature extraction, abstraction, and generalization.

Feature extraction deals with the selection of features,

minimal in number, but sufficient to characterize a given Object

for recognition purposes. The primitive features give the absolute
descriptions for a set of objects and are considered as a set of
coordinates. A given object is represented by a vector, or a point
called a primitive pattern, im = (X1,X2,...,Xm), in the m-dimensional
primitive feature space Im. Not all m primitive features are always
required for recognition. The solution to the feature extraction
problem calls for the selection Of a minimal set of primitive features
sufficient for recognition at a predetermined error rate thereby
generating a pattern, i, for every primitive pattern Em.

The abstraction problem involves the feature subset obtained
from a solution to the feature extraction problem and a set of patterns
called training patterns. The information necessary to produce a
discriminant function, u(§), is abstracted from these training
patterns. When only two classes, H1 and H2, are involved, the

discriminant function has the form:

> 0 if i 6 class H1
no?) - _, 2
< 0 if X 6 class H

The generalization problem attempts to assess the perfor-
mance, or goodness, of u(§) with new patterns through the use Of
appropriate criteria, chief among these being various error prob-
abilities.

It is obvious that these three problems are interrelated.
Success in solving the abstraction and generalization problems
clearly depends on how well the feature extraction problem has been

solved. But the feature extraction problem remains the most

difficult of the three. The nature of primitive features is not
easily defined in many situations and, at present, the primitive
features selected to characterize Objects depend entirely on the
intuition of the experimenter. The selection of the best subset
of primitive features for the purposes of recognition constitutes
the feature extraction problem proper. The abstraction and gen-
eralization problems have received much more attention than the
feature extraction problem and some solutions in these areas are
available on the basis of Statistical Decision Theory and non-
probabilistic procedures.

The application of Statistical Decision Theory to the
pattern recognition problem is called parametric pattern recogni-
tion and requires that the pattern classes be characterized by
probability distributions. When the underlying probability dis-
tributions are not known, so called nonparametric or deterministic
pattern recognition procedures are available for solving the
abstraction problem.

A pattern recognition system consists of algorithms which
attempt to solve the problems stated above. Given a set of primitive
features, which provide the absolute descriptions of the Objects,
the pattern recognition system must determine which primitive features
or which combinations of such features are effective for recognition
purposes, how to extract them, and how to classify the patterns. A
general model for a pattern recognition system is shown in Fig. 1.1.
The primitive features are sensed and measured by the receptor.

The derivation of these features is not always obvious and, at

present, only features for which there are measuring devices are

object

A PATTERN RECOGNITION SYSTEM

 

 

\V

 

Receptor

 

Feature

><
Vé

 

 

 

 

 

Extractor

 

 

 

Figure 1.1

 

Categorizer

 

 

allowed. At the output of the receptor is a set of measurements
from which the feature extractor distills the feature subset to be
used by the categorizer in classifying unknown input patterns. The
receptor provides the set of features which are used in the solution
to the feature extraction problem while the categorizer attempts to
solve the abstraction and generalization problems. Many of the
decision rules develOped in mathematical statistics and in non-
parametric pattern recognition have been implemented in the design
Of the categorizer. The design of the receptor still remains sub-

jective.

1.2 THESIS OUTLINE AND CONTRIBUTIONS

The main objective Of this thesis is to examine the problem
Of feature extraction in multiclass pattern recognition systems
which use Q-discriminant functions for decision making. No assump-
tions are made about the forms Of the probability densities for the
pattern classes.

Chapter 2 unifies the three basic problems in pattern
recognition, namely, feature extraction, abstraction, and generaliza-
tion. The Optimum pattern recognition system under Bayes criterion
is derived in the form of a 3-tuple, and a set Of requirements for
"good? performance criteria for feature extraction and abstraction
is prOposed. Following a comprehensive literature review, it is
shown that a computationally feasible pattern recognition system
can be expressed as a 4-tuple when separate criteria are used to
evaluate Optimum features and discriminant functions. Several

definitions are stated for comparing the relative performance of

pattern recognition systems during the recognition phase Of Operation.
Hitherto, there has been no general procedure for evaluating pattern
recognition systems. Researchers merely constructed and implemented
their own isolated recognition systems, providing little comparison
to systems developed by others. The definitions should standardize
investigations in the formulation of a general procedure for deter-
mining Optimum pattern recognition systems.

Chapter 3 introduces the Q-discriminant function and reviews
an iterative procedure for "learning" the coefficients which converges
to the coefficients which minimize the mean-square error between the
Q-discriminant functions and the Bayes discriminant functions in a
multiclass pattern recognition problem. A scheme which uses the
estimate of the upper bound on the mean-square error as a criterion
for selecting effective Q-functions is also reviewed. This criterion
is shown to be a good criterion for selecting effective features as
well. Finally, the algorithm for determining the coefficients of
the 6-discriminant functions is related to other learning schemes and
suggestions are made for adjusting the coefficients to improve per-
formance.

In Chapter 4, the estimate of the upper bound on the mean-
square error between the @-discriminant functions and the Bayes dis-
criminant functions is proposed as a criterion for both feature and
Q-function selection. It is shown that thecriterion has some of
the desirable prOperties of a good performance criterion proposed
in Chapter 2. Two sequential feature extraction algorithms are re-
viewed. A third sequential algorithm is proposed and its char-

acteristics examined. The prOposed algorithm can lead to

simultaneous feature and Q-function selection. Four strategies are
advanced for the case of generalized polynomial discriminant func-
tions. The distributions of the O-functions are assumed to be
normal for each class in a two-class recognition system. This
assumption is baseless and very unlikely in practice, but it
exhibits the relationships between the coefficients of the Q-dis-
criminant functions and coefficients derived in other algorithms.
Finally, the prOposed criterion function and sequential extraction
rule are used to construct and implement a Q-system. The results
demonstrate the feasibility of simultaneous feature extraction and
@-function selection.

Chapter 5 summarizes the conclusions and Opinions concerning
the thesis and suggests further research in the area of feature

extraction in pattern recognition systems.

CHAPTER II

FORMULATION OF A PATTERN RECOGNITION SYSTEM

A generalized mathematical formulation of a pattern recogni-
tion system must include not only the selection of features and of
discriminant functions, but also the selection of performance
criteria for evaluating the effectiveness Of features and discriminant
functions. The universe is assumed to consist of an m-dimensional
measurement space IF, where m is the maximum number Of primitive
features available; these features provide the absolute description
of an object. An object is thus represented by a vectordvalued
random'variable. The vector*, gm, of values assumed by such a
random variable is called a primitive pattern and its elements
X1,X2,...,Xm are referred to as primitive features. The absolute
standard of performance for a pattern recognition system is proba-
bility of error or probability of misclassification depending on
the strategy adopted.

Section 2.1 deals with the mathematical formulation of the
Optimum pattern recognition system under the absolute standard of
performance, known as Bayes criterion. A survey of the literature
in feature extraction can be found in Sec. 2.2. Section 2.3

suggests classes of criteria for feature extraction and discrimination.

 

The same symbol is used in denoting the random variable and the
values it assumes.

These criteria measure the effectiveness of various sets of features
and sets of discriminant functions. Section 2.3 also prOposes a
general class of discriminant functions and a general class of
feature extraction rules. Definitions of some commonly used feature
performance criteria are given in Sec. 2.4. Finally, Sec. 2.5

summarizes the main tOpics of the chapter.

2.1 PATTERN RECOGNITION SYSTEMS AND FEATURE EXTRACTION

For a given primitive feature Space Im, a feature extrac-
tion rule t is a deletion of some Of the coordinates of Im; the
resulting p-dimensional Space IP is called the feature Space and
each point in I? is called a pattern;

i.e.,

t: I? d I? , p s m .

The deletion can also be performed after a linear transformation has
been applied to the primitive feature space, but only deletions in
the primitive feature space will be considered in this thesis. The

family of all p-dimensional feature Spaces is expressed as:
m
{1:}. 4, = 1,2....,<p).

If XP 6 I? is a pattern with unknown classification, the
recognition problem is to assign it to one of s predetermined
pattern classes denoted by H1, H2....,HS. The density functions

.0 .0 II. p
p1(X),p2(X),...,pa(X) are defined on the feature space ‘1 , one

for each pattern class. A discriminant function u which

associates a pattern class with each set in the partition

10
{R1,R2,...,RS} of 1p is sought. The partition has the prOperty:
UR=IP;
#13

Rj n RR = ¢, j # k; j,k = 1,2,...,s;

where ¢ is the empty set. Thus the discriminant function maps
a feature Space to the decision space D = {d1,d2,...,d8]. In

particular,

Rj 5 {X: X 6 IP and u(X) = dj}
where u(X) denotes the value a discriminant function u assigns
to a point in the feature space 1?.
A cost c(j\k) is incurred whenever a decision is made in
favor of class Hj and a pattern actually originated in class HR.

The probability of this misclassification under u is given by

mﬂm=£pgbﬁ.
J

If the prior probabilities of the pattern classes are given
by q1,q2,...,qs, then the expected cost of misclassification of a
recognition system which uses the feature extraction rule t and

the discriminant function u, is defined as:

S S
Es(t; n) = z qkl 2 C(jlk)E(J\k)}.
kal j=1
j¢k

The discriminant functions which are optimum in the sense that they
minimize the expected cost of misclassification for fixed t are

Bayes discriminant functions and are given [A-l], by

11

.. 1 S ..
u (X) -—-:,- z qr (mop (X).1s3.kss
jk meh=1 hh h
where
pT(§) = qlp1(§) + 9292(2) +u..+-qsps(§)
and

rh(j\k) = c(j|h) - c(k‘h) .
The resulting Bayes decision rule is:

Decide if e H1 if ujkdi) 2 o for all k a! j.

If the Bayes performance criterion, 5, is used to find
Optimum feature sets as well as Optimum discriminant functions,
then a pattern recognition system can be represented as a 3-tuple
(t; u,6). The absolutely Optimum pattern recognition system under
this single performance criterion is the system that minimizes the
expected cost Es(t; u) over all t and u;

i.e.,

(t; u,6) is the system for which Es(t; u)

abs.Opt.

is minimum over all t and u, t E T, u E U; the class of allow-

P
LP

L I l,2,...,(:)}, and the class of allowable discriminant functions,

able feature extraction rules is denoted by T = {t ; p = 1,2,...,m;

P
by U = {u: I? a D}, where D is a decision Space. The functions
u can assume various forms and some of these will be discussed
later.
Intuitively, for both features and discriminant functions

to be effective in pattern recognition they must carry a high measure

12

of information of a type which highlights the differences between
pattern classes. This type of information will be referred to as
discriminative information and may or may not have a precise mathe-
matical definition. The discriminative effectiveness of features
and discriminant functions refers to a measure of this information.

The expected cost for a given pattern recognition system
should be a function of the discriminative effectiveness of the
combination of feature set and discriminant function. The Optimum
system minimizes this cost over all possible feature subsets and dis-
criminant functions. The expected cost of misclassification of the
absolutely Optimum recognizer is only computable if the density
functions for the classes are known. Pattern recognition problems
in which the forms of these density functions are known, or known
to within a set of parameters, are referred to as parametric pattern
recognition problems. Exact mathematical expressions for the prob-
ability of error can be Obtained only in simple parametric pattern
recognition problems. Most parametric pattern recognition problems
use Bayes discriminant functions whose total eXpected costs are
computationally impossible to minimize over all possible feature sub-
sets. Consequently, investigators have resorted to less desirable
but computationally workable performance criteria in seeking nearly
Optimum discriminant functions and feature sets.

A desirable characteristic of a performance criterion for
feature extraction is that it can be related, hOpefully monotonically,
to the probability of error. Except in very special cases, these
relationships have been difficult to exhibit. Nonparametric (deter-

ministic) pattern recognition problems have no probabilistic basis

13

and provide no mathematical expressions whatever for the expected
cost. The following five requirements for a "good" performance
criterion for feature extraction are proposed. Analagous require-
ments can be prOposed for discriminant function performance criteria.

R.1 The performance criterion should show that the per-
formance of a feature set which carries more discriminative informa-
tion is better than that of a feature set which carries less.

R.2 The performance criterion should not change when a
feature which carries no discriminative information is added to or
deleted from the feature space.

R.3 The performance criterion should be Optimum when the
feature set carrying the greatest amount of discriminative informa-
tion is achieved.

R.4 The performance criterion should be computationally
feasible.

R.5 The performance criterion should converge to its

Optimum‘value as the number of training patterns increases without

bound.

2.2 LITERATURE REVIEW

The feature set having Optimum potential for discrimination
must possess the best discrimination capability Of all possible
feature subsets. Not only does the generalization problem require
the formulation of a measure by which the discrimination effective-
ness of features can be evaluated, but a feature extraction rule
which selects the feature set which achieves discrimination most

effectively is also necessary.

14

The measure of discriminative effectiveness for each feature
must be isolated so that its contribution to a total measure Of a
feature subset can be ascertained. This can be done if the features
are independent in some sense. Isolating the discriminative effective-
ness for each feature becomes very difficult when the features are
statistically correlated. The total measure of discriminative
effectiveness for features related by strong stochastic or functional
linear relationships may be significantly larger than the sum of the
discriminative effectiveness for the separate features [F-l]. This
means that features which themselves are not effective in discrimina-
tion, but which are strongly correlated, may result in a feature
set having high discriminative effectiveness. Of course, in some
cases, the joint discriminative effectiveness of a feature set of
strongly correlated features may be less than the sum of the dis-
criminative effectiveness measures of the individual features. There-
fore, in any pattern recognition problem, it is important that
relations between features be taken into account. This, however,
necessitates considering all possible feature subsets, which is
usually impractical. The important point to note is that the dis-
criminative effectiveness of a given feature set is not always
improved by merely adding more features to that set. Improvement
only occurs when the discriminative effectiveness increases.

The seemingly unresolvable difficulties sighted in the
search for the optimum feature extraction rule have led many in-
vestigators to abandon the search for a solution to the problem,
and instead, to seek alternative subOptimum feature extraction pro-

cedures which can be implemented. There have been basically three

15

approaches of this sort which, in general, assume a discriminant
function and a performance criterion with which to assess it and
then search for the optimum feature extraction rule to match them.
The three basic approaches are discussed in the following sub-

sections. Table 2-1 summarizes the Literature Survey.

2.2.1 Information theoretic and statistical distance approach

In this approach, the search has been for a suitable
"distance" measure between a pair of pattern class distributions,
or a measure of the separation between the classes. Ideally, the
further apart a pair of classes in the feature space, in some sense
of distance, the smaller the probability of mistaking one for the
'Other. Thus, a separability measure is used as the measure of
feature discriminative effectiveness. The most widely used sep-
arability measure in pattern recognition problems has been the
divergence, a measure closely related to Shannon's logarithmic
measure of information. Unfortunately it can only be related to
the probability of misclassification in the case of Gaussian proba-
bility models.

Kullback [K-8] and Marill and Green [M-Z], applied the
divergence as a measure of feature discriminative effectiveness in
a two-class pattern recognition problem in which the class distribu-
tions were Gaussian with equal covariance matrices. They demonstrated
a monotonic function relating the divergence to the probability of
misclassification. However, no simple function has yet been found
that relates the divergence to this probability when the covariance
matrices are unequal, even though Kadota and Shepp [K-l], have

derived upper and lower bounds for the probability of error in terms

16

of divergence.

In multi-class pattern recognition problems, the criterion
for feature extraction has been to select that set Of features which
maximizes the minimum divergence between all pairs of classes:
Grettenberg [G-l], Fu and Chen [F-Z]. Another criterion in multi-
class pattern recognition problems was prOposed by Min [M-S] who
maximized the mean of the separability measures for all pairs of
classes under the constraint of minimizing the maximum variance of
the separability measures. Using a linear discriminant function
found under the performance criterion of minimizing the maximum
prObability of over-all error (minimax procedure), Min found a
monotonic functional relation between each of a set of pair-wise
misclassification probabilities and a statistical quantity prOposed
by Kullback [K-B] and by Anderson and Bahadur [A-Z]. He used this
quantity as the measure of feature discriminative effectiveness and
showed that it was related, under certain conditions, to both the
Mahalanobis distance [M-l], and the divergence. He further showed
that a monotonic functional relation can be found between the prob-
ability of misclassification and this new measure Of feature dis-
criminative effectiveness for the case of Gaussian distributions
with unequal covariance matrices.

In nonparametric pattern recognition, Min [M-S], prOposed
a feature extraction criterion which is based on a direct estima-
tion of the minimum expected error rates from a given set of train-
ing samples. The techniques for feature extraction he develOped
for the parametric pattern recognition problem were found to be

directly applicable to the nonparametric problem when the system

17

error was taken as the over-allpair-wise error probability. The
over-all pair-wise error probability is arrived at by computing the
error probability for each pair of classes and combining them into

a total multi-class pattern recognition system error probability.

In most cases, the weighted average of the pair-wise error proba-
bilities is taken as the system error probability.

In a recent thesis, Prabhu [P-8] applied the Mahalanobis
distance, which is a generalization of Fisher's criterion, as a
measure of the discriminative effectiveness of features along with
a linear discriminant function of second order to reduce the number
of features and classify electronencephalographic (EEG) signals.

At each step, that feature is selected which adds the most to the
discriminative effectiveness of the feature set already chosen.

All of the authors mentioned so far have deleted some
original features to obtain that feature set exhibiting the greatest
discriminative effectiveness under a given criterion.

Some investigators, [W-l] [T-l] [F-3], have approached the
problem through feature-Space compression techniques using linear
transformations as feature extraction rules. In this technique
the primitive feature Space 1m is compressed into a transformed
feature space through a linear transformation, in such a way as to
maximize the distances between pairs of classes in the transformed
feature space. Tou and Heydorn [T-l], applied this technique and
used the divergence as a measure of feature discriminative effective-
ness in the transformed feature Space while, in a nonparamatric
feature extraction problem, Patrick and Fischer [P-Z], used the L

2

norm instead of the divergence.

18

Relatively recently, the Bhattacharyya distance has been
proposed as a signal design criterion by Kailath [K-Z], and as a
measure of feature discriminative effectiveness by Lainiotis [L-l]
and Lainiotis and Henderson [L-Z]. The Bhattacharyya distance is
the negative of the natural logarithm of the Bhattacharyya coefficient,
also referred to as the affinity. Unlike the divergence distance
measure, the affinity is not restricted to Gaussian models. Kobayashi
and Thomas [K-S], have shown the existence of upper and lower bounds
on the Bayes risk in terms of the affinity as a generalization of
the error bounds first derived by Kailath [K-Z]. Using a different
approach, Lainiotis [L-3] also obtained a class of upper bounds on
the probability of error for the multi-class pattern recognition
problem in which an upper bound is shown to be a linear functional
of the pair-wise Bhattacharyya coefficients. Except for the case
of Gaussian models, no upper bounds have been Obtained for the
divergence. This has led to the claim that the Bhattacharyya dis-
tance is a better measure of feature discriminative effectiveness
than the divergence. A detailed comparison between the divergence
and the Bhattacharyya distance is available: Kailath [K-Z],
Kobayashi and Thomas [K-S]. Lainiotis and Henderson [L-Z] adOpted
the Bhattacharyya distance measure to a two-class feature extrac-
tion prOblem using feature-Space compression techniques. The
original data was assumed to be Gaussian and the best linear trans-
formation which maximized the Bhattacharyya distance in the trans-
formed feature space was obtained.

A coefficient prOposed by Kullback [K-9] is a generalization

of the Bhattacharyya coefficient. This coefficient, used by Renyi

l9

[R-l] to develOp a generalization of the Bhattacharyya bound, is
also referred to as the Chernoff bound, and provides an upper bound
on the Bayes error probability. Renyi [R-Z], also developed a bound
on the Bayes error probability in terms of the equivocation, which
is defined as the information missing from an average pattern about
the identity of the pattern class in the information-theoretic sense.
Further discussions and extensions on Chernoff bounds and the re-
lationship between the equivocation and Bayes error probability can
be found in Hellman and Raviv [H-l]. An upper bound on the Bayes
probability of error derived by Chu and Chueh [C-4], has been applied
to the feature extraction problem but the equivocation has not been
used as a performance criterion for feature selection.

Investigators such as Lewis [L-6], have prOposed a single-
number statistic in the form of an entrOpy as a measure of the dis-
criminative effectiveness for each feature. This statistic is taken
to be a measure of the "goodness" of the feature in reducing the
probability of misclassification. It is assumed that the features
are statistically independent and the optimum extraction rule is to
choose features in a descending order of their "goodness". Follow-
ing Lewis [L-6], Fu and Cardillo [F-4] prOposed two single-number
statistics for sequential forward and sequential backward extrac-
tion rules, reSpectively. The statistic for the sequential forward
extraction rule is an expectation Of a function describing the
correlation among pattern classes, the features already extracted,
and each of the remaining features. At each stage of the sequential
forward extraction rule, that feature with the maximum "goodness"

among all remaining features is added to the feature set already

20

extracted. The sequential backward extraction rule, on the other
hand, starts with the complete set of features and discards the

feature with the minimum "goodness" at each stage of the Operation.

2.2.2 Dimensignglity reduction approach

Viewing the problem of feature extraction as one of
dimensionality reduction, several investigators [F-3], have applied
the statistical and algebraic techniques of factor and discriminant
analysis, and of principal components, or KarhunenéLOEVe expansions.
All these techniques are Special cases of a general weighting pro-
cedure for choosing weights for linear combinations of the given
features [M-B]. In the Karhunen-Loeve expansion, for example, the
features are pre-weighted according to their relative importance in
characterizing the input patterns, regardless of the Specific dis-
criminant function in the recognition system. The weights are chosen
to make the features uncorrelated.

The feature reduction technique based on the Karhunen-LOEve
expansion was prOposed by Watanabe [W-Z], and later generalized by
Chien and Fu [0-2]. They defined the relative importance of a
feature in terms of its ability to represent pattern classes when
the primitive patterns are projected to a p-dimensional feature
Space, p < m. They develOped a rule for extracting a coordinate
system of weights that not only provides the best mean-Square
approximation to the primitive feature set, but also minimizes the
entrOpy function among all possible coordinate systems. It should
be mentioned however that in their develOpment they required that
the primitive features be normalized so as to make the lengths of

all patterns the same. This normalization facilitated the isolation

21

of the contribution of each primitive feature to the length of
patterns. A discussion on the normalization of primitive features

in pattern recognition can be found in Dubes [D-2]. The result is
that the discriminative information over the primitive feature

space is concentrated in a few coordinates by a linear transforma-
tion. This technique is not dependent on the probability structure
of the given features in the primitive feature Space IE, but Offers
no way for computing the expected system error. Furthermore, as
pointed out by Fukunaga and Koontz [F-S], the information concentrated
in the few coordinates in the Karhunen-LOEVe expansion is more class-
representative information than class-discriminative information.

For even though the expansion reduces the dimensionality of the
primitive feature space by discarding coordinates which do not carry
much information, the coordinates retained are only Optimum with
respect to representing a class. They do not necessarily extract
coordinates which highlight the difference between pattern classes.
Two pattern classes may have similar representation in the same
coordinates, making recognition in the feature Space no easier than
in the primitive feature Space.

In order to make the Karhunen-Loéve expansion relate mean-
ingfully to the pattern recognition problem, Fukunaga and Koontz
[F-S] proposed a procedure whereby the features are transformed
before applying the expansion. In a two-class pattern recognition
problem, their procedure allowed the important coordinates, in terms
of pattern representation of the first pattern class, to be the
least important for the second pattern class and vice versa. In

this way they were able to reduce the dimension of the original

22

feature space to two, consisting Of the one most important repre-
sentation coordinate for one pattern class and the one most impor-
tant representation coordinate for the other pattern class. These
two coordinates were shown to contain as much discrimination infor-
mation as all the original set of coordinates. It should be men-
tioned however that this method worked only for the case in which
both pattern classes have the same covariance matrices. This case
has been fully studied and is well known that the assumption of
equal covariance matrices allows decisions to be made with the same
probability Of error in one dimension as in the primitive feature
space.

Korsunskaya [K-7], used linear transformations to obtain
uncorrelated transformed features in a two-class Gaussian distribu-
tion feature extraction scheme. The linear transformation also
minimized the number of dimensions in the feature space by maximizing
the square of the difference between the means of the distributions
for the two classes, under the constraint of keeping the diSpersions
of the distributions constant. The fact that the features are un-
correlated in the feature Space permits isolating the contribution
of each transformed feature to the recognition problem, through its
measure of discriminative information.

Recently, Fukunaga and Olsen [F-6] approached the problem of
dimensionality reduction by looking at the local characteristics of
the data distribution rather than at the global characteristics.
They established small regions around the coordinates of the data,
and showed that the KarhuneneLOEVe expansions for these local regions

indicated dimensionalities which are closer to the intrinsic

23

dimensionality of the data than are the global KarhunensLOEVe
expansion. The intrinsic dimensionality of a datum is the minimum
number of parameters required to describe it. In another paper,
Koontz and Fukunaga [K-6] treated jointly the problems of dimen-
sionality reduction and distance measure enhancement between pattern
classes. They develop a recursive non-linear mapping algorithm
which generates a seQuence of data configurations that converges to
a final configuration in which the sum of the squares of the inter-

class Euclidean distance is maximized.

2.2.3 Reconstitution approach

Several approaches to the problem of feature extraction as
applied to image analysis on a two dimensional plane can be found
in a survey by Levine [L-S]. One of these attempts to determine
a set of features, minimal in number, with the prOperty that each
pattern under consideration can be formed by the superposition of a
subset of these features. This idea was suggested and implemented
by Block et a1. [8-1] in an algorithm for training the first layer
of weights in a two-layer learning machine. The features are con-
sidered as components or sub-patterns of the given patterns. This
method is essentially equivalent to clustering the pattern vectors
in the Space defined by the components. In practice, the require-
ment that the set Of features be a minimal set is difficult to
achieve. Nagy [N-l] used this approach in feature extraction on
binary patterns. The features were determined by the disposition
of the component vectors in a pattern space rather than by the

superposition method of Block et a1. [B-l].

24

Image processing deals with the actual problem of what
features make up the absolute description of Specific objects on
planar surfaces. In this thesis it is assumed that the problem of

the absolute description of Objects has been solved.

2.3 CRITERIA FOR PATTERN RECOGNITION SYSTEMS

The general philOSOphy behind many of the algorithms dis-
cussed in Sec. 2.2 is to establish a measure which evaluates the
discriminative effectiveness of all feature sets. The feature set
with the greatest discriminative effectiveness is that which con-
tributes most to recognition. Thus it is hOped that a good measure
will concentrate its discriminative effectiveness in as few features
as possible, thereby minimizing the number of features required for
discrimination. The key problem is to find a computationally feasible
measure of discriminative effectiveness; i.e., a reasonable perfor-
mance criterion. Some of the performance criteria already mentioned,
such as entropy, divergence and Bhattacharyya distance, belong to a
general class, B, of computationally feasible performance criteria
for finding Optimum feature sets.

After the selection of a performance criterion from B comes
the selection of a feature extraction rule itself, or a procedure
for choosing features with a total measure Of discriminative effective-
ness sufficient for recognition at a prescribed level. Obviously,
there are many ways by which features can be extracted. The
literature of pattern recognition contains essentially two major
feature extraction rules. The first is the deletion rule, which

allows only subsets of the primitive features. This approach has

25

appeared in three forms. The first is the Sequential Forward
extraction rule which starts by selecting the best discriminating
feature and sequentially arrives at the best p-tuple feature set
by adding a feature to the best (p-l)-tuple feature set. The
Sequential Backward extraction rule on the other hand, starts with
the entire feature set and determines the best p-tuple feature set
by discarding a feature from the best (p-l)-tuple feature set.

The third form of the delection rule considers all possible subsets
of the primitive features and deletes as many features as possible
saving the smallest number Sufficient for the discrimination of
pattern classes. This form is very limited in applicability since
present day computers cannot search over all feature subsets,
especially when the number Of primitive features is large.

The second most widely used extraction rule is the one
obtained by compressing the primitive feature space to a lower
dimensional Space, generally through a linear transformation. Both
the deletion rules and the compression rules will be considered as
members of a general class of feature extraction rules denoted by T.

Another problem is selecting a performance criterion for
assessing the effectiveness of a discriminant function. Peterson
and Mattson [P-6] deal with the prOblem of finding linear dis-
criminant functions for a class of performance criteria. The best
coefficients for use in a linear discriminant function are Obtained
not only according to Bayes criterion, which is the absolute standard
of performance, but also according to the computationally feasible
Fisher criterion. Other computationally feasible performance criteria

are entrOpy, divergence and the least-mean-Square, which, in addition

26

to Fisher's criterion, can be included in a general class of computa-
tionally feasible performance criteria, A, for comparing discriminant
functions.

The search for discriminant functions for use in decision-
making strategies led to Statistical Decision Theory, which is
applicable to probabilistic problems. For deterministic decision-
making strategies, linear, quadratic and polynomial discriminant
functions have been employed. All such discriminant functions can
be regarded as members of a general class of discriminant functions U.

If the existence of B, T, U and A is accepted, then each
of the algorithms prOposed as a solution to the pattern recognition
problem define the pattern recognition system (t, b; u, a), where
t E T, b E B, u E U and a E A. Expressing the performance criterion
corresponding to the pattern recognition system (t, b; u, a) by
the notation ys(t, b; u, a) where Ys* is the eXpected cost, or
the percentage Of misrecognition of the system, the following defini-
tions show how algorithms can be compared.

Definitiongl.

Given a discriminant function u E U, a performance criterion
a €,A, and a feature extraction rule t E T, a feature performance
criterion bg is better than a feature performance criterion b

h,

written b8 2) b if

h‘t,u,a

vs“, b8; U. a) < Ys(t. bh; U. n) -

 

The notation Vs stands for the expected cost or the percentage
of misrecognition and ys(t,b;u,a) stands for the value it assumes
for given t, b, u and a.

27

Def init ion 1;.

Given a discriminant function u E U, a performance criterion
a E A, and a feature performance criterion b E B, a feature extrac-
tion rule t: is better than a feature extraction rule tq written

k

P q
I) if
ti. t:k|b,u,a

YSUZE’ b; Us a) < 'Ys(t:s 1);“: a)

A p-dimensional feature space is better than a q-dimensional

feature Space, p s q if

for some L and all k.
Definition III.

Given a discriminant function u E U, a performance criterion
a E A, a feature extractor (ti, bg) is better than a feature

extractor (ti, bh), written (ti, bg) :3 (ti, bh)\u’a if

ys(t:, b ; u, a) < ys(t:, bh; u, a) and p s q .

8

Definition I2,

Given a discriminant function u E U, a feature extraction
rule t E T and a feature performance criterion b E B, a dis-
criminant function performance criterion ai is better than a

discriminant function performance criterion aj, written ai:3 a

if

ys(t, b; u, ai) < vs(t, b; u, aj) .

jtsbsu

28

Definition V.

Given a feature performance criterion b E B, a discriminant
function performance criterion a E A, and a feature extraction rule
t E T, a discriminant function uw is better than a discriminant

function u , written u 13 u \ if
z w z t,b,a

Ys(ta b; UW’ 3) < 'YS(t’ b; ”z, 3) °

Definition VI.

P
L

than a pattern recognition system (ti, bh; uz, aj) if p s q and

A pattern recognition system (t , bg; uw’ ai) is better

P . P .
YS(tL’ bg) Uw, 8.1..) < YS(tk, bh’ Uz, aj) 0

Table 2-1 lists some of the computationally feasible per-
formance criteria employed in the selection of features and the
search for Optimum discriminant functions. Also in the table is
a list of various feature extraction rules and discriminant func-
tions that have been applied in pattern recognition Systems, matched
with the authors who applied them. Surveys Of the literature Of
feature extraction can also be found in Min [M-S], and Prabhu [P-B],
as well as in an IEEE Conference Record of the Symposium on Feature

Extraction and Selection in Pattern Recognition [I-l].

2.4 FEATURE PERFORMANCE CRITERIA

The definitions Of some of the feature performance criteria
listed in Table 2-1 are given below. The definitions are stated for

two-class pattern recognition systems.

29

Denote the density functions defined over the feature space
LP, p S m, for class H1 and H2 by p1(X) and p2(X) reSpectively.

Let the likelihood ratio be

p20?)

15(2)

 

L6?) =

and let Ei[°] be the expectation under class H1.

Divergence.

2
The divergence, J, between classes H1 and H is defined

 

 

as
p (i') _, _, p (55) _,
J =J‘ p1(XM,n 1 _. dX +I p2(X).{,n 2 _. dX
xv p200 IP p100

= E1[(L - 1)Ln L] .

Bhattacharyya distance.

The Bhattacharyya distance, B is given by

B a any #1669263) di’
IP

= -Ln EIL/L] .

Mahalanobis distance.

The Mahalanobis distance, d, is given by

-. 4T 1 2-1—9 —0
d=(l-L1-l-52) (Z +2)(l1'1‘ll2)

where 31 and 2'1

are the mean vector and covariance matrix under
class Hj.
Tou 22d Heydorn's intraset meanrsqugre estimation error.

Let {ai; i = l,2,...,m; mi< n} be a set of coefficients.

Then, the intraset mean-square estimation error, Rm(;), is defined by

30
a a m s a 2
sm(a) = Ej{\z - iElfiotwiail }

where
fi(X) is a function of pattern vector X

{wt} are weighting coefficients

---9 -9

Z = X - 31 is a normalized pattern from class HJ

”j is the mean Of class Hj.

Lewis' entrOpy measure.

Assume that each feature f , j = l,2,...,N, can take Vj

J
possible values, a particular value of fj being denoted by fj(k)

k = l,2,...,Vj. Lewis' entrOpy, Gj’ defined for each feature f

j
is given by

2 Vj

i
G. = z 2 p[H .fj(k)]tn
3 i=1 k=1

1
pEH ‘f1(k)]
9011)

2 v . p(Hijf (k)]
= '23 szLH‘.fj(k))[ 1i 1]
i=1 k=l p(H )

 

 

where p[H1,f (k)] is the probability of the joint occurrence of

J
H1 and fj(k), p[Hi‘fj(k)] is the probability of the occurrence
of Hi given fj(k) and p(Hi) is the prior probability of Hi.
Fu's entrOpy measure.

Let r s N be the number of features left to be measured
at any stage of a sequential probability ratio test and fj,
j 8 l,2,...,r, be the j-th remaining feature. Fu's entrOpy measure

Ij(n) associated with f after n features have already been

J

measured is defined as

i
2 1 p(f .H .x1.x2.....xn)
Ij (n) = 2 p(fj 9H 9x13x23' "9xn)‘{ln

i=1

j
P(fj)p(Hi)p(X1,X2,...,Xn)

31

where j = l,2,...,r, and N = r +'n.
Manhattan distance.
The Manhattan distance between points R and y in an
m-dimensional Space is defined as
d a m
pm(X.Y) = iEI‘Xi - yil
i.e., the sum of the absolute difference of the feature values.
Hamming distJanTce.
Let U = [a1,a2,...,an] be a binary word. That is, ak = 0
or 1. The Hamming distance between two points _U and
V = [81,52,...,ﬁn] is defined as the number of coordinates for
which all ak and Bk are different; i.e.
n
d" = kglmk C9 Bk) .
The notation <3 indicates mod-2 addition.
Neymark's spaped metric.
In an N-dimensional feature Space IF, consider the two sets

42 a

Of points X(;1,x ,...,xp) and Y<§1,§2,...,§q). A metric

N N %
p(xo') = [“51 “Elgmkm - 3;men - ynﬂ

is called spaced if, under the normalization condition

811 + 322 +u..+‘gNN = N

and with nonnegative quadratic form

N N
.51 “318mm“... - MW. ' Yn>

the distance between the sets X and Y

32
i j
p(X.Y) " min 90! .y )
1.1
is maximal.
Integral mean-squgre distance.
The integral mean-square distance between two functions

y(X) and z(x) is defined as

d<y.z) = [yum - zcxndxj’é .

The "continuity" criterion K.

K is defined as

where dij = distance from point i to point j in the primitive
feature space
Dij = distance from point i to point j in the feature

Space .

2.5 SUMMARY

This chapter was concerned with the mathematical formula-
tion of pattern recognition systems and a review Of pertinent
literature.

It was shown in Sec. 2.1 that a pattern recognition system
can be expressed asva 3-tup1e when the Bayes performance criterion
is used in seeking both optimum features and discriminant functions.
Following a comprehensive literature review in Sec. 2.2, Sec. 2.3

showed that a computationally feasible pattern recognition system

33

can be expressed as a 4-tup1e when the features and discriminant
functions are determined under different performance criteria.
Finally, several definitions were posited by which the relative
performances Of various pattern recognition systems can be assessed.
based on their relative costs or misrecognition rates. Section

2.4 contains definitions of some feature performance criteria.

34

mooa

moma

momﬁ

momﬁ

NomH

new»

:Owumowm
uHmmoHOmHz mo
auaaaeueoem

coma
nﬂcwooom mo
mwuucmouom

comm
uouum

coﬂuﬁcwoomm
Oomuuoo mo
suaaaeeeeum

coﬁuwcwooom
uouuuoo
mo Owuucooumm

coﬁuoumuo
mocuauomuom
Buumhm
coHuwawOOOm
auouuum

coeuueam
:wmmmﬁomwz mo
suaHaeeeoum

coﬁu
nwcwoowm uO
mwmucoouom

3 use
eooeaamxaq
Ensaxmz

xmﬁu mmzmm

cause
eeoeaaexag
Seaman:

cOHuOuHuu
oucuauomuum
:OHansm
uaucaaauuuwa

unused

nausea

oauem
sooeaﬁexaa

woman

oases
eoeeuﬂexea

coHuosam
unmawaauomﬂa

Hum Manda

mocowum>wa

COwumuummm
mhupcsom

mocuwuu>wm

moamwuo>wn

ex

eunuuom Sumo you

n penance
U museums
:OHumEuomcH

aoHuOuﬁuo
ouauauowuom
ouauuom

mmusuuow
mo mucmeoe
pSOOOm

mam umuﬂw
mo umouo
wcﬁocoomma
ouusxomm
Huﬁucosvom
cheapom
Hmwucosumm

some
nuEuommamuH
ummcwa

aOHu
nusuommcmua
Hammad

consumes

a

0 HO homuo
wawccmomou
aw mmusumom

uouﬁom

was“
noduouuuxm
unauuom

mauﬂw moumm

he-ew

now>uxumx

ﬂ~-og

whencuuuouu

ﬂu-zg
cumuu mam Hawks:

ﬁning meson

nonus<

35

woma

moma

womﬁ

moma

Roma

Roma

moma

Roma

coma

coHuchooom mo
mmuucmouom

coﬁquﬁw
uwmmuaomHz mo
muwaanunoum
SOHO

uﬁcwoomm wo
owuucouuom
coﬂu

Iwcwooom mo
mwuucooumm

mush nouum

sump uouum

moan uouum

moon uouum

coauaawouum
a“ muouum

cOwuchooom mo
mwuucmouom

oases

eooeaaexaa
Eaewxmz

muuucwpuooo
o>moa
namcSSumM

chasm
eooeaﬁexau
Eaaﬁxdz

nouum
muwsvmnquoz

mucoumwn
muesvmunuuz
umwuouaH

wouuu mo

sueaaeeeoum
so mpcaom

uouum mo

suaaeeeeeum
co amazon

sees 52
cause
coon: 9: .—
83.“ g

unused

cause
eooeeamaaa

COHmauoxm
O>NOA
scucssumx

oases
eooeuaexaa

“nomad

unused

oases
coon: 33

oases
eooaaaoxau

oases
eooeaaexag

unwuuumCOO
Ono cums
m.uonmﬁm

x musumow wcﬁcwua

now some pomH
escapee Ace H
ousmmoz
cOHumBuomcH

mouuaﬁouoou

m>woA
scmcssumx

mocowuo>mn

uouum cowuuSHumm

ouupvmncwuz
uomuuucH

mocowuo>aa
HuumoucH
wowcHHHum

codename
umkuunouuumnm

SOHmuoamﬁn
uamumooo
pu>o mono
rummage cum:

mcHEEmuwoum
owsmcza

oumsxomm
Huwuaosuom
ohmsuom
Huﬁucoscom

cOHu
luauowmmuue
amused

sown
umEuowmamuH
umoawg

some
nuEqumcuuH
umucwn

nose
quuOmmmuuH
“momma

some
nusuommnuua
unused

ﬂ~-zL

>>ma was COmHOz

mm-mg as

mm-uu em

ﬂs-au

snowmen was :09

ﬁa-eg

mmmsm pad uuovmx

9.2—...

assesses;

Hauxu

umvxmcamuox

36

Humw

Cums

Omma

0mmH

Oan

ohmH

momH

moms

moma

moms

moma

musm uouum

Soﬁuwcwoomm
mo meucwoumm

boom uouum

muum cOHqu
-auaemeaeeaz

cowuﬂcwoomm
mo mwuucmoumm

coﬁuwcwooom
mo mmuucmouom

cOHu
-chooum mo
oweuamouom

coﬁu
uﬁcwooom mo
Ownucooumm

woman no
seaﬁaeeeeum

cued uouum
:oﬁuwcwouom

GOEuwcwooom
mo owuucoouom

mummm

mucwuuumcoo 03u
some m.uonmam

commando
3855932

M coﬁumuwuu
: AAUHDGH Ugo U.—

uouum mo

muHHHnunoum
voumeumm

uouum
opusmmumumz

nouum mo

muﬁHHnunoum
momum

oases

eoeeaﬁexaa
835axuz

oauam

voosﬁamxwd
assess:

uuum nouum
coHuwawoomm

:OHuaawooom
mo Owuucoouum

ounce
eooeamexau

unocwg

hemmed

unused
masm

nonswwmz
umouuucux

HuHEOcmHom

oauem
eoocaaexea

ounce
coonaﬂexaa

cause
ecosaﬁuxaa

Humaocmaom

wcaumuoaao
mcmoaux

muum uoupm

mucamuumcoo
03u sums m.uusmwm

museumma
manocmausmz

M coauouﬁuo
tauaseaucoo:

omens:
mmuuuncmz

nouum
mumsmmucmoz

opusmmncuuz
HaywoucH

mmuum nouum
mouuomxm

OHS m” 0:

muaaanuuumow
x55 52

xauuua
vuoumm

ooauumun
ma.“ 88mm

cOHuOHmQ

maﬁaEuuwoum
owaucma

oumsuom
Huaucosmmm

:oHumEuow
umomua “mused
cheapom

Huguesdmmm

nunsuom
Huﬁucmsmom

aoﬁu

unsuommqmua
Hammad

cOwumHmQ

consumes

cos»
unencumasua
nausea

pucsuom
Huaucuamom

hm-oL guano
ﬂm-zH

m>mA mam :OmHOz

ﬁw-mm scenes

9 a o... u um>Hmo

hm-:g assumes
me-3L ems
h~-mu

monumwm
one Mewuumm

Fla... 5:

ﬂe-zu
.Hu um xuuamoz

Si... .32

CHAPTER III

Q-DISCRIMINANT FUNCTIONS

A practical pattern recognition system has been defined in
Sec. 2.3 as the 4-tuple (t, b; u, a). The order in which the
elements of the 4-tuple are selected Should be immaterial, but we
shall follow the general approach of first selecting a discriminant
function u and a corresponding performance criterion a, to
evaluate it. Later on, we Shall seek a matching feature extraction
rule along with a measure b to evaluate the performance of feature
sets.

Various types of probabilistic structures are available
for probability-based discriminant functions. When the form of
the density for each pattern class is completely known, or known
to within a set of parameters, parametric classification algorithms
can be applied to classify patterns. The unknown parameters can
be "learned with a teacher" [8-6] [8-7], when a set of training
patterns of known classification is available, or "learned without
a teacher" [8-5], when the set of training patterns are of unknown
classification. Nonparametric algorithms, on the other hand, deal
with situations in which distributions describing the pattern
classes are unknown. The only data available are a set of training
patterns, of known or of unknown classification. A mathematical

form for the discriminant function is then assumed and the training

37

38

patterns are used to "learn" the coefficients of the discriminant

function. In this thesis, a Q-discriminant function [N-S], will be

used for the nonparametric multi-class pattern recognition problem.
The discriminant function adopted in this thesis is pre-

sented in Sec. 3.1, along with its learning capability, convergence

rates and other computational considerations. Section 3.2 reviews

a scheme for the selection of Q-functions and discusses how it leads

to the definition of Q-systems. The possibility of improving the

performance of the chosen discriminant function is discussed in

Sec. 3.3. Finally, a summary of the main tOpics of this chapter is

given in Sec. 3.4.

3.1 LEARNING THE COEFFICIENTS OF A Q-DISCRIMINANT FUNCTION

This thesis will consider a set of discriminant functions
which are formed from a set, {W1(i):¢2(§):°°°»¢M(Y)}:0f arbitrary
functions defined on the primitive feature space 1?. These
arbitrary functions are sometimes referred to as Q, or potential,
functions and are used in formulating discriminant functions which
approximate the Bayes discriminant functions. The form of a @-

discriminant function is:

M
unit?) = 2 a cp (3?)
=1 L L
= 2T 0,5?)

'1‘ -o
where §M(K) - (¢ﬁ(i)»¢2(x):---:¢h(i))s superscript T denoting the
tranSpose Operation. Then, under a given criterion, and utilizing

any available prior knowledge, values for 3 6 RM' which best

39

approximate the Bayes discriminant function for the pair of pattern
classes being considered are determined. The potential functions
not only estimate the Bayes discriminant functions nonparametrically
but, unlike the conventional linear hyperplane learning schemes
[H-6] [8-1] [8-2] [M-A], also take into account higher order statistics.
The Bayes decision rule minimizes the expected cost of
decision-making and is defined to be the "best" rule. For a p-
dimensional pattern vector, or point in 1?, which is known to have
originated in one of the 8 pattern classes with prior probability
qi and density function pi(-), the Bayes discriminant functions
are defined in Sec. 2.1.
The application of the Bayes discriminant functions in Sec.
2.1 requires full knowledge of the quantities {qL,pL(-)}:=1. Since
these quantities are unknown, ways have to be sought to form dis-
criminant functions from a set of training patterns. The decision
rule problem therefore calls for the construction of a sequence of
estimates which approach the Bayes discriminant functions in some
sense, as the number of training patterns approach infinity. Several
authors have constructed functions which approximate the Bayes dis-
criminant functions: [A-3], [B-Z], [P-B], [H-S], [P-l], [R-3],
[8-3], [W-S], [8-4]. The approximating functions given below are
from Pitt and Womack [P-7]. No attempt is made to develop a new
set of discriminant functions in this thesis. Rather, an investiga-
tion of the feature extraction problem is made when a set of @-
discriminant functions is applied to the pattern recognition problem.
The Q-functions will be selected to Optimize both feature extrac-

tion and decision-making.

40

The unknown Bayes discriminant functions will be approximated

by:
M
—o ajk g jk -o
ujk(X,a ) Z a; qk(X)
{Fl
(3.1.1)
= '33“ 691,65)
where
ujk(§,;jk) = - ukj(§,;kj)

and Qafi) - (¢IC§),m2(i),...,mM(§)) is a vector-valued function

on 1?. The {mi(i)}:=1 are selected to meet Specific design con-
siderations such as computer storage. The set of classified training
patterns {§(l),§(2),...,Y(N),...} are obtained sequentially and
utilized in determining the vectors {ij(N)}:,k=1 which minimize
the set of performance indices {ng(3)}:’k=1 introduced later in
this section. Then, a pattern vector 2 of unknown claSsification
is classified with the approximate discriminant functions as if they
were the Bayes functions in Sec. 2.1. The Optimum decision rule,

in the mean-square sense, can be stated aS:

Decide i 6 Hj if ujk(i,zjk) 2 O, k = l,2,...,s
j 9‘ k .

The following assumptions are made concerning the Q-functions.
Assumption!l. {mi(§)}T=1 are linearly independent with probability
1, i.e., there exists a measure p(o) satisfying
fpcpiéimj (2mm?) - 0215“, where 51 is the Kronecka delta function

1
I.

and ai is some constant.

41

Assumption II.

2 "'° 2 -0 -o
EL{<pi<X)} = J‘pcpﬁwdxwx
I

A
8

L

ll
....
v
N
U
0
O
U
U)

i = l,2,...,M
where {pL(§)} are unknown probability density functions.
The coefficients for the j-kth discriminant function in
(3.1.1) after N training patterns have been processed is given,
[P-7], by the following iterative equations. The number in brackets
indicate the iteration number.
~ k - a'k
.1 (N) = P 1m) 93 (N) (3.1.2)
where
_. k —-0 -O “'1‘ -.
ej (n) = ejk(n-l) +-zjk(n)QM[X(n)] and 9] (0) = 0 (3.1.3)
for n = l,2,...,N.

The number zjk(n) = rL(j‘k) (defined in Sec. 2.1) if 2(n) belongs

to pattern class HL, and
P-1(n) = P-1(n-1) - [Q;(§(n))P-1(n-I)QM(§(n)) + 13‘1 (3.1.4)
-1 - T d -l
. [P (n-l)§M(X(n))§M(X(n)]P (n-l)

for n = l,2,...,N;

P (0) (l/B)I.

with B an arbitrary small number and I and M x M unit matrix.

j

Nk(3) for each j and k is given

The performance index Q

by

42

N
k.~ 1 S ‘ , a 2 ere
<23 (a) - ,7 L51 151 [ujk<5?‘(1>,a> - now] + a . a

where iL(i) is the i-th training pattern from class HL

, and

'NI +-N2 +u..+-Ns = N. The term 3 3T; can be drOpped if it is
assumed that the number of training patterns N is larger than M,
the dimension of the Q-function Space. In this case, the matrix
P-1(n) defined in (3.1.2) exists. A convergence theorem states

that for each pair of classes H1 and HR

the vector ajRCN)

in (3.1.2) approaches a vector, denoted by a;:, with probability 1
as the number of training patterns N approach infinity. The
coefficients 2;: minimize the mean-square error between the dis-

criminant function using a particular set of Q-functions, and the

Bayes discriminant function. This mean-square error is given by:
jk“. . -—O - .9 2
QE (a) ET{[ujk(i,a) ujk(X)] ]

where ujk(i) is the unknown Bayes discriminant function. A
second convergence theorem shows the rate at which the vector EjRCN)
jk

—¢
converges to a .
ms

This procedure requires ﬁliéll discriminant functions and

s s-l ﬁjk .
2 sets of coefficients a (N). The computation of the co-
efficients can be simplified, [P-7], by expressing the coefficients
as follows.
3 .

. k “L 3 1 5

8(N) ZrL(J\)W(N). (..)
L=1

The vector GL(N) is computed for pattern class BL from the set

of N training patterns as follows.

3L(n) . P-1(n) §L(n) (3.1.6)

43

where

3%) = Econ-1) + ﬂu) «is»
and

L _ . v . L

i (n) - 1 if X(n) 18 from class H

= 0 otherwise .
This procedure not only reduces the number of parameter

vectors that must be computed iteratively from E12211. to s, but

2

also retains the convergence rate prOperties of ajk(N) proved in
Theorem 2 [P-7].

In both theorems mentioned above, the Q-functions are
assumed as given, even though it is clear that the mean-square
approximation error between all j-kth discriminant functions and
the Bayes discriminant functions, and hence the loss due to mis-
classification, depends upon the Q-functions selected. In general,
there is no direct procedure for selecting an Optimum set or number
of Q-functions. The scheme in Sec. 3.2 was prOposed by Pitt and
Womack [P-7] for selecting and testing several sets of Q-functions
to determine their relative performance based on the available

training patterns.

3.2 A SCHEME FOR SELECTING Q-FUNCTIONS

..jk

a,ms the coefficients which minimize ng(3),

Denote by
«jk «jk .
where a (N) converges to ams with probability one as N a m.
The resulting approximation error, gjk, between the j-kth dis-

criminant function and the Bayes discriminant function is given by

44

jk

a -0j ICTQ
6 ET{[amS

a a 2
An upper bound for this mean-square approximation error was derived

[P-7] and denoted by:

31k gjk'r 31k ajkT -1 —-jk

P 9 (3.2.1)

IDS ms

= Rom - - Rom-

where Bjk(N) converges to 3;: and P(N) converges to Pms

with probability one as 'N 4 m, and
R(j\k)= max {r i<1|k>1
Isiss
An estimate of an upper bound on the mean-square approximation

error after n training patterns is:

km) =R(jlk)-;11'3m (r03 km ,

where ajk(n) and ‘ajk(n) are given by (3.1.2) and (3.1.3)
reSpectively. Theorem 1 [P-7], shows that

lim gjk(N) = 31: with probability 1 .

Nam

The use of the approximation error ejk(N) can improve

the performance of the Q-discriminant functions since it permits
one to compare the performance of sets of Q-functions. There is,
however, no general procedure for selecting the Optimum set of @-
functions, or the set which makes the Q-discriminant functions per-
form as well as the Bayes discriminant functions. In fact, no set
of Q-functions has this prOperty in every situation. The average
loss due to misclassification is always greater than that incurred

with the Bayes discriminant functions. The only case in which the

45

system loss is asymptotically equivalent to the Bayes loss is when
the Bayes discriminant functions can be expressed as a finite sum
Of the Q-functions selected. The following statements, proved in

[P-7], state the point precisely.

 

Statement 1. If for some j and k there exists an ; E‘ZO 6 RM
such that
u (if) =§T§ (if) (322)
jk 6 M o 0
then
11111 31km) = 331‘ = '3 with probability 1
ms 6
N-«n
where
"jkT -o = —o
ems QMOI) ujkOt) a.e.
Statement 2. For each pair j,k such that 36 satisfies (3.2.2)
lim ng(3jkm)) = 0 with probability 1.

N—IOO

Statement 3. If there exists a set of coefficients 2 k M

 

g E R such
that
—o = "jkT -o
ujkm a6 mm
and
J“ 93613.55 = o for j,k = 1,2,...,s
jk
j 9‘ k
where Ajk

- {if : ujkdi) = 0}, then

lim ‘eSCN; u, 3) - eS(N; u, 36” = o with probabiliy 1,

N-m

46

where es(N; u, 3) is the expected system loss due to misclassifica-
tion after N adaptations of the Q-discriminant functions and

eS(N; u, 36) is the expected system loss incurred from using

the Bayes discriminant functions. It should be noted that the
expressions es(N; u, 2) and eSQN; u, :5) do not involve feature
extraction or Q-function selection.

A slight alteration in the notation will be made to show the
dependence of the coefficients zjk(N), and the approximation error
ejk(N), on the dimensionality of the primitive feature Space as well
as on the dimensionality of the Q-Space. A Q-extraction rule 7
is a deletion Of some of the coordinates of the given Q-Space QM
to produce a V-dimensional Q-Space Qv, V s M. Let t and T
denote two extraction rules applied to the primitive feature space
I? and the Q-space QM, reSpectively. Figure 3.2.1 indicates the
sequence in which t and T are applied.

Following the application of t and T, the coefficients

3

a 1‘'(N) and the mean-square approximation error ejk(N) are

designated by Epk(N; t, T) and ejk(N; t, T), reSpectively.
Expressing the system's mean-square approximation error as
s
k
68(N; t: T) = z €j (N; t: T)
j.k=1
the following definitions can be stated.
Definition VII.
Given a Q-extraction rule T and N training patterns,

P

a feature extraction rule tL is better than a feature extraction

rule written tpza tq\T if

q
"k’ t k

THE SEQUENCE OF THE APPLICATION OF

primitive

‘

pattern,

m features

 

 

rule

patterns

 

 

 

 

I

p features

pSm

 

i machine

AND

 

 

 

 

w

 

rule

M Q-functions

Figure 3.2.1

 

___\
7

V Q-functions

V Sin

48
.801; ti. T) < esm; ti. T) -

A p-dimensional feature Space is better than a q-dimensional

feature Space, p s q, if

tziD ti‘T for some L and all k .

Definition VIII.

 

Given a feature extraction rule t and N training patterns,
. V w .
a Q-extraction rule Tr is better than a Q-extraction TL, written,

V W
TrDTL‘t if
es(N; t, TY) < e504; t. T2) -

A set of V é-functions is better than a set of W @-

functions, V s w if
V W
Tr:3 TL\t for some r and all L .

These definitions suggest that the system's mean-square
approximation error es* is a good criterion for selecting Q-func-
tions as well as features. This means that the performance of
features and the performance of Q-discriminant functions can be
evaluated under a single performance criterion. The notion of
assessing both the feature extractor and the categorizer under a
single performance criterion was suggested and implemented in a

different mathematical structure by Wee [w-4].

 

*

The notation es stands for the mean-square approximation
error and 38(N; t, T) stands for the value it assumes for
given N, t and '1'.

49

The definition of a pattern recognition system given in
Sec. 2.3 included a class of feature extraction rules T. The
adaption of Q-discriminant functions has imposed the additional
requirement of an extraction rule for Q-functions. Thus, pattern
recognition systems which use Q-discriminant functions, called
Q-systems, must include two types of extraction rules. The defini-
tion of a general pattern recognition System was given as the 4-
tuple (t, b; u, a). Choosing the performance criterion as for
both .a and b, the Q-system can be defined as the 4-tuple
(N; t, 7, es) where N is the number of training patterns. If
the expected cost or the percentage of misrecognition of a Q-
system is denoted by YS(N; t, 7, es), the following definition
shows how Q-systems can be compared.
Definition Lg.

Given N training patterns, a Q-system (N: ti, rZ, es)

. is better than a Q-system (N; ti, 7:, es) written
. P V . q w
(N: tL: Tr: es) 3 (N: tk’ Tf’ es)
if p s q, V s w, and

. V , q w
mm :3, Tr. es) < ysm, tk. if, 65) .

A lesser degree of a better Q-System occurs when p s q,

v>w, or p>q,V$W and

V W
Yam; t2! Tr: ‘8) < Yam; tit Tf. ‘8) °

50

3.3 ALGORITHMS FOR IMPROVING THE Q’DISCRIMINANT FUNCTIONS

As noted by Nilsson [N-S], a Q-discriminant function is
merely a linear discriminant function in which the individual
features have been replaced by Q-functions. Consequently, the
results Of studies on linear discriminant functions as applied to
pattern classification can be extended to Q-discriminant functions.

Several authors [H-6] [8-2] [Mk4] [H-7] [G-Z] [P-4] [G-3]
[D-l] [Y-l],have dealt with linear discriminant functions in a
fundamental problem in mathematical programming, switching theory
and pattern classification which can be stated in the following way:
Determine vectors 3 and '6 Such that 113-=3 and §>0. Each
row of the N X m matrix A is a training pattern from one of
two classes and N >'m. Each training pattern is augmented to have
an (m+l)-th component of value 1. The augmented training patterns
from one class are then multiplied by (-l). A solution to this
inequality is a hyperplane which dichotomizes the primitive feature
space into regions which are associated with the two pattern
classes. A complete separating hyperplane is one which correctly
separates all training patterns into pattern classes. If a complete
separating hyperplane exists.the training patterns are said to be
linearly separable or consistent. If no complete separating hyper-
plane exists the training patterns are said to be nonseparable or
inconsistent. Ho and Kashyap [H-7], formulated an iterative
algorithm which not only gives the complete separating hyperplane
in a finite number of steps, if it exists, but also serves as a
test for linear separability. Their algorithm consists of the

following alternating steps:

51

A1. For a fixed 3, a vector 3 is determined such that it con-
stitutes a least square fit in which J = “A3 - EHZ is minimized.
A2. For a fixed a, change a in the direction of steepest descent
of J, subject to the condition 34> 0.

The problem of learning the coefficients ajk(N) of a @-
discriminant function for any pair of classes Hj and HR can be
stated as follows: Determine vectors 3pk(N) and zgk(N) such
that C zjk(N) ' Egkcu) > 0 where C is a given N X M
matrix consisting of rows which are the images of the N training
patterns in the 9-space and N > M. Stated in this manner it is
easily observed that the problem is identical with the fundamental
problem discussed above with the obvious need for changing the terms
linearly separable and linearly nonseparable to Q-linearly separable
and Q-linearly nonseparable respectively.

Complete separating hyperplanes exist only when the regions
containing all training patterns from each pattern class can be
defined which do not intersect and are convex. In most practical
problems, however, such regions do intersect and the discriminant
functions are chosen to meet some Specific criterion function.
Suppose, for any pair of pattern classes Hj and HR, each training
pattern in the set {ii}:il from H‘1 and {it}::1 from Hk is

assigned the variables 61 and OR, reSpectively, which take the

L
values 0 and l. The value 1 is given to correctly classified
training patterns and the value 0 to incorrectly classified ones.
One criterion function that has been applied as a constraint on the

discriminant function is F, the maximm nuuber of training patterns

correctly classified during the learning process. With the

52

additional constraint, the fundamental problem mentioned earlier
is restated as follows: Determine vector 3 and 3 such that

A; “ B and ‘3 >~O to maximize F. Stated precisely

N N
31 k k
Maximize F = 2 5i +- Z 6L
i=1 L=l
under the constraints
1 if ii 3 - E > 0
61 =
1 . dj-O —o
0 if Xi a - B s 0 1 = l,2,...,Nj
1 if TEE-E<o
k
6L3 «k-o -0
0 if xLa-BZO L=1,2,000’Nk0

Golovkin [G-Z], has shown that the exact solution to this
problem is Obtained by solving a problem in partial integer linear
programming. Gol'shteyn and Yudin [G-3], and Dantzig [D-l] have
shown that the computations involved can be carried out in a finite
number of steps and Pervozanskiy [P-4], gave an algorithm which (1)
determines the complete separating hyperplane it if exists; (2)
Shows the existence or nonexistence of such a hyperplane. Finally
Yegisapetov [Y-l] has extended Golovkin's solution to the case
where more than one hyperplane can be constructed to separate a
given pair of pattern classes.

Golovkin's algorithm can obviously be extended to Q-dis-
criminant functions by formulating the images of the training
patterns in the Q-Space to replace the original training patterns.

In addition, both Ho and Kashyap's and Golovkin's algorithm can

53

be carried out in combination with Pitt and Womack's algorithm by
Obtaining the vector 35k(N) from‘Pitt and Womack and the vector
agk(N) by either Ho and Kashyap's or Golovkin's algorithm. The
vector :kaN) from‘Pitt and Womack is known to approach the co-
efficients which minimize the mean-square error between the j-kth
Q-discriminant function and the j-kth Bayes discriminant function.
The performance of the j-kth Q-discriminant function may be improved

upon by determining an optimum 3%k(N) through the Ho-Kashyap and

Golovkin algorithms or an algorithm develOped by Mengert [M-4].

3.4 SUMMARY

In this chapter, the Q-discriminant function was reviewed.
Section 3.1 showed, under certain assumptions on the Q-functions,
that an iterative procedure can be implemented to determine sets
of coefficients which converge to the coefficients for a linear
discriminant function that minimizes the mean-Square error between
the Q-discriminant functions and the Bayes discriminant functions
with prObability one as the number Of training patterns increases
to infinity. Section 3.2 reviewed a Scheme for selecting Q-functions
and the definitions by which the relative performances of Q-systems
can be assessed. Finally, in Sec. 3.3, the coefficients of the
Q-discriminant functions were related to coefficients determined
from other algorithms, and possible ways for improving on the

Q-discriminant functions were suggested.

CHAPTER IV

FEATURE EXTRACTION AND DECISION MAKING
UNDER A SINGLE CRITERION

From the point of view of Information Theory, each feature
in the set of primitive features contains some information about the
object being recognized. This information can be either repre-
sentative information or discriminative information. Representative
information accentuates the common attributes within a pattern class
while discriminative information emphasizes the attributes which
distinguish pattern classes. Consequently, discriminative informa-
tion appears to be the more relevant in pattern recognition. Even
though the possibility of a feature containing high measures of
both types Of information cannot be discounted, it seems that pattern
recognition can be regarded as an information destroying process.
This thesis conjectures that the more representative information
destroyed the better, and that the feature extractor is the device
to perform this destruction of information. In going from the
primitive feature Space 1? to the feature space 1?, p s m, the
feature extractor attempts to eliminate features with high measures
of representative information and retain features which discriminate
well. These discriminative features are, Of course, unknown to the
feature extractor. If they were known before hand, the feature
extractor would simply select the most discriminative features and
the problem of feature extraction would be easily solved. The

S4

55

destruction of representative information by the feature extractor
makes the patterns in the transformed feature space bear little
"resemblance" to what they "look like" in the primitive feature
space.

The same prOblem can be approached from the point of view of
distances between probability distributions over a given feature
space. The notion of a distance between probability distributions
is widely used in mathematical statistics and various distance
measures have been formulated, as discussed in Sec. 2.2.1. In
feature extraction, the ultimate feature Set should permit minimiza-
tion of the probability of error over all feature sets. In most
prOblemS, simple analytical expressions for the probability Of error
as functions of the feature set do not exist. When they can be
found, they are usually impossible to minimize over all feature sets.
Consequently, several distance measures have been studied as
computationally efficient alternatives to the minimization of error
probability. Only in Special cases [P-8], has the error probability
been expressed as a function of a distance measure. But a distance
measure will suffice if upper and lower bounds for the probability
of error can be formulated in terms of the distance measure. For
example, the probability of error can be bounded above by the
divergence and the Bhattacharyya distance, both members of the class,
B, of computationally feasible feature performance criteria dis-
cussed in Sec. 2.1. The aim of feature extraction is to search for
a transformation that will maximize, over all feature sets, the
distance between sets of training patterns from different classes

in the transformed feature Space. This thesis takes the view that

56

patterns in the feature space need not resemble those in the primitive
feature Space and that the "self distance" of a training pattern be
changed as much as possible by the transformation. Not only should
the dimensionality of the transformed feature space be minimized,
but patterns of different classes should also be made to "look
unalike". A possible measure of the self distance of a training
pattern is the Euclidean length of the vector representing the
training pattern.

Section 4.1 prOposes the Q-performance criterion of Pitt
and Womack [P-7] as a performance criterion for feature extraction
and reviews two sequential feature extraction rules. A third
sequential feature extraction rule is suggested in Sec. 4.2. Section
4.3 proposes a procedure for selecting both Q-functions and features
which are effective in discrimination simultaneously under a single
performance criterion. Four system approaches are suggested for the
case when the Q-functions are generalized polynomials. Section 4.4
establishes the properties of the prOposed sequential search pro-
cedure and Sec. 4.5 examines the prOperties of the prOposed per-
formance criterion. In Sec. 4.6, the Q-discriminant function is
studied in a two-class recognition problem under the assumption
that the Q-functions selected can be represented by Gaussian random
variables under the two classes. Section 4.7 summarizes the prOposed
Q-pattern recognition system and some computer simulations. Section

4.8 summarizes the main tOpics of the chapter.

57

4.1 PROPOSED CRITERION FOR FEATURE EXTRACTION

The feature performance criterion being prOposed is the
quantity introduced in Sec. 3.2. The notation expresses the
dependence of the quantities on the feature extraction rule t
and Q-extraction rule 7.

s

es(n;t.“r) - 2: ejk(n;t.'r) = 2[R(j\k) -§3“‘T(n;c.¢)'é'jk(n;m>1
j,k=I jak

“=1,2’...’N.

This is an estimate of the upper bound on the sum of the mean-
square approximation errors between the Q-discriminant functions and
the Bayes discriminant functions based on n training patterns.

For a given Q-extraction rule T, and a given number of training
patterns, various feature extraction rules might be tried to see
which minimize this criterion. If the set of all feature extraction

rules leading to feature spaces of dimension r is designated by

r _ m
Tr 3 {CL}, L ' 1:2:°°°s(r) a

O O I r c I O
the optimum feature extraction rule of dimension r, t minimizes

r Lopt
the misrecognition rate ys(N; CL, T, as) of the Q-system

(N; CZ: T: 35); 103':
r r m
t :3 t f = 1,2,... ; #
Lopt LLr or L .(r) L Lopt

Since misrecognition rates can only be determined from an actual

computer simulation we shall associate tr with the feature
opt
extraction rule which minimizes the system mean-square approxima-

r
, r); i.e.,

tion error 38(N, t1.

58

tr :3 tr\ if
L T
Opt

1' r
Opt

for all 4, - 1,2,...,(1‘).
To obtain a truly Optimum feature extraction rule, it is

necessary to consider all possible subsets of the set of primitive

m m
features. This means that a total number of 2 (m) = 8 -——EL———
r=l r r=1(m-r)lrl

subsets of features must be appraised to determine the subset with
the best performance measure. It is clear that the total number of
feature subsets, even for a relatively small value of m, can be

prohibitively high. For example, if m = 20

20 20

20 20!
2 = 2 —-—- = 1,048,575 .
r=l (r ) r=l (20-r)lr!

Consequently, any.realistic search procedure for the extraction of
features will have to be sub-Optimum.

Several authors [F-B] [F-4] [L-6] [M-B], have considered two
sequential search procedures as alternatives to an exhaustive search.
The first is the sequential forward procedure, which first extracts
the best single primitive feature, then extracts the best pair, in
which is included the best Single feature, and so on, as discussed
in Sec. 2.2. At each step, the search procedure selects from the
remaining set of features that feature which leads to the largest
decrease in the performance criterion. The procedure is stOpped
either when the performance measure has reached a prescribed value
or when all the primitive features have been exhausted. The other

sequential search procedure is the sequential backward search

S9

procedure. At each step it discards that feature which increases
the performance measure the most. The procedure stOps at a given
value Of the performance measure. The sequential procedures de-
crease the number Of feature subsets to be appraised considerably.
For example, the total number of feature sets that must be con-
sidered in the sequential forward search, is given by Ejgill .
For m = 20, this number is 210 which is four orders of magnitude
smaller than that required in the exhaustive search.

The main drawback to the sequential search procedures is
the fact that, once a feature has been selected as a good feature,
it is retained in the subset which eventually get admitted as the
best feature subset. However, the best (p-l)-tuple feature subset
may not be contained in the best p-tuple feature subset. This fact
is illustrated forcefully in an example in Prabhu [P-8]. In a two-
class pattern recognition system in which each pattern is char-
acterized by the three features (X1,X2,X3), Prabhu showed that the
best single feature for the purposes of discrimination is X3.
However, the best two features were found to be the pair (X1,X2),
which clearly show that a sequential search procedure in which X3
is retained can never find the pair (x1,x2). Thus, any sequential
search procedure in which selected features can be re-assessed will
be a highly desirable search procedure. Of course, the novelty of
re-assessing selected features will not in itself procure a truly
optimum subset of features. The reappraisal of selected features

permits discarding feature sets which, in combination with recently

acquired features, become less effective in discrimination.

60

Pattern recognition problems generally assume that a set of
N training patterns is given. This thesis further assumes that the
origins of the primitive training patterns are known, and the
patterns are compiled in an N X m matrix, where each row is a

primitive pattern. In matrix notation, the training set takes the

 

 

 

 

form:
f' r- '1
* Alﬂ if” (1)
A = AZT , ALT = ELT(2) , NL(j) 6 class Ht
ST T
5A J LYL (Natl
where NL is the number of training patterns from class Ht,
5

2 NL =‘N, and s is the number of pattern classes. Each training
(31 _,
pattern X&(j), j 8 1’2"..’NL; L 3 l,2,...,s, is represented as

a point in the m-dimensional primitive feature space 1?. Thus

ELT(j) = (X31,X§2,...,X§m). The feature extraction problem deals

with the attempt to rearrange the columns of the data matrix A

into two matrices BE and BE so that only features with a high
measure of discriminative effectiveness are retained in BE. The

primitive features in the data matrix B: will have little or no
effectiveness in discrimination. Ideally, a rearrangement of this
nature will supply the categorizer with a subset of primitive
features which will be almost as effective in classifying unknown
patterns as a scheme which uses all of the m primitive features.
After the rearrangement of the columns of the data matrix A, the
set of N training patterns can be represented as the partitioned

matrix

,.
1T 11‘

.1 .21
T . T 21' 21:
B=[B1.BZ]= 131 132
ST ST

PI Bze

 

 

where Bi is an N X r matrix and BE

with r +-q = m. Only the first r features are used for dis-

is an N X q matrix,

crimination. The rearrangement of the data matrix A into Bi
and B3, can be considered as a dichotomization of the m primitive

features.

4.2 PROPOSED SEQUENTIAL FEATURE EXTRACTION RULE

The basic prOperty of a pattern class is that its members
are more similar to each other than they are to the members of
another class in some of the coordinates Of the primitive feature
Space 1?. In order for recognition to occur between any pair of
classes it is only necessary that the objects of one class be dis-
similar in one feature from the objects of the other class. This
means that the lower bound on the number of features required for
recognition between any pair of classes is one, and that the feature
extractor can afford to destroy the information carried in all but
one of the primitive features. In an s-class pattern recognition
system the lower bound on the number of features required for
recognition is between one and .Eiglll . The lower bound of one
is reached when the 8 class regions are located in a chain on
a single coordinate. The lower bound 21%:11 arises when a dif-

ferent coordinate is required for the recognition of all pairs of

62

pattern classes. A sequential feature extraction rule is now pro-
posed which seeks to extract only one feature between any pair Of
classes, but which allows for the extraction of more than one
feature when necessary.

To begin the sequential operation, any two primitive features
are measured and three feature subsets are constructed; the first
subset consists of one primitive feature, the second of the other
primitive feature, and the third of the combination of the two
primitive features. The subset which performs the best is said to
contain effective features and is retained. The remaining subsets
are discarded. At the second Stage of the sequential Operation, a
third primitive feature is measured and two feature subsets are
constructed; the first consists of the third primitive feature
alone and the second consists of the third primitive feature in
combination with the effective features found in stage one. The
current effective features are determined and compared to the
effective features of the previous stage. The better of the two
is retained and the Operation proceeds to the next stage. At any
stage r, two feature subsets are constructed; the first consists
of the (r+l)-th primitive feature alone and the other consists of
the (r+l)-th primitive feature in combination with the previous
effective features. A comparison is made between the effective
features at the rth Stage and at the (r-l)-th stage and the
better of the two is retained. A subset of p effective features,
p s r +'l, is achieved at the end of the r-th stage of the sequential
operation. The feature space Spanned by the p effective features

is denoted by IP. This space can be considered to have been

L

63

P

obtained through the application of feature extraction CL, i.e.,
m m
t: : I 4 T2 for some L; L = l,2,...,(p)

At each stage of the Operation, the effective features must
perform no worse than the effective features at the previous stage.

This means that at any stage r, for a given Q-extraction rule T,

q
h‘T

p s r+l, q s r for some L and all h. The sequential Operation

and a given set of N training patterns, tEID t with
is continued until either all the m primitive features have been
processed, or the feature performance criterion has reached a pre-
assigned threshold.

The total number of feature subsets that must be considered
is 2m-1, where m is the number of primitive features. For
m B 20, the total number of feature subsets is 39. This number is
considerably smaller than 1,048,575 Obtained with an exhaustive
search, as well as the number 210 obtained for the sequential for-
ward procedure. Furthermore, under certain conditions it can per-
form better than the sequential forward procedure reviewed in
Sec. 4.2. For example, in Prabhu's example cited in Sec. 4.1, the
best single feature was X3, but the pair (X1,X2) gave the Optimum
performance. The sequential forward procedure would not attain the
Optimum features if X3 were retained. The sequential procedure
proposed here would Obtain the optimum features if the primitive
features are measured in the order x1,x2,x3. At the first stage
of the Operation, the three feature subsets (X1), (X2) and
(X1,X2) would be constructed and in this case (X1,X2) would be

retained. However if the order of measurement is altered, one

64

cannot guarantee that the prOposed sequential procedure will always
perform better than the sequential forward procedure. The only way
to determine the relative performance Of these procedures is to
actually carry them out. However, if a cost is associated with
each feature subset processed, it is clear that the proposed

sequential procedure will be the least costly to compute.

4.3 SIMULTANEOUS FEATURE EXTRACTION AND Q-EXTRACTION

The application of Q-functions in pattern recognition gen-
erates several difficulties. Besides the problems of selecting
"good" Q-functions, and the extraction of effective discriminating
features, one must determine the number, M, of Q-functions required
and of the number, N, Of training patterns to use. For a suffi-
ciently large number of training patterns, the @-discriminant
functions should be "close" to the mean-square approximations to
the Bayes discriminant functions, as shown by the convergence rate
of the system [P-7]. Even if the optimum mean-square sets of
coefficients 3%: are used, the system performance might be much
worse than that obtained when the Bayes discriminant functions are
employed. The set {Ejk(N)}:,k'1 depends on both the Q-functions
and the true probability densities of the pattern classes. Thus
the selection of the number and form of é-functions are both critical.
The questions then arise:' For a given set of N training patterns,
how does one select a set of "good" Q-functions? How many of them
does one select? Which feature subset of the primitive features

does one extract and employ in generating the 9-functions? How

does one go about extracting the features? The answers to these

65

questions require the consideration of the relationship between
the number of training patterns, the dimensionality of a primitive
feature space, and the number of 6-functions selected, as implied
in Fig. 3.2.1.

Kanal and Chandrasekaran [K-S] have concluded that the
probability structure assumed for the pattern classes and its
correspondence to the real problem at hand determine the relation-
ship between the number of the training patterns, the number of
features and the probability of error. In the application of @-
functions to pattern recognition, this conclusion means that the
Q-functions selected and their correspondence to the real problem
at hand will determine the relationship between the number of the
training patterns, the number of features and the loss incurred
in misclassifying patterns.

The Q-extraction problem is to determine which subset of
the Q-functions are significant for the purposes of discrimination.
From this point of view, the Q-extraction problem becomes identical
to the feature extraction problem. This means that, in addition
to the problem of extracting an optimum feature subset from a
given feature set, the pattern recognition system must attempt
to extract an optimum Q-subset from a given set of Q-functions.
The Optimum Q-system (N; t, T, as) can thus be associated

opt .
with the pair (t*,'r*)opt which optimizes

r V
68m, CL, Th)

over all pairs (4,,h), 4, = 1,2,...,(3); h = 1,2,...,(3); and all

pairs (r,V).

66

V
The optimum 9-system at level (r,V) is the system (t:, T*)

for which
r V , r V
68(N’ t*! T*) S 68(N, tL’ Th)

for all (L.h); and the optimum Q-system (t*, T*)0pt satisfies

r

(t*: 7*) 3 (t*, TX.) for 811 (r,V).

opt
This thesis limits the Q-discriminant functions to gen-
eralized polynomials and examines the interplay between the order
of the polynomial r, and the extraction rules t and T which
are applied as indicated in Fig. 3.2.1. For a given r, the number
of Q-functions that can be generated after the application of a

.3

feature extraction rule is given [N-S] by

M(p>=(P:T>-1=§FE}L-1

where p S m, q = l,2,...,(2) and each of the Q-functions is of
the form

r

d r1 2 r
¢L(x) = (x1) (x2) ... (xp) p

with L = l,2,...,M(p) and r + r +...+ r s r. Since the number

1 2
m
of all possible feature subsets is given by E (m), there are a
m M(p) p=1

M(p)

total number of Z 2 ( V

p=l V=l
systems to consider. It is obvious that such a number is orders

)(2) possible pattern recognition

of magnitude larger than that encountered when extracting subsets
of m primitive features alone. In spite of this computational
difficulty, ways have to be sought to simultaneously extract

features and @-functions which are effective in discrimination.

67

The 3-tuple (r,p,q) denotes not only the q-th feature
subset of size p, but also represents the set of M(p) Q-functions
that can be generated from the feature subset for a given r, where
r is the highest-order polynomial allowed. Thus the problem of
extracting optimum features and Q-functions is reduced to finding
the Optimum 3-tuple, (r’p’q)opt.’ over all r, p and q. The

following approaches are proposed.

Strategies for feature extraction alone

8.1. For a given r and p s m, generate all (z) p-
dimensional feature subsets and evaluate their performances. The
feature subset with the best performance is then selected as the
optimum subset. This approach does not attempt to select optimum
Qefunctions. The performance of each p-dimensional feature sub-
set is evaluated with all the 6-functions that can be generated
from that subset, M(p) in number.
8.2. For a given L s r and p s m, generate the (PZL) - 1
possible Q-functions and evaluate the performance of all pairs
(L,p). The pair with the best performance is then selected as the
optimum pair. As in 8.1, no attempt is made to extract optimum

Q-functions since all the Q-functions that the pair (L,p) can

generate are used.

Strategies for simultaneous feature and Q-function extraction

8.3. For a given r and p s m, generate all the (2)
possible feature subsets. Then, for each feature subset, generate
all the M(p) possible Q-functions and apply the sequential opera-

tion proposed in Sec. 4.2 to select a subset of effective Q-functions.

68

This approach attempts to solve the problems of feature and 9-
function extraction simultaneously. The Optimum features and @-
functions are contained in that feature subset and its correspond-
ing subset Of effective Q-functions which give the best performance.
8.4. Select r and examine the M(m) = (m:r) - l 6-
functions in a sequence such that the L-th Q-function in the
sequence differs from the (L+l)th in one feature and/or in the
order of one feature. For example, for r = 2 and m = 3 order

the M(3) Q-functions as follows:

$166) = x1
62d) = Xi
e366 = x2
‘94 (if) B x1x2
(95(2) = x;
$668) = x3
q’76‘.) = x1x3
‘Psm g x2x3
q”966) = x§

With this ordering of the Q-functions, the sequential operation

prOposed in Sec. 4.2 can be applied to extract effective Q-functions.

It is easily seen that if the final set of Q-functions do not contain

a particular feature, that feature would be eliminated and simultaneous

feature and Q-extraction would be achieved. On the other hand, if

the final Q-functions contain all the primitive features, only

69

Q-extraction would be achieved. A simulation of the third and

fourth approaches will be carried out in Sec. 4.7.

4.4 CHARACTERISTICS OF THE PROPOSED SEQUENTIAL ALGORITHM

The sequential algorithm proposed for feature extraction

and Q-function selection in Sec. 4.2 possesses the following char-

acteristics. These characteristics apply, as well, to Q-extraction.

1. At each step, the algorithm exhibits the best features obtained
so far in theoperation.

2. Feature sets which are effective in discrimination may be dis-
carded if found to be less effective in combination with a new
feature.

3. In case of a small number of training patterns, the algorithm
allows a trade-Off between the number of features selected and
the number of training patterns available.

4. No probability density functions for the pattern classes are
required and high orders of features can be taken into con-
sideration.

5. The number of subsets of features that must be examined is
smaller than in sequential extraction rules suggested elsewhere
as explained in Sec. 4.2.

6. Feature extraction and 6 selection can be handled simultaneously.

Since the algorithm is not exhaustive, it cannot be expected
to produce an absolutely Optimum system. The extraction of absolutely

Optimum sets of Q-functions and features is not guaranteed and, in

general, will not be achieved since relatively few subsets are

appraised. The algorithm also fails to exhibit an explicit

70

relationship between the system percentage of misrecognition and
the performance criterion. This problem, however, is common to all
procedures in which no prior knowledge is available concerning the
pattern class probability densities. Furthermore, the algorithm
does not give an explicit relationship between the number of Q-
functions and features selected, the number of training patterns
required, and the performance criterion as, even though the
algorithm does provide for a trade-off among these quantities.

In addition, the only Q-functions and features that can be selected
are supplied to the pattern recognition system by its designer.

All computationally realistic algorithms suffer these drawbacks.
Along these lines, Highleyman [H-3] has noted that it is not simple
to predict in advance whether a linear discriminant function has

a chance of working. The same can be said about Q-discriminant
functions. The experimenter has no recourse but to formulate the
discriminant function and estimate its performance. If the per-
formance is found to be unacceptable the experimenter can only
search for a better set of Q-functions (features) or seek other

types of discriminant functions.

4.5 CHARACTERISTICS OF THE PROPOSED PERFORMANCE CRITERION

The prOposed performance criterion is a measure of how
"close" in the mean-square sense, the Q-discriminant functions are
to the best mean-square approximation to the Bayes discriminant
functions. In this section, it is shown that this criterion, when
used for 9-function extraction, meets some of the requirements

listed in Sec. 2.1 for feature extraction.

7l

R.l The performance criterion should show that the per-
formance of a Q-function set which carries more discriminative
information is better than that Of a Q-function set which carries
less.

In a W-dimensional Q-function space, Appendix A shows that
the system's performance criterion function after N training

patterns have been processed is given by:

Z ejkO‘H t9 Tw)
J'ak

ll

88m; t: Tw)
(4.5.1)
.1

z [Rum - ﬁEJkTmnnww'lmntnh'é kmnnwn.
j.k

With the selection of the (W1) 'thé-function, the system's per-

formance criterion becomes:

W+l W+l

.301: c. 'r >= 2 ejkm; t. T )
3,1.

.. kT w+1 .. w+
ej )ejk 1

= ammo-1% (N;t,-r"’”>r'1<N;t.w (Nun )1.

LR

The increase in the performance criterion due to the addition of
the (W+1)-thQ-function is designated by:
_ W+1 . W
AWs8 680‘], t. “r )- 6801. t. T)
1 k k
= q,- 2 [dj (N.w+1) - dj own
LR
where

4 RT
9.1

djkmm - (N; c. Tww'lms c. T")'e’3“cu; t. T").

In this notation, the system's performance criterion in a W-dimen-

sional Q-function space, can be expressed as

72

63(N; t. Tw) ‘ j2k[R(J\k) ' % djka,W)] °

Appendix B establishes the following relationships under

the assumption of a 0-1 loss function:

(i) Awe8 < 0; (4.5.2)
(ii) AWeB is not defined when qW+l 1s a linear combina-

tion Of the previously established Q-functions. (4.5.3)

Equation (4.5.2) means that the addition of any Q-function does

not worsen the performance of the existing system and the criterion
function changes monotonically with the addition Of any Q-functions.
This implies that, as measured by the performance criterion, the
performance of any Q-function set is better than that of any 0-
function set of smaller size. Thus, the adOpted criterion meets
requirement R.l only in cases where the largest set of é-functions
is the best. For cases in which the size of the Q-function set does
not determine the discriminative effectiveness Of the set, the per-
formance criterion does not meet requirement R.l.

R.2 The performance criterion should not change when a
Q-function which carries no discriminative information is added
to the system.

Equation (4.5.3) states that AWes is not defined whenever
the additional Q-function is a linear combination of the Q-functions
already selected. This indicates that Requirement 2 is not met by
the proposed criterion function. The difficulty may be in the
fact that a Q-function which is a linear combination of the 6—

functions already selected in fact still carries some information

73

about pattern classes. This implies that the addition of a non-
information bearing Q-function is not equivalent to the addition
of a i-function which is a linear combination of the Q-functions
already selected. This non-equivalence is discussed by Frantsuz
[F-l].

R.3 The performance criterion should be minimum when the
Q-Eunction set carrying the greatest amount of discriminative in-
formation is achieved.

The convergence of the Q-system's performance criterion is
directly related to the convergence of all j-kth performance
criteria ejk(N; t, T), which is in turn directly related to the

convergence of the coefficients ij(N; t, T). It has been proved

[P-7] that
lim ijrn; t, T) = zi:(t, T) w.p. l
N-ooo

where 3::(t. T) are the coefficients which minimize the mean-
square error resulting from approximating the Bayes discriminant
functions.’ This directly leads to
k k
11m ej (N; t. T) = ej (t, T) W-Po 1
ms
N—nuo
the upper bound on the approximation error. This can also be

expressed as

11m§33kTms c. OP’ICN; c. '06“

New

(N; t, T)

" 31““. T)P'1(t. T)'e°jk(t. 1')
m8 m8 m8

which indicates that Optimality in the mean-square sense is achieved

74

as the number of training patterns increases to infinity for all
pairs of classes Hj and HR. Optimality in the Bayes sense,
however, is achieved only when the Bayes discriminant functions
can be represented as a finite sum Of the Q-functions selected.
The optimality of the Q-system is achieved when optimality is
achieved for all pairs of classes in either the mean-square or
Bayes sense. The convergence of the @-system in the mean-square
sense is achieved‘when

lim 2 ejk(N; t, T) = E ejk(t. T) .
N-mj,k jak ms

R.4 The performance criterion should be computationally
feasible. This is demonstrated in the computer simulations.

R.5 The performance criteria should converge to their
Optimum values as the number of training patterns increases. This

has been demonstrated in Requirement 3.

4.6 THE GAUSSIAN TWO-CLASS CASE

With a 0-1 loss function the quantity djk(N,W) satisfies
the properties required of a distance measure except for the tri-
angle inequality, as shown in Appendix C. It can be expressed in
a form similar to that of the Mahalanobis distance measure [M-l]

between two normal distributions. This "M-distance" is defined as

12 _ a a T
dMaha. - (p1 - p’2) 2

-1.. ..

where z = 214+ 22 and Hi and 21 are the mean vector and co-

variance matrix under class H1 and H2. In the absence of any

knowledge about the probability structure of pattern classes, the

75

means and covariance matrices are not known. Hence the quantities
a W

eiON; t, T ) and Piﬂﬂ; t, TN), 1 8 1,2, are determined instead to
define a distance measure d12(N,W) between pattern classes. The

quantities 31(N; t, T") and PiON; t, T”) are defined in Appendix

B.
If the Q-functions selected are assumed to be Gaussian
random variables which describe the distributions of classes H1

2
and H , then the Q-discriminant function is given by

W
«1am» = 2 a (N) spa?)
L‘l "

= 3%) Qua?)

where the notation lulzdfiadzm» has been simplified to u(}—(',3(N)).

The decision rule is:
Decide if 6 H1 if u(i,3(N)) > 0 .

The expected value of the discriminant function when class H1

is active is:
E,{3T<N>¢w<i)} = ETmmiuwm} = 'ii'rapii-i = 6Tm)p'1m);i

where Ei{§W(R)} = 3i is the mean vector of the Q-functions under
class Hi, i = 1,2. The variance of the discriminant function when

class H1 is active is:

_, -o c-O 2
E1{[aT(N)§w(Y) - Jami] }

...T..

zitszmws’) - Eimwﬁ) - pi) aCN)}

3%) 21 3m)

'é'Tm)r‘1(N) £1 P'lm)‘e'(u)

76

where 21 is the covariance matrix of the Q-functions under class
Hi, i = 1,2. It cannot be overemphasized that in practice the @-
functions may not be Gaussian even if the primitive features are
Gaussian. Nevertheless, if the assumption is made that the Q-

2
functions are Gaussian under H1 and H then the means of the

Q-discriminant function are:
-0T .4 «T -1 -. .
m1 = a (”W1 = e (N)? «mi. 1 = 1.2;

and the variances are:

2 —-o -o —o - - -o
a, = Jan 2, am) = 9%)? 10021? Imam). i = 1,2.

The probability of error is then given by

Pe = q1 Prob.1[u s 0] +q2 Prob.2[u > 0]

where q1 and q2 are the prior probabilities of the two classes.
Because of the Gaussian assumption the probability of error can be
expressed as

m -m

P. = qlu - 9(ng + qzu - p(—02-)]
l 2

where p is the cumulative distribution function of the standard
normal distribution. It is not easy to obtain the coefficient 30:)
which minimizes Pe in a simple form. However Peterson and Mattson
[P-6] have given a search technique for finding the linear coefficient
2 which minimizes the probability of error, and concluded that the

Opt

coefficient 30

pt of an optimum linear discriminant function

.663) . 2%: 1... .162) =xi

77

satisfies an equation of the form
+ -o = .4 -—o
[“121 “222]‘opt (”1 “2)“3

where a , a , and a are scalars. The coefficient 3 can
1 2 3 Opt

be expressed as

a =[a2 +a21'1(” -">a (4.6.1)
Opt 1 1 2 2 ”1 “2 3
which is seen to have the same form as the coefficient KIN) in

(3.1.2) which is rewritten here for convenience.

«12

3120:) = P‘lmm (N) = [P100 + P2m>]’1<'é’1m> - 6200) (4.6.2)

for a 0-1 loss function; P1(N) and PZGN) are defined in
Appendix B. This suggests that there is a possible link between
the iterative procedure for determining 3(N) discussed in Sec.
3.1, and the search technique. Peterson and Mattson further

stated that their approach is applicable to many different problems
and many different performance criteria which do not necessarily
involve the assumption of normal distributions. This tends to

make the possibility of a link even greater since EKN) was
determined with no assumption on the distributions of the pattern

~12

classes. From (4.6.1) and (4.6.2), a ON) and 3 are equal if

Opt

~i

and the determination of Pi(Ni) and 31(N1) is seen to be equi-
valent to determining quantities which are prOportional to co-

variance matrices and means. This suggests that the parametric

78

and deterministic approach to pattern recognition are essentially
the same. In the former, assumptions are made concerning the form
of the distributions of classes and unknown parameters, such as
covariance matrices and mean vectors, are learned. The latter
also learns its unknown parameters which are generally in the
form of coefficients of a discriminant function assumed for each
pattern class.

Anderson and Bahadur [A-Z] have also dealt with the problem
of determining the Optimum coefficient Zopt in a linear classifica-
tion scheme involving two multivariate normal distributions with

different covariance matrices. They came out with a vector of the

same form as the ones in the two approaches above, specifically

—o = -1 q - —o
aopt “131 + t:2’32] (“‘1 ”2)

where t1 and t2 are scalars such that tlzl + tzz is positive

2

definite. One case in their treatment involved the determination

—.
Of the coefficient am max in a procedure that minimizes the

maximum probability Of error. A scalar c was defined as

 

3T ( -">
m max ul p'2
c g am a % am a k
(a 2 a ) + (a 2 a )

m max 2 m max

where

a _ -1—9 —-0
8m max - [tzl + (1 - t)£2] (”l ' ”2)

with 0 < t < 1. This was the scalar quantity that Min [M-B]
used as a separability measure and as a performance criterion for

feature selection and which he showed is related to the divergence.

79

This suggests that another scalar, C(N), can be defined as

3%)(31 - II)
C(N) ' 'r _. J} ..1' .. 55
(3 (N)Zla(N)) + (a 0022801))

 

for use as a performance criterion for feature selection in the
case where the Q-functions have normal distributions. Similar
expressions were also arrived at by HO and Kashyap [H-7], who

expressed the Optimum linear coefficient in the form

3 = k(21 + 22)'1

Opt (“1 - ”2)’ k > 0 °

This coefficient also maximizes the function

T

—o -. —o 2
{aODt (”I - #2)}

J 3 am a
+
aOpt(zl z:2)aopt:

which is interpreted as the "distance" between the clusters of the
two classes over the "spread" of the cluster of each of the two

classes. Obviously the function

4T —+ —0 2
{a (N) ("-1 ' 92)}

3%le + 293m

 

JON) ‘

can be adopted as a feature performance criterion, where the vector

EON) is the coefficient derived in Sec. 3.1.

4.7 A PARTICULAR Q-SYSTEM

The 6-system described in Fig. 4.7.1 is now prOposed for
implementation. The estimate ‘3 of an upper bound on the mean-
square error between a set Of Q-discriminant functions and the

Bayes discriminant functions is adOpted as a performance criterion for

80

 

Select subset of

 

 

 

 

 

primitive features 6 : minimum a
’:*- I encountered at any
Select initial subset stage With a Part1°U1ar
feature subset
of Q-functions
[
'Y eﬁaés

 

Compute‘ as using |
-:tterns

 

 

 

 

Select new

3 Increment —
feature subset é-functions 7

 

 

 

 

 

 

 

  
  

\

/
Form discriminant
functions for c

smin
* Classify all I
trainin Apatterns
4*—

 

Adjust
discriminant functions

 
 

'Rate of
misclassification
satisfactory?

     

 

 

Figure 4.7.1 Implementation Of Strategy 3.3 and 8.4

81

selecting both optimum features and 6-functions. The discriminant
functions are restricted to generalized polynomials and, to avoid

excessive computation, only the strategies 8.3 and 8.4 in Sec. 4.3
will be simulated.

Strategy 8.3 first selects a subset of features and the
highest power, r, allowed for any feature. Strategy 8.4 is a special
casecﬂ Strategy 8.3 when the feature subset selected is the entire
set of primitive features. The Q-extractor selects the best 6-
function subset at each stage of the prOposed sequential extraction
rule in Sec. 4.2, until all the Q-functions have been exhausted.
At this juncture, the Q-extractor has the effective Q-functions
and the derived coefficients of the discriminant functions stored
in the Q-memory and the coefficient memory, respectively, for the
particular subset of features chosen earlier. Next, the N
training patterns are classified using these Q-functions and dis-
criminant functions. If an acceptable rate of misclassification
is achieved, the selector switch is turned to the "recognize"
position so patterns of unknown classification can be processed.

If the misclassification rate achieved is not acceptable, the
coefficients are adjusted in an attempt to improve the error rates.
If the adjustment leads to an acceptable improvement the selection
switch is turned on to the "recognize" position. If no acceptable
improvement is achieved, one must either increase r and gen-
erate a new set of Q-functions, obtain more training patterns,

if they are available, or select a new subset of primitive features

and repeat the entire learning procedure.

82

To adjust the coefficients, a scalar djk = dj - dk is

chosen and the following decision rule is employed.

.‘ - j «jk
Decide x EH if ujk(3i,a cm) - djkz o,

where

D-
II
I
...a
U
N
J
z

j min \E'jkTCN) «REM. L -
L

dk

ll

...:
U

N
V

O

O

O
U
2

min ‘zjkT(N) 6(it)‘, i
i

and N1 is the number of training patterns from class Hi.

Fisher's data [F-7] was used in implementing the Q-system.
The data consisted of fifty four-dimensional primitive patterns
from each of three pattern classes representing varieties Of irises,
namely, Sesotas (Se), Versicolors (Ve) and Virginicas (Vi). Each
Of the 150 irises is represented by four measurements of petal
width and length and sepal width and length. The flow chart for
the simulation is given in Fig. 4.7.2. This differs somewhat from
Fig. 4.7.1 in that the misclassification rate was determined for
each best subset of Q-functions encountered to ensure that the
sequential Operation was working prOperly and also to observe
whatever functional relationship may arise between the criterion
function es and the percentage of misrecognition. In all the
isimulations B was taken to be 10-10. A summary of the results
of approach 8.3 is shown in Table 4-1 and 4-2 for the feature

subset X2,X3,X4. Only 25 patterns from each class were used for

training but all 50 patterns were used for testing to Obtain the

83

®

 

[

Read in Data

1

 

 

I Initialize Phi1 Vectors“

J

 

 

Compute ¢Jk(N) for the Phi vectors
from traininggpatterns

 

~--—--.-—..

 

 

Choose vector which produced

3 jk

minimum e = 2 e (N)

i=1
ji‘k

 

 

Test training patterns with chosen

Yes

Aggctor

  
   
 

esired performanc

— ‘ —~

 

“if

 

 

L
L—r

 

Adjust discriminant function

i

]

 

 

LgTest Testin patterns

Yes

 

— Plot Graphs of

vs Z mis-

 

Figure 4.7.2 Flow Chart of Proposed Q-System

  

NO
Phi Functions
xhausted ?

     
   

y

 

Revise
Vectors

 

i . -..—..-—-—.. o-“

 

34

results in Table 4-1. The results in Table 4-2 were obtained by
using all 50 patterns from each class for both training and
testing. Similar results were Obtained for the remaining three
possible feature subsets of size three as well as for the case
in which all four primitive features were considered. The 6-
functions generated by each feature subset were ordered in the
manner prOposed in approach 8.4. Tables 4-3 through 4-6 show a
comparison of the performance Of the two approaches employed in

the simulation.

4.8 SUMMARY

The discriminant function adopted in Chap. 2 was used in
combination with a proposed sequential extraction rule to con-
struct a pattern recognition system in which Optimum features
and Q-functions are sought under a single criterion function.

An estimate of the upper bound on the mean-square error
between the é-discriminant functions and the mean-square approxima-
tion to the Bayes discriminant functions was proposed as a
criterion for feature extraction in Sec. 4.1. In the same section,
two sequential feature extraction rules were reviewed. It was
shown that neither of the rules can be guaranteed to be truly
optimum. In Sec. 4.2, a third sequential extraction rule which
considers fewer feature subsets than the two sequential extrac-
tion rules was develOped. An added feature of the prOposed
sequential extraction rule is that a feature found to be effective
need not be retained if found to be ineffective in combination

with other features. The possibility of simultaneous feature and

85

a .qu

Nx x
.sxmx .ex .mxmx .mx .Mx .Nx moamm.~ w
exmx .exmx .ex .mxmx .mx .Mx .Nx Hmwmm.~ a
exmx .ex .mxmx .mx .Mx .Nx oemmm.~ 0
sx .mex .mx .Mx .Nx mammm.~ m
mxmx .nx .Mx .Nx o¢¢ma.~ q
mex .mx .Mx .Nx oemmm.~ n
ma .Mx .Nx meamm.~ N
we .Nx ooooo.n a
mo Enswcwe m coHuouoao
couscouo cages cowuocsuuo mo oomanm u + uz+ o u w answcwz Hmwucmsoom venomouo mo meum

mmm> mmw> o> a>
Tan.» ”#29

ex .mx .Nx uomnsm ousumom Sues mmmHO zoom Eouw mcuouumm wcﬁumoa on was mahogany mm "m.m somouao<

mommm.o o.¢~ nmmm¢.o o.w~ Hmmma.o o.- mo~mm.~ n.0a w

ensmm.o o.m Nmmom.o o.oH ammam.o o.~H Hmmmm.~ m.m a
os¢a¢.o o.~ ~mmmm.o 0.5 ~mmam.o o.w oemas.~ a.m o
masam.o o.m omaaa.o o.m Hmsmo.o o.s~ wmmmm.~ n.8H m
6

.s Hamma.o o.m ssmmm.o o.a ssaaa.o o.aq oemmm.~ o.ma e
Hmama.o o.m asmmm.o o.H emmaa.c o.He s¢msm.~ o.mH m
ammam.o o.e meamm.o 0.8 maams.o o.me semam.~ o.mH N
aaaaa.o 0.0m ommsm.o o.om msoaa.o o.om ooooo.m o.om H

Odom

m coduomuuxu

Ocauﬁcmo coauwswo coﬂuﬁcwo u :Oﬁuaowo Haausoawom

o noose“: N u nooumﬂz N w nooumwx N locum“: N vomoaouo

mmo> om\o> mma> am\a> a>o> a>\m>. saaacaa om\a>\o> so «scum

mu .7 d mama

87

N

.d& .mx .mNNx .mx .NN .Nx
N N

qux adxmx

.«N .mx .MKNN .MN .NN .Nx
N N

«xmx .ON .MN .mex .MN .m% .Nx

ex .mx .mex .mx .Nx .Nx
N N
N N

mex .mx .m% .N%

ﬁx .NN .NN
N

N .N
NN X

m
u 8:5«aﬁ8

couscoun scans m:Owuoc=uuo

a m N

x x x banana ouauomm sows mango some Bonn mauouumm wawumoa on was mcwaﬁmuy on

o
omo>

_+

mmmmm.~

mamam.~
onmmm.~
mmmom.w
owamm.~
ooooo.m
ooooo.m

ooooo.m

u + u
mma> m>a>

<n N... a Bag

Esaﬁcwe

mass coauomuuxo
Hmﬁusoaoom oomooouo mo Oumum

"m.m sumouaa¢

88

H0000.0
0n000.0
H0000.0
50000.0
N0000.0
00000.0
00000.~

00000.H

u
omo>

0.N
0.H
0.0
0.0
0.0
0.0a
0.0a

0.00

cowuwcwo
nouns“: N
Om\o>

N0000.0
05000.0
H0000.0
H0000.0
50000.0
00000.0
00000.H

00000.H

u
mmﬁ>

0.0
0.~
0.0
0.H
0.0
0.00
0.H0

0.Hm

cowuwcwo
nomumazx
om\w>

00000.0
00000.0
00000.0
00000.0
00000.0
00000.0
00000.H

00000.H

a
H>o>

0.0
0.0
0.0
0.0a
0.0a
0.0a
0.0a

0.0q

cONOacwo
nooumﬁz no N
w>\o>

m: N...» mag

50000.0
00000.0
om000.~
00000.N
00000.N
00000.0
00000.0

00000.0

Basaaae

0.0
0.8
0.N
0.0a
n.0a
0.5a
n.0a

0.00

cowuwawo
Iowans: no N
Om\«>\o>

mass
co«uumuuxo
amwucosvom

venomouo

so magnum

89

0

vo>ownuo umuwm mos
:ONONOMOOOMmHz N om
\H>\O>.esewswe gowns
us mans anuomuuxm
Hmwucosvom vomooouo
Osu uo woman see

c 0

x

a .e H

x. x

Nx.wx.Hx

qN.Mx.NxHx.Nx.Wx.HN

d .Mx.0xNx.0x.Mx.Nx

Mx.mxHx.mx.wx.Hx

N AH “H
x
Nx x

Nxax.Nx.WN.Nx

x x x.¢x.~x.~x~x.~x.~x.ﬁx

N

N N

¢N0N.qx.0xNx.0x.Mx.Nx

NxHNQ N“. W“. H“

OOHuﬁcwoomu

-maz N smm\a>\m>

aneﬁaae 00020
Iowa suﬁsa you
upon coﬂuocnmuo

0.0a

0.NN

0.N

0.0

0.0a

N.qH

N.w

5.0

0.NH

N.¢H

.oEmnom coﬂuﬂcwoomu bﬂomnomucu m mmuocmb mm\«>\o>
.1

bo>ounom
coHucho

nooumwz

om\«>\m>

N

00

00

00

00

00

00

00

00

on

00

use mauouuwa
snags“: wcuumou mo .02

00

00

00

00

on

0N

0N

0N

0N

0N

moans

you mauouuma
wcaawmuu mo .02

ex.mx.~x.ax
ex.~x.ax
ex.mx.~x
ex.mx.ax
mx.~x.ax
ex.mx.ux.ax
sx.~x.ax
ex.mx.~x

«x.0x.~x

MN. NNQHx

mango
Oomasm

cONOHOwooom Om\«>\o> you noumanawm monomouoa< oau mo somwuuasoo <

0: a 5049

0.0

0.0

0.0

0.0

0.0

«.0

0.0

0.0

0.0

ousuuom somouna<

90

N 0x .wx.0x $0.0 00 00 0x.mx.~x.ax 0.0
0 Mx.mx0x.mx.wx.wx $0.0 00 00 «$.0x.~x 0.0
N ~x.wx.ﬁx $0.0 00 00 0x.~x.~x 0.0
N 0x.wx.ax $0.0 00 00 0x.mx.~x 0.0
N 0x.wx.Hx $0.0 00 00 0x.0x.ax 0.0
a wx.0x $0.$ 0m 00 0x.mx.~x.ax 0.0
N 0x.wx.~x $0.H 00 mm 0x.mx.mx 0.0
m ex.wx.0xax.$x.wx.ax $0.0 00 mm 0x.~x.ax 0.0
N 0x.wx.ax $0.0 0m 00 0x.mx.ax 0.0
H wx.ax $0.$ 00 00 0x.Nx.Hx 0.0
vm>oucum umuuu
ems coHuHcmoooumﬁz N cOHuwcwoomu
om\a>.a=ea=as noses -maz_$ amxa> capsicum cans
um wash scauouuuxo Essacaa vocab uﬂcwoomumﬁz mmmHO mmmao
aneucmswom venomoua noun Loans umm N om\w> use msuOuuma you mauouuma uomaam
onu mo ammum any spam coHuOc=muo ﬁnesse: wcwumou «0 .oz. wcacamuu 00 .oz OHOOOOM somou00<

ceauacwooom om\g> you woumasaam monomouaa< Ono wo acmﬂucasou <

anc mnm¢H

m NxHx.~x.Hx.Hx $0.H 00 00 «$.0x.~x.Hx 0.0

N
0 same. x.mx.mx~x.mx.mx.~x $0.0 00 00 0x.mx.~x 0.0
m ~x~x.~x.wx.ax $0.H 00 00 0x.~x.ax 0.0
m mxHx.0x.wx.Hx $0.0 00 00 0x.0x.~x 0.0
m Nxax.mx.wx.ﬁx $0.0 00 00 0x.~x.ax 0.0
0 Nxax.~x.wx.ax $0.0 0m 00 0x.mx.~x.ax 0.0
0 0x0x.ex.mx~x.mx.wx.~x $0.0 00 mm ex.0x.~x 0.0
m NxHx.~x.wx.0x $0.0 00 00 0x.~x.ﬁx 0.0
1
9
m mxax.mx.wx.sx $0.00 on 00 0x.mx.ax 0.0
0 NxHx.~x.wx.Hx $0.0 00 mm 0x.~x.ax 0.0
oo>oanom umuwu
mos co« uwcwoo oumaz N
o>xom asawcws scans newuﬂcwoooumﬁz nm>ownoo coda
us was“ coﬁuosuuxo N O>xom asewcﬂe nacwoomumwz N mamas mmmao
Hmaucusvom vmmoaoua nauseouo sagas use o>\om pom announce you mcumuuma Oomnsm
may mo swoon may ness coauocswuo EOENOAZ wcgumou mo .02 wawcﬁmuu mo .02 muouoom nomou00¢

cowuﬂcwooom m>xom you vmumH:E«m monomouaa< Ono mo :OmwumOEoo <

0.1» 5049

92

m 0x.wx.$xﬁx.wx.wx.ax $0.$~ 00 00 0x.mx.ux.ax 0.0
0 0xmx.0x.mx.mx~x.mx.wx.~x $0.0 00 00 ex.mx.mx 0.0
0 “a.exax.0x.mx.$xﬁx.~x.wx.0x $0.08 00 00 0x.~x.~x 0.0
0 Mx.mxax.mx.wx.ax $0.00 00 00 0x.0x.ax 0.0
m 0x.wx.~xax.~x.wx.ax $0.$~ 00 00 0x.~x.ax 0.0
$ mxax.mx~x.mx.mx.~xax.0x.wx.ﬁx $0.00 on 00 ex.nx.~x.0x 0.0
0 sxmx.0x.0x~x.nx.wx.~x $0.0 00 mm 0%.0x.~x 0.0
$ sxax.sx.mx.~xax.0x.wx.ax $0.08 00 00 ex.Nx.Hx 0.0
0 «max. x.Mx.mxHx.mx.wx.Hx $0.HH 00 mm 0x.mx.ax 0.0
$ mxax.mx~x.0x.wx.~xﬁx.~x.wx.ax $0.00 0m 00 0x.Nx.~x 0.0

vo>oasom umuwu

was aOauacwoooumuz N
m>\«>.asawcaa nouns
us many cowuomuuxm
Hugucosuom venomoua
on» no owmum one

:oHuﬁswoomumHz.N vm>oanoa cowu

o>\w>ﬂasaacaa nacwoomumﬁz N mango cacao
vmoovoun £0053 o>\«> non acumuuma non manouuma uomnam
Oomnsm coauoasmue anawaﬁz wcﬂumou mo .02. wcwaamuu mo .02 musumom aomou00<

cowuwcwooom o>\«> you woumaaawm monomou00< Onu mo aooauogaou 4

on.» 5048

93

Q-extraction was discussed in Sec. 4.3. Characteristics of the
prOposed feature extraction rule and the mean-square error
criterion were given in Sec. 4.4 and Sec. 4.5, respectively. It
was shown that the criterion has desirable prOperties. In Sec.
4.6, it was assumed that the Q-functions selected have normal dis-
tributions with reSpect to two pattern classes, and the Q-dis-
criminant function was studied in the case when the two distribu-
tions differ in mean vectors as well as in covariance matrices.
It is shown that the mean-square Optimum coefficients Of the dis-
criminant function has the same form as those derived by others
[P-6], [A-Z], [H-7].

Finally, Sec. 4.7 proposed and implemented a Q-system.
The results indicate that in a three-class recognition problem
involving Fishers' data [F-7], a minimum percentage of mis-

2
recognition of 2% was achieved with the Q-functions X2, X2, X3,

X2X3, Kg, X4, XBXA. Since the feature X1 is not included in
the set of optimum Q-functions both feature extraction and Q-
function selection were achieved simultaneously.

The results for all feature subsets considered do not
indicate any simple functional relationship between the criterion
as and the percentage of misrecognition. A monotonic relation-
ship between as and the percentage of misrecognition was not
found. Furthermore, the set of Q-functions which were selected
as Optimum, i.e. produced the minimum value of es’ did not pro-
duce the minimum percentage of misrecognition. However, judging

from the results, it is obvious that as is a workable criterion

function. In fact, the @-system prOposed exhibits a performance

94

level comparable to, and in some cases better than, other recogni-
tion systems [D-Z] [F-B] [K-IO] which experimented with the same
data as shown in Table 4-7. The proposed 6-system also achieved

a minimum misrecognition rate Of 0.0% for Se/Vi and Se/Ve
recognition and a minimum misrecognition rate of 2.0% for Ve/Vi/Se

recognition.

95

TABLE 4-7
Author Algorithm Minimum
Ve/Vi
% Misrecognition
achieved
Ishaku Proposed 6.0%
Kendall 11.0%
[F-S]
Freeman 5.5%
[Ii-8]
Dubes Entropy
[D-Z] dispersion 46.0%
method
Dubes Energy Subspace 44.0%
[D-Z] method
Dubes Gaussian 14.0%
[D-Z] rule (l-dim)
Dubes Minimum
[D-Z] clustering
distance 14.0%
method

(l-dimo

CHAPTER V

CONCLUSIONS AND POSSIBLE EXTENSIONS

Some general comments and possible extensions in the area
of general pattern recognition systems are discussed in Sections
5.1 and 5.2. The conclusions and possible extensions of the
particular system investigated in this thesis are summarized in

Sections 5.3 and 5.4.

5.1 GENERAL COMMENTS ON PATTERN RECOGNITION SYSTEMS

Four prOblems are generally associated with the design of
pattern recognition systems; namely, feature measurement, feature
extraction, abstraction and generalization. This thesis concentrates
on feature extraction.

There are two problems associated with feature extraction:
What criterion should be used in appraising the relative performance
of features? How can Optimum features be established under the
chosen criterion? The theoretical solution to the second problem
is obvious once a performance criterion has been selected. One
simply tries all possible feature subsets to be absolutely certain
that the Optimum feature subset is Obtained. Economy forbids the
consideration of all possible feature subsets so practical feature
extraction rules consider only a fraction of the possible feature
sets. It appears, therefore, that future developments in feature

extraction can be most useful if efforts are concentrated in the

96

97

search for more useful criterion functions, and in the development
of machines that will economically handle problems which require
considerable computations.

The problem of abstraction also involves selecting a suit-
able criterion. There have been basically two approaches to this
prOblem: probabilistic and deterministic. The main technique for
the first approach is the recursive application of the Bayes rule
to "learn" the unknown parameters of probability densities having
known form. Then, under a given criterion, the Optimum discriminant
function is determined using the likelihood ratio. In the absence
of any knowledge about class prObability densities, the main
technique is to iteratively "learn" the coefficients, generally of
a linear discriminant function, which Optimize a selected criterion
function. A

In the case of probabilistic discriminant functions, the
generalization problem is solved through the determination of
various error probabilities. NO such error probabilities can be
determined from deterministic discriminant functions. Since the
probability of error can very seldom be computed, it may be more
useful to assess pattern recognition systems only on how well the
systems perform in actual simulations. In that case, the Optimum
pattern recognition system can be defined as that system which has
the smallest misrecognition rate for a given data base during the
testing phase.

This thesis suggests and formulates a pattern recognition
system as a 3-tuple, (t; u, a), when a single criterion, a, is used

in determining optimum features and discriminant functions, and as

98

a 4-tuple, (t, b; u, a) when b is used in determining Optimum
features and a in determining Optimum discriminant functions.
This marks the first attempt at integrating the problems of feature
extraction, abstraction and generalization into a single mathe-
matical formulation. The validity of this formulation is borne

out in Table 2-1 where it is shown that all the pattern recognition
systems that have been developed are arrived at through a choice of
particular elements in the formulation. This unified system
facilitates theoretical research into the Optimality of total pattern
recognition systems. This represents a major contribution of the
thesis. A second major contribution is the develOpment of a new
sequential extraction rule which considers fewer feature subsets
than all previous sequential extraction rules and which has been
shown to be superior, under certain conditions, to the sequential

forward rule.

5.2 EXTENSIONS AND OPINIONS IN THE STUDY OF PATTERN RECOGNITION

SYSTEMS IN GENERAL

The author hOpes that research into pattern recognition
systems can be approached from a fresh viewpoint which may or may
not be related to the traditional manner described in Chapter 2.
The following possible extensions concern only pattern recognition
systems which can be expressed in the formulation mentioned in
Sec. 5.1.

1. For a given data base, it would be interesting to rate
all the pattern recognition systems that have been develOped under

the performance criterion of misrecognition rate during the testing

99

phase of the systems.

2. It is possible that more effective elements can be
included into the classes T, B, U and A defined in Chapter 2.

3. Since the problem of feature measurement appears to be
beyond mathematical formulation, the author agrees with Prabhu [P-B]
that members of any discipline who can assist the experimenter to
intuite as to what features to measure in a Specific recognition
problem should be consulted. In fact, educated intuition concerning
what features to measure should be regarded as essential in the
formulation of recognition systems and certainly as a valid tool in

the sOlution to the problem of feature measurement.

5.3 CONCLUSIONS IN REGARD TO Q-SYSTEMS

This thesis constructed and implemented a three-class @-
system with a particular set of the three elements (t; u, a) which
define a generalized pattern recognition system. The feature
extraction rule prOposed in Sec. 4.2 was employed for t, generalized
second order polynomial discriminant functions were employed for u
and the approximation of an upper bound on the mean-square error
between a set of polynomial discriminant functions and the Bayes
discriminant functions was used for a. The data used in the
simulations was Fishers' [F-7] but the Q-system is not restricted
to any type of data.

This thesis extends the work of Pitt and Womack [P-7] and
contains the first investigation into the problems involved in
linking the selection of 6-functions to both feature extraction

and discriminant functions. Thus, a basic goal of this thesis was

100

accomplished when the simulations indicated that feature extraction
and Q-function selection can be done simultaneously. A second basic
goal of this thesis which was not realized was the exhibition of a
monotonic functional relationship between the criterion function

as and the percentage of misrecognition. The thesis, however,

showed beyond any doubts that es is an effective criterion in the

selection of both Optimum features and Q-functions.

5.4 POSSIBLE EXTENSIONS IN THE STUDY OF Q-SYSTEMS

The study of the Q-system constructed and implemented in
this thesis, may be extended as follows:

1. Construct discriminant functions which partition the
Q-Space by associating more than two disjoint regions with any pair
of pattern classes.

2. Approximate the distributions of each pattern class in
the Q-Space or the distribution of the Q-discriminant functions and
approach the abstraction problem from the probabilistic viewpoint.

3. Use Q-functions other than generalized polynomials to
implement the system, such as Legendre and Fourier functions.

4. Investigate the possibility of Obtaining the optimum
ordering of the Q-functions prior to the start of the prOposed
sequential Operation.

5. Investigate the conditions under which a functional
relationship exists between as and the percentage of misrecogni-

tion.

B IBLIOGRAPHY

[A-IJ

p-23

[A-3]

[13-1]

[3-2]

02-11

[c-z]

[c-3]

[as]

[9-1]
[p-21

BIBLIOGRAPHY

Anderson, T.W. “Ag_lntroduction £2 Multivariate Statistical
Anal sis", New York, Wiley, 1958.

Anderson, T.W. and Bahadur, R.Ra 'Classification into Two
Multivariate Normal Distributions with Different Covariance
Matrices", Ann. Math. 8tatist., Vol. 33, pp. 420-431, 1962.

Aizerman, MMA., Braverman, E.M. and Rozoner, L.I. "The
Probability Problem of Pattern Recognition Learning and
the Method of Potential Function", Automation and Remote
Control, Vol. 25, pp. 1175-1193, 1964.

Block, H.D., Nilsson, N.J. and Duda, R.O. "Determination
and Detection of Features in Patterns", Computer and
Information Sciences, J.T. Tou and R.H. Wilcox, Eds.,
washington, D.C., Spartan, pp. 75-110, 1964.

Braverman, E.M. and Byatnitskii, E.S. "Estimation of the
Rate of Convergence Algorithms based on the Potential
Functions Method", Automation and Remote Control, Vol.
AC-12, pp. 195-197, 1967.

Calvert, T.W. "Nonorthogonal Projections for Feature
Extraction in Pattern Recognition", IEEE Trans. Computers,
Vol. c-19, pp. 447-452, 1970.

Chien, Y.T. and Fu, K.S., "On the Generalized Karhunen-
LOEVe Expansion", IEEE Trans. Info. Theory, Vol. IT-13,
pp. 518-520, 1967.

Chien, Y. “A Sequential Decision Model for Selecting
Feature Subsets in Pattern Recognition", IEEE Trans.
Computers, Vol. C-20, pp. 282-290, 1971.

Chu, J.T. and Chueh, J.C. “Error Probability in Decision
Functions for Character Recognition", Journ. of A.C.M.,
N0. 14, pp. 273-280, 1967.

Dantzig, J. "Linear Programming", Progress, 1966.

Dubes, R.C. "Information Compression, Structure Analysis,
and Decision Making with a Correlation Matrix", Interim

Scientific Rept. NO. 11, Division of Engineering Research,
Michigan State University, East Lansing, Michigan, 1970.

101

[Ia-11

[F-ll

[F-21

[p-31

[F-4]

[F-S]

[F-6]

[3-7]

[F-81

16-11

16-21

16-31

[n-1]

[n-2]

102

Estes, 8.E. "Measurement Selection for Linear Discriminants
used in Pattern Classification", IBM Research Rept. RJ-331,
1965.

Frantsuz, A.G. "Influence Of Correlations between Attributes
on their Informativeness for Pattern Recognition", Engineering
Cybernetics U.S.S.R., NO. 4, pp. 64-67, 1967.

Fu, K.S. and Chen, C.H. "Sequential Decisions, Pattern
Recognition and Machine Learning", Tech. Rept. TR-EE65-6,
School of Elect. Eng., Purdue University, Lafayette,
Indiana, 1965.

Fu, K.S. "Sequential Methods in Pattern Recognition and
Machine Learning", Academic Press, New York and London, 1968.

Fu, K.8. and Cardillo, G.P. “An Optimum Finite Sequential
Procedure for Feature Selection and Pattern Classification",
IEEE Trans. Auto. Control, Vol. 12, pp. 588-591, 1967.

Fukunaga, K. and Koontz, L.G. “Application of the Karhunen-
Loave Expansion to Feature Selection and Ordering", IEEE
Trans. Computers, Vol. C-19, pp. 311-318, 1970.

Fukunaga, K. and Olsen, D.R. “An Algorithm for Finding
Intrinsic Dimensionality Of Data", IEEE Trans. Computers,
Vol. c-zo, pp. 176-183, 1971.

Fisher, R.A. "The Use of Multiple Measurements in Taxonomic
Problems", Annals of Eugenics 3, Part 2, pp. 179-188, 1936.

Freeman, J.J. "Experiments in Discrimination and Classifica-
tion", Pattern ReCOgnition 1, pp. 207-218, 1969.

Grettenberg, T.L. "Signal Selection in Communication and
Radar System", IEEE Trans. Info. Theory, Vol. IT-9, pp. 265-
275, 1963.

Golovkin, B.A. “Adaptive Pattern Recognition as a Problem
in Partial Integer Linear Programming", Engineering Cyber-
netics U.S.S.R., NO. 2, pp. 140-144, 1968.

GOl'shteyn, E.G. and Yudin, D.B. "New Directions in Linear
Programming", Sovetskye Radio, 1966.

Hellman, M.E. and Raviv, J. "Probability of Error, Equi-
vocation, and the Chernoff Bound", IEEE Trans. Info. Theory,
Vol. IT-l6, pp. 368-372, 1970.

Henderson, T.L. and Lainiotis, D.G. "Comments on Linear
Feature Extraction", IEEE Trans. Info. Theory, Vol. IT-15,
pp. 728-730, 1969.

1H-31

[H-4]

[H-S]

1H-61

[n-7]

[I-l]

[K-3]

[K-4]

[K-S]

[K-6]

1K-71

103

Highleyman, W.H. "Linear Decision Functions, with Applica-
tion to Pattern Recognition", IRE Proc., pp. 1501-1514, 1962.

Ho, Y. and Agrawala, A.K. "On Pattern Classification
Algorithms: Introduction and Survey", IEEE Proc., Vol. 56,

HO, Y.C. and Blaydon, C. "On the Abstraction Problem in
Pattern Classification", Proc. N.E.C., 1966.

Ho, Y.C. and Kashyap, R.L. “A Class of Iterative Procedures
for Linear Inequalities", SIAM Journal on Control, Vol. 4,
pp. 112‘115 , 19660

HO, Y.C. and Kashyap, R.L. “An Algorithm for Linear
Inequalities and its Applications", IEEE Trans. Electronic
Computers, Vol. EC-l4, 1965.

IEEE Computer Group Midwest Area and Pattern Recognition
Committees, and Argonne National Laboratory, Applied Mathe-
matics and High Energy Physics Divisions. "IEEE Conference
Record of the Symposium on Feature Extraction and Selection
in Pattern Recognition", 1970.

Kadota, T.T. and Shepp, L.A. "On the Best Finite Set Of
Linear Observables for Discriminating Two Gaussian Signals",
IEEE Trans. Info. Theory, Vol. 13, pp. 278-284, 1967.

Kailath, T. "The Divergence and Bhattacharyya Distance
Measures in Signal Selection", IEEE Trans. Comm. Tech.,
Vol. con-15, pp. 52-60, 1967.

Kanal, L. and Chandrasekaran, B. "On Dimensionality and
Sample Size in Statistical Pattern Classification", Proc.
N.E.C., pp. 2-7, Chicago, 1968.

Kharkevich, A.A. "Choice of Features for Recognition
Machines", Engineering Cybernetics U.S.S.R., NO. 2, pp. l-7,
1963.

KObayashi, H. and Thomas, J.B. "Distance Measures and
Related Criterion", Proc. Fifth Annual Allerton Conference
on Circuit and System Theory, pp. 491-500, 1967.

Koontz, L.G. and Fukunaga, K. "Nonparametric Mapping of
Multivariate Measurements for Separability Enhancement",
Proc. N.E.C., 1970.

Korsunskaya, V.O. "Some Principles in Formation of
Attributes in Pattern Recognition", Engineering Cyber-
netics U.S.S.R., NO. 6, pp. 109-114, 1966.

[K-81

[K-9]

[K-IO]

[L-l]

[Ia-2]

[1-3]

[L-4]

[L-5]

[M-1]

[M-2]

[M-3]
[n-41

[PI-5]

[N '1]

[u-21

104

Kullback, 8. "Information Theory and Statistics", John
Wiley and Sons, Inc., New York, New York, 1959.

Kullback, 8. ”Certain Inequalities in Information Theory
and Cramér-Rao Inequality", Ann. Math. Statist., Vol. 25,
pp. 745-751, 1954.

Kendall, M.G. "Discrimination and Classification", in
Multivariate Analysis (ed. P.R. Krishnaiah, Academic Press,
New‘York, 1966).

Lainiotis, D.B. "Optimal Feature Extraction in Pattern
Recognition", IEEE International Info. Theory Symp.
Abstracts, San Remo, Italy, 1967.

Lainiotis, D.G. and Henderson, T.L. “Application of State
Variable Techniques to Optimal Feature Extraction", Proc.
IEEE (Letters), Vol. 5, pp. 2175-2176, 1968.

Lainiotis, D.G. “A Class of Upper Bounds on Probability
of Error for Multihypotheses Pattern Recognition", IEEE
Trans. Info. Theory, Vol. IT-15, pp. 730-731, 1969.

Levine, M.D. "Feature Extraction: A Survey", Proc. IEEE,
Vol. 57, pp. 1391-1407, 1969.

Lewis, P.M. "The Characteristic Selection Problem in
Recognition System", IRE Trans. Info. Theory, Vol. IT-8,
pp. 171-178, 1962.

Mahalanobis, P.C. "On the Generalized Distance in Statistics",
Proc. Nat. Inst. Sci., Vol. 122, pp. 49-55, India, 1936.

Marill, T. and Green, D.M. "On the Effectiveness of

Receptors in Recognition Systems", IEEE Trans. Info. Theory,
V01. 9, pp. 11-17, 1963.

McDonald, R.P. “A Unified Treatment of the Weighting
Problem", Psychometrika, Vol. 33, pp. 351-381, 1968.

Mengert, P.H. "Solution Of Linear Inequalities", IEEE
Trans. Computers, Vol. C-l9, pp. 124-131, 1970.

Min, P.J. "On Feature Selection in Multi-class Pattern
Recognition", Ph.D. Thesis, School of Elect. Eng., Purdue
University, Lafayette, Indiana, 1969.

Nagy, G. "Feature Extraction on Binary Patterns", IEEE
Trans. System Sci. and Cybernetics, Vol. SSC-S, 1969.

Nelson, G.D. and Levy, D.M. "A Dynamic Programming Approach
to the Selection of Pattern Features", IEEE Trans. System
Sci. and Cybernetics, Vol. 880-4, 1968.

[N4]

[N-S]
[P-ll

[p-23

[p-3]

[P41]

[P-5]

[P-6]

[P-71

[P-8]

[R-l]

[R-21

105

Nelson, G.D. and Levy, D.M. "Selection of Pattern Fvnruros
by Mathematical Programming Algorithms", IEEE Trans. System
Sci. and Cybernetics, Vol. SSC-6, 1970.

Neymark, Y.I., Batalova, 2.8. and Obraztsova, N.D. "Pattern
Recognition and Computer Systems: Feature Selection in

Pattern Recognition", Engineering Cybernetics U.S.S.R., 1969.
Nilsson, N.J. "Learning Machines", McGraw-Hill, 1965.

Parzen, E. "On the Estimation of a Probability Density and
Mode", Ann. Math. Statist., Vol. 33, pp. 1065-1076, 1962.

Patrick, E.A. and Fischer II, F.P. "Nonparametric Feature
Selection", IEEE Trans. Info. Theory, Vol. IT-15, pp. 577-
584, 1969.

Patterson, J.D., Wagner, T.J. and Womack, B.F. “A Perfor-
mance Criterion for Adaptive Pattern Classification Systems",
IEEE Trans. Automatic Control, Vol. AC-12, pp. 195-197, 1967.

Pervozanskiy, A.A. "The Recognition of Abstract Patterns
as a Problem in Linear Programming", Engineering Cybernetics
U.S.S.R., NO. 4, pp. 38-41, 1965. '

Peterson, D.W. "Discriminant Functions: Classes and
Computational Techniques", Stanford Electronic Lab.,
Stanford, California, Rept. SEL-65-021 (TR6761-2), 1965.

Peterson, D.W. and Mattson, R.L. “A Method of Finding
Linear Discriminant Functions for a Class of Performance
Criteria", IEEE Trans. Info. Theory, Vol. IT-12, pp. 380-
287, 1966.

Pitt, J.M. and Womack, B.F. “Additional Features Of an
Adaptive, Multicategory Pattern Classification System",
IEEE Trans. System Sci. and Cybernetics, Vol. SSC-S, 1969.

Prabhu, K.P.S. "On Feature Reduction with Application to
Electroencephalographs", Ph.D. Thesis, Division of Engineering
and Applied Physics, Harvard University, Cambridge,
Massachusetts, 1970.

Renyi, A. "On the Amount of Missing Information and the
Neyman-Pearson Lemma", Research Papers in Statistics,
F.N. David (Ed.), Wiley, 1966.

Renyi, A. "On the Amount of Information Concerning an
Unknown Parameter in a Sequence of Observations", Publ.
Math. Inst. Hungar. Acad. Sci., 9, pp. 617-624, 1964.

[R-3]

[8-11

[3-2]

[3-31

18-41
[8-51

[8-6]

[3-7]

[r-1]

[w-IJ

[w-z]

[w—33

[w-4]

[W-Sl

[Y-l]

106

"Bayes Risk.Consistency of Classification
The Indian

Ryzin, J.V. _
Procedures Using Density Estimation", Sankhya:
Journal of Statistics: Series A, Vol. 28, 1966.

Sammon Jr., J.W. "Interactive Pattern Analysis and
Classification", IEEE Trans. Computers, Vol. C-l9, pp. 594-
616, 1970.

Sammon Jr., J.W. “An Optimal Discriminant Plane", IEEE
Trans. Computers, Vol. C-19, pp. 826-829, 1970.

Schwartz, S.C. "Convergence of Risk in Adaptive Pattern
Recognition Procedures", Proc. Fifth Allerton Conf. on
Circuit and System Theory, 1967.

Sebestyen, G.S. "Decision-Making Processes in Pattern
Recognition", Macmillan, New'York and London, 1962.

Spragins, J. "Learning without a Teacher", IEEE Trans.
Info. Theory, Vol. IT-12, pp. 223-229, 1966.

Spragins, J. "A Note on the Iterative Application of Bayes
Rule", IEEE Trans. Info. Theory, Vol. IT-11, pp. 544-549,
1965.

Spragins, J. "Reproducing Distributions for Machine Learn-
ing", Stanford University, Stanford, California, Tech. Rept.,

Tou, J.T. and Heydorn, R.P. "Some.Approaches to Optimum
Feature Extraction", Computer and Info. Sci.-II, Academic
Press, pp. 51-89, 1967.

Watanabe, S. “A Method of Self-Featuring Information Com-
pression in Pattern Recognition", Bionics Symp., 1966.

Watanabe, S. "Karhunen-Loéve Expansion and Factor Analysis:
Theoretical Remarks and Applications", Proc. Fourth Conf.
Inform. Theory, Prague, 1965.

Whitney, A.W. “A Direct Method of Nonparametric Measurement
Selection", IEEE Conf. Record of the Symp. on Feature
Extraction and Selection in Pattern Recognition, 1970.

Wee, W.G. "On Feature Selection in a Class of Distribution-
Free Pattern Classifiers", IEEE Trans. Info. Theory, Vol.
IT-16’ 19700

Wolverton, C.T. and Wagner, T.J. “Asymptotically Optimal Dis-
criminant functions for Pattern Classification", IEEE Trans.
Info. Theory, Vol. IT-lS, 1969.

Yegisapetov, E.G. "Learning Process in Pattern Recognition
Using Linear Decision Functions", Engineering Cybernetics

U.S.S.R., NO. 5, pp. 76-85, 1969.

APPENDICES

APPENDIX.A

This appendix demonstrates (4 5 l) The performance criterion

function for any pair of classes Hj and H , in a W dimenSional
training patterns have been processed,

Q-function space, after N

 

is given by:
Jim; m w) = R(j|k)- ﬁ-EjkTmnprwféjkmnnw) .
Substituting for 31km;t,'rw) from (3.1.2),

ekj (N; m w) = Mum-$617” (N;t.ww)P‘1(N;c.tw>6jk(N;t.w )

80

68013.7”) = 2[R(1\k)-6jkl-}T;(N t,T ‘4‘)? (N; mwm” (N; BTW-)1
J.k
’ - loss

For a two-class pattern recognition system Wlth O 1
function the system's performance criterion function is given by
T W -l W «12

(N;t.'r )P (N;t.T )9 (Nun)

1 -
w - w
1(Nun )[el(N;t.T)

Zh—a

6:20“ ;tsTw)

- N1 [61m;t,t")-62(N;t,ww)]
02

- e (N;c,ww)]

OjCN t,T ) and P-1(N;C,Tw) are defined in Sec. 3 1

where
is equal to l as seen in the following

The quantity R(j‘k)
steps.

107

108

R(j\k) = max {r:(j‘k)}
lsi-“s
= mx {[c(j\1) - c(k‘i)]2}
lsiss
if c(j‘i) = 1 for j # i
= 0 for j = i
then

ll
...:
O

R(j\k>

APPENDIX B

This appendix establishes (4.5.2) and (4.5.3) which are

rewritten below.
15th = -[djk(N,W-+l) - dJk(N,W)] < 0 (4.5.2)

AWgS is not defined when ¢W+1 is a linear
combination Of the previously established Q

functions (ml,¢z,...,gw).

The notation has been simplified by dropping t and replacing Tw

with W.
djkmm = EjkTmmw'ICNmYe‘jkmM)
From (3.1.3),

'e‘jk(n,w> = 'éjkm-LN) + zjk(n>6[i(n>]; 'éj“(o,w) = 6

where

zjk(n) = c(j‘L) - c(k‘L) if 2(n) belongs to class HL

and c(j\k) is the cost of deciding that §(n) belongs to class
HJ when it actually belongs to HR, with c(j‘j) = 0.
The alternative computation Of the 6 vector discussed in

Sec. 3.1 proceeds by writing

109

110

3%.)» = E‘s-1.)» + i‘(n)aw(2(n)); 2510.)!) = o
where
= _ .6 r . =.B
PL(nsw) PLO) 19w) + 1 (n)¢w(x(n))a PL(Osw) S I
and
1 if E(n) belongs to class HL
iL(n).=
0 otherwise .
Then,
-Jk . —~j . -k S . -L
6 (NM) = c(k\J)e (NM) - c(J\k)e (N,W) + z [c(J\t)-c(k|t)]e (NM)
L=1
ask.)

where N = N1 + N2 +...+ NS; N1 is the number Of training patterns

from class H1.

For a 0-1 loss function,

a'k -9' -ok
6J (N,W) = 63(N,W) - 6 (NM) if 1% k
and

P(N,W) = P1(N,W) + P2(N,W) +...+ PS(N,W).

If a new Q-function ¢W+l is added,

-1

d (N.W+1) = (e (N,W)ew+1)
T jk
C pw+1 9w+1

where 9421 is the scalar difference in the theta values summed

over all patterns from classes Hj and Hk is a scalar

’ pw+1

which equals the sum of the squares of the new Q-function summed

over all patterns, and C is a vector whose j-th row is

111
N —-o -—o
E p (X(i)) (X(i)). It can be shown [P-8] that
i=1 1 Wu

[ejk - CT P'lmmféjkmmnz

 

djk(N,W+l) - dJk(N,W) = ”+1 'r _1 .
Pw+1 - C P (N.W)C
P(N,W) C
Consider det
CT

where "det" stands for determinant. Multiplying the first W rows

of the matrix by CTP-1(N,W) and subtracting from row W+1 gives

P(N,W) C

o pw+1 - c P-1(N,W)C

Expanding from the bottom row,

P(N,W) C
T -l
det = [p - C p (N,w)c]. det P(N,W) .
T W+l
C pn+1
P(N,W) C
Now, both det and det P(N,W) are greater
CT

than zero since both are determinants Of positive definite matrices.

This implies that

T '1

and

dJkCN,W+l) - djkm,W) > o,

112

which demonstrates (4.5.2). TO show (4.5.3), let ¢W+l(i) = aT¢W(E),
where a is a scalar vector and §w(i) is the vector which consists

of the W previously established Q-functions. Then,

1k = j ... k
ew+1 ew+1 ew+1
....k

= aTej (M)

= ..Tpm W)—'

pw+1 a ’ °’
c = P(N,W);

and

jk(N,w)]2

ST? (N JOE-[P (N .W)&'T]P'1(N ,W) [P (N .1052]

.k jk [aT'éjkcmm-P(N,W)&°]TP'1(N,W)’6‘
d3 (N,W+1) - d (N.W) =

 

which is not defined. This in turn implies that (4.5.3) is not

defined.

APPENDIX C

'k
In this appendix the distance properties of dJ (N,W) are
investigated.
For a 0-1 loss function and simplifying the notation as

in Appendix B,
djchM) =[‘e’jm,W)-6k(N.W)]TP'1(N,W)['éj(N,W)-6km,w)]

[3km,w)-§j (N,w)]P'1(N,W)[§k(N,w) 563' (N,W)]

k
=d’mmx
Since dJkGN,W) is a positive-definite, quadratic form

djkmm) = o if and only if 61mm) = 6km»
and

dJRCN,W) > O .

113

  

    

VLBR

MIIIIWUIIIWIIIWIIVIIIIIEIIIIWIW\I ll
3 1 2 9 3 0 3 013 2

244

 

 

 

 

 

 

    
  

   

ARIES