noumammc mums; PROCEDURES row i > V . .
Human-mom PATTERN cmssmcmou f ' "

Thesis for the Degree (If PEEL! *
mum STATE umvmsm , *
ALBERT YEWSH‘EN HERE '

'19? '1' _

   

07639

This is to certify that the
thesis entitled
NONPARAMETRIC TRAINING PROCEDURES FOR
MULTICATEGORY PATTERN CLASSIFICATION

presented by

Albert Yen-Shien Hung

has been accepted towards fulﬁllment
of the requirements for

Ph . D. degree in W1.
Engineering

,A

I

z- ” ' '
ﬂ .4 k. ’,-"" /
”V a — _,r" ,-’ -’ ’/
‘ I

Major professor

 

LIBRA.~:Y

Michigan State
University

 

.v‘

. . A
{42;

 

 

 

 

 

 

 

 

Space
A set
CI‘imin
fundam
the or
358mm

UHSUpe

ABSTRACT

NONPARAMETRIC TRAINING PROCEDURES FOR
MULTICATEGORY PATTERN CIASSIFICATION

By

Albert Yen-shien Hung

The M-class pattern recognition problem is to construct a
set of discriminant functions which partition a feature space into
M regions, one region per pattern class. Each point in the feature
space is a potential pattern and each pattern represents an object.
A set of training patterns is to be generalized into a set of dis-
criminant functions which classify the potential patterns. The
fundamental algorithms develOped here concern the situation where
the origin of each training pattern is known and almost nothing is
assumed about the origins of the patterns. An extension to the
unsupervised case is also given.

Several new multi-class decision-making algorithms are pro-
posed. An entirely new class of algorithms is obtained by trans-
lating the pattern recognition problem into the problem of minimizing
a function of several variables and selecting suitable functions.
This general formulation includes most known algorithms as special
cases. The class of algorithms includes all procedures which
approximate discriminant functions by linear combinations of basis
functions. Several successful two-class algorithms are extended

to the M-class problem.

 

 

Albert Yen-shien Hung

The concept of linear inequalities and the role of the mean-
square error criterion in pattern recognition are studied. Several
algorithms are shown to rely heavily on the basic mean-square-
error criterion. In order to solve the generalization problem,
the conditional probabilities are selected to form the optimal dis-
criminant functions. A class of multi-class algorithms using
stochastic approximation techniques is proposed that learn the
unknown coefficients of the discriminant functions. A digital
simulation has been performed.

The most novel aspect of this thesis is the introduction
of an algorithm that combines cluster-seeking and multiclass pattern
recognition. Cluster-seeking tries to uncover the structure inherent
in the training patterns. The algorithm exploits this structural
information to construct discriminant functions. The success of
the discriminant function in classifying training patterns then
provides clues about structure. The algorithm is straightforward
and computationally realistic. It has been tested with both

artificial and practical data.

NONPARAMBTRIC TRAINING PROCEDURES FOR
MULTICATEGORY PATTERN CLASSIFICATION
By

Albert Yen-shien Hung

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering
and Systems Science

1971

 

 

[O l
have

to h

Mich.
Resee

resea

Dr. J

Critil

the cc

 

ACKNOWLEDGEMENTS

I am deeply indebted to Dr. R.C. Dubes who introduced me
to the field of pattern recognition and whose valuable suggestions
have gone a long way in shaping this thesis. I am also grateful
to him for the guidance and encouragement I received from him during
my entire doctoral degree program.

I am also indebted to the Division of Engineering Research,
Michigan State University, and the Air Force Office of Scientific
Research for their financial support during the course of this
research.

Thanks are also due to Dr. R.O. Barr, Dr. G.L. Park,

Dr. J.H. Stapleton and Dr. Rita Zemach for their interest and
critical reviews of this thesis.

Finally, this thesis would be incomplete without mentioning

the contribution of my wife Lilian to whom this thesis is dedicated.

ii

.
WL‘

 

II

IV

Chapter

I

II

III

IV

TABLE OF CONTENTS

ABSTRACT

LIST OF FIGURES

INTRODUCTION

1.1 Basic Problems

1.2 Multiclass Problems
1.3

1.4

GENERALIZED LINEAR INEQUALITIES IN MULTI-
CATEGORY PATTERN RECOGNITION

2.1 The Concept of Linear Inequalities in
Pattern Recognition ......................

2.2 The General Learning Algorithms ..........

2.3 A Class of Iterative Procedures for
Linear Inequalities ......................

THE MEAN-SQUARE-ERROR CRITERION IN MULTI-
CATEGORY PATTERN RECOGNITION

3.1 The Mean-Square4Error Criterion in
Pattern Recognition ......................
3.2 The Generalized Inverse Approach in
Multiclass Pattern Recognition ...........
3.3 Some Properties of Least-Mean-Square-Error
Pattern Classifiers

MULTICATEGORY PATTERN RECOGNITION USING
STOCHASTIC APPROXIMATION TECHNIQUES

DDD
COMP"

4‘45
UI-P

Cluster-Seeking and Pattern Recognition ..
Thesis Objectives and Outline .. ..... ...

Mathematic Formulation ...................
Generalized Inverse Adaptation ...........
A Stochastic Approximation Algorithm

with Updating Property ...................
Sensitivity Study and Error Upper Bound ..
An Example ...............................

iii

3.4 Relation Between Linear Inequalities and
the Mean-SquareéError Criterion

Page

whoop-

16

22

26

32

39

45
47

51
57
62

Page

MULTICLASS PATTERN RECOGNITION IN
UNSTRUCTURED SITUATIONS

1 Some Cluster-Seeking Techniques .......... 64
2 A Procedure for Combining Cluster-Seeking

and Multiclass Pattern Classification .... 68
5.3 A Procedure for Unsupervised Structure

5.
5.

An81YSis O......OOCOOOCCOOOOC00......0.... 74
5.4 Computer Simulations ..................... 76
CONCLUSIONS
6.1 Thesis Results ........................... 79
6.2 Possible Extensions ...................... 82
REFERENCES 84
APPENDIX 88

iv

LIST OF FIGURES

Figure Page
1 A prOposed pattern recognition system ......... 6
2 & 3 Decision surfaces for algorithm GA.4 .......... 32
4 A mathematical model for pattern recognition .. 63
5 Error rate versus number of iterations ........ 63
6 Flow chart of the proposed algorithm .......... 72
7 Example of algorithm in Figure 6 .............. 77
Appendix A Convergence Proof ............. ..... ........... 88

LIST OF APPENDICES
Appendix Page

A Convergence proof of algorithm GA.4 88

vi

CHAPTER I

INTRODUCTION

The problem of automatic data analysis has drawn consider-
able attention since the development of high speed computers. To-
day there are automated systems that read handwriting and finger-
prints, as well as systems that classify data, forecast weather
and perform medical diagnosis. In all these cases, some type of
patterns or templates are assumed as a basis for recognition; that

is, they are pattern recognition problems (H1, B3, N1, N2).

1.1 Basic Problems

There are three fundamental problems associated with
pattern recognition.

(1) Feature extraction. A set of real variables
{x1,x2,...,xd}, called features or attributes, identify the object
to be classified. A vector X* = (x1,x2,...,xd)T is called a
pattern vector or, simply, a pattern. No general procedure
currently exists for selecting an Optimal set of features for a
given problem. In most cases, the success of feature extraction
depends entirely upon factors in the specific problem. The feature
extraction problem will not be discussed in this thesis; a "good"
set of features is assumed.

(2) The abstraction problem. Once a set of features has
been selected, and a set of pattern vectors

1

D* = {(X:,y1), i = 1,2,...,N} are given where yi denotes the
classification of the pattern vector X*, then a set of augmented
pattern vectors D = {(Xi’yi)’ i = 1,2,...,N}, called training
patterns, can be formed. An augmented pattern vector is a d-
dimensional vector X* augmented by a (d+l)st component whose
value is 1; that is, XT = (X*,1)T. The problem considered in this
report is to find a set of discriminant functions {fj(X)}?=1

defined on the augmented feature space such that
fi(X) > fj(X) if x 5 class i; i,j = 1,2,...,M, i #1. (1)

If there is a set of discriminant functions satisfying (1) then
an "optimal" solution exists rather than one which permits a few
violations of (l).

The abstraction problem is thus one of distilling the
information from the training patterns that is necessary to con-
struct a set of discriminant functions. Generally speaking,
there are two approaches to the abstraction problem. First, if a
great deal of information is available about the data, a proba-
bilistic model can be generated to describe the underlying physical
phenomenon. The abstraction problem can then be treated in the
framework of statistical decision theory. This approach is the
parametric (or statistical) method. Second, there are many prob-
lems, especially in the biomedical area, in which data has simply
been amassed. It is reasonable to assume, however, that there
exists a set of discriminant functions which approximate (1), and
so the functional form for the discriminant functions is assumed

known except for a set of parameters. Based on the given training

patti
lear:
the L
(or c

prime

crimi

 

to pr

5
4
.a
J

was C

 

probl

‘ 0
WI!" -0 b

nonpa
is us“
ab 1e

Patter

 

mather

1.2 }

geUEra
Origin

reCOEn

a COIL
SeVera.

On its

whit]

 

patterns, an iterative procedure, sometimes called training or
learning, can be formulated to determine reasonable values for
the unknown parameters. This approach is termed the nonparametric
(or deterministic) method of training. This thesis will be
primarily concerned with nonparametric training.

(3) The generalization problem. Once a set of dis-
criminant functions has been determined, an attempt may be made
to predict performance for new patterns. If the parametric method
was chosen to solve the abstraction problem, the generalization
problem consists in determining the error probability. If the
nonparametric approach was adopted, the generalization question
is usually difficult to answer because of the absence of a suit-
able criterion. The misclassification percentage with training
patterns is usually employed even though it cannot be related

mathematically to the problem parameters.
1.2 Multiclass Problems

A single data source, called a pattern class or category,
generates each pattern. In the two-class problem, each pattern
originates in one of two pattern classes. A review of the current
literature (H2, H3, Pl) indicates that the majority of pattern
recognition algorithms have been developed for the two-class prob-
lem. In theory, the M-class (M > 2) problem can be solved as
a collection of (g) two-class problems (N1). However, there are
several advantages in considering M-class pattern classification
on its own nerits.

(1) A generalization of a two-class algorithm to solve a

tmulticlass problem by pair-wise operations is quite involved

computationally. For instance, it will be necessary to compute
(2) generalized inverse matrices in the Ho-Kashyap algorithm
(H2); with an M-class algorith, only one inverse computation is
required.

(2) Two-class algorithms give no immediate answer to the
multiclass problem even if a solution exists. A voting scheme
must be imposed to classify a pattern. With an M-class algorithm,
there is no need for a voting scheme and it will furnish a reason-
able suboptimal solution in case of overlapping sets of training
patterns.

(3) With a deterministic M-class algorithm, the problems
of pattern recognition and cluster-seeking can be combined, as
discussed later, into a single problem. Such a simplification is
the main reason for devoting this thesis to deterministic Mrclass

algorithms. Criteria and procedures are presented in Sec. 5.2.
1.3 Cluster-Seeking and Pattern Recognition

According to Ball (Bl), “A cluster is a set of patterns
contained in a high dimensional space where the density of patterns
is large compared to the density in the surrounding volumes."

The idea behind cluster-seeking techniques is the grouping of
patterns into clusters or groups so that all patterns within a
cluster are very "alike" and patterns in different clusters are
very "unalike". Several measures of simularity will lead to
clustering algorithhs. For instance, Freidman and Rubin (F1)
proposed some invariant criteria for grouping data through linear

transformations. Ball and Hall (BZ) suggested the distance

between a pattern and the cluster center be used. A number of

other measures of simularity were categorized by Ball (Bl). In

this thesis, the criterion of minimum probability of misclassifica-
tion will be used for cluster-seeking. A pattern recognition system
is proposed which combines cluster analysis and classification in

such a way that knowledge of data structure guides pattern classifica-
tion, and the results of pattern classification provide additional
insight into the true structure of the data.

Assuming that the feature extraction problem has been
solved, a set of (augmented) training patterns D = {(Xi,yi),

i = 1,2,...,N} from M pattern classes is provided. The value
of yi, which indicates the true classification of Xi, may or
may not be given for all i. The functional block diagram of the
proposed system is shown in Figure 1.

If the classifications of all training patterns are known,
the training patterns from each class are grouped to form one
cluster per class. A preselected M-class iterative learning
algorithm is applied to learn the unknown parameters in a set of
M discriminant functions. The rate of learning is determined by
prOperties of the algorithm and the distribution of patterns. If
the performance of the algorithm is acceptable, 3 set of dis-
criminant functions is obtained which adequately classifies all
training patterns. By comparing the results of this classifica-
tion with {yi}, the misclassification percentage can be computed.
If the performance is acceptable, the pattern recognition problem
is solved and the data structure obtained assigns one cluster to

each pattern class. Otherwise, all the misclassified patterns from

 
      
 
  

Classification
Known

   

Training
patterns

r

Unsupervised Supervised
Structure Structure

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

    
   
 
   

 

 

 

 

 

 

  
   

 

 

 

 

 

__ _ (Leg-ning—Phasg)
j‘x l Compute
M-class algorithm
update
| the
Change the discrim.
number of clusters functions
l
I Performance
acceptable No
l 9
Classify all
training patterns
No Error
acceptable
?
I'— l
I Take new
observations l
l 1 (Decision Phase)
I Make |
decision

 

 

 

l__-_._J

Figure l. A proposed pattern recognition system.

each class are clustered. These new clusters together with the
clusters of correctly classified patterns represent an updated
data structure. A new set of discriminant functions is then obtained.
The learning process will not stop until performance is satisfactory.
If the true pattern classifications are unknown, it is
possible in principle but not in practice to solve the computa-
tional problem of evaluating all the possible partitions of input
data into M clusters in order to find a partition that minimizes
the probability of misclassification (Fl). A subOptimal procedure
for unsupervised structure analysis is discussed in Sec. 5.3. In
any event, the system is switched from the learning phase to the
classification phase when satisfactory system performance has been

reached with training patterns.
1.4 Thesis Objectives and Outline.

This thesis suggests that cluster-seeking and multiclass
pattern recognition can be combined into a workable pattern
recognition system, Figure l, in which the two problems are solved
in a step-wise fashion, partial solutions for one providing clues
for the other. The literature for both clustering and multiclass,
nonparametric pattern recognition studies is reviewed, and a
specific algorithm is proposed in Sec. 5.2. Several new multiclass
algorithms are proposed as extensions of two-class algorithms.

In order to realize this proposed system, a class of linear
multiclass learning algorithms is derived in Chapters 11, III, and
IV. The idea of translating the abstraction problem into an

optimization problem in which linear functionals are minimized

 

 

 

under
creat
imagiL
of lit
forum
respe
rithms

parti

patter
is sci
the gg
as a 3‘
ditiOr
as the
approy
”ERROR
tiOn a
SQHSit
abStra
tion 8
CIUSte

artifi

 

under certain constraints provides a great deal of freedom when
creating new algorithms. In fact, the only limitation is one's
imagination in finding meaningful linear functionals. The concept
of linear inequalities and the mean-square-error criterion for
formulating linear functionals are studied in Chapters II and III,
respectively. Based on these criteria, a general learning algo-
rithm for the M-class problem is derived. In each chapter,
particular algorithms are formulated as special cases.

Since no assumptions are made about the origins of the
patterns in Chapters II and III, the pattern recognition problem
is solved only for the given training patterns. In order to solve
the generalization problem, we must view the fixed training patterns
as a sample from a population as was done in Chapter IV. The con-
ditional probability functions P(wi‘X), i = 1,2,...,M5 are selected
as the optimal discriminant functions. The techniques of stochastic
approximation (W5) are employed to estimate the coefficients of
unknown discriminant functions. Several M-class stochastic approxima-
tion algorithms are proposed. The relations among them and the
sensitivity problems are investigated. Chapter V examines the
abstraction problem in unstructured situations. A pattern recogni-
tion system which combines multiclass pattern recognition and
cluster-seeking is proposed. The system has been tested with both

artificial and practical data.

tic
4W
E
au:
If
det
mir

llll
A
I}!
[II
III
III
.I
I,
‘II.
III!

II

|

 

fa

Sir

tra

is

ine

Ira

Co]

2.1

Pat

CHAPTER II

GENERALIZED LINEAR INEQUALITIES IN MULTICATEGORY
PATTERN RECOGNITION

The most important task in nonparametric pattern recogni-
tion can be posed as the selection of a set of weight vectors
{W1} that defines a set of discriminant functions
{fi(X) = WE ¢(X)}, where ¢(X) is a vector function of the
augmented pattern X; that is, the output of a "é-machine" 0N1).
If training patterns are provided, the unknown weights can be
determined either with or without training. For instance, a
minimum-distance classifier 0N1) is achieved without training
since the unknown cluster points are computed directly from the
training patterns non-recursively. In this section, training
is viewed as an Optimization problem and the concept of linear
inequalities is chosen to form linear functionals. A general
training algorithm for the M-class problem is derived and a

collection of training algorithms are presented as special cases.
2.1 The Concept of Linear Inequalities in Pattern Recognition

In the two-class pattern recognition problem, a set of N
patterns with known classification is given. The problem is to
construct a discriminant function f(X), such that

f(X) > 0 if X 6 class w

1 (2)

< 0 if X 6 class wz

 

ti

8C

The

3(7).

sel

and

func
the

patt
can I

simuj

Here
CIaSs
eithez

reul i

tions

appr0x

10

for as many of the N training patterns as possible.

By the Weierstrauss theorem on polynomial approximation
(W1), one can choose a polynomial which uniformly approximates
the continuous function f(X) on a closed interval with arbitrary

accuracy. The function f(X) will be approximated by §(X)

d+l T
we = z cisi<x> =c soc).
1=1

T _ .
The parameters C - (CI’CZ""’Cd+l) are unknown,

QT(X) = (¢1(X),¢2(X),...,¢d(X),l) are linearly independent, pre-
selected, real functions. Nilsson (N1) calls §(X) a "Q-function,"
and any pattern classifier employing Q functions, a "Q-machine."
It is not necessary to actually approximate the discriminant
function f(X) itself. It is sufficient to achieve agreement in
the signs of the functions f(X) and Q(X) for all training
patterns. Thus, the problem of finding a discriminant function
can be converted to the problem of solving the system of N

simultaneous linear inequalities in (3).
yCT¢(X) > O for all training patterns X . (3)

Here y = 1 if X is from class wl and = —1 if X is from
class m2. The coefficients in CT are chosen to map ¢(X) into
either 1 or -1 as dictated by y. Once CT is chosen, the decision

reul is: Decide X is in m if CT¢(X) > 0, m2 if CT¢(X) < 0.

1
In M-class pattern recognition, a set of discriminant func-

tions {f1(X)}?;1 is needed. Each discriminant function will be

approximated by a finite sum

 

 

Siona

t0 prc
natrua
a set

f0110i.

 

ExiStS

 

X fro.

ll

d+l
sis) = z

j 1 aij¢j(X) = a: ¢(X) ; i = 1,2,...,M .

Once (a?) is chosen, the decision rule is: Decide X is in

m if i is the smallest integer for which a§¢(X) 2 a§¢(X)

i
for all j * i.

The matrix of unknown parameters is written as the

M X (d+1) matrix A.

r' r- .—
A = iaT 1= a a ... a
i 1 ll 12 l,d+l
« 2
3T: 8 a ass a
2 ‘ 21 22 2,d+l
laT' a a so. a
‘_ M j _.M1 M2 M,d+l

 

 

The letters {e1,e2,...,eM} denote a set of M m-dimen-
sional points, called vertices, for which eiej = O for all i # j.

By analogy to the two-class problem, matrix A is chosen
to project ¢(X) onto the vertex ei as dictated by Y1, the
natrual generalization of y; that is, {Y1} (i = 1,2,...,M) is

a set of MXM-dimensional orthogonal matrices which satisfy the

following conditions.

Y?Y, = I
1 1
for all i = 1,2,...,M O
Y.e. = e
1 1 1

Thus, the orthogonal matrix Yi transforms the vertex ei into

e1 for each of the M pattern classes.

If the M pattern classes are linearly separable, there

exists a linear transformation A which maps each training pattern

X from class mi close to the vertex ei. Choosing the vertex

12

e1 > 0 (each entry of e1 is greater than zero), an orthonormal
matrix Yi is to be found such that Y1A¢(X) ezel for all X

in class mi

The natural extension of (3) to the M-class problem is:

YiA¢(X) > 0 (vector) for all i and X when X is in
class w. (4)
1
This is a system of MN simultaneous linear inequalities.

The inequalities (4) can be written in another form.

‘CT¢(X) - 1‘ < ‘CT¢(X) + 1‘ if X C class ml
(5)
T T .
\c ¢(X) - 1‘ > |c Q15(X) + 1\ If x c class wz .
An extension of (5) to the M-class problem was proposed by Chaplin

and Levadi (Cl).

HA¢(X) ' 61“ < HA¢(X) - eju if X 6 class wi for all
j = 1,2,...,M; i # j

(6)

where {ei}T=1 are M-dimensional vertices satisfying the follow-

ing conditions:
He H E trace {e eT} = 1
i i i
uei - eju = “ei - ck“ for all j,k # i .
Equation (6) can be rewritten as
T T , . .
eiA¢(X) > ejA¢(X) If X 6 class mi for all J # 1

for all training pattern X. Equivalently, decide X is in class

w. if
1

13

eiAqﬁX) - max {eTA¢(X)} > O (7)
jh j

Inequalities (3) and (4) are based on the idea of achieving
agreement between the signs of discriminant functions and approximat-
ing functions. The inequality prOposed by Chaplin and Levadi is
based on a minimum-distance criterion. In both cases, the assumption
of linear separability assures the existence of a linear transforma-
tion A which correctly classifies all N training patterns. When
training patterns overlap, Equations (4) and (7) will be inconsistent,

and no solution for a matrix A can be found.
2.2 The General Learning Algorithm

It is a common practice in solving linear inequalities to
transform the problem into an Optimization problem. Then, the
gradient descent method, or some other optimization procedure, is
chosen to minimize certain linear functionals. A general approach
for obtaining meaningful linear functionals will now be described;
these functionals will then be minimized to obtain general learning
algorithms. The two-class pattern recognition problem will be con-
sidered first, and the results extended to the M-class problems.

Equation (3) can be rewritten as
yCT®(X) = B > O ; B is a positive number . (8)
Define
Z = yCT¢(X) - B for all X .

A strictly convex and differentiable function defined on

the set 2 is denoted by F(.) where

l4

F(Z)>0 if Z<0

= 0 if z 2 0 .

A solution to the equation 2 = O can be obtained by minimizing

a linear functional

J(C) = L{F(2)} = L{F(yCT¢(X) - 3)}

where L denotes a linear operator.

It is clear that the functional J(C) will also be strictly
convex and differentiable, which guarantees the existence and unique-
ness of a minimum. The problem is now to find C such that J(C)
is a global minimum for the N training patterns.

The usual method for minimizing J(C) is the gradient

descent procedure (B3, Dl). An algorithm can be written as

C[n] = CLn-l] - p :éLE 1] x[ 3 (9)

where n is the iteration index, and X[n] is the training pattern
presented to the algorithm at the nth iteration; p is a positive
scalar which determines the distance to be moved at each step in
the direction of the negative gradient of J(C) in the parameter
space.

Equation (9) leads to the following general iterative

algorithm for the two-class pattern recognition problem.

(GA.1) c[n] = c[n-1] + pL{F'(en_1)Y[“]¢(x[“])}
em = sin-1] + pLUF'<en-l>\ - Wen-1’}
where en-l = y[n]CT[n-l]¢(X[n]) - B[n-l]

for all n = 1,2,3,... and B[O] > O .

15

It should be noted that, in the above algorithm, only one
training pattern is used at each iteration. Each of the N train-
ing patterns must be presented to the algorithm many times to assure
that J(C) is a global minimum.

Algorithm (GA.l) can be extended easily to the M-class prob-

lem by invoking (4). Equation (4) can be rewritten as
YiA¢(X) = v > O for all i, X .

Letting W = YiA¢(X) - y, a solution of the linear equations W = O

can be obtained by finding the minimum of the linear functional:
J(A) = L{F(W)] = L{F(YiA¢(X) - y)} .

Again using (9), a general learning algorithm for solving

the M-class pattern recognition problem is obtained.

(GA.2) A[n] = A[n-1] + pL{F'(gn_1)Yf[n]¢(X[n])}
Yin] = Yin'll + DL{\F'(€n_1)‘ ' F(en_1)}
where en-l = Yi[n]A[n-l]¢(X[n]) - y[n-l] .

Euqation (GA.2) holds for all X in class wi’ i = 1,2,...,M; and
n = 1,2,...,; y[0] > 0.

Similarly, (7) can be rewritten as

T

T
eiA¢(X) - max ejA¢(X) = B >’O . (10)

ji‘i

This leads to the following general learning algorithm:

l6

(GA.3) Ain] = Aim-1] + pLiF'(sn_l>}[ei¢T(X[n]> - mix ej¢T(X[n]>}
j i
B[n] = B[n-l] + pL{\F'(en_1)‘ ' F'(en-1)}
where en-l = e§A[n-I]¢(X[n]) - mg: e§A[n-1]¢(X[n] - e[n-1]).

Equation (GA.3) holds for all X in class mi where i = 1,2,...,M
and n = 1,2,... .
In two-class pattern recognition, setting e = 1, e = -l,

and A = CT changes (10) to:

2CT¢(X) = B if X 6 class ml
T ' .
-2C ¢(X) = B If X 6 class wz .
This is equivalent to:
T .
yC ¢(X) = 5/2, y = 1 If x 6 class ml
-1 if X 6 class m2 -

This is exactly (8). Therefore, (GA.3) reduces to (GA.1) for two-
class problems.

Based on the general learning algorithms (GA.1), (GA.2),
and (GA.3), a particular class of iterative algorithm can be
derived easily. In fact, the number of specific algorithms is
limited only by the availability of meaningful convex functions
F(.) and linear operators L. In Sec. 2.3, a class of standard

iterative algorithm for the M-class problem is presented.
2.3 A Class of Iterative Procedures for Linear Inequalities

A class of iterative algorithms for the two-class problem

corresponding to (GA.1) with B[n] E O for all n is given in

17

Table 2 of Devyaterikov, Propoi, and Tsypkin (D1). The extension
to M-class algorithms according to the general algorithm.(GA.2) is
straightforward and hence not included. A collection of iterative

algorithms corresponding to (GA.3) will be presented in this section.

2.3.1 The Fixed Increment Method
Consider the following linear functional.
1 T T
J(A) = —{‘e_A¢(X) - max e,A¢(X) - 5‘
2 1 j#l J

- (eEAqﬂX) - max e?A¢(X) - 3)} .
j#i J

This function has a unique minimum only for those values

of A satisfying (7). Note that

J(A) = 0 if egA¢(X) > e§A¢(X) for all i # j, X 6 class mi
T T .
= -(e,A®(X) - max e,A¢(X) - 3) otherwise .
1 iﬁj J

The gradient Of A is

J _ , T T . .
T T .
= -(ei¢ (X) - max ei¢ (X)) otherwise .
l#j

Substituting into the general algorithms (GA.3):

(PA.1) A[n] A[n-l] + %p{[ei¢T(X[n-l]) - max ej¢T(X[n-1])] '

jﬁl
[ei¢T(X[n-l]) - gm: ej¢T(X[n-1])]sgn(en_l)}

B[n-l] + %p{[1 - sgn(gn_1)] +"l sgn(en_1)‘}

BIn]

where en-l — e§A[n-1]¢(X[n-1]) - max e§A[n-l]¢(X[n-l]) - Bln-l] .

j¢1

18

In this algorithm, no correction is made to A if

eEA¢(X) - max eTA¢(X) > B ,

jfi j
which means X is correctly classified. If X is incorrectly
classified, A is incremented by
T T
p(ei¢ (X) - max e.¢ (X))
jfi 3

That is,

A[n] = A[n-1] + p(e ¢T(X[n-1]) - max e.OT(X[n-1]))

The convergence proof of algorithm (PA.1) with B[n] E O
for all n is a modified version of Novikoff's convergence proof
for perceptrons, the details of which can be found in Blaydon,

Appendix II.2 (BB).
2.3.2 The Relaxation Method

Consider the following linear functional.

J(A) = l((e'¥‘A¢(X) - max eT,‘AQ5(X) - B)
8 1 1#j J

T T 2
- esA¢(X) - max e A¢(X) - B ) .
A 1 jfi j ‘
Thus
J(A) = 0 if e§A¢(X) > e§A¢(X) for all i # j, X 6 class mi

§A¢(X) - B)2 otherwise .

1 T
"(eA (X) -maxe
2 1 ¢ j#I

The gradient of J(A) is

its

0 if eEA¢(X) > e§A¢(X) for all i # j, X 6 class mi

T T
(eiA¢(X) ' max ejA¢(X)-B)(ei¢T(X)-max e-¢T(X)) otherwise.
1*) i#j J

19

Substituting into the general algorithm (GA.3) yields:

(PA.2) A[n] = A[n-1] if e§A[n-l]¢(X[n-l]) > e§A[n-1]¢(X[n-1])
for all i # j, x 5 class mi
= Aim-1] + puendlteisTocin-Iv-Tz ejchmn-Im
otherwise
where en_1 = efA[n-l]¢(X[n-l])-?:§ e§A[n-l]¢(X[n-l])-e[n-l] .
B[n] = e[n-l] if e§A[n-l]¢(x[n-l])

> e§A[n-l]¢(X[n-l])
for all i # j, X 6 class mi

= Bin‘I] + p{len_1‘ + €n_1} otherwise .

Algorithm (PA.2) behaves as algorithm (PA.1) except that
the increment added to A[n] is weighted by the magnitude of the

absolute error.

2.3.3 The Minimum-SquareéError Method

Consider the following strictly convex and differentiable
linear functional.
1 T T 2
J(A) = —(e,A®(X) - max e A¢(X) - B) ,
2 1 j
I#j
The gradients of J(A) are

LJ ... (e’ngbCX) - max eTAsoc) - e)(ei¢T<X> - max 8 451300)

ai.= - ?A x - IA x -
86 (e1 ¢() :1; ej ¢() 8)

20

Substituting into the general algorithm (GA.3) yields:

(PA.3) A[n] A[n-1] - p{en_1[eiOT(X[n-1] - 9:3: ej¢T<x[n-1])3}

Bin] = Bin-1] + pilen-1\ + en-1l
where en-l = eEAEn-1]¢(X[n-1])-max e§A[n-l]¢(x[n-l]) - B[n-1] .

15
Algorithm (PA.3) terminates only if

T

e§A¢(X) - max ejA¢(X) = B

iaéj
for all N training patterns. Therefore, the existence of a
solution for A in (7) does not assure the convergence of the
algorithm. After each iteration, the following condition can be

checked.

eiA¢(X) - max e2A¢(X) > O for all i and X 6 class mi .

1H 3

If it is satisfied, the problem is solved. If not, the number of
misclassified training patterns is counted; the learning procedure
is terminated after a fixed number of iterations and the solution

with the smallest number of errors is retained.

CHAPTER III

THE DEAN -SQUARE -ERROR CRITERION IN MJLTICATEGORY
PATTERN RECOGNITION

The unknown parameters in a discriminant function are
established by selecting and Optimizing a performance criterion.

If the criterion is well defined, the parameters can be found by
search techniques such as the gradient descent method in (9). For
the two-class problem, the criterion function should be chosen so
that the discriminant function f(X) has approximately one value
whenever X comes from class ml and a different value whenever
X comes from class wz.

One such method is the mean-square-error criterion, first
used by Widrow and Hoff (W2) to find an adaptive procedure for
classifying binary patterns. Koford and Groner (Kl) later extended
Widrow and Hoff's procedure to real-valued patterns and showed that
the adaptive pattern classifier converges to the optimal classifier
(in the Bayes sense) for normally distributed pattern classes with
equal covariance matrices. Sebestyen (Sl) prOposed a two-step
clustering transformation which maps all patterns belonging to one
class into the neighborhood of a fixed point. Sebestyen's cluster-
ing transformation actually minimizes a mean-square-error criterion.

A mean-square-error approach to the M-class problem, using
the generalized inverse idea for rapid computation, was undertaken
by Wee (W3). This approach allows numerous results obtained in the

21

22

two-class problem to be extended to the M-class problem. In fact,
the results show that the pattern classifiers proposed by Widrow
and Hoff (W2), Patterson and Womack (P1) and Chaplin and Levadi (C1)
are special cases of Wee's procedure.

The remainder of this chapter will contain a mathematical

formulation and some applications of the mean-square-error criterion.

3.1 The Mean-Square-Error Criterion in Pattern Recognition

The mean—square-error criterion will be formulated for the
two-class problem and then extended to the M-class problem. Since
minimizing the mean-square-error does not necessarily minimize the
probability of misclassification, the relation between the resulting
discriminant function and the Optimum Bayes discriminant function
will be discussed under special circumstances.

If the discriminant function f(X) in (2) is considered
to be a transformation from the feature space to the real line, then
a reasonable solution to the two-class pattern recognition problem
will be to find an 'f(X) which transforms all patterns X belonging
to class m1 as close as possible to some number 31 > O and which
transforms all patterns belonging to class m2 as close as possible
to 32 < 0. Using this reasoning, the mean-square-error criterion
can be formulated as follows.

N2 2

N .
2 [Minn-3112 + 2: [f(xian-BZ) (11)
1 2 i=1 i=1

1

 

where {X1(i),X2(i),...,XNi(i)} are the training patterns from
class mi (i = 1,2). Patterson and Womack (P1, P2) have proposed

a modified version of (11),

23

P N1 T 2 P2 N2 T 2
J(C) = J z [c X.(1)-C(2/1)] + —— z [c x.(2)+C(1/2)]
N1 i=1 1 N2 i=1 1

where P1, P2 are the prior probabilities of pattern classes wl
and m2, respectively; C(i/j) denotes the cost in assigning pattern
X to class mi when it really belongs to class mj. Since J(C)
is a convex function of C, a unique minimum exists, which can be
obtained by computer search techniques.

Patterson and Womack (Pl) also relate the optimum Bayes
discriminant function and the least mean-square-error discriminant
function as the number of training patterns from each class
approaches infinity. This relationship is stated in Theorem 3.1.1.

If pi(X) denotes the probability density of X, given

that class m

i is active, and P is the prior probability of

1

class wi’ the Bayes discriminant function is:

 

5(X) 2 0 if X 6 class ml
(12)
< 0 if X 6 class m2
where
800 £C(2/1)P1pl(X) - C(1/2)P2p2(X) .
P1p1(X) + P2132 (X)
N
l 1 T 2
Theorem 3.1.1. If lim f 2[(: Xi(1)-C(2/1)] =
N am 1 i=1
1
EIECTX - C(2/1)] almost everywhere
and N
1 2 T 2 1: 2
11m —-— z [c x,(2) + c(1/2)] = E [c x + C(1/2)]
N 1 2
Nzem 2 i=1

almost everywhere

24

*
then, if C =C minimizes J(C) as N

also minimizes the following integral

[ch - e<x>]2[P1p1(x> + P21»2 <x>1dx

fax

provided the integral exists, where “X is the feature Space.

In other words, the least mean-square-error discriminant
function is the best mean-square, linear, approximation to the
optimum Bayes discriminant function as the number of training
patterns from each class approaches infinity.

Equation (11) can be rewritten as

2 N-
1 1 2
J =1; )3 Z [f(X.(i)) - Bi] 3
i=1 j=l 3
where N = N1 + N2 or, equivalently, as:
2 Ni
1 T 2 1 2
J(C) =1; 2 r (x.(i)C-a.) =-\\YC-a\\ (13)
._ ._ j l N
1-1 j-l
where
T
Y - X1
T
x2

training patterns from class w

J k-_ - .... r___._.___.J

h’ training patterns from class wz

 

 

 

25

B = 61 , where Bi > O for all i = 1,2,...,N

 

 

L__._J
The mean-square-error criterion will now be extended to the

M-category problem. Consider the set of M discriminant functions

£f1(X)}?=1 as a set of transformations which, for each i, fi maps

all multidimensional patterns belonging to class mi as close as

possible to some K-dimensional vertex* ei = (eil’ei2"'°’ei,K)T

as defined in Sec. 2.1. The mean-square error criterion can be

written as follows.

1 K M Ni T 2
k=1 1=1 j=l
This is equivalent to
_ I. T 2 A l_ T T T
J - N “WA -EH _.N trace {(YA as) (yA 43)} (15)
where A is defined in Sec. 2.1 and
r- __
Y = Y[1] vtraining patterns from class m1
Y[2] training patterns from class wz
Y[M] training patterns from class wM

 

 

 

*

The number K = M-l if all the vertices lie on a sphere with centroid
at the origin having equal angles formed by the lines joining the
vertices and the origin, that is,

eT = l
iej if 1 = j
= -l/K otherwise

_ T
K M if eiej

1 if i = j

0 otherwise

26

M r-
N = Z N and Y[i] = X'{(i)-j for all i

 

 

 

 

 

 

1:1 i
i
XN (1)
__ i _.
E = e = e e ... e
1 11 12 1M
T
eT e e e
__N_J __N1 N2 " RE“
where e, > 0 if X, 6 class w
lj 1 j
e,, < 0 if X, E class w .
1] 1 j

The pattern recognition problem becomes the problem of
selecting A and E to minimize (15). In Sec. 3.2, the relation
to Bayes procedure will be discussed.

3.2 The Generalized Inverse Approach in Multiclass Pattern
Recognition

Generalized inverse computations can be used to furnish a
quick solution under the mean-square-error criterion for M-class
problems, (15), for fixed-size training patterns. Since the rows
of matrix E in (15) can be interpreted either as reference points
(vertices) or as cost vectors, the pattern classifier proposed by
Chaplin and Levadi (Cl) and the adaptive classifier of Patterson
and Womack (P1) are special cases of the generalized inverse
approach proposed by Wee (W3). was also extended Theorem 3.1.1
to the M-class problem with (15). However, in both of his
interpretations, the matrix E is fixed and a solution to (15)

is obtained by letting dJ/dA = O.

27

A modified version of the generalized inverse approach per-
mitting variation in the E matrix, subject to certain constraints,
will now be described. In fact, the proposed method resembles the
Ho-Kashyap strategy (H2) for the two-class problem. The introduction
of the entries of the E matrix as additional parameters in the
optimization problem improves the convergence rate of the algorithm.

Since A is unconstrained, the minimum of J in (15) can

be obtained by letting dJ/dA = O. This implies that:

- #
AT = (YTY) 1 YTE = Y E

where Y# is called the generalized inverse (W3).

 

 

 

Let
N.
1 1
x[i] = §- 2 X,[i]
i j=1 J
T 1 M Ni T
xx = {q- r z x.[i]X.[i]
i=1 j=1 J J
._ 1 M
x = " z N.X[i] .
N 1
=1
Then,
T M Ni T .4. M Ni T
A = 2 2 X.[i]X.[i] z z X.[i]ei
i=1 j=1 J 1 i=1 j=1 J
-—-— M
= N'1(xxT)‘1 z N, X[i] e? ,
i=1 1 1
Set

e: = (C(1/i),C(2/i),...,C(M/i)) = CT(i)

and assume

28

C(j/i) o if i = j

(1 6)

c > 0 otherwise

-1 __ N ____ __ N ____ __ N ____
AT = c(X x1) (x {€1qu x -N—2-X[2],...,X -ﬁMX[M]).

A reasonable decision rule is:

Decide X 6 class mi if

T

“X?A - cT(1)u2 < HX?AT - cT(j)\\2 for all j # i. (17)

Expanding (17), the decision rule becomes:
Decide X 6 class wi if

M M

c 2 XTak - %(M-1)c2 > c 2 XTak - 'Zl'(M-1)c2
k=1 k=1
Rﬁi k#j
or, if
XTaj > xTai for all j r i (18)
where

 

_'I '1 - Ni
ai = c(XX ) X - ﬁ-'X[i] .

Equation (18) is the result of the generalized inverse approach
when the E matrix satisfies (16).
In practice, if a pattern class has multimodal structure,
a cluster-seeking technique should be employed and one discriminant
function should be assigned to each cluster before applying (18).
The relation between the generalized inverse approach and
the Optimum Bayes decision rule is as follows. By analogy to (12),
the Optimum Bayes discriminant functions for the M-class problem

are (W4)

29

M

2: C(i/k)P (X)?
k=1 k k

Bi(X) = M for all i = 1,2,...,M

E p (101’
k=1 1‘

 

k

where, ideally, Bi(X) < B (X) if X is from class mi.

1
Let

3T .-. (31(x), 132(X),...,BM(X))

Theorem 3.1.1 can be extended to the M-class problem as follows.

N
1
Theorem 3.2.1. If lim .ﬁ" 2 HXI[i]AT-e?u
. ._ j 1
N14» 1.3-1

= Ei[HXT[i]AT-efuz] almost everywhere

then
M. N Ni
2
lim 2 ﬁi' ﬁ" 2 WXT[i]AT-e?u
N—ml=1 ij=l j 1
i

M
= i§1P1E1[HXT[i]AT-eiﬁz] almost everywhere .

Furthermore, minimizing J of (15) is equivalent to minimizing
the integral
M
T T T 2
HXA - B H ( Zpk(X)Pk) dx .

I“): k=l

The proof is given in Wee (W3).

Thus, the discriminant functions obtained by the generalized
inverse approach are closest among all linear functions to Optimum
Bayes discriminant functions in a mean-square sense as the number of
training patterns approaches infinity.

It is assumed that the entries of matrix E can be varied

subject to the constraint that any rwo vector e in the grouping

30
of rows corresponding to class wi must satisfy the inequality
T T .
e (n)ei(0) 2 e (n)ej(0), for all j r 1

where n is the iteration number, and where ei(O) is the vertex

vector assigned to class mi. It satisfies:

e§(0)ej(0) a if 1 = j s > 5

3 otherwise

where a, B are real numbers.

The problem becomes to find A and E so as to minimize

J(A,E) '1qu “XAT-EHZ = trace{(XAT-E)T(XAT-E)} .

ZIP‘

The complete algorithm can be derived by minimizing J with the

gradient descent method.

(GA.4) AT[O] = Y#E[0]
D[n] = yAT[n] - E[n]
AT[n+1] = AT[n] + Y#6E[n]
E[n+l] = E[n] + 6E[n]
where
5Ein11j = DDin11j if 91“]ijej[0] 2 e[n]ije&[0] for all L f j

= 0 otherwise

Here, 6E[n]ij denotes the ith row vector in the grouping of rows
corresponding to class mi. The convergence proof is given in
Appendix A. A computer simulation Of (GA.4) was performed on
CDC 3600. Two artificial pattern recognition problems were con-

sidered. The first problem contains three linearly separable

31

pattern classes with two training patterns from each class. The
decision surfaces which correctly classify all training patterns
are shown in Figure 2. The second problem contains three linearly
unseparable pattern classes. There are eight training patterns
from class 1, six from class 2 and two from class 3. The decision
surfaces which misclassify one training pattern are shown in

Figure 3.

3.3 Some Properties of LeastAMean-Square-Error Pattern Classifiers

The properties of linear two-class pattern classifiers which
are based on the mean-square-error criterion have been investigated
by a number of authors (K1, P1, P2). In this section, the statistical
properties of two-class classifiers will be studied and the results
extended to the M-class problem.

The mean-square-error criterion for the two-class problem

is given in (13)

N.
l 2 1 T 2
J(C) =§ 2 2 (X.(i)C - Bi)
i=1 j=l J
or equivalently,
N
2 i
1 *T * 2
J(C) =ﬁ' 2 Z (X. (1)0 + CO - Bi)
i=1 j=l 3
where 0* A (C C C ) and C A C
" 1’ 2""’ d O -' d+1 '
For simplicity, the star (*) is dropped. Thus
1 2 N1 T 2
J(C) =" 2‘. 2 (X (i)C +C - B.) (19)
N i=1 j=l J 0 1 °

The problem is to find C and CO to minimize J(C).

32

 

 

f13(X) = 0
l
3
E X1
2

    

‘3 Class #3 ‘L
/

I

1+2

 

— X1 - 0.26X2 - 0.16

— X1 + 1.28X2 - 0.16

f13(X) = XI - 6.4X2 - 0.16

H1
...:
N

A

X

V
I

H
N
U

’52
v
I

Figure 2. Decision surfaces for algorithm GA.4.

      
 

33

 

 

/1
2 \‘x
a x X 3
//\ Class #1

A - X 0' X2 f23(X)=O

Class #3
A K x x

R 5 4 t = X
3 1 2 3 4 1
O
2
o O 1
x =

O 2 f12() 0

3\ C1ass#2

\1
f13(X) = 0
f12(X) = XI + l.lX2 + 0.01
£1300 = XI - 0.3x2 + 1.2
Figure 3. Decision surfaces for algorithm GA.4

34

Let
Ni
Al .
M _._ Z. X(1) and
1 N1 3:1 j
N (20)
s 11— 21mm -M) (xm -M)T
i‘Ni 1:1 j i j i

be the sample mean vector and sample covariance matrix for pattern

class wi'

Equation (19) can be rewritten as

N

_ l T T 2
J(C) — N1+NZ 1:1 Ni{C sic + (Mic + cO - Bi) } . (21)

 

Letting OJ/OCO = 0 ,

 

 

2 2 T
2: N.(MC+C - 8.) =0
N1+N2 i=1 1 i 0 1
or
T
c = -(MlN1 + MZNZ) c + N131 + N282 (22)
0 N1 +u2 ‘

If N1 = N2 and Bl = 1, 82 = -1, then

_ T
c0 — -l/2 (M1 + M2) c .

Substituting (22) into (21),
2

J(C) = CT“1 + s2)c +1 (CTCM1 - M2) - 2)

NIH

Let

“5
i

@i and M12 _.M1 - M2

||r1:z

1 .

J(C) = cTsc +% (CTM12 - 2)2

35

Letting aJ/aC = 0,

 

 

 

 

 

l T
ZQC + §'(C M12 - 2)M12 = 0
or
'r
sec + (Mucus12 - 2M12'= o .
Try
= (1)112
2 + "11“;[22 1M12
Thus
(MT -
1 + (1112‘1 l’hzi’hz _ 2M
2 + 1M? {R11 2 + 11113212
2 2 2111112
1 T -
’ 2 +51? {1111 141524101121 11“12mm 4
2 2 2
'r -1 )
' (M121 M12 M121
= o .

From (23),

— - = -1 -

c—Ks 11412 Rs (M1 M2)
where
K —

_ 1
T -
2 +2111? 1“12

(23)

The least-square-error linear classifier for the two-class

problem is:

36

, T
ml If X C +Co 2 0

T
say X 6 class m2 if X C +-CO < 0

say X 6 class
(24)

where C and C0 are defined in (23) and (22) respectively.
On the other hand, the optimal Bayes pattern classifier for

the two-calss problem, assuming normal distribution with equal co-

variance matrices, is (D2)

X 6 class m1 if

T -1 l. T -l

X Q (M1 - M2) - 2(M1‘+ M2) Q (M1 - M2) 2 0 . (25)
Let

c = {1011 - M and c = %(M

T
2) 0 ‘+ M2) C .

1

Equation (25) becomes,

X G class ml if

XTC + C0 2 0

which is equivalent to (24).

Therefore, the least-mean-square-error classifier is equi-
valent to the optimal parametric classifier for normally distributed
patterns with equal covariance matrices.

A different proof leading to the above result has been pro-
vided by Koford and Groner (Kl), who also show that a simple modifica-
tion of the least-mean-square-error adaption procedure enables the
adaptive structure to converge to a nearly optimal classifier,
even though the numbers of training patterns from the two categories

are not equal.

37

For the M-class problem, the mean-square-error criterion

is given in (14).

1 K M N1 'r 2
J=§ E z z (xj(i)ak-eik)
k=l i=1 j=l
or equivalently,
N
K M i *
2
J=l% z 2 )3 (X::T(i)ak+ak -e'k)
k=l i=1 j=l ° 1
where
A * T A * T
X _.(X ,l) and ak _ (ak’ako) .

Again, for simplicity, the star (*) is drOpped. Thus

M Ni

K r 2
z z 2 (x(i)a +a -e.) .
k=l 1=1 j=1 3 k k° 1k

(.4
ll
2 IH

The problem of minimizing J with respect to ak and ak0

is equivalent to minimizing J(ak, ako)’ k = 1,2,...,N, where
1M N1 T 2
J(ak, ako) = ﬁ-igl jEI (Xj(1)ak + ak0 - eik) . (26)

Applying (20) and assuming N1 = N2 =...= NM, Equation (26)

can be rewritten as:

J( )=-1-M T +(TM+ )2 27
ak’ako M 121 ak‘iak £1R i ako " eik ' ( )

Letting aJ/aako = 0,

M r
+ -
1:1 (akMi ako eik)

ZIN>

38

Let
M M
— 1 ‘- 1
M = -' z M, and e = -' 2 e,
M i=1 1 k M.i=1 1k
then
a = -aT1~_4+ Z . (28)
Re R k
If ei = (eil,e12,...,eik), i = 1,2,...,N, are simplex vertices,
then
M
2 e. = 0 for all k .
. 1k
1=l
If ei = (eil’e12"°"eiM)’ for all 1, are vertices, then ek # 0.
Substituting (28) into (27),
M
_ _1_ '1‘ — T - 2
J(ak’ako) ' M 121 ak‘iak + [(Mi M) 3k ' (ek + e119] } °

Setting aJ/aak = O,

M
l_ T T T T—“‘T —
M 121 [%akéi + ZakMiMi + ZakM M - 4a MWM

- 2(2k + eik)(Mi - M)

Li!
H
c:

or

u M 3
z
3
I
zl
3%

39

and

Then

a: = vk[§ +13]'1 . (29)

Substituting (29) into (28),

aka = -vk[¢ + T]'1ﬁ + E# . (30)

Equations (29) and (30) indicate the relation between
the optimal solution based on the least-mean-square-error criterion

and the sample covariance matrices and sample mean vectors.

3.4 Relation Between Linear Inequalities and the Mean-Square-

Error Criterion

In the two-class problem, a correct decision is made if

yCTX(i) > o 1 1,2

where y = 1 if 1 = 1, and y -1 if 1 = 2. Let 2 = yCTX(i).
Ideally, CT is chosen to maximize the probability that Z > 0.
On the other hand, the mean-square-error criterion implies

that CT is chosen to map X(i) as close as possible into points

1.1, which minimizes the mean-square-error

 

|cTX(1) - ylz , or (2-1)2 .

In the M-class problem, the corresponding linear inequality

is:

40
yiAX(i) > 0 for all i

where A is a linear transformation which maps X(i) into the
vertices {e1,e2,...,ek}, k s M. Let W = YiAX(i). Ideally, the
matrix A is chosen to maximize the probability that W > O.
Chaplin and Levadi (Cl) have shown that the result of maximizing
the probability that W > 0 is identical to that obtained by mini

minimizing

2

‘W - e1| .

It can be shown that the result of maximizing the probability that

Z > 0 is identical to that obtained by minimizing
2
(Z-l) .

T
Consider Z = yC X(i) as a random variable with unknown

0: =E(z - {)2, where 2

denotes the mean. By the Tchebycheff inequality, (D2)

density function and finite variance

_ _ '0 2
prob(‘Z -Z‘ 2 Z) _<. (12") .
Z

Since prob(‘Z - 2] 2 Z) > prob(Z < O), minimizing the ratio

O§IZ£ minimizes prob(Z < O), or maximizes prob(Z > O).

 

52 —2

T -2
In order to minimize prob(Z < O), the term Z Z /Z must

be maximized.

N
N
H

_ cTX(i)ng)c 9 MC) .
2 (chmy) may CT)

 

41

Setting aH(C)/5C = O,

 

 

 

 

*-——-—- T T
X(i)XT(i)C qugnx (1)0 X(i)y = o
C X(i)y
or
. T . -l "‘T-
C = (X(1)X (1)) K X(1)y (31)
where
T T
K = C Xéi)x (i)C , a real number.
C X(i)y

Now, minimize

(z - 1)2 = CTX(i)XT(i)C - ZCTX(i)y + 1 .

 

Letting a(Z - 1)2/aC = 0,
2X(i)XT(i)C -2m = 0
or
c = (xmxTun'lm . (32)

If K = l in (31), then (31) is identical to (32).

For the M-class problem,

 

 

 

‘W - el‘2 = \AX(i) - ei‘z = XT(i)AtAX(i) - 2e?AX(i) + 1
a‘W ’ 31‘ O i 1.

8A , mp 1es

X(i)XT(i)AT - X(i)e§ = 0

or

 

-————-———— -1
AT = (X(i)XT(i)) X(i)ef . (33)

42

By analogy to the two-class problem, consider the random

variable 9 = ‘W‘ with unknown density function and finite variance
02
9 O
02 T 2
.9. = 9.1. .. 1 == lﬂ— - 1
-Q -2 _.2 '
9 9 \WI

Equation (33) can be obtained by minimizing ‘W‘z / \W‘Z .

 

 

 

 

 

 

Letting
2
a_. W 2 = 0 ,
3" m
a— \w\2 -KL m2 = o (34)
3A 5"
where
2 _
x = M /|w\2.
Since
I‘VE ‘ ( ‘2 M K d+1 2
w =Y.AXi) =1: 22y.ax(i)
1 m=l(j=1 k=1 “‘1 jkk )
W 2 M K d+l
a-L-L—= 22 z 2x(i)x(i)ayy
aAqr m=l j=l k=l r k jk mj mq
which is the (q,r)th component of the matrix
BW‘Z T TT T T
= 2X(i)X (i)A Y.Y, = 2X(i)X (i)A . (35)
3A 1 1
Similarly,

 

_2 2 M ’K (1+1 2
|w| = \YiAX(i)[ = 2 ( )3 t. ymjajkxkﬁ)

m=l j=1 k=1

43

a 2 M K (H'l
= 2 i
13.1. “£1 ”rm (j; 1.51 ymjajkxk( ’) qu

qr

 

which is the (q,r)th component of the matrix

amz
5A

 

= 2 X(iﬁyi . (36)

Substituting (35) and (36) into (34) produces

 

2 X(i)XT(i)AT - K 2X(1)GTYi = 0

or

-1

AT = (X(1)XT(1)) X(i)KWTYi .
Choosing KWT = e{, then
T T '1 T
A = (xmx (1)) X(i)ei

which is identical to (33).

CHAPTER IV

MUDTICATEGORY PATTERN RECOGNITION USING STOCHASTIC
APPROXIMATION TECHNIQUES

In the previous chapters, a class of deterministic pattern
classification algorithms was discussed. The pattern recognition
problem was solved only with reSpect to the N training patterns

given. No questions related to the generalization problem could be

answered. The stochastic, or parametric, approach to pattern
recognition (B4, K3, P4, W6, Y1) views the N training patterns

as samples from the populations corresponding to the pattern classes.

The pattern classes are described by conditional probabilities
P(wi!x), the probability that pattern x belongs to pattern class
mi. In this chapter the function fi(x) = a§¢(x) will be used as

an approximation to P(wilx). The stochastic problem is then one of
selecting a suitable criterion function involving all the available
information so that the extremum of the criteria function correSponds
to a reasonable approximation to P(wi|x), i = 1,2,...,N.

In this chapter, the generalized inverse idea is invoked to
form the criterion function. The mathematical formulation of the
pattern recognition problem is discussed in Sec. 4.1. In Sec. 4.2,
two stochastic algorithms which are similar to the deterministic
algorithm of Wee (W3) are proposed. Both schemes use information
from all training patterns at each stage of the algorithm. In
Sec. 4.3, a stochastic algorithm with a special updating property
is proposed. At each iteration, only the information from the

44

45

particular pattern presented is utilized. As the number of training
patterns increases without bound, the algorithm of Sec. 4.2 is
equivalent to the algorithm of Sec. 4.3. In Sec. 4.4, the problem
of an occasionally mislabeled training pattern is investigated.

The mean square approximation error is defined and its upper bound

is computed. A computer simulation of the algorithms is presented

in Sec. 4.5.

4.1 Mathematical Formulation

Let there be M pattern classes w1,w2,...,mM, any one of
which can be active to produce a d+l dimensional (augmented)
pattern vector X as shown in Figure 4. The prior probabilities
q1,q2,...,qM and the probability densities p1(-),p2(-),...,pM(-)
are unknown. However, it is possible to observe a sequence
{(X1,yi), i = 1,2,...,N} of training patterns. The correct

classification of pattern X is denoted by yi E (1,2,...,M). The

i
patterns {Xi} are chosen independently under probability density
pk(-) and prior probability qk when y1 = k.

The controller has two operating modes described as follows;
1) Training mode: the switch is governed by prior probabilities
q1,q2,...,qM and the exact location of the switch is observable
2) Decision mode: Same as the training mode except that the location
of the switch is unknown.

We assume that P(wi|X) is approximated by 3E¢(X)\fi = 1,2,...,N,
where J(X) = (cpl(X),cp2(X),...,cp&+1(X)), and {winnfg is a set
T

of known and bounded functions of X; ai = (ail’ai2"°°’ai,d+1)’

{(aij)‘i = 1,2,...,N, j = 1,2,...,d+l} is a set of constants to be

and

determined.

46

The training patterns are compiled in a matrix.

T . . .
m (X1) where x1,X2,.b.d.,XN 1s a set of N training
Q A m?(x ) samples; N = Z N where N. denotes the
N -' 2 i=1 i 1
I number of training patterns from pattern class
T
1:9 or. -

 

 

The coefficients to be determined are compiled in the A matrix de-
fined in Sec. 2.1. The classifications of the training patterns

are summarized in the matrix Z .

N
ZT(X1) 21(X1) 22(X1) .... zM(X1)
2N A zT(x2) = 21(x2) 22(x2) zM(X2)

1 if Xi E wj ( yi

 

 

 

 

= j )

where zj(Xi)
0 if Xi {inj (yi # i)

In general, Ez ‘X[21(X)] = 1-P[zj (X) = 1\x] + O-P[zj(X) = o|x1
J'

Pb: = ﬁx] = P[mj\X] .

The matrix Pk is the following matrix of conditional expectations.

—P_('w1\x1) P(w2|x1) P(u)M‘x1)

F(wl‘xz) P(w2|x2) .... P(wM\x2)

 

 

A . . .
PN- : . :
P(w\x) P(w|x).. P(u)‘x)
___ 1 N 2 N M NhJ
The random matrix Z can be considered as a noisy "measure-

N

ment" of PN since ZN = Pk +VN where VN is the "measurement

47

noise". This particular formulation is very useful in proving

Theorem 4.2.2 later.

4.2 Generalized Inverse Adaptation

The stochastic pattern classification problem can be viewed
as the problem of constructing an approximation to a function which
can only be measured in the presence of noise. In other words, we are
given a random pair (QN’ZN) which contains all the available in-
formation, and requested to determine the unknown parameters in A
so that the unknown functions {F(wilX)}?=1 are approximated by
{a§¢(x)}?=1 for all i and all patterns X.

The criterion function prOposed here involves the available
random pair [QN’ZN] and is similar to that used with the deter-

ministic generalized inverse approach of Wee (W3).

1 2 1
JN(A) = h— HZN ' QNATH AI; TracedZN - QNAT]T[ZN - QNATD.

The matrix AI =

28-;

which minimizes JN(A) can be written as

T T '1 T

By the gradient descent procedure, this can be computed

iteratively as
T
(SA-1) AT<n+1) = AT<n> - p(n)TN[§NAT(n) - 2N]

where n denotes the iteration number, n = 1,2,3,... , AT(1) is
an arbitrary m X M matrix (e.g. A(1) = BI where B > O is an
small positive number) and p(n) is the weighting factors required

for stochastic approximation algorithms. The following conditions

48

are necessary for convergence (BB, Gl).

Q on 2
Lim p(n) = O, 2 p(n) = w, 2 p (n) < m .
new n=l n-l

The second order gradient scheme defined by

T a 2N, (A) -1 MN (A)
A (n+1>= AT (n) - p<n> < 2 —) (ET—”Mm
a A
leads to the stochastic algorithm defined below.
aJN<A) T

T
5A '- NTTN

A - z a d
N] n

2
a JN<A)

T N T
2 "T TNTN = EMXNW (KN)
a A n—l

Let R(n) = R(n-l) + TTXNTTTTXn)’ R(O) = y1 where y is an arbitrary

small positive number. It has been shown by Ralston (R3) that
R-1(n) = R-1(n-l) - R-1(n-l)m(Xn)[mT(Xn)R-1(n-1)m(xn) +-1]-1¢F(Xn)R-1(n-l).

The stochastic second-order algorithm becomes

(SA-2) AT(n+1)= AT (n) - p(n)R1N[N]Q [QN AT (n) - z N.)

In both algorithms, the correction term is proportional to
the difference between the ideal probability of classification (i.c.,
zi(X) = O or 1) and the approximated probability of classification
. T . T -1 T .
(1.e.,ajm(X)). The factors p(n)qh and p(n)R TNTTN determine
how the corrections are to be weighted. At the beginning, the weight
is heavy. As this difference is reduced, less and less weight is

attached to incoming patterns.

49

We shall now study the asymptotic properties of generalized
inverse adaptation as the number of training patterns becomes infinite.

Theorem 4.2.1 lim JN(A) = J(A) with probability one where

N-caa
M
J(A) A JTIqTETTHZTX) - cpT(X)AT“2] . And
2 TA(X) (21 (X), z 2M(X),...,z (X))

. 1
Proof: lim JN(A) - 11m ﬁ-“ZN - TNA “H

N-m' N—m
=lim: b{TN—-l—2(2(X)-<pr(x)a)2]
N—mk=li=1N N1j=1kJ 3“
M M X. VST
= Z Z q. 1EUR (X)- H(X)ak] (By the strong law of large numbers)
k=l i=1
M M
= 5: qiEE): (zk(X) - cp T200319]
i=1 k=1

M
iglqrﬁiwzhm - cpT<x>ATHTJ

We now show that the criterion function J(A) is a suitable

criterion since it is equivalent to a mean-squared criteria function

M
J(A) = z: qiEiLHZTX) - cpT(X>ATHZ]
i=1
M M
T quiE 1Ez|1TkE 1‘71"" ' “’T 003192 1
i:
M M
= z 1: :11 E [Pmkm - 2P<w k1X>cpT 0081, + WT 0031821
1:1 k=1
= EM :1 E. [P( le) - P20» \X)] + E [Pan IX) - T008121
i= -1 k=1q w k qi i k T k
= J0 +~J(A)

M M
2
where J z 2 q. E. 1[F(u) ‘X) - P (m ‘X)] is not a function of the
O= i=1 k—l k k

50

unknown parameters A.
A M M T 2
= — a
J(A) .2: z qiEi[P(wk‘X) (p (X) k] .
1=l k=1
Thus,minimizing J(A) corresponds to minimizing 3(A) which
is the mean-square-error approximation criterion.

The solution for the unknown coefficients can be written

explicitly as follows.

The matrix A? = A: which minimizes J(A) is,

T M T -1 M
A* = {.31‘11Eiwo‘m (X))} {iilq1E1[‘P(X)P(w1‘X)]’

1:

dra:z

M
1Q iEiEqXX) P(leX) ] , - - - , iglq iEihpoi) P(wM\X)]} -

1

The relation hijél,= 0 implies

68k
M T
zlqiEito - 2P<wk|xm<x> + 2((p (X)ak)cp(X)]
i:
M T
= 2 qiEi[<p(X)P(wk‘X) - wow (X)ak] = 0 -
i=1

M M
Thus, ak = {iilqiEiEQp(X)(pT(X)]} 1{iilqiEi[cp(X)P(wk\X)]} k = 1,2,...,N.

Theorem 4.2.2 Lim AT = A: with probability one under the following
Nam N
conditions.
1) 2 qiEi[m(X)mT(X)] exists and is positive definite;
.=1
M
2) _zlqiEi[cp(X)P(wk|X)] exist for all k = 1,2,...,M.
1:

T T -1 T
: A =
Proof N [9N QN] QN ZN

Writing ZN = PN +VN as in Sec. 4.1,

T__l_T 411:
‘hv ’[N q’N éI’N1 N QNUDN +VN]

-1 MN i 1
(1) 1:im [-QN @N] =N—l:{11§_§(§_

cp(X )cpT (Xj D}1

J

M
= [ Z q.E.[¢(X)¢?(X)]]-l By the strong law of large numbers.
i=11L1

1 T M N | M N ‘
(ii) lira—o P =11m—[z Zcp(X)P(u) X),. ..,2 2cp(X)P(u) X)]
N—ucoN N“ N-mNi=1k=1 k 1 1‘ i=lk=1 1‘ M 1‘
xkewi xkewi
M Ni 1 N M N N
N‘m 1=l 1 k- N-ooo 1-1N=1
X kEwi LEw

M
i 2 q. E (w(X)P(w Ix))...., 2 qiEi(¢CX)P(wM‘X))]
i=1

i=1

1
(iii) lim N @EVN = 1im§§§[ZN - PN]
N-m N—m
[1 M N i T I 1 M )1 M( l )1
= lim -' 2 Z (X.) Z (X.)-P0n X.)],---,‘— Z Z ¢(Xj Z XT j) P( X '
N—m Ni=13=1CP J 1 J 1 J Ni=1j=1 m”
X.6w. X. 6w.
J 1 J i
M T M
= { 21qizlziw<X)(zl<x >-P<w1|x>>1,.... 2q q1 Elzia<X)(qn<xT >- P<quX>>]}
i= i=1
M M
= { ZlqiEiiw(X)(F(w11X)-P(w11X>)],...,q 2q .E [a(X)(P(mMIX)- P(w M\X))]}
i= i=1

T T
Combining these three parts, liﬂlAN = A* with probability one. Q.E.D.

N—m

4.3 A Stochastic Approximation Algorithm with an Updating Property

The method of stochastic approximation prOposed by Robbins
and Monroe (R2, W5) is designed to find the zero of a regression
function R(x) = EV(x), where V(x) is a random function. Sometimes,
the R-M method is employed to locate the minimum of R(x) by finding

the zero of a regression function E[V'(x)].

52

To study the multidimensional case, let V be a stationary
random vector which is observed at discrete times. The probability
distribution of V is unknown. Let H be a matrix of parameters,
and let f(V,H) be a real valued function of V and H. Our prob-

lem is to find a matrix H = H which minimize a regression function

*
R(H) QEv{f(V,H)} .

Since the probability distribution of V is unknown, R(H)
cannot be evaluated. However, in practice, if we are given a
sequence Vn’ n = 1,2,..., of independent observations of the random
vector V, the matrix H* can be obtained by iteratively finding

the zero of a regression function V R(H) = EV{VHf(V,H)}, where v

H H

denotes the gradient operator applied with respect to H.
The following theorem concerning the R4M process for finding

A* has been proved by Gladyshev (GI) and was restated for the multi-

dimensional case by Yau (Y1).

Theorem 4.3.1 Let VH f(V,H) be a random variable with

EV{VHf(V,H)}, and let H

VHR(H) be the (unique) solution of

*

VHR(H) - 0. Choose H1 arbitrarily and define

(GSA-l) Hn+l = Hn - anHf(Vn,Hn), n = 1,2,...
Then P[lim Hn ' H*] = 1
n—m

\2=o

lim Euun - H*\

n-coo
provided the following conditions are satisfied.

(1) inf 1 <(H - H*), VHR(H)> > O for each 6 > 0 where
€HH‘H*“<;'

53

“H“ A (<H,H>)% the norm of a matrix H; <H1,H2> A trace {HEHZ} the

inner product of matrices H1 and H2;

(2) There exists a positive number 6 such that for all H

EHVHW’MH s mum ;

(3) va(Vn,Hn) is a random variable whose conditional distribution,
given H1,H2,...,Hn, is the same as the distribution of VHf(Vn,Hn).

(4) {on} is a sequence of positive numbers satisfying the conditions

“ 2
pn=0 and an<oo.
1 n=1
In the following, the criterion function J(A) of Sec. 4.2

=i’u'ma

will be used to formulate a stochastic approximation algorithm with
an updating property.

M
Lemma 4.3.1 J(A) A z: qiEi[HZT(X) - cpT(X)ATH2] = ami'ix) - J(m’ruz].
i:

1 M M

Proof: Since p(Z,X) * z p(Z,X,wi) = 2

p(Z,X w.)P(w.)
1=l 1—1 ‘ 1 1

19(Z.X|wi)qi

II
ll [‘12

i
then

T
E[u2(x) - ¢F(X)ATH2] = [HiQX) - ¢F(X)ATH2p(x,2)dXdz

M
T
.ZlquHZ(X) - ¢F(X)ATH2P(X,Z|wi)dXdZ
1:
M 27 T T 2
.zlqiEiEH (X) - cp (x>A \\ 1 .
1:

Since the random variable (X,Z) has the same significance
T
as the random variable V and the matrix A is equivalent to H,

the stochastic algorithm (GSA-1) can be rewritten as follows.

_ T T _
(GSA-2) A — An anAf(Xn,Zn,An), n — 1,2,3,...

54

We now consider the following criterion function.

T
J(A) = E[H2(X) - qF(x>ATH2]
= E[f(x,z,AT)]
where f(X,Z,AT) = Hiax) - ¢F(X)ATH2 .

Taking the gradient,

vAf(x,z,AT) = 2¢<X>£¢¢<X>AT - iQX)] .

By invoking (GSA-2), we obtain the following stochastic approxima-

tion algorithm, which has an updating property.

(SP-l) A:+1 = A: + pntp(Xn)[ZT(X.n) - (pT(Xn)A:] .

The sequence A:, n = 1,2,..., converges with probability one
to the value A: which minimize J(A) and, hence, 3(A).
To show the convergence of the algorithm SP-l, we need to
prove the following lemmas.
M d+l
Lemma 4.3.2 If P{iEIqipi(X) > 0} = 1 and {¢i(.)}i=l are linearly

independent and continuous functions, then A: exists.

T
Proof. J(A) = EEHZ(X) ‘ ¢F(X)ATH2]

A
EAL)- : o 2 E£2gp(X)cpT(X)A: - 2Qp(X)ZT(X)] = o .

T T - . T
Thus A* = {E[¢(X)m (X)]} 1E[m(X)2kX)] and the matrix E[qKX)m (X)]
has been shown (P3) to be nonsingular; hence A: exists.

The following lemma has been proved (Y1).
Lemma 4.3.3 Let T be a symmetric operator on Ed+1 with eigen-

values *1 3 k2 S...s xd+l° Then

xlnATnz s «Mg xdﬂuATnz

55

for all AT 6 L(Ed+l,EM) where L(Ed+1,EM) is the set of all linear

transformations taking Ed+1 into EM and forms a vector space.
M d+l

Lemma 4.3.4 If P{ 2 qipi(X) > 0} = l and {¢i(X)}i_1 are linearly
i=1 ‘

independent and continuous functions on the pattern space 0X, then

there exists a constant k such that

T - AT VAJ(A)> .

kHAT - EH st<A *,

Proof: vAJ(A) E[<p(x>cpT(x>]AT - E[cp<X)zT(X)]

@(AT - é-IB) since Q is nonsingular

Q(AT - AI) since A: exists

where Q E[¢(X)qF(X)] a symmetric Operator

ID

Eicp<x>zT<x>1 .

0!
ID

By invoking Lemma 4.3.3, we have
T T 2 T T T T
xluA - A*H s <Q(A - A*), (A - A*)>

where k1 is the smallest eigenvalue of Q, X1 > 0. Therefore,

2
kuAT - I“ s <qu(A), AT - A:>. Q.E.D.

Lemma 4.3.5 LEt v = (x,2), vAf(V,AT) = 2(p(X)[q)T(X)AT - ZT(X)].
If the matrix M A_4E[Hm(x)¢F(X)H2] exists, there exists a constant

d > 0 such that for all AT,

“WW ,ATWZ M + M2).

Proof: nvAfcv,AT)H2 Trace{[2¢(X)¢F(X)AT-2qu)i&x)]T[2qu)qF(X)AT
- 2cp<X>2T<X)]}

4tt{A(¢(X)¢F(x))T(oCX)§r(X))AT}

56

- act—{z (X)CPT(X)CP(X)cpT(X)AT}

+ 4tr{ ((p(X)ZT(X) )T(CP(X) £00 )}

EHVAf (v,AT) “2 4tr{A E[\\tp(X) tpT (X) \\2]AT}

8tr{E[z (X)tpT(x)cp(X)cpT(X)]AT}

+

4tr{E[Hm(X)i%x)“2}

<M AT,AT> - <6,AT> + y

where a = 8E[2 (X)cpT<x>cp<x>cpT<xn

Y=tummamamwi

s <M A:,AT> + H6“ - HAT“ + y

smwwz+mu-MW+w

where KM is the maximum eigenvalue of matrix M and is positive.

Choosing d = XM + “6H + y we have

EuvAf<v,AT>u2 s d<1 +-uATu2> . _ Q.E.D.

Theorem 4.3.2 Let A: be the unique matrix minimizing J(A). Let

T
the sequence of matrices An’ n = 1,2,..., be generated by the proposed

algorithm (SP-1). If the following conditions are satisfied:
M
1) P{ ‘2: qipiOO > 0} = 1 ;
i=1
+1

2) {¢i(°)}:=1 are linearly independent and continuous functions
of X ;
. T 2 .
3) The matrix M = 4E{Hm(X)m (X)H } eXIStS;
4) {[Xn,Z(Xn)], n = 1,2,3,...] is an arbitrary training sequence

of independent observations;

5) {pm} is a sequence of positive numbers satisfying Condition 4

57

of Theorem 4.3.1.

Then P[lim A: = A1] = 1
n—m
and lim EHA: - AIHZ = 0.
rpm

Proof: Invoke Theorem 4.3.1 and utilizing Lemma 2 through 5.
Condition 1 of Theorem 4.3.1 is satisfied because of Lemma 4.3.4.
Condition 2 of Theorem 4.3.1 is satisfied because of Lemma 4.3.5.
Condition 3 of Theorem 4.3.1 is satisfied because of Condition 4
of Theorem 4.3.2. Condition 4 of Theorem 4.3.1 is identical to

Condition 5 of Theorem 4.3.2. Q.E.D.

4.4 Sensitivity Study and Error Upper Bound

In a practical situation, some of the patterns used for train-
ing might be mislabeled. The causes for mislabeling are plentiful and
include typing errors and measurement errors. In this section, we
assume that it is possible to observe a seQuence of sample patterns

{(xi,§i), i = 1,2,...,N] where §1,y2,...,§ is a sequence

N

of labeling (or classification) numbers containing some mislabels.
Furthermore, we write the probability of mislabeling the origin of

pattern X, as f,.
1 1

Let y1,y2,...,y be the correctly labeled sequence correspond-

N

ing to §1,§2,...,§ The probability that (y1. = k) is P(wk) .for

N.

any i. Then

kl

"U
r-‘w
'~<)

H'
II
‘<
H
‘<
H
II

PB?i # yi, yi - k] P(wk)fi

58

We new study the sensitivity problem of the proposed algorithm.

The following is a generalization of the results obtained in the

previous sections.

Define Z(X) = (21(X), 22(X),...,zM (X))

f
0 if

H.

and 2j(X) =

‘<> ’~<>
‘1L

=y=j
y=j

In general

 

 

 

 

 

Ez m‘x 2(X)] = 1-P[2j(X) = 1\x] + 0-P[ﬁj(X) = 01x1
= P1? = y. y =jlx1= mjmu - f)
r- '-j -- -
2T[X£] 210(1) 22(xl) ’z‘M(x1)
. .T T _ . .
2N A 2 [x2] - . .
' T :T . . .
LE [XN]__ 1:108“) 220%) 21403qu
r...
P(w1|X1)(l-f1) P(w2|X1) (1-131) pm; Ml‘x )(1-t1)
TN 5, = Pohlix2)(1-£2) F(m2|x2)(1-£2) . F(mM|X2 )(1- —£ 2)
. . g
I E ' E
LiwﬂxNHl-fN) P(tu2\XN)(l-fN) . P(u1 Mhwa fN)i
As in Sec. 4.1, we can write: 2N = PN + 0N where 0N
is the "measurement noise".
Consider the criterion function
1 . 1 T ~ T
JN<A> =N HZN - 1N A “u _N- tr riizN - iNATJ [ZN - .NA 1}
. T _ «T . . . . . ..
The matrix A — N which minimizes JN(A) is

«T_ T —1T.
51 'U’N‘bu] 9N ZN °

59

As in the iterative algorithms (SA-l) and (SA-2), “: can be written

recursively as
(SA-3) AT(n+l) = AT(n) - p(n)§:[¢NAT(n) - 5“]
(SA-4) AT(n+1) = AT(n) - p(n)n'1[N][§NAT(n) - 2N]
. M T T 2 .
Theorem 4.4.1. lim 1N0.) = z q.E.[\\5r(X) - q) (X)A u ]AJ(A) .
N-m i=1 1 1

1 T

Proof: :1: ﬁ- HZN - QNA H

M N

. T 2
-lim2 2(z(X)'cp(X)a)
N-too k=l j=l k j j k
M M N M
-11m 2 z — - 2(2(X)-¢(X.)a)l
N-cco 1t=1 1=l N ["1 3:1 k j J k
X Em
M M j 1 M
3 .21 kzlqiﬁiﬁkoc) ' ¢T(X)ak)2 = ZlqiEi[\‘?(x) - <PT(X)ATH23
18 z: i:

Q.E.D.

We now show that the criterion function 3(A) is a suitable criterion.
M

~ ‘T T T 2

J(A) - 1: qiﬁiiuz (X) - <9 (X)A H 1
1'1

M M
T T 2
- z r. q 1: [Pan boa-f) - 2%» ‘X)(1‘f)cp me + (co ma ) 1
i i 1t 1t k k
1-1 k-l
where f is the probability that X is mislabeled
M

M M M
2 T 2
= .. E q E.[P(w x)(1-f)-p ( x)(l-f) + 2 2 q.E.[P( X)(1'f)' 003
1:1 k-l 1 ‘ kl ml“ 1 i=1 k=1 1 ‘ wk‘ (P k]

Thus, minimizing 3(A) is equivalent to minimizing

M M , .
z 2 q s [F(w |x)(1-£) - Jana )2 A J(A) in aJ—LAl -- 0 implies that
1-1 k-l i 1 k k 58k

H
zlq131[cp(X)P(wk|X)(1-f) - antitank] = o
i:

or
M T _1 M
3k - {iflqisiupmm (10]) iilqisiupmwmklm(14)] v k = 1,2,...,N.

Hence we obtain

60

.T M T -1
A* = (al,az,'°°)aM) = {iilqiEj-[cp(x)cp (X)]

M M
- [iglqiEi[<p(X>P(w1\X)(l-f). i.ElqiEiicht)P(u>2|x)(l-f)..-..

M
iflqiEi[‘P(x)P(wM1xm'f)] .

If f=0,§$=AT

* or equivalently {9}?=1 is identical to

{y}?=l. For f = l, we have A: = 0 (matrix) which indicatesthat the
unknown parameters AT cannot be estimated because of lack of avail-
able information (i.e. no training samples).

CorreSponding to (SP-l), we have the following stochastic

approximation algorithm.

(SA-5) A:+1 = A: + pntpT(Xn)[2T(Xn) - cpT(Xn)A:] .

The sequence A:, n = 1,2,..., converges with probability one to

AT

the matrix A*.

We shall now define the mean square approximation error and
compute the error upper bound. The results are also applicable to

the algorithms proposed in Sec. 4.2 and 4.3.

Let 6k _A. EX[P(UJR‘X) " (PT(X)ak—J2

2 T
= ExiP (wklxn - zaisxwxwwkun + 3kEx[cp(X)cpT(X)]ak

2 ..T M
= EX[P (wk|X)] - 2 k iglqiEiicp(X)P(wk\X)]

T M
+ 3k iglqiﬁiimxwmkw) (l-m

_ E [22 3T M p 1 + f
- X (wklxn - k iglqiﬁiicpm (wklxm )

2
= EX[P (wk‘X)] - aisxtmxxphxnakdfé

61

T T 1+:
5 Pmax ' akExt‘PmW (x)]ak(1—-?) A ekb

2
where P A, max E [P (m ‘X)]
max 1sk$M x k
and ekb is the upper bound for the mean square error ek’
k = 1,2,...,N.

Let be an estimate of the error upper bound after N

e"1th

training samples. The relation between can be

ekbN 61(1)

stated by the following theorem.

Theorem 4.4.2. lim ekbN = ekb

N—too

2
ekN - Ex[P(wkIX) - cpT(X)3kNl

with probability 1 .

Proof:

2 .
EXU’ (wk|X)] - ZaiNEXEcp(X)P(wk‘X)]
+ aTNEXEe<X)eT<x>JaNN

and ekbN A Pmax - ZagExﬂﬂX) P(mk\X)] + ailEthX) cpT(X) 13m .

N T _1 N
Since akN = [.§ m(Xi)m (Xi)] ._ ¢1(Xi)2k(Xi)
1—1 1-1
N
.2 ¢2(xi)zk(xi)
i=1
N .
Z

. 1 ‘PM(xi)‘Z‘k(Xi)

1

. _ T -
and lim akN — {Ex[m(X)m (X)]] 1EX[¢KX)P(wk\X)(1'f)
Nam
=ak
then lim ekbN = ekb With probability one.

Nam

Pitt and Womack (P5) have suggested that the value of
ekbN can be used to test different sets of m(X) functions to

compare relative performance based on the available training samples.

62

4.5 An Example

A simulation of the proposed algorithm (SP-l) was performed

on a IBM 1130 computer. The problem was to classify patterns drawn

from three categories with equal a priori probabilities and bivariate

Gaussian distributions. Each class has the same covariance matrix

(3 2) but with different means. The mean vectors for pattern

classes wl’ wz and w3 are (8), (Z) and (_g) reSpectively.

Since the optimal Bayesian classifiers for this type of pattern

classes are known to be linear, we choose
T -
cp (X) - (x1,x2,l) .

Figure 5 shows the classification performance of the algorithm

(SP-l) after each training group. At the completion of each training,

the system was tested by classifying 300 unknown patterns (100 patterns

from each classes). The misclassification rate and the correSponding

Bayes misclassification rate are shown for comparison. The results

indicate that the System error rate approaches to the Bayesian error

rate, which is about 0.2, as the number of training samples increases.

63

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CONTROLLER
PATTERN GENERATORS
P.C. I SE
# 1 3 l RECEPTORS N01 PATTERN VECTOR
: :3 TX, 1
A ' 'Tr I x2 9
P.C. " I ' z
# 2 J, 1 t
I
{:1 3: = _iMJ
: CHANNEL
I
P.C. I
# M I
Figure 4. A mathematical model for pattern recognition
IERROR,RATE
IOOJ

(Total unknown patterns to be classified = 300)

Bayes‘classifier

XC‘K,‘ [Stochastic classifier

0.5d ‘\\\\\\\\
«I x X y

 

 

 

 

 

 

 

l ‘—~iﬁ“-——=;— ““ '
—¢ IMF. 5 L7 $ 5 3 i’ N
45 300 600 900 1200 1500

NUMBER OF ITERATIONS

(N/3 training patterns from each class)

Figure 5. Error rate versus number of iterations.

CHAPTER V

CLUSTER-SEEKING AND PATTERN RECOGNITION

Chapters II, III and IV assumed that the data had structure so
the pattern recognition problem could be solved by selecting an
algorithm derived according to selected criteria. In reality,
information about data structure is rarely known beforehand. Some
techniques are thus required to sort the patterns into subsets,
such that the patterns in each subset are as much "alike" as possible,
and the patterns of different subsets are as much "unalike" as
possible. Having obtained the data structure, a pattern classifier
can be designed to achieve a minimum number of misclassifications.
Some typical cluster-seeking techniques will be discussed in this

chapter and a new cluster-seeking procedure will be proposed.

5.1 Some Cluster-Seeking Techniques

Nearly all cluster-seeking techniques were developed after
1960 since they generally require a great deal of computerized high-
speed computation. G.H. Ball (Bl) classified cluster-seeking
techniques into the following six categories: probabilistic, signal
detection, clumping, eigenvalue, minimal mode-seeking, and miscel-
laneous.

Cluster-seeking techniques can also be categorized into the
following three groups: (a) Techniques for reproducing "natural"

structure. The pattern recognizer selects a grouping criterion;

64

65

based on this criterion, the data itself should suggest "natural“
clusters. ISODATA by Ball and Hall (BZ), and work by Friedman and
Rubin (F1), are typical of techniques for reproducing "natural"
structure. (b) Techniques for data compression. Patterns are
mapped from a higher dimensional space to a lower dimensional
space in such a way that the inherent data structure is preserved.
These techniques include eigenvalue-type techniques (81), a non-
linear mapping proposed by Sammon ($2), and discriminant analysis
(W4). (c) Techniques for minimizing misclassification rate. The
probability of error is used as a criterion for clustering. In-
correct classifications imply that new clusters are needed. The

algorithm prOposed in Sec. 5.2 belongs to this category.

5.1.1. Clustering Techniques for Reproducing "Natural" Structure

Friedman and Rubin (Fl) suggested that a real-valued function
be defined and evaluated for all possible partitions of the given
patterns. The partition which has the maximal function value is
selected to represent the data structure.

Suppose a given data matrix Y is given in which each row
represents a training pattern. A decomposition of the, N training
patterns into g groups can be achieved by a row partition of Y
into g submatrices. Without loss of generality, let the sample
mean of the N patterns be zero. The total scatter matrix (W4)

is given as follows in the notation of Sec. 3.1.

For each partition of the N patterns into g groups, the

following matrix identity is satisfied.

66

T = W +'B

where

“k

g g T
W=2w=z >3[x(k)-C][x(k)-C].
=1 8 k=1 6:1 L k L k

This is called the pooled within-group scatter matrix; nk is the

number of patterns in group k;

8
Z n =N:
k=1 k
ck (k = 1,2,...,g) denotes the mean pattern vector of group k;
and X1(k),...,Xn (k) are the patterns from group k. The matrix
R
8
T
B = 2 n C C
k=1 k kk

is called the between-group scatter matrix.

Since the total scatter matrix T is fixed, a natural con-
dition for grouping data is to minimize \B‘, the determinant of
B, or, equivalently, to maximize \W‘.

One criterion is to maximize

T _ -1 p
M - \I+W 3| = 2104'11)

where *1 are eigenvalues of W-lB. The ratio ‘T‘I‘W‘ is in-
variant under non-singular linear transformation of the data and
is to be maximized over all possible groupings.

The number of possible partitions of N patterns into g
groups is enormous; as noted by Friedman and Rubin, the computational

problem may be solved in principle but not in practice.

67

Ball and Hall (B2) proposed a cluster-seeking technique
called ISODATA, an adaptive, and semi-heuristic algorithm. The
user supplied training patterns and program parameters such as
the minimum allowable size of each cluster and the number of clusters
desired. Several patterns are initially selected as cluster centers
and all patterns are clustered around them by assigning each pattern
to the nearest center. New cluster centers are then computed by
averaging all patterns in each cluster. Several cycles of assigning
patterns to the nearest cluster center and computing new cluster
centers are performed. Heuristic subroutines are provided to divide
a cluster in two when the number of clusters is small or to combine
two of them to reduce the number of clusters.

The main drawbacks of the Ball-Hall technique are that the
resulting cluster configuration is highly dependent upon the pro-
gram parameters supplied by the user, and that there is a lack of
good criteria for determining the adequacy of clustering. The type
of distance measure also implies spherical or ellipsoidal clusters

only.

5.1.2 Clustering Techniques for Data Compression

One clustering algorithm, proposed by Sammon ($2) is basically
a nonlinear mapping of the N-dimensional feature space to a low
dimensional Space (usually two or three dimensions) such that visual
identification of clusters is possible. Structure preservation is
achieved by moving the N points in the low dimensional space in
such a way that the interpoint distances approximate the correspond-

ing interpoint distances in the original high dimensional Space.

68

*
Let d be the distance between patterns X and X

ij i J
where Xi = (X11,X12,...,Xid) denotes a pattern in the original
space. Similarly, let dij be the distance between patterns Y1

and YJ where Y1 = (Yil’Yi2’°'°’YiD) is the pattern in the D-
dimensional space corresponding to Xi’ D < d. An error function

for determining the degree of structure preservation is defined as

 

* 2
_ __1___ N [dii ' dii]
E - N z * o
* i<j d,
.2. [dij] lj

1<J

A gradient descent procedure is used to search for a

minimum of E by changing the locations of the Yi's. That is,

YidEn + l] = Yid[n] - p 2%
1d E[n], Yid[n]
where n is the iteration number, and Yid[0] is chosen arbitrarily.
The advantages of Sammon's algorithm are (1) that it is free
of dependence upon any parameter, and (2) that the resultant con-
figuration in three or less dimensions is easily evaluated by the
user. The weaknesses of the algorithm are (l) the large memory

requirement and (2) the inaccuracy of the scatter diagram in re-

presenting very complex high dimensional data structures.

5.2 A Procedure for Combining Cluster-Seeking and Pattern

Classification

A set of weight vectors {W1} defining a set of discriminant
functions {fi(X) = WEX}T=1 is needed for non-parametric pattern

recognition. A pattern classifier whose discriminant functions

69

{fi(X)} all have this form is called a linear classifier (N1),
for example, the MinimumADistance Classifier (MDC). A MDC assigns

*
pattern X to class wi if

* *
\x - P11 < \x - Pj\ for all j = 1,2,...,M (37)
i # j
where P1,P2,...,PM are cluster points in the feature Space which

represent the M pattern class; ‘X* - Pi‘ is the Euclidean distance
from X* to the cluster point Pi' Equation (37) can be rewritten
as follows:

Let

x = <x*, 1)T

_ 2 T
wi - (Pi - 1/2‘Pi‘ ) .
*
Then, say X 6 class mi if

i # j .

Thus, the discriminant functions of a MDC can be written as
f (X) = W, . X for all i .
i 1

One of the shortcomings of a linear classifier is that the
decision regions are necessarily convex. If the pattern classes
are multimodel, convex decision regions are certainly inadequate.
Non-convex decision regions can be achieved by defining discriminant

functions having the form:

fi(X) = max {W - X} for all i . (38)

15m<L1

im

70

This can also be stated in terms of cluster points Pij as:

*
assign X to mi if

min \x - P“) < min \x - P for all k # 1

J kj‘
where Wim is the weight vector for subclass m of class wi’
and Pij is the cluster point for subclass j of class mi; L1
is the number of subclasses for pattern class wi. A classifier
which implements Equation (38) is called a Piecewise Linear

Classifier (PWLC). The subsidiary discriminant function fim(x)

is defined as

fim‘x) = Wi ° X for all i = 1,2,...,M

m
1,2,...,Li .

Ill

If the information about subclasses is given beforehand,
the problem can be treated as a multiclass pattern recognition prob-

lem. Then, linear classifiers assign pattern X to mi if
fim(X) > fjn(x) for all (Jan) # (i,m) .

Therefore, a PWLC is actually a multiclass linear classifier and
can be trained by the same methods used to train linear classifiers.
In order to obtain a set of weight vectors for a PWLC,

information is needed about the structure of the subclasses. Since
such information is rarely known beforehand, structure analysis
must be applied; that is, a similarity measure must be selected

and the patterns from each pattern class must be partitioned into
subclasses. An algorithm will now be prOposed for determining a

PWLC without prior knowledge of pattern class distributions. The

71

algorithm combines M-class, linear-classifier training procedures

and clustering-seeking techniques under the control of a minimum

error probability performance criterion.

The flow chart of the proposed algorithm is given in Figure

6; the main steps of the algorithm are described as follows.

(A) Phase I

Step 1.

Step 2.

Step 3.

Step 4.

The iteration number is n. Determine a set
of Sn(SO = M) discriminant functions, one
for each of the Sn nonempty subsets, or
clusters, of patterns.

Classify each training pattern.

Compute the misclassification percentage,
Pr(n). If Pr(n) SNe is the maximim
acceptable misclassification percentage

for a particular problem, or the total number
of discriminant functions exceeds the allow-
able number of clusters then stop Phase I of
the algorithm and begin Phase II. If

Pr(n) >Ne and the total number of dis-
criminant functions is small enough then con-
tinue to step 4.

Divide each of the Sn clusters into two
clusters, one containing all correctly
classified patterns and the other containing
all misclassified patterns. Learn a new set
of discriminant function, one for each cluster.

Check to see whether the current set of

 

read in patterns
set parameters

1

learn the

discriminant functions -
one for each pattern class
L

 

 

 

 

 

 

 

 

 

 

—‘bri’orm classification ]’—‘

         
 

error prob. Yes

Pr(n) 5 NE
?

 

divide each cluster
into two clusters

1

remove empty clusters and
learn a set of new
discriminant functions
one for each cluster

 

 

 

 

 

      
       
 

  

fcn .
identical to the
previous disc.
fcns.
?

 

split all clusters

 

which are mis classified

 

and learn a set of
new discriminant functions

 

 

 

PHASE I

\V1

 

72

    
  
    

   
  
 

is total
no. of clusters
5 desired no. of
clusters

Yes

 

lump clusters
of same origin
One pair each time.

 

 

 

 

 

learn discriminant
functions and perform
classification

 

 

store the best partition
up to now

 

 

 

  

 

 

of remaining
clusters equal
to M
?

 
       

 

 

 

Figure 6.

I PHASE II

  
  
       
 

 

 

  

Is
the best partition
acceptable
?

Flow chart of the proposed clustering algorithm.

 

L_____

 

 

 

 

output
stat.

 

 

73

discriminant functions is identical to the
previous set of discriminant functions. If
so, split those clusters which were mis-
classified; let n = n+1 and return to step 1.
(B) Phase II

Step 5. Find that pair of clusters from the clusters
belonging to each pattern class which are
closest together, using an apprOpriate dis-
tance measure. Lump them and compute a new
set of discriminant functions. Compute Pr(n)
and repeat the procedure until the number of
clusters equals the number of categories plus
one.

Step 6. Select the best partition. If the partition is
acceptable, stop; otherwise, continue to step 2
of Phase I.

Step 7. Terminate the algorithm when the current best
partition is inferior to the previous best
partition (step 6).

The convergence of the algorithm is intuitively clear, since
in Phase I there is a finite number of training patterns; error-
free classification can always be achieved if each pattern is con-
sidered to be a cluster. In reality, it can be reasonably assumed
that the training patterns from each class are somewhat clustered
together. Therefore, the number of clusters should be much smaller
than the total number of training patterns. Only a finite number

of clusters is generated in Phase I. The lumping procedure will

 

74

terminate when the number of clusters decreases to M.+ l. The
interaction between Phase I and Phase II provides a search for
optimal results. '

If a minimum distance classifier (MDC) is employed for
classifying training patterns and N = 0, the prOposed algorithm
is actually another version of the condensed nearest neighbor
rule (H4). The resulting set of cluster points correctly classify
all training patterns. Cover (C2) has shown that the error prob-
ability of classifying new patterns is bounded above by twice the
Bayes probability of error. If other M-class algorithms are
selected for classifying patterns, the proposed algorithm is
basically a training procedure for a PWLC so that a bank of sub-

sidary functions for each class can be achieved at the completion

of the algorithm.

5.3 A Procedure for Unsupervised Structure Analysis

In supervised learning, a set of training patterns is pro-
vided in which each pattern has a label indicating the pattern
class to which it belongs. Since no such labels exist in the un-
supervised learning problem, the training patterns must be studied
for natural groupings. Each such grouping, or cluster, is viewed
as defining a pattern class and is assigned a discriminant function.
A partition of the 'N training patterns into M groups, or clusters,
is desired for which the misclassification probability and the
number of clusters, or discriminant functions, are both minimized.

Such a partition converts unclassified patterns into classified

training patterns since all patterns in each cluster are tentatively

7S

given the same pattern class label. The algorithms previously
defined will then produce discriminant functions for any partition.

The computational problem of searching all possible parti-
tions of N patterns into M groups to find the "best" partition
has been solved in principle but not in practice. A suboptimal
procedure for searching through the partitions follows.

Any prior knowledge about data structure is inserted into
the initial partition. If prior knowledge is not available, the
center of gravity for all training patterns can be computed; this
center and the (M-1) patterns furthest away from it can be
selected as initial cluster centers. The initial partition is then
obtained by assigning each pattern to the closest cluster center.

Rubin (R1) prOposed an algorithm for finding a local
extremum of a criterion function that searched only some of the
possible groupings. Starting with the best partition achieved,

a single pattern is moved into every group other than its own. If
no move increases the criterion, the group label of the pattern is
not changed. Otherwise, the pattern is transferred to the group
which maximizes the criterion function. This operation is repeated
with each pattern in the group. It is then repeated for all
patterns in the group with the closest cluster center. After
several passes, a point is reached at which changing the label on
any single pattern degrades the criterion. This grouping provides
a local maximum.

Once the patterns have been tentatively converted into
training patterns by labelling each with a group identifier, the

algorithm of Sec. 5.2 will produce the error rate (based on the

 

76
assumed partition) as well as discriminant functions.

5.4 Computer Simulations

This section contains results obtained by using the CDC 3600
computer system at Michigan State University, to process both the
artificial data and the practical data. A FORTRAN listing is pro-
vided in (H5).

The artificial pattern recognition problem is shown in
Figure 7. The minimum distance classifier (MDC) is selected for
classifying training patterns. The data consist of twelve patterns,
four for each of 3 pattern classes (Fig. 7a). The patterns are
two-dimensional to produce meaningful plots. The initial number of
misclassifications is eight (Fig. 7b). That is, assume that each
pattern class forms one cluster. The number of misclassified
patterns at the fourth iteration before splitting is two (Fig. 7c).
At the end of Phase I, there are eight clusters and the number of
misclassifications is zero (Fig. 7d). After the completion of Phase
II, the best partition of the data is into six clusters and the
corresponding number of misclassified patterns is zero (Fig. 7e).

The practical data is the Iris data which was published
by Fisher (F2) and repeated by Kendall (K4). There are three species
of Iris, setosa, versicolor and virginica. There are fifty flowers
of each species and four measurements (sepal length, sepal width,
petal length and petal width) are taken on each flower. We labeled
them from 1-50, 51-100, and 101-150 respectively. Fisher (F2) con-
structed a linear function of the four variables to classify them.

He found that setosas could be separated from the other two species

77

 

 

 

 

 

X X Region!
A .
o A
O
X X
A
A o o
(a) Feature space showing (b) Initial decision regions.
four patterns from each Eight patterns mis-
of three classes. classified.

 

/x:/OA/ /oo:
/

/

A \
2 00 A 00

 

 

 

 

 

 

(c) Decision regions before (d) Decision regions before
splitting. Two rnisclas- lumping. No mis-
sifications. classifications.

X X
/ A
O A
O __

 

 

 

Figure 7 . Example of
A Algorithm in

X X
Fiurea.
Vim ‘
No

(e) Be at partition.
misclassifications.

 

78

perfectly. However, versicolors and virginicas overlapped. Kendall
(K4) applied a clustering technique based on a pairwise distance
function utilizing only the rank order of the measurements and he

was able to separate setosas from others. However, for versicolors
and virginicas, he classified 87 flowers and left 13 flowers doubtful.
Friedman and Rubin (F1) applied the min trace w criterion to this
data. They found when partitioned into three groups, setosas,
appeared as a separate group with the other two groups being pre-
dominantly versicolors and virginicas except for about ten flowers
which were misclassified.

A computer simulation of the system prOposed here using
minimum distance classifiers has been completed. The initial number
of misclassifications was eleven (assuming that each pattern class
formed one cluster). With 30 allowable number of clusters and
N6 = 0, the number of misclassified flowers at the end of Phase I
was twenty-one. There were thirty—three clusters. Setosas appears
as one cluster. Versicolors contains seventeen clusters with four-
teen flowers misclassified. Virginicas contains fifteen clusters
with seven flowers misclassified. After the completion of Phase II
(lumping Operation), the best partition of the data was into six
clusters and the corresponding number of misclassified flowers was
six. The six clusters were: one for setosas, two for versicolors

and three for virginicas.

CHAPTER VI

CONCLUSIONS

This chapter discusses the general results of this thesis

and possible extensions of this work.

6.1 Thesis Results
There are two sets of results in this thesis.

(1) The development of a class of nonparametric, multiclass

pattern recognition algorithms.

The idea of translating a problem into an optimization
problem in which search techniques are employed to find the minimum
of a linear functional is used in many branches of applied mathe—
matics. Applied to the abstraction problem, this idea provides a
great deal of freedom when creating new algorithms. In Chapter II
and III, although we make no assumptions about the data structure
and pattern classes, the concept of linear inequalities and the
mean-square-error criterion were employed to formulate linear
functionals. The general multiclass learning algorithms (GA.2)
and (GA.3) derived in Chapter II led to particular algorithms such
as the fixed increment method (PA.1), the relaxation method (PA.2)
and the minimum square error method (PA.3). These algorithms are
generalized versions of algorithms in the literature. Chapter III

showed that the mean-square-error criterion is equivalent to the

79

80

generalized inverse approach. A multiclass algorithm, (GA.4),
utilizing the generalized inverse approach is proposed and is unique
with this thesis. The convergence proof is given in Appendix A

and the utility of the algorithm was demonstrated by a digital
computer siuu lation .

Since no assumptions were made about data structure in
Chapters II and III, the pattern recognition problem was solved
only for the N given training patterns. In order to investigate
the generalization problem, we must view the N fixed training
patterns as a sample from a population as was done in Chapter IV.
we assumed the existence of probability densities of unknown func-
tional form so the optimal discriminant functions involved condi-
tional probabilities. Based on the available information, the
method of stochastic approximation was employed to estimate the
unknown discriminant functions and two stochastic algorithms were
derived from the generalized inverse approach. Both schemes used
information from all training patterns to update the discriminant
functions at each iteration. The asymptotic properties of both
algorithms are studied for the first time in Theorem 4.2.1 and
4.2.2. In Sec. 4.3, a new stochastic algorithm with an updating
property was proposed. Since only the information from the
particular pattern presented was utilized at each iteration, the
problem of storing large numbers of patterns was eliminated. The
convergence of the proposed algorithm was proved by involving
Gladyshev's Theorem (61). In fact, the algorithms of Sec. 4.2 are
equivalent to the algorithm in Sec. 4.3 as the number of training

samples approaches infinity.

81

In practical situations, some of the training patterns
might be mislabeled. This sensitivity problem is studied for the
first time in Sec. 4.4. The results show that the estimated co-
efficients of the discriminant functions under mislabeled condi-
tion are equal to the probability of correct labeling times the
estimated coefficients under ideal situation. The mean square
approximation error was defined and its upper bound was derived.
An estimate of the error upper bound after N training samples

can be used as a guide for selecting the ¢(X) functions.

(2) The introduction of a procedure that combines cluster-
seeking and multiclass pattern recognition into a workable
pattern recognition system.

In some cases, the pattern class has a multimodal structure;
both the number and location of the modes are unknown. In order to
minimize the misclassification error, a cluster-seeking technique
should be applied to learn the data structure and one discriminant
function should be assigned to each cluster (or subclass). A new
algorithm for solving the cluster-seeking and multiclass pattern
classification problems in a step-wise fashion was prOposed in
Chapter V. The procedure proposed in Sec. 5.2 exploited structural
information to construct discriminant functions. The success of
the discriminant functions in classifying training patterns then
provided clues about data structure. The procedure was implemented
on the CDC Computers at Michigan State University and tested with
the problem reported in Sec. 5.4.

The focal point of the thesis is the presentation of a

computationally feasible solution to a very difficult, but very

82

common, pattern recognition problem. The difficulty is caused by

the lack of prior knowledge about the data structure, or one's un-
willingness to make assumptions about pattern class distributions

or linear separability among pattern classes. This problem is
extremely acute in situations involving large numbers of pattern

and features. The abstraction and clustering problems are attacked
simultaneously in Chapter 5 of this thesis. The algorithms proposed
derive a piece-wise linear pattern classifier in an iterative manner.
Information about structure is incorporated into the classifier.

In turn, the operation of the classifier provides structural in-

formation.

6.2 Possible Extensions

To simulate the proposed pattern recognition system on a
digital computer, the minimum distance classifier was selected for
classifying training patterns. In fact, all multiclass algorithms
listed in this thesis could be used for classifying patterns. This
fact generates a class of recognition procedures and each procedure
needs a full-scale test and optimization of program parameters.

For a given pattern recognition problem, a complete explora-
tion of all possible procedures and selection of the "best" is
possible in principle but not in practice. However, if one is
willing to make assumptions about the pattern classes, he might
derive some theoretical results or heuristic rules which optimize
the searching among possible procedures.

Another possible extension is to develop an algorithm for

unsupervised structure analysis. Since the true classifications of

83

the patterns are unknown in an unsupervised learning problem, the
unclassified patterns must somehow be converted into classified
training patterns. Again, if one is willing to impose some
probabilistic structure (such as the mixture probability densities),
he has converted the problem of clustering into the problem of

unsupervised estimation (P6, P7).

REFERENCES

REFERENCES

G.H. Ball, "Data analysis in the social science; what about
the details?", Fall Joint Computer Conference (1965).

G.H. Ball and D.J. Hall, "ISODATA, a novel method of data
analysis and pattern recognition," Technical Report,
Stanford Research Institute, Menlo Park, California (1965).

C.C. Blaydon, "Recursive algorithms for pattern classifica-
tion," Technical Report No. 520, Division of Engineering
and Applied Physics, Harvard University, Cambridge,
Massachusetts, March, 1967.

Colin Blaydon and Yu-chi Ho, "On the abstraction problem
in pattern classification," Proc. 1966 NEC, Vol. 22,
857 -8 62 e

W.G. Chaplin and V.S. Levadi, "A generalization of the
linear threshold decision algorithm to multiple classes,"
Proc. Second Symposium on Computer and Information Sciences
at Batelle Memorial Institute, August 22-24 (1966).

T.M. Cover and P.E. Hart, "Nearest neighbor pattern
classification," IEEE Trans. on Information Theory, Vol.
IT-l3, No. 1, January 1967.

I.P. Devyaterikov, A.I. Propoi, and Y.Z. Tsypkin, "Iterative
learning algorithms for pattern recognition," Automation
and Remote Control (English Translation), 108-117 (1967).

R.C. Dubes, The Theory of Applied Probability, (Prentice-
Hall, Inc., New'York, 1968).

H.P. Friedman and J. Rubin, "On some invariant criteria
for grouping data," J. Auer. Statist. Assoc. 62, 320,
1159-1178 (1967).

R.A. Fisher, "Multiple measurements in taxonomic problems,"
Annals of Eugenics, Vol. VII, part II, 179-188, 1936.

E.E. Gladyshev, "On stochastic approximation," Theory of
Probability and Its Applications, Vol. 10, 275-278, 1965.

84

85

Y.C. Ho and A.K. Agrawala, "On pattern classification
algorithms: Introduction and survey," Proc. IEEE 26) 12,
2101-2114 (1968).

Y.C. Ho and R.L. Kashyap, “An algorithm for linear in-
equalities and its applications," IEEE Trans. Electr. Comp.
EC‘IA, Se

Y.C. Ho and R.L. Kashyap, "A class of iterative procedures
for linear inequalities," J. SIAM Control 4, 1 (1966).

P.E. Hart, "The condensed nearest neighbor rule," IEEE
Trans. on Information Theory, Vol. IT-4, No. 3, May, 1968.

A.Y. Hung and R.C. Dubes, “An introduction to multiclass
pattern recognition in unstructured situations," Interim
Report No. 12, Contract No. AFOSR-1023-67B, Div. Engineering
Research, Michigan State University, 1970.

J.S. Koford and G.F. Groner, "The use of adaptive threshold

element to design a linear optimal pattern classifier,"
IEEE Trans. Information Theory, lg, 1 (1966).

R.E. Kalman and J.E. Bertram, "Control system analysis and
design via the 'second method' of Lyapunov, II Discrete-
time system," Trans. ASME J. Basic Engr. 83, D, 2, 371-400
(1960).

R.L. Kashyap and C.C. Blaydon, "Recovery of function from
noisy measurements taken at randomly selected points and its
application to pattern classification," IEEE Proc., Vol. 54,
1127-1129, August, 1966.

M.G. Kendall, "Discrimination and Classification," pp. 165-
185, Multivariate Analysis, edited by P.R. Krishnaiah,
Academic Press, N.Y., 1966.

N.J. Nilsson, "Learning Machines," (McGraw-Hill, New York,
1965).

G. Nagy, "State of the art in pattern recognition," Proc.
IEEE, 56, 5 (1968).

J.D. Patterson and B.F. Womack, “An adaptive pattern classi-
fication system," IEEE Trans. Systems Science and Cybernetics
SSC-2, 1 (1966).

J.D. Patterson, T.J. wagner and B.F. Womack, “A mean-square
performance criterion for adaptive pattern classification
systems," IEEE Trans. Automat. Control AC-12, 2, 195-197
(1967).

86

J.D. Patterson, T.J. wagner and B.F. Womack, “A performance
criterion for adaptive pattern classification systems,"
Proc. 1966 Joint Automatic Control Conf. (Seattle, Wash.),
38-46.

J.M, Pitt and B.F. Womack, "A sequentialization of Patterson
classifier," Proc. IEEE (letter), Vol. 54, 1987-1988,
December, 1966.

James M. Pitt and Baxter F. Womack, “Additional features

of an adaptive, multicategory pattern classification system,"
IEEE Trans. on Systems Science and Cybernetics, Vol. SSC-S,
No. 5, July, 1969.

E.A. Patrick and J.P. Costello, "On unsupervised estimation
algorithms," IEEE Trans. on Information Theory, Vol. IT-16,
No. 5, September, 1970.

E.A. Patrick, "On a class of unsupervised estimation problems,"
IEEE Trans. on Information Theory, Vol. IT-14, No. 3, May, 1968.

J. Rubin, "Optimal classification into groups: An approach
for solving the taxonomy problem," J. Theor. Biol. 12,
103-144 (1967).

H. Robbins and 8. Monroe, “A stochastic approximation method,"
Ann. Math. Statist., Vol. 23, 462-466, 1952.

A. Ralston, “A First Course in Numerical Analysis," MoGraw-
11111, N.Y., 1965.

6.8. Sebestyen, "Decision-Making Processes in Pattern
Recognition," (The Macmillan Company, New York, 1962).

J.W. Sammon, Jr., “A nonlinear mapping for data structure
analysis," IEEE Trans. Computers, C-18, 401-409 (1969).

D.V. Widder, “Advanced Calculus," 2nd edition (Prentice-
Hall, Inc., New'York, 1961).

B. Widrow and M.E. Hoff, trans., “Adaptive switching circuits,"
1960 IRE WESCON Conv. Record, part 4, 96-104 (1960).

W.G. Wee, "Generalized inverse approach to adaptive multi-
class pattern classification," IEEE Trans. Computers C-l7,
12 (1968).

8.8. Wilks, "Mathematical Statistics," (John Wiley and Sons,
Inc., New York, 1962).

M.T. Wasan, "Stochastic Approximation," Cambridge University
Press, 1969.

W 6

87

T.J. Wagner, J.Mm Pitt and B.F. Womack, “A comparison be-
tween pattern classification approaches," IEEE Trans. In-

formation Theory (Correspondence), Vol. IT-13, 611-613,
October, 1967.

S.S. Yau and J.M. Schumpert, "Design of pattern classifiers
with updating property using stochastic approximation

techniques," IEEE Trans. on Computers, Vol. C-17, No. 9,
September, 1968.

APPENDIX

APPENDIX A

CONVERGENCE PROOF

The convergence proof of (GA.4) can be divided into two
possible situations.

Case 1. If the constraint on E is violated, the algorithm
will be terminated since 6E[n] = 0 for all n.

Case 2. Assume that the constraint on E holds. It must
be shown that the algorithm converges. Before the proof of

HD[n]H a 0 as n a m, two matrix identities will be proved.

YT (YA[n] - 1?.an

(a) vTDEn]

YT(YY#E[n] - E[n])

= (YTW# - YT) EDI]
= (YTHYTW'IYT - YT) E[n] = o

(b) Trace {D[n]T(YY# - I)T(YY# - I) D[n]}

= Trace {D[n]T[ (W#)T(‘y\y#)-y\y#-W#+ I] D[n]}
= Trace {D[n]T[Y(YTY)-1YTY(YTY)-1 YT

- 2YY# + I] D[n]}
T #
= Trace {D[n] (I - YY )D[n]}
= Trace {D[n]TD[n]}
- Trace {D[n]TYY#D[n]}
= HD[n]“2 - Trace {(YA[n] - E[n])TYY#D[n]}

= Hvtnlnz - Trace {(vv#E[n1 - Eth>Tvv#vtn1}

88

89

= “DinJHz - Trace {E[n]T(YP#-I)TP(PTP)'1
x YTD[n]}
by a 2
= \\D[n]\\
Define V(D[n]) = uD[n]u2 a positive definite function.
AV(D[n]) = V(D[n+1]) - V(D[n]>
= V(YA[n+1] - E[n+l]) - V(D[n])
= V(Y(A[n] + Y#6E[n]) - E[n+1]) - V(D[n])
= V(YA[n] + YY#pD[n] - E[n] - pD[n]) - V(D[n])
= V<D[n] + pcvv# - I)D[n]) - v<n[n1>
= “D[n] + p(YY# - I)D[n]“2 - “D[n]“2
= Trace {[D[n] + p(YY# - 1)D[n]]T
x Evin] + p<vv# - I>Dtn11} - HDEnJHZ
= Trace {D[n]TD[n] + pD[n]T(YY# - I)TD[n]
+ pD[n]T(YY# - I)D[n]
+ p2D[n]T(YY#-I)T(YY#-I)D[n]} - HD[n]H2
by_b 2 T #
— HD[n]“ + 2p Trace {D[n] (PP -I)D[n]}
+ aznvtnnuz -uvtn1u2
= 29 Trace {D[n]T(‘i"i’#-I)D[n]} + p2\\D[n]H2
= 2p Trace {D[n]TYY#D[n]} - 2p Trace {D[n]TD[n]}
by a + pzﬁvtnmz

2 2 2
= -29\\D[n3\\ + 9 NEW“
= -HD[n1u2<29 - 92>
For 0 < p s 2, 2p - p2 = 9(2-9) 2 0, then AV(D[n]) s 0 for all
D[n] and AV(D[n]) = 0 if D[n] = 0.

By Lyapunov's stability theorem for discrete systems (K2),

lim V(D[n]) = lim uD[n]u2 = o .

n—m

"IWAMMWEUEWLWWITS