NONPARAMETRIC PROCEDURES FOR
LEARNING WITH AN- IMPERFECT TEACHER

Thesis for the Degree of Ph. D.
MICHIGAN STATE UNIVERSITY
RONALD JOSEPH RICHTER
1972

 

E'Fl’ “

 

 

 

   
   

Y'—

LIB;:11RY

Michigan .SWC
Univcrmty

 

This is to certify that the

thesis entitled

NONPARAMETRIC PROCEDURES FOR

LEARNING WITH AN IMPERFECT TEACHER
presented by

Ronald Joseph Richter

has been accepted towards fulﬁllment
of the requirements for

Ph. D. degree in Electrical Engineering

in /,/
AMI/MC/L’ 411/4

Major professor

Date November ‘8, 1972

0-7039

(wen-mi -‘EM
fir.c N

 
 
 
 
 

     
 
     

amomo av ‘9
HUM} & SDNS'

800K BINDERY INC.
LIBRARY amosns
I snlnnu meme“

  

ABSTRACT

NONPARAMETRIC PROCEDURES FOR LEARNING
WITH AN IMPERFECT TEACHER

By

Ronald Joseph Richter

In this dissertation a general pattern recognition problem is
investigated in which the classification of an observed phenomenon or
pattern is inferred from a set of training patterns. The statistical
approach to pattern recognition is taken. The pattern classes are
assumed to be characterized by probability distributions that are in-
adequately known. In order to form decision rules the unknowns must be
"learned" from a set of training patterns. The training patterns are
classified by an imperfect teacher that makes errors in its classifi-
cations. This type of learning, called learning with an imperfect
teacher, lies in between supervised learning and unsupervised learning.

In the first part of the thesis, a probabilistic model of an
imperfect teacher is proposed. Expressions are developed relating a
perfect teacher to an imperfect teacher. An example is studied to show
the asymptotic effects of using misclassified training patterns in an
algorithm designed for supervised learning. The example illustrates the
need for developing algorithms specifically for use with an imperfect
teacher.

A class of nonparametric learning procedures is proposed for
learning to recognize patterns with an imperfect teacher. The pro-
cedures require prior knowledge only of the nonsingular matrix of error
probabilities characterizing the teacher and of whether the patterns are

discrete or continuous random variables. Formal proofs are given showing

Ronald Joseph Richter
that the procedures are asymptotically optimal in the sense that they
have expected risks which converge with increasing number of training
patterns to the Optimal (Bayes) risk. Theorems on rates of convergence
are also obtained.

In the latter part of the thesis, the two-class recognition
problem is investigated in detail with the objective being to study the
quantitative and qualitative effects of using an imperfect teacher
rather than a perfect teacher. Large-sample approximations are deve10ped
for evaluating the expected risk ofthe estimated decision rules. The
performance of the learning procedures is studied in several examples
involving normal, triangular, and binomial distributions. Various large
sample properties of the expected risk are also investigated.

Finally, a measure of relative performance is proposed for quanti-
tatively evaluating the effects of an imperfect teacher. This measure
is evaluated for the important case of a zero-one loss function. The
measure is then used along with a cost of training to establish an
overall cost for an imperfect teacher. Conditions are established under
which an imperfect teacher is more cost effective than a perfect

teacher.

NONPARAMETRIC PROCEDURES FOR LEARNING
WITH AN IMPERFECT TEACHER

By

RONALD JOSEPH RICHTER

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

and Systems Science

1972

q .
(9 ACKNOWLEDGEMENTS

The author wishes to thank his thesis advisor, Dr. Richard C. Dubes,
for the guidance and encouragement given throughout this research.

Thanks are also due to Dr. Dennis C. Gilliland for his helpful
suggestions. Appreciation is also expressed to Dr. G. L. Park, Dr. R.
0. Barr, and Dr. J. H. Stapleton for their interest in this work.

The principal support for this research was a United States Steel
Foundation Fellowship. The cooperation of The Magnavox Company is also
gratefully acknowledged.

Special thanks go to Mrs. Jean Emig for her excellence in typing.

ii

TABLE OF CONTENTS
Page
Chapter
I. INTRODUCTION ........................................... 1
Survey of the Pattern Recognition Problem .........

.1 2
O 2 The ImperfeCt TeaCher O C C O C C O O O C C C C O C C C C C C C C C C C . O O O 5
.3 Thesis Contributions .............................. 7

e—Ip—ar-e

II. mE LEARNING PROBLEM OOOOOOOOIOOCOOOOOOCOI0......0...... 9

1 Mathematical Model for Pattern Recognition ........ 9
2 Model of the Imperfect Teacher .................... 12
3 Some Fundamental Relations ........................ 13
4 Effects of Misclassifications on Supervised

Learning Procedures ............................... 15

III. A NONPARAMETRIC LEARNING PROCEDURE ..................... 21

3 1 Decision Rules .................................... 22
3 2 Convergence Criteria .............................. 25
3.3 Preliminary Lemmas ................................ 27
3 4 The Discrete Case ................................. 30
3 S The Continuous Case ............................... 35

IV. EFFECTS OF THE IMPERFECT TEACHER ........................ 41

1 Expected Risk for the Two-Class Problem ........... 42
2 Normal Approximation .............................. 44
3 Examples of Learning .............................. 49
4 Large Sample Properties of the Expected Risk ...... 61
S A Measure of Performance .......................... 65
6 Cost of Training .................................. 68

v. CONCLUSIONS 0......00....00.0.00...OOOOOOOOOOOOOOOOOOOO. 73

5.1 Slmary 00......OOOOOOOCOOOOOOOOOCOOOOOOOOOOOOOOOOO 73
5.2 Extensj-ons .0.00....00.0.00...OOOOOOOOOOOQOOOOOOOOO 7S

BIBLIMRAPHY OOOOOOOOOOOOCOOOO0.0..0OOOOOOOOOOOOOOOOOOOOO0.... 78

iii

APPENDICES

A

B

C

optimal DeCiSion RUIeS IDC.OOOOOIOOOOOOOOOOOOOOOOO
Nonparametric Estimation of Density Functions . . . .

prOOf Ofmeorem 3.5 0.00.00.00.000000000000000000

iv

82

86

89

LIST OF TABLES

Table Page

8.1 univariate Kernels OOOOOOOOOOOOOOOIOOOO0......000...... 88

Figure
2.1
4.1
4.2
4.3
4.4
4.5
4.6

4.7

4.8
4.9
4.10

4.11

LIST OF FIGURES

Asymptotic Risk with an Imperfect Teacher .............

Expected Risk with Normal Distributions, P1 0.5 .....

0.1 .0...

Expected Risk with Normal Distributions, P1
Expected Risk with Normal Distributions, 811 # 822 ....
Expected Risk fer Large N .............................
Triangular Density Functions ..........................
Expected Risk with Triangular Densities ...............

Comparison of Large Sample Approximation and

Simulation ............................................
Expected Risk with Binomial Distributions .............
Measure of Relative Perfbrmance .......................
Teacher's Cost for Classification .....................

Relative Cost of Training .............................

vi

Page
20
51
52
53
S4
55

S7

58
60
69
71

72

CHAPTER I

INTRODUCTION

Pattern recognition is the study of ways in which machines, usu-
ally meaning digital computers and associated equipment, can observe an
environment, learn to distinguish relevant details from background
trivia, and make sound and reasonable decisions. The task of designing
machines to recognize patterns appears in many different forms in a
variety of disciplines. The problems encountered range from the practi-
cal to the profound, from engineering design and economics to the
theories of artificial intelligence and human learning.‘ But the central
problem of pattern recognition is to develop procedures or algorithms
that effectively classify an observed phenomenon as resulting from one
of a set of sources.

In this thesis a general pattern recognition problem is investi-
gated in which the classification of an observed phenomenon or pattern
is inferred from a set of training patterns. A statistical approach to
pattern recognition is taken. In this approach the sources or pattern
classes are assumed to be characterized by probability distributions,
and statistical decision theory is used as the mathematical tool for
deriving classification procedures.

When the probability distributions that describe the patterns are
inadequately known, the unknowns must be "learned" from a set of training

patterns. Learning with a teacher (supervised learning) refers to the

l

2

situation in which all of the training patterns have been correctly
classified as to origin. When the true classifications of the training
patterns are unknown, the learning is said to occur without a teacher
(unsupervised learning). This thesis investigates a third type of
learning, learning with an imperfect teacher, which lies somewhere be-
tween supervised and unsupervised learning. For this type of learning
the training patterns are classified by an imperfect teacher that makes

errors in its classifications.

1.1 SURVEY OF THE PATTERN RECOGNITION PROBLEM

Surveys of early work in pattern recognition have been written by
Nagy [N-l] and Ho and Agrawala [H-l]. Nilsson [N-2] presented some of
the early work on the theory of "learning machines," or machines that
can be trained to recognize patterns. The idea of finding clustering
tranSformations for designing pattern recognizers was developed by
Sebestyen [8-3]. A book edited by Kanal [K-l] describes applications of
pattern recognition to the problem of character recognition, and a book
edited by Watanabe [W-l] is a very good collection of papers emphasizing
the philosophy of various approaches. Sequential methods in statistical
pattern recognition have been presented by Fu [F-4].

Ho and Agrawala [H-l] identify three fundamental problems associ-
ated with pattern recognition: characterization, abstraction, and
generalization. Characterization is concerned with the problem of se-
lecting the measurements which should be taken on the obﬁects and of
develOping methods for reducing these measurements to a set of real

variables which effectively characterize the objects and are amenable to

3
automatic data processing. The set of real numbers obtained from the
measurements on anobject is called a pattern, and each element of the
pattern is called a feature. Each of the states of nature or sources
of patterns is said to be a pattern class. At present there is no
unifying theory for the selection of features [I-2]. Much of the problem
of feature selection is left to the ingenuity of the system designer
[I-l].

After the features have been selected, the designer, using all
available information, must devise a decision procedure for classifying
a new pattern of unknown origin. The process of deriving a decision rule
from the available information is called abstraction. The ability of the
resulting decision rule to correctly classify new patterns is termed
generalization. This quality is best stated in probabilistic terms, such
as the probability of correct classification.

These three aspects of the pattern recognition problem are not
entirely independent of each other. Clearly, the choice of imprOper
features manifests itself in unduly complex forms for the decision rule
with poor generalization ability. Similarly, the ability to generalize
may be the criterion for choosing the features. In this thesis it will
be assumed that the features have been judiciously chosen. This research
is concerned with a particular abstraction problem and the generali-
zation ability of the resulting algorithms.

Two primary approaches have been taken to the abstraction problem
in the literature [B-l]: namely, a deterministic approach and a sta-
tistical one. In the deterministic approach, the goal of the recognition
system is to partition the feature space into regions such that each

region can be identified with a pattern class. A functional form is

4
often assumed for the decision function, and unknown parameters are
derived from a set of classified training patterns. Many of the de-
terministic procedures take the form of error correction algorithms or
gradient descent algorithms. Blaydon [B-l], Nilsson [N—Z], and Ho and
Kashyap [H-Z] have described such methods.

This thesis is concerned with the statistical approach to the
abstraction problem. In this approach the mathematical tools of sta-
tistical decision theory [F-Z] are applied to the problem of designing
the classifier. The pattern features are assumed to be described by
probability distributions; and optimal decision rules are obtained to
satisfy certain classification criteria; for example, minimum average
risk. A recent book by Fukunaga [F-S] presents this statistical ap-
proach.

The problem of learning arises during the abstraction process
when the probability distributions characterizing the pattern classes
are inadequately known. The unknowns of the class distributions must be
estimated or "learned" from training patterns drawn from each of the
pattern classes. When the origins of the training patterns are known,
the learning is said to take place with a perfect teacher (supervised
learning); but when the classification of the training patterns is un-
known, the learning is called unsupervised (without a teacher).

Abramson [A-1] and Spragins [8-9] have studied the convergence
question in supervised learning when only a few unknown parameters need
to be learned. Aizerman, et. a1. [A-2] deve10ped the method of potential
functions for supervised learning, and Van Ryzin [V-l], [V-Z] has shown
convergence of a procedure that estimates complete density functions via

Parzen type [P-l], [C-l] density estimators.

5

Most of the practical learning procedures that have been deve10ped
use supervised learning. Theoretical foundations for unsupervised
learning have been established, but practical algorithms have been slow
in coming. Fralick [F-3], Patrick and Hancock [P-2], and Yakowitz [Y-l]
have discussed various theoretical aspects of unsupervised learning.

A third type of learning, which has received little attention in.
the literature, is studied in this thesis. This type may be called
"learning with an imperfect teacher" and lies somewhere in between super-

vised and unsupervised learning.

1.2 THE IMPERFECT TEACHER

In many practical situations it is unreasonable to assume that all
of the training patterns supplied to a learning system by a "teacher" are
correctly classified. The teacher often classifies the training patterns
using past experience and additional information not available to the
learning system. But even this additional information may not be ade-
quate to ensure that the teacher's classifications are all correct.

As an example, a case has been reported in the literature [M-3]
in which two groups of electrocardiographers reading the same set of 561
EKG's disagreed in over 40 percent of the normal-abnormal classifi-
cations. If one were to attempt to use this set of EKG's for supervised
learning, he is confronted with a dilemma. Which group's classifi-
cations should be used? It is clearly questionable to attempt to perform
supervised learning with such unreliable data. One might solve the
problem by using only those EKC's for which both groups agreed. But this

is a wasteful procedure since time and money are involved in recording

6
and reading each EKG. Another solution might be to use fUrther tests
and clinical records from each patient as an aid for classifying each
EKG. Again this may be a costly and time consuming undertaking with no
guarantee of success. Thus it is reasonable to talk about an "imperfect
teacher" and consider situations in which the training patterns may not
all be correctly classified.

Estimation and decision making with misclassified data has re-
ceived some attention in the literature. Bross [B-Z] considered the
effects of misclassified data on estimators and significance tests in-
volving binomial distributions. Tenenbein [T-l] extended this work to
investigate the effects of estimation with misclassified multinomial
data. He presented a double sampling scheme that minimized a measure-
ment cost.

Linear classifiers have been used in many pattern recognition
problems. Lachenbruch [L-l] looked at the effects of misclassification
when learning a linear discriminant function for a Gaussian classifi-
cation problem. He exhibited the asymptotic effects of estimating the
mean and variance of the normal distribution from training samples that
had a constant probability of being misclassified. Whitney and Dwyer
[W-Z] derived the asymptotic performance of the k-nearest neighbor rule
[C-Z] which used training patterns classified by an imperfect teacher.

Most of the research involving imperfect teachers has been con-
cerned with analyzing the performance of existing supervised learning
algorithms when some of the training samples are misclassified. Recent-
ly Shanmugam [8-4], [8-5] developed an error correction procedure for
learning with an imperfect teacher. Shanmugam studied a nonparametric

learning scheme that was asymptotically optimal in the sense that it had

7
an average risk which converged to the Bayes risk. He considered only a
two-class problem, used a Bayes decision rule with a zero-one loss
function, and assumed that the teacher had the same rate of misclassifi-
cation for each pattern class. He proposed a threshold feedback scheme
for gradually phasing out the teacher in the case when the class densi-
ties had disjoint support.

The work in this thesis removes some of the assumptions needed
for the convergence of Shanmugam's algorithms. An alternative and more
general approach to learning with an imperfect teacher is developed
herein. The procedures considered are analogous to the type considered
by Van Ryzin [V-l], [V-2] for supervised learning. These procedures are
nonparametric in the sense that minimal knowledge about the class

distributions is assumed to be available to the system designer.

1.3 THESIS CONTRIBUTIONS

The emphasis of this research is on developing and evaluating a
procedure for learning with an imperfect teacher. The first contribution
of this thesis appears in Chapter II which contains a mathematical model
of the learning problem. A probabilistic model for an imperfect teacher
is proposed and expressions are developed relating the imperfect teacher
to a perfect teacher. These basic relations provide the key for studying
the class of learning procedures proposed in Chapter III. An example is
also presented in Chapter II to illustrate the effects of using mis-
classified training patterns in an algorithm designed for supervised
learning. The example provides motivation for developing algorithms

Specifically for use with an imperfect teacher.

8

The second major contribution appears in Chapter III. A specific
class of nonparametric procedures is proposed for learning to recognize
patterns with an imperfect teacher. Formal proofs are given showing
that the procedures are asymptotically optimal in the sense that they
have expected risks which converge with increasing number of training
patterns to the optimal (Bayes) risk. The conditions under which the
convergence holds are very general for the two cases considered: i.)
the case of discrete valued patterns and ii.) the case of patterns with
a.e. continuous densities. Theorems on rates of convergence are also
obtained. These rate theorems serve as guidelines for selecting the
parameters for the estimators.

The final contributions of the thesis are presented in Chapter IV
where attention is restricted to the two-class recognition problem. The
purpose of Chapter IV is to investigate the quantitative and qualitative
effects of using an imperfect teacher rather than a perfect one. Two
large-sample approximations are developed for evaluating the expected
risk of the estimated decision rules. Qualitative effects of the im-
perfect teacher are studied in section 4.3 by considering examples in-
volving normal, triangular, and binomial distributions.

In the latter part of Chapter IV, a measure of relative per-
formance is proposed for quantitatively evaluating the effects of the
imperfect teacher. This measure, which can be evaluated for the case
of a zero-one loss fUnction, is used along with a cost of training to
evaluate an overall cost of an imperfect teacher. The results in
Chapter IV are original and represent a contribution to the study of

learning theory.

CHAPTER II

THE LEARNING PROBLEM

This chapter presents the mathematical model to be used to study
the problem of learning with an imperfect teacher. The problem is
formulated in terms of statistical decision theory. A Bayesian decision
strategy is employed throughout.

In Section 2.1, the basic model for statistical pattern recog-
nition is defined. The Bayes decision rule and associated Bayes risk
are given fer the M-class problem. In Section 2.2, a model is preposed
for an imperfect teacher. Section 2.3 then develops some fundamental
relations between the imperfect teacher and a perfect teacher. Finally,
Section 2.4 investigates the asymptotic effects of using an imperfect

teacher with a procedure designed for supervised learning.

2.1 MATHEMATICAL MODEL FOR PATTERN RECOGNITION

The major objective of pattern recognition is to derive decision
rules for classifying a pattern X as coming from one of M possible
sources or pattern classes. In the statistical approach to pattern
recognition the sources are characterized by probability distributions,
and decision theoretic strategies are used for decision making.

The mathematical model for pattern classification to be used in

this research will now be defined. The pattern X is defined to be a

9 .

lO
vector random variable taking values in a feature space T, a subset of
Euclidean n-space. The set of pattern classes is denoted as
n = {w1, w2, ..., wM} with “i representing the ith pattern class.

The pattern classes are assumed to be characterized by a prior
probability distribution P = (P1, P2, ...,PM)T where Pi is the prior
probability of occurrence of pattern class mi. Let A be an identifi-
cation random variable defined over 9 such that A(wi) = i. The proba-
bility distribution of A is then just the prior distribution P. The

probability density function of the pattern X given that X is from

 

pattern class w- is denoted by f(- A = i). This is a density with re-

1

spect to some measure v on T. Denote the vector of conditional densi-

ties by f(-) = (f(- A = MDT.

 

A = 1). f(°

 

A = 2): ..., f('

 

After observing a value of the pattern random variable X, the
pattern recognizer is required to classify X as having come from one of
the M possible pattern classes “1 with density f(xIA % i). The random
variable A indicating the class that produced X is, however, unobserva-
ble. The general elements of this statistical decision making problem
are outlined in Appendix A. A Bayesian decision strategy is used to
obtain optimal decision rules.

The Bayes decision rule, which follows from (A.S) of Appendix A,

is any randomized decision rule 68(x) = (681(x), 637(x), ..., 63M(x))

satisfying
1 if Dj(x) < min Dk(x)
ka‘j
GB.(x) = 0 if D (x) > min Dk(x), j = l, 2, ..., M (2.1)
J J -
k7!)
Yj if Dj(x) = min Dk(x)

ka‘j

11
where the Bayes discriminant function Dk(x) is given by
M

0km = E

A kapr(X|A = A); k = l, 2, coo, B10 (202)

l
The function 68j(x) is interpreted as the conditional probability of
classifying the pattern X as coming from pattern class wj given that the
observed value of the pattern is X = x. The loss function ka represents
numerical loss incurred by deciding X is from pattern class wk when the
true pattern class is wk' The loss function is a nonnegative, real-
valued function that is assumed to satisfy the condition Lij > ij :_0,
i # j.

The Bayes rule minimizes the average loss of misclassification.
This minimum average less, known as the Bayes risk, follows from (A.7)

and is given by

M

E

RB(P, f)

M
pA Ir jgl ijf(xIA = x)ij(x) dv(x)

A 1

u
Ilbvﬁz

f Dj(x)dBj(x) dv(x). (2.3)
J'l'r
The notation fer RB in (2.3) emphasizes the dependence of the Bayes risk
on the prior distribution P and the class-conditional densities f.

This notation will be of use in later chapters.

The optimal Bayes rule can be used for decision making only if one
has exact knowledge of the class-conditional densities and of the prior
distribution. When these distributions are known, the abstraction and
generalization problems are essentially solved. The system designer
needs only to implement the decision rule of (2.1). The resulting

system perfbrmance will be given by the Bayes risk in (2.3).

12

But in practice, the required distributions are usually only
partially known. The unknowns of the class distributions must be
learned (estimated) from training patterns drawn from each of the
pattern classes. The pattern recognition system must use the training
patterns provided by a (imperfect) teacher to abstract a decision rule
for classifying new patterns of unknown origin. One would hope that
the learned decision rules adapt or converge with increasing number of
training patterns to what the optimal rule would be if the true proba-

bility distributions were known.

2.2 MODEL OF THE IMPERFECT TEACHER

The training patterns to be used for learning a decision rule
are represented by a sequence of independent, identically distributed

N

random variables Y = (Y1, Y2, ..., YN). Each random variable

Y1 = (Xi, A;' A1) consists of a training pattern Xi, a label A; provided
by an imperfect teacher, and an identification random variable Ai
representing the true classification of Xi- Only Xi and A; are
available to the learning system; Ai remains unknown. The pattern
random variable Xi is assumed to be distributed as f(-|A = A) if the
correct classification of Xi is A1 = X. The identification random vari-
ables are identically distributed according to the prior distribution P.
The labels provided by the imperfect teacher are modeled as

identically distributed random variables on O having a probability

distribution defined by

Pr[A; = le.

1 k] = Bjk (2.4)

with Bjk 1.0, Bjk = l. The probabilities Bjk are assumed not to be

ll M3

j 1

13
a function of i or of the value of the training pattern Xi. An M x M

matrix of probabilities characterizing the imperfect teacher may be

defined by
I" 1
B11 81? ”' 81M
8 B ... B
21 22 2M
_8_ = . (2.5)
3M1 8M2 BMM

 

 

This simple probabilistic model of an imperfect teacher is one
that is justifiable in many practical situations. The model is an
extension of one used by Whitney and Dwyer [W-Z] and by Shanmugam [8-5].
When Bii = l for all i, A; = A1 with probability one, in which case the
model reduces to one of a perfect teacher. When Bij = l/M for all i
and j, the model corresponds to unsupervised learning since having a
teacher that SUpplies equally likely labels for all pattern classes is
equivalent to having no teacher at all. There is no classification

information available in the labels.

2.3 SOME FUNDAMENTAL RELATIONS

Several expressions can be derived to relate the probability
distributions associated with an imperfect teacher to those for a perfect
teacher. The labels of the perfect teacher are distributed as the

identification random variable A, and the class-conditional densities

 

with the perfect teacher are f(- A). With an imperfect teacher, the

density of X conditioned on A' is given for j = l, 2, ..., M by

f(x|A' = j) = f(x|A' = j, A = k)Pr[A = kIA' = j]. (2.6)

k

nevﬁ:

l

14

 

But
f(X|A' = j, A = k) = Pr[[\' zjjxj A :: k]f(xLA = k)
WM'=HA=M
= f(xlA = k) (2.7)

where the last equality follows from the assumption that Bjk is not a

function of the value of X. Also Bayes rule gives

PrjA' = le = k]Pr[A = k]

M
Z Pr[A' = le = X]Pr[A = A]
1:1

 

Pr[A = kIA' = j] =

E
= BjkPk/pj’ j, k = 1, 2, ooo, b1o (208)

Here P; is the probability that A' equals j,

P j: 1, 2, ooo, Mo (209)

ji 1'

Substitution of (2.7) and (2.8) into (2.6) then gives

M .
P3f(xIA' = j) = Z BjkPkf(xIA = k), j = 1, 2, ..., M. (2.10)
k=1

Equations (2.9) and (2.10) can be expressed in a convenient

vector form by defining the following vectors and matrices:

f'(x) = (f(xIA' = 1), ..., f(xIA' = 11))T (2.1la)
I_ 1: 01‘
p _ (p1, p2, ..., PM) (2.11b)
3.: diag (p1, P2, ..., PM) (2.11c)
and 3: = diag (Pi, Pg, ..., Pg). (2.11d)

Equation (2.10) then becomes

P'f'(x) = §_g_f(x) (2.12)

and (2.9) has the form

P = _e_ P. (2.13)

When §_is nonsingular, it may be inverted to obtain from (2.12) and
(2.13)
P = g‘lp' (2.14)

and

Pf(x) g-lg'f'm. (2.15)

The expressions (2.12) thru (2.15) describe the relations between
the probability distributions for the perfect teacher and those for the
imperfect teacher. Equations (2.14) and (2.15) provide the key for
develOping and studying in Chapter III and Chapter IV an algorithm for
learning with an imperfect teacher. Previous studies with imperfect
teachers [W-2], [8-4], [8-5] have used (2.10) for the case of two
pattern classes (M = 2), but none of these studies have made use of
the inverse relations in (2.14) and (2.15). Shanmugam [8-4], [8-5]
avoided the need to use the inverse relations by restricting the density
functions to having disjoint supports and by using a zero-one loss
function along with a known prior distribution. The inverse relations

remove the need for such restrictive assumptions.

2.4 EFFECTS OF MISCLASSIFICATIONS ON SUPERVISED LEARNING PROCEDURES

Many algorithms have been deve10ped for supervised learning.
Several of these algorithms have also been used with an imperfect
teacher [S-S], [W-Z], [L-l]. For the problem considered in [8-5], it
was shown that misclassified training patterns could be used in the

supervised learning procedure and that the resulting decision rule would

16

still converge to the Bayes rule. The use of misclassified data did
not prevent the algorithm from converging to the desired decision rule.
If this were always the case, one would have little motivation for
deve10ping learning procedures which take into account the imperfect
teacher. One could just ignore the fact that the data was misclassi-
fied and use existing algorithms which were designed for supervised
learning.

An example presented in this section shows that, as one would
expect, misclassified training data can significantly effect the con-
vergence of a supervised learning algorithm. This example provides
motivation for developing learning procedures for use with an imperfect
teacher.

The learning procedures discussed in [8-5] and [V-l], as well as
those preposed in Chapter III, are based on nonparametric estimators of
the class-conditional densities and prior distribution. In [V-l], the
training patterns with label k were used to form estimates f(xlA = k)
and Pk of f(x|A = k) and Pk,respectively. An estimate of the Bayes dis-

criminant function Dk was then defined as

M . .
Dk(x) = E ij ij(xIA = 3). (2.16)

)

A decision rule 6 was formed according to (2.1) by replacing Dk with
the estimate Dk. It was shown in [V-l] that,under suitable conditions,
3 converged to the Bayes decision rule 63.

Now if the training patterns were misclassified, one would be
estimating f(xIA' = k) and Pi, not f(xIA = k) and Pk, when using the

patterns with label k. The resulting decision rule would, at best, con-

verge to a decision rule 6' having the form of (2.1) but with

17

M
DL(X) = 521 ij P§f(X|A' = j). k = 1, 2, ..., M (2.17)

as the discriminant functions. This decision rule would be equiva-

lent to the Bayes rule only if the risk (see equation (A.2))

' M
R(P, f. 6 ) = I

[11.00 6.'(x) d\)(x) (2.18)
J J J

l
were equal to the Bayes risk RB(P, f).

For the problem considered in [8-4], [5—5], R(P, f, 6') and
RB(P, f) were equal. Hence for that problem, the misclassified data
did not alter the convergence of the learning procedure. Shanmugam
[5-5] has shown that for the M-class problem, 6' and 63 are equivalent
if the following conditions are satisfied:

a.) A zero-one loss function is used; i.e.,

Lii = 0 and Li' = l, i # j.

J

b.) The probabilities characterizing the imperfect teacher are

 

Bii = B > l/M, i = 1, 2, ..., M (2.19)
and

_1-3 . ._ . . 220
B " a 1: J _ 1: 2, 0”, M, 17‘]- (. )
13 M- 1

Under these conditions, supervised learning algorithms based on density
estimators as in [V-l] will perform asymptotically just as well with an
imperfect teacher as with a perfect teacher. The following example

shows what happens when either condition (a.) or (b.) is not satisfied.

Suppose that there are two pattern classes w and w2' and assume

1

that under pattern class ”j the pattern X has a normal distribution with

mean uj and variance 02,

18

_1
6

f(x|A = j) = (2no’) exp (~(x - uj)?/202). (2.21)
For convenience assume that 112 > “1'
The decision rule 6' = (6;, 6;) obtained with the imperfect

teacher is given by (2.1) using the discriminant functions Di in place

of DR. The rule is 6;(x) = 1 if

i I t l I !
L11P1f(x|A = 1) + L12P2f(xIA = 2) :_L21Plf(xIA = 1)
+ L222;f(x|A' = 2), (2.22)

and 65(x) = l — 6;(x). Making use of (2.12) one can show that (2.22)

simplifies to 61(x) = 1 if either x :_r, p > 0 or x :.T’ p < 0, where

 

p 3 ‘(L21 ‘ L11)812 + (L12 ' L22)822 (2'23)
and
T _ 02 2(1421 -I;11)811‘ (L12 " L22)82-1- + u] + “2
u2 - P1 P2 (L12 ’ L222322 ‘ (L21 ' L113312 2
(2.24)

The risk incurred in using 6' follows from (2.18). One can show

that (2.18) can be evaluated as

 

P IJ + P,L " A if p > 0
RU), f, (3') z 1 21 Z 22 (2.25)
P L + P L + A if . p < 0
1 11 2 12
where
A = (L21 ' L113P1¢(Y1) * (L12 ' L22)P2¢(Y2) (2°26)
- 1 P L - L - L - L S
Y1 = T’ HI = __1n .1.( 21 11)811 ( 12 22)821 + __ (2.27)
O S P2 (L12 " L22)822 ' (L21 ' L11)812 2
Y? = (T ‘ U2)/0 = Y1 - S (2.28)
S = (“2 - ull/o (2.29)

19
and

a l.
¢(a) = ] (2n)’? exp (-t2/2) dt. (2.30)

.00

The Bayes decision rule and the Bayes risk may be found for this

example by setting 811 = B = 1 in the above equations. It is clear

22
from the above equations that in general R(P, f, 6') will vary with 811
and 822. Hence the rule 6' is not always equivalent to the Bayes rule.

In Figure 2.1 the risk R(P, f, 6') is plotted as a function of
5:: for the case of a zero-one loss function and equal prior proba-

bilities. The figure shows that for small 8 the risk exceeds the

22’
Bayes risk by a significant amount. In this case, estimating the
density functions without compensating for the misclassified data leads

to a decision rule which is not a Bayes rule. Thus, one is motivated to

develop a learning procedure for use with an imperfect teacher.

20

o.m

 

genome? poomwemEH cm :uﬁz xmwm capoumEXm< .H.N magmam

A00 00z<ema0 00Nan<za0z

 

   
   

0.0 0.0 0.0 0.0
_ 1 . 0.0
o; ... mum
\ ....
10.0
. an
0 H u m‘
N H Navummm 10.0
0.0 u a u a
0.H u H04 u N04
0 u NNL I Has . u mm
0 0 mg ..0.0
”ma0z

 

 

m.o

XSIU

CHAPTER III

A NONPARAMETRIC LEARNING PROCEDURE

In this chapter,a class of nonparametric procedures is deve10ped
for learning with an imperfect teacher. The specific problem con-
sidered is that of classifying a pattern pair (X, A), X observable and
A unobservable, based on a set of training data. The training patterns
are assumed to be generated by an imperfect teacher as described in
Section 2.2.

The method used to learn the decision rule is essentially one of
estimating unknown density functions with Parzen type estimators. These
density estimators, which are summarized in Appendix B, were first pro-
posed by Parzen [P-1] and later studied by Cacoullos [C-l], Murthy
[M-4], Van Ryzin [V-3], and others. Van Ryzin [V-l], [V-2] used these
estimators to develop a nonparametric scheme for supervised learning.
Procedures analogous to those given by Van Ryzin for supervised learning
are deve10ped in this chapter for learning with an imperfect teacher.

The general form of the learning procedures and the resulting de-
cision rules are presented in Section 3.1. Convergence criteria are
discussed in Section 3.2. In Section 3.3 a preliminary lemma is pre-
sented for proving convergence of the classification procedures. The
exact conditions under which convergence holds are examined in two cases:

a.) the discrete case (Section 3.4) in which the feature space

T is countable and v is counting measure; and

21

22
b.) the continuous case (Section 3.5) in which T is Euclidean

n-space, v is n-dimensional Lebesgue measure, and the class-

 

conditional densities f(- A = A) are continuous a.e.v.
In both cases, theorems are proved that give rates of convergence for
the algorithms.

The learning procedures proposed in this chapter are nonparametric
in the sense that they only require knowledge of whether the densities
are of case (a.) or case (b.). The prior probabilities are also taken
to be unknown. The §_matrix describing the imperfect teacher is assumed
to be nonsingular and known. The methods of proof follow those of
Van Ryzin [V-l].

3.1 DECISION RULES

Let “X1, Al),,(X Aé), ..., (XN, A&)} be a sequence of training

2,
data from an imperfect teacher as defined in Section 2.2. Using these
training patterns, one is required to form a decision rule for classi—

fying a pattern X of unknown origin.

The Bayes discriminant functions in (2.2) may be rewritten as

Z

Dj(x; v) = kz1 ijvk(x), j = 1, 2, ..., M (3.1)

where
vk(x) = Pkf(xIA = k) (3.2)

with v being the row vector
V = (V1, V2, coo, VET). (303)

From (2.1), the Bayes decision rule may be written as

63(X; V) = (631(x; V). 632(x; V). .... 6BM(X; V)) (3.4)

23
where

1 if D.(x; v) < min Dk(x; v)
J O
k#J

53j(x3 V) = o if Dj(x; v) > min Dk(x; v) (3.5)
k¢i

yj 1f Dj(x; v) min Dk(x; v)

k#j
The notation in (3.4) and (3.5) displays the dependence of the Bayes
rule on the vector v (or, alternatively, on P and f).

If the training patterns are used to obtain an estimate ;N(x) of
v(x) and if 6N is a good approximation to v, then a reasonable decision
rule may be formed by using 6N in (3.4) as if it were the true v. The

resulting decision rule is

8N(x) = chx; 6N) = (lime), lime» (3.6)
with
(I if DNj(x) > pi? DNk(x)
. 6Nj(x) =4 0 if DNj(x) < min DNk(x) (3.7)

kﬁj

 

min DNk(x)

Lyj if an(x) k¥j

and with the estimates of the Bayes discriminant functions given by

DNj(x) Dj(x; 3N)

M
= z [Jo-v 0(X). (3.8)
i=1 )1 N1

The class of estimators of v(x) preposed here is defined as
follows. Let {gN(x, y)} be a sequence of nonnegative real-valued func-
tions on T x T such that

I chx. y) dv(x) = 1 a.e.v (3.9)
T

and

lim E[g (x, X)IA = A] = f(xIA
New N

A) a.e.v. (3.10)

Let

1 if i = j
Mi. 3') = (3.11)
o if i s j

be the Kronecker delta, and denote the elements of Bil by bij:
Bil = [bij]° Then the estimator 9N = (5N1, VNZ, ..., GNM) is defined
by

M
1 izl bjiA(Aﬁ, i)gN(x, xk). (3.12)

ZIH
IIMZ

ij(x) = k

Explicit expressions for the 3N functions will be exhibited in Section
3.4 for the discrete case and in Section 3.5 fer the continuous case.
The form of (3.12) may be obtained from the following consider-
ations. Suppose that Ni is the number of training patterns classified
by the imperfect teacher as coming from pattern class mi. Then an esti-

mate of f(xIA'

i) is given by the Parzen estimator described in
Appendix B,

N
f(xIA = i) = $1 kzl A(A£, i)gN (x, xk). (3.13)
1 = 1

An estimate of the probability Pi is given by Ni/N. Combining these
estimates according to the inverse relation (2.15) results in an esti-
mator for v(x) having the form in (3.12).

The decision rule in (3.6),along with the estimator in (3.12),
forms the class of procedures for learning with an imperfect teacher to
be studied in the remainder of this thesis. When Bii = 1 for all 1,
these procedures reduce to those of Van Ryzin [V-l] for supervised

learning.

25

3.2 CONVERGENCE CRITERIA

A

The risk incurred in using decision rule 6N is given by (A.2) as

M M
R , 6 I.,,P,f A =° 6, d
(v N) 351 IT [12 31 1 (xl 1)] NJ(X) v(x)

=1

J

M
2

IT chx; v) 3113' (x) dv(x) (3.14)
1

and the Bayes risk is given by (2.3) as

R3”) = .
J

II M2

1 1‘ Dj(x; v) 6Bj(x; v) dv(x). (3.15)

The dependence of the risks on the vector v is displayed by the no-
tation in (3.14) and (3.15).

One would hepe that as N + w, the estimated rule 6N would in some
sense converge to the Bayes rule 63. Now the risk R(v, 6N) is a random
variable since it is a function of the training set. 80 the decision
rule 6N is said to converge to the Bayes rule 6B if the risk R(v, 6N)

converges in some sense to the Bayes risk RB(v). The following defi-

nitions of convergence are due to Van Ryzin [V-Z].

DEFINITION 3.1. A decision rule 6N is said to be Bayes Risk Consistent

 

if for every 5 > 0,

Pr[R(v, 6N) - RB(v) _>_ e] -> o as N ~> oo; (3.16)

i.e., if R(v, 6N) converges in probability to RB(v) as N + w.
Let EN[-] denote the expectation with respect to the distribution
of the training data, {(Xl, A1, A1), ..., (XN, A§, AN)}. Then define

the expected risk for the learning procedure as

Rch) 2 ENtncv. 3N1]. (3.17)

RN is a nonrandom variable which depends upon the vector v.

26

DEFINITION 3.2. A decision rule 6N is said to be Mean Risk Consistent

 

if
lim RN(v) = RB(V). (3.18)

N-HZO

Definition 3.2 is a special case of the following definition:

DEFINITION 3.3. A decision rule 6N is said to be Mean Risk Consistent

 

of order r > 0 if

$1: EN[IR(V, 8N) - RB(v)|r] = 0. (3.19)

The following notation is defined for use in later sections:

AR(v, SN) 2 R(v, 6N) - RB(v) (3.20)
and
ARN(v) 2 Rch) .. RB(v). (3.21)

So a decision rule 6N is Bayes Risk Consistent if AR(v, 6N) converges to
zero in probability, and the rule is Mean Risk Consistent if ARN con-
verges to zero in the ordinary sense of a limit.

Since convergence in the mean implies convergence in probability
[L-2], Mean Risk Consistency implies Bayes Risk Consistency. A stronger
relation between the various types of convergence is established by the

following preperty.

proPeTW3.1. If a decision rule 6N is Mean Risk Consistent, then it is

 

Mean Risk Consistent of order r for any r > O.

PROOF: From (3.14), it follows that

. M M
|R(V. 5N)l 5,.2 _Z [I

j=1 1=1

 

LjiPif(xIA = i)6Nj(x)I dv(x)

M
5_ £1 ? LjiPi IT f(xIA = 1) dv(x)
j: i=1

27

M
Z Ljipi < m- (3.22)
1 i=1

u
Ilbvﬁz

5

So the sequence of random variables {R(v, 6N)} is 3.5. uniformly
bounded. Hence R(v, 6N) converges to RB(v) in the rth mean if it con-
verges fer r = l [L-2, p. 158].

End of proof.
In the following sections the estimated decision rules are shown to be
Mean Risk Consistent. The above property then establishes that the rules

are Mean Risk Consistent of order r for any r > 0.

3.3 PRELIMINARY LEMMAS

The two lemmas presented in this section are used to establish
convergence theorems and rate theorems for the decision rule 6N in both
the discrete case (Section 3.4) and the continuous case (Section 3.5).

Let E[-] denote expection with respect to the mixture distribution

M

Z Pif(xlA = i)
=1

p(X)

1

M
E P{f(x|A' = 1). (3.23)
1:

1

The estimator GN satisfies the following lemma:

LEMMA 3.1. Let GNj(x), j = 1, 2, ..., M, be defined by (3.12). Then
ENtij(x)] = PjEIchx. X)IA = j]

= Pj [ gN(x, z)f(zIA = j) dv(z). (3.24)
T

28
PROOF: Taking expectations in (3.12) gives

M M

A 1 1 v .

EN[ij(x)] = ﬁ' 2 .) bjiEN[A(Ak, 1)gN(x, xk)]. (3.25)
k=l i=1

Since the training patterns are identically distributed according to the

mixture distribution (3.23), equation (3.25) may be written as

,. M '
EN[VNj(X)] = 111 bji E[A(A . i)gN(x. X)]
M
= “Z bjiP; EIgN(x. X)|A' = 1]
1=l
= P3 E[2N(x. X)IA = 1] (3.26)

where the last equality follows from (2.15).
End of proof.

The decision rule 6N satisfies the following lemma:

LEMMA 3.2. The average risk for the decision rule 6N defined by (3.7),

(3.8), and (3.12) satisfies

M A
0 < ARN _<_ 1Z1 L-iIEN[|vNi(x) - PiE[gN(x, X)|A .-. i]l] dv(x)

M . .
+ I f{EN[DNJ-(X)] - Dj(X)} - {éBj(x; v) - EN[6Nj(x)]} dV(X) (3.27)
.J.=1
where
E: = max L .
1 lijih 31 (3.28)

and the integrals are over T.
PROOF: By the definition of Bayes risk

RB(v) :_R(v. 3N) (3.29)

29

so that ARN(v) :_0. Also

R(GN. 68(x; v1) :_RB(GN). (3.30)
So

ARN :_EN[R(v, 6N)] - RB(v) + EN[R(vN, 6B(x; V)) - RB(VN)]
= hN[R(V. 5N) - RB(VN)] + LN[R(VN. 63(X; V)) - RB(V)]. (3.31)
Consider the first expectation in (3.31) and use (3.14) to get

. . M . .
EN[R(v, 5N) - RB(vN)] = ENS-El I {nj (x) - DNj (x)}<ij (x) dv(x)]

M . A .
= j§1 f Entanj(x) - ENIDNj(x)])(-6Nj(x)] dv(x)

. E1.I(Dj(x) - EN[DNj(x)]) EN[6Nj(x)] d.(x). (3.32)

J
The first integral in (3.32) can be bounded as follows using Lemma
3.1:

M . - . .
Z I EN[(ij(x) - ENIDNj(x)])(-6Nj(x))] dv(x)

1:1

M M . A

. 1E1 I EN[{j§1 Lji6Nj(x)}{PiE[gN(x, X)lA = i] _ VN1(X)}1 dex)
M

i. Z I ENII°I] dv(x)
i=1
M — A

f. iél I EN[Li|PiE[gN(X, X)|/\ : i] _ VNi(x)l] dV(X) (3.33)

where the last inequality follows from the fact that

B21 A —
I...6 .(x) ‘< L.

. 1 N —- 1 .

J=1 J J J

IIbwﬁZ

1 6Nj(x) = Li, (3.34)

30

So then substituting from (3.33) and (3.32) into (3.31) gives

M
ARN :.i§1 f; I ENIIPiEIgNCX. X)IA= 1] - vNi(x)I] dv(x)

M . .
+ jél f (Dj(x) — EN[DNj(x)]) EN[6Nj(X)] dv(X)
M .
+ jél f EN[DNj(x) — Dj(x)] ij(x; V) dV(X)
M __ .
= 1&1 Li I EN[IPiE[gN(X, X)|A = 1] - VNi(X)I] dV(X)
M . .
+ jElf{EN[DNj(x)]--Dj(x)}{6Bj(x; V)- EN[6Nj(X)]} dVCX)- (3°35)

End of proof.

3.4 THE DISCRETE CASE

For the discrete case the observed random variable X takes values
in a countable set T, and v is counting measure. The class-conditional

densities are probability distributions on T,
f(x|A = j) = Pr[X = xIA = j], j = 1, 2, ..., M. (3.36)
The gN functions for this case are defined by

1 if x = y
gN(X. y) = g(X. y) = (3.37)
0 if x i y

so that

ngch. y)|A = j] = f(x|A = 1). (3.38)

Note that with this gN function, the estimator f(x|A' = i) in (3.13) is

'
merely the emperical distribution of those Xk's for which Ak = i.

31
The convergence of the decision rule 6N is established by the

following theorem.

THEOREM 3.2. In the discrete case, if the decision rule 6N is defined

 

by (3.7), (3.8), and (3.12) with gN given by (3.37), then 6N is Mean Risk

Consistent.

PROOF: Note that

ij(x) = Zj(x, x (3.39)

k)

ll M2

1.
N k 1

is the average of a sequence of independent, identically distributed

random variables, {Zj(x, Xk)}, that are distributed as the random

variable
M '

2 (x, X) = Z b.-A(A , i)g(x, X). (3.40)

3 1:1 31
Since

2 M 2 1 t

E[z.(x, X)] = Z b..P.f(xIA = 1) < m, (3.41)
3 1:1 31 1

the Kolmogorov strong law of large numbers [L-2] implies that for each
xeT

GNj(x) + E[Zj(x, X)] w.p.1 as N + W. (3.42)

But from (3.40) and (2.15) it follows that

M , ,
Z bjiPif(x|A = 1)
=1

EIZj(X. X)]

1

13 ml!) =1) = vj (x). (3.43)

Thus §Nj(x) converges to vj(x) with probability one. Also GNj(x) is

bounded as follows:

32

I; .(x)| = Il- g g b..A(A' i)g(x X )|
NJ N k=l 1:1 31 k, ’ k
< 1. § § lb..|A(A' i)g(x x )
“’N k=l 1:1 31 k, ’ k
< max lbjil < 0°. (3.44)
1<1<M

The w.p.l convergence and boundedness guarantee that VNj(x)-+vj(x)

in the mean [L-2, p. 158]; i.e.,

N'*’°°

Now

ENIIVNj(x) - Vj(x)|I :_EN[|VNj(x)|] + vj(x)

N M
£1 EN['i§1 bjiA(Ak. i)g(x. xk)II + vj(x)

<

l.
N k

M
1.. Pff A' = ' .
1E1 l1J1I 1 (xl 1) + vjcx)

IA

N
(max ijil) 2 p;f(x|A' = 1) + vj(x)
liiiM i=1

(max I l)..|) 2 v (x) + v.(x)
ljajM 31 1:1 1 3
Since vj(x) is integrable for all j, s(x) is an integrable function.

the Legesgue dominated convergence theorem along with the above ine-

quality implies that

N—roo
Thus the first term in the bound of Lemma 3.2 converges to zero. By

(3.38), the second term of the bound is zero for all N. Hence

lim E[|;Nj(x) - vj(x)l] = 0 for all ch. (3.45)

5(X). (3.46)

So

lim f EN[|§Nj(x) — vj(x)l] d6(x) = 0, j = 1, 2, ..., M. (3.47)

33

lim ARN = 0, (3.48)
N—roo

End of proof.
A rate of convergence of ARN to zero is given by the following

theorem.

THEOREM 3.3. Let

 

. M M .
p = max )‘[ 2 ( Z bf

, L
B .)V.(X) - V?(X)]2. (3.49)
ljjiM ch 1:1 2:1 )2 21 1 3

If p < w, there exists a constant H(v) such that

0 :_ARN :_n(v)//ﬁ. (3.50)
Furthermore,
L __ __ M _2 9
H(v) 5,6MZIILII; IILII = ( I L112. (3.51)
1:1

PROOF: Consider the bound given by Lemma 3.2. The second term of the

 

bound is zero for the discrete case. So by (3.27) and (3.38) it follows

that ARN is bounded as

M
ARN :_.21 E} f EN[|GNj(x) - vj(x)|] dv(x) (3.52)
J:

The Schwarz inequality gives

ENIIGNj(x) - vj(x)l] :_(EN[(GNj(x) - vjcx))21)2. (3.53)

Lemma 3.1 and (3.38) imply that

EN[(VNj(x) - vj(x))2] = ENIG§j(x)I - v§(x). (3.54)
From (3.12) it follows that
E [92.(x)] = 1 E g g g b. b.. E [A(A' m)A(A' i)
N N’ 1 R7 k=1 2:1 m=1 1:1 3m 31 N k, g,

9(X. Xk)g(X. X£)]. (3.55)

34

When k # R, (Xk, AL) and (X£, AL) are statistically independent. So

1 N M M ' .
EN[V Nj (x)]= N2 k21 121 m21 bjmbji E N[A(Ak, m)A(Ak, 1)
82(X. XkJJ
N M M

N
Z 2 2' Z 13,.ij m4(Ak.m)g(x xk)1
=1 2:1 i=1 m=1

k¢2

N k

2.,ij 123m , i)g(x. xp]

M
i12 bj‘ iP. f(xIA = 1)

I”‘Zlh—a

(N _ 1 N _ .
N2 ) (12 bjiPif(xIA' _ 1))2 (3.56)

 

Substituting (2.15) into (3.56) gives
, *2 _ _ . l 2
[[VNj(x)] i2 inPif(x|A' _ 1) + (1 _ RJV3(X) (3.57)

So
M

ENIlijIx) - v (x)I] :71..in b

1:1 iPif(x Ml = 1) - v§(x)]%. (3.58)

J

Recalling that v is counting measure, one obtains upon substitution of

(3-58) into (3—52)

M 1
AR < 2 T’ 2 _L.[ 121 b; p f(xIA = 1) — v?(x)]é. (3.59)
N _-j=1 J XCT /N J
Define
42‘): zi?'l
H(v) = [ b..P,f(x A' = 1) - v_‘(x)]/2
j=1Lj ch i=1 31 1 J
M M M
_ * - ‘ 2 * _ 2 k
_ 2 L 2 [ 2 b..( 2 BiRP,f(XIA - 2)) - Vj(X)]

1 3 XET 1:1 31 2:1

II
II MZ

j . 80Q Q .( / (

35

Then
ARN :_H(v)//ﬁl (3.61)

The second conclusion of the theorem results by noting that

 

HIv) _<_ o 2: ‘17. _<_le’2||1.‘| . (3.62)

J
End of proof.

3.5 THE CONTINUOUS CASE

For the continuous case the pattern vector X is taken to be a con-
tinuous random vector. Specifically, let the feature space T be
Euclidean n-space and let v be n-dimensional Lebesgue measure. Assume
that the class-conditional densities f(xlA = j) are continuous a.e.v
for j = 1, 2, ..., M.

The 8N functions to be used to form the estimator 3N have the form

described in Appendix B. The gN function is taken to be defined by

1 ..
gN(x, y) = —ﬁ-K(5———X) ~ (3.63)
h hN
such that
lim hN = 0 (3.64)
N—wo
and
lim Nhn = m. (3.65)
N
N—mo

The kernel K(-) is assumed to satisfy the conditions of (8.1). Examples
of univariate kernels are given in Table 8.1. One way of obtaining
multivariate kernels is to use products of univariate kernels.

Lemma 8.1 guarantees that

E[gN(x, X)|A j] + f(xIA = j) a.e.v as N + m (3.66)
and

j] + Qf(XIA = j) a.e.v as N + m (3.67)

n 2
hN EIgNIx, X)IA

"here Q = I K2(y) dvcy). (3.68)

36
The asymptotic unbiasedness of (3.66) replaces the unbiased preperty
(3.38) of the discrete densities.
The following lemma due to Van Ryzin [V—l] will be used to prove

convergence of 6N for continuous densities.

LEMMA 3.3. Let (T', G', v.) be a measure space upon which is defined
two sequences of integrable functions {qN(x)} and {q§(x)} such that
IqN(x)I : Iq§(x)l a.e.v'. If qN(x) + q(x) in measure, q§(x) + q&(x) in
the mean, and q(x) and q'(x) are integrable, then qN(x) + q(x) in the
mean.

Theorem 3.4 establishes the convergence of the decision procedures

for continuous densities.

THEOREM 3.4. In the continuous case, if the decision rule 5N is defined

 

by (3.7), (3.8), and (3.12) with gN given by (3.63), then the decision

rule is Mean Risk Consistent.

PROOF: The second term in the bound of Lemma 3.2 is given by

 

M A A
T2 2 _Z f{EN[DNj(x)] - Dj(x)}{6Bj(x;v) - EN[6Nj(x)]} dv(x)
3-1
M M .
= jél 121 LjiPif'{E[gN(x, X)]A = 1] - f(xIA = 1)}
° {68j(x; v) - EN[8Nj(x)]} dv(x). (3.69)
But for all x and all N
I68j(x; v) — EN[6Nj(x)]I :_1. (3.70)

Taking the absolute value of the integrand in (3.69) gives

T2 :..
J

Ilbvﬁz
Ilbvﬁz

1 i 1 ijpi I|“[BN(X. X)|A = i] - f(X|A = i)| dv(x). (3.71)

37

Fubini's Theorem together with (3.9) implies that

IE[gN(x. X)IA = 1] dv(x) = utIgN(x, X) dv(x)IA = i] = 1.

So E[gN(x, X)]A = i] is a density function. Since (3.66) holds,

Scheffe's theorem [P-3, p.22] implies that

lim j|E[gN(x, X)IA = i] — f(xIA = i)| dv(x) = 0.

N—wo
Hence T2 + 0 as N + m.

The first term in the bound of Lemma 3.2 is given by

M
T1 9 321 i3.fEN[|ij(x) - Pj Eth(x, X)IA = j]|] dv(x).

Schwarz inequality gives
A A 1 = '
qu(X) = EN[|ij(X) - Pj L[gN(3. X)|A Jll]
1
C A - u = . 2 6
:.{EN[(ij(x) Pj Eth(x, x>IA 31) 1}

“2 2 «9 - f’é
- {ENtij(x)] - Pj L [gN(x, X)|A - 3]

(3.72)

(3.73)

(3.74)

(3.75)

where the last equality follows from expanding the square and applying

Lemma 3.1. Now (3.35) together with (2.12) gives

M

E “2 l. b7 P' E 9 x x A' = i

Ntij(x)] N ,5] ji 1 [gN( . )l 1
N - 1

N

11)”

+

 

M
. o , v
(igl bjiii h[gN(x, X)|A

M

%- § b3. 2 BMP2 E[g;(X. XllA
i=1 31 2:1

2]

+ (1 - ﬁocpj E[gN(x. X)|A = j])2.

(3.76)

38

Thus
NJ.(x) < (th l5{ ( y . Phn E[g2(X, X)|A = R]
151 151bji 81 N N

2 _.1/
-thj E [gN(x, X)IA -1]}2. (3.77)

Next note that (3.66) and (3.67) imply that the right hand side of
(3.77) converges to zero a.e.v as N + w. Hence qu(x) + O a.e.v as

N + w, and in particular qu(x) converges to zero in measure for each
j = l, 2, ..., M.

Alternately, qu(x) may be bounded in a manner similar to (3.46)

qu(x) :_E[|§Nj(x)|] + Pj E[gNCX. X)|A = j]

M
:_max lbji' Z Pk E[gN(x, X)IA = k]
1:}:M k=l

+ Pj E[gN(X. X)|A = J]
M

max lb..l X P E[g (x X)IA = k]
lﬁijM 31 k=1 k N ’

+ Pj Eth(x. X)IA = j]

A '
= quCX). (3.78)

Equation (3.73) shows that E[gN(x, X)IA = j] converges in the mean to

f(xIA = j) for each j. Clearly then qu(x) converges in the mean to the

integrable function

q3(x) 2 max |1>..| 2 v (x) + v.(x). (3.79)
3 1<i<M 31 i=1 1 3

Hence Lemma 3.3 implies that

lim 1 qu (x) dv(x)= j 1, 2, ..., M. (3.80)

N-mo

39

Thus Tl + 0 as N + w.

End of proof.
A rate of convergence similar to that in Theorem 3.3 can also be
established for the case of continuous densities. For any density

function f(x), define a translate function T(°) as

1(y, f) 9 f If(x + )0 -'f(x)| dv(x). (3.81)

The rate theorem is as follows:

THEOREM 3.5. Let the decision rule 8N be defined according to (3.7),
(3.8), and (3.12) with 8N given by (3.63). In addition suppose that the

following conditions are satisfied:

1) I ||Y|| K(y) dv(y) < w. (3.82a)

ii) for each A = l, 2, ..., M

x] < w for some 5' > 0, (3.82b)

>
ll

iii) for some 0 < y :_1 and some constant C

 

A

max 1(y; f('
1:}:M

Al) :_C||y||Y. (3.82c)

Then choosing hN = 0(N-1/(n+a)) implies that there exist constants Q1

and Q2 such that for large N
Q1 N'Y/(“+“) if a > 2y

-o/2(n+a)

ARN 5. Q2 N if a < 2y (3.83)

(<21 + (aura/0‘ ”“3 if a

2Y

The proof of this theorem is given in Appendix C.
The conditions in (3.82) necessary for proving Theorem 3.5 are

very general. Condition (3.82a) is satisfied by all of the kernels in

40
Table 8.1 except for the Cauchy kernel (entry 5). Condition (3.82b) is
a weak moment condition on the density functions. Condition (3.82c)
covers an extensive class of densities as discussed by Van Ryzin [V-l].
Van Ryzin has shown that when the density functions f(x|A = k),
k = l, 2, ..., M, are absolutely continuous and have first partial
derivatives which are integrable, then (3.82c) is satisfied with y = 1.
It is easy to shown that when y = 1 the bound in (3.83) is smallest if
u is chosen to equal 2. In this case the rate of convergence is
ARN = 0(N-1/(n+2)). So a good rule of thumb would be to choose
-1/(n+2)

hN = 0(N ). Of course if a were known, a better rate of con-

vergence could be obtained by properly choosing a.

CHAPTER IV

EFFECTS OF THE IMPERFECT TEACHER

In Chapter II,a model was preposed for an imperfect teacher. An
example (Section 2.4) was presented to show that the use of misclassi-
fied training data can significantly affect the convergence of a super-
vised learning procedure. A nonparametric algorithm for learning with
an imperfect teacher was then proposed in Chapter III. The algorithm
took into account the misclassifications so that the estimated decision
rules converged to the Bayes rule as the number of training patterns
became large. Rates of convergence were also given by establishing
bounds on ARN that were functions of the number of training patterns
used to learn the decision rules. Since these bounds hold for a large
class of probability distributions, one would not expect the bounds to
be very tight; they only give a loose measure of rate of convergence.

In this chapter the two-class learning problem is investigated in
some detail. The objective is to study the qualitative and quanti-
tative effects of an imperfect teacher on learning. This objective is
achieved by restricting attention to the two-class problem and assuming
specific forms for the underlying class-conditional densities. Measures
of performance for comparing the perfect and imperfect teacher are
proposed and studied.

A large sample approximation is developed in Section 4.2 to

evaluate the expected risk in the two-class problem. The approximation
41

42
is then used in Section 4.3 to study three examples of learning with an
imperfect teacher. In Section 4.4 large sample properties of the ex-
pected risk are investigated. A quantitative measure of performance is
pr0posed in Section 4.5. The measure is evaluated for a zero-one loss
function and used in Section 4.6 to compute a cost associated with
training. This cost provides a second quantitative measure for com-

paring an imperfect teacher with a perfect one.

4.1 EXPECTED RISK FOR THE TWO-CLASS PROBLEM

The Bayes decision rule for the two-class problem may be written

from (3.1) and (3.5) as

631(x) = 1 if D(x) :_O
= o if D(x) < o , (4.1)
637(x) = 1 — 681(x) (4.2)

where the Bayes discriminant function has the form

D(x) = (L - L11)Plf(x|A = 1) - (I.12 - L22)P2f(xIA = 2)

21
(L21 ‘ L11)V1(X) ‘ (L12 ‘ L22)V2(X)° (4'3)

It follows from (4.3), (3.8) and (3.12) that the estimator of D(x) has

the form

lISNO‘) = (L21 ’ L11)01(X) ' (L12 ' L22)‘72(X)

N M
1 ‘ v .
N kzl 1%1 [(L21 ' L11)b1i 7 (L12 ' L22)bzi]MAk’1)gN(x’ xk)

.1.
N 1

nrvﬁz

k

43
where {WNk} is a sequence of independent random variables that are

identically distributed as the random variable

2
WN(x) = 1 [(1.21 - L11)bli - (L12 — L22)b2i]A(A', i)gN(x, X)
1:
= [01A(A'. 1) + 021(1), 2)]gN(x, X) (4.5)
with
Ci 3 (L21 - IJll)bli - (L1? - L22)b21. (4.6)

The estimated decision rule is then

= o if' ﬁN(x) < o , (4.7)
8N2(x) = 1 — 3N1(x). (4.8)

The risk for any decision rule 6 = (6 6?) can be expressed in a

1’

convenient form for the two-class problem as follows. From (A.4) the

risk is given by

R(P, f, 8) = P1‘([L (x) + L (x)]f(xlA = 1) dv(x)

1161 2162
+ P2 1 [L1261(x) + L2262(x)]f(xIA = 2) dv(x). (4.9)
Using the relation
62(x) = l - 61(x) (4.10)

in (4.9) gives

R(P, f, 8)

P1L21 + 1321.22 + f [P1(L11 — L21)f(xIA = 1)

+ p20.12 - L22)f(xIA = 2)]61(x) dv(x)

PILZI + p2122 _ f D(x)61(x) dv(x). (4.11)

This expression shows that the risk for any decision rule in the two-

class problem can be expressed as a sum of two constant terms that are

44
not functions of the decision rule plus an integral involving the de-
cision rule 61(-) and the Bayes discriminant function D(-).

The risk for the estimated decision rule 8N is then

R(P, f, EN) = P1L21 + P21.22 — f D(x)8N1(x) dv(x), (4.12)

and the expected risk with N training patterns is

RN(P, f) P1L21 + P2L22 - f D(x)EN[SN1(x)] dv(x)

1211.21 + P2L22 - j D(x)Pr[ﬁN(x) :_0] dv(x). (4.13)

To evaluate the expected risk, one must be able to compute the
probability that the estimated discriminant function is nonnegative.
Since 6N is the average of a sequence of independent, identically dis-
tributed random variables, the calculation of Pr[6N(x) :_O] is gener-
ally an extremely difficult problem, although a normal approximation for
6N may be used when N is large. This approximation, which may be used
to evaluate the expected risk given by (4.13), is developed in the next
section. It is assumed throughout this chapter that 1/2 < Bii :_l,

1 = 1, 2.

4.2 NORMAL APPROXIMATION

The first two moments of the random variable WN(x) defined in
(4.5) will be used in the normal approximation. From (4.5) it follows

that the expected value of WN is given by

EIWN(X)] -

I
II MIN)
II MN

GjE[A(A . i)gNCX. X)] = .

1 0:.
1 J G.PjE[gN(x. X)|A 1]

j 1 J

2
1 cj 121 BjiPiE[gN(x, X)IA = 1] (4.14)

II
II MN

j

45

where the last equality follows from (2.10). Interchanging summations

results in

E[wN(x)] = (61811 + G2821)P15[8N(X, X)IA = 1]

+ (61812 + 62822)P2E[gN(x, X)IA = 2]. (4.15)

For the two-class problem the 8 matrix is

_ —

B11 B12
§.= (4.16)

 

 

8' B
_21 2a

so that the inverse matrix is

1 B22 “812
8'1 = . (4.17)

- 811 + 822'1 ‘821 811

 

Substituting (4.17) and (4.6) into (4.15) and simplifying gives

Eth(x)] (L21 - L11)P1E[gN(x, X)|A = 11

’ (L12 ‘ L22)P25[gN(X, X)|A = 2]

I
ll MN

(L - - L .)P-E[g (x. X)|A = 1]. (4.18)
l 2] 1] j N

1
Note that the expected value of WN(x) is not a function of the
teacher; that is, it does not depend upon the §_matrix. The second

moment of WN is

EIW§(x)1 E[(G,A(A', 1) + 924(4', 2))2g§(x. X)]

GzP'E 2 x A' = 1 + GZP'E 2 x x A' = 2 . 4.19
1 1 [gN(x, )I 1 2 2 [gN( , )I 1 ( )

Proceeding as in (4.14), the second moment may be written as

46

E[w§(x)] = Hlpln[g§(x, X)IA = 1] + n2P25[g§(x, X)IA = 2] (4.20)

where
_ 2 2
H1 “ G1811 + G2821 (4'21)
and
_ 2 2
n? _ 61812 + 62822. (4.22)

Substitution of (3.38) into (4.18) shows that when X is a discrete
random variable, E[WN(x)] = D(x). Thus 6N is an unbiased estimator of D

in the discrete case. Also for discrete X, (4.20) becomes
E[w§(x)] = H1P1f(xIA = 1) + H2P2f(xIA = 2) (4.23)

which is finite for all x. It then follows that 6N is asymptotically

normal as defined by the following theorem:

THEOREM 4.1. Let the training patterns be represented by discrete

 

random variables. Then DN(x) is asymptotically normal in the sense

that, for every real number a,

Bch) - D(x)
lim Pr

1 < a = 4(3) (4.24)
N+u> (VARIWN(x)1/N)é"

 

where ¢(-) is the standard normal cumulative distribution,

a _ 2
1 e t /2 d

w /§F

t. (4.25)

 

4(a) = I

PROOF: Since 6N is the average of a sequence of independent, identi-

 

cally distributed random variables with finite mean and variance, the
conclusion follows immediately from the Lindberg-Levy Central Limit

Theorem [F-l, p. 256].
End of proof.

Now consider the case when the patterns are continuous random

variables and the gN functions have the form described in Section 3.5.

47
Proving that 6N is asymptotically normal in this case is not as direct
as in Theorem 4.1 because the mean and variance of ”N are now both
functions of N. The asymptotic normality of ON is given by the follow-

ing theorem.

THEOREM 4.2. If the estimator 6N is defined by (4.4) with 3N defined

 

as in (3.63), then at every continuity point of f(xIA = k), k = l, 2,

UN is asymptotically normal in the sense that, for every real number 3,

lim Pr DNOO - E[WN(X)1] f. a
NM (VAR1wN(x)]/N1/2

= ¢(a). (4.26)

 

PROOF: Let x be a continuity point of f(xIA = k), k l, 2. From
Parzen [P-l, p. 1069] and Loéve [L-2, p. 316], it follows that a suf-

ficient condition for (4.26) to hold is that for some 5 > O,

Q N E[|WN(x) - 13[wN(x)]|2+€]_> 0

(N VAR1wN(x)1)1*€/2

qN(x) as N + m. (4.27)

 

Now from (4.5) it follows that

2+ '+ v + '
EIIWNI 5] = |cl|2 5P. 512% t"cx. X)|A = 1]
+ |62|2*€P£ 5(g§+€(x, X)IA' = 21. (4.28)
Using Lemma 8.1, it follows from (4.28) that

2""; 9

1+5 +5
hn( )E[IWNI2 ] = {loll P1f(x|A' = 1)

11m N

N~>oo
+ IG2I2+€P;f(xIA' = 2)} f K2+E(y) dv(y).
(4.29)

Since the conditions on the kernel K guarantee that [C-l, p. 185]

f K2+€(y) dv(y) < m, (4.29) implies that

48

1
o < lim h"( +5)12[|w - E[W ]|2*5] < m (4.30)
N N N
N+oo
and
o < lim h“ VAR[W ] < m. (4.31)
N+ua N N

But qN may be written as

hn(1+€/2) n(1+E)E[ W - E[W ] 2+5]
q...) . N 1N I N N ' . (4.32)
Ng/2h§(l+€) (ha VAR[WN])“‘3/2

 

Thus qN + O as N + w because the first term of (4.32) goes to zero as
N + w while the numerator and denominator of the second term remain
finite. ‘
End of proof.
Theorems 4.1 and 4.2 provide justification for using a normal

approximation to Pr[ON(x) :_O] in (4.13). Specifically, for large N

 

 

. -N%r[wN(x)1
Pr[DN(x) :_O] 3 l - 4 L
(VAR[WN(X)])2
N%EIWN(x)J
= 4 1 . (4.33)
(VARth(x)])e

Thus for large N the expected risk may be approximated by

N%E[wN(x)]
RN(P, f) 2 P11.21 + P7L22 — f D(x) 4 1 dv(x). (4.34)
(VAR [WN (X) ] )/2

 

Even with the large sample approximation, the integral in (4.33)
is a formidable expression that cannot be easily evaluated. In the
next section, three examples are studied in an attempt to evaluate the
effects of an imperfect teacher. For these examples, (4.34) is evalu-

ated numerically to study the learning algorithm.

49

4.3 EXAMPLES OF LEARNING

Example 1 - Normal Distributions

 

In the first example to be studied, the patterns are assumed to
have univariate normal distributions. Under pattern class mi the
patterns are normally distributed with mean “i and variance 02. Since
the density functions are continuous, the class of estimators described

in Section 3.5 will be used for learning a decision rule. The kernel

for the estimator is taken to be

..1/
KCY) = (2“) 2 eXp (-y2/2). (4.35)
The 3N function is then
_ 2
chx. y) = 1 . eXp (- M). (4.36)
. hN(2n)/2 2h§

The first and second moments of gN are needed in order to evaluate
the approximation for the expected risk. The conditional mean is given

by

1
-/2

E[gN(x, X)|A = 1] = f h§1(2n) exp {- (x - z)2/2h§}

1
“’2

o'1(2n) exp {- (z - pi)2/202} dz. (4.37)

Completing the squares in the exponentials and integrating the resulting

expression gives

1
-/2

EIchx. X)|A = i] = [2"ch2 + 1113)] exp {—acx -u-1)2/(02 + 1113)}. (4.38)

A similar evaluation for the second moment shows that
., , _;
E[gﬁ(x, X)IA = i] = (ZnhN)"1(202 + hﬁ) 2 exp {4%(x - u.)2/(o2 + h§/2)}.
1

(4.39)

The Bayes discriminant function for this example is found from

50

(4.3) to be

D(X) = P1(L;)1 - I:11)(207)-l/2 CXp {"‘ (X "' U1)2/202}

- P2(L12 - L22)(202)"% exp {- (x - u2)2/202}- (4.40)

The Bayes risk for this example was derived in (2.25).

The above relations in conjunction with (4.18) and (4.19) can be
used to evaluate, via numerical integration, the approximation for the
expected risk given by (4.34). The results of evaluating this approxi—
mation are presented in Figures 4.1 thru 4.4. For this example the
following parameter values were used: “1 = 1.5, u2 = —l.5, and 02 = 1.0.
A zero-one loss function was used for all cases shown. The parameter
hN was chosen to have the form hN = N_a. Following the rule of thumb
suggested in Section 3.5, a was chosen to be 1/3.

Figures 4.1 and 4.2 present the expected risk plotted as a
function of 8(811 = 822 = B) for the prior probability of pattern class
ml equal to 0.5 and 0.1 respectively. Figure 4.3 shows the same case as
Figure 4.1 except that 811 = 1.0. Figure 4.4 shows the expected risk
plotted as a fUnction of the number of training patterns. These figures
illustrate the effects of an imperfect teacher. Basically, the im-
perfect teacher slows down the rate of convergence. For B > 0.9 the
effect of the imperfect teacher is small, but for B < 0.7, say, con-
vergence of the algorithm is much slower.

The dotted line in Figures 4.1, 4.2, and 4.3 is the teacher's
risk. This risk is just the teacher's probability of error which is
equal to 1 - B. The figures show that for small 8, the learning

algorithm becomes better than the teacher as the number of training

patterns increases.

oé

 

 

 

 

 

 

 

m.o n Hm .mcoﬂusnﬁhpmﬂo HmEHoz :pﬂz xmwm wouoomxm .H.v oysmwm
3V momma 5.23
as as no so no
A u q A _ Coo
/
14 Va; 3:3an
to
\/
Va; rﬁzuqﬁ / 82 u z
/
/ OOH N Z l N.O
/
/
/
/ 1 m.o
muwmmugm oHuz /
. H /
m o u a /
o.H u Hms u was 1 3.0

m.o

)ISI‘d 03133dX3

52

~.o u Hm .mcoﬁusnﬁaumﬁo HmEHoz :pﬂz xmﬂm wouoomxm .N.v oHDMﬁm

iv ”5% E53:

 

 

 

 

 

 

04 ad mg no 9o mic
/ . . a . 0.0
Illllllllllllulllulllullllllndmalmﬁa
/
/ no
oooH u z 1
/
/
on: 95:23 \// 02 u z
/ .
III .JN o
/
/
//
/ $5
an Nun... 2m /
. H [II
n a
H o S u z /
o.H 1 HN4 1 man Ill
- - // 1:
o u NN._ u :4 /
/
/
/
m5

NSIH 03133dX3

53

Nam x Ham .m:0ap:nﬁupmﬁm HmEuoz spa: xmﬁm wmpoomxm .m.v masmwu

ANNE ENE E139

 

 

 

 

 

 

 

 

o; as ma no so no
. a a a 0.0
/
/
IIIIV lllnllllullllllllllllllllllllllulnllnll.
1 Xmam mm><m
/ 1 no
/ OS 1 z
/
+4
/
/ 1 «.0
/ / /
// 2 u z
on; 9&5 \
s: /
/ .
/ 1m o
o; 12m //
m.O u HQ //
315 1 N: / 1 «.0
mm : /
o ... s 1 s /
/
/

m.o

XSIH 03133dX3

S4

ooo.oH

 

ooom

z owumq How xmﬁm mmuoomxm .v.q oHSwﬂm

sz mzmmhb<a oszH<mH do mmmZDZ
oooH oom

 

mem mm><m

OOH

0.71m

X318 GHlDBdXB

55

Example 2 - Triangglar Distributions

For the second example let the class—conditional densities be
disjoint triangular densities as shown in Figure 4.5. These densities
were used in [5-5] for a simulation of an imperfect teacher. They are

considered here to show that one can evaluate the expected risk without

resorting to simulations.

2.0

 

    

uh.
(I
X

I
V

-l.0 -0.5 O 0.5 1.0

Figure 4.5 Triangular Density Functions

The normal kernel (4.35) will also be used in this example with
the parameter a chosen to be 3 in order to conform with the example in

[8-5]. After some lengthy algebra, one can show that the moments of

the 3N function are the following:

_ _ x _ _ x - 0.5 _ x - 1.0
E[gN(x, X)]A - 1] - 4x¢(ﬁEJ 4(2x 1)¢(""'E§'"J + 4(x 1)4(_.Tﬁ:.g

2 _ __ 2
:ﬁbﬂr.[exp {ZEEJ- 2 exp {-(x - 0'5) } + exp { (x 1'0) }] (4.41)
(211)’5 2h§ 2h§ 2h§

 

= = 3L. _ + x + 0.5 + + x + 1.0
Ethcx. >014 2] 4x<x>(hN) 4(2x mum—W 1 4(x 1)¢(--————.hN 1

 

+4hN [exp {1‘21} .. 2 exp {W} + exp {W}] (4042)
- 2h

lé 2
(2n) ZhN ZhN N

S6

 

 

 

 

 

 

 

 

 

 

., 2 /2‘ /2’ - 0.5
Elgh(x1 XllA = 1] = [X¢(——5J - (2x - 1)¢( (x . ))
hN/JF hN hN
+ (x - 1.0)<I>('/§(x " 1'91)] +l-[9XP {33;}
hN Tr hN
_ _ 2 _ _ 2
- 2 exp { (x 0'5) } + exp { (X 119) }] (4.43)
h2 2
N N
E19136. X)|A = 21 = $141495) - (2x + 114(59‘ " 0'53)
hN n hN hN
/2 + 1.0 - 2
+ (x + 1.014( (11 )1] + -1-[exp {'35—}
hN 7‘ hN
.. + 7. _ + 2
- 2 exp { (x ?;5) } + exp { (x 1°Ol‘}]. (4.44)
l

2
hN

Equations (4.41) thru (4.44) can now be used along with (4.18)
and (4.20) to evaluate numerically the expected risk. Figures 4.6 and
4.7 present the results of this evaluation using a zero-one loss
function and equal prior probabilities. The results of the simulation
in [8-5] are also shown on Figure 4.7. There is reasonable agreement

between the simulation and the large sample approximation.

Example 3 - Binomial Distributions

 

For the final example, suppose that the patterns are discrete
random variables that have binomial distributions under both pattern
classes,

f(xIA = j) = (919391 - ej)c"‘, x = 0, 1, ..., c, j = 1, 2.(4.45)

For this discrete case, the estimator 6N is formed according to (4.4)

with the gN function defined by (3.37).

57

o.H

moﬂuwmcoo Hmasmcmﬂhh :uﬁz xmﬂm wouoomxm

av mos; ESE

.o.¢ magma;

 

 

 

 

m.o m.o “.0 o.o m.o
. _ I] d 0.0
oooﬁ 2
xxx
1 H.o
002 z
1 N.o
OH 1
Q1 NNQ 1 Zn
m.o 1 HQ 1 m o
o.H 1 Hms 1 mas
o 1 was 1 Has
1. «.0

 

m.o

 

NSIH 03133dX3

58

newumﬁseﬁm paw cowumsﬁxOHam< onEmw ownmq mo comﬁummeou .n.v oHSMﬁm

§ 3% 55,9

 

 

o.H m.o w.o 5.0 0.0 m.o
ql‘ﬂ _’ 4 _
I
II
II:
,1,
onpquxomaa< mqaz<m mwm<4 1/ 1
,x
,,
L
mxnnmmmxu Hﬁmx . /, 1
ma 1 z \\\\\L/
H onp<Jasz
m.o n a
o.H 1 Ems 1 mas
o 1 NNs 1 sad

 

 

 

o.o

H.o

N.o

m.o

¢.o

m.o

XSIH 03133dX3

59
From (4.1) and (4.3) it follows that the Bayes decision rule for

this example may be written as 681(x) = 1 if
f(xIA = 1)&(x|A = 2) 3.1120.12 — 1.22)/Pl(1.21 — L11). (4.46)

Substituting (4.45) into (4.46) and simplifying, results in the con-
dition that 681(x) = 1 if
Ax :_T (4-47)

where
A = ln(01(l - 02)/(l - 01)62) (4.48)

and

r = 111((1.12 - L22)/(L21 - L11)) + 1n(P2/P1) - C 1n(1 - 61)/(1 — 62)).
(4.49)

The Bayes risk for this example is then

RB = P11.21 + P2L22 - Z D(x). (4.50)
szf

The large sample approximation for the expected risk is just

C 1/
RN = PIL21 + P2L22 — Z D(x)¢(/N'D(x)/(VAR[wN(x)])2) (4.51)
x=O
where
VAR[WN(x)] = H1P1f(xIA = 1) + H2P2f(xIA = 2) - 02(x) (4.52)

and the “i are defined according to (4.21) and (4.22).

Figure 4.8 shows an evaluation of (4.51) for a typical set of
parameters: C = 10, 61 = 0.25, and 82 = 0.75. A zero-one loss function
and equal prior probabilities are assumed. The behavior of the learning
algorithm is similar to that in the previous examples.

In all three of these examples, the expected risk rapidly becomes
large as B approaches 0.5. But for B greater than 0.8, only a moderate

number, say N < 1000, of samples are necessary for the expected risk to

60

o.H

maoﬂusnwhumwo Hawaozﬁm spa: xmwm wouoomxm .w.¢ onsmﬂm

§ ”5%: 6:23

 

 

 

 

no mg no so We
a — — d
III/III IIIIIIIIIIIIIIIIIII'IIII.
‘14 Va: mm»:
/ 1
/
\V/
on; TEIEE / om 1 2 OS 1 z 8.1. 1 z
/
/ 1
/
/
/
//
1 mm 1 2
a - q. q / 1
m; 1 mm mad 1 J //
/
S 1 u /
H /
no 1 a // 1
o; 1 as 1 N: / /
o 1 N} .1. SA /
/

 

 

 

o.o

H.o

N.o

m.o

¢.o

m.o

XSIH 03133dX3

61
be close to the Bayes risk. In all cases shown, the expected risk is a
decreasing function of N and of Bii' In the next section it is shown
that these characteristics hold in general.

Even though these examples provide a qualitative feel for the be-
havior of the learning algorithms, one would like to have quantitative
measures of performance that characterize the effects of the imperfect
teacher. The remaining sections of this chapter address the problem of

developing such measures of performance.

4.4 LARGE SAMPLE PROPERTIES OF THE EXPECTED RISK

In Theorems 4.1 and 4.2 the estimator 6N was shown to be asymptot-
ically normal with a certain mean and variance. For the continuous case
the mean and variance of WN were functions of N. In order to obtain per-
formance measures, it is necessary to develop in this section a normal
approximation in which the mean and variance are not functions of N.

Only the continuous case will be considered in this section.
The following two lemmas establish the asymptotic preperties of

the mean and variance of 6N in the continuous case.
LEMMA 4.1. At every continuity point of f(xIA = k), k = 1, 2,

lim E[DN(x)] = D(x); (4.53)

N+oo

A

i.e., DN is an asymptotically unbiased estimator of D.

PROOF: Lemma B.l along with (4.4) and (4.18) imply that at every conti-

 

nuity point of the densities £(x111 - k)

lim E[DN(x)] = lim E[WN(x)] = D(x). (4.54)
N+oo N-Hao

End of proof.

62

LEMMA 4.2. At every continuity point of f(xIA = k), k = l, 2,

lim Nhg VAR[DN(x)] = Q[H1P1f(xIA = 1) + H2P2f(xIA = 2)] (4.55)

N—>oo

with Q = f K2(y) dv(y) and H1 and “2 defined by (4.21) and (4.22)

respectively.

PROOF: Lemma B.l implies that at every continuity point x of the

density functions

lim hn E[g2(x, X)|A = k] = Qf(xlA = k). (4.56)
N-Hzo N N
Now from (4.4)
Nhg VAR[6N(x)] = hﬁ E[w§(x)] — h; E2[WN(x)]. (4.57)

Equation (4.54) and the fact that hN + 0 as N + w imply that the second
term of (4.57) converges to zero as N + w. Thus substituting (4.20)
into (4.57) and using (4.56) gives the desired conclusion.
End of proof.
In view of Lemmas 4.1 and 4.2, it would seem reasonable to expect

that DN(x) is asymptotically normal with mean D(x) and variance
0% = [H1P1f(xIA = 1) + H2P2f(xIA = 2)]Q/Nh“. (4.58)

Showing that such a condition holds requires some further restrictions

on the rate of convergence of hN to zero. Lemma 8.2 leads to the neces-

sary restrictions. In view of Theorem 4.2, it is clear that

¢(a) (4.59)

1im Pr[(DN(x) — D(x))/0N :.a]

N+oo
provided
VAR[6N(x)]/o§ +-1 as N + m (4.60)

and

63

(E[DN(x)] - D(x))/0N + 0 as N + m. (4.61)

.Lemma 4.2 guarantees that (4.60) is satisfied. Also

515 (X) - D(x)] h'2{E 6 (x)] - D( )) -%
N = (Nhn+4)%, N [ N x Q 1 . (4.62)
0N N [H1P1fcxlA = 1) + H2P2f(x|A = 2)]/2

 

 

So from (4.4), (4.18) and Lemma B.2, it follows that condition (4.61)
will be satisfied if Nh3+” + 0 as N + m. This is then the additional
condition on hN required for proving that 6N is asymptotically normal in

the sense of the following theorem.

THEOREM 4.3. In the continuous case if 6N is defined by (4.4), if

 

f(x|A = k), k = 1, 2, have continuous partial derivatives of third order

a.e., and if hN satisfies the conditions
Nh§+u + 0, Nhg + w as N + m, (4.63)

then DN(x) is asymptotically normal in the sense that, for every real

number a,

1im Pr[(DN(x) — D(x))/ON(X) :_a] = 4(a) (4.64)

N+oo

with oN(x) defined by (4.58).
In the remainder of this chapter, the conditions of Theorem 4.3
are assumed to be satisfied. In particular hN is taken to be hN = N'a,

1/(n + 4) < a < l/n. For sufficiently large N, the expected risk may

be approximated by

RN(P, f) z P1L21 + P2L22 - f 0(x)4(t(x)) de(x) (4.65)
with

t(x) = NCl'na)/20(x)Q'5[H1plf(xIA = 1) + H2P2f(x|A = 2)]-%. (4.66)

64
In the three examples in Section 4.3 it was observed that the
expected risks were decreasing functions of N and 8. The following two
theorems show that, as one would expect for large N, RN is always a

decreasing function of N, 811, and 822.

THEOREM 4.4. For large N, the expected risk RN is a strictly decreasing

 

function of N.

PROOF: Treat N as a real variable. Then from (4.65) it follows that

-f (2n)-%D(x) exp {-t2(x)/2}8t(x)/8N dv(x)

BRN/BN

..1 ..1
/2 /2

-%(2n) (1 - na)N'na/2 I D2(X) exp {‘t2(X)/2}Q

- [H1P1f(xIA = 1) + H2P2f(xIA = 2)]‘1/2 dv(x) < 0 (4.67)

since (1 - na) > O for a < l/n and the integrand is always nonnegative.

End of proof.

THEOREM 4.5. For large N, the eXpected risk RN is a strictly decreasing

 

function of 811 and 822.

PROOF: For large N, (4.65) implies that

 

 

 

aRN/BB11 = - f(2n)'%b(x) exp {-t2(x)/2}8t(x)/3811 dv(x). (4.68)
But
8t(x) _gN(1-n9)/ZD(X){anl/asllplf(xlh = 1) +3H2/3811P2f(xIA = 2)}
8811 - 0%[H1P1f(xlh = 1) + H2P2f(xIA = 2)]3/2 (4.69)

By differentiating (4.21) and (4.22) with respect to 811, one can show

that

6S

 

 

 

an -(1 - B + B (28 - 1))(L - L + L - L )2
1 11 22 11 1

= 21 11 2 22 < 0 (4 70)
8811 (811 + 822 ' 1)2
. 2
d”2 = ’[(l ' 811M1 ' 822) + B11822](L21 ‘ L11 + L12 ' L22) < 0

2

8811 (811 + 822 ' 1) (4.71)

Substitution of (4.69), (4.70), and (4.71) into (4.68) then shows that
BRN/BBll < 0. Thus RN is a strictly decreasing function of 611. A
similar argument holds for 822.
End of proof.
One should note that even though Theorems 4.4 and 4.5 were stated
and proved for the continuous case, they also hold for the discrete
case. This follows immediately by comparing (4.64) with (4.23) and

(4.24).

4.5 A MEASURE OF PERFORMANCE

The problem that one encounters in trying to develop any quanti-
tative measure of the effects of the imperfect teacher is that the be-
havior of the learning algorithm depends upon several factors:

a.) the class conditional densities,

b.) the prior probabilities,

c.) the loss function,

d.) the §_matrix, and

e.) the form of the gN function.
Any measure of performance which one defines will generally be a
function of all of these parameters. What one would like for comparing
a perfect teacher with an imperfect teacher is a measure which is in-

sensitive to all of the parameters except the §_matrix.

66

A second problem that one encounters is that an analytic evalu-
ation of the expected risk is virtually impossible except in special
cases. Even then, numerical methods and the large sample approxi-
mations developed in the previous sections must be used. In view of
these considerations, the measure of performance proposed in this
section will also be evaluated from the large sample approximations.

The criterion proposed here for comparing an imperfect teacher to
a perfect teacher is defined as follows. For a given number No of
training patterns, the nonparametric learning procedure with a perfect
teacher has some expected risk, call it RNO. Define N(§, No) to be the
number of training patterns required for the learning algorithm with an
imperfect teacher to have the same eXpected risk. Then a measure of

relative performance may be defined as

n(_8_. No) 2 Mg. No1/No. (4.72)

In general this measure of relative performance will be a function of
NO as well as a fUnction of the five factors listed earlier. But n does
provide a quantitative measure of the effect of the imperfect teacher in
the sense that it measures the additional number of training patterns
required to compensate for the imperfect teacher.

The dependence of (4.72) on No may be removed by defining an as-

ymptotic measure of performance according to

n(_B_) = lim Mg, No) (4.73)
No+oo

provided the limit exists. In general the evaluation of (4.73) is an
intractable problem. But for one very important case, evaluation of

(4.73) is rather direct.

67
Consider the case when a zero-one loss function is used and
811 = 822 = 8. Then for large N, the expected risk for the continuous
case is given by (4.65). Now suppose an No is chosen and (4.65) is
evaluated for the perfect teacher (8 = l) to obtain RNO. Let RN(B) de—
note the expected risk when using N training patterns with an imperfect
teacher that has a probability 8 of correctly classifying a training

pattern. To find N(B, NO), one must then solve the equation
RN(B) = RNO(B) (4.74)

for N. One way in which (4.74) is satisfied is for the function t(') of

(4.66) to be the same for both RN and RNO. But this will occur if

(28 - 1)N(1'n“)/2 s “6(1'““)/2. (4.75)
Thus

Ncs. N0) = NO/(ze - 112/(1'na) (4.76)

Since by Theorems 4.4 and 4.5 RN is a strictly decreasing function of N
and B, (4.76) is the unique solution of (4.74). Hence in this case the

measure of relative performance is

nCB) = (28 - 11'2/(1'na). (4.77)

This measure of performance has the pleasant prOperty that it is
not a function of the class-conditional densities or prior probabilities,
and it depends upon the SN function only through the parameter 0. Thus
for a zero-one loss function, n(B) in (4.77) is interpreted as the
proportionate number of training patterns required for an imperfect
teacher to yield the same expected risk as a perfect teacher.

Even when 811 # 822 but a zero-one loss function is still used,

68
(4.77) provides a meaningful bound. Since RN is a decreasing function

of 8 B ,, and N,

11’ 22
n(max(eu. 82211 _<_ 0(811. 8221 _<_ 001111103111 82211. (4.78)

Thus using min(811, 822) in (4.77) yields an upper bound on the actual
measure of relative perfbrmance, and using max(811, 822) gives a lower
bound.

In Figure 4.9 the measure n(B) is plotted as a function of B for

n = l and a 0.2, 0.33, 0.5, and 0.65. The figure shows, for example,

that with a k the algorithm requires roughly 2.5 times as many train-

ing patterns when 8 = 0.9 as compared to a perfect teacher. When

8

0.8, 7.6 times as many training patterns are required; and when

8 = 0.7, nearly 40 times as many training patterns are required for the
imperfect teacher to match the perfbrmance of the perfect teacher. Thus
a poor teacher is costly in the sense that many more training patterns
are required to achieve a perfOrmance equal to that achieved with a

perfect teacher.

4.6 COST OF TRAINING

The idea of a cost associated with an imperfect teacher can be
further developed by assuming that the teacher incurs a cost in classi-
fying each training pattern. Assume that the cost of classifying a
training pattern is an increasing function of 811 and 822. Let T(§)
denote this cost function.

Suppose that one wishes to attain a given expected risk R* with
minimum cost. If a perfect teacher is used, N* training patterns will

be required to achieve R* at a total cost of N*T(I), I being the unit

69

 

\.. \

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

11 \ \ww

\\\

 

a = 0.33

a=0.5

Z

/\V

 

‘/
A
\
X

 

[a = 0.2

\\\\

\
\ \
[C
X \
\

\
\
V

 

 

 

 

 

 

 

 

 

 

 

\X \L

 

 

 

 

 

\\\\

 

 

 

 

1000

200

Q

2

100

0
new 2

10

I: #593: 325.20de

1.0

0.9

0.8

TEACHER ERROR (B )

0.6 0.7

0.5

Measure of Relative Performance

Figure 4.9.

70
matrix. For an imperfect teacher, the number of training patterns re-
quired to achieve R* is approximately N*n(§, N*) and the cost incurred
is
c = N*n(_8_, 11*)T(_8). (4.79)

Differentiating (4.79) with respect to Bii and equating to zero gives

N*[.23..T(§) + 9(E, N*) 21121] = 0, i = 1, 2. (4.80)
3811 8811

Solving this set of simultaneous equations for 811 and 822 gives the 8.

matrix that minimizes C whenever the solution satisfies % < Bii < 1.

As an example, suppose that a zero-one loss fUnction is used and

that 811 = 8 Then n(B) is given by (4.78). Let the cost of

22 = 8.

classification have the form (see Figure 4.10)

T(B) = a1 + a2(28 - 1)E (4.81)
Then
39- = 2N*(28 - “4-2/(1-1611132? + a1(e - TEETH” - 1151. (4.82)

and the minimum cost occurs at

2a 1 1
121* = .1. + .1. 1 N; (4.83)
2 2

32 5(1 - no) - 2

 

o * o o o
prov1ded % < B < 1. Otherwlse the m1n1mum occurs for a perfect teacher.

From (4.83) it follows that a perfect teacher is best whenever

5 :_2(1 + al/a2)/(l - na). (4.84)
Figure 4.11 shows the relative cost plotted as a function of B for
several parameter values. For this figure n = l and a = 1/3.
Figure 4.11 indicates that when there is a cost associated with

classifying the training patterns, a perfect teacher is not necessarily

71

the best. An imperfect teacher may provide an acceptable level of

learning at a lower cost than a perfect teacher.

T03)

1

a1

 

 

 

Figure 4.10. Teachers Cost for Classification

UH!

RELATIVE COST (C/N*)

72

 

 

 

 

 

 

700»
600»
5001-
81/512 = 0.002

e = 4
4001-
300-

al/a2 = 0.01
200» ff

al/a2 = 0.01
= 4
100~ / 6
al/a2 = 0 01
1 J l 1
0.5 0.6 0.7 0.8 0.9

TEACHER ERROR ((3)

Figure 4.11. Relative Cost of Training

 

CHAPTER V

CONCLUSIONS

This chapter summarizes the main results of the thesis and dis-

cusses possibilities for future research.

5.1 SUMMARY

This thesis has been concerned with studying nonparametric pro-
cedures for learning to recognize patterns with an imperfect teacher.
The statistical approach to pattern recognition was taken, and sta-
tistical decision theory was used as the tool for developing and
evaluating learning algorithms. The objective was to study procedures
for learning a Bayes decision rule using training patterns some of which
were misclassified by an imperfect teacher.

In Chapter II, the general M-class statistical pattern recognition
problem was outlined and a model for an imperfect teacher was proposed.
The imperfect teacher was characterized by a matrix of conditional error
probabilities. The main result of Chapter II was the deveIOpment of a
set of expressions relating the probability distributions of the perfect
teacher with those of the imperfect teacher. These relations were one
of the key factors used for develOping in Chapter III learning pro-
cedures that required less restrictive assumptions than in previous

studies. An example was also presented to show the asymptotic effects

73

74
of using an imperfect teacher with an algorithm designed for supervised
learning. It was concluded that misclassified training patterns could
prevent the convergence of supervised learning algorithms and that pro-
cedures are needed for use specifically with imperfect teachers.

In Chapter III a class of nonparametric procedures was prOposed
for learning with an imperfect teacher. The procedures required prior
knowledge only of the nonsingular matrix of error probabilities charac-
terizing the teacher and of whether the pattern random variable was
discrete or continuous. The main result of the chapter was formal
proofs of the convergence of the estimated decision rules to the Bayes
rule in both the discrete case and the continuous case. The procedures
were asymptotically optimal in the sense that the expected risk of the
decision procedures converged to the Bayes risk with increasing number
of training patterns. Theorems were also given establishing rates of
convergence of the expected risk. These rate theorems provided guide-
lines for choosing the parameters in the estimators.

The two-class problem was studied in more detail in Chapter IV.
The expected risk was expressed in a convenient form as a function of
the Bayes discriminant function and of the probability that the esti-
mated discriminant function was nonnegative. A large sample approxi-
mation was then developed to evaluate the expression for the expected
risk. This approximation was used to study three examples of learning.
The examples indicated that an imperfect teacher could have a signifi-
cant effect on the rate of learning. When the teacher's error rate was
high, learning occurred at a much slower rate than when a perfect
teacher was used. The examples also indicated that the learning pro-

cedure would eventually perform better than a poor teacher.

75

A second large sample approximation was derived and used to es-
tablish several large sample properties of the expected risk. It was
shown that the expected risk was a strictly decreasing function of the
number of training patterns and of the teacher's error rate, Bii‘

A measure of relative performance was then proposed for quanti-
tatively comparing an imperfect teacher with a perfect teacher. This
performance measure was a measure of the additional number of training
patterns required to compensate for an imperfect teacher. The measure
was readily evaluated for the case of a zero-one loss function. For
this important case, the measure was not a function of the underlying
probability distributions. Thus it provided a very usefu1 criterion for
evaluating the effects of the imperfect teacher. It was concluded that
a poor teacher was costly in the sense that many more training patterns
were required to achieve the same expected risk as obtained with a
perfect teacher.

By assigning a cost for classifying the training patterns, an
overall cost of an imperfect teacher was derived. It was shown that
when the cost of training is prOportional to the quality of the teacher,
an imperfect teacher may be preferred. By using an imperfect teacher, a
learning system may be able to achieve a given level of performance at

less overall cost than with a perfect teacher.

5.2 EXTENSIONS

The work in this thesis suggests several possibilities for future
research. The concept of an imperfect teacher can be extended to many
areas of pattern recognition.

This thesis has been concerned with one general class of learning

76

procedures. One can also look at several other types of supervised
learning algorithms and attempt to develop analogous algorithms for
learning with an imperfect teacher. For example, one might develop
algorithms based on stochastic approximation, the method of potential
functions, the k-nearest neighbor rule, or the various mean-square esti-
mation methods for discriminant functions. All of these techniques for
supervised learning will be affected by an imperfect teacher, and con-
sequently they should be altered for use with an imperfect teacher.

The model proposed in Section 2.2 for the imperfect teacher could
also be developed further. A natural extension would be to assume that
the teacher's error probabilities are a function of the observed values
of the training patterns. One might also consider a situation in which
the teacher improves with time and experience.

It was assumed throughout this thesis that the matrix of error
probabilities characterizing the teacher was known. An interesting
problem would be to investigate conditions under which learning could
occur without knowledge of the matrix of error probabilities. More
structure would necessarily have to be assumed known about the proba-
bility distributions. This problem would be very close to one of un-
supervised learning.

In Sections 4.5 and 4.6 measures of performance were preposed for
quantitatively evaluating the effects of an imperfect teacher. These
measures were chosen because they were intuitively appealing and mathe-
matically tractable. Further work is needed in developing other
measures of performance. Also the idea of a cost for the teacher could
be pursued further.

Finally, one might consider ways of gradually phasing the teacher

77
out of the learning process. It was observed in the examples of Section
4.3 that the learning procedures eventually perform better than a poor
teacher. One wonders whether the learned decision rules could eventu-
ally be used to classify the training patterns and thus eliminate the
teacher. This might speed up the learning process since the learning
procedure eventually has a smaller expected risk than the teacher.
Eliminating the teacher would also be desirable when there is a cost of

training associated with the teacher as in Section 4.6.

B I BL IOGRAPHY

[A-I]

[A-Z]

[3-1]

[3-2]

[C-1]

[C-2]

[F-ll

[F-Z]

[F-3]

14-41

[F-S]

[11-1]

BIBLIOGRAPHY

N. Abramson and D. Braverman, "Learning to recognize patterns in
a random environment," IRE Trans. Inform. Theory, vol. IT-8,
pp. 58-63, 1962.

M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer, "Method of
potential functions in the problem of restoration of functional
converter characteristic by means of points observed randomly,"
Autom. and Remote Control, vol. 25, pp. 1705-1714, 1964.

C. C. Blaydon, "Recursive algorithms for pattern classification,"
Rept. No. 520, Division of Engineering and Applied Physics,
Harvard Univ., Cambridge, Mass., 1967.

I. Bross, "Misclassification in 2 x 2 tables," Biometrics, vol.
10, pp. 478-486, 1954.

T. Cacoullos, "Estimation of a multivariate density," Ann. Inst.
Statist. Math., vol. 18, pp. 179-189, 1966.

 

T. M. Cover and P. E. Hart, "Nearest neighbor pattern classifi-
cation," IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27,
Jan. 1967.

W. Feller, An Introduction to Probability Theory and Its
Applications, vol. II. New York: Wiley, 1966.

 

T. S. Ferguson, Mathematical Statistics: A Decision Theoretic
Approach. New York: Academic Press, 1967.

S. C. Fralick, "Learning to recognize patterns without a teacher,"
IEEE Trans. Inform. Theory, vol. IT-13, pp. 57-65, Jan. 1967.

 

K. S. Fu, Sequential Methods in Pattern Recognition and Machine
Learning. New York: Academic Press, 1968.

K. Fukunaga, Introduction to Statistical Pattern Recggnition.
New York: Academic Press, 1972.

 

Y. C. Ho and A. K. Agrawala, "On pattern classification
algorithms, introduction and survey," Proc. IEEE, vol. 56,
pp. 2101-2114, Dec. 1968.

 

78

[11-2]

[1-1]

[1-2]

[K-I]

[L-l]

[L-Z]

[M-l]

[11-2]

[M- 3]

[11-4]

[N-1]

[N-Zl

[P-1]

[P-3]

[P-3]

79

Y. C. Ho and R. L. Kashyap, "An algorithm for linear inequalities
and its applications," IEEE Trans. Electronic Computers, vol.
EC-l4, pp. 683-688, Oct. 1965.

IEEE Transactions on Computers, vol. C-20, Sept. 1971, (Special
issue on feature extraction).

 

B. G. Ishaku, "Feature extraction and ¢-function selection in
9-systems," Ph.D. dissertation, Mich. State Univ., East Lansing,
Mich., Oct. 1971.

Wash., D. C.:

L. N. Kanal, Ed., Pattern Recognition. Thompson,

1968.

 

P. A. Lachenbruch, "Discriminant analysis when the initial samples
are misclassified," Technometrics, vol. 8, pp. 657-662, Nov.
1966.

 

M. Loeve, Probability Theory, 2nd ed. Princeton, N. J.:

Van Nostrand, 1955.

W. S. Meisel, "Potential functions in mathematical pattern
recognition," IEEE Trans. Comput., vol. C-18, pp. 911-918,
Oct. 1969.

 

J. M. Mendel and K. S. Fu, Ed., Adaptive, Learning, and Pattern
Recognition Systems: Theory and Applications. New York:
Academic Press, 1970. '

A. M. Mucciardi and E. E. Gose, "A comparison of seven techniques
for choosing subsets of pattern rec0gnition properties," IEEE
Trans. Comput., vol. C-20, pp. 1023-1031, Sept. 1971.

V. K. Murthy, "Nonparametric estimation of multivariate densities
with applications," in Proc. Int. Symp. on Multivariate Analysis,
P. Krishnaih, Ed. New York: Academic Press, 1966, pp. 43-56.

C. Nagy, "State of the art in pattern recognition," Proc. IEEE,
vol. 56, pp. 836-862, May 1968.

N. J. Nilsson, LearningyMachines. New York: McGraw-Hill, 1965.
E. Parzen, "An estimation of a probability density function and
mode," Ann. Math. Statist., vol. 33, pp. 1065-1076, 1962.

E. A. Patrick and J. C. Hancock, "Nonsupervised sequential
classification and recognition of patterns," IEEE Trans. Inform.
Theor , vol. IT-12, pp. 362-372, July 1966.

M. L. Puri and P. K. Sen, Nonparametric Methods in Multivariate
Analysis. New York: Wiley, 1971.

 

[R-ll

[R-Z]

[5-1]

[5-2]

[5-3]

[5-4]

[5-5]

[3-6]

[3-7]

[5-8]

[5-9]

[T-1]

[V-1]

[V-Zl

[V-3]

80

H. Robbins, "The empirical Bayes approach to statistical
decision problems," Ann. Math. Statist., vol. 35, pp. 1-20,
1964.

New York:

H. L. Royden, Real Analysis, 2nd ed. Macmillan, 1968.

 

S. C. Schwartz, "Convergence of risk in adaptive pattern recog-
nition procedures," Proc. 5th Allerton Conf, on Circuits and
Systems Theory, pp. 800-806, 1967.

 

S. C. Schwartz, "Estimation of probability density by an orthogo-
nal series," Ann. Math. Statist., vol. 38, pp. 1261-1265, 1967.

 

G. S. Sebestyen, Decision-Making Processes in Pattern Recognition.
New York: Macmillan, 1962.

K. Shanmugam and A. M. Breipohl, "An error correcting procedure
for learning with an imperfect teacher," IEEE Trans. Systems,
Man, and Cybernetics, vol. SMC-l, pp. 223-229, July 1971.

 

K. Shanmugam, "Learning to recognize patterns with an imperfect
teacher,” Ph.D. dissertation, Oklahoma State Univ., Stillwater,
Okla., 1970.

K. Shanmugam, "A parametric procedure for learning with an im-
perfect teacher," IEEE Trans. Inform. Theory (Corresp.), vol.
IT-18, pp. 300-302, March 1972.

D. F. Specht, ”Generation of polynomial discriminant functions
for pattern recognition," IEEE Trans. Elect. Comput., vol. EC-l6,
pp. 308-319, June 1967.

 

D. F. Specht, "Series estimation of a probability density
function," Technometrics, vol. 13, pp. 409-424, May 1971.

 

J. Spragins, "Learning without a teacher," IEEE Trans. Inform.
Theory, vol. IT-11, pp. 544-549, April 1966.

A. Tenenbein, "A double sampling scheme for estimating from mis-
classified multinomial data with applications to sampling in-
spection," Technometrics, vol. 14, pp. 187-202, Feb. 1972.

 

J. Van Ryzin, "Non-parametric Bayesian decision procedures for
(pattern) classification with stochastic learning," Fourth Prague
Conf. on Inform. Thy., Statistical Decision Functions,yand

Random Processes, 1965.

J. Van Ryzin, "Bayes risk consistency of classification pro-
cedures using density estimation," Sankhya, Series A, vol. 28,
pp. 261-270, 19660

J. Van Ryzin, "On strong consistency of density estimators,"
Ann,_Math. Statist., vol. 40, pp. 1765-1772, 1969.

[W-l]

[11-2]

[W-3]

[Y-ll

81

S. Watanabe, Ed., Methodologies of Pattern Recognition. New York:
Academic Press, 1969.

 

A. W. Whitney and S. J. Dwyer, "Performance and implementation of
the k-nearest neighbor rule with incorrectly identified training

patterns," Proc. 4th Allerton Conf. on Circuit and System Theory,
pp. 96-106, 1966.

 

C. T. Wolverton and T. J. Wagner, "Asymptotically optimal
discriminant functions for pattern classification," IEEE Trans,
Inform. Theory, vol. IT-15, pp. 258-265, March 1969.

S. J. Yakowitz, "Unsupervised learning and the identification of
finite mixtures," IEEE Trans. Inform. Theory, vol. IT-l6, pp.
330-338, May 1970. '

 

APPENDICES

APPENDIX A

OPTIMAL DECISION RULES

This appendix outlines the elements of statistical decision
theory that are used in this thesis. The results, which are taken
from the literature [F-2], [R-l], are presented from the vieWpoint of
pattern classification.

A problem in statistical decision theory consists of the
following basic elements:

a.) A state space 0 with generic element w representing a ”state

of nature."

b.) An action space A with generic element 0 representing an

action available to the decision maker.

c.) A loss function L(a, w) defined on A X 0 and representing

the loss incurred in taking action a when the true state of
nature is w.
d.) An observable random variable X belonging to a space T on

which a o-finite measure 0 is defined. When the true state

m)

 

of nature is m, X has a specified probability density f(-
with respect to v.
The type of decision making problem of concern here is the sta-
tistical classification problem. In classification problems the state
Space is a finite set, 0 = {61, wz’ ..., wM}. The action space is

_ . . . ,, . .
A - {a1, a2, ..., 0“} where action a1 is say ml was active to produce

82

83
the observed X."

The decision rules that are convenient for classification problems
are the so called behavioral decision rules [F-2, pp. 198]. Let 6(x) =
(61(x), ..., 6M(x))be a vector of real-valued measurable fUnctions on T
such that .gl 6j(x) = l, 6j(x) :_0. Denote the class of all such
functions 8; A. Then any OEA is a behavioral decision rule with 6j(x) =
Pr[ajlx] identified as the conditional probability of classifying X as
coming from “j given that X = x is observed. In other words, a be-
havioral decision rule assigns to each xeT a probability distribution
on A.

A Bayesian strategy for classification involves the notion of a
prior distribution on the state space 0. This distribution is denoted

by P = (P1, P ..., PM) with P1 being the prior probability assigned

2'

For any decision rule 080, the average risk associated with state

of nature mien is defined as

M
r(8, bi) = E[.X

J L(aj, di)8j(x)|di] (A.1)

l

 

where E[° mi] denotes expectation with respect to the conditional

 

density f(° mi). The overall risk of misclassification with decision

rule 6 under the prior distribution P and conditional densities

 

 

f = (f(- ml), ..., f(° wM)) is defined by
M
R(P, f, 6) = Z Pir(6, bi). (A.2)
i=1

A decision rule OBEA is said to be a Bayes rule with respect to

the prior distribution P if and only if it satisfies

R(P, f, 6B) = inf R(P, f, 6). (A.3)
OEA

84
Thus a Bayes decision rule minimizes the overall risk of misclassifi-
cation.
Substituting (A.l) into (A.2) gives

M

M
R(P, f, 8) = 121 Pi [T j§1 L(aj, wi)6j(x)f(xlwi) dv(x)
M M
= Z I [ Z P, L(aj, wi)f(xlwi)]6j(x) 81(1). (A.4)
j=1 T i=1

A choice of 6 minimizing the risk in (A.4) is seen to be any behavioral

decision rule 6B = (631, 582' ..., 68M) satisfying

1 if Dj(x) < pi? Dk(x)
6 -(x) = 0 if D-(x) > min D (x)
83 j kfj k
. ‘ . = ' A.5
y) if DJ(x) pi? Dk(x) ( )
where
M

Dk(x) = 1Z1 Pi L(ak, wi)f(x|wi), k = 1, 2, ..., M (A.6)

and where {yj} is such that GB is a measurable function and Yj 3.0 and
M
{(5-21
B o
i=1J
The Bayes risk is the risk incurred by a Bayes decision rule.
Substituting (A.6) into (A.4) results in the following expression for

the Bayes risk:

R30). f1 = Z
j=l

D. 6 . x d x . A.7
IT J(x1BJ(1 v(1 ( 1
The notation in (A.7) emphasizes the dependence of RB on the prior
distribution P and conditional densities f.
The identification of the above statistical classification problem

with pattern recognition is immediate. The state space corresponds to

the set of pattern classes, T to the feature space, and X to the pattern

85
to be classified. When the pattern is discrete valued, v is taken to
be counting measure; when X is real valued, v is taken to be Lebesgue
measure on Euclidean n-space. The optimal decision rules for pattern

recognition are then the Bayes rules given by (A.5).

APPENDIX B

NONPARAMETRIC ESTIMATION OF DENSITY FUNCTIONS

This appendix presents a nonparametric method of estimating a
probability density function. The method was first proposed by Parzen
[P-l] for estimating a univariate density and later extended to multi-
variate densities by Cacoullos [C-1] and Murthy [M-4].

Let X1, X , ..., XN be N independent observations on an n-

2

dimensional random variable X with density function f(x). Let

K(Y) = K(Y1, yz, ..., yn) be a Borel scalar function on Euclidean n-

space En such that

KCY) :_0 (8.1a)
fEnKO') dy = 1 (B.lb)
SUP KO) < °° (B.1C)
YeEn
and
Ilyll"K(y1 + 0 as IIle + co. (8.141

The function K(-) is called the kernel of the estimator.

Define a function gN on En x En by

gN(x, y) = .171. K(.’.‘.:_.2’.) (8.2)
hN hN

with {hN} a sequence of positive constants satisfying

hN + 0 as N -> 0° (B.3a)
and
Nhg -> oo as N -> oo. (B.3b)

86

87
Then the nonparametric estimator for the density function f(x) is

defined as

gN(X. Xi). (B.4)
1

ll M2

1.
N i
Cacoullos [C-l] has shown that the estimator fN is asymptotically un-
biased and consistent at all points of continuity of f.

There are many kernels which satisfy (8.1). Typical examples of
univariate kernels are shown in Table 8.1. Specht [S-7], [S-8] has
investigated methods for approximating the Gaussian kernel (entry 3 of
Table B.l) to obtain estimators which have fixed storage requirements.

In this dissertation the density estimator of (B.4) is not di-
rectly used; instead, linear combinations of the SN functions are used
to form estimators of discriminant functions. The following two lemmas

concerning the 8N function are proven in [C-l]:

LEMMA B.l. Let 3N be a function on En x En defined by (B.l) thru (B.3).

Then at every continuity point of a density function f(x),

n(r-l)

1im hN E[g;(x, X)] = f(x) [Enxr(y) dy. (B.5)

N+oo
LEMMA 8.2. Let 8N be defined as above. If a probability density
function f(x) has continuous partial derivatives of third order in a

neighborhood of x, then

lim h'2{E[g (x, X)] - f(x)} = 1/2 (B.6)
N—Hzo N N
where
n n 2
1 = .2 Z E—EIEl-fyiij(y) dy. (8.7)

88

TABLE B.1. UNIVARIATE KERNELS

oo

 

 

 

K(y) I K2(y) dY
l. fer |y| < l
2 " 1
2
0 for Iyl > 1
1 - M for Iyl _<_ 1 2
0 for Iyl > 1 3
(21‘1/2 ( 2/21 1
11 up -y ....
21/1?
1 1
E'exp ( IYI) 2
.1. 1 .1.
TI 1 + y2 n
. 2
_1_ 5111 y _1_

n y 3n

APPENDIX C

PROOF OF THEOREM 3.5

The proof of Theorem 3.5 is given in this appendix. The proof is

as follows.

The first term in the bound of Lemma 3.2 is

N
'11 = jgl I, I qu(x1 db(x1 (c.11
where
qu(x1 = ENIIGNj(x1 - Pj E[gN(x. x1|4 = 11I1. (c.21

Now from (3.77) it follows that

H

qu(x1 ;_N éc Z ( I b. 81,18, Eth(x. x1IA 4113 (c.31

i=1 i=1 ji
So

T1_ < N 2 I E'j f ( Z ( Z bjiB 1,)? E[gN(x, X)|A
j= —1 2:1 i=1

411% dv(x1

M __ n M
< N 2 Z L,{f n (1 + Ix |1*5) Z ( 2 b2 81H)P
— = J :1 5 2:]. 1:]. ji

(1 + |1csll+€)"1 dv(x)}l5

21:3

E18:(X, X)]A = 2]) d\)(x)}1/2 ° {I 1
5:

_g M
2 21.1 E ( 2b b? .8. H1? I H (1 + Ix |'*51
j=l i=1 i=1 ii 5:1 5

I (1 + helm)"1 44,11”2 (c.41

E[g§(x, X)|h = 2] 8001)}11 - { 1

S

11:21:!

89

90
where the second inequality follows from Schwarz inequality. Van Ryzin
[V-l] has shown that if (3.82b) is satisfied, then

1im hN j [ n (1 + Ixs |1+€)] E[gN(x, X)I/\= 2] dv(x)
N+w s=l

= E[
S

":1:

(1 + IxSIMHA = 2] I my) dv(y) < w (C.5)
l
where E = min (5', l/n). Since
I(1 + Ixsll“51'l dxS < m. (c.b)

(C.5) along with (C.4) implies that

lim (NhN)/2T1: j): L1. { {(2 1311310151

p++m 2:1 i=1
n 1 n 1

- b[ n (1 + lxs|1+51|A=21IK2(y) dvmm n 1(1 + lxs|1*‘€rl dxsrz
5:1 3:1

9 Q1. ((1.7)

Thus for large N
n-%
T1 :_Q1(NhN) . (C.8)
Now consider the second term in the bound of Lemma 3.2 (see equation

(3.69)).

M
T2 = Z P1f{E[gN(x, X)]A = i] - f(xlA = 1)}
i=1

M
{1114; (61-B (x v) - I: N[3Nj (x111 dvcx) c.9)

Using (3.63) and the definition of the expectation operator gives

91

X—
hN

 

M Y
Tb. = Z )f(yIA = i) dv(y) - f(xIA = 1))

1
PiIU—ﬁ'KC
i hN

1

M
. {321 Lji(6Bj(x; v) - EN[5Nj(x)]} dv(x). (c.10)
Making a change of variable and applying Fubini's Theorem gives

T2

M
121 p1ff K(z)[f(x - thIA = i) - f(xIA = 1)]
1:

M A
° E1 L11(aBj(x; v) - EN[6Nj(x)]) dv(x) dv(z)

M
< P. K(z) f(x - A = °) - f(X A = ')
’igl 1” I thl 1 - I ll

M
' ljgl Lji(éBj(x; V) - EN[6Nj(X)])| dV(X) deZ)

Z

 

'A

2 2 Tip. If K(z) 1(h (2); f(- A = 1)) dv(z) (0.11)
i=1 1 1 N

where the function T(.;.) is defined by (3.81) and the last inequality
follows from the fact that the magnitude of the second term in the
integrand is bounded by 231. Conditions (3.82a) and (3.82c) then imply

that

M
hNY T2 :_2C I K(2)||2||Y dez) 2 P11} 9 Q2. (c.12)
i=1

Thus for large N

_1/
AR < T1+ T2 iQ1(NhN) 2

N __ + Q2 hg. (c.13)

If hN is then chosen as hN = 0(N-1/(n+a)),it follows from (C.l3) that

92

Q1 N'Y/(n*a) if a > 2y
ARN _<_ Q2 N'a/zﬁ‘w) if a < 2y
(Q1 + Q2)N'°/(“*2“’ if b = 2y. (c.14)

This completes the proof of Theorem 3.5.