ERROR RRoaARJJJTJ JR UNSUPERVESEE} REFERRERJ _ H5 "
, PATTERN CLASSIFICATIQN . »

Thesis for the Degtee of Ph 0..
MICHIGAN STATE UNIVERS‘TY
. JOHN JAMES FORSYTH
197,1

tr?" ‘

  
     

  

L I B R A I. Y
Michigan State
University

This is to certify that the

thesis entitled

ERROR PROBABILITY IN UNSUPERVISED DEPENDENT
PATTERN CLASSIFICATION

presented by

John James Forsyth

has been accepted towards fulfillment
of the requirements for

Ph.D. degreein Electrical Engineering &

Systems Science

 

 

//7 \ .” “ A
't I/" ./ r" 7
¢ . ll, / ;T I, a I I -‘l O.
www.cstg-J/ Ce , Vou- .4.

Major professor

Date Qié/ ‘jéz /};)/

0-7639

 

 

 

 

ABSTRACT

ERROR PROBABILITY IN UNSUPERVISED DEPENDENT
PATTERN CLASSIFICATION

BY
John James Forsyth

This thesis investigates a communications problem
within the framework of unsupervised multi-category pat-
tern recognition. Consider a digital data communications
system in which a source randomly selects symbols from an
alphabet, encodes the symbols as a string of digits (the
code being of fixed but unknown length) and transmits the
resulting digital data over a channel. A symbol synchroni-
zation problem ensues when a receiver locks on to the
signal at a time other than when the first digit of a
coded symbol arrives. When the receiver does not know
the length of the individual symbol codes being received,
and when special synchronizing pulses are not present as
a guide, then the receiver is faced with processing pat-
terns which exhibit statistical dependence. The term
"synchronization" as used here refers to the problem of
establishing the starting point of each symbol code in
the data, which in turn requires determination of the

code length.

The syn
in unsupervis
Solutions to
both through
stochastic a;
havior of the
whether the s
an independen
decision proc
Since the dec
possible pro:

A cieta'l
decision proc
through a lix
Probability c
Another uses

 

COeffiCient
ational dist.

dlStance me a

 

John James Forsyth

The synchronization problem is treated as a problem
in unsupervised multi-category pattern recognition.
Solutions to the synchronization problem are develOped
both through the Bayes decision process and through a
stochastic approximation algorithm. The convergent be-
havior of these solutions is proved. It is shown that
whether the source generating the codes is governed by
an independent or a Markov random process the Bayes
decision process for the synchronization problem converges.
Since the decision procedures use no training data, the
possible probability models for the source must be known.

A detailed study of error bounds for the Bayes
decision process is presented. One bound is obtained
through a limiting process which examines the asymptotic
probability of error of a suboptimum decision process.
Another uses information theoretic concepts. The roles
of two measures of distance between probability distri-
butions is examined; those measures are the Bhattacharyya
coefficient (Hellinger integral) and the Kolmogorov vari-
ational distance. Error bounds based directly on those

distance measures are exhibited.

in

De

ERROR PROBABILITY IN UNSUPERVISED DEPENDENT

PATTERN CLASSIFICATION

BY

John James Forsyth

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering and
Systems Science

1971

To my wife and children

ii

Than]
for his thc
course of 1
of the writ
for his hel
Professors
t0 particul
the efforts
(3- L. Park

improvement

ACKNOWLEDGMENT S

Thanks are in order to Professor R. C. Dubes both
for his thoughtful guidance and assistance in shaping the
course of this research, and for his painstaking review
of the writing. Thanks also go to Professor R. J. Reid
for his help in defining the motivating problem, and to
Professors Reid and R. Staudte for stimulating solutions
to particular problems which were encountered. Finally,
the efforts of Professors M. G. Keeney, C. V. Page, and
G. L. Park in reviewing the thesis and suggesting helpful

improvements are acknowledged.

iii

LIST OF T}
LIST 01" F1

LIST OF NC

Chapter

I. INT

H H H H H
I I O O O
(n J:- (A) k) H

II. THE

TABLE OF CONTENTS

Page
LIST OF TABLES . . . . . . . . . . . . vi
LIST OF FIGURES. . . . . . . . . . . . vii
LIST OF NOTATION . . . . . . . . . . . viii
Chapter
I. INTRODUCTION . . . . . . . . . . l
1.1 Communications System Models . . . 2
1.2 Symbol Code Synchronization. . . . 4
1.3 Pattern Recognition Methodology . . 7
1.4 Contributions of the Thesis. . . . 15
1.5 Organization of the Thesis . . . . 17
II. THE SYMBOL SYNCHRONIZATION PROBLEM . . . 18
2.1 The Data Generation Model . . . . 18
2.2 Estimating the Symbol Code Length
and Synchronization Instant. . . . 22
2.3 Convergence of the Posterior
Probability Mass Functions . . . . 26
2.4 A Stochastic Approximation Approach . 29
2.5 Synchronization and Learning Unknown
Parameters . . . . . . . . . 36
2.6 Markovian Symbol Selection . . . . 41
2.7 Summary . . . . . . . . . . 43
III. ERROR BOUNDS FOR DECISION PROCESSES. . . 45
3.1 Majority Decision Functions and
Error Probability for Dependent
Random Variables . . . . . . . 46
3.1.1 Asymptotic Error Probability
for Two Pattern Classes. . . 47
3.1.2 Bayes Majority Decision
Functions . . . . . . . 57

iv

Chag

IV.

Ulm

BIBLIOGR;

Chapter Page

3.1.3 Asymptotic Error Probability
for m Pattern Classes . . . . 64

3.2 Bounding the Probability of Error After

a Finite Number of Samples. . . . . 67
3.2.1 An Information Theoretic
Approach . . . . . . . . 68
3.2.2 Bounds Based on the Distance
Between Distributions . . . . 80
3.3 Summary . . . . . . . . . . . 90

IV. EXAMPLES OF DECISION PROCESSES AND ERROR

BOUNDS O O O O O O O O O O O O O 9 2
4.1 The Bayes Decision Process Applied to
the Symbol Synchronization Problem . . 93
4.2 The Binary Symmetric Channel . . . . 98
4.3 Error Estimates and Bounds. . . . . 108
V. CONCLUSIONS AND RECOMMENDATIONS. . . . . 115
5.1 Summary of the Thesis . . . . . . 115
5.2 Recommendations for Continued Research. 117
BIBLIOGRAPHY . . . . . . . . . . . . . 120

Table
4.1 Symbc

4.2 Compa

 

LIST OF TABLES

Table Page
4.1 Symbol Generating Probabilities . . . . . 96
4.2 Comparison of Three Decision Processes. . . 107

vi

Figure
1.1
1.2
2.1

2.2

4.1

4.2

4.3

 

LIST OF FIGURES

Figure Page
1.1 A Communication System . . . . . . . 2
1.2 Multisource Communication System. . . . 3
2.1 Genesis of Pattern Dependence. . . . . 21

2.2 Pattern Dependency with Markov Symbol
Selection. . . . . . . . . . . 41

4.1 Posterior Distribution of the Source

Index Parameter. . . . . . . . . 94
4.2 The Binary Symmetric Channel . . . . . 98
4.3 Comparison of Error Bounds. . . . . . 112

vii

Notation

 

 

Notation

A

covi(V)
d(X2n+1

k
dB(X )

CB

LIST OF NOTATION

Page
Meaning Reference
1 n
11m — Z A 52
n+w n h=l k+h
2C0V1{Uk'Uk+l} + varl{Uk+1} 52
1 n

lim — 2 B 56
n+w n h=l k+h
2cov2{Vk,Vk+l} + var2{Vk+1} 56
Covariance of V under class i 52
Majority decision function 47
Bayes decision function 9
Compound Bayes majority decision
function 65
Decision function used for k—th
pattern for majority decision
function 47
Majority decision function--
same as d(X2n+1) 47

viii

Rotation

was = ti)

EMRi)
51W)

EGW)

fllek)

fin)

Notation

E(VIO = ti)

E(VIQi)
Ei(V)

EG(V)
f (lek)

fi(V)

G = {gl,...,gm}

H(')

Page
Meaning Reference

Conditional expectation
of V under class i 52

Same as E(VIG

t.) 74
1

Same as E(VIO t.) 11

Expectation of V with respect
to the mixture G 33

Sample conditional probability
density function for V 37

Conditional probability density
function for V under class i 11

Element of G that equals one 30

Estimator for gi based on n
observations 33

Non-negative estimator for 9i

derived from gi n 33
I

Probability distribution for A
used as a finite mixing

distribution 30
Shannon's entrOpy 69
A vector space 32

Divergence measure relative
to two probability density
functions 11

ix

Notation

i.

P(VJw)

Notation

2

p>

kB

n>

Meaning

Index of sources

Correct decision about 2

Bayes estimator for 2

Maximum likelihood extimator
for 2

Total number of sources

Symbol code length for source

Number of digits in a pattern

Prior probability that class i
is active

Conditional Probability mass
function for V given w

Probability of error for
decision function V

Empirical probability mass
function for V

Probability mass function for
V under class i

Conditional probability mass
function for V given W under
class i

Vector of probability mass
function for V under class i

Page
Reference

19

26

25

26

19

l 19

20

46

23

49,66

28

46

32

Notation

RB

9—5)

T
ok

Notation
POW)
(x

rij k)

ik

r9>

kB

*3)

kM

ok

Page

Meaning Reference

Prior probability mass
function for V

Factor for recursively com-
puting pi?)

Decision region for deciding
class i at k-th observation

Value of 0 indicating a pattern
class

Synchronization point in k-th
pattern

Bayes estimator for Tk

Maximum likelihood estimator
for Tk

Correct decision about Tk

Decision indicator random
variable for k-th pattern--

Uk = 1 - Vk

Decision indicator random
variable for k-th pattern--
Vk = l - Uk

Variance of V under class i

Random variable for k-th
pattern

The sequence of random
variables xl""'xk

xi

25

84

47

20

25

26

26

48

48

52

9,20

9,20

Notation

Z .
J

ik

“ik (UR-l)

Yk(£,'rk)

 

Page

Notation Meaning Reference
Z. Base 10 representation of
J a binary pattern 99
a max(al,a2) 59
g min(al'a2) 59
a. Probability of error under
1 .
class 1 58
aik Probability of error of
d (X ) under class i 47
k k
ik_(Uk 1) Probability of error of
(Xk) given dk-1(Xk—1) under
c ass i 50
yk(£,Tk) Mapping from (£,Tk) to A 30
6 One-half of upper bound on
variational distance 59
_l__<5_
E 8—: 4 62
n n = E - g 59
0 Class indexing parameter 9
A Class indexing parameter
mapped from (E, Tk ) under
k(£ T k) 30
p Bhattacharyya coefficient
(Hellinger integral) 11

xii

Notation

Uij

 

 

 

Notation

$1?)
13

¢.

1

¢(x,u,02)

Meaning

Interclass Bhattacharyya
coefficient based on one
pattern

Interclass Bhattacharyya
coefficient based on k
patterns

Component of §i(V)

Integral over the right
tail of the normal density
function

Subset of a probability
space on which 0 = th

xiii

Page
Reference

81

82

33

55

74

Recer
engineering
use of stat
the traditi
H-Z, P-Z, w
trend towar
the framewo
estimation,
formulatigr1
assumptions

System. In

the applica
Edge of Pro:
tiCians hav
Ba‘yes Strat

tomed to th

System desc

 

 

 

CHAPTER I

INTRODUCTION

Recent research efforts in engineering and recent
engineering practice have turned more and more toward the
use of statistical models rather than, or in addition, to
the traditional deterministic system models [A-7, D-2, F-S,
H-Z, P-2, W-l]. In particular, there exists a growing
trend toward the formulation of engineering problems in
the framework of statistical hypothesis testing, parameter
estimation, and pattern recognition methodology. These
formulations often show a willingness to make certain
assumptions about the probabilistic description of the
system. Indeed, many of the solution algorithms grow from
the application of Bayes' rule, which requires prior knowl-
edge of probability distributions [H-7, N-l]. Some statis-
ticians have long eschewed any such assumptions,* but the
Bayes strategy represents a logical step for those accus-
tomed to the availability of a complete, deterministic

system description. As a rule, the models describe

 

*Although for a scholarly treatise supporting
the Bayesian approach see Good [G-l].

prOCGSS'
tistica.‘
cations
produce

process

1.1 Corn
Fu

a source

The Objec
Priately

and whic}
Radar SYE
SUCCessfu
Neran~pe
“‘11 to c
radar app.
in? noise
Signal is
descriPtic
the parame
receiVEr C
minimi2e t

dlsmiSSal

processes for which successive experiments produce sta-
tistically independent outcomes. However, in communi-
cations theory and other fields, processes can occur which
produce statistically dependent observations. One such

process is now introduced.

l.l Communications System Models

 

Fundamentally, a communications system consists of

a source cascaded with a channel and a receiver (Figure 1.1).

 

 

 

Receiver

 

 

xiv

Channel

v

Source

 

 

 

 

 

 

 

 

 

Figure 1.1 A Communication System

The objective of the system is to have the receiver appro-
priately process signals, which are emitted from the source,
and which are subject to some modification by the channel.
Radar systems provide the first historical example of the
successful application of probabilistic models and of the
Neyman-Pearson hypothesis testing theory [D—2, H—2, M-8,
W—l] to communication systems. In the usual model of the
radar application, the channel distorts the signal by add—
ing noise to it. The receiver must determine whether a
signal is present or absent. From the probabilistic
descriptions of the source and channel, one can determine
the parameters that define a threshold detector which the
receiver can use on the output of a correlator in order to
minimize the false alarm probability for a given false

dismissal probability. The source and the channel can

both be 1
tain eitl
M-7, P-l
trates t2
is more ‘

Th.
nition, a
consider:
sources n

its own 5

E
E

’ SOL

Fi
fIOm an al
and transm

The r eCeiv

starting
effective

StUdy Unde-

both be modeled probabilistically, and the models can con-
tain either independent or dependent random variables [M-5,
M-7, P-l, S-3]. While the signal detection problem illus-
trates the use of probabilistic models, a modified system
is more directly related to the problems of this thesis.
The fields of communication theory, pattern recog-
nition, and statistical hypotheses testing intertwine when
considering a communications system in which one of several
sources may transmit signals (Figure 1.2). Each source has

its own process by which it repeatedly selects symbols

 

Source 1 Encoder

 

 

 

 

 

 

 

Source i Encoder ___« Channel Receiver

 

 

 

 

 

 

 

 

 

 

 

 

Source r Encoder

 

 

 

 

Figure 1.2 Multisource Communication System

from an alphabet, encodes the symbol into several digits,
and transmits the resulting digits in a continuing stream.
The receiver must know which source is active and the
starting point of each symbol code in order to apply an
effective decoding algorithm. The problem motivating the

study undertaken in this thesis is to determine which

source
receiv
zation

startii

1.2 8?

channel
a corre]
the symt
problem
because .
because c
receiving
tyne of p
each Symb.
They furti
tically ir
Zation pro
and Stewar
prOVed tha'
theorem1 [8-
One
with othErs
Recent Work

an Muftha

 

source is active and the symbol code starting point. The
receiver will be said to have achieved symbol synchroni-
zation when it has identified the source and determined the

starting point of each symbol code which it receives.

1.2 Symbol Code Synchronization

Digital data transmitted over a communication
channel often contains a symbol synchronization pulse which
a correlator can process to determine the starting point of
the symbol code [M-2, W-l]. A symbol synchronization
problem arises when such a pulse is either not provided
because of bandwidth or other design considerations, lost
because of channel degradation, or simply unknown at the
receiving end. Hancock and Stewart [H-l] considered this
type of problem by assuming that the number of bits in
each symbol code was known and fixed from symbol to symbol.
They further assumed that successive symbols were statis-
tically independent. By modeling the symbol synchroni—
zation problem as a hypothesis testing problem, Hancock
and Stewart established a Bayesian decision procedure and
proved that it converges in the sense of Spragins'
theorem [S—6].

One should compare this type of synchronization
with others receiving attention in current research.
Recent works by McBride and Sage [M-3, M-4] and by Farrell

and Murtha [F-l] examine the problem of extracting bit

synchrc
purpOSE
which a
On the

synchro
symbol.
quently.
that acc
able for
to devel
of sever
the type
PrOblem 1
Stewart,
Sources,
and SUCCe
that alph
into a Ce
used is c.
mits Stat;
independe;
uses a fi)

but diffe}

 

COde lenC'

synchronization information from the message data. Their
purpose is to determine the apprOpriate time interval over
which a correlator should Operate for signal detection.
0n the other hand, Harnett [H-3] takes the approach that
synchronization error effectively inserts or deletes a
symbol. However, except for the randomly (and infre-
quently) inserted and deleted symbols, Harnett assumes
that accurate synchronization, or registration, is avail-
able for the rest of the symbol stream, and he proceeds
to develop decoding theorems for code strings consisting
of several symbols. These two views are distinct from
the type of synchronization studied in this thesis.

This thesis approaches the symbol synchronization
problem by generalizing the point of view of Hancock and
Stewart. A receiver obtains data from one of a set of
sources. The source model assumes an alphabet of symbols,
and successive random experiments select symbols from
that alphabet for transmission. Each symbol is encoded
into a certain number of digits, and the number of digits
used is called the symbol code length. Each source trans-
mits statistically independent symbols although this
independence constraint is not necessary. Each source
uses a fixed symbol code length for all of its symbols,
but different sources do not necessarily use the same

code length.

fr

wa

an

mic

tra

The
ObServing
first bit
the reCei

that the

Example

Suppose there are two sources, one using a
four binary digit (binit) code length and the other
using three binits for each code word. If the

receiver observes six binits, say

1 0 l l l 0

from the data stream, there are several possible
ways these binits could be partitioned for decoding.
If the four binit source is transmitting, then

any of the partitions

 

 

 

l 0 l l l 0 .
. . . l 0 l l l 0 . . .
. . l O 1 l l 0

 

. l 0 l l 0 .

 

might be appropriate. If the three binit source is

transmitting, then there are three possible partitions.

l O l l l 0
O O l 0 1 l 1 0 O

 

. l O l l l O . .

 

The receiver does not know which source it is
observing or whether the first bit it received is the
first bit of a symbol code. However, it is assumed that
the receiver has achieved bit synchronization, and further

that the receiver knows something about the probability

distr
butio:
funct:
zatioz
proble
first
precis
must c

random

ature a
ables [
be said
ables [1
random x
extend t
e"amples
of jOint
tation o
The COnv
PEUdent ,
random Va

The
literatur
ables are

Weak and

distributions governing the sources--either the distri-
butions themselves or the parametric form of their density
functions. Under these assumptions, this symbol synchroni-
zation problem can be modeled as a pattern recognition
problem for which the observed random variables exhibit
first order dependence, or the Markov prOperty (for the
precise specification, see page 24). Consequently, one
must consider decision-making processes using dependent

random variables.

1.3 Pattern Recognition Methodology

There is an ample and rapidly growing body of liter-
ature about decision making based on independent random vari—
ables [D-3, D-4, L-l, P-l]. Unfortunately, the same cannot
be said for decision making based on dependent random vari-
ables [H—6]. Quite often, the algorithms for independent
random variables derive from some property that does not
extend to dependent random variables. Two salient
examples are the product rule for defining probabilities
of joint independent events and the rule for the expec-
tation of the product of uncorrelated random variables.
The convenient factoring which they provide in the inde-
pendent case are denied to one who deals with dependent
random variables.

The most powerful general results in the statistical
literature which are applicable to dependent random vari-
ables are the Central Limit Theorem [F-4, H-8], and the

weak and strong laws of large numbers [F-2]. These tools

the cc
for PE
Stewar
consis
zation
exhibi
Bayes 1
the prc
ponds t
mate a
bution
the mod.
nonzero
Of the i
PUted b)
Consiste
the Post
the Prob
Conseque.
lated are
ObserVat]
argument
questiOns

and Stor

provide the basis for describing the asymptotic behavior
of decision processes. Signori [S-3] used them to show
the convergence of optimum (Bayesian) decision processes
for Partially Observable Markov Systems. Hancock and
Stewart also used them to show the existence of a strongly
consistent sequence of estimators for the symbol synchroni-
zation problem. In both works, the technique was to
exhibit strongly consistent estimators in proving that the
Bayes posterior distributions asymptotically assign all
the probability mass to the pattern class which corres-
ponds to the correct decision [8-5, 8-6]. One can esti-
mate a parameter value by computing the posterior distri-
bution of the parameter and taking the mean argument of
the mode as the estimate. Spragins showed that if a
nonzero prior probability is assigned to the true value

of the parameter, if the posterior distributions are com—
puted by the Bayes rule and if there exists a strongly
consistent sequence of estimators for the parameter, then
the posterior distribution asymptotically assigns all of
the probability mass to the true value of the parameter.
Consequently, when the posterior distributions so calcu-
lated are premised on a sufficiently large number of
observations, the parameter estimate provided by the mean
argument of the mode should be reliable. This raises
questions of rate of convergence, probability of error,

and storage requirements which are examined in this thesis.

M‘
process
Given a
variabl
i.e., p
The Bay
andap
9={tl
require

distrib

Where x
mass fu]

p(

is the 1

*F.
defined ‘

values 0:

Much of what follows concerns the Bayes decision
process, which will be briefly summarized at this point.
Given a sequence of observations, X1, X2, ..., of a random
variable governed by one of m probability distributions,
i.e., pattern classes, decide which distribution is active.
The Bayes decision process requires a prior distribution
and a parameter, say 0, which indicates the active class,
0 = {t1, ..., tm}. The Bayes decision process further
requires calculating the sample conditional posterior

distribution of O by the Bayes rule:*

le-l

_ k-l
P(o — tilx ) Pi(Xk )

 

_ k _
P(o — tilx ) —

_ k-l k-l
ZP(O-—tiIX ) Pi(XkIX )
G
where Xk = X X X and P (X IXk—l) is the i-th
l' 2' "" k i k

mass function of Xk' The decision function, which maps

the observation space to the decision space, is
dB(Xk) = ti if i is the smallest integer for which
k k . .
P(@ = tiIX ) 2 P(o = thX ), all 3 ¢ 1.

Here, dB(Xk) = ti means to decide that element ti of O

is the index of the pattern class being observed.

 

*For notational convenience, terms which have been
defined to denote random variables will also denote the
values of the random variables.

U
its wel
risk st
minimunx
generat
to an e
the min

U1
Bayes de
pendent,
vestigat
R-41. C
SUbOptim
rule. T
Optimum
bEhavior
rule, am
the mlmb'

Sex
Of boundj
of "dista

 

 

10

Use of the Bayes decision process is motivated by
its well-known prOperties which include being the minimum

risk strategy and being the strategy which leads to

minimum probability of error. A standard derivation
generates the Bayes decision function as the solution
to an extremum problem, in which the extremum sought is
the minimum of the expectation of a risk function.

Upper bounds on the probability of error for the
Bayes decision process have been established for inde-
pendent, continuous random variables by a number of in-
vestigators [C-lO, F—3, H-S, K-2, L—2, L—3, L-4, R—3,
R-4]. Chu and Chueh established bounds by defining a
suboptimum decision rule called the majority decision
rule. The Central Limit Theorem implied that the sub-
optimum procedure converged and established the asymptotic
behavior of the probability of error for the Bayes decision
rule, and the majority decision rule, as a function of
the number of observations.

Several closely related approaches to the problem
of bounding the probability of error rely on some measure
of "distance" between the distributions of the pattern
classes.* The underlying decision procedure uses the
likelihood ratio. The goal (which in general has not
yet been reached) is to find a monotonic relation between

the distance_between the distributions of the pattern

 

*

The term distance appears in quotes here because
the measures do not always obey the triangle inequality
required of a metric [K-l, K-2, K-S].

claSS‘
divert

with

Kailat
[K‘Z],
0f one
exists
butions
bility
diverge.
effGCtil
diStribL
Th

in the t

Se

are ”lore

11

classes and the probability of error. One measure, the
divergence, is defined as follows in the two-class case

with the class distributions fl(x) and f2(x), respectively.

fl(x) fl(X)
J=Elan'E21nf-2—m

where

f1(X) fl(x) _
Ei 1n W = ln f-E—(X—Y fi(X) dX, 1 = 1,2.

Kailath, in his well-documented paper on these techniques
[K-2], points out that if one can choose the distributions
of one's signal sets to increase the divergence, then there
exists a set of prior probabilities under which the distri—
butions with the larger divergence provide a lower proba-
bility of error than would distributions with a smaller
divergence. However, this falls short of providing an
effective computational technique for selecting signal
distributions even if that option were available.

The Bhattacharyya coefficient, denoted p, is defined

in the two-class case as

 

p = f/fl(x)f2(x) dx.

Several investigators [K-2, L—2, L—3, L—4] have
obtained results using the Bhattacharyya coefficient which

are more encouraging than the results stemming from the

diverge
bounds
the Bha
vergenC
underly
hood de
diverge
of obse
spite o.
usual 5;
formidak
Th
method f
the Conv
R-SI. R
COnCErni
equivoca
fOUnd the
tial fUnc
Raviv [H-
error. 'I
the PrOba
for any n
niqUQ Of
the aSym;
thesis Sh

IEtic Poi

12

divergence approach. Kailath obtains both upper and lower
bounds on the probability of error which are functions of
the Bhattacharyya coefficient. This measure, like the di-
vergence, indicates the potential effectiveness of the
underlying probability distributions for maximum likeli-
hood decision making. If one tries to apply either the
divergence or the Bhattacharyya coefficient to a sequence
of observations in a multihypothesis situation, then in
spite of the simple form of the bounds, one still has the
usual situation that computing the error bounds can be more
formidable than computing the decision function.

The late Alfréd Rényi proposed a highly interesting
method for using information theoretic measures to show
the convergence of Bayes decision processes [R-3, R-4,
R-S]. Rényi saw the observations as providing information
concerning the classification parameter. He studied the
equivocation (average entropy) of the observations and
found that the equivocation approaches zero as an exponen-
tial function of the number of observations. Hellman and
Raviv [H-5] related Rényi's results to the probability of
error. This approach provides an absolute upper bound on
the probability of error of the Bayes decision process
for any number of observations. In contrast, the tech-
nique of Chu and Chueh, mentioned earlier, describes only
the asymptotic behavior of the probability of error. This
thesis shows the applicability of this information theo-

retic point of view to dependent random variables.

obserr
descri
bution
distri
distri
ing di:
in proc
estimat
However
sequenc
distrib
indepen(
to a Stc
[C‘2] re
mation a
as a fUn
Presente(
random VE
C.
dependEnc
VariatiOn
hypothesi

elements .

 

adjaCent.

not neCeS

13

Another view of pattern recognition considers the
observations as originating from a process which is
described by a mixture [T-l, T-2, Y-2] of the distri-
butions of individual component processes. The prior
distribution of the classes serves as an unknown mixing
distribution. Some of the approaches to estimating mix-
ing distributions [C-4, D-l, R-7, Y-l, Y-3] have resulted
in procedures which require either a steadily growing
estimator space or an infinite set of finite distributions.
However, Robbins [R-6] has shown how to calculate a
sequence of strongly consistent estimators for the mixing
distribution when the observations are statistically
independent. His procedure can be shown to be equivalent
to a stochastic approximation algorithm. Chien and Fu
[C-2] related Bayesian learning to stochastic approxi-
mation and gave bounds on the variance of such estimators
as a function of the number of observations. The work
presented here applies these ideas in concert to dependent
random variables.

C. K. Chow [C-6, C-7, C-8] has investigated inter-
dependence between pattern elements (features) that is a
variation on the idea of the Markov prOperty. Chow
hypothesized that the features might be ordered such that
elements having interdependence would not necessarily be
adjacent. This suggestion grew from character recognition
work where the order in which features were measured did

not necessarily reflect the history of their generation.

He CC
descr
estab
resul
matin
depen
are C:
a max;
matior

bution

thesis
(which
of the

with 1.117

indepenc
samples
niQUes.
Processil
the Sampl
Used eith
distribut
late weig
{C-ll' C-
&q

14

He conceived a tree structure imposed on the features to
describe the dependencies, and devised techniques for
establishing the most effective tree. One interesting
result showed that if one limits the form for approxi-
mating a probability distribution to products of low order
dependent distributions; and if the low order distributions
are chosen to be those for which the components exhibited

a maximum mutual information; then the resulting approxi-
mation has minimum mutual information with the distri-

bution being approximated [C-8].

There is no attempt to extend Chow's work in this
thesis, since Chow worked in a supervised learning mode
(which uses classified training data to evaluate parameters
of the decision function), and this thesis is concerned
with unsupervised techniques.

Many pattern recognition researchers have available
independent, identically distributed (i.i.d.) training
samples with which they can use supervised learning tech—
niques. This avoids the high cost of optimum unsupervised
processing, and no underlying probability distribution on
the sample space need be specified. The training data are
used either to construct approximations to the probability
distributions [A-2, B-l, K-4, M-6, P-4, Y-4] or to calcu-
late weighting parameters for discriminant functions
[C-ll, C-12, C-13, C-l4, D-S, F-S, I—l, N-2, 8—1, 8—2,
8-4]. Various supervised learning techniques are often

compared by tuning them on the same set of training data

 

and ti“
test d
the d8
would .
some C]
were ur
techniq
of symb
unsuper1
zation ;
depender
in the w
removed.

be applit

l.
w

Thi

prOblem w

 

 

15

and then evaluating their performance on another set of
test data. Other researchers [K-4, P-4] have proved that
the decision functions produced by a particular scheme
would agree with the decision function which would optimize
some criterion function if the number of training samples
were unlimited. This thesis has not pursued supervised
techniques primarily because the motivating problem, that
of symbol synchronization, seemed more plausible in an
unsupervised operation. Further, the symbol synchroni—
zation problem very definitely produces statistically
dependent data, and the theoretically pleasing results

in the works cited collapse when the i.i.d. assumption is
removed. Many of the strictly empirical techniques could

be applied equally well to i.i.d. or non-i.i.d. data.

1.4 Contributions of the Thesis

This thesis identifies the symbol synchronization
problem with unknown symbol code length as a problem in
unsupervised, multicategory pattern recognition with dis-
crete,dependent random variables. A sequence of decisions
based on a Bayes strategy is shown to converge and allow
one to simultaneously determine the synchronization and
learn the parameters in the source distribution. Even
though the process at the source which selects successive
symbols for encoding selects symbols independently, the
data-gathering model at the receiver produces dependent

patterns. It is pointed out that if the symbol selection

proce-
and SI
proce
vides
studi.
cedure

varial

decisi
asympt
Optimu;
continx
informa
random

bility

the numl
depender
error b0
PIObabil
Coefficie
Presents

error b0

    
 

majority
St°Chasti

algorithr

l6

process at the source were Markov, then the synchronization
and source parameters could still be learned by the decision
process at the receiver. The binary symmetric channel pro-
vides a specific example of parameter learning which is
studied. The convergence properties of such decision pro-
cedures are thoroughly examined for dependent random
variables.

A subOptimum decision procedure, the majority
decision procedure, is used to derive expressions for the
asymptotic behavior of the probability of error for the
optimum procedure; such expressions are derived for both
continuous and discrete dependent random variables. An
information theoretic argument, applied to the dependent
random variables, provides an upper bound on the proba-
bility of error of the optimum procedure as a function of
the number of observations. This thesis also extends to
dependent random variables some techniques for determining
error bounds which use measures of the distance between
probability distributions, specifically, the Bhattacharyya
coefficient, and the Kolmogorov variational distance, and
presents the role of these quantities in the more elaborate
error bound techniques, those using equivocation and the
majority decision function. Another procedure is based on
stochastic approximation of a mixing distribution, and the

algorithm's convergence rate is discussed. Finally,

CODPU
the f

anal}

 

in ti
estar
conce
for t
of ob
which
the tl

vised

17

computer simulations of selected techniques demonstrate
the feasibility of the type of processing which has been

analyzed.

1.5 Organization of the Thesis

Chapter II places the symbol synchronization problem
in the framework of a pattern recognition problem and
establishes convergent decision processes. Chapter III
concentrates on various methods of computing error bounds
for the Bayes decision process as functions of the number
of observations. Chapter IV presents several examples
which illustrate the theory, while Chapter V summarizes
the thesis and suggests alternate approaches to unsuper-

vised decision making.

proble
the de
will b
depend'
depends
in the

eters a
it is s
arise f;
SPeCific
the fran

describe

L

%
Th

One of 86

not rest

CHAPTER II

THE SYMBOL SYNCHRONIZATION PROBLEM

Detailed explanations of the symbol synchronization
problem, the decision procedure, and the convergence of
the decision procedure are presented in this chapter. It
will be shown that the processing technique must deal with
dependent random variables. The exact effect of this
dependence on the decision process is described. While
in the basic problem description the only unknown param—
eters are the source active and synchronization instant,
it is shown that other unknown parameters, such as could
arise from either a noisy channel or a less complete
specification of the source, can be accommodated within
the framework of the class of decision procedures

described.

2.1 The Data Generation Model

The information to be received can be generated by
one of several sources, and each source encodes the infor-
mation in binary form. The assumption of binary data does

not restrict the generality of the results obtained. Many

18

data
nal
and ‘
data

a la]

19

data transmission systems--such as remote computer termi-
nal facilities, equipment monitors in earth satellites,
and various character recognition schemes--use binary
data, so the binary data model is directly applicable to
a large class of existing systems.

Let L denote the number of sources. All sources
transmit at the same bit rate, and all sources use the
same signal to represent a binary digit. A particular
source uses the same symbol code length, or number of
binary digits per coded symbol, for all symbols in its
alphabet. Throughout this thesis, the term symbol refers
to an element of a source alphabet, and the term symbol
code refers to the encoded version of a symbol. Two or
more different sources might use the same symbol code
length; however, it is also possible that different
sources use different symbol code lengths.

By letting the letter 2 stand for the index of the

source, £e{l,2,...,L}, one can then have m stand

R

for the symbol code length used by source 2. Every mk

time units (the time unit is the reciprocal of the trans-
mission bit rate) source 2 randomly selects a symbol from

its alphabet, encodes the symbol using m binary digits,

£

and transmits the m binary digits. At some time, not

2
necessarily as the source begins transmitting, a receiver
locks on to one of the sources and receives from that

source for the rest of the time. The receiver does not

know
face
is be
lengt
each
the I
that
each
ceive

Methol

separa
binary
struct
digits
Stream.
X1,X2,.
Where n
length,
first k

Stands f

 

20

know which source is transmitting. Two immediate tasks
face the receiver. The first is to determine which source
is being observed, which in turn specifies the symbol code
length. It is assumed that the symbol code length of

each source is known to the receiver. As a second task
the receiver must determine the "synchronization instants,"
that is, the binary digit which is the first digit of

each symbol code. Note that the first binary digit re-
ceived is not necessarily the first digit of a symbol code.
Methods for solving these two tasks will be developed.

The receiver observes patterns which it obtains by
separating the input stream into successive sets of n
binary digits. The input stream then could be recon-
structed as a concatenated version of the patterns; no
digits are lost in extracting the patterns from the input
stream. The sequence of patterns is represented by
X1,X2,... and each Xi (i = 1,2,...) has n binary components
where n is at least as large as the largest symbol code
length, i.e., 1‘: m2 : n. The notation Xk denotes the
first k patterns; Xk = (X1,X2,...,Xk). The notation Tk
stands for the synchronization instant in the k-th pat-
tern, Xk’ and is defined as follows:

a. If exactly one symbol code has its beginning

binary digit in X then T is the position of

k' k
that binary digit in Xk.

21

b. If more than one symbol code has its beginning
binary digit in Xk (which can happen whem mi < n),
then Tk is the position in Xk of the beginning
digit of the first symbol code which begins in
Xk.

This definition of Tk comes from the following idea: if

there are exactly i binary digits (1 = 0,1,...,m£-1) in Xk

which belong to a symbol code which began in Xk-l’ then

Tk = i+l, and Tk

left over from a symbol started in Xk_1) starts in Xk'

tells where the first new symbol code (not

Figure 2.1 shows examples of sequences of values for

T Values of m = 2 and n = 3 are assumed, and the two

k' 1
possible values for T1 produce the sequences of values for

T ,T as shown. Effectively then, Figure 2.1 illustrates

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2 3'...
. . . o o 1 o 1 o o o 1 o 1 1 . . . (a)
* * * *
X1 X2 X3 X4
Tl=l T2=2 T3=1 T4=2
. . . +_g o 1 o 1 g_y_g o 1 o 1 1 . . . (b)
* * * *
x1 x2 x3 x4
Tl=2 T2=l T3=2 T4=l

Figure 2.1 Genesis of Pattern Dependence

tWKD ways in which the same sequence of patterns, Xk' could
Eriseefrom a source which uses two binary digits for a

SYHﬂmol code. Pairs of underlined digits represent symbol

22

codes. The values of T are shown. The "*" under a digit

k

shows that the digit is the one which determines the value

of Tk'

The value of Tk cannot exceed the symbol code length

of the source. Since n might exceed mg, T is not neces-

k

sarily equal to Tk+l° However, given m n, and T the

R' k-l'

number of digits which carry over to X from a symbol code

k
started in xk-l is mi minus the remainder of
n - (Tk-l - 1)L/h£ so that
Tk = mx — [n - Tk-l + I] + l 2.1.l
mod m

2

The distinguishing feature of this chapter is the
assumption that once the receiver has selected a stream to
observe, the receiver stays with that one stream. Sections
2.2 through 2.4 develop an Optimum receiver and discuss its
convergence for the case in which the symbol code length
and synchronization instants are the only unknowns; the
probability model for the sources will be completely known
and the channel between the source and receiver will be
noiseless. Sections 2.5 and 2.6 allow unknown parameters
in the source distributions and include a noisy channel.

2.2 Estimating the Symbol Code Length
and Synchronization Instant

 

Each source performs a random experiment to select

a symbol to be encoded. The probability of any source

23

selecting a given symbol is assumed known, fixed and
independent of previously selected symbols. The stream

of binary digits (binits) so produced represents obser-
vations on a discrete—parameter, discrete-time stochastic
process. The values of the sample conditional probability

mass function, for a noiseless channel

P(XI9.,T 5L=l,2,...,L;T =1,...,£

k” k

is known for all 2n mass points and fixed. In the expres-
sion above, X--which has represented a random variable--is
used to represent the value of the random variable as well.
This practice will be continued, and the context will indi-
cate whether the random variable or the value of the
random variable is intended.

A symbol code might have its initial binits in Xk__1

and its final binits at the start of Xk' which causes Tk

to be greater than one. Yet no single symbol code could
overlap more than two patterns, because the length of a
pattern is as great as the longest code length. That

being so, the value of X could depend on the value of

k
Xk_1 but not on Xj for j < (k-l); recall that successive
symbols are independent. Furthermore, since stationary
probability models for all sources are known, the con-

ditional probability of Xk given the value of xk-l can

be specified. In short, the sequence of random variables

trW

P (3

anc
pos
the
denc
func
k pa1
P(T
quant

GXpre:

P(Tk(

24

X1,X2,... is a first-order Markov chain with stationary

transition probabilities. Thus, P(Xk|£,Tk,Xk-1) =

P(XkI£,Tk,Xk_l).
The procedure for estimating the active source, 2,

and the synchronization instant, T is based on forming

k’
posterior estimates of the probability mass function for
the active source given the first k patterns observed--
denoted P(£|Xk); R = l,...,L--and of the probability mass
function for the synchronization instant given the first
k patterns observed and the active source--denoted

P(Tk|£,xk); l = l,...,L; T = l,...,l. Expanding these

k

quantities by the Bayes rule gives the following recursive

expressions.

 

 

 

 

k P(xk|2,xk l)PULIXk 1)
P(£IX ) = L
z P(XkI£,Xk lk)PULIX 1)
2:1
2.2.1
1£>(xk|r,xk'l">130lek 1)
= k— l) 9, = 1'2'ooopL
P(XkIX
k P(xk|Tk,2,x l)P(Tk|9.,xk'l)
P(Tk|£,X ) =
2: PM IT ,2.x"‘1)1=<'r Isaack 1)
k k k
T =1
k
k 1 k 1 2.2.2
P(X IT ,£,x ' )P(T |2,x )
k k k _
= k-l 2 — l,2,...,L
P(xk|2,x )

25

The denominator of 2.2.2 shows how to compute the first
factor in the numerator of 2.2.1. All that is left is to

Specify initial values for

PO(2) P(2|X0) and

_ O
PO(Tkl2) — P(Tk|2,X )

because P(X|Tk,2) is assumed fixed and known.
With the procedure for computing the posterior
probabilities specified in 2.2.1 and 2.2.2 one can con-

sider ways of estimating the value of 2 and T One esti-

k.
mator, called the Bayes estimator, uses the mean of the

posterior distribution. The Bayes estimators of 2 based

on Xk is denoted by 2 (the estimator for the source

kB

index after k observations) and that for T, by T (the

kB

estimator for the synchronization instant after k obser—

vations). These estimators are defined by:

A L k
2kB = z 2p(£|x )
2:1
and
m
T = z z“ T P(T I2,Xk)P(2,Xk)
kB _ k k
2 Tk-l

However, these estimators tend to give fractional values
for quantities defined as integers, so some rounding

algorithm would have to be specified.

26

Another, perhaps more intuitively satisfying,
estimator is the maximum likelihood estimator. The maxi-

mum likelihood estimators of 2 and Tk based on Xk are

denoted by 2 and T and are defined by:

kM kM’
EkM = 2os.t. P(2O|Xk) = max P(2|Xk)
and
TkM = Toks°t' P(Tok|2kM,Xk) = max P(Tkl2kM,Xk).
Tk

2.3 Convergence of the Posterior
Probability Mass Functions

 

 

The receiver is connected to a single source, 20,

so all symbol codes have the same symbol code length, mg .
The receiver stays connected to the source, missing none0
of the binary digits produced after the connection is made.
Consequently, the sequence of true synchronization in-
stants, {T }00 will satisfy 2.1.1 with T0 substituted

0k k=1 k
for T . It is essential to show that the posterior

k
probability mass functions computed according to 2.2.1
and 2.2.2 converge in the sense that as k+w they have all
of their mass at 20 and Tok, respectively. In order for
that to happen, a theorem of Spragins requires that the
following conditions be met:

(1) the posterior probabilities must be computed

by the Bayes rule,

27

(2) the true value of 20 and TO must have non-zero

1
prior probabilities, and

(3) there must exist sequences of functions of the

observations, {fk(Xk)} and {gk(Xk)}. such that

k k k k
fk(X )-—'*2O w.p.l. and gk(X )-—-->T0k

w.p.l.

where w.p.l means "with probability one."

Equations 2.2.1 and 2.2.2 show that condition 1 is

met, while proper choice of prior probabilities
that condition 2 is met. The following theorem
existence of the strongly consistent estimators

in condition 3.

assures
shows the

required

Theorem 2.3.1

 

If (a) the sequence of patterns represent obser-
vations on a regular Markov chain;

(b) only one synchronization, denoted by 20, TO
k

exists during the transmission of all

Xk, k: 1,2,...;

(c) members of the family {P(Xk|2,Tk); 2 = l,...,L;

Tk = l,...,mﬁ} are distinct;

(d) P(Xkl2,Tk) is the same for all k = 1,2,...
then there exist sequences of minimum dis-
tance estimators 2 and T for 2 and T
m m k
which converge to 20 and T0 with probability

k
one.*

 

*The estimators 2m and Tm imply a sequence of esti-

mators 2m,T1m,T2m,...,Tmm,..., for which ij is not

28

Proof: Define the empirical discrete probability

mass function for X, §h(x), as follows:

F(x)=-1- 11fx=x
m m

IX(Xk) a.e. where I x(Xk) k

">18

k 1

I X(Xk) 0 1f Xk # X.

This is the proportion of the first m observations

which equal X.
Define the estimators 2m and Tm by

inf supIPm (X) - P(X|2, T k)I = supIPm (X) - P(X|2 m,Tm)l

2 'Tk X

w.p.l

The infimum on the left allows us to write

suple(X) — P(Xl2m,Tm) Issupl'ﬁmm
X X

- P(XI2O,TO )|—l“—+o w.p.l. 2.3.1
m
Convergence to zero with probability one is by the
Glivenko-Cantelli theorem (using Signori's extension
to regular Markov chains). Now the triangle in-

equality is used to write

 

necessarily the estimator T based on the j- -th pattern,

for j#m. Correct estimation of the synchronization means

that the value of 2m is 20 and Tkm gives the value TO .
k

29

sup|P(Xl2m,Tm) - P(XI2O,TO )l = suplP(X|2m,Tm)
X m X

- Pm(X) + Pm(X) — P(XI2O,TOm)] s s;plP(Xl2m,Tm)

- §m(X)| + sup|5m(X) - P(X|2O,TO )| 2.3.2
X m
Equation 2.3.1 says that both quantities on the right
in 2.3.2 go to zero with probability one so that
A A m
consequently P(Xl2m,Tm)-——+P(XI2O,TO ) w.p.l. Hy-

m
pothesis (c) then provides that

2m-+ 2O w.p.l and T —+ T w.p.l.

Q.E.D.

This result in turn implies that, because of
Spragins' theorem, either of the decision procedures of
section 2.2 will give correct estimates of 20 and T0k if
the process continues long enough. With the estimators
discussed so far, the rate of convergence and probability
of error are difficult to specify. Section 2.4 provides

an algorithm which allows statements about the rate of

convergence.

2.4 A Stochasticquproximation Approach
Synchronization is completely specified by the combi-

nation of source index and synchronization instant, (2 Tk).
I

30

The probability of a given pattern can be represented by

the mixture

2
1

"b4;

P(X) = P(X|2,Tk)P(2,Tk) 2.4.

Ilbab

2 l T

For a convenient change of notation, observe that since

L
6 n and l s T s m there arexn = 2 m or at most,
2 k 2 2:1 2

nL distinct pairs (2,Tk) which can specify the synchroni-

l s m

zation. The notational change comes through letting an
unknown parameter A€{l,2,...,m} index the possible source-
synchronization pairs expressed relative to the first pat-
tern. A sequence of functions Yk(2,Tk) is defined such
that Yk(2,Tk) is a l-l map between the values of A and
the values of the ordered pair (2,Tk). Further, yk(2,Tk)
is defined such that given the value of the pair (2,Tl),
the sequence of values of (2,T2),(2,T3),... generated
according to 2.1.1--with 2 held constant--maps under
Yk(2,Tk) to the same value A for k = 1,2,... .

The parameter A has an unknown probability vector

G = {gl,...,gm}, gi ) 0, 2m gi = 1 such that P(A=i) = gi.

i=1

Assuming that a single source with one synchronization,

(2O,To ), produces all of the observation vectors implies
k
that one value of A is in effect for all observations.

So the system being observed has a probability vector

G = {gl,...,gm} such that for 1 = 10, gi0 = P(A = 10) = l

and for all other 1, 9i = 0. The entry of G which has

31

the value 1 is unknown. Once i0 is known, gi = l and
0
all other gi, iiio, must be zero.

The parameters and notation above leads to rewriting

2.4.1 as
L rn’Q
P(Xk) = E E P(Xkl2 = a,Tk = b)P(2 = a,Tk = b)

a—l b—l

m

- .E P(Xlek(a.b) = 1)P(Yk(a.b) = 1)

1—1

m

1—l

where 2 = a, Tk = b maps under Yk to A = i,

Pi(Xk) = P(Xklyk = 1), and

Estimating the synchronization now implies finding
i0, the true value of A, for which P(A = i0) = l. The
approach will be to estimate the probability vector G. A
decision on the value of i0 can then be made from the
estimates of G.

If one can obtain a sequence of estimates gi,n for
gi such that gi,n converges to gi, then one can use a

maximum likelihood decision rule for deciding on the

synchronization. The following theorem shows the

32

existence of a set of strongly consistent estimators for
9i (i = l,...,m) which can be computed by a stochastic

approximation algorithm (of. [A-4]).

Theorem 2.4.1
If (a) {Pi(X)}IE=1 is an identifiable [T-l, T—2, Y—3]
family of probability mass functions
_ m _ .
(b) G — {gl,...,gm}, gi a 0, 21:1 gi — l 18 a
finite mixing distribution, and

(c) X (k = 1,2,...) is a Markov chain with the

k
. . . = m
d1str1but1on PG(Xk) Zi=l gipi(xk)
then there exists a sequence of estimators for the

mixing distribution, G, which converges to G with

probability one.

Proof: The sequence of estimators developed in the
proof will have the form of a stochastic approxi-
mation algorithm. The proof follows that of

H. Robbins [R-6] for a related theorem. A member

of the family {Pi(X)} will be denoted by the vector
Pi(X), i = l,...,m which has 2n elements, one for
each of the 2n possible values of the observations
X; recall that X is an n element binary vector.
Yakowitz and Spragins [Y-3] show that identifiability
assures that the vectors Pi(X), i = l,...,m are
linearly independent and span the space Rm (and form

a basis for Rm). Let Hj denote the m-l dimensional

33

(X) P

—j- -1. “—j+l(x)

subspace Spanned by P1(X),...,P.

ooo’EIn(X)o Then

33. (X) = g). (X) + 23. (X) where

g]. (X)€Hj , g]. (X) j_ Hj and gj (X) a! 0. 2.4.3
Define
¢.(X) = P."(X)/2[P."(X)12
J J x 3
so that
i ¢j(X)Pk(X) = 1 if j = k
= o if j 7! k.

The elements of the set ¢j(X): j = l,...,m form a set

of orthonormal components of gj(X).

Now define

n
— +
Z ¢.(X) and g. = [g. ] X [g. ]
k=1 1 k 1,n 1,n ///j-1 j, n

ﬁo|
I:

ll
SIP

where [a]+ = max (a,0).
Hypothesis (c) and the above result imply that
for all k
m
E¢.(X)=E¢ik(X)ZgP.k(X)= Eg.2¢ikjk(X)P(X)=g.

2.4.4

gonal 1
Gram-Sc
Bj-l (X1
an ortl
zation

90nal

    

H-: t
J h

34

Signori [8-3] formulated a theorem, based on
a proof of Raviv [R-2],extending the law of large
numbers to Markov chains. Signori's theorem pro-
vides that when ¢i(X) is a Baire function integrable
with respect to a Lebesque measure on X then 2.4.4

implies that

31 n n’91 with probability one, hence
I

n . O I
gi,n __+gi w1th probab111ty one.
QoEoDo

In 2.4.3, the vector gj"(x) was defined to be ortho-
gonal to Hj; Pj"(x) can be obtained by applying the
Gram-Schmidt orthogonalization procedure to P1(X),...,

P (X).P

-j-l _j+l(X),...,Em(X), wh1ch span Hj' The result 15

an orthogonal basis for Hj' Then the final orthogonali-
zation step finds the portion of gj(X) which is ortho—
gonal to the orthogonal basis of Hj, hence orthogonal to
Hj; this gives the vector gj"(X).

While the estimators developed in Theorem 2.3.1
estimated 20 and T0k and required only one source active
to converge, the estimator of Theorem 2.4.1 estimates
the probability distribution of the synchronization. So

the estimator of Theorem 2.4.1 suggests a decision rule

for learning the probability law which governs a receiver

35

when the receiver's data generation model selects a differ-
ent source for each observation. Additional complications
would be introduced in defining 2 and Tk' Those compli-
cations would motivate the addition of data buffering to
more completely state the data generation model of such

a problem. It is not clear what the practical value of
such a device would be.

Rewriting the definition of g.

to give it the form
1,n

of a stochastic approximation algorithm begins with

_ _ 1 n l n-l
gi,n ' i,n-l = H kil¢i(xk) ' HIT kil¢i(xk’
n-l
-1 £-__1
_ n ¢i(xn) + (n n-l kil¢i(xk)

_ l _ —
— n[¢i(Xn) gi'n_1].
So the recursive expression

— _ — l: _ _

has the form of a stochastic approximation algorithm.
Chien and Fu [C-2] show that, according to Dvoretsky's

theorem, 3: converges to 91 in mean square and with

,n
probability one. Letting B denote an upper bound on the

variance of ¢i(Xk), Chien and Fu show further that the

mean squared error of Si n decreases at least as i for k
I

36

observations and is less than or equal to B/k. This gives
a bound on the rate of convergence of estimators for the
mixing distribution.

2.5 Synchronization and Learning
Unknown Parameters

 

A processor might not be fortunate enough to know
the probability model for the received patterns (given the
synchronization information). A likely situation would
have the receiver see each source transmitting through a
channel having unknown noise parameters. This section
considers such unknown parameters. The functional forms
of the probability distributions for the patterns, given
the synchronization, are assumed known. The parameters
must be learned.

The model by which data are generated is now estab—
lished. Several sources generate binary data in the
manner described at the beginning of this chapter. The
operation of the sources is exactly as described before.
At some point in time the receiver connects to the channel
for one of the sources and receives from that one source
through that one channel from that time forward. Each
source has its own channel, such as would be the case for
two space satellites whose signal paths experience differ-
ent kinds of distortion, and the channel might contain
noise. Each source-channel combination might contain

unknown parameters which will be denoted by G where 2

2"

is the source index. The notation {02}:=1 denotes the

/— ""
l .. __

 

 

37

set of all unknown parameters. As before, the first bit
of the first observation vector is not necessarily the
first bit of a symbol code, so that the unknown synchroni-
zation instant, Tk’ must be learned; Tk is defined
exactly as in the introductory paragraphs of this
chapter.

The received data will be processed as n binary-

digit patterns denoted by X k = 1,2,.... The functional

kl
form of the parameter conditional probability of the re-

ceived pattern is known, so that P(Xkl2,Tk,{O£},Xk-1) is
the fundamental known quantity from which posterior
distributions will be calculated. Recursive formulas for

the posterior densities follow.

k-l k-l
p(xk|{e£}.x )f({GR}IX )

k-l)

 

f ({e£}|xk) =
p(XkIX

£=1’000’L;k=l'2'ooo

k-l k—l
P(xkl2,{oz},x )P(2I{0£},X )

k-l)

 

P(2I{G£},Xk) =
p(xk|{e£},x

2 = l,...,L; k = 1,2,...

k-l k-l
P(Xkl2,Tk,{92},X )P(Tkl2,{9£},x )

k-l)

 

k
P(T 12,{e },x )
k “ p(xk|£.{e£} .x

2 = l,...,L; k = 1,2,...; Tk = l,...,mz

2.5.1

38

Denominator terms come from integrating the numerator over
{9%}, summing the numerator over 2, and summing the numer-

ator over T respectively, for the three equations of

k’
2.5.1. Lower case f represents probability density
function as Opposed to probability mass function. Channel
noise could make possible a continuum of values for com-
ponents of the patterns. However, if a preprocessor were
to convert the received data into binary valued pattern
features, then conditional probability mass functions for
the observations would be appropriate.

If 20 denotes the index of the source being observed,
TO the true synchronization instant, and Go the value of

the parameters for the channel, then the patterns are

distributed according to the conditional density

P(X) = P(XIGO,2O,TO), k = 1,2,....

In this case the system must learn the parameters of the
channel for the source as well as the source index and
synchronization instant. Spragins' theorem can be applied
again to the posterior densities of 2.5.1, with the criti—
cal matter being the existence of strongly consistent

estimators for O , 2 , and T .
o o 0

Theorem 2.5.1

 

If (a) all observations are taken from the channel

connected to the source with symbol code

39

length m2 , synchronization To' and channel
0
parameters Go so that

P(Xk) = P(XkIGO,2O,TO) for k = 1,2,...

(b) members of the family {p(kac%) ,2,Tk)}£’Tk

are distinct

A

then there exist sequences of estimators Gm, 2m,
and Tm which converge to 00' 20, and To, respec-

tively, with probability one.

Proof: The proof used for Theorem 2.3.1 applies
almost identically here. The empirical distribution
function and the extended Glivenko-Cantelli theorem

. A A A m
establish P(Xklem,2m,Tm)-—+P(XkIOO,2O,TO) w.p.l.
Hypothesis (b) then assures that the limiting
é , 2 and T are unique and equal 0 , 2 and T ,

m o o o

m m

respectively.
QOEOD.

Notice that under this mode of operation, information
is obtained about the parameters of only one channel, the
channel over which the symbols are received. In general
the system provides no improvement over the prior infor-
mation about the parameters of the other channels.

In classical unsupervised learning [8-5] the class

identification is random from observation to observation.

40

The k-th pattern there is represented by (Xk,ak), where
ak indicates the parameters of the class being observed
through the k-th pattern. The set {ck} are usually

assumed to be independent identically distributed random
variables for which a set of prior probabilities, P(ak = Z)
must be defined. Over a long run of observations, all
classes are sampled and the parameters of all classes are
learned. By contrast with classical unsupervised learning,
the problem considered here assumes that the class identifi-
cation is fixed, though unknown.

While the formulation of problems of unsupervised
learning of unknown parameters is reasonably straight-
forward, the computational problems in most instances are
severe. Unless the functional forms are exceptionally
convenient to deal with, one is forced to make discrete
approximations to the continuous parameter values. This
leads to the usual difficulties associated with numerical
techniques, with a very difficult tradeoff between the
small discretizing intervals desired for accuracy and the
cost in memory and computing time that results; questions
of roundoff and truncation error cannot be ignored, of
course. Hellman and Cover [H-4] have worked toward a
theory of the computational overhead involved in pattern

recognition algorithms.

41

2.6 Markovian Symbol Selection

Many interesting symbol generation experiments
are modeled better by assuming Markov dependence
than by assuming stochastic independence among symbols.
If the sources described in the single source case were
modeled by first order Markov chains, then what kind of
model would be apprOpriate for the sequence of patterns?
Some insight can be gained from an example.

Assume that patterns of length n = 3 are taken from

a source whose symbol code length is m2 = 3. Let

= 010, X = 110, and T = 2, as illustrated in

xk-l k k
Figure 2.2. The symbol denoted by 10? is incompletely

observed in Xk and might be either 100 or 101. Since

I
0.1 o l1.10 ! ?
I I I I
l6X-———+L——X+—eﬁ

= 110, T = 2, and m = 3.

k-l k k 2
The binary digits underlined by | lare a single

n = 3, X = 010, X
coded symbol. Digits through the k-th pattern are

known.

Figure 2.2 Pattern Dependency with Markov
Symbol Selection

the symbols are generated under first order Markovian
dependence, knowledge of the previous symbol, 101, allows
one to specify the probability that 10x will turn out

to be 100. The distribution of Xk+1 is conditioned on

42

the source, symbol code length, synchronization instant,

X and X is needed.

k k-l' k-l
Consequently, when the source has a symbol code length of

No information prior to X

3 and first order Markovian dependence between symbols,

the sequence of patterns of length 3, {X is also

k}k=l'
a first order Markov chain whose states correspond to
the eight possible values of a pattern.

In order to generalize to all possible code lengths

and pattern lengths for first order Markov sources, observe

that having T > 1 caused part of X to be an incomplete

k+1

observation of a symbol. So m

k

2 - (Tk+l - 1) d1g1ts of the

incompletely observed symbol are in X The symbol pre-

k.
ceding the incompletely observed symbol has m2 code

digits, which, together with the ma - - 1) digits

(Tk+l

2 - (Tk+l g - Tk + l d1gits which have

already been observed and which have to be considered

makes 2m - l) = 2m

to give the conditional distribution of Xk+l° Since

- T + 1 6 2n - T + 1 < 2n

2 k+l k+1

when Tk+1 > 1. As a result, the patterns observed prior

ml 6 n by assumption, 2m

to X have no influence on the prior conditional

k-l

distribution of X Therefore, if the source generates

k+1'
symbols by using a first order Markov process, and if
the source symbol code length does not exceed the ob-
served pattern length, then the observed process can be

represented by a second order Markov chain. A similar

analysis can show that if the source generates symbols

43

by using an i—th order Markov process, and if the

source code length does not exceed the observation vector
length, then the observed process can be represented by a
Markov process of order less than or equal to i + 1. The
fact that an equivalent first order Markov process can be
defined for any higher order Markov process allows one to
conclude that the theorems developed for sources with
independent symbols also apply to Markov sources of any
order. As is well known, however, the number of states
in the first order equivalent of a higher order chain
increases exponentially with the order of the chain, which
presents an effective restraint on the application of the
technique of first order approximation. The effect of
the large number of states shows up both in storage re-
quirements and in the number of Operations that must be

performed.

2.7 Summary

 

It has been shown that the Bayes posterior distri-
butions of the unknown parameters have a converging be-
havior when dependent random variables are observed
and consequently the Bayes decision process is an effec-
tive procedure for deciding the symbol synchronization.
In the process of proving convergence, a strongly
consistent minimum distance estimator for the unknown
parameters is defined, which suggests other decision

procedures which are not pursued here. One alternative

 

 

 

decis
appro
has a
proce
zatio.
of th
decis
sourc(
ing, 1
then 1

tive s

 

44

decision procedure which is pursued uses a stochastic
approximation technique, whose estimator of the parameter
has a linearly decreasing variance. The Bayes decision
process can be applied simultaneously to the synchroni-
zation problem and to the problem of learning parameters
of the source being observed, and the sequences of
decisions on both problems will converge. Finally, if the
sources use Markov processes to select symbols for encod-
ing, rather than stochastically independent experiments,
then the Bayes decision process still provides an effec-

tive solution to the symbol synchronization problem.

CHAPTER III

ERROR BOUNDS FOR DECISION PROCESSES

This chapter follows several approaches to determine
the rate of convergence and probability of error of a
decision procedure described in Chapter II. First, a sub-
Optimum procedure is defined for which the asymptotic error
probability can be determined. This asymptotic behavior Of
the error of the suboptimum procedure is used to describe
the asymptotic behavior of the minimum error procedure.
The second approach uses an information theoretic measure
to define an upper bound on the error probability of the
Optimum procedure as a function of the number of obser-
vations. Subsequently, measures of the hypothesis con-
ditional probability distributions, namely the Bhattacharyya
coefficient and the Kolmogorov variational distance, give
error bounds that are formally elegant but whose asymptotic
properties are difficult to describe.

All of the error estimating and bounding theorems
presented here require the same basic functions, the hy-
pothesis conditional densities of the patterns, although the

processing operations which are specified by each theorem

45

46

vary widely in their computational requirements. The
underlying question throughout is: Is it worthwhile to
compute this bound on the basis of n observations? Any
worth must be measured against the computational costs,
which for some of the theorems tend to offset the payoff
Obtained from computing the error bounds.
3.1 Majority Decision Functions and Error
Probability for Dependent Random Variables

This section uses a subOptimum procedure called the
majority decision procedure to obtain an upper bound on
the error probability of the optimum decision procedure.
Chu and Chueh [C-lO] invented this approach and studied
its prOperties for the i.i.d. case. Here, the technique
will be extended to dependent random variables. An exact
expression for the suboptimum error probability provides
the upper bound for the Optimum procedure. The asymptotic
behavior of the exact expression indicates the asymptotic
behavior of the Optimum probability of error. Details of
the application to both continuous and discrete first
order dependent random variables are presented.

In an m-class decision problem, let G = {tl,t2,...,tm}
denote the m classes, where class i occurs with proba-

bility pi for i = l,...,m and 2m pi = 1. Let {Xk}

i=1
k = 1,2,..., denote a sequence of vector-valued r.v.'s, all
having the same distribution. When discrete Xk are con-

sidered then Pi(xk) stands for the conditional probability

47

mass function for Xk when class i is active. Similarly,

when continuous X are considered, the corresponding con-

k

ditional probability density is written fi(xk)' First order

dependence (l-dependence) of the X is assumed; i.e.,

k
Pi(xk|xl,...,xk_l) = Pi(xk|xk_l) for i = l,...,m.

3.1.1 Asymptotic Error Probability
for Two Pattern Classes

In the two-class case (m = 2), with probabilities

2 +
p1 and p2, let d(X n l) = D[d1(xl)""'d2n+ﬂx2n+l)] where

dk(Xk) depends only on X so that dk(Xk) = t1 or t2 for

2n+l

k

k = l,2,...,2n+l. Then d(X ) is a majority decision

 

function if it follows the decision of the majority of
dk(xk)' Here, dk(Xk) is a mapping from the domain of
values for the k-th r.v. to the set (tl,t2); there can be
a different mapping for each k. The function D, on the
other hand, maps from the cartesian product 92n+l to O.

The decision regions in the Observation space for Xk and

the conditional probabilities of error are:

Sik = {Xk dk(xk) = ti};
“1k = X2ES P.(Xk) — X2 5 X2 Pi(xklxk_l)Pi(xk_1)
k jk kE jk k-l

for i¢j , i,j = 1,2 , and k = l,2,...,2n+l.

3.1.1.1

48

An exact expression for the probability of error
for the majority decision function will be developed with
the aid of random variables which indicate the decision
dk(xk)' These random variables are purposely redundant
in order to provide a clear expression for the error
probability and to facilitate application of the Central
Limit Theorem. The approach largely follows the lead of
Chu and Chueh with appropriate allowances and new defi-
nitions for handling discrete and l-dependent random vari-

ables.

Decision indicator random variables are defined as

U = 0 and V

k k 1 1f dk(X

k)

3.1.1.2

l
w

l and V

C
ll

0 1f dk(Xk) — 2

In terms of these variables, the majority decision is

2n+1
) = t if Z V 2 n+1

x2n+1 1 k
k=l

d(

2n+1
= t if 2 U

2 2 n+1. 3.1.1.3
k=1

k

The distribution of Uk is related to the conditional

error probability as follows:

ll
(1'

k=1|e

49

and
P(Uk = ole = t1) = X2ES P1(Xk) = 1 — “1k
k 1k
so that
= — — E“ ' _ 1_€ —
P(Uk ale — t1) — alk(l alk) , g — 0.1

where alk is the probability of the decision error that
O = t2 when in fact 0 = t1 based on Xk alone.

By a similar develOpment,

_ _ _ v _ l-v _
P(Vk _\k|9 — t2) — a2k(l a2k) ,v — 0,1.

The exact expression for the error probability for the

majority decision on 2n+1 observations is

_ 2n+1 _ _
Pe(d) — P(d(X ) — t2,O — t1)
2n+1 _ _
+ P(d(X )— t1,O — t2)
2n+1
Pe(d) = plP kil Uk 2 n+1IO = t1

n+1
2 v 2 n+1IO = t2 3.1.1.4

 

50

U* 2n+1
Pe(d) = pl 2 kgz p(uk|e = tl,Uk_l)Po(UlIO = t1)
V* 2n+1
+ p2 2 kgz P(VkIO = t2,vk_l)Po(v1|® = t2).
3.1.1.5

 

where Po(° O) is the prior conditional mass function of

the indicated random variable and 2°*

is the sum over all
sequences for which the sum of the indicated random vari—
able exceeds n. The random variables Uk and Vk are both
functions Of single, l-dependent random variables and so
are, in turn, l-dependent random variables.

Equation 3.1.1.5 can be written somewhat more com-

pactly by defining conditional error probabilities

a1k(Uk-l) = P(Uk = 1'9 = tl'Uk-l)
and

“2k(Vk—1) P(vk = lIO — tZ'Vk-1)°
Now, P(Uk = 0|e = tl,Uk_l) = 1 - alk(Uk_l)
and similarly

P(Vk = 0'9 = tZ'Vk-l) = l ‘ a2k(Vk-1)'

51

If one further defines

all(UO) = PO(U = lIO t1) and

“21(Vo) = PO(V = lIO t2) then

these newly defined quantities can be used in 3.1.1.5

to give
U". 2n+1 U l-Uk
Pe(d) = pl 2 kgl alk (Uk-l) l - alk(Uk-l)
V* 2n+1 Vk 1-Vk
+ p2 Z knl “2k (Vk-l) l ‘ a2k(Vk-l) .

The form Of 3.1.1.4, containing probabilities of
sums of l-dependent random variables, leads one to apply
the Central Limit Theorem. In particular, consider the
following factor from 3.1.1.4:

2n+1

P E U

3 n+1]@ = t1 .
k=1

k
In order for the Central Limit Theorem for 1—dependent
random variables to apply [F-4] the following three

conditions are sufficient

l. E(UkIO = t1) = 5 < m must exist for k = 1,2,...

k

2. 2(lukl3le = t1) < w for k = 1,2,...

52

A = A must exist uniformly

1

IIMZS

3. lim l
n

n+m h k+h

for all k, where

A = 2cov1{Uk'Uk+l} + varl{Uk+l}

k

and where covi(-) and vari(-) denote the covariance and
variance, respectively, of the argument when class i,
i = 1,2, is active. The first condition can be seen to

be satisfied by considering the expectation directly,

5(Ukle = t1) = o-Pwk = ole = t1)

+ 1-P(Uk = 1|e = t1) = elk

for k = 1,2,... .

In fact, from the eXpansion of the first moment,
E{Uk|O = t1}, one can see that all moments and absolute

moments of U about zero are equal to alk’ so that con—

k
dition 2 for the Central Limit Theorem is satisfied by
the random variables {Uk}’

The expression for Ak reduces to terms containing

the various conditional error probabilities already de-

fined. The variance term is

_ 2 _ _ —2
var1{Uk+1} ' E(|Uk+ll '9 ‘ t1) Uk+1

_ .- 2— —
‘ 0"1,k+1 a1,k+l ‘ 0‘1,k+1‘l “1,k+1)°

53
The covariance term is

°°V1{Uk'Uk+1} = E{(Uk ' alk)(Uk+1 ’ °‘1,k+1’IO

= E{UkUk+1IG = t1} ’ O‘1k"‘1,k+1°
Writing the expectation of the product,
3{Ukuk+l|0 = t1} = o-o-P(Uk==o,Uk+l - oltl)
+ o-1eP(Uk==o,uk+l = 1|9 =
+ 1-O-P(Uk==1,uk+l = Oltl)
+ 1°1'P(Uk==l,Uk+l = 1|e — t
= p(uk = 1,0k+l = 1|e = t1)
— P(Uk+l 1|0 =«t1,Uk — 1)
pmk - 1|e = t1)
= 2,11; ° ..
Combining the previous three equations,
Ak = 2°°vl{Uk’Uk+l} + varl{Uk+l}
= 2“ (l) ' “1k ' 20‘11:"‘1,k+1 + O‘1,k+1 ' (1 ‘

1,k+l

= t1}

O‘1,k+1)'

54

If P.(X ) is stationary for i = 1,2, then the a.
1 k 1,k
and d(j) are constant for all k and uniform convergence
i'k n 00
1 . . .

of the sequence {# hil Ak+h}n=lls assured. V1ew1ng the
terms in the sequence as sample averages, one can see that
the uniform convergence allows one to disregard an ini-
tial finite number of Ak and the average of the remain-
ing Ak remains unchanged. So when the XR are stationary,
or under any other circumstance in which the sequence of

sample averages on A converges uniformly, one can apply

k
the Central Limit Theorem for dependent random variables

[F-4] to sums of U to Obtain

k

2n+1 2n+1

P 2 U ) n+l|O = t1] + @[n+l, Z k,(2n+l)A

a
k=1 k k=1 1
where A is the uniform limit described in condition 3

and 0 is defined in 3.1.1.6. A similar result applies to

the sum of V in 3.1.1.4 so that the following theorem

k
has been proved.
Theorem 3.1.1.1
If (a) X1,X2,..., are stationary, 1-dependent random
variables under either hypothesis of a two-
hypothesis decision problem,
(b) Uk and Vk are indicator functions, as defined
in 3.1.1.2, for the decision based on Xk, and
(c) Z¢* represents the sum of the function ¢ over

all sequences of length 2n+1 which give a

sum 2 n+1,

55

then the probability of error for the majority

decision function after 2n+1 observations is

U* 2n+1
pe(d) = p12 kgz P(Ukle = t1.uk_1)PO(U1|e = t1)
V* 2n+1
+ pz 2 n P(VkIO = t2,Vk_l)Po(VlIO = t2)
k=2
0* 2n+1 Uk l-Uk
= pl )3 II (11k "(UK-1) EI- - alk(Uk-1):l
k=1
V* 2n+1 vk l-Vk
+ p2 2 II 02k (Vk’l) E'- - a2k‘ Vk-l’] .
k=1
Further,

 

 

2n+1
Lim P (d) = p 0 n+1, 2 a ,(2n+l)A
n+w e l k=1 1k
2n+1
+ p20 n+1, E a2k,(2n+l)B
k-l
where

co

¢(X.u.02) = f (21102)-;5 exP[-(y-uL/202]dyr 3.1.1.6
X

alk and 02k are defined in 3.1.1.1, A is defined

in condition 3 and B is defined similarly to A,

that is,

56

1 n
B = 1im — 2 B and
n+w n h=l k+h
Bk = 2cov2{Vk,Vk+1} + var2{Vk+1}.

The above theorem holds whether the XR are contin-
uous or discrete random variables. For continuous random
variables, the definitions of Sij'aik' alk(Uk-l) and
02k(vk_1) involve appropriate integrals rather than dis-
crete sums. However, the resulting Uk and Vk’ which are
used in the theorem, are discrete in either case so that
the analysis above holds. Further, the theorem implies
convergence of the majority decision function regardless
of the local decision function used on each individual
observation. If the Xk were independent, then the

expression for A would reduce to A

k k = O‘1,k+1(1 ‘ O‘1,k+1)
and Theorem 3 of Chu and Chueh is thereby Obtained as a
special case.

The asymptotic distribution of the probability of
error provides a method for deciding on the number of
observations required to achieve a particular level of
performance. The Central Limit Theorem showed that
2n+1

2 n+1IO = t1] —E+ ¢[n+l,k:l alk,(2n+l)A

Prob ZUk

3.1.1.7

57

and a similar statement holds for sums of the random vari-
ables Vk' If the goal is to make Pe(d) < 5, then one must

choose n at least large enough to force both terms of the

type in 3.1.1.7 to be less than 5, because Pe(d) is the
average of such terms. Defining
2n+1

a' = Z a
k=1

1k

and

n+1 -a'

/(2n+1)A

 

one can use tabulated values to find 8 such that

00

‘§? g exp(-x/2)dx = g

and in turn solve for n in the definition of 8.

3.1.2 Bayes Majority Decision Functions

Application of the previous theorem can require
substantial computing effort to evaluate the conditional
error probabilities, aik' for each decision region. Re-
stricting the decision functions at each step to a class
which is formally similar to a Bayes decision produces
the Bayes majority decision function, whose precise
definition follows shortly. Under the Bayes majority

decision function, the expressions for the mean and

58

variance of the limiting distribution take on a much
simplified form. This simplified form contains a factor
which is derived from an upper bound on the Kolmogorov
variational distance between the class distributions.

In what follows, one needs to use a particular
prOperty of the probability of error of individual
decisions in a two-hypothesis decision problem. Using
essentially the same notation as above, for a discrete
random variable X let P1(-) and P2(-) denote the proba-

bility mass functions for X given O t and O = t

l 2'

respectively. Also let
_ . = = I
S1 — {x . d(x) t1} and S2 S1 .
Then
on = Z P (x) and a = 2 P (x)
l x582 l 2 xeSl 2

are the conditional probabilities of error for rule d.

The overall probability of error, Pe’ is then

P

e plal + pzaz where pl + p2 = 1.

e l

in which case

This implies that min(o:.l dz) 6 P s max(a a2) with
I I

equality holding only if a1 = a2,
a1 = a2 = Pe' Def1ne

59

‘g E m1n(alra2),
a : max(al’a2) and
n E E - g = Ial - a2]

so that n denotes the length of the interval within which

Pe is bounded.

Theorem 3.1.2.1

 

mud
I
an»
+
an:
*

If ilp1(X) - P2(x)l a 25 then Pe s

Proof: Case 1: pl 2 p2. By steps which are
identical to those in Theorem 2 of Chu and Chueh

[C-10] one gets
1 ‘ Pe 3 91‘8 + Pzaz + p1°‘1 = 91‘3 + Pe'

Substituting a lower bound for Pe on the right

gives

and the Pe term can be isolated on the right by

 

*The quantity §|Pl(x) - P2(x)|, or £|fl(x) - f2(x)|dx

when x is a continuous r.v., is the Kolmogorov variational
distance.

60
Next, add 5 to both sides

1 - p16 + E -‘g a Pe + 3.

But Fe + E a 2 P8 and n = E - g so that

l - p16 + n Z 2 Pe

or

1
Pex< '2-(1 - p15 + U)

(l - g + n).

NIH

and if p1 2 p2 then pl ;,% so P s

Case 2: pl < p2. In this case 1 - Pe 2 p26 + Pe
and an analysis of the type above with pl replaced

by p2 again produces the conclusion of the theorem.

Q.E.D.

Still considering the two-class decision problem, a

Bayes majority decision function is a majority decision

 

function such that for every k = l,2,...,2n+l, the

decision regions, Sik' are defined as

s (xk)} and s = s '

lk = {xk ‘ qlkPlk(xk) > qZkPZk 2k 1k

where qlk’qZk 2 0 and q1k + q2k = l. The qik need not be
the true probability, pi, that O = ti’ nor need they be

the Bayes posterior estimates of pi.

61

The least favorable distribution of O with respect
to P1k(-) and P2k(-) is defined to be that set Of values
for q1k and q2k wh1ch m1n1mizes nk = ak - Ek’ where
ak = max(alk,a2k) and 2k = m1n(alk,a2k) and alk is de-
fined in 3.1.1.1. This distribution is least favorable
in the sense that, since 2k is a lower bound on
Pe[dk(xk)), and since minimizing nk implies maximizing 3k
(because alk and 02k are measures of complementary regions
in the sample space), then minimizing nk maximizes the
lower bound which 2k places on Pe(dk(xk)). However,
minimizing nk does minimize the general upper bound of
the previous theorem. For continuous random variables,
the minimum value of nk is zero, provided that f1(x)if2(x).

In the case of continuous random variables, the

following theorem results:

Theorem 3.1.2.2

2n+1 is a sequence of l-dependent, con-

If (a) x
tinuous random variables;

(b) if d(x) is the Bayes majority decision
function such that for k = l,...,2n+l, q1k
and q2k are the least favorable distribution
of O with respect to flk(xk) and f2k(xk); and

(c) if for every k, fIf1k - f2k| > 26 > 0

then

Lim Pe‘d) s 1im <I>(n+l,(2n+l)e,3(2n+l)e(l-e)) w.p.l

n+°° n+oo

where

62

. Consequently, Pe(d)+0as n+m, w.p.1

.Mo:

6 = % -

Proof: The least favorable distribution for con-
tinuous random variables makes alk = a2k and n = 0,

so by the above theorem aik 6 e < %, i = 1,2. The

d can be replaced by e in Theorem 3.1.1.1 and the

ik

variance terms are bounded above by 35(1-6). That

is,

Ak = 20‘11<("‘1,k+1(1) ‘ O‘1,k+1) + O‘1,k+1‘l ’ O‘1,k+1)
s 26(1-6) + 5(1-5) = 3€(l-€).

Consequently
1 n

Lim — Z A 6 38(1-8).

n+m h=1 k+h

A similar approach holds for Bk and B. Thus,

both normal distribution functions in 3.1.1.6

are bounded above by

n+1 - (2n+l)e
/3(2n+l)e(l-€)

0,1 .

 

¢ n+1,(2n+1)€,(2n+1) - 36(1-8) = 4

 

Since 6 < %, the first argument is positive and

32

increases as n so the theorem follows.

Q.E.D.

63

One encounters difficulty in attempting to apply
the above proof technique to a similar theorem for dis-
crete random variables. In particular, since n¢0 in the
definition of least favorable distribution, it is not
necessarily the case that proper choice of qlk and q2k can
force both a. < A for i = 1,2. As an example, consider

1k 2

the following two discrete distributions on {0,1}.

0.8

Pl(0) 0.9 P2(0)

ll
0
H

ll
0
O
N

P1(1) P2(l)

The full set of possible partitionings of the sample space
into decision regions, and the resulting conditional error

probabilities are:

d1(xk) d2(xk> d3(xk) d4(xk)
sl {0.1} {o} {1} {<1}
52 {4:} {1} {0} {0,1}
a1 0.0 0.1 0.9 1.0
a2 1.0 0.8 0.2 0.0

 

 

 

 

For this example, no decision function exists for which

both 01 and a2 are less than %. Consequently, substitution

64

of an upper bound for aik’ i = 1,2, in 3.1.1.6 is fruit-
less.

However, for sample spaces which can be partitioned
so that ai < %, i = 1,2, the above theorem extends to
discrete l-dependent distributions.

3.1.3 Asymptotic Error Probability for
m Pattern Classes

In the m-hypothesis case, 0 = {tl,...,tm}, the maxi-
mum likelihood-Bayes decision regions for a single Obser-
vation are defined as

T. ={x:p.f.(x) =max p.f.(x)}, i=1,...,m
l 11 j=lpooopm J J

and the pairwise decision regions are

= {X : pifi(x) a p fj(X)} ilj = lice-rm.

S..
13 J

In the expressions above, fi(x) is the conditional proba-
bility density of x given 0 = ti. The error probability

for the Bayes decision procedure is

P (Bayes) = X f p.f.(x)dx + f p.f.(x)dx .
e i<j Ti 3 J Tj l 1

. , .
Since Ti_<:_: Sij and T3. S Sij’ it follows that

Pe(Bayes) S .2. f pjfj(x)dx + f. pifi(x)dx .
1<j S.. S..
13 13

65

Consider the problem of deciding O = t1 vs. 0 # ti
by a Bayes majority decision function when there are m
classes. That is, m such two-class partitionings can be

made. Under the hypothesis 0 = t1' x has the density

fi(x). Under the composite hypothesis 0 # ti’ x has the

density
1 m
f’.‘(X) : _ Z p.f.(X).
1 1 pi j=l 1 j
j#i
Let
I
Sik = {Xk ‘ qikfik(xk) > qikfik(xk)} and Sik = Sik

where qik’qik a O and qik + qik = l. The decision

for xk is: Decide O = ti if xkesik and decide 0 # ti other-
wise. Then the above theorem says that if for all k,

flfik - f?

1k| a 26 > 0 then

Lim Pe(d) s Lim¢(n+1,(2n+1)e,(2n+1)t(1-e))
moo n+oo

where e = % - %. Define the compound Bayes majority

decisionyprocedure, dCB’ by making Bayes majority decisions

 

for the mmdecisions O = ti vs. 0 # ti’ 1 = l,...,m, and
deciding O = tj where j is the value of i for which the
two-way Bayes majority decision was 0 = ti, provided that
exactly one such j exists. Otherwise, no compound decision

is made, but another Observation would be taken.

 

 

66

Theorem 3.1.3.1
If (a) x is a sequence of 1—dependent Observations
in an m-class decision problem, and
(b) if for all i = l,...,m and k = 1,2,...,

flfik - f. 2 25 > o

1kl
then

{II-J “]

lim P (dCB) < (m—1)1im[%(n+l,(2n+1)€p3(2n+l)€(l-€){]2

n+oo n+oo

a.e.

on»

where e = % -

 

Consequently 11m Pe(dCB) = 0 a.e.

n+oo

Pgoof: Of the decisions which dCB can make, m-l
represent errors, and each of those results from
exactly two errors in the two-way Bayes majority
decision procedures. Since the above theorem applies

to the probability of error of the Bayes majority

decision procedures, the result follows.
Corollary: Under the same hypotheses

lim Pe(Bayes) s (m-l)lim[%(n+l,(2n+l)€,3(2n+l)€(l‘€)i]2

11+co n+m
a.e.
Proof: The Bayes decision procedure minimizes the
error probability, so the error of any other decision
procedure, such as dCB' provides an upper bound for

Pe(Bayes).

67

3.2 Bounding the Probability of Error
After a Finite Number of Samples

 

In studying m-class hypothesis testing using de-
pendent random variables, the results presented so far
have shown the existence of a strongly consistent mini—
mum distance estimator for the pattern class, that Bayes
rule calculations of posterior distributions converge
in the sense that eventually the mass of the posterior
distribution lies at the point corresponding to the correct
class index, and that the error probability of a suboptimum
procedure--and consequently Of the Bayes procedure-~vanishes
as the number of Observations grows without bound. Now it
will be shown that an upper bound exists for the probability
of error of the Bayes decision process, and that the upper
bound decreases exponentially as the number of obser-
vations increases. An exact expression for the upper
bound is obtained as a function of the number of patterns
observed. The expression is derived from information
theoretic considerations, and it turns out that the inter-
class Bhattacharyya coefficients are factors of terms in
the bound.

This presentation will use the general notation which
applies to the class of m-hypothesis decision-making
problems for which there is first order dependence be-
tween successive Observations. The symbol synchronization
problem with unknown source code length is a member of

that class. For convenient reference, the Specification

68

of the m-class decision-making problem and the notation
to be used will be briefly reviewed. This is the same
problem and notation which was described in detail
earlier.

Let X1,X2,...,X be a sequence of identically

k""
distributed discrete* random variables having first order
dependence. The common distribution of Xk’ k = 1,2,...,
depends on the pattern class, indexed by O = {t1,t2,...,tm}
and the prior probability for the pattern class j is

given by Prob(0==tj) = p. > 0. The parameter conditional

3
distribution of the Xi will be written

Prob{Xk|O tj} = Pj(Xk), j = l,...,m;

k

1,2,...,

and the first order dependence provides that

k—l)'

Pj(xk|xl,x2,...,xk_l)

= Pj(XkIX

It is assumed that each class has a unique probability
d1str1bution; that 13, Pj(xklxk_1) and Ph(Xk|Xk-l) are not
identical for all values of the arguments when j#h.
3.2.1 An Information Theoretic
Approach

After k Observations, the amount of information

contained in the sequence of random variables

 

*The arguments to be set forth would apply to con-
tinuous random variables as well.

69

Xk = X1,X2,...,Xk about 0 is the mutual information

1k = Ik(xk:0) = H(0) ' E(H(9|Xk))r

where H(O) is Shannon's entropy:

H(O) =
j

utna

9:) 1°9 (31‘)
1 j

and pj = Pr0b(9=tj) as described above. Similarly

 

H(OIXk) =

"HHS

P e=t.|xk log 1 k .
3 1 3 P(O=tj|X-)

Logarithms are to the base 2, and throughout this chapter,
E(-) denotes the expectation with respect to the joint
distribution of Xk.

The develOpment which follows is based on work done
in 1964 by A. Rényi [R-3]. Rényi described the behavior
of the average entropy for independent random variables
and used his results to show the almost sure convergence
of a decision procedure which was similar to the maximum
likelihood decision procedure. Hellman, Raviv and others

have called the expectation of the entrOpy E(H(lek)),

the equivocation. Hellman and Raviv [H-S] showed that

 

for the Bayes decision procedure, the probability of
error, PB(e), is bounded above by one—half the equi-

vocation, i.e.,

70

PB(e) \< %E(H(lek)).

They showed also that in the i.i.d. case,
3(H(e|xk)) s K(p**)k

where p** is defined as follows.

pgj = inf 2 Pi(x)aP.(x)l—a and
Osasl x 3
p** = max pt.
i¢j 13
When a = %, the argument of the infimum is called the inter-

class Bhattacharyya coefficient, so that Hellman and Raviv's
result for i.i.d. random variables defines an upper bound
on the error probability of the Bayes decision procedure
which is an exponentially decreasing function of the maximum
interclass Bhattacharyya coefficient. In order to Obtain
related results for dependent random variables, two lemmas

must be established.

Lemma 3.2.1.1
(Rényi) A universal constant C > 0 exists such
that for any set p1,...,pm of positive numbers

forming a probability distribution

71

The logarithm is to the base 2.

Proof: This is Rényi's proof, although he used

 

 

=1
a 2.
x loglﬁl (l-x)1og[i%;]
Both a and a are continuous
x x
in [0,1].
Define
xloglil (l-x)log[I%§]
C1 = max a and C2 = max a
Osxsl x Osxsl x

for 0 s a 6 1

Then by breaking the entropy expression into two

parts, one gets bounds on each part of

 

l
m m p. log[-—]
1 .
.2 Pj 109(57] = .2 3 p3 p? s c ? p9
j=2 3 3:2 a j 1 .=2 3
Pj 3

and

72

 

 

 

 

1 m 1
pl log —— = 1 - X p. log
P -= J m
l 3 2 1 - 2 p.
-_ J
3—2 ,
’ 1
m m
l - : pj log 1 - .: pJ m a
= J l. 3 2 2 p
m a ._2 j
2 p. 3-
j=2 3

 

M

0

n:
T7
":38
N

”'0
L14

Q

I

O

N
LJ
"has
N)

”O

U Q

The lemma follows with C = C

Q.E.D.

Lemma 3.2.1.2

 

If Y1,Y2,...,Y is a sequence of random vari-

kl...
ables having the Markov prOperty, and if E(Yk|Yk_l)

exists and is bounded for all k, say

E(Yk|Y s K , k = 2,3,... w.p.1

k-l)

then

k k-l
E 2 Y. s E(Y1)K where K = E(Y2|Yl).
i=1 1

Proof: Expanding the expression for the expectation
of the product of k random variables having the

Markov prOperty gives

73

k
E .E Yi = 2 ...2 Yl-YZ...YkP(YkIYk_1)...P(y2|Yl)P(Yl|Y0)
1—1 Yl Yk

2 '.'§ Y1...Yk_1§ YkP(YkIYk_1)P(Yk_1IYk_2)...
1 k-l k

P(YZIY1)P(Y1IY0)

Y1...Yk_lE(Ylek_l)P(Yk_llYk_2)...

2
Yk-1

P(YZIY1)P(Y1|YO)

/A
M
O
O

yl. . 'Yk-1KP(Yk-1|Yk-2) . . .P(Y2|Y1)P(Y1|YO)

Oz
Yk-l

This process repeats k-2 more times to yield the
lemma.
Q.E.D.

These lemmas will be used in proving the following

theorem.

Theorem 3.2.1.1
If (a) O is a discrete random variable taking on m

different values tl,t2,...,tm with positive

prior probabilities pj Prob(O = tj),
j = 1,2,...,m;
(b) the discrete random variables X1,X2,...,Xk,...

have, for each j, identical conditional, given

74

O = tj, distributions with the Markov
property; and

(c) the conditional joint distributions of
Xk_1,Xk g1ven O = tj versus 0 = th are
different for each j#h, i.e., there is no
value Of Xk_1 such that Ph(Xk|Xk_l)

then there exist positive constants A and q < 1

such that
O s E(H(O|Xk)) s Aqkm1 for k = 1,2,...
h k - *

Proof: Letting 9h denote the subset of the full

probability space, 9, on which 0 = t (h = l,...,m),

h
the equivocation is expanded in terms of the parameter
conditional equivocation as follows-**

3(H(e|xk)) = phE(H(OIXk)IQh) 3.2.1.1

"baa

h 1

The Bayes posterior distribution of 0 given Xk, which
is needed to evaluate the entropy factors in 3.2.1.1

is:

 

*Korsh [K-6] proved a similar theorem using a
different proof.

 

 

**The notations E(- 0h) and E(° O = th) mean the same
thing. They are alternative notations for the same con-
ditional expectation.

P(e = thXk) = i

. k P. (X. |x. )
< —l H i 1 3.2.2.2

ph i=-l Ph(xi [xi-1)
where the strict inequality holds for all finite
k and h = 1,2,...,m. Here, Pj(X1IXO) stands for

the prior conditional probability of X the first

ll
pattern.
Now the entropy expression can be expanded,

Lemma 3.2.1.1 applied to the expansion, and then the

bound in 3.2.1.2 applied to the result:

 

 

 

m
H(O|Xk) = 2 P(O = t.IXk)1og 1 k
j=l 3 p(0 = tle )
m
s c 2 [P(O — t Ix )]
i=1
j#h
m p ak Pji(X|Xi_l)0L
< C 2 H Ph (x. X. )
j=l ph i=1 1l i- -1
j#h
for 0 s a s 1. 3.2.1.3

The function Pj(Xi|Xi_l) depends on the random vari-

able Xi_1, so one can take the expectation of the

76

bound in 3.2.1.3 given 9h, recalling that Qh specifies

the distribution of Xk. The expectation wanted is

m [oiJOE E [Pj(X1|Xi_l)]d 9h .

 

k
EH(GIX)|Q]<C 2
1 h j=l ph i=1 Ph1inXi-l1

 

3.2.1.4

The rest of the proof proceeds using a = % for the
sake of a helpful cancellation which develops. Lemma
3.2.1.2 can be applied to the E{...} on the right of

3.2.1.4 as soon as the K bound is demonstrated. In

particular, it will be shown that

P.(X2|Xl) s
E 3 9h < l to give a bound on the

Ph(x21xl)

 

equivocation which decreases monotonically with k.

 

 

9h

E Pi(X2|Xl) 3
Ph1lexl)

P.(X IX ) k
2 1~ 2 1 P (x ,x )
xl.x2[Ph(X2[x1)] h 2 1

= 2 Pj(x2'xl)|Pj(Xl) %P (x .x )
Xl'xz Ph(x2,x17]Ph(xl) h 2 1

 

P-j—Thml) 15 ( ) ( ) 1‘5
= Z P. X [X P X Ix
Xl'xz Pj Xl [ j 2 l h 2 1 J

Ph(xl) 1: .ﬁ
=§ 3.7377 3; [Pj (XZ'X11Ph(x2'Xl))
1 2

77

2 [Ph(X1) * 3
< Z P.(X ,X )2 P (X ,X )
X1 szxl) X2 3 2 1 X211 2 1

3.2.1.5
The inequality in 3.2.1.5 is an application of the
Schwarz inequality. Strict inequality is a result
of the assumption of unique parameter conditional
distributions, i.e., the condition for which equality
would hold in the Schwarz inequality is equivalent to
Pj(X2|Xl) E Ph(X2IXl) for some j and h, j#h. The
right side of 3.2.1.5 reduces to 1, so there exists

a qjh’ 0 s qjh < 1, such that

E [pj(x2|xl)]a

< 1 3.2.1.6
Ph(X2|X1)

“h = qjh

 

Letting a = % in 3.2.1.4 and applying Lemma 2 with

K = q results in 3.2.1.4 becoming

 

 

jh
m p- % P-(X ) h -
E(H(lek)|0h) < c 2 —l s 517217 a q1h1 .
j=1 ph h 1 3
Also, 3 h 3.2.1.7
P.(X)15 P-(X)1§ 1
l _ l _ 1
E [FﬁTXIT] 9h - i [PﬁTXIT] Ph(Xl) - i [Pj(x1)Ph(Xl)]
1 1
= pjh < 1 3.2.1.8

This expectation turns out to be precisely the inter-
class Bhattacharyya coefficient [H-S, K-2, L-3] for

the prior distributions of the first Observation, X1.

78

As in 3.2.1.5, the strict inequality in 3.2.1.8 is
from the Schwarz inequality and the unique parameter
conditional distributions. Now defining

q A max (q.h) < l and using this q with 3.2.1.7
1sj<h

and 3.2.1.8 in 3.2.1.1 gives

m m p. 8 _
E(H(OIXk)) < 2 ph c 2 —1 p.hqk 1 3.2.1.9
h=1 j=1 ph 3
jﬁh

This proves the theorem with

A = c 2 2 (p.ph) p.h < C(m-l) 3.2.1.10
h=1 j=1 3 3
j#h
Q.E.D.

In Rényi's derivation for the case of independent
random variables q turns out to be the maximum interclass

Bhattacharyya coefficient.

Corollary: Under the hypotheses of the theorem, the
probability of error for the Bayes decision pro-

cedure is bounded by
1
P(e) S E Aq , k = 1,2,... 0 3.2.1011

where A is defined in 3.2.1.10 and q is the maximum

of the qjh defined in 3.2.1.6.

79

Proof: Hellman and Raviv [H-S] showed that the
error probability for the Bayes decision procedure
is bounded by % the equivocation which this theorem

shows is bounded in turn by Aqk-l.

Q.E.D.

In the argument between 3.2.1.4 and 3.2.1.5,
a value of a = % was chosen in order to argue that qjh < l
in 3.2.1.6. With that accomplished, one can establish a
tighter bound on the probability of error by defining, as

an alternative to qjh in 3.2.1.6,

 

Q

Pj(X2IX1) a
h

Ph(szX1’

 

q.‘ = inf E
0<a$1

NIH

Since the definition of qjh exhibits a value of a, a =

for which the expectation is less than 1, then

I
qjh s qjh < l

Letting 3 denote the value of a for which the infimum
is obtained, and following the subsequent steps of the

theorem gives an alternative to 3.2.1.9.

2(H(0|xk)) <

"P43

p
1 h j=1

J#h

c 2 31-5 (‘)( ')n‘1 3 2 1 13
h ph pjh a q . . .

where

80
q1 = max(q.ﬁ) and
Ji‘h 1

— _ a 1-5
1

This approach gives the result of the theorem with q'

defined differently from q and

>

ll

('3
H043
m.u015

While 3.2.1.13 gives the strongest version of the theorem
obtainable with this approach, there is no straightforward
general algorithm for computing 3, so that 3.2.1.9 is
probably easier to use except when the distributions have
a convenient form. Also, pjh(3) is not necessarily

algebraically less than pjh(%) so 3.2.1.9 might

= thr
give smaller values than 3.2.1.13 in some cases.
3.2.2 Bounds Based on the Distance
Between Distributions

Several investigators [K-l, K-2, L-2] have used the
Bhattacharyya coefficient to bound the probability of
error of a maximum likelihood rule. Kadota and Shepp [K-l]
and other statistical literature call this quantity the
Hellinger integral. This coefficient is an inner product
of two hypothesis conditional probability densities. Using

0 to denote the Bhattacharyya coefficient for two—class

81

hypothesis testing, the definition of p for discrete

distributions is

_ ‘1
p — (Pl(x)P2(x))

Z
X
where Pi(X), i = 1,2, is the mass function under hypothesis
Hi. From Schwarz's inequality one can see that 0 < p 6 1

since

2P(X)P(X)1"1s 2P(X)P(X)1”’=1

and the arguments are non—negative over their domains.
Several functions of p can be used to describe a "distance"
between the density functions, with - log 0 being a
favorite since it is non-negative and - log 0 = 0 when
P1(X) E P2(X). Since the Bhattacharyya coefficient has
played such a significant role in the recent literature,
several results will be presented in order to show its
role in providing error bounds for m—class maximum likeli-
hood hypothesis testing using dependent random variables.

In m-class hypothesis testing with m > 2, the inter-

class Bhattacharyya coefficient, pij’ is given by

= g ..=
p.. (Pi(X)Pj(X)) for 1,3 l,...,m.

Z
13 X

82

The derivations of error bounds related to the
Bhattacharyya coefficient are designed by treating all of
the Observed patterns as a single pattern, rather than
defining a bound that is a function of the number of
patterns. This is in distinct contrast to the kinds of
bounds derived by using either majority decision functions
or equivocation. Further, it changes the mathematical
treatment of the dependence to a matter Of the computation
of 0, making that computation more complex as the number
of observations increases.

In order to use the Bhattacharyya coefficient
approach to bound the error probability after the k-th
pattern in a sequence one must let the argument of Pi(')
be Xk, a sequence of real variables whose possible values
are the set of possible sequences of the first k patterns.
When the patterns have first order dependence, the proba-
bility pi(xk) is given by

k

Pi(Xk) = 1

Pi(xylxy-1)

where Pi(Xl|X0) is the prior probability of X1 under
hypothesis Hi’ and Xk = X1,X2,...,Xk. Then the interclass
Bhattacharyya coefficient between class i and class j
after k Observations is given by
k

= 2 H P.(X IX _
Xk y=1 1 Y Y 1

(k)

k
on P. X 3.2.2.]-
913 ) 3(

ley_1)

 

83

When each pattern, Xy' has n binary digits then X
k

Y
has 2n possible values, and the sum in 3.2.2.1 has 2n

(k)
ij
summing m(m-l)-2nk terms. While this technique does not

terms for each 0 . Computing all pig) for i < j requires
provide a closed form solution for the error bound as a
function of the number Of observations, it does suggest
computerized experiments to determine a sequence of error
bounds, subject to whatever limits exist on computing

resources .

(k)

The computational technique for evaluating pij can

 

take a recursive form evidenced by expanding 3.2.2.1 as

(k) _ 8
pij — 2 2 (Pi(xklxk_l)Pj(xk|xk_l)) .
x x
k k-l
%
2 (Pi(Xk_1IXk_2)Pj(Xk_lIXk_2)) ...
x
k-2
%
2 (Pi(X3|X2)Pj(X3IX2)) .
x
2
t
X (Pi(X2IXl)Pj(XZIXl)Pi(XlIX0)Pj(XlIX0))
1

3.2.2.2
Defining the factors in the equation above

_ %
rij(X2) — é (Pi(x2|xl)Pj(lexl)Pi(x1|xo)Pj(xllx0))

1

84

and
r..(X ) = 2 (P.(x |x )P (x |x ))5r (x )
1] k X 1 k ,k-l j k k-l ij k-l
k-l
for k > 2
gives
(k) _
pij — Z r1j(Xk) for k 2 2
X
k
. . . (k) -
W1th th1s techn1que, one can compute the pij sequent1ally
by saving the 2n values of rij(Xk) at each stage for use
at the next stage.
The use of pig) in computing error bounds will now

1]
be considered. The quantity pij for a single observation

appeared in Theorem 3.2.1.1 and it was pointed out that

for the i.i.d. case the theorem became

E[H(OIXk)) s Ap*k

where p*Emax pij and the superscript represents exponen-
i753
tiation. If one starts with the approach of the previous

(k)

theorem and attempts to include pij in the expression for

the error bound, then one obtains the following theorem.

Theorem 3.2.2.1

 

Under the hypotheses of Theorem 3.2.1.1,

m

1: (k)
X . .
j=1(pjph) oh]
j#h

2(H(e|xk)) < c

IIMB

h 1

 

‘ .AX.‘

I~.-.
1

85

Proof: The proof commences in the same manner as

before through 3.2.1.4. Following 3.2.1.4,

 

 

 

 

 

 

h ph

E g :3;:1|:1' ; 8 ab = 2 E :3::1::1’1: LSPh(xk)
i=1 h 13 i-l Xk i=1 h 1 1-1
1‘ I»;
= 1k i:l(Pj(XiIXi-11Ph(XiIXi-l)) = phj(k)
x
so that
m p. %
E{H(O|Xk)|0h} < C 2 —l] ph.(k)°
j=1 ph 3
j#h
Therefore
k m m 31 3 (k)
E(H(O]X )) < E phC Z phj

1 j=1
j#h

which is the result stated in the theorem.

Q.E.D.

While this does show the role of the interclass
Bhattacharyya coefficient for k Observations, it is not

as tight a bound as Lainiotis [L-3] has achieved working

from Chu and Chueh's premise that

P < 2 P (iqj) .
e i<j e

 

86

Lainiotis' upper bound on the error probability of the maxi-

mum likelihood decision is given in the following theorem.

Theorem 3.2.2.2

In the m-class hypothesis testing problem with prior
probability pi of hypothesis Hi and conditional proba-
bility Pi(X) under hypothesis Hi' the probability of

error of the maximum likelihood decision is bounded

 

above by
aij -aij dij l-aij
P s 2 p. p. XP.(X) P.(X) 3.2.2.3
e i<j 1 j X 1 j
for 0 S 0.. s l and i,j = l,...,m.

1]

Proof: The proof is given by Lainoitis [L-3] and

will not be repeated here.

For maximum likelihood decisions using the Bayes
posterior distribution after k observations on

dependent random variables, the argument X in

3.2.2.3 becomes xk, and the Pi(Xk) are the Bayes

posterior distributions.
While there is no general way to choose the

aij in 3.2.2.3 to minimize the bound, one is free

to choose any convenient value in [0,1] for aij'

When the hypotheses have equal prior probability,

choosing aij = % for all i and j, and applying

Lainiotis' theorem gives

87

'U
IA
5““

E p..
i<j 13

for a single observation or

P s 1 2 p..(k) 3.2.2.4
m <

for the decision after k observations, which compares

with Pe < C(m-l)pij(k) in Theorem 3.2.2.1. Even using ;

3.2.2.4, there is the possibility that the bound 2

 

is greater than 1. There are m(m-l)/2 terms in the

sum in 3.2.2.4, and if all Of the pij or pij(k)

very close to 1, then the bound can be close to

ul‘O‘ ( ['1

are

(m-l)/2. However, in the case described, all of

the distributions are nearly identical, and one

would anticipate difficulty in making decisions.
Lainiotis' theorem agrees with results pub—

lished by Kailath [K-2] and Kadota and Shepp [K-l].

Kailath's upper and lower bounds for the two class,

equal prior probability, maximum likelihood decision

are

ubl l-'
'0
IA
'U
/A

NI H
'0

while Kadota and Shepp retain the prior probabili-

ties, p1 and p2, for the bounds

1 . 2
-2-m1n(P1,p2)o \< Pe s “P p o-

88

G. T. Toussaint, in a recently published paper
[T-3],derived an upper bound on the interclass
Bhattacharyya coefficient by using the Kolmogorov
variational distance, which might be easier to
calculate than is the Bhattacharyya coefficient.

The Kolmogorov variational distance between two

hypothesis conditional distributions is defined as

 

Vij = §|P1(X) - Pj(X)

The theorem is as follows:

Theorem 3.2.2.3

 

(Toussaint) In m-class hypothesis testing using
maximum likelihood decisions with equal prior

probabilities, the probability of error is bounded

by

"U
/A
5|“

2 (l - V../2)
i<j 13

Proof: Using Chu and Chueh's upper bound

P s X P (ilj)

e i<j e
with
Pe(1,j) = §m1n{piPi(X);ijj(X)} 3.2.2.5

and

 

:l-I’. . .. .-

89

l . .
pi = pj = a for 1,] = l,...,m
gives
P < l 2 2m1n{P (X)-P (X)} 3 2 2 6
e ‘ m . . i ' j ° ' ' ’
1<j X

‘_'1
5

Kailath [K-2] shows that

. . l
Pe(l,j) = pi - 5 leiPi(X) - p.Pj(X)| 3.2.2.7

X J

 

One can use 3.2.2.5 on the left of 3.2.2.7, so that

for pi = p3. = %, the prior probabilities factor and

cancel leaving 3.2.2.7 reduced to

. _ _ 1 _
m1n{Pi(X),Pj(X)} — 1 2 iIPi(X) Pj(X)I
= 1 - % v.. 3.2.2.8
1]

Using 3.2.2.8 in 3.2.2.6 completes the proof.

Q.E.D.

The functional form of the Kolmogorov variational
distance prohibits the type of factoring used in
3.2.2.2 to compute pij(k)' So when the argument of
the hypothesis conditional probability is Xk, the

sequence of k observations, the storage requirements

for Pi(Xk) grow exponentially with k.

90

3.3 Summary

The majority decision procedure is designed to allow
one to apply the Central Limit Theorem to describe the
error probability as the number of observations grows with-
out bound. The behavior of the resulting normal distri-
bution function is well known, and tabulated values are
readily available if one wants to obtain an estimate of
the error probability in a specific case. There is no way
to compute a number of Observations (finite) which is
sufficient to guarantee that such an estimate is an upper
bound on the error probability. Computationally, this
procedure has the advantage that, once the single pattern
variational distance is computed, arguments for the limit-
ing distribution can be easily hand calculated.

Consideration of the equivocation of a sequence of
Observations provides the valuable insight that the expec-
tation of the amount of additional information that can
be Obtained by taking additional samples goes to zero as
the number of samples increases. The functional form of
the equivocation, coupled with the probability of error
expression for the Optimum processor described in Chapter
II, defines a computable upper bound for the error proba-
bility--which was missing in the first approach using the
Central Limit Theorem. Once the parameters of the bound
have been computed--generally a job which requires the

aid of a digital computer-—one can use a slide rule to

 

91

determine the number of Observations required in order for
the error bound to assure a given error level.

The error bounds obtained from the Bhattacharyya
coefficient and the variational distance appear to be
easier to calculate than the equivocation bound, but have
the drawback that the bound might exceed 1 regardless of

(k)

the number of observations. Also the computation of pij

and Vij' when the distance is between Pi(Xk) and Pj(Xk),
can be as burdensome as computing the quantities needed
for the equivocation bound. The recursive method
described for the Bhattacharyya coefficient produces an
algorithm for which the computation increases linearily
with the number of Observations, and the storage require-
ments are fixed. The variational distance bound, on the
other hand, forces exponentially growing amounts of compu-

tation and storage as one tries to compute the bound for

successively larger numbers of observations.

 

”a“ ." .

CHAPTER IV
EXAMPLES OF DECISION PROCESSES

AND ERROR BOUNDS

The results presented in the previous two chapters

suggest a wide variety of test cases to illustrate the

 

theory involved. Several such tests have been simulated E
on the CDC 6500 computer, and the results are summarized

in this chapter. The first example shows how the processor
learns the symbol code length of the source when the
sources have different code lengths and there is no

channel noise. Next, a noisy channel--the binary sym-
metric channe1--is used, and the sources all have the same
code length. In this latter case, only the symbol distri-
butions distinguish the sources, and the channel error

rate is unknown. Decisions are compared with decisions
made using a minimum distance estimator. Computational
overhead for computing Bayes posterior conditional distri-
butions from the l-dependent Observations is examined, and
a processing technique is proposed which reduces the com-
putational overhead to the amount one would require if

the Observations were independent. The proposed processing

92

93

technique is used on the binary symmetric channel with
favorable experimental results. The convergence and
error probability theorems of Chapters II and III do not
apply to this subOptimum technique.

Finally, the various error bounds presented in
Chapter III are compared for a Specific example of distri-
butions for three sources. The upper bounds in particular
indicate a requirement for a large number Of observations
to obtain a desired error rate, although the limiting
distribution derived from the majority decision function
suggests that low error probability could be obtained from
relatively fewer observations.

4.1 The Bayes Decision Process Applied
to the Symbol Synchronization Problem

Experiments programmed for the CDC 6500 computer
illustrate the behavior of the posterior distributions
used in the optimum decision procedure described in
Chapter II. Figure 4.1 shows the results of one set
of experiments. In these experiments each of three sources
used a different symbol code length--one, two and three
binary digits, respectively. The probability law govern-
ing the i.i.d. symbol selection process for the respective
sources is shown in Table 4.1. The receiver was simu-
lated by making observations on patterns containing 3
binary digits, the smallest pattern length that could be

used and still satisfy the requirement that the length

 

t1-I'.- ,~

94

     
    
 

 

 

 

 

     

 

 

 

1.0 ' r“—‘\"_-_-—
0-9 b // Source 1 active (2 - l)
008 ’— /’
P(2 - zlxk) ,
0.7 I
0.6
0.5
0.4
0.3
0.2 b- \\ I ’
\vl .
\Pu - 1|x") \.
‘ , I
3.: 1‘~"1 1 1 1 1 1.«~._ 1 l 1 1 1‘~
2 4 6 6 10 12 l4 16 18 20 22 24 26 28 30'
Number of Observations
Figure 4.1 (a) Posterior Distribution of the Source Index Parameter
1.0 F ——r
0.9 —
P(P. - 2|xk)
0.8
Source 2 active (2 - 2)
0.7
0.6
0.5
0.4 —
K

\

0.3 -\
\.
0.2 — ‘\P(2 - 3|x")
1\\
0.1 \\
k \~ 1~
‘P(2 - l X) ‘1 "s ’ ‘~ /‘~~1 --
\ 1' L 1 ”’1 a 1 1" 1 1‘-”1 ‘12.: 1 1 _1
2 4 6 8 10 12 l4 16 18 20 22 24 26 28 30

Number of Observations

Figure 4.1 (b) Posterior Distribution of the Source Index Parameter

 

95

on

  

.
.
.1
.n I .3 2’30. n ooh—son \\
\
‘\ I.
s
s s
fin a in ~// 1 1
\.n, x I t
I .51.s I
\ II x c 1
\\I¢\Il\ (\I(\\
I'""‘\""\ L

umuoEdumm xmvcH mousom on» no :oﬁusnwuumwo uowumumom A0. H.v ouamwm

unowum>umuno mo nonasz

mm mm VN NN 0N ma 9.” «A NH OH w w v N
. q 4 q d 4 4 \J

3 4
.xx: u in .

 

    

   
   

Asia u in

 

 

 

 

v.0

m.o

v.0

5.0

0.0

«.0

o.H

96

Table 4.1 Symbol Generating Probabilities.

_g —

 

 

 

 

Source Source Code Source Probability
...... ..
2
l l 0 0.75
l 0.25
00 0.127
2 2 01 0.375
10 0.375
11 0.125
000 0.09375
001 0.125
010 0.15625
3 3 011 0.125
100 0.125
101 0.09375
110 0.09375
111 0.1875

 

Symbol selections by the sources mentioned in the
example of Figure 4.1 are governed by the respective
distribution shown above.

97

of Xk be at least the longest symbol code length. Figure
4.1 shows the posterior conditional probability that

each source is active as a function of the number of
observations. The ordinate intersections represent the
prior probabilities. After any particular Observation,
the decision procedure decides that the source with the
greatest posterior probability is active.

Figure 4.1 illustrates that the initial bias, im-
posed by the prior distributions, can be overcome by this
decision process. The discussion in Chapter III suggests
that the distance between the probability distributions
of the Observations, which is influenced by the source
distributions, should affect the number of Observations
required to overcome this initial bias. Posterior distri-
butions for the synchronization instant P(Tkl2,Xk) were
computed as well. In the first example with m2 = l, Tk
is l for all k, so that once the source is correctly
decided the synchronization instants are obvious. In the
second example with 2 = 2 and m = 2, the true value of

2

Tk alternates between 1 and 2. Since the prior proba-

bility for the correct value of T was 0.714, the posterior

l

probabilities P(Tkl2 = 2,Xk) produced correct decisions

for all k. In the third example with 2 = 3 and m = 3,

2
the prior probability for the correct value of T1 was

0.222, vs. values of 0.333 and 0.445 for the other possi-

bilities for T1' This, combined with the probability

'..a .N_

 

I 1' 1. '5.‘
4

98

distribution for source 3, led to needing over 20 pattern
Observations before correct synchronization decisions were
made, even though the source decision was correct after

every observation.

4.2 The Binary Symmetric Channel

The binary symmetric channel illustrates the notion l
of an unknown channel parameter. As the name implies, the
noise in the channel has the net effect of changing a code

digit one to a digit zero with probability p and of chang— 1

 

ing a code digit zero to a one with probability p, as Em
diagrammed in Figure 4.2. Thus p is the probability
that a binary digit is complemented as it passes through

the channel, and will be called the complementation rate.

 

 

 

Figure 4.2

The Binary Symmetric Channel

A digit passes through the channel unchanged with proba-
bility l-p. As in the data generation model previously
considered, the channel is connected to a source which

generates a binary digit code of length m and transmits

21'

the codes in a continuous stream. The receiver processes

99

the output as n binary digit patterns, Xk, k = 1,2,...,
n 2 m2 for all 2. Yk will denote the input which pro-
duced the output Xk. Both Xk and Yk can be any one of the
2n vectors having n binary elements, but Xk is not neces-
sarily equal to Yk because of the noise properties of the
channel. The quantity p is an unknown parameter.

The full set of assumptions for this special case
follows.

1. Patterns of n binary elements, Xk’ are received

through the binary symmetric channel of Figure

4.2.
2. The value of the parameter p is unknown.

3. Changes of distinct binary digits in a pattern

are mutually independent.

4. Given the index of the active source, 2, and

synchronization instant, T the distribution

kl
of Yk at the 1nput, denoted by Pin(Yk|2,Tk), 18

known.

Define C(kak) as the number of bit positions in which Yk

and Xk differ; C(Yk,Xk) is the Hamming distance between Y

k

and Xk'

Let Zj, je{l,2,...,2h}, denote the base 10 values
corresponding to the strings of base 2 digits which can
be assumed by Yk and Xk' The probability of receiving

Xk = Zj when Yk = 21 was transm1tted 1s

‘Wwo- .

 

100

C(Z ,Z.) n-C(Z ,Z )
P(Xk = ZjIY = Z-) = P l J (l-p) l J

i,je{l,2,...,2n}

The conditional probability of receiving a particular

vector, Xk = Zj, can be represented by

P(xk zjlp.2,Tk)

 

n
2
= 1:1P1n(Yk = Zilp,2,Tk)P(Xk = szYk = Zi,p,2,Tk) ,
2n
= i:lPinwk = Zi|2,Tk)P(Xk = zjlyk = zi)
n
2 C(Zi,Z.) n-C(Zi,Zj)
= 1:1P1“(Yk = zi|2.Tk)p (l-p)

||l>

i,je{l,2,...,2n}, 0 s p s 1, 2 = l,...,L,

and Tk = l,...,2.

This equation defines the quantity Qj(p,2,Tk) =
P(Xk = Zjlp,2,Tk) as a polynomial in p of degree n. Given
2 and Tk’ there are 2n of these polynomials, one for each

possible value of Xk'

Theorem 2.5.1 says that if members of the family

{P(X Ip,2,T )} are distinct, then strongly consistent

estimators exist for p, 2, and Tk' Saying that members

101

of the family are distinct means that for any (a,b)?(c,d)
there exists at least one value Zj such that

P(Xk = Zjlp,2 = a,Tk = b)7€P(Xk = Zjlp,2 = c,T = d) when

k
those quantities are defined. As a result, the hypotheses
of Theorem 2.5.1 require that no two synchronizations

have exactly the same set of Zn polynomials Qj(p,2,Tk)

for all 2n values of j. Equivalently, for (a,b)#(c,d)
Qj(p,2 = a,Tk = b) must not have the same set of coef-

k = d), j€{l,2,...,2n}, when é
(a,b) and (c,d) are such that Qj(:) is defined. In turn,

ficients as Qj(p,2 = c,T

 

the coefficients of Qj(p,2,Tk) are defined above to

be linear combinations of Pin(YkI2,Tk), so it is reason—
able to look for a condition on the family {Pin1Ykll’Tk)2tTk
which guarantees that linear combinations of members of

that family are unique.

Identifiability is such a sufficient condition.
Consequently, one can proceed with the Bayes decision
process to simultaneously decide the source, synchroni—
zation, and channel complementation rate and assume that
the sequences of decisions will converge provided that
{Pin(Yk|£’Tk)}2,Tk is known to be an identifiable family
[T-l, T-2, Y-3] or provided that the members can be shown
to be linearly independent.

The learning capability of the processor was demon-

strated for a binary symmetric channel with unknown comple—

mentation rate. The complementation rate was the parameter

102

to be learned. A Fortran language program, BSC MC--for
Binary Symmetric Channel, Monte Carlo--provides appropri-
ate data generation, channel simulation, and Bayes rule
processing. For coding convenience, BSC MC processes
sources whose code length is 3 binary digits and uses
observations of 3 binary digits. BSC MC allows up to 10
discrete values for the channel complementation rate; two
values, 0.05 vs. 0.1, were used in the examples which are
reported here. The program is set to cut off after 500
observations. Then the prior distributions can be re-
initialized for as many Monte Carlo iterations as one
desires.

As a check on the data generated, BSC MC maintains
counts of the number of times each symbol is generated,
the number of bits changed by the channel and the number
of times each observation vector value is Observed. This
last is used to compute the empirical mass function
P(Xk) which can be used in an alternative minimum dis-

tance estimate in which

distance = min max I P(Xk) - P(X) '-

2,Tk X

The decision procedure decides that the values of the
minimization arguments for which the minimum is achieved
give the source and synchronization. In the examples run,

the optimum decision and the minimum distance decision

 

‘11:: sum _.’. '.-

103

agreed with respect to the source and synchronization, but
not always with respect to the channel parameter (channel
complementation rate).

A second program, called BMCIND (for Binary symmetric
channel, Monte garlo, INerendent) is identiCal to BSC MC

except for a subroutine named POST which computes the

- ‘1 (maxim

posterior distributions. For BMCIND, POST computes the
posterior distributions as if the Observations were inde- I
pendent. After the k-th observation, the version of

POST used with BMCIND computes

1r—

 

P(xklp.2,Tk)P(p|2.Tk.xk_1)

= :P(xklp.2.Tk)P(pl2.Tk.xk_1)

 

P(pl2.Tk.xk)

where the denominator is defined to be P(Xkl2,Tk),

= P(Xkl2,Tk)P(TkI2,Xk_l)
g P(XkT2,Tk)P(Tk|2,Xk_l)
k

P(Tkl2,Xk)

where the denominator is defined to be P(Xkl2),

= P(xk|2)P(2|xk_l)
P(xklz)P(2[xk_17‘°

 

P(2IXk)

By contrast, the version of POST used in connection with

BSCMC computes

. k. k :P(Xk[Pp2,Tk.Xk_lIP(pr2,Tk,xk_l)

 

104

where the denominator is defined to be P(Xkl2,Tk,Xk_l),

= P(Xkl2,Tk,Xk_1)P(TkI2,Xk_l)
é P(xkl2,Tk,xk_l)P(Tk12,xk_l)
k

 

P(Tkl2,Xk)

where the denominator is P(Xkl2,X and We

k-l) '
P(xklz,xk_l)P(2|xk_l)

— iP(Xk[2,Xk_1)P(2IX

 

P(2le)

k-l)

 

My: .(.-"1 . u .
1'

The "posterior distributions,’ computed in connection with
BMCIND, when used with a maximum likelihood decision pro-
cedure, learned the unknowns but appeared to converge less
rapidly than did the Bayes processor. The quantities com-
puted by the version of POST used with BMCIND coincided
with the Bayes posterior conditional distributions only if
the observations were independent. The motivation for using
these quantities when the observations are dependent stems
from a consideration of storage requirements and compu-
tation volume, which is discussed next.

In considering the storage requirements for computing
the Bayes posterior conditional distributions for the pro—
cessor described in Chapter II, one must first realize
that there are 2n values of the conditional probability

k-l

P(Xkl2,Tk,X ) for each set of values of the conditional

k l

arguments. The dependence on X - is a result of the

possibility that Tk#l, i.e., a symbol code begins in Xk-l

105

and ends in Xk as described in Section 2.1. Since suc-
cessive symbol codes are independent, only the cases in

which symbol codes overlap observations Xk-l and Xk will

tend to affect the storage of the posterior conditional

distributions. If Tk = 1, then the conditional probability

is independent of X and F‘

k-l'

k-l
P(Xkl2,Tk,X ) = P(Xkl2,Tk).

.-DL.) - m»: 11' '

' ‘lxll

If Tk = 2, then the conditional probability is dependent

on the last m2 — 1 digits of Xk-l‘ In general, 1_
k...

 

P(Xkl2,Tk,X 1) is dependent on the last

d = [m — T

g k + 1]

mod mﬁ

digits of X While d is in fact a function of m2 and

k-l'
Tk' writing d rather than d(m£,Tk) is convenient because d

will be used as an exponent in subsequent expressions.
Taking into account the above description of how

the value of T describes the dependence on xk-l’ one can

k
k-l

see that there are 2n+d values of P(Xkl2,Tk,X ) for each

value of 2 and Tk' Given 2 = A, there are mA values of

T so for each source, A, there are

kI

n+0

Z 2 = 2 + 2n+mA—1 n+1

+ ... + 2

106

k.1) to compute and store, which

values of P(Xkl2,Tk,X
could tax the capacity of a very large computer. For
example, suppose one of the sources which might be ob-
served uses a symbol code length, including parity digits,
of 9 binary digits. Suppose further that this is the

longest possible source code length, so n is chosen to be

9. Then

- 1) = 2 - 2 > 2 .

This shows that just storing the conditional distribution
for a single source could use up all of the immediately
accessible, individually addressable core storage of the
latest model computers. The usual storage requirement for
a Markov chain having 29 states is 218 values, but the
analysis based on the value of Tk was aimed at identifying
redundant values that need not be stored.

The technique used by subroutine POST in the BMCIND
program would reduce the storage requirements to 29 in
the case cited above, or a reduction factor of over 2(n—1).

Table 4.2 summarizes the results of one computer
experiment with the binary symmetric channel. In this
instance, each of three sources used three binary digits
in its symbol codes. The channel error rate was to be

either 0.05 or 0.10. Ten runs were made, and after the

500-th observation a decision was made about the channel

 

107

error rate, about which source was active, and about the
synchronization instant. Results of the optimum processor
(BSC MC), one subOptimum processor (BMCIND), and the mini-
mum distance decision described earlier in this section

are compared.

Table 4.2 Comparison of Three Decision Processes.

 

Number of Correct Decisions in
10 Monte Carlo Runs

 

 

[ﬁver Au,-
1

 

Channel Bit Active Synchronization
Complementation Rate Source Instant
BSC MC 4 10 10
BMCIND 2 10 10
Minimum
Distance 6 10 10

 

The Bayes decision process, a Bayes-like process,

and a minimum distance decision process were used to decide
the bit complementation rate, source, and synchronization.

In these experiments, the prior distributions were
randomly generated before the first run, and the same
prior distributions were re-established for each of the
following 9 runs.

It is heartening to note that BSC MC, which uses the
Bayes posterior distribution, performs somewhat better
than BMCIND, which uses the Bayes formula with the
marginals of the conditional distributions. But the

superior performance of the minimum distance estimator

was unexpected.

108

Examination of the history of the posterior distri—
butions computed by both BSC.MC and BMCIND shows that the
decisions made after the earlier observations vary, but
after about 25 observations, all subsequent decisions are
identical, whether they are correct or not. This suggests
that a type of bias develOps that is unlikely to be over-
ruled if the probability distributions are very close.

The posterior distributions of quantities that are quite
distinctive,such as the source and synchronization in-
stant in this case, tend to be capable of producing correct
decisions even if the decisions regarding other quantities

are incorrect.

4.3 Error Estimates and Bounds

This section presents numerical results obtained by
applying the various error bounds of Chapter III to a
specific example. Details of the example were chosen for
computational expediency rather than to represent a par-
ticular application. Specifically, the example is not
typical of the models one would expect in the symbol
synchronization problem.

In the example there are three pattern classes, for
which a pattern is one binary digit. Each class has a
stationary first order dependent distribution. This is
intended to mean that for fixed 2 and m--2,me{0,l}--and

any class i

 

IA- _ Dv". ‘

Pi(xk = I”XX-1

109

= m) is constant for all k =

2,3,....

The values used for the class conditional joint proba-

bility distributions are tabulated below.

Pi(Xk = 0,xk_l = 0)
Pi(xk = 0,xk_1 = l)
P.(xk 1,xk_1 = 0)
Pi(Xk = 1,xk_l = l)

The values of

i=1

0.40

 

i=2 i=3
0.30 0.10 §_1
5
1
s
0.05 0.20 *
0.45 0.30
0.20 0.40

the joint distributions lead to the sample

conditional distributions below, which are used to obtain

the upper bound based on the equivocation measure.

P.(xk = Ole-l = 0)
P.(Xk = 0|xk__l = 1)
P.(Xk — 1|xk_1 = 0)
Pi(Xk = l|xk_1 = 1

0.375

0.333

0.625

i=2 i=3

0.400 0.250
0.200 0.333
0.600 0.750
0.800 0.667

An important part of the equivocation bound argu—

ment,

3.2.1.6 of Chapter III, states that

110

 

Q < l for j#h

[Pj(xklxk_l)]t
E h

PhIXkTXk-l)

 

A matrix of the values Obtained for the above expectations

Obtained for this example is given below.

h=1 h=2 h=3 {7
j=1 1.000 .968 .963 i
j=2 .971 1.000 .988
j=3 .945 .987 1.000

 

The diagonal terms should obviously be 1 and serve as a
check on the computation.

The classes were assumed to be equally probable.
The prior class conditional distribution of the first
observation and the corresponding interclass Bhattacharyya
coefficients are tabulated below. The values of Pi(xl = -)

were chosen to make Pi(Xk) stationary with respect to k.

1=l i=2 1=3
P1(X1 = 0) 0.529 0.25 0.308
P.(X1 = 1) 0.471 0.75 0.692

Prior Class Conditional Distributions
of the First Observation

111

013 = 031 = 0.974
023 = 032 = 0.997

Interclass Bhattacharyya Coefficients for the
Distribution of the First Observation

Assuming that C of Lemma 1, Chapter III, is not
greater than 2, the value of A in 3.2.1.10, is 3.907, and

the value of q in 3.2.1.9 is 0.988. So the error bound

is

Pe < 1.953 x 0.988k'1.
Under this approach, one is very much at the mercy of the
value of q, which in this case is very close to the maxi-

mum possible value, one. Figure 4.3 shows that for this

example the equivocation bound gives the largest values of

all the techniques which were compared. Approximately

367 observations would be required in order for this error

bound to assure an error rate of less than 0.05.
Comparing the above analysis based on the equi-

vocation bound with the asymptotic distribution from the

majority decision function approach reveals sizable differ-

ences in the error estimates. Using the form for the m

pattern class case, one calculates

um .1 2'." 3n

n- 1) ._.__.r -

 

 
   

 

112

 
    

Equivocation
Bound

 
 
 
 
 
 
  
 
  

Lainiotis' Bound
(Average Interclass
Bhattacharyya Coefficients)

Toussaint's
Variational
Distance Bound

Asymptotic Distribution
from Majority Decision Function

1 1, .1 1 l __J
5 10 15 20 25 30

Number of Observations

Figure 4.3 Comparison of Error Bounds

 

113

1. m
P? X = —_ X .P. x
1( k) 1 7 Pi j=1 p] 3( k)
j#i

and Obtains the values tabulated below.

i=1 i=2 i=3
Pi~(xk = 0) 0.279 0.419 0.390 f.
1
A = i1
Pi(Xk 1) 0.721 0.581 0.610 S

The next task is to find 6 such that

 

ﬁiPi(xk) - P§(Xk)l ; 26
where the Kolmogorov variational distances on the left
turn out to be 0.501, 0.337, and 0.164, respectively, for
i = 1,2,3. Taking one-half the minimum distance for 6
gives 6 = 0.082, and e = — - — = 0.479. The values of
the asymptotic distribution of the error probability as
a function of the number of Observations is given in the
lowest curve of Figure 4.3 labeled "Majority Decision
Function Asymptotic Distribution."

The bound suggested by Lainiotis in 3.2.2.4 is
plotted as the middle curve in Figure 4.3. The values
are all smaller than the values obtained from the
equivocation method, plotted in the top curve, for the
number of observations investigated. Whether this is

true for large numbers of Observations is not clear, since

114

the form of the Lainiotis bound does not reveal its
asymptotic behavior. A consequent advantage of the
equivocation bound is that, once the coefficient and base
are calculated, one can estimate values for large numbers
of samples by using a slide rule. Conversely, the
Lainiotis bound requires a rather tedious recalculation,
best done on a digital computer, for each successive
number of observations.

As one comes to the error bound proposed by
Toussaint, the storage requirements for the probability
distributions threaten to be very costly. However, the
bound was computed on the basis of three observations and
came out to be 0.707, the smallest value provided by any

of the analyses. The value is spotted on Figure 4.3.

 

i ...5 ‘1!” . - .

CHAPTER V
CONCLUSIONS AND RECOMMENDATIONS

5.1 Summary of the Thesis r7

 

Pattern recognition techniques offer a choice be-
tween unsupervised learning, which requires prior knowl—

edge of the probability distributions governing the events

 

producing the patterns, and supervised learning, which re-
quires a set of training data for computing parameters of
the decision function. This thesis has ignored supervised
learning techniques and the problems that arise in deter-
mining that the training data are sufficiently representa-
tive of the total population to assure a low probability
of error. Instead, this work has demonstrated that cer—
tain statements about probability of error for m-class

(m > 2) unsupervised pattern recognition can be extended,
in an apprOpriately modified form, to problems involving
statistical dependence.

The motivating problem for this research, symbol
synchronization for an unknown source code length, has been
shown to be a problem in unsupervised learning with sta—
tistically dependent data. Several solutions to this

problem have been presented, including a Bayes decision

115

116

process and a stochastic approximation technique, both of
which were shown to produce a sequence of decisions which
converge to the correct decision. A subOptimum technique,
formally like the Bayes process for independent random
variables,was suggested, and empirical results were en-
couraging.
New results include applying the interclass

Bhattacharyya coefficient to a sequence of observations

having first order dependence in order to bound the proba—

 

bility of error of the Bayes decision process. Other error

11"».
Y

bounds are presented, one based on the eXpectation of the
additional information in successive samples and one based
on the Kolmogorov variational distance. The asymptotic
distribution of the error probability for a subOptimum
process provides an in-the-limit statement about the

error rate of the Bayes decision procedure. It is inter-
esting to note that the Bhattacharyya coefficient appears
in the information theoretic bound and the Kolmogorov
variational distance plays a role in the asymptotic distri—
bution of the error probability. The probability of error
statements which are applied to decision problems using
dependent random variables have turned out to use these
two measures on the distributions, the Central Limit
Theorem and expectations of products of a very limited
class of dependent random variables. In all of this, the
computational expediency of the resulting algorithms has

been a foremost consideration.

 

 

'II'I'II l1 JII 4}]. .l.ll I 'I'I.

117

5.2 Recommendations for Continued Research

Unsupervised learning methodology would benefit from
an influx of new decision processes which could be used
as alternatives to the Bayes decision process. The
majority decision techniques may be an important step in

this direction. Chu and Chueh recommended that the error

lg“-
estimates derived from the majority decision process be ”

considered as upper bounds on the error of the Bayes pro- ,
cess, but that decisions should be made by the Bayes pro-

cess in order to achieve the minimum decision error

 

probability. An alternative approach would be to use :—
easily computed local decisions as each pattern is ob-
served and refine the estimate of the error probability
based on properties of those local decision functions.
This is essentially the philOSOphy behind the Bayes
majority decision function, and others might be feasible.
The objective, of course, is to reduce the computational
burden imposed by the Bayes decision process but with a
technique whose convergence and error properties could be
well defined. The "posterior distributions" described in
Chapter IV in connection with the program called BMCIND
provided a computational advantage to the Bayes posterior
distributions, but the convergence properties of the
BMCIND process need to be determined. The BMCIND process
was a semi-Bayes process which would be the Bayes decision
process if the observations were independent. As work
moves toward decision processes which are more and more

unlike the Bayes decision process,it will most likely be

118

conducted by considering i.i.d. random variables at first,
with the hOpe of generalizing to dependent random variables
at a later time.

Since the Bhattacharyya coefficient has proved to
have an important relationship to the error probability
for the Bayes decision process, additional studies of its r“
properties might prove useful. In particular one might be .
able to describe the class of distributions for which

pij(k) defined by 3.2.2.1 is a monotonically decreasing

 

function of k. One could also attempt to determine

I..."M"D’.§‘;S -.‘ 0.- - -'.

relationships between the Bhattacharyya coefficient
and error rates for non-Bayes decision procedures in
either a supervised or an unsupervised mode.

Supervised learning techniques include the use of
a linear combination of functions called potential
functions [A-Z, B-l, P-4] to approximate unknown proba-
bility distributions. Unsupervised learning has tra-
ditionally proceeded by assuming that, if one did not
have classified training data, then one would assume that
the probability distributions for the pattern classes are
known to provide a starting point around which a decision
process can be built. Knowing the probability distributions
has been a demanding assumption for unsupervised learning.
It is tempting to try to relax that assumption, and per-
haps that could be done by using potential functions.

The first task would be to determine whether there are

119

any conditions for the traditional unsupervised learning
problem (in which all pattern classes are represented
randomly in the data) under which combinations of potential
functions would provide useful estimates of unknown pattern
class distributions.

In general the properties of supervised learning

I .1!
8

techniques using dependent observations are not described

in existing literature. The work of C. K. Chow is an

'I “(Jr 9 mum's?) ' '1

exception, of course. Aside from the mathematical com-

plexities introduced by considering dependent random vari-

 

ables, there are problems in that a large number of
dependency models are candidates for consideration. Chow
has had some success in applying information theoretic
methods to this type of problem. Perhaps other methods

could be fruitful as well.

BIBLIOGRAPHY

 

[A-l]

[A-Zl

[A-3]

[A-4]

[A-S]

[A-6]

[A-7]

[B-l]

BIBLIOGRAPHY

Abramson, N., and D. Braverman, "Learning to Recog-
nize Patterns in a Random Environment," IRE Trans.,
Vol. IT-8, pp. 558—563, 1962.

 

Aizerman, M. A., E. M. Brauerman, and L. I.
Rozonoer, "Theoretical Foundations of the Potential
Function Method in Pattern Recognition Learning,"
Automatica i Telemekhanika, Vol. 25, June 1964,
trans. Jan. 1965, pp. 821-837.

 

Albert, A. "A Mathematical Theory of Pattern
Recognition," Ann. Math. Stat., Vol. 34, pp. 284—299,
March, 1963.

 

Albert, Arthur E., and Leland A. Gardner, Jr.
Stochastic Approximation and Nonlinear Regression,
MIT Press, Cambridge, 1967.

 

Amari, Shunichi, "A Theory of Adaptive Pattern
Classifiers," IEEE Trans., Vol. EC-l6, No. 3,
pp. 299-307, June 1967.

Anderson, T. W., and R. R. Bahadur, "Classification
Into Two Multivariate Normal Distributions with
Different Covariance Matrices," Ann. Math. Stat.,
Vol. 33, pp. 420-431, June 1962.

 

Aoki, Masanao, thimization of Stochastic Systems,
Academic Press, New York, 1967.

Bashkirov, O. A., E. M. Braverman, and I. B.
Muchnik, "Potential Function Algorithms for Pattern
Recognition Learning Machines," Automatic Machines
and Remote Control, Vol. 25, No. 5, pp. 629-631,
March, 1964, trans. from Aut. i Telemekh., Vol. 25,
No. 5, pp. 692-695, May 1964.

 

 

120

 

.
l

[B-Zl

[B-3]

[C-l]

[C-Z]

[c—3]

[C-4]

[C-5]

[C-6]

[C-7]

[C-8]

[C-9]

[C-lO]

121

Blaydon, Colin C.,"Recursive Algorithms for Pattern

Classification," Thesis, Harvard University, Div.
of Engineering and Applied Physics, March 1967.

Bledsoe, W. W., "Some Results on Multicategory
Pattern Recognition," J. ACM, Vol. 13, No. 2,
pp. 304-316, April 1966.

Chadwick, Henry D., and Ludwik Kurz, "Two Sequential
Nonparametric Detection Procedures," Info. & Con—
trol, Vol. 13, No. 5, pp. 403—428, November 1968.

 

Chien, Y. T., and K. S. Fu, "On Bayesian Learning
and Stochastic Approximation," IEEE Trans., Vol.
SSC-3, No. 1, pp. 28-38, June 1967.

 

Chien, Y. T., and K. S. Fu, "Selection and Order-
ing of Feature Observation in a Pattern-Recognition
System," Info. & Control, Vol. 12, Nos. 5/6, pp.
394-414, May-June 1968.

 

Choi, Keewhan, "Estimators for the Parameters of a
Finite Mixture of Distributions," Annals of the
Inst. of Stat. Math., Tokyo, Vol. 21, No. 1, pp.
107-116, 1969.

 

Choi, Keewhan, "Empirical Bayes Procedure for
(Pattern) Classification with Stochastic Learning,"
Ann. of the Inst. of Stat. Math., Tokyo, Vol. 21,
No. 1, pp. 117-125, 1969.

 

Chow, C. K., "A Recognition Method Using Neighbor
Dependence," IRE Trans., Vol. EC-ll, pp. 683-690,
Oct. 1962.

 

Chow, C. K., "Statistical Independence and
Threshold Functions," IEEE Trans., Vol. EC-14,
No. 1, pp. 66-68, Feb. 1965.

 

Chow, C. K., and C. N. Kiu, Approximating Discrete
Probability Distributions with Dependence Tree,
IBM Research Report RC-1816, May 4, 1967.

Chu, J. T., "Optimal Decision Functions for Com—
puter Character Recognition," J. of ACM, Vol. 12,
No. 2, pp. 213-226, 1965.

 

Chu, J. T., and J. C. Chueh, "Error Probability
in Decision Functions for Character Recognition,"
J.gf ACM, Vol. 14, No. 2, pp. 273-280, April
1967.

 

 

[C-11]

[C-12]

[C-13]

[C-14]

[D-ll

[D-2]

[D-3]

[D-4]

[D-5]

[F-l]

[F-2]

122

Cooper, P. W., "The Hyperplane in Pattern Recog—
nition," Cybernetica, Vol. 4, pp. 215-218, 1962.

Cooper, D. B., and P. W. Cooper, "Nonsupervised
Adaptive Signal Detection and Pattern Recognition,"
Info. & Control, Vol. 7, pp. 416—444, 1964.

COOper, P. W., "Hyperplanes, Hyperspheres, and
Hyperquadrics as Decision Boundaries," Computer and
Information Sciences, Tou and Wilcox, Eds.,
Spartan, Washington, D. C., 1964. 9»

 

Cover, Thomas M., "Geometrical and Statistical
PrOperties of Systems of Linear Inequalities with
Applications in Pattern Recognition," IEEE Trans.,
Vol. EC-14, No. 3, pp. 326-334, June 1965.

 

Deely, J. J., and R. L. Kruse, "Construction of
Sequences Estimating the Mixing Distribution," 1
Ann. Math. Statist., Vol. 39, No. 1, pp. 286-288,
1968.

 

I .' .'.

 

Dubes, R. C., The Theory of Applied Probability,
Prentice-Hall, Englewood Cliffs, N. J., 1968.

 

Dubes, R. C., and P. J. Donoghue, "Bayesian Learn-
ing in Markov Chains with Observable States,"
Interim Report No. 5, Contract No. AFOSR-1023-67B,
Division of Engineering Research, Michigan State
University.

Dubes, R. C., and E. Panayirci, "Pattern Recognition
with Continuous Parameter, Observable Markov Chains,"
Interim Report No. 10, Contract No. AFOSR-1023-67B,
Division of Engineering Research, Michigan State
University.

Duda, R. 0., and H. Fossum, "Pattern Classifi-
cation by Iteratively Determined Linear and
Piecewise Linear Discriminant Functions," IEEE
Trans., Vol. EC-lS, No. 2, pp. 220-232.

Farrell, J. L., and J. C. Murtha, "Statistical
Band Synchronization in Digital Communication,"
Proceedings of the Univ. of Missouri, Rolla,

Mervin J. Kellngommunications Conference, Oct.

‘-

1970, pp. 3—5-1 through 3-5-6.

 

Feller, W., An Introduction to Probability Theory
and Its Appligations, John Wiley & Sons, Inc.,
New York, 1957.

[F-3]

[F-4]

[F-5]

[G-l]

[H-l]

[H-2]

[H-3]

[H-4]

[H-5]

[H-6]

[H-7]

[H-8]

[I-l]

123

Friedman, H. D., "On the Expected Error in the
Probability of Misclassification," Proc. IEEE,
Vol. 53, pp. 658-659, June 1965.

 

Fraser, D. A. S., Non-Parametric Methods in
Statistics, Wiley, New York, 1957.

 

Fu, K. S., Sequential Methods in Pattern Recog-
nition and Machine Learning, Academic Press,
New York, 1968.

Good, I. J., The Estimation of Probabilities,
Research Monograph No. 30, MIT Press, Cambridge,
1965.

.1):

Hancock, J. C., and T. L. Stewart, "Parameter
Estimation with Unknown Symbol Synchronization,"
Thesis, Purdue University, January 1967.

“TI-- 5:?) " ..

 

Hancock, J. C., and P. A. Wintz, Signal Detection
Theory, McGraw-Hill, New York, 1966.

 

Harnett, W. E., "Generalization of Tests for
Certain PrOperties of Variable-Length Codes,"
Info. & Control, Vol. 13, No. 1, pp. 20-24,
July 1968.

Hellman, M. E., and T. M. Cover, "Learning with
Finite Memory, Ann. Math. Stat., Vol. 41, No. 3,
pp. 765-782, June 1970.

 

Hellman, Martin E., and Josef Raviv, "Proba-
bility of Error, Equivocation, and the Chernoff

 

Bound," IEEE Trans., Vol. IT-16, No. 4, pp.
368-372, July 1970.
Hilborn, C. G., and D. G. Lainiotis, "Optimum

Unsupervised Learning Multicategory Dependent
Hypothesis Pattern Recognition," IEEE Trans., Vol.
IT-14, No. 3, pp. 468-470, May 1968.

Ho, and Agrawala, "Summary of the State of the Art
in Pattern Recognition," Proc. IEEE., Vol. 56, No.
12, pp. 2101-2114, Dec. 1968.

Hoeffding, W., and H. Robbins, "The Central Limit
Theorem for Dependent Random Variables, Duke Math
10, V01. 15' pp. 773-780, 1948.

Irani, K. B., ”A Finite-Memory Adaptive Pattern
Recognizer," IEEE Trans., Vol. SSC-4, No. 1, pp.
2-11, March 1968.

 

 

 

[K-l]

[K-2]

[K-3]

[K-4]

[K-Sl

[K-6]

[K-7]

[L-l]

[L-Zl

[L-3]

[L-4]

[M-l]

[M-2]

124

Kadota, T. T., and L. A. Shepp, "On the Best

Set of Linear Observables for Discriminating Two
Gaussian Signals," IEEE Trans., Vol. IT-13, No. 3,
pp. 278-284, April 1967.

Kailath, Thomas, "The Divergence and Bhattacharyya
Distance Measures in Signal Selection," IEEE Trans.,
Vol. COM-15, No. 1, pp. 52-60, February 1967.

 

Kazmierczak, H., and K. Steinbuch, "Adaptive
Systems in Pattern Recognition," IEEE Trans., Vol. {9
EC-12, No. 6, pp. 822-835, Dec. 1963. i

Keehn, Daniel G., "A Note on Learning for Gaussian
Properties," IEEE Trans., Vol. IT-ll, No. 1, pp.
126-132, Jan. 1965.

 

Kobayashi, H., and J. B. Thomas, "Distance
Measures and Related Criteria," Proceedings of
the Fifth Allerton Conference on Cirguit and 1_
System Theory, pp. 491-500, Oct. 1967.

 

 

 

Korsh, James F. "On Decisions and Information
Concerning an Unknown Parameter," Info. & Control,
Vol. 16, pp. 123-127, 1970.

 

Ku, Harry S., and Solomon Kullback, "Approxi-
mating Discrete Probability Distributions," IEEE
Trans., Vol. IT-15, No. 4, pp. 444-447, July 1969.

Lainiotis, D. G., "Optimal Feature Extraction in
Pattern Recognition," IEEE Intern. Info. Theory
Symp. Abstracts, San Remo, Italy, Sept. 1967.

 

 

Lainiotis, Demetrios G., "On a General Relationship
Between Estimation, Detection, and the
Bhattacharyya Coefficient," IEEE Trans., Vol. IT-lS,
No. 4, pp. 504-505, July 1969.

 

Lainiotis, D. G. "A Class of Upper Bounds on
Probability of Error for Multihypothesis Pattern
Recognition," IEEE Trans., Vol. IT-lS, No. 6, pp.
730-731, Nov. 1969.

 

Lainiotis, D. G., and S. K. Park, "Probability of
Error Bounds," IEEE Trans., Vol. SMC-l, No. 2, pp.
175-178, April 19 1.

Martin, J. J., Bayesian Decision Problems and
Markov Chains, Wiley, New York, 1967.

 

Martin, James. Telecommunications and the Com-
puter, Prentice-Hall, Englewood Cliffs, N.J.,
1969.

[M-3]

[M-4]

[M-5]

[M-6]

[M-7]

[M-8]

[N-l]

[N-Zl

[P-l]

[P-2]

[P-3]

[P-4]

[R-l]

[R-Zl

125

McBride, Alan L., and Andrew P. Sage, "Optimum
Estimation of Bit Synchronization," IEEE Trans.,
Vol. ABS-5, No. 3, pp. 525-536, May 1969.

McBride, Alan L., and Andrew P. Sage, "On Dis-
crete Sequential Estimation of Bit Synchronization,"
IEEE Com., Vol. 18, No. 1, Feb. 1970.

 

 

McLendon, J. R. "A Pseudo Bayes Approach to
Digital Detection and Likelihood Ratio Computation,"
Thesis, Southern Methodist University, Dec. 1969.

Meisel, William S. "Potential Functions in Mathe-
matical Pattern Recognition," IEEE Trans., Vol.
C-18, No. 10, pp. 911-918, Oct. 1969.

Middleton, D., "On the Detection of Random Signals
in Additive Normal Noise--Part 1," IREE Trans.,
Vol. IT-3, No. 2, pp. 86-122, June 1957.

 

m‘ _ . J... I

Middleton, D., An Introduction to Statistical Com-
munication Theory, McGraw-Hill, New York, 1960.

 

 

"State of the Art in Pattern Recog-
IEEE., Vol. 56, pp. 836-862, 1968.

Nagy, G.,
nition," Proc.

 

Nilsson, N. J., Learning Machines, McGraw—Hill,
New York, 1965.

 

Panayirci, E., "Bayesian Decision Making and Learn-
ing for Continuous-Time Markov Systems," Ph.D.
dissertation, Michigan State University, 1970.

Papoulis, A., Probability, Random Variables, and
Stochastic Processes, McGraw-Hill, New York, 1965.

 

Patrick, E. A., and J. C. Hancock, "Nonsupervised
Sequential Classification and Recognition of
Patterns," IEEE Trans., Vol. IT-12, No. 3, pp.
362-372, July 1966.

Pitt, J. M., and B. F. Womack, "Additional Features
of an Adaptive, Multicategory Pattern Classifi-
cation System," IEEE Trans., Vol. SSC-S, No. 3,

pp. 183-191, July 1969.

Raviv, J., "Decision Making in Incompletely Known
Stochastic Systems," Int. J. Eng. Sci., Vol. 3,
pp. 119-140, 1965.

 

Raviv, J. "Decision Making in Markov Chains
Applied to the Problem of Pattern Recognition,"
IEEE Trag§., Vol. IT-3, No. 4, pp. 536-551,
Oct. 1967.

 

[R-3]

[R-4]

[R-5]

[R-6]

[R-7]

[S-l]

[5-2]

[5’3]

[S-4]

[5‘5]

[S-6]

[T-l]

126

Rényi, Alfréd, "On the Amount of Information Con-
cerning an Unknown Parameter in a Sequence of
Observations," Ma ar TudgAkad. Mat. Kutaté Int.
K621., Vol. 9, pp. 617-625, 1964.

 

Rényi, A., "On the Amount of Missing Information
and the Neyman-Pearson Lemma," Research Papers in
Statistics Festschrift for J. Neyman, Wiley, New
York, 1966.

 

 

Rényi, A. "On Some Basic Problems of Statistics {“
from the Point of View of Information Theory,"

Proc. 5th Berkeley Symp. on Math. Stat. and Prop.,
Vol. 1, Berkeley, Calif., U. of Cal. Press, 1967.

 

Robbins, Herbert, "The Empirical Bayes Approach to ,
Statistical Decision Problems," Ann. Math. Statist.,
Vol. 35, pp. 1-20, 1964.

 

Rolph, John E., "Bayesian Estimation of Mixing
Distribution," Columbia Univ.

Sebestyen, George S., Decision-Making Processes in
Pattern Recognition, Macmillan, New York, 1962.

 

 

Sebestyen, G., and J. Edie, "An Algorithm for
Non-Parametric Pattern Recognition," IEEE Trans.,
Vol. EC-lS, No. 6, pp. 908-915, Dec. 1966.

 

Signori, D. T., Jr., "Estimation and Adaptive
Decision Making for Partially Observable Markov
Systems," Ph.D. dissertation, Michigan State Uni-
versity, 1968.

Specht, Donald F., "Generation of Polynomial Dis—
criminant Functions for Pattern Recognition,"
IEEE Trans., Vol. EC-l6, No. 3, June 1967.

 

"Learning Without a Teacher,"
IT-IZ’ NO. 2, pp. 273-230,

Spragins, J. D.
IEEE Trans., Vol.
April 1966.

 

Spragins, J. D. Rpproducing Distributions for
Machine Learning, Stanford Electronic Lab, Tech.
Report No. 6103-7, Nov. 1963.

 

Tallis, G. M., "The Identifiability of Mixtures
of Distributions," J. Appl. Prob., Vol. 6, pp.
398, 1969.

389-

 

[T-2]

[T-3]

[T-4]

[W-l]

[Y-l]

[Y-2]

[Y-3]

[Y-4]

127

Teicher, H., "Identifiability of Mixtures," Ann.
Math. Stat., Vol. 32, pp. 244-248, 1961.

 

 

Toussaint, G. T., "Some Upper Bounds on Error
Probability for Multiclass Feature Selection,"
IEEE Trans., Vol. C-20, No. 8, pp. 943-944,
Aug. 1971.

Tucker, Howard G., A Graduate Course in Probability,
Academic Press, New York, 1967.

 

Wainstein, L. A., and V. D. Zubakov, Extraction of
Signals from Noise, Prentice-Hall, Englewood
Cliffs, N.J., 1962.

 

Yakowitz, Sidney J., "A Consistent Estimator for
the Identification of Finite Mixtures," Ann. Math.
Stat., Vol. 40, No. 5, pp. 1728-1735, 1969.

‘-.'.'\‘_"‘) H '1 )Li-"fn‘ :- f - J‘

 

 

Yakowitz, Sidney J., "Unsupervised Learning and
the Identification of Finite Mixtures," IEEE Trans.,
Vol. IT-16, No. 3, pp. 330-338, May 1970.

 

Yakowitz, Sidney J., and John D. Spragins, "On the
Identifiability of Finite Mixtures," Ann. Math.
Stat., Vol. 39, No. 1, pp. 209-214, 1968.

 

Yau, S. 8., and J. M. Schumpert, "Design of
Pattern Classifiers with the Updating Property
Using Stochastic Approximation Techniques., IEEE
Trans., Vol. C-l7, No. 9, pp. 861-872. Sept. 1968.

 

WWW W “11111111111111 WIVES
3 1293 03056 5984