BAYESLAN DECISION MAKING AND LEARNING
FOR CONTINUOUS - TIME MARK-0V SYSTEMS

Thesis for the Degree of Ph. D.
MICHEGAN STATE UNIVERSITY
ERDAL PANAYERCI
1970

 

 

LIBPxiRY

meme Mldil? "'1 State
Uniursity

 

This is to certify that the

thesis entitled

BAYESIAN DECISION MAKING AND
LEARNING FOR CONTINUOUS -TIME
MARKOV SYSTEMS

presented by

ERDAL PANAYIRCI

has been accepted towards‘fnlfillment
of the requirements for

Ph.D. degree in Electrical Engineering &
Systems Science

' y —' ,w .a-N
I"! / » . U '3’
. ‘ / ‘ ,,
I" - f a- . .
c/ ‘L. 4 _ . w’ 1.. .0, -' — .1.-.”-

 

thor professor

Date Novembg: 17, 1970

 

0-169

1.1 8-37

ABSTRACT

BAYESIAN DECISION MAKING AND LEARNING
FOR CONTINUOUS-TIME MARKOV SYSTEMS

BY

Erdal Panay1rc1

This thesis is concerned with Bayesian decision making and
learning algorithms for a particular problem in parametric pattern
recognition in which each of a finite set of pattern classes is
characterized by a continuous-time, discrete-state Markov process.
The basic problem considered is that of determining rules for making
decisions about the identity of the active pattern class based upon
observation of a sample function in some finite interval. The sta-
tionary transition probability matrices for the processes in question
are the parameters of the pattern classes.

Statistical decision theory is employed throughout to develop
optimal solutions. In the first part of the thesis, Bayes-optimum
decision rules are derived under a perfect observation mechanism
(noiseless case). The observed quantities are the sojourn times in
the states and the state numbers themselves. Using classified
samples from each pattern class, an algorithm for supervised learning
is presented and the existence of reproducing prior densities for
the parameters is demonstrated. Particularly useful results are the
formulations of recursive, computationally simple parametric forms
for the posterior densities of the unknown parameters and for the

optimum decision rules. The simulation of a Specific example shows

l I
o
I Q .§
0. 'U
'1
u D ‘
.\. ‘ a
— r.
.~ _.
o .
vl .\.
:-

 

D4
. v,

, s D
u ~ ~ .
o . .
.h. ... v.

. 9.

n . pl.

. p,
a . .
. u L-
. . u . ...

my
so

 

Erdal Panayirc1

the empirical probability of error for different amounts of training
data and demonstrates the inherent practicality of the results.

The problem of computing probability of error is investigated
extensively for the noiseless case. The exact probability of error,
as well as lower and upper bounds and asymptotic expressions, are
established for several cases. Conditional error probabilities of
the first and second kinds are introduced by which the usual proba-
bility of error can be computed iteratively.

In the second part, only "noisy" observations are available.
In this case, a new model is defined in which the states of the con-
tinuous-time Markov chains are described by random processes, but
the transition times can be observed. Iterative, optimal (minimum
Bayes risk) decision rules are derived and conditions are established
under which these rules perform effectively. Optimum-adaptive
decision rules are defined when the underlying model is not completely
specified. Decision rules are formulated with two types of random
processes.

Finally, the situation when the transition times cannot be
observed is investigated for the Special case in which there are two
pattern classes and the states are observed with additive, Gaussian,
white noise. Both discrete and continuous observation times are
considered. Computationally feasible algorithms are derived for the
likelihood ratio which optimally solve the problem, assuming a discrete-
sampling scheme. Also, stochastic differential equations are found
for the continuous logarithm of the likelihood ratio and the continuous
conditional probabilities of errors from discrete results by a

limiting operation. The results are applied to the Specific problem

Erdal PanayerI

of detecting a random telegraph signal (two-state, continuous-time

Markov chain) in white noise.

BAYESIAN DECISION MAKING AND LEARNING
FOR CONTINUOUS-TIME MARKOV SYSTEMS

By

Erdal Panayirc1

A THESIS
Submitted to
Michigan State University

in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Electrical Engineering and Systems Science

1970

 

‘71’
57/3/

TO
MY MOTHER AND FATHER

ii

 

 

 

 

. a t _ .. . .= .S.
C I . s v t . u o . 2. o .1

P. o . .8 u. v .d rn . v. . .. L¢ .v.. o. la I:
v _ n .0. v, no 0 S «L . : p o o . 0 h T .. .r o s.-.
. v. .. run .\. _\. .. ns. . .2. . o w. . n). p. b.

«i. u. . o .I o a O Q . . I, . n

.n. .s. S u» .\e .0 I o . . . p. Ir pk. _ e .d r‘u

-. . . o s 1. . v. v. In. 3 .» ... . . s e v.

., n. L“ D Q r. V» L . .. .x- ’t .‘v c;

. a. D a . a o . e . rs .\v .A v a n!

. . c Q... u s — . o . O 5 a p p .I- .-

. c '0. a. O. s - ,n u .p. c . q - PJs o ..

.x. .. .. “I 1 .. .1. u. u.“ . .1 .g.. .. .

ACKNOWLEDGEMENTS

The author very gratefully acknowledges the guidance and
encouragement of Dr. Richard C. Dubes, Department of Computer
Science, his teacher and academic advisor, throughout the course
of this research. The author is indebted to Dr. Dubes for suggest-
ing the problem as a potentially fruitful area for research, as
well as for his generous devotion of time to provide counsel.

Special thanks are due to Dr. Dennis C. Gilliland, Depart-
ment of Statistics and Probability, for many useful discussions
concerning the problems treated in this thesis.

For their interest and consideration of this research, the
author wishes to acknowledge the other members of his doctoral com-
mittee: Dr. Carl V. Page, Department of Computer Science, Dr. James
H. Stapleton, Department of Statistics and Probability, Dr. Rita
Zemach, Department of Electrical Engineering and Systems Science.

Also, Dr. Yilmaz Tokad, Department of Electrical Engineering
and Systems Science, deserves a Special acknowledgement because of
his recommendation and continuous encouragement.

Finally, acknowledgement is made to the Technical University
of Istanbul and the Division of Engineering Research, Michigan State

University, for the financial support which made this work possible.

iii

 

Von-

h...

Yt—u
t.\. '

p—l
o
'J

ha
o
P]

-
, ..
-.~
.

4 v

d."
‘ 1
4"
c ,

.4

 

 

 

 

 

Chapter

I

II

III

TABLE OF CONTENTS

ABSTRACT
ACKNOWLEDGEMENTS
LIST OF FIGURES
LIST OF APPENDICES

LIST OF SYMBOLS

INTRODUCTION .000........0.0. ..... . 00000 . ...... ....

1.1 Literature ReView ...0...............0......O.
1.2 Thesis Outline and Contributions .............

DECISION MAKING AND LEARNING WITH OBSERVABLE
STATES AND TWSITION TIES 0....0...0.........0.00

2 1 Source Model .................................
2 2 Decision Rules ...............................
2.3 Optimum Decision .............................
2 4 Adaptive Decision Rules When Q-Matrices

Are Not Known ................................
2 5 Supervised Learning ..........................
2 6 Learning (q, } ..............................
2.7 Optimum-Adapfive Decision Rule ...............
2 8 Computer Implementation ......................
2 9 Conclusions ..................................

ROBABILITY OF ERROR O.............0.0...0...0..0..

3.1 Exact Probability of Error for N = 2, and

Known Q-Matrices .............................
3 2 An Upper Bound on the Probability of Error ...
3.3 Probability of Error for Large Sample Sizes ..

3.3.1 N = 2, M = 2, and Known Q-Matrices ....
3.3.2 N > 2, M = 2, and Known Q-Matrices ....

3.4 Recurrent Expressions for Probability of
Error .......0.......O...00..................O

3.5 COHCIUSionS .0.........0...........00.....0...

iv

Page

iii

11

12
13
15

17

19
23
25
26
31

32

32
4O
42

42
44

50

52

IV

VI

DECISION MAKING AND LEARNING WITH UNOBSERVABLE
STATES AND OBSERVABLE TRANSITION TIMES ............

4.1 Mathematical Model for Generating Observations

4.2 Iterative Generation of the Optimum Decision
Rule .........................................

4.3 Application of the Model to the Special Cases

4.4 Adaptive Decision Making ....................

4.4.1 Optimal-Adaptive Decision Rule ........
4.4.2 Iterative Form for the Decision Rule ..
4.4.3 Asymptotic Optimality and Adaptation ..

4.5 Conclusions

DECISION MAKING WITH UNOBSERVABLE STATES AND

TRANSITION TIMES

5.1 Optimal Decision Making with Time Samples ....

5.2 Observation of Entire Sample Function ........

5.3 Optimal Detection of Random'Telegraph Signal
in Additive Gaussian Noise ...................

5.3.1 Optimal Decision Rule .................
5.3.2 Iterative Computations of the Mean

and Variance of ...................
5.3.3 Probabilities of Errors ...............
5.3.4 Observation of Entire Sample Function .

5.4 Conclusions

GENEML CONCLUSIONS AND EXTENSIONS 0 0 O . C . O O . O 0 O C 0 . O

6.1 Conclusions
6.2 Extensions

BIBLIOGRAPHY ...

54
55

58
62
68

69
71
73

76

78

80
84

89
90
92

96
97

100
103

103
106

108

 

, a. v
.oLcn
. . ‘

'50.

3-
..L-

I I

 

LIST OF FIGURES
Figure Page
2.8. 1 Average Error C‘Jrves 0 . . . . . . . . . . . . . . . 0 . . 0 . 0 0 0 . 0 . . . . t 30

4.1.1 The General Model and the Sampling Scheme ......... 57

4.2.1 An Algorithm for Iterative Implementation of
k k
8(5 ,t) ......OCOOOCOOOCC....00000...000..00...... 63
4.3.2 Asampling SCheme ................00..0000..00..0.0 64
5.0.1 The Model for Decision Making with Discrete
Sampling SCheme ......0.......0....00.0.00.......0. 79
5.3.1 The Block Diagram of the Optimum Detector ......... 98
A.1 Typical Sample Function from 8 Continuous-
Parauleter Markov Chain 0.0.000..................... 118
D.2 Algorithms for a Single Error Curve and for
DeCiSionMaking ......0...............00000......0. 126

vi

 

o --..

’l
o...

'v
...

 

Appendix
A

B

LIST OF APPENDICES

DEFINITIONS AND PROPER TIES 0.000000000000000000000
DERIVATIONS OF OPTIMAL DECISION RULES .............
A PROOF OF CONVERGENCE ............................

ALGORITHMS FOR ERROR CURVES AND FOR DECISION
MAKING ............................................

THE ROOF OF 1mm 303.2 0.00.00.00.000.0.0.0000...

vii

Page
112
119

123

125

128

‘0.

 

Appendix

LIST OF APPENDICES

DEFINITIONS AND mOPER TIES 0.00000000000000000000.
DERIVATIONS OF OPTIMAL DECISION RULES .............
A PROOF OF CONVERGENCE ............................

ALGORITHMS FOR ERROR CURVES AND FOR DECISION
MAKING ............................................

THE ROOF OF IJEMM 30302 0000......00000000.0.0....

vii

Page
112
119

123

125

128

 

 

 

Symbol

3(xk.tk\e = 8)

(s)
11

LIST OF SYMBOLS

Description
Number of states in each Markov chain.
Number of pattern classes (Markov chains).
Prior probability for pattern class 3.
Density (probability mass) function for first
observation when chain 3 is active.
é (x1,x2,...,xk) Random variables representing
state numbers (also, the values for a particular
realization of the random variables).

= {l,2,...,N} Range Space for each Xi'

IID

(t1,t2,...,tk) Random.variables representing
sojourn times (also, the values for a particular
realization of the random variables).

3
= [qi )] Transition-rate matrix for pattern

.1

class s.

Q (S)
‘ ‘qii

Parameter establishing distribution of
sojourn in state i when class 8 is active.
Point in state Space (Pattern class).

A k k

= gs(x ,t ) Product of density function for

k k

t and probability mass function for x When
class 8 is active.

3 s
glj)/q§ ) Transition probability from state

IID

i to state j for the jump chain associated

with class 3.
viii

 

C a
V1
I
.\u
r.-
.i- S I. p»
s - Paw n a: S s .\1 (.
~\ ) . C V C a \,

 

Symbol

fs(qS’rS\YaT)

n9?)
1J

(8)

Description
= N,j(xk) Number of one-step transitions from
1
. . k

state 1 to state j 1n x .

k . . . k .
= Zi(t ) Total SOJourn time 1n t Spent 1n
state i.
_ k . .
- Ki(x ) Number of occupancies of state 1

. k
1n x .

IID

(ysl,y82,...,ysns) States from the tra1n1ng
sample function produced by pattern class S.
= (TSI’TSZ’...’TSH ) SOJourns from training
sample function produced by pattern class S.
Concatenation of states from all training
functions.

Concatenation of sojourns from all training
functions.

_ (S) q(S) q(S 8)
— (ql sq 2 900°:q N

entries of Q8.

) Vector of diagonal

Vector of all non-zero terms from the transition
matrix of the jump chain for class 8.

Posterior density for the parameters of the

class 3, given all the training patterns.

n(: S)

(y: 8) Number of one- step transitions

from state i to State j in the training
n
patterns ysS from class S.
2(8 H)(T: ) Total sojourn time in state i
n8
recorded in the training pattern T88 from

class 3.

Symbol Description

RES) = k§5)(y:s) Number of occupancies of state
i in the training pattern y:8 from class

f(qs,rs\To,yo) Prior density for the parameters from class

WES) ”ﬂip

[u:§s)]§,j=l Parameters of the prior density for (q8,rs).
[w :8) ’Vilbi-ﬂ’
(S) N . .
[mij 1i,j=l Parameters of posterior denS1ty for (qs,rs).
w.p.1 With probability one.
ij Kronecker delta.

Pe Total probability of error

p Bhattacharyya distance.

00,50 Probability of error of the first kind and
the second kind.

ao(k),30(k) Conditional probability of errors of first
and second kinds.

p1,p2,...,pk Times at which transitions occur.

R(t) Auto-correlation function.

6(t) Impulse function.

Ak Likelihood function.

n . Threshold.

u(-) Unit step function.

 

-.

...g

. . . .
v - Q. 1.1.
... _. T . _.
’ o v a .- ,c a Q ~.
C. A.» .... w, A. t. S k
... a. a .. 7: .c (. .. u."
a“ .3 ’s u p. 6» ‘5
t. .d a. .. ﬁe *3 .. ....
.... .H .u. ..q S at w. .3 .t
_. a“ .m .t .s .... ... .a
. . .\ DA . a .. a
.. t u a in 1.. ... ..r.

 

CHAPTER I

INTRODUCTION

In recent years, pattern recognition and learning theory
have received a great deal of attention from investigators in many
branches of sciences. Some applications are: Recognition of
speech and handwritten characters, checkers and chess playing
machines, classification of electrocardiograms and EEG'S, detec-
tion and filtering theory, System identification and Operations
research.

In this thesis, "pattern recognition" is another name for
"computerized decision making". Given a set of objects, each
arising in one of a finite number of sources, an algorithm is to
be established which efficiently classifies the objects and all
other Similar objects. A pattern recognition problem, in general,
consists of two sub-problems. The first sub-problem determines
which measurements should be taken on the objects. These measure-
ments, called "features", characterize all the possible objects.

A "pattern" is a vector of feature measurements. At present, there
is very little general theory for the selection of features. In-
vestigators have been concerned with the selection of subsets or
linear combinations of existing features or with ordering features
in a given set of measurements. The criterion for feature selection

or ordering is often based on either the importance of features in

 

II I II I I

n I Q . u o
... .9. an . v. .... .. .u . . it. 1 ... . .0 5 ... 3..
.. a. a}. ( p. _. v. o. v. w. . r. .3 ... . .I .G r. r. 3.:
r. . .. at. v. .. .. .. .C . ... s .d T. .t .... r. ..l. pun .u. 9L . .1
. are. u . at a b a r: . o O s o x a. VI {3 rug 3.. u . r. t . ...“.
.. I .. nu .. .... r .. z .. 2 it 5 .. ... ... S ... 6. 5 Z ...
.~ 6 .C ..(n . a: -. v. .A. .4.» ... a b .C b; N. v. .Q .t n.. r «a .n 0..
a I... . . 2’ .r. C. .r. ._ .J ... «.... ~— .\. ... S a: .r. Lu pl v. .. .
V. On «p. D a. 3. Lu 0. A. . ... O. .3 NJ .. I. .: u. i: a

can . . b . o a y s .C . _ 5. o e n u .u. .d . I .r. s t e L... ., .. .a.u ‘5. N a
.Dn AL . V; a . u . .7 r. ..h . .. 11‘ v u u . a C . a

a v D u y. 3 Grow. " DJ .0 . Hit a s v. u. 1 »\ b H l.- . ”I. ... S M“ I! IN m‘ .an~ I.”

characterizing the patterns or on the contribution of the features
to the performance of a recognition algorithm. The second sub-
problem in pattern recognition is the problem of classification,
or making decisions about the source of the patterns. Thus, a
pattern recognizer consists of a "feature extractor", and a
"classifier". Statistical decision theory provides a powerful
mathematical tool for solving the classification problem when the
features representing the pattern classes can be described by
probability distributions. The application of decision theory to
the pattern recognition problem is called "parametric pattern
recognition". One can proceed to obtain optimal decision rules
satisfying a (subjectively chosen) classification criterion; e.g.,
minimum probability of misclassification (probability of error).

The problem of learning in parametric pattern recognition
must be solved when the distributions characterizing the pattern
classes are inadequately known. The unknowns of the class dis-
tributions are "learned" from sample patterns drawn from each of
the sources, or pattern classes. These samples are called "training
patterns". Supervised learning (learning with a teacher) refers to
the case when the training patterns are classified. When the origins
of the training patterns are unknown, learning is non-supervised
(learning without a teacher).

A so-called "non-parametric" approach to pattern recogni-
tion refers to the design of classifiers in which no assumption is
nade as to the form of the underlying probability distributions char-
acterizing each class. The goal of the recognition system is to

partition the feature Space into regions such that each region can

 

 

l II
. . s. ...u A.

.. . . . . . .4 a .3 to . 4 u.
. o —._ a n .a u a n ‘ I A o a e . k .... v o
'5 . . V. 3a .0; v. .C
an v. .r on .J .u an an .3
a... . . s . a: C. ...A .... v . v . .9“ .5.
.fv .V- 9‘ P. a . t V .A 2's ,v. n.
,1 s . . .4.» uh. - u o a 5 . . .

.au ... r. .5 0.. ... s. o. I $—
é .a V. Arv 9: ix .4 . . ‘5 u
thud on. In t L uau S Rb. . . an". s u . .

.\<

-§ v -
s ..l. .
. 5 ., m
. . n 4
P. .ts
.x. .w.
T s . a
by it c
P. .. S
u .6A 5 s

be identified with a pattern class. This can be achieved by defining
a discriminant function for each pattern class on the feature space.
A pattern is classified by choosing the class corresponding to the
largest discriminant function. Training patterns from each class

are usually available and the problem is to establish a reasonable
set of functions from them.

In this thesis, a general parametric pattern recognition
problem is investigated in which the classification of an unknown
pattern is inferred from a finite set of training patterns. Each
pattern class is characterized by a different N-State, continuous-
time Markov chain. The stationary transition matrices of the
chains are the parameters of the pattern classes. Depending upon
the medium (noisy or noiseless) in which the observations are made,
several feature selection schemes are considered. Optimum Bayes
decision rules are formulated as solutions to problems in statistical
decision theory. When the parameters are not known, a supervised
learning scheme that uses classified training patterns is employed.
The general model considered here finds, specifically, an application
area in the classification of EEG'S which has been investigated by
Dubes [D-S], recently. It is also applicable to detection and
filtering problems with Gaussian noise when the underlying signal

is a function of a finite dimensional Markov process.

1.1 LITERATURE REVIEW

A comprehensive survey of early works on pattern recogni-
tion and learning theory has been written by Nagy [N-l]. Nilsson

[N-3] presented some of the theory of "learning machines", or

 

 

 

. . . . .. ... . .
. .Ie .- . c o. v. . 1. u
u . . A
n . . a I. ... t- . ... a. .. .A . . .
. . o. p: p. p . a a . ...
s a A .u .. ..\v at. .4. .r: .\~ .47. a
.J . v. ”I 5Q . ... . . .n.. L: ~.
. a . v. . . v . x u. v. . a a Z t.
.. n. s . a s A r» in. p‘. h. ...» V. L. . \ .—.
s . b . C. n .3 . .u. u . _ S. .h.. P. t s
v. .. at: C ,. ... ._ 1 h. r:— a. .C L.
.C uh. . . .p a v . {a P3 P . ~ A (a 5d 1!:-

 

nachines which can be trained to recognize patterns, and Sebestyen
[S-l] identified the task of finding clustering transformations as
central to the design of pattern recognizers. Fu [F-A] presented
the latest developments in the area of machine learning, emphasizing
sequential methods in statistical decision theory and estimation
theory. Recent books edited by Watanable [W—3], Kanal [K-B] and

Tou [T-Z] are collections of papers on pattern recognition. Each
author emphasizes the philosophy of the approach rather than mathe-
matical derivations or experimental data.

The problem of classification has, indeed, received much
attention by statisticians. Of the many sources, the books of
Fisher [F-Z], Anderson [A-l], Raiffa and Schlaifer [R-l], Ferguson
[F-ZJ and Blackwell and Girshick [B-B] Should be mentioned that
deal with the theory of statistical techniques and the application
to classification problems.

For learning theory, Spragins [S-S] and Braverman [B-A]
studied the convergence question in supervised learning. This
question deals with the Sufficient conditions under which the para-
meter posterior density approaches the delta function about the
true value of the parameters as the number of samples increases.
Patrick and Hancock [P-l] gave a rather general approach to learning
schemes.

The literature closely related to the thesis is now sum-
marized. Dubes and Donoghue [D-h] considered the problem of
determining which of a finite set of N-state, discrete-parameter
Markov chains is active. The states are observed without noise

and the transition probabilities for the chains in question are

 

 

..K.
--.r’

on
_.

 

II
D V . h
. . . I .. w. ..t ~ I .
I O n o . ... r. a n A r: .‘e ‘n r
. .d a» Qs l . p. ..>_ . . u.
u . .0; _ v. C .r. L- r. L .
Z .\c . _ a3 .34 .u .: .1 . a a. L. s t
_. ... a e . ... . t . J . . .u . v.
. .. s. . C .. .. a? ,S .. ... G.
... a: .\. p . Ce 9 a ﬁx. t .cu .: _ .w. o t
I . A . v. ..t. . u .. v . .S c \v . .
L. .3 3 ... ... .u: .W. .5 . . ...,“ Eu. ..1 v,\l. sot.

 

unknown. A Bayesian Strategy is employed throughout the report.
However, they did not investigate probability of error problems.
The reSults in Chapters II and III presented in this thesis differ
from reference [D-h] in that the pattern classes are described by
continuous-parameter Markov chains. The unknown parameters are the
transition rate matrices 62-matrices) and noisy observations are
considered. Exact and asymptotic probability of error expressions
are also derived for several cases.

When the state activity of a general system is described by
a first order, homogeneous, discrete-parameter Markov chain and the
states of the chain can be observed only in the presence of noise,
most of the literature deals with the design of optimum and sub-
optimum decision rules for making decisions about the states of the
system and establishing conditions under which the unknown para-
meters can be learned. Billingsley [B-Z], Martin [M-Z], Good [G-1]
and Bartlett [B-l] estimated the transition probability matrix of
the underlying chain assuming, by some external means, that the
states of the system could be observed. Removing the assumptions
of observability of the states, Raviv [R-Z] constructed a class of
adaptive decision rules using an estimate of the transition matrix,
P, and only part of the past observations. Recently, Signori [8-2]
studied the problem of determining the optimum-adaptive decision rule
when the observations were governed by an underlying discrete-para—
meter Markov chain. The conditional densities of the observed random
variables, given the state of the system, were characterized by a
set of unknown parameters. He derived an iterative, optimum adaptive

decision rule with the capability of using future and past observations

 

 

 

I | ‘
. u I
n r \n I n . a I
45. . . J u v . o . . e a ._ r. o I
o p. O u s p . .. n S a u. . . o . A. ... o o a
C n: L . a . . I; a. .. ..- . .
u o . .I . . o . . ~L 2’. . .1 s . _ .
u c. n. v. r. .L .3 o. «I p c. _ u . u. . . .s
. . . a. .\. . Ce . u . v . . . .4. . . ,r. .—
.u. :. u a . _. . . .. o . . a. n . . ._
a. r. v. v. a . 3 .. cl . c. .. ... . . ..
a . at. .n . u. .s a” q . - b . ... .4. . Vt - . P.
a» .. r . 0 . . o . a u .x. . c a .1 . . . QJ c a nil

 

 

as well as present observations. He also constructed a variety of
consistent estimators for the unknown parameters which yield a class
of Suboptimal rules.

Patrick and Hancock [P-l] gave a rather general approach to
the problem Of learning for classification and recognition of pat-
terns with or without supervision. Their model for the quantization
of the parameter Space is given as a reference in Chapter IV, to
the solution of the problem of finite computer storage in implementing
optimal decision rules and learning schemes.

Hilborn and Lainiotis [H-l] investigated the Optimal (in the
quadratic sense) nonlinear estimation of discrete-time or sampled
stochastic processes, where the processes can be characterized as
having probability distributions of known functional form but con-
taining a set Of unknown parameters. A Bayes optimal estimate for
a state was to be determined and expressed in terms Of the parameter-
conditional-optimum estimates and another statistic which could be
computed recursively. The observations obeyed a generalized Markov
property. The results of Chapters IV and V in this thesis differ
from [H—l] since the Optimization criterion used here was minimum
probability of error. The decision problem is alos different in
nature.

Recent studies of the problem of detecting an arbitrary
random signal in the presence of additive Gaussian noise by Kailath
[K-Z] have resulted in an exact formula for the optimal likelihood
ratio, which applies to the detection Of any continuous second order
Stochastic process. The result must be expressed in terms Of con-

tinuous Stochastic integrals, so it cannot be implemented directly.

 

 

. u
-.n~'v:" V
“.-...U.‘ ‘
z.-. ... -40
v .‘1-

 

 

 

 

 

 

.- Ob.- —
" .... 3 .
‘I;"";O .
>LI-‘l.~.
-'- n
' h-
. ... _
a
.
n
5‘ ..
‘__I; w
‘ ' I
~.
...L ‘
- D
. 1" q/. .
._-,‘.‘_.‘-‘
-. o
~ ~.‘ ...
‘~L ... u
U-..
A... ‘" r-
‘ 7 nl
... a
a .. .
"“ -L o.
‘ a
3-.
a... “; ,
\‘u_ ,'
u .
I .
“-.C" ...
' h
_
' O
' O
..
..I
" O
a
“5, v~.
“\
.
.1
v
o
.V .
l \ , .
- a
"-1" Q-
- A
.
..
_ «
.. .
‘1, .v- .
4
. .
'5.
~.,:-.
~.S ..~
I .'
e4
5
I‘ D .
bl, 4 K
a
.
b".
. u
u
a
‘s ‘
_‘ . a,
.v
~
, ..
‘-
.
§~:.-
. n

McLendon [M-l} investigated the general problem Of extracting an
arbitrary random process from additive white noise. Under certain
approximations, computationally feasible algorithms were derived
for the logarithm of the likelihood ratio. The assumption essential
to the solution of the problem was that the joint densities of the
Observation processes, when the signal is present, could be approx-
imated by Gaussian densities (Pseudo Bayes approximation).

The question of detection of Markov processes in a noisy
background has been studied by several authors. Nifontov and
Likharev [N-Z] considered the Optimal detection Of a Binary,
quantized, Markov signal in the presence of noise similar to the
Signal. SOSOlin [8-4] investigated the Optimal detection of Gauss-
Markov noise with discrete-time observations. They both adopted
the Bayes likelihood ratio criterion as an Optimum decision rule
and Obtained recurrent relationships for the likelihood ratio.
Kulman and Stratonovic [K-O] provided Optimal devices for detecting
a random telegraph Signal in the presence Of white Gaussian noise.
They first Obtained a non-linear stochastic differential equation
for the optimum filtering and then tried to solve it for some Special
cases and compared the results Of the probabilities Of errors with
non-linear and linear filtering. The work in Chapter V of this
thesis, related to their results, was done independently and deals

With a particular model which yields more specific results.

1.2 THESIS OUTLINE AND CONTRIBUTIONS

The emphasis of this research is on pattern recognition and

learning theory. The first major contribution of the thesis appears

 

 

,0-

 

.‘u
.3

_\.
. .

 

 

in Chapter II which contains a formal development of optimal and
adaptive decision making and learning, employing a new model which
has not been heretofore studied. The contribution lies in the fact
that, under some necessary and sufficient conditions, a finite set
Of constant parameters can be found which uniquely defines the
underlying model.

Chapter II is devoted to the case of perfect Observation
(noiseless case). The basic model and the decision problem are
defined in Sec. 2.1, Appendix A and Appendix B. In the early
sections of the chapter, the optimum and the optimum-adaptive
decision rules are found and expressed in terms Of sufficient
Statistics Of finite dimensions for two cases. A supervised
learning scheme is employed to learn the paramters, and the exis-
tence Of reproducing prior densities for them is exhibited. It
is also shown that the computer memory needed to implement these
rules is fixed. Finally, a computer simulation is developed and
discussed for a Special case.

The second major contribution appears in Chapter III in
which the probability Of error is studied. Some Specific reSults
are Obtained for the model considered in Chapter II. In the case
considered, all quantities in the model are known and there are
only two pattern classes. In Sec. 3.3, exact probability Of error
expressions are derived for a particular case while in the follow-
ing sections, upper and lower bounds and asymptotic expressions are
obtained. When the number of Observations increases without bound,
the bound on the probability of error is shown to approach zero.

In the last section of the chapter, conditional error probabilities

 

 

 

_‘—
. .
II .
H . .
.
r\-
.V. .
v. .s.
. .

.du

 

 
 

 

Of the first kind and the second kind are introduced from which
the total probability of error can be computed. The expressions
derived for the error probabilities are original and self-contained.
Final contributions appear in Chapters IV and V in which
optimal and adaptive decision-making and parameter-learning problems
are investigated under an imperfect Observation mechanism. Inserting
this condition into the model of Chapter II requires a new model in
which the states Of the chains are described by random processes.
The main assumption Of Chapter IV is that the transition times from
one State tO another can be Observed. After discussing several
Sampling schemes for selecting features, the Optimum decision rule
is derived and its basic components are generated iteratively,
assuming all the parameters Of the model are known. Analytic
results are Obtained for two Special cases. In the second part
Of Chapter IV, adaptive decision-making and learning are studied
when the model is not completely specified. A theorem about the
convergence of the Optimum-adaptive decision rule is provided.
The storage problem in implementing the adaptive decision rule and
supervised learning algorithm is discussed.
In Chapter V, the model and decision problem are studied
but the assumption of observability of the transition times is re-
moved. A uniform sampling scheme (discrete-time Observations) is
assumed. The Bayes likelihood algorithm is developed for the
Optimum decision rule and the recurrent expressions for the likeli-
hood ratio is derived. In the case of continuous Observations,
non-linear stochastic differential equations are derived for the

IOgarithm of the likelihood ratio and the conditional error

 

 

R?*"A
In .
L
.9-

......

 

 

10

probabilities Of the first and second kinds. The results of Chapters
IV and V are original and have not appeared in the literature and

present one of the major contributions of the thesis.

 

 

I

v
.. O Q
r Ix
‘ .—
...

.

.

.n.

 

 

 

.v
. § .-
...
.
a.
.9. o.
.
A‘. o
a.

 

 

 

 

.3
T. L
. .

Q ark.
-\_ a
. a b.
p h § 0
.u. .

\

CHAPTER II
DECISION MAKING AND LEARNING WITH
OBSERVABLE STATES AND TRANSITION TIMES

The basic problem considered here is that Of determining
which of a finite set Of M continuous-time, discrete-state Markov
process is active, based upon Observations Of sample functions,
when the stationary transition probability matrices,

[P€§)(t)]g ._ ; s 6 {1,2,...,M}, for the processes in question

13 1,j—l
are unknown. The main reSults Of this chapter rely heavily upon
the definitions and theorems related with continuous-time, discrete-
state Markov processes (continuous-parameter Markov chains) which
are given without proofs in Appendix A.

A Bayesian strategy is employed throughout. The problem
is formulated as a problem in Statistical decision theory. Prior
distributions which lead to convenient computer implementations are
chosen for the unknown parameters. The amount Of computer storage
required is of prime importance.

The decision problem is defined and a source model is chosen
for generating Observations in Sec. 2.1. In Sec. 2.2, Observation
and parameter Spaces are defined, and in Sec. 2.3, the Optimal
decision rule is derived when the transition rate matrices «2-
matrices) are known for every pattern class. In Sec. 2.4, adaptive
decision rules are derived when the transition-rate matrices are

not known. A Supervised learning scheme is employed tO learn the

11

 

 

II II
.‘ F. . . v I .-
r. v v. 0 ~. .
up an .3 o .. .
u. v. .0. .u it v. _\
._ r. 5. a. J
n.. .1 .« n) a .
..I» .. a... a. .. .
«n v. a. .. on .o. .
0 av; .- ‘A .s ‘. ... .

 

 

. ..
. .. .. M

. a. ..

U ' .'« ..

o . v.

I .O Q n

s . \. ..u b .
.e a 0.. v . .4.
. .~ v . .nq .v.
s .rs ... ... .s n

. . v. 1 u

 

12

unknown transition-rate matrices and the existence of reproducing
prior densities is demonstrated in Sec. 2.5. Algorithms for learn-
ing transition rates are introduced in Sec. 2.6, while the final

form for the optimal Optimum-adaptive Bayes decision rule is obtained
in Sec. 2.7. Section 2.8 is assigned to the computer implementation
of a Specific problem. Finally, the main results of the chapter are

summarized in Sec. 2.9.

2.1 SOURCE MODEL

Before going into decision rules and learning, the model by
which observations are generated must be chosen. The Object of all
decision rules is to decide which of M continuous-parameter
Markov chain is active; each continuous-parameter Markov chain
characterizes a pattern class. The observable quantities, or the
"features", are the sojourn times in the states and the state
numbers themselves. The transition-rate, or Q, matrices defining
the processes are the parameters Of the distributions governing the
Observations and must be learned from the training data. The train-
ing data consist Of sample functions from labelled sources and are
used to form posterior densities for the parameters which, in turn,
are employed in Bayes decision rules. The properties of the obser-
vation processes, some important definitions and theorems about
continuous-parameter Markov chains, and the necessary and sufficient
conditions under which infinitesimal parameters can uniquely
determine a process are given in Appendix A.

A key assumption is that a single Markov chain is active

during the entire Observation interval. A decision about the

.r

.A-e.’ .
.ut... ~-

.-..»-

 

 

 

 

 

 

 

 

 

| I.
.
.— v D
p v s n. . . u . A .K- o .
g I! - I . u . . . ... w ... .i r. o. h
. n I .. ~ . . I
.. . . .0. . a C. . ... v a a a r: . e
. . 0.1 . .N. I.- n .4. Q s .u a
O . u u c . ..
.1 t). . . _ .... 2 v . ... .1 a v . .5 9. a. .V. n . ‘s w ..
.I a t t s s . L, A . . . q .4
o .RL. w. . «’4. o . av . . u AL. s . . .oasw .Opm ( Dr
.44 . . ....u . u . . . .1 A
ru .: .. v. a. a . .. .o .5. s- m .. a .nu A a = .F.
Q 1 n 9‘ ..
. 1' y s I. .Ith

 

13

identity Of that chain is to be made after Observing a sample
function. Each sample function is assumed to be generated in a
manner consistent with the assumptions and conditions explained

in Appendix A. The generation process can be pictured as follows:
Suppose a typical realization starts in any state, say 1. Then,
a waiting time, having an exponential distribution with parameter
qi > 0, determines the length Of time Spent in state i. At the

end of this sojourn, the process jumps to State j with proba-
q.

bility -li'
qi

random duration, as determined by an exponential distribution with

, j # i. The process stays in the new state for a

parameter qj’ and then moves to another state, say k, with

probability Elk, k # j. The sojourn time again has an exponential
j

distribution with parameter All possible realization of the

qk'
process can be generated by this procedure. A typical sample

function constructed by the above procedure is illustrated in

Fig. A.l.

2.2 DECISION RULES

The Observation Space is defined first. Each random variable
in the sequence xk é (x1,x2,...,xk) of state random variables takes
on values in a finite Space A = {l,2,...,N}, N < m, and each sojourn
time random variable in the sequence tk é {t1,t2,...,tk} takes on
values on the positive real line. All random variables are defined

on the Space Sk = {wt w E Ak X [0,m)k}. Thus, the sample space 8k

is the 2k-dimensional Euclidean space Of all sequences
k k

m = ( H gi) x ( H n1), where gi 6 A, “i E [0,m), vi. In particular,
i=1 i=1

the random variable xi, i s k, is defined as Xi(w) = gi and the

 

 

. \ ...... n a . .. t .. . \ . . s. u. b
u. . a k u . 1 . . . . . S . .... a.
r . . t - - . .
. . o v o v. . a .r. ... . . . o _ .I ~ .
n v 8 Ohm .~ . F. H‘.
. 5 19.. s . .2 = T L; .. . s . ,C
b s p . v. I. . ~ S . . L. D.
.. .. .: >. . J v-
.x. u . s .. c . i. r. ~§ . e .3
. t .x. .. . 3 1. 2‘
.: a >., . ~ . . n
.u. r. t e . . _ . .nl‘ sat

 

VJ

 

r.
NL-

~v-
...
r
n
c
. .d
...
.' ‘-
v‘-.4
.. .'
.-
I
..
...
.

.
an" -.
.3..-

ra

H'V 9"

L.|~..

 

14

random variable ti, i s k, is defined as ti(m) = “1'

The necessary and sufficient conditions and Theorem A.1
given in Appendix A show that {xk; k 2 l} is a discrete-state,
discrete-time Markov chain with transition matrix [rij]§,j=1

defined as follows:

= (2.2.1)

Chung [0-2] calls such a Markov chain, the jump chain associated
with the continuous-parameter Markov chain, [xt; O S t < w}.

The sojourn times (t1,t2,...,) are conditionally inde-
pendent random variables. That is, if P(-) is a probability

measure defined on the sample Space Sk’ then

P(c1 e A1,t2 e A2,...,tk e Ak\x1 = i,x2 = j,...,xk = L)

= P(t1 e A1\x1 = i) ... P(tk e Ak|xk = L) (2.2.2)

where A1 is suitable Borel set on R1, i = l,2,...,k. Hence, for
each sequence of realizations xk = (x1,x2,...,xk), a conditional
distribution for tk = (t1,t2,...,tk), given xk is defined. In
this model, the conditional distribution can be described by the
. . . k R Q

conditional den51ty f(t ‘x ) - f(t1\x1)f(t2\x2)...f(tk‘xk) on

k
[O,m)k with reSpect to Lebesgue measure p .

The mass function p(xk) é p(x1,x2,...,xk), which is a
discrete density over Nk points in Rk with reSpect to counting
tneasure vk is defined in terms of the initial distribution over
1N
ij i,j=1'
Tﬂien, the joint density for (xk,tk), deuoted by g(-), is well-

the states and the transition probability matrix, [r

 

‘—
5.

£21ch

,.'
b

15

k k k
defined in terms of f(t ‘x ) and p(x ), with respect to the pro-

. k 2k
duct of counting and Lebesgue measures v xu on R and can be

expressed as

k k
g(x at) 8(X19°°°9Xk9 t1:°°°atk)

f(t1,...,tk‘x1,...,xk)p(x1,...,xk) (2.2.3)

The general elements of the decision making problem and
derivation of the non-randomized, Optimal decision rules for any
arbitrary loss function, L(.,.) are given in Appendix B. In the
following two sections, optimal and adaptive decision rules are
derived for the cases of known and unknown Q-matrices, assuming

a Special loss function.

2.3 OPTIMAL DECISION RULES WHEN Q-MATRICES ARE KNOWN

The first case to be considered assumes that the Q-matrix,

(S) N . _
[qij Ji,j=l 18 known for every pattern class, 8 l,2,...,M. The
special "0-1" loss function is chosen, defined by L(i,j) = 1 - A(i,j)1.

*
The optimal decision rule, d (.), follows from (B.5), Appendix B,

and is given by the Bayes decision rule:

k k
,t ) = s if s is the first index such that

P(9 = s xk,tk) 2 P(e = L‘xk,tk),

*
d (x

v L f s, {as E A (2.3.1)

It follows from.Bayes rule that

l
A(i,j) denotes the Kronecker delta.

 

7'
C
.
...-new!-
I! ' ‘\
‘ x
(.0... '
Y
,.
U
..
y
I
u-
a
.7
2 :- -
.. ‘ 9-
~
v
..
.
a
' o .‘
...‘K ,G‘
6'
l
g
u.
-.
.1: :.A_ -
~~- r'
\_H\‘
W .‘
\ "-'
c
s
‘.
’.
t-‘..
‘0 .5-
, t b- .
'¥
_,
r" -
’ I r.
I..p

 

16

k
xk,tk) = P(9=S)g(:>1§19=s) . (2.3.2)
80‘ at )

Defining gS (xk,tk) é g(xk,tk‘e = s), P: =P(e= s) and employing

P(e = 8‘

(2.3.2), (2.3.1) becomes

d*(xk,tk) = s if s is the first index such that
o k k k k
PS gs(x ,t ) 2 1);:ng ,t ) v L ¢ 5, 3,4, 6 A. (2.3.3)

The density functions required in (2.3.3) are given by

k

k
gs(x ,t ) — fS(t1,t2,...,tk x1,x2,...,xk)ps(x1,...,xk)

V s E A (2.3.4)

The first factor in (2.3.4) is now computed. From the con-

ditional independence of (t1,t2,...,tk) and (A.14),

k
fs(tl,...,tk‘xl,...,xk) = .n fs(ti‘xi)
1=1
k k
‘ H q<3> exp - Z q(s)t. (2'3'5)
. x . x. 1
1=1 i i=1 1

The second factor in (2.3.4) is found from (A.13).

 

o k'1 qié)’ x1+1
pS(x1,x2,...,xk) = Ps(x1) .H 1 (2.3.6)
i=1 q(s)
xi

Then, the joint density gs(xk,tk) is given from (2.3.5) and

(2.3.6) as:

 

k k k-l (1(3) x
k , .
83(x ,tk) = H qis) exp — E q(s)t. Po(x ) H Xi 1+1 (2.3.7)
. . x. 1 1 ,
i=1 1 i=1 1 i=1 (8)
qx

 

"a“:d . H
ht '-‘..
O
.
."
- \
K. h
\
. o
...,O ‘ n.
“‘3 .V,...
.7- \

$315 ...

5“V

...-

 

 

21): ‘
“I ‘ l
. )
.
-
l‘.
‘h.
...r "
i
.. A
.
.F’
-

‘-
/
).-
l
\
‘4
~.“ I"
mm?"
H“
A.»
I.
I
I“ I
‘ .1-"
. 'I.
\I- .I

17

Equation (2.3.7) can be written in a more convenient way
k
as follows: Let Nij = Nij(x ) denote the number of one-step
transitions from state i to state j in xk = (x1,...,xk). Let

k
Ki = Ki(x ) be the number of occupancies of state i in x ,

not counting x1, and let tij = tij(tk) be the waiting (sojourn)

. . . .th
t1mes in state 1, for the j- occupancy of the process. The total

. . . k .
sojourn time In state 1, 2i = Zi(t ) is then,

= +t +...+t i
Zi til 12 1K1 V 6A

Then, (2.3.7) can be written as,

(s) N..
k k N K. N N N (1'1 13
830‘ ,t ) = n(qi(s))1 exp -z: qis)zi P:(x1) n n :8)
i=1 i=1 i=1 3‘1 qi
ji‘i
(2.3.8)

Taking natural logorithms of (2.3.8) and noting that

ij i k

‘iLIIMZ
...:

2

II
N
H
"I
:4
‘IL
H

L4. L;-
H.

= K, +1 if x =1 (2.3.9)

the optimal decision rule in (2.3.3) can be expressed in a simpler

form as follows:

k

* k
d (x ,t ) = s if s = L maximizes

N N N
0 0 (L) (L) (L)
Ln P (x )P - 2: q. z. + 2 EN an (2.3.10)
I: L .1 L qxl] i=1 1 ,1 i=1j=1 ij 13
ji‘i

2.4 ADAPTIVE DECISION RULES WHEN Q-MATRICES ARE NOT KNOWN

When the infinitesimal parameters, {q§:)}§ j’ s E {l,2,...,M},
3

are not known, they are treated as parameters in density functions

 

..qu.
C" ..
¥ 5..

 

:3‘31 vu-
3256.":6:
vu-

in:

 

.t
.-
Le

18

describing the observations. To make effective decisions under
such circumstances, information about these unknowns must be
extracted from classified observations. The state numbers and the
sojourn times of sample functions from each pattern class are
observed and the infinitesimal parameters can be eliminated from
the decision rule using supervised learning techniques. The train-

ing data from pattern class 3 form a random vector
ns ns A
= ... ... of states numbers
(yS ’Ts ) (ysl.ysz. ’ysnS’Tsl’TSZ’ .Tsns)

and of sojourn times. The optimal decision rule in this case is

* k k
d (x ,t ) = s if s is the first index such that

o k k “s “s o k k “L “4,
PS gs(x .t lys .TS ) 2 PL gL(X .t ‘YL ’TL )
v L 5‘ 8. L.s e A (2.4-1)
where
k k n “S k k nS ns nS nS
85(X ,t \ySS,TS ) = INZ 88(X ,t ‘ys ,TS ,qs,rs)f(qs,rs\yS ’Ts )dqsdrS
R
(2.4.2)
where
9 (S) (S) (s) .
qs - (q1 .q2 ....,qN ),
= (S) N
rs {rij }i,j=1 (2.4.3)
(s) N
The term r, was defined in (2.2.1); 2 r(S) = 1, Vi E A.
1J ._1 ij
J—
j#i

The supervised learning procedure is the same for all
Pattern classes so the subscript and superscript s will be dropped
from the following development. The first factor in the integrand

0f (2.4.2) can be written in a form similar to (2.3.8) as follows:

19
N K N N N N
k k n n 1 O .
30‘ ’t ‘y ’T ’q’r) = n qi exp ' 2 qizi P (X1) 11 II (rij) U

3541
(2.4.4)

k _ k _ k .
where Nij - Nij(x ), Zi - Zi(t ) and K1 - Ki(t ) are defined
as before. The second factor in the integrand of (2.4.2) is computed

in Sec. 2.5. As a result, the required density has the form:

k k n n o N Ki N
30‘ ’t ‘3' ’t ) = P (*1) $qu .13 qi exp ' E qizi
1—1 i—l
N N Ni. n n
x n n (r..) J f(q,r\y ,w >dq dr (2.4.5)
._ ._ ij
1—1 j—l
j¢1

Equation (2.4.5) shows explicitly that the required density
function is proportional to a joint moment of the posterior distribu-
tion. As shown in the next section, the computer storage is limited

whenever a natural conjugate family of distribution is used for each

(qs.rs)-
2.5 SUPERVISED LEARNING

The object of this section is to form the posterior density
n n

function for (qs,rs) from the training samples (ySS,T S). The

3
learning is supervised because the continuous-parameter Markov
chain that produces each set of training data is labelled. The s
subscript and superscript will be dropped for convenience.

The necessary and sufficient condition for the existence of
a reproducing prior distribution for (q,r) is that a sufficient
statistic of finite dimension exists for (q,r) (Theorem 2,

Spragins [S-6]). To demonstrate this finite-dimensioned sufficient

Statistic, the likelihood function g(yn,Tn‘q,r) can be written

20

from (2.3.8) as:

N k.
n n 1
8(y :T ‘Q:r) = H (11 exp -
i=1 i

o N N nij
q-z. P (y ) 11 n (r. ) (2.5.1)
1 1 1 1 i=1 j=l ‘3
ja‘i

IIMZ

where nij = nij(yn); zi = 21(Tn), ki = kiCYn)-

Equation (2.5.1) shows that g(yn,7n|q,r) belongs to an
exponential family of distributions. Thus, (q,r) has a sufficient
statistic of finite dimension. One such statistic is denoted by
T(yn,Tn) and can be easily determined by applying the factoriza-

tion theorem to (2.5.1).

N N N
T(yn.Tn) = nij(yn) . 21(Tn) . ki(yn)
i,j=1 i=1 1=
j-‘fi
(2.5.2)

Thus, a reproducing prior density exists for the para-
meter (q,r). Any reproducing prior density can be written in the

following form:

8(Y_ :°-°ay :T_ 900°3T ‘Q:r)¢(Qar)
f(q,r|yO,To) = m 0 m 0 (2.5.3)
8(Y_m9"°’y097_ ’0003T0‘Qar)¢(Q:r)dq dr
qxr

m

where (yo’To) = (y_m,...,yo,m_m,...,To) are fictitious observations
and ¢(q,r) is an arbitrary positive function of (q,r) except

that the denominator in (2.5.3) must exist. In particular, setting
¢(q,r) E 1, the following reproducing prior density is obtained for
(q,r). This density is the product of N gamma censities and a

matrix beta density.

 

 

 

' vi+l v. -w,q, ' o
N w, q 1e 1 1 N N p, -1
i=1 (V1 ) i=1 j=1 N 13

._ .J ja‘i

f.
—.
.. .tA
.H.
.p.
.»
as

 

 

 

 

 

I... l
... . . .
y .
(V ’r
I. .» ... .. . ....
Qt b e at
. . .v. .a
I». m. .3 vi ‘ ../ LIV ..-
. b ‘ Ir.“
e. 3. .. . C 4..
. :- n‘J .p.
L L. . .. o t . W.—
.. ... .a. .15 A: ..u.
.U A ..nl. fl ult d

 

21

 

where
F(D.)
do A 1 A N O
B (Lien) = 9 ﬂ. = 2 [1“. (2.5.5)
N 1 N o 1 ,=1 13
n F(u..) J
i=1 ‘3
j¢i
The parameters of this distribution are the matrix
n = [no ]N where ”O = O and the vectors v = (v v ... v )
0 ij i,j=1 ii ’ - 1’ 2’ ’ N
j#i
and w_= (w1,w2,...,wN). The posterior density for (q,r), based
on all training samples (yn,Tn) = (y1,...,yn,71,...,mn) is found
to be:

g(yn.Tn\q.r)f(q.r\yo.¢o)

 

f(q,r‘y“,.“) =
f g<y“.w“\q.r>£(q.r\yo.¢o)dq dr

 

 

qxr
V +1 V, 4W,q,
N wi1 qil e 1 1 [ N N A mij-l
= H ' H H B (m.)r.. i].
1:]. F(Vi + 1) i=1 j=1 N 1 1.]
j#i
(2.5.6)
where
V 9 v, + k,, ‘W = w + z , m e p? + n
1 1 1j ij 1j
F(Mi) N (2.5.7)
91m” -B<a>é M92
ijJi,j=l’ i N ’ i ,=1 mij
II NHL.) 3
i=1 13

The posterior density has the same mathematical form as the prior
density for (yn,7n). The only difference between (2.5.4) and
(2.5.6) is in the parameters of the two densities, which are re-
lated by (2.5.7).

The convergence question in supervised learning deals with

conditions under which the joint prior densities of the parameters,

 

 

‘
«‘n' '-
v‘ t‘ 1..
...-ﬁ '
-.

I," — ua
...... a L
up P P

-- _-
-- .- U
. 1.. -
D
..
.
.
.
g.
.. .
..:" °
1“.
..
§
. ~.
.-j ..
..
b
I
9..

..
w P
.
-.._' b
...
‘\
.b
..P '-
-. .

I
,.
m

1‘: .L.

H _S s
~A
<fn1~ ‘

.... A
I'E'Eta
l. 1
D
l
1’.
u' ‘L

-..e ..
‘1
.uﬁ-v.

“’3 E7
.1

C.
...C ‘ VA.
.

'11

 

 

 

 

22

(3) N (s) N .
= d =
q3 {qi }i=1 an rs {rij }i,j=l approach delta funct1ons
#1

centered at the true values of the parameters as the number of
training patterns increases. The most general theorem for answering

this question is given by Spragin's [8-6], and is stated below.

THEOREM 2.5.1 Assume that the following conditions are satisfied.

A. The data generation model under supervised learning

is employed; 90 is the "true" value of 9.

B. fe(z) > 0 for all z in a Sphere containing 90.

C. The posterior density fe(.‘yn) is defined as before.

D. A consistent sequence of estimators t1 = t1(y1),

t2 = t2(y1,y2),...,tk = tk(y1,...,yk) exists that
converges to 90 Wpl.
Then, fe(z‘yn)-E::>|6(z - 90) Wpl, where 6(.) is an impulse
function having the same dimension as 90.

For the problem considered above, it is easily seen that
the first three conditions are satisfied for g(q,r|yn,Tn). To
exhibit the existence of a sequence of consistent estimators for
the unknown parameters, (q,r), the following well known result
[0-3] is used: If a sufficient statistic of an unknown parameter
exists, then any'naximum.1ikelihood estimator will be a function
of this sufficient statistic. Bartlett [B-l] first showed that the

maximm likelihood estimators, ei , obtained by

Eij = nij/ni’

j
N
where n1 3 E nij form the consistent sequence of estimation
i=1

for the parameters rij in (2.5.1). It can be also shown that
there exists a set of consistent maximum likelihood estimators for

the parameters q. in (2.5.1), iven in terms of the sufficient
1 8

Statistics ki and 21 as a, = (ki +-l)/zi, i = l,2,...,N.

23

Thus, the last condition of Theorem 2.5.1 is satisfied for

n n n
8(q,r\yn,Tn) and g(q,r‘y ,T )-¥:£9’6(r - ro,q - qo) w.p.l.

N
2.6 LEARNING {qijh’j=1

ja‘i

N
J}i.j=1
N ja‘i
i.J'=1’
ji‘i

'8, given the

N
S fa 1 th a a t a d
o r, on y e p r me ers {qi}i=1 n {ri
have been learned. It is also possible to learn {qij}
or to determine the posterior densities for the

qij

. . n n .
training samples, (y ,T ), and to prove that these posterior

densities converge to 6(qij - q? ) w.p.l., as n tends to infinity.

3
0 . . n n
Here, qij is the true value of qij' To find ¢(qij‘y ,T ), the

posterior density for qij’ let

q.. = q.r.. i,j 6 A(j # i) and (2.6.1)

.(yk). (2.6.2)

define N, Q .
1 1 ij
i

~‘l|~ll["12 H
D

J

j

The joint posterior density function for (qi,ri), given
(yn,Tn), can be obtained easily by integrating (2.5.6). The

required density has the form:

V.+l V. -q W.
l. 1 l 1

 

 

W. q. e
n n _ 1 1
f(ql’rlj‘y 9T ) - 1n(vi + 1)
I‘(Ni - nij)I‘(nij) ij ij

0<qi<aog 0<rij<1

0 elsewhere (2.6.3)

The posterior density ¢(.‘yn,Tn) is given by the following

formula:

 

 

 

 

I
e I ‘ .
J u. s. . . I he a s
.v x a ..b .. ..u.
.... _ u —. .I
.. . . v. {A p . ._. ml. .. .
I n t a. :' I‘J ‘n . I
0‘. n .F; n a F3.
.. .. .. Lu 5. 2 ... C. I:
.‘ .‘v ’§ . 5 Q. \ Y- - '

 

 

 

 

q
l i
(ql.\y“.w“) =f —— f( 13.. 71) dr1J
J o 13 ij
a q
= J—j l “.11, u)du
q. u
U qij
nij-l Vifl V
= qij I‘CNi) Wi °° u 1
qij I‘(nij)I‘(Ni - nij) F(Vi+l) q unij
1 Ni-nij-l 'UWi
XTT'T(U"‘1J) e d“
i ij
u
. . . . Li A - -
USing the Binomial expanSion for (u - qi,) N1 nij l,
in the above
V.+l L
nij-z rm) W11 11“qu co m1 -uw1
= --1k d
iJ'
(2.6.4)
. Q _ _ . . . . =
Since mi Vi nij k is a p051tive integer k 0,1,...,Li,

the following integral formula can be used to evaluate the integral

in (2.6.4).

on m. -uW. I‘(m. + 1) -q.W. mi (q..w.)r
j u 1e ldu = ——1———— e 1 1 z —-l-L-1-— (2.6.5)
q. L. + 1 r!
ij W 1 r=O
i

Using (2.6.5) in (2.6.4)

 

n n (91 m1 nun—+192 -qijwi
(2.6.7)
= O elsewhere
where
a _ k Ni'nij'l 1"(N ) 1‘(Vi -n ii-k+1_)W n iij+r+k
kr ('1)

k1"(niij)I‘(N--nij).I‘(v:W+1)I‘(r+1)

25

The fact that V(qij‘yn,1n)——E:3>.5(qij - qij) w.p.l., is

shown in Appendix C.

2.7 OPTIMUM-ADAPTIVE DECISION RULE

In the previous section, joint posterior density functions
for the parameters of each pattern class were obtained and used to
eliminate the unknown infinitesimal parameters, {qij}, from the
decision rule. The decision rule obtained in this manner is an
adaptive decision rule which is optimal for the prior information
and cost function given. Furthermore, the decision rule adapts
or converges to what the optimal rule would be if the true para-
meter values were known. Thus, in summary, the optimum-adaptive

Bayes decision rule is given by:

*
d (xk,tk) = s if s is the first index such that

n I"! n n

o k k s s o k k L L
Pg(X.t\y .T _)=max {Ps(X.t\y .T }
S S S S Le[l,2,...,M} LL " L
(2.7.1)
where
k k “L “L - k k “L TIL “L “L
81(x .t \YL ’TL ) - qur31(x ,t ‘YL ,TL ,q,r)f(q,r‘yL ’TL )dq dr \IL

(2.722)
The first and second factors in the integrand above are given

in (2.4.4) and (2.5.6), respectively. After some algebra,

(L)

v+1
(L) 1

MW.)

 

 

(L)
gL(x ’t WW.“ '7 n (L) ' i (L)i
1‘1 (L) vi ”‘1'” 1"("1 +1)
(. .)
1 i
N
jgl (Nij +mg’) - 1)....(mf3J)
o N j¢i (2 7 3)
x P (x ) n ' '
L 1 i=1 (Ni +miu.) _ 1hump»)

26

where
w(L) A (L) + (TQL , v(L) g (L) + k ( 9L) (L) g ( 9L) + o(L)
i ‘ W1 zi L )’ 1 vi i 12 ’ mij “ij yL “1)
21 Z (t ), K Ki(x )a Nlj N1J(X )3
N N
EL) A )3 mg), N1 A 8 Ni' V L
j=1 J j=1 J
j¢i j¢i

2.8 COMPUTER IMPIEMENTATION

The optimum-adaptive Bayes decision rule obtained in (2.7.1)
and (2.7.3) would be implemented in an iterative fashion in an actual
application. Such an implementation and a simUlation are given in
this section. The storage requirements and execution time required

to simulate the decision rule are discussed.

Equation (2.7.3) can be written iteratively as follows:

 

 

 

ViL)+Kx (Xk'l)+1
k-l k k
k k n n (L) k [wiif—mxka )]
g (x¢\yLnL)=V +K Q)
(L) {I L xk Xk ViL)+Kx (xk)+1
[wi:)+£xk(tk)] k k
a) k'1
+N (X )
x ka-l’xk xk-l’xk - g (xk-1 tk-lh’nL T114”) (2 8 1)
méL) *‘Nx (xk-l L ’ L ’ L
k-l k-l VS.)+1
((50 1
1 1 “L InL _ (L) 1
80! ,t ”L ’TL ) — (Vx1 + 1) Va)“ +1 P (X1)
X "X
w“)+z 1 1
x1 X1

27

In (2.8.1), Rx (xk) is the number of occupancies of state
k

k
xk in x , Zx (t ) is the total sojourn time in state x
k
k-l _ . .
N (X ) is the number of one-step tranSitions from state
x x
k-l’ k
k-l k-l

x to state x in x and N (X ) is the number of
k-l k xk_1

. . . k-l . . . .
one-step tranSitions in x whose initial state is

. k
k in x ,

xk_1. An
algorithm for computing (2.8.1) is shown in Fig. D.l of Appendix D.

The optimum-adaptive decision rule implemented in this
fashion is a fixed memory rule. No matter how many learning samples
are employed and no matter how large the size of the vector to be
classified, only the parameters for the posterior densities need
be stored. Including the storage requirements for prior informa-
tion, the total number of words of storage needed is equal to
M(N2 + 2N + 2).

A computer simulation was made to illustrate the performance
of the optimum and optimum-adaptive decision rules. All computer
simulations discussed below were performed on the CDC 6500 digital
computer at Michigan State University. The Specific case considered

is the following:

1. There are two pattern classes denoted by ml and wz;

Pi = P; = 1/2.

2. The continuous-parameter Markov chains which produce
the samples are assumed to have 3 states for both
pattern classes.

3. The Q-matrices used to produce all observations are

listed below.

 

 

 

 

Q1 =

r

0.60

0.40

 

L036

0.12

1.00

0.84

28

q

0.48

0.60

 

1.20
.4

Q2 =

P

0.50

0.60

 

Lo.7o

0.15

1.20

0.30

The initial state probability distributions for the

bility distributions of the correSponding chains.

were computed by (A.ll) and are given as follows:

H:

LAND-d

0
Pi("1

1)

 

0.387
0.306
0.307

O
2(x1

P

_ i)

 

0.388
0.224
0.388

The following parameter matrices were used for the

l

0.35

0.60

 

1.00

‘

chains were chosen to be equal to the stationary proba-

They

matrix-beta prior density, which was used for the para-

meters of the jump chain.

0
”1

 

 

o
“2 =

(o

5

 

7

b

6. The following parameters for the prior

the sojourn times were used.

 

 

F"

= 1.0

 

h-

1.0

1.0

3

 

-

V
—2

 

densities of

F- -

15

10 ,

 

 

b12

0‘

In the first part of the sinulation, the average error

Q-matrices were not known.

curve was obtained using the decision rule in (2.3.10).

F661

= 10.0

 

 

12.0

In the
Second part, the average curves were obtained for the case when the

Two cases of supervised learning were

29

employed, all applying (2.8.1) and differing in the number of
training samples provided per pattern class. Training samples of
size 50 and 100 were used for each pattern class. Each error curve
provides a Monte-Carlo estimate of the probability of error as a
function of the size of the observations. In each situation studied,
10 such error curves were obtained along with an average error

curve. Using the same notation as tD-4], 100Li(k) is defined as

the number of wrong decisions in 100 classifications of a sequence

k k
(x :t ) for the iEh-run of Fig. D.1. The kgh'point on the average

curve is defined by

10
- =}_ E L-(k)
L(k) 10 i=1 1

Thus, 1003(k) is the percent of wrong decisions in 1000 independent
classifications of k states. Average error curves for several
situations are shown in Fig. 2.8.1.

As mentioned in [D-4], the quality and amount of prior
information are critical factors in determining the rate at which
the error converges to that for known parameters. Equation (2.7.3)
shows that the posterior density function for (xk,tk), given the

training samples, depends on the parameters VEL), W(L) and mgL)

i ij
(L) __ (L) (L) _ (L) (L) _ 0(L)
where Vi - vi +-ki, Wi — wi + zi’ mij — n11 +-nij.
The prior information as presented by VEL), WEL) and pi§L), can

thus be made to either overwhelm the initial training samples or
to be overwhelmed by them so the magnitudes of va), wa) and
uZ§L) are a measure of the amount of prior information being

inserted in the decision rule. If these parameters are properly

selected, the training data will reinforce the prior information

3O

 

 

 

 

 

 

A Estimated Error
Prdbability
0.3 -r
0.2 7
<::f-' Optimum, 50 learning samples
“—r Optimum, 100 learning samples
0.1 _.
Known
Q-matrices
. “— 4 >
50 100 Length of
Observation
Sequence

Fig. 2.8.1: Average Error Curves

31

and the amount of training data used is not critical.

2.9 CONCLUSIONS

This chapter has dealt mainly with decision making and with
learning the unknown parameters in finite-state, continuous-para-
meter Markov system.with observable states and transition times.
The model by which observations are generated was defined in Sec.
2.1 and the properties of the observation process were given in
Sec. 2.2. Assuming the Q-nmtrices defining the chains were known,
an optimal decision rule was defined in Sec. 2.3 to be that rule
in a given class of rules which minimizes the Bayes Risk. 08.2).

The optimum-adaptive decision rule (2.4.1) was derived in
Sec. 2.4, in case of unknown.Q-matrices while the supervised learn-
ing scheme for learning them'was employed in Sec. 2.5. The existence
of a reproducing prior distribution for (q,r) was exhibited.

It was also shown that, under the stated conditions, the parameter
posterior density (2.5.6) converges to a delta function centered

at the true value of the unknown parameter. The posterior densities
for the infinitesimal parameters given the training samples were
obtained in Sec. 2.6.

The final analytical form of the optimum-adaptive decision
rule was given in Sec. 2.7 and it was expressed in an iterative
form in Sec. 2.8. Finally, a computer simulation was performed
for a specific case to obtain the probability of error curves in

both known and unknown parameter cases.

CHAPTER III

PROBABILITY OF ERROR

The quality of a decision rule is characterized by the
total probability of error. Unfortunately, for general decision-
making problems, exact analytic solutions for the probability of
error are impossible. Even if one could find such solutions, they
would be tremendously complex. For this reason, simple lower and
upper bounds, or asymptotic errors or iterative approximations are
more useful than exact error probabilities.

In Sec. 3.1, exact probability of error expressions are
derived for the two-pattern class case where both classes are de-
scribed by 2-state, continuous-parameter Markov chains with known
Q-matrices. Lower and upper bounds are given in Sec. 3.2. The
limit cases are also studied as the number of observations tends
to infinity. Asymptotic probability of error formulas are derived
for the two-pattern class problem with N-state Markov chains having
known Q-natrices in Sec. 3.3. The probability of error is shown to
converge to zero w.p.1 as the number of observations tends to
infinity. In Sec. 3.4, the conditional probability of error notion
is introduced and iterative expressions are established for them.

Finally, Sec. 3.5 summarizes the main results of the chapter.

3.1 EXACT PROBABILITY OF ERROR FOR N = 2 AND Q-MATRICES

Let Q1. and Q2 be known Q-matrices characterizing pattern
classes, ml and m2, reSpectively. The following notation will

32

“vi-133a}.

1..

 

.111

33

be used:

Q1 = ; Q2 = 9 q19q2’p19p2 > 0

N

The density functions, (2.3.8), are evaluated below for the observa-
k
tion sequence (xk,t ) = (x1,...,xk,t1,...,tk). Since the states

must alternate, and assuming k is an even integer, Ki 8 k/Z and

rij = l V i,j E {1,2}, (j # i). Then
2 -Z.q.
k k k/2
31(x ,t ) = P:(X1) U qi e l 1 (3.1.1)
i=1
2 -Z.p.
k k k/Z
32(x .t ) = 9361) n pi e 1 1 (3.1.2)
i=1

Hence, the likelihood ratio, A(-), is defined as

k k 0
g (x .t) P (X) 2 p. k/2 -z,(p,-q,)
2 _ 2 l (_19 e i 1 1 (3.1.3)

A(Z ,2 ,X ) -—-——- - -———-
1 2 1 g1(xk,tk) 230(1) i=1 qi

Using the "0-1" loss function, the optimal decision rule is given as

 

 

* k
d(xk,t)=u)1 if L<'n
= (1:2 if L > n (3°1°“)
where
o 2/k o
P q q P (x )
n QLn -% +Ln plpz; L3 ?an g 1 +12: )3 (qi - pi)Zi (3.1.5)
P2 1 2 P1(x1) i=1

The total probability of error, Pe’ is then given as

Fe = P(L > Tﬂwl)P(£ + P(L < szwg (3.1.6)

34

where pa. > (no.1) = m > mxl -- 1.691930)

+ P(L > n\x1 = 2,m1)Pi(2) (3.1.7)

P(L < n|m2) = P(L < n\x1 = l,w2)P;(l)

+ P(L < n\x1 - 2,w2)P:(2) (3.1.8)

Using (3.1.5), (3.1.7) and (3.1.8), it follows that

O O
P(L > n\wl) = P(Z > n1\w1)P1(l) + P(Z > n2\w1)p1(2) (3.1.9)
0
P(L< ﬁlwz) = P(Z < n1\w2)P2(1) + P(Z < n2‘w2)Pg(2) (3°1-10)
where
2 2 ) P1P1(i) Z/k q1‘12 1 2
Z=-£(q.-p.Z.; n.=Ln --—,i=,
k i=1 1 1 1 1 P3P;(i) p192
(3.1.11)
The density functions for Z1 and 22 are now obtained
k/2 k/2
Z1 = .2 t2i-1; Z2 = z t2i if x1 = 1 (3.1.12)
i=1 i=1
k/2 k/2
Z1 = .E t21 ; Z2 = ._ t2i-1 if x1 = 2 (3.1.13)
1—1 1—1

There are k/2 terms in each sum in the above expressions
and also
ti ~ Exponential (qi) when wl active

ti ~ Exponential (pi) when w active

2
Thus, for both values of XI, Zi is the Sum of k/2 i.i.d. random
variables, all having the exponential distribution with parameter
(11 or pi (i = 1,2). Thus,

Zi ~ Gamma (k/2,qi) under ml

Zi ~ Gamma (k/2,pi) under wz

35

Explicit probability of error expressions are now derived

for several cases.

92. - . 9.2 .
Let g1 - k(q1 p1), g2 - R(qz-p2)) g1 > 0, g2 > 0' Then,

2, defined in (3.1.11) can be written as
z = 1:121 + 5222

and, €121 ~ Gamma (k/2,i1); €222 Gamma (k/2,xz) under ml

5121 ~ Gamma (k/2,p.1); €222 Gamma (It/2m?) under (02
- -——¥———— 1 = 1,2. (3.1.14)

The density function for 2 can be found by convolving

the densities and expanding the integrand with a power series.

 

 

 

 

As a result, the densities for Z under m1 and under wz are
given as
a n -322
fZ(Z\w1) = 2‘. c112 e z > 0 (3.1.15)
i=0
m n -p22
fz(z\w2) = .2 C122 e z > 0 (3.1.16)
i=0
wh A - + k _ 1. g F(k/2 + 1) (1 x )k/Z (12 ' 1‘1)1 ,
ere “ 1 ’ C1i F(k+i)F(k/2) 1 2 1! ’
i
g F(k/2 +11) ( )k/Z (”2 ‘ “1)
C12 I"(k+i)I‘(k/2) “1112 1!

Using (3.1.6), (3.1.9), (3.1.10), (3.1.15) and (3.1.16), the total
probability of error can be evaluated for several sub-cases.
The following probabilities in (3.1.9) and (3.1.10) are first

computed. With a modest amount of effort, it can be shown that

‘hh n mnpj
P(Z > n11m1) = z 611 e 2 “‘3T"‘ (3.1.17)
1=o j=0
m -81 ncniﬂ
P(Z > nz‘ml) = z 611 e 2 2 z -—%T;—— (3.1.18)
1=o j=
j
w -n 1 n (n i )
P(Z < n1\m2) = 2 b,2 [1 - e 2 2 —l!—2——] (3.1.19)
1: 1 J=0 ‘1
( n ‘ ) m [i -n,i2 n (“212)j] (3 1 20)
P Z < w = 2 b. - e 2 -ﬁ--—- . .
2 2 1= 12 j=0 J!

"D

C ng-i-k) . b grg1+k2
11 11 i+k ’ ° “
1‘2 ”2

where b

Then, in terms of the above expressions, the total proba-

bility of error becomes

. -ni n(niﬂ -ni n<8iﬂ
P = PO[ 2 611(P°(1)e 1 2 z; —l-2—— + P:(2)e 2 2 z —-2—-2-—)]
i=0 =

e 1 l j 0 j! 3:0 3!
w n» n mu)j 'Butlmu)j
o o 1 2 l 2 o 2 2 2 2
+ P 2 b. P (1) (1 - e 2 ——) + P (2)(1 - e Z _____)]
(3.1.21)

CASE IB. “1 < 0, “2 > O
P(Z > n1\m1) = 1 ; 8(2 < n1|mz) = o

The remaining probabilities are computed as above. Using these

probabilities in (3.1.6), the total probability of error is:

an 'In A n (A Tl )1
_ o o o 2 2 2 2
P8 — P1[1gobﬂ(?1(l) + p1(2) e 320 j! )]

°° Tm n (Bu )1
+ P;[ E b12P(2)(2) (1 - e 2 2 )3 -—-§-!—2—):l (3.1.22)
i=0 j=0

37

CASEIC. n1>o,n2<o

j
°° 4) 1 n (T) l)
_ o o l 2 2 l 0
Fe ‘ P1[ E 1111(131”)e E ‘31—— + 131(3)]
1-0 j-O
j
°° 1111» n (Tl U» )
+ 133‘: .2. biZPcz)(2) (l - e 2 .2 "'le"‘)] (3.1.23)
i=0 j=0

CASE ID. n1<o, n2<o

II
|'-‘

P(Z > n1\w1) = 1 ; P(Z > nz\w1)

P(Z < nl‘mz) = 0 ; P(Z < n2|m2) = 0

Total probability of error is,

Pe = P1
Df' é3( - ). =2( -) 3124
e lne g1 k ql pl , g2 k ‘12 p2 ( ' ' )

Then, the logarithm of likelihood ratio in (3.1.5) becomes

 

2 P3011)
L = ELn O + glzl - |g2|22 (3.1.25)
P1911)

The total probability of error is calculated in terms of the formulas,
(3.1.6), (3.1.9) and (3.1.10), where

2/k
. D192

00, qq
P2P2(1) 1 2

0 O .
Q A P1P1(1) .
z - 5121 "1521221111 = Ln -— , 1 = 1,2. (3.1.26)

Here, €121 ~ Gamma (k/2,x1) ; lgzjzz Gamma (k/2,x2) under ml

5121 .. Gamma (k/2,p,1) ; 152122 Gamma (k/2,p.2) under

 

 

where X.

' = 1,2,. (3.1.27)

38

The density functions for 2, under m1 and under wz,

can be determined as in Case I and are given by:

m n . . -Z(k +1 )
fZ(z\w1) = Z 2 CE?) z1+J e 1 2 z > 0
i=0 j=0 J
m . z)
= z c§1) 21 e 2 z < 0 (3.1.28)
i=0 1
m n -z(n +u
2
fzczlm,)= z 2 f) 1 2 z>o
-0 j=0 J
m 2%
= z cfz) 2 z < 0 (3.1.29)
1=o
where m 2 g - l ; n Q k - i - 2
k/2
C(1)3 H”) rgn - H1) (1112) ,
. 2 3
1 (1N+)2 r (k/2)
k/2

= (-1) ( ) n-l 2
1 (9141.2) r (k/2)
(1 +1 )1 < + )j
(1) A (1) 1 2 (2) 1 (2) ”1 LL2
= C. ° _.———— C.. - C. —'——:'—_-
C1] 1 J! ij i j!

The total probability of error will be computed for the

fOllowing'sub-cases.

CASE IIA. n1 > 0, n2 > O

'The following probabilities are first computed as in Case I.

m n 'n (A +1 )
b(l) l l 2
P(Z > n w ) = Z Z “r (3-1-30)
1‘ l i-O j-O r-O bijr le
m n -B (1 +1 )
P(Z > n2|w1) = z z zbfji ﬁle 2 1 2 (3.1.31)

i=0 j=0 r=O

 

._ (2) m n L (2) r “1‘21“”2)
P(Z < n w) b 2 2 2b (3.1.32)
1‘ 2 i=0 j=0 r=O 13‘ 1
m n L -n (u +u )
P(z < n2\w2) - b(2) - z z z b(2) n: 2 1 2 (3.1.33)

m>
NHx2

where m

b<1) g c<1) m. + 1) “1%)

 

 

 

 

bijr ij (11+22)L+1 ' r! ;
r
b(2) A (2) m. + 1) , (““1 + ”2) .
bijr — Cij (ulﬁp2)L+l r! ’
g m (2) P(l + 1 n (2) P(Lrﬁ-l)
b2 ‘ 12.01} 1) iii—1+ E Ocij L+1 °

Substituting the above expressions in (3.1.9) and (3.1.10) and using

them in (3.1.6), the total probability of error is obtained.

_ o (1) o r ““1(*1+*2) o r '“2(*1+*2)
Pe ' P1 . 2 [Eijr P1””1 e +P1(2)“2 e
19j2r
o (1) o r ““1(“1+”2) o r '“2(”1+”2)
+ P2 b2 - . E [},,r P2(1)n1 e + P2(2)n2e ]
13J2r
(3.1.34)

Using the same procedures as in Case IIA, the total proba-

bility of error expressions are derived for the other sub-cases.

CASE 113. nl < 0, n2 > 0

Mn 1-n (1 +1
_ o 012 (l)r 2 1 2
Pe - P1E31P1(l)- isz°1(1)b:: H(-n1)e + §r1P011'(2)b31-en2 j
n u -n (u *uz )
+ P0 [ 2 P °(1)b(2 r)(-n1)re 1 2 + bZP:(2) - 2 P °(2)b(§)n;e 2 12]
i,r i,j r

(3.1.35)

40

CASE 110. nl > O , n2 < O

-T\ (1 +1) 1 1
P = P0 2: P°(1)b(1)11e1 1 2 +b P°(2) - 2: P °(2)b(1)(- 412 )2e 2 2
e 1 ”j ijr 1e 1 1 i,r

4] (M +1») I.“ u
+ P‘2’[b2P‘2’(1) - 2 P 3(l)b,(2,)'n1e 1 1 2 + 2 P2 0(2)b,(2)( (412).: 2 2]

i,jar i,r

(3.1.36)

CASE IID. n1 < 0 , “2 < 0

122

- o (1) o r n o 2 r n22‘2
pe _ p1 b1 - izrbir [P1(l)(-ﬂ1) e + Plan-112) e :1

i,r

H 1» ll 11
+ P; . 2: b(2)[: P;(1)('Tl1)r e 1 2 + 123600-712)r e 2 2] (3-1-37)

Kr
where bf: ) A <-1> a”) $52—12 73 ; b3.) 2 <-1> cf” 4411?
2 “2

m n
b 1 Z[1.11)<1>.2111..,:<1>_2_+.11]

1 , _ ij 1+1
1=0 12+ j- -0 (11+12)

3.2 AN UPPER BOUND ON THE PROBABILITY OF ERROR

The exact probability of error formulas derived in Sec. 3.1
are very complicated and the asymptotic behavior is difficult to
ascertain. A simple upper bound on the probability of error is
needed and will be derived from the following theorem [K-l], for

the case considered in Sec. 3.1.

THEOREM 3.2.1. Let P: and P; be the prior probabilities for
the pattern classes, ml and wz, respectively, and let gi(-) be

the density function for the sequence of state and sojourn time

observations, (xk,tk) = (x1,...,xk,t1,...,tk) when pattern class

41

mi is active, 1 = 1,2. Then, Pe is bounded by the Bhattacharyya

distance as follows:
%min(P‘iP;)p2 s Pe s (PiP‘z’fi 9 (3.2.1)

where p is Bhattacharyya distance and is defined by

 

k
P 21* $103.1: )sz(xk.tk) dxkdtk (3.2.2)

ARXEO 3”)k

To apply this bound to the problem being considered, 31(-)
and g2(-) must first be determined. Since the 2-state continuous-
parameter Markov chain is being considered, the values of

(x2,...,xk) are known w.p.1 as soon as x1 is known. Substituting

r 1 i = l,2,...,k-1 in (2.3.7), the desired densities
xi’x i+1

are determined and are given as:

31(xk.tk) = (QIq2)k/2[:£I: emqltn"1 {(12:21 as<1-:c2i_1>a<2-x211 ]P‘1’<1>
+ (q1q2)k/2[::§: e-qthZi-1 eqltsib(1-x21)6(2-x21_1)] Pq(2) (3.2.3)
82(Xk,tk) = (p1p2)k/2[:§: e-pltZi'l e-p2t816(1-x21-1)6(2-x21)] P32)
+ (p1p2)k/2[:1: e plt“ e-p2t2i'15(1-x21)6(2-x21_1) P;(2)] (3.2.4)

Here, 6(-) is the impulse function. Then,

 

kl!» ”2 °° '(qulnzi 1 '(qzz 2)t2i
= o o - d .
P (qlqulpz) P1(1)P2(1) II I e dt21-1 x {g e t21
rd 0
9 +9 Ca +91

0 0 kn” '( 222)t2i 1 ° " 12 ”21
+ -

P1(2)P2(2)121 e dt21_1£ e «1121 (3.2.5)

42

Performing the integrals in the above, the result follows.

 

 

k/Z
_ k/4 o 0 4
p - (qlqulpz) P1(1)P2(1)[ ((114131ansz
0 o [ 4 k/Z
+ P1(1)P2(2) (92"?2) ((114131) ] (3.2.6)

Then, the lower and upper bounds for probability of error are

given in terms of p, Pi, P0 by (3-2-1)-

2
o O 1/2

Since (Ple) s% P°

1, P; 6 [0,1], each term in the

upper bound approaches zero as k tends to infinity so P;

approaches zero.

3.3 PROBABILITY OF ERROR FOR LARGE SAMPUE SIZES

In this section, asymptotic probability of error expressions
are derived in the case of large sample sizes. In Sec. 3.3.1, the
2-pattern class problem with 2-state continuous-time Markov chains
is investigated when both Q-matrices are known while results are

generalized to N-state case in Sec. 3.3.2.

3.3.1 N = 2, M = 2, AND KNOWN Q-MATRICES

Let the Q-uatrices for the two-pattern classes be given as
in Sec. 3.1. The formulas necessary to calculate the total proba-
bility of error are given in (3.1.6), (3.1.9) and (3.1.10). For

large k, Z and Z defined in (3.1.10) are asymptotically

1 2

normally distributed. The means mi and variances oi, i I 1,2
are defined for ml and w2 as follows:

m, = -E- , o? 8 -E-' under w , i = 1,2

1 2q 1 2 l
i 2q1
k 2 ’ I:
mi = EST , Oi = .55. under wz, 1 1,2

43

Then, 2, defined in (3.1.10) is asymptotically normal with

mean mm and variance 0: defined by

 

 

 

 

 

 

 

 

i i
c11-P1 <12-P2 2 2 P(QI'P1)2 ((12-132)?
mb = q +' q ; o¢ =-E +. 2 under ml
1
1 2 1 _ q1 (‘2 .
r 2 21
q -P q -P (q -P ) (q -P ) 1-1
- 1 1 2 2 . 2 _ 1 ..1._1_
q” — p +- p , gm — k ,2 4--—3L7?L—- under wz
2 1 2 2 P P
- 1 2 ..
Thus, the total probability of error for large k is I
given by the following integral formula:
* o O Q * o m *
= +
Pe P1[P1(l) 1; fz(z‘wl)dz P1(2) 1; f (z\m1)dz]
1 2
0 O 2 * 0 P2 * 1
+ P2 [:Pz(l) l: fz (z‘w1)dz + P2(2) .m fz(z‘u)2)dz] (3.3. )

*

where “1 and n2 are as defined in (3.1.10) and fz(z‘wi),

i = 1,2, is a limiting density function of the random‘variable

2. Performing the integrations in (3.3.1), the asymptotic proba-

*
bility of error, Pe is obtained.

 

 

 

 

 

nl-m ﬂz-m
* _ o u)1 0 ml
Pe - P1 PC;(1)[1 - P(fk 01 )] + P1(2)[1 - 2(fk 01 )]
o o 111-me o “Z-mmz
+ P2[:P2(2)<§(fk °2 ) + P2(2)<§(/k 02 )] - (3.3.2)
2 2" 2 2
(q -P) (q- ) (q -P) (q -P)
where 0292 1 1 + 2p2 3 2“ —i-L-+-—Z'—2— 3
. 1 2 2 2 2 2
q1 q2 p1 p2

x 2
1009-11—1- e-t ’2 dt .
/2n

44

*
To show that Pe decreases toward zero as k increases
without bound, let ql > p1, q2 > p2. Then, it is clear from

(3.1.9) that

 

 

 

 

m9 lim’n1= 1min2 =an1:2>o
k—«oo k—cao P1 2
A little algebra shows that
m < < m
0"1 TL” “’2
“1..me “2..me
Thus, lim Q(,/k ) = lim 9 A = 1
k—ooo 0]. k—noo 0’1
nz'mbz n2-mb2
limQQ/‘k )=lim§(,/k———)=O.
k...» km 0

So, from (3.3.1), it follows that

*
P-+0 as k—oco.
e

3.3.2 N > 2, M = 2 AND KNOWN Q-MATRICES

In this section, the asymptotic probability of error is
derived for the N-state case, (N > 2), assuming that there are two
pattern classes and that the Q-matrices correSponding to the pattern
classes are given.

Let the Q-natrices and the density functions for the observa-

k k
tion sequence (x ,t ) be given as

N
N x, N N q ij
Q =[<L.1. g (xk.tk) =[ II C! 1 exp{-q Z }]P°(x ) 11 110-11)
1 1] 1 i=1 i i i 1 1 i=1 j']. qi
jv‘i
(3.3.3)

45

Ni
N K N ij
_ k k g 1 _ 0 p11
ji‘i
(3.3.4)

for the wl and m2 respectively. Hence, the likelihood ratio,

A(.), defined in (3.1.3) becomes

1

P. i
MUS}, {21}: {Nij}’x1) =[(E:'> exP{-(pi-qi)zi}]

 

 

P°<x> N N p.. q."13
x II n {—1-}- -1-) . (3.3.5)
P°1(x 1) i=1 j= -1 pi qij
ji‘i

Using the "0-1" loss function, the optimal decision rule is given

as in (3.1.4), where

 

 

A P° 2/k
n = Ln(-;) and (3-3-6)
P
2
L g imtn A
N pi N N N p Pp(x )
=‘11;[2K.Ln(—)+ 2(q1-Pi)2 + 2 NijLn(—:‘,l°q—q—) +;Lno 2 1 ]
i=1 1 i=1 i=1 j-1 pi ij P° 1(x1)
jii
N
For large k, Ki 5 Z N. , then,
=1 ij
J
1 N N pi. 1 N 1 #23011)
L = '1: Z 2 Ni' Ln “—1 + '1: 2 Zi(qi-pi) + Eta o . (3.3.7)
i=1 j=1 J qij i=1 1110(1)
ja‘i
N N
Since k is fixed and k = 2 z Nij’ the Nij's are
i-l j-l
js‘i

stochastically dependent random variables. Bartlett [8-1] proved

that the asymptotic distribution of Nij's are normal with expected

values mij = kPirij, where

46

g 11.1 . d
rij qi (j # 1) un er wl
3.3.8
é'Eli (' # i) nder ( )
.. p J u (”2

i
and the Pi's are the stationary probabilities of the corresponding
jump chain. Using Bartlett's method, the covariance matrix V = KVO,

where V = {cov(nu ,n 1, can be determined by the expansion

N
O r tq)}u,r,t,q=
of Ln ”1(2) up to second degree in 913’ where p1(§) is the root

such that ”1(0) = 1, of the matrix

9.. N
R(ﬁ) =[r.. e 1]] .
13 1.331

Then the matrix Vo which is independent of k is given by the
quadratic expression %h§?Vo§' in which 'g stands for the column

vector with components (g1,§2,...,§u) where £1 - (911’612""’91N)'

As a result of the above argument, the first term in (3.3.7),

N N p.
n A l- 2 2 N. Ln ‘L1, is asymptotically normally distributed
k 1j q.
i=1 j=1 1j
ji‘i 2
according to N(mn, on), where
N N P
mm = z 2 Piri, Ln JJ- (3.3.9)
i=1 j=l J qij
o2 N P P
2 2
o = k2 ; 00 Q 2 (Ln -2£)(Ln -£g) Cov(n r,nt ) (3.3.10)
“ u,r,t,q ur qtq “ q
ufr,t¢q
1 N
To determine the distribution of 2 Q E: 2 Zi(q -p ), the
1.1 i 1
second term in (3.3.7), the distribution of the 21's is first
studied. The random variable Zi was defined by (3.1.12) and
(3.1.13) as
K.
A 1
Z r 2 t. i G A (3.3.11)

 

47
N
Because of the linear relation k 8 2 K1, the Ki's are
i=1
dependent random variables. Also, the Ki's are asymptotically

normally distributed as k tends to infinity. Means and covariance

of the Ki's are given by Good [G-l]. They are as follows:

E(Ki) = k Pi i e A (3.3.12)

Cov(Ki,Kj) = k xij i,j E A (3.3.13)

Here, the Pi's are the stationary probabilities of the chain and

1,. is defined as
1J

1

-1 -1
i]. {AijPi - P1P], + Pi[S(I-S) 11,1 + Pj[S(I-S) 11,1}.

(3.3.14)

The matrix S is defined in terms of the equation given by

R = g_§F +13
T ,
where R — [rij]’-£ — [p1,p2,...,pN] and g. correSponds to left

T T

and right eigenvectors, reSpectively; i.e., _I_’ R I E and R g = 3.

Now, define

ll>

~<
H
xﬂhi

It“)?!

j-1 tij(qi-pi) V'i E A (3.3.15)

Since the Ki's are dependent random variables, the yi's are
also dependent random variables. For small k, the distributions
of the yi's are too involved to compute. However, it can be
shown that as k tends to infinity, their distributions approach
the normal with probability one (w.p.1). Note that

K

P{1im E3- = P.} w.p.1. (3.3.16)
1.... 1

48

So, for large R, R1 ~ k Pi

w.p.1. Substituting this into (3.3.15),
it follows that
kP,
1 1
Y1 - k iEltijmi-Pi) (3.3.17)

Since the tij's are i.i.d., exponentially distributed
random variables, when k is large enough, using the Central limit

theorem, it can be seen that the yi's are independent and

asymptotically normal according to N(m1,oi) where

 

 

 

2
_ Pi(qi-Pi) . 2 _ Pi(qi-Pi)
i kq
i
2
m = 31:31:311 . 02 _ Pi(qi'pi) under w
s ’ . — ‘2’
1 pi 1 kg. 2
1
N .
Also, 2 = 2 y is asymptotically normally distributed
i=1
according to N(m ,o ) under m and N(m ,02 ) under w ,
z z 1 z z 2
l l 2 2
where
N P ( - ) N P ( - )2
i qi Pi 2 1 i qi pi
1 i=1 ql 1 i=1 qi
m = N Pi(qi-Pi) . 02 = l. g Pi(qi pi) (3 3 19)
22 i=1 Pi 22 k i=1 p:
Q

As a result, 2 n + z is asymptotically normally dis-

tributed according to N(mz ,0: ), i = 1,2. Where
i i

_ . 2 _ 2 2
mz1 — 11121 + m.r11 , 021 - azl +-an1 under ml (3.3.20)
mz = m + m ; 02 = 0'2 + 0'2 under (D2. (3.3.21)
2 22 n2 22 22 n2

49

In the above, mn and a: , i = 1,2, are expressed in
i 1

terms of m.n and 0: defined in (3.3.9) and (3.3.10) as follows:

q .
— ° 2 - 2 -- —1Ii . '
mn - mh , on — on when rij — q. 1,j E A (j # 1)
1 l 1
m = ° 2 = 2 when r = 331 i E A ( i i)

The total probability of error, then, can be calculated

using equations (3.16), (3.19) and (3.1.10). The result is as

 

 

 

 

 

follows:
,, 1‘1'“1 *21 T‘2'““z1
P8 = P1(l) 1 - P(fk o ) + P1(2) 1 - i(/1c 0* )
Z1 Z1
0 “Fuzz T‘2’"‘22
+ P2 P2(1) @(fko ) + P2 (2) :(fk* 2) (3.3.22)
c’z
Z2 2
W(i)
where 0:. A,/k Uzi , n1- , i = 1,2.
1 P: P2(i)
Note that ha: 111 = 0, =.1,2

Ram

It is shown that for a Specific case in which qi‘1 > pij’
V i,j E A (j f i), the total probability of error decreases toward
zero as k tends to infinity. The following lemma is first

introduced and its proof is given in Appendix E.

LEMMA 3.3.2 qij > pij = -m < mi < O < mi < +m
From Lemma 3.3.2 and the property of Q(.), the following

limits hold:

L111.» '1 I

1‘1"mzl T‘Z-mzl
lim §(/k T) = lim 9(fk T) = 1
k—m 0' k—-m 0
z1 z1
Tll-mz 1‘12- 22
11m @(/k _,_2.) = lim @(ﬂ —;——) = o
1?“ Oz k—cco Oz
2 2

Thus, from (3.3.20), it follows that

*
lim P = 00
e

k—m

3.4 RECURRENT EXPRESSIONS FOR THE PROBABILITY OF ERROR

In two-pattern class problems, the total probability of
error can always be expressed in terms of the probability of error
of the first kind (the probability of false alarm), go, the second

kind (the probability of a false dismissal), 50m and the prior

0

probabilities, P3, P2.

0 O
= + 3.4.1

where do and Bo are defined as follows:

0.0 ‘3 P[d*(xk,tk) = “”219 = 1], Bo 9 P[d*(xk,tk) = mlle = 2]
(3.4.2)
From (3.1.4), they can be written in another form as
Q A - l 3 4 3
aO-P[I.K>n\e=0],eo-PELK<nle-J (..)

where LK is the logarithm of the likelihood ratio and n is
a threshold.
In general, exact analytical expressions for go, and Bo

are impossible. However, it is possible to obtain them recursively

51

in terms of the conditional error probabilities, aL(k) and BL(k)’

*
where qL(k) is the probability of d (xk,tk) = wz, given that

e = 1 and xL = (x1,...,xL), 4, s k while (k) is the proba-

BL
. . * k k . L
bility of d (x ,t ) = wl’ given that 9 = 2 and x = (x1,...,xL).

For short, this can be written as,

k

P[d*(xk,t ) 1,XL]

u
n
8
3;
u
n

01,00 PELK > 1119 = 105’1

(3.4.4)

6,00 P[d*(xk.tk) - 2.131 MK < me = 2,151]

(3.4.5)

I
e
ﬁL.
(D

n

It is clear that the usual error probabilities, “0’50
are related to the conditional error probabilities by the follow-

ing expressions:
010 = 0,00.) , so = 6000 (“-6)

With the help of the conditional error probabilities and the total
probability law, a0 and so may be obtained in terms of the follow-

ing expressions:

a0 =2 J” aL(k>g1<x".t”>dc" (3.4.7)
4 L
x t

Bo=z j‘ BL(k)g2(xL,t‘)dt’° (3.4.8)
x". c"

where XL E A&, tL 6 [O,m)L, L = O,1,2,...,k.

The conditional probabilities of error can be generated

iteratively. This can be achieved writing gi(xL,tL), i - 1,2, as:

g.(xL,tL) = fi(xL XL-1,CL-1)fi(CL‘XL,tL-1)g.(it-1,6L-1) (3.4.9)

1 1

Substituting in (3.4.7):

52

= L"1tL"1 L-1tL-l L'l
a0 2 f,_1 g1(x ) 2 It gc(k)f1(xL‘x ,t )f1(cL|x%H)dtL dt
L-l t Lt
X X
Defining
4- 1 4- -1 4 4- -1 .
4,, 10031,: J“, Maw (x J" ,t )f (t \ )d 4’ 4, l,...,k

(3.4.10)
the desired iterative form for qL(k) is obtained. Similarly,

for B£(k):

‘xL tL'1)dtL, L = 1,...,k.
(3.4.11)

= L—l L-l
BL_1(k) :4 {L BL(k)f2(xL‘x ,t )£2(th
X

From (3.4.4) and (3.4.5), it follows that these probabilities

of error must satisfy the following boundary conditions when L = k.
ak(k) = U(Lk - n). Bk(k) = 1 - u(Lk - n) (3.4.12)

Here, u(.) is the unit step function. In terms of the recurrent
relations (3.4.10), (3.4.11) and of the boundary conditions,
(3.4.12), it is possible to obtain Pe via a0 and BO. Un-
fortunately, except for some simple cases, an analytical form for
the total probability of error is almost impossible to obtain. How-
ever, because of the iterative character of the method, it is more

convenient for the computer implementation than the work in Sec. 3.3.

3.5 CONCLUSIONS

The aim of this chapter has been to derive expressions for
the total probability of error and to study their asymptotic pro-
perties for the problem considered in Chapter II. Because of the
tremendous complexity involved in computations with unknown para-

meters, attempts have been made only for the known parameters case

53

with two pattern classes.

The exact analytical solutions were obtained in Sec. 3.1
for the total probability of error assuming two pattern classes
with 2-state chains and known.Q-matrices while lower and upper
bounds were given in Sec. 3.2. It was shown that the probability
of error decreases toward zero as the number of observations
increases without bound. Section 3.3 derives the asymptotic proba-
bility of errors for large sample sizes. Proving the asymptotic
normalities of the underlying distributions, asymptotic error
expressions were obtained for the case of two pattern classes,
N-states and known Q-matrices, and, in terms of the Lemma 3.3.2,
their convergence to zero as the number of observations tends to
infinity was proven.

Finally, In Sec. 3.4, the conditional probabilities of
error of the first and second kind were defined and it was shown
that the total probability of error can be computed in terms of
these conditional error probabilities and of the prior probabilities.

A method was also given to generate them in an iterative fashion.

CHAPTER IV
DECISION MAKING AND LEARNING WITH
UNOBSERVABLE STATES AND OBSERVABIE
TRANSITION TIMES

In this chapter, optimal decision making with unreliable
observations on finite-state, continuous-parameter Markov systems
is considered. The states of the systems cannot be observed
directly, but the exact times at which transitions from one state
to another occur can be observed. The decision problem itself is
the same as that in Chapter II.

The model by which observations are generated is defined
in Sec. 4.1. In Sec. 4.2, the optimal decision rule is established
and is generated iteratively for the case when all parameters in
the model are assumed known. In Sec. 4.3, the results of Sec. 4.2
are applied to two Specific cases. Section 4.4 deals with various
aSpeCtS of decision making for continuous-parameter Markov systems
when the model is not completely specified. The main problem of
Sec. 4.4 is to extract some information about the unknowns of the
model from the observations and use this information to construct
an optimum-adaptive decision rule which performs almost as well
(with reSpect to a well-defined criteria) as the optimal decision
rule of Sec. 4.3. Finally, Sec. 4.5 summarizes the main results of

the chapter.

54

 

55

4.1 MATHEMATICA1.MODEL FOR GENERATING OBSERVATIONS

The basic model considered in this chapter is similar to
that in Chapter II. Each of the M-pattern classes is defined by a
finite-state, continuous-parameter Markov chain. The Q-matrices
for the chains are parameters of the problem. The states of the
chains cannot be observed directly, but each state is characterized
by an observable random process. For instance, a noise process can
be added to each state. The random variables describing the observa-
tions during a given time interval have joint distributions which
depend on the state of the continuous-parameter Markov chain during
that time interval.

Appendix A shows that almost all sample functions of a

finite-state, continuous-parameter Markov chain, {x O s t < a},

t’
which satisfies certain conditions are step functions, and that the
process can be uniquely determined by observing any of the sample
functions in the interval 0 s t < w. Observing the sample function
on [0,m) is equivalent to observing the sequences of random vari-
ables {xk}:, {tk}: correSponding to the state numbers and the
sojourn times. It is also shown in Appendix A that {xk): is a
discrete-time Markov chain (jump chain) with the transition matrix
[rij] defined in (2.2.1) and that the sojourn time random variables
{tk}: are state-conditionally independent and have exponential dis-
tributions with parameters {qi}q.

In this chapter, the random variables {xk}: cannot be
observed directly, but it is known that when Au = i, a sample

function of a random process xi(t) can be observed. The random

variables {tk}: can be observed. The main assumption on the

56

random processes {xi(t)}§, which describe the states of the chains
generated by each pattern class, is that the statistical properties
of the random processes {xi(t)}§ are known and are the same for
all pattern classes. This simplifies the formulation.

There are many possible sampling schemes for defining the
observation process. Here, it is assumed that time samples

x(t..) = x

1] ij’ i = 1,2,...; j = 1’2"°°’ni’ are taken during the

interval [pi-l’pi) where pi_1 and p11 are the successive jump

A

points of the process at which transitions occur and po = 0.
Another sampling scheme is discussed in Sec. 4.3.
The observation process is then defined by the sequence of

k
random variables {§_,tk}: where tk é (t1....,tk), ti A pi - pi_1:

and 5% Q (£1,xz,...,xk); ii correSponds to the iEE Observed vector

T

£1 = [xil,...,xin ]. Sampling times t are determined as follows:
i

13

The first sample xi1 IS always taken at time t11 = pi-l’ just

after a transition occurs. The following samples (xi2""’xin )
i

are taken at times t. =

12 pi-1+T,ooo,ti

u1 = 91-1 + (“i'DT’
reSpectively, where T > O is the time interval between two samples.
From the above, it is clear that the number of samples n1 is a
random variable which depends on ti and is determined by

ni = [ti/T], in which the expression [g] is defined for any real
number § 2 O as the largest integer less than or equal to g.

The general model with the sampling scheme defined above is
illustrated in Fig. 4.1.1. The :andom'vector x: = [xkl’xk2""’xknk]

defined above takes values in R k and has a nk-dimensional density

function denoted by fi('\9) when it is known that pattern class

9 is active and that the correSponding Markov chain is in state i

 

 

 

 

13/.

 

 

 

 

 

Ch.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mxnnmnnon

 

 

 

o N H . rIIIII////q\
Mmmncnm

 

 

 

mwm. b.H.HH arm nmomnmp Somme mod arm mmaowwsm mnrmam

 

 

 

 

 

58

during the interval [pi 1,pi). That is, fi(.|e) is the conditional
density of observations 5k given that chain 9 is in state i.

By assumption, fi(§k‘e) is independent of e for any 5k and
any k and will be denoted by fi(§k). Since the states of the
underlying continuous-parameter Markov chains are not observed,
5k has the following global density when pattern class 9 is

active:

N
f(gkle) = 2 fi(§k)P(1k = its) 9 e a (4.1.1)
i=1

This density is a finite-mixture with component densities
{f,(.)}§ and mixing parameters {P(xk = i‘e)}§ where
1
N
E P(xk = i‘e) = 1. Another simplifying assumption is that the
i=1

. . km ..
random variables in the sequence {5 }l are state-conditionally

independent. That is,

fl(£1,£2,...,£k‘>\l = 12)).2 = j:'°-3Ak = L39) = f1(£1)fj(¥-2)..OfL(¥-k)
(4.1.2)

As in Chapter II, a single, continuous-parameter Markov
k k
chain is active to produce (§_,t ) during the entire observation
interval [0,T], T < m. A decision about the identity of that

chain is to be made.

4.2 ITERATIVE GENERATION OF THE OPTIMUM DECISION RULE

The model is completely Specified when the followingoquantities
are known: The compoennt densities, {fi(')}§’ the infinitesimal
parameters, {q:§)}q, and the initial probabilities, {P(x1=i|e)f:
over the states for all pattern classes 9 6 @. Decision theory

then provides the optimal strategies for processing the observations.

 

59

* k k
An optimum, non-randomized decision rule d (x_,t ) can
be chosen as in Appendix B. In particular, when the loss function
is given by L(i,j) = l - Aij (O-l loss function), then the

optimum decision rule becomes a minimum probability of error rule

defined by l
d*(£k’tk) = S if P(9=S|§.k’tk) {max {P(e=%\ak.tk)}] :1 ’
LEO L=s
or
if Pzgs(xk,tk) = [max {PZgL()_<_k,tk)}] (4.2.1)
LEO L=s

The rule above can be written iteratively. For simplicity,
the subscript L denoting pattern class will be drOpped. From

the Markov property

k k k tk- 1)f k- l tk- 1 k-l k-l
8(§ :t ) = f(tk‘x mk‘ )g(§. ,t ) (4.2.2)
k-l k-l , ,
The last factor, g(§_ ,t ), is available from the pre-
k-l tk- l

vious step of the iteration scheme. The middle factor, f(x #‘ ,t ),

can be written in an iterative manner as follows. By the total

probability law,

k-l k-l k- l tk- 1

N
_. k- l k- 1
(eke .t )— iEIf(§_ka-1:_

WM?"— .c > (4.2.3)

where Assumption (4.1.2) implies

f(xk‘xke‘i " ,c ) = £,(§k) (4.2.4)

The second factor in (4.2.3) will also be required in the

first part of (4.2.2) and is computed below. This factor,

P(xk= i‘xk 1,t k 1), can be obtained iteratively as follows:

 

60

N

k- 1 tk- l k- 1 tk- l , k- 1 k-l
P(kk=i‘_ ) = 2P()\k=i‘xk_1=xja )P()\k_1=J‘x ,t )
j#i (4.2.5)
From (4.2.1),
Xk- 1 tk- l . .
P(kk? _1hk_ 1:] j at )= rji 1f J # i
= o if j = 1 (4.2.6)

Using Bayes rule for the second factor in summation (4.2.5),

 

‘ k- 1 k 2 P1 k- 1 k- 2
k 1=j ) - N k- 1 k- 2 k- 1 k- 2
j§1f(tk- 1l1k_1=1 x )P(1k_1=1|x .t >
(4.2.7)

where, from the state-conditionality independence assumption

 

 

 

_. k-l k-l _. _
f(tk_1\xk_1-J.§. .t ) = f(tk_1\xk_1-J) - qj exp{-qjtk_1} (4-2.8>
and
k- 2 k- 2
f gc )P<1=j|x .c )
P(xk 1_ _.‘Xk 1, k 2) = N -k- 1 k- 1 k (4.2.9)
k- 2 2
m_1
Then, putting (4.2.6) and (4.2.7) into (4.2.5),
k- 2 k- 2
k- 1 k- 1 N f(tk-I‘Xk-1=j)fj (Xk-1)P(xk- 1=jlx )
P(x =i\_ ) = z r .
k j=1 31 N f =m)f p ‘ k- 2 1c-2
j#i mE1 (tk-l‘xk-l m(Xk-1) ("k-1:“?x ’t )
(4. 2.10)

This is a "predictive" factor since it used the present
and past samples to make decisions about future states. Substituting
this in (4.2.3) will produce the middle factor of (4.2.2).

The first factor in (4. 2.2), f(tk|xk,t k 1), can be written

in an iterative form as follows: By the total probability law and

the Bayes rule,

61

 

 

N
k k-l X1. -1 , k k-l
f(tk\x,t )= iz_; f(t kl1k=i ._ ,t )P()(k=1\x,t ) (4.2.11)
where
, k k-l . .
f(tkhk=1”i ,t ) = f(tkh‘fl) = qi exp{-qitk} (4.2.12)
k 1-1 fi<1k>mk =1|_“ 1.1:“ 1)
P(xk=i\§ ,t ) = (4.2.13)
N k=1t-k 1
2: f m<xk>r<1k=mlx ,t >
m=1
Substituting (4.2.12) and (4.2.13) in (4.2.11),
1(— -
k tk- 1 N fi()ik)P(>\k=i‘§ 1,tk 1)
f(t \x ,c )= )3 f(t \1 =1) (4.2.14)
k =1 k k N k-l k-l
mzlfm (x k)P(x k=m\x )

k-l k-l
Here, again, {P(Xk=i1£ ,t )]§=1 can be obtained iteratively

from (4.2.2).
The starting procedure for the iterative scheme is given

below. When k = 1, observe (x1,t1) and compute

 

1 1
8(1 .t ) = f(t1|§1)f(§1) (4.2.15)
where
N o A
f(£1)= 23 f. (x 1:)1’ . Pi - P()(1 =- 1)
i=1
N f (x )P:
_ _. i -1
f(tllil) " 1211301141“) N o
E ftn(}'('1)Pm
m-l

Next, compute the predictive probabilities, {P(12=i‘x1,t1)}1=1.

N f(t I1 =j)f (x )P°

1 1 _

P(12=i\§.t)=2rjiN 11 111
,1-

j l _ 0
j mEIf(t1h1m)fm(£1)Pm

 

62
When k = 2, observe (x2,t2), compute
2 2 2 1 1 l 1 1
f(>_<. .t ) = f(t2\1<_ .t >f<>12|§ .t >134; ,t)
1 1 . .
where f(x ,t ) 18 obta1ned from the above, and

N
f<§Q\£}.t1) = 2 fi(§2)P(12=i|£}.tl)

£1(§2>P<12=i|§1.t1>

 

f(t2\§?’t1) = f“2112:” N 1 1
malfmczgpmfmtz .12 )

1

"(‘12
H

2 N

Again, {P(x3=i\§?,t )}i-1 must be computed for the next
step. This procedure can be repeated up to time k. The flow-

diagram 4.2.1 shows how gs(xk,tk) can be determined in an iterative

fashion.

4.3 APPLICATION OF THE MODEL TO SPECIAL CASES

The model defined in Sec. 4.1 is quite general and applicable
to problems encountered in communication theory, pattern recogni-
tion and operations research. The major difficulty in implementing
the Optimal decision rule is that even if all information about
the random processes {xi(t)}1:=1 is known, determining the n%h
order density of 5i = (xil’xi2’°'°’xini) can be very complex and,
sometimes, impossible. Some ways of getting around this problem
are presented below.

1. Take only one sample from each time interval [p1_1,pi).
The random vector 5i is then reduced to a single random'variable,
xi. Sampling points may be taken at times pi_1 when transitions

occur. Fig. 4.3.2 shows the sampling scheme for a particular

sample function.

 

63

Observe ]
(5151) ‘
4k

Compute g(x1,t1) |

 

Eq. (4.2.15)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

—“L‘;
1 = 1
Predict Next Step
Com ut
{P(Xj+1=1‘x ’t )}i=1
7 l I
I Observe I
(£j+1:t1+1)
' .12
_____.L__. .____ilr
Compute Compute
N . N
A {f1(§j+1)}1=1 {f(tj+l‘)‘j+l=1}i=
(Known) Eq. (4-2-8)
‘71 L1 L 4:12.:
Compute Compute Compute
1
f<xj+1‘xj,tj) g()_c_j+1,tj+1) f(tj+1‘2‘_j+ ’12:!)
Eq. (4.2.3) Eq. (4.2.2) - Eq. (4.2.14)
j = 1+1
N0 < j > k
YES

 

Eq. (4.2.2)

Fig. 4.2.1: An Algorithm for Iterative Implementation of g(xF,tk).

l Obta in 3(xk,tk)

 

 

 

 

 

 

 

 

x3
6 t1 e: t2 4?. t3 —: > t
po p1 p2 P3

Fig. 4.3.2: A Sampling Scheme

2. Choose processes for which the joint distribution of
{5k}:=1 can be simply described.

In the following developments, the iterative results obtained
for optimal decision rule are applied to the case when the random
processes {xi(t)}§=1 defining the states of the continuous-para-
meter Markov chains, are wide-sense stationary normal processes.

In this case, the component densities {fi(°)}§-1 for observation
vector Ek’

vectors E1 and covariance matrices 21 are assumed known. They

can be computed from the mean function and auto-correlation function

a vector of order ni, have known forms. The mean

of the correSponding process as follows: 91 = niI, where I is a

unit vector of order ni and Cov(x = Ri(t - t

ku’xkv) ku 1G,) 2
u,v = l,2,...,ni. The joint density fi(°) may be written in the

following form and is the same for all pattern classes.

NIH

-1
fi(§k) = n /2 exp{- (5k - 111(1)T :1 (5k - 111(1)} (4.3.1)

 

 

65

The iterative, optimal decision rule in Sec. 4.2 can be

rewritten as

* k k
d (g ,tk) = s if P:g8()_<_k,t ) > PEfL(§k,tk) 2?. 9‘ s (4.3-2)

where, from (4.22), the density required has the form

 

k k xk- 1 tk- 1
Omitting, again, supercript and subscript L, Jﬁ_ and
FiL are defined as:
1 N A i N 1 N
Jkoik’tk’U-‘k-lh) ’ Jkl(§k’{Pk-1}1)Jk2(x ’tk’{rk-1}1) “'3'“
A k-1tk-1xk k- 1
Jkl f(x “-1.1 ), Jk2= Akf(t ,t ) (4.3.5)
t)k
Pk =Fk(Xk:t k’{r.k- 131): A(Ak+11xk :t (ll-3:6)
. . i .
The fOIIOW1ng expre551ons for the Pk, Jk1 and sz are der1ved

using the iterative equations in Sec. 4.2.

 

 

 

 

 

 

 

N . r
' 1 T -1
121 Fi‘l “jib 1/2 exp“ 501W?) 21 (’inI) ' “1‘13
Pi ___ iii (2") ‘3‘
k N I‘m m - I I t
#21 k-l n,/2 1/2 exp{' 2(xkfum )T2m1(§k-um ) - qm k}
<2“) 1 12ml.
.. N 1 _1_ -1
Jk1 " 121 Fk- 1 n /2 1/2 expi' ﬂak-ping (1151411)}
(211) |Zi|
N . q.
1 1 1 T -1
121 rk-l n./2 1,2 “91‘ f(ik'PiI) 21 (’iﬂ‘il) ' qitk}
(211) 1 lzd
J =
k2 N
m 1 1 T -l
2: I‘k-1 n [2 exP{' 2 (’ik‘uml) 3m (ik‘uml)

66

Note that {qi}1 and {rij}§,j=1 depend on the pattern
#1

class, but the mean vectors and covariance matrices are the same

. . . i
for all pattern classes. The initial values of Pk, Jk1 and sz

can be easily determined by putting k = l in the above expressions
1
and noting that F = P?.
o i
The optimal decision rule can also be established for the
case when the random processes {xi(t)1§=1’ describing the states
of the chains, are Gauss-Markov processes. Then, the transition

probabilities and initial distributions for a given observation

vector 5k may be written in the form:

 

 

 

 

 

1 _[xk +1 (Xk “95’le
piock .+1‘xk .) = 2 exp 412 g 1 (4.3.7)
’1 ’3 2n(1-Ri)oi 201(1 - R1)
(x - u.)
1
P(i)(xk 1) = exp - k’1 2 1 (4.3.8)
’ 21101 201

Note that the double subscript notation, x was used to

191’

denote xkj for convenience. Here, pi, Oi and R1 are the

correSponding mean, variance and correlation functions of the process

xi(t) where R1 = exp{-yi‘¢l}; T is the sampling interval, —-

 

Y1
is the correlation time. The JOlnt density of Ek = (xk1,...,xknk)
is determined in terms of (4.3.7) and (4.3.8) as follows:
0 nk-l
g1(-k) _ P1(Xk,1) 121 pi(xk,j+1‘xk,_])
nk-l 2
[x (x - )R ]
1>°i(xk 1) E1 k,j+l k,j 1‘ 1 ““1
- T’— 8*? J
i

 

67

n-l
k

2 2 2
where Ai [’3 [211(1 - Ri)oi] . Then, the necessary quantities,

Flt, Jlk’ J2k defined in (4.3.4), (4.3.5), (4.3.6), determining the

iterative optimal decision rule, are computed. The expressions

are given below.

 

 

 

 

 

 

 

 

 

 

 

1
Pk =
nk-l 2
N r..q.p‘.’(x ) 31 [xk.r+1'(xk r-ijj " ”1]
J J1 .1 .1 k:1 r _
2 1"1-1 A P ‘ 2 2 qjtk
1,1
nk-l 2
0 2 x '( '1») =1»
21 I‘m qum(Xk’1) p r"1 k,r+l k,r m m 111 q t
k-l A - 2 2 - m k
m=1 m 2qm(1 - Rm)
nk-l 2
R " l
N 1 p:(xk 1) r21 [xk,r+l (xk rqli) i 1&1
Jkl = Z rk-l A ex" ' 2 2
i=1 i 201(1 - R )
sz —
nk-l 2
N i qip:(xk 1) 131 [xk,r+l (xk r u'1)Ri - H'1:l
2Fk_1 A L em) 2 2 -qik
i=1 20.(1 - R )
1 i
nk-l 2
N m p0 (x ) E [xk,r+l (xk r M'm) m - p'm.J
z r‘ In k 1 r—l
=1 k-l ———X—L—'exp - 2 2
In m 20 (1 ‘ R )
m m
The initial values of Flt, Jk1 and sz are given as
followa:
N’
o o
I’ t , , x ex - ,t
121 j leJ( 1.1)qj p{ (11 1}
i _ 1in .
1 " N ’

P0 ° -
mil mpm (xl,l)qm exp{ qmtl}

67
n -l
k

2
where A1 9 [2n(l - Ri)ci] . Then, the necessary quantities,

 

Ti, Jlk’ J2k defined in (4.3.4), (4.3.5), (4.3.6), determining the

iterative optimal decision rule, are computed. The expressions

are given below.

 

 

 

 

 

 

 

 

 

 

 

1
Pk =
0 21h -(X m )R -u]2
N ..Q.P.(X1)e k,r+l k,r j J j
J Jl l l ”k r-1
2 Fk-l A 2 2 - thk
j=1 j 203 (1 - Rj)
1941
nk-l 2
0 2‘. x 'X( -p. )R -u
N rm qmpm(xkll) r— _1 k, r+l k,r m m m _ t
21 k-l A e - 2 1 R2 qm k
m m qm( " up
nk-l
N 1 9:05:11) r21 [xk “‘1 (ka-u )R - p‘ 3
Jkl = z Fk-l A "p '
1=l i 201(1 ‘ R )
sz _
nk-l 2
N . q.po(x ) E [xk,r+1 - (xk,r-ui)Ri - H'i:I
1 1 1 k,l r-1
’3 rkq A exp 2 2 ' qitk
i=1 20.0. - R )
1 i
nk-l 2
Z Fm p;(xk 1) r21 [xk,r+l - (Xk,r-um>Rm.- p'm]
_ k-l ---L- exp -
““1 A 2 2(1 - R2)
m cm m
The initial values of Pi, Jk1 and sz are given as
follows:
N
e - ,t
121 Pjt j. pj(x1 1)qj xp{ <1] 1}
1 _ ﬁi .
F1 -' N o ,
P° -
Z P (X 1, 1) qm EXP{ qmt 1}

m=l m.m

 

 

x1,1

4.4 ADAPTIVE DECISION MAKING

Adaptive decision making schemes are now studied when the
underlying Markov model is not completely Specified. The unknOWns
of the model are the Q-matrices, QS = [q§:)]§,j=1, qis) g Qii),

S 6 ®. The parameters and the properties of the random processes,
{xi(t)}§, describing the states of the underlying continuous-
parameter Markov chains are assumed to be known. The component
densities, {fi(.‘e)fi, defined in Sec. 4.1, are, then, completely

known. Furthermore, for simplicity, they are again chosen to be

the same for each pattern class. That is,
fi(1<_k>= fiekle) v £1. (4.4.1)

A prior distribution summarizing initial knowledge about the
unknown infinitesimal parameters is available for each pattern class.
Using a supervised learning scheme, posterior distributions for the
fixed, but unknown, parameters are "learned" from training samples
(2:8, T:S). The training data are employed to form posterior
densities for the parameters which, in turn, lead to an adaptive
decision rule which makes decisions with minimum error probability.
The purpose of Sec. 4.4.1 is to derive the optimum-adaptive decision

rule using the Supervised learning scheme. In Sec. 4.2.2, components

of the optimum-adaptive decision rule are generated iteratively and

69

the structure and computational feasibility of the iterative scheme
are discussed. The learning feature of the rule is discussed in

4 . 1+ . 1 OPTIMAL-ADAPTIVE DECISION RULE

The optimum-adaptive decision rule will be derived for the
problem defined above. The following definitions introduce the
problem.

0 I I k
The non random1zed dec181on rule

k
,t ), given the
C k k j C .
observation sequence (x ,t ), given all tra1n1ng samples
n n n

A
(BUT) = {(111,T11),...,(1MM,TM )} and given Q-matrices

dgd(x
n

9, = {Ql’Q2’°°"QM}’ has expected loss

M

k
R[d\ (ﬂ :tk):(Y:T) 9Q] = 2 L[d(xk:tk):i]P[e=i‘(ngtk)9(Y:T) ’9‘] (40402)
i=1

Equation (4.4.2) is referred to as the sample-parameter-

2

conditional risk . The parameter-conditional, adaptive-Bayes risk
is defined as:

ra<p°,d\9) = $3 R[d\(xk,tk>,<y,-r>mf<§k,tk\(m), >42k dck (4.4.3)
k

0
where P0 = (P3,P2,...,PO)

M is the prior distribution over the para-

meter Space and

k k M o k k n. n,
{[5 ,t [(y.'r),Q] = 2: Pigi[x ,t “111,35 ’91] (4.4.4)
i=1
Thus, the adaptive-Bayes risk is given by
O O
ra(P .d) = jg ra(P ,d\g)fo(9)dg (4.4.5)

 

'2
Signori [8-2], pg. 14.

70

This development, which involves the use of conditional risks,

differs from that given in Appendix B and emphasises the role of

prior information in constructing the total risk.

* *-
The optimum-adaptive decision rule da 9 d (xk,tk) is

defined as a non-randomized decision rule which minimizes the

adapt ive-Bayes risk. That is,

0 * . o
ra(P ’da) = inf ra(P ,d)

(4.4.6)
dED
where, D is the set of all non-randomized decision rules. Note
that
* * k k
ra<P°.da) = E{R[d31<§,.t ).(y.T).Q]} (4.4.7)

In particular, assuming the 0-1 loss function, the optimum-

adaptive decision rule becomes a minimum probability of error rule

defined by
k k k k n n k k n “
da(x .t ) = s if pie=s1(§ .t ).(xss.'rss)] = max p[6=tl(>_<_ .t ).(yLL.TLL>]
Leo
, k k n n o k k n n
or, if pggs(x ,t ‘ySS,TSS) = 2:; 2,3,(x .c \y,‘.¢,‘> (4.4.8)

The basic elements of the above decision rule are the posterior

densities of the observation sequence, given the learning samples.

These densities are obtained by averaging the parameter-conditional

densities over the posterior densities as in (4.4.9).
n n n n
L L a k k L L L L
8,01» ”at ”L ) £24;st .t Ix, .T,’ ’QL)f(QL‘XL .1',’ MD, L 6 6)
(4.4.9)

Both factors in the integrand above are generated iteratively in

the next sect ion.

Viv»!

71

4.4.2 ITERATIVE FORM FOR THE DECISION RULE

The first factor in the integrand of (4.4.9) can be imple-

mented iteratively in exactly the same way as in Sec. 4.2, conditioned

“L

on the unknown parameter QL. The second factor f(Q ”‘21 TZL)

can be implemented iteratively as follows. In the following, the
subscript L will be dropped as a convenience; f(Q‘yé,¢n) is
given by the Bayes theorem as follows:

f(x“ . “0\Q>f (Q)

f(Q‘yn,'rn)= (4.4.10)
“in ’1' n)

 

where fo(Q) is the prior density function over the Q Space.

Using the Markov dependency between the samples, (4.4.10) can be

written as

 

- -1Tn-1
m Li‘s“ am \y, " .Q)
f(Q\xn,'rn) = n n _1 1““! m1 “-1 mix " 1 “ 1) (4.4.11)
f(rn‘z .w min”. .«r )

Equation (4.4.11) is the desired result. The last factor
n-1Tn- 1
on the right side of (4. 4. 11), fGQ‘x ), is the density in

the Q space at the (k-1)-h'stage and is available from the pre-

vious step of the iteration scheme. The first and second factors
. n n- -l tn- 1
1n the same numerator, f(rn‘y_,rn1,Q) and f(yn|y ,Q),

are the densities directly utilizing the prior knowledge above and

can be obtained iteratively as explained in Sec. 4.3. The
denominator of (4.4.11) is a normalization constant which assures

that f(Q‘lnn'n) integrates over the Q Space to unity. The

n n
parameter-conditional density,g L(xk ,t k‘yL L ’L&’ ), is the density

. k .
function for observation (§.:t k) correSponding to pattern class

{, and is given by (4.2.2) (where QL was assumed known) as a

72

function of the unknown parameters and the training samples. This
term can be generated iteratively in a manner Similar to that of
Sec. 4.2, but conditioned on knowledge of the unknown parameters.
The information about the fixed but unknown parameters {QL}I is
n n
summarized by f(QL‘XLL’TLé) which is generated iteratively as
explained above and is employed in the decision rule by the
averaging procedure given in (4.4.9).

Even though the iterative expressions are available for
both posterior densities in the integrand of (4.4.9), it has one
major draw-back. The posterior densities for the parameters are
not usually reproducing because the class of densities involved
are mixtures. In general, two Serious problems are immediately
encountered when trying to implement this decision rule on a
computer. The storage problem refers to the difficulty in allocat-
ing computer storage locations for the posterior densities in
(4.4.9). The computation time problem refers to the difficulty
inherent in performing the number of computations required to
change the old posterior densities calculated at (k-l)£h'step,
into the new posterior densities at REE-step. Since the entries
of each matrix Qi are continuous random.variables, in general,
the amount of Storage required to store these densities is infinite.

The only way to make use of the rule is by quantization of
the parameter Space. Then, each parameter taken to be a discrete
random variable so only a finite number of values needed to be
stored and updated with each observation. By quantizing fine

enough, it is possible to get arbitrarily close to the optimum

solution at the expense of increased memory. However, the memory

73

grows exponentially with the dimensionality of the parameter vector
(1, making the method feasible only for problems with a small number
of states. As a simple example, assume that M = 2,‘N = 3; then,
there are 2 x 9 = 18 parameters. Using 10 quanta level for each
parameter, 1018 storage allocations are required for Storing
posterior densities of the parameters for both classes. One way

to get around this problem is to use some sub-optimum methods in
which consistent estimators for the parameters are found and, in

turn, used in the decision rule as if they were the true values

of the parameters.

4.4.3 ASYMPTOTIC OPTIMALITY AND ADAPTATION

In the previous sub-sections, the decision making problem
was studied for the case when the Q-matrices describing the pattern
classes were unknown. In that case, the optimum-adaptive decision
rule d:(x ,t ) was used instead of the optimum decision rule
d*(xk,tk). To discuss the learning capability of the optimum

adaptive decision rule, the following definition is given, [R-Z].

*
DEFINITION 4.4.1 An optimum-adaptive decision rule dat is said
*
to be an asymptotically Optimum decision rule d relative to
RC = (Qi,Q:,...,Q;), the true value of Q, if and only if
. o * o *
11m r(p .da) = r(p .d )

n1, 0 o o ,nM—oaa

That is the Bayes risk in using the optimumsadaptive decision
rule, when 9_ is unknown, converges as the number of training
samples ni‘» m, i E {l,2,...,M}, to the same limit as the Bayes

risk when Q is known and the optimal decision rule is used.

74

The following theorem, then, insures the asymptotic optimality

of the adaptive-optimum decision rule.

*
THEOREM 4.1.1 da is asymptotical optimal relative to go.

A lemma which will be used in the proof of Theorem 4.1.1

is first given.

n n n,~o

1mm 4.1.1 P[e=i\(_xk,tk).(y_ i,r i> «211—1—9 P£e=il<xk¢k> :03]
w.p.1 i E {l,2,...,M}
The proof of the above lemma is anologous to that given by
Signori [8-2] and will be omitted.

PROOF OF THEOREM 4.1.1: For Simplicity, the following are defined.

[1 n n
k A k k . i L) i i Q
U (E :t ) : V (Xi ,Ti ) , Vni (yini’Tini)

Then, from (4.4.6) it follows that

o s r(po,d:) - r(p°,d*) s[R(d:|uk) - R(d*[uk)]

n .
E[R(d:|uk - R(dZ‘uk, {V l}?:9]

n, n,
+ Emwjlu“, {v 112‘s) - R<d*lu“.{v 1119.3

+

E[R(d*|uk: {vniﬁ‘m - R<d*\uk>3

But,

Etudihknvniﬁw - R<d*|uk,{vni}im = we”? - ra<P°:d*> S 0
ThuS,

o _<. r<p°.d:) - r<p°.d*> s E[R(d:|uk) - R(d:\uk. {Vniﬂ‘m

n
+ E[R(d*|uk, {v iffxp - R(d*\uk)]

75

Substituting the values of R(.‘.) defined in (4.4.2) in the above,

it fo llows that

o * 0*
0sr<p .da> -r(p,d>

M k M n

i=1 i=1

n

M , M
+ E{ 2 L(d*,i)P(e=i|uk,V 1,9) - 2 L(d*,i)P(6‘iluk9
i=1 i=1

M n.
s s{ 2 L(d:.i>tp<e=i\uk> - P<e=i\uk.v 1.991}

i=1

M n-
+ E{.2 L(d*,i)[P(e=i\uk.V 1.9) - P<9=i|uk)]}

1=l

Using Lemma 4.4.1 in the above, the convergence follows. That is,

lim [r(d:) - r(d*)] = 0 w.p.1

n ...,n—m
M

1’
As a result, the rule adapts, or converges, to the optimal
rule that would be obtained if 90 were known.
The conditions under which the posterior distribution of

Q, which suumarizes all the information about Q0 contained in
“i n' M
{Y1 ’Tilh’ approaches, with probability one, a delta function

whose mass is centered about 90 were stated in Chapter II. For
decision rule (4.4.8), condition 1 follows from (4.1.2). Condition
2 is assumed. Condition 3 is the major requirement for which a
strongly consistent estimator for Q0 (a function of the observa-

tions that converges with probability one to QC) must be

exhibited.

76

4.5 CONCLUSIONS
Optimal decision making and Bayesian learning with con-
tinuous-parameter Markov systems with unobservable states and

observable transition times have been the t0pics of this chapter.

The object of the decision rules is to decide which of M con-

tinuous-parameter Markov chains is active. The observable quantities

were random processes describing the states of the chains and the

sojourn times in these states. The general mathematical model,

the properties of the observation processes and sampling schemes

were defined in Sec. 4.1. Assuming a prior distribution over the

pattern classes and that all the parameters of the model were known,

the optimal decision rule was defined in (4.2.1) to be that rule in

a given class of rules which minimizes the Bayes risk. Its basic

components (4.2.2) were generated iteratively. In Sec. 4.3, the

analytic results were obtained for the iterative optimal decision
rule using special random processes, the normal process and the

Gauss-Markov process, to describe the states of the chains.

In Sec. 4.4, the adaptive decision making was studied when

the underlying model was not completely Specified. The unknowns

of the model were the Q-matrices. A prior distribution summarizing

the initial knowledge about these unknown parameters was assumed

for each pattern class. The fixed but unknown elements of the Q-

matrices were learned, using a supervised learning scheme, from

the training samples by forming posterior densities for the para-

meters 0

In Sec. 4.4.1, the basic elements of the adaptive decision

making problem were defined, such as sample-parameter conditional

77
risk, the parameter conditional adaptive Bayes risk and adaptive-

Bayes risk. Assuming the (0-1) loss function, the optimum-adaptive

decision rule was defined as a non-randomized decision rule which

minimizes the adaptive Bayes risk. Iterative expressions for the

opt imum-adaptive rule were obtained in Sec. 4.2.2. Some serious

prob lesm encountered in implementing this decision rule on a

computer were discussed. It was noted that only a high storage

quantization procedure could be used to implement the rule.
Asymptotic optimality of the optimum-adaptive decision rule was

proved in a manner that exhibits the learning capability of the

rule. It was also shown that, under the stated condition, the

posterior density of the parameter Q, which summarizes the know-

ledge about the true value of parameter g0. converges to a delta

function. Thus, 9'0 is learned and the rule adapts.

CHAPTER, v

DECISION MAKING WITH
UNOBSERVABLE STATES AND TRANSITION TIMES

In this chapter, the problem of decision making is inves-
tigated when the underlying model is completely specified but
neither the states nor the transition times can be observed
directly. The basic model considered here consists of two
pattern classes, each of which is characterized by an N-state,
continuous-parameter Markov chain with different stationary
transition probability matrices. Every T < o seconds, one of
the two classes is chosen according to the probability distribu-

o
tirui P = (P:,P:) and a sample function, I is generated from

t,
the corresponding chain. It is assumed that the features are
selected in a medium.disturbed by the addition of white Gaussian

noise, so that the observed sample function is

Xt = xt +nt , 0 S t S T. (5.01)

wTuzre nt is a sample function from a white Gaussian process with

E(nt) = o ; R(t) = v06(t) (5.02)

The decision problem is, as defined in the previous chapter,
tc> determine which pattern class is active based upon observation of
}{t° The entire model is illustrated in Fig. 5.01.

In Sec. 5.1, Optimal decision making with discrete-time

cﬂaservations is studied and iterative expressions are obtained for
78

79

mm H

E

2.: G:

on N

Ema 2.:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

]

 

 

 

 

 

 

 

mew.

 

m.oHu

P---

..3

arm ZOQmH mom Umowmwos
mvawwsm mnrmam.

 

 

+ n n n

 

 

 

meHsm awn: owmonmnm

mmncnm

mxnnmonon

W .
Uh u AxHuxNvooovu a.

 

 

80

the likelihood ratio. In Sec. 5.2, a continuous likelihood ratio
is obtained by a limiting operation. Section 5.3 discusses a
method for solving the classical problem of detecting a random

telegraph signal in additive white noise.

5.1 OPTIMAL DECISION MAKING WITH TIME SAMPLES

In this section, the features on which decisions are based
are defined as point samples of the observed function xt as

follows:

= k + n , k = l,2,...,K, (5.1.1)

k k k

where K = T/T and T is the interval between samples;

2
k n(k-1)¢

Gaussian processes at time (k-1)T, with

n represents the noise sample taken from the white

E(nk) = o, Cov(nk,n£) = v06(k-L), k,L = l,2,...,K. (5.1.2)

The sequence of random variables {Ak}§, xk é x(k-1)T’
take values {a1,a2,...,aN} corresponding tozifinite state Space
A = {l,2,...,N}. That is, if the Markov process is in state i
at time (k-l)T, then xk = ai. Since the xk's are the time
samples from the process At they satisfy the Markov prOperty;

namely,

= = _ _ _ (S)
P(Kk aj‘xk-l a.,..., ll — a ,9-3) - p

1 L ij (T) when ws is active

(5.1.3)
( )

where pi:

function of the continuous-parameter Markov chain describing chain 5.

(T), s = 1,2, is the stationary transition probability

81

The observed random variables {Xk}§ takes values in a

finite-dimensional Euclidean Space and have the following density

when it is known that the chain in state i at time kT.
fi(xk) = p(xk - ai) (5.1.4)

Since the noise samples are Gaussian and uncorrelated, it

follows that they are state-conditionally independent. That is,

P(Xk-l’xk‘Xk-l = ai’xk = j) P(Xk-l‘kk-l = ai)P(Xk‘xk = j)

fi(xk_1)fj(xk) (5.1.5)

Since the true states of the chain that is active cannot

be observed directly, xk has the global density:

N
f(xk\e) = iElfi(xk)P(>.k = ai\e) s E e (5.1-6)

In accordance with the discussion in Chapter III, the

*
Optimum decision rule, d (.) is given by the Bayes decision rule:

* K

d (x ) = ml if AK < n
= m, if AK > n (5.1.7)

where
K 0
K g (x ) A P2 K K
AK = AK(X ) ‘ K , ﬂ =‘—; ; gs(x ) = g(x ls = s)
81(X ) P1

Using conditional probabilities, the likelihood ratio, AK,

can be written iteratively as follows:
k+1 k k
2(x ) = g2<x ) . f(xk+1\x .9—2)

k
g1<x +1) g1<xk> f(xk+1\xk.e=1>

8

 

A =

k+1 (5.1.8)

82

By the total probability law,

N
k - - k = = = k = =

By the state-conditional independence, the first factor in

the summation above can be written as:
f(x \xk x =a e=s> = f (x >= p<x - a > (5.1.9)
H4 ’kﬂ i’ i an 1&1 i

The following notation is now introduced.

 

111; = k __ . iZQ = k _ . ___
rk mm ailx .e 1) .rk mm ailx ,e 2). 1 e A. s 1.2.
(5.1.10)
. il
Iterative expressions are then obtained for AK, Pk
'2
and F; as follows:
N
i2
121 p("k+1 aiwk
Ak+l = AK N 11 (5.1.11)
, - a
._ p(xk+1 i)rk
1-1
From the total probability law, Fis, s = 1,2, is given by
is k N k k
rk = P(Ak+1=ailx 39:8) = .EIP(Ak+1—ai‘Ak—ajix 59:3)P(Ak—J‘X 39‘s)
1 (5.1.12)
where (5.1.3) and state-conditional independence imply
P(x =a ‘1 =a xk 9:5) = P(S) = 1 2 (5 1 13)
k+likj” 31(T)’S " '°

The second factor in (5.1.12) can be evaluated from the

Bayes rule as:

83

k-l
f(x \x =a..e=s)P< =a.\x .e=s)
P(xk=aj‘xk,e=s) = k k l )‘k J

 

f - - P - \ k'l -
1 (xk\xk—a,.e-s) (1k_L x .e-s)

"1452

L

js
p(x - a.)F _
= N R J k 1 (5.1.14)

L3
2 o(x - )F _

 

as

Substituting (5.1.13) and (5.1.14) into (5.1.12) gives the

following recursive relation.

 

g (S)( > (x - a >rjs
is .=1pji T p k j k-l
r =*1 , s = 1,2 (5.1.15)
K N PL3
2 p(x - a ) _
L=1 k L k l
. . . il 12 N .
The initial values, A1 and {F1 ’r1 )1, are given by
N

0 .
p(x1 - a,)P1(i)
i 1 1
A1 " (5.1.16)

o(x1 - ai)P§<i>

"P1

"P132

1

)4.

Z

 

2 P;:)(T)p(x1 - aj)P:(j)
F18 = l=1h , s = 1,2, 1 e A. (5.1.17)
2 p<x - a >Pp(1>
i=1 1 L s

where (P:(l), P:(2),...,P:(N)); s = 1,2, are the prior probabilities
over the states.

The recurrent expressions (5.1.11), (5.1.15) and the initial
values (5.1.16), (5.1.17) permit sequential computation of the like-
lihood ratio for any k. The rule d*(xK) is an iterative, optimal

decision rule which can be though of as an element of a class of on-

84

line decision rules with fixed memory. However, the number of
computations needed to make the optimum decision grows linearly

with the length of the observation sequence.

5.2 OBSERVATION OF ENTIRE SAMPLE FUNCTION

Expressions analogous to those obtained in Sec. 5.1 can
be obtained when the entire sample function xt is observed for
T seconds. In this case, the iterative expressions (5.1.11) and
(5.1.15) are replaced by a set of non-linear stochastic differential
equations with a limiting argument. Difficulties are encountered
when attempts are made to take a mathematically rigorous limit of
the results derived from the model of Sec. 5.1 as the sampling
interval goes to zero. Because of the infinite variance of the
resulting continuous white noise, mathematical operations (e.g.,
differentiation and integration) do not exist in any strict sense.
For this reason, it is necessary to modify the sampling scheme so
that the correSponding limits exist. To achieve this, a new
observation scheme will be employed.

Let the observed process be x = l

t t + nt, t E [0,T],

where xt is an N-state, continuous-time Markov chain with the
. .. .. (S) N- _
stationary tranSition probability matrix [pij (t)]i j—l, s - 1,2,
.9
and nt is a white Gaussian process. The process At is first
approximated by a homogeneous, N-state, discrete-time Markov process
*
(Markov chain), xt = {xk, k = l,2,...,K}, taking values in a finite
set {a1,...,aN} at times (k-1)T; k = l,2,...,K, TK = T, with
transition matrix [p§:)(r)]§ j=l’ s = 1,2. Good [G-l] showed
3

*
that At will converge to xt as T a O with probability one.

85

Thus, for sufficiently small T, x may be approximated by

t

*
xt ~ kt +nt O S t S T (5.2.1)

where the equality will be held when T = O, w.p.1. The sampling

process being employed here is defined in (5.2.2)

+ n

ik k . k = l,2,...,K. (5.2.2)

1 k
xk = ;'I T xtdt z
(k

-1)T

It is clear from (5.02) and (5.2.2) that,

 

E(nk) = o, Var(nk) = $9- (5.2.3)
1 2
p(nk) = v exp{- §-'nk} (5.2.4)
(2n o)l/2

The distributions and the prOperties of the observed random
variables {xk}§ are the same as with discrete-time observations
and are given in (5.1.4), (5.1.5) and (5.1.6). Thus, the stochastic
differential equations for the logarithm of the likelihood ratio,
Lk+1 = LnAk+1, can be obtained in the case of continuous-time
observations as follows. Taking logarithms in (5.1.11), substituting
' for p(.), and cancelling common terms:

N
- .. 12 _T_ - 2
Lk+l Lk L“ [121 Fk eXP{" 2vO (Xk+l 81) 1]

1=l

N il T 2
- {,n[ I‘k exp{- —2Vo (Xk+1 - ai) }] (5.2.5)
For sufficiently small T,

Ir 2 ... 1' 2
exp{ 23;.(xk+1 ai) } ~ 1 - 2;;'(Xk+1 ai) . (5.2.6)

From the definitions of Til, Fiz, it follows that

86
N
2 F = 1 , s = 1,2 . (5.2.7)

Putting these expressions into (5.2.5) produces

N .
~ .;L_ 2 il
Lk+1 ' Lk " L“ [I ' 2vo .2 “k+1 ‘ 31) Pk]

T N 2 11
- Ln [1 - E— 2 (xk+1 - ai) Pk] . (5.2.8)

0 i=1

Also, for sufficiently small T,

(«ml-“LEW '8)2FiSs--I-g( -a)2'I‘iSs=12
2v0 i=1 k+1 i k 2v .=1 Xk+1 i k ’ ’ '
° 1 (5.2.9)
Thus, (5.2.8) becomes
N . .
g T 2 il 12
Lk+l Lk 2vo i'_’3_1("k+1 ' ai) (Pk ’ I‘k )° (5°2°10)

Dividing both sides by T and letting T a O, the approx-
imate expression in (5.2.10) becomes an exact stochastic differential

equation for the logarithm of the likelihood ratio.

2 o o
- 31) (r:1 - r12) 0 s t s 'r (5.2.11)

_=_1_ N
dt 2v .E(X t

is is ,
where Ft , s = 1,2, is the corresponding value of Fk in con-

tinuous time and is defined as

is _ = :2
Ft _ P(xt ai‘xa, 0 s a S t, e s) , s 1,2.

The initial condition will be obtained from (5.1.16) by

setting T = O and taking into account the substitution L1 = LnAl.

L a 0 (5.2.12)

87

From (5.2.11) and (5.2.12), it follows that

L__1_
t 2v ,
o i

t 2 '1 '2
f0 (xz - a1) (r: - r: )dz (5.2.13)

"NZ

1

In a similar manner a system of stochastic differential

equations are obtained for {F:3}2, s = 1,2, as follows. Replacing

k by k+1 in (5.1.15) and then subtracting ris from both sides,

 

2 I
2p (3 i--2—-)<w)exp{ —---<xk+1 - aj> )rf’
Fis 'PiS~j=1pjl V0 -Fis s=12
k+1 k N k ° ’ °
2 exp(- -—“’— (x - a ﬁr”
8—1 2vO k+1 {, R

For small T, using approximation (5.2.6) and considering (5.2.7),

12p (S i) (T) T (X _ a )2 rjs
r - F18 2 ° - r13 (5 2 14)
k+1 k N k ' ’ °

2 (S
1-; -
2‘.( at)?

x
2v ML _1 k+1

For Sufficiently small T,
N -l N
T 2 s N T _ 2 Ls
[ -—2VM§ ska «m Fig] ~ 1+—-—, Elem ap rk. (5.2.15)

Thus, (5.2.14) becomes

is is 1 _1;_ N 2 LS
Fk+1 ‘ rk ‘5 + 2v 2 (xk+l ' a N)
o L=l
N (5) 2 js is
T 1-— -a, P -I‘.
X j21p31( )l: V0 (Xk +1 J) :] k. k

Expanding the right side of the above expression and taking

only the first orders terms in T gives

 

 

is ‘_ is _ (s) is (s) is
I“k+1 I5k ~ (1 p11 (0) I“k + jilpji (ﬂrk
js‘i
T (s) _ 2 is _1;_ (s) _ 2 js
' 2v ii (T)(Xk+1 ai) I“k 2v 2 pji (10(ka a1) Pk
o o j=l
jaéi
N N
(s) 2 Ls T Fsz 2
+_ (TN: 2(X )I‘ +— 2 )(T) 1" S(x -a).
2v0 L'- _1 k+1 8!, k 2vOj""1!,=1p‘llSi k k+1 L
j¥i
Dividing both sides by T, letting T a O and taking into
account the following limiting conditions,
1-p$5)(v) p(S .)(T)
lim--—%%—-—- = qES) ; lim -li;—-—'=’(1'Aij)q(:S) ; 11m P(: )(T) = AU
T-eO T—*0 T-+0
(5.2.16)
where Aij denotes the Kronecker delta, (5.2.17) follows.
dI‘iS N N
t _ - (s) is ﬂ (3) js 'l _ 2 is l is _ Ls
dc " qi Ft +,2_‘qjirt '2v (Xt 8L) I‘t +2v I51: 2 (xt aL2t)F
J-l o o L=1
ji‘i
i E A, s = 1,2 . (5.2.17)
is N .
The initial conditions for {Ft }1, s = 1,2, can be obtained
from (5.1.17) by putting T = O.
. P°(i)
is s _
Ft = N , i E A, s = 1,2. (5.2.18)
0 .
2 PS (J)
j=1

Expressions (5.2.13), (5.2.17) permit computing the logarithm

of the likelihood ratio for any point in time. The stochastic
differential equations representing the solutions for Lt and
{F:8}q, s = 1,2, appear initially to be neat and concise. How-

ever, careful examination shows that they represent an infinite-

89

dimensional system of first-order stochastic differential equations,
each with the observation process xt as a driving term. Because
of this fact, these stochastic differential equations must be solved
by representing the non-linear functions in these equations as
series expansions and retaining only the first few terms. The
infinite system of differential equations then reduces to a finite
set and realizable algorithms are possible [8-6].

5.3 OPTIMAL DETECTION OF RANDOM TELEGRAPH SIGNAL IN ADDITIVE,
GAUSSIAN NOISE

The problem of detecting the presence of a random tele-
graph signal in white Gaussian noise is investigated in this
section. This signal is a 2-state, continuous-time Markov chain.
Observations are made over a time interval of fixed duration, T.
This problem is a Special case of the model considered in the
previous section. There are two pattern classes. One is associated
with the presence of the random telegraph signal and the other
pattern class is associated with noise alone.

The optimal decision rule and an iterative implementation
are given in Sec. 5.3.1, while the mean and the variance of the
logarithm of the likelihood ratio are derived and discussed in
Sec. 5.3.2. In Sec. 5.3.3, the probabilities of errors of the
first and second kinds are given and recurrent relationships are
established for the conditional error probabilities. Finally, in
Sec. 5.3.4, stochastic differential equations are obtained both
for the continuous logarithm of the likelihood ratio and for the

error probabilities.

90

5.3.1 OPTIMAL DECISION RULE

In the particular statistical decision problem considered

here, the action Space and the parameter Space consist of two

elements, ® = {90,61}, 6': {wo,w1}. The basic problem is, then,

that of choosing between the two alternatives in (5.3.1); namely,

where, T = TK, T is

the finite observation interval and K

samples.

Sec. 5.1.

The signal process, represented by kk =
k = l,2,...,K, is a 2-state, continuous-time Markov chain which

can take values 0 and a with

Q given by (5.3.2).

ll

xk + nk

= n
k

the sampling interval, T

 

 

k = l,2,...,K,

(5.3.1)

is the duration of

is the (fixed) number of

x(k-1)vr’

a specified transition-rate matrix

The same sampling scheme is employed as was defined in

(5.3.2)

The stationary transition probability matrix, P(T) = [pij(T)]

can be obtained by solving the backward Kolmogoroff system of

differential equations, The complete matrix is:

 

 

 

q1 + qo
qoﬁll qoﬁl
P(T) =
q1 q0
qo+q1 qo+q

 

 

-(qo+q1)T

-(qo+q1)T

 

 

 

 

'1
qo _ qo e-(qo+q1)T
o+q1 qo+q1
qo + q1 e-(qo+q1)T
oﬁH qdml

 

(5.3.3)

91

The noise process, represented by nk = n(k-1)T’
k = l,2,...,K, is a zero mean, Gaussian noise process.
The observation process {xk}§ is as defined in (5.1.1).

The optimal decision rule is expressed in terms of the likelihood

ratio, AK, and the threshold n. That is,

if AK < n choose mo

(5.3.4)
if AK > n choose

u’1
where the notations are the sane as in Sec. 5.1. Taking logarithm

of (5.3.5) produces the recurrent relationship for the logarithm

of the likelihood ratio,

 

k-l
_ f(xk\x .91) 5 3 5
Lk ’ Lk+1 +‘Ln k-l ( ° ' )
f(xklx .90)
where Lk g LnAK k = l,2,...,K. (5.3.6)

The probability density functions f(xklxk-1,ei), i = 0,1,

are determined from (5.1.6) as

k-1
f(xk\x ,90) = p(xk) (5.3.7)
k-l _ o a _ a P(Xk'a)
f(xk‘X ,91) - I‘k_1 + Fk_1p (xk-a) — p(xk) [I + Fk-1(—p-(_X—;)- - 1)
(5.3.8)
where
A
o = _ k+1 a _ k 0 a _ k
Pk p("k+1-W ) ; Pk ’ P<Kk+1‘alx ) ; rk + Pk ’ 1 ‘V x °

The noise density, p(.), is defined in (5.1.2). Substituting

(5.3.7) and (5.3.8) into (5.3.5),

_ a p(xk-a)
Lk - Lk+1 + Ln [1 + Fk_1(—m'k—)- - 1)] . (5.3.9)

92

For the initial value, set k = 1.

L [l 1 + p 1 ___— 1 o . O
From arguments Simi lar to those USEd in the previous SeCCIOD,

o a
the following expressions are obtained for Pk and Pk,

k = l,2,...,K:

O a
POO(T)p(xk)Fk_1 + P10(T)R(xk-a)I‘k_1

 

 

o _ o o a =

Pk Pk(xk,rk_1,rk_l) )r0 +, (x -a)Fa (5.3.11)
P(Xk k-l p k k-l

O a
a _ a O a _ 901(T)p(xk)Fk_1 + P11('r)p(xk-a)1‘k_1
l—‘k " Tk(Xkark 1,Fk 1) - a (5°3°12)
‘ ' (x )r" + p(x -a)r

p k k-l k k-l

For the initial values, setting k = l in the above expressions

to have

p (w>p<x )p°<0> + p <w>p<x -a>p°(1>
r = 0° 1 0 10 O 1 (5.3.13)
p(X1)P (0) +'D(X1'3)P (1)

 

0 0
Pa = p01(T)p(x1)p (0) + P10(T)9(X1‘3)P (1) (5.3.14)

p<x1)p°<1> + p(X1-3)P0(1)

 

5.3.2 ITERATIVE COMPUTATIONS OF THE MEAN AND VARIANCE OF Lk

In this section, the mean and variance of the logarithm of
the likelihood ratio are derived. Iterative algorithms are dis-
cussed and a theorem, related to the likelihood ratio algorithm,
is given.

Before getting into the details, the following is first
defined for convenience:

_ k é a p(xk-a)
gk " §k(x ) " Ln [1 + rk‘l(W - 1) (5.3.15)

93

The expected value of the lograithm of the likelihood ratio

can be determined directly from (5.3.9) as follows:

k k
E(Lk - Lk_1‘ei) =11. §k(x )gi(x )dxk i = 0,1, (5.3.16)
X

In general, it is impossible to evaluate the integral in

(5.3.16) analytically. As an alternative, it may help to investigate

the conditional expectations

, ”L(klei)’ which are defined as
A- L - a p(xk-a) L
”L(k‘ 9i) _ E(§k‘ei)x ) - E {Ln [1 + PIC-1(W - ei’x ,
L = 1:2: °°3k
k = 1,2, .

Using the above definition and the fact that the o-field

L+1

3

induced by XL is a sub-field of the o-field induced by x

“L(k‘ei) can be written iteratively as follows:

am 3

L
I "1+1‘klei>f(xi+1\9yx >dx,+1 (5.3.17)
xi+1

uL(k\ei)

 

This is the desired result. It is clear that at the final point,

i.e., L = k,

k‘ _ _ 1 4'8 p(xk-a) ._ 1
uk( 91) — gk ~Ln + k_1 W - 1)] , 1 - 0,1. (5.3.8)

With the help of (5.3.16) and (5.3.17), it is possible to

‘ - b
obtain E(Lk Lk-l‘ei) y the expression

I~:(Lk - Lk_1\ei) = 1.00499 , i = 0,1. (5.3.19)

Substituting the values of f<xL+llei’xL) into (5.3.17)

gives

94

u,(kleo) ==£ uL+1(k\eo)p(xL+l)de+l (5.3.20)
(+1

a p(XL-+1.6” )
“L(k‘el) =j‘ Ha£+1(kl91)p(XL+]-) 1 + FL (W - 1 (kl/+1.
X2+1 L (5.3.21)

The variance of the logarithm of the likelihood ratio is
now found. Taking the variances of both Sides of (5.3.9), when

9 = Bi, 1 = 1,2, produces the expression for the variance of Lk'

Var(Lk‘ei) = Var(Lk_1|ei) +-Var(gk\ei) + 2 Cov(Lk_1,§k‘ei) (5.3.22)
For the initial value,

Var(L1‘ei) = Var(g1\ei) (5.3.23)

Here, Var(Lk_1‘ei) is available from the previous step of iteration
scheme. The second and third terms in (5.3.22) are evaluated from

the usual definitions of variance and covariance, as
2 2
var<§k|ei> = E<§klei> - uo(k\ei) (5.3.24)

Cov<Lk_1.§k\ei> = E<Lk_1.§k|ei> - E(Lk_1\ei)uo(k\ei) (5.3.25)

But, from (5.3.9), Lk-l can be expressed as
k-l r
Lk-l = 2 §r(x ) (5.3.26)
r=l
Substituting this into (5.3.25) gives
k-l k-l
Cov(Lk_1.§k\ei) = z E<§r.§k\ei> - uo<k\e,) z uo(r\er) (5.3.27)
r=l r=0
To simplify the computations of E(§2‘91) and E(§r,§k‘ei),

the conditional expectations, SL(k‘ei) and eL(r,k\ei), defined

below are introduced.

95
2

3,049,) ‘5 EEgkckaeiax‘]

eL(r,k|9i) 9 E[gr(xr)§k(xk>\ei.x”]

These conditional expectations can be expressed iteratively,

using the similar argument as in (5.3.17), as follows:

_ L
81.0491) -£ sL+1(k\ei)f(xL+1\x ”31)de

+1
(+1
_ L
k _ = ...
eL(r, ‘91) )3; eL+l(r’k‘ei)f(XL+l‘X ,ei>dx,+1,2+1 1.2, ,k,
(+1

i = 0,1.

The boundary values are obtained putting L = k.

_ 2 k _ 2 a p(Xk-a)
Sk(k\ei) - §k(x ) — Ln [:1 + Fk_1 (_EIEET— - I)

k p(x -a)
ek(r,klei) = §r(xr)§k(x ) = Ln {[1 + F:_1 (Ti-1%?- - 1)]

a p(xr-a)
x [l +Fr-l (___—90%) - 1)]}

It is obvious that the usual expected values in (5.3.24)
and (5.3.27) are related to these conditional expectations by the

expressions:
2
E(§k\ei) = 30(k‘ei) ; E(§r§k‘ei) = eo(r,k‘ei) , i = 0,1. (5.3.28)

In terms of (5.3.22), (5.3.24), (5.3.27), (5.3.28) and
the boundary values, it is possible to compute Var<Lk‘8i)’

i = 0,1, iteratively.

96

5.3.3 PROBABILITIES OF ERRORS

The quality of the optimum detector is characterized by
the probabilities of errors of the first kind, a0, and of the
second kind, 80' For the problem considered here, they are de-

fined as follows:

* *
oz = P(LK >11 \eo). 80 = P(1R< n ‘91),

0

where n* 2

in“.

In general, exact analytical expressions for do and
Bo are impossible. However, it is possible to obtain them re-
cursively in terms of the conditional error probabilities, ak(K)

and Bk(K) which are defined below, similar to those in (8.4.4),

(3.4.5).
ak<1<> = P(LK > n*\eo,xk). ekao = Par.K < n*\el.xk). k = 1.2.....K.

By the same argument used in the previous section, the
following recurrent expressions for the conditional error proba-

bilities can be obtained.

01km) = 3E xk+1(1()p(xk+1)dxk+1 (5.3.29)
k+1
_ a p(xk-a)
51.00 '£ Bk+l(K)p(Xk+1) [1 + Fk (W— ' 1)] dxk+l
1"” (5.3.30)

From (5.3.29) and (5.3.30), it follows that these probabilities

of errors at the final point, k = K, must satisfy the expressions:

01K(K) = u(Lk - 11*) ; exao = 1 - u(LK - 11*).

97

where u(.) is the unit step function. In terms of these relations
and the boundary values, the usual error probabilities can be com-

puted by

do = 00(K) ; 80 = 80(K)

5.3.4 OBSERVATION OF ENTIRE SAMPLE FUNCTION

The results obtained in Sec. 5.3.2 can be Specialized to
the present problem. The following stochastic differential equa-

tion is obtained for the logarithm of the likelihood ratio.

dL

t l a
_ = ___ 2 - 1 .3031
dt ZVO ( xt )Ft 0 s t S T (5 )

The initial condition is obtained from (5.3.10). Setting

T = 0 in the expressions of L1,

In a similar manner, the following system of non-linear

. . . . . o a
stochastic differential equations are obtained for Pt and Ft-

dF: o a l o a
E = -qOI‘t + q11—‘t + 2v—O (l - th)I‘tI‘t (5°3°32)
dF: a o l o a

Again, the initial conditions are obtained from (5.3.4)

and (5.3.13) by setting T = 0 in the expressions of F: and P:-

o = 2°(O) a p°(1)

;I‘= -
p°<0> + p°<1> ° p°<0> + p°<1>

The non-linear stochastic differential equation obtained for zt,

98

F - F: is identical to that derived in [K-6] by Kulman and
Stratonovich, using a completely different technique.

Expressions (5.3.31), (5.3.32), (5.3.33) permit computing
the logarithm of the likelihood ratio for any time t E [0,T].
The block diagram of the optimum receiver is realized in terms of

these expressions in Fig. 5.3.1.

 

 

 

 

 

 

 

 

 

>
Solve the select a
xt=xt+n Diff. Eqs. F°,F: Solve the L Compare ”"“*° °
-€t (5.3.35) t :>— diff. Eq. t3>- 1‘ with n*
(5.3.36) (5.3.34) t select a
select max. ‘--——-0 1

 

 

 

 

 

 

 

 

Fig. 5.3.1: The Block Diagram of the Optimum Detector

Recurrent expressions for the conditional probabilities of
errors obtained in (5.3.29), (5.3.30) can also be obtained for the
case of continuous-time observations. The details of the deriva-
tions are as follows: Replacing k with t and k+1 with t+¢,

(5.3.29) can be written in a new form.

)dx (5.3.34)

0 .—
at(Lt.I’t) -f m

x at+T
t+¢

(L )p(

O
t+T’rt+T xc+¢

Feller [F-l] Showed that, under mild regularity conditions,

a Q at(Lt,F:) must be bounded and p(xt ) must satisfy the

+1-
postulates, (4.2), (4.3) and (4.4)3 and a muSt satisfy the

"backward diffusion equation,"

 

2 2 2
at 11 2 12 o 22 2 1 5L 2 0
5L aLaF 0 BF
5L
3

Feller [F-l], pg. 321.

99

For the problem considered here, it is easily proved that
p(xt+T) satisfies the above postulates and since a is a proba-
bility of error, then, \a‘ s 1. Thus, the regularity conditions
are satisfied.

Coefficients in (5.3.35) can be determined as follows: r
From (5.3.34) for T > O, the following equation is first obtained.

0 0 O
at‘H'(Lt,rt) ‘ at(Lt’rt) — atd‘T(Lt’I-‘t)

T T

 

 

 

 

 

 

 

 

1
-:r-f at+r(L ”Tm M)p(x )dxHT (5.3.36)
X t-l-T
Expanding at+T(Lt+T’r:+T) in a Taylor series about (Lt,F:)
and taking only the first and second order terms gives
0 o
L 9
t+T t+'r t+T t+T t’ t 1- aLt-H' T o
a?
t+T
2 o 2 o
+ L + 21 F + 1“ ,
I 2 T T 3L az t 2
5 HT
Wh é - L . 0 Q 0 _ O.
ere’ LT Lt+T t ’ FT rt-Pr I‘t;

Substituting this expansion into (5.3.36), performing
the resulting integration and then passing to the limit as T 4 O
produce the coefficients in (5.3.38). The resulting partial
stochastic differential equation can be written as:

1-0 1_022
ae+__f‘_.ae_q+(_1__q_q)P0,_Ll-°m__$_f_‘_1_a_e
l 2vo o 1 2v aro

t
a 2v0 5L

 

o 2
+ (1'1“) _a do _ r0(1 - I“0) 23% (5.3.37)

100

Using similar arguments, a partial Stochastic differential
equation can be obtained for the conditional probability of error

of the second kind and is given below.

 

 

 

o o
(l-F )(1-2r )
3.6.- t BE_ ___1__ _ o _1_ _o _ o as
at 2v 5L [ 2v qo q1)F +2v (1 P )(1 2F ) at
o o o
Il‘ro)2 2 (l-PO 2 2 02 o 2 2
_ V A C; + V 2 Aﬁo .. I“ (14" ) M2 = 0 (5.3.38)
0 3L 0 aLaF Bro
where
A o . A o . _ o . 011 0
Cl’ —at(Lt’rt) 2 B '— Bt(Lt’Ft) a L " Lt(xt,rt) a P — Ft

Equations (5.3.37) and (5.3.38) describe the change in a
conditional error probability for a recurrent period of time.
From (5.3.34), it follows that these probabilities must satisfy

the conditions,
6,0,5; = u(L.r - 11*) ; 0,0,3"? = 1 - u(L.r - 11*) (5.3.39)

where T is the time for which the detection is completed. If
at(Lt,F:) and Bt(Lt,F:) are the solutions of equations (5.3.37),
(5.3.38) with the boundary conditions (5.3.39), then the first
kind and the second kind of probabilities of errors are determined

by the expressions

0 C
p0 p0
0' = 0’ (O: ) a B = B (O: )
° O p + pc 0 ° 90 p
O O O

5.4 CONCLUSIONS

In the first part of this chapter, the optimal decision

making problem was investigated for the model which consists of

101

two pattern classes characterized by different N-State continuous-
parameter Markov chains. All parameters of the underlying model
were assumed known. Because of the noisy medium which affects

the model in an additive manner, neither the states nor the
transition times of the sample function generated by these chains
could be observed directly.

In Sec. 5.1, assuming additive white Gaussian noise and a
discrete-time sampling scheme, the observation process was defined
and the Bayes likelihood ratio algorithm.was derived. The likeli-
hood ratio and a statistic (5.1.10) were computed recursively.
Continuous-time results were obtained by introducing a new sampling
scheme and applying a limiting argument. A set of non-linear
differential equations were obtained fro the logarithm of the
likelihood ratio and for the statistics defined in (5.1.10).

The second part of this chapter considers the problem of
detecting a random telegraph signal with a fixed observation time
in additive white Gaussian noise. It was shown that the problem
is a Special case of the model considered in the first part. The
optimum detector was defined in Sec. 5.3.1 and its basic components
(5.3.9), (5.3.11), were generated recursively. Iterative schemes
for computing the mean and variance of the logarithm of the like-
lihood ratio were given in Sec. 5.3.2. In order to analyze the
quality of the Optimum detector, it was necessary to consider the
probabilities of error of the first and second kind. For this,
the conditional error probabilities were first investigated in
Sec. 5.3.3. Recurrent relationships were established for them

from which the usual error probabilities were found as a function

102

of the observation time and the parameters of the problem.
Finally, in Sec. 5.3.3, the case of continuous observation

times were considered. Stochastic differential equations were

obtained for the logarithm of the likelihood ratio and for the

conditional probabilities of errors.

CHAPTER VI

GENERAL CONCLUSIONS AND EXTENSIONS

The main objectives of the thesis are reviewed in this

chapter and possibilities for future research are discussed.

6.1 CONCLUSIONS

This thesis has been concerned with Bayesian decision
making and learning algorithms for a particular problem in para-
metric pattern recognition. Each of M pattern classes was
characterized by an N-state, continuous-time, homogeneous Markov
chain and the infinitesimal-rate matrices (Q-matrices) for the
chains were the parameters of the problem. The object of the
decision rules was to decide which of M chains produced the
sample function observed. Statistical decision theory was
employed throughout the thesis to develop optimal solutions for
a Special loss function (0-1 loss function).

In Chapter II, the Bayes-optimum decision rules were de—
rived under a perfect observation mechanism (noiseless case).

The observable quantities, or the "features", were the sojourn
times in the states and the state numbers themselves. Using
classified training data from each pattern class, an algorithm

for supervised learning was presented and the existence of repro-
ducing prior densities for the parameters was demonstrated. The
main result was the formation of recursive, computationally simple

103

104

parametric forms for the posterior densities of the unknown para-
meters and for the components of the optimum decision rules. It
was shown that the amount of computer storage required to implement
these algorithms was fixed and finite. Computer simulations of a
specific example gave curves showing probability of error versus
the length of the observation sequence for different amounts of
training data and demonstrated the inherent practicality of the
results.

The problem of computing probability of error was inves-
tigated for the noiseless case in Chapter III. All parameters in
the model were assumed known and there were only two pattern classes.
The exact probability of error, as well as lower and upper bounds,
were established for several cases. Conditional error probabilities
of the first and second kinds were introduced by which the usual
probability of error could be computed iteratively. The main con-
clusion of the chapter was that, even for a simple case, exact
expressions for the probability of error were very complicated
and it was difficult to evaluate them. However, the fact that the
probability of error decreases toward zero as the number of samples
increased without bound was proved by employing a Bhattacharyya
bound. The asymptotic behavior of error probability was also
studied.

The model of Chapter II was extended in Chapter IV to
include the case in which the observations are made in a noisy
medium. In the new model, the states of the chains were described
by random processes, but sojourn times could be observed. The

optimal and the adaptive-optimal decision rules were defined and

105

their basic components were generated iteratively. The asymptotic
optimality of the adaptive rules was exhibited. For the known
parameter case, the iterative optimal decision rule can be im-
plemented on a computer using only a finite memory. However,
computer time increases linearly with the number of observation
sequences. One can also conclude that the optimum-adaptive rule

is a fixed memory, iterative rule and only a high Storage quantiza-
tion procedure is needed to implement it.

Finally, Chapter V dealt with the case when neither the
transition times nor the states could be observed but the processes
defining the states were white Gaussian processes. This is equi-
valent to additive white, Gaussian noise. The Bayes likelihood
algorithm was established for the Optimal decision rule assuming
two pattern classes and a discrete-time sampling scheme. Non-
linear stochastic differential equations were obtained for the
logarithm of the likelihood ratio and for the conditional error
probabilities when the sampling scheme was changed to permit
observation of the entire sample function. The results were
applied to the Specific problem of detecting a random telegraph
signal in white noise. In general, the algorithms for the like-
lihood ratio are computationally feasible. The implementation
of the optimal decision rule with discrete-time sampling requires
the knowledge of the stationary transition probability functions
for each pattern class. These functions could be obtained by
solving N2 System of linear differential equations with constant
coefficients. The continuous sampling scheme leads to much

simpler algorithms and do not require transition functions.

 

106

6.2 EXTENSIONS

The work in this thesis suggests several extensions that
should be investigated.

First is the problem of developing sub-optimum decision
rules that are easier to implement than the high-Storage quantiza-
tion procedure of Sec. 4.4.2 implicit in the optimal-adaptive rule.
Strongly consistent estimators for the unknown Q-matrices might
be deve10ped. These are functions of the observations that con-
verge with probability one to the true value Of the parameter.
Conditions for the existence of such estimators are important
because, in Sec. 4.4.3, it was shown that a strongly-consistent
estimator for go must be exhibited to ensure that these para-
meters will be learned during the Operation of the optimal rule.

Secondly, the expressions obtained for the probability of
error can be extended to the cases where M patterns are present
and the parameters of the underlying models are not known. The
conditional error probabilities, introduced in Sec. 3, suggest a
useful computer implementation algorithm for evaluating adaptive
decision making devices.

In Chapter V, stochastic differential equations were derived
for several quantities, but these equations were not solved. Since,
in general, these differential equations are non-linear as well as
stochastic, one must not expect to obtain any exact analytical
solutions for them. A natural extension of these results would be
to seek some approximate solution methods, Such as expressing
them in the form of difference equations which can be solved on a

computer, or, by some reasonable assumptions, reducing them to

107

stochastic linear differential equations for which exact aolutions
may exist.

Finally, a particularly significant theorem concerning
the Bayes likelihood algorithm in Chapter V can be proved. The
probability of error goes to zero as the number of samples, k,
increases without bound. This can be proved from the fact that the
expected value of the logarithm of the likelihood ratio is a monotonic
increasing function of k under 91 and a monotonic decreasing

function of k under 90.

 

vl\n.(

 

B IB LIOGRAH-IY

[A-u

[3‘1]

[3-2]

[B-3]

[3-4]

[C-I]

[C-2]

[C-3]

[D-ll

[D-Zl
[D-3l

[D-S]

BIBLIOGRAPHY

Anderson, T. W., Introduction to Multivariate Statistical
Analysis, Wiley, New York, 1958.

Bartlett, M. 8., "The Frequency Goodness-of-fit Test for
Probability Chains," Proc. Camb. Philos. Soc., Vol. 47,
pp. 86-95, 1951.

Billingsley, P., "Statistical Methods in Markov Chains,"
Ann. Math. Stat., Vol. 32, pp. 12-40, 1961.

Blackwell, D. and M. A. Girshick, Theory Of Games and
Statistical Decisions, Wiley, New York, 1954.

 

Braverman, D., "Machine Learning and Automatic Pattern
Recognition," Stanford Electronics Laboratories, Technical
Report No. 2003-1, Feb., 1961.

Chiang, C. L., Introduction to Stochastic Processes in
Biostatistics, Wiley, New York, 1968.

Chung, L. K., Markov Chains with Stationary Transition
Probabilities, Springer-Verlag, Berlin, 1960.

Cramér, H., Mathematical Methods of Statistics, Princeton
U. Press, 1958.

Daly, R. F., "Adaptive Binary Detectors," Stanford Elec-
tronic Lab., Technical Report NO. 2003-2, June 26, 1961.

Doob, J. L., Stochastic Processes, Wiley, New York, 1953.

Drake, A” W., "Observation of a Markov Source Through a
Noisy Channel," presented at the IEEE Symposium on Signal
Transmission and Processing, Columbia University, May, 1965.

Dubes, R. C. and Donoghue, P. J., "Bayesian Learning in
Markov Chains with Observable States," Interim Report
No. 5, Contract No. AFOSR-1023-67B, Div. Engineering
Research, Michigan State University.

Dubes, R. C., Hung, A., and McCrum,'W. R., "Classifica-
tion of Electroencephalograms with Pattern Recognition
Algorithms," Interim Report No. 4, Contract No. AFOSR-
1023-67B, Div. Engineering Research, Michigan State
University.

108

7 4::

 

[D-6]

[F-ll

[F-Z]

[F-3]

[F-4]

[G-I]

[H-1]

1K4]

[K-Zl

[K-3]

[K-4]

[K-SJ

[K-6]

iL-l]
[M-l]

109

Dubes, R. C., The Theory of Applied Probability, Prentice-
Hall, Englewood Cliffs, N.J., 1968.

 

Feller, W., An Introduction to Probability Theory and Its
Applications, Vol. 2, Wiley, New York, 1966.

 

 

Ferguson, T. 8., Mathematical Statistics: A Decision
Theoretic Approach, Academic Press, New York, 1967.

 

 

Fisher, R. A., Contributions to Mathematical Statistics,
Wiley, New York, 1952.

 

Fu, K. S., Sequential Methods in Pattern Recognition and
Machine Learning, Academic Press, New York, 1968.

 

 

Good, I. J., "The Frequency Count of a Markov Chain and
the Transition to Continuous Time," Ann. Math, Stat.,
Vol. , pp. 41-47, October 29, 1960.

 

Hilborn, C. G. and D. G. Lainiotis, "Optimal Estimation
in the Presence of Unknown Parameters," IEEE Trans. on
System Science and Cybernetic, Vol. Sec-5, January, 1965.

 

 

Kadota, T. T. and L. A. Shepp, "On the Best Finite Set of
Linear Observables for Discriminating Two Gaussian Signals,"
IEEE Trans. on Information Theory, Vol. IT-13, pp. 278-284,
April, 1967.

Kailath, T., "A General Likelihood Ratio Formula for
Random Signals in Gaussian Noise," IEEE Trans. on Informa-
tion Theory, Vol. IT-15, pp. 350-361, May, 1969.

 

 

Kanal, L., "Basic Principle of Some Pattern Recognition
Systems," Proc. National Electronic Conference, Vol. 18,
pp. 279-295, October, 1962.

Karlin, S., A First Course in Stochastic Processes,
Academic Press, New York, 1966.

 

KeinOSuke, F. and L. G. Koontz, "Application Of the
Karhunen-LOEVe Expansion to Feature Selection and Ordering,"
IEEE Trans. on Computers, Vol. C-19, No. 4, April, 1970.

Kulman, N. K., "Certain Optimal Devices for Detection of

a Pulse Signal of Random Duration in the Presence of Noise,"
Radio Engineering and Electronic Physics, Vol. 6, No. 9,

pp. 1279-1288, 1961.

LOEVe, M., Probability Theory, VanNostrand, Princeton, N.J., 1963.
McLendon, J. R., "A Pseudo Bayes Approach to Digital

Detection and Likelihood Ratio Computation," Ph.D. Thesis,
Southern Methodist University, 1969.

 

[M-Zl

[N-ll

iN-Z]

[N-3]

[P- 11

[R-11

[R-Zl

[R-3]

02-41
13-1]

13-2]

[8-3]

[8-4]

110

Martin, J. J., Bayesian Decision Problems and Markov
Chains, Wiley, New York, 1967.

 

Nagy, G., "State of the Art in Pattern Recognition,"
Proc. IEEE, Vol. 56, No. 5, May, 1968.

Nifontov, Y. A. and V. A. Likharev, "Optimal Detection
of a Binary Quantized Markov Signal in the Presence of
Noise Similar to the Signal," Engineering Cybernetics,
No. 6, 1968.

 

Nillson, N. J., LearninngachineS, McGraw-Hill, New
York, 1966.

 

Patrick, E. A. and J, C. Hancock, "Learning Probability
Spaces for Classification and Recognition of Patterns
With or Without Supervision," PUrdue University Report
TR-EE-65-21, November, 1965.

Papoulis, A., Probability,gRandom Variables, and Stochastic
Processes, McGraw-Hill System Science Series, New York,
1965.

Raiffa, H. and R. Schlaifer, Applied Statistical Decision
Theory, The M.I.T. Press, Cambridge, Massachusetts, 1961.

 

Raviv, J., "Decision Making in Incompletely Known Sto-
chasic Systems," Int. J. Eng. Sci., Vol. 3, pp. 119-140,
Pergamon Press, Long Island City, New York, 1965.

 

Robbins, H., "The Empirical Bayes Approach to Statistical
Decision Problems," Ann. Math. Stat., Vol. 35, pp. 1-20,
1964.

 

Royden, H. L., Real Analysis, Macmillan, New York, 1963.

 

Sebestyen, G. S., Decision-Making Processes in Pattern
Recognition, Macmillan, 1962.

 

Signori, D. T., "Estimation and Adaptive Decision Making
for Partially Observable Markov Systems," Ph.D. Thesis,
Michigan State University, 1968.

Silver, E. D., "Markovian Decision Processes with Uncertain
Transition Probabilities or Rewards," Ph.D. Thesis,
Massachusetts Institute of Technology, 1963.

Sosolin, Y. G., "Optimal Detection of Markov Signals and
Markov Noise with Discrete Time," Engineering Cybernetics,
No. 6, 1966.

 

[3-5]

[8-6]

[T-ll

[T4]

[T-31

[W-ll

[VJ-2]

[VJ-3]

111

Spragins, J., "A Note on the Iterative Application of
Bayes Rule," IEEE Trans. on Information Theory, Vol.
IT-ll, No. 4, October, 1965.

 

Stratonovich, R. L., Conditional Markov Processes,
American Elsevier Publishing Company, Inc., New York
1968.

 

Takacs, L., Stochastic Processes, Problems and Solutions,
Methuen's Monographs on Applied Prob. and Statistics,
1960.

 

Tou, J. T., Computer and Information Sciences-II,
Academic Press, New York, 1967.

 

Tucker, H. G., A Graduate Course in Probability Theory,
Academic Press, 1967.

 

Wilks, S. 8., Mathematical Statistics, Wiley, New York,
1962.

 

Wassily, H., "Probability Inequalities for Sums of
Bounded Random Variables," American Stat. Association J.,
March, 1963.

 

Watanabe, 3., Methodologies of Pattern Recognition,
Academic Press, New York, 1969.

 

 

L— mowed-J

APPENDICES

APPENDIX A.

DEFINITIONS AND PROPERTIES

In this appendix, some important definitions and theorems
related to continuous-parameter Markov chains are given. Most
results are based on Doob [D-2] and Chung [C-Z].

Consider a family of real random variables indexed by t,
{xt,0 s t < m} on a probability triplet (0,5,P), where the
possible values of xt is a set of non-negative integers. Here
0 is a probability Space, 3 is a o-field of subsets of n and

P is a probability measure defined on 3.

DEFINITION 1.A The stochastic process, {xt,0 s t < m} is said
to be a homogeneous, continuous-time, discrete-state Markov pro-
cess (continuous- parameter Markov chain), if and only if the
following two conditions are satisfied

1) Markov prOperty:

P(X

n—1 n-1

(A.l)

for every n 2 2, O s t1 < t2 <...< tn and for all possible values

0f the random variables in question.
ii) Homogeneity property:
The chain is said to be Homogeneous, or said to have

Stationary transition probabilities, if and only if

P(xHh-j‘xhq) = P..(t) Vi,j e A and c > 0 (A.2)

1J
112

t =1n\xt =1n-1,...,X1=11) = P(xt =1n‘xt =ln-1)

 

in 'l“ .1 anus-c!

11* _

113

where Pij(t) is independent of h 2 0.

N
DEFINITION 2.A The functions {Pij(t)}, are called stationary

1,j=l

transition probability functions if and only if the following con-

ditions are satisfied:

Al-l. Pij(t) 2 0 Fa
N
A1-2. Z P..(t) = l t > 0, V i E A
. ij
J=1
N
Al-3. 2 P, (t)P ,(S) = P,,(t + s) S,t > 0
k=1 1k kj 1] ‘

 

The following continuity condition is also assumed to be satisfied

for t > 0

A1—4. lim P,,(t) = A,
t40 11 1

J

The matrix [Pij(t)]NXN is called the stationary transition proba-

bility matrix.

DEFINITION 3.A The process {xt; 0 s t < m}, is said to be separable
(relative to the class of closed sets), if and only if there exists

a denumerable subset, R C {t; 0 s t < m}, called a separability

set, and a null set N with the following property: for any closed

linear set A and open interval G in (~m,m)

fw: Xt(w) e A, t e G n R} - [wz xt(w) E A, t e G n [0,m)} czN

The main reSults in this thesis are based upon the follow-

ing theorems which are given without proof:

THEOREM 1.A If [Pij(t)] is the stationary transition probability

matrix, then, the following limits exist:

1 - P11“) A
. = - ' m . = ' .
11m Pii(0) < v 1 e A, q, Pii(0) (A 3)

t~0 t 1

 

114

P..(t)
iim—U——=P!.(0)<ao Vi,jeA(j#i);q..-QP!.(O) (AA)
t—.0 t 13 13 13

Furthermore, if {xt: 0 s t < m] is a separable process determined

by [Pij(t)]§ j—l together with an initial probability distribution,
’ _

{Pi; i E A} over the states, then, the probability of remaining

in state i for a time units, given that the chain in state i is:

41.0!
P{xt=i,hStsh+Q/\xh=i}=e 1,0120 (A.5)

In addition, if qi > 0 V i E A and if xt = i, there is
a sample function discontinuity for some t > tO with probability
one (w.p.1). If 0 < a s m, and if there is a discontinuity in the
interval [to,tO + a]. the probability that the first jump is to j
is qij/qi'
N

The matrix Q = [qij]1,j=1

where q.. Q -q,, defined above
11 1
is called the infinitesimal matrix or the transition-rate matrix

or the Q-matrix. From Al-l and Al-2 it follows that

q.20, z q..=qi Vi,J'EA (ja‘i) (A.6)

Differentiating A1-3 with reSpect to each variable and

setting the result equal to zero produces the following system of
differential equations for the stationary transition probability
functions which is called the "backward differential equation
system".

N

I = _ -
Pik(t) quik(t) + jEI qiijk(t) V i,k e A (A.7)

j¢i

115

The initial conditions are given by

Pij(0) = Aij V i,j E A (A.2)
The following question is of interest: What are the

necessary and Sufficient conditions that the qi's and q,,'s

1J
must satisfy to determine the {Pij(t)}§ j=l uniquely? Or, in .*
9

other words, what conditions on the qi's and qij's ensure a

unique solution to the backward differential equation system?

 

Doob [D-Z] shows that the necessary and Sufficient conditions rd
under which an allowable set of stationary transition probabilities

are defined by the backward differential equation system for a

given infinitesinmJ.Q-matrix are that the process is separable

(which assures that the sojourn times in the states are independent

and exponentially distributed), has a finite number of states,

satisfies (A.6), and that the states are stable (i.e., qi > O).

For a discrete-parameter Markov chain, these conditions imply

that the process is ergodic.

As long as the above conditions are satisfied, a unique
continuous-parameter Markov chain with the stationary transition
probability matrix [Pij(t)] can always be found from a given
Q-matrix. The process, constructed in this way, is called a
minimal process. The relation between the Pij(t)'s and q1

is given by one of the following integral equations.

-qit N t -qis
P k(t) = 61k e +j1§ qu ijk(t-s)ds (A.9)
31¢

or

'qkt N t -qk(t-s)
Pik(t) = 61k e + jgl Pij(s)qjke ds. (A.10)
j¢k

If the eigenvalues of the Q-matrix are real and distinct,
then the solutions of the above integral equations are the same

and are given by

 

Pik(t) = z V i,j E A, (A.1l)
mEl(pL-pn9

m¢£
where p1,p2,...,pN are the eigenvalues of the Q-matrix and Aik(L)
is the cofactor of the matrix A({) = {p61 - Q]. Here, I is the
unit matrix.

Doob also proves that the sample functions Of a continuous-
parameter Markov chain satisfying the conditions stated above are
almost all Step functions. The corresponding continuous-parameter
Markov chains can thus be constructed in terms of their sample
functions in the following manner:

Let xl,x ,... be random variables taking values in

2
A = {l,2,...,N} only. Let T1,T2,... be positive random variables
N

1 j'l are any real positive numbers
, _

N
and Suvpose {qi}i=1. {qij}

j#i
satisfying (A.6) and 0 < qi < m. Define the following conditional
probabilities:
-qit1
P(T1 s t1\x1 = i) = 1 - e t1 > 0 (A.12)

. . i' . . .
P(xk=j‘xk_1=1,...,xl=L,tk_1,...,tl) = -a%' qi > 0 V i,j 6 A (j # 1)
1 (A.13)

117

-q.t
= - J k
P(Tk s tk‘tk_1,...,t1,xk,...,x1) 1 e ck > 0 k e A (A.14)

This shows that the random variables {Xk}:=l have a
Markov dependency and the random variables {Tk}:=l are condi-

tionally independent. Now, define the process, {xt: 0 s t < m}

as follows: F?

x - x if p s t < pn

t n n-l

where T Q
n

pn - pn-l and p0 E 0. Then, it is easy to Show that j

 

the process defined in this way is a continuous-parameter Markov

chain and that the qi's and qij's satisfy the limit conditions

Specified in Theorem A.1. The xn's are the state numbers and

Tn's are the sojourn times in these states. A typical sample

function from this process is illustrated in Fig. A.l.

118

 

.awmzo >oxumz OEHuImSOJCwucoo m Eoum COMuocsw OHmEmm Hmowame

 

 

"Ha .wE

 

 

 

 

 

 

 

 

x HIXQ HQ 00
l\\ 1|MMF|I u.lmVﬂAM .mWﬂMIIT ulllmV.
/ a 1 a
v H
K
4 N
. 1T 0
V 1. tr 0
X
1 o
1 TM .. Huz
11 z

 

APPENDIX B

DERIVATION OF OPTIMAL DECISION RULES

The basic elements of statistical decision theory are
summarized in this Appendix. Most of the results are taken from
the literature, but are presented from a slightly different point
of view, more suitable for the problem of interest in this thesis.

A problem in decision theory consists of three basic
elements:

1. A non-empty set, @, of possible state of nature

(parameter Space)

2. A non-empty set,(7, called an action Space

3. A real-valued loss function, L(9,a), defined on ® X(7.

In pattern recognition terminology, a state of nature is
called a "pattern class". Before making a decision, a random
vector, (xk,tk) Q (x1,...,xk,t1,...,tk) is Observed whose dis-
tribution depends on the true State of nature 9 6 ®. The sample

A k k
space, Sk = (w: w = (121:1) X(iglni)}, is taken to be a Borel sub-
set of a finite dimensional Euclidean Space and the probability
distributions of (xk,tk) are defined on the class of Borel sub-
sets .8 on Sk' For each 9 6 @, there is a product measure
u(xk) X v(tk) defined on (B and correSponding multivariate

k

density function g(x ,tk‘e), which represents the distribution

k k .
of (x ,t ) when a is the true state of nature.

119

 

120

On the basis of the set of observations (xk,tk), an action
d(xk,tk) €¢7 is chosen. In choosing d(xk,tk), the loss
L[9,d(xk,tk)], e 6 G, which is a random quantity, is incurred.

The expected value of L[e,d(xk,tk)], when 9 is the true State

of nature is called the risk.

R(e,d> = EeLie.d(xk,tk>] (3.1)

Any function, d(.), that maps the sample Space, S into

k’
(7 is called a non-randomized decision rule, provided the risk
function, R(e,d), exists and is finite. The Bayes principle involves
the notion of the Bayes risk. A distribution Po is imposed on

the parameter Space called a prior distribution. The Bayes risk

is the expectation Of the risk function with reSpect to PO; namely,

r(PO,d) = ER(Z,d) (B.2)

. . . . . . o
where Z is a random variable over ® hav1ng distribution P .

*
DEFINITION 8.1 A non-randomized decision rule, d (xk,tk) is

said to be Bayes with reSpect to the prior distribution Po, if
and only if

r<P°,d*> = inf r<P°.d) (3.3)
d

DEFINITION B.2 A non-randomized decision rule d*(xk,tk) is
said to be an Optimum decision rule if and only if it is Bayes
with reSpect to the prior distribution P0.
In the following, the optimum decision rule is first derived

for the general loss function, and then for the Special "0-1" loss

function, which leads to a minimum probability of error rule.

r._l ..l o' . - .
“41m...

 

121

In Chapter II, the parameter Space, ® and the action space,

(7 are finite, i.e., ® =(7 = {l,2,...,M}, where each integer in

this set correSponds to a different continuous-parameter Markov

k k .
chain. The random vector (x ,t ) is observed. Its value de-
pends on which continuous-parameter Markov chain is active. The
0 .
problem, then, for any given prior distribution P on O, is to
. . . . * k k . .
find a non-randomized dec131on rule d (x ,t ) that minimizes

the Bayes risk

r<PO,d) =

"MK

R(e = i,d)P9
1 1.

i

where
R(0,d) = jk k Lie.d<xk.tk)]g<xk,tk\mama“) x 6(3)) 6 e O
A XR

Using Fubini's theorem,

k k k k k k
R<e,d> =j [f Lie.d<x .t )]g(x ,t \9)dp»(t )]dv<x>
k k
A R
. k . . k k
Since x is defined on A = {l,2,...,N} , the outer
Lebesgue integral with reSpect to counting measure v(xk) becomes

a summation over AR, and the inner integral becomes a Riemann

integral over [0,m)k. Thus,

R(e,d) = 2 f L[e,d(xk,tk)]g(xk,tk‘e)dtk' where dtk A dtldtz...dtk
k k
A [09”)

Then, the Bayes risk is,

M
r(P°,d) = 2 [zk.f

i=1 L[6=i,d(xk,tk)]g(xk,tk‘9=i>dtk ] P:
A [0,m)

k

 

 

122

0
Since P on ® is finite, the order of integration and
. . . o k k _.
Summation can be interchanged. Since Pi g(x ,t \9—1) =
k k
,t

P(e=i|xk.ck>g<x )

o _ M
“P ,d) ‘ Zki z L<e=i,d(xk,tk>]P<e=i\xk,tk>] g(xk,tk>dtk (3.4)
A [0,m 1=l

*
Now, to find a function d (.) that minimizes r(PO,d),
. . * k k k k
an action, call it d (x ,t ), may be found for each (x ,t ) E Sk

that minimizes

:3

z L[e = i,d(xk,tk)]P(e = i\xk,tk).

1=l
. . . * k k . . .
In other words, the Optimum dec1Sion rule, d (x ,t ), minimizes
. . . . k
the posterior conditional expected loss, given (x ,t ).
For the problem considered in Chapter IV, the optimum
decision rule is determined in exactly the same way as above.
. . R., k.
The only difference is that x is defined on R instead of
, k
A so the counting measure becomes a Lebesgue measure u(x )
. . . k
and the summation becomes a Riemann integral over R .
In particular, when the loss function is given by
L(i,j) = 1 - Aij (0-1 loss function), the optimal decision rule
Obtained above becomes a minimum probability of error rule, de-

fined by

d (xk,t ) = s if S is the first index Such that
k k
P(e = slxk.t > 2 P(e = 2\xk.t > v 2 ‘1‘ s

k
or P:g(xk,tkle=s) 2 P:8(X ,tk\9=L) V s # L 55% E {1,2,---9M1

(B.5)

 

APPENDIX C.

A PROOF OF CONVERGENCE

n 11 n—KD

The fact that N(qij‘y ,¢ )————-3p6(qij - qij) w.p. 1,
V i,j E A (j # i), is proved in this Appendix. It was shown in

Chapter II that

f(q ,r .“\y ,w “>—“——>6<q - q:.qij - <12,» 1.1 e A o f i) (0.1)

But, from the definition of the delta dirac function, 6(.), it

follows that

O O 0
6m, - qi,q,j - q,,) = 6m, - q,)6<q,j - q§j> . (c.2>

Thus, taking limit of ¢(qij‘yn,wn) in Sec. 2.6 gives

1
lim 112(q j“\y ,Tn)=1imJ”—1— f(ri J.,—ii‘y“,rn)drij (0.3)
mam new 0 rij ij

Since W(q.11‘yn,Tn) < m qi 0 < qij < l and

1J
V(yn,1‘n ) 6 Sn’ the limit and integration Sign in (C.3) can be inter-
changed. Doing so,
lim ((q. |y“,¢“)= :— lim f(rij ,Jijynﬂnmr. . (0.4)
ij 11
n—cco ij n—cao rij

In terms of (0.1) and (C.2), the last factor in the inte-

grand of (0.4) can be written as:

q 1 Q. q

Substituting this into (C.3) to give
123

 

124

1'1"“

1 q
, n n 1 O i’
l = -——— - .__l

 

J u 13
. A l qij 0
Define h(ri,) =—6( -q,) then,
3 rij rij 1
n n 1 O 0
lim q.. T = h r.. 5 r.. - r.. dr.. = h ..
Hugh. > i (UH,J 1,) 1, (r13)
q q0 r0
__1_ ij - i ii
- 0 0( O >
r.. r..
1J 1J
0
Since 0 Q q°r° and 5(Eii_:_iii. — ° 6( ° ) th 1t
’ qi i ij ro ) ” rij qij qij ’ e res”
follows. That is, 13

n n O
1' .. = .. - .
n}: 1(quly yr) 5(qu qu

)

ViijEA (ja‘i).

O
- d , 0 10
qi) rij <r” <

 

 

APPENDIX D

ALGORITHMS FOR ERROR CURVES AND FOR DECISION MAKING

 

 

Read in

Q-matrices,
A priori probabilities,
Parameters of the prior densities,

Number of the training samples.

 

 

:1

 

 

 

 

 

 

 

L=1
\1
_SL
Generate the training data

n n

(YL,TL)
L L

for PC L.

 

 

1’

 

 

Establish the sufficient statistics

{nij}. {2,}. {k1}

for PC L

 

 

 

17*

L = (+1

 

 

 

 

 

< L>2

YES

 

 

 

 

 

 

 

 

125

(Elf-CV. "-

 

Tune

Choose PC r with probability P: 1

l

Generate classification sequence

k
(x ,tk) = (x1....,xk,t1,...,tk)

 

 

 

 

for PC r

k k
Classify (x ,t )

9

Count the number of misclassifications

_L

i = i+1

NO < i = 100
YES

Compute Error Probability

_.L

k = k+40

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NO < k>120

YES

Figure D.l: Algorithms for a single error curve and for decision
making.

127

 

 

Observe (x1,t1)

11nn
Compute gr(x ,c /yrr,Trr)

Eq. (2.8.1)

 

 

 

 

 

 

 

 

 

Observe (Xj+l’tj+1)

K ~ K +-l
j+1 Xj+1 '

Z ﬂ Z + l

j+1 xj+1

 

 

Jr

 

 

n

nr
/y ’Tr

j+1 j+1
,t
r

Compute gr(x

Eq. (2.8.1)

r

)

 

 

 

 

 

 

 

 

 

 

 

 

 

o k k n n
Compute pr gr(x ,t ‘y r,T r)

 

 

= 1 3L

 

Find maximum for
r = l,2,...,m

 

 

 

Hm ._ __ ..rnn. "'17:! _:

APPENDIX E

THE PROOF OF IEMMA 3.3.2

First, it is shown that -m < m2 < O.
1

> > '
qij pij 3 qi Pi V 1 E A

From (3.3.19), (3.3.18), (3.3.9), it follows that

N N q.. P.. N q. - p.
mz=2 Zpi—éanq—lli'ZPi—IT'I'
1 i=lj=l i ij i=1 i

j¢i

q..
Because q.. > p,, 2 Ln -ll-> 0, m can be rewritten as

N pi [ ) N qi']
=2—(q.-P.-£q..Ln-—l
mil i=1 qi 1 1. j=1 1J pij
j¢i
qi' Pi.
Using the inequality, Ln-——1 2 l - ——J3 in the above it follows
Pi. q..
J 1J
that
N qi. N pi. N
Zq Ln—lzzq..(1-'—l)=2(q..-p..)=q.-p.>OVi€/\
j=1 1.] pij J=1 1.] qij j=1 1:] l] 1 1
j¥i J#i j#i
Thus,
N q.
2 qi.. Ln —l1 2 (qi - pi) i e A = -m < mi < 0

j¢i
Now, it is shown that O < mZ < +m.
2

From (3.3.20), (3.3.19) and (3.3.9)

128

 

N N pic pic N q. -p
mZ = 2 z Pi--11Ln-—l +- 2 P. l 1
2 1=l j=1 Pi qu 1:1 :1 qi

Using the same argument as in the first part of the proof, m2
2
can be written as

N Pi [ N qi.
=z_ 2 ..-—L]
mzz i=1 Pi 1 1 j=1 13 pij
j¢i

Q.. Q..
Using the inequality, Ln -il 5 —il - l, in the above it follows
p

ij pij
that
N q.. . N
, 1|
Z Pl Ln 5 pi (“Ll - 1) = 2 (q1 - pl ) = qi - pi > 0
j#i j¢i J 1
Thus,
N qi
-—~l - ' < +m
E Pijtnpusml pi) 16A=50<mZ
3-1 13 2
j¢i

and the lemma is proved.