ESTIMATION AND ADAPTIVE DECISION
MAKING FOR PARTIALLY OBSERVABLE
MARKOV SYSTEMS

Thesis for the Degree of Ph. D.
MICHIGAN STATE UNIVERSITY
DAVID T. SIGNORI, JR.
1968

 

7HESIS

_ 1' 7- “2.5%,?"
Michigan State
University

  
  

L1

This is to certify that the
thesis entitled

ESTIMATION AND ADAPTIVE DECISION MAKING
FOR PARTIALLY OBSERVABLE MARKOV SYSTEMS

presented by

David T. Signori, Jr.

has been accepted towards fulfillment
of the requirements for

_P_h._I_)_.__ degree in i121—

.., ﬂ,-

/, ,7, \’/ I! if,“ / ,/ .;/.I I
I, /./f._’ )7 ﬁ-"lf/r/ / ”A ' 3(/////./ / ,~
.. "1.44'4,/r--' .,/-/-, - /L
4/“ -"‘ ' r‘AA L1 1 I

 

Major professor

Date Sept. 25. 1968

0-169

P‘;_j._ i“ ’“dfm‘fw

   
       

alu‘smo av “7
HOME 8: SUNS'
300K BINIIERY INC.

' "’RARY BINDERS
““‘l’. Ilcllﬂl

ABSTRACT

A Partially Observable Markov System (POMS) is a discrete state,
discrete time system whose state activity is described by a Markov
chain. The states of the system cannot be observed directly, but
"noisy" observations are available.

The main problem considered is that of determining rules for
making decisions about system states when the conditional densities
of observed random variables given the state of the system are char-
acterized by a set of unknown parameters. Furthermore, it is desired
that, as more observations are taken, these rules converge to the rule
that would be used if the parameters were known.

An iterative,optima1 (minimum Bayes risk), decision rule is
derived for making decisions concerning the state of the system at a
given time on the basis of available observations. This rule has the
capability of using future observations as well as past observations.
An optimal rule is also established for determining to which class
the state of the system belongs among a set of non-communicating
classes of states and an optimal, adaptive estimator is constructed
for the parameters associated with the active class. Conditions are
established under which these rules perform effectively.

A variety of consistent estimators are constructed for the un-
known parameters, yielding a class of suboptimum rules. The basic
estimation problem is a nonsupervisory one involving the resolution
of mixtures. However, unlike previous work, the observation process

is dependent and nonstationary. A general strategy is established

David T. Signori, Jr.

for extending estimation techniques developed for the case of independent
identically distributed observations to this problem. The results apply
also to the non-parametric case and the case with unknown transition
matrix.

The model under study here corresponds directly to that of a
Pattern Recognition System with Markov dependent pattern activity. How-
ever, several communication systems of interest can be shown to be
POMS. These include systems with a Markov encoder, intersymbol inter-
ference, unknown synchronization, signals with random arrival times,
and combinations of the foregoing. In all of these systems, the
observations are dependent and the design of adaptive detectors is
generally difficult. However, by formulating the problem as one of
decision making for a POMS, optimal and suboptimal detectors as well
as conditions for effective operation follow easily and in a unified

manner .

ESTIMATION AND ADAPTIVE DECISION MAKING
FOR PARTIALLY OBSERVABLE MARKOV SYSTEMS

B)!
.,e/
G“
David T. Signori, Jr.

A THESIS
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1968

To

My Mother and Father

ii

ACKNOWLEDGEMENTS

It is with great pleasure that the author thanks his thesis

advisor Dr. R.C. Dubes for his guidance and encouragement during the

course of this research

Thanks are also due to Dr. H. Salehi and Dr. R.B. Zemach for
their suggestions concerning some mathematical questions and to Dr.

H.E. Koenig and Dr. G.L. Park for their interest in this work.

iii

Chapter

I. INTRODUCTION ....................................
1.1 Partially Observable Markov Systems ........
1.2 Optimal Decision Making for a POMS .........
1.3 Unsupervised Learning ......................
1.4 Review of the Literature ...................
1.5 Thesis Objectives ..........................
II. OPTIMAL ADAPTIVE DECISION MAKING ........... .....
2.1 The Decision Problem .......................
2.2 Derivation of the Decision Rule ............

2.3 Iterative Generation of the Posterior
Densities ..................................
2.4 Analysis of the Iterative Procedure ........

2.5 The Learning Features of the Optimum
Decision Rule ..............................
2.6 Related Inference Problems .................
2.7 Conclusions ................................
III. CONSISTENT ESTIMATORS ...........................

TABLE OF CONTENTS

ABSTRACT

ACKNOWLEDGEMENTS

LIST OF TABLES

LIST OF FIGURES

wwuwwww
\onmJ-‘uNr-I

WU)
mm

The Estimation Problem .....................
Properties of the Observations .............
The Basic Tools of Estimation ..............
Classical Estimation Methods ...............
Minimum Distance Estimation Methods ........
An Example .................................

Alternate Strategies for the Estimation

Problem ....... . ....... .. ............ . ......
Adaptive Estimation and Class Estimation ...
Conclusions ................................

iv

Page

iii
vi

vii

16
19

22
24
27

29

30
33
35
37
40
43

47
49
52

IV.

EXAMPLES OF PARTIALLY OBSERVABLE MARKOV SYSTEMS

4.1 Pattern Recognition with Markov Dependent

Pattern Activity ...........................
4.2 Adaptive Signal Detection with a Markov
Encoder ....................................
4.3 Adaptive Detection with Intersymbol
Interference ...............................
4.4 Adaptive Detection with Unknown
Synchronization ............................
4.5 M-ary Adaptive Detection of Signal with
Random Arrival Times .......................
4.6 Adaptive Detection of Signals with Random
Arrival Times ............ . .................
4.7 Remarks ....................................
4.8 Conclusions ................................
GENERAL CON CLUS IONS .............................
5.1 Review .... .................................
5.2 Extensions .................................
5.3 Some Interesting Problems ................. .
BIBLIOGRAPHY ....................................
APPENDIX ........................................

54

55
58
6O
63
67
7O
74
75
77
77
78
79
81

85

LIST OF TABLES

Table Page

3.6.1 Computer Simulation Results for 20 Runs with

E0 = 2 ........................................ 46

vi

LIST OF FIGURES
Figure Page

1.1.1 A Schematic of a Partially Observable Markov

System ............. .... ....................... 5
4.1.1 A Pattern Recognition System .................. 56
4.2.1 A Communication System ........................ 58

4.5.1 The Channel Output for a M-ary Signal Set with
Random Arrival Times .................... . ..... 68

4.6.1 The Channel Output for a Signal with Random
Arrival Times ................................. 7O

vii

CHAPTER I

Introduction

A fundamental requirement of classical engineering techniques
for the design of control and information processing systems is the
existence of a fully specified model of the system under study in-
cluding all the factors that influence system performance. However,
factors such as unpredictable changes in environment, drift in system
parameter values due to component aging, or difficulties in measuring
relevant quantities often make the development of an accurate model
impractical. This problem has motivated the study of adaptive decision
making devices such as controllers and detectors which can achieve a
desired goal despite some degree of ignorance concerning the under-
lying model. Such devices are characterized by the fact that they
improve their performance on the basis of past experience. In effect
they "learn" the additional information needed to complete the model
in terms of which their task is defined.

Workable adaptive schemes have been proposed for several types
of communication and control systems [A-l][H-4][M-2][S-4]. In partic-
ular, a large class of problems that arise from analyzing such systems
can be posed as problems in Mathematical Statistics. In such problems
Decision Theory provides the mathematical form of the adaptive device
and a mathematical criterion for evaluating its performance. The learn-
ing process is related to the well-defined problem of estimating the
unknowns in a partially specified statistical structure. The only pre-

requisite for using the powerful tools of Mathematical Statistics is

the existence of a meaningful statistical model.

The basic model under study in this Thesis is that of a discrete
state, discrete time system whose state activity is described by a
Markov chain. The system states cannot be observed directly because
of an imperfect observation mechanism that can be accounted for

statistically. This system will be referred to as a Partially Observ-

 

able Markov §ystem (POMS). Such systems arise frequently in Pattern
Recognition, Signal Detection, and Operations Research [D-BJEK-lJER-B].
The model for a POMS is precisely defined in Sec. 1.1.

The main decision problem associated with a POMS is one of estab-
lishing rules for taking effective action concerning the states of the
system on the basis of available observations. In Sec. 1.2 optimal
decision making is defined for a POMS when all quantities in its model
are assumed known. The structure of the resulting rule is discussed
along with its computational feasibility, a basic consideration through-
outthis study.

This Thesis deals with various aspects of decision making for a
POMS when the model is not completely specified (not all the quantities
in the model are known). Of primary interest is the problem of extract-
ing from the observations information concerning the model unknowns.
Hopefully, such information can be used to construct adaptive decision
rules or rules which perform almost as well (in some well defined
manner) as the optimal rules of Sec. 1.2 where the model is completely
known. The problem of learning the unknowns in a POMS is discussed in
Sec. 1.3. Previous work related to this problem is listed in Sec. 1.4

and in Sec. 1.5 the Thesis objectives are explicitly stated.

1.1 PARTIALLY OBSERVABLE MARKOV SYSTEMS

The basic model considered in the Thesis is established in this
section. The model is composed of two random processes. The first is
a discrete-time, finite-dimensional Markov chain which cannot be observed.
The second is an observable process with the property that the random
variable describing the observations at a given time has a distribution
which depends on the state of the chain at that time. The model corre-
sponds to that of a system whose states cannot be observed but must be
monitored indirectly through a "noisy" observation mechanism, which
suggests the name Partially Observable Markov System (POMS).1

More specifically, the state activity of the system is described
by a first order homogeneous Markov chain; that is, a sequence of ran-
dom variables {AN; N = 1,2,...} taking values in a finite state space
A = {l,2,...,M}; M«< m and satisfying the Markov Property. Namely,
if P(-) is a probability measure defined on the same sample space as
the sequence {AN}: then

P(AN = le

N_1 = i,...,A = k) = P(AN = JliN_l = 1) = p

1
IJN > 1 and i,j = l...M (1 1.1)

where pij is the probability the system is in state j at time N
given it was in state i at time N-l. Hence, knowledge of the last
state summarizes the past history of the system. The probability state

vector of the system is defined by

EN = [P(XN=1)...-.P(KN=M)] (1.1.2)

 

1This is a generalization to a continuous observation space of
what has been previously referred to as a POMS [D-3],[K-1].

where P(AN=i) is the probability that the system is in state i at

time N. Then

_ N-l _
RN - RlP - RN-lp (1.1.3)

where P = [pij] is a stationary transition matrix and .21 is an
initial probability state vector at time 1. Then ‘21 and P are
sufficient to summarize the prior knowledge (knowledge before any
observations are taken) of the state activity of the system.

When the past history of the system is summarized by knowledge
of the last k states, the describing random process is termed a kth
order Markov chain. Since any kth order chain can be reduced to a 18t
order chain, the above chain implies higher order chains [D-Z].

The observation process is defined by a sequence of random vari-
ables {KN}T; XN, the random variable observed at time N, takes values
in a finite-dimensional Euclidean space and has a density function
fi(-) when it is known the system is in state i at time N. That
is, fi(-) is the conditional density of the observations given the
system is in state i. Since the states of the system are unknown,

XN has the global density

pN(X) = .E
l—

f.(X) PO» =i) (1.1.4)
1 1 N

which is referred to as a finite mixture with component densities
f (- M and mixing parameters [T-l . The sequence X }m is
1 1 2N N 1

assumed state conditionally independent. This implies, for example,

. _. _ 2
p(xN’bullxqu’I‘N—n—J) - fi(XN)fj(xN+l) (1‘1'5)

 

2p(°) will denote probability density with Xi indicating both
the random variables and the value it takes on.

and hence the joint density of any number of random variables in the
observation process can be constructed from the component densities

and the system probabilities p1 and P.

 

£100 —"\ lr-‘IE

f2<x> --\, K XN

g4...) _/

FIG. 1.1.1 A Schematic of a Partially
Observable Markov System

The entire model is illustrated by the sampling scheme in Fig.
1.1.1. At time N a sample is taken according to a density determined

by the state of the system at time N.

1.2 OPTIMAL DECISION MAKING FOR.A POMS

When the model is completely specified, Decision Theory provides
a means of generating optimal strategies for action relative to the
states of the system. In this section the type of decision rules of
interest in the Thesis are illustrated along with important properties
of this class of rules.

In order to define the decision problem, the following elements
of Decision Theory are introduced. The action space or set of allowable
actions that can be taken at a given time is denoted by A = {a1,...,ar},
r < m with generic element a; L(-,°) is a non-negative loss function

defined on A X A with L(a,i) denoting the loss incurred when action

. . . . m
a 18 taken and the system IS in state 1; X = [X1,...,Xm] represents

the set of observations obtained up to time m. The problem, then, is
. . . . . m
to find a nonrandomized dec131on function 6N mapping X to A such

that the risk
11(1)“) = E L(eN(xm),>.N) (1.2.1)

is a minimum. The expectation is taken with respect to Xm and AN.
If, for example, r = M and the action ai is "say AN=i" and

3
L(i,j) = l - Aij (the 0-1 loss function) the optimal decision rule,

5:, is given by the Bayes decision rule

52mm) = i if P(>.N=i/x'“) 2 1>(>.N=j/xm) Vjaéi (1.2.2)

where P(AN=i/Xm) is the conditional probability that the system is
in state i at time N given observation Xm [R-B]. Using the
assumption of state conditional independence the above rule can be
written iteratively. For example, if m = N it follows from Bayes

rule and the Markov property that

. -1
fi(XN)P(AN=1/XN )

 

 

N
P(A =i/X ) = _ (1.2.3)
N p(XN/XN 1)
where
. N-l M N-l
P(>~N=1/x ) = z pkiP(AN_1=k/X ) (1.2.4)
1
and
P xN-1 _ M _. N-l
(XN/ ) - ‘12: fi(xN)P(AN-1/x ). (1.2.5)
3

Aij denotes the Kronecker delta.

The iterative scheme operates in two basic steps. The state posterior
probability P(AN_1=i/XN-1) computed at step N-l is projected from
the transition probabilities in (1.2.4) to P(AN=i/XN-1) which serves
as a prior probability for the state activity at time N before the
Nth observation is available. This in turn is converted in (1.2.3) to
a posterior probability using xN and the Bayes Rule.

It is worth noting that, because of the Markov dependencies be-
tween the states of the system at different times, observations are in
general dependent. Consequently, observations at one time may contain
information about the states at other times. This point is reflected
in the above decision rule by the fact that both past and present
observations are used to make decisions about the state of the system
at a given time. By the same token the sequence of decision made by
{6:}? can usually be improved if XN is used to classify all past
states simultaneously. However, rules of this type lead to memory re-
quirements which grow linearly and exponentially with the length of the
observed sequence and decisions are not available for immediate use
[c-1].

The type of decision rules established in this section can be
thought of as a class of on-line rules with fixed memory but changing
structure. The case where m'> N, termed a look ahead mode, is an

attempt to improve performance by increasing memory by a fixed amount.

1.3 UNSUPERVISED LEARNING

If the model for a POMS is only partially specified the pre-

viously established decision rules can be considered functions of the

unknowns in the model. For example, the transition matrix and/or the
component densities may be unknown or may contain unknown parameters.
To make effective decisions under such circumstances information about
these unknowns must be extracted from the observations. Since the
states of the system are unobservable the unknowns appear in a mixture.
The process of estimating or approximating these unknowns is commonly
referred to as unsupervised learning. This is in contrast to the case
in which, by some external means, the states of the system are known
for a fixed period of time and each of the component densities and the
transition matrix can be determined separately or with supervision [Bel]
[M- 1][H-4].

As an example of the above class of estimation problems a POMS
is considered with a transition matrix which has identical row vectors.

That is
pij = q- j=1)2!"')M (19.391)

If £1 is given by a row of P, the probability state vector EN is

independent of time and xN, the random variable observed at time N,

has density

M
1300 = Z quj(x> vN (1.3.2)
1

Then the state conditional independence assumption implies {x1}: is
a sequence of independent identically distributed (i.i.d.) random vari-
ables. This is a well-studied case and the tools of classical estima-
tion theory have been used to develop several mixture resolving tech-

niques for various degrees of uncertainty about (1.3.2) [S-6][H-2].

Some of these methods are discussed in Chapters II and III.
However, for a general transition matrix and initial probability
vector the observations are neither independent nor identically dis-

tributed and the above techniques do not apply directly.
1.4 REVIEW OF THE LITERATURE

Research related to adaptive decision making for a POMS has
been motivated by the need to handle decision problems in which the
usual independence and stationarity assumptions mentioned in Sec. 1.3
do not hold. The results outlined in this section can be categorized
according to what is assumed to be unknown in the model.

For the case in which the only unknowns are parameters in the
component densities, most of the work consists of attempts to design
optimal (Minimum Bayes Risk) and suboptimal adaptive detectors for
communication systems wherein the signal is unknown but, for various
reasons, the observations are dependent even when the signal is known.
Some examples of conditions which result in dependent observations are
intersymbol interference or signal overlap due to channel memory
(Chang [H-l]), unknown symbol synchronization between the transmitter
and receiver (Stewart [H-3]) and random signal arrival times (Stewart
[H-3] and Nolte [N-l]).

The main problem associated with these examples is to establish
conditions under which the unknown parameters can be learned. In the
first two examples Chang and Stewart were able to establish such con-
ditions because they assumed signal activity at the transmitter was

5 .
independent from one time interval to the next. Their results do not

 

Optimal and suboptimal adaptive decision rules are defined at
the beginning of Chapters II & III respectively. The correspondence
between a POMS and a communication system is made in Chapter IV.

5 . . . .
This corresponds to assuming a special form for the tran31tion

10

apply when such dependencies exist. In the example of signals with
random arrival times Stewart was unable to prove convergence of his
estimates and Nolte does not treat the problem. These examples are
discussed in more detail in Chapter IV.

For the case in which the component densities are assumed known
but the transition matrix P is unknown Raviv [R-Z] constructed a
class of adaptive decision rules using an estimate of P and only
part of the past observations. He established conditions under which
P can be estimated and developed some properties of the observation
process for a large class of POMS. These properties are stated in
Chapter III.

Recently, Patrick [P-2] and Hilborn and Lainiotus [H-S] have
made some general observations concerning non-supervisory problems
with non-stationary, dependent observations. The work in this Thesis
related to their results was done independently and deals with a

particular model which yields more specific results.
1.5 THESIS OBJECTIVES

For most of the work in this Thesis it is assumed that the tran-
sition matrix P is known and that the component densities are known
to within a parameter.6 For this case, the Thesis objectives can be
stated generally as follows:

1. To find a class of optimal adaptive decision rules with

properties similar to those of the rules of Sec. 1.2.
2. To show that, under appropriate assumptions, virtually all

the mixture resolving techniques developed for the i.i.d.

 

6The estimation of P and f (X) for the nonparametric case is
discussed briefly in Chapter III.

11

case defined in Sec. 1.3 can be extended to the dependent,
non-stationary case considered here.

3. To show that a variety of decision problems arising in
communication systems, including those of Sec. 1.4, can be
easily solved by considering them as decision making pro-
blems for a POMS.

Fulfillment of Objective 1 provides a reference for evaluation
of any related adaptive scheme. Objective 2 implies conditions under
which the unknown parameters can be learned and leads to a class of
suboptimum decision rules. Objective 3 suggests that the model of a
POMS is a very versatile one, providing a unifying approach to a class
of communications problems.

Objectives 1, 2, and 3 are pursued in Chapters II, III, and IV
reSpectively. In Chapter V the main results of the Thesis are outlined

and problems that need further study are discussed.

CHAPTER II
OPTIMAL ADAPTIVE DECISION MAKING

When the component densities of a Partially Observable Markov
System (POMS) are specified to within a parameter set and a prior dis-
tribution7 summarizing initial knowledge about the unknown parameters
is available then Bayes decision-theoretic techniques can generally be
used to establish decision rules which are optimal against prior in-
formation and a given cost function. Furthermore, under appropriate
conditions, the fixed but unknown value of the parameter is learned
from the observations and the decision rule adapts or converges to
what the optimal rule would be if the true parameter value were known.
Patrick [H-2] derived the optimal decision rule for the i.i.d. case.

In Sec. 2.1, the decision problem is defined and in Sec. 2.2
the corresponding optimal decision rule is derived. In Sec. 2.3, the
basic components of the optimal rule are generated recursively and in
Sec. 2.4 the structure and computational feasibility of the iterative
scheme are discussed. In Sec. 2.5, the learning properties of the
rule are discussed and in Sec. 2.6 some inference problems related to
that defined in Sec. 2.1 are treated. Finally, in Sec. 2.7, the main

results of the chapter are summarized.
2.1 THE DECISION PROBLEM

In this section, the decision problem under consideration in
this chapter is defined. The basic elements of the problem are a POMS

with an unknown parameter set and a prior density for the parameters,

 

7 .
The standard Bayesian technique of treating the unknown parameter

as a random variable will be employed.
12

13

a class of decision functions and a criterion for evaluating the per-
formance of these functions.

As in Sec. 1.1, the state activity of the POMS is described by
a Markov chain {AN}: with transition matrix P and initial probability
state vector ‘21° The component densities are characterized by a para-
meter vector in the following manner. When the system is known to be
in state i and the parameter B1 is given, XN’ the random variable
observed at time N, has density f('/Bi).8 The random vector

B = [31...BM] takes on an unknown value B according to the prior

0
density pO(B) and maintains this value throughout system operation.
The basic assumptions are as follows.

, m
l. The observations X = [X1...Xm] are state-parameter-con-

ditional independent. This implies

p(Xm...X /x = 1,... A = j,B)

1 m 1 p(xm/>.m = 1,Bi)...p(x1/>.1 = j,Bj)

f(Xm/Bi)...f(X1/Bj) (2.1.1)
or

p(xm/1m=i,xm‘1,s) = p(xm/1m=i,Bi) = f(Xm/Bi) (2.1.2)

2. For each N, the random variables B and AN are independent.
That is, the parameter values do not affect system state activity.

If, as in Sec. 1.2, A is the action space, the class of allow-
able decision rule, D, is the set of all non-randomized functions map-
ping the space of observations Xm to A. If SN 6 D denotes a

decision rule for taking action relative to the state of the system at

 

The random variables Xi and Bi take values in a finite-
dimensional Euclidean vector space.

14

time N and L(°,') is a nonnegative loss function, the corresponding

risk is given by

R(6N) = E L(6N(xm),iN) (2.1.3)

. . . m
where the expectation is taken With respect to X , AN, and B.

The problem is to find the optimal rule, 6N which is defined

*9

by
R(e§) s R(6N) v/éN e D (2.1.4)

That is, the minimum risk decision rule for taking action concerning
the state of the system at time N on the basis of m observations

is to be found.
2.2 DERIVATION OF THE DECISION RULE

In this section, the optimum decision rule is derived for the
problem defined in Sec. 2.1. The development involves the use of con-
ditional risks which emphasize the role of prior information in con-
structing the total risk.

With the decision rule 6 E D, given observation Km, and given

parameter B the average loss is

R[6(Xm)/Xm,B] = z L[a(xm),1]P(iN=i/xm,s) (2.2.1)9
i=1
Equation (2.2.1) is referred to as the samp1e-parameter-conditional
risk. The parameter—conditional risk is given by
R(6/B) = IR[5(xm)/xm,B]p(xm/B)dxm (2.2.2)

 

N
9The superscript in 6 has been dropped for convenience.

15

and the total risk, (2.1.3), is

R(6) = j‘ R(6/B)pO(B)dB (2.2.

Substituting (2.2.1) and (2.2.2) into (2.2.3) and interchanging the

order of integration yields

12(5) = j‘ R[6(Xm)/Xm] p(x"’)c1xIn (2.2.
where
Ill m Ill
R[5(x )/x ] = *2 L[e(x"‘),i]p(iN=i/xm) (2.2.
i=1
and
P(>.N=i/xm) = f P(AN=i/Xm,B)p(B/Xm) dB (2.2.
P(B/Xm) = P(Xm/B)PO(B)/P(Xm) (2.2.
p(Xm) = I p(Xm/B)pO(B)dB (2.2.

Since 6(Xm) = a for some a E A, and L(~,') and p(Xm) are non-

negative it follows that

R(6*) = inf 11(5) = E R[6*(Xm)/Xm] (2.2.
66D
where
5*(xm) = ai if Mai/x”) s R(aj/Xm) Vj aé i (2.2.

In particular, when r ==M, ai is the action ”say AN=i" and

the loss function is given by L(ai,j) = 1 - Ai (0-1 loss function)

3

then the decision rule becomes a minimum probability of error rule 6e

defined by

66(xm) = i if P(xN=i/xm) 2 P(AN=j/Xm) Vi 34 j (2.2.

 

Minimum sample conditional risk implies minimum total risk.

3)

4)

5)

6)

7)

8)

9)10

10)

11)

16

The basic elements of the above decision rules are the state
posterior probabilities given by (2.2.6). These probabilities are
obtained by averaging the parameter-conditional posterior probabilities
P(AN=i/Xm,B) (This is the probability that would be used for decision
making with B = B if B were known) over the posterior density

0 O

p(B/Xm). This posterior density summarizes knowledge about B0 in
the first m observations. Both terms are generated iteratively in

the next section.
2.3 ITERATIVE GENERATION OF THE POSTERIOR DENSITIES

In this section, the key posterior densities in the decision
rule derived in Sec. 2.2 are generated iteratively. First, P(AN=i/Xm,B)
is generated for m = N, m > N and m < N. Then p(B/Xm) is treated.

Case 1. m = N

From the Bayes rule,

_, m-l =. m-l
p(xm/xm-1.B.x )Pam 1/x ,B)

 

P(Am=i/Xm,B) = m_1 (2.3.1)
B

p(Xm/X . )

where the three terms on the right hand side can be accounted for in

the following three steps.

1. p(Xm/Am=i,B,Xm-1) is given by (2.1.2)

2. By the total probability law

_. m-l _. m-l
m_1-J,X ,B)P(im_1—J/x ,B) (2.3.2)

m-l M
P<x =i/x ,B) = 2 P(i =i/A
m j=1 m

where (2.1.2) and Assumption 1 of Sec. 2.1 imply

. _. m-l _
P(Am-i/Am_1—J,X ,B) — pji (2-3-3)

17

_. m-l
and P(im_l—J/x ,B)

is available from the previous step

of the iteration scheme.

3. Again by the total probability law and (2.1.2)

p(xm/xm'1,s) = 2 f(Xm /B »)P(i m/ m 1,3)

i=1

Case 2. m >‘N

(2.3.4)

Using arguments similar to those used in the previous development, it

follows that

p(xN+1...xm/AN=i,xN,E)p(iN=i/xN,E)

 

—.' m _
P(AN—i/X ,B) -

p(xN+1...xm/xN,B)

(2.3.5)

where p(AN=i/XN,B) is known from the previous iteration using steps

1,

The iteration procedure would be complete if p(XN+l...Xm/AN

C011

2, 3 above and with m replaced by N

M

N
P(XN+1...Xm/X ,B) = iElp(xN+l...xm

=- N =. N
/XN 1.x ,B)P(>.N 1/x ,B)

(2.3.6)

N
=1,x ,B)

Id be determined. This factor should be recognized as the heart of

the look-ahead procedure that results from using future observations.

To generate this last factor it is convenient to define

Then, by (2 1 2),

But

p<zN/1N=i.xN,B)

M
p(zN/xN=i.B> = E p(zN/1Nm
i=1
M.
= 2 MN /1N=

j 1

= p(zN/1N=i,B)

.... =1 BMW/Ir . ...

’KN+1=j’B)pﬁ

N
z = [xN+1...xm

(2.3.7)

(2.3.8)

].

18
Similarly,
N . . N . .
p(Z /XN=1,XN+1=J,B) = E pjkp(Z /kN=1,XN+1=J,AN+2=k,B) (2.3.9)
Hence this procedure can be repeated until

N _, = = N =.
p(Z /xN-1...km_l t,B) E ptqp(z /XN 1,K

N+l=j,...,km=q,B) (2.3.10)

where

N -' =' = =
p(Z /XN-1,XN+1 J,...,Xm q,B) f(XN+1/Bj)...f(Xm/Bq) (2.3.11)

Although the above procedure shows the decision rule to be a fixed-
memory rule (the number of observations stored is fixed) the memory in-
creased linearly with m-N and the number of computations in (2.3.8)
to (2.3.11) increases exponentially with m-N. Hence, the look-ahead
is expensive.

Case 3. m_< N
Past samples are being used to make decision about future states. For

m < N

lltdCZ

_' m _
P(xN-1/x ,B) -

_. =. m =. m
j P(xN-1/xm J,X ,B)P(Xm J/X ,B) (2.3.12)

1

But, by (2.1.2),
—. a... m _ :0 =0

t - , .
which is the (j,i) h element of PN m and P(Xm=J/Xm,B) can be
obtained iteratively using steps 1, 2, 3 above. In this case, the
number of computations involved in making decisions on future states

increases linearly with m-N.

19

Finally, an iterative form for p(B/Xm) is established.

p(xm/xm'1,B)p(B/xm'l)

 

p(B/Xm) = (2.3.14)

p(xm/xm’l)

where p(Xm/Xm-1,B) is given by (2.3.4), p(B/Xm-l) is available from

the previous step and

p(xm/xm'l) = f p(xm/xm’1,3)p(B/xm’l)d3 (2.3.15)

2.4 ANALYSIS OF THE ITERATIVE PROCEDURE

In this section, a special but important case of the previous
decision rule is studied. The iterative structure of the rule is in-
vestigated and interpreted. The start of the iterative procedure is
illustrated and the problems encountered in machine implementation of
the rule are discussed.

If m=N and a 0-1 loss function is used then the result is the
minimum probability of error rule, be, given by (2.2.11), for determin-
ing the present state of the POMS, described in Sec. 2.1, on the basis

of past and present observations. This rule is summarized below.
6e(xN) = i if p(lN=i/xN) 2 P(>.N=j/XN) Vi aé j (2.4.1)
where, from (2.2.6),
P(xN=i/xN) = f p(xN=i/xN,B)p(B/xN)dB (2.4.2)
with, from (2.3.1) and (2.3.2),

f(xN/Bi)P(xN=i/XN'1,B)
p(xN/XN’1,B)

(2.4.3)

 

P(xN=i/XN,B) =

and

20

M
. N-l , -
P(lN=i/x ,B) = jEIPjiP(XN_1=J/XN 1,3) (2.4.4)
and, from (2.3.14),
N p(xN/XN'1,B>p(B/XN'1)
p(B/X ) = (2.4.5)

 

p(XN/XN'I)
All the terms in (2.4.2)-(2.4.5) are either known, available from the
previous step or can be calculated from those given above.

The rule 6e is a fixed memory, iterative, optimal decision
rule. The parameter-conditional state posterior probability,
P(AN=i/XN,B), is the state posterior probability given by (1.2.3)

(where B0 was assumed known) as a function of the unknown parameter.
Equations (2.4.3) and (2.4.4) generate this term iteratively in a manner
similar to that of Sec. 1.2 but conditioned on knowledge of the unknown
parameter. That is, this probability at time N-l is projected with
the transition matrix in (2.4.4) to P(XN=i/XN-1,B) which is used in
the Bayes rule in (2.4.3) in incorporate the information provided by

XN and to generate P(XN=i/XN,B). Everything in the procedure is con-
ditioned on knowledge of B0 and information about B0 is summarized
by P(B/XN) which is generated iteratively in (2.4.5) and is intro-
duced into the decision rule by the averaging procedure given in
(2.4.2).

The starting procedure for the iterative scheme is given below.
At step 1
£(x1/Bi)P(kl=i)

p(Xl/B)

P(x1=i/x1,B) =

 

(2.4.6)

where

then

where

At step 2

where

and

then

where

21

M
p(Xl/B) = 2
i:

1£(xl/Bi)P(>.1=i) (2.4.7)

p(XllB)p0(B)
p (X1)

 

P(B/X1) =

p(X1) = I p(Xl/B)pO(B)dB

fi(x2/Bi)P(l2=i/XI,B)
p(xz/x1.B)

 

P(x2=i/x1,x2,B) =

m
p(Xz/X1,B) = iElfi(x2/Bi)P(x2=i/xl,3)

P(x2=i/x1,B) = § pjiP(x1=j/x1,B)

p(XZ/X1,B)p(B/X1)
p(x2/x1)

 

P(B/X1,X2) =

p(lexl) = I P(X2/X1,B)P(B/Xl)dB

this procedure can be repeated up to time N.

Despite the desirable features of the above rule it has one major

drawback.

At each step, m+l functions of B must be stored. If this

rule is to be machine implemented it can only be done under one of the

following conditions which are characteristic of general Bayesian learning.

22

l. The parameter vector B is a discrete random variable taking on a
finite number of values. That is, B0 is known to be one of a
finite number of values. This allows storage of the required
densities but is a highly restrictive assumption.

2. The parameter vector B is a continuous random variable but a
finite dimensional sufficient statistic is available for the un-
known parameter. Under these conditions, only a function of the
observations need to be stored [8-5]. However, such statistics do
not usually arise in unsupervised learning problems because the
class of densities involved are mixtures.

3. In general, the only way to make use of the rule is by quantization
of the parameter space, thus reducing the problem to case 1 above.
By quantizing fine enough it is possible to get arbitrarily close
to the optimum solution at the expense of increased memory [F-l].
However, the memory grows exponentially with the dimension of the
parameter vector B, making the method feasible only for problems
with a small number of parameters. An example is given in Sec.

3.6 which indicates the extent of the storage limitations.
2.5 THE LEARNING FEATURE OF THE OPTIMUM DECISION RULE

In this section, some limiting properties of the posterior
densities that comprise the optimum decision rule (2.2.10) are dis-
cussed in a manner that exhibits the learning capability of the rule.
First it is shown that the state posterior probability P(XN=i/Xm)
given by (2.2.6) with N fixed converges with probability one as the

number of observations is increased. Then, the conditions under which

23
p(B/Xm )m 6(B- B 0) wp 1 (2.5.1)

where 6(B-BO) is the Dirac delta function are stated. Equation (2.5.1)
can be restated as saying that the posterior distribution of B, which
summarizes all the information about B0 contained in Xm, approaches

wpl a distribution whose mass is grouped about BO so that B0 is

learned. Then, for all practical purposes,

p(XN=i/Xm) 3 p(lN=i/x°,BO) (2.5.2)

and the rule adapts, or converges, to the optimal rule that would be
obtained if B0 were known.

The statistical stability of P(XN=i/Xm) = Ym can be demonstrated
by showing {Ym} to be a bounded martingale. Then, convergence follows
immediately from a theorem of Doob which says that every bounded martin-

gale converges with probability one [D-Z]. To show Ym to be a martin-

gale it is sufficient to prove that
m
EEYm+1/x ] — Ym
But
_, m+1 m+1 m
[YmH/xml —IP<xN-1/x >p(x /x)<1xm+1

m+1/).N=i)P(XN =i)

 

 

=fp( pp; de+
p<x m+1) p(x'“)
p(xm/ =i>P( =1)
= N N.
p(Xm) m

Since for each m Ym is a probability,

|Ym|s1 vm

 

1The implied interchange of the limit and integration process
has been carried out formally and (2.5.2) has not been proven for any
mode of convergence.

24

and Ym is bounded. Therefore lim Y exists a.e. in the space of
m m
sequences {X }.

The conditions under which the parameter posterior density
approaches a delta function are well known. Spragins [8-5] and Braver-
man [B-3] have given sufficient conditions for (2.5.1) to hold through
an interpretation of the 0-1 law of probability [L-l]. These conditions
are presented below.

N
1. p(B/X ) is computed using the Bayes rule.
2. p0(B) is positive in some sphere about BO

G
3. There exist functions {fm(Xm)}1 of the observations such

that

m
fm(Xm)'- B0 wpl

That these conditions are satisfied for decision rule (2.2.10) is now
demonstrated. Condition 1 follows from (2.3.7). Condition 2 is assumed.
Condition 3 is the major requirement and can be restated as saying
that a strongly consistent estimator for BO must be exhibited. In
Chapter III, a variety of such estimators are established by placing
constraints on the transition matrix and the family of component

densities.
2.6 RELATED INFERENCE PROBLEMS

Some of the properties of the previously-derived decision rule
can be used to solve additional decision problems of interest. For

example, in the adaption process, B0 is learned but estimates of B0

are never actually generated. Since the parameter posterior density

converges wpl to a dirac delta at B any property of this density,

0,

25

such as the mean, maximum or median, converges wpl to B0 also.
Furthermore, it is well known that the mean of the posterior density

is the minimum mean-square estimate of B and hence can be interpreted

O
as an optimal estimate for B0.
Another decision problem of interest is that of determining
which class or subset of the states the system is in. In particular,
if the transition matrix P is block diagonal with q blocks, the
states of the system are divided into q classes with the property
that the system stays in the class in which it starts.12 Then, if
P(wi/Xm) is the conditional probability that the system is in class i,

I?(W./XN) = 2 P(x =j/xN) i = l,2,...,q (2.6.1)
1 jEv. N

1

where Yi is that subset of the state Space corresponding to class i.
The probability P(XN=j/Xm) can be generated iteratively in the manner
of Sec. 2.3 and an optimal decision rule for choosing the system class
on the basis of XN is to pick the class for which the class posterior

probability P(wi/Xm) is largest. In order to exhibit some of the

properties of the above decision rule, (2.6.1) is rewritten as

 

N-l N-l
P( /W.,X )P(w./X )
P(wi/XN) = K” 1 N_1 1 (2.6.2)
p(XNﬁX )
where
- q - -
p(XN/XN 1) = i§1p(xN/wi,xN 1)P(wi/XN 1) (2.6.3)
and

 

2 l O O l O
The corresponding Markov chain 15 said to have q noncommunicat-
ing classes.

26

N-l , ' - - -
p(XN/wiJC ) = jESI p(igq/XN=J,wi,Bl,XN 1)1=(>.N=j/wi,xN 1,131)
i

l

p(Bi/wi,xN' )dBi (2.6.4)

. i . . . .th
w1th B the parameter vector for the component den31t1es 1n the 1

class and {P0(wi)}q, the prior class probabilities, assumed known.

Then (2.6.2) indicates that the class posterior probabilities can be
generated using the Bayes rules with normalizing factor given by (2.6.3).
All the terms in the mixture (2.6.4) are conditioned on knowledge of

the class and thus can be generated iteratively using only the block

of the transition matrix and component densities corresponding to the
given class. By an appropriate interpretation of Spragins' conditions

for convergence, it follows that if P0(wi ) > O and there exists a

0
strongly consistent estimator for i0 then
p(w./x‘“) 9—. A,, wpl (2.6.5)
J J10

where 10 is the true class. Conditions for such an estimator to
exist are discussed in Sec. 3.8.

Finally, the above two decision problems can be combined into
that of estimating the unknown parameters defining the class in which
the system is. If the parameter vectors are the same dimension for

each class, the optimal estimate is given by the mean of

q
p(B/XN) = >3 p(B/wi,XN)P(wi/XN) (2.6.6)
i=1

This procedure is referred to as adaptive estimation and conditions

for convergence are given in Sec. 3.8.

27
2.7 CONCLUSIONS

Optimal decision making for a POMS with unknown parameters in
the component densities has been the topic of this chapter. Assuming
a prior distribution over the parameter space, an optimal decision rule
was defined in Sec. 2.1 to be that rule in a given class of rules which
minimizes the Bayes risk (2.1.3). This class of rules, similar to
that of Sec. 1.2 where the component densities were assumed known, in-
cludes rules with the capability of using some future observations
(look-ahead mode) as well as past observations to make decision about
the state of the system at a given time.

The optimal decision rule (2.2.10) was derived in Sec. 2.2 and
its basic components (2.2.6)-(2.2.8) were generated iteratively in
Sec. 2.3. It was shown that for the look-ahead mode the memory grows
linearly and the number of computations exponentially with the number
of future samples used. However, the extent to which future observa-
tions affect the risk needs investigation. It is clear from the
derivation that these results can be extended to include time-varying
transition probabilities.

As emphasized in Sec. 1.4, the optimal decision rule is a fixed-
memory, iterative rule with a structure similar to that of the rules
in Sec. 1.2. In general, only a high storage quatization procedure can
be used to implement the rule. This suggests that the main use of the
optimal rule may be to evaluate related suboptimal, low storage rules
in the hope that conclusions can be extrapolated to the more complicated

cases .

28

In Sec. 2.5, it was shown that, under the stated conditions,
the parameter posterior density (2.2.7), which summarizes the knowledge
about BO, converges to a dirac delta function. Thus, B is learned

0
and the rule adapts.

Finally, in Sec. 2.6, various results established in the pre-
vious sections were used to solve related inference problems. In
particular, a class of estimators for B0 was established including
an optimal (minimum mean square) estimator. An optimal decision rule
was constructed for determining to which class the state of the system
belongs among a set of noncommunicating classes of states. An optimal
adaptive estimator was constructed for the parametersin the component

densities associated with the active class of states. These examples

indicate the versatility of the decision problem of Sec. 2.1.

CHAPTER III
CONSISTENT ESTIMATORS

When the component densities in a Partially Observable Markov
System (POMS) are specified to within a parameter vector, but no prior
distribution for the parameter is available, optimal decision making
as defined in Sec. 2.1 is no longer possible. However, the observations

still contain information about B the true but unknown value of the

O’
parameter. If a strongly-consistent estimator for BO ( a function of
the observations that converges with probability one to B0) can be
found, decision rules can be constructed by treating the estimate at a
given time as if it were the true value. The decision rule of Sec. 1.2,
where the component densities were assumed known, can then be used.
Hopefully, as more observations are taken, decision rules constructed
in this manner adapt or converge to the optimal rule, which uses the
true component densities.13 Such rules, unlike those of Chapter II,
may not extract information about B0 in an optimal manner but do
approach the optimum as more observations are taken.

There are several other reasons for investigating consistent
estimators, some of which are listed below.

1. Even when a prior distribution on the parameters is avail-
able suboptimal rules may be easier to implement than the high-storage
quantization procedure of Sec. 2.4 implicit in the optimal rule.14

2. In Sec. 2.5, it was shown that a strongly-consistent

estimator for BO must be exhibited to ensure that this parameter

 

This convergence problem is discussed in Chapter V.

4Optimal and suboptimal rules are compared with regard to implemen-

tation in Chapter V.
29

30

will be learned during the operation of the optimal rule. Hence con-
ditions for the existence of such estimators are important.

3. In some applications the true parameter values may be of
interest in themselves. The estimators suggested in Sec. 2.6 not only
require parameter prior distributions but suffer from the implementation
difficulties mentioned in 1 above.

This chapter deals mainly with the problem of finding strongly-
consistent estimators for the parameters defining the component densities
of a POMS. In Sec. 3.1 the estimation problem is defined and some im-
portant assumptions are discussed. The properties of the observation
process are listed in Sec. 3.2 while estimation tools are developed in
Sec. 3.3. In Sec. 3.4 and 3.5, estimation techniques developed for the
case of independent identically distributed (i.i d.) observations dis-
cussed in Sec. 1.3 are extended to the more general, dependent, non-
stationary case under study in this chapter. Section 3.6 provides
computer-simulated results for an example illustrating some of these
estimation techniques. Generalizations of the problem defined in Sec.
3.1 are discussed in Sec. 3.7 and 3.8. Finally, Sec. 3.9 summarizes

the main results of the chapter.
3.1 THE ESTIMATION PROBLEM

The problem of finding a strongly-consistent estimator for B0,
the unknown parameter in a set of component distributions, is precisely

stated and some of the assumptions necessary to ensure the existence

of such estimators are discussed in this section.

31
A POMS similar to that of Sec. 2.1 is considered.15 The state
activity is described by a transition matrix P and initial probability
state vector 21° The distribution of the observed process {KN}: is
defined by a set of distinct univariate component CDF's {Fi(°)}? corre-
sponding to the component densities of Sec. 1.1; Fi(-) is assumed to
be an unknown element of the family' 8 '=:{F(X;c:v)}cvEA indexed by a
point a in a subset A of the real line.16 As indicated in Sec.
1.1, XN, the random variable observed at time N, has CDF KN(X) which
' a l t f th f ' = ;B
13 n e emen o e set 0 mixtures HN {KN(X )}B65' where
M
KN(X;B) = 2 F(X;B,)P(A =i) (3.1.1)
. 1 N
1=1
In (3-1-1) B = [Bl’B2’°°°’BM] and '3' is the set of all such vectors
with distinct components Bi 6 A. That is 5” is a subset of M-

dimensional Euclidean space. Then there exists at least one point

B0 6 5' such that

KN(X) = 1%(x;30) VN (3.1.2)

The problem is to find Bm(Xm), a function of the observations

m . .
x - [x1,...,xm], such that

. . m _ .
P<,Bm(x'“) —. B0) - 1 (3.1.3)

The major assumptions necessary for the construction of such
estimators are listed below. They are assumed to hold throughout the
chapter and will be referred to when needed.

a D

A-l. The observed process {Xi}1 is state-conditionally 1n-

17
dependent.

 

5 .
Here B is no longer being treated as a random variable.

6 . . .
Scalar observations and parameters are assumed for Simplic1ty
but all ideas extend to the vector case.

7 . . .
This corresponds to the assumption of state-parameter cond1tional

32

A-2. For each X, F(X;v) is a continuous function on A and

for each a, F(°;a) is continuous in X; i = 1,2,...,m.
A-3. B0 is an interior point of 18 and 18 is a compact sub-
set of 5' .

A-4. {AN}: is a regular Markov chain.18

Assumption A-l was introduced in Sec. 1.1 and is repeated here
for convenience. It is a key assumption in developing the estimation
tools of Sec. 3.3. Assumptions A-2 and A-3 are standard requirements
for developing the existence and convergence properties of estimators.

Assumption A-4 characterizes the class of POMS to which the
techniques of this chapter apply. It ensures that the observations

contain information about all the components of B More explicitly,

O”
a regular Markov chain has the property that it is possible to be in
any state after some finite number of steps no matter what the starting
state [K-4]. Hence, among a large number of observations there will
be a large number of representatives from each of the component dis-
tributions. As shown in Sec, 3.2, this assumption also implies the
asymptotic ergodicity of the observed process, which is the basis of
the estimation strategy to come.

In addition to assumptions A-l through A-4 listed above, a

uniqueness condition on B is required. This is a standard assump-

0
tion in estimation problems and takes different forms depending on the
method used. For the estimation problem defined above a special unique-

ness condition arises since, in general, mixtures do not have a unique

decomposition into allowable component distributions. So (3.1.2) may

 

8 . . .
1 A regualr Markov chain IS one that has no tran51ent states and

only one ergodic class with no cyclically moving sub-classes.

33

not define BO uniquely. To avoid this type of ambiguity the mapping
defined by (3.1.1) from 18 onto a subset of HN’ (for a fixed family
of component distributions) must be one-to-one. This subset of HN

is then an example of an identifiable class of parameter-indexed mix-
tures [Heiﬂ. The concept of identifiability is discussed in the
Appendix where it is shown that this property, which is necessary for

all the methods of this chapter, can be ensured by placing constraints

on the family of component distributions 8 and the parameter spaceJBt
3.2 PROPERTIES OF THE OBSERVATIONS

In order to establish estimation strategies it is necessary to
study some properties of the observation process {Xi}:. In this
section, properties of the system state activity and the resulting
observations are developed by making use of assumptions GX-l) and 0A-4).

When {xi}: is a regular Markov chain with transition matrix
P and initial probability state vector 21, it has the following three

properties [K-4].

1. There exists a unique vector 5.: [ﬂ1,...,ﬁM] such that
3:31P 1 (3.2.1)
2 f - {1}” 't' ~
. I 21 - E: i 1 1; a -ta 1onary process.
_ N-lN
3. pN-pN_lP-_91P 4:; V31 (3.2.2)

The vector 3 is called the stationary probability state vector.

Given P, (3.2.1) can usually be solved directly for 3. Otherwise

(3.2.2) suggests a convenient algorithm for approximating 3,
Properties 2 and 3 indicate that a regular Markov chain is

asymptotically stationary. Such activity in a POMS suggests that the

34

observations might exhibit some form of asymptotic stability. In fact,
va is the set of mixing parameters for the mixture KN(X) which

governs the observations at time N. Hence properties 1 and 3 imply

M
there exists a unique mixture cdf Kh(X) = 2 Fi(X)'n’i such that
l
M N
- X S - =' -0 .
sup |1q<x> Kﬁml . Ini P(>.N1)| o (3.23)

-”<X<m i=1
Hence the cdf of XN’ the random variable observed at time N, uniformly
approaches a unique limit mixture; Kﬁ(x) is the lSt order distribution
of the stationary process which results when 21 = E.
To gain further insight into the statistical structure of the
observation process, two results proved by Raviv [R-Z] are presented
below in slightly modified form.

Lemma 3.2.1 Under assumption QA-l) each random variable in the

process {Xi}: can be represented as

21,, = ¢<MN>
where ¢(-) is a Baire function and {Mi}: is a Markov process satisfy-
ing Doeblins Hypothesis.19

Lemma 3.2.2 Under assumptions (A-1) and (A-2), if 21 = E. then
{Mi}: and {Xi}: are stationary and ergodic (metrically transitive)
processes.

The above lemmas and property 3 imply that the observation pro-
cess is asymptotically ergodic for a class of POMS defined by assump-
tions A-1 and A-4. Furthermore, for this class of systems the limiting
behavior of the observations is independent of the initial system

activity.

 

19 Doob [D-Z], pg. 192 .

35
3.3 THE BASIC TOOLS OF ESTIMATION

When the observation process in an estimation problem is a
sequence of independent, identically-distributed (i.i.d.) random vari-
ables, the key tools for constructing estimators are the Law of Large
Numbers (LLN) and the Glivenko-Cantelli Theorem (GCT). If the obser-
vation process of Sec. 3.1 is ergodic, the ergodic theorem establishes
appropriate extensions of these tools. The fact that the observation
process for the class of estimation problems considered in this chapter
is asymptotically ergodic suggests that these important tools might
be extended to this case. In this section the needed generalization
of the LLN and the COT are established and a basic estimation strategy
is stated for the problem of Sec. 3.1.

Using lemma 3.2.1 and assumption A-4, Raviv [R-Z] has essentially
proved the following theorem for the observation process {Xi}: of a
POMS with arbitrary initial probability state vector 21.20

THEOREM 3.3.1 If g(') is a Baire function integrable with

respect to Kﬁ(X) then

1 M N
- '2: g(X.) _. E g(X) WP1 (33.1)
N 1 ﬂ
1 o
where
E x ~13. I (mm (x) (3 3 2)21
"Om ) - 1 Tri s 1 - -

Theorem 3.3.1 is the required extension of the LLN. It provides a
means for establishing strongly-consistent estimators for any expec-

tation of Kh(X).

 

20The proof which consists of a direct application of Theorem 6.2
of Doob appears as part of the proof of Raviv's Lemma 2.5.

21ETT will be used to denote expectation with respect to the true
limit mixgure of the POMS; i.e. the distribution with cdf Kh(X) = K%(X;BO).

36

Q
The problem of estimating Kﬁ(x) from the observations {Xi}l
is now considered and an extension of the GCT is presented.

The function

A N
KN(X) = l/N killlrerXJ(xk) (3.3.3)

where IA(-) is the indicator function on the set A is called the

empirical distribution function; RN(x) is the proportion of samples

from the set XN = [X1,...,XN] that are less than or equal to x.
Since
= S
En I[r:er ](X) Kh(x0) 1
O O
and22

n

E OIErerxO](X) = Kh(XO-O) S 1

the following lemma is an immediate consequence of Theorem 3.3.1.

Lemma 3.3.1 For every real x

12
H

P1RN(x)

Kn(X)} =

II
...:

12

P{KN(x-0)

Kn(x-O)}
Lemma 3.3.1 is the key step in extending the GCT.
THEOREM 3.3.2

P{ sup | (x) - K <x>| E o} = 1
-a<x<m KN n

With Lemma 3.3.1 the proof is exactly the same as that of the classical
theorem [T-Z] and will be omitted.
Theorems 3.3.1 and 3.3.2 lead to a general estimation strategy

for determining B As indicated by the discussion of Sec. 3.1 and

O.

 

22f(x-O) = lim f(x-E); e > o.
eso

37

3.2 the unknown limit mixture Kﬁ(X;Bo) is an element of the class of

mixtures HTr = {Kn(X;B)}B66 where
M
Kn(X;B) = fF(X;Bi)1Ti

m
The main idea is to treat the process {X111 as if it were a sequence

 

 

of indgpendent random variables with common CDF Kﬁ(X;Bo)' The schemes

 

developed for the i.i.d. case are then used with the classical conver-
gence tools replaced by Theorems 3.3.1 and 3.3.2. This procedure is
illustrated in Sec. 3.4 and 3.5. The general format is to establish
an estimator, present a convergence theorem and evaluate the results.

Assumptions A-1 and A-4 will be assumed through the rest of the chapter.
3.4 CLASSICAL ESTIMATION METHODS

In this section, the principle behind the classical method of
moments and the maximum likelihood method are used to solve the
estimation problem defined in Sec. 3.1. Both methods involve the use
of sample averages whose convergence is guaranteed by Theorem 3.3.1.

The method of moments has been used frequently for mixture re-
solving [B-Z][C-5][P-3][R-l][R-4] and is treated in the most general
form by Chien and Fu [0-2]. The procedure requires equating M sample
moments to the corresponding population moments of Kn(X;B). A solution,
if one exists, of the resulting equation in B can then be taken as an
estimate for BO. More specifically, if F(B) is a vector of popula-
tion moments of Kn(X;B), which are assumed to exist, and Tm is the

corresponding set of sample moment constructed from the observation

Xm, then by Theorem 3.3.1

38

T Wpl (3.4.1)

where T0 is a vector of moments of the true limit mixture Kn(X;B0).

The set of existing solutions of
Tm = F(B) (3.4.2)

may contain candidates for strongly-consistent estimates of B In

0'
order to obtain a more explicit result the following uniqueness con-
dition is assumed. The function F(-) is a one-to-one mapping from

18 onto T, a subset of M-dimensional Euclidean Space. Under this con-
dition, a consistent estimator can be constructed as indicated by the
following theorem.

THEOREM 3.4.1 If the above uniqueness condition is satisfied
then, with probability one, there exists a sequence of solutions of
(3.4.2), say {Bm}:O, which converges to B0.

Proof. Assumption A-2 implies that F(°) is a continuous
function on 18. Hence by the uniqueness condition F-1(-) is defined
and continuous [R-S]. Equation (3.4.1) and the fact that B0 is an
interior point of .8 imply that for every sequence of observations,
except possibly those in a set of measure zero, there exists an m0
such that Tm E f for m > m0. Thus for m > m0, Bm = F-1(Tm) and
converges to 30'

In general (3.4.2) can have more than one solution and prior
information concerning the particular application must be used to se-

lect a convergent sequence of solutions. This corresponds to restrict-

ing 8' to B as given in the uniqueness condition.

39

Equation (3.4.2) can pose some difficult computation problems
since, in general, (3.4.2) is a set of complicated nonlinear equations
and an iterative algorithm for finding a zero of a function must be
used at each step. Since the sample moments can be generated iteratively,
the storage requirement is fixed. When B can be found as an explicit
function of Tm the resulting estimator is very desirable with respect
to implementation.

The maximum likelihood method has been used by Patrick [H-Z]
for mixture-resolving. He treated a mixture of Gaussian distributions
in detail. The version of maximum likelihood estimation given here is
essentially a modification of that given by Wilks [W-l].

The following notation is introduced.

6 Log dKrr (X ;B)

 

Sj(X;B) = a Bj j = 1,2,...,M (3.4.3)
§(x;B) = [81(X;B),...,SM(X;B)] (3.4.4)
AJ.(B;B ) = fsj (X;B)dKﬂ(X;B ) (3.4.5)
5033') = [A1(X;B),...,AM(X;B)] (3.4.6)

Now, Kn(X;B) is said to be regular in 73 with respect to its first

derivative if
Es(x-B)EA(B-B)=§-—J”dx(x-B)sovBes j=12...,M (3.4.7)
TTj ’ j , aBj TT ’ , , 3

Under this condition Theorem 3.3.1 implies

m
_S_(Xm;B) E l/m 2: §(Xi;B) 9 A(B;BO) (3.4.8)
1

40

Since A(BO,BO) = O a reasonable strategy for generating a con-

sistent estimator is to choose among the roots of
m
§(X ;B) = 0 (3.4.9)

This idea is made more explicit in the following theorem.
THEOREM 3.4.2 If KH(X;B) is regular in 16 with respect to
its first derivative, then there exists a sequence of solutions of

(3.4.9) which converges with probability one to B In particular,

0'

if (3.4.9) has a unique solution Bm for m > m then the sequence

0
{Bm3zo converges with probability one to B0.
The proof, which is in the same spirit as that given for the
method of moments and which follows from (3.4.8), is analogous to that
given by Wilks and will be omitted. The same comments can be made about
the uniqueness condition and the computational problem of finding roots.
of (3.4.9) as were mentioned for the method of moments. However in
this case, all the samples must be stored and the form of (3.4.9)
changes at each step. The same problem was faced in the i.i.d. case
[K-2]. A
The interpretation of the above method as a maximum likelihood
method is obvious if the likelihood function is taken to be the product

K%(X1;B),...,Kh(xm;B). Then, under suitable restrictions, the value

of B which maximizes the likelihood function satisfies (3.4.9).
3.5 MINIMUM DISTANCE ESTIMATION METHODS

In this section, the minimum distance principle is used to con-

struct estimators for B0. Given an appropriate distance measure

p(',') and the empirical distribution function for Kﬁ(X;Bo)’ HN(X)

41

defined in (3.3.3), and estimator Bm can be defined as an element in
18 which minimizes p(Rh(X),Kﬁ(X;B)). Theorem 3.3.2 is the main tool
for establishing convergence. This method has been used for mixture
resolving by Patrick [H-Z], Stewart [H-B], Choi [0-3], and Deely and
Kruse [D-1] and their ideas apply to the problem.

Patrick, Stewart, and Deely used the sup norm for a distance
measure.23 Analogously, the estimate Bm is defined by
:63 Himm - Kn(x;B)H = HKm(X) - Kn(x;Bm)H (3.5.1)
The existence of Bm is ensured by A-2 and A-3; i.e. Bm is obtained
by minimizing a continuous function over a compact set. The following
theorem establishes the consistency of such estimates.

THEOREM 3.5.1 If Hn = {Kﬂ(X;B)}BEB is an identifiable class
of mixtures, then Bm’ defined by (3.5.1), converges with probability
one to B -

0
Proof. If Qm(B) = HRm(X) - Kﬁ(X;B)H then (3.5.1) and Theorem

 

3.3.2 imply
Qm(Bm) S Qm(BO) 9 o wpl (3.5.2)
Since
Hxﬁ(x;30) - Kﬁ(X;BO)H s Qm(Bm) + Qm(BO) (3.5.3)
it follows that
Kn(x;Bm)-‘9Kn(x;30) wpl (3.5.4)
23Hs<X)H= ... Igool.

-m<x<m

42

Assumption A-2 and A-3 and the fact that HTT is identifiable imply
that the mapping defined by Kﬁ(X;B) from 16 to Hn is a one-to-one,
continuous mapping from a compact set to a Hausdorf space. Hence the
inverse exists and is continuous [R-S]. Thus (3.5.4) implies Bm|9 B0
wpl.

Another estimator analogous to that given by Choi can be con-

structed using the distance function

sm<B> =f<1<n<xm> - Rm<x112di<m<x>
Then Btn is defined as the element of 16 which minimize Sm(B). Under
the same conditions as those in Theorem 3.5.1, the existence and strong
consistency of Bm can be demonstrated from Theorem 3.3.2 and arguments
analogous to those given by Choi.

The conditions needed to establish the existence of the above
estimators are weakest considered in this thesis under which a strongly
consistent estimator for BO can be established. Hence they are key
assumptions which guarantee the learning property of the optimal rule
given in Chapter II. Identifiability is a necessary condition for the
uniqueness conditions required by the method of moments and the maximum
likelihood method but, in general, is not sufficient.

The minimum-distance methods require storage of all samples and
the use of a computational algorithm at each step. Such procedures

have been developed for the i.i.d. case and can be applied here [C-3]

[D-l].

43
3.6 AN EXAMPLE

In this section, a simple example of a POMS is used to illustrate
some of the estimation techniques developed previously. Estimators are
established from the method of moments, the maximum likelihood method
and the optimal Bayes method of Sec. 2.6. To illustrate some basic
properties of these estimators, results of computer simulation are pre-
sented.

The example consists of a POMS defined by the following.

.25 .75

24 . .
regular tran31tlon

Matrix .40 .60

initial probability
21 = [.7 .3]

state vector

 

1 e-ko-ZOH-B)2

component densities, f1(X;B) =
2n 0

Gaussian in form 2 2
_1_ 8-..; (M)

f (X;B) =
2 Ma

The true mean B0 is assumed to be in the interval (0,4). .The sta-

tionary probability vector is easily calculated as
g,= [15/23 8/23]

and hence the limit mixture has the form

 

-2 2 -2 2.
1 [15/23 e’$0 (X+B) + 8/23 e'kc (X'B) ] (3-5-1>

f (X;B) =
" /35 o

It is well-known that all moments of a Gaussian distribution exist and

Patrick [H42] has shown that a mixture of Gaussian distributions with

 

24As indicated in Chapter IV, if all the elements of a transition

matrix are positive the corresponding Markov chain is regular.

44

unknown means is regular with respect to its first derivatives. Hence,
the method of moments and the maximum likelihood method can be applied
here.
A moment estimator is easily obtained using the second moment
2

2
equation snx = B + 02. Then, Theorem 3.4.1 implies that the follow-

ing estimator is strongly consistent.

B' =v/S S 2 0
m m

m
= < o o
O Sm 0 (3 6 2)
= 2 S > 4
m
m
where S = 1/m 2X? - 02.
m , 1 1
1:

As indicated in Sec. 3.4, the maximum likelihood estimate 33'

is defined as a solution to the following equation in the interval [0,4]
-2 ' 2 -2 2
- X.+B- x.-B)
m -15(Xi+B)e 15° ( 1 ) 35° ( 1

+ 8(Xi-B)e-
2

 

= 0 (3.6.3)

1 1 -31,;<7'2(Xi+B)2 450'2 (xi-312

02(15 e + 8 e )
Theorem 3.4.2 guarantees a sequence of roots that converges. Moreover,
the computer solution25 of (3.6.3) indicated only one root in [0,4]
for most values of m.

The Bayes estimator is obtained by quantizing the interval [1,3]

2
into 50 levels, .04 apart.6 The estimate is defined by
1.1 50
m

B = E B p(B /X) (3.6.4)
m k k

k=1

, th . . .

where Bk is the k quantization level and p(Bk/Xm) is generated

iteratively using equations (2.4.3)-(2.4.5) with B replaced by Bk'

 

2SUnder the conditions stated below.

6 . . . .
Additional prior information is assumed for Bayes estimator to
emphasize its effect.

45

A uniform prior density is assumed. As pointed out in Sec. 2.6, if B0
is in [1,3] convergence is guaranteed by the existence of the previous
two estimators.

The above estimators were implemented on a digital computer with

a random number generator supplying the samples; B was chosen equal

0
to 2. For a given value of O, E(Bm) and Var(Bm) were approximated
for different values of m by averaging the results of twenty runs.
On each run the same samples were used to generate estimates by all
three methods. The results are presented in Table (3.6.1) for
o = l,1.5,2 and m = 25,50,100.

For this example, Table (3.6.1) indicates some general trends.
Not only do all the estimates converge on the average,but as more
observations are taken the variances of the estimates decrease. Also,
as the variance of the observations (controlled by 0) increases, the

estimates become less accurate; i.e., Var(Bm) and 1E(Bm) - B in-

0|
creases. All three estimators seem to perform about the same with
perhaps the Bayes estimator slightly better for small m and large a.
On the whole, the estimators behave much as they do for the i.i.d. case.

The above estimators can also be rated with regard to implementa-
tion. Since 02 + Sm = (1+1/m)(O2 + Sm-l) + (l/m)X:, the moment
estimator is a simple iterative one. The Bayes estimate is also
iterative but, for this example, requires fifty times more storage and
computation than the moment estimator. The maximum likelihood estimate
is, by far, the worst requiring storage of all samples and use of a

computational algorithm to find the zeros of (3.6.2) whenever an

estimate is desired.

46

mommm "m

ooossaoxsq assess: "as

muﬁmsoz NO UOSUmz ":2

 

“woo. mooo. mmmo. ammo. Mmma. qua. wmqa. come. moan.
mmam.a «meo.H mmmm.H mmHm.H Hmom.a momw.a mmHm.H mumm.a nmmw.a
mHmo. mmqo. aNm. ammo. wmmo. mon. coma. mama. HmNN.
«Nmm.H aomm.~ mmnm.H mmoo.~ mmmo.~ ommo.m wmma.~ Homm.H ommo.~
MNHO. mmao. HNHO.. qomo. mono. wmmo. Nmmo. memo. «ode.
mwma.H Nooo.N «wwmma Nwmm.H mmmm.H mmmm.H «qu.H mmqm.m smcm.a
m 1mm11 22 \fl m thll 22 .xf, m ”WW1 22 11\
ooH u 8 on u 8 mm M E
o

N u m mHHS mznm ON roam mHADmmm ZCHH<ADZHm MmHDmEOo a.e.m mam<ﬂ

Asmvum>
Lees
Asmvss>
Ase s
Aamvso>

A88 m

m.H

47
3.7 ALTERNATE STRATEGIES FOR THE ESTIMATION PROBLEM

In this section, three modifications of the basic estimation
strategy outlined in Sec. 3.3 and illustrated in Sec. 3.4 and 3.5 are
discussed briefly. Also, methods for handling extensions of the
original estimation problem are indicated.

The first modification is one that can be used when n is un-
known, as when P is unknown or ‘E_ is too difficult to calculate. In
such a case, E. can be included in the set of unknowns of K“(X;B)
and the methods of Sec. 3.4 and 3.5 can be used with an enlarged para-
meter vector.

The second modification is one that requires knowledge of P
but computes E. during the procedure. The main idea is to treat the
random variables Xm as if they had common distribution Km(X), an
element of Hm = {Km(X;B)]B68. Then Bm’ the mth estimate, can be
calculated using one of the principles of Sec. 3.4 and 3.5 with Kn(X;B)
replaced by Km(X;B). Since for each B E B, Km(X;B) —' Kn(X;B)
uniformly in X, the estimate Bm should be consistent. As an example,
a minimum distance estimator is considered; Bm is defined as the element

in B which minimizes 1mm) = Mirna) - Km(X;B)H. Then
Im(Bm) s Im(BO) s Hikm(x> - 1%(X;BO)H + HKﬁ(X;BO) - Km(X;BO)H (3.7.1)
and V
“KT-[(X;BO) - Km(X;Bm)H s HKn(X;BO) - Km(X)H + Im(Bm). (3.7.2)

Assuming identifiability of the class of mixtures generated by the

family of component distribution with B and n as parameters, the

48

strong consistency of Bm follows from Theorem 3.3.2 and the fact that
Km(X;BO) ﬂ Kﬁ(X) uniformly. Analogous estimators can be established
using the method of moments and the maximum likelihood method.

The final modification takes advantage of the fact that, in gen-
eral, the random variables observed at adjacent points in time are not
independent and, hence, information can be obtained about the unknown
parameter by cross correlating successive observations. Raviv [R-Z]
used this technique to estimate the transition matrix P. His basic
tool was a modification of Theorem 3.3.1. Namely, if g(-,-) is an
integrable function then

N

1
PiN iflgocimm) Eng(xk.x

k4%)} = 1 (3.7.3)

where ETr denotes the expectation when 21 = E: The approach here is

based on the fact that

KN(XN’XN+1) = iszi(xN)FjOSl+l)pijP(kN=i)

and

_. 15,

P(XN-i) ”i
G
That is, the sequence of random variables {[xi’xi+1]}l is described
by a sequence of mixtures of joint distributions which converges to a
fixed mixture
Kn (xk’ka) = iszi (xk) Fj (xk+l) pijni

2

Hence a second order estimation strategy analogous to the first order
m

strategy of this chapter can be developed using observations {Exi’xi+1]}l

and all past techinques with (3.7.3) in place of theorem 3.3.1. In

this case, the elements of the transition matrix P appear as para-
meters in the mixture and hence can be included in the list of unknowns

to be estimated.2
3.8 ADAPTIVE ESTIMATION AND CLASS ESTIMATION

Estimation problems related to the POMS defined in Sec. 2.6 are
considered in this section. The transition matrix is assumed to be
block-diagonal; hence the system is in one of L noncommunicating
classes of states and each class is assumed to satisfy the assumptions
of the POMS defined in Sec. 3.1. The component distributions associated
with each class are assumed to be unknown to within a parameter as in
Sec. 3.1 and the active class of states is unknown also. The problems
are to determine which class is active and estimate the unknown para-
meters of that class.

To formulate the problems more clearly some notation is defined;
33 is the stationary probability vector corresponding to the ith class;
Hni = {Nﬁi(X;B)]B661 is the set of limit mixtures induced by the family
of component cdf's for the 1th class with 1Gi the corresponding para-
meter Space;29 10 is the value of the index for the active class; B0
is the true value of the parameter defining the component densities of
class i0. Both B0 and i0 are unknown and are to be estimated.

The first problem considered will be that of finding a strongly-
consistent estimator for BO. One strategy is to assume the system is

in a particular class and construct an estimate accordingly using the

methods of Sec. 3.4 and 3.5. If this is done for all L classes the

 

27 .
If P is to be estimated n must be known or a conSistent

estimator for n must be available. Such an estimator can be obtained
from the first order strategy discussed in modification one above.

8 . . .
Each class is being treated as a separate Markov chain.
9For simplicity the parameter Space is assumed to be the same

dimension for all classes. But all arguments apply directly to the
more general case.

50

result in a set of L estimators containing at least one consistent
estimator. This procedure is illustrated below using the minimum dis-
tance principle.

If Qi(B) = HRm(X) - Kﬁi(X;B)H then Bi, the estimate assuming
class i is active, is defined as the element in 131 which minimizes
Qi(B). Since there exists at least one value of i, say i, such that

k.(X) B Knl(X;B Wpl, then, as in Theorem (3.5.1), if H c is
m 11].

>0

identifiable B B B0 Wpl. Therefore a sufficient condition for the

8H

adaptive estimator given by (2.6.6) to converge to B0 is that H i
n

be an identifiable set of mixtures for i = 1,2,...,L.

It is important to realize that i as defined by
11..
Qm(Bm) 0 Wpl (3.8.1)

may not be unique. Hence B0 can be estimated even when it is not

possible to determine which class is active. The question of sufficient
conditions for determining 10 will now be considered and a more

explicit estimate for BO will be constructed.

Definition 3.8.1. The set of mixtures H is said to

L
= U H
. 1 111
be class identifiable if, for any B1 6 £91 and B2 6 BJ

Kni(X;B1) = Knj(X;B2) VX (3.8.2)

implies i = j.

The above definition can be interpreted as saying that the
space of mixtures HTr is class identifiable if and only if {Hui}?
is a collection of disjoint sets. Sufficient conditions for class
identifiability are given in the Appendix. As indicated by Theorem

3.8.1, class identifiability is the key condition for determining i0.

51

L
THEOREM 3.8.1. If an = U Hi is class identifiable then im
i=1
defined by

Q(Blm) min Q;(B;) (3.8.3)

1

is a strongly-consistent estimator for i0. Furthermore, if, for each

1
i, Hni is identifiable then Bmﬁ defined by (3.8.3) is a strongly-

consistent estimator for BO.
Proof. By definition and Theorem 3.3.2
1 i
Q(B m) s QmO(BO) 9 o wpl (3.8.4)

Then the subadditivity of the supnorm implies

Ki EK,€H, wpl (3.8.5)
n m 1T10 n10

i
where K , = K , (X;B m) and K , = K , (X;B ). Since for each i
nl ﬁ1 n1 Ni O
m m i 0 O i
the mapping from 13 onto H i’ defined by K i(X;B ), is continuous
n n
and 161 is compact, H i is compact also [R-S]. By the class identi-
n

{H i}l is a collection of disjoint sets. Hence,

fiability of H
E n

1

for almost every sequence of observations, there exists an mO such

that for m > mO K i E H i . Otherwise there would exist a sub-
: m n 0
sequence of {K . } contained in H , for some i # i and con-
nlm 1 1111 1 0
verging to K i a point not in H i . This contradicts the compact-
n n

O

ness of H . . Thus for m > m , i = i . If in addition H . is
"11 O m 0 . ”10

H

0 wpl, as in Theorem (3.5.1).

It follows from both identifiability conditions that for m.s mO B

1m,”
identifiable then {B }m converges to B
0 im
is well defined.
As indicated in the Appendix identifiability of H i is not a
n

necessary condition for class identifiability of HTT hence iO can

52

be estimated even when BO cannot.

3.9 CONCLUSIONS

This chapter has dealt mainly with the problem of finding strongly-
consistent estimators for the unknowns in the component distributions
of a POMS. The problem was defined in Sec. 3.1 as one of estimating
the parameter set BO that defines a sequence of mixtures using de-
pendent samples from successive elements in this sequence. The study
was restricted to a class of systems with state activity described by
a regular Markov chain. As shown in Sec. 3.2, the corresponding se-
quence of mixtures approaches a limit mixture Kﬁ(X;BO) and the
observation process is asymptotically ergodic. These properties were
used in Sec. 3.3 to establish tools (extensions of the Law of Large
Numbers and the Glivenko-Cantelli Theorem) for estimating K"(X;BO)
and any of its expectations from available observations, thus re-
ducing the estimation problem to the resolution of the limit mixture.
This estimation strategy was illustrated with the method of moments
and the maximum likelihood method in Sec. 3.4 and the minimum distance
principle in Sec. 3.5. The result was a variety of conditions under
which the parameters could be estimated and, hence, under which the
optimal rule of Chapter II adapts. A specific example illustrating
some of these methods and the optimal Bayes estimator of Chapter II
was presented in Sec. 3.6. Computer simulations indicated the behavior
of the estimators to be typical.

Alternate strategies were proposed in Sec. 3.7. It was shown

that these strategies would also handle the case in which the transition

matrix P is unknown.

53

In Sec. 3.8, the adaptive estimation problem and class estimation
problems of Sec. 2.6 were solved suboptimally. In the process of
establishing conditions under which such estimators could be constructed,
the concept of class identifiability was introduced with sufficient con-
ditions for this type of identifiability being given in the Appendix.

Finally, the basic aim of the Chapter has been to put forth a
general estimation strategy and illustrate it with examples. It should
be clear that many methods not mentioned here [C-4][H-2][S-2][S-3], in-

cluding nonparametric ones, apply equally well to this problem.

CHAPTER IV
EXAMPLES OF PARTIALLY OBSERVABLE MARKOV SYSTEMS

Examples of Partially Observable Markov Systems (POMS) can be
found in the fields of Pattern Recognition and Communication Theory.
When the model defined in Sec. 1.1 can be associated with systems in
these fields, the decision rules of Chapter II and the estimation
schemes of Chapter III lead to a class of decision devices with a
learning capability.

This chapter deals mainly with the design of optimum, adaptive
signal detectors for a variety of communication systems with unknown
signals. The basic approach is to propose a communication system,
make the correspondence between it and a POMS with unknown parameter
in the component densities, identify the optimal detector, and check
critical assumptions that ensure adaption.

The main assumptions that guarantee the existence of an iterative
optimal rule which adapts are listed below from Sec. 2.1, 2.5, 3.1 and
3.5.

l. The observation process is state-parameter-conditionally

independent.

2. The underlying Markov chain is regular.

3. The Corresponding set of mixtures is identifiable.

Through most of the chapter the transition matrix P will be
given and the component densities will be Gaussian with unknown mean.
Consequently, in verifying the above assumptions, the following in-
formation will be useful.

54

55

A regular Markov chain can be characterized by its transition
matrix in either of the following two ways [K-3][K-4].

1. There exists a finite integer N, such that PN has all

positive entries, denoted by PN > 0.

2. All but one eigenvalue of P lies inside the unit circle.

As indicated in the Appendix,all the sets of finite mixtures of
Gaussian distributions with distinct meansare identifiable if constraints
are imposed to rule out any ambiguities which may arise in the parameter
space. In this chapter it is assumed that, for each example, a parameter
prior density which reflects such constraints is available.

The purpose of this chapter is to display the versatility of the
model for a POMS and not to investigate each application in detail. In
order to display the main ideas clearly, special cases are treated which
can be easily generalized. Additional background concerning each problem
can be found in the references cited in the corresponding sections.

In Sec. 4.1, a Pattern Recognition System with Markov-dependent
pattern activity is introduced and discussed briefly. In Sec. 4.2, a
basic communication system with a Markov encoder, memoryless channel, and
known synchronization is considered. The assumptions of the basic system
are weakened in Sec. 4.3 and 4.4; systems with unknown synchronization and
channels with memory are considered. Sections 4.5 and 4.6 deal with vari-
ations of the basic system in which synchronization is undefined; namely,
systems in which signals arrive at random times. The results of the chapter
are discussed generally in Sec. 4.7 and 4.8.

4.1 PATTERN RECOGNITION WITH MARKOV-DEPENDENT PATTERN ACTIVITY

Before attacking any communication systems it is convenient to

investigate a more general class of systems and illustrate the format

 

30

Since the elements in each row of P sum to one, 1 is an eigenvalue.

56

for utilizing the results of Chapters II and III. In this section, a
pattern recognition problem in which pattern activity at one time de-
pends on pattern activity at other times in a Markov manner is shown
to be a problem in decision making with a POMS.

The system under study is depicted in Fig. 4.1.1.

 

PC1 .
/" — Sw1tch
PC2 """“\1 , ,
Feature Pattern
Extractor Classifier

pc——/\ ’81

 

 

 

 

 

 

Fig. 4.1.1 A Pattern Recognition System

There are M pattern classes {PCi}T. At time N, a sample pattern

is randomly chosen according to the probability vector

N
2N =[P(w1)9-°°9P(wl:4)] (4.1.1)

where P(W§) is the probability that PCi is active at time N and
P(W§|W§-1,...,W:) = P(W§|W§-1) = PjiIV’N\ The sample pattern is mapped
to a point XN in a finite-dimensional, Euclidean vector space via the
feature extractor. Associated with each pattern class is a density
f(X|Wi) for XN when PCi is active. The pattern classifier makes

a decision as to which pattern class generated XN. Raviv [R-3] applied

57

such a model to the recognition of characters in text. He assumed all
quantities were known or obtainable through supervised learning.

The problem here is to design an adaptive classifier which makes
decisions on the basis of observations XN = [X1,...,XN] when ‘21
and P = [pij] are given but the densities {f(XiWi)}? are unknown.
The relation between this problem and that of decision making for a
POMS is established by introducing the random variable KN which maps
the event W? to the interger 1. Then {AN}: is a first-order,
homogeneous Markov chain. The events {WE}? have been put into a one-
to-one correspondence with the states of a POMS whose state activity
is summarized by ‘21 and P = [Pij] and whose observations are governed by
the component densities {f(XIWi)}T. The problem of classifying
feature vectors is that of making decision about the states of a POMS.
A variety of adaptive classifiers follow from the decision rules of
Chapters II and III.

For example, when the component densities are specified to
within a parameter vector B, the minimum probability of error rule

as given in Sec. 2.2 is:
decide PCi is active at time N if
p(wlny) 2 F(wbj‘IXN1 vi 14 1 (4.1.2)
where
P(WT|XN) = j‘ P(W1:|XN,B)p(B|XN)dB (4.1.3)

As in Sec. 2.3, the posterior probabilities {F(WEIXN)}? can be gen-
erated iteratively under Assumption 1 of this chapter. Furthermore,

if Assumption 2 and 3 hold

58

P(BIXN) 5 6(B-BO) wpl (4.1.4)

where B0 is the true parameter value. Hence B0 is learned and the
rule adapts.

It is clear that the estimation schemes and suboptimum rules de-
fined in Chapter III apply as well.

In the remainder of the chapter more explicit examples are given

in which the assumptions guaranteeing existence and adaption of optimal

rules can be checked.
4.2 ADAPTIVE SIGNAL DETECTION WITH A MARKOV ENCODER

This section considers a particular example of the system treated
in Sec. 4.1. A communication system, wherein the signal sent in one
time interval depends on that sent during another time interval in a
Markov manner, is investigated. Figure 4.2.1 illustrates the system

under study.

 

81(t) \V /" Encoder /’|Sampler
/
_ ./
32“) ‘7 - 8(t) X(t
. Channel ' Detector *ep-

+
X
-—_./’* N

M n(t)

 

 

 

 

 

 

Fig. 4.2.1 A Communication System

59

Every T seconds a signal is randomly chosen from the set

{Si(t); 0 S t S T}? and sent through a channel which changes it in

some fixed but unknown manner. The channel output 9(t) is corrupted

by additive noise and the result is sampled. The observable coordinates

are a sequence of time samples {X(ti)}: where X(ti) = 9(ti) + n(ti).

The detector uses these observations to determine which signal was

sent in a given time interval.

The Basic assumptions are the following

1.

The operation of the encoder is described by a matrix of
positive probabilities P = [pij] = [F(Wg/Wﬁ-1)] for all

N, where W? is the event that the ith signal was sent over
the interval [(N-1)T, NT].

A set of prior probabilities -21 = [P(Wi),...,P(W;p] govern-
ing transmission in the first interval is given.

The channel is memoryless. Hence, there is no intersymbol
interference and the channel output during the Nth interval
is caused only by the input during that interval. The re-
sponse to Si(t) is 91(t).

The Noise process is white and Gaussian with zero mean and
finite variance.

The encoder and detector are synchronized so the time re-
ference is the same for both.

Every T seconds a block of L samples is taken at the
receiver in accordance with standard sampling theorems;

EN is the block taken during the Nth interval and 91(t)

is characterized by an L-dimensional vector of samples

_8_i = [eil,...,8i,].

60

The correspondence between this system and that of Sec. 4.1 is
immediate. The encoder acts as the switching device and the sampler
as the feature extractor. When the event W? occurs, the resulting
observation is EN = ﬁi +DEN where EN is a vector of noise samples.
Hence, the component densities are Gaussian with means £§i]¥. The
problem is to establish a detector which makes decisions on the basis
of the observations XN = [X1,...,XN] while learning the signal
vector .g = Lgl,g2,...,§3g.

As in Sec. 4.1 the optimum decision rule is:

decide the ith signal was sent in the Nth interval if
N
P(W1:/X ) 2 P(W1:/XN) Vj )4 1 (4.2.1)

This rule has the learning property indicated by (4.1.4) with B re-
placed by 9. Only the assumptions remain to be checked. The state-
parameter-conditional independence follows from the white Gaussian

Noise assumption. This will be the case through the remainder of the
Chapter. Hence only Assumptions 2 and 3 will be discussed forthwith.
Since pij > O, the underlying Markov chain is regular. The family

of component densities are Gaussian. Hence if the vectors in the set
L01}? are distinct, the identifiability assumption is satisfied and

the rule adapts.
4.3 ADAPTIVE DETECTION WITH INTERSYMBOL INTERFERENCE

When the channel of the system considered in Sec. 4.2 has memory,
the signal transmitted in one time interval spills over into other
time intervals. Consequently, the received signal at any time is

affected by what was sent before. This dependence, under suitable

61

assumptions, can be shown to be Markovian. Chang [H-l] considered
adaptive detection for the binary case in which one of two antipodal
signals were sent independently from one time interval to the next.
In this section, a more general class of signals transmitted with
Markov dependencies is considered.

The system under study is the basic communication system of
Fig. 4.2.1 with the channel described by a causal linear filter with
impulse response h(-). For simplicity, interference is assumed to
be limited to the immediately succeeding time interval, which is ex-

pressed as

h(t-T) = 0 t > T+¢
(4.3.1)

t<T

The channel input-output relation is given by

1 1
j0h(t-T)si(t)d¢ ei(t) o s t s T

(4.3.2)

0:(t) T S t S 21

Let P(W: ) be the probability that the ith signal was sent during

1
th . .th . . . .
the N interval and the 3 Signal, during the preceeding interval.

From the superposition property of linear filters the output of the

channel when the event Wﬁj occurs is

eij(t) = 61(NT-t) + G§IN+1yr-n (N-1)T s t s NT

and EN = 013 +-EN where 833 is the vector representation of
613(t), 0 S t S T. Then, using the transition probabilities of Sec.
4.2, the event {ng} can be put into one-to-one correspondence with

i O
the states of a POMS with Gaussian component densities with means {Q J].

62

K
II
N

For example if

-1 -1
P0419 W1“? )=P11P(wli)

= pum 111) n 112:4) + F(wli'l n w’i'zn

-1 -1
p11 Pmlil) + p11 P 12 )

where as defined in Sec. 4.2 W? is the event that the ith signal was
sent in the Nth interval. This procedure can be repeated for the re-
maining events in the set {Wﬁj}. Then the probability vector defined

by
EN = [F(Wbil)P(WI:2)P(Wl:1)P(Wl:2)]

satisfies the iterative relation EN = EN 1p1 where

 

 

r- ‘1
0
p11 p12 0
p11 0 p12 0
P =
o
1 0 p21 p22
0 p 0 p
L 21 22]

The initial probability vector p2 can be calculated from that of Sec.
4.2; e.g. P(W:1) = p12 P(W:). When the underlying chain is in the
state corresponding to Wﬁj the density of the observations is Gaussian
with mean .013.

The optimum decision rule is given by:
. . th . .
dec1de Si was sent during the N interval 1f

p(wi/xN) 2 P(Wk/XN) v k ,1 1 (4.3.4)

63

where

P(Wi/XN) = 2 P(Wij/XN) (4.3.5)
N j
/X ) can be generated iteratively as in Sec. 2.3.

and P011:j

The assumptions ensuring adaption are now investigated. Since

pij > 0 i,j = 1,2, Pi > 0 and, thus P1 is regular. It is clear

that, for general M, P1 retains a structure such that Pi > 0. Again,
because the component densities are Gaussian, identifiability is in-
sured if the vectors in the set (913} are distinct.

When p11 = p21 = q1 and p22 = p12 = q2 (this corresponds to
the case Chang treated) P2 is uniform in the columns. Hence {A2k]:

and {A are independent subsequence of the underlying chains and

}¢
2k+1 l
the corresponding observations are independent. Chang used this fact,
which can be arrived at by direct consideration of the model, to con-
struct estimators for the unknown signals. He used the method of
moments and his techniques are a special case of those discussed in

3
Sec. 2.4 and 2.7, 1 where if M = 2, the unnormalized stationary prob-

ability vector is

E-‘ [pllp21’ p12921’ p12p21’ 9229121' (4°3'6)

4.4 ADAPTIVE DETECTION WITH UNKNOWN SYNCHRONIZATION

When synchronization is unknown in the basic communication
system of Sec. 4.2, each block of samples may contain the effects of
two signals. Hence there are dependencies between the signal received
in one time interval and that in adjacent intervals. Under appropriate

assumptions , these dependencies can be shown to be Markovian. Stewart

 

31The Third Modification.

64

[H-3] found the optimum decision rule for determining the true synchro-
nization time when the signals were transmitted independently from one
time interval to the next. In this section transmitted signals are
Markov-dependent and the optimum decision rule for determining synchro-
nization is shown to follow easily from the results of Sec. 2.6 and 3.8.

The system under study is the basic communication system of Sec.

4.2 without the synchronization assumption. That is, in each block of
samples the time sample at which the effect of one signal ends and that
of the following signal begins is unknown. Additional assumptions are
needed and these will be considered in force for the remainder of the
chapter.

1. Time zero is the time the receiver is turned on.

2. The initial probability state vector of Sec. 4.2 governing
the transmission of the first signal is the stationary
probability vector En

Assumption 2 implies the time the transmitter is turned on is ir-

relevant provided it occurs before the receiver is turned on.

If each block is assumed to consist of L samples and the rth
sample from the start is the true synchronization time, the Nth
r
O . = + C O h
observation has the form. KN §”j EN some 1,] w ere
r
= 0.. , ’00., . 4.4.1
Eij [91,L-r+2’ ’eiL ejl eJ,L-r+l] ( )

is the vector representation of the signal at the output of the channel
between (N-1)T and NT. That is, gij contains the last r-l com-
ponents of «31 and the first L-r+l components of 9) where -Qi and

SE are the vector representations of the channel responses to signals

65

Si and Sj respectively.

Let Pm:j 6 XN) be the probability that eij is the channel
output in the Nth interval and let P(Wr) be the prior probability
that r is the true synchronization time. Then, using the transition
probabilities of Sec. 4.2 the events {Wij}; i,j = 1,2,...,M; r = 1,2,...,L
can be put into a one-to-one correspondence with the states of a POMS.

For example if M = 2 and L = 2

2 _ 2 k k-l
P(W11€XN) -P(W nwlrlw1 )

k . .th .
where, as in Sec. 4.2 W1 is the event that the 1 signal is sent in

th . . .
the k interval after the transmitter is turned on; k is unknown. Then

P(Wil e xN) = F(wli/wlf'lmzwmz n wli'l)

k-l k-2

1 k-2 2
nw1)+1>(w rlw1 nw2 )]

2 k-
p11[P(w 0 W1

2 2
p11 P(wll e xn-l) + p11 PCW21 6 ’51-1)

This procedure can be repeated for all the events [W§j}. Then the

probability state vector

2 l l 2 2 2 2
EN = [P (Wu)?(1422»<w11)P<w21)P(w12)P(w22>]

is related to its predecessor by Bé = E§_1p2 where

"P11 p12 0 o o 0]
p21 p22 0 o o 0
P2 = O 0 p11 0 p12 0
O 0 p11 0 p12 0
o 0 0 p21 0 p22

 

 

66

The initial probability vector 21 can be obtained from E. and
{P(Wr)}€. The corresponding component densities are Gaussian with
means {9:j].

The transition matrix is block diagonal with one block asso-
ciated with each synchronization time. The problem of determining the
true synchronization time corresponds to that of class estimation given
in Sec. 2.6. The optimum decision rule follows from (2.6.1) and has
the form:

decide the true synchronization time is r if
P(Wr/XN) 2 P(Wk/XN) V k aé r
where
P(Wr/XN) = iZjPWEj/XN)

Conditions under which

P(Wr/XN) 15. at r , wpl
l 0

where rO is the true synchronization time, are now considered. In
general, P2 will have L-l blocks exactly the same (those correspond-
ing to synchronization time r = 2,...,L). These have the same form

as P1 considered in Sec. 4.3 and hence are regular. The remaining
block (for r = 1) has all positive elements and is regular also.
Since for 62> 2 the blocks are not distinct and the component den-
sities belong to the same family, then as indicated in Sec. 3.8 and

the Appendix r cannot be determined unless prior information is

0

available concerning the parameters of each class. However, this POMS

has some special properties that can be exploited. The observations

67

from one class are related to those in others(when they are active)
through a shifting procedure. Hence, an empirical cdf for the limit
mixture of each class can be constructed for minimum distance estimation.
This observation was used by Stewart in the i.i.d. case to establish
that rO could be uniquely determined if {QEj} contained at least
m+1 distinct vectors for each r. Since the identifiability problem
is the same for the Markov case, his result can be extended via Theorem

3.3.2.
4.5 Mr ary ADAPTIVE DETECTION OF SIGNALS WITH RANDOM.ARRIVAL TIMES

In Sec. 4.2-4.4, the periodic behavior of the encoder allowed
observations to be processed in blocks of known size, whether or not
synchronization was known. In this section and the next, the un-
certainty concerning the signal arrival time is greatly increased and
the samples must be processed one at a time. The detection of signals
of known duration but with random arrival times is considered. For
random lengths of time no signal is transmitted and only noise is re-
ceived.

Stewart [H-B] found the optimum receiver for a case in which
signals were transmitted independently in time but was unable to find
a suboptimum solution and, therefore, could not prove adaption. In
this section, the optimum receiver for the Markov-dependent case is
established and convergence follows immediately from previous results.

The system under consideration is essentially the basic commu-
nication system of Sec. 4.2 with an additional "no signal" input which

can be active for a random length of time. Signals are transmitted

68

for a duration T and no transitions of the encoder are allowed during
this time. A typical signal of the output of the channel is depicted

in Fig. 4.5.1.

9(t)

1

 

 

I
1

 

Fig. 4.5.1 Channel Output for an M-ary
Signal set with Random Arrival Times

Then Xk = 9(tk) + M(tk) is the sample at time tk' The sampler is
assumed synchronized; i.e., signals can arrive only at sampling in-
stants. Thus, 9(tk) is either 0 or eij, the jth component of the
signal vector characterizing 61(t); O S t S T.

Let P(W:j) be the probability that eij is active in Xk
and let P(W:) be the probability that noise alone is present. Then,
as in previous sections, the events {ng}, WE can be associated with
the state of a POMS with Gaussian component densities with means {eij}

and 0.

When L

M = 2 the probability state vector
k k k k k
2k - [P<w0)P(w11)P(w12)P(w21)P<w22)]

can be generated by Bk = 2k-1P3 where

 

 

p00 p01 0 p02 0
o o 1 o 0
P3 7 p10 p11 0 p12 0
o o o o l
Lon p21 0 p22 0.

The probabilities p01, and p10 are those governing transitions (when
they can occur) between the signal and noise states of the encoder,
from one sampling time to the next.

The following posterior probabilities can be used to make optimal

decisions as to which signal, if any, is active at time k.

P(W1;) =ZP(W1; /xk) i=0,1,...,M

J

For ease in verifying assumptions, let

3
- . 0:1,°'°9M

(This is the case considered by Stewart). For this case with 1M - L = 2

P3 has eigenvalues 0, 0, 0, 1, qO-l and P3 is regular. With
general M and L, the characteristic polynomial of P3 can be shown
to be
mi. -1 +1 4. L-l
ems) = (-1> s(‘“ ” <-s +qu +q>
M

where q = l - q = Z q,. Since q (s) = O has only one root on the
0 1 1 mL
2

unit circle3 , s = 1, P3 is regular. In the most general case P3
has the same structure, with respect to non zero entries, as in the
previous case. Hence the type of state activity is the same and P3
, 33
is regular.

Identifiability conditions follow as before. If {91)} is a

set of distinct elements, all unknown parameters can be determined.

 

32The only value of a which satisfies qu(eJa) = O is a = 0.

33See Sec. 4.7.

70
4.6 ADAPTIVE DETECTION OF SIGNALS WITH RANDOM.ARRIVAL TIMES

This section treats a variation of the problem defined in Sec.
4.5. Instead of employing a randomly-chosen sequence of signals with
random Spacing, one signal is randomly chosen and transmitted repeatedly
at random times. The problem is to design detectors to determine when
the signal is present and which signal is being sent. Nolte [N-l]
found a detector for the case in which all the signals are known. How-
ever, he did not discuss the conditions for adaption; i.e., for con-
vergence to the detector that would be used if the identity of the
signal being sent were known. Here, the signals are unknown but other
prior information is assumed available to make the problem meaningful.

The system under study is basically the same as that in Sec. 4.5
except that here the encoder switches between a fixed signal and noise.

A typical signal at the channel output is shown in Fig. 4.6.1.
9(t)

1

1—T—-1 1——r——-1 1——T———1

 

Fig. 4.6.1 Channel Output for a Signal
with Random Arrival Times

71

When the ith signal is being sent X(tk) = Xk = 6(tk) + n(tk) where
6(tk) is either 0 or 91.1 for some j; eij is the jth component of
the signal vector corresponding to the ith signal 91(t). Let PCWi)
be the prior probability that signal i is the one being sent repeatedly.

As in Sec. 4.5, PCW: ) is the probability that X ‘8 e + n(tk)

J k ij

and PCWEO) is the probability that Xk = n(tk) and signal 1 is the
one being sent. Then, the event {WIJ} can be associated with states

of a POMS with a block diagonal transition matrix, each block correspond-

ing to the event that a particular signal is being sent repeatedly.

For example, when M = L = 2 the probability state vector is

- k k k k k k
EN - [P (W10) ’Pmll) ’P (“12) ’szo) ,P(W21) ,P 0122)] (4. 6. 1)

and the transition matrix is

0 5
v1 l-vl
O 0 1
P4 = v1 l-vl 0 (4.6.2)
v2 l-v2 O
O O 1
v 1-v O
L. 2 2 J

 

 

k -1 k k . .
where vi PCWlo/Wﬁo ) — PCWiO/Wiz) are assumed given. Again, the
component densities are Gaussian with means {913} and 0.

Then, as indicated in Sec. 2.6, under appropriate assumptions

to be discussed presently

k k
P(Wi/X) = JZIP(Wij/x ) _. A130 wpl (4.6.3)

72

where jo is the index of the true signal being sent and F(Wi) is
the probability that ith signal is being sent. The probabilities
{F(Wi/Xk)}? can be used to make decisions as to which signal is being
sent.

Now if the ith signal is assumed the one being sent the decision
rule is

decide si present if

L P(W, /ka.)

11, 1

L(Xk/Wi) = 2 k
J=1 F(in/x wi)

 

> 1 (4.6.4)

Consequently an adaptive detector is given by

decide a signal present if L(Xk) > 1

where
k M k k
L(X) = z L(x /Wi)P(Wi/x) (4.6.5)
z=l
and
k as
L(X ) e L(X /wi ,91 ) (4.6.6)

0 O
The rule (4.6.4) is analogous to that given by Nolte but it is
not the optimal rule for detecting a signal. The optimal rule uses

the likelihood ratio

M L k M, k
z 2Pm1./x)/ 21>(w1/X)
i=1 j=l 3 i=1 0

which has adaption properties similar to (4.6.6).

The conditions for convergence will now be investigated. The

ith block of P corresponds to a special case (M = l) of P3 con-

4

sidered in Sec. 4.5. Hence, P is regular.

4

73

In checking identifiability it is important to realize that
there are two estimation problems involved. The first is concerned
with determining which signal is being sent and involves class iden-
tifiability. The second involves the estimation of the signal being
sent and regular identifiability. As indicated in Sec. 3.8, (4.6.3)
can hold without (4.6.6) being true and vice versa.

For example, let M.= L = 2. Then the stationary probability
vector for the class corresponding to the ith signal is proportional

to [vi l-v l-vi]. Hence, according to the discussion in the Appendix,

1
(4.6.3) will be true if v1 # v2 and, for each i, {913} is a set of
distinct elements. If v1 = v2, additional prior information is needed

on the parameters. It must be known that 91 and 92 lie in disjoint
regions of the parameter space; e.g., the signal set might be anti-
podal. If such information is not available, the signals cannot be
distinguished, in general.

0n the other hand, if, for each i, {eij} is a set of distinct

elements, Q can be learned and (4.6.6) will obtain even if the class

10

of the signal is indeterminable. If no special prior information is
available concerning the signal from each class, then,one might as
well design a detector for one unknown signal. The rule of Sec. 4.5
could be used with M = 1. The important point is that the rules con-
sidered in this section provide a means of using this prior information
when it is available.

When 6 is known (the case treated by Nolte) it is clear from
the Appendix that the existence of the set {Bi} of distinct vectors

is sufficient for adaption.

74
4.7 REMARKS

This section considers briefly some points that apply generally
to the contents of this chapter.

In attempting to treat a variety of situations in a uniform
manner, simplifying assumptions were made and the models were slightly
contrived. Consequently, the results generalize in many respects and
are intended to include other situations that give rise to similar
decision problems. The emphasis in these applications should be on the
received signals with the encoder and channel serving as a convenient
way of accounting for the generation of unknown signals in a Markov
manner. From this point of view, the results of Sec. 4.3 indicate
that the Markov dependencies between the received signals of other
sections could be due to intersymbol interferrence; the randomly
arriving signals of Sec. 4.5 or 4.6 could, for example, be seismic
waves; any signal space representation of the received signal in Sec.
4.2 will serve as well as time samples to make decisions. It is also
clear that the Gaussian noise assumptions can be weakened and the
number of unknowns can be increased.

With regard to the identifiability conditions, the following
observations are important. The conditions stated in each section are
sufficient to ensure that all parameters can be learned and effective
decisions can be made on all the states of the corresponding POMS.
However, in many cases, (Sec. 3.3-3.6) the events of interest consist
of a union of other events. To make effective decisions, it is not
necessary that all the component densities corresponding to the events

in this union be distinguishable. Thus, depending on the inferrence

75

problem of interest and the available prior information, conditions for
adaption can be considerably weakened, with Sec. 3.8 and the Appendix
providing the guidelines.

Verifying the regularity of the underlying Markov chains may
have appeared to be a formidable task. However, it is well known that
the regularity of a Markov chain can generally be determined from the
structure of the corresponding transition matrix. While the eigenvalues
give useful information concerning the system activity, they do not have
to be computed to verify this assumption. Furthermore, the behavior of
a general class of systems can usually be summarized by a simple example.
This point of view was not developed in this chapter but was implicit
in Sec. 4.5.

Finally, while the optimum solutions given in this chapter pro-
vide a reference for comparison, they are generally undesirable from
the view point of practical engineering. Among the suboptimum
solutions suggested by the estimation techniques of Chapter III it
appears the method of moments would yield the most fruitful results.
The success that Chang and Stewart have had with this method in de-
veloping low-memory estimators for cases of practical interest in-

dicate that similar results could be obtained here.
4.8 CONCLUSIONS

The aim of this chapter has been to display the versatility of
the model for a POMS and to exhibit how the results of Chapters II and
III can be applied. Several communication systems were shown to be

POMS. These include systems with

76

l. A Markov information source

2. A channel with memory

3. unknown synchronization

4. signals with random arrival time

5. combinations of the above
For all these cases the form of the optimal detector for unknown re-
ceived signal was shown to follow directly from the results of Chapter
II and the conditions for adaptions from Chapter III. Although optimal
decision making was emphasized in the above examples, it is implied
that the estimation schemes and suboptimal decision rules apply as well.

In addition to providing quick solution, the technique of formu-
lating these inference problems as those of decision making for a POMS
has other advantages. First, it provides a common model for a variety
of seemingly different systems. This facilitates comparison and helps
focus analysis efforts in one direction. Results developed for gen-
eral POMS apply to all of the above systems. Next, it illustrates
clearly the nature of the estimation problem involved. Whereas, the
properties of the observations are not always clear, in an ad hoc
formulation, the mixture approach used in this thesis clearly defines
conditions for the existence of estimators and suggests a wealth of
techniques. Finally, it brings into play the powerful tools of Markov
chain theory. Once a state space and transition matrix have been

established, a great deal can be inferred about system state activity.

CHAPTER V
GENERAL CONCLUSIONS

In this chapter, the main contributions of the Thesis are re-

viewed and suggestions are made for future research.
5.1 REVIEW

This Thesis has been concerned with several inference problems
related to a class of Partially Observable Markov Systems. Generally,
the results represent an extension of previous work on unsupervised
learning and adaption from the i.i.d. case to a particular case with
dependent, non-stationary observations. However, additional inference
problems not well established for the i.i.d. case were generalized and
solved here also; namely those of class estimation and adaptive estima-
tion treated in Sec. 2.6 and 3.8. The basic goal has been to construct
estimators and decision rules and to state conditions under which they
perform effectively.

In Chapter II, Bayes Decision-Theoretic concepts were used to
develop optimal solutions when the component densities are defined by
an unknown parameter set. Large sample theory was used in Chapter III
to establish suboptimum solutions. The basic estimation problem was
one of mixture resolving and a general strategy was developed for ex-
tending estimation techniques developed for the i.i.d. case to the
more general case studied here. In general the results display many

similarities with the i.i.d. case. The dominant role of mixtures and

77

78

identifiability, the implementation difficulties of the optimum rule
and the need for prior information are all general characteristics of
nonsupervisory problems.

Chapter IV showed several communication systems of current in-
terest to be POMS. Consequently adaptive detectors and estimators were
easily established along with conditions for effective operation. This
unifying approach to solving a previously troublesome class of problems

represents a major contribution of the Thesis.
5.2 EXTENSIONS

The similarities between the results obtained here and previous
work in both estimation and Markov chain theory suggest certain natural
extensions that used to be investigated.

First is a class of inference problems with time varying para-
meters. For example, as an extension of the optimal i.i.d. case
Braverman [B-3] and Fralich [F-l] considered parameter changes summarized

by the difference equation of the form

= +
Bm+1 BM AM

where {AM} is a sequence of independent random variables. Their
ideas are applicable to the problem treated here [H-S].

Next is the problem of developing estimators that are easier
to implement than those given here. Sakrison [S-l] has used stochastic
approximation techniques to solve the likelihood equation. The result
is a simple iterative low memory estimator. This method appﬁﬁs to
ergodic observation processes which suggests that it could be extended

to the asymptotically ergodic case treated in this thesis.

79

Finally, most of the results in this Thesis apply to POMS whose
state activity is described by a regular Markov chain. Although as
demonstrated in Chapter IV this represents a useful class of systems
it would be desirable to extend the results to include ergodic chains.
In so far as those properties which effect the proof of Theorem 3.3.1
are concerned, regular and ergodic chains are similar and it appears the

general estimation strategy can be extended.
5.3 SOME INTERESTING PROBLEMS

In an attempt to exploit the similarities between the i.i.d.
case and that of a general POMS, several interesting problems have
been ignored. Most of these problems arise from the fact that, unlike
the i.i.d. case, the optimum decision rule for a POMS has a changing
structure (as a function of the observations) even when all quantities
in the model are known. This causes three main difficulties.

The first problem is concerned with implementation of suboptimum
rules. Unless F(AN=i/XN,B) is stored as a function of B (This
would lead to the same storage problems as the optimum rule) using
P(AN=i/XN,BN(XN)) to make decisions at the Nth step requires storage
of XN and an iteration over N steps using the schemes of Sec. 1.2.
This is the case regardless of what is available from the (N-l)th
step. Hence the memory and number of computations grow with N. In
an attempt to overcome this problem Raviv [R-Z] has shown, for P un-
known and 21 = E, that a fixed number of past samples can be used to
construct decision rules with an asymptotic risk arbitrarily close to

the corresponding risk for known P. The question of how many past

80

observations are needed in a practical situation remains open.
The second problem is that of proving adaption of both optimal

and suboptimal rules. For example one would like to Show
. N N _. N a
IP(XN 1/X ,BN(X )) - PO,N 1/X ,BO)| 0 Wpl

This is a nontrivial statement and again the problem can be traced to
the varying structure of the rule. Raviv by using only a fixed number
of past observations had a fixed form rule (given P) and adaption
followed.

The third problem is that of computing probability of error.
Even when allquantities are known and a very simple example is used
error calculations are prohibitive [D-B]. This suggests that computer
simulation must be used to evaluate adaptive decision making devices.

The above problems indicates that a profitable result would be
some practical measure of dependency between the samples. If such a
tool were available effective suboptimum rules could be constructed
with desirable implementation properties; the look-ahead mode of de-
cision making could be better evaluated; and the effect of initial
system behavior on asymptotic decision modes could be determined. At

present computer simulation and intuition are the only guides.

BIBLIOGRAPHY

[A-l]
[B-l]

[3-2]

CB-3]

[c-1]

[0-2]

[0-3]

[c-4]

[0-5]

[D-l]

[D-2]

BIBLIOGRAPHY

Aoki, M., Optimization of Stochastic Systems, Academic Press,
New York, 1967.

Billingsley, P., “Statistical Methods in Markov Chains", Ann.
Math. Statist., Vol. 32, pp. 12-40, 1961.

Blischke, W.R., "Moment Estimators for the Parameters of Two
Binomial Distributions," Ann. Math. Statist., Vol. 33, pp.
444-454, 1962.

 

Braverman, D., Machine Learning and Automatic Pattern Recognition,
Stanford Electronics Laboratories, Technical Report No. 2003-1,
Feb., 1961. '

Chang, R.W. and J.C. Hancock, "0n Receiver Structures for
Channels Having Memory,“ IEEE Trans. on Information Theory, Vol.
IT-12, No. 4, pp. 463-468, Oct., 1966.

 

Chien, Y.T., and K.S. Fu, "On Bayesian Learning and Stochastic
Approximation", IEEE Trans. on System Science and Cybernetics,
Vol. SSC-3, No. 1, pp. 28-38, June 1967.

 

Choi, K., Estimates for the parameters of a finite mixture of
distributions, University of Missouri, Technical Report No. 18,
1966.

Cooper, D.B., 0n the Existence of Nonsupervised Adaptive Signal
Detectors and Detector Estimation using Stochastic Approximation
Methods, Ph.D. Dissertation, Columbia University, April 1966.

Cooper, D.B., and P.W. Cooper, "Nonsupervisory Adaptive Signal
Detection," Information and Control, Vol. 7, No. 3, pp. 416-444,
Sept., 1964.

 

Deely, J.J., and R.L. Kruse, Construction of Sequences Estimating
the Mixing Distribution, Ann. Math. Statist., Vol. 39, No. 1,
pp. 286-288, 1968.

 

Doob, J.L., Stochastic Processes, Wiley, New York, 1953.

 

81

[H-2]

[H-3]

[H-4J

[H-S]

[K-l]

[K-2]
[K-3]
[K-4]

[L-1]
[M-1]

[M-2]

[N-l]

82

Drake, A.W., "Observation of a Markov Source Through a Noisy
Channel," presented at the IEEE Symposium on Signal Transmission
and Processing, Columbia University, May 1965.

Fralick, S.C., "The Synthesis of Machines which Learn Without
a Teacher," Stanford Technical Report No. 61308-9, April 1964.

Hancock, J.C., and R.W. Chang, "Unsupervised Learning Receivers
for Binary Channels with Intersymbol Interference," presented at
the IEEE Symposium on Signal Transmission and Processing, Columbia
University, May 1965.

Hancock, J.C., and E.A. Patrick, "Learning Probability Spaces for
Classification and Recognition of Patterns with or without
Supervision", Purdue University Report TR-EE-65-21, Nov. 1965.

Hancock, J.C., and T.L. Stewart, Parameter Estimation with Un-
known Symbol Synchronization, Purdue University Report TR-EE-
67-1.

Hancock, J.C., and P.A. Wintz, Signal Detection Theory, McGraw-
Hill, 1966.

 

Hilborn, C.G., and D.G. Lainiotis, "Optimal Unsupervised Learning
Multicategory Dependent Hypothesis Pattern Recognition," IEEE
Trans, on Information Theory, Vol. IT-l4, No. 3, pp. 468-470,

May 1968.

 

Kakalik, J.S., Optimum Policies for Partially Observable Markov
Systems, Technical Report No. 18, Operations Research Center,
Massachusetts Institute of Technology, Oct. 1965.

Kale, B.K., "On the Solution of the Likelihood Equation by
Iteration Processes," Biometrika, Vol. 48, pp. 452-456, 1961.

 

Karlin, S., A First Course in Stochastic Processes, Academic
Press, New York, 1966.

Kemeny, J.C., and J.L. Snell, Finite Markov Chains, Van Nostrand,
1960.

 

Loeve, M., Probability Theory, Van Nostrand, New York, 1955.

 

Martin, J.J., Bayesian Decision Problems and Markov Chains,
John Wiley and Sons, New York, 1967.

Mendel, J.M., "A Survey of Learning Control Systems," I.S.A.
Transactions, pp. 297-303, July 1966.

 

Nolte, L.W., "An Adaptive Realization of the Optimum Receiver
for a Sporadically Recurrent Wave Form in Noise," IEEE Trans.
on Information Theory, Vol. IT-13, pp. 308-401, April 1967.

[P-1]

[P-2]

[P-3]
[R-1]

[R-2]

[R-3]

[R-4]

[R-S]
[3-1]
[8-2]
[s-3]

[3-4]

[3-5]
[s-e]

[T-l]

83

Patrick, E.A., and J.C. Hancock, "Nonsupervised Sequential
Classification and Recognition of Patterns," IEEE Trans. on
Information Theory, Vol. IT-12, No. 3, pp. 362-372, Oct. 1966.

 

 

Patrick, E.A., "On a Class of Unsupervised Estimation Problems,"
IEEE Trans. on Information Theory, Vol. IT-14, No. 3, pp. 407-
415, May 1968.

Pearson, K.P., "Contributions to the Mathematical Theory of
Evolution," Phil. Trans. Roy. Soc., 185A 71-110, 1894.

Rao, C.R., Advanced Statistical Studies in Biometric Research,
John Wiley and Sons, New York, 1952.

Raviv, J., "Decision Making in Incompletely Known Stochastic
Systems," Int. J. Eng. Sci., Vol. 3, pp. 119-140, Pergamon
Press, 1965.

 

Raviv, J., "Decision Making in Markov Chains Applied to the
Problem of Pattern Recognition," IEEE Trans. on Information
Theory, Vol. IT-3, No. 4, pp. 536-551, Oct. 1967.

Rider, P.R., The Method of Moments Applied to a Mixture of Two
Exponential Distributions, Ann. Math. Statist., Vol. 32, pp.
143-147, 1961.

Royden, H.L., Real Analysis, Macmillan, New York, 1963.

 

Sakrison, D.J., "Stochastic Approximation: A Recursive Method
for Solving Regression Problems," Advances in Communication
Systems, Vol. 2, 1966, Academic Press, New York, 1966.

Sammon, J.W., “An Adaptive Technique for Multiple Signal
Detection and Identification," IEEE Convention Record, 1967.

Scudder, H.J., "Adaptive Communication Receivers," IEEE Trans.
on Information Theory, Vol. IT-11, pp. 167-174, April 1965.

Slansky, J., "Learning Systems for Automatic Control," IEEE
Trans. on Automatic Control, Vol. AC-11, No. 1, pp. 6-19,
Jan. 1966.

Spragins, J.D., Reproducing Distributions for Machine Learning,
Stanford Electronic Lab., Technical Report No. 6103-7, Nov. 1963.

Spragins, J.D., "Learning Without a Teacher," IEEE Trans. on
Information Theory, Vol. IT-12, No. 2, pp. 223-230, April 1966.

 

Teicher, H., "On Mixtures of Distributions," Ann. Math. Statist.,
Vol. 31, pp. 55-73, 1961.

84

[T-Z] Tucker, H.G., A Graduate Course in Probability Theory,
Academic Press, 1967.

[W-l] Wilks, S.S., Mathematical Statistics, John Wiley and Sons,
New York, 1962.

[Y-l] Yakowitz, S.J., and J.D. Spragins, "On the Identifiability
of Finite Mixtures," Ann. Math. Statist., Vol. 39, No. 1,
pp. 209-214, 1968.

APPENDIX

APPENDIX

Identifiability and Prior Information
for Unsupervised Learning

In Sec. 3.5 and 3.8, it was shown that when the set of mixtures
which arise in the estimation problem of Sec. 3.1 is identifiable, the
unknown parameter vector B0 can be uniquely determined from the
observations. To ensure this condition constraints must be imposed on
the family of component densities and the parameter space. Thus in a
particular decision making problem a certain amount of prior information
is needed to guarantee solution. This Appendix deals with sufficient
conditions for identifiability. Most of the results are taken from
the literature but are presented from a slightly different point of
view, more suitable for the problems of interest in this Thesis.

Sufficient conditions are also established for class identifiability

introduced in Sec. 3.8.

The Uniqueness of Representation Property
for Finite Mixtures

Let 8 = {F(X;a)}a€A be a family of joint densities indexed by
a point a taking values in a subset of a finite dimensional Euclidean

vector space A. Let

k k
HR = {H(X) = 2 C.F(X;or.), c. > o, z c. = 1, F(x;a.) e 8}
, 1 1 1 1 1
1=1 l
k . . . °° k .
where the {ai}1 is a distinct set of elements. Then. &'= U H 13
1

the set of all finite mixtures of the family 8. The set ﬂ' is said

85

86

to have the uniqueness of representation property (urp) if

R k
.2 czF(x;ai) = .2 0119(me (A-l)
1=1 1=1

implies k = k and for each i, 1 S i S k there is some j, 1 S j S k

A A

such that Ci - Cj and ai = aj. The URP can be restated as saying
there is a one-to-one correspondence between each set of allowable
points {C1,ai}: and the mixture they generate. As indicated by the
following theorem this property can be characterized by 8.

Theorem A-l. A necessary and sufficient condition that H’ have
the URP is that 8 be a linearly independent set over the field of
real numbers.

This theorem can be used to establish the URP for the set of
finite mixtures generated by the following families.

1. The family of n-dimensional Gaussian cdf's indexed by the

mean and/or the covariance matrix.

2. The family of n-dimensional exponential cdf's indexed by

the exponent constant.

3. The translation parameter family induces by any cdf with

a bilateral Laplace transform.

That is any finite set of distinct elements in each of the above

families is a linearly independent set.
Identifiability and the Estimation Problem

For the estimation problem Sec. 3.1 the class of mixtures of

1nterest 13 HH = {Kh(X;B)}B68 where

M
Knows) = iflniﬂxmi) (A-Z)

87

In (A-2) B = [B1,...,BM], f(X;Bi) E {5, and B is a subset of M-
dimensional Euclidean space containing only vectors with distinct
components. Then HTT is said to be an identifiable class of para-
meter indexed mixtures if the mapping defined by (A-2) say Qn is a
one-to-one mapping from 46 onto H". Then if Kh(X) E H" there

exists a unique vector B such that K%(X;BO) = Kh(X).

0
Since Hn<2 H, if H has the URP so does H". However the

URP guarantees the uniqueness of B only to within an equivalence

0

class. That is permutations of the components of B might result

0
in another solution vector. This would be the case if all the com-
ponents of n were not distinct. Thus in order to guarantee iden-
tifiability as defined here a constraint on the parameter space is
needed as well as URP. In a practical situation this constraint will
be a reflection of prior information on a particular problem. For
example, if M = 2, HI = n2 = 5 and it is known B1

is the parameter associated with state i of the system, then with

>
32’ where Bi

the URP BO can be uniquely determined by minimum distance estimation.
It is important to realize also that in order for the Bayes algorithm
to learn Bo these constraints must be reflected in the prior dis-
tribution PO(B). If one is not interested in using the estimates

for decision making then constraints can be arbitrarily imposed to get
a unique solution vector. The definition of identifiability and all

the above remarks can be extended to the case in which n is unknown

and the parameter space is 2M-dimensional.

 

34Identifiability is usually defined for this larger class of
mixtures [H-2] and the above definition is a consistent modification.

88
Class Identifiability

If 81 is the family of component CDF's associated with the
.th , i ,
1 class then as 1n Sec. 3.8 H . = {K (X;B )} . . 18 the corre-
n1 n1 31631
sponding set of mixtures with parameter space 181 and probability

L
1
state vector n . According to definition 3.3.1 H = U H is

1
i=1 "
said to be class identifiable if {H i}? are disjoint subsets of H".
n
Then from Theorem A-1 and a simple contradiction argument any of the
following conditions are sufficient for class identifiability of H”.
L
l

L . .
l. S = U 81 is a linearly independent set, {n1} is a set

1=1
of vectors distinct to within permutations on the components.
2. 8 is a linearly independent set by 51 n 5'1 = d) i 9‘ j
L .
3. S is a linearly independent set and ﬂ 81 = ¢
i=1

Assuming 8 is a linearly independent set, conditions 1 and 3
above indicate that classes of states can be distinguished when the
stationary Probability vector associated with different classes are
distinct or the component densities associated with different classes
have distinct forms. However condition 2 indicates that if sufficient
prior information is available concerning the parameters associated
with each class the classes can be distinguished even when the class
transition matrices are equal and the component densities for each
class have the same form.

The above conditions are sufficient and it is clear that they
can be weakened. For example if 81 is not a linearly independent
set, its mixtures do not necessarily equal those corresponding to
other classes. Also the components of B3 need not be distinct to

distinguish classes. In fact when n is known the distinctness of

 

89

the components of B is needed only to ensure the system states can

0
be distinguished in some sense and is not necessary to estimate B0.
Finally when the class transition matrices are distinct but the corre-

sponding stationary probability vectors are not, a second order esti-

mation strategy (Sec. 3.7) may still lead to an estimator for 10.

m1
L mg
”7
"4
“7
“1
m3