ESTIMATION AND ADAPTIVE DECISION MAKING FOR PARTIALLY OBSERVABLE MARKOV SYSTEMS Thesis for the Degree of Ph. D. MICHIGAN STATE UNIVERSITY DAVID T. SIGNORI, JR. 1968 7HESIS _ 1' 7- “2.5%,?" Michigan State University L1 This is to certify that the thesis entitled ESTIMATION AND ADAPTIVE DECISION MAKING FOR PARTIALLY OBSERVABLE MARKOV SYSTEMS presented by David T. Signori, Jr. has been accepted towards fulfillment of the requirements for _P_h._I_)_.__ degree in i121— .., fl,- /, ,7, \’/ I! if,“ / ,/ .;/.I I I, /./f._’ )7 fi-"lf/r/ / ”A ' 3(/////./ / ,~ .. "1.44'4,/r--' .,/-/-, - /L 4/“ -"‘ ' r‘AA L1 1 I Major professor Date Sept. 25. 1968 0-169 P‘;_j._ i“ ’“dfm‘fw alu‘smo av “7 HOME 8: SUNS' 300K BINIIERY INC. ' "’RARY BINDERS ““‘l’. Ilcllfll ABSTRACT A Partially Observable Markov System (POMS) is a discrete state, discrete time system whose state activity is described by a Markov chain. The states of the system cannot be observed directly, but "noisy" observations are available. The main problem considered is that of determining rules for making decisions about system states when the conditional densities of observed random variables given the state of the system are char- acterized by a set of unknown parameters. Furthermore, it is desired that, as more observations are taken, these rules converge to the rule that would be used if the parameters were known. An iterative,optima1 (minimum Bayes risk), decision rule is derived for making decisions concerning the state of the system at a given time on the basis of available observations. This rule has the capability of using future observations as well as past observations. An optimal rule is also established for determining to which class the state of the system belongs among a set of non-communicating classes of states and an optimal, adaptive estimator is constructed for the parameters associated with the active class. Conditions are established under which these rules perform effectively. A variety of consistent estimators are constructed for the un- known parameters, yielding a class of suboptimum rules. The basic estimation problem is a nonsupervisory one involving the resolution of mixtures. However, unlike previous work, the observation process is dependent and nonstationary. A general strategy is established David T. Signori, Jr. for extending estimation techniques developed for the case of independent identically distributed observations to this problem. The results apply also to the non-parametric case and the case with unknown transition matrix. The model under study here corresponds directly to that of a Pattern Recognition System with Markov dependent pattern activity. How- ever, several communication systems of interest can be shown to be POMS. These include systems with a Markov encoder, intersymbol inter- ference, unknown synchronization, signals with random arrival times, and combinations of the foregoing. In all of these systems, the observations are dependent and the design of adaptive detectors is generally difficult. However, by formulating the problem as one of decision making for a POMS, optimal and suboptimal detectors as well as conditions for effective operation follow easily and in a unified manner . ESTIMATION AND ADAPTIVE DECISION MAKING FOR PARTIALLY OBSERVABLE MARKOV SYSTEMS B)! .,e/ G“ David T. Signori, Jr. A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1968 To My Mother and Father ii ACKNOWLEDGEMENTS It is with great pleasure that the author thanks his thesis advisor Dr. R.C. Dubes for his guidance and encouragement during the course of this research Thanks are also due to Dr. H. Salehi and Dr. R.B. Zemach for their suggestions concerning some mathematical questions and to Dr. H.E. Koenig and Dr. G.L. Park for their interest in this work. iii Chapter I. INTRODUCTION .................................... 1.1 Partially Observable Markov Systems ........ 1.2 Optimal Decision Making for a POMS ......... 1.3 Unsupervised Learning ...................... 1.4 Review of the Literature ................... 1.5 Thesis Objectives .......................... II. OPTIMAL ADAPTIVE DECISION MAKING ........... ..... 2.1 The Decision Problem ....................... 2.2 Derivation of the Decision Rule ............ 2.3 Iterative Generation of the Posterior Densities .................................. 2.4 Analysis of the Iterative Procedure ........ 2.5 The Learning Features of the Optimum Decision Rule .............................. 2.6 Related Inference Problems ................. 2.7 Conclusions ................................ III. CONSISTENT ESTIMATORS ........................... TABLE OF CONTENTS ABSTRACT ACKNOWLEDGEMENTS LIST OF TABLES LIST OF FIGURES wwuwwww \onmJ-‘uNr-I WU) mm The Estimation Problem ..................... Properties of the Observations ............. The Basic Tools of Estimation .............. Classical Estimation Methods ............... Minimum Distance Estimation Methods ........ An Example ................................. Alternate Strategies for the Estimation Problem ....... . ....... .. ............ . ...... Adaptive Estimation and Class Estimation ... Conclusions ................................ iv Page iii vi vii 16 19 22 24 27 29 30 33 35 37 40 43 47 49 52 IV. EXAMPLES OF PARTIALLY OBSERVABLE MARKOV SYSTEMS 4.1 Pattern Recognition with Markov Dependent Pattern Activity ........................... 4.2 Adaptive Signal Detection with a Markov Encoder .................................... 4.3 Adaptive Detection with Intersymbol Interference ............................... 4.4 Adaptive Detection with Unknown Synchronization ............................ 4.5 M-ary Adaptive Detection of Signal with Random Arrival Times ....................... 4.6 Adaptive Detection of Signals with Random Arrival Times ............ . ................. 4.7 Remarks .................................... 4.8 Conclusions ................................ GENERAL CON CLUS IONS ............................. 5.1 Review .... ................................. 5.2 Extensions ................................. 5.3 Some Interesting Problems ................. . BIBLIOGRAPHY .................................... APPENDIX ........................................ 54 55 58 6O 63 67 7O 74 75 77 77 78 79 81 85 LIST OF TABLES Table Page 3.6.1 Computer Simulation Results for 20 Runs with E0 = 2 ........................................ 46 vi LIST OF FIGURES Figure Page 1.1.1 A Schematic of a Partially Observable Markov System ............. .... ....................... 5 4.1.1 A Pattern Recognition System .................. 56 4.2.1 A Communication System ........................ 58 4.5.1 The Channel Output for a M-ary Signal Set with Random Arrival Times .................... . ..... 68 4.6.1 The Channel Output for a Signal with Random Arrival Times ................................. 7O vii CHAPTER I Introduction A fundamental requirement of classical engineering techniques for the design of control and information processing systems is the existence of a fully specified model of the system under study in- cluding all the factors that influence system performance. However, factors such as unpredictable changes in environment, drift in system parameter values due to component aging, or difficulties in measuring relevant quantities often make the development of an accurate model impractical. This problem has motivated the study of adaptive decision making devices such as controllers and detectors which can achieve a desired goal despite some degree of ignorance concerning the under- lying model. Such devices are characterized by the fact that they improve their performance on the basis of past experience. In effect they "learn" the additional information needed to complete the model in terms of which their task is defined. Workable adaptive schemes have been proposed for several types of communication and control systems [A-l][H-4][M-2][S-4]. In partic- ular, a large class of problems that arise from analyzing such systems can be posed as problems in Mathematical Statistics. In such problems Decision Theory provides the mathematical form of the adaptive device and a mathematical criterion for evaluating its performance. The learn- ing process is related to the well-defined problem of estimating the unknowns in a partially specified statistical structure. The only pre- requisite for using the powerful tools of Mathematical Statistics is the existence of a meaningful statistical model. The basic model under study in this Thesis is that of a discrete state, discrete time system whose state activity is described by a Markov chain. The system states cannot be observed directly because of an imperfect observation mechanism that can be accounted for statistically. This system will be referred to as a Partially Observ- able Markov §ystem (POMS). Such systems arise frequently in Pattern Recognition, Signal Detection, and Operations Research [D-BJEK-lJER-B]. The model for a POMS is precisely defined in Sec. 1.1. The main decision problem associated with a POMS is one of estab- lishing rules for taking effective action concerning the states of the system on the basis of available observations. In Sec. 1.2 optimal decision making is defined for a POMS when all quantities in its model are assumed known. The structure of the resulting rule is discussed along with its computational feasibility, a basic consideration through- outthis study. This Thesis deals with various aspects of decision making for a POMS when the model is not completely specified (not all the quantities in the model are known). Of primary interest is the problem of extract- ing from the observations information concerning the model unknowns. Hopefully, such information can be used to construct adaptive decision rules or rules which perform almost as well (in some well defined manner) as the optimal rules of Sec. 1.2 where the model is completely known. The problem of learning the unknowns in a POMS is discussed in Sec. 1.3. Previous work related to this problem is listed in Sec. 1.4 and in Sec. 1.5 the Thesis objectives are explicitly stated. 1.1 PARTIALLY OBSERVABLE MARKOV SYSTEMS The basic model considered in the Thesis is established in this section. The model is composed of two random processes. The first is a discrete-time, finite-dimensional Markov chain which cannot be observed. The second is an observable process with the property that the random variable describing the observations at a given time has a distribution which depends on the state of the chain at that time. The model corre- sponds to that of a system whose states cannot be observed but must be monitored indirectly through a "noisy" observation mechanism, which suggests the name Partially Observable Markov System (POMS).1 More specifically, the state activity of the system is described by a first order homogeneous Markov chain; that is, a sequence of ran- dom variables {AN; N = 1,2,...} taking values in a finite state space A = {l,2,...,M}; M«< m and satisfying the Markov Property. Namely, if P(-) is a probability measure defined on the same sample space as the sequence {AN}: then P(AN = le N_1 = i,...,A = k) = P(AN = JliN_l = 1) = p 1 IJN > 1 and i,j = l...M (1 1.1) where pij is the probability the system is in state j at time N given it was in state i at time N-l. Hence, knowledge of the last state summarizes the past history of the system. The probability state vector of the system is defined by EN = [P(XN=1)...-.P(KN=M)] (1.1.2) 1This is a generalization to a continuous observation space of what has been previously referred to as a POMS [D-3],[K-1]. where P(AN=i) is the probability that the system is in state i at time N. Then _ N-l _ RN - RlP - RN-lp (1.1.3) where P = [pij] is a stationary transition matrix and .21 is an initial probability state vector at time 1. Then ‘21 and P are sufficient to summarize the prior knowledge (knowledge before any observations are taken) of the state activity of the system. When the past history of the system is summarized by knowledge of the last k states, the describing random process is termed a kth order Markov chain. Since any kth order chain can be reduced to a 18t order chain, the above chain implies higher order chains [D-Z]. The observation process is defined by a sequence of random vari- ables {KN}T; XN, the random variable observed at time N, takes values in a finite-dimensional Euclidean space and has a density function fi(-) when it is known the system is in state i at time N. That is, fi(-) is the conditional density of the observations given the system is in state i. Since the states of the system are unknown, XN has the global density pN(X) = .E l— f.(X) PO» =i) (1.1.4) 1 1 N which is referred to as a finite mixture with component densities f (- M and mixing parameters [T-l . The sequence X }m is 1 1 2N N 1 assumed state conditionally independent. This implies, for example, . _. _ 2 p(xN’bullxqu’I‘N—n—J) - fi(XN)fj(xN+l) (1‘1'5) 2p(°) will denote probability density with Xi indicating both the random variables and the value it takes on. and hence the joint density of any number of random variables in the observation process can be constructed from the component densities and the system probabilities p1 and P. £100 —"\ lr-‘IE f2 --\, K XN g4...) _/ FIG. 1.1.1 A Schematic of a Partially Observable Markov System The entire model is illustrated by the sampling scheme in Fig. 1.1.1. At time N a sample is taken according to a density determined by the state of the system at time N. 1.2 OPTIMAL DECISION MAKING FOR.A POMS When the model is completely specified, Decision Theory provides a means of generating optimal strategies for action relative to the states of the system. In this section the type of decision rules of interest in the Thesis are illustrated along with important properties of this class of rules. In order to define the decision problem, the following elements of Decision Theory are introduced. The action space or set of allowable actions that can be taken at a given time is denoted by A = {a1,...,ar}, r < m with generic element a; L(-,°) is a non-negative loss function defined on A X A with L(a,i) denoting the loss incurred when action . . . . m a 18 taken and the system IS in state 1; X = [X1,...,Xm] represents the set of observations obtained up to time m. The problem, then, is . . . . . m to find a nonrandomized dec131on function 6N mapping X to A such that the risk 11(1)“) = E L(eN(xm),>.N) (1.2.1) is a minimum. The expectation is taken with respect to Xm and AN. If, for example, r = M and the action ai is "say AN=i" and 3 L(i,j) = l - Aij (the 0-1 loss function) the optimal decision rule, 5:, is given by the Bayes decision rule 52mm) = i if P(>.N=i/x'“) 2 1>(>.N=j/xm) Vjaéi (1.2.2) where P(AN=i/Xm) is the conditional probability that the system is in state i at time N given observation Xm [R-B]. Using the assumption of state conditional independence the above rule can be written iteratively. For example, if m = N it follows from Bayes rule and the Markov property that . -1 fi(XN)P(AN=1/XN ) N P(A =i/X ) = _ (1.2.3) N p(XN/XN 1) where . N-l M N-l P(>~N=1/x ) = z pkiP(AN_1=k/X ) (1.2.4) 1 and P xN-1 _ M _. N-l (XN/ ) - ‘12: fi(xN)P(AN-1/x ). (1.2.5) 3 Aij denotes the Kronecker delta. The iterative scheme operates in two basic steps. The state posterior probability P(AN_1=i/XN-1) computed at step N-l is projected from the transition probabilities in (1.2.4) to P(AN=i/XN-1) which serves as a prior probability for the state activity at time N before the Nth observation is available. This in turn is converted in (1.2.3) to a posterior probability using xN and the Bayes Rule. It is worth noting that, because of the Markov dependencies be- tween the states of the system at different times, observations are in general dependent. Consequently, observations at one time may contain information about the states at other times. This point is reflected in the above decision rule by the fact that both past and present observations are used to make decisions about the state of the system at a given time. By the same token the sequence of decision made by {6:}? can usually be improved if XN is used to classify all past states simultaneously. However, rules of this type lead to memory re- quirements which grow linearly and exponentially with the length of the observed sequence and decisions are not available for immediate use [c-1]. The type of decision rules established in this section can be thought of as a class of on-line rules with fixed memory but changing structure. The case where m'> N, termed a look ahead mode, is an attempt to improve performance by increasing memory by a fixed amount. 1.3 UNSUPERVISED LEARNING If the model for a POMS is only partially specified the pre- viously established decision rules can be considered functions of the unknowns in the model. For example, the transition matrix and/or the component densities may be unknown or may contain unknown parameters. To make effective decisions under such circumstances information about these unknowns must be extracted from the observations. Since the states of the system are unobservable the unknowns appear in a mixture. The process of estimating or approximating these unknowns is commonly referred to as unsupervised learning. This is in contrast to the case in which, by some external means, the states of the system are known for a fixed period of time and each of the component densities and the transition matrix can be determined separately or with supervision [Bel] [M- 1][H-4]. As an example of the above class of estimation problems a POMS is considered with a transition matrix which has identical row vectors. That is pij = q- j=1)2!"')M (19.391) If £1 is given by a row of P, the probability state vector EN is independent of time and xN, the random variable observed at time N, has density M 1300 = Z quj(x> vN (1.3.2) 1 Then the state conditional independence assumption implies {x1}: is a sequence of independent identically distributed (i.i.d.) random vari- ables. This is a well-studied case and the tools of classical estima- tion theory have been used to develop several mixture resolving tech- niques for various degrees of uncertainty about (1.3.2) [S-6][H-2]. Some of these methods are discussed in Chapters II and III. However, for a general transition matrix and initial probability vector the observations are neither independent nor identically dis- tributed and the above techniques do not apply directly. 1.4 REVIEW OF THE LITERATURE Research related to adaptive decision making for a POMS has been motivated by the need to handle decision problems in which the usual independence and stationarity assumptions mentioned in Sec. 1.3 do not hold. The results outlined in this section can be categorized according to what is assumed to be unknown in the model. For the case in which the only unknowns are parameters in the component densities, most of the work consists of attempts to design optimal (Minimum Bayes Risk) and suboptimal adaptive detectors for communication systems wherein the signal is unknown but, for various reasons, the observations are dependent even when the signal is known. Some examples of conditions which result in dependent observations are intersymbol interference or signal overlap due to channel memory (Chang [H-l]), unknown symbol synchronization between the transmitter and receiver (Stewart [H-3]) and random signal arrival times (Stewart [H-3] and Nolte [N-l]). The main problem associated with these examples is to establish conditions under which the unknown parameters can be learned. In the first two examples Chang and Stewart were able to establish such con- ditions because they assumed signal activity at the transmitter was 5 . independent from one time interval to the next. Their results do not Optimal and suboptimal adaptive decision rules are defined at the beginning of Chapters II & III respectively. The correspondence between a POMS and a communication system is made in Chapter IV. 5 . . . . This corresponds to assuming a special form for the tran31tion 10 apply when such dependencies exist. In the example of signals with random arrival times Stewart was unable to prove convergence of his estimates and Nolte does not treat the problem. These examples are discussed in more detail in Chapter IV. For the case in which the component densities are assumed known but the transition matrix P is unknown Raviv [R-Z] constructed a class of adaptive decision rules using an estimate of P and only part of the past observations. He established conditions under which P can be estimated and developed some properties of the observation process for a large class of POMS. These properties are stated in Chapter III. Recently, Patrick [P-2] and Hilborn and Lainiotus [H-S] have made some general observations concerning non-supervisory problems with non-stationary, dependent observations. The work in this Thesis related to their results was done independently and deals with a particular model which yields more specific results. 1.5 THESIS OBJECTIVES For most of the work in this Thesis it is assumed that the tran- sition matrix P is known and that the component densities are known to within a parameter.6 For this case, the Thesis objectives can be stated generally as follows: 1. To find a class of optimal adaptive decision rules with properties similar to those of the rules of Sec. 1.2. 2. To show that, under appropriate assumptions, virtually all the mixture resolving techniques developed for the i.i.d. 6The estimation of P and f (X) for the nonparametric case is discussed briefly in Chapter III. 11 case defined in Sec. 1.3 can be extended to the dependent, non-stationary case considered here. 3. To show that a variety of decision problems arising in communication systems, including those of Sec. 1.4, can be easily solved by considering them as decision making pro- blems for a POMS. Fulfillment of Objective 1 provides a reference for evaluation of any related adaptive scheme. Objective 2 implies conditions under which the unknown parameters can be learned and leads to a class of suboptimum decision rules. Objective 3 suggests that the model of a POMS is a very versatile one, providing a unifying approach to a class of communications problems. Objectives 1, 2, and 3 are pursued in Chapters II, III, and IV reSpectively. In Chapter V the main results of the Thesis are outlined and problems that need further study are discussed. CHAPTER II OPTIMAL ADAPTIVE DECISION MAKING When the component densities of a Partially Observable Markov System (POMS) are specified to within a parameter set and a prior dis- tribution7 summarizing initial knowledge about the unknown parameters is available then Bayes decision-theoretic techniques can generally be used to establish decision rules which are optimal against prior in- formation and a given cost function. Furthermore, under appropriate conditions, the fixed but unknown value of the parameter is learned from the observations and the decision rule adapts or converges to what the optimal rule would be if the true parameter value were known. Patrick [H-2] derived the optimal decision rule for the i.i.d. case. In Sec. 2.1, the decision problem is defined and in Sec. 2.2 the corresponding optimal decision rule is derived. In Sec. 2.3, the basic components of the optimal rule are generated recursively and in Sec. 2.4 the structure and computational feasibility of the iterative scheme are discussed. In Sec. 2.5, the learning properties of the rule are discussed and in Sec. 2.6 some inference problems related to that defined in Sec. 2.1 are treated. Finally, in Sec. 2.7, the main results of the chapter are summarized. 2.1 THE DECISION PROBLEM In this section, the decision problem under consideration in this chapter is defined. The basic elements of the problem are a POMS with an unknown parameter set and a prior density for the parameters, 7 . The standard Bayesian technique of treating the unknown parameter as a random variable will be employed. 12 13 a class of decision functions and a criterion for evaluating the per- formance of these functions. As in Sec. 1.1, the state activity of the POMS is described by a Markov chain {AN}: with transition matrix P and initial probability state vector ‘21° The component densities are characterized by a para- meter vector in the following manner. When the system is known to be in state i and the parameter B1 is given, XN’ the random variable observed at time N, has density f('/Bi).8 The random vector B = [31...BM] takes on an unknown value B according to the prior 0 density pO(B) and maintains this value throughout system operation. The basic assumptions are as follows. , m l. The observations X = [X1...Xm] are state-parameter-con- ditional independent. This implies p(Xm...X /x = 1,... A = j,B) 1 m 1 p(xm/>.m = 1,Bi)...p(x1/>.1 = j,Bj) f(Xm/Bi)...f(X1/Bj) (2.1.1) or p(xm/1m=i,xm‘1,s) = p(xm/1m=i,Bi) = f(Xm/Bi) (2.1.2) 2. For each N, the random variables B and AN are independent. That is, the parameter values do not affect system state activity. If, as in Sec. 1.2, A is the action space, the class of allow- able decision rule, D, is the set of all non-randomized functions map- ping the space of observations Xm to A. If SN 6 D denotes a decision rule for taking action relative to the state of the system at The random variables Xi and Bi take values in a finite- dimensional Euclidean vector space. 14 time N and L(°,') is a nonnegative loss function, the corresponding risk is given by R(6N) = E L(6N(xm),iN) (2.1.3) . . . m where the expectation is taken With respect to X , AN, and B. The problem is to find the optimal rule, 6N which is defined *9 by R(e§) s R(6N) v/éN e D (2.1.4) That is, the minimum risk decision rule for taking action concerning the state of the system at time N on the basis of m observations is to be found. 2.2 DERIVATION OF THE DECISION RULE In this section, the optimum decision rule is derived for the problem defined in Sec. 2.1. The development involves the use of con- ditional risks which emphasize the role of prior information in con- structing the total risk. With the decision rule 6 E D, given observation Km, and given parameter B the average loss is R[6(Xm)/Xm,B] = z L[a(xm),1]P(iN=i/xm,s) (2.2.1)9 i=1 Equation (2.2.1) is referred to as the samp1e-parameter-conditional risk. The parameter—conditional risk is given by R(6/B) = IR[5(xm)/xm,B]p(xm/B)dxm (2.2.2) N 9The superscript in 6 has been dropped for convenience. 15 and the total risk, (2.1.3), is R(6) = j‘ R(6/B)pO(B)dB (2.2. Substituting (2.2.1) and (2.2.2) into (2.2.3) and interchanging the order of integration yields 12(5) = j‘ R[6(Xm)/Xm] p(x"’)c1xIn (2.2. where Ill m Ill R[5(x )/x ] = *2 L[e(x"‘),i]p(iN=i/xm) (2.2. i=1 and P(>.N=i/xm) = f P(AN=i/Xm,B)p(B/Xm) dB (2.2. P(B/Xm) = P(Xm/B)PO(B)/P(Xm) (2.2. p(Xm) = I p(Xm/B)pO(B)dB (2.2. Since 6(Xm) = a for some a E A, and L(~,') and p(Xm) are non- negative it follows that R(6*) = inf 11(5) = E R[6*(Xm)/Xm] (2.2. 66D where 5*(xm) = ai if Mai/x”) s R(aj/Xm) Vj aé i (2.2. In particular, when r ==M, ai is the action ”say AN=i" and the loss function is given by L(ai,j) = 1 - Ai (0-1 loss function) 3 then the decision rule becomes a minimum probability of error rule 6e defined by 66(xm) = i if P(xN=i/xm) 2 P(AN=j/Xm) Vi 34 j (2.2. Minimum sample conditional risk implies minimum total risk. 3) 4) 5) 6) 7) 8) 9)10 10) 11) 16 The basic elements of the above decision rules are the state posterior probabilities given by (2.2.6). These probabilities are obtained by averaging the parameter-conditional posterior probabilities P(AN=i/Xm,B) (This is the probability that would be used for decision making with B = B if B were known) over the posterior density 0 O p(B/Xm). This posterior density summarizes knowledge about B0 in the first m observations. Both terms are generated iteratively in the next section. 2.3 ITERATIVE GENERATION OF THE POSTERIOR DENSITIES In this section, the key posterior densities in the decision rule derived in Sec. 2.2 are generated iteratively. First, P(AN=i/Xm,B) is generated for m = N, m > N and m < N. Then p(B/Xm) is treated. Case 1. m = N From the Bayes rule, _, m-l =. m-l p(xm/xm-1.B.x )Pam 1/x ,B) P(Am=i/Xm,B) = m_1 (2.3.1) B p(Xm/X . ) where the three terms on the right hand side can be accounted for in the following three steps. 1. p(Xm/Am=i,B,Xm-1) is given by (2.1.2) 2. By the total probability law _. m-l _. m-l m_1-J,X ,B)P(im_1—J/x ,B) (2.3.2) m-l M P‘N (2.3.4) Using arguments similar to those used in the previous development, it follows that p(xN+1...xm/AN=i,xN,E)p(iN=i/xN,E) —.' m _ P(AN—i/X ,B) - p(xN+1...xm/xN,B) (2.3.5) where p(AN=i/XN,B) is known from the previous iteration using steps 1, The iteration procedure would be complete if p(XN+l...Xm/AN C011 2, 3 above and with m replaced by N M N P(XN+1...Xm/X ,B) = iElp(xN+l...xm =- N =. N /XN 1.x ,B)P(>.N 1/x ,B) (2.3.6) N =1,x ,B) Id be determined. This factor should be recognized as the heart of the look-ahead procedure that results from using future observations. To generate this last factor it is convenient to define Then, by (2 1 2), But p = E p(zN/1Nm i=1 M. = 2 MN /1N= j 1 = p(zN/1N=i,B) .... =1 BMW/Ir . ... ’KN+1=j’B)pfi N z = [xN+1...xm (2.3.7) (2.3.8) ]. 18 Similarly, N . . N . . p(Z /XN=1,XN+1=J,B) = E pjkp(Z /kN=1,XN+1=J,AN+2=k,B) (2.3.9) Hence this procedure can be repeated until N _, = = N =. p(Z /xN-1...km_l t,B) E ptqp(z /XN 1,K N+l=j,...,km=q,B) (2.3.10) where N -' =' = = p(Z /XN-1,XN+1 J,...,Xm q,B) f(XN+1/Bj)...f(Xm/Bq) (2.3.11) Although the above procedure shows the decision rule to be a fixed- memory rule (the number of observations stored is fixed) the memory in- creased linearly with m-N and the number of computations in (2.3.8) to (2.3.11) increases exponentially with m-N. Hence, the look-ahead is expensive. Case 3. m_< N Past samples are being used to make decision about future states. For m < N lltdCZ _' m _ P(xN-1/x ,B) - _. =. m =. m j P(xN-1/xm J,X ,B)P(Xm J/X ,B) (2.3.12) 1 But, by (2.1.2), —. a... m _ :0 =0 t - , . which is the (j,i) h element of PN m and P(Xm=J/Xm,B) can be obtained iteratively using steps 1, 2, 3 above. In this case, the number of computations involved in making decisions on future states increases linearly with m-N. 19 Finally, an iterative form for p(B/Xm) is established. p(xm/xm'1,B)p(B/xm'l) p(B/Xm) = (2.3.14) p(xm/xm’l) where p(Xm/Xm-1,B) is given by (2.3.4), p(B/Xm-l) is available from the previous step and p(xm/xm'l) = f p(xm/xm’1,3)p(B/xm’l)d3 (2.3.15) 2.4 ANALYSIS OF THE ITERATIVE PROCEDURE In this section, a special but important case of the previous decision rule is studied. The iterative structure of the rule is in- vestigated and interpreted. The start of the iterative procedure is illustrated and the problems encountered in machine implementation of the rule are discussed. If m=N and a 0-1 loss function is used then the result is the minimum probability of error rule, be, given by (2.2.11), for determin- ing the present state of the POMS, described in Sec. 2.1, on the basis of past and present observations. This rule is summarized below. 6e(xN) = i if p(lN=i/xN) 2 P(>.N=j/XN) Vi aé j (2.4.1) where, from (2.2.6), P(xN=i/xN) = f p(xN=i/xN,B)p(B/xN)dB (2.4.2) with, from (2.3.1) and (2.3.2), f(xN/Bi)P(xN=i/XN'1,B) p(xN/XN’1,B) (2.4.3) P(xN=i/XN,B) = and 20 M . N-l , - P(lN=i/x ,B) = jEIPjiP(XN_1=J/XN 1,3) (2.4.4) and, from (2.3.14), N p(xN/XN'1,B>p(B/XN'1) p(B/X ) = (2.4.5) p(XN/XN'I) All the terms in (2.4.2)-(2.4.5) are either known, available from the previous step or can be calculated from those given above. The rule 6e is a fixed memory, iterative, optimal decision rule. The parameter-conditional state posterior probability, P(AN=i/XN,B), is the state posterior probability given by (1.2.3) (where B0 was assumed known) as a function of the unknown parameter. Equations (2.4.3) and (2.4.4) generate this term iteratively in a manner similar to that of Sec. 1.2 but conditioned on knowledge of the unknown parameter. That is, this probability at time N-l is projected with the transition matrix in (2.4.4) to P(XN=i/XN-1,B) which is used in the Bayes rule in (2.4.3) in incorporate the information provided by XN and to generate P(XN=i/XN,B). Everything in the procedure is con- ditioned on knowledge of B0 and information about B0 is summarized by P(B/XN) which is generated iteratively in (2.4.5) and is intro- duced into the decision rule by the averaging procedure given in (2.4.2). The starting procedure for the iterative scheme is given below. At step 1 £(x1/Bi)P(kl=i) p(Xl/B) P(x1=i/x1,B) = (2.4.6) where then where At step 2 where and then where 21 M p(Xl/B) = 2 i: 1£(xl/Bi)P(>.1=i) (2.4.7) p(XllB)p0(B) p (X1) P(B/X1) = p(X1) = I p(Xl/B)pO(B)dB fi(x2/Bi)P(l2=i/XI,B) p(xz/x1.B) P(x2=i/x1,x2,B) = m p(Xz/X1,B) = iElfi(x2/Bi)P(x2=i/xl,3) P(x2=i/x1,B) = § pjiP(x1=j/x1,B) p(XZ/X1,B)p(B/X1) p(x2/x1) P(B/X1,X2) = p(lexl) = I P(X2/X1,B)P(B/Xl)dB this procedure can be repeated up to time N. Despite the desirable features of the above rule it has one major drawback. At each step, m+l functions of B must be stored. If this rule is to be machine implemented it can only be done under one of the following conditions which are characteristic of general Bayesian learning. 22 l. The parameter vector B is a discrete random variable taking on a finite number of values. That is, B0 is known to be one of a finite number of values. This allows storage of the required densities but is a highly restrictive assumption. 2. The parameter vector B is a continuous random variable but a finite dimensional sufficient statistic is available for the un- known parameter. Under these conditions, only a function of the observations need to be stored [8-5]. However, such statistics do not usually arise in unsupervised learning problems because the class of densities involved are mixtures. 3. In general, the only way to make use of the rule is by quantization of the parameter space, thus reducing the problem to case 1 above. By quantizing fine enough it is possible to get arbitrarily close to the optimum solution at the expense of increased memory [F-l]. However, the memory grows exponentially with the dimension of the parameter vector B, making the method feasible only for problems with a small number of parameters. An example is given in Sec. 3.6 which indicates the extent of the storage limitations. 2.5 THE LEARNING FEATURE OF THE OPTIMUM DECISION RULE In this section, some limiting properties of the posterior densities that comprise the optimum decision rule (2.2.10) are dis- cussed in a manner that exhibits the learning capability of the rule. First it is shown that the state posterior probability P(XN=i/Xm) given by (2.2.6) with N fixed converges with probability one as the number of observations is increased. Then, the conditions under which 23 p(B/Xm )m 6(B- B 0) wp 1 (2.5.1) where 6(B-BO) is the Dirac delta function are stated. Equation (2.5.1) can be restated as saying that the posterior distribution of B, which summarizes all the information about B0 contained in Xm, approaches wpl a distribution whose mass is grouped about BO so that B0 is learned. Then, for all practical purposes, p(XN=i/Xm) 3 p(lN=i/x°,BO) (2.5.2) and the rule adapts, or converges, to the optimal rule that would be obtained if B0 were known. The statistical stability of P(XN=i/Xm) = Ym can be demonstrated by showing {Ym} to be a bounded martingale. Then, convergence follows immediately from a theorem of Doob which says that every bounded martin- gale converges with probability one [D-Z]. To show Ym to be a martin- gale it is sufficient to prove that m EEYm+1/x ] — Ym But _, m+1 m+1 m [YmH/xml —IPp(x /x)<1xm+1 m+1/).N=i)P(XN =i) =fp( pp; de+ pP( =1) = N N. p(Xm) m Since for each m Ym is a probability, |Ym|s1 vm 1The implied interchange of the limit and integration process has been carried out formally and (2.5.2) has not been proven for any mode of convergence. 24 and Ym is bounded. Therefore lim Y exists a.e. in the space of m m sequences {X }. The conditions under which the parameter posterior density approaches a delta function are well known. Spragins [8-5] and Braver- man [B-3] have given sufficient conditions for (2.5.1) to hold through an interpretation of the 0-1 law of probability [L-l]. These conditions are presented below. N 1. p(B/X ) is computed using the Bayes rule. 2. p0(B) is positive in some sphere about BO G 3. There exist functions {fm(Xm)}1 of the observations such that m fm(Xm)'- B0 wpl That these conditions are satisfied for decision rule (2.2.10) is now demonstrated. Condition 1 follows from (2.3.7). Condition 2 is assumed. Condition 3 is the major requirement and can be restated as saying that a strongly consistent estimator for BO must be exhibited. In Chapter III, a variety of such estimators are established by placing constraints on the transition matrix and the family of component densities. 2.6 RELATED INFERENCE PROBLEMS Some of the properties of the previously-derived decision rule can be used to solve additional decision problems of interest. For example, in the adaption process, B0 is learned but estimates of B0 are never actually generated. Since the parameter posterior density converges wpl to a dirac delta at B any property of this density, 0, 25 such as the mean, maximum or median, converges wpl to B0 also. Furthermore, it is well known that the mean of the posterior density is the minimum mean-square estimate of B and hence can be interpreted O as an optimal estimate for B0. Another decision problem of interest is that of determining which class or subset of the states the system is in. In particular, if the transition matrix P is block diagonal with q blocks, the states of the system are divided into q classes with the property that the system stays in the class in which it starts.12 Then, if P(wi/Xm) is the conditional probability that the system is in class i, I?(W./XN) = 2 P(x =j/xN) i = l,2,...,q (2.6.1) 1 jEv. N 1 where Yi is that subset of the state Space corresponding to class i. The probability P(XN=j/Xm) can be generated iteratively in the manner of Sec. 2.3 and an optimal decision rule for choosing the system class on the basis of XN is to pick the class for which the class posterior probability P(wi/Xm) is largest. In order to exhibit some of the properties of the above decision rule, (2.6.1) is rewritten as N-l N-l P( /W.,X )P(w./X ) P(wi/XN) = K” 1 N_1 1 (2.6.2) p(XNfiX ) where - q - - p(XN/XN 1) = i§1p(xN/wi,xN 1)P(wi/XN 1) (2.6.3) and 2 l O O l O The corresponding Markov chain 15 said to have q noncommunicat- ing classes. 26 N-l , ' - - - p(XN/wiJC ) = jESI p(igq/XN=J,wi,Bl,XN 1)1=(>.N=j/wi,xN 1,131) i l p(Bi/wi,xN' )dBi (2.6.4) . i . . . .th w1th B the parameter vector for the component den31t1es 1n the 1 class and {P0(wi)}q, the prior class probabilities, assumed known. Then (2.6.2) indicates that the class posterior probabilities can be generated using the Bayes rules with normalizing factor given by (2.6.3). All the terms in the mixture (2.6.4) are conditioned on knowledge of the class and thus can be generated iteratively using only the block of the transition matrix and component densities corresponding to the given class. By an appropriate interpretation of Spragins' conditions for convergence, it follows that if P0(wi ) > O and there exists a 0 strongly consistent estimator for i0 then p(w./x‘“) 9—. A,, wpl (2.6.5) J J10 where 10 is the true class. Conditions for such an estimator to exist are discussed in Sec. 3.8. Finally, the above two decision problems can be combined into that of estimating the unknown parameters defining the class in which the system is. If the parameter vectors are the same dimension for each class, the optimal estimate is given by the mean of q p(B/XN) = >3 p(B/wi,XN)P(wi/XN) (2.6.6) i=1 This procedure is referred to as adaptive estimation and conditions for convergence are given in Sec. 3.8. 27 2.7 CONCLUSIONS Optimal decision making for a POMS with unknown parameters in the component densities has been the topic of this chapter. Assuming a prior distribution over the parameter space, an optimal decision rule was defined in Sec. 2.1 to be that rule in a given class of rules which minimizes the Bayes risk (2.1.3). This class of rules, similar to that of Sec. 1.2 where the component densities were assumed known, in- cludes rules with the capability of using some future observations (look-ahead mode) as well as past observations to make decision about the state of the system at a given time. The optimal decision rule (2.2.10) was derived in Sec. 2.2 and its basic components (2.2.6)-(2.2.8) were generated iteratively in Sec. 2.3. It was shown that for the look-ahead mode the memory grows linearly and the number of computations exponentially with the number of future samples used. However, the extent to which future observa- tions affect the risk needs investigation. It is clear from the derivation that these results can be extended to include time-varying transition probabilities. As emphasized in Sec. 1.4, the optimal decision rule is a fixed- memory, iterative rule with a structure similar to that of the rules in Sec. 1.2. In general, only a high storage quatization procedure can be used to implement the rule. This suggests that the main use of the optimal rule may be to evaluate related suboptimal, low storage rules in the hope that conclusions can be extrapolated to the more complicated cases . 28 In Sec. 2.5, it was shown that, under the stated conditions, the parameter posterior density (2.2.7), which summarizes the knowledge about BO, converges to a dirac delta function. Thus, B is learned 0 and the rule adapts. Finally, in Sec. 2.6, various results established in the pre- vious sections were used to solve related inference problems. In particular, a class of estimators for B0 was established including an optimal (minimum mean square) estimator. An optimal decision rule was constructed for determining to which class the state of the system belongs among a set of noncommunicating classes of states. An optimal adaptive estimator was constructed for the parametersin the component densities associated with the active class of states. These examples indicate the versatility of the decision problem of Sec. 2.1. CHAPTER III CONSISTENT ESTIMATORS When the component densities in a Partially Observable Markov System (POMS) are specified to within a parameter vector, but no prior distribution for the parameter is available, optimal decision making as defined in Sec. 2.1 is no longer possible. However, the observations still contain information about B the true but unknown value of the O’ parameter. If a strongly-consistent estimator for BO ( a function of the observations that converges with probability one to B0) can be found, decision rules can be constructed by treating the estimate at a given time as if it were the true value. The decision rule of Sec. 1.2, where the component densities were assumed known, can then be used. Hopefully, as more observations are taken, decision rules constructed in this manner adapt or converge to the optimal rule, which uses the true component densities.13 Such rules, unlike those of Chapter II, may not extract information about B0 in an optimal manner but do approach the optimum as more observations are taken. There are several other reasons for investigating consistent estimators, some of which are listed below. 1. Even when a prior distribution on the parameters is avail- able suboptimal rules may be easier to implement than the high-storage quantization procedure of Sec. 2.4 implicit in the optimal rule.14 2. In Sec. 2.5, it was shown that a strongly-consistent estimator for BO must be exhibited to ensure that this parameter This convergence problem is discussed in Chapter V. 4Optimal and suboptimal rules are compared with regard to implemen- tation in Chapter V. 29 30 will be learned during the operation of the optimal rule. Hence con- ditions for the existence of such estimators are important. 3. In some applications the true parameter values may be of interest in themselves. The estimators suggested in Sec. 2.6 not only require parameter prior distributions but suffer from the implementation difficulties mentioned in 1 above. This chapter deals mainly with the problem of finding strongly- consistent estimators for the parameters defining the component densities of a POMS. In Sec. 3.1 the estimation problem is defined and some im- portant assumptions are discussed. The properties of the observation process are listed in Sec. 3.2 while estimation tools are developed in Sec. 3.3. In Sec. 3.4 and 3.5, estimation techniques developed for the case of independent identically distributed (i.i d.) observations dis- cussed in Sec. 1.3 are extended to the more general, dependent, non- stationary case under study in this chapter. Section 3.6 provides computer-simulated results for an example illustrating some of these estimation techniques. Generalizations of the problem defined in Sec. 3.1 are discussed in Sec. 3.7 and 3.8. Finally, Sec. 3.9 summarizes the main results of the chapter. 3.1 THE ESTIMATION PROBLEM The problem of finding a strongly-consistent estimator for B0, the unknown parameter in a set of component distributions, is precisely stated and some of the assumptions necessary to ensure the existence of such estimators are discussed in this section. 31 A POMS similar to that of Sec. 2.1 is considered.15 The state activity is described by a transition matrix P and initial probability state vector 21° The distribution of the observed process {KN}: is defined by a set of distinct univariate component CDF's {Fi(°)}? corre- sponding to the component densities of Sec. 1.1; Fi(-) is assumed to be an unknown element of the family' 8 '=:{F(X;c:v)}cvEA indexed by a point a in a subset A of the real line.16 As indicated in Sec. 1.1, XN, the random variable observed at time N, has CDF KN(X) which ' a l t f th f ' = ;B 13 n e emen o e set 0 mixtures HN {KN(X )}B65' where M KN(X;B) = 2 F(X;B,)P(A =i) (3.1.1) . 1 N 1=1 In (3-1-1) B = [Bl’B2’°°°’BM] and '3' is the set of all such vectors with distinct components Bi 6 A. That is 5” is a subset of M- dimensional Euclidean space. Then there exists at least one point B0 6 5' such that KN(X) = 1%(x;30) VN (3.1.2) The problem is to find Bm(Xm), a function of the observations m . . x - [x1,...,xm], such that . . m _ . P<,Bm(x'“) —. B0) - 1 (3.1.3) The major assumptions necessary for the construction of such estimators are listed below. They are assumed to hold throughout the chapter and will be referred to when needed. a D A-l. The observed process {Xi}1 is state-conditionally 1n- 17 dependent. 5 . Here B is no longer being treated as a random variable. 6 . . . Scalar observations and parameters are assumed for Simplic1ty but all ideas extend to the vector case. 7 . . . This corresponds to the assumption of state-parameter cond1tional 32 A-2. For each X, F(X;v) is a continuous function on A and for each a, F(°;a) is continuous in X; i = 1,2,...,m. A-3. B0 is an interior point of 18 and 18 is a compact sub- set of 5' . A-4. {AN}: is a regular Markov chain.18 Assumption A-l was introduced in Sec. 1.1 and is repeated here for convenience. It is a key assumption in developing the estimation tools of Sec. 3.3. Assumptions A-2 and A-3 are standard requirements for developing the existence and convergence properties of estimators. Assumption A-4 characterizes the class of POMS to which the techniques of this chapter apply. It ensures that the observations contain information about all the components of B More explicitly, O” a regular Markov chain has the property that it is possible to be in any state after some finite number of steps no matter what the starting state [K-4]. Hence, among a large number of observations there will be a large number of representatives from each of the component dis- tributions. As shown in Sec, 3.2, this assumption also implies the asymptotic ergodicity of the observed process, which is the basis of the estimation strategy to come. In addition to assumptions A-l through A-4 listed above, a uniqueness condition on B is required. This is a standard assump- 0 tion in estimation problems and takes different forms depending on the method used. For the estimation problem defined above a special unique- ness condition arises since, in general, mixtures do not have a unique decomposition into allowable component distributions. So (3.1.2) may 8 . . . 1 A regualr Markov chain IS one that has no tran51ent states and only one ergodic class with no cyclically moving sub-classes. 33 not define BO uniquely. To avoid this type of ambiguity the mapping defined by (3.1.1) from 18 onto a subset of HN’ (for a fixed family of component distributions) must be one-to-one. This subset of HN is then an example of an identifiable class of parameter-indexed mix- tures [Heifl. The concept of identifiability is discussed in the Appendix where it is shown that this property, which is necessary for all the methods of this chapter, can be ensured by placing constraints on the family of component distributions 8 and the parameter spaceJBt 3.2 PROPERTIES OF THE OBSERVATIONS In order to establish estimation strategies it is necessary to study some properties of the observation process {Xi}:. In this section, properties of the system state activity and the resulting observations are developed by making use of assumptions GX-l) and 0A-4). When {xi}: is a regular Markov chain with transition matrix P and initial probability state vector 21, it has the following three properties [K-4]. 1. There exists a unique vector 5.: [fl1,...,fiM] such that 3:31P 1 (3.2.1) 2 f - {1}” 't' ~ . I 21 - E: i 1 1; a -ta 1onary process. _ N-lN 3. pN-pN_lP-_91P 4:; V31 (3.2.2) The vector 3 is called the stationary probability state vector. Given P, (3.2.1) can usually be solved directly for 3. Otherwise (3.2.2) suggests a convenient algorithm for approximating 3, Properties 2 and 3 indicate that a regular Markov chain is asymptotically stationary. Such activity in a POMS suggests that the 34 observations might exhibit some form of asymptotic stability. In fact, va is the set of mixing parameters for the mixture KN(X) which governs the observations at time N. Hence properties 1 and 3 imply M there exists a unique mixture cdf Kh(X) = 2 Fi(X)'n’i such that l M N - X S - =' -0 . sup |1q Kfiml . Ini P(>.N1)| o (3.23) -” where ¢(-) is a Baire function and {Mi}: is a Markov process satisfy- ing Doeblins Hypothesis.19 Lemma 3.2.2 Under assumptions (A-1) and (A-2), if 21 = E. then {Mi}: and {Xi}: are stationary and ergodic (metrically transitive) processes. The above lemmas and property 3 imply that the observation pro- cess is asymptotically ergodic for a class of POMS defined by assump- tions A-1 and A-4. Furthermore, for this class of systems the limiting behavior of the observations is independent of the initial system activity. 19 Doob [D-Z], pg. 192 . 35 3.3 THE BASIC TOOLS OF ESTIMATION When the observation process in an estimation problem is a sequence of independent, identically-distributed (i.i.d.) random vari- ables, the key tools for constructing estimators are the Law of Large Numbers (LLN) and the Glivenko-Cantelli Theorem (GCT). If the obser- vation process of Sec. 3.1 is ergodic, the ergodic theorem establishes appropriate extensions of these tools. The fact that the observation process for the class of estimation problems considered in this chapter is asymptotically ergodic suggests that these important tools might be extended to this case. In this section the needed generalization of the LLN and the COT are established and a basic estimation strategy is stated for the problem of Sec. 3.1. Using lemma 3.2.1 and assumption A-4, Raviv [R-Z] has essentially proved the following theorem for the observation process {Xi}: of a POMS with arbitrary initial probability state vector 21.20 THEOREM 3.3.1 If g(') is a Baire function integrable with respect to Kfi(X) then 1 M N - '2: g(X.) _. E g(X) WP1 (33.1) N 1 fl 1 o where E x ~13. I (mm (x) (3 3 2)21 "Om ) - 1 Tri s 1 - - Theorem 3.3.1 is the required extension of the LLN. It provides a means for establishing strongly-consistent estimators for any expec- tation of Kh(X). 20The proof which consists of a direct application of Theorem 6.2 of Doob appears as part of the proof of Raviv's Lemma 2.5. 21ETT will be used to denote expectation with respect to the true limit mixgure of the POMS; i.e. the distribution with cdf Kh(X) = K%(X;BO). 36 Q The problem of estimating Kfi(x) from the observations {Xi}l is now considered and an extension of the GCT is presented. The function A N KN(X) = l/N killlrerXJ(xk) (3.3.3) where IA(-) is the indicator function on the set A is called the empirical distribution function; RN(x) is the proportion of samples from the set XN = [X1,...,XN] that are less than or equal to x. Since = S En I[r:er ](X) Kh(x0) 1 O O and22 n E OIErerxO](X) = Kh(XO-O) S 1 the following lemma is an immediate consequence of Theorem 3.3.1. Lemma 3.3.1 For every real x 12 H P1RN(x) Kn(X)} = II ...: 12 P{KN(x-0) Kn(x-O)} Lemma 3.3.1 is the key step in extending the GCT. THEOREM 3.3.2 P{ sup | (x) - K | E o} = 1 -a o. eso 37 3.2 the unknown limit mixture Kfi(X;Bo) is an element of the class of mixtures HTr = {Kn(X;B)}B66 where M Kn(X;B) = fF(X;Bi)1Ti m The main idea is to treat the process {X111 as if it were a sequence of indgpendent random variables with common CDF Kfi(X;Bo)' The schemes developed for the i.i.d. case are then used with the classical conver- gence tools replaced by Theorems 3.3.1 and 3.3.2. This procedure is illustrated in Sec. 3.4 and 3.5. The general format is to establish an estimator, present a convergence theorem and evaluate the results. Assumptions A-1 and A-4 will be assumed through the rest of the chapter. 3.4 CLASSICAL ESTIMATION METHODS In this section, the principle behind the classical method of moments and the maximum likelihood method are used to solve the estimation problem defined in Sec. 3.1. Both methods involve the use of sample averages whose convergence is guaranteed by Theorem 3.3.1. The method of moments has been used frequently for mixture re- solving [B-Z][C-5][P-3][R-l][R-4] and is treated in the most general form by Chien and Fu [0-2]. The procedure requires equating M sample moments to the corresponding population moments of Kn(X;B). A solution, if one exists, of the resulting equation in B can then be taken as an estimate for BO. More specifically, if F(B) is a vector of popula- tion moments of Kn(X;B), which are assumed to exist, and Tm is the corresponding set of sample moment constructed from the observation Xm, then by Theorem 3.3.1 38 T Wpl (3.4.1) where T0 is a vector of moments of the true limit mixture Kn(X;B0). The set of existing solutions of Tm = F(B) (3.4.2) may contain candidates for strongly-consistent estimates of B In 0' order to obtain a more explicit result the following uniqueness con- dition is assumed. The function F(-) is a one-to-one mapping from 18 onto T, a subset of M-dimensional Euclidean Space. Under this con- dition, a consistent estimator can be constructed as indicated by the following theorem. THEOREM 3.4.1 If the above uniqueness condition is satisfied then, with probability one, there exists a sequence of solutions of (3.4.2), say {Bm}:O, which converges to B0. Proof. Assumption A-2 implies that F(°) is a continuous function on 18. Hence by the uniqueness condition F-1(-) is defined and continuous [R-S]. Equation (3.4.1) and the fact that B0 is an interior point of .8 imply that for every sequence of observations, except possibly those in a set of measure zero, there exists an m0 such that Tm E f for m > m0. Thus for m > m0, Bm = F-1(Tm) and converges to 30' In general (3.4.2) can have more than one solution and prior information concerning the particular application must be used to se- lect a convergent sequence of solutions. This corresponds to restrict- ing 8' to B as given in the uniqueness condition. 39 Equation (3.4.2) can pose some difficult computation problems since, in general, (3.4.2) is a set of complicated nonlinear equations and an iterative algorithm for finding a zero of a function must be used at each step. Since the sample moments can be generated iteratively, the storage requirement is fixed. When B can be found as an explicit function of Tm the resulting estimator is very desirable with respect to implementation. The maximum likelihood method has been used by Patrick [H-Z] for mixture-resolving. He treated a mixture of Gaussian distributions in detail. The version of maximum likelihood estimation given here is essentially a modification of that given by Wilks [W-l]. The following notation is introduced. 6 Log dKrr (X ;B) Sj(X;B) = a Bj j = 1,2,...,M (3.4.3) §(x;B) = [81(X;B),...,SM(X;B)] (3.4.4) AJ.(B;B ) = fsj (X;B)dKfl(X;B ) (3.4.5) 5033') = [A1(X;B),...,AM(X;B)] (3.4.6) Now, Kn(X;B) is said to be regular in 73 with respect to its first derivative if Es(x-B)EA(B-B)=§-—J”dx(x-B)sovBes j=12...,M (3.4.7) TTj ’ j , aBj TT ’ , , 3 Under this condition Theorem 3.3.1 implies m _S_(Xm;B) E l/m 2: §(Xi;B) 9 A(B;BO) (3.4.8) 1 40 Since A(BO,BO) = O a reasonable strategy for generating a con- sistent estimator is to choose among the roots of m §(X ;B) = 0 (3.4.9) This idea is made more explicit in the following theorem. THEOREM 3.4.2 If KH(X;B) is regular in 16 with respect to its first derivative, then there exists a sequence of solutions of (3.4.9) which converges with probability one to B In particular, 0' if (3.4.9) has a unique solution Bm for m > m then the sequence 0 {Bm3zo converges with probability one to B0. The proof, which is in the same spirit as that given for the method of moments and which follows from (3.4.8), is analogous to that given by Wilks and will be omitted. The same comments can be made about the uniqueness condition and the computational problem of finding roots. of (3.4.9) as were mentioned for the method of moments. However in this case, all the samples must be stored and the form of (3.4.9) changes at each step. The same problem was faced in the i.i.d. case [K-2]. A The interpretation of the above method as a maximum likelihood method is obvious if the likelihood function is taken to be the product K%(X1;B),...,Kh(xm;B). Then, under suitable restrictions, the value of B which maximizes the likelihood function satisfies (3.4.9). 3.5 MINIMUM DISTANCE ESTIMATION METHODS In this section, the minimum distance principle is used to con- struct estimators for B0. Given an appropriate distance measure p(',') and the empirical distribution function for Kfi(X;Bo)’ HN(X) 41 defined in (3.3.3), and estimator Bm can be defined as an element in 18 which minimizes p(Rh(X),Kfi(X;B)). Theorem 3.3.2 is the main tool for establishing convergence. This method has been used for mixture resolving by Patrick [H-Z], Stewart [H-B], Choi [0-3], and Deely and Kruse [D-1] and their ideas apply to the problem. Patrick, Stewart, and Deely used the sup norm for a distance measure.23 Analogously, the estimate Bm is defined by :63 Himm - Kn(x;B)H = HKm(X) - Kn(x;Bm)H (3.5.1) The existence of Bm is ensured by A-2 and A-3; i.e. Bm is obtained by minimizing a continuous function over a compact set. The following theorem establishes the consistency of such estimates. THEOREM 3.5.1 If Hn = {Kfl(X;B)}BEB is an identifiable class of mixtures, then Bm’ defined by (3.5.1), converges with probability one to B - 0 Proof. If Qm(B) = HRm(X) - Kfi(X;B)H then (3.5.1) and Theorem 3.3.2 imply Qm(Bm) S Qm(BO) 9 o wpl (3.5.2) Since Hxfi(x;30) - Kfi(X;BO)H s Qm(Bm) + Qm(BO) (3.5.3) it follows that Kn(x;Bm)-‘9Kn(x;30) wpl (3.5.4) 23Hs =f<1 - Rm Then Btn is defined as the element of 16 which minimize Sm(B). Under the same conditions as those in Theorem 3.5.1, the existence and strong consistency of Bm can be demonstrated from Theorem 3.3.2 and arguments analogous to those given by Choi. The conditions needed to establish the existence of the above estimators are weakest considered in this thesis under which a strongly consistent estimator for BO can be established. Hence they are key assumptions which guarantee the learning property of the optimal rule given in Chapter II. Identifiability is a necessary condition for the uniqueness conditions required by the method of moments and the maximum likelihood method but, in general, is not sufficient. The minimum-distance methods require storage of all samples and the use of a computational algorithm at each step. Such procedures have been developed for the i.i.d. case and can be applied here [C-3] [D-l]. 43 3.6 AN EXAMPLE In this section, a simple example of a POMS is used to illustrate some of the estimation techniques developed previously. Estimators are established from the method of moments, the maximum likelihood method and the optimal Bayes method of Sec. 2.6. To illustrate some basic properties of these estimators, results of computer simulation are pre- sented. The example consists of a POMS defined by the following. .25 .75 24 . . regular tran31tlon Matrix .40 .60 initial probability 21 = [.7 .3] state vector 1 e-ko-ZOH-B)2 component densities, f1(X;B) = 2n 0 Gaussian in form 2 2 _1_ 8-..; (M) f (X;B) = 2 Ma The true mean B0 is assumed to be in the interval (0,4). .The sta- tionary probability vector is easily calculated as g,= [15/23 8/23] and hence the limit mixture has the form -2 2 -2 2. 1 [15/23 e’$0 (X+B) + 8/23 e'kc (X'B) ] (3-5-1> f (X;B) = " /35 o It is well-known that all moments of a Gaussian distribution exist and Patrick [H42] has shown that a mixture of Gaussian distributions with 24As indicated in Chapter IV, if all the elements of a transition matrix are positive the corresponding Markov chain is regular. 44 unknown means is regular with respect to its first derivatives. Hence, the method of moments and the maximum likelihood method can be applied here. A moment estimator is easily obtained using the second moment 2 2 equation snx = B + 02. Then, Theorem 3.4.1 implies that the follow- ing estimator is strongly consistent. B' =v/S S 2 0 m m m = < o o O Sm 0 (3 6 2) = 2 S > 4 m m where S = 1/m 2X? - 02. m , 1 1 1: As indicated in Sec. 3.4, the maximum likelihood estimate 33' is defined as a solution to the following equation in the interval [0,4] -2 ' 2 -2 2 - X.+B- x.-B) m -15(Xi+B)e 15° ( 1 ) 35° ( 1 + 8(Xi-B)e- 2 = 0 (3.6.3) 1 1 -31,;<7'2(Xi+B)2 450'2 (xi-312 02(15 e + 8 e ) Theorem 3.4.2 guarantees a sequence of roots that converges. Moreover, the computer solution25 of (3.6.3) indicated only one root in [0,4] for most values of m. The Bayes estimator is obtained by quantizing the interval [1,3] 2 into 50 levels, .04 apart.6 The estimate is defined by 1.1 50 m B = E B p(B /X) (3.6.4) m k k k=1 , th . . . where Bk is the k quantization level and p(Bk/Xm) is generated iteratively using equations (2.4.3)-(2.4.5) with B replaced by Bk' 2SUnder the conditions stated below. 6 . . . . Additional prior information is assumed for Bayes estimator to emphasize its effect. 45 A uniform prior density is assumed. As pointed out in Sec. 2.6, if B0 is in [1,3] convergence is guaranteed by the existence of the previous two estimators. The above estimators were implemented on a digital computer with a random number generator supplying the samples; B was chosen equal 0 to 2. For a given value of O, E(Bm) and Var(Bm) were approximated for different values of m by averaging the results of twenty runs. On each run the same samples were used to generate estimates by all three methods. The results are presented in Table (3.6.1) for o = l,1.5,2 and m = 25,50,100. For this example, Table (3.6.1) indicates some general trends. Not only do all the estimates converge on the average,but as more observations are taken the variances of the estimates decrease. Also, as the variance of the observations (controlled by 0) increases, the estimates become less accurate; i.e., Var(Bm) and 1E(Bm) - B in- 0| creases. All three estimators seem to perform about the same with perhaps the Bayes estimator slightly better for small m and large a. On the whole, the estimators behave much as they do for the i.i.d. case. The above estimators can also be rated with regard to implementa- tion. Since 02 + Sm = (1+1/m)(O2 + Sm-l) + (l/m)X:, the moment estimator is a simple iterative one. The Bayes estimate is also iterative but, for this example, requires fifty times more storage and computation than the moment estimator. The maximum likelihood estimate is, by far, the worst requiring storage of all samples and use of a computational algorithm to find the zeros of (3.6.2) whenever an estimate is desired. 46 mommm "m ooossaoxsq assess: "as mufimsoz NO UOSUmz ":2 “woo. mooo. mmmo. ammo. Mmma. qua. wmqa. come. moan. mmam.a «meo.H mmmm.H mmHm.H Hmom.a momw.a mmHm.H mumm.a nmmw.a mHmo. mmqo. aNm. ammo. wmmo. mon. coma. mama. HmNN. «Nmm.H aomm.~ mmnm.H mmoo.~ mmmo.~ ommo.m wmma.~ Homm.H ommo.~ MNHO. mmao. HNHO.. qomo. mono. wmmo. Nmmo. memo. «ode. mwma.H Nooo.N «wwmma Nwmm.H mmmm.H mmmm.H «qu.H mmqm.m smcm.a m 1mm11 22 \fl m thll 22 .xf, m ”WW1 22 11\ ooH u 8 on u 8 mm M E o N u m mHHS mznm ON roam mHADmmm ZCHH Lees Asmvss> Ase s Aamvso> A88 m m.H 47 3.7 ALTERNATE STRATEGIES FOR THE ESTIMATION PROBLEM In this section, three modifications of the basic estimation strategy outlined in Sec. 3.3 and illustrated in Sec. 3.4 and 3.5 are discussed briefly. Also, methods for handling extensions of the original estimation problem are indicated. The first modification is one that can be used when n is un- known, as when P is unknown or ‘E_ is too difficult to calculate. In such a case, E. can be included in the set of unknowns of K“(X;B) and the methods of Sec. 3.4 and 3.5 can be used with an enlarged para- meter vector. The second modification is one that requires knowledge of P but computes E. during the procedure. The main idea is to treat the random variables Xm as if they had common distribution Km(X), an element of Hm = {Km(X;B)]B68. Then Bm’ the mth estimate, can be calculated using one of the principles of Sec. 3.4 and 3.5 with Kn(X;B) replaced by Km(X;B). Since for each B E B, Km(X;B) —' Kn(X;B) uniformly in X, the estimate Bm should be consistent. As an example, a minimum distance estimator is considered; Bm is defined as the element in B which minimizes 1mm) = Mirna) - Km(X;B)H. Then Im(Bm) s Im(BO) s Hikm(x> - 1%(X;BO)H + HKfi(X;BO) - Km(X;BO)H (3.7.1) and V “KT-[(X;BO) - Km(X;Bm)H s HKn(X;BO) - Km(X)H + Im(Bm). (3.7.2) Assuming identifiability of the class of mixtures generated by the family of component distribution with B and n as parameters, the 48 strong consistency of Bm follows from Theorem 3.3.2 and the fact that Km(X;BO) fl Kfi(X) uniformly. Analogous estimators can be established using the method of moments and the maximum likelihood method. The final modification takes advantage of the fact that, in gen- eral, the random variables observed at adjacent points in time are not independent and, hence, information can be obtained about the unknown parameter by cross correlating successive observations. Raviv [R-Z] used this technique to estimate the transition matrix P. His basic tool was a modification of Theorem 3.3.1. Namely, if g(-,-) is an integrable function then N 1 PiN iflgocimm) Eng(xk.x k4%)} = 1 (3.7.3) where ETr denotes the expectation when 21 = E: The approach here is based on the fact that KN(XN’XN+1) = iszi(xN)FjOSl+l)pijP(kN=i) and _. 15, P(XN-i) ”i G That is, the sequence of random variables {[xi’xi+1]}l is described by a sequence of mixtures of joint distributions which converges to a fixed mixture Kn (xk’ka) = iszi (xk) Fj (xk+l) pijni 2 Hence a second order estimation strategy analogous to the first order m strategy of this chapter can be developed using observations {Exi’xi+1]}l and all past techinques with (3.7.3) in place of theorem 3.3.1. In this case, the elements of the transition matrix P appear as para- meters in the mixture and hence can be included in the list of unknowns to be estimated.2 3.8 ADAPTIVE ESTIMATION AND CLASS ESTIMATION Estimation problems related to the POMS defined in Sec. 2.6 are considered in this section. The transition matrix is assumed to be block-diagonal; hence the system is in one of L noncommunicating classes of states and each class is assumed to satisfy the assumptions of the POMS defined in Sec. 3.1. The component distributions associated with each class are assumed to be unknown to within a parameter as in Sec. 3.1 and the active class of states is unknown also. The problems are to determine which class is active and estimate the unknown para- meters of that class. To formulate the problems more clearly some notation is defined; 33 is the stationary probability vector corresponding to the ith class; Hni = {Nfii(X;B)]B661 is the set of limit mixtures induced by the family of component cdf's for the 1th class with 1Gi the corresponding para- meter Space;29 10 is the value of the index for the active class; B0 is the true value of the parameter defining the component densities of class i0. Both B0 and i0 are unknown and are to be estimated. The first problem considered will be that of finding a strongly- consistent estimator for BO. One strategy is to assume the system is in a particular class and construct an estimate accordingly using the methods of Sec. 3.4 and 3.5. If this is done for all L classes the 27 . If P is to be estimated n must be known or a conSistent estimator for n must be available. Such an estimator can be obtained from the first order strategy discussed in modification one above. 8 . . . Each class is being treated as a separate Markov chain. 9For simplicity the parameter Space is assumed to be the same dimension for all classes. But all arguments apply directly to the more general case. 50 result in a set of L estimators containing at least one consistent estimator. This procedure is illustrated below using the minimum dis- tance principle. If Qi(B) = HRm(X) - Kfii(X;B)H then Bi, the estimate assuming class i is active, is defined as the element in 131 which minimizes Qi(B). Since there exists at least one value of i, say i, such that k.(X) B Knl(X;B Wpl, then, as in Theorem (3.5.1), if H c is m 11]. >0 identifiable B B B0 Wpl. Therefore a sufficient condition for the 8H adaptive estimator given by (2.6.6) to converge to B0 is that H i n be an identifiable set of mixtures for i = 1,2,...,L. It is important to realize that i as defined by 11.. Qm(Bm) 0 Wpl (3.8.1) may not be unique. Hence B0 can be estimated even when it is not possible to determine which class is active. The question of sufficient conditions for determining 10 will now be considered and a more explicit estimate for BO will be constructed. Definition 3.8.1. The set of mixtures H is said to L = U H . 1 111 be class identifiable if, for any B1 6 £91 and B2 6 BJ Kni(X;B1) = Knj(X;B2) VX (3.8.2) implies i = j. The above definition can be interpreted as saying that the space of mixtures HTr is class identifiable if and only if {Hui}? is a collection of disjoint sets. Sufficient conditions for class identifiability are given in the Appendix. As indicated by Theorem 3.8.1, class identifiability is the key condition for determining i0. 51 L THEOREM 3.8.1. If an = U Hi is class identifiable then im i=1 defined by Q(Blm) min Q;(B;) (3.8.3) 1 is a strongly-consistent estimator for i0. Furthermore, if, for each 1 i, Hni is identifiable then Bmfi defined by (3.8.3) is a strongly- consistent estimator for BO. Proof. By definition and Theorem 3.3.2 1 i Q(B m) s QmO(BO) 9 o wpl (3.8.4) Then the subadditivity of the supnorm implies Ki EK,€H, wpl (3.8.5) n m 1T10 n10 i where K , = K , (X;B m) and K , = K , (X;B ). Since for each i nl fi1 n1 Ni O m m i 0 O i the mapping from 13 onto H i’ defined by K i(X;B ), is continuous n n and 161 is compact, H i is compact also [R-S]. By the class identi- n {H i}l is a collection of disjoint sets. Hence, fiability of H E n 1 for almost every sequence of observations, there exists an mO such that for m > mO K i E H i . Otherwise there would exist a sub- : m n 0 sequence of {K . } contained in H , for some i # i and con- nlm 1 1111 1 0 verging to K i a point not in H i . This contradicts the compact- n n O ness of H . . Thus for m > m , i = i . If in addition H . is "11 O m 0 . ”10 H 0 wpl, as in Theorem (3.5.1). It follows from both identifiability conditions that for m.s mO B 1m,” identifiable then {B }m converges to B 0 im is well defined. As indicated in the Appendix identifiability of H i is not a n necessary condition for class identifiability of HTT hence iO can 52 be estimated even when BO cannot. 3.9 CONCLUSIONS This chapter has dealt mainly with the problem of finding strongly- consistent estimators for the unknowns in the component distributions of a POMS. The problem was defined in Sec. 3.1 as one of estimating the parameter set BO that defines a sequence of mixtures using de- pendent samples from successive elements in this sequence. The study was restricted to a class of systems with state activity described by a regular Markov chain. As shown in Sec. 3.2, the corresponding se- quence of mixtures approaches a limit mixture Kfi(X;BO) and the observation process is asymptotically ergodic. These properties were used in Sec. 3.3 to establish tools (extensions of the Law of Large Numbers and the Glivenko-Cantelli Theorem) for estimating K"(X;BO) and any of its expectations from available observations, thus re- ducing the estimation problem to the resolution of the limit mixture. This estimation strategy was illustrated with the method of moments and the maximum likelihood method in Sec. 3.4 and the minimum distance principle in Sec. 3.5. The result was a variety of conditions under which the parameters could be estimated and, hence, under which the optimal rule of Chapter II adapts. A specific example illustrating some of these methods and the optimal Bayes estimator of Chapter II was presented in Sec. 3.6. Computer simulations indicated the behavior of the estimators to be typical. Alternate strategies were proposed in Sec. 3.7. It was shown that these strategies would also handle the case in which the transition matrix P is unknown. 53 In Sec. 3.8, the adaptive estimation problem and class estimation problems of Sec. 2.6 were solved suboptimally. In the process of establishing conditions under which such estimators could be constructed, the concept of class identifiability was introduced with sufficient con- ditions for this type of identifiability being given in the Appendix. Finally, the basic aim of the Chapter has been to put forth a general estimation strategy and illustrate it with examples. It should be clear that many methods not mentioned here [C-4][H-2][S-2][S-3], in- cluding nonparametric ones, apply equally well to this problem. CHAPTER IV EXAMPLES OF PARTIALLY OBSERVABLE MARKOV SYSTEMS Examples of Partially Observable Markov Systems (POMS) can be found in the fields of Pattern Recognition and Communication Theory. When the model defined in Sec. 1.1 can be associated with systems in these fields, the decision rules of Chapter II and the estimation schemes of Chapter III lead to a class of decision devices with a learning capability. This chapter deals mainly with the design of optimum, adaptive signal detectors for a variety of communication systems with unknown signals. The basic approach is to propose a communication system, make the correspondence between it and a POMS with unknown parameter in the component densities, identify the optimal detector, and check critical assumptions that ensure adaption. The main assumptions that guarantee the existence of an iterative optimal rule which adapts are listed below from Sec. 2.1, 2.5, 3.1 and 3.5. l. The observation process is state-parameter-conditionally independent. 2. The underlying Markov chain is regular. 3. The Corresponding set of mixtures is identifiable. Through most of the chapter the transition matrix P will be given and the component densities will be Gaussian with unknown mean. Consequently, in verifying the above assumptions, the following in- formation will be useful. 54 55 A regular Markov chain can be characterized by its transition matrix in either of the following two ways [K-3][K-4]. 1. There exists a finite integer N, such that PN has all positive entries, denoted by PN > 0. 2. All but one eigenvalue of P lies inside the unit circle. As indicated in the Appendix,all the sets of finite mixtures of Gaussian distributions with distinct meansare identifiable if constraints are imposed to rule out any ambiguities which may arise in the parameter space. In this chapter it is assumed that, for each example, a parameter prior density which reflects such constraints is available. The purpose of this chapter is to display the versatility of the model for a POMS and not to investigate each application in detail. In order to display the main ideas clearly, special cases are treated which can be easily generalized. Additional background concerning each problem can be found in the references cited in the corresponding sections. In Sec. 4.1, a Pattern Recognition System with Markov-dependent pattern activity is introduced and discussed briefly. In Sec. 4.2, a basic communication system with a Markov encoder, memoryless channel, and known synchronization is considered. The assumptions of the basic system are weakened in Sec. 4.3 and 4.4; systems with unknown synchronization and channels with memory are considered. Sections 4.5 and 4.6 deal with vari- ations of the basic system in which synchronization is undefined; namely, systems in which signals arrive at random times. The results of the chapter are discussed generally in Sec. 4.7 and 4.8. 4.1 PATTERN RECOGNITION WITH MARKOV-DEPENDENT PATTERN ACTIVITY Before attacking any communication systems it is convenient to investigate a more general class of systems and illustrate the format 30 Since the elements in each row of P sum to one, 1 is an eigenvalue. 56 for utilizing the results of Chapters II and III. In this section, a pattern recognition problem in which pattern activity at one time de- pends on pattern activity at other times in a Markov manner is shown to be a problem in decision making with a POMS. The system under study is depicted in Fig. 4.1.1. PC1 . /" — Sw1tch PC2 """“\1 , , Feature Pattern Extractor Classifier pc——/\ ’81 Fig. 4.1.1 A Pattern Recognition System There are M pattern classes {PCi}T. At time N, a sample pattern is randomly chosen according to the probability vector N 2N =[P(w1)9-°°9P(wl:4)] (4.1.1) where P(W§) is the probability that PCi is active at time N and P(W§|W§-1,...,W:) = P(W§|W§-1) = PjiIV’N\ The sample pattern is mapped to a point XN in a finite-dimensional, Euclidean vector space via the feature extractor. Associated with each pattern class is a density f(X|Wi) for XN when PCi is active. The pattern classifier makes a decision as to which pattern class generated XN. Raviv [R-3] applied 57 such a model to the recognition of characters in text. He assumed all quantities were known or obtainable through supervised learning. The problem here is to design an adaptive classifier which makes decisions on the basis of observations XN = [X1,...,XN] when ‘21 and P = [pij] are given but the densities {f(XiWi)}? are unknown. The relation between this problem and that of decision making for a POMS is established by introducing the random variable KN which maps the event W? to the interger 1. Then {AN}: is a first-order, homogeneous Markov chain. The events {WE}? have been put into a one- to-one correspondence with the states of a POMS whose state activity is summarized by ‘21 and P = [Pij] and whose observations are governed by the component densities {f(XIWi)}T. The problem of classifying feature vectors is that of making decision about the states of a POMS. A variety of adaptive classifiers follow from the decision rules of Chapters II and III. For example, when the component densities are specified to within a parameter vector B, the minimum probability of error rule as given in Sec. 2.2 is: decide PCi is active at time N if p(wlny) 2 F(wbj‘IXN1 vi 14 1 (4.1.2) where P(WT|XN) = j‘ P(W1:|XN,B)p(B|XN)dB (4.1.3) As in Sec. 2.3, the posterior probabilities {F(WEIXN)}? can be gen- erated iteratively under Assumption 1 of this chapter. Furthermore, if Assumption 2 and 3 hold 58 P(BIXN) 5 6(B-BO) wpl (4.1.4) where B0 is the true parameter value. Hence B0 is learned and the rule adapts. It is clear that the estimation schemes and suboptimum rules de- fined in Chapter III apply as well. In the remainder of the chapter more explicit examples are given in which the assumptions guaranteeing existence and adaption of optimal rules can be checked. 4.2 ADAPTIVE SIGNAL DETECTION WITH A MARKOV ENCODER This section considers a particular example of the system treated in Sec. 4.1. A communication system, wherein the signal sent in one time interval depends on that sent during another time interval in a Markov manner, is investigated. Figure 4.2.1 illustrates the system under study. 81(t) \V /" Encoder /’|Sampler / _ ./ 32“) ‘7 - 8(t) X(t . Channel ' Detector *ep- + X -—_./’* N M n(t) Fig. 4.2.1 A Communication System 59 Every T seconds a signal is randomly chosen from the set {Si(t); 0 S t S T}? and sent through a channel which changes it in some fixed but unknown manner. The channel output 9(t) is corrupted by additive noise and the result is sampled. The observable coordinates are a sequence of time samples {X(ti)}: where X(ti) = 9(ti) + n(ti). The detector uses these observations to determine which signal was sent in a given time interval. The Basic assumptions are the following 1. The operation of the encoder is described by a matrix of positive probabilities P = [pij] = [F(Wg/Wfi-1)] for all N, where W? is the event that the ith signal was sent over the interval [(N-1)T, NT]. A set of prior probabilities -21 = [P(Wi),...,P(W;p] govern- ing transmission in the first interval is given. The channel is memoryless. Hence, there is no intersymbol interference and the channel output during the Nth interval is caused only by the input during that interval. The re- sponse to Si(t) is 91(t). The Noise process is white and Gaussian with zero mean and finite variance. The encoder and detector are synchronized so the time re- ference is the same for both. Every T seconds a block of L samples is taken at the receiver in accordance with standard sampling theorems; EN is the block taken during the Nth interval and 91(t) is characterized by an L-dimensional vector of samples _8_i = [eil,...,8i,]. 60 The correspondence between this system and that of Sec. 4.1 is immediate. The encoder acts as the switching device and the sampler as the feature extractor. When the event W? occurs, the resulting observation is EN = fii +DEN where EN is a vector of noise samples. Hence, the component densities are Gaussian with means £§i]¥. The problem is to establish a detector which makes decisions on the basis of the observations XN = [X1,...,XN] while learning the signal vector .g = Lgl,g2,...,§3g. As in Sec. 4.1 the optimum decision rule is: decide the ith signal was sent in the Nth interval if N P(W1:/X ) 2 P(W1:/XN) Vj )4 1 (4.2.1) This rule has the learning property indicated by (4.1.4) with B re- placed by 9. Only the assumptions remain to be checked. The state- parameter-conditional independence follows from the white Gaussian Noise assumption. This will be the case through the remainder of the Chapter. Hence only Assumptions 2 and 3 will be discussed forthwith. Since pij > O, the underlying Markov chain is regular. The family of component densities are Gaussian. Hence if the vectors in the set L01}? are distinct, the identifiability assumption is satisfied and the rule adapts. 4.3 ADAPTIVE DETECTION WITH INTERSYMBOL INTERFERENCE When the channel of the system considered in Sec. 4.2 has memory, the signal transmitted in one time interval spills over into other time intervals. Consequently, the received signal at any time is affected by what was sent before. This dependence, under suitable 61 assumptions, can be shown to be Markovian. Chang [H-l] considered adaptive detection for the binary case in which one of two antipodal signals were sent independently from one time interval to the next. In this section, a more general class of signals transmitted with Markov dependencies is considered. The system under study is the basic communication system of Fig. 4.2.1 with the channel described by a causal linear filter with impulse response h(-). For simplicity, interference is assumed to be limited to the immediately succeeding time interval, which is ex- pressed as h(t-T) = 0 t > T+¢ (4.3.1) t 0 i,j = 1,2, Pi > 0 and, thus P1 is regular. It is clear that, for general M, P1 retains a structure such that Pi > 0. Again, because the component densities are Gaussian, identifiability is in- sured if the vectors in the set (913} are distinct. When p11 = p21 = q1 and p22 = p12 = q2 (this corresponds to the case Chang treated) P2 is uniform in the columns. Hence {A2k]: and {A are independent subsequence of the underlying chains and }¢ 2k+1 l the corresponding observations are independent. Chang used this fact, which can be arrived at by direct consideration of the model, to con- struct estimators for the unknown signals. He used the method of moments and his techniques are a special case of those discussed in 3 Sec. 2.4 and 2.7, 1 where if M = 2, the unnormalized stationary prob- ability vector is E-‘ [pllp21’ p12921’ p12p21’ 9229121' (4°3'6) 4.4 ADAPTIVE DETECTION WITH UNKNOWN SYNCHRONIZATION When synchronization is unknown in the basic communication system of Sec. 4.2, each block of samples may contain the effects of two signals. Hence there are dependencies between the signal received in one time interval and that in adjacent intervals. Under appropriate assumptions , these dependencies can be shown to be Markovian. Stewart 31The Third Modification. 64 [H-3] found the optimum decision rule for determining the true synchro- nization time when the signals were transmitted independently from one time interval to the next. In this section transmitted signals are Markov-dependent and the optimum decision rule for determining synchro- nization is shown to follow easily from the results of Sec. 2.6 and 3.8. The system under study is the basic communication system of Sec. 4.2 without the synchronization assumption. That is, in each block of samples the time sample at which the effect of one signal ends and that of the following signal begins is unknown. Additional assumptions are needed and these will be considered in force for the remainder of the chapter. 1. Time zero is the time the receiver is turned on. 2. The initial probability state vector of Sec. 4.2 governing the transmission of the first signal is the stationary probability vector En Assumption 2 implies the time the transmitter is turned on is ir- relevant provided it occurs before the receiver is turned on. If each block is assumed to consist of L samples and the rth sample from the start is the true synchronization time, the Nth r O . = + C O h observation has the form. KN §”j EN some 1,] w ere r = 0.. , ’00., . 4.4.1 Eij [91,L-r+2’ ’eiL ejl eJ,L-r+l] ( ) is the vector representation of the signal at the output of the channel between (N-1)T and NT. That is, gij contains the last r-l com- ponents of «31 and the first L-r+l components of 9) where -Qi and SE are the vector representations of the channel responses to signals 65 Si and Sj respectively. Let Pm:j 6 XN) be the probability that eij is the channel output in the Nth interval and let P(Wr) be the prior probability that r is the true synchronization time. Then, using the transition probabilities of Sec. 4.2 the events {Wij}; i,j = 1,2,...,M; r = 1,2,...,L can be put into a one-to-one correspondence with the states of a POMS. For example if M = 2 and L = 2 2 _ 2 k k-l P(W11€XN) -P(W nwlrlw1 ) k . .th . where, as in Sec. 4.2 W1 is the event that the 1 signal is sent in th . . . the k interval after the transmitter is turned on; k is unknown. Then P(Wil e xN) = F(wli/wlf'lmzwmz n wli'l) k-l k-2 1 k-2 2 nw1)+1>(w rlw1 nw2 )] 2 k- p11[P(w 0 W1 2 2 p11 P(wll e xn-l) + p11 PCW21 6 ’51-1) This procedure can be repeated for all the events [W§j}. Then the probability state vector 2 l l 2 2 2 2 EN = [P (Wu)?(1422»] is related to its predecessor by Bé = E§_1p2 where "P11 p12 0 o o 0] p21 p22 0 o o 0 P2 = O 0 p11 0 p12 0 O 0 p11 0 p12 0 o 0 0 p21 0 p22 66 The initial probability vector 21 can be obtained from E. and {P(Wr)}€. The corresponding component densities are Gaussian with means {9:j]. The transition matrix is block diagonal with one block asso- ciated with each synchronization time. The problem of determining the true synchronization time corresponds to that of class estimation given in Sec. 2.6. The optimum decision rule follows from (2.6.1) and has the form: decide the true synchronization time is r if P(Wr/XN) 2 P(Wk/XN) V k aé r where P(Wr/XN) = iZjPWEj/XN) Conditions under which P(Wr/XN) 15. at r , wpl l 0 where rO is the true synchronization time, are now considered. In general, P2 will have L-l blocks exactly the same (those correspond- ing to synchronization time r = 2,...,L). These have the same form as P1 considered in Sec. 4.3 and hence are regular. The remaining block (for r = 1) has all positive elements and is regular also. Since for 62> 2 the blocks are not distinct and the component den- sities belong to the same family, then as indicated in Sec. 3.8 and the Appendix r cannot be determined unless prior information is 0 available concerning the parameters of each class. However, this POMS has some special properties that can be exploited. The observations 67 from one class are related to those in others(when they are active) through a shifting procedure. Hence, an empirical cdf for the limit mixture of each class can be constructed for minimum distance estimation. This observation was used by Stewart in the i.i.d. case to establish that rO could be uniquely determined if {QEj} contained at least m+1 distinct vectors for each r. Since the identifiability problem is the same for the Markov case, his result can be extended via Theorem 3.3.2. 4.5 Mr ary ADAPTIVE DETECTION OF SIGNALS WITH RANDOM.ARRIVAL TIMES In Sec. 4.2-4.4, the periodic behavior of the encoder allowed observations to be processed in blocks of known size, whether or not synchronization was known. In this section and the next, the un- certainty concerning the signal arrival time is greatly increased and the samples must be processed one at a time. The detection of signals of known duration but with random arrival times is considered. For random lengths of time no signal is transmitted and only noise is re- ceived. Stewart [H-B] found the optimum receiver for a case in which signals were transmitted independently in time but was unable to find a suboptimum solution and, therefore, could not prove adaption. In this section, the optimum receiver for the Markov-dependent case is established and convergence follows immediately from previous results. The system under consideration is essentially the basic commu- nication system of Sec. 4.2 with an additional "no signal" input which can be active for a random length of time. Signals are transmitted 68 for a duration T and no transitions of the encoder are allowed during this time. A typical signal of the output of the channel is depicted in Fig. 4.5.1. 9(t) 1 I 1 Fig. 4.5.1 Channel Output for an M-ary Signal set with Random Arrival Times Then Xk = 9(tk) + M(tk) is the sample at time tk' The sampler is assumed synchronized; i.e., signals can arrive only at sampling in- stants. Thus, 9(tk) is either 0 or eij, the jth component of the signal vector characterizing 61(t); O S t S T. Let P(W:j) be the probability that eij is active in Xk and let P(W:) be the probability that noise alone is present. Then, as in previous sections, the events {ng}, WE can be associated with the state of a POMS with Gaussian component densities with means {eij} and 0. When L M = 2 the probability state vector k k k k k 2k - [P s(‘“ ” <-s +qu +q> M where q = l - q = Z q,. Since q (s) = O has only one root on the 0 1 1 mL 2 unit circle3 , s = 1, P3 is regular. In the most general case P3 has the same structure, with respect to non zero entries, as in the previous case. Hence the type of state activity is the same and P3 , 33 is regular. Identifiability conditions follow as before. If {91)} is a set of distinct elements, all unknown parameters can be determined. 32The only value of a which satisfies qu(eJa) = O is a = 0. 33See Sec. 4.7. 70 4.6 ADAPTIVE DETECTION OF SIGNALS WITH RANDOM.ARRIVAL TIMES This section treats a variation of the problem defined in Sec. 4.5. Instead of employing a randomly-chosen sequence of signals with random Spacing, one signal is randomly chosen and transmitted repeatedly at random times. The problem is to design detectors to determine when the signal is present and which signal is being sent. Nolte [N-l] found a detector for the case in which all the signals are known. How- ever, he did not discuss the conditions for adaption; i.e., for con- vergence to the detector that would be used if the identity of the signal being sent were known. Here, the signals are unknown but other prior information is assumed available to make the problem meaningful. The system under study is basically the same as that in Sec. 4.5 except that here the encoder switches between a fixed signal and noise. A typical signal at the channel output is shown in Fig. 4.6.1. 9(t) 1 1—T—-1 1——r——-1 1——T———1 Fig. 4.6.1 Channel Output for a Signal with Random Arrival Times 71 When the ith signal is being sent X(tk) = Xk = 6(tk) + n(tk) where 6(tk) is either 0 or 91.1 for some j; eij is the jth component of the signal vector corresponding to the ith signal 91(t). Let PCWi) be the prior probability that signal i is the one being sent repeatedly. As in Sec. 4.5, PCW: ) is the probability that X ‘8 e + n(tk) J k ij and PCWEO) is the probability that Xk = n(tk) and signal 1 is the one being sent. Then, the event {WIJ} can be associated with states of a POMS with a block diagonal transition matrix, each block correspond- ing to the event that a particular signal is being sent repeatedly. For example, when M = L = 2 the probability state vector is - k k k k k k EN - [P (W10) ’Pmll) ’P (“12) ’szo) ,P(W21) ,P 0122)] (4. 6. 1) and the transition matrix is 0 5 v1 l-vl O 0 1 P4 = v1 l-vl 0 (4.6.2) v2 l-v2 O O O 1 v 1-v O L. 2 2 J k -1 k k . . where vi PCWlo/Wfio ) — PCWiO/Wiz) are assumed given. Again, the component densities are Gaussian with means {913} and 0. Then, as indicated in Sec. 2.6, under appropriate assumptions to be discussed presently k k P(Wi/X) = JZIP(Wij/x ) _. A130 wpl (4.6.3) 72 where jo is the index of the true signal being sent and F(Wi) is the probability that ith signal is being sent. The probabilities {F(Wi/Xk)}? can be used to make decisions as to which signal is being sent. Now if the ith signal is assumed the one being sent the decision rule is decide si present if L P(W, /ka.) 11, 1 L(Xk/Wi) = 2 k J=1 F(in/x wi) > 1 (4.6.4) Consequently an adaptive detector is given by decide a signal present if L(Xk) > 1 where k M k k L(X) = z L(x /Wi)P(Wi/x) (4.6.5) z=l and k as L(X ) e L(X /wi ,91 ) (4.6.6) 0 O The rule (4.6.4) is analogous to that given by Nolte but it is not the optimal rule for detecting a signal. The optimal rule uses the likelihood ratio M L k M, k z 2Pm1./x)/ 21>(w1/X) i=1 j=l 3 i=1 0 which has adaption properties similar to (4.6.6). The conditions for convergence will now be investigated. The ith block of P corresponds to a special case (M = l) of P3 con- 4 sidered in Sec. 4.5. Hence, P is regular. 4 73 In checking identifiability it is important to realize that there are two estimation problems involved. The first is concerned with determining which signal is being sent and involves class iden- tifiability. The second involves the estimation of the signal being sent and regular identifiability. As indicated in Sec. 3.8, (4.6.3) can hold without (4.6.6) being true and vice versa. For example, let M.= L = 2. Then the stationary probability vector for the class corresponding to the ith signal is proportional to [vi l-v l-vi]. Hence, according to the discussion in the Appendix, 1 (4.6.3) will be true if v1 # v2 and, for each i, {913} is a set of distinct elements. If v1 = v2, additional prior information is needed on the parameters. It must be known that 91 and 92 lie in disjoint regions of the parameter space; e.g., the signal set might be anti- podal. If such information is not available, the signals cannot be distinguished, in general. 0n the other hand, if, for each i, {eij} is a set of distinct elements, Q can be learned and (4.6.6) will obtain even if the class 10 of the signal is indeterminable. If no special prior information is available concerning the signal from each class, then,one might as well design a detector for one unknown signal. The rule of Sec. 4.5 could be used with M = 1. The important point is that the rules con- sidered in this section provide a means of using this prior information when it is available. When 6 is known (the case treated by Nolte) it is clear from the Appendix that the existence of the set {Bi} of distinct vectors is sufficient for adaption. 74 4.7 REMARKS This section considers briefly some points that apply generally to the contents of this chapter. In attempting to treat a variety of situations in a uniform manner, simplifying assumptions were made and the models were slightly contrived. Consequently, the results generalize in many respects and are intended to include other situations that give rise to similar decision problems. The emphasis in these applications should be on the received signals with the encoder and channel serving as a convenient way of accounting for the generation of unknown signals in a Markov manner. From this point of view, the results of Sec. 4.3 indicate that the Markov dependencies between the received signals of other sections could be due to intersymbol interferrence; the randomly arriving signals of Sec. 4.5 or 4.6 could, for example, be seismic waves; any signal space representation of the received signal in Sec. 4.2 will serve as well as time samples to make decisions. It is also clear that the Gaussian noise assumptions can be weakened and the number of unknowns can be increased. With regard to the identifiability conditions, the following observations are important. The conditions stated in each section are sufficient to ensure that all parameters can be learned and effective decisions can be made on all the states of the corresponding POMS. However, in many cases, (Sec. 3.3-3.6) the events of interest consist of a union of other events. To make effective decisions, it is not necessary that all the component densities corresponding to the events in this union be distinguishable. Thus, depending on the inferrence 75 problem of interest and the available prior information, conditions for adaption can be considerably weakened, with Sec. 3.8 and the Appendix providing the guidelines. Verifying the regularity of the underlying Markov chains may have appeared to be a formidable task. However, it is well known that the regularity of a Markov chain can generally be determined from the structure of the corresponding transition matrix. While the eigenvalues give useful information concerning the system activity, they do not have to be computed to verify this assumption. Furthermore, the behavior of a general class of systems can usually be summarized by a simple example. This point of view was not developed in this chapter but was implicit in Sec. 4.5. Finally, while the optimum solutions given in this chapter pro- vide a reference for comparison, they are generally undesirable from the view point of practical engineering. Among the suboptimum solutions suggested by the estimation techniques of Chapter III it appears the method of moments would yield the most fruitful results. The success that Chang and Stewart have had with this method in de- veloping low-memory estimators for cases of practical interest in- dicate that similar results could be obtained here. 4.8 CONCLUSIONS The aim of this chapter has been to display the versatility of the model for a POMS and to exhibit how the results of Chapters II and III can be applied. Several communication systems were shown to be POMS. These include systems with 76 l. A Markov information source 2. A channel with memory 3. unknown synchronization 4. signals with random arrival time 5. combinations of the above For all these cases the form of the optimal detector for unknown re- ceived signal was shown to follow directly from the results of Chapter II and the conditions for adaptions from Chapter III. Although optimal decision making was emphasized in the above examples, it is implied that the estimation schemes and suboptimal decision rules apply as well. In addition to providing quick solution, the technique of formu- lating these inference problems as those of decision making for a POMS has other advantages. First, it provides a common model for a variety of seemingly different systems. This facilitates comparison and helps focus analysis efforts in one direction. Results developed for gen- eral POMS apply to all of the above systems. Next, it illustrates clearly the nature of the estimation problem involved. Whereas, the properties of the observations are not always clear, in an ad hoc formulation, the mixture approach used in this thesis clearly defines conditions for the existence of estimators and suggests a wealth of techniques. Finally, it brings into play the powerful tools of Markov chain theory. Once a state space and transition matrix have been established, a great deal can be inferred about system state activity. CHAPTER V GENERAL CONCLUSIONS In this chapter, the main contributions of the Thesis are re- viewed and suggestions are made for future research. 5.1 REVIEW This Thesis has been concerned with several inference problems related to a class of Partially Observable Markov Systems. Generally, the results represent an extension of previous work on unsupervised learning and adaption from the i.i.d. case to a particular case with dependent, non-stationary observations. However, additional inference problems not well established for the i.i.d. case were generalized and solved here also; namely those of class estimation and adaptive estima- tion treated in Sec. 2.6 and 3.8. The basic goal has been to construct estimators and decision rules and to state conditions under which they perform effectively. In Chapter II, Bayes Decision-Theoretic concepts were used to develop optimal solutions when the component densities are defined by an unknown parameter set. Large sample theory was used in Chapter III to establish suboptimum solutions. The basic estimation problem was one of mixture resolving and a general strategy was developed for ex- tending estimation techniques developed for the i.i.d. case to the more general case studied here. In general the results display many similarities with the i.i.d. case. The dominant role of mixtures and 77 78 identifiability, the implementation difficulties of the optimum rule and the need for prior information are all general characteristics of nonsupervisory problems. Chapter IV showed several communication systems of current in- terest to be POMS. Consequently adaptive detectors and estimators were easily established along with conditions for effective operation. This unifying approach to solving a previously troublesome class of problems represents a major contribution of the Thesis. 5.2 EXTENSIONS The similarities between the results obtained here and previous work in both estimation and Markov chain theory suggest certain natural extensions that used to be investigated. First is a class of inference problems with time varying para- meters. For example, as an extension of the optimal i.i.d. case Braverman [B-3] and Fralich [F-l] considered parameter changes summarized by the difference equation of the form = + Bm+1 BM AM where {AM} is a sequence of independent random variables. Their ideas are applicable to the problem treated here [H-S]. Next is the problem of developing estimators that are easier to implement than those given here. Sakrison [S-l] has used stochastic approximation techniques to solve the likelihood equation. The result is a simple iterative low memory estimator. This method appfifis to ergodic observation processes which suggests that it could be extended to the asymptotically ergodic case treated in this thesis. 79 Finally, most of the results in this Thesis apply to POMS whose state activity is described by a regular Markov chain. Although as demonstrated in Chapter IV this represents a useful class of systems it would be desirable to extend the results to include ergodic chains. In so far as those properties which effect the proof of Theorem 3.3.1 are concerned, regular and ergodic chains are similar and it appears the general estimation strategy can be extended. 5.3 SOME INTERESTING PROBLEMS In an attempt to exploit the similarities between the i.i.d. case and that of a general POMS, several interesting problems have been ignored. Most of these problems arise from the fact that, unlike the i.i.d. case, the optimum decision rule for a POMS has a changing structure (as a function of the observations) even when all quantities in the model are known. This causes three main difficulties. The first problem is concerned with implementation of suboptimum rules. Unless F(AN=i/XN,B) is stored as a function of B (This would lead to the same storage problems as the optimum rule) using P(AN=i/XN,BN(XN)) to make decisions at the Nth step requires storage of XN and an iteration over N steps using the schemes of Sec. 1.2. This is the case regardless of what is available from the (N-l)th step. Hence the memory and number of computations grow with N. In an attempt to overcome this problem Raviv [R-Z] has shown, for P un- known and 21 = E, that a fixed number of past samples can be used to construct decision rules with an asymptotic risk arbitrarily close to the corresponding risk for known P. The question of how many past 80 observations are needed in a practical situation remains open. The second problem is that of proving adaption of both optimal and suboptimal rules. For example one would like to Show . N N _. N a IP(XN 1/X ,BN(X )) - PO,N 1/X ,BO)| 0 Wpl This is a nontrivial statement and again the problem can be traced to the varying structure of the rule. Raviv by using only a fixed number of past observations had a fixed form rule (given P) and adaption followed. The third problem is that of computing probability of error. Even when allquantities are known and a very simple example is used error calculations are prohibitive [D-B]. This suggests that computer simulation must be used to evaluate adaptive decision making devices. The above problems indicates that a profitable result would be some practical measure of dependency between the samples. If such a tool were available effective suboptimum rules could be constructed with desirable implementation properties; the look-ahead mode of de- cision making could be better evaluated; and the effect of initial system behavior on asymptotic decision modes could be determined. At present computer simulation and intuition are the only guides. BIBLIOGRAPHY [A-l] [B-l] [3-2] CB-3] [c-1] [0-2] [0-3] [c-4] [0-5] [D-l] [D-2] BIBLIOGRAPHY Aoki, M., Optimization of Stochastic Systems, Academic Press, New York, 1967. Billingsley, P., “Statistical Methods in Markov Chains", Ann. Math. Statist., Vol. 32, pp. 12-40, 1961. Blischke, W.R., "Moment Estimators for the Parameters of Two Binomial Distributions," Ann. Math. Statist., Vol. 33, pp. 444-454, 1962. Braverman, D., Machine Learning and Automatic Pattern Recognition, Stanford Electronics Laboratories, Technical Report No. 2003-1, Feb., 1961. ' Chang, R.W. and J.C. Hancock, "0n Receiver Structures for Channels Having Memory,“ IEEE Trans. on Information Theory, Vol. IT-12, No. 4, pp. 463-468, Oct., 1966. Chien, Y.T., and K.S. Fu, "On Bayesian Learning and Stochastic Approximation", IEEE Trans. on System Science and Cybernetics, Vol. SSC-3, No. 1, pp. 28-38, June 1967. Choi, K., Estimates for the parameters of a finite mixture of distributions, University of Missouri, Technical Report No. 18, 1966. Cooper, D.B., 0n the Existence of Nonsupervised Adaptive Signal Detectors and Detector Estimation using Stochastic Approximation Methods, Ph.D. Dissertation, Columbia University, April 1966. Cooper, D.B., and P.W. Cooper, "Nonsupervisory Adaptive Signal Detection," Information and Control, Vol. 7, No. 3, pp. 416-444, Sept., 1964. Deely, J.J., and R.L. Kruse, Construction of Sequences Estimating the Mixing Distribution, Ann. Math. Statist., Vol. 39, No. 1, pp. 286-288, 1968. Doob, J.L., Stochastic Processes, Wiley, New York, 1953. 81 [H-2] [H-3] [H-4J [H-S] [K-l] [K-2] [K-3] [K-4] [L-1] [M-1] [M-2] [N-l] 82 Drake, A.W., "Observation of a Markov Source Through a Noisy Channel," presented at the IEEE Symposium on Signal Transmission and Processing, Columbia University, May 1965. Fralick, S.C., "The Synthesis of Machines which Learn Without a Teacher," Stanford Technical Report No. 61308-9, April 1964. Hancock, J.C., and R.W. Chang, "Unsupervised Learning Receivers for Binary Channels with Intersymbol Interference," presented at the IEEE Symposium on Signal Transmission and Processing, Columbia University, May 1965. Hancock, J.C., and E.A. Patrick, "Learning Probability Spaces for Classification and Recognition of Patterns with or without Supervision", Purdue University Report TR-EE-65-21, Nov. 1965. Hancock, J.C., and T.L. Stewart, Parameter Estimation with Un- known Symbol Synchronization, Purdue University Report TR-EE- 67-1. Hancock, J.C., and P.A. Wintz, Signal Detection Theory, McGraw- Hill, 1966. Hilborn, C.G., and D.G. Lainiotis, "Optimal Unsupervised Learning Multicategory Dependent Hypothesis Pattern Recognition," IEEE Trans, on Information Theory, Vol. IT-l4, No. 3, pp. 468-470, May 1968. Kakalik, J.S., Optimum Policies for Partially Observable Markov Systems, Technical Report No. 18, Operations Research Center, Massachusetts Institute of Technology, Oct. 1965. Kale, B.K., "On the Solution of the Likelihood Equation by Iteration Processes," Biometrika, Vol. 48, pp. 452-456, 1961. Karlin, S., A First Course in Stochastic Processes, Academic Press, New York, 1966. Kemeny, J.C., and J.L. Snell, Finite Markov Chains, Van Nostrand, 1960. Loeve, M., Probability Theory, Van Nostrand, New York, 1955. Martin, J.J., Bayesian Decision Problems and Markov Chains, John Wiley and Sons, New York, 1967. Mendel, J.M., "A Survey of Learning Control Systems," I.S.A. Transactions, pp. 297-303, July 1966. Nolte, L.W., "An Adaptive Realization of the Optimum Receiver for a Sporadically Recurrent Wave Form in Noise," IEEE Trans. on Information Theory, Vol. IT-13, pp. 308-401, April 1967. [P-1] [P-2] [P-3] [R-1] [R-2] [R-3] [R-4] [R-S] [3-1] [8-2] [s-3] [3-4] [3-5] [s-e] [T-l] 83 Patrick, E.A., and J.C. Hancock, "Nonsupervised Sequential Classification and Recognition of Patterns," IEEE Trans. on Information Theory, Vol. IT-12, No. 3, pp. 362-372, Oct. 1966. Patrick, E.A., "On a Class of Unsupervised Estimation Problems," IEEE Trans. on Information Theory, Vol. IT-14, No. 3, pp. 407- 415, May 1968. Pearson, K.P., "Contributions to the Mathematical Theory of Evolution," Phil. Trans. Roy. Soc., 185A 71-110, 1894. Rao, C.R., Advanced Statistical Studies in Biometric Research, John Wiley and Sons, New York, 1952. Raviv, J., "Decision Making in Incompletely Known Stochastic Systems," Int. J. Eng. Sci., Vol. 3, pp. 119-140, Pergamon Press, 1965. Raviv, J., "Decision Making in Markov Chains Applied to the Problem of Pattern Recognition," IEEE Trans. on Information Theory, Vol. IT-3, No. 4, pp. 536-551, Oct. 1967. Rider, P.R., The Method of Moments Applied to a Mixture of Two Exponential Distributions, Ann. Math. Statist., Vol. 32, pp. 143-147, 1961. Royden, H.L., Real Analysis, Macmillan, New York, 1963. Sakrison, D.J., "Stochastic Approximation: A Recursive Method for Solving Regression Problems," Advances in Communication Systems, Vol. 2, 1966, Academic Press, New York, 1966. Sammon, J.W., “An Adaptive Technique for Multiple Signal Detection and Identification," IEEE Convention Record, 1967. Scudder, H.J., "Adaptive Communication Receivers," IEEE Trans. on Information Theory, Vol. IT-11, pp. 167-174, April 1965. Slansky, J., "Learning Systems for Automatic Control," IEEE Trans. on Automatic Control, Vol. AC-11, No. 1, pp. 6-19, Jan. 1966. Spragins, J.D., Reproducing Distributions for Machine Learning, Stanford Electronic Lab., Technical Report No. 6103-7, Nov. 1963. Spragins, J.D., "Learning Without a Teacher," IEEE Trans. on Information Theory, Vol. IT-12, No. 2, pp. 223-230, April 1966. Teicher, H., "On Mixtures of Distributions," Ann. Math. Statist., Vol. 31, pp. 55-73, 1961. 84 [T-Z] Tucker, H.G., A Graduate Course in Probability Theory, Academic Press, 1967. [W-l] Wilks, S.S., Mathematical Statistics, John Wiley and Sons, New York, 1962. [Y-l] Yakowitz, S.J., and J.D. Spragins, "On the Identifiability of Finite Mixtures," Ann. Math. Statist., Vol. 39, No. 1, pp. 209-214, 1968. APPENDIX APPENDIX Identifiability and Prior Information for Unsupervised Learning In Sec. 3.5 and 3.8, it was shown that when the set of mixtures which arise in the estimation problem of Sec. 3.1 is identifiable, the unknown parameter vector B0 can be uniquely determined from the observations. To ensure this condition constraints must be imposed on the family of component densities and the parameter space. Thus in a particular decision making problem a certain amount of prior information is needed to guarantee solution. This Appendix deals with sufficient conditions for identifiability. Most of the results are taken from the literature but are presented from a slightly different point of view, more suitable for the problems of interest in this Thesis. Sufficient conditions are also established for class identifiability introduced in Sec. 3.8. The Uniqueness of Representation Property for Finite Mixtures Let 8 = {F(X;a)}a€A be a family of joint densities indexed by a point a taking values in a subset of a finite dimensional Euclidean vector space A. Let k k HR = {H(X) = 2 C.F(X;or.), c. > o, z c. = 1, F(x;a.) e 8} , 1 1 1 1 1 1=1 l k . . . °° k . where the {ai}1 is a distinct set of elements. Then. &'= U H 13 1 the set of all finite mixtures of the family 8. The set fl' is said 85 86 to have the uniqueness of representation property (urp) if R k .2 czF(x;ai) = .2 0119(me (A-l) 1=1 1=1 implies k = k and for each i, 1 S i S k there is some j, 1 S j S k A A such that Ci - Cj and ai = aj. The URP can be restated as saying there is a one-to-one correspondence between each set of allowable points {C1,ai}: and the mixture they generate. As indicated by the following theorem this property can be characterized by 8. Theorem A-l. A necessary and sufficient condition that H’ have the URP is that 8 be a linearly independent set over the field of real numbers. This theorem can be used to establish the URP for the set of finite mixtures generated by the following families. 1. The family of n-dimensional Gaussian cdf's indexed by the mean and/or the covariance matrix. 2. The family of n-dimensional exponential cdf's indexed by the exponent constant. 3. The translation parameter family induces by any cdf with a bilateral Laplace transform. That is any finite set of distinct elements in each of the above families is a linearly independent set. Identifiability and the Estimation Problem For the estimation problem Sec. 3.1 the class of mixtures of 1nterest 13 HH = {Kh(X;B)}B68 where M Knows) = iflniflxmi) (A-Z) 87 In (A-2) B = [B1,...,BM], f(X;Bi) E {5, and B is a subset of M- dimensional Euclidean space containing only vectors with distinct components. Then HTT is said to be an identifiable class of para- meter indexed mixtures if the mapping defined by (A-2) say Qn is a one-to-one mapping from 46 onto H". Then if Kh(X) E H" there exists a unique vector B such that K%(X;BO) = Kh(X). 0 Since Hn<2 H, if H has the URP so does H". However the URP guarantees the uniqueness of B only to within an equivalence 0 class. That is permutations of the components of B might result 0 in another solution vector. This would be the case if all the com- ponents of n were not distinct. Thus in order to guarantee iden- tifiability as defined here a constraint on the parameter space is needed as well as URP. In a practical situation this constraint will be a reflection of prior information on a particular problem. For example, if M = 2, HI = n2 = 5 and it is known B1 is the parameter associated with state i of the system, then with > 32’ where Bi the URP BO can be uniquely determined by minimum distance estimation. It is important to realize also that in order for the Bayes algorithm to learn Bo these constraints must be reflected in the prior dis- tribution PO(B). If one is not interested in using the estimates for decision making then constraints can be arbitrarily imposed to get a unique solution vector. The definition of identifiability and all the above remarks can be extended to the case in which n is unknown and the parameter space is 2M-dimensional. 34Identifiability is usually defined for this larger class of mixtures [H-2] and the above definition is a consistent modification. 88 Class Identifiability If 81 is the family of component CDF's associated with the .th , i , 1 class then as 1n Sec. 3.8 H . = {K (X;B )} . . 18 the corre- n1 n1 31631 sponding set of mixtures with parameter space 181 and probability L 1 state vector n . According to definition 3.3.1 H = U H is 1 i=1 " said to be class identifiable if {H i}? are disjoint subsets of H". n Then from Theorem A-1 and a simple contradiction argument any of the following conditions are sufficient for class identifiability of H”. L l L . . l. S = U 81 is a linearly independent set, {n1} is a set 1=1 of vectors distinct to within permutations on the components. 2. 8 is a linearly independent set by 51 n 5'1 = d) i 9‘ j L . 3. S is a linearly independent set and fl 81 = ¢ i=1 Assuming 8 is a linearly independent set, conditions 1 and 3 above indicate that classes of states can be distinguished when the stationary Probability vector associated with different classes are distinct or the component densities associated with different classes have distinct forms. However condition 2 indicates that if sufficient prior information is available concerning the parameters associated with each class the classes can be distinguished even when the class transition matrices are equal and the component densities for each class have the same form. The above conditions are sufficient and it is clear that they can be weakened. For example if 81 is not a linearly independent set, its mixtures do not necessarily equal those corresponding to other classes. Also the components of B3 need not be distinct to distinguish classes. In fact when n is known the distinctness of 89 the components of B is needed only to ensure the system states can 0 be distinguished in some sense and is not necessary to estimate B0. Finally when the class transition matrices are distinct but the corre- sponding stationary probability vectors are not, a second order esti- mation strategy (Sec. 3.7) may still lead to an estimator for 10. m1 L mg ”7 "4 “7 “1 m3