BAYESLAN DECISION MAKING AND LEARNING FOR CONTINUOUS - TIME MARK-0V SYSTEMS Thesis for the Degree of Ph. D. MICHEGAN STATE UNIVERSITY ERDAL PANAYERCI 1970 LIBPxiRY meme Mldil? "'1 State Uniursity This is to certify that the thesis entitled BAYESIAN DECISION MAKING AND LEARNING FOR CONTINUOUS -TIME MARKOV SYSTEMS presented by ERDAL PANAYIRCI has been accepted towards‘fnlfillment of the requirements for Ph.D. degree in Electrical Engineering & Systems Science ' y —' ,w .a-N I"! / » . U '3’ . ‘ / ‘ ,, I" - f a- . . c/ ‘L. 4 _ . w’ 1.. .0, -' — .1.-.”- thor professor Date Novembg: 17, 1970 0-169 1.1 8-37 ABSTRACT BAYESIAN DECISION MAKING AND LEARNING FOR CONTINUOUS-TIME MARKOV SYSTEMS BY Erdal Panay1rc1 This thesis is concerned with Bayesian decision making and learning algorithms for a particular problem in parametric pattern recognition in which each of a finite set of pattern classes is characterized by a continuous-time, discrete-state Markov process. The basic problem considered is that of determining rules for making decisions about the identity of the active pattern class based upon observation of a sample function in some finite interval. The sta- tionary transition probability matrices for the processes in question are the parameters of the pattern classes. Statistical decision theory is employed throughout to develop optimal solutions. In the first part of the thesis, Bayes-optimum decision rules are derived under a perfect observation mechanism (noiseless case). The observed quantities are the sojourn times in the states and the state numbers themselves. Using classified samples from each pattern class, an algorithm for supervised learning is presented and the existence of reproducing prior densities for the parameters is demonstrated. Particularly useful results are the formulations of recursive, computationally simple parametric forms for the posterior densities of the unknown parameters and for the optimum decision rules. The simulation of a Specific example shows l I o I Q .§ 0. 'U '1 u D ‘ .\. ‘ a — r. .~ _. o . vl .\. :- D4 . v, , s D u ~ ~ . o . . .h. ... v. . 9. n . pl. . p, a . . . u L- . . u . ... my so Erdal Panayirc1 the empirical probability of error for different amounts of training data and demonstrates the inherent practicality of the results. The problem of computing probability of error is investigated extensively for the noiseless case. The exact probability of error, as well as lower and upper bounds and asymptotic expressions, are established for several cases. Conditional error probabilities of the first and second kinds are introduced by which the usual proba- bility of error can be computed iteratively. In the second part, only "noisy" observations are available. In this case, a new model is defined in which the states of the con- tinuous-time Markov chains are described by random processes, but the transition times can be observed. Iterative, optimal (minimum Bayes risk) decision rules are derived and conditions are established under which these rules perform effectively. Optimum-adaptive decision rules are defined when the underlying model is not completely specified. Decision rules are formulated with two types of random processes. Finally, the situation when the transition times cannot be observed is investigated for the Special case in which there are two pattern classes and the states are observed with additive, Gaussian, white noise. Both discrete and continuous observation times are considered. Computationally feasible algorithms are derived for the likelihood ratio which optimally solve the problem, assuming a discrete- sampling scheme. Also, stochastic differential equations are found for the continuous logarithm of the likelihood ratio and the continuous conditional probabilities of errors from discrete results by a limiting operation. The results are applied to the Specific problem Erdal PanayerI of detecting a random telegraph signal (two-state, continuous-time Markov chain) in white noise. BAYESIAN DECISION MAKING AND LEARNING FOR CONTINUOUS-TIME MARKOV SYSTEMS By Erdal Panayirc1 A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering and Systems Science 1970 ‘71’ 57/3/ TO MY MOTHER AND FATHER ii . a t _ .. . .= .S. C I . s v t . u o . 2. o .1 P. o . .8 u. v .d rn . v. . .. L¢ .v.. o. la I: v _ n .0. v, no 0 S «L . : p o o . 0 h T .. .r o s.-. . v. .. run .\. _\. .. ns. . .2. . o w. . n). p. b. «i. u. . o .I o a O Q . . I, . n .n. .s. S u» .\e .0 I o . . . p. Ir pk. _ e .d r‘u -. . . o s 1. . v. v. In. 3 .» ... . . s e v. ., n. L“ D Q r. V» L . .. .x- ’t .‘v c; . a. D a . a o . e . rs .\v .A v a n! . . c Q... u s — . o . O 5 a p p .I- .- . c '0. a. O. s - ,n u .p. c . q - PJs o .. .x. .. .. “I 1 .. .1. u. u.“ . .1 .g.. .. . ACKNOWLEDGEMENTS The author very gratefully acknowledges the guidance and encouragement of Dr. Richard C. Dubes, Department of Computer Science, his teacher and academic advisor, throughout the course of this research. The author is indebted to Dr. Dubes for suggest- ing the problem as a potentially fruitful area for research, as well as for his generous devotion of time to provide counsel. Special thanks are due to Dr. Dennis C. Gilliland, Depart- ment of Statistics and Probability, for many useful discussions concerning the problems treated in this thesis. For their interest and consideration of this research, the author wishes to acknowledge the other members of his doctoral com- mittee: Dr. Carl V. Page, Department of Computer Science, Dr. James H. Stapleton, Department of Statistics and Probability, Dr. Rita Zemach, Department of Electrical Engineering and Systems Science. Also, Dr. Yilmaz Tokad, Department of Electrical Engineering and Systems Science, deserves a Special acknowledgement because of his recommendation and continuous encouragement. Finally, acknowledgement is made to the Technical University of Istanbul and the Division of Engineering Research, Michigan State University, for the financial support which made this work possible. iii Von- h... Yt—u t.\. ' p—l o 'J ha o P] - , .. -.~ . 4 v d." ‘ 1 4" c , .4 Chapter I II III TABLE OF CONTENTS ABSTRACT ACKNOWLEDGEMENTS LIST OF FIGURES LIST OF APPENDICES LIST OF SYMBOLS INTRODUCTION .000........0.0. ..... . 00000 . ...... .... 1.1 Literature ReView ...0...............0......O. 1.2 Thesis Outline and Contributions ............. DECISION MAKING AND LEARNING WITH OBSERVABLE STATES AND TWSITION TIES 0....0...0.........0.00 2 1 Source Model ................................. 2 2 Decision Rules ............................... 2.3 Optimum Decision ............................. 2 4 Adaptive Decision Rules When Q-Matrices Are Not Known ................................ 2 5 Supervised Learning .......................... 2 6 Learning (q, } .............................. 2.7 Optimum-Adapfive Decision Rule ............... 2 8 Computer Implementation ...................... 2 9 Conclusions .................................. ROBABILITY OF ERROR O.............0.0...0...0..0.. 3.1 Exact Probability of Error for N = 2, and Known Q-Matrices ............................. 3 2 An Upper Bound on the Probability of Error ... 3.3 Probability of Error for Large Sample Sizes .. 3.3.1 N = 2, M = 2, and Known Q-Matrices .... 3.3.2 N > 2, M = 2, and Known Q-Matrices .... 3.4 Recurrent Expressions for Probability of Error .......0.......O...00..................O 3.5 COHCIUSionS .0.........0...........00.....0... iv Page iii 11 12 13 15 17 19 23 25 26 31 32 32 4O 42 42 44 50 52 IV VI DECISION MAKING AND LEARNING WITH UNOBSERVABLE STATES AND OBSERVABLE TRANSITION TIMES ............ 4.1 Mathematical Model for Generating Observations 4.2 Iterative Generation of the Optimum Decision Rule ......................................... 4.3 Application of the Model to the Special Cases 4.4 Adaptive Decision Making .................... 4.4.1 Optimal-Adaptive Decision Rule ........ 4.4.2 Iterative Form for the Decision Rule .. 4.4.3 Asymptotic Optimality and Adaptation .. 4.5 Conclusions DECISION MAKING WITH UNOBSERVABLE STATES AND TRANSITION TIMES 5.1 Optimal Decision Making with Time Samples .... 5.2 Observation of Entire Sample Function ........ 5.3 Optimal Detection of Random'Telegraph Signal in Additive Gaussian Noise ................... 5.3.1 Optimal Decision Rule ................. 5.3.2 Iterative Computations of the Mean and Variance of ................... 5.3.3 Probabilities of Errors ............... 5.3.4 Observation of Entire Sample Function . 5.4 Conclusions GENEML CONCLUSIONS AND EXTENSIONS 0 0 O . C . O O . O 0 O C 0 . O 6.1 Conclusions 6.2 Extensions BIBLIOGRAPHY ... 54 55 58 62 68 69 71 73 76 78 80 84 89 90 92 96 97 100 103 103 106 108 , a. v .oLcn . . ‘ '50. 3- ..L- I I LIST OF FIGURES Figure Page 2.8. 1 Average Error C‘Jrves 0 . . . . . . . . . . . . . . . 0 . . 0 . 0 0 0 . 0 . . . . t 30 4.1.1 The General Model and the Sampling Scheme ......... 57 4.2.1 An Algorithm for Iterative Implementation of k k 8(5 ,t) ......OCOOOCOOOCC....00000...000..00...... 63 4.3.2 Asampling SCheme ................00..0000..00..0.0 64 5.0.1 The Model for Decision Making with Discrete Sampling SCheme ......0.......0....00.0.00.......0. 79 5.3.1 The Block Diagram of the Optimum Detector ......... 98 A.1 Typical Sample Function from 8 Continuous- Parauleter Markov Chain 0.0.000..................... 118 D.2 Algorithms for a Single Error Curve and for DeCiSionMaking ......0...............00000......0. 126 vi o --.. ’l o... 'v ... Appendix A B LIST OF APPENDICES DEFINITIONS AND PROPER TIES 0.000000000000000000000 DERIVATIONS OF OPTIMAL DECISION RULES ............. A PROOF OF CONVERGENCE ............................ ALGORITHMS FOR ERROR CURVES AND FOR DECISION MAKING ............................................ THE ROOF OF 1mm 303.2 0.00.00.00.000.0.0.0000... vii Page 112 119 123 125 128 ‘0. Appendix LIST OF APPENDICES DEFINITIONS AND mOPER TIES 0.00000000000000000000. DERIVATIONS OF OPTIMAL DECISION RULES ............. A PROOF OF CONVERGENCE ............................ ALGORITHMS FOR ERROR CURVES AND FOR DECISION MAKING ............................................ THE ROOF OF IJEMM 30302 0000......00000000.0.0.... vii Page 112 119 123 125 128 Symbol 3(xk.tk\e = 8) (s) 11 LIST OF SYMBOLS Description Number of states in each Markov chain. Number of pattern classes (Markov chains). Prior probability for pattern class 3. Density (probability mass) function for first observation when chain 3 is active. é (x1,x2,...,xk) Random variables representing state numbers (also, the values for a particular realization of the random variables). = {l,2,...,N} Range Space for each Xi' IID (t1,t2,...,tk) Random.variables representing sojourn times (also, the values for a particular realization of the random variables). 3 = [qi )] Transition-rate matrix for pattern .1 class s. Q (S) ‘ ‘qii Parameter establishing distribution of sojourn in state i when class 8 is active. Point in state Space (Pattern class). A k k = gs(x ,t ) Product of density function for k k t and probability mass function for x When class 8 is active. 3 s glj)/q§ ) Transition probability from state IID i to state j for the jump chain associated with class 3. viii C a V1 I .\u r.- .i- S I. p» s - Paw n a: S s .\1 (. ~\ ) . C V C a \, Symbol fs(qS’rS\YaT) n9?) 1J (8) Description = N,j(xk) Number of one-step transitions from 1 . . k state 1 to state j 1n x . k . . . k . = Zi(t ) Total SOJourn time 1n t Spent 1n state i. _ k . . - Ki(x ) Number of occupancies of state 1 . k 1n x . IID (ysl,y82,...,ysns) States from the tra1n1ng sample function produced by pattern class S. = (TSI’TSZ’...’TSH ) SOJourns from training sample function produced by pattern class S. Concatenation of states from all training functions. Concatenation of sojourns from all training functions. _ (S) q(S) q(S 8) — (ql sq 2 900°:q N entries of Q8. ) Vector of diagonal Vector of all non-zero terms from the transition matrix of the jump chain for class 8. Posterior density for the parameters of the class 3, given all the training patterns. n(: S) (y: 8) Number of one- step transitions from state i to State j in the training n patterns ysS from class S. 2(8 H)(T: ) Total sojourn time in state i n8 recorded in the training pattern T88 from class 3. Symbol Description RES) = k§5)(y:s) Number of occupancies of state i in the training pattern y:8 from class f(qs,rs\To,yo) Prior density for the parameters from class WES) ”flip [u:§s)]§,j=l Parameters of the prior density for (q8,rs). [w :8) ’Vilbi-fl’ (S) N . . [mij 1i,j=l Parameters of posterior denS1ty for (qs,rs). w.p.1 With probability one. ij Kronecker delta. Pe Total probability of error p Bhattacharyya distance. 00,50 Probability of error of the first kind and the second kind. ao(k),30(k) Conditional probability of errors of first and second kinds. p1,p2,...,pk Times at which transitions occur. R(t) Auto-correlation function. 6(t) Impulse function. Ak Likelihood function. n . Threshold. u(-) Unit step function. -. ...g . . . . v - Q. 1.1. ... _. T . _. ’ o v a .- ,c a Q ~. C. A.» .... w, A. t. S k ... a. a .. 7: .c (. .. u." a“ .3 ’s u p. 6» ‘5 t. .d a. .. fie *3 .. .... .... .H .u. ..q S at w. .3 .t _. a“ .m .t .s .... ... .a . . .\ DA . a .. a .. t u a in 1.. ... ..r. CHAPTER I INTRODUCTION In recent years, pattern recognition and learning theory have received a great deal of attention from investigators in many branches of sciences. Some applications are: Recognition of speech and handwritten characters, checkers and chess playing machines, classification of electrocardiograms and EEG'S, detec- tion and filtering theory, System identification and Operations research. In this thesis, "pattern recognition" is another name for "computerized decision making". Given a set of objects, each arising in one of a finite number of sources, an algorithm is to be established which efficiently classifies the objects and all other Similar objects. A pattern recognition problem, in general, consists of two sub-problems. The first sub-problem determines which measurements should be taken on the objects. These measure- ments, called "features", characterize all the possible objects. A "pattern" is a vector of feature measurements. At present, there is very little general theory for the selection of features. In- vestigators have been concerned with the selection of subsets or linear combinations of existing features or with ordering features in a given set of measurements. The criterion for feature selection or ordering is often based on either the importance of features in II I II I I n I Q . u o ... .9. an . v. .... .. .u . . it. 1 ... . .0 5 ... 3.. .. a. a}. ( p. _. v. o. v. w. . r. .3 ... . .I .G r. r. 3.: r. . .. at. v. .. .. .. .C . ... s .d T. .t .... r. ..l. pun .u. 9L . .1 . are. u . at a b a r: . o O s o x a. VI {3 rug 3.. u . r. t . ...“. .. I .. nu .. .... r .. z .. 2 it 5 .. ... ... S ... 6. 5 Z ... .~ 6 .C ..(n . a: -. v. .A. .4.» ... a b .C b; N. v. .Q .t n.. r «a .n 0.. a I... . . 2’ .r. C. .r. ._ .J ... «.... ~— .\. ... S a: .r. Lu pl v. .. . V. On «p. D a. 3. Lu 0. A. . ... O. .3 NJ .. I. .: u. i: a can . . b . o a y s .C . _ 5. o e n u .u. .d . I .r. s t e L... ., .. .a.u ‘5. N a .Dn AL . V; a . u . .7 r. ..h . .. 11‘ v u u . a C . a a v D u y. 3 Grow. " DJ .0 . Hit a s v. u. 1 »\ b H l.- . ”I. ... S M“ I! IN m‘ .an~ I.” characterizing the patterns or on the contribution of the features to the performance of a recognition algorithm. The second sub- problem in pattern recognition is the problem of classification, or making decisions about the source of the patterns. Thus, a pattern recognizer consists of a "feature extractor", and a "classifier". Statistical decision theory provides a powerful mathematical tool for solving the classification problem when the features representing the pattern classes can be described by probability distributions. The application of decision theory to the pattern recognition problem is called "parametric pattern recognition". One can proceed to obtain optimal decision rules satisfying a (subjectively chosen) classification criterion; e.g., minimum probability of misclassification (probability of error). The problem of learning in parametric pattern recognition must be solved when the distributions characterizing the pattern classes are inadequately known. The unknowns of the class dis- tributions are "learned" from sample patterns drawn from each of the sources, or pattern classes. These samples are called "training patterns". Supervised learning (learning with a teacher) refers to the case when the training patterns are classified. When the origins of the training patterns are unknown, learning is non-supervised (learning without a teacher). A so-called "non-parametric" approach to pattern recogni- tion refers to the design of classifiers in which no assumption is nade as to the form of the underlying probability distributions char- acterizing each class. The goal of the recognition system is to partition the feature Space into regions such that each region can l II . . s. ...u A. .. . . . . . .4 a .3 to . 4 u. . o —._ a n .a u a n ‘ I A o a e . k .... v o '5 . . V. 3a .0; v. .C an v. .r on .J .u an an .3 a... . . s . a: C. ...A .... v . v . .9“ .5. .fv .V- 9‘ P. a . t V .A 2's ,v. n. ,1 s . . .4.» uh. - u o a 5 . . . .au ... r. .5 0.. ... s. o. I $— é .a V. Arv 9: ix .4 . . ‘5 u thud on. In t L uau S Rb. . . an". s u . . .\< -§ v - s ..l. . . 5 ., m . . n 4 P. .ts .x. .w. T s . a by it c P. .. S u .6A 5 s be identified with a pattern class. This can be achieved by defining a discriminant function for each pattern class on the feature space. A pattern is classified by choosing the class corresponding to the largest discriminant function. Training patterns from each class are usually available and the problem is to establish a reasonable set of functions from them. In this thesis, a general parametric pattern recognition problem is investigated in which the classification of an unknown pattern is inferred from a finite set of training patterns. Each pattern class is characterized by a different N-State, continuous- time Markov chain. The stationary transition matrices of the chains are the parameters of the pattern classes. Depending upon the medium (noisy or noiseless) in which the observations are made, several feature selection schemes are considered. Optimum Bayes decision rules are formulated as solutions to problems in statistical decision theory. When the parameters are not known, a supervised learning scheme that uses classified training patterns is employed. The general model considered here finds, specifically, an application area in the classification of EEG'S which has been investigated by Dubes [D-S], recently. It is also applicable to detection and filtering problems with Gaussian noise when the underlying signal is a function of a finite dimensional Markov process. 1.1 LITERATURE REVIEW A comprehensive survey of early works on pattern recogni- tion and learning theory has been written by Nagy [N-l]. Nilsson [N-3] presented some of the theory of "learning machines", or . . . . .. ... . . . .Ie .- . c o. v. . 1. u u . . A n . . a I. ... t- . ... a. .. .A . . . . . o. p: p. p . a a . ... s a A .u .. ..\v at. .4. .r: .\~ .47. a .J . v. ”I 5Q . ... . . .n.. L: ~. . a . v. . . v . x u. v. . a a Z t. .. n. s . a s A r» in. p‘. h. ...» V. L. . \ .—. s . b . C. n .3 . .u. u . _ S. .h.. P. t s v. .. at: C ,. ... ._ 1 h. r:— a. .C L. .C uh. . . .p a v . {a P3 P . ~ A (a 5d 1!:- nachines which can be trained to recognize patterns, and Sebestyen [S-l] identified the task of finding clustering transformations as central to the design of pattern recognizers. Fu [F-A] presented the latest developments in the area of machine learning, emphasizing sequential methods in statistical decision theory and estimation theory. Recent books edited by Watanable [W—3], Kanal [K-B] and Tou [T-Z] are collections of papers on pattern recognition. Each author emphasizes the philosophy of the approach rather than mathe- matical derivations or experimental data. The problem of classification has, indeed, received much attention by statisticians. Of the many sources, the books of Fisher [F-Z], Anderson [A-l], Raiffa and Schlaifer [R-l], Ferguson [F-ZJ and Blackwell and Girshick [B-B] Should be mentioned that deal with the theory of statistical techniques and the application to classification problems. For learning theory, Spragins [S-S] and Braverman [B-A] studied the convergence question in supervised learning. This question deals with the Sufficient conditions under which the para- meter posterior density approaches the delta function about the true value of the parameters as the number of samples increases. Patrick and Hancock [P-l] gave a rather general approach to learning schemes. The literature closely related to the thesis is now sum- marized. Dubes and Donoghue [D-h] considered the problem of determining which of a finite set of N-state, discrete-parameter Markov chains is active. The states are observed without noise and the transition probabilities for the chains in question are ..K. --.r’ on _. II D V . h . . . I .. w. ..t ~ I . I O n o . ... r. a n A r: .‘e ‘n r . .d a» Qs l . p. ..>_ . . u. u . .0; _ v. C .r. L- r. L . Z .\c . _ a3 .34 .u .: .1 . a a. L. s t _. ... a e . ... . t . J . . .u . v. . .. s. . C .. .. a? ,S .. ... G. ... a: .\. p . Ce 9 a fix. t .cu .: _ .w. o t I . A . v. ..t. . u .. v . .S c \v . . L. .3 3 ... ... .u: .W. .5 . . ...,“ Eu. ..1 v,\l. sot. unknown. A Bayesian Strategy is employed throughout the report. However, they did not investigate probability of error problems. The reSults in Chapters II and III presented in this thesis differ from reference [D-h] in that the pattern classes are described by continuous-parameter Markov chains. The unknown parameters are the transition rate matrices 62-matrices) and noisy observations are considered. Exact and asymptotic probability of error expressions are also derived for several cases. When the state activity of a general system is described by a first order, homogeneous, discrete-parameter Markov chain and the states of the chain can be observed only in the presence of noise, most of the literature deals with the design of optimum and sub- optimum decision rules for making decisions about the states of the system and establishing conditions under which the unknown para- meters can be learned. Billingsley [B-Z], Martin [M-Z], Good [G-1] and Bartlett [B-l] estimated the transition probability matrix of the underlying chain assuming, by some external means, that the states of the system could be observed. Removing the assumptions of observability of the states, Raviv [R-Z] constructed a class of adaptive decision rules using an estimate of the transition matrix, P, and only part of the past observations. Recently, Signori [8-2] studied the problem of determining the optimum-adaptive decision rule when the observations were governed by an underlying discrete-para— meter Markov chain. The conditional densities of the observed random variables, given the state of the system, were characterized by a set of unknown parameters. He derived an iterative, optimum adaptive decision rule with the capability of using future and past observations I | ‘ . u I n r \n I n . a I 45. . . J u v . o . . e a ._ r. o I o p. O u s p . .. n S a u. . . o . A. ... o o a C n: L . a . . I; a. .. ..- . . u o . .I . . o . . ~L 2’. . .1 s . _ . u c. n. v. r. .L .3 o. «I p c. _ u . u. . . .s . . . a. .\. . Ce . u . v . . . .4. . . ,r. .— .u. :. u a . _. . . .. o . . a. n . . ._ a. r. v. v. a . 3 .. cl . c. .. ... . . .. a . at. .n . u. .s a” q . - b . ... .4. . Vt - . P. a» .. r . 0 . . o . a u .x. . c a .1 . . . QJ c a nil as well as present observations. He also constructed a variety of consistent estimators for the unknown parameters which yield a class of Suboptimal rules. Patrick and Hancock [P-l] gave a rather general approach to the problem Of learning for classification and recognition of pat- terns with or without supervision. Their model for the quantization of the parameter Space is given as a reference in Chapter IV, to the solution of the problem of finite computer storage in implementing optimal decision rules and learning schemes. Hilborn and Lainiotis [H-l] investigated the Optimal (in the quadratic sense) nonlinear estimation of discrete-time or sampled stochastic processes, where the processes can be characterized as having probability distributions of known functional form but con- taining a set Of unknown parameters. A Bayes optimal estimate for a state was to be determined and expressed in terms Of the parameter- conditional-optimum estimates and another statistic which could be computed recursively. The observations obeyed a generalized Markov property. The results of Chapters IV and V in this thesis differ from [H—l] since the Optimization criterion used here was minimum probability of error. The decision problem is alos different in nature. Recent studies of the problem of detecting an arbitrary random signal in the presence of additive Gaussian noise by Kailath [K-Z] have resulted in an exact formula for the optimal likelihood ratio, which applies to the detection Of any continuous second order Stochastic process. The result must be expressed in terms Of con- tinuous Stochastic integrals, so it cannot be implemented directly. . u -.n~'v:" V “.-...U.‘ ‘ z.-. ... -40 v .‘1- .- Ob.- — " .... 3 . ‘I;"";O . >LI-‘l.~. -'- n ' h- . ... _ a . n 5‘ .. ‘__I; w ‘ ' I ~. ...L ‘ - D . 1" q/. . ._-,‘.‘_.‘-‘ -. o ~ ~.‘ ... ‘~L ... u U-.. A... ‘" r- ‘ 7 nl ... a a .. . "“ -L o. ‘ a 3-. a... “; , \‘u_ ,' u . I . “-.C" ... ' h _ ' O ' O .. ..I " O a “5, v~. “\ . .1 v o .V . l \ , . - a "-1" Q- - A . .. _ « .. . ‘1, .v- . 4 . . '5. ~.,:-. ~.S ..~ I .' e4 5 I‘ D . bl, 4 K a . b". . u u a ‘s ‘ _‘ . a, .v ~ , .. ‘- . §~:.- . n McLendon [M-l} investigated the general problem Of extracting an arbitrary random process from additive white noise. Under certain approximations, computationally feasible algorithms were derived for the logarithm of the likelihood ratio. The assumption essential to the solution of the problem was that the joint densities of the Observation processes, when the signal is present, could be approx- imated by Gaussian densities (Pseudo Bayes approximation). The question of detection of Markov processes in a noisy background has been studied by several authors. Nifontov and Likharev [N-Z] considered the Optimal detection Of a Binary, quantized, Markov signal in the presence of noise similar to the Signal. SOSOlin [8-4] investigated the Optimal detection of Gauss- Markov noise with discrete-time observations. They both adopted the Bayes likelihood ratio criterion as an Optimum decision rule and Obtained recurrent relationships for the likelihood ratio. Kulman and Stratonovic [K-O] provided Optimal devices for detecting a random telegraph Signal in the presence Of white Gaussian noise. They first Obtained a non-linear stochastic differential equation for the optimum filtering and then tried to solve it for some Special cases and compared the results Of the probabilities Of errors with non-linear and linear filtering. The work in Chapter V of this thesis, related to their results, was done independently and deals With a particular model which yields more specific results. 1.2 THESIS OUTLINE AND CONTRIBUTIONS The emphasis of this research is on pattern recognition and learning theory. The first major contribution of the thesis appears ,0- .‘u .3 _\. . . in Chapter II which contains a formal development of optimal and adaptive decision making and learning, employing a new model which has not been heretofore studied. The contribution lies in the fact that, under some necessary and sufficient conditions, a finite set Of constant parameters can be found which uniquely defines the underlying model. Chapter II is devoted to the case of perfect Observation (noiseless case). The basic model and the decision problem are defined in Sec. 2.1, Appendix A and Appendix B. In the early sections of the chapter, the optimum and the optimum-adaptive decision rules are found and expressed in terms Of sufficient Statistics Of finite dimensions for two cases. A supervised learning scheme is employed to learn the paramters, and the exis- tence Of reproducing prior densities for them is exhibited. It is also shown that the computer memory needed to implement these rules is fixed. Finally, a computer simulation is developed and discussed for a Special case. The second major contribution appears in Chapter III in which the probability Of error is studied. Some Specific reSults are Obtained for the model considered in Chapter II. In the case considered, all quantities in the model are known and there are only two pattern classes. In Sec. 3.3, exact probability Of error expressions are derived for a particular case while in the follow- ing sections, upper and lower bounds and asymptotic expressions are obtained. When the number of Observations increases without bound, the bound on the probability of error is shown to approach zero. In the last section of the chapter, conditional error probabilities _‘— . . II . H . . . r\- .V. . v. .s. . . .du Of the first kind and the second kind are introduced from which the total probability of error can be computed. The expressions derived for the error probabilities are original and self-contained. Final contributions appear in Chapters IV and V in which optimal and adaptive decision-making and parameter-learning problems are investigated under an imperfect Observation mechanism. Inserting this condition into the model of Chapter II requires a new model in which the states Of the chains are described by random processes. The main assumption Of Chapter IV is that the transition times from one State tO another can be Observed. After discussing several Sampling schemes for selecting features, the Optimum decision rule is derived and its basic components are generated iteratively, assuming all the parameters Of the model are known. Analytic results are Obtained for two Special cases. In the second part Of Chapter IV, adaptive decision-making and learning are studied when the model is not completely specified. A theorem about the convergence of the Optimum-adaptive decision rule is provided. The storage problem in implementing the adaptive decision rule and supervised learning algorithm is discussed. In Chapter V, the model and decision problem are studied but the assumption of observability of the transition times is re- moved. A uniform sampling scheme (discrete-time Observations) is assumed. The Bayes likelihood algorithm is developed for the Optimum decision rule and the recurrent expressions for the likeli- hood ratio is derived. In the case of continuous Observations, non-linear stochastic differential equations are derived for the IOgarithm of the likelihood ratio and the conditional error R?*"A In . L .9- ...... 10 probabilities Of the first and second kinds. The results of Chapters IV and V are original and have not appeared in the literature and present one of the major contributions of the thesis. I v .. O Q r Ix ‘ .— ... . . .n. .v . § .- ... . a. .9. o. . A‘. o a. .3 T. L . . Q ark. -\_ a . a b. p h § 0 .u. . \ CHAPTER II DECISION MAKING AND LEARNING WITH OBSERVABLE STATES AND TRANSITION TIMES The basic problem considered here is that Of determining which of a finite set Of M continuous-time, discrete-state Markov process is active, based upon Observations Of sample functions, when the stationary transition probability matrices, [P€§)(t)]g ._ ; s 6 {1,2,...,M}, for the processes in question 13 1,j—l are unknown. The main reSults Of this chapter rely heavily upon the definitions and theorems related with continuous-time, discrete- state Markov processes (continuous-parameter Markov chains) which are given without proofs in Appendix A. A Bayesian strategy is employed throughout. The problem is formulated as a problem in Statistical decision theory. Prior distributions which lead to convenient computer implementations are chosen for the unknown parameters. The amount Of computer storage required is of prime importance. The decision problem is defined and a source model is chosen for generating Observations in Sec. 2.1. In Sec. 2.2, Observation and parameter Spaces are defined, and in Sec. 2.3, the Optimal decision rule is derived when the transition rate matrices «2- matrices) are known for every pattern class. In Sec. 2.4, adaptive decision rules are derived when the transition-rate matrices are not known. A Supervised learning scheme is employed tO learn the 11 II II .‘ F. . . v I .- r. v v. 0 ~. . up an .3 o .. . u. v. .0. .u it v. _\ ._ r. 5. a. J n.. .1 .« n) a . ..I» .. a... a. .. . «n v. a. .. on .o. . 0 av; .- ‘A .s ‘. ... . . .. . .. .. M . a. .. U ' .'« .. o . v. I .O Q n s . \. ..u b . .e a 0.. v . .4. . .~ v . .nq .v. s .rs ... ... .s n . . v. 1 u 12 unknown transition-rate matrices and the existence of reproducing prior densities is demonstrated in Sec. 2.5. Algorithms for learn- ing transition rates are introduced in Sec. 2.6, while the final form for the optimal Optimum-adaptive Bayes decision rule is obtained in Sec. 2.7. Section 2.8 is assigned to the computer implementation of a Specific problem. Finally, the main results of the chapter are summarized in Sec. 2.9. 2.1 SOURCE MODEL Before going into decision rules and learning, the model by which observations are generated must be chosen. The Object of all decision rules is to decide which of M continuous-parameter Markov chain is active; each continuous-parameter Markov chain characterizes a pattern class. The observable quantities, or the "features", are the sojourn times in the states and the state numbers themselves. The transition-rate, or Q, matrices defining the processes are the parameters Of the distributions governing the Observations and must be learned from the training data. The train- ing data consist Of sample functions from labelled sources and are used to form posterior densities for the parameters which, in turn, are employed in Bayes decision rules. The properties of the obser- vation processes, some important definitions and theorems about continuous-parameter Markov chains, and the necessary and sufficient conditions under which infinitesimal parameters can uniquely determine a process are given in Appendix A. A key assumption is that a single Markov chain is active during the entire Observation interval. A decision about the .r .A-e.’ . .ut... ~- .-..»- | I. . .— v D p v s n. . . u . A .K- o . g I! - I . u . . . ... w ... .i r. o. h . n I .. ~ . . I .. . . .0. . a C. . ... v a a a r: . e . . 0.1 . .N. I.- n .4. Q s .u a O . u u c . .. .1 t). . . _ .... 2 v . ... .1 a v . .5 9. a. .V. n . ‘s w .. .I a t t s s . L, A . . . q .4 o .RL. w. . «’4. o . av . . u AL. s . . .oasw .Opm ( Dr .44 . . ....u . u . . . .1 A ru .: .. v. a. a . .. .o .5. s- m .. a .nu A a = .F. Q 1 n 9‘ .. . 1' y s I. .Ith 13 identity Of that chain is to be made after Observing a sample function. Each sample function is assumed to be generated in a manner consistent with the assumptions and conditions explained in Appendix A. The generation process can be pictured as follows: Suppose a typical realization starts in any state, say 1. Then, a waiting time, having an exponential distribution with parameter qi > 0, determines the length Of time Spent in state i. At the end of this sojourn, the process jumps to State j with proba- q. bility -li' qi random duration, as determined by an exponential distribution with , j # i. The process stays in the new state for a parameter qj’ and then moves to another state, say k, with probability Elk, k # j. The sojourn time again has an exponential j distribution with parameter All possible realization of the qk' process can be generated by this procedure. A typical sample function constructed by the above procedure is illustrated in Fig. A.l. 2.2 DECISION RULES The Observation Space is defined first. Each random variable in the sequence xk é (x1,x2,...,xk) of state random variables takes on values in a finite Space A = {l,2,...,N}, N < m, and each sojourn time random variable in the sequence tk é {t1,t2,...,tk} takes on values on the positive real line. All random variables are defined on the Space Sk = {wt w E Ak X [0,m)k}. Thus, the sample space 8k is the 2k-dimensional Euclidean space Of all sequences k k m = ( H gi) x ( H n1), where gi 6 A, “i E [0,m), vi. In particular, i=1 i=1 the random variable xi, i s k, is defined as Xi(w) = gi and the . \ ...... n a . .. t .. . \ . . s. u. b u. . a k u . 1 . . . . . S . .... a. r . . t - - . . . . o v o v. . a .r. ... . . . o _ .I ~ . n v 8 Ohm .~ . F. H‘. . 5 19.. s . .2 = T L; .. . s . ,C b s p . v. I. . ~ S . . L. D. .. .. .: >. . J v- .x. u . s .. c . i. r. ~§ . e .3 . t .x. .. . 3 1. 2‘ .: a >., . ~ . . n .u. r. t e . . _ . .nl‘ sat VJ r. NL- ~v- ... r n c . .d ... .' ‘- v‘-.4 .. .' .- I .. ... . . an" -. .3..- ra H'V 9" L.|~.. 14 random variable ti, i s k, is defined as ti(m) = “1' The necessary and sufficient conditions and Theorem A.1 given in Appendix A show that {xk; k 2 l} is a discrete-state, discrete-time Markov chain with transition matrix [rij]§,j=1 defined as follows: = (2.2.1) Chung [0-2] calls such a Markov chain, the jump chain associated with the continuous-parameter Markov chain, [xt; O S t < w}. The sojourn times (t1,t2,...,) are conditionally inde- pendent random variables. That is, if P(-) is a probability measure defined on the sample Space Sk’ then P(c1 e A1,t2 e A2,...,tk e Ak\x1 = i,x2 = j,...,xk = L) = P(t1 e A1\x1 = i) ... P(tk e Ak|xk = L) (2.2.2) where A1 is suitable Borel set on R1, i = l,2,...,k. Hence, for each sequence of realizations xk = (x1,x2,...,xk), a conditional distribution for tk = (t1,t2,...,tk), given xk is defined. In this model, the conditional distribution can be described by the . . . k R Q conditional den51ty f(t ‘x ) - f(t1\x1)f(t2\x2)...f(tk‘xk) on k [O,m)k with reSpect to Lebesgue measure p . The mass function p(xk) é p(x1,x2,...,xk), which is a discrete density over Nk points in Rk with reSpect to counting tneasure vk is defined in terms of the initial distribution over 1N ij i,j=1' Tflien, the joint density for (xk,tk), deuoted by g(-), is well- the states and the transition probability matrix, [r ‘— 5. £21ch ,.' b 15 k k k defined in terms of f(t ‘x ) and p(x ), with respect to the pro- . k 2k duct of counting and Lebesgue measures v xu on R and can be expressed as k k g(x at) 8(X19°°°9Xk9 t1:°°°atk) f(t1,...,tk‘x1,...,xk)p(x1,...,xk) (2.2.3) The general elements of the decision making problem and derivation of the non-randomized, Optimal decision rules for any arbitrary loss function, L(.,.) are given in Appendix B. In the following two sections, optimal and adaptive decision rules are derived for the cases of known and unknown Q-matrices, assuming a Special loss function. 2.3 OPTIMAL DECISION RULES WHEN Q-MATRICES ARE KNOWN The first case to be considered assumes that the Q-matrix, (S) N . _ [qij Ji,j=l 18 known for every pattern class, 8 l,2,...,M. The special "0-1" loss function is chosen, defined by L(i,j) = 1 - A(i,j)1. * The optimal decision rule, d (.), follows from (B.5), Appendix B, and is given by the Bayes decision rule: k k ,t ) = s if s is the first index such that P(9 = s xk,tk) 2 P(e = L‘xk,tk), * d (x v L f s, {as E A (2.3.1) It follows from.Bayes rule that l A(i,j) denotes the Kronecker delta. 7' C . ...-new!- I! ' ‘\ ‘ x (.0... ' Y ,. U .. y I u- a .7 2 :- - .. ‘ 9- ~ v .. . a ' o .‘ ...‘K ,G‘ 6' l g u. -. .1: :.A_ - ~~- r' \_H\‘ W .‘ \ "-' c s ‘. ’. t-‘.. ‘0 .5- , t b- . '¥ _, r" - ’ I r. I..p 16 k xk,tk) = P(9=S)g(:>1§19=s) . (2.3.2) 80‘ at ) Defining gS (xk,tk) é g(xk,tk‘e = s), P: =P(e= s) and employing P(e = 8‘ (2.3.2), (2.3.1) becomes d*(xk,tk) = s if s is the first index such that o k k k k PS gs(x ,t ) 2 1);:ng ,t ) v L ¢ 5, 3,4, 6 A. (2.3.3) The density functions required in (2.3.3) are given by k k gs(x ,t ) — fS(t1,t2,...,tk x1,x2,...,xk)ps(x1,...,xk) V s E A (2.3.4) The first factor in (2.3.4) is now computed. From the con- ditional independence of (t1,t2,...,tk) and (A.14), k fs(tl,...,tk‘xl,...,xk) = .n fs(ti‘xi) 1=1 k k ‘ H q<3> exp - Z q(s)t. (2'3'5) . x . x. 1 1=1 i i=1 1 The second factor in (2.3.4) is found from (A.13). o k'1 qié)’ x1+1 pS(x1,x2,...,xk) = Ps(x1) .H 1 (2.3.6) i=1 q(s) xi Then, the joint density gs(xk,tk) is given from (2.3.5) and (2.3.6) as: k k k-l (1(3) x k , . 83(x ,tk) = H qis) exp — E q(s)t. Po(x ) H Xi 1+1 (2.3.7) . . x. 1 1 , i=1 1 i=1 1 i=1 (8) qx "a“:d . H ht '-‘.. O . ." - \ K. h \ . o ...,O ‘ n. “‘3 .V,... .7- \ $315 ... 5“V ...- 21): ‘ “I ‘ l . ) . - l‘. ‘h. ...r " i .. A . .F’ - ‘- / ).- l \ ‘4 ~.“ I" mm?" H“ A.» I. I I“ I ‘ .1-" . 'I. \I- .I 17 Equation (2.3.7) can be written in a more convenient way k as follows: Let Nij = Nij(x ) denote the number of one-step transitions from state i to state j in xk = (x1,...,xk). Let k Ki = Ki(x ) be the number of occupancies of state i in x , not counting x1, and let tij = tij(tk) be the waiting (sojourn) . . . .th t1mes in state 1, for the j- occupancy of the process. The total . . . k . sojourn time In state 1, 2i = Zi(t ) is then, = +t +...+t i Zi til 12 1K1 V 6A Then, (2.3.7) can be written as, (s) N.. k k N K. N N N (1'1 13 830‘ ,t ) = n(qi(s))1 exp -z: qis)zi P:(x1) n n :8) i=1 i=1 i=1 3‘1 qi ji‘i (2.3.8) Taking natural logorithms of (2.3.8) and noting that ij i k ‘iLIIMZ ...: 2 II N H "I :4 ‘IL H L4. L;- H. = K, +1 if x =1 (2.3.9) the optimal decision rule in (2.3.3) can be expressed in a simpler form as follows: k * k d (x ,t ) = s if s = L maximizes N N N 0 0 (L) (L) (L) Ln P (x )P - 2: q. z. + 2 EN an (2.3.10) I: L .1 L qxl] i=1 1 ,1 i=1j=1 ij 13 ji‘i 2.4 ADAPTIVE DECISION RULES WHEN Q-MATRICES ARE NOT KNOWN When the infinitesimal parameters, {q§:)}§ j’ s E {l,2,...,M}, 3 are not known, they are treated as parameters in density functions ..qu. C" .. ¥ 5.. :3‘31 vu- 3256.":6: vu- in: .t .- Le 18 describing the observations. To make effective decisions under such circumstances, information about these unknowns must be extracted from classified observations. The state numbers and the sojourn times of sample functions from each pattern class are observed and the infinitesimal parameters can be eliminated from the decision rule using supervised learning techniques. The train- ing data from pattern class 3 form a random vector ns ns A = ... ... of states numbers (yS ’Ts ) (ysl.ysz. ’ysnS’Tsl’TSZ’ .Tsns) and of sojourn times. The optimal decision rule in this case is * k k d (x ,t ) = s if s is the first index such that o k k “s “s o k k “L “4, PS gs(x .t lys .TS ) 2 PL gL(X .t ‘YL ’TL ) v L 5‘ 8. L.s e A (2.4-1) where k k n “S k k nS ns nS nS 85(X ,t \ySS,TS ) = INZ 88(X ,t ‘ys ,TS ,qs,rs)f(qs,rs\yS ’Ts )dqsdrS R (2.4.2) where 9 (S) (S) (s) . qs - (q1 .q2 ....,qN ), = (S) N rs {rij }i,j=1 (2.4.3) (s) N The term r, was defined in (2.2.1); 2 r(S) = 1, Vi E A. 1J ._1 ij J— j#i The supervised learning procedure is the same for all Pattern classes so the subscript and superscript s will be dropped from the following development. The first factor in the integrand 0f (2.4.2) can be written in a form similar to (2.3.8) as follows: 19 N K N N N N k k n n 1 O . 30‘ ’t ‘y ’T ’q’r) = n qi exp ' 2 qizi P (X1) 11 II (rij) U 3541 (2.4.4) k _ k _ k . where Nij - Nij(x ), Zi - Zi(t ) and K1 - Ki(t ) are defined as before. The second factor in the integrand of (2.4.2) is computed in Sec. 2.5. As a result, the required density has the form: k k n n o N Ki N 30‘ ’t ‘3' ’t ) = P (*1) $qu .13 qi exp ' E qizi 1—1 i—l N N Ni. n n x n n (r..) J f(q,r\y ,w >dq dr (2.4.5) ._ ._ ij 1—1 j—l j¢1 Equation (2.4.5) shows explicitly that the required density function is proportional to a joint moment of the posterior distribu- tion. As shown in the next section, the computer storage is limited whenever a natural conjugate family of distribution is used for each (qs.rs)- 2.5 SUPERVISED LEARNING The object of this section is to form the posterior density n n function for (qs,rs) from the training samples (ySS,T S). The 3 learning is supervised because the continuous-parameter Markov chain that produces each set of training data is labelled. The s subscript and superscript will be dropped for convenience. The necessary and sufficient condition for the existence of a reproducing prior distribution for (q,r) is that a sufficient statistic of finite dimension exists for (q,r) (Theorem 2, Spragins [S-6]). To demonstrate this finite-dimensioned sufficient Statistic, the likelihood function g(yn,Tn‘q,r) can be written 20 from (2.3.8) as: N k. n n 1 8(y :T ‘Q:r) = H (11 exp - i=1 i o N N nij q-z. P (y ) 11 n (r. ) (2.5.1) 1 1 1 1 i=1 j=l ‘3 ja‘i IIMZ where nij = nij(yn); zi = 21(Tn), ki = kiCYn)- Equation (2.5.1) shows that g(yn,7n|q,r) belongs to an exponential family of distributions. Thus, (q,r) has a sufficient statistic of finite dimension. One such statistic is denoted by T(yn,Tn) and can be easily determined by applying the factoriza- tion theorem to (2.5.1). N N N T(yn.Tn) = nij(yn) . 21(Tn) . ki(yn) i,j=1 i=1 1= j-‘fi (2.5.2) Thus, a reproducing prior density exists for the para- meter (q,r). Any reproducing prior density can be written in the following form: 8(Y_ :°-°ay :T_ 900°3T ‘Q:r)¢(Qar) f(q,r|yO,To) = m 0 m 0 (2.5.3) 8(Y_m9"°’y097_ ’0003T0‘Qar)¢(Q:r)dq dr qxr m where (yo’To) = (y_m,...,yo,m_m,...,To) are fictitious observations and ¢(q,r) is an arbitrary positive function of (q,r) except that the denominator in (2.5.3) must exist. In particular, setting ¢(q,r) E 1, the following reproducing prior density is obtained for (q,r). This density is the product of N gamma censities and a matrix beta density. ' vi+l v. -w,q, ' o N w, q 1e 1 1 N N p, -1 i=1 (V1 ) i=1 j=1 N 13 ._ .J ja‘i f. —. .. .tA .H. .p. .» as I... l ... . . . y . (V ’r I. .» ... .. . .... Qt b e at . . .v. .a I». m. .3 vi ‘ ../ LIV ..- . b ‘ Ir.“ e. 3. .. . C 4.. . :- n‘J .p. L L. . .. o t . W.— .. ... .a. .15 A: ..u. .U A ..nl. fl ult d 21 where F(D.) do A 1 A N O B (Lien) = 9 fl. = 2 [1“. (2.5.5) N 1 N o 1 ,=1 13 n F(u..) J i=1 ‘3 j¢i The parameters of this distribution are the matrix n = [no ]N where ”O = O and the vectors v = (v v ... v ) 0 ij i,j=1 ii ’ - 1’ 2’ ’ N j#i and w_= (w1,w2,...,wN). The posterior density for (q,r), based on all training samples (yn,Tn) = (y1,...,yn,71,...,mn) is found to be: g(yn.Tn\q.r)f(q.r\yo.¢o) f(q,r‘y“,.“) = f g£(q.r\yo.¢o)dq dr qxr V +1 V, 4W,q, N wi1 qil e 1 1 [ N N A mij-l = H ' H H B (m.)r.. i]. 1:]. F(Vi + 1) i=1 j=1 N 1 1.] j#i (2.5.6) where V 9 v, + k,, ‘W = w + z , m e p? + n 1 1 1j ij 1j F(Mi) N (2.5.7) 91m” -Bé M92 ijJi,j=l’ i N ’ i ,=1 mij II NHL.) 3 i=1 13 The posterior density has the same mathematical form as the prior density for (yn,7n). The only difference between (2.5.4) and (2.5.6) is in the parameters of the two densities, which are re- lated by (2.5.7). The convergence question in supervised learning deals with conditions under which the joint prior densities of the parameters, ‘ «‘n' '- v‘ t‘ 1.. ...-fi ' -. I," — ua ...... a L up P P -- _- -- .- U . 1.. - D .. . . . g. .. . ..:" ° 1“. .. § . ~. .-j .. .. b I 9.. .. w P . -.._' b ... ‘\ .b ..P '- -. . I ,. m 1‘: .L. H _S s ~A 0 for all z in a Sphere containing 90. C. The posterior density fe(.‘yn) is defined as before. D. A consistent sequence of estimators t1 = t1(y1), t2 = t2(y1,y2),...,tk = tk(y1,...,yk) exists that converges to 90 Wpl. Then, fe(z‘yn)-E::>|6(z - 90) Wpl, where 6(.) is an impulse function having the same dimension as 90. For the problem considered above, it is easily seen that the first three conditions are satisfied for g(q,r|yn,Tn). To exhibit the existence of a sequence of consistent estimators for the unknown parameters, (q,r), the following well known result [0-3] is used: If a sufficient statistic of an unknown parameter exists, then any'naximum.1ikelihood estimator will be a function of this sufficient statistic. Bartlett [B-l] first showed that the maximm likelihood estimators, ei , obtained by Eij = nij/ni’ j N where n1 3 E nij form the consistent sequence of estimation i=1 for the parameters rij in (2.5.1). It can be also shown that there exists a set of consistent maximum likelihood estimators for the parameters q. in (2.5.1), iven in terms of the sufficient 1 8 Statistics ki and 21 as a, = (ki +-l)/zi, i = l,2,...,N. 23 Thus, the last condition of Theorem 2.5.1 is satisfied for n n n 8(q,r\yn,Tn) and g(q,r‘y ,T )-¥:£9’6(r - ro,q - qo) w.p.l. N 2.6 LEARNING {qijh’j=1 ja‘i N J}i.j=1 N ja‘i i.J'=1’ ji‘i '8, given the N S fa 1 th a a t a d o r, on y e p r me ers {qi}i=1 n {ri have been learned. It is also possible to learn {qij} or to determine the posterior densities for the qij . . n n . training samples, (y ,T ), and to prove that these posterior densities converge to 6(qij - q? ) w.p.l., as n tends to infinity. 3 0 . . n n Here, qij is the true value of qij' To find ¢(qij‘y ,T ), the posterior density for qij’ let q.. = q.r.. i,j 6 A(j # i) and (2.6.1) .(yk). (2.6.2) define N, Q . 1 1 ij i ~‘l|~ll["12 H D J j The joint posterior density function for (qi,ri), given (yn,Tn), can be obtained easily by integrating (2.5.6). The required density has the form: V.+l V. -q W. l. 1 l 1 W. q. e n n _ 1 1 f(ql’rlj‘y 9T ) - 1n(vi + 1) I‘(Ni - nij)I‘(nij) ij ij 0.5(qij - qij) w.p.l., is shown in Appendix C. 2.7 OPTIMUM-ADAPTIVE DECISION RULE In the previous section, joint posterior density functions for the parameters of each pattern class were obtained and used to eliminate the unknown infinitesimal parameters, {qij}, from the decision rule. The decision rule obtained in this manner is an adaptive decision rule which is optimal for the prior information and cost function given. Furthermore, the decision rule adapts or converges to what the optimal rule would be if the true para- meter values were known. Thus, in summary, the optimum-adaptive Bayes decision rule is given by: * d (xk,tk) = s if s is the first index such that n I"! n n o k k s s o k k L L Pg(X.t\y .T _)=max {Ps(X.t\y .T } S S S S Le[l,2,...,M} LL " L (2.7.1) where k k “L “L - k k “L TIL “L “L 81(x .t \YL ’TL ) - qur31(x ,t ‘YL ,TL ,q,r)f(q,r‘yL ’TL )dq dr \IL (2.722) The first and second factors in the integrand above are given in (2.4.4) and (2.5.6), respectively. After some algebra, (L) v+1 (L) 1 MW.) (L) gL(x ’t WW.“ '7 n (L) ' i (L)i 1‘1 (L) vi ”‘1'” 1"("1 +1) (. .) 1 i N jgl (Nij +mg’) - 1)....(mf3J) o N j¢i (2 7 3) x P (x ) n ' ' L 1 i=1 (Ni +miu.) _ 1hump») 26 where w(L) A (L) + (TQL , v(L) g (L) + k ( 9L) (L) g ( 9L) + o(L) i ‘ W1 zi L )’ 1 vi i 12 ’ mij “ij yL “1) 21 Z (t ), K Ki(x )a Nlj N1J(X )3 N N EL) A )3 mg), N1 A 8 Ni' V L j=1 J j=1 J j¢i j¢i 2.8 COMPUTER IMPIEMENTATION The optimum-adaptive Bayes decision rule obtained in (2.7.1) and (2.7.3) would be implemented in an iterative fashion in an actual application. Such an implementation and a simUlation are given in this section. The storage requirements and execution time required to simulate the decision rule are discussed. Equation (2.7.3) can be written iteratively as follows: ViL)+Kx (Xk'l)+1 k-l k k k k n n (L) k [wiif—mxka )] g (x¢\yLnL)=V +K Q) (L) {I L xk Xk ViL)+Kx (xk)+1 [wi:)+£xk(tk)] k k a) k'1 +N (X ) x ka-l’xk xk-l’xk - g (xk-1 tk-lh’nL T114”) (2 8 1) méL) *‘Nx (xk-l L ’ L ’ L k-l k-l VS.)+1 ((50 1 1 1 “L InL _ (L) 1 80! ,t ”L ’TL ) — (Vx1 + 1) Va)“ +1 P (X1) X "X w“)+z 1 1 x1 X1 27 In (2.8.1), Rx (xk) is the number of occupancies of state k k xk in x , Zx (t ) is the total sojourn time in state x k k-l _ . . N (X ) is the number of one-step tranSitions from state x x k-l’ k k-l k-l x to state x in x and N (X ) is the number of k-l k xk_1 . . . k-l . . . . one-step tranSitions in x whose initial state is . k k in x , xk_1. An algorithm for computing (2.8.1) is shown in Fig. D.l of Appendix D. The optimum-adaptive decision rule implemented in this fashion is a fixed memory rule. No matter how many learning samples are employed and no matter how large the size of the vector to be classified, only the parameters for the posterior densities need be stored. Including the storage requirements for prior informa- tion, the total number of words of storage needed is equal to M(N2 + 2N + 2). A computer simulation was made to illustrate the performance of the optimum and optimum-adaptive decision rules. All computer simulations discussed below were performed on the CDC 6500 digital computer at Michigan State University. The Specific case considered is the following: 1. There are two pattern classes denoted by ml and wz; Pi = P; = 1/2. 2. The continuous-parameter Markov chains which produce the samples are assumed to have 3 states for both pattern classes. 3. The Q-matrices used to produce all observations are listed below. Q1 = r 0.60 0.40 L036 0.12 1.00 0.84 28 q 0.48 0.60 1.20 .4 Q2 = P 0.50 0.60 Lo.7o 0.15 1.20 0.30 The initial state probability distributions for the bility distributions of the correSponding chains. were computed by (A.ll) and are given as follows: H: LAND-d 0 Pi("1 1) 0.387 0.306 0.307 O 2(x1 P _ i) 0.388 0.224 0.388 The following parameter matrices were used for the l 0.35 0.60 1.00 ‘ chains were chosen to be equal to the stationary proba- They matrix-beta prior density, which was used for the para- meters of the jump chain. 0 ”1 o “2 = (o 5 7 b 6. The following parameters for the prior the sojourn times were used. F" = 1.0 h- 1.0 1.0 3 - V —2 densities of F- - 15 10 , b12 0‘ In the first part of the sinulation, the average error Q-matrices were not known. curve was obtained using the decision rule in (2.3.10). F661 = 10.0 12.0 In the Second part, the average curves were obtained for the case when the Two cases of supervised learning were 29 employed, all applying (2.8.1) and differing in the number of training samples provided per pattern class. Training samples of size 50 and 100 were used for each pattern class. Each error curve provides a Monte-Carlo estimate of the probability of error as a function of the size of the observations. In each situation studied, 10 such error curves were obtained along with an average error curve. Using the same notation as tD-4], 100Li(k) is defined as the number of wrong decisions in 100 classifications of a sequence k k (x :t ) for the iEh-run of Fig. D.1. The kgh'point on the average curve is defined by 10 - =}_ E L-(k) L(k) 10 i=1 1 Thus, 1003(k) is the percent of wrong decisions in 1000 independent classifications of k states. Average error curves for several situations are shown in Fig. 2.8.1. As mentioned in [D-4], the quality and amount of prior information are critical factors in determining the rate at which the error converges to that for known parameters. Equation (2.7.3) shows that the posterior density function for (xk,tk), given the training samples, depends on the parameters VEL), W(L) and mgL) i ij (L) __ (L) (L) _ (L) (L) _ 0(L) where Vi - vi +-ki, Wi — wi + zi’ mij — n11 +-nij. The prior information as presented by VEL), WEL) and pi§L), can thus be made to either overwhelm the initial training samples or to be overwhelmed by them so the magnitudes of va), wa) and uZ§L) are a measure of the amount of prior information being inserted in the decision rule. If these parameters are properly selected, the training data will reinforce the prior information 3O A Estimated Error Prdbability 0.3 -r 0.2 7 <::f-' Optimum, 50 learning samples “—r Optimum, 100 learning samples 0.1 _. Known Q-matrices . “— 4 > 50 100 Length of Observation Sequence Fig. 2.8.1: Average Error Curves 31 and the amount of training data used is not critical. 2.9 CONCLUSIONS This chapter has dealt mainly with decision making and with learning the unknown parameters in finite-state, continuous-para- meter Markov system.with observable states and transition times. The model by which observations are generated was defined in Sec. 2.1 and the properties of the observation process were given in Sec. 2.2. Assuming the Q-nmtrices defining the chains were known, an optimal decision rule was defined in Sec. 2.3 to be that rule in a given class of rules which minimizes the Bayes Risk. 08.2). The optimum-adaptive decision rule (2.4.1) was derived in Sec. 2.4, in case of unknown.Q-matrices while the supervised learn- ing scheme for learning them'was employed in Sec. 2.5. The existence of a reproducing prior distribution for (q,r) was exhibited. It was also shown that, under the stated conditions, the parameter posterior density (2.5.6) converges to a delta function centered at the true value of the unknown parameter. The posterior densities for the infinitesimal parameters given the training samples were obtained in Sec. 2.6. The final analytical form of the optimum-adaptive decision rule was given in Sec. 2.7 and it was expressed in an iterative form in Sec. 2.8. Finally, a computer simulation was performed for a specific case to obtain the probability of error curves in both known and unknown parameter cases. CHAPTER III PROBABILITY OF ERROR The quality of a decision rule is characterized by the total probability of error. Unfortunately, for general decision- making problems, exact analytic solutions for the probability of error are impossible. Even if one could find such solutions, they would be tremendously complex. For this reason, simple lower and upper bounds, or asymptotic errors or iterative approximations are more useful than exact error probabilities. In Sec. 3.1, exact probability of error expressions are derived for the two-pattern class case where both classes are de- scribed by 2-state, continuous-parameter Markov chains with known Q-matrices. Lower and upper bounds are given in Sec. 3.2. The limit cases are also studied as the number of observations tends to infinity. Asymptotic probability of error formulas are derived for the two-pattern class problem with N-state Markov chains having known Q-natrices in Sec. 3.3. The probability of error is shown to converge to zero w.p.1 as the number of observations tends to infinity. In Sec. 3.4, the conditional probability of error notion is introduced and iterative expressions are established for them. Finally, Sec. 3.5 summarizes the main results of the chapter. 3.1 EXACT PROBABILITY OF ERROR FOR N = 2 AND Q-MATRICES Let Q1. and Q2 be known Q-matrices characterizing pattern classes, ml and m2, reSpectively. The following notation will 32 “vi-133a}. 1.. .111 33 be used: Q1 = ; Q2 = 9 q19q2’p19p2 > 0 N The density functions, (2.3.8), are evaluated below for the observa- k tion sequence (xk,t ) = (x1,...,xk,t1,...,tk). Since the states must alternate, and assuming k is an even integer, Ki 8 k/Z and rij = l V i,j E {1,2}, (j # i). Then 2 -Z.q. k k k/2 31(x ,t ) = P:(X1) U qi e l 1 (3.1.1) i=1 2 -Z.p. k k k/Z 32(x .t ) = 9361) n pi e 1 1 (3.1.2) i=1 Hence, the likelihood ratio, A(-), is defined as k k 0 g (x .t) P (X) 2 p. k/2 -z,(p,-q,) 2 _ 2 l (_19 e i 1 1 (3.1.3) A(Z ,2 ,X ) -—-——- - -———- 1 2 1 g1(xk,tk) 230(1) i=1 qi Using the "0-1" loss function, the optimal decision rule is given as * k d(xk,t)=u)1 if L<'n = (1:2 if L > n (3°1°“) where o 2/k o P q q P (x ) n QLn -% +Ln plpz; L3 ?an g 1 +12: )3 (qi - pi)Zi (3.1.5) P2 1 2 P1(x1) i=1 The total probability of error, Pe’ is then given as Fe = P(L > Tflwl)P(£ + P(L < szwg (3.1.6) 34 where pa. > (no.1) = m > mxl -- 1.691930) + P(L > n\x1 = 2,m1)Pi(2) (3.1.7) P(L < n|m2) = P(L < n\x1 = l,w2)P;(l) + P(L < n\x1 - 2,w2)P:(2) (3.1.8) Using (3.1.5), (3.1.7) and (3.1.8), it follows that O O P(L > n\wl) = P(Z > n1\w1)P1(l) + P(Z > n2\w1)p1(2) (3.1.9) 0 P(L< filwz) = P(Z < n1\w2)P2(1) + P(Z < n2‘w2)Pg(2) (3°1-10) where 2 2 ) P1P1(i) Z/k q1‘12 1 2 Z=-£(q.-p.Z.; n.=Ln --—,i=, k i=1 1 1 1 1 P3P;(i) p192 (3.1.11) The density functions for Z1 and 22 are now obtained k/2 k/2 Z1 = .2 t2i-1; Z2 = z t2i if x1 = 1 (3.1.12) i=1 i=1 k/2 k/2 Z1 = .E t21 ; Z2 = ._ t2i-1 if x1 = 2 (3.1.13) 1—1 1—1 There are k/2 terms in each sum in the above expressions and also ti ~ Exponential (qi) when wl active ti ~ Exponential (pi) when w active 2 Thus, for both values of XI, Zi is the Sum of k/2 i.i.d. random variables, all having the exponential distribution with parameter (11 or pi (i = 1,2). Thus, Zi ~ Gamma (k/2,qi) under ml Zi ~ Gamma (k/2,pi) under wz 35 Explicit probability of error expressions are now derived for several cases. 92. - . 9.2 . Let g1 - k(q1 p1), g2 - R(qz-p2)) g1 > 0, g2 > 0' Then, 2, defined in (3.1.11) can be written as z = 1:121 + 5222 and, €121 ~ Gamma (k/2,i1); €222 Gamma (k/2,xz) under ml 5121 ~ Gamma (k/2,p.1); €222 Gamma (It/2m?) under (02 - -——¥———— 1 = 1,2. (3.1.14) The density function for 2 can be found by convolving the densities and expanding the integrand with a power series. As a result, the densities for Z under m1 and under wz are given as a n -322 fZ(Z\w1) = 2‘. c112 e z > 0 (3.1.15) i=0 m n -p22 fz(z\w2) = .2 C122 e z > 0 (3.1.16) i=0 wh A - + k _ 1. g F(k/2 + 1) (1 x )k/Z (12 ' 1‘1)1 , ere “ 1 ’ C1i F(k+i)F(k/2) 1 2 1! ’ i g F(k/2 +11) ( )k/Z (”2 ‘ “1) C12 I"(k+i)I‘(k/2) “1112 1! Using (3.1.6), (3.1.9), (3.1.10), (3.1.15) and (3.1.16), the total probability of error can be evaluated for several sub-cases. The following probabilities in (3.1.9) and (3.1.10) are first computed. With a modest amount of effort, it can be shown that ‘hh n mnpj P(Z > n11m1) = z 611 e 2 “‘3T"‘ (3.1.17) 1=o j=0 m -81 ncnifl P(Z > nz‘ml) = z 611 e 2 2 z -—%T;—— (3.1.18) 1=o j= j w -n 1 n (n i ) P(Z < n1\m2) = 2 b,2 [1 - e 2 2 —l!—2——] (3.1.19) 1: 1 J=0 ‘1 ( n ‘ ) m [i -n,i2 n (“212)j] (3 1 20) P Z < w = 2 b. - e 2 -fi--—- . . 2 2 1= 12 j=0 J! "D C ng-i-k) . b grg1+k2 11 11 i+k ’ ° “ 1‘2 ”2 where b Then, in terms of the above expressions, the total proba- bility of error becomes . -ni n(nifl -ni n<8ifl P = PO[ 2 611(P°(1)e 1 2 z; —l-2—— + P:(2)e 2 2 z —-2—-2-—)] i=0 = e 1 l j 0 j! 3:0 3! w n» n mu)j 'Butlmu)j o o 1 2 l 2 o 2 2 2 2 + P 2 b. P (1) (1 - e 2 ——) + P (2)(1 - e Z _____)] (3.1.21) CASE IB. “1 < 0, “2 > O P(Z > n1\m1) = 1 ; 8(2 < n1|mz) = o The remaining probabilities are computed as above. Using these probabilities in (3.1.6), the total probability of error is: an 'In A n (A Tl )1 _ o o o 2 2 2 2 P8 — P1[1gobfl(?1(l) + p1(2) e 320 j! )] °° Tm n (Bu )1 + P;[ E b12P(2)(2) (1 - e 2 2 )3 -—-§-!—2—):l (3.1.22) i=0 j=0 37 CASEIC. n1>o,n2 n1\w1) = 1 ; P(Z > nz\w1) P(Z < nl‘mz) = 0 ; P(Z < n2|m2) = 0 Total probability of error is, Pe = P1 Df' é3( - ). =2( -) 3124 e lne g1 k ql pl , g2 k ‘12 p2 ( ' ' ) Then, the logarithm of likelihood ratio in (3.1.5) becomes 2 P3011) L = ELn O + glzl - |g2|22 (3.1.25) P1911) The total probability of error is calculated in terms of the formulas, (3.1.6), (3.1.9) and (3.1.10), where 2/k . D192 00, qq P2P2(1) 1 2 0 O . Q A P1P1(1) . z - 5121 "1521221111 = Ln -— , 1 = 1,2. (3.1.26) Here, €121 ~ Gamma (k/2,x1) ; lgzjzz Gamma (k/2,x2) under ml 5121 .. Gamma (k/2,p,1) ; 152122 Gamma (k/2,p.2) under where X. ' = 1,2,. (3.1.27) 38 The density functions for 2, under m1 and under wz, can be determined as in Case I and are given by: m n . . -Z(k +1 ) fZ(z\w1) = Z 2 CE?) z1+J e 1 2 z > 0 i=0 j=0 J m . z) = z c§1) 21 e 2 z < 0 (3.1.28) i=0 1 m n -z(n +u 2 fzczlm,)= z 2 f) 1 2 z>o -0 j=0 J m 2% = z cfz) 2 z < 0 (3.1.29) 1=o where m 2 g - l ; n Q k - i - 2 k/2 C(1)3 H”) rgn - H1) (1112) , . 2 3 1 (1N+)2 r (k/2) k/2 = (-1) ( ) n-l 2 1 (9141.2) r (k/2) (1 +1 )1 < + )j (1) A (1) 1 2 (2) 1 (2) ”1 LL2 = C. ° _.———— C.. - C. —'——:'—_- C1] 1 J! ij i j! The total probability of error will be computed for the fOllowing'sub-cases. CASE IIA. n1 > 0, n2 > O 'The following probabilities are first computed as in Case I. m n 'n (A +1 ) b(l) l l 2 P(Z > n w ) = Z Z “r (3-1-30) 1‘ l i-O j-O r-O bijr le m n -B (1 +1 ) P(Z > n2|w1) = z z zbfji file 2 1 2 (3.1.31) i=0 j=0 r=O ._ (2) m n L (2) r “1‘21“”2) P(Z < n w) b 2 2 2b (3.1.32) 1‘ 2 i=0 j=0 r=O 13‘ 1 m n L -n (u +u ) P(z < n2\w2) - b(2) - z z z b(2) n: 2 1 2 (3.1.33) m> NHx2 where m b<1) g c<1) m. + 1) “1%) bijr ij (11+22)L+1 ' r! ; r b(2) A (2) m. + 1) , (““1 + ”2) . bijr — Cij (ulfip2)L+l r! ’ g m (2) P(l + 1 n (2) P(Lrfi-l) b2 ‘ 12.01} 1) iii—1+ E Ocij L+1 ° Substituting the above expressions in (3.1.9) and (3.1.10) and using them in (3.1.6), the total probability of error is obtained. _ o (1) o r ““1(*1+*2) o r '“2(*1+*2) Pe ' P1 . 2 [Eijr P1””1 e +P1(2)“2 e 19j2r o (1) o r ““1(“1+”2) o r '“2(”1+”2) + P2 b2 - . E [},,r P2(1)n1 e + P2(2)n2e ] 13J2r (3.1.34) Using the same procedures as in Case IIA, the total proba- bility of error expressions are derived for the other sub-cases. CASE 113. nl < 0, n2 > 0 Mn 1-n (1 +1 _ o 012 (l)r 2 1 2 Pe - P1E31P1(l)- isz°1(1)b:: H(-n1)e + §r1P011'(2)b31-en2 j n u -n (u *uz ) + P0 [ 2 P °(1)b(2 r)(-n1)re 1 2 + bZP:(2) - 2 P °(2)b(§)n;e 2 12] i,r i,j r (3.1.35) 40 CASE 110. nl > O , n2 < O -T\ (1 +1) 1 1 P = P0 2: P°(1)b(1)11e1 1 2 +b P°(2) - 2: P °(2)b(1)(- 412 )2e 2 2 e 1 ”j ijr 1e 1 1 i,r 4] (M +1») I.“ u + P‘2’[b2P‘2’(1) - 2 P 3(l)b,(2,)'n1e 1 1 2 + 2 P2 0(2)b,(2)( (412).: 2 2] i,jar i,r (3.1.36) CASE IID. n1 < 0 , “2 < 0 122 - o (1) o r n o 2 r n22‘2 pe _ p1 b1 - izrbir [P1(l)(-fl1) e + Plan-112) e :1 i,r H 1» ll 11 + P; . 2: b(2)[: P;(1)('Tl1)r e 1 2 + 123600-712)r e 2 2] (3-1-37) Kr where bf: ) A <-1> a”) $52—12 73 ; b3.) 2 <-1> cf” 4411? 2 “2 m n b 1 Z[1.11)<1>.2111..,:<1>_2_+.11] 1 , _ ij 1+1 1=0 12+ j- -0 (11+12) 3.2 AN UPPER BOUND ON THE PROBABILITY OF ERROR The exact probability of error formulas derived in Sec. 3.1 are very complicated and the asymptotic behavior is difficult to ascertain. A simple upper bound on the probability of error is needed and will be derived from the following theorem [K-l], for the case considered in Sec. 3.1. THEOREM 3.2.1. Let P: and P; be the prior probabilities for the pattern classes, ml and wz, respectively, and let gi(-) be the density function for the sequence of state and sojourn time observations, (xk,tk) = (x1,...,xk,t1,...,tk) when pattern class 41 mi is active, 1 = 1,2. Then, Pe is bounded by the Bhattacharyya distance as follows: %min(P‘iP;)p2 s Pe s (PiP‘z’fi 9 (3.2.1) where p is Bhattacharyya distance and is defined by k P 21* $103.1: )sz(xk.tk) dxkdtk (3.2.2) ARXEO 3”)k To apply this bound to the problem being considered, 31(-) and g2(-) must first be determined. Since the 2-state continuous- parameter Markov chain is being considered, the values of (x2,...,xk) are known w.p.1 as soon as x1 is known. Substituting r 1 i = l,2,...,k-1 in (2.3.7), the desired densities xi’x i+1 are determined and are given as: 31(xk.tk) = (QIq2)k/2[:£I: emqltn"1 {(12:21 as<1-:c2i_1>a<2-x211 ]P‘1’<1> + (q1q2)k/2[::§: e-qthZi-1 eqltsib(1-x21)6(2-x21_1)] Pq(2) (3.2.3) 82(Xk,tk) = (p1p2)k/2[:§: e-pltZi'l e-p2t816(1-x21-1)6(2-x21)] P32) + (p1p2)k/2[:1: e plt“ e-p2t2i'15(1-x21)6(2-x21_1) P;(2)] (3.2.4) Here, 6(-) is the impulse function. Then, kl!» ”2 °° '(qulnzi 1 '(qzz 2)t2i = o o - d . P (qlqulpz) P1(1)P2(1) II I e dt21-1 x {g e t21 rd 0 9 +9 Ca +91 0 0 kn” '( 222)t2i 1 ° " 12 ”21 + - P1(2)P2(2)121 e dt21_1£ e «1121 (3.2.5) 42 Performing the integrals in the above, the result follows. k/Z _ k/4 o 0 4 p - (qlqulpz) P1(1)P2(1)[ ((114131ansz 0 o [ 4 k/Z + P1(1)P2(2) (92"?2) ((114131) ] (3.2.6) Then, the lower and upper bounds for probability of error are given in terms of p, Pi, P0 by (3-2-1)- 2 o O 1/2 Since (Ple) s% P° 1, P; 6 [0,1], each term in the upper bound approaches zero as k tends to infinity so P; approaches zero. 3.3 PROBABILITY OF ERROR FOR LARGE SAMPUE SIZES In this section, asymptotic probability of error expressions are derived in the case of large sample sizes. In Sec. 3.3.1, the 2-pattern class problem with 2-state continuous-time Markov chains is investigated when both Q-matrices are known while results are generalized to N-state case in Sec. 3.3.2. 3.3.1 N = 2, M = 2, AND KNOWN Q-MATRICES Let the Q-uatrices for the two-pattern classes be given as in Sec. 3.1. The formulas necessary to calculate the total proba- bility of error are given in (3.1.6), (3.1.9) and (3.1.10). For large k, Z and Z defined in (3.1.10) are asymptotically 1 2 normally distributed. The means mi and variances oi, i I 1,2 are defined for ml and w2 as follows: m, = -E- , o? 8 -E-' under w , i = 1,2 1 2q 1 2 l i 2q1 k 2 ’ I: mi = EST , Oi = .55. under wz, 1 1,2 43 Then, 2, defined in (3.1.10) is asymptotically normal with mean mm and variance 0: defined by i i c11-P1 <12-P2 2 2 P(QI'P1)2 ((12-132)? mb = q +' q ; o¢ =-E +. 2 under ml 1 1 2 1 _ q1 (‘2 . r 2 21 q -P q -P (q -P ) (q -P ) 1-1 - 1 1 2 2 . 2 _ 1 ..1._1_ q” — p +- p , gm — k ,2 4--—3L7?L—- under wz 2 1 2 2 P P - 1 2 .. Thus, the total probability of error for large k is I given by the following integral formula: * o O Q * o m * = + Pe P1[P1(l) 1; fz(z‘wl)dz P1(2) 1; f (z\m1)dz] 1 2 0 O 2 * 0 P2 * 1 + P2 [:Pz(l) l: fz (z‘w1)dz + P2(2) .m fz(z‘u)2)dz] (3.3. ) * where “1 and n2 are as defined in (3.1.10) and fz(z‘wi), i = 1,2, is a limiting density function of the random‘variable 2. Performing the integrations in (3.3.1), the asymptotic proba- * bility of error, Pe is obtained. nl-m flz-m * _ o u)1 0 ml Pe - P1 PC;(1)[1 - P(fk 01 )] + P1(2)[1 - 2(fk 01 )] o o 111-me o “Z-mmz + P2[:P2(2)<§(fk °2 ) + P2(2)<§(/k 02 )] - (3.3.2) 2 2" 2 2 (q -P) (q- ) (q -P) (q -P) where 0292 1 1 + 2p2 3 2“ —i-L-+-—Z'—2— 3 . 1 2 2 2 2 2 q1 q2 p1 p2 x 2 1009-11—1- e-t ’2 dt . /2n 44 * To show that Pe decreases toward zero as k increases without bound, let ql > p1, q2 > p2. Then, it is clear from (3.1.9) that m9 lim’n1= 1min2 =an1:2>o k—«oo k—cao P1 2 A little algebra shows that m < < m 0"1 TL” “’2 “1..me “2..me Thus, lim Q(,/k ) = lim 9 A = 1 k—ooo 0]. k—noo 0’1 nz'mbz n2-mb2 limQQ/‘k )=lim§(,/k———)=O. k...» km 0 So, from (3.3.1), it follows that * P-+0 as k—oco. e 3.3.2 N > 2, M = 2 AND KNOWN Q-MATRICES In this section, the asymptotic probability of error is derived for the N-state case, (N > 2), assuming that there are two pattern classes and that the Q-matrices correSponding to the pattern classes are given. Let the Q-natrices and the density functions for the observa- k k tion sequence (x ,t ) be given as N N x, N N q ij Q =[ exP{-(pi-qi)zi}] P° N N p.. q."13 x II n {—1-}- -1-) . (3.3.5) P°1(x 1) i=1 j= -1 pi qij ji‘i Using the "0-1" loss function, the optimal decision rule is given as in (3.1.4), where A P° 2/k n = Ln(-;) and (3-3-6) P 2 L g imtn A N pi N N N p Pp(x ) =‘11;[2K.Ln(—)+ 2(q1-Pi)2 + 2 NijLn(—:‘,l°q—q—) +;Lno 2 1 ] i=1 1 i=1 i=1 j-1 pi ij P° 1(x1) jii N For large k, Ki 5 Z N. , then, =1 ij J 1 N N pi. 1 N 1 #23011) L = '1: Z 2 Ni' Ln “—1 + '1: 2 Zi(qi-pi) + Eta o . (3.3.7) i=1 j=1 J qij i=1 1110(1) ja‘i N N Since k is fixed and k = 2 z Nij’ the Nij's are i-l j-l js‘i stochastically dependent random variables. Bartlett [8-1] proved that the asymptotic distribution of Nij's are normal with expected values mij = kPirij, where 46 g 11.1 . d rij qi (j # 1) un er wl 3.3.8 é'Eli (' # i) nder ( ) .. p J u (”2 i and the Pi's are the stationary probabilities of the corresponding jump chain. Using Bartlett's method, the covariance matrix V = KVO, where V = {cov(nu ,n 1, can be determined by the expansion N O r tq)}u,r,t,q= of Ln ”1(2) up to second degree in 913’ where p1(§) is the root such that ”1(0) = 1, of the matrix 9.. N R(fi) =[r.. e 1]] . 13 1.331 Then the matrix Vo which is independent of k is given by the quadratic expression %h§?Vo§' in which 'g stands for the column vector with components (g1,§2,...,§u) where £1 - (911’612""’91N)' As a result of the above argument, the first term in (3.3.7), N N p. n A l- 2 2 N. Ln ‘L1, is asymptotically normally distributed k 1j q. i=1 j=1 1j ji‘i 2 according to N(mn, on), where N N P mm = z 2 Piri, Ln JJ- (3.3.9) i=1 j=l J qij o2 N P P 2 2 o = k2 ; 00 Q 2 (Ln -2£)(Ln -£g) Cov(n r,nt ) (3.3.10) “ u,r,t,q ur qtq “ q ufr,t¢q 1 N To determine the distribution of 2 Q E: 2 Zi(q -p ), the 1.1 i 1 second term in (3.3.7), the distribution of the 21's is first studied. The random variable Zi was defined by (3.1.12) and (3.1.13) as K. A 1 Z r 2 t. i G A (3.3.11) 47 N Because of the linear relation k 8 2 K1, the Ki's are i=1 dependent random variables. Also, the Ki's are asymptotically normally distributed as k tends to infinity. Means and covariance of the Ki's are given by Good [G-l]. They are as follows: E(Ki) = k Pi i e A (3.3.12) Cov(Ki,Kj) = k xij i,j E A (3.3.13) Here, the Pi's are the stationary probabilities of the chain and 1,. is defined as 1J 1 -1 -1 i]. {AijPi - P1P], + Pi[S(I-S) 11,1 + Pj[S(I-S) 11,1}. (3.3.14) The matrix S is defined in terms of the equation given by R = g_§F +13 T , where R — [rij]’-£ — [p1,p2,...,pN] and g. correSponds to left T T and right eigenvectors, reSpectively; i.e., _I_’ R I E and R g = 3. Now, define ll> ~< H xflhi It“)?! j-1 tij(qi-pi) V'i E A (3.3.15) Since the Ki's are dependent random variables, the yi's are also dependent random variables. For small k, the distributions of the yi's are too involved to compute. However, it can be shown that as k tends to infinity, their distributions approach the normal with probability one (w.p.1). Note that K P{1im E3- = P.} w.p.1. (3.3.16) 1.... 1 48 So, for large R, R1 ~ k Pi w.p.1. Substituting this into (3.3.15), it follows that kP, 1 1 Y1 - k iEltijmi-Pi) (3.3.17) Since the tij's are i.i.d., exponentially distributed random variables, when k is large enough, using the Central limit theorem, it can be seen that the yi's are independent and asymptotically normal according to N(m1,oi) where 2 _ Pi(qi-Pi) . 2 _ Pi(qi-Pi) i kq i 2 m = 31:31:311 . 02 _ Pi(qi'pi) under w s ’ . — ‘2’ 1 pi 1 kg. 2 1 N . Also, 2 = 2 y is asymptotically normally distributed i=1 according to N(m ,o ) under m and N(m ,02 ) under w , z z 1 z z 2 l l 2 2 where N P ( - ) N P ( - )2 i qi Pi 2 1 i qi pi 1 i=1 ql 1 i=1 qi m = N Pi(qi-Pi) . 02 = l. g Pi(qi pi) (3 3 19) 22 i=1 Pi 22 k i=1 p: Q As a result, 2 n + z is asymptotically normally dis- tributed according to N(mz ,0: ), i = 1,2. Where i i _ . 2 _ 2 2 mz1 — 11121 + m.r11 , 021 - azl +-an1 under ml (3.3.20) mz = m + m ; 02 = 0'2 + 0'2 under (D2. (3.3.21) 2 22 n2 22 22 n2 49 In the above, mn and a: , i = 1,2, are expressed in i 1 terms of m.n and 0: defined in (3.3.9) and (3.3.10) as follows: q . — ° 2 - 2 -- —1Ii . ' mn - mh , on — on when rij — q. 1,j E A (j # 1) 1 l 1 m = ° 2 = 2 when r = 331 i E A ( i i) The total probability of error, then, can be calculated using equations (3.16), (3.19) and (3.1.10). The result is as follows: ,, 1‘1'“1 *21 T‘2'““z1 P8 = P1(l) 1 - P(fk o ) + P1(2) 1 - i(/1c 0* ) Z1 Z1 0 “Fuzz T‘2’"‘22 + P2 P2(1) @(fko ) + P2 (2) :(fk* 2) (3.3.22) c’z Z2 2 W(i) where 0:. A,/k Uzi , n1- , i = 1,2. 1 P: P2(i) Note that ha: 111 = 0, =.1,2 Ram It is shown that for a Specific case in which qi‘1 > pij’ V i,j E A (j f i), the total probability of error decreases toward zero as k tends to infinity. The following lemma is first introduced and its proof is given in Appendix E. LEMMA 3.3.2 qij > pij = -m < mi < O < mi < +m From Lemma 3.3.2 and the property of Q(.), the following limits hold: L111.» '1 I 1‘1"mzl T‘Z-mzl lim §(/k T) = lim 9(fk T) = 1 k—m 0' k—-m 0 z1 z1 Tll-mz 1‘12- 22 11m @(/k _,_2.) = lim @(fl —;——) = o 1?“ Oz k—cco Oz 2 2 Thus, from (3.3.20), it follows that * lim P = 00 e k—m 3.4 RECURRENT EXPRESSIONS FOR THE PROBABILITY OF ERROR In two-pattern class problems, the total probability of error can always be expressed in terms of the probability of error of the first kind (the probability of false alarm), go, the second kind (the probability of a false dismissal), 50m and the prior 0 probabilities, P3, P2. 0 O = + 3.4.1 where do and Bo are defined as follows: 0.0 ‘3 P[d*(xk,tk) = “”219 = 1], Bo 9 P[d*(xk,tk) = mlle = 2] (3.4.2) From (3.1.4), they can be written in another form as Q A - l 3 4 3 aO-P[I.K>n\e=0],eo-PELK 1119 = 105’1 (3.4.4) 6,00 P[d*(xk.tk) - 2.131 MK < me = 2,151] (3.4.5) I e fiL. (D n It is clear that the usual error probabilities, “0’50 are related to the conditional error probabilities by the follow- ing expressions: 010 = 0,00.) , so = 6000 (“-6) With the help of the conditional error probabilities and the total probability law, a0 and so may be obtained in terms of the follow- ing expressions: a0 =2 J” aL(k>g1dc" (3.4.7) 4 L x t Bo=z j‘ BL(k)g2(xL,t‘)dt’° (3.4.8) x". c" where XL E A&, tL 6 [O,m)L, L = O,1,2,...,k. The conditional probabilities of error can be generated iteratively. This can be achieved writing gi(xL,tL), i - 1,2, as: g.(xL,tL) = fi(xL XL-1,CL-1)fi(CL‘XL,tL-1)g.(it-1,6L-1) (3.4.9) 1 1 Substituting in (3.4.7): 52 = L"1tL"1 L-1tL-l L'l a0 2 f,_1 g1(x ) 2 It gc(k)f1(xL‘x ,t )f1(cL|x%H)dtL dt L-l t Lt X X Defining 4- 1 4- -1 4 4- -1 . 4,, 10031,: J“, Maw (x J" ,t )f (t \ )d 4’ 4, l,...,k (3.4.10) the desired iterative form for qL(k) is obtained. Similarly, for B£(k): ‘xL tL'1)dtL, L = 1,...,k. (3.4.11) = L—l L-l BL_1(k) :4 {L BL(k)f2(xL‘x ,t )£2(th X From (3.4.4) and (3.4.5), it follows that these probabilities of error must satisfy the following boundary conditions when L = k. ak(k) = U(Lk - n). Bk(k) = 1 - u(Lk - n) (3.4.12) Here, u(.) is the unit step function. In terms of the recurrent relations (3.4.10), (3.4.11) and of the boundary conditions, (3.4.12), it is possible to obtain Pe via a0 and BO. Un- fortunately, except for some simple cases, an analytical form for the total probability of error is almost impossible to obtain. How- ever, because of the iterative character of the method, it is more convenient for the computer implementation than the work in Sec. 3.3. 3.5 CONCLUSIONS The aim of this chapter has been to derive expressions for the total probability of error and to study their asymptotic pro- perties for the problem considered in Chapter II. Because of the tremendous complexity involved in computations with unknown para- meters, attempts have been made only for the known parameters case 53 with two pattern classes. The exact analytical solutions were obtained in Sec. 3.1 for the total probability of error assuming two pattern classes with 2-state chains and known.Q-matrices while lower and upper bounds were given in Sec. 3.2. It was shown that the probability of error decreases toward zero as the number of observations increases without bound. Section 3.3 derives the asymptotic proba- bility of errors for large sample sizes. Proving the asymptotic normalities of the underlying distributions, asymptotic error expressions were obtained for the case of two pattern classes, N-states and known Q-matrices, and, in terms of the Lemma 3.3.2, their convergence to zero as the number of observations tends to infinity was proven. Finally, In Sec. 3.4, the conditional probabilities of error of the first and second kind were defined and it was shown that the total probability of error can be computed in terms of these conditional error probabilities and of the prior probabilities. A method was also given to generate them in an iterative fashion. CHAPTER IV DECISION MAKING AND LEARNING WITH UNOBSERVABLE STATES AND OBSERVABIE TRANSITION TIMES In this chapter, optimal decision making with unreliable observations on finite-state, continuous-parameter Markov systems is considered. The states of the systems cannot be observed directly, but the exact times at which transitions from one state to another occur can be observed. The decision problem itself is the same as that in Chapter II. The model by which observations are generated is defined in Sec. 4.1. In Sec. 4.2, the optimal decision rule is established and is generated iteratively for the case when all parameters in the model are assumed known. In Sec. 4.3, the results of Sec. 4.2 are applied to two Specific cases. Section 4.4 deals with various aSpeCtS of decision making for continuous-parameter Markov systems when the model is not completely specified. The main problem of Sec. 4.4 is to extract some information about the unknowns of the model from the observations and use this information to construct an optimum-adaptive decision rule which performs almost as well (with reSpect to a well-defined criteria) as the optimal decision rule of Sec. 4.3. Finally, Sec. 4.5 summarizes the main results of the chapter. 54 55 4.1 MATHEMATICA1.MODEL FOR GENERATING OBSERVATIONS The basic model considered in this chapter is similar to that in Chapter II. Each of the M-pattern classes is defined by a finite-state, continuous-parameter Markov chain. The Q-matrices for the chains are parameters of the problem. The states of the chains cannot be observed directly, but each state is characterized by an observable random process. For instance, a noise process can be added to each state. The random variables describing the observa- tions during a given time interval have joint distributions which depend on the state of the continuous-parameter Markov chain during that time interval. Appendix A shows that almost all sample functions of a finite-state, continuous-parameter Markov chain, {x O s t < a}, t’ which satisfies certain conditions are step functions, and that the process can be uniquely determined by observing any of the sample functions in the interval 0 s t < w. Observing the sample function on [0,m) is equivalent to observing the sequences of random vari- ables {xk}:, {tk}: correSponding to the state numbers and the sojourn times. It is also shown in Appendix A that {xk): is a discrete-time Markov chain (jump chain) with the transition matrix [rij] defined in (2.2.1) and that the sojourn time random variables {tk}: are state-conditionally independent and have exponential dis- tributions with parameters {qi}q. In this chapter, the random variables {xk}: cannot be observed directly, but it is known that when Au = i, a sample function of a random process xi(t) can be observed. The random variables {tk}: can be observed. The main assumption on the 56 random processes {xi(t)}§, which describe the states of the chains generated by each pattern class, is that the statistical properties of the random processes {xi(t)}§ are known and are the same for all pattern classes. This simplifies the formulation. There are many possible sampling schemes for defining the observation process. Here, it is assumed that time samples x(t..) = x 1] ij’ i = 1,2,...; j = 1’2"°°’ni’ are taken during the interval [pi-l’pi) where pi_1 and p11 are the successive jump A points of the process at which transitions occur and po = 0. Another sampling scheme is discussed in Sec. 4.3. The observation process is then defined by the sequence of k random variables {§_,tk}: where tk é (t1....,tk), ti A pi - pi_1: and 5% Q (£1,xz,...,xk); ii correSponds to the iEE Observed vector T £1 = [xil,...,xin ]. Sampling times t are determined as follows: i 13 The first sample xi1 IS always taken at time t11 = pi-l’ just after a transition occurs. The following samples (xi2""’xin ) i are taken at times t. = 12 pi-1+T,ooo,ti u1 = 91-1 + (“i'DT’ reSpectively, where T > O is the time interval between two samples. From the above, it is clear that the number of samples n1 is a random variable which depends on ti and is determined by ni = [ti/T], in which the expression [g] is defined for any real number § 2 O as the largest integer less than or equal to g. The general model with the sampling scheme defined above is illustrated in Fig. 4.1.1. The :andom'vector x: = [xkl’xk2""’xknk] defined above takes values in R k and has a nk-dimensional density function denoted by fi('\9) when it is known that pattern class 9 is active and that the correSponding Markov chain is in state i 13/. Ch. mxnnmnnon o N H . rIIIII////q\ Mmmncnm mwm. b.H.HH arm nmomnmp Somme mod arm mmaowwsm mnrmam 58 during the interval [pi 1,pi). That is, fi(.|e) is the conditional density of observations 5k given that chain 9 is in state i. By assumption, fi(§k‘e) is independent of e for any 5k and any k and will be denoted by fi(§k). Since the states of the underlying continuous-parameter Markov chains are not observed, 5k has the following global density when pattern class 9 is active: N f(gkle) = 2 fi(§k)P(1k = its) 9 e a (4.1.1) i=1 This density is a finite-mixture with component densities {f,(.)}§ and mixing parameters {P(xk = i‘e)}§ where 1 N E P(xk = i‘e) = 1. Another simplifying assumption is that the i=1 . . km .. random variables in the sequence {5 }l are state-conditionally independent. That is, fl(£1,£2,...,£k‘>\l = 12)).2 = j:'°-3Ak = L39) = f1(£1)fj(¥-2)..OfL(¥-k) (4.1.2) As in Chapter II, a single, continuous-parameter Markov k k chain is active to produce (§_,t ) during the entire observation interval [0,T], T < m. A decision about the identity of that chain is to be made. 4.2 ITERATIVE GENERATION OF THE OPTIMUM DECISION RULE The model is completely Specified when the followingoquantities are known: The compoennt densities, {fi(')}§’ the infinitesimal parameters, {q:§)}q, and the initial probabilities, {P(x1=i|e)f: over the states for all pattern classes 9 6 @. Decision theory then provides the optimal strategies for processing the observations. 59 * k k An optimum, non-randomized decision rule d (x_,t ) can be chosen as in Appendix B. In particular, when the loss function is given by L(i,j) = l - Aij (O-l loss function), then the optimum decision rule becomes a minimum probability of error rule defined by l d*(£k’tk) = S if P(9=S|§.k’tk) {max {P(e=%\ak.tk)}] :1 ’ LEO L=s or if Pzgs(xk,tk) = [max {PZgL()_<_k,tk)}] (4.2.1) LEO L=s The rule above can be written iteratively. For simplicity, the subscript L denoting pattern class will be drOpped. From the Markov property k k k tk- 1)f k- l tk- 1 k-l k-l 8(§ :t ) = f(tk‘x mk‘ )g(§. ,t ) (4.2.2) k-l k-l , , The last factor, g(§_ ,t ), is available from the pre- k-l tk- l vious step of the iteration scheme. The middle factor, f(x #‘ ,t ), can be written in an iterative manner as follows. By the total probability law, k-l k-l k- l tk- 1 N _. k- l k- 1 (eke .t )— iEIf(§_ka-1:_ WM?"— .c > (4.2.3) where Assumption (4.1.2) implies f(xk‘xke‘i " ,c ) = £,(§k) (4.2.4) The second factor in (4.2.3) will also be required in the first part of (4.2.2) and is computed below. This factor, P(xk= i‘xk 1,t k 1), can be obtained iteratively as follows: 60 N k- 1 tk- l k- 1 tk- l , k- 1 k-l P(kk=i‘_ ) = 2P()\k=i‘xk_1=xja )P()\k_1=J‘x ,t ) j#i (4.2.5) From (4.2.1), Xk- 1 tk- l . . P(kk? _1hk_ 1:] j at )= rji 1f J # i = o if j = 1 (4.2.6) Using Bayes rule for the second factor in summation (4.2.5), ‘ k- 1 k 2 P1 k- 1 k- 2 k 1=j ) - N k- 1 k- 2 k- 1 k- 2 j§1f(tk- 1l1k_1=1 x )P(1k_1=1|x .t > (4.2.7) where, from the state-conditionality independence assumption _. k-l k-l _. _ f(tk_1\xk_1-J.§. .t ) = f(tk_1\xk_1-J) - qj exp{-qjtk_1} (4-2.8> and k- 2 k- 2 f gc )P<1=j|x .c ) P(xk 1_ _.‘Xk 1, k 2) = N -k- 1 k- 1 k (4.2.9) k- 2 2 m_1 Then, putting (4.2.6) and (4.2.7) into (4.2.5), k- 2 k- 2 k- 1 k- 1 N f(tk-I‘Xk-1=j)fj (Xk-1)P(xk- 1=jlx ) P(x =i\_ ) = z r . k j=1 31 N f =m)f p ‘ k- 2 1c-2 j#i mE1 (tk-l‘xk-l m(Xk-1) ("k-1:“?x ’t ) (4. 2.10) This is a "predictive" factor since it used the present and past samples to make decisions about future states. Substituting this in (4.2.3) will produce the middle factor of (4.2.2). The first factor in (4. 2.2), f(tk|xk,t k 1), can be written in an iterative form as follows: By the total probability law and the Bayes rule, 61 N k k-l X1. -1 , k k-l f(tk\x,t )= iz_; f(t kl1k=i ._ ,t )P()(k=1\x,t ) (4.2.11) where , k k-l . . f(tkhk=1”i ,t ) = f(tkh‘fl) = qi exp{-qitk} (4.2.12) k 1-1 fi<1k>mk =1|_“ 1.1:“ 1) P(xk=i\§ ,t ) = (4.2.13) N k=1t-k 1 2: f mr<1k=mlx ,t > m=1 Substituting (4.2.12) and (4.2.13) in (4.2.11), 1(— - k tk- 1 N fi()ik)P(>\k=i‘§ 1,tk 1) f(t \x ,c )= )3 f(t \1 =1) (4.2.14) k =1 k k N k-l k-l mzlfm (x k)P(x k=m\x ) k-l k-l Here, again, {P(Xk=i1£ ,t )]§=1 can be obtained iteratively from (4.2.2). The starting procedure for the iterative scheme is given below. When k = 1, observe (x1,t1) and compute 1 1 8(1 .t ) = f(t1|§1)f(§1) (4.2.15) where N o A f(£1)= 23 f. (x 1:)1’ . Pi - P()(1 =- 1) i=1 N f (x )P: _ _. i -1 f(tllil) " 1211301141“) N o E ftn(}'('1)Pm m-l Next, compute the predictive probabilities, {P(12=i‘x1,t1)}1=1. N f(t I1 =j)f (x )P° 1 1 _ P(12=i\§.t)=2rjiN 11 111 ,1- j l _ 0 j mEIf(t1h1m)fm(£1)Pm 62 When k = 2, observe (x2,t2), compute 2 2 2 1 1 l 1 1 f(>_<. .t ) = f(t2\1<_ .t >f<>12|§ .t >134; ,t) 1 1 . . where f(x ,t ) 18 obta1ned from the above, and N f<§Q\£}.t1) = 2 fi(§2)P(12=i|£}.tl) £1(§2>P<12=i|§1.t1> f(t2\§?’t1) = f“2112:” N 1 1 malfmczgpmfmtz .12 ) 1 "(‘12 H 2 N Again, {P(x3=i\§?,t )}i-1 must be computed for the next step. This procedure can be repeated up to time k. The flow- diagram 4.2.1 shows how gs(xk,tk) can be determined in an iterative fashion. 4.3 APPLICATION OF THE MODEL TO SPECIAL CASES The model defined in Sec. 4.1 is quite general and applicable to problems encountered in communication theory, pattern recogni- tion and operations research. The major difficulty in implementing the Optimal decision rule is that even if all information about the random processes {xi(t)}1:=1 is known, determining the n%h order density of 5i = (xil’xi2’°'°’xini) can be very complex and, sometimes, impossible. Some ways of getting around this problem are presented below. 1. Take only one sample from each time interval [p1_1,pi). The random vector 5i is then reduced to a single random'variable, xi. Sampling points may be taken at times pi_1 when transitions occur. Fig. 4.3.2 shows the sampling scheme for a particular sample function. 63 Observe ] (5151) ‘ 4k Compute g(x1,t1) | Eq. (4.2.15) —“L‘; 1 = 1 Predict Next Step Com ut {P(Xj+1=1‘x ’t )}i=1 7 l I I Observe I (£j+1:t1+1) ' .12 _____.L__. .____ilr Compute Compute N . N A {f1(§j+1)}1=1 {f(tj+l‘)‘j+l=1}i= (Known) Eq. (4-2-8) ‘71 L1 L 4:12.: Compute Compute Compute 1 f k YES Eq. (4.2.2) Fig. 4.2.1: An Algorithm for Iterative Implementation of g(xF,tk). l Obta in 3(xk,tk) x3 6 t1 e: t2 4?. t3 —: > t po p1 p2 P3 Fig. 4.3.2: A Sampling Scheme 2. Choose processes for which the joint distribution of {5k}:=1 can be simply described. In the following developments, the iterative results obtained for optimal decision rule are applied to the case when the random processes {xi(t)}§=1 defining the states of the continuous-para- meter Markov chains, are wide-sense stationary normal processes. In this case, the component densities {fi(°)}§-1 for observation vector Ek’ vectors E1 and covariance matrices 21 are assumed known. They can be computed from the mean function and auto-correlation function a vector of order ni, have known forms. The mean of the correSponding process as follows: 91 = niI, where I is a unit vector of order ni and Cov(x = Ri(t - t ku’xkv) ku 1G,) 2 u,v = l,2,...,ni. The joint density fi(°) may be written in the following form and is the same for all pattern classes. NIH -1 fi(§k) = n /2 exp{- (5k - 111(1)T :1 (5k - 111(1)} (4.3.1) 65 The iterative, optimal decision rule in Sec. 4.2 can be rewritten as * k k d (g ,tk) = s if P:g8()_<_k,t ) > PEfL(§k,tk) 2?. 9‘ s (4.3-2) where, from (4.22), the density required has the form k k xk- 1 tk- 1 Omitting, again, supercript and subscript L, Jfi_ and FiL are defined as: 1 N A i N 1 N Jkoik’tk’U-‘k-lh) ’ Jkl(§k’{Pk-1}1)Jk2(x ’tk’{rk-1}1) “'3'“ A k-1tk-1xk k- 1 Jkl f(x “-1.1 ), Jk2= Akf(t ,t ) (4.3.5) t)k Pk =Fk(Xk:t k’{r.k- 131): A(Ak+11xk :t (ll-3:6) . . i . The fOIIOW1ng expre551ons for the Pk, Jk1 and sz are der1ved using the iterative equations in Sec. 4.2. N . r ' 1 T -1 121 Fi‘l “jib 1/2 exp“ 501W?) 21 (’inI) ' “1‘13 Pi ___ iii (2") ‘3‘ k N I‘m m - I I t #21 k-l n,/2 1/2 exp{' 2(xkfum )T2m1(§k-um ) - qm k} <2“) 1 12ml. .. N 1 _1_ -1 Jk1 " 121 Fk- 1 n /2 1/2 expi' flak-ping (1151411)} (211) |Zi| N . q. 1 1 1 T -1 121 rk-l n./2 1,2 “91‘ f(ik'PiI) 21 (’ifl‘il) ' qitk} (211) 1 lzd J = k2 N m 1 1 T -l 2: I‘k-1 n [2 exP{' 2 (’ik‘uml) 3m (ik‘uml) 66 Note that {qi}1 and {rij}§,j=1 depend on the pattern #1 class, but the mean vectors and covariance matrices are the same . . . i for all pattern classes. The initial values of Pk, Jk1 and sz can be easily determined by putting k = l in the above expressions 1 and noting that F = P?. o i The optimal decision rule can also be established for the case when the random processes {xi(t)1§=1’ describing the states of the chains, are Gauss-Markov processes. Then, the transition probabilities and initial distributions for a given observation vector 5k may be written in the form: 1 _[xk +1 (Xk “95’le piock .+1‘xk .) = 2 exp 412 g 1 (4.3.7) ’1 ’3 2n(1-Ri)oi 201(1 - R1) (x - u.) 1 P(i)(xk 1) = exp - k’1 2 1 (4.3.8) ’ 21101 201 Note that the double subscript notation, x was used to 191’ denote xkj for convenience. Here, pi, Oi and R1 are the correSponding mean, variance and correlation functions of the process xi(t) where R1 = exp{-yi‘¢l}; T is the sampling interval, —- Y1 is the correlation time. The JOlnt density of Ek = (xk1,...,xknk) is determined in terms of (4.3.7) and (4.3.8) as follows: 0 nk-l g1(-k) _ P1(Xk,1) 121 pi(xk,j+1‘xk,_]) nk-l 2 [x (x - )R ] 1>°i(xk 1) E1 k,j+l k,j 1‘ 1 ““1 - T’— 8*? J i 67 n-l k 2 2 2 where Ai [’3 [211(1 - Ri)oi] . Then, the necessary quantities, Flt, Jlk’ J2k defined in (4.3.4), (4.3.5), (4.3.6), determining the iterative optimal decision rule, are computed. The expressions are given below. 1 Pk = nk-l 2 N r..q.p‘.’(x ) 31 [xk.r+1'(xk r-ijj " ”1] J J1 .1 .1 k:1 r _ 2 1"1-1 A P ‘ 2 2 qjtk 1,1 nk-l 2 0 2 x '( '1») =1» 21 I‘m qum(Xk’1) p r"1 k,r+l k,r m m 111 q t k-l A - 2 2 - m k m=1 m 2qm(1 - Rm) nk-l 2 R " l N 1 p:(xk 1) r21 [xk,r+l (xk rqli) i 1&1 Jkl = Z rk-l A ex" ' 2 2 i=1 i 201(1 - R ) sz — nk-l 2 N i qip:(xk 1) 131 [xk,r+l (xk r u'1)Ri - H'1:l 2Fk_1 A L em) 2 2 -qik i=1 20.(1 - R ) 1 i nk-l 2 N m p0 (x ) E [xk,r+l (xk r M'm) m - p'm.J z r‘ In k 1 r—l =1 k-l ———X—L—'exp - 2 2 In m 20 (1 ‘ R ) m m The initial values of Flt, Jk1 and sz are given as followa: N’ o o I’ t , , x ex - ,t 121 j leJ( 1.1)qj p{ (11 1} i _ 1in . 1 " N ’ P0 ° - mil mpm (xl,l)qm exp{ qmtl} 67 n -l k 2 where A1 9 [2n(l - Ri)ci] . Then, the necessary quantities, Ti, Jlk’ J2k defined in (4.3.4), (4.3.5), (4.3.6), determining the iterative optimal decision rule, are computed. The expressions are given below. 1 Pk = 0 21h -(X m )R -u]2 N ..Q.P.(X1)e k,r+l k,r j J j J Jl l l ”k r-1 2 Fk-l A 2 2 - thk j=1 j 203 (1 - Rj) 1941 nk-l 2 0 2‘. x 'X( -p. )R -u N rm qmpm(xkll) r— _1 k, r+l k,r m m m _ t 21 k-l A e - 2 1 R2 qm k m m qm( " up nk-l N 1 9:05:11) r21 [xk “‘1 (ka-u )R - p‘ 3 Jkl = z Fk-l A "p ' 1=l i 201(1 ‘ R ) sz _ nk-l 2 N . q.po(x ) E [xk,r+1 - (xk,r-ui)Ri - H'i:I 1 1 1 k,l r-1 ’3 rkq A exp 2 2 ' qitk i=1 20.0. - R ) 1 i nk-l 2 Z Fm p;(xk 1) r21 [xk,r+l - (Xk,r-um>Rm.- p'm] _ k-l ---L- exp - ““1 A 2 2(1 - R2) m cm m The initial values of Pi, Jk1 and sz are given as follows: N e - ,t 121 Pjt j. pj(x1 1)qj xp{ <1] 1} 1 _ fii . F1 -' N o , P° - Z P (X 1, 1) qm EXP{ qmt 1} m=l m.m x1,1 4.4 ADAPTIVE DECISION MAKING Adaptive decision making schemes are now studied when the underlying Markov model is not completely Specified. The unknOWns of the model are the Q-matrices, QS = [q§:)]§,j=1, qis) g Qii), S 6 ®. The parameters and the properties of the random processes, {xi(t)}§, describing the states of the underlying continuous- parameter Markov chains are assumed to be known. The component densities, {fi(.‘e)fi, defined in Sec. 4.1, are, then, completely known. Furthermore, for simplicity, they are again chosen to be the same for each pattern class. That is, fi(1<_k>= fiekle) v £1. (4.4.1) A prior distribution summarizing initial knowledge about the unknown infinitesimal parameters is available for each pattern class. Using a supervised learning scheme, posterior distributions for the fixed, but unknown, parameters are "learned" from training samples (2:8, T:S). The training data are employed to form posterior densities for the parameters which, in turn, lead to an adaptive decision rule which makes decisions with minimum error probability. The purpose of Sec. 4.4.1 is to derive the optimum-adaptive decision rule using the Supervised learning scheme. In Sec. 4.2.2, components of the optimum-adaptive decision rule are generated iteratively and 69 the structure and computational feasibility of the iterative scheme are discussed. The learning feature of the rule is discussed in 4 . 1+ . 1 OPTIMAL-ADAPTIVE DECISION RULE The optimum-adaptive decision rule will be derived for the problem defined above. The following definitions introduce the problem. 0 I I k The non random1zed dec181on rule k ,t ), given the C k k j C . observation sequence (x ,t ), given all tra1n1ng samples n n n A (BUT) = {(111,T11),...,(1MM,TM )} and given Q-matrices dgd(x n 9, = {Ql’Q2’°°"QM}’ has expected loss M k R[d\ (fl :tk):(Y:T) 9Q] = 2 L[d(xk:tk):i]P[e=i‘(ngtk)9(Y:T) ’9‘] (40402) i=1 Equation (4.4.2) is referred to as the sample-parameter- 2 conditional risk . The parameter-conditional, adaptive-Bayes risk is defined as: ra,mf<§k,tk\(m), >42k dck (4.4.3) k 0 where P0 = (P3,P2,...,PO) M is the prior distribution over the para- meter Space and k k M o k k n. n, {[5 ,t [(y.'r),Q] = 2: Pigi[x ,t “111,35 ’91] (4.4.4) i=1 Thus, the adaptive-Bayes risk is given by O O ra(P .d) = jg ra(P ,d\g)fo(9)dg (4.4.5) '2 Signori [8-2], pg. 14. 70 This development, which involves the use of conditional risks, differs from that given in Appendix B and emphasises the role of prior information in constructing the total risk. * *- The optimum-adaptive decision rule da 9 d (xk,tk) is defined as a non-randomized decision rule which minimizes the adapt ive-Bayes risk. That is, 0 * . o ra(P ’da) = inf ra(P ,d) (4.4.6) dED where, D is the set of all non-randomized decision rules. Note that * * k k ra_<_ .t ).(yLL.TLL>] Leo , k k n n o k k n n or, if pggs(x ,t ‘ySS,TSS) = 2:; 2,3,(x .c \y,‘.¢,‘> (4.4.8) The basic elements of the above decision rule are the posterior densities of the observation sequence, given the learning samples. These densities are obtained by averaging the parameter-conditional densities over the posterior densities as in (4.4.9). n n n n L L a k k L L L L 8,01» ”at ”L ) £24;st .t Ix, .T,’ ’QL)f(QL‘XL .1',’ MD, L 6 6) (4.4.9) Both factors in the integrand above are generated iteratively in the next sect ion. Viv»! 71 4.4.2 ITERATIVE FORM FOR THE DECISION RULE The first factor in the integrand of (4.4.9) can be imple- mented iteratively in exactly the same way as in Sec. 4.2, conditioned “L on the unknown parameter QL. The second factor f(Q ”‘21 TZL) can be implemented iteratively as follows. In the following, the subscript L will be dropped as a convenience; f(Q‘yé,¢n) is given by the Bayes theorem as follows: f(x“ . “0\Q>f (Q) f(Q‘yn,'rn)= (4.4.10) “in ’1' n) where fo(Q) is the prior density function over the Q Space. Using the Markov dependency between the samples, (4.4.10) can be written as - -1Tn-1 m Li‘s“ am \y, " .Q) f(Q\xn,'rn) = n n _1 1““! m1 “-1 mix " 1 “ 1) (4.4.11) f(rn‘z .w min”. .«r ) Equation (4.4.11) is the desired result. The last factor n-1Tn- 1 on the right side of (4. 4. 11), fGQ‘x ), is the density in the Q space at the (k-1)-h'stage and is available from the pre- vious step of the iteration scheme. The first and second factors . n n- -l tn- 1 1n the same numerator, f(rn‘y_,rn1,Q) and f(yn|y ,Q), are the densities directly utilizing the prior knowledge above and can be obtained iteratively as explained in Sec. 4.3. The denominator of (4.4.11) is a normalization constant which assures that f(Q‘lnn'n) integrates over the Q Space to unity. The n n parameter-conditional density,g L(xk ,t k‘yL L ’L&’ ), is the density . k . function for observation (§.:t k) correSponding to pattern class {, and is given by (4.2.2) (where QL was assumed known) as a 72 function of the unknown parameters and the training samples. This term can be generated iteratively in a manner Similar to that of Sec. 4.2, but conditioned on knowledge of the unknown parameters. The information about the fixed but unknown parameters {QL}I is n n summarized by f(QL‘XLL’TLé) which is generated iteratively as explained above and is employed in the decision rule by the averaging procedure given in (4.4.9). Even though the iterative expressions are available for both posterior densities in the integrand of (4.4.9), it has one major draw-back. The posterior densities for the parameters are not usually reproducing because the class of densities involved are mixtures. In general, two Serious problems are immediately encountered when trying to implement this decision rule on a computer. The storage problem refers to the difficulty in allocat- ing computer storage locations for the posterior densities in (4.4.9). The computation time problem refers to the difficulty inherent in performing the number of computations required to change the old posterior densities calculated at (k-l)£h'step, into the new posterior densities at REE-step. Since the entries of each matrix Qi are continuous random.variables, in general, the amount of Storage required to store these densities is infinite. The only way to make use of the rule is by quantization of the parameter Space. Then, each parameter taken to be a discrete random variable so only a finite number of values needed to be stored and updated with each observation. By quantizing fine enough, it is possible to get arbitrarily close to the optimum solution at the expense of increased memory. However, the memory 73 grows exponentially with the dimensionality of the parameter vector (1, making the method feasible only for problems with a small number of states. As a simple example, assume that M = 2,‘N = 3; then, there are 2 x 9 = 18 parameters. Using 10 quanta level for each parameter, 1018 storage allocations are required for Storing posterior densities of the parameters for both classes. One way to get around this problem is to use some sub-optimum methods in which consistent estimators for the parameters are found and, in turn, used in the decision rule as if they were the true values of the parameters. 4.4.3 ASYMPTOTIC OPTIMALITY AND ADAPTATION In the previous sub-sections, the decision making problem was studied for the case when the Q-matrices describing the pattern classes were unknown. In that case, the optimum-adaptive decision rule d:(x ,t ) was used instead of the optimum decision rule d*(xk,tk). To discuss the learning capability of the optimum adaptive decision rule, the following definition is given, [R-Z]. * DEFINITION 4.4.1 An optimum-adaptive decision rule dat is said * to be an asymptotically Optimum decision rule d relative to RC = (Qi,Q:,...,Q;), the true value of Q, if and only if . o * o * 11m r(p .da) = r(p .d ) n1, 0 o o ,nM—oaa That is the Bayes risk in using the optimumsadaptive decision rule, when 9_ is unknown, converges as the number of training samples ni‘» m, i E {l,2,...,M}, to the same limit as the Bayes risk when Q is known and the optimal decision rule is used. 74 The following theorem, then, insures the asymptotic optimality of the adaptive-optimum decision rule. * THEOREM 4.1.1 da is asymptotical optimal relative to go. A lemma which will be used in the proof of Theorem 4.1.1 is first given. n n n,~o 1mm 4.1.1 P[e=i\(_xk,tk).(y_ i,r i> «211—1—9 P£e=il :03] w.p.1 i E {l,2,...,M} The proof of the above lemma is anologous to that given by Signori [8-2] and will be omitted. PROOF OF THEOREM 4.1.1: For Simplicity, the following are defined. [1 n n k A k k . i L) i i Q U (E :t ) : V (Xi ,Ti ) , Vni (yini’Tini) Then, from (4.4.6) it follows that o s r(po,d:) - r(p°,d*) s[R(d:|uk) - R(d*[uk)] n . E[R(d:|uk - R(dZ‘uk, {V l}?:9] n, n, + Emwjlu“, {v 112‘s) - R3 But, Etudihknvnifiw - R S 0 ThuS, o _<. r s E[R(d:|uk) - R(d:\uk. {Vnifl‘m n + E[R(d*|uk, {v iffxp - R(d*\uk)] 75 Substituting the values of R(.‘.) defined in (4.4.2) in the above, it fo llows that o * 0* 0sr

-r(p,d> M k M n i=1 i=1 n M , M + E{ 2 L(d*,i)P(e=i|uk,V 1,9) - 2 L(d*,i)P(6‘iluk9 i=1 i=1 M n. s s{ 2 L(d:.i>tp - P determine which pattern class is active based upon observation of }{t° The entire model is illustrated in Fig. 5.01. In Sec. 5.1, Optimal decision making with discrete-time cflaservations is studied and iterative expressions are obtained for 78 79 mm H E 2.: G: on N Ema 2.: ] mew. m.oHu P--- ..3 arm ZOQmH mom Umowmwos mvawwsm mnrmam. + n n n meHsm awn: owmonmnm mmncnm mxnnmonon W . Uh u AxHuxNvooovu a. 80 the likelihood ratio. In Sec. 5.2, a continuous likelihood ratio is obtained by a limiting operation. Section 5.3 discusses a method for solving the classical problem of detecting a random telegraph signal in additive white noise. 5.1 OPTIMAL DECISION MAKING WITH TIME SAMPLES In this section, the features on which decisions are based are defined as point samples of the observed function xt as follows: = k + n , k = l,2,...,K, (5.1.1) k k k where K = T/T and T is the interval between samples; 2 k n(k-1)¢ Gaussian processes at time (k-1)T, with n represents the noise sample taken from the white E(nk) = o, Cov(nk,n£) = v06(k-L), k,L = l,2,...,K. (5.1.2) The sequence of random variables {Ak}§, xk é x(k-1)T’ take values {a1,a2,...,aN} corresponding tozifinite state Space A = {l,2,...,N}. That is, if the Markov process is in state i at time (k-l)T, then xk = ai. Since the xk's are the time samples from the process At they satisfy the Markov prOperty; namely, = = _ _ _ (S) P(Kk aj‘xk-l a.,..., ll — a ,9-3) - p 1 L ij (T) when ws is active (5.1.3) ( ) where pi: function of the continuous-parameter Markov chain describing chain 5. (T), s = 1,2, is the stationary transition probability 81 The observed random variables {Xk}§ takes values in a finite-dimensional Euclidean Space and have the following density when it is known that the chain in state i at time kT. fi(xk) = p(xk - ai) (5.1.4) Since the noise samples are Gaussian and uncorrelated, it follows that they are state-conditionally independent. That is, P(Xk-l’xk‘Xk-l = ai’xk = j) P(Xk-l‘kk-l = ai)P(Xk‘xk = j) fi(xk_1)fj(xk) (5.1.5) Since the true states of the chain that is active cannot be observed directly, xk has the global density: N f(xk\e) = iElfi(xk)P(>.k = ai\e) s E e (5.1-6) In accordance with the discussion in Chapter III, the * Optimum decision rule, d (.) is given by the Bayes decision rule: * K d (x ) = ml if AK < n = m, if AK > n (5.1.7) where K 0 K g (x ) A P2 K K AK = AK(X ) ‘ K , fl =‘—; ; gs(x ) = g(x ls = s) 81(X ) P1 Using conditional probabilities, the likelihood ratio, AK, can be written iteratively as follows: k+1 k k 2(x ) = g2 f(xk+1\xk.e=1> 8 A = k+1 (5.1.8) 82 By the total probability law, N k - - k = = = k = = By the state-conditional independence, the first factor in the summation above can be written as: f(x \xk x =a e=s> = f (x >= p (5.1.9) H4 ’kfl i’ i an 1&1 i The following notation is now introduced. 111; = k __ . iZQ = k _ . ___ rk mm ailx .e 1) .rk mm ailx ,e 2). 1 e A. s 1.2. (5.1.10) . il Iterative expressions are then obtained for AK, Pk '2 and F; as follows: N i2 121 p("k+1 aiwk Ak+l = AK N 11 (5.1.11) , - a ._ p(xk+1 i)rk 1-1 From the total probability law, Fis, s = 1,2, is given by is k N k k rk = P(Ak+1=ailx 39:8) = .EIP(Ak+1—ai‘Ak—ajix 59:3)P(Ak—J‘X 39‘s) 1 (5.1.12) where (5.1.3) and state-conditional independence imply P(x =a ‘1 =a xk 9:5) = P(S) = 1 2 (5 1 13) k+likj” 31(T)’S " '° The second factor in (5.1.12) can be evaluated from the Bayes rule as: 83 k-l f(x \x =a..e=s)P< =a.\x .e=s) P(xk=aj‘xk,e=s) = k k l )‘k J f - - P - \ k'l - 1 (xk\xk—a,.e-s) (1k_L x .e-s) "1452 L js p(x - a.)F _ = N R J k 1 (5.1.14) L3 2 o(x - )F _ as Substituting (5.1.13) and (5.1.14) into (5.1.12) gives the following recursive relation. g (S)( > (x - a >rjs is .=1pji T p k j k-l r =*1 , s = 1,2 (5.1.15) K N PL3 2 p(x - a ) _ L=1 k L k l . . . il 12 N . The initial values, A1 and {F1 ’r1 )1, are given by N 0 . p(x1 - a,)P1(i) i 1 1 A1 " (5.1.16) o(x1 - ai)P§ "P1 "P132 1 )4. Z 2 P;:)(T)p(x1 - aj)P:(j) F18 = l=1h , s = 1,2, 1 e A. (5.1.17) 2 pPp(1> i=1 1 L s where (P:(l), P:(2),...,P:(N)); s = 1,2, are the prior probabilities over the states. The recurrent expressions (5.1.11), (5.1.15) and the initial values (5.1.16), (5.1.17) permit sequential computation of the like- lihood ratio for any k. The rule d*(xK) is an iterative, optimal decision rule which can be though of as an element of a class of on- 84 line decision rules with fixed memory. However, the number of computations needed to make the optimum decision grows linearly with the length of the observation sequence. 5.2 OBSERVATION OF ENTIRE SAMPLE FUNCTION Expressions analogous to those obtained in Sec. 5.1 can be obtained when the entire sample function xt is observed for T seconds. In this case, the iterative expressions (5.1.11) and (5.1.15) are replaced by a set of non-linear stochastic differential equations with a limiting argument. Difficulties are encountered when attempts are made to take a mathematically rigorous limit of the results derived from the model of Sec. 5.1 as the sampling interval goes to zero. Because of the infinite variance of the resulting continuous white noise, mathematical operations (e.g., differentiation and integration) do not exist in any strict sense. For this reason, it is necessary to modify the sampling scheme so that the correSponding limits exist. To achieve this, a new observation scheme will be employed. Let the observed process be x = l t t + nt, t E [0,T], where xt is an N-state, continuous-time Markov chain with the . .. .. (S) N- _ stationary tranSition probability matrix [pij (t)]i j—l, s - 1,2, .9 and nt is a white Gaussian process. The process At is first approximated by a homogeneous, N-state, discrete-time Markov process * (Markov chain), xt = {xk, k = l,2,...,K}, taking values in a finite set {a1,...,aN} at times (k-1)T; k = l,2,...,K, TK = T, with transition matrix [p§:)(r)]§ j=l’ s = 1,2. Good [G-l] showed 3 * that At will converge to xt as T a O with probability one. 85 Thus, for sufficiently small T, x may be approximated by t * xt ~ kt +nt O S t S T (5.2.1) where the equality will be held when T = O, w.p.1. The sampling process being employed here is defined in (5.2.2) + n ik k . k = l,2,...,K. (5.2.2) 1 k xk = ;'I T xtdt z (k -1)T It is clear from (5.02) and (5.2.2) that, E(nk) = o, Var(nk) = $9- (5.2.3) 1 2 p(nk) = v exp{- §-'nk} (5.2.4) (2n o)l/2 The distributions and the prOperties of the observed random variables {xk}§ are the same as with discrete-time observations and are given in (5.1.4), (5.1.5) and (5.1.6). Thus, the stochastic differential equations for the logarithm of the likelihood ratio, Lk+1 = LnAk+1, can be obtained in the case of continuous-time observations as follows. Taking logarithms in (5.1.11), substituting ' for p(.), and cancelling common terms: N - .. 12 _T_ - 2 Lk+l Lk L“ [121 Fk eXP{" 2vO (Xk+l 81) 1] 1=l N il T 2 - {,n[ I‘k exp{- —2Vo (Xk+1 - ai) }] (5.2.5) For sufficiently small T, Ir 2 ... 1' 2 exp{ 23;.(xk+1 ai) } ~ 1 - 2;;'(Xk+1 ai) . (5.2.6) From the definitions of Til, Fiz, it follows that 86 N 2 F = 1 , s = 1,2 . (5.2.7) Putting these expressions into (5.2.5) produces N . ~ .;L_ 2 il Lk+1 ' Lk " L“ [I ' 2vo .2 “k+1 ‘ 31) Pk] T N 2 11 - Ln [1 - E— 2 (xk+1 - ai) Pk] . (5.2.8) 0 i=1 Also, for sufficiently small T, («ml-“LEW '8)2FiSs--I-g( -a)2'I‘iSs=12 2v0 i=1 k+1 i k 2v .=1 Xk+1 i k ’ ’ ' ° 1 (5.2.9) Thus, (5.2.8) becomes N . . g T 2 il 12 Lk+l Lk 2vo i'_’3_1("k+1 ' ai) (Pk ’ I‘k )° (5°2°10) Dividing both sides by T and letting T a O, the approx- imate expression in (5.2.10) becomes an exact stochastic differential equation for the logarithm of the likelihood ratio. 2 o o - 31) (r:1 - r12) 0 s t s 'r (5.2.11) _=_1_ N dt 2v .E(X t is is , where Ft , s = 1,2, is the corresponding value of Fk in con- tinuous time and is defined as is _ = :2 Ft _ P(xt ai‘xa, 0 s a S t, e s) , s 1,2. The initial condition will be obtained from (5.1.16) by setting T = O and taking into account the substitution L1 = LnAl. L a 0 (5.2.12) 87 From (5.2.11) and (5.2.12), it follows that L__1_ t 2v , o i t 2 '1 '2 f0 (xz - a1) (r: - r: )dz (5.2.13) "NZ 1 In a similar manner a system of stochastic differential equations are obtained for {F:3}2, s = 1,2, as follows. Replacing k by k+1 in (5.1.15) and then subtracting ris from both sides, 2 I 2p (3 i--2—-) )rf’ Fis 'PiS~j=1pjl V0 -Fis s=12 k+1 k N k ° ’ ° 2 exp(- -—“’— (x - a fir” 8—1 2vO k+1 {, R For small T, using approximation (5.2.6) and considering (5.2.7), 12p (S i) (T) T (X _ a )2 rjs r - F18 2 ° - r13 (5 2 14) k+1 k N k ' ’ ° 2 (S 1-; - 2‘.( at)? x 2v ML _1 k+1 For Sufficiently small T, N -l N T 2 s N T _ 2 Ls [ -—2VM§ ska «m Fig] ~ 1+—-—, Elem ap rk. (5.2.15) Thus, (5.2.14) becomes is is 1 _1;_ N 2 LS Fk+1 ‘ rk ‘5 + 2v 2 (xk+l ' a N) o L=l N (5) 2 js is T 1-— -a, P -I‘. X j21p31( )l: V0 (Xk +1 J) :] k. k Expanding the right side of the above expression and taking only the first orders terms in T gives is ‘_ is _ (s) is (s) is I“k+1 I5k ~ (1 p11 (0) I“k + jilpji (flrk js‘i T (s) _ 2 is _1;_ (s) _ 2 js ' 2v ii (T)(Xk+1 ai) I“k 2v 2 pji (10(ka a1) Pk o o j=l jaéi N N (s) 2 Ls T Fsz 2 +_ (TN: 2(X )I‘ +— 2 )(T) 1" S(x -a). 2v0 L'- _1 k+1 8!, k 2vOj""1!,=1p‘llSi k k+1 L j¥i Dividing both sides by T, letting T a O and taking into account the following limiting conditions, 1-p$5)(v) p(S .)(T) lim--—%%—-—- = qES) ; lim -li;—-—'=’(1'Aij)q(:S) ; 11m P(: )(T) = AU T-eO T—*0 T-+0 (5.2.16) where Aij denotes the Kronecker delta, (5.2.17) follows. dI‘iS N N t _ - (s) is fl (3) js 'l _ 2 is l is _ Ls dc " qi Ft +,2_‘qjirt '2v (Xt 8L) I‘t +2v I51: 2 (xt aL2t)F J-l o o L=1 ji‘i i E A, s = 1,2 . (5.2.17) is N . The initial conditions for {Ft }1, s = 1,2, can be obtained from (5.1.17) by putting T = O. . P°(i) is s _ Ft = N , i E A, s = 1,2. (5.2.18) 0 . 2 PS (J) j=1 Expressions (5.2.13), (5.2.17) permit computing the logarithm of the likelihood ratio for any point in time. The stochastic differential equations representing the solutions for Lt and {F:8}q, s = 1,2, appear initially to be neat and concise. How- ever, careful examination shows that they represent an infinite- 89 dimensional system of first-order stochastic differential equations, each with the observation process xt as a driving term. Because of this fact, these stochastic differential equations must be solved by representing the non-linear functions in these equations as series expansions and retaining only the first few terms. The infinite system of differential equations then reduces to a finite set and realizable algorithms are possible [8-6]. 5.3 OPTIMAL DETECTION OF RANDOM TELEGRAPH SIGNAL IN ADDITIVE, GAUSSIAN NOISE The problem of detecting the presence of a random tele- graph signal in white Gaussian noise is investigated in this section. This signal is a 2-state, continuous-time Markov chain. Observations are made over a time interval of fixed duration, T. This problem is a Special case of the model considered in the previous section. There are two pattern classes. One is associated with the presence of the random telegraph signal and the other pattern class is associated with noise alone. The optimal decision rule and an iterative implementation are given in Sec. 5.3.1, while the mean and the variance of the logarithm of the likelihood ratio are derived and discussed in Sec. 5.3.2. In Sec. 5.3.3, the probabilities of errors of the first and second kinds are given and recurrent relationships are established for the conditional error probabilities. Finally, in Sec. 5.3.4, stochastic differential equations are obtained both for the continuous logarithm of the likelihood ratio and for the error probabilities. 90 5.3.1 OPTIMAL DECISION RULE In the particular statistical decision problem considered here, the action Space and the parameter Space consist of two elements, ® = {90,61}, 6': {wo,w1}. The basic problem is, then, that of choosing between the two alternatives in (5.3.1); namely, where, T = TK, T is the finite observation interval and K samples. Sec. 5.1. The signal process, represented by kk = k = l,2,...,K, is a 2-state, continuous-time Markov chain which can take values 0 and a with Q given by (5.3.2). ll xk + nk = n k the sampling interval, T k = l,2,...,K, (5.3.1) is the duration of is the (fixed) number of x(k-1)vr’ a specified transition-rate matrix The same sampling scheme is employed as was defined in (5.3.2) The stationary transition probability matrix, P(T) = [pij(T)] can be obtained by solving the backward Kolmogoroff system of differential equations, The complete matrix is: q1 + qo qofill qofil P(T) = q1 q0 qo+q1 qo+q -(qo+q1)T -(qo+q1)T '1 qo _ qo e-(qo+q1)T o+q1 qo+q1 qo + q1 e-(qo+q1)T ofiH qdml (5.3.3) 91 The noise process, represented by nk = n(k-1)T’ k = l,2,...,K, is a zero mean, Gaussian noise process. The observation process {xk}§ is as defined in (5.1.1). The optimal decision rule is expressed in terms of the likelihood ratio, AK, and the threshold n. That is, if AK < n choose mo (5.3.4) if AK > n choose u’1 where the notations are the sane as in Sec. 5.1. Taking logarithm of (5.3.5) produces the recurrent relationship for the logarithm of the likelihood ratio, k-l _ f(xk\x .91) 5 3 5 Lk ’ Lk+1 +‘Ln k-l ( ° ' ) f(xklx .90) where Lk g LnAK k = l,2,...,K. (5.3.6) The probability density functions f(xklxk-1,ei), i = 0,1, are determined from (5.1.6) as k-1 f(xk\x ,90) = p(xk) (5.3.7) k-l _ o a _ a P(Xk'a) f(xk‘X ,91) - I‘k_1 + Fk_1p (xk-a) — p(xk) [I + Fk-1(—p-(_X—;)- - 1) (5.3.8) where A o = _ k+1 a _ k 0 a _ k Pk p("k+1-W ) ; Pk ’ Pp + p pp°(1> r = 0° 1 0 10 O 1 (5.3.13) p(X1)P (0) +'D(X1'3)P (1) 0 0 Pa = p01(T)p(x1)p (0) + P10(T)9(X1‘3)P (1) (5.3.14) p + p(X1-3)P0(1) 5.3.2 ITERATIVE COMPUTATIONS OF THE MEAN AND VARIANCE OF Lk In this section, the mean and variance of the logarithm of the likelihood ratio are derived. Iterative algorithms are dis- cussed and a theorem, related to the likelihood ratio algorithm, is given. Before getting into the details, the following is first defined for convenience: _ k é a p(xk-a) gk " §k(x ) " Ln [1 + rk‘l(W - 1) (5.3.15) 93 The expected value of the lograithm of the likelihood ratio can be determined directly from (5.3.9) as follows: k k E(Lk - Lk_1‘ei) =11. §k(x )gi(x )dxk i = 0,1, (5.3.16) X In general, it is impossible to evaluate the integral in (5.3.16) analytically. As an alternative, it may help to investigate the conditional expectations , ”L(klei)’ which are defined as A- L - a p(xk-a) L ”L(k‘ 9i) _ E(§k‘ei)x ) - E {Ln [1 + PIC-1(W - ei’x , L = 1:2: °°3k k = 1,2, . Using the above definition and the fact that the o-field L+1 3 induced by XL is a sub-field of the o-field induced by x “L(k‘ei) can be written iteratively as follows: am 3 L I "1+1‘klei>f(xi+1\9yx >dx,+1 (5.3.17) xi+1 uL(k\ei) This is the desired result. It is clear that at the final point, i.e., L = k, k‘ _ _ 1 4'8 p(xk-a) ._ 1 uk( 91) — gk ~Ln + k_1 W - 1)] , 1 - 0,1. (5.3.8) With the help of (5.3.16) and (5.3.17), it is possible to ‘ - b obtain E(Lk Lk-l‘ei) y the expression I~:(Lk - Lk_1\ei) = 1.00499 , i = 0,1. (5.3.19) Substituting the values of f = E<§klei> - uo(k\ei) (5.3.24) Cov = E - E(Lk_1\ei)uo(k\ei) (5.3.25) But, from (5.3.9), Lk-l can be expressed as k-l r Lk-l = 2 §r(x ) (5.3.26) r=l Substituting this into (5.3.25) gives k-l k-l Cov(Lk_1.§k\ei) = z E<§r.§k\ei> - uo\ei.x”] These conditional expectations can be expressed iteratively, using the similar argument as in (5.3.17), as follows: _ L 81.0491) -£ sL+1(k\ei)f(xL+1\x ”31)de +1 (+1 _ L k _ = ... eL(r, ‘91) )3; eL+l(r’k‘ei)f(XL+l‘X ,ei>dx,+1,2+1 1.2, ,k, (+1 i = 0,1. The boundary values are obtained putting L = k. _ 2 k _ 2 a p(Xk-a) Sk(k\ei) - §k(x ) — Ln [:1 + Fk_1 (_EIEET— - I) k p(x -a) ek(r,klei) = §r(xr)§k(x ) = Ln {[1 + F:_1 (Ti-1%?- - 1)] a p(xr-a) x [l +Fr-l (___—90%) - 1)]} It is obvious that the usual expected values in (5.3.24) and (5.3.27) are related to these conditional expectations by the expressions: 2 E(§k\ei) = 30(k‘ei) ; E(§r§k‘ei) = eo(r,k‘ei) , i = 0,1. (5.3.28) In terms of (5.3.22), (5.3.24), (5.3.27), (5.3.28) and the boundary values, it is possible to compute Var11 \eo). 80 = P(1R< n ‘91), 0 where n* 2 in“. In general, exact analytical expressions for do and Bo are impossible. However, it is possible to obtain them re- cursively in terms of the conditional error probabilities, ak(K) and Bk(K) which are defined below, similar to those in (8.4.4), (3.4.5). ak<1<> = P(LK > n*\eo,xk). ekao = Par.K < n*\el.xk). k = 1.2.....K. By the same argument used in the previous section, the following recurrent expressions for the conditional error proba- bilities can be obtained. 01km) = 3E xk+1(1()p(xk+1)dxk+1 (5.3.29) k+1 _ a p(xk-a) 51.00 '£ Bk+l(K)p(Xk+1) [1 + Fk (W— ' 1)] dxk+l 1"” (5.3.30) From (5.3.29) and (5.3.30), it follows that these probabilities of errors at the final point, k = K, must satisfy the expressions: 01K(K) = u(Lk - 11*) ; exao = 1 - u(LK - 11*). 97 where u(.) is the unit step function. In terms of these relations and the boundary values, the usual error probabilities can be com- puted by do = 00(K) ; 80 = 80(K) 5.3.4 OBSERVATION OF ENTIRE SAMPLE FUNCTION The results obtained in Sec. 5.3.2 can be Specialized to the present problem. The following stochastic differential equa- tion is obtained for the logarithm of the likelihood ratio. dL t l a _ = ___ 2 - 1 .3031 dt ZVO ( xt )Ft 0 s t S T (5 ) The initial condition is obtained from (5.3.10). Setting T = 0 in the expressions of L1, In a similar manner, the following system of non-linear . . . . . o a stochastic differential equations are obtained for Pt and Ft- dF: o a l o a E = -qOI‘t + q11—‘t + 2v—O (l - th)I‘tI‘t (5°3°32) dF: a o l o a Again, the initial conditions are obtained from (5.3.4) and (5.3.13) by setting T = 0 in the expressions of F: and P:- o = 2°(O) a p°(1) ;I‘= - p°<0> + p°<1> ° p°<0> + p°<1> The non-linear stochastic differential equation obtained for zt, 98 F - F: is identical to that derived in [K-6] by Kulman and Stratonovich, using a completely different technique. Expressions (5.3.31), (5.3.32), (5.3.33) permit computing the logarithm of the likelihood ratio for any time t E [0,T]. The block diagram of the optimum receiver is realized in terms of these expressions in Fig. 5.3.1. > Solve the select a xt=xt+n Diff. Eqs. F°,F: Solve the L Compare ”"“*° ° -€t (5.3.35) t :>— diff. Eq. t3>- 1‘ with n* (5.3.36) (5.3.34) t select a select max. ‘--——-0 1 Fig. 5.3.1: The Block Diagram of the Optimum Detector Recurrent expressions for the conditional probabilities of errors obtained in (5.3.29), (5.3.30) can also be obtained for the case of continuous-time observations. The details of the deriva- tions are as follows: Replacing k with t and k+1 with t+¢, (5.3.29) can be written in a new form. )dx (5.3.34) 0 .— at(Lt.I’t) -f m x at+T t+¢ (L )p( O t+T’rt+T xc+¢ Feller [F-l] Showed that, under mild regularity conditions, a Q at(Lt,F:) must be bounded and p(xt ) must satisfy the +1- postulates, (4.2), (4.3) and (4.4)3 and a muSt satisfy the "backward diffusion equation," 2 2 2 at 11 2 12 o 22 2 1 5L 2 0 5L aLaF 0 BF 5L 3 Feller [F-l], pg. 321. 99 For the problem considered here, it is easily proved that p(xt+T) satisfies the above postulates and since a is a proba- bility of error, then, \a‘ s 1. Thus, the regularity conditions are satisfied. Coefficients in (5.3.35) can be determined as follows: r From (5.3.34) for T > O, the following equation is first obtained. 0 0 O at‘H'(Lt,rt) ‘ at(Lt’rt) — atd‘T(Lt’I-‘t) T T 1 -:r-f at+r(L ”Tm M)p(x )dxHT (5.3.36) X t-l-T Expanding at+T(Lt+T’r:+T) in a Taylor series about (Lt,F:) and taking only the first and second order terms gives 0 o L 9 t+T t+'r t+T t+T t’ t 1- aLt-H' T o a? t+T 2 o 2 o + L + 21 F + 1“ , I 2 T T 3L az t 2 5 HT Wh é - L . 0 Q 0 _ O. ere’ LT Lt+T t ’ FT rt-Pr I‘t; Substituting this expansion into (5.3.36), performing the resulting integration and then passing to the limit as T 4 O produce the coefficients in (5.3.38). The resulting partial stochastic differential equation can be written as: 1-0 1_022 ae+__f‘_.ae_q+(_1__q_q)P0,_Ll-°m__$_f_‘_1_a_e l 2vo o 1 2v aro t a 2v0 5L o 2 + (1'1“) _a do _ r0(1 - I“0) 23% (5.3.37) 100 Using similar arguments, a partial Stochastic differential equation can be obtained for the conditional probability of error of the second kind and is given below. o o (l-F )(1-2r ) 3.6.- t BE_ ___1__ _ o _1_ _o _ o as at 2v 5L [ 2v qo q1)F +2v (1 P )(1 2F ) at o o o Il‘ro)2 2 (l-PO 2 2 02 o 2 2 _ V A C; + V 2 Afio .. I“ (14" ) M2 = 0 (5.3.38) 0 3L 0 aLaF Bro where A o . A o . _ o . 011 0 Cl’ —at(Lt’rt) 2 B '— Bt(Lt’Ft) a L " Lt(xt,rt) a P — Ft Equations (5.3.37) and (5.3.38) describe the change in a conditional error probability for a recurrent period of time. From (5.3.34), it follows that these probabilities must satisfy the conditions, 6,0,5; = u(L.r - 11*) ; 0,0,3"? = 1 - u(L.r - 11*) (5.3.39) where T is the time for which the detection is completed. If at(Lt,F:) and Bt(Lt,F:) are the solutions of equations (5.3.37), (5.3.38) with the boundary conditions (5.3.39), then the first kind and the second kind of probabilities of errors are determined by the expressions 0 C p0 p0 0' = 0’ (O: ) a B = B (O: ) ° O p + pc 0 ° 90 p O O O 5.4 CONCLUSIONS In the first part of this chapter, the optimal decision making problem was investigated for the model which consists of 101 two pattern classes characterized by different N-State continuous- parameter Markov chains. All parameters of the underlying model were assumed known. Because of the noisy medium which affects the model in an additive manner, neither the states nor the transition times of the sample function generated by these chains could be observed directly. In Sec. 5.1, assuming additive white Gaussian noise and a discrete-time sampling scheme, the observation process was defined and the Bayes likelihood ratio algorithm.was derived. The likeli- hood ratio and a statistic (5.1.10) were computed recursively. Continuous-time results were obtained by introducing a new sampling scheme and applying a limiting argument. A set of non-linear differential equations were obtained fro the logarithm of the likelihood ratio and for the statistics defined in (5.1.10). The second part of this chapter considers the problem of detecting a random telegraph signal with a fixed observation time in additive white Gaussian noise. It was shown that the problem is a Special case of the model considered in the first part. The optimum detector was defined in Sec. 5.3.1 and its basic components (5.3.9), (5.3.11), were generated recursively. Iterative schemes for computing the mean and variance of the logarithm of the like- lihood ratio were given in Sec. 5.3.2. In order to analyze the quality of the Optimum detector, it was necessary to consider the probabilities of error of the first and second kind. For this, the conditional error probabilities were first investigated in Sec. 5.3.3. Recurrent relationships were established for them from which the usual error probabilities were found as a function 102 of the observation time and the parameters of the problem. Finally, in Sec. 5.3.3, the case of continuous observation times were considered. Stochastic differential equations were obtained for the logarithm of the likelihood ratio and for the conditional probabilities of errors. CHAPTER VI GENERAL CONCLUSIONS AND EXTENSIONS The main objectives of the thesis are reviewed in this chapter and possibilities for future research are discussed. 6.1 CONCLUSIONS This thesis has been concerned with Bayesian decision making and learning algorithms for a particular problem in para- metric pattern recognition. Each of M pattern classes was characterized by an N-state, continuous-time, homogeneous Markov chain and the infinitesimal-rate matrices (Q-matrices) for the chains were the parameters of the problem. The object of the decision rules was to decide which of M chains produced the sample function observed. Statistical decision theory was employed throughout the thesis to develop optimal solutions for a Special loss function (0-1 loss function). In Chapter II, the Bayes-optimum decision rules were de— rived under a perfect observation mechanism (noiseless case). The observable quantities, or the "features", were the sojourn times in the states and the state numbers themselves. Using classified training data from each pattern class, an algorithm for supervised learning was presented and the existence of repro- ducing prior densities for the parameters was demonstrated. The main result was the formation of recursive, computationally simple 103 104 parametric forms for the posterior densities of the unknown para- meters and for the components of the optimum decision rules. It was shown that the amount of computer storage required to implement these algorithms was fixed and finite. Computer simulations of a specific example gave curves showing probability of error versus the length of the observation sequence for different amounts of training data and demonstrated the inherent practicality of the results. The problem of computing probability of error was inves- tigated for the noiseless case in Chapter III. All parameters in the model were assumed known and there were only two pattern classes. The exact probability of error, as well as lower and upper bounds, were established for several cases. Conditional error probabilities of the first and second kinds were introduced by which the usual probability of error could be computed iteratively. The main con- clusion of the chapter was that, even for a simple case, exact expressions for the probability of error were very complicated and it was difficult to evaluate them. However, the fact that the probability of error decreases toward zero as the number of samples increased without bound was proved by employing a Bhattacharyya bound. The asymptotic behavior of error probability was also studied. The model of Chapter II was extended in Chapter IV to include the case in which the observations are made in a noisy medium. In the new model, the states of the chains were described by random processes, but sojourn times could be observed. The optimal and the adaptive-optimal decision rules were defined and 105 their basic components were generated iteratively. The asymptotic optimality of the adaptive rules was exhibited. For the known parameter case, the iterative optimal decision rule can be im- plemented on a computer using only a finite memory. However, computer time increases linearly with the number of observation sequences. One can also conclude that the optimum-adaptive rule is a fixed memory, iterative rule and only a high Storage quantiza- tion procedure is needed to implement it. Finally, Chapter V dealt with the case when neither the transition times nor the states could be observed but the processes defining the states were white Gaussian processes. This is equi- valent to additive white, Gaussian noise. The Bayes likelihood algorithm was established for the Optimal decision rule assuming two pattern classes and a discrete-time sampling scheme. Non- linear stochastic differential equations were obtained for the logarithm of the likelihood ratio and for the conditional error probabilities when the sampling scheme was changed to permit observation of the entire sample function. The results were applied to the Specific problem of detecting a random telegraph signal in white noise. In general, the algorithms for the like- lihood ratio are computationally feasible. The implementation of the optimal decision rule with discrete-time sampling requires the knowledge of the stationary transition probability functions for each pattern class. These functions could be obtained by solving N2 System of linear differential equations with constant coefficients. The continuous sampling scheme leads to much simpler algorithms and do not require transition functions. 106 6.2 EXTENSIONS The work in this thesis suggests several extensions that should be investigated. First is the problem of developing sub-optimum decision rules that are easier to implement than the high-Storage quantiza- tion procedure of Sec. 4.4.2 implicit in the optimal-adaptive rule. Strongly consistent estimators for the unknown Q-matrices might be deve10ped. These are functions of the observations that con- verge with probability one to the true value Of the parameter. Conditions for the existence of such estimators are important because, in Sec. 4.4.3, it was shown that a strongly-consistent estimator for go must be exhibited to ensure that these para- meters will be learned during the Operation of the optimal rule. Secondly, the expressions obtained for the probability of error can be extended to the cases where M patterns are present and the parameters of the underlying models are not known. The conditional error probabilities, introduced in Sec. 3, suggest a useful computer implementation algorithm for evaluating adaptive decision making devices. In Chapter V, stochastic differential equations were derived for several quantities, but these equations were not solved. Since, in general, these differential equations are non-linear as well as stochastic, one must not expect to obtain any exact analytical solutions for them. A natural extension of these results would be to seek some approximate solution methods, Such as expressing them in the form of difference equations which can be solved on a computer, or, by some reasonable assumptions, reducing them to 107 stochastic linear differential equations for which exact aolutions may exist. Finally, a particularly significant theorem concerning the Bayes likelihood algorithm in Chapter V can be proved. The probability of error goes to zero as the number of samples, k, increases without bound. This can be proved from the fact that the expected value of the logarithm of the likelihood ratio is a monotonic increasing function of k under 91 and a monotonic decreasing function of k under 90. vl\n.( B IB LIOGRAH-IY [A-u [3‘1] [3-2] [B-3] [3-4] [C-I] [C-2] [C-3] [D-ll [D-Zl [D-3l [D-S] BIBLIOGRAPHY Anderson, T. W., Introduction to Multivariate Statistical Analysis, Wiley, New York, 1958. Bartlett, M. 8., "The Frequency Goodness-of-fit Test for Probability Chains," Proc. Camb. Philos. Soc., Vol. 47, pp. 86-95, 1951. Billingsley, P., "Statistical Methods in Markov Chains," Ann. Math. Stat., Vol. 32, pp. 12-40, 1961. Blackwell, D. and M. A. Girshick, Theory Of Games and Statistical Decisions, Wiley, New York, 1954. Braverman, D., "Machine Learning and Automatic Pattern Recognition," Stanford Electronics Laboratories, Technical Report No. 2003-1, Feb., 1961. Chiang, C. L., Introduction to Stochastic Processes in Biostatistics, Wiley, New York, 1968. Chung, L. K., Markov Chains with Stationary Transition Probabilities, Springer-Verlag, Berlin, 1960. Cramér, H., Mathematical Methods of Statistics, Princeton U. Press, 1958. Daly, R. F., "Adaptive Binary Detectors," Stanford Elec- tronic Lab., Technical Report NO. 2003-2, June 26, 1961. Doob, J. L., Stochastic Processes, Wiley, New York, 1953. Drake, A” W., "Observation of a Markov Source Through a Noisy Channel," presented at the IEEE Symposium on Signal Transmission and Processing, Columbia University, May, 1965. Dubes, R. C. and Donoghue, P. J., "Bayesian Learning in Markov Chains with Observable States," Interim Report No. 5, Contract No. AFOSR-1023-67B, Div. Engineering Research, Michigan State University. Dubes, R. C., Hung, A., and McCrum,'W. R., "Classifica- tion of Electroencephalograms with Pattern Recognition Algorithms," Interim Report No. 4, Contract No. AFOSR- 1023-67B, Div. Engineering Research, Michigan State University. 108 7 4:: [D-6] [F-ll [F-Z] [F-3] [F-4] [G-I] [H-1] 1K4] [K-Zl [K-3] [K-4] [K-SJ [K-6] iL-l] [M-l] 109 Dubes, R. C., The Theory of Applied Probability, Prentice- Hall, Englewood Cliffs, N.J., 1968. Feller, W., An Introduction to Probability Theory and Its Applications, Vol. 2, Wiley, New York, 1966. Ferguson, T. 8., Mathematical Statistics: A Decision Theoretic Approach, Academic Press, New York, 1967. Fisher, R. A., Contributions to Mathematical Statistics, Wiley, New York, 1952. Fu, K. S., Sequential Methods in Pattern Recognition and Machine Learning, Academic Press, New York, 1968. Good, I. J., "The Frequency Count of a Markov Chain and the Transition to Continuous Time," Ann. Math, Stat., Vol. , pp. 41-47, October 29, 1960. Hilborn, C. G. and D. G. Lainiotis, "Optimal Estimation in the Presence of Unknown Parameters," IEEE Trans. on System Science and Cybernetic, Vol. Sec-5, January, 1965. Kadota, T. T. and L. A. Shepp, "On the Best Finite Set of Linear Observables for Discriminating Two Gaussian Signals," IEEE Trans. on Information Theory, Vol. IT-13, pp. 278-284, April, 1967. Kailath, T., "A General Likelihood Ratio Formula for Random Signals in Gaussian Noise," IEEE Trans. on Informa- tion Theory, Vol. IT-15, pp. 350-361, May, 1969. Kanal, L., "Basic Principle of Some Pattern Recognition Systems," Proc. National Electronic Conference, Vol. 18, pp. 279-295, October, 1962. Karlin, S., A First Course in Stochastic Processes, Academic Press, New York, 1966. KeinOSuke, F. and L. G. Koontz, "Application Of the Karhunen-LOEVe Expansion to Feature Selection and Ordering," IEEE Trans. on Computers, Vol. C-19, No. 4, April, 1970. Kulman, N. K., "Certain Optimal Devices for Detection of a Pulse Signal of Random Duration in the Presence of Noise," Radio Engineering and Electronic Physics, Vol. 6, No. 9, pp. 1279-1288, 1961. LOEVe, M., Probability Theory, VanNostrand, Princeton, N.J., 1963. McLendon, J. R., "A Pseudo Bayes Approach to Digital Detection and Likelihood Ratio Computation," Ph.D. Thesis, Southern Methodist University, 1969. [M-Zl [N-ll iN-Z] [N-3] [P- 11 [R-11 [R-Zl [R-3] 02-41 13-1] 13-2] [8-3] [8-4] 110 Martin, J. J., Bayesian Decision Problems and Markov Chains, Wiley, New York, 1967. Nagy, G., "State of the Art in Pattern Recognition," Proc. IEEE, Vol. 56, No. 5, May, 1968. Nifontov, Y. A. and V. A. Likharev, "Optimal Detection of a Binary Quantized Markov Signal in the Presence of Noise Similar to the Signal," Engineering Cybernetics, No. 6, 1968. Nillson, N. J., LearninngachineS, McGraw-Hill, New York, 1966. Patrick, E. A. and J, C. Hancock, "Learning Probability Spaces for Classification and Recognition of Patterns With or Without Supervision," PUrdue University Report TR-EE-65-21, November, 1965. Papoulis, A., Probability,gRandom Variables, and Stochastic Processes, McGraw-Hill System Science Series, New York, 1965. Raiffa, H. and R. Schlaifer, Applied Statistical Decision Theory, The M.I.T. Press, Cambridge, Massachusetts, 1961. Raviv, J., "Decision Making in Incompletely Known Sto- chasic Systems," Int. J. Eng. Sci., Vol. 3, pp. 119-140, Pergamon Press, Long Island City, New York, 1965. Robbins, H., "The Empirical Bayes Approach to Statistical Decision Problems," Ann. Math. Stat., Vol. 35, pp. 1-20, 1964. Royden, H. L., Real Analysis, Macmillan, New York, 1963. Sebestyen, G. S., Decision-Making Processes in Pattern Recognition, Macmillan, 1962. Signori, D. T., "Estimation and Adaptive Decision Making for Partially Observable Markov Systems," Ph.D. Thesis, Michigan State University, 1968. Silver, E. D., "Markovian Decision Processes with Uncertain Transition Probabilities or Rewards," Ph.D. Thesis, Massachusetts Institute of Technology, 1963. Sosolin, Y. G., "Optimal Detection of Markov Signals and Markov Noise with Discrete Time," Engineering Cybernetics, No. 6, 1966. [3-5] [8-6] [T-ll [T4] [T-31 [W-ll [VJ-2] [VJ-3] 111 Spragins, J., "A Note on the Iterative Application of Bayes Rule," IEEE Trans. on Information Theory, Vol. IT-ll, No. 4, October, 1965. Stratonovich, R. L., Conditional Markov Processes, American Elsevier Publishing Company, Inc., New York 1968. Takacs, L., Stochastic Processes, Problems and Solutions, Methuen's Monographs on Applied Prob. and Statistics, 1960. Tou, J. T., Computer and Information Sciences-II, Academic Press, New York, 1967. Tucker, H. G., A Graduate Course in Probability Theory, Academic Press, 1967. Wilks, S. 8., Mathematical Statistics, Wiley, New York, 1962. Wassily, H., "Probability Inequalities for Sums of Bounded Random Variables," American Stat. Association J., March, 1963. Watanabe, 3., Methodologies of Pattern Recognition, Academic Press, New York, 1969. L— mowed-J APPENDICES APPENDIX A. DEFINITIONS AND PROPERTIES In this appendix, some important definitions and theorems related to continuous-parameter Markov chains are given. Most results are based on Doob [D-2] and Chung [C-Z]. Consider a family of real random variables indexed by t, {xt,0 s t < m} on a probability triplet (0,5,P), where the possible values of xt is a set of non-negative integers. Here 0 is a probability Space, 3 is a o-field of subsets of n and P is a probability measure defined on 3. DEFINITION 1.A The stochastic process, {xt,0 s t < m} is said to be a homogeneous, continuous-time, discrete-state Markov pro- cess (continuous- parameter Markov chain), if and only if the following two conditions are satisfied 1) Markov prOperty: P(X n—1 n-1 (A.l) for every n 2 2, O s t1 < t2 <...< tn and for all possible values 0f the random variables in question. ii) Homogeneity property: The chain is said to be Homogeneous, or said to have Stationary transition probabilities, if and only if P(xHh-j‘xhq) = P..(t) Vi,j e A and c > 0 (A.2) 1J 112 t =1n\xt =1n-1,...,X1=11) = P(xt =1n‘xt =ln-1) in 'l“ .1 anus-c! 11* _ 113 where Pij(t) is independent of h 2 0. N DEFINITION 2.A The functions {Pij(t)}, are called stationary 1,j=l transition probability functions if and only if the following con- ditions are satisfied: Al-l. Pij(t) 2 0 Fa N A1-2. Z P..(t) = l t > 0, V i E A . ij J=1 N Al-3. 2 P, (t)P ,(S) = P,,(t + s) S,t > 0 k=1 1k kj 1] ‘ The following continuity condition is also assumed to be satisfied for t > 0 A1—4. lim P,,(t) = A, t40 11 1 J The matrix [Pij(t)]NXN is called the stationary transition proba- bility matrix. DEFINITION 3.A The process {xt; 0 s t < m}, is said to be separable (relative to the class of closed sets), if and only if there exists a denumerable subset, R C {t; 0 s t < m}, called a separability set, and a null set N with the following property: for any closed linear set A and open interval G in (~m,m) fw: Xt(w) e A, t e G n R} - [wz xt(w) E A, t e G n [0,m)} czN The main reSults in this thesis are based upon the follow- ing theorems which are given without proof: THEOREM 1.A If [Pij(t)] is the stationary transition probability matrix, then, the following limits exist: 1 - P11“) A . = - ' m . = ' . 11m Pii(0) < v 1 e A, q, Pii(0) (A 3) t~0 t 1 114 P..(t) iim—U——=P!.(0) 0 V i E A and if xt = i, there is a sample function discontinuity for some t > tO with probability one (w.p.1). If 0 < a s m, and if there is a discontinuity in the interval [to,tO + a]. the probability that the first jump is to j is qij/qi' N The matrix Q = [qij]1,j=1 where q.. Q -q,, defined above 11 1 is called the infinitesimal matrix or the transition-rate matrix or the Q-matrix. From Al-l and Al-2 it follows that q.20, z q..=qi Vi,J'EA (ja‘i) (A.6) Differentiating A1-3 with reSpect to each variable and setting the result equal to zero produces the following system of differential equations for the stationary transition probability functions which is called the "backward differential equation system". N I = _ - Pik(t) quik(t) + jEI qiijk(t) V i,k e A (A.7) j¢i 115 The initial conditions are given by Pij(0) = Aij V i,j E A (A.2) The following question is of interest: What are the necessary and Sufficient conditions that the qi's and q,,'s 1J must satisfy to determine the {Pij(t)}§ j=l uniquely? Or, in .* 9 other words, what conditions on the qi's and qij's ensure a unique solution to the backward differential equation system? Doob [D-Z] shows that the necessary and Sufficient conditions rd under which an allowable set of stationary transition probabilities are defined by the backward differential equation system for a given infinitesinmJ.Q-matrix are that the process is separable (which assures that the sojourn times in the states are independent and exponentially distributed), has a finite number of states, satisfies (A.6), and that the states are stable (i.e., qi > O). For a discrete-parameter Markov chain, these conditions imply that the process is ergodic. As long as the above conditions are satisfied, a unique continuous-parameter Markov chain with the stationary transition probability matrix [Pij(t)] can always be found from a given Q-matrix. The process, constructed in this way, is called a minimal process. The relation between the Pij(t)'s and q1 is given by one of the following integral equations. -qit N t -qis P k(t) = 61k e +j1§ qu ijk(t-s)ds (A.9) 31¢ or 'qkt N t -qk(t-s) Pik(t) = 61k e + jgl Pij(s)qjke ds. (A.10) j¢k If the eigenvalues of the Q-matrix are real and distinct, then the solutions of the above integral equations are the same and are given by Pik(t) = z V i,j E A, (A.1l) mEl(pL-pn9 m¢£ where p1,p2,...,pN are the eigenvalues of the Q-matrix and Aik(L) is the cofactor of the matrix A({) = {p61 - Q]. Here, I is the unit matrix. Doob also proves that the sample functions Of a continuous- parameter Markov chain satisfying the conditions stated above are almost all Step functions. The corresponding continuous-parameter Markov chains can thus be constructed in terms of their sample functions in the following manner: Let xl,x ,... be random variables taking values in 2 A = {l,2,...,N} only. Let T1,T2,... be positive random variables N 1 j'l are any real positive numbers , _ N and Suvpose {qi}i=1. {qij} j#i satisfying (A.6) and 0 < qi < m. Define the following conditional probabilities: -qit1 P(T1 s t1\x1 = i) = 1 - e t1 > 0 (A.12) . . i' . . . P(xk=j‘xk_1=1,...,xl=L,tk_1,...,tl) = -a%' qi > 0 V i,j 6 A (j # 1) 1 (A.13) 117 -q.t = - J k P(Tk s tk‘tk_1,...,t1,xk,...,x1) 1 e ck > 0 k e A (A.14) This shows that the random variables {Xk}:=l have a Markov dependency and the random variables {Tk}:=l are condi- tionally independent. Now, define the process, {xt: 0 s t < m} as follows: F? x - x if p s t < pn t n n-l where T Q n pn - pn-l and p0 E 0. Then, it is easy to Show that j the process defined in this way is a continuous-parameter Markov chain and that the qi's and qij's satisfy the limit conditions Specified in Theorem A.1. The xn's are the state numbers and Tn's are the sojourn times in these states. A typical sample function from this process is illustrated in Fig. A.l. 118 .awmzo >oxumz OEHuImSOJCwucoo m Eoum COMuocsw OHmEmm Hmowame "Ha .wE x HIXQ HQ 00 l\\ 1|MMF|I u.lmVflAM .mWflMIIT ulllmV. / a 1 a v H K 4 N . 1T 0 V 1. tr 0 X 1 o 1 TM .. Huz 11 z APPENDIX B DERIVATION OF OPTIMAL DECISION RULES The basic elements of statistical decision theory are summarized in this Appendix. Most of the results are taken from the literature, but are presented from a slightly different point of view, more suitable for the problem of interest in this thesis. A problem in decision theory consists of three basic elements: 1. A non-empty set, @, of possible state of nature (parameter Space) 2. A non-empty set,(7, called an action Space 3. A real-valued loss function, L(9,a), defined on ® X(7. In pattern recognition terminology, a state of nature is called a "pattern class". Before making a decision, a random vector, (xk,tk) Q (x1,...,xk,t1,...,tk) is Observed whose dis- tribution depends on the true State of nature 9 6 ®. The sample A k k space, Sk = (w: w = (121:1) X(iglni)}, is taken to be a Borel sub- set of a finite dimensional Euclidean Space and the probability distributions of (xk,tk) are defined on the class of Borel sub- sets .8 on Sk' For each 9 6 @, there is a product measure u(xk) X v(tk) defined on (B and correSponding multivariate k density function g(x ,tk‘e), which represents the distribution k k . of (x ,t ) when a is the true state of nature. 119 120 On the basis of the set of observations (xk,tk), an action d(xk,tk) €¢7 is chosen. In choosing d(xk,tk), the loss L[9,d(xk,tk)], e 6 G, which is a random quantity, is incurred. The expected value of L[e,d(xk,tk)], when 9 is the true State of nature is called the risk. R(e,d> = EeLie.d(xk,tk>] (3.1) Any function, d(.), that maps the sample Space, S into k’ (7 is called a non-randomized decision rule, provided the risk function, R(e,d), exists and is finite. The Bayes principle involves the notion of the Bayes risk. A distribution Po is imposed on the parameter Space called a prior distribution. The Bayes risk is the expectation Of the risk function with reSpect to PO; namely, r(PO,d) = ER(Z,d) (B.2) . . . . . . o where Z is a random variable over ® hav1ng distribution P . * DEFINITION 8.1 A non-randomized decision rule, d (xk,tk) is said to be Bayes with reSpect to the prior distribution Po, if and only if r = inf r i (UH,J 1,) 1, (r13) q q0 r0 __1_ ij - i ii - 0 0( O > r.. r.. 1J 1J 0 Since 0 Q q°r° and 5(Eii_:_iii. — ° 6( ° ) th 1t ’ qi i ij ro ) ” rij qij qij ’ e res” follows. That is, 13 n n O 1' .. = .. - . n}: 1(quly yr) 5(qu qu ) ViijEA (ja‘i). O - d , 0 10 qi) rij 2 YES 125 (Elf-CV. "- Tune Choose PC r with probability P: 1 l Generate classification sequence k (x ,tk) = (x1....,xk,t1,...,tk) for PC r k k Classify (x ,t ) 9 Count the number of misclassifications _L i = i+1 NO < i = 100 YES Compute Error Probability _.L k = k+40 NO < k>120 YES Figure D.l: Algorithms for a single error curve and for decision making. 127 Observe (x1,t1) 11nn Compute gr(x ,c /yrr,Trr) Eq. (2.8.1) Observe (Xj+l’tj+1) K ~ K +-l j+1 Xj+1 ' Z fl Z + l j+1 xj+1 Jr n nr /y ’Tr j+1 j+1 ,t r Compute gr(x Eq. (2.8.1) r ) o k k n n Compute pr gr(x ,t ‘y r,T r) = 1 3L Find maximum for r = l,2,...,m Hm ._ __ ..rnn. "'17:! _: APPENDIX E THE PROOF OF IEMMA 3.3.2 First, it is shown that -m < m2 < O. 1 > > ' qij pij 3 qi Pi V 1 E A From (3.3.19), (3.3.18), (3.3.9), it follows that N N q.. P.. N q. - p. mz=2 Zpi—éanq—lli'ZPi—IT'I' 1 i=lj=l i ij i=1 i j¢i q.. Because q.. > p,, 2 Ln -ll-> 0, m can be rewritten as N pi [ ) N qi'] =2—(q.-P.-£q..Ln-—l mil i=1 qi 1 1. j=1 1J pij j¢i qi' Pi. Using the inequality, Ln-——1 2 l - ——J3 in the above it follows Pi. q.. J 1J that N qi. N pi. N Zq Ln—lzzq..(1-'—l)=2(q..-p..)=q.-p.>OVi€/\ j=1 1.] pij J=1 1.] qij j=1 1:] l] 1 1 j¥i J#i j#i Thus, N q. 2 qi.. Ln —l1 2 (qi - pi) i e A = -m < mi < 0 j¢i Now, it is shown that O < mZ < +m. 2 From (3.3.20), (3.3.19) and (3.3.9) 128 N N pic pic N q. -p mZ = 2 z Pi--11Ln-—l +- 2 P. l 1 2 1=l j=1 Pi qu 1:1 :1 qi Using the same argument as in the first part of the proof, m2 2 can be written as N Pi [ N qi. =z_ 2 ..-—L] mzz i=1 Pi 1 1 j=1 13 pij j¢i Q.. Q.. Using the inequality, Ln -il 5 —il - l, in the above it follows p ij pij that N q.. . N , 1| Z Pl Ln 5 pi (“Ll - 1) = 2 (q1 - pl ) = qi - pi > 0 j#i j¢i J 1 Thus, N qi -—~l - ' < +m E Pijtnpusml pi) 16A=50