ERROR RRoaARJJJTJ JR UNSUPERVESEE} REFERRERJ _ H5 " , PATTERN CLASSIFICATIQN . » Thesis for the Degtee of Ph 0.. MICHIGAN STATE UNIVERS‘TY . JOHN JAMES FORSYTH 197,1 tr?" ‘ L I B R A I. Y Michigan State University This is to certify that the thesis entitled ERROR PROBABILITY IN UNSUPERVISED DEPENDENT PATTERN CLASSIFICATION presented by John James Forsyth has been accepted towards fulfillment of the requirements for Ph.D. degreein Electrical Engineering & Systems Science //7 \ .” “ A 't I/" ./ r" 7 ¢ . ll, / ;T I, a I I -‘l O. www.cstg-J/ Ce , Vou- .4. Major professor Date Qié/ ‘jéz /};)/ 0-7639 ABSTRACT ERROR PROBABILITY IN UNSUPERVISED DEPENDENT PATTERN CLASSIFICATION BY John James Forsyth This thesis investigates a communications problem within the framework of unsupervised multi-category pat- tern recognition. Consider a digital data communications system in which a source randomly selects symbols from an alphabet, encodes the symbols as a string of digits (the code being of fixed but unknown length) and transmits the resulting digital data over a channel. A symbol synchroni- zation problem ensues when a receiver locks on to the signal at a time other than when the first digit of a coded symbol arrives. When the receiver does not know the length of the individual symbol codes being received, and when special synchronizing pulses are not present as a guide, then the receiver is faced with processing pat- terns which exhibit statistical dependence. The term "synchronization" as used here refers to the problem of establishing the starting point of each symbol code in the data, which in turn requires determination of the code length. The syn in unsupervis Solutions to both through stochastic a; havior of the whether the s an independen decision proc Since the dec possible pro: A cieta'l decision proc through a lix Probability c Another uses COeffiCient ational dist. dlStance me a John James Forsyth The synchronization problem is treated as a problem in unsupervised multi-category pattern recognition. Solutions to the synchronization problem are develOped both through the Bayes decision process and through a stochastic approximation algorithm. The convergent be- havior of these solutions is proved. It is shown that whether the source generating the codes is governed by an independent or a Markov random process the Bayes decision process for the synchronization problem converges. Since the decision procedures use no training data, the possible probability models for the source must be known. A detailed study of error bounds for the Bayes decision process is presented. One bound is obtained through a limiting process which examines the asymptotic probability of error of a suboptimum decision process. Another uses information theoretic concepts. The roles of two measures of distance between probability distri- butions is examined; those measures are the Bhattacharyya coefficient (Hellinger integral) and the Kolmogorov vari- ational distance. Error bounds based directly on those distance measures are exhibited. in De ERROR PROBABILITY IN UNSUPERVISED DEPENDENT PATTERN CLASSIFICATION BY John James Forsyth A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering and Systems Science 1971 To my wife and children ii Than] for his thc course of 1 of the writ for his hel Professors t0 particul the efforts (3- L. Park improvement ACKNOWLEDGMENT S Thanks are in order to Professor R. C. Dubes both for his thoughtful guidance and assistance in shaping the course of this research, and for his painstaking review of the writing. Thanks also go to Professor R. J. Reid for his help in defining the motivating problem, and to Professors Reid and R. Staudte for stimulating solutions to particular problems which were encountered. Finally, the efforts of Professors M. G. Keeney, C. V. Page, and G. L. Park in reviewing the thesis and suggesting helpful improvements are acknowledged. iii LIST OF T} LIST 01" F1 LIST OF NC Chapter I. INT H H H H H I I O O O (n J:- (A) k) H II. THE TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . vi LIST OF FIGURES. . . . . . . . . . . . vii LIST OF NOTATION . . . . . . . . . . . viii Chapter I. INTRODUCTION . . . . . . . . . . l 1.1 Communications System Models . . . 2 1.2 Symbol Code Synchronization. . . . 4 1.3 Pattern Recognition Methodology . . 7 1.4 Contributions of the Thesis. . . . 15 1.5 Organization of the Thesis . . . . 17 II. THE SYMBOL SYNCHRONIZATION PROBLEM . . . 18 2.1 The Data Generation Model . . . . 18 2.2 Estimating the Symbol Code Length and Synchronization Instant. . . . 22 2.3 Convergence of the Posterior Probability Mass Functions . . . . 26 2.4 A Stochastic Approximation Approach . 29 2.5 Synchronization and Learning Unknown Parameters . . . . . . . . . 36 2.6 Markovian Symbol Selection . . . . 41 2.7 Summary . . . . . . . . . . 43 III. ERROR BOUNDS FOR DECISION PROCESSES. . . 45 3.1 Majority Decision Functions and Error Probability for Dependent Random Variables . . . . . . . 46 3.1.1 Asymptotic Error Probability for Two Pattern Classes. . . 47 3.1.2 Bayes Majority Decision Functions . . . . . . . 57 iv Chag IV. Ulm BIBLIOGR; Chapter Page 3.1.3 Asymptotic Error Probability for m Pattern Classes . . . . 64 3.2 Bounding the Probability of Error After a Finite Number of Samples. . . . . 67 3.2.1 An Information Theoretic Approach . . . . . . . . 68 3.2.2 Bounds Based on the Distance Between Distributions . . . . 80 3.3 Summary . . . . . . . . . . . 90 IV. EXAMPLES OF DECISION PROCESSES AND ERROR BOUNDS O O O O O O O O O O O O O 9 2 4.1 The Bayes Decision Process Applied to the Symbol Synchronization Problem . . 93 4.2 The Binary Symmetric Channel . . . . 98 4.3 Error Estimates and Bounds. . . . . 108 V. CONCLUSIONS AND RECOMMENDATIONS. . . . . 115 5.1 Summary of the Thesis . . . . . . 115 5.2 Recommendations for Continued Research. 117 BIBLIOGRAPHY . . . . . . . . . . . . . 120 Table 4.1 Symbc 4.2 Compa LIST OF TABLES Table Page 4.1 Symbol Generating Probabilities . . . . . 96 4.2 Comparison of Three Decision Processes. . . 107 vi Figure 1.1 1.2 2.1 2.2 4.1 4.2 4.3 LIST OF FIGURES Figure Page 1.1 A Communication System . . . . . . . 2 1.2 Multisource Communication System. . . . 3 2.1 Genesis of Pattern Dependence. . . . . 21 2.2 Pattern Dependency with Markov Symbol Selection. . . . . . . . . . . 41 4.1 Posterior Distribution of the Source Index Parameter. . . . . . . . . 94 4.2 The Binary Symmetric Channel . . . . . 98 4.3 Comparison of Error Bounds. . . . . . 112 vii Notation Notation A covi(V) d(X2n+1 k dB(X ) CB LIST OF NOTATION Page Meaning Reference 1 n 11m — Z A 52 n+w n h=l k+h 2C0V1{Uk'Uk+l} + varl{Uk+1} 52 1 n lim — 2 B 56 n+w n h=l k+h 2cov2{Vk,Vk+l} + var2{Vk+1} 56 Covariance of V under class i 52 Majority decision function 47 Bayes decision function 9 Compound Bayes majority decision function 65 Decision function used for k—th pattern for majority decision function 47 Majority decision function-- same as d(X2n+1) 47 viii Rotation was = ti) EMRi) 51W) EGW) fllek) fin) Notation E(VIO = ti) E(VIQi) Ei(V) EG(V) f (lek) fi(V) G = {gl,...,gm} H(') Page Meaning Reference Conditional expectation of V under class i 52 Same as E(VIG t.) 74 1 Same as E(VIO t.) 11 Expectation of V with respect to the mixture G 33 Sample conditional probability density function for V 37 Conditional probability density function for V under class i 11 Element of G that equals one 30 Estimator for gi based on n observations 33 Non-negative estimator for 9i derived from gi n 33 I Probability distribution for A used as a finite mixing distribution 30 Shannon's entrOpy 69 A vector space 32 Divergence measure relative to two probability density functions 11 ix Notation i. P(VJw) Notation 2 p> kB n> Meaning Index of sources Correct decision about 2 Bayes estimator for 2 Maximum likelihood extimator for 2 Total number of sources Symbol code length for source Number of digits in a pattern Prior probability that class i is active Conditional Probability mass function for V given w Probability of error for decision function V Empirical probability mass function for V Probability mass function for V under class i Conditional probability mass function for V given W under class i Vector of probability mass function for V under class i Page Reference 19 26 25 26 19 l 19 20 46 23 49,66 28 46 32 Notation RB 9—5) T ok Notation POW) (x rij k) ik r9> kB *3) kM ok Page Meaning Reference Prior probability mass function for V Factor for recursively com- puting pi?) Decision region for deciding class i at k-th observation Value of 0 indicating a pattern class Synchronization point in k-th pattern Bayes estimator for Tk Maximum likelihood estimator for Tk Correct decision about Tk Decision indicator random variable for k-th pattern-- Uk = 1 - Vk Decision indicator random variable for k-th pattern-- Vk = l - Uk Variance of V under class i Random variable for k-th pattern The sequence of random variables xl""'xk xi 25 84 47 20 25 26 26 48 48 52 9,20 9,20 Notation Z . J ik “ik (UR-l) Yk(£,'rk) Page Notation Meaning Reference Z. Base 10 representation of J a binary pattern 99 a max(al,a2) 59 g min(al'a2) 59 a. Probability of error under 1 . class 1 58 aik Probability of error of d (X ) under class i 47 k k ik_(Uk 1) Probability of error of (Xk) given dk-1(Xk—1) under c ass i 50 yk(£,Tk) Mapping from (£,Tk) to A 30 6 One-half of upper bound on variational distance 59 _l__<5_ E 8—: 4 62 n n = E - g 59 0 Class indexing parameter 9 A Class indexing parameter mapped from (E, Tk ) under k(£ T k) 30 p Bhattacharyya coefficient (Hellinger integral) 11 xii Notation Uij Notation $1?) 13 ¢. 1 ¢(x,u,02) Meaning Interclass Bhattacharyya coefficient based on one pattern Interclass Bhattacharyya coefficient based on k patterns Component of §i(V) Integral over the right tail of the normal density function Subset of a probability space on which 0 = th xiii Page Reference 81 82 33 55 74 Recer engineering use of stat the traditi H-Z, P-Z, w trend towar the framewo estimation, formulatigr1 assumptions System. In the applica Edge of Pro: tiCians hav Ba‘yes Strat tomed to th System desc CHAPTER I INTRODUCTION Recent research efforts in engineering and recent engineering practice have turned more and more toward the use of statistical models rather than, or in addition, to the traditional deterministic system models [A-7, D-2, F-S, H-Z, P-2, W-l]. In particular, there exists a growing trend toward the formulation of engineering problems in the framework of statistical hypothesis testing, parameter estimation, and pattern recognition methodology. These formulations often show a willingness to make certain assumptions about the probabilistic description of the system. Indeed, many of the solution algorithms grow from the application of Bayes' rule, which requires prior knowl- edge of probability distributions [H-7, N-l]. Some statis- ticians have long eschewed any such assumptions,* but the Bayes strategy represents a logical step for those accus- tomed to the availability of a complete, deterministic system description. As a rule, the models describe *Although for a scholarly treatise supporting the Bayesian approach see Good [G-l]. prOCGSS' tistica.‘ cations produce process 1.1 Corn Fu a source The Objec Priately and whic} Radar SYE SUCCessfu Neran~pe “‘11 to c radar app. in? noise Signal is descriPtic the parame receiVEr C minimi2e t dlsmiSSal processes for which successive experiments produce sta- tistically independent outcomes. However, in communi- cations theory and other fields, processes can occur which produce statistically dependent observations. One such process is now introduced. l.l Communications System Models Fundamentally, a communications system consists of a source cascaded with a channel and a receiver (Figure 1.1). Receiver xiv Channel v Source Figure 1.1 A Communication System The objective of the system is to have the receiver appro- priately process signals, which are emitted from the source, and which are subject to some modification by the channel. Radar systems provide the first historical example of the successful application of probabilistic models and of the Neyman-Pearson hypothesis testing theory [D—2, H—2, M-8, W—l] to communication systems. In the usual model of the radar application, the channel distorts the signal by add— ing noise to it. The receiver must determine whether a signal is present or absent. From the probabilistic descriptions of the source and channel, one can determine the parameters that define a threshold detector which the receiver can use on the output of a correlator in order to minimize the false alarm probability for a given false dismissal probability. The source and the channel can both be 1 tain eitl M-7, P-l trates t2 is more ‘ Th. nition, a consider: sources n its own 5 E E ’ SOL Fi fIOm an al and transm The r eCeiv starting effective StUdy Unde- both be modeled probabilistically, and the models can con- tain either independent or dependent random variables [M-5, M-7, P-l, S-3]. While the signal detection problem illus- trates the use of probabilistic models, a modified system is more directly related to the problems of this thesis. The fields of communication theory, pattern recog- nition, and statistical hypotheses testing intertwine when considering a communications system in which one of several sources may transmit signals (Figure 1.2). Each source has its own process by which it repeatedly selects symbols Source 1 Encoder Source i Encoder ___« Channel Receiver Source r Encoder Figure 1.2 Multisource Communication System from an alphabet, encodes the symbol into several digits, and transmits the resulting digits in a continuing stream. The receiver must know which source is active and the starting point of each symbol code in order to apply an effective decoding algorithm. The problem motivating the study undertaken in this thesis is to determine which source receiv zation startii 1.2 8? channel a corre] the symt problem because . because c receiving tyne of p each Symb. They furti tically ir Zation pro and Stewar prOVed tha' theorem1 [8- One with othErs Recent Work an Muftha source is active and the symbol code starting point. The receiver will be said to have achieved symbol synchroni- zation when it has identified the source and determined the starting point of each symbol code which it receives. 1.2 Symbol Code Synchronization Digital data transmitted over a communication channel often contains a symbol synchronization pulse which a correlator can process to determine the starting point of the symbol code [M-2, W-l]. A symbol synchronization problem arises when such a pulse is either not provided because of bandwidth or other design considerations, lost because of channel degradation, or simply unknown at the receiving end. Hancock and Stewart [H-l] considered this type of problem by assuming that the number of bits in each symbol code was known and fixed from symbol to symbol. They further assumed that successive symbols were statis- tically independent. By modeling the symbol synchroni— zation problem as a hypothesis testing problem, Hancock and Stewart established a Bayesian decision procedure and proved that it converges in the sense of Spragins' theorem [S—6]. One should compare this type of synchronization with others receiving attention in current research. Recent works by McBride and Sage [M-3, M-4] and by Farrell and Murtha [F-l] examine the problem of extracting bit synchrc purpOSE which a On the synchro symbol. quently. that acc able for to devel of sever the type PrOblem 1 Stewart, Sources, and SUCCe that alph into a Ce used is c. mits Stat; independe; uses a fi) but diffe} COde lenC' synchronization information from the message data. Their purpose is to determine the apprOpriate time interval over which a correlator should Operate for signal detection. 0n the other hand, Harnett [H-3] takes the approach that synchronization error effectively inserts or deletes a symbol. However, except for the randomly (and infre- quently) inserted and deleted symbols, Harnett assumes that accurate synchronization, or registration, is avail- able for the rest of the symbol stream, and he proceeds to develop decoding theorems for code strings consisting of several symbols. These two views are distinct from the type of synchronization studied in this thesis. This thesis approaches the symbol synchronization problem by generalizing the point of view of Hancock and Stewart. A receiver obtains data from one of a set of sources. The source model assumes an alphabet of symbols, and successive random experiments select symbols from that alphabet for transmission. Each symbol is encoded into a certain number of digits, and the number of digits used is called the symbol code length. Each source trans- mits statistically independent symbols although this independence constraint is not necessary. Each source uses a fixed symbol code length for all of its symbols, but different sources do not necessarily use the same code length. fr wa an mic tra The ObServing first bit the reCei that the Example Suppose there are two sources, one using a four binary digit (binit) code length and the other using three binits for each code word. If the receiver observes six binits, say 1 0 l l l 0 from the data stream, there are several possible ways these binits could be partitioned for decoding. If the four binit source is transmitting, then any of the partitions l 0 l l l 0 . . . . l 0 l l l 0 . . . . . l O 1 l l 0 . l 0 l l 0 . might be appropriate. If the three binit source is transmitting, then there are three possible partitions. l O l l l 0 O O l 0 1 l 1 0 O . l O l l l O . . The receiver does not know which source it is observing or whether the first bit it received is the first bit of a symbol code. However, it is assumed that the receiver has achieved bit synchronization, and further that the receiver knows something about the probability distr butio: funct: zatioz proble first precis must c random ature a ables [ be said ables [1 random x extend t e"amples of jOint tation o The COnv PEUdent , random Va The literatur ables are Weak and distributions governing the sources--either the distri- butions themselves or the parametric form of their density functions. Under these assumptions, this symbol synchroni- zation problem can be modeled as a pattern recognition problem for which the observed random variables exhibit first order dependence, or the Markov prOperty (for the precise specification, see page 24). Consequently, one must consider decision-making processes using dependent random variables. 1.3 Pattern Recognition Methodology There is an ample and rapidly growing body of liter- ature about decision making based on independent random vari— ables [D-3, D-4, L-l, P-l]. Unfortunately, the same cannot be said for decision making based on dependent random vari- ables [H—6]. Quite often, the algorithms for independent random variables derive from some property that does not extend to dependent random variables. Two salient examples are the product rule for defining probabilities of joint independent events and the rule for the expec- tation of the product of uncorrelated random variables. The convenient factoring which they provide in the inde- pendent case are denied to one who deals with dependent random variables. The most powerful general results in the statistical literature which are applicable to dependent random vari- ables are the Central Limit Theorem [F-4, H-8], and the weak and strong laws of large numbers [F-2]. These tools the cc for PE Stewar consis zation exhibi Bayes 1 the prc ponds t mate a bution the mod. nonzero Of the i PUted b) Consiste the Post the Prob Conseque. lated are ObserVat] argument questiOns and Stor provide the basis for describing the asymptotic behavior of decision processes. Signori [S-3] used them to show the convergence of optimum (Bayesian) decision processes for Partially Observable Markov Systems. Hancock and Stewart also used them to show the existence of a strongly consistent sequence of estimators for the symbol synchroni- zation problem. In both works, the technique was to exhibit strongly consistent estimators in proving that the Bayes posterior distributions asymptotically assign all the probability mass to the pattern class which corres- ponds to the correct decision [8-5, 8-6]. One can esti- mate a parameter value by computing the posterior distri- bution of the parameter and taking the mean argument of the mode as the estimate. Spragins showed that if a nonzero prior probability is assigned to the true value of the parameter, if the posterior distributions are com— puted by the Bayes rule and if there exists a strongly consistent sequence of estimators for the parameter, then the posterior distribution asymptotically assigns all of the probability mass to the true value of the parameter. Consequently, when the posterior distributions so calcu- lated are premised on a sufficiently large number of observations, the parameter estimate provided by the mean argument of the mode should be reliable. This raises questions of rate of convergence, probability of error, and storage requirements which are examined in this thesis. M‘ process Given a variabl i.e., p The Bay andap 9={tl require distrib Where x mass fu] p( is the 1 *F. defined ‘ values 0: Much of what follows concerns the Bayes decision process, which will be briefly summarized at this point. Given a sequence of observations, X1, X2, ..., of a random variable governed by one of m probability distributions, i.e., pattern classes, decide which distribution is active. The Bayes decision process requires a prior distribution and a parameter, say 0, which indicates the active class, 0 = {t1, ..., tm}. The Bayes decision process further requires calculating the sample conditional posterior distribution of O by the Bayes rule:* le-l _ k-l P(o — tilx ) Pi(Xk ) _ k _ P(o — tilx ) — _ k-l k-l ZP(O-—tiIX ) Pi(XkIX ) G where Xk = X X X and P (X IXk—l) is the i-th l' 2' "" k i k mass function of Xk' The decision function, which maps the observation space to the decision space, is dB(Xk) = ti if i is the smallest integer for which k k . . P(@ = tiIX ) 2 P(o = thX ), all 3 ¢ 1. Here, dB(Xk) = ti means to decide that element ti of O is the index of the pattern class being observed. *For notational convenience, terms which have been defined to denote random variables will also denote the values of the random variables. U its wel risk st minimunx generat to an e the min U1 Bayes de pendent, vestigat R-41. C SUbOptim rule. T Optimum bEhavior rule, am the mlmb' Sex Of boundj of "dista 10 Use of the Bayes decision process is motivated by its well-known prOperties which include being the minimum risk strategy and being the strategy which leads to minimum probability of error. A standard derivation generates the Bayes decision function as the solution to an extremum problem, in which the extremum sought is the minimum of the expectation of a risk function. Upper bounds on the probability of error for the Bayes decision process have been established for inde- pendent, continuous random variables by a number of in- vestigators [C-lO, F—3, H-S, K-2, L—2, L—3, L-4, R—3, R-4]. Chu and Chueh established bounds by defining a suboptimum decision rule called the majority decision rule. The Central Limit Theorem implied that the sub- optimum procedure converged and established the asymptotic behavior of the probability of error for the Bayes decision rule, and the majority decision rule, as a function of the number of observations. Several closely related approaches to the problem of bounding the probability of error rely on some measure of "distance" between the distributions of the pattern classes.* The underlying decision procedure uses the likelihood ratio. The goal (which in general has not yet been reached) is to find a monotonic relation between the distance_between the distributions of the pattern * The term distance appears in quotes here because the measures do not always obey the triangle inequality required of a metric [K-l, K-2, K-S]. claSS‘ divert with Kailat [K‘Z], 0f one exists butions bility diverge. effGCtil diStribL Th in the t Se are ”lore 11 classes and the probability of error. One measure, the divergence, is defined as follows in the two-class case with the class distributions fl(x) and f2(x), respectively. fl(x) fl(X) J=Elan'E21nf-2—m where f1(X) fl(x) _ Ei 1n W = ln f-E—(X—Y fi(X) dX, 1 = 1,2. Kailath, in his well-documented paper on these techniques [K-2], points out that if one can choose the distributions of one's signal sets to increase the divergence, then there exists a set of prior probabilities under which the distri— butions with the larger divergence provide a lower proba- bility of error than would distributions with a smaller divergence. However, this falls short of providing an effective computational technique for selecting signal distributions even if that option were available. The Bhattacharyya coefficient, denoted p, is defined in the two-class case as p = f/fl(x)f2(x) dx. Several investigators [K-2, L—2, L—3, L—4] have obtained results using the Bhattacharyya coefficient which are more encouraging than the results stemming from the diverge bounds the Bha vergenC underly hood de diverge of obse spite o. usual 5; formidak Th method f the Conv R-SI. R COnCErni equivoca fOUnd the tial fUnc Raviv [H- error. 'I the PrOba for any n niqUQ Of the aSym; thesis Sh IEtic Poi 12 divergence approach. Kailath obtains both upper and lower bounds on the probability of error which are functions of the Bhattacharyya coefficient. This measure, like the di- vergence, indicates the potential effectiveness of the underlying probability distributions for maximum likeli- hood decision making. If one tries to apply either the divergence or the Bhattacharyya coefficient to a sequence of observations in a multihypothesis situation, then in spite of the simple form of the bounds, one still has the usual situation that computing the error bounds can be more formidable than computing the decision function. The late Alfréd Rényi proposed a highly interesting method for using information theoretic measures to show the convergence of Bayes decision processes [R-3, R-4, R-S]. Rényi saw the observations as providing information concerning the classification parameter. He studied the equivocation (average entropy) of the observations and found that the equivocation approaches zero as an exponen- tial function of the number of observations. Hellman and Raviv [H-5] related Rényi's results to the probability of error. This approach provides an absolute upper bound on the probability of error of the Bayes decision process for any number of observations. In contrast, the tech- nique of Chu and Chueh, mentioned earlier, describes only the asymptotic behavior of the probability of error. This thesis shows the applicability of this information theo- retic point of view to dependent random variables. obserr descri bution distri distri ing di: in proc estimat However sequenc distrib indepen( to a Stc [C‘2] re mation a as a fUn Presente( random VE C. dependEnc VariatiOn hypothesi elements . adjaCent. not neCeS 13 Another view of pattern recognition considers the observations as originating from a process which is described by a mixture [T-l, T-2, Y-2] of the distri- butions of individual component processes. The prior distribution of the classes serves as an unknown mixing distribution. Some of the approaches to estimating mix- ing distributions [C-4, D-l, R-7, Y-l, Y-3] have resulted in procedures which require either a steadily growing estimator space or an infinite set of finite distributions. However, Robbins [R-6] has shown how to calculate a sequence of strongly consistent estimators for the mixing distribution when the observations are statistically independent. His procedure can be shown to be equivalent to a stochastic approximation algorithm. Chien and Fu [C-2] related Bayesian learning to stochastic approxi- mation and gave bounds on the variance of such estimators as a function of the number of observations. The work presented here applies these ideas in concert to dependent random variables. C. K. Chow [C-6, C-7, C-8] has investigated inter- dependence between pattern elements (features) that is a variation on the idea of the Markov prOperty. Chow hypothesized that the features might be ordered such that elements having interdependence would not necessarily be adjacent. This suggestion grew from character recognition work where the order in which features were measured did not necessarily reflect the history of their generation. He CC descr estab resul matin depen are C: a max; matior bution thesis (which of the with 1.117 indepenc samples niQUes. Processil the Sampl Used eith distribut late weig {C-ll' C- &q 14 He conceived a tree structure imposed on the features to describe the dependencies, and devised techniques for establishing the most effective tree. One interesting result showed that if one limits the form for approxi- mating a probability distribution to products of low order dependent distributions; and if the low order distributions are chosen to be those for which the components exhibited a maximum mutual information; then the resulting approxi- mation has minimum mutual information with the distri- bution being approximated [C-8]. There is no attempt to extend Chow's work in this thesis, since Chow worked in a supervised learning mode (which uses classified training data to evaluate parameters of the decision function), and this thesis is concerned with unsupervised techniques. Many pattern recognition researchers have available independent, identically distributed (i.i.d.) training samples with which they can use supervised learning tech— niques. This avoids the high cost of optimum unsupervised processing, and no underlying probability distribution on the sample space need be specified. The training data are used either to construct approximations to the probability distributions [A-2, B-l, K-4, M-6, P-4, Y-4] or to calcu- late weighting parameters for discriminant functions [C-ll, C-12, C-13, C-l4, D-S, F-S, I—l, N-2, 8—1, 8—2, 8-4]. Various supervised learning techniques are often compared by tuning them on the same set of training data and ti“ test d the d8 would . some C] were ur techniq of symb unsuper1 zation ; depender in the w removed. be applit l. w Thi prOblem w 15 and then evaluating their performance on another set of test data. Other researchers [K-4, P-4] have proved that the decision functions produced by a particular scheme would agree with the decision function which would optimize some criterion function if the number of training samples were unlimited. This thesis has not pursued supervised techniques primarily because the motivating problem, that of symbol synchronization, seemed more plausible in an unsupervised operation. Further, the symbol synchroni— zation problem very definitely produces statistically dependent data, and the theoretically pleasing results in the works cited collapse when the i.i.d. assumption is removed. Many of the strictly empirical techniques could be applied equally well to i.i.d. or non-i.i.d. data. 1.4 Contributions of the Thesis This thesis identifies the symbol synchronization problem with unknown symbol code length as a problem in unsupervised, multicategory pattern recognition with dis- crete,dependent random variables. A sequence of decisions based on a Bayes strategy is shown to converge and allow one to simultaneously determine the synchronization and learn the parameters in the source distribution. Even though the process at the source which selects successive symbols for encoding selects symbols independently, the data-gathering model at the receiver produces dependent patterns. It is pointed out that if the symbol selection proce- and SI proce vides studi. cedure varial decisi asympt Optimu; continx informa random bility the numl depender error b0 PIObabil Coefficie Presents error b0 majority St°Chasti algorithr l6 process at the source were Markov, then the synchronization and source parameters could still be learned by the decision process at the receiver. The binary symmetric channel pro- vides a specific example of parameter learning which is studied. The convergence properties of such decision pro- cedures are thoroughly examined for dependent random variables. A subOptimum decision procedure, the majority decision procedure, is used to derive expressions for the asymptotic behavior of the probability of error for the optimum procedure; such expressions are derived for both continuous and discrete dependent random variables. An information theoretic argument, applied to the dependent random variables, provides an upper bound on the proba- bility of error of the optimum procedure as a function of the number of observations. This thesis also extends to dependent random variables some techniques for determining error bounds which use measures of the distance between probability distributions, specifically, the Bhattacharyya coefficient, and the Kolmogorov variational distance, and presents the role of these quantities in the more elaborate error bound techniques, those using equivocation and the majority decision function. Another procedure is based on stochastic approximation of a mixing distribution, and the algorithm's convergence rate is discussed. Finally, CODPU the f anal} in ti estar conce for t of ob which the tl vised 17 computer simulations of selected techniques demonstrate the feasibility of the type of processing which has been analyzed. 1.5 Organization of the Thesis Chapter II places the symbol synchronization problem in the framework of a pattern recognition problem and establishes convergent decision processes. Chapter III concentrates on various methods of computing error bounds for the Bayes decision process as functions of the number of observations. Chapter IV presents several examples which illustrate the theory, while Chapter V summarizes the thesis and suggests alternate approaches to unsuper- vised decision making. proble the de will b depend' depends in the eters a it is s arise f; SPeCific the fran describe L % Th One of 86 not rest CHAPTER II THE SYMBOL SYNCHRONIZATION PROBLEM Detailed explanations of the symbol synchronization problem, the decision procedure, and the convergence of the decision procedure are presented in this chapter. It will be shown that the processing technique must deal with dependent random variables. The exact effect of this dependence on the decision process is described. While in the basic problem description the only unknown param— eters are the source active and synchronization instant, it is shown that other unknown parameters, such as could arise from either a noisy channel or a less complete specification of the source, can be accommodated within the framework of the class of decision procedures described. 2.1 The Data Generation Model The information to be received can be generated by one of several sources, and each source encodes the infor- mation in binary form. The assumption of binary data does not restrict the generality of the results obtained. Many 18 data nal and ‘ data a la] 19 data transmission systems--such as remote computer termi- nal facilities, equipment monitors in earth satellites, and various character recognition schemes--use binary data, so the binary data model is directly applicable to a large class of existing systems. Let L denote the number of sources. All sources transmit at the same bit rate, and all sources use the same signal to represent a binary digit. A particular source uses the same symbol code length, or number of binary digits per coded symbol, for all symbols in its alphabet. Throughout this thesis, the term symbol refers to an element of a source alphabet, and the term symbol code refers to the encoded version of a symbol. Two or more different sources might use the same symbol code length; however, it is also possible that different sources use different symbol code lengths. By letting the letter 2 stand for the index of the source, £e{l,2,...,L}, one can then have m stand R for the symbol code length used by source 2. Every mk time units (the time unit is the reciprocal of the trans- mission bit rate) source 2 randomly selects a symbol from its alphabet, encodes the symbol using m binary digits, £ and transmits the m binary digits. At some time, not 2 necessarily as the source begins transmitting, a receiver locks on to one of the sources and receives from that source for the rest of the time. The receiver does not know face is be lengt each the I that each ceive Methol separa binary struct digits Stream. X1,X2,. Where n length, first k Stands f 20 know which source is transmitting. Two immediate tasks face the receiver. The first is to determine which source is being observed, which in turn specifies the symbol code length. It is assumed that the symbol code length of each source is known to the receiver. As a second task the receiver must determine the "synchronization instants," that is, the binary digit which is the first digit of each symbol code. Note that the first binary digit re- ceived is not necessarily the first digit of a symbol code. Methods for solving these two tasks will be developed. The receiver observes patterns which it obtains by separating the input stream into successive sets of n binary digits. The input stream then could be recon- structed as a concatenated version of the patterns; no digits are lost in extracting the patterns from the input stream. The sequence of patterns is represented by X1,X2,... and each Xi (i = 1,2,...) has n binary components where n is at least as large as the largest symbol code length, i.e., 1‘: m2 : n. The notation Xk denotes the first k patterns; Xk = (X1,X2,...,Xk). The notation Tk stands for the synchronization instant in the k-th pat- tern, Xk’ and is defined as follows: a. If exactly one symbol code has its beginning binary digit in X then T is the position of k' k that binary digit in Xk. 21 b. If more than one symbol code has its beginning binary digit in Xk (which can happen whem mi < n), then Tk is the position in Xk of the beginning digit of the first symbol code which begins in Xk. This definition of Tk comes from the following idea: if there are exactly i binary digits (1 = 0,1,...,m£-1) in Xk which belong to a symbol code which began in Xk-l’ then Tk = i+l, and Tk left over from a symbol started in Xk_1) starts in Xk' tells where the first new symbol code (not Figure 2.1 shows examples of sequences of values for T Values of m = 2 and n = 3 are assumed, and the two k' 1 possible values for T1 produce the sequences of values for T ,T as shown. Effectively then, Figure 2.1 illustrates 2 3'... . . . o o 1 o 1 o o o 1 o 1 1 . . . (a) * * * * X1 X2 X3 X4 Tl=l T2=2 T3=1 T4=2 . . . +_g o 1 o 1 g_y_g o 1 o 1 1 . . . (b) * * * * x1 x2 x3 x4 Tl=2 T2=l T3=2 T4=l Figure 2.1 Genesis of Pattern Dependence tWKD ways in which the same sequence of patterns, Xk' could Eriseefrom a source which uses two binary digits for a SYHflmol code. Pairs of underlined digits represent symbol 22 codes. The values of T are shown. The "*" under a digit k shows that the digit is the one which determines the value of Tk' The value of Tk cannot exceed the symbol code length of the source. Since n might exceed mg, T is not neces- k sarily equal to Tk+l° However, given m n, and T the R' k-l' number of digits which carry over to X from a symbol code k started in xk-l is mi minus the remainder of n - (Tk-l - 1)L/h£ so that Tk = mx — [n - Tk-l + I] + l 2.1.l mod m 2 The distinguishing feature of this chapter is the assumption that once the receiver has selected a stream to observe, the receiver stays with that one stream. Sections 2.2 through 2.4 develop an Optimum receiver and discuss its convergence for the case in which the symbol code length and synchronization instants are the only unknowns; the probability model for the sources will be completely known and the channel between the source and receiver will be noiseless. Sections 2.5 and 2.6 allow unknown parameters in the source distributions and include a noisy channel. 2.2 Estimating the Symbol Code Length and Synchronization Instant Each source performs a random experiment to select a symbol to be encoded. The probability of any source 23 selecting a given symbol is assumed known, fixed and independent of previously selected symbols. The stream of binary digits (binits) so produced represents obser- vations on a discrete—parameter, discrete-time stochastic process. The values of the sample conditional probability mass function, for a noiseless channel P(XI9.,T 5L=l,2,...,L;T =1,...,£ k” k is known for all 2n mass points and fixed. In the expres- sion above, X--which has represented a random variable--is used to represent the value of the random variable as well. This practice will be continued, and the context will indi- cate whether the random variable or the value of the random variable is intended. A symbol code might have its initial binits in Xk__1 and its final binits at the start of Xk' which causes Tk to be greater than one. Yet no single symbol code could overlap more than two patterns, because the length of a pattern is as great as the longest code length. That being so, the value of X could depend on the value of k Xk_1 but not on Xj for j < (k-l); recall that successive symbols are independent. Furthermore, since stationary probability models for all sources are known, the con- ditional probability of Xk given the value of xk-l can be specified. In short, the sequence of random variables trW P (3 anc pos the denc func k pa1 P(T quant GXpre: P(Tk( 24 X1,X2,... is a first-order Markov chain with stationary transition probabilities. Thus, P(Xk|£,Tk,Xk-1) = P(XkI£,Tk,Xk_l). The procedure for estimating the active source, 2, and the synchronization instant, T is based on forming k’ posterior estimates of the probability mass function for the active source given the first k patterns observed-- denoted P(£|Xk); R = l,...,L--and of the probability mass function for the synchronization instant given the first k patterns observed and the active source--denoted P(Tk|£,xk); l = l,...,L; T = l,...,l. Expanding these k quantities by the Bayes rule gives the following recursive expressions. k P(xk|2,xk l)PULIXk 1) P(£IX ) = L z P(XkI£,Xk lk)PULIX 1) 2:1 2.2.1 1£>(xk|r,xk'l">130lek 1) = k— l) 9, = 1'2'ooopL P(XkIX k P(xk|Tk,2,x l)P(Tk|9.,xk'l) P(Tk|£,X ) = 2: PM IT ,2.x"‘1)1=<'r Isaack 1) k k k T =1 k k 1 k 1 2.2.2 P(X IT ,£,x ' )P(T |2,x ) k k k _ = k-l 2 — l,2,...,L P(xk|2,x ) 25 The denominator of 2.2.2 shows how to compute the first factor in the numerator of 2.2.1. All that is left is to Specify initial values for PO(2) P(2|X0) and _ O PO(Tkl2) — P(Tk|2,X ) because P(X|Tk,2) is assumed fixed and known. With the procedure for computing the posterior probabilities specified in 2.2.1 and 2.2.2 one can con- sider ways of estimating the value of 2 and T One esti- k. mator, called the Bayes estimator, uses the mean of the posterior distribution. The Bayes estimators of 2 based on Xk is denoted by 2 (the estimator for the source kB index after k observations) and that for T, by T (the kB estimator for the synchronization instant after k obser— vations). These estimators are defined by: A L k 2kB = z 2p(£|x ) 2:1 and m T = z z“ T P(T I2,Xk)P(2,Xk) kB _ k k 2 Tk-l However, these estimators tend to give fractional values for quantities defined as integers, so some rounding algorithm would have to be specified. 26 Another, perhaps more intuitively satisfying, estimator is the maximum likelihood estimator. The maxi- mum likelihood estimators of 2 and Tk based on Xk are denoted by 2 and T and are defined by: kM kM’ EkM = 2os.t. P(2O|Xk) = max P(2|Xk) and TkM = Toks°t' P(Tok|2kM,Xk) = max P(Tkl2kM,Xk). Tk 2.3 Convergence of the Posterior Probability Mass Functions The receiver is connected to a single source, 20, so all symbol codes have the same symbol code length, mg . The receiver stays connected to the source, missing none0 of the binary digits produced after the connection is made. Consequently, the sequence of true synchronization in- stants, {T }00 will satisfy 2.1.1 with T0 substituted 0k k=1 k for T . It is essential to show that the posterior k probability mass functions computed according to 2.2.1 and 2.2.2 converge in the sense that as k+w they have all of their mass at 20 and Tok, respectively. In order for that to happen, a theorem of Spragins requires that the following conditions be met: (1) the posterior probabilities must be computed by the Bayes rule, 27 (2) the true value of 20 and TO must have non-zero 1 prior probabilities, and (3) there must exist sequences of functions of the observations, {fk(Xk)} and {gk(Xk)}. such that k k k k fk(X )-—'*2O w.p.l. and gk(X )-—-->T0k w.p.l. where w.p.l means "with probability one." Equations 2.2.1 and 2.2.2 show that condition 1 is met, while proper choice of prior probabilities that condition 2 is met. The following theorem existence of the strongly consistent estimators in condition 3. assures shows the required Theorem 2.3.1 If (a) the sequence of patterns represent obser- vations on a regular Markov chain; (b) only one synchronization, denoted by 20, TO k exists during the transmission of all Xk, k: 1,2,...; (c) members of the family {P(Xk|2,Tk); 2 = l,...,L; Tk = l,...,mfi} are distinct; (d) P(Xkl2,Tk) is the same for all k = 1,2,... then there exist sequences of minimum dis- tance estimators 2 and T for 2 and T m m k which converge to 20 and T0 with probability k one.* *The estimators 2m and Tm imply a sequence of esti- mators 2m,T1m,T2m,...,Tmm,..., for which ij is not 28 Proof: Define the empirical discrete probability mass function for X, §h(x), as follows: F(x)=-1- 11fx=x m m IX(Xk) a.e. where I x(Xk) k ">18 k 1 I X(Xk) 0 1f Xk # X. This is the proportion of the first m observations which equal X. Define the estimators 2m and Tm by inf supIPm (X) - P(X|2, T k)I = supIPm (X) - P(X|2 m,Tm)l 2 'Tk X w.p.l The infimum on the left allows us to write suple(X) — P(Xl2m,Tm) Issupl'fimm X X - P(XI2O,TO )|—l“—+o w.p.l. 2.3.1 m Convergence to zero with probability one is by the Glivenko-Cantelli theorem (using Signori's extension to regular Markov chains). Now the triangle in- equality is used to write necessarily the estimator T based on the j- -th pattern, for j#m. Correct estimation of the synchronization means that the value of 2m is 20 and Tkm gives the value TO . k 29 sup|P(Xl2m,Tm) - P(XI2O,TO )l = suplP(X|2m,Tm) X m X - Pm(X) + Pm(X) — P(XI2O,TOm)] s s;plP(Xl2m,Tm) - §m(X)| + sup|5m(X) - P(X|2O,TO )| 2.3.2 X m Equation 2.3.1 says that both quantities on the right in 2.3.2 go to zero with probability one so that A A m consequently P(Xl2m,Tm)-——+P(XI2O,TO ) w.p.l. Hy- m pothesis (c) then provides that 2m-+ 2O w.p.l and T —+ T w.p.l. Q.E.D. This result in turn implies that, because of Spragins' theorem, either of the decision procedures of section 2.2 will give correct estimates of 20 and T0k if the process continues long enough. With the estimators discussed so far, the rate of convergence and probability of error are difficult to specify. Section 2.4 provides an algorithm which allows statements about the rate of convergence. 2.4 A Stochasticquproximation Approach Synchronization is completely specified by the combi- nation of source index and synchronization instant, (2 Tk). I 30 The probability of a given pattern can be represented by the mixture 2 1 "b4; P(X) = P(X|2,Tk)P(2,Tk) 2.4. Ilbab 2 l T For a convenient change of notation, observe that since L 6 n and l s T s m there arexn = 2 m or at most, 2 k 2 2:1 2 nL distinct pairs (2,Tk) which can specify the synchroni- l s m zation. The notational change comes through letting an unknown parameter A€{l,2,...,m} index the possible source- synchronization pairs expressed relative to the first pat- tern. A sequence of functions Yk(2,Tk) is defined such that Yk(2,Tk) is a l-l map between the values of A and the values of the ordered pair (2,Tk). Further, yk(2,Tk) is defined such that given the value of the pair (2,Tl), the sequence of values of (2,T2),(2,T3),... generated according to 2.1.1--with 2 held constant--maps under Yk(2,Tk) to the same value A for k = 1,2,... . The parameter A has an unknown probability vector G = {gl,...,gm}, gi ) 0, 2m gi = 1 such that P(A=i) = gi. i=1 Assuming that a single source with one synchronization, (2O,To ), produces all of the observation vectors implies k that one value of A is in effect for all observations. So the system being observed has a probability vector G = {gl,...,gm} such that for 1 = 10, gi0 = P(A = 10) = l and for all other 1, 9i = 0. The entry of G which has 31 the value 1 is unknown. Once i0 is known, gi = l and 0 all other gi, iiio, must be zero. The parameters and notation above leads to rewriting 2.4.1 as L rn’Q P(Xk) = E E P(Xkl2 = a,Tk = b)P(2 = a,Tk = b) a—l b—l m - .E P(Xlek(a.b) = 1)P(Yk(a.b) = 1) 1—1 m 1—l where 2 = a, Tk = b maps under Yk to A = i, Pi(Xk) = P(Xklyk = 1), and Estimating the synchronization now implies finding i0, the true value of A, for which P(A = i0) = l. The approach will be to estimate the probability vector G. A decision on the value of i0 can then be made from the estimates of G. If one can obtain a sequence of estimates gi,n for gi such that gi,n converges to gi, then one can use a maximum likelihood decision rule for deciding on the synchronization. The following theorem shows the 32 existence of a set of strongly consistent estimators for 9i (i = l,...,m) which can be computed by a stochastic approximation algorithm (of. [A-4]). Theorem 2.4.1 If (a) {Pi(X)}IE=1 is an identifiable [T-l, T—2, Y—3] family of probability mass functions _ m _ . (b) G — {gl,...,gm}, gi a 0, 21:1 gi — l 18 a finite mixing distribution, and (c) X (k = 1,2,...) is a Markov chain with the k . . . = m d1str1but1on PG(Xk) Zi=l gipi(xk) then there exists a sequence of estimators for the mixing distribution, G, which converges to G with probability one. Proof: The sequence of estimators developed in the proof will have the form of a stochastic approxi- mation algorithm. The proof follows that of H. Robbins [R-6] for a related theorem. A member of the family {Pi(X)} will be denoted by the vector Pi(X), i = l,...,m which has 2n elements, one for each of the 2n possible values of the observations X; recall that X is an n element binary vector. Yakowitz and Spragins [Y-3] show that identifiability assures that the vectors Pi(X), i = l,...,m are linearly independent and span the space Rm (and form a basis for Rm). Let Hj denote the m-l dimensional 33 (X) P —j- -1. “—j+l(x) subspace Spanned by P1(X),...,P. ooo’EIn(X)o Then 33. (X) = g). (X) + 23. (X) where g]. (X)€Hj , g]. (X) j_ Hj and gj (X) a! 0. 2.4.3 Define ¢.(X) = P."(X)/2[P."(X)12 J J x 3 so that i ¢j(X)Pk(X) = 1 if j = k = o if j 7! k. The elements of the set ¢j(X): j = l,...,m form a set of orthonormal components of gj(X). Now define n — + Z ¢.(X) and g. = [g. ] X [g. ] k=1 1 k 1,n 1,n ///j-1 j, n fio| I: ll SIP where [a]+ = max (a,0). Hypothesis (c) and the above result imply that for all k m E¢.(X)=E¢ik(X)ZgP.k(X)= Eg.2¢ikjk(X)P(X)=g. 2.4.4 gonal 1 Gram-Sc Bj-l (X1 an ortl zation 90nal H-: t J h 34 Signori [8-3] formulated a theorem, based on a proof of Raviv [R-2],extending the law of large numbers to Markov chains. Signori's theorem pro- vides that when ¢i(X) is a Baire function integrable with respect to a Lebesque measure on X then 2.4.4 implies that 31 n n’91 with probability one, hence I n . O I gi,n __+gi w1th probab111ty one. QoEoDo In 2.4.3, the vector gj"(x) was defined to be ortho- gonal to Hj; Pj"(x) can be obtained by applying the Gram-Schmidt orthogonalization procedure to P1(X),..., P (X).P -j-l _j+l(X),...,Em(X), wh1ch span Hj' The result 15 an orthogonal basis for Hj' Then the final orthogonali- zation step finds the portion of gj(X) which is ortho— gonal to the orthogonal basis of Hj, hence orthogonal to Hj; this gives the vector gj"(X). While the estimators developed in Theorem 2.3.1 estimated 20 and T0k and required only one source active to converge, the estimator of Theorem 2.4.1 estimates the probability distribution of the synchronization. So the estimator of Theorem 2.4.1 suggests a decision rule for learning the probability law which governs a receiver 35 when the receiver's data generation model selects a differ- ent source for each observation. Additional complications would be introduced in defining 2 and Tk' Those compli- cations would motivate the addition of data buffering to more completely state the data generation model of such a problem. It is not clear what the practical value of such a device would be. Rewriting the definition of g. to give it the form 1,n of a stochastic approximation algorithm begins with _ _ 1 n l n-l gi,n ' i,n-l = H kil¢i(xk) ' HIT kil¢i(xk’ n-l -1 £-__1 _ n ¢i(xn) + (n n-l kil¢i(xk) _ l _ — — n[¢i(Xn) gi'n_1]. So the recursive expression — _ — l: _ _ has the form of a stochastic approximation algorithm. Chien and Fu [C-2] show that, according to Dvoretsky's theorem, 3: converges to 91 in mean square and with ,n probability one. Letting B denote an upper bound on the variance of ¢i(Xk), Chien and Fu show further that the mean squared error of Si n decreases at least as i for k I 36 observations and is less than or equal to B/k. This gives a bound on the rate of convergence of estimators for the mixing distribution. 2.5 Synchronization and Learning Unknown Parameters A processor might not be fortunate enough to know the probability model for the received patterns (given the synchronization information). A likely situation would have the receiver see each source transmitting through a channel having unknown noise parameters. This section considers such unknown parameters. The functional forms of the probability distributions for the patterns, given the synchronization, are assumed known. The parameters must be learned. The model by which data are generated is now estab— lished. Several sources generate binary data in the manner described at the beginning of this chapter. The operation of the sources is exactly as described before. At some point in time the receiver connects to the channel for one of the sources and receives from that one source through that one channel from that time forward. Each source has its own channel, such as would be the case for two space satellites whose signal paths experience differ- ent kinds of distortion, and the channel might contain noise. Each source-channel combination might contain unknown parameters which will be denoted by G where 2 2" is the source index. The notation {02}:=1 denotes the /— "" l .. __ 37 set of all unknown parameters. As before, the first bit of the first observation vector is not necessarily the first bit of a symbol code, so that the unknown synchroni- zation instant, Tk’ must be learned; Tk is defined exactly as in the introductory paragraphs of this chapter. The received data will be processed as n binary- digit patterns denoted by X k = 1,2,.... The functional kl form of the parameter conditional probability of the re- ceived pattern is known, so that P(Xkl2,Tk,{O£},Xk-1) is the fundamental known quantity from which posterior distributions will be calculated. Recursive formulas for the posterior densities follow. k-l k-l p(xk|{e£}.x )f({GR}IX ) k-l) f ({e£}|xk) = p(XkIX £=1’000’L;k=l'2'ooo k-l k—l P(xkl2,{oz},x )P(2I{0£},X ) k-l) P(2I{G£},Xk) = p(xk|{e£},x 2 = l,...,L; k = 1,2,... k-l k-l P(Xkl2,Tk,{92},X )P(Tkl2,{9£},x ) k-l) k P(T 12,{e },x ) k “ p(xk|£.{e£} .x 2 = l,...,L; k = 1,2,...; Tk = l,...,mz 2.5.1 38 Denominator terms come from integrating the numerator over {9%}, summing the numerator over 2, and summing the numer- ator over T respectively, for the three equations of k’ 2.5.1. Lower case f represents probability density function as Opposed to probability mass function. Channel noise could make possible a continuum of values for com- ponents of the patterns. However, if a preprocessor were to convert the received data into binary valued pattern features, then conditional probability mass functions for the observations would be appropriate. If 20 denotes the index of the source being observed, TO the true synchronization instant, and Go the value of the parameters for the channel, then the patterns are distributed according to the conditional density P(X) = P(XIGO,2O,TO), k = 1,2,.... In this case the system must learn the parameters of the channel for the source as well as the source index and synchronization instant. Spragins' theorem can be applied again to the posterior densities of 2.5.1, with the criti— cal matter being the existence of strongly consistent estimators for O , 2 , and T . o o 0 Theorem 2.5.1 If (a) all observations are taken from the channel connected to the source with symbol code 39 length m2 , synchronization To' and channel 0 parameters Go so that P(Xk) = P(XkIGO,2O,TO) for k = 1,2,... (b) members of the family {p(kac%) ,2,Tk)}£’Tk are distinct A then there exist sequences of estimators Gm, 2m, and Tm which converge to 00' 20, and To, respec- tively, with probability one. Proof: The proof used for Theorem 2.3.1 applies almost identically here. The empirical distribution function and the extended Glivenko-Cantelli theorem . A A A m establish P(Xklem,2m,Tm)-—+P(XkIOO,2O,TO) w.p.l. Hypothesis (b) then assures that the limiting é , 2 and T are unique and equal 0 , 2 and T , m o o o m m respectively. QOEOD. Notice that under this mode of operation, information is obtained about the parameters of only one channel, the channel over which the symbols are received. In general the system provides no improvement over the prior infor- mation about the parameters of the other channels. In classical unsupervised learning [8-5] the class identification is random from observation to observation. 40 The k-th pattern there is represented by (Xk,ak), where ak indicates the parameters of the class being observed through the k-th pattern. The set {ck} are usually assumed to be independent identically distributed random variables for which a set of prior probabilities, P(ak = Z) must be defined. Over a long run of observations, all classes are sampled and the parameters of all classes are learned. By contrast with classical unsupervised learning, the problem considered here assumes that the class identifi- cation is fixed, though unknown. While the formulation of problems of unsupervised learning of unknown parameters is reasonably straight- forward, the computational problems in most instances are severe. Unless the functional forms are exceptionally convenient to deal with, one is forced to make discrete approximations to the continuous parameter values. This leads to the usual difficulties associated with numerical techniques, with a very difficult tradeoff between the small discretizing intervals desired for accuracy and the cost in memory and computing time that results; questions of roundoff and truncation error cannot be ignored, of course. Hellman and Cover [H-4] have worked toward a theory of the computational overhead involved in pattern recognition algorithms. 41 2.6 Markovian Symbol Selection Many interesting symbol generation experiments are modeled better by assuming Markov dependence than by assuming stochastic independence among symbols. If the sources described in the single source case were modeled by first order Markov chains, then what kind of model would be apprOpriate for the sequence of patterns? Some insight can be gained from an example. Assume that patterns of length n = 3 are taken from a source whose symbol code length is m2 = 3. Let = 010, X = 110, and T = 2, as illustrated in xk-l k k Figure 2.2. The symbol denoted by 10? is incompletely observed in Xk and might be either 100 or 101. Since I 0.1 o l1.10 ! ? I I I I l6X-———+L——X+—efi = 110, T = 2, and m = 3. k-l k k 2 The binary digits underlined by | lare a single n = 3, X = 010, X coded symbol. Digits through the k-th pattern are known. Figure 2.2 Pattern Dependency with Markov Symbol Selection the symbols are generated under first order Markovian dependence, knowledge of the previous symbol, 101, allows one to specify the probability that 10x will turn out to be 100. The distribution of Xk+1 is conditioned on 42 the source, symbol code length, synchronization instant, X and X is needed. k k-l' k-l Consequently, when the source has a symbol code length of No information prior to X 3 and first order Markovian dependence between symbols, the sequence of patterns of length 3, {X is also k}k=l' a first order Markov chain whose states correspond to the eight possible values of a pattern. In order to generalize to all possible code lengths and pattern lengths for first order Markov sources, observe that having T > 1 caused part of X to be an incomplete k+1 observation of a symbol. So m k 2 - (Tk+l - 1) d1g1ts of the incompletely observed symbol are in X The symbol pre- k. ceding the incompletely observed symbol has m2 code digits, which, together with the ma - - 1) digits (Tk+l 2 - (Tk+l g - Tk + l d1gits which have already been observed and which have to be considered makes 2m - l) = 2m to give the conditional distribution of Xk+l° Since - T + 1 6 2n - T + 1 < 2n 2 k+l k+1 when Tk+1 > 1. As a result, the patterns observed prior ml 6 n by assumption, 2m to X have no influence on the prior conditional k-l distribution of X Therefore, if the source generates k+1' symbols by using a first order Markov process, and if the source symbol code length does not exceed the ob- served pattern length, then the observed process can be represented by a second order Markov chain. A similar analysis can show that if the source generates symbols 43 by using an i—th order Markov process, and if the source code length does not exceed the observation vector length, then the observed process can be represented by a Markov process of order less than or equal to i + 1. The fact that an equivalent first order Markov process can be defined for any higher order Markov process allows one to conclude that the theorems developed for sources with independent symbols also apply to Markov sources of any order. As is well known, however, the number of states in the first order equivalent of a higher order chain increases exponentially with the order of the chain, which presents an effective restraint on the application of the technique of first order approximation. The effect of the large number of states shows up both in storage re- quirements and in the number of Operations that must be performed. 2.7 Summary It has been shown that the Bayes posterior distri- butions of the unknown parameters have a converging be- havior when dependent random variables are observed and consequently the Bayes decision process is an effec- tive procedure for deciding the symbol synchronization. In the process of proving convergence, a strongly consistent minimum distance estimator for the unknown parameters is defined, which suggests other decision procedures which are not pursued here. One alternative decis appro has a proce zatio. of th decis sourc( ing, 1 then 1 tive s 44 decision procedure which is pursued uses a stochastic approximation technique, whose estimator of the parameter has a linearly decreasing variance. The Bayes decision process can be applied simultaneously to the synchroni- zation problem and to the problem of learning parameters of the source being observed, and the sequences of decisions on both problems will converge. Finally, if the sources use Markov processes to select symbols for encod- ing, rather than stochastically independent experiments, then the Bayes decision process still provides an effec- tive solution to the symbol synchronization problem. CHAPTER III ERROR BOUNDS FOR DECISION PROCESSES This chapter follows several approaches to determine the rate of convergence and probability of error of a decision procedure described in Chapter II. First, a sub- Optimum procedure is defined for which the asymptotic error probability can be determined. This asymptotic behavior Of the error of the suboptimum procedure is used to describe the asymptotic behavior of the minimum error procedure. The second approach uses an information theoretic measure to define an upper bound on the error probability of the Optimum procedure as a function of the number of obser- vations. Subsequently, measures of the hypothesis con- ditional probability distributions, namely the Bhattacharyya coefficient and the Kolmogorov variational distance, give error bounds that are formally elegant but whose asymptotic properties are difficult to describe. All of the error estimating and bounding theorems presented here require the same basic functions, the hy- pothesis conditional densities of the patterns, although the processing operations which are specified by each theorem 45 46 vary widely in their computational requirements. The underlying question throughout is: Is it worthwhile to compute this bound on the basis of n observations? Any worth must be measured against the computational costs, which for some of the theorems tend to offset the payoff Obtained from computing the error bounds. 3.1 Majority Decision Functions and Error Probability for Dependent Random Variables This section uses a subOptimum procedure called the majority decision procedure to obtain an upper bound on the error probability of the optimum decision procedure. Chu and Chueh [C-lO] invented this approach and studied its prOperties for the i.i.d. case. Here, the technique will be extended to dependent random variables. An exact expression for the suboptimum error probability provides the upper bound for the Optimum procedure. The asymptotic behavior of the exact expression indicates the asymptotic behavior of the Optimum probability of error. Details of the application to both continuous and discrete first order dependent random variables are presented. In an m-class decision problem, let G = {tl,t2,...,tm} denote the m classes, where class i occurs with proba- bility pi for i = l,...,m and 2m pi = 1. Let {Xk} i=1 k = 1,2,..., denote a sequence of vector-valued r.v.'s, all having the same distribution. When discrete Xk are con- sidered then Pi(xk) stands for the conditional probability 47 mass function for Xk when class i is active. Similarly, when continuous X are considered, the corresponding con- k ditional probability density is written fi(xk)' First order dependence (l-dependence) of the X is assumed; i.e., k Pi(xk|xl,...,xk_l) = Pi(xk|xk_l) for i = l,...,m. 3.1.1 Asymptotic Error Probability for Two Pattern Classes In the two-class case (m = 2), with probabilities 2 + p1 and p2, let d(X n l) = D[d1(xl)""'d2n+flx2n+l)] where dk(Xk) depends only on X so that dk(Xk) = t1 or t2 for 2n+l k k = l,2,...,2n+l. Then d(X ) is a majority decision function if it follows the decision of the majority of dk(xk)' Here, dk(Xk) is a mapping from the domain of values for the k-th r.v. to the set (tl,t2); there can be a different mapping for each k. The function D, on the other hand, maps from the cartesian product 92n+l to O. The decision regions in the Observation space for Xk and the conditional probabilities of error are: Sik = {Xk dk(xk) = ti}; “1k = X2ES P.(Xk) — X2 5 X2 Pi(xklxk_l)Pi(xk_1) k jk kE jk k-l for i¢j , i,j = 1,2 , and k = l,2,...,2n+l. 3.1.1.1 48 An exact expression for the probability of error for the majority decision function will be developed with the aid of random variables which indicate the decision dk(xk)' These random variables are purposely redundant in order to provide a clear expression for the error probability and to facilitate application of the Central Limit Theorem. The approach largely follows the lead of Chu and Chueh with appropriate allowances and new defi- nitions for handling discrete and l-dependent random vari- ables. Decision indicator random variables are defined as U = 0 and V k k 1 1f dk(X k) 3.1.1.2 l w l and V C ll 0 1f dk(Xk) — 2 In terms of these variables, the majority decision is 2n+1 ) = t if Z V 2 n+1 x2n+1 1 k k=l d( 2n+1 = t if 2 U 2 2 n+1. 3.1.1.3 k=1 k The distribution of Uk is related to the conditional error probability as follows: ll (1' k=1|e 49 and P(Uk = ole = t1) = X2ES P1(Xk) = 1 — “1k k 1k so that = — — E“ ' _ 1_€ — P(Uk ale — t1) — alk(l alk) , g — 0.1 where alk is the probability of the decision error that O = t2 when in fact 0 = t1 based on Xk alone. By a similar develOpment, _ _ _ v _ l-v _ P(Vk _\k|9 — t2) — a2k(l a2k) ,v — 0,1. The exact expression for the error probability for the majority decision on 2n+1 observations is _ 2n+1 _ _ Pe(d) — P(d(X ) — t2,O — t1) 2n+1 _ _ + P(d(X )— t1,O — t2) 2n+1 Pe(d) = plP kil Uk 2 n+1IO = t1 n+1 2 v 2 n+1IO = t2 3.1.1.4 50 U* 2n+1 Pe(d) = pl 2 kgz p(uk|e = tl,Uk_l)Po(UlIO = t1) V* 2n+1 + p2 2 kgz P(VkIO = t2,vk_l)Po(v1|® = t2). 3.1.1.5 where Po(° O) is the prior conditional mass function of the indicated random variable and 2°* is the sum over all sequences for which the sum of the indicated random vari— able exceeds n. The random variables Uk and Vk are both functions Of single, l-dependent random variables and so are, in turn, l-dependent random variables. Equation 3.1.1.5 can be written somewhat more com- pactly by defining conditional error probabilities a1k(Uk-l) = P(Uk = 1'9 = tl'Uk-l) and “2k(Vk—1) P(vk = lIO — tZ'Vk-1)° Now, P(Uk = 0|e = tl,Uk_l) = 1 - alk(Uk_l) and similarly P(Vk = 0'9 = tZ'Vk-l) = l ‘ a2k(Vk-1)' 51 If one further defines all(UO) = PO(U = lIO t1) and “21(Vo) = PO(V = lIO t2) then these newly defined quantities can be used in 3.1.1.5 to give U". 2n+1 U l-Uk Pe(d) = pl 2 kgl alk (Uk-l) l - alk(Uk-l) V* 2n+1 Vk 1-Vk + p2 Z knl “2k (Vk-l) l ‘ a2k(Vk-l) . The form Of 3.1.1.4, containing probabilities of sums of l-dependent random variables, leads one to apply the Central Limit Theorem. In particular, consider the following factor from 3.1.1.4: 2n+1 P E U 3 n+1]@ = t1 . k=1 k In order for the Central Limit Theorem for 1—dependent random variables to apply [F-4] the following three conditions are sufficient l. E(UkIO = t1) = 5 < m must exist for k = 1,2,... k 2. 2(lukl3le = t1) < w for k = 1,2,... 52 A = A must exist uniformly 1 IIMZS 3. lim l n n+m h k+h for all k, where A = 2cov1{Uk'Uk+l} + varl{Uk+l} k and where covi(-) and vari(-) denote the covariance and variance, respectively, of the argument when class i, i = 1,2, is active. The first condition can be seen to be satisfied by considering the expectation directly, 5(Ukle = t1) = o-Pwk = ole = t1) + 1-P(Uk = 1|e = t1) = elk for k = 1,2,... . In fact, from the eXpansion of the first moment, E{Uk|O = t1}, one can see that all moments and absolute moments of U about zero are equal to alk’ so that con— k dition 2 for the Central Limit Theorem is satisfied by the random variables {Uk}’ The expression for Ak reduces to terms containing the various conditional error probabilities already de- fined. The variance term is _ 2 _ _ —2 var1{Uk+1} ' E(|Uk+ll '9 ‘ t1) Uk+1 _ .- 2— — ‘ 0"1,k+1 a1,k+l ‘ 0‘1,k+1‘l “1,k+1)° 53 The covariance term is °°V1{Uk'Uk+1} = E{(Uk ' alk)(Uk+1 ’ °‘1,k+1’IO = E{UkUk+1IG = t1} ’ O‘1k"‘1,k+1° Writing the expectation of the product, 3{Ukuk+l|0 = t1} = o-o-P(Uk==o,Uk+l - oltl) + o-1eP(Uk==o,uk+l = 1|9 = + 1-O-P(Uk==1,uk+l = Oltl) + 1°1'P(Uk==l,Uk+l = 1|e — t = p(uk = 1,0k+l = 1|e = t1) — P(Uk+l 1|0 =«t1,Uk — 1) pmk - 1|e = t1) = 2,11; ° .. Combining the previous three equations, Ak = 2°°vl{Uk’Uk+l} + varl{Uk+l} = 2“ (l) ' “1k ' 20‘11:"‘1,k+1 + O‘1,k+1 ' (1 ‘ 1,k+l = t1} O‘1,k+1)' 54 If P.(X ) is stationary for i = 1,2, then the a. 1 k 1,k and d(j) are constant for all k and uniform convergence i'k n 00 1 . . . of the sequence {# hil Ak+h}n=lls assured. V1ew1ng the terms in the sequence as sample averages, one can see that the uniform convergence allows one to disregard an ini- tial finite number of Ak and the average of the remain- ing Ak remains unchanged. So when the XR are stationary, or under any other circumstance in which the sequence of sample averages on A converges uniformly, one can apply k the Central Limit Theorem for dependent random variables [F-4] to sums of U to Obtain k 2n+1 2n+1 P 2 U ) n+l|O = t1] + @[n+l, Z k,(2n+l)A a k=1 k k=1 1 where A is the uniform limit described in condition 3 and 0 is defined in 3.1.1.6. A similar result applies to the sum of V in 3.1.1.4 so that the following theorem k has been proved. Theorem 3.1.1.1 If (a) X1,X2,..., are stationary, 1-dependent random variables under either hypothesis of a two- hypothesis decision problem, (b) Uk and Vk are indicator functions, as defined in 3.1.1.2, for the decision based on Xk, and (c) Z¢* represents the sum of the function ¢ over all sequences of length 2n+1 which give a sum 2 n+1, 55 then the probability of error for the majority decision function after 2n+1 observations is U* 2n+1 pe(d) = p12 kgz P(Ukle = t1.uk_1)PO(U1|e = t1) V* 2n+1 + pz 2 n P(VkIO = t2,Vk_l)Po(VlIO = t2) k=2 0* 2n+1 Uk l-Uk = pl )3 II (11k "(UK-1) EI- - alk(Uk-1):l k=1 V* 2n+1 vk l-Vk + p2 2 II 02k (Vk’l) E'- - a2k‘ Vk-l’] . k=1 Further, 2n+1 Lim P (d) = p 0 n+1, 2 a ,(2n+l)A n+w e l k=1 1k 2n+1 + p20 n+1, E a2k,(2n+l)B k-l where co ¢(X.u.02) = f (21102)-;5 exP[-(y-uL/202]dyr 3.1.1.6 X alk and 02k are defined in 3.1.1.1, A is defined in condition 3 and B is defined similarly to A, that is, 56 1 n B = 1im — 2 B and n+w n h=l k+h Bk = 2cov2{Vk,Vk+1} + var2{Vk+1}. The above theorem holds whether the XR are contin- uous or discrete random variables. For continuous random variables, the definitions of Sij'aik' alk(Uk-l) and 02k(vk_1) involve appropriate integrals rather than dis- crete sums. However, the resulting Uk and Vk’ which are used in the theorem, are discrete in either case so that the analysis above holds. Further, the theorem implies convergence of the majority decision function regardless of the local decision function used on each individual observation. If the Xk were independent, then the expression for A would reduce to A k k = O‘1,k+1(1 ‘ O‘1,k+1) and Theorem 3 of Chu and Chueh is thereby Obtained as a special case. The asymptotic distribution of the probability of error provides a method for deciding on the number of observations required to achieve a particular level of performance. The Central Limit Theorem showed that 2n+1 2 n+1IO = t1] —E+ ¢[n+l,k:l alk,(2n+l)A Prob ZUk 3.1.1.7 57 and a similar statement holds for sums of the random vari- ables Vk' If the goal is to make Pe(d) < 5, then one must choose n at least large enough to force both terms of the type in 3.1.1.7 to be less than 5, because Pe(d) is the average of such terms. Defining 2n+1 a' = Z a k=1 1k and n+1 -a' /(2n+1)A one can use tabulated values to find 8 such that 00 ‘§? g exp(-x/2)dx = g and in turn solve for n in the definition of 8. 3.1.2 Bayes Majority Decision Functions Application of the previous theorem can require substantial computing effort to evaluate the conditional error probabilities, aik' for each decision region. Re- stricting the decision functions at each step to a class which is formally similar to a Bayes decision produces the Bayes majority decision function, whose precise definition follows shortly. Under the Bayes majority decision function, the expressions for the mean and 58 variance of the limiting distribution take on a much simplified form. This simplified form contains a factor which is derived from an upper bound on the Kolmogorov variational distance between the class distributions. In what follows, one needs to use a particular prOperty of the probability of error of individual decisions in a two-hypothesis decision problem. Using essentially the same notation as above, for a discrete random variable X let P1(-) and P2(-) denote the proba- bility mass functions for X given O t and O = t l 2' respectively. Also let _ . = = I S1 — {x . d(x) t1} and S2 S1 . Then on = Z P (x) and a = 2 P (x) l x582 l 2 xeSl 2 are the conditional probabilities of error for rule d. The overall probability of error, Pe’ is then P e plal + pzaz where pl + p2 = 1. e l in which case This implies that min(o:.l dz) 6 P s max(a a2) with I I equality holding only if a1 = a2, a1 = a2 = Pe' Def1ne 59 ‘g E m1n(alra2), a : max(al’a2) and n E E - g = Ial - a2] so that n denotes the length of the interval within which Pe is bounded. Theorem 3.1.2.1 mud I an» + an: * If ilp1(X) - P2(x)l a 25 then Pe s Proof: Case 1: pl 2 p2. By steps which are identical to those in Theorem 2 of Chu and Chueh [C-10] one gets 1 ‘ Pe 3 91‘8 + Pzaz + p1°‘1 = 91‘3 + Pe' Substituting a lower bound for Pe on the right gives and the Pe term can be isolated on the right by *The quantity §|Pl(x) - P2(x)|, or £|fl(x) - f2(x)|dx when x is a continuous r.v., is the Kolmogorov variational distance. 60 Next, add 5 to both sides 1 - p16 + E -‘g a Pe + 3. But Fe + E a 2 P8 and n = E - g so that l - p16 + n Z 2 Pe or 1 Pex< '2-(1 - p15 + U) (l - g + n). NIH and if p1 2 p2 then pl ;,% so P s Case 2: pl < p2. In this case 1 - Pe 2 p26 + Pe and an analysis of the type above with pl replaced by p2 again produces the conclusion of the theorem. Q.E.D. Still considering the two-class decision problem, a Bayes majority decision function is a majority decision function such that for every k = l,2,...,2n+l, the decision regions, Sik' are defined as s (xk)} and s = s ' lk = {xk ‘ qlkPlk(xk) > qZkPZk 2k 1k where qlk’qZk 2 0 and q1k + q2k = l. The qik need not be the true probability, pi, that O = ti’ nor need they be the Bayes posterior estimates of pi. 61 The least favorable distribution of O with respect to P1k(-) and P2k(-) is defined to be that set Of values for q1k and q2k wh1ch m1n1mizes nk = ak - Ek’ where ak = max(alk,a2k) and 2k = m1n(alk,a2k) and alk is de- fined in 3.1.1.1. This distribution is least favorable in the sense that, since 2k is a lower bound on Pe[dk(xk)), and since minimizing nk implies maximizing 3k (because alk and 02k are measures of complementary regions in the sample space), then minimizing nk maximizes the lower bound which 2k places on Pe(dk(xk)). However, minimizing nk does minimize the general upper bound of the previous theorem. For continuous random variables, the minimum value of nk is zero, provided that f1(x)if2(x). In the case of continuous random variables, the following theorem results: Theorem 3.1.2.2 2n+1 is a sequence of l-dependent, con- If (a) x tinuous random variables; (b) if d(x) is the Bayes majority decision function such that for k = l,...,2n+l, q1k and q2k are the least favorable distribution of O with respect to flk(xk) and f2k(xk); and (c) if for every k, fIf1k - f2k| > 26 > 0 then Lim Pe‘d) s 1im (n+l,(2n+l)e,3(2n+l)e(l-e)) w.p.l n+°° n+oo where 62 . Consequently, Pe(d)+0as n+m, w.p.1 .Mo: 6 = % - Proof: The least favorable distribution for con- tinuous random variables makes alk = a2k and n = 0, so by the above theorem aik 6 e < %, i = 1,2. The d can be replaced by e in Theorem 3.1.1.1 and the ik variance terms are bounded above by 35(1-6). That is, Ak = 20‘11<("‘1,k+1(1) ‘ O‘1,k+1) + O‘1,k+1‘l ’ O‘1,k+1) s 26(1-6) + 5(1-5) = 3€(l-€). Consequently 1 n Lim — Z A 6 38(1-8). n+m h=1 k+h A similar approach holds for Bk and B. Thus, both normal distribution functions in 3.1.1.6 are bounded above by n+1 - (2n+l)e /3(2n+l)e(l-€) 0,1 . ¢ n+1,(2n+1)€,(2n+1) - 36(1-8) = 4 Since 6 < %, the first argument is positive and 32 increases as n so the theorem follows. Q.E.D. 63 One encounters difficulty in attempting to apply the above proof technique to a similar theorem for dis- crete random variables. In particular, since n¢0 in the definition of least favorable distribution, it is not necessarily the case that proper choice of qlk and q2k can force both a. < A for i = 1,2. As an example, consider 1k 2 the following two discrete distributions on {0,1}. 0.8 Pl(0) 0.9 P2(0) ll 0 H ll 0 O N P1(1) P2(l) The full set of possible partitionings of the sample space into decision regions, and the resulting conditional error probabilities are: d1(xk) d2(xk> d3(xk) d4(xk) sl {0.1} {o} {1} {<1} 52 {4:} {1} {0} {0,1} a1 0.0 0.1 0.9 1.0 a2 1.0 0.8 0.2 0.0 For this example, no decision function exists for which both 01 and a2 are less than %. Consequently, substitution 64 of an upper bound for aik’ i = 1,2, in 3.1.1.6 is fruit- less. However, for sample spaces which can be partitioned so that ai < %, i = 1,2, the above theorem extends to discrete l-dependent distributions. 3.1.3 Asymptotic Error Probability for m Pattern Classes In the m-hypothesis case, 0 = {tl,...,tm}, the maxi- mum likelihood-Bayes decision regions for a single Obser- vation are defined as T. ={x:p.f.(x) =max p.f.(x)}, i=1,...,m l 11 j=lpooopm J J and the pairwise decision regions are = {X : pifi(x) a p fj(X)} ilj = lice-rm. S.. 13 J In the expressions above, fi(x) is the conditional proba- bility density of x given 0 = ti. The error probability for the Bayes decision procedure is P (Bayes) = X f p.f.(x)dx + f p.f.(x)dx . e i qikfik(xk)} and Sik = Sik where qik’qik a O and qik + qik = l. The decision for xk is: Decide O = ti if xkesik and decide 0 # ti other- wise. Then the above theorem says that if for all k, flfik - f? 1k| a 26 > 0 then Lim Pe(d) s Lim¢(n+1,(2n+1)e,(2n+1)t(1-e)) moo n+oo where e = % - %. Define the compound Bayes majority decisionyprocedure, dCB’ by making Bayes majority decisions for the mmdecisions O = ti vs. 0 # ti’ 1 = l,...,m, and deciding O = tj where j is the value of i for which the two-way Bayes majority decision was 0 = ti, provided that exactly one such j exists. Otherwise, no compound decision is made, but another Observation would be taken. 66 Theorem 3.1.3.1 If (a) x is a sequence of 1—dependent Observations in an m-class decision problem, and (b) if for all i = l,...,m and k = 1,2,..., flfik - f. 2 25 > o 1kl then {II-J “] lim P (dCB) < (m—1)1im[%(n+l,(2n+1)€p3(2n+l)€(l-€){]2 n+oo n+oo a.e. on» where e = % - Consequently 11m Pe(dCB) = 0 a.e. n+oo Pgoof: Of the decisions which dCB can make, m-l represent errors, and each of those results from exactly two errors in the two-way Bayes majority decision procedures. Since the above theorem applies to the probability of error of the Bayes majority decision procedures, the result follows. Corollary: Under the same hypotheses lim Pe(Bayes) s (m-l)lim[%(n+l,(2n+l)€,3(2n+l)€(l‘€)i]2 11+co n+m a.e. Proof: The Bayes decision procedure minimizes the error probability, so the error of any other decision procedure, such as dCB' provides an upper bound for Pe(Bayes). 67 3.2 Bounding the Probability of Error After a Finite Number of Samples In studying m-class hypothesis testing using de- pendent random variables, the results presented so far have shown the existence of a strongly consistent mini— mum distance estimator for the pattern class, that Bayes rule calculations of posterior distributions converge in the sense that eventually the mass of the posterior distribution lies at the point corresponding to the correct class index, and that the error probability of a suboptimum procedure--and consequently Of the Bayes procedure-~vanishes as the number of Observations grows without bound. Now it will be shown that an upper bound exists for the probability of error of the Bayes decision process, and that the upper bound decreases exponentially as the number of obser- vations increases. An exact expression for the upper bound is obtained as a function of the number of patterns observed. The expression is derived from information theoretic considerations, and it turns out that the inter- class Bhattacharyya coefficients are factors of terms in the bound. This presentation will use the general notation which applies to the class of m-hypothesis decision-making problems for which there is first order dependence be- tween successive Observations. The symbol synchronization problem with unknown source code length is a member of that class. For convenient reference, the Specification 68 of the m-class decision-making problem and the notation to be used will be briefly reviewed. This is the same problem and notation which was described in detail earlier. Let X1,X2,...,X be a sequence of identically k"" distributed discrete* random variables having first order dependence. The common distribution of Xk’ k = 1,2,..., depends on the pattern class, indexed by O = {t1,t2,...,tm} and the prior probability for the pattern class j is given by Prob(0==tj) = p. > 0. The parameter conditional 3 distribution of the Xi will be written Prob{Xk|O tj} = Pj(Xk), j = l,...,m; k 1,2,..., and the first order dependence provides that k—l)' Pj(xk|xl,x2,...,xk_l) = Pj(XkIX It is assumed that each class has a unique probability d1str1bution; that 13, Pj(xklxk_1) and Ph(Xk|Xk-l) are not identical for all values of the arguments when j#h. 3.2.1 An Information Theoretic Approach After k Observations, the amount of information contained in the sequence of random variables *The arguments to be set forth would apply to con- tinuous random variables as well. 69 Xk = X1,X2,...,Xk about 0 is the mutual information 1k = Ik(xk:0) = H(0) ' E(H(9|Xk))r where H(O) is Shannon's entropy: H(O) = j utna 9:) 1°9 (31‘) 1 j and pj = Pr0b(9=tj) as described above. Similarly H(OIXk) = "HHS P e=t.|xk log 1 k . 3 1 3 P(O=tj|X-) Logarithms are to the base 2, and throughout this chapter, E(-) denotes the expectation with respect to the joint distribution of Xk. The develOpment which follows is based on work done in 1964 by A. Rényi [R-3]. Rényi described the behavior of the average entropy for independent random variables and used his results to show the almost sure convergence of a decision procedure which was similar to the maximum likelihood decision procedure. Hellman, Raviv and others have called the expectation of the entrOpy E(H(lek)), the equivocation. Hellman and Raviv [H-S] showed that for the Bayes decision procedure, the probability of error, PB(e), is bounded above by one—half the equi- vocation, i.e., 70 PB(e) \< %E(H(lek)). They showed also that in the i.i.d. case, 3(H(e|xk)) s K(p**)k where p** is defined as follows. pgj = inf 2 Pi(x)aP.(x)l—a and Osasl x 3 p** = max pt. i¢j 13 When a = %, the argument of the infimum is called the inter- class Bhattacharyya coefficient, so that Hellman and Raviv's result for i.i.d. random variables defines an upper bound on the error probability of the Bayes decision procedure which is an exponentially decreasing function of the maximum interclass Bhattacharyya coefficient. In order to Obtain related results for dependent random variables, two lemmas must be established. Lemma 3.2.1.1 (Rényi) A universal constant C > 0 exists such that for any set p1,...,pm of positive numbers forming a probability distribution 71 The logarithm is to the base 2. Proof: This is Rényi's proof, although he used =1 a 2. x loglfil (l-x)1og[i%;] Both a and a are continuous x x in [0,1]. Define xloglil (l-x)log[I%§] C1 = max a and C2 = max a Osxsl x Osxsl x for 0 s a 6 1 Then by breaking the entropy expression into two parts, one gets bounds on each part of l m m p. log[-—] 1 . .2 Pj 109(57] = .2 3 p3 p? s c ? p9 j=2 3 3:2 a j 1 .=2 3 Pj 3 and 72 1 m 1 pl log —— = 1 - X p. log P -= J m l 3 2 1 - 2 p. -_ J 3—2 , ’ 1 m m l - : pj log 1 - .: pJ m a = J l. 3 2 2 p m a ._2 j 2 p. 3- j=2 3 M 0 n: T7 ":38 N ”'0 L14 Q I O N LJ "has N) ”O U Q The lemma follows with C = C Q.E.D. Lemma 3.2.1.2 If Y1,Y2,...,Y is a sequence of random vari- kl... ables having the Markov prOperty, and if E(Yk|Yk_l) exists and is bounded for all k, say E(Yk|Y s K , k = 2,3,... w.p.1 k-l) then k k-l E 2 Y. s E(Y1)K where K = E(Y2|Yl). i=1 1 Proof: Expanding the expression for the expectation of the product of k random variables having the Markov prOperty gives 73 k E .E Yi = 2 ...2 Yl-YZ...YkP(YkIYk_1)...P(y2|Yl)P(Yl|Y0) 1—1 Yl Yk 2 '.'§ Y1...Yk_1§ YkP(YkIYk_1)P(Yk_1IYk_2)... 1 k-l k P(YZIY1)P(Y1IY0) Y1...Yk_lE(Ylek_l)P(Yk_llYk_2)... 2 Yk-1 P(YZIY1)P(Y1|YO) /A M O O yl. . 'Yk-1KP(Yk-1|Yk-2) . . .P(Y2|Y1)P(Y1|YO) Oz Yk-l This process repeats k-2 more times to yield the lemma. Q.E.D. These lemmas will be used in proving the following theorem. Theorem 3.2.1.1 If (a) O is a discrete random variable taking on m different values tl,t2,...,tm with positive prior probabilities pj Prob(O = tj), j = 1,2,...,m; (b) the discrete random variables X1,X2,...,Xk,... have, for each j, identical conditional, given 74 O = tj, distributions with the Markov property; and (c) the conditional joint distributions of Xk_1,Xk g1ven O = tj versus 0 = th are different for each j#h, i.e., there is no value Of Xk_1 such that Ph(Xk|Xk_l) then there exist positive constants A and q < 1 such that O s E(H(O|Xk)) s Aqkm1 for k = 1,2,... h k - * Proof: Letting 9h denote the subset of the full probability space, 9, on which 0 = t (h = l,...,m), h the equivocation is expanded in terms of the parameter conditional equivocation as follows-** 3(H(e|xk)) = phE(H(OIXk)IQh) 3.2.1.1 "baa h 1 The Bayes posterior distribution of 0 given Xk, which is needed to evaluate the entropy factors in 3.2.1.1 is: *Korsh [K-6] proved a similar theorem using a different proof. **The notations E(- 0h) and E(° O = th) mean the same thing. They are alternative notations for the same con- ditional expectation. P(e = thXk) = i . k P. (X. |x. ) < —l H i 1 3.2.2.2 ph i=-l Ph(xi [xi-1) where the strict inequality holds for all finite k and h = 1,2,...,m. Here, Pj(X1IXO) stands for the prior conditional probability of X the first ll pattern. Now the entropy expression can be expanded, Lemma 3.2.1.1 applied to the expansion, and then the bound in 3.2.1.2 applied to the result: m H(O|Xk) = 2 P(O = t.IXk)1og 1 k j=l 3 p(0 = tle ) m s c 2 [P(O — t Ix )] i=1 j#h m p ak Pji(X|Xi_l)0L < C 2 H Ph (x. X. ) j=l ph i=1 1l i- -1 j#h for 0 s a s 1. 3.2.1.3 The function Pj(Xi|Xi_l) depends on the random vari- able Xi_1, so one can take the expectation of the 76 bound in 3.2.1.3 given 9h, recalling that Qh specifies the distribution of Xk. The expectation wanted is m [oiJOE E [Pj(X1|Xi_l)]d 9h . k EH(GIX)|Q] ll ('3 H043 m.u015 While 3.2.1.13 gives the strongest version of the theorem obtainable with this approach, there is no straightforward general algorithm for computing 3, so that 3.2.1.9 is probably easier to use except when the distributions have a convenient form. Also, pjh(3) is not necessarily algebraically less than pjh(%) so 3.2.1.9 might = thr give smaller values than 3.2.1.13 in some cases. 3.2.2 Bounds Based on the Distance Between Distributions Several investigators [K-l, K-2, L-2] have used the Bhattacharyya coefficient to bound the probability of error of a maximum likelihood rule. Kadota and Shepp [K-l] and other statistical literature call this quantity the Hellinger integral. This coefficient is an inner product of two hypothesis conditional probability densities. Using 0 to denote the Bhattacharyya coefficient for two—class 81 hypothesis testing, the definition of p for discrete distributions is _ ‘1 p — (Pl(x)P2(x)) Z X where Pi(X), i = 1,2, is the mass function under hypothesis Hi. From Schwarz's inequality one can see that 0 < p 6 1 since 2P(X)P(X)1"1s 2P(X)P(X)1”’=1 and the arguments are non—negative over their domains. Several functions of p can be used to describe a "distance" between the density functions, with - log 0 being a favorite since it is non-negative and - log 0 = 0 when P1(X) E P2(X). Since the Bhattacharyya coefficient has played such a significant role in the recent literature, several results will be presented in order to show its role in providing error bounds for m—class maximum likeli- hood hypothesis testing using dependent random variables. In m-class hypothesis testing with m > 2, the inter- class Bhattacharyya coefficient, pij’ is given by = g ..= p.. (Pi(X)Pj(X)) for 1,3 l,...,m. Z 13 X 82 The derivations of error bounds related to the Bhattacharyya coefficient are designed by treating all of the Observed patterns as a single pattern, rather than defining a bound that is a function of the number of patterns. This is in distinct contrast to the kinds of bounds derived by using either majority decision functions or equivocation. Further, it changes the mathematical treatment of the dependence to a matter Of the computation of 0, making that computation more complex as the number of observations increases. In order to use the Bhattacharyya coefficient approach to bound the error probability after the k-th pattern in a sequence one must let the argument of Pi(') be Xk, a sequence of real variables whose possible values are the set of possible sequences of the first k patterns. When the patterns have first order dependence, the proba- bility pi(xk) is given by k Pi(Xk) = 1 Pi(xylxy-1) where Pi(Xl|X0) is the prior probability of X1 under hypothesis Hi’ and Xk = X1,X2,...,Xk. Then the interclass Bhattacharyya coefficient between class i and class j after k Observations is given by k = 2 H P.(X IX _ Xk y=1 1 Y Y 1 (k) k on P. X 3.2.2.]- 913 ) 3( ley_1) 83 When each pattern, Xy' has n binary digits then X k Y has 2n possible values, and the sum in 3.2.2.1 has 2n (k) ij summing m(m-l)-2nk terms. While this technique does not terms for each 0 . Computing all pig) for i < j requires provide a closed form solution for the error bound as a function of the number Of observations, it does suggest computerized experiments to determine a sequence of error bounds, subject to whatever limits exist on computing resources . (k) The computational technique for evaluating pij can take a recursive form evidenced by expanding 3.2.2.1 as (k) _ 8 pij — 2 2 (Pi(xklxk_l)Pj(xk|xk_l)) . x x k k-l % 2 (Pi(Xk_1IXk_2)Pj(Xk_lIXk_2)) ... x k-2 % 2 (Pi(X3|X2)Pj(X3IX2)) . x 2 t X (Pi(X2IXl)Pj(XZIXl)Pi(XlIX0)Pj(XlIX0)) 1 3.2.2.2 Defining the factors in the equation above _ % rij(X2) — é (Pi(x2|xl)Pj(lexl)Pi(x1|xo)Pj(xllx0)) 1 84 and r..(X ) = 2 (P.(x |x )P (x |x ))5r (x ) 1] k X 1 k ,k-l j k k-l ij k-l k-l for k > 2 gives (k) _ pij — Z r1j(Xk) for k 2 2 X k . . . (k) - W1th th1s techn1que, one can compute the pij sequent1ally by saving the 2n values of rij(Xk) at each stage for use at the next stage. The use of pig) in computing error bounds will now 1] be considered. The quantity pij for a single observation appeared in Theorem 3.2.1.1 and it was pointed out that for the i.i.d. case the theorem became E[H(OIXk)) s Ap*k where p*Emax pij and the superscript represents exponen- i753 tiation. If one starts with the approach of the previous (k) theorem and attempts to include pij in the expression for the error bound, then one obtains the following theorem. Theorem 3.2.2.1 Under the hypotheses of Theorem 3.2.1.1, m 1: (k) X . . j=1(pjph) oh] j#h 2(H(e|xk)) < c IIMB h 1 ‘ .AX.‘ I~.-. 1 85 Proof: The proof commences in the same manner as before through 3.2.1.4. Following 3.2.1.4, h ph E g :3;:1|:1' ; 8 ab = 2 E :3::1::1’1: LSPh(xk) i=1 h 13 i-l Xk i=1 h 1 1-1 1‘ I»; = 1k i:l(Pj(XiIXi-11Ph(XiIXi-l)) = phj(k) x so that m p. % E{H(O|Xk)|0h} < C 2 —l] ph.(k)° j=1 ph 3 j#h Therefore k m m 31 3 (k) E(H(O]X )) < E phC Z phj 1 j=1 j#h which is the result stated in the theorem. Q.E.D. While this does show the role of the interclass Bhattacharyya coefficient for k Observations, it is not as tight a bound as Lainiotis [L-3] has achieved working from Chu and Chueh's premise that P < 2 P (iqj) . e iumuno mo nonasz mm mm VN NN 0N ma 9.” «A NH OH w w v N . q 4 q d 4 4 \J 3 4 .xx: u in . Asia u in v.0 m.o v.0 5.0 0.0 «.0 o.H 96 Table 4.1 Symbol Generating Probabilities. _g — Source Source Code Source Probability ...... .. 2 l l 0 0.75 l 0.25 00 0.127 2 2 01 0.375 10 0.375 11 0.125 000 0.09375 001 0.125 010 0.15625 3 3 011 0.125 100 0.125 101 0.09375 110 0.09375 111 0.1875 Symbol selections by the sources mentioned in the example of Figure 4.1 are governed by the respective distribution shown above. 97 of Xk be at least the longest symbol code length. Figure 4.1 shows the posterior conditional probability that each source is active as a function of the number of observations. The ordinate intersections represent the prior probabilities. After any particular Observation, the decision procedure decides that the source with the greatest posterior probability is active. Figure 4.1 illustrates that the initial bias, im- posed by the prior distributions, can be overcome by this decision process. The discussion in Chapter III suggests that the distance between the probability distributions of the Observations, which is influenced by the source distributions, should affect the number of Observations required to overcome this initial bias. Posterior distri- butions for the synchronization instant P(Tkl2,Xk) were computed as well. In the first example with m2 = l, Tk is l for all k, so that once the source is correctly decided the synchronization instants are obvious. In the second example with 2 = 2 and m = 2, the true value of 2 Tk alternates between 1 and 2. Since the prior proba- bility for the correct value of T was 0.714, the posterior l probabilities P(Tkl2 = 2,Xk) produced correct decisions for all k. In the third example with 2 = 3 and m = 3, 2 the prior probability for the correct value of T1 was 0.222, vs. values of 0.333 and 0.445 for the other possi- bilities for T1' This, combined with the probability '..a .N_ I 1' 1. '5.‘ 4 98 distribution for source 3, led to needing over 20 pattern Observations before correct synchronization decisions were made, even though the source decision was correct after every observation. 4.2 The Binary Symmetric Channel The binary symmetric channel illustrates the notion l of an unknown channel parameter. As the name implies, the noise in the channel has the net effect of changing a code digit one to a digit zero with probability p and of chang— 1 ing a code digit zero to a one with probability p, as Em diagrammed in Figure 4.2. Thus p is the probability that a binary digit is complemented as it passes through the channel, and will be called the complementation rate. Figure 4.2 The Binary Symmetric Channel A digit passes through the channel unchanged with proba- bility l-p. As in the data generation model previously considered, the channel is connected to a source which generates a binary digit code of length m and transmits 21' the codes in a continuous stream. The receiver processes 99 the output as n binary digit patterns, Xk, k = 1,2,..., n 2 m2 for all 2. Yk will denote the input which pro- duced the output Xk. Both Xk and Yk can be any one of the 2n vectors having n binary elements, but Xk is not neces- sarily equal to Yk because of the noise properties of the channel. The quantity p is an unknown parameter. The full set of assumptions for this special case follows. 1. Patterns of n binary elements, Xk’ are received through the binary symmetric channel of Figure 4.2. 2. The value of the parameter p is unknown. 3. Changes of distinct binary digits in a pattern are mutually independent. 4. Given the index of the active source, 2, and synchronization instant, T the distribution kl of Yk at the 1nput, denoted by Pin(Yk|2,Tk), 18 known. Define C(kak) as the number of bit positions in which Yk and Xk differ; C(Yk,Xk) is the Hamming distance between Y k and Xk' Let Zj, je{l,2,...,2h}, denote the base 10 values corresponding to the strings of base 2 digits which can be assumed by Yk and Xk' The probability of receiving Xk = Zj when Yk = 21 was transm1tted 1s ‘Wwo- . 100 C(Z ,Z.) n-C(Z ,Z ) P(Xk = ZjIY = Z-) = P l J (l-p) l J i,je{l,2,...,2n} The conditional probability of receiving a particular vector, Xk = Zj, can be represented by P(xk zjlp.2,Tk) n 2 = 1:1P1n(Yk = Zilp,2,Tk)P(Xk = szYk = Zi,p,2,Tk) , 2n = i:lPinwk = Zi|2,Tk)P(Xk = zjlyk = zi) n 2 C(Zi,Z.) n-C(Zi,Zj) = 1:1P1“(Yk = zi|2.Tk)p (l-p) ||l> i,je{l,2,...,2n}, 0 s p s 1, 2 = l,...,L, and Tk = l,...,2. This equation defines the quantity Qj(p,2,Tk) = P(Xk = Zjlp,2,Tk) as a polynomial in p of degree n. Given 2 and Tk’ there are 2n of these polynomials, one for each possible value of Xk' Theorem 2.5.1 says that if members of the family {P(X Ip,2,T )} are distinct, then strongly consistent estimators exist for p, 2, and Tk' Saying that members 101 of the family are distinct means that for any (a,b)?(c,d) there exists at least one value Zj such that P(Xk = Zjlp,2 = a,Tk = b)7€P(Xk = Zjlp,2 = c,T = d) when k those quantities are defined. As a result, the hypotheses of Theorem 2.5.1 require that no two synchronizations have exactly the same set of Zn polynomials Qj(p,2,Tk) for all 2n values of j. Equivalently, for (a,b)#(c,d) Qj(p,2 = a,Tk = b) must not have the same set of coef- k = d), j€{l,2,...,2n}, when é (a,b) and (c,d) are such that Qj(:) is defined. In turn, ficients as Qj(p,2 = c,T the coefficients of Qj(p,2,Tk) are defined above to be linear combinations of Pin(YkI2,Tk), so it is reason— able to look for a condition on the family {Pin1Ykll’Tk)2tTk which guarantees that linear combinations of members of that family are unique. Identifiability is such a sufficient condition. Consequently, one can proceed with the Bayes decision process to simultaneously decide the source, synchroni— zation, and channel complementation rate and assume that the sequences of decisions will converge provided that {Pin(Yk|£’Tk)}2,Tk is known to be an identifiable family [T-l, T-2, Y-3] or provided that the members can be shown to be linearly independent. The learning capability of the processor was demon- strated for a binary symmetric channel with unknown comple— mentation rate. The complementation rate was the parameter 102 to be learned. A Fortran language program, BSC MC--for Binary Symmetric Channel, Monte Carlo--provides appropri- ate data generation, channel simulation, and Bayes rule processing. For coding convenience, BSC MC processes sources whose code length is 3 binary digits and uses observations of 3 binary digits. BSC MC allows up to 10 discrete values for the channel complementation rate; two values, 0.05 vs. 0.1, were used in the examples which are reported here. The program is set to cut off after 500 observations. Then the prior distributions can be re- initialized for as many Monte Carlo iterations as one desires. As a check on the data generated, BSC MC maintains counts of the number of times each symbol is generated, the number of bits changed by the channel and the number of times each observation vector value is Observed. This last is used to compute the empirical mass function P(Xk) which can be used in an alternative minimum dis- tance estimate in which distance = min max I P(Xk) - P(X) '- 2,Tk X The decision procedure decides that the values of the minimization arguments for which the minimum is achieved give the source and synchronization. In the examples run, the optimum decision and the minimum distance decision ‘11:: sum _.’. '.- 103 agreed with respect to the source and synchronization, but not always with respect to the channel parameter (channel complementation rate). A second program, called BMCIND (for Binary symmetric channel, Monte garlo, INerendent) is identiCal to BSC MC except for a subroutine named POST which computes the - ‘1 (maxim posterior distributions. For BMCIND, POST computes the posterior distributions as if the Observations were inde- I pendent. After the k-th observation, the version of POST used with BMCIND computes 1r— P(xklp.2,Tk)P(p|2.Tk.xk_1) = :P(xklp.2.Tk)P(pl2.Tk.xk_1) P(pl2.Tk.xk) where the denominator is defined to be P(Xkl2,Tk), = P(Xkl2,Tk)P(TkI2,Xk_l) g P(XkT2,Tk)P(Tk|2,Xk_l) k P(Tkl2,Xk) where the denominator is defined to be P(Xkl2), = P(xk|2)P(2|xk_l) P(xklz)P(2[xk_17‘° P(2IXk) By contrast, the version of POST used in connection with BSCMC computes . k. k :P(Xk[Pp2,Tk.Xk_lIP(pr2,Tk,xk_l) 104 where the denominator is defined to be P(Xkl2,Tk,Xk_l), = P(Xkl2,Tk,Xk_1)P(TkI2,Xk_l) é P(xkl2,Tk,xk_l)P(Tk12,xk_l) k P(Tkl2,Xk) where the denominator is P(Xkl2,X and We k-l) ' P(xklz,xk_l)P(2|xk_l) — iP(Xk[2,Xk_1)P(2IX P(2le) k-l) My: .(.-"1 . u . 1' The "posterior distributions,’ computed in connection with BMCIND, when used with a maximum likelihood decision pro- cedure, learned the unknowns but appeared to converge less rapidly than did the Bayes processor. The quantities com- puted by the version of POST used with BMCIND coincided with the Bayes posterior conditional distributions only if the observations were independent. The motivation for using these quantities when the observations are dependent stems from a consideration of storage requirements and compu- tation volume, which is discussed next. In considering the storage requirements for computing the Bayes posterior conditional distributions for the pro— cessor described in Chapter II, one must first realize that there are 2n values of the conditional probability k-l P(Xkl2,Tk,X ) for each set of values of the conditional k l arguments. The dependence on X - is a result of the possibility that Tk#l, i.e., a symbol code begins in Xk-l 105 and ends in Xk as described in Section 2.1. Since suc- cessive symbol codes are independent, only the cases in which symbol codes overlap observations Xk-l and Xk will tend to affect the storage of the posterior conditional distributions. If Tk = 1, then the conditional probability is independent of X and F‘ k-l' k-l P(Xkl2,Tk,X ) = P(Xkl2,Tk). .-DL.) - m»: 11' ' ' ‘lxll If Tk = 2, then the conditional probability is dependent on the last m2 — 1 digits of Xk-l‘ In general, 1_ k... P(Xkl2,Tk,X 1) is dependent on the last d = [m — T g k + 1] mod mfi digits of X While d is in fact a function of m2 and k-l' Tk' writing d rather than d(m£,Tk) is convenient because d will be used as an exponent in subsequent expressions. Taking into account the above description of how the value of T describes the dependence on xk-l’ one can k k-l see that there are 2n+d values of P(Xkl2,Tk,X ) for each value of 2 and Tk' Given 2 = A, there are mA values of T so for each source, A, there are kI n+0 Z 2 = 2 + 2n+mA—1 n+1 + ... + 2 106 k.1) to compute and store, which values of P(Xkl2,Tk,X could tax the capacity of a very large computer. For example, suppose one of the sources which might be ob- served uses a symbol code length, including parity digits, of 9 binary digits. Suppose further that this is the longest possible source code length, so n is chosen to be 9. Then - 1) = 2 - 2 > 2 . This shows that just storing the conditional distribution for a single source could use up all of the immediately accessible, individually addressable core storage of the latest model computers. The usual storage requirement for a Markov chain having 29 states is 218 values, but the analysis based on the value of Tk was aimed at identifying redundant values that need not be stored. The technique used by subroutine POST in the BMCIND program would reduce the storage requirements to 29 in the case cited above, or a reduction factor of over 2(n—1). Table 4.2 summarizes the results of one computer experiment with the binary symmetric channel. In this instance, each of three sources used three binary digits in its symbol codes. The channel error rate was to be either 0.05 or 0.10. Ten runs were made, and after the 500-th observation a decision was made about the channel 107 error rate, about which source was active, and about the synchronization instant. Results of the optimum processor (BSC MC), one subOptimum processor (BMCIND), and the mini- mum distance decision described earlier in this section are compared. Table 4.2 Comparison of Three Decision Processes. Number of Correct Decisions in 10 Monte Carlo Runs [fiver Au,- 1 Channel Bit Active Synchronization Complementation Rate Source Instant BSC MC 4 10 10 BMCIND 2 10 10 Minimum Distance 6 10 10 The Bayes decision process, a Bayes-like process, and a minimum distance decision process were used to decide the bit complementation rate, source, and synchronization. In these experiments, the prior distributions were randomly generated before the first run, and the same prior distributions were re-established for each of the following 9 runs. It is heartening to note that BSC MC, which uses the Bayes posterior distribution, performs somewhat better than BMCIND, which uses the Bayes formula with the marginals of the conditional distributions. But the superior performance of the minimum distance estimator was unexpected. 108 Examination of the history of the posterior distri— butions computed by both BSC.MC and BMCIND shows that the decisions made after the earlier observations vary, but after about 25 observations, all subsequent decisions are identical, whether they are correct or not. This suggests that a type of bias develOps that is unlikely to be over- ruled if the probability distributions are very close. The posterior distributions of quantities that are quite distinctive,such as the source and synchronization in- stant in this case, tend to be capable of producing correct decisions even if the decisions regarding other quantities are incorrect. 4.3 Error Estimates and Bounds This section presents numerical results obtained by applying the various error bounds of Chapter III to a specific example. Details of the example were chosen for computational expediency rather than to represent a par- ticular application. Specifically, the example is not typical of the models one would expect in the symbol synchronization problem. In the example there are three pattern classes, for which a pattern is one binary digit. Each class has a stationary first order dependent distribution. This is intended to mean that for fixed 2 and m--2,me{0,l}--and any class i IA- _ Dv". ‘ Pi(xk = I”XX-1 109 = m) is constant for all k = 2,3,.... The values used for the class conditional joint proba- bility distributions are tabulated below. Pi(Xk = 0,xk_l = 0) Pi(xk = 0,xk_1 = l) P.(xk 1,xk_1 = 0) Pi(Xk = 1,xk_l = l) The values of i=1 0.40 i=2 i=3 0.30 0.10 §_1 5 1 s 0.05 0.20 * 0.45 0.30 0.20 0.40 the joint distributions lead to the sample conditional distributions below, which are used to obtain the upper bound based on the equivocation measure. P.(xk = Ole-l = 0) P.(Xk = 0|xk__l = 1) P.(Xk — 1|xk_1 = 0) Pi(Xk = l|xk_1 = 1 0.375 0.333 0.625 i=2 i=3 0.400 0.250 0.200 0.333 0.600 0.750 0.800 0.667 An important part of the equivocation bound argu— ment, 3.2.1.6 of Chapter III, states that 110 Q < l for j#h [Pj(xklxk_l)]t E h PhIXkTXk-l) A matrix of the values Obtained for the above expectations Obtained for this example is given below. h=1 h=2 h=3 {7 j=1 1.000 .968 .963 i j=2 .971 1.000 .988 j=3 .945 .987 1.000 The diagonal terms should obviously be 1 and serve as a check on the computation. The classes were assumed to be equally probable. The prior class conditional distribution of the first observation and the corresponding interclass Bhattacharyya coefficients are tabulated below. The values of Pi(xl = -) were chosen to make Pi(Xk) stationary with respect to k. 1=l i=2 1=3 P1(X1 = 0) 0.529 0.25 0.308 P.(X1 = 1) 0.471 0.75 0.692 Prior Class Conditional Distributions of the First Observation 111 013 = 031 = 0.974 023 = 032 = 0.997 Interclass Bhattacharyya Coefficients for the Distribution of the First Observation Assuming that C of Lemma 1, Chapter III, is not greater than 2, the value of A in 3.2.1.10, is 3.907, and the value of q in 3.2.1.9 is 0.988. So the error bound is Pe < 1.953 x 0.988k'1. Under this approach, one is very much at the mercy of the value of q, which in this case is very close to the maxi- mum possible value, one. Figure 4.3 shows that for this example the equivocation bound gives the largest values of all the techniques which were compared. Approximately 367 observations would be required in order for this error bound to assure an error rate of less than 0.05. Comparing the above analysis based on the equi- vocation bound with the asymptotic distribution from the majority decision function approach reveals sizable differ- ences in the error estimates. Using the form for the m pattern class case, one calculates um .1 2'." 3n n- 1) ._.__.r - 112 Equivocation Bound Lainiotis' Bound (Average Interclass Bhattacharyya Coefficients) Toussaint's Variational Distance Bound Asymptotic Distribution from Majority Decision Function 1 1, .1 1 l __J 5 10 15 20 25 30 Number of Observations Figure 4.3 Comparison of Error Bounds 113 1. m P? X = —_ X .P. x 1( k) 1 7 Pi j=1 p] 3( k) j#i and Obtains the values tabulated below. i=1 i=2 i=3 Pi~(xk = 0) 0.279 0.419 0.390 f. 1 A = i1 Pi(Xk 1) 0.721 0.581 0.610 S The next task is to find 6 such that fiiPi(xk) - P§(Xk)l ; 26 where the Kolmogorov variational distances on the left turn out to be 0.501, 0.337, and 0.164, respectively, for i = 1,2,3. Taking one-half the minimum distance for 6 gives 6 = 0.082, and e = — - — = 0.479. The values of the asymptotic distribution of the error probability as a function of the number of Observations is given in the lowest curve of Figure 4.3 labeled "Majority Decision Function Asymptotic Distribution." The bound suggested by Lainiotis in 3.2.2.4 is plotted as the middle curve in Figure 4.3. The values are all smaller than the values obtained from the equivocation method, plotted in the top curve, for the number of observations investigated. Whether this is true for large numbers of Observations is not clear, since 114 the form of the Lainiotis bound does not reveal its asymptotic behavior. A consequent advantage of the equivocation bound is that, once the coefficient and base are calculated, one can estimate values for large numbers of samples by using a slide rule. Conversely, the Lainiotis bound requires a rather tedious recalculation, best done on a digital computer, for each successive number of observations. As one comes to the error bound proposed by Toussaint, the storage requirements for the probability distributions threaten to be very costly. However, the bound was computed on the basis of three observations and came out to be 0.707, the smallest value provided by any of the analyses. The value is spotted on Figure 4.3. i ...5 ‘1!” . - . CHAPTER V CONCLUSIONS AND RECOMMENDATIONS 5.1 Summary of the Thesis r7 Pattern recognition techniques offer a choice be- tween unsupervised learning, which requires prior knowl— edge of the probability distributions governing the events producing the patterns, and supervised learning, which re- quires a set of training data for computing parameters of the decision function. This thesis has ignored supervised learning techniques and the problems that arise in deter- mining that the training data are sufficiently representa- tive of the total population to assure a low probability of error. Instead, this work has demonstrated that cer— tain statements about probability of error for m-class (m > 2) unsupervised pattern recognition can be extended, in an apprOpriately modified form, to problems involving statistical dependence. The motivating problem for this research, symbol synchronization for an unknown source code length, has been shown to be a problem in unsupervised learning with sta— tistically dependent data. Several solutions to this problem have been presented, including a Bayes decision 115 116 process and a stochastic approximation technique, both of which were shown to produce a sequence of decisions which converge to the correct decision. A subOptimum technique, formally like the Bayes process for independent random variables,was suggested, and empirical results were en- couraging. New results include applying the interclass Bhattacharyya coefficient to a sequence of observations having first order dependence in order to bound the proba— bility of error of the Bayes decision process. Other error 11"». Y bounds are presented, one based on the eXpectation of the additional information in successive samples and one based on the Kolmogorov variational distance. The asymptotic distribution of the error probability for a subOptimum process provides an in-the-limit statement about the error rate of the Bayes decision procedure. It is inter- esting to note that the Bhattacharyya coefficient appears in the information theoretic bound and the Kolmogorov variational distance plays a role in the asymptotic distri— bution of the error probability. The probability of error statements which are applied to decision problems using dependent random variables have turned out to use these two measures on the distributions, the Central Limit Theorem and expectations of products of a very limited class of dependent random variables. In all of this, the computational expediency of the resulting algorithms has been a foremost consideration. 'II'I'II l1 JII 4}]. .l.ll I 'I'I. 117 5.2 Recommendations for Continued Research Unsupervised learning methodology would benefit from an influx of new decision processes which could be used as alternatives to the Bayes decision process. The majority decision techniques may be an important step in this direction. Chu and Chueh recommended that the error lg“- estimates derived from the majority decision process be ” considered as upper bounds on the error of the Bayes pro- , cess, but that decisions should be made by the Bayes pro- cess in order to achieve the minimum decision error probability. An alternative approach would be to use :— easily computed local decisions as each pattern is ob- served and refine the estimate of the error probability based on properties of those local decision functions. This is essentially the philOSOphy behind the Bayes majority decision function, and others might be feasible. The objective, of course, is to reduce the computational burden imposed by the Bayes decision process but with a technique whose convergence and error properties could be well defined. The "posterior distributions" described in Chapter IV in connection with the program called BMCIND provided a computational advantage to the Bayes posterior distributions, but the convergence properties of the BMCIND process need to be determined. The BMCIND process was a semi-Bayes process which would be the Bayes decision process if the observations were independent. As work moves toward decision processes which are more and more unlike the Bayes decision process,it will most likely be 118 conducted by considering i.i.d. random variables at first, with the hOpe of generalizing to dependent random variables at a later time. Since the Bhattacharyya coefficient has proved to have an important relationship to the error probability for the Bayes decision process, additional studies of its r“ properties might prove useful. In particular one might be . able to describe the class of distributions for which pij(k) defined by 3.2.2.1 is a monotonically decreasing function of k. One could also attempt to determine I..."M"D’.§‘;S -.‘ 0.- - -'. relationships between the Bhattacharyya coefficient and error rates for non-Bayes decision procedures in either a supervised or an unsupervised mode. Supervised learning techniques include the use of a linear combination of functions called potential functions [A-Z, B-l, P-4] to approximate unknown proba- bility distributions. Unsupervised learning has tra- ditionally proceeded by assuming that, if one did not have classified training data, then one would assume that the probability distributions for the pattern classes are known to provide a starting point around which a decision process can be built. Knowing the probability distributions has been a demanding assumption for unsupervised learning. It is tempting to try to relax that assumption, and per- haps that could be done by using potential functions. The first task would be to determine whether there are 119 any conditions for the traditional unsupervised learning problem (in which all pattern classes are represented randomly in the data) under which combinations of potential functions would provide useful estimates of unknown pattern class distributions. In general the properties of supervised learning I .1! 8 techniques using dependent observations are not described in existing literature. The work of C. K. Chow is an 'I “(Jr 9 mum's?) ' '1 exception, of course. Aside from the mathematical com- plexities introduced by considering dependent random vari- ables, there are problems in that a large number of dependency models are candidates for consideration. Chow has had some success in applying information theoretic methods to this type of problem. Perhaps other methods could be fruitful as well. BIBLIOGRAPHY [A-l] [A-Zl [A-3] [A-4] [A-S] [A-6] [A-7] [B-l] BIBLIOGRAPHY Abramson, N., and D. Braverman, "Learning to Recog- nize Patterns in a Random Environment," IRE Trans., Vol. IT-8, pp. 558—563, 1962. Aizerman, M. A., E. M. Brauerman, and L. I. Rozonoer, "Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning," Automatica i Telemekhanika, Vol. 25, June 1964, trans. Jan. 1965, pp. 821-837. Albert, A. "A Mathematical Theory of Pattern Recognition," Ann. Math. Stat., Vol. 34, pp. 284—299, March, 1963. Albert, Arthur E., and Leland A. Gardner, Jr. Stochastic Approximation and Nonlinear Regression, MIT Press, Cambridge, 1967. Amari, Shunichi, "A Theory of Adaptive Pattern Classifiers," IEEE Trans., Vol. EC-l6, No. 3, pp. 299-307, June 1967. Anderson, T. W., and R. R. Bahadur, "Classification Into Two Multivariate Normal Distributions with Different Covariance Matrices," Ann. Math. Stat., Vol. 33, pp. 420-431, June 1962. Aoki, Masanao, thimization of Stochastic Systems, Academic Press, New York, 1967. Bashkirov, O. A., E. M. Braverman, and I. B. Muchnik, "Potential Function Algorithms for Pattern Recognition Learning Machines," Automatic Machines and Remote Control, Vol. 25, No. 5, pp. 629-631, March, 1964, trans. from Aut. i Telemekh., Vol. 25, No. 5, pp. 692-695, May 1964. 120 . l [B-Zl [B-3] [C-l] [C-Z] [c—3] [C-4] [C-5] [C-6] [C-7] [C-8] [C-9] [C-lO] 121 Blaydon, Colin C.,"Recursive Algorithms for Pattern Classification," Thesis, Harvard University, Div. of Engineering and Applied Physics, March 1967. Bledsoe, W. W., "Some Results on Multicategory Pattern Recognition," J. ACM, Vol. 13, No. 2, pp. 304-316, April 1966. Chadwick, Henry D., and Ludwik Kurz, "Two Sequential Nonparametric Detection Procedures," Info. & Con— trol, Vol. 13, No. 5, pp. 403—428, November 1968. Chien, Y. T., and K. S. Fu, "On Bayesian Learning and Stochastic Approximation," IEEE Trans., Vol. SSC-3, No. 1, pp. 28-38, June 1967. Chien, Y. T., and K. S. Fu, "Selection and Order- ing of Feature Observation in a Pattern-Recognition System," Info. & Control, Vol. 12, Nos. 5/6, pp. 394-414, May-June 1968. Choi, Keewhan, "Estimators for the Parameters of a Finite Mixture of Distributions," Annals of the Inst. of Stat. Math., Tokyo, Vol. 21, No. 1, pp. 107-116, 1969. Choi, Keewhan, "Empirical Bayes Procedure for (Pattern) Classification with Stochastic Learning," Ann. of the Inst. of Stat. Math., Tokyo, Vol. 21, No. 1, pp. 117-125, 1969. Chow, C. K., "A Recognition Method Using Neighbor Dependence," IRE Trans., Vol. EC-ll, pp. 683-690, Oct. 1962. Chow, C. K., "Statistical Independence and Threshold Functions," IEEE Trans., Vol. EC-14, No. 1, pp. 66-68, Feb. 1965. Chow, C. K., and C. N. Kiu, Approximating Discrete Probability Distributions with Dependence Tree, IBM Research Report RC-1816, May 4, 1967. Chu, J. T., "Optimal Decision Functions for Com— puter Character Recognition," J. of ACM, Vol. 12, No. 2, pp. 213-226, 1965. Chu, J. T., and J. C. Chueh, "Error Probability in Decision Functions for Character Recognition," J.gf ACM, Vol. 14, No. 2, pp. 273-280, April 1967. [C-11] [C-12] [C-13] [C-14] [D-ll [D-2] [D-3] [D-4] [D-5] [F-l] [F-2] 122 Cooper, P. W., "The Hyperplane in Pattern Recog— nition," Cybernetica, Vol. 4, pp. 215-218, 1962. Cooper, D. B., and P. W. Cooper, "Nonsupervised Adaptive Signal Detection and Pattern Recognition," Info. & Control, Vol. 7, pp. 416—444, 1964. COOper, P. W., "Hyperplanes, Hyperspheres, and Hyperquadrics as Decision Boundaries," Computer and Information Sciences, Tou and Wilcox, Eds., Spartan, Washington, D. C., 1964. 9» Cover, Thomas M., "Geometrical and Statistical PrOperties of Systems of Linear Inequalities with Applications in Pattern Recognition," IEEE Trans., Vol. EC-14, No. 3, pp. 326-334, June 1965. Deely, J. J., and R. L. Kruse, "Construction of Sequences Estimating the Mixing Distribution," 1 Ann. Math. Statist., Vol. 39, No. 1, pp. 286-288, 1968. I .' .'. Dubes, R. C., The Theory of Applied Probability, Prentice-Hall, Englewood Cliffs, N. J., 1968. Dubes, R. C., and P. J. Donoghue, "Bayesian Learn- ing in Markov Chains with Observable States," Interim Report No. 5, Contract No. AFOSR-1023-67B, Division of Engineering Research, Michigan State University. Dubes, R. C., and E. Panayirci, "Pattern Recognition with Continuous Parameter, Observable Markov Chains," Interim Report No. 10, Contract No. AFOSR-1023-67B, Division of Engineering Research, Michigan State University. Duda, R. 0., and H. Fossum, "Pattern Classifi- cation by Iteratively Determined Linear and Piecewise Linear Discriminant Functions," IEEE Trans., Vol. EC-lS, No. 2, pp. 220-232. Farrell, J. L., and J. C. Murtha, "Statistical Band Synchronization in Digital Communication," Proceedings of the Univ. of Missouri, Rolla, Mervin J. Kellngommunications Conference, Oct. ‘- 1970, pp. 3—5-1 through 3-5-6. Feller, W., An Introduction to Probability Theory and Its Appligations, John Wiley & Sons, Inc., New York, 1957. [F-3] [F-4] [F-5] [G-l] [H-l] [H-2] [H-3] [H-4] [H-5] [H-6] [H-7] [H-8] [I-l] 123 Friedman, H. D., "On the Expected Error in the Probability of Misclassification," Proc. IEEE, Vol. 53, pp. 658-659, June 1965. Fraser, D. A. S., Non-Parametric Methods in Statistics, Wiley, New York, 1957. Fu, K. S., Sequential Methods in Pattern Recog- nition and Machine Learning, Academic Press, New York, 1968. Good, I. J., The Estimation of Probabilities, Research Monograph No. 30, MIT Press, Cambridge, 1965. .1): Hancock, J. C., and T. L. Stewart, "Parameter Estimation with Unknown Symbol Synchronization," Thesis, Purdue University, January 1967. “TI-- 5:?) " .. Hancock, J. C., and P. A. Wintz, Signal Detection Theory, McGraw-Hill, New York, 1966. Harnett, W. E., "Generalization of Tests for Certain PrOperties of Variable-Length Codes," Info. & Control, Vol. 13, No. 1, pp. 20-24, July 1968. Hellman, M. E., and T. M. Cover, "Learning with Finite Memory, Ann. Math. Stat., Vol. 41, No. 3, pp. 765-782, June 1970. Hellman, Martin E., and Josef Raviv, "Proba- bility of Error, Equivocation, and the Chernoff Bound," IEEE Trans., Vol. IT-16, No. 4, pp. 368-372, July 1970. Hilborn, C. G., and D. G. Lainiotis, "Optimum Unsupervised Learning Multicategory Dependent Hypothesis Pattern Recognition," IEEE Trans., Vol. IT-14, No. 3, pp. 468-470, May 1968. Ho, and Agrawala, "Summary of the State of the Art in Pattern Recognition," Proc. IEEE., Vol. 56, No. 12, pp. 2101-2114, Dec. 1968. Hoeffding, W., and H. Robbins, "The Central Limit Theorem for Dependent Random Variables, Duke Math 10, V01. 15' pp. 773-780, 1948. Irani, K. B., ”A Finite-Memory Adaptive Pattern Recognizer," IEEE Trans., Vol. SSC-4, No. 1, pp. 2-11, March 1968. [K-l] [K-2] [K-3] [K-4] [K-Sl [K-6] [K-7] [L-l] [L-Zl [L-3] [L-4] [M-l] [M-2] 124 Kadota, T. T., and L. A. Shepp, "On the Best Set of Linear Observables for Discriminating Two Gaussian Signals," IEEE Trans., Vol. IT-13, No. 3, pp. 278-284, April 1967. Kailath, Thomas, "The Divergence and Bhattacharyya Distance Measures in Signal Selection," IEEE Trans., Vol. COM-15, No. 1, pp. 52-60, February 1967. Kazmierczak, H., and K. Steinbuch, "Adaptive Systems in Pattern Recognition," IEEE Trans., Vol. {9 EC-12, No. 6, pp. 822-835, Dec. 1963. i Keehn, Daniel G., "A Note on Learning for Gaussian Properties," IEEE Trans., Vol. IT-ll, No. 1, pp. 126-132, Jan. 1965. Kobayashi, H., and J. B. Thomas, "Distance Measures and Related Criteria," Proceedings of the Fifth Allerton Conference on Cirguit and 1_ System Theory, pp. 491-500, Oct. 1967. Korsh, James F. "On Decisions and Information Concerning an Unknown Parameter," Info. & Control, Vol. 16, pp. 123-127, 1970. Ku, Harry S., and Solomon Kullback, "Approxi- mating Discrete Probability Distributions," IEEE Trans., Vol. IT-15, No. 4, pp. 444-447, July 1969. Lainiotis, D. G., "Optimal Feature Extraction in Pattern Recognition," IEEE Intern. Info. Theory Symp. Abstracts, San Remo, Italy, Sept. 1967. Lainiotis, Demetrios G., "On a General Relationship Between Estimation, Detection, and the Bhattacharyya Coefficient," IEEE Trans., Vol. IT-lS, No. 4, pp. 504-505, July 1969. Lainiotis, D. G. "A Class of Upper Bounds on Probability of Error for Multihypothesis Pattern Recognition," IEEE Trans., Vol. IT-lS, No. 6, pp. 730-731, Nov. 1969. Lainiotis, D. G., and S. K. Park, "Probability of Error Bounds," IEEE Trans., Vol. SMC-l, No. 2, pp. 175-178, April 19 1. Martin, J. J., Bayesian Decision Problems and Markov Chains, Wiley, New York, 1967. Martin, James. Telecommunications and the Com- puter, Prentice-Hall, Englewood Cliffs, N.J., 1969. [M-3] [M-4] [M-5] [M-6] [M-7] [M-8] [N-l] [N-Zl [P-l] [P-2] [P-3] [P-4] [R-l] [R-Zl 125 McBride, Alan L., and Andrew P. Sage, "Optimum Estimation of Bit Synchronization," IEEE Trans., Vol. ABS-5, No. 3, pp. 525-536, May 1969. McBride, Alan L., and Andrew P. Sage, "On Dis- crete Sequential Estimation of Bit Synchronization," IEEE Com., Vol. 18, No. 1, Feb. 1970. McLendon, J. R. "A Pseudo Bayes Approach to Digital Detection and Likelihood Ratio Computation," Thesis, Southern Methodist University, Dec. 1969. Meisel, William S. "Potential Functions in Mathe- matical Pattern Recognition," IEEE Trans., Vol. C-18, No. 10, pp. 911-918, Oct. 1969. Middleton, D., "On the Detection of Random Signals in Additive Normal Noise--Part 1," IREE Trans., Vol. IT-3, No. 2, pp. 86-122, June 1957. m‘ _ . J... I Middleton, D., An Introduction to Statistical Com- munication Theory, McGraw-Hill, New York, 1960. "State of the Art in Pattern Recog- IEEE., Vol. 56, pp. 836-862, 1968. Nagy, G., nition," Proc. Nilsson, N. J., Learning Machines, McGraw—Hill, New York, 1965. Panayirci, E., "Bayesian Decision Making and Learn- ing for Continuous-Time Markov Systems," Ph.D. dissertation, Michigan State University, 1970. Papoulis, A., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York, 1965. Patrick, E. A., and J. C. Hancock, "Nonsupervised Sequential Classification and Recognition of Patterns," IEEE Trans., Vol. IT-12, No. 3, pp. 362-372, July 1966. Pitt, J. M., and B. F. Womack, "Additional Features of an Adaptive, Multicategory Pattern Classifi- cation System," IEEE Trans., Vol. SSC-S, No. 3, pp. 183-191, July 1969. Raviv, J., "Decision Making in Incompletely Known Stochastic Systems," Int. J. Eng. Sci., Vol. 3, pp. 119-140, 1965. Raviv, J. "Decision Making in Markov Chains Applied to the Problem of Pattern Recognition," IEEE Trag§., Vol. IT-3, No. 4, pp. 536-551, Oct. 1967. [R-3] [R-4] [R-5] [R-6] [R-7] [S-l] [5-2] [5’3] [S-4] [5‘5] [S-6] [T-l] 126 Rényi, Alfréd, "On the Amount of Information Con- cerning an Unknown Parameter in a Sequence of Observations," Ma ar TudgAkad. Mat. Kutaté Int. K621., Vol. 9, pp. 617-625, 1964. Rényi, A., "On the Amount of Missing Information and the Neyman-Pearson Lemma," Research Papers in Statistics Festschrift for J. Neyman, Wiley, New York, 1966. Rényi, A. "On Some Basic Problems of Statistics {“ from the Point of View of Information Theory," Proc. 5th Berkeley Symp. on Math. Stat. and Prop., Vol. 1, Berkeley, Calif., U. of Cal. Press, 1967. Robbins, Herbert, "The Empirical Bayes Approach to , Statistical Decision Problems," Ann. Math. Statist., Vol. 35, pp. 1-20, 1964. Rolph, John E., "Bayesian Estimation of Mixing Distribution," Columbia Univ. Sebestyen, George S., Decision-Making Processes in Pattern Recognition, Macmillan, New York, 1962. Sebestyen, G., and J. Edie, "An Algorithm for Non-Parametric Pattern Recognition," IEEE Trans., Vol. EC-lS, No. 6, pp. 908-915, Dec. 1966. Signori, D. T., Jr., "Estimation and Adaptive Decision Making for Partially Observable Markov Systems," Ph.D. dissertation, Michigan State Uni- versity, 1968. Specht, Donald F., "Generation of Polynomial Dis— criminant Functions for Pattern Recognition," IEEE Trans., Vol. EC-l6, No. 3, June 1967. "Learning Without a Teacher," IT-IZ’ NO. 2, pp. 273-230, Spragins, J. D. IEEE Trans., Vol. April 1966. Spragins, J. D. Rpproducing Distributions for Machine Learning, Stanford Electronic Lab, Tech. Report No. 6103-7, Nov. 1963. Tallis, G. M., "The Identifiability of Mixtures of Distributions," J. Appl. Prob., Vol. 6, pp. 398, 1969. 389- [T-2] [T-3] [T-4] [W-l] [Y-l] [Y-2] [Y-3] [Y-4] 127 Teicher, H., "Identifiability of Mixtures," Ann. Math. Stat., Vol. 32, pp. 244-248, 1961. Toussaint, G. T., "Some Upper Bounds on Error Probability for Multiclass Feature Selection," IEEE Trans., Vol. C-20, No. 8, pp. 943-944, Aug. 1971. Tucker, Howard G., A Graduate Course in Probability, Academic Press, New York, 1967. Wainstein, L. A., and V. D. Zubakov, Extraction of Signals from Noise, Prentice-Hall, Englewood Cliffs, N.J., 1962. Yakowitz, Sidney J., "A Consistent Estimator for the Identification of Finite Mixtures," Ann. Math. Stat., Vol. 40, No. 5, pp. 1728-1735, 1969. ‘-.'.'\‘_"‘) H '1 )Li-"fn‘ :- f - J‘ Yakowitz, Sidney J., "Unsupervised Learning and the Identification of Finite Mixtures," IEEE Trans., Vol. IT-16, No. 3, pp. 330-338, May 1970. Yakowitz, Sidney J., and John D. Spragins, "On the Identifiability of Finite Mixtures," Ann. Math. Stat., Vol. 39, No. 1, pp. 209-214, 1968. Yau, S. 8., and J. M. Schumpert, "Design of Pattern Classifiers with the Updating Property Using Stochastic Approximation Techniques., IEEE Trans., Vol. C-l7, No. 9, pp. 861-872. Sept. 1968. WWW W “11111111111111 WIVES 3 1293 03056 5984