DXSCRHMNANT GRAMMARS FOR WHERE! CLASSEFZCAT§ON Bisses’tation for the Degree of Ph. D. Minimal-£4 TATE UMVERSITY ALAN iAMES FiLiPSKI 1978 This is to certify that the thesis entitled Discriminant Grammars for Pattern Classification presented by Alan James F11 ipski has been accepted towards fulfillment of the requirements for ‘RLLdegree in flmxience Major professg Date May 17, 1976 \ 0-7639 n W00 ABSTRACT DISCRIMINANT GRAMMARS FOR PATTERN CLASSIFICATION By Alan James Filipski This thesis introduces the idea of a "discriminant grammar" as a tool for syntactic pattern recognition. A discriminant grammar is defined as a context-free, unam— biguous grammar together with a mapping from the produc- tions into the reals. We associate a number with each sentence generated by the grammar by adding the numbers corresponding to each use of a production in the deriva- tion of the sentence. The language is partitioned into three decision regions by comparing this sum to a cutpoint. It is shown that a discriminant grammar may be used to provide a sufficient statistic for decision between two stochastic grammars. Results are given in terms of a Bayesian formulation and a sequential scheme using top- down parsing and LL(k) grammars. It is shown that regular discriminant grammars can yield decision regions that are not recursively enumerable, but if all coefficients are rational, the regions are context-free. Results are ob- tained concerning the relationship of regular discriminant grammars to probabilistic automata. The use of perceptron- like methods to learn the coefficients from empirical sam- ples is also discussed. DISCRIMINANT GRAMMARS FOR PATTERN CLASSIFICATION By Alan James Filipski A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1976 ACKNOWLEDGEMENTS I would like to express thanks to all members of my committee, especially Dr. Richard Dubes for his careful reading of the various drafts of this thesis and my chair- man, Dr. Carl Page, for his encouragement and assistance during the many false starts and doldrums preceding the idea which led to this work. I would also like to express thanks to those who helped in the typing of this thesis, namely Bob Frye and Annette Davis. Special thanks are due to my wife, Lois, for her en- couragement throughout my long period of graduate studies. 11 TABLE OF CONTENTS Page List of Tables................................... v List of Figures.................................. vi I. INTRODUCTION............................... 1 II. DISCRIMINANT GRAMMARS- DEFINITION & INTERPRETATIONS............... 8 A. DefinitionOOOOOO......OOOOOOOOOO00000.0 8 B. A Bayesian Interpretation.............. 15 C. Bounds on the Bayes Error (Linear Grammars)...................... 19 D. A Sequential Approach To Discriminative Parsing.... ....... . ..... 2h 1. IntrOduction. O O O O O O O O O O O O O O O O O O O O O O 2“ Markov Chain Model.. .......... ..... 28 The SPRTCOCOOOO ...... ........ O... I. 30 Error AnalySiSOOOOIOOOOOO0.0.0.0... 32 U‘l 4:? U) N O o 0 O Par81ng Algorithm. 0 O O O O O O O O O O O O O O O O 31‘ E. sumaryo00.000000000000000...0000000000 “0 III. SOME LANGUAGE-THEORETIC RESULTS............ "1 A. Regular Discriminant Grammars With Rational Coefficients.. ........... Al 1. Informal Description ............ ... Al 2. Detailed Construction .............. 43 3. Example.......... ........... . ...... 50 iii B. Regular Discriminant Grammars With Unrestricted Coefficients......... CO summaryOOIOOIIOOOOOOO......OOOOOOOOOOOO IV. DISCRIMINANT GRAMMARS AND PROBABILISTIC AUTOMATA......... ..... . ...... V. SUMMARY AND RECOMMENDATIONS ................ A. Summary.................. ....... . ...... B. Training Methods....... ....... . ..... ... C. other Future worKOOOOOOOOOOOOOOOOOO.... APPENDIX A. LL(k) GRAMMARS AND TOP-DOWN PARSING.0.000000000000000.00...... APPENDIX B. STOCHASTIC GRAMMARS. ................ BIBLIOGRAPHY ...... . ....... . ...................... iv 56 58 59 68 68 70 75 77 8O 83 Table Table Table Table LIST OF TABLES Two Stochastic Grammars and Their Log-Likelihood Ratio Representation... Example of the SDPS..................... PDA Transition Function (Example)..... Trace of FDA (Example).............. 38 39 53 55 Figure Figure Figure Figure Figure Figure :UJNI-J U'l LIST OF FIGURES Syntactic Decision Strategies.......... Encoded Line Patterns....... Acceptor for Production Sequences...... Sequential Discriminative Parsing Scheme.............. Flowchart of Push-Down Automaton....... Procedure ADDSTK(r). ...... .. vi 1A 23 36 HA “5 I. INTRODUCTION The term "Pattern Recognition" as applied to the more classical feature-space oriented techniques as well as to the newer syntactic methods, has tended to obscure a fun- damental difference of intent between the two approaches. The former is almost exclusively concerned with the devel— opment of algorithms to extract or utilize discriminatory information, that is, information which pertains to the assignment of an object to one of several classes. Any information which does not further this end is considered a nuisance. Most work in "Syntactic Pattern Recognition", on the other had, does not stress this distinction. The "Syntactic Pattern Recognizer" is generally expected to transduce the input object into a derivation, which is a representation of all structural information inherent in the original object in terms of some given grammar. This derivation may then be interrogated to answer questions about the object, produce a description of the object, etc. We can thus distinguish between "Pattern Classification" and "Pattern Description", the former being primarily a process of information reduction and the latter a process of information re-organization. Both of these are useful ends toward which syntactic means may be applied. They are, however, distinct ends and l 2 efficiency may be sacrificed if we confuse them. As an extreme example, suppose we are given a character string and we must decide whether it is a Fortran program or an Algol program. A very inefficient way to make this deci- sion would be to feed the string into a Fortran compiler and then into an Algol compiler and then note which gen- erated fewer error messages. It is obvious that the de- cision could be made instead by a very small and fast pro- gram which looked for a few characteristic structures. The important assumption here, of course, is that we are interested only in a decision and not in any by-products such as error messages or object programs. It is the object of this work to show that syntactic methods may be applied to a purely classificatory end, i.e. that there is such a thing as structural discrimina- tory information and that this information may be effec- tively extracted and used by syntactic means. Hopefully, restricting our attention ty discriminatory information will serve to make the path to the decision itself a direct one. It is evident that in many applications for which the use of syntactic methods has been proposed, only the ulti- mate decision is of any importance, and additional infor- mation supplied by the parse represents wasted effort. Examples are hand-printed character identification and automatic blood-cell counting. Most systems used on a production basis need only supply a decision (or possibly 3 report inability to make a decision). See, for example, Kanal(l2). Another instance in which descriptive information in addition to a decision is of very little value occurs when the grammar is obtained by means of the semi-automatic in- ductive methods now being explored by a number of re— searchers. (See Fu & Booth(8) for a survey.) In this case there would be no a priori correspondence of semantic with syntactic notions with the result that knowledge of the derivation becomes rather useless. In this case we should simply ask the system to report a decision rather than to exhibit a parse in an unfamiliar grammar. Figure 1 depicts three possible approaches to syntac- tic pattern classification. Figure l.a represents the most straightforward approach: The string is processed through a parser for each grammar; one, none, or both of the parsers then report success. Figure l.b is a modifi- cation of this in the case where probabilistic information is available. In this case two stochastic parses yield the probabilities that the string was generated by each of the two stochastic grammars. This information, together with the a priori likelihoods of the two classes, is fed into a Bayes decion-maker. These two methods are used, for example, by Lee & Fu(15). Figure 1.0 represents the approach introduced in this thesis. A single parse is performed using a "discriminant grammar" which maps the string x into a single real number, say f(x). The sign PARSER XEL(G1)? USING G1 WHICH PARSE P SUCCESSFUL, ' 9 ARSLR XEL(G2)' IF ANY? USING G2 (l.a) LIKELIHOOD OF Ll vs. L2 STOCHASTIC P(x|G ) PARSER FOR GI BAYES DECISION STOCHASTIC P(x|G2) . PARSER FOR (l.b) DISCRIMINANT GRAMMAR x PARSE FOR f(x) Ll vs. L2 (1.0) Figure l Syntactic Decision Strategies 5 of this number then determines the classification of x. The absolute value of f(x) is a measure of our confidence in the classification. (Under a statistical interpreta- tion of the discriminant grammar to be given later, f(x) relates to a quantity which I. J. Good calls the "amount of information in an observation about a hypothesis". (Good(9), Kullback(lA))) The potential advantages of the discriminant grammar approach are several: 1.) Only a single parse need to made. In fact, as we shall see later, if the user wishes to specify error tolerances, this parse need not even be completed. 2.) A discriminant grammar can usually be made simpler than a grammar describing either of the languages because only discriminatory information need be considered. For example, the two languages {aib1|1_>_1} L“ ll {biailigl} II" II are both context-free, but a discriminant grammar with regular underlying grammatical structure can be used to discriminate between them on the basis of whether the first character of a given string is an a or a b. 3.) The set of strings for which the discriminant grammar will yield a decision can be made larger than the union of 6 the two original languages, thus giving a means for the classification of noisy strings. As a trivial example, suppose we want to distinguish between the languages {a}* and {b}*. It will prove to be a simple matter to produce a discriminant grammar (again, with regular underlying structure) which will be able to classify any string from {a,b}* depending upon whether there are more a's or more b's in the string. The discriminant grammar approach may be regarded as a step towards a synthesis of the syntactic and feature- space formulations of the Pattern Recognition problem. We will show later how syntactic information may be repre- sented by means of a mapping from strings into a space of structural indices. Once this transformation is made, we then may apply results from feature-space theory, as pre- sented, for example, in Duda & Hart(4). It is hoped that this "structural statistics" approach will help the two methodologies of Pattern Recognition to cross-fertilize each other. One of the most important areas of syntactic Pattern Recognition concerns the processing of two-dimensional pictures. Since the early 1960's (e.g. Minsky(l8)) there has been considerable work done on languages for the re- presentation of various types of two-dimensional images. Freeman(5) proposed a language of concatenated vectors to describe connected curves. Kirsch(l3) suggested a grammar of two-dimensional arrays. Web grammars, Plex grammars, 7 Tree grammars and a number of picture description lan- guages have been proposed (see Fu(7) for a survey) in ad- dition to many special purpose languages, e.g., for structural formulae, flowcharts, and mathematical expres- sions. In some cases these grammars transduce the input picture into a one-dimensional string which can then be dealt with using all the apparatus of formal language theory. Methods of this type have been quite successful (e.g. Lee & Fu(15)). Other grammars, such as the Web and Plex grammars, seem to demand a more general notion of parsing. This thesis follows the lead of the former ap- proach and is set in the classical theory of formal lan— guages. It is easy to conceive of situations in which two-dimensional syntactic classification schemes may be put to use. In the game of Go, for example, a board con— figuration is essentially syntactic in that it is the spatial relationships between primitives (stones) which is of significance in determining the worth of a given board position to either player. Furthermore, if our task were to develop a static evaluation function for the game we would be interested only in the relative value of a position and not in its full syntactic description. This is an example of the possible use of a two-dimensional ex- tension of the discriminant grammar technique. II. DISCRIMINANT GRAMMARS-DEFINITION AND INTERPRETATIONS A. DEFINITION The basic structure introduced and developed in this thesis for use in syntactic pattern recognition is the "discriminant grammar" or "d-grammar". Formally, a dis- criminant grammar D is defined to be an ordered quintuple D = (VN,VT,TT ’SaR) The first four components are, respectively, a set of non-terminal symbols, a set of terminal symbols, a set of production rules, and a start symbol. It is assumed that the reader is familiar with the concepts and notational conventions of formal language theory as contained in, for example, Hopcroft and Ullman(10). The component R is de- fined as a mapping from the production setTT into the real numbers. It is often convenient to use the notation ri for R(ni), where nieTT. The ordinary phrase-structure grammar defined by the first four components of D will be denoted CHAR(D) and will be called the characteristic grammar of D. Some restrictions are usually put on the form of the production rules of a grammar. The restriction that has been found to strike a good balance between generative 9 power and mathematical tractability is the context-free restriction, i.e. that the premise of each production rule consist of a single non-terminal symbol. In addition, it is reasonable to require that each string in the language have an essentially unique derivation in the given gram- mar. This property is called unambiguity. We will limit the grammars under consideration as follows: Unless otherwise noted, all discriminant grammars used in this thesis will be assumed to have unambiguous, context-free characteristic_grammars. A mapping Q from L(CHAR(D)) into the reals is defined by n QD(X> = Z: rik k=l where nil, "i2, n13,... "in is the unique left-most canonical derivation (LMCD) of the string x. (Throughout this thesis, when we speak of "derivations" we shall mean sequences of productions, not sequences of sentential forms.) Note that it does not in fact matter whether we use the LMCD to compute Q(x) or whether we use any deriva— tion of x, since all derivations of x in the unambiguous characteristic grammar are simply permutations of each other. When considered as set of ordered pairs, the func- tion Q is called the discriminant language generated by the discriminant grammar D and is denoted L(D), i.e. L(D) = { (X,Q(X))I X€L(CHAR(D)) } 10 Given any real number 6 (a "cutpoint"), the language L(CHAR(D)) may be partitioned into three classes in the following way: L+(D,e) = {xlxeL(CHAR(D)) and Q(x) > e} LO(D,6) = {xlxeL(CHAR(D)) and Q(x) = e} L’(D,e) = {x|x£L(CHAR(D)) and Q(x) < a} This partition will be interpreted as a simple decision scheme which will allow us to distinguish between two languages. It is sometimes conceptually convenient to de- compose this decision scheme into four stages: structural indexing, projection, linear translation and quantization. It is possible in this way to express the decision as a composition of four functions: xeLi(D,6) iff Sgn(T(P(S(x)))) = 1 Where S, P and T are described as follows: Let N be the set of non-negative integers. For any discriminant gram- mar D = (VN,VT,TT,S,R), let k be the number of productions in“ , i.e. -rr= {Nl,n2, ...,N } For any xeL(CHAR(D)) and for i 1 to k, let mi equal the number of times N occurs in the LMCD of x. 1 Then define s : L(CHAR(D))+Nk as Exx) = (ml,m2,m3,...,mk). 11 We may interpret S as a mapping from L(CHAR(D)) into a space of structural indices determined by the underlying characteristic grammar. Note that even though CHAR(D) is unambiguous, the mapping S is not necessarily one-to-one, as the following example shows: EXAMPLE Let CHAR(D) be G = ({S,A},{b,c},1T,S) where-\Tconsists of the productions S+AA A+b A+c Then S(bc) = S(cb) = (1,1,1) The projection function P : Nk+(-w, +m) is defined as k P(ml,m2,...,mk) = E miri i = 1 Finally the translation function T : (-m, +w) + (-m, +w) is defined as T(€) = E - 9 (Note that T is actually a function of 6 also.) The out- put of this function is then quantized at zero yielding the decision. This exposition of simple discriminant grammar decision-making in terms of a composition of many-to-one 12 functions is valuable because it defines exactly what sort of information-reduction is performed at each stage of the process from observation to decision. Clearly the most interesting function of this chain, and the one which is unique to this thesis, is the structural indexing function S. The functions S and P are implicit in the specifications of the discriminant grammar, while the function T depends upon some specified cutpoint. As a simple example of a discriminant grammar, con- sider the following: EXAMPLE Let D = ({S,A,B},{a,b};TF,R) WhereTTand R are given by the following table: N1 R(Ni) S + aA O S + bB O S + e O A + aA +1 A + bB —l A + e O B + aA -1 B + bB +1 B + e 0 Given any string xe{a,b}*, QD(x) tends to be positive if x contains long "runs" of either a's or b's and tends 13 to be negative if the a's and b's tend to alternate with a short period. A concrete interpretation of this discrimi- nant grammar would be to consider the a's and b's respec- tively as positive and non-positive slopes between appro- priately sampled points of some periodic function over some interval. Then for some arbitrary cutpoint 6, L+(D,6) would consist of relatively low frequency functions while L-(D,6) would contain functions with predominantly higher frequency components. Figure 2 gives an example of such a situation. The functions are sampled at the "0" point of the axis and at each "+" point. The sampled value at each "+" point is compared to the previously sampled value. If there has been an increase, an a is inserted into the string, otherwise a b is inserted. For example, the first character of 2.a is an a since the ordinate of the curve is greater at the first "+" than at the origin. The en- codings and their corresponding Q-values are given in the figure. Thus we see that the string represented in Figure 2.a is contained in L-(D,0) while the string re- presented in Figure 2.b is contained in L+(D,O). 11+ OOOOO+OOOO+IO00+...C+OOOC+OOOO+OOOO+OIOO+OOOO+OOOO+ 2.a x = abaabbaaba QD(X) = -3 OOOOO+OOOO+OOOO+OOOO+COOO+OOOO+IOOO+OOOO+OOOO+OOOO+ 2.b x = aaabbbbbbb QD(X) = 7 Figure 2 Encoded Line Patterns B. A BAYESIAN INTERPRETATION In this section we will investigate the use of the discriminant grammar in making a Bayes decision between two stochastic languages. (See the Appendix for defini- tions and notations involving stochastic grammars.) Definition Two stochastic grammars G1 and G2 are said to be commensurate if and only if CHAR(Gl) = CHAR(GZ) and all and G . production probabilities are nonzero under both G1 2 Definition A discriminant grammar D is said to be a lo - likelihood ratio representation of two commensurate sto- chastic grammars G1 and G2 if and only if: 1. The characteristic grammars of G1, G2 and D are the same. 2. For all nieTT, R(fli) = log P1(Ni) - log P2(fli) Where P1,P2 are the probability functions of G1 and G2 respectively. If D is the log-likelihood ratio representation of G1 and G2, we write D = LLRR(G1,G2). Note that in general, LLRR(G1,G2) does not equal LLRR(G2,G1). Suppose now that we are given a string x and two commensurate stochastic grammars G and G such that x can 15 l 2 16 be generated by their common characteristic grammar. Let [—1(x) be the probability of generating x under G Let i' the a priori probabilities of G1 and G2 be denoted Pr(Gl) and Pr(G2). Let Hi be the hypothesis that x was generated under G1. Finally, let LiJ be the loss incurred in de- ciding Hj when the true hypothesis actually is Hi' Given any x, the conditional expected loss incurred in deciding H is given by: l tl(x) = L lPr(G2 x) + LllPr(Gl x) 2 Using Bayes rule, by which Pr(ei) [:(x) Pr(Gilx) = Pr(x) We obtain L21Pr(62) [—;(x)+LllPr(Gl) [~;(x) tl(X) = Pr(x) Similarly, the conditional expected loss incurred in de- ciding H2 is given by L12Pr(Gl) rl(x)+L22Pr(G2) fq 2(x) t2(X) = Pr(x) 17 We can minimize the expected loss by deciding 1) H1 if tl(x) < t2(x) 2) H2 if t2(x) > tl(x) 3) making a random decision if tl(X) = t2(X) This is the same as deciding H if 1 (L2l—L22)Pr(G2) r-;(x) < (LlZ-L11)Pr(Gl) [’11(x) If we assume that L13 > Lii when i f 3, this inequality reduces to f—;(x) Pr(GZ) (L21-L22) -——-—- > Bu) Pr(Gl) (L12-Lll) Taking logs of both sides and expanding ‘Tfli yields the rule: decide Hl if m ZEE:: Pr(Gz) (L21’L22) (log Pl(nik)-log P2(nik)) > log Pr(Gl) (L12‘L11) k = l where Ni N1 “1 is the LMCD of x in i’ 2’ " ’ m CHAR(Gl). Now let D = LLRR(G ) and let 1’G2 18 Pr(G2) (L21'L22) 6 = log Pr(Gl) (L12'L11) Then we have shown that the unconditional expected loss (Bayes risk) is minimized by the rule: if st+(D,6) then decide Hl if xeL’(D,e) then decide H2 if xeLO(D,e) then decide arbitrarily. Using a loss function of L = L = l and 12 21 L11 = L22 = O, the Bayes rule minimizes the probability of misclassification. This Bayes error may be expressed as P(err) =Zmin [Pr(Gl) l—‘lm, Pr(G2) F200] xeL(CHAR(Gl)) The size of this probability is in general quite difficult to compute. In special cases, however, we may obtain bounds on the value of P(err). One such case is discussed in the next section. C. BOUNDS ON THE BAYES ERROR (LINEAR GRAMMARS) In this section we present a technique for the calcu- lation of upper and lower bounds on the Bayes error (P(err)) for linear grammars using the Bhattacharyya co- efficient. Definition A context-free grammar is linear if an only if the consequence of each production (i.e. the right hand side of that production) contains at most one non—terminal. Thus each production must be of the form A + xBy or A + x s s where erT , erT Definition Given two probability mass functions p(.) and q(.) defined on the same discrete sample space X, the Bhattacharyya coefficient p of p vs. q is o=E W) xeX It is known (see Kadota & Shepp(1l)) that the Bhattacharyya coefficient can be used to form an upper and 19 20 lower bound on P(err) as follows: pzminmwthrmznn _<_ P(err) 5, p/2 We now solve the following problem: Given two commen— surate stochastic grammars G1 = (VN,VT,—rr ,S,Pl) and G2 = (VN,VT,1TI,S,P2) with linear characteristic grammars, com- pute the Bhattacharyya coefficient of rfil vs. [—2 . First, to each production n ETT , assign a number qi 7JP1(W1)P2(wi) Then p may be expressed as k p =:E; [[ qimi i = l where the summation extends over all xeL(CHAR(G1)). As before, the m1 are the structural indices of x in the k-element production set. The key to the computation of this sum of products is the realization that the set of derivations of sentences in a linear grammar is itself a regular language over the production set TI. This fact will become apparent by the following construction. (For simplicity, we omit commas in the derivations.) Suppose VN = {A1 (=S),A1,A2,...Am} is the set of non- terminals for the linear grammar. Then the non—determinis— tic finite-state acceptor for derivations has m+l states {sl,s2,...sm+l} and on input "1 has a transition from sJ to s k where A3 is the non-terminal in the premise of Ni and Ak is the non-terminal in the consequence of N if the 13 21 consequence of Ni contains no non-terminals, then the transition should go to Sm+l' Let sl be the start state m+1 be the final state. Then a string «11 N12... “1r is accepted by this automation if and only if it is a and s valid derivation of some string in the original linear grammar. Note that the terminals in the original grammar have no bearing on the construction of this acceptor. EXAMPLE Suppose G = ({A,B,C}, {d,e,f,g},Tr,A) where the productions are given by + eeB + dAgg + fb 8 + eB '* 880 + Afg -q 0\ \n .= t» n) +4 0 0 tr: w w :> :b :1:- + + d Then the corresponding acceptor for production sequences is given by Figure 3. We now return to the original problem of calculating the Bhattacharyya coefficient. In the graph of the auto- maton Just constructed, associate with each transition “1 the number q1 defined previously. Our problem is now that 22 of summing the products of the q1 along all possible paths from the start state to the accept state. To accomplish this, construct the (m+l) by (m+l) matrix W such that Wij equals the sum of the qk corresponding to all single-step transitions between 81 and SJ. In the example, for in— stance, Wl2 would be set equal to ql+q3. In addition to this, set Wh equal to one. Now let W' be equal to +l,m+1 the limit of wn as n approaches infinity, if this limit exists. It is now an easily obtained combinatorial fact. (See, for example, Feller(3)) that p = W'l’m+1 Note that this same technique may be extended to the class of grammars for which the number of non-terminals in every possible sentential form has some upper bound de- pending only upon the grammar. (This is the class of "ultralinear" grammars.) In this case, we simply assign a name to each group of non-terminals which can appear in a sentential form and follow the above procedure using this "super-non-terminal". S 3 (From C) Figure 3 Acceptor for Production-Sequences 23 D. A SEQUENTIAL APPROACH TO DISCRIMINATIVE PARSING D. 1. Introduction Since Wald's introduction of the Sequential Probabil- ity Ratio Test (SPRT) (Wald(20)), the principle of optional stopping has become well known in the pattern recognition literature. See for example, Fu(6). The two principal approaches are the sequential Bayes approach and the SPRT. If only a finite number of features are available, these are not necessarily equivalent. The Bayes approach as- signs costs to both feature measurements and incorrect decisions and then selects a scheme which minimizes the expected total cost. This is usually done computationally by backwards dynamic programming. The SPRT, on the other hand, is more closely related to Neyman-Pearson theory and stops observing features when the value of the likelihood ratio guarantees certain bounds on misclassification pro- babilities. In this section we exhibit a scheme for sequential syntactic decision-making using the SPRT, discriminant grammars and LL(k) top-down parsing techniques. This scheme could save both parsing time, and, in the case of on—line systems, feature extraction time (and cost). The availability of such a sequential scheme is a unique 2D 25 aspect of the discriminant grammar approach. Informally, the essence of the scheme is this: We are given some string x and we want to decide which of two commensurate stochastic grammars generated x. We use the LL(k) parser to supply us (in top-down sequence and with- out backup) with the sequence of productions which forms the LMCD of the string x. Each time we are informed of a production in the derivation we decide either to request the next production from the parser or to abort the parse and make a decision immediately as to which grammar gener- ated x. The stopping criteria will depend upon pre- assigned error tolerance. We now proceed to lay the statistical foundations of the sequential parsing technique in some detail. Crucial issues in the development include the necessity for top- down parsing and the treatment of parses which terminate before reaching a stopping boundary. Suppose that we are given two commensurate stochastic grammars G1 = (VN,VT,11-,S,Pl) and G2 = (VN,VT,11-,S,P2). Denote their common characteristic grammar by G and let D = LLRR(G1,G2). For any string xeL(G), denote by H1 (1 = 1,2) the hypothesis that x was generated by G1. As before, let r—1(x) denote the probability of the genera- tion of x under H . 1 Since derivations in G are strings over-II , we shall a use the customary notations-YT. and ‘FT+ to mean, respec- tively, the set of all production sequences and the set of 26 all non-empty production sequences. The symbols 0, p and w will usually be used to represent production-sequences; o(n) will represent the nth production in the sequence 0 The sample space )3 for our experiment will be the set of all LMCD's under G. We will ignore the null derivation, so ,3 QTY... Given any pa ,8 such that p is a LMCD for xeL(G), then for i = 1,2, define Pr(pIH1> = i—1 Since G is unambiguous, there is a one-to-one corres- pondence between points in ,3 and strings over L(G), hence the fact that r—g(.) is a true discrete probability mea- sure implies that Pr(.|Hi) is also a probability measure. SUPposetmat p = p(l)p(2)p(3)-.-p(n). Then, by the unrestrictedness assumption, we have Pr(pIHi) = Pi(p<1>)P1(p(2>)...pi(p 0, decide H1 and stop. This procedure is flow-charted in Figure A. EXAMPLE Suppose we are given two stochastic grammars ({S.B}.{c.e.r.h}. II ,P > C) II and G2 ({s.B}.{c.s,r.h}.TY .132) where Tr , P1 and P2 are given in Table 1. Table 1 also gives R(ni)=log Pl(wi) - log P2(ni). Under this defini— tion of R, D=LLRR(Gl,G2) = ({S,B},{c,g,f,h}, TT ,R). Both ( START [... a1 find next w c i [h + h+R(n) l ‘ _ STOP 4 0 Y decide H j 1 N | Y N done decide H2 Y | decide Hl ‘ N a Figure A Sequential Discriminative Parsing Scheme 36 37 G1 and G2 are consistent and proper and their underlying characteristic grammar G is LL(l). Let us arbitrarily set A = 100 and B = .01. The probability of the parse reaching the wrong stopping boundary is then less than 1%. (Recall that this is not a bound on the overall error of the procedure.) We then have o(= log 100 = “.61 and B = log .01 = -A.6l. Suppose now that we are given the string ccccgfhhhhh. The underlying characteristic grammar belongs to a partic- ularly simple class of LL(l) grammars for which we can de- termine one production of the LMCD for each symbol of the string scanned. Table 2 summarizes the results of the ap- plication of the SDPS to the given string. Recall that h is the accumulator variable representing the cumulative sum of R(ni). As indicated in the table, the parse was truncated and a decision was made that x was generated by Gl after observing only four of the eleven symbols of the string. This was possible because the pre-determined up- per stopping boundary A.6l was exceeded. Table 1 Two Stochastic Grammars and Their Log-Likelihood Ratio Representation i ni P1(ni) P2(w1) ’R(wi) l S + 08B .8 .2 1.39 2 s +g .2 .8 -1.39 3 B + fBB .3 .u -.288 4 B +h .7 .6 .151: 38 Table 2 Example of the SDPS Symbol Production Scanned Determined R(n1) h Action 0 "l 1.39 1.39 continue c "1 1.39 2.78 continue 0 n1 1.39 h.l7 continue 0 n1 1.39 5.56 stop; decide H1 39 E. SUMMARY In this chapter we have formulated the definition of the Discriminant Grammar and the decision regions which accompany it. We then demonstrated that the Discriminant Grammar has a natural application in providing a sufficient statistic for the two-class decision problem involving stochastic grammars. First a Bayesian approach was de- scribed and then it was shown how a sequential probability ratio test could be implemented using a top—down parsing technique. It should be noted that top-down parsing of an LL(k) language can be programmed quite readily using re- cursive descent. In all the above statistical interpretations it was assumed that the two stochastic grammars involved are com- mensurate. If they are not commensurate, we cannot apply the above results, but all is not lost, since we can use training methods to be discussed later. 40 III. SOME LANGUAGE-THEORETIC RESULTS A. REGULAR DISCRIMINANT GRAMMARS WITH RATIONAL COEFFICIENTS A. l. Informal Description Discriminant grammars whose underlying characteristic grammars are regular have many interesting properties. We shall call such discriminant grammars regular discriminant grammars. In this section we intend to show that given any regular discriminant grammar D = (VN,VT,'TY ,S,R) where all the R(ni) are rational, then for any rational number O, the decision region L-(D,O) is a context-free language. The following informal argument will serve as a basis for a proof of this theorem: Denote R(w1) by r1, as before. Convert the numbers O,rl,r2,...rn to integers by multiplying each by the least common multiple of their denominators. To simplify nota- tion, we will retain the same names for these (now inte— gral) numbers. In effect, we create a new discriminant grammar D with integral coefficients. This does not change the region L-(D,O). Now construct a non-deterministic finite-state automaton which accepts L(CHAR(D)). We shall now modify this into a non-deterministic push-down automa- ton which will accept L-(D,O). The existence of this U1 A2 automaton implies that L-(D,O) is a context-free language. We shall also show by example that L—(D,O) is not neces- sarily regular. We proceed as follows to construct the automaton: Let M be the maximum ab solute value of the (now integral) numbers rl,r2,...rn,O. Clearly, we can construct a finite- state machine which is capable of adding any two integers of opposite Sign and absolute value less than or equal to M and producing a result of absolute value less than or equal to M. Call this machine the "adder". Next we need a push-down store, each location of which is capable of storing an encoding of an integer of absolute value less than or equal to M. The final push—down automaton is an assemblage of these three components (non—deterministic recognizer, ad- der, push—down store) plus some mechanism for integrating these parts, which works as follows: Suppose the recog- nizer (in a non-deterministic mode) makes a transition which indicates that w is in the derivation of the input 1 string. If the push-down store is empty, the operation of the machine would be to place ri on the store. If the store is not empty, denote the value of the top element by K. If ri and K are of the same sign, then the machine should push r onto the stack. If r and K differ in sign, 1 i then the machine should add r1 and K, forming a number r' of lesser absolute value than either r1 or K, pop the stack, exposing a new top symbol K', and then compare r' A3 with K'. The machine should proceed in this way until r' is eventually added into the stack. This stack manipula- tion algorithm has two very important properties: 1) At no time is a number with absolute value greater than M on the stack. 2) All numbers on the stack have the same sign. When the recognizer reaches a final state, the adder is then used in exactly the same way to add 0 into the stack. After this operation, the input string is accepted as an element of L-(D,O) if and only if the top stack ele- ment is negative. This informal description of the operation of the machine is flow-charted in Figure 5 and Figure 6. The multiple arrows in Figure 5 denote the non-deterministic step in which the non-deterministic finite-state automa- ton discovers another step in the parse. Figure 6 de- fines the procedure ADDSTK with formal parameter r. In both figures, K refers to the top element of the stack. NDFA stands for non-deterministic finite-state automaton In Figure 6 the condition "K*r > 0" does not imply that an actual multiplication need be done, only that K and r are non-zero and have the same sign. A. 2. Detailed Construction We now give a detailed formal construction to imple- ment the foregoing informal description. A (non-determi— nistic) push-down automaton (PDA) is a 7-tup1e C ) NDFA CALL DONE? ADDSTK(—e) I GET “1 DECIDE FROM NDFA xeL (D39) , N r . CALL DECIDE STOP ADDSTK(Pi) xeL'(D,9) Figure 5 Flowchart of Push-Down Automaton 1H4 C 3 EEE: Yr PUSH r ONTO STACK ‘ RETURN 3 ' r + K+r ' POP STACK +— Figure 6 Procedure ADDSTK(r) as M = (192 . T" .6.q0.zO.F> where: K is a finite set of states :8. is a finite input alphabet f“ is a finite stack alphabet 6 is a function from KX(23 U{e})xr1 into CFYKXTT *) qOeK is the initial state Zoe r" is the initial stack symbol F E K is a set of final states The details of the operation of a PDA may be found in Hopcroft and Ullman(10). The fundamental result we will use is the fact that a language is accepted by a non-deter- ministic push-down automaton if and only if the language is context free. We will begin with a regular discriminant grammar with rational coefficients and a rational number 0 and we will construct a PDA which will accept L'(D,C). First, as before, we convert all coefficients, including O, to inte- gers in the range -M to +M. This does not alter the re— gion L-(D,O). VN = {Bi} and VT = {a1} Let A, A', and T be three symbols not in VN. 47 Then let K = (VNu{A,A',TDx{-M,-M+l,...o...M—1,M} §3= vT F‘= {-M,-M+1,...—l,l,2,...M-1,M,z } q0 = (830) F = {(Ts0)} The construction of the transition function 6 is somewhat more complex. For each non-terminal B and ter- 1 minal ak there is a (possibly empty) set of production A and possibly a production of the form B1 + ak (2) We will use the symbol C to represent either non- terminals (B3) or either of the special symbols A or A'. Specifically, the set designated {Cik} contains the A sym- bols {B 13:1 as represented by (1) and also the symbol A J if there is a production of the form (2). (In this case J the C1k Analogously, the set {nik} is the set of integers (r's) can be considered to have the superscript 2+1.) associated with the productions (l) and (2). For example, suppose that the productions of TI having B on the left and a on the right and their 1 k associated r-values are: A8 Production r-value Bi + akBl +5 Bi + akB2 —2 Bi + ak 0 Then {Cik}33l ‘ {Bl B2,A} and {hik132 -1 = {5,-2.0} In the following, m1k represents the number of pro— ductions in Tr with B1 on the left and ak on the right. To avoid subscripted superscripts, we will usually omit the subscripts on m when context makes them clear. Z represents an arbitrary stack symbol (either a number in the specified range or the stack start symbol 20). We now have sufficient notation to define 6. First, for each Bi’ 6 should contain the following non-deterministic, non-e transition rules (for all Zer| ): (1) 6((Bi,o), ak,Z) = {((Cikn nik), Z’}r m Informally, this is equivalent to saying that no matter what is on the stack, if the machine is in state (Bi,0) and ak is read, then, for each production n of the form Bi + akBJ with R(n)=n, the machine moves (non-determinis- tically) into state (BJ,n). If there is a production a of the form Bi + ak with r(n)=n, then the machine also moves into state (A,n). In all cases the stack symbol remains unchanged. The purpose of this class of instructions is 49 to allow the machine to recognize when a production rule might have been applied and to incorporate the r—value of that production rule into the memory of the PDA's finite state control. Next we need a mechanism to transfer these r-values from the finite-state control onto the push down stack as described informally earlier. For each C, therefore, we shall include the following e—rules: For all combinations of non-zero n and Z such that n and Z have the same Sign or Z=Z include the rules 0, (2) 6((Can),€,z) = {((C,0),nZ)} i.e., if the number in the finite-state control has the same sign as the number on top of the stack, or if the symbol on top of the stack is Z0, then the number from the finite-state control may be pushed onto the stack. For all combinations of n and Z such that n and Z dif- fer in Sign and n is not zero, include the rules: (3) 6((C,n).€.Z) = {((C,Z+n).6)} i.e., if the number in the finite state control differs in sign from the number on top of the stack, we add the top stack element into the finite state control and pop the stack. By using the above rules, we eventually arrive in the state (A,0) if and only if there is a derivation for the input string in CHAR(D). The next step is to put -0 into 50 the finite control. This may be accomplished by the sin- gle production: (A) 6((A,0),e,Z) = {((A',-9),Z)} for all Zei—‘. Previously defined transition rules may now be used to transfer -0 from the finite control to the stack, ultimately sending the machine into state (A',0). The additional symbol A' is necessary here to insure that this rule is used only once. Now it remains only to in- terrogate the top stack symbol and send the machine into the final state if and only if this top symbol is negative. This is accomplished by including the following set of productions for all Z less than zero: (5) 6((A',0),e.z) = {((T,0),Z)} The existence of the above construction establishes the following: THEOREM Given a regular discriminant grammar D with ra- tional values for R(n1) and given any rational numberO, the language L-(D,O) is context-free. Simple modifications to the above construction could be made to show that L+(D,O) and LO(D,O) are also context- free. A. 3. Example As an example, consider the following regular discri- minant grammar with rational coefficients: 51 D = ({S=Bl},{al,a2,a3},'TF,S,R) Where IT and R are given by Tri I'i Bl + alBl +1 Bl + a2Bl -1 Bl + a3 +1 Clearly, L-(D,l) = {xlxc{al,a2}*a3 and x(a2) > x(al)} where the notation x(a) means the number of occurrences of the symbol a in the string x. Since this language is not regular, this example shows that we cannot strengthen the theorem to say that L"(D,O) must be regular. According to the construction, M = (K. 2 . F" .6. q0.zo.F> where K = {(Bl,1),(Bl,O),(Bl,~l),(A,l),(A,O), (A.-l).(A'.1).(A'.0).(A'.-l).(T.l). (T30):Ts'l)} 2 = {al,a2,a3} T1= {l,-l,Z } q0 = (Blao) F = {(T,O)} 52 and the transition function 6 is given in Table 3. In that table, each element of the transition function is labelled I.J. where the I refers to one of the five rules used to generate it and the J is simply an ordinal number within a group. Table A gives a trace of the action of this automaton in accepting the string x=a2ala2a3. Following the conven— tion of Hopcroft and Ullman, the bottom of the stack is on the right and stack symbols are added or deleted from the left. Also remember that "—l" is a single stack symbol, not two. As expected, this string is accepted because the PDA finally enters state (T,O). O F’ Table 3 PDA Transition Function (Example) 6((Bl.0).al.-l) = {((Bl.l).-1)} 6((81.0),a1,1> = {((Bl.1),)} 6((81.0).al.z0> = {((Bl.1),z )} 6((Bl.0).a2.zo) = {((Bl.-1).Z0)} 6((Bl.0).a2.-l) = {((Bl.-l).-l)} 6((Bl.0).a2.1) = {((Bl.-l).l)} 6((Bl,o),a3,z0) = {((A,l),Z0)} 6((Bl,0)sa3,-l) = {((A,1),-l)} 6((Bl.0),a3,1) = {((A,l),l)} 6((Bl.l).e.zo) = {((Bl.0).lzo)} 6((Bl.-1>.e.zo) = {((Bl.0),-lzo)} 5((Aal)9€azo) = {((A30)31Z0)} 6((A9-l)9egzo) {((A,0),-lz0)} 6((A',l),e,Z0) {((A',O),1Z0)} 6((A'.-1>.e.zo> = (((A'.0).-lzo)} 53 .10 .ll .12 Table 3 (continued) 6((Bl,1),e,l) = {((Bl,0),11)} 5((Bl.-l).e,-l) = {((Bl.0).-1-1>} 6((A.l).e,l) = {((A,o),ll)} 6((A.-l).e.-l) = {((A.0).-l-l)} 6((A'.l).e.l) = {((A'.0).ll)} 6((A'.-l).6.-l) = {((A'.0).-l-l)} 6((Bl.1).e.-1) = {((B1.0).e)} 6((Bls‘l)sesl) = {((Blao)s€)} 6((A’l)353-1) {((A30)95)} 6((A.-1),e.1) {((A.0).e>} 6((A',1).e.-1> = {((A',o>,e)} 6((A'.-1>.e.1) = {((A',0),e)} 6((A.o>.e.zo> = {((A'.-1>.z )} 6((A.o>.e,1) = {((A',-l).l)} 6((A,O))€3-1) = {((A'3-l),-l)} 5((A' .0),6.-1)={((T.0).-l)} 514 Table A Trace of PDA (Example) 55 input 6-rule state stack contents (81.0) 20 a2 1.“ (Bl,-l) Z0 8 2.2 (Bl,0) -1ZO a1 1.1 (Bl,l) -1Z0 6 3.1 (81,0) ZO a2 1.4 (Bl,-l) Z0 8 2.2 (81,0) -1ZO a3 1.8 (A.l) -120 e 3.3 (A,O) 20 e ".1 (A',-1) Z0 6 2.6 (A',O) -1Z0 6 5.1 (T,O) -lZO B. REGULAR DISCRIMINANT GRAMMARS WITH UNRESTRICTED COEFFICIENTS In this section we show that if the ri and O are not constrained to be rational, then there are decision re- gions L_(D,O) which are not context-free. THEOREM The class of languages expressible in the form L-(D,O), where CHAR(D) is regular, is uncountable. PROOF For each real number ¢, let D(¢) represent the fol- lowing discriminant grammar: D(¢> = ({S}.{a.b.c}. 1T .S.R) where TI and R are given by 1 Hi R(wi) 1 S + aS cos(¢) 2 S + bS -sin(¢) 3 S + c 0 ¢ may be regarded as a parameter of the discriminant gram— mar. We now define the uncountable class of languages. {L'(D(¢).0) | 0:¢_<_1r/2} 56 57 We now show that the languages in this class are distinct. AS before, let x(a) denote the number of occurrences of the symbol a in the string x. Then we can express _ a L (D(¢),O) x {xe{a,b} c|x(a)cos(¢)-x(b)sin(¢)<0} = {xe{a,b}*c|x(a)/x(b)A} Paz(l9) shows that this class of languages is not changed if we remove the restrictions that I and the tran- sition matrices be stochastic and if we allow F to be an arbitrary vector. In this case we use the terminology "pseudo-probabilistic automaton" (PPA) and "pseudo-proba- bilistic event". The language accepted by a PPA with cut- point A is called a "general cut-point event" (GCE). We will now show that, given any regular discrimi- nant grammar D, the region L+(D,O) is a GCE. Suppose that we are given a regular discriminant grammar D=(VN,VTJT,S,R). Since D is regular, all productions must be of the form B1 + akBJ OI' Bi+ak 61 In the former case denote the associated r-value as rij and in the latter case as rfio. Given any cutpoint O, we will now construct a PPA M and a cutpoint A such that L+(D,O) = T(M,A). Let K = VNU{T}, where T is some symbol not already in VN which will serve as an "acceptance state". Denote the cardinal— 1,B2,B3,.. set of input symbols for M be identical to the set of ter- ity of K by n, so that K = {S=B .Bn=T}. Let the minal symbols of D (Z =VT). Let I be the n-component row vector with a l in the first position and 0's else- where. F is defined as the n-component column vector with a one in the nth position and zeros elsewhere. Finally we construct the set of matrices {A(a):aez }. The essential construction here involves exponentiation so that the addi- tive structure of the discriminant grammar may be transla- ted to the essentially multiplicative structure of the PPA. To accomplish this, for each production of the form Bi + akBJ in TV , let the (1,3) element of A(ak) be :0). Let all other entries of the matrices be 0. (Note that the entire last row of each matrix is zero.) exp(r Finally, let A = exp(O). We now show that the above PPA with cutpoint A accepts all strings in L+(D,O) and that these are the only strings which it accepts. Given any string x in L+(D,O), there must be exactly one derivation for that string in CHAR(D). Let this deri- vation be given by 62 where the productions applied are “1’"2""’"r' Since xeL+(D,O), we must also have I' 2: R("1)>0 i=1 We must now show that r I “A(ak) FO k=1 and A = exp(O). 64 The monotonicity of the exponential function gives us the result that EM(X)>A. Thus we have shown that xeL+(D,O) :9 xeT(M,A) under the given construction. To establish the converse, we must consider two cases: 1) X€L(CHAR(D)) 2) X¢L(CHAR(D)) In the first case, x has a derivation in CHAR(D) and, as above, we can establish that r EM(x) = exp E R(1rk) k=1 Since we are assuming now that EM(x)>A and that A = exp(O), r ZR("k)>9 k=1 or, equivalently, that xeL+(D,O). The second case, xeL(CHAR(D)) cannot occur, since in this case no derivation for x in CHAR(D) exists, hence A(a = 0 k) x u ZZlH H contradicting the assumption that EM(x)>A. (Recall that A = exp(O), hence is always positive.) 65 Invoking the equivalence between PA and PPA, we have now proved the following: THEOREM Given any regular discriminant grammar D and cut- point O, there exists a PA M and a cutpoint A such that L+(D,O) = T(M,A). Consider the following example: Let D = ({S=Bl,B2},{al,a2,a3 LTT,S,R) where TT’ and R are given as follows: Hi R(fli) Bl + alB2 1 B2 + a2Bl 0 Bl + a3Bl -1 Bl + a3 —1 H R D - * * * d Then L(C A ( )) - ((a1a2) a3 ) a3 an L+(D,0) = {xlxeL(CHAR(D)) and x(al)> x(a3)} where x(ai) denotes the number of occurrences of ai in x. Then the pseudo-probabilistic automaton is constructed as follows: M = (K, 2 ,I,F,{A(a):aez, }) where K = {S=Bl,B2,T} Z= {al,a2,a3} I = (1,0,0) F = (0,0,1)T 66 0 e 0 A(al) = 0 0 0 O O O O O 0 A(a2) = l 0 0 O O 0 e'1 0 e"1 A(a3) = 0 O 0 0 O O and the generalized cut-point event equivalent to L+(D,O) is T(M,1). Let us check some sample strings: 1) x = (xeL+(D,O)) 3182313283 EM(x) = (1.0.0)AAT e 0 e O = (1,0,0) 0 0 0 0 = e 0 0 0 1 since e > 1, xeT(M,l). 2) x = a (xeL(CHAR(D)) but x¢L+(D,0)) la2a3 EM(x) = (1.0.o>A(al>A(a,>A(a3>(0.0.1>T l o 1 o = (1,0,0) 0 o o o = 1 o o o 1 hence x¢T(M,l). 67 3) x = ala3 (xeL(CHAR(D))) EMT O 0 O 0 = (1,0,0) 0 O O O O O O 1 hence, as predicted by the theorem, x¢T(M,l). V. SUMMARY AND RECOMMENDATIONS A. SUMMARY This thesis introduces the idea of a discriminant grammar as a tool for syntactic pattern recognition, ex- plores certain properties which derive from the definition, and explains certain techniques which the discriminant grammar makes possible. Chapter I provides a motivation for the discriminant grammar concept and provides some informal examples of the advantages of discriminant grammars. Chapter II gives the formal definition of discrimi- nant grammars and goes on to show how they may be used to provide a sufficient statistic for the two-class decision problem involving stochastic languages. Specific formu- lations are given for the Bayesian case and the SPRT. The existence of the Sequential Discriminative Parsing Scheme is one of the principal contributions of this thesis. This chapter also gives a technique for the calculation of bounds on the Bayes error for linear grammars. Chapter III proves some results on the nature of the decision regions of regular discriminant grammars. The result of greatest theoretical interest is the theorem which says that decision regions of regular discriminant grammars with rational coefficients are context free. 68 69 Chapter IV provides the proof of a result which es— tablishes a connection between the regular discriminant grammar and probabilistic automata. Finally, some suggestions are given for future work involving the learning of discriminant grammar coefficients from labeled empirical samples. B. TRAINING METHODS Insofar as pattern recognition is intended as a prac- tical art, it must incorporate a learning or inductive phase as well as a decision algorithm. The decision schemes presented in this thesis center around the context- free unambiguous discriminant grammar. It will be neces- sary to develop schemes for the inference of such a discri- minant grammar given some kind of empirical sample of strings from the classes between which we wish to discri- minate. Two distinct problems arise -- inference of the structural component and inference of the numerical compo- nent of the discriminant grammar. The former is a far more difficult problem than the latter; in fact, the latter sort of learning is usually designated by a less imposing word than "inference" such as "training". In general, the inference of a discriminant grammar is a two-stage process: determination of the structural component (the characteristic grammar) followed by the determination of the numerical component (The function R and the cutpoint 0). It is the purpose of this section to demonstrate the following two points: 1) Once the structural component has been deter- mined, the determination of the numerical component is mathematically trivial. 7O 71 2) The existence of the numerical component makes the determination of the structural component far less critical to the perfor- mance of the decision algorithm. In the inference of ordinary regular or context-free phrase structure grammars (see Fu and Booth(8) for a sur— vey), it is typical to have two finite empirical samples of strings, say L+ and LI, and be required to exhibit a grammar which accepts L+ but does not accept L”. The ex— ponential combinatorics of known methods for accomplish- ing this make the task infeasible for samples of any sub- stantial size, nor is it known how to make effective use of a_priori human knowledge about apparent differences be- tween the sample sets. In addition, the corruption of a single string by noise may sabotage the entire algorithm. In contrast, let us now consider the inference problem in terms of discriminant grammars. Suppose we are given two sample sets L+ and L- and we are asked to exhibit a discriminant grammar D and cutpoint 0 such that L+<_:_ L+(D,O) and L‘s; L‘(D,e). As a first step, we must determine a phrase structure grammar G with the property that L(G); L+UL- which will serve as a characteristic grammar for D. There are a great many obvious grammars which will satisfy this requirement and we are free to use our human powers of inference to select grammars which seem like good candidates. The primary criterion in which we are interested is that G have some productions 72 which will be of most use in generating strings from L+ and some productions more useful in generating strings from L . If we wish to allow for noisy strings, we can choose G such that L(G) is as large as possible, in fact, we may * T . Once we have chosen the characteristic grammar for D well choose G such that L(G) = V (or, more probably, several candidate characteristic grammars), we are now ready for the second (numerical) stage of the inference process. To accomplish this, we first determine the structural indices (using the mapping S defined previously) of all strings in L+ and L-. The problem of numerical training is now simply one of finding a separating hyperplane between these two sets of indices. If these sets happen to be linearly separable, we have made a good choice of G and the desired coefficients may be found by means of the perceptron algorithm. In the more likely case that the sets of indices are not linearly separable, (or in the case where the sample strings are probabilistically generated infinite sequences), we may find coefficients which are in some sense "good" by using well-known variants of the perceptron algorithm. (See Duda & Hart(A)). As a trivially simple example, consider the following two sample sets: L+ = {abbba , babba , bb} and 73 L- = {aaa , abaaa , abbaaa , a} We note that the strings in L- contain more a's than b's and vice versa for the strings in L+. We therefore con- struct the following tentative characteristic grammar G: G = ({S}>{aab}a.fi ,3) Where IT is given by l) S + aS 2) S + b3 3) 8+8 The structural indices of the sample strings may be sum— marized as follows: STRING CLASS S(x) abbba +1 (2.3.1) babba +1 (2.3.1) bb +1 (0,2,1) aaa -1 (3.0.1) abaaa —l (A,l,l) abbaaa -l (h,2,l) a -1 (1,0,1) The two classes are indeed linearly separable. A suitable set of coefficients might be (-1,+l,0) with threshold 0. In terms of the discriminant grammar we are seeking this gives us r1 = —1, r2 = +1, r3 = O and O = O. 74 Although this problem is admittedly trivial, the point is that it is usually far easier to separate L+ from L— than it is to analyze the structure of either L+ or L‘, thereby avoiding complex problems of structural inference, (or at least deferring the structural inference problem until the complexity of the problems rise to meet the capabilities of the analytical technique). It is hoped that investigation of the various aspects of this training problem will provide fruitful ground for further research. C. OTHER FUTURE WORK There are several other directions in which this work might be extended. First, an improved error analysis for the sequential discriminative parsing scheme would be desirable, including estimates of the amount of work saved by truncating the parse. It would also be valuable to see how the new field of grammatical inference could be brought to bear on the problem of constructing characteristic grammars which emphasize the differences between sets of sample strings rather than the regularities within a sam— ple set. On a more theoretical level, an investigation of further relationships between discriminant grammars and probabilistic automate would be interesting. For example, is it true that every cut-point event can be represented as a decision region of some discriminant grammar? It is also an interesting question whether a discriminant gram- mar can always be extended to accept arbitrarily noisy strings, that is, given a discriminant grammar D with char- acteristic grammar G and a cut-point O, is it always pos- sible to find a discriminant grammar D' with characteristic grammar G' and cut-point 6' such that L'(D,O)EE L'(D',0') * and L+(D,O)E L+(D',O') and L(G') = vT '2 75 76 It is not hard to imagine a host of other questions, such as the possibility of extending the sequential dis- criminative parsing scheme to grammars which are not LL(k) or the possibility of devising a discriminant grammar scheme based on ambiguous grammars. It is the authors hope that this thesis opens up a new area of study which will prove important in the future. APPENDICES APPENDIX A. LL(k) GRAMMARS AND TOP-DOWN PARSING Of the three general classes of parsing algorithms (top-down, bottom-up, and tabular), the top-down method is of greatest relevance to this thesis. This technique al- lows us to reconstruct a LMCD in a forward (start symbol to sentence) manner, using information obtained from the input string to guide the parser in selecting productions. The parsing method described here is the LL(k) technique, which allows us to parse a certain class of unambiguous context-free languages in linear time. This class of languages is defined as follows: Let G = (VN’VT’ TT ,8) be a context-free grammar. Let k be a natural number, and leto( be a string over VN u VT' Then define a a FIRSTk(ot) = {x e VT [either |x|xy for some y in VT} Informally, FIRSTk(dJ is the set of k—symbol terminal prefixes of strings derivable from (X. In the following, let the notation ‘17-}? stand for left-most derivation. Then G is LL(k) if and only if the three conditions: 77 78 s l) 871-"? ert fixed =>xv 2) 8% ont fix)“ :>xw 3) FIRSTk(v) = FIRSTk(w) imply that B = r . LL(k) languages lend themselves in a natural way to deterministic top-down parsing. Informally, as we are re— constructing the derivation of a sentence 2 in L(G) and wish to know what production rule to apply to the non-ter- minal A in the sentential form xAd, the decision can be determined by looking at the k terminals following the prefix x of z. In practice, an LL(k) parser may easily be programmed using either table look-up or recursive descent. A complete discussion of LL(k) techinques may be found in Aho and Ullman(l). It is known that for any k the class of LL(k) gram- mars forms a proper subset of the unambiguous context-free grammars and furthermore that the class of LL(m) grammars properly includes the class of LL(n) grammars if m exceeds n. Given any language L, it is undecidable if L is gener— ated by an LL(k) grammar, but the class cf languages pars- able by LL(k) methods is large enough, for example, to allow ALGOL to be thus compiled (See Lewis and Rosen- krantz(l6)). An LL(l) grammar is called simple if and only if for each pair (A,a), where A e VN and a c VT, there is at most one production of the form A + ad. Many of the 79 examples used in this thesis are simple LL(l) grammars. APPENDIX B. STOCHASTIC GRAMMARS In may situations, particularly in pattern classifi- cation problems, it is desirable to have a probability assignment over a language, that is, a mapping P: L+[O,l] :EZPKX) = 1 Considered as a set of ordered pairs, this function is such that called a stochastic language. Since interesting languages are usually infinite, it behooves us to specify an algor- ithm for the computation of this function in order to de- fine it. One way to do this is to combine the probability computation with the language generation by means of a "stochastic grammar". See, for example, Booth and Thomp— son(2). A stochastic grammar is a quintuple F = (VN.VT.IT .S.P) where V , VT’ Tr , and S are the non-terminal set, the terminal set, the production set, and the start symbol, respectively, and P is a mapping from TI to the unit in— terval [0,1]. The characteristic grammar CHAR(F) is the ordinary grammar formed from the first four components of 80 81 F. The probability of a given derivation fl1,fl2,fl3,...,fln is defined to be equal to the product P(nl)P(N2)P(w3)...P(wn) Implicit in this definition is the following extremely important principle: UNRESTRICTEDNESS ASSUMPTION: The probability of the appli— cation of any production depends only upon the presence of a given premise in a derivation and not upon how the pre- mise was generated. For all x e L(CHAR(F)), the probability of x (denoted [(x)) is defined to be the sum of the probabilities of all distinct LMCD's of x. Note that if the grammar is un- ambiguous, there is only one term in this sum. If the relation grew. x e L(CHAR(F)) is satisfied, then I" is a true probability mass function and F is said to be consistent. In this case Va is said to be the stochastic language generated by F (‘—'= L(F)). For each A c VN, let n(A) be the subset of TI consist- ing of all produCtions whose premise is A. Then F is said to be proper if, for all A e VN, 82 EE P(Hi) = l 111 8 MN All stochastic grammars used in this thesis will be unrestricted, consistent, and proper. BIBLIOGRAPHY (l) (2) (3) (A) (S) (6) (7) (8) (9) (10) (ll) (12) BIBLIOGRAPHY Aho, A.V. and Ullman, J.D. The Theory of Parsing, Translating, and Compiling. Prentice-Hall,Eng1e- wood Cliffs, N.J. 1972. Booth, T.L. and Thompson, A. "Applying Probability Measures to Abstract Languages." IEEE Transactions on Computers 0-22 (May 1973) 4A2—4h9. Feller, W. An Introduction to Probability Theory and Its Applications, v. 1. Wiley, New York, 19 Duda, R. O. and Hart, P. E. Pattern Classification and Scene Analysis. Wiley, New York, 1973. Freeman, H. "On the Encoding of Arbitrary Geometric Configurations." IEEE Transactions on Electronic Computers EC-lO (1961) 260-268. Fu, K.S. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, New—York, 1968. Fu, K. S. Syntactic Methods in Pattern Recognition. Academic Press, New York, 1973. Fu, K.S. and Booth, T.L. "Grammatical Inference: In- troduction and Survey -- Part I." IEEE Transactions on Systems, Man, and Cybernetics SMC-5 (January 1975) Good, I. J. Probability and the Weighing of Evidence. Charles Griffin, London, 1950. Hopcroft, J. E. and Ullman, J. D. Formal Languages and Their Relation to Automata. Addison-Wesley, Reading, Mass., 1969. Kadota, T.T. and Sheep, L.A. "On the Best Finite Set of Linear Observables for Discriminating Two Gaussian Signals." IEEE Transactions on Information Theory IT-l3 (1967) 278-28“. Kanal, L.N. (ed.) Pattern Recognition. Thompson Book 00., Washington, D.C., 1968. 83 (l3) (1") (15) (l6) (17) (18) (19) (20) 8h Kirsch, R.A. "Computer Interpretation of English Text and Patterns." IEEE Transactions on Electronic Computers EC-13 (1964) 363-376. Kullback, S. Information Theory and Statistics. Wiley, New York, 1959. Lee, H.C. and Fu, K.S. Stochastic Langgages For Pic- ture Recognition TR-EE 72-17. Purdue University, Lafayette, Indiana, June 1972. Lewis, P.M. and Rosenkrantz, D.J. "An Algol Compiler Designed Using Automata Theory" Proceedings Poly- technic Institute of Brooklyn Symposium on Computers and Automata (1971). Mendel, J.M. and Fu, K.S. Adaptive, Learning and Pattern Recognition Systems: Theory and Practice. New York, Academic Press, 1970. Minsky, M. "Steps Toward Artificial Intelligence." Proceedings IRE H9 (1961) 8-30. Paz, A. Introduction to_Probabilistic Automata. Academic Press, New York, 1971. Wald, A. Sequential Analysis. Wiley, New York, 19A7. M'11177777111777![itflinjflfifll'lijflflflfijflfllfi'ES