DXSCRHMNANT GRAMMARS FOR
WHERE! CLASSEFZCAT§ON

Bisses’tation for the Degree of Ph. D.
Minimal-£4 TATE UMVERSITY
ALAN iAMES FiLiPSKI
1978

 

This is to certify that the
thesis entitled

Discriminant Grammars for Pattern
Classification

presented by

Alan James F11 ipski

has been accepted towards fulﬁllment
of the requirements for

‘RLLdegree in ﬂmxience

Major professg

Date May 17, 1976

\

0-7639

 

n

W00

ABSTRACT
DISCRIMINANT GRAMMARS FOR PATTERN CLASSIFICATION
By

Alan James Filipski

This thesis introduces the idea of a "discriminant
grammar" as a tool for syntactic pattern recognition. A
discriminant grammar is defined as a context-free, unam—
biguous grammar together with a mapping from the produc-
tions into the reals. We associate a number with each
sentence generated by the grammar by adding the numbers
corresponding to each use of a production in the deriva-
tion of the sentence. The language is partitioned into
three decision regions by comparing this sum to a cutpoint.
It is shown that a discriminant grammar may be used to
provide a sufficient statistic for decision between two
stochastic grammars. Results are given in terms of a
Bayesian formulation and a sequential scheme using top-
down parsing and LL(k) grammars. It is shown that regular
discriminant grammars can yield decision regions that are
not recursively enumerable, but if all coefficients are
rational, the regions are context-free. Results are ob-
tained concerning the relationship of regular discriminant
grammars to probabilistic automata. The use of perceptron-
like methods to learn the coefficients from empirical sam-

ples is also discussed.

DISCRIMINANT GRAMMARS
FOR

PATTERN CLASSIFICATION

By
Alan James Filipski

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1976

ACKNOWLEDGEMENTS

I would like to express thanks to all members of my
committee, especially Dr. Richard Dubes for his careful
reading of the various drafts of this thesis and my chair-
man, Dr. Carl Page, for his encouragement and assistance
during the many false starts and doldrums preceding the
idea which led to this work.

I would also like to express thanks to those who
helped in the typing of this thesis, namely Bob Frye and
Annette Davis.

Special thanks are due to my wife, Lois, for her en-

couragement throughout my long period of graduate studies.

11

TABLE OF CONTENTS

Page

List of Tables................................... v
List of Figures.................................. vi
I. INTRODUCTION............................... 1

II. DISCRIMINANT GRAMMARS-
DEFINITION & INTERPRETATIONS............... 8

A. DefinitionOOOOOO......OOOOOOOOOO00000.0 8
B. A Bayesian Interpretation.............. 15

C. Bounds on the Bayes Error
(Linear Grammars)...................... 19

D. A Sequential Approach To
Discriminative Parsing.... ....... . ..... 2h

1. IntrOduction. O O O O O O O O O O O O O O O O O O O O O O 2“
Markov Chain Model.. .......... ..... 28
The SPRTCOCOOOO ...... ........ O... I. 30

Error AnalySiSOOOOIOOOOOO0.0.0.0... 32

U‘l 4:? U) N
O o 0

O Par81ng Algorithm. 0 O O O O O O O O O O O O O O O O 31‘
E. sumaryo00.000000000000000...0000000000 “0
III. SOME LANGUAGE-THEORETIC RESULTS............ "1

A. Regular Discriminant Grammars

With Rational Coefficients.. ........... Al
1. Informal Description ............ ... Al
2. Detailed Construction .............. 43
3. Example.......... ........... . ...... 50

iii

B. Regular Discriminant Grammars
With Unrestricted Coefficients.........

CO summaryOOIOOIIOOOOOOO......OOOOOOOOOOOO

IV. DISCRIMINANT GRAMMARS AND

PROBABILISTIC AUTOMATA......... ..... . ......
V. SUMMARY AND RECOMMENDATIONS ................
A. Summary.................. ....... . ......
B. Training Methods....... ....... . ..... ...

C. other Future worKOOOOOOOOOOOOOOOOOO....

APPENDIX A. LL(k) GRAMMARS AND
TOP-DOWN PARSING.0.000000000000000.00......

APPENDIX B. STOCHASTIC GRAMMARS. ................

BIBLIOGRAPHY ...... . ....... . ......................

iv

56
58

59
68

68
70
75

77
8O

83

Table

Table
Table
Table

LIST OF TABLES

Two Stochastic Grammars and Their

Log-Likelihood Ratio Representation...

Example of the SDPS.....................

PDA Transition Function (Example).....

Trace of FDA (Example)..............

38
39
53
55

Figure
Figure
Figure

Figure

Figure

Figure

:UJNI-J

U'l

LIST OF FIGURES

Syntactic Decision Strategies..........

Encoded Line Patterns.......

Acceptor for Production Sequences......

Sequential Discriminative
Parsing Scheme..............

Flowchart of Push-Down Automaton.......

Procedure ADDSTK(r). ...... ..

vi

1A
23

36
HA
“5

I. INTRODUCTION

The term "Pattern Recognition" as applied to the more
classical feature-space oriented techniques as well as to
the newer syntactic methods, has tended to obscure a fun-
damental difference of intent between the two approaches.
The former is almost exclusively concerned with the devel—
opment of algorithms to extract or utilize discriminatory
information, that is, information which pertains to the
assignment of an object to one of several classes. Any
information which does not further this end is considered
a nuisance. Most work in "Syntactic Pattern Recognition",
on the other had, does not stress this distinction. The
"Syntactic Pattern Recognizer" is generally expected to
transduce the input object into a derivation, which is a
representation of all structural information inherent in
the original object in terms of some given grammar. This
derivation may then be interrogated to answer questions
about the object, produce a description of the object, etc.
We can thus distinguish between "Pattern Classification"
and "Pattern Description", the former being primarily a
process of information reduction and the latter a process
of information re-organization.

Both of these are useful ends toward which syntactic

means may be applied. They are, however, distinct ends and
l

2

efficiency may be sacrificed if we confuse them. As an
extreme example, suppose we are given a character string
and we must decide whether it is a Fortran program or an
Algol program. A very inefficient way to make this deci-
sion would be to feed the string into a Fortran compiler
and then into an Algol compiler and then note which gen-
erated fewer error messages. It is obvious that the de-
cision could be made instead by a very small and fast pro-
gram which looked for a few characteristic structures.
The important assumption here, of course, is that we are
interested only in a decision and not in any by-products
such as error messages or object programs.

It is the object of this work to show that syntactic
methods may be applied to a purely classificatory end,
i.e. that there is such a thing as structural discrimina-
tory information and that this information may be effec-
tively extracted and used by syntactic means. Hopefully,
restricting our attention ty discriminatory information
will serve to make the path to the decision itself a
direct one.

It is evident that in many applications for which the
use of syntactic methods has been proposed, only the ulti-
mate decision is of any importance, and additional infor-
mation supplied by the parse represents wasted effort.
Examples are hand-printed character identification and
automatic blood-cell counting. Most systems used on a

production basis need only supply a decision (or possibly

3
report inability to make a decision). See, for example,

Kanal(l2).

Another instance in which descriptive information in
addition to a decision is of very little value occurs when
the grammar is obtained by means of the semi-automatic in-
ductive methods now being explored by a number of re—
searchers. (See Fu & Booth(8) for a survey.) In this
case there would be no a priori correspondence of semantic
with syntactic notions with the result that knowledge of
the derivation becomes rather useless. In this case we
should simply ask the system to report a decision rather
than to exhibit a parse in an unfamiliar grammar.

Figure 1 depicts three possible approaches to syntac-
tic pattern classification. Figure l.a represents the
most straightforward approach: The string is processed
through a parser for each grammar; one, none, or both of
the parsers then report success. Figure l.b is a modifi-
cation of this in the case where probabilistic information
is available. In this case two stochastic parses yield
the probabilities that the string was generated by each of
the two stochastic grammars. This information, together
with the a priori likelihoods of the two classes, is fed
into a Bayes decion-maker. These two methods are used,
for example, by Lee & Fu(15). Figure 1.0 represents the
approach introduced in this thesis. A single parse is
performed using a "discriminant grammar" which maps the

string x into a single real number, say f(x). The sign

 

 

 
 

 

  
 

 

  
    

 

 
 
  
 
 

 

 

 

 

 

 

 

 

PARSER XEL(G1)?
USING
G1 WHICH
PARSE
P SUCCESSFUL,
' 9
ARSLR XEL(G2)' IF ANY?
USING
G2
(l.a)
LIKELIHOOD
OF
Ll vs. L2
STOCHASTIC P(x|G )
PARSER FOR
GI

 

 

BAYES
DECISION

  
 

 

   
 

  

STOCHASTIC P(x|G2)
. PARSER FOR

 

 

 

 

 

(l.b)
DISCRIMINANT
GRAMMAR
x PARSE FOR f(x)
Ll vs. L2
(1.0)

Figure l Syntactic Decision Strategies

5

of this number then determines the classification of x.
The absolute value of f(x) is a measure of our confidence
in the classification. (Under a statistical interpreta-
tion of the discriminant grammar to be given later, f(x)
relates to a quantity which I. J. Good calls the "amount
of information in an observation about a hypothesis".
(Good(9), Kullback(lA)))

The potential advantages of the discriminant grammar

approach are several:

1.) Only a single parse need to made. In fact, as we shall
see later, if the user wishes to specify error tolerances,

this parse need not even be completed.

2.) A discriminant grammar can usually be made simpler
than a grammar describing either of the languages because
only discriminatory information need be considered. For

example, the two languages

{aib1|1_>_1}

L“
ll

{biailigl}

II"
II

are both context-free, but a discriminant grammar with
regular underlying grammatical structure can be used to
discriminate between them on the basis of whether the

first character of a given string is an a or a b.

3.) The set of strings for which the discriminant grammar

will yield a decision can be made larger than the union of

6

the two original languages, thus giving a means for the
classification of noisy strings. As a trivial example,
suppose we want to distinguish between the languages {a}*
and {b}*. It will prove to be a simple matter to produce
a discriminant grammar (again, with regular underlying
structure) which will be able to classify any string from
{a,b}* depending upon whether there are more a's or more

b's in the string.

The discriminant grammar approach may be regarded as
a step towards a synthesis of the syntactic and feature-
space formulations of the Pattern Recognition problem.

We will show later how syntactic information may be repre-
sented by means of a mapping from strings into a space of
structural indices. Once this transformation is made, we
then may apply results from feature-space theory, as pre-
sented, for example, in Duda & Hart(4). It is hoped that
this "structural statistics" approach will help the two
methodologies of Pattern Recognition to cross-fertilize
each other.

One of the most important areas of syntactic Pattern
Recognition concerns the processing of two-dimensional
pictures. Since the early 1960's (e.g. Minsky(l8)) there
has been considerable work done on languages for the re-
presentation of various types of two-dimensional images.
Freeman(5) proposed a language of concatenated vectors to
describe connected curves. Kirsch(l3) suggested a grammar

of two-dimensional arrays. Web grammars, Plex grammars,

7

Tree grammars and a number of picture description lan-
guages have been proposed (see Fu(7) for a survey) in ad-
dition to many special purpose languages, e.g., for
structural formulae, flowcharts, and mathematical expres-
sions.

In some cases these grammars transduce the input
picture into a one-dimensional string which can then be
dealt with using all the apparatus of formal language
theory. Methods of this type have been quite successful
(e.g. Lee & Fu(15)). Other grammars, such as the Web and
Plex grammars, seem to demand a more general notion of
parsing. This thesis follows the lead of the former ap-
proach and is set in the classical theory of formal lan—
guages. It is easy to conceive of situations in which
two-dimensional syntactic classification schemes may be
put to use. In the game of Go, for example, a board con—
figuration is essentially syntactic in that it is the
spatial relationships between primitives (stones) which is
of significance in determining the worth of a given board
position to either player. Furthermore, if our task were
to develop a static evaluation function for the game we
would be interested only in the relative value of a
position and not in its full syntactic description. This
is an example of the possible use of a two-dimensional ex-

tension of the discriminant grammar technique.

II. DISCRIMINANT GRAMMARS-DEFINITION AND
INTERPRETATIONS

A. DEFINITION

The basic structure introduced and developed in this
thesis for use in syntactic pattern recognition is the
"discriminant grammar" or "d-grammar". Formally, a dis-

criminant grammar D is defined to be an ordered quintuple

D = (VN,VT,TT ’SaR)

The first four components are, respectively, a set of
non-terminal symbols, a set of terminal symbols, a set of
production rules, and a start symbol. It is assumed that
the reader is familiar with the concepts and notational
conventions of formal language theory as contained in, for
example, Hopcroft and Ullman(10). The component R is de-
fined as a mapping from the production setTT into the real
numbers. It is often convenient to use the notation ri
for R(ni), where nieTT. The ordinary phrase-structure
grammar defined by the first four components of D will be
denoted CHAR(D) and will be called the characteristic
grammar of D.

Some restrictions are usually put on the form of the

production rules of a grammar. The restriction that has

been found to strike a good balance between generative

9

power and mathematical tractability is the context-free
restriction, i.e. that the premise of each production rule
consist of a single non-terminal symbol. In addition, it
is reasonable to require that each string in the language
have an essentially unique derivation in the given gram-
mar. This property is called unambiguity. We will limit
the grammars under consideration as follows: Unless
otherwise noted, all discriminant grammars used in this
thesis will be assumed to have unambiguous, context-free
characteristic_grammars.

A mapping Q from L(CHAR(D)) into the reals is defined

by

n

QD(X> = Z: rik

k=l

where nil, "i2, n13,... "in is the unique left-most
canonical derivation (LMCD) of the string x. (Throughout
this thesis, when we speak of "derivations" we shall mean
sequences of productions, not sequences of sentential
forms.) Note that it does not in fact matter whether we
use the LMCD to compute Q(x) or whether we use any deriva—
tion of x, since all derivations of x in the unambiguous
characteristic grammar are simply permutations of each
other. When considered as set of ordered pairs, the func-

tion Q is called the discriminant language generated by

the discriminant grammar D and is denoted L(D), i.e.

L(D) = { (X,Q(X))I X€L(CHAR(D)) }

10

Given any real number 6 (a "cutpoint"), the language
L(CHAR(D)) may be partitioned into three classes in the

following way:

L+(D,e) = {xlxeL(CHAR(D)) and Q(x) > e}
LO(D,6) = {xlxeL(CHAR(D)) and Q(x) = e}
L’(D,e) = {x|x£L(CHAR(D)) and Q(x) < a}

This partition will be interpreted as a simple decision
scheme which will allow us to distinguish between two
languages. It is sometimes conceptually convenient to de-
compose this decision scheme into four stages: structural
indexing, projection, linear translation and quantization.
It is possible in this way to express the decision as a

composition of four functions:
xeLi(D,6) iff Sgn(T(P(S(x)))) = 1

Where S, P and T are described as follows: Let N be the
set of non-negative integers. For any discriminant gram-

mar D = (VN,VT,TT,S,R), let k be the number of productions

in“ , i.e.

-rr= {Nl,n2, ...,N }

For any xeL(CHAR(D)) and for i 1 to k, let mi equal the

number of times N occurs in the LMCD of x.

1

Then define s : L(CHAR(D))+Nk as Exx) = (ml,m2,m3,...,mk).

11
We may interpret S as a mapping from L(CHAR(D)) into a

space of structural indices determined by the underlying

characteristic grammar. Note that even though CHAR(D) is

unambiguous, the mapping S is not necessarily one-to-one,

as the following example shows:

EXAMPLE Let CHAR(D) be G = ({S,A},{b,c},1T,S)
where-\Tconsists of the productions
S+AA

A+b

A+c
Then S(bc) = S(cb) = (1,1,1)
The projection function P : Nk+(-w, +m) is defined as

k

P(ml,m2,...,mk) = E miri

i = 1

Finally the translation function T : (-m, +w) +

(-m, +w) is defined as
T(€) = E - 9

(Note that T is actually a function of 6 also.) The out-
put of this function is then quantized at zero yielding
the decision.

This exposition of simple discriminant grammar

decision-making in terms of a composition of many-to-one

12

functions is valuable because it defines exactly what sort
of information-reduction is performed at each stage of the
process from observation to decision. Clearly the most
interesting function of this chain, and the one which is
unique to this thesis, is the structural indexing function

S.

The functions S and P are implicit in the specifications of
the discriminant grammar, while the function T depends upon

some specified cutpoint.

As a simple example of a discriminant grammar, con-

sider the following:

EXAMPLE Let D = ({S,A,B},{a,b};TF,R)

WhereTTand R are given by the following table:

 

N1 R(Ni)
S + aA O
S + bB O
S + e O
A + aA +1
A + bB —l
A + e O
B + aA -1
B + bB +1
B + e 0

Given any string xe{a,b}*, QD(x) tends to be positive

if x contains long "runs" of either a's or b's and tends

13
to be negative if the a's and b's tend to alternate with a

short period. A concrete interpretation of this discrimi-
nant grammar would be to consider the a's and b's respec-
tively as positive and non-positive slopes between appro-
priately sampled points of some periodic function over
some interval. Then for some arbitrary cutpoint 6, L+(D,6)
would consist of relatively low frequency functions while
L-(D,6) would contain functions with predominantly higher
frequency components. Figure 2 gives an example of such a
situation. The functions are sampled at the "0" point of
the axis and at each "+" point. The sampled value at each
"+" point is compared to the previously sampled value. If
there has been an increase, an a is inserted into the
string, otherwise a b is inserted. For example, the first
character of 2.a is an a since the ordinate of the curve
is greater at the first "+" than at the origin. The en-
codings and their corresponding Q-values are given in the
figure. Thus we see that the string represented in

Figure 2.a is contained in L-(D,0) while the string re-

presented in Figure 2.b is contained in L+(D,O).

11+

OOOOO+OOOO+IO00+...C+OOOC+OOOO+OOOO+OIOO+OOOO+OOOO+

2.a x = abaabbaaba QD(X) = -3

OOOOO+OOOO+OOOO+OOOO+COOO+OOOO+IOOO+OOOO+OOOO+OOOO+

2.b x = aaabbbbbbb QD(X) = 7

Figure 2 Encoded Line Patterns

B. A BAYESIAN INTERPRETATION

In this section we will investigate the use of the
discriminant grammar in making a Bayes decision between
two stochastic languages. (See the Appendix for defini-

tions and notations involving stochastic grammars.)

Definition Two stochastic grammars G1 and G2 are said to

 

 

be commensurate if and only if CHAR(Gl) = CHAR(GZ) and all

and G .

production probabilities are nonzero under both G1 2

Definition A discriminant grammar D is said to be a lo -

 

likelihood ratio representation of two commensurate sto-

chastic grammars G1 and G2 if and only if:

1. The characteristic grammars of G1, G2

and D are the same.
2. For all nieTT,
R(ﬂi) = log P1(Ni) - log P2(ﬂi)
Where P1,P2 are the probability functions of

G1 and G2 respectively.

If D is the log-likelihood ratio representation of G1 and

G2, we write D = LLRR(G1,G2). Note that in general,
LLRR(G1,G2) does not equal LLRR(G2,G1).
Suppose now that we are given a string x and two

commensurate stochastic grammars G and G such that x can

15

l 2

16

be generated by their common characteristic grammar. Let

[—1(x) be the probability of generating x under G Let

i'
the a priori probabilities of G1 and G2 be denoted Pr(Gl)
and Pr(G2). Let Hi be the hypothesis that x was generated
under G1. Finally, let LiJ be the loss incurred in de-
ciding Hj when the true hypothesis actually is Hi'
Given any x, the conditional expected loss incurred

 

in deciding H is given by:

l

tl(x) = L lPr(G2 x) + LllPr(Gl x)

2

Using Bayes rule, by which

Pr(ei) [:(x)

 

Pr(Gilx) =
Pr(x)

We obtain

L21Pr(62) [—;(x)+LllPr(Gl) [~;(x)

 

tl(X) =
Pr(x)

Similarly, the conditional expected loss incurred in de-

ciding H2 is given by

L12Pr(Gl) rl(x)+L22Pr(G2) fq 2(x)

 

t2(X) =
Pr(x)

17
We can minimize the expected loss by deciding

1) H1 if tl(x) < t2(x)
2) H2 if t2(x) > tl(x)

3) making a random decision if
tl(X) = t2(X)

This is the same as deciding H if

1

(L2l—L22)Pr(G2) r-;(x) < (LlZ-L11)Pr(Gl) [’11(x)
If we assume that L13 > Lii when i f 3, this inequality

reduces to

f—;(x) Pr(GZ) (L21-L22)

 

-——-—- >
Bu) Pr(Gl) (L12-Lll)

Taking logs of both sides and expanding ‘Tﬂi yields

the rule:

decide Hl if

m

 

ZEE:: Pr(Gz) (L21’L22)
(log Pl(nik)-log P2(nik)) > log

Pr(Gl) (L12‘L11)

k = l
where Ni N1 “1 is the LMCD of x in
i’ 2’ " ’ m
CHAR(Gl).

Now let D = LLRR(G ) and let

1’G2

18

Pr(G2) (L21'L22)

 

6 = log

Pr(Gl) (L12'L11)

Then we have shown that the unconditional expected loss

(Bayes risk) is minimized by the rule:

if st+(D,6) then decide Hl

if xeL’(D,e) then decide H2

if xeLO(D,e) then decide arbitrarily.

Using a loss function of L = L = l and

12 21
L11 = L22 = O, the Bayes rule minimizes the probability of

misclassification. This Bayes error may be expressed as

P(err) =Zmin [Pr(Gl) l—‘lm, Pr(G2) F200]

xeL(CHAR(Gl))

The size of this probability is in general quite difficult
to compute. In special cases, however, we may obtain
bounds on the value of P(err). One such case is discussed

in the next section.

C. BOUNDS ON THE BAYES ERROR

(LINEAR GRAMMARS)

In this section we present a technique for the calcu-
lation of upper and lower bounds on the Bayes error
(P(err)) for linear grammars using the Bhattacharyya co-

efficient.

Definition A context-free grammar is linear if an only if

 

the consequence of each production (i.e. the right hand
side of that production) contains at most one non—terminal.

Thus each production must be of the form

A + xBy
or
A + x

s s
where erT , erT

Definition Given two probability mass functions p(.) and

 

q(.) defined on the same discrete sample space X, the

Bhattacharyya coefficient p of p vs. q is

o=E W)

xeX

It is known (see Kadota & Shepp(1l)) that the

Bhattacharyya coefficient can be used to form an upper and

19

20

lower bound on P(err) as follows:

pzminmwthrmznn _<_ P(err) 5, p/2

We now solve the following problem: Given two commen—
surate stochastic grammars G1 = (VN,VT,—rr ,S,Pl) and G2 =
(VN,VT,1TI,S,P2) with linear characteristic grammars, com-
pute the Bhattacharyya coefficient of rﬁl vs. [—2 .

First, to each production n ETT , assign a number

qi 7JP1(W1)P2(wi)

Then p may be expressed as

k
p =:E; [[ qimi

i = l

 

where the summation extends over all xeL(CHAR(G1)). As
before, the m1 are the structural indices of x in the
k-element production set. The key to the computation of
this sum of products is the realization that the set of
derivations of sentences in a linear grammar is itself a
regular language over the production set TI. This fact
will become apparent by the following construction. (For
simplicity, we omit commas in the derivations.)

Suppose VN = {A1 (=S),A1,A2,...Am} is the set of non-
terminals for the linear grammar. Then the non—determinis—
tic finite-state acceptor for derivations has m+l states
{sl,s2,...sm+l} and on input "1 has a transition from sJ
to s

k where A3 is the non-terminal in the premise of Ni and

Ak is the non-terminal in the consequence of N if the

13

21

consequence of Ni contains no non-terminals, then the

transition should go to Sm+l' Let sl be the start state

m+1 be the final state. Then a string «11 N12... “1r

is accepted by this automation if and only if it is a

and s

valid derivation of some string in the original linear
grammar.
Note that the terminals in the original grammar have

no bearing on the construction of this acceptor.

EXAMPLE Suppose G = ({A,B,C}, {d,e,f,g},Tr,A)

where the productions are given by

 

+ eeB
+ dAgg
+ fb

8

+ eB
'* 880
+ Afg

-q 0\ \n .= t» n) +4
0 0 tr: w w :> :b :1:-
+

+ d
Then the corresponding acceptor for production sequences is

given by Figure 3.

We now return to the original problem of calculating
the Bhattacharyya coefficient. In the graph of the auto-
maton Just constructed, associate with each transition “1
the number q1 defined previously. Our problem is now that

22

of summing the products of the q1 along all possible paths
from the start state to the accept state. To accomplish
this, construct the (m+l) by (m+l) matrix W such that Wij
equals the sum of the qk corresponding to all single-step
transitions between 81 and SJ. In the example, for in—
stance, Wl2 would be set equal to ql+q3. In addition to

this, set Wh equal to one. Now let W' be equal to

+l,m+1
the limit of wn as n approaches infinity, if this limit
exists. It is now an easily obtained combinatorial fact.
(See, for example, Feller(3)) that p = W'l’m+1

Note that this same technique may be extended to the
class of grammars for which the number of non-terminals in
every possible sentential form has some upper bound de-
pending only upon the grammar. (This is the class of
"ultralinear" grammars.) In this case, we simply assign a
name to each group of non-terminals which can appear in a

sentential form and follow the above procedure using this

"super-non-terminal".

 

 

 

S

3
(From C)

 

Figure 3 Acceptor for Production-Sequences

23

D. A SEQUENTIAL APPROACH TO

DISCRIMINATIVE PARSING

D. 1. Introduction

Since Wald's introduction of the Sequential Probabil-
ity Ratio Test (SPRT) (Wald(20)), the principle of optional
stopping has become well known in the pattern recognition
literature. See for example, Fu(6). The two principal
approaches are the sequential Bayes approach and the SPRT.
If only a finite number of features are available, these
are not necessarily equivalent. The Bayes approach as-
signs costs to both feature measurements and incorrect
decisions and then selects a scheme which minimizes the
expected total cost. This is usually done computationally
by backwards dynamic programming. The SPRT, on the other
hand, is more closely related to Neyman-Pearson theory and
stops observing features when the value of the likelihood
ratio guarantees certain bounds on misclassification pro-
babilities.

In this section we exhibit a scheme for sequential
syntactic decision-making using the SPRT, discriminant
grammars and LL(k) top-down parsing techniques. This
scheme could save both parsing time, and, in the case of
on—line systems, feature extraction time (and cost). The

availability of such a sequential scheme is a unique
2D

25
aspect of the discriminant grammar approach.

Informally, the essence of the scheme is this: We
are given some string x and we want to decide which of two
commensurate stochastic grammars generated x. We use the
LL(k) parser to supply us (in top-down sequence and with-
out backup) with the sequence of productions which forms
the LMCD of the string x. Each time we are informed of a
production in the derivation we decide either to request
the next production from the parser or to abort the parse
and make a decision immediately as to which grammar gener-
ated x. The stopping criteria will depend upon pre-
assigned error tolerance.

We now proceed to lay the statistical foundations of
the sequential parsing technique in some detail. Crucial
issues in the development include the necessity for top-
down parsing and the treatment of parses which terminate
before reaching a stopping boundary.

Suppose that we are given two commensurate stochastic
grammars G1 = (VN,VT,11-,S,Pl) and G2 = (VN,VT,11-,S,P2).
Denote their common characteristic grammar by G and let
D = LLRR(G1,G2). For any string xeL(G), denote by H1
(1 = 1,2) the hypothesis that x was generated by G1. As
before, let r—1(x) denote the probability of the genera-
tion of x under H .

1
Since derivations in G are strings over-II , we shall

a
use the customary notations-YT. and ‘FT+ to mean, respec-

tively, the set of all production sequences and the set of

26
all non-empty production sequences. The symbols 0, p and

w will usually be used to represent production-sequences;

o(n) will represent the nth

production in the sequence 0
The sample space )3 for our experiment will be the

set of all LMCD's under G. We will ignore the null

derivation, so ,3 QTY... Given any pa ,8 such that p is a

LMCD for xeL(G), then for i = 1,2, define

Pr(pIH1> = i—1<x>

Since G is unambiguous, there is a one-to-one corres-
pondence between points in ,3 and strings over L(G), hence
the fact that r—g(.) is a true discrete probability mea-
sure implies that Pr(.|Hi) is also a probability measure.

SUPposetmat p = p(l)p(2)p(3)-.-p(n).

Then, by the unrestrictedness assumption, we have
Pr(pIHi) = Pi(p<1>)P1(p(2>)...pi(p<n)) (1)

We now proceed to define certain compound events on
the sample space x3 in which we have an interest. Consi-
der the set of all production sequences which form initial

segments of LMCD'S. To define this set formally, let

2: {(5ngth 30’5"" such that owes}

Each element of 2::may now be associated in a natural
way with certain compound events, or subsets of £3 . For

all 062 , define EO={pe;8 I 3 we“... such that am = p}

27
Since the probability of a compound event is given by

the sum of the probabilities of the sample points in it,
we have (for i = 1,2 and all er),

Pr(EOIHi) = 2 :Pr(p|Hi) (2)
peEO

Combining (1) and (2) gives:

Pr(EOIHi) = E Pi(p(l))Pi(p(2))...Pi(p(n))
peEo

Noting that all production sequences in Ed have the
same initial segment 0 = p(l)p(2)...p(k) = o(l)o(2)...o(k)
followed by some sequence m = w(l)w(2)...w(r) =
p(k+l)...p(n) where k+r=n, we can write, pulling the con-

stant factors out of the summation,

Pr(E |H ) =
° 1 (3)
Pi(o(1))Pi(o(2))...Pi(o(k9) E P&(w(l))...Pi(w(r))

{wloweEO}

We now prove that, for any 062

E Pi(w(1))...Pi(w(r)) = 1 (A)

{wIONEEO}
Once we have established this, (3) reduces to

Pr(EGIHi) - Pi(0(l))P1(o(2))...Pi(0(k)) <5)

28
This is the centrally important multiplication rule

for initial segments. Before a discussion of this rule,

it remains to prove equation (A).

D. 2. Markov Chain Model

Consider the countably infinite Markov chain M whose
states are represented by sentential forms in G. (See
Feller(3) for a general reference on Markov Chains.) If
m and m' are states of M, let the probability of a tran-
sition from m to m' be p where p is the probability under
Hi associated with the production in-rr which transforms
the sentential form represented by m into the sentential
form represented my m' by expanding the left-most non-
terminal. If no such production exists, then let the
transition probability from m to m' be zero. Let all
states corresponding to sentences of L(G) be absorbing
with probability one. This is equivalent to assuming we
have a 'null production rule' which is capable of trans-
forming any sentence into itself with probability one.
(In order to insure that this is a true Markov chain we

are making use of the assumption that G is a proper

i
stochastic grammar as defined in Appendix B.) Let the
initial state of the Markov chain be the state correspond—
ing to the sentential form consisting of the start symbol
of G.

Let Y be the set of all state-sequences generated by

this process. Let YT be the set of all sequences in Y in

which all but a finite number of states are absorbing.

29

These state-sequences correspond to terminal sentences in
L(G). Let YN = Y-YT. The consistency condition of G1,
when translated to this model, says that the probability
of generating a sequence in YN is zero and the probability
of generating a sequence in YT is one.

Let c be an element of 22 and letcx be the sentential
form produced by applying 0 to the start symbol of G. Let
1110 be the state of M corresponding tocx. Now construct

the Markov process MO which is identical to M in every

0 and

respect except that m is the start state. Let Y°,YN

o
Yg be defined analogously to Y, YN, and YT.

We can now show that the probability that MO gener-
ates a sequence in Y% must be equal to one. Suppose this
were not true. Then, since Y§ and Y; are complementary in
Y0, the probability that Mo generates a sequence in Yﬁ
must be non-zero. Call this probability a. Now let 1;
be the probability under M of generating a path from the
start state of the original chair! to ma. Let 2% be the
set of all sequences in YN with o as initial segment.
Then the probability under M of generating a sequence in
2:] is given by the product 5‘ . Since Z13 is a subset of
YN, the probability of generating a sequence in YN is at
least £6, . This contradicts the consistency of G1,
hence our assumption was false, and we have established

0

that the probability of generating a sequence in YT by the

Markov process M0 is equal to one for any as Z . QED

30

In terms of our original generative grammar G1, this
means that with probability one, any initial segment of a
derivation sequence will terminate in a finite number or
steps. Thus equation (A) and hence equation (5) are es-
tablished.

It should be noted that the multiplication rule
Pr(EOIHi)=Pi(o(1))Pi(o(2))...Pi(o(k))

is in general valid only for sets of the form Eo (i.e.
corresponding to an initial segment of a LMCD) and does
not imply any kind of independence among events associated
with each production.

For example, the probability of the occurrence of NJ

at the nth

position in a derivation is not given by

Pi(wJ). This latter probability represents the probability
of the occurrence of ”j conditioned upon the existence of
the premise of NJ in the sentential form upon which the
production is applied. This condition is automatically
fulfilled if we consider only initial segments of deriva—
tions. It is for this reason that a top—down parsing

technique is essential.

D. 3. The SPRT

We now have a sample space :8 with a different pro-
bability measure under each hypothesis and a class of
events {Eo}osE over ’8 . We have Just established an
efficient means of computing the probability associated

with any E0. Assume now that either H1 or H2 is active,

31
but that we have no a_priori knowledge about the likeli-

hood of either hypothesis. We now perform a single random
experiment whose outcome is completely described by exact-
ly one point of :8 . The experiment may be regarded as
consisting of pushing a button and receiving a string x in
return. The outcome of this experiment (a point in ,8 )
specifies a finite family of events (subsets) of :3 ;
namely all those events EO such that o is an initial seg-
ment of the LMCD of the string x. Call this family of
events 8 x’ Note that this family is totally ordered by

inclusion. That is to say, E

o is contained within E0

1
if 02 is an initial segment of 01.

2

The sequential discriminative parsing scheme is now
used to examine each of the sets Ed 6. 8 in turn from
largest to smallest. When either the family 8 x is ex-
hausted or certain stopping criteria are met, the algor-

ithm decides that one of the hypotheses H or H is true

1 2
and terminates.

The stopping condition is defined as follows: Before
performing the experiment, select two numbers A and B

(stopping boundaries). As we successively inspect each

EU as described, we form the likelihood ratio

Pr(EolHl)

 

A(EO) =
Pr(EOIHz)

and check if either A(EG) Z'A or A(Eo) : B. In the former

case we decide H1; in the latter case we decide H2. We

32

shall see later that the discriminant grammar provides an
effective means of recursively computing the likelihood
ratio. (Actually, we will use the log of the likelihood
ratio.) First we consider the determination and inter-
pretation of A and B.

In the SPRT as developed by Wald, an unbounded number
of observations is available and conditions are specified
so that the likelihood ratio eventually reaches a stopping
boundary with probability one. In the case of the sequen-
tial parsing scheme this is not true; it is quite possible
that a parse may be completed before a boundary is reached.
In this case we will make a maximum likelihood decision,

i.e., we will decide H if the log of the likelihood ratio

1

is positive and decide H2 otherwise. Because of this pos-
sibility of running out of features, the error analysis

which follows differs somewhat from that given by Wald.

D. A. Error Analysis

Let e21 represent the probability of deciding H2 by
encountering the boundary B when the true hypothesis is
H1. Let el2 represent the probability of deciding Hl by
encountering the boundary A when the true hypothesis is

H Let )’1 be the probability that a parse under H

2' i
never reaches either boundary A or B. Assume now that at
some point during the sequential decision scheme we have

A(EO) Z A. For this E0 we have

Pr(EOIHl)

V
ID

 

MEG) =
Pr(Eole)

33

By definition, Pr(EOIHg) = e for this E0. Also,

12
since under Hl we must either terminate at A, terminate at

B or fail to reach a boundary, we have
Pr(EOIHl) + e21 + )"1 = 1 (6)

Combining these facts yields

1'921’ 3’ 1
A .i (7)
e12

 

Similarly, we can show that

e21

1“"312" 3’ 2

 

tn
v

(8)

From these inequalities we can immediately derive upper

bounds for the e :

13
e12 5.1/A

e215-B

(9)

We can tighten the bounds (9) if we make some assump-
tions about the behavior of the sequence of likelihoods
A(EO). One such assumption is that we will strive to have
e21 = e12. An additional, related assumption is that,
under each hypothesis, the probability of classifying a
sequence correctly using a maximum-likelihood decision
rule is no less than the probability of classifying a se-

quence correctly under a pure sequential scheme where the

A(EO) must cross a boundary before classification is made.

3“

Several arguments can be made for the reasonableness of
this assumption. First, if the decision boundaries are
very far apart, there is a good chance that no decision

at all will be made, whereas the maximum likelihood scheme
always yields a decision. Secondly, it is natural to as—
sume that with more observations available, a better de—
cision can be made. This assumption is not reasonable,
however, without the original condition that e12 = e21 so
that the total error is evenly distributed between both
classes. If we let 6 be the maximum likelihood probability

of error under both classes, The assumption says that

(D)

e12 + I'e 3
and

e21 + ”’1 : 8

Substituting this into (7) and (8) and denoting

e = e, we get

21 = e12

A :_(1 - é)/e
and

._ e/(l - é)

tn
v

From which we get
e §_[min(B,l/A)](l - é)

D. 5. Parsing Algorithm
We now relate the procedure described in this sec—

tion in terms of the discriminant grammar. Assume that we

35

are given two commensurate stochastic grammars G and G

1 2
and two stopping boundaries A and B as described above.
Assume further that the common characteristic grammar of
the given stochastic grammars is LL(k). Then construct
discriminant grammar D = LLRR(G1,G2) and set stopping
boundaries 0(= log A and B = log B.

Then the sequential discriminative parsing scheme

(SDPS) may be described as follows:

1. Initialize accumulator variable h to zero.
2. Find the next production n in top-down se—
quence using LL(k) techniques.
. Let h + h + R(ﬂ)

. If h lot, decide H and stop.

1

. If h 5,8, decide H and stop.

3

z;

5 2

6. If the parse is not finished, go to step 2.
7

8

. If h‘: 0, decide H2 and stop.

. If h > 0, decide H1 and stop.
This procedure is flow-charted in Figure A.

EXAMPLE Suppose we are given two stochastic grammars

({S.B}.{c.e.r.h}. II ,P >

C)
II

and

G2 ({s.B}.{c.s,r.h}.TY .132)

where Tr , P1 and P2 are given in Table 1. Table 1 also
gives R(ni)=log Pl(wi) - log P2(ni). Under this defini—

tion of R, D=LLRR(Gl,G2) = ({S,B},{c,g,f,h}, TT ,R). Both

 

( START
[...

a1

find
next w

c i
[h + h+R(n) l

 

‘ _

 

 

 

STOP

 

4

0 Y decide H j
1
N
| Y
N

 

 

done decide H2

 

Y

| decide Hl ‘
N a

Figure A Sequential Discriminative Parsing Scheme

 

 

 

36

37

G1 and G2 are consistent and proper and their underlying
characteristic grammar G is LL(l). Let us arbitrarily
set A = 100 and B = .01. The probability of the parse
reaching the wrong stopping boundary is then less than 1%.
(Recall that this is not a bound on the overall error of
the procedure.) We then have o(= log 100 = “.61 and
B = log .01 = -A.6l.

Suppose now that we are given the string ccccgfhhhhh.
The underlying characteristic grammar belongs to a partic-
ularly simple class of LL(l) grammars for which we can de-
termine one production of the LMCD for each symbol of the
string scanned. Table 2 summarizes the results of the ap-
plication of the SDPS to the given string. Recall that h
is the accumulator variable representing the cumulative
sum of R(ni). As indicated in the table, the parse was
truncated and a decision was made that x was generated by
Gl after observing only four of the eleven symbols of the
string. This was possible because the pre-determined up-

per stopping boundary A.6l was exceeded.

Table 1 Two Stochastic Grammars

and Their Log-Likelihood Ratio Representation

 

i ni P1(ni) P2(w1) ’R(wi)
l S + 08B .8 .2 1.39
2 s +g .2 .8 -1.39
3 B + fBB .3 .u -.288
4 B +h .7 .6 .151:

38

Table 2 Example of the SDPS

 

Symbol Production
Scanned Determined R(n1) h Action
0 "l 1.39 1.39 continue
c "1 1.39 2.78 continue
0 n1 1.39 h.l7 continue
0 n1 1.39 5.56 stop;
decide H1

39

E. SUMMARY

In this chapter we have formulated the definition of
the Discriminant Grammar and the decision regions which
accompany it. We then demonstrated that the Discriminant
Grammar has a natural application in providing a sufficient
statistic for the two-class decision problem involving
stochastic grammars. First a Bayesian approach was de-
scribed and then it was shown how a sequential probability
ratio test could be implemented using a top—down parsing
technique. It should be noted that top-down parsing of an
LL(k) language can be programmed quite readily using re-
cursive descent.

In all the above statistical interpretations it was
assumed that the two stochastic grammars involved are com-
mensurate. If they are not commensurate, we cannot apply
the above results, but all is not lost, since we can use

training methods to be discussed later.

40

III. SOME LANGUAGE-THEORETIC RESULTS

A. REGULAR DISCRIMINANT GRAMMARS WITH

RATIONAL COEFFICIENTS

A. l. Informal Description

Discriminant grammars whose underlying characteristic
grammars are regular have many interesting properties. We
shall call such discriminant grammars regular discriminant
grammars. In this section we intend to show that given
any regular discriminant grammar D = (VN,VT,'TY ,S,R)
where all the R(ni) are rational, then for any rational
number O, the decision region L-(D,O) is a context-free
language. The following informal argument will serve as

a basis for a proof of this theorem:

Denote R(w1) by r1, as before. Convert the numbers
O,rl,r2,...rn to integers by multiplying each by the least
common multiple of their denominators. To simplify nota-
tion, we will retain the same names for these (now inte—
gral) numbers. In effect, we create a new discriminant
grammar D with integral coefficients. This does not change
the region L-(D,O). Now construct a non-deterministic
finite-state automaton which accepts L(CHAR(D)). We shall
now modify this into a non-deterministic push-down automa-

ton which will accept L-(D,O). The existence of this
U1

A2

automaton implies that L-(D,O) is a context-free language.
We shall also show by example that L—(D,O) is not neces-
sarily regular.

We proceed as follows to construct the automaton:

Let M be the maximum ab solute value of the (now integral)
numbers rl,r2,...rn,O. Clearly, we can construct a finite-
state machine which is capable of adding any two integers
of opposite Sign and absolute value less than or equal to

M and producing a result of absolute value less than or
equal to M. Call this machine the "adder".

Next we need a push-down store, each location of
which is capable of storing an encoding of an integer of
absolute value less than or equal to M.

The final push—down automaton is an assemblage of
these three components (non—deterministic recognizer, ad-
der, push—down store) plus some mechanism for integrating
these parts, which works as follows: Suppose the recog-
nizer (in a non-deterministic mode) makes a transition

which indicates that w is in the derivation of the input

1
string. If the push-down store is empty, the operation of
the machine would be to place ri on the store. If the
store is not empty, denote the value of the top element by
K. If ri and K are of the same sign, then the machine

should push r onto the stack. If r and K differ in sign,

1 i
then the machine should add r1 and K, forming a number r'
of lesser absolute value than either r1 or K, pop the

stack, exposing a new top symbol K', and then compare r'

A3
with K'. The machine should proceed in this way until r'

is eventually added into the stack. This stack manipula-
tion algorithm has two very important properties:
1) At no time is a number with absolute value
greater than M on the stack.
2) All numbers on the stack have the same sign.

When the recognizer reaches a final state, the adder
is then used in exactly the same way to add 0 into the
stack. After this operation, the input string is accepted
as an element of L-(D,O) if and only if the top stack ele-
ment is negative.

This informal description of the operation of the
machine is flow-charted in Figure 5 and Figure 6. The
multiple arrows in Figure 5 denote the non-deterministic
step in which the non-deterministic finite-state automa-
ton discovers another step in the parse. Figure 6 de-
fines the procedure ADDSTK with formal parameter r. In
both figures, K refers to the top element of the stack.
NDFA stands for non-deterministic finite-state automaton
In Figure 6 the condition "K*r > 0" does not imply that an
actual multiplication need be done, only that K and r are

non-zero and have the same sign.

A. 2. Detailed Construction
We now give a detailed formal construction to imple-
ment the foregoing informal description. A (non-determi—

nistic) push-down automaton (PDA) is a 7-tup1e

 

C )

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NDFA CALL
DONE? ADDSTK(—e)
I
GET “1 DECIDE
FROM NDFA xeL (D39)
, N
r .
CALL DECIDE STOP
ADDSTK(Pi) xeL'(D,9)

 

Figure 5 Flowchart of Push-Down Automaton

1H4

 

C 3

 

 

 

EEE: Yr
PUSH r
ONTO STACK

‘ RETURN 3

 

 

' r + K+r '

POP
STACK

 

 

+—

 

Figure 6 Procedure ADDSTK(r)

as
M = (192 . T" .6.q0.zO.F>

where:

K is a finite set of states

:8. is a finite input alphabet

f“ is a finite stack alphabet

6 is a function from KX(23 U{e})xr1
into CFYKXTT *)

qOeK is the initial state

Zoe r" is the initial stack symbol

F E K is a set of final states

The details of the operation of a PDA may be found in
Hopcroft and Ullman(10). The fundamental result we will
use is the fact that a language is accepted by a non-deter-
ministic push-down automaton if and only if the language
is context free.

We will begin with a regular discriminant grammar
with rational coefficients and a rational number 0 and we
will construct a PDA which will accept L'(D,C). First, as
before, we convert all coefficients, including O, to inte-
gers in the range -M to +M. This does not alter the re—

gion L-(D,O).
VN = {Bi} and VT = {a1}

Let A, A', and T be three symbols not in VN.

47
Then let

K = (VNu{A,A',TDx{-M,-M+l,...o...M—1,M}

§3= vT

F‘= {-M,-M+1,...—l,l,2,...M-1,M,z }
q0 = (830)
F = {(Ts0)}

The construction of the transition function 6 is

somewhat more complex. For each non-terminal B and ter-

1
minal ak there is a (possibly empty) set of production
A

and possibly a production of the form

B1 + ak (2)

We will use the symbol C to represent either non-
terminals (B3) or either of the special symbols A or A'.
Specifically, the set designated {Cik} contains the A sym-

bols {B 13:1 as represented by (1) and also the symbol A

J
if there is a production of the form (2). (In this case

J
the C1k

Analogously, the set {nik} is the set of integers (r's)

can be considered to have the superscript 2+1.)

associated with the productions (l) and (2).
For example, suppose that the productions of TI

having B on the left and a on the right and their

1 k

associated r-values are:

A8

Production r-value
Bi + akBl +5
Bi + akB2 —2
Bi + ak 0
Then {Cik}33l ‘ {Bl B2,A} and
{hik132 -1 = {5,-2.0}

In the following, m1k represents the number of pro—
ductions in Tr with B1 on the left and ak on the right.
To avoid subscripted superscripts, we will usually omit
the subscripts on m when context makes them clear. Z
represents an arbitrary stack symbol (either a number in
the specified range or the stack start symbol 20). We now
have sufficient notation to define 6.

First, for each Bi’ 6 should contain the following

non-deterministic, non-e transition rules (for all Zer| ):

(1) 6((Bi,o), ak,Z) = {((Cikn nik), Z’}r m

Informally, this is equivalent to saying that no matter
what is on the stack, if the machine is in state (Bi,0) and
ak is read, then, for each production n of the form

Bi + akBJ with R(n)=n, the machine moves (non-determinis-
tically) into state (BJ,n). If there is a production a of
the form Bi + ak with r(n)=n, then the machine also moves

into state (A,n). In all cases the stack symbol remains

unchanged. The purpose of this class of instructions is

49

to allow the machine to recognize when a production rule
might have been applied and to incorporate the r—value of
that production rule into the memory of the PDA's finite
state control.

Next we need a mechanism to transfer these r-values
from the finite-state control onto the push down stack as
described informally earlier. For each C, therefore, we
shall include the following e—rules: For all combinations
of non-zero n and Z such that n and Z have the same Sign

or Z=Z include the rules

0,
(2) 6((Can),€,z) = {((C,0),nZ)}

i.e., if the number in the finite-state control has the
same sign as the number on top of the stack, or if the
symbol on top of the stack is Z0, then the number from the
finite-state control may be pushed onto the stack.

For all combinations of n and Z such that n and Z dif-

fer in Sign and n is not zero, include the rules:
(3) 6((C,n).€.Z) = {((C,Z+n).6)}

i.e., if the number in the finite state control differs in
sign from the number on top of the stack, we add the top
stack element into the finite state control and pop the
stack.

By using the above rules, we eventually arrive in the
state (A,0) if and only if there is a derivation for the

input string in CHAR(D). The next step is to put -0 into

50
the finite control. This may be accomplished by the sin-

gle production:
(A) 6((A,0),e,Z) = {((A',-9),Z)}

for all Zei—‘. Previously defined transition rules may
now be used to transfer -0 from the finite control to the
stack, ultimately sending the machine into state (A',0).
The additional symbol A' is necessary here to insure that
this rule is used only once. Now it remains only to in-
terrogate the top stack symbol and send the machine into
the final state if and only if this top symbol is negative.
This is accomplished by including the following set of

productions for all Z less than zero:
(5) 6((A',0),e.z) = {((T,0),Z)}

The existence of the above construction establishes

the following:

THEOREM Given a regular discriminant grammar D with ra-
tional values for R(n1) and given any rational numberO,

the language L-(D,O) is context-free.

Simple modifications to the above construction could
be made to show that L+(D,O) and LO(D,O) are also context-

free.

A. 3. Example
As an example, consider the following regular discri-

minant grammar with rational coefficients:

51
D = ({S=Bl},{al,a2,a3},'TF,S,R)

Where IT and R are given by

 

Tri I'i
Bl + alBl +1
Bl + a2Bl -1
Bl + a3 +1

Clearly, L-(D,l) = {xlxc{al,a2}*a3 and x(a2) > x(al)}
where the notation x(a) means the number of occurrences of
the symbol a in the string x. Since this language is not
regular, this example shows that we cannot strengthen the
theorem to say that L"(D,O) must be regular.

According to the construction,
M = (K. 2 . F" .6. q0.zo.F> where

K = {(Bl,1),(Bl,O),(Bl,~l),(A,l),(A,O),
(A.-l).(A'.1).(A'.0).(A'.-l).(T.l).

(T30):Ts'l)}

2 = {al,a2,a3}
T1= {l,-l,Z }
q0 = (Blao)

F = {(T,O)}

52

and the transition function 6 is given in Table 3. In
that table, each element of the transition function is
labelled I.J. where the I refers to one of the five rules
used to generate it and the J is simply an ordinal number
within a group.

Table A gives a trace of the action of this automaton
in accepting the string x=a2ala2a3. Following the conven—
tion of Hopcroft and Ullman, the bottom of the stack is on
the right and stack symbols are added or deleted from the
left. Also remember that "—l" is a single stack symbol,
not two. As expected, this string is accepted because the

PDA finally enters state (T,O).

O
F’

Table 3 PDA Transition Function (Example)

6((Bl.0).al.-l) = {((Bl.l).-1)}
6((81.0),a1,1> = {((Bl.1),)}
6((81.0).al.z0> = {((Bl.1),z )}
6((Bl.0).a2.zo) = {((Bl.-1).Z0)}
6((Bl.0).a2.-l) = {((Bl.-l).-l)}
6((Bl.0).a2.1) = {((Bl.-l).l)}
6((Bl,o),a3,z0) = {((A,l),Z0)}
6((Bl,0)sa3,-l) = {((A,1),-l)}
6((Bl.0),a3,1) = {((A,l),l)}
6((Bl.l).e.zo) = {((Bl.0).lzo)}
6((Bl.-1>.e.zo) = {((Bl.0),-lzo)}

5((Aal)9€azo) = {((A30)31Z0)}

6((A9-l)9egzo) {((A,0),-lz0)}

6((A',l),e,Z0) {((A',O),1Z0)}
6((A'.-1>.e.zo> = (((A'.0).-lzo)}

53

.10

.ll

.12

Table 3 (continued)
6((Bl,1),e,l) = {((Bl,0),11)}
5((Bl.-l).e,-l) = {((Bl.0).-1-1>}
6((A.l).e,l) = {((A,o),ll)}
6((A.-l).e.-l) = {((A.0).-l-l)}
6((A'.l).e.l) = {((A'.0).ll)}
6((A'.-l).6.-l) = {((A'.0).-l-l)}
6((Bl.1).e.-1) = {((B1.0).e)}

6((Bls‘l)sesl) = {((Blao)s€)}

6((A’l)353-1) {((A30)95)}

6((A.-1),e.1) {((A.0).e>}
6((A',1).e.-1> = {((A',o>,e)}
6((A'.-1>.e.1) = {((A',0),e)}
6((A.o>.e.zo> = {((A'.-1>.z )}
6((A.o>.e,1) = {((A',-l).l)}

6((A,O))€3-1) = {((A'3-l),-l)}

5((A' .0),6.-1)={((T.0).-l)}
514

 

Table A

Trace of PDA (Example)

 

55

input 6-rule state stack contents
(81.0) 20
a2 1.“ (Bl,-l) Z0
8 2.2 (Bl,0) -1ZO
a1 1.1 (Bl,l) -1Z0
6 3.1 (81,0) ZO
a2 1.4 (Bl,-l) Z0
8 2.2 (81,0) -1ZO
a3 1.8 (A.l) -120
e 3.3 (A,O) 20
e ".1 (A',-1) Z0
6 2.6 (A',O) -1Z0
6 5.1 (T,O) -lZO

B. REGULAR DISCRIMINANT GRAMMARS WITH

UNRESTRICTED COEFFICIENTS

In this section we show that if the ri and O are not
constrained to be rational, then there are decision re-

gions L_(D,O) which are not context-free.

THEOREM The class of languages expressible in the form
L-(D,O), where CHAR(D) is regular, is uncountable.
PROOF For each real number ¢, let D(¢) represent the fol-

lowing discriminant grammar:
D(¢> = ({S}.{a.b.c}. 1T .S.R)

where TI and R are given by

 

1 Hi R(wi)
1 S + aS cos(¢)
2 S + bS -sin(¢)
3 S + c 0

¢ may be regarded as a parameter of the discriminant gram—

mar. We now define the uncountable class of languages.

{L'(D(¢).0) | 0:¢_<_1r/2}

56

57

We now show that the languages in this class are distinct.
AS before, let x(a) denote the number of occurrences of
the symbol a in the string x. Then we can express

_ a

L (D(¢),O) x {xe{a,b} c|x(a)cos(¢)-x(b)sin(¢)<0}

= {xe{a,b}*c|x(a)/x(b)<tan(¢))
Now consider two languages
L‘(D(¢l),0) and L’(D(¢2),0)
where ¢l<¢2.
Now choose two integers I and J such that

tan(¢l)<I/J<tan(¢2).

Then the string aIbJ

c is an element of L‘(D(¢2),o) but is
not an element of L-(D(¢l),0), hence the two languages are

distinct.

ED

Since the class of context-free languages is count-

able, it follows immediately that

CORROLLARY There exist non-context—free languages of the

 

form L-(D,O) where CHAR(D) is regular.

This statement can be strengthened by substituting

the phrase "recursively enumerable" for "context-free".

C. SUMMARY

In this chapter we have shown that if we have a regu-
lar discriminant grammar with rational coefficients, the
resulting decision regions are context-free but not neces-
sarily regular. On the other hand, if the coefficients
are not constrained to be rational, there are uncountably
many decision regions expressible as L-(D,O) hence there
are decision regions which are not recursively enumerable.

The latter result is not surprising and there is an
analogous result for probabilistic automata. The first-
mentioned theorem is of much greater theoretical interest
and provides a new characterization of context-free lan-

guages.

58

IV. DISCRIMINANT GRAMMARS AND

PROBABILISTIC AUTOMATA

In this section we consider the relation between a
decision region of a regular discriminant grammar and the
language defined by a probabilistic automaton. A proba-

bilistic automaton (PA) may be defined as a quintuple

M (K,§S ,I,F,{A(a):ae:g })

where

} n

i=1 is a finite set of states

K = {k1

23 is a finite set of input symbols
I is an n-component stochastic row vector
(The "initial distribution")
F is an n-component column vector consisting
of 0's and 1's. (The final-state vector)
{A(a):a62 } is a set of n by n stochastic
(sum along any row=l) transi-
tion matrices, one for each

input symbol.

a
M defines a mapping from :53 into the reals in the

a
following way: Given any string x = ala2a3...are :2; ,

59

60
Let A(x) be defined as

I'

A(x) = TT A(ai)

i=1

Then the real valued "probabilistic event" EM(x) is

a
defined for all xe 2. as
EM(X) = IA(X)F

Finally, given any PA M and a real number A (a "cutpoint"),
define the language (probabilistic cut-point event) accep-

ted by M with cutpoint A as:
a
T(M,A) = {xlxe Z and EM(x)>A}

Paz(l9) shows that this class of languages is not
changed if we remove the restrictions that I and the tran-
sition matrices be stochastic and if we allow F to be an
arbitrary vector. In this case we use the terminology
"pseudo-probabilistic automaton" (PPA) and "pseudo-proba-
bilistic event". The language accepted by a PPA with cut-
point A is called a "general cut-point event" (GCE).

We will now show that, given any regular discrimi-
nant grammar D, the region L+(D,O) is a GCE. Suppose that

we are given a regular discriminant grammar D=(VN,VTJT,S,R).
Since D is regular, all productions must be of the form

B1 + akBJ

OI'

Bi+ak

61

In the former case denote the associated r-value as rij

and in the latter case as rﬁo.

Given any cutpoint O, we will now construct a PPA M
and a cutpoint A such that L+(D,O) = T(M,A). Let

K = VNU{T}, where T is some symbol not already in VN which

will serve as an "acceptance state". Denote the cardinal—

1,B2,B3,..

set of input symbols for M be identical to the set of ter-

ity of K by n, so that K = {S=B .Bn=T}. Let the
minal symbols of D (Z =VT). Let I be the n-component
row vector with a l in the first position and 0's else-
where. F is defined as the n-component column vector with
a one in the nth position and zeros elsewhere. Finally we
construct the set of matrices {A(a):aez }. The essential
construction here involves exponentiation so that the addi-
tive structure of the discriminant grammar may be transla-
ted to the essentially multiplicative structure of the PPA.
To accomplish this, for each production of the form

Bi + akBJ in TV , let the (1,3) element of A(ak) be

:0). Let all other entries of the matrices be 0.

(Note that the entire last row of each matrix is zero.)

exp(r

Finally, let A = exp(O). We now show that the above PPA
with cutpoint A accepts all strings in L+(D,O) and that
these are the only strings which it accepts.

Given any string x in L+(D,O), there must be exactly
one derivation for that string in CHAR(D). Let this deri-

vation be given by

62

where the productions applied are “1’"2""’"r' Since

xeL+(D,O), we must also have

I'

2: R("1)>0

i=1

We must now show that

r
I “A(ak) F<A
k=l

That is to say, the (1,n) element of the matrix

r
“ A(ak) must exceed A
k=1

Let us decompose this matrix into a product of two matrices

by choosing some p such that l :_p < r. Then
r p r
IT A(ak) = 1T A(ak) T)— A(ak)
k=l k=l k=p+1

Call the matrix on the left U and the two matrices on the
right G and H respectively. Then by the definition of

matrix multiplication,

n
Uln =2 GliHin
i=1

63

If more than one term in the above sum is non-zero, this
would imply that there is more than one distinct derivation
in CHAR(D) for x. Since CHAR(D) is unambiguous, we can say

that only one of the products GliHin is non-zero. Call

this product GlsHsn’ By decomposing G and H in exactly the

same way, we can show that G and Hsn are simple products

ls
rather than sums of products. Continuing in this way, we

can Show that

r r

k=1 k=1
where each pk is some element of A(ak). If n is the pth
production used in the derivation of x, and n is B1 + akBj

then pm must be exp(rij). Thus we have shown that

I"
EM(x) = TIT (exp(R(wk)))
k=1

or, equivalently

EM(X)

r
exp 2E R(wk)
k=1

Since we are assuming that

I’
Z R(wk)>O
k=1

and

A = exp(O).

64

The monotonicity of the exponential function gives us the

result that EM(X)>A. Thus we have shown that
xeL+(D,O) :9 xeT(M,A)

under the given construction. To establish the converse,

we must consider two cases:

1) X€L(CHAR(D))

2) X¢L(CHAR(D))

In the first case, x has a derivation in CHAR(D) and,

as above, we can establish that
r
EM(x) = exp E R(1rk)
k=1

Since we are assuming now that EM(x)>A and that A = exp(O),

r
ZR("k)>9
k=1

or, equivalently, that xeL+(D,O).
The second case, xeL(CHAR(D)) cannot occur, since in

this case no derivation for x in CHAR(D) exists, hence

A(a = 0

k)

x
u ZZlH
H

contradicting the assumption that EM(x)>A. (Recall that

A = exp(O), hence is always positive.)

65

Invoking the equivalence between PA and PPA, we have

now proved the following:

THEOREM Given any regular discriminant grammar D and cut-
point O, there exists a PA M and a cutpoint A such that
L+(D,O) = T(M,A).

Consider the following example:

Let D = ({S=Bl,B2},{al,a2,a3 LTT,S,R) where TT’ and R are

given as follows:

 

Hi R(ﬂi)
Bl + alB2 1
B2 + a2Bl 0
Bl + a3Bl -1
Bl + a3 —1

H R D - * * * d

Then L(C A ( )) - ((a1a2) a3 ) a3 an

L+(D,0) = {xlxeL(CHAR(D)) and x(al)> x(a3)} where x(ai)
denotes the number of occurrences of ai in x. Then the
pseudo-probabilistic automaton is constructed as follows:

M = (K, 2 ,I,F,{A(a):aez, }) where
K = {S=Bl,B2,T}

Z= {al,a2,a3}

I = (1,0,0)

F = (0,0,1)T

66

0 e 0
A(al) = 0 0 0

O O O

O O 0
A(a2) = l 0 0

O O 0

e'1 0 e"1
A(a3) = 0 O 0

0 O O

and the generalized cut-point event equivalent to

L+(D,O) is T(M,1). Let us check some sample strings:

1) x = (xeL+(D,O))

3182313283
EM(x) = (1.0.0)A<al)A(a2>A<al)A(a2)A(a3)(o.o.l>T

e 0 e O
= (1,0,0) 0 0 0 0 = e
0 0 0 1

since e > 1, xeT(M,l).

2) x = a (xeL(CHAR(D)) but x¢L+(D,0))

la2a3
EM(x) = (1.0.o>A(al>A(a,>A(a3>(0.0.1>T

l o 1 o
= (1,0,0) 0 o o o = 1
o o o 1

hence x¢T(M,l).

67

3) x = ala3 (xeL(CHAR(D)))

EM<x) = (1.0.0)A(al)A(a3)(o.0.1>T

O 0 O 0
= (1,0,0) 0 O O O
O O O 1

hence, as predicted by the theorem, x¢T(M,l).

V. SUMMARY AND RECOMMENDATIONS

A. SUMMARY

This thesis introduces the idea of a discriminant
grammar as a tool for syntactic pattern recognition, ex-
plores certain properties which derive from the definition,
and explains certain techniques which the discriminant
grammar makes possible.

Chapter I provides a motivation for the discriminant
grammar concept and provides some informal examples of the
advantages of discriminant grammars.

Chapter II gives the formal definition of discrimi-
nant grammars and goes on to show how they may be used to
provide a sufficient statistic for the two-class decision
problem involving stochastic languages. Specific formu-
lations are given for the Bayesian case and the SPRT. The
existence of the Sequential Discriminative Parsing Scheme
is one of the principal contributions of this thesis.

This chapter also gives a technique for the calculation of
bounds on the Bayes error for linear grammars.

Chapter III proves some results on the nature of the
decision regions of regular discriminant grammars. The
result of greatest theoretical interest is the theorem
which says that decision regions of regular discriminant

grammars with rational coefficients are context free.
68

69
Chapter IV provides the proof of a result which es—

tablishes a connection between the regular discriminant
grammar and probabilistic automata.

Finally, some suggestions are given for future work
involving the learning of discriminant grammar coefficients

from labeled empirical samples.

B. TRAINING METHODS

Insofar as pattern recognition is intended as a prac-
tical art, it must incorporate a learning or inductive
phase as well as a decision algorithm. The decision
schemes presented in this thesis center around the context-
free unambiguous discriminant grammar. It will be neces-
sary to develop schemes for the inference of such a discri-
minant grammar given some kind of empirical sample of
strings from the classes between which we wish to discri-
minate. Two distinct problems arise -- inference of the
structural component and inference of the numerical compo-
nent of the discriminant grammar. The former is a far
more difficult problem than the latter; in fact, the latter
sort of learning is usually designated by a less imposing
word than "inference" such as "training".

In general, the inference of a discriminant grammar
is a two-stage process: determination of the structural
component (the characteristic grammar) followed by the
determination of the numerical component (The function R
and the cutpoint 0). It is the purpose of this section to
demonstrate the following two points:

1) Once the structural component has been deter-
mined, the determination of the numerical

component is mathematically trivial.

7O

71
2) The existence of the numerical component
makes the determination of the structural
component far less critical to the perfor-
mance of the decision algorithm.

In the inference of ordinary regular or context-free
phrase structure grammars (see Fu and Booth(8) for a sur—
vey), it is typical to have two finite empirical samples
of strings, say L+ and LI, and be required to exhibit a
grammar which accepts L+ but does not accept L”. The ex—
ponential combinatorics of known methods for accomplish-
ing this make the task infeasible for samples of any sub-
stantial size, nor is it known how to make effective use
of a_priori human knowledge about apparent differences be-
tween the sample sets. In addition, the corruption of a
single string by noise may sabotage the entire algorithm.

In contrast, let us now consider the inference
problem in terms of discriminant grammars. Suppose we
are given two sample sets L+ and L- and we are asked to
exhibit a discriminant grammar D and cutpoint 0 such that
L+<_:_ L+(D,O) and L‘s; L‘(D,e). As a first step, we must
determine a phrase structure grammar G with the property
that L(G); L+UL- which will serve as a characteristic
grammar for D. There are a great many obvious grammars
which will satisfy this requirement and we are free to use
our human powers of inference to select grammars which
seem like good candidates. The primary criterion in

which we are interested is that G have some productions

72

which will be of most use in generating strings from L+

and some productions more useful in generating strings from

L . If we wish to allow for noisy strings, we can choose

G such that L(G) is as large as possible, in fact, we may
*

T .

Once we have chosen the characteristic grammar for D

well choose G such that L(G) = V

(or, more probably, several candidate characteristic
grammars), we are now ready for the second (numerical)
stage of the inference process. To accomplish this, we
first determine the structural indices (using the mapping
S defined previously) of all strings in L+ and L-. The
problem of numerical training is now simply one of finding
a separating hyperplane between these two sets of indices.
If these sets happen to be linearly separable, we have
made a good choice of G and the desired coefficients may
be found by means of the perceptron algorithm. In the
more likely case that the sets of indices are not linearly
separable, (or in the case where the sample strings are
probabilistically generated infinite sequences), we may
find coefficients which are in some sense "good" by using
well-known variants of the perceptron algorithm. (See
Duda & Hart(A)).

As a trivially simple example, consider the following

two sample sets:

L+ = {abbba , babba , bb}

and

73

L- = {aaa , abaaa , abbaaa , a}

We note that the strings in L- contain more a's than b's
and vice versa for the strings in L+. We therefore con-

struct the following tentative characteristic grammar G:

G = ({S}>{aab}a.ﬁ ,3)

Where IT is given by

l) S + aS
2) S + b3
3) 8+8

The structural indices of the sample strings may be sum—

marized as follows:

 

STRING CLASS S(x)

abbba +1 (2.3.1)
babba +1 (2.3.1)
bb +1 (0,2,1)
aaa -1 (3.0.1)
abaaa —l (A,l,l)
abbaaa -l (h,2,l)
a -1 (1,0,1)

The two classes are indeed linearly separable. A
suitable set of coefficients might be (-1,+l,0) with
threshold 0. In terms of the discriminant grammar we are

seeking this gives us r1 = —1, r2 = +1, r3 = O and O = O.

74

Although this problem is admittedly trivial, the point is
that it is usually far easier to separate L+ from L— than
it is to analyze the structure of either L+ or L‘, thereby
avoiding complex problems of structural inference, (or at
least deferring the structural inference problem until the
complexity of the problems rise to meet the capabilities
of the analytical technique).

It is hoped that investigation of the various aspects
of this training problem will provide fruitful ground for

further research.

C. OTHER FUTURE WORK

There are several other directions in which this
work might be extended. First, an improved error analysis
for the sequential discriminative parsing scheme would be
desirable, including estimates of the amount of work saved
by truncating the parse. It would also be valuable to see
how the new field of grammatical inference could be brought
to bear on the problem of constructing characteristic

grammars which emphasize the differences between sets of

 

sample strings rather than the regularities within a sam—
ple set.

On a more theoretical level, an investigation of
further relationships between discriminant grammars and
probabilistic automate would be interesting. For example,
is it true that every cut-point event can be represented
as a decision region of some discriminant grammar? It is
also an interesting question whether a discriminant gram-
mar can always be extended to accept arbitrarily noisy
strings, that is, given a discriminant grammar D with char-
acteristic grammar G and a cut-point O, is it always pos-
sible to find a discriminant grammar D' with characteristic
grammar G' and cut-point 6' such that L'(D,O)EE L'(D',0')

*
and L+(D,O)E L+(D',O') and L(G') = vT '2

75

76

It is not hard to imagine a host of other questions,
such as the possibility of extending the sequential dis-
criminative parsing scheme to grammars which are not LL(k)
or the possibility of devising a discriminant grammar
scheme based on ambiguous grammars. It is the authors
hope that this thesis opens up a new area of study which

will prove important in the future.

APPENDICES

APPENDIX A. LL(k) GRAMMARS AND TOP-DOWN PARSING

Of the three general classes of parsing algorithms
(top-down, bottom-up, and tabular), the top-down method is
of greatest relevance to this thesis. This technique al-
lows us to reconstruct a LMCD in a forward (start symbol to
sentence) manner, using information obtained from the input
string to guide the parser in selecting productions. The
parsing method described here is the LL(k) technique,
which allows us to parse a certain class of unambiguous
context-free languages in linear time. This class of

languages is defined as follows:

Let G = (VN’VT’ TT ,8) be a context-free

grammar.

Let k be a natural number, and leto( be

a string over VN u VT' Then define

a a
FIRSTk(ot) = {x e VT [either |x|<k andd =?x

* a
or Ix] = k and ot:>xy for some y in VT}

Informally, FIRSTk(dJ is the set of k—symbol terminal
prefixes of strings derivable from (X. In the following,
let the notation ‘17-}? stand for left-most derivation.

Then G is LL(k) if and only if the three conditions:

77

78

s
l) 871-"? ert ﬁxed =>xv

2) 8% ont ﬁx)“ :>xw

3) FIRSTk(v) = FIRSTk(w)

imply that B = r .

LL(k) languages lend themselves in a natural way to
deterministic top-down parsing. Informally, as we are re—
constructing the derivation of a sentence 2 in L(G) and
wish to know what production rule to apply to the non-ter-
minal A in the sentential form xAd, the decision can be
determined by looking at the k terminals following the
prefix x of z. In practice, an LL(k) parser may easily be
programmed using either table look-up or recursive descent.
A complete discussion of LL(k) techinques may be found in
Aho and Ullman(l).

It is known that for any k the class of LL(k) gram-
mars forms a proper subset of the unambiguous context-free
grammars and furthermore that the class of LL(m) grammars
properly includes the class of LL(n) grammars if m exceeds
n. Given any language L, it is undecidable if L is gener—
ated by an LL(k) grammar, but the class cf languages pars-
able by LL(k) methods is large enough, for example, to
allow ALGOL to be thus compiled (See Lewis and Rosen-
krantz(l6)). An LL(l) grammar is called simple if and
only if for each pair (A,a), where A e VN and a c VT, there

is at most one production of the form A + ad. Many of the

79

examples used in this thesis are simple LL(l) grammars.

APPENDIX B. STOCHASTIC GRAMMARS

In may situations, particularly in pattern classifi-
cation problems, it is desirable to have a probability

assignment over a language, that is, a mapping P: L+[O,l]

:EZPKX) = 1

Considered as a set of ordered pairs, this function is

such that

called a stochastic language. Since interesting languages
are usually infinite, it behooves us to specify an algor-
ithm for the computation of this function in order to de-
fine it. One way to do this is to combine the probability
computation with the language generation by means of a
"stochastic grammar". See, for example, Booth and Thomp—
son(2).

A stochastic grammar is a quintuple

F = (VN.VT.IT .S.P)

where V , VT’ Tr , and S are the non-terminal set, the
terminal set, the production set, and the start symbol,
respectively, and P is a mapping from TI to the unit in—
terval [0,1]. The characteristic grammar CHAR(F) is the

ordinary grammar formed from the first four components of

80

81

F. The probability of a given derivation ﬂ1,ﬂ2,ﬂ3,...,ﬂn

is defined to be equal to the product
P(nl)P(N2)P(w3)...P(wn)

Implicit in this definition is the following extremely

important principle:

UNRESTRICTEDNESS ASSUMPTION: The probability of the appli—
cation of any production depends only upon the presence of
a given premise in a derivation and not upon how the pre-

mise was generated.

For all x e L(CHAR(F)), the probability of x (denoted

[(x)) is defined to be the sum of the probabilities of

all distinct LMCD's of x. Note that if the grammar is un-
ambiguous, there is only one term in this sum.

If the relation

grew.

x e L(CHAR(F))

is satisfied, then I" is a true probability mass function
and F is said to be consistent. In this case Va is said to
be the stochastic language generated by F (‘—'= L(F)).

For each A c VN, let n(A) be the subset of TI consist-
ing of all produCtions whose premise is A. Then F is said

to be proper if, for all A e VN,

82

EE P(Hi) = l

111 8 MN

All stochastic grammars used in this thesis will be

unrestricted, consistent, and proper.

BIBLIOGRAPHY

(l)

(2)

(3)

(A)

(S)

(6)

(7)

(8)

(9)

(10)

(ll)

(12)

BIBLIOGRAPHY

Aho, A.V. and Ullman, J.D. The Theory of Parsing,
Translating, and Compiling. Prentice-Hall,Eng1e-
wood Cliffs, N.J. 1972.

 

Booth, T.L. and Thompson, A. "Applying Probability
Measures to Abstract Languages." IEEE Transactions
on Computers 0-22 (May 1973) 4A2—4h9.

Feller, W. An Introduction to Probability Theory and
Its Applications, v. 1. Wiley, New York, 19

Duda, R. O. and Hart, P. E. Pattern Classification and
Scene Analysis. Wiley, New York, 1973.

 

Freeman, H. "On the Encoding of Arbitrary Geometric
Configurations." IEEE Transactions on Electronic
Computers EC-lO (1961) 260-268.

Fu, K.S. Sequential Methods in Pattern Recognition
and Machine Learning. Academic Press, New—York, 1968.

 

 

Fu, K. S. Syntactic Methods in Pattern Recognition.
Academic Press, New York, 1973.

 

 

Fu, K.S. and Booth, T.L. "Grammatical Inference: In-
troduction and Survey -- Part I." IEEE Transactions
on Systems, Man, and Cybernetics SMC-5 (January 1975)

Good, I. J. Probability and the Weighing of Evidence.
Charles Griffin, London, 1950.

 

Hopcroft, J. E. and Ullman, J. D. Formal Languages and
Their Relation to Automata. Addison-Wesley, Reading,
Mass., 1969.

Kadota, T.T. and Sheep, L.A. "On the Best Finite Set
of Linear Observables for Discriminating Two Gaussian

Signals." IEEE Transactions on Information Theory
IT-l3 (1967) 278-28“.

Kanal, L.N. (ed.) Pattern Recognition. Thompson Book
00., Washington, D.C., 1968.

 

83

(l3)

(1")

(15)

(l6)

(17)

(18)

(19)

(20)

8h

Kirsch, R.A. "Computer Interpretation of English
Text and Patterns." IEEE Transactions on Electronic
Computers EC-13 (1964) 363-376.

Kullback, S. Information Theory and Statistics.
Wiley, New York, 1959.

 

Lee, H.C. and Fu, K.S. Stochastic Langgages For Pic-
ture Recognition TR-EE 72-17. Purdue University,
Lafayette, Indiana, June 1972.

 

 

Lewis, P.M. and Rosenkrantz, D.J. "An Algol Compiler
Designed Using Automata Theory" Proceedings Poly-
technic Institute of Brooklyn Symposium on Computers
and Automata (1971).

Mendel, J.M. and Fu, K.S. Adaptive, Learning and
Pattern Recognition Systems: Theory and Practice.

 

 

 

New York, Academic Press, 1970.

Minsky, M. "Steps Toward Artificial Intelligence."
Proceedings IRE H9 (1961) 8-30.

Paz, A. Introduction to_Probabilistic Automata.
Academic Press, New York, 1971.

 

Wald, A. Sequential Analysis. Wiley, New York,
19A7.

   

M'11177777111777![itﬂinjﬂfiﬂl'lijﬂﬂﬂﬁjﬂﬂlﬁ'ES