.‘ﬂﬂ-

THESIS

 

Date

" m “A «A «A? IllllﬂilllllllllllllllllllIllllllullilllllLllllllml

Eva-Aw “Vi“

3 1293 104945

This is to certify that the

thesis entitled

A LL GRAMMAR ANALYZER

presented by

ZANE C. MOTTELER

has been accepted towards fulﬁllment
of the requirements for

MS Computer Science

 

degree in

 

ﬁj/m

5.

Major professor

21 August 1981

 

0-7639

 

 

MSU

LIBRARIES

 

 

RETURNING MATERIALS:
Place in book drap.to
remove this checkout from
your record. FINES will
be charged if book is
returned after the date
stamped below.

 

 

 

 

 

 

 

A LL GRAMMAR ANALYZER

By

Zane C. Hotteler

A THESIS

Submitted to
Michigan State University
in partiat futfiltment of the requirements

for the degree of
MASTER or SCIENCE
Department of Computer Science

1981

ABSTRACT

A LL GRAMMAR ANALYZER

By

Zane C. Motteler

A grammar analyzer is a computer program which accepts
a grammar as input and decides whether or not the grammar is
of a certain type. The LL analyzer described in this thesis
makes it possible to design LL(1) grammars and recursive
descent parsers.

This analyzer accepts a modified Backus-Naur grammar as
its input language: syntax diagrams may easily be translated
to this grammar. Because the modified Backus-Naur grammar
is itself LL(1)¢ it is parsed by recursive descent. If the
grammar is correct. it is then translated into a graph,
which is searched in various ways in order to determine

whether it is LLtl) or not.

ACKNOHLEDGEMENTS

The author wishes to thank the personnel of the
Computer Science Department. especially its chairman.
Professor Harry G. Hedges. and my academic advisor.
Professor Carl Page. for their support and hospitality
during a very fruitful and beneficial sabbatical year. More
particularly I acknowledge with gratitude the help and
inspiration of my thesis advisor. colleague. and sometime
mentor. Professor Bernhard L. Ueinberg. without whom this
project would never have come to fruition. His help even
included drafting the figures‘ for this text and spending
hours entering programs and data into interactive terminals

for me. Greater love hath no man.
1

ii

TABLE OF CONTENTS

LIST OF FIGURES 0.0...0.0.0....OOOOOOOOOOOOOOOOO000...... TV

INTRODUCTION OO0..0.0.0.0....OOOOOOOOOOOOOOOOOOOCO00...... 1

I. THE GRAMMAR AND ITS GRAPH OOOOOCOOOOOOOCCCOOCOC00...... 5
II. GRAPH SEARCH TECHNIQUES OOOOOOOCCOOCCOOCOCCOOO0...... 18
III. RESULTS OF PROCESSING EXAMPLE GRAMMARS ............. 26

IV. CONCLUSIONS AND FUTURE DIRECTIONS ................... 32

BIBLIOGRAPHY OOOOOOOOOOOOOOOOOOOOOOCCOOOOOOOOOOCOOOCOOOOO 34

iii

Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure

Figure

1a.
1b.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12a.
12b.
12c.
12d.
13.
14.
15.
16.
17.
18.

19.

LIST OF FIGURES

The modified BNF syntax charts ...............
The modified BNF grammar .....................
Pgrammar ......................................
Pproduction ...................................
Pitem .........................................
Prightalternate ..............................
Palternation .................................
Palternate ...................................
A node in the production graph ...............
Graph of a list of symbols ...................
Graph of a list of alternate symbols ........
Graph of a repeat construct .................
Graphs of GRAMMAR and PRODUCTION ...........
Graph of ITEM ..............................
Graph of ALTERNATION .......................
Graph of ALTERNATE .........................
Mark nullable nonterminals ..................
Function isnullable .........................
Procedure getfirstset .......................
Procedure getfollowsets .....................
Procedure getlastset ........................
The grammar for PL/O ........................
Backhouse's “simple language” ...............

iv

10
10
11
12
12
13
14
15
16
16
17
20
21
22
23
24
26

27

INTRODUCTION

One of the key steps in the process of compiler design
and construction is the definition of the language to be
used. and the expression of its syntax in a grammar. The
designer of a language may wish to ensure that a grammar for
the language has certain special prOperties. since the
language chosen determines the parsing technique or
techniques. Some frequently used grammars for relatively

simple parsing schema are the following:

1. precedence (or the closely related Operator
precedence.) In these grammars. symbols (or more
generally. pairs of strings of length m and n)
adjacent to one another in a sentential form have
unique precedence relations. and a precedence
matrix can be calculated which directs the parsing.
No lookahead is required for precedence parsing.
which is a bottom-up parsing scheme.

2. Lle). Such grammars can be successfully parsed by
examining the current viable prefix together with a
k-symbol lookahead. by a method called recursive
descent. which will be described later in this
thesis. This parsing method works from the top
down.

3. Lle). These grammars include~ both of the
preceding types. so are more general. Like LLtk).
they can be successfully parsed by a k-symbol
lookahead. and like simple precedence. the parsing
is bottom-up.

A gggmmg; analyze; is a program which accepts the

productions of a grammar as its input and produces output

2

which enables the user to infer whether the grammar is of a
particular type; if so. he can proceed to design a parser
apprOpriate to the grammar: and. if not. the analyzer should
supply information on why the grammar fails to be of the
desired type. so that the user can repair to his workshop
and alter it in an attempt to put it into the requisite
form. From this standpoint. simple precedence and Operator
precedence grammars have been extensively studied
(2.3.6.7.10]. Techniques for the analysis of Lle) grammars
for any k are well known (see. for example. Backhouse [11.
pp. 82-116). although seldom used. since most realistic LL
grammars are LL(1) except for a very few productions. which
can be handled in an ad hoc manner. This thesis and the
associated computer project represent the deveLOpment of an
LL grammar analyzer. which in particular will detect whether
a given grammar is LLCl). and if not. precisely where and in
which productions it goes wrong. (See pages 8 and 19 for
definitions of LLll).)

A distinctive feature of this LL analyzer is that it
accepts an input grammar written in a modified Backus-Naur
form. which permits alternation and repetition constructs.
and therefore closely parallels syntax charts. It will
parse any language written correctly in such a grammar.
calculate first and follow sets of nonterminal symbols. and
trace productions to discover multiple paths to the same
symbol. Any place where the LL(1) conditions fail to hold

will be noted on the output. This input language is unique

3
among. grammar analyzers. Furthermore. the author has been
unable to find references to similar algorithms for LL
analyzers in the literature. even though the techniques
behind its construction are well known (see. for example.
Uirth [9]. pp. 280-349. and Backhouse E1]. pp.92-156).

Pyster [8] describes a program called ”ZUSE" which
accepts a grammar as input. verifies whether or not it is
LLll). and produces a parser for the language even if it
fails to be LL(1). by enforcing priorities on which
productions to try first in the event of ambiguities. The
techniques for designing a program to produce such a parser
are described in Birth [9] cited previously. ZUSE has few
similarites to the present effort. however. The input which
ZUSE accepts is more restrictive and less general (lacking
repetition and alternation constructs). Since ZUSE also
produces a parser. it demands more of the user in the way of
input. LLPARSE. the algorithm whose implementation is
described in this thesis. does only the grammar analysis.
Hhile a parser construction could have been included. it was
not felt that this was an essential part of the project at
this time.

Lalonde [5] describes an algorithm which does much the
same thing as ZUSE for LALR languages: it accepts a grammar
in a standard version of BNF as input. tests the grammar for
inclusion in the apprOpriate language class. and generates a
parser for it if possible. The input language is much

simpler than that of ZUSE. but the algorithm otherwise has

4
similar goals. and it differs from the algorithm described
in this thesis essentially in the same way as ZUSE.

LLPARSE reads the productions Of a grammar in modified
Backus-Naur form as its input; there are very few
re/stric/tions on the form Of this input. As the grammar is
read in. it is parsed by recursive descent.‘while syntax
errors (if any) are detected in the process. At the same
time. LLPARSE converts the productions into graphs. Subse-
quently these graphs are searched to determine nullable
symbols. last sets. first sets. and follow sets. the last
three Of which are stored as boolean matrices. Closure
Operations on these matrices determine complete first and
follow sets. which are the output Of LLPARSE. If anywhere
in the grammar there are two or more paths to the same sym-
bol. or if the first and follow sets Of any nullable symbols
overlap. then the grammar fails to be LLtl). and LLPARSE
pinpoints the locationls) of the problem(s). Thus the user
can either attempt to repair the grammar. or. if there are
only a small number Of problems. write ad hoc fixes for them
into a parser for the grammar. Portions of the algorithms
described in this thesis are illustrated in a Pascal-like
pseudo-language; the actual implementation is in machine
independent Pascal. Various constants and restrictions in
the program are controlled by Pascal constants. which can be
changed by the user to fit arbitrary predilections or

machine hardware constraints.

I. THE GRAMMAR AND ITS GRAPH

The building blocks Of the modified BNF grammar used by
LLPARSE are IDENTIFIERs and SYMBOLs. An IDENTIFIER begins
with an alphabetic character and the remainder Of its
characters are alphanumeric. Semantically an IDENTIFIER may
be either a terminal or a nonterminal. depending on whether
or not it appears on the left side Of a production. but this
distinction can not be made until the entire grammar has
been scanned. A SYMBOL begins with any character other than
an alphabetic one and consists Of non-alphabetic characters.
A SYMBOL will always be a terminal in the language of the
grammar. Other lexical details about the input language
will be described later. with the description Of the

implementation.

A PRODUCTION consists Of the left side IDENTIFIER (or
'mastersymbol') followed by the production symbol '::=' and
a list Of zero or more ITEMs. An ITEM may be either an
IDENTIFIER. a SYMBOL. a repetition construct. or an alter-
nation construct. A repetition construct is a list of ITEMs
enclosed between '[S' and "$3”. and semantically means to
repeat the list zero or more times. An alternation

construct is enclosed between the metasymbols "<5“ and "3)“.

GRAMMAR

 

 

 

 

T—‘l PRODUCTION I l —>
PRODUCTION
—-rf IDENTIFIER ‘ _ I,

w—

 

 

IDENTIFIER

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ITEM f
4 IDENTIFIER 1|
ITEM
WALTERNATION
l
SYMHn. r"'
ALTERNATION
—.l ALTERNATE : 14®———cl ALTERNATE
ALTERNATE

 

 

ITEM

 

Eigggg 1a; The modified BNF syntax charts.

GRAMMAR :3: PRODUCTION CS PRODUCTION S]
PRODUCTION 3:: IDENTIFIER '::=' [3 ITEM $3
ITEM :2: (S IDENTIFIER

$3 '[3' ITEM [S ITEM 8] '53'
SS '(3' ALTERNATION 'S)'
SS SYMBOL S)
ALTERNATION 2:: ALTERNATE 'SS' ALTERNATE
ES '33' ALTERNATE S]
ALTERNATE 3:: [3 ITEM 3]

Figgre 1b; The modified BNF grammar.

and consists Of two or more fields (one Of which might be
null) separated by the metasymbol "SS". Not only is the
null production specifically allowed. but also. in addition.
repetition constructs are nullable. and alternate constructs
may be. so that nullable symbols can exist in the grammar.
Although the form Of the grammar makes more than one
production for each mastersymbol unnecessary. it is allowed.
and any subsequent production will be treated correctly as a
new alternate.

Figure 1a consists Of syntax charts for the modified
BNF grammar. and Figure 1b shows what we shall henceforth
refer to as the 'metagrammar': the modified BNF grammar
expressed in its own grammar. Note the close correspondence
between the syntax charts and the metagrammar. The
metasymbols of modified BNF have tO be enclosed within
quotes so that they will be taken as terminals Of the
metagrammar. He shall use this grammar during the remainder
Of this section to illustrate how LLPARSE analyzes a grammar

and converts it into a sequence of graphs that can be

8
searched for further information about the grammar.

First. note that the metagrammar is LL(1). He shall
define more precisely what this means later on. but for now.
we shall merely state informally that there is no more than
one path from any grammar item directly to any other: and
the only nullable nonterminal. ALTERNATE. has disjoint first
set {SYMBOL.'E$'.'<$'. SYMBOL} and follow set ('$>'.'$$'}.
Thus any modified BNF grammar can be parsed by recursive
descent. Backhouse [1] has an excellent description of this
parsing technique. which is used by LLPARSE.

Each component Of a production is examined by a
procedure whose name consists Of the letter 'p' followed by
the name of the component. such as 'pproduction'. 'pitem'.
etc.. with 'pleftalternate" for I'<$". "par“ for “SS". etc.
There is a procedure called “advanceinputpointer“ which
Obtains the next symbol: at the end Of a production or Of
the grammar it returns "null"; at the end of the grammar
“inputpointer" will be set tO zero. 'Advanceinputpointer'
sets a variable called 'nextsymbol" to one Of the values
I'qidentifier". 'qsymbol". 'qleftalternate'. 'qrightalter-
nate'. 'qleftrepeat". 'qrightrepeat'. 'qor". or 'qnull'.

Advanceinputpointer;
repeat
pproductioni
if inputpointer <> 0 then

advanceinputpointer
until inputpointer = 0;

5133;; g; Pgrammar.

If nextsymbol = qidentifier then
(* firstset Of pproduction is {qsymbol} *)
begin

advanceinputpointer:

pprodsymbolz

while nextsymbol <> qnull do

pitem

(. end of production if nextsymbol = qnull *)
else
end

goto 1 (. error *)3

_jgggg 3; Pproduction.

depending on the next token in the input. The main
Operation of the parser Of GRAMMAR then lOOks as in Figure
2. (Note that we are only describing the parser here. and
not the semantics Of graph construction which accompany it.

These will be described later.) Figure 3 shows the core of

If not (nextsymbol in Eqidentifier. qleftrepeat.
qleftalternate. qsymbOlJ) (* first set of item *) then
goto 1 (. error *)
else begin
case nextsymbol Of
qidentifier. qsymbol: advanceinputpointer 3
qleftrepeat: begin
advanceinputpointer:
repeat
pitem
until nextsymbol = qrightrepeati
(* in follow set of item *)
prightrepeat
end;
qleftalternate: begin
advanceinputpointer;
palternation: .
prightalternate (* in follow set Of alternation *)
end
end (* case *)

ﬁjgggg 3; Pitem.

10

If nextsymbol = qrightalternate then
advanceinputpointer

else
goto 1 (* error *)

Eigg£_ ﬁ; Prightalternate.

"pproduction". which parses an IDENTIFIER. the 'prodsymbol'
('::='). and then one or more ITEMs until the production
ends ('advanceinputpointer' returns the null symbol).

The procedure "pitem". shown in Figure 4. is the most
complicated. since ITEMs can take several forms. NO action
(except semantics) is necessary when an IDENTIFIER or SYMBOL
is recognized. but the beginning Of an alternation or a
repeat construct requires additional action. Note how
parsing a repeat construct causes a recursive call Of
I'pitem". 'Prightalternate' (Figure 5) is representative Of
the procedures which test individual tokens: if the expected
token is not next. an error routine is invoked which skips
the rest of the current production and begins parsing again
at the next. 'Prightrepeat' is similar in form.

Finally. Figure 6 and Figure 7 illustrate procedures
'palternation' and ”palternate". which parse the interior Of

Palternate;
if nextsymbol = qor then
repeat
pori
palternate

until nextsymbol <> qor;

5193;; g: Palternation.

11

If not nextsymbol in (qor. qrightalternate]
then
repeat
pitem
until nextsymbol in [qor.qrightalternate13

E1993: 1; Palternate.

an alternation construct. Since the various grammatical
constructs can be nested inside one another. recursive calls
Of 'pitem' are possible in "palternate". and ”pitem" may
again call 'palternation'. etc.. tO any depth desired.

SO far we have considered only the parsing function Of
LLPARSE. Naturally the actual program also includes
semantic routines. These routines construct graphical
representations Of the productions as they scan them. These
graphs are quite similar to the forms suggested by Uirth
([9]. pp. 295-299). but contain some refinements and
additions not suggested in his work which were made
necessary by the extreme complications which can arise in
some large programming languages (the reader is invited. for
instance. to reduce a Pascal fieldlist to the graphical form
described below: the difficulty is caused by the
non-structured form. See Jensen and Uirth E4]. p. 116 for
a syntax diagram.) Figure 8 shows the construction Of a
typical node in the graph. Each node contains a reference
to an IDENTIFIER. which is null if there is nO symbol
(recall that repeat constructs and some alternation

constructs may be nullable. so there has to be a way to

12

 

Sﬂﬂﬂl
APIEAIER

Eigggg a; A node in the production graph.

 

NP

 

 

 

 

represent this). There are two pointers in each node. NP to
the next node in sequence (if any). and AP to an alternate
of the current node (if any). In addition. there are two
flags: the EA flag signals whether or not the current node
is the last ITEM in an alternation. and the ER flag signals
the end Of a repetition construct. The latter flag is
particularly important in that the successor pointer“of the
last nOde in a repetition points back to an earlier node in
the same graph; the ER flag. if on. thus prevents endless
cycling through repeats when the graph is being searched.
Let us now indicate how a graph is put together. In
each of Figures 9. 10. and 11. a blank node signifies one
which can have any allowed value (T or F in EA and ER. a

pointer or NIL in AP and NP). Filled in nodes will always

 

 

 

SYMHM;1 I SYMHN.2 SYMHM.n

llFl—A-i I HAM” IFIA

E1933; 2; Graph Of a list Of symbols.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SYMBOL n

MI

5193;; IQ; Graph Of a list Of alternate symbols.

 

 

 

 

contain the data indicated.

A list Of IDENTIFIERs. which may include SYMBOLS. is
simply a linked list Of nodes and successors. as illustrated
in Figure 9. Figure 10 shows the construction Of a list Of
alternate nodes. Finally. in Figure 11. we see a repeat
construct. Note that the last node points back to the
first. and is flagged as the end Of a repeat. The first
node has a null node as its alternate. and this null node is
flagged as the end Of an alternate. If a repeat construct
is followed by other items. this null node will have them as
its successors. while if a repeat construct is also the

beginning Of an ITEM in an alternation. the alternate

14

 

 

 

 

SYMBOL 1 SYMBOL 2 SYMBOL n

all H"—‘ H"!’“""[ l H“

 

 

 

 

 

 

 

 

 

 

 

‘

 

 

 

 

 

NULL
Figggg 1. Graph Of a repeat construct.

pointer Of the null node will point to the next alternate in

sequence.

In order to fix these ideas. we illustrate the graph Of
the metagrammar in the components Of Figure 12. Since this
is a relatively simple grammar. the graphs are unremarkable.
Hith the exception Of EA and ER flags. they are similar to
graphs in Hirth [9]. These flags turn out to be absolutely
crucial in the graph searching process. and represent one Of
the new techniques in LLPARSE.

Physically. it is quite simple for a person to trace
such graphs in order to detect nullable symbols. first sets.
and follow sets. Thus it seems that writing a program tO do
a graph search for this purpose should be, relatively
straightforward. Real computer languages. such as Pascal.

have a much more complicated graph structure. however. so

 

15

 

GRAMMAR—I I PRODUCTICEI PRODUCTION

 

 

 

 

 

[ .
[.[FIFIq._4. F FI...I_.[JIIFITI—

 

 

INULLI

I. TIP .4

 

 

 

 

 

d FRODUCTIONI I IDENTIFIER

 

 

 

I.IFIFI:I__4LF F

 

 

 

 

 

.I-l

 

 

I '::=' TI ITEM
,._.I.IFIFI._I._.IIIFIT
L .

LNULI- J

[.ITIFI.I

 

 

E1995; 123; Graphs of GRAMMAR and PRODUCTION.

the task is far less trivial than it may seem. He describe

this process in some detail in Section II Of this thesis.

16

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ITEM IDENTIFIER
F Fwy—u FIFI.
'[3' ITEM ITEM I
FlFl”—'[FIFI" 11F T ‘-
NULL
I J,Tl F '-__’
r<$' ALTERNATION «$>«
RIF IF .__,.. IFI F r-r*'- IF IF

 

 

 

 

 

 

 

 

 

 

 

 

ALTERNATION ALTERNATE '$$'

; IFI Fl .—-—.-.I FI Fl ‘ . FITF .__]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

i
ALTERNATE '$$' ALTERNATE

FIT? I FI FIE .I FI T ,3
l

 

 

 

 

 

 

 

 

 

 

 

NULL

.MF

 

 

 

 

£19352 125; Graph of ALTERNATION.

17

TALTERNATEI r ITEM |
D F F FIT _

NULL

1
TIFI.I

E13333 121; Graph Of ALTERNATE.

 

 

 

 

 

 

 

 

 

 

 

 

 

7-1-1:

 

II. GRAPH SEARCH TECHNIQUES

Once the graphs of the productions have been
constructed as described in the previous section. LLPARSE
first goes serially through the symbol table and marks as
terminals alt those IDENTIFIERs which did not occur on the
left side of any production. (Those which did. contain in
their symbol table entry a non-null pointer to a produc-
tion.) It is important for the graph searches which follow
to be able to distinguish between nonterminals and terminals
in the subject grammar. Four separate graph searches then
occur in order to obtain the following sets of symbols:

1. nullable symbols. These consist of all nontermi-
nals for which some path through the graph defining
them consists of all null or nullable symbols.

2. First sets for each nonterminal symbol. The first
set contains all terminals which can begin a sen-
tential form derivable from the nonterminal.

3. Follow sets for each nonterminal symbol. These
contain all terminals which can follow any senten-
tial form derivable from the nonterminal in a sen-
tence in the grammar.

4. Last sets for each nonterminal symbol. These con-
sist of all nonterminal symbols which can occur in
the last position in any sentential form derivable
from the symbol.

There are two steps in the graph search to construct

these sets: the first calculates the various sets from

18

19
individual productions and includes nonterminals in them.
The second combines this information into boolean matrices.
and by the use of transitive closure operations. determines
the final constitution of these sets. Note that the first
step can not be done recursively. for recursion in
productions is always possible. meaning that a recursive
graph search might never terminate. Instead. we must limit
recursion in graph searching to single productions. never
allowing recursion to proceed from inside the graph of one

production to the start of the same. or another. production.

Before we begin discussing the graph searching process.
it would be well to review the definition of an LL(1)
grammar and the significance of the above sets in checking
the definition. Basically. a grammar is said to be strong
LLLL; if at any point during the parse of an input string it
is possible to determine uniquely the production of the
grammar which led to the current sentential form. given only
the input up to and including the current input token and
the one immediately following it. A grammar containing the
following production would not be LL(1). then. because there
are two paths to the symbol C from 8: given current symbol 8
and next symbol C one could not determine uniquely which
alternate to use:

A 3:: B (S C D SS C E S)

Thus any graph search must trace out dual paths to the same

symbol. A second grammar fault which causes a grammar to be

20
non-Ll(1) occurs when the first and follow sets of a
nullable symbol have elements in common. Upon seeing one of
these elements of the intersection. the parser would not be
able to determine whether the null production or some
non-null production of the mastersymbol is the correct one.
The figures which accompany the text in this chapter
are written in 'pseudocode' for ease of understanding.
Figure 13 illustrates the procedure which marks all nullable
nonterminals in the symbol table. It cycles repeatedly
through all productions and marks nullable symbols until no
more can be marked. Its main subprocedure is the boolean
function “isnullable“ (Figure 14). which searches the graph
of a production trying to find a path which contains all
null or nullable symbols. If every path encounters either a
terminal or a nonterminal which is not known to be null.
then it returns "false". If the mastersymbol is defined by
an alternation structure which contains a null or nullable
alternate. then the graph search must proceed to the end of
the alternate structure. and then the successor of the end
node (if any) by a recursive call to "isnullable“. Since
Uhile there is a change do begin
set production pointer to first;
while production pointer is not nil do begin
if isnullable (production)
then mark mastersymbol nullable;
advance production pointer
end

end

Eigggg 1;; Mark nullable nonterminals.

21

Isnullable := false;
if current symbol is null or nullable
or has an alternate
then begin
repeat
searchsymbol := current symbol;
if searchsymbol is null or nullable
and has no successor
then isnullable := true
else if searchsymbol has successor then
if it is end of alternate and
not end of repeat
then
isnullable := isnullable(successor)
else
isnullable := isnullable or
isnullable(successor)
else if it is end of alternate
and not end of a repeat
then .
isnullable := isnullable and
isnullable(successor);
set current to alternate
until no more alternates

5.9253 $3; Function isnullable.

the grammar is finite. eventually every nullable symbol will
be marked. This process is completed before the search for
first sets and follow sets. since information on nullability
is important in determining the components of these sets.
Figure 15 shows the search technique for first sets.
This procedure is called once for each production with a
pointer to the first IDENTIFIER or SYMBOL on the right side.
Any non-null symbol is placed in the first set: if it is a
nonterminal. transitive closure will later be used to
generate terminals in the first set coming from these
nonterminals. If a nullable symbol is placed in the first

set. so are its successor(s). if any. The alternates of any

22

If startsymbol is not null
then put it in first set;
if it is null or nullable then
if it has a successor then
if it is not the end of a repeat construct then
getfirstset(successor)
else put successor in first set
else
if it is part of an alternate structure then
go to the end of the alternate;
if that node has a successor then
getfirstset (successor);
if startsymbol has an alternate then
getfirstset (alternate);

Eigggg 1;; Procedure getfirstset.

symbols in the first set also go there. as do the successors
of any alternate structures containing a null alternate.
This part of the graph search is quite complicated and uses
both the end alternate (EA) and end repeat (ER) flags. The
search is made more difficult by the fact that it is
possible for EA. ER. NP. and AP to occur in any of sixteen
possible combinations of values. Fortunately there are not
that many separate actions to take.

In Figure 16. we see the follow set procedure. whereas
the first set procedure never scans a graph beyond a
terminal symbol. the follow set procedure searches the
entire graph of a production for nonterminals (if any) and
adds their successors (if any) to their follow sets. It is
during this complete graph traversal that LLPARSE also
checks for multiple paths to any one symbol: this is done

during the construction of the first set of a successor to

23

if startsymbol is not null and not terminal then begin
if it has a successor then
put first set of successor in follow set;
set current to startsymbol:
if current has alternate and
has no. or nullable. successor then
repeat
set current to its alternate
until EA or no more alternates:
if current is EA then begin
if current has successor then
put first set of successor
in follow set of startsymbol
else
if current has successor then begin
while current has successor
and is not end of repeat
and is not end of alternate
and does not have alternate do
set current to successor:
if EA and successor exists then
out first set of successor
in follow set
else if ER then
put successor symbol in
follow set
end
end
end.
if startnode has successor then
getfollowsets (successor);
if startnode has alternate then
getfollowsets (successor);
end;

Eigygg 1g; Procedure getfollowsets.

the given nonterminal. by checking to see whether a symbol
being placed in this set is already there. If such are
‘ found. an informative error comment is printed to the effect
that the production for the mastersymbol contains multiple
paths to whatever particular symbol has been encountered.

This procedure is probably the most complex. since it

24

If current symbol has a successor and
and is not the end of a repeat construct then
getlastset(successor)3
if isnullable(successor) and
current is not terminal then
put current in last set
else
if symbol is not null and not terminal then
put it in last set
if current symbol has an alternate then
if its alternate is not the end
of an alternate construct then
getlastset(alternate)
else
if its alternate is not null then
.if it is not a terminal then
put it in last set;

5193;: 11; Procedure getlastset.

utilizes the results of marking nullable IDENTIFIERs as well
as procedure ”getfirstset”.

He need further information in order to be able to
calculate follow sets. however. Consider the following
production:

A 3:: [S B C D S] E F
If F is a nonterminal. then clearly its follow set must
include the follow set of A. since any occurrence of A in
any production can be replaced by the right side of this
one. For this reason. we need the last sets of each
nonterminal; Figure 17 shows the construction of these. In
this graph search. we proceed through the graph to the ends
of its branches. searching for nonterminals at their ends.
If such exist. they are added to the last set of the master

symbol: so is any nonterminal whose successor graph is

25
nullable.

Once the graph searches have been concluded. we have
constructed matrices of zeros and ones containing a row for
each nonterminal. The 'firstsets" matrix contains ones for
each symbol (terminal and nonterminal) in the first set of
each row symbol; the terminals in all first sets may be
computed‘ simply by transitive closure. The 'followsets'
matrix initially contains ones for each terminal or
nonterminal in the follow set of each row symbol; these are
augmented by ORing row 5 into row i if symbol 1 is in the
last set of symbol 1. The complete set of all terminals in
each follow set is obtained not by transitive closure. but
by ORing the first set of row i into the follow set of row i
if j is in the follow set of i and i is nullable.

This process is guaranteed to work regardless of the
input grammar. as long as it is syntactically correct
(Figure 1). Thus when the first and follow sets are printed
out at the conclusion of the program. these together with
any previous messages will enable the user to determine
whether the input grammar is LL(1). and if not. where
changes should be made. If the input grammar contains
syntax errors. the graph search can not be completed. but
each production will be scanned nevertheless. enabling the

user to detect as many syntax errors as possible.

III. RESULTS OF PROCESSING EXAMPLE GRAMMARS.

In this chapter we shall discuss how the implementation
of LLPARSE handles two fairly simple programming languages.
Hirth's PLO ([9]. pp. 308-310). and Backhouse's ”simple
programming language" ([1]. p.93). PLO is an LL(1)
language. while "simple" is not. Figure 18 and Figure 19
show the two grammars cast into the modified BNF form which
LLPARSE uses as its input.

Before we proceed with this discussion. it would be
useful (now that we actually see two input grammars) to
discuss some lexical features of the implementation. An

IDENTIFIER which begins in column 1 is taken to be the left

PROGRAM BLOCK .
BLOCK (3 CONST IDENT = NUMBER
[3 9 IDENT = NUMBER 5] 3 SS 3)
<3 VAR IDENT [S . IDENT SJ 3 SS 5)
[S PROCEDURE IDENT 3 BLOCK 3 $3
STATEMENT
STATEMENT <3 IDENT I: EXPRESSION 33 CALL IDENT
$3 BEGIN STATEMENT ES 3 STATEMENT 3] END
$3 IF CONDITION THEN STATEMENT
$3 HHILE CONDITION DO STATEMENT $3 3)
CONDITION <5 ODD EXPRESSION SS EXPRESSION <3 = 33 <)
33 < $3 > 33 <= 33 )= S) EXPRESSION S)
EXPRESSION (S + $3 $3 - S) TERM [3 (S + $3 - S) TERM $3
TERM FACTOR [S (S * SS 3) FACTOR S]
FACTOR <3 IDENT $3 NUMBER 33 ( EXPRESSION ) S)

Eiggng 1a; The grammar for PLO.

26

27

PROGRAM BLOCK .

BLOCK BEGIN BLOCKTAIL

BLOCKTAIL (S DECLARES BLOCKTAIL SS STMTLIST END S)

DECLARES (S LABEL IDLIST SS INTEGER IDLIST S)

IDLIST IDENTIFIER RIDLIST

RIDLIST (S 3 SS . IDLIST S)

STMTLIST OPTLABEL OPTSTMT RESTOFLIST

OPTLABEL (S SS IDENTIFIER I S)

OPTSTMT (S SS STATEMENT S)

RESTOFLIST (S SS 3 STMTLIST S)

STATEMENT (S ASSIGNMENT SS TRANSFER SS CONDIT SS
HRITE SS BLOCK S)

ASSIGNMENT EXPRESSION =) IDENTIFIER RASGNLST

RASGNLIST (S SS =) IDENTIFIER RASGNLIST S)

TRANSFER GOTO IDENTIFIER

CONDIT IF EXPRESSION THEN STMTLIST OPTELSE FI
OPTELSE ‘(S SS ELSE STMTLIST S)
URITE OUTPUT ( OUTPUTLIST )

OUTPUTLIST EXPRESSION MOREOUTPUT
MOREOUTPUT (S SS . EXPRESSION MOREOUTPUT S)
EXPRESSION EXP1 REXPR

REXPR (S SS RELOP EXPl S)

EXPl EXP2 RESTOFEXPI

RESTOFEXPl (S SS ADDOP EXP2 RESTOFEXPI S)
EXP2 EXP3 RESTOFEXP2

RESTOFEXP2 (S SS MULOP EXP3 RESTOFEXP2 S)
EXP3 (S POSEXP3 SS - POSEXP3 S)

POSEXPS (S INPUT SS IDENTIFIER SS CONSTANT
SS ( EXPRESSION ) S)

RELOP (S < SS ) SS 3 S)
ADDOP (S + SS - S)
MULOP (S * SS S)
£139;- 12; Backhouse's "simple language."

side of a production rule; because of this we can dispense
with the production symbol '::=' entirely. The right side
of a production may be continued onto an arbitrary number of
lines as long as the first nonblank character is not in
colhmn 1. Blanks and end of line are separators.

So much for lexical formalities; now let us proceed to
the grammars. Hhile these languages are both quite simple.

parsing them leads to difficulties which can severely test

28-
the ability of a programmer to produce a parser which
computes all the necessary information correctly. Let us
examine PLO first. STATEMENT is clearly a nullable symbol.
as is obvious from its definition. which contains a null
alternate. It takes a bit of work. however. to deduce that
BLOCK is also nullable: its graph consists of a nullable
alternation followed by another. followed by a repetition.
followed by a statement. The entire graph must be searched
to conclude that BLOCK is nullable. Because of this.
PROGRAM must include in its first set not only the first set
of BLOCK. but also the first set of anything which follows
BLOCK. namely the singleton {.}.

First sets give no particular problems in PLO) one does
not have to go very deep into the productions to encounter
terminals. Follow sets do present some subtle difficulties.
though. From the production for TERM. one can see that 't'
and " are in the follow set of FACTOR. But because a
FACTOR is also a TERM (what follows it in this production is
nullable). the follow set of TERM is also in its follow set.
From the production for EXPRESSION. we thus get '+' and '-'
in both follow sets. But then the next preceding production
tells -us that EXPRESSION derives TERM. so the follow set of
EXPRESSION must be included in that of TERM; from the
production for CONDITION we obtain the symbols { =. <). <.
>. <=. )= ) for the follow sets of both TERM and FACTOR.
Then the production for STATEMENT tells us that THEN and DO

can follow CONDITION. and that EXPRESSION is in the lastset

29

of STATEMENT. which has follow set { 3. END. . }. Thus.
what amounts to a backwards graph search is simulated by
manipulation of lastset matrices. and we find that the
follow sets of FACTOR. TERM. and EXPRESSION are thus as
follows:

1. EXPRESSION: {.. =. 3. END. THEN. DO. <). <. >. <=.

=9 )}.

2. TERM: the union of the above and (+. -).

3. FACTOR: the union of both of the above and {*. }.

Backhouse's “simple programming language" is not much
more complicated than PLO. although it has a considerably
simpler grammar. in that the number of nonterminals is
increased substantially in order to avoid complicated
repetition constructs. and right recursion is used
frequently. LLPARSE detects the following nullable
IDENTIFIERs: (STMTLIST. OPTLABEL. OPTSTMT. RESTOFLIST.
RASGNLIST. OPTELSE. MOREOUTPUT. REXPR. RESTOFEXPI.
RESTOFEXP2}. Of these. only OPTLABEL has intersecting first
and follow sets (both contain IDENTIFIER). This means that
a one-symbol lookahead does not suffice to determine which
of the productions for OPTLABEL. the null one or the
non-null one. should be applied at this point in the parse.
A second lookahead token would be necessary to determine the
appropriate production unambiguously. However. as Backhouse
points out in his treatment of this example. one could write
an LL(1) parser for everything else in this grammar. except

for the procedure 'poptlabel'. which would look ahead one

30

more character. Elsewhere in the grammar. one-character
lookahead suffices. since none of the other symbols
mentioned above has intersecting first and follow sets. and
there are no multiple paths to symbols. If a grammar of
this sort were presented to LLPARSE. the user could easily
make the necessary change in the grammar. or build an ad hoc
fix into his or her parser. In Backhouse's grammar. it
suffices to look ahead one more character. If the second
lookahead character is ':'. then the optional label is
present. otherwise not.

He shall conclude with a short mention of LLPARSE's
attempts at analyzing the full Pascal grammar. The syntax
diagrams in Jensen and Birth ([4]. pp. 116-118) were
translated into modified BNF; I have previously mentioned
the difficulties inherent in doing this. as the charts are
non-structured.

It should be noted that Jensen and Birth seem to make
Pascal an LL(1) language by including some semantics in
their syntax diagrams. During the parsing of executable
code. symbol table information is available on all iden-
tifiers used in it. since Pascal requires that all decla-
rations must precede references to identifiers. Thus the
authors include boxes labeled "type identifier.” "field
identifier.“ "variable identifier.“ "function identifier.“
etc.. in their syntax charts. However. from a purely
syntactic point of view. one can not distinguish among the

various identifiers. Thus a statement. for example. can be-

31
gin with an identifier. which may be a variable name. a
function name. or a procedure name. Even a second lookahead
character may not suffice to isolate the correct production
without symbol table information. since both a function name
and a variable name may be followed by the assignment
operator ':='.

LLPARSE detected two ways in which Pascal fails to be

an LL(1) grammar:

1. the Pascal “field list" has more than one path to
the symbol '3' and to “identifier“ (a construct
beginning with an ”identifier“ can be followed by a
case construct or not. so a parser does not know

which of two productions to choose).

2. As remarked earlier. “statement" has more than one
path to “identifier".

Both of these problems can be cured in a practical compiler;
the "field list" difficulty can be handled by a local fix
(an additional lookahead) in the procedures which parse the
construct. while the "statement" difficulty can be removed
by using semantic information already available from the

symbol table.

IV. CONCLUSIONS AND FUTURE DIRECTIONS.

One conclusion which can be quickly drawn from this
work is that it is onerous at best to translate unstructured
syntax diagrams (Pascal "field list" being a classical
example) into the structured input format necessary for this
algorithm. Syntax charts are helpful to the human being
attempting to analyze a grammar to determine what is
syntactically permitted. and probably are easier to
understand than a modified BNF grammar. But the latter is a
far more useful form for the computer input.

In conclusion. we briefly outline some facilities which
could be added to the present implementation. and some
directions for future research.

First. it would seem advantageous to kill two birds
with one stone by combining the LL detection apparatus of
LLPARSE with a simple precedence detector. All of the
information necessary to determine whether the grammar is
simple precedence or not is available in the graphs and in
the boolean matrices constructed by the graph searching
algorithm. It would be useful and challenging to devise a
way to use this information. (The reader is reminded that
LL(1) grammars and simple precedence grammars do not have a

superset-subset relationship).

32

33

It may be possible to improve the efficiency or
elegance of the algorithm by doing the calculations
exclusively with matrices. as is frequently done for simple
precedence grammars. Possibly these matrices could be
constructed during the first parse of the grammatical
productions. It is not clear to the author whether it is
possible to do this: as has been remarked above. LLPARSE
conducts several graph searches. in addition to performing
matrix closure operations. in finding its information.

Another addition that would simplify the form of
certain input grammars would be an additional construct
which means "repeat one or more times.“ currently this must
be done in the form A [S A S]. where A could itself be quite
complicated in form.

Finally. one could add a parser constructor to the
basic program if desired. This could then become a part of
a larger. and very general. compiler constructor. which
could work directly from the input grammar to a compiler.
This would certainly be a large project for some future
computer scientist to undertake: the author could see no
profit or advantage to be gained by attempting it at this

time.

BIBLIOGRAPHY

[1]

[2]

C3]

[4]

[5]

[6]

[71

[8]

[9]

BIBLIOGRAPHY

Backhouse. R- Co Sluts; 21 Ecsscammins Languages;
1339;; gng Pragtigg; Prentice-Hall. Englewood Cliffs.
1979.

Bell. J. A new method for determining linear precedence
functions for precedence grammars. Comm; ACM 12. 10
(Oct.. 1969).567-569.

Floyd. R. Syntactic analysis and Operator precedence.
g; ACM 10. 3 (July. 1863). 316-333.

Jensen. K.. and Birth. N. Eggggl Use; ﬁgnugl 39g
ﬁgggrt. Second edition. Springer-Verlag. New York.

1974.

Lalonde. H. An efficient LALR parser generator.
Technical Report CSRG-Z. Computer System Research
Group. University of Toronto. No date.

Martin. D. Boolean matrix methods for the detection of
simple precedence grammars. ggmm; ACM 11. 10 (Oct..
1968). 685-687.

Martin. O. A boolean matrix method for the computation
of simple precedence functions. 99mm; ACM 15. 6
(June. 1971). 448-454.

Pyster. A. ZUSE User's Manual. Report Number
TRCSBI-O4. Department of Computer Science. University
of California. Santa Barbara. January. 1981: revised
March. 1981.

Birth. N. AIsscithms 1 ﬂats
lif

Prentice-Hall. Englewood C

Sinsstysss s assesses;
TS. 1976

[10] Birth. N.. and Weber. H. EULER - an extension of ALGOL

and its formal definition. ngm; ACM 7. 1 (Jan.
1966). 13-25 and 89-99.

34

IGRN STRTE UNIV

IIIBIIIIIII IIIIIII IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII