.‘flfl- THESIS Date " m “A «A «A? IllllflilllllllllllllllllllIllllllullilllllLllllllml Eva-Aw “Vi“ 3 1293 104945 This is to certify that the thesis entitled A LL GRAMMAR ANALYZER presented by ZANE C. MOTTELER has been accepted towards fulfillment of the requirements for MS Computer Science degree in fij/m 5. Major professor 21 August 1981 0-7639 MSU LIBRARIES RETURNING MATERIALS: Place in book drap.to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. A LL GRAMMAR ANALYZER By Zane C. Hotteler A THESIS Submitted to Michigan State University in partiat futfiltment of the requirements for the degree of MASTER or SCIENCE Department of Computer Science 1981 ABSTRACT A LL GRAMMAR ANALYZER By Zane C. Motteler A grammar analyzer is a computer program which accepts a grammar as input and decides whether or not the grammar is of a certain type. The LL analyzer described in this thesis makes it possible to design LL(1) grammars and recursive descent parsers. This analyzer accepts a modified Backus-Naur grammar as its input language: syntax diagrams may easily be translated to this grammar. Because the modified Backus-Naur grammar is itself LL(1)¢ it is parsed by recursive descent. If the grammar is correct. it is then translated into a graph, which is searched in various ways in order to determine whether it is LLtl) or not. ACKNOHLEDGEMENTS The author wishes to thank the personnel of the Computer Science Department. especially its chairman. Professor Harry G. Hedges. and my academic advisor. Professor Carl Page. for their support and hospitality during a very fruitful and beneficial sabbatical year. More particularly I acknowledge with gratitude the help and inspiration of my thesis advisor. colleague. and sometime mentor. Professor Bernhard L. Ueinberg. without whom this project would never have come to fruition. His help even included drafting the figures‘ for this text and spending hours entering programs and data into interactive terminals for me. Greater love hath no man. 1 ii TABLE OF CONTENTS LIST OF FIGURES 0.0...0.0.0....OOOOOOOOOOOOOOOOO000...... TV INTRODUCTION OO0..0.0.0.0....OOOOOOOOOOOOOOOOOOOCO00...... 1 I. THE GRAMMAR AND ITS GRAPH OOOOOCOOOOOOOCCCOOCOC00...... 5 II. GRAPH SEARCH TECHNIQUES OOOOOOOCCOOCCOOCOCCOOO0...... 18 III. RESULTS OF PROCESSING EXAMPLE GRAMMARS ............. 26 IV. CONCLUSIONS AND FUTURE DIRECTIONS ................... 32 BIBLIOGRAPHY OOOOOOOOOOOOOOOOOOOOOOCCOOOOOOOOOOCOOOCOOOOO 34 iii Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure 1a. 1b. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12a. 12b. 12c. 12d. 13. 14. 15. 16. 17. 18. 19. LIST OF FIGURES The modified BNF syntax charts ............... The modified BNF grammar ..................... Pgrammar ...................................... Pproduction ................................... Pitem ......................................... Prightalternate .............................. Palternation ................................. Palternate ................................... A node in the production graph ............... Graph of a list of symbols ................... Graph of a list of alternate symbols ........ Graph of a repeat construct ................. Graphs of GRAMMAR and PRODUCTION ........... Graph of ITEM .............................. Graph of ALTERNATION ....................... Graph of ALTERNATE ......................... Mark nullable nonterminals .................. Function isnullable ......................... Procedure getfirstset ....................... Procedure getfollowsets ..................... Procedure getlastset ........................ The grammar for PL/O ........................ Backhouse's “simple language” ............... iv 10 10 11 12 12 13 14 15 16 16 17 20 21 22 23 24 26 27 INTRODUCTION One of the key steps in the process of compiler design and construction is the definition of the language to be used. and the expression of its syntax in a grammar. The designer of a language may wish to ensure that a grammar for the language has certain special prOperties. since the language chosen determines the parsing technique or techniques. Some frequently used grammars for relatively simple parsing schema are the following: 1. precedence (or the closely related Operator precedence.) In these grammars. symbols (or more generally. pairs of strings of length m and n) adjacent to one another in a sentential form have unique precedence relations. and a precedence matrix can be calculated which directs the parsing. No lookahead is required for precedence parsing. which is a bottom-up parsing scheme. 2. Lle). Such grammars can be successfully parsed by examining the current viable prefix together with a k-symbol lookahead. by a method called recursive descent. which will be described later in this thesis. This parsing method works from the top down. 3. Lle). These grammars include~ both of the preceding types. so are more general. Like LLtk). they can be successfully parsed by a k-symbol lookahead. and like simple precedence. the parsing is bottom-up. A gggmmg; analyze; is a program which accepts the productions of a grammar as its input and produces output 2 which enables the user to infer whether the grammar is of a particular type; if so. he can proceed to design a parser apprOpriate to the grammar: and. if not. the analyzer should supply information on why the grammar fails to be of the desired type. so that the user can repair to his workshop and alter it in an attempt to put it into the requisite form. From this standpoint. simple precedence and Operator precedence grammars have been extensively studied (2.3.6.7.10]. Techniques for the analysis of Lle) grammars for any k are well known (see. for example. Backhouse [11. pp. 82-116). although seldom used. since most realistic LL grammars are LL(1) except for a very few productions. which can be handled in an ad hoc manner. This thesis and the associated computer project represent the deveLOpment of an LL grammar analyzer. which in particular will detect whether a given grammar is LLCl). and if not. precisely where and in which productions it goes wrong. (See pages 8 and 19 for definitions of LLll).) A distinctive feature of this LL analyzer is that it accepts an input grammar written in a modified Backus-Naur form. which permits alternation and repetition constructs. and therefore closely parallels syntax charts. It will parse any language written correctly in such a grammar. calculate first and follow sets of nonterminal symbols. and trace productions to discover multiple paths to the same symbol. Any place where the LL(1) conditions fail to hold will be noted on the output. This input language is unique 3 among. grammar analyzers. Furthermore. the author has been unable to find references to similar algorithms for LL analyzers in the literature. even though the techniques behind its construction are well known (see. for example. Uirth [9]. pp. 280-349. and Backhouse E1]. pp.92-156). Pyster [8] describes a program called ”ZUSE" which accepts a grammar as input. verifies whether or not it is LLll). and produces a parser for the language even if it fails to be LL(1). by enforcing priorities on which productions to try first in the event of ambiguities. The techniques for designing a program to produce such a parser are described in Birth [9] cited previously. ZUSE has few similarites to the present effort. however. The input which ZUSE accepts is more restrictive and less general (lacking repetition and alternation constructs). Since ZUSE also produces a parser. it demands more of the user in the way of input. LLPARSE. the algorithm whose implementation is described in this thesis. does only the grammar analysis. Hhile a parser construction could have been included. it was not felt that this was an essential part of the project at this time. Lalonde [5] describes an algorithm which does much the same thing as ZUSE for LALR languages: it accepts a grammar in a standard version of BNF as input. tests the grammar for inclusion in the apprOpriate language class. and generates a parser for it if possible. The input language is much simpler than that of ZUSE. but the algorithm otherwise has 4 similar goals. and it differs from the algorithm described in this thesis essentially in the same way as ZUSE. LLPARSE reads the productions Of a grammar in modified Backus-Naur form as its input; there are very few re/stric/tions on the form Of this input. As the grammar is read in. it is parsed by recursive descent.‘while syntax errors (if any) are detected in the process. At the same time. LLPARSE converts the productions into graphs. Subse- quently these graphs are searched to determine nullable symbols. last sets. first sets. and follow sets. the last three Of which are stored as boolean matrices. Closure Operations on these matrices determine complete first and follow sets. which are the output Of LLPARSE. If anywhere in the grammar there are two or more paths to the same sym- bol. or if the first and follow sets Of any nullable symbols overlap. then the grammar fails to be LLtl). and LLPARSE pinpoints the locationls) of the problem(s). Thus the user can either attempt to repair the grammar. or. if there are only a small number Of problems. write ad hoc fixes for them into a parser for the grammar. Portions of the algorithms described in this thesis are illustrated in a Pascal-like pseudo-language; the actual implementation is in machine independent Pascal. Various constants and restrictions in the program are controlled by Pascal constants. which can be changed by the user to fit arbitrary predilections or machine hardware constraints. I. THE GRAMMAR AND ITS GRAPH The building blocks Of the modified BNF grammar used by LLPARSE are IDENTIFIERs and SYMBOLs. An IDENTIFIER begins with an alphabetic character and the remainder Of its characters are alphanumeric. Semantically an IDENTIFIER may be either a terminal or a nonterminal. depending on whether or not it appears on the left side Of a production. but this distinction can not be made until the entire grammar has been scanned. A SYMBOL begins with any character other than an alphabetic one and consists Of non-alphabetic characters. A SYMBOL will always be a terminal in the language of the grammar. Other lexical details about the input language will be described later. with the description Of the implementation. A PRODUCTION consists Of the left side IDENTIFIER (or 'mastersymbol') followed by the production symbol '::=' and a list Of zero or more ITEMs. An ITEM may be either an IDENTIFIER. a SYMBOL. a repetition construct. or an alter- nation construct. A repetition construct is a list of ITEMs enclosed between '[S' and "$3”. and semantically means to repeat the list zero or more times. An alternation construct is enclosed between the metasymbols "<5“ and "3)“. GRAMMAR T—‘l PRODUCTION I l —> PRODUCTION —-rf IDENTIFIER ‘ _ I, w— IDENTIFIER ITEM f 4 IDENTIFIER 1| ITEM WALTERNATION l SYMHn. r"' ALTERNATION —.l ALTERNATE : 14®———cl ALTERNATE ALTERNATE ITEM Eigggg 1a; The modified BNF syntax charts. GRAMMAR :3: PRODUCTION CS PRODUCTION S] PRODUCTION 3:: IDENTIFIER '::=' [3 ITEM $3 ITEM :2: (S IDENTIFIER $3 '[3' ITEM [S ITEM 8] '53' SS '(3' ALTERNATION 'S)' SS SYMBOL S) ALTERNATION 2:: ALTERNATE 'SS' ALTERNATE ES '33' ALTERNATE S] ALTERNATE 3:: [3 ITEM 3] Figgre 1b; The modified BNF grammar. and consists Of two or more fields (one Of which might be null) separated by the metasymbol "SS". Not only is the null production specifically allowed. but also. in addition. repetition constructs are nullable. and alternate constructs may be. so that nullable symbols can exist in the grammar. Although the form Of the grammar makes more than one production for each mastersymbol unnecessary. it is allowed. and any subsequent production will be treated correctly as a new alternate. Figure 1a consists Of syntax charts for the modified BNF grammar. and Figure 1b shows what we shall henceforth refer to as the 'metagrammar': the modified BNF grammar expressed in its own grammar. Note the close correspondence between the syntax charts and the metagrammar. The metasymbols of modified BNF have tO be enclosed within quotes so that they will be taken as terminals Of the metagrammar. He shall use this grammar during the remainder Of this section to illustrate how LLPARSE analyzes a grammar and converts it into a sequence of graphs that can be 8 searched for further information about the grammar. First. note that the metagrammar is LL(1). He shall define more precisely what this means later on. but for now. we shall merely state informally that there is no more than one path from any grammar item directly to any other: and the only nullable nonterminal. ALTERNATE. has disjoint first set {SYMBOL.'E$'.'<$'. SYMBOL} and follow set ('$>'.'$$'}. Thus any modified BNF grammar can be parsed by recursive descent. Backhouse [1] has an excellent description of this parsing technique. which is used by LLPARSE. Each component Of a production is examined by a procedure whose name consists Of the letter 'p' followed by the name of the component. such as 'pproduction'. 'pitem'. etc.. with 'pleftalternate" for I'<$". "par“ for “SS". etc. There is a procedure called “advanceinputpointer“ which Obtains the next symbol: at the end Of a production or Of the grammar it returns "null"; at the end of the grammar “inputpointer" will be set tO zero. 'Advanceinputpointer' sets a variable called 'nextsymbol" to one Of the values I'qidentifier". 'qsymbol". 'qleftalternate'. 'qrightalter- nate'. 'qleftrepeat". 'qrightrepeat'. 'qor". or 'qnull'. Advanceinputpointer; repeat pproductioni if inputpointer <> 0 then advanceinputpointer until inputpointer = 0; 5133;; g; Pgrammar. If nextsymbol = qidentifier then (* firstset Of pproduction is {qsymbol} *) begin advanceinputpointer: pprodsymbolz while nextsymbol <> qnull do pitem (. end of production if nextsymbol = qnull *) else end goto 1 (. error *)3 _jgggg 3; Pproduction. depending on the next token in the input. The main Operation of the parser Of GRAMMAR then lOOks as in Figure 2. (Note that we are only describing the parser here. and not the semantics Of graph construction which accompany it. These will be described later.) Figure 3 shows the core of If not (nextsymbol in Eqidentifier. qleftrepeat. qleftalternate. qsymbOlJ) (* first set of item *) then goto 1 (. error *) else begin case nextsymbol Of qidentifier. qsymbol: advanceinputpointer 3 qleftrepeat: begin advanceinputpointer: repeat pitem until nextsymbol = qrightrepeati (* in follow set of item *) prightrepeat end; qleftalternate: begin advanceinputpointer; palternation: . prightalternate (* in follow set Of alternation *) end end (* case *) fijgggg 3; Pitem. 10 If nextsymbol = qrightalternate then advanceinputpointer else goto 1 (* error *) Eigg£_ fi; Prightalternate. "pproduction". which parses an IDENTIFIER. the 'prodsymbol' ('::='). and then one or more ITEMs until the production ends ('advanceinputpointer' returns the null symbol). The procedure "pitem". shown in Figure 4. is the most complicated. since ITEMs can take several forms. NO action (except semantics) is necessary when an IDENTIFIER or SYMBOL is recognized. but the beginning Of an alternation or a repeat construct requires additional action. Note how parsing a repeat construct causes a recursive call Of I'pitem". 'Prightalternate' (Figure 5) is representative Of the procedures which test individual tokens: if the expected token is not next. an error routine is invoked which skips the rest of the current production and begins parsing again at the next. 'Prightrepeat' is similar in form. Finally. Figure 6 and Figure 7 illustrate procedures 'palternation' and ”palternate". which parse the interior Of Palternate; if nextsymbol = qor then repeat pori palternate until nextsymbol <> qor; 5193;; g: Palternation. 11 If not nextsymbol in (qor. qrightalternate] then repeat pitem until nextsymbol in [qor.qrightalternate13 E1993: 1; Palternate. an alternation construct. Since the various grammatical constructs can be nested inside one another. recursive calls Of 'pitem' are possible in "palternate". and ”pitem" may again call 'palternation'. etc.. tO any depth desired. SO far we have considered only the parsing function Of LLPARSE. Naturally the actual program also includes semantic routines. These routines construct graphical representations Of the productions as they scan them. These graphs are quite similar to the forms suggested by Uirth ([9]. pp. 295-299). but contain some refinements and additions not suggested in his work which were made necessary by the extreme complications which can arise in some large programming languages (the reader is invited. for instance. to reduce a Pascal fieldlist to the graphical form described below: the difficulty is caused by the non-structured form. See Jensen and Uirth E4]. p. 116 for a syntax diagram.) Figure 8 shows the construction Of a typical node in the graph. Each node contains a reference to an IDENTIFIER. which is null if there is nO symbol (recall that repeat constructs and some alternation constructs may be nullable. so there has to be a way to 12 Sflflfll APIEAIER Eigggg a; A node in the production graph. NP represent this). There are two pointers in each node. NP to the next node in sequence (if any). and AP to an alternate of the current node (if any). In addition. there are two flags: the EA flag signals whether or not the current node is the last ITEM in an alternation. and the ER flag signals the end Of a repetition construct. The latter flag is particularly important in that the successor pointer“of the last nOde in a repetition points back to an earlier node in the same graph; the ER flag. if on. thus prevents endless cycling through repeats when the graph is being searched. Let us now indicate how a graph is put together. In each of Figures 9. 10. and 11. a blank node signifies one which can have any allowed value (T or F in EA and ER. a pointer or NIL in AP and NP). Filled in nodes will always SYMHM;1 I SYMHN.2 SYMHM.n llFl—A-i I HAM” IFIA E1933; 2; Graph Of a list Of symbols. SYMBOL n MI 5193;; IQ; Graph Of a list Of alternate symbols. contain the data indicated. A list Of IDENTIFIERs. which may include SYMBOLS. is simply a linked list Of nodes and successors. as illustrated in Figure 9. Figure 10 shows the construction Of a list Of alternate nodes. Finally. in Figure 11. we see a repeat construct. Note that the last node points back to the first. and is flagged as the end Of a repeat. The first node has a null node as its alternate. and this null node is flagged as the end Of an alternate. If a repeat construct is followed by other items. this null node will have them as its successors. while if a repeat construct is also the beginning Of an ITEM in an alternation. the alternate 14 SYMBOL 1 SYMBOL 2 SYMBOL n all H"—‘ H"!’“""[ l H“ ‘ NULL Figggg 1. Graph Of a repeat construct. pointer Of the null node will point to the next alternate in sequence. In order to fix these ideas. we illustrate the graph Of the metagrammar in the components Of Figure 12. Since this is a relatively simple grammar. the graphs are unremarkable. Hith the exception Of EA and ER flags. they are similar to graphs in Hirth [9]. These flags turn out to be absolutely crucial in the graph searching process. and represent one Of the new techniques in LLPARSE. Physically. it is quite simple for a person to trace such graphs in order to detect nullable symbols. first sets. and follow sets. Thus it seems that writing a program tO do a graph search for this purpose should be, relatively straightforward. Real computer languages. such as Pascal. have a much more complicated graph structure. however. so 15 GRAMMAR—I I PRODUCTICEI PRODUCTION [ . [.[FIFIq._4. F FI...I_.[JIIFITI— INULLI I. TIP .4 d FRODUCTIONI I IDENTIFIER I.IFIFI:I__4LF F .I-l I '::=' TI ITEM ,._.I.IFIFI._I._.IIIFIT L . LNULI- J [.ITIFI.I E1995; 123; Graphs of GRAMMAR and PRODUCTION. the task is far less trivial than it may seem. He describe this process in some detail in Section II Of this thesis. 16 ITEM IDENTIFIER F Fwy—u FIFI. '[3' ITEM ITEM I FlFl”—'[FIFI" 11F T ‘- NULL I J,Tl F '-__’ r<$' ALTERNATION «$>« RIF IF .__,.. IFI F r-r*'- IF IF ALTERNATION ALTERNATE '$$' ; IFI Fl .—-—.-.I FI Fl ‘ . FITF .__] i ALTERNATE '$$' ALTERNATE FIT? I FI FIE .I FI T ,3 l NULL .MF £19352 125; Graph of ALTERNATION. 17 TALTERNATEI r ITEM | D F F FIT _ NULL 1 TIFI.I E13333 121; Graph Of ALTERNATE. 7-1-1: II. GRAPH SEARCH TECHNIQUES Once the graphs of the productions have been constructed as described in the previous section. LLPARSE first goes serially through the symbol table and marks as terminals alt those IDENTIFIERs which did not occur on the left side of any production. (Those which did. contain in their symbol table entry a non-null pointer to a produc- tion.) It is important for the graph searches which follow to be able to distinguish between nonterminals and terminals in the subject grammar. Four separate graph searches then occur in order to obtain the following sets of symbols: 1. nullable symbols. These consist of all nontermi- nals for which some path through the graph defining them consists of all null or nullable symbols. 2. First sets for each nonterminal symbol. The first set contains all terminals which can begin a sen- tential form derivable from the nonterminal. 3. Follow sets for each nonterminal symbol. These contain all terminals which can follow any senten- tial form derivable from the nonterminal in a sen- tence in the grammar. 4. Last sets for each nonterminal symbol. These con- sist of all nonterminal symbols which can occur in the last position in any sentential form derivable from the symbol. There are two steps in the graph search to construct these sets: the first calculates the various sets from 18 19 individual productions and includes nonterminals in them. The second combines this information into boolean matrices. and by the use of transitive closure operations. determines the final constitution of these sets. Note that the first step can not be done recursively. for recursion in productions is always possible. meaning that a recursive graph search might never terminate. Instead. we must limit recursion in graph searching to single productions. never allowing recursion to proceed from inside the graph of one production to the start of the same. or another. production. Before we begin discussing the graph searching process. it would be well to review the definition of an LL(1) grammar and the significance of the above sets in checking the definition. Basically. a grammar is said to be strong LLLL; if at any point during the parse of an input string it is possible to determine uniquely the production of the grammar which led to the current sentential form. given only the input up to and including the current input token and the one immediately following it. A grammar containing the following production would not be LL(1). then. because there are two paths to the symbol C from 8: given current symbol 8 and next symbol C one could not determine uniquely which alternate to use: A 3:: B (S C D SS C E S) Thus any graph search must trace out dual paths to the same symbol. A second grammar fault which causes a grammar to be 20 non-Ll(1) occurs when the first and follow sets of a nullable symbol have elements in common. Upon seeing one of these elements of the intersection. the parser would not be able to determine whether the null production or some non-null production of the mastersymbol is the correct one. The figures which accompany the text in this chapter are written in 'pseudocode' for ease of understanding. Figure 13 illustrates the procedure which marks all nullable nonterminals in the symbol table. It cycles repeatedly through all productions and marks nullable symbols until no more can be marked. Its main subprocedure is the boolean function “isnullable“ (Figure 14). which searches the graph of a production trying to find a path which contains all null or nullable symbols. If every path encounters either a terminal or a nonterminal which is not known to be null. then it returns "false". If the mastersymbol is defined by an alternation structure which contains a null or nullable alternate. then the graph search must proceed to the end of the alternate structure. and then the successor of the end node (if any) by a recursive call to "isnullable“. Since Uhile there is a change do begin set production pointer to first; while production pointer is not nil do begin if isnullable (production) then mark mastersymbol nullable; advance production pointer end end Eigggg 1;; Mark nullable nonterminals. 21 Isnullable := false; if current symbol is null or nullable or has an alternate then begin repeat searchsymbol := current symbol; if searchsymbol is null or nullable and has no successor then isnullable := true else if searchsymbol has successor then if it is end of alternate and not end of repeat then isnullable := isnullable(successor) else isnullable := isnullable or isnullable(successor) else if it is end of alternate and not end of a repeat then . isnullable := isnullable and isnullable(successor); set current to alternate until no more alternates 5.9253 $3; Function isnullable. the grammar is finite. eventually every nullable symbol will be marked. This process is completed before the search for first sets and follow sets. since information on nullability is important in determining the components of these sets. Figure 15 shows the search technique for first sets. This procedure is called once for each production with a pointer to the first IDENTIFIER or SYMBOL on the right side. Any non-null symbol is placed in the first set: if it is a nonterminal. transitive closure will later be used to generate terminals in the first set coming from these nonterminals. If a nullable symbol is placed in the first set. so are its successor(s). if any. The alternates of any 22 If startsymbol is not null then put it in first set; if it is null or nullable then if it has a successor then if it is not the end of a repeat construct then getfirstset(successor) else put successor in first set else if it is part of an alternate structure then go to the end of the alternate; if that node has a successor then getfirstset (successor); if startsymbol has an alternate then getfirstset (alternate); Eigggg 1;; Procedure getfirstset. symbols in the first set also go there. as do the successors of any alternate structures containing a null alternate. This part of the graph search is quite complicated and uses both the end alternate (EA) and end repeat (ER) flags. The search is made more difficult by the fact that it is possible for EA. ER. NP. and AP to occur in any of sixteen possible combinations of values. Fortunately there are not that many separate actions to take. In Figure 16. we see the follow set procedure. whereas the first set procedure never scans a graph beyond a terminal symbol. the follow set procedure searches the entire graph of a production for nonterminals (if any) and adds their successors (if any) to their follow sets. It is during this complete graph traversal that LLPARSE also checks for multiple paths to any one symbol: this is done during the construction of the first set of a successor to 23 if startsymbol is not null and not terminal then begin if it has a successor then put first set of successor in follow set; set current to startsymbol: if current has alternate and has no. or nullable. successor then repeat set current to its alternate until EA or no more alternates: if current is EA then begin if current has successor then put first set of successor in follow set of startsymbol else if current has successor then begin while current has successor and is not end of repeat and is not end of alternate and does not have alternate do set current to successor: if EA and successor exists then out first set of successor in follow set else if ER then put successor symbol in follow set end end end. if startnode has successor then getfollowsets (successor); if startnode has alternate then getfollowsets (successor); end; Eigygg 1g; Procedure getfollowsets. the given nonterminal. by checking to see whether a symbol being placed in this set is already there. If such are ‘ found. an informative error comment is printed to the effect that the production for the mastersymbol contains multiple paths to whatever particular symbol has been encountered. This procedure is probably the most complex. since it 24 If current symbol has a successor and and is not the end of a repeat construct then getlastset(successor)3 if isnullable(successor) and current is not terminal then put current in last set else if symbol is not null and not terminal then put it in last set if current symbol has an alternate then if its alternate is not the end of an alternate construct then getlastset(alternate) else if its alternate is not null then .if it is not a terminal then put it in last set; 5193;: 11; Procedure getlastset. utilizes the results of marking nullable IDENTIFIERs as well as procedure ”getfirstset”. He need further information in order to be able to calculate follow sets. however. Consider the following production: A 3:: [S B C D S] E F If F is a nonterminal. then clearly its follow set must include the follow set of A. since any occurrence of A in any production can be replaced by the right side of this one. For this reason. we need the last sets of each nonterminal; Figure 17 shows the construction of these. In this graph search. we proceed through the graph to the ends of its branches. searching for nonterminals at their ends. If such exist. they are added to the last set of the master symbol: so is any nonterminal whose successor graph is 25 nullable. Once the graph searches have been concluded. we have constructed matrices of zeros and ones containing a row for each nonterminal. The 'firstsets" matrix contains ones for each symbol (terminal and nonterminal) in the first set of each row symbol; the terminals in all first sets may be computed‘ simply by transitive closure. The 'followsets' matrix initially contains ones for each terminal or nonterminal in the follow set of each row symbol; these are augmented by ORing row 5 into row i if symbol 1 is in the last set of symbol 1. The complete set of all terminals in each follow set is obtained not by transitive closure. but by ORing the first set of row i into the follow set of row i if j is in the follow set of i and i is nullable. This process is guaranteed to work regardless of the input grammar. as long as it is syntactically correct (Figure 1). Thus when the first and follow sets are printed out at the conclusion of the program. these together with any previous messages will enable the user to determine whether the input grammar is LL(1). and if not. where changes should be made. If the input grammar contains syntax errors. the graph search can not be completed. but each production will be scanned nevertheless. enabling the user to detect as many syntax errors as possible. III. RESULTS OF PROCESSING EXAMPLE GRAMMARS. In this chapter we shall discuss how the implementation of LLPARSE handles two fairly simple programming languages. Hirth's PLO ([9]. pp. 308-310). and Backhouse's ”simple programming language" ([1]. p.93). PLO is an LL(1) language. while "simple" is not. Figure 18 and Figure 19 show the two grammars cast into the modified BNF form which LLPARSE uses as its input. Before we proceed with this discussion. it would be useful (now that we actually see two input grammars) to discuss some lexical features of the implementation. An IDENTIFIER which begins in column 1 is taken to be the left PROGRAM BLOCK . BLOCK (3 CONST IDENT = NUMBER [3 9 IDENT = NUMBER 5] 3 SS 3) <3 VAR IDENT [S . IDENT SJ 3 SS 5) [S PROCEDURE IDENT 3 BLOCK 3 $3 STATEMENT STATEMENT <3 IDENT I: EXPRESSION 33 CALL IDENT $3 BEGIN STATEMENT ES 3 STATEMENT 3] END $3 IF CONDITION THEN STATEMENT $3 HHILE CONDITION DO STATEMENT $3 3) CONDITION <5 ODD EXPRESSION SS EXPRESSION <3 = 33 <) 33 < $3 > 33 <= 33 )= S) EXPRESSION S) EXPRESSION (S + $3 $3 - S) TERM [3 (S + $3 - S) TERM $3 TERM FACTOR [S (S * SS 3) FACTOR S] FACTOR <3 IDENT $3 NUMBER 33 ( EXPRESSION ) S) Eiggng 1a; The grammar for PLO. 26 27 PROGRAM BLOCK . BLOCK BEGIN BLOCKTAIL BLOCKTAIL (S DECLARES BLOCKTAIL SS STMTLIST END S) DECLARES (S LABEL IDLIST SS INTEGER IDLIST S) IDLIST IDENTIFIER RIDLIST RIDLIST (S 3 SS . IDLIST S) STMTLIST OPTLABEL OPTSTMT RESTOFLIST OPTLABEL (S SS IDENTIFIER I S) OPTSTMT (S SS STATEMENT S) RESTOFLIST (S SS 3 STMTLIST S) STATEMENT (S ASSIGNMENT SS TRANSFER SS CONDIT SS HRITE SS BLOCK S) ASSIGNMENT EXPRESSION =) IDENTIFIER RASGNLST RASGNLIST (S SS =) IDENTIFIER RASGNLIST S) TRANSFER GOTO IDENTIFIER CONDIT IF EXPRESSION THEN STMTLIST OPTELSE FI OPTELSE ‘(S SS ELSE STMTLIST S) URITE OUTPUT ( OUTPUTLIST ) OUTPUTLIST EXPRESSION MOREOUTPUT MOREOUTPUT (S SS . EXPRESSION MOREOUTPUT S) EXPRESSION EXP1 REXPR REXPR (S SS RELOP EXPl S) EXPl EXP2 RESTOFEXPI RESTOFEXPl (S SS ADDOP EXP2 RESTOFEXPI S) EXP2 EXP3 RESTOFEXP2 RESTOFEXP2 (S SS MULOP EXP3 RESTOFEXP2 S) EXP3 (S POSEXP3 SS - POSEXP3 S) POSEXPS (S INPUT SS IDENTIFIER SS CONSTANT SS ( EXPRESSION ) S) RELOP (S < SS ) SS 3 S) ADDOP (S + SS - S) MULOP (S * SS S) £139;- 12; Backhouse's "simple language." side of a production rule; because of this we can dispense with the production symbol '::=' entirely. The right side of a production may be continued onto an arbitrary number of lines as long as the first nonblank character is not in colhmn 1. Blanks and end of line are separators. So much for lexical formalities; now let us proceed to the grammars. Hhile these languages are both quite simple. parsing them leads to difficulties which can severely test 28- the ability of a programmer to produce a parser which computes all the necessary information correctly. Let us examine PLO first. STATEMENT is clearly a nullable symbol. as is obvious from its definition. which contains a null alternate. It takes a bit of work. however. to deduce that BLOCK is also nullable: its graph consists of a nullable alternation followed by another. followed by a repetition. followed by a statement. The entire graph must be searched to conclude that BLOCK is nullable. Because of this. PROGRAM must include in its first set not only the first set of BLOCK. but also the first set of anything which follows BLOCK. namely the singleton {.}. First sets give no particular problems in PLO) one does not have to go very deep into the productions to encounter terminals. Follow sets do present some subtle difficulties. though. From the production for TERM. one can see that 't' and " are in the follow set of FACTOR. But because a FACTOR is also a TERM (what follows it in this production is nullable). the follow set of TERM is also in its follow set. From the production for EXPRESSION. we thus get '+' and '-' in both follow sets. But then the next preceding production tells -us that EXPRESSION derives TERM. so the follow set of EXPRESSION must be included in that of TERM; from the production for CONDITION we obtain the symbols { =. <). <. >. <=. )= ) for the follow sets of both TERM and FACTOR. Then the production for STATEMENT tells us that THEN and DO can follow CONDITION. and that EXPRESSION is in the lastset 29 of STATEMENT. which has follow set { 3. END. . }. Thus. what amounts to a backwards graph search is simulated by manipulation of lastset matrices. and we find that the follow sets of FACTOR. TERM. and EXPRESSION are thus as follows: 1. EXPRESSION: {.. =. 3. END. THEN. DO. <). <. >. <=. =9 )}. 2. TERM: the union of the above and (+. -). 3. FACTOR: the union of both of the above and {*. }. Backhouse's “simple programming language" is not much more complicated than PLO. although it has a considerably simpler grammar. in that the number of nonterminals is increased substantially in order to avoid complicated repetition constructs. and right recursion is used frequently. LLPARSE detects the following nullable IDENTIFIERs: (STMTLIST. OPTLABEL. OPTSTMT. RESTOFLIST. RASGNLIST. OPTELSE. MOREOUTPUT. REXPR. RESTOFEXPI. RESTOFEXP2}. Of these. only OPTLABEL has intersecting first and follow sets (both contain IDENTIFIER). This means that a one-symbol lookahead does not suffice to determine which of the productions for OPTLABEL. the null one or the non-null one. should be applied at this point in the parse. A second lookahead token would be necessary to determine the appropriate production unambiguously. However. as Backhouse points out in his treatment of this example. one could write an LL(1) parser for everything else in this grammar. except for the procedure 'poptlabel'. which would look ahead one 30 more character. Elsewhere in the grammar. one-character lookahead suffices. since none of the other symbols mentioned above has intersecting first and follow sets. and there are no multiple paths to symbols. If a grammar of this sort were presented to LLPARSE. the user could easily make the necessary change in the grammar. or build an ad hoc fix into his or her parser. In Backhouse's grammar. it suffices to look ahead one more character. If the second lookahead character is ':'. then the optional label is present. otherwise not. He shall conclude with a short mention of LLPARSE's attempts at analyzing the full Pascal grammar. The syntax diagrams in Jensen and Birth ([4]. pp. 116-118) were translated into modified BNF; I have previously mentioned the difficulties inherent in doing this. as the charts are non-structured. It should be noted that Jensen and Birth seem to make Pascal an LL(1) language by including some semantics in their syntax diagrams. During the parsing of executable code. symbol table information is available on all iden- tifiers used in it. since Pascal requires that all decla- rations must precede references to identifiers. Thus the authors include boxes labeled "type identifier.” "field identifier.“ "variable identifier.“ "function identifier.“ etc.. in their syntax charts. However. from a purely syntactic point of view. one can not distinguish among the various identifiers. Thus a statement. for example. can be- 31 gin with an identifier. which may be a variable name. a function name. or a procedure name. Even a second lookahead character may not suffice to isolate the correct production without symbol table information. since both a function name and a variable name may be followed by the assignment operator ':='. LLPARSE detected two ways in which Pascal fails to be an LL(1) grammar: 1. the Pascal “field list" has more than one path to the symbol '3' and to “identifier“ (a construct beginning with an ”identifier“ can be followed by a case construct or not. so a parser does not know which of two productions to choose). 2. As remarked earlier. “statement" has more than one path to “identifier". Both of these problems can be cured in a practical compiler; the "field list" difficulty can be handled by a local fix (an additional lookahead) in the procedures which parse the construct. while the "statement" difficulty can be removed by using semantic information already available from the symbol table. IV. CONCLUSIONS AND FUTURE DIRECTIONS. One conclusion which can be quickly drawn from this work is that it is onerous at best to translate unstructured syntax diagrams (Pascal "field list" being a classical example) into the structured input format necessary for this algorithm. Syntax charts are helpful to the human being attempting to analyze a grammar to determine what is syntactically permitted. and probably are easier to understand than a modified BNF grammar. But the latter is a far more useful form for the computer input. In conclusion. we briefly outline some facilities which could be added to the present implementation. and some directions for future research. First. it would seem advantageous to kill two birds with one stone by combining the LL detection apparatus of LLPARSE with a simple precedence detector. All of the information necessary to determine whether the grammar is simple precedence or not is available in the graphs and in the boolean matrices constructed by the graph searching algorithm. It would be useful and challenging to devise a way to use this information. (The reader is reminded that LL(1) grammars and simple precedence grammars do not have a superset-subset relationship). 32 33 It may be possible to improve the efficiency or elegance of the algorithm by doing the calculations exclusively with matrices. as is frequently done for simple precedence grammars. Possibly these matrices could be constructed during the first parse of the grammatical productions. It is not clear to the author whether it is possible to do this: as has been remarked above. LLPARSE conducts several graph searches. in addition to performing matrix closure operations. in finding its information. Another addition that would simplify the form of certain input grammars would be an additional construct which means "repeat one or more times.“ currently this must be done in the form A [S A S]. where A could itself be quite complicated in form. Finally. one could add a parser constructor to the basic program if desired. This could then become a part of a larger. and very general. compiler constructor. which could work directly from the input grammar to a compiler. This would certainly be a large project for some future computer scientist to undertake: the author could see no profit or advantage to be gained by attempting it at this time. BIBLIOGRAPHY [1] [2] C3] [4] [5] [6] [71 [8] [9] BIBLIOGRAPHY Backhouse. R- Co Sluts; 21 Ecsscammins Languages; 1339;; gng Pragtigg; Prentice-Hall. Englewood Cliffs. 1979. Bell. J. A new method for determining linear precedence functions for precedence grammars. Comm; ACM 12. 10 (Oct.. 1969).567-569. Floyd. R. Syntactic analysis and Operator precedence. g; ACM 10. 3 (July. 1863). 316-333. Jensen. K.. and Birth. N. Eggggl Use; fignugl 39g figggrt. Second edition. Springer-Verlag. New York. 1974. Lalonde. H. An efficient LALR parser generator. Technical Report CSRG-Z. Computer System Research Group. University of Toronto. No date. Martin. D. Boolean matrix methods for the detection of simple precedence grammars. ggmm; ACM 11. 10 (Oct.. 1968). 685-687. Martin. O. A boolean matrix method for the computation of simple precedence functions. 99mm; ACM 15. 6 (June. 1971). 448-454. Pyster. A. ZUSE User's Manual. Report Number TRCSBI-O4. Department of Computer Science. University of California. Santa Barbara. January. 1981: revised March. 1981. Birth. N. AIsscithms 1 flats lif Prentice-Hall. Englewood C Sinsstysss s assesses; TS. 1976 [10] Birth. N.. and Weber. H. EULER - an extension of ALGOL and its formal definition. ngm; ACM 7. 1 (Jan. 1966). 13-25 and 89-99. 34 IGRN STRTE UNIV IIIBIIIIIII IIIIIII IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII