GRAMMATICAL INFERENCE BY CONSTRUCTIVE METHOD Dissertation for the Degree of Ph. D. MICHIGAN STATE UNIVERSITY JIUNN-I LIOU 1977 LIBRARY NHL- _ 5 tats University This is to certify that the thesis entitled GRAMMATICAL INFERENCE BY CONSTRUCTIVE METHOD presented by JIUNN-I LI 0U has been accepted towards fulfillment of the requirements for Ph-D- degree inCompuiep—Science Major professor Date 1; 21+; 1977 0-7 639 @9504 056 ABSTRACT GRAMMATICAL INFERENCE BY CONSTRUCTIVE METHOD By Jiunn-I Liou In this dissertation a grammatical inference problem i is investigated in which constructive methods for inferring finite-state and Chomsky normal form p—grammars from a p-sample are developed. An heuristic approach to grammatical inference is taken. Complexity measures and acceptance criterion are used in selecting a grammar. The solution p-grammar for a p-sample is defined as the p-grammar which has the least complexity and generates an acceptable language. In the first part of the thesis, a procedure for analyzing sample structure is proposed. Five dissimilarity measures based on the minimal cost of a sequence of edit operations (insertion, deletion, change) are defined and dis- cussed. The type of sequence, determined by the position and the depth of an edit operation in a digraphic representation of paths is used in the assignment of cost. It is shown that these measures vary in the ability to discriminate among strings. A clustering algorithm.based on a labelled MST and seven interpretations of inconsistent edges are proposed to break a large inference problem into several small ones. The algorithm uses the notions of "common substrings" and Jiunn-I Liou "length" and results in an inference tree. It is argued that the inference tree will provide sufficient information for finding recursive and other string structures, as well as promoting the efficiency of construction. Techniques used as tools for the development of constructive methods are presented in the second part of the thesis. A size complexity measure is suggested to evaluate the performance of the five dissimilarity measures. A com- plexity measure and several difference measures based on information theory are defined and used to optimize the con- struction of p-grammars. The difference measures are shown to be bounded and more realistic than others in the literature. The Kolmogrov-Smirnov tests are suggested to measure the difference between language and sample and are compared with the chi-square goodness of fit test. Finally, all the above techniques are integrated into constructive methods for the inference of finite-state and Chomsky normal form p-grammars. The method consists of two parts: Construct an initial p-grammar for the p-sample according to the inference tree, then merge productions to generate candidate p-grammars. A set of rules based on six different situations are defined for merging pair-related productions. This constructive method has several advantages over other methods. First, it uses a very general, unique and computationally efficient method for analyzing sample structure. Second, it provides a more realistic difference Jiunan Liou measure for p-languages than those suggested in the literature. Third, it is reasonable and can be used in very general cases. The method is different from others in the way it analyzes sample structure, creates candidate p-grammars and in its heuristic strategies. GRAMMATICAL INFERENCE BY CONSTRUCTIVE METHOD BY Jiunn-I Liou A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science January, 1977 ACKNOWLEDGEMENTS I want to thank each of the members of my doctoral committee for his unique contribution. Thanks go to Dr. Mort A. Rahimi for serving as my academic advisor in the early stage of my program and for continually encouraging me during the course of this research. Thanks go to Dr. Anil K. Jain, Dr. John J. Forsyth, Dr. Lewis Greenberg and Dr. Dennis D. Gilliland for serving on my guidance committee and their critical review of this thesis. My special thanks go to Dr. Herry Hedges, Dr. Carl V. Page and the Division of Engineer— ing Research for providing me with financial assistance during my thesis writing. Of course the committee member and faculty member I am most indebted to is my thesis advisor, Dr. Richard C. Dubes. Mbst of my research abilities were developed by him. ‘Without his encouragement, patience and guidance I would never had the chance to complete my degree. I am also indebted to my wife, Alice Shih-Sing, for her love, patience and understanding throughout my years of study. ii CHAPTER II. III. TABLE OF CONTENTS INTRODUCTION . 1.1. Basic Notations 1.2. General Discussion of. Grammatical Inference Problem . . 1.3. Statement of the Problem and Assumptions . 1.4. Organization of this Thesis 1.5. Contributions of this Thesis LITERATURE REVIEW 2.1. 2.2. 2.3. 2.4. 2.5. 2.6. Introduction . . . Theoretical Models . Probability Concepts . 2.3.1. Maximum.Likelihood Estimation Techniques . . 2.3.2. Acceptance Testing of Candi- date Grammars . . . 2.3.3. Difference Measure for Languages Grammatical Complexity Measures 2.4.1. Complexity Measure Based on Size . . 2.4.2. Complexity Measure Based on. Information Theory . Constructive Methods for Grammatical Inference . . . Summary . ANALYSIS OF SAMPLE STRINGS . 3.1. 3.2. 3.3. Introduction Syntactical Structure of Sample Strings . . . . Dissimilarity Measure iii end roc>u> \o \Joum u: r4 Id ’1! m ('D F‘ id P‘ i4 a: ox Ln .p 18 19 21 24 26 26 28 30 TABLE OF CONTENTS (Continued . . .) CHAPTER IV. 3.4. 3.5. 3.3.1. Weighted Levenshtein Dis- tance and Uniform Edge Weighting . . 3.3.2. Modification I - Path Weight Adjustment . . 3.3.3. Modification II - Normal Edge Weighting . . . 3.3.4. Modification III - Binormal Edge Weighting . 3.3.5. Modification IV - Depth Edge Weighting . . . . Cluster Analysis 3.4.2 Clustering Method . . 3.4.3 Clustering Method for Gram- matical Inference . . . Summary . GRAMMATICAL COMPLEXITY AND ACCEPTANCE CRI- TERION OF LANGUAGES . . . 4.1. 4.2. 4.3. 4.4. 4.5. Introduction Solution Grammar. 4.2.1. Halting Problem . . . . 4.2.2. Relationship Between Sample Strings and Grammars . . 4.2.3. The Definition of Solution Grammar . . . . . . . Complexity of Grammars 4.3.1. Size Complexity Measure . 4.3.2. General Complexity Measure Based on Size Measure . 4.3.3. Complexity Measure for Evaluation . 4.3.4. Complexity Measure for Selecting the Best Grammar Difference Measures of Languages 4.4.1. General Discussion . 4.4.2. Definitions and Properties Statistical Acceptance Criterion of Languages . iv 3.4.1. Labelled Minimal Spanning Tree. . 31 . 37 . 42 . 44 . 47 . 51 53 . 55 . 58 . 63 . 65 . 65 . 66 . 67 . 69 . 7O . 7O . 71 . 72 . 74 . 75 . 79 . 80 83 . 88 TABLE OF CONTENTS (Continued . . .) CHAPTER VI. 4.5.1. Appropriateness of a Langu- age for a Sample . 4.5.2. Acceptance Criteria Based on . Statistical Tests 4.5.2.1. Chi-Square Goodness of Fit Test . 4.5.2.2. The Kolmogrov- Smirnov Tests 4.6. Summary . INFERENCE OF PROBABILISTIC GRAMMARS . 5.1. Introduction . . . 5.2. Assigning Production Probabilities . 5.3.1. Constructing Initial Finite- State P- Grammar . . 5.3.2. Constructing the Candidate FSPG . . . . . . . . 5.3.2.1. Rules for Merging Productions and Consolidation 5.3.2.2. Search Strategy 5.4. Inference of Context-Free P-Grammars 5.4.1. Constructing the Initial Chomsky Normal Form.P-Grammar. 5.4.2. Constructing the Candidate CNFPG . . . . . . . . 5.4.2.1. Rules for Merging Productions and Consolidation 5.4.2.2. Search Strategy. 5.5 Comparison . . . . 5.6 Computational Considerations 5.7 Summary . SUMMARY, CONCLUSIONS AND FUTURE RESEARCH . 6.1. Summary and Conclusions I 101 5.3. Inference of Finite- State p- -Grammars . 89 91 91 93 97 99 99 103 104 . 106 . 107 . 108 . 111 111 . 114 . 114 . 118 . 119 . 123 . 125 . 127 . 127 TABLE OF CONTENTS (Continued . CHAPTER 6.2. Future Research . APPENDICES BIBLIOGRAPHY. vi FIGURE mNOU'IDLDN LIST OF FIGURES AB Digraph . Uniform Edge Weighting . AB Digraph Under Uniform Edge Weighting AB Digraph Under Normal Edge Weighting . AB Digraph Under Binormal Edge Weighting . AB Digraph Under Depth Edge Weighting A Labelled MST . (a) A Labelled MST for S . (b) An Inference Tree for S vii CHAPTER I INTRODUCTION The problem of learning a grammar based on a set of sample strings is called the grammatical inference problem. A survey of literature /27/,/28/ shows that the problem of grammatical inference has recently received a great deal of attention. The importance of studying grammatical inference has been pointed out in several papers /25/,/27/,/28/,/30/, /61/. Based on this substantial information, this thesis will study a grammatical inference problem“ In this chapter, we will introduce some basic nota- tions for grammatical inference, state the grammatical inference problem studied in this thesis, describe the ideas presented, lay out the organization of this thesis and claim its major contributions. 1.1. BASIC NOTATIONS A number of notations are needed for understanding grammatical inference. These notations are described below. A phrase structure grammar G, as defined in /32/, is a 4-tuple, G = (Vn’ V R, A), where Vn is a finite set of t) nonterminal symbols; V t is a finite set of terminal symbols; 2 A is the start symbol, A.é:Vn; R is a finite set of produc- tion rules of the form; —,'> evuv* v* B C’B (n t) - t, The language defined by the grammar G is denoted L(G), 7': C é-(Vn U Vt) * * L(G) = {x Iert , A==?x} * A===+ux means x is derived from A by successive applications of productions in R. A phrase structure grammar G is context-free if every production B~*? C in R satisfies £(B)=1, where £(B) is the length of the string B. It is known that /34/ any context-free language can be generated by a grammar in which all productions are of the form: B-—?CD, or B—aa, where B, C, D e Vn’ a e- Vt“ Grammars of this type are called Chomsky normal form grammars. A phrase structure grammar G is regular or finite stag; if every production in R is of the form BaaC or B-——7a, B, CéVn, ath. A sample of a language L is any subset S of L. The variations "sample S" and "positive sample 8+" will be used synonymously. The sample S is complete if S = L. A denumerable class of grammars C is said to be admissible if for any x.e v: it is decidable whether or not x<:L(G), for any G e C. A grammar G is compatible with a sample S if S Q L(G). Tree representation is an effective 3 way to represent a class of admissible grammars C, and is defined as: /48/ (1) Each node in the tree corresponds to a grammar G in C. (2) A node Gt is a descendant of a node Gt-l if and only if Gt can be obtained from Gt-l under some conditions, such as adding or merging productions. A probabilistic grammar /27/ (p-grammar) Gp = (G,PG) is a grammar with production probabilities PC. A probabilisticjpositive)sample SP 8 (S, P8) of a language L is a sample S with string probabilities PS. 1.2. GENERAL DISCUSSION OF GRAMMATICAL INFERENCE PROBLEM The primary problem of grammatical inference can be formulated as follows: Given a sample S infer a grammar G to describe the given sample S and other strings which, in some sense, are of the same nature as S. Precisely the same problem arises in trying to choose a model (theory, function) to explain a collection of sample data. The problem of inferring finite-state grammars can be approached analytically /3/,/27/,/50/. Since many properties are undecidable about context-free grammars /34/, most studies in inferring context-free grammars are limited to specific types of context-free grammars and often rely on heuristic methods /l4/-/16/; /22/,/59/. Two methods, the enumerative method and the constructive method, have been studied in grammatical inference. Enumerative methods simply 4 map the set of positive integers to a class of grammars. Constructive methods are based on the syntactical structure of sample strings. In inferring or guessing a grammar based on a sequence of strings, Gold /30/ has shown that no guess- ing rules are uniformly better than enumeration. Enumerative methods have been discussed, developed and investigated in the literature /35/./36/./46/./50/./67/. To enumerate all possible grammars in a tree, a tree representation for a class of grammars can be defined before enumeration, or an algorithm can be defined which visits all nodes in some order and defines the structure of the grammar found at each node. Enumerative methods admit several vari- ations, such as the formalism of the state-space approach to problemesolving /48/, which involves problem definition, searching strategy and heuristic function. Several enumera- tive methods have been developed and discussed to infer, finite-state and context-free grammars, /50/,/67/, but enumerative methods for p-grammars have not been investigated. In inferring a grammar from a sample, Fu /27/ has stated that no methods will be more efficient than construc- tive methods in the sense of constructing an approximate grammar in a reasonable time. Constructive methods have been developed, discussed and investigated in the literature /11/, /16/. The common features of constructive methods are: (1) Analyze the syntactical structure of the sample strings. (2) Construct a grammar for the sample to reflect the syntactical structure. (3) Merge or add production to get a more acceptable grammar. In inferring a p-grammar for a p-sample, probability measurements can be used to evaluate the appropriateness of an inferred grammar. Constructive methods also admit several variations, such as methods for analyzing the syn- tactical structure of sample strings, methods for construct- ing candidate grammars and methods for testing constructed grammars. Several constructive methods for the inference of finite-state grammars have been proposed /3/,/20/. The inference of context-free grammars by constructive methods for subsets of context-free languages has also been investi- gated. /l4/-/l6/, /24/,/6l/,/62/. The inference of finite- state /52/ and context-free /11/ p-grammars has been suggested. However, there is no general procedure for analyzing the syntactical structure of sample strings, and no inherent limitations on developing methods for generating candidate grammars. In addition, only a few statistical pro- cedures have been applied to grammatical inference for decision making. Based on these facts, there is a big challenge in developing general proceduresfor analyzing sample strings, generating candidate grammars, and testing the acceptance of an inferred grammar. This thesis responds to this challenge. 6 lgg. STATEMENT OF THE PROBLEM AND ASSUMPTIONS The grammatical inference problem for this thesis is formulated as follows: Given a probabilistic positive sample Sp, construct an acceptable probabilistic grammar Gp to des- cribe Sp' For convenience, we call SP a probabilistic sample (p-sample). The basic assumptions for this study are: (1) The probabilistic sample is finite; (2) The probabilistic sample is randomly drawn from the source language; (3) The size of the probabilistic sample is large enough to represent the source language; (4) The types of grammar being inferred are finite-state and Chomsky normal form p-grammars. 1.4. ORGANIZATION OF THIS THESIS This thesis includes six chapters. The first chapter introduces the concept of grammatical inference, defines the grammatical inference problem for this thesis and claims the major contributions of the thesis. The second chapter reviews appropriate literature and discusses the most important methods and techniques related to this thesis. The third chapter defines and discusses methods for analyzing the syn- tactical structure of sample strings. The fourth chapter defines grammatical complexity measures and acceptance criteria for grammatical inference. The fifth chapter discusses the problem of generating candidate grammars and presents the 7 procedure for constructing candidate finite-state and Chomsky normal form p-grammars. Comparisons are then made between this procedure and other methods. The sixth chapter draws conclusions and suggests further research. lié- CONTRIBUTIONS OF THIS THESIS The main contributions of this thesis are listed as below. (1) The models, methods and literature of grammatical infer- ence are thoroughly reviewed and discussed. (2) A new, general and reasonable method is proposed for analyzing the syntactical structure of sample strings. (3) The concept of dissimilarity measure is applied to grammatical inference, so that sample strings can be analyzed in a new dimension. (4) Five different dissimilarity measures for discriminating between strings are developed, so that strings with similar syntax can be identified. (5) For the first time, the concept of clustering is applied to grammatical inference, based on a labelled minimum spanning tree (MST). The resultant inference tree pro- vides the framework necessary for’computationally efficient grammatical inference. (6) The role of grammatical complexity in constructive methods for grammatical inference is thoroughly discussed and analyzed. (7) (8) (9) 8 Several difference measures between p-languages are de- fined and discussed. Modifications are suggested to make differences between p-languages more realistic. The application of non-parametric statistics to gram- matical inference is expanded, and the Kolmogrov-Smirnov maximum deviation test is introduced and compared to the chi-square test. A general and practical constructive method, guided by an inference tree, is developed for inferring finite- state and Chomsky normal form p-grammars. CHAPTER II LITERATURE REVIEW 2.1. INTRODUCTION Our purpose in this chapter is to outline the grams matical inference methods already developed by representing their central concepts, results and techniques. We do not attempt to present each method in its entirety, nor to pre- sent them in historical order. Rather we concentrate on exhibiting those concepts and techniques which are important to this thesis, and on presenting them froman unified point of view. The most significant application of grammatical in- ference methods has been to syntactical pattern recognition /25/. Other application areas include information retrieval /62/, programming language design [16], translating and com- piling /22/, graphic languages /39/, man-machine communication /54/, and artificial intelligence /23/. Three comprehensive surveys of grammatical inference /27/,/28/,/2/, have laid out the most important work in this area. In this chapter, we will be presenting grammatical inference models, probabilistic concepts, measurement concepts and inference methods for both non-probabilistic and probabil- istic grammars, concentrating on those ideas related to the work in this thesis. 10 2.2. THEORETICAL MODELS In this section, we will formulate the basic theo- retical framework for solving the problem of grammatical inference as it exists in the literature. The first model of language identification was intro- duced by Solomonoff /61/, called inductive inference. The model consists of a teacher, a set of sample strings and an inductive procedure. This model only discovers certain recursive productions for a context-free grammar. The concept of language learnability was first formu- lated by Gold /30/. He introduced exact language identification which infers an exact, non-probabilistic grammar for an un- known language. The model consists of three components; a definition of learnability, a method of information presenta- tion, and a naming relation which assigns names to languages. A language L is said to be identified in the limit, if a grammatical inference machine, M, exists which guesses a name for L each time it receives a unit of information and for ‘which the guesses are all the same and are a name of L after a finite time. Gold also defined two other learnabilities: finite identification and fixed time identification. Two basic methods of presenting information are proposed: text presenta- tion and informant presentation. Gold's main results concern the conditions for in- formation presentation under which a language can be identified 11 in the limit by enumeration. The results show a great difference in the classes of languages that can be identified in the limit by effectively enumerating a class of admissible grammars from two different methods of presenting information. The exact language identification has several dis- advantages such as only being valid through an informant presentation. In addition, inference procedures based on this model are limited to enumeration. Feldman /21/ introduced a weaker notion of learnability called lapguage approachability, which extends the class of languages learnable from a positive information sequence (text), reduces the information needed to identify a language, and enlarges the domain of language identification. This model requires a complexity measure on a class of grammars, so that the class of grammars can be effectively enumerated in the order of their complexities. The concept of inferring the best grammar for a language with regard to grammatical complexity is practicable because of the time and information needed to identify a language. Feldman's main results show that for any class C of grammars ordered by complexities, there is a machine which can infer the best grammar for any finite positive sample. Wharton /68/ presented the theory of approximate lapguage identification which is analogous to the existing theory of exact language identification. Approximate language identification is an extension of language approachability. 12 Defining different metrics on languages, makes the difference between languages measurable, and indicates the degree of approximation between two languages. The difference between languages can be used as an acceptance criterion for an in- ferred grammar. With the help of grammatical complexity and language difference, grammatical inference problems can be solved by either an enumerative method or a constructive method. Approximate language identification is more practical than other models, not only because various kinds of procedures are applicable, but also because it only requires that the solution grammar generates a language which approximates the unidentified language. In most cases, the best grammar for a language can be found in a reasonable time. In this research, we use the concept of approximate language identification to construct the best grammar for a set of sample strings, with respect to grammatical complexity and difference between languages. Therefore, the approximate language identification model will be the most significant model in this research. 2.3. PROBABILITY CONCEPTS Statistical information is as important as structural information, when inferring a probabilistic grammar based on a finite sample. For the past few years, probabilistic grammars and probabilistic automata have received increasing attention /9 /,/26/,/64/,/65/. The mathematical formulations of probabilistic grammars and probabilistic automata have 13 been studied in several papers /l9/,/57/,/58/,/63/. The relations between probabilistic automata and probabilistic language have been described /29/. The generative capacity of grammars can be measured by Shannon's entropy /59/. The information-theoretic concept of entropy of a context free language and its relation to the structure generating func- tion have been investigated /43/. The information-theoretic concept of entropy has been related to probabilistic grammars /63/. Soule /63/ has calculated the entropies of a deriva- tion of a sentence and the average terminal symbol and has proposed methods to maximize the information rate. Booth and Thompson /9/, presented two methods for assigning probabilities to words in a language, developed several properties of probabilistic languages, and investi- gated the conditions under which a p-grammar is consistent. Horning /36/ and Patel /52/ presented maximum.likelihood esti- mation techniques to determine production probabilities from an experimental set. The concept of strong-approximation of a probabilistic language was introduced by Booth /7/. Other type of approximations based on some type of distance measure were developed by Maryanski /46/. Horning /36/, has des- cribed an acceptance technique based on the concept of hypo- thesis testing. In his method, a finite set of candidate probabilistic grammars is given and one must be selected from the set of grammars which best describes the experimental set. 14 In this section, we will relate these works to this thesis. 2.3.1. MAXIMUM LIKELIHOOD ESTIMATION TECHNIQUES In inferring a p-grammar from a p-sample, it is neces- sary to minimize the difference between the resultant p- language and the p-sample. The p-language should match the p—sample, both in probability and structure. Minimizing this difference is the same as estimating the production probabil- ities for a grammar from the p-sample. Maximum likelihood estimates of the production probabilities for a p-grammar G maximize the probability of a string being generated by G. This method is a reasonable method for estimating the produc- tion probabilities of unambiguous grammars. This is not true for ambiguous grammars, since there exists more than one leftmost derivation for some strings in the language. This simple method can be described as follows: Let G = (Vn’ V R, P, A) be an unambiguous p-grammar, t, and E = { (x1,Cl), ..... ,(xn,Cn) } be the sample information where xié.S and C1 is an associated estimated or a-priori probability of xi. Let Vn = { A,A1,...,Ar } be the finite set of nonterminals, and Pi be the probability of the production 1 + Under single class estimation, be the estimated pro- duction probability for Pij’ the maximum likelihood estimate for Pij is Pij = nij/I nij’ where nij = Zc-S CkNij(xk) J th. 15 and Nij(xk) is the number of times that the production Ai—a'Bj is used in generating xk. For multiclass estimation, the procedure is similar and will not be presented here. 2.3.2. ACCEPTANCE TESTING OF CANDIDATE GRAMMARS The purpose of statistical tests in grammatical in- ference is to choose a p-grammar that can adequately describe the given p-sample. The only two statistical tests that have been used in grammatical inference are Bayesian acceptance [51/ and the chi-square test /12/. There is a functional difference between these two tests. The chi-square test is used to judge the degree of closeness between a p-language and a p-sample, while the Bayes test is used to select the best. grammar from a class of grammar for the sample. The Bayes' test requires a prior probability distribution over a class of grammars which is not compatible with the procedure pre- sented for this thesis. The chi-square test is used to test the difference between the theoretical and observed distribu- tions of a sample /46/ and is described below. Let ei be the expected number of times that xi e S appears in the sample S, and let n be the size of S, and let k be the number of distinct categories. Thus ei = nPr(xi). The d2 statistic measures the difference between the expected frequency (ei) and the observed frequency (Ci) and is defined as: 2 d a 2 (Ci - ei) /ei 16 Under general conditions, the distribution of d2 can be closely approximated by a chi-square distribution with k-l degrees of freedom (df) /12/. An hypothesis testing problem can be stated as: H = C1 for all i, ififk 0‘ ei H1: ei f ci for some i, lfisk Let h2 be the acceptance criterion for a.chi-square test. Then the decision rule under a 0-1 loss function is: 2 2 2 2 Accept HO if d the notion of difference measure of languages. 17 The first approach to measuring differences between languages is given by Cook /11/ and is called the discrepancy measure. It is an information theoretic measure and based on the probability distributions of the sample and the language (Cf. Sec. 4.4.1.). A second type of difference measure was proposed by Wharton /68/, who defined metrics on the class of all languages U = { LILCI-Vt+ } over a finite terminal vocabulary Vt' He considered two types of metrics, the discrete metric and the weight metric. The weight metric is intuitively reasonable in grammatical inference by construc- tive methods, since the grammar inferred is an approximation to an exact grammar for the sample. Assigning weights to strings in a language is difficult if the language is probab- ilistic. The third type of difference measure is based on the string probability /46/. This difference measure only con- siders the difference of string probabilities over all strings in one language which will reduce the accuracy of the difference. Each of the three different types of difference measures has some disadvantages. Other difference measures are possible, but it is difficult to establish another dimen- sion that can be used to measure the difference between LLanguages. The difference measure between p-languages based cum information theory will be most significant to this thesis, since it involves both syntactical and probability differences. 18 2.4. GRAMMATICAL COMPLEXITY MEASURE The idea of grammatical complexity can be used to select a good grammar to represent a sample /21/,/67/. Grammatical complexity measures have received increasing attention over the past few years /5/,/6/,/13/,/32/,/56/. Two different categories of grammatical complexity measures are those based on the size of grammars, and those based on information theory. In this section, we will present the concepts underlying this idea. 2.4.1. COMPLEXITY MEASURE BASED ON SIZE The size measure of machines is based on the number of states in the machine /5/. The relative size of two machines is independent of the particular measure used /6/. The measure of complexity is the number of steps taken by machines to compute a function. Wharton [67/ uses the concept of size measure in approximation language identification (Cf. Sec. 4.3.1.). He introduces the maximum length of the right hand sides of productions as a parameter in the complexity measure. Thus the class of context-free grammars may be distinguished by their special normal form of productions. Feldman /21/ de- fined a general grammatical complexity measure (Cf. Sec. 4.3.2.). Based on this measure, there are many ways to define grammatical complexity measures, but there is no optimal grammatical complexity measure among all complexity measures. l9 Feldman's and Wharton's complexity measures are for the pur- pose of effectively enumerating a class of grammars. They may also be used for other purposes such as evaluation. Salomma /56/ defines the index of a grammar and the index of a language which is similar to size measure. Gruska [32/ presented several classifications of context-free languages which are analogous to the size measure. 2gflpg. COMPLEXITY MEASURES BASED ON INFORMATION THEORY The information conveyed by an event is the decrease in uncertainty that comes about when one of the events does occur. The average information conveyed by a complete set of events is the entropy. Shannon /59/ notes several properties required of a measure of the information conveyed by a complete set of events. Cook /11/ defined information- theoretical complexity and discrepancy measures for grammatical inference.(Cf. Sec. 4.3.4.) Cook's complexity measure assumes that the complexity of a context-free p-grammar is the sum.of production complexities. Since context-free grammars have a single nonterminal on the left hand side of each production, the production complexity is determined by the right hand side. This assumption is reasonable only if productions in context-free grammars are statistical independent. The com— plexity measure only considers the complexities contributed by two sources: the production probability and the right hand side of each production. That is not enough, since there exists some degree of similarity among productions of any 20 context-free grammar. Two context-free grammars can have productions very similar to each other in length and symbol distribution. However, all productions in one might be of the same type, as in a pivot grammar, while those in the other can be heterogeneous. Certainly, this fact should in- fluence complexity. Horning /36/ suggested that if complexity is defined as a monotonic function of the negative logarithm of the conditional probability Pr(Gi|S,C), where Gié: C, and C is a class of p—grammars which generate the sample S, then the intrinsic complexity of the grammar and the relative complex- ity of the sample given the grammar are defined as: - log Pr(Gi) and - [log Pr(S|Gi) - log Pr(S)] This is analogous to Cook's complexity measure, since it con- siders both the information needed to select the grammar Gi from C, and the information needed to generate the sample S. This complexity measure is different from Cook's complexity measure. The complexity of a grammar Gieé C depends on Pr(Gi) and Pr(S|Gi), while Cook's complexity measure considers produc- tions of a p-grammar. Feldman, et a1. /24/ define a derivational complexity of a set of strings relative to a grammar. A real valued function similar to a probability function is defined over the productions of a grammar. The derivational complexity of a grammar is based on the values assigned by the function to productions. This complexity measure is similar to Horning's 21 complexity measure, since they all depend on the probability generated by a p-grammar. Three different complexity measures based on informa- tion theory can be thus used to measure the complexity of a p-grammar. Horning's complexity measure requires the prior probability Pr(Gi) and the conditional probability distribu- tion Pr(S|Gi). Feldman's complexity measure only concerns the derivational complexity of a sample relative to a grammar. Cook's complexity measure ignores the similarity among pro- ductions in a grammar. Therefore a complexity measure should be defined which is similar to Cook's complexity measure but will consider the difference among productions. Also, other complexity measures based on information theory can be defined that will possess desirable properties. An information measure is completely different from a size measure. Different complexity measures can be defined for different purposes and interests. For measuring the com- plexity of a p-grammar, an information measure is better than a size measure since it considers the production probabilities as well as the productions themselves. 2.5. CONSTRUCTIVE METHODS FOR GRAMMATICAL INFERENCE The various grammatical inference methods developed in the literatures can be classified into two categories: enumerative methods and constructive methods. Since this thesis studies constructive'mEthOdS.enumerative method /33/ 22 /35/,/36/,/46/,/48/,/67/, will not be discussed. Constructive methods are based on the structural and statistical properties of the strings in the sample. In this section, we will dis- cuss some of the most important work in this area. A constructive method for inferring a recursive, unambiguous finite-state grammar from a sample S has been presented /20/. The procedure first constructs a grammar that generates exactly S, then produces a simpler, recursive grammar, that adds more strings to the generated language. Two constructive methods based on a sample S has been des- cribed for finite state grammars /3/. The algorithms will infer a grammar which generates S and strings similar to S. The first algorithm /3/ creates and compares sublanguages Sw = {xiwx €.S }. All strings in S are grouped together when they have the same initial substring, w. Let Sw be the one sublanguage, say the ith. A production Ai---->'aAj is created, if Swa is the other sublanguage, say the jth , and Ai—s-a is created if we e-s. The second algorithm /3/ is called a linear grammatical inference program, and operates analogously to the finite-state program. A constructive method for the inference of finite- state p-grammar has been presented [52/ which starts with a small subset H of the positive sample S and generates the set of all possible deterministic derived grammars that could describe this subset. Then it uses the maximum-likelihood method to assign production probabilities. The search is 23 terminated if one of the grammars is accepted according to a chi-square test. Otherwise, the size of H is enlarged. The search continues until either an acceptable grammar is found or the subset H is equal to S. A method for discovering the nesting or recursive structure in a context-free grammar has been described /6l/, /62/. The procedure cannot find the recursive property of a context-free grammar with self-embedding, e.g. A-—a>aAb,acVE, bté Vt’ A E: Vn' A constructive method for inferring a pivot grammar has been described /24/. The strategy is to find self-embedding in the positive sample. The inference of 9perator_precedence grammars from a structural information sequence of a language has also been presented /l4/-/l6/. One of the most important constructive methods for the inference of context-free p-grammars is the hill -climbing method /ll/. In this procedure, a cost function is defined that measures the complexity of a grammar and the discrepancy between languages. The procedure starts with the grammar in which the initial symbol goes to each string in the sample with the given string probability. It then examines possible simplifications of the grammar interatively. The inference procedure makes the simplification that yields the grammar of the lowest complexity, provided that it lowers the cost. The process continues until no simplifications that lead to lower-cost grammars are possible. Another semi-heuristic constructive method for the inference of context-free p—grammars has been developed /46/ 24 in which candidate grammars are generated by searching for various structural features of the sample, such as the "uvwxy" property /34/ of context-free grammars, and constructing corresponding grammars. This procedure is satisfactory for simple grammars. If the original grammar which generates the sample is ambiguous, the language generated by the in- ferred grammars are statistically close to the sample, but in some cases their productions are quite different. The problem of grammatical inference by constructive methods is still open for the following reasons: (1) Other heuristic strategies exist for constructing grammars. (2) No best evaluation procedure for improving the selection of grammars exists. (3) No best difference measure between languages for deciding the appropriateness of a language exists. g;6. SUMMARY In this chapter, we have summarized the most important work in the area of grammatical inference. The concepts of exact language identification and approximate language identification have been described. Two major approaches to grammatical inference have been presented and recent work on the inference of probabilistic and non-probabilistic grammars have been summarized. Grammatical inference is still in its infancy. Much theoretical and experimental research needs to be done before we can establish the principles underlying grammatical 25 inference. Many problems are still open especially in the inference of probabilistic and high dimensional grammars. A big challenge exists in grammatical inference by construc- tive methods, and the inference of context-free grammars. CHAPTER III ANALYSIS OF SAMPLE STRINGS §;l. INTRODUCTION The purpose of this chapter is to present efficient and general procedures for analyzing the syntactical structure of a given set of sample strings, called the sample structure. The results obtained from these procedures will provide a framework for constructing a grammar whose language is simi- lar to the given set of sample strings. (Cf. Chap. V) In the literature, a grammar for a set of sample strings is inferred in two steps. First, construct a grammar for each individual string. Then, merge all grammars into a single grammar. The number of grammars and productions con- structed during the first step increases linearly with the size of the sample. Feldman /2/ proposed four steps for in- ferring grammars for a set of sample strings. His first step is to analyze the syntactical structure of the given set of sample strings. This can be accomplished in many ways and will be discussed in Sec. 3.2. In this chapter, we will propose a cluster analysis method which partitions the set of sample strings according to their similarities and creates an inference tree. Then we 26 27 will construct a grammar for each cluster. This procedure substantially reduces the number of grammars and produc- tions constructed in the first step. Also, the cluster analysis will suggest recursive structures for strings which will indicate recursive productions for the inferred grammar. The cluster analysis is based on a measure of prox- imity between strings. A dissimilarity (distance) measure for strings of different lengths will be prOposed in Sec. 3.3. The dissimilarity measure is defined as the minimum.cost sequence of ”edit operations" needed to change one string to another /49/,/66/. The inadequacy of assigning uniform cost for edit operations regardless of the position of the opera- tion in the operation sequence will be discussed. Alternative ways to assign cost for edit Operations will be presented, and their properties will be studied. For the purpose of detecting certain syntactical relations between strings, the dissimilarity measure will be modified by letting the weights for edit operations depend on their positions in the Operation sequence. A cluster analysis method based on the minimal spanning tree is a powerful and general tool for detecting and des- cribing the structure of point clusters /69/. In Sec. 3.4. we will describe this clustering procedure and discuss the significance of cutting inconsistent edges in the minimal spanning tree so that a number of meaningful clusters may be formed. 28 3.2. SYNTACTICAL STRUCTURE OF SAMPLE STRINGS The syntactical structure of a finite-state language is different from that of a context-free language. In this section, we will review the methods which have already been developed to analyze the syntactical structure of a set of sample strings for inferring both finite-state and context- free grammars. The analysis of the syntactical structure of a finite- state language is based on the particular form of productions for finite-state grammars (see Sec. 1.1.). Two methods have been proposed to analyze the syntactical structure of a finite-state language. One uses the idea of formal derivative /10/. The other uses the idea of K-tails of a derivative /27/. These methods are described as follows: The formal derivative of a set of strings S with res- pect to the symbol a e Vt is defined as: DaS = { xlaxé:S }. As an example, let S = { 01, 100, 111, 0100 }. Then D S = { 1,100 }, D S = { 00,11 }. O 1 Based on the formal derivative, a canonical derivative finite-state grammar can be defined for a sample S /52/. The canonical derivative grammar can be used to define a class C of derived grammars /52/. In many situations, the set C will not contain the grammar that generated the sample S /27/. e Let u = alaz...anéVt, and let S 6 L(G). The k- tails of S with respect to u is defined as: 29 g(u,S,k) = { xlxmé'DuS, lefk }. For example let S = { 01, 100, 111, 0100 }. Then g(O,S,2) =”{1}, g(0,s,3) = { 1,100}, and g(l,S,2) = { 00,11 }. The k-tails method can be used to define equivalence classes on the states of the canonical derivative finite- state grammar generated by a complete sample 8 /27/. This method can only obtain rough solution grammars for the sample S. The k-tails of a finite-state language is based on the formal derivative. These two methods are similar and lead to similar results. These methods can be easily pro- grammed and used to analyze a set of sample strings from a finite-state language. But they are limited to sequentially scanning characters in strings from a sample. An adequate syntactic structure for the sample cannot be discovered by these methods. In addition, grammatical inference methods based on these methods are constricted in their procedures for merging productions. We want to develop a general tech- nique for analyzing syntactic relations among strings so that a more reasonable sample structure can be found and mergers of productions are clear, flexible and permit heuris- tic strategies. The syntactical structure of a context-free language is very complicated. To date, only limited research has been done in investigating subsets of a context-free language, 30 such as Operator precedence grammars and pivot grammars. For inferring Operator precedence grammars, each string in the sample is evaluated to establish a set Of precedence relationships that exist on the elements Of the strings /l4/- /l6/. For inferring pivot grammars, the main strategy is to find self-embedding in the sample strings by using substring relationships /24/. There is no general method for analyzing the syntax of a context-free language. Existing techniques for the analysis Of syntactical structure of context-free languages are applicable only to small subsets of the language. Thus we are motivated to develop a general method for analyzing the syntactical structure Of context-free languages. 3.3. DISSIMILARITY MEASURE Using cluster analysis methods to analyze the syntacti- cal structure of a set of sample strings requires a proximity ‘measure which determines the syntactical difference between pairs of strings. Recent research /49/,/66/ leads to a reasonable concept Of dissimilarity between strings. The dissimilarity between two strings is defined as the minimum cost sequence of "edit operations" needed to change one string to the other. Under this measure, three edit Operations are considered: (1) change one character to another, (2) delete a character from a string, (3) insert a character into a string. A cost is assigned to each edit operation. 31 In this section we will adopt the dissimilarity measure proposed by Okuda et al. /49/ called "Weighted Levenshtein Distance” (WLD) which defines a distance between words of different length. The WLD between strings is ex- plained in terms Of a digraph in Sec. 3.3.1. Wagner /66/ proposed an edit distance between strings which has the same configuration as the WLD, but Wagner used a trace idea to interpret the edit distance. Since the edit distance and the WLD are computed with the same algorithm, we will use the notation of Okuda et a1. 3.3.1. WEIGHTED LEVENSHTEIN DISTANCE AND UNIFORM EDGE WEIGHTING Before defining the dissimilarity measure, we need the following preliminary definitions: Definition: An edit operation is a pair (a,b) # (A,A) of strings of length less than or equal to l and is usually written a-—?'b. Let A, B be strings. A =?:B means that an edit Opera- tion is applied to string A to produce string B. In general, if A : = oar, B : = ObT for some strings O and r, then A =? B if a —9 b. Definition: An edit Operation a ~9‘b is (l) a change Operation if a f A, b ¢ A; (2) a delete operation if a $ A, b = A; (3) an insert operation if a = A, h f A. 32 The three operations are interpreted consistently as follows: (1) An insertion Operation inserts a character into the right end of a string. (2) A deletion Operation deletes a character from the left end Of a string. (3) A change Operation changes the character on the left end of a string and moves the changed character to the right end Of the string. Definition: Let h be an arbitrary cost function which assigns to each edit Operation a-«seb a non- negative real number h(a,b). When h has the properties shown below, it is called the uniform cost function. Let h(a,b) = p, h(a,A) = r, h(A,b) = q, a # A, b # A. The basic dissimilarity measure is defined as follows: Definition: Let A, B be strings. Then the dissimilarity between A and B is defined as: D(A,B) = min (pki + qmi + rni) where k1, mi’ ni are the numbers of changes, insertions and deletions respectively, needed to transfer the string A to the string B in th the 1 sequence Of edit operations. As a reasonable cost assignment, we will assume, un- less otherwise noted, that q = r, p = q+r, and the cost Of changing a character to itself is zero. 33 If the lengths of strings A and B are n and m, res- pectively, then all possible ways that string A can be transformed into string B can be represented on a digraph with (n+1) x (m+l) nodes Si,j and 3mn+m+n edges, called the AB digraph. The digraph can be understood as a rectangle of n rows and m columns and n.m city blocks with a diagonal for each block. In the digraph, (n+m92) nodes have in-degree one and out-degree three, (n+m-2) nodes have in-degree one and out-degree one, one node is a source, representing string A, one node is a sink, representing string B, and the remaining nodes have in-degree three and out—degree three (see Fig.1). Sn+1, 1 , Sn+l ,m+1=B 1 ” n+1 rows . A j ’5 A - . rs A = 811 ‘ I J 1,111+]- qu.columns FIG. 1. AB digraph Each node Si,j in the digraph represents a string, and each edge represents an edit Operation. As shown in Fig.1, 311 represents the string A, Sn+l,m+l represents the string B, and si,j represents the string composed of the ith to the nth characters of A concatenated with the lst to the (j-l)th character Of B. To Obtain Si+l . from s. ., delete the left .3 1.3 end character Of si j (which is the ith character of A); to 9 34 obtain 3. . from s , insert the jth character Of B at 1,j+l i, j ,j' i+l,j+l from Si,j’ move the left end character of 31 j to the right end of 31 j if the left end character Of si j (which is the ith character of A) is identical to the jth character Of B; otherwise the right end of S1 to obtain 3 change the left end character Of 31 j to the jth character of B and moves it to the right end Of s. 1 j' These operations are pictured in Fig. 2. The costs of edit operations are represented as edge weights, as in Fig. 2. Under the uniform cost function h, the edge weights are assigned as follows: (1) Assign weight r to all edges (Si,j,si+l,j) for lflfn, lfj5m+l, (deletion). (2) Assign weight q to all edges (Si,j’si,j+l) for lfisn+l, lsjsm, (insertion). (3) For all edges (Si,j'si+l,j+l)’ lfisn, lfjfm,if the ith character of A is equal to the jth character of B then assign weight zero; otherwise assign weight p, (change). S1+1, j s i+l,j+l delete 1: change P Sid e > 81:3” insert FIG. 2. Uniform Edge Weight 35 Example 1: Let A : = cb, B : = cab. Then the AB digraph, under a uniform edge weight, is given in Fig.3. For clarity, arrowheads are omitted. (331)A n 9 n, (a) jg, cab=B (S34) r r r r b 44%;; c 2 bcaq bcab r r r r A=cb e; n q cbcab ($14) ‘1 cbc cbca FIG.3. AB digraph under uniform edge weighting A."path" will mean a directed path from source to sink. The weight Of a path in the digraph is the sum of the weights of its edges, and the WLD is the minimum path ‘weight from source to sink. Example 2: The dissimilarity between A : = cb and B : = cab is D(A,B) = min (pk. + qm. + rn.). Let q = r, i 1 1 1 p = 2q. Then a minimum weight path is 811822823834‘With weight q. (See Fig. 3) Let A,B and C be strings. Then, under uniform cost, the dissimilarity measure has the following properties /51/: (l) D(A,B) a 0 if and only if A a B, (2) D(A,B) D(A.C) + D(C,B). (3) if q = r, then D(A,B) = D(B,A). IA The minimum.weight path in the digraph can be found smith algorithms based on the following two theorems /51/. 36 Let A, B be strings with length n, m, respectively; p, q, r are the cost of change, insertion and deletion, respectively. Furthermore, let D. 1.3‘ Then D(A,B) = D denote the minimum weight from 511 to s and it can be com- i,j' n+l,m+l puted iteratively. Theorem 1: Di 1 = (i-l)r, for lfifn+l Dl j = (j-1)q, for lfj3m+l Theorem 1 is for computing the weights of minimum weight paths composed of only deletions or insertions. Theorem 2: Diej = mln (Di'lej+r’ Di'lgj'lfipp’ Diyj'l-l-q) for zfifn+l, 25j5m+1 The proofs Of these theorems are trivial and can be found in Wagner's paper /66/ in different terminology. The above theorems provide a recursive structure Of computation. The uniform cost assignment uses the same weight for each edit operation regardless of the position Of the edge in the path. In some applications Of the WLD, the position of an edge in a path is not important, while in applying the WLD to grammatical inference the position of an edge in a path should be considered. The following example will show the difference between considering and not considering the position of an edge in a path. Example 3: Let A : = (a+b), B : = (a), C : = ((a)). Based on the uniform cost assignment, we find that D(A,B) = 2r,D(C,B) = Zr, and D(A,C) = 2q+2r. 37 When applied to Ex.3, the constructive procedure in Chap. 5 constructs grammars G1, G2, G3 for strings A, B, C, respectively. Then, according to the dissimilarities of their languages, grammars are merged in a stepwise fashion. Since D(A,B) = D(C,B) = 2r is the smallest dissimilarity, we either merge G1 with G2, or merge G2 with G3. If we merge G2 with G3, we may Obtain a self-embedded production (Sec. 2.5.2.), but if we merge G1 with G2, we will not Obtain a self-embedding production since A and B do not possess the substring property. TO force the merger of G2 with G3 first, D(C,B) should be less than D(A,B). This can be accomplished if we consider the position of an edge in a path as a factor in assigning weight to the edge. If, in Ex. 3, we assume that any insertion or deletion made at an end of the string has less effect on syn- tactical structure than one made inside Of the string, then D(C,B) can be made less than D(A,B). In the rest of this section we will propose several modifications Of the WLD. One of them.adjusts the weight of the minimum.weight paths found under the uniform.weight assignment, and is discussed in Sec. 3.3.2. Other modifica- tions adjust the edge weights in the digraph according to position and are presented in subsequent sections. 3.3.2. MODIFICATION I - PATH WEIGHT ADJUSTMENT In this section, we propose a method that adjusts the weight Of a minimum-weight path along with techniques for storing and retrieving paths. The idea is to subtract a 38 certain amount of weight from the minimum weight measured under the uniform weight assignment. The weight of a minimal weight path is adjusted if the path does not contain non-zero weight change operations, and if insertion and deletion operations do not appear in the same path. The weights assigned to the insertions and deletions depend on position in the string. We assume that any edit Operation occurring at the end of string has less effect than one occurring in the middle. The principle is to reduce the weight for those edit Operations that occur before the first or after the last zero-weight change Operation, as compared to those that occur at other positions. Example 4: Let S = { ab, cb, acbc }, p = 2, q = r = 1. Let D, I, C, C' denote deletion, insertion, non-zero weight change and zero weight change operations, respectively. Then the dissimilar- ities and all minimal paths for each pair Of strings are listed below. a) D(ab,cb) = 2, P1 = 811821822833, P2 = 311312522533: P3 = S11522333 With corresponding operation sequences 01 = DIC', 02 = IDC', 03 = 00'. Since all minimal weight paths include both deletions and insertions, their weights are not candidates for adjustment. b) D(ab,acbc) = 2, P1 = 311322523534335' with operation sequence 01 = C'IC'I. The weight of the right end insertion can be adjusted, 39 and D'(ab,acbc) = 2-e, where O [(m+n+3)/2] . (3) Assign weight (w1 +w2 + W3 + W4)/2 to edges (Si,j’si+l,j+l)’ where wl,w2,w3,w4 are weights for edges (Si,j'si+1,j)' (Si,j'si,j+1)' (Si+1,j'si+1,j+1)' and (Si,j+l’si+l,j+l) respectively. If the ith character of 43 A is equal to the jth character of B then assign weight zero to edge (Si,j’si+l,j+l)' The following example demonstrates normal edge weighting. Example 8: Let A : = cb, B : = cab. Then the AB digraph and its normal edge weights are indicated in Fig. 4, and D(A,B) = 3q. All unnumbered edge weights equal 5(q+r)/2. '1- %q cab “‘1 2r 3r 2r r 3s r 2F 3r 2r cb q— 2q~ 3e FIG.4. AB digraph under normal edge weighting Let n,'m be the lengths of strings A and B respectiv- ely, and q - r = 1. Then the normal edge weighting of the AB digraph has the following properties: (1) The maximum weights for edges (Si,j’si+l,j) and (3i,j’si,j+l) are the same and equal [(m+n+l)/2]. (2) The maximum weight of edge (Si,j’si+l,j+l) is (m-l-n). (3) The maximum path weight is (m+n+l)2/4, if m+n is odd; (m-l-n) (m+n+2)/4, if M is even. Let e = [(m+n+3)/2] , and let k = min {e,n}, and k'-min {e,m}. Under the normal edge weighting, Theorems 1 and 2 become the following. The proofs of these theorems are analogous to the proofs of Theorems 1 and 2. 44 Theorem 3: Di 1 = i(i-l)r/2, for lfisk D. = k(k-l)r/2 + (i-k)(2m+2n+3-i—k)r/2, for 1,1 kq/2. for lfjfk' Dl j = k'(k'-l)q/2 + (j-k') (2m+2n+3-j-k')q/2, for k'= 1 2 2 l 2 3 4 2 4 l 1 o l 2 3 .J ‘3 5 - 4 4 5 2 The edges terminating at nodes two steps from.All are: (821'531)' (321:522>: (812:322): (512’313> Thus fé = 4 and each edge weight is fé/T = 4/17. The depth edge weights in Fig. 6 are the number shown divided by 17, and D(A,B) = 5/l7. 5 4 2 N U'l :p FIG.6. AB digraph under depth edge weighting. Let k = min {n,m}. The depth edge weights of the AB digraph have the following properties: (1) The maximum.weights for edges (Si,j’si+l,j) and (Si,j’si,j+l) are the same and equal (2k+l)/(2mn+m+n). (2) The maximum weight of a diagonal edge is (4k + 2)/(2mn+m+n), n+m is even; otherwise, (4k+l)/(2mn+m+n). (3) The maximum.weight of*a path is l. 49 Depth edge weighting depends heavily on the lengths of strings. Actually, depth edge weighting classifies edges into three categories. The weights of edges on both ends of a path are varied gradually, while edges in the middle of a path have the same weight. Uhder depth edge weighting, Theorems 1 and 2 become the following. The proofs of these theorems are analogous to the proofs of Theorems 1 and 2. Theorem 7: 1 = i(i-l)/T, for lflf(k+l) Di, 131,1 - (k(k+l) + (i-k-l)(2k+l)/T, for (k+l) is said to be approximately ordered by a function f(y) if and only if there is a function hf(i) such that for each i>l,t>hf(i) implies f(ytx>f(yi). If hf(i) is effectively computable, then is said to be effectively approximately ordered (EAO) by f, 73 and f is said to be EAO by < y1,Y2.--->- Let 8 be the set of all finite sets S contained in Vt, and C be the class of grammars that generate Z. Then a general complexity measure is a mapping f from Z x C into the set of nonnegative rational members, satisfying the following conditions. (C1) The function f is expressible in terms of the intrin- sic complexity c(G,C) and the derivational complexity d(S,G) and is a computable unbounded increasing func- tion of its two arguments. (C2) The intrinsic complexity c(G,C) is a positive comput- able unbounded function EAO by the length of grammar. (C3) The derivational complexity d(S,G) is a positive function and defined if and only if SéEL(G). (C4) There exists a computable function D(S,G,m) which is equal to zero if and only if d(S,G)Sm, and 1 otherwise. Feldman's intrinsic grammatical complexity is similar to Blumfs size measure /5/,/6/. Instead of CZ, Blum's com- plexity measure requires that an algorithm exists for com- puting the finite number of grammars with any fixed complex- ity. Wharton /67/ defines a complexity measure based on Blum's measure and requires a condition similar to C2. Feldman's complexity measure is more restrictive than Blum's and Wharton's complexity measures, but it is still a general complexity measure. Many intuitive complexity measures will meet these requirements. 74 Solomma /56/ defined an index of a context-free grammar which is analogous to the derivational complexity. Since the relationship between the number of productions and the index of grammar is undecidable, the index of a context-free grammar will not meet the requirements for evaluating grammars. 4.3.3. COMPLEXITY MEASURE FOR EVALUATION A simple complexity measure of grammars is needed to compare the sizes of grammars constructed from a cluster analysis on a set of sample strings. The general complexity measure given by Blum and modified by Wharton, meets most requirements. The axioms given by Blum.for his complexity measure effectively enumerate a class of grammars, which is not necessary for evaluating a set of grammars. We now modify and restate the size measure of grammars as follows: Let C be a set of grammars G = (Vn’vt’R'A)° Then a general complexity measure on C is a mapping f from C into nonnegative real numbers, which satisfies the following axioms, where n(-), m(-) are defined as before. (1) f is a positive unbounded function of n(Vn), n(Vt), n(R) and m(G). (2) the effect of m(G) on the function value is expontential. (3) increasing any arguments increases complexity. Example 1: Let C be a set of context-free grammars. Then m(G) f(G) = n(Vn) + n(Vt) + n(R) + e is a complex- ity measure on C. 75 Example 2: Let C be a set of Chomsky normal form grammars. Then f(G) = n(Vn) + n(Vt) + n(R) is a complex- ity measure on C. The number of productions which can be constructed with the right hand side of each production having length less than or equal to m(G) increases exponentially with the number of nonterminals. Therefore it is reasonable to assume that the contribution of m(G) is expontential. ‘ggggfi. COMPLEXITY MEASURE FOR SELECTING THE BEST GRAMMAR We will adopt complexity measures based on informa- tion theory to select the best grammar for a set of sample strings. There are at least three different approaches to such complexity measures (Cf. Sec. 2.4.2.). Since the in- ference procedure prOposed for this research is similar to Cook's hill-climbing method, we will adopt the complexity measure given by Cook in the sequel /ll/. The complexity of a grammar can be measured by the information required to specify that grammar. The informa- tion required to specify a grammar can be determined in three different situations. If a probability distribution over a class of grammars C is given, then the information conveyed by the selection of any particular grammar G is -log Pr(G|C). In the second situation, we need a preliminary definition of a grammar-grammar, which is a grammar for generating grammars. 76 Definition: A grammar-grammar G = (Vfi, Vt’ R, A) on the terminal alphabet Vt is defined to be a context- free grammar such that a>mnW=e (2) VtCWUVt U {—>}U{ ,} where, A'is the starting symbol; W is the universe of nonterminal symbols; {,} is used to separate the rules of R. If a grammar-grammar G is given, the information required to select G depends on the probability of generating G by C. The third situation is to define a standard stochastic grammar-grammar G, that generates grammars on the vocabulary of the given G, and to determine the probability of generat- ing G using this particular G. Let G be the context-free grammar whose productions are Ala—l) x1, A2.» XZ, ...... Ar w} Xr . . + where x1, x2...., x are strings 1n Vt’ and Al"°"Ar are r elements of Vn' The complexity of G is the sum of the com- plexities of its productions. Since G is context-free, the complexity of a production is determined by the complexity of r its right hand side. Thus C(G) = Z C(xi) i=1 We now formulate the computation of C(y), y e Vt. Suppose that y has length K and involves the symbols y1,...,ys, 9 say kl""’ks times, respectively, so that Z k. = K. Let i=1 "#" be a special "stop symbol". The string y# can be generated by the stochastic grammar G which has the single alternative set 77 A 4>y1Ai....|ySAI# (kl/K+l,...,kS/K+l,l/K+l) The probability of generating y#, i.e., of generating y then stopping, is S k (l/K+1) n (ki/(K+l)) i 1 l The complexity of y can be measured by the negative logarithm of this probability. C(y) = log(K+l) + E ki log ((K+l)/ki) i=1 3 = (K+l) log (K+1) - Z k. log k. i=1 1 1 The complexity measure C(y) does not consider the order in which the symbols yi occur in y. A more complex measure would involve the conditional probabilities of the symbols, derived from their relative conditional frequencies in the order in which they appear in y. If G is a stochastic context-free grammar and has the productions Al -e»xl:llxl,2-°-lxl,ml (Pl,l,Pl,2’°°"Pl,ml) Ar ‘7> xr,llxr,2”'lxr,mr (Pr,l’Pr,2""’Pr,mr) where Ai - Aj for i = j, and 2 Pi j = 1, lfisr, then the come plexity of G is the sum of complexities of these alternative SEES, r C(G) giil C(Ai"“7'xi,ll...[x ) i,mi The information conveyed by choosing one of the al- ternatives, say xi is -log Pi" Thus j’ J 78 r m C(G) = Z 2 (-log Pi + C(x. .)) i=1 j=l 1 j .3 where the first term represents the information inherent in the jth choice, and the second term represents the complex- ity of the chosen right hand side, as derived above. Cook showed that if the length of y (right hand side of a production) increases, while the relative frequencies of the symbols remain the same, then C(y) increases. In addition, if the length of y remains constant, while the number of symbols (other than #) that occur in y increases, then C(y) increases. Finally, Cook showed that for a given length of y and a given number of symbols, C(y) is greatest when all the symbols (except for #) have equal frequencies. These properties of C(y) are useful when trying to find grammars that are simpler than a given one. These properties will not be of much use for finite-state grammars, since the right hand sides of productions have lengths less than three, and all symbols occurring on the right hand sides of productions are different. However, the complexity measure is not dependent on the right hand sides of produc- tions and their probabilities. The complexity measure is still valid for measuring the complexity of a finite-state grammar. The context-free grammar in Chomsky normal form has the same drawback. However, the second property is use- ful in searching for a less complex Chomsky normal form grammar. Several remarks need to be made in connection with this complexity measure. These remarks will be the basis 79 for the heuristic merging procedures in Chap. V. (1) If a new production shortens and decreases the number of symbols in the original production, than that production decreases the complexity. (2) Decreasing the length of the right hand side of a pro- duction while increasing the number of different symbols increases complexity. (3) The complexity of a production of the form X ~§> an can- not be reduced by breaking it down into shorter produc- tions. (4) If a substring longer than two symbols occurs repeatedly on the right hand side of a production, a decrease in complexity is possible by substituting a symbol for the entire substring. (5) If the given grammar's productions contain many disjuncts, a reduction in complexity may be achieved by adding a disjunctive production. Egg. DIFFERENCE MEASURES BETWEEN LANGUAGES In approximate language identification, the differ- ence between the sample strings and a language can be used to decide the degree of appropriateness of the language in des- cribing the sample. In grammatical inference by the con- structive method, it is very important to determine whether the language generated by an inferred grammar can sufficiently represent the sample strings or not. This question will be studied for finite-state grammars and context-free grammars. 80 4.4.1. GENERAL DISCUSSION Three different approaches to the notion Of difference measure between languages have been introduced in Sec. 2.3.3. we will briefly describe these measures before we define an- other difference measure. Cook /11/ defined a discrepancy measure for measuring the difference between a probabilistic sample S and a p-language L(G) based on the probability distri- bution of a language and a set of sample strings, which is formulated as: D(L(G).S) = iv+ IP(X)('10g p(X)+C(x))-q(x)(-log q(X)+C(x))l X t s C(x)= (K+l)log(K+l) + Z k. log k. j=1 J J where p(x) and q(x) are the probabilities of x in S and L(G), respectively; K is the length of x; s is the number of different symbols in x; kj is the number of times the jth symbol occurs in x. The discrepancy between S and L(G) depends on the lengths, numbers of different symbols, and probabilities of the strings in S and L(G). For infinite languages, there is no guarantee that the discrepancy will converge. The value of the discrepancy ranges from zero to infinity. This discrep- ancy measure might not show the actual relationship between S and L(G). For instance, if L(G) and 8 have a complement relationship, the discrepancy can be any value from 81 zero to infinity. Also, S and L(G) are probabilistic languages, but the sum taken over all x in v: is not con- sistent with a probabilistic word function. The differ- ence measure of languages will be defined to overcome these disadvantages. Wharton /67/ defines metrics on the class of languages 2 over a finite terminal vocabulary Vt which measure the difference between two strings. Two kinds of metrics are defined: the discrete metric and the weight metric. The discrete metric is used in exact language identi- fication and is not discussed here. The weight metric is based on the sequence of weights which are assigned to words in v: according to their lexicographic order, and norm. The sum of the weights in a sequence of weights is required to be bounded, and individual weights Wi must be positive. If the sum of weights is equal one then the sequence of weights is normalized. In this case, v: is a probabilistic language. This difference measure cannot reflect the difference between two probabilistic languages, since the weights do not depend on the original probabilities. Although the weight metric has a several interesting properties, it is not appropriate for this research, because we infer p-grammars. Maryanski /46/ defined several distance measures for p-languages, based on differences in word probabilities. Some distance measures have bounded values but some do not. The most useful distance measures are the absolute and square difference measures, defined as follows. 82 Let pi(x) be the probability of x in the ith langu- age. Then the absolute and square differences for approx- imating p-language Ll by L2 are: Da(L1.L2> = x:Lllploc) - p2l DS = z (p1 - p2>2 xeLl The distance measures between two p-languages do not involve in the subset L of the p-language L2 whose strings are not in the p-language L1. Therefore the distance measures actually do not measure absolute difference between two p- languages; that is Da(Ll’L2) # Da(L2,Ll), making these dis- tance measures inadequate to some extent. If L1 is finite and the size Of L2 increases monotonically, then the distance between L1 and L2 will increase monotonically. This property is not desirable when trying to select the best language to represent L1. The difference between two languages L1 and L2 can be divided into three parts which are related to three sub— sets; Ll-LZ, Lz-Ll, and Ll/XLZ. The subset Ll-L2 contains strings in L1 but not in L2, L2-Ll is interpreted analogously; L1’\L2 contains strings in both L1 and L2. The difference contained in Llfle is the word probability difference. How- ever, the difference provided by those strings in Ll-L2 or LZ-L1 consists of both the word probability difference and string syntactical difference. If L1 and L2 contain the same 83 strings then either Cook's discrepancy measure or Maryanski's distance measures may indicate the difference between L1 and L2 properly. If there are some strings in L1 and not in L2, or vise versa, these two difference measures may not indicate the difference properly. Among all difference measures for languages, the discrepancy measure based on information theory best meets our intuitive requirements for difference measure. This measure considers both the word probability difference and the string difference. 4.4.2. DEFINITIONS AND PROPERTIES Several difference measures between two p-languages will be defined in this section, based on Cook's discrepancy measure D(L(G),S) given in Sec. 4.4.1. As mentioned pre- viously, there is no guarantee of convergence for an infinite language. The value of discrepancy ranges from.zero to in- finity. Since S has a finite cardinality, we could sum only over those x in S, so that the discrepancy would be bounded from above. However, the loss of information would be sig- nificant, and the value of discrepancy would not reflect the actual difference between S and L(G). In order to find a measure which will minimize the information loss, and will converge, we will limit the number of strings counted in measuring the discrepancy, as explained below. For a finite terminal vocabulary Vt, and a positive integer mq' then a string in L2 is selected according to its lexicographic order, and its probability is added to q'. 85 (5) Continue step (4) until q'zq. Then the computation involves all strings in L1 plus those strings selected from L2. In computing the distance between S and L(G), we have proposed two different ways to limit the number of strings which should be in the computation. NOW’We formulate these ideas and define a new distance measure. In using the length of a string to restrict the number of strings in the computation, let m = max {i(x)}, and let L' = {xlxeL(G), l(x)é m}. Xés In using the string probability to restrict the number of strings in the computation. Let the proportion of L(G) which will be in the computation be q, 05q = p(x)(-log p(x)+C(x)) - q(X)(-1og q(x)+C(x>) where p(x), q(x), C(x) are defined as before. Definition: The absolute distance between S and L(G) is: Lemma 1 : Proof: 86 Dal xéL' The absolute distance between S and L(G) is bounded above. Let m = max{£(x)}, and let L' = {xlxeL(G),£(x)5m}, xeS G has terminate set Vt' Since Vt has a finite cardinality, L' has a finite cardinality. We only need to prove that the distance for individual strings is bounded above. Assume L' has the cardinality n. The absolute dis- tance between S and L(G) is: Da(S.L(G)) = Z PCX)(-1082 P(X) + C(x)) - XéL' q(X)(-log2 q(X) + C(x)) Sn.maxf|p(x)-q(x)C(x)-p(x)log2 p(x) + Q(X)logz q(X)l} 5n.max{Ip(X)-q(x)C(X)|+IP(X)1082 P(x)l+ |q(x) logz q(x)|} S lp-qc = 1og21og2 qlfl/2, liqiio 2 Da(S,L(G)) ) = z (B(x>>2 xeL' Lemma 2: The square distance is bounded above. Proof: Analogous to the proof of Lemma 1. DS(S,L(G)) 5 n-N2> - xiL' > qi Lemma 4: The mean square distance is bounded above. Proof: Analogous to the proof of Lemma 1. Since 05q(x)sl, therefore DmS(S,L(G))5n:N2 F0(x) for some x; use K+ statistic 1 : Fn(x) < F0(x) for some x; use K' statistic Birnbaum./4/ has provided tables to five decimal places. Accept H0 at significance levels a, if K is less than K' in the table for the given n and a. For the one-sided test K+, accept H0 at significance level a, if K+ is less than K' found in the table for the given n and a. For K- statistic, accept H0 at significance level a, if K- is greater than K' found in the table for the given n and a. Since the sensitivity of K is not concentrated upon a particular type or class of alternatives, it is, in effect, a test of goodness of fit. The most appropriate classical test against which to compare it is the chi-square test. The K-S test is superior to chi-square tests in the following ways. (1) The K-S test requires only the relative modest assump- tions that sampling is random and the sample population is continuous, whereas the chi-square test assumes that the sample size is infinite. 96 . (2) The exact distribution of K is known and tabled for small sample size, whereas the exact distribution of the chi-square test is known and tabled only for in- finite sample size. (3) The chi-square test is only an approximate test, at all sample sizes, and the degree of approximation is diffi- cult to assess, whereas the K-S test is exact at small sample sizes and its degree of approximation at large sample sizes is more readily assessable. (4) The K+ and K- test statistics were designed to test for deviations in a given direction, and do so easily, whereas the chi-square test must be specially modified and conducted in unconventional fashion in order to do so. (5) The K-S test uses ungrouped data, whereas the chi-square test uses grouped data. The chi-square test on the other hand, is superior to the K-8 test in the following ways: (1) The chi-square test does not require that the hypothesized population be completely specified in advance Of sampling. (2) The chi-square test can be applied to discrete popula- tions, but not the K-S test. When the assumption of con- tinuity is not met, the K-S test is conservative. (3) The chi-square values can be meaningfully added by making the appropriate reduction in degree of freedom. Since all the assumptions for using K-S test are not usually met in grammatical inference, the K-S test, just like the chi-square test, is not an Optimum test for grammatical 97 inference. The major difficult lies in the random experi- ment. In general, we assume that the experiment set is a random sample. But since an infinite language contains an infinite number of strings and the actual string distribu- tion is unknown, it is technically impossible to decide the sample size for obtaining an unbiased estimate of string probability. Although both chi-square and K-S tests have many disadvantages in grammatical inference, they still are valu- able tests for deciding which language is most appropriate for a sample. Presumably, the inaccuracy caused by unsatis- fied conditions on the test is the same from language to language. Since the conditions under which the test is applied are the same for all candidate languages of a sample, a constant error is expected in the tests. 4gp. SUMMARY This chapter concentrates on the definitions of solu- tion grammar, complexity measures and statistical acceptance criteria for grammatical inference. The problem.of defining solution grammars for grammatical inference by constructive methods has been discussed. The solution grammar for this thesis is defined and characterized by grammatical complexity and acceptance criteria. The concept of grammatical complexity measures is reviewed and discussed. A size measure of grammars is defined for evaluating the performance of dissimilarity measures. 98 A complexity measure based on information theory is defined which will be used to optimize the selection of grammars. Difference measures between p-languages have been reviewed and discussed. Several difference measures based on information theory are defined and their properties are in- vestigated. The proposed difference measures are shown to be more realistic than those in the literature. The use of the chi-square gOodness of fit test and its application to grammatical inference have been discussed. The Kolmogrov-Smirnov maximum deviation test is proposed and come pared to the chi-square test, its application to grammatical inference is discussed. Techniques presented in this chapter will be used to develop constructive methods in Chap. V. CHAPTER V INFERENCE OF PROBABILISTIC GRAMMARS ‘igl. INTRODUCTION In the previous chapters a number of constructive methods for grammatical inference have been discussed. In this chapter, we will integrate these ideas into a new tech- nique for grammatical inference with finite-state and Chomsky normal form p-grammars. The essential features of any constructive method are: (l) Grammars are constructed based on a sample of strings and heuristic procedures; (2) Each constructed grammar is examined and a relatively better grammar is selected; (3) The language generated by each constructed grammar is compared with the sample, and an acceptance criterion for the constructed grammar is evaluated, based on the difference between the language and the sample. The problem studied in this chapter is to infer a p- grammar for a p-sample by a constructive method. The p- sample is assumed randomly drawn from a p-language. The type of grammar being constructed is the same as the type of p-language assumed for the p-sample. The constructive pro- cedure for this problem.oonsists of the following components: 99 100 (1) Analyzing the syntactical structure of the sample. (2) Constructing the initial grammar from the sample strings. (3) Generating candidate grammars from the initial grammars. (4) Evaluating an acceptance criterion. Methods for analyzing the syntactical structure of sample strings were presented in Chap. III. Techniques for generat- ing candidate grammars were discussed in Chap. II and IV. Methods for acceptance testing of inferred grammars were described in Chap. IV. The overall picture of the constructive method pro- posed in this thesis is described below. (1) Cluster the sample strings to establish their syntactic structure in an inference tree. (2) Construct an initial grammar for each cluster by merging the partial grammar for each string, according to the subtree defining the cluster. Then generate candidate grammars for the clusters.' (3) Merge the candidate grammars for the clusters according to the inference tree to produce a candidate grammar for the sample. (4) Compare the language generated by the candidate grammar and the p-sample. If the language is not acceptable, alter either the syntactic description of structure or merging rules and repeat. The problem of assigning production probabilities when inferring a p-grammar for a p-sample will be discussed in Sec. 5.2. The procedure for generating candidate 101 p-grammars is divided into two problems. The first problem is to generate an initial p-grammar from the p-sample, based on the inference tree (Step (2)). The second problem is to apply heuristic strategies to merge productions in the initial grammar and construct new p-grammars called candidate p-grammars for the p-sample (Step (3)). These two problems will be discussed in Sec. 5.3. and 5.4. for finite-state and context free grammars. 5.2. ASSIGNING PRODUCTION PROBABILITIES In this section, we discuss the procedure of assign- ing production probabilities. The techniques will be used in Sec. 5.3. and 5.4. There is no theoretically-based technique for assigning probabilities to the productions of an ambiguous grammar. Since the problem of assigning production probabilities is secondary in this thesis, the production probabilities will be assigned in a simple way and differently for different situations. In assigning production probabilities, only the following four situations will be considered. These condi- tions will apply later to both initial and candidate p- grammars. A is the initial symbol. (1) Let "ab" be a string with probability P, and let a grammar generating the string contain the productions: A —%?aB, B-—9 b where A, B are nonterminals and a, b are terminals. (2) (3) (4) 102 The first production in the derivation sequence of the string is assigned the string probability. In this case the production A -4'aB will have probability P and B -9 b, probability one. Let A 49 a3, A-—9 aC be two productions with probabil- ities P1 and P2 respectively. Let these two produc- tions be replaced by the productions A —9 aD, D-—9 BIC. Then Pr(A ——7> aD) = P1 + P2 Since Aa—a a3 is replaced by A -e9aD and D —a~B. Similarly, Pr(At—a aD) Pr(D —9 B) = P1 Pr(A ——9aD) Pr(D -—-> C) = P2 Thus, Pr(D —9 B) = P1/ (P1 + P2) Pr(D ——-,>C) =- P2/(P1 + P2) Let two productions and their probabilities be A"; B3131'32‘32 (P11:P12): C "’BsBl'Bsz (Pu-1’22) If A and C are replaced by a new nonterminal U, then the probability of the new production is assigned as the average of the old production probabilities mm a» BBBIIBZBZ) - ((1)11 + Pzp/z, (P12 + P22)/2) Let a subset of the productions for a grammar and their probabilities be e) Dl-<>aD2, f) D2":a e 103 If B1 and 32 are replaced by a new nonterminal B, then the productions a) and b) are merged into A -—9icB|dD1|dD2 (P11 + P21, P12, P22). Since the productions c) and d) are related to a) and b), a new production B —e>b|c|aB|a is created. The production probabilities are assigned as follows: Pr(B-»e»b) - P11 x P31/(P11 + P21) Pr(B~—e c) 8 P11 x P32/(P11 + P21) Pr(B~—é»aB) = P2l x P41/(P11 + P21) Pr<3“‘* a) ‘ P21 1 P42/(1’11 + P21) These four conditions include all cases occurring in the inference procedure in subsequent sections. gygg INFERENCE OF FINITE-STATE P-GRAMMARS According to Sec. 5.1. any constructive procedure usually consists of three subproblems: analyzing the syn- tactical structure Of a p-sample, constructing candidate p- grammars and acceptance testing of inferred p-grammars. The first and the third subproblems have been carefully discussed in Chap. III and IV. The subproblem of constructing candidate finite-state p-grammars (FSPG) will be discussed in this section. This subproblem is divided into two steps: (1) constructing initial finite-state p-grammar, and (2) heuristically merging productions to produce candidate p- grammars. These two steps will be described separately. The procedures described below are particularly effective with 104 the inference tree produced in Sec. 3.4. 5.3.1. CONSTRUCTING INITIAL FINITE-STATE P-GRAMMAR The initial finite-state p-grammar exactly generates the p-sample and is constructed based on the results of the cluster analysis of sample strings. The following pre- liminary definitions are needed: Definition: A canonical finite-state p-grammar (CFSPG) G = (Vn, V c R, P, A) for a string x = a1...an t) with probability p is defined below. Vn = (A, A A _1 }is a set of nonterminals; 1,000, n Vt = {a1, a2, ...... ,an} is a set of terminals; A is the start symbol; R is the set of productions, defined recursively as . <.< A —a>alAl, Ai-l"” aiAi for 2-1-n-l, An-lfififi'an P is the set of production probabilities. All productions have probability one except for the production A-—9 alAl’ which has probability p. Definition: Two productions are eqpivalent if and only if they have the same probabilities and identical right hand sides. The procedure for constructing the initial finite- state p-grammar from a p-sample consists of two steps. First, construct a CFSPG for each cluster or subtree of the labelled MST (Sec. 3.5.). Then combine all CFSPG and make 105 necessary consolidations. For each cluster, the construc- tion starts at the shortest string in the cluster, then follows the depth first order in the labelled MST which con- siders top to down first, left to right second, to construct a CFSPG for each string in the tree. After all strings in a cluster are visited, productions are grouped together to form an initial p-grammar for the cluster. Since the initial finite-state p-grammar has to generate the sample strings, the consolidation procedure is limited to replacing equivalent productions without creating recursive productions. When two productions are equivalent, one of them is eliminated, and the occurrences of the left hand symbol of the eliminated production are replaced by the left hand symbol of the other production. The following example demonstrates the procedure of constructing the initial finite-state p-grammar. Example 1: Let S1 = {cb, cab, caab, caaab } be a set of strings with corresponding probabilities {1/4, 1/8, 1/16, 1/16}. The labelled MST connects the strings in the sequence shown with root cb, string cab at depth 1, string caab at depth 2 and string caaab» at depth 3. The stepwise construction is as follows; where capital letters are nonterminals, unless otherwise stated, all productions have probability one. 106 (1) Construct the CFSPG for the string ”cb" A -—7>cB1 (1/4), 31—9 b. (2) Construct the CFSPG for the second string "cab" A —9’cB2 (1/8), 32"; a33, B3~—> b (3) The productions Bl‘T’ b and B3.as b are equivalent. Delete B3.e9 b and change all occurrences of B3 to B1. Thus, we obtain: Ai—-;~>cBl (1/4), 31.,9 b,A _a.cB2 (1/8), BZ-ea aBl. Repeating this procedure, we obtain the initial finite-state p-grammar Gl for 31’ which has the following set of productions. A.__,.>c.B1 (1/4), Bl-ae'b, A-—e ch (1/8), 324—9 aBl, A —5;cB3 (1/16), B3-—9 aBZ, A acB‘, (1/16), B4 ——)aB3. 5.3.2. CONSTRUCTING THE CANDIDATE FSPG In this section, we propose a procedure for construct- ing candidate finite-state p-grammars from an initial finite- state p-grammar. Candidate finite-state p-grammars are con- structed to reduce the complexity of the initial finite-state p-grammar. According to the grammatical complexity measure defined in Sec. 4.4., the complexity of a grammar can be reduced by introducing recursive and disjunctive productions. A recursive production is a production which has the same symr bol occurring on both sides of the production. A disjunctive production is one with more than one branch on the right hand 107 side of the production. The heuristic constructive procedure consists of the following steps and is discussed in this section. (a) Creating recursive and disjunctive productions by merging related productions; (b) Consolidating the new grammar and removing undesirable productions which contribute unnecessary complexity. 5.3.2.1. RULES FOR MERGING PRODUCTIONS AND CONSOLIDATION In this section, we discuss the merger of productions to form recursive or/and disjunctive productions. In setting up merger rules, we will consider the relationships among symbols occurring in different productions. Since the length of the right hand side of each production in any finite-state grammar cannot exceed two, the symbol relation- ships between two productions are limited. All possible situations and corresponding consolidations are listed below. When two productions are to be merged, the rules are applied in the order shown (see Appendix B). (1) Eliminating one of the productions. When two productions are equivalent, for example A -—>aB and C —9 aB, one of them is eliminated and the occurrences of the left hand symbol of the eliminated production are replaced by the left hand symbol of the other production. (2) Introducing a recursive production. When the terminals and nonterminals are related as in A -—> aB, B —;> aA, a recursive production A->aA is created. (3) (4) (5) (6) 108 Introducing a recursive and disjunctive production. The situation is similar to (2), as in A -€>aB, B e9*aC. The production A-a>aA|aC is created. Introducing a disjunctive production. When two productions have identical left hand symbols, for example, A-—9>aB, A——9~bB they are merged into A —e’aB|bB. Introducing a new nonterminal. When two productions only differ in the right hand nonterminal, for example A.—=>aB, A—ea aX, a new non- terminal is introduced to replace all occurrences of B and X. Creating recursive productions. When the situation A-—€>aB, B-e»bC arises, productions with C as the left hand symbol or A as the right hand symbol are considered. If applying the sequence of pro- ductions shows that the terminal symbols occur repeatedly with the same pattern, recursive productions may be created to replace the set of productions. 5.3.2.2. SEARCH STRATEGY The rules in Sec. 5.3.2.1. ShOW'hOW pair of produc- tions are merged. In this subsection, strategies for select- ing pairs of productions for merging from those in the initial finite state p-grammar are proposed. The search strategy uses the labelled MST to investi- gate the possibility of merging productions. The search 109 process will use a depth first search to discover recursive productions as well as disjunctive productions. The con- structive procedure first tries to merge productions within a cluster or subtree. According to the depth first order, the productions generating a string are examined along with those productions generating its predecessors for merging productions. Productions from all clusters are then examined according to the inference tree for further consolidation. The procedures for merging productions from different clusters are similar to those for merging productions in a cluster. Several important heuristic tactics are considered. (1) Productions with the initial symbol at the left hand are not allowed to merge with other productions except the productions having the same left hand symbol, or when the merging process is at the final stage. (2) If the set of strings generated by a set of productions is contained in the set of strings generated by another set of productions, then the first set of productions can be eliminated. This should be done whenever a merging is achieved. (3) When a new nonterminal is introduced to replace a pair of symbols, the appropriateness of replacing each occurrence should be tested (Sec. 4.5.). Example 2: Let sample S1 and initial finite-state p-grammar Gl be defined as in Ex. 1. The merging process starts with the productions that generate the Resale—3.5 llO root and its immediate descendant, A —=>cB1 and A~<9'cB2. According to Rule 5 in Sec. 5.3.2.1., B1 is replaced byBZ. Thus, we have A.----‘>cB2 (3/8), and by Rule 4, BZ—€> bIaB2 (2/3,1/3) is created. Similarly, merging A«——>cB3 with A ——>cB2, we have A ~9OB3 (7/16), B3 -—> bIaB3 (4/7, 3/7). Finally, A-<> cB4 and A-—? CB3 are treated the same way, and the final set of pro- ductions is: A.—7\cB4 (1/2), remembers4 (1/2, 1/2) Let S {31,82} be a p-sample, where S1 is defined as in Ex. 1, and 82 = {ab, abb, abbb, abbbb } with associated probabilities {1/4, 1/8, 1/16, 1/16}. By a merging process similar to that in Ex. 2, the final finite-state p-grammar for 82 has the productions: A---;>aC4 (1/2), C‘+__9b|bC4 (1/2, 1/2). NOW‘We merge productions from the two p-grammars in an heuristic manner. By introducing a new nonterminal B to replace B4 and C4, we obtain the productions: A——) aBch (1/2, 1/2) B-e9 blaBIbB (1/2, 1/4, 1/4). If the language generated by this grammar is acceptable, then it is the solution grammar. Otherwise, different heuristics must be applied or the syntactic structure must be evaluated. 111 5:4. INFERENCE OF CONTEXT-FREE P-GRAMMARS Since every context-free language can be generated by a Chomsky normal form grammar, in the inference of context-free p-grammars, we propose a constructive method for constructing Chomsky normal form p—grammars (CNFPG) for a p—sample. The constructive procedure consists of two steps: first, construct the initial CNFPG from the p-sample, then merge productions to generate candidate p-grammars for the p-sample. These steps will be discussed separately in this section. égfigl. CONSTRUCTING THE INITIAL CNFPG FROM A P-SAMPLE The steps for constructing the initial CNFPG from a set of sample strings are similar to those for constructing the initial finite p-grammar. First, construct a determin- istic CNFPG for each cluster by merging CNFPG's for the strings in the cluster, then combine them and make necessary consolidations. The construction of a CNFPG for a string consists of the five steps described below. (1) Construct a complete binary tree (CBT) or partial complete binary tree (PCBT) for each string in the cluster based on the length of the string. Let n be the length of the string x, and let k, m.be positive integers. Then (la) If n - 2k, construct a CBT with k levels; (lb) If n - 2k 1'1 + m, where 15m<2 construct a PCBT with k+l levels in which the left most m nodes at the 112 kth level have two descendents. (2) Assign a nonterminal symbol to each node in the tree. (3) Add an edge to each terminal node and assign the symbols in the given string to the added nodes on one-to-one basis from left to right. If the symbol in the given string is a nonterminal, do not add an edge but assign the nonterminal directly to the node in the CBT (or PCBT). (4) Formalize productions from the tree. A node and its successors generate a production by using the nonterminal corresponding to the node as the left hand side and nonterminals or terminal corresponding to its successors as the right hand side of the production. (5) Replace equivalent productions. In the construction of a CNFPG for a cluster of strings, we exploit the self-embedding property of some context- free languages. The common substring relation between strings in the cluster is used in the construction. The con- structive procedure for each cluster consists of the follow- ing steps: (1) Construct a CNFPG for the shortest string in the cluster. (2) Perform a depth first search in the labelled MST for successors. If no successor nodes exist then stop, otherwise continue. (3) Substitute substrings in the successor for the existing nonterminals whose sentential forms are equal to the sub- strings. The substitution takes the longest substring first, and if there are several alternatives then the 113 left most substring has priority. (4) Construct a CNFPG for the string in Step 3, and consoli- date it with existing productions. Then go to Step 2. The initial CNFPG for the p-sample is obtained by combining all CNFPG's for clusters and making all necessary final consolidations. In order to avoid the confusion caused by the use of the initial symbol in different places, we will use differ- ent initial symbols for different strings in the cluster. These differences will be eliminated during mergers. Example 4: Let S1 be defined as in Ex. 1. The construction of the initial CNFPG from S1 is shown below. (1) Construct a CNFPG for the shortest string 'cb'. Alf—€>Ble(l/4), B144; c, BZ—ee b. (2) The descendant of 'cb' is 'cab'. After replacing all substrings in 'cab' by previously- defined nonterminals,'cab' becomes 'BlaBz', with probability (1). (3) Construct a CNFPG for 'BlaBz'. 112—#733132 (1/8), 33 «913134. B4__>a. (4) The descendant of 'cab' is 'caab'. After substituting for all substrings as explained above, 'caab' becomes B3B4B2 with probability (1/16). (5) Construct a CNFPG for 'BBB4B2" 114 A #93532 (1/16), B5—9B3B4. 3 Repeating the procedure for the remaining strings and renaming nonterminals produces the initial CNFPG, G1 containing the follow up productions: (1,8): 3393134, 34*?) a: A fiBSBZ (1/16): BS-—€>B3BA, A ~79B6BZ (1/16), B6——9‘BSBA. ggggg. CONSTRUCTING THE CANDIDATE CNFPG In this section, we will present a procedure for con- structing candidate CNFPG's from the initial CNFPG. The strategy for constructing candidate CNFPG is the same as that for candidate finite-state p-grammars described in Sec. 5.3.2. Optimizing the grammatical complexity measure defined in Sec. 4.4. requires the constructive procedure to merge related productions into disjunctive and/or recursive pro- ductions. A procedure for removing undesirable productions which introduce unnecessary complexity is also needed. These two steps will be described in the next two sections. 5.4.2.1. RULES FOR MERGING PRODUCTIONS AND CONSOLIDATIONS In this section, all possible situations in which two productions from CNFPG's can be merged into disjunctive and/or recursive productions will be discussed. The con— solidation for each situation also Will be described. The 115 rules are analogous to those for finite-state grammars. The strategy for using these rules is explained in Sec. 5.4.2.2. A complete listing of all situations is given in Appendix B. (l) Eliminating one of the productions. When two productions are equivalent, one of them will be eliminated. The necessary consolidation is to replace all occurrences of the left hand nonterminal of the eliminated production by the left hand nonterminal of the remaining production. (2) Introducing a recursive production. Suppose two productions meet all the following conditions. (2a) Both productions have three different symbols. (2b) Both productions have an identical symbol at the same right hand side position. . (2c) The left hand symbol of a production equals the non- identical right hand symbol of the other production, and vice versa. Then all occurrences of the left hand symbol of one pro- duction are replaced by the left hand symbol of the other production and the second production is eliminated. For example, for A—> BC, C-—) BA, the production AA BA is created. (3) Introducing a recursive and a disjunction production. Suppose two productions meet one of the following condi- tions. (3a) The conditions (2a) and (2b) plus the left hand symbol of a production equals the non-identical right hand (4) (5) 116 symbol of the other production, or vice versa. (3b) A production has all symbols different and the other production has identical right hand symbols. In addition the left hand symbol of a production equals a right hand symbol of the other production, or vice versa. Then the consolidation is the same as the situation (2), except no production is eliminated. For example, for A.—e>BC, D-ae BA, the production A~—9IBC|BA is created. Introducing a new nonterminal. Suppose two productions meet all the following conditions. (4a) The left hand symbols of both productions are identical. (4b) One of the right hand symbols of both productions are identical at the same position. (4c) Neither of the non-identical symbolson the right hand sides is identical to the left hand symbol. Then a new nonterminal is introduced to replace the pair of distinct symbols. For example, for A-—s>BC, A-—> BD, a new nonterminal E is introduced to replace all occur- rences of C and D. Introducing a disjunctive production. Suppose two productions satisfy both the following condi- tions. (5a) The left hand symbols of both productions are identical. 117 (5b) The condition of (4b) is not satisfied. Then a disjunctive production is generated. For example, for A ——9 BC, A ~e CD, a disjunctive production A‘f? BCICD is generated. (6) Creating recursive productions. Suppose two productions satisfy both the following con- ditions. (6a) Both productions have three different nonterminals. (6b) The left hand symbol of one product is identical to one of the right hand symbols, say the first symbol, of the other production. Then we try to consoli- date a set of existing productions having all the following properties into recursive productions. (Q1) All symbols in the production are different. (Q2) Either the left hand symbol of the production is identical to the first right hand symbol of one pro- duction, or the first right hand symbol of the pro- duction is identical to the left hand symbol of the other. If a repeating pattern of nonterminals occurs when the productions are applied, then introduce a recursive pro- duction. Otherwise, continue looking for related pro- ductions until all productions are examined. For example, for A-e; BC, B-ee DE, we look for productions which either have A as the first right hand symbol, or B as the left hand symbol, and satisfy Q1. 118 In the inference process, these rules are applied in the order they are listed. All relationships between two productions are tabled in the Appendix according to these situations. 5.4.2.2. SEARCH STRATEGY Since strings having a close syntactical relationship are grouped together, the merging process first examines productions for the strings in the same cluster. Produc- tions corresponding to a string are compared with those for its successors in the labelled MST. Productions for differ- ent clusters are then examined according to the inference tree for further consolidation. The heuristic procedure for merging productions from different cluster is similar to that for merging productions in a cluster. Some of the heuristic tactics applied in Sec. 5.3.2.2. can also be applied here. Example 5: Let sample 81 and its initial CNFPG G1 be defined as in Ex. 4. The merging process starts with the productions that generate the root and its immediate successors. According to Rule 5 in Sec. 5.4.2.1., A-——>'B1B2 and A-e? B3B2 are merged into A -? Ble (3/8), and Rule A'Bl‘TC clBlB4 (2/3, 1/3) is created. Similarly, B5, and then B6 are replaced by 31' After all consolidations, we obtain: A —9ZBlB2 (1/2), Bl-—:;>c|BlB4 (1/2, 1/2), B2-—9 b, B4—eal a. 119 Assuming this is the best set of productions for 31’ then the merging process for the cluster stops. Example 6: Let S = {S1,Szl be a p-sample, S1 is defined as in Ex. 5., and 82 = {bbb, bbab, bbaab, bbaaab} with associated probabilities {1/4, 1/8, l/16, 1/16}. By constructing and merging processes similar to that in Ex. 4 and Ex. 5, assuming the final CNFPG for 82 has the productions: A-—9'D1Dl (1/2), Dl---.>‘b|D1Dl|D1D2 (l/4,1/4,1/2), D2-—9.a. Now we merge productions from the two p-grammars in an heuristic manner. According to Rule 1 in Sec. 5.4.2.1, D2 is replaced by 34' and Rule 4, A maeDlDllBle (1/2, 1/2) is created. Assuming no more productions can be merged, then we Obtain the solution grammar with the productions: A449 BlelDlD2 (1/2, 1/2), Bl_—,~«c|B1B4 (1/2, l/2), 32—9 b, 134.9 a, 131"?" bIDlDllolB4 (1/4, 1/4,1/2). 5.5. COMPARISON In the previous chapters, constructive methods had been proposed for the inference of finite—state and Chomsky normal form p-grammars from a p-sample. In this section, the constructive methods are compared with other constructive methods. Based on the common features Of constructive methods 120 described in Sec. 5.1., their generality, complexity, limitations and completeness will be compared. The construc- tive method proposed for the inference of finite-state grammars is compared with the three most significant exist- ing constructive methods for finite-state grammars /3/,/20/, /52/. In the analysis of sample structure, the proposed method uses a cluster analysis technique based on a labelled MST while the other three methods /3/, /13/, /52/ use formal derivative and k-tails (Sec. 3.2.). The clustering method uses dissimilarities as well as common substrings to analyze sample structure, while the formal derivative and k-tails methods only sequentially search for common characters of sample strings to decide the informational relationship among sample strings. The clustering method is more general and less limited than the formal derivative. The clustering method will produce a reasonable result for any set of sample strings, while the formal derivative and k-tails methods are inherently limited. In the construction of finite-state grammars, the- proposed method uses the inference tree as a guide while the other three methods /3/, /13/, /52/ use an arbitrary subset of sample strings, or every string in the sample, guided by the results of formal derivative to construct grammars. The former is more efficient than the later, since it uses organ- ized subsets of sample strings and the inference tree to construct grammars with less effort. 121 In merging productions, the proposed method uses the relationship between two productions and the concept of grammatical complexity to merge related productions and to simplify the grammar, while the three methods /3/,/l3/,/52/ use a derived grammar or merge equivalent nonterminals to produce grammars. .The former is more complex, complete and general, but less limited than the later. The proposed method searches for a solution grammar which has least complexity and generates an acceptable langu- age, while the three methods /3/,/l3/,/52/obtain a solution grammar without the complexity criterion. The constructive method proposed for the inference of Chomsky normal form p-grammars is compared with the three most significant constructive methods for the subsets of context-free languages /l4/-/l6/,/24/,/61/,/62/, and the two most significant constructive methods for context-free p- languages /ll/,/46/. The results of the comparison are pre- sented separately below. Comparing the proposed method to the three construc- tive methods for the subsets of context-free languages /l4/- /l6/,/24/,/61/,/62/ produces three main conclusions. In the analysis of sample structure, the cluster analysis method uses dissimilarities as well as substring relationships to classify sample strings, while the three methods use substring and structural relationships to analyze sample structure. The former is more general, complex and 122 complete, but less limited than the later, since the cluster- ing method considers more factors than the three methods /3/, /13/,/52/ in analyzing sample structure and the clustering method is not limited to analyzing subsets of context-free languages. The proposed method is more general than the three methods /3/,/13/,/52/, since it is not limited to the infer- ence of subsets of context-free languages. The results Of comparing the proposed method to the two constructive methods for context-free p-languages /ll/, /46/ are summarized below: The clustering method considers various differences between strings to classify the sample strings, while the two methods only use a simple substring relationship to analyze sample structure. In the construction of context-free p- grammars, the proposed method uses the inference tree as a guide to construct grammars and to merge productions, while the two methods start with a simple grammar and use pattern matching techniques or heuristic search of common substrings to merge productions. The proposed method is more general, complex, particular and complete than the two methods, since it uses a well formed framework to construct grammars. The prOposed method has a more reasonable difference measure of p-languages than those two methods have (Cf. Sec. 4.4.). 123 546. COMPUTATIONAL CONSIDERATIONS In this section, we discuss the computational diffi- culties inherent in the application of the constructive method developed in this chapter. In the literature, all examples for demonstrating the performance of constructive procedures either involve a small number of sample strings or are designed for a very specific type of grammar. Since the field of grammatical inference is still in its infancy, few restrictions are placed on solution grammars and no uniform standard of performance exists. Little optimization is possible. In addition, existing constructive methods cannot generate an adequate grammar even when the sample contains a moderate number of strings, much less when a few hundred strings are in the sample. There is no common ground for comparing two different constructive methods. In most cases, constructive methods are developed for some speci- fic purposes and under different assumptions and restrictions. A few examples are not enough to show one constructive proce- dure is uniformly better than another constructive procedure. As mentioned before, the constructive method developed in this thesis consists of three main components: Analyzing the syntactic structure of sample strings, constructing candidate grammars and testing acceptance criteria. In analyz- ing the syntactic structure of sample strings, the clustering method in this thesis was developed particularly for handling a general situation and a large number of sample strings. 124 There is a problem in obtaining a real and reasonably large set of sample strings. In selecting a real sample, the sample space is the power set of a terminal set, and strings of small length are more likely to be selected than long strings. Thus, the syntactic structure of a sample in prac- tice is not obvious. On the other hand, the artificial samples used to demonstrate constructive methods in the liter- ature are usually taken from an hypothesized source language whose syntactic structure is apparent. Since the construction of candidate grammars is guided by an inference tree, the number of grammars being constructed equals the number of clusters formed for the sample. For a large number of sample strings, a relatively large number of clusters will be formed. The problem of merging related productions to generate candidate grammars be- comes more difficult when the number of initial grammars as well as productions increase. The nature of the difficulty lies in both the number of comparisons needed to find a merger, and the number Of productions involved in the consolidation after a merger is adopted. In addition, the problem of assigning production probabilities becomes serious when the consolidations following mergers involve many productions. The acceptance criterion depends on the type of mergers used and requires knowledge of the language generated by a grammar. Finding the probabilistic language generated by an inferred p-grammar requires a parser. A probabilistic finite-state automaton is needed for parsing a finite-state 125 p-language, and a probabilistic push down automaton is needed for parsing a context-free p-language. Constructing and up- dating these parsers are complicated tasks. Besides, when the resulting p-grammar is not acceptable, it is difficult to decide what should be changed first, dissimilarity measure, clustering techniques, or merging strategies. Defining heuristic strategies to optimizing an acceptance criterion is beyond the scope of this thesis. The entire constructive method is not easy to program. The main difficulties are in the selection of prOper data structures for a p-grammar, in the construction of efficient parsers for recognizing strings, and in the implementation of heuristic strategies for optimizing an acceptance criterion. gpl. SUMMARY In this chapter we have discussed the problem of con- structing p-grammars from a p-sample. A method of assigning production probabilities is described. The procedures for in- ferring finite-state and Chomsky normal form p-grammars from a p-sample are presented, and compared with other constructive methods in the same category. In general, the proposed method is more efficient, complex, general and complete than other existing constructive methods. The constructive methods proposed are based on the techniques discussed in the previous chapters. The method consists of two steps and based on an inference tree, construct an initial p-grammar exactly describing the sample strings, 126 then merge related productions to generate candidate p- grammars. Situations in which two productions may be merged together have been discussed, and the necessary consolida- tions for each situation have been described. Two examples have been presented to explain how the constructive methods work for the inference of finite-state and Chomsky normal form p-grammars. The computational difficulties for the application of the constructive method have been discussed. CHAPTER VI SUMMARY, CONCLUSIONS AND FUTURE RESEARCH This chapter summarizes the main results of the thesis and discusses possibilities for future research. ‘le. SUMMARY AND CONCLUSIONS This thesis is concerned with studying constructive procedures for inferring a simple and acceptable p-grammar from a p-sample. A heuristic approach is taken to grammati- cal inference. Clustering methods, grammatical complexity measures, difference measures and statistical tests are used as the tools for developing constructive methods for gram- matical inference. The objective was to develop general and efficient constructive methods for inferring finite-state and Chomsky normal form p-grammars that generate not only the p- sample but also a language that resembles the p-sample in some sense. The most significant models, concepts and techniques available in the literature are reviewed and briefly dis- cussed in Chapter II. Chapter III introduces a procedure for analyzing sample structure. This procedure is one of the major contributions of the thesis. Five measures of dis- similarity between strings based on the minimal cost of a sequence of edit operations are defined and compared and 127 128 their properties are investigated. Examples are also pre- sented to show the effects of different dissimilarity ‘measures in discriminating among strings having different inherent relationships. A labelled MST for a set of sample strings which leads to an inference tree are the fundamental notions used to establish structure. A clustering algorithm based on the MST which uses the notions of "common substrings" and "length" to define inconsistent edges is proposed. Seven interpretations of inconsistent edges are defined. The removal of inconsistent edges generates the inference tree. The main contributions of Chapter III are the develop- ments of dissimilarity measures and clustering algorithms for analyzing sample structure, none of which have been pro- posed in the literature. The resulting inference tree is unique to this thesis and is one of the key factors used for developing general and computationally efficient constructive methods in Chapter V. The procedure is demonstrated to be more general than methods for analyzing sample structure in the literature. Grammatical complexity measures and acceptance cri- terion are investigated in Chapter IV. The problem of defining solution grammars is carefully examined, and a definition of solution grammar is adopted for this thesis. A size measure is suggested for evaluating the performance of dissimilarity measures. A complexity measure based on in- formation theory is defined for Optimizing the selection of a grammar. Several difference measures for languages are 129 defined and investigated, especially those based on informa- tion theory. For the first time, the Kolmogrov-Smirnov maximum deviation statistic is applied to test the appro- priateness of a p-language for a p-sample, and compared with the chi-square goodness of fit test. The main contri- butions of Chapter IV are the develOpment of difference measures for p-languages and the suggestions of statistical tests and complexity measures for grammatical inference. The difference measures prOposed are more realistic than those in the literature. The development of constructive methods depends heavily on the techniques in Chapter IV. Procedures for actually constructing finite-state and context-free p-grammars are proposed in Chapter V. The constructive procedure for finite-state p-grammars is similar to that for Chomsky normal form p-grammars, and consists of two parts: construction of an initial p-grammar, and merging related productions to generate candidate p-grammars. The construction is guided by the inference tree of Chapter III and the merging rules optimize the complexity measure of Chapter IV. Rules for assigning production probabilities are described, and the techniques presented in previous chapters are integrated into constructive methods which are compared with constructive methods in the literature. The main contri- bution of Chapter V is the synthesis of the constructive ‘method itself. The merging rules are unique to this thesis. Examples are presented to demonstrate the constructive process. The constructive method proposed here is unique in the 130 procedure for analyzing sample structures, the use of difference measures for p-languages, the type of statistic acceptance test and the heuristic strategies for construct- ing initial and candidate p-grammars. The proposed construc- tive methods are more general, complex, complete and less limited than other constructive methods. The nature of this research precludes mathematical proofs of several claims made in this thesis. Introducing enough grammatical and mathematical structure to prove theorems would have destroyed the very problem under investi- gation. The true value of this thesis will only become apparent when applied to a real data base. Unfortunately, the generation of such a data base requires image processing equipment not currently available at M.S.U. 6.2. FUTURE RESEARCH The following topics are suggested for future research in the area of grammatical inference. (1) In measuring dissimilarity between strings, it is evi- dent that the ability to discriminate among sample strings varies considerably with the dissimilarity measure chosen. The investigation of the effect of different dissimilarity measures on discrimination, and the optimization of dissimilarity measures are important topics for future research. (2) The nonoverlapping hierarchical clustering algorithm developed in this thesis is based on a labelled MST. 131 The study of other clustering methods for analyzing sample structure is needed. In particular, the development of overlapping clustering methods for the inference of ambiguous grammars and the effect of different clustering methods in grammatical inference are important topics. (3) The chi-square goodness of fit and Kolmogrov-Smirnov maximum deviation tests have some disadvantages in grammatical inference. Other statistics need to be investigated, particularly non-parametric statistics. (4) In forming candidate p-grammars from initial p-grammars, the effect of different heuristic strategies such as the order of merging productions should be investigated. The same can be said for merging grammars for different clusters. APPENDIX 132 APPENDIX A CLUSTERING ALGORITHM Given a subtree of the labelled MST consists of node N with immediate successors N1,N2,...Nn, and immediate pre- decessor N' of N. Let x. ll length of maximal substrings common to N and x Nx' ux = Number of symbols not used in both N and Nx' dx = Weight of edge (N,Nx). 2(N)= length of node N. The procedures P1 and P2 for Obtaining an inference tree from a labelled MST are given below; Procedure Pl: * Stack l : Labelled MST * Stack 2 : Inference Tree * Stack 3 : Roots Of subtrees not being classified * Stack 4 : Immediate successors of N in lexicographic order 1. Get Root N from Stack l 2. Set N‘4—-0 3. Push N on Stack 2 4. Set j é——0 5. Get immediate successors N1,N2,...Nn of N from Stack l 133 134 Push N1,N2,.. Call P2(N,N',j) .Nn on Stack 4 If j + 0, set N'é— N, N e— Nj, go to 3 \OCDNO If Stack 3 is empty, stop 10. Pop stack 3, get N, go to 2 Procedure P2 (N,N',j) * Stack 3 : Roots of subtrees not being classified * Stack 4 : Immediate successors of N in lexicographic order 1. If Stack 4 is empty, return, else pop Stack 4, get Nx 2. If £(N)=£(Nx), go to 10 3. If N'=O go to 5 4. If 2(N)>max(l(N§,r(Nx)) or 2(N)kx, go to 10 7. If kx>k or u>ux, go to 11 8. If uxéu or dx>d, go to 10 9. If d>dx, go to 11 10. Push Nx on Stack 3, go to 1 11. Push Nj on Stack 3 12. Set ké—kx, ué—ux, d é—dx, jé— x, go to l 135 APPENDIX B B1. RULES FOR MERGING FINITE-STATE PRODUCTIONS Situation Pairs of Candidate Productions (l) (A—éaB, C——>aB), (A —>aB, B—PaB), (A-—) a, B———>a) (2) (A —) aB, B -—-> aA) ~(3) (A_._>aB, B——,>aC), (As—eaB, B—aa) (4) (A—SaB,A—->bC), (A ——9aB,A—->bB), (A_.>aA, A._>bB), (A o—aaB, A——>b), (A —-—,>a, A—s b) (5) (A -——>aB, A ...,» 3C), (.41., aA, A —eaC) (5) (A ._;aB, B—->bC) Situation Pairs of Candidate Productions 136 RULES FOR MERGING CHOMSKY NORMAL FORM PRODUCTIONS (1) (2) (3) (4) (5) (6) (A —->30, D—9BC), (A -9BA, (A__/~.a, B—a a), (B_,>AA, C (A—) AA, B——) AA) (A—eBC, C eBA), (A—a CB, (D—aBA, A—93C), (D-e‘AB, (A—eBC, D——>AA), (A -—9BC, (A-aa, 8—9 AA) (A 63C, A-—,~> BD), (A—aCB, (A—aAB, A-Q AC), (A_.,>BA, (A —7‘:BC, A-—-7‘: DE), (A e-9BC, (A—-> BC, B—->CD), (A _,>BC, D—QBA), —.-)AA), (A-—9AB, B‘s" AB), C—a AB) A—9 CB), B-a CC), (A -——> BC, C—> BB), A—a DB), A-a CA) A-——>CB), ...etc. C ——>DE), ...etc. BIBLIOGRAPHY 137 BIBLIOGRAPHY M.R. Anderberg, "Cluster Analysis for Applications," Academic Press, Inc., New York, 1973. A.W. Biermann and J.A. Feldman, "A survey of results in grammatical inference," in Frontiers of Pattern Recognition, pp. 31-54, ed. by S. Watanaba Academic Press, New York, 1972. A.W. Biermann and J.A. Feldman, "On the synthesis of finite-state machines from.samples of their be- havior," IEEE Trans. Comput., vol. C-21, pp. 592-597, June 1972. Z.W. Birnbaum, "Numerical Tabulation of Distribution of the Distribution of Kolmogrov's Statistic for Finite Sample Size," Journal of the American Statistical Association, 47, 425-441, 1952. M. Blum, "0n the size measure," Inf. and Cont., vol. 11, pp. 257-265, 1967. M. Blum, "A Machine-Independent theory of the complex- ity of recursive functions," J. ACM, vol. 14, pp. 322-336, 1967. T.L. Booth, "Probabilistic representation of formal languages," in Proc. 10th IEEE Ann. Switching and Automate Theory, 1969. T.L. Booth and Y.T. Chien, Computing: Fundamentals and Applications. Santa Barbara, Calif., Hamilton, 1974. T.L. Booth and R.A. Thompson, "Applying probability measures to abstract languages,‘ IEEE Trans. Comput., vol. C-22, pp. 442-450, May 1973. . J.A. Brzozowski, "Derivatives of Regular Expressions," J.ACM, vol. 11, No. 4, pp. 481-494, Oct. 1964. . C.M. Cook, "A Cost Function for Concept Formation," "Experiments in Grammatical Inference," "Grammatical Inference by Heuristic Search", Comput. Sci. Center, Univ. Maryland, College Park, Tech. Rep., TR-212, 1972, TR-257, 1973, TR-287, 1974. 138 12. 13. 14. 15. 16. l7. 18. 19. 20. 21. 22. 23. 24. 139 . Cramer, Mathematical Models of Statistics. Princeton, N.J.: Princeton Univ., 1946. . Cremers and O. Mayer, "On matrix languages," Inf. and Cont., vol. 23, pp. 86-96, 1973. . Crespi-Reghizzi, "An Effective MOdel for Grammar Inference," in Proc. IFIP Congr., 1971. . Crespi-Reghizzi, "Reduction of Enumeration in Grammar Acquisition," presented at the 2nd Int. Joint Conf. grtificial Intelligence, London, England, Sept. 1-3, 971. . Crespi-Reghizzi, M.A. Melkanoff, and L. Lichten, "The use of Grammatical Inference for Designing Programming Languages," Commun.ACM, vol. 16, pp. 83-90, Feb. 1973. . Dubes, "Information Compression, Structure Analysis and Decision Making with a Correlation Matrix," AD 826811 Michigan State University, E.Lansing, Michigan, 1970. .F. Edwards and L.L. Cavalli-Sforza, "Reconstruction of evolutionary trees," Phe. and Phy. Classifications, pp. 67-76, 1964. . Ellis, "Probabilistic languages and automata," Ph.D. dissertation, Univ. of Illinois, Urbana, Illinois, 1969. . Feldman, "First Thought on Grammatical Inference," Standford, Art. Int. Project Memo. NO. 55, Standford University, Standford, California 1967. . Feldman, "Some Decidability Results on Grammatical Inference and Complexity," Inf. and Cont., vol. 20, pp. 244-262, 1972. . Feldman and D. Gries, "Translator writing systems," Comput. ACM, vol. 11, pp. 77-113, Feb. 1968. . Feldman and P. Shields, "On Total Complexity and the existence of best programs," Comput. Sci. Dep., Stanford Univ., Stanford, Calif., Tech. Rep. CS-255, 1972. . Feldman, J. Gips, J.J. Horning and S. Reder, "Grammatical Complexity and Inference," Comput. Sci. Dep., Standford Univ., Standford, Calif., Tech. Rep. CS-125, 1969. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. N. 140 .S. Fu, Syntactic Methods in Pattern Recognition, New York Academic, 1974. .S. Fu, "Stochastic automata, stochastic languages and pattern recognition," J. Cybernetics, vol. 1, no. 3, pp. 31-49, 1971. .S. Fu and T.L. Booth, "Grammatical Inference: Intro- duction and Survey - Part I," IEEE Trans. Syst., Man, Cybern., vol. SMC-S, pp. 95-111, Jan. 1975. .S. Fu and T.L. Booth, "Grammatical Inference: Intro- duction and Survey - Part II," IEEE Trans. Syst., Man, Cybern., vol. SMC-S, pp. 409-423, July 1975. .S. Fu and T. Huang, "Stochastic grammars and languages," Int. J. Comput. Inform. Sci., vol. 1, no. 2, pp. 135-170, 1972. .M. Gold, "Language Identification in the Limit," Inf. and Cont., vol. 10, pp. 447-474, 1967. .C. Gower, "A comparison of some methods of cluster analysis," Biometrics, vol. 23, pp. 623-637, 1967. . Gruska, "Some classifications of context-free languages," Inf. and Cont., vol. 14, pp. 152-179, 1969. . Hartmanis and R.E. Stearns, Algebric Structure Theory of Sequential Machines, Prentice-Hall, 1966. .E. Hopcroft and J.D. Ullman, "Formal Languages and Their Relation to Automate." Reading, Mass., Addison-Wesley, 1969. .J. Horning, "A Procedure for Grammatical Inference," Proc. IFIP Congress 71, Ljubljana. .J. Horning, "A Study of Grammatical Inference," Ph.D. Thesis, Standford University, Standford, Calif. Jardine and R. Sibson, "Mathematical Taxonomy," John Wiley, New York, 1970. C. Johnson, "Hierarchical Clustering Schemes," Psychometrika, vol. 32, pp. 241- 54, 1967. . Kanal, "Patterns in Pattern Recognition," IEEE Trans. Inf. Theory, vol. IT-20, pp. 697-722, 1974. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. A. R W. H. T. T E. A. 141 Kolmogrov, "Confidence Limits for an Unknown Dis- tribution Function," Annals of Mathematical Statistics, 12, 461-463, 1941. .R. Korfhage, Discete Computational Structures, pp. 101- 129, Acad. Press, N.Y., 1974. .B. Kruskal, "On the shortest spanning subtree of a graph and a travelling salesman problem," Proc. Amer. Math. Soc., 7, 48-50, 1956. Kuich, "On the entropy of context-free languages," Inf. and Cont., vol. 16, pp. 173-200, 1970. .W. Lindgren, Statistical Theory, Collier-Macmillan Ltd., London, 1968. Loberman and A. Weinberger, "Formal procedures for connecting terminals with a minimum total wire length," J. ACM, vol. 4, pp. 428-237, 1957. . Maryanski, "Inference Of Probabilistic Grammars," Ph.D. dissertation, Dep. Elec. Eng. Comput. Sci., Univ. Conn., Storrs, July 1974. . Massey, "A Note on the Power of a Non-Parametric Test," Annals of Mathematical Statistics, 21, 440- 443, 1950 (See also 23 (1952), 637-638). . Nillson, Problem-solving Methods in Artificial Intelligence, McGraw-Hill Book Company, New York, 1971. Okuda, E. Tanaka and T. Kasai, "A Method for the Correction of Garbled Words Based on the Levenshtein Metric," IEEE Trans. on Comput., vol. C-25, No. 2, pp. 172-178, 1976. .W. Pao, "A solution of the syntactical induction- inference problem for a non-trivial subset of context-free languages," Moore Sch. Elec. Eng., Uhiv. Pennsylvania, Philadelphia, Interim Tech. Rep. 69-19, 1969. Parzen, MOdern Probability Theory and its Applications, John Wiley, New York, 1960. R. Patel, "Grammatical inference for probabilistic finite-state languages," Ph.D. dissertation, Dept. Elec. Eng. Comput. Sci., Univ. of Conn., Storrs, 1972. .C. Prim, "Shortest Connection Networks and Some Generalizations," Bell Systems Technical Journal, Vol. 36, pp. 1389-1401, 1957. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. P. G. A. .M. Wharton, ”Grammatical Inference and Approximation,‘ 142 S. Rosenbaum, "A Grammar base question-answering procedure," Comput. ACM, vol. 10, pp. 630-635, 1967. J.S. Ross, "Single Linkage Cluster," Algorithms as 13-15, Applied Statistics, vol. 18, pp. 50-54, 1969. Salomma, "On the index of a context-free grammar and languages," Inf. and Cont., vol. 14, pp. 474-477, 1969. . Salomma, "Probabilistic and weighted grammars," Inf. and Cont., vol. 15, pp. 529-544, 1969. .S. Santos, "Probabilistic grammars and automata," Inf. and Cont., vol. 21, pp. 27-47, 1972. .E. Shannon and W. Weaver, The Mathematical Theory of Communication, The University of Illinois Press, Urbana, 1963. .J. Shepherd and A.J. Willmott, "Cluster Analysis on the Atlas Computer," Comp. J., 11, pp. 57-62, 1968. .J. Solomonoff, "A Formal Theory of Inductive Inference - Part I, Part II," Inf. and Cont., vol. 7, pp. 1-22, 224-254, 1964. .J. Solomonoff, "A New Method for Discovering the Grammars of Phrase Structure Languages, in Informa- tion Processing. New York: UNESCO, 1959. . Soule, "Entropies of probabilistic grammars," Inf. and Cont., vol. 25, pp. 57-74, 1974. .A. Thompson, "Determination of probabilistic grammars for functionally specified probability-measure languages," IEEE Trans. Comput., vol. C-23, pp. 603- 614, June 1974. .G. Thomason, "Stochastic syntax-directed translation schemator for correction of errors in context-free lan ages," IEEE Trans. Comput., vol. C-24, pp. 1211- 121 , Dec. 1975. .A. Wagner and M.J. Fischer, "The String-to-String Correction Problem," J. ACM, vol. 21, pp. 168-173, 1974. Ph.D. Thesis, University of Toronto, Toronto, Canada, 1973. 143 68. R.M. Wharton, "Approximate Language Identification," Inf. and Cont., vol. 2, pp. 236-255, 1974. 69. C.T. Zahn, "Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters," IEEE Trans. Comput., vol. C-20, NO. 1, pp. 68-86, Jan. 1971. "'illlllljuillljllfiilijifilllfllflilfillll“