3%? *h. ' 3°“ \1‘. ‘ v) A ' 1L. .4, > _ :- a .2?” lei-$3 ‘l C?....“‘ r’fil‘; t Wt???" E | ‘ Y1; ; 'L_> 4 r. . v 'C‘ 5-9 ‘ 1 '9 '3‘ u $33??? ~:-._-.2 '5 (6‘7" ‘-‘ «:5? -. . {ha-E '< ‘15 M. ”é AF" . .J V -..:v. r. - ,wifprr‘fifi' 4 mm; !.' "'r x _ . I" , 3,.) “.3. .g " ' - ‘rw‘v 9:2. U“. ”25,55“... u This is to certify that the thesis entitled The Application of Formal Methods to the Reverse Engineering of Imperative Program Code presented by Gerald C. Gannod has been accepted towards fulfillment of the requirements for M.S. degree in Computer Science M or professor 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution |\\\\\\\\\\ \\ W W W \\\\\\\2\\‘2 ‘\\\\0\\\\2 ‘\\\\1\\\\\i 3 1293 0102 LIBRARY Michigan State Unlverslty PLACE EIN RETUR RN BOXto reomov othls hockoutfrom you r.record TO AVOID FINES returno nor before date duo. DATE DUE DATE DUE DATE DUE Luminous-M The Application of Formal Methods to the Reverse Engineering of Imperative Program Code By Gerald C. Gannod A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Computer Science Department May 1994 ABSTRACT The Application of Formal Methods to the Reverse Engineering of Imperative Program Code By Gerald C. Gannod Formal methods in software development provide many benefits in the forward engineering aspect of software development. One of the advantages of using formal methods in software development is that the formalnotations are precise, verifiable, and facilitate automated processing. Reverse engineering is the process of construct- ing high level representations from lower level instantiations of an existing system. There are two main objectives that the research described in this thesis covers. First, a new approach to the construction of formal specifications from program code has been developed. To this end, a propagational model of translation has been developed that provides the framework for performing reverse engineering using the direct transla- tion of programming constructs into corresponding formal specifications. Using this framework, a number of translation rules for constructing formal specifications for programming primitives, including assignment, alternation, iteration, sequence, and procedure call, have been developed. The second objective of this research involves the movement of procedural software towards ob ject-orientation. We have developed an approach that uses clustering and restructuring of formal specifications at the procedural level to identify objects embedded in programs. Copyright © by Gerald C. Gannod May 1994 To G.T. and RC, I always thought I’d see you again. iv ACKNOWLEDGMENTS I’d like to thank my parents, siblings, and friends for their understanding and support. Well deserved recognition goes to the Committee for Institutional Cooper- ation Summer Research Opportunity Program and the McNair Post Baccalaureate Research Program at Michigan State University (specifically, Mary Lee and Shawn). Without that program, none of this would have been possible. I would also like to thank Bob Bourdeau and the rest of the students in the software bullpen for their insights into this work. My utmost respect and gratitude goes to Dr. Cheng, my advisor and mentor, for helping me find the way. Finally, to the dead (and dread) poets and jazz musicians of the world, may you always say the right verse and play the right note. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 1.1 Motivation ................................. 1.2 Research Contributions .......................... 1.3 Organization ............................... 2 Background 2.1 Software Engineering ........................... 2.2 Software Maintenance .......................... 2.2.1 A Model of Re-engineering ..... ' ............... 2.3 Applications of Reverse Engineering ................... 2.4 Formal Methods .............................. 2.4.1 Propositional Logic ........................ 2.4.2 Predicate Logic .......................... 2.4.3 Program Semantics ........................ 3 A Translational Approach to Reverse Engineering 3.1 Approach ................................. 3.2 Model of Programs with Annotations .................. 3 .2. 1 Variables .............................. 3.2.2 Statements ............................ 3.2.3 Annotations ............................ 3.2.4 Relating Statements and Annotations .............. 3.2.5 Annotation Interpretations .................... 3.3 An Annotation Interpretation based on sp ............... 3.3.1 Translational Propagation .................... 4 Primitive Constructs vi viii ix [emu—Iii CDCDWOSMIBQ 10 11 l4 14 15 16 18 19 19 20 20 21 23 4.1 Assignment ................................ 23 4.2 Alternation ................................ 24 4.3 Sequence .................................. 25 5 Iterative and Procedural Constructs 27 5.1 Iteration .................................. 27 5.2 Procedural Abstractions without Recursion ............... 30 6 Moving towards Object-Orientation 34 6.1 Identification of Classes Using Formal Specifications .......... 34 6.1.1 Guidelines ............................. 35 6.1.2 Specification Language ...................... 36 6.1.3 Example Identification of Candidate Objects .......... 37 7 Modeling the Representation and Abstraction of Imperative Lan- guages 41 7.1 Properties of Block Structured Languages ............... 42 7.1.1 Static Scope Rules ........................ 42 7.1.2 Statements ............................ 43 7.1.3 Procedural Abstractions ..................... 43 7.2 Modeling ................... 7 ............... 44 7.2.1 Variables, Symbols, and Types .................. 44 7.2.2 Statements, Rules, and Formal Specifications .......... 47 7.3 Example .................................. 54 8 Related Work 59 8.1 Formal Approaches ............................ 59 8.1.1 Function Abstraction ....................... 59 8.1.2 A Knowledge Based Transformational System ......... 62 8.2 Object-Orientation ............................ 65 8.2.1 Object Identification ....................... 65 8.2.2 REDO - Objects ......................... 68 8.3 Comparison of Approaches ........................ 70 9 Conclusions and I'hture Investigations 72 BIBLIOGRAPHY 75 vii LIST OF TABLES 2.1 BNF Grammar for Propositional Logic [1] ............... 10 2.2 Properties of the top and wlp predicate transformers .......... 12 6.1 Grammar for object specifications .................... 36 viii LIST OF FIGURES 2.1 Relationship between re—engineering terms ............... 7 2.2 Graphical Depiction of a Re-engineering Model [2] .......... 8 5.1 Steps for abstracting the effect of iteration statements ......... 29 5.2 Removal of procedure call p(h',5,E) abstraction ............. 33 6.1 Example Pascal Code ........................... 38 6.2 A Formal Specification of Example Code in Figure 6.1 ........ 39 6.3 Specification of Object St ........................ 40 7.1 Formal Specification of SymbolTableElement using LSL ...... 45 7.2 OMT Object Model of SymbolTable ................. 46 7.3 Example Sequence of Pascal Code .................... 47 7.4 Object Model of Statements ....................... 48 7.5 Object Instance Model for code sequence of Figure 7.3 ........ 49 7.6 Code sequences partially annotated with specifications ........ 51 7.7 Example with specifications annotated by the AUTOSPEC [3] system . 54 7.8 Example Pascal program ......................... 56 7.9 Output created by applying AUTOSPEC to example .......... 57 7.10 Output created by applying AUTOSPEC to example (cont.) ...... 58 8.1 Stages of Development .......................... 63 8.2 Languages for Development ....................... 63 8.3 COBOL File Operations [4] ....................... 69 ix CHAPTER 1 Introduction 1 . 1 Motivation The demand for software correctness becomes more evident when accidents, some- times fatal, are due to software errors. For example, recently it was reported that the software of a medical diagnostic system was the major source of a number of potentially fatal doses of radiation [5]. Other problems caused by software failure have been well documented and with the change in laws concerning liability [6], the need to reduce the number of problems due to software increases. Formal methods in software development provide many benefits in the forward engineering aspect of software development [7, 8, 9, 10, 11]. One of the advantages of using formal methods in software development is that the formal notations are precise, verifiable, and facilitate automated processing [12]. Reverse Engineering is the process of constructing high level representations from lower level instantiations of an existing system. One method for introducing formal methods, and therefore taking advantage of the benefits of formal methods, is through the reverse engineering of existing program code into formal specifications. 1.2 Research Contributions There are two main objectives that the research described in this thesis covers. First, a new approach to the construction of formal specifications from program code has been developed. To this end, a new propagational model of translation has been devel- oped that provides the framework for performing reverse engineering using the direct translation of programming constructs into corresponding formal specifications. Using this framework, a number of translation rules for constructing formal specifications for programming primitives, including assignment, alternation, iteration, sequence, and procedure call, have been developed. Finally, this translational framework is applied to the imperative programming paradigm, using Pascal as a model [13]. The second objective of this research involves the movement of procedural software towards ob ject-orientation. Other projects have focused on source code clustering [l4] and functional abstraction [4]. We have developed an approach that uses clustering and restructuring of formal specifications at the procedural level to identify objects embedded in programs [15]. 1 .3 Organization The remainder of this thesis is organized as follows. Chapter 2 provides background information pertinent to the area of software maintenance and the support of reverse engineering using formal methods. The research contributions are described in detail in Chapters 3 through 7. Chapter 3 describes the translational propagation model for the construction of formal specifications from program code. Chapters 4 and 5 discuss the translation rules for the assignment, alternation, iteration, sequence, and procedure call programming primitives. Chapter 6 discusses the application of the translational approach to the imperative programming paradigm. The method for identifying objects in programs using formal specifications is described in Chapter 7. Chapter 8 presents related work in the area of reverse engineering, and Chapter 9 draws conclusions and discusses future investigations. CHAPTER 2 Background 2.1 Software Engineering The following definition of software engineering has been offered by Fritz Bauer [16]: The establishment and use of sound engineering principles in order to obtain economically software that is reliable and works efficiently on real machines. In 1968 and 1969, workshops were held in Garmisch, West Germany and Rome, Italy with the intent of addressing the growing problems associated with the construction of computer software. The phrase “software crisis” was coined to describe the increas- ing difficulty associated with the development of software. It was recognized that processes for developing and managing software were desperately needed in order to respond to the software crisis. Since then, a number of software life-cycles have been developed with the intent of producing “software that is reliable and works efficiently on real machines”. The classical life-cycle, better known as the “waterfall” life-cycle, provides a sys- tematic process for developing software. It involves five phases including requirements analysis, design, implementation, testing, and maintenance [17]. The requirements phase is used to define the needs of a customer or user. The main task is to determine the required function of the software system based on information gathered about the problem domain. The design phase is used to refine the informa- tion gathered in the requirements phase by focusing on issues of software architecture, behavior, and presentation. The implementation phase involves the translation of the design into a machine executable form. Here the issues include efficiency considera- tions, low level algorithmic details, and language. The purpose of the testing phase is to verify that the software performs as expected. Testing strategies include black-box testing, white-box testing, and branch testing. The maintenance phase encompasses all operations that occur after the release of the software. Redesign, modification, and enhancement are all operations resulting from the maintenance of a system. Other types of life-cycles have been created including the reuse, prototype, and spiral models [17]. Each has advantages over the classic life-cycle, but in general incorporate, to some degree, the phases defined by the waterfall model. 2.2 Software Maintenance Software maintenance has long been a problem faced by software professionals, where the average age of software is between 10 to 15 years old [18]. With the development of new architectures and improvements in programming methods and languages, in- cluding formal methods in software development and object-oriented programming, there is a strong motivation to reverse engineer and re-engineer existing program code in order to preserve functionality, while exploiting the latest technology. Reverse engineering of program code is the process of constructing a higher level abstraction of an implementation in order to facilitate the understanding of a system that may be in a “legacy” or “geriatric” state. Re-engineering is the process of examining, understanding, and altering a system with the intent of implementing the system in a new form [19]. The benefits offered by re—engineering versus developing software from the original requirements is considered to be a solution for handling existing (legacy) code because much of the functionality of the existing software has been achieved over a period of time and must be preserved for many reasons, including providing continuity to current users of the software [20]. One of the most difficult aspects of re—engineering is the recognition of the func- tionality of existing programs. This step in re—engineering is known as reverse engi- neering. Identifying design decisions, intended use, and domain specific details are often the main obstacles to successfully re-engineering a system. Several terms are frequently used in the discussion of re-engineering. [19]. Forward engineering is the process of developing a system by moving from high level abstract specifications to detailed, implementation-specific manifestations [19]. The explicit use of the word “forward” is used to contrast the process with Reverse engineering, the process of analyzing a system in order to identify system components, component relationships, and intended behavior [19]. Restructuring is the process of creating a logically equivalent system at the same level of abstraction [19]. This process does not require a semantic understanding of the system and is best characterized by the act of transforming unstructured code into structured code. A diagram depicting the relationships between forward engineering, reverse engineering, restructuring, and re- engineering is shown in Figure 2.1. 2.2.1 A Model of Re-engineering Re-engineering can be described in terms of a model based on three concepts that are defined as follows [2]: Refinement is the gradual decrease in the abstraction level of a system representation and is caused by the successive replacement of existing Rm Famed Win [anthem Design Inflammation Figure 2.1. Relationship between re-engineering terms system information with more detailed information. Abstraction is the gradual increase in the abstraction level of a system representation and is created by the successive replacement of existing detailed information with information that is more abstract. Abstraction produces a representation that emphasizes certain system characteristics by suppressing information about others. Alteration is the incorporation of one or more changes to a system repre- sentation without changing the degree of abstraction. Alteration includes the addition, deletion, and modification of information. Figure 2.2 depicts the three concepts of refinement, abstraction, and alteration within a model of re-engineering, where the left triangle represents the original system, and the right triangle represents the new system. Abstraction is shown as the arrow angled upward and corresponds to reverse engineering. The arrow angled downward depicts refinement and corresponds to forward engineering. The horizontal arrow shows alteration and corresponds to restructuring. Alteration fl Requirements Abstraction Refinement Design / Implementation \ Figure 2.2. Graphical Depiction of a Re-engineering Model [2] 2.3 Applications of Reverse Engineering Reverse engineering has many areas of application, including redocumentation and design recovery. Redocumentation is the creation or revision of a semantically equiv- alent representation within the same relative abstraction level [19]. The purpose of redocumentation is to recover or revise documentation that is missing or obsolete. Often redocumentation is in the form of data flow and control flow diagrams. Design recovery is the process in which domain knowledge, external information, and deduction or fuzzy reasoning are added to the observations of a software system to identify meaningful higher level abstractions beyond those obtained directly by examining the system itself [19]. In addition to providing the abstractions for the re-engineering processes of rework and replacement, design recovery can facilitate the reuse of software components by providing the details of functionality. 2.4 Formal Methods Although the waterfall development life-cycle provides a structured process for devel- oping software, the design methodologies that support the life—cycle (i.e., Structured Analysis and Design) have shortcomings in that they have the potential for intro- ducing ambiguity, inconsistency, and incompleteness in designs and implementations. Formal methods used in software development are systematic techniques for specify- ing, developing, and verifying computer software. Formal methods aid in eliminating the properties of ambiguity, inconsistency, and incompleteness of a system through the use of a well-defined specification language with a set of well-defined inference rules that can be used to reason about the specification. The remainder of this sec- tion describes the notation used throughout the thesis and defines some key aspects of program semantics. 2.4.1 Propositional Logic A proposition is a logical expression that represents a truth value of true (T) or false (F). Operators consist of conjunction (A), disjunction (V), implication (=>), and equality (=) Propositional logic can be described using the following rules [1]: 1. Tand F are propositions 2. An identifier is a proposition. (An identifier is a sequence of one or more digits and letters, the first of which is a letter.) 3. If b is a proposition, then so is (ab). 4. If h and c are propositions, then so are (I) /\ c), (b V c), (b => c), and (b = c). Table 2.1 gives the BNF grammar for propositions using precedence rules to elim- inate parenthesized expressions [1]. 10 (proposition) ::= (imp-expr) (proposition) = (imp-expr) (imp-expr) ::= (expr) (imp-expr) => (expr) (erpr) — (term) I (expr) V (term) (term) ::= (factor) I (term) A (factor ) (factor) = -1(factor) | ((proposition)) I T I F I (identifier) Table 2.1. BNF Grammar for Propositional Logic [1] 2.4.2 Predicate Logic Propositional logic can be extended by replacing identifiers in a proposition with any expression (termed an atomic expression) that can be evaluated to true or false and by allowing the use of existential (3) and general (V) quantification. This extension is known as predicate logic. Atomic expressions are made up of four different types of symbols: individual symbols or constants, variables, functions, and predicates. A predicate is a mapping from a list of constants to the value true or false. Existential quantification can be defined in the following manner. Given the ex- pression EjVEj+1V...VEk, where j and k are integers, and E.- is a predicate, the existential quantification oper- ator (3) can be used to abbreviate the expression in the following way 11 General quantification (V) can be expressed in a similar fashion. Given the ex- pression EjAEj+1/\.../\Ek, general quantification has the following form (VisziSlczEg), where j and k are integers and E,- is a predicate. General quantification can be expressed in terms of existential quantification, and vice versa, as is shown below (Vi: R,- : E.) E '1(3i 3R1“ I“Ei)a where R.- is a predicate that describes the range, E,- is an arbitrary predicate, and i is the quantified variable 2.4.3 Program Semantics The notation Q { S } R was originally introduced by Hoare [21] to indicate a partial correctness model of execution, where given that logical condition Q holds, if the exe- cution of statement S terminates, then logical condition R will hold. A rearrangement of the braces in the form of { Q } S { R }, in contrast, represents a total correctness model of execution. That is, if logical condition Q holds, then statement S is guar- anteed to terminate with logical condition R true. If we consider a state space for computation then, in terms of some predicate R, we can partition computations into the following three mutually exclusive classes [22]: eternal: All computations that fail to terminate finally R: All computations terminating in a final state with R true 12 finally pR: All computations terminating in a final state with pR true. An analogous characterization of the computation state space in terms of the initial state of a program can be given as follows [22]: initially Q: All computations starting in an initial state with Q true. initially -Q: All computations starting in an initial state with -Q true. A precondition describes the initial state of a program, and a postcondition de- scribes the final state. Given a statement S and a postcondition R, the following predicates can be defined in terms of the characterization of the computation state space [22]: wp(S, R) The set of all states in which the computation under control of S belongs to the class “finally R”. wlp(S, R) The set of all states in which the computation under control of S belongs to the class “eternal” or “finally R”. That is, the weakest precondition wp(S, R) describes the set of all states in which the statement S can begin execution and terminate with R true, and the weakest liberal precondition wlp(S, R) is the set of all states in which the statement S can begin execution and establish R as true if S terminates. In this respect, wp(S, R) establishes the total correctness of S, and wlp(S, R) establishes the partial correctness of S. The wp and wlp are called predicate transformers because they take predicate R and, using the properties listed in Table 2.2, produce a new predicate. An interesting wp(S, A) E wp(S, true) A wlp(S, A) wp(S,/1) => “101145, “Al wp(S, false) 5 false wp(S, A A B) E wp(S, A) A wp(S, B) wp(S, A V B) => wp(S, A) V wp(S, B) wp(S,/1 -+ B) => wp(S,/1) -+ wp(S,B) Table 2.2. Properties of the wp and wlp predicate transformers l3 characterization of wlp(S, R) describes the set of computations for which “finally R” is merely possible [22]. pwlp(S, pH) The set of all states in which there exists a computation under control of S that belongs to the class “finally R”. This use of wlp is contrasted to wlp(S, R) which, recall, is the set states in which the computation under control of S belongs to the class “eternal” or “finally R”. An analogous characterization can be made in terms of the computation state space that describes initial conditions using the strongest postcondition sp(S, Q) pred- icate transformer [22]. sp(S, Q) The set of all states in which there exists a computation under control of S that belongs to the class “initially Q”. That is, given that Q holds, execution of S results in sp(S, Q) true. It can be proven that sp(S, Q) and wlp(S, R) are converses of one another [22]. Therefore, given the Hoare triple Q { S } R we observe that sp(S,Q) and wlp(S, R) have the following relationship: QawMSM, sp(S, Q) => R. The importance of this property is two-fold. First, it provides a basis for translating programming statements into formal specifications. Second, the symmetry of sp and wlp provides a method for verifying the correctness of a reverse engineering process that utilizes the properties of sp. CHAPTER 3 A Translational Approach to Reverse Engineering Reverse engineering of program code into formal specifications facilitates the utiliza- tion of the benefits of formal methods in projects where formal methods may not have previously been used. Program annotations in the form of Hoare triples [21] provide a natural way for presenting the formal specification of sequential programs. In this chapter we present an approach to reverse engineering that uses the properties of some well defined predicate transformers to specify programs. 3.1 Approach Chapter 2 introduced the notation Q { S } R and { Q } S { R } to denote partial and total correctness, respectively. We will use { Q } S { R } to indicate a general triple of precondition, statement, postcondition. When an interpretation is needed in a formal context (i.e., partial correctness, to- tal correctness), one will be explicitly stated. The general approach described herein for reverse engineering is a top-down, sequential translation (derivation) of imperative programming language primitives using the properties defined by the strongest post- 14 15 condition predicate transformer sp. The approach is: Given that some precondition Q is known, a postcondition R is constructed using the properties of sp as applied to a statement S. 3.2 Model of Programs with Annotations Program annotations in the form of preconditions and postconditions can be used to document, in-line, the specification of a program. Given a program P with a sequence of n statements SliS2i . - - ;S.--1;Sa; . . - ;Sn_1;Sn a sequence of specifications can be used to annotate the program such that the spec- ification indicates the “state” of the program at a given point of execution. For instance, the above sequence can be annotated to appear as {SP0}51;{SP1}52;~-i{SP2}Si—1i{SP3}Si;---i{SPn—2}S -1;{SPn-1}Sn{SPn} where specifications are delimited by the “{ )” notation, and S pj is the specification of program P after execution of statement Sj. Alternatively, we can annotate the program with comments in the following manner: {CO}SI;{CI}S2i---i{Ci-2}Si-li{Ci-1}Sii-~-;{Cn-2}S -1;{C' -1}Sn{C'n} where C.- is a comment annotated to the program after statement S.- and before statement 5,-4.1. We can model a program P with annotations as a 4-tuple P(V, S, A1,Value1) (3.1) where V is a set of variables, S is a sequence of statements, A I is a sequence of anno- tations, and Value, is an interpretation function, where I denotes the interpretation. 16 The model can be extended to accommodate multiple types of annotations, each with different interpretations. Consider the sequence {SPuo}{SPuo}Si; {5Pu1}{SPu1}52; - - - ; 55—1; {Spui-1}{Spvi-1}Si; {SPuiH-S'Pm'} - - - t (3.2) where S puj and Spa, each denote annotations for a statement 5,- with different inter- pretations. The appropriate model would appear as P(V, 5, (Au, Av}, {Valuem Valuev}) (3.3) where V and S are as before, A, and Au are annotation sequences, and Value, and Value” define the interpretations of Au and A”, respectively. The remainder of this section describes each of the components of the general program model and provides an interpretation of Value, for reverse engineering. 3.2.1 Variables Variable decorations are used to indicate a temporal ordering of logical instances of a variable. In Z, decorations v? and v! are used with a variable v to indicate logical input and logical output instances, respectively [23]. Other specification languages use prime (i.e. v’, v”, etc.) and vector decorations (‘5) to indicate temporal order- ings [24]. In our model, a variable is represented using subscripted integer indices (i.e. v0, 121,. . .), where an ordering is imposed by the index and each v,- represents a logical instance. There are three types of values that a logical instance can take. The first is as an unconditional expression. In this case, the statement generating the instance is an assignment statement. The second type of value is a conditional value, where the value of the logical instance depends on one or more Boolean conditions. Using disjunction, a single logical instance can be used to describe a value that is dependent 17 on a Boolean condition. Consider the simple statement if A then x :8 1 else 2 :8 2 ii We can describe the i“ instance of the variable x after execution of the statement by stating x.- = l V 2:.- = 2. The final type of instance is a range instance. Again a single logical instance can be used to describe the variable. Consider the sequence x :8 0; j :8 O; a[] :8 A; n :8 N; do j <> n -> if a[j] 8 T -> x,j :8 x + 1,j + 1 a[j] <> T -> j :8 j + 1 fi od where we are interested in the specification of the variable :1: after execution of the do-od statement. In this case, a: can take a value anywhere between 0 and N. We can describe the i‘“ instance of x by stating (31' : 0 Sj S N : x.- = j). Notationally, a variable v is a sequence with the following form ‘00, U], . . . , 0);, (3.4) where for 0 S i S lc, v,- represents a logical instance. The size k of a sequence v is bounded from above by the number of annotations in a program P. For a given program P( V, S, A1,Value1), V(P) denotes the set of all defined variables, V(P) = {ulu is a variable with form given by (3.4)} (3.5) This treatment of variables makes the following assumptions: 1. All variables in V(P) are local to P. 18 This condition ensures that a variable in V(P) is specified within the correct scope. The main motivation is to use the program model to specify subprograms of P. 2. All variables are untyped. Symbolically this assumption poses no problems. However, in practice, pro- gramming languages allow mixed type operations with appropriate transforma- tions. 3. All variables are simple. Structured types can be represented as simple variables by decomposing the structures into the appropriate components. Pointers are also excluded from the class of variables handled in this formalization. Pointers merit a more rigorous treatment and are beyond of the scope of this work. 3.2.2 Statements There are five main types of statements: assignment, alternation, iteration, sequence, and procedure call. 5 (P) denotes the sequence of statements of a program P. The ordering of the statement sequence given by S (P) corresponds directly to the order found in the source text of a program. The sequence is of the form S(P) = 81,82,...,Sn (3.6) where n is the number of statements in the text of the program. 19 3.2.3 Annotations A(P) denotes the sequence of annotations of a program P. The sequence has form A(P) = a1,a2, . . . ,an+1 (3.7) where n is the number of statements in the text of the program and n+1 is the number of annotations. As stated earlier, there can be any number of annotation sequences, each with a different interpretation. The interpretation function for predicate logic annotations has the form Value] : A —+ 2" (3.8) where the subscript I identifies a particular type of interpretation of the annotation, and 2" denotes the set of valid predicate logic expressions. The range for A with respect to Value, changes depending on the interpretation of the annotation. The notion of annotation interpretation is described in more depth in Section 3.2.5. 3.2.4 Relating Statements and Annotations A convenience operation in the model is the association of annotations with a given statement using prec and succ. Symbolically, prec and succ represent precedes and succeeds and are defined as follows, where (3.9) and (3.10) are function signatures and (3.11) and (3.12) define the semantics for prec and succ, respectively. The prec function returns the annotation preceding statement S and the succ function returns the annotation succeeding a statement S. prec : S —+ A (3.9) succ : S —> A (3.10) 20 where prec(s,-) E (lg, 13 i S n,n = [SI (3.11) and succ(s,) E cg“, 1 S i S n,n = [S] (3.12) 3.2.5 Annotation Interpretations Annotations can have many interpretations. For instance, suppose signature 3.8 is in- terpreted as a weakest precondition assertion using predicate logic, denoted Value”. Then for some statement 3 we have prec(s) = awcc and succ(s) = am“ such that Valuewp(a,ucc) = R and Valuewp(ap,cc) = wp(s,R). Additionally, recall that many different annotation sequences can exist for a given program (See Expressions (3.2) and (3.3)). So, in addition to the wp interpretation for predicate logic, a second interpretation to a program can be given using a second sequence of annotations. Allowing different interpretations for annotations faCilitates the use of different spec— ification techniques, each of which can provide an alternative perspective of a given program. 3.3 An Annotation Interpretation based on sp The previous section described an annotation interpretation based on the semantics of the weakest precondition predicate transformer wp. Another interpretation for annotations provides the framework for the reverse engineering techniques described in the following chapters using the strongest postcondition predicate transformer sp. When given a statement and a predicate logic expression, sp derives a predicate logic specification of the statement. The formal presentation of the semantics of sp with respect to primitive programming structures is given in Chapters 4 and 5. Using the 21 definition of sp we define an annotation interpretation, denoted Value”, with the signature given above by Expression (3.8) and semantics as follows: Value,p(succ(s,-)) E sp(sg, Value,p(pre(s,-))) A Value,p(pre(s,-)),, (3.13) where s.- 6 S (P),1 S i S |S(P)| for some program P. Note that the second conjunct (underlined) is a propagation of the precondition for a given statement 3;. How- ever, also note that we subscript the second conjunct with a p to indicate that the propagated conjunct is subject to a propagation interpretation, named so in order to distinguish it from annotation interpretations given by Value 1. The interpretation of Value, as a strongest postcondition annotation provides the basis of a model for the creation of formal specifications from program code. Thus, given some initial annotation a1, each step in the reverse engineering process involves the derivation of a formal specification of some statement using precondition assertions and a propagated expression that provides context for the derivation. 3.3.1 Translational Propagation There are two types of propagation that can be used to interpret the second conjunct of Expression (3.13). The first type of propagation is dependent propagation. In this interpretation, each propagated item is dependent on the value given by the translation of a statement by sp. The second type of propagation is independent propagation, where, as one might expect, the propagated item is independent of translation. The difference in the interpretations lies in the definition of the initial precondi— tion annotation a1 6 A(P), whose value is given by Value(a1). In the dependent 22 interpretation, the initial annotation has the following value Valtre,p(al) E (Vv : v E V(P) A k = |v| : (v0 = v") A (v; = e) . . . A (v,c = 6)) (3.14) where vi is the value of instance 12,- and 6 denotes an undefined value. In the depen- dence interpretation the initial condition of a program is a specification of all the instances of all the variables of a program. When the initial precondition of a pro- gram is defined as such, then as a formal specification is derived for each statement 3, propagation of Value,p(prec(s)) must be modified to reflect the changes to each instance of a variable. In the independent interpretation for propagated items the initial annotation a1 has the following value Valtte,p(al) E (Vv : v E V(P) : v0 = v“). (3.15) That is, the initial condition for a program is a specification of the values of the initial instances of all variables. This convention allows for the propagation of statement preconditions without requiring modifications to the propagated expression Qp. CHAPTER 4 Primitive Constructs In this chapter we discuss the derivation of formal specifications from the primitive programming constructs of assignment, alternation, and sequences. The Dijkstra language [25] is used to represent each primitive construct, but the techniques are applicable to the general class of imperative languages. For each primitive we first describe the semantics of the predicate transformers wlp and sp as they apply to each primitive and then describe the specification derivation using Hoare triples. 4.1 Assignment An assignment statement has the form 1:8 e; where x is a variable, and e is an expression. The wlp of an assignment statement is expressed as wlp(xz8e, R) = Rfi', which represents textual substitution, where every occurrence of a: is replaced by the expression e using the postcondition R[1]. If :1: corresponds to a vector '37 of variables and e represents a vector B of expressions, then the wlp of the assignment is of the form Ry , where each y,- is replaced by E;, respectively, in expression R. The sp of an assignment statement is expressed as sp(xz8e,Q) = (3v :: Q: A :r = 6:), (4.1) 23 24 where v is the previous value of x, and ‘::’ indicates that the range of the quantified variable x0 is not used in the current context. Recall from Expression (3.13) that we assume a propagation of preconditions in the construction of formal specifications and that the initial precondition of a program is given by Expression (3.15). In terms of specifying programs using variable instances, sp is expressed as sp(xz8e,Q) = (335-1 :: Q A x,- = eim), where x,- and 13-1 are the current and previous values of 1, respectively. We make the conjecture that the removal of the quantification for the initial values of a variable is valid if the precondition Q has a conjunct that specifies the textual substitution. That is, Q: in Expression (4.1) is a redundant operation if, initially, Q has a conjunct of the form a: = v. As such, if given the Hoare triple for a statement 3,, with Statement(s,-) = x:8e and U = prec(s,~-1), the following annotated sequence is created. q {(x, = X) A U} /* precondition */ x := e; {(xJ-H = e;1)A (at,- = X) A U} /* postcondition */ which satisfies the sp for assignment. 4.2 Alternation An alternation statement using the Dijkstra language [25] is expressed as if B1 —’ S] i ll Bn -* Sn; fi; where B,- —. S; is a guarded command such that S.- is only executed if logical expression (guard) B.- is true. The wlp for an alternation statement is given by wlp(IF, R) :3 (Vi: B.- : wlp(S,, R)), 25 where IF represents the alternation statement. The equation states that the necessary condition to satisfy R, if the alternation statement terminates, is that given 8; is true, the wlp for each guarded statement S,- with respect to R holds. The sp for alternation has form sp(IF,Q) E (3i :: sp(S,-,B.- A Q)), (4.2) The existential expression can be expanded to be of the form sp(IF, Q) E (sash 8. A Q) v - - . v sp(S., a. A 0)). (43) Expression (4.3) illustrates the disjunctive nature of alternation statements where each disjunct describes the postcondition in terms of both the precondition Q and the guard-guarded command pairs, given by B.- and S,-. This characterization is intuitive in that a statement 8.- is only executed if B,- is true, and that only one of 81-, 1 S j S n, is executed. The translation of alternation statements follows accordingly from Expression (4.3). Using the Hoare triple notation, a specification is constructed as follows {Q} if Bl —* 31; iii. —+ 3..- fi,’ {(3p(31,Bl A Q) V . . . V sp(SnaBn A Q» A QP} where Qp is a propagation of the precondition of the statement being specified. 4.3 Sequence For a given sequence of statements 31;. . . ; Sn, the notion that the postcondition for some statement S.- is the precondition for some other statement 5;“ is natural. The 26 wlp and sp for sequences follow accordingly. The wlp for sequences is defined in the following manner wlp(Sl 332, R) ’3 wlp(SlaU)lp(S2i R)) Conversely, the sp is 3P(S]§SQ,Q) E 31’(32,SP(SI,Q))- (4'4) In the case of wlp, the set of states for which the sequence 81 ;S; can execute with R true (if the sequence terminates) is equivalent to the wlp of S; with respect to the set of states defined by wlp(Sg, R). For sp, the derived state for the sequence 81:83 with respect to the precondition Q is equivalent to the derived postcondition for S; with respect to a precondition given by sp(S,, Q). The Hoare triple formulation and construction process is as follows {Q} 31; { 81451, Q) A Q10} 323 { sp(32,3p(31.Q)) A Q.» }- CHAPTER 5 Iterative and Procedural Constructs The programming constructs of assignment, alternation, and sequence can be com- bined to produce straight-line programs (programs without iteration or recursion). The introduction of iteration and recursion into programs provides for more power- ful computation ability. However, constructing formal specifications of iterative and recursive programs can be problematic, even for the human specifier. This section discusses the formal specification of iteration and procedural abstractions without recursion. We deviate from our previous convention of providing the formalisms for wlp and sp for each construct and use an operational definition of how specifications are constructed. This approach is necessitated by the fact that the formalisms for the wlp and sp for iteration are defined in terms of recursive functions [1, 22] that are in general difficult to practically apply. 5.1 Iteration Iteration allows for the repetitive application of a statement. Iteration, using the Dijkstra language, has the form 27 28 do 81 —* 31; II B. -> 8.: od; In general, the iteration statement may contain any number of guarded commands of the form 8,- -—> S,-, such that the loop is executed as long as any guard B.- is true. In the context of iteration, a bound function determines the upper bound on the number of iterations still to be performed on the loop. An invariant is a predicate that is true before and after each iteration of a loop. The problem of constructing formal specifications of iteration statements is difficult because the bound functions and the invariants must be identified. However, for a partial correctness model of execution, concerns of boundedness and termination fall outside of the interpretation, and thus can be relaxed. Gries defines guidelines for developing loops through the identification of loop invariants [1]. The methods of deleting a conjunct, replacing a constant by a variable, enlarging the range of a variable, and adding a disjunct can provide insight into the automated construction of a specification from program code. For instance, a loop written using the method of replacing a constant by a variable must identify the upper (lower) bound of an incremented (decremented) variable. Furthermore, determining the statements that ensure progress towards termination is facilitated by the properties associated with this class of loops. Figure 5.1 gives the steps for constructing a specification for a loop that was developed using the replace a constant by a variable strategy for the loop invariant. Although these characteristics are more total correctness in nature, the insight provided by identifying these properties in loops aids in specifying a loop using partial correctness interpretations. When no automated strategy can be applied to a loop, the domain expert‘is ‘A domain expert is a maintenance engineer who is familiar with the subject system. 29 prompted for the proper specification of the statement. The following items are then identified in order to confirm that the specification of the loop is complete: a invariant (P): an expression describing the conditions prior to entry and upon exit of the iterative structure. 0 guards (B): Boolean expressions that restrict the entry into the loop. Execution of each guarded command, B,- -> S.- terminates with P true, so that P is an invariant of the loop. {P /\ B;}S;{P}, for I S i S n When none of the guards is true and the invariant is true, then the postcondition of the loop should be satisfied (P A “BB -—. R, where BB = B1 V . . . V B" and R is the postcondition). 1. The abstraction algorithm begins with the template for a quantified expression of the form . . . . (Qt : range(z) : expression(z)), where Q represents one of the quantifier symbols V, 3, E. 2. The quantified variable(s) are determined by examining the identifiers occurring in guards Bj. 3. The ranges of the quantified variables are determined by finding statements occurring prior to entry into the loop that assign values to incremented (decre- mented) variables and their occurrences in the guards. 4. For each guarded command, the corresponding statement list includes state- ments that ensure progress towards termination; the postcondition for the re- maining statements constitutes expression(i). 5. The bound function becomes the difference between the upper (lower) bound for a variable that is being incremented (decremented) and its value during loop iterations. Figure 5.1. Steps for abstracting the effect of iteration statements 30 5.2 Procedural Abstractions without Recursion This section describes the construction of formal specifications from code containing the use of non-recursive procedural abstractions. A procedure declaration can be represented using the following notation proc p ( value if; value-result 1;; result 2 ); {PM body ){Q} where f, y, and '2' represent the value, value—result, and result parameters for the procedure, respectively. The notation ( body ) represents one or more statements making up the “procedure”, while {P} and {Q} are the precondition and postcondi- tion, respectively. The syntactic signature of a procedure appears as proc p : (inputJype)' —> (outputJype)" (5.1) where the Kleene star (*) indicates zero or more repetitions of the preceding unit, input-type denotes the name of an input parameter to the procedure p, and outputjype denotes the name of an output parameter of procedure p. A specification of a procedure can be constructed to be of the form { P: U } proc p : E0 —> E1 (body) { Qi sp(body,U) AU? } where ED is one or more input parameter types with attribute value or value-result, and E1 is one or more output parameter types with attribute value-result or result. The postcondition for the body of the procedure, sp (body, U), is constructed using the previously defined guidelines for assignment, alternation, and iteration as applied to the statements of the procedure body. Gries defines a theorem for specifying the effects of a procedure call [1] using a total correctness model of execution. Given a procedure declaration of the above 31 form, the following condition holds [1] {PRT : Pg? A (vine :: Q5" => 12%)} p(fi,b,E) {a} (5.2) for a procedure call p(a, b, E), where a, b, and E represent the actual parameters of type value, value-result, and result, respectively. Local variables of procedure p used to compute value-result and result parameters are represented using i and 5', re- spectively. Informally, the condition states that PRT must hold before the execution of procedure p in order to satisfy R. In addition, PRT states that the precondition to procedure p must hold for the parameters passed to the procedure and that the postcondition for procedure p implies R for each value-result and result parameter. The formulation in terms of a partial correctness model of execution is identical, as- suming that the procedure is straight-line, non-recursive, and terminates. Using this theorem for the procedure call, an abstraction of the effects of a procedure call can be derived using a specification of the procedure declaration. That is, the construction of a formal specification from a procedure call can be performed by inlining a pro- cedure call and using the strongest postcondition for assignment. A procedure call p(a,b,E) can be represented by the Pascal—like block [1] found in Figure 5.2, where (body) comprises the statements of the procedure declaration for p. By representing a procedure call in this manner, parameter binding can be achieved through multi- ple assignment statements and a postcondition R can be established by using the sp for assignment. Removal of a procedural abstraction allows for the extension of the notion of a straight-line program to include non-recursive straight-line procedures. Making the appropriate sp substitutions, we can annotate the code sequence from Figure 5.2 to appear as follows: 32 {PR } if :8E,b; __ __ _ { 19:35,? - PR§§A§=5§§A§=S;§} (body) {Q } 7,: was; .—- "i ___—z ___‘y,z $933734 9;; AY-“vz Az—sz} ,6 ='y.2; .-_ 8.3 —_.SE __.SE { R. 319,¢ QRE? Ab— YE? Ac— 23¢} where 6', B, 7, E, if, and ¢ are the initial values of i, 7 (before execution of the procedure body), y (after execution of the procedure body), 2, b, and E, respectively. Using the conjecture of Section 4.1 regarding assignments as well as the facts that formal and actual result parameters have no initial values, and local variables are used to compute the values of the value-result parameters, the above sequence can be simplified using the semantics of sp for assignments to obtain the following annotated code sequence: {PR} $595.3; {P.-PRAi=a/\y=3} (body) {Q} Yiiz'fifi. {QR.-QAy=fi¥/\z=v¥} bizrii; {12: QRAE=7AE=2} where Q is derived using sp((body), P). 33 begin w: NW" FA-JU a-hn ' A ' any Hap-4m end begin declare it, y, 2, fi, 7; { PR } 5,1); "x37 :8 {P} A u 8‘ a. Q I wav Isl wOIQNIQ some. «swat Q ‘0 I 8.. 2 Ph‘ a‘|~h'~<|.-h --v I and Figure 5.2. Removal of procedure call p(a,li,e) abstraction CHAPTER 6 Moving towards Ob ject- Orientation The main characteristics of object-oriented analysis, design, and programming are encapsulation, modularity, and inheritance. Well-defined object interfaces and a clear behavioral knowledge about an object can facilitate software reuse. In recent years, there has been an increase in the development and use of ob ject-oriented programming languages, including C++, Smalltalk, and Modula-3. Re-engineering efforts have been focused, in part, to the conversion of software from the imperative paradigm to the object-oriented paradigm [4, 14, 26]. This chapter discusses an approach for identifying objects embedded in programs and restructuring formal specifications to reflect that identification. 6.1 Identification of Classes Using Formal Speci- fications The methods described in Chapters 4 and 5 produce specifications that have proper- ties that are amenable to the identification of candidate objects in program code. An object is a self-contained module that includes both the data and procedures (meth- 34 35 ods) that operate on that data. An object can be considered to be an abstract data type (ADT), that is, a user-defined data type with a specific set of allowable oper- ations. A class is a collection of objects that have common use [27]. This section describes an aggregation heuristic for identifying objects based on the examination of procedure signatures. 6.1 .1 Guidelines Using the above definition of an object, a set of guidelines for identifying objects is as follows: 1. Construct a list of all data structures contained in a program. These data structures should not include primitive types. 2. For each data structure contained in the list of program data structures, group together the operations that refer to the data structure as input in the syntactic signature specification of the procedure (See Equation 5.1 for the format of signatures). This grouping along with the associated data structure is an object candidate. 3. In case of conflicts, (i.e., a procedure contains two non-primitive data structures as input in the signature) one of three actions can be taken (a) Determine if one data structure is composed of one or more occurrences of the other data structure. This step is performed by checking the definitions of data structures. (b) Determine whether the output of the procedure excludes either data struc- ture. If it can be shown that the procedure does not modify a data struc- ture then associate the procedure with the data structure that is modified. (c) In the cases where no determination can be made as to how to associate a procedure with a respective data structure, query the domain expert on the appropriate association. The process of identifying candidate objects is facilitated by the format of the formal specifications for procedures. It is emphasized that the datatypes identified by this technique are only candidate objects. However, once the object definition has been constructed, the formal nature of the specification facilitates formal reasoning about the objects and can aid in the validation of candidate objects as true objects. 36 6.1.2 Specification Language An object specification is constructed by collecting the procedure specifications asso- ciated to a data structure and declaring the data structure to be a candidate object. The BNF grammar for the language used to specify potential object classes is given in Figure 6.1. The definition uses the roman font to describe non-terminals; bold type defines keywords; the Kleene star (*) is used to denote one or more repetitions of the preceding unit; square brackets (I ]) indicate optional items; parentheses (‘()’) indi- cate groupings. The non-terminal, expression, represents a predicate logic expression. component = type typemame: has ( method )* has = has ( datajescription )* data.description = variable : type.name method = method methodmame: (type_name)* —. (typemame)* in((variable: type-name)*) local((variable: type.name)*) out((variable: type-name)*) { pre: expression } { post: expression } expression = true I false I ( expression ) I - expression I expression A expression I expression V expression I expression => expression I expression 4:) expression I ( V variable : type :: expression ) I ( 3 variable : type 2: expression ) I predicatemame [( term (, term)*)] I def . term = expression term 2 variable I functionmame [( term (, term)* )] Table 6.1. Grammar for object specifications 37 6.1.3 Example Identification of Candidate Objects The first step of the analysis is to identify the non-primitive data structures of the pro- gram. This process is performed by examining the signatures of the procedure spec- ifications and extracting the unique non-primitive data structure names. Analysis of the program QeS (shown in Figure 6.1) produces the specifications given in Figure 6.2. Analysis of the formal specifications identifies non-primitive data structures named St, On, and elementtype. In continuing the analysis of the data structure St, the procedures inns, nts, tp, and po are grouped with St through examination of the spec- ification signatures. Procedure pu is initially grouped with St and elementtype, but further analysis leads to a strict association of pa to St since the input elementtype is not modified by pu. The subsequent object definition for St appears in Figure 6.3. 38 progren Qe8(output); conet neulength 8 60; procedure nnq ( ver 0 : Qu ); begin Q.f :8 nenlength; Q.r :8 nenlength; tVPO end: 8t 8 record t = integer; function utq ( var Q : Qu ) : booleen; e : urruy[1..nuxlengthJ of elementtype 5.51. end; if Q.r 8 Q.f then Qu 8 record ntq :8 true e : urruyIi..nnnlengthJ of elenenttype; .1.. f. r : integer ntq :8 false 884: end; elementtype : Qu; function fr ( var Q : Qu ) : ole-enttype; procedure nun ( vur 8 : 8t ): begin 8.t :8 lenlength + 1; end; function nte ( 8 : 8t ) : booleun; begin if 8.t > nunlength then ntn :8 true elee nte :8 fulee end; function tp ( vur 8 : 8t ) : eleuenttype; begin if nte(8) then urite1n(’error’); elee tp :8 8.e[8.top] end; procedure po ( ver 8 : 8t ); begin if ntn(3) then urite1n('error’); elee 8.t :8 8.t # 1; end; procedure pu ( x : St begin if 8.t 8 1 then urite1n(’error’); elee begin 8.t :8 8.t - 1; 8.e[8.t] :8 3 end; end; elementtype; vur S : Figure 6.1. begin if ntq(Q) then urite1n(’error’) elee fr :- q..[q.r]; end; procedure en ( n : elementtype; vur Q : Qu ); begin if Q.r 8 nunlength then 0.x :8 1 else Q.r :8 Q.r + 1; if 0.x 8 Q.f then urite1n(’error’) elee Q.e[Q.r] :8 n; end; procedure de ( vur Q : Qu ); begin if ntq(Q) then urite1n(’error’) else if Q.f 8 nexlength then Q.f :8 1 else Q.f :8 Q.f + 1; end; var ex-e : St; en_q : Qu; begin (0 QeS Body o) end Example Pascal Code 39 proc rune : St _. St in( S : St ) out( S : St ) { pro: true } ( poet: S.t = maxlength + l A true } proc tp : St —o St x clementtype in( S : St ) out( 8 : St, tp : elementtype ) { pre: domain(S) ) { poet: (((S.t ) maxlength) A (mix = true))A sp(writeln( ’error ’), true))v ((«(SJ > maxlength) A (mts = false))A (tp = S.e[S.t])) A domain(S) } proc po : St _. St. in( S : St ) out( S : St ) { pre: domain(S) } { poet: (((S.t > maxlength) A (mts = true))A ep(writeln( ’error ’), true))V ((-(S.t > maxlength) A (mix = false))A (SJ, = 3.80 + l))/\ domain(S) } proc pu : St x elementtype —+ St in( S : St, x : elementtype) out( 8 : St ) { pre: domain(S) A domain(x) } { post: (34 = l A sp(writeln( 'error’), true)) v (8(S.t = 1)A(S.t1 = S.to A S.e[S.11 = x)) Adomain(S) A domain(x) } proc mnqz Qu _. Qu in( Q = Q“ ) out( a = on ) { pre: true } { poet: QJ = maxlengthA Q.r = maxlength A true } proc mtq : Qu -» Qu x boolean in( Q : Qu ) out( Q : Qu, mtq : boolean) { pre: domain(Q) } { poet: (Q.r = Q.f A mtq = true)v (-(Q.r = Q.f) A mtq = false) A domain(Q) } proc fr : Qu _. Qu x elementtype in( Q : Qu ) out( Q : Qu , fr : elementtypc) { pro: domain(Q) } { poet: ((Q.r = Q.f A mtq = true)A (sp(writeln( ’error ’), true)))v ((Q.r = Q.f A mtq = true)A (fr = Q.e[Q.f])) A domain(Q) } proc en : clementtype x Qu -¢ Qu in( x : elementtype , Q : Qu ) out( Q : Qu ) { pre: domain(Q) A domain(x) } { poet: ((Q.ro = maxlength A Q.r1 = 1) V (“(Qmo = maxlength)A (QJ’; = QJ'o + I)))/\ (((Q.r = QJ) A ep(writeln( ’error’), true)) V ("(QJ‘ = QJ) A (Q..e[Qr] = 1:)” A domain(Q) } proc dc : Qu —' Qu in( Q : Qu ) out( Q : Qu ) { pre: domain(Q) } { post: ((Q.r = Q.’ A mtq = truc)A (sp(writeln( ’error ’), true)))v ((H(Q.r = QJ) A mtq = false)A (((Q.f = maxlength) A QJ = 1)V (fi(Q.fo = maxlength)A (or. = are + 1))» A domino) } Figure 6.2. A Formal Specification of Example Code in Figure 6.1 40 typo St: hue t : integer; o : may of olemonttype; method mm 2 St -. St in( S : St ) out( S : St ) { pro: true } { pout: S.t = maxlength + 1 A true } method rnts : St -. boolean in( S : St ) out( mtu : boolcan ) { pro: domain(S) } { poet:(((S.t > maxlength) A (mts = true))V (H(S.t > maxlength) A (min = false)” A domain(S) } method tp : St -v St x elementtype in( S : St ) out( 8 : St, tp : olementtype) { pro: domain(S) } { poet: (((S.t > maxlength) A (mix = true)) A sp(writeln( ’error’), true))v ((-v(S.t > maxlength) A (mts = false)) A (tp = S.e[S.t])) A domain($) } method po : St —0 St in( S : St ) out( 8 : St ) { pro: domain(S) } { poet: (((S.t > maxlength) A (mts = true)) A sp(writeln( ’error’), true))v ((-(S.t ) maxlength) A (mts = false)) A (8.1; = S.to + 1)) A domain(S) } method pu : St x olementtypo -o St in( S : St, x : olernonttype ) out( S : St ) { pro: domain(S) A domain(x) } { poet: (SJ = 1 A sp(writeln( ’error’),true))v (H(S.t = 1) A (St; = S.to A S.e[S.t1 = x)) A domain(S) } Figure 6.3. Specification of Object St CHAPTER 7 Modeling the Representation and Abstraction of Imperative Languages This chapter presents the rules and representations that embody the application of the translational approach to reverse engineering for the Pascal language. For ease in understanding the representations, two complementary description mechanisms are used. First, a diagramming technique, known as the Object-Modeling Tech- nique (OMT) [28], is used to represent the object-oriented relationships of program- ming constructs? Second, the formal specification language Larch Shared Language (LSL) [29] is used to formally specify the abstract data types that support the auto- mated construction of formal specifications from program code. ‘Note that the OMT approach includes the use of object models, data flow diagrams, and state- charts. In this discussion, OMT refers exclusively to the use of object models. 41 42 7 .1 Properties of Block Structured Languages The fundamental concepts of block structured languages originated with the devel- opment of ALGOL 60 [30] and have since been incorporated into the design of many programming languages, such as Ada [31] and Pascal [32]. Programs written us- ing block structured languages are organized into nested blocks, where each block introduces a new local referencing environment. Because of their widespread and longstanding use, block structured languages have become a common object of study by maintenance engineers [33]. In order to analyze and maintain programs written using block structured languages, an understanding of the rules that govern block structured languages is necessary. The remainder of this section describes the funda- mental concepts underlying block structured languages, where Pascal is used as an example of imperative (procedural) languages. 7.1.1 Static Scope Rules Static scope rules provide the definition of the context of a variable declaration within a program. In order to determine the intended behavior of a subject system, it is important that the rules that define the scope of variables within a system are understood. When abstracting a formal specification from program code, the scope of a variable and its potential values play a major role in expressing the effects of statements. The static scope rules for block structured languages are as follows [30]: 1. The declarations at the beginning of any block define the local identifiers for a block. 2. Any identifiers referenced within a block for which no local declaration exists re for to the immediate parental block for a declaration. If no declaration is located in the parent block then the next ancestor is referenced. This identification pro- cess continues until the declaration is found (success) or no declaration is found (i.e., the top-most environment is reached and no declaration is present). 43 3. Declarations in nested child blocks are completely hidden from parent blocks and cannot be referenced by parents or ancestors. 4. Named subblocks in the form of subprograms are members of the parent’s local referencing environment. 7.1.2 Statements Imperative programming language constructs generally consist of four different types of basic statements regardless of whether the language is block structured or not. The programming constructs are assignment, alternation, iteration, and sequence. It is important to note that alternation, iteration, and sequence statements can contain one or more nested statements within the body of the statement. For instance, an alternation statement in Pascal can appear as if (a 8 b) then begin 1 :8 q; y :8 r; end; where the if statement contains a begin-end statement and, additionally, the begin-end statement contains two assignment statements. This property will be referred to as the nesting property. 7 .1.3 Procedural Abstractions A concern in the use of subprograms is parameter association. Knowing the meth- ods for binding formal and actual parameters as well as determining the parameter transmission schemes (i.e., value, value-result) of a language are necessary prerequi- site tasks for the correct abstraction of formal specifications from program code. The rules for parameter association in Pascal are as follows [34]: l. The number of formal and actual parameters for a given function or subprogram must be identical. 44 2. The types of the parameters must be the same. Actual parameters of type integer can be coerced to type real. 3. The actual parameters associated with var formal parameters must be variables. They cannot be constants or expressions. 7 .2 Modeling In order to facilitate the automated construction of formal specifications from Pascal programs, a model of the analysis of properties contained in Section 7.1 has been de- veloped. This section describes the formal and graphical models used to the represent Pascal language and to abstract formal specifications from Pascal programs. 7.2.1 Variables, Symbols, and Types Section 7.1 discussed properties (including static scope rules) relevant to the reverse engineering of program code written using imperative block structured languages. Static‘scope rules define the methods for determining the context of an identifier within a program. As is common in compiler construction [35], a hierarchical approach has been developed for both determining scope and recording histories of identifiers within a program. This approach is implemented in the form of an abstract data type (ADT) called SymbolTablo. A SymbolTable is an ADT containing a set of SymbolTabloEloments (re- ferred to as the symbols set) and a link to a parent referencing environment. Each SymbolTableElement object in the symbols set represents an identifier, a table of the values of the instances for that identifier (history), and an index used to reference the last instance of the identifier in the history table. The SymbolTableElement objects contained in the symbols set are uniquely identified within the set by the name given to each identifier. The notation convention for referring to the name, history table, and index of the last instance added to the history table for a given 45 identifier i will be i.name, i.history, and i.index, respectively. The formal specifica- tion of the SymbolTableEloment ADT, written in the Larch Shared Language, is given in Figure 7.2.1 and is used to define a method for determining the effects of certain programming constructs. In the specification, SymTabElement is the name SymTabElement : trait includes Integer introduces % % new defines a new element % % new : Sir —r Ste % % name gives the name of the identifier % % name : Ste -+ Str % % addHist adds history information to an identifier’s table % % addllist : Ste, Int, Val -o Ste % % retrieve gets a specific instance of an identifier % % retrieve : Ste, Int —. Val % % lastlndex gives the index to the last item for the % % identifier in the table % % lastlndex : Ste -+ Int % % 6 determines membership in a list % % .. E .: Int, Ste —* Baal asserts V ste,t : Ste, v : Val, str : Str, i, it : Int “(i E new(str)) t6 addHist(t, if, v) z: (i = if) V (i E t) retrieve(addHist(t, i, v), il) == ifi = it then v else retrieve(t, il) lastlndex(new(str)) == 0 lastlndex(addHist(t, i, v)) z: if (i 6 t) then lastlndex(t) else lastlndex“) + l name(new(str)) == str name(addHist(ste, i, v)) =2 name(ste) Figure 7.1. Formal Specification of SymbolTableElemont using LSL of the SymbolTableElement ADT, the includes keyword indicates that the def- 46 inition for the Integer type is used, the introduces keyword delimits the signature of the operators of the ADT, and the asserts keyword introduces the semantics of the operators, including the types of the arguments for the operators. The formal definitions for the operations of SymbolTableElement are used in the next section in the discussion of the abstraction process. Figure 7.2 contains a graphical depiction of the SymbolTable ADT using the OMT notation that shows SymbolTable as an aggregate of zero or more Symbol- TabloEloments and zero or one SymbolTables, where aggregation is symbolically represented by a diamond, the “zero or more” relation is represented by the filled circle, and the “zero or one” relation is represented by the hollow circle. [SymbolTable T SymbolTableElement Legend 0-— Aggregate Relation —0 One to Zero or One Relation —. One to Zero or More Relation Figure 7.2. OMT Object Model of SymbolTable 47 7 .2.2 Statements, Rules, and Formal Specifications Section 7.1.2 defined the nesting property of programming statements. For abstraction purposes, it becomes useful to recognize that nested statements are contained, to some extent, within a single statement. For instance, consider the code sequence given in Figure 7.3. This sequence has eleven separate statements including the begin-end sequences. At the highest level of abstraction this sequence has one statement, the outer begin-end statement. The next level of granularity contains three statements, the assignment statement at line 2, the if statement beginning at line 3, and the assignment statement at line 14. The if statement beginning at line 4 contains two statements in the form of an assignment statement and a while statement. 1 begin 2 n :8 y; 3 if D then 4 if L then 5 n :8 u; 6 else 7 uhile i <> n 8 begin 9 x:Ix+t[1]; 10 i :8 i + 1 11 end 12 else 13 n :8 e; 14 q :8 n; 15 end Figure 7.3. Example Sequence of Pascal Code An object model is a representation in OMT that defines the hierarchy and static relationships of object classes. An instance object model depicts a snapshot of the dy- namic relationships between objects in a system. Figure 7.4 contains the object model of the Statement ADT, where the triangle symbol denotes inheritance. As such, the diagram explicitly states that Sequence, Assignment, Alternation, Iteration, 48 Stucment .1... m... “.1... .1... [i _l in :3 ‘ c‘l‘t‘fis‘n Boolean l Exmsion , Legend BiennibclueedA 1* OuetoOnea'MoroRellia Figure 7.4. Object Model of Statements and ProcedureCall are Statements. A GuardodCommand, also depicted in the diagram, is a construct that is only executed when a given boolean expression (called the guard) is true. Furthermore, the diagram depicts the nesting property through the use of the aggregation symbol. Notice that Sequences can contain zero or more Statements, and Alternation and Iteration statements can contain one or more GuardedCommands. Figure 7.5 shows the object instance model for the code se- quence of Figure 7.3. Each oval represents a particular instance of an object class and the parenthesized words identify the type of the instance. The remaining label gives information about the instance. An implementation of the rules given by Expressions (4.1), (4.2) and (4.4) rely heavily on the assumption that nested programming statements can be abstracted 49 ns-xetlil in 1 e 1 Figure 7.5. Object Instance Model for code sequence of Figure 7.3 into single statements with autonomous contexts that depend only on sequence and procedure calls. In order to support the abstraction of specifications from program- ming constructs, the notion of a referencing environment local to a programming statement is introduced. It is important to note that in order to abstract away the details underlying an implementation, it becomes necessary to treat programming statements as “mini-programs” with the assumption that programming statements are single entry, single exit. Currently, this assumption effectively excludes goto state- 50 ments but facilitates the hierarchical management of the abstraction process using SymbolTable objects, where one SymbolTable is used per nested sequence. One of the main purposes of the SymbolTable ADT is to support the con- struction of specifications. Recall that the sp of the assignment statement is sp(x:8oxpr,Q) = (3x0 :: :0 A x = ego). In order for the conjunctive expression :0 A x = ego to be satisfied, expressions must be partially evaluated. The history capabilities of a SymbolTable directly supports the evaluation of expressions by pro- viding information about the values of various instances of a given identifier and by providing a means for storing the effects of an assignment statement. Expressions are evaluated using textual substitution of the value of the last instance of each identifier contained in the expression. For example, an expression typically found in a program might be as follows q+r—s+t (TU An expression can be processed such that the identifiers in the expression are replaced with the representation of the last instance of each identifier. Expression (7.1), for example, can be translated into an internal representation such that each identifier in (7.1) is replaced with the corresponding internal representation of the last instance for the identifier. If it is assumed that the last instance for q, r, s, and t are q;, r3, so, and t1, respectively, then Expression (7.1) would be translated into the following ‘11 + 7‘3 — so + ii, (72) ‘h instance of an identifier. For example, in where the subscripts represent the i Expression (7.2), r3 represents the third instance of identifier r. Once the initial substitution of the last instance for each identifier is completed, the representation of an instance of an identifier is replaced with the actual value of the instance. For illustration purposes, assume that the values of instances ql, r3, so, and t; are 5, 9, so, and x1, respectively. Then Expression (7.2) would be translated into the following 51 5+9—30-I-231. (7.3) This process repeats until all terms of the expression are either initial instances (e.g. so), conditional instances, (that is, the instance depends on the evaluation of a number of logical conditions, a concept explained in the following section), or constant values. Consider the code sequence in Figure 7.6(a). Using a bottom-up approach to spec- o. (oxo-xsuo) 1. if (e 8 b) then 2. if (c 8 d) then 0. (0 n0 8 X h 0 t) 3. begin 1 if I then 4. x :8 p; 2 ithheu 4.1 (0 :18p200) 3 begin 5. n :8 q 4. n :8 p; 5.1 (8 x2 8 q t 0 8) 5. n :8 q 6. end 6 end 7. else 7 else 8. begin 8 begin 9. n :8 r; 9 n :8 r; 9.1 (O xi 8 r I U 0) 10. n :8 x + q; 10. n :8 n + q; 11. n :8 n - p; 10.1 (8 (s2 8 xi + q) I 0 e) 12. end 11. n :8 n - p; 13. else 11.1 (8 (x3 8 n2 - p) I U 8) 14. x :8 s 12. end 18. (0 ??? 0) 13. else 14. n :8 s 14.1 (8 81 8 s 2 U 8) 15. (0 ??? e) (a) (b) Figure 7.6. Code sequences partially annotated with specifications ify the sequence requires that the deepest level of nesting be specified first, followed by the next to deepest level, and so on. This process yields the annotated code shown in Figure 7.6(b), where specifications are delimited with Pascal comment notation, ‘ (8 8) ’. In our notation we use ‘h’ and ‘ l ’ to denote logical ‘and’ and ‘or’, respectively. Combining the specifications of lines 4.1, 5.1, 9.1, 10.1, 11.1, and 14.1 can often lead to an incoherent specification for line 15 due to the multiple instance subscripts for 52 identifier x. As mentioned earlier, each level of nesting in our approach is managed by using separate referencing environments through the use of SymbolTable objects. Using this approach we define the notion of dirty sets and last instances, which will aid in the definition of methods for correctly combining the specifications of nested statements. These methods can then be used to determine the specification such as that required for line 15. Definition 1 (Last Instance) Assuming that a begin-end sequence is single entry, single exit, then a last instance is the most recent value of any identifier accessible during the context of the sequence. In some cases, the last instance may be the same as the initial instance, while in other cases the last instance is different. Given that a SymbolTable object, called symtab, is used to manage an arbitrary level of nesting, we define the last_instance of an identifier id to be retrieve(id,lastIndex(id)) id 6 symtab.symbols lastinstance(id, symtab) E { undefined otherwise where retrieve(id,lastIndex(id)) represents the value obtained by referencing the last item in the history table for identifier id, and the identifier id must be accessible in the current environment. Definition 2 (Dirty Set) A dirty set is a construct that can be used to determine whether an assign- ment statement has been performed on an identifier in a nested sequence of statements. In addition to being useful for combining specifications of nested statements, a dirty set also aids in simplifying specifications. Formally, a dirty set is defined in the following manner 53 lastinstance(y, orig) # lastinstance(y,target) ’ dirty_set(orig,target) E {y e orig I where orig and target are SymbolTablos of the containing and contained nested statements, respectively. The method for abstracting specifications from Pascal alternation statements uses three SymbolTablos. The main SymbolTable is used to manage the entire alterna- tion construct while two auxiliary SymbolTablos are used to manage the identifiers of the guarded command constructs, assuming that the alternation statement is com- posed of two guarded commands, that is, one guarded command corresponds to the if case, and the other corresponds to the else case. (For case statements, an auxil- iary SymbolTable object would be used to manage each separate case.) By using dirty sets, identifiers that are modified within the scope of the guarded commands contained within an alternation statement can be determined. That is, a set of all identifiers that are conditional are specified formally as follows I = {dirty_set(main, auxl) U dirty.set(main, aux2)}, where main represents the main SymbolTable, and aux] and aux; refer to the Sym- bolTablos used to manage the guarded commands of the alternation statement. Upon determining the identifiers that fit this criterion, the main SymbolTable is updated in order to satisfy the following condition: (Vi: i E I : (31'sz aux; :i.name = j.nameA lastinstance(j, auxl) = lastinstance(i, orig)) V (3j :j E aux; : i.name = j.name A lastinstance(j, auxg) = lastinstance(i, orig))) which states that all identifiers in the dirty set are in one of the auxiliary Symbol- Tables. Finally, with knowledge about the values of last instances of identifiers that would result from the execution of nested subblocks, a formal specification can be completed. 54 Consider again the sequence of code given in Figure 7.6. The final version of the code with annotated specifications of the alternation statements using the notation id{nost}instanco for identifiers, where id is the identifier in question, nest is the level of nesting, and instance is the instance number within the current context, is given in Figure 7.7. 1. if (e 8 b) then 2. if (c 8 d) then 3. begin 4. n :8 p; 4.1 (0 (1(311 8 p0) I 0 t) 6. n :8 q; 5.1 (0 (si3l2 8 qO) I 0 t) 6. end 6.1 (0 ((xtan - p0) s (11(3): - (10)) s u .) 7. else 8. begin 9. n :8 r; 9.1 (8 (si3)1 8 r0) I 0 0) 10. n :8 (n + q); 10.1 (0 (11(3): - to + «10) s u t) 11. n :8 (u - p); 11.1 (8 (si3l3 8 ((r0 + qO) - p0)) I 0 0) 12. end 12.1 (0 ((s{3}1 8 r0) I (1(312 8 r0 r qO) I (31313 8 r0 e qO - p0)) I U 0) 12.2 (0 (((co 8 d0) I ((x(3}1 8 p0) I (still 8 q0))) I 12.3 ( not(c0 - «10) e (txtsn - r0) s (11(3): - r0 + qO) 12.4 I (1(111 8 r0 4 qo - p0)))) I U 8) 13. else 14. n :8 s; 14.1 (0 (x(1)1 8 s0) I 0 t) 15. (0 (((eo - 1.0) e (((co . «10) 1 «1(3): - p0) s (xtoh - q0))) I 15.1 (not(c0 8 d0) I ((x{3}1 8 r0) I (3(3l2 8 r0 + qO) I 15.2 (1(011 8 (r0 + qO - p0)))))) I 15.3 ((not(u0 8 b0)) I (x{0)1 8 sO))) I 0 8) Figure 7.7. Example with specifications annotated by the AUTOSPEC [3] system 7 .3 Example The following example demonstrates the use of four major programming constructs described in this paper (assignment, alternation, sequence, and procedure call) along 55 with the application of the translation rules for abstracting formal specifications from code. The program, shown in Figures 7.8(a) and 7.8(b), has four procedures, including three different implementations of “swap”. AUTOSPBC [3, 15, 13] is a tool that we have developed to support the derivational approach to the reverse engineering of formal specifications from program code. Figures 7.9 and 7.10 depict the output of AUTOSPEC when applied to the program code given in Figures 7.8(a) and 7.8(b), respectively, where the notation id{scopo}instanco is used to indicate a variable id with scope defined by the referencing environment for scope. The instance identifier is used to provide an ordering of the assignments to a variable. The scope identifier has two purposes. When scope is an integer, it indicates the level of nesting within the current program or procedure. When scope is an identifier, it provides information about variables specified in a different context. Of particular interest are the specifications for the swap procedures given in Fig- ure 7.9 named suapa and svapb. The implementation of suapa uses an arithmetic based algorithm for swapping the values of two Variables, while swapb uses a tem- porary variable algorithm. Although each implementation of the swap operation is different, the code in each procedure effectively produces the same results, a prop- erty appropriately captured by the respective specifications for suapa and svapb. In addition, Figure 7.10 shows the formal specification of the funnyswap procedure. The parameter passing scheme used in this procedure is pass by value, a property reflected by the specification of the effects of the call to funnyswap in Figure 7.10. In the specification, no variables local to the scope of the call to funnysvap are af- fected, and thus the specification shows no change in variable values. The procedure Findl‘laxliin provides another example of the specification of alternation statements, with the specification of the procedure shown in Figure 7.9, and the effect of the call to the procedure given in Figure 7.10. This example illustrates the use of textual substitution in the specification of the eflects of a call to a procedure that computes 56 the maximum and minimum of two variables. progran Iainn ( input. output ); var a, b. c. Largest. Seallest : real; procedure Findlanlin( lunOne, lunfnozreal; var Ian. linzreal ); begin if lunone > luntuo then begin Ian :8 luIDne; Iia :8 lunfuo; end else procedure funnysnap( lzinteger; Yzinteger ); begin Nan :8 lunTno; var lin :8 lunnne; tenp : integer; end begin end; tenp :8 X; l :8 Y; procedure euapa( var Xzinteger; var Yzinteger ); Y :8 tenp end; begin 7 :8 Y 0 I; begin 1 :8 Y - X; a :8 6; Y :8 Y - l; b :8 10; end; suapa(a.b); euapb(a.b); procedure seapb( var Xzinteger; var Yzinteger ); funnysuap(a,b); VDI' tenp : integer; (a) Findlanflin(a,b,Largest.Snallest); c :8 Largest; end. (*3) Figure 7.8. Example Pascal program 57 program lanlin( input, output ); var a. b, c. Largest, Smallest : real; procedure Findlanlin( lumDne, lumfuozreal; var Ian, linzreal ); begin if (lun0ne > lumTIo) then begin lax :8 lumone; (8 lan{2)1 8 lumOneO I U 8) lin :8 lumTuo; (0 Nin{2}1 8 lumTuoO I 0 8) end (8 (lan(2}1 8 lunOneO I lin{2}1 8 lumTuoO) I U 0) else begin Ian :8 lamina; (0 lan(2}1 8 lumTuoO I U 8) Nin :8 lumDne; (0 Nini2ll 8 lumOneO I U 8) end (8 (lan{2}1 8 lumTuoO I Nin{2}1 8 lumOneO) I U 8) (8 (((lumOneO > lumTuoO) I (IIIIOI1 8 IanneO I lini011 8 IIITIOO)) l (not(lum0ne0 > lent-00) I (laxf0)1 8 lumfuoo I Iin{0}1 8 lum0ne0))) I U 8) end (8 (((lumOneo > lumTuoO) I (II8{O)1 8 Inloneo I Iini0}1 8 IuITI00)) I (not(lum0ne0 > lumTuoO) I (laxiol1 8 IumTuoO I lin{0}1 8 lumOne0))) I U 0) procedure suapa( var X:integer; var Yzinteger ); begin 7 :8 (Y + X); (0 (7(011 8 ('0 + 10)) I U a) X :8 (Y - X); (0 (1(0): - ((10 + 10) - 10)) a u a) Y :8 (Y - X); (8 (YIOI2 8 ((70 + X0) - ((70 + 10) - 10))) I 0 t) and (8 (Y{O}2 8 to I 1(0): 8 10 I 1(0}1 8 to + 10) I 0 8) procedure suapb( var lzinteger; var Yzinteger ); var temp : integer; begin temp :8 X; (0 (temp{0}1 - 10) a u o) X :8 Y; (0 (H0)! 8 to) a 0 0) Y :8 temp; (— (1(0): - 10) a u 0) end (0 (7(0)1 8 10 I 2(0)1 8 Y0 I tenp{0}1 8 10) I U 0) Figure 7.9. Output created by applying AUTOSPEC to example 58 procedure funnysuap( l:integer; Yzinteger ); var temp : integer; begin temp :8 X; (8 (temp{0}1 8 10) I U 8) l :8 Y; (8 (XiOI1 8 Y0) I 0 8) Y :8 temp; (8 (Yi0)1 8 20) I 0 e) and (8 (rte): . so a 1(0): - to a temp{0}1 - 10) a u .) begin a :8 5; (8 I(0}1 8 5 I U 8) b :8 10; (8 b{0}1 8 10 I U 8) suapa(a,b) (8 (DIOIZ 8 5 I (a(0)2 8 10 I (Yienapa)2 8 5 I (stuapa}1 8 10 I Y(suapa}1 8 15)))) I 0 8) suapb(a,b) (8 (5(0): 8 10 a (a(0)3 8 5 I (Y{suapb}1 8 10 I (steapbli 8 5 I tempfsuapbli 8 10)))) I U 8) funnysuap(a,b) (8 (Y{O}1 8 5 I lifunnysuapli 8 10 I temp{funnysuap}1 8 5) I U 8) Findlanflin(a,b,Largest,Snallest) (8 (3mallest{0}1 8 Nianindlaxlinli I Largest{0}1 8 Iaaniudhanlinli I (((s > 10) a (NasfFindlanflinli 8 5 I liniPindNanliali 8 10)) I (not(5 > 10) I (lan(Findlanlin)1 8 10 I lianindNasflinli 8 5)))) I U 8) c :8 Largest; (8 chI1 8 Ianffindlanlinll I U 8) end (8 ((CIOI1 8 IIIIFindlanlinII) I ( Smallest{0}1 8 lin{Findlanlin}1 I Largest{0}1 8 Ian{FindIanlin}1 I (((5 > 10) I (laxiFindlanlinl1 8 5 I IintFindlanfliII1 8 10)) l (not(5 > 10) I (laxifindlaxflinli 8 10 I liniFindflanlinli 8 5)))) I ( Y{funnysuap)1 8 5 I X{funnysuap)1 8 10 I temp{funnysuap}1 8 5 ) I ( 5(0}3 8 10 1 IN” 8 5 a (Yieuapbli 8 10 I stuapbll 8 5 I tempfsuapbli 8 10)) I ( bi0}2 8 s a a{0}2 8 10 I (stuapal2 8 5 I stuapal1 8 10 I Y{suapa)1 8 15)) I (b{0}1 8 10 I a(0)1 8 5))))) I U 8) Figure 7.10. Output created by applying AUTOSPEC to example (cont.) CHAPTER 8 Related Work Many approaches have been suggested for reverse engineering program code. This chapter discusses reverse engineering efforts that use formal and semi-formal ap- proaches for the reverse engineering of program code into procedural and object- oriented formal specifications. 8.1 Formal Approaches This section describes reverse engineering methods that take advantage of logical and mathematical properties of programs for the abstraction of program function as well as in the representation of program specifications. 8.1.1 Function Abstraction It has been shown that program flowcharts can be decomposed into basic constructs that represent sequence, alternation, and iteration [36]. Using these concepts, a strategy outlining an approach for abstracting function behaviour from programs has been suggested [37]. The strategy takes advantage of the mathematical properties of structured programs in order to to abstract program function. The end result is the determination of a precise representation of a program’s function. At the 59 60 heart of the approach is the concept of proper programs, prime programs, and small primes. A proper program is a flowchart program with a single entry and single exit. An irreducible proper program is known as a prime program, and a small prime consists of the most atomic of program statements including sequence, alternation, and iteration. Using the approach defined by Linger et al. a specification is constructed by de- termining the function of each small prime contained in a program. Overall, the approach to abstracting program function proceeds as follows: 1. Restructure the object program into a structured format. 2. Analyze the program data in order to localize the scope of variables. 3. Compute the function of each prime. The restructuring step is essential since the rest of the approach relies on structured programming constructs as the basis for function abstraction. Special emphasis is put on the second step in order to reduce the difficulty of abstracting the specifica- tions from the code. The final step involves a number of algorithms centered around identifying the function of each program prime. Sequence statements, at the most basic level, consist of assignment statements of the form: where x is a variable and e is some expression. A trace-table technique is used for determining the function of sequence statements. The purpose of the trace-table is to record the effect of an assignment on the state space of all variables in the same 1 1'. .._-.— fiud- " 61 scope of the statement. For example, consider the following sequence of code: X := X+Y; Y = X—Y, X := X—Y; A trace-table containing a column for each variable within the current scope of exe- cution can be constructed using a subscripted notation (i.e. X ,-) to denote the value of a variable X in each separate state i. The trace table for the sequence shown above would appear as [37]: Statement Value of X Value of Y Initially X0 Y0 X2=X+Y X1=X0+Y0 Yi=Yo Y2=X-Y X2=X1=X0+Yo Y2=X1-Y1=Xo X:=X-—Y X3=X2-Y2=Yo Y3=Y2=Xo Analysis of the trace-table allows for the determination that the sequence of code performs a swap of the values of X and Y. Alternation abstraction is based on analyzing the guarded command conditions of the statements. An alternative statement is one that takes the form: IF ETHEN CI ELSE C2 ENDIF where B is a guard and Cu is a guarded command. The specification created from abstracting the function from alternative statements involves the use of the conditions in the statement and the specification of the subsequent guarded statements. For example, consider the following statement IF b THEN g ELSE h END-IF, 62 where b is a guard, and g and h are statements. A specification is constructed with the following form ([b] = true -> [gIIIb] = false —+ [h]), which states that whenever [b] is true, perform statement g, and whenever [b] is false, perform statement h. Linger et al. describe a number of strategies for understanding the function of iterative constructs. One approach is through the use of program slicing. Program slicing is a technique whereby the effect of a loop on an isolated variable is specified for each of the variables in the loop. Upon completion, the specifications are com- bined to in order to determine a specification for the entire loop. Another proposed approach for understanding loops uses pattern matching with the help of superpat- terns. Superpatterns are templates that outline common formats of programming constructs, particularly loops. This proceeds by performing program slicing on a loop and comparing the results with a number of standard superpatterns. 8.1.2 A Knowledge Based Transformational System Transformational systems have been widely used for forward engineering. The appli- cation of transformational systems to reverse engineering has been proposed by Ward et al. [38, 39, 40]. The approach relies on proving that two versions of a program are equivalent. A kernel language that includes both imperative programming constructs and general specifications as part of a single language is defined. By using a kernel or Wide Spectrum Language (WSL) that is mathematically based, the problem of ab— stracting a specification of a program can be reduced to first transforming a program into another form (a specification) and then proving the equivalence of the original and the new form. The Maintainer’s Assistant is a system that incorporates the notions of Wide 63 Spectrum Languages and a Knowledge Base for performing transformations of pro— gram code into specifications. The rest of this section outlines the major aspects of the Maintainer’s Assistant. Wide Spectrum Languages Software development classically proceeds through multiple stages ranging from re- quirements specification to program development. Figure 8.1 illustrates this con- cept [39]. Typically each stage uses a different language to express the problem. For instance, natural language may be used for requirements stage So while a formal specification language may be used for specification stage Sn as can be seen in F ig- ure 8.2 [39]. A WSL provides a unified language for use in all stages starting from So and ending in P. The WSL encorporates both executable imperative constructs as SOHSIH52H...HSnHP Figure 8.1. Stages of Development SLOHSLlHSLzH...HSLnHPL Figure 8.2. Languages for Development well as non-executable specifications. By providing a strong mathematical basis for a 64 WSL, proving a transformation can be performed by proving the equivalence of two formulae [39]. Knowledge Base To assist in performing transformations of programs into specifications the Main- tainer’s Assistant encorporates the use of a knowledge base [40]. This knowledge base is centered around identifying programs plans. Program plans can be divided into two classes, General Program Plans and Program Class Knowledge. Use of a general plan knowledge base aids in identifying commonly occuring activities in pro- grams. Program Class knowledge is used for specific types of activities. For instance, program class knowledge could exist for maintaining a C compiler. This knowledge would include information on design decisions typically found in developing a com- piler. Tbansformations Transformation has been used in many instances for forward engineering tasks [41]. Transformation is the action of substituting part of a program with a new part that is meaning preserving (i.e., the new part computes the same thing as the old part). There are typically three steps to transformations. The first is a determination of where to apply a transformation, next is a determination of what transformations to apply, and finally creation of a new program part to replace the matched pattern. Transformations can take two forms, vertical and lateral. Vertical transformations with respect to program development are used to move from an abstract representa- tion to a concrete one while lateral transformations are used to specify equivalence between two expressions at the same level of abstraction. There are several types of transformations supported by the Maintainer’s Assis- tant. Two approaches are used to perform transformations; the catalog approach, 65 and the generative set approach. The catalog approach uses the knowledge base de- scribed in the previous section. The generative set approach uses a set of elementary transformations to build more complex transformations. The Maintainer’s Assistant has the ability to perform the following types of trans- formations: Syntactic Tl'ansformations: This type of transformation is synonymous with re- structuring. , Action Systems: These transformations are used to simplify languages that allow goto statements. Non-Terminal Recursion: Used to remove non-terminal recursion by using a stack of “postponed obligations”. Local Semantic 'D'ansformations: These are transformations for changing the flow of computation. 8.2 Ob ject-Orientation Object-Oriented Analysis, Design, and Programming have become increasingly pop- ular in recent years. The encapsulation, inheritance, and reuse properties of Object- Oriented Programming has made the approach attractive to many software devel- opers. A desire to migrate existing systems from an imperative paradigm to an ob ject-oriented paradigm has prompted the development of a number of reverse engi- neering methods for facilitating the migration. This section describes two approaches for abstracting objects from imperative programs. 8.2.1 Object Identification One approach to identifying objects in procedural code has been proposed by Liu and Wilde [14]. Identification is centered around the characterization of candidate objects based on common routines, data types, and data items. These properties are .__._..--“A A —— I -. l 66 arranged as a tuple of the following format: CandidateObject = (F, T, D) where F is a set of routines, T is a set of types, and D is a set of data objects. Candidate objects may not be completely disjoint which means that there may be an overlap between data objects and routines. Additionally, there is no clear distinction between an object and an object class. This definition of a candidate object differs from the classical notion of objects but is necessary. Candidate objects can overlap since rejection would limit the domain and exclude the manifestations of objects whose only failing is that they were produced using poor programming practices. The lack of a clear distinction between classes and class objects is due to the fact that it may be easier in some cases to identify object classes and then identify instances of the classes. In other cases it may be easier to reverse this process by identifying the instances of the classes, and then the classes. There are two methods of object identification. The first uses global and static data and relates operations that operate on that data. The second uses data types and relates routines that use the data types as parameters or return values. Global Based Object Identification The method for identifying candidate objects based on global and static data uses the following steps [14]: 1. For each global variable x, let P(x) be the set of routines that use x. 2. Using each P(x) as a vertex, construct a graph G = (V,E) such that: V = {P(x) I x is shared by at least two routines} E = {P($1)P($2) I (”(3001712) # Q} where P($1)P($2) is an edge joining two vertices P(xl) and P(l‘g). 67 3. Construct a candidate object tuple (F, T, D) from each strongly connected com- ponent C = (VC,E'C) in G such that: F = UP(x)ch P(J’) T 9 D = UP(x)€V¢{x} The major drawback to using the global based object identification method is that the candidate objects that are found can often be very large since all functions that reference a global variable will be included in the F component of the candidate object tuple. Other procedures will be needed to constrain the search so that acceptable objects are identified. Typos Based Object Identification The second type of object finding method is a types-based algorithm. The steps to this method are as follows [14]: 1. (Ordering) A topological order of all types must be defined such that: (a) If type x is used to define type y, then x is a sub-type of y and y is a supertype of x, denoted by x < y (b) If x < y and y < 2 then x < z (Transitivity). 2. (Initial Classification) A relationship matrix R(F, T) is constructed where rows are routines and columns are types of formal parameters and return values. Values of R(F, T) are either 0 or 1 with 1 indicating that for an entry R(f,t), t is a subtype of the type of a formal parameter or of a return value of routine f. 3. (Classification Adjustment) For each row f in the matrix R, set R(f,t) to 0 if there are any other supertypes of t in the same row that has been marked 1. This handles the fact that a supertype dominates the role of classification. 4. (Grouping) Collect routines into groups based on the sharing of types. Routines f1 and f; are related if there exists a type t such that R( f1, t) = R(f2,t) = l. 5. (Construction) Construct a candidate object (F, T,D) from each group where: F = {fl the routine [is a member of the group} T = {t I R(f,t) = I for somefin F} D = a 68 As was the case with the previous method, the types based object finding method can identify classes that are too large. 8.2.2 REDO - Objects REDO (Restructuring, Maintenance, Validation and Documentation of Software Sys- tems) is an Espirit II project whose objective is to improve applications by making them more maintainable through the use of reverse engineering techniques. The ap- proach in reverse engineering of COBOL involve the development of general guidelines for the process of deriving objects and specifications from program code as well as pro- viding a framework for formally reasoning about objects [4, 42, 43]. This approach also involves the identification of objects as well as the systematic specification of classes using an extension to the Z notation, known as Z++. The approach involves three stages defined as follows 1. Translation of COBOL to UNIFORM, an intermediate language 2. Creation of outline objects through the use of data flow diagrams. Code is partitioned into single-entry, single-exit functions called phases. These phases are abstracted into functions. 3. Simplifying transforms are applied to the abstracted functions and the objects identified in the previous stage. The functions are incorporated into the objects and a formal specification using Z and Z++ is created. 'D'anslation from COBOL to UNIFORM The translation to UNIFORM is performed in order to add extra information about the subject system and to eliminate redundant or obsolete programming structures. 69 Higher Level Abstractions A technique known as phasing is used to for identifying flows of information from data structure to data structure. For example, consider the code contained in Figure 8.3 [4]. The phasing process identifies the following phases: OPEN INPUT filo-1; PERFORM process file-1; OPEN OUTPUT file-2; OPEN OUTPUT filo-3; OPEN INPUT filo-4; PERFORM process filo-4; CLOSE filo-1; CLOSE file-2; CLOSE filo-3; CLOSE file-4; Figure 8.3. COBOL File Operations [4] e file-1, file-4 —> file-2, file-3 e file-4 —-> file-2, file-3 e file-4 —-) file-3 These phases identify the possible flows of information from the data structures on left hand side to data structures on the right hand side. Identifying phases in a program provides insight to the existence of objects. A technique for abstracting the functionality of program phases is used to reduce programming structures into two essential constructs [44]: functional composition and conditional expressions. 70 Simplification and specification Once data types, the attributes of the data types, and the operations of the data types have been abstracted from code they are formalized by generating a specification in a formal notation. The Z++ is an extension of the Z notation that adds object- oriented properties to the specification language. A Z++ schema consists of four distinct sections for describing attributes (OWNS), predicates of the properties of the internal state of a class object (IN VARIA N T), types of operations available to a class (OPERATIONS), and definitions of the operations available to a class (ACTIONS). The use of a formal notation provides advantages over the approach described in Section 8.2.1 by allowing for the formal reasoning about the objects once the specification has been created. 8.3 Comparison of Approaches The approach to reverse engineering described in this thesis uses a translational tech- nique based on the semantics of the strongest postcondition predicate transformer. This differs from the Functional Abstraction approach of Linger et al. in that their approach is informal and assumes that the combination of specifications constructed for small primes will result in a formal specification for a whole program. The Maintainer’s Assistant is a tool that uses a combination of transformation and knowledge based approaches. This differs greatly from the approach presented in this thesis in that much of the burden of specification construction relies on choosing transformations to apply to a program written using a WSL. The potential number of transformations needed to construct a specification for a program can be large, even for small programs. The approach suggested by Liu and Wilde for identifying objects in program code provides the basis for the approach described in this thesis. Our approach, however, 71 differs in that we use formal specifications as a means for identifying objects, and the level of granularity differs greatly due to the fact that we consider structured types only. The approach taken by the REDO project for identifying objects is tightly coupled to COBOL and UNIFORM. Their use of formal specifications in the form of Z and Z ++ is, however, attractive because the formal notations allow for formal reasoning. CHAPTER 9 Conclusions and Future Investigations The level of abstraction of specification constructed using the techniques described in this thesis are at the “as-built” level, that is, the specifications contain implemen- tation specific information. For straight-line programs (programs without iteration or recursion) the techniques described herein are complete in that a fully automated construction of a formal specification from program code is achievable. As such, automated techniques for verifying the correctness of straight-line programs can be facilitated. The Object Modeling Technique (OMT) is a method for modeling system require- ments through the use of three complementary diagramming notations. The OMT notation is attractive because it provides an easy to use, visually-oriented notation for capturing system requirements and other types of high level information about a system. In practice, the lack of formalism in the notations can be problematic. However, a formalization of OMT in terms of algebraic specifications has recently been developed [45] for describing the states and state transitions of a system. Future investigations will focus on three areas. First, a method of reverse en- gineering that encompasses all major facets of imperative programming constructs, 72 73 including iteration and recursion will be developed. Second, methods for constructing higher level abstractions from lower level abstractions will be investigated. Finally, a rigorous technique for re—engineering specifications from the imperative programming paradigm to the object-oriented programming paradigm will be created. Iteration and recursive procedure calls pose potential difficulties since we need to construct a postcondition with respect to bound functions, 100p invariants, and termination conditions. The investigations into this component will involve extending the existing rules for assignment, alternation, iteration, and procedure calls to include a generalized procedure for handling iteration and recursion. “As-built” specifications have the property of being too tightly coupled to the pro- grams that they describe. That is, they suffer from implementation bias. The second component of this research is an investigation into creating more abstract representa- tions of the “as-built” specifications. Diagrams, in the form of statecharts, data flow diagrams, and object diagrams, provide an informal specification of systems and can be constructed both manually and automatically. OMT is a notation that uses each of these diagramming techniques. A formal treatment of OMT in terms of algebraic specifications coupled with the constraints provided by “as-built” specifications will be used to facilitate the task of abstracting higher level specifications from the lower level “as-built” specifications. Additionally, this integration of techniques will provide the basis for constructing high level requirements in terms of object-oriented models and algebraic specifications. A prototype to support the construction of “as-built” specifications will be devel- oped. As each subsequent component of the proposed investigations is performed (i.e. construction of higher levels of abstraction and creation of object-oriented models), the prototype will evolve to reflect the new capabilities. The results of the investiga- tions will be encapsulated in the form of a prototype that will support the methods for the partial automation of the construction of object-oriented specifications by way of 74 a shift in paradigms. That is, the main target is to develop a tool that will aid in the construction of algebraic specifications from programs. This will involve using “as- built” specifications as constraints for identifying objects embedded in specifications of imperative program code and specifying objects using an algebraic specification language. That is, we will be performing the re—engineering of formal specifications for an imperative system into a formal specification for an object-oriented system. Using the integration described above for OMT and the formal strategy for reverse engineering, a set of transformations will be developed for reconstituting the high level specifications for the original imperative system into a high level specification for an object-oriented system. BIBLIOGRAPHY BIBLIOGRAPHY [l] D. Gries, The Science of Programming. Springer-Verlag, 1981. [2] E. Byrne, “A Conceptual Foundation for Software Re-engineering,” in Proceed- ings for the Conference on Software Maintenance, pp. 226—235, IEEE, 1992. [3] B. H. Cheng and G. C. Gannod, “Abstraction of Formal Specifications from Program Code,” in Proceedings for the IEEE 3rd International Conference on Tools for Artificial Intelligence, pp. 125—128, IEEE, 1991. [4] H. Haughton and K. Lano, “Objects Revisited,” in Proceedings for the Confer- ence on Software Maintenance, pp. 152—161, IEEE, 1991. [5] N. G. Leveson and C. S. Turner, “An Investigation of the Therac-25 Accidents,” IEEE Computer, pp. 18-41, July 1993. [6] V. S. Flor, “Ruling’s Dicta Causes Uproar,” The National Law Journal, July 1991. [7] J. M. Wing, “A Specifier’s Introduction to Formal Methods,” IEEE Computer, vol. 23, pp. 8—24, September 1990. [8] S. L. Gerhart, “Applications of formal methods: Developing virtuoso software,” IEEE Software, vol. 7, pp. 7—10, September 1990. [9] N. G. Leveson, “Formal Methods in Software Engineering,” IEEE Transactions on Software Engineering, vol. 16, pp. 929—930, September 1990. [10] R. A. Kemmerer, “Integrating Formal Methods into the Development Process,” IEEE Software, vol. 7, pp. 37—50, September 1990. [11] A. Hall, “Seven myths of formal methods,” IEEE Software, vol. 7, pp. 11—19, September 1990. [12] B. H. Cheng, “Synthesis of Procedural Abstractions from Formal Specifications,” in Proc. of COMPSAC’9I, pp. 149—154, September 1991. 75 76 [13] G. C. Gannod and B. H. Cheng, “Facilitating the Maintenance of Safety-Critical Systems Using Formal Methods,” The International Journal of Software Engi- neering and Knowledge Engineering, vol. 4, no. 2, 1994. [14] S. Liu and N. Wilde, “Identifying Objects in a Conventional Procedural Lan- guage: An Example of Data Design Recovery,” in Proceedings for the Conference on Software Maintenance, IEEE, 1990. [15] G. C. Gannod and B. H. Cheng, “A Two Phase Approach to Reverse Engineering Using Formal Methods,” Lecture Notes in Computer Science: Formal Methods in Programming and Their Applications, vol. 735, pp. 335—348, July 1993. [16] F. L. Bauer, “Software Engineering,” in Information Processing 71, p. 530, North Holland, 1972. [17] R. S. Pressman, Software Engineering A Practitioner’s Approach. McGraw-Hill, 1992. [18] W. M. Osborne and E. J. Chikofsky, “Fitting pieces to the maintenance puzzle,” IEEE Software, vol. 7, pp. 11-12, January 1990. [19] E. J. Chikofsky and J. H. Cross II, “Reverse Engineering and Design Recovery: A Taxonomy,” IEEE Software, vol. 7, pp. 13-17, January 1990. [20] E. J. Byrne and D. A. Gustafson, “A Software Re-engineering Process Model,” in COMPSAC, ACM, 1992. [21] C. A. R. Hoare, “An axiomatic basis for computer programming,” Communica- tions of the ACM, vol. 12, pp. 576-580, October 1969. [22] E. W. Dijkstra and C. S. Scholten, Predicate Calculus and Program Semantics. Springer-Verlag, 1990. [23] J. Spivey, The Z Notation: A Reference Manual. Prentice Hall, 1987. [24] C. B. Jones, Systematic Software Development Using VDM. Prentice Hall In- ternational Series in Computer Science, Prentice Hall International (UK) Ltd., second ed., 1990. [25] E. W. Dijkstra, A Discipline of Programming. Prentice Hall, 1976. [26] F. Lindstrom, “Re-engineering of old systems to an ob ject-oriented architecture,” in OOPSLA, ACM, 1991. 77 [27] A. Winblad, S. Edwards, and D. King, Object-Oriented Software. Addison- Wesley, 1990. [28] J. Rumbaugh, M. Blaha, W. Premerlani, F. Eddy, and W. Lorensen, Object- Oriented Modeling and Design. Englewood Cliffs, New Jersey: Prentice Hall, 1991. [29] J. Guttag and J. Horning, Larch: Languages and Tools for Formal Specification. Springer-Verlag, 1993. [30] T. Pratt, Programming Languages: Design and Implementation. Prentice-Hall, 1984. [31] J. Ichbiah, “Rationale for the Design of the Ada Programming Language,” SIG- PLAN Notices, vol. 14, 1979. [32] N. Wirth, “The programming language Pascal,” Acta Informatica, vol. 1, pp. 35- 63, 1971. [33] P. Benedusi, A. Cimitile, and U. DeCarlini, “A Reverse Engineering Methodology to Reconstruct Hierarchical Data Flow Diagrams for Software Maintenance,” in Proceedings for the Conference on Software Maintenance, pp. 180—189, IEEE, 1989. [34] S. Leestma and L. N yhoff, Pascal Programming and Problem Solving. Macmillan, second ed., 1987. [35] A. Aho, R. Sethi, and J. Ullman, Compilers: Principles, Techniques, and Tools. Addison Wesley, 1986. [36] R. Linger, H. Mills, and B. Witt, Structured Programming: Theory and Practice. Addison-Wesley, 1979. [37] P. Hausler, M. Pleszkock, R. Linger, and A. Hevner, “Using Function Abstraction to Understand Program Behavior,” IEEE Software, vol. 7, pp. 55—63, January 1990. [38] M. Ward, F. Calliss, and M. Munro, “The Maintainer’s Assistant,” in Proceedings for the Conference on Software Maintenance, IEEE, 1989. [39] T. Bull, “An Introduction to the WSL Program Transformer,” in Proceedings for the Conference on Software Maintenance, pp. 242-250, IEEE, 1990. 78 [40] F. Calliss, M. Khalil, M. Munro, and M. Ward, “A Knowledge-Based System for Software Maintenance,” in Proceedings for the Conference on Software Mainte- nance, pp. 319—324, IEEE, 1988. [41] D. R. Smith, “KIDS: A Semi-automatic Program Development System,” IEEE Transactions on Software Engineering, vol. 16, pp. 1024—1043, September 1990. [42] K. Lano, “Formal Approaches to Reverse-Engineering,” Tech. Rep. 2487—TN- PRC—1067, Oxford University, March 1991. [43] K. Lano, “Test Results for the Reverse-Engineering Tool Set,” Tech. Rep. 2487- TN-PRG-1074, Oxford University, September 1991. [44] K. Lano and H. Haughton, “Extracting Design and Functionality from Code,” in Proceedings for the 5th International Workshop on Computer Aided Software Engineering, pp. 74—82, IEEE, 1992. [45] R. H. Bourdeau and B. H. Cheng, “A Formal Semantics for the Integration of Object and Dynamic Models,” Tech. Rep. CPS-93-32, Michigan State University, December 1993. MICHIGAN STATE UNIV. LIBRARIES IIIIIIIIIIIII]IIIIIIIIIIIIIIIII 31293010220261