145.33....“ ‘ , ‘ . . . . ‘ . . figzfig=fim§ J; - A . . , ‘ : . . . . ‘ ,. 155%” hr. Yam-max: . . .1. .2 swflfl. .8.” ".3" i Emu»: 5.; . .33.}? 53.45 :99 i. a ‘ ‘ “awn... s ”:31... .1 ESQ. ~ «f Dy .1 2|. um 1.. 1:! . 5:». “fiaafieifin‘ifi {- if :25! s . 4 {3233.4 33. C .9 la. ‘oo .3 , :rr ~ I v v: “at auaafigfi ‘ 7W? v \ \J 1 , f L Qf‘lfi/l7& LIBRARIES MICHIGAN STATE UNIVERSITY EAST LANSING, MICH 48824-1048 This is to certify that the dissertation entitled AUTOMATIC SYNTHESIS OF FAULT-TOLERANCE Ph.D. presented by ALI EBNENASIR has been accepted towards fulfillment of the requirements for the degree in Computer Science Major Professor’s Signature €149" Date MSU is an Affirmative Action/Equal Opportunity Institution PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 cfinmteDueJndd-p. 15 AUTOMATIC SYNTHESIS OF FAULT-TOLERANCE By Ali Ebnenasir A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering 2005 if): an Siez‘ (Um fram the d ABSTRACT AUTOMATIC SYNTHESIS OF FAULT-TOLERANCE By All Ebnenasir Fault-tolerance is an important property of today’s software systems as we rely on computers in our daily affairs (e.g., medical equipments, transportation systems, etc). Since it is difficult (if not impossible) to anticipate all classes of faults that perturb a program while designing that program, it is desirable to incrementally add fault-tolerance concerns to an existing program as we encounter new classes of faults. Hence, in this dissertation, we concentrate on automatic addition of fault-tolerance to (distributed) programs; i.e., synthesizing fault-tolerant programs from their fault- intolerant version. Such automated synthesis generates a fault-tolerant program that is correct by construction, thereby alleviating the need for its proof of correctness. Also, there exists a potential for reusing the computations of the fault-intolerant program during the synthesis of its fault-tolerant version. In the absence of faults, the synthesized fault-tolerant program should behave similar to the fault-intolerant program. In the presence of faults, the synthesized fault—tolerant program has to provide a desired level of fault-tolerance, namely failsafe, nonmasking, or masking fault-tolerance. A failsafe fault-tolerant program guarantees safety even in the presence of faults. In the presence of faults, a nonmasking fault- tolerant program recovers to states from where its safety and liveness specifications are satisfied. A masking fault-tolerant program always satisfies safety and recovers to states from where its safety and liveness specifications are satisfied. To provide a foundation for automatic synthesis of fault-tolerant programs, we concentrate on two directions: theoretical aspects, and the development of a software framework for the synthesis of fault-tolerant. programs. The main contributions of the dissertation regarding theoretical aspects are as follows: fl 0 We identify the effect of safety specification modeling on the complexity of synthesizing fault-tolerant programs from their fault-intolerant version. 0 We show the N P-completeness of synthesizing failsafe fault-tolerant distributed programs from their fault-intolerant version. 0 We identify the sufficient conditions for polynomial-time synthesis of failsafe fault-tolerant distributed programs. 0 we design a sound and complete synthesis algorithm for enhancing the fault- tolerance of high atomicity programs — where program processes can atomically read / write all program variables ~ from nonmasking to masking. 0 We present a sound algorithm for enhancing the fault-tolerance of distributed programs — where program processes have read / write restriction with respect to program variables. 0 We present a synthesis method for providing reuse in the synthesis of differ- ent programs where we automatically specify and add pre-synthesized fault- tolerance components to programs. 0 We define and address the problem of synthesizing multitolemnt programs that are subject to multiple classes of faults and provide (possibly) different levels of fault-tolerance corresponding to each fault-class. To validate our theoretical results, we develop an extensible software framework, called Fault-Tolerance Synthesizer (FTSyn), where developers of fault-tolerance can interactively synthesize fault-tolerant programs. Also, FTSyn provides a platform for developers of heuristics to extend F TSyn by integrating their heuristics for the addition of fault-tolerance in F TSyn. Using FTSyn, we have synthesized several fault-tolerant distributed programs that demonstrate the applicability of FTSyn for the cases where we have different types of faults, and for the cases where a program is subject to multiple simultaneous faults. © Copyright by ALI EBNENASIR 2005 To my parents Hussein and Ezzat and my wife N iloofar for all their love and sacrifices. lfl in; 5‘11; Oil ACKNOWLEDGMENTS All thanks go to the almighty God who has endowed us the blessing Of existence. I extend my regards to all people who have contributed to my education in anyway from primary school to higher educations. First, I am truly grateful to Dr. Sandeep Kulkarni whose guidance was always enlightening throughout my PhD program. Also, I thank the members of my PhD committee, Dr. Laura Dillon, Dr. Betty Cheng, and Dr. Jonathan Hall, who have always supported me by their valuable comments. Furthermore, I appreciate all the efforts of the Computer Science and Engineering Department at Michigan State University towards creating a productive environment for research and education. Moreover, I would like to thank my teachers and advisors to whom I am in debt for all their hard work and sacrifices for educating me and my fellow students: Dr. Mohsen Sharifi my advisor in my Master program at Iran University of Science and Technology, Tehran - Iran, Dr. Abbas Vafaei and Dr. Mustafa Kermani my professors at the University of Isfahan, Isfahan - Iran, Mr. Fereydani, Mr. Khayyam, Mr. Nahvi, Mr. Riazi, and Mr. Nasr my teachers in high school, Mr. Saljooghian in middle school, and finally my first grade teacher Mrs. Afshari. Last but not least, I thank my fellow graduate students at the Software Engineer- ing and Network Systems Laboratory at Michigan State University who have always supported me by (i) proofreading my manuscripts, (ii) providing valuable feedback on my research work, (iii) engaging in discussions, and (iv) attending my not-so— attractive talks. In particular, I appreciate the sincere collaboration Of Laura Anne Campbell, Karun Biyani, Bru Bezawada, Sascha Konrad, Mahesh Arumugam, and Borzoo Bonakdarpour. Thank you. vi (0 IO L's Ln - IQ If) 00 TABLE OF CONTENTS LIST OF FIGURES x 1 Introduction 1 1.1 The Outline Of the Dissertation ....................... 6 2 Preliminaries 8 2.1 Program .................................... 8 2.2 Issues of Distribution ............................. 10 2.3 Specification .................................. 11 2.4 Fault ...................................... 13 2.5 Fault-Tolerance ................................ 14 2.6 The Problem of Adding Fault-Tolerance .................. 15 2.7 Synthesis of Fault-Tolerance in High Atomicity .............. 17 2.7.1 Synthesizing Failsafe Fault-Tolerance ................... 17 2.7.2 Synthesizing Nonmasking Fault-Tolerance ................ 18 2.7.3 Synthesizing Masking Fault-Tolerance .................. 19 2.8 Synthesis of Fault-Tolerant Distributed Programs ............. 21 3 The Effect of Safety Specification Model on the Complexity of Syn- thesis 24 3.1 N P-Completeness Proof ........................... 26 3.1.1 Mapping 3-SAT to the Addition of Masking Fault-Tolerance ...... 26 3.1.2 Reduction from 3-SAT ........................... 28 3.2 Summary ................................... 32 4 Synthesizing Failsafe Fault-Tolerant Distributed Programs 34 4.1 Problem Statement .............................. 35 4.2 NP-Completeness Proof ........................... 37 4.2.1 Mapping 3-SAT to an Instance of the Synthesis Problem ........ 37 4.2.2 Reduction from 3-SAT ........................... 42 4.3 Monotonic Specifications and Programs ................... 45 4.3.1 Sufficiency of Monotonicity ........................ 46 4.3.2 Role of Monotonicity in Complexity Of Synthesis ............ 50 4.4 Examples of Monotonic Specifications .................... 51 4.4.1 Byzantine Agreement ........................... 52 4.4.2 Consensus and Commit .......................... 55 4.5 Summary ................................... 56 vii 5 Fault-Tolerance Enhancement 58 5.1 Problem Statement .............................. 59 5.2 Enhancement in High Atomicity Model ................... 61 5.2.1 Example: Triple Modular Redundancy .................. 66 5.3 Enhancement for Distributed Programs ................... 69 5.3.1 Example: Byzantine Agreement ...................... 75 5.4 Using Monotonicity for the Enhancement Of Fault-Tolerance ....... 81 5.4.1 Monotonicity of Nonmasking Programs .................. 81 5.4.2 Example: Distributed Counter ...................... 85 5.5 Enhancement versus Addition ........................ 88 5.6 Summary ................................... 90 6 Pre—Synthesized Fault-Tolerance Components 92 6.1 Problem Statement .............................. 94 6.2 The Synthesis Method ............................ 95 6.2.1 Overview of Synthesis Method ....................... 95 6.2.2 Token Ring Example ............................ 98 6.3 Specifying Pre—Synthesized Components .................. 101 6.3.1 The Specification of Detectors ....................... 101 6.3.2 The Representation of Detectors ..................... 102 6.3.3 Token Ring Example Continued ...................... 105 6.4 Using Pre—Synthesized Components ..................... 106 6.4.1 Algorithmic Specification of the Fault-Tolerance Components ..... 106 6.4.2 Token Ring Example Continued ...................... 108 6.4.3 Algorithmic Addition of The Fault—Tolerance Components ....... 108 6.4.4 Token Ring Example Continued ...................... 117 6.5 Example: Alternating Bit Protocol ..................... 118 6.6 Adding Hierarchical Components ...................... 126 6.6.1 Specifying Hierarchical Components ................... 127 6.6.2 Diffusing Computation ........................... 128 6.7 Discussion ................................... 136 6.8 Summary ................................... 139 7 Automated Synthesis of Multitolerance 140 7.1 Problem Statement .............................. 141 7.2 Addition of Fault—Tolerance to One Fault-Class .............. 144 7.3 Nonmasking-Masking Multitolerance .................... 146 7.4 Failsafe—Masking Multitolerance ....................... 149 7.5 Failsafe-NonmaskingMasking Multitolerance ............... 152 7.5.1 Non-Deterministic Synthesis Algorithm .................. 152 7.5.2 Mapping 3-SAT to Multitolerance ..................... 154 7.5.3 Reduction From 3—SAT ........................... 156 7.5.4 Failsafe-Nonmasking h’lultitolerance .................... 160 7.6 Summary ................................... 161 CO CD (c (3 ch M~ 1( ll.) 10. 10. AP Bll 8 FTSyn: A Software Framework for Automatic Synthesis Tolerance 8.1 Adding Fault—Tolerance to Distributed Programs 8.1.1 The Input / Output Of the Framework .............. 8.1.2 Framework Execution Scenario 8.1.3 User Interactions ......................... 8.2 Framework Internals 8.2.1 Class Modeling .......................... 8.2.2 Design Patterns .......................... 8.3 Integrating New Heuristics ..................... 8.4 Changing the Internal Representations .............. 8.5 Example: Altitude Controller ................... 8.6 Summary oooooooooooooooooooooooooooooo 9 Ongoing Research 9.1 Program Transformation ...................... 9.1.1 Problem Statement ........................ 9.1.2 'ITansformation Algorithm .................... 9.1.3 Soundness ............................. 9.2 Specification Transformation .................... 9.2.1 Problem Statement ........................ 9.2.2 Transformation Algorithm .................... 9.3 Example: Distributed Control System 9.4 SAT-based Synthesis of Fault-Tolerance 9.4.1 Synthesis Method ......................... 9.4.2 Representing Synthesis Requirements as Boolean Formulas . . 9.4.3 Implementing SAT-based Synthesis ............... 9.5 Summary .............................. 10 Conclusion and Future Work 10.1 Discussion .............................. 10.2 Contributions ............................ 10.3 Impact ................................ 10.4 Future Work APPENDICES BIBLIOGRAPHY ix Of Fault- 163 164 164 169 173 175 175 177 179 182 184 189 ..... 190 191 191 192 195 197 198 198 208 212 216 217 217 223 226 228 232 268 2.1 2.2 2.3 2.4 3.1 3.2 4.1 4.2 4.3 4.4 5.1 5.2 5.3 6.1 6.2 6.3 LIST OF FIGURES Synthesizing failsafe fault-tolerance in the high atomicity model. ....... Synthesizing nonmasking fault-tolerance in the high atomicity model. Synthesizing masking fault—tolerance in the high atomicity model. ...... A non-deterministic algorithm for adding fault-tolerance to distributed pro- grams. ................................... The states and the transitions corresponding to the propositional variables in the 3—SAT formula. (Except for transitions marked as fault all are pro- gram transitions. Also, note that the program has no long transitions that originate from a, and no short transitions that originate from q.) ..... The partial structure of the fault-tolerant program .............. The relation between the invariant of a fault-intolerant program p and a fault- tolerant program 17’ . ............................ The transitions corresponding to the propositional variables in the 3-SAT for— mula. ................................... The structure Of the fault-intolerant program for a propositional variable I),- and a disjunction cj = bm V 55;, V ()1. ...................... The value assignment to variables ........................ The enhancement of fault—tolerance in high atomicity. .......... Constructing an invariant in the low atomicity model. .......... The enhancement Of fault-tolerance for distributed programs. ...... Overview of the synthesis method. ...................... Automatic specification Of a component. ................... Verifying the interference-freedom conditions. ................ 18 19 20 22 27 29 36 38 39 40 62 70 71 6.4 7.1 7.2 7.3 7.4 7.5 7.6 8.1 8.2 8.3 8.4 8.5 9.1 9.2 9.3 9.4 9.5 The automatic addition of a component. ................... Synthesizing nonmasking-masking multitolerance. ............... Synthesizing failsafe-masking multitolerance. ................. A non—deterministic polynomial algorithm for synthesizing multitolerance. The states and the transitions corresponding to the propositional variables in the 3-SAT formula. ............................ The partial structure Of the multitolerant program .............. A proof Sketch for NP-completeness of synthesizing failsafe-nonmasking multi- tolerance. ................................. A deterministic execution scenario for the framework FTSyn. ........ The class diagram of F TSyn. ......................... The Bridge design patterns. .......................... The FactoryMethod design patterns. ...................... Integrating the deadlock resolution heuristics using Strategy pattern. ..... Transforming non-monotonic programs to positive monotonic. ........ Algorithms for removing deadlock states and ensuring the closure Of the invari- ant. .................................... Transforming non-monotonic specifications to monotonic. .......... Non-deterministic algorithm for adding fault-tolerance to distributed programs. Using SAT solvers for the synthesis Of fault-tolerant programs. ........ xi 147 150 153 155 157 161 170 176 178 178 179 193 194 199 206 Chapter 1 Introduction The anticipation of all classes of faults that may perturb a program is difficult (if not impossible). Thus, it is desirable to synthesize fault-tolerant programs from their fault-intolerant version upon finding new classes of faults. Although there exist efficient approaches [1] for the synthesis of high atomicity fault—tolerant programs ~— where processes can read/ write all program variables in an atomic step, there exists a well-defined need for developing efficient techniques for the synthesis of (i) fault- tolerant distributed programs — where processes have read/ write restrictions with respect to program variables, and (ii) multitolerant programs - where a program simultaneously provides different levels of fault—tolerance to different classes of faults. In this dissertation, we concentrate on the theoretical and the practical aspects of synthesizing fault—tolerant distributed programs and multitolerant programs. To synthesize a fault-tolerant program from its fault-intolerant version, Kulkarni and Arora [1] present a synthesis method that takes a given class of faults and a fault-intolerant program, and generates a program that is fault-tolerant to that class of faults. The fault-intolerant program satisfies its (safety and liveness) specification in the absence of faults and provides no guarantees in the presence of faults. The synthesized fault-tolerant program provides a desired level of fault-tolerance in the p; SH Spa Dru 0f 5 bat, aU'li presence of faults, and satisfies the safety and liveness specification of the fault- intolerant program in the absence of faults. Such synthesis approach has the potential to reuse the computations of the fault- intolerant program during the synthesis of its fault-tolerant version. As a result, reusing the computations of a fault-intolerant program preserves its important prop- erties (e.g., efficiency) that are difficult to specify in a specification-based approach (e. g., [2, 3, 4]) where one synthesizes a fault-tolerant program from its temporal logic (respectively, automata—theoretic [5, 6, 7]) specification. The synthesized fault-tolerant program provides one of the three levels of fault- tolerance namely, failsafe, nonmasking, and masking [1]. Intuitively, in the presence of faults, a failsafe fault-tolerant program ensures that its safety specification is satisfied. In the presence of faults, a nonmasking fault-tolerant program recovers to states from where its safety and liveness specification is satisfied. A masking fault-tolerant program guarantees that in the presence of faults it recovers to states from where its safety and liveness specification is satisfied while preserving safety during recovery. The complexity of the synthesis presented in [1] depends on the program model. The authors of [1] Show that the complexity of synthesis is polynomial in the state space of the fault-intolerant program in the high atomicity model. For distributed programs (i.e., low atomicity model), Kulkarni and Arora Show that the complexity of synthesizing masking fault-tolerance is exponential. Also, in the specification- based approach, the synthesis of fault-tolerant distrIbuted programs (with particular architectures) from their specification is known to be non-elementary decidable [6, 7]. A survey of the literature [7, 8] reveals that the complexity of synthesis and the inefficiency Of the synthesized programs construct the main obstacles in the automated synthesis of fault-tolerant programs. Moreover, to the best of our knowledge, no automated approach has been presented for adding multitolerance to programs where a multitolerant program is subject to multiple classes of faults and provides (possibly) 2 different levels of fault-tolerance corresponding to different classes of faults. Hence, in this dissertation, we focus our attention on theoretical and practical problems in the synthesis of fault-tolerant distributed programs and multitolerant programs. Theoretical problems. Regarding theoretical aspects of synthesis, we address the following problems: 0 Identify the eflect of safety specification model on the complexity of synthesis It is shown in the literature that the complexity of adding fault-tolerance to high atomicity programs is polynomial in the state space of the fault-intolerant program if the safety specification is represented as a set of bad transitions [1]. In [9], the authors conjecture that representing safety specification as a set of sequences of transitions results in exponential complexity for adding fault- tolerance. They validate their claim in the context of some examples. However, to the best of our knowledge, there exist no significant result to verify the claim made in [9]. Thus, it is desirable to explore the complexity of synthesis in the case where safety specification is represented as a set of sequences of transitions. The significance of such complexity analysis is in that it identifies the appropriate approach for modeling safety Specification where automatic addition of fault-tolerance can be done efficiently. 0 Find sufficient conditions for polynomial-time synthesis of distributed programs Since the complexity Of synthesizing fault-tolerant distributed programs from their fault-intolerant version is exponential [1], we Shall identify properties of programs and specifications where the synthesis can be done in polynomial time. 0 Reduce the complexity of synthesis by reusing the computations of the fault- intolerant program During the synthesis Of fault~tolerant programs, there exist situations where the computational structure of the fault-intolerant program provides necessary 3 means for satisfying fault—tolerance requirements in the presence of faults. Thus, it is desirable to design synthesis algorithms that take advantage of such situa- tions to reduce the complexity Of synthesis. Identify and reuse pre-synthesized fault-tolerance components There exist recurring sub-problems of synthesis that arise in the synthesis of different programs (e.g., resolving deadlock states). Thus, it is desirable to generalize the solution to common synthesis problems so that we can develop generic solution strategies that are independent of the program at hand. In other words, we would like to reuse the effort put in the synthesis of one program for the synthesis of another program. To achieve this goal, we plan to identify com- monly encountered patterns in the synthesis of programs in order to encapsulate those patterns in the form of pre-synthesized fault-tolerance components. Also, we would like to devise a synthesis method where we automatically specify and add the required pre—synthesized components to the fault-intolerant programs. Synthesize programs that tolerate multiple classes of faults and provide diflerent levels of fault-tolerance to each fault-class Dependable and fault-tolerant systems are often subject to multiple classes of faults, and hence, these systems need to provide appropriate level of fault- tolerance to each class of faults. Often it is undesirable or impractical to provide the same level of fault-tolerance to each class of faults. Hence, these systems need to tolerate multiple classes of faults, and provide a (possibly) different level of fault-tolerance to each class. To characterize such systems, the notion of multitolerance was introduced in [10]. The importance of such multitolerant systems can be easily observed from the fact that several methods for designing multitolerant programs as well as several instances of multitolerant programs can be found (e.g., [11, 12, 13, 10]) in the literature. 4 Automated synthesis of multitolerant programs has the advantage Of generat- ing fault—tolerant programs that (i) are correct by construction, and (ii) tol- erate multiple classes of faults. However, the complexity Of such synthesis is an Obstacle in the synthesis of multitolerant programs. Specifically, there exist situations where satisfying a specific fault-tolerance requirement for one class of faults conflicts with providing a different level of fault-tolerance to another fault-class. Hence, it is necessary to identify situations where synthesis of mul- titolerant programs can be performed efficiently and where heuristics need to be developed for adding I‘nultitolerance. Practical problems. To reduce the exponential complexity Of synthesis for prac- tical purposes and to enable the synthesis of programs that have large state space, heuristic-based approaches are proposed in [14, 15, 9]. These heuristic-based ap— proaches reduce the complexity of synthesis by forfeiting the completeness of synthe— sizing fault-tolerant distributed programs. In other words, if heuristics are applicable then a heuristic-based algorithm will generate a fault-tolerant program efficiently. However, if the heuristics are not applicable then the synthesis algorithm will declare failure even though it is possible to synthesize a fault-tolerant program from the given fault-intolerant program. The development and the implementation Of heuristics are complicated by the fact that, for a given heuristic, we need to determine how that heuristic reduces the complexity of synthesizing fault-tolerant distributed programs. Furthermore, we need to identify if a heuristic is so restrictive that its use will cause the synthesis algorithm to declare failure very often. Also, in order to provide maximum efficiency, there exist situations where we need to apply heuristics in a specific order. Moreover, the developers of a fault-tolerant program may have additional insights about the order in which heuristics should be applied. Thus, we have to provide the possibility of changing the order of available heuristics (respectively, adding new heuristics) for the 5 In am all“ we Cher mlf'l‘ also j distr‘il enhant We inn where ii slutme in Chain and We s} llllOledlll We present [ole rant rim developers of fault-tolerance. Therefore, there exists a substantial need for an extensible software framework where (i) developers of fault- tolerant programs can synthesis fault-tolerant programs from their fault-intolerant version; (ii) developers of heuristics can integrate new heuristics into the framework or modify exiting heuristics, and (iii) developers can benefit from existing automated reasoning tools (e.g., SAT solvers) in the synthesis of fault—tolerant distributed programs. 1.1 The Outline Of the Dissertation In Chapter 2, we present preliminary concepts of programs, specifications, faults, and fault-tolerance. We also describe synthesis algorithms presented by Kulkanri and Arora [1] in Chapter 2 as we reuse those algorithms in this dissertation. Then, we identify the effect of specification modeling on the complexity of synthesis in Chapter 3. Subsequently, in Chapter 4, we Show that synthesizing a failsafe fault- tolerant distributed program from its fault—intolerant version is NP-complete. We also present sufficient conditions for polynomial synthesis of failsafe fault-tolerant distributed programs. In Chapter 5, we define the enhancement problem where we enhance the level of fault—tolerance from nonmasking to masking in polynomial time. We introduce the concept of pre—synthesized fault-tolerance components in Chapter 6, where we present a synthesis method for automatic specification and addition of pre- synthesized fault-tolerance components to programs during synthesis. Afterwards, in Chapter 7, we formally state the problem of adding multitolerance to programs, and we Show that, in general, synthesizing multitolerant programs from their fault— intolerant version is NP-complete even in the high atomicity model. In Chapter 8, we present the design of our software framework for automatic synthesis Of fault- tolerant distributed programs. In Chapter 9, we present some ongoing research work. 6 Finally, in Chapter 10, we discuss related work, contributions, and the impact of this dissertation, and then we make concluding remarks. k‘x s. Chapter 2 Preliminaries 111 this chapter, we present formal definitions of programs, problem specifications, faults, fault-tolerance, and addition of fault-tolerance. Specifically, in Section 2.1, we present the formal definition of programs, state predicates, and projection of program transitions on a state predicate. In Section 2.2, we present the issues of modeling distributed programs that is adapted from [1, 4]. Then, in Section 2.3, we adapt the definition of specifications from Alpern and Schneider [16]. In Sections 2.4 and 2.5, we adapt the definition of faults and fault-tolerance from Arora and Gouda [17] and Kulkarni [18]. We represent the problem of adding fault-tolerance to fault-intolerant programs in Section 2.6. We have adapted the problem statement of fault-tolerance addition from [1]. In Section 2.7, we reiterate the results presented in [1] for the synthesis of fault-tolerant programs in high atomicity model - where processes can read / write all program variables in an atomic step. Finally, in Section 2.8, we recall the results presented in [1] for the synthesis Of distributed programs — where processes have read/ write restrictions with respect to program variables. 2.1 Program A program p is specified by a finite set of variables, say V = {2)0, v2, .., vq}, and a finite set of processes, say P = {P0, - -- ,P,,}, where q and n are positive integers. Each 8 variab and lt" A s doluaii The st; A p [50-91] T]. that Of ping] IhlS ('lls‘s Hence. 1 set of st A Ste [H Wllt ll] 3 [.lflflg of Ar by A 30([1 Condlllfjng l-lqu 2. ([0 is that { l SHl'lf‘i’i l5 .. ‘ l‘l“SJ]EnI’ . variable is associated with a finite domain of values. Let v0, v2, .., vq be variables of p, and let D0, D2, .., D, be their respective domains. A state of p is obtained by assigning each variable a value from its respective domain. Thus, a state s Ofp has the form: ([0, ll, .., lq) where Vi : 0 S i s q : l, E D,. The state space of p, Sp, is the set of all possible states Of p. A process, say Pj, consists of a set of transitions 6,; each transition has the form (30,31) where so, 31 6 SP. A process P] in p is associated with a set of variables, say rj, that Pj can read and a set of variables, say 10,-, that P, can write. The transitions of program p, (SP, is the union of the transitions of its processes. In most situations in this dissertation, we focus on the entire state space of a program and all its transitions. Hence, for simplicity, we rewrite program p as the tuple (Sp, 61,), where Sp is a finite set of states and 6,, is a subset of SI) x Sp. A state predicate X of p is any subset of Sp. We denote the cardinality of X by [X |, where [X] represents the number of states in X. A state predicate X is closed in a program p (respectively, 6p) iff (if and only if) the following condition holds. Vso,s1 :: ((so,sl)E(5p) => (sOEX => 81 EX) A transition predicate AP of p is any subset of 3,, x Sp. We denote the cardinality of Ar by [API, where [AP] represents the number of transitions in AP. A sequence Of states, 0 = (30,31, ..), is a computation of p iff the following two conditions are satisfied (i.e., a computation is maximal): 1. If 0 is infinite then Vj :j > 0 : (sH,sJ-)€6p, and 2. If at is finite and terminates in state 31 then there does not exist state 3 such that (sl,s)€6p, and \7’j :0 0: (sk_1, sk)E ((5,, U f), 2. If a is finite and terminates in state s; then there does not exist state 8 such that (3,, s) 6 6p, and 3. 3nzn20: (Vk:k>n:(sk_1,sk)€6p). The first requirement captures that in each step, either a program transition or a fault transition is executed. The second requirement captures that faults do not have to execute; i.e., if the program reaches a state where only a fault transition can be executed then the fault transition need not be executed. It follows that fault transitions cannot be used to deal with deadlocked states. Finally, the third requirement captures that the number of fault occurrences in a computation is finite. Such assumption also appears in previous work [19, 20, 17, 21]. Program and faults representation. We use Dijkstra’s guarded commands [22] to represent the transitions of programs and faults. A guarded command (action) is of 13 the form grd —+ st, where grd is a state predicate and st is a function from 3,, to Sp (i.e., an assignment) that updates program variables. Specifically, the guarded command grd ——> st represents the following set of transitions: {(80,81) : grd is true at so and the atomic execution Of st at so takes the program to state 31} 2.5 Fault-Tolerance In this section, we formally define what it means for a program to be fault-tolerant. We define three levels of fault—tolerance; failsafe, nonmasking, and masking. In the absence of faults, irrespective of the level of fault-tolerance, a program should satisfy its specification, say spec, from its invariant. The level of fault-tolerance characterizes the extent to which the program satisfies spec in the presence of faults. Intuitively, a failsafe fault-tolerant program ensures that in the presence of faults, the safety of spec is maintained. A nonmasking fault-tolerant program ensures that in the presence of faults, the program recovers to states from where spec is satisfied. A masking fault-tolerant program ensures that in the presence of faults the safety of spec is maintained and the program recovers to states from where spec is satisfied. Thus, we formally define these three levels of fault-tolerance for a program p, its invariant S, its specification spec, and a class of faults f as follows: Program p is failsafe f-tolerant for spec from S iff the following two conditions hold: (1) p satisfies spec from S, and (2) there exists T such that T is an f—span of p from S and p[] f maintains spec from T. Program p is nonmasking f-tolerant for spec from S iff the following two conditions hold: (1) p satisfies spec from S, and (2) there exists T such that T is an f-span of p from S and every computation of p[] f that starts from a state in T has a state in S. Program p is masking f-tolerant for spec from S iff the following two conditions hold: (1) p satisfies spec from S, and (2) there exists T such that T is an f-span of p 14 from S, p[] f maintains spec from T, and every computation of p[] f that starts from a state in T has a state in S . Note that a specification is a set of infinite sequences of states. Hence, if p satisfies spec from S then all computations of p that start in S must be infinite. In the context of nonmasking and masking fault—tolerance, every computation from the fault-span reaches a state in its invariant. Hence, if fault-span T is used to Show that p is nonmasking (respectively, masking) f—tolerant for spec from S then all computations of p that start in a state in T must also be infinite. Also, note that p is allowed to contain a self-loop of the form (so, so); we use such a self-loop whenever so is an acceptable fixpoint of p. Notation. Henceforth, whenever the program p is clear from the context, we will omit it; thus, “S is an invariant” abbreviates “S is an invariant of p” and “f is a fault” abbreviates “f is a fault for p”. Also, whenever the specification spec and the invariant S are clear from the context, we omit them; thus, “f-tolerant” abbreviates “f-tolerant for spec from S”. 2.6 The Problem of Adding Fault-Tolerance In this section, we reiterate the problem of adding fault-tolerance presented in [1]. The addition problem requires a fault—tolerant program p’ (with its invariant S’) to behave similar to its fault-intolerant version, say p, in the absence of a given class of faults f. In the presence of f, p’ must provide a desired level of fault-tolerance, say .C, where [I could be failsafe, nonmasking, or masking. Since p’ must behave similar to p in the absence of faults, Kulkarni and Arora [1] stipulate the following conditions: 1. S’ must be a subset of S. Otherwise, if there exists a state 8 E S’ where 3 ¢ S then, in the absence of faults, p’ can reach 5 and create new computations that do not belong to p. Thus, p’ will include new ways of satisfying spec from s in the absence of faults. The Giver (If, The D For g and i 8' so, fault- 2. p’lS’ must be a subset of p[S’. If p’lS’ includes a transition that does not belong to p[S’ then p’ can include new ways for satisfying spec in the absence of faults. Thus, the formal definition of the problem of adding fault-tolerance is as follows: The Addition Problem Given p, S, spec, and faults f, identify p’ and S’ such that S’ C; S, p’lS’ Q pIS’, and p’ is [I f-tolerant for spec from S’, where I. can be failsafe, nonmasking, or masking. E] The decision problem of adding fault-tolerance to fault-intolerant programs (from [1]) is as follows: The Decision Problem For a given fault-intolerant program p, its invariant S, the specification spec, and faults f, does there exist a fault-tolerant program p’ and the invariant S’ such that S’ g S, p’IS’ g pIS’, and p’ is failsafe/noun)asking/masking fault-tolerant for spec from 5’? Remark. Given a program p’ and its invariant S’ that meet the requirements of the decision problem, every computation of p’ [] f that starts in the fault-span reaches a state in S’. From that state in S’, a computation of p’ is also a computation of p (Since 5’ Q S and p’ IS’ Q p[S’). Since the fault-intolerant program p satisfies its liveness specification from S, every computation of p has a suffix that is in the liveness specification. It follows that every computation of p’ that starts in its fault-span will eventually reach a state from where it continuously satisfies its liveness specification. For this reason, liveness specification is not included in the above problem statement. 16 2.7 Tlif? l b19111“ Syllfll‘ to V” Chat” [[3 for W“ [ext-”l8 ‘- [fallSaft where l to progl 2.7.l. I tolerant where or Thrill tith 8. ii a S)‘iiilltf$ 2.7 Synthesis of Fault-Tolerance in High Atomic- ity The properties of synthesized high atomicity fault-tolerant programs identify an upper bound on the abilities of fault-tolerant distributed programs. As a result, in the synthesis of fault-tolerant distributed programs, there exist situations where we need to verify the possibility of solving a problem in the high atomicity model (e.g., see Chapter 5). Hence, we recall synthesis algorithms presented by Kulkarni and Arora [1] for the synthesis of fault-tolerant programs in the high atomicity model. We represent three synthesis algorithms presented in [1] for adding three different levels of fault-tolerance to fault-intolerant programs. These algorithms synthesize a (failsafe/nonmasking/masking) fault-tolerant program in the high atomicity model where there exist no read/write restrictions for the program processes with respect to program variables. In particular, we present Add_Failsafe algorithm in Subsection 2.7.1. Then, in Subsection 2.7.2, we show how one synthesizes a nonmasking fault- tolerant program. Finally, in Subsection 2.7.3, we describe the algorithm Add-Masking where one adds masking fault-tolerance to fault-intolerant programs. Throughout this section, we denote a fault—intolerant program with p, its invariant with S, its specification with spec, and a given class of faults with f. Also, we denote a synthesized fault-tolerant program and its invariant with p’ and S’. 2.7 .1 Synthesizing Failsafe Fault-Tolerance The algorithm Add_Failsafe (cf. Figure 2.1) takes p, S, spec, and faults f. It calculates program p’ with the invariant S’ where p’ is failsafe f—tolerant for spec from S’. To synthesize a fault-tolerant program p’ from the given fault-intolerant program p, Add_Failsafe calculates a set of states, say ms, from where fault transitions alone may violate safety of spec. The fault—tolerant program p’ must never reach a state in ms, otherwise, faults may directly violate the safety of spec. Thus, p’ should not 17 Fi include I transitio To (a of S ~m~~ mt (Cf. F the imari. hmmfimn Soundnes antliesizet problem st failsafe faiil addition pr: Addjailsafe(p, f : transitions, 5 : state predicate, spec : specification) m3 := {so : 331,32, ...sn : (Vj :0$jRank(sl)}); (16) S’ 2: SI; (17) T’ :2 T1 (18) return p’, S’,T’; (19) } ConstructFaultSpan(T : state predicate, f : transitions) // Returns the largest subset of T that is closed in f. while (330,31:soeTAsl¢TA(so,sl)€f) T:=T—{so} } Figure 2.3: Synthesizing masking fault-tolerance in the high atomicity model. In the iterative steps between Lines 5 to 13 in Figure 2.3, the Add_Masking algo- rithm searches for a valid invariant and its corresponding fault-span for the masking fault-tolerant program. Towards this end, in each iteration, Add-Masking identifies the set of transitions of p1 that consists of transitions of p on the current invariant S] (i.e., p[Sl) and every transition in the fault-span T j that does not violate the clo- sure Of 81 and does not belong to mt (cf. Line 7 in Figure 2.3). Afterwards, using Construct-FaultSpan routine, the Add_Masking algorithm calculates the largest subset of T1 that is closed in p1[]f. Since the invariant of the masking program must be 20 a. snlset of raakuhnt The Adt exist no 111” the Add_Ma grain synt llt satisfies the such suliiset J invariant S’ 2.3). Soundness I S'llll‘lf'Sthtl l”: problem. Ale Program p” d. then Add_Masl 2.8 Syn grar In this section. and Arora [1] ft a theorem fron- programs. Kulkarni anti 2.4) for the add. llit: Addft algm titolerant dist rill {f Idll ' .lts f. After“?! a subset of its fault-Span, Add_Masking recalculates the invariant S1 considering the recalculated fault-span T1 (cf. Line 9 in Figure 2.3). The Add_Masking algorithm continues the above iterative procedure until there exist no more changes in S1 and T1, or S1 becomes empty. When 51 becomes empty, the Add-Masking algorithm declares that there exists no masking fault-tolerant pro- gram synthesized from p. Otherwise, there must exist a non-empty subset of S that satisfies the requirements of the addition problem (cf. Section 2.6). If there exists such subset S’ of S then Add_Masking will guarantee safe recovery from states outside invariant S’ to S’, and there will be no cycles in T’ —S’ (cf. Lines 14-16 in Figure 2.3). Soundness and completeness. The algorithm Add-Masking is sound; i.e., the synthesized program p’ and its invariant S’ satisfy the requirements of the addition problem. Also, Add-Masking is complete; i.e., if there exists a masking fault-tolerant program p” derived from p that satisfies the requirements of the addition problem then Add_Masking will find p” and its invariant S” [1]. 2.8 Synthesis of Fault-Tolerant Distributed Pro- grams In this section, we represent the non-deterministic algorithm presented by Kulkarni and Arora [1] for the synthesis of distributed fault-tolerant programs. We also recall a theorem from [1] about the complexity of synthesizing fault-tolerant distributed programs. Kulkarni and Arora [1] present the non-deterministic algorithm Add-ft (cf. Figure 2.4) for the addition of fault-tolerance to distributed programs in polynomial time. The Add_ft algorithm takes the transition groups go, - - - , gm“ (that represent a fault- intolerant distributed program p), its invariant S, its specification spec, and a class of faults f. Afterwards, Add_ft calculates the set of ms states from where safety can be 21 t'it‘ilflWd ”- transit it)!“ nt‘In-deifl“ fiiult-Sl”l’L Add—ft in f "15:: l [ mt I: l’ Guess 5’ l'erifi' [h (PM [ {FE} (F53: [ (Hi . ['Fjl . [} (F011 -’\§\ figure 2.4: A grams. The algorit satisfies the th the required In Flfti. The fir F2. checks that not violated fro does not deadln luxuriant. i.e.. b 5’ is a subset of l’ ~ ’ 5 forever. For synthtsiz; dea failsafe prw flu: '1 : ‘ thrashing faul‘ violated by the execution of fault transitions alone. Also, Add_ft computes the set of transitions mt that violate safety or reach a state in ms. Then, the Add-ft algorithm non- -deterministicallyg guesses the fault- tolerant program p’ its invariant, S’ and its fault-span, T’. Add_ft(-p, f : set of transitions, 5' : state predicate, spec : specification, go, 91, ..., gm” : groups of transitions) { :s={ s.o 3131,32,...sn: (Vj : OSj v ' fault ' ------ 3' ------ >. o ...... )o ...... >. Short """" > a b c a- b- 9 HI i+l 1+1 Medium . I I l P 4 s ‘ ‘ Figure 3.1: The states and the transitions corresponding to the propositional variables in the 3—SAT formula. (Except for transitions marked as fault all are program transitions. Also, note that the program has no long transitions that originate from a,- and no short transitions that originate from (3.) Fault transitions. The class of faults f is equal to the set of medium transitions {(s,dj):1_<_ j 3 WI}. The safety specification of the fault-intolerant program, p. Safety will be 27 i'jol'dlf‘ tiW‘lV. (Cl. F trausitl ifit‘atiot t l1 5 J 3.1.2 In this se satisfialtlt in Set-tint. Lemma 5 fault-tolert 3.1.1. Proof. Sir values to tl is true. No adding fault The trim transitions (; SllOWH [l [(3 l) violated if a short (respectively, long) transition is followed by another short (respec- tively, long) transition. Note that (s, s) and fault transitions are medium transitions (cf. Figure 3.1). Hence, they can be followed by (respectively, preceded by) any transition. Also, all transitions except those identified above violate the safety spec- ification. This is to ensure that transitions such as ((1,, s), (a,,s), (b,, s), and (Ci, 3) ((1 _<_j _<_ M) /\ (1 S 2' g 17.)) cannot be used for recovery. 3.1.2 Reduction from 3—SAT In this section, we show (with Lemmas 3.1 and 3.2) that the given instance of 3-SAT is satisfiable iff masking fault-tolerance can be added to the problem instance identified in Section 3.1.1. Lemma 3.1 If the given 3—SAT formula is satisfiable then there exists a masking fault-tolerant program for the instance of the decision problem identified in Section 3.1.1. Proof. Since the 3-SAT formula is satisfiable, there exists an assignment of truth values to the propositional variables 23,-, 1 S 2' g n, such that each y,, 1 S j _<_ M, is true. Now, we identify a masking fault-tolerant program, p’, that is obtained by adding fault-tolerance to the fault-intolerant program p identified in Section 3.1.1. The invariant of p’ is the same as the invariant of p (i.e., {3}). We derive the transitions of the fault-tolerant program p’ as follows. (As an illustration, we have shown the partial structure of p’ where 271 = true, 1:2 = false, and x3 =2 true in Figure 3.2.) c For each propositional variable 37,-, 1 _<_ 2' g n, if 1?,- is true then we include the short transition (a,, b,). In this case, we also include the long transition (b,,a,~11) if :r,+1 is true, or (1),, 1),“) if 1:,“ is false. 0 For each propositional variable 35,-, 1 S i S n, if 27,- is false then we include the short transition (b,, (1,). In this case, we also include the long transition (c,,a,-+1) 28 .\‘.r if 17,-+1 is true, or ((1,, bi“) if 13,-+1 is false. 0 We include the transitions (a,,+1, bn+1) and (bn+1,s) corresponding to :r,,+1. o For each disjunction 3;, that includes xi, we include the transition (d,, a,) iff .r, is true. 0 For each disjunction yj that includes -:1:,-, we include the transition (dj, 1),) iff :13,- is false. d_ .......................................................................................................................... j . ....................................................................................... ll s . W . n THU" .- - ‘3'- - -, o . . ------ >0 ------- , a b c a b c a, b, C, 2 2 2 3 3 3 D Figure 3.2: The partial structure of the fault-tolerant program Now, we show that p’ is masking fault-tolerant in the presence of faults f. o p’ in the absence of faults. p’ IS = pIS. Thus, p’ satisfies spec in the absence of faults. «- p’ is masking f-tolerant for spec from S. To show this result, we let T’ be the set of states reached in the computations of p’ [] f starting from s. -— p’ satisfies its safety specification from T’. Since the instance of the 3-SAT formula is satisfiable, each propositional variable :19,- is assigned a unique truth value. Thus, for each pair of transitions (a,, b.) and (1),, Ci), one of them is excluded in the set of transitions of p’. Hence, a computation of p’ cannot include two consecutive short transitions. Also, the only way to execute two consecutive long transitions in the original fault-intolerant 29 program is to execute a long transition that terminates in state b,, 1 g 2' g n, and then execute a long transition that originates in 1),. If the former transition is included then 2:.- is assigned the truth value false. However, in this case, no outgoing long transition from b, is included. Thus, p’ cannot execute two consecutive long transitions. — Starting from every state in T’, a computation of p’ reaches 5. By construction, p’ contains no cycles outside the invariant. Hence, it suffices to show that p’ does not deadlock in T’ — S’. Now, let y,- 2 £13,: V ark V 33,. be a disjunction in the 3-SAT formula. Since y,- evaluates to true, p’ includes a transition from {((1}, a,), (d), bk), (dj, ar)}. Also, by considering the truth values of :13,- and my“, 1 g i g n, we observe that for every state in {a,-, b,, 6,} in T’ there is a path that reaches a state in {a,+1, bi+1,c,-+1}. Finally, from an+1 (respectively, bu“) there is an outgoing transition to bn+1 (respectively, 3). It follows that p’ does not deadlock in T’ — S. E] Lemma 3.2 If there exists a masking fault-tolerant program for the instance of the decision problem identified earlier then the given 3-SAT formula is satisfiable. Proof. Before we use the masking fault-tolerant program p’ to identify the tr L1 th value assignment to the propositional variables in the 3-SAT formula, we make some observations about p’. Let S’ be the invariant of p’ and let T’ be the fault-span used to show the masking fault-tolerance property of p’. Since 5’ 74 {} and S’ Q S, tile Conditions 5’ = S and plS’ == p’IS’ hold. Since faults may directly perturb p’ to dj (1 g j g M), the condition dj 6 T’ holds _ Thus, p’ must provide safe recovery from each dj. As a result, for each d], tllere exists 1 S i S 72. such that either (dj,a,-) or ((dj,b,) and (b,,c,-)) is included in p’ f T, ; i.e., either a,- or c, must be reachable. Hence, we have C) . b8Brvation 3.3. There exists 1 g 2' g n such that either a, E T’ or c,- E T’. El 30 Now, consider the case where a,- E T’ and C,- E T’. In this case, (a,,b,-) must be included as all transitions terminating in a, are long transitions. Further, if c,- E T’ then (1),, C.) must be included since it is the only transition that reaches Q. In this case, p’[] f can violate safety by executing (a,, b.) and (6,, 0,). Hence, we have Observation 3.4. If a,- E T’ then c,- ¢ T’. [:1 Moreover, if a,- E T’ then (a,,b,-) E p’IT’ since all transitions terminating in a,- are long transitions. Hence, b,- E T’. Now, to guarantee safe recovery from b,, p’ must include either (b,,a,-+1) or ((b,,b,-+1) and (b,+1,c,-+1)). Thus, either a,“ E T’ or Ci+l E T’. Also, if c,- E T’ then either (chm-+1) or ((c,,b,-+1) and (b,+1,c,-+1)) must be included. Thus, we have Observation 3.5. If (a,- E T’) V (c,- E T’) holds then we have (Vl : z' < l S n : ((a1 6 T’) V (c; E T’))). C) Now, let sm be the smallest value for which ((asm E T’) V(csm E T’)) holds. Based on the Observation 3.5, we have (Vl : sm < l S n : (a; E T’) V (c; E T’)). Hence, we Inake value assignment to the literals of the 3-SAT formula as follows: 0 For t < sm, we assign true to 113;. O For sm S t, if at E T’ then at, = true. And, if c, E T’ then :1.“ = false. Based on the observations 3.3-3.5, it is straightforward to observe that a unique val me is assigned to each x,- (1 S 2' S n). To complete the proof, we need to show that, with this truth-value assignment, the 3-SAT formula is satisfiable. We show this for a disjunction y, (1 S j S M). Wlog, let y]- = 51:,- V 1'2. V 33,. Since state (1,- can be 1" eached by the occurrence of a fault from s, p’ must provide safe recovery from dj. Since the only safe transitions from dj are those corresponding to states a,, bk and 0’" ’ Z)’ must include at least one of the transitions (dj, ai), ((1,,bk), or (dj, ar). Now, if (dd 1» (1i) 6 p’ then a, E T’, and hence, 1:,- is assigned true. Further, if (dj, bk) 6 p’ then n9 10mg transition from bk can be included as it would allow p’ to execute two long traxl -. 1 , - raw T1 - T’ ih Sltlons successive y. Hence, p must me u e ()k, ck). 111s, ck E , an( ence, wk 31 is assigllt“ y) evaluatt Theorent of adding Proof. T front LPN“ Given an prograin [1 5’ Q S. (‘3 notviolatt Since earh memami Coronary that shoult masking f ‘ ‘(l 3.2 S In this eha tion on th‘ ll] that. if BTtnodel to high at in an atom laultintoh sentetl live transitions 5 ‘\P‘(‘Utn is assigned false. It follows that irrespective of which transition is included from d », y] evaluates to true. Therefore, the 3-SAT formula is satisfiable. E] Theorem 3.6 If the safety specification is specified in the BP model then the problem of adding masking fault-tolerance to high atomicity programs is N P-complete. Proof. The NP-hardness of adding masking fault-tolerance in the BP model follows from Lemmas 3.1 and 3.2. To show that this problem is in NP, we proceed as follows: Given an input for the problem of adding fault—tolerance, we guess fault-tolerant program p’, its invariant S’ and its fault-span T’. Now, we need to verify that (1) S’ Q S, (2) S’ is closed in p’, (3) p’IS’ E MS", (4) T’ is closed in p’[]f, (5) p’[]f does not violate safety in T’, (6) p’ does not deadlock in T’ — S’, (7) p’|(T’ — S’) is acyclic. Since each of these conditions can be verified in polynomial time in the state space, the theorem follows. E] Corollary 3.7 If the safety specification is specified by a set of computational prefixes that should not occur in program computations (as in [23]) then the problem of adding masking fault-tolerance is NP-hard in the program state space. CI 3.2 Summary In this chapter, we investigated the effect of the representation of the safety specifica- tion on the complexity of adding masking fault-tolerance. It is shown in the literature [1] that if one represents the safety specification as a set of bad transitions (denoted BT model) that must not occur in program computations then adding fault-tolerance to high atomicity programs — where processes can read/ write all program variables in an atomic step — can be done in polynomial time in the state space of the input fault-intolerant program. However, in this chapter, we showed that if safety is repre- sented by a set of sequences of transitions, where each sequence contains at most two transitions (denoted bad pair (BP) model), then adding fault-tolerance to programs is N P—complete. With this result, we argue that adding fault—tolerance to existing 32 progt A to (a exan‘ it re: restr the l corn] ptoh (hsst programs can be done more efficiently if we focus on the BT model. Although the BT model is a restricted version of the BP model, it is general enough to capture other representations for modeling safety considered in the literature. For example, in the bad state (BS) model (e.g., [2, 4]), a computation violates safety if it reaches a state that is ruled out by the safety specification. The BS model is a restrictive version of the BT model. Hence, the algorithms in [1] can be extended to the BS model. Thus, the complexity for the BS model is (approximately) in the same complexity class as that of the BT model. Also, we observe that the expressiveness of the BT model has the potential to capture the safety specification of practical problems. As an illustration, we model the safety specification of several examples including a simplified version of an aircraft altitude switch (cf. Section 8.5) throughout this dissertation. As a result, we argue that although the results of this chapter limit the applicability of efi'icient addition of fault-tolerance to the BT model, this model can capture a broad range of interesting problems in the synthesis of fault-tolerant programs. Therefore, in the rest of this dissertation, we represent safety specification of programs in the BT model. 33 Chapter 4 Synthesizing Failsafe Fault-Tolerant Distributed Programs In this chapter, we focus on the synthesis of failsafe fault-tolerant distributed pro- grams from their fault-intolerant versions. First, we show that synthesizing a failsafe fault-tolerant distributed program from its fault-intolerant version (i.e., adding failsafe fault-tolerance to distributed fault-intolerant programs) is NP-complete. To achieve this goal, we reduce the 3-SAT problem to the decision problem of synthesizing a failsafe fault-tolerant program. Second, we identify the restrictions that can be im— posed on specifications and fault-intolerant programs in order to ensure that failsafe fault-tolerance can be synthesized in polynomial time. Towards this end, we iden- tify a class of specifications, namely monotonic specifications, and a class of programs, namely monotonic programs. We show that failsafe fault-tolerance can be synthesized in polynomial time if monotonicity restrictions 011 the program and the specification are met. As another important contribution of this chapter, we evaluate the role of restric— 34 tions imposed on specification and fault-intolerant progratn. In this context, we show that if monotonicity restrictions are imposed only on the specification (respectively, the fault-intolerant program) then the problem of adding failsafe fault-tolerance will remain NP-complete. Finally, we show that the class of monotonic specifications con- tains well-recognized [24, 25, 26, 27, 28] problems of distributed consensus, atomic commitment and Byzantine agreement. We proceed as follows: In Section 4.1, we state the problem of adding failsafe fault- tolerance to fault-intolerant programs. In Section 4.2, we show the NP-completeness of the problem of adding failsafe fault-tolerant distributed programs. In Section 4.3, we precisely define the notion of monotonic specifications and monotonic pro- grams, and identify their role in reducing the complexity of synthesizing failsafe fault-tolerance. Finally, we give examples of monotonic specifications and monotonic programs in Section 4.4, and summarize this chapter in Section 4.5. 4. 1 Problem Statement In this subsection, we formally state the problem of synthesizing failsafe fault- tolerance. Our goal is to only add failsafe fault-tolerance to generate a program that reuses a given fault-intolerant program. In other words, we require that any new computations that are added in the fault—tolerant program are solely for the purpose of dealing with faults; no new computations are introduced when faults do not occur. Now, consider the case where we begin with the fault-intolerant program p, its invariant S, its specification, spec, and faults f. Let p’ be the fault-tolerant program derived from p, and let S’ be an invariant of p’. Since S is an invariant of p, all the computations of p that start from a state in S satisfy the specification, spec. Since we have no knowledge about the computations of p that start outside S and we are interested in deriving p’ such that the correctness of p’ in the absence of faults is derived from the correctness of p, we must ensure that p’ begins in a state in S; i.e., 35 the invari lni figure 4.1 tolerant pr Lihevvi the erunpt about (t ,1, ll()l llll l‘t _lt' 0f Sj’lllllt‘s The Pro Given 1, Identifi- l 3’ E . p’fS' the invariant of p’, say S’, must be a subset of S (cf. Figure 4.1). Invariant of fault-intolerant program Invariant of fault-tolerant program No new transitions here New transitions added here Figure 4.1: The relation between the invariant of a fault-intolerant program p and a fault- tolerant program p’. Likewise, to show that p’ is correct in the absence of faults, we need to show that the computations of p’ that start in states in S’ are in spec. We only have knowledge about computations of p that start in a state in S (cf. Figure 4.1). Hence, we must not introduce new transitions in the absence of faults. Thus, we define the problem of synthesizing failsafe fault-tolerance as follows: The Problem of Synthesizing Failsafe Fault-Tolerance Given p, S, spec and f such that p satisfies spec from S Identify p’ and S’ such that S’ Q S, p’IS’ (_Z p[S’, and p’ is failsafe fault-tolerant to spec from S’. D This problem statement is taken from [1]. In [1], a generalized definition that applies to other types of fault-tolerance is presented. However, we use this restrictive definition as it suffices in this chapter. Also, to show that the problem of synthesizing failsafe fault-tolerance is NP-complete, we state the corresponding decision problem: for a given fault-intolerant program p. its invariant S . the specification spec, and faults f, does there ezrist a failsafe fault-tolerant program p’ and the invariant S’ that satisfy the three conditions of the syntheszs problem? 36 J Nomi“ {311115 J gin’Il H 4.2 In thlS 5 tributcd end- we tolerance into an l1. that [lie : this-sized {I we state tl 3~SAT pr Given is a and ' /‘=, (' Does then: 4.2.1 It} lei In this Siibsr-i problem. Th! its. sperifiratir tioiial i'ariablr‘ Notation. Given a fault-intolerant program p, specification spec, invariant S and faults f, we say that program p’ and predicate S’ solve the synthesis problem for a given input iff p’ and S’ satisfy the three conditions of the synthesis problem. We say p’ (respectively, S’) solves the synthesis problem iff there exists S’ (respectively, p’) such that p’, S’ solve the synthesis problem. 4.2 NP-Completeness Proof In this section, we prove that the problem of synthesizing failsafe fault-tolerant dis- tributed programs from their fault-intolerant version is NP-complete. Towards this end, we reduce the 3—SAT problem to the problem of synthesizing failsafe fault- tolerance. In Subsection 4.2.1, we present the mapping of the given 3-SAT formula into an instance of the synthesis problem. Afterwards, in Subsection 4.2.2, we show that the 3-SAT formula is satisfiable iff a failsafe fault-tolerant program can be syn- thesized from this instance of the synthesis problem. Before presenting the mapping, we state the 3-SAT problem: 3—SAT problem. Given is a set of propositional variables, b1, b2, ..., b" and fib1,-ib2,...,—ib,,, where b,- and pi), are complements of each other, and a Boolean formula c 2 c1 /\ c2 /\ /\ CM, where each cj is a disjunction of exactly three propositional variables. Does there exist an assignment of truth values to b1, b2, ..., bn such that c is satisfiable? 4.2.1 Mapping 3-SAT to an Instance of the Synthesis Prob- lem In this subsection, we map the given 3—SAT formula into an instance of the synthesis problem. The instance of the synthesis problem includes the fault-intolerant program, its specification, its invariant, and a class of faults. Corresponding to each proposi— tional variable and each disjunction in the 3-SAT formula, we specify the states and 37 _J {he 5" [rallslI ”mu-la VarjalJl The 51 sjtjonéil Figure- Fort stat?S ll The tr; pmgralll. we intrm larl‘II’l. Figure 4.2‘ formula. Also. we rsponding I (cf. Figure 4 Fault trans the . faultintoi the set of transitions of the fault-intolerant program. Then, we identify the fault transitions of this instance. Subsequently, we identify the safety specification and the invariant of the fault-intolerant program and determine the value of each program variable in every state. The states of the fault-intolerant program. Corresponding to each propo— sitional variable b,- and its complement -b,~, we introduce the following states (see Figure 4.2): x,,x§,ai, y,,y,’-, 2,, and 2;. For each disjunction, cj = bm V-wbk Vb; (cf. Figure 4.3), we introduce the following states (k 7E m): cgm, dgm, cjk, djk, c3.“ and d31- The transitions of the fault-intolerant program. In the fault-intolerant program, corresponding to each propositional variable b,- and its complement -»b,-, we introduce the following transitions (cf. Figure 4.2): (a,_1,:r,-),(:r,,a,-),(y,’-,z:), (air-13x2), ($2, at), and (yia Zi)’ y ...bilq....>z y. Wbfid...) Z_ yn ....bficl...>zn 1 1 l 1 X1 X I X n \ a ......... a \. a ......... a \ a0 3‘ 1 1-1 3* i n-I 3 a," L a 5 , x . E "1 ' x “ y’ ---liai‘l_-,_z’ y’ ---.lZ‘Ed.-_,. z’ yin __.tfa_d___>zr’l Figure 4.2: The transitions corresponding to the propositional variables in the 3-SAT formula. Also, we introduce a transition from an to ac in the fault-intolerant program. Cor- responding to each Cj = bm V fibk V b,, we introduce the following program transitions (cf. Figure 4.3): (cg-dem), (Cy-ham), and (c;,,d;,). Fault transitions. We introduce the following fault transitions: From state :r,, the fault-intolerant program can reach y,- by the execution of faults. From state .73: 38 y. ...................... > z. l l c’ i .” jm I f .’ s’ g ' f .I" bm 2 x, ‘y’ v \ .I ‘. ,-’ d’ \\ " jm 21'] a ! " .4 ! . " c. . - .- x k 9 9 ! <1, 1,1 l,0> i J ! l s . l dk . bad 3 r. ........................ yl ---------- > z] ' 3 Legend 3 ' . (6, f, g,h> i 3 ------- pl : c' . 2 ............. p2 ; jl p3 b'1; bd - : a P4 2 v u-o-u—o- Fault d, . . j <1, 0,1.j+l+1> Figure 4.3: The structure of the fault-intolerant program for a propositional variable b,- and a disjunction cj = bm V fibk V 1),. the faults can perturb the program to state yi. Thus, for each propositional variable b,- and its complement -b,-, we introduce the following fault transitions: (33,331,), and (:rg, yg). In addition, for each disjunction cj = (bm V fibk V b1), we introduce a fault transition that perturbs the program from state a,, 0 S i < n, to cgm. We also introduce the fault transition that perturbs the program from d3", to cjk, and the transition that perturbs the program from djk to 631. Thus, the fault transitions for cj are as follows: (a,,c3m), (d3WCJ-k), and (djk,c3-,). (Note that the fault transition can perturb the program from state a,- only to the first state introduced for C]; i.e., cg...) The invariant of the fault-intolerant program. The invariant of the fault- intolerant program consists of the following set of states: {:rl, - - - ,xn}U{:r’l, - - - ,arfi,}U 39 {0, Sa; int Spf‘t iuay the] Fhre addec corres progra hark“ theh n to grin 11' l 9 ant l ~1. 0. I lhhnia State/V; {am . . . ,an_1}. Safety specification of the fault-intolerant program. For each propositional variable b,- and its complement fibi, the following two transitions violate the safety specification: (yi,z,-), and (y,’-,z,’). Observe that in state 1r,- (respectively, 17:) safety may be violated if the fault perturbs the program to g,- (respectively, yi) and then the program executes the transition (y,,z,-) (respectively, (y,’.,z,’)) (cf. Figure 4.3). For each disjunction cj = bm V fibk V b,, only the last program transition (c3), 3,) added for cj violates the safety of specification. Thus, if all three program transitions corresponding to cj are included then safety may be violated by the execution of program and fault transitions (cf. Figure 4.3). Variables. Now, we specify the variables used in the fault-intolerant program and their respective domains. These variables are assigned in such a way that allows us to group transitions appropriately. The fault-intolerant program has 4 variables: 6, f. g, and h. The domains of these variables are respectively as follows: {0, - -- ,n}, {——I,O,1}, {0,--- ,n}, and {O,--- ,M-l-n+1}. Value assignments. The value assignments are as follows (cf. Figure 4.4): Figure 4.4: The value assignment to variables. Processes and read / write restrictions. The fault-intolerant program consists of five processes, P1, P2, P3, P4, and P5. The read/write restrictions on these processes are as follows: 0 Processes P1 and P2 can read and write variables f and g. They can only read variable 6 and they cannot read or write h. 40 o Processes P3 and P4 can read and write variables e and f. They can only read variable g and they cannot read or write h. 0 Process P5 can read all program variables and it can only write e and 9. Remark. We could have used one process for transitions of P1 and P2 (respectively, P3 and P4) however, we have separated them in two processes in order to simplify the presentation. Grouping of Transitions. Based on the above read / write restrictions, we identify the transitions that are grouped together. We illustrate the grouping of the program transitions and the values assigned to the program variables in Figure 4.3. Observation 4.1 Based on the inability of P3 and P4 to write 9, the transitions (3:,, a1), (11:2,ai), (y,, z.) and (g,’, z_,’) can only be executed by P1 or P2. D Observation 4.2 Based on the inability of P1 and P2 to write e, the transitions (a,-1, 513,-) and (a,_1, $1) can only be executed by P3 or P4. [1 Observation 4.3 Based on the inability of P1 to read h, the transitions (25,-, a.) and (y,’, z,’) are grouped in P1. Moreover, this group also includes the transition (c,,-, d,-,-) for each cj that includes fibi. Cl Observation 4.4 Based on the inability of P2 to read h, the transitions (ch, a.) and (y,, z.) are grouped in P2. Moreover, this group also includes the transition (031', (132') for each c,- that includes bi. CI Observation 4.5 (a,_1, 513,-) is grouped in P3. El Observation 4.6: (a,_1, 23;) is grouped in P4. Cl Observation 4.7: Since process P5 cannot write f, it cannot execute the following transitions: (ai_1,:r,-),(a,-_1,a:;),(x,,a,-),(2:2,ai),(y,-,z,-), and (y£,z,’-), for I S i g n. Process P5 can only execute transition (0.", a0). [I] For i, 1 S i g n, the set of transitions for each process is the union of the transitions mentioned above. 41 h: an: int Le. faui Sect Pro \Hlllf is m failsaf SPt‘i l (u g! y: tFausi r {351,01} lye 3,) ”alum, 4.2.2 Reduction from 3-SAT In this subsection, we show that 3—SAT has a satisfying truth value assignment if and only if there exists a failsafe fault-tolerant program derived from the instance introduced in Section 4.2.1. Towards this end, we prove the following lemmas: Lemma 4.8 If the given 3-SAT formula is satisfiable then there exists a failsafe fault-tolerant program that solves the instance of the addition problem identified in Section 4.2.1. Proof. Since the 3—SAT formula is satisfiable, there exists an assignment of truth values to the propositional variables bi, 1 S i S n, such that each c,-, I g j S M, is true. Now, we identify a fault-tolerant program, p’, that is obtained by adding failsafe fault-tolerance to the fault-intolerant program, p, identified earlier in this section. The invariant of p’ is: S’ = {a0, ..,an_1} U {:c, | propositional variable b,- is true in 3-SAT } U {:r; I propositional variable b,- is false in 3-SAT } The transitions of the fault-tolerant program p’ are obtained as follows: 0 For each propositional variable b,, 1 S i g n, if b,- is true, we include the transition (a,_1, 23,-) that is grouped in process P3. We also include the transition (17,, a,). Based on Observation 4.3, as we include (23,-,ai), we have to include I (y,’., :4). Also, based on Observation 4.3, for each disjunction c, that includes —1b,, we have to include the transition (cg-i, (1],). o For each propositional variable b,, I g i g n, if b,- is false, we include the transition (ai_1,:I::) that is grouped in process P4. We also include the transition (.132, ai). Based on Observation 4.4, as we include (221,01), we have to include (11,-, 2,). Also, for each disjunction c, that includes 1),, we have to include the transition (C3,,d3-i). 42 4" 0 We include the transition (an, a0) to ensure that p’ has infinite computations in its invariant. Now, we show that p’ does not violate safety even if faults occur. Note that we introduced safety-violating transitions for each propositional variable and for each disjunct. We show that none of these can be executed by p’. o Safety-violating transitions related to propositional variable b,- . If the value of propositional variable b, is true then the safety-violating transition (311,22) is included in p’. However, in this case, we have removed the state :17; from the invariant of p’ and, hence, p’ cannot reach state yfi. It follows that p’ cannot exe- cute the transition (y,’-, 2;). By the same argument, p’ cannot execute transition (y,, 2,) when I),- is false. 0 Safety-violating transitions related to disjunction cj. Since the 3-SAT formula is satisfiable, every disjunction in the formula is true. Let c,- = bm V fibk V bi. Without loss of generality, let bm be true in cj. Therefore, the transition (cg-m, dgm) is not included in p’. It follows that p’ cannot reach the state C}, and, hence, it cannot violate safety by executing the transition (c3,, 3,). Since S’ _C_ S, p’ I 3’ Q I? l 3’, p’ does not deadlock in the absence of faults, and p’ does not violate safety in the presence of faults, p’ and S’ solve the synthesis problem. Cl Lemma 4.9 If there exists a failsafe fault-tolerant program that solves the instance of the addition problem identified in Section 4.2.1 then the given 3-SAT formula is satisfiable. Proof. Suppose that there exists a failsafe fault-tolerant program p’ derived from the fault-intolerant program, p, identified in Section 4.2.1. Since the invariant of p’, S ’, is not empty and S’ g S, S’ must have at least one state in S. Since the computations of the fault—tolerant program in S’ should not deadlock, for O S i S n —— 1, every 43 a, must be included in S’. For the same reason, since P5 cannot execute from a,_1 (cf. Observation 4.7), one of the transitions (a,_.1,:1:,) or (a,_1,:r;) should be in p’ (1 S i S n). If p’ includes (a,_1, 22,) then we will set b, = true in the 3-SAT formula. If p’ contains the transition (a,_1,:r§) then we will set b, = false. Hence, each propositional variable will be assigned a truth value. Now, we show that it is not the case that b, is assigned true and false simultaneously, and that each disjunction is true. 0 Each propositional variable gets a unique truth assignment. We prove this by contradiction. Suppose that there exists a propositional variable b,, which is assigned both true and false; i.e., both (a,_1,:r,) and (a,..1,:r;) are included in p’. Based on the Observations 4.1 and 4.3, the transitions (a,_1, 513,), (:r,, a,) and (y,’-, 2;) must be included in p’. Likewise, based on the Observations 4.2 and 4.4, the transitions (a,_1, mg), ($1, a.) and (311.2,) must also be included in p’. Hence, in the presence of faults, p’ may reach 3;, and violate safety by executing the transition (y,, 2,). This is a contradiction since we assumed that p’ is failsafe fault-tolerant. 0 Each disjunction is true. Suppose that there exists a c,- = bm V fibk V b,, which is not true. Therefore, bm = false, bk =2 true and b, = false. Based on the grouping discussed earlier, the transitions (cg-WdS-m), (cjk,djk), (c3,,d;-,) are included in p’. Thus, in the presence of faults, p’ can reach c3, and violate safety specification by executing the transition (0’ fl, d3,). Since this is a contradiction, it follows that each disjunct in the 3-SAT formula is true. [:1 Theorem 4.10 The problem of synthesizing failsafe fault-tolerant distributed pro— grams from their fault-intolerant version is NP-complete. Proof. The NP-hardness of synthesizing failsafe fault~tolerant distributed programs follows from Lemmas 4.8 and 4.9. Also, using Theorem 2.] presented in Section 2.8, 44 (J) is (17 (iii. gin pm. that lauh (1 pr lOlEr are i hlh s a con fauh fault-i 4,3,1, ”0.1%de m Wily, it follows that the problem of synthesizing failsafe fault-tolerant distributed programs is N P—complete. E] 4.3 Monotonic Specifications and Programs Since the synthesis of failsafe fault-tolerance is NP-complete, as discussed earlier, we focus on this question: What restrictions can be imposed on speczfications, programs and faults in order to guarantee that the addition of failsafe fault-tolerance can be done in polynomial time? As seen in Section 4.2, one of the reasons behind the complexity involved in the synthesis of failsafe fault-tolerance is the inability of the fault-intolerant program to execute certain transitions even when no faults have occurred. More specifically, if a group of transitions includes a transition within the invariant of the fault-intolerant program and a transition that violates safety, then it is difficult to determine whether that group should be included in the failsafe fault-tolerant program. To identify the restrictions that need to be imposed on the specification, the fault-intolerant program and the faults, we begin with the following question: Given a program p with invariant S, under what conditions, can we design a failsafe fault— tolerant program, sag p’, that includes all transitions in pIS? If all transitions in pIS are included then it follows that p’ will not deadlock in any state in S. Moreover, p’ will satisfy its specification from S; if a computation of p’ begins in S then it is also a computation of p. Now, we need to ensure that safety will not be violated due to fault transitions and the transitions that are grouped with those in plS. In this section, we identify the situations under which the addition of failsafe fault-tolerance can be achieved in polynomial time. Towards this end, in Subsection 4.3.1, we define a class of specifications, monotonic specifications, and a class of programs, monotonic programs, for which failsafe fault—tolerance can be synthesized in polynomial time. The intent of these definitions is to identify conditions under 45 \i‘hii‘h we int 4.3.2. progra Show I fault-ti. 4.3.1 lIl llllS : safe fau ngrau; WE}: [lull me even if I safety. T D€fimtia respect u H . “50,51. LikOWle- Dan-2220,, .., lJa B()0](,a“ which a process can make safe estimates of variables that it cannot read. Also, we introduce the concept of fault-safe specifications. Subsequently, in Subsection 4.3.2, we Show the role of monotonicity restrictions imposed on specifications and programs in adding failsafe fault-tolerance. When these restrictions are satisfied, we show the transitions in plS and the transitions grouped with them form the failsafe fault-tolerant program. 4.3.1 Sufficiency of Monotonicity In this section, we identify sufficient conditions for polynomial-time synthesis of fail- safe fault-tolerant distributed programs from their fault-intolerant version. In a program with a set of processes {P0, - -- ,Pn}, consider the case where process P, (0 S j g 72.) cannot read the value of a Boolean variable :23. The definition of (posi- tive) monotonicity captures the case where P,- can safely assume that :r is false, and even if it were true when P,- executes, the corresponding transition would not violate safety. Thus, we define monotonic specification as follows: Definition. A specification spec is positive monotonic on a state predicate Y with respect to a Boolean variable :1: iff the following condition is satisfied: V30, 51, 36, 3’1 :: a:(so) = false A 22(81) 2 false A 56(86) = true A x(s’,) = true Athe value of all other variables in so and 36 are the same A the value of all other variables in 31 and s’l are the same A(so, 51) does not violate spec A so 6 Y A 31 E Y :> (56, s’,) does not violate spec Likewise, we define monotonicity for programs by considering transitions within a state predicate, and define monotonic programs as follows: Definition. A program p is positive monotonic on a state predicate Y with respect to a Boolean variable :1: iff the following condition is satisfied. 46 Negat. aMm. the aim vanme tomMa [his ('aSt damxt deuu hhmmg \i Una nasmm mMmmr mfixaa enema mamas: meanest WMMWL lhumu Vso.sl, 36, 3’1 :: :r(so) = false A :1:(s1) = false A :r(sf,) = true A 112(3’1) = true Athe value of all other variables in so and sf, are the same Athe value of all other variables in 31 and s’1 are the same A(so, 31) E pIY 2:) (86a 8'1) 6 MY Negative monotonicity and monotonicity with respect to non-Boolean vari- ables. We define negative monotonicity by swapping the words false and true in the above definitions. Also, although we defined monotonicity with respect to Boolean variables, it can be extended to deal with non-Boolean variables. One approach is to replace a: = false with :r = 0 and a: = true with a: 74 0 in the above definition. In this case, the estimate for :1: is 0. We use this definition later in the section where we discuss the necessity of the monotonic programs and specifications. Definition. Given a specification spec and faults f, we say that spec is f -safe iff the following condition is satisfied. Vso, 31 :: ((so, 81) E f A (so, 31) violates spec) => (Vs_1 :: (s_1, so) violates spec) The above definition states that if a fault transition (so, 81) violates spec then all transitions that reach state so violate spec. The goal of this definition is to capture the requirement that if a computation prefix violates safety and the last transition in that prefix is a fault transition then the safety is violated even before the fault transition is executed. Another interpretation of this definition is that if a computation prefix maintains safety then the execution of a fault action cannot violate safety. Yet another interpretation is that the first transition that causes safety to be violated is a program transition. We would like to note that for most problems, the specifications being considered are fault-safe. To understand this, consider the problem of mutual exclusion where 47 a tau. \‘iolat SGt'tlo: first tr that tl the (or class 0. exa‘utii ll-‘ti. We 1 ‘11 is full Stt‘tlon, ' Theorem Using thesis. lllltf’ lair] .llere spet Theorem an l‘Safe 5 If ill) .J Tllffn Fail \afp lll poly“! a fault may cause a process to fail. In this problem, failure of a process does not violate the safety; safety is violated if some process subsequently accesses its critical section even though some other process is already in the critical section. Thus, the first transition that causes safety to be violated is a program transition. We also note that the specifications for Byzantine agreement, consensus and commit are f—safe for the corresponding faults (cf. Section 4.4). In fact, given a specification spec and a class of fault f, we can obtain an equivalent specification spec, that prohibits the execution of the following transitions. {(so,sl) : (so,sl) violates spec V (332 :: (81,82)€f A (51,32) violates spec) } We leave it to the reader to verify that ‘p is failsafe f-tolerant to spec from S ’ iff ‘p is failsafe f-tolerant to spec, from S’. With this observation, in the rest of this section, we assume that the given specification, spec, is f-safe. If this is not the case, Theorem 4.11 and Corollary 4.12 can be used if one replaces spec with spec f. Using monotonicity of specifications / programs for polynomial time syn- thesis. We use the monotonicity of specifications and programs to show that even if the fault-intolerant program executes after faults occur, safety will not be violated. More specifically, we prove the following theorem: Theorem 4.11 Given is a fault-intolerant program p, its invariant S, faults f and an f—safe specification spec, If VP,, :1: : P,- is a process in p, :L‘ is a Boolean variable such that P,- cannot read a: : spec is positive monotonic on S with respect to :1: A The program consisting of the transitions of P,- is negative monotonic on S with respect to :1: Then Failsafe fault-tolerant program that solves the synthesis problem can be obtained in polynomial time. 48 Free I be where New. (respi- sprc l). 0 J h HI . .1" l Hi. The the trari failsafe f fOllt‘iWs t, l'lOlates 5 Spec. 1, f We gr Corona, an f‘Safe If ”Y Proof. Let (so, 51) be a transition of process P, and let (50,31) be in plS. Let .r be a Boolean variable that P,- cannot read. Since we are considering programs where a process cannot blindly write a variable, it follows that :r(so) equals 33(31). Now, we consider the transition (36, s’,) where 3:, (respectively, s’,) is identical to so (respectively, 31) except for the value of :r. We show that (36, s’,) does not violate spec by considering the value of 1'(so). o :r(so) = false. Since (so, 31) E pIS, it follows that (so, 81) does not violate safety. Hence, from the positive monotonicity of spec on S, it follows that (56. s’,) does not violate spec. S. o :r:(so) = true. From the negative monotonicity of p on S, (36,3’1) is in p Hence, (36, s’,) does not violate spec. The above discussion leads to a special case of solving the synthesis problem where the transitions in pIS and the transitions grouped with them can be included in the failsafe fault-tolerant program. Since p’ |S equals plS and p satisfies spec from S, it follows that p’ satisfies spec from S. Moreover, as shown above, no transition in p’ violates spec. And, since spec is f-safe, execution of fault actions alone cannot violate spec. It follows that p’ is failsafe f-tolerant to spec from S. C] We generalize Theorem 4.11 as follows: Corollary 4.12 Given is a fault-intolerant program p, its invariant S, faults f and an f-safe specification spec, If VP,, x : P,- is a process in p, a: is a Boolean variable such that P,- cannot read 11: : (spec is positive monotonic on S with respect to a: A The program consisting of the transitions of P,- is negative monotonic on S with respect to 17) (spec is negative monotonic on S with respect to :1: A The program consisting of the transitions of P, is positive monotonic on S 49 e01 Ol; an sati pro Pro pr 0(- in S. of ad Spirit with respect to :17) Then Failsafe fault-tolerant program that solves the synthesis problem can be obtained in polynomial time. D 4.3.2 Role of Monotonicity in Complexity of Synthesis In Section 4.3.1, we showed that if the given specification is positive (respectively, negative) monotonic and the fault-intolerant program is negative (respectively, posi- tive) monotonic then the problem of adding failsafe fault-tolerance can be solved in polynomial time. In this section, we consider the question: What can we say about the complexity of adding failsafe fault-tolerance if only one of these conditions is sat- isfied? Specifically, in Observations 4.13 and 4.14, we show that if only one of these conditions is satisfied then the problem remains NP-complete. Observation 4.13 Given is a fault-intolerant program p, its invariant S, faults f and an f—safe specification spec. If the monotonicity restrictions (from Corollary 4.12) are satisfied for p and no restrictions are imposed on the monotonicity of spec then the problem of adding failsafe fault-tolerance to p remains NP-complete. Proof. This proof follows from the fact that the program obtained by mapping the 3—SAT problem in Section 4.2 is negative monotonic with respect to h. Moreover, all processes can read all variables except h (i.e., e, f, and g). It follows that the proof in Section 4.2 maps an instance of the 3—SAT problem to an instance of the problem Of adding failsafe fault—tolerance where the monotonicity restrictions from Corollary 4.12 holds for the program and no assumption is made about the monotonicity of the Specification. Therefore, based on Lemmas 4.8 and 4.9, the proof follows. E) Furthermore, the specification obtained by mapping the 3-SAT problem in Section 4.2 IS negative monotonic with respect to h. Hence, similar to Observation 4.13, we have 50 ’I) trar the] t0 (l. 4.4 in thi mit f the (a Cle iii , (lllf‘llllj; “lib res: Observation 4.14 Given is a fault-intolerant program p, its invariant S, faults f and an f-safe specification spec. If the monotonicity restrictions (from Corollary 4.12) are satisfied for spec and no restrictions are imposed on the monotonicity of p on S then the problem of adding failsafe fault—tolerance to p remains NP-complete. Proof. The proof is similar to the proof of Observation 4.13. E] Based on the above discussion, it follows that monotonicity of both programs and specifications is necessary in the proof of Theorem 4.11. If only one of these properties is satisfied then the problem of adding failsafe fault-tolerance remains N P-complete. Comment on the monotonicity property. The monotonicity requirements are simple and if a program and its specification meet the monotonicity requirements then the synthesis of failsafe fault-tolerance will be simple as well. Nevertheless, the signifi- cance of such sufficient conditions lies in developing heuristics by which we transform specifications (respectively, programs) to monotonic specifications (respectively, pro- grams) so that polynomial—time addition of failsafe fault-tolerance becomes possible. While the issue of designing such heuristics is outside the scope of this paper, we note that we have developed such heuristics in Chapter 9 and [29], where we automatically transform specifications (respectively, programs) to monotonic specifications (respec- tively, programs) for the sake of polynomial-time addition of failsafe fault-tolerance to distributed programs. 4.4 Examples of Monotonic Specifications In this section, we present three problems, Byzantine agreement, consensus and corn- rnit, for which the specifications and fault-intolerant programs are monotonic. In the case of Byzantine agreement, we first identify the variables and their respective domains. Then, we provide the fault—intolerant program and its invariant. Subse- quently, we present the specification and faults. Finally, we show the monotonicity with reSpect to appropriate variables. Since the arguments for consensus and corn- 51 Th. j is mit are similar to those in the Byzantine agreement problem, we simply sketch the arguments for those two problems. 4.4.1 Byzantine Agreement For simplicity, we consider the canonical version where there are 4 distributed pro- cesses g, j, k, and l such that g is the general and j, k, l are the non-generals. (An identical explanation is applicable if we consider arbitrary number of non-generals.) In the agreement program, the general sends its decision to non-generals and subse- quently non-generals output their decisions. Hence each process has a variable d to represent its decision, a boolean variable b to represent if that process is Byzantine, and a variable f to represent that process has finalized (output) its decision. The program variables and their domains are as follows: d.g : {0,1} d. j, d.k, d.l : {0,1,1} // 1 denotes uninitialized decision b.g,b.j,b.k,b.l : {true, false} // b.j=true iffj is Byzantine f.j, f.k,f.l : {0,1} // f.j=1 iffj has finalized its decision The fault-intolerant Byzantine Agreement, IB. Each non-Byzantine process j is represented by the following actions: d.j=_L A f.j=0 —+ d.j:= d.g d.j7é1 A f.j=0 —+ f.j:=1 Invariant of 13. The invariant of I B , S I 3, is as follows: SIB = (Vp :: -wb.p A (d.p = _L V d.p = d.g) A (f.p => d.p 75 1)) Safety specification of Byzantine agreement. The safety specification requires that Validity and Agreement be satisfied. Validity requires that if the general is not Byzantine and a non—Byzantine non-general has finalized its decision then the decision 52 of that non-general process is the same as that of the general. Agreement requires that if two non-Byzantine non-generals have finalized their decisions then their decisions are identical. Hence, the program should not reach a state in 5,], where S3; = (3p,q :: fibp A fibq A d.p 75 _L A d.g 75 _L A d.p 36 do A f.p A f.q) V (3p :: -ib.g A -ib.p A d.p 7g 1 A d.p at d.g A f.p) In addition, when a non-Byzantine process finalizes, it is not allowed to change it decision. Therefore, the set of transitions that should not be executed is as follows: ts! = {(30:51) 3 31 E szl U {(50,31) 3 fib-Jlsol A _‘b-Jisl) Af-j(30) =1 A ((1-“50) 74 d-.l(31) V f-j(30) 7‘5 fj(31))} Faults. The Byzantine faults, f3, can affect one process at most and a Byzantine process can change its decision arbitrarily. Hence, the Byzantine faults are shown by the following actions: -1b.g A fib.j A fibk A fib.l ——+ b.j := true b.j ——> d.j,f.j:= OII,OI1 The read / write restrictions: Each non-general non-Byzantine process j is allowed to read r,- = {b.j,d.j, f.j,d.k,d.l,d.g} and it can only write w,- = {d.j,f.j}. Hence, in this case w,- C_: r,. And, the variables that j is not allowed to read are nr, = {b.g,b.k,b.l,f.k,f.l}. Monotonicity of the specification and the program. We make the following observations. Observation 4.15 The specification of Byzantine agreement is positive monotonic with respect to bk (respectively, b. j and bl) Proof. Consider a transition (so, 31) of some non—general process, say 3', where Validlty and agreement are not violated when k is not Byzantine. Let (36, s’,) be the 53 corresponding transition where k is Byzantine. Since validity and agreement impose no restrictions on what a Byzantine process may do, it follows that (86, s’,) does not violate validity and agreement. [I] Observation 4.16 The specification of Byzantine agreement is negative monotonic with respect to fit (respectively, f. j and fl) Proof. Consider a transition (so, 31) of some non-general process, say j , where validity and agreement are not violated when fit is I. i.e., I»: has finalized its deci- sion. Let (56, s’,) be the corresponding transition where f.k is 0. Since validity and agreement impose no restrictions on processes that have not finalized their decision, it follows that (.96, s’,) does not violate validity and agreement. [3 Observation 4.17 The program I 8,, consisting of the transitions of j, with invariant SIB is negative monotonic with respect to bk (respectively, b. j and b.l) Proof. Follows from the fact that I B IS 1 3 contains no transitions when bk is true. [I Observation 4.18 The program 18,-, consisting of the transitions of j, with invariant S13 is positive monotonic with respect to f.k (respectively. f. j and fl) Proof. We leave it to the reader to observe this by considering all transitions in 3'. Cl Observation 4.19 The specification of Byzantine agreement is fB-safe. Proof. Follows from the fact that a fault only affects the variables of a Byzan- tine process and, hence, cannot violate safety; safety may only be violated if a non- Byzantine process changes its state based on the variables of the Byzantine process. El Now, using Observations 4.15-4.19 and Corollary 4.12, we have Theorem 4.20 Failsafe fault-tolerant Byzantine agreement program can be obtained in polynomial time. D 54 To obtain the failsafe fault-tolerant program, we calculate the transitions of the fault-tolerant program inside the invariant S 1 B. The groups of transitions associated with them form the failsafe fault-tolerant program, F SB. Thus, the actions of a non-general process P,- in the fault-tolerant program are as follows: FSBI: d.j=.L A f.j=0 ——+ d.j:=d.g FSBg: (d.j = 0) A ((d.k # I) A (d.l 75 1)) A f.) = 0 ——> f.j :=1 FSBgI (d.j=1) A ((d.k # 0) A (d.l 75 0)) A f.j = 0 ——-—> f.]' := The first action remains unchanged, and the second and the third actions deter- mine when a process can safely finalize its decision so that the validity and agreement are preserved. Note that if the general is Byzantine and casts two different decisions to two non-general processes then the non-general processes may never finalize their decisions. Nonetheless, the program F S B will never violate the safety of specification (i.e., F S B is failsafe fault-tolerant). 4.4.2 Consensus and Commit We now discuss the problems of distributed consensus and atomic commit to show that their specifications and fault-intolerant programs satisfy the monotonicity re- quirements. Since the arguments involved in these problems are similar to those in Byzantine agreement, we simply outline the reasoning behind the monotonicity. Consensus. In distributed consensus, each process begins with a vote. Initially, the votes of processes may be different. It is required that all non-faulty processes agree on the same value (agreement) and that if the vote of every process is v then the agreed value be the same as v (validity). A fault can cause a process to crash (undetectably). Upon failure, the vote (and the decision) of the failed process is reset to 1 so that other processes cannot distinguish between the failed process and 55 a process that has yet to vote. In this problem, we introduce a variable, up. j for every process j; j can read its own up value but not the up value of other processes. It is straightforward to see that the specification of consensus is negative monotonic with respect to up. Likewise, in the absence of faults, all up values are true and, hence, in the absence of faults, a fault-intolerant program has no transitions that execute when an up value is false. It follows that a fault-intolerant program for consensus is positive monotonic with respect to up. Commit. In the commit problem, the agreement requirement is the same as that in consensus. However, validity requires that if the vote of any process is 0 then the agreed value must be 0. And, if all processes vote 1 and no failures occur then it is required that the agreed value must be 1. Again, the fault considered for this problem is the crash fault and, hence, we introduce the variable up for every process to denote whether the process is up or not. The argument that monotonicity requirements are met in the commit problem is the same as that in the consensus problem. 4.5 Summary In this chapter, we focused on the problem of adding failsafe fault—tolerance to an existing fault-intolerant distributed program. A failsafe fault-tolerant program satis- fies its specification (including safety and liveness) when no faults occur. However, if faults occur, it satisfies at least the safety specification. We showed, in Section 4.2, that the problem of adding failsafe fault-tolerance to distributed programs is NP-complete. Towards this end, we reduced the 3-SAT problem to the problem of adding failsafe fault-tolerance. In a broader perspective, we are interested in identifying the problems for which the synthesis of fault-tolerant programs can be designed efficiently (in polynomial 56 time) and the problems for which exponential complexity is inevitable (unless P = NP). By identifying such a boundary, we can determine the problems that can reap the benefits of automation and the problems for which heuristics need to be developed in order to benefit from automation. This chapter helps to make this boundary more precise than [I] in three ways. For one, the proof in [1] is for masking fault-tolerance where both safety and liveness need to be satisfied. By contrast, the NP-completeness in this chapter applies to the class of programs where only safety is satisfied. Also, the proof in [I] relies on the ability of a process to blindly write some variables. By contrast, the proof in this chapter does not rely on such an assumption. The third —and the most important— step in identifying the boundary is addressed in Section 4.3 where we identified a class of specifications and a class of programs for which failsafe fault-tolerance can be added in polynomial time. Essentially, this class captures the intuition that to obtain a failsafe fault-tolerant program, we can let the fault-intolerant program execute in the presence of faults and ensure that a program transition is executed only if its execution will be safe even if faults have occurred. Towards this end, we imposed two restrictions: positive monotonicity of the specification and negative monotonicity of the fault-intolerant program. We showed that these restrictions are sufficient for polynomial synthesis of failsafe fault-tolerant distributed programs. To show the sufficiency, in Section 4.3, we showed how a failsafe fault-tolerant program can be designed if one begins with a positive monotonic specification and a negative monotonic program. Also, we proved that if only the input program (respectively, specification) is monotonic and there exist no assumption about the monotonicity of the specification (respectively, program) then the synthesis of failsafe fault-tolerance remains NP-complete. 57 Chapter 5 Fault-Tolerance Enhancement In this chapter, we concentrate on automated techniques to enhance the fault- tolerance level of a program from nonmasking to masking. Given the complexity of adding fault-tolerance to a fault-intolerant distributed program, in this chapter, we address the following question. Is it possible to reduce the complexity of adding masking fault-tolerance if we begin with a program that provides additional guarantees about its behavior in the presence of faults? Towards this end, we formally define the problem of enhancing the fault-tolerance of nonmasking programs to masking. Then, we present a sound and complete algorithm for the enhancement of fault- tolerance in high atomicity model. We also present a sound algorithm for enhancing the fault—tolerance of nonmasking distributed programs. We illustrate our algorithms by enhancing the fault-tolerance of the triple modular redundancy (TMR) program and the Byzantine agreement program. This chapter is organized as follows: In Section 5.1, we state the problem of enhancing the fault-tolerance from nonmasking to masking. In Section 5.2, we present our solution for the high atomicity model. In Section 5.3, we present our solution for distributed programs. Finally, we summarize this chapter in Section 5.6. 58 5. 1 Problem Statement In this section, we formally define the problem of enhancing fault-tolerance from non- masking to masking. The input to the enhancement problem includes the (transitions of) nonmasking program, p, its invariant, S, faults, f, and specification, spec. Given p, S, and f, we can calculate an f-span, say T, of p by starting at a state in S and identifying states reached in the computations of p[] f. Hence, we include fault-span T in the inputs of the enhancement problem. The output of the enhancement problem is a masking fault—tolerant program, p’, its invariant, S’, and its f-span, T’. Since p is nonmasking fault—tolerant, in the presence of faults, p may temporarily violate safety. More specifically, faults may perturb p to a state in T—S. After faults stop occurring, p will eventually reach a state in S. However, p may violate spec while it is in T—S. By contrast, a masking fault-tolerant program p’ must satisfy its safety specification even during recovery from T—S to S. The goal of the enhancement problem is to separate the tasks involved in adding recovery transitions and the tasks involved in ensuring safety. The enhancement problem deals only with adding safety to a nonmasking fault-tolerant program. With this intuition, we define the enhancement problem in such a way that only safety may be added while adding masking fault-tolerance. In other words, we require that during the enhancement, no new transitions are added to deal with functionality or to deal with recovery. Towards this end, we identify the relation between state predicates T and T’, and the relation between the transitions of p and p’. If p’[] f reaches a state that is outside T then new recovery transitions must be added while obtaining the masking fault-tolerant program. Hence, we require that the fault-span of the masking fault-tolerant program, T’, be a subset of T. Likewise, if p’ does not introduce new recovery transitions then all the transitions included in p’IT’ must be a subset of pIT’. Thus, the enhancement problem is as follows: 59 The Enhancement Problem Given p, S, spec, f, and T such that p satisfies spec from S and T is an f—span used to show that p is nonmasking fault-tolerant for spec from S Identify p’ and T’ such that T’ (_I T, p’lT’ Q. plT’, and p’ is masking f-tolerant from T’ for spec. [3 Comments on the Problem Statement 1. While the invariant, S, of the nonmasking fault-tolerant program is an input to the enhancement problem, it is not used explicitly in the requirements of the enhancement problem. The knowledge of S permits us to identify the transitions of p that provide functionality and the transitions of p that provide recovery. We find that such classification of transitions is useful in solving the enhancement problem. Hence, we include S in the problem statement. 2. If S’ is an invariant of p’, S’ (_I T’, every computation of p’ that starts from a state in T’ maintains safety, and every computation of p’ that starts from a state in T’ eventually reaches a state in S’ then every computation of p’ that starts in a state in T’ also satisfies its specification. In other words, in this situation, T’ is also an invariant of p’. (This result has been previously shown in [18]; we repeat the proof in Section 5.2.) Hence, we do not explicitly identify an invariant of p’. Predicates T’ and T’ D S can be used as the invariants of p’. 3. The above problem statement assumes that no new states/ variables are added while enhancing fault-tolerance. This assumption can be removed by allowing systematic addition of new variables [1]. Another approach is to pretend that a process can read certain private variables of other processes. Then, we design a masking program that uses such private variables. The transitions of such 60 a masking program will require the detection of predicates involving the pri- vate variables of other processes; one can use refinement techniques to detect these non-local predicates appropriately. These refinement techniques, in turn, will determine the new variables that need to be added to detect these non- local predicates. Several such refinement techniques have been discussed in the literature (e.g., [30, 18]). 5.2 Enhancement in High Atomicity Model In this section, we present our algorithm for solving the enhancement problem in high atomicity model. Thus, given a high atomicity nonmasking fault-tolerant program p, our algorithm derives masking fault-tolerant program p’ that ensures that safety is added while the recovery provided by p is preserved. The goal of the enhancement problem is to add safety while preserving recovery. Hence, we obtain a solution for the enhancement problem by tailoring the algorithm Add- f ailsa f 6 (see Section 2.7.1); Add- f ailsa f e deals with the addition of safety to a fault-intolerant program in the presence of faults. In our algorithm (cf. Figure 5.1), first, we compute the set of states, ms, from where fault actions alone violate safety. Clearly, we must ensure that the program never reaches a state in ms. Hence, in addition to the transitions that violate safety, we cannot use the transitions that reach a state in ms. We use mt to denote the transitions that cannot be used while adding safety. Using ms and mt, we compute the fault-span of p’, T’, by calling function HighAtomicityConstructInvariant (H ACI ). The first guess for T’ is T —ms. However, due to the removal of transitions in mt, it may not be possible to provide recovery from some states in T—ms. Hence, we remove such states while obtaining T’. If the removal of such states causes other states to become deadlocked, we remove those states as well. Moreover, if (so, 81) is a fault transition such that 31 was removed from T’ then we remove so to ensure 61 that T’ is closed in f. We continue the removal of states from T’ until a fixed point is established. After computing T’, we compute the transitions of p’ by removing all the transitions of p-mt that start in a state in T’ but reach a state outside T’. Thus, our algorithm is as follows: High-Atomicity_Enhancement(p, f: set of transitions, T: state predicate, spec: specification) { ms := {so : 331,32, ...sn: (Vj :05j out :2 in.j Faults. Faults may perturb one of the inputs when all of them are equal. Thus, the fault action that affects j is represented by the following action: F: (Vp :: 2n.j 2 mp) ——+ 2'n..j :2 0 | 1 Invariant. The following state predicate is an invariant of TMR. 66 .., v. ‘- a“ I STMR 2 (out =_L /\ (Vp, q :: 2'n.p = in.q)) V (3p, q : p 75 q : out = 212p =2 272.21) Safety specification. The safety specification of TN! R requires the program not to reach states in which there exist two processes whose input values are equal but these inputs are not equal to out (where out 7E_L). The safety specification also stipulates that variable out cannot change if it is different from _L. Thus, safety specification requires that following transitions are not included in a program computation. szMR = sf1 u sfg, where sfl = {(30, 81) I (31w 2 (p sé q) = (meta) =in-(1(31)) /\ (in.q(sl) 7e out) /\ (0ut(sl) 7.41)», and sf2 = {(30, 81) | (out(80) 7%) A (out(So) 3* out(81))} Fault-span. If all the inputs are equal then the value of out is either 1 or equal to those inputs. Thus, fault-span of the nonmasking version of TA! R is T, where TTMR = (Vp, q :: in.p = 2n.q) :> ((out =_L) V (Vp :: out = 2n.p)) Remark. The TMR program consists of three variables whose domain is {0, 1} and one variable whose domain is {0, 1, _L}. Enumerating the states associated with these variables, the state space of TMR program includes 24 states. Of these, 10 states are in the invariant, 12 additional states are in the fault-span, and two states are outside the fault-span. The program consisting of actions N 1 and N2 is nonmasking fault—tolerant in that if it begins in a state where STMR is true then it satisfies its specification. However, if the faults perturb it to a state in TTMR-STMR then it eventually recovers to a state where STMR is true. Nonetheless, until such a state is reached, safety specification may be violated. Enhancing the tolerance of TMR. We trace the execution of our high atomicity algorithm for nonmasking TMR. program. 67 1. Compute ms. ms includes all the states from where one or more fault transitions violate safety. In case of TMR, fault transitions do not violate safety if they execute in a state in TTMR . Faults only change the value of one of the inputs and then safety may be violated if the corresponding process executes guarded command N 1. Thus, TTMR 0 ms 2 {}. 2. Compute mt. From the definition of ms, mt: sfnm. 3. Construct TfMR and p’. After removing transitions in mt, states where out differs from _L and out differs from the majority of the inputs are deadlocked. Hence, we need to remove those states while obtaining T } M 3. After removal of those states, there are no other deadlock states. Hence, our algorithm will let iwa to be the state predicate: T7,"an = TTMR _ {3 : (3p, (15 (P 7f Q) 5 (271.1)(8) = 2724(8)) A (“at“) #J—l A <0ut ¢ m.p>)} Moreover, to obtain the transitions of masking version of TA! R, we consider the transitions of p that preserve the closure property of T7’~MR. Thus, the masking version of TM R consists of the following guarded command: M1: (022t=_L)/\((2'n.j = 271.069 1)) V (2724' = 2'72.(j {B 2))) ——> out :2 in.j The predicate Tfim computed by our algorithm is both an invariant and a fault- span for the above program; every computation of the above program satisfies the specification if it begins in a state in T h, 3. Moreover, Th“? is closed in both the program and fault transitions. Remark. Note that transitions included in N2 are removed from the above masking fault—tolerant program as those transitions violate ng. However, if safety consisted of only 3 f1 then the fault-tolerant program would include the 68 transitions included in N2. While a masking fault-tolerant program can be obtained without using the transitions in N2, their inclusion follows from the heuristic in [1] that the output program should be maximal. In [1], Kulkarni and Arora have argued that if the output of a synthesis algorithm is to be used as an input, say to add fault-tolerance for a new fault, it is desirable that the intermediate program be maximal. 5.3 Enhancement for Distributed Programs In this section, we present an algorithm to enhance the fault-tolerance level of a distributed nonmasking fault-tolerant program to masking. First, we discuss the issues involved in the enhancement problem for distributed programs. Then, we present our algorithm. As a case study, we apply our algorithm to the Byzantine agreement problem. In high atomicity model, the main issue in enhancing the fault—tolerance level of a nonmasking fault-tolerant program p was to ensure that p does not execute a safety violating transition (so, 31). In order to achieve this goal, we can either (i) ensure that p will never reach 30, or (ii) remove (30, 31). For the high atomicity model, we chose the latter option as it was strictly a better choice. However, for distributed programs, we cannot simply remove a safety violating transition (80,81) as (so, 31) could be grouped with some other transitions (due to read restrictions). Thus, removal of (so, 31) will also remove other transitions that are potentially useful recovery transitions. In other words, for distributed programs, the second choice is not necessarily the best option. Since an appropriate choice from the above two options cannot be identified easily for distributed programs, the synthesis of distributed programs becomes more difficult. We develop our low atomicity algorithm (cf. Figure 5.3) by tailoring the high atomicity algorithm to deal with the grouping of transitions. More specifically, given a nonmasking fault-tolerant program p, we first start by calculating a high atomicity 69 fault-span, T,’,,g,, , which is closed in p[] f. Since the low atomicity model is more restrictive than the high atomicity model and T f“. 9,, is the largest fault-span for a high atomicity program, we use Til-9,, as the domain of the states that may be included in the fault-span of our low atomicity program. In other words, if a transition, say (so,sl) violates the safety specification and so 5! T,’,,.g,, then we include the group associated with (so, 81) and ensure that state so is never reached. Then, we call function LowAtomicityConstructInvariant (LACI) to calculate a low atomicity invariant SIM, for p’ (cf. Figure 5.2). In the body of the algorithm in Figure 5.3, to calculate Sfow, we first call function LAC] with T,’,,.g,,os as its first argument. Inside LACI, we ignore the fault transitions during the call to H ACI ; we consider the effect of fault transitions subsequently. In this call to H AC'I , we also ignore the grouping of transitions. These requirements are checked on the value of 8,2,5”, returned by H ACI . Specifically, if there exists a group containing transitions (so, 31) and (86, 3’1) such that so, 36, 3’1 6 SAW, and 31 ¢ 8,2,9,“ we remove so from 83,9, and recalculate the invariant. If no such group exists, LAC'I returns Stag}:- Thus, the function LACI is as follows: LACI(S : state predicate, p: transitions, go, ' -- , gm: groups of transitions ) { Sfu'gh = HACI(Sapa 0); if (391350: 31,86, 3’1 : (30131)1(5f)33,1) E 9i i (803 8f), 3,1 e an'gh A '31 ¢ an'gh) ) then return LACI(S,’,,-g,, — {so},p, go, - -~ ,gm); else return Saigh; Figure 5.2: Constructing an invariant in the low atomicity model. In Figure 5.3, the value returned by LACI, S,’ is used as an estimate of the in- nit? variant of the masking fault-tolerant distributed program. To compute T’, we identify the effect of the fault transitions and the program transitions from states in S’ We init' use the variable Sfow to keep track of states reached in the execution of the program and fault transitions from S,’,,,-,. Our first estimate for Sfow is the same as S,’,,,-,. Now, 70 we compute So as the set of states reached in one step (of program or fault). Regard- ing fault transitions, if (so, 31) is a fault transition, so 6 Sfow and .91 E (Téigh-Sl’ow) then we add state 31 to the set So. Regarding program transitions, we only consider a group if the following three conditions are satisfied: (1) at least one of the transitions in it begins and ends in Sfow’ (2) if a transition in that group begins in a state in high then it terminates in a state in my, and it does not violate safety, and (3) if a transition in that group begins in a state in 85,,“ then it terminates in a state in 5! mt. If such a group has another transition (36, 3’1) such that 36 6 SW and s’l ¢ Sfaw then we include state s'1 in the set 5'2. (Note that in the first iteration, 5,2,“ equals Sfow. Hence, expansion by program transitions need not be considered. However, this expansion may be necessary in subsequent iterations.) Thus, .92 identifies states from where recovery must be preserved. Low_Atomicity_Enhancement(p : transitions, go, - - - , gm: groups of transitions, f: faults, T, .S' : state predicate, spec : specification) // P: gOUQI U ngm { Calculate ms and mt as in High_Atomicity-Enhancement Tfiigh = HACI(T - ms,p - mt, f); Sim-t = Sfow = LACI(SflT,’u-gh,p — mt,go, - -- ,gm); repeat { 52 ={81381€( liigh— (low): (380180 E Sim” 2 (80,81) 6 f V (395 1(80,81)6 91' I (((gilsz’ow) O (P‘mtll 7'5 4’) A 0732,83 2 (32,83) E 9, /\ 32 E Tfn’gh : 33 E Tfu’gh /\ (82,83) ¢ mt) /\ 0732,33 3 (32933) 5 9i A 32 E Sinit ‘ 33 E Sim't» )} S3 = {80 :80 E (Tfii9h~Sl’mv) 1 (381,91' 1 (80,81) 6 g,- /\ 81 6 81,011): (V32,S3 : (82,83) 6 gi /\ 32 E Ti’u'gh : 33 6 Tgigh /\ (32,33) g mt) /\ 0132,33 : (32,33) E 9. /\ 32 E 55,,“ :83 E Sinulll (low = Sfow U S3; } until (53 = 0); if (52 95 (0) then declare fault-tolerance cannot be enhanced; exit(). T’ = Slow; P’ = {91' 1 (V80»81 I (80.81) E 92‘ 3 (80 E T’ => (81 E T’ A (80.81) E (P " mill) A (30 E Sfm't => 31 E Sinitllli return p’, T’; Figure 5.3: The enhancement of fault-tolerance for distributed programs. 71 0f. l we then calculate the set of states from where recovery can be added, in one step. Specifically, if there is a transition (so, sl) such that so ¢ $wa and 31 E Sfow then we include so in set 53. We require that Tam, and S,’,,,-, are closed in the group being considered for recovery and that safety is not violated by any transition (that starts in a state in T122921) in that group (see the constraints of S3 in Figure 5.3). Subsequently, we add 53 to Sfow. The goal of this step is to ensure that infinite computations are possible from all states in Sl’mu' This result is true about the initial value (Sfimt) of I 10w. Moreover, this property continues to be true since there is an outgoing transition from every state in S3. We continue this calculation until no new states can be added to Sfow. At this point, if So is nonempty, i.e., there are states from where recovery needs to be added but no new recovery transitions can be added, we declare failure. Otherwise, we identify the transitions of fault-tolerant program p’ by considering transitions of p—mt that start in a state in Sfmu. Hence, our low atomicity algorithm is as shown in Figure 5.3. Before we discuss the soundness and the complexity of LowAtomicity-Enhancement, we first make some observations about our low atomicity algorithm. Then, we present three lemmas that are used in the soundness proof. Similar to the proof in the high atomicity algorithm, we have Observation 5.12 T’ g (T —- ms), T’ 0 ms 2 {}, and (p | T’) 0 mt = 0. El Observation 5.13 S,’,,,-, g S, and {0w 0 ms 2 {}. El Observation 5.14 mg, Q T. [:1 Observation 5.15 (p’ | T’) g (p | T’). D In the main loop of the algorithm, So and 5'3 are subsets of T,’, Hence, the igh ' I [010 relations _C; T ,1, qh remains true throughout our algorithm. The value of T’ equals 72 the value of {0w when the loop terminates. Hence, we have Observation 5.16 T’ Q T,’,,.g,, g T . [:1 Lemma 5.17 p’ [] f maintains spec from T’. Proof. By construction, when T’ is assigned the value Sl’ow, the value of 32 is the empty set. Thus, starting from a state in T’, p’ [] f cannot perturb p’ to a state that is outside T’. It follows that T’ is closed in p’ [] f. Now, let c be a computation of p’[] f that starts from a state in T’. Just as in the proof of Lemma 5.6, it can be shown that each prefix of c maintains spec. Thus, p’ [] f maintains spec from T’. [:1 Lemma 5.18 p’ satisfies spec from S,’ nit ' Proof. Since Sfm-t is a subset of S, S’ mit Q Sig... Q T’, (p’lT’) Q (pIT’) and I ini every computation of p’ that starts from a state in S t is also a computation of p. Hence, every computation of p’ that starts from a state in S,’,,,-, is in spec. Also, by construction of p’, S’ is closed in p’. Thus, p’ satisfies spec from 5;,“ . E] Lemma 5.19 Every computation of p’ that starts in a state in T’ is infinite. Proof. By construction of LACI, this property is true about S,’ Now, a state, nit' say s, is added to 53 only if there is a recovery transition, say t, from that state. Moreover, when transitions of p’ are computed, the value of .32 is the empty set. Hence, the group(s) of transitions containing t is included in p’. Thus, from every state in T’, there is an outgoing transition in p’. It follows that every computation of p’ that starts in a state in T’ is infinite. [:1 Theorem 5.20 T’ is (also) an invariant of p’ for spec. Proof. From Observation 5.15, every computation of p’ that starts in a state in T’ is a computation of p. Thus, every computation of p’ that starts from a state in T’ reaches a state in S. Thus, a computation of p from T is of the form (so,sl,...,sn,s,,+1,...) where 3,, E 5. By Lemma 5.17, (so,sl, ...,s,,) maintains spec and (3", 3M1, ...) is in spec. Now, similar to the proof in Theorem 5.8, we can show 73 that c is in spec. Thus, T’ is also an invariant of p’ for spec. [2] Theorem 5.21 The algorithm Low_Atomicity-Enhancement is sound and its com- plexity is polynomial in the state space of the nonmasking fault— tolerant program. Regarding soundness, we have to show that the conditions of the enhancement problem are satisfied. 1. T’ g T. (cf. Observation 5.16). 2. p'lT’ g plT’. (cf. Observation 5.15). 3. p’ is masking f-tolerant to spec from T’. By letting the fault-span to be T’ itself, the proof follows. Regarding, complexity, we observe that the number of iterations for the main loop are at most |T,’, | and each statement in the low atomicity algorithm requires only igh polynomial time. [I] Modifications / Improvements for LowAtomicityEnhancement. There are several improvements that can be made for the above algorithm. We discuss these improvements and issues related to completeness below. 1. In the low atomicity enhancement algorithm, if the value of 32 is the empty set then we can break out of the 100p before computing S3. Subsequently, we can use value of SW at that time to compute p’ and T’. However, we continue in the loop to determine whether recovery can be added from new states. This allows the possibility that a larger fault-span is computed and additional transitions are included in the masking fault-tolerant program. As mentioned in [1], if the output of a synthesis algorithm is used as an input to another synthesis algorithm, say to add fault-tolerance for a new fault, then it is desirable that the fault-span and the transitions of the intermediate program be maximal. For this reason, we have allowed the algorithm to expand the fault-span and to add new transitions. 74 2. In the low atomicity enhancement algorithm, in the calculation of S3, we calcu- late states from where recovery is possible. One heuristic is to focus on states in So first as recovery must be added from states in 52- If recovery from states in So is not possible then other states in T,',,.g,,— {0w should be considered. However, considering states in 82 alone may be insufficient as it may not be possible to add recovery from those states in one step; adding recovery from other states can help in recovering from states in S2. 3. Our algorithm is incomplete in that it may be possible to enhance the fault- tolerance of a given nonmasking program although our algorithm fails to find it. One of the causes for incompleteness is in our calculation of LACI; when LACI needs to remove states/ transitions to deal with grouping of transitions, the choice is non-deterministic. Since this choice may be inapprOpriate, the algorithm is incomplete. As we showed in Chapter 4 that adding failsafe fault-tolerance to distributed programs is N P-complete, it is expected that the complexity of a deterministic sound and complete algorithm for enhancing the fault—tolerance of a distributed nonmasking program will be exponential unless P=NP. 5.3.1 Example: Byzantine Agreement We show how our algorithm for the low atomicity model is used to enhance the fault—tolerance level of a nonmasking Byzantine agreement program to masking. First, we present the nonmasking program, its invariant, its safety specification, faults, the fault-span for the given faults, and read / write restrictions. Finally, we show how our algorithm is used to obtain the masking program (in [26]) for Byzantine agreement. Variables for Byzantine agreement. The nonmasking program consists of three non-general processes j, k,l and a general 9. Each non—general process has three variables (1, f, and b. Variable d. j represents the decision of a non-general 75 process j, f. j denotes whether 3' has finalized its decision, and b. j denotes whether 3' is Byzantine or not. Process 9 also has a variable d. g and b. 9. Thus, the variables in the Byzantine agreement program are as follows: a d.g : {0,1} 0 d.j,d.k,d.l : {0,1,1} 0 b.g,b.j,b.k,b.l : {true,false} o f.j,f.k,f.l : {0,1} Transitions of the nonmasking program. If process j has not copied a value from the general, action N Bl copies the decision of the general. If j has copied a decision and as a result d.j is different from J. then j can finalize its decision by action N B2. If process j reaches a state, where its decision is not equal to the majority of decisions and all the non-general processes have decided then j corrects its decision by actions N B3 or N B4. Thus, the actions of each process j in the nonmasking program are as follows: NBl: d.j=1 A f.j=0 —> d.j:= d.g NBZ:d.j7£_L/\f.j=0 —s f.j:=1 N83: (dj: 1) /\ (d.k=0) /\ (d.l=0) —> d.j:=0 NB4: (d.j=0) A (d.k= 1) A ((1.121) ——» d.j:=1 Safety specification. The safety specification requires that if g is Byzantine, all the non-general processes should finalize with the same decision (agreement). If g is not Byzantine, then the decision of every non-general non-Byzantine process that has finalized should be the same as d.g (validity). Thus, safety is violated if the program reaches a state in SS}, where (in this section, unless otherwise specified, quantifications are on non-general processes) SS} = (3p,q :: flap /\ fibg /\ d.p # .L /\ d.g 75 _L /\ d.p 75 d.g /\ f.p /\ f.q) V (3p :: fibg /\ -vb.p /\ d.p 3A _L /\ d.p # d.g /\ f.p) 76 Also, a transition violates safety if it changes the decision of a process after it has finalized. Thus, the set of transitions that violate safety is equal to t3), where t8, = {(so,sl) : 31 E 55)} U {(so,sl) : 3p :: —wb.p(so) /\ -:b.p(sl) /\ f.p(so) = 1 A ((1-MSU) 75 d-P(31) V f-P(30) 7e f~P(51))l Invariant. The invariant of nonmasking Byzantine agreement is the state predicate SNB = SNB, V SNBga where SNB, = fibg /\ (obj V -nb.k) /\ (fibk V —wb.l) /\ (fibl V -1b.j) /\ (\7’p :: -b.p => (d.p = .J. V d.p = d.g)) /\ (Vp :: (fibp /\ f.p) => (d.p 79 _L)) 3N3, = b.g /\ obj /\ fibk /\ -1b.l /\ (d.j 2 dis: 2 d.l /\ d.j 76 1) Read/ Write restrictions. Each non—general process j is allowed to read {b. j , d.j, f.j, d.k, d.l, d.g}. Thus, j can read the d values of other processes and all its variables. The set of variables that j can write is {d. j, f. j } Faults for Byzantine agreement. A fault transition can cause a process to become Byzantine if no process is initially Byzantine. A fault can also change the d and f values of a Byzantine process. Thus, the fault transitions that affect j are as follows (We include similar fault-transitions for k, l, and g): F1 : -wb.g /\ -ab.j /\ -1b.k /\ —vb.l ——> b.j :2 true F2:b.j —> d.j,f.j:= 0|1,0l1 Fault-Span. Starting from a state in S N 3,, if no process is Byzantine then a fault transition can cause one process to become Byzantine. Then, faults can change d and f values of the Byzantine process. Now, if the faults do not cause 9 to become Byzantine then the set of states reached from S N B, is the same as S N 3,. However, if the faults cause g to become Byzantine then (I and f values of non-general processes may be arbitrary. Nonetheless, the b values of non-general processes will remain false. Thus, the set of states reached from SNB, is (5N3, U (b.g /\ —1b.j /\ -wb.k /\ -1b.l)). 77 Starting from SNBz, no process can become Byzantine. Hence, the d values of non-general processes will remain unchanged. It follows that the set of states reached from SNB, is 5N3? Finally, since SNB2 is a subset of (b.g /\ obj /\ obk /\ ob.l), the set of states reached from S N 3 is TNB, where TNB = SNB, U (b.g /\ obj /\ obk /\ obl) Application of our algorithm. First, we compute ms and mt that are needed by our algorithm. Every fault transition originating at SS; reaches S3} because it only affects the Byzantine process and the destination state will remain in 83,. Since the destination of these fault transitions is 53), they violate the safety. Thus, the set of states from where faults alone violate safety is equal to S S f, and as a result ms = SS). Since t3; includes all the transitions that reach SS; (which is equal to ms) or violate safety, mt = ts). To calculate Téigh, we use the H AC1 function of our high atomicity algorithm. This function removes deadlock states and states from where the closure of T,’,,.g,, is violated by fault transitions. Since we have removed ms states and no fault transition can reach a state in ms from a state outside ms, there exists no state from where the closure of T,',,.g,, can be violated by fault transitions. Now, consider a state, say so, where d.j=0, d.k=0, d.l: 1, b.l=false, and f.l= 1. Clearly, so is a deadlock state as no process can execute a safe transition from so. Hence, such states must be removed while obtaining T 1119» Now, consider a state, say .91, where d.j = 1,d.k = 0,d.l 2 Lb! = false, and f.l = 1. In state 31, only process j can execute a transition (by copying d.g) without violating safety. However, if j copies the value of the general and d.g: O, the program reaches a state that was removed earlier. Hence, such states must also be removed while obtaining T (igh. Continuing thus, we remove all states where a process in the minority has finalized its decision. In other words, T,’,,.g,, is equal to TNB-X, where X = {s: (3p::f.p(8) = 1A(V(13P 3‘é CI 3 CAP“) 75 (11103))» 78 After this step, function LAC I returns 8’ 2111! = T2229). 0 SNB. Now, we trace two iterations of the main loop in our algorithm in order to illustrate the way that our algorithm works. 1. First iteration. To calculate SQ, we search for states in S’m-t from where _S! we can directly reach a state in T ,’, m, by fault transitions or by program 257/1 transitions. From S’ no program transition can reach a state that is outside init? St 21Lit' However, from a state 8, where (o(d.j(s) =d.k(s) =d.l(s)) V(3p :: d.p(s) == _L)), a fault transition can cause the general to become Byzantine and then the program is outside S3,“. b.g(s) /\ (o(d.j(s) =d.k(s) =d.l(s)) V (3p :: d.p(s) = i)) /\ (Vp: (d.p(s) 75 _L) :> (cl-MS) = d-9(8)))}- Hence, in the first iteration, 82 = {s : s 6 (Téigh—S’ ) : zviit Now, we compute S3. Consider a state, say so, where d.j = 0,d.k 2: 0,d.l = I irrit' 1, blzfalse, and f.l-—=O. In so, I can change d.l to 0 and reach a state in S Hence, such states are included in So. Also, consider a state, say 31, where d.j==_L,d.k.= 1,d.l= 1, and d.g: 1. In 31, process j can copy the value of d.g I mit. Therefore, in this iteration So 2 P1 U P2, where and take the program to S P1 = {8 = 8 6 ( lion-5.3...) 2 (3p: (d-p(8) 75 i) /\ (f-p(8) = 0) : (Vq : (q 7e p) = (cl-(1(8) 7e 1) A d.p(s) 7e d.q)>}. and P2 = {3 3 5 E( bighusf'nit): (3P : d-p(8)=i = (Vq : q 7e 19: ((1-(1(8) at J—) A (d-q(8)=d-g(8))))} Then, we add So states to I’m”. Remark. In the case of Byzantine agreement, the only states from where recovery to S,’,,,-t can be achieved in a single step are the states of S3 in the first iteration. Every other recovery path includes these states as its final step to 5! 1712t° 79 2. Second iteration. In the second iteration Sfow = S’ U S3 (S3 in the first mit I iteration). To calculate So in the second iteration, we search for states in law from where we can directly reach a state in T Iiigh-Sl’ow by fault transitions or by program transitions. To calculate S2 in the second iteration, we need to calculate the set of states in T,',,g,,—s;,,w that are reachable by a fault transition from Sl’ow. From the first iteration, we already know the set of states reachable from 3’ Thus, we only Init' need to calculate the states of T,’,,gh-Sl’ow that are reachable by a fault transition from recently joined states (i.e., S3 = P1 UP2 of the first iteration) to S,’0w. Since in P1 the general process is Byzantine and all non-generals have decided, P1 is closed in fault transitions. However, in a state in P2, since 9 is Byzantine, faults may change the value of d.g and take the program outside Sfow. In these states, the condition (3p : d.p: _L : (Vq : q 75 p : (d.g aé _L) /\ (d.g 71$ d.g))) holds. Therefore, in this iteration, the program can reach states of 82 by a fault transition, where S2 = {318 E (Tiltigh—Sl’ow) 3 129(8) /\ (3p 2 d-p=i = (W = q 74 p: (M at _L) /\ ((1-61 at d-g)))} To calculate So, we find states from where recovery is possible to Sfow' Thus, we search for states from where we can reach the states of 53 calculated in the first iteration. Hence, in this iteration, single—step recovery to {ow is possible from S3, where 83 = {s = s E (Tiligh — to...) 1 (3p ; (d~p(8)=i) : (Vq : q ¢ 19 = (14(8) at i)) V (31): ((1-12(8) at J—) /\ ((1-12(8)=d-9(8)) = (W = (1% p 2 d-q(8) =i))} 80 Continuing thus, we get the masking fault—tolerant Byzantine agreement; this program is the same as that in [26]. The actions of this program are as follows: MBl: d.j=_L A f.j=0 ——» d.j:= d.g MBZ: d.j;«éJ. A f.j=0 A ((d.j=d.k)v(d.j=d.l))—> f.j:=1 MB3: (d.j: 1) A (d.k=0) A (d.le) A (f.j=0) ——> (1.32:0 MB4: (d.j=0) A (d.k= 1) A (d.lzl) A (f.j=0) ———> d.j:=1 5.4 Using Monotonicity for the Enhancement of Fault-Tolerance In this section, we illustrate how we use monotonicity of programs and specifications to enhance the fault—tolerance of nonmasking fault-tolerant distributed programs to masking fault-tolerance in polynomial-time (in the state space of the nonmasking program). Towards this end, in Subsection 5.4.1, we present a theorem that identifies the sufficient conditions for enhancing the fault-tolerance of nonmasking programs in polynomial time. Then, in Subsection 5.4.2, we present an example to illustrate the application of the theorem presented in Section 5.4.1. 5.4.1 Monotonicity of Nonmasking Programs In this section, our goal is to identify properties of programs and specifications where enhancing the fault-tolerance of nonmasking fault-tolerant programs can be done in polynomial time. Specifically, we present a theorem that identifies the sufficient conditions for polynomial—time enhancement of the fault-tolerance of nonmasking distributed programs to masking. As we have shown in Section 4.2, in general, adding failsafe fault-tolerance to a distributed program is NP-complete. Thus, it is expected that the enhancement problem is also NP-complete. Hence, we focus on the following question: 81 Given is a nonmasking program, p, its specification, spec, its invariant, S, a class of faults f, and its fault-span, T: Under what conditions can one derive a masking fault-tolerant program p’ from a nonmasking fault-tolerant program p in polynomial time? To address the above question, we sketch a simple scenario where we can eas- ily derive a masking fault-tolerant program from p. Specifically, we investigate the case where we only remove groups of transitions of p that include safety-violating transitions and the remaining groups of transitions construct the set of transitions of the masking fault-tolerant program p’. However, removing a group of transitions may result in creating states with no outgoing transitions (i.e., deadlock states) in the fault-span T or the invariant S. In order to resolve deadlock states, we need to add recovery transitions, and as a result, adding recovery transitions may create non-progress cycles in (T — S). When we remove a non-progress cycle, we may create more deadlock states. This way, removing a group of safety-violating transitions may lead us to a cycle of complex actions of adding and removing (groups of ) transitions. To address the above problem, we require the set of transitions of p to be structured in such a way that removing safety-violating transitions (and their associated group of transitions) does not create deadlock states. Towards this end, we define potentially safe nonmasking programs as follows: Definition. A nonmasking program p with the invariant S and the specification spec is potentially safe iff the following condition is satisfied. Vso,31 :: ((so,sl) ¢ plS /\ ((so,sl) violates spec) ) :> ( 352 :: ((80,82) 6 p) /\ (80,82) does not violate spec) [I] Moreover, we require that the removal of a safety-violating transition and its associated group of transitions does not remove good transitions that are useful for the purpose of recovery. Thus, if a transition violates the safety of spec then we require 82 that no good transition exists in its associated group of transitions. To address this issue (i.e., safety-violating transitions are not grouped with good transitions), we use the monotonicity property to define independent programs and specifications as follows. Definition. A nonmasking program p is independent of a Boolean variable a: on a predicate Y iff p is both positive and negative monotonic on Y with respect to 3:. D Intuitively, the above definition captures that if there exists a transition (so, .91) E plY and (so, 51) belongs to a group of transitions g that is created due to inability of reading a: then for all transitions (86, 3’1) 6 9 we will satisfy (36, 3’1) 6 plY, regardless of the value of the variable a: in 86 and Si- Likewise, we define the notion of independence for specifications as follows: Definition. A specification spec is independent of a Boolean variable x on a predicate Y iff spec is both positive and negative monotonic on Y with respect to or. Cl Based on the above definition, if a transition (so, 31) belongs to a group of tran— sitions g that is created due to inability of reading 1:, and (so, 31) does not violate safety then no transition (36, 3’1) 6 9 will violate safety, regardless of the value of the variable a: in sf, and s’l. Now, using the above definitions, we present the following theorem. Theorem 5.22 Given is a nonmasking fault-tolerant program p, its invariant S, its fault-span T, faults f and f—safe specification spec, If p is potentially safe, and VPj, :1: : P]- is a process in p, :1: is a Boolean variable such that Pj cannot read :1: : spec is independent of :r on T A The program consisting of the transitions of P] is independent of 1: on S Then A masking fault-tolerant program p’ can be derived from p in polynomial time. 83 Proof. Let (so, 31) be a transition of process Pj. We consider two case where (30,31) 6 (pIS) 01" (30,31) ¢ (pIS)- 1. Let (so, 31) E (pIS) and :1: be a variable that P] cannot read. Since we consider programs where a process cannot blindly write a variable, it follows that x(so) equals $(sl). Now, we consider the transition (36,3’1) where 36 (respectively, 3’1) is identical to so (respectively, 31) except for the value of :23. Since p is independent of :r on S, for every value of :r(so) we will have (36, 3’1) 6 (pIS). Thus, we include the group associated with (so, 31) in the set of transitions of I p . 2. Let (so,sl) ¢ (pIS). Again, due to the inability of P,- to read :13, we consider the transition (36, 3’1) where 36 (respectively, 3’1) is identical to so (respectively, 31) except for the value of :r. By the definition of spec independence, if (so, 31) violates spec then regardless of the value of at every transition (36, 3’1) in the group associated with (30,31) violates spec, and as a result, we exclude this group of transitions in the set of transitions of p’. p’ satisfies spec from S. Now, let p’ be the program that consists of the transitions remained in pIT after excluding some groups of transitions. Since p’ IS equals pIS and p satisfies spec from S, it follows that p’ satisfies spec from S in the absence of f. Every computation prefix of p’[] f that starts at T maintains spec. Since we have removed the safety-violating transitions in pIT, when f perturbs p to T every computation prefix of p’[] f maintains safety of specification. Every computation of p’[] f that starts in T has a state in S. When we remove a safety-violating transition (so, 31) E pIT, we actually remove all transitions (36, 3’1), where 36 (respectively, 3’1) is identical to so (respectively, 31) except for the value of 2:. Note that since spec is independent of :r, all transitions (36, 3’1) that are grouped with (so, 31) violate the safety of spec if (so, 31) violates the safety of spec. N ow, since 84 p is potentially safe, by definition, for every removed transition (so, 31) (respectively, (36,3’1)) there exist safe transitions (so,s-2) (respectively, (36, 36)) that guarantee so (respectively, 36) has at least one outgoing transition (i.e., so (respectively, 36) is not a deadlock state). Thus, if we remove the safety-violating transitions then we will not create any deadlock state in T. It follows that the recovery from T—S to S, provided by the nonmasking program p, is preserved. Also, we have shown that p’ satisfies spec from S and every computation prefix of p’ [] f maintains spec. Therefore, p’ is masking f-tolerant to spec from S. E] 5.4.2 Example: Distributed Counter In this section, we present an example for enhancing the fault-tolerance of nonmasking distributed programs to masking using the monotonicity property. Towards this end, we first introduce the nonmasking program, its invariant, its safety specification, and the faults that perturb the program. Then, we synthesize the masking fault-tolerant program using Theorem 5.22. Nonmasking program. The nonmasking program p represents an even counter. The program p consists of two processes namely, Po and P1, where Po is responsible to reset the least significant bit (denoted xo) whenever it is not equal to zero. And, P1 is responsible to toggle the value of the most significant bit (denoted 1:1), continuously. Process Po can only read/ write :ro, P1 is able to read do and 31:1, and P1 can only write 161. The only action of Po is as follows: Po : xo 74 0 ——+ $0 1: O The following two actions represent the transitions of P1. (21:1)A(a:o=0) ——+ 21:20 £120 ——+ 171221 For simplicity, we represent a state of the program by a tuple ('21, mo). Invariant. Since the program simulates an even counter, we represent the invariant of the program by the state predicate Sm E (To = 0). 85 Faults. Fault transitions perturb the value of :ro and arbitrarily change its value from O to 1 and vice versa. The following action represents the fault transitions. true -—> do := 0 | 1 Fault-span. The entire state space is the fault-span for faults that perturb cro. Thus, we represent the fault-span of the program by the state predicate Tc“. E true. Safety specification. Intuitively, the safety specification specifies that whenever faults perturb the counter, the counting operation should stop until the program returns to its invariant. In other words, the counter must not count from an odd value to another odd value. We identify the safety of specification speed, by the following set of transitions that the program is not allowed to execute: 3P€Cctr = {(80.31) I (330(80) = 1)/\(5130(81)=1)A(5171(31)3Jé $1090)” Observe that, p is potentially safe and speech. is f-safe. The nonmasking program p is independent of $1 on Sm. For two arbitrary transitions of Po, say (so, 31) and (36,3’1), that are grouped due to inability of Po to read x1, we show that the nonmasking program is independent of cm on Sm. Towards this end, we first show that p is negative monotonic on Sm with respect to $1, and then, we show that p is positive monotonic on Sm with respect to $1. 1. Negative monotonicity of p on Set, with respect to 1:1. Consider (so, 31), where (x1(so) = 1) and (21(31) = 1). Since there is no transition (so, 31) in plS, where (231(3o) = 1) and (231(31) = 1), p is negative monotonic on Sm with respect to 3:1. 2. Positive monotonicity of p on Sm with respect to 21. Consider (so, 31), where (11:1(so) = 0) and (21(31) = 0). Since there is no transition (so,sl) in plS, where (2:1(so) = 0) and (21(31) 2 0), p is positive monotonic on Sm with respect to $1. 86 As a result of the above argument, p is independent of $1 on Sm. Now, we show that speech. is independent of 51:1 on the fault-span Tm. For a given transition (so, 31) of process Po, we let (:ro(3o) = 1) and (aro(sl) = 0). Since Po cannot read 2:1, the transition (so, 31) is grouped with a transitions (36, 3’1), where the value of .171 remains unchanged in (36,3’1). Now, using the definition of program monotonicity, speech. is independent of .231 on Tc". Given two arbitrary transitions of Po, say (so, 31) and (36, 3’1), that are grouped due to inability of Po to read :51, we show that the specification is both negative and positive monotonic on Too. with respect to 51:1. 1. Positive monotonicity of speed... Consider (30,31), where (1271(so) = 0) and (231(31) = 0), and (so,sl) does not violate safety. If (131(36) = 1) and (131(3’1) = 1) then (36, 3’1) will not violate safety (because the value of $1 does not change during this transition). Since we have chosen (so, 31) and (36, 3’1) arbitrarily, the specification is positive monotonic on Too. with respect to 1:1. 2. Negative monotonicity of specctr. A similar argument shows that the specification is negative monotonic on Tm with respect to 331. Based on the above discussion, the specification is independent of 231 on Tm. Masking fault-tolerant program. The nonmasking program presented in this section is potentially safe. Also, process Po is independent of 2:1 on Sm. Moreover, the specification, speed, is f—safe and is independent of :51 on Tm. Therefore, using Theorem 5.22, we can derive a masking fault-tolerant version of p in polynomial time. In the synthesis of masking program, we remove the transition from (0,1) to (l, 1). The action of Po remains as is, and the actions of P1 are as follows: ($1=1)/\(IL‘0=0) ——') 1131 2:0 (1'1=0)/\(IL'0=O) ——* 11312:]. 87 5.5 Enhancement versus Addition In this section, we compare the complexity of enhancement with adding masking fault- tolerance. Specifically, we first discuss enhancement in high atomicity with respect to Add_Masking algorithm represented in Subsection 2.7.3. Subsequently, we compare the complexity of these two algorithms for distributed programs (i.e., low atomicity model). Complexity of enhancement versus addition in high atomicity. Since Add_Masking tries to add both safety and recovery simultaneously, it is more com- plex than HighAtomicityEnhancement presented in this chapter. More specifi- cally, the asymptotic complexity of High_Atomicity-Enhancement is less than that of Add_masking. Thus, if the state space of the problem at hand prevents the addition of masking fault-tolerance to a fault-intolerant program, it may be possible to partially automate the design of a masking fault-tolerant program by manually designing a nonmasking fault-tolerant program and enhancing its fault-tolerance to masking us- ing automated techniques. The algorithm High_Atomicity-Enhancement adds safety to a nonmasking fault- tolerant program while ensuring that the recovery provided by it continues to be satisfied. We note that the asymptotic complexity of HighAtomicity-Enhancement is the same as the complexity of adding failsafe fault-tolerance to a fault-intolerant program. In other words, in High_Atomicity_Enhancement, the recovery is preserved for free! Complexity of enhancement versus addition in low atomicity. We com- pare the cost of adding masking fault-tolerance to a fault-intolerant distributed pro- gram and the cost of enhancing the fault-tolerance of a nonmasking fault-tolerant distributed program to masking. Asymptotically speaking, adding masking (re- spectively, failsafe) fault-tolerance to a fault-intolerant distributed program is N P- complete [1, 31]. Therefore, it is expected that the enhancement problem ——that adds 88 safety while preserving recovery— for distributed programs will also be NP-complete. Although the enhancement problem may not provide relief in terms of the worst- case complexity, we find that it helps in developing heuristics that determine if safe recovery is possible from states that are reached in the presence of faults. More specifically, consider a state, say 3, that is reached in a computation of the fault- intolerant program in the presence of faults. While adding masking fault-tolerance to a fault—intolerant program, we need to exhaustively search all possible transition sequences from s to determine if recovery from s is possible. By contrast, while enhancing the fault-tolerance of a nonmasking fault-tolerant program, we reuse the recovery provided by the nonmasking fault-tolerant program. Hence, we need to check only the transition sequences that the nonmasking fault—tolerant program can produce. It follows that deriving heuristics that determine if safe recovery is possible from a given state is simpler in the enhancement problem. The enhancement problem also allows us to deduce additional information about states by reasoning in the high atomicity model. We illustrate this by one example that occurs in Byzantine agreement. Consider a state so where all processes are non- Byzantine, d.j=d.k=_L, d.g: 1, d.l=1 and f.l-=0. Let 31 be a state that is identical to so except that the value of f.l in 31 is 1. Now, consider the transition (so, 31). Note that both so and 31 are in the invariant, S N 3. Hence, for a synthesis algorithm, this appears as a good transition that should be retained in the fault-tolerant program. However, from 31, if 9 becomes Byzantine and changes d.g, we can reach a state where d.g, d. j , and d.l: become 0. The resulting state is a deadlock state. While adding masking fault-tolerance to a fault-intolerant program, it is difficult to check that all computations that (1) start from s], (2) in which 9 becomes Byzantine, and (3) in which 9 changes d.g to 0 are deadlock states. Moreover, if we ignore the grouping restrictions imposed by the low atomicity model, i.e., if we could read and write all variables in one atomic step then recovery would be possible from 31. 89 However, in the context of the enhancement problem, we concluded that even in the high atomicity model, we could not recover from state 31 by reusing the transitions of the nonmasking fault-tolerant program. We expect that such high atomicity reasoning will play an important role in re- ducing complexity in the enhancement problem. To reduce the complexity of adding fault—tolerance in the low atomicity model, it is desirable to reason about the input program in the high atomicity model, obtain a high atomicity masking fault-tolerant program, and modify that high atomicity masking fault-tolerant program so that the restrictions of the low atomicity model are satisfied while preserving the masking fault-tolerance. As the Byzantine agreement example illustrates, this approach can be followed while enhancing the fault-tolerance of a nonmasking fault-tolerant program. However, this approach could not be used while adding masking fault-tolerance to a fault-intolerant program. 5.6 Summary In this chapter, we defined the problem of enhancing the fault-tolerance level of a nonmasking program to masking. This problem separates ( 1) the task of adding re- covery, and (2) the task of maintaining the safety specification during recovery. For the high atomicity model, we presented a sound and complete algorithm for the en- hancement problem. We showed that the complexity of our high atomicity algorithm is asymptotically less than Add-Masking algorithm (cf. Subsection 2.7.3). For dis- tributed programs, we presented a sound algorithm for the enhancement problem. We also showed that our fault-tolerance enhancement algorithm for distributed pro- grams resolves some of the difficulties encountered in adding safe recovery transitions in [14]. As an illustration of our algorithms, we showed how masking fault-tolerant pro- grams for TMR (in high atomicity model) and Byzantine agreement (for distributed 90 programs) can be designed by enhancing the fault-tolerance of the corresponding nonmasking programs. We chose these examples as masking fault-tolerant versions of these programs have been manually designed from the corresponding nonmasking fault-tolerant versions [32]. The results in this chapter show that those enhancements can in fact be automated as well. Also, we argued that enhancing the fault-tolerance of a distributed program is simpler than adding masking fault-tolerance to its fault-intolerant version. We vali- dated this result by comparing the derivation of a masking fault-tolerant Byzantine agreement program from the corresponding fault-intolerant version and from the cor- responding nonmasking version. Moreover, we have used the monotonicity property (presented in Section 4.3) to identify sufficient conditions under which the enhancement of fault-tolerance can be done in polynomial time. Specifically, we presented a sufficiency theorem and we enhanced the fault-tolerance of a distributed counter to masking fault-tolerance using our sufficiency theorem. 91 Chapter 6 Pre—Synthesized Fault-Tolerance Components In this chapter, we present a synthesis approach that adds pre-synthesized fault— tolerance components to a given fault—intolerant program in the synthesis of its fault- tolerant version. Techniques presented in [14] and Chapters 4 and 5 respectively reduce the complexity of synthesis by using heuristics and by identifying classes of programs and specifications for which efficient synthesis is possible. However, these techniques cannot apply the lessons learnt in synthesizing one fault-tolerant program while synthesizing another fault-tolerant program. The synthesis method presented in this chapter allows us to recognize the patterns that we often apply in the synthesis of fault-tolerant distributed programs. Then, we organize those patterns in terms of fault-tolerance components and reuse them in the synthesis of new problems. To investigate the use of pie-synthesized fault-tolerance components in the syn- thesis of fault-tolerant programs from their fault-intolerant version, we use detectors and correctors identified in [33, 10]. Specifically, in [33, 10], Arora and Kulkarni have shown that detectors and correctors suffice in the manual design of a rich class of fault-tolerant programs. Hence, we expect to benefit from the generality of such 92 components in automated synthesis of fault—tolerant programs. Thus, in this chapter, we present a synthesis approach that adds pre—synthesized detectors and correctors to a given fault-intolerant program in order to synthesize its fault-tolerant version. In particular, we focus on adding masking fault—tolerance where we address issues regarding the representation, the specification, and the addition of pre-synthesized fault-tolerance components. In general, our synthesis method is applicable for adding failsafe and nonmasking fault-tolerance as well. As a running example, we synthesize a token ring program that consists of 4 processes and is subject to process-restart faults. The masking fault—tolerant (token ring) program can recover even from the situation where every process is corrupted. We note that the previous approaches that added fault-tolerance to the token ring program presented in this chapter assumed that at least one process is not corrupted. We proceed as follows: in Section 6.1, we formally state the problem of adding fault-tolerance components to fault-intolerant programs. Then, in Section 6.2, we present a synthesis method that identifies when and how the synthesis algorithm de- cides to add a component. Subsequently, in Section 6.3, we formally describe how we represent a fault-tolerance component. In Section 6.4, we show how we automatically specify a component and add it to a program. In Section 6.5, we show how we reuse a linear pre-synthesized component in the synthesis of an alternation bit protocol. Afterwards, in Sections 6.6, we apply our synthesis method for adding nonmasking fault-tolerance to a diffusing computation program with a tree-like structure where we show that our synthesis method is applicable for programs with hierarchical topolo- gies. In Section 6.7, we address some of the questions raised by the synthesis method presented in this chapter. Finally, we summarize our discussion in Section 6.8. 93 6. 1 Problem Statement In this section, we formally define the problem of adding fault-tolerance components to a fault-intolerant program. We identify the conditions of the addition problem by which we can verify the correctness of the synthesized fault-tolerant program after adding fault-tolerance components. Given a fault-intolerant program p, its state space Sp, its invariant S, its specifica- tion spec, and a class of faults f, we add pre—synthesized fault-tolerance components (i.e., detectors and correctors) to p in order to synthesize a fault-tolerant program p’ with the new invariant S’. When we add a fault-tolerance component to p, we also add the variables associated with that component. As a result, we expand the state space of p. The new state space, say Sp], is actually the state space of the synthesized fault-tolerant program p’. After the addition, we require the fault-tolerant program p’ to behave similar to p in the absence of faults f. In the presence of faults f, p’ should satisfy masking fault-tolerance. To ensure the correctness of the synthesized fault—tolerant program in the new state space, we need to identify the conditions that have to be met by the synthesized program, p’. Towards this end, we define a projection from Sp, to 3,, using onto function H : 510' ——+ Sp. We apply H on states, state predicates, transitions, and groups of transitions in Sp: to identify their corresponding entities in Sp. Let the invariant of the synthesized program be S’ Q Sp]. If there exists a state 36 E S’ where H (36) ¢ S then in the absence of faults p’ can start at 36 whose image, H (56), is outside S. As a result, in the absence of faults, p’ will include computations in the new state space Sp: that do not have corresponding computations in p. These new computations resemble new behaviors in the absence of faults, which is not desirable. Therefore, we require that H (S’ ) Q S. Also, if p’ contains a transition (36, 3’1) in p’ [5’ that does not have a corresponding transition (so, 31) in p[H(S’) (where H (36) = so and H (3’1) = 31) then p’ can take this transition and create a new way for 94 satisfying spec in the absence of faults. Therefore, we require that H (p’ [5’ ) Q p|H (S ’ ) Now, we present the problem of adding fault-tolerance components to p. The Addition Problem. Given p, S, spec, f, with state space Sp such that p satisfies spec from S, Sp! is the new state space due to adding fault—tolerance components to p, H : Sp, -—> 3,, is an onto function, Identify p’ and S’ Q Spr such that H (3') Q S , H(p’IS’) C; p|H(S’), and p’ is masking f-tolerant for spec from S’. E] 6.2 The Synthesis Method In this section, we present a synthesis method to solve the addition problem of Section 6.1. In Section 6.2.1, we present a high level description of our synthesis method and express our approach for combining heuristics from [14] (cf. Section 6.2.2 for an example heuristic) with pre—synthesized components. Then, in Section 6.2.2, we illustrate our synthesis method using a simple example, a token ring program with 4 processes. We use the token ring program as a running example in the rest of the chapter, where we synthesize a token ring program that is masking fault—tolerant to process-restart faults. 6.2.1 Overview of Synthesis Method Our synthesis method takes as its input a fault-intolerant program p with a set of processes Po - - - R, (n > 1), its specification spec, its invariant S, a set of read/write restrictions ro - - ~rn and mo - - own, and a class of faults f to which we intend to add fault-tolerance. The synthesis method outputs a fault-tolerant program p’ and its invariant S’. 95 The heuristics in [14] (i) add safety to ensure that the masking fault-tolerant program never violates its safety specification, and (ii) add recovery to ensure that the masking fault-tolerant program never deadlocks (respectively, livelocks). Moreover, while adding recovery transitions, it is necessary to ensure that all the groups of transitions included along that recovery transition are safe unless it can be guaranteed (with the help from heuristics) that those transitions cannot be executed. Thus, adding recovery transitions from deadlock states is one of the important issues in adding fault-tolerance. Hence, the method presented in this chapter, focuses on adding pre—synthesized components for resolving deadlock states. Now, in order to resolve a deadlock state, say 34, using our hybrid approach, we proceed as follows: First, for each process P,- in the given fault-intolerant program, we introduce a high atomicity pseudo process PSi. Initially, PS,- has no action to execute, however, we allow PS, to read all program variables and write only those variables that P,- can write. Using these special processes, we now present the ResolveDeadlock routine (cf. Figure 6.1) that is the core of our synthesis method. The input of ResolveDeadlock consists of the deadlock state that needs to be resolved, 3d, and the set of high atomicity pseudo processes PS,- (0 S i S n). ResolveDeadlock(sd: state, PSo, - - ~ , PS7,: high atomicity pseudo process) Step 1. If Add_Recovery (3d) then return true. Step 2. Else non-deterministically choose a PSz-ndex, where 0 3 index 3 n and PSindex adds a high atomicity recovery action grd ——* st Step 3. If (there exists a PSmdel.) and (there exists a detector (1 in the component library that suffices to refine grd -+ st without interfering with the program) then add d to the program, and return true. else return false. // Subsequently, we remove some transitions to make 3,) unreachable. Figure 6.1: Overview of the synthesis method. First, in Step 1, we invoke a heuristic-based routine Add_Recovery to add recovery from 3,, under the distribution restrictions (i.e., in the low atomicity model) — where 96 program processes have read / write restrictions with respect to the program variables. Add.Recovery explores the ability of each process P,- to add recovery transition from 3,; under the distribution restrictions. If Add_Recovery fails then we will choose to add a fault-tolerance component in Steps 2 and 3. In Steps 2 and 3, we identify a fault-tolerance component and then add it to p in order to resolve 3d. To add a fault-tolerance component, the synthesis algorithm should (i) specify the required component; (ii) retrieve the specified component from a given library of components; (iii) ensure the interference freedom of the composition of the component and the program, and finally (iv) add the extracted component to the program. As a result, adding a pre—synthesized component is a costly opera- tion. Hence, we prefer to add a component during the synthesis only when available heuristics for adding recovery fail in Step 1. To identify the required fault—tolerance components, we use pseudo process PS,- that can read all program variables and write w,- (i.e., the set of variables that H can write). In other words, we check the ability of each PS,- to add high atomicity recovery — where we have no read restrictions — from 30;. If no PS, can add recovery from 3,, then our algorithm fails to resolve 34. If there exist one or more pseudo processes that add recovery from 34 then we non-deterministically choose a process PSz-ndex with high atomicity action ac : grd —* st. Since we give PSmdex the permission to read all program variables for adding recovery from 3d, the guard grd is a global state predicate that we need to refine. If there exists a detector that can refine grd without interfering with the program execution then we will add that detector to the program. (We present the discussion about how to specify the required detector (1 and how to add d to the fault—intolerant program in Sections 6.3 and 6.4.) In cases where ResolveDeadlock returns false, we remove some transitions to make 3,, unreachable. If we fail to make so unreachable then we will declare failure in the synthesis of the masking fault-tolerant program p’. Observe that by using pre- 97 synthesized components, we increase the chance of adding recovery from so, and as a result, we reduce the chance of reaching a point where we declare failure to synthesize a fault-tolerant program. 6.2.2 Token Ring Example In this subsection, we introduce a token ring program with 4 processes that is subject to process restart faults. Using our synthesis method (cf. Figure 6.1), we synthesize a token ring program that is masking fault—tolerant for the case where all processes are corrupted. The token ring program. The fault-intolerant program consists of four processes Po, P1, P2, and P3 arranged in a ring. Each process P,- has a variable 2:,- (O S i S 3) with the domain {1,0,1}. Due to distribution restrictions, process P, can read I, and 35,--1 and can only write 2:,- (1 S i S 3). Po can read zoo and 2:3 and can only write do. We say, a process P,- (1 g i g 3) has the token iff $.- 7$ 2,4 and fault transitions have not corrupted P,- and Pi_1. And, Po has the token iff 1133 2 :co and fault transitions have not corrupted Po and P3. A process P,- (1 _<_ i S 3) copies :r,_1 to r,- if the value of .73,- is different from 33-1. Also, if do = (133 then process Po copies the value of (2:3 63 1) to do, where 89 is addition in modulo 2. This way, a process passes the token to the next process. We represent a state 3 of the token ring program by a 4-tup1e (120, 1:1, 3:2, 2:3). Each element of the 4—tuple ((130, 321, 332,13) represents the value of z, in s (0 S i S 3). Thus, if we start from initial state (0, 0, 0,0) then process Po has the token and the token circulates along the ring. We represent the transitions of the fault-intolerant program TR by the following actions (1 g i _<_ 3). TRo: (:ro =1) A (11:3 =1) ——§ I130 z: 0; TR,: (1:1: 0) A (:r,--1 = 1) —+ 3:,- := 1; TR6: (:ro=0) A ($320) -—>:ro:= 1; TR;: (113,-:1) A (1r,_1=0)——>a:,-:=0; Faults. Faults can restart a process H. Thus, the value of it,- becomes unknown. 98 Hence, we model faults by setting the value of 1',- to an unknown value 1. Specification. The problem specification requires that the corrupted value of one process does not affect a non-corrupted process, and there is only one process that has the token. Invariant. The invariant of the above program includes states (0, 0, 0, 0), (1,0, 0, 0), (1, 1,0,0), (1, 1,1,0), (1,1, 1, 1), (O, 1, 1, 1), (0,0, 1, 1), and (0,0,0, 1). A heuristic for adding recovery. In the presence of faults, the program TR may reach states where there exists at least a process P,- (0 S i S 3) whose :16,- is corrupted (i.e., 2:,- = _L). In such cases, processes P, and PW“) mod 4) cannot take any transition, and as a result, the propagation of the token stops (i.e., the whole program deadlocks). In order to recover from the states where there exist some corrupted processes, we apply the heuristic for single—step recovery from [14] in an iterative fashion. Specifi- cally, we identify states from where single-step recovery to a set of states RecoverySet is possible. The initial value of RecoverySet is equal to the program invariant. At each iteration, we include a set of states in RecoverySet from where single-step re- covery to RecoverySet is possible. In the first iteration, we search for deadlock states where there is only one cor- rupted process in the ring. For example, consider a state so = (1,1, 1,0). In state so, P1 and P2 cannot take any transitions. However, P3 can copy the value of 1:2 and reach 32 = (1,1, 1,1). Subsequently, Po changes 1‘0 to 0, and as a result, the program reaches state 33 = (0,1,1,1). The state 33 is a deadlock state since no process can take any transition at 33. To add recovery from 33, we allow P1 to correct itself by copying the value of do, which is equal to 0. Thus, by copying the value of do, P1 adds a recovery transition to an invariant state (0,0, 1,1). Therefore, we include 33 in the set of states RecoverySet in the first iteration. Note that this recov- ery transition is added in low atomicity in that all the transitions included in action (1‘0 2 0) A (1:1 2 i) —+ x1 := 0 can be included in the fault-tolerant program without 99 violating safety. In the second and third iterations, we follow the same approach and add recovery from states where there are two or three corrupted processes to states that we have already resolved in the previous iterations. Adding recovery up to the fourth iteration of our heuristic results in the intermediate program I TR (1 g i S 3). ITRo: ((xo=l)V(:z:o=_L)) A (37321) ——>:ro:= 0; (TBS: ((330 = 0) V (1‘0 = l)) A (1‘3 = 0) '—+ 130 == 1; ITR,: (((1:.-= 0) V (3:,- = 1)) A ($1-1 = 1) —. 11:,- := 1; ITR2: ((23,- : 1) V (so.~ = _L)) A (27,4 = 0) —+ 2:, z: 0; Using above heuristic, we can only add recovery from the states where there exists at least one uncorrupted process. If there exists at least one uncorrupted process P]- (0 S j S 3) then P((j+1)mod 4) will initiate the token circulation throughout the ring, and as a result, the program recovers to its invariant. However, in the fourth iteration of the above heuristic, we reach a point where we need to add recovery from the state where all processes are corrupted; i.e., we reach the program state .53 = (_L, .L, _L, _L). In such a state, the program I TR deadlocks as an action of the form (do = .L)A(xl = .l.) —> 5101 := 0 cannot be included in the fault-tolerant program. Such an action can violate safety if 1:2 and 333 are not corrupted. In fact, no process Can add safe recovery from 33 in low atomicity. Thus, Add-Recovery returns false for (1,1,1,1). Adding the actions of the high atomicity pseudo process. In order to add IIlasking fault-tolerance to the program I TR, a process Pindex (0 5 index _<_ 3) should Set its a: value to 0 (respectively, 1) when all processes are corrupted. Hence, we f()llow our synthesis method (cf. Figure 6.1), where the pseudo process PSo takes the high atomicity action H TR and recovers from 33. Thus, the actions of the masking program MTR are as follows (1 g i _<_ 3). MTRo: ((xo=1)V(:1:o=.L)) A (1:321) ——>:ro:= 0; 100 MTRg: ((xo _ 0) v (:ro _ 1)) A (2:3 2 0) ——3 170:: 1; MTR.: ((22.- =0)V(:r, = )) A (2..-, = ) .33, = 1, MTRgz ((11:,=1)V(:r.= )) A (an—1 = 0) __. 1r,- 2: 0; HTR: (.ro=_L)A(:1:1—_L)A(T2—.L)A(:r3=L)——>afo:—-0; In order to refine the high atomicity action H T R, we need to add a detector that detects the state predicate (do = 1) /\ (.231 = 1) A (12 = _L) A (51:3 2 _L). In Section 6.3, we describe the specification of fault-tolerance components, and we show how we use a distributed detector to refine high atomicity actions. Remark. Had we non—deterministically chosen to use PS,- (i # 0) as the process that adds the high atomicity recovery action then the high atomicity action H TR would have been different in that H TR would write x,» (We refer the reader to Section 6.7 for a discussion about this issue.) 6.3 Specifying Pre—Synthesized Components In this section, we describe the specification of fault-tolerance components (i.e., de- tectors and correctors). Specifically, we concentrate on detectors and we consider a special subclass of correctors where a corrector consists of a detector and a write action on the local variables of a single process. 6.3.1 The Specification of Detectors We recall the specification of a detector component presented in [34, 33]. Towards this end, we describe detection predicates, and witness predicates. A detector, say d, identifies whether or not a global state predicate, X, holds. The global state predicate X is called a detection predicate in the global state space of a distributed program [34, 33]. It is often difficult to evaluate the truth value of X in an atomic action. Thus, We (i) decompose the detection predicate X into a set of smaller detection predicates X0 -~Xn where the compositional detection of Xo - - -X,, leads us to the detection 101 of X, and (ii) provide a state predicate, say Z, whose value leads the detector to the conclusion that X holds. Since when Z becomes true its value witnesses that X is true, we call Z a witness predicate. If Z holds then X will have to hold as well. If X holds then Z will eventually hold and continuously remain true. Hence, corresponding to each detection predicate X ,, we identify a witness predicate Z,- such that if Z,- is true then X, will be true. The detection predicate X is either the conjunction of X,- (0 S i S n) or the disjunction of X,. Since the detection predicates that we encounter represent deadlock states, they are inherently in conjunctive form where each conjunct represents the valuation to program variables at some process. Hence, in the rest of this chapter, we consider the case where X is a conjunction of X;, for 0 S i S n. Specification. Let X and Z be state predicates. Let ‘Z detects X’ be the problem specification. Then, ‘Z detects X’ stipulates that 0 (Safety) When Z holds, X must hold as well. 0 (Liveness) W’ hen the predicate X holds and continuously remains true, Z will eventually hold and continuously remain true. [I We represent the safety specification of a detector as a set of transitions that a detector is not allowed to execute. Thus, the following set of transitions represents the safety specification of a detector. speed 2 {(30,31) : (2(31) A oX(31))} 6-3.2 The Representation of Detectors I n this section, we describe how we formally represent a distributed detector. While Our method allows one to use detectors of different topologies (cf. Section 6.4.1), in t31113 section, we comprehensively describe the representation of a linear (sequential) detector as such a detector will be used in our token ring example. 102 The composition of detectors. A detector, say d, with the detection predicate X .=_ Xo A. . . AX" is obtained by composing d,, O S i S n, where d,- is responsible for the detection of X,- using a witness predicate Z,- (0 S i S n). The elements of d can execute in parallel or in sequence. More specifically, parallel detection of X requires do - - ~dn to execute concurrently. As a result, the state predicate (Zo A - -- A Zn) is the witness predicate for detecting X. A sequential detector requires the detectors do, - -- ,dn to execute one after an- other. For example, given a linear arrangement (1,, - . -do, a detector d, (0 S i < n) detects its detection predicate, using 2,, after (1,-+1 witnesses. Thus, when Z, be— comes true, it shows that Z,-+1 already holds. Since when Z, becomes true X,- must be also true, it follows that the detection predicates Xn - - ~X,- hold. Therefore, we can atomically check the witness predicate Zo in order to identify whether or not X E (Xn A---AXo) holds. The detection of global state predicates of programs that have a hierarchical topol- og (e.g., tree-like structures) requires parallel and sequential detectors. In this sec- tion, we demonstrate our method in the context of a linear detector as such a detector suffices for the token ring example. In Section 6.6, we apply our synthesis method for the synthesis of a diffusing computation program using components with hierarchical tOpology. A linear detector. We consider a detector d with linear topology. The detector d consists of n + 1 elements (72 > 0), its specification speed, its variables, and its invariant U. Since the structure of the detector is linear, without loss of generality, We consider an arrangement dn- - -do for the elements of the distributed detector, Where the left-most element is dn and the right-most element is do. Component variables. Each element d,, 0 S i S n, of the detector has a Boolean Variable y,. Read/ write restrictions. Element (1, can read y; and yi+1, and can only write y, 103 (0 S i < n). dn reads and writes yn. Also, d,- is allowed to read all variables that P,- can read (i.e., the process with which d,- is composed). Witness predicates. The witness predicate of each (1,, say Z,, is equal to (y, = true). The detector actions. The actions of the linear detector are as follows (0 S i < n). DA" : (LCn) A (yn = false) ——> y." := true; DA,- : (L0,) A (y, = false) A (y,-+1 = true) —2 y, := true; Using action DA,- (0 S i < n), each element d,- of the linear detector witnesses (i.e., sets the value of y,- to true) whenever (i) the condition LC,- becomes true, where LC,- represents a local condition that d,- atomically checks (by reading the variables of R), and (ii) its neighbor d,+1 has already witnessed. The detector (1,, witnesses (using action DA") when LC" becomes true. Detection predicates. The detection predicate X,- for element d,- is equal to (L0,, A - - - A LC.) (0 S i S n). Therefore, do detects the global detection predicate LCnAu-ALCo. Invariant. During the detection, when an element d, sets y, to true, the elements d,-, for i < j S n, have already set their y values to true. Hence, we represent the invariant of the linear detector by the predicate U, where U={s:(Vi:(OSiSn):(y.(8)=>(Vj=(i y,- := false; Theorem 6.1 The linear detector is masking F -tolerant for ‘Z detects X ’ from U. 104 Proof. The linear detector satisfies ‘Z detects X’ from U. Also, in the presence of F, no element (1,- (0 S i S n) of the detector will reach a state where d,- witnesses incorrectly. As a result, the linear detector never violates the safety of ‘Z detects X’ in the presence of F. Also, when faults stop occurring, the actions of the linear detector correct the corrupted values of y,- if necessary. Thus, every computation of the linear detector in the presence of F will eventually reach a state in U. Therefore, the linear detector component is masking F -tolerant for ‘Z detects X’ from U. [3 6.3.3 Token Ring Example Continued In Section 6.2.2, we added the following high atomicity action to the token ring program I TR that is executed by the pseudo process PSo. HTR: (x021)A(:cl=1)A(:rg=J_)A(at3=_L) —+ $0220 In order to synthesize a distributed program (that includes low atomicity actions), we need to refine the guard of the above action. The read/write restrictions of the processes in the token ring program identify the underlying communication topology of the fault—intolerant program, which is a ring. Hence, we select a linear detector, d, so that we can organize its elements, d3, d2, d1,do, in the ring. Each detector d2 is responsible to detect whether or not the local conditions LC3 to LC,- hold (LC,- E (”~731- = _L)), for 0 S i S 3. Thus, the detection predicate X,- is equal to ((:r3 = J.) A - - - A (51:,- = _L)), for 0 S i S 3. As a result, the global detection predicate of the linear detector is ((:r3 = _L) A ($2 2 _L) A (331 = .L) A (do = 1)). The witness predicate 0f each d,, say Z), is equal to (y.- = true), and the actions of the sequential detector are as follows (0 S i S 2). DA3 : (x3 = _L) A (1);; = false) ——+ 313 3= true; DA,- : (x,- = 1) A (y, = false) A (y,+1 = true) ——v y, :: true; Note that we replace (LG) with (:13,- = _L) in the above actions. During the Syll thesis, after the synthesis algorithm acquires the actions of its required component, 105 it replaces each (LC,) with the appropriate condition in order to create the transition groups corresponding to each action of the component. 6.4 Using Pre-Synthesized Components In this section, we describe how we perform the second and the third step of our synthesis approach presented in Figure 6.1. In particular, in Section 6.4.1, we show how we automatically specify the required components during the synthesis. Then, in Section 6.4.3, we show how we ensure that no interference exists between the program and the fault-tolerance component. Afterwards, we present an algorithm for the addition of fault—tolerance components. In Sections 6.4.2 and 6.4.4, we respectively present the algorithmic specification and the algorithmic addition of a linear detector to the token ring program. 6.4.1 Algorithmic Specification of the Fault-Tolerance Com- ponents We present the Componentfipecification algorithm (cf. Figure 6.2) that takes a dead- lock state 33, the distribution restrictions (i.e., the read/write restrictions) of the program being synthesized, and the set of high atomicity pseudo processes PS,- (0 S i S n). First, the algorithm searches for a high atomicity process PSmdex that is able to add a high atomicity recovery action, ac : grd —-> 3t, from 33 to a state in the state predicate 5.80, where Sm represents the set of states from where there exists a safe recovery path to the invariant. Also, we verify the closure of Sm. U 33 in the computations of p[] f. If there exists such a process PSindex then the algorithm returns a triple (X, R, inderr), where (i) X is the detection predicate that should be refined in the refinement of the action ac; (ii) R is a relation that represents the topology of the program, and (iii) the index is an integer that identifies the process that should detect grd and execute st. The Component-Specification algorithm constructs the state predicate X using 106 the LC,- conditions. Each LC,- condition is by itself a conjunction that consists of the program variables readable for process Pi. Therefore, the predicate X will be the conjunction of LC,- conditions (0 S i S n). ComponentSpecification(sd: state, Sr“: state predicate, PSo, - - - , PSn: high atomicity pseudo process, spec: safety specification, ro, - - -,rn: read restrictions, wo, - - . ,wn: write restrictions) { // n is the number of processes. if ( Bindex : 0 S index S n : (33 : 3 E Srec : (3d, 3) E PSmdex A ((33, 3) does not violate spec) A (Vx : (x(sd) 7é 33(3)) : x E wmde,» ) then X := A?=0(LC.-), where LC,- = (A"‘|(x = x(sd))); R={(2’.J'>:(OSiSn)A(OsJ'sn)=w. gm}; return X, R, index; else return false, 0, —1; } Figure 6.2: Automatic specification of a component. The relation R Q (P x P) identifies the communication topology of the distributed program, where P is the set of program processes. We represent R by a finite set {(i,j) : (0 S i S n) A (0 S j S n) : w,- Q r,} that we create using the read/write restrictions among the processes. The presence of a pair (2, j) in R shows that there exists a communication link between P,- and P,-. Since we internally represent R by an undirected graph, we consider the pair (2, j) as an unordered pair. The interface of the fault-tolerance components. The format of the interface of each component is the same as the output of the Component-Specification algorithm, which is a triple (X, R, index) as described above. We use this interface to extract a component from the component library using a pattern-matching algorithm. To achieve this goal, we use existing specification-matching techniques [35] for extracting components from the component library. The output of the component library. Given the interface (X, R, index) of a required component, the component library returns the witness predicate, Z, the invariant, U, and the set of transition groups, gdo U - ~ - U gd;c U 92mm, of the pre- synthesized component (k 2 0). The group of transitions gmdex represents the low atomicity write action that should be executed by process Pmdem. Complexity. Since the algorithm Component-Specification checks the possibility 107 of adding a high atomicity recovery action to each state of Sm, its complexity is polynomial in the number of states of Srec. 6.4.2 Token Ring Example Continued We trace the algorithm of Figure 6.2 for the case of the token ring program. First, we non-deterministically identify PSo as the process that can read every program variable and can add a high atomicity recovery transition from the deadlock state 34 = (_L, _L, _L, 1). Thus, the value of index will be equal to 0. Second, we construct the detection predicate X, where X E ((xo = _L) A (x1 = J.) A (x2 = _L) A (x3 = _L)). Finally, using the read/ write restrictions of the processes in the token ring program, the relation R will be equal to {(0, 1), (1,2), (2,3), (3,0)}. 6.4.3 Algorithmic Addition of The Fault-Tolerance Compo- nents In this section, we present an algorithm for adding a fault—tolerance component to a fault-intolerant distributed program to resolve a deadlock state 33. Before the addition, we ensure that no interference exists between the program and the fault— tolerance component that we add. We show that our addition algorithm is sound; i.e., the synthesized program satisfies the requirement of the addition problem (cf. Section 6.1). We recall the structure of the fault-intolerant program, p, from the first paragraph of Section 6.2.1. We represent the transitions of p by the union of its groups of transitions (i.e., 11,103). We also assume that we have extracted the required pre- SYnthesized component, c, as described in Section 6.4.1. The component c consists Of a detector d that includes a set of transition groups Ufzogdi, and the write action Of the pseudo process PSmdeI represented by a group of transitions gmdex in the low atomicity. The state space of the composition of p and d is the new state space Spr. We 108 introduce an onto function H1 : Sp, ——> 5,, (respectively, H2 : Sp, —> Sd, where S3 is the state space of the detector d) that maps the states in the new state space Sp, to the states in the old state space Sp (respectively, Sd). Now, we show how we verify the interference-freedom of the composition of c and p. Interference-freedom. We say the program p and the fault—tolerance component c interfere iff the execution of one of them violates the (safety or liveness) specification of the other one. In order to ensure that no interference exists between p and c, we verify the following three conditions in the new state space SP1: (i) transitions of p do not interfere with the execution of d; (ii) transitions of d do not interfere with the execution of p, and (iii) the low atomicity write action associated with c does not interfere with the execution of p and (1. Towards this end, we present the algorithm Interfere in Figure 6.3. Interfere(S, Srec, U : state predicate, H1, H2: onto mapping function, spec, speed: safety specification, go, - - - ,gm,gdo, - - - ,gdk,g;ndex: groups of transitions) // Checks the interference-freedom between the program and // the fault-tolerance component. {//p=90U"'Ugma and d=ngU"'Ugdegindex // Po - - - P" are the processes of p, and do - - - d" are the elements of d 11 = {9 = (391 = (93' EP)A(0 32' S m) = (H1(g) =91)A (3(36,s’1) : (36.3’1) E g : ((36,3’1) violates speed) V (H260) e U A H2(s ’)¢ U))} if (11 76 (2)) then return true; 12 = {gd: (Egdj (adj 6 d) A (0< jS k) = (H2(gd) = adj) A (3(36, 3’1) : (3’ 0,3’1) E gd: ((36,3’1) violates spec) V (H1096) E 5 A H1(3’1)¢ 5))l if (12 79 (ll) then return true; = {g : (H2(g )= gmdex) A (3 (36,3’1) : (96,3’1) E g. ((36.3 ’) violates speed) V (Hl(s’1)¢5rcc) V (H1(30)€S A H1(3’1)¢S)V (H2(s6) E U A H2(3’1)¢ U) V ((36,3’1) violates spec))} if ([3 aé 0) then return true; return false; L Figure 6.3: Verifying the interference-freedom conditions. 109 First, we ensure that the set of transitions of p do not interfere with the execution of d by constructing the set of groups of transitions 11, where 11 contains those groups of transitions in the new state space Sp, that violate either the safety of d or the closure of its invariant U. The transitions of p do not interfere with the liveness of d because d executes only when p is deadlocked in the state 33. Hence, we are only concerned with the safety of the detector d and the closure of U. When we map the transitions of p to the new state space, the mapped transitions should preserve the safety of d. Moreover, if the image of a transition (36, 3’1) starts in U (i.e., H2(36) E U) then the image of (36, 3’1) will have to end in U (i.e., H2(s'1) E U). The emptiness of 11 shows that the transitions of p do not interfere with the execution of (1. Second, using a similar argument, we construct the set of groups of transitions 12 in the new state space Sp: whose every transition is a mapping of the transitions of d that violate either the safety of spec or violate the closure of the program invariant S. Third, if 11 and 12 are empty then it will follow that the detector d is able to detect 3,; without interfering with p. However, after d detects its detection predicate, the component c performs a write action to change the state of the program from 33 to a state 3 E Sm, where Srec is the set of states from where safe recovery has already been added. If a transition in the group associated with the write transition (33, s) violates (i) the safety of the detector; (ii) the safety of the program; (iii) the closure of U, or (iv) the closure of S then the recovery action will interfere with the program (see the construction of 13 in Figure 6.3). If 11, 12, and [3 are empty then the Interfere algorithm declares that no interference will happen due to the addition ofc to p. Addition. We present the Add_Component algorithm for an interference-free addition of the fault-tolerance component c to p. Thus, if the Interfere algorithm returns false then we will invoke 110 Add_Component (cf. Figure 6.4). In the new state space 519:, we construct a set of transition groups pH, (respectively, de) that includes all groups of transitions, 9, whose images in Sp (respectively, 5.1) belong to p (respectively, d). Besides, no transition of (36,3’1) E g violates the safety specification of d (respectively, p) or the closure of the invariant of d (respectively, p), i.e., U (respectively, S). In the calculation of dH,, we note that the image of every group g in d and p must belong to the same process (cf. condition (I = i) in the construction of (1112)- Add_Component(S, Srec, U: state predicate, H1, H2: onto mapping function, spec, speed: safety specification, 9... - - - .9... gal... - - - gotta-m: groups of transitions) {// p=goU---U9m, and dzngU"'Ugde9index // Po - - - Pn are the processes of p, and do - - -dn are the elements of d PH1 = {g = (391- = (93- EMA (OSJ' S m) = (H1(g) =91) A (V(s6,3’1) : (36, 3’1) 6 g : ((36, 3’1) does not violate speed) A (H2(86) 6 U => H2(8’1) E U))} dug = {gd: (Bydj = (de E d) A (0 S j S k) = (H2(gd) = gdj) A (3d,,P):(0SiSn)A(OSlSn): (H2(9d) 6 di) A (H1(9d) 6 Pl) A (1:17) A (V(s6,3’1) : (36, 3’1) 6 gd : ((36, 3’1) does not violate spec) A (H1096) E 5 => H1(8’1) E S))} PC 3: {9 3 (H2(9) = gindex) A (V(5(),3,1): (36,3’1) E g : ((36,3’1) does not violate spec) A (H1(3’1) E Srec) A (H2(s6) E U => H2(3’1) E U) V ((36,3’1) does not violate speed))} S’ := {3:3ESp' :H1(s) E S A H2(s) E U} p, I: le U ng UPC; return p’,S’; } Figure 6.4: The automatic addition of a component. The set pC includes all groups of transitions, 9, whose every transition has an image in 91mm under the mapping H2. Further, no transition (36, 3’1) 6 g violates the safety of spec or the closure of S. The set of states of the invariant of the synthesized program, 5’, consists of those states whose images in Sp belong to the program invariant S and whose images in 111 the state space of the detector, 53, belong to the detector invariant U. Theorem 6.2 The algorithm Add_Component is sound. [3 Theorem 6.3 The complexity of Add_Component is polynomial in [S6]. CI Before we show the soundness of Add..Component, we make some observations and present the following preliminary lemmas and theorems. Towards this end, we assume that we are given a program p, its specification spec, its invariant S, its state space Sp, faults f, and a deadlock state Sdeadlock ¢ S. we consider the case where we have already added safety to p and we only need to resolve 33803100), to synthesize the masking fault-tolerant program p’ with the invariant S’ in the new state space 53’. Towards this end, we use Add_Component algorithm for adding a fault-tolerance component c to p. The component c consists of a distributed detector d, with the detection predicate X, the witness predicate Z, an invariant U, and a low atomicity write action Z —-> st that takes p from state 3380,1100), to a state 3 E Srec. The state predicate Sm.C represents the set of states from where a safe recovery to the invariant S is guaranteed. By definition, the set of states Sr,c includes the invariant S; i.e., S Q Sm. Also, the set S,“ USdeadlock is closed in the computations of p[] f. However, because of the deadlock state Sdeadlocka recovery to S is not guaranteed from Srec U Sdeadlock- We define two mapping functions H1 and H2 respectively from Sp, to S1, and from Sp, to S1, where 5,; is the state space of the distributed detector d included in c. In the Add_Component algorithm, based on the construction of S’, we include those states in S’ whose images in Sp belong to S. Thus, Observation 6.4 V3 : 3 E S’ : H1(s) E S C] Now, we present the following theorem. Theorem 6.5 H1(S’) Q S. Proof. The proof follows from Observation 6.4. C] By construction, for every arbitrary group of transitions g E p”, (cf. Figure 6.4) 112 there exists a group of transitions gj E p (0 S j S m). Now, if we consider a transition (36, 3’1) E 9 such that 36 E S’ and 3’1 E S’ then using Observation 6.4, H1(s6) E S and H1(3’2) E S. As a result, the condition (H1(s6),H1(36)) E p[H1(S’) holds. Thus, we have Observation 6.6 V(s6,s’1) : (36,3’1) E pH, : (((s6,3’1) E p’IS’)=> (H1((s6,s ’)) E p|H1(S’))) 1:1 (H1((s6, 3’1)) denotes the transition (H1(s6), H1(s’1)) in the state space Sp.) Using a similar argument, we present the following observation. Observation 6.7 V(s6,3’1) : (36,3’1) E de : (((36,3 ’) E p’]S’)=> (H1((s6,s ’)) E PIH1(S’))) D The transition groups of pc add recovery to Sdeadzock- Also, by construction, for every transition (s6eadlock, 3’1) E pc, Z (Sfieadlock) holds. Thus, at 3660,1106), the detector detects the deadlock state sdeadlock. Since 33603100), E S, the state 3680,11,,“ does not belong to S’. It follows that (36eadlock, 3’1) E p’ [S’. Therefore, we observe that Observation 6.8 V(s6,s’1) : (36,3’1) E pc : (((s6,s’1) E p’IS’) => (H1((s6,s’1)) E PIH1(S'))) 1:] Using above observations, we present the second theorem. Theorem 6.9 H1(p’|S’) Q p[H1(S’). Proof. By the construction of p’, the proof follows from Observations 6.6, 6.7, and 6.8. D To show that p’ is masking f-tolerant for spec, we prove the following lemmas. Lemma 6.10 From every state of 5’ rec safe recovery to S’ with respect to spec is guaranteed. Proof. By definition, from every state of Sm safe recovery to S is guaranteed with respect to spec. Now, let cmp be a computation of p’ [] f that starts from a state in SI rec. If cmp violates spec then there exists a computation prefix of cmp that violates 113 spec. Let (36, 3’1,. 3’ ) be the smallest such prefix. It follows that (Sin-l)’ 36) violates the safety of spec. As a result, (H1(36,__1)), H1(s6)) is a transition of program p that violates spec. Thus, the corresponding computation prefix (H1(s6), H1(s’1), ..., H1(36)) violates spec. Hence, we find a computation prefix in Sm, that is not safe. This contradicts with the assumption that from every state of Sm safe recovery to S with respect to spec is guaranteed. If (3[n_1),36) is a fault transition then the corresponding fault transition (H1(36H)),H1(36)) violates spec. Hence, we could find a state of p in the state space Sp (i.e., H1(3[n_1))) from where faults alone violate spec. This contradicts with the assumption that we have already added safety to p. Now, let cmp be a computation of p’ that starts from a state in S’ and never TCC reaches S’. Since the computations of p’ are infinite, there must exist a prefix (36, 3’1, 3’ 36) of cmp that includes a cycle. Now, using function H1, we calculate I n? the computation prefix (3o, 31, ..., 3”, so) in the old state space Sp, where H1(s;) = s,- (0 S i S 71). As a result, starting at so E 5.36, we find a computation prefix that includes a cycle and never reaches S, which is a contradiction with the definition of Sm. Therefore, from every state of S’ safe recovery to S’ with respect to spec is T€C guaranteed. 1:] Lemma 6.11 From every state of S’ no computation prefix of p’[] f that ends in rec, S’ violates the safety specification of the detector d (i.e., speed). Proof. Let cmp be a computation of p’[] f that starts from a state in SC“. If cmp violates speed then there exists a computation prefix of cmp that violates spec. Let (36,3’1,...,3’,,) be the smallest such prefix. It follows that (s’( 36) violates speed. n—l)’ Thus, the transition (H2(s[n_1)), H2(s6)) violates speed; i.e., the detector (1 and the program p interfere. By the construction of the transitions of p’, no transition of p’ interferes with the execution of d. Thus, the computation prefix cmp does not violate SpCCd. 114 Also, since we showed (cf. Theorem 6.1) that the fault-tolerance component (1 is by itself F -tolerant, (H2(s’(n_1)), H2(3’,,)) cannot be a fault transition that violates speed. Therefore, starting from every state in 5’ every computations of p’[] f satisfy rec’ speed. 1:] Lemma 6.12 T’ = S’ TCC 3,. (i.e., H1(T’) = s... u {33). U {36md10ck} is a valid fault-span for p’ in the new state space Proof. By construction, we have S Q Srec. Hence, using function H1, we have S’QS’ TCC' Otherwise, if there exists a state 36 E S’ such that 36 E S,’. then we will 8C have a state so E S, where H1(s6) = so, that is not in Sm, which is a contradiction with S Q Sm. Hence, we have 8’ Q S’ rec. Also, by assumption, the set Sm U 33803106), is closed in the computations of p[] f. As a result, S’ U Sfieadlock is closed in the TCC computations of p’[] f. It follows that T’ is a valid fault-span since it is closed in p’[] f, and S’ Q T’. C! Using T’, we present the following lemmas. Lemma 6.13 p’[] f satisfies spec and speed from T’. Proof. Using Lemmas 6.10 and 6.11, p’[] f satisfies spec and speed from S’ TCC' We only need to show that p’[] f satisfies spec and speed from sfiieadlock, where H1(s’deadlock) = Sdeadlock~ By the construction of pa, no transition originated at 366,,ka violates spec or speed. Therefore, starting from every state at T’, p’[] f satisfies spec and speed. Cl Lemma 6.14 Every computation of p’[] f that starts from a state in T’, where H1(T’) = Srec U {33}, contains a state in S’. Proof. Using Lemma 6.10, it follows that every computation of p’[] f that starts where H1(S’ rec) = Srec, reaches a state in 8’. Moreover, by the from a state in 5’ rec’ construction of p’, transitions of pC provide safe recovery from s6mdlock to a state in S, to S’ I _ . 3 I rec, where H 1(3dmdlock) — sdeadlock. Since safe recovery from every state of S TGC 115 is guaranteed, every computation of p’ that starts from a state in T’ contains a state in S’. [:1 Theorem 6.15 p’ is masking f-tolerant for spec from S’. Proof. First, we show that S' is an invariant of p’. We consider a transition (36, 3’1) of p’ that starts in S’ and ends outside S’. Since 36 E S’, by Observation 6.4, we have H1(36) E S. Also, from the construction of S’, we have H1(s’1) E S. As a result, we find a transition (H1(s6), H1(s’1)) of p that starts in S and ends outside S, which is a contradiction with the closure of S in p. Thus, the execution of p’ is closed in S’. From Theorem 6.9, it follows that p’ satisfies spec from 8’. Thus, 3’ is an invariant of p’. Therefore, using S’ as an invariant and T’ as a fault-span, and based on Lemmas 6.13, and 6.14, we have shown that p’ is masking f-tolerant for spec from S’. [:1 Theorem 6.2 (Soundness) The algorithm Add_Component is sound. Proof. To prove that our algorithm is sound, we have to show that the conditions of the addition problem are satisfied. 1. H1(S’) Q S. (cf. Theorem 6.5). 2. H1(p’]S’) Q p|H1(S’). (cf. Theorem 6.9). 3. p’ is masking f—tolerant for spec from S’. (cf. Theorem 6.15). 1:] Theorem 6.3 The complexity of Add_Component is polynomial in SP1. Proof. The Add-Component algorithm consists of three parts where we construct the set of transitions pH“ dH2, and pc. Respectively, each one of these sets contains a set of transition groups in the new state space Spr. The size of the new state space is in the order of [Sp] - [Sd] (i.e., |Spr| = [Sp] - |Sd|). As a result, the size of each transition group cannot be more than [SP1] - [SPA] in SP2. To construct pm, we process all groups of transitions that belong to pH,. Thus, in the worst case, we need to process m groups of transitions in the new state space 116 810,, where m is the number of groups. As a result, the worst-case complexity for constructing pH, is in the order of m - |Sp2|2. The same reasoning holds for the worst-case complexity for constructing (1112 and pc. Therefore, the complexity of the Add-Component algorithm is polynomial in the size of the SP1; i.e., [SW]. D 6.4.4 Token Ring Example Continued Using Add_Component, we add the detector specified in Section 6.4.2 to the token ring program M TR introduced in Section 6.2.2. The resulting program, consisting of the processes Po - - - P3 arranged in a ring, is masking fault-tolerant to process-restart faults. We represent the transitions of Po by the following actions. MTRoi (($o=1)V($0=i)) A (173 =1) —’~To 1:0; ZIITR6: ((xo = 0) V (xo = 1)) A (x3 = 0) —+ x0221; Do : (xo 2 J.) A (yo = false) A (y1 = true) ——s yo := true; Co : (yo = true) ———. xo 2: 0; yo 2: false; The actions M TRo and M TR6 are the same as the actions of the [VI TR program presented in Section 6.2.2. The action Do belongs to the sequential detector that sets the witness predicate Zo to true. The action Co is the recovery action that Po executes whenever the witness predicate (yo = true) becomes true. Now, we present the actions of P3. WITR3: ((x3 = 0) V (x3 2 1)) A (x2 = 1) —> 173 :2 1;y3 :2 false; ll/ITR6: ((x3 = 1) V (x3 = 1)) A (x2 = 0) —+ x3 := 0;y3 :2 false; D3: (x3 = .L) A (y3 = false) ——+ y3 := true; The action D3 belongs to the detector that sets Z3 to true. We present the actions of P1 and P2 as the following parameterized actions (for i = 1,2). 1WTR1-z ((x, = 0) V (x, = _L)) A (x,_1 =1) -_. x, :=1;y,~:= false; AITR6: ((x, = 1) V (x1 = 1)) A (13-1: 0) ——s x,- := 0:1, :: false; D, : (x, = _L) A (y, = false) A (y,+1 = true) —-+ y, := true; 117 The above program is masking fault-tolerant for the faults that corrupt one or more processes. Note that when a process P,- (1 S i S 3) changes the value of x,- to a non-corrupted value, it falsifies Z,- (i.e., y,). The falsification of Z, is important during the recovery from 33 = (_L, _L, _L, _L) in that when x,- takes a non-corrupted value, the detection predicate X,- no longer holds. Thus, if Z,- remains true then the detector d,- witnesses incorrectly, and as a result, violates the safety of the detector. However, Po does not need to falsify its witness predicate Zo in actions M TRo and M TR6 because the action Co has already falsified Zo during a recovery from 33. Remark. One could argue that we could have selected a different linear order do - - - d3 for the detector added to the token ring program. To address this issue, we note that in the case of token ring program a detector with such linear arrangement would interfere with the execution of the program (cf. Section 6.7 for details). 6.5 Example: Alternating Bit Protocol In this section, we reuse the linear component used in the synthesis of the token ring program presented in this chapter in the synthesis of a fault-tolerant alternating bit protocol (ABP). The ABP program consists of a sender and a receiver processes connected by a communication link that is subject to message loss faults. Using the synthesis method presented in this chapter, we add pre-synthesized components to synthesize an alternating bit protocol that is nonmasking fault-tolerant; i.e., when faults occur the program guarantees recovery to its invariant. However, during recov- ery, the nonmasking fault-tolerant protocol may violate its safety specification. The alternating bit protocol (ABP). The fault-intolerant program consists of two processes: a sender and a receiver. The sender reads from an infinite input stream of data packets and sends the newly read packet to the receiver. The receiver copies each received packet into an infinite output stream. When the sender sends a data packet, it waits for an acknowledgement from the receiver before it sends the next packet. 118 Also, when the receiver receives a new data packet, it sends an acknowledgment bit back to the sender. A one-bit message header suffices to identify the data packet currently being sent since at every moment there exists at most one unacknowledged data packet. Using this identifier bit, the sender (respectively, the receiver) does not need count the total number of packets sent (respectively, received). Both processes have read / write access to a send channel and a receive channel. The send channel is represented by an integer variable es and the variable or models the receive channel. The domain of cs (respectively, cr) is {—1,0, 1}, where 0 and 1 represent the value of the data bit in the channel and -1 represents an empty channel. Since we are only concerned about the synchronization between the sender and the receiver, we do not explicitly consider the actual data being sent. Thus, we consider the contents of CS and cr to be a single binary digit. The sender process has a Boolean variable b3 that stores the data bit that identifies the data packet currently being sent to the receiver. Correspondingly, the receiver process has a Boolean variable br that represents the value that is supposed to be received. When the sender process transmits a data packet, it waits for a confirmation from the receiver before it sends the next packet. To represent the mode of operation, the sender process uses a Boolean variable rs. The value of rs is 0 iff the sender is waiting for an acknowledgement. Likewise, the receiver process uses a Boolean variable rr such that the value of rr is 0 iff the receiver is waiting for a new packet. We represent a state 3 of the ABP program by a 6-tuple (rs,bs,rr,br,cs,cr). Thus, if we start from initial state (1, 1, 0,0, -1, —l), then the sender process begins to send a data bit 1 while the receiver waits to receive it. We represent the transitions of the sender process in the fault-intolerant program ABP by the following actions. Sendo: (rs =1) ——+ rs := 0; cs 2: b3; Sendlz (cr 7f —1) ———s rs := 1; cr 2: —1;bs:= (b3 + 1) mod 2; Using action S endo, the sender sends another packet to the receiver when it is not 119 waiting for an acknowledgment. Thus, by setting r3 to 0, the sender moves to the sate where it waits for an acknowledgment from the receiver. If the receive channel is non- empty (i.e., (cr aé —1)) then the sender reads the receive channel and becomes ready for sending the next packet. The actions of the receiver process in the fault-intolerant program ABP are as follows: Reco: (cs aé —1) ——> c3 :2 —1;rr:= 1; br :2 (br + 1) mod 2; Reel : (rr = 1) -——» rr :2 0; cr := br; The receiver reads the send channel c3 when it is non-empty (cf. Action Reco). Then, the receiver toggles the value of br where it becomes ready to send an acknowl- edgment to the receiver (in Action Reel). Read/ Write restrictions. The sender can read/ write rs, c3, b3, and er, but it is not allowed to read rr and br. The receiver is allowed to read/ write rr, cs, br, and or. The receiver is not allowed to read rs and b3. Faults. Faults can remove a data bit from either one of the communication channels causing the loss of that data bit. Hence, we model faults by setting the value of cs (respectively, cr) to -1. F0: (cs;£—l) ———> cs 2: —1; F1: (craé—l) ——2 cr:=—1; We assume that the fault actions will be executed a finite number of times; i.e., eventually faults stop occurring. Safety specification. The problem specification requires that the receiver receives no duplicate packets. Invariant. The state of the ABP program should satisfy the following conditions: (i) If the receiver is ready to send an acknowledgement message or it has already sent an acknowledge then the receive bit br and the send bit bs must be equal; (ii) If the 120 sender is ready to send a new packet or it has already sent a new packet then the b3 and br must not be equal; (iii) It is always the case that either the send channel cs is empty or it contains the sent bit bs; (iv) If both channels are empty then only one of the processes (i.e., the sender or the receiver) should be waiting; (v) If one of the channels is empty and the other one contains some data then both processes are waiting. Hence, we specify the invariant of the ABP program, S A 31?, as follows: SABP = {S l (((7‘7‘(3) = 1) V (“(8) r5 -1)) => (57(3) = 59(8)» A 76-1)) => (1’7"(8) 5‘ b8(8 1)) A ((CS(S) = —1)V(CS(S) = bC(51)) A (C3 S): -1) A (CT 3): -1)) => ( 7"(31+ 719(3)) = 1)) A ( ( >75 (( ( ( ( (((C 8(8175- 1)A (CT (9)= -1))=>((T7‘(S)+ rs(8))=0))A (((C 3(3): —1)A(Cr(3)7'é -1)) => (('rr (8 )+r8(9)) =0))} ((r3 3): 1)V (cs(s Fault-span. The state of the ABP program may be perturbed to the state predicate TABp due to fault transitions, where TABp = {s I ((cs(s) = —1)V(cs(3) = bs(s))) A (((CS(8 1: —1)V(C7‘(3)= —1))=> (((7‘7‘8( l+ rs‘>‘(‘>‘))- “ 1) V ((7‘T(S )+ ”(8 l) = 0)))1 The state predicate TABp includes states where (i) the send channel is empty or it is equal to the sent bit b3, and (ii) if at least one of the channels is empty then at least one of the processes is waiting. Adding the actions of the high atomicity pseudo process. Faults may perturb the program in the states where sender has sent a new packet and the receiver is waiting for its arrival. As a result, the sent message is lost in the sender channel (i.e., cs becomes —1) and the receiver is waiting for a lost message. Likewise, the acknowledgement sent by the receiver might be lost in cr. Thus, the program may reach states where both channels are empty and both processes are waiting. For example, when the sent message is lost, the receiver is waiting for the lost message and the sender is waiting for its acknowledgement. In such states the program takes 121 no action; i.e., deadlock state. Since the processes are not allowed to read the global state of the program, they cannot detect such global deadlock states. Using our synthesis method, we use high atomicity processes to identify the following high atomicity actions that are added to the program for recovery. HAC'o:(rs =0)A(rr=0)A (b3:1)A(br=0)A(cs=—1)A(cr= —1) ._. cs=1, HAClz(rs =0)A(rr=0)A(b3=0)A(br=1)A(cs=—1)A(cr=—1) _. cs:=-0; HACg:(rs =0)A(rr=0)A(bs=-1)A (br=1)A(cs=—1)A(cr=—1) ... 3:1, HAC’3:(rs=0)A(rr=0)A(bs= 0)A(br=0)A(c3=—1)A(cr=—1) __. 3:0, The guards of the above actions are global state predicates that we refine using linear distributed detectors. Let G,- be the guard of the action H A0,, where 0 S i S 3. For example, we have Co _=_ ((r3 = 0)A(rr == 0)A(bs =1)A(br = 0)A(cs = —1)A(cr = —1)). Corresponding to each global state predicate G,, we use a distributed detector with two elements d3,- and dr,, where d3,- is the local detector installed in the sender side and dr, is the local detector installed in the receiver side. Next, we show how we add a linear distributed detector for the detection of Go. We omit the presentation of the refinement of 01,02, and G3 as it is similar to the refinement of Go. Adding fault-tolerance components. Due to read restrictions, the sender (re- spectively, the receiver) cannot atomically detect Go. However, the sender can detect a local condition LC, E ((rs = 0) A (b3 = 1) A (cs = —1)). Respectively, the re— ceiver can detect a local condition LC; E ((rr 2 0) A (br = 0) A (or = —1)), where Co E (LC, A LC,’.). Now, we instantiate the required distributed detector by reusing the code of the pre—synthesized linear detectors presented in Section 6.3. DAro : (LC,’.) A (y; : false) —-—+ y; :: true; DAso : (LCS) A (y, = false) A (y; = true) ——s y, := true; The action DAso belongs to detector dso that is allowed to read the witness predicate y; of the detector element dro in the receiver side. If the detector element 122 dro detects its local predicate LC; then it will set its witness predicate y; to true. Then, if the condition LC, holds in the sender side then the detector element dso will detect the global state predicate Co by setting its witness predicate y, to true. Afterwards, the synthesis algorithm adds the following write action to the sender process. Cso: (y, = true) -——+ cs 1: 1; y, := false; The synthesis algorithm adds similar distributed detectors to ABP in order to refine the global state predicates C1, C2, and C3. Given the local conditions LC; E ((rs = 0) A (b3 2 0) A (c3 = —1)) and LC, E ((rr = 0) A (br = 1) A (cr = —1)), we have the following logical equivalences: o 01 E (LC; A LC,) 0 G2 E (LC, A LC,) 0 C3 E (LC; A LC;). Corresponding to global detection predicates C1---C3, we respectively add the following linear distributed detectors and also the necessary correcting action for recovery to the invariant. Note that each added component has its own variables for representing the witness predicates. Detecting C 1. This linear detector refines the guard of the action H AC1 added by our synthesis algorithm. DArl : (LCr) A (yr 2 false) —+ yr 1: true; DA31 : (LC;) A (y; = false) A (yr :— true) ——+ y; 2: true; Correcting C 1. After the detection of C 1, the following write action takes place. 031 : (y; = true) _—. cs 1: 0; y; := false: 123 Detecting Co. We use the following linear detector to refine the guard of the action HACg. DAr2 : (LCr) A (u, = false) A (u, = true) ——) u, := true; DA32 : (LC,) A (u, = false) ——s u, := true; Correcting 02. The following action, composed with the receiver, recovers the state of the ABP program to the invariant S ABP after the detection of the global state predicate Co. Cr,» : (ii, = true) -—> cr := 1; ur := false; Detecting G3. To detect the global state predicate C3 (i.e., the guard of the high atomicity action H AC3), we add the following detector to ABP. DAr3 : (LC;) A (u; = false) A (u; = true) —s u’r :2 true; DA33 : (LC;) A (u; = false) ——> u; := true; Correcting G3. This action changes the state of the ABP program to a state in S ABP after the detection of 03. Cr3 : (u; = true) -—» cr := 0; u’, := false; The fault-tolerant ABP program. Next, we present the actions of the sender process in the resulting nonmasking fault-tolerant program. Send6: (rs = 1) ——+ rs := 0; cs :2 b3; cs 2: b3; u; :== false; 11., :2 false; Send’lz (cr 75 —1) —» rs := 1; cr :: —1;bs::(bs +1) mod 2; I 3 u := false; u, := false; DAso: (LC,) A (y, = false) A (y; = true) ———> y, 2: true; Cso : (y, = true) ———s cs :2 1; y, := false; 124 DAsl : (LC;) A (y; = false) A (y, = true) —-s y; := true; C31 : (y; = true) ——+ c3 :2 0; y; := false; DA32 : (LC,) A (u, = false) ——+ u, 2: true; DA33 : (LC;) A ( ’3 = false) —-s u’ := true; 8 The synthesis algorithms has added new assignments to the actions Send6 and Send’1 for the falsification of the witness predicates. For example, in action Send6, when c3 is assigned a value other than -1, the predicates LC, and LC; no longer hold. Thus, the witness predicates u; and u, must be falsified. The actions of the receiver in the synthesized fault-tolerant program are as follows: Reco: (cs f -—1) —s c3 :2 —1;rr:= 1; br 2: (br + 1) mod 2; y,- := false; 3;; := false; Reel : (rr 2 1) ——> rr :2 0; cr := br; y, := false; 11; := false; DAro : (LC;) A (y; = false) ——> y; := true; DAr1 : (LCr) A (yr 2 false) ——s yr := true; DAr2 : (LCT) A (u, = false) A (u, = true) ——> u, := true; Cr2: (u, = true) ——r cr := 1; u, 2: false; DAr3 : (LC;) A (11’, = false) A (u; :2 true) ——> u := true: Cr3 : (u; = true) -—s or z: 0: 12;. :2 false; Observe that in actions Reco (respectively, Reel), we falsify the witness predicate yr and y; once the program changes the value of rr to 1 (respectively, or to 0 or 1). This falsification is necessary since once the condition (rr = 1) holds, the predi- cates LC, and LC; no longer hold. Also, this example illustrates the case where we simultaneously add multiple pre-synthesized components to a distributed program to add fault—tolerance. We have verified the interference—freedom requirements using 125 the SPIN model checker [36] to gain more confidence with the implementation of our synthesis framework, FTSyn (see Appendix A for the Promela [37] code of this example). 6.6 Adding Hierarchical Components In this section, we show how we add components with hierarchical topology to a dif- fusing computation program to provide recovery in the presence of faults. In earlier sections, we showed how we apply the synthesis algorithm presented in this chapter to programs where the underlying communication topology between processes is lin- ear. In this section, we show how we add hierarchical pre-synthesized components to distributed programs. Specifically, we add tree-like structured components to a diffusing computation program where processes are arranged in an out-tree, where the indegree of each node is at most one. A diffusing computation starts at the root and propagates throughout the tree, and then, reflects back up to the root of the tree. The fault-intolerant program is subject to faults that perturb the state of the diffus- ing computation and the topology of the program (i.e., the parenting relationship amongst processes). This case study shows that the synthesis method presented in this chapter han- dles pre-synthesized components (respectively, distributed programs) with different topologies as we have already reused a particular linear component in the synthesis of a token ring program and an alternating bit protocol in this chapter. Next, in Subsection 6.6.1, we describe 110w we formally represent a hierarchical fault—tolerance component. Subsequently in Subsection 6.6.2, we show how we automatically add a hierarchical component to a diffusing computation program. 126 6.6.1 Specifying Hierarchical Components In this section, we describe the representation of hierarchical fault—tolerance compo- nents (i.e., detectors and correctors). We focus on the representation of a detector with a tree-like structure as a special case of hierarchical detectors. The hierarchical detector d consists of n elements (1,- (0 S i < n), its specification speed (specified in Subsection 6.3.1), its variables, and its invariant U. We introduce a relation j on the elements d,- that represents the parenting relation between the nodes of the tree; e.g., i :5 j means d.- is the parent of dj. The element do is placed at the root of the tree and other elements of the detector are placed in other nodes of the tree. Each node d,- has its own detection predicate X ,- and witness predicate Z,. The siblings of a node can detect their detection predicate in parallel. However, the truth-value of the detection predicate of each node depends on the truth-value of its children. In other words, node d,- can witness if all its children have already witnessed. Each element d,, 0 S i < n, of the detector has a Boolean variable y.- that represents its witness predicate; i.e., the witness predicate of each d,, say Z,, is equal to (y,- = true). Also, the element d,- can read/ write the y values of its children and its parent (0 S i < n). Moreover, each element d,- is allowed to read the variables that P,- can read, where P,- is the process with which d,- is composed. Now, we present the template action of the detector (1,- as follows ((0 S i,j, k < n) A (j < k) A (Vr : j S rSk:ijr)): DA,: (LC;) A (y,. A---Ayk) A (1},: false) ——s y, 2: true; Using action DA,- (0 S i < n), each element (1,- of the hierarchical detector witnesses (i.e., sets the value of y,- to true) whenever (i) the condition LC,- be- comes true, where LC, represents a local condition that d,- atomically checks (by reading the variables of P,), and (ii) its children (1;, - -- ,dk have already witnessed 127 ((0 S j,k < n) A (j < k)). The detection predicate X,- for element d,- is equal to (LC,- A LCj A A LCk). Therefore, do detects the global detection predicate LCo A - - - A LCn_1. The above action is an abstract template that should be instantiated by the syn- thesis algorithm during the synthesis of a specific program in such a way that the program and the detector do not interfere. For automatic addition of nonmasking fault-tolerance, the interference-freedom of the program and the detector requires that (i) in the absence of faults, the program specification and the safety specification of detectors are satisfied, and (ii) in the presence of faults, recovery is provided by the composition of the program and the detectors. During the detection, when d, sets y; to true, its children have already set their y values to true. Hence, we represent the invariant of the hierarchical detector by the predicate U, where U={s:(v2':(0Sz'(vj=i:j=LCj))} 6.6.2 Diffusing Computation In this section, we present the addition of a hierarchical pre—synthesized component to a fault-intolerant diffusing computation. We have adapted the diffusing compu- tation program from [38]. First, in Subsection 6.6.2.1, we give the specification of the diffusing computation program. Then, in Subsection 6.6.2.2, we present the syn- thesized nonmasking fault-tolerant program before the addition of the hierarchical component, which includes high atomicity recovery actions. Finally, in Subsection 6.6.2.3, we show how we add pre—synthesized components to refine the high atomicity actions added during synthesis. 6.6.2.1 Diffusing Computation Program The diffusing computation (DC) program consists of four processes {Po, P1, P2, P3} whose underlying communication is based on a tree topology. The process Po is the 128 root of the tree. Processes P1 and P2 are the children of Po (i.e., (0 j 1) A (0 j 2)) and P3 is the child of P2 (i.e., 2 j 3). Starting from a state where every process is green, Po initiates a diffusing com- putation throughout the tree by propagating the red color towards the leaves. The leaves reflect the diffusing computation back to the root by coloring the nodes green. Afterwards, when all processes become green again, the cycle of diffusing computation repeats. Each process P,- (0 S j S 3) has a variable c,- that represents its color and whose domain is {0, 1}, where 0 represents the red and 1 represents the green. Also, process P,- has a Boolean variable 3n,- that represents the session number of the diffusing computation where P,- is currently participating. Thus, we use sn, to distinguish the case where P, has not started to participate in the current diffusing computation from the case where P,- has completed the current session of diffusing computation. Moreover, each process has a variable par,- that represents the parent of P,-. The domain of par,- is equal to {0, 1,2,3}. The value of par,- identifies the node from where there exists an edge to P,- in the out-tree. For example, since the parent of Po is itself, we have paro = 0. Program actions. The actions of the process P,- (0 S j < 4) are as follows: 190,1: (c, = 1) A (par, =j) ——+ c,- :2 0; 3n, = 3371,; DC,2 : (c, = 1) A (cparj == 0) A (371., E snparj) —s c,- := cpar); 3n,- = snparj; DC,3: (c, = 0) A (We :: (par)c = j) => (c)C = 1 A 3n, E snk)) ———. c,- :=1; Read / write restrictions. Each process P,- is allowed to read / write the variables of its children and its parent. For example, process Po can read / write its local variables and the local variables of P1 and P2. However, Po is not allowed to read/ write the variables of P3. Also, P3 cannot read / write the variables of Po and P1. Invariant. In each session of diffusing computation, every process P,- meets one of the following requirements: (i) P, and PPM]. have both started participating in 129 the cum- the cum the curre current 31 all state 1 50c = I? Faults. underlyin tions by 7, The a( Whereas a tIOnSlllp I) falllt-cl 218$ the current session of diffusing computation; (ii) P,- and PPM, have both completed the current session of diffusing computation; (iii) P,- has not started participating in the current session whereas PPM, has, and (iv) P, has completed participating in the current session whereas Ppar, has not. Hence, the invariant of the program contains all state where SDC holds, where 800 = (Vj : (0 S j S 3): ((c, = cpm, A sn, E snparj) V (c,- = 1 A cpar, = 0))) A (paro = 0Apar1= 0Apar2 = 0Apar3 = 2) Faults. Fault transitions can perturb the values of c,- and 3n,- (0 S j S 3), and the underlying communication topology of the program. We represent the fault transi- tions by the following actions: F,- : (true) —2 c,- = 0|1; F,: (true) —-s sn,=false[true; Fo: (true) ——) paro=0]1[2; The actions F ,-, and 17,, represent the fault transitions that perturb a process P, whereas action Fo only affects Po. The class of faults Fo perturbs the parenting rela- tionship by changing the value of paro to one of the values {0, 1, 2}. We have included fault-class Fo since it perturbs the DC program to states where we can demonstrate the advantages of using pre-synthesized components in dealing with deadlock states. 6.6.2.2 Intermediate Nonmasking Program Now, we present the intermediate nonmasking fault-tolerant program that includes high atomicity recovery actions. We have synthesized this intermediate program using our software framework FTSyn (cf. Chapter 8). The faults may perturb the state of the DC program outside 500 where the program may fall in a non-progress cycle or reach a deadlock state. For example, faults Fo may perturb the program to states where the condition Tdeadlock E ((co = 1) A (c1 = 1) A (c2 = 1) A (c3 = 1)) A (paro # 0) holds. The state predicate Tdeadlock. 130 represents states from where no program action is enabled; i.e., deadlock states. Now, to add recovery from a state in Tdeadlocka FTSyn assigns a high atomicity process Phigh, to each process P,- (0 S j < 4). To illustrate our approach of adding hierarchical pre—synthesized detectors (re- spectively, correctors), we only focus on one of the high atomicity recovery actions added by process Phigho as the refinement of other high atomicity actions is simi- lar. The actions of other high atomicity processes in the intermediate nonmasking program are available in the Appendix A. The action H AC is as follows: HAC : (co = 1) A (c1 = 1) A (c2 = 1) A (C3 = 1) A (sno = 1) A ((paro = 2) V (paro = 1)) A ((sn3 = 0) V (snl = 0) V (8712 = 0)) ——> sno := 0; The guard of H AC identifies a subset of Tdeadlock for which H AC provides recovery to states from where recovery to S 00 has been already established. The write action performed by H AC is a local write operation in process Po, whereas the guard of H AC is a global state predicate that should be refined in the distributed program. Thus, we only need to add detectors for the refinement of the guard of H AC. In the next subsection, we show how FTSyn uses the guard of H AC to automatically specify the required detectors. 6.6.2.3 Adding Pre—synthesized Detectors To refine the guard of H AC, the synthesis algorithm presented in this chater auto- matically identifies the interface of the required component. The component interface is a triple (X, R, i), where X is the detection predicate of the required component, R is a relation that represents the topology of the required component, and i is the index of the process that performs the local write action after the detection of X. For example, for action H AC, X is equal to the state predicate Xo as we describe next in this section, R is a set of pairs where each pair represents the existence of a com- munication link between two processes, and i is equal to 0 since Po should perform 131 the local writ I Using the rithm queries the option of of H AC and helps in minii: ponent adds it example. in 111 one componen by the guard . variables that . detectors (1 an X0 5 (If-3 : I X, E l (3123 The preSVU and d3 (res M03133). p (X? l. to the topologV lead/ll Write I“. 1.11. ' -« a . The synthesi at ' tion presented P . . ..ontlmons are 1 a1 «was (((‘3 = (It - - . L’Sll BSLC: I ]r the local write action. Using the interface of the required pre-synthesized component, the synthesis algo- rithm queries an existing library of pre-synthesized components. At this step, we have the option of supervising the synthesis algorithm in that we can observe the guard of H AC and manually identify the required components. This manual intervention helps in minimizing the number of components added to the program since each com- ponent adds its associated variables to the program and expands the state space. For example, in the case of action H AC , the synthesis algorithm automatically identifies one component corresponding to each deadlock state in the set of states represented by the guard of H AC, whereas by manual intervention, we observe that the only variables that are not readable for P0 are 03 and 3723. Hence, we add two distributed detectors d and d’ to simultaneously detect the predicates X0 and X6, where X0 5 ((03 = 1) A (00 =1)A(C1= 1) A (02 = 1) A (8710 = 1) A ((Paro = 2) V (Para = 1))) x5 2 ((3713 = 0) A (Co = 1) A (c1 = 1) A (c2 = 1) A (sno = 1) A ((Pm‘o = 2) v (pare =1») The pre—synthesized detector (1 (respectively, (1’) includes four elements d0, d1, (12, and (1;, (respectively, 6, ’1, ’2, and d3), where (1,- (respectively, d;) is composed with P,- (0 S i S 3). Thus, the topologies of the distributed detectors (1 and d’ are similar to the topology of the DC program. Also, the parenting relationship (respectively, read/write restrictions) between do,d1,d2,d3 (respectively, 6, ’1, ’2, and d3) follows the parenting relationship (respectively, read /write restrictions) of P0, P1, P2, and P3. The synthesis algorithm automatically instantiates an instance of the template action presented in Section 6.6.1 with the appropriate local condition. The local conditions are automatically identified based on the set of readable variables of each process. For example, the part of X0 that is readable for detector (1;; is identified as LC3 E ((03 = 1) /\ (c2 = 1)). Thus, the instantiation of the template action for detector ([3 results in the following action: 132 Likewise. as LC; 3 ((1”. 1);, : (3'13 =‘ The detect dition LC3 (re 23 _=_ (y3 = 1 (respectively. r the detection l is the leaf of 1 Next, we prese 021 I (y3 2 (Hr: Dir 3 (.113 = trur The local cc 1)A(c0 21M ( to X2 3 (LC; A condition of [llt predicate of d3 1': ( I "y? = true) 9 Likewise. lllr l predicate 0| dCllOIlS 0f (11 a . I 3 II J “Fss D31 : (C3 = 1) /\ (c2 = 1) A (y3 = false) —> y3 := true; Likewise, the part of X 6 that is readable for detector d.g is automatically identified as LC’ E ((3713 = 0) /\ (02 = 1)). Hence, the action of d'3 is as follows: D51 : (3713 = 0) /\ (62 z: 1) A (y3 = false) ——> yg :2 true; The detector d3 (respectively, d3) sets 11;, (respectively, yg) to true if the local con— dition LC'3 (respectively, L05) holds and 313 (respectively, yg) is false. The predicate Z3 E (3);, = true) (respectively, 2!, E (313 = true)) is the witness predicate of d3 (respectively, d3), and the predicate X3 E L03 (respectively, 5 E LCg) constructs the detection predicate of d3 (respectively, d3). Note that since d3 (respectively, 3) is the leaf of the tree, it does not have any children to wait for before it witnesses. Next, we present the actions of d2 and (1’2 (i.e., actions D21 and D31) as follows: D21 : (y3 = true) /\ (c2 = 1) /\ (sno = 1) /\ (c0 = 1) A((paro = 2) V (pa-r0 =1))/\(y2 = false) —-> yg :2 true; Dar: (y§=true)/\(C2 =1)A sno := 0; yo := false; y6 := false; yo :2 false;y6_, :2 false; When the program executes the above recovery action, the predicates Xo and X 2 (respectively, X6 and X6) no longer hold. Thus, the witness predicates of do and d2 (respectively, d6 and d6) must be falsified; i.e., yo and y2 (respectively, y6 and y6) should become false. The composition of the DC program and the pre—synthesized detectors. Now, we present the actions of the process Po of the nonmasking DC program that is a composition of the actions of the pro-synthesized detectors and the actions of the processes in the intermediate fault-intolerant program. Since the actions of P1 and P2 are structurally similar to Po’s actions, we refer the interested reader to the Appendix A for the actions of P1 and P2. Note that since no detection is done by d1, the synthesized program does not have any new actions in process P1. Thus, the actions of P1 remain similar to the fault-intolerant program. The actions of process Po composed with the actions of do, (16, and the recovery action Ree are as follows: Dem : (co = 1) A (paro = 0) ———+ co := 0; yo := false;y6 2: false; DC02 2 (C0 = 1) A (Cparo : 0) A (3710 35‘ suparn) 134 D Co 13 l D 01' -( («a 06 1 . (34 Reic- : (,1 ——> co := cparo; sno = snpam; if ((co = 0) A (yo = true)) then yo := false;y6 :2 false; DCo3: (co = 0) A (We :: (par;c = 0) :> (ck : 1A sno E snk)) —-> co := 1; if (((y1 =- false) V (y2 = false)) A (yo = true)) then yo := false; if (((y; = false) v as -—- false)) A (ya -—- true)) then 316 := false; Dol : (y1 = true) A (yo = true) A LCo A (yo : false) ——> yo := true; D61 : (y6 = true) A (ya, = true) A L06 A (y6 = false) ——r y6 := true; Rec: (yo = true) A ((y6 = true) V (3721 = 0) V (8712 = 0)) —+sno := 0;yo := false; y6 := false; y2 := false;y6 :2 false; The actions of process Po are composed with the actions of detectors do and d6 (i.e., 001 and D61) and the recovery action Rec presented in this section. Observe that the statement of actions D001 and D002 of Po are composed with assignments that falsify the witness predicates of the corresponding detectors. Such falsification of the witness predicates is necessary so that program execution preserves the safety of detectors. For example, when co becomes 0 the state predicate LCo no longer holds. Thus, the witness predicate yo must be falsified to ensure the interference-freedom of the program and the pres-synthesized detectors. Interference-freedom. The interference-freedom requires the synthesized program to provide recovery in the presence of faults, and satisfy the specification of the DC program in the absence of faults. In the presence of faults, if faults perturb the program outside the invariant 500 then the synthesized program satisfies the 135 requireme: the absenc Thus. in tl computatii “cunt the safety faults may i.e., the sa tolerance o Violate the the compos in the presg Althoug ified the int Chet'ker to g seated in C: Promela 1110 Can. the 'S'yTlf, "C7228? hes, The tolerant requirements of nonmasking fault-tolerance; i.e., recovery to S DC is guaranteed. In the absence of faults, the added detectors do not interfere with the program execution. Thus, in the absence of faults, the above program satisfies the specification of diffusing computation program and the safety of detectors. We would like to note that when faults occur, fault transitions may directly violate the safety specification of detectors; e.g., after d3 witnesses that (Co = 1) holds, faults may change the value of C3 to 0, and as a result, d3 witnesses incorrectly; i.e., the safety of do will be violated by fault transitions. Since nonmasking fault— tolerance only requires recovery to the invariant, the violation of safety does not violate the nonmasking fault-tolerance property. Thus, the only requirement is that the composition of the program and the pre-synthesized detectors provides recovery in the presence of faults. Although the synthesized nonmasking program is correct by construction, we ver- ified the interference-freedom requirements of the above program in the SPIN model checker to gain more confidence on the implementation of the framework FTSyn pre- sented in Chapter 8. We refer the reader to the Appendix A for the source of the Promela model. 6.7 Discussion In this section, we address some of the questions raised by our synthesis method. Specifically, we discuss the following issues: the fault-tolerance of the components, the choice of detectors and correctors, and pre—synthesized components with non- linear topologies. Can the syntheszs method deal with the faults that affect the fault-tolerance compo- nents? Yes. The added component may itself be perturbed by the fault to which fault- tolerance is added. Hence, the added component must itself be fault-tolerant. For 136 example. the adde (cf. Thee process r For arbit the mode How does programs While intolerant this chap tolerance abstract, ( in replica- We expect CTOrrec-t 01-8 to providf delE‘Ctors ] detectiOns Does the s flies? Yes. Ar method Of and hlefan sis method Dents) wit it In the toffe‘r example, in our token ring program, we modeled the effect of the process restart on the added component and ensured that the component is fault-tolerant to that fault (cf. Theorem 6.1). For the fault-classes that are commonly used, e.g., process failure, process restart, input corruption, Byzantine faults, such modeling is always possible. For arbitrary fault-classes, however, some validation may be required to ensure that the modeling is appropriate for that fault. How does the choice of detectors and correctors help in the synthesis of fault-tolerant programs? While there are several approaches (e.g., [39]) that manually transform a fault- intolerant program into a fault-tolerant program, we use detectors and correctors in this chapter, based on their necessity and sufficiency for manual addition of fault- tolerance [18]. The authors of [18] have also shown that detectors and correctors are abstract enough to generalize other components (e.g., comparators and voters used in replication-based approaches) for the design of fault—tolerant programs. Hence, we expect that our synthesis method can benefit from the generality of detectors and correctors in the automated synthesis of fault-tolerant programs as there is a potential to provide a rich library of fault—tolerance components. Moreover, pre—synthesized detectors provide the kind of abstraction by which we can integrate efficient existing detections approaches (e.g., [40, 41]) in pre—synthesized fault-tolerance components. Does the synthesis method support pre-synthesized components with non-linear t0polo- gies? Yes. As we demonstrated in Sections 6.5 and 6.6.2, we have applied the synthesis method of this chapter to add pre—synthesized fault-tolerance components with linear and hierarchical topologies. These examples show the applicability of our synthe- sis method for distributed programs (respectively, distributed fault-tolerance compo— nents) with linear and hierarchical topologies. In the token ring example, will the synthesis succeed if we select PSmdeI (1 5 index 3 137 3). luster from the Yes. (Imam—1, the detec the elemc index — 1 d3. and n write dk. Using “Ll A (11 atornicitv the case i aCtion. In the tell,” 3 ), instead ofPSo, as the pseudo process that adds a high atomicity recovery transition from the deadlock state 3,, = (J_, _L, _L, 1)? Yes. We argue that if we select a detector d with the following arrangement, dmdex_1, - - -, do, do, - - o, dmdex, where index 75 0, then the synthesis will succeed and the detector d will not interfere with the token ring program. In this arrangement, the element denim—1 is allowed to read and write yindex_1. Every element dj, 0 S j < index —- 1, is allowed to read yj and yjH, and write y]. do is allowed to read do and do, and write d3. Elements dk, index 3 k < 3, are allowed to read dk and dk+1, and write dk. Using the above arrangement, Zmdez witnesses the detection predicate X E ((xo = _L) A (x1 = _L) A (x2 = _L) A (x3 = .L)), and afterwards, the PSindex adds a high atomicity recovery action to the program. The proof of non-interference is similar to the case where PSo is selected as the pseudo process that adds the high atomicity action. In the token ring example, will the synthesis succeed if we add a sequential detector with a difi'erent linear order do . - -d3, where 2:, witnesses for the detection predicate X 5((1‘0 = 1)A(x1: _L)A(x2 = _L)A(x3 :2 1))? No. We show that if we use the above order then the Interfere algorithm returns true as 11 becomes non-empty; i.e., the execution of the token ring program interferes with the added pre—synthesized component. In a state 3 = (1, 1,0,0), the elements do and d1 of the linear detector witness their detection predicates Xo and X1, where Xo E (xo 2 _L) and X1 E ((xo 2 1)A(x1 = 1)). Now, if Po executes and sets xo to 1 then X1 no longer holds. As a result, the program reaches a state where (11 incorrectly witnesses its detection predicate and violates the specification of the linear detector. 138 6.8 E In this ch.- grain from Specificallj fault-tolera rithm for a tributed p: of the com; thesis algoi Program w matic addii Process is 1 lLS‘Gd in the bit protoco showed tha Dents Will] ( Components 6.8 Summary In this chapter, we presented an approach for the synthesis of a fault-tolerant pro gram from its fault—intolerant version and pre—synthesized fault-tolerance components. Specifically, we presented an algorithm for automatic specification of the required fault-tolerance components during the synthesis. We also presented a sound algo— rithm for automatic addition of pre—synthesized fault-tolerance components to a dis- tributed program. Before adding a component, we verified the interference-freedom of the composition of the program and the fault-tolerance component. Using our syn- thesis algorithm, we showed how we could add masking fault-tolerance to a token-ring program where all process might be corrupted. By contrast, previous work on auto- matic addition of fault-tolerance to the token ring program assumed that at least one process is not corrupted. Also, we demonstrated how we reuse the same component used in the synthesis of the token ring program for the synthesis of an alternating bit protocol that is nonmasking fault-tolerant to message loss faults. Moreover, we showed that our synthesis method is applicable for adding pre—synthesized compo- nents with different topologies (e.g., linear and hierarchical) where we added tree—like components to a diffusing computation program. 139 Chapter 7 Automated Synthesis of Multitolerance In this chapter, we focus on automated synthesis of multitolerant programs. Such automated synthesis has the advantage of generating fault-tolerant programs that (i) tolerate multiple classes of faults, and (ii) are correct by construction. Automatic synthesis of multitolerance is desirable as (i) today’s systems are often subject to multiple classes of faults, and (ii) it is often undesirable or impractical to provide the same level of fault-tolerance to each class of faults. Hence, these systems need to tolerate multiple classes of faults, and (possibly) provide a different level of fault- tolerance to each class. To characterize such systems, the notion of multitolerance was introduced in [34]. The importance of such multitolerant systems can be easily observed from the fact that several methods for designing multitolerant programs as well as several instances of multitolerant programs can be readily found (e.g., [11, 12, 13, 34]) in the literature. We focus on automated synthesis of high atomicity multitolerant programs in a stepwise fashion. Specifically, we (i) present a sound and complete stepwise algorithm for the case where we add nonmasking fault-tolerance to one class of faults and mask- ing fault-tolerance to another class of faults, and (ii) present a sound and complete 140 stepwise " faults am algorithm case Whel‘ tolerance find that the additi can be P9 to one cl'd XP-compl In the formal def program f the relevai multitolere synthesis (; Then. in S of Inultitol 7.5, we pr. multitolera concluding 7.1 p In this secti fTOm their f- o In definitk IflllllllO]QraI-l stepwise algorithm for the case where we add failsafe fault—tolerance to one class of faults and masking fault-tolerance to another class of faults. The complexity of these algorithms is polynomial in the state space of the fault-intolerant program. For the case where failsafe fault—tolerance is added to one fault-class and nonmasking fault- tolerance is added to another fault-class, we find a somewhat surprising result. We find that this problem is NP-complete. This result is surprising in that automating the addition of failsafe and nonmasking fault-tolerance to the same class of faults can be performed in polynomial time. However, addition of failsafe fault-tolerance to one class of faults and nonmasking fault-tolerance to a different class of faults is N P-complete. In the rest of this chapter, we proceed as follows: In Section 7.1, we present the formal definition of multitolerance and the problem of synthesizing a multitolerant program from a fault-intolerant program. Subsequently, in Section 7.2, we recall the relevant properties of algorithms in 2.7 that we use in automated addition of multitolerance. In Section 7.3, we present a sound and complete algorithm for the synthesis of multitolerant programs that provide nonmasking-masking multitolerance. Then, in Section 7.4, we present a sound and complete algorithm for the synthesis of multitolerant programs that provide failsafe—masking multitolerance. In Section 7.5, we present the NP-completeness proof for the case where failsafe-nonmasking multitolerance is added to fault-intolerant programs. Finally, in Section 7.6, we make concluding remarks and discuss future work. 7. 1 Problem Statement In this section, we formally define the problem of synthesizing multitolerant programs from their fault-intolerant versions. Before defining the synthesis problem, we present our definition of multitolerance; i.e., we identify what it means for a program to be multitolerant in the presence of multiple classes of faults. 141 As 1 program safe/ non sider the program There should b provide 1' definitior: fl occur faults occ Anotl: Where f 1 tolerance tOIerance H O‘K’G \Pe [- 3 IS pr0\'ide( fl and f2 DTOOf Of N the minirm fElul In a SDW t‘tolt‘l‘a As mentioned in Section 2.5, a failsafe/ nonmasking/ masking fault-tolerant program guarantees to provide a desired level of fault-tolerance (i.e., fail- safe/ nonmasking/ masking) in the presence of a specific class of faults. Now, we con- sider the case where faults from multiple fault-classes, say f1 and f2, occur in a given program computation. There exist several possible choices in deciding the level of fault-tolerance that should be provided in the presence of multiple fault-classes. One possibility is to provide no guarantees when f 1 and f2 occur in the same computation. With such a definition of multitolerance, the program would provide fault-tolerance if faults from f 1 occur or if faults form f2 occur. However, no guarantees will be provided if both faults occur simultaneously. Another possibility is to require that the fault—tolerance provided for the case where f1 and f2 occur simultaneously should be equal to the minimum level of fault- tolerance provided when either f 1 occurs or f2 occurs. For example, if masking fault- tolerance is provided to f 1 and failsafe fault-tolerance is provided to f2 then failsafe fault-tolerance should be provided for the case where f 1 and f2 occur simultaneously. However, if nonmasking fault-tolerance is provided to f 1 and failsafe fault-tolerance is provided to f2 then no level of fault-tolerance will be guaranteed for the case where f1 and f2 occur simultaneously. We note that this assumption is not required in our proof of NP-completeness in Section 7.5. In our definition, we follow the latter approach. The following table illustrates the minimum level of fault-tolerance provided for different combinations of levels of fault-tolerance provided to individual classes of faults. Fault-Tolerance Failsafe Non masking Masking Failsafe F ailsafe Intolerant Failsafe Non masking Intolerant Nonmasking Nonmasking Masking Failsafe Nonmasking Masking In a special case, consider the situation where failsafe fault-tolerance is provided 142 to both provided for whici which fa fnonmaskz tolerance Now. specificat fmaskz'ng. ‘ be multitr class of fa DEfinitio S, for 8‘1)E(_' flags of fault N .. . 0“: USU] 0f ”‘9 PFOhler to both f 1 and f2. From the above description, failsafe fault-tolerance should be provided for the fault class f 1 U f2. By taking the union of all the fault-classes for which failsafe fault-tolerance is provided, we get one fault-class, say f fags“ f8, for which failsafe fault-tolerance needs to be added. Likewise, we obtain the fault-class fnmmasking (respectively, fmasking) for which nonmasking (respectively, masking) fault- tolerance is provided. Now, given (the transitions of ) a fault-intolerant program, p, its invariant, S, its specification, spec, and a set of distinct classes of faults ffausafe, fnmmasking, and fmaskmg, we define what it means for a synthesized program p’, with invariant S", to be multitolerant by considering how p’ behaves when (i) no faults occur; (ii) only one class of faults happens, and (iii) multiple classes of faults happen. Definition. Program p’ is multitolerant to f 10,130 f6, fnonmashm, and _f'ma.,k,,,g from S’ for spec iff (if and only if) the following conditions hold: 1. p’ satisfies spec from S" in the absence of faults. 2. p’ is masking fmask,,,g-tolerant from S’ for spec. 3. p’ is failsafe ( f [0,130 fe U fmasking)-tolerant from S" for spec. 4. p’ is nonmasking (fnonmasking U fmaskinghtolerant from S’ for spec. C] Remark. Since every program is failsafe/ nonmasking/ masking fault-tolerant to a class of faults whose set of transitions is empty, the above definition generalizes the cases where one of the classes of faults is not specified (e.g., fmaski-ng = {}). Now, using the definition of multitolerant programs, we identify the requirements of the problem of synthesizing a multitolerant program, p’, from its fault-intolerant version, p. The problem statement is motivated by the goal of simply adding multi- tolerance and introducing no new behaviors in the absence of faults. This problem statement is the natural extension to the problem statement in Section 2.6 where fault-tolerance is added to a single class of faults. 143 Since following if there 6 and crea of satisfy If p']5’ i1 ways for : synthesis The l\‘Iul Given p, 1 Identify I 5’ g s P']S’ g I . P IS 1111 We State r] The DeCiE Chen 1) DOG: the , Since we require p’ to behave similar to p in the absence of faults, we stipulate the following conditions: First, we require S’ to be a subset of S (i.e., S’ Q S). Otherwise, if there exists a state 8 E S’ where s g! S then, in the absence of faults, p’ can reach 3 and create new computations that do not belong to p. Thus, p’ will include new ways of satisfying spec from s in the absence of faults. Second, we require (p’IS’) Q (p[S’). If p’IS’ includes a transition that does not belong to pIS’ then p’ can include new ways for satisfying spec in the absence of faults. Thus, the problem of multitolerance synthesis is as follows: The Multitolerance Synthesis Problem Given 1), S, spec, ffailsafea fnonmaskmg, and fmasking Identify p’ and S’ such that S’ g s,’ p’IS’ Q pIS’, and p’ is multitolerant to f failsa fe, fnmmasking, and fmasking from S’ for spec. [I We state the corresponding decision problem as follows: The Decision Problem Given 1), 5, SPEC, ffailsafea fnonmaskinga and fmasking: Does there exist a program p’, with its invariant S’ that satisfies the requirements of the synthesis problem? Cl 7 .2 Addition of Fault-Tolerance to One Fault- Class In the synthesis of multitolerant programs, we reuse algorithms Add-Failsafe, Add-Nonmasking, and Add_Masking, presented by Kulkarni and Arora [1] (cf. Section 144 2,7). The to a sing/ in this se< The a specificati with the ; following 1 nonmaskin The in property ( masking) I program. invariant 5 5” S S’. A from any st; of the fault- Observatic 0f Addfaus; invariant Sn fme Sr {01‘ ‘5- 2.7). These algorithms respectively add failsafe/ nonmasking/ masking fault-tolerance to a single class of faults. Hence, we recall the relevant properties of these algorithms in this section. The algorithms represented in Section 2.7 take a program p, its invariant S, its specification spec, a class of faults f, and synthesize an f—tolerant program p’ (if any) with the invariant S’. The synthesized program p’ and its invariant S’ satisfy the following requirements: (i) S’ Q S; (ii) p’ IS’ C; plS’, and (iii) p’ is failsafe (respectively, nonmasking or masking) f-tolerant from S’ for spec. The invariant S’, calculated by Add-Fai|safe (respectively, Add_Masking), has the property of being the largest such possible invariant for any failsafe (respectively, masking) program obtained by adding fault-tolerance to the given fault-intolerant program. In other words, if there exists a failsafe fault-tolerant program p”, with invariant S” that satisfies the above requirements for adding fault-tolerance then S” Q S’. Also, if no sequence of fault transitions can violate the safety of specification from any state inside S then Add-Failsafe (cf. Section 2.7) will not change the invariant of the fault-intolerant program. Hence, we make the following observations: Observation 7.1. Let the input for Add_Fai|safe be p, S, spec and f. Let the output of Add-Fai|safe be fault-tolerant program p’ and invariant S’. If any program p” with invariant 8” satisfies (i) S” Q S; (ii) p”|S” g pIS”, and (iii) p” is failsafe f-tolerant from S’ for spec then S” _C_ S’. D Observation 7 .2. Let the input for Add_Failsafe be p, S, spec and f. Let the output of Add_Failsafe be fault-tolerant program p’ and invariant S’. Unless there exists states in S from where a sequence of f transitions alone violates safety, S’= S. [:1 Likewise, the f—span of the masking f—tolerant program, say T’, synthesized by the algorithm Add-Masking (cf. Section 2.7) is the largest possible f—span. Thus, we make the following observation: Observation 7.3. Let the input for Add_Masking be p, S, spec and f. Let the 145 output Of If any Pro masking f the mask” The all the inVaria ObserVat Observatj Based ‘ rithmS Add the outl)Ut a 5272915 cla exists. Theorem I sound and < l 7.3 N1 In this secti grams that respectively our S)’l1lh€8li Given a synthesize a frnasking- Bl both f,, 071 "IOU \- kl lie prom output of Add-Masking be fault—tolerant program 19’, invariant S’, and fault-span T’. If any program p” with invariant S” satisfies (i) S” Q S; (ii) p”|S" Q plS”, (iii) p” is masking f-tolerant from S’ for spec, and (iv) T” is the fault-span used for verifying the masking fault-tolerance of p” then S” _C_ S’ and T” g T’. C] The algorithm Add-Nonmasking only adds recovery transitions from states outside the invariant S to S. Thus, we make the following observations: Observation 7.4. Add_Nonmasking does not add or remove any state of S. 0 Observation 7 .5. Add_Nonmasking does not add or remove any transition of pIS. El Based on the Observations 7.1- 7.5, Kulkarni and Arora [1] show that the algo- rithms Add_Failsafe, Add_Nonmasking, and Add_Masking are sound and complete, i.e., the output of these algorithms satisfy the requirements for adding fault-tolerance to a single class of faults and these algorithms can find a fault-tolerant program if one exists. Theorem 7.6. The algorithms Add_Fai|safe, Add-Nonmasking, and Add_Masking are sound and complete. [:1 7 .3 Nonmasking-Masking Multitolerance In this section, we present an algorithm for stepwise synthesis of multitolerant pro- grams that are subject to two classes of faults fnmmasking and fmasking for which respectively nonmasking and masking fault-tolerance is required. We also show that our synthesis algorithm is sound and complete. Given a program p, with its invariant S, its specification spec, our goal is to synthesize a program p’, with invariant S’ that is multitolerant to fnmmashng and fmasking- By definition, p’ must be masking fmasking—tolerant. In the presence of both fnmmaskmg and fmaskmg (i.e., fnmmaskmg U fmaskmg), 13’ must provide nonmasking fnonmasking U fmasking’tOlerance. We proceed as follows: Using the algorithm Add-Masking, we synthesize a masking 146 fnm-5k1n9-tl program 1 from ever} perturbed tolerance I to states 5 on the Ob recovery b; tolerance. Figure 7.1. fmasking-tolerant program p], with invariant S’, and fault-span Tmasking- Now, since program p1 is masking fmasking-tolerant, it provides safe recovery to its invariant, S’, from every state in (Tmasking—S’ ). Thus, in the presence of fnonmaskmgUfmasm-ng, if [)1 is perturbed to (Tmasking_S’) then p1 will satisfy the requirements of nonmasking fault- tolerance (i.e., recovery to S ’ ) However, if fnmmaskingu fmasking transitions perturb p1 to states 3, where 5 ¢ Tmaskmg, then recovery must be added from those states. Based on the Observations 7.4 and 7.5, it suffices to add recovery to Tmasking as provided recovery by p1 from Tmasking to S’ can be reused even after adding nonmasking fault- tolerance. Thus, the synthesis algorithm Add_Nonmasking-Masking is as shown in Figure 7 . 1. Add_Nonmasking_Masking(p: transitions, fnonmadcinm fmasking: fault, S: state predicate, spec: safety specification) { 131,5", Tmasking 3: Add-Masking(p, fmaskinga 5: spec), if (S’ = {}) declare no multitolerant program p’ exists; return 0, Q); 19', T, ;: Add-NoanSk’ingUm fnonmasking U fmasking , Tmaskinga Spec); return p’, S"; } Figure 7.1: Synthesizing nonmasking-masking multitolerance. Now, in Theorem 7.7, we show the soundness of Add_Nonmasking-Masking, i.e., we show that the output of Add-Nonmasking_Masking satisfies the requirements of the problem statement in Section 7.1. Subsequently, in Theorem 7.8, we show the com- pleteness of Add-Nonmasking-Masking, i.e., we show that if a multitolerant program can be designed for the given fault-intolerant program then Add-Nonmasking-Masking will not declare failure. Theorem 7.7. The algorithm Add_Nonmasking_Masking is sound. Proof. Based on the soundness of Add-Masking (cf. Theorem 7.6), S’ Q S. Also, using the soundness of Add_Masking, we have p1 IS" Q pIS’. In addition, based on the Observation 7.5, we have 191 IS’ = p’lS’. As a result, we have p’IS’ Q plS’. 147 NO“?! ‘ 1. Abs saris not . Obsc 2. Mas ing / “J . 0‘! x. PilTn spec. 3~ Noni Add-l\ tOlOI‘a Now, we show that p’ is multitolerant to fmmnaskmg and fmaski-ng from S’ for spec: 1. Absence of faults. From the soundness of Add-Masking, it follows that p1 satisfies spec from S’ in the absence of faults. Since Add_Nonmasking does not add (respectively, remove) any transitions to (respectively, from) pllS’ (cf. Observation 7.5), it follows that p’ satisfies spec from S’. 2. Masking fmaskmg-tolerance. From the soundness of Add-Masking, p1 is mask- ing fmaskmg-tolerant from S’ for spec. Also, based on the Observation 7.4 and 7.5, Add_Nonmasking preserves masking fmaskmg-tolerance property of p1 since plleaskz'ng = p’ ITmasking- Therefore, p’ is masking fmaskmg—tolerant from S’ for spec. 3. Nonmasking (fnmmaskmg U fmaskmg)-tolerance. From the soundness of Add_Nonmasking, we know that p’ is nonmasking (fnonmaskmg U fmasking} tolerant from Tmaskmg for spec. Also, based on the Observation 7.4 and 7.5, Add_Nonmasking preserves masking fmasking-tolerance property of 101 since plleasking = p’ leasking. Thus, recovery from Tmasking to S’ is guaran— teed in the presence of fnmmasking U fmasking- Therefore, p’ is nonmasking (fnonmasking U f masking)'tOl€rant from S’ for spec. Based on the above discussion, it follows that p’ is multitolerant to fnonmaskmg and fmaskmg from S’ for spec. Therefore, Add_Nonmasking_Masking is sound. Cl Theorem 7.8. The algorithm Add_Nonmasking_Masking is complete. Proof. Add-Nonmasking_Masking declares that a multitolerant program does not exist only when Add-Masking does not find a masking fmaskmg-tolerant program. Since the synthesized program must be masking fmaskmg-tolerant, from the completeness of Add-Masking, completeness of Add_Nonmasking-Masking follows. E] 148 7.4 I In this sc erant to t failsafe an synthesizi: Let p l spec, and the “IUltlll I in the corn safety is vi. a set Of St; sequence 0 that take p as the trail. Now. 5 AddMaskin S ‘7718, am~ fault-tOJQFa, IDS, we USO t we use mt t Add-Masking m5, ’ a C01). masking C011; masking‘tOlel' lemo _- I - \ Observat i 01 7 .4 Failsafe-Masking Multitolerance In this section, we investigate the stepwise synthesis of programs that are multitol— erant to two classes of faults f fan” ,8 and fmasking for which we respectively require failsafe and masking fault—tolerance. We present a sound and complete algorithm for synthesizing failsafe-masking multitolerant programs. Let p be the input fault-intolerant program with its invariant S, its specification spec, and p’ be the synthesized multitolerant program with its invariant 5’. Since the multitolerant program p' must maintain safety of spec from every reachable state in the computations of p’ [K f failsa fe U fmaskmg), p’ must not reach a state from where safety is violated by a sequence of f fads“ [e U fmasking transitions. Hence, we calculate a set of states, say ms (cf. Figure 7.2), from where safety of spec is violated by a sequence of transitions of f {0,-Isa fa U fmasking. Also, p’ must not execute transitions that take p’ to a state in ms. Hence, we define mt to include these transitions as well as the transitions that violate safety of spec. Now, since p’ should be masking fmaskmg-tolerant, we use the algorithm Add_Masking to synthesize a program p1 given the input parameters p—mt, fmasking, S —ms, and mt. We only consider faults fmasking because p1 need not be masking fault-tolerant to f failsa f8. Since a multitolerant program must not reach a state of ms, we use the state predicate S -ms as the input invariant to Add-Masking. Finally, we use mt transitions in place of the spec parameter (i.e., the fourth parameter of Add-Masking). Since Add_Masking treats mt as a set of safety-violating transitions, it does not include them in the synthesized program p1. Thus, starting from a state in S’, a computation of p1[]fmask,-,,g does not reach a state in ms. As a result, if Tmasking contains a state 3 in ms, 3 can be removed while preserving the masking fmaskmg—tolerance property of p1. Hence, we make the following observation: Observation 7.9. In the output of the algorithm Add_Masking (cf. Figure 7.2), removing ms states from Tmskmg preserves masking fmmkmg-tolerance property of 149 P1- Now. 1 our synthi goal. we a the algorit Add_Fail: "133:1 mt 2:{ PLS’J H(s': pIaT, :: return I \ The 31% State Dredic that the nu ‘0 Addie”: (T( O Add‘FallSa f8. ‘e’ltr mask? "9 Theorem PFOofi Usi S, g S. Bag, Also, f1. 01 bserl'ati on NO“?! “‘9 171- C] Now, if faults f failsa [6 U fmaskz'ng perturb pl to a state 3, where 8 ¢ Tmasking then our synthesis algorithm will have to ensure that safety is maintained. To achieve this goal, we add failsafe ( f failsa [e U fmask,,,g)-tolerance to p1 from (Tmaskmq—ms) using the algorithm Add_Failsafe. Add.Failsafe.Masking(p: transitions, ffausafe, fmaskmg! fault, S: state predicate, spec: safety specification) { ms :2 {so : 331,32, ...sn : (Vj : 05j Masking faults ff """" * Failsafe faults fn “ """"" V Nonmasking faults ——> Program transition Figure 7.4: The states and the transitions corresponding to the propositional variables in the 3-SAT formula. The transitions of ffailsafe- The transitions of f fail“ fe can perturb the program from :17,- to 12,. Thus, the class of faults f fails“ fe is equal to the set of transitions {(5r,,v,-):1_<_z'§ n}. The transitions of fnonmasking° The transitions of fnmmaskmg can perturb the program from $2 to 12,. Thus, we have fnmmaskmg = {(12, vi) : 1 g 2' S n}. The transitions of fmaskmg. The transitions of fmaskmg can take the program from s to y,. Also, for each disjunction c], we introduce a fault transition that perturbs the program from state 3 to state 23- (1 _<_ j g M). Thus, the class of faults fmaskmg is equal to the set of transitions {(s,y,-) : 1 S 2' g n} U {(s, zj) : 1 g j S 1%}. The safety specification of the fault-intolerant program, p. None of the fault transitions, namely f foam 1e, fnmmaskmg, and fmasking identified above directly violate safety. In addition, for each propositional variable a, and its complement pa,- (1 S i g n), the following transitions do not violate safety (cf. Figure 7.4): ° (yiaxi)a($irs)a(yia$;)a(332:3) And, for each disjunction cj = aivfiakVar, the following transitions do not violate safety: . (zj) 1131‘), (er 1‘2) (Zjv (137-) All transitions except those identified above violate safety of specification. Also, observe that the transition (1),-,3), shown in Figure 7.4, violates safety. 7 .5.3 Reduction From 3-SAT In this section, we show that the given instance of 3-SAT is satisfiable iff multitoler- ance can be added to the problem instance identified in Section 7.5.2. Specifically, in Lemma 7 .14, we show that if the given instance of the 3-SAT formula is satisfiable then there exists a multitolerant program that solves the instance of the multitoler- ance synthesis problem identified in Section 7.5.2. Then, in Lemma 7.15, we show that if there exists a multitolerant program that solves the instance of the multitol- erance synthesis problem, identified in Section 7.5.2, then the given 3—SAT formula is satisfiable. Lemma 7.14 If the given 3—SAT formula is satisfiable then there exists a multitol- erant program that solves the instance of the addition problem identified in Section 7.5.2. Proof. Since the 3-SAT formula is satisfiable, there exists an assignment of truth values to the propositional variables a,, 1 S i S n, such that each C], 1 g j _<_ 1%, is true. Now, we identify a multitolerant program, p’, that is obtained by adding multitolerance to the fault-intolerant program p identified in Section 7.5.2. The invariant of p’ is the same as the invariant of p (i.e., {3}). We derive the transitions of the multitolerant program p’ as follows. (As an illustration, we have shown the partial structure of p’ where a,- = true, (2;, = false, and a... = true (1 S 2’, k,r S n) in Figure 7.5.) o For each propositional variable (1,, 1 g 2' g n, if a,- is true then we will include the transitions (y,,a:,-) and (x,,s). Thus, in the presence of fnmskmg alone, p’ provides safe recovery to 3 through 23,-. 156 o For each propositional variable a,, 1 S 2' S n, if a,- is false then we will include (y,, mi) and (23;, s) to provide safe recovery to the invariant. In this case, since state 1),- can be reached from :13; by faults fmmmaskmg, we include transition (1),-,3) so that in the presence of fmasking and fnmmaskmg program p’ provides nonmasking fault-tolerance. o For each disjunction cJ- that includes a,-, we include the transition (zj,:1:,-) iff a,- is true. And, for each disjunction cj that includes pai, we include transition (23,14) iff a,- is false. ‘,->- ”I v Vi / ”k r. o ’ o I \ ,' I \__ _1 \ ft _fn .' ff ‘ zfn ff. fn x-V Y: ‘. l " yk ll x V yr lo‘__ . oxl “ xko o _. oxk r0...— 0 oxr a ‘. I A ‘. I f I {ma fm‘, fm: m. . ____________________________ 1 > s Vg Figure 7.5: The partial structure of the multitolerant program Now, we show that p’ is multitolerant in the presence of faults f {0,130 f6, fnmmaskmg, and fmasking - S = pIS. Thus, p’ satisfies spec in the absence O p’ in the absence of faults. p’ of faults. o Masking tolerance to fmaskmg. If the faults from fmashng occur then the program can be perturbed to (1) 31,-, 1S'iSn, or (2) z], lSj SIM. In the first case, if a, is true then there exists exactly one sequence of transitions, ((y,, mi), (:1:,-, 5)), in p’ []fma,k,-,,g. Thus, any computation of p’[]fmask,:ng eventually reaches a state in the invariant. Moreover, starting from y,- the computations of 157 p’ [] fmasking do not violate the safety specification. And, if a, is false then there exists exactly one sequence of transitions, ((y,,:r§), (23:, 3)), in p’ [] fmaskmg- By the same argument, even in this case, any computation of p’ [] fmaskmg reaches a state in the invariant and does not violate the safety specification during recovery. In the second case, since c, evaluates to true, one of the term in c] (a proposi- tional variable or its complement) evaluates to true. Thus, there exists at least one transition from zj to some state ark (respectively, $1,) where a], (respectively, flak) is a propositional variable in cj and ak (respectively, flak) evaluates to true. Moreover, the transition (23,261,) is included in p’ iff ak evaluates to true. Thus, (2,-,rrk) (respectively, (23,132)) is included in p’ iff (anus) (respectively, (xL,s)) is included in p’. Since from :13,c (respectively, 2:3,), there exists no other transi- tion in p’flfmaskmg except (2:1,, 3), every computation of p’ reaches the invariant without violating safety. Based, on the above discussion, p’ is masking tolerant to fmasking- Failsafe tolerance to fmaskmg U f fausa f6. Clearly, based on the case consid- ered above, if only faults from fmasking occur then the program is also failsafe fault-tolerant. Hence, we consider only the case where at least one fault from f failsa fe has occurred. Faults in ffa-ilsafe occur only in state a}, 1 S 2' S n. And, p’ reaches 1:,- iff a,- is assigned true in the satisfaction of the given 3—SAT formula. Moreover, if a,- is true then there is no transition from 22,-. Thus, after a fault transition of class f failsa f3 occurs p’ simply stops. Therefore, p’ does not violate safety. Nonmasking tolerance to fmaskmg U fnmmaskmg. This proof is similar to the proof of failsafe fault-tolerance shown above. Specifically, we only need to consider the case where at least one fault transition of class fnonmasking has 158 occurred. Faults in fnonmasking occur only in state 17;, 1Si Sn. And, p’ reaches 2:; iff a, is assigned false in the satisfaction of the given 3-SAT formula. Moreover, if a, is false then the only transition from v,- is (12,, 3). Thus, in the presence of fmasking and fnmmaskmg, p’ recovers to its invariant. (Note that the recovery in this case violates safety.) D Lemma 7.15 If there exists a multitolerant program that solves the instance of the synthesis problem identified earlier then the given 3-SAT formula is satisfiable. Proof. Suppose that there exists a multitolerant program p’ derived from the fault-intolerant program, p, identified in Section 7.5.2. Since the invariant of p’, S’, is non-empty and S’ Q S, 5’ must include state 3. Thus, S’ = S. Also, since each y,, 1 S i S n, is directly reachable from s by a fault from fmaskmg, p’ must provide safe recovery from y,- to 3. Thus, p’ must include either ($11,561:) or (yr, 23:). We make the following truth assignment as follows: If p’ includes (y,,r,) then we assign a,- to be true. And, if p’ includes (yr-513;) then we assign a,— to be false. Clearly, each propositional variable in the 3-SAT formula will get at least one truth assignment. Now, we show that the truth assignment to each propositional variable is consistent and that each disjunct in the 3-SAT formula evaluates to true. 0 Each propositional variable gets a unique truth assignment. Suppose that there exists a propositional variable a,, which is assigned both true and false, i.e., both (31,-, 15,) and (ft/r, 11:2) are included in p’. Now, 221 can be reached by the following transitions (3, 31,-), (ping), and ((132, 11,-). In this case, only faults from fmasking and fnonmasking have occurred. Hence, p’ must provide recovery from v,- to invariant. Also, u, can be reached by the following transitions (3, 31,-), (y,, mi), and (:r,,u,-). In this case, only faults from fmaskmg and f failsa fe have occurred. Hence, p’ must ensure safety. Based on the above discussion, p’ must provide 159 a safe recovery to the invariant from 11,-. Based on the definition of the safety specification identified in Section 7.5.2, this is not possible. Thus, propositional variable a,- is assigned only one truth value. Each disjunction is true. Let c, = a,- V flak V a, be a disjunction in the given 3—SAT formula. The corresponding state added in the instance of the multitol- erance problem is 2,. Note that state 2,- can be reached by the occurrence of a fault from fmasking from 3. Hence, p’ must provide safe recovery from zj. Since the only safe transitions from 2,- are those corresponding to states Ii, 23;, and (Er, p’ must include at least one of the transitions (zj, cm), (2,, $1,), or (23,113,). Now, we show that the transition included from z]- is consistent with the truth assignment of propositional variables. Specifically, consider the case where p’ contains transition (2,, 33,-) and a,- is assigned false, p’ can reach as,- in the presence of faults from fmaskmg alone. Moreover, if a,- is assigned false then p’ contains the transition (311,222). Thus, 51:; can also be reached by the occurrence of faults from fmaskmg alone. Based on the above proof for unique assignment of truth values to propositional variables, p’ cannot reach :13,- and as; in the presence of fmaskmg alone. Hence, if (23-, 1:,) is included in p’ then a,- must have been assigned truth value true. Likewise, if (zj, 2:2) is included in p’ then ak must be assigned truth value false. Thus, with the truth assignment considered above, each disjunction must evaluate to true. D Theorem 7.16 The problem of synthesizing multitolerant programs from their fault- intolerant versions is NP-complete. [3 7.5.4 Failsafe-Nonmasking Multitolerance In this section, we extend the NP-completeness proof of synthesizing multitolerance for the case where we add failsafe fault—tolerance to one class of faults, say f fags, f,, 160 and we add nonmasking fault—tolerance to another class of faults, say fnonmaskmg. Our mapping for this case is similar to that in Section 7.5.2. We replace the fmaskmg fault transition (3, 31,-) with a sequence of transitions of f fags, f, and fnmmashng as shown in Figure 7.6. Likewise, we replace fault transition (s, 2,) with a structure similar to Figure 7.6. Thus, y,- (respectively, z,) is reachable by f fausafe faults alone and by fnmmasking faults alone. As a result, 2),- is reachable in the computations of p’flffailsafe and in the computations of p’flfnonmask,,,g. Thus, to add multitolerance, safe recovery must be added from u,- to 3 (cf. Figure 7.4). Now, we note that with this mapping, the proofs of Lemmas 7.14 and 7.15 and Theorem 7.16 can be easily extended to show that synthesizing failsafe-nonmasking multitolerance is NP-complete. Thus, we have Corollary 7.17. The problem of synthesizing failsafe-nonmasking multitolerant pro- grams from their fault-intolerant version is NP—complete. D yi 1 ' \, wi o o W} \ I .ff fn C Figure 7.6: A proof sketch for NP-completeness of synthesizing failsafe—nonmasking multi- tolerance. 7 .6 Summary In this chapter, we investigated the problem of synthesizing multitolerant programs from their fault-intolerant versions. The input to the synthesis algorithm included the fault-intolerant program, different classes of faults to which fault-tolerance had to be added, and the level of tolerance provided for each class of faults. Our algorithms 161 ensured that the synthesized program provided (i) the specified level of fault-tolerance if a fault from any single class had occurred, and (ii) the minimal level of fault- tolerance if faults from multiple classes occurred. We presented a sound and complete algorithm for the case where failsafe (respec- tively, nonmasking) fault-tolerance would be added to one class of faults and masking fault-tolerance would be provided to another class of faults. Thus, in these cases, if a multitolerant program could be synthesized for the given input program, our algo— rithms would always produce one such fault-tolerant algorithm. The complexity of these algorithms is polynomial in the state space of the fault-intolerant program. For the case where one needs to add failsafe fault-tolerance to one class of faults and nonmasking fault-tolerance to another class of faults, we showed that this problem is NP-complete. As mentioned earlier, this result was counterintuitive as adding failsafe and nonmasking fault-tolerance to the same class of faults can be done in polynomial time. However, adding failsafe fault-tolerance to one class of faults and nonmasking fault-tolerance to another class of faults is N P-complete. Although the results focused in this chapter deal with the high atomicity model, we note that the algorithms in high atomicity model are important in synthesizing distributed fault-tolerant programs as well. Specifically, our algorithms identify a limit up to which even highly powerful processes can add the necessary multitoler- ance. Thus, the output of these algorithms can be used in identifying the limits that distributed processes —along with their limitation on reading and writing variables of the program— can achieve in terms of adding the necessary multitolerance. As an illustration, we note that in Chapter 5, we have identified how algorithms in high atomicity can be systematically used in enhancing the level of fault-tolerance to a single class of faults. 162 Chapter 8 FTSyn: A Software Framework for Automatic Synthesis of Fault-Tolerance In this chapter, we present the design and the internal working of the software frame- work Fault—Tolerance Synthesizer (FTSyn) that we have developed for the synthesis of fault-tolerant distributed programs. This framework allows the users to automat- ically (respectively, interactively) add fault-tolerance. We also show that our frame- work permits one to add new heuristics for adding fault-tolerance. Towards this end, we describe the addition of several heuristics (based on the algorithms proposed in [14] and in Chapter 5) for different steps involved in adding fault-tolerance. Further, we show how one can easily change the internal representation of different entities in the framework. We have used our framework to synthesize several fault-tolerant programs among them (i) a simplified version of an altitude switch that controls the altitude of an aircraft by monitoring the altitude sensors and generating necessary command signals, where the altitude switch tolerates the corruption of altitude sensors; (ii) a token ring protocol that tolerates process-restart faults; (iii) an agreement protocol that 163 tolerates Byzantine faults; (iv) an agreement program that tolerates both Byzantine faults and fail-stop faults; (v) an alternating bit protocol program that tolerates message-loss faults, and (vi) a Triple Modular Redundancy program that tolerates input—corruption faults. These examples illustrate the potential of our framework in adding fault-tolerance to different types of faults with different natures. We proceed as follows: in Section 8.1, we illustrate how the developers of fault- tolerance can synthesize fault-tolerant programs using our framework. In Section 8.2, we present the design of the framework, and discuss the internal working of the framework. In Section 8.3, we show how one can integrate new heuristics into our framework. In Section 8.4, we present the way in which one can change the internal representation of entities involved in the framework. In Section 8.5, we present a simplified version of an altitude switch synthesized using our framework. We make concluding remarks and discuss future work in Section 8.6. 8.1 Adding Fault-Tolerance to Distributed Pro- grams In this section, we first describe the input and the output of our framework (of. Section 8.1.1). Then, in Section 8.1.2, we give an overview of framework fractions that participate in the automatic synthesis of fault-tolerant programs. We implement a deterministic version of Add- ft algorithm (cf. Section 2.8) and a set of heuristics developed in [14, 15] to synthesize a fault-tolerant program. Further, in Section 8.1.3, we illustrate how the users can interact with the framework in order to semi- automatically synthesize a fault-tolerant program from its fault-intolerant version. 8.1.1 The Input / Output of the Framework In this subsection, we explain how developers of fault-tolerance should prepare the input to our framework and how the framework provides the output to its users. The 164 input of our framework consists of the abstract structure of the fault-intolerant pro- gram, its invariant, its safety specification, its initial states, and a class of faults. The output of our framework is also the abstract structure of the fault—tolerant program, represented by guarded commands. We note that there exist automated techniques (e.g., [42, 43]) by which we can ex- tract the abstract structure of programs written in common programming languages, and then provide our framework with the abstract structure of programs. Moreover, after the synthesis of a fault-tolerant program, there exist automated techniques (e.g., [44, 45, 46]) that allow us to refine the abstract structure of the fault-tolerant pro— gram while preserving its correctness and fault-tolerance properties. Next, we present a very simple example of a token ring program to illustrate the way developers can communicate with our framework to add fault-tolerance. Our goal is to provide an overall picture about the input / output of our framework. Afterwards, in Subsection 8.1.2, we show the internal working of our framework and how it synthesizes the fault-tolerant token ring program. Token ring program The fault-intolerant program consists of four processes P0,P1,P2, and P3 arranged in a ring. Each process P,, 0 S i S 3, has a variable 2:,- with the domain {—1,0, 1}. We say that process 1",, 1 S i S 3, has the token if and only if (2:,- 9é :r,_1) and fault transitions have not corrupted P,- and P,_1. And, P0 has the token if (2:3 = 2:0) and fault transitions have not corrupted P0 and P3. Process P,, 1 S i S 3, copies 23,--1 to r,- if the value of :r, is different than x,_1. This action passes the token to the next process. Also, if (20 = 2:3) holds then process P0 copies the value of (2:3 69 1) to 230, where 69 is addition in modulo 2. Now, if we initialize every 23,-, O S i S 3, with 0 then process P0 has the token and the token circulates along the ring. In the input file of our framework, we specify the actions of P0 as follows (keywords are shown in italic): 1 process PO 165 2 begin 3 (x0 == x3) -> x0 = ((x3+1)%2); 4 read x0, x3; 5 write x0; 6 end Since processes P1, P2, and P3 are similar, we only present the action of process P1 as follows. 1 process P1 2 begin 3 (x1 != x0) -> x1 = x0; 4 read x1, x0; 5 write x1; 6 end Read / Write restrictions. Each process P,, 1 S i S 3, is only allowed to read 23,-_1 and :r,, and allowed to write 23,-. Process P0 is allowed to read 13 and 2:0, and write 2:0. We specify the read/ write restrictions of a process by read and write keywords inside the body of the process (cf. lines 4 and 5 in the body of P1). Faults. The faults are also modeled as a set of guarded commands that change the values of program variables. In the case of the token ring program, the faults may corrupt at most three processes. Also, in this example, the faults are detectable in that a process that is corrupted can detect if it is in a corrupted state. Hence, we model the fault at process P,- by setting :r, = —1. Thus, one of the fault actions that corrupts 2:0 is represented as follows: 1 fault TokenCorruption 2 begin 3 ( ((xO!=-1)&&(x1!=-1)) II ((xO!=-1)&&(x2!=-1)) II 4 ((x0!=-1)&&(x3!=-1)) ll ((x1!=-1)&&(x2!=-1)) II 5 ((x1!=-1)&&(x3!=-1)) II ((x2!=-1)&&(x3!=-1)) ) 166 6 -> x0 = -1; 7 end Note that there exist no read/ write restrictions for the fault transitions because we assume that fault transitions can read and write arbitrary program variables. Safety specification. The safety specification of the fault-intolerant program is rep— resented as a Boolean expression over program variables. In the token ring program, the problem specification stipulates that the fault-tolerant program is not allowed to take a transition where a non-corrupted process copies a corrupted value from its neighbor. In the input of the framework, we represent the specification as follows. 1( ((XIS!=-1)&&(x1d==-1)) ll ((X2S!=-1)&&(x2d==-1)) I] 2 ((x33!=-1)&&(x3d==-1)) ll ((x38==-1)&&(x03!=x0d)) ) Note that we have added a suffix “s” (respectively, suffix “d”) to the variable names that stands for source (respectively, destination). Since the above condition specifies a set of transitions tspec using their source and destination states, we need to distinguish between the value of a specific variable mi in the source state of tsp,c (i.e., 23is means the value of mi in the source state of tap“) and in the destination state of tsp“ (i.e., 2:id means the value of mi in the destination state of tspec). Invariant. The invariant is also specified as a Boolean expression over program variables. The invariant of the token ring program consists of the states where no process is corrupted and there exists only one token in the ring. We represent the invariant of the program using the invariant keyword followed by a state predicate. 1 invariant 2 ((XO==1)&&(x1==0)&&(x2==0)&&(x3==0)) ll 3 ((X0==1)&&(x1==1)&&(x2==0)&&(x3==0)) ll 4 ((X0==1)&&(x1==1)&&(x2==1)&&(x3==0)) ii 5 ((x0==1)&&(x1==1)&&(x2==1)&&(x3==1)) ll 6 ((x0==0)&&(x1==0)&&(x2==0)&&(x3==0)) ll 7 ((x0==0)&&(x1==0)&&(x2==0)&&(x3==1)) II 167 s ((x0==0)&&(x1==0)&&(x2==1)&&(x3==1)) ll 9 ((x0==0)&&(x1==1)&&(x2==1)&&(x3==1)) Initial states. We also specify some initial states in the input of the synthesis frame- work. While these initial states are included in the invariant of the fault-intolerant program, we find that explicitly listing them assists in adding fault-tolerance. The initial states of the token ring program are as follows (init and state are keywords): 1infi l 0 >4 M ll 0 >4 (0 ll 2 state x0 = 0; x1 - O; 3 state x0 ll H >4 H II H N M II H :14 (A) ll 1; The output fault-tolerant program. Finally, the output of our framework is also generated in guarded commands. For the token ring program, the actions of process P0 in the synthesized fault-tolerant program are as follows: 1 (x ==-1) && (x ==1) -> x0 := O; 2 l 3 (x0==1) && (x3==1) -> x0 := O; 4 l 5 (x0==0) && (x3==0) -> x0 := 1; 6 I 7 (x0==-1) && (x3==0) -> x0 := 1; The above actions mean that P0 can copy the value of (2:3 6) 1) to 20 as long as 2:3 # -1. Next, we present the actions of synthesized process P1. 1 (x1==1) && (x0==0) -> x1 := O; 2 I 3 (x1==-1) && (x0==0) -> x1 := 0; 4 I 5 (X1==0) && (x0==1) -> 111 := 1; 6 l 7 (x1==-1) && (x0==1) -> x1 := 1; 168 The above actions stipulate that process P1 can copy the value of 2:0 to 271 if ((230 74 —1)/\ (2:1 74 230)) holds (i.e., P0 is not corrupted). Likewise, the synthesis framework generates similar actions for the synthesized processes P2 and P3. We would like to note that the token ring program that we have automatically synthesized using our framework is the same as the program that was manually designed in [10]. 8.1.2 Framework Execution Scenario In this subsection, we discuss the sample execution scenario for the case where fault- tolerance is added without any user interaction. Also, we use the token ring example to illustrate the execution of the synthesis algorithm. In this execution scenario, the synthesis algorithm consists of four fractions: Initialize, Preservelnvariant, Modify- Invariant, and ResolveCycles (cf. Figure 8.1). Expanding the reachability graph. Before the execution of the synthesis algo- rithm, the framework uses initial states and program / fault transitions to generate the state-transition graph of the fault-intolerant program. Since this directed graph only includes those states of the state space that are reachable by program / fault transitions from initial states, we call it a reachability graph of the fault-intolerant program. (It also represents a reachable subset of the fault-span of the fault-intolerant program.) The reachability graph of the token ring program. For the token ring program pre- sented in Section 8.1.1, the reachability graph is equal to its state space and includes 81 states. Let (2:0, 2:1, 2:2, 2:3) denote a state of the token ring program. Thus, starting from the initial state so = (0,0,0,0), fault transitions may perturb the program to 31 = (—1,0,0,0), where process P0 is corrupted. From 31, process P1 copies the cor— rupted value and the fault-intolerant program reaches state 32 = (—1,—1,0,0). As a result, starting from the given initial states, a combination of program and fault transitions can take the state of the program to any possible state in the whole state space. 169 Smeazflofivzae a w 88?? Sac—92:8 8 o a 0585:? 8:50 52 .N ..o homo—goo 2:. 8 can flag 3860.5 28m a 8 .m 88»an m o How one “8:8 98888 .025 3.33. . a ESE . Boas» E ... ”a; A 3.5:» E u can a _ coats:— new: 058.82: .m . a $5 232.3. a 8% .m a _ om buxom ”w 35 mammawuowfiuww .m 028.5 833. .8“ mu E 5.8 a 35 IN .m 02%: noon: 8.. E @228 a 9% _ l0 mn— bmaam no mam W ...atgségz a: 8.62... .. m m... Seam s new «0 ......Ma £3 a new I r Ewing—atone...— 5 553....— ..r .8 SBSEUHN mam _ _ m3 BEBEUHESm T ._, 3.3.3 a 85...... 55:52 285:3 1‘ .Eom 553.35 Figure 8.1: A deterministic execution scenario for the framework FTSyn. 170 Execution of fraction (I). After the expansion of the reachability graph, the framework executes every step of the synthesis algorithm (i.e., F 1-F 6 in Figure 2.4) on the reachability graph of the fault—intolerant program in order to derive a reach- ability graph of the fault-tolerant program. First, in fraction (1) (cf. Figure 8.1), the synthesis algorithm calculates the sets of ms states and mt transitions (in the reachability graph). The token ring program in fraction (1)111 the case of the token ring program, safety is violated when a process copies a corrupted value from its neighbor. Thus, fault transitions do not directly violate safety, and as a result, the set of ms states is empty. Also, since ms is empty, the set of mt transitions is equal to the set of program transitions that directly violate safety. Execution of fraction (II). Then, the synthesis algorithm moves to fraction (II) where we attempt to identify a valid fault-span T’ that (i) is closed in p’ [] f; (ii) does not include any ms states or safety-violating transitions of mt, and (iii) does not include any deadlock states outside the invariant. While executing in fraction (II), we leave the invariant S’ unchanged. This is due to the fact that the addition problem requires that the invariant of the fault-tolerant program is a subset of the invariant of the fault-intolerant program. Thus, states inside the invariant of the fault-intolerant program are important; removing them prematurely can cause the automated synthesis to fail. Also, when we remove ms states (respectively, remove mt transitions) from T’ in order to satisfy F3, the new fault-span will be a subset of initial T’. As a result, those transitions that start in the new fault—span and end in the part of T’ that is not in the new fault-span violate the closure of the fault-span (i.e., F2) and must be removed. Hence, after satisfying F3, we may need to re-satisfy F2. A similar scenario can happen while resolving deadlock states (i.e., satisfying F4). Hence, fraction (II) is an iterative procedure. The execution continues in fraction (II) until an iteration 171 does not cause any changes or until the number of iterations exceeds a predetermined bound. The token ring program in fraction (II). For the token ring program, the framework removes (groups of) program transitions that violate safety of specification. For example, the transition that process P1 takes from 51 to so violates the safety of specification. Hence, the synthesis algorithm removes (31,32) in fraction (II). As a result, 51 = (—1,0,0,0) becomes a state without any outgoing transition; i.e., deadlock state. The execution of fraction (II) does not create any deadlock states inside the invari- ant of the token ring program since ms is empty and no mt transition exists inside the invariant. Thus, in the first iteration, the synthesis algorithm only removes a set of transitions in the fault-span outside the invariant (i.e., mt transitions and the transitions that violate the closure of fault—span). Execution of fraction (III). At the end of fraction (II), if the resulting program does not satisfy F 1-F 6, we modify the invariant S’ in fraction (III) to ensure that the invariant S’ is closed in the program p’, i.e., F5 is satisfied. In fraction (III), we recalculate a valid invariant. In this fraction, the newly added transitions may violate the closure of the fault-span. Thus, when we exit fraction (III), the conditions F 2-F 4 may need to be re—satisfied. Hence, we jump to fraction (II) and attempt to re—satisfy F 2-F 4. Notice that in fraction (III), we satisfy F4 only for the invariant states; i.e., we ensure that there is no deadlock state inside the invariant whereas in fraction (II), we resolve deadlock states that are in the fault-span but outside the invariant. The token ring program in fraction (111). As we mentioned earlier, the removal of mt transitions creates deadlock states outside the invariant of the token ring program. For example, state 31 = (—1,0,0,0) became a deadlock state since the framework removed a transition to 32 = (—1,—1,0,0) taken by P1. Now, in the fraction (III), the framework adds recovery transitions to the invariant by allowing a corrupted 172 process to copy an uncorrupted value from its predecessor. Thus, from 32, process Po can toggle the value of 2:3 and correct itself by moving to state 83 = (1, —1,0,0). Now, from 33, process P1 copies 2:0 and takes the program to state 34 = (1, 1, 0, 0), which is in the invariant. Note that since P1 cannot read variables 2:2 and 233, the group of transitions associated with the transition (53, 84), say g34, includes 9 transitions. By definition, the values of 2:3 and 2:4 remain unchanged in each transition of 934. Also, P1 does not propagate a corrupted value by executing transition (s3, 84). Thus, no transition in g34 violates safety of specification. Execution of fraction (IV). If the values of p’, S’, and T’ satisfy formulae F 2- F5 at the end of fraction (III) then we will ensure that p’ will not stay outside its invariant forever. Toward this end, we move into fraction (IV) where we remove reachable non—progress cycles in T’-—S’ (if any). The token ring program in fraction (IV). As long as there exists an uncorrupted value, the token ring program can propagate that value along the ring and recover to the invariant. Since faults can perturb at most three processes, the existence of an uncorrupted process is always guaranteed. Also, no non-progress cycles exist outside the invariant of the token ring program. Thus, in this automatic execution scenario, our framework generates the fault-tolerant token ring program presented in Section 8.1.1 by adding safe recovery from deadlock states outside the invariant. 8.1.3 User Interactions Although the framework can automatically synthesize a fault-tolerant program with- out user intervention, there are some situations where (1) user intervention can help to speed up the synthesis of fault-tolerant programs, or (ii) a fully automatic approach fails. In this subsection, we present the nature of the interactions that fault-tolerance developers can have with our framework. Our framework permits developers to semi-automatically supervise the synthesis 173 procedure. In such supervised synthesis, fault-tolerance developers interact with the framework and apply their insights during synthesis. In order to achieve this goal, we have devised some interaction points (cf. Figure 8.1) where the developers can stop the synthesis algorithm and query it. At each interaction point, the users can make the following queries: (1) apply a specific heuristic for a particular task; (ii) apply some heuristics in a particular order; (iii) view the incoming program (respectively, fault) transitions to a particular state; (iv) view the outgoing program (respectively, fault) transitions from a particular state; (v) check the membership of a particular state (respectively, transition) to a specific set of states (respectively, transition); e.g., check the membership of a given state 3 in the set of ms states, and finally (vi) view the intermediate representation of the program that is being synthesized. Since our goal is to focus on the technical details of the framework and its application in adding fault-tolerance, we omit the details about the user interface of the framework. We refer the reader to the tutorial about using this framework in the Appendix B. While we expect that the queries included in this version will be sufficient for a large class of programs, we also provide an alternative for the cases where the heuristics fail and these queries are insufficient. Specifically, in such cases, the users of our framework need to determine what went wrong during synthesis. The answer to this question is very difiicult without the help of automated techniques, especially for programs with large state space. To address this issue, developers of fault-tolerance can obtain the corresponding intermediate program in a syntax compatible with the Promela modeling language [37]; this program can then be checked by the SPIN model checker to determine the exact scenario where the intermediate program does not provide the required fault-tolerance property. The counterexamples generated by SPIN enable the users to identify the appropriate heuristics that should be applied in subsequent steps of synthesis. 174 8.2 Framework Internals The integration of new heuristics into our framework (respectively, modifying the internal representation of framework entities) requires some background knowledge about the design and the internal working of our framework. Hence, in this section, we present preliminary information that helps the users of the framework (especially the developers of heuristics) to understand the internal working of the framework. We use this information in Sections 8.3 and 8.4 to describe how the framework permits the addition of new heuristics and the ability to change the internal representation of its entities. We organize this section as follows: In Section 8.2.1, we introduce the important classes (i.e., abstract data structures) used in the design of the framework and their relationship. Then, in Section 8.2.2, we identify three important design patterns that help to make the design of the framework extensible. 8.2. 1 Class Modeling The input to the synthesis algorithm consists of the following entities: program, pro- cess, fault, safety specification, invariant, and initial states. Hence, we create the follow- ing classes corresponding to each entity: Program, Process, Fault, SafetySpecification, Invariant, and InitiaIStates. Also, since we can generate the fault-span (i.e., reachability graph) of the fault-intolerant program using the initial states and program (respec- tively, fault) transitions, we regard the fault-span of the fault-intolerant program as an input entity. Thus, we model the fault—span of the fault-intolerant program using ReachabilityGraph (RG) class. The synthesis framework takes the input entities and then executes the synthesis algorithm in order to generate a fault-tolerant program, its invariant, and its fault-span. Thus, we model the output entities using the same category of classes Program, Invariant, and RG. We depict the class diagram of the synthesis framework in Figure 8.2. This figure 175 conicaabfiflmok— confine—tin". *2— r zocooSowaoxow V .6333: _ {AV 033:; _ .....o azo ....._ .....— _ _ _ ll. :3". _qll gasogsafism 552: m:90=o:_m§.F .._ EoEoEm a. .. o _ ...: Efiwoi 289i coco< “5:0 \ A Ste—.5 8868mm Ill ..oéomomEOo J ”ccowo... k Figure 8.2: The class diagram of FTSyn. 176 identifies the important classes and their relationship. For example, each Process is composed of one or more Action objects. (We annotate the composition relation by black diamonds attached to an arrowed line.) Every Process is associated with zero or more TransitionGroup objects that are created due to the read restrictions of that process. (We illustrate associations by solid lines.) Finally, we have derived some new classes from the original classes of our abstract design by inheritance relationship. (We annotate inheritance by a solid line attached to a triangle.) For example, we have an abstract class Transition from which we have inherited two concrete classes ProgramTransition and FaultTransition. 8.2.2 Design Patterns In this section, we identify three important design patterns [47], Bridge, Facto- ryMethod, and Strategy, that we use in our framework. The advantage of using design patterns with respect to traditional abstract data types stems in the level of flexibility and reusability that these design patterns provide in the design and implementation of our framework. We use the Bridge design pattern (cf. Figure 8.3) in order to achieve extensibil- ity. The Bridge pattern is a structural design pattern [47] that allows us to sepa- rate the design class hierarchy from the implementation class hierarchy. This way, we can independently extend the design and the implementation of the framework by subclassing. For example, we can introduce different implementation hierarchies corresponding to the AbstractProgram class, where these implementation hierarchies implement a common interface Programeplementor (cf. Figure 8.3). Another requirement for the developers of fault- tolerance is the ability to apply a specific heuristic at a particular stage of synthesis. Hence, the framework has to dynamically instantiate different classes that represent different heuristics at run-time. In order to achieve this goal, we use the FactoryMethod design pattern (cf. Figure 8.4). The FactoryMethod pattern 177 Abstraction Hierarchy- Implementation Hierarchy L—-Client AbstractPrograni [ «interface» impRet O— Program_lmplemento +isDeadlock() E +isDeadIockImp0 Program Brogramlmplomentatlom] +isDeadlock() +isDead|ocklmp() Figure 8.3: The Bridge design patterns. is a creational pattern [47] that facilitates the dynamic instantiation of objects at run—time. Hence, if one adds a new heuristic in the form of a new class, which is extended from the abstract design of the framework, then the users of the framework can activate the newly added heuristic at run-time. Client A Graph Iristantiates +solveDeadlock() Figure 8.4: The FactoryMethod design patterns. As we mentioned in the Introduction, the developers of heuristics should be able to easily integrate new heuristics into the framework. We presented the contribution of the Bridge and the FactoryMethod patterns respectively in achieving extensibility and dynamic instantiation of heuristics at run-time. Yet another issue is the design of different versions of a heuristic. In the case where there are different algorithms for 178 a specific step of the synthesis algorithm, we need to implement different versions of a particular class (respectively, method). For example, in resolving deadlock states, we may have different heuristics for dealing with a deadlock state. Hence, we need to have different versions of the solveDeadIock method of the RG class (cf. Figure 8.5). no lDeadlockResolveIl +solveDeadlock() J +Resolve() so [DeadlockResolven Ineadlocknesolverz] Ineadlocknesolverd +Resolve() +Resolve() +Resolve() Figure 8.5: Integrating the deadlock resolution heuristics using Strategy pattern. We use the Strategy pattern [47] to provide a flexible solution to the above- mentioned problem. In particular, we design a DeadlockResolver class for deadlock resolution (cf. Figure 8.5). This class has a method called Resolve, where we im- plement our deadlock resolution heuristic. Then, we apply the Strategy pattern to DeadlockResolver so that the developers of heuristics can extend new classes from the DeadlockResolver class and integrate their own heuristic in the Resolve method (cf. Figure 8.5). Finally, in the solveDeadIock method of the RG class, we use the Fac- toryMethod design pattern in order to dynamically instantiate different subclasses of the DeadlockResolver class at run-time. 8.3 Integrating New Heuristics In this section, we address the problem of adding new heuristics into our framework (i.e., the second goal mentioned in the Introduction). Specifically, we show how one can integrate a new heuristic into our framework so that the added heuristic will be available to the developers of fault-tolerance during synthesis. Since a new heuristic will be integrated into a new class or into a method of an existing class, the problem of 179 adding new heuristics to the framework reduces to the problem of adding new classes (respectively, methods) to the framework. We have used the ability to add heuristics for adding several heuristics from [14, 31, 15]. Of these heuristics, we now present the integration of the three heuristics that we added for resolving deadlocks and discuss our experience in adding them. First heuristic. Kulkarni, Arora, and Chippada [14] present a heuristic for deadlock resolution that includes two passes. In the first pass, their heuristic tries to add single- step recovery transitions from a given deadlock state, 50,, to the invariant. Due to distribution restrictions, when their heuristic adds a recovery transition, tree , it has to add the group, grec , of transitions that is associated with tree. Moreover, the addition of gm, is not allowed if there exists a transition (30,31) 6 gm, such that (i) (30,31) 6 mt; (ii) (30,31 6 S) /\ (30,31) ¢ p; (iii) (so E T’) A (s, 9! T’), or (iv) (so E S) /\ (31 at S). If adding recovery from so is not possible, and so is directly reachable from the invariant by fault transitions then their heuristic does nothing in the first pass. Otherwise, their heuristic makes 3,, unreachable. In the second pass, if there still exists a deadlock state 3,, that is directly reachable from the invariant by fault transitions then their heuristic makes 5,, unreachable by removing the corresponding invariant state. At the end of deadlock resolution, if the invariant is empty then they declare that their heuristic could not synthesize a fault-tolerant program. We have integrated their heuristic into the framework using the DeadlockResolverl class (cf. Figure 8.5) that inherits from the DeadlockResolver class. Second heuristic. The first heuristic only adds single-step recovery to deadlock states. As a result, it fails in cases where single-step recovery is not possible. For example, the first heuristic fails in the case where recovery from a deadlock state, say 32,, is possible via another deadlock state, say 3.1, from where we have already added a recovery transition to the invariant. Hence, we develop a new heuristic for adding 180 Tll'll - inva poin the f state tion, states fixpoi is pos in the will he ery pa PFOgra. heurisr | In t sitions , r e(Wires States_ In of class Dee then impl Third In heurispic outside t t heufl-Stic 1 multi-step recovery to deadlock states for the cases where single-step recovery to the invariant is not possible. Our new heuristic also consists of two passes. In the first pass, we conduct a fix- point computation that searches through the deadlock states outside the invariant in the fault-span. In the first iteration of the fixpoint computation, we find all deadlock states from where single-step recovery to the invariant is possible. In the second itera- tion, we find all deadlock states from where single-step recovery is possible to recovery states explored in the first iteration. Continuing thus, we reach an iteration of the fixpoint computation where either no more deadlock states exist or no more recovery is possible. In the latter case, we choose to deal with the remaining deadlock states in the second pass. In the former case, at the end of the fixpoint computation, we will have a set of states, RecoveryStates, from where there exists a multi—step recov- ery path to the invariant. (Notice that adding a recovery transition in a distributed program requires the satisfaction of the grouping requirements described in the first heuristic.) In the second pass, we try to remove 3,, if 5,, is directly reachable by fault tran- sitions from the invariant and no recovery can be added to sd. If the removal of 3,, requires the removal of one or more invariant states then we remove those invariant states. During deadlock resolution, if the invariant becomes empty then we declare that the synthesis framework failed to synthesize a fault-tolerant program. In order to integrate this new heuristic into our framework, we extended a new class DeadlockResolver2 (cf. Figure 8.5) from the abstract class DeadlockResolver and then implemented our new heuristic in its Resolve method. Third heuristic. The strategy of the third heuristic is similar to that in the second heuristic, except that the domain of the fixpoint computation includes all the states outside the invariant in the fault-span (i.e., (T’ —— S’ )) In other words, the third heuristic is more general than the second heuristic. (Likewise, the second heuristic is 181 11101 the reco — t0 simp 8.5) 1 The 0f flit the s] gener: Appm Tl.- tICS (111 as the either frame“ the Iatt AS We 11 ificatgn “E’s Whit. Where 0“ 0f the {Tel 95,901“in In tlii. — ——-+ more general than the first heuristic.) We have also used this heuristic for enhancing the fault-tolerance of nonmasking programs —— where the program only guarantees recovery to the invariant in the presence of faults and not necessarily a safe recovery - to masking fault-tolerance [15]. The integration of the third heuristic was fairly simple. We integrated the third heuristic into a class DeadlockResolver3 (cf. Figure 8.5) extended from the abstract class DeadlockResolver. The application of heuristics. The second heuristic suffices for the synthesis of the fault-tolerant token ring program presented in Subsection 8.1.1. However, in the synthesis of a version of the Byzantine agreement program containing four non- general processes, since the second heuristic failed, we applied the third heuristic (see Appendix B for this program). The developers of fault-tolerance have the option to select one of the above heuris- tics during synthesis. Despite the generality of the third heuristic, it is not as efficient as the first two heuristics. Therefore, given a particular problem, the developers can either use their insight to choose the appropriate heuristic or they can rely on the framework to make that choice. The former choice provides more efficiency whereas the latter choice allows more automation. 8.4 Changing the Internal Representations As we mentioned in the Introduction, it is difficult to determine a priori the internal representation that one should use for different entities, namely Program, Fault, Spec- ification, and Invariant, involved in the synthesis of fault-tolerant programs. Thus, it is necessary to provide the ability to modify the internal representation of these enti- ties while reusing the remaining parts of the framework. In fact, there are situations where one needs to use one internal representation while executing in one fraction of the framework, and a different internal representation for the same entity while executing in another fraction of the framework. In this section, we argue that our framework enables such a change of internal 182 repi exp in O in tl well point. Imp] class violar We V0; to ver ificatic especi; S}'Iltl1e Places, SFTltlleg 0fthe we, data 3” states 0 311bst itu that, Wm H the SD' repregem t0 eXPC‘m instead 0- representation for entities involved in our framework. Towards this end, we discuss our experience in changing the internal representation of SafetySpecification and Invariant in our framework. We find that the ability to modify the representation of entities in this fashion is especially useful for improving the efficiency of the framework as well as in simplifying the tasks involved in responding to user queries at interaction points. We discuss these applications next. Improving the efficiency. The initial implementation of the SafetySpecification class consisted of a linked list whose elements would each represent a set of safety- violating transitions. The SafetySpecification class includes a method violates by which we verify whether a given transition t violates the safety specification or not. In order to verify the safety of t, we needed to traverse the linked list structure of SafetySpec— ification. The traversal of the SafetySpecification structure was very time-consuming, especially when the size of the state space would become large. Since during the synthesis of a fault-tolerant program we need to invoke the method violates in many places, the efficiency of this method significantly degrades the overall efficiency of the synthesis. Hence, we changed the data structure used for the internal representation of the SafetySpecification class. We replaced the linked list structure of the SafetySpecification class with a dummy data structure. Now, for a given transition t, we first take the source and destination states of t (specified as st and (1,). In order to verify the safeness of t, we then substitute the values of the program variables at 3, and d, into the state predicates that represent the safety specification (e.g., refer to Section 8.5 or Subsection 8.1.1 ). If the specification predicate holds for st and dt then t violates safety. (Note that we represent safety specification as a set of transitions that the program is not allowed to execute.) We have applied the same approach for the Invariant class. Therefore, instead of traversing a huge linked list data structure, we check only a predicate in order to find out the safeness of a transition or the membership of a state to the 183 invariant. Reasoning about a query. As we discussed in this section, we have two differ- ent implementations for the SafetySpecification class based on the linked list and the dummy data structures. The latter data structure helps to improve the efficiency of the synthesis when we need to automatically synthesize a fault-tolerant program with- out user intervention. On the other hand, when users interact with our framework, they may need to know why a particular transition violates the safety specification. To answer this query, the framework uses the information stored in the linked list data structure in order to provide the required reasoning for the users. Thus, in such situations, the framework switches the implementation of the SafetySpecification class from a dummy to a linked list data structure to provide the required reasoning for the developers of fault-tolerance. 8.5 Example: Altitude Controller In this section, we show how we used our framework to synthesize a simplified version of an altitude switch (ASW) used in aircraft altitude controller. We have adapted this example from [48] and the output program of our framework is the same as the fault-tolerant program that is manually designed in [48]. This example illustrates the applicability of our framework in automatic synthesis of practical applications. The program of the altitude switch reads a set of input variables coming from two analog altitude sensors and a digital altitude sensor. Then, the ASW program activates an actuator when the altitude is less than a pre-determined threshold. The fault-intolerant altitude switch (ASW). The ASW program monitors a set of input variables and generates an output. There exist five internal variables, a mode variable that determines the operating mode of the program, and four input variables that represent the state of the altitude sensors. The internal variables are as follows: (i) AltBelow is equal to 1 if the altitude is below a specific threshold, 184 otherwise, it is equal to 0; (ii) ActuatorStatus is equal to 1 if the actuator is powered on, otherwise, it is equal to 0; (iii) 1 nit represents the system initialization when it is equal to 1; otherwise, it is equal to 0; (iv) Inhibit is equal to 1 when the actuator power-on is inhibited; otherwise, it is equal to 0, and (v) Reset is equal to 0 if the system is being reset. The ASW program can be in three different modes: (i) the Initialization mode when the ASW system is initializing; (ii) the Await-Actuator mode if the system is waiting for the actuator to power on, and (iii) the Standby mode. We use an integer variable Status with domain {—1,0, 1,2} to show the system modes in the program where (i) Status = —1 if the system is in the initialization mode; (ii) Status = 0 if the system is in the Await-Actuator mode; (iii) Status = 1 if the system is in the Standby mode, and (iv) Status = 2 if the system is in a faulty state. Moreover, we model the signals that come from the input (analog and digital) altitude sensors using the following variables: (i) AltF ail is equal to 1 when analog and digital altitude meters are failed; (ii) if the system remains in the Initialization mode more than 0.6 second then the variable I nitF ailed will be set to 1. Otherwise, I nitF ailed remains 0; (iii) if the condition AltF ail = 1 remains true more than 2 seconds then the variable AltFailOver will be equal to 1. Otherwise, AltFailOuer remains O, and (iv) if the system remains in the Await-Actuator mode more than 2 seconds then the variable AwaitOver will be equal to 1. Otherwise, AwaitOver remains 0. The output of the ASW program is identified based on the system mode. The ASW program has an output integer variable WakeupActuator that is equal to 1 if the system is in the Await-Actuator mode and is equal to 0 otherwise. The domain of all variables except Status is equal to {0, 1}. The fault-intolerant program consists of only one process, called Controller. In the input of our framework, we specify the Controller process as follows: 185 1 pnxmss Controller 2 begin 3 4 ((Status == -1) && (Init == 1)) -> Status = 1; Init = O; 5 l 6 ((Status == 1) && (Reset == 0)) -> Status = -1; Reset = 1; 7 l a ((Status == 1) && (AltBelow == 0) && (Inhibit == 0) 9 && (ActuatorStatus ==O)) -> Status = O; AltBelow ll H .0 10 l 11 ((Status == 0) && (ActuatorStatus == 0)) -> Status = 1; ActuatorStatus ll p b- 12 l 13 ((Status == 0) && (Reset == 0)) -> Status = -1; Reset = 1; 15 wead AltBelow, ActuatorStatus, Init, Inhibit, Reset, 1s AltFail, InitFailed, A1tFailOver, AwaitDver, Status; 13 uwne WakeupActuator, AltBelow, ActuatorStatus, 19 Init, Inhibit, Reset, Status; 20 end The program changes its mode from Initialization to Standby when the I nit vari- able is equal to 1. Also, the program goes to the Initialization mode when it is either in Standby or in Await-Actuator mode and the reset Signal is received. If the pro- gram is in the Standby mode and the actuator power-011 is not inhibited and the actuator is not powered on then the program goes to Await-Actuator mode. In the Await-Actuator mode, the program either (i) powers on the actuator and goes to the standby mode, or (ii) goes to the Initialization mode upon receiving the reset signal. The read / write sections in the body of the Controller process identify its read / write restrictions on the program variables. Faults. If the altitude sensors incur malfunction then the state of the program will 186 be Safet not ('l .5 i.e_, , the 111 is not 2 ((A1 3 ((St 4 ((St. S be perturbed to a faulty state. We represent the fault actions as follows: 1 .flndt Malfunction 2 begin 4 (InitFailed == 1 ) -> InitFailed = 0; Status = 2; 5 l s (AltFailOver == 1 ) -> AltFaileer = 0; Status = 2; 7 l s (AwaitOver == 1 ) -> AwaitOver = 0; Status = 2; 10 end Safety specification. The problem specification requires that the program does not change its mode from Standby to Await-Actuator if the altitude sensors are failed; i.e., AltF ail is equal to 1. Also, from the faulty state, the program can only go to the Initialization mode. Moreover, in the faulty state, the program can recover if it is not reset. In the input file, we represent the specification as a state predicate. 1 0)) ll 0)))ll 2 ((AltFails == 1) && (Statuss == 1) && (Statusd 1) ll (Statusd 3 ((Statuss == 2) && ((Statusd = 4 ((Statuss == 2) && (Resets == 1)) As we described in Subsection 8.1.1, to distinguish the value of a variable (e.g., AltF ail) at the source of a transition from its value at the destination, we append the variable names with suffixes ’s’ and ’d’ (e.g., AltF ails and AltF ails). Invariant. The invariant of the program consists of the states where the program is not in the faulty state; i.e., Status aé 2. We specify the invariant as follows: 1 invariant 2 3(Status != 2) 187 Initial states. We specify the initial state as follows: 1 init 2 3 sane 4 WakeupActuator = O; 5 AltBelow = 1; 6 ActuatorStatus = O; 7 Init = 1; s Inhibit = O; 9 Reset = 0; 1o AltFail = O; 11 InitFailed = 1; 12 AwaitOver = 1; 13 AltFaileer = 1; 14 Status = -1; Fault-tolerant program. The framework automatically generates the following fault—tolerant program. We present the actions of the Controller process as follows: 1 ((Status == -1) && (Init = 1)) -> Status = 1; Init = O; 2 l 3 ((Status == 1) && (Reset 0)) -> Status -1; Reset = 1; 4 | 5 ((Status == 1) && (AltBelow == 0) && (Inhibit == 0) 5 && (ActuatorStatus ==0) && ( AltFail == 0)) 7 -> Status = 0; AltBelow = 1; 8 l 9 ((Status == 0) && (ActuatorStatus == 0)) -> Status = 1; ActuatorStatus = 1; 10 l 11 ((Status == 0) && (Reset == 0)) -> Status = -1; Reset = 1; 12 l 188 14 (Status == 2) w (Reset == 0) -> Status = -1; Reset = 1; The fault-tolerant program has a new recovery action (cf. Line 14), where it recovers to the initialization mode from faulty state (i.e., states where Status = 2 holds). Also, a new constraint has been added to the third action (cf. Lines 7-9) where the program is allowed to change its state to the Await-Actuator mode only when the input sensors are not corrupted; i.e., the condition (AltF ail = 0) holds. 8.6 Summary In this chapter, we presented a framework for adding fault-tolerance to existing fault- intolerant programs. We showed that our framework is extensible in that it permits easy addition of new heuristics that help in reducing the complexity of adding fault- tolerance. The framework also allows one to partially change the internal represen- tation of different entities used in the synthesis while reusing other entities. These abilities are especially useful for testing different heuristics as well as testing the effect (in terms of space, time, etc.) of different internal representations of entities involved in synthesis. Finally, since we have developed the framework in Java, it is platform- independent; we have used this framework on Windows/Solaris environment. We also find that the choice of this implementation makes our framework suitable for pedagogical purposes. Using our framework, we have synthesized fault-tolerant programs for, among others, token ring, agreement in the presence of Byzantine faults, and agreement in the presence of Byzantine and failstop faults. Thus, these examples demonstrate that the framework can be applied for the cases where we have different types of faults (process restart, Byzantine and failstop), and for the cases where a program is subject to multiple simultaneous faults. 189 Chapter 9 Ongoing Research In this chapter, we present ongoing research work, where we have developed prelimi- nary results. Specifically, we focus on developing heuristics that can extend the scope of efficient synthesis by transforming non-monotonic programs (respectively, specifi- cations) to monotonic. Such heuristics are especially beneficial where for a specific program the monotonicity property (defined in Section 4.3) holds, whereas no guar- antees are provided for the monotonicity of its specification (or vice versa). Towards this end, we present a set of heuristics for transforming non-monotonic programs (respectively, specifications) to monotonic where we benefit from Theorem 4.11 and synthesize fault-tolerant distributed programs in polynomial time. Moreover, in this chapter, we present a SAT-based synthesis approach where we use state-of-the-art SAT solvers to synthesize fault-tolerant distributed programs. In particular, we show how we reduce different sub-problems in the synthesis of fault- tolerant programs to the satisfiability problem. Afterwards, we show how we im- plement our SAT-based approach in the FTSyn framework (presented in Chapter 8). We proceed as follows: In Section 9.1, we present our heuristics for transforming non-monotonic programs (respectively, specifications) to monotonic. Then, in Sec- tion 9.2, we present an algorithm for transforming non—monotonic specifications to 190 monotonic. We demonstrate our transformation algorithms by an example in Section 9.3. Subsequently, in Section 9.4, we present our SAT-based synthesis method. We summarize this chapter in Section 9.5. 9.1 Program Transformation In this section, our goal is to address the following question: Given a fault~intolerant distributed program and its invariant that do not satisfy monotonicity requirements, how can one modify the program and its invariant such that monotonicity requirements are met while ensuring that the program satisfies its specification from the modified invariant? To address this question, first, we formally define the problem of trans- forming programs to monotonic (failsafe-ready) programs in Subsection 9.1.1. Then, in Subsection 9.1.2, we present an algorithm for solving the transformation problem. Finally, in Subsection 9.1.3, we show the soundness of our transformation algorithm. 9.1.1 Problem Statement Given a program p, a state predicate Y, and a Boolean variable 11:, if p is not positive (respectively, negative) monotonic on Y with respect to :1: then our goal is to identify a program p’ and a state predicate Y’ such that p’ is positive (respectively, negative) monotonic on Y’ with respect to at. We require p’ not to add new computations to the set of computations of p during such transformation. Thus, Y’ should be a subset of Y. Otherwise, if Y’ includes a state 3, where s Q Y, then p’ may create new computations from s, which is not desirable. Also, for the same reason, p’ must not include new transitions during such transformation. Thus, we require that the set of transitions of p’ on Y’ is a subset of the set of transitions of p on Y’ (i.e., p’ IY’ E PlY’). Hence, we state the problem of transforming non-monotonic programs as follows: 191 Problem 9.1.1 'Itansforming Non-Monotonic Programs to Monotonic Given p, Y, spec, and :1: such that p satisfies spec from Y, and p is not positive (respectively, negative) monotonic on Y with respect to :1: Identify p’ and Y’ such that Y’ (_Z Y, p’lY’ Q plY’, and p’ is positive (respectively, negative) monotonic on Y’ with respect to a: p’ satisfies spec from Y’. [:1 Before we present our algorithms, we recall the definition of the monotonicity property from Section 4.3. Observe that in the definition of monotonicity, we implic- itly refer to transitions (30,31) and (33, 31) where the value of all variables except :1: is the same in so and 56 (respectively, in 31 and s’l). Hence, we introduce the concept of symmetric transitions with respect to :1: as follows: Definition 9.1.2. We say two transitions t = (30,51) and t’ = (36, 3’1) are symmetric with respect to a Boolean variable :5 (denoted t =x t’) iff the condition ((2:(so) 2: :r(sl))/\(:r(s()) = a:(s’1))/\(a:(so) aé :r(s{,))) holds and the value of all variables in so and I ‘ ' I 30 (respectively, 1n 31 and 51) are the same. [I] 9.1.2 'IYansformation Algorithm In this subsection, we present a sound algorithm to solve Problem 9.1.1. We use the Definition 9.1.2 in the design of our transformation algorithm (see Figure 9.1). The algorithm To_Positive_Monotonic-Programs is an iterative procedure that takes the set of groups of transitions of a distributed program, a state predicate Y, and a Boolean variable :1: and generates a distributed program p’ and a state predicate Y’ such that p’ is positive monotonic on Y’ with respect to r. Intuitively, our algorithm removes the program transitions that go against the monotonicity property. Removing such 192 transitions may create deadlock states in program invariant. Hence, we recalculate another invariant to guarantee that no deadlock states exist in the new invariant. If our algorithm succeeds in finding such an invariant then we generate a monotonic (failsafe-ready) program. Otherwise, our algorithm declares failure in generating a monotonic program. T0.Positive_Monotonic.Program(p: set of transitions, 1:: Boolean variable, Y: state predicate ) // p is the union of a set of groups of transitions go, - - - ,gm. { Step 1: p’ := p; Y’ := Y; Step 2: repeat { Step 2-1: TRrem := {(30,31) : (1(30) = false) /\ (17(31) = false) /\ ((30,331) E p’IY’) /\ (3(3613'1li(36139::(30131)3(3613'1)¢ P'lY')}; Step 2-2: if (TRrem = 0) then Step 2-2-1: Y’ ,p’ 2: Recalculatelnvariant(p’, Y’); Step 2—2-2: if ((Y’ 95 0)) return p’, Y’; else declare failure in finding a monotonic program; Step 2-3: t := (so, 31), where (so, 31) E TRrem and so has the maximum outdegree; Step 2-4: 1” == P"{(82,83)1(391391€ p’ = t E g:- A(82183) E 91)} Step 2-5: Y1 := RemoveDeadlocks(p’ , Y’); Step 2-6: p1 := EnsureClosure(p’, Y1); Step 2—5: p’ := p1; Y’ := Y1; Step 3: } until (Y’ = ); Step 4: declare failure in finding a monotonic program; } Figure 9.1: Transforming non-monotonic programs to positive monotonic. After the initialization, in Step 2-1 (cf. Figure 9.1), we calculate the set of tran- sitions that violate the definition of positive monotonicity. If there exist no such transitions (i.e., TRrem = (ll) then we will verify (i) the non-existence of deadlock states in Y’, and (ii) the closure of p’ in Y’. When we reach Step 2-2-1, we recalculate a valid invariant for p’ by invoking the function RecalculateJnvariant (cf. Figure 9.2). Obviously, if we reach Step 2—2-1 in the first iteration then that means the input program p and Y inherently satisfy the monotonicity requirements. Note that Steps 2-1 and 2-2 verify the monotonicity of the input program, and hence, we do not need develop a separate verification algorithm. To recalculate the invariant, we develop an iterative procedure where we first 193 use function RemoveDeadlocks to remove the existing deadlock states of p in a state predicate S (cf. Figure 9.2). The RemoveDeadlocks function returns the largest subset S1 of S where there exist no deadlock states; i.e., the computations of p are infinite in S1. After removing the deadlock states of S, there might exist transitions of p that start in S1 and reach the removed states of S. Such transitions violate the closure of Sl. Using function EnsureClosure (cf. Figure 9.2), we remove (groups of) transitions that violate the closure of S. We repeat this procedure until there exist no more deadlock states or we remove all states of S. (We invoke the function HasDeadlocks that verifies if there exist deadlock states in a state predicate S of a program p.) Recalculatelnvariant(p : set of transitions, S : state predicate) // p is the union of a set of groups of transitions go, - - - ,gm. { 5’ ‘-= S; p’ := 10; repeat { 51 :2 RemoveDeadlocks(p’, S’ ); p1 :2 EnsureClosure(p’,Sl); P’ == 101; 5" == 31; } until (-1 HasDeadlocks(p’,S’) V S’ = (l ); return S’,p’; } RemoveDeadlocks( p : set of transitions, S : state predicate) // Returns the largest subset of S such that computations of p within that subset are infinite { S’ := S while (330 : soeS’ : (V31 :31 65’ : (80,31)¢p)) 5’ == 5’ - {30}; return 5’; } HasDeadlocks(p : set of transitions, S : state predicate) // Verify the existence of deadlock states in S. { if (330 : so€S : (V31 : 3165 : (so,sl)¢p)) return true; return false; } EnsureClosure(p : set of transitions, S : state predicate) // p is the union of a set of groups of transitions go, - - - ,gm. {return p—{(so,s1) : (39,- :gi E p: ((so,sl) E 93-) A (saga) = (861%) e 91-: (st 6 S A si e S)))} } Figure 9.2: Algorithms for removing deadlock states and ensuring the closure of the in- . variant. In Step 2—3 (see Figure 9.1), we select one of the transitions of T Rrem, say t, whose source state has the maximum number of outgoing transitions (i.e., outdegree). 194 Afterwards, we remove the group of transitions associated with t (cf. Step 2-4). In this way, we reduce the chance of creating more deadlock states. Then, since the removal of transitions may create deadlock states, we invoke RemoveDeadlocks (in Step 2-5). Afterward, we use EnsureClosure to remove the transitions (and their associated groups) that violate the closure of Y1. We continue the iterative procedure of the algorithm To-Positive_Monotonic_Program until in an iteration either (i) the state predicate Y’ becomes empty (in Step 3 or in Step 2—2-2), or (ii) we find a positive monotonic program (in Step 2-2-2). Likewise, we design an algorithm To-Negative_Monotonic_Programs for transform- ing distributed programs to negative monotonic programs. The only difference be- tween such algorithm and To_Positive_Monotonic_Programs is in calculating the set of transitions TRmm (see Step 2-1 in Figure 9.1), where we replace the condition (($(so) = false) /\ (1(31) 2 false)) with (($(so) = true) /\ (23(31) = true)). 9.1.3 Soundness In this subsection, we show that the algorithm To-Positive-Monotonic-Programs (cf. Figure 9.1) is sound; i.e., the transformed program satisfies the requirements of Prob— lem 9.1.1. Towards this end, we make the following observations: Observation 9.1.3 The function RemoveDeadlocks returns a subset S’ of a predicate S where the computations of program p in S’ are infinite. Proof. Since RemoveDeadlocks only removes states with no outgoing program tran- sitions, it follows that S’ does not have new states (i.e., S’ Q S). Also, every state that remains in S’ has at least one outgoing transition in p. Otherwise, it would have been removed. Therefore, the computations of p are infinite in S’. C] Observation 9.1.4 The functions RemoveDeadlocks and EnsureClosure do not add any new transitions to the set of transitions of program p. Proof. The proof follows by construction. (:1 Observation 9.1.5 The function RecalculateJnvariant does not add any new states 195 (respectively, transitions) to the invariant (respectively, the set of transitions) of pro- gram p. Proof. The proof follows from Observations 9.1.3 and 9.1.4. [3 Theorem 9.1.6 The algorithm To_Positive_Monotonic_Programs is sound. Proof. We show that the program generated by To-Positive-Monotonic-Program satisfies the requirements of Problem 9.1.1. 0 Y’ _C_ Y. The algorithm To_Positive_Monotonic-Program calculates state predi- cate Y’ by invoking RecalculateJnvariant (in Step 2-2-1) and RemoveDeadlocks (in Step 2-5). Hence, using Observations 9.1.3 - 9.1.5, it follows that Y’ E Y. , o p’IY’ (_i pIY’. The algorithm To-Positive_Monotonic-Program modifies the tran- sitions of the input program p in Steps 2-2-1, 2-4, and 2-6. Based on observations 9.1.4 and 9.1.5, Steps 2-2-1 and 2—6 do not add any new transitions to the set of transitions plY’. Also, by construction, Step 2-6 does not add new transitions to pIY’ as well. Thus, it follows that p’ IY’ Q plY’. o p’ is positive monotonic on Y’ with respect to it. Since the set of transitions TR¢8m identifies transitions of plY that violate the definition of positive monotonicity of p, and in the final iteration of the algorithm To_Positive_Monotonic_Program the set of transitions TRrem becomes empty, it follows that when the algorithm To_Positive_Monotonic-Program terminates there exist no transitions in p’IY’ that violate the positive monotonicity of p’ on Y’. As a result, the program p’ returned by To_Positive-Monotonic_Program is positive monotonic on Y’ with respect to :r. o p’ satisfies spec from Y’. Based on Observation 9.1.3, Y’ is a subset of Y where the computations of p are infinite. Also, using the requirements Y’ _C_ Y and p’ IY’ g p|Y’, it follows that the computations of p’ in Y’ are a subset of computations of p in Y’. Since starting in Y every computation of p is in 196 spec, it follows that starting in Y’ every computation of p’ is in spec. Also, by construction, Y’ is closed in p’. Thus, p’ satisfies spec from Y’. Based on the above discussion, it follows that To_Positive_Monotonic-Program is sound. CI Theorem 9.1.7 The complexity of algorithm To_Positive_Monotonic-Programs is poly- nomial in the state space of the input program. Proof. The maximum number of iterations of the while loop in the body of Re- moveDeadlocks function (cf. Figure 9.2) is in the order of ISI. Also, for program p, since S g Sp, it follows that the worst-case complexity of RemoveDeadlocks is O(|Sp|). A similar reasoning shows that the worst-case complexity of HasDeadlocks is 0(ISPI). Also, the number of groups of transitions of p is polynomial in lSpl since in a distributed program each transition is associated with a group of transitions, and the number of transitions included in each process is in the order of ISplz. Moreover, by construction, the size of each group is in the order of lSpl as well. As a result, the worst-case complexity of the EnsureClosure (cf. Figure 9.2) will be polynomial in ISp|. Based on the above discussion, the complexity of RecalculateJnvariant will be polynomial in ISpl since the loop inside this function can iterate at most ISPI times. Now, in the To-Positive_Monotonic_Programs algorithm, the maximum number of iterations of the main loop cannot exceed IYI, where the algorithm removes all states in Y and declares failure in Step 4. Also, each step of the algorithm has a polynomial-time complexity based on the above discussion. Therefore, the complex- ity of To-Positive_Monotonic_Programs is polynomial in the state space of the input program. [I] 9.2 Specification Transformation In this section, our goal is to address the following question: How can safety spec- ifications be strengthened to meet the monotonicity requirements? To address this 197 question, in Subsection 9.2.1, we present a formal definition for the problem of trans- forming non-monotonic specifications to monotonic. Then, in Subsection 9.2.2, we present a sound algorithm for solving the transformation problem. 9.2.1 Problem Statement Given a safety specification specsf, a state predicate Y, and a Boolean variable x, if specsf is not positive (respectively, negative) monotonic on Y with respect to a: then our goal is to derive a specification specgf that is positive (respectively, negative) monotonic on Y with respect to :r. In such derivation, we require that if a transition t satisfies spec’s, then t will satisfy specs, as well. As a result, specgf will be a strengthened version of specs]. Hence, we state the problem of transforming non- monotonic specifications to monotonic as follows: Problem 9.2.1 Transforming Non-Monotonic Specifications to Monotonic Given Y, specsf, and :1: such that specsf is not positive (respectively, negative) monotonic on Y with respect to :1: Identify spec’s! such that specsf Q spec’s, specgf is positive (respectively, negative) monotonic on Y’ with respect to x [3 Note that we represent safety specifications spec_,f and spec’s! as two sets of bad transitions in the state space that must not occur in program computations (cf. Sec- tion 2). Thus, the condition specs; Q spec’s, states that spec’sf is a restricted version of spec,, by adding more transitions to specs]; i.e., strengthening specsf. 9.2.2 Transformation Algorithm To address the transformation Problem 9.2.1 for positive monotonicity, we present an algorithm that takes a safety specification spec”, a state predicate Y, and a Boolean 198 variable :12, and generates a safety specification spec’s! such that spec’s, is positive monotonic on Y with respect to :13. To_Positive_Monotonic_Specification(specsf: safety specification, Y: state predicate, :13: Boolean variable) { Step 1: TRadd := {(so,sl) : (:r(so) = false) A (:1:(sl) = false) /\ (so 6 Y) /\ (31 E Y) /\ ((so,sl) ¢ specsf) /\ (3(8618'1) =(8613'1)=x (30,31) =(5618'1) E Specsfil; Step 2: return specs; U TRadd; } Figure 9.3: Transforming non-monotonic specifications to monotonic. In Step 1, the algorithm To-Positive_Monotonic_Specification calculates the set of transitions that violate the definition of positive monotonicity of specification. Then, the algorithm strengthens the specification specs; by adding the set of good tran- sitions TRadd to the existing set of bad transitions (specified by specsf) in order to construct a new safety specification spec’qf. The new specification specgf is repre- sented by a new set of bad transitions specstTRadd. Since the specification returned by To-Positive_MonotonicSpecification is a strengthened version of the original speci- fication specs], the soundness of the above algorithm follows accordingly. (In the case of negative monotonic specifications, we present a similar algorithm by replacing the condition ((a:(so) = false) /\ (513(31) = false)) with ((:1:(so) = true) /\ (:1:(sl) = true)) in Step 1 in Figure 9.3. ) Theorem 9.2.2 The algorithm To_Positive_Monotonic_Specification is sound. El Theorem 9.2.3 The complexity of algorithm To_Positive_Monotonic_Specification is polynomial in the size of Y. D Comment on strengthening the specification. Strengthening the specification does not destroy the fault-safe property of the specification. Specifically, the transformation of a specification to a monotonic specification adds new transitions to the set of bad transitions that must not occur in program computations. Since such new transitions 199 are program transitions, no fault transition will be included as a safety-violating transition. As a result, the fault-safe property of the specification will be preserved during the transformation. Also, since we add new transitions to the specification during transformation, there may exist program transitions in the invariant that do not violate the original specification but violate the strengthened monotonic specification. Such transitions must not occur in the computations of the transformed program, otherwise the pro— gram will violate the safety of the strengthened specification. In the next section, in the context of an example, we illustrate how we identify and remove such transitions from the invariant and then recalculate a new invariant. 9.3 Example: Distributed Control System In this section, we present an example where we use our transformation algorithms for efficient addition of failsafe fault-tolerance. Specifically, we first present a distributed controlling program that is subject to input faults; i.e., the faults that perturb the input sensors of the program. Then, we transform the specification of the controlling program to a positive monotonic specification. Since the program is negative mono— tonic, efficient (i.e., polynomial-time) addition of failsafe fault-tolerance to it becomes possible. The fault-intolerant process-control program (PC). The program PC con- sists of three processes P1, P2, and P3 connected by a loosely-coupled network. The processes P1 and P2 respectively control the speeds of two electro motors M1 and M2 located in the same environment but in distant places. The motors M1 and M2 provide the driving force of a conveyer belt that can move in two different directions: left-to-right and right-to—left. The conveyer belt carries fragile objects that are loaded when the belt is stationary. Once the objects are loaded, the conveyer belt moves with an increasing speed up to a maximum speed. Then, the belt stops so that the 200 already loaded objects can be unloaded and new objects are loaded. The speed of the conveyer belt depends 011 the speeds of All and A42. The speeds of MI and M2 should be synchronous; i.e., the speed of M1 is equal to the speed of Mg or is at most one unit more than the speed of M2. When the two electro motors reach their maximum speed, process P3 resets their speed to O and the whole process repeats. It is required that the temperature of the environment where electro motors function should not exceed a pre—determined threshold. The program PC has four integer variables :r, y, z, and w. The variable 2: (respec- tively, z) is a counter that contains the speed of M1 (respectively, M2). The domain of :1: (respectively, .2) is equal to {0, - - ' ,c}, where c is an integer constant. The variable y is used to represent the movement direction of the conveyer belt. Specifically, if the direction of the conveyer belt is from left to right then the value of y alternates between 1 and 0. In the case where the conveyer belt moves from right to left, the value of y alternates between -1 and 0. Moreover, the value of y is equal to 0 if a: = 2. Otherwise, y could be 1 or -1. As a result, the domain of y is equal to {—1,0, 1}. The variable 11) represents the temperature of the environment, which could be in three different levels of normal, alarming, and critical that are respectively represented by three values 0, 1, and 2. Let (:13,y,z,w) denote the global state of the distributed program. The initial state of the program is (0, 0, 0,0), where process P1 starts to speed up (i.e., increment its counter). Process P1 is responsible to increment :1: and process P2 increments .2. When both counters reach the maximum value c (i.e., (:1: = c) /\ (z = c)) the counting operation will be restarted by process P3. Read/write restrictions. Process P1 is allowed to read :1:, y, and z and it can only write :1: and y. Process P2 can read 2:, y, and 2, but it is only allowed to write y and 2:. Process P3 is allowed to read all program variables, however, it can only write :1: and 2. Note that P1 and Po cannot read 111 due to distribution restrictions. 201 Program actions. We present the action of process P1 as follows: PC1: (:1:=z)/\(:1: .r:=:r+1;y:=1|—1; When 1% and 1112 have the same speed (i.e., at = z), P1 increments the value of at (i.e., the speed of M1). The action PCl indeed represents two actions depending on the direction of the belt (i.e., the value of y). The action of process P2 is as follows: P02: (:1:=z+1) ——> y:=0;z:=z+1; Process P2 increments the value of z (i.e., the speed of M2) and resets y to zero since 2 has become equal to :13. Finally, the transitions of P3 are represented by the following action: If both counters have reached the maximum value c (i.e., 1111 and M2 have reached their maximum speed) then P3 resets their values to 0. Safety specification. For application-specific purposes, the safety specification stipulates that in the case where the belt is moving in the right-to—left direction and the temperature level is in a critical level (i.e., w = 2), the speed of M2 must remain less than the speed of M1; i.e., speed of the belt must not be increased in critical temperature. We represent the safety specification of PC with specPC, where 3P€CPC = {(50131) 3 (31(50): “1)A (13(31) = 431)) /\ (“1131) = 2)} Invariant. The temperature should be in the normal level in ordinary working conditions. Hence, we represent the invariant of the program by the state predicate S pC, where 5pc = {81 (111(3) = 0) /\ (($(8) = Z(8)) V (11(3) = 2(8) +1))} 202 All and .Mg are synchronized in the invariant; i.e., ((:r = z) V (:1: = z + 1)). Faults. Faults may change the value of the temperature sensor to 1 or 2 when the speed of All is ahead of A12. We represent the fault transitions by the following action: F: (:r=z+1) ——> w:=1|2; Fault-span. We represent the fault-span of program PC by the following state predicate: no = {s : (Ms) = z> V(a:(s) = z+1>> A ((2r(s) = 2(8)) => (11(3) = 0)>v ((2213) = z+11 => ((1/(s) = 11v= -1))> } Note that the value of 11) could vary in its domain {0, 1,2}. Negative monotonicity of program PC. Since w is not a Boolean variable, we apply the definition of program monotonicity on the program PC by partitioning the domain of w to zero and non-zero values. We consider the Boolean value true corresponding to non-zero values of w and the Boolean value false corresponding to (w = 0). Since there exists no transition in PC ISPC where the value of w is non- zero, it follows that the definition of negative monotonicity holds for the program PC. Thus, the program PC is negative monotonic on S [)0 with respect to w. Positive monotonicity of specpc. Now, we investigate the positive monotonicity of specPC on S [DC with respect to 111. First, we identify the set of transitions (so, 3 1) that satisfy the following conditions: (i) so, 31 6 Spa; (ii) (w(so) = 0) A (1U(81) = 0); (iii) (so,sl) does not violate specPC, and (iv) there exists transition (36,3’1) that is grouped with (so,sl) due to inability of reading w, where (56,3’1) violates specPC and (111(36) 75 0) A (111(3’1) 76 0). Thus, for the specification specPC, the algorithm To_Positive-Monotonic_Specification calculates the set of transitions TRadd (cf. Figure 9.3) as follows: 203 TRadd = {(so, 31) : (:r(so) = z(so) + 1) A (y(so) = —1)A(w(so) = 0) A (33(31) = Z(51))A(y(31) = 0)A(wl31) = 0)} The set of transitions TRadd includes those transitions of PC ISPC in which the values of :1: and 2 become equal in their destination state. Although the transitions of TRadd do not violate specPC by themselves, they are grouped with unsafe tran- sitions that reach a state where the condition ((111 = 2) A (:r = 2)) holds. Hence, we strengthen the safety specification by including the set of transitions TRadd in the set of transitions that violate safety. As a result, the new safety specification spec’PC = specpc U TRadd satisfies the definition of positive monotonicity for spec’PC on Spa with respect to w. Recalculating the invariant of the program PC. After strengthening specPC, the program transitions in TRadd (1 (PC [3120) violate spec’PC although they do not violate Sp€Cpc. When we remove the set of transitions TRadd 0 (PC ISpC), we create the following deadlock states in the invariant S p0. Deadlocks = {s : (a:(s) = z(s) + 1) A (y(s) = —-1) A (111(3) = 0)} We invoke the algorithm RecalculateJnvariant (cf. Figure 9.2) to recalculate a new invariant Sgc where the computations of PC are infinite in S,Pc. In the first iteration of the algorithm RecalculateJnvariant, we remove the states in Deadlocks from the invariant S P0. Since the removal of the above deadlock states does not introduce new deadlock states, we calculate the new invariant 31301 where Sic = {s = «(1(3) = 2(8)) A1118) = 0)) v ((113) = 2(8) + 1) A (y(s) ll ._.1 V v v > A 8 A CO V II C v Wu The action of the process P1 in the new invariant is as follows: PCi: (:c=z)A(:1: x:=:1:+1;y;=1; 204 Note that the above action only assigns 1 to y; i.e., all transitions corresponding to the action that assigns -1 to y have been removed during synthesis. Now, we represent the transitions of the process P2 by the following action: PCéz (y=1)A(:1:=z+1) ——+ y:=0;z:=z+1; The action of process P3 remains as is. Since program PC’ is negative monotonic on Sfisc with respect to w and its new specification spec’PC is positive monotonic on S},C with respect to w, failsafe fault-tolerance can be added to PC’ in polynomial time (using Theorem 4.11). In fact, in this case, the program PC’ is failsafe F-tolerant I I to specPC from S P0- 9.4 SAT-based Synthesis of Fault-Tolerance In this section, we investigate the use of automated reasoning techniques in the syn- thesis of fault-tolerant distributed programs. There exist several heuristics-based approaches [14] (also see Chapter 5) for polynomial-time synthesis of fault-tolerant distributed programs. Each heuristic identifies a deterministic order for the verifi- cation of synthesis requirements, where synthesis requirements are conditions that have to be met by program states and transitions during synthesis so that the syn- thesized fault-tolerant program is correct by construction. As a result, the efficiency of synthesis is directly affected by the efficiency of verifying such synthesis require— ments. Thus, it is desirable to benefit from the existing automated reasoning tools to efficiently verify synthesis requirements. Specifically, in this section, we focus our attention on using state-of-the-art SAT solvers during synthesis where we express different synthesis requirements in terms of the satisfiability problem and use existing SAT solvers to efficiently verify those requirements. We organize this section as follows: First, in Subsection 9.4.1, we give an overview of our SAT-based approach for the synthesis of fault-tolerant distributed programs. In Subsection 9.4.2, we show how we formulate each synthesis requirement as an instance 205 of the satisfiability problem. In Subsection 9.4.3, we discuss the implementation of our SAT-based synthesis method in the FTSyn framework. 9.4.1 Synthesis Method In this subsection, we present a general overview of our SAT-based synthesis method. Specifically, in Subsection 9.4.1.1, we state the problem of reducing synthesis require- ments to the satisfiability problem. Subsequently, in Subsection 9.4.1.2, we provide a strategy for using SAT solvers during synthesis for the verification of the synthesis requirements. 9.4.1.1 Synthesis Requirements Verification The non-deterministic synthesis algorithm presented in Section 2.8 identifies six re— quirements that must be verified during the synthesis of a fault-tolerant program from its fault-intolerant version. For reader’s convenience, we repeat the Add_ft algorithms in Figure 9.4: Adet(p, f : set of transitions, S : state predicate, spec : specification, go, 91, ..., gm” : groups of transitions) { ms := {so : 331,512,...sn : (Vj :0$j B, where B is the set of Boolean formulas over program variables. SBF(s) = A::g(v,- = l,), where l,- E D,- The S BF transformation generates a unique Boolean formula corresponding to each state 3 E Sp; i.e., SBF is a one—to—one function. However, the formula S BF (3) is specified in terms of equalities over program variables; i.e., (v,- = l,-). To generate a formula that consists of Boolean variables, we have to transform each term (v,- = l,) in S BF (3) into a formula that only consists of Boolean variables. Towards this end, we introduce [log(|D,-|)] Boolean variables corresponding to each program variable v,, where ID,| represents the size of the domain of 11,-. In other words, if the domain of v,- includes |D,-| distinct values then we will need [log(|D,|)l Boolean variables to encode each value assignment to v,- by a unique binary code with length [log(|D,—|)]. Therefore, the maximum size of SBF(s) is equal to (q + 1) - [log(K)l, where K is the size of the domain of a variable vj (0 S j _<_ n) that has the largest domain. Representing a state predicate. By definition, a state predicate is the union of a set of states in the state space of p (i.e., Sp). Thus, to represent a state predicate X Q Sp, we use the function SBF to define a function SPBF : Pou1(Sp) —> B as follows: SPBF(X) :2 VVs::sEX SBF(s) The transformation S PBF takes the disjunction of all the Boolean formulas cor- responding to all states in X. The resulting formula will be a formula co V cl V - - °C|X| in disjunctive normal form where each conjunction cj (0 S j g |X I) represents a state. 209 Representing a transition. To represent a transition (so, 31) 6 8,, x Sp, we use the SBF function and define the function T BF : 8,, x Sp -—> B, where TBF((80, 81)) = SBF(So) /\ SBF(SI) We represent a transitions (so, 81) as a conjunction of the Boolean formula that represents its source state so and the Boolean formula that represents its destination state 31. One could argue that TBF should be defined as SBF (so) :> SBF(sl). This way, T BF ((so, 31)) holds for all transitions terminating at 31 and the Boolean formula SBF (so) => SBF (31) represents more than a single transition. Hence, to represent an individual transitions (so, 31), we use the conjunction of SBF (so) and S BF (31). Representing a transition predicate. We use an approach similar to the one we used for defining state predicates. In other words, a transitions predicate AP 6 5,, x S,D is the union of a set of transitions in the state space Sp. Hence, we define function TPBF : Pow(Sp x Sp) -—> B to represent a set of transitions A111 where TPBF(A,,) = vvootoiootoesp TBF((80, 81)) Note that we use transition predicates to model the set of program transitions, a group of transitions, and the safety specification. For example, if specs; represents the safety specification of a program p then TPBF(specsf) generates a Boolean formula corresponding to specsf. 9.4.2.2 Formulating Synthesis Requirements In this subsection, we show how we formulate the requirements F LP 6 of the non- deterministic algorithm presented in Subsection 9.4.1. Towards this end, we use the functions presented in the previous subsection. We observe that the condition F 1 E (p’IS’ Q plS’) verifies whether the set of transitions p’ [5’ is a subset of the set of transitions pIS’. Since p’IS’ and plS’ are transition predicates, we use TPBF to generate the Boolean formulas corresponding 210 to p’lS’ and plS’. Hence, to verify F1 we verify the satisfiability of TPBF(p’|S’) => TPBF(plS’). Likewise, for the requirements F2 E (5’ => T’) and F5 E (5’ => S), we re- spectively verify the satisfiability of SPBF(S’) => SPBF(T’) and SPBF(S’) => SPBF (S) To verify the closure of the state predicate S’ in the set of transitions of p’ (cf. F5 in Figure 9.4), we verify the satisfiability of CLBF(S’,p’), where CLBF(S’,p’) = AV(so,sl)::(so,sl)€p’ ‘ (SBF(so) => spams» => (SBF(sl) : SPBF(S’)) To verify F3, we simply verify the satisfiability of SPBF(T’) A SPBF (ms) and TPBF(p’IT’) A TPBF (mt). If these two formulas are not satisfiable then F3 is satisfied. The requirements F5 stipulates that there exists no cycles in the set of transitions of p’| (T’—S’). As a result, we have to formulate the cycle detection problem in terms of a Boolean formula. To achieve this goal, we adopt the techniques used in the existing approaches for symbolic cycle detection [49, 50, 51] where one generates a Boolean formula whose satisfiability shows the existence of a non-progress cycle in p’|(T’—S’). Towards this end, we define a transformation Reach(s, A11) from 3,, x Pow(S,, x Sp) to the set of Boolean formulas B, where Pow(Sp x Sp) is the power set of (S; x Sp), and Reach(s, A1») = SPBF(R) , where R = {s’ : s’ is reachable from s by transitions of Ap} Using function Reach, we can construct a Boolean formula that represents the set of states reachable from a particular state 8 6 SP. Now, to verify if s is in a cycle, we only need to verify the satisfiability of Cycle(s, AP), where Cycle(s, A11) E (SBF(s) => Reach(s, Ap)) 211 If Cycle(s, AP) is satisfiable then s is in a cycle in the graph constructed by the set of transitions AP. In the case where Reach(s, A11) E false then it follows that s is a deadlock state in the state transition graph of Air Thus, using the invalidity of Reach(s, Ap), we conclude that s is a deadlock state (i.e., verifying F4 in Figure 9.4). 9.4.3 Implementing SAT-based Synthesis In this subsection, we present an overview of our implementation strategy where we implement our SAT-based synthesis method in the FTSyn framework presented in Chapter 8. Towards this end, we only focus on the part of implementation that is related to the verification of the requirement F3 (cf. Figure 9.4) since the implemen- tation approach for verifying other synthesis requirements is similar. Given a program p, its groups of transitions go, - - - , gm and its safety specification specsf, our goal is to identify the groups of transitions whose transitions do not violate specs]; i.e., safe groups. In the initial implementation of FTSyn, we exhaustively verify the safety of the transitions of a group g,- E p (0 S i S m). The exhaustive verification is inefficient for the cases where the size of a group is very large. Hence, we expect that our SAT-based approach provides a better performance in verifying the safety of the transition groups. In the rest of this section, we proceed as follows: First, we present the necessary transformation for formulating the safety verification problem. Then, we introduce different layers of our implementation in FTSyn for solving the safety verification problem. Safety verification problem. For the program p, we say a group 9,- of transitions is safe iff no transition (so, 31) E g,- violates specsf. Since we represent specs; as a set of transitions that must not occur in program computations, we say g,- is safe iff the set of transitions of 9,- does not intersect with specsf. Formally, we use the transformation Sa f e( g,-) to represent the safety of g,, where 212 Safe(g,-, specsf) = TPBF(g,) A TPBF(specsf) To verify the safety of 9,, we verify the satisfiability of Sa f e(g,-, specs f). If Sa f 6(91, Specs f) is satisfiable then it follows that the group 9,- intersects specsf; i.e., g,- includes a transition that violates safety. Thus, Saf€(g.1,8pecsf) is satisfiable iff g,- is not safe. The layers of SAT-based safety verification. To solve the safety verification problem in FTSyn, we implement the following three layers Boolean formula genera- tion, CNF formula generation, and native method invocation. In the first layer, we use a Java API package provided by Alloy analyzer [52] of MIT to formulate the safety verification problem in terms of a Boolean formula. Then, in the CN F formula gen- eration layer, we transform the Boolean formula to Conjunctive Normal Form (CNF) as the existing SAT solvers only accept formulas in CNF format. We use the SAT solver zChaff [53] since zChaff is one of the most efficient SAT solvers at the time of implementing our SAT-based approach. Towards this end, we implement a Java na- tive method where we invoke zChaff to verify the satisfiability of the calculated CN F formulas. The CNF formula is satisfiable iff the group of transitions whose safety is being verified is not safe. Now, we discuss the implementation of each layer. 0 Boolean formula generation. To generate the Boolean formulas, we first intro- duce a set of Boolean variables by which we encode the value assignment to program variables. For example, if a program p has an integer variables :1: with the domain {—1,0, 1} then we use two Boolean variables a1 and a2 to represent the terms (:1: = —1), (r = 0), and ( = 1) respectively by the following Boolean formulas: (al A a2), (pal A a2), and (al A flag), where -1aj is the complement of aj (1 S j S 2). Hence, we represent a state predicate (:1: = 0) V (:1: = 1) by the Boolean formula (fial A a2) V (a1 A -1a2). Note that since the domain of :1: contains only three values, the term (-1a1 A -1a2) will never be used in the transformation of state predicates to Boolean formulas. 213 In the generation of a Boolean formula corresponding to a transition, say (so, 31), the value of a specific variable may be different in so and 81. Thus, using a set of Boolean variables (e.g., a1 and a2 in the above example) for the represen- tation of the source and the destination states may result in the generation of contradictory Boolean formulas. To illustrate this problem, consider the above- mentioned example where we use two Boolean variables a1 and a2 to represent value assignments to an integer variable 2:. Suppose that we need to generate the Boolean formula corresponding to a transition (so, 31) where the value of :1: at so is —1 (denoted :1:(so) = —1) and the program changes the value of :1: to 0 during the transition (so,sl) (i.e., :1:(sl) = 0). Now. to formulate (so, 31) using Boolean variables a1 and a2, the resulting formula would be equal to ((11 A a2) A (fial A a2), which is a logical contradiction. Hence, we need to dis- tinguish the value assignment to variables at the source and the destination of rooram transitions. 0 To distinguish the value assignment to a specific variable in a transition, we introduce two separate sets of Boolean variables for representing the value of that variable at the source and at the destination state. For example, we intro- duce two new Boolean variables b1 and b2 to represent the value assignment to variable :1: in the destination of transitions. Thus, the transition (so, 31), where :1:(so) = —1 and :1:(sl) = 0, will be formulated as (al A a2) A (-1b1 A b2). CNF formula generation. Using the approach presented above, we transform the safety specification and each group of transitions to a Boolean formula in terms of variables introduced for encoding the value assignments to program variables. Since zChaff requires the input formula in DIMACS CN F format [54], we have to transform the generated Boolean formulas to CNF format. Towards this end, we use an API provided in the Alloy analyzer [52] and integrate it in F TSyn. Using this API, we transform the generated Boolean formulas to 214 CNF format, which can be directly delivered to the SAT solver zChaff. For example, in DIMACS format, the formula ((11 V flag V a3) A (-1a1 V a2 V -1a3) will be represented as follows: pcnf32 The first line identifies that a CN F formula with 3 variables and two clauses is being specified. Each clause (i.e., disjunction) must be specified on a separate line. Also, the variables and their complements are distinguished by a minus sign. 0 Native method invocation. In FTSyn, after we automatically generate a CNF formula corresponding to TPBF(specsf) ATPBF(g,-), we invoke a native method where we query zChaff with the generated CNF formula. The source code of zChaff is available for educational purposes. Hence, we have generated a Dynamic Link Library so that we invoke zChaff from Java environment when we instantiate an instance of our framework F TSyn. Therefore, for every group of transitions g,, we invoke zChaff once to verify the safety of 9,. Using the implementation of our SAT-based approach, we have synthesized the token ring program presented in Chapter 6. Since we invoke zChaff from Java en- vironment, the current implementation of our SAT-based approach suffers from the performance of the Java Native Interface. Nonetheless, our implementation provides a platform for SAT-based synthesis of fault-tolerant (distributed) programs and the efficiency of this platform can be improved as the software technology improves. 215 9.5 Summary In this chapter, we presented two directions of research in progress. Specifically, we discussed the development of heuristics that can transform non-monotonic programs (respectively, specifications) to monotonic. Since adding failsafe fault-tolerance to distributed programs that satisfy the monotonicity requirements can be done in poly- nomial time (cf. Chapter 4), such heuristics extend the scope of programs that can reap the benefits of efficient synthesis. Also, we presented a technique for using SAT solvers in the synthesis of fault- tolerant distributed programs from their fault-intolerant version. We reduce the syn- thesis requirements to the satisfiability problem and then invoke SAT solvers to solve those problems. This way, we benefit from the efficiency of the state-of-the-art SAT solvers during the synthesis of fault-tolerant distributed programs. Currently, we have created a centralized implementation of our approach, however, we plan to ex- tend this work for the cases where we deploy our synthesis algorithm on a distributed platform. Also, we plan to investigate the applicability of other decision procedures [55] in the synthesis of fault-tolerant distributed programs. 216 Chapter 10 Conclusion and Future Work In this chapter, we discuss related work, make concluding remarks, and provide some insight for future research work. Specifically, in Section 10.1, we compare our synthesis approach to the existing approaches in the literature. Then, in Section 10.2, we present the contributions of this dissertation. In Section 10.3, we demonstrate the impacts of the synthesis approach presented in this dissertation. Finally, in Section 10.4, we present open problems and future research directions. 10.1 Discussion In this section, we discuss issues related to the approach presented in this dissertation. Specifically, we compare our synthesis method with the existing synthesis approaches in the literature. Towards this end, we address some questions raised regarding our synthesis method and the framework FTSyn that we have developed for the synthesis of fault-tolerant (distributed) programs. How does the synthesis method presented in this dissertation difler from model- theoretic synthesis approach? The synthesis method in model-theoretic approach [2, 56, 3, 57, 4] is based on a decision procedure for the satisfiability proof of the specification. Although such 217 synthesis methods may have slight differences with respect to the input specification language and the program model that they synthesize, the general approach is based on the satisfiability proof of the specification. This makes it difficult to provide reuse in the synthesis of programs; i.e., any changes in the specification require the synthesis to be restarted from scratch. By contrast, since the input to our synthesis method is the set of transitions of a fault-intolerant program, our approach has the potential to reuse those transitions in the synthesis of the fault—tolerant version of the input program. Nevertheless, similar to the above-mentioned methods that generate the synchro- nization skeleton (i.e., abstract structure) of programs, we also generate the abstract structure of programs. Synthesizing the abstract structure of programs allows us to (1) focus on concurrency issues in the synthesis of fault-tolerant distributed programs instead of their functional properties, and (ii) provide the potential of translating the abstract structure of the synthesized program to multiple programming languages unlike approaches that focus on the synthesis of programs in a specific programming language[58] Model-theoretic approaches model distribution by atomic read/write actions [4] where in an atomic action a process performs either a read or a write operation. Kulkarni and Arora [1] present a more general way for modeling distribution restric— tions where a process is allowed to read/write only a subset of program variables. Since we have adapted Kulkarni and Arora’s approach for modeling distribution, our synthesis algorithms benefit from the generality of their modeling. In addition to the above-mentioned issues, the only implementation of model— theoretic synthesis approaches that we are aware of is an implementation of Emerson and Clarke’s method for the synthesis of mutual exclusion protocol [59]. On the other hand, we have implemented an extensible framework (cf. Chapter 8) where developers of fault-tolerance synthesize fault-tolerant distributed programs. Our framework is 218 not problem-dependent and developers of fault-tolerance can use our framework for the synthesis of a variety of programs [60]. Also, due to the incompleteness of the heuristics integrated in our framework, we have chosen to design our framework for change so that if the existing heuristics fail to synthesize a program then developers can integrate their new heuristics in the framework without an expensive overhead. How does the synthesis method presented in this dissertation differ from automata- theoretic approach where one synthesizes reactive distributed programs [5, 6, 7/ that interact with a non-deterministic environment? The automata—theoretic approach is a specification-based synthesis method where one synthesizes a program from its tree automaton specification. Also, automata- theoretic approaches are mostly used for the synthesis of reactive systems that interact with a non-deterministic environment [5, 61, 6] whereas in the case of our synthesis problem, we have complete information about the behavior of the environment (i.e., faults) with which the program interacts. Since our approach supports incremental synthesis of multitolerant programs (cf. Chapter 7), it has the potential to incrementally add desired fault-tolerance properties to programs once a new behavior of the environment (i.e., a new class of faults) is discovered. This way, we decompose the problem of synthesizing reactive programs into simpler problems. As a result, we do not encounter the complexity of synthesizing a reactive distributed program [6, 7] that interacts with a hostile environment. How does the synthesis method presented in this dissertation differ from synthesizing proof-carrying (certified) code? In the synthesis of proof-carrying code, the synthesis method takes the input spec- ification and generates the code of the program annotated by its proof of correctness [62, 63]. Also, the synthesis method generates a proof checker that is delivered to the program user. Then, using the proof checker, users verify the correctness of the 219 synthesized program to gain high assurance in safety-critical systems. Also, in the synthesis of certified code, there exists an option for adding domain-specific knowledge in order to derive more efficient programs. However, such approaches mostly focus on safety properties of programs whereas our focus is to add all levels of fault—tolerance to programs. How does the synthesis method presented in this dissertation dififer from synthesizing controllers in control theory? Synthesizing discrete-event controllers in control theory is indeed an automata- theoretic approach. Our approach has several advantages with respect to existing approaches for the synthesis of controllers. First, the general-case complexity of synthesizing controllers is PSPACE—complete [64, 65, 66, 67, 68] in the size of the uncontrolled automaton, whereas our problem is NP-complete. Second, our model of distribution is general enough to capture different modeling cases in distributed com- puting whereas in Control theory each controller performs its controlling task individ- ually and there exists limitations on the communication between controllers. Finally, our approach is incremental in that we reuse the computations of the fault-intolerant program for the synthesis of its fault-tolerant version. Such reuse of computations is expected to be helpful in the cases where the state space is large. How does the synthesis method presented in this dissertation difler from synthesizing strategies for two-player games? Regarding two-player games, most of the approaches in the literature [5, 61, 69, 70] for the synthesis of winning strategies are focused on the cases where the program is interacting with an adversary via input / output variables. Such model restricts us to the cases where faults can only affect a subset of program variables, whereas in our model faults can perturb the state of the program to any state. Although the authors of [71] address this shortcoming of two—player games, the language chosen for 220 expressing the winning strategy is Propositional Linear Temporal Logic (PLTL) [72]. Since fault-tolerance properties are existential properties, PLTL does not have the expressiveness power to capture such properties. Does the fault model used in this dissertation enable us to capture difierent types of faults? Yes. The notion of state perturbation is general enough to model different types of faults (namely, stuck-at, crash, fail-stop, omission, timing, or Byzantine) with different natures (intermittent, transient, and permanent faults). As an illustration of the generality of the notion of state perturbation, we have modeled (i) Byzantine faults (cf. Subsections 4.4.1 and 5.3.1); (ii) fail-stop faults (cf. Subsection 4.4.2); (iii) input-corruption faults (cf. Subsection 5.2.1), and (iv) the process-restart faults that affect the token ring program synthesized in Chapter 6. State-perturbation model has also been used in designing fault-tolerance to (i) omission faults (e.g., [17]), and (ii) transient faults and improper initialization (e.g., [19]). How does F TSyn scale as the state space of programs increase? In this dissertation, we showed that using FTSyn, we synthesize fault-tolerant pro— grams that tolerate different types of faults and are simultaneously subject to multiple faults. The largest state space among the programs that we have synthesized belongs to an agreement program (see Appendix B for this program) that is simultaneously perturbed by Byzantine and fail-stop faults (1.3 million states) [73, 60]. Also, in Sec- tion 8.5, we synthesized a simplified version of an altitude switch used in the altitude controller of an aircraft. Although the state space of 1.3 million is much smaller than the state space of many practical applications, we argue that our synthesis framework has the potential in adding fault-tolerance to real-world applications. Towards this end, we discuss the following three points: 1. We argue that model checkers were also faced with similar problems with which 221 our framework faces regarding the state space explosion. Researchers were using early versions of model checkers for checking small protocols and verifying the correctness of operating system kernels [74, 75] despite a state space limit of about 500,000 states on an average workstation (in the early 90$) [74]. The state space handled by our framework is comparable to that reported by early model checkers. We expect that by incorporating the recent optimizations developed for model checking, it will be possible to increase the state space for which fault-tolerance can be added using our framework. . We have not currently included these optimizing techniques in the current ver- sion of the synthesis framework as the goal of the framework is to study the effectiveness of different heuristics, different internal representation of programs, faults, and the ability to add fault—tolerance to different types of faults. There exist several possible optimizations that can be applied to the framework to reduce the synthesis time. However, these optimizations are orthogonal to the issues at hand. For example, the techniques that are used to determine if a given group of transitions violates safety or if a given group of transitions is appro- priate for adding recovery equally affect the above—mentioned goals. (One can either take advantage of the SAT-based approach (presented in Section 9.4) to check the safety of a group of transitions, or exhaustively check every transition of a given group of transitions.) While the design of the framework permits one to use these techniques, these techniques are orthogonal to the issue of adding heuristics that focuses on (i) which recovery transitions should be added, and (ii) how one should deal with safety-violating transitions. In other words, it is expected that the relative improvement of these optimizations will have the same effect on different heuristics. 222 10.2 Contributions The contributions of this dissertation are two—fold: theoretical and practical. Re- garding theoretical contributions, we showed that the problem of synthesizing failsafe fault-tolerant distributed programs from their fault-intolerant version is NP-complete. This result was counterintuitive in the sense that Kulkarni and Arora [1] had al- ready conjectured that adding failsafe fault-tolerance to distributed programs would be polynomial. Subsequently, in Section 4.3, we identified sufficient conditions for polynomial-time synthesis of failsafe fault-tolerant distributed programs. Specifically, we identified monotonic programs and specifications where the addition of failsafe fault-tolerance to distributed programs can be done in polynomial time. We showed that if only programs (respectively, specifications) are monotonic then the synthesis of failsafe fault-tolerant distributed programs will remain NP-complete. Another theoretical contribution of this dissertation is the enhancement synthesis algorithms presented in Chapter 5. We showed that one approach for reducing the complexity of synthesis is to reuse the computational structure of the fault-intolerant programs in the synthesis of their fault-tolerant version. In particular, we formalized the problem of enhancing the fault-tolerance of nonmasking fault-tolerant programs to masking fault-tolerance. Also, we presented a sound and complete algorithm for en- hancing the fault-tolerance of programs in the high atomicity model — where processes can atomically read / write program variables. Then, we designed a sound algorithm for the enhancement of the fault-tolerance of nonmasking distributed programs. The enhancement technique allows us to partially automate the design of masking fault-tolerant programs and reap the benefits of automation. Specifically, in the syn- thesis of masking fault-tolerant programs, if automatic synthesis of the fault-tolerant program fails due to the large state space of the fault-intolerant program then one can manually design a nonmasking program and then automatically enhance the level of fault-tolerance to masking using the enhancement algorithms of Chapter 5. 223 We used monotonicity property to extend the scope of programs and specifications that can reap the benefits of efficient automation. Specifically, we developed heuristics (cf. Sections 9.1 and 9.2) for the transformation of non—monotonic programs (respec- tively, specification) to monotonic where Theorem 4.11 can be applied for efficient addition of failsafe fault-tolerance to distributed programs. In other words, given a monotonic program (respectively, specification) and a non-monotonic specification (respectively, program), we design heuristics that transform a non-monotonic speci- fication (respectively, program) to a monotonic specification (respectively, program) so that failsafe fault-tolerance can be added in polynomial time. To show the advan- tage of developing such heuristics, we enhanced the fault-tolerance of a nonmasking distributed program using our heuristics (cf. Section 9.3). We also presented a synthesis method for automatic addition of pre-synthesized fault-tolerance components to fault-intolerant programs (cf. Chapter 6). Our method enables us to identify commonly encountered patterns in the synthesis of fault-tolerant distributed programs, and reuse those patterns in the synthesis of different programs. In other words, to reuse the effort put in the synthesis of one program for the synthe- sis of another program, we introduced the notion of pre-synthesized fault-tolerance components. Moreover, we presented algorithms for automatic specification of pre—synthesized components during synthesis where we extract a specified component from a library of pre-synthesized components. Afterwards, in Chapter 6, we presented an algorithm for ensuring the interference-freedom between the program being synthesized and the fault-tolerance components being added to that program. Finally, we designed an al- gorithm for automatic addition of a pre-synthesized component to a fault-intolerant program. Since the existing algorithms for the synthesis of fault-tolerant distributed programs are not complete (i.e., the algorithms may fail to synthesize a fault-tolerant program from a given fault—intolerant program although there exists a fault-tolerant 224 program), usage of pre—synthesized components allows us to reduce the chance of failure in the synthesis of fault-tolerant distributed programs. Furthermore, we have added pre-synthesized fault-tolerance components with different topologies (e.g., lin- ear and hierarchical) to different programs (cf. Chapter 6). These examples, illustrate the applicability of pre—synthesized fault-tolerance components in the synthesis of a variety of fault-tolerant distributed programs with different topologies. Using pre—synthesized fault-tolerance components, we also extended the problem of adding fault-tolerance to the case where new variables can be introduced while synthesizing fault-tolerant programs. By contrast, previous algorithms required that the state space of the fault-tolerant program is the same as that of the fault-intolerant program. Moreover, our synthesis method controls the way new variables are intro- duced; new variables are determined based on the added components. Hence, the synthesis method of Chapter 6 controls the way in which the state space is expanded. Also, in this dissertation, we investigated the problem of synthesizing multitol- erant programs from their fault-intolerant versions (cf. Chapter 7). Specifically, we formally defined what multitolerance means where a multitolerant program provides (1) the specified level of fault-tolerance if a fault from any single class of faults occurs, and (ii) the minimal level of fault-tolerance if faults from multiple classes occur. Then, we showed that, in general, the problem of adding multitolerance to high atomicity programs is NP-complete in the state space of the fault-intolerant program. Subse- quently, we presented sound and complete synthesis algorithms for special cases of adding multitolerance where one incrementally adds failsafe (respectively, nonmasking) fault-tolerance to one class of faults and masking fault-tolerance to another fault-class. Regarding the practical contributions of this dissertation, we developed the syn- thesis framework FTSyn (presented in Chapter 8) for developers of fault-tolerance where they can synthesize fault-tolerant programs. F TSyn integrates existing al- gorithms and heuristics for the synthesis of fault—tolerant distributed programs and 225 allows developers to automatically synthesize fault-tolerant programs from their fault- intolerant version. Also, FTSyn is extensible in the sense that developers of heuristics can easily integrate new heuristics into the framework. Moreover, FTSyn is changeable in the sense that developers can easily change its implementation, without changing the design of FTSyn. The changeability of F TSyn is important since changing the implementation of FTSyn may help to increase the efficiency of the synthesis. Thus, any changes in the implementation should be simple and cheap. Furthermore, we have integrated a SAT—based synthesis approach in FTSyn where we use efficient SAT solvers in the synthesis of fault—tolerant distributed programs (cf. Section 9.4). 10.3 Impact In this section, we discuss the impacts of this dissertation in research and education. Regarding research, this dissertation has significant impacts on the development of fault-tolerant and dependable distributed programs as the extensible and changeable design of our software framework will help to develop a rich integrated framework of heuristics for the development of fault-tolerant distributed programs. Moreover, the approach presented in this dissertation for the synthesis of fault- tolerant programs can be extended for the synthesis of reactive programs [5]. Towards this end, we have designed a hybrid synthesis method that benefits from specification- based approaches [76, 2, 77, 56, 78, 79, 80, 57, 4, 5, 6, 71, 81, 7] and the synthesis approach presented in this dissertation. Specifically, we have developed an incremen- tal synthesis method [82] for automatic addition of liveness properties to finite-state concurrent programs. In particular, in [82], we present a sound and complete algo- rithm for adding Leads-to [30] properties to programs. The incremental approach of [82] has the potential to reuse the efforts put in the synthesis of a program for the synthesis of its improved version. 226 Furthermore, the synthesis algorithm in [82] can be integrated with model checkers to provide automated assistance beyond generating counterexamples; i.e., in the cases where a model fails to satisfy a property, our synthesis algorithm automatically (1) identifies the fixability of the model, and (ii) fixes the model if it is fixable. Hence, we believe the synthesis method presented in this dissertation has the potential to provide a practical methodology for the synthesis of reactive programs. Regarding educational impact, we note that using our framework provides the op- portunity to experience non-trivial concepts regarding distributed and fault-tolerant systems. We have used the synthesis framework in the graduate distributed system class as well as in a seminar on fault-tolerance. In the class on distributed systems, the students find that the interactive na- ture of the framework is extremely useful in understanding several concepts about fault-tolerant programs. In this class, the students focused on re-synthesizing a fault- tolerant program for which the framework had been used successfully. In this case, the students began with the fault—intolerant program. First, they used the auto— mated approach to obtain the fault-tolerant program. Subsequently, they focused on interactive synthesis of the same fault-tolerant program. During this interactive synthesis, they applied different heuristics and observed the intermediate program. They explored the state transition diagram of the intermediate program and used the framework to understand why the intermediate program was not fault-tolerant. This allowed them to experience the non-deterministic execution of different processes of the program. Moreover, they could observe individual states and transitions in the global state transition diagram and could experience the effect of distribution restric- tions on the complexity of the synthesis of fault-tolerant distributed programs. 227 10.4 Future Work In this section, we present open theoretical problems in the synthesis of fault-tolerant distributed programs. Also, we discuss future extensions and modification to the FT- Syn framework presented in Chapter 8. First, we discuss open theoretical problems: 0 Identify the polynomial boundary of synthesizing nonmasking fault-tolerant dis- tributed programs. As we identified sufficient conditions for the synthesis of failsafe fault-tolerant distributed programs in Chapter 4, we would like to at least identify the suf- ficient conditions for polynomial synthesis of nonmasking fault-tolerant pro— grams. Although we do not have a proof for the NP-hardness of the problem of synthesizing nonmasking fault-tolerant distributed programs from their fault- intolerant version, we already know that this problem is in NP (cf. Section 2.8). To the best of our knowledge, no polynomial-time algorithm has yet been presented for the synthesis of nonmasking distributed programs. Thus, finding properties of programs that identify sufficient conditions for polynomial-time synthesis of nonmasking distributed programs remains an open problem. 0 Develop nonmasking programs that satisfy the monotonicity requirements. Since the worst-case complexity of enhancing the fault-tolerance of nonmasking fault-tolerant distributed programs to masking is exponential (cf. Chapter 5), we would like to use the notion of monotonicity in order to identify nonmasking programs whose level of fault-tolerance can be enhanced to masking in poly- nomial time. Thus, it is desirable to develop a methodology for the design of nonmasking programs that satisfy the requirements of program monotonicity. Such design methodology provides a framework for partial automation in the design of masking programs where one manually develops a nonmasking mono— tonic program and then applies Theorem 4.11 to automatically enhance the 228 level of fault-tolerance to masking. Identify the necessary and sufficient conditions for simultaneous addition of multiple pre-synthesized components. In Chapter 6, we showed how we add a pie-synthesized corrector to the pro— gram being synthesized in order to resolve a deadlock state from which existing heuristics fail to add recovery. Also, we ensured that the execution of the pre- synthesized component does not interfere with the execution of the program. Now, since there exist many situations where we need to simultaneously add such correctors to the program being synthesized, we plan to identify neces- sary and sufficient conditions for an interference-free addition of multiple pre- synthesized components to a program. Develop a platform for providing automated assistance in model checking beyond generating counterexamples. Although model checkers provide user—friendly counterexamples in cases where a model fails to satisfy a desired property, it is difficult to manually fix a failed model so that it satisfies a desired property while preserving its existing prop- erties. We have developed a synthesis algorithm [82] that has the potential to provide such automated assistance for developers when the model checking of the program at hand fails. Using the synthesis algorithm of [82], we auto- matically (1) identify whether or not a model is fixable to satisfy a particular property in addition to its existing properties, and (ii) fix the model if it is fixable so that it satisfies a new property in addition to its existing properties. However, currently, the synthesis algorithm in [82] can only be used for linear computation model where program properties are specified in Linear Temporal Logic [72]. To develop a platform for automatic model correction, it is desir- able to (i) integrate the algorithm [82] in one of the existing model checkers 229 (e.g., SPIN [36]) to investigate the practicality of the algorithm of [82], and (ii) extend the results of [82] for the case where the program computation is non-linear (e.g., tree-like computation) and program properties are specified in Computation Tme Logic (CTL) [72]. Now, we discuss issues related to the extensions and improvements of the synthesis framework FTSyn presented in this dissertation. 0 Use model checkers in the synthesis of fault-tolerant programs in order to reduce the complexity of synthesis. As mentioned in Chapter 8, the FTSyn framework has the ability to interact with developers of fault-tolerance. If the synthesis of a fault-tolerant program fails then developers can ask FTSyn to generate an intermediate version of the program being synthesized in order to identify what went wrong during synthesis. F TSyn generates the intermediate program in Promela [37] model- ing language. Thus, developers can benefit from the SPIN model checker and verify the fault-tolerance properties. The SPIN model checker returns coun- terexamples that are enlightening for developers in that they can identify what heuristic should be applied next in synthesis. Currently, the users of FTSyn should perform this verification manually. We plan to develop an automated approach for the communication between FTSyn and model checkers. Such communication has an important impact on reducing the complexity of synthe- sis as model checkers can provide behavioral information about the program at hand. The synthesis algorithm uses this behavioral information to make more intelligent decisions during synthesis. 0 Develop a distributed synthesis platform. Currently, the implementation of F TSyn is centralized. To extend the scope of synthesis for real-world applications, we adopt two directions: developing a 230 scalable parallel synthesis algorithm, and extending F TSyn for deployment on a distributed platform. In the first direction, we plan to conduct a survey 011 the existing approaches [83, 84, 85, 86, 87] for parallel and distributed model checking, where one distributes the reachability graph of the model at hand on a network. Towards this end, we note that the synthesis problem differs from model checking problem in that during synthesis we modify the program model to satisfy specific synthesis requirements, whereas model checkers only verify the program model without performing any modification. We conjecture that the scalable synthesis will be in a higher complexity class than the scalable model checking, thus making the development of a scalable synthesis algorithm more challenging. In the second direction, we plan to simultaneously implement the achievements in the design of the scalable synthesis algorithm in FTSyn. As a result, we can experience the applicability of our theoretical results in the context of distributed FTSyn. Develop an on-the-fiy synthesis method. In the synthesis of a fault-tolerant program, F TSyn initially expands the reacha- bility graph of the fault-intolerant program using program and fault transitions. For real-world applications, the size of the reachability graph is very large, and as a result of the space complexity of synthesis, FTSyn may fail to synthesize a fault-tolerant program. To remedy this problem, we plan to develop a space- efficient synthesis algorithm where FTSyn partially generates the reachability graph of the program. Towards this end, we benefit from existing techniques [88] in the model checking literature for providing space efficiency. Such space- efficient techniques are orthogonal to the development of a distributed synthesis algorithm in that we deploy the space-efficient synthesis algorithm on each node of the scalable synthesis platform discussed above. 231 APPENDICES 232 Appendix A: Programs Synthesized Using Pre-Synthesized Components In this appendix, we present the programs that we have synthesized using pre- synthesized components. Specifically, we first present an Alternating Bit Protocol (in Section A.1) that is nonmasking fault-tolerant to message loss faults. Then, in Section A.2, we present an intermediate diffusing computation program synthe- sized by our synthesis framework, F TSyn. Subsequently, in Section A.3, we present the synthesized diffusing computation program after we have added pre—synthesized components to refine one of the high atomicity recovery actions in the intermediate program. Finally, in Section A.4, we present a refined version of the synthesized diffusing computation program in the syntax of the Promela modeling language [37] where we have verified the synthesized program in the SPIN model checker to gain more confidence in the impleinentation of FTSyn. 233 A.1 The Promela Model of the Alternating Bit Pro- tocol In this section, we present the Promela model of the alternating bit protocol (ABP) synthesized by adding linear pre-synthesized components to the fault-intolerant ABP program presented in Section 6.5. 1 2#define inv 3( (((rr != 1) && (cr == -1)) ll (br == bs)) && 4 (((rs != 1) && (cs == -1)) ll (hr != bs)) && 5 ((cs == -1) ll (cs == bs)) && 6 ((cs != -1) ll (cr != -1) ll ((rr + rs) == 1)) && 7 ((cs == -1) ll (cr != -1) ll ((rr + rs) == 0)) && 8 ((cs != —1) ll (cr == -1) ll ((rr + rs) == 9 1o#define fS ((CS == '1) ll (CS == bs)) && n (((cs != -1) && (cr != -1)) ll 12 13 14/* The property to be verified 15 16#define 17#define 13#define 19#define 20 21#define zzfldefine 23#define 24#define 25 0))) (((rr + rs) == 1) ll ((rr + rs) == 0))) Zs (rs == 0) && (bs Zr (rr == 0) && (br ZPs (rs == 0) && (bs ZPr (rr == 0) && (br X3 Z3 && ZPr XPs ZPs && Zr Xr Zs && Zr XPr ZPs && ZPr ==1) && (cs == -1) ==1) && (cr == -1) ==O) && (cs == -1) ==O) && (cr == -1) 234 [](fs -> <> inv) /* /* /* /* */ LCs */ LCr */ LC’s */ LC’r */ 26bOOl rs = 1; 27bool II = O; 2sbool bs = 1; 29bool hr = 0; 3O 31bool ypr; /* y’r */ 32bool ys; 33bool yr; 34bool yps; /* y’s */ 35 3ebool us; 37bool ur; 3sbool ups; /* u’s */ 39bool upr; /* u’r */ 40 II I H 41int cs ll l H 4zint cr 43 44proctype sender() { 45do 45:: atomic { ((rs == 1)) -> rs = 0; cs = bs ; 47 us =0; ups =0; } 43:: atomic { (cr != -1) -> rs = 1; cr = -1; ,9 b3 = (bs+1)%2 ; us =0; ups =0; } 50 51:: atomic { Zs && lys && ypr -> ys = 1; } 52:: atomic { ys -> cs = 1; YS=03 } 53 54:: atomic { ZPs as lyps && yr -> yps = 1; } 55:: atomic { yps -> cs = O; yps=0; } 56 57:: atomic { Zs && lus -> us = 1; } 1 53:: atomic { ZPs && lups -> ups = 590d; 60} 61 s2proctype receiver() { 63do 54:: atomic { ( cs != -1) -> cs = -1; rr 65 hr = (br+1)%2 ; yr =0; 56:: atomic { ( rr == 1) -> rr = 0; cr = 67 yr =0; 68 ll H - w—I 59:: atomic { ZPr && lypr -> ypr 7oz: atomic { Zr && !yr -> yr = 1; } 71 72:: atomic { Zr && lur && us -> ur = 1; } 73:: atomic { ur -> cr = 1; ur=0; } 74 75:: atomic { ZPr && lupr && ups -> upr = 1; } 76:: atomic { upr -> cr = 0; upr=0; } 770d; 78} 79 soproctype MessageLossFaults() { siif 82:: ((cs != -1)) -> cs = —1; 83:: ((cr != -1)) -> cr = -1; 84:: skip; 85fi; 86} 87 sainit{ 1; 89run sender(); run receiver(); run MessageLossFaults(); 90} 236 A.2 The Synthesized Intermediate Diffusing Com- putation Program In this section, we present the intermediate diffusing computation program that we have synthesized using F TSyn. This program includes the actions of the high atomic- ity processes added for the purpose of adding recovery. FTSyn represents the synthe- sized program in a syntax close to the syntax of the Promela modeling language [37]. The semantic of the output program is based on the Dijkstra’s guarded commands, where each guarded command grd ——> st represents a set of transitions {(so, 31) : grd holds at so and the atomic execution of st at so takes the state of the program to 81 }. In the following program, ci, pi, and sni respectively represent the color, the parent, and the session number of process P,. Also, cpi and snpi respectively represent the color and the session number of the parent of P,- (0 S i S 3). 1 ---------- The actions of Process PO ---------- 2(c0 == 1) && 3((p0 == 0) && (snO == 1)) -> CO := 0; snO := O; 4 5(C0 == 1) && 6((p0 == 0) && (snO == 0)) -> CO := O; snO := 1; 7 8(c0 == 1) && 9( ((c1 == 0) && (c2 == 0) && (sno 1) && (sni == 0) && 10 (sn2 == 0) && ((p0 == 1) ll (p0 == 2)) ) II n ((c2 == 0) && (snO == 1) && (sn2 == 0) && (p0 == 2) ) || 12 ((c1 == 0) && (sno == 0) && (sni == 1) && (p0 == 1) ) ll 13 ((c2 == 0) && (sno == 0) && (sn2 == 1) && (p0 == 2) ) || 14 ((c1 == 0) as (snO == 1) && (sn1 == 0) && (p0 == 1) ) ll 15 ((c1 == 0) && (c2 == 0) && (sno == 0) && (sni == 1) && 15 (sn2 == 1) && ((p0 == 1) ll (p0 == 2))) ) 17 -> CO := ch; snO := snpO; 237 18 19(C0 == 0) && 20( ((c1 == 1) && (c2 == 1) && (sno == 0) && (sni == 0) && (sn2 == 0)) ll 21 ((c1 == 1) && (c2 == 1) && (snO == 1) && (sni == 1) && (sn2 == 1)) ) 22 > CO = 1; 23 24 ---------- The actions of Process P1 ---------- 25 26(C == 1) && 27( ((cpl == 0) && (sn1 == 0) k& (snpl == 1)) ll 28 ((cpl == 0) && (sni == 1) && (snp1 == 0)) ) 29 -> c1 := cpl; 8n1 := snpl; 30 31(c1 == 0) && ((snl == 1) ll (sni == 0)) -> c1 := 1; 32 33 ---------- The actions of Process P2 ---------- 34 35 (c2 == 1) && 35( ((cp2 == 0) && (sn2 == 0) && (snp2 == 1)) ll 37 ((cp2 == 0) && (sn2 == 1) && (snp2 == 0)) ) 38 -> c2 := cp2; sn2 := snp2; 39 4o(c2 == 0) && 41C ((sn2 == 0) && (c3 == 1) && (sn3 == 0) && (p3 == 2)) ll 42 ((sn2 == 1) && (03 == 1) && (sn3 == 1) && (p3 == 2)) ) 43 -> c2 := 1; 44 45 ---------- The actions of Process P3 ---------- 46 47(c3 == 1) && 48( ((cp3 == 0) && (sn3 == 0) && (snp3 == 1)) ll 49 ((cp3 == 0) && (sn3 == 1) && (snp3 == 0)) ) a) —> c3 := cp3; sn3 := snp3; 238 m 52(C3 == 0) && 53 ((sn3 == 0) ll (Sn3 == 1)) -> C3 := 1; 54 55 55 ---------- The actions of the high atomicity Process 0 ---------- 57 58(CO == 1) && 59( ((c1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 1) && (sn3 == 0) && 60 ((p0 == 2) I1 (po == 1)) 55 (p1 == 0) 32 (p2 == 0) 25 (p3 == 2) ) II 51 ((c1 == 1) && (c2 == 1) && (c3 == 1) && (sno == 1) && (sni == 0) && 52 ((p0 == 2) ll (p0 == 1)) && (p1 == 0) && (p2 == 0) 52 (p3 as 2) ) ll 53 ((c1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 1) && (sn2 == 0) && 54 ((p0 == 2) ll (p0 == 1)) && (p1 == 0) && (p2 == 0) && (p3 == 2)) ) 67(c0 == 1) && -> sn.0 := 0; 58( ((Cl == 1) && (C2 == 1) && (c3 == 1) && (snO == 0) && (sni == 0) && 59 (sn2 == 0) && (sn3 == 0) && ((p0 == 2) II (p0 == 1)) 25 (p1 == 0) as 70 (p2 == 0) 22 (p3 == 2) ) ll 71 ((c1 == 1) && (C2 == 1) && (c3 == 1) && (snO =3 1) && (sni == 1) && 72 (sn2 == 1) as (sn3 == 1) && ((p0 == 1) || (p0 == 2)) 22 (p1 == 0) as 73 (p2 == 0) 21 (p3 == 2)) ) -> po := o; 75(c0 == 1) && 75( ((C == 1) && (c == 1) && (c == 1) && (snO == 0) 77 (sn2 == 1) && (p0 == 2) && (p1 == 0) && (p2 == 0) 78 ((CI == 1) && (c2 == 1) && (c3 == 1) && (snO == 0) 79 (sn2 == 0) && (p0 == 2) && (p1 == 0) && (p2 == 0) 80 ((Cl == 1) && (c2 == 1) && (c3 == 1) && (snO == 0) 91 (sn3 == 1) && (p0 == 2) && (p1 == 0) && (p2 == 0) 82 ((c1 == 1) && (C2 == 1) && (C3 == 1) && (sno O) 83 (sn3 == 0) 32 (p0 == 2) 22 (p1 == 0) 22 (p2 == 0) 239 && && && && && && && && (sn == 0) && (p3 == 2)) || (sn1 == 1) && (p3 == 2)) ll (sn2 == 0) && (p3 == 2)) ll (sn1 == 1) && (p3 == 2)) || 84 ((c1 == 1) && (C2 == 1) && (c3 == 1) && (snO 0) && (sn2 == 1) && 85 (sn3 == 0) && (p0 == 2) && (p1 == 0) && (p2 == 0) && (p3 == 2)) ll 86 ((c1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 0) 92 (sn1 == 0) && 87 (sn3 == 1) && (p0 == 2) && (p1 == 0) && (p2 O) && (p3 == 2)) ) as -> p0 := 1; 89 9o(c0 == 1) && 91((c1 == 1) && (c2 == 1) && (c3 == 1) && (sno == 0) && (sni == 1) && 92 (sn2 == 1) && (sn3 == 1) && ((pO 2) II (p0 == 1)) 25 (p1 == 0) 25 93 (p2 == 0) && (p3 2)) ) -> CO := 1; snO :=1; p0 := 0; 94 95 ---------- The actions of the high atomicity Process 1 ---------- 96 97(CO == 1) && 98( (C1 == 1) && (c2 == 1) && (C3 == 1) && (snO == 0) && (Snl == 1) && 99 (sn2 == 0) && (sn3 == 0) && (p0 == 1) && (p1 == 0) && (p2 == 0) && 1m) (p3 == 2) ) -> sni := 0; 101 102(CO == 1) && 103( ((c1 == 1) && (c2 == 1) && (C3 == 1) && (snO 0) && (sni == 0) && 104 (sn2 == 1) && (p0 == 1) && (p1 == 0) && (p2 O) && (p3 == 2)) II 105 ((C1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 0) && (sni == 0) && 106 (sn3 == 1) 55 (p0 == 1) 55 (p1 == 0) 45 (p2 0) && (p3 == 2)) ) 107 -> sn1 := 1; 108 109 no ---------- The actions of the high atomicity Process 2 ---------- 111 112 (CO == 1) && 113((c1 == 1) && (c2 == 1) && (c3 == 1) && (sno == 0) && (sni == 1) && 114 (sn2 == 0) && (sn3 == 1) && (p0 == 1) && (p1 == 0) && (p2 == 0) && n5 (p3 == 2)) ) -> sn2 := 1; 116 240 117 ---------- The actions of the high atomicity Process 3 ---------- 118 119(C0 == 1) && 120 ((c1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 0) && (snl == 1) && 121 (sn2 == 1) 88 (sn3 == 0) 88 (p0 == 1) 88 (p1 == 0) 88 (p2 == 0) 88 m2 (p3 == 2)) -> sn3 := 1; A.3 The Actions of the Refined Diffusing Compu- tation Program In this section, we present the actions of processes P2 and P3 in the DC program (from Section 6.6.2). These actions construct the actions of the synthesized program. We presented the actions of P0 in Section 6.6.2. DC31 : (C3 = 1) A (pan, 2 3) ——> C3 := 0; 3713 = -isn3; y3 := false; y2 := false; if ((3713 = 1) A (yg :2 true)) then ya := false; ya := false D032 : (C3 = 1) /\ (cpar3 : 0) A (3713 as snpara) ——8 C3 :2 cpara; 3113 = 377mm; if ((03 = 0) A (313 = true)) then 3);; := false; y2 := false; if ((sn3 = 1) A (95 = true)) then y; := false; y; := false; DC33: (C3 = 0) A (Vk up;c = 3 => (c;C : 1 A3113 E snk)) ——8 C31: 1; D31 : ((‘3 :— 1) A (C2 = 1) A (313 = false) ——+ m 2: true; D31 : (3713 = 0) A (62 = 1) A (313 = false) —-8 ya :=t1'ue; Note that, in action DC31, our synthesis method has added new statements to the statements of the first action in the fault-intolerant DC program. These new 241 statements falsify the witness predicates of the detectors. For example, when c3 becomes 0 the state predicate LC3 no longer holds. Thus, the witness predicate 3);; must be falsified to ensure the interference—freedom of the program and the pres- synthesized detectors. Now, we present the actions of process P2 composed with the detectors d2 and (1’2. D021 : (c2 = 1) /\ (parg = 2) ———+ C2 := 0; 3712 = “18712; 1121: 0:110 1= 0; if ((y3 = false) A (sn2 = 1) A((yé = trWE) V (316 = true») then y; := false; 3,16 := false; 0022 : (c2 = 1) /\ (elmr2 : 0) A (3712 gé snparz) —+ C2 == 6pm,; sn2 = snpm; if ((62 = 1) V (ya = false)) A((y2 = true) V (yo = true))) then y2 := false; yo := false; if ((3712 = 1) V (ya = false)) A((yé = true) V (316 = true») then 3,”? := false; 311’) := false; DC23: (C2 : 0) A (Vk :: pk = 2 => (C;‘ =1A sn2 E s-nk)) ——* C2 := 1; if (y;; = false)) A ((312 = true) V (yo =tr1t€))) then y2 :2 false;y0 := false; if (y3 =2 false)) A «y; = true) v as = true))) then y; := false;y6 2: false; 021 : (y3 = true) A (02 = 1) A (sno = 1) A (c0 = 1) A((paro = 2) V (para 2 1)) A (y2 = false) —+ yg :2 true; 0’21 3 (315, = true) A (02 = 1) /\ (3710 = 1) /\ (Co = 1) A((Pm'o = 2) V (Faro = 1))/\(1/5 = false) -——> y; := true; 242 AA The Promela Model of the Synthesized Diffus- ing Computation Program In this section, we present the Promela model of the synthesized diffusing computation program where we verify the nonmasking fault-tolerance property of the synthesized program. Although the synthesized program is correct by construction, we have con- ducted this formal verification in order to gain more confidence in the implementation of F TSyn. 1#define inv 2 ((( 3 (((C[0] == c[pO]) && (C[4] == C[p0+4])) ll ((c[O] ==1) && (c[pO] == 0)))&& 4 (((c[1] == c[p1]) && (c[5] == c[p1+4])) ll ((c[1] ==1) && (c[p1] == 0)))&& 5 (((c[2] == c[p2]) && (c[6] == c[p2+4])) ll ((c[2] ==1) && (c[p2] == 0)))&& s (((c[3] == c[p3]) && (c[7] == c[p3+4])) ll ((c[3] ==1) && (c[p3] == 0))) 7)) && 8 ((p0 ==O) && (p1 == 0) 88 (p2 == 0) && (p3 == 2)) ) 9 io#define safetyO (!20 ll X0) 11#define safetyOp (!ZOp ll XOp) 12 13#define safety2 (!22 II X2) 14#define safety2p (122p ll X2p) 15 16#define safety3 (123 || X3) 17#define safety3p (!Z3p ll X3p) 18 n)#define X0 (C[3] == 1) && (C[1] == 1) && (C[2] == 1) && (C[0] == 1) && 20 (C[4]==1)&&((p0 == 2) ll (p0 == 1)) 21#define 20 (yo == 1) 22 z;#define XOp (c[7] == 0) && (c[1] == 1) && (c[2] == 1) && (c[O] == 1) && 243 24 25 #define 26 27 28#define 29 30#define 31 32 #define as 34#define 35 36#define 37#define 38 39#define 4o#define 41 ( c[4] == 1 ) && ((p0 == 2 ) II (p0 == 1)) 1) ZOp (yOp X2 (c[3] == 1) && (c[2] == 1) && (c[4] == 1) && (c[0] == 1) && ((p0 == 2 ) ll (p0 == 1)) 22 (y2 ==1) x2p (c[7] == 0) 88 (c[2] == 1) 88 (c[4] == 1) 88 (c[0] == 1) 88 ((p0 == 2 ) ll (p0 == 1)) =1) 22p (y2p X3 (c[3] == 1) && (c[2] == 1) 23 (y3 ==1) X3p (c[7] == 0) 88 (c[2] == 1) 23p (y3p ==1) 42/* Properties to be verified as [] safety 44 [] (linv -> <> inv) 45 [] (O 46*/ 47 inv) 48 bool c [8] ; 49bool y3 50 =0, y2=0, y3p=0, y2p=0, yO =0, yOp =0; 51/* The cells of this array respectively represent 52 53 54 55 56 c0, c1, c2, c3, snO, snl, sn2, sn3 // CO ---> c[0] // c1 ---> c[1] // c2 ———> c[2] // c3 ---> c[3] 244 57 // 58 // 59 // 60 // 61*/ 62 saint p0 = O; 64int p1 = O; 65int p2 = O; saint p3 = 2; 67 88proctype PO() { 69do 70:: 71 72 73:: 74 75 76 77 78} 79 80:: 81 82 83 84 85 86 87 88 89 snO sn1 sn2 sn3 ---> c[4] ---> c[5] ---> c[6] ---> c[7] atomic{ ((c[O] ==1) && (p0 == 0) ) -> c[0] = O; c[4] = !c[4]; YO = 0; yOP =0; atomic{ ((c[O] == 1) && (c[pO] { CEO] = c[pOJ; C[4] if :: (c[0] == 0) 88 (yO ': else skip; fi; = c[p0+4]; } == 0) && (c[4] != c[p0+4])) -> -=1) -> y0 = 0; y0p =0; atomic{ ((c[O] == 0) && ((p1 != 0) II ((c[1] == 1) && (c[4] == c[5] ))) && ((p2 != 0) ll ((c[2] == 1) && if :: (c[4] == c[6])) ) ) -> { c[0] = 1; (y2 == 0) 88 (yO ==1) —> yO =0; -: else skip; fi; if :: (y2p == 0) 88 (yop ==1)-> yOp =0; '2 else skip; fi; 245 90} 91/* component-based actions of PO */ 92 93 :: atomic { ( ( yO == 1 ) && 94( ( y0p == 1 ) ||( c[5] == 0 ) ||( c[6] == 0 )) ) -> c[4] = 0; 95 y0 =0; y0p = 0; y? =0; y2P = 0; } 96 97:: atomic { (y2 == 1) 88 ( c[1] == 1 ) 88 (c[2] == 1) 88 98 (c[0] == 1) && ( c[4] == 1 ) && 99 ((p0 == 2 ) ll (p0 == 1)) && (yO == 0) -> yo = 1; } 100 101:: atomic { (y2p == 1) 88 ( c[1] == 1 ) 88 (c[2] == 1) 88 102 (c[0] == 1) && ( c[4] == 1 ) && um ((p0 == 2 ) II (p0 == 1)) 88 (y0p == 0) —> y0p = 1; } 104 od; 105} 106 107proctype P1() { lmsdo 109:: atomic { ((c[1] ==1) && (p1 =2 1) ) -> c[1] = 0; C[5] = !C[5]; } 1u1:: atomic { ((c[1] == 1) && (c[p1] == 0) && (c[5] != c[p1+4]) ) IN -> c[1] - c[pl]; c[5] = c[p1+4]; } 112:: atomic { (c[1] == 0) -> cu] = 1; } 1130(1; 114} n5 IJGPIOCtype P2() { 117 do 118:: atomic{ ((c[2] ==1) && (p2 == 2) ) -> { C[2] = 0; c[6]= !C[5]; U9 y? =0; y0 =0; y3 =0; y3p =0; 120 if :: ((y3p == 0) && (c[6] == 1)) && ((y2p ==1)|| (y0p ==1)) .21 -> y2p =0; y0p =0; y3p =0; n2 :: else skip; 246 123 fi; 124 } 125 } 126 127:: atomic { ((C[2] == 1) && (C[p2] == 0) && (C[6] != C[p2+4])) 123 -> { c[2] = c[p2]; c[6] = c[p2+4]; 129 if :: ((c[2] == 0) ll (y3 == 0)) && mo ((y2 ==1) ll (yO ==1) ll (y3 ==1)) w) -> y2 =0; yO =0; y3 =0; 132 :: else skip; 133 fi; 134 if :: ((y3p == 0) ll (c[7] == 1) || (c[3] == 0) || 135 (c[2] == 0)) && ((y2p ==1)|| (y0p ==1)|| (y3p ==1)) we -> y2p =0; y0p =0; y39 =0; 137 :: else skip; 138 fi; w9 } M0} 141 142:: atomic { ((c[2] == 0) && ((p3 != 2) || 143 ((CEBJ == 1) && (c{7] == c{6])))) -> { CD] = 1; 111 if :: (y3 == 0) 88 ((y2 ==1)Il(yo ==1)) -> y2 =0; yO =0; y3 =0; M5 :: else skip; 146 fi; 147 if :: (y3p == O)&&((y2p ==1) I I (y0p ==1))-> y2p =0; y0p =0; y3p =0; M8 :: else skip; 149 fi; 150 } 151 } 152 153 :: atomic { (y3 == 1) && (c[2] == 1) && (c[4] == 1) 818: 154 (c[0] == 1) && ((p0 == 2) II (p0 == 1)) MI 155 (y2 == 0) "> y2 = 1; } 247 156 157:: atomic { (y3p == 1) 88 (c[2] == 1) 88 (c[4] == 1) 88 158 (c[0] == 1) 88 ((p == 2 ) ll (p0 == 1)) && 159 (y2p == 0) -> y2p = 1: } mood; 161} 162 183proctype P3() { 164 do 165:: atomic { ((c[3] ==1) && (p3 == 3) ) -> { c[3] = O; c[7] = !c[7]; 186y3 = o; y2 = o; 167 1881f :: ((c[7] == 1) ll (c[2] ==O)) && (y3p ==1) -> y3p =0; y2p =0; 169:! else skip; 170 fi; 171 } 172} 173 174 175 178:: atomic { ((c[3] == 1) && (c[p3] == 0) && (c[7] != c[p3+4])) 177 -> { CE3] = c[p3]; CU] = ctp3+4]; 178 if :: ((c[3 == 0) ll (c[2] ==O)) && (y3 ===1) -> y3 = 0; y2 =0; 179 :: else skip; 180 fi; 181 if :: ((c[7] == 1) ll (c[2] ==O)) && (y3p ==1)-> y3p =0; y2p =0; 182 :: else skip; 187:: atomic { (c[3] == 0) -> { c[3] = 1; 188 if :: ((c[7] == 1) ll (c[2] ==O)) && (y3p ==1)-> y3p =0; y2p =0; 248 189 :: else skip; 190 fi; 191 } 192 } 193 194:: atomic { (c[3] == 1) && (c[2] == 1) && 195 ((c[6] ==o) ll (c[7] ==O)) 88 (y3 = o) —> y3 = 1; } 196 197:: atomic { (c[7] == 0) && (c[2] == 1) && (y3p I! H ‘- \--’ 0)-> y3p 198 0d; 199 } 200 201 202 203 zoaproctype Pseud00() { 205 do 2mi/* This high atomicity recovery action has been refined by adding an the pre-synthesized components. Thus, we comment it out. mm :: atomic { ( c[0] == 1) && mm ( ( ( c[1] == 1 ) ha ( c[2] == 1 ) && ( c[3] == 1 ) && ( c[4] == 1 ) && 2u1( ( p0 == 2 ) ll ( p0 == 1 ) ) ) && 2n (( c[7] == 0 ) ll ( c[5] == 0 ) II ( c[6] == 0 )) ) -> CE4] = 0; } 212 */ m8 :: atomic{ ((c[O] == 1) && (c[1] == 1) && (c[2] == 1) && m4 (c[3] == 1) && ((p0 == 2) ll (p0 == 1)) ) && 2m ( mo ((c[4] == 0) && (c[5] == 0) && (c[6] == 0) eh (c[7] == 0)) ll 2r7((C[4] == 1) && (c[5] == 1) && (c[6] == 1) && (c[7] == 1)) m8) 219 -> p0 = O; } 220 221:: atomic { 249 222 (c[0] == 1) && (c[1] == 1 ) && (c[2] = 1) && (c[3] == 1) && 223 (c[4] == 0) && (p == 2) && 224 ( 225 ((c[4] == 0) 8m (c[5] == 0) && (c[6] == 1)) II 226 ((c[4] == 0) && (c[5] == 1) && (c[6] == 0)) II 227 ((c[4] == 0) 8188 (c[5] == 0) && (c[7] == 1)) II 228 ((c[4] == 0) && (c[5] == 1) && (c[7] == 0)) II 229 ((c[4] == 0) && (c[6] == 1) 88: (c[7] == 0)) II 230 ((c[4] == 0) he (c[5] == 0) && (c[7] == 1)) 231) —> p0= 1; } 232 233 :: atomic { 234 (c[0] == 1) 818: mm ((C[1] == 1) && (C[2] == 1) && (C[3] == 1) && mm (C[4] == 0)&& (C[5] == 1) && (C[6] == 1) && 237 (c[7] == 1) 8181 ((p0 == 2) ll (p0 == 1)) ) 211 -> c[0] =1; c[4] = 1; p0 = o; 239 } 240 0d; 241 } 242 243 proctype PseudolC) { 244 do 245 :: atomic { 24o (c[0] == 1) 8:81 247 ((c[1] == 1) && (c[2] == 1) as: (c[3] == 1) && 248 (CM) == 0) && (c[5] == 1) 8:81 (c[6] == 0) && 249 (cm == 0) 88 (p0 == 1) 88 (p1 == 0) 88 :50 (p2 == 0) 88 (p3 == 2)) -> c[5] =0; 251 } mm 253:: atomic{(c[0] == 1) && (c[1] == 1) && (c[2] == 1) && 254 (c[3] == 1)&& (c[4] == 0) && (c[5] == 0) && 250 255 (p0 == 1) 88 ((C[6] == 1) ll (C[7] == 1)) 256 257 od; 258 } 259 250 proctype Pseudo2() { 261 do -> c[5] = 1; } 262 :: atomic {(c[O] == 1) 88 (c[1] == 1) 88 (c[2] == 1) 88 263 (c[3] == 1) 88 (CM) == 0) 88 (c[5] == 1) 88 21:1 (c[6] == 0) 88 (cm == 1) 88 (p0 == 1) 265 -> { c[6] = 1; 266} 267} 268 od; 269} 270 271 proctype Pseud03() { 272 do 273:: atomic { (c[0] == 1) && (c[1] == 1) && (c[2] == 1) && 271 (c[3] == 1) 88 (c[4] == 0) 88 (c[5] 175 (c[6] == 1) 88 (CW) == 0) 88 (p0 2715 (p1 == 0) 88 (p2 == 0) 88 (p3 277 -> } 2790d; 280} 281 282 proctype Faults() { 283 if 284 :: atomic { (true) -> 285 :: atomic { (true) -> 286 :: atomic { (true) -> 287:: atomic { (true) -> c[0] c[0] c[l] c[1] 251 1) 88 1) 88 2) CD] = 1; 288 rhrhfir-M (true) (true) (true) (true) (true) (true) (true) (true) (true) (true) (true) (true) atomic{ (true) ‘> atomic{ (true) -> atomic{ (true) -> 289 :: atomic 2m1:: atomic 291 :: atomic 292 :: atomic an an :: atomic an :: atomic 296 :: atomic 297 :: atomic mm mm 2: atomic 3m1:: atomic 301:: atomic an :: atomic mm mm :: um :: mm :: 307 fi; mm ama} am 311 init{ :n2run Faults(); c[2] c[2] c[3] c[3] c[4] c[4] c[5] c[5] c[6] c[6] C[7] CE7] p0 pO= = o; } = 1; } = o; } = 1; } = o; } = 1; } a o; } = 1; } = o; } = 1; } a o; } = 1; } o; } 1; } 2; } auirun P0(); run P1(); run P2(); run P3(); 384run PseudoO(); :n5run Pseudo2(); ausrun Pseud03(); 317 } run Pseud01(); 252 Appendix B: Agreement in the Presence of Byzantine and Failstop Faults In this section, we present a comprehensive example of adding fault-tolerance to a fault-intolerant program using our software framework FTSyn. Specifically, we show how developers of fault-tolerance can interact with FTSyn in order to add masking fault-tolerance to an agreement program. This example may be thought of as a brief version of the user manual for our framework. A more detailed user manual including the source code of FTSyn is available at [73]. The fault-intolerant program consists of a general process and four non-general processes that are perturbed by Byzantine and fail-stop faults. The user should specify the input fault-intolerant program, its variables, its invariant, its specification, and the faults in a text file. The input file of the agreement program is as follows: 1 program Byzant ine-Failstop 2 var 3 bool bi; 4 bool bj ; 5 bool bk; 5 bool b1; 7 bool bg; 253 9 int dg=0, domain 0 .. 1; 10 int di, domain -1 .. 1; 11 // (di == -1) means process $i$ has not yet decided. 12 int dj, domain -1 .. 1; 13 int dk, domain -1 .. 1; 14 int d1, domain -1 .. 1; 15 16 bool fi; 17 bool fj ; 18 bool fk; 19 bool f1; 20 21 bool upi; 22 bool upj; 23 bool upk; 24 bool upl; 25 26 // The structure of process i. 27 process i 28 begin 29 ((di == -1) 88 (fi == 0) 88 (upi == 0)) -> di = dg ; 30 I 31 ((di != -1) && (fi == 0) && (upi == 0)) -> fi = 1 ; 32 33 read di, dj, dk, d1, dg, fi, upi, bi; 34 write di, fi; 35 end 36 37 // The structure of process 3'. 38 process 3' 39 begin 40 ((dj == -1) && (fj == 0) && (upj == 0)) -> dj = dg; 41I 254 42((dj 1= -1) 88 (fj == 0) 88 (upj == 0)) -> fj = 1; 43 44 788d di, dj, dk, d1, dg, fj, upj, bj; 45 write dj, fj; 46 end 47 48 // The structure of process k. 49 process k 50 begin 51((dk == -1) 88 (fk == 0) 88 (upk == 0)) -> dk = dg; 52 I 53 ((di: != -1) && (fk == 0) && (upk == 0)) -> fk = 1; 54 55 786d di, dj, dk, d1, dg, fk, upk, bk; 56 write dk, fk; 57 end 58 59 // The structure of process 1. 60 process 1 61 begin 62 ((d1 == -1) 88 (f1 == 0) 88 (upl == 0)) -> d1 = dg; 63 l 44((41 1= -1) 88 (11 == 0) 88 (upl == 0)) -> 11 = 1; 65 66 read di, dj, dk, d1, dg, fl, upl, b1; 67 write (11, f1; 68 end 69 70 // Faults are represented as a process. 71 72 fault FailstopAndByzantine 73 begin 74 ((upi == 1)&&(upj == 1)&&(upk == 1)&&(up1 == 1)) 255 -> upi = 0, upj = 0, upk = 0, upl 77((bi == 0)88(bj == 0)88(bk == 0)88(b1 == 0)88(bg == 0)) -> bi = 1, bj = 1, bk = 1, bl = 1, bg = 1, 80((bi == 1)) -> di = 1 , di =0 , 81| 82((bj == 1)) -> dj = 1 , dj =0 , 83l 84((bk == 1)) -> dk = 1 , dk =0 , 85| 86((b1 == 1)) -> d1 = 1 , d1 =0 , 87I 88((bg == 1)) -> dg = 1 , dg =0 , 89 90 end 91// The invariant of the program. 92 invariant 93( ( 94 ((bg==0) 88 95 (((bi == 1) 88 (bj == 0)88 (bk =2 0)88 (bl 96 ((bj == 1) 88 (bi == 0)88 (bk == 0)88 (b1 97 ((bk == 1) 88 (bj == 0)88 (bi == 0)88 (bl 94 ((b1 == 1) 88 (bj == 0)88 (bk == 0)88 (bi 99 100 101 102 103 104 105 106 107 ((bi == 0) 88 (bj = 0)88 (bk == 0)88 (bl ((bi==1)l|(di==-1)||(di==dg))88 ((bj==1)||(dj==-1)|l(dj==dg))88 ((b ==1)ll(dk==-1)||(dk==dg))&8 ((bl==1)lI(dl==-1)ll(d1==dg))88 ((bi==1)|l(fi==0)ll(di!=-1) )88 ((bj==1)l|(fj==0)||(dj!=-1) )88 ((bk==1)l|(fk==0)l|(dk!=-1) )88 ((b ==1)||(f1==0)l|(d1!=-1) ) ) II 256 0)) ll 0)) ll 0)) ll 0)) ll 0)) ) 88 108 109 ((bg==1)&& (bi==0)8&(bj==0)&&(bk==0)&&(b1==0)88 ( no ((((upi == 1) 88 (upj == 1)88 (upk == 1)88 (upl == 1))) 88 111 ((d'==dj)&&(dj==dk)&&(dk==dl)&&(di.'=-1)) ) II 112 ((((upi == 1) 88 (upj == 1)88 (upk == 1)88 (upl == 0))) 88 n3 ((di==dj)88(dj==dk)88(di!=-1)) ) ll 1m ((((upi == 1) 88 (upj == 1)88 (upk == 0)88 (upl == 1))) 88 115 ((di==dj)&&(dj==d1)&&(di!=-1)) ) ll 116 ((((upi == 1) 88 (upj == 0)88 (upk == 1)88 (upl == 1))) 88 117 ((di==dk)&&(dk==dl)&&(di!=-1)) ) ll 11s ((((upi == 0) 88 (upj == 1)88 (upk == 1)88 (upl == 1))) 88 119 ((dj==dk)&&(dk==dl)&&(dj!=-1)) ) no )) 1m ) 122 && us ( 124 ((upi == 0) 88 (upj == 1) 88 (upk ==1) 88 (upl == 1)) || 125 ((upi == 1) 88 (upj == 0) 88 (upk ==1) 88 (upl == 1)) II no ((upi == 1) 88 (upj == 1) 88 (upk ==0) 88 (upl == 1)) II 127 ((upi == 1) 88 (upj == 1) 88 (upk ==1) 88 (upl == 0)) II 128 ((upi == 1) 88 (upj == 1) 88 (upk ==1) 88 (upl == 1)) )) 129 130 // The specification of the program is specified in three parts starting 131 // with specification keyword. 132 133 specification 134 135 // The destination part identifies a set of states that every 136 // transition reaching them violates safety. 137 138 destination 139 ( 141( (bid == 0) 88 (bjd == 0) 88 (upid == 1) 88 (upjd == 1) 88 (did 1= —1) 88 257 : ”tum—4n1:37.142: «:1 '.' " ' *- 5.3 - ... .1. .- . - ,. w «s '- o v o .1 o c - - o . 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 ( (bid ( (bid ( (bjd ( (bjd ( (bkd ((bgd ((bgd ((bgd ((bgd ) (djd 1= —1) 88 == 0) 88 (bkd == 0) 88 (dkd != -1) 88 = O) 88 (bld == 0) 88 (dld != -1) 88 0) 88 (bkd == 0) 88 (dkd != -1) 88 0) 88 (bld == 0) 88 (dld != ~1) 88 0) 88 (bld == 0) 88 (dld != -1) 88 == 0) 88 (bid 0) 88 (bjd O) 88 (bkd 0) 88 (bld 159 // The relation part 160 161 162 163 164 165 166 167 168 169 171 172 173 relatiorz 0) 0) 0) 0) 88 88 88 88 (did 1= djd) 88 (fid (upid == 1) 88 (upkd (did 1= dkd) 88 (fid (upid == 1) 88 (upld (did l= dld) 88 (fid (upkd == 1) 88 (upjd (djd 1= dkd) 88 (fjd (upld == 1) 88 (upjd (djd 1= dld) 88 (fjd (upkd == 1) 88 (upld (dkd != dld) 88 (fkd = (did 1= -1) 88 (did (djd 1= -1) 88 (djd (dkd 1= -1) 88 (dkd (dld 1= -1) 88 (dld != dgd) 88 (fid 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 88 88 88 88 88 88 88 88 88 88 88 (fj (did (fkd (did (fld (434 (fkd (djd (fld (dkd (fld == 1)) ll 1= -1) 88 == 1)) ll 1= -1) 88 == 1)) II != -1) 88 == 1)) II != -1) 88 == 1)) II != -1) 88 == 1)) ll 1)) || 1= dgd) 88 (fjd == 1)) II 1= dgd) 88 (fkd == 1)) II 1= dgd) 88 (fld 1)) identifies a set of transitions that violate safety. ((((bis == 0)88 (bid == 0) 88 // The init (((bjs (((bks (((bls (((bis (((bjs (((bks (((bls inn section is used for specifying the == 0) == 0) == 0) - O) 0) 0) 0) (fis == 1) 88 (dis . 88 (bjd == 0) 88 (fjs == 88 (bkd == 0) 88 (fks == 88 (bld == 0) 88 (£13 == 88 (bid == 0) 88 (fis == 88 (bjd == 0) 88 (fjs == 88 (bkd == 0) 88 (fks == 88 (bld == 0) 88 (fls == 258 1) 1) 1) 1) 1) 1) 1) did))) II 88 88 88 88 88 88 88 (djs (dks . (dls . (fid (fjd (fkd (fld initial states. djd)))|| dkd)))|| d1d)))|| 0)))Il 0)))ll 0)))ll 0)))) 174 175 // Each initial state is specified using the state keyword. 176 177 state 178bi=0; bj=0; bk=0;b1=0; bg=0; dg=0; 179di = -1; dj = -1; dk = -1; bl = -1; isofi=0; fj =0; fk=0; f1=0; upi= 1; upj = 1; 1811.1pk = 1; upl = 1; 182 183 134 state 185bi=0; bj=0;bk=0;b1=0, bg=0; dg=1, issdi = -1; dj = -1; dk = -1; bl = -1; 187fi=0; fj =0; fk=0; f1=0; upi= 1; upj = 1; 189 B.1 The Description of the Input File The fault-intolerant agreement program consists of four non-general processes B, Pj, Pk, H and a general Pg. Each non-general process has four variables d, f, b, and up. Variable dz’ represents the decision of a non-general process H, fz' denotes whether R- has finalized its decision, ()1 denotes whether P,- is Byzantine or not, and upi states whether P,- has failed or not. Process Pg also has variables (19 and bg. We assume that the process Pg never fails. Thus, the variables of the agreement program are as shown in the var section (cf. Lines 2-24). Transitions of the fault-intolerant program. If process P.- has not copied a value from the general and P; has not failed (i.e., upi = 1) then P,- copies the decision of the general (first action in the body of process H- (cf. Line 29)). If P,- has copied a decision and as a result (12' is different from -1 then P,- can finalize its decision if it has 259 not failed (second action in the body of process P, (of. Line 31)). Other non—general processes (Pj, Pk, and B) have a similar structure as shown in the input file (cf. Lines 37-68). Read / Write restrictions. Each non-general process P,- is allowed to read {di, dj, dk, dl, dg, fi, upi, bz'}. Thus, P,- can read the d values of other processes and all its variables. The set of variables that P,- can write is {di, f 2} Read / write restrictions of each process are specified in its body after the program actions (usng read and write keywords (e.g., Lines 33-34)). Faults. A Byzantine fault transition can cause a process to become Byzantine if no process is initially Byzantine. A Byzantine process can arbitrarily change its decision (i.e., the value of d). Moreover, the program is subject to fail-stop faults such that at most one of the non-general processes can be failed, and as a result, it will stop executing any action. The developers of fault-tolerance should specify the faults similar to an independent process that can perturb program variables (cf. Lines 72-89). Invariant. The developers of fault-tolerance should represent the invariant of the program as a state predicate. In particular, the invariant is a Boolean function (over program variables) that takes a state 3 and identifies whether .9 is an invariant state or not. In the agreement program, the bg variable partitions the invariant into two parts: the set of states where P9 is non-Byzantine (cf. Line 94), and the set of states where P9 is Byzantine (cf. Line 109). When P9 is non-Byzantine, at most one of the non- generals could be Byzantine (cf. Lines 95-107). Also, for every non-general process P,- that is non-Byzantine (i) P,- has not yet decided or it has copied the value of dg (cf. Lines 100—103), and (ii) P,- has not yet finalized or P,- has decided (cf. Lines 104-107). When Pg becomes Byzantine, all the non-general processes are non-Byzantine and all the processes that have not failed agree on the same decision (cf. Lines 109—119). 260 The invariant of the agreement program stipulates the above conditions on the states where at most one non-general process has failed (cf. Lines 124-128). Safety specification. The safety specification requires that if Pg is Byzantine, all the non-general non-Byzantine processes that have not failed should finalize with the same decision (agreement). If P9 is not Byzantine, then the decision of every finalized non-general non-Byzantine process should be the same as dg (validity). Thus, safety is violated if the program executes a transition that satisfies at least one of the conditions specified in the specification section of the input file (cf. Lines 133-169). The specification section is divided into two parts: destination and relation parts. Intuitively, in the destination part (cf. Lines 138-158), we write a state predicate that identifies a set of states Sdesfination, where if a transition t reaches sdesunatim then t violates safety. In the relation part (cf. Lines 162—169), we specify a condition that identifies a set of transitions that should not be executed by the program. Note, that we have added a suffix “(1” (respectively, suffix “8”) to the variable names in the specification section that stands for destination (respectively, source ). Since the relation condition specifies a set of transitions tsp“ using their source and destination states, we need to distinguish between the value of a specific variable :1: in the source state of tspec (i.e., .125 means the value of a: in the source state of tsp“) and in the destination state of tsp“ (i.e., :rd means the value of :1: in the destination state of Its-pee)- In the case that the program specification does not stipulate any destination con- dition on safety-violating transitions, we leave the destination section empty with the keyword noDestination . We use similar keyword noRelation for the case where we do not have relational conditions in the specification. Initial states. The keyword init (cf. Line 173) identifies the section of the input file where the user has to specify some initial states. These initial states should belong to the invariant. For each initial state, the user should use the reserved word state (cf. 261 Line 177). In the state section (cf. Lines 177-181 and 185-188), the user should assign some values to the program variables that belong to their corresponding domain. B.2 The Output of the Framework In this section, we present the output of the synthesis framework. In particular, we present the actions of non-general processes. Observe that the structures of the non-generals are not symmetric. In the rest of this section, we describe the structure of each non-general process that is subject to Byzantine and fail-stop faults. Note that each non-general process can take an action if and only if it has not yet finalized and also has not failed due to fail-stop faults. The description of process Pi. Process P, of the fault-tolerant agreement program consists of 5 actions. We describe each action as a separate item. 1. If process P,- has not yet decided then it performs one of the following actions: either P,- copies the decision of the general, or if at least two other non—generals have decided on the same value then P,- copies their decision. 1(di == -1) 88 ( 2 ((dk == 0)88(d1 == 0)88(fi == 0)88(upi == 1)) II 3 ((dg == 0)88(fi == 0)88(upi == 1)) ll 4 ((dj == 0)88(d1 == 0)88(fi == 0)88(upi == 1)) ll 5 ((dj == 0)88(dk == 0)88(fi == 0)88(upi == 1)) ) -> set_di_va10 6 7(di == -1) 88 ( a ((dk == 1)88(d1 == 1)88(fi == 0)88(upi == 1)) II 9 ((dg == 1)88(fi == 0)88(upi == 1)) ll 10 ((dj == 1)88(d1 == 1)88(fi == 0)88(upi == 1)) ll 11 ((dj == 1)88(dk == 1)88(fi == 0)88(upi == 1)) ) -> set_di_va11 262 2. If process P,- has copied 1, and at least one of the following conditions holds then process P,- changes its decision to 0: (i) Pk and B have decided on 0 and P]- has decided; (ii) P]- and P, have decided on 0, or (iii) Pj and Pk have decided on 0 and P, has decided. 1(di == 1) 88 ( 2 (((dj ==O )||(dj == 1))88(dk == 0)88(d1 == 0)88(fi 0)88(upi == 1)) ll 3 ((dj == 0)88(d1 == 0mm == 0)88(upi == 1)) II 4 ((dj ==O )88(dk == 0)88((d1 == 0)88(d1 == 1))88(fi == 0)88(upi == 1)) ) 5 -> set_di_va10 3. If process P,- has copied 0, and at least one of the following conditions holds then process P,- changes its decision to 1: (i) P,- and Pk have decided on 1; (ii) P; and P9 have decided on 1; (iii) PJ- and P, have decided on 1, or (iv) Pk and P, have decided on 1. 1(di == 0) 88 ( 2 ((d3 == 1 )88(dk == 1)88(fi == 0)88(upi == 1)) || 3 ((dl == 1 )88(dg == 1)88(fi == 0)88(upi == 1)) ll 4 ((dj == 1 )88(d1 == 1)88(fi == 0)88(upi == 1)) ll 5 ((dk == 1 )88(d1 == 1)88(fi == 0)88(upi == 1)) ) -> Set_di_va11 4. Process P,- finalizes with decision 0 if at least one of the following conditions holds. (i) P]- has decided on 0 or P]- has not. yet decided, and Pk has decided on 0, and P, has decided on 0 or P; has not yet decided; (ii) P,- has decided on 0 or P]- has not yet decided, and Pk has decided on 0 or Pk has not yet decided, and P, has decided on 0; (iii) P]- has decided on O, and Pk has decided on O or Pk has not yet decided, and P, has decided 011 O or P, has not yet decided. 1(di == 0) 88 ( 2 (((dj == O)||(dj == ~1))88(dk == 0)88((d1 == 0)|l(dl == -1))88 3 (fi == 0)88(upi == 1)) II 263 4 (((dj == 0)|l(dj == -1))88(d1 == 0)88((dk == 0)|l(dk == -1))88 5 (fi == 0)88(upi == 1)) ll 6 ((dj == 0)88((dk == 0)|l(dk == -1))8&((d1 == 0)|l(d1 == -1))88 7 (fi == 0)88(upi == 1)) ) s -> set_fi_va11 5. Process P, finalizes with decision 1 if at least one of the following conditions holds. (1) P,- has decided on 1, and Pk has decided on 1 or Pk has not yet decided, and P; has decided on 1 or P, has not yet decided; (ii) P,- has decided on 1 or P, has not yet decided, and P, has decided on 1 or H has not yet decided, and Pk has decided on 1; (iii) P,- has decided on 1 or P,- has not yet decided, and Pk has decided on 1 or Pk has not yet decided, and B has decided on 1. 1(di == 1) 88( 2 ((dj == 1)88((dk == 1)Il(dk == -1))88((d1 == 1)Il(d1 == —1))88 3 (fi .. 0)88(upi == 1)) II 4 (((dj == 1)Il(dj == -1))88(dk == 1)88((d1 == 1)Il(d1 == -1))88 5 (fi == 0)88(upi == 1)) II 6 (((dj == 1)Il(dj == -1))88((dk == 1)Il(dk == -1))88(d1 == 1)88 7 (fi == 0)88(upi == 1)) > a -> set_fi_va11 The description of process P,-. The actions of process P, in the fault-tolerant agreement program are as follows: 1. If process P,- has not yet decided then it performs one of the following actions: P,- either copies the decision of the general, or if at least two other non-generals have decided on the same value then P,- copies their decision. 2. If process P,- has copied 1, and at least one of the following conditions holds then process P,- changes its decision to 0: (i) P, and B have decided on 0; (ii) Pk and B have decided on 0, or (iii) P,- and Pk have decided on 0. 264 3. If process P,- has copied O, and at least one of the following conditions holds then process P,- changes its decision to 1: (i) R and Pk have decided on 1; (ii) P,- and B have decided on 1, or (iii) Pk and P; have decided on 1. 4. Process P, finalizes with decision 0 if at least one of the following conditions holds: (i) P,- has decided on 0 or H has not yet decided, and Pk has decided on O or Pk has not yet decided, and H has decided on 0; (ii) P,- has decided on 0, and Pk has decided on O or Pk has not yet decided, and B has decided on O or B has not yet decided; (iii) P,- has decided on 0 or P,- has not yet decided, and Pk has decided on 0, and H has decided on O or B has not yet decided. 5. Process P,- finalizes with decision 1 if at least one of the following conditions holds: (i) P,- has decided on 1 or P,- has not yet decided, and P1, has decided on 1 or Pk has not yet decided, and B has decided on 1; (ii) P,- has decided on 1 or P,- has not yet decided, and B has decided on 1 or B has not yet decided, and Pk has decided on 1; (iii) i has decided on 1, and It has decided on 1 or A: has not yet decided, and I has decided on 1 or I has not yet decided. The description of process Pk. The actions of process Pk in the fault-tolerant agreement program are as follows: 1. If process Pk has not yet decided then it performs one of the following actions: Pk either copies the decision of the general, or if at least two other non-generals have decided on the same value then Pk copies their decision. 2. If process Pk has copied 1, and at least one of the following conditions holds then process Pk changes its decision to 0: (i) B and P, have decided on 0; (ii) P,- and P,- have decided on 0; (iii) P,- and B have decided on 0; (iv) P,- and B have decided on 0; (v) P,- and P, have decided on 0, or (vi) P,- and P, have decided on O. 3. If process Pk has copied 0, and at least one of the following conditions holds then process Pk changes its decision to 1: (i) P, and P, have decided on 1; (ii) B and P, have decided on 1; (iii) P,- and P,- have decided on 1; (iv) P,- and P, have decided on 1, or (v) P, and B have decided on 1. 4. Process Pk finalizes with decision 0 if at least one of the following conditions holds: (i) P,- has decided on 0, and P,- has decided on 0 or P,- has not yet decided, and B has decided on 0 or H has not yet decided; (ii) P,- has decided on 0 or P, has not yet decided, and P, has decided on O, and H has decided on 0 or B has not yet decided; (iii) P,- has decided on O or P,- has not yet decided, and H has decided on O, and P,- has decided on 0 or P, has not yet decided. 5. Process Pk finalizes with decision 1 if at least one of the following conditions holds: (i) P,- has decided on 1, and P, has decided on 1 or P, has not yet decided, and H has decided on 1 or H has not yet decided; (ii) P,- has decided on 1 or P,- has not yet decided, and P,- has decided on 1 or P,- has not yet decided, and B has decided on 1; (iii) P,- has decided on 1 or P,- has not yet decided, and B has decided on 1 or H has not yet decided, and P, has decided on 1. The description of process P,. The actions of process B in the fault-tolerant agreement program are as follows: 1. If process H has not yet decided then it performs one of the following actions: P, either copies the decision of the general, or if at least two other non-generals have decided on the same value then B cepies their decision. 2. If process B has copied 1, and at least one of the following conditions holds then process P, changes its decision to 0: (i) P,- and P, have decided on 0; (ii) P, and Pk have decided on 0; (iii) P, and P,- have decided on 0; (iv) P,- and Pk have decided on 0. 266 3. If process P, has copied 0, and at least one of the following conditions holds then process P, changes its decision to 1: (i) P, and P,- have decided on 1; (ii) P, and Pk have decided on 1; (iii) P, and Pk have. decided 011 1. 4. Process P, finalizes with decision 0 if at least one of the following conditions holds: (i) P, has decided on 0, and P, has decided on O or P,- has not yet decided, and Pk has decided on 0 or Pk has not yet decided; (ii) P, has decided on 0 or P, has not yet decided, and P,- has decided on 0 or P,- has not yet decided, and Pk has decided on 0; (iii) P, has decided on 0 or P, has not yet decided, and P,- has decided on O, and Pk has decided on 0 or Pk has not yet decided. 5. Process P, finalizes with decision 1 if at least one of the following conditions holds: (1) P, has decided on 1, and P,- has decided on 1 or P,- has not yet dec1ded, and Pk has decided on 1 or Pk has not yet decided; (ii) P, has decided on 1 or P, has not yet decided, and P,- has decided on 1 or P,- has not yet decided, and Pk has decided on 1; (iii) P, has decided on 1 or P, has not yet decided, and Pk has decided on 1 or Pk has not yet decided, and P, has decided on 1. 267 BIBLIOGRAPHY 268 Bibliography [1] S. S. Kulkarni and A. Arora. Automating the addition of fault-tolerance. Formal Techniques in Real- Time and Fault- Tolerant Systems, page 82, 2000. [2] EA. Emerson and EM. Clarke. Using branching time temporal logic to synthe- size synchronization skeletons. Science of Computer Programming, 2(3):241—266, 1982. [3] P. C. Attie, A. Arora, and E. A. Emerson. Synthesis of fault-tolerant concur- rent programs. ACM Transactions on Programming Languages and Systems (TOPLAS), 26:125 — 185, 2004. [4] P. Attie and A. Emerson. Synthesis of concurrent programs for an atomic read/write model of computation. ACM TOPLAS (a preliminary version of this paper appeared in PODC96), 23(2), March 2001. [5] A. Pnueli and R. Rosner. On the synthesis of a reactive module. In Proceedings of the 16th ACM Symposium on Principles of Programming Languages, pages 179—190, 1989. [6] A. Pnueli and R. Rosner. Distributed reactive systems are hard to synthesis. In Proc. of 313t IEEE Symposium on Foundation of Computer Science, pages 746—757, 1990. [7] O. Kupferman and M.Y. Vardi. Synthesizing distributed systems. In Proc. 16th IEEE Symp. on Logic in Computer Science, July 2001. [8] Ali Ebnenasir. Automatic synthesis of distributed programs: A survey. http: //www.cse.msu.edu/"ebnenasi/survey.pdf,2002. [9] Felix C. Gartner and Arshad Jhumka. Automating the addition of fault- tolerance: Beyond fusion-closed specifications. Formal Modeling and Analysis of Timed Systems - Formal Techniques in Real— Time and Fault Tolerant System (FORMA TS-F TRTF T 2004), Grenoble, France, September 22-24, 2004. [10] A. Arora and S. S. Kulkarni. Component based design of multitolerant systems. IEEE Transactions on Software Engineering, 24(1):63—-78, January 1998. 269 [11] V. Hadzilacos E. Anagnostou. Tolerating transient and permanent failures. Pro- ceedings of the 7th International Workshop on Distributed Algorithms. Les Dia- blerets, Switzerland, pages 174—188, September 1993. [12] S. Dolev and T. Herman. Superstabilizing protocols for dynamic distributed systems. Proceedings of the Second Workshop on Self-Stabilizing Systems, pages 3.1 — 3.15, 1995. [13] S. Tsang and E. Magill. Detecting feature interactions in the intelligent network. Feature Interactions in Telecommunications Systems II, I OS Press, pages 236 — 248, 1994. [14] S. S. Kulkarni, A. Arora, and A. Chippada. Polynomial time synthesis of byzan- tine agreement. Symposium on Reliable Distributed Systems, page 130, 2001. [15] S. S. Kulkarni and A. Ebnenasir. Enhancing the fault—tolerance of nonmasking programs. In Proceedings of International Conference on Distributed Computing Systems, page 441, 2003. [16] B. Alpern and F. B. Schneider. Defining liveness. Information Processing Letters, 21:181—185, 1985. [17] A. Arora and M. G. Gouda. Closure and convergence: A foundation of fault- tolerant computing. IEEE Transactions on Software Engineering, 19(11):1015— 1027,1993. [18] S. S. Kulkarni. Component-based design of fault-tolerance. PhD thesis, Ohio State University, 1999. [19] E. W. Dijkstra. Self-stabilizing systems in spite of distributed control. Commu- nications of the ACM, 17:643—644, November 1974. [20] A. Arora and S. Kulkarni. Designing masking fault-tolerance via nonmasking fault-tolerance. Revised for IEEE Transactions on Software Engineering, 1995. A preliminary version appears in the Proceedings of the Fourteenth Symposium on Reliable Distributed Systems, Bad Neuenahr, 174—185, 1995. [21] G. Varghese. Self-stabilization by local checking and correction. PhD thesis, MIT/LCS/TR—583, 1993. [22] E. W. Dijkstra. A Discipline of Programming. Prentice-Hall, 1990. [23] B. Alpern and F. B. Schneider. Defining liveness. Information Processing Letters, 21(4):181—185, 7 October 1985. [24] M. Barborak, A. Dahbura, and M. Malek. The consensus problem in fault- tolerant computing. ACM Computing Surveys, 25(2):17l—220, 1993. 270 [25] A. Doudou, B. Garbinato, R. Guerraoui, and A. Schiper. Muteness failure de- tectors: Specification and implementation. In European Dependable Computing Conference, pages 71—87, 1999. [26] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4:382 — 401, July 1982. [27] L. Gong, P. Lincoln, and J. Rushby. Byzantine agreement with authentication: Observations and applications in tolerating hybrid and link faults. In Proceedings Dependable Computing for Critical Applications-5, Champaign, IL, pages 139— 157, September 1995. [28] M. Singhal and N. Shivaratri. Advanced Concepts in Operating Systems: Dis- tributed, Database, and Multiprocessor Operating Systems. McGraw—Hill Pub- lishing Company, 1994. [29] Ali Ebnenasir and Sandeep S. Kulkarni. Efficient synthesis of failsafe fault- tolerant distributed programs. Technical Report MSU-CSE—05-13, Computer Science and Engineering, Michigan State University, East Lansing, Michigan, April 2005. [30] K. M. Chandy and J. Misra. Parallel Program Design: A Foundation. Addison— Wesley, 1988. [31] S. S. Kulkarni and A. Ebnenasir. The complexity of adding failsafe fault- tolerance. In Proceedings of International Conference on Distributed Computing Systems, page 337, 2002. [32] A. Arora and S. S. Kulkarni. Designing masking fault-tolerance via nonmasking faulttolerance. IEEE Transactions on Software Engineering, pages 435—450, June 1998. A preliminary version appears in the Proceedings of the Fourteenth Symposium on Reliable Distributed Systems, Bad Neuenahr, 1995, pages 174— 185. [33] A. Arora and S. S. Kulkarni. Detectors and correctors: A theory of fault-tolerance components. International Conference on Distributed Computing Systems, pages 436—443, May 1998. [34] A. Arora and S. S. Kulkarni. Component based design of multi-tolerant systems. IEEE Transactions on Software Engineering, 24:63-78, January 1998. [35] A. Moormann Zaremski and J .M. Wing. Specification matching of software com- ponents. ACM Transactions on Software Engineering Methods (A preliminary version appeared in Proceedings of the 3rd ACM SICSOF T Symposium on the Foundations of Software Engineering, 1995), 6(4):333 - 369, 1997. [36] G. Holzmann. The model checker spin. IEEE Transactions on Software Engi- neering, 1997. 271 [37] Spin language reference. http://spinroot . com/spin/Man/promela.html. [38] Anish Arora, Mohamed G. Gouda, and George Varghese. Constraint satisfaction as a basis for designing nonmasking fault-tolerant systems. Journal of High Speed Networks, 5(3):293-—306, 1996. [39] Z. Liu and M. Joseph. Transformations of programs for fault-tolerance. Formal Aspects of Computing, 4(5):442-469, 1992. [40] A.I. Tomlinson and V.K. Garg. Detecting relational global predicates in dis- tributed systems. In proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging, San Diego, California, pages 21-31, May 1993. [41] Neeraj Mittal. Techniques for Analyzing Distributed Computations. PhD thesis, The University of Texas at Austin, 2002. [42] Klaus Havelund and Tom Pressburger. Model checking java programs using java pathfinder. International Journal on Software Tools for Technology Transfer (STTT), 2(4), April 2000. [43] Gerard J. Holzmann. From code to models. In Proceedings of the Second Interna- tional Conference on Application of Concurrency to System Design (A CSD ’01), pages 3—10, 2001. [44] MG. Gouda and T. McGuire. Correctness preserving transformations for net- work protocol compilers. Prepared for the Workshop on New Visions for Software Design and Productivity: Research and Applications, 2001. [45] M N esterenko and A Arora. Stabilization-preserving atomicity refinement. Jour- nal of Parallel and Distributed Computing, 62(5):766—791, 2002. [46] M. Demirbas and A. Arora. Convergence refinement. International Conference on Distributed Computing Systems, pages 589 — 596, 2002. [47] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object- Oriented Software. Addison-Wesley Publishing Company, 1995. [48] R. Bharadwaj and C. Heitmeyer. Developing high assurance avionics systems with the SCR requirements method. In Proceedings of the 19th Digital Avionics Systems Conference, Philadelphia, PA, October 2000. [49] R. Hardin, R. Kurshan, S. Shukla, and M. Vardi. A new heuristic for bad cycle detection using bdds. Computer Aided Verification (CAV’97). LNCS Springer- Verlag, 12542268 - 278, 1997. [50] R. Bloem, H.N. Gabow, and F. Somenzi. An algorithm for strongly connected component analysis in n log n symbolic steps. In Proc. F M CAD, LN CS Springer- Verlag, 1954:37—54, 2000. 272 [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] K. Fisler, R. Fraer, G. Kamhi, Y. Vardi, and Z. Yang. Is there a best symbolic cycle-detection algorithm? In Proc. Tools and Algorithms for Construction and Analysis of Systems, LNCS, 2031:420—434, 2001. The alloy analyzer. http:/lalloy.mit.edu. M. Moskewicz, C. Madigan, Y. Zhao, L. Zhang, and S. Malik. Chaff: Engineering an efficient sat solver. .3ch Design Automation Conference, Las Vegas, 2001. Satisfiability suggested format dimacs, may. ftp://dimacs.rutgers.edu/pub/ challenge/satisfiability/doc/satformat.tex,1993. Jean-Christophe Filliétre, Sam Owre, Harald RueB, and N. Shankar. ICS: inte- grated canonizer and solver. Proceedings of the 13th Conference on Computer- Aided Verification ( CA V’01 ), volume 2102 of Lecture Notes in Computer Science. Springer— Verlag, 2001. Z. Manna and P. Wolper. Synthesis of communicating processes from temporal logic specifications. ACM Transactions on Programming Languages and Systems, 6(1):68-93, 1984. P. Attie. Synthesis of large concurrent programs via pairwise composition. CON- CUR ’99: 10th International Conference on Concurrency Theory, 1999. Xinghua Deng, Matthew B. Dwyer, John Hatcliff, and Masaaki Mizuno. Invariant-based specification, synthesis and verification of synchronization in con- current programs. Proceedings of the 24th International Conference on Software Engineering, May 2002. Y.Inaba. An implementation of synthesizing synchronization skeletons using temporal logic specifications. Master Thesis, The University of Texas at Austin, 1984. Sandeep S. Kulkarni and Ali Ebnenasir. A framework for automatic synthesis of fault-tolerance. Technical Report MSU-CSE—03-16, Computer Science and Engineering, Michigan State University, East Lansing MI 48824, Michigan, July 2003. A. Pnueli and R. Rosner. On the synthesis of an asynchronous reactive module. In Proceeding of 16th International Colloqium on Automata, Languages, and Programming, Lec. Notes in Computer Science 372, Springer-Verlagz652—671, 1989. A.W. Appel and AP. Felty. A semantic model of types and machine instructions for proof-carrying code. In Proceedings of the 27th ACM Symposium of Principles of Programming Languages, ACM Press, pages 243—253, 2001. 273 [63] Bernd Fisher, Johann Schumann, and Mike Whalen. Synthesizing certified code. In Proceedings Formal Methods Europe( F ME ’02). Copenhagen, Denmark. LNAI, Springer, 2002. [64] P.J. Ramadge and W.M. Wonham. The control of discrete event systems. Pro- ceedings of the IEEE, 77(1):81—98, 1989. [65] S. Lafortune and F. Lin. On tolerable and desirable behaviors in supervisory control of discrete event systems. Discrete Event Dynamic Systems: Theory and Applications, 1(1):61—92, 1992. [66] Feng Lin and W. Murray Wonham. Decentralized control and coordination of discrete-event systems with partial observation. IEEE Transactions 0n Auto- matic Control, 35(12), December 1990. [67] Karen Rudie and W.M. Wonham. Think globally, act locally: Decentralized supervisory control. IEEE Transactions 0n Automatic Control, 37(11):1692— 1708, 1992. [68] Kurt Ryan Rohloff. Computations on distributed discrete-event systems. Ph.D. thesis, University of Michigan, 2004. [69] Wolfgang Thomas. On the synthesis of strategies in infinite games. S TACS, pages 1—13, 1995. [70] R. McNaughton. Infinite games played on finite graphs. Annals of Pure and Applied Logic, 65(2):149—184, 1993. [71] A. Anuchitanukul and Z. Manna. Realizability and synthesis of reactive modules. International Conference on Computer-Aided Verification, pages 156—169, 1994. [72] EA. Emerson. Handbook of Theoretical Computer Science: Chapter 16, Tempo- ral and Modal Logic. Elsevier Science Publishers B. V., 1990. [73] F TSyn: A framework for automatic synthesis of fault-tolerance. http:/hmw. cse.msu.edu/"ebnenasi/research/tools/ftsyn.htm. [74] Audun Jusang. Security protocol verification using spin. The First SPIN Work- shop,1995. [75] Gregory Duval and Jacques Julliand. Modeling and verification of rubis micro— kernel with spin. The First SPIN Workshop, 1995. [76] MS. Laventhal. Synthesis of Synchronization Code for Data Abstraction. PhD thesis, MIT, 1978. [77] Z. Manna and R. Waldinger. A deductive approach to program synthesis. ACM Transactions on Programming Languages and Systems, 2(1):90—121, 1980. 274 [78] M. Abadi, L. Lamport, and P. Wolper. Realizable and unrealizable concurrent program specification. In Proceeding of 16th International Colloqium on Au- tomata, Languages, and Programming, volume 372 of LN CS, pages 1-17, 1989. [79] H. Wong-Toi and D. Dill. Synthesizing processes and schedulers from temporal logic specifications. Computer-Aided Verification (Proceeding of CAV90 Work- shop, DIMACS Series in Discrete Mathematics and Theoretical Computer Sci- ence Vol. 3, 1991. [80] E. Asarin, O. Maler, and A. Pnueli. Symbolic controller synthesis for discrete and timed systems. In Hybrid System II, LNCS 999, pages 1 — 20, 1995. [81] M. Y. Vardi. An automata—theoretic approach to fair realizability and synthesis. Computer Aided Verfification, volume 939 of LNCS,, pages 267—278, 1995. [82] Ali Ebnenasir and Sandeep Kulkarni. Automatic addition of liveness. Techni- cal Report MSU-CSE—04-22, Department of Computer Science, Michigan State University, East Lansing, Michigan, June 2004. [83] Jiri Barnat, Lubos Brim, and Jitka Stfibrna. Distributed LTL model-checking in SPIN. Lecture Notes in Computer Science, 2057:200—216, 2001. [84] U.Stern and D. L. Dill. Parallelizing the murphi verifier. Proceedings of Computer Aided Verification (CAV ’97), 1254 of LNCS:256—267, 1997. [85] S. Ben-David, T. Heyman, and O. Grumberg. Scalable distributed on—the—fiy symbolic model checking. In third International Conference on Formal methods in Computer-Aided Design (FMCAD’UO), Austin, Texas, 2000. [86] S. Aggarwal, R. Alonso, and C. Courcoubetis. Distributed reachability analysis for protocol verification environments. in Discrete Event Systems: Models and Applications, IIASA Conference, pages 40—56, 1987. [87] H. Garavel, R. Mateescu, and I. Smarandache. Parallel state space construction for model checking. In Proc. SPIN Workshop on Model Checking of Software, 2057 of LNCS:215+, 2001. [88] Gerard Holzmann. State compression in spinzrecursive indexing and compression training runs. The Third SPIN Workshop, 1997.