145.33....“ ‘ , ‘ . . . . ‘ . . ﬁgzﬁg=ﬁm§
J; - A . . , ‘ : . . . . ‘ ,. 155%”

 

hr.
Yam-max:

. .
.1. .2

swﬂﬂ.
.8.”

".3"

 

i

Emu»: 5.;

. .33.}?
53.45
:99
i. a
‘

 

‘ “awn...
s ”:31... .1
ESQ. ~

 

 

 

 

 

 

 

     

    

    

«f

Dy
.1

         

2|.

  

 

um

 

1.. 1:! . 5:».

“ﬁaaﬁeiﬁn‘iﬁ

{- if :25! s
. 4 {3233.4 33.

C .9 la. ‘oo

.3 , :rr

   
 

~ I

v

 

v:
“at

   
 
 

auaaﬁgﬁ

‘ 7W? v

\
\J

1 , f L
Qf‘lfi/l7&

LIBRARIES
MICHIGAN STATE UNIVERSITY
EAST LANSING, MICH 48824-1048

This is to certify that the
dissertation entitled

AUTOMATIC SYNTHESIS OF FAULT-TOLERANCE

Ph.D.

 

presented by

ALI EBNENASIR

has been accepted towards fulﬁllment
of the requirements for the

degree in Computer Science

 

 

Major Professor’s Signature

€149"

 

Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 cﬁnmteDueJndd-p. 15

 

AUTOMATIC SYNTHESIS OF
FAULT-TOLERANCE

By

Ali Ebnenasir

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Computer Science and Engineering

2005

if):
an

Siez‘

(Um
fram

the d

ABSTRACT
AUTOMATIC SYNTHESIS OF FAULT-TOLERANCE
By
All Ebnenasir

Fault-tolerance is an important property of today’s software systems as we rely
on computers in our daily affairs (e.g., medical equipments, transportation systems,
etc). Since it is difﬁcult (if not impossible) to anticipate all classes of faults that
perturb a program while designing that program, it is desirable to incrementally add
fault-tolerance concerns to an existing program as we encounter new classes of faults.
Hence, in this dissertation, we concentrate on automatic addition of fault-tolerance
to (distributed) programs; i.e., synthesizing fault-tolerant programs from their fault-
intolerant version. Such automated synthesis generates a fault-tolerant program that
is correct by construction, thereby alleviating the need for its proof of correctness.
Also, there exists a potential for reusing the computations of the fault-intolerant

program during the synthesis of its fault-tolerant version.

In the absence of faults, the synthesized fault-tolerant program should behave
similar to the fault-intolerant program. In the presence of faults, the synthesized
fault—tolerant program has to provide a desired level of fault-tolerance, namely failsafe,
nonmasking, or masking fault-tolerance. A failsafe fault-tolerant program guarantees
safety even in the presence of faults. In the presence of faults, a nonmasking fault-
tolerant program recovers to states from where its safety and liveness speciﬁcations
are satisﬁed. A masking fault-tolerant program always satisﬁes safety and recovers to

states from where its safety and liveness speciﬁcations are satisﬁed.

To provide a foundation for automatic synthesis of fault-tolerant programs, we
concentrate on two directions: theoretical aspects, and the development of a software
framework for the synthesis of fault-tolerant. programs. The main contributions of

the dissertation regarding theoretical aspects are as follows:

fl

0 We identify the effect of safety speciﬁcation modeling on the complexity of

synthesizing fault-tolerant programs from their fault-intolerant version.

0 We show the N P-completeness of synthesizing failsafe fault-tolerant distributed

programs from their fault-intolerant version.

0 We identify the sufficient conditions for polynomial-time synthesis of failsafe

fault-tolerant distributed programs.

0 we design a sound and complete synthesis algorithm for enhancing the fault-
tolerance of high atomicity programs — where program processes can atomically

read / write all program variables ~ from nonmasking to masking.

0 We present a sound algorithm for enhancing the fault-tolerance of distributed
programs — where program processes have read / write restriction with respect

to program variables.

0 We present a synthesis method for providing reuse in the synthesis of differ-
ent programs where we automatically specify and add pre-synthesized fault-

tolerance components to programs.

0 We deﬁne and address the problem of synthesizing multitolemnt programs that
are subject to multiple classes of faults and provide (possibly) different levels

of fault-tolerance corresponding to each fault-class.

To validate our theoretical results, we develop an extensible software framework,
called Fault-Tolerance Synthesizer (FTSyn), where developers of fault-tolerance can
interactively synthesize fault-tolerant programs. Also, FTSyn provides a platform
for developers of heuristics to extend F TSyn by integrating their heuristics for the
addition of fault-tolerance in F TSyn. Using FTSyn, we have synthesized several
fault-tolerant distributed programs that demonstrate the applicability of FTSyn for
the cases where we have different types of faults, and for the cases where a program

is subject to multiple simultaneous faults.

© Copyright by
ALI EBNENASIR
2005

To my parents Hussein and Ezzat and my wife N iloofar for all their

love and sacriﬁces.

 

lfl

in;
5‘11;

Oil

ACKNOWLEDGMENTS

All thanks go to the almighty God who has endowed us the blessing Of existence.
I extend my regards to all people who have contributed to my education in anyway
from primary school to higher educations. First, I am truly grateful to Dr. Sandeep
Kulkarni whose guidance was always enlightening throughout my PhD program. Also,
I thank the members of my PhD committee, Dr. Laura Dillon, Dr. Betty Cheng,
and Dr. Jonathan Hall, who have always supported me by their valuable comments.
Furthermore, I appreciate all the efforts of the Computer Science and Engineering
Department at Michigan State University towards creating a productive environment
for research and education.

Moreover, I would like to thank my teachers and advisors to whom I am in debt
for all their hard work and sacriﬁces for educating me and my fellow students: Dr.
Mohsen Shariﬁ my advisor in my Master program at Iran University of Science and
Technology, Tehran - Iran, Dr. Abbas Vafaei and Dr. Mustafa Kermani my professors
at the University of Isfahan, Isfahan - Iran, Mr. Fereydani, Mr. Khayyam, Mr.
Nahvi, Mr. Riazi, and Mr. Nasr my teachers in high school, Mr. Saljooghian in
middle school, and ﬁnally my first grade teacher Mrs. Afshari.

Last but not least, I thank my fellow graduate students at the Software Engineer-
ing and Network Systems Laboratory at Michigan State University who have always
supported me by (i) proofreading my manuscripts, (ii) providing valuable feedback
on my research work, (iii) engaging in discussions, and (iv) attending my not-so—
attractive talks. In particular, I appreciate the sincere collaboration Of Laura Anne
Campbell, Karun Biyani, Bru Bezawada, Sascha Konrad, Mahesh Arumugam, and
Borzoo Bonakdarpour.

Thank you.

vi

 

 

 

(0 IO L's Ln -

IQ If)

00

TABLE OF CONTENTS

LIST OF FIGURES x
1 Introduction 1
1.1 The Outline Of the Dissertation ....................... 6
2 Preliminaries 8
2.1 Program .................................... 8
2.2 Issues of Distribution ............................. 10
2.3 Speciﬁcation .................................. 11
2.4 Fault ...................................... 13
2.5 Fault-Tolerance ................................ 14
2.6 The Problem of Adding Fault-Tolerance .................. 15
2.7 Synthesis of Fault-Tolerance in High Atomicity .............. 17
2.7.1 Synthesizing Failsafe Fault-Tolerance ................... 17
2.7.2 Synthesizing Nonmasking Fault-Tolerance ................ 18
2.7.3 Synthesizing Masking Fault-Tolerance .................. 19
2.8 Synthesis of Fault-Tolerant Distributed Programs ............. 21
3 The Effect of Safety Speciﬁcation Model on the Complexity of Syn-
thesis 24
3.1 N P-Completeness Proof ........................... 26
3.1.1 Mapping 3-SAT to the Addition of Masking Fault-Tolerance ...... 26
3.1.2 Reduction from 3-SAT ........................... 28
3.2 Summary ................................... 32
4 Synthesizing Failsafe Fault-Tolerant Distributed Programs 34
4.1 Problem Statement .............................. 35
4.2 NP-Completeness Proof ........................... 37
4.2.1 Mapping 3-SAT to an Instance of the Synthesis Problem ........ 37
4.2.2 Reduction from 3-SAT ........................... 42
4.3 Monotonic Speciﬁcations and Programs ................... 45
4.3.1 Sufficiency of Monotonicity ........................ 46
4.3.2 Role of Monotonicity in Complexity Of Synthesis ............ 50
4.4 Examples of Monotonic Speciﬁcations .................... 51
4.4.1 Byzantine Agreement ........................... 52
4.4.2 Consensus and Commit .......................... 55
4.5 Summary ................................... 56

vii

5 Fault-Tolerance Enhancement 58

5.1 Problem Statement .............................. 59
5.2 Enhancement in High Atomicity Model ................... 61
5.2.1 Example: Triple Modular Redundancy .................. 66
5.3 Enhancement for Distributed Programs ................... 69
5.3.1 Example: Byzantine Agreement ...................... 75
5.4 Using Monotonicity for the Enhancement Of Fault-Tolerance ....... 81
5.4.1 Monotonicity of Nonmasking Programs .................. 81
5.4.2 Example: Distributed Counter ...................... 85
5.5 Enhancement versus Addition ........................ 88
5.6 Summary ................................... 90
6 Pre—Synthesized Fault-Tolerance Components 92
6.1 Problem Statement .............................. 94
6.2 The Synthesis Method ............................ 95
6.2.1 Overview of Synthesis Method ....................... 95
6.2.2 Token Ring Example ............................ 98
6.3 Specifying Pre—Synthesized Components .................. 101
6.3.1 The Speciﬁcation of Detectors ....................... 101
6.3.2 The Representation of Detectors ..................... 102
6.3.3 Token Ring Example Continued ...................... 105
6.4 Using Pre—Synthesized Components ..................... 106
6.4.1 Algorithmic Speciﬁcation of the Fault-Tolerance Components ..... 106
6.4.2 Token Ring Example Continued ...................... 108
6.4.3 Algorithmic Addition of The Fault—Tolerance Components ....... 108
6.4.4 Token Ring Example Continued ...................... 117
6.5 Example: Alternating Bit Protocol ..................... 118
6.6 Adding Hierarchical Components ...................... 126
6.6.1 Specifying Hierarchical Components ................... 127
6.6.2 Diffusing Computation ........................... 128
6.7 Discussion ................................... 136
6.8 Summary ................................... 139
7 Automated Synthesis of Multitolerance 140
7.1 Problem Statement .............................. 141
7.2 Addition of Fault—Tolerance to One Fault-Class .............. 144
7.3 Nonmasking-Masking Multitolerance .................... 146
7.4 Failsafe—Masking Multitolerance ....................... 149
7.5 Failsafe-NonmaskingMasking Multitolerance ............... 152
7.5.1 Non-Deterministic Synthesis Algorithm .................. 152
7.5.2 Mapping 3-SAT to Multitolerance ..................... 154
7.5.3 Reduction From 3—SAT ........................... 156
7.5.4 Failsafe-Nonmasking h’lultitolerance .................... 160
7.6 Summary ................................... 161

 

 

 

CO CD (c (3 ch M~

1(

ll.)
10.
10.
AP

Bll

8 FTSyn: A Software Framework for Automatic Synthesis
Tolerance
8.1 Adding Fault—Tolerance to Distributed Programs
8.1.1 The Input / Output Of the Framework ..............
8.1.2 Framework Execution Scenario
8.1.3 User Interactions .........................
8.2 Framework Internals
8.2.1 Class Modeling ..........................
8.2.2 Design Patterns ..........................
8.3 Integrating New Heuristics .....................
8.4 Changing the Internal Representations ..............
8.5 Example: Altitude Controller ...................
8.6 Summary

oooooooooooooooooooooooooooooo

9 Ongoing Research

9.1 Program Transformation ......................
9.1.1 Problem Statement ........................
9.1.2 'ITansformation Algorithm ....................
9.1.3 Soundness .............................
9.2 Speciﬁcation Transformation ....................
9.2.1 Problem Statement ........................
9.2.2 Transformation Algorithm ....................
9.3 Example: Distributed Control System
9.4 SAT-based Synthesis of Fault-Tolerance
9.4.1 Synthesis Method .........................
9.4.2 Representing Synthesis Requirements as Boolean Formulas . .
9.4.3 Implementing SAT-based Synthesis ...............
9.5 Summary ..............................

10 Conclusion and Future Work

10.1 Discussion ..............................
10.2 Contributions ............................
10.3 Impact ................................
10.4 Future Work

APPENDICES
BIBLIOGRAPHY

ix

Of Fault-

163
164
164
169
173
175
175
177
179
182
184
189

.....

190
191
191
192
195
197
198
198

208
212
216

217
217
223
226
228

232

268

2.1
2.2
2.3
2.4

3.1

3.2

4.1

4.2

4.3

4.4

5.1
5.2
5.3

6.1
6.2
6.3

LIST OF FIGURES

Synthesizing failsafe fault-tolerance in the high atomicity model. .......
Synthesizing nonmasking fault-tolerance in the high atomicity model.

Synthesizing masking fault—tolerance in the high atomicity model. ......
A non-deterministic algorithm for adding fault-tolerance to distributed pro-

grams. ...................................

The states and the transitions corresponding to the propositional variables in
the 3—SAT formula. (Except for transitions marked as fault all are pro-
gram transitions. Also, note that the program has no long transitions that
originate from a, and no short transitions that originate from q.) .....

The partial structure of the fault-tolerant program ..............
The relation between the invariant of a fault-intolerant program p and a fault-
tolerant program 17’ . ............................

The transitions corresponding to the propositional variables in the 3-SAT for—
mula. ...................................

The structure Of the fault-intolerant program for a propositional variable I),- and
a disjunction cj = bm V 55;, V ()1. ......................

The value assignment to variables ........................

The enhancement of fault—tolerance in high atomicity. ..........
Constructing an invariant in the low atomicity model. ..........

The enhancement Of fault-tolerance for distributed programs. ......

Overview of the synthesis method. ......................
Automatic speciﬁcation Of a component. ...................

Verifying the interference-freedom conditions. ................

18
19
20

22

27
29

36

38

39
40

62
70
71

6.4

7.1
7.2
7.3
7.4

7.5
7.6

8.1
8.2
8.3
8.4
8.5

9.1
9.2

9.3
9.4
9.5

The automatic addition of a component. ...................

Synthesizing nonmasking-masking multitolerance. ...............
Synthesizing failsafe-masking multitolerance. .................
A non—deterministic polynomial algorithm for synthesizing multitolerance.

The states and the transitions corresponding to the propositional variables in
the 3-SAT formula. ............................

The partial structure Of the multitolerant program ..............
A proof Sketch for NP-completeness of synthesizing failsafe-nonmasking multi-

tolerance. .................................
A deterministic execution scenario for the framework FTSyn. ........
The class diagram of F TSyn. .........................
The Bridge design patterns. ..........................
The FactoryMethod design patterns. ......................

Integrating the deadlock resolution heuristics using Strategy pattern. .....

Transforming non-monotonic programs to positive monotonic. ........

Algorithms for removing deadlock states and ensuring the closure Of the invari-
ant. ....................................

Transforming non-monotonic specifications to monotonic. ..........

Non-deterministic algorithm for adding fault-tolerance to distributed programs.

Using SAT solvers for the synthesis Of fault-tolerant programs. ........

xi

147
150
153

155
157

161

170
176
178
178
179

193

194
199
206

Chapter 1

Introduction

The anticipation of all classes of faults that may perturb a program is difﬁcult (if
not impossible). Thus, it is desirable to synthesize fault-tolerant programs from
their fault-intolerant version upon ﬁnding new classes of faults. Although there exist
efﬁcient approaches [1] for the synthesis of high atomicity fault—tolerant programs ~—
where processes can read/ write all program variables in an atomic step, there exists
a well-deﬁned need for developing efﬁcient techniques for the synthesis of (i) fault-
tolerant distributed programs — where processes have read/ write restrictions with
respect to program variables, and (ii) multitolerant programs - where a program
simultaneously provides different levels of fault—tolerance to different classes of faults.
In this dissertation, we concentrate on the theoretical and the practical aspects of

synthesizing fault—tolerant distributed programs and multitolerant programs.

To synthesize a fault-tolerant program from its fault-intolerant version, Kulkarni
and Arora [1] present a synthesis method that takes a given class of faults and a
fault-intolerant program, and generates a program that is fault-tolerant to that class
of faults. The fault-intolerant program satisﬁes its (safety and liveness) speciﬁcation
in the absence of faults and provides no guarantees in the presence of faults. The

synthesized fault-tolerant program provides a desired level of fault-tolerance in the

 

p;

SH

Spa
Dru
0f 5
bat,

aU'li

presence of faults, and satisﬁes the safety and liveness speciﬁcation of the fault-

intolerant program in the absence of faults.

Such synthesis approach has the potential to reuse the computations of the fault-
intolerant program during the synthesis of its fault-tolerant version. As a result,
reusing the computations of a fault-intolerant program preserves its important prop-
erties (e.g., efﬁciency) that are difﬁcult to specify in a speciﬁcation-based approach
(e. g., [2, 3, 4]) where one synthesizes a fault-tolerant program from its temporal logic

(respectively, automata—theoretic [5, 6, 7]) specification.

The synthesized fault-tolerant program provides one of the three levels of fault-
tolerance namely, failsafe, nonmasking, and masking [1]. Intuitively, in the presence of
faults, a failsafe fault-tolerant program ensures that its safety speciﬁcation is satisﬁed.
In the presence of faults, a nonmasking fault-tolerant program recovers to states
from where its safety and liveness speciﬁcation is satisﬁed. A masking fault-tolerant
program guarantees that in the presence of faults it recovers to states from where its

safety and liveness speciﬁcation is satisﬁed while preserving safety during recovery.

The complexity of the synthesis presented in [1] depends on the program model.
The authors of [1] Show that the complexity of synthesis is polynomial in the state
space of the fault-intolerant program in the high atomicity model. For distributed
programs (i.e., low atomicity model), Kulkarni and Arora Show that the complexity
of synthesizing masking fault-tolerance is exponential. Also, in the speciﬁcation-
based approach, the synthesis of fault-tolerant distrIbuted programs (with particular

architectures) from their speciﬁcation is known to be non-elementary decidable [6, 7].

A survey of the literature [7, 8] reveals that the complexity of synthesis and the
inefﬁciency Of the synthesized programs construct the main obstacles in the automated
synthesis of fault-tolerant programs. Moreover, to the best of our knowledge, no
automated approach has been presented for adding multitolerance to programs where

a multitolerant program is subject to multiple classes of faults and provides (possibly)

2

different levels of fault-tolerance corresponding to different classes of faults. Hence,
in this dissertation, we focus our attention on theoretical and practical problems in
the synthesis of fault-tolerant distributed programs and multitolerant programs.

Theoretical problems. Regarding theoretical aspects of synthesis, we address the

following problems:

0 Identify the eﬂect of safety speciﬁcation model on the complexity of synthesis

It is shown in the literature that the complexity of adding fault-tolerance to
high atomicity programs is polynomial in the state space of the fault-intolerant
program if the safety speciﬁcation is represented as a set of bad transitions
[1]. In [9], the authors conjecture that representing safety speciﬁcation as a set
of sequences of transitions results in exponential complexity for adding fault-
tolerance. They validate their claim in the context of some examples. However,
to the best of our knowledge, there exist no signiﬁcant result to verify the
claim made in [9]. Thus, it is desirable to explore the complexity of synthesis
in the case where safety speciﬁcation is represented as a set of sequences of
transitions. The signiﬁcance of such complexity analysis is in that it identiﬁes
the appropriate approach for modeling safety Speciﬁcation where automatic

addition of fault-tolerance can be done efﬁciently.
0 Find sufﬁcient conditions for polynomial-time synthesis of distributed programs

Since the complexity Of synthesizing fault-tolerant distributed programs from
their fault-intolerant version is exponential [1], we Shall identify properties of

programs and specifications where the synthesis can be done in polynomial time.

0 Reduce the complexity of synthesis by reusing the computations of the fault-

intolerant program

During the synthesis Of fault~tolerant programs, there exist situations where

the computational structure of the fault-intolerant program provides necessary

3

means for satisfying fault—tolerance requirements in the presence of faults. Thus,
it is desirable to design synthesis algorithms that take advantage of such situa-

tions to reduce the complexity Of synthesis.

Identify and reuse pre-synthesized fault-tolerance components

There exist recurring sub-problems of synthesis that arise in the synthesis of
different programs (e.g., resolving deadlock states). Thus, it is desirable to
generalize the solution to common synthesis problems so that we can develop
generic solution strategies that are independent of the program at hand. In other
words, we would like to reuse the effort put in the synthesis of one program for
the synthesis of another program. To achieve this goal, we plan to identify com-
monly encountered patterns in the synthesis of programs in order to encapsulate
those patterns in the form of pre-synthesized fault-tolerance components. Also,
we would like to devise a synthesis method where we automatically specify and

add the required pre—synthesized components to the fault-intolerant programs.

Synthesize programs that tolerate multiple classes of faults and provide diﬂerent

levels of fault-tolerance to each fault-class

Dependable and fault-tolerant systems are often subject to multiple classes of
faults, and hence, these systems need to provide appropriate level of fault-
tolerance to each class of faults. Often it is undesirable or impractical to provide
the same level of fault-tolerance to each class of faults. Hence, these systems
need to tolerate multiple classes of faults, and provide a (possibly) different
level of fault-tolerance to each class. To characterize such systems, the notion
of multitolerance was introduced in [10]. The importance of such multitolerant
systems can be easily observed from the fact that several methods for designing
multitolerant programs as well as several instances of multitolerant programs

can be found (e.g., [11, 12, 13, 10]) in the literature.

4

Automated synthesis of multitolerant programs has the advantage Of generat-
ing fault—tolerant programs that (i) are correct by construction, and (ii) tol-
erate multiple classes of faults. However, the complexity Of such synthesis is
an Obstacle in the synthesis of multitolerant programs. Speciﬁcally, there exist
situations where satisfying a speciﬁc fault-tolerance requirement for one class
of faults conflicts with providing a different level of fault-tolerance to another
fault-class. Hence, it is necessary to identify situations where synthesis of mul-
titolerant programs can be performed efﬁciently and where heuristics need to

be developed for adding I‘nultitolerance.

Practical problems. To reduce the exponential complexity Of synthesis for prac-
tical purposes and to enable the synthesis of programs that have large state space,
heuristic-based approaches are proposed in [14, 15, 9]. These heuristic-based ap—
proaches reduce the complexity of synthesis by forfeiting the completeness of synthe—
sizing fault-tolerant distributed programs. In other words, if heuristics are applicable
then a heuristic-based algorithm will generate a fault-tolerant program efﬁciently.
However, if the heuristics are not applicable then the synthesis algorithm will declare
failure even though it is possible to synthesize a fault-tolerant program from the given
fault-intolerant program.

The development and the implementation Of heuristics are complicated by the
fact that, for a given heuristic, we need to determine how that heuristic reduces the
complexity of synthesizing fault-tolerant distributed programs. Furthermore, we need
to identify if a heuristic is so restrictive that its use will cause the synthesis algorithm
to declare failure very often. Also, in order to provide maximum efﬁciency, there
exist situations where we need to apply heuristics in a specific order. Moreover, the
developers of a fault-tolerant program may have additional insights about the order
in which heuristics should be applied. Thus, we have to provide the possibility of

changing the order of available heuristics (respectively, adding new heuristics) for the

5

 

 

In
am
all“

we
Cher
mlf'l‘
also j
distr‘il
enhant
We inn
where ii
slutme
in Chain
and We s}

llllOledlll
We present
[ole

rant rim

developers of fault-tolerance.

Therefore, there exists a substantial need for an extensible software framework
where (i) developers of fault- tolerant programs can synthesis fault-tolerant programs
from their fault-intolerant version; (ii) developers of heuristics can integrate new
heuristics into the framework or modify exiting heuristics, and (iii) developers can
beneﬁt from existing automated reasoning tools (e.g., SAT solvers) in the synthesis

of fault—tolerant distributed programs.

1.1 The Outline Of the Dissertation

In Chapter 2, we present preliminary concepts of programs, speciﬁcations, faults,
and fault-tolerance. We also describe synthesis algorithms presented by Kulkanri
and Arora [1] in Chapter 2 as we reuse those algorithms in this dissertation. Then,
we identify the effect of speciﬁcation modeling on the complexity of synthesis in
Chapter 3. Subsequently, in Chapter 4, we Show that synthesizing a failsafe fault-
tolerant distributed program from its fault—intolerant version is NP-complete. We
also present sufﬁcient conditions for polynomial synthesis of failsafe fault-tolerant
distributed programs. In Chapter 5, we deﬁne the enhancement problem where we
enhance the level of fault—tolerance from nonmasking to masking in polynomial time.
We introduce the concept of pre—synthesized fault-tolerance components in Chapter 6,
where we present a synthesis method for automatic speciﬁcation and addition of pre-
synthesized fault-tolerance components to programs during synthesis. Afterwards,
in Chapter 7, we formally state the problem of adding multitolerance to programs,
and we Show that, in general, synthesizing multitolerant programs from their fault—
intolerant version is NP-complete even in the high atomicity model. In Chapter 8,
we present the design of our software framework for automatic synthesis Of fault-

tolerant distributed programs. In Chapter 9, we present some ongoing research work.

6

Finally, in Chapter 10, we discuss related work, contributions, and the impact of this

dissertation, and then we make concluding remarks.

k‘x

 

s.

 

 

Chapter 2

Preliminaries

111 this chapter, we present formal definitions of programs, problem speciﬁcations,
faults, fault-tolerance, and addition of fault-tolerance. Speciﬁcally, in Section 2.1, we
present the formal deﬁnition of programs, state predicates, and projection of program
transitions on a state predicate. In Section 2.2, we present the issues of modeling
distributed programs that is adapted from [1, 4]. Then, in Section 2.3, we adapt the
deﬁnition of speciﬁcations from Alpern and Schneider [16]. In Sections 2.4 and 2.5,
we adapt the deﬁnition of faults and fault-tolerance from Arora and Gouda [17] and
Kulkarni [18]. We represent the problem of adding fault-tolerance to fault-intolerant
programs in Section 2.6. We have adapted the problem statement of fault-tolerance
addition from [1]. In Section 2.7, we reiterate the results presented in [1] for the
synthesis of fault-tolerant programs in high atomicity model - where processes can
read / write all program variables in an atomic step. Finally, in Section 2.8, we recall
the results presented in [1] for the synthesis Of distributed programs — where processes

have read/ write restrictions with respect to program variables.

2.1 Program
A program p is speciﬁed by a ﬁnite set of variables, say V = {2)0, v2, .., vq}, and a ﬁnite

set of processes, say P = {P0, - -- ,P,,}, where q and n are positive integers. Each

8

 

variab
and lt"

A s
doluaii
The st;

A p
[50-91]
T]. that
Of ping]
IhlS ('lls‘s
Hence. 1
set of st

A Ste
[H Wllt

ll] 3 [.lflflg

of Ar by
A 30([1

Condlllfjng

l-lqu

2. ([0 is

that {

l SHl'lf‘i’i

l5 .. ‘
l‘l“SJ]EnI’

.

variable is associated with a ﬁnite domain of values. Let v0, v2, .., vq be variables of p,
and let D0, D2, .., D, be their respective domains.

A state of p is obtained by assigning each variable a value from its respective
domain. Thus, a state s Ofp has the form: ([0, ll, .., lq) where Vi : 0 S i s q : l, E D,.
The state space of p, Sp, is the set of all possible states Of p.

A process, say Pj, consists of a set of transitions 6,; each transition has the form
(30,31) where so, 31 6 SP. A process P] in p is associated with a set of variables, say
rj, that Pj can read and a set of variables, say 10,-, that P, can write. The transitions
of program p, (SP, is the union of the transitions of its processes. In most situations in
this dissertation, we focus on the entire state space of a program and all its transitions.
Hence, for simplicity, we rewrite program p as the tuple (Sp, 61,), where Sp is a ﬁnite
set of states and 6,, is a subset of SI) x Sp.

A state predicate X of p is any subset of Sp. We denote the cardinality of X by
[X |, where [X] represents the number of states in X. A state predicate X is closed
in a program p (respectively, 6p) iff (if and only if) the following condition holds.

Vso,s1 :: ((so,sl)E(5p) => (sOEX => 81 EX)

A transition predicate AP of p is any subset of 3,, x Sp. We denote the cardinality
of Ar by [API, where [AP] represents the number of transitions in AP.

A sequence Of states, 0 = (30,31, ..), is a computation of p iff the following two

conditions are satisﬁed (i.e., a computation is maximal):

1. If 0 is inﬁnite then Vj :j > 0 : (sH,sJ-)€6p, and

2. If at is ﬁnite and terminates in state 31 then there does not exist state 3 such

that (sl,s)€6p, and \7’j :0 <j£l:(sj_1,sj)E6p.

A sequence of states, (50,51, ..., .s,,), is a computation prefix of p iff Vj : 0 < j g n :

(SJ--1, 83-) 66,0; i.e., a computation preﬁx need not be maximal.

9

 

 

a"

 

 

(f)

1"“
*1»:

O

9“.“
""5

W9 i
‘\.L:l 1f

v59 5

2.2

In if
of fa
pflf’fp
It
1d, Hill
50 to f
a iaib‘
Obldlull

{lien pi

Notation.

Read I

3]] the ratio“.

EMQWD

group is €‘Xt‘l

rand b. witl

.lOli'. observe l

The projection Of program p on state predicate X , denoted as pIX, is the program
(Sp, {(30,31) : (so, 31) 66,, /\ 80,81EX}). In other words, pIX consists Of transitions
of p that start in X and end in X. Given two programs, p: (Sp, 6,) and p’ = (Spr, bpr),
we say p’ C_: p iff Sp, 25,, and 6p, g 6,.

Notation. When it is clear from the context, we use p and 6p interchangeably. Also,

we say that a state predicate X is true in a state 3 iff 36 X.

2.2 Issues of Distribution

In this section, we present the issues that distribution introduces during the addition
of fault-tolerance. More speciﬁcally, we identify how read/write restrictions on a
process affect its transitions.

Write restrictions. Given a transition (so, 81) of a program p, we can easily
identify the variables that need to be changed in order to modify the state of p from
so to 31. Hence, if process P,- can write only the variables in w,- and the value of
a variable :1: ¢ 10, is changed in transition (30,51) then (30,31) cannot be used in
Obtaining the transitions of P,. In other words, if P,- can write only variables in to,-

then P, cannot use the transitions in mv(uv,-), where

nw(w,-) = {(30,31) : (3x ICC¢1U, : :r(so)¢;r(sl))}

w,- is the set of variables that process P,- is allowed to write.
Notation. $(so) represents the value of a variable :1: in state so.

Read restrictions. Given a single transition (so. 31), the program p must read
all the variables in order to execute (30,31). For this reason, read restrictions require
us to group transitions and ensure that the entire group is included or the entire
group is excluded. As an example, consider a program consisting of two variables
a and b, with domains {0,1}. Suppose that we have a process that cannot read b.

NOW, Observe that the transition from the state (a = 0, b = 0) to (a = 1, b = 0) can

10

:

- :-

ff.
_.
s:
:<

“I“ .t' ~-

 

be included iff the transition from (a = 0, b = 1) to (a = 1, b = 1) is also included.
If we were to include only one Of these transitions then we would need to read both
a and b. However, when these two transitions are grouped, the value of b becomes
irrelevant, and hence, we do not need read it.

More generally, consider the case where r,- is the set of variables that P,- can
read, to,- is the set of variables that P,- can write, and to,- Q r,. (In this dissertation,
we assume that to,- Q r,; i.e., j cannot blindly write any variable. A more general
case is discussed in [1]; we omit it here as this case sufﬁces for our presentation.)
Now, process P,- can include the transition (so, 51) iff P,- also includes the transition
(32,, 3’1) where so (respectively, 31) and sf, (respectively, s’l) are identical as far as the
variables in r, are considered, and so (respectively, 86) and s1 (respectively, s’l) are
identical as far as the variables not in r, are considered. We deﬁne these transitions

as group(r,)(so, 31) for the case to,- Q r,, where

group(r,-)(so,sl) = {(86,8’1) : (Va: : :rEr, : a:(so)=:c(s[,) /\ r(sl)=;r(s’1)) /\

(Va: : mgr, : 27(36) = :r(s’,) /\ a:(so) = :r(sl)) }

2.3 Speciﬁcation

A specification is a set of inﬁnite sequences of states that is suffix-closed and fusion-
closed. Suffix-closure of the set means that if a state sequence 0' is in that set then so
are all the sufﬁxes of o. Fusion-closure of the set means that if state sequences (a, s, '7)
and (H, s, 6) are in that set then so are the state sequences (a, s, 6) and (B, s, 'y), where
a and ,8 are ﬁnite preﬁxes of state sequences, '7 and 6 are suffixes of state sequences,
and s is a program state. Intuitively, fusion closure of the speciﬁcation means that
an implementation of the speciﬁcation must execute its next transition only based on
its current state; i.e., the history Of a computation does not affect the next move of

the program.

11

 

 

 

Full
of a sat
the safe
occur i1
subset (
temple)
as a set
complex
except i;

of progr,

lII th.
“Lilith s
in the a},
SPH‘lﬁcar
tolerant f
5,-

J

Following Alpern and Schneider [16], we rewrite the speciﬁcation as a conjunction
of a safety specification and a liveness specification. For a sufﬁx-closed Speciﬁcation,
the safety speciﬁcation can be speciﬁed as a set of bad transitions [18] that must not
occur in program computations; that is, for program p, its safety speciﬁcation is a
subset of S, x Sp. To investigate the effect of the safety speciﬁcation model on the
complexity of synthesis, we Show, in Chapter 3, that if the speciﬁcation is represented
as a set of computation preﬁxes (i.e., a set of ﬁnite sequences of transitions), the
complexity of synthesis signiﬁcantly increases to a higher complexity class. Hence,
except in Chapter 3, in the rest of this dissertation, we represent safety speciﬁcation

Of programs as a set Of bad transitions.

In the synthesis algorithms presented in this dissertation, we do not require the
explicit speciﬁcation of the liveness properties. More speciﬁcally, we require that,
in the absence Of faults, the synthesized fault-tolerant program satisﬁes the liveness
speciﬁcation of the fault-intolerant program. In the presence of faults, the fault-
tolerant program must satisfy desired fault-tolerance properties deﬁned in Section

2.5.

Given a program p, a state predicate S, and a speciﬁcation spec, we say that p
satisfies spec from S iff (1) S is closed in p, and (2) every computation of p that starts
in a state of S is in spec. If p satisﬁes spec from S and S 74— {}, we say that S is an

invariant of p for spec.

For a ﬁnite sequence (of states) a, we say that (it maintains (does not violate) spec
iff there exists a sequence of states [3 such that (r3 E spec. We say that p maintains
(does not violate) spec from a state predicate X iff (1) X is closed in p, and (2) every

computation preﬁx Of p that starts in a state in X I’I'lalllt‘dll'ls spec.

12

 

 

 

2.4

We sys
A class
to (lean
transiti:
fault-5p;
(2) T i
Where 5
Which) I
XUW,
a sequent

following

1. Ho

Zlfo

llldl

2.4 Fault

W'e systematically represent the faults that perturb a program by a set of transitions.
A class of faults f for program p: (Sp, 6,) is a subset of the set 5,, x 5,. We use p[] f
to denote the transitions Obtained by taking the union of the transitions in p and the
transitions in f (i.e., 6,, U f). We say that a state predicate T is an f-span (read as
fault-span) of p from S iff the following two conditions are satisﬁed: (1) S g T, and
(2) ~ T is closed in p[] f. Observe that for all computations of p that start at states
where S is true, T is a boundary in the state space of p up to which (but not beyond
which) the state of p may be perturbed by the occurrence of the transitions in f.
Now, we deﬁne the computations Of p in the presence of faults, f. We say that
a sequence of states, a = (so, 81, ...), is a computation of p in the presence of f iff the

following three conditions are satisﬁed.

1. If 0 is inﬁnite then Vk : k > 0: (sk_1, sk)E ((5,, U f),

2. If a is ﬁnite and terminates in state s; then there does not exist state 8 such

that (3,, s) 6 6p, and

3. 3nzn20: (Vk:k>n:(sk_1,sk)€6p).

The ﬁrst requirement captures that in each step, either a program transition or
a fault transition is executed. The second requirement captures that faults do not
have to execute; i.e., if the program reaches a state where only a fault transition
can be executed then the fault transition need not be executed. It follows that
fault transitions cannot be used to deal with deadlocked states. Finally, the third
requirement captures that the number of fault occurrences in a computation is ﬁnite.

Such assumption also appears in previous work [19, 20, 17, 21].

Program and faults representation. We use Dijkstra’s guarded commands [22] to

represent the transitions of programs and faults. A guarded command (action) is of

13

 

the form grd —+ st, where grd is a state predicate and st is a function from 3,, to
Sp (i.e., an assignment) that updates program variables. Speciﬁcally, the guarded

command grd ——> st represents the following set of transitions:

{(80,81) : grd is true at so and the atomic execution Of st at so takes the program

to state 31}

2.5 Fault-Tolerance

In this section, we formally deﬁne what it means for a program to be fault-tolerant.
We deﬁne three levels of fault—tolerance; failsafe, nonmasking, and masking. In the
absence of faults, irrespective of the level of fault-tolerance, a program should satisfy
its speciﬁcation, say spec, from its invariant. The level of fault-tolerance characterizes
the extent to which the program satisﬁes spec in the presence of faults. Intuitively,
a failsafe fault-tolerant program ensures that in the presence of faults, the safety
of spec is maintained. A nonmasking fault-tolerant program ensures that in the
presence of faults, the program recovers to states from where spec is satisﬁed. A
masking fault-tolerant program ensures that in the presence of faults the safety of
spec is maintained and the program recovers to states from where spec is satisﬁed.
Thus, we formally deﬁne these three levels of fault-tolerance for a program p, its
invariant S, its speciﬁcation spec, and a class of faults f as follows:

Program p is failsafe f-tolerant for spec from S iff the following two conditions hold:
(1) p satisﬁes spec from S, and (2) there exists T such that T is an f—span of p from
S and p[] f maintains spec from T.

Program p is nonmasking f-tolerant for spec from S iff the following two conditions
hold: (1) p satisﬁes spec from S, and (2) there exists T such that T is an f-span of p
from S and every computation of p[] f that starts from a state in T has a state in S.

Program p is masking f-tolerant for spec from S iff the following two conditions

hold: (1) p satisﬁes spec from S, and (2) there exists T such that T is an f-span of p

14

 

 

from S, p[] f maintains spec from T, and every computation of p[] f that starts from
a state in T has a state in S .

Note that a speciﬁcation is a set of inﬁnite sequences of states. Hence, if p satisﬁes
spec from S then all computations of p that start in S must be inﬁnite. In the context
of nonmasking and masking fault—tolerance, every computation from the fault-span
reaches a state in its invariant. Hence, if fault-span T is used to Show that p is
nonmasking (respectively, masking) f—tolerant for spec from S then all computations
of p that start in a state in T must also be inﬁnite. Also, note that p is allowed
to contain a self-loop of the form (so, so); we use such a self-loop whenever so is an
acceptable ﬁxpoint of p.

Notation. Henceforth, whenever the program p is clear from the context, we will
omit it; thus, “S is an invariant” abbreviates “S is an invariant of p” and “f is a
fault” abbreviates “f is a fault for p”. Also, whenever the speciﬁcation spec and the
invariant S are clear from the context, we omit them; thus, “f-tolerant” abbreviates

“f-tolerant for spec from S”.

2.6 The Problem of Adding Fault-Tolerance

In this section, we reiterate the problem of adding fault-tolerance presented in [1].
The addition problem requires a fault—tolerant program p’ (with its invariant S’) to
behave similar to its fault-intolerant version, say p, in the absence of a given class of
faults f. In the presence of f, p’ must provide a desired level of fault-tolerance, say .C,
where [I could be failsafe, nonmasking, or masking. Since p’ must behave similar to

p in the absence of faults, Kulkarni and Arora [1] stipulate the following conditions:

1. S’ must be a subset of S. Otherwise, if there exists a state 8 E S’ where 3 ¢ S
then, in the absence of faults, p’ can reach 5 and create new computations that
do not belong to p. Thus, p’ will include new ways of satisfying spec from s in

the absence of faults.

 

 

The
Giver

(If,

The D
For g
and i
8' so,

fault-

2. p’lS’ must be a subset of p[S’. If p’lS’ includes a transition that does not belong

to p[S’ then p’ can include new ways for satisfying spec in the absence of faults.

Thus, the formal deﬁnition of the problem of adding fault-tolerance is as follows:

The Addition Problem

Given p, S, spec, and faults f, identify p’ and S’ such that
S’ C; S,
p’lS’ Q pIS’, and
p’ is [I f-tolerant for spec from S’, where

I. can be failsafe, nonmasking, or masking. E]

The decision problem of adding fault-tolerance to fault-intolerant programs (from

[1]) is as follows:

The Decision Problem
For a given fault-intolerant program p, its invariant S, the speciﬁcation spec,
and faults f, does there exist a fault-tolerant program p’ and the invariant
S’ such that S’ g S, p’IS’ g pIS’, and p’ is failsafe/noun)asking/masking

fault-tolerant for spec from 5’?

Remark. Given a program p’ and its invariant S’ that meet the requirements of
the decision problem, every computation of p’ [] f that starts in the fault-span reaches
a state in S’. From that state in S’, a computation of p’ is also a computation of p
(Since 5’ Q S and p’ IS’ Q p[S’). Since the fault-intolerant program p satisﬁes its
liveness speciﬁcation from S, every computation of p has a sufﬁx that is in the liveness
speciﬁcation. It follows that every computation of p’ that starts in its fault-span will
eventually reach a state from where it continuously satisﬁes its liveness specification.

For this reason, liveness speciﬁcation is not included in the above problem statement.

16

 

2.7

Tlif? l
b19111“
Syllfll‘
to V”
Chat”
[[3 for
W“
[ext-”l8 ‘-
[fallSaft
where l
to progl
2.7.l. I
tolerant
where or
Thrill
tith 8. ii

a S)‘iiilltf$

2.7 Synthesis of Fault-Tolerance in High Atomic-
ity

The properties of synthesized high atomicity fault-tolerant programs identify an upper
bound on the abilities of fault-tolerant distributed programs. As a result, in the
synthesis of fault-tolerant distributed programs, there exist situations where we need
to verify the possibility of solving a problem in the high atomicity model (e.g., see
Chapter 5). Hence, we recall synthesis algorithms presented by Kulkarni and Arora
[1] for the synthesis of fault-tolerant programs in the high atomicity model.

We represent three synthesis algorithms presented in [1] for adding three different
levels of fault-tolerance to fault-intolerant programs. These algorithms synthesize a
(failsafe/nonmasking/masking) fault-tolerant program in the high atomicity model
where there exist no read/write restrictions for the program processes with respect
to program variables. In particular, we present Add_Failsafe algorithm in Subsection
2.7.1. Then, in Subsection 2.7.2, we show how one synthesizes a nonmasking fault-
tolerant program. Finally, in Subsection 2.7.3, we describe the algorithm Add-Masking
where one adds masking fault-tolerance to fault-intolerant programs.

Throughout this section, we denote a fault—intolerant program with p, its invariant
with S, its speciﬁcation with spec, and a given class of faults with f. Also, we denote

a synthesized fault-tolerant program and its invariant with p’ and S’.

2.7 .1 Synthesizing Failsafe Fault-Tolerance
The algorithm Add_Failsafe (cf. Figure 2.1) takes p, S, spec, and faults f. It calculates
program p’ with the invariant S’ where p’ is failsafe f—tolerant for spec from S’.

To synthesize a fault-tolerant program p’ from the given fault-intolerant program
p, Add_Failsafe calculates a set of states, say ms, from where fault transitions alone
may violate safety of spec. The fault—tolerant program p’ must never reach a state

in ms, otherwise, faults may directly violate the safety of spec. Thus, p’ should not

17

 

 

 

Fi

include I
transitio

To (a
of S ~m~~
mt (Cf. F
the imari.
hmmﬁmn
Soundnes
antliesizet
problem st
failsafe faiil

addition pr:

 

Addjailsafe(p, f : transitions, 5 : state predicate, spec : speciﬁcation)

m3 := {so : 331,32, ...sn :
(Vj :0$j<n : (3,,s(,-H)) E f) A (s(,,_1),sn) violates spec};

mt := {(30,31) : ((31 Ems) V (30,31) violates spec) };
S’ := ConstructInvariant(S — ms, p—mt);
if (S’={}) declare no failsafe f-tolerant program p’;

return 0, 0;
else p’ :=Construct'1ransitions(p-mt, S’)
return p’, S’;

}

ConstructInvariant(S : state predicate, p : transitions)
// Returns the largest subset of S such that computations of p within that subset are inﬁnite
{while (330 : 306$ : (V31 :31 ES : (so,sl)¢p)) S := S - {so} }

ConstructTransitions(p : transitions, 5 : set of states)
{return p_{(3013l):30€S A 81¢ 5}}

 

 

 

Figure 2.1: Synthesizing failsafe fault-tolerance in the high atomicity model.

include the transitions that reach ms or directly violate safety of spec (i.e., set of mt
transitions).

To calculate the invariant S’, the algorithm Add_Fai|safe returns the largest subset
of S —-ms where the computations of p—mt are inﬁnite and include no transitions Of
mt (cf. Figure 2.1). The routine ConstructJnvariant calculates such a subset of S as
the invariant of p’. Since 5’ must be closed in transitions of p’, Add_Fai|safe removes
transitions that start in S’ and end outside S’ using the routine Construct.Transition.
Soundness and completeness. The algorithm Add-Failsafe is sound; i.e., the
synthesized program p’ and its invariant S’ satisfy the requirements of the addition
problem stated in Section 2.6. Also, Add-Failsafe is complete; i.e., if there exists a
failsafe fault—tolerant program p” derived from p that satisﬁes the requirements of the

addition problem then Add-Fai|safe will ﬁnd p” and its invariant S” [1].

2.7 .2 Synthesizing Nonmasking Fault-Tolerance

To add nonmasking fault-tolerance to fault-intolerant programs, Kulkarni and Arora
present algorithm Add_Nonmasking (cf. Figure 2.2). The Add_Nonmasking algorithm

takes p, S, spec, and faults f, and then, synthesizes program p’ with its invariant S’.

18

 

 

(f,

to
trans
Soun
Sji'nt lit
pf 0 ll l r
HOIIUM

Of flit" .'

 

Add_nonmasking(p, f : transitions, S : state predicate, spec : speciﬁcation)
{

S’ := S;

P, 1: (PlS) U {(50,191) 3 30955 /\ SIES}

return 12’, S’;
}

 

 

 

Figure 2.2: Synthesizing nonmasking fault-tolerance in the high atomicity model.

The invariant S’ is equal to S since Add_Nonmasking only adds recovery transitions
to S. The set of transitions of p’ is the union of transitions of p[S and recovery
transitions.

Soundness and completeness. The algorithm Add_Nonmasking is sound; i.e., the
synthesized program p’ and its invariant S’ satisfy the requirements of the addition
problem (cf. Section 2.6). Also, Add-Nonmasking is complete; i.e., if there exists a
nonmasking fault-tolerant program p” derived from p that satisﬁes the requirements

of the addition problem then Add_Nonmasking will ﬁnd p” and its invariant S” [1].

2.7 .3 Synthesizing Masking Fault-Tolerance

In the presence of faults, a masking fault-tolerant program must maintain safety
of spec and provide safe recovery to its invariant. The Add_Masking algorithm (cf.
Figure 2.3) takes p, S, spec, and faults f, and then generates masking fault-tolerant
program p’ with its invariant S’ and its f-span T’.

Since no masking fault-tolerant program is allowed to reach a state from where
fault transitions may violate safety, the invariant of the masking fault-tolerant pro-
gram must include no ms state. Moreover, the fault-span of the masking program p’
must not include any state of ms. Hence, Add_Masking sets the initial value of fault-
span T1 to true—ms (cf. Line 4 in Figure 2.3). Also, since a masking fault-tolerant
program should satisfy safety of spec from every state in its fault-span (i.e., in the
presence Of faults), the set of transitions of the masking program must not include

a transition of mt. Thus, Add_Masking calculates the initial invariant S1 (cf. Figure

19

 

 

2.3?» 14’

unll

For I

Construct l
[ [Raunh
i

[ while (I
l l
’\

Figure

In the it

titlim search

faulttolerant

the Set of tra

She. [151)
sure of S

1 and
ConstructFat/lt
if 1] ti

2.3) by removing ms states from S and mt transitions from the set of transitions of

p (cf. Line 3 in Figure 2.3).

 

 

 

Add-masking(p, f : transitions, S : state predicate, spec : speciﬁcation)
ms 2: {so : 3131,32,...3": (1)
(Vj : 0_<_j<n : (3,,s(,H)) E f) A (s(,,_1),s,,) violates spec};
mt := {(so,sl) : ((31 Ems) V (30,31) violates spec) }; (2)
S1 :2 ConstructInvariant(S — ms,p—mt); (3)
T1 := true—ms; (4)
repeat (5)
712,52 i: T1: 31; (5)
pl: p|51 U {(30,31) : so ¢Sl A 306T; A 31€T1}—mt; (7)
T1 := ConstructFaultSpan(T1 — {s : SI is not reachable from s in p1 }, f); (8)
81 := ConstructInvariant(Sl A T1,p1); (9)
iftsl=ti v TI={}) (10)
declare no masking f-tolerant program p’ exists; (11)
return 0, 0, 0; (12)
until (T1=T2 A 51:52); (13)
For each state 3 : SET} : (14)
Rank(s) = length of the shortest computation preﬁx of pl (15)
that starts from s and ends in a state in SI;
p’ := {(so,sl) : ((so,sl)Ept) A (soE51 V Rank(so)>Rank(sl)}); (16)
S’ 2: SI; (17)
T’ :2 T1 (18)
return p’, S’,T’; (19)
}
ConstructFaultSpan(T : state predicate, f : transitions)
// Returns the largest subset of T that is closed in f.
while (330,31:soeTAsl¢TA(so,sl)€f) T:=T—{so}
}

 

Figure 2.3: Synthesizing masking fault-tolerance in the high atomicity model.

In the iterative steps between Lines 5 to 13 in Figure 2.3, the Add_Masking algo-
rithm searches for a valid invariant and its corresponding fault-span for the masking
fault-tolerant program. Towards this end, in each iteration, Add-Masking identifies
the set of transitions of p1 that consists of transitions of p on the current invariant
S] (i.e., p[Sl) and every transition in the fault-span T j that does not violate the clo-
sure Of 81 and does not belong to mt (cf. Line 7 in Figure 2.3). Afterwards, using
Construct-FaultSpan routine, the Add_Masking algorithm calculates the largest subset

of T1 that is closed in p1[]f. Since the invariant of the masking program must be

20

 

a. snlset of
raakuhnt
The Adt
exist no 111”
the Add_Ma
grain synt llt
satisﬁes the
such suliiset J
invariant S’
2.3).
Soundness I
S'llll‘lf'Sthtl l”:
problem. Ale
Program p” d.

then Add_Masl

2.8 Syn

grar
In this section.

and Arora [1] ft

 

a theorem fron-

programs.

 

Kulkarni anti
2.4) for the add.
llit: Addft algm

titolerant dist rill

{f
Idll '
.lts f. After“?!

 

a subset of its fault-Span, Add_Masking recalculates the invariant S1 considering the
recalculated fault-span T1 (cf. Line 9 in Figure 2.3).

The Add_Masking algorithm continues the above iterative procedure until there
exist no more changes in S1 and T1, or S1 becomes empty. When 51 becomes empty,
the Add-Masking algorithm declares that there exists no masking fault-tolerant pro-
gram synthesized from p. Otherwise, there must exist a non-empty subset of S that
satisﬁes the requirements of the addition problem (cf. Section 2.6). If there exists
such subset S’ of S then Add_Masking will guarantee safe recovery from states outside
invariant S’ to S’, and there will be no cycles in T’ —S’ (cf. Lines 14-16 in Figure
2.3).

Soundness and completeness. The algorithm Add-Masking is sound; i.e., the
synthesized program p’ and its invariant S’ satisfy the requirements of the addition
problem. Also, Add-Masking is complete; i.e., if there exists a masking fault-tolerant
program p” derived from p that satisﬁes the requirements of the addition problem

then Add_Masking will ﬁnd p” and its invariant S” [1].

2.8 Synthesis of Fault-Tolerant Distributed Pro-

grams

In this section, we represent the non-deterministic algorithm presented by Kulkarni
and Arora [1] for the synthesis of distributed fault-tolerant programs. We also recall
a theorem from [1] about the complexity of synthesizing fault-tolerant distributed
programs.

Kulkarni and Arora [1] present the non-deterministic algorithm Add-ft (cf. Figure
2.4) for the addition of fault-tolerance to distributed programs in polynomial time.
The Add_ft algorithm takes the transition groups go, - - - , gm“ (that represent a fault-
intolerant distributed program p), its invariant S, its speciﬁcation spec, and a class of

faults f. Afterwards, Add_ft calculates the set of ms states from where safety can be

21

 

t'it‘ilﬂWd ”-
transit it)!“
nt‘In-deifl“

ﬁiult-Sl”l’L

  

Add—ft in f

 

"15:: l
[ mt I: l’

Guess 5’
l'erifi' [h
(PM
[ {FE}
(F53:

[ (Hi
. ['Fjl .
[} (F011
-’\§\

figure 2.4: A
grams.

The algorit
satisﬁes the th
the required In
Flfti. The ﬁr
F2. checks that
not violated fro
does not deadln
luxuriant. i.e.. b
5’ is a subset of

l’ ~ ’
5 forever.

 

For synthtsiz;

dea failsafe prw

 

ﬂu: '1 : ‘
thrashing faul‘

violated by the execution of fault transitions alone. Also, Add_ft computes the set of
transitions mt that violate safety or reach a state in ms. Then, the Add-ft algorithm
non- -deterministicallyg guesses the fault- tolerant program p’ its invariant, S’ and its

fault-span, T’.

 

 

Add_ft(-p, f : set of transitions, 5' : state predicate, spec : speciﬁcation,
go, 91, ..., gm” : groups of transitions)
{
:s={ s.o 3131,32,...sn: (Vj : OSj<n : (sj,s(]~+1)) E f) /\ (s(,,_1),sn) violates spec};
:2 {(s:o,sl) ((316ms) V (50,31) violates spec) };
Guess S’,T’, and p’ :2 U (9,- : g, is chosen to be included in the fault-tolerant program);
Verify the following
(F1) p’IS’QPIS’;
(F2) 5’ g T’; T’ is closed in p’[]f; // T’ is a fault-span of p’.
(F3) T’ 0 ms 2 {}; (p’lT’) 0 mt = {}; // Safety cannot be violated from states in T’.
(F4) (V30 : 306 T’: (331 :: (so,sl)Ep’)); // T’ does not have deadlocks.
(F5) 3’94}; 8’ Q S; S’ is closed in p’; // S" is an invariant of 11’.
(F6) p’|(T’—S") is acyclic; // p’ cannot stay in (T’ — S’) forever.
l

 

Figure 2.4: A non-deterministic algorithm for adding fault—tolerance to distributed pro-
grams.

The algorithm Addﬁ veriﬁes that the synthesized (guessed) fault-tolerant program
satisﬁes the three conditions of the addition problem (cf. Section 2.6) depending on
the required level of fault-tolerance. This goal is achieved by verifying the six formulae
F l-F 6. The ﬁrst formula F 1 veriﬁes that p’IS’ Q PIS" is true. The second formula,
F2, checks that T’ is a valid fault-span. The third formula, F3, ensures that safety is
not violated from any state in T’. The fourth formula, F4, veriﬁes that the program
does not deadlock in a state in T’. The ﬁfth formula, F5, checks that S" is a valid
invariant, i.e., S’ is nonempty and S’ is closed in p’. The formula F5 also veriﬁes if
S’ is a subset of S. Finally, the formula F6 veriﬁes that the program cannot stay in
T’ — S’ forever.

For synthesizing failsafe fault-tolerant programs, we do not need verify F4 and F6
as a failsafe program need not provide recovery to 8’. Likewise, in the synthesis of a

nonmasking fault-tolerant program, there exists no need to verify F3 as a nonmasking

22

 

program is allowed to temporarily violate safety of spec in the presence of faults.

Since the algorithm Add-ft is non-deterministic, there exists no speciﬁc order in
the veriﬁcation of F 1-F 6. However, a deterministic implementation of Add_ft im-
poses a speciﬁc order for verifying (a subset of) F1—F6 in order to satisfy one of
the requirements of the addition problem. We call such a deterministic strategy a
heuristic.

Regarding the complexity of Add_ft, Kulkarni and Arora [1] show that each one of
the conditions F1-F6 can be veriﬁed in polynomial time in the state space of p. As
a result, Add_ft is in NP. We reiterate this result in the following theorem.
Theorem 2.1 The problem of synthesizing failsafe/nonmasking/masking fault-

tolerant distributed programs is in NP. C]

23

Chapter 3

The Effect of Safety Speciﬁcation
Model on the Complexity of

Synthesis

In this chapter, we focus on the effect of safety speciﬁcation model on the complex-
ity of adding masking fault—tolerance to high atomicity programs. We consider two
approaches for modeling safety speciﬁcations. The ﬁrst approach is based on the mod-
eling used in [1], where the safety speciﬁcation is speciﬁed in terms of a set of bad
transitions that must not occur in program computations. In other words, intuitively,
a program computation violates safety speciﬁcation if there exists a bad transition
in that computation. We denote this model as the bad transit-ion (BT) model (cf.

Section 2.3 for precise deﬁnition).

The second approach is a restricted model of the safety speciﬁcation speciﬁed by
Alpern and Schneider [23]. In [23], the safety speciﬁcation is speciﬁed as a set of
computation preﬁxes, where a computation preﬁx is a ﬁnite sequence of transitions.
A computation violates the safety speciﬁcation if one of its preﬁxes is ruled out by

the safety speciﬁcation. This model is more general than that in [1]; given the safety

24

 

Shetllll “1 l'
cornptttati

in progft‘ill

As a sp
(hire a 111ml
most two tr.-
and only if i
this tnodel a
model is a ge

by Alpern an

We Show
intolerant vets
fault~tolerant l
synthesis in th
result has bet-
the synthesis 1
in this Chapter
Where safety is
aprttgratn mitt;
faulttolerant p'
lllt'ations from t

*alf‘tl' Spf‘t'lfltfa’ -

 

The organizal
mashing fa ttl

lﬂ Se(

Jllull 3'2

l‘lfil

“If

l

 

speciﬁcation speciﬁed in terms of ‘bad transitions’ that should not occur in program
computations, we can obtain the corresponding set of preﬁxes that should not occur

in program computations.

As a special case of the model presented by Alpern and Schneider [23], we intro-
duce a model where safety speciﬁcation is speciﬁed in terms of a set of sequences of at
most two transitions. In this model, a computation violates the safety speciﬁcation if
and only if it contains any sequence ruled out by the safety speciﬁcation. We denote
this model as the bad pair (BP) model. It is straightforward to observe that the BP
model is a generalization of the BT model and a specialization of the model presented

by Alpern and Schneider.

We show that synthesizing a masking fault-tolerant program from its fault-
intolerant version in the BP model is signiﬁcantly more complex than synthesizing a
fault-tolerant program in the BT model. Speciﬁcally, for high atomicity programs, the
synthesis in the BT speciﬁcation model can be performed in polynomial time. (This
result has been previously shown in [1].) However, for the same program model,
the synthesis in the BP speciﬁcation model is NP-complete. (This result is shown
in this chapter.) It follows that the problem of adding fault-tolerance for the case
where safety is represented as a set of computation preﬁxes that should not occur in
a program computation is N P-hard. With this result, we argue that the synthesis of
fault—tolerant programs will be more successful if we focus on more restrictive spec-
iﬁcations from the BT model. Hence, in the rest of this dissertation, we represent

safety speciﬁcation in the BT model.

The organization of this chapter is as follows: In Section 3.1, we show that adding
masking fault-tolerance to high atomicity programs is N P-completc for the BP model.

In Section 3.2, we present a summary of this chapter.

25

 

 

3.1 T

In tllis 5”
tolerant I)
safeii' "l“
we DrewIlt
0f the lid“.
WP Show [1

Problem i5 '

3.1.1 1“
Tt
Tl]? problf'll

Given i8 ‘d 9*
and ‘1'

variable

Dries there (‘-

able 1’

Next, WP l
tolerant-e. hast”-

of synthesizing

invariant, S. its

The state spat
invariant. S. U

s. '
pending to he,
Wei .
ntlurle arlrlit
ear-
.1 proposition

 

3.1 NP-Completeness Proof

In this section, we Show that, in general, the problem of synthesizing masking fault—
tolerant programs from their fault-intolerant version becomes NP-complete if the
safety speciﬁcation is speciﬁed in the BP model. Towards this end, in Section 3.1.1,
we present a mapping between a given instance of the 3-SAT problem and an instance
of the (decision) problem of adding masking fault-tolerance. Then, in Section 3.1.2,
we show that the given 3-SAT instance is satisﬁable iff the answer to the decision

problem is afﬁrmative.

3.1.1 Mapping 3-SAT to the Addition of Masking Fault-

Tolerance
The problem statement for the 3-SAT problem is as follows:
Given is a set of propositional variables, $1,232, ..., :r.n and ﬁnal, 12:2, ..., or”, where x,-
and “ICE,- are complements of each other, and a Boolean formula y = 311 Ayz /\ /\
yM, where each y]- (1 S j _<_ M) is a disjunction of exactly three propositional

variables.

Does there exist an assignment of truth values to $1,172, ..., 33,, such that y is satisﬁ-

able? C]

Next, we identify each entity of the instance of the problem of adding fault-
tolerance, based on the given 3-SAT formula. The instance of the decision problem
of synthesizing masking fault-tolerance consists of the fault-intolerant program, p, its

invariant, 3, its speciﬁcation spec, and a class of faults f.

The state space and the invariant of the fault-intolerant program, p. The
invariant, S, of the fault-intolerant program, p, includes only one state, say 3. Corre-
sponding to the propositional variables and disjunctions of the given 3-SAT instance,
we include additional states outside the invariant (cf. Figure 3.1). Speciﬁcally, for

each propositional variable :16,- and its complement -cr,-, we introduce three states

26

 

st‘ﬂlll
tions.
mt‘t'l l 1.1
with l)
a Shorl
MOI
Wf‘ llllff
Front (in
to ear-l1 <
o lf 1

olf‘

Long

l Short "
l Medium \

figure 3.1:
the &SAT ft
Also, note tl

transitions th

Fault transi

(1,511,) I l S,

The safety 3)

ai,b,-, and c,- (1 S i g 71). Also, for simplicity, we introduce a propositional vari-
able In+1 which is always true, and corresponding to l‘n+1, we introduce two states
an+1 and bn+1. For each disjunction yj, we introduce a state (1]- outside the invariant
(1 s j s M )-
The transitions of the fault-intolerant program. For the convenience of repre—
senting safety speciﬁcation, we classify transitions as short, long, and medium transi-
tions. The only transition inside the invariant of the fault-intolerant program is the
medium transition (3, 3). Also, we introduce short transitions (a,,b,-) and (b,,c,-) for
each propositional variable :15,- and its complement ‘11‘1- (1 S 2' _<_ n). We also introduce
a short transition (an+1,bn+1) for an“.

Moreover, corresponding to each propositional variable :13,- and its complement in,
we introduce long transitions (ohm-+1), (bum-+1), (c,, a,-+1), and (c,,b,-+1) (1 g 2' S n).
From bu“, we introduce a long transition (bn+1, s) to the invariant. Corresponding
to each disjunction y], we have the following long transitions:

0 If x,- is a propositional variable in y, then we include the long transition ((1,, a,).

o If or, is a propositional variable in y,- then we include the long transition (dj, bi).

 

 

 

 

 

d.
J
r ‘ lr
Legend : ........................................................................................
Long ......... > v '
fault ' ------ 3' ------ >. o ...... )o ...... >.
Short """" > a b c
a- b- 9 HI i+l 1+1
Medium . I I l
P 4 s ‘ ‘

 

Figure 3.1: The states and the transitions corresponding to the propositional variables in
the 3—SAT formula. (Except for transitions marked as fault all are program transitions.
Also, note that the program has no long transitions that originate from a,- and no short
transitions that originate from (3.)

Fault transitions. The class of faults f is equal to the set of medium transitions
{(s,dj):1_<_ j 3 WI}.

The safety speciﬁcation of the fault-intolerant program, p. Safety will be

27

 

 

 

 

i'jol'dlf‘
tiW‘lV.

(Cl. F
trausitl
ifit‘atiot

t l1 5 J

3.1.2
In this se
satisﬁaltlt
in Set-tint.
Lemma 5
fault-tolert
3.1.1.
Proof. Sir
values to tl
is true. No
adding fault
The trim
transitions (;

SllOWH [l [(3 l)

violated if a short (respectively, long) transition is followed by another short (respec-
tively, long) transition. Note that (s, s) and fault transitions are medium transitions
(cf. Figure 3.1). Hence, they can be followed by (respectively, preceded by) any
transition. Also, all transitions except those identiﬁed above violate the safety spec-
iﬁcation. This is to ensure that transitions such as ((1,, s), (a,,s), (b,, s), and (Ci, 3)

((1 _<_j _<_ M) /\ (1 S 2' g 17.)) cannot be used for recovery.

3.1.2 Reduction from 3—SAT
In this section, we show (with Lemmas 3.1 and 3.2) that the given instance of 3-SAT is
satisﬁable iff masking fault-tolerance can be added to the problem instance identiﬁed

in Section 3.1.1.

Lemma 3.1 If the given 3—SAT formula is satisﬁable then there exists a masking
fault-tolerant program for the instance of the decision problem identiﬁed in Section
3.1.1.
Proof. Since the 3-SAT formula is satisﬁable, there exists an assignment of truth
values to the propositional variables 23,-, 1 S 2' g n, such that each y,, 1 S j _<_ M,
is true. Now, we identify a masking fault-tolerant program, p’, that is obtained by
adding fault-tolerance to the fault-intolerant program p identiﬁed in Section 3.1.1.
The invariant of p’ is the same as the invariant of p (i.e., {3}). We derive the
transitions of the fault-tolerant program p’ as follows. (As an illustration, we have

shown the partial structure of p’ where 271 = true, 1:2 = false, and x3 =2 true in

Figure 3.2.)

c For each propositional variable 37,-, 1 _<_ 2' g n, if 1?,- is true then we include the
short transition (a,, b,). In this case, we also include the long transition (b,,a,~11)

if :r,+1 is true, or (1),, 1),“) if 1:,“ is false.

0 For each propositional variable 35,-, 1 S i S n, if 27,- is false then we include the

short transition (b,, (1,). In this case, we also include the long transition (c,,a,-+1)

28

 

.\‘.r

if 17,-+1 is true, or ((1,, bi“) if 13,-+1 is false.
0 We include the transitions (a,,+1, bn+1) and (bn+1,s) corresponding to :r,,+1.

o For each disjunction 3;, that includes xi, we include the transition (d,, a,) iff .r,

is true.

0 For each disjunction yj that includes -:1:,-, we include the transition (dj, 1),) iff

:13,- is false.

 

d_ ..........................................................................................................................
j . .......................................................................................
ll
s . W . n
THU" .- - ‘3'- - -, o . . ------ >0 ------- ,
a b c a b c
a, b, C, 2 2 2 3 3 3
D

Figure 3.2: The partial structure of the fault-tolerant program

Now, we show that p’ is masking fault-tolerant in the presence of faults f.

o p’ in the absence of faults. p’ IS = pIS. Thus, p’ satisﬁes spec in the absence

of faults.

«- p’ is masking f-tolerant for spec from S. To show this result, we let T’ be

the set of states reached in the computations of p’ [] f starting from s.

-— p’ satisﬁes its safety speciﬁcation from T’. Since the instance of the
3-SAT formula is satisﬁable, each propositional variable :19,- is assigned a
unique truth value. Thus, for each pair of transitions (a,, b.) and (1),, Ci),
one of them is excluded in the set of transitions of p’. Hence, a computation
of p’ cannot include two consecutive short transitions. Also, the only way

to execute two consecutive long transitions in the original fault-intolerant

29

program is to execute a long transition that terminates in state b,, 1 g 2' g
n, and then execute a long transition that originates in 1),. If the former
transition is included then 2:.- is assigned the truth value false. However, in
this case, no outgoing long transition from b, is included. Thus, p’ cannot

execute two consecutive long transitions.

— Starting from every state in T’, a computation of p’ reaches 5. By
construction, p’ contains no cycles outside the invariant. Hence, it sufﬁces
to show that p’ does not deadlock in T’ — S’. Now, let y,- 2 £13,: V ark V 33,.
be a disjunction in the 3-SAT formula. Since y,- evaluates to true, p’
includes a transition from {((1}, a,), (d), bk), (dj, ar)}. Also, by considering
the truth values of :13,- and my“, 1 g i g n, we observe that for every state
in {a,-, b,, 6,} in T’ there is a path that reaches a state in {a,+1, bi+1,c,-+1}.
Finally, from an+1 (respectively, bu“) there is an outgoing transition to

bn+1 (respectively, 3). It follows that p’ does not deadlock in T’ — S. E]

Lemma 3.2 If there exists a masking fault-tolerant program for the instance of the
decision problem identiﬁed earlier then the given 3-SAT formula is satisﬁable.
Proof. Before we use the masking fault-tolerant program p’ to identify the
tr L1 th value assignment to the propositional variables in the 3-SAT formula, we make
some observations about p’. Let S’ be the invariant of p’ and let T’ be the fault-span
used to show the masking fault-tolerance property of p’. Since 5’ 74 {} and S’ Q S,
tile Conditions 5’ = S and plS’ == p’IS’ hold.

Since faults may directly perturb p’ to dj (1 g j g M), the condition dj 6 T’
holds _ Thus, p’ must provide safe recovery from each dj. As a result, for each d],
tllere exists 1 S i S 72. such that either (dj,a,-) or ((dj,b,) and (b,,c,-)) is included in
p’ f T, ; i.e., either a,- or c, must be reachable. Hence, we have

C) .
b8Brvation 3.3. There exists 1 g 2' g n such that either a, E T’ or c,- E T’. El

30

Now, consider the case where a,- E T’ and C,- E T’. In this case, (a,,b,-) must be
included as all transitions terminating in a, are long transitions. Further, if c,- E T’
then (1),, C.) must be included since it is the only transition that reaches Q. In this
case, p’[] f can violate safety by executing (a,, b.) and (6,, 0,). Hence, we have
Observation 3.4. If a,- E T’ then c,- ¢ T’. [:1

Moreover, if a,- E T’ then (a,,b,-) E p’IT’ since all transitions terminating in a,-
are long transitions. Hence, b,- E T’. Now, to guarantee safe recovery from b,, p’

must include either (b,,a,-+1) or ((b,,b,-+1) and (b,+1,c,-+1)). Thus, either a,“ E T’ or

 

Ci+l E T’. Also, if c,- E T’ then either (chm-+1) or ((c,,b,-+1) and (b,+1,c,-+1)) must be

included. Thus, we have

 

Observation 3.5. If (a,- E T’) V (c,- E T’) holds then we have (Vl : z' < l S n : ((a1 6

T’) V (c; E T’))). C)

Now, let sm be the smallest value for which ((asm E T’) V(csm E T’)) holds. Based

on the Observation 3.5, we have (Vl : sm < l S n : (a; E T’) V (c; E T’)). Hence, we

Inake value assignment to the literals of the 3-SAT formula as follows:

0 For t < sm, we assign true to 113;.
O For sm S t, if at E T’ then at, = true. And, if c, E T’ then :1.“ = false.

Based on the observations 3.3-3.5, it is straightforward to observe that a unique

val me is assigned to each x,- (1 S 2' S n). To complete the proof, we need to show
that, with this truth-value assignment, the 3-SAT formula is satisﬁable. We show
this for a disjunction y, (1 S j S M). Wlog, let y]- = 51:,- V 1'2. V 33,. Since state (1,- can
be 1" eached by the occurrence of a fault from s, p’ must provide safe recovery from dj.
Since the only safe transitions from dj are those corresponding to states a,, bk and
0’" ’ Z)’ must include at least one of the transitions (dj, ai), ((1,,bk), or (dj, ar). Now, if
(dd 1» (1i) 6 p’ then a, E T’, and hence, 1:,- is assigned true. Further, if (dj, bk) 6 p’ then
n9 10mg transition from bk can be included as it would allow p’ to execute two long

traxl -. 1 , - raw T1 - T’ ih
Sltlons successive y. Hence, p must me u e ()k, ck). 111s, ck E , an( ence, wk

31

 

 

 

 

is assigllt“

 

y) evaluatt
Theorent
of adding

Proof. T
front LPN“
Given an

prograin [1
5’ Q S. (‘3
notviolatt
Since earh
memami
Coronary
that shoult
masking f ‘

‘(l

3.2 S

In this eha
tion on th‘
ll] that. if
BTtnodel
to high at
in an atom
laultintoh
sentetl live
transitions

5 ‘\P‘(‘Utn

is assigned false. It follows that irrespective of which transition is included from d »,
y] evaluates to true. Therefore, the 3-SAT formula is satisﬁable. E]
Theorem 3.6 If the safety speciﬁcation is speciﬁed in the BP model then the problem
of adding masking fault-tolerance to high atomicity programs is N P-complete.

Proof. The NP-hardness of adding masking fault-tolerance in the BP model follows
from Lemmas 3.1 and 3.2. To show that this problem is in NP, we proceed as follows:
Given an input for the problem of adding fault—tolerance, we guess fault-tolerant
program p’, its invariant S’ and its fault-span T’. Now, we need to verify that (1)
S’ Q S, (2) S’ is closed in p’, (3) p’IS’ E MS", (4) T’ is closed in p’[]f, (5) p’[]f does
not violate safety in T’, (6) p’ does not deadlock in T’ — S’, (7) p’|(T’ — S’) is acyclic.
Since each of these conditions can be veriﬁed in polynomial time in the state space,
the theorem follows. E]
Corollary 3.7 If the safety speciﬁcation is speciﬁed by a set of computational preﬁxes
that should not occur in program computations (as in [23]) then the problem of adding

masking fault-tolerance is NP-hard in the program state space. CI

3.2 Summary

In this chapter, we investigated the effect of the representation of the safety speciﬁca-
tion on the complexity of adding masking fault-tolerance. It is shown in the literature
[1] that if one represents the safety speciﬁcation as a set of bad transitions (denoted
BT model) that must not occur in program computations then adding fault-tolerance
to high atomicity programs — where processes can read/ write all program variables
in an atomic step — can be done in polynomial time in the state space of the input
fault-intolerant program. However, in this chapter, we showed that if safety is repre-
sented by a set of sequences of transitions, where each sequence contains at most two
transitions (denoted bad pair (BP) model), then adding fault-tolerance to programs

is N P—complete. With this result, we argue that adding fault—tolerance to existing

32

 

 

 

 

 

 

 

 

 

progt

A
to (a
exan‘
it re:
restr
the l

corn]

ptoh

(hsst

programs can be done more efﬁciently if we focus on the BT model.

Although the BT model is a restricted version of the BP model, it is general enough
to capture other representations for modeling safety considered in the literature. For
example, in the bad state (BS) model (e.g., [2, 4]), a computation violates safety if
it reaches a state that is ruled out by the safety speciﬁcation. The BS model is a
restrictive version of the BT model. Hence, the algorithms in [1] can be extended to
the BS model. Thus, the complexity for the BS model is (approximately) in the same
complexity class as that of the BT model.

Also, we observe that the expressiveness of the BT model has the potential to
capture the safety speciﬁcation of practical problems. As an illustration, we model
the safety speciﬁcation of several examples including a simpliﬁed version of an aircraft
altitude switch (cf. Section 8.5) throughout this dissertation. As a result, we argue
that although the results of this chapter limit the applicability of eﬁ'icient addition of
fault-tolerance to the BT model, this model can capture a broad range of interesting
problems in the synthesis of fault-tolerant programs. Therefore, in the rest of this

dissertation, we represent safety speciﬁcation of programs in the BT model.

33

 

 

Chapter 4

Synthesizing Failsafe
Fault-Tolerant Distributed

Programs

In this chapter, we focus on the synthesis of failsafe fault-tolerant distributed pro-
grams from their fault-intolerant versions. First, we show that synthesizing a failsafe
fault-tolerant distributed program from its fault-intolerant version (i.e., adding failsafe
fault-tolerance to distributed fault-intolerant programs) is NP-complete. To achieve
this goal, we reduce the 3-SAT problem to the decision problem of synthesizing a
failsafe fault-tolerant program. Second, we identify the restrictions that can be im—
posed on speciﬁcations and fault-intolerant programs in order to ensure that failsafe
fault-tolerance can be synthesized in polynomial time. Towards this end, we iden-
tify a class of speciﬁcations, namely monotonic speciﬁcations, and a class of programs,
namely monotonic programs. We show that failsafe fault-tolerance can be synthesized
in polynomial time if monotonicity restrictions 011 the program and the speciﬁcation

are met.

As another important contribution of this chapter, we evaluate the role of restric—

34

 

 

tions imposed on speciﬁcation and fault-intolerant progratn. In this context, we show
that if monotonicity restrictions are imposed only on the speciﬁcation (respectively,
the fault-intolerant program) then the problem of adding failsafe fault-tolerance will
remain NP-complete. Finally, we show that the class of monotonic speciﬁcations con-
tains well-recognized [24, 25, 26, 27, 28] problems of distributed consensus, atomic

commitment and Byzantine agreement.

We proceed as follows: In Section 4.1, we state the problem of adding failsafe fault-
tolerance to fault-intolerant programs. In Section 4.2, we show the NP-completeness
of the problem of adding failsafe fault-tolerant distributed programs. In Section
4.3, we precisely deﬁne the notion of monotonic specifications and monotonic pro-
grams, and identify their role in reducing the complexity of synthesizing failsafe
fault-tolerance. Finally, we give examples of monotonic speciﬁcations and monotonic

programs in Section 4.4, and summarize this chapter in Section 4.5.

4. 1 Problem Statement

In this subsection, we formally state the problem of synthesizing failsafe fault-
tolerance. Our goal is to only add failsafe fault-tolerance to generate a program
that reuses a given fault-intolerant program. In other words, we require that any new
computations that are added in the fault—tolerant program are solely for the purpose

of dealing with faults; no new computations are introduced when faults do not occur.

Now, consider the case where we begin with the fault-intolerant program p, its
invariant S, its speciﬁcation, spec, and faults f. Let p’ be the fault-tolerant program
derived from p, and let S’ be an invariant of p’. Since S is an invariant of p, all the
computations of p that start from a state in S satisfy the speciﬁcation, spec. Since
we have no knowledge about the computations of p that start outside S and we are
interested in deriving p’ such that the correctness of p’ in the absence of faults is

derived from the correctness of p, we must ensure that p’ begins in a state in S; i.e.,

35

 

 

the invari

 

lni

figure 4.1
tolerant pr

Lihevvi
the erunpt
about (t ,1,
ll()l llll l‘t _lt'

0f Sj’lllllt‘s

The Pro

Given 1,

Identifi- l
3’ E .

p’fS'

the invariant of p’, say S’, must be a subset of S (cf. Figure 4.1).

Invariant of fault-intolerant program Invariant of fault-tolerant program

 

 

 

 

 

No new transitions here New transitions added here

Figure 4.1: The relation between the invariant of a fault-intolerant program p and a fault-
tolerant program p’.

Likewise, to show that p’ is correct in the absence of faults, we need to show that
the computations of p’ that start in states in S’ are in spec. We only have knowledge
about computations of p that start in a state in S (cf. Figure 4.1). Hence, we must
not introduce new transitions in the absence of faults. Thus, we deﬁne the problem

of synthesizing failsafe fault-tolerance as follows:

The Problem of Synthesizing Failsafe Fault-Tolerance
Given p, S, spec and f such that p satisﬁes spec from S
Identify p’ and S’ such that

S’ Q S,

p’IS’ (_Z p[S’, and

p’ is failsafe fault-tolerant to spec from S’. D

This problem statement is taken from [1]. In [1], a generalized definition that
applies to other types of fault-tolerance is presented. However, we use this restrictive
deﬁnition as it sufﬁces in this chapter. Also, to show that the problem of synthesizing
failsafe fault-tolerance is NP-complete, we state the corresponding decision problem:
for a given fault-intolerant program p. its invariant S . the speciﬁcation spec, and
faults f, does there ezrist a failsafe fault-tolerant program p’ and the invariant S’ that

satisfy the three conditions of the syntheszs problem?

36

 

 

 

J

Nomi“
{311115 J
gin’Il H

4.2

In thlS 5
tributcd
end- we
tolerance
into an l1.
that [lie :
this-sized {I
we state tl
3~SAT pr
Given is a
and '
/‘=, ('

Does then:

4.2.1 It}

lei
In this Siibsr-i
problem. Th!
its. speriﬁratir

tioiial i'ariablr‘

Notation. Given a fault-intolerant program p, speciﬁcation spec, invariant S and
faults f, we say that program p’ and predicate S’ solve the synthesis problem for a
given input iff p’ and S’ satisfy the three conditions of the synthesis problem. We say
p’ (respectively, S’) solves the synthesis problem iff there exists S’ (respectively, p’)

such that p’, S’ solve the synthesis problem.

4.2 NP-Completeness Proof

In this section, we prove that the problem of synthesizing failsafe fault-tolerant dis-
tributed programs from their fault-intolerant version is NP-complete. Towards this
end, we reduce the 3—SAT problem to the problem of synthesizing failsafe fault-
tolerance. In Subsection 4.2.1, we present the mapping of the given 3-SAT formula
into an instance of the synthesis problem. Afterwards, in Subsection 4.2.2, we show
that the 3-SAT formula is satisﬁable iff a failsafe fault-tolerant program can be syn-
thesized from this instance of the synthesis problem. Before presenting the mapping,
we state the 3-SAT problem:

3—SAT problem.

Given is a set of propositional variables, b1, b2, ..., b" and ﬁb1,-ib2,...,—ib,,, where b,-
and pi), are complements of each other, and a Boolean formula c 2 c1 /\ c2 /\

/\ CM, where each cj is a disjunction of exactly three propositional variables.

Does there exist an assignment of truth values to b1, b2, ..., bn such that c is satisﬁable?

4.2.1 Mapping 3-SAT to an Instance of the Synthesis Prob-

lem

In this subsection, we map the given 3—SAT formula into an instance of the synthesis
problem. The instance of the synthesis problem includes the fault-intolerant program,
its speciﬁcation, its invariant, and a class of faults. Corresponding to each proposi—

tional variable and each disjunction in the 3-SAT formula, we specify the states and

37

 

_J

 

{he 5"
[rallslI
”mu-la
VarjalJl
The 51
sjtjonéil
Figure-
Fort
stat?S ll
The tr;
pmgralll.
we intrm

larl‘II’l.

Figure 4.2‘
formula.

Also. we

rsponding I

(cf. Figure 4

Fault trans

the

. faultintoi

the set of transitions of the fault-intolerant program. Then, we identify the fault
transitions of this instance. Subsequently, we identify the safety speciﬁcation and the
invariant of the fault-intolerant program and determine the value of each program
variable in every state.
The states of the fault-intolerant program. Corresponding to each propo—
sitional variable b,- and its complement -b,~, we introduce the following states (see
Figure 4.2): x,,x§,ai, y,,y,’-, 2,, and 2;.

For each disjunction, cj = bm V-wbk Vb; (cf. Figure 4.3), we introduce the following
states (k 7E m): cgm, dgm, cjk, djk, c3.“ and d31-
The transitions of the fault-intolerant program. In the fault-intolerant
program, corresponding to each propositional variable b,- and its complement -»b,-,
we introduce the following transitions (cf. Figure 4.2): (a,_1,:r,-),(:r,,a,-),(y,’-,z:),

(air-13x2), ($2, at), and (yia Zi)’

y ...bilq....>z y. Wbﬁd...) Z_ yn ....bﬁcl...>zn
1 1 l 1
X1 X I X n
\ a ......... a \. a ......... a \

a0 3‘ 1 1-1 3* i n-I 3 a,"
L a

5 , x . E

"1 ' x “

y’ ---liai‘l_-,_z’ y’ ---.lZ‘Ed.-_,. z’ yin __.tfa_d___>zr’l

Figure 4.2: The transitions corresponding to the propositional variables in the 3-SAT
formula.

Also, we introduce a transition from an to ac in the fault-intolerant program. Cor-
responding to each Cj = bm V ﬁbk V b,, we introduce the following program transitions
(cf. Figure 4.3): (cg-dem), (Cy-ham), and (c;,,d;,).

Fault transitions. We introduce the following fault transitions: From state :r,,

the fault-intolerant program can reach y,- by the execution of faults. From state .73:

38

 

 

 

<i-l

Figure 4
and a (hr
the fault
I), and it:
it. 99- I
traiisitiiﬁiu
intrurlure

transition

C, are as f(
can perturl;
4..)
The invaria

intolerant pm

cj = meb’kaI

 

 

 

<i,-I,i-l,2> y. ...................... > z. <i,0,i.2>
l l c’ <m,-l,m-l,j+m+l>
i .” jm
I f .’
s’ g
' f .I" bm 2
<i,l.i-l.0> x, ‘y’ v
\ .I
‘. ,-’ d’ <m,0.m,j+m+l>
\\ " jm
<i-l,0,i-l,0> 21'] a <i,0,i,0> !
" .4 !
. " c. <k 1 k-l j+k+l>
. - .- x k 9 9 !
<1, 1,1 l,0> i J
! l
s .
l dk <k,0,k.j+k+l >
. bad 3
r. ........................ yl ---------- > z] '
3 Legend
3 <i,l,i-1,1> <i,0,i,l> '
. (6, f, g,h> i
3 ------- pl : c' .
2 ............. p2 ; jl <L-1.I-1.i+l+1>
p3 b'1; bd
- : a
P4 2 v
u-o-u—o- Fault d,

. . j <1, 0,1.j+l+1>

Figure 4.3: The structure of the fault-intolerant program for a propositional variable b,-
and a disjunction cj = bm V ﬁbk V 1),.

the faults can perturb the program to state yi. Thus, for each propositional variable
b,- and its complement -b,-, we introduce the following fault transitions: (33,331,), and
(:rg, yg). In addition, for each disjunction cj = (bm V ﬁbk V b1), we introduce a fault
transition that perturbs the program from state a,, 0 S i < n, to cgm. We also
introduce the fault transition that perturbs the program from d3", to cjk, and the
transition that perturbs the program from djk to 631. Thus, the fault transitions for
cj are as follows: (a,,c3m), (d3WCJ-k), and (djk,c3-,). (Note that the fault transition
can perturb the program from state a,- only to the ﬁrst state introduced for C]; i.e.,

cg...)
The invariant of the fault-intolerant program. The invariant of the fault-

intolerant program consists of the following set of states: {:rl, - - - ,xn}U{:r’l, - - - ,arﬁ,}U

39

 

{0,
Sa;
int
Spf‘t
iuay
the]
Fhre
addec
corres
progra
hark“
theh n
to grin 11'
l 9 ant
l ~1. 0. I

lhhnia

State/V;

{am . . . ,an_1}.

Safety speciﬁcation of the fault-intolerant program. For each propositional
variable b,- and its complement ﬁbi, the following two transitions violate the safety
speciﬁcation: (yi,z,-), and (y,’-,z,’). Observe that in state 1r,- (respectively, 17:) safety
may be violated if the fault perturbs the program to g,- (respectively, yi) and then
the program executes the transition (y,,z,-) (respectively, (y,’.,z,’)) (cf. Figure 4.3).
For each disjunction cj = bm V ﬁbk V b,, only the last program transition (c3), 3,)
added for cj violates the safety of speciﬁcation. Thus, if all three program transitions
corresponding to cj are included then safety may be violated by the execution of
program and fault transitions (cf. Figure 4.3).

Variables. Now, we specify the variables used in the fault-intolerant program and
their respective domains. These variables are assigned in such a way that allows us
to group transitions appropriately. The fault-intolerant program has 4 variables: 6,
f. g, and h. The domains of these variables are respectively as follows: {0, - -- ,n},
{——I,O,1}, {0,--- ,n}, and {O,--- ,M-l-n+1}.

Value assignments. The value assignments are as follows (cf. Figure 4.4):

 

Figure 4.4: The value assignment to variables.

Processes and read / write restrictions. The fault-intolerant program consists of
ﬁve processes, P1, P2, P3, P4, and P5. The read/write restrictions on these processes

are as follows:

0 Processes P1 and P2 can read and write variables f and g. They can only read

variable 6 and they cannot read or write h.

40

o Processes P3 and P4 can read and write variables e and f. They can only read

variable g and they cannot read or write h.

0 Process P5 can read all program variables and it can only write e and 9.

Remark. We could have used one process for transitions of P1 and P2 (respectively,
P3 and P4) however, we have separated them in two processes in order to simplify the
presentation.

Grouping of Transitions. Based on the above read / write restrictions, we identify
the transitions that are grouped together. We illustrate the grouping of the program
transitions and the values assigned to the program variables in Figure 4.3.
Observation 4.1 Based on the inability of P3 and P4 to write 9, the transitions
(3:,, a1), (11:2,ai), (y,, z.) and (g,’, z_,’) can only be executed by P1 or P2. D
Observation 4.2 Based on the inability of P1 and P2 to write e, the transitions
(a,-1, 513,-) and (a,_1, $1) can only be executed by P3 or P4. [1
Observation 4.3 Based on the inability of P1 to read h, the transitions (25,-, a.) and
(y,’, z,’) are grouped in P1. Moreover, this group also includes the transition (c,,-, d,-,-)
for each cj that includes ﬁbi. Cl
Observation 4.4 Based on the inability of P2 to read h, the transitions (ch, a.) and

(y,, z.) are grouped in P2. Moreover, this group also includes the transition (031', (132')

for each c,- that includes bi. CI
Observation 4.5 (a,_1, 513,-) is grouped in P3. El
Observation 4.6: (a,_1, 23;) is grouped in P4. Cl

Observation 4.7: Since process P5 cannot write f, it cannot execute the following
transitions: (ai_1,:r,-),(a,-_1,a:;),(x,,a,-),(2:2,ai),(y,-,z,-), and (y£,z,’-), for I S i g n.
Process P5 can only execute transition (0.", a0). [I]

For i, 1 S i g n, the set of transitions for each process is the union of the

transitions mentioned above.

41

 

 

 

h:
an:
int
Le.
faui
Sect
Pro
\Hlllf
is m
failsaf

SPt‘i l (u

g!
y:

tFausi
r

{351,01}

lye 3,)

”alum,

4.2.2 Reduction from 3-SAT

In this subsection, we show that 3—SAT has a satisfying truth value assignment if
and only if there exists a failsafe fault-tolerant program derived from the instance
introduced in Section 4.2.1. Towards this end, we prove the following lemmas:
Lemma 4.8 If the given 3-SAT formula is satisﬁable then there exists a failsafe
fault-tolerant program that solves the instance of the addition problem identiﬁed in
Section 4.2.1.

Proof. Since the 3—SAT formula is satisﬁable, there exists an assignment of truth
values to the propositional variables bi, 1 S i S n, such that each c,-, I g j S M,
is true. Now, we identify a fault-tolerant program, p’, that is obtained by adding
failsafe fault-tolerance to the fault-intolerant program, p, identiﬁed earlier in this

section. The invariant of p’ is:

S’ = {a0, ..,an_1} U {:c, | propositional variable b,- is true in 3-SAT } U

{:r; I propositional variable b,- is false in 3-SAT }

The transitions of the fault-tolerant program p’ are obtained as follows:

0 For each propositional variable b,, 1 S i g n, if b,- is true, we include the
transition (a,_1, 23,-) that is grouped in process P3. We also include the transition
(17,, a,). Based on Observation 4.3, as we include (23,-,ai), we have to include

I

(y,’., :4). Also, based on Observation 4.3, for each disjunction c, that includes

—1b,, we have to include the transition (cg-i, (1],).

o For each propositional variable b,, I g i g n, if b,- is false, we include the
transition (ai_1,:I::) that is grouped in process P4. We also include the transition
(.132, ai). Based on Observation 4.4, as we include (221,01), we have to include
(11,-, 2,). Also, for each disjunction c, that includes 1),, we have to include the
transition (C3,,d3-i).

42

4"

 

0 We include the transition (an, a0) to ensure that p’ has inﬁnite computations in

its invariant.

Now, we show that p’ does not violate safety even if faults occur. Note that we
introduced safety-violating transitions for each propositional variable and for each

disjunct. We show that none of these can be executed by p’.

o Safety-violating transitions related to propositional variable b,- . If the value
of propositional variable b, is true then the safety-violating transition (311,22) is
included in p’. However, in this case, we have removed the state :17; from the
invariant of p’ and, hence, p’ cannot reach state yﬁ. It follows that p’ cannot exe-
cute the transition (y,’-, 2;). By the same argument, p’ cannot execute transition

(y,, 2,) when I),- is false.

0 Safety-violating transitions related to disjunction cj. Since the 3-SAT formula
is satisﬁable, every disjunction in the formula is true. Let c,- = bm V ﬁbk V
bi. Without loss of generality, let bm be true in cj. Therefore, the transition
(cg-m, dgm) is not included in p’. It follows that p’ cannot reach the state C}, and,

hence, it cannot violate safety by executing the transition (c3,, 3,).

Since S’ _C_ S, p’ I 3’ Q I? l 3’, p’ does not deadlock in the absence of faults, and p’
does not violate safety in the presence of faults, p’ and S’ solve the synthesis problem.
Cl
Lemma 4.9 If there exists a failsafe fault-tolerant program that solves the instance
of the addition problem identiﬁed in Section 4.2.1 then the given 3-SAT formula is
satisﬁable.

Proof. Suppose that there exists a failsafe fault-tolerant program p’ derived from
the fault-intolerant program, p, identiﬁed in Section 4.2.1. Since the invariant of p’, S ’,
is not empty and S’ g S, S’ must have at least one state in S. Since the computations

of the fault—tolerant program in S’ should not deadlock, for O S i S n —— 1, every

43

a, must be included in S’. For the same reason, since P5 cannot execute from a,_1
(cf. Observation 4.7), one of the transitions (a,_.1,:1:,) or (a,_1,:r;) should be in p’
(1 S i S n). If p’ includes (a,_1, 22,) then we will set b, = true in the 3-SAT formula.
If p’ contains the transition (a,_1,:r§) then we will set b, = false. Hence, each
propositional variable will be assigned a truth value. Now, we show that it is not

the case that b, is assigned true and false simultaneously, and that each disjunction

is true.

0 Each propositional variable gets a unique truth assignment. We prove this by
contradiction. Suppose that there exists a propositional variable b,, which is
assigned both true and false; i.e., both (a,_1,:r,) and (a,..1,:r;) are included in
p’. Based on the Observations 4.1 and 4.3, the transitions (a,_1, 513,), (:r,, a,) and
(y,’-, 2;) must be included in p’. Likewise, based on the Observations 4.2 and 4.4,
the transitions (a,_1, mg), ($1, a.) and (311.2,) must also be included in p’. Hence,
in the presence of faults, p’ may reach 3;, and violate safety by executing the
transition (y,, 2,). This is a contradiction since we assumed that p’ is failsafe

fault-tolerant.

0 Each disjunction is true. Suppose that there exists a c,- = bm V ﬁbk V b,,
which is not true. Therefore, bm = false, bk =2 true and b, = false. Based on
the grouping discussed earlier, the transitions (cg-WdS-m), (cjk,djk), (c3,,d;-,) are

included in p’. Thus, in the presence of faults, p’ can reach c3, and violate safety

speciﬁcation by executing the transition (0’

ﬂ, d3,). Since this is a contradiction,

it follows that each disjunct in the 3-SAT formula is true. [:1

Theorem 4.10 The problem of synthesizing failsafe fault-tolerant distributed pro—
grams from their fault-intolerant version is NP-complete.
Proof. The NP-hardness of synthesizing failsafe fault~tolerant distributed programs

follows from Lemmas 4.8 and 4.9. Also, using Theorem 2.] presented in Section 2.8,

44

 

 

 

 

(J)

is
(17

(iii.

gin
pm.

that

lauh
(1 pr
lOlEr
are i
hlh s
a con

fauh

fault-i
4,3,1,
”0.1%de

m Wily,

it follows that the problem of synthesizing failsafe fault-tolerant distributed programs

is N P—complete. E]

4.3 Monotonic Speciﬁcations and Programs

Since the synthesis of failsafe fault-tolerance is NP-complete, as discussed earlier, we
focus on this question: What restrictions can be imposed on speczﬁcations, programs
and faults in order to guarantee that the addition of failsafe fault-tolerance can be
done in polynomial time?

As seen in Section 4.2, one of the reasons behind the complexity involved in the
synthesis of failsafe fault-tolerance is the inability of the fault-intolerant program to
execute certain transitions even when no faults have occurred. More speciﬁcally, if a
group of transitions includes a transition within the invariant of the fault-intolerant
program and a transition that violates safety, then it is difﬁcult to determine whether

that group should be included in the failsafe fault-tolerant program.

To identify the restrictions that need to be imposed on the speciﬁcation, the
fault-intolerant program and the faults, we begin with the following question: Given
a program p with invariant S, under what conditions, can we design a failsafe fault—
tolerant program, sag p’, that includes all transitions in pIS? If all transitions in pIS
are included then it follows that p’ will not deadlock in any state in S. Moreover, p’
will satisfy its speciﬁcation from S; if a computation of p’ begins in S then it is also
a computation of p. Now, we need to ensure that safety will not be violated due to
fault transitions and the transitions that are grouped with those in plS.

In this section, we identify the situations under which the addition of failsafe
fault-tolerance can be achieved in polynomial time. Towards this end, in Subsection
4.3.1, we deﬁne a class of speciﬁcations, monotonic speciﬁcations, and a class of
programs, monotonic programs, for which failsafe fault—tolerance can be synthesized

in polynomial time. The intent of these deﬁnitions is to identify conditions under

45

 

 

 

 

\i‘hii‘h
we int
4.3.2.

progra
Show I

fault-ti.

4.3.1

lIl llllS :
safe fau
ngrau;
WE}:
[lull me
even if I
safety. T
D€ﬁmtia

respect u

H .
“50,51.

LikOWle-

Dan-2220,,

..,
lJa B()0](,a“

which a process can make safe estimates of variables that it cannot read. Also,
we introduce the concept of fault-safe speciﬁcations. Subsequently, in Subsection
4.3.2, we Show the role of monotonicity restrictions imposed on speciﬁcations and
programs in adding failsafe fault-tolerance. When these restrictions are satisﬁed, we
show the transitions in plS and the transitions grouped with them form the failsafe

fault-tolerant program.

4.3.1 Sufﬁciency of Monotonicity

In this section, we identify sufﬁcient conditions for polynomial-time synthesis of fail-
safe fault-tolerant distributed programs from their fault-intolerant version. In a
program with a set of processes {P0, - -- ,Pn}, consider the case where process P,
(0 S j g 72.) cannot read the value of a Boolean variable :23. The deﬁnition of (posi-
tive) monotonicity captures the case where P,- can safely assume that :r is false, and
even if it were true when P,- executes, the corresponding transition would not violate
safety. Thus, we deﬁne monotonic speciﬁcation as follows:

Deﬁnition. A speciﬁcation spec is positive monotonic on a state predicate Y with

respect to a Boolean variable :1: iff the following condition is satisﬁed:

V30, 51, 36, 3’1 :: a:(so) = false A 22(81) 2 false A 56(86) = true A x(s’,) = true
Athe value of all other variables in so and 36 are the same
A the value of all other variables in 31 and s’l are the same
A(so, 51) does not violate spec A so 6 Y A 31 E Y
:>

(56, s’,) does not violate spec

Likewise, we deﬁne monotonicity for programs by considering transitions within
a state predicate, and deﬁne monotonic programs as follows:
Deﬁnition. A program p is positive monotonic on a state predicate Y with respect

to a Boolean variable :1: iff the following condition is satisﬁed.

46

 

 

 

 

Negat.
aMm.

the aim
vanme
tomMa
[his ('aSt
damxt
deuu

hhmmg

\i

Una

nasmm

mMmmr
mﬁxaa
enema

mamas:

meanest

WMMWL

lhumu

Vso.sl, 36, 3’1 :: :r(so) = false A :1:(s1) = false A :r(sf,) = true A 112(3’1) = true
Athe value of all other variables in so and sf, are the same
Athe value of all other variables in 31 and s’1 are the same
A(so, 31) E pIY
2:)

(86a 8'1) 6 MY

Negative monotonicity and monotonicity with respect to non-Boolean vari-
ables. We deﬁne negative monotonicity by swapping the words false and true in
the above deﬁnitions. Also, although we deﬁned monotonicity with respect to Boolean
variables, it can be extended to deal with non-Boolean variables. One approach is
to replace a: = false with :r = 0 and a: = true with a: 74 0 in the above deﬁnition. In
this case, the estimate for :1: is 0. We use this deﬁnition later in the section where we
discuss the necessity of the monotonic programs and speciﬁcations.

Deﬁnition. Given a speciﬁcation spec and faults f, we say that spec is f -safe iff the

following condition is satisﬁed.
Vso, 31 :: ((so, 81) E f A (so, 31) violates spec) => (Vs_1 :: (s_1, so) violates spec)

The above deﬁnition states that if a fault transition (so, 81) violates spec then all
transitions that reach state so violate spec. The goal of this deﬁnition is to capture the
requirement that if a computation preﬁx violates safety and the last transition in that
preﬁx is a fault transition then the safety is violated even before the fault transition
is executed. Another interpretation of this deﬁnition is that if a computation preﬁx
maintains safety then the execution of a fault action cannot violate safety. Yet another
interpretation is that the ﬁrst transition that causes safety to be violated is a program
transition.

We would like to note that for most problems, the speciﬁcations being considered

are fault-safe. To understand this, consider the problem of mutual exclusion where

47

 

 

 

a tau.
\‘iolat
SGt'tlo:
first tr
that tl
the (or
class 0.
exa‘utii
ll-‘ti.
We 1
‘11 is full
Stt‘tlon, '
Theorem
Using
thesis.
lllltf’ lair]
.llere spet
Theorem

an l‘Safe 5

If

ill) .J

Tllffn
Fail

\afp

lll poly“!

a fault may cause a process to fail. In this problem, failure of a process does not
violate the safety; safety is violated if some process subsequently accesses its critical
section even though some other process is already in the critical section. Thus, the
ﬁrst transition that causes safety to be violated is a program transition. We also note
that the speciﬁcations for Byzantine agreement, consensus and commit are f—safe for
the corresponding faults (cf. Section 4.4). In fact, given a speciﬁcation spec and a
class of fault f, we can obtain an equivalent speciﬁcation spec, that prohibits the
execution of the following transitions.
{(so,sl) : (so,sl) violates spec V (332 :: (81,82)€f A (51,32) violates spec) }

We leave it to the reader to verify that ‘p is failsafe f-tolerant to spec from S ’ iff
‘p is failsafe f-tolerant to spec, from S’. With this observation, in the rest of this
section, we assume that the given speciﬁcation, spec, is f-safe. If this is not the case,
Theorem 4.11 and Corollary 4.12 can be used if one replaces spec with spec f.

Using monotonicity of speciﬁcations / programs for polynomial time syn-
thesis. We use the monotonicity of speciﬁcations and programs to show that even
if the fault-intolerant program executes after faults occur, safety will not be violated.
More speciﬁcally, we prove the following theorem:
Theorem 4.11 Given is a fault-intolerant program p, its invariant S, faults f and

an f—safe speciﬁcation spec,

If
VP,, :1: : P,- is a process in p, :L‘ is a Boolean variable such that P,- cannot read a: :
spec is positive monotonic on S with respect to :1:
A The program consisting of the transitions of P,- is negative monotonic on S
with respect to :1:
Then

Failsafe fault-tolerant program that solves the synthesis problem can be obtained

in polynomial time.

48

 

 

Free

I be
where
New.
(respi-
sprc l).
0 J

h

HI

. .1" l

Hi.

The

the trari
failsafe f
fOllt‘iWs t,
l'lOlates 5
Spec. 1, f
We gr
Corona,

an f‘Safe

If

”Y

Proof. Let (so, 51) be a transition of process P, and let (50,31) be in plS. Let
.r be a Boolean variable that P,- cannot read. Since we are considering programs
where a process cannot blindly write a variable, it follows that :r(so) equals 33(31).
Now, we consider the transition (36, s’,) where 3:, (respectively, s’,) is identical to so
(respectively, 31) except for the value of :r. We show that (36, s’,) does not violate
spec by considering the value of 1'(so).

o :r(so) = false. Since (so, 31) E pIS, it follows that (so, 81) does not violate safety.

Hence, from the positive monotonicity of spec on S, it follows that (56. s’,) does

not violate spec.

S.

 

o :r:(so) = true. From the negative monotonicity of p on S, (36,3’1) is in p
Hence, (36, s’,) does not violate spec.

The above discussion leads to a special case of solving the synthesis problem where
the transitions in pIS and the transitions grouped with them can be included in the
failsafe fault-tolerant program. Since p’ |S equals plS and p satisﬁes spec from S, it
follows that p’ satisﬁes spec from S. Moreover, as shown above, no transition in p’
violates spec. And, since spec is f-safe, execution of fault actions alone cannot violate
spec. It follows that p’ is failsafe f-tolerant to spec from S. C]

We generalize Theorem 4.11 as follows:

Corollary 4.12 Given is a fault-intolerant program p, its invariant S, faults f and
an f-safe speciﬁcation spec,
If
VP,, x : P,- is a process in p, a: is a Boolean variable such that P,- cannot read 11: :
(spec is positive monotonic on S with respect to a:
A The program consisting of the transitions of P,- is negative monotonic on S

with respect to 17)

(spec is negative monotonic on S with respect to :1:

A The program consisting of the transitions of P, is positive monotonic on S

49

 

e01
Ol;
an

sati
pro
Pro

pr 0(-
in S.

of ad

Spirit

with respect to :17)
Then
Failsafe fault-tolerant program that solves the synthesis problem can be obtained

in polynomial time. D

4.3.2 Role of Monotonicity in Complexity of Synthesis
In Section 4.3.1, we showed that if the given speciﬁcation is positive (respectively,
negative) monotonic and the fault-intolerant program is negative (respectively, posi-
tive) monotonic then the problem of adding failsafe fault-tolerance can be solved in
polynomial time. In this section, we consider the question: What can we say about
the complexity of adding failsafe fault-tolerance if only one of these conditions is sat-
isﬁed? Speciﬁcally, in Observations 4.13 and 4.14, we show that if only one of these
conditions is satisﬁed then the problem remains NP-complete.
Observation 4.13 Given is a fault-intolerant program p, its invariant S, faults f and
an f—safe speciﬁcation spec. If the monotonicity restrictions (from Corollary 4.12) are
satisﬁed for p and no restrictions are imposed on the monotonicity of spec then the
problem of adding failsafe fault-tolerance to p remains NP-complete.
Proof. This proof follows from the fact that the program obtained by mapping the
3—SAT problem in Section 4.2 is negative monotonic with respect to h. Moreover, all
processes can read all variables except h (i.e., e, f, and g). It follows that the proof
in Section 4.2 maps an instance of the 3—SAT problem to an instance of the problem
Of adding failsafe fault—tolerance where the monotonicity restrictions from Corollary
4.12 holds for the program and no assumption is made about the monotonicity of the
Speciﬁcation. Therefore, based on Lemmas 4.8 and 4.9, the proof follows. E)
Furthermore, the speciﬁcation obtained by mapping the 3-SAT problem in Section

4.2 IS negative monotonic with respect to h. Hence, similar to Observation 4.13, we

have

50

 

’I)

trar
the]

t0 (l.

4.4

in thi
mit f
the (a
Cle iii ,
(lllf‘llllj;

“lib res:

Observation 4.14 Given is a fault-intolerant program p, its invariant S, faults f and
an f-safe speciﬁcation spec. If the monotonicity restrictions (from Corollary 4.12) are
satisﬁed for spec and no restrictions are imposed on the monotonicity of p on S then

the problem of adding failsafe fault—tolerance to p remains NP-complete.
Proof. The proof is similar to the proof of Observation 4.13. E]

Based on the above discussion, it follows that monotonicity of both programs and
speciﬁcations is necessary in the proof of Theorem 4.11. If only one of these properties
is satisﬁed then the problem of adding failsafe fault-tolerance remains N P-complete.
Comment on the monotonicity property. The monotonicity requirements are simple
and if a program and its speciﬁcation meet the monotonicity requirements then the
synthesis of failsafe fault-tolerance will be simple as well. Nevertheless, the signiﬁ-
cance of such sufﬁcient conditions lies in developing heuristics by which we transform
speciﬁcations (respectively, programs) to monotonic speciﬁcations (respectively, pro-
grams) so that polynomial—time addition of failsafe fault-tolerance becomes possible.
While the issue of designing such heuristics is outside the scope of this paper, we note
that we have developed such heuristics in Chapter 9 and [29], where we automatically
transform speciﬁcations (respectively, programs) to monotonic speciﬁcations (respec-
tively, programs) for the sake of polynomial-time addition of failsafe fault-tolerance

to distributed programs.

4.4 Examples of Monotonic Speciﬁcations

In this section, we present three problems, Byzantine agreement, consensus and corn-
rnit, for which the speciﬁcations and fault-intolerant programs are monotonic. In
the case of Byzantine agreement, we ﬁrst identify the variables and their respective
domains. Then, we provide the fault—intolerant program and its invariant. Subse-
quently, we present the speciﬁcation and faults. Finally, we show the monotonicity

with reSpect to appropriate variables. Since the arguments for consensus and corn-

51

 

Th.

j is

mit are similar to those in the Byzantine agreement problem, we simply sketch the

arguments for those two problems.

4.4.1 Byzantine Agreement

For simplicity, we consider the canonical version where there are 4 distributed pro-
cesses g, j, k, and l such that g is the general and j, k, l are the non-generals. (An
identical explanation is applicable if we consider arbitrary number of non-generals.)
In the agreement program, the general sends its decision to non-generals and subse-
quently non-generals output their decisions. Hence each process has a variable d to
represent its decision, a boolean variable b to represent if that process is Byzantine,
and a variable f to represent that process has ﬁnalized (output) its decision. The

program variables and their domains are as follows:

d.g : {0,1}

d. j, d.k, d.l : {0,1,1} // 1 denotes uninitialized decision
b.g,b.j,b.k,b.l : {true, false} // b.j=true iffj is Byzantine

f.j, f.k,f.l : {0,1} // f.j=1 iffj has ﬁnalized its decision

The fault-intolerant Byzantine Agreement, IB. Each non-Byzantine process

j is represented by the following actions:

d.j=_L A f.j=0 —+ d.j:= d.g
d.j7é1 A f.j=0 —+ f.j:=1

Invariant of 13. The invariant of I B , S I 3, is as follows:
SIB = (Vp :: -wb.p A (d.p = _L V d.p = d.g) A (f.p => d.p 75 1))

Safety speciﬁcation of Byzantine agreement. The safety speciﬁcation requires
that Validity and Agreement be satisﬁed. Validity requires that if the general is not

Byzantine and a non—Byzantine non-general has ﬁnalized its decision then the decision

52

of that non-general process is the same as that of the general. Agreement requires that
if two non-Byzantine non-generals have ﬁnalized their decisions then their decisions

are identical. Hence, the program should not reach a state in 5,], where

S3; = (3p,q :: ﬁbp A ﬁbq A d.p 75 _L A d.g 75 _L A d.p 36 do A f.p A f.q)
V (3p :: -ib.g A -ib.p A d.p 7g 1 A d.p at d.g A f.p)

In addition, when a non-Byzantine process ﬁnalizes, it is not allowed to change it

decision. Therefore, the set of transitions that should not be executed is as follows:

ts! = {(30:51) 3 31 E szl U {(50,31) 3 ﬁb-Jlsol A _‘b-Jisl) Af-j(30) =1
A ((1-“50) 74 d-.l(31) V f-j(30) 7‘5 fj(31))}

Faults. The Byzantine faults, f3, can affect one process at most and a Byzantine
process can change its decision arbitrarily. Hence, the Byzantine faults are shown by

the following actions:

-1b.g A ﬁb.j A ﬁbk A ﬁb.l ——+ b.j := true

b.j ——> d.j,f.j:= OII,OI1

The read / write restrictions: Each non-general non-Byzantine process j is allowed
to read r,- = {b.j,d.j, f.j,d.k,d.l,d.g} and it can only write w,- = {d.j,f.j}. Hence,
in this case w,- C_: r,. And, the variables that j is not allowed to read are nr, =
{b.g,b.k,b.l,f.k,f.l}.

Monotonicity of the speciﬁcation and the program. We make the following
observations.

Observation 4.15 The speciﬁcation of Byzantine agreement is positive monotonic
with respect to bk (respectively, b. j and bl)

Proof. Consider a transition (so, 31) of some non—general process, say 3', where

Validlty and agreement are not violated when k is not Byzantine. Let (36, s’,) be the

53

corresponding transition where k is Byzantine. Since validity and agreement impose
no restrictions on what a Byzantine process may do, it follows that (86, s’,) does not

violate validity and agreement. [I]

Observation 4.16 The speciﬁcation of Byzantine agreement is negative monotonic

with respect to fit (respectively, f. j and fl)

Proof. Consider a transition (so, 31) of some non-general process, say j , where
validity and agreement are not violated when fit is I. i.e., I»: has ﬁnalized its deci-
sion. Let (56, s’,) be the corresponding transition where f.k is 0. Since validity and
agreement impose no restrictions on processes that have not ﬁnalized their decision,

it follows that (.96, s’,) does not violate validity and agreement. [3

Observation 4.17 The program I 8,, consisting of the transitions of j, with invariant

SIB is negative monotonic with respect to bk (respectively, b. j and b.l)
Proof. Follows from the fact that I B IS 1 3 contains no transitions when bk is true.

[I

Observation 4.18 The program 18,-, consisting of the transitions of j, with invariant
S13 is positive monotonic with respect to f.k (respectively. f. j and fl)

Proof. We leave it to the reader to observe this by considering all transitions in 3'.
Cl

Observation 4.19 The speciﬁcation of Byzantine agreement is fB-safe.

Proof. Follows from the fact that a fault only affects the variables of a Byzan-
tine process and, hence, cannot violate safety; safety may only be violated if a non-

Byzantine process changes its state based on the variables of the Byzantine process.

El

Now, using Observations 4.15-4.19 and Corollary 4.12, we have

Theorem 4.20 Failsafe fault-tolerant Byzantine agreement program can be obtained

in polynomial time. D

54

To obtain the failsafe fault-tolerant program, we calculate the transitions of the
fault-tolerant program inside the invariant S 1 B. The groups of transitions associated
with them form the failsafe fault-tolerant program, F SB. Thus, the actions of a

non-general process P,- in the fault-tolerant program are as follows:

FSBI: d.j=.L A f.j=0 ——+ d.j:=d.g
FSBg: (d.j = 0) A ((d.k # I) A (d.l 75 1)) A f.) = 0 ——> f.j :=1
FSBgI (d.j=1) A ((d.k # 0) A (d.l 75 0)) A f.j = 0 ——-—> f.]' :=

The ﬁrst action remains unchanged, and the second and the third actions deter-
mine when a process can safely ﬁnalize its decision so that the validity and agreement
are preserved. Note that if the general is Byzantine and casts two different decisions
to two non-general processes then the non-general processes may never ﬁnalize their
decisions. Nonetheless, the program F S B will never violate the safety of speciﬁcation

(i.e., F S B is failsafe fault-tolerant).

4.4.2 Consensus and Commit

We now discuss the problems of distributed consensus and atomic commit to show
that their speciﬁcations and fault-intolerant programs satisfy the monotonicity re-
quirements. Since the arguments involved in these problems are similar to those in
Byzantine agreement, we simply outline the reasoning behind the monotonicity.
Consensus. In distributed consensus, each process begins with a vote. Initially,
the votes of processes may be different. It is required that all non-faulty processes
agree on the same value (agreement) and that if the vote of every process is v then
the agreed value be the same as v (validity). A fault can cause a process to crash
(undetectably). Upon failure, the vote (and the decision) of the failed process is

reset to 1 so that other processes cannot distinguish between the failed process and

55

a process that has yet to vote.

In this problem, we introduce a variable, up. j for every process j; j can read its
own up value but not the up value of other processes. It is straightforward to see that
the speciﬁcation of consensus is negative monotonic with respect to up. Likewise, in
the absence of faults, all up values are true and, hence, in the absence of faults, a
fault-intolerant program has no transitions that execute when an up value is false.
It follows that a fault-intolerant program for consensus is positive monotonic with

respect to up.

Commit. In the commit problem, the agreement requirement is the same as
that in consensus. However, validity requires that if the vote of any process is 0 then
the agreed value must be 0. And, if all processes vote 1 and no failures occur then it is
required that the agreed value must be 1. Again, the fault considered for this problem
is the crash fault and, hence, we introduce the variable up for every process to denote
whether the process is up or not. The argument that monotonicity requirements are

met in the commit problem is the same as that in the consensus problem.

4.5 Summary

In this chapter, we focused on the problem of adding failsafe fault—tolerance to an
existing fault-intolerant distributed program. A failsafe fault-tolerant program satis-
ﬁes its speciﬁcation (including safety and liveness) when no faults occur. However,
if faults occur, it satisﬁes at least the safety speciﬁcation. We showed, in Section
4.2, that the problem of adding failsafe fault-tolerance to distributed programs is
NP-complete. Towards this end, we reduced the 3-SAT problem to the problem of
adding failsafe fault-tolerance.

In a broader perspective, we are interested in identifying the problems for which

the synthesis of fault-tolerant programs can be designed efﬁciently (in polynomial

56

time) and the problems for which exponential complexity is inevitable (unless P =
NP). By identifying such a boundary, we can determine the problems that can reap
the beneﬁts of automation and the problems for which heuristics need to be developed
in order to beneﬁt from automation. This chapter helps to make this boundary more
precise than [I] in three ways. For one, the proof in [1] is for masking fault-tolerance
where both safety and liveness need to be satisﬁed. By contrast, the NP-completeness
in this chapter applies to the class of programs where only safety is satisﬁed. Also,
the proof in [I] relies on the ability of a process to blindly write some variables. By
contrast, the proof in this chapter does not rely on such an assumption.

The third —and the most important— step in identifying the boundary is addressed
in Section 4.3 where we identiﬁed a class of speciﬁcations and a class of programs
for which failsafe fault-tolerance can be added in polynomial time. Essentially, this
class captures the intuition that to obtain a failsafe fault-tolerant program, we can
let the fault-intolerant program execute in the presence of faults and ensure that a
program transition is executed only if its execution will be safe even if faults have
occurred. Towards this end, we imposed two restrictions: positive monotonicity of the
speciﬁcation and negative monotonicity of the fault-intolerant program. We showed
that these restrictions are sufﬁcient for polynomial synthesis of failsafe fault-tolerant
distributed programs.

To show the sufficiency, in Section 4.3, we showed how a failsafe fault-tolerant
program can be designed if one begins with a positive monotonic speciﬁcation and
a negative monotonic program. Also, we proved that if only the input program
(respectively, speciﬁcation) is monotonic and there exist no assumption about the
monotonicity of the speciﬁcation (respectively, program) then the synthesis of failsafe

fault-tolerance remains NP-complete.

57

Chapter 5

Fault-Tolerance Enhancement

In this chapter, we concentrate on automated techniques to enhance the fault-
tolerance level of a program from nonmasking to masking. Given the complexity
of adding fault-tolerance to a fault-intolerant distributed program, in this chapter,
we address the following question. Is it possible to reduce the complexity of adding
masking fault-tolerance if we begin with a program that provides additional guarantees
about its behavior in the presence of faults? Towards this end, we formally deﬁne
the problem of enhancing the fault-tolerance of nonmasking programs to masking.
Then, we present a sound and complete algorithm for the enhancement of fault-
tolerance in high atomicity model. We also present a sound algorithm for enhancing
the fault—tolerance of nonmasking distributed programs. We illustrate our algorithms
by enhancing the fault-tolerance of the triple modular redundancy (TMR) program

and the Byzantine agreement program.

This chapter is organized as follows: In Section 5.1, we state the problem of
enhancing the fault-tolerance from nonmasking to masking. In Section 5.2, we present
our solution for the high atomicity model. In Section 5.3, we present our solution for

distributed programs. Finally, we summarize this chapter in Section 5.6.

58

5. 1 Problem Statement

In this section, we formally deﬁne the problem of enhancing fault-tolerance from non-
masking to masking. The input to the enhancement problem includes the (transitions
of) nonmasking program, p, its invariant, S, faults, f, and speciﬁcation, spec. Given
p, S, and f, we can calculate an f-span, say T, of p by starting at a state in S and
identifying states reached in the computations of p[] f. Hence, we include fault-span T
in the inputs of the enhancement problem. The output of the enhancement problem
is a masking fault—tolerant program, p’, its invariant, S’, and its f-span, T’.

Since p is nonmasking fault—tolerant, in the presence of faults, p may temporarily
violate safety. More speciﬁcally, faults may perturb p to a state in T—S. After faults
stop occurring, p will eventually reach a state in S. However, p may violate spec
while it is in T—S. By contrast, a masking fault-tolerant program p’ must satisfy its
safety speciﬁcation even during recovery from T—S to S.

The goal of the enhancement problem is to separate the tasks involved in adding
recovery transitions and the tasks involved in ensuring safety. The enhancement
problem deals only with adding safety to a nonmasking fault-tolerant program. With
this intuition, we deﬁne the enhancement problem in such a way that only safety
may be added while adding masking fault-tolerance. In other words, we require that
during the enhancement, no new transitions are added to deal with functionality
or to deal with recovery. Towards this end, we identify the relation between state
predicates T and T’, and the relation between the transitions of p and p’.

If p’[] f reaches a state that is outside T then new recovery transitions must be
added while obtaining the masking fault-tolerant program. Hence, we require that
the fault-span of the masking fault-tolerant program, T’, be a subset of T. Likewise,
if p’ does not introduce new recovery transitions then all the transitions included in

p’IT’ must be a subset of pIT’. Thus, the enhancement problem is as follows:

59

The Enhancement Problem
Given p, S, spec, f, and T such that p satisﬁes spec from S and

T is an f—span used to show that p is nonmasking fault-tolerant for spec from S
Identify p’ and T’ such that

T’ (_I T,

p’lT’ Q. plT’, and

p’ is masking f-tolerant from T’ for spec. [3

Comments on the Problem Statement

1. While the invariant, S, of the nonmasking fault-tolerant program is an input to
the enhancement problem, it is not used explicitly in the requirements of the
enhancement problem. The knowledge of S permits us to identify the transitions
of p that provide functionality and the transitions of p that provide recovery. We
ﬁnd that such classiﬁcation of transitions is useful in solving the enhancement

problem. Hence, we include S in the problem statement.

2. If S’ is an invariant of p’, S’ (_I T’, every computation of p’ that starts from
a state in T’ maintains safety, and every computation of p’ that starts from a
state in T’ eventually reaches a state in S’ then every computation of p’ that
starts in a state in T’ also satisﬁes its speciﬁcation. In other words, in this
situation, T’ is also an invariant of p’. (This result has been previously shown
in [18]; we repeat the proof in Section 5.2.) Hence, we do not explicitly identify

an invariant of p’. Predicates T’ and T’ D S can be used as the invariants of p’.

3. The above problem statement assumes that no new states/ variables are added
while enhancing fault-tolerance. This assumption can be removed by allowing
systematic addition of new variables [1]. Another approach is to pretend that a
process can read certain private variables of other processes. Then, we design

a masking program that uses such private variables. The transitions of such

60

a masking program will require the detection of predicates involving the pri-
vate variables of other processes; one can use reﬁnement techniques to detect
these non-local predicates appropriately. These reﬁnement techniques, in turn,
will determine the new variables that need to be added to detect these non-
local predicates. Several such reﬁnement techniques have been discussed in the

literature (e.g., [30, 18]).

5.2 Enhancement in High Atomicity Model

In this section, we present our algorithm for solving the enhancement problem in high
atomicity model. Thus, given a high atomicity nonmasking fault-tolerant program p,
our algorithm derives masking fault-tolerant program p’ that ensures that safety is
added while the recovery provided by p is preserved. The goal of the enhancement
problem is to add safety while preserving recovery. Hence, we obtain a solution for
the enhancement problem by tailoring the algorithm Add- f ailsa f 6 (see Section 2.7.1);
Add- f ailsa f e deals with the addition of safety to a fault-intolerant program in the
presence of faults.

In our algorithm (cf. Figure 5.1), ﬁrst, we compute the set of states, ms, from
where fault actions alone violate safety. Clearly, we must ensure that the program
never reaches a state in ms. Hence, in addition to the transitions that violate safety,
we cannot use the transitions that reach a state in ms. We use mt to denote the
transitions that cannot be used while adding safety. Using ms and mt, we compute the
fault-span of p’, T’, by calling function HighAtomicityConstructInvariant (H ACI ).
The ﬁrst guess for T’ is T —ms. However, due to the removal of transitions in mt,
it may not be possible to provide recovery from some states in T—ms. Hence, we
remove such states while obtaining T’. If the removal of such states causes other
states to become deadlocked, we remove those states as well. Moreover, if (so, 81)

is a fault transition such that 31 was removed from T’ then we remove so to ensure

61

that T’ is closed in f. We continue the removal of states from T’ until a ﬁxed point
is established. After computing T’, we compute the transitions of p’ by removing all
the transitions of p-mt that start in a state in T’ but reach a state outside T’. Thus,

our algorithm is as follows:

 

High-Atomicity_Enhancement(p, f: set of transitions, T: state predicate,
spec: speciﬁcation)
{ ms := {so : 331,32, ...sn:
(Vj :05j<n : (s,-,s(,-+1)) E f) A (s(,,_1),s.,,) violates spec};
mt := {(30,31) : ((31 Ems) V (so,s-1) violates spec) };
T’ := HACI(T - ms,p-—mt,f);
if (T’={}) declare no masking f-tolerant program p’ exists;
else p’ := (p— mt) -— {(30,31) : soeT’ A 31¢T’}

}

H ACI (T : state predicate, p, f: set of transitions)
{while (380 Z 806T: (V31 1 816T: (So,81)¢p)V (381 i 81 e? T I (80,81)€f))
T :2 T — {so} }

 

 

 

Figure 5.1: The enhancement of fault-tolerance in high atomicity.

Before showing that the algorithm High_Atomicity-Enhancement is sound and com—
plete and its complexity is polynomial in the state space of the nonmasking fault—
tolerant program, we present a set of observations about our high atomicity algo—
rithm. We use these observations to prove two lemmas about the computations of
the synthesized masking fault-tolerant program in the presence of faults. Then, we
use these lemmas to prove the soundness and completeness of our algorithm in the
high atomicity model. To prove the soundness of our algorithm, we have to show that
p’ and T’ satisfy the conditions of the enhancement problem. To prove the complete-
ness, we show that if there exists any masking fault-tolerant program that enhances
the fault—tolerance of the given nonmasking program then our algorithm will succeed
in ﬁnding one.

We use the following notation in the rest of this section: Given a fault-intolerant

program p, speciﬁcation spec, invariant S, faults f, and fault—span T, we say that

62

program p’ and predicate T’ solve the enhancement problem for the given input
iff p’ and T’ satisfy the three conditions of the enhancement problem. we say p’
(respectively, T’) solves the enhancement problem iff there exists T’ (respectively, p’)
such that p’, T’ solve the enhancement problem.
In the high atomicity algorithm, based on the the construction of T’, we have:
Observation 5.1 T’ 0 ms = {}, D
By the construction of T’, T’ is obtained by removing zero or more states in T.
Thus, we have:
Observation 5.2 T’ (_Z T. E]
The transitions of p’ are a subset of the transitions of p. Thus, we have:
Observation 5.3 (p’lT’) Q (plT’). [I]
From the deﬁnition of H AC1 , T’ is closed in p’ and f. Thus, we have:
Observation 5.4 T’ is closed in p’ [l f. E]
If faults perturb p to a state in T, eventually p will return to a state in S. Also,
by deﬁnition, S Q T and by Observation 5.2, T’ Q T. Now, if T’ 0 S = {}, and a
computation c of p’[] f reaches a state in T’ — S then p’ will never have a chance to
return to a state of S. By Observation 5.3, c is also a computation of p. Thus, if
T’ D S = {} then there exists a computation of p[] f that starts in a state in T and
never reaches a state in S. Since this is a contradiction, we have
Observation 5.5 T’ H S # {} . Cl
Deﬁnition. For the rest of the section, we let S’ to be equal to T’ D S. C]
Now, we use these observations to present two lemmas that are used in the sound-
ness proof of the algorithm. First, in Lemma 5.6, we show that in the presence of
faults safety speciﬁcation is not violated. Then, in Lemma 5.7, we show that if faults
perturb p’ to a state in T’ then every computation of p’ starting at T’ will reach a
State in 5’.

Lemma 5.6 p’[] f maintains spec from T’.

63

Proof. By construction, T’ is closed in p’[] f. Let c be a computation of p’[] f
that starts from a state in T’. If e violates the safety of spec, there exists a preﬁx,
say (30,31, ...,s,,), that violates the safety of spec. \Nlog, let (so,sl, ...,s,,) be the
smallest such preﬁx. It follows that (s(,,_1), sn) violates the safety of spec and, hence,
s(,,_1), s") 6 mt. By construction, p’ does not contain any transition in mt. Thus,
s(),,_1,s,,) is a transition of f. If (s(,,.1), 3") is a transition of f then s(,,_1) 6 ms and

(
(
(s(,,_2),s,,_1)) 6 mt and, henc,e (s(,,_2),s(,,-_1)) is a transition of f. By induction, if
(so, 31, .. (.,sn) violates the safety of spec, so 6 ms, which is not possible since so 6 T’
(

cf. Observation 5.1). Thus, p’[] f maintains spec from T’. [:1

Lemma 5.7 Every computation of p’ that starts from a state in T’ contains a state
in S’.

Proof. Consider a computation of p’, say c, that starts from a state so in T’. Since
c is also a computation of p, it eventually reaches in a state, say 3”, in S (0 g n). By

the deﬁnition of S’ and the closure of T’ in p’, it follows that 3,, is in S’. 1:]
Theorem 5.8 T’ is an invariant of p’ for spec.

Proof. Let c be a computation of p’ that starts from a state in T’. By Lemma
5.6, c maintains spec and by Lemma 5.7, c contains a state 3", where 8,, € 3’. Thus,
c is of the form (so, 31, ..., sn, s,,+1, ).., where the preﬁx (so, 81, ..., s") maintains spec
and (sn,sn+1, ...) is in spec. By deﬁnition of maintains, there exists a suffix, say
B, such that (so,sl,...,s,,)ﬁ is in spec. Now, from fusion closure, it follows that
(so, 51, ..., s", sn+1, ...) is also in spec. Thus, every computation of p’ that starts in a
state in T’ is in spec. Also, T’ is closed in p’ (cf. Observation 5.4). It follows that T’

is an invariant of p’ for spec. [3

Theorem 5.9 (Soundness) The algorithm High_Atomicity-Enhancement is
sound.
Proof. To prove that our algorithm is sound, we have to show that the conditions

of the enhancement problem are satisﬁed.

64

 

auj
tra
lie
I"!

in j

(011}

“it h;

90111]:

plea”, .
Spam

Pr.
it], ,

at 5,,

1. T’ (_Z T. (cf. Observation 5.2).

2. p’]T’ g plT’. (cf. Observation 5.3).

3. p’ is masking f—tolerant to spec from T’. By letting the fault-span to be T’

itself, the proof follows.

Theorem 5.10 (Completeness) The algorithm H igh_Atomicity-Enhancement

is complete.

Proof. Let program p” and predicate T” solve the enhancement problem. Clearly,
T” 0 ms: {}; if so 6 (T” 0 ms) then the execution of faults alone from so can violate
the safety of spec. It follows that T” g (T -ms). Moreover, p” IT” cannot include
any transitions in mt; if p” |T” contains a transition in mt then the execution of this
transition followed by zero or more fault transitions can violate the safety of spec.
Thus, p” |T” Q (p—mt). Finally, every computation of p” that starts from a state in
T” must be an inﬁnite computation, if it were to be in spec, and T” must be closed
in f. It follows that there exists a nonempty subset of T (namely, T”) such that all

computations of p—mt within that subset are inﬁnite.

Our algorithm declares that no solution for the enhancement problem exists only
when there is no nonempty subset of T—ms such that all the computations of p—mt
within that subset are inﬁnite, and that set is closed. It follows that the algorithm is
complete. [:1

Theorem 5.11 The algorithm H igh_Atomicity_E nhancement is sound and corn-
plete and the complexity of H igh_Atomic-ity_E-nhancement is polynomial in the state
space of the nonmasking fault-tolerant program.

Proof. The soundness and completeness proof follows from Theorems 5.9 and
5.10. Regarding complexity, note that the computation of ms as well as computation
of H AC1 are. both polynomial in the state space of the input program. I]

65

5.2.1 Example: Triple Modular Redundancy

As an illustration of our high atomicity algorithm, we show how the masking
triple modular redundancy (TMR) program can be designed by enhancing the fault—
tolerance level of the corresponding nonmasking program.

First, we present the nonmasking version of TMR program, the speciﬁcation of
TMR, and the fault actions for TMR. Then, we show how our high atomicity algo-
rithm is used to enhance the level of fault-tolerance to masking.

N onmasking TMR program. N onmasking version of TMR program consists of
three processes j, k, and I that share an output variable out. Each process j has an
input variable 2124'. The values of these input variables are obtained from a common
sensor. The domain of each input variable is {0, 1} and the domain of out is {0, 1, i}
(1 means no value has been assigned to out). For each process j, if the value of out
is not yet assigned, 3' copies (using guarded command N 1) its input 212. j to out. And,
if out is assigned a wrong value, i.e., the value other than the majority value, and the
value of 272. j is not corrupted then process j corrects (by guarded command N2) out
by copying 2n.j to out. Both nonmasking and masking programs for TM R include
a self-loop for states in which out has been assigned a correct value. However, for
brevity, in this section, we keep such self-loops implicit. Thus, the actions of each
process j in the nonmasking version of TM R are as follows (in this section, {9 denotes

modulo 3 addition):

1 N1: (outzi) —+ out ~.= my
N2: (out#_L) /\ (out ¢ 2n.j) /\ ((in.j = 2'22.(j 6‘1 1)) V (2724' = 272.(j ED 2)))

—-> out :2 in.j

Faults. Faults may perturb one of the inputs when all of them are equal. Thus,

the fault action that affects j is represented by the following action:
F: (Vp :: 2n.j 2 mp) ——+ 2'n..j :2 0 | 1

Invariant. The following state predicate is an invariant of TMR.

66

.., v. ‘- a“ I

 

STMR 2 (out =_L /\ (Vp, q :: 2'n.p = in.q)) V (3p, q : p 75 q : out = 212p =2 272.21)

Safety speciﬁcation. The safety speciﬁcation of TN! R requires the program not to
reach states in which there exist two processes whose input values are equal but these
inputs are not equal to out (where out 7E_L). The safety speciﬁcation also stipulates
that variable out cannot change if it is different from _L. Thus, safety speciﬁcation

requires that following transitions are not included in a program computation.

szMR = sf1 u sfg, where
sfl = {(30, 81) I (31w 2 (p sé q) = (meta) =in-(1(31)) /\
(in.q(sl) 7e out) /\ (0ut(sl) 7.41)», and
sf2 = {(30, 81) | (out(80) 7%) A (out(So) 3* out(81))}

Fault-span. If all the inputs are equal then the value of out is either 1 or equal

to those inputs. Thus, fault-span of the nonmasking version of TA! R is T, where
TTMR = (Vp, q :: in.p = 2n.q) :> ((out =_L) V (Vp :: out = 2n.p))

Remark. The TMR program consists of three variables whose domain is {0, 1}
and one variable whose domain is {0, 1, _L}. Enumerating the states associated with
these variables, the state space of TMR program includes 24 states. Of these, 10
states are in the invariant, 12 additional states are in the fault-span, and two states
are outside the fault-span.

The program consisting of actions N 1 and N2 is nonmasking fault—tolerant in that
if it begins in a state where STMR is true then it satisﬁes its speciﬁcation. However, if
the faults perturb it to a state in TTMR-STMR then it eventually recovers to a state
where STMR is true. Nonetheless, until such a state is reached, safety speciﬁcation
may be violated.

Enhancing the tolerance of TMR. We trace the execution of our high atomicity

algorithm for nonmasking TMR. program.

67

1. Compute ms. ms includes all the states from where one or more fault
transitions violate safety. In case of TMR, fault transitions do not violate safety
if they execute in a state in TTMR . Faults only change the value of one of the
inputs and then safety may be violated if the corresponding process executes

guarded command N 1. Thus, TTMR 0 ms 2 {}.
2. Compute mt. From the deﬁnition of ms, mt: sfnm.

3. Construct TfMR and p’. After removing transitions in mt, states where out
differs from _L and out differs from the majority of the inputs are deadlocked.
Hence, we need to remove those states while obtaining T } M 3. After removal of
those states, there are no other deadlock states. Hence, our algorithm will let

iwa to be the state predicate:

T7,"an = TTMR _ {3 : (3p, (15 (P 7f Q) 5 (271.1)(8) = 2724(8)) A (“at“) #J—l A
<0ut<s> ¢ m.p<s>>)}

Moreover, to obtain the transitions of masking version of TA! R, we consider the
transitions of p that preserve the closure property of T7’~MR. Thus, the masking

version of TM R consists of the following guarded command:
M1: (022t=_L)/\((2'n.j = 271.069 1)) V (2724' = 2'72.(j {B 2))) ——> out :2 in.j

The predicate Tﬁm computed by our algorithm is both an invariant and a fault-
span for the above program; every computation of the above program satisﬁes
the speciﬁcation if it begins in a state in T h, 3. Moreover, Th“? is closed in

both the program and fault transitions.

Remark. Note that transitions included in N2 are removed from the above
masking fault—tolerant program as those transitions violate ng. However, if

safety consisted of only 3 f1 then the fault-tolerant program would include the

68

transitions included in N2. While a masking fault-tolerant program can be
obtained without using the transitions in N2, their inclusion follows from the
heuristic in [1] that the output program should be maximal. In [1], Kulkarni
and Arora have argued that if the output of a synthesis algorithm is to be used
as an input, say to add fault-tolerance for a new fault, it is desirable that the

intermediate program be maximal.

5.3 Enhancement for Distributed Programs

In this section, we present an algorithm to enhance the fault-tolerance level of a
distributed nonmasking fault-tolerant program to masking. First, we discuss the
issues involved in the enhancement problem for distributed programs. Then, we
present our algorithm. As a case study, we apply our algorithm to the Byzantine
agreement problem.

In high atomicity model, the main issue in enhancing the fault—tolerance level of a
nonmasking fault-tolerant program p was to ensure that p does not execute a safety
violating transition (so, 31). In order to achieve this goal, we can either (i) ensure that
p will never reach 30, or (ii) remove (30, 31). For the high atomicity model, we chose the
latter option as it was strictly a better choice. However, for distributed programs, we
cannot simply remove a safety violating transition (80,81) as (so, 31) could be grouped
with some other transitions (due to read restrictions). Thus, removal of (so, 31) will
also remove other transitions that are potentially useful recovery transitions. In other
words, for distributed programs, the second choice is not necessarily the best option.
Since an appropriate choice from the above two options cannot be identiﬁed easily for
distributed programs, the synthesis of distributed programs becomes more difficult.

We develop our low atomicity algorithm (cf. Figure 5.3) by tailoring the high
atomicity algorithm to deal with the grouping of transitions. More speciﬁcally, given

a nonmasking fault-tolerant program p, we ﬁrst start by calculating a high atomicity

69

fault-span, T,’,,g,, , which is closed in p[] f. Since the low atomicity model is more
restrictive than the high atomicity model and T f“. 9,, is the largest fault-span for a high
atomicity program, we use Til-9,, as the domain of the states that may be included
in the fault-span of our low atomicity program. In other words, if a transition, say
(so,sl) violates the safety speciﬁcation and so 5! T,’,,.g,, then we include the group
associated with (so, 81) and ensure that state so is never reached.

Then, we call function LowAtomicityConstructInvariant (LACI) to calculate a
low atomicity invariant SIM, for p’ (cf. Figure 5.2). In the body of the algorithm in
Figure 5.3, to calculate Sfow, we ﬁrst call function LAC] with T,’,,.g,,os as its ﬁrst
argument. Inside LACI, we ignore the fault transitions during the call to H ACI ; we
consider the effect of fault transitions subsequently. In this call to H AC'I , we also
ignore the grouping of transitions. These requirements are checked on the value of
8,2,5”, returned by H ACI . Speciﬁcally, if there exists a group containing transitions

(so, 31) and (86, 3’1) such that so, 36, 3’1 6 SAW, and 31 ¢ 8,2,9,“ we remove so from 83,9,

 

and recalculate the invariant. If no such group exists, LAC'I returns Stag}:- Thus, the
function LACI is as follows:
LACI(S : state predicate, p: transitions, go, ' -- , gm: groups of transitions )

{ Sfu'gh = HACI(Sapa 0);
if (391350: 31,86, 3’1 : (30131)1(5f)33,1) E 9i i (803 8f), 3,1 e an'gh A '31 ¢ an'gh) )
then return LACI(S,’,,-g,, — {so},p, go, - -~ ,gm);
else return Saigh;

 

 

 

Figure 5.2: Constructing an invariant in the low atomicity model.

In Figure 5.3, the value returned by LACI, S,’ is used as an estimate of the in-

nit?

variant of the masking fault-tolerant distributed program. To compute T’, we identify

the effect of the fault transitions and the program transitions from states in S’ We

init'
use the variable Sfow to keep track of states reached in the execution of the program

and fault transitions from S,’,,,-,. Our ﬁrst estimate for Sfow is the same as S,’,,,-,. Now,

70

we compute So as the set of states reached in one step (of program or fault). Regard-
ing fault transitions, if (so, 31) is a fault transition, so 6 Sfow and .91 E (Téigh-Sl’ow)
then we add state 31 to the set So. Regarding program transitions, we only consider a
group if the following three conditions are satisﬁed: (1) at least one of the transitions
in it begins and ends in Sfow’ (2) if a transition in that group begins in a state in

high then it terminates in a state in my, and it does not violate safety, and (3) if
a transition in that group begins in a state in 85,,“ then it terminates in a state in

5!

mt. If such a group has another transition (36, 3’1) such that 36 6 SW and s’l ¢ Sfaw

then we include state s'1 in the set 5'2. (Note that in the ﬁrst iteration, 5,2,“ equals
Sfow. Hence, expansion by program transitions need not be considered. However, this
expansion may be necessary in subsequent iterations.) Thus, .92 identiﬁes states from

where recovery must be preserved.

 

Low_Atomicity_Enhancement(p : transitions, go, - - - , gm: groups of transitions,
f: faults, T, .S' : state predicate, spec : speciﬁcation)
// P: gOUQI U ngm
{ Calculate ms and mt as in High_Atomicity-Enhancement
Tﬁigh = HACI(T - ms,p - mt, f);
Sim-t = Sfow = LACI(SﬂT,’u-gh,p — mt,go, - -- ,gm);
repeat {
52 ={81381€( liigh— (low): (380180 E Sim” 2 (80,81) 6 f V
(395 1(80,81)6 91' I (((gilsz’ow) O (P‘mtll 7'5 4’) A
0732,83 2 (32,83) E 9, /\ 32 E Tfn’gh : 33 E Tfu’gh /\ (82,83) ¢ mt) /\
0732,33 3 (32933) 5 9i A 32 E Sinit ‘ 33 E Sim't» )}
S3 = {80 :80 E (Tﬁi9h~Sl’mv) 1 (381,91' 1 (80,81) 6 g,- /\ 81 6 81,011):
(V32,S3 : (82,83) 6 gi /\ 32 E Ti’u'gh : 33 6 Tgigh /\ (32,33) g mt) /\
0132,33 : (32,33) E 9. /\ 32 E 55,,“ :83 E Sinulll
(low = Sfow U S3;
} until (53 = 0);
if (52 95 (0) then declare fault-tolerance cannot be enhanced; exit().
T’ = Slow;
P’ = {91' 1 (V80»81 I (80.81) E 92‘ 3 (80 E T’ => (81 E T’ A (80.81) E (P " mill) A
(30 E Sfm't => 31 E Sinitllli
return p’, T’;

 

 

 

Figure 5.3: The enhancement of fault-tolerance for distributed programs.

71

 

 

0f.

l

we then calculate the set of states from where recovery can be added, in one step.
Speciﬁcally, if there is a transition (so, sl) such that so ¢ $wa and 31 E Sfow then we
include so in set 53. We require that Tam, and S,’,,,-, are closed in the group being
considered for recovery and that safety is not violated by any transition (that starts in
a state in T122921) in that group (see the constraints of S3 in Figure 5.3). Subsequently,
we add 53 to Sfow. The goal of this step is to ensure that inﬁnite computations are

possible from all states in Sl’mu' This result is true about the initial value (Sﬁmt) of

I

10w. Moreover, this property continues to be true since there is an outgoing transition

from every state in S3.

We continue this calculation until no new states can be added to Sfow. At this
point, if So is nonempty, i.e., there are states from where recovery needs to be added
but no new recovery transitions can be added, we declare failure. Otherwise, we
identify the transitions of fault-tolerant program p’ by considering transitions of p—mt
that start in a state in Sfmu. Hence, our low atomicity algorithm is as shown in Figure

5.3.

Before we discuss the soundness and the complexity of
LowAtomicity-Enhancement, we ﬁrst make some observations about our low
atomicity algorithm. Then, we present three lemmas that are used in the soundness

proof. Similar to the proof in the high atomicity algorithm, we have

Observation 5.12 T’ g (T —- ms), T’ 0 ms 2 {}, and (p | T’) 0 mt = 0. El
Observation 5.13 S,’,,,-, g S, and {0w 0 ms 2 {}. El
Observation 5.14 mg, Q T. [:1
Observation 5.15 (p’ | T’) g (p | T’). D

In the main loop of the algorithm, So and 5'3 are subsets of T,’, Hence, the

igh '
I
[010

relations _C; T ,1, qh remains true throughout our algorithm. The value of T’ equals

72

the value of {0w when the loop terminates. Hence, we have
Observation 5.16 T’ Q T,’,,.g,, g T . [:1

Lemma 5.17 p’ [] f maintains spec from T’.

Proof. By construction, when T’ is assigned the value Sl’ow, the value of 32 is the
empty set. Thus, starting from a state in T’, p’ [] f cannot perturb p’ to a state that is
outside T’. It follows that T’ is closed in p’ [] f. Now, let c be a computation of p’[] f
that starts from a state in T’. Just as in the proof of Lemma 5.6, it can be shown

that each preﬁx of c maintains spec. Thus, p’ [] f maintains spec from T’. [:1

Lemma 5.18 p’ satisﬁes spec from S,’

nit '

Proof. Since Sfm-t is a subset of S, S’

mit

Q Sig... Q T’, (p’lT’) Q (pIT’) and

I
ini

every computation of p’ that starts from a state in S t is also a computation of p.
Hence, every computation of p’ that starts from a state in S,’,,,-, is in spec. Also, by

construction of p’, S’ is closed in p’. Thus, p’ satisﬁes spec from 5;,“ . E]

Lemma 5.19 Every computation of p’ that starts in a state in T’ is inﬁnite.

Proof. By construction of LACI, this property is true about S,’ Now, a state,

nit'
say s, is added to 53 only if there is a recovery transition, say t, from that state.
Moreover, when transitions of p’ are computed, the value of .32 is the empty set.
Hence, the group(s) of transitions containing t is included in p’. Thus, from every

state in T’, there is an outgoing transition in p’. It follows that every computation of

p’ that starts in a state in T’ is inﬁnite. [:1

Theorem 5.20 T’ is (also) an invariant of p’ for spec.

Proof. From Observation 5.15, every computation of p’ that starts in a state
in T’ is a computation of p. Thus, every computation of p’ that starts from a
state in T’ reaches a state in S. Thus, a computation of p from T is of the form
(so,sl,...,sn,s,,+1,...) where 3,, E 5. By Lemma 5.17, (so,sl, ...,s,,) maintains spec

and (3", 3M1, ...) is in spec. Now, similar to the proof in Theorem 5.8, we can show

73

that c is in spec. Thus, T’ is also an invariant of p’ for spec. [2]

Theorem 5.21 The algorithm Low_Atomicity-Enhancement is sound and its com-
plexity is polynomial in the state space of the nonmasking fault— tolerant program.
Regarding soundness, we have to show that the conditions of the enhancement

problem are satisﬁed.

1. T’ g T. (cf. Observation 5.16).
2. p'lT’ g plT’. (cf. Observation 5.15).

3. p’ is masking f-tolerant to spec from T’. By letting the fault-span to be T’

itself, the proof follows.

Regarding, complexity, we observe that the number of iterations for the main loop

are at most |T,’, | and each statement in the low atomicity algorithm requires only

igh
polynomial time. [I]
Modiﬁcations / Improvements for LowAtomicityEnhancement. There are

several improvements that can be made for the above algorithm. We discuss these

improvements and issues related to completeness below.

1. In the low atomicity enhancement algorithm, if the value of 32 is the empty set
then we can break out of the 100p before computing S3. Subsequently, we can
use value of SW at that time to compute p’ and T’. However, we continue in the
loop to determine whether recovery can be added from new states. This allows
the possibility that a larger fault-span is computed and additional transitions
are included in the masking fault-tolerant program. As mentioned in [1], if
the output of a synthesis algorithm is used as an input to another synthesis
algorithm, say to add fault-tolerance for a new fault, then it is desirable that
the fault-span and the transitions of the intermediate program be maximal. For
this reason, we have allowed the algorithm to expand the fault-span and to add

new transitions.

74

2. In the low atomicity enhancement algorithm, in the calculation of S3, we calcu-
late states from where recovery is possible. One heuristic is to focus on states in
So ﬁrst as recovery must be added from states in 52- If recovery from states in
So is not possible then other states in T,',,.g,,— {0w should be considered. However,
considering states in 82 alone may be insufficient as it may not be possible to
add recovery from those states in one step; adding recovery from other states

can help in recovering from states in S2.

3. Our algorithm is incomplete in that it may be possible to enhance the fault-
tolerance of a given nonmasking program although our algorithm fails to ﬁnd
it. One of the causes for incompleteness is in our calculation of LACI; when
LACI needs to remove states/ transitions to deal with grouping of transitions,
the choice is non-deterministic. Since this choice may be inapprOpriate, the
algorithm is incomplete. As we showed in Chapter 4 that adding failsafe
fault-tolerance to distributed programs is N P-complete, it is expected that the
complexity of a deterministic sound and complete algorithm for enhancing the

fault—tolerance of a distributed nonmasking program will be exponential unless

P=NP.

5.3.1 Example: Byzantine Agreement

We show how our algorithm for the low atomicity model is used to enhance the
fault—tolerance level of a nonmasking Byzantine agreement program to masking. First,
we present the nonmasking program, its invariant, its safety speciﬁcation, faults, the
fault-span for the given faults, and read / write restrictions. Finally, we show how our
algorithm is used to obtain the masking program (in [26]) for Byzantine agreement.
Variables for Byzantine agreement. The nonmasking program consists of
three non-general processes j, k,l and a general 9. Each non—general process has

three variables (1, f, and b. Variable d. j represents the decision of a non-general

75

process j, f. j denotes whether 3' has ﬁnalized its decision, and b. j denotes whether 3'
is Byzantine or not. Process 9 also has a variable d. g and b. 9. Thus, the variables in
the Byzantine agreement program are as follows:

a d.g : {0,1}

0 d.j,d.k,d.l : {0,1,1}

0 b.g,b.j,b.k,b.l : {true,false}

o f.j,f.k,f.l : {0,1}
Transitions of the nonmasking program. If process j has not copied a value
from the general, action N Bl copies the decision of the general. If j has copied a
decision and as a result d.j is different from J. then j can ﬁnalize its decision by action
N B2. If process j reaches a state, where its decision is not equal to the majority of
decisions and all the non-general processes have decided then j corrects its decision
by actions N B3 or N B4. Thus, the actions of each process j in the nonmasking

program are as follows:

NBl: d.j=1 A f.j=0 —> d.j:= d.g
NBZ:d.j7£_L/\f.j=0 —s f.j:=1
N83: (dj: 1) /\ (d.k=0) /\ (d.l=0) —> d.j:=0
NB4: (d.j=0) A (d.k= 1) A ((1.121) ——» d.j:=1

Safety speciﬁcation. The safety speciﬁcation requires that if g is Byzantine,
all the non-general processes should ﬁnalize with the same decision (agreement). If
g is not Byzantine, then the decision of every non-general non-Byzantine process
that has ﬁnalized should be the same as d.g (validity). Thus, safety is violated if
the program reaches a state in SS}, where (in this section, unless otherwise speciﬁed,

quantiﬁcations are on non-general processes)

SS} = (3p,q :: ﬂap /\ ﬁbg /\ d.p # .L /\ d.g 75 _L /\ d.p 75 d.g /\ f.p /\ f.q)

V (3p :: ﬁbg /\ -vb.p /\ d.p 3A _L /\ d.p # d.g /\ f.p)

76

Also, a transition violates safety if it changes the decision of a process after it has

ﬁnalized. Thus, the set of transitions that violate safety is equal to t3), where

t8, = {(so,sl) : 31 E 55)} U {(so,sl) : 3p :: —wb.p(so) /\ -:b.p(sl) /\ f.p(so) = 1
A ((1-MSU) 75 d-P(31) V f-P(30) 7e f~P(51))l

Invariant. The invariant of nonmasking Byzantine agreement is the state predicate
SNB = SNB, V SNBga where
SNB, = ﬁbg /\ (obj V -nb.k) /\ (ﬁbk V —wb.l) /\ (ﬁbl V -1b.j)
/\ (\7’p :: -b.p => (d.p = .J. V d.p = d.g)) /\ (Vp :: (ﬁbp /\ f.p) => (d.p 79 _L))
3N3, = b.g /\ obj /\ ﬁbk /\ -1b.l /\ (d.j 2 dis: 2 d.l /\ d.j 76 1)

Read/ Write restrictions. Each non—general process j is allowed to read {b. j ,
d.j, f.j, d.k, d.l, d.g}. Thus, j can read the d values of other processes and all its
variables. The set of variables that j can write is {d. j, f. j }

Faults for Byzantine agreement. A fault transition can cause a process to
become Byzantine if no process is initially Byzantine. A fault can also change the d
and f values of a Byzantine process. Thus, the fault transitions that affect j are as

follows (We include similar fault-transitions for k, l, and g):

F1 : -wb.g /\ -ab.j /\ -1b.k /\ —vb.l ——> b.j :2 true
F2:b.j —> d.j,f.j:= 0|1,0l1

Fault-Span. Starting from a state in S N 3,, if no process is Byzantine then a fault
transition can cause one process to become Byzantine. Then, faults can change d
and f values of the Byzantine process. Now, if the faults do not cause 9 to become
Byzantine then the set of states reached from S N B, is the same as S N 3,. However, if
the faults cause g to become Byzantine then (I and f values of non-general processes
may be arbitrary. Nonetheless, the b values of non-general processes will remain false.

Thus, the set of states reached from SNB, is (5N3, U (b.g /\ —1b.j /\ -wb.k /\ -1b.l)).

77

Starting from SNBz, no process can become Byzantine. Hence, the d values of
non-general processes will remain unchanged. It follows that the set of states reached
from SNB, is 5N3? Finally, since SNB2 is a subset of (b.g /\ obj /\ obk /\ ob.l), the
set of states reached from S N 3 is TNB, where

TNB = SNB, U (b.g /\ obj /\ obk /\ obl)

Application of our algorithm. First, we compute ms and mt that are needed by
our algorithm. Every fault transition originating at SS; reaches S3} because it only
affects the Byzantine process and the destination state will remain in 83,. Since the
destination of these fault transitions is 53), they violate the safety. Thus, the set of
states from where faults alone violate safety is equal to S S f, and as a result ms = SS).
Since t3; includes all the transitions that reach SS; (which is equal to ms) or violate
safety, mt = ts).

To calculate Téigh, we use the H AC1 function of our high atomicity algorithm.
This function removes deadlock states and states from where the closure of T,’,,.g,, is
violated by fault transitions. Since we have removed ms states and no fault transition
can reach a state in ms from a state outside ms, there exists no state from where the
closure of T,',,.g,, can be violated by fault transitions. Now, consider a state, say so,
where d.j=0, d.k=0, d.l: 1, b.l=false, and f.l= 1. Clearly, so is a deadlock state as
no process can execute a safe transition from so. Hence, such states must be removed
while obtaining T 1119»

Now, consider a state, say .91, where d.j = 1,d.k = 0,d.l 2 Lb! = false, and
f.l = 1. In state 31, only process j can execute a transition (by copying d.g) without
violating safety. However, if j copies the value of the general and d.g: O, the program
reaches a state that was removed earlier. Hence, such states must also be removed
while obtaining T (igh. Continuing thus, we remove all states where a process in the

minority has ﬁnalized its decision. In other words, T,’,,.g,, is equal to TNB-X, where

X = {s: (3p::f.p(8) = 1A(V(13P 3‘é CI 3 CAP“) 75 (11103))»

78

After this step, function LAC I returns 8’

2111!

= T2229). 0 SNB. Now, we trace two
iterations of the main loop in our algorithm in order to illustrate the way that our
algorithm works.

1. First iteration. To calculate SQ, we search for states in S’m-t from where

_S!

we can directly reach a state in T ,’, m, by fault transitions or by program

257/1

transitions. From S’ no program transition can reach a state that is outside

init?
St

21Lit'

However, from a state 8, where (o(d.j(s) =d.k(s) =d.l(s)) V(3p :: d.p(s) ==
_L)), a fault transition can cause the general to become Byzantine and then the
program is outside S3,“.

b.g(s) /\ (o(d.j(s) =d.k(s) =d.l(s)) V (3p :: d.p(s) = i)) /\ (Vp: (d.p(s) 75 _L) :>
(cl-MS) = d-9(8)))}-

Hence, in the ﬁrst iteration, 82 = {s : s 6 (Téigh—S’ ) :

zviit

Now, we compute S3. Consider a state, say so, where d.j = 0,d.k 2: 0,d.l =

I
irrit'

1, blzfalse, and f.l-—=O. In so, I can change d.l to 0 and reach a state in S
Hence, such states are included in So. Also, consider a state, say 31, where
d.j==_L,d.k.= 1,d.l= 1, and d.g: 1. In 31, process j can copy the value of d.g

I

mit. Therefore, in this iteration So 2 P1 U P2, where

and take the program to S

P1 = {8 = 8 6 ( lion-5.3...) 2 (3p: (d-p(8) 75 i) /\ (f-p(8) = 0) :
(Vq : (q 7e p) = (cl-(1(8) 7e 1) A d.p(s) 7e d.q<s>)>}. and
P2 = {3 3 5 E( bighusf'nit):

(3P : d-p(8)=i = (Vq : q 7e 19: ((1-(1(8) at J—) A (d-q(8)=d-g(8))))}

Then, we add So states to I’m”.

Remark. In the case of Byzantine agreement, the only states from where
recovery to S,’,,,-t can be achieved in a single step are the states of S3 in the ﬁrst
iteration. Every other recovery path includes these states as its ﬁnal step to

5!

1712t°

79

2. Second iteration. In the second iteration Sfow = S’

U S3 (S3 in the ﬁrst

mit

I

iteration). To calculate So in the second iteration, we search for states in law

from where we can directly reach a state in T Iiigh-Sl’ow by fault transitions or

by program transitions.

To calculate S2 in the second iteration, we need to calculate the set of states
in T,',,g,,—s;,,w that are reachable by a fault transition from Sl’ow. From the ﬁrst

iteration, we already know the set of states reachable from 3’ Thus, we only

Init'
need to calculate the states of T,’,,gh-Sl’ow that are reachable by a fault transition
from recently joined states (i.e., S3 = P1 UP2 of the ﬁrst iteration) to S,’0w. Since
in P1 the general process is Byzantine and all non-generals have decided, P1 is
closed in fault transitions. However, in a state in P2, since 9 is Byzantine,
faults may change the value of d.g and take the program outside Sfow. In these
states, the condition (3p : d.p: _L : (Vq : q 75 p : (d.g aé _L) /\ (d.g 71$ d.g)))
holds. Therefore, in this iteration, the program can reach states of 82 by a fault

transition, where

S2 = {318 E (Tiltigh—Sl’ow) 3
129(8) /\ (3p 2 d-p=i = (W = q 74 p: (M at _L) /\ ((1-61 at d-g)))}

To calculate So, we ﬁnd states from where recovery is possible to Sfow' Thus,
we search for states from where we can reach the states of 53 calculated in the
ﬁrst iteration. Hence, in this iteration, single—step recovery to {ow is possible

from S3, where

83 = {s = s E (Tiligh — to...) 1 (3p ; (d~p(8)=i) : (Vq : q ¢ 19 = (14(8) at i)) V
(31): ((1-12(8) at J—) /\ ((1-12(8)=d-9(8)) = (W = (1% p 2 d-q(8) =i))}

80

Continuing thus, we get the masking fault—tolerant Byzantine agreement; this

program is the same as that in [26]. The actions of this program are as follows:

MBl: d.j=_L A f.j=0 ——» d.j:= d.g
MBZ: d.j;«éJ. A f.j=0 A ((d.j=d.k)v(d.j=d.l))—> f.j:=1
MB3: (d.j: 1) A (d.k=0) A (d.le) A (f.j=0) ——> (1.32:0
MB4: (d.j=0) A (d.k= 1) A (d.lzl) A (f.j=0) ———> d.j:=1

5.4 Using Monotonicity for the Enhancement of

Fault-Tolerance
In this section, we illustrate how we use monotonicity of programs and speciﬁcations
to enhance the fault—tolerance of nonmasking fault-tolerant distributed programs to
masking fault-tolerance in polynomial-time (in the state space of the nonmasking
program). Towards this end, in Subsection 5.4.1, we present a theorem that identiﬁes
the sufficient conditions for enhancing the fault-tolerance of nonmasking programs in
polynomial time. Then, in Subsection 5.4.2, we present an example to illustrate the

application of the theorem presented in Section 5.4.1.

5.4.1 Monotonicity of Nonmasking Programs

In this section, our goal is to identify properties of programs and speciﬁcations where
enhancing the fault-tolerance of nonmasking fault-tolerant programs can be done
in polynomial time. Speciﬁcally, we present a theorem that identiﬁes the sufﬁcient
conditions for polynomial—time enhancement of the fault-tolerance of nonmasking
distributed programs to masking. As we have shown in Section 4.2, in general, adding
failsafe fault-tolerance to a distributed program is NP-complete. Thus, it is expected
that the enhancement problem is also NP-complete. Hence, we focus on the following

question:

81

 

Given is a nonmasking program, p, its speciﬁcation, spec, its invariant, S,
a class of faults f, and its fault-span, T:
Under what conditions can one derive a masking fault-tolerant program p’

from a nonmasking fault-tolerant program p in polynomial time?

To address the above question, we sketch a simple scenario where we can eas-
ily derive a masking fault-tolerant program from p. Speciﬁcally, we investigate the
case where we only remove groups of transitions of p that include safety-violating
transitions and the remaining groups of transitions construct the set of transitions
of the masking fault-tolerant program p’. However, removing a group of transitions
may result in creating states with no outgoing transitions (i.e., deadlock states) in
the fault-span T or the invariant S. In order to resolve deadlock states, we need
to add recovery transitions, and as a result, adding recovery transitions may create
non-progress cycles in (T — S). When we remove a non-progress cycle, we may create
more deadlock states. This way, removing a group of safety-violating transitions may
lead us to a cycle of complex actions of adding and removing (groups of ) transitions.

To address the above problem, we require the set of transitions of p to be structured
in such a way that removing safety-violating transitions (and their associated group of
transitions) does not create deadlock states. Towards this end, we deﬁne potentially
safe nonmasking programs as follows:

Deﬁnition. A nonmasking program p with the invariant S and the speciﬁcation

spec is potentially safe iff the following condition is satisﬁed.

Vso,31 :: ((so,sl) ¢ plS /\ ((so,sl) violates spec) )

:> ( 352 :: ((80,82) 6 p) /\ (80,82) does not violate spec) [I]

Moreover, we require that the removal of a safety-violating transition and its
associated group of transitions does not remove good transitions that are useful for

the purpose of recovery. Thus, if a transition violates the safety of spec then we require

82

that no good transition exists in its associated group of transitions. To address this
issue (i.e., safety-violating transitions are not grouped with good transitions), we
use the monotonicity property to deﬁne independent programs and speciﬁcations as
follows.
Deﬁnition. A nonmasking program p is independent of a Boolean variable a: on a
predicate Y iff p is both positive and negative monotonic on Y with respect to 3:. D

Intuitively, the above deﬁnition captures that if there exists a transition (so, .91) E
plY and (so, 51) belongs to a group of transitions g that is created due to inability of
reading a: then for all transitions (86, 3’1) 6 9 we will satisfy (36, 3’1) 6 plY, regardless of
the value of the variable a: in 86 and Si- Likewise, we define the notion of independence
for speciﬁcations as follows:
Deﬁnition. A speciﬁcation spec is independent of a Boolean variable x on a
predicate Y iff spec is both positive and negative monotonic on Y with respect to or.
Cl

Based on the above deﬁnition, if a transition (so, 31) belongs to a group of tran—
sitions g that is created due to inability of reading 1:, and (so, 31) does not violate
safety then no transition (36, 3’1) 6 9 will violate safety, regardless of the value of the
variable a: in sf, and s’l.

Now, using the above deﬁnitions, we present the following theorem.
Theorem 5.22 Given is a nonmasking fault-tolerant program p, its invariant S, its
fault-span T, faults f and f—safe speciﬁcation spec,
If p is potentially safe, and

VPj, :1: : P]- is a process in p, :1: is a Boolean variable such that Pj cannot read :1: :

spec is independent of :r on T
A The program consisting of the transitions of P] is independent of 1: on S

Then

A masking fault-tolerant program p’ can be derived from p in polynomial time.

83

Proof. Let (so, 31) be a transition of process Pj. We consider two case where

(30,31) 6 (pIS) 01" (30,31) ¢ (pIS)-

1. Let (so, 31) E (pIS) and :1: be a variable that P] cannot read. Since we consider
programs where a process cannot blindly write a variable, it follows that x(so)
equals $(sl). Now, we consider the transition (36,3’1) where 36 (respectively,
3’1) is identical to so (respectively, 31) except for the value of :23. Since p is

independent of :r on S, for every value of :r(so) we will have (36, 3’1) 6 (pIS).

Thus, we include the group associated with (so, 31) in the set of transitions of

I

p .

2. Let (so,sl) ¢ (pIS). Again, due to the inability of P,- to read :13, we consider
the transition (36, 3’1) where 36 (respectively, 3’1) is identical to so (respectively,
31) except for the value of :r. By the deﬁnition of spec independence, if (so, 31)
violates spec then regardless of the value of at every transition (36, 3’1) in the
group associated with (30,31) violates spec, and as a result, we exclude this

group of transitions in the set of transitions of p’.

p’ satisﬁes spec from S. Now, let p’ be the program that consists of the transitions
remained in pIT after excluding some groups of transitions. Since p’ IS equals pIS and
p satisﬁes spec from S, it follows that p’ satisﬁes spec from S in the absence of f.
Every computation preﬁx of p’[] f that starts at T maintains spec. Since we
have removed the safety-violating transitions in pIT, when f perturbs p to T every
computation preﬁx of p’[] f maintains safety of speciﬁcation.

Every computation of p’[] f that starts in T has a state in S. When we remove
a safety-violating transition (so, 31) E pIT, we actually remove all transitions (36, 3’1),
where 36 (respectively, 3’1) is identical to so (respectively, 31) except for the value of
2:. Note that since spec is independent of :r, all transitions (36, 3’1) that are grouped

with (so, 31) violate the safety of spec if (so, 31) violates the safety of spec. N ow, since

84

 

p is potentially safe, by deﬁnition, for every removed transition (so, 31) (respectively,
(36,3’1)) there exist safe transitions (so,s-2) (respectively, (36, 36)) that guarantee so
(respectively, 36) has at least one outgoing transition (i.e., so (respectively, 36) is not
a deadlock state). Thus, if we remove the safety-violating transitions then we will
not create any deadlock state in T. It follows that the recovery from T—S to S,
provided by the nonmasking program p, is preserved. Also, we have shown that p’
satisﬁes spec from S and every computation preﬁx of p’ [] f maintains spec. Therefore,

p’ is masking f-tolerant to spec from S. E]

5.4.2 Example: Distributed Counter
In this section, we present an example for enhancing the fault-tolerance of nonmasking
distributed programs to masking using the monotonicity property. Towards this end,
we ﬁrst introduce the nonmasking program, its invariant, its safety speciﬁcation, and
the faults that perturb the program. Then, we synthesize the masking fault-tolerant
program using Theorem 5.22.
Nonmasking program. The nonmasking program p represents an even counter.
The program p consists of two processes namely, Po and P1, where Po is responsible to
reset the least significant bit (denoted xo) whenever it is not equal to zero. And, P1 is
responsible to toggle the value of the most signiﬁcant bit (denoted 1:1), continuously.
Process Po can only read/ write :ro, P1 is able to read do and 31:1, and P1 can only write
161. The only action of Po is as follows:

Po : xo 74 0 ——+ $0 1: O

The following two actions represent the transitions of P1.

(21:1)A(a:o=0) ——+ 21:20

£120 ——+ 171221

For simplicity, we represent a state of the program by a tuple ('21, mo).
Invariant. Since the program simulates an even counter, we represent the invariant

of the program by the state predicate Sm E (To = 0).

85

 

Faults. Fault transitions perturb the value of :ro and arbitrarily change its value

from O to 1 and vice versa. The following action represents the fault transitions.

true -—> do := 0 | 1

Fault-span. The entire state space is the fault-span for faults that perturb cro.
Thus, we represent the fault-span of the program by the state predicate Tc“. E true.
Safety speciﬁcation. Intuitively, the safety speciﬁcation specifies that whenever
faults perturb the counter, the counting operation should stop until the program
returns to its invariant. In other words, the counter must not count from an odd
value to another odd value. We identify the safety of speciﬁcation speed, by the

following set of transitions that the program is not allowed to execute:

3P€Cctr = {(80.31) I (330(80) = 1)/\(5130(81)=1)A(5171(31)3Jé $1090)”

Observe that, p is potentially safe and speech. is f-safe.
The nonmasking program p is independent of $1 on Sm. For two arbitrary
transitions of Po, say (so, 31) and (36,3’1), that are grouped due to inability of Po to
read x1, we show that the nonmasking program is independent of cm on Sm. Towards
this end, we ﬁrst show that p is negative monotonic on Sm with respect to $1, and

then, we show that p is positive monotonic on Sm with respect to $1.

1. Negative monotonicity of p on Set, with respect to 1:1. Consider (so, 31),
where (x1(so) = 1) and (21(31) = 1). Since there is no transition (so, 31) in plS,
where (231(3o) = 1) and (231(31) = 1), p is negative monotonic on Sm with
respect to 3:1.

2. Positive monotonicity of p on Sm with respect to 21. Consider (so, 31),
where (11:1(so) = 0) and (21(31) = 0). Since there is no transition (so,sl) in
plS, where (2:1(so) = 0) and (21(31) 2 0), p is positive monotonic on Sm with

respect to $1.

86

 

As a result of the above argument, p is independent of $1 on Sm. Now, we show
that speech. is independent of 51:1 on the fault-span Tm.

For a given transition (so, 31) of process Po, we let (:ro(3o) = 1) and (aro(sl) = 0).
Since Po cannot read 2:1, the transition (so, 31) is grouped with a transitions (36, 3’1),
where the value of .171 remains unchanged in (36,3’1). Now, using the deﬁnition of
program monotonicity,
speech. is independent of .231 on Tc". Given two arbitrary transitions of Po, say
(so, 31) and (36, 3’1), that are grouped due to inability of Po to read :51, we show that

the speciﬁcation is both negative and positive monotonic on Too. with respect to 51:1.

1. Positive monotonicity of speed... Consider (30,31), where (1271(so) = 0)
and (231(31) = 0), and (so,sl) does not violate safety. If (131(36) = 1) and
(131(3’1) = 1) then (36, 3’1) will not violate safety (because the value of $1 does
not change during this transition). Since we have chosen (so, 31) and (36, 3’1)

arbitrarily, the speciﬁcation is positive monotonic on Too. with respect to 1:1.

2. Negative monotonicity of specctr. A similar argument shows that the

speciﬁcation is negative monotonic on Tm with respect to 331.

Based on the above discussion, the speciﬁcation is independent of 231 on Tm.
Masking fault-tolerant program. The nonmasking program presented in this
section is potentially safe. Also, process Po is independent of 2:1 on Sm. Moreover,
the speciﬁcation, speed, is f—safe and is independent of :51 on Tm. Therefore, using
Theorem 5.22, we can derive a masking fault-tolerant version of p in polynomial time.
In the synthesis of masking program, we remove the transition from (0,1) to (l, 1).

The action of Po remains as is, and the actions of P1 are as follows:

($1=1)/\(IL‘0=0) ——') 1131 2:0

(1'1=0)/\(IL'0=O) ——* 11312:].

87

5.5 Enhancement versus Addition

In this section, we compare the complexity of enhancement with adding masking fault-
tolerance. Speciﬁcally, we ﬁrst discuss enhancement in high atomicity with respect
to Add_Masking algorithm represented in Subsection 2.7.3. Subsequently, we compare
the complexity of these two algorithms for distributed programs (i.e., low atomicity

model).

Complexity of enhancement versus addition in high atomicity. Since
Add_Masking tries to add both safety and recovery simultaneously, it is more com-
plex than HighAtomicityEnhancement presented in this chapter. More speciﬁ-
cally, the asymptotic complexity of High_Atomicity-Enhancement is less than that of
Add_masking. Thus, if the state space of the problem at hand prevents the addition of
masking fault-tolerance to a fault-intolerant program, it may be possible to partially
automate the design of a masking fault-tolerant program by manually designing a
nonmasking fault-tolerant program and enhancing its fault-tolerance to masking us-
ing automated techniques.

The algorithm High_Atomicity-Enhancement adds safety to a nonmasking fault-
tolerant program while ensuring that the recovery provided by it continues to be
satisﬁed. We note that the asymptotic complexity of HighAtomicity-Enhancement
is the same as the complexity of adding failsafe fault-tolerance to a fault-intolerant
program. In other words, in High_Atomicity_Enhancement, the recovery is preserved

for free!

Complexity of enhancement versus addition in low atomicity. We com-
pare the cost of adding masking fault-tolerance to a fault-intolerant distributed pro-
gram and the cost of enhancing the fault-tolerance of a nonmasking fault-tolerant
distributed program to masking. Asymptotically speaking, adding masking (re-
spectively, failsafe) fault-tolerance to a fault-intolerant distributed program is N P-

complete [1, 31]. Therefore, it is expected that the enhancement problem ——that adds

88

safety while preserving recovery— for distributed programs will also be NP-complete.

Although the enhancement problem may not provide relief in terms of the worst-
case complexity, we ﬁnd that it helps in developing heuristics that determine if safe
recovery is possible from states that are reached in the presence of faults. More
speciﬁcally, consider a state, say 3, that is reached in a computation of the fault-
intolerant program in the presence of faults. While adding masking fault-tolerance
to a fault—intolerant program, we need to exhaustively search all possible transition
sequences from s to determine if recovery from s is possible. By contrast, while
enhancing the fault-tolerance of a nonmasking fault-tolerant program, we reuse the
recovery provided by the nonmasking fault-tolerant program. Hence, we need to
check only the transition sequences that the nonmasking fault—tolerant program can
produce. It follows that deriving heuristics that determine if safe recovery is possible

from a given state is simpler in the enhancement problem.

The enhancement problem also allows us to deduce additional information about
states by reasoning in the high atomicity model. We illustrate this by one example
that occurs in Byzantine agreement. Consider a state so where all processes are non-
Byzantine, d.j=d.k=_L, d.g: 1, d.l=1 and f.l-=0. Let 31 be a state that is identical
to so except that the value of f.l in 31 is 1. Now, consider the transition (so, 31). Note
that both so and 31 are in the invariant, S N 3. Hence, for a synthesis algorithm, this
appears as a good transition that should be retained in the fault-tolerant program.
However, from 31, if 9 becomes Byzantine and changes d.g, we can reach a state where

d.g, d. j , and d.l: become 0. The resulting state is a deadlock state.

While adding masking fault-tolerance to a fault-intolerant program, it is difﬁcult to
check that all computations that (1) start from s], (2) in which 9 becomes Byzantine,
and (3) in which 9 changes d.g to 0 are deadlock states. Moreover, if we ignore
the grouping restrictions imposed by the low atomicity model, i.e., if we could read

and write all variables in one atomic step then recovery would be possible from 31.

89

However, in the context of the enhancement problem, we concluded that even in the
high atomicity model, we could not recover from state 31 by reusing the transitions
of the nonmasking fault-tolerant program.

We expect that such high atomicity reasoning will play an important role in re-
ducing complexity in the enhancement problem. To reduce the complexity of adding
fault—tolerance in the low atomicity model, it is desirable to reason about the input
program in the high atomicity model, obtain a high atomicity masking fault-tolerant
program, and modify that high atomicity masking fault-tolerant program so that
the restrictions of the low atomicity model are satisﬁed while preserving the masking
fault-tolerance. As the Byzantine agreement example illustrates, this approach can be
followed while enhancing the fault-tolerance of a nonmasking fault-tolerant program.
However, this approach could not be used while adding masking fault-tolerance to a

fault-intolerant program.

5.6 Summary

In this chapter, we deﬁned the problem of enhancing the fault-tolerance level of a
nonmasking program to masking. This problem separates ( 1) the task of adding re-
covery, and (2) the task of maintaining the safety speciﬁcation during recovery. For
the high atomicity model, we presented a sound and complete algorithm for the en-
hancement problem. We showed that the complexity of our high atomicity algorithm
is asymptotically less than Add-Masking algorithm (cf. Subsection 2.7.3). For dis-
tributed programs, we presented a sound algorithm for the enhancement problem.
We also showed that our fault-tolerance enhancement algorithm for distributed pro-
grams resolves some of the difﬁculties encountered in adding safe recovery transitions
in [14].

As an illustration of our algorithms, we showed how masking fault-tolerant pro-

grams for TMR (in high atomicity model) and Byzantine agreement (for distributed

90

programs) can be designed by enhancing the fault-tolerance of the corresponding
nonmasking programs. We chose these examples as masking fault-tolerant versions
of these programs have been manually designed from the corresponding nonmasking
fault-tolerant versions [32]. The results in this chapter show that those enhancements
can in fact be automated as well.

Also, we argued that enhancing the fault-tolerance of a distributed program is
simpler than adding masking fault-tolerance to its fault-intolerant version. We vali-
dated this result by comparing the derivation of a masking fault-tolerant Byzantine
agreement program from the corresponding fault-intolerant version and from the cor-
responding nonmasking version.

Moreover, we have used the monotonicity property (presented in Section 4.3)
to identify sufﬁcient conditions under which the enhancement of fault-tolerance can
be done in polynomial time. Speciﬁcally, we presented a sufﬁciency theorem and we
enhanced the fault-tolerance of a distributed counter to masking fault-tolerance using

our sufﬁciency theorem.

91

Chapter 6

Pre—Synthesized Fault-Tolerance

Components

In this chapter, we present a synthesis approach that adds pre-synthesized fault—
tolerance components to a given fault—intolerant program in the synthesis of its fault-
tolerant version. Techniques presented in [14] and Chapters 4 and 5 respectively
reduce the complexity of synthesis by using heuristics and by identifying classes of
programs and speciﬁcations for which efﬁcient synthesis is possible. However, these
techniques cannot apply the lessons learnt in synthesizing one fault-tolerant program
while synthesizing another fault-tolerant program. The synthesis method presented
in this chapter allows us to recognize the patterns that we often apply in the synthesis
of fault-tolerant distributed programs. Then, we organize those patterns in terms of

fault-tolerance components and reuse them in the synthesis of new problems.

To investigate the use of pie-synthesized fault-tolerance components in the syn-
thesis of fault-tolerant programs from their fault-intolerant version, we use detectors
and correctors identiﬁed in [33, 10]. Speciﬁcally, in [33, 10], Arora and Kulkarni
have shown that detectors and correctors sufﬁce in the manual design of a rich class

of fault-tolerant programs. Hence, we expect to beneﬁt from the generality of such

92

components in automated synthesis of fault—tolerant programs. Thus, in this chapter,
we present a synthesis approach that adds pre—synthesized detectors and correctors
to a given fault-intolerant program in order to synthesize its fault-tolerant version.
In particular, we focus on adding masking fault—tolerance where we address issues
regarding the representation, the speciﬁcation, and the addition of pre-synthesized
fault-tolerance components. In general, our synthesis method is applicable for adding

failsafe and nonmasking fault-tolerance as well.

As a running example, we synthesize a token ring program that consists of 4
processes and is subject to process-restart faults. The masking fault—tolerant (token
ring) program can recover even from the situation where every process is corrupted.
We note that the previous approaches that added fault-tolerance to the token ring

program presented in this chapter assumed that at least one process is not corrupted.

We proceed as follows: in Section 6.1, we formally state the problem of adding
fault-tolerance components to fault-intolerant programs. Then, in Section 6.2, we
present a synthesis method that identiﬁes when and how the synthesis algorithm de-
cides to add a component. Subsequently, in Section 6.3, we formally describe how we
represent a fault-tolerance component. In Section 6.4, we show how we automatically
specify a component and add it to a program. In Section 6.5, we show how we reuse
a linear pre-synthesized component in the synthesis of an alternation bit protocol.
Afterwards, in Sections 6.6, we apply our synthesis method for adding nonmasking
fault-tolerance to a diffusing computation program with a tree-like structure where we
show that our synthesis method is applicable for programs with hierarchical topolo-
gies. In Section 6.7, we address some of the questions raised by the synthesis method

presented in this chapter. Finally, we summarize our discussion in Section 6.8.

93

6. 1 Problem Statement

In this section, we formally deﬁne the problem of adding fault-tolerance components
to a fault-intolerant program. We identify the conditions of the addition problem by
which we can verify the correctness of the synthesized fault-tolerant program after

adding fault-tolerance components.

Given a fault-intolerant program p, its state space Sp, its invariant S, its speciﬁca-
tion spec, and a class of faults f, we add pre—synthesized fault-tolerance components
(i.e., detectors and correctors) to p in order to synthesize a fault-tolerant program p’
with the new invariant S’. When we add a fault-tolerance component to p, we also
add the variables associated with that component. As a result, we expand the state
space of p. The new state space, say Sp], is actually the state space of the synthesized

fault-tolerant program p’.

After the addition, we require the fault-tolerant program p’ to behave similar to
p in the absence of faults f. In the presence of faults f, p’ should satisfy masking
fault-tolerance. To ensure the correctness of the synthesized fault—tolerant program
in the new state space, we need to identify the conditions that have to be met by the
synthesized program, p’. Towards this end, we deﬁne a projection from Sp, to 3,, using
onto function H : 510' ——+ Sp. We apply H on states, state predicates, transitions, and

groups of transitions in Sp: to identify their corresponding entities in Sp.

Let the invariant of the synthesized program be S’ Q Sp]. If there exists a state
36 E S’ where H (36) ¢ S then in the absence of faults p’ can start at 36 whose image,
H (56), is outside S. As a result, in the absence of faults, p’ will include computations
in the new state space Sp: that do not have corresponding computations in p. These
new computations resemble new behaviors in the absence of faults, which is not
desirable. Therefore, we require that H (S’ ) Q S. Also, if p’ contains a transition
(36, 3’1) in p’ [5’ that does not have a corresponding transition (so, 31) in p[H(S’) (where

H (36) = so and H (3’1) = 31) then p’ can take this transition and create a new way for

94

satisfying spec in the absence of faults. Therefore, we require that H (p’ [5’ ) Q p|H (S ’ )

Now, we present the problem of adding fault-tolerance components to p.

The Addition Problem.

Given p, S, spec, f, with state space Sp such that p satisﬁes spec from S,
Sp! is the new state space due to adding fault—tolerance components to p,
H : Sp, -—> 3,, is an onto function,

Identify p’ and S’ Q Spr such that
H (3') Q S ,
H(p’IS’) C; p|H(S’), and

p’ is masking f-tolerant for spec from S’. E]

6.2 The Synthesis Method

In this section, we present a synthesis method to solve the addition problem of Section
6.1. In Section 6.2.1, we present a high level description of our synthesis method
and express our approach for combining heuristics from [14] (cf. Section 6.2.2 for
an example heuristic) with pre—synthesized components. Then, in Section 6.2.2, we
illustrate our synthesis method using a simple example, a token ring program with
4 processes. We use the token ring program as a running example in the rest of the
chapter, where we synthesize a token ring program that is masking fault—tolerant to

process-restart faults.

6.2.1 Overview of Synthesis Method

Our synthesis method takes as its input a fault-intolerant program p with a set of
processes Po - - - R, (n > 1), its speciﬁcation spec, its invariant S, a set of read/write
restrictions ro - - ~rn and mo - - own, and a class of faults f to which we intend to add
fault-tolerance. The synthesis method outputs a fault-tolerant program p’ and its

invariant S’.

95

The heuristics in [14] (i) add safety to ensure that the masking fault-tolerant
program never violates its safety speciﬁcation, and (ii) add recovery to ensure that the
masking fault-tolerant program never deadlocks (respectively, livelocks). Moreover,
while adding recovery transitions, it is necessary to ensure that all the groups of
transitions included along that recovery transition are safe unless it can be guaranteed
(with the help from heuristics) that those transitions cannot be executed. Thus,
adding recovery transitions from deadlock states is one of the important issues in
adding fault-tolerance. Hence, the method presented in this chapter, focuses on
adding pre—synthesized components for resolving deadlock states.

Now, in order to resolve a deadlock state, say 34, using our hybrid approach, we
proceed as follows: First, for each process P,- in the given fault-intolerant program, we
introduce a high atomicity pseudo process PSi. Initially, PS,- has no action to execute,
however, we allow PS, to read all program variables and write only those variables
that P,- can write. Using these special processes, we now present the ResolveDeadlock
routine (cf. Figure 6.1) that is the core of our synthesis method. The input of
ResolveDeadlock consists of the deadlock state that needs to be resolved, 3d, and the

set of high atomicity pseudo processes PS,- (0 S i S n).

 

ResolveDeadlock(sd: state, PSo, - - ~ , PS7,: high atomicity pseudo process)

Step 1. If Add_Recovery (3d) then return true.

Step 2. Else non-deterministically choose a PSz-ndex, where 0 3 index 3 n and PSindex
adds a high atomicity recovery action grd ——* st

Step 3. If (there exists a PSmdel.) and (there exists a detector (1 in the component
library that sufﬁces to reﬁne grd -+ st without interfering with the program)
then add d to the program, and return true.
else return false.

// Subsequently, we remove some transitions to make 3,) unreachable.

 

 

 

Figure 6.1: Overview of the synthesis method.

First, in Step 1, we invoke a heuristic-based routine Add_Recovery to add recovery

from 3,, under the distribution restrictions (i.e., in the low atomicity model) — where

96

program processes have read / write restrictions with respect to the program variables.
Add.Recovery explores the ability of each process P,- to add recovery transition from
3,; under the distribution restrictions. If Add_Recovery fails then we will choose to

add a fault-tolerance component in Steps 2 and 3.

In Steps 2 and 3, we identify a fault-tolerance component and then add it to p
in order to resolve 3d. To add a fault-tolerance component, the synthesis algorithm
should (i) specify the required component; (ii) retrieve the speciﬁed component from
a given library of components; (iii) ensure the interference freedom of the composition
of the component and the program, and ﬁnally (iv) add the extracted component to
the program. As a result, adding a pre—synthesized component is a costly opera-
tion. Hence, we prefer to add a component during the synthesis only when available

heuristics for adding recovery fail in Step 1.

To identify the required fault—tolerance components, we use pseudo process PS,-
that can read all program variables and write w,- (i.e., the set of variables that H can
write). In other words, we check the ability of each PS,- to add high atomicity recovery
— where we have no read restrictions — from 30;. If no PS, can add recovery from 3,,
then our algorithm fails to resolve 34. If there exist one or more pseudo processes
that add recovery from 34 then we non-deterministically choose a process PSz-ndex
with high atomicity action ac : grd —* st. Since we give PSmdex the permission to
read all program variables for adding recovery from 3d, the guard grd is a global state
predicate that we need to reﬁne. If there exists a detector that can reﬁne grd without
interfering with the program execution then we will add that detector to the program.
(We present the discussion about how to specify the required detector (1 and how to

add d to the fault—intolerant program in Sections 6.3 and 6.4.)

In cases where ResolveDeadlock returns false, we remove some transitions to
make 3,, unreachable. If we fail to make so unreachable then we will declare failure

in the synthesis of the masking fault-tolerant program p’. Observe that by using pre-

97

synthesized components, we increase the chance of adding recovery from so, and as a
result, we reduce the chance of reaching a point where we declare failure to synthesize

a fault-tolerant program.

6.2.2 Token Ring Example

In this subsection, we introduce a token ring program with 4 processes that is subject
to process restart faults. Using our synthesis method (cf. Figure 6.1), we synthesize
a token ring program that is masking fault—tolerant for the case where all processes
are corrupted.

The token ring program. The fault-intolerant program consists of four processes
Po, P1, P2, and P3 arranged in a ring. Each process P,- has a variable 2:,- (O S i S 3)
with the domain {1,0,1}. Due to distribution restrictions, process P, can read I,
and 35,--1 and can only write 2:,- (1 S i S 3). Po can read zoo and 2:3 and can only
write do. We say, a process P,- (1 g i g 3) has the token iff $.- 7$ 2,4 and fault
transitions have not corrupted P,- and Pi_1. And, Po has the token iff 1133 2 :co and
fault transitions have not corrupted Po and P3. A process P,- (1 _<_ i S 3) copies :r,_1
to r,- if the value of .73,- is different from 33-1. Also, if do = (133 then process Po copies
the value of (2:3 63 1) to do, where 89 is addition in modulo 2. This way, a process
passes the token to the next process.

We represent a state 3 of the token ring program by a 4-tup1e (120, 1:1, 3:2, 2:3). Each
element of the 4—tuple ((130, 321, 332,13) represents the value of z, in s (0 S i S 3). Thus,
if we start from initial state (0, 0, 0,0) then process Po has the token and the token
circulates along the ring. We represent the transitions of the fault-intolerant program

TR by the following actions (1 g i _<_ 3).

TRo: (:ro =1) A (11:3 =1) ——§ I130 z: 0; TR,: (1:1: 0) A (:r,--1 = 1) —+ 3:,- := 1;
TR6: (:ro=0) A ($320) -—>:ro:= 1; TR;: (113,-:1) A (1r,_1=0)——>a:,-:=0;
Faults. Faults can restart a process H. Thus, the value of it,- becomes unknown.

98

Hence, we model faults by setting the value of 1',- to an unknown value 1.

Speciﬁcation. The problem speciﬁcation requires that the corrupted value of one
process does not affect a non-corrupted process, and there is only one process that

has the token.

Invariant. The invariant of the above program includes states (0, 0, 0, 0), (1,0, 0, 0),
(1, 1,0,0), (1, 1,1,0), (1,1, 1, 1), (O, 1, 1, 1), (0,0, 1, 1), and (0,0,0, 1).

A heuristic for adding recovery. In the presence of faults, the program TR may
reach states where there exists at least a process P,- (0 S i S 3) whose :16,- is corrupted
(i.e., 2:,- = _L). In such cases, processes P, and PW“) mod 4) cannot take any transition,

and as a result, the propagation of the token stops (i.e., the whole program deadlocks).

In order to recover from the states where there exist some corrupted processes, we
apply the heuristic for single—step recovery from [14] in an iterative fashion. Speciﬁ-
cally, we identify states from where single-step recovery to a set of states RecoverySet
is possible. The initial value of RecoverySet is equal to the program invariant. At
each iteration, we include a set of states in RecoverySet from where single-step re-

covery to RecoverySet is possible.

In the ﬁrst iteration, we search for deadlock states where there is only one cor-
rupted process in the ring. For example, consider a state so = (1,1, 1,0). In state
so, P1 and P2 cannot take any transitions. However, P3 can copy the value of 1:2
and reach 32 = (1,1, 1,1). Subsequently, Po changes 1‘0 to 0, and as a result, the
program reaches state 33 = (0,1,1,1). The state 33 is a deadlock state since no
process can take any transition at 33. To add recovery from 33, we allow P1 to correct
itself by copying the value of do, which is equal to 0. Thus, by copying the value
of do, P1 adds a recovery transition to an invariant state (0,0, 1,1). Therefore, we
include 33 in the set of states RecoverySet in the ﬁrst iteration. Note that this recov-
ery transition is added in low atomicity in that all the transitions included in action

(1‘0 2 0) A (1:1 2 i) —+ x1 := 0 can be included in the fault-tolerant program without

99

violating safety.

In the second and third iterations, we follow the same approach and add recovery
from states where there are two or three corrupted processes to states that we have
already resolved in the previous iterations. Adding recovery up to the fourth iteration
of our heuristic results in the intermediate program I TR (1 g i S 3).

ITRo: ((xo=l)V(:z:o=_L)) A (37321) ——>:ro:= 0;

(TBS: ((330 = 0) V (1‘0 = l)) A (1‘3 = 0) '—+ 130 == 1;

ITR,: (((1:.-= 0) V (3:,- = 1)) A ($1-1 = 1) —. 11:,- := 1;
ITR2: ((23,- : 1) V (so.~ = _L)) A (27,4 = 0) —+ 2:, z: 0;

Using above heuristic, we can only add recovery from the states where there exists
at least one uncorrupted process. If there exists at least one uncorrupted process P]-
(0 S j S 3) then P((j+1)mod 4) will initiate the token circulation throughout the
ring, and as a result, the program recovers to its invariant. However, in the fourth
iteration of the above heuristic, we reach a point where we need to add recovery
from the state where all processes are corrupted; i.e., we reach the program state
.53 = (_L, .L, _L, _L). In such a state, the program I TR deadlocks as an action of the
form (do = .L)A(xl = .l.) —> 5101 := 0 cannot be included in the fault-tolerant program.
Such an action can violate safety if 1:2 and 333 are not corrupted. In fact, no process
Can add safe recovery from 33 in low atomicity. Thus, Add-Recovery returns false for
(1,1,1,1).
Adding the actions of the high atomicity pseudo process. In order to add
IIlasking fault-tolerance to the program I TR, a process Pindex (0 5 index _<_ 3) should
Set its a: value to 0 (respectively, 1) when all processes are corrupted. Hence, we
f()llow our synthesis method (cf. Figure 6.1), where the pseudo process PSo takes the

high atomicity action H TR and recovers from 33. Thus, the actions of the masking
program MTR are as follows (1 g i _<_ 3).
MTRo: ((xo=1)V(:1:o=.L)) A (1:321) ——>:ro:= 0;

100

MTRg: ((xo _ 0) v (:ro _ 1)) A (2:3 2 0) ——3 170:: 1;
MTR.: ((22.- =0)V(:r, = )) A (2..-, = ) .33, = 1,
MTRgz ((11:,=1)V(:r.= )) A (an—1 = 0) __. 1r,- 2: 0;
HTR: (.ro=_L)A(:1:1—_L)A(T2—.L)A(:r3=L)——>afo:—-0;

In order to reﬁne the high atomicity action H T R, we need to add a detector that
detects the state predicate (do = 1) /\ (.231 = 1) A (12 = _L) A (51:3 2 _L). In Section
6.3, we describe the speciﬁcation of fault-tolerance components, and we show how we
use a distributed detector to reﬁne high atomicity actions.

Remark. Had we non—deterministically chosen to use PS,- (i # 0) as the process that

adds the high atomicity recovery action then the high atomicity action H TR would

have been different in that H TR would write x,» (We refer the reader to Section 6.7

for a discussion about this issue.)

6.3 Specifying Pre—Synthesized Components
In this section, we describe the speciﬁcation of fault-tolerance components (i.e., de-
tectors and correctors). Speciﬁcally, we concentrate on detectors and we consider

a special subclass of correctors where a corrector consists of a detector and a write

action on the local variables of a single process.

6.3.1 The Speciﬁcation of Detectors

We recall the speciﬁcation of a detector component presented in [34, 33]. Towards
this end, we describe detection predicates, and witness predicates. A detector, say d,
identiﬁes whether or not a global state predicate, X, holds. The global state predicate
X is called a detection predicate in the global state space of a distributed program
[34, 33].

It is often difﬁcult to evaluate the truth value of X in an atomic action. Thus,
We (i) decompose the detection predicate X into a set of smaller detection predicates

X0 -~Xn where the compositional detection of Xo - - -X,, leads us to the detection

101

of X, and (ii) provide a state predicate, say Z, whose value leads the detector to
the conclusion that X holds. Since when Z becomes true its value witnesses that
X is true, we call Z a witness predicate. If Z holds then X will have to hold as
well. If X holds then Z will eventually hold and continuously remain true. Hence,
corresponding to each detection predicate X ,, we identify a witness predicate Z,- such
that if Z,- is true then X, will be true.

The detection predicate X is either the conjunction of X,- (0 S i S n) or the
disjunction of X,. Since the detection predicates that we encounter represent deadlock
states, they are inherently in conjunctive form where each conjunct represents the
valuation to program variables at some process. Hence, in the rest of this chapter,
we consider the case where X is a conjunction of X;, for 0 S i S n.

Speciﬁcation. Let X and Z be state predicates. Let ‘Z detects X’ be the problem

speciﬁcation. Then, ‘Z detects X’ stipulates that

0 (Safety) When Z holds, X must hold as well.

0 (Liveness) W’ hen the predicate X holds and continuously remains true, Z will

eventually hold and continuously remain true. [I

We represent the safety speciﬁcation of a detector as a set of transitions that a
detector is not allowed to execute. Thus, the following set of transitions represents

the safety speciﬁcation of a detector.

speed 2 {(30,31) : (2(31) A oX(31))}

6-3.2 The Representation of Detectors

I n this section, we describe how we formally represent a distributed detector. While
Our method allows one to use detectors of different topologies (cf. Section 6.4.1), in
t31113 section, we comprehensively describe the representation of a linear (sequential)

detector as such a detector will be used in our token ring example.

102

The composition of detectors. A detector, say d, with the detection predicate
X .=_ Xo A. . . AX" is obtained by composing d,, O S i S n, where d,- is responsible for
the detection of X,- using a witness predicate Z,- (0 S i S n). The elements of d can
execute in parallel or in sequence. More speciﬁcally, parallel detection of X requires
do - - ~dn to execute concurrently. As a result, the state predicate (Zo A - -- A Zn) is
the witness predicate for detecting X.

A sequential detector requires the detectors do, - -- ,dn to execute one after an-
other. For example, given a linear arrangement (1,, - . -do, a detector d, (0 S i < n)
detects its detection predicate, using 2,, after (1,-+1 witnesses. Thus, when Z, be—
comes true, it shows that Z,-+1 already holds. Since when Z, becomes true X,- must
be also true, it follows that the detection predicates Xn - - ~X,- hold. Therefore, we
can atomically check the witness predicate Zo in order to identify whether or not
X E (Xn A---AXo) holds.

The detection of global state predicates of programs that have a hierarchical topol-
og (e.g., tree-like structures) requires parallel and sequential detectors. In this sec-
tion, we demonstrate our method in the context of a linear detector as such a detector
suffices for the token ring example. In Section 6.6, we apply our synthesis method for

the synthesis of a diffusing computation program using components with hierarchical
tOpology.
A linear detector. We consider a detector d with linear topology. The detector
d consists of n + 1 elements (72 > 0), its speciﬁcation speed, its variables, and its
invariant U. Since the structure of the detector is linear, without loss of generality,
We consider an arrangement dn- - -do for the elements of the distributed detector,
Where the left-most element is dn and the right-most element is do.
Component variables. Each element d,, 0 S i S n, of the detector has a Boolean
Variable y,.

Read/ write restrictions. Element (1, can read y; and yi+1, and can only write y,

103

(0 S i < n). dn reads and writes yn. Also, d,- is allowed to read all variables that P,-
can read (i.e., the process with which d,- is composed).

Witness predicates. The witness predicate of each (1,, say Z,, is equal to (y, =
true).

The detector actions. The actions of the linear detector are as follows (0 S i < n).

DA" : (LCn) A (yn = false) ——> y." := true;
DA,- : (L0,) A (y, = false) A (y,-+1 = true) —2 y, := true;

Using action DA,- (0 S i < n), each element d,- of the linear detector witnesses
(i.e., sets the value of y,- to true) whenever (i) the condition LC,- becomes true, where
LC,- represents a local condition that d,- atomically checks (by reading the variables
of R), and (ii) its neighbor d,+1 has already witnessed. The detector (1,, witnesses
(using action DA") when LC" becomes true.

Detection predicates. The detection predicate X,- for element d,- is equal to
(L0,, A - - - A LC.) (0 S i S n). Therefore, do detects the global detection predicate
LCnAu-ALCo.

Invariant. During the detection, when an element d, sets y, to true, the elements
d,-, for i < j S n, have already set their y values to true. Hence, we represent the

invariant of the linear detector by the predicate U, where

U={s:(Vi:(OSiSn):(y.(8)=>(Vj=(i<an)=LCj))}

Faults. We model the fault transitions that affect the linear detector using the
following action (cf. Section 6.7 for a discussion about the way that we have modeled

the faults).

F : true ——> y,- := false;

Theorem 6.1 The linear detector is masking F -tolerant for ‘Z detects X ’ from U.

104

Proof. The linear detector satisﬁes ‘Z detects X’ from U. Also, in the presence of
F, no element (1,- (0 S i S n) of the detector will reach a state where d,- witnesses
incorrectly. As a result, the linear detector never violates the safety of ‘Z detects
X’ in the presence of F. Also, when faults stop occurring, the actions of the linear
detector correct the corrupted values of y,- if necessary. Thus, every computation of
the linear detector in the presence of F will eventually reach a state in U. Therefore,

the linear detector component is masking F -tolerant for ‘Z detects X’ from U. [3

6.3.3 Token Ring Example Continued
In Section 6.2.2, we added the following high atomicity action to the token ring
program I TR that is executed by the pseudo process PSo.

HTR: (x021)A(:cl=1)A(:rg=J_)A(at3=_L) —+ $0220

In order to synthesize a distributed program (that includes low atomicity actions),
we need to reﬁne the guard of the above action. The read/write restrictions of the
processes in the token ring program identify the underlying communication topology
of the fault—intolerant program, which is a ring. Hence, we select a linear detector,
d, so that we can organize its elements, d3, d2, d1,do, in the ring. Each detector d2
is responsible to detect whether or not the local conditions LC3 to LC,- hold (LC,- E
(”~731- = _L)), for 0 S i S 3. Thus, the detection predicate X,- is equal to ((:r3 =
J.) A - - - A (51:,- = _L)), for 0 S i S 3. As a result, the global detection predicate of the

linear detector is ((:r3 = _L) A ($2 2 _L) A (331 = .L) A (do = 1)). The witness predicate
0f each d,, say Z), is equal to (y.- = true), and the actions of the sequential detector
are as follows (0 S i S 2).

DA3 : (x3 = _L) A (1);; = false) ——+ 313 3= true;

DA,- : (x,- = 1) A (y, = false) A (y,+1 = true) ——v y, :: true;

Note that we replace (LG) with (:13,- = _L) in the above actions. During the

Syll thesis, after the synthesis algorithm acquires the actions of its required component,

105

it replaces each (LC,) with the appropriate condition in order to create the transition

groups corresponding to each action of the component.

6.4 Using Pre-Synthesized Components

In this section, we describe how we perform the second and the third step of our
synthesis approach presented in Figure 6.1. In particular, in Section 6.4.1, we show
how we automatically specify the required components during the synthesis. Then, in
Section 6.4.3, we show how we ensure that no interference exists between the program
and the fault-tolerance component. Afterwards, we present an algorithm for the
addition of fault—tolerance components. In Sections 6.4.2 and 6.4.4, we respectively
present the algorithmic speciﬁcation and the algorithmic addition of a linear detector

to the token ring program.

6.4.1 Algorithmic Speciﬁcation of the Fault-Tolerance Com-

ponents

We present the Componentﬁpecification algorithm (cf. Figure 6.2) that takes a dead-
lock state 33, the distribution restrictions (i.e., the read/write restrictions) of the
program being synthesized, and the set of high atomicity pseudo processes PS,-
(0 S i S n). First, the algorithm searches for a high atomicity process PSmdex
that is able to add a high atomicity recovery action, ac : grd —-> 3t, from 33 to a state
in the state predicate 5.80, where Sm represents the set of states from where there
exists a safe recovery path to the invariant. Also, we verify the closure of Sm. U 33 in
the computations of p[] f. If there exists such a process PSindex then the algorithm
returns a triple (X, R, inderr), where (i) X is the detection predicate that should be
reﬁned in the reﬁnement of the action ac; (ii) R is a relation that represents the
topology of the program, and (iii) the index is an integer that identiﬁes the process
that should detect grd and execute st.

The Component-Specification algorithm constructs the state predicate X using

106

the LC,- conditions. Each LC,- condition is by itself a conjunction that consists of
the program variables readable for process Pi. Therefore, the predicate X will be the

conjunction of LC,- conditions (0 S i S n).

 

ComponentSpeciﬁcation(sd: state, Sr“: state predicate, PSo, - - - , PSn: high atomicity pseudo
process, spec: safety speciﬁcation, ro, - - -,rn: read restrictions, wo, - - . ,wn: write restrictions)
{ // n is the number of processes.
if ( Bindex : 0 S index S n : (33 : 3 E Srec : (3d, 3) E PSmdex A ((33, 3) does not violate spec) A
(Vx : (x(sd) 7é 33(3)) : x E wmde,» )
then X := A?=0(LC.-), where LC,- = (A"‘|(x = x(sd)));
R={(2’.J'>:(OSiSn)A(OsJ'sn)=w. gm};
return X, R, index;
else return false, 0, —1;

}

 

 

 

Figure 6.2: Automatic speciﬁcation of a component.

The relation R Q (P x P) identiﬁes the communication topology of the distributed
program, where P is the set of program processes. We represent R by a ﬁnite set
{(i,j) : (0 S i S n) A (0 S j S n) : w,- Q r,} that we create using the read/write
restrictions among the processes. The presence of a pair (2, j) in R shows that there
exists a communication link between P,- and P,-. Since we internally represent R by
an undirected graph, we consider the pair (2, j) as an unordered pair.

The interface of the fault-tolerance components. The format of the interface
of each component is the same as the output of the Component-Specification algorithm,
which is a triple (X, R, index) as described above. We use this interface to extract
a component from the component library using a pattern-matching algorithm. To
achieve this goal, we use existing speciﬁcation-matching techniques [35] for extracting
components from the component library.

The output of the component library. Given the interface (X, R, index) of
a required component, the component library returns the witness predicate, Z, the
invariant, U, and the set of transition groups, gdo U - ~ - U gd;c U 92mm, of the pre-
synthesized component (k 2 0). The group of transitions gmdex represents the low
atomicity write action that should be executed by process Pmdem.

Complexity. Since the algorithm Component-Specification checks the possibility

107

of adding a high atomicity recovery action to each state of Sm, its complexity is

polynomial in the number of states of Srec.

6.4.2 Token Ring Example Continued

We trace the algorithm of Figure 6.2 for the case of the token ring program. First,
we non-deterministically identify PSo as the process that can read every program
variable and can add a high atomicity recovery transition from the deadlock state
34 = (_L, _L, _L, 1). Thus, the value of index will be equal to 0. Second, we construct
the detection predicate X, where X E ((xo = _L) A (x1 = J.) A (x2 = _L) A (x3 = _L)).
Finally, using the read/ write restrictions of the processes in the token ring program,

the relation R will be equal to {(0, 1), (1,2), (2,3), (3,0)}.

6.4.3 Algorithmic Addition of The Fault-Tolerance Compo-

nents

In this section, we present an algorithm for adding a fault—tolerance component to
a fault-intolerant distributed program to resolve a deadlock state 33. Before the
addition, we ensure that no interference exists between the program and the fault—
tolerance component that we add. We show that our addition algorithm is sound;
i.e., the synthesized program satisﬁes the requirement of the addition problem (cf.
Section 6.1).

We recall the structure of the fault-intolerant program, p, from the ﬁrst paragraph
of Section 6.2.1. We represent the transitions of p by the union of its groups of
transitions (i.e., 11,103). We also assume that we have extracted the required pre-
SYnthesized component, c, as described in Section 6.4.1. The component c consists
Of a detector d that includes a set of transition groups Ufzogdi, and the write action
Of the pseudo process PSmdeI represented by a group of transitions gmdex in the low

atomicity.

The state space of the composition of p and d is the new state space Spr. We

108

introduce an onto function H1 : Sp, ——> 5,, (respectively, H2 : Sp, —> Sd, where S3 is
the state space of the detector d) that maps the states in the new state space Sp, to
the states in the old state space Sp (respectively, Sd). Now, we show how we verify

the interference-freedom of the composition of c and p.

Interference-freedom. We say the program p and the fault—tolerance component c
interfere iff the execution of one of them violates the (safety or liveness) speciﬁcation
of the other one. In order to ensure that no interference exists between p and c, we
verify the following three conditions in the new state space SP1: (i) transitions of p
do not interfere with the execution of d; (ii) transitions of d do not interfere with the
execution of p, and (iii) the low atomicity write action associated with c does not
interfere with the execution of p and (1. Towards this end, we present the algorithm

Interfere in Figure 6.3.

 

Interfere(S, Srec, U : state predicate, H1, H2: onto mapping function,
spec, speed: safety speciﬁcation,
go, - - - ,gm,gdo, - - - ,gdk,g;ndex: groups of transitions)
// Checks the interference-freedom between the program and
// the fault-tolerance component.
{//p=90U"'Ugma and d=ngU"'Ugdegindex
// Po - - - P" are the processes of p, and do - - - d" are the elements of d

11 = {9 = (391 = (93' EP)A(0 32' S m) = (H1(g) =91)A
(3(36,s’1) : (36.3’1) E g : ((36,3’1) violates speed) V
(H260) e U A H2(s ’)¢ U))}

 

if (11 76 (2)) then return true;
12 = {gd: (Egdj (adj 6 d) A (0< jS k) = (H2(gd) = adj) A
(3(36, 3’1) : (3’ 0,3’1) E gd: ((36,3’1) violates spec) V
(H1096) E 5 A H1(3’1)¢ 5))l
if (12 79 (ll) then return true;

= {g : (H2(g )= gmdex) A (3 (36,3’1) : (96,3’1) E g. ((36.3 ’) violates speed) V
(Hl(s’1)¢5rcc) V (H1(30)€S A H1(3’1)¢S)V
(H2(s6) E U A H2(3’1)¢ U) V ((36,3’1) violates spec))}

if ([3 aé 0) then return true;
return false;

L

 

 

 

Figure 6.3: Verifying the interference-freedom conditions.

109

First, we ensure that the set of transitions of p do not interfere with the execution
of d by constructing the set of groups of transitions 11, where 11 contains those groups
of transitions in the new state space Sp, that violate either the safety of d or the closure
of its invariant U. The transitions of p do not interfere with the liveness of d because
d executes only when p is deadlocked in the state 33. Hence, we are only concerned
with the safety of the detector d and the closure of U. When we map the transitions
of p to the new state space, the mapped transitions should preserve the safety of d.
Moreover, if the image of a transition (36, 3’1) starts in U (i.e., H2(36) E U) then the
image of (36, 3’1) will have to end in U (i.e., H2(s'1) E U). The emptiness of 11 shows

that the transitions of p do not interfere with the execution of (1.

Second, using a similar argument, we construct the set of groups of transitions 12
in the new state space Sp: whose every transition is a mapping of the transitions of d

that violate either the safety of spec or violate the closure of the program invariant

S.

Third, if 11 and 12 are empty then it will follow that the detector d is able to
detect 3,; without interfering with p. However, after d detects its detection predicate,
the component c performs a write action to change the state of the program from
33 to a state 3 E Sm, where Srec is the set of states from where safe recovery has
already been added. If a transition in the group associated with the write transition
(33, s) violates (i) the safety of the detector; (ii) the safety of the program; (iii) the
closure of U, or (iv) the closure of S then the recovery action will interfere with the
program (see the construction of 13 in Figure 6.3). If 11, 12, and [3 are empty then
the Interfere algorithm declares that no interference will happen due to the addition

ofc to p.

Addition. We present the Add_Component algorithm for an
interference-free addition of the fault-tolerance component c to p.

Thus, if the Interfere algorithm returns false then we will invoke

110

Add_Component (cf. Figure 6.4). In the new state space 519:, we construct a
set of transition groups pH, (respectively, de) that includes all groups of transitions,
9, whose images in Sp (respectively, 5.1) belong to p (respectively, d). Besides, no
transition of (36,3’1) E g violates the safety speciﬁcation of d (respectively, p) or
the closure of the invariant of d (respectively, p), i.e., U (respectively, S). In the

calculation of dH,, we note that the image of every group g in d and p must belong

to the same process (cf. condition (I = i) in the construction of (1112)-

 

Add_Component(S, Srec, U: state predicate, H1, H2: onto mapping function,

spec, speed: safety speciﬁcation,

9... - - - .9... gal... - - - gotta-m: groups of transitions)
{// p=goU---U9m, and dzngU"'Ugde9index
// Po - - - Pn are the processes of p, and do - - -dn are the elements of d

PH1 = {g = (391- = (93- EMA (OSJ' S m) = (H1(g) =91) A
(V(s6,3’1) : (36, 3’1) 6 g : ((36, 3’1) does not violate speed) A
(H2(86) 6 U => H2(8’1) E U))}

dug = {gd: (Bydj = (de E d) A (0 S j S k) = (H2(gd) = gdj) A
(3d,,P):(0SiSn)A(OSlSn):
(H2(9d) 6 di) A (H1(9d) 6 Pl) A (1:17) A
(V(s6,3’1) : (36, 3’1) 6 gd : ((36, 3’1) does not violate spec) A
(H1096) E 5 => H1(8’1) E S))}

PC 3: {9 3 (H2(9) = gindex) A (V(5(),3,1):
(36,3’1) E g : ((36,3’1) does not violate spec) A (H1(3’1) E Srec) A
(H2(s6) E U => H2(3’1) E U) V ((36,3’1) does not violate speed))}
S’ := {3:3ESp' :H1(s) E S A H2(s) E U}
p, I: le U ng UPC;
return p’,S’;

}

 

 

 

Figure 6.4: The automatic addition of a component.

The set pC includes all groups of transitions, 9, whose every transition has an
image in 91mm under the mapping H2. Further, no transition (36, 3’1) 6 g violates the
safety of spec or the closure of S.

The set of states of the invariant of the synthesized program, 5’, consists of those

states whose images in Sp belong to the program invariant S and whose images in

111

the state space of the detector, 53, belong to the detector invariant U.
Theorem 6.2 The algorithm Add_Component is sound. [3
Theorem 6.3 The complexity of Add_Component is polynomial in [S6]. CI

Before we show the soundness of Add..Component, we make some observations
and present the following preliminary lemmas and theorems. Towards this end, we
assume that we are given a program p, its speciﬁcation spec, its invariant S, its state
space Sp, faults f, and a deadlock state Sdeadlock ¢ S. we consider the case where
we have already added safety to p and we only need to resolve 33803100), to synthesize
the masking fault-tolerant program p’ with the invariant S’ in the new state space
53’. Towards this end, we use Add_Component algorithm for adding a fault-tolerance
component c to p.

The component c consists of a distributed detector d, with the detection predicate
X, the witness predicate Z, an invariant U, and a low atomicity write action Z —-> st
that takes p from state 3380,1100), to a state 3 E Srec. The state predicate Sm.C represents
the set of states from where a safe recovery to the invariant S is guaranteed. By
deﬁnition, the set of states Sr,c includes the invariant S; i.e., S Q Sm. Also, the set
S,“ USdeadlock is closed in the computations of p[] f. However, because of the deadlock
state Sdeadlocka recovery to S is not guaranteed from Srec U Sdeadlock-

We deﬁne two mapping functions H1 and H2 respectively from Sp, to S1, and from
Sp, to S1, where 5,; is the state space of the distributed detector d included in c.
In the Add_Component algorithm, based on the construction of S’, we include those

states in S’ whose images in Sp belong to S. Thus,

Observation 6.4 V3 : 3 E S’ : H1(s) E S C]
Now, we present the following theorem.
Theorem 6.5 H1(S’) Q S.

Proof. The proof follows from Observation 6.4. C]
By construction, for every arbitrary group of transitions g E p”, (cf. Figure 6.4)

112

there exists a group of transitions gj E p (0 S j S m). Now, if we consider a transition
(36, 3’1) E 9 such that 36 E S’ and 3’1 E S’ then using Observation 6.4, H1(s6) E S and
H1(3’2) E S. As a result, the condition (H1(s6),H1(36)) E p[H1(S’) holds. Thus, we
have

Observation 6.6 V(s6,s’1) : (36,3’1) E pH, : (((s6,3’1) E p’IS’)=> (H1((s6,s ’)) E
p|H1(S’))) 1:1

(H1((s6, 3’1)) denotes the transition (H1(s6), H1(s’1)) in the state space Sp.)

Using a similar argument, we present the following observation.
Observation 6.7 V(s6,3’1) : (36,3’1) E de : (((36,3 ’) E p’]S’)=> (H1((s6,s ’)) E
PIH1(S’))) D

The transition groups of pc add recovery to Sdeadzock- Also, by construction, for
every transition (s6eadlock, 3’1) E pc, Z (Sﬁeadlock) holds. Thus, at 3660,1106), the detector
detects the deadlock state sdeadlock. Since 33603100), E S, the state 3680,11,,“ does not
belong to S’. It follows that (36eadlock, 3’1) E p’ [S’. Therefore, we observe that
Observation 6.8 V(s6,s’1) : (36,3’1) E pc : (((s6,s’1) E p’IS’) => (H1((s6,s’1)) E
PIH1(S'))) 1:]

Using above observations, we present the second theorem.
Theorem 6.9 H1(p’|S’) Q p[H1(S’).
Proof. By the construction of p’, the proof follows from Observations 6.6, 6.7, and
6.8. D

To show that p’ is masking f-tolerant for spec, we prove the following lemmas.

Lemma 6.10 From every state of 5’

rec safe recovery to S’ with respect to spec is

guaranteed.
Proof. By definition, from every state of Sm safe recovery to S is guaranteed with
respect to spec. Now, let cmp be a computation of p’ [] f that starts from a state in

SI

rec. If cmp violates spec then there exists a computation preﬁx of cmp that violates

113

spec. Let (36, 3’1,. 3’ ) be the smallest such preﬁx. It follows that (Sin-l)’ 36) violates

the safety of spec. As a result, (H1(36,__1)), H1(s6)) is a transition of program p that
violates spec. Thus, the corresponding computation preﬁx (H1(s6), H1(s’1), ..., H1(36))
violates spec. Hence, we ﬁnd a computation preﬁx in Sm, that is not safe. This

contradicts with the assumption that from every state of Sm safe recovery to S with

respect to spec is guaranteed.

If (3[n_1),36) is a fault transition then the corresponding fault transition
(H1(36H)),H1(36)) violates spec. Hence, we could ﬁnd a state of p in the state
space Sp (i.e., H1(3[n_1))) from where faults alone violate spec. This contradicts with

the assumption that we have already added safety to p.

Now, let cmp be a computation of p’ that starts from a state in S’ and never

TCC

reaches S’. Since the computations of p’ are inﬁnite, there must exist a preﬁx

(36, 3’1, 3’ 36) of cmp that includes a cycle. Now, using function H1, we calculate

I n?
the computation preﬁx (3o, 31, ..., 3”, so) in the old state space Sp, where H1(s;) = s,-
(0 S i S 71). As a result, starting at so E 5.36, we ﬁnd a computation preﬁx that
includes a cycle and never reaches S, which is a contradiction with the deﬁnition of

Sm. Therefore, from every state of S’ safe recovery to S’ with respect to spec is

T€C

guaranteed. 1:]

Lemma 6.11 From every state of S’ no computation preﬁx of p’[] f that ends in

rec,
S’ violates the safety speciﬁcation of the detector d (i.e., speed).

Proof. Let cmp be a computation of p’[] f that starts from a state in SC“. If cmp
violates speed then there exists a computation preﬁx of cmp that violates spec. Let

(36,3’1,...,3’,,) be the smallest such preﬁx. It follows that (s’( 36) violates speed.

n—l)’
Thus, the transition (H2(s[n_1)), H2(s6)) violates speed; i.e., the detector (1 and the
program p interfere. By the construction of the transitions of p’, no transition of p’

interferes with the execution of d. Thus, the computation preﬁx cmp does not violate

SpCCd.

114

Also, since we showed (cf. Theorem 6.1) that the fault-tolerance component (1
is by itself F -tolerant, (H2(s’(n_1)), H2(3’,,)) cannot be a fault transition that violates

speed. Therefore, starting from every state in 5’ every computations of p’[] f satisfy

rec’

speed. 1:]

Lemma 6.12 T’ = S’

TCC

3,. (i.e., H1(T’) = s... u {33).

U {36md10ck} is a valid fault-span for p’ in the new state space

Proof. By construction, we have S Q Srec. Hence, using function H1, we have

S’QS’

TCC'

Otherwise, if there exists a state 36 E S’ such that 36 E S,’. then we will

8C

have a state so E S, where H1(s6) = so, that is not in Sm, which is a contradiction

with S Q Sm. Hence, we have 8’ Q S’

rec. Also, by assumption, the set Sm U 33803106),

is closed in the computations of p[] f. As a result, S’ U Sfieadlock is closed in the

TCC

computations of p’[] f. It follows that T’ is a valid fault-span since it is closed in p’[] f,

and S’ Q T’. C!

Using T’, we present the following lemmas.
Lemma 6.13 p’[] f satisﬁes spec and speed from T’.

Proof. Using Lemmas 6.10 and 6.11, p’[] f satisﬁes spec and speed from S’

TCC'

We only need to show that p’[] f satisﬁes spec and speed from sﬁieadlock, where
H1(s’deadlock) = Sdeadlock~ By the construction of pa, no transition originated at 366,,ka
violates spec or speed. Therefore, starting from every state at T’, p’[] f satisﬁes spec

and speed. Cl

Lemma 6.14 Every computation of p’[] f that starts from a state in T’, where

H1(T’) = Srec U {33}, contains a state in S’.

Proof. Using Lemma 6.10, it follows that every computation of p’[] f that starts

where H1(S’

rec) = Srec, reaches a state in 8’. Moreover, by the

from a state in 5’

rec’

construction of p’, transitions of pC provide safe recovery from s6mdlock to a state in

S, to S’

I _ . 3 I
rec, where H 1(3dmdlock) — sdeadlock. Since safe recovery from every state of S

TGC

115

is guaranteed, every computation of p’ that starts from a state in T’ contains a state
in S’. [:1
Theorem 6.15 p’ is masking f-tolerant for spec from S’.
Proof. First, we show that S' is an invariant of p’. We consider a transition (36, 3’1)
of p’ that starts in S’ and ends outside S’. Since 36 E S’, by Observation 6.4, we have
H1(36) E S. Also, from the construction of S’, we have H1(s’1) E S. As a result, we
ﬁnd a transition (H1(s6), H1(s’1)) of p that starts in S and ends outside S, which is a
contradiction with the closure of S in p. Thus, the execution of p’ is closed in S’.
From Theorem 6.9, it follows that p’ satisﬁes spec from 8’. Thus, 3’ is an invariant
of p’. Therefore, using S’ as an invariant and T’ as a fault-span, and based on Lemmas

6.13, and 6.14, we have shown that p’ is masking f-tolerant for spec from S’. [:1

Theorem 6.2 (Soundness) The algorithm Add_Component is sound.
Proof. To prove that our algorithm is sound, we have to show that the conditions

of the addition problem are satisﬁed.

1. H1(S’) Q S. (cf. Theorem 6.5).
2. H1(p’]S’) Q p|H1(S’). (cf. Theorem 6.9).

3. p’ is masking f—tolerant for spec from S’. (cf. Theorem 6.15). 1:]

Theorem 6.3 The complexity of Add_Component is polynomial in SP1.
Proof. The Add-Component algorithm consists of three parts where we construct
the set of transitions pH“ dH2, and pc. Respectively, each one of these sets contains
a set of transition groups in the new state space Spr. The size of the new state space
is in the order of [Sp] - [Sd] (i.e., |Spr| = [Sp] - |Sd|). As a result, the size of each
transition group cannot be more than [SP1] - [SPA] in SP2.

To construct pm, we process all groups of transitions that belong to pH,. Thus,

in the worst case, we need to process m groups of transitions in the new state space

116

810,, where m is the number of groups. As a result, the worst-case complexity for
constructing pH, is in the order of m - |Sp2|2. The same reasoning holds for the
worst-case complexity for constructing (1112 and pc. Therefore, the complexity of the

Add-Component algorithm is polynomial in the size of the SP1; i.e., [SW]. D

6.4.4 Token Ring Example Continued

Using Add_Component, we add the detector speciﬁed in Section 6.4.2 to the token
ring program M TR introduced in Section 6.2.2. The resulting program, consisting of
the processes Po - - - P3 arranged in a ring, is masking fault-tolerant to process-restart

faults. We represent the transitions of Po by the following actions.

MTRoi (($o=1)V($0=i)) A (173 =1) —’~To 1:0;

ZIITR6: ((xo = 0) V (xo = 1)) A (x3 = 0) —+ x0221;

Do : (xo 2 J.) A (yo = false) A (y1 = true) ——s yo := true;

Co : (yo = true) ———. xo 2: 0; yo 2: false;

The actions M TRo and M TR6 are the same as the actions of the [VI TR program
presented in Section 6.2.2. The action Do belongs to the sequential detector that
sets the witness predicate Zo to true. The action Co is the recovery action that Po
executes whenever the witness predicate (yo = true) becomes true. Now, we present

the actions of P3.

WITR3: ((x3 = 0) V (x3 2 1)) A (x2 = 1) —> 173 :2 1;y3 :2 false;
ll/ITR6: ((x3 = 1) V (x3 = 1)) A (x2 = 0) —+ x3 := 0;y3 :2 false;
D3: (x3 = .L) A (y3 = false) ——+ y3 := true;

The action D3 belongs to the detector that sets Z3 to true. We present the actions

of P1 and P2 as the following parameterized actions (for i = 1,2).

1WTR1-z ((x, = 0) V (x, = _L)) A (x,_1 =1) -_. x, :=1;y,~:= false;
AITR6: ((x, = 1) V (x1 = 1)) A (13-1: 0) ——s x,- := 0:1, :: false;
D, : (x, = _L) A (y, = false) A (y,+1 = true) —-+ y, := true;

117

The above program is masking fault-tolerant for the faults that corrupt one or
more processes. Note that when a process P,- (1 S i S 3) changes the value of x,-
to a non-corrupted value, it falsiﬁes Z,- (i.e., y,). The falsiﬁcation of Z, is important
during the recovery from 33 = (_L, _L, _L, _L) in that when x,- takes a non-corrupted
value, the detection predicate X,- no longer holds. Thus, if Z,- remains true then the
detector d,- witnesses incorrectly, and as a result, violates the safety of the detector.
However, Po does not need to falsify its witness predicate Zo in actions M TRo and
M TR6 because the action Co has already falsiﬁed Zo during a recovery from 33.
Remark. One could argue that we could have selected a different linear order do - - - d3
for the detector added to the token ring program. To address this issue, we note that
in the case of token ring program a detector with such linear arrangement would

interfere with the execution of the program (cf. Section 6.7 for details).

6.5 Example: Alternating Bit Protocol

In this section, we reuse the linear component used in the synthesis of the token
ring program presented in this chapter in the synthesis of a fault-tolerant alternating
bit protocol (ABP). The ABP program consists of a sender and a receiver processes
connected by a communication link that is subject to message loss faults. Using the
synthesis method presented in this chapter, we add pre-synthesized components to
synthesize an alternating bit protocol that is nonmasking fault-tolerant; i.e., when
faults occur the program guarantees recovery to its invariant. However, during recov-

ery, the nonmasking fault-tolerant protocol may violate its safety speciﬁcation.

The alternating bit protocol (ABP). The fault-intolerant program consists of two
processes: a sender and a receiver. The sender reads from an inﬁnite input stream of
data packets and sends the newly read packet to the receiver. The receiver copies each
received packet into an inﬁnite output stream. When the sender sends a data packet,

it waits for an acknowledgement from the receiver before it sends the next packet.

118

Also, when the receiver receives a new data packet, it sends an acknowledgment bit
back to the sender. A one-bit message header sufﬁces to identify the data packet
currently being sent since at every moment there exists at most one unacknowledged
data packet. Using this identiﬁer bit, the sender (respectively, the receiver) does not
need count the total number of packets sent (respectively, received).

Both processes have read / write access to a send channel and a receive channel.
The send channel is represented by an integer variable es and the variable or models
the receive channel. The domain of cs (respectively, cr) is {—1,0, 1}, where 0 and
1 represent the value of the data bit in the channel and -1 represents an empty
channel. Since we are only concerned about the synchronization between the sender
and the receiver, we do not explicitly consider the actual data being sent. Thus, we
consider the contents of CS and cr to be a single binary digit. The sender process
has a Boolean variable b3 that stores the data bit that identiﬁes the data packet
currently being sent to the receiver. Correspondingly, the receiver process has a
Boolean variable br that represents the value that is supposed to be received. When
the sender process transmits a data packet, it waits for a conﬁrmation from the
receiver before it sends the next packet. To represent the mode of operation, the
sender process uses a Boolean variable rs. The value of rs is 0 iff the sender is waiting
for an acknowledgement. Likewise, the receiver process uses a Boolean variable rr
such that the value of rr is 0 iff the receiver is waiting for a new packet.

We represent a state 3 of the ABP program by a 6-tuple (rs,bs,rr,br,cs,cr).
Thus, if we start from initial state (1, 1, 0,0, -1, —l), then the sender process begins
to send a data bit 1 while the receiver waits to receive it. We represent the transitions
of the sender process in the fault-intolerant program ABP by the following actions.

Sendo: (rs =1) ——+ rs := 0; cs 2: b3;

Sendlz (cr 7f —1) ———s rs := 1; cr 2: —1;bs:= (b3 + 1) mod 2;

Using action S endo, the sender sends another packet to the receiver when it is not

119

waiting for an acknowledgment. Thus, by setting r3 to 0, the sender moves to the sate
where it waits for an acknowledgment from the receiver. If the receive channel is non-
empty (i.e., (cr aé —1)) then the sender reads the receive channel and becomes ready
for sending the next packet. The actions of the receiver process in the fault-intolerant
program ABP are as follows:

Reco: (cs aé —1) ——> c3 :2 —1;rr:= 1; br :2 (br + 1) mod 2;

Reel : (rr = 1) -——» rr :2 0; cr := br;

The receiver reads the send channel c3 when it is non-empty (cf. Action Reco).
Then, the receiver toggles the value of br where it becomes ready to send an acknowl-
edgment to the receiver (in Action Reel).

Read/ Write restrictions. The sender can read/ write rs, c3, b3, and er, but it is not
allowed to read rr and br. The receiver is allowed to read/ write rr, cs, br, and or.
The receiver is not allowed to read rs and b3.

Faults. Faults can remove a data bit from either one of the communication channels
causing the loss of that data bit. Hence, we model faults by setting the value of cs
(respectively, cr) to -1.

F0: (cs;£—l) ———> cs 2: —1;
F1: (craé—l) ——2 cr:=—1;

We assume that the fault actions will be executed a ﬁnite number of times; i.e.,

eventually faults stop occurring.

Safety speciﬁcation. The problem speciﬁcation requires that the receiver receives

no duplicate packets.

Invariant. The state of the ABP program should satisfy the following conditions:
(i) If the receiver is ready to send an acknowledgement message or it has already sent

an acknowledge then the receive bit br and the send bit bs must be equal; (ii) If the

120

sender is ready to send a new packet or it has already sent a new packet then the
b3 and br must not be equal; (iii) It is always the case that either the send channel
cs is empty or it contains the sent bit bs; (iv) If both channels are empty then only
one of the processes (i.e., the sender or the receiver) should be waiting; (v) If one of
the channels is empty and the other one contains some data then both processes are
waiting. Hence, we specify the invariant of the ABP program, S A 31?, as follows:
SABP = {S l (((7‘7‘(3) = 1) V (“(8) r5 -1)) => (57(3) = 59(8)» A
76-1)) => (1’7"(8) 5‘ b8(8 1)) A ((CS(S) = —1)V(CS(S) = bC(51)) A
(C3 S): -1) A (CT 3): -1)) => ( 7"(31+ 719(3)) = 1)) A

( ( >75

(( ( ( (

(((C 8(8175- 1)A (CT (9)= -1))=>((T7‘(S)+ rs(8))=0))A
(((C 3(3): —1)A(Cr(3)7'é -1)) => (('rr (8 )+r8(9)) =0))}

((r3 3): 1)V (cs(s

Fault-span. The state of the ABP program may be perturbed to the state predicate
TABp due to fault transitions, where

TABp = {s I ((cs(s) = —1)V(cs(3) = bs(s))) A
(((CS(8 1: —1)V(C7‘(3)= —1))=> (((7‘7‘8( l+ rs‘>‘(‘>‘))- “ 1) V ((7‘T(S )+ ”(8 l) = 0)))1

The state predicate TABp includes states where (i) the send channel is empty or
it is equal to the sent bit b3, and (ii) if at least one of the channels is empty then at

least one of the processes is waiting.

Adding the actions of the high atomicity pseudo process. Faults may perturb
the program in the states where sender has sent a new packet and the receiver is
waiting for its arrival. As a result, the sent message is lost in the sender channel
(i.e., cs becomes —1) and the receiver is waiting for a lost message. Likewise, the
acknowledgement sent by the receiver might be lost in cr. Thus, the program may
reach states where both channels are empty and both processes are waiting. For
example, when the sent message is lost, the receiver is waiting for the lost message

and the sender is waiting for its acknowledgement. In such states the program takes

121

no action; i.e., deadlock state. Since the processes are not allowed to read the global
state of the program, they cannot detect such global deadlock states. Using our
synthesis method, we use high atomicity processes to identify the following high

atomicity actions that are added to the program for recovery.

HAC'o:(rs =0)A(rr=0)A (b3:1)A(br=0)A(cs=—1)A(cr= —1) ._. cs=1,
HAClz(rs =0)A(rr=0)A(b3=0)A(br=1)A(cs=—1)A(cr=—1) _. cs:=-0;
HACg:(rs =0)A(rr=0)A(bs=-1)A (br=1)A(cs=—1)A(cr=—1) ... 3:1,
HAC’3:(rs=0)A(rr=0)A(bs= 0)A(br=0)A(c3=—1)A(cr=—1) __. 3:0,

The guards of the above actions are global state predicates that we reﬁne using
linear distributed detectors. Let G,- be the guard of the action H A0,, where 0 S i S 3.
For example, we have Co _=_ ((r3 = 0)A(rr == 0)A(bs =1)A(br = 0)A(cs = —1)A(cr =
—1)). Corresponding to each global state predicate G,, we use a distributed detector
with two elements d3,- and dr,, where d3,- is the local detector installed in the sender
side and dr, is the local detector installed in the receiver side. Next, we show how we
add a linear distributed detector for the detection of Go. We omit the presentation

of the reﬁnement of 01,02, and G3 as it is similar to the reﬁnement of Go.

Adding fault-tolerance components. Due to read restrictions, the sender (re-
spectively, the receiver) cannot atomically detect Go. However, the sender can detect
a local condition LC, E ((rs = 0) A (b3 = 1) A (cs = —1)). Respectively, the re—
ceiver can detect a local condition LC; E ((rr 2 0) A (br = 0) A (or = —1)), where
Co E (LC, A LC,’.). Now, we instantiate the required distributed detector by reusing

the code of the pre—synthesized linear detectors presented in Section 6.3.

DAro : (LC,’.) A (y; : false) —-—+ y; :: true;
DAso : (LCS) A (y, = false) A (y; = true) ——s y, := true;

The action DAso belongs to detector dso that is allowed to read the witness

predicate y; of the detector element dro in the receiver side. If the detector element

122

dro detects its local predicate LC; then it will set its witness predicate y; to true.
Then, if the condition LC, holds in the sender side then the detector element dso
will detect the global state predicate Co by setting its witness predicate y, to true.
Afterwards, the synthesis algorithm adds the following write action to the sender

process.

Cso: (y, = true) -——+ cs 1: 1; y, := false;

The synthesis algorithm adds similar distributed detectors to ABP in order to
reﬁne the global state predicates C1, C2, and C3. Given the local conditions LC; E
((rs = 0) A (b3 2 0) A (c3 = —1)) and LC, E ((rr = 0) A (br = 1) A (cr = —1)), we

have the following logical equivalences:
o 01 E (LC; A LC,)
0 G2 E (LC, A LC,)
0 C3 E (LC; A LC;).

Corresponding to global detection predicates C1---C3, we respectively add the
following linear distributed detectors and also the necessary correcting action for
recovery to the invariant. Note that each added component has its own variables for
representing the witness predicates.

Detecting C 1. This linear detector reﬁnes the guard of the action H AC1 added by
our synthesis algorithm.

DArl : (LCr) A (yr 2 false) —+ yr 1: true;

DA31 : (LC;) A (y; = false) A (yr :— true) ——+ y; 2: true;

Correcting C 1. After the detection of C 1, the following write action takes place.

031 : (y; = true) _—. cs 1: 0; y; := false:

123

Detecting Co. We use the following linear detector to reﬁne the guard of the action

HACg.

DAr2 : (LCr) A (u, = false) A (u, = true) ——) u, := true;

DA32 : (LC,) A (u, = false) ——s u, := true;

Correcting 02. The following action, composed with the receiver, recovers the state
of the ABP program to the invariant S ABP after the detection of the global state

predicate Co.

Cr,» : (ii, = true) -—> cr := 1; ur := false;

Detecting G3. To detect the global state predicate C3 (i.e., the guard of the high

atomicity action H AC3), we add the following detector to ABP.

DAr3 : (LC;) A (u; = false) A (u; = true) —s u’r :2 true;

DA33 : (LC;) A (u; = false) ——> u; := true;

Correcting G3. This action changes the state of the ABP program to a state in

S ABP after the detection of 03.

Cr3 : (u; = true) -—» cr := 0; u’, := false;

The fault-tolerant ABP program. Next, we present the actions of the sender

process in the resulting nonmasking fault-tolerant program.

Send6: (rs = 1) ——+ rs := 0; cs :2 b3; cs 2: b3;
u; :== false; 11., :2 false;

Send’lz (cr 75 —1) —» rs := 1; cr :: —1;bs::(bs +1) mod 2;

I
3

u := false; u, := false;
DAso: (LC,) A (y, = false) A (y; = true)

———> y, 2: true;

Cso : (y, = true) ———s cs :2 1; y, := false;

124

DAsl : (LC;) A (y; = false) A (y, = true)
—-s y; := true;
C31 : (y; = true) ——+ c3 :2 0; y; := false;
DA32 : (LC,) A (u, = false) ——+ u, 2: true;
DA33 : (LC;) A ( ’3 = false) —-s u’ := true;

8

The synthesis algorithms has added new assignments to the actions Send6 and
Send’1 for the falsiﬁcation of the witness predicates. For example, in action Send6,
when c3 is assigned a value other than -1, the predicates LC, and LC; no longer hold.
Thus, the witness predicates u; and u, must be falsiﬁed. The actions of the receiver

in the synthesized fault-tolerant program are as follows:

Reco: (cs f -—1) —s c3 :2 —1;rr:= 1; br 2: (br + 1) mod 2;
y,- := false; 3;; := false;

Reel : (rr 2 1) ——> rr :2 0; cr := br;
y, := false; 11; := false;

DAro : (LC;) A (y; = false) ——> y; := true;

DAr1 : (LCr) A (yr 2 false) ——s yr := true;

DAr2 : (LCT) A (u, = false) A

(u, = true) ——> u, := true;

Cr2: (u, = true) ——r cr := 1; u, 2: false;

DAr3 : (LC;) A (11’, = false) A

(u; :2 true) ——> u := true:

Cr3 : (u; = true) -—s or z: 0: 12;. :2 false;

Observe that in actions Reco (respectively, Reel), we falsify the witness predicate
yr and y; once the program changes the value of rr to 1 (respectively, or to 0 or
1). This falsiﬁcation is necessary since once the condition (rr = 1) holds, the predi-
cates LC, and LC; no longer hold. Also, this example illustrates the case where we
simultaneously add multiple pre-synthesized components to a distributed program

to add fault—tolerance. We have veriﬁed the interference—freedom requirements using

125

the SPIN model checker [36] to gain more conﬁdence with the implementation of
our synthesis framework, FTSyn (see Appendix A for the Promela [37] code of this

example).

6.6 Adding Hierarchical Components

In this section, we show how we add components with hierarchical topology to a dif-
fusing computation program to provide recovery in the presence of faults. In earlier
sections, we showed how we apply the synthesis algorithm presented in this chapter
to programs where the underlying communication topology between processes is lin-
ear. In this section, we show how we add hierarchical pre-synthesized components
to distributed programs. Speciﬁcally, we add tree-like structured components to a
diffusing computation program where processes are arranged in an out-tree, where
the indegree of each node is at most one. A diffusing computation starts at the root
and propagates throughout the tree, and then, reflects back up to the root of the tree.
The fault-intolerant program is subject to faults that perturb the state of the diffus-
ing computation and the topology of the program (i.e., the parenting relationship

amongst processes).

This case study shows that the synthesis method presented in this chapter han-
dles pre-synthesized components (respectively, distributed programs) with different
topologies as we have already reused a particular linear component in the synthesis
of a token ring program and an alternating bit protocol in this chapter. Next, in
Subsection 6.6.1, we describe 110w we formally represent a hierarchical fault—tolerance
component. Subsequently in Subsection 6.6.2, we show how we automatically add a

hierarchical component to a diffusing computation program.

126

6.6.1 Specifying Hierarchical Components

In this section, we describe the representation of hierarchical fault—tolerance compo-
nents (i.e., detectors and correctors). We focus on the representation of a detector
with a tree-like structure as a special case of hierarchical detectors. The hierarchical
detector d consists of n elements (1,- (0 S i < n), its speciﬁcation speed (speciﬁed in
Subsection 6.3.1), its variables, and its invariant U. We introduce a relation j on
the elements d,- that represents the parenting relation between the nodes of the tree;
e.g., i :5 j means d.- is the parent of dj.

The element do is placed at the root of the tree and other elements of the detector
are placed in other nodes of the tree. Each node d,- has its own detection predicate X ,-
and witness predicate Z,. The siblings of a node can detect their detection predicate
in parallel. However, the truth-value of the detection predicate of each node depends
on the truth-value of its children. In other words, node d,- can witness if all its children
have already witnessed.

Each element d,, 0 S i < n, of the detector has a Boolean variable y.- that
represents its witness predicate; i.e., the witness predicate of each d,, say Z,, is equal
to (y,- = true). Also, the element d,- can read/ write the y values of its children and its
parent (0 S i < n). Moreover, each element d,- is allowed to read the variables that
P,- can read, where P,- is the process with which d,- is composed. Now, we present the
template action of the detector (1,- as follows ((0 S i,j, k < n) A (j < k) A (Vr : j S

rSk:ijr)):

DA,: (LC;) A (y,. A---Ayk) A (1},: false) ——s y, 2: true;

Using action DA,- (0 S i < n), each element (1,- of the hierarchical detector
witnesses (i.e., sets the value of y,- to true) whenever (i) the condition LC,- be-
comes true, where LC, represents a local condition that d,- atomically checks (by

reading the variables of P,), and (ii) its children (1;, - -- ,dk have already witnessed

127

((0 S j,k < n) A (j < k)). The detection predicate X,- for element d,- is equal to
(LC,- A LCj A A LCk). Therefore, do detects the global detection predicate
LCo A - - - A LCn_1.

The above action is an abstract template that should be instantiated by the syn-
thesis algorithm during the synthesis of a speciﬁc program in such a way that the
program and the detector do not interfere. For automatic addition of nonmasking
fault-tolerance, the interference-freedom of the program and the detector requires
that (i) in the absence of faults, the program speciﬁcation and the safety speciﬁcation
of detectors are satisﬁed, and (ii) in the presence of faults, recovery is provided by
the composition of the program and the detectors.

During the detection, when d, sets y; to true, its children have already set their y
values to true. Hence, we represent the invariant of the hierarchical detector by the
predicate U, where

U={s:(v2':(0Sz'<n):(y.(s)=>(vj=i:j=LCj))}

6.6.2 Diffusing Computation

In this section, we present the addition of a hierarchical pre—synthesized component
to a fault-intolerant diffusing computation. We have adapted the diffusing compu-
tation program from [38]. First, in Subsection 6.6.2.1, we give the speciﬁcation of
the diffusing computation program. Then, in Subsection 6.6.2.2, we present the syn-
thesized nonmasking fault-tolerant program before the addition of the hierarchical
component, which includes high atomicity recovery actions. Finally, in Subsection
6.6.2.3, we show how we add pre—synthesized components to reﬁne the high atomicity

actions added during synthesis.

6.6.2.1 Diffusing Computation Program
The diffusing computation (DC) program consists of four processes {Po, P1, P2, P3}

whose underlying communication is based on a tree topology. The process Po is the

128

root of the tree. Processes P1 and P2 are the children of Po (i.e., (0 j 1) A (0 j 2))
and P3 is the child of P2 (i.e., 2 j 3).

Starting from a state where every process is green, Po initiates a diffusing com-
putation throughout the tree by propagating the red color towards the leaves. The
leaves reﬂect the diffusing computation back to the root by coloring the nodes green.
Afterwards, when all processes become green again, the cycle of diffusing computation
repeats.

Each process P,- (0 S j S 3) has a variable c,- that represents its color and whose
domain is {0, 1}, where 0 represents the red and 1 represents the green. Also, process
P,- has a Boolean variable 3n,- that represents the session number of the diffusing
computation where P,- is currently participating. Thus, we use sn, to distinguish
the case where P, has not started to participate in the current diffusing computation
from the case where P,- has completed the current session of diffusing computation.
Moreover, each process has a variable par,- that represents the parent of P,-. The
domain of par,- is equal to {0, 1,2,3}. The value of par,- identiﬁes the node from
where there exists an edge to P,- in the out-tree. For example, since the parent of Po
is itself, we have paro = 0.

Program actions. The actions of the process P,- (0 S j < 4) are as follows:
190,1: (c, = 1) A (par, =j) ——+ c,- :2 0; 3n, = 3371,;

DC,2 : (c, = 1) A (cparj == 0) A (371., E snparj) —s c,- := cpar); 3n,- = snparj;

DC,3: (c, = 0) A (We :: (par)c = j) => (c)C = 1 A 3n, E snk)) ———. c,- :=1;

Read / write restrictions. Each process P,- is allowed to read / write the variables of
its children and its parent. For example, process Po can read / write its local variables
and the local variables of P1 and P2. However, Po is not allowed to read/ write the
variables of P3. Also, P3 cannot read / write the variables of Po and P1.

Invariant. In each session of diffusing computation, every process P,- meets one

of the following requirements: (i) P, and PPM]. have both started participating in

129

the cum-
the cum
the curre
current 31

all state 1

50c = I?

Faults.

underlyin

tions by 7,

The a(

 

Whereas a
tIOnSlllp I)
falllt-cl

218$

the current session of diffusing computation; (ii) P,- and PPM, have both completed
the current session of diffusing computation; (iii) P,- has not started participating in
the current session whereas PPM, has, and (iv) P, has completed participating in the
current session whereas Ppar, has not. Hence, the invariant of the program contains
all state where SDC holds, where

800 = (Vj : (0 S j S 3): ((c, = cpm, A sn, E snparj) V (c,- = 1 A cpar, = 0))) A

(paro = 0Apar1= 0Apar2 = 0Apar3 = 2)

Faults. Fault transitions can perturb the values of c,- and 3n,- (0 S j S 3), and the
underlying communication topology of the program. We represent the fault transi-
tions by the following actions:

F,- : (true) —2 c,- = 0|1;

F,: (true) —-s sn,=false[true;

Fo: (true) ——) paro=0]1[2;

The actions F ,-, and 17,, represent the fault transitions that perturb a process P,
whereas action Fo only affects Po. The class of faults Fo perturbs the parenting rela-
tionship by changing the value of paro to one of the values {0, 1, 2}. We have included
fault-class Fo since it perturbs the DC program to states where we can demonstrate

the advantages of using pre-synthesized components in dealing with deadlock states.

6.6.2.2 Intermediate Nonmasking Program

Now, we present the intermediate nonmasking fault-tolerant program that includes
high atomicity recovery actions. We have synthesized this intermediate program using
our software framework FTSyn (cf. Chapter 8).

The faults may perturb the state of the DC program outside 500 where the
program may fall in a non-progress cycle or reach a deadlock state. For example,
faults Fo may perturb the program to states where the condition Tdeadlock E ((co =

1) A (c1 = 1) A (c2 = 1) A (c3 = 1)) A (paro # 0) holds. The state predicate Tdeadlock.

130

represents states from where no program action is enabled; i.e., deadlock states. Now,
to add recovery from a state in Tdeadlocka FTSyn assigns a high atomicity process Phigh,
to each process P,- (0 S j < 4).

To illustrate our approach of adding hierarchical pre—synthesized detectors (re-
spectively, correctors), we only focus on one of the high atomicity recovery actions
added by process Phigho as the reﬁnement of other high atomicity actions is simi-
lar. The actions of other high atomicity processes in the intermediate nonmasking

program are available in the Appendix A. The action H AC is as follows:

HAC : (co = 1) A (c1 = 1) A (c2 = 1) A (C3 = 1) A (sno = 1) A ((paro = 2) V (paro = 1)) A

((sn3 = 0) V (snl = 0) V (8712 = 0)) ——> sno := 0;

The guard of H AC identiﬁes a subset of Tdeadlock for which H AC provides recovery
to states from where recovery to S 00 has been already established. The write action
performed by H AC is a local write operation in process Po, whereas the guard of
H AC is a global state predicate that should be reﬁned in the distributed program.
Thus, we only need to add detectors for the reﬁnement of the guard of H AC. In
the next subsection, we show how FTSyn uses the guard of H AC to automatically

specify the required detectors.

6.6.2.3 Adding Pre—synthesized Detectors

To reﬁne the guard of H AC, the synthesis algorithm presented in this chater auto-
matically identiﬁes the interface of the required component. The component interface
is a triple (X, R, i), where X is the detection predicate of the required component,
R is a relation that represents the topology of the required component, and i is the
index of the process that performs the local write action after the detection of X. For
example, for action H AC, X is equal to the state predicate Xo as we describe next
in this section, R is a set of pairs where each pair represents the existence of a com-

munication link between two processes, and i is equal to 0 since Po should perform

131

the local writ I
Using the
rithm queries
the option of
of H AC and
helps in minii:
ponent adds it
example. in 111
one componen
by the guard .
variables that .

detectors (1 an

X0 5 (If-3 : I

X, E l

(3123

The preSVU
and d3 (res

M03133).

p (X? l.

to the topologV

lead/ll Write I“. 1.11.
' -« a .

The synthesi

at '
tion presented

P . .
..ontlmons are
1 a1

«was

(((‘3 =

(It -
- . L’Sll

BSLC:
I

 

]r

the local write action.

Using the interface of the required pre-synthesized component, the synthesis algo-
rithm queries an existing library of pre-synthesized components. At this step, we have
the option of supervising the synthesis algorithm in that we can observe the guard
of H AC and manually identify the required components. This manual intervention
helps in minimizing the number of components added to the program since each com-
ponent adds its associated variables to the program and expands the state space. For
example, in the case of action H AC , the synthesis algorithm automatically identifies
one component corresponding to each deadlock state in the set of states represented
by the guard of H AC, whereas by manual intervention, we observe that the only
variables that are not readable for P0 are 03 and 3723. Hence, we add two distributed

detectors d and d’ to simultaneously detect the predicates X0 and X6, where

X0 5 ((03 = 1) A (00 =1)A(C1= 1) A (02 = 1) A (8710 = 1) A ((Paro = 2) V (Para = 1)))

x5 2 ((3713 = 0) A (Co = 1) A (c1 = 1) A (c2 = 1) A (sno = 1) A ((Pm‘o = 2) v (pare =1»)

The pre—synthesized detector (1 (respectively, (1’) includes four elements d0, d1, (12,
and (1;, (respectively, 6, ’1, ’2, and d3), where (1,- (respectively, d;) is composed with
P,- (0 S i S 3). Thus, the topologies of the distributed detectors (1 and d’ are similar
to the topology of the DC program. Also, the parenting relationship (respectively,
read/write restrictions) between do,d1,d2,d3 (respectively, 6, ’1, ’2, and d3) follows
the parenting relationship (respectively, read /write restrictions) of P0, P1, P2, and P3.

The synthesis algorithm automatically instantiates an instance of the template
action presented in Section 6.6.1 with the appropriate local condition. The local
conditions are automatically identiﬁed based on the set of readable variables of each
process. For example, the part of X0 that is readable for detector (1;; is identiﬁed

as LC3 E ((03 = 1) /\ (c2 = 1)). Thus, the instantiation of the template action for

detector ([3 results in the following action:

132

Likewise.

as LC; 3 ((1”.

1);, : (3'13 =‘

The detect
dition LC3 (re
23 _=_ (y3 = 1
(respectively. r
the detection l
is the leaf of 1

Next, we prese

021 I (y3 2 (Hr:

Dir 3 (.113 = trur

The local cc
1)A(c0 21M (
to X2 3 (LC; A
condition of [llt
predicate of d3 1':

( I
"y? = true)

9

 

Likewise.

lllr
l predicate 0|

dCllOIlS 0f (11 a . I
3 II

J

“Fss

D31 : (C3 = 1) /\ (c2 = 1) A (y3 = false) —> y3 := true;

Likewise, the part of X 6 that is readable for detector d.g is automatically identified

as LC’ E ((3713 = 0) /\ (02 = 1)). Hence, the action of d'3 is as follows:

D51 : (3713 = 0) /\ (62 z: 1) A (y3 = false) ——> yg :2 true;

The detector d3 (respectively, d3) sets 11;, (respectively, yg) to true if the local con—
dition LC'3 (respectively, L05) holds and 313 (respectively, yg) is false. The predicate
Z3 E (3);, = true) (respectively, 2!, E (313 = true)) is the witness predicate of d3
(respectively, d3), and the predicate X3 E L03 (respectively, 5 E LCg) constructs
the detection predicate of d3 (respectively, d3). Note that since d3 (respectively, 3)
is the leaf of the tree, it does not have any children to wait for before it witnesses.

Next, we present the actions of d2 and (1’2 (i.e., actions D21 and D31) as follows:

D21 : (y3 = true) /\ (c2 = 1) /\ (sno = 1) /\ (c0 = 1) A((paro = 2) V (pa-r0 =1))/\(y2 = false)

—-> yg :2 true;

Dar: (y§=true)/\(C2 =1)A<sno= )A(Co=1)A((PaTo=2)V(P07‘0=1))A(y§=f013€)

-—-—* y’2 := true;

The local condition of the action D21 (i.e., LC2) is equal to (C2 = 1)A(sno =
1) /\ (co = 1)/\ ((paro :2 2) V (paro = 1)). Thus, the detection predicate of d2 is equal
to X2 E (L02 /\ LC3) and its witness predicate Z; is equal to (7J2 = true). The local
condition of the action 0'2, (i.e., LCé) is also equal to LC2. Hence, the detection
predicate of d; is equal to X; E (LC; /\ LCé) and its witness predicate Z; is equal to
(ya = true).

Likewise, the synthesis algorithm identiﬁes the detection (respectively, the wit-
ness) predicate of (11 based on identifying L01 E (cl = 1). We omit the details of the

actions of (11 as it is straightforward and similar to the actions of (12 and (13. The local

133

(‘OIldjt l0“
0f detecto

as full0W5

DOT : (I’ll I
DE” : (yl :
The trl

value of Xi

and do 31"}

Rec: (yo :

When tl
(respectivel)
(respectivel)
should bet'or
The compc
Now. we pre
is a cornposi
the processes
and P2 are st
Appendix A i

d
1. the synthe

condition LCo of detector do is equal to (01 : 1) A LCo. Also, the local condition LC6
of detector d6 is equal to (01 = 1) A L06. Thus, the actions of detectors do and (16 are
as follows:

D01 : (y1 = true) A (yg = true) A (c1 = 1) A LC2 A (yo = false) ——) yo := true;

D61 : (y; = true) A (y6 = true) A (c1 = 1) A L06 A (y6 = false) —+ y6 ::= true;

The truth-value of yo witnesses the truth-value of Xo and y6 witnesses the truth-
value of X6. Now, we add a recovery action that only reads the local variables of Po
and do and writes the local variables of Po. The recovery action is as follows:

Rec: (yo = true) A ((y6 = true) V (sm = 0) V (3722 = 0)) -——> sno := 0; yo := false; y6 := false;

yo :2 false;y6_, :2 false;

When the program executes the above recovery action, the predicates Xo and X 2
(respectively, X6 and X6) no longer hold. Thus, the witness predicates of do and d2
(respectively, d6 and d6) must be falsiﬁed; i.e., yo and y2 (respectively, y6 and y6)
should become false.

The composition of the DC program and the pre—synthesized detectors.
Now, we present the actions of the process Po of the nonmasking DC program that
is a composition of the actions of the pro-synthesized detectors and the actions of
the processes in the intermediate fault-intolerant program. Since the actions of P1
and P2 are structurally similar to Po’s actions, we refer the interested reader to the
Appendix A for the actions of P1 and P2. Note that since no detection is done by
d1, the synthesized program does not have any new actions in process P1. Thus, the
actions of P1 remain similar to the fault-intolerant program. The actions of process
Po composed with the actions of do, (16, and the recovery action Ree are as follows:

Dem : (co = 1) A (paro = 0) ———+ co := 0;

yo := false;y6 2: false;

DC02 2 (C0 = 1) A (Cparo : 0) A (3710 35‘ suparn)

134

D
Co
13 l

D
01'
-(

(«a

06
1 .
(34

Reic-
: (,1

 

 

——> co := cparo; sno = snpam;
if ((co = 0) A (yo = true))
then yo := false;y6 :2 false;
DCo3: (co = 0) A (We :: (par;c = 0) :> (ck : 1A sno E snk))
—-> co := 1;

if (((y1 =- false) V (y2 = false)) A (yo = true))

then yo := false;

if (((y; = false) v as -—- false)) A (ya -—- true))

then 316 := false;

Dol : (y1 = true) A (yo = true) A LCo A (yo : false)
——> yo := true;
D61 : (y6 = true) A (ya, = true) A L06 A (y6 = false)
——r y6 := true;
Rec: (yo = true) A ((y6 = true) V (3721 = 0) V (8712 = 0))
—+sno := 0;yo := false; y6 := false;

y2 := false;y6 :2 false;

The actions of process Po are composed with the actions of detectors do and d6
(i.e., 001 and D61) and the recovery action Rec presented in this section. Observe
that the statement of actions D001 and D002 of Po are composed with assignments
that falsify the witness predicates of the corresponding detectors. Such falsiﬁcation of
the witness predicates is necessary so that program execution preserves the safety of
detectors. For example, when co becomes 0 the state predicate LCo no longer holds.
Thus, the witness predicate yo must be falsiﬁed to ensure the interference-freedom of
the program and the pres-synthesized detectors.

Interference-freedom. The interference-freedom requires the synthesized program
to provide recovery in the presence of faults, and satisfy the speciﬁcation of the
DC program in the absence of faults. In the presence of faults, if faults perturb

the program outside the invariant 500 then the synthesized program satisﬁes the

135

requireme:
the absenc
Thus. in tl
computatii
“cunt
the safety
faults may
i.e., the sa
tolerance o
Violate the
the compos
in the presg
Althoug
ified the int
Chet'ker to g
seated in C:

Promela 1110

Can. the 'S'yTlf,
"C7228?

hes, The
tolerant

requirements of nonmasking fault-tolerance; i.e., recovery to S DC is guaranteed. In
the absence of faults, the added detectors do not interfere with the program execution.
Thus, in the absence of faults, the above program satisﬁes the speciﬁcation of diffusing

computation program and the safety of detectors.

We would like to note that when faults occur, fault transitions may directly violate
the safety speciﬁcation of detectors; e.g., after d3 witnesses that (Co = 1) holds,
faults may change the value of C3 to 0, and as a result, d3 witnesses incorrectly;
i.e., the safety of do will be violated by fault transitions. Since nonmasking fault—
tolerance only requires recovery to the invariant, the violation of safety does not
violate the nonmasking fault-tolerance property. Thus, the only requirement is that
the composition of the program and the pre-synthesized detectors provides recovery
in the presence of faults.

Although the synthesized nonmasking program is correct by construction, we ver-
iﬁed the interference-freedom requirements of the above program in the SPIN model
checker to gain more conﬁdence on the implementation of the framework FTSyn pre-
sented in Chapter 8. We refer the reader to the Appendix A for the source of the

Promela model.

6.7 Discussion

In this section, we address some of the questions raised by our synthesis method.
Speciﬁcally, we discuss the following issues: the fault-tolerance of the components,
the choice of detectors and correctors, and pre—synthesized components with non-
linear topologies.
Can the syntheszs method deal with the faults that affect the fault-tolerance compo-
nents?

Yes. The added component may itself be perturbed by the fault to which fault-

tolerance is added. Hence, the added component must itself be fault-tolerant. For

136

example.
the adde
(cf. Thee
process r

For arbit
the mode
How does
programs

While
intolerant
this chap
tolerance
abstract, (
in replica-
We expect
CTOrrec-t 01-8
to providf
delE‘Ctors ]
detectiOns
Does the s
flies?

Yes. Ar
method Of
and hlefan
sis method

Dents) wit it

In the toffe‘r

example, in our token ring program, we modeled the effect of the process restart on
the added component and ensured that the component is fault-tolerant to that fault
(cf. Theorem 6.1). For the fault-classes that are commonly used, e.g., process failure,
process restart, input corruption, Byzantine faults, such modeling is always possible.
For arbitrary fault-classes, however, some validation may be required to ensure that

the modeling is appropriate for that fault.

How does the choice of detectors and correctors help in the synthesis of fault-tolerant
programs?

While there are several approaches (e.g., [39]) that manually transform a fault-
intolerant program into a fault-tolerant program, we use detectors and correctors in
this chapter, based on their necessity and sufﬁciency for manual addition of fault-
tolerance [18]. The authors of [18] have also shown that detectors and correctors are
abstract enough to generalize other components (e.g., comparators and voters used
in replication-based approaches) for the design of fault—tolerant programs. Hence,
we expect that our synthesis method can benefit from the generality of detectors and
correctors in the automated synthesis of fault-tolerant programs as there is a potential
to provide a rich library of fault—tolerance components. Moreover, pre—synthesized
detectors provide the kind of abstraction by which we can integrate efﬁcient existing
detections approaches (e.g., [40, 41]) in pre—synthesized fault-tolerance components.
Does the synthesis method support pre-synthesized components with non-linear t0polo-
gies?

Yes. As we demonstrated in Sections 6.5 and 6.6.2, we have applied the synthesis
method of this chapter to add pre—synthesized fault-tolerance components with linear
and hierarchical topologies. These examples show the applicability of our synthe-
sis method for distributed programs (respectively, distributed fault-tolerance compo—

nents) with linear and hierarchical topologies.
In the token ring example, will the synthesis succeed if we select PSmdeI (1 5 index 3

137

3). luster

from the

Yes.
(Imam—1,
the detec
the elemc
index — 1
d3. and n

write dk.

Using
“Ll A (11
atornicitv

the case i

aCtion.

In the tell,”

3 ), instead ofPSo, as the pseudo process that adds a high atomicity recovery transition

from the deadlock state 3,, = (J_, _L, _L, 1)?

Yes. We argue that if we select a detector d with the following arrangement,
dmdex_1, - - -, do, do, - - o, dmdex, where index 75 0, then the synthesis will succeed and
the detector d will not interfere with the token ring program. In this arrangement,
the element denim—1 is allowed to read and write yindex_1. Every element dj, 0 S j <
index —- 1, is allowed to read yj and yjH, and write y]. do is allowed to read do and
do, and write d3. Elements dk, index 3 k < 3, are allowed to read dk and dk+1, and

write dk.

Using the above arrangement, Zmdez witnesses the detection predicate X E ((xo =
_L) A (x1 = _L) A (x2 = _L) A (x3 = .L)), and afterwards, the PSindex adds a high
atomicity recovery action to the program. The proof of non-interference is similar to
the case where PSo is selected as the pseudo process that adds the high atomicity

action.

In the token ring example, will the synthesis succeed if we add a sequential detector
with a diﬁ'erent linear order do . - -d3, where 2:, witnesses for the detection predicate

X 5((1‘0 = 1)A(x1: _L)A(x2 = _L)A(x3 :2 1))?

No. We show that if we use the above order then the Interfere algorithm returns
true as 11 becomes non-empty; i.e., the execution of the token ring program interferes
with the added pre—synthesized component. In a state 3 = (1, 1,0,0), the elements
do and d1 of the linear detector witness their detection predicates Xo and X1, where
Xo E (xo 2 _L) and X1 E ((xo 2 1)A(x1 = 1)). Now, if Po executes and sets xo to 1
then X1 no longer holds. As a result, the program reaches a state where (11 incorrectly

witnesses its detection predicate and violates the speciﬁcation of the linear detector.

138

6.8 E
In this ch.-
grain from
Specificallj
fault-tolera
rithm for a
tributed p:
of the com;
thesis algoi
Program w
matic addii
Process is 1
lLS‘Gd in the
bit protoco
showed tha
Dents Will] (

Components

6.8 Summary

In this chapter, we presented an approach for the synthesis of a fault-tolerant pro
gram from its fault—intolerant version and pre—synthesized fault-tolerance components.
Speciﬁcally, we presented an algorithm for automatic speciﬁcation of the required
fault-tolerance components during the synthesis. We also presented a sound algo—
rithm for automatic addition of pre—synthesized fault-tolerance components to a dis-
tributed program. Before adding a component, we veriﬁed the interference-freedom
of the composition of the program and the fault-tolerance component. Using our syn-
thesis algorithm, we showed how we could add masking fault-tolerance to a token-ring
program where all process might be corrupted. By contrast, previous work on auto-
matic addition of fault-tolerance to the token ring program assumed that at least one
process is not corrupted. Also, we demonstrated how we reuse the same component
used in the synthesis of the token ring program for the synthesis of an alternating
bit protocol that is nonmasking fault-tolerant to message loss faults. Moreover, we
showed that our synthesis method is applicable for adding pre—synthesized compo-
nents with different topologies (e.g., linear and hierarchical) where we added tree—like

components to a diffusing computation program.

139

Chapter 7

Automated Synthesis of

Multitolerance

In this chapter, we focus on automated synthesis of multitolerant programs. Such
automated synthesis has the advantage of generating fault-tolerant programs that (i)
tolerate multiple classes of faults, and (ii) are correct by construction. Automatic
synthesis of multitolerance is desirable as (i) today’s systems are often subject to
multiple classes of faults, and (ii) it is often undesirable or impractical to provide
the same level of fault-tolerance to each class of faults. Hence, these systems need
to tolerate multiple classes of faults, and (possibly) provide a different level of fault-
tolerance to each class. To characterize such systems, the notion of multitolerance
was introduced in [34]. The importance of such multitolerant systems can be easily
observed from the fact that several methods for designing multitolerant programs

as well as several instances of multitolerant programs can be readily found (e.g.,
[11, 12, 13, 34]) in the literature.

We focus on automated synthesis of high atomicity multitolerant programs in a
stepwise fashion. Speciﬁcally, we (i) present a sound and complete stepwise algorithm
for the case where we add nonmasking fault-tolerance to one class of faults and mask-

ing fault-tolerance to another class of faults, and (ii) present a sound and complete

140

stepwise "
faults am
algorithm
case Whel‘
tolerance

ﬁnd that

the additi
can be P9
to one cl'd
XP-compl

In the

formal def
program f
the relevai
multitolere
synthesis (;
Then. in S
of Inultitol
7.5, we pr.
multitolera

concluding

7.1 p

In this secti

fTOm their f-

o
In deﬁnitk

IflllllllO]QraI-l

stepwise algorithm for the case where we add failsafe fault—tolerance to one class of
faults and masking fault-tolerance to another class of faults. The complexity of these
algorithms is polynomial in the state space of the fault-intolerant program. For the
case where failsafe fault—tolerance is added to one fault-class and nonmasking fault-
tolerance is added to another fault-class, we ﬁnd a somewhat surprising result. We
ﬁnd that this problem is NP-complete. This result is surprising in that automating
the addition of failsafe and nonmasking fault-tolerance to the same class of faults
can be performed in polynomial time. However, addition of failsafe fault-tolerance
to one class of faults and nonmasking fault-tolerance to a different class of faults is

N P-complete.

In the rest of this chapter, we proceed as follows: In Section 7.1, we present the
formal deﬁnition of multitolerance and the problem of synthesizing a multitolerant
program from a fault-intolerant program. Subsequently, in Section 7.2, we recall
the relevant properties of algorithms in 2.7 that we use in automated addition of
multitolerance. In Section 7.3, we present a sound and complete algorithm for the
synthesis of multitolerant programs that provide nonmasking-masking multitolerance.
Then, in Section 7.4, we present a sound and complete algorithm for the synthesis
of multitolerant programs that provide failsafe—masking multitolerance. In Section
7.5, we present the NP-completeness proof for the case where failsafe-nonmasking
multitolerance is added to fault-intolerant programs. Finally, in Section 7.6, we make

concluding remarks and discuss future work.

7. 1 Problem Statement

In this section, we formally define the problem of synthesizing multitolerant programs
from their fault-intolerant versions. Before defining the synthesis problem, we present
our deﬁnition of multitolerance; i.e., we identify what it means for a program to be

multitolerant in the presence of multiple classes of faults.

141

As 1
program
safe/ non
sider the
program

There
should b
provide 1'
definitior:
fl occur
faults occ

Anotl:
Where f 1
tolerance

tOIerance

H O‘K’G \Pe [- 3

IS pr0\'ide(
fl and f2
DTOOf Of N

the minirm

fElul

In a SDW

t‘tolt‘l‘a

As mentioned in Section 2.5, a failsafe/ nonmasking/ masking fault-tolerant
program guarantees to provide a desired level of fault-tolerance (i.e., fail-
safe/ nonmasking/ masking) in the presence of a speciﬁc class of faults. Now, we con-
sider the case where faults from multiple fault-classes, say f1 and f2, occur in a given
program computation.

There exist several possible choices in deciding the level of fault-tolerance that
should be provided in the presence of multiple fault-classes. One possibility is to
provide no guarantees when f 1 and f2 occur in the same computation. With such a
deﬁnition of multitolerance, the program would provide fault-tolerance if faults from
f 1 occur or if faults form f2 occur. However, no guarantees will be provided if both
faults occur simultaneously.

Another possibility is to require that the fault—tolerance provided for the case
where f1 and f2 occur simultaneously should be equal to the minimum level of fault-
tolerance provided when either f 1 occurs or f2 occurs. For example, if masking fault-
tolerance is provided to f 1 and failsafe fault-tolerance is provided to f2 then failsafe
fault-tolerance should be provided for the case where f 1 and f2 occur simultaneously.
However, if nonmasking fault-tolerance is provided to f 1 and failsafe fault-tolerance
is provided to f2 then no level of fault-tolerance will be guaranteed for the case where
f1 and f2 occur simultaneously. We note that this assumption is not required in our
proof of NP-completeness in Section 7.5.

In our deﬁnition, we follow the latter approach. The following table illustrates
the minimum level of fault-tolerance provided for different combinations of levels of

fault-tolerance provided to individual classes of faults.

 

 

 

 

 

 

Fault-Tolerance Failsafe Non masking Masking

Failsafe F ailsafe Intolerant Failsafe
Non masking Intolerant Nonmasking Nonmasking

Masking Failsafe Nonmasking Masking

 

 

 

 

In a special case, consider the situation where failsafe fault-tolerance is provided

142

to both

provided

 

for whici
which fa
fnonmaskz
tolerance

Now.
speciﬁcat
fmaskz'ng. ‘
be multitr
class of fa
DEﬁnitio

S, for 8‘1)E(_'

ﬂags of fault

N .. .
0“: USU]

0f ”‘9 PFOhler

 

to both f 1 and f2. From the above description, failsafe fault-tolerance should be
provided for the fault class f 1 U f2. By taking the union of all the fault-classes
for which failsafe fault-tolerance is provided, we get one fault-class, say f fags“ f8, for
which failsafe fault-tolerance needs to be added. Likewise, we obtain the fault-class
fnmmasking (respectively, fmasking) for which nonmasking (respectively, masking) fault-
tolerance is provided.

Now, given (the transitions of ) a fault-intolerant program, p, its invariant, S, its
speciﬁcation, spec, and a set of distinct classes of faults ffausafe, fnmmasking, and
fmaskmg, we deﬁne what it means for a synthesized program p’, with invariant S", to
be multitolerant by considering how p’ behaves when (i) no faults occur; (ii) only one
class of faults happens, and (iii) multiple classes of faults happen.

Deﬁnition. Program p’ is multitolerant to f 10,130 f6, fnonmashm, and _f'ma.,k,,,g from

S’ for spec iff (if and only if) the following conditions hold:

1. p’ satisﬁes spec from S" in the absence of faults.
2. p’ is masking fmask,,,g-tolerant from S’ for spec.
3. p’ is failsafe ( f [0,130 fe U fmasking)-tolerant from S" for spec.

4. p’ is nonmasking (fnonmasking U fmaskinghtolerant from S’ for spec. C]

Remark. Since every program is failsafe/ nonmasking/ masking fault-tolerant to a
class of faults whose set of transitions is empty, the above deﬁnition generalizes the
cases where one of the classes of faults is not speciﬁed (e.g., fmaski-ng = {}).

Now, using the deﬁnition of multitolerant programs, we identify the requirements
of the problem of synthesizing a multitolerant program, p’, from its fault-intolerant
version, p. The problem statement is motivated by the goal of simply adding multi-
tolerance and introducing no new behaviors in the absence of faults. This problem
statement is the natural extension to the problem statement in Section 2.6 where

fault-tolerance is added to a single class of faults.

143

 

Since
following
if there 6
and crea
of satisfy
If p']5’ i1
ways for :

synthesis

The l\‘Iul
Given p, 1
Identify I
5’ g s
P']S’ g

I .
P IS 1111

We State r]
The DeCiE
Chen 1)
DOG:

the ,

Since we require p’ to behave similar to p in the absence of faults, we stipulate the
following conditions: First, we require S’ to be a subset of S (i.e., S’ Q S). Otherwise,
if there exists a state 8 E S’ where s g! S then, in the absence of faults, p’ can reach 3
and create new computations that do not belong to p. Thus, p’ will include new ways
of satisfying spec from s in the absence of faults. Second, we require (p’IS’) Q (p[S’).
If p’IS’ includes a transition that does not belong to pIS’ then p’ can include new
ways for satisfying spec in the absence of faults. Thus, the problem of multitolerance

synthesis is as follows:

The Multitolerance Synthesis Problem

Given 1), S, spec, ffailsafea fnonmaskmg, and fmasking
Identify p’ and S’ such that

S’ g s,’

p’IS’ Q pIS’, and

p’ is multitolerant to f failsa fe, fnmmasking, and fmasking from S’ for spec. [I

We state the corresponding decision problem as follows:

The Decision Problem

Given 1), 5, SPEC, ffailsafea fnonmaskinga and fmasking:
Does there exist a program p’, with its invariant S’ that satisﬁes

the requirements of the synthesis problem? Cl

7 .2 Addition of Fault-Tolerance to One Fault-

Class

In the synthesis of multitolerant programs, we reuse algorithms Add-Failsafe,

Add-Nonmasking, and Add_Masking, presented by Kulkarni and Arora [1] (cf. Section

144

 

2,7). The
to a sing/
in this se<
The a
specificati
with the ;
following 1
nonmaskin
The in
property (
masking) I
program.
invariant 5
5” S S’. A
from any st;
of the fault-
Observatic
0f Addfaus;
invariant Sn

fme Sr {01‘ ‘5-

 

2.7). These algorithms respectively add failsafe/ nonmasking/ masking fault-tolerance
to a single class of faults. Hence, we recall the relevant properties of these algorithms

in this section.

The algorithms represented in Section 2.7 take a program p, its invariant S, its
speciﬁcation spec, a class of faults f, and synthesize an f—tolerant program p’ (if any)
with the invariant S’. The synthesized program p’ and its invariant S’ satisfy the
following requirements: (i) S’ Q S; (ii) p’ IS’ C; plS’, and (iii) p’ is failsafe (respectively,
nonmasking or masking) f-tolerant from S’ for spec.

The invariant S’, calculated by Add-Fai|safe (respectively, Add_Masking), has the
property of being the largest such possible invariant for any failsafe (respectively,
masking) program obtained by adding fault-tolerance to the given fault-intolerant
program. In other words, if there exists a failsafe fault-tolerant program p”, with
invariant S” that satisﬁes the above requirements for adding fault-tolerance then
S” Q S’. Also, if no sequence of fault transitions can violate the safety of speciﬁcation
from any state inside S then Add-Failsafe (cf. Section 2.7) will not change the invariant
of the fault-intolerant program. Hence, we make the following observations:
Observation 7.1. Let the input for Add_Fai|safe be p, S, spec and f. Let the output
of Add-Fai|safe be fault-tolerant program p’ and invariant S’. If any program p” with
invariant 8” satisﬁes (i) S” Q S; (ii) p”|S” g pIS”, and (iii) p” is failsafe f-tolerant
from S’ for spec then S” _C_ S’. D
Observation 7 .2. Let the input for Add_Failsafe be p, S, spec and f. Let the output
of Add_Failsafe be fault-tolerant program p’ and invariant S’. Unless there exists states
in S from where a sequence of f transitions alone violates safety, S’= S. [:1

Likewise, the f—span of the masking f—tolerant program, say T’, synthesized by
the algorithm Add-Masking (cf. Section 2.7) is the largest possible f—span. Thus, we

make the following observation:

Observation 7.3. Let the input for Add_Masking be p, S, spec and f. Let the

145

output Of
If any Pro
masking f
the mask”

The all
the inVaria
ObserVat
Observatj

Based ‘
rithmS Add
the outl)Ut
a 5272915 cla

exists.

Theorem I

sound and <

l
7.3 N1

In this secti

grams that
respectively
our S)’l1lh€8li

Given a

synthesize a

 

frnasking- Bl

both f,,

071 "IOU

\- kl

lie prom

output of Add-Masking be fault—tolerant program 19’, invariant S’, and fault-span T’.
If any program p” with invariant S” satisﬁes (i) S” Q S; (ii) p”|S" Q plS”, (iii) p” is
masking f-tolerant from S’ for spec, and (iv) T” is the fault-span used for verifying
the masking fault-tolerance of p” then S” _C_ S’ and T” g T’. C]
The algorithm Add-Nonmasking only adds recovery transitions from states outside
the invariant S to S. Thus, we make the following observations:
Observation 7.4. Add_Nonmasking does not add or remove any state of S. 0
Observation 7 .5. Add_Nonmasking does not add or remove any transition of pIS. El
Based on the Observations 7.1- 7.5, Kulkarni and Arora [1] show that the algo-
rithms Add_Failsafe, Add_Nonmasking, and Add_Masking are sound and complete, i.e.,
the output of these algorithms satisfy the requirements for adding fault-tolerance to
a single class of faults and these algorithms can ﬁnd a fault-tolerant program if one

exists.

Theorem 7.6. The algorithms Add_Fai|safe, Add-Nonmasking, and Add_Masking are

sound and complete. [:1

7 .3 Nonmasking-Masking Multitolerance

In this section, we present an algorithm for stepwise synthesis of multitolerant pro-
grams that are subject to two classes of faults fnmmasking and fmasking for which
respectively nonmasking and masking fault-tolerance is required. We also show that
our synthesis algorithm is sound and complete.

Given a program p, with its invariant S, its speciﬁcation spec, our goal is to
synthesize a program p’, with invariant S’ that is multitolerant to fnmmashng and
fmasking- By deﬁnition, p’ must be masking fmasking—tolerant. In the presence of
both fnmmaskmg and fmaskmg (i.e., fnmmaskmg U fmaskmg), 13’ must provide nonmasking

fnonmasking U fmasking’tOlerance.

We proceed as follows: Using the algorithm Add-Masking, we synthesize a masking

146

 

 

fnm-5k1n9-tl
program 1
from ever}
perturbed
tolerance I
to states 5
on the Ob
recovery b;

tolerance.

Figure 7.1.

 

fmasking-tolerant program p], with invariant S’, and fault-span Tmasking- Now, since
program p1 is masking fmasking-tolerant, it provides safe recovery to its invariant, S’,
from every state in (Tmasking—S’ ). Thus, in the presence of fnonmaskmgUfmasm-ng, if [)1 is
perturbed to (Tmasking_S’) then p1 will satisfy the requirements of nonmasking fault-
tolerance (i.e., recovery to S ’ ) However, if fnmmaskingu fmasking transitions perturb p1
to states 3, where 5 ¢ Tmaskmg, then recovery must be added from those states. Based
on the Observations 7.4 and 7.5, it sufﬁces to add recovery to Tmasking as provided
recovery by p1 from Tmasking to S’ can be reused even after adding nonmasking fault-
tolerance. Thus, the synthesis algorithm Add_Nonmasking-Masking is as shown in

Figure 7 . 1.

 

Add_Nonmasking_Masking(p: transitions, fnonmadcinm fmasking: fault,
S: state predicate, spec: safety speciﬁcation)
{

131,5", Tmasking 3: Add-Masking(p, fmaskinga 5: spec),

if (S’ = {}) declare no multitolerant program p’ exists;
return 0, Q);

19', T, ;: Add-NoanSk’ingUm fnonmasking U fmasking , Tmaskinga Spec);

return p’, S";

}

 

 

 

Figure 7.1: Synthesizing nonmasking-masking multitolerance.

Now, in Theorem 7.7, we show the soundness of Add_Nonmasking-Masking, i.e.,
we show that the output of Add-Nonmasking_Masking satisﬁes the requirements of the
problem statement in Section 7.1. Subsequently, in Theorem 7.8, we show the com-
pleteness of Add-Nonmasking-Masking, i.e., we show that if a multitolerant program
can be designed for the given fault-intolerant program then Add-Nonmasking-Masking

will not declare failure.

Theorem 7.7. The algorithm Add_Nonmasking_Masking is sound.
Proof. Based on the soundness of Add-Masking (cf. Theorem 7.6), S’ Q S.
Also, using the soundness of Add_Masking, we have p1 IS" Q pIS’. In addition, based

on the Observation 7.5, we have 191 IS’ = p’lS’. As a result, we have p’IS’ Q plS’.

147

NO“?! ‘

1. Abs
saris
not .

Obsc

2. Mas

ing /

“J
.

0‘!
x.

PilTn

spec.

3~ Noni
Add-l\

tOlOI‘a

 

Now, we show that p’ is multitolerant to fmmnaskmg and fmaski-ng from S’ for spec:

1. Absence of faults. From the soundness of Add-Masking, it follows that p1
satisﬁes spec from S’ in the absence of faults. Since Add_Nonmasking does
not add (respectively, remove) any transitions to (respectively, from) pllS’ (cf.

Observation 7.5), it follows that p’ satisﬁes spec from S’.

2. Masking fmaskmg-tolerance. From the soundness of Add-Masking, p1 is mask-
ing fmaskmg-tolerant from S’ for spec. Also, based on the Observation 7.4 and
7.5, Add_Nonmasking preserves masking fmaskmg-tolerance property of p1 since
plleaskz'ng = p’ ITmasking- Therefore, p’ is masking fmaskmg—tolerant from S’ for

spec.

3. Nonmasking (fnmmaskmg U fmaskmg)-tolerance. From the soundness of
Add_Nonmasking, we know that p’ is nonmasking (fnonmaskmg U fmasking}
tolerant from Tmaskmg for spec. Also, based on the Observation 7.4 and
7.5, Add_Nonmasking preserves masking fmasking-tolerance property of 101 since
plleasking = p’ leasking. Thus, recovery from Tmasking to S’ is guaran—
teed in the presence of fnmmasking U fmasking- Therefore, p’ is nonmasking

(fnonmasking U f masking)'tOl€rant from S’ for spec.

Based on the above discussion, it follows that p’ is multitolerant to fnonmaskmg and

fmaskmg from S’ for spec. Therefore, Add_Nonmasking_Masking is sound. Cl
Theorem 7.8. The algorithm Add_Nonmasking_Masking is complete.

Proof. Add-Nonmasking_Masking declares that a multitolerant program does not
exist only when Add-Masking does not ﬁnd a masking fmaskmg-tolerant program. Since
the synthesized program must be masking fmaskmg-tolerant, from the completeness of

Add-Masking, completeness of Add_Nonmasking-Masking follows. E]

148

7.4 I

In this sc
erant to t
failsafe an
synthesizi:
Let p l
spec, and
the “IUltlll
I

in the corn
safety is vi.
a set Of St;
sequence 0
that take p
as the trail.
Now. 5
AddMaskin
S

‘7718, am~
fault-tOJQFa,
IDS, we USO t
we use mt t
Add-Masking

m5,

’ a C01).
masking C011;

masking‘tOlel'

 

 

lemo _- I
- \

Observat i 01

 

7 .4 Failsafe-Masking Multitolerance

In this section, we investigate the stepwise synthesis of programs that are multitol—
erant to two classes of faults f fan” ,8 and fmasking for which we respectively require
failsafe and masking fault—tolerance. We present a sound and complete algorithm for

synthesizing failsafe-masking multitolerant programs.

Let p be the input fault-intolerant program with its invariant S, its speciﬁcation
spec, and p’ be the synthesized multitolerant program with its invariant 5’. Since
the multitolerant program p' must maintain safety of spec from every reachable state
in the computations of p’ [K f failsa fe U fmaskmg), p’ must not reach a state from where
safety is violated by a sequence of f fads“ [e U fmasking transitions. Hence, we calculate
a set of states, say ms (cf. Figure 7.2), from where safety of spec is violated by a
sequence of transitions of f {0,-Isa fa U fmasking. Also, p’ must not execute transitions
that take p’ to a state in ms. Hence, we deﬁne mt to include these transitions as well

as the transitions that violate safety of spec.

Now, since p’ should be masking fmaskmg-tolerant, we use the algorithm
Add_Masking to synthesize a program p1 given the input parameters p—mt, fmasking,
S —ms, and mt. We only consider faults fmasking because p1 need not be masking
fault-tolerant to f failsa f8. Since a multitolerant program must not reach a state of
ms, we use the state predicate S -ms as the input invariant to Add-Masking. Finally,
we use mt transitions in place of the spec parameter (i.e., the fourth parameter of
Add-Masking). Since Add_Masking treats mt as a set of safety-violating transitions,
it does not include them in the synthesized program p1. Thus, starting from a state
in S’, a computation of p1[]fmask,-,,g does not reach a state in ms. As a result, if
Tmasking contains a state 3 in ms, 3 can be removed while preserving the masking
fmaskmg—tolerance property of p1. Hence, we make the following observation:
Observation 7.9. In the output of the algorithm Add_Masking (cf. Figure 7.2),

removing ms states from Tmskmg preserves masking fmmkmg-tolerance property of

149

P1-

Now. 1
our synthi
goal. we a

the algorit

 

Add_Fail:

"133:1

mt 2:{
PLS’J
H(s':

pIaT, ::
return I

 

 

\

The 31%
State Dredic
that the nu
‘0 Addie”:

(T(
O

Add‘FallSa f8.
‘e’ltr

mask? "9
Theorem

PFOoﬁ Usi

S, g S. Bag,

Also, f1. 01
bserl'ati on

NO“?! “‘9

 

171- C]

Now, if faults f failsa [6 U fmaskz'ng perturb pl to a state 3, where 8 ¢ Tmasking then
our synthesis algorithm will have to ensure that safety is maintained. To achieve this
goal, we add failsafe ( f failsa [e U fmask,,,g)-tolerance to p1 from (Tmaskmq—ms) using

the algorithm Add_Failsafe.

 

Add.Failsafe.Masking(p: transitions, ffausafe, fmaskmg! fault, S: state predicate,
spec: safety speciﬁcation)
{

ms :2 {so : 331,32, ...sn : (Vj : 05j<n : (sj,s(j+1)) E (ffailsafe Ufmuking» A
(s(,.,_1), 3,.) violates spec };
mt := {(30,31) : ((31 Ems) V (30,31) violates spec) };
P1 a SI) Tmasking 1: Add-M03king(p - mt» fmaskz‘nga S-ms, mt);
if (S’={}) declare no multitolerant program p’ exists;
return 0,0;
P” T, := Add—FailsafCWIa ffailsafe U fmaskings Tmaskz'ng —m37 mt);
return p’, S’;

 

 

 

Figure 7.2: Synthesizing failsafe-masking multitolerance.

The algorithm Add-Fai|safe takes the program p1, faults ffailsafe U fmaskmg, the
state predicate (Tmask,,,g-ms), and the set of mt transitions as the set of transitions
that the multitolerant program is not allowed to execute. Since the input invariant
to Add_Failsafe (i.e., (Tmaskmg—ms» has no ms state, based on the Observation
7.2, the algorithm Add-Failsafe does not remove any state of (Tmask,,,g—n'zs). Also,
Add-Failsafe does not remove any transition of p1|(’l",.,,ask,-,,g —ms). Thus, we have
(p’l(Tmask.-ng—m8)) = (p1|(Tmask.-ng-ms)) and p’lS’ = I); IS’.

Theorem 7.10. The algorithm Add_Failsafe-Masking is sound.
Proof. Using the soundness of Add_Masking, we have 5’ Q (S —ms), and as a result,
S’ Q S. Based on the Observation 7.2, it follows that Add_Failsafe preserves S’ Q S.
Also, from the soundness of Add_Masking, it follows that pllS’ Q PIS’. Using the
Observation 7.9, we have p’lS’ Q PlS’.

N 0w, we show that p’ (cf. Figure 7.2) is indeed nmltitolerant to ffailsnfe and

150

fmasking frOIn S, for spec.

1. Absence of faults. From the soundness of Add_Masking (cf. Theorem 7.6),
it follows that p1 satisﬁes spec from S’ in the absence of faults. Thus, using
Observations 7.2 and 7.9, it follows that p’ satisﬁes spec from S’ in the absence

of faults.

2. Masking fmaskmg-tolerance. Based on the soundness of Add_Masking, p1 is
masking fmaskmg-tolerant from S’ for spec. Also, using the Observations 7.2

and 7.9, it follows that p’ is masking fmaskmg-tolerant from S’ for spec.

3. Failsafe ( f failsa fe U fmasking)-tolerance. From the soundness of Add-Fai|safe,
it follows that p’ is failsafe ( f Muse fe U fmaskmg)-tolerant from T’ for spec. Us-
ing Observation 7.2 and 7.9, since S’ Q (Tmaskmg—ms), no ffausafe U fmaskmg
transition can directly violate safety of spec from S’. Also, since (p’IS’) Q
(p’|(Tmask,-ng—ms)), p’ IS’ does not include any mt transitions. Thus, p’ is fail-

safe ( f failsa fe U fmaskmg)-tolerant from S’ for spec.

Based on the above discussion, it follows that p’ is multitolerant to ffailsafe and

fmasking from S" for spec. C]

Now, we present the completeness proof for Add_Masking algorithm.

Theorem 7.11. The algorithm Add-Fai|safe-Masking is complete.

Proof. If there exists a program p”, with invariant S”, and fault-span T” that is

multitolerant to f fails“ f8 and fmasking then p” must be masking fmasking-tolerant from

S” for spec. Thus, there must exist a program synthesized from p that is masking

fault-tolerant to fmasking faults. Also, since p” is multitolerant, it must maintain the
safety of spec in the presence of f [(1,-Isa fe and fmasking. Thus, we have T” 0 ms == (0
and p”|T” [’1 mt = (0. Now, the completeness of Add_Failsafe_Masking follows from the

completeness of Add_Masking and Add_Failsafe. C]

151

7.5

In this 5
program
SCCtiOn ‘
their fall
goritllm'
of the 3—5
tOkrance
iff the all
ant progr

synthcsis

7.5.1

In this 3‘
distinct C
deterrnini
For a
fmaskttlg’ '
the comp
computat
possible f
In suc
in the CO?
\-‘iolates S
cornputat.
the choice

all possibl

7 .5 Failsafe-Nonmasking-Masking Multitolerance

In this section, we show that, in general, the problem of synthesizing multitolerant
programs from their fault-intolerant version is N P-complete. Towards this end, in
Section 7.5.1, we show that the problem of synthesizing multitolerant programs from
their fault-intolerant version is in NP by designing a non-deterministic polynomial al—
gorithm. Afterwards, in Section 7.5.2, we present a mapping between a given instance
of the 3—SAT problem and an instance of the (decision) problem of synthesizing multi-
tolerance. Then, in Section 7.5.3, we show that the given 3-SAT instance is satisﬁable
iff the answer to the decision problem is afﬁrmative; i.e., there exists a multitoler-
ant program synthesized from the instance of the decision problem of multitolerance

synthesis.

7 .5.1 Non-Deterministic Synthesis Algorithm

In this section, we ﬁrst identify the difﬁculties of adding multitolerance to three
distinct classes of faults f {0,-Isa fa, fnmmasking, and fmasking. Then, we present a non-
deterministic solution for adding multitolerance to fault-intolerant programs.

For a program p that is subject to three classes of faults f failsa f6, fnonmaskmg, and
fmaskmg, consider the cases where there exists a state 3 such that (i) s is reachable in
the computations of p[]( f fail” f8 U fmaskmg) from invariant, (ii) 8 is reachable in the
computations of p[](fnonmasking U fmasking) from invariant, and (iii) no safe recovery is
possible from s to the invariant.

In such cases, we have the following options: (i) ensure that s is unreachable
in the computations of p[](ffailsafe U fmasking) and add a recovery transition (that
violates safety) from s to the invariant, or (ii) ensure that s is unreachable in the
computations of p[](fnonmaskmg U fmaskmg) and leave 3 as a deadlock state. Moreover,
the choice made for this state affects other similar states. Hence, one needs to explore

all possible choices for each such state 5, and as a result, brute-force exploration of

152

these or

Now
classes (
algorithi
program
Then, an

have sho
Adelu

{

 

 

 

 

 

 

 

 

 

 

 

 

 

Fio-

olll‘e 7
SlHC’e t

hQOPQIn

these options requires exponential time in the state space.

Now, given a program p, with its invariant S, its speciﬁcation spec, and three
classes of faults f 10,130 f6, fnmmaskmg, and fmasking, we present the non—deterministic
algorithm Add-Multitolerance. In our non-deterministic algorithm, ﬁrst, we guess a
program p’, its invariant S’ , and three fault-spans Tfailsafe, Tnonmaskmg, and Tmasking.
Then, we verify a set of conditions that ensure the multitolerance property of p’. We

have shown our algorithm in Figure 7.3.

 

Add.Multitolerance (p: transitions, f fad,“ fe, fnonmaskmg, fmaskmg: fault, S: state predicate,
spec: safety speciﬁcation)
{
ms := {so : 331,32, ...sn : (Vj : 0§j<n : (33,301”) E (ffausafe U fmasking)) A
(s(,,_1), sn) violates spec }; (1)
mt :2 {(30,31) : ((3167713) V (30.51) violates spec) }; (2)
GUESS P’, S” Tfailsafe, Tnonmaskz’ngv Tmaskz'ng; (3)
Verify the following conditions:
S, g. S; S, ¢ {}; S, g Tfailsafe; S, g Tnonmasking; S, g Tmasking; (4)
(V30 : so 6 S’ : (331 2: (30,31) 6 p’)); (5)
p’lS’ Q plS’; S’ is closed in p’; (6)
Tmaskmg is Closed in lelfmaskmg; (7)
Tmasking 0 m3 = 0; (plleasking) m "It : 0; (8)
(V30 : so 6 Tmaskmg : (331 :: (30,51) 6 p’)); (p’|(Tmask,~ng—S’)) is acyclic; (9)
Tfailsafe is Closed in p,[l(ffailsafe U fmaskz'ng); (10)
Tfaz'lsafe ﬂ m3 = 0; (plleailsafe) 0 mt = 0; (11)
Tnonmasking iS Closed in p[H(fnonmaskmg U fmasking); (12)
(V30 3 30 E Tnonmaskrng : (331 :5 (30:31) E P’)); (pluTnonmasking—S’» lS acyCIiC§ (13)
}

 

 

 

Figure 7.3: A non-deterministic polynomial algorithm for synthesizing multitolerance.

Theorem 7.12 The algorithm Add-Mu|titolerance is sound and complete. Cl
Since this algorithm simply veriﬁes the conditions needed for multitolerance in

polynomial time in the state space of the program, the proof is straightforward.

Theorem 7.13 The problem of synthesizing nmltitolerai‘it programs from their fault-

intolerant versions is in NP. [:1

153

7.5-32

In thig
stance
tio11 7':
of the i
of fault
the 3-5
ssAT

G1\'(‘ll

at

D098 t

a b

Next

synt hC’SlI

The Sta
invariant
on the pr
addithIli

and its 0

.1",

And. i

lgr<n

The tran

faultintole

7 .5.2 Mapping 3—SAT to Multitolerance

In this section, we give an algorithm for polynomial-time mapping of any given in—
stance of the 3—SAT problem into an instance of the decision problem deﬁned in Sec-
tion 7.1. The instance of the decision problem of synthesizing multitolerance consists
of the fault—intolerant program, p, its invariant, S, its speciﬁcation, and three classes
of faults f [aim fe, fnmmaskmg, and f-maskz'ng that perturb p. The problem statement for
the 3-SAT problem is as follows:

3-SAT problem.

Given is a set of propositional variables, a1, a2, ..., an and pal, ﬂag, ..., pan, where a,-
and pa,- are complements of each other, and a Boolean formula c = c1 /\ c2 /\

/\ CM, where each c]- is a disjunction of exactly three propositional variables.

Does there exist an assignment of truth values to 0.1,(12, ..., an such that c is satisﬁ-

able?

Next, we identify each entity of the instance of the problem of multitolerance

synthesis, based on the given instance of the 3-SAT formula.

The state space and the invariant of the fault-intolerant program, p. The
invariant, S, of the fault-intolerant program, p, includes only one state, say 3. Based
on the prOpositional variables and disjunctions of the given 3—SAT instance, we include
additional states outside the invariant. Speciﬁcally, for each propositional variable a,-

and its complement, we introduce the following states (cf. Figure 7.4):

' $1,33layiivi

And, for each disjunction c, = (a,- V ﬂak V or), where 1 g i g n, 1 g k g n, and
1 g r S n, we introduce a state zj outside the invariant (1 _<_ j S M).

The transitions of the fault-intolerant program. The only transition in the

fault-intolerant program is a self-loop (s, s).

154

 

 

Legend

 

 

frn """ > Masking faults
ff """" * Failsafe faults
fn “ """"" V Nonmasking faults

——> Program transition

 

 

 

 

Figure 7.4: The states and the transitions corresponding to the propositional variables in
the 3-SAT formula.

The transitions of ffailsafe- The transitions of f fail“ fe can perturb the program
from :17,- to 12,. Thus, the class of faults f fails“ fe is equal to the set of transitions

{(5r,,v,-):1_<_z'§ n}.

The transitions of fnonmasking° The transitions of fnmmaskmg can perturb the

program from $2 to 12,. Thus, we have fnmmaskmg = {(12, vi) : 1 g 2' S n}.

The transitions of fmaskmg. The transitions of fmaskmg can take the program from
s to y,. Also, for each disjunction c], we introduce a fault transition that perturbs
the program from state 3 to state 23- (1 _<_ j g M). Thus, the class of faults fmaskmg

is equal to the set of transitions {(s,y,-) : 1 S 2' g n} U {(s, zj) : 1 g j S 1%}.

The safety speciﬁcation of the fault-intolerant program, p. None of the
fault transitions, namely f foam 1e, fnmmaskmg, and fmasking identiﬁed above directly
violate safety. In addition, for each propositional variable a, and its complement pa,-

(1 S i g n), the following transitions do not violate safety (cf. Figure 7.4):

° (yiaxi)a($irs)a(yia$;)a(332:3)

And, for each disjunction cj = aivﬁakVar, the following transitions do not violate
safety:

. (zj) 1131‘), (er 1‘2) (Zjv (137-)

All transitions except those identiﬁed above violate safety of speciﬁcation. Also,

observe that the transition (1),-,3), shown in Figure 7.4, violates safety.

7 .5.3 Reduction From 3-SAT

In this section, we show that the given instance of 3-SAT is satisﬁable iff multitoler-
ance can be added to the problem instance identiﬁed in Section 7.5.2. Speciﬁcally, in
Lemma 7 .14, we show that if the given instance of the 3-SAT formula is satisﬁable
then there exists a multitolerant program that solves the instance of the multitoler-
ance synthesis problem identiﬁed in Section 7.5.2. Then, in Lemma 7.15, we show
that if there exists a multitolerant program that solves the instance of the multitol-
erance synthesis problem, identiﬁed in Section 7.5.2, then the given 3—SAT formula is

satisﬁable.

Lemma 7.14 If the given 3—SAT formula is satisﬁable then there exists a multitol-

erant program that solves the instance of the addition problem identiﬁed in Section
7.5.2.

Proof. Since the 3-SAT formula is satisﬁable, there exists an assignment of truth
values to the propositional variables a,, 1 S i S n, such that each C], 1 g j _<_ 1%,
is true. Now, we identify a multitolerant program, p’, that is obtained by adding
multitolerance to the fault-intolerant program p identiﬁed in Section 7.5.2.

The invariant of p’ is the same as the invariant of p (i.e., {3}). We derive the
transitions of the multitolerant program p’ as follows. (As an illustration, we have
shown the partial structure of p’ where a,- = true, (2;, = false, and a... = true (1 S

2’, k,r S n) in Figure 7.5.)

o For each propositional variable (1,, 1 g 2' g n, if a,- is true then we will include
the transitions (y,,a:,-) and (x,,s). Thus, in the presence of fnmskmg alone, p’

provides safe recovery to 3 through 23,-.

156

o For each propositional variable a,, 1 S 2' S n, if a,- is false then we will include
(y,, mi) and (23;, s) to provide safe recovery to the invariant. In this case, since
state 1),- can be reached from :13; by faults fmmmaskmg, we include transition
(1),-,3) so that in the presence of fmasking and fnmmaskmg program p’ provides

nonmasking fault-tolerance.

o For each disjunction cJ- that includes a,-, we include the transition (zj,:1:,-) iff a,-
is true. And, for each disjunction cj that includes pai, we include transition

(23,14) iff a,- is false.

 

 

 

 

 

 

 

‘,->-
”I v
Vi / ”k r.
o ’ o
I \ ,' I \__ _1 \
ft _fn .' ff ‘ zfn ff. fn
x-V Y: ‘. l " yk ll x V yr
lo‘__ . oxl “ xko o _. oxk r0...— 0 oxr
a ‘. I A
‘. I f I
{ma fm‘, fm: m.
. ____________________________ 1
> s Vg

 

 

 

Figure 7.5: The partial structure of the multitolerant program

Now, we show that p’ is multitolerant in the presence of faults f {0,130 f6, fnmmaskmg,

and fmasking -

S = pIS. Thus, p’ satisﬁes spec in the absence

 

O p’ in the absence of faults. p’

of faults.

o Masking tolerance to fmaskmg. If the faults from fmashng occur then the

program can be perturbed to (1) 31,-, 1S'iSn, or (2) z], lSj SIM.

In the ﬁrst case, if a, is true then there exists exactly one sequence of transitions,
((y,, mi), (:1:,-, 5)), in p’ []fma,k,-,,g. Thus, any computation of p’[]fmask,:ng eventually

reaches a state in the invariant. Moreover, starting from y,- the computations of

157

p’ [] fmasking do not violate the safety speciﬁcation. And, if a, is false then there
exists exactly one sequence of transitions, ((y,,:r§), (23:, 3)), in p’ [] fmaskmg- By
the same argument, even in this case, any computation of p’ [] fmaskmg reaches
a state in the invariant and does not violate the safety speciﬁcation during

recovery.

In the second case, since c, evaluates to true, one of the term in c] (a proposi-
tional variable or its complement) evaluates to true. Thus, there exists at least
one transition from zj to some state ark (respectively, $1,) where a], (respectively,
ﬂak) is a propositional variable in cj and ak (respectively, ﬂak) evaluates to true.
Moreover, the transition (23,261,) is included in p’ iff ak evaluates to true. Thus,
(2,-,rrk) (respectively, (23,132)) is included in p’ iff (anus) (respectively, (xL,s))
is included in p’. Since from :13,c (respectively, 2:3,), there exists no other transi-
tion in p’ﬂfmaskmg except (2:1,, 3), every computation of p’ reaches the invariant

without violating safety. Based, on the above discussion, p’ is masking tolerant

to fmasking-

Failsafe tolerance to fmaskmg U f fausa f6. Clearly, based on the case consid-
ered above, if only faults from fmasking occur then the program is also failsafe
fault-tolerant. Hence, we consider only the case where at least one fault from

f failsa fe has occurred.

Faults in ffa-ilsafe occur only in state a}, 1 S 2' S n. And, p’ reaches 1:,- iff a,- is
assigned true in the satisfaction of the given 3—SAT formula. Moreover, if a,- is
true then there is no transition from 22,-. Thus, after a fault transition of class

f failsa f3 occurs p’ simply stops. Therefore, p’ does not violate safety.

Nonmasking tolerance to fmaskmg U fnmmaskmg. This proof is similar
to the proof of failsafe fault-tolerance shown above. Speciﬁcally, we only need

to consider the case where at least one fault transition of class fnonmasking has

158

occurred.

Faults in fnonmasking occur only in state 17;, 1Si Sn. And, p’ reaches 2:; iff a, is
assigned false in the satisfaction of the given 3-SAT formula. Moreover, if a, is
false then the only transition from v,- is (12,, 3). Thus, in the presence of fmasking
and fnmmaskmg, p’ recovers to its invariant. (Note that the recovery in this case

violates safety.) D

Lemma 7.15 If there exists a multitolerant program that solves the instance of the
synthesis problem identiﬁed earlier then the given 3-SAT formula is satisﬁable.
Proof. Suppose that there exists a multitolerant program p’ derived from the
fault-intolerant program, p, identiﬁed in Section 7.5.2. Since the invariant of p’, S’,
is non-empty and S’ Q S, 5’ must include state 3. Thus, S’ = S. Also, since each
y,, 1 S i S n, is directly reachable from s by a fault from fmaskmg, p’ must provide
safe recovery from y,- to 3. Thus, p’ must include either ($11,561:) or (yr, 23:). We make
the following truth assignment as follows: If p’ includes (y,,r,) then we assign a,-
to be true. And, if p’ includes (yr-513;) then we assign a,— to be false. Clearly, each
propositional variable in the 3-SAT formula will get at least one truth assignment.
Now, we show that the truth assignment to each propositional variable is consistent

and that each disjunct in the 3-SAT formula evaluates to true.

0 Each propositional variable gets a unique truth assignment. Suppose that
there exists a propositional variable a,, which is assigned both true and false,
i.e., both (31,-, 15,) and (ft/r, 11:2) are included in p’. Now, 221 can be reached by the
following transitions (3, 31,-), (ping), and ((132, 11,-). In this case, only faults from
fmasking and fnonmasking have occurred. Hence, p’ must provide recovery from v,-
to invariant. Also, u, can be reached by the following transitions (3, 31,-), (y,, mi),
and (:r,,u,-). In this case, only faults from fmaskmg and f failsa fe have occurred.

Hence, p’ must ensure safety. Based on the above discussion, p’ must provide

159

a safe recovery to the invariant from 11,-. Based on the deﬁnition of the safety
speciﬁcation identiﬁed in Section 7.5.2, this is not possible. Thus, propositional

variable a,- is assigned only one truth value.

Each disjunction is true. Let c, = a,- V ﬂak V a, be a disjunction in the given
3—SAT formula. The corresponding state added in the instance of the multitol-
erance problem is 2,. Note that state 2,- can be reached by the occurrence of a
fault from fmasking from 3. Hence, p’ must provide safe recovery from zj. Since
the only safe transitions from 2,- are those corresponding to states Ii, 23;, and

(Er, p’ must include at least one of the transitions (zj, cm), (2,, $1,), or (23,113,).

Now, we show that the transition included from z]- is consistent with the truth
assignment of propositional variables. Speciﬁcally, consider the case where p’
contains transition (2,, 33,-) and a,- is assigned false, p’ can reach as,- in the presence
of faults from fmaskmg alone. Moreover, if a,- is assigned false then p’ contains the
transition (311,222). Thus, 51:; can also be reached by the occurrence of faults from
fmaskmg alone. Based on the above proof for unique assignment of truth values
to propositional variables, p’ cannot reach :13,- and as; in the presence of fmaskmg
alone. Hence, if (23-, 1:,) is included in p’ then a,- must have been assigned truth
value true. Likewise, if (zj, 2:2) is included in p’ then ak must be assigned truth
value false. Thus, with the truth assignment considered above, each disjunction

must evaluate to true. D

Theorem 7.16 The problem of synthesizing multitolerant programs from their fault-

intolerant versions is NP-complete. [3

7.5.4 Failsafe-Nonmasking Multitolerance

In this section, we extend the NP-completeness proof of synthesizing multitolerance

for the case where we add failsafe fault—tolerance to one class of faults, say f fags, f,,

160

and we add nonmasking fault—tolerance to another class of faults, say fnonmaskmg.
Our mapping for this case is similar to that in Section 7.5.2. We replace the
fmaskmg fault transition (3, 31,-) with a sequence of transitions of f fags, f, and fnmmashng
as shown in Figure 7.6. Likewise, we replace fault transition (s, 2,) with a structure
similar to Figure 7.6. Thus, y,- (respectively, z,) is reachable by f fausafe faults alone
and by fnmmasking faults alone. As a result, 2),- is reachable in the computations of
p’ﬂffailsafe and in the computations of p’ﬂfnonmask,,,g. Thus, to add multitolerance,
safe recovery must be added from u,- to 3 (cf. Figure 7.4). Now, we note that with
this mapping, the proofs of Lemmas 7.14 and 7.15 and Theorem 7.16 can be easily
extended to show that synthesizing failsafe-nonmasking multitolerance is NP-complete.

Thus, we have

Corollary 7.17. The problem of synthesizing failsafe-nonmasking multitolerant pro-

grams from their fault-intolerant version is NP—complete. D
yi
1 ' \,
wi o o W}
\ I
.ff fn
C

Figure 7.6: A proof sketch for NP-completeness of synthesizing failsafe—nonmasking multi-
tolerance.

7 .6 Summary

In this chapter, we investigated the problem of synthesizing multitolerant programs
from their fault-intolerant versions. The input to the synthesis algorithm included
the fault-intolerant program, different classes of faults to which fault-tolerance had to

be added, and the level of tolerance provided for each class of faults. Our algorithms

161

ensured that the synthesized program provided (i) the speciﬁed level of fault-tolerance
if a fault from any single class had occurred, and (ii) the minimal level of fault-
tolerance if faults from multiple classes occurred.

We presented a sound and complete algorithm for the case where failsafe (respec-
tively, nonmasking) fault-tolerance would be added to one class of faults and masking
fault-tolerance would be provided to another class of faults. Thus, in these cases, if
a multitolerant program could be synthesized for the given input program, our algo—
rithms would always produce one such fault-tolerant algorithm. The complexity of
these algorithms is polynomial in the state space of the fault-intolerant program.

For the case where one needs to add failsafe fault-tolerance to one class of faults
and nonmasking fault-tolerance to another class of faults, we showed that this problem
is NP-complete. As mentioned earlier, this result was counterintuitive as adding
failsafe and nonmasking fault-tolerance to the same class of faults can be done in
polynomial time. However, adding failsafe fault-tolerance to one class of faults and
nonmasking fault-tolerance to another class of faults is N P-complete.

Although the results focused in this chapter deal with the high atomicity model,
we note that the algorithms in high atomicity model are important in synthesizing
distributed fault-tolerant programs as well. Speciﬁcally, our algorithms identify a
limit up to which even highly powerful processes can add the necessary multitoler-
ance. Thus, the output of these algorithms can be used in identifying the limits that
distributed processes —along with their limitation on reading and writing variables
of the program— can achieve in terms of adding the necessary multitolerance. As an
illustration, we note that in Chapter 5, we have identiﬁed how algorithms in high
atomicity can be systematically used in enhancing the level of fault-tolerance to a

single class of faults.

162

Chapter 8

FTSyn: A Software Framework for
Automatic Synthesis of

Fault-Tolerance

In this chapter, we present the design and the internal working of the software frame-
work Fault—Tolerance Synthesizer (FTSyn) that we have developed for the synthesis
of fault-tolerant distributed programs. This framework allows the users to automat-
ically (respectively, interactively) add fault-tolerance. We also show that our frame-
work permits one to add new heuristics for adding fault-tolerance. Towards this end,
we describe the addition of several heuristics (based on the algorithms proposed in
[14] and in Chapter 5) for different steps involved in adding fault-tolerance. Further,
we show how one can easily change the internal representation of different entities in

the framework.

We have used our framework to synthesize several fault-tolerant programs among
them (i) a simpliﬁed version of an altitude switch that controls the altitude of an
aircraft by monitoring the altitude sensors and generating necessary command signals,
where the altitude switch tolerates the corruption of altitude sensors; (ii) a token

ring protocol that tolerates process-restart faults; (iii) an agreement protocol that

163

tolerates Byzantine faults; (iv) an agreement program that tolerates both Byzantine
faults and fail-stop faults; (v) an alternating bit protocol program that tolerates
message-loss faults, and (vi) a Triple Modular Redundancy program that tolerates
input—corruption faults. These examples illustrate the potential of our framework in
adding fault-tolerance to different types of faults with different natures.

We proceed as follows: in Section 8.1, we illustrate how the developers of fault-
tolerance can synthesize fault-tolerant programs using our framework. In Section
8.2, we present the design of the framework, and discuss the internal working of the
framework. In Section 8.3, we show how one can integrate new heuristics into our
framework. In Section 8.4, we present the way in which one can change the internal
representation of entities involved in the framework. In Section 8.5, we present a
simpliﬁed version of an altitude switch synthesized using our framework. We make

concluding remarks and discuss future work in Section 8.6.

8.1 Adding Fault-Tolerance to Distributed Pro-

grams

In this section, we ﬁrst describe the input and the output of our framework (of.
Section 8.1.1). Then, in Section 8.1.2, we give an overview of framework fractions
that participate in the automatic synthesis of fault-tolerant programs. We implement
a deterministic version of Add- ft algorithm (cf. Section 2.8) and a set of heuristics
developed in [14, 15] to synthesize a fault-tolerant program. Further, in Section
8.1.3, we illustrate how the users can interact with the framework in order to semi-

automatically synthesize a fault-tolerant program from its fault-intolerant version.

8.1.1 The Input / Output of the Framework
In this subsection, we explain how developers of fault-tolerance should prepare the

input to our framework and how the framework provides the output to its users. The

164

input of our framework consists of the abstract structure of the fault-intolerant pro-
gram, its invariant, its safety speciﬁcation, its initial states, and a class of faults. The
output of our framework is also the abstract structure of the fault—tolerant program,
represented by guarded commands.

We note that there exist automated techniques (e.g., [42, 43]) by which we can ex-
tract the abstract structure of programs written in common programming languages,
and then provide our framework with the abstract structure of programs. Moreover,
after the synthesis of a fault-tolerant program, there exist automated techniques (e.g.,
[44, 45, 46]) that allow us to reﬁne the abstract structure of the fault-tolerant pro—
gram while preserving its correctness and fault-tolerance properties. Next, we present
a very simple example of a token ring program to illustrate the way developers can
communicate with our framework to add fault-tolerance. Our goal is to provide an
overall picture about the input / output of our framework. Afterwards, in Subsection
8.1.2, we show the internal working of our framework and how it synthesizes the
fault-tolerant token ring program.

Token ring program The fault-intolerant program consists of four processes
P0,P1,P2, and P3 arranged in a ring. Each process P,, 0 S i S 3, has a variable
2:,- with the domain {—1,0, 1}. We say that process 1",, 1 S i S 3, has the token if
and only if (2:,- 9é :r,_1) and fault transitions have not corrupted P,- and P,_1. And,
P0 has the token if (2:3 = 2:0) and fault transitions have not corrupted P0 and P3.
Process P,, 1 S i S 3, copies 23,--1 to r,- if the value of :r, is different than x,_1. This
action passes the token to the next process. Also, if (20 = 2:3) holds then process
P0 copies the value of (2:3 69 1) to 230, where 69 is addition in modulo 2. Now, if we
initialize every 23,-, O S i S 3, with 0 then process P0 has the token and the token
circulates along the ring. In the input ﬁle of our framework, we specify the actions of

P0 as follows (keywords are shown in italic):

1 process PO

165

2 begin

3 (x0 == x3) -> x0 = ((x3+1)%2);
4 read x0, x3;

5 write x0;

6 end

Since processes P1, P2, and P3 are similar, we only present the action of process

P1 as follows.

1 process P1

2 begin

3 (x1 != x0) -> x1 = x0;
4 read x1, x0;

5 write x1;

6 end

Read / Write restrictions. Each process P,, 1 S i S 3, is only allowed to read 23,-_1
and :r,, and allowed to write 23,-. Process P0 is allowed to read 13 and 2:0, and write
2:0. We specify the read/ write restrictions of a process by read and write keywords
inside the body of the process (cf. lines 4 and 5 in the body of P1).

Faults. The faults are also modeled as a set of guarded commands that change the
values of program variables. In the case of the token ring program, the faults may
corrupt at most three processes. Also, in this example, the faults are detectable in
that a process that is corrupted can detect if it is in a corrupted state. Hence, we
model the fault at process P,- by setting :r, = —1. Thus, one of the fault actions that

corrupts 2:0 is represented as follows:

1 fault TokenCorruption

2 begin
3 ( ((xO!=-1)&&(x1!=-1)) II ((xO!=-1)&&(x2!=-1)) II
4 ((x0!=-1)&&(x3!=-1)) ll ((x1!=-1)&&(x2!=-1)) II
5 ((x1!=-1)&&(x3!=-1)) II ((x2!=-1)&&(x3!=-1)) )

166

6 -> x0 = -1;

7 end

Note that there exist no read/ write restrictions for the fault transitions because
we assume that fault transitions can read and write arbitrary program variables.
Safety speciﬁcation. The safety speciﬁcation of the fault-intolerant program is rep—
resented as a Boolean expression over program variables. In the token ring program,
the problem speciﬁcation stipulates that the fault-tolerant program is not allowed
to take a transition where a non-corrupted process copies a corrupted value from its
neighbor. In the input of the framework, we represent the speciﬁcation as follows.

1( ((XIS!=-1)&&(x1d==-1)) ll ((X2S!=-1)&&(x2d==-1)) I]
2 ((x33!=-1)&&(x3d==-1)) ll ((x38==-1)&&(x03!=x0d)) )

Note that we have added a sufﬁx “s” (respectively, suffix “d”) to the variable
names that stands for source (respectively, destination). Since the above condition
speciﬁes a set of transitions tspec using their source and destination states, we need to
distinguish between the value of a speciﬁc variable mi in the source state of tsp,c (i.e.,
23is means the value of mi in the source state of tap“) and in the destination state of
tsp“ (i.e., 2:id means the value of mi in the destination state of tspec).

Invariant. The invariant is also speciﬁed as a Boolean expression over program
variables. The invariant of the token ring program consists of the states where no
process is corrupted and there exists only one token in the ring. We represent the
invariant of the program using the invariant keyword followed by a state predicate.
1 invariant

2 ((XO==1)&&(x1==0)&&(x2==0)&&(x3==0)) ll

3 ((X0==1)&&(x1==1)&&(x2==0)&&(x3==0)) ll

4 ((X0==1)&&(x1==1)&&(x2==1)&&(x3==0)) ii

5 ((x0==1)&&(x1==1)&&(x2==1)&&(x3==1)) ll

6 ((x0==0)&&(x1==0)&&(x2==0)&&(x3==0)) ll

7 ((x0==0)&&(x1==0)&&(x2==0)&&(x3==1)) II

167

s ((x0==0)&&(x1==0)&&(x2==1)&&(x3==1)) ll
9 ((x0==0)&&(x1==1)&&(x2==1)&&(x3==1))

Initial states. We also specify some initial states in the input of the synthesis frame-
work. While these initial states are included in the invariant of the fault-intolerant
program, we ﬁnd that explicitly listing them assists in adding fault-tolerance. The

initial states of the token ring program are as follows (init and state are keywords):

1inﬁ

l
0
>4
M

ll
0
>4
(0

ll

2 state x0 = 0; x1 - O;

3 state x0

ll
H
>4
H

II
H
N
M

II
H
:14
(A)

ll

1;

The output fault-tolerant program. Finally, the output of our framework is also
generated in guarded commands. For the token ring program, the actions of process

P0 in the synthesized fault-tolerant program are as follows:

1 (x ==-1) && (x ==1) -> x0 := O;
2 l
3 (x0==1) && (x3==1) -> x0 := O;
4 l
5 (x0==0) && (x3==0) -> x0 := 1;
6 I
7 (x0==-1) && (x3==0) -> x0 := 1;

The above actions mean that P0 can copy the value of (2:3 6) 1) to 20 as long as

2:3 # -1. Next, we present the actions of synthesized process P1.

1 (x1==1) && (x0==0) -> x1 := O;
2 I
3 (x1==-1) && (x0==0) -> x1 := 0;
4 I
5 (X1==0) && (x0==1) -> 111 := 1;
6 l

7 (x1==-1) && (x0==1) -> x1 := 1;

168

The above actions stipulate that process P1 can copy the value of 2:0 to 271 if
((230 74 —1)/\ (2:1 74 230)) holds (i.e., P0 is not corrupted). Likewise, the synthesis
framework generates similar actions for the synthesized processes P2 and P3. We
would like to note that the token ring program that we have automatically synthesized

using our framework is the same as the program that was manually designed in [10].

8.1.2 Framework Execution Scenario

In this subsection, we discuss the sample execution scenario for the case where fault-
tolerance is added without any user interaction. Also, we use the token ring example
to illustrate the execution of the synthesis algorithm. In this execution scenario, the
synthesis algorithm consists of four fractions: Initialize, Preservelnvariant, Modify-
Invariant, and ResolveCycles (cf. Figure 8.1).

Expanding the reachability graph. Before the execution of the synthesis algo-
rithm, the framework uses initial states and program / fault transitions to generate the
state-transition graph of the fault-intolerant program. Since this directed graph only
includes those states of the state space that are reachable by program / fault transitions
from initial states, we call it a reachability graph of the fault-intolerant program. (It
also represents a reachable subset of the fault-span of the fault-intolerant program.)
The reachability graph of the token ring program. For the token ring program pre-
sented in Section 8.1.1, the reachability graph is equal to its state space and includes
81 states. Let (2:0, 2:1, 2:2, 2:3) denote a state of the token ring program. Thus, starting
from the initial state so = (0,0,0,0), fault transitions may perturb the program to
31 = (—1,0,0,0), where process P0 is corrupted. From 31, process P1 copies the cor—
rupted value and the fault-intolerant program reaches state 32 = (—1,—1,0,0). As
a result, starting from the given initial states, a combination of program and fault
transitions can take the state of the program to any possible state in the whole state

space.

169

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Smeazﬂoﬁvzae a w 88?? Sac—92:8 8 o a
0585:? 8:50 52 .N ..o homo—goo 2:. 8 can ﬂag
3860.5 28m a 8 .m 88»an m o How
one “8:8 98888 .025 3.33. . a ESE .
Boas» E ... ”a; A 3.5:» E u can a _
coats:— new:
058.82:
.m . a $5 232.3. a 8% .m a
_ om buxom ”w 35 mammawuowﬁuww .m 028.5 833. .8“
mu E 5.8 a 35
IN

 

 

 

 

.m 02%: noon: 8..
E @228 a 9%

 

 

_

 

l0

 

 

mn— bmaam no mam W

...atgségz a: 8.62... ..

 

 

 

 

 

m

m... Seam s new

«0

......Ma £3 a new

I
r

 

 

 

 

 

Ewing—atone...— 5 553....—

 

 

..r .8 SBSEUHN mam _

 

_ m3 BEBEUHESm T

._,

 

 

 

3.3.3 a 85......

 

 

 

 

 

 

55:52 285:3

 

 

1‘

.Eom 553.35

Figure 8.1: A deterministic execution scenario for the framework FTSyn.

170

Execution of fraction (I). After the expansion of the reachability graph, the
framework executes every step of the synthesis algorithm (i.e., F 1-F 6 in Figure 2.4)
on the reachability graph of the fault—intolerant program in order to derive a reach-
ability graph of the fault-tolerant program. First, in fraction (1) (cf. Figure 8.1),
the synthesis algorithm calculates the sets of ms states and mt transitions (in the

reachability graph).

The token ring program in fraction (1)111 the case of the token ring program, safety
is violated when a process copies a corrupted value from its neighbor. Thus, fault
transitions do not directly violate safety, and as a result, the set of ms states is
empty. Also, since ms is empty, the set of mt transitions is equal to the set of

program transitions that directly violate safety.

Execution of fraction (II). Then, the synthesis algorithm moves to fraction (II)
where we attempt to identify a valid fault-span T’ that (i) is closed in p’ [] f; (ii)
does not include any ms states or safety-violating transitions of mt, and (iii) does
not include any deadlock states outside the invariant. While executing in fraction
(II), we leave the invariant S’ unchanged. This is due to the fact that the addition
problem requires that the invariant of the fault-tolerant program is a subset of the
invariant of the fault-intolerant program. Thus, states inside the invariant of the
fault-intolerant program are important; removing them prematurely can cause the

automated synthesis to fail.

Also, when we remove ms states (respectively, remove mt transitions) from T’ in
order to satisfy F3, the new fault-span will be a subset of initial T’. As a result,
those transitions that start in the new fault—span and end in the part of T’ that is
not in the new fault-span violate the closure of the fault-span (i.e., F2) and must be
removed. Hence, after satisfying F3, we may need to re-satisfy F2. A similar scenario
can happen while resolving deadlock states (i.e., satisfying F4). Hence, fraction (II)

is an iterative procedure. The execution continues in fraction (II) until an iteration

171

does not cause any changes or until the number of iterations exceeds a predetermined

bound.

The token ring program in fraction (II). For the token ring program, the framework
removes (groups of) program transitions that violate safety of speciﬁcation. For
example, the transition that process P1 takes from 51 to so violates the safety of
speciﬁcation. Hence, the synthesis algorithm removes (31,32) in fraction (II). As
a result, 51 = (—1,0,0,0) becomes a state without any outgoing transition; i.e.,

deadlock state.

The execution of fraction (II) does not create any deadlock states inside the invari-
ant of the token ring program since ms is empty and no mt transition exists inside
the invariant. Thus, in the ﬁrst iteration, the synthesis algorithm only removes a
set of transitions in the fault-span outside the invariant (i.e., mt transitions and the

transitions that violate the closure of fault—span).

Execution of fraction (III). At the end of fraction (II), if the resulting program
does not satisfy F 1-F 6, we modify the invariant S’ in fraction (III) to ensure that
the invariant S’ is closed in the program p’, i.e., F5 is satisﬁed. In fraction (III), we
recalculate a valid invariant. In this fraction, the newly added transitions may violate
the closure of the fault-span. Thus, when we exit fraction (III), the conditions F 2-F 4
may need to be re—satisﬁed. Hence, we jump to fraction (II) and attempt to re—satisfy
F 2-F 4. Notice that in fraction (III), we satisfy F4 only for the invariant states; i.e.,
we ensure that there is no deadlock state inside the invariant whereas in fraction (II),

we resolve deadlock states that are in the fault-span but outside the invariant.

The token ring program in fraction (111). As we mentioned earlier, the removal of mt
transitions creates deadlock states outside the invariant of the token ring program.
For example, state 31 = (—1,0,0,0) became a deadlock state since the framework
removed a transition to 32 = (—1,—1,0,0) taken by P1. Now, in the fraction (III),

the framework adds recovery transitions to the invariant by allowing a corrupted

172

process to copy an uncorrupted value from its predecessor. Thus, from 32, process Po
can toggle the value of 2:3 and correct itself by moving to state 83 = (1, —1,0,0). Now,
from 33, process P1 copies 2:0 and takes the program to state 34 = (1, 1, 0, 0), which
is in the invariant. Note that since P1 cannot read variables 2:2 and 233, the group of
transitions associated with the transition (53, 84), say g34, includes 9 transitions. By
deﬁnition, the values of 2:3 and 2:4 remain unchanged in each transition of 934. Also,
P1 does not propagate a corrupted value by executing transition (s3, 84). Thus, no
transition in g34 violates safety of speciﬁcation.

Execution of fraction (IV). If the values of p’, S’, and T’ satisfy formulae F 2-
F5 at the end of fraction (III) then we will ensure that p’ will not stay outside its
invariant forever. Toward this end, we move into fraction (IV) where we remove
reachable non—progress cycles in T’-—S’ (if any).

The token ring program in fraction (IV). As long as there exists an uncorrupted
value, the token ring program can propagate that value along the ring and recover to
the invariant. Since faults can perturb at most three processes, the existence of an
uncorrupted process is always guaranteed. Also, no non-progress cycles exist outside
the invariant of the token ring program. Thus, in this automatic execution scenario,
our framework generates the fault-tolerant token ring program presented in Section

8.1.1 by adding safe recovery from deadlock states outside the invariant.

8.1.3 User Interactions

Although the framework can automatically synthesize a fault-tolerant program with-
out user intervention, there are some situations where (1) user intervention can help to
speed up the synthesis of fault-tolerant programs, or (ii) a fully automatic approach
fails. In this subsection, we present the nature of the interactions that fault-tolerance
developers can have with our framework.

Our framework permits developers to semi-automatically supervise the synthesis

173

procedure. In such supervised synthesis, fault-tolerance developers interact with the
framework and apply their insights during synthesis. In order to achieve this goal, we
have devised some interaction points (cf. Figure 8.1) where the developers can stop

the synthesis algorithm and query it.

At each interaction point, the users can make the following queries: (1) apply a
speciﬁc heuristic for a particular task; (ii) apply some heuristics in a particular order;
(iii) view the incoming program (respectively, fault) transitions to a particular state;
(iv) view the outgoing program (respectively, fault) transitions from a particular state;
(v) check the membership of a particular state (respectively, transition) to a speciﬁc
set of states (respectively, transition); e.g., check the membership of a given state 3
in the set of ms states, and ﬁnally (vi) view the intermediate representation of the
program that is being synthesized. Since our goal is to focus on the technical details
of the framework and its application in adding fault-tolerance, we omit the details
about the user interface of the framework. We refer the reader to the tutorial about

using this framework in the Appendix B.

While we expect that the queries included in this version will be sufﬁcient for
a large class of programs, we also provide an alternative for the cases where the
heuristics fail and these queries are insufﬁcient. Speciﬁcally, in such cases, the users
of our framework need to determine what went wrong during synthesis. The answer to
this question is very diﬁicult without the help of automated techniques, especially for
programs with large state space. To address this issue, developers of fault-tolerance
can obtain the corresponding intermediate program in a syntax compatible with the
Promela modeling language [37]; this program can then be checked by the SPIN
model checker to determine the exact scenario where the intermediate program does
not provide the required fault-tolerance property. The counterexamples generated by
SPIN enable the users to identify the appropriate heuristics that should be applied

in subsequent steps of synthesis.

174

8.2 Framework Internals

The integration of new heuristics into our framework (respectively, modifying the
internal representation of framework entities) requires some background knowledge
about the design and the internal working of our framework. Hence, in this section,
we present preliminary information that helps the users of the framework (especially
the developers of heuristics) to understand the internal working of the framework. We
use this information in Sections 8.3 and 8.4 to describe how the framework permits
the addition of new heuristics and the ability to change the internal representation of
its entities.

We organize this section as follows: In Section 8.2.1, we introduce the important
classes (i.e., abstract data structures) used in the design of the framework and their
relationship. Then, in Section 8.2.2, we identify three important design patterns that

help to make the design of the framework extensible.

8.2. 1 Class Modeling

The input to the synthesis algorithm consists of the following entities: program, pro-
cess, fault, safety specification, invariant, and initial states. Hence, we create the follow-
ing classes corresponding to each entity: Program, Process, Fault, SafetySpecification,
Invariant, and InitiaIStates. Also, since we can generate the fault-span (i.e., reachability
graph) of the fault-intolerant program using the initial states and program (respec-
tively, fault) transitions, we regard the fault-span of the fault-intolerant program as
an input entity. Thus, we model the fault—span of the fault-intolerant program using
ReachabilityGraph (RG) class. The synthesis framework takes the input entities and
then executes the synthesis algorithm in order to generate a fault-tolerant program,
its invariant, and its fault-span. Thus, we model the output entities using the same

category of classes Program, Invariant, and RG.

We depict the class diagram of the synthesis framework in Figure 8.2. This ﬁgure

175

 

conicaabﬁﬂmok— conﬁne—tin".

 

 

 

 

 

 

 

 

 

 

 

*2— r

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

zocooSowaoxow V .6333: _ {AV 033:;
_ .....o
azo ....._ .....—
_ _ _ ll. :3".
_qll
gasogsaﬁsm 552:
m:90=o:_m§.F .._ EoEoEm
a.
.. o _ ...:
Eﬁwoi 289i coco< “5:0

 

  

 

  
   

 

 

 

 

\

 

A Ste—.5

8868mm

 

Ill

..oéomomEOo

J

”ccowo...

k

 

Figure 8.2: The class diagram of FTSyn.

176

identiﬁes the important classes and their relationship. For example, each Process is
composed of one or more Action objects. (We annotate the composition relation by
black diamonds attached to an arrowed line.) Every Process is associated with zero
or more TransitionGroup objects that are created due to the read restrictions of that
process. (We illustrate associations by solid lines.) Finally, we have derived some new
classes from the original classes of our abstract design by inheritance relationship.
(We annotate inheritance by a solid line attached to a triangle.) For example, we
have an abstract class Transition from which we have inherited two concrete classes

ProgramTransition and FaultTransition.

8.2.2 Design Patterns

In this section, we identify three important design patterns [47], Bridge, Facto-
ryMethod, and Strategy, that we use in our framework. The advantage of using design
patterns with respect to traditional abstract data types stems in the level of ﬂexibility
and reusability that these design patterns provide in the design and implementation
of our framework.

We use the Bridge design pattern (cf. Figure 8.3) in order to achieve extensibil-
ity. The Bridge pattern is a structural design pattern [47] that allows us to sepa-
rate the design class hierarchy from the implementation class hierarchy. This way,
we can independently extend the design and the implementation of the framework
by subclassing. For example, we can introduce different implementation hierarchies
corresponding to the AbstractProgram class, where these implementation hierarchies
implement a common interface Programeplementor (cf. Figure 8.3).

Another requirement for the developers of fault-
tolerance is the ability to apply a speciﬁc heuristic at a particular stage of
synthesis. Hence, the framework has to dynamically instantiate different classes
that represent different heuristics at run-time. In order to achieve this goal, we

use the FactoryMethod design pattern (cf. Figure 8.4). The FactoryMethod pattern

177

Abstraction Hierarchy- Implementation Hierarchy
L—-Client

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AbstractPrograni [ «interface»
impRet O— Program_lmplemento
+isDeadlock() E +isDeadIockImp0
Program Brogramlmplomentatlom]
+isDeadlock() +isDead|ocklmp()

 

 

 

 

 

 

 

Figure 8.3: The Bridge design patterns.

is a creational pattern [47] that facilitates the dynamic instantiation of objects at
run—time. Hence, if one adds a new heuristic in the form of a new class, which is
extended from the abstract design of the framework, then the users of the framework

can activate the newly added heuristic at run-time.

    
 

Client

A

  

 

 

Graph

 

Iristantiates

 

+solveDeadlock()

 

 

 

Figure 8.4: The FactoryMethod design patterns.

As we mentioned in the Introduction, the developers of heuristics should be able
to easily integrate new heuristics into the framework. We presented the contribution
of the Bridge and the FactoryMethod patterns respectively in achieving extensibility
and dynamic instantiation of heuristics at run-time. Yet another issue is the design

of different versions of a heuristic. In the case where there are different algorithms for

178

a speciﬁc step of the synthesis algorithm, we need to implement different versions of
a particular class (respectively, method). For example, in resolving deadlock states,
we may have different heuristics for dealing with a deadlock state. Hence, we need to

have different versions of the solveDeadIock method of the RG class (cf. Figure 8.5).

 

 

no lDeadlockResolveIl

 

 

 

 

 

 

 

 

+solveDeadlock() J +Resolve()

so

 

 

 

 

 

 

 

[DeadlockResolven Ineadlocknesolverz] Ineadlocknesolverd

 

 

 

 

 

 

 

 

 

 

+Resolve() +Resolve() +Resolve()

 

 

Figure 8.5: Integrating the deadlock resolution heuristics using Strategy pattern.

We use the Strategy pattern [47] to provide a flexible solution to the above-
mentioned problem. In particular, we design a DeadlockResolver class for deadlock
resolution (cf. Figure 8.5). This class has a method called Resolve, where we im-
plement our deadlock resolution heuristic. Then, we apply the Strategy pattern to
DeadlockResolver so that the developers of heuristics can extend new classes from the
DeadlockResolver class and integrate their own heuristic in the Resolve method (cf.
Figure 8.5). Finally, in the solveDeadIock method of the RG class, we use the Fac-
toryMethod design pattern in order to dynamically instantiate different subclasses of

the DeadlockResolver class at run-time.

8.3 Integrating New Heuristics

In this section, we address the problem of adding new heuristics into our framework
(i.e., the second goal mentioned in the Introduction). Speciﬁcally, we show how one
can integrate a new heuristic into our framework so that the added heuristic will be
available to the developers of fault-tolerance during synthesis. Since a new heuristic

will be integrated into a new class or into a method of an existing class, the problem of

179

adding new heuristics to the framework reduces to the problem of adding new classes

(respectively, methods) to the framework.

We have used the ability to add heuristics for adding several heuristics from [14,
31, 15]. Of these heuristics, we now present the integration of the three heuristics

that we added for resolving deadlocks and discuss our experience in adding them.

First heuristic. Kulkarni, Arora, and Chippada [14] present a heuristic for deadlock
resolution that includes two passes. In the ﬁrst pass, their heuristic tries to add single-
step recovery transitions from a given deadlock state, 50,, to the invariant. Due to
distribution restrictions, when their heuristic adds a recovery transition, tree , it has
to add the group, grec , of transitions that is associated with tree. Moreover, the
addition of gm, is not allowed if there exists a transition (30,31) 6 gm, such that (i)
(30,31) 6 mt; (ii) (30,31 6 S) /\ (30,31) ¢ p; (iii) (so E T’) A (s, 9! T’), or (iv)
(so E S) /\ (31 at S). If adding recovery from so is not possible, and so is directly
reachable from the invariant by fault transitions then their heuristic does nothing in

the ﬁrst pass. Otherwise, their heuristic makes 3,, unreachable.

In the second pass, if there still exists a deadlock state 3,, that is directly reachable
from the invariant by fault transitions then their heuristic makes 5,, unreachable by
removing the corresponding invariant state. At the end of deadlock resolution, if
the invariant is empty then they declare that their heuristic could not synthesize a
fault-tolerant program. We have integrated their heuristic into the framework using
the DeadlockResolverl class (cf. Figure 8.5) that inherits from the DeadlockResolver

class.

Second heuristic. The ﬁrst heuristic only adds single-step recovery to deadlock
states. As a result, it fails in cases where single-step recovery is not possible. For
example, the ﬁrst heuristic fails in the case where recovery from a deadlock state, say
32,, is possible via another deadlock state, say 3.1, from where we have already added

a recovery transition to the invariant. Hence, we develop a new heuristic for adding

180

 

Tll'll -

inva

poin
the f
state
tion,
states
ﬁxpoi
is pos
in the
will he
ery pa
PFOgra.

heurisr
|

In t
sitions ,
r e(Wires
States_

In of

 

class Dee
then impl
Third In
heurispic

outside t t

heufl-Stic 1

multi-step recovery to deadlock states for the cases where single-step recovery to the

invariant is not possible.

Our new heuristic also consists of two passes. In the ﬁrst pass, we conduct a ﬁx-
point computation that searches through the deadlock states outside the invariant in
the fault-span. In the ﬁrst iteration of the ﬁxpoint computation, we ﬁnd all deadlock
states from where single-step recovery to the invariant is possible. In the second itera-
tion, we ﬁnd all deadlock states from where single-step recovery is possible to recovery
states explored in the ﬁrst iteration. Continuing thus, we reach an iteration of the
ﬁxpoint computation where either no more deadlock states exist or no more recovery
is possible. In the latter case, we choose to deal with the remaining deadlock states
in the second pass. In the former case, at the end of the ﬁxpoint computation, we
will have a set of states, RecoveryStates, from where there exists a multi—step recov-
ery path to the invariant. (Notice that adding a recovery transition in a distributed
program requires the satisfaction of the grouping requirements described in the ﬁrst

heuristic.)

In the second pass, we try to remove 3,, if 5,, is directly reachable by fault tran-
sitions from the invariant and no recovery can be added to sd. If the removal of 3,,
requires the removal of one or more invariant states then we remove those invariant
states. During deadlock resolution, if the invariant becomes empty then we declare

that the synthesis framework failed to synthesize a fault-tolerant program.

In order to integrate this new heuristic into our framework, we extended a new
class DeadlockResolver2 (cf. Figure 8.5) from the abstract class DeadlockResolver and

then implemented our new heuristic in its Resolve method.

Third heuristic. The strategy of the third heuristic is similar to that in the second
heuristic, except that the domain of the ﬁxpoint computation includes all the states
outside the invariant in the fault-span (i.e., (T’ —— S’ )) In other words, the third

heuristic is more general than the second heuristic. (Likewise, the second heuristic is

181

 

11101
the
reco
— t0
simp
8.5) 1
The
0f flit
the s]
gener:
Appm
Tl.-
tICS (111
as the
either
frame“

the Iatt

AS We 11
ificatgn

 

“E’s Whit.
Where 0“
0f the {Tel

95,901“in

In tlii.

—
——-+

 

more general than the ﬁrst heuristic.) We have also used this heuristic for enhancing
the fault-tolerance of nonmasking programs —— where the program only guarantees
recovery to the invariant in the presence of faults and not necessarily a safe recovery
- to masking fault-tolerance [15]. The integration of the third heuristic was fairly
simple. We integrated the third heuristic into a class DeadlockResolver3 (cf. Figure
8.5) extended from the abstract class DeadlockResolver.

The application of heuristics. The second heuristic sufﬁces for the synthesis
of the fault-tolerant token ring program presented in Subsection 8.1.1. However, in
the synthesis of a version of the Byzantine agreement program containing four non-
general processes, since the second heuristic failed, we applied the third heuristic (see
Appendix B for this program).

The developers of fault-tolerance have the option to select one of the above heuris-
tics during synthesis. Despite the generality of the third heuristic, it is not as efﬁcient
as the ﬁrst two heuristics. Therefore, given a particular problem, the developers can
either use their insight to choose the appropriate heuristic or they can rely on the
framework to make that choice. The former choice provides more efficiency whereas

the latter choice allows more automation.

8.4 Changing the Internal Representations

As we mentioned in the Introduction, it is difﬁcult to determine a priori the internal
representation that one should use for different entities, namely Program, Fault, Spec-
ification, and Invariant, involved in the synthesis of fault-tolerant programs. Thus, it
is necessary to provide the ability to modify the internal representation of these enti-
ties while reusing the remaining parts of the framework. In fact, there are situations
where one needs to use one internal representation while executing in one fraction
of the framework, and a different internal representation for the same entity while
executing in another fraction of the framework.

In this section, we argue that our framework enables such a change of internal

182

 

repi

exp

in O

in tl
well

point.

Imp]
class

violar

   
  
  
  
  
   
 

We V0;
to ver
ificatic
especi;
S}'Iltl1e

Places,

 

SFTltlleg

0fthe

we,
data 3”
states 0
311bst itu
that, Wm
H the SD'
repregem
t0 eXPC‘m
instead 0-

representation for entities involved in our framework. Towards this end, we discuss our
experience in changing the internal representation of SafetySpecification and Invariant
in our framework. We ﬁnd that the ability to modify the representation of entities
in this fashion is especially useful for improving the efﬁciency of the framework as
well as in simplifying the tasks involved in responding to user queries at interaction

points. We discuss these applications next.

Improving the efﬁciency. The initial implementation of the SafetySpecification
class consisted of a linked list whose elements would each represent a set of safety-
violating transitions. The SafetySpecification class includes a method violates by which
we verify whether a given transition t violates the safety speciﬁcation or not. In order
to verify the safety of t, we needed to traverse the linked list structure of SafetySpec—
ification. The traversal of the SafetySpecification structure was very time-consuming,
especially when the size of the state space would become large. Since during the
synthesis of a fault-tolerant program we need to invoke the method violates in many
places, the efﬁciency of this method signiﬁcantly degrades the overall efﬁciency of the
synthesis. Hence, we changed the data structure used for the internal representation

of the SafetySpecification class.

We replaced the linked list structure of the SafetySpecification class with a dummy
data structure. Now, for a given transition t, we ﬁrst take the source and destination
states of t (speciﬁed as st and (1,). In order to verify the safeness of t, we then
substitute the values of the program variables at 3, and d, into the state predicates
that represent the safety speciﬁcation (e.g., refer to Section 8.5 or Subsection 8.1.1 ).
If the speciﬁcation predicate holds for st and dt then t violates safety. (Note that we
represent safety speciﬁcation as a set of transitions that the program is not allowed
to execute.) We have applied the same approach for the Invariant class. Therefore,
instead of traversing a huge linked list data structure, we check only a predicate in

order to ﬁnd out the safeness of a transition or the membership of a state to the

183

 

invariant.

Reasoning about a query. As we discussed in this section, we have two differ-
ent implementations for the SafetySpecification class based on the linked list and the
dummy data structures. The latter data structure helps to improve the efﬁciency of
the synthesis when we need to automatically synthesize a fault-tolerant program with-
out user intervention. On the other hand, when users interact with our framework,
they may need to know why a particular transition violates the safety speciﬁcation.
To answer this query, the framework uses the information stored in the linked list
data structure in order to provide the required reasoning for the users. Thus, in such
situations, the framework switches the implementation of the SafetySpecification class
from a dummy to a linked list data structure to provide the required reasoning for

the developers of fault-tolerance.

8.5 Example: Altitude Controller

In this section, we show how we used our framework to synthesize a simpliﬁed version
of an altitude switch (ASW) used in aircraft altitude controller. We have adapted
this example from [48] and the output program of our framework is the same as the
fault-tolerant program that is manually designed in [48]. This example illustrates the

applicability of our framework in automatic synthesis of practical applications.

The program of the altitude switch reads a set of input variables coming from
two analog altitude sensors and a digital altitude sensor. Then, the ASW program

activates an actuator when the altitude is less than a pre-determined threshold.

The fault-intolerant altitude switch (ASW). The ASW program monitors a
set of input variables and generates an output. There exist ﬁve internal variables, a
mode variable that determines the operating mode of the program, and four input
variables that represent the state of the altitude sensors. The internal variables are

as follows: (i) AltBelow is equal to 1 if the altitude is below a speciﬁc threshold,

184

otherwise, it is equal to 0; (ii) ActuatorStatus is equal to 1 if the actuator is powered
on, otherwise, it is equal to 0; (iii) 1 nit represents the system initialization when it
is equal to 1; otherwise, it is equal to 0; (iv) Inhibit is equal to 1 when the actuator
power-on is inhibited; otherwise, it is equal to 0, and (v) Reset is equal to 0 if the

system is being reset.

The ASW program can be in three different modes: (i) the Initialization mode
when the ASW system is initializing; (ii) the Await-Actuator mode if the system is
waiting for the actuator to power on, and (iii) the Standby mode. We use an integer
variable Status with domain {—1,0, 1,2} to show the system modes in the program
where (i) Status = —1 if the system is in the initialization mode; (ii) Status = 0 if
the system is in the Await-Actuator mode; (iii) Status = 1 if the system is in the

Standby mode, and (iv) Status = 2 if the system is in a faulty state.

Moreover, we model the signals that come from the input (analog and digital)
altitude sensors using the following variables: (i) AltF ail is equal to 1 when analog
and digital altitude meters are failed; (ii) if the system remains in the Initialization
mode more than 0.6 second then the variable I nitF ailed will be set to 1. Otherwise,
I nitF ailed remains 0; (iii) if the condition AltF ail = 1 remains true more than 2
seconds then the variable AltFailOver will be equal to 1. Otherwise, AltFailOuer
remains O, and (iv) if the system remains in the Await-Actuator mode more than
2 seconds then the variable AwaitOver will be equal to 1. Otherwise, AwaitOver

remains 0.

The output of the ASW program is identiﬁed based on the system mode. The
ASW program has an output integer variable WakeupActuator that is equal to 1 if
the system is in the Await-Actuator mode and is equal to 0 otherwise. The domain

of all variables except Status is equal to {0, 1}.

The fault-intolerant program consists of only one process, called Controller. In the

input of our framework, we specify the Controller process as follows:

185

 

1 pnxmss Controller

2 begin

3

4 ((Status == -1) && (Init == 1)) -> Status = 1; Init = O;

5 l

6 ((Status == 1) && (Reset == 0)) -> Status = -1; Reset = 1;

7 l
a ((Status == 1) && (AltBelow == 0) && (Inhibit == 0)

9 && (ActuatorStatus ==O)) -> Status = O; AltBelow

ll
H
.0

10 l

11 ((Status == 0) && (ActuatorStatus == 0)) -> Status = 1; ActuatorStatus

ll
p
b-

12 l

13 ((Status == 0) && (Reset == 0)) -> Status = -1; Reset = 1;

15 wead AltBelow, ActuatorStatus, Init, Inhibit, Reset,

1s AltFail, InitFailed, A1tFailOver, AwaitDver, Status;

13 uwne WakeupActuator, AltBelow, ActuatorStatus,
19 Init, Inhibit, Reset, Status;

20 end

The program changes its mode from Initialization to Standby when the I nit vari-
able is equal to 1. Also, the program goes to the Initialization mode when it is either
in Standby or in Await-Actuator mode and the reset Signal is received. If the pro-
gram is in the Standby mode and the actuator power-011 is not inhibited and the
actuator is not powered on then the program goes to Await-Actuator mode. In the
Await-Actuator mode, the program either (i) powers on the actuator and goes to the
standby mode, or (ii) goes to the Initialization mode upon receiving the reset signal.

The read / write sections in the body of the Controller process identify its read / write

restrictions on the program variables.

Faults. If the altitude sensors incur malfunction then the state of the program will

186

 

be

Safet

not ('l

.5

i.e_, ,
the 111

is not

2 ((A1
3 ((St

4 ((St.

S

 

be perturbed to a faulty state. We represent the fault actions as follows:

1 .ﬂndt Malfunction

2 begin

4 (InitFailed == 1 ) -> InitFailed = 0; Status = 2;
5 l

s (AltFailOver == 1 ) -> AltFaileer = 0; Status = 2;
7 l

s (AwaitOver == 1 ) -> AwaitOver = 0; Status = 2;

10 end

Safety speciﬁcation. The problem speciﬁcation requires that the program does
not change its mode from Standby to Await-Actuator if the altitude sensors are failed;
i.e., AltF ail is equal to 1. Also, from the faulty state, the program can only go to
the Initialization mode. Moreover, in the faulty state, the program can recover if it

is not reset. In the input ﬁle, we represent the speciﬁcation as a state predicate.

1

0)) ll
0)))ll

2 ((AltFails == 1) && (Statuss == 1) && (Statusd

1) ll (Statusd

3 ((Statuss == 2) && ((Statusd =

4 ((Statuss == 2) && (Resets == 1))

As we described in Subsection 8.1.1, to distinguish the value of a variable (e.g.,
AltF ail) at the source of a transition from its value at the destination, we append
the variable names with sufﬁxes ’s’ and ’d’ (e.g., AltF ails and AltF ails).
Invariant. The invariant of the program consists of the states where the program

is not in the faulty state; i.e., Status aé 2. We specify the invariant as follows:

1 invariant
2

3(Status != 2)

187

Initial states. We specify the initial state as follows:

1 init

2

3 sane

4 WakeupActuator = O;
5 AltBelow = 1;

6 ActuatorStatus = O;
7 Init = 1;

s Inhibit = O;

9 Reset = 0;

1o AltFail = O;

11 InitFailed = 1;

12 AwaitOver = 1;

13 AltFaileer = 1;

14 Status = -1;

Fault-tolerant program. The framework automatically generates the following

fault—tolerant program. We present the actions of the Controller process as follows:

1 ((Status == -1) && (Init = 1)) -> Status = 1; Init = O;

2 l

3 ((Status == 1) && (Reset 0)) -> Status -1; Reset = 1;
4 |

5 ((Status == 1) && (AltBelow == 0) && (Inhibit == 0)

5 && (ActuatorStatus ==0) && ( AltFail == 0))

7 -> Status = 0; AltBelow = 1;
8 l

9 ((Status == 0) && (ActuatorStatus == 0)) -> Status = 1; ActuatorStatus = 1;

10 l
11 ((Status == 0) && (Reset == 0)) -> Status = -1; Reset = 1;

12 l

188

14 (Status == 2) w (Reset == 0) -> Status = -1; Reset = 1;

The fault-tolerant program has a new recovery action (cf. Line 14), where it
recovers to the initialization mode from faulty state (i.e., states where Status = 2
holds). Also, a new constraint has been added to the third action (cf. Lines 7-9)
where the program is allowed to change its state to the Await-Actuator mode only

when the input sensors are not corrupted; i.e., the condition (AltF ail = 0) holds.

8.6 Summary

In this chapter, we presented a framework for adding fault-tolerance to existing fault-
intolerant programs. We showed that our framework is extensible in that it permits
easy addition of new heuristics that help in reducing the complexity of adding fault-
tolerance. The framework also allows one to partially change the internal represen-
tation of different entities used in the synthesis while reusing other entities. These
abilities are especially useful for testing different heuristics as well as testing the effect
(in terms of space, time, etc.) of different internal representations of entities involved
in synthesis. Finally, since we have developed the framework in Java, it is platform-
independent; we have used this framework on Windows/Solaris environment. We
also ﬁnd that the choice of this implementation makes our framework suitable for
pedagogical purposes.

Using our framework, we have synthesized fault-tolerant programs for, among
others, token ring, agreement in the presence of Byzantine faults, and agreement in
the presence of Byzantine and failstop faults. Thus, these examples demonstrate that
the framework can be applied for the cases where we have different types of faults
(process restart, Byzantine and failstop), and for the cases where a program is subject

to multiple simultaneous faults.

189

 

Chapter 9

Ongoing Research

In this chapter, we present ongoing research work, where we have developed prelimi-
nary results. Speciﬁcally, we focus on developing heuristics that can extend the scope
of efﬁcient synthesis by transforming non-monotonic programs (respectively, speciﬁ-
cations) to monotonic. Such heuristics are especially beneﬁcial where for a speciﬁc
program the monotonicity property (deﬁned in Section 4.3) holds, whereas no guar-
antees are provided for the monotonicity of its speciﬁcation (or vice versa). Towards
this end, we present a set of heuristics for transforming non-monotonic programs
(respectively, speciﬁcations) to monotonic where we beneﬁt from Theorem 4.11 and

synthesize fault-tolerant distributed programs in polynomial time.

Moreover, in this chapter, we present a SAT-based synthesis approach where we
use state-of-the-art SAT solvers to synthesize fault-tolerant distributed programs. In
particular, we show how we reduce different sub-problems in the synthesis of fault-
tolerant programs to the satisﬁability problem. Afterwards, we show how we im-
plement our SAT-based approach in the FTSyn framework (presented in Chapter
8).

We proceed as follows: In Section 9.1, we present our heuristics for transforming
non-monotonic programs (respectively, speciﬁcations) to monotonic. Then, in Sec-

tion 9.2, we present an algorithm for transforming non—monotonic speciﬁcations to

190

monotonic. We demonstrate our transformation algorithms by an example in Section
9.3. Subsequently, in Section 9.4, we present our SAT-based synthesis method. We

summarize this chapter in Section 9.5.

9.1 Program Transformation

In this section, our goal is to address the following question: Given a fault~intolerant
distributed program and its invariant that do not satisfy monotonicity requirements,
how can one modify the program and its invariant such that monotonicity requirements
are met while ensuring that the program satisﬁes its speciﬁcation from the modiﬁed
invariant? To address this question, ﬁrst, we formally deﬁne the problem of trans-
forming programs to monotonic (failsafe-ready) programs in Subsection 9.1.1. Then,
in Subsection 9.1.2, we present an algorithm for solving the transformation problem.

Finally, in Subsection 9.1.3, we show the soundness of our transformation algorithm.

9.1.1 Problem Statement

Given a program p, a state predicate Y, and a Boolean variable 11:, if p is not positive
(respectively, negative) monotonic on Y with respect to :1: then our goal is to identify
a program p’ and a state predicate Y’ such that p’ is positive (respectively, negative)
monotonic on Y’ with respect to at. We require p’ not to add new computations to
the set of computations of p during such transformation. Thus, Y’ should be a subset
of Y. Otherwise, if Y’ includes a state 3, where s Q Y, then p’ may create new
computations from s, which is not desirable. Also, for the same reason, p’ must not
include new transitions during such transformation. Thus, we require that the set of
transitions of p’ on Y’ is a subset of the set of transitions of p on Y’ (i.e., p’ IY’ E PlY’).

Hence, we state the problem of transforming non-monotonic programs as follows:

191

Problem 9.1.1 'Itansforming Non-Monotonic Programs to Monotonic

Given p, Y, spec, and :1: such that p satisﬁes spec from Y, and
p is not positive (respectively, negative) monotonic on Y with respect to :1:
Identify p’ and Y’ such that
Y’ (_Z Y,
p’lY’ Q plY’, and
p’ is positive (respectively, negative) monotonic on Y’ with respect to a:

p’ satisﬁes spec from Y’. [:1

Before we present our algorithms, we recall the deﬁnition of the monotonicity
property from Section 4.3. Observe that in the deﬁnition of monotonicity, we implic-
itly refer to transitions (30,31) and (33, 31) where the value of all variables except :1:
is the same in so and 56 (respectively, in 31 and s’l). Hence, we introduce the concept
of symmetric transitions with respect to :1: as follows:

Deﬁnition 9.1.2. We say two transitions t = (30,51) and t’ = (36, 3’1) are symmetric
with respect to a Boolean variable :5 (denoted t =x t’) iff the condition ((2:(so) 2:
:r(sl))/\(:r(s()) = a:(s’1))/\(a:(so) aé :r(s{,))) holds and the value of all variables in so and

I ‘ ' I
30 (respectively, 1n 31 and 51) are the same. [I]

9.1.2 'IYansformation Algorithm

In this subsection, we present a sound algorithm to solve Problem 9.1.1. We use the
Deﬁnition 9.1.2 in the design of our transformation algorithm (see Figure 9.1). The
algorithm To_Positive_Monotonic-Programs is an iterative procedure that takes the set
of groups of transitions of a distributed program, a state predicate Y, and a Boolean
variable :1: and generates a distributed program p’ and a state predicate Y’ such that
p’ is positive monotonic on Y’ with respect to r. Intuitively, our algorithm removes

the program transitions that go against the monotonicity property. Removing such

192

transitions may create deadlock states in program invariant. Hence, we recalculate
another invariant to guarantee that no deadlock states exist in the new invariant. If
our algorithm succeeds in ﬁnding such an invariant then we generate a monotonic
(failsafe-ready) program. Otherwise, our algorithm declares failure in generating a

monotonic program.

 

T0.Positive_Monotonic.Program(p: set of transitions, 1:: Boolean variable, Y: state predicate )
// p is the union of a set of groups of transitions go, - - - ,gm.
{
Step 1: p’ := p; Y’ := Y;
Step 2: repeat {
Step 2-1: TRrem := {(30,31) : (1(30) = false) /\ (17(31) = false) /\ ((30,331) E p’IY’) /\

(3(3613'1li(36139::(30131)3(3613'1)¢ P'lY')};
Step 2-2: if (TRrem = 0) then

Step 2-2-1: Y’ ,p’ 2: Recalculatelnvariant(p’, Y’);
Step 2—2-2: if ((Y’ 95 0)) return p’, Y’;
else declare failure in finding a monotonic program;
Step 2-3: t := (so, 31), where (so, 31) E TRrem and so has the maximum outdegree;
Step 2-4: 1” == P"{(82,83)1(391391€ p’ = t E g:- A(82183) E 91)}
Step 2-5: Y1 := RemoveDeadlocks(p’ , Y’);
Step 2-6: p1 := EnsureClosure(p’, Y1);
Step 2—5: p’ := p1; Y’ := Y1;
Step 3: } until (Y’ = );
Step 4: declare failure in finding a monotonic program;

}

 

 

 

 

Figure 9.1: Transforming non-monotonic programs to positive monotonic.

After the initialization, in Step 2-1 (cf. Figure 9.1), we calculate the set of tran-
sitions that violate the deﬁnition of positive monotonicity. If there exist no such
transitions (i.e., TRrem = (ll) then we will verify (i) the non-existence of deadlock
states in Y’, and (ii) the closure of p’ in Y’. When we reach Step 2-2-1, we recalculate
a valid invariant for p’ by invoking the function RecalculateJnvariant (cf. Figure 9.2).
Obviously, if we reach Step 2—2-1 in the ﬁrst iteration then that means the input
program p and Y inherently satisfy the monotonicity requirements. Note that Steps
2-1 and 2-2 verify the monotonicity of the input program, and hence, we do not need

develop a separate veriﬁcation algorithm.

To recalculate the invariant, we develop an iterative procedure where we ﬁrst

193

use function RemoveDeadlocks to remove the existing deadlock states of p in a state
predicate S (cf. Figure 9.2). The RemoveDeadlocks function returns the largest subset
S1 of S where there exist no deadlock states; i.e., the computations of p are inﬁnite
in S1. After removing the deadlock states of S, there might exist transitions of p that
start in S1 and reach the removed states of S. Such transitions violate the closure of
Sl. Using function EnsureClosure (cf. Figure 9.2), we remove (groups of) transitions
that violate the closure of S. We repeat this procedure until there exist no more
deadlock states or we remove all states of S. (We invoke the function HasDeadlocks

that veriﬁes if there exist deadlock states in a state predicate S of a program p.)

 

Recalculatelnvariant(p : set of transitions, S : state predicate)

// p is the union of a set of groups of transitions go, - - - ,gm.
{

5’ ‘-= S; p’ := 10;

repeat {

51 :2 RemoveDeadlocks(p’, S’ );

p1 :2 EnsureClosure(p’,Sl);

P’ == 101; 5" == 31;
} until (-1 HasDeadlocks(p’,S’) V S’ = (l );
return S’,p’;

}

RemoveDeadlocks( p : set of transitions, S : state predicate)
// Returns the largest subset of S such that computations of p within that subset are inﬁnite

{ S’ := S
while (330 : soeS’ : (V31 :31 65’ : (80,31)¢p)) 5’ == 5’ - {30};
return 5’; }

HasDeadlocks(p : set of transitions, S : state predicate)
// Verify the existence of deadlock states in S.
{ if (330 : so€S : (V31 : 3165 : (so,sl)¢p)) return true;
return false; }

EnsureClosure(p : set of transitions, S : state predicate)
// p is the union of a set of groups of transitions go, - - - ,gm.
{return p—{(so,s1) : (39,- :gi E p: ((so,sl) E 93-) A
(saga) = (861%) e 91-: (st 6 S A si e S)))} }

 

 

 

Figure 9.2: Algorithms for removing deadlock states and ensuring the closure of the in- .
variant.

In Step 2—3 (see Figure 9.1), we select one of the transitions of T Rrem, say t,

whose source state has the maximum number of outgoing transitions (i.e., outdegree).

194

Afterwards, we remove the group of transitions associated with t (cf. Step 2-4). In
this way, we reduce the chance of creating more deadlock states. Then, since the
removal of transitions may create deadlock states, we invoke RemoveDeadlocks (in
Step 2-5). Afterward, we use EnsureClosure to remove the transitions (and their
associated groups) that violate the closure of Y1. We continue the iterative procedure
of the algorithm To-Positive_Monotonic_Program until in an iteration either (i) the
state predicate Y’ becomes empty (in Step 3 or in Step 2—2-2), or (ii) we ﬁnd a
positive monotonic program (in Step 2-2-2).

Likewise, we design an algorithm To-Negative_Monotonic_Programs for transform-
ing distributed programs to negative monotonic programs. The only difference be-
tween such algorithm and To_Positive_Monotonic_Programs is in calculating the set
of transitions TRmm (see Step 2-1 in Figure 9.1), where we replace the condition

(($(so) = false) /\ (1(31) 2 false)) with (($(so) = true) /\ (23(31) = true)).

9.1.3 Soundness

In this subsection, we show that the algorithm To-Positive-Monotonic-Programs (cf.
Figure 9.1) is sound; i.e., the transformed program satisﬁes the requirements of Prob—
lem 9.1.1. Towards this end, we make the following observations:

Observation 9.1.3 The function RemoveDeadlocks returns a subset S’ of a predicate
S where the computations of program p in S’ are inﬁnite.

Proof. Since RemoveDeadlocks only removes states with no outgoing program tran-
sitions, it follows that S’ does not have new states (i.e., S’ Q S). Also, every state
that remains in S’ has at least one outgoing transition in p. Otherwise, it would have
been removed. Therefore, the computations of p are inﬁnite in S’. C]
Observation 9.1.4 The functions RemoveDeadlocks and EnsureClosure do not add
any new transitions to the set of transitions of program p.

Proof. The proof follows by construction. (:1

Observation 9.1.5 The function RecalculateJnvariant does not add any new states

195

(respectively, transitions) to the invariant (respectively, the set of transitions) of pro-
gram p.

Proof. The proof follows from Observations 9.1.3 and 9.1.4. [3
Theorem 9.1.6 The algorithm To_Positive_Monotonic_Programs is sound.

Proof. We show that the program generated by To-Positive-Monotonic-Program

satisﬁes the requirements of Problem 9.1.1.

0 Y’ _C_ Y. The algorithm To_Positive_Monotonic-Program calculates state predi-
cate Y’ by invoking RecalculateJnvariant (in Step 2-2-1) and RemoveDeadlocks

(in Step 2-5). Hence, using Observations 9.1.3 - 9.1.5, it follows that Y’ E Y.

, o p’IY’ (_i pIY’. The algorithm To-Positive_Monotonic-Program modiﬁes the tran-
sitions of the input program p in Steps 2-2-1, 2-4, and 2-6. Based on observations
9.1.4 and 9.1.5, Steps 2-2-1 and 2—6 do not add any new transitions to the set of
transitions plY’. Also, by construction, Step 2-6 does not add new transitions

to pIY’ as well. Thus, it follows that p’ IY’ Q plY’.

o p’ is positive monotonic on Y’ with respect to it. Since the set

of transitions TR¢8m identiﬁes transitions of plY that violate the deﬁnition
of positive monotonicity of p, and in the ﬁnal iteration of the algorithm
To_Positive_Monotonic_Program the set of transitions TRrem becomes empty,
it follows that when the algorithm To_Positive_Monotonic-Program terminates
there exist no transitions in p’IY’ that violate the positive monotonicity of p’
on Y’. As a result, the program p’ returned by To_Positive-Monotonic_Program

is positive monotonic on Y’ with respect to :r.

o p’ satisﬁes spec from Y’. Based on Observation 9.1.3, Y’ is a subset of Y

where the computations of p are inﬁnite. Also, using the requirements Y’ _C_ Y
and p’ IY’ g p|Y’, it follows that the computations of p’ in Y’ are a subset

of computations of p in Y’. Since starting in Y every computation of p is in

196

spec, it follows that starting in Y’ every computation of p’ is in spec. Also, by

construction, Y’ is closed in p’. Thus, p’ satisﬁes spec from Y’.

Based on the above discussion, it follows that To_Positive_Monotonic-Program is

sound. CI

Theorem 9.1.7 The complexity of algorithm To_Positive_Monotonic-Programs is poly-
nomial in the state space of the input program.
Proof. The maximum number of iterations of the while loop in the body of Re-
moveDeadlocks function (cf. Figure 9.2) is in the order of ISI. Also, for program p,
since S g Sp, it follows that the worst-case complexity of RemoveDeadlocks is O(|Sp|).
A similar reasoning shows that the worst-case complexity of HasDeadlocks is 0(ISPI).
Also, the number of groups of transitions of p is polynomial in lSpl since in a
distributed program each transition is associated with a group of transitions, and the
number of transitions included in each process is in the order of ISplz. Moreover,
by construction, the size of each group is in the order of lSpl as well. As a result,
the worst-case complexity of the EnsureClosure (cf. Figure 9.2) will be polynomial in
ISp|. Based on the above discussion, the complexity of RecalculateJnvariant will be
polynomial in ISpl since the loop inside this function can iterate at most ISPI times.
Now, in the To-Positive_Monotonic_Programs algorithm, the maximum number
of iterations of the main loop cannot exceed IYI, where the algorithm removes all
states in Y and declares failure in Step 4. Also, each step of the algorithm has a
polynomial-time complexity based on the above discussion. Therefore, the complex-
ity of To-Positive_Monotonic_Programs is polynomial in the state space of the input

program. [I]

9.2 Speciﬁcation Transformation
In this section, our goal is to address the following question: How can safety spec-

iﬁcations be strengthened to meet the monotonicity requirements? To address this

197

question, in Subsection 9.2.1, we present a formal deﬁnition for the problem of trans-
forming non-monotonic speciﬁcations to monotonic. Then, in Subsection 9.2.2, we

present a sound algorithm for solving the transformation problem.

9.2.1 Problem Statement

Given a safety speciﬁcation specsf, a state predicate Y, and a Boolean variable x, if
specsf is not positive (respectively, negative) monotonic on Y with respect to a: then
our goal is to derive a speciﬁcation specgf that is positive (respectively, negative)
monotonic on Y with respect to :r. In such derivation, we require that if a transition
t satisﬁes spec’s, then t will satisfy specs, as well. As a result, specgf will be a
strengthened version of specs]. Hence, we state the problem of transforming non-

monotonic speciﬁcations to monotonic as follows:

Problem 9.2.1 Transforming Non-Monotonic Speciﬁcations to Monotonic

Given Y, specsf, and :1: such that specsf is not positive (respectively, negative)
monotonic on Y with respect to :1:
Identify spec’s! such that
specsf Q spec’s,

specgf is positive (respectively, negative) monotonic on Y’ with respect to x [3

Note that we represent safety speciﬁcations spec_,f and spec’s! as two sets of bad
transitions in the state space that must not occur in program computations (cf. Sec-
tion 2). Thus, the condition specs; Q spec’s, states that spec’sf is a restricted version

of spec,, by adding more transitions to specs]; i.e., strengthening specsf.

9.2.2 Transformation Algorithm

To address the transformation Problem 9.2.1 for positive monotonicity, we present an

algorithm that takes a safety speciﬁcation spec”, a state predicate Y, and a Boolean

198

variable :12, and generates a safety speciﬁcation spec’s! such that spec’s, is positive

monotonic on Y with respect to :13.

 

To_Positive_Monotonic_Speciﬁcation(specsf: safety specification, Y: state predicate,
:13: Boolean variable)
{

Step 1: TRadd := {(so,sl) : (:r(so) = false) A (:1:(sl) = false) /\
(so 6 Y) /\ (31 E Y) /\ ((so,sl) ¢ specsf) /\
(3(8618'1) =(8613'1)=x (30,31) =(5618'1) E Specsfil;

 

Step 2: return specs; U TRadd;

}

 

 

 

Figure 9.3: Transforming non-monotonic speciﬁcations to monotonic.

In Step 1, the algorithm To-Positive_Monotonic_Specification calculates the set of
transitions that violate the deﬁnition of positive monotonicity of speciﬁcation. Then,
the algorithm strengthens the speciﬁcation specs; by adding the set of good tran-
sitions TRadd to the existing set of bad transitions (speciﬁed by specsf) in order
to construct a new safety speciﬁcation spec’qf. The new speciﬁcation specgf is repre-
sented by a new set of bad transitions specstTRadd. Since the speciﬁcation returned
by To-Positive_MonotonicSpecification is a strengthened version of the original speci-
ﬁcation specs], the soundness of the above algorithm follows accordingly. (In the case
of negative monotonic speciﬁcations, we present a similar algorithm by replacing the
condition ((a:(so) = false) /\ (513(31) = false)) with ((:1:(so) = true) /\ (:1:(sl) = true))
in Step 1 in Figure 9.3. )

Theorem 9.2.2 The algorithm To_Positive_Monotonic_Specification is sound. El

Theorem 9.2.3 The complexity of algorithm To_Positive_Monotonic_Specification is
polynomial in the size of Y. D
Comment on strengthening the speciﬁcation. Strengthening the speciﬁcation does not
destroy the fault-safe property of the speciﬁcation. Speciﬁcally, the transformation
of a speciﬁcation to a monotonic speciﬁcation adds new transitions to the set of bad

transitions that must not occur in program computations. Since such new transitions

199

are program transitions, no fault transition will be included as a safety-violating
transition. As a result, the fault-safe property of the speciﬁcation will be preserved
during the transformation.

Also, since we add new transitions to the speciﬁcation during transformation,
there may exist program transitions in the invariant that do not violate the original
speciﬁcation but violate the strengthened monotonic speciﬁcation. Such transitions
must not occur in the computations of the transformed program, otherwise the pro—
gram will violate the safety of the strengthened speciﬁcation. In the next section, in
the context of an example, we illustrate how we identify and remove such transitions

from the invariant and then recalculate a new invariant.

9.3 Example: Distributed Control System

In this section, we present an example where we use our transformation algorithms for
efﬁcient addition of failsafe fault-tolerance. Speciﬁcally, we ﬁrst present a distributed
controlling program that is subject to input faults; i.e., the faults that perturb the
input sensors of the program. Then, we transform the speciﬁcation of the controlling
program to a positive monotonic speciﬁcation. Since the program is negative mono—
tonic, efﬁcient (i.e., polynomial-time) addition of failsafe fault-tolerance to it becomes
possible.

The fault-intolerant process-control program (PC). The program PC con-
sists of three processes P1, P2, and P3 connected by a loosely-coupled network. The
processes P1 and P2 respectively control the speeds of two electro motors M1 and
M2 located in the same environment but in distant places. The motors M1 and M2
provide the driving force of a conveyer belt that can move in two different directions:
left-to-right and right-to—left. The conveyer belt carries fragile objects that are loaded
when the belt is stationary. Once the objects are loaded, the conveyer belt moves

with an increasing speed up to a maximum speed. Then, the belt stops so that the

200

already loaded objects can be unloaded and new objects are loaded.

The speed of the conveyer belt depends 011 the speeds of All and A42. The speeds
of MI and M2 should be synchronous; i.e., the speed of M1 is equal to the speed of
Mg or is at most one unit more than the speed of M2. When the two electro motors
reach their maximum speed, process P3 resets their speed to O and the whole process
repeats. It is required that the temperature of the environment where electro motors

function should not exceed a pre—determined threshold.

The program PC has four integer variables :r, y, z, and w. The variable 2: (respec-
tively, z) is a counter that contains the speed of M1 (respectively, M2). The domain of
:1: (respectively, .2) is equal to {0, - - ' ,c}, where c is an integer constant. The variable
y is used to represent the movement direction of the conveyer belt. Speciﬁcally, if
the direction of the conveyer belt is from left to right then the value of y alternates
between 1 and 0. In the case where the conveyer belt moves from right to left, the
value of y alternates between -1 and 0. Moreover, the value of y is equal to 0 if a: = 2.
Otherwise, y could be 1 or -1. As a result, the domain of y is equal to {—1,0, 1}. The
variable 11) represents the temperature of the environment, which could be in three
different levels of normal, alarming, and critical that are respectively represented by

three values 0, 1, and 2.

Let (:13,y,z,w) denote the global state of the distributed program. The initial
state of the program is (0, 0, 0,0), where process P1 starts to speed up (i.e., increment
its counter). Process P1 is responsible to increment :1: and process P2 increments .2.
When both counters reach the maximum value c (i.e., (:1: = c) /\ (z = c)) the counting

operation will be restarted by process P3.

Read/write restrictions. Process P1 is allowed to read :1:, y, and z and it can
only write :1: and y. Process P2 can read 2:, y, and 2, but it is only allowed to write y
and 2:. Process P3 is allowed to read all program variables, however, it can only write

:1: and 2. Note that P1 and Po cannot read 111 due to distribution restrictions.

201

Program actions. We present the action of process P1 as follows:

PC1: (:1:=z)/\(:1:<c) ——-> .r:=:r+1;y:=1|—1;

When 1% and 1112 have the same speed (i.e., at = z), P1 increments the value of at
(i.e., the speed of M1). The action PCl indeed represents two actions depending on

the direction of the belt (i.e., the value of y). The action of process P2 is as follows:

P02: (:1:=z+1) ——> y:=0;z:=z+1;

Process P2 increments the value of z (i.e., the speed of M2) and resets y to zero
since 2 has become equal to :13. Finally, the transitions of P3 are represented by the

following action:

If both counters have reached the maximum value c (i.e., 1111 and M2 have reached
their maximum speed) then P3 resets their values to 0.
Safety speciﬁcation. For application-speciﬁc purposes, the safety speciﬁcation
stipulates that in the case where the belt is moving in the right-to—left direction and
the temperature level is in a critical level (i.e., w = 2), the speed of M2 must remain
less than the speed of M1; i.e., speed of the belt must not be increased in critical

temperature. We represent the safety speciﬁcation of PC with specPC, where

3P€CPC = {(50131) 3 (31(50): “1)A (13(31) = 431)) /\ (“1131) = 2)}

Invariant. The temperature should be in the normal level in ordinary working
conditions. Hence, we represent the invariant of the program by the state predicate

S pC, where

5pc = {81 (111(3) = 0) /\ (($(8) = Z(8)) V (11(3) = 2(8) +1))}

202

 

All and .Mg are synchronized in the invariant; i.e., ((:r = z) V (:1: = z + 1)).
Faults. Faults may change the value of the temperature sensor to 1 or 2 when
the speed of All is ahead of A12. We represent the fault transitions by the following

action:

F: (:r=z+1) ——> w:=1|2;

Fault-span. We represent the fault-span of program PC by the following state

predicate:

no = {s : (Ms) = z<s>> V(a:(s) = z<s>+1>> A ((2r(s) = 2(8)) => (11(3) = 0)>v
((2213) = z<s>+11 => ((1/(s) = 11v<y<s>= -1))> }

Note that the value of 11) could vary in its domain {0, 1,2}.

Negative monotonicity of program PC. Since w is not a Boolean variable,
we apply the deﬁnition of program monotonicity on the program PC by partitioning
the domain of w to zero and non-zero values. We consider the Boolean value true
corresponding to non-zero values of w and the Boolean value false corresponding to
(w = 0). Since there exists no transition in PC ISPC where the value of w is non-
zero, it follows that the deﬁnition of negative monotonicity holds for the program PC.
Thus, the program PC is negative monotonic on S [)0 with respect to w.

Positive monotonicity of specpc. Now, we investigate the positive monotonicity
of specPC on S [DC with respect to 111. First, we identify the set of transitions (so, 3 1)
that satisfy the following conditions: (i) so, 31 6 Spa; (ii) (w(so) = 0) A (1U(81) = 0);
(iii) (so,sl) does not violate specPC, and (iv) there exists transition (36,3’1) that is
grouped with (so,sl) due to inability of reading w, where (56,3’1) violates specPC
and (111(36) 75 0) A (111(3’1) 76 0). Thus, for the speciﬁcation specPC, the algorithm

To_Positive-Monotonic_Specification calculates the set of transitions TRadd (cf. Figure

9.3) as follows:

203

TRadd = {(so, 31) : (:r(so) = z(so) + 1) A (y(so) = —1)A(w(so) = 0) A
(33(31) = Z(51))A(y(31) = 0)A(wl31) = 0)}

The set of transitions TRadd includes those transitions of PC ISPC in which the
values of :1: and 2 become equal in their destination state. Although the transitions
of TRadd do not violate specPC by themselves, they are grouped with unsafe tran-
sitions that reach a state where the condition ((111 = 2) A (:r = 2)) holds. Hence,
we strengthen the safety speciﬁcation by including the set of transitions TRadd in
the set of transitions that violate safety. As a result, the new safety speciﬁcation
spec’PC = specpc U TRadd satisﬁes the deﬁnition of positive monotonicity for spec’PC
on Spa with respect to w.

Recalculating the invariant of the program PC. After strengthening specPC,
the program transitions in TRadd (1 (PC [3120) violate spec’PC although they do not
violate Sp€Cpc. When we remove the set of transitions TRadd 0 (PC ISpC), we create

the following deadlock states in the invariant S p0.

Deadlocks = {s : (a:(s) = z(s) + 1) A (y(s) = —-1) A (111(3) = 0)}

We invoke the algorithm RecalculateJnvariant (cf. Figure 9.2) to recalculate a new
invariant Sgc where the computations of PC are inﬁnite in S,Pc. In the ﬁrst iteration
of the algorithm RecalculateJnvariant, we remove the states in Deadlocks from the
invariant S P0. Since the removal of the above deadlock states does not introduce new

deadlock states, we calculate the new invariant 31301 where

Sic = {s = «(1(3) = 2(8)) A1118) = 0)) v
((113) = 2(8) + 1) A (y(s)

ll
._.1
V
v
v
>
A
8
A
CO
V
II
C
v
Wu

The action of the process P1 in the new invariant is as follows:

PCi: (:c=z)A(:1:<c) ——> x:=:1:+1;y;=1;

204

 

Note that the above action only assigns 1 to y; i.e., all transitions corresponding
to the action that assigns -1 to y have been removed during synthesis. Now, we

represent the transitions of the process P2 by the following action:

PCéz (y=1)A(:1:=z+1) ——+ y:=0;z:=z+1;

The action of process P3 remains as is. Since program PC’ is negative monotonic
on Sﬁsc with respect to w and its new speciﬁcation spec’PC is positive monotonic on
S},C with respect to w, failsafe fault-tolerance can be added to PC’ in polynomial
time (using Theorem 4.11). In fact, in this case, the program PC’ is failsafe F-tolerant

I I
to specPC from S P0-

9.4 SAT-based Synthesis of Fault-Tolerance

In this section, we investigate the use of automated reasoning techniques in the syn-
thesis of fault-tolerant distributed programs. There exist several heuristics-based
approaches [14] (also see Chapter 5) for polynomial-time synthesis of fault-tolerant
distributed programs. Each heuristic identiﬁes a deterministic order for the veriﬁ-
cation of synthesis requirements, where synthesis requirements are conditions that
have to be met by program states and transitions during synthesis so that the syn-
thesized fault-tolerant program is correct by construction. As a result, the efﬁciency
of synthesis is directly affected by the efﬁciency of verifying such synthesis require—
ments. Thus, it is desirable to beneﬁt from the existing automated reasoning tools
to efﬁciently verify synthesis requirements. Speciﬁcally, in this section, we focus our
attention on using state-of-the-art SAT solvers during synthesis where we express
different synthesis requirements in terms of the satisﬁability problem and use existing
SAT solvers to efﬁciently verify those requirements.

We organize this section as follows: First, in Subsection 9.4.1, we give an overview
of our SAT-based approach for the synthesis of fault-tolerant distributed programs. In

Subsection 9.4.2, we show how we formulate each synthesis requirement as an instance

205

of the satisﬁability problem. In Subsection 9.4.3, we discuss the implementation of

our SAT-based synthesis method in the FTSyn framework.

9.4.1 Synthesis Method

In this subsection, we present a general overview of our SAT-based synthesis method.
Speciﬁcally, in Subsection 9.4.1.1, we state the problem of reducing synthesis require-
ments to the satisﬁability problem. Subsequently, in Subsection 9.4.1.2, we provide
a strategy for using SAT solvers during synthesis for the veriﬁcation of the synthesis

requirements.

9.4.1.1 Synthesis Requirements Veriﬁcation

The non-deterministic synthesis algorithm presented in Section 2.8 identiﬁes six re—
quirements that must be veriﬁed during the synthesis of a fault-tolerant program from

its fault-intolerant version. For reader’s convenience, we repeat the Add_ft algorithms

 

 

 

in Figure 9.4:
Adet(p, f : set of transitions, S : state predicate, spec : speciﬁcation,
go, 91, ..., gm” : groups of transitions)
{
ms := {so : 331,512,...sn : (Vj :0$j<n : (Sj,3(j+1)) E f) A (s(,,_1),sn) violates spec};
mt := {(80,81) : ((3161113) V (so,sl) violates spec) };
Guess S’ ,T’ , and p’ := U (9,- : g,- is chosen to be included in the fault-tolerant program);
Verify the following
(F1) P’IS’QPIS’;
(F2) 8’ Q T’; T’ is closed in p’[]f; // T’ is a fault-span of p’.
(F3) T’ 0 ms = {}; (p’lT’) (1 mt = {}; // Safety cannot be violated from states in T’.
(F4) (Vso : soE T’ : (331 :: (so, 31)Ep’)); // T’ does not have deadlocks.
(F5) S’7é{}; S’ Q S; S’ is closed in p’; // S’ is an invariant of p’.
(F6) p’ |(T’ —S’ ) is acyclic; // p’ cannot stay in (T’ — S’) forever.
}

 

Figure 9.4: N on-deterministic algorithm for adding fault-tolerance to distributed programs.

Each one of the conditions F 1—F 6 identiﬁes a property ’P of the states (respec-
tively, transitions) of the synthesized program. If a program p’ (consisting of processes

{Po,--- ,Pn}) satisﬁes all these requirements then that program is fault-tolerant.

206

 

Given a process P]- (0 S j g n) that consists of a set of groups of transitions go, - - - , gm
(0 S m) and a property P, we say P, has the property P iff each group of transitions
go, - -- , gm has the property P. Also, a group of transition 9,- (0 S i g 771.) has the
property P iff each transition of g,- has the property P.

Also, given a state predicate X that consists of a set of program states, we say X
has the property P iff each state s E X has the property P. Hence, we present the

veriﬁcation problem as follows:

The veriﬁcation problem
Given a group of transitions g (respectively, a state predicate X), and
a property P:

Does g (respectively, X) have the property P? C]

9.4.1.2 Using SAT Solvers

The veriﬁcation of the conditions F 1-F 6 requires an exhaustive enumeration of the
states (respectively, transitions) of the program being synthesized, and as a result,
such veriﬁcation is not efﬁcient for programs with large state space. In this subsection,
we present a SAT-based solution for efﬁcient veriﬁcation of the synthesis requirements.

To verify the synthesis requirements, we transform the problem of verifying a
property P for a group of transitions g (respectively, a state predicate X) to a Boolean
formula whose satisﬁability can be veriﬁed by SAT solvers. Speciﬁcally, we deﬁne a
function BF that takes a group of transitions g (respectively, a state predicate X)
and a property P and generates a Boolean formula. Such transformation can be done
in polynomial time in the state space of the program (cf. Section 9.4.2).

Now, given the function BF, we design a veriﬁcation sub-layer that provides
veriﬁcation abilities for the synthesis algorithm (cf. Figure 9.5). Speciﬁcally, every
time the synthesis algorithm needs to verify a property P of a group of transitions 9

(respectively, a state predicate X), it queries the veriﬁcation sub—layer with P and g

207

 

Synthesis Algorithm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Verify ll Verify ll Verify ll
V P for g Y’N l P for g Y’N v P for g Y’N
[ Veriﬁcation Sub-Layer J
1 ll ll
BF(P, g) YIN BF(P, g) Y/N { BF(P, g) YIN
V V
SAT ‘ ’ ’ SAT ‘ ' ' SAT

 

 

 

 

 

 

 

 

 

Figure 9.5: Using SAT solvers for the synthesis of fault-tolerant programs.

(respectively, X). The veriﬁcation sub-layer transforms the request of the synthesis
algorithm to a Boolean formula BF (P, g) and delivers it to the SAT solver. After
the SAT solver provides the result of the satisﬁability of BF (P, g), the veriﬁcation
sub-layer forwards this result to the synthesis algorithm. The veriﬁcation sub—layer
has the potential to create multiple instances of the SAT solver to verify P for a set

of groups of transitions in parallel.

9.4.2 Representing Synthesis Requirements as Boolean For-

mulas
In this section, we show how we formulate the veriﬁcation of a synthesis requirement
in terms of a Boolean formula. First, in Subsection 9.4.2.1, we focus on representing
the basic entities of our formal model (i.e., states, transitions, state predicates, and
transitions predicates) in terms of Boolean formulas. Then, in Subsection 9.4.2.2, we
use the representation of states and transitions to formulate synthesis requirements

F 1-F 6 (shown in Figure 9.4) in terms of Boolean formulas.

9.4.2.1 Formulating State Transition Graphs

In this subsection, we show how we formulate states, state predicates, transitions, and
transition predicates in terms of Boolean formulas. In the next subsection, we use the

transformations presented in this subsection to formulate the synthesis requirements.

208

Representing a state. We recall the deﬁnition of a state from Chapter 2, where
we deﬁne a state as a value assignment to program variables. Formally, a state 5 of
a typical program p with the set of variables {vo, - -~ ,vq} has the form: (lo, l1, .., lq)
where Vi : 0 S i _<_ q : l,- E D,, D,- is the domain of v,, and q is a positive integer.
Thus, to represent a state s as a Boolean formula, we introduce the transformation
S BF : 5,, ——> B, where B is the set of Boolean formulas over program variables.

SBF(s) = A::g(v,- = l,), where l,- E D,-

The S BF transformation generates a unique Boolean formula corresponding to
each state 3 E Sp; i.e., SBF is a one—to—one function. However, the formula S BF (3)
is speciﬁed in terms of equalities over program variables; i.e., (v,- = l,-). To generate a
formula that consists of Boolean variables, we have to transform each term (v,- = l,)
in S BF (3) into a formula that only consists of Boolean variables. Towards this end,
we introduce [log(|D,-|)] Boolean variables corresponding to each program variable
v,, where ID,| represents the size of the domain of 11,-. In other words, if the domain
of v,- includes |D,-| distinct values then we will need [log(|D,|)l Boolean variables to
encode each value assignment to v,- by a unique binary code with length [log(|D,—|)].
Therefore, the maximum size of SBF(s) is equal to (q + 1) - [log(K)l, where K is
the size of the domain of a variable vj (0 S j _<_ n) that has the largest domain.
Representing a state predicate. By deﬁnition, a state predicate is the union of
a set of states in the state space of p (i.e., Sp). Thus, to represent a state predicate
X Q Sp, we use the function SBF to deﬁne a function SPBF : Pou1(Sp) —> B as

follows:
SPBF(X) :2 VVs::sEX SBF(s)

The transformation S PBF takes the disjunction of all the Boolean formulas cor-
responding to all states in X. The resulting formula will be a formula co V cl V - - °C|X|
in disjunctive normal form where each conjunction cj (0 S j g |X I) represents a

state.

209

Representing a transition. To represent a transition (so, 31) 6 8,, x Sp, we use

the SBF function and deﬁne the function T BF : 8,, x Sp -—> B, where
TBF((80, 81)) = SBF(So) /\ SBF(SI)

We represent a transitions (so, 81) as a conjunction of the Boolean formula that
represents its source state so and the Boolean formula that represents its destination
state 31. One could argue that TBF should be deﬁned as SBF (so) :> SBF(sl).
This way, T BF ((so, 31)) holds for all transitions terminating at 31 and the Boolean
formula SBF (so) => SBF (31) represents more than a single transition. Hence, to
represent an individual transitions (so, 31), we use the conjunction of SBF (so) and
S BF (31).

Representing a transition predicate. We use an approach similar to the one we
used for deﬁning state predicates. In other words, a transitions predicate AP 6 5,, x S,D
is the union of a set of transitions in the state space Sp. Hence, we deﬁne function

TPBF : Pow(Sp x Sp) -—> B to represent a set of transitions A111 where
TPBF(A,,) = vvootoiootoesp TBF((80, 81))

Note that we use transition predicates to model the set of program transitions, a
group of transitions, and the safety speciﬁcation. For example, if specs; represents the
safety speciﬁcation of a program p then TPBF(specsf) generates a Boolean formula

corresponding to specsf.

9.4.2.2 Formulating Synthesis Requirements

In this subsection, we show how we formulate the requirements F LP 6 of the non-
deterministic algorithm presented in Subsection 9.4.1. Towards this end, we use the
functions presented in the previous subsection.

We observe that the condition F 1 E (p’IS’ Q plS’) veriﬁes whether the set of
transitions p’ [5’ is a subset of the set of transitions pIS’. Since p’IS’ and plS’ are

transition predicates, we use TPBF to generate the Boolean formulas corresponding

210

 

to p’lS’ and plS’. Hence, to verify F1 we verify the satisﬁability of TPBF(p’|S’) =>
TPBF(plS’).

Likewise, for the requirements F2 E (5’ => T’) and F5 E (5’ => S), we re-
spectively verify the satisﬁability of SPBF(S’) => SPBF(T’) and SPBF(S’) =>
SPBF (S) To verify the closure of the state predicate S’ in the set of transitions of

p’ (cf. F5 in Figure 9.4), we verify the satisﬁability of CLBF(S’,p’), where

CLBF(S’,p’) = AV(so,sl)::(so,sl)€p’
‘ (SBF(so) => spams» => (SBF(sl) : SPBF(S’))

To verify F3, we simply verify the satisﬁability of SPBF(T’) A SPBF (ms) and
TPBF(p’IT’) A TPBF (mt). If these two formulas are not satisﬁable then F3 is
satisﬁed.

The requirements F5 stipulates that there exists no cycles in the set of transitions
of p’| (T’—S’). As a result, we have to formulate the cycle detection problem in terms of
a Boolean formula. To achieve this goal, we adopt the techniques used in the existing
approaches for symbolic cycle detection [49, 50, 51] where one generates a Boolean
formula whose satisﬁability shows the existence of a non-progress cycle in p’|(T’—S’).
Towards this end, we deﬁne a transformation Reach(s, A11) from 3,, x Pow(S,, x Sp)
to the set of Boolean formulas B, where Pow(Sp x Sp) is the power set of (S; x Sp),

and

Reach(s, A1») = SPBF(R) , where

R = {s’ : s’ is reachable from s by transitions of Ap}

Using function Reach, we can construct a Boolean formula that represents the set
of states reachable from a particular state 8 6 SP. Now, to verify if s is in a cycle, we

only need to verify the satisﬁability of Cycle(s, AP), where

Cycle(s, A11) E (SBF(s) => Reach(s, Ap))

211

If Cycle(s, AP) is satisﬁable then s is in a cycle in the graph constructed by the
set of transitions AP. In the case where Reach(s, A11) E false then it follows that s
is a deadlock state in the state transition graph of Air Thus, using the invalidity of

Reach(s, Ap), we conclude that s is a deadlock state (i.e., verifying F4 in Figure 9.4).

9.4.3 Implementing SAT-based Synthesis

In this subsection, we present an overview of our implementation strategy where we
implement our SAT-based synthesis method in the FTSyn framework presented in
Chapter 8. Towards this end, we only focus on the part of implementation that is
related to the veriﬁcation of the requirement F3 (cf. Figure 9.4) since the implemen-
tation approach for verifying other synthesis requirements is similar.

Given a program p, its groups of transitions go, - - - , gm and its safety speciﬁcation
specsf, our goal is to identify the groups of transitions whose transitions do not violate
specs]; i.e., safe groups. In the initial implementation of FTSyn, we exhaustively
verify the safety of the transitions of a group g,- E p (0 S i S m). The exhaustive
veriﬁcation is inefﬁcient for the cases where the size of a group is very large. Hence,
we expect that our SAT-based approach provides a better performance in verifying
the safety of the transition groups.

In the rest of this section, we proceed as follows: First, we present the necessary
transformation for formulating the safety veriﬁcation problem. Then, we introduce
different layers of our implementation in FTSyn for solving the safety veriﬁcation
problem.

Safety veriﬁcation problem. For the program p, we say a group 9,- of transitions
is safe iff no transition (so, 31) E g,- violates specsf. Since we represent specs; as a
set of transitions that must not occur in program computations, we say g,- is safe
iff the set of transitions of 9,- does not intersect with specsf. Formally, we use the

transformation Sa f e( g,-) to represent the safety of g,, where

212

Safe(g,-, specsf) = TPBF(g,) A TPBF(specsf)

To verify the safety of 9,, we verify the satisﬁability of Sa f e(g,-, specs f). If
Sa f 6(91, Specs f) is satisﬁable then it follows that the group 9,- intersects specsf; i.e.,
g,- includes a transition that violates safety. Thus, Saf€(g.1,8pecsf) is satisﬁable iff g,-
is not safe.

The layers of SAT-based safety veriﬁcation. To solve the safety veriﬁcation
problem in FTSyn, we implement the following three layers Boolean formula genera-
tion, CNF formula generation, and native method invocation. In the ﬁrst layer, we use
a Java API package provided by Alloy analyzer [52] of MIT to formulate the safety
veriﬁcation problem in terms of a Boolean formula. Then, in the CN F formula gen-
eration layer, we transform the Boolean formula to Conjunctive Normal Form (CNF)
as the existing SAT solvers only accept formulas in CNF format. We use the SAT
solver zChaff [53] since zChaff is one of the most efﬁcient SAT solvers at the time of
implementing our SAT-based approach. Towards this end, we implement a Java na-
tive method where we invoke zChaff to verify the satisﬁability of the calculated CN F
formulas. The CNF formula is satisﬁable iff the group of transitions whose safety is

being veriﬁed is not safe. Now, we discuss the implementation of each layer.

0 Boolean formula generation. To generate the Boolean formulas, we ﬁrst intro-
duce a set of Boolean variables by which we encode the value assignment to
program variables. For example, if a program p has an integer variables :1: with
the domain {—1,0, 1} then we use two Boolean variables a1 and a2 to represent
the terms (:1: = —1), (r = 0), and ( = 1) respectively by the following Boolean
formulas: (al A a2), (pal A a2), and (al A ﬂag), where -1aj is the complement
of aj (1 S j S 2). Hence, we represent a state predicate (:1: = 0) V (:1: = 1)
by the Boolean formula (ﬁal A a2) V (a1 A -1a2). Note that since the domain
of :1: contains only three values, the term (-1a1 A -1a2) will never be used in the

transformation of state predicates to Boolean formulas.

213

In the generation of a Boolean formula corresponding to a transition, say (so, 31),
the value of a speciﬁc variable may be different in so and 81. Thus, using a set
of Boolean variables (e.g., a1 and a2 in the above example) for the represen-
tation of the source and the destination states may result in the generation of
contradictory Boolean formulas. To illustrate this problem, consider the above-
mentioned example where we use two Boolean variables a1 and a2 to represent
value assignments to an integer variable 2:. Suppose that we need to generate
the Boolean formula corresponding to a transition (so, 31) where the value of
:1: at so is —1 (denoted :1:(so) = —1) and the program changes the value of :1:
to 0 during the transition (so,sl) (i.e., :1:(sl) = 0). Now. to formulate (so, 31)
using Boolean variables a1 and a2, the resulting formula would be equal to
((11 A a2) A (ﬁal A a2), which is a logical contradiction. Hence, we need to dis-
tinguish the value assignment to variables at the source and the destination of

rooram transitions.
0

To distinguish the value assignment to a speciﬁc variable in a transition, we
introduce two separate sets of Boolean variables for representing the value of
that variable at the source and at the destination state. For example, we intro-
duce two new Boolean variables b1 and b2 to represent the value assignment to
variable :1: in the destination of transitions. Thus, the transition (so, 31), where

:1:(so) = —1 and :1:(sl) = 0, will be formulated as (al A a2) A (-1b1 A b2).

CNF formula generation. Using the approach presented above, we transform
the safety speciﬁcation and each group of transitions to a Boolean formula
in terms of variables introduced for encoding the value assignments to program
variables. Since zChaff requires the input formula in DIMACS CN F format [54],
we have to transform the generated Boolean formulas to CNF format. Towards
this end, we use an API provided in the Alloy analyzer [52] and integrate it

in F TSyn. Using this API, we transform the generated Boolean formulas to

214

CNF format, which can be directly delivered to the SAT solver zChaff. For
example, in DIMACS format, the formula ((11 V ﬂag V a3) A (-1a1 V a2 V -1a3) will

be represented as follows:

pcnf32

The ﬁrst line identiﬁes that a CN F formula with 3 variables and two clauses is
being speciﬁed. Each clause (i.e., disjunction) must be speciﬁed on a separate
line. Also, the variables and their complements are distinguished by a minus

sign.

0 Native method invocation. In FTSyn, after we automatically generate a
CNF formula corresponding to TPBF(specsf) ATPBF(g,-), we invoke a native
method where we query zChaff with the generated CNF formula. The source
code of zChaff is available for educational purposes. Hence, we have generated
a Dynamic Link Library so that we invoke zChaff from Java environment when
we instantiate an instance of our framework F TSyn. Therefore, for every group

of transitions g,, we invoke zChaff once to verify the safety of 9,.

Using the implementation of our SAT-based approach, we have synthesized the
token ring program presented in Chapter 6. Since we invoke zChaff from Java en-
vironment, the current implementation of our SAT-based approach suffers from the
performance of the Java Native Interface. Nonetheless, our implementation provides
a platform for SAT-based synthesis of fault-tolerant (distributed) programs and the

efﬁciency of this platform can be improved as the software technology improves.

215

 

9.5 Summary

In this chapter, we presented two directions of research in progress. Speciﬁcally, we
discussed the development of heuristics that can transform non-monotonic programs
(respectively, speciﬁcations) to monotonic. Since adding failsafe fault-tolerance to
distributed programs that satisfy the monotonicity requirements can be done in poly-
nomial time (cf. Chapter 4), such heuristics extend the scope of programs that can
reap the beneﬁts of efﬁcient synthesis.

Also, we presented a technique for using SAT solvers in the synthesis of fault-
tolerant distributed programs from their fault-intolerant version. We reduce the syn-
thesis requirements to the satisﬁability problem and then invoke SAT solvers to solve
those problems. This way, we beneﬁt from the efﬁciency of the state-of-the-art SAT
solvers during the synthesis of fault-tolerant distributed programs. Currently, we
have created a centralized implementation of our approach, however, we plan to ex-
tend this work for the cases where we deploy our synthesis algorithm on a distributed
platform. Also, we plan to investigate the applicability of other decision procedures

[55] in the synthesis of fault-tolerant distributed programs.

216

Chapter 10

Conclusion and Future Work

In this chapter, we discuss related work, make concluding remarks, and provide some
insight for future research work. Speciﬁcally, in Section 10.1, we compare our synthesis
approach to the existing approaches in the literature. Then, in Section 10.2, we
present the contributions of this dissertation. In Section 10.3, we demonstrate the
impacts of the synthesis approach presented in this dissertation. Finally, in Section

10.4, we present open problems and future research directions.

10.1 Discussion

In this section, we discuss issues related to the approach presented in this dissertation.
Speciﬁcally, we compare our synthesis method with the existing synthesis approaches
in the literature. Towards this end, we address some questions raised regarding our
synthesis method and the framework FTSyn that we have developed for the synthesis

of fault-tolerant (distributed) programs.

How does the synthesis method presented in this dissertation diﬂer from model-
theoretic synthesis approach?
The synthesis method in model-theoretic approach [2, 56, 3, 57, 4] is based on

a decision procedure for the satisﬁability proof of the speciﬁcation. Although such

217

synthesis methods may have slight differences with respect to the input speciﬁcation
language and the program model that they synthesize, the general approach is based
on the satisﬁability proof of the speciﬁcation. This makes it difficult to provide reuse
in the synthesis of programs; i.e., any changes in the speciﬁcation require the synthesis
to be restarted from scratch. By contrast, since the input to our synthesis method
is the set of transitions of a fault-intolerant program, our approach has the potential
to reuse those transitions in the synthesis of the fault—tolerant version of the input

program.

Nevertheless, similar to the above-mentioned methods that generate the synchro-
nization skeleton (i.e., abstract structure) of programs, we also generate the abstract
structure of programs. Synthesizing the abstract structure of programs allows us to
(1) focus on concurrency issues in the synthesis of fault-tolerant distributed programs
instead of their functional properties, and (ii) provide the potential of translating the
abstract structure of the synthesized program to multiple programming languages
unlike approaches that focus on the synthesis of programs in a speciﬁc programming

language[58]

Model-theoretic approaches model distribution by atomic read/write actions [4]
where in an atomic action a process performs either a read or a write operation.
Kulkarni and Arora [1] present a more general way for modeling distribution restric—
tions where a process is allowed to read/write only a subset of program variables.
Since we have adapted Kulkarni and Arora’s approach for modeling distribution, our

synthesis algorithms beneﬁt from the generality of their modeling.

In addition to the above-mentioned issues, the only implementation of model—
theoretic synthesis approaches that we are aware of is an implementation of Emerson
and Clarke’s method for the synthesis of mutual exclusion protocol [59]. On the other
hand, we have implemented an extensible framework (cf. Chapter 8) where developers

of fault-tolerance synthesize fault-tolerant distributed programs. Our framework is

218

not problem-dependent and developers of fault-tolerance can use our framework for
the synthesis of a variety of programs [60]. Also, due to the incompleteness of the
heuristics integrated in our framework, we have chosen to design our framework for
change so that if the existing heuristics fail to synthesize a program then developers

can integrate their new heuristics in the framework without an expensive overhead.

How does the synthesis method presented in this dissertation differ from automata-
theoretic approach where one synthesizes reactive distributed programs [5, 6, 7/ that

interact with a non-deterministic environment?

The automata—theoretic approach is a speciﬁcation-based synthesis method where
one synthesizes a program from its tree automaton speciﬁcation. Also, automata-
theoretic approaches are mostly used for the synthesis of reactive systems that interact
with a non-deterministic environment [5, 61, 6] whereas in the case of our synthesis
problem, we have complete information about the behavior of the environment (i.e.,

faults) with which the program interacts.

Since our approach supports incremental synthesis of multitolerant programs (cf.
Chapter 7), it has the potential to incrementally add desired fault-tolerance properties
to programs once a new behavior of the environment (i.e., a new class of faults) is
discovered. This way, we decompose the problem of synthesizing reactive programs
into simpler problems. As a result, we do not encounter the complexity of synthesizing

a reactive distributed program [6, 7] that interacts with a hostile environment.

How does the synthesis method presented in this dissertation differ from synthesizing

proof-carrying (certiﬁed) code?

In the synthesis of proof-carrying code, the synthesis method takes the input spec-
iﬁcation and generates the code of the program annotated by its proof of correctness
[62, 63]. Also, the synthesis method generates a proof checker that is delivered to

the program user. Then, using the proof checker, users verify the correctness of the

219

 

synthesized program to gain high assurance in safety-critical systems. Also, in the
synthesis of certiﬁed code, there exists an option for adding domain-speciﬁc knowledge
in order to derive more efﬁcient programs. However, such approaches mostly focus on
safety properties of programs whereas our focus is to add all levels of fault—tolerance

to programs.

How does the synthesis method presented in this dissertation diﬁfer from synthesizing

controllers in control theory?

Synthesizing discrete-event controllers in control theory is indeed an automata-
theoretic approach. Our approach has several advantages with respect to existing
approaches for the synthesis of controllers. First, the general-case complexity of
synthesizing controllers is PSPACE—complete [64, 65, 66, 67, 68] in the size of the
uncontrolled automaton, whereas our problem is NP-complete. Second, our model of
distribution is general enough to capture different modeling cases in distributed com-
puting whereas in Control theory each controller performs its controlling task individ-
ually and there exists limitations on the communication between controllers. Finally,
our approach is incremental in that we reuse the computations of the fault-intolerant
program for the synthesis of its fault-tolerant version. Such reuse of computations is

expected to be helpful in the cases where the state space is large.

How does the synthesis method presented in this dissertation diﬂer from synthesizing

strategies for two-player games?

Regarding two-player games, most of the approaches in the literature [5, 61, 69, 70]
for the synthesis of winning strategies are focused on the cases where the program
is interacting with an adversary via input / output variables. Such model restricts us
to the cases where faults can only affect a subset of program variables, whereas in
our model faults can perturb the state of the program to any state. Although the

authors of [71] address this shortcoming of two—player games, the language chosen for

220

expressing the winning strategy is Propositional Linear Temporal Logic (PLTL) [72].
Since fault-tolerance properties are existential properties, PLTL does not have the

expressiveness power to capture such properties.

Does the fault model used in this dissertation enable us to capture diﬁerent types of
faults?

Yes. The notion of state perturbation is general enough to model different types
of faults (namely, stuck-at, crash, fail-stop, omission, timing, or Byzantine) with
different natures (intermittent, transient, and permanent faults). As an illustration
of the generality of the notion of state perturbation, we have modeled (i) Byzantine
faults (cf. Subsections 4.4.1 and 5.3.1); (ii) fail-stop faults (cf. Subsection 4.4.2); (iii)
input-corruption faults (cf. Subsection 5.2.1), and (iv) the process-restart faults that
affect the token ring program synthesized in Chapter 6. State-perturbation model
has also been used in designing fault-tolerance to (i) omission faults (e.g., [17]), and

(ii) transient faults and improper initialization (e.g., [19]).

How does F TSyn scale as the state space of programs increase?

In this dissertation, we showed that using FTSyn, we synthesize fault-tolerant pro—
grams that tolerate different types of faults and are simultaneously subject to multiple
faults. The largest state space among the programs that we have synthesized belongs
to an agreement program (see Appendix B for this program) that is simultaneously
perturbed by Byzantine and fail-stop faults (1.3 million states) [73, 60]. Also, in Sec-
tion 8.5, we synthesized a simpliﬁed version of an altitude switch used in the altitude
controller of an aircraft. Although the state space of 1.3 million is much smaller than
the state space of many practical applications, we argue that our synthesis framework
has the potential in adding fault-tolerance to real-world applications. Towards this

end, we discuss the following three points:

1. We argue that model checkers were also faced with similar problems with which

221

our framework faces regarding the state space explosion. Researchers were using
early versions of model checkers for checking small protocols and verifying the
correctness of operating system kernels [74, 75] despite a state space limit of
about 500,000 states on an average workstation (in the early 90$) [74]. The state
space handled by our framework is comparable to that reported by early model
checkers. We expect that by incorporating the recent optimizations developed
for model checking, it will be possible to increase the state space for which

fault-tolerance can be added using our framework.

. We have not currently included these optimizing techniques in the current ver-
sion of the synthesis framework as the goal of the framework is to study the
effectiveness of different heuristics, different internal representation of programs,
faults, and the ability to add fault—tolerance to different types of faults. There
exist several possible optimizations that can be applied to the framework to
reduce the synthesis time. However, these optimizations are orthogonal to the
issues at hand. For example, the techniques that are used to determine if a given
group of transitions violates safety or if a given group of transitions is appro-
priate for adding recovery equally affect the above—mentioned goals. (One can
either take advantage of the SAT-based approach (presented in Section 9.4) to
check the safety of a group of transitions, or exhaustively check every transition
of a given group of transitions.) While the design of the framework permits one
to use these techniques, these techniques are orthogonal to the issue of adding
heuristics that focuses on (i) which recovery transitions should be added, and
(ii) how one should deal with safety-violating transitions. In other words, it
is expected that the relative improvement of these optimizations will have the

same effect on different heuristics.

222

10.2 Contributions

The contributions of this dissertation are two—fold: theoretical and practical. Re-
garding theoretical contributions, we showed that the problem of synthesizing failsafe
fault-tolerant distributed programs from their fault-intolerant version is NP-complete.
This result was counterintuitive in the sense that Kulkarni and Arora [1] had al-
ready conjectured that adding failsafe fault-tolerance to distributed programs would
be polynomial. Subsequently, in Section 4.3, we identiﬁed sufﬁcient conditions for
polynomial-time synthesis of failsafe fault-tolerant distributed programs. Speciﬁcally,
we identiﬁed monotonic programs and speciﬁcations where the addition of failsafe
fault-tolerance to distributed programs can be done in polynomial time. We showed
that if only programs (respectively, speciﬁcations) are monotonic then the synthesis

of failsafe fault-tolerant distributed programs will remain NP-complete.

Another theoretical contribution of this dissertation is the enhancement synthesis
algorithms presented in Chapter 5. We showed that one approach for reducing the
complexity of synthesis is to reuse the computational structure of the fault-intolerant
programs in the synthesis of their fault-tolerant version. In particular, we formalized
the problem of enhancing the fault-tolerance of nonmasking fault-tolerant programs
to masking fault-tolerance. Also, we presented a sound and complete algorithm for en-
hancing the fault-tolerance of programs in the high atomicity model — where processes
can atomically read / write program variables. Then, we designed a sound algorithm

for the enhancement of the fault-tolerance of nonmasking distributed programs.

The enhancement technique allows us to partially automate the design of masking
fault-tolerant programs and reap the beneﬁts of automation. Speciﬁcally, in the syn-
thesis of masking fault-tolerant programs, if automatic synthesis of the fault-tolerant
program fails due to the large state space of the fault-intolerant program then one
can manually design a nonmasking program and then automatically enhance the level

of fault-tolerance to masking using the enhancement algorithms of Chapter 5.

223

We used monotonicity property to extend the scope of programs and speciﬁcations
that can reap the beneﬁts of efﬁcient automation. Speciﬁcally, we developed heuristics
(cf. Sections 9.1 and 9.2) for the transformation of non—monotonic programs (respec-
tively, speciﬁcation) to monotonic where Theorem 4.11 can be applied for efﬁcient
addition of failsafe fault-tolerance to distributed programs. In other words, given
a monotonic program (respectively, speciﬁcation) and a non-monotonic speciﬁcation
(respectively, program), we design heuristics that transform a non-monotonic speci-
ﬁcation (respectively, program) to a monotonic speciﬁcation (respectively, program)
so that failsafe fault-tolerance can be added in polynomial time. To show the advan-
tage of developing such heuristics, we enhanced the fault-tolerance of a nonmasking

distributed program using our heuristics (cf. Section 9.3).

We also presented a synthesis method for automatic addition of pre-synthesized
fault-tolerance components to fault-intolerant programs (cf. Chapter 6). Our method
enables us to identify commonly encountered patterns in the synthesis of fault-tolerant
distributed programs, and reuse those patterns in the synthesis of different programs.
In other words, to reuse the effort put in the synthesis of one program for the synthe-
sis of another program, we introduced the notion of pre-synthesized fault-tolerance

components.

Moreover, we presented algorithms for automatic speciﬁcation of pre—synthesized
components during synthesis where we extract a speciﬁed component from a library
of pre-synthesized components. Afterwards, in Chapter 6, we presented an algorithm
for ensuring the interference-freedom between the program being synthesized and the
fault-tolerance components being added to that program. Finally, we designed an al-
gorithm for automatic addition of a pre-synthesized component to a fault-intolerant
program. Since the existing algorithms for the synthesis of fault-tolerant distributed
programs are not complete (i.e., the algorithms may fail to synthesize a fault-tolerant

program from a given fault—intolerant program although there exists a fault-tolerant

224

program), usage of pre—synthesized components allows us to reduce the chance of
failure in the synthesis of fault-tolerant distributed programs. Furthermore, we have
added pre-synthesized fault-tolerance components with different topologies (e.g., lin-
ear and hierarchical) to different programs (cf. Chapter 6). These examples, illustrate
the applicability of pre—synthesized fault-tolerance components in the synthesis of a

variety of fault-tolerant distributed programs with different topologies.

Using pre—synthesized fault-tolerance components, we also extended the problem
of adding fault-tolerance to the case where new variables can be introduced while
synthesizing fault-tolerant programs. By contrast, previous algorithms required that
the state space of the fault-tolerant program is the same as that of the fault-intolerant
program. Moreover, our synthesis method controls the way new variables are intro-
duced; new variables are determined based on the added components. Hence, the

synthesis method of Chapter 6 controls the way in which the state space is expanded.

Also, in this dissertation, we investigated the problem of synthesizing multitol-
erant programs from their fault-intolerant versions (cf. Chapter 7). Speciﬁcally, we
formally deﬁned what multitolerance means where a multitolerant program provides
(1) the speciﬁed level of fault-tolerance if a fault from any single class of faults occurs,
and (ii) the minimal level of fault-tolerance if faults from multiple classes occur. Then,
we showed that, in general, the problem of adding multitolerance to high atomicity
programs is NP-complete in the state space of the fault-intolerant program. Subse-
quently, we presented sound and complete synthesis algorithms for special cases of
adding multitolerance where one incrementally adds failsafe (respectively, nonmasking)

fault-tolerance to one class of faults and masking fault-tolerance to another fault-class.

Regarding the practical contributions of this dissertation, we developed the syn-
thesis framework FTSyn (presented in Chapter 8) for developers of fault-tolerance
where they can synthesize fault-tolerant programs. F TSyn integrates existing al-

gorithms and heuristics for the synthesis of fault—tolerant distributed programs and

225

allows developers to automatically synthesize fault-tolerant programs from their fault-
intolerant version. Also, FTSyn is extensible in the sense that developers of heuristics
can easily integrate new heuristics into the framework.

Moreover, FTSyn is changeable in the sense that developers can easily change its
implementation, without changing the design of FTSyn. The changeability of F TSyn
is important since changing the implementation of FTSyn may help to increase the
efﬁciency of the synthesis. Thus, any changes in the implementation should be simple
and cheap. Furthermore, we have integrated a SAT—based synthesis approach in
FTSyn where we use efﬁcient SAT solvers in the synthesis of fault—tolerant distributed

programs (cf. Section 9.4).

10.3 Impact

In this section, we discuss the impacts of this dissertation in research and education.
Regarding research, this dissertation has signiﬁcant impacts on the development of
fault-tolerant and dependable distributed programs as the extensible and changeable
design of our software framework will help to develop a rich integrated framework of
heuristics for the development of fault-tolerant distributed programs.

Moreover, the approach presented in this dissertation for the synthesis of fault-
tolerant programs can be extended for the synthesis of reactive programs [5]. Towards
this end, we have designed a hybrid synthesis method that beneﬁts from speciﬁcation-
based approaches [76, 2, 77, 56, 78, 79, 80, 57, 4, 5, 6, 71, 81, 7] and the synthesis
approach presented in this dissertation. Speciﬁcally, we have developed an incremen-
tal synthesis method [82] for automatic addition of liveness properties to ﬁnite-state
concurrent programs. In particular, in [82], we present a sound and complete algo-
rithm for adding Leads-to [30] properties to programs. The incremental approach of
[82] has the potential to reuse the efforts put in the synthesis of a program for the

synthesis of its improved version.

226

Furthermore, the synthesis algorithm in [82] can be integrated with model checkers
to provide automated assistance beyond generating counterexamples; i.e., in the cases
where a model fails to satisfy a property, our synthesis algorithm automatically (1)
identiﬁes the ﬁxability of the model, and (ii) ﬁxes the model if it is ﬁxable. Hence,
we believe the synthesis method presented in this dissertation has the potential to

provide a practical methodology for the synthesis of reactive programs.

Regarding educational impact, we note that using our framework provides the op-
portunity to experience non-trivial concepts regarding distributed and fault-tolerant
systems. We have used the synthesis framework in the graduate distributed system

class as well as in a seminar on fault-tolerance.

In the class on distributed systems, the students ﬁnd that the interactive na-
ture of the framework is extremely useful in understanding several concepts about
fault-tolerant programs. In this class, the students focused on re-synthesizing a fault-
tolerant program for which the framework had been used successfully. In this case,
the students began with the fault—intolerant program. First, they used the auto—
mated approach to obtain the fault-tolerant program. Subsequently, they focused
on interactive synthesis of the same fault-tolerant program. During this interactive
synthesis, they applied different heuristics and observed the intermediate program.
They explored the state transition diagram of the intermediate program and used the
framework to understand why the intermediate program was not fault-tolerant. This
allowed them to experience the non-deterministic execution of different processes of
the program. Moreover, they could observe individual states and transitions in the
global state transition diagram and could experience the effect of distribution restric-

tions on the complexity of the synthesis of fault-tolerant distributed programs.

227

10.4 Future Work

In this section, we present open theoretical problems in the synthesis of fault-tolerant
distributed programs. Also, we discuss future extensions and modiﬁcation to the FT-

Syn framework presented in Chapter 8. First, we discuss open theoretical problems:

0 Identify the polynomial boundary of synthesizing nonmasking fault-tolerant dis-

tributed programs.

As we identiﬁed sufﬁcient conditions for the synthesis of failsafe fault-tolerant
distributed programs in Chapter 4, we would like to at least identify the suf-
ﬁcient conditions for polynomial synthesis of nonmasking fault-tolerant pro—
grams. Although we do not have a proof for the NP-hardness of the problem of
synthesizing nonmasking fault-tolerant distributed programs from their fault-
intolerant version, we already know that this problem is in NP (cf. Section
2.8). To the best of our knowledge, no polynomial-time algorithm has yet been
presented for the synthesis of nonmasking distributed programs. Thus, ﬁnding
properties of programs that identify sufficient conditions for polynomial-time

synthesis of nonmasking distributed programs remains an open problem.

0 Develop nonmasking programs that satisfy the monotonicity requirements.

Since the worst-case complexity of enhancing the fault-tolerance of nonmasking
fault-tolerant distributed programs to masking is exponential (cf. Chapter 5),
we would like to use the notion of monotonicity in order to identify nonmasking
programs whose level of fault-tolerance can be enhanced to masking in poly-
nomial time. Thus, it is desirable to develop a methodology for the design of
nonmasking programs that satisfy the requirements of program monotonicity.
Such design methodology provides a framework for partial automation in the
design of masking programs where one manually develops a nonmasking mono—

tonic program and then applies Theorem 4.11 to automatically enhance the

228

level of fault-tolerance to masking.

Identify the necessary and sufﬁcient conditions for simultaneous addition of

multiple pre-synthesized components.

In Chapter 6, we showed how we add a pie-synthesized corrector to the pro—
gram being synthesized in order to resolve a deadlock state from which existing
heuristics fail to add recovery. Also, we ensured that the execution of the pre-
synthesized component does not interfere with the execution of the program.
Now, since there exist many situations where we need to simultaneously add
such correctors to the program being synthesized, we plan to identify neces-
sary and sufﬁcient conditions for an interference-free addition of multiple pre-

synthesized components to a program.

Develop a platform for providing automated assistance in model checking beyond

generating counterexamples.

Although model checkers provide user—friendly counterexamples in cases where
a model fails to satisfy a desired property, it is difﬁcult to manually ﬁx a failed
model so that it satisﬁes a desired property while preserving its existing prop-
erties. We have developed a synthesis algorithm [82] that has the potential
to provide such automated assistance for developers when the model checking
of the program at hand fails. Using the synthesis algorithm of [82], we auto-
matically (1) identify whether or not a model is ﬁxable to satisfy a particular
property in addition to its existing properties, and (ii) ﬁx the model if it is
ﬁxable so that it satisﬁes a new property in addition to its existing properties.
However, currently, the synthesis algorithm in [82] can only be used for linear
computation model where program properties are speciﬁed in Linear Temporal
Logic [72]. To develop a platform for automatic model correction, it is desir-

able to (i) integrate the algorithm [82] in one of the existing model checkers

229

(e.g., SPIN [36]) to investigate the practicality of the algorithm of [82], and
(ii) extend the results of [82] for the case where the program computation is
non-linear (e.g., tree-like computation) and program properties are speciﬁed in

Computation Tme Logic (CTL) [72].

Now, we discuss issues related to the extensions and improvements of the synthesis

framework FTSyn presented in this dissertation.

0 Use model checkers in the synthesis of fault-tolerant programs in order to reduce

the complexity of synthesis.

As mentioned in Chapter 8, the FTSyn framework has the ability to interact
with developers of fault-tolerance. If the synthesis of a fault-tolerant program
fails then developers can ask FTSyn to generate an intermediate version of
the program being synthesized in order to identify what went wrong during
synthesis. F TSyn generates the intermediate program in Promela [37] model-
ing language. Thus, developers can beneﬁt from the SPIN model checker and
verify the fault-tolerance properties. The SPIN model checker returns coun-
terexamples that are enlightening for developers in that they can identify what
heuristic should be applied next in synthesis. Currently, the users of FTSyn
should perform this veriﬁcation manually. We plan to develop an automated
approach for the communication between FTSyn and model checkers. Such
communication has an important impact on reducing the complexity of synthe-
sis as model checkers can provide behavioral information about the program at
hand. The synthesis algorithm uses this behavioral information to make more

intelligent decisions during synthesis.

0 Develop a distributed synthesis platform.

Currently, the implementation of F TSyn is centralized. To extend the scope

of synthesis for real-world applications, we adopt two directions: developing a

230

scalable parallel synthesis algorithm, and extending F TSyn for deployment on
a distributed platform. In the ﬁrst direction, we plan to conduct a survey 011
the existing approaches [83, 84, 85, 86, 87] for parallel and distributed model
checking, where one distributes the reachability graph of the model at hand
on a network. Towards this end, we note that the synthesis problem differs
from model checking problem in that during synthesis we modify the program
model to satisfy speciﬁc synthesis requirements, whereas model checkers only
verify the program model without performing any modiﬁcation. We conjecture
that the scalable synthesis will be in a higher complexity class than the scalable
model checking, thus making the development of a scalable synthesis algorithm
more challenging. In the second direction, we plan to simultaneously implement
the achievements in the design of the scalable synthesis algorithm in FTSyn.
As a result, we can experience the applicability of our theoretical results in the

context of distributed FTSyn.

Develop an on-the-ﬁy synthesis method.

In the synthesis of a fault-tolerant program, F TSyn initially expands the reacha-
bility graph of the fault-intolerant program using program and fault transitions.
For real-world applications, the size of the reachability graph is very large, and
as a result of the space complexity of synthesis, FTSyn may fail to synthesize
a fault-tolerant program. To remedy this problem, we plan to develop a space-
efﬁcient synthesis algorithm where FTSyn partially generates the reachability
graph of the program. Towards this end, we beneﬁt from existing techniques
[88] in the model checking literature for providing space efﬁciency. Such space-
efﬁcient techniques are orthogonal to the development of a distributed synthesis
algorithm in that we deploy the space-efﬁcient synthesis algorithm on each node

of the scalable synthesis platform discussed above.

231

APPENDICES

232

Appendix A: Programs
Synthesized Using Pre-Synthesized

Components

In this appendix, we present the programs that we have synthesized using pre-
synthesized components. Speciﬁcally, we ﬁrst present an Alternating Bit Protocol
(in Section A.1) that is nonmasking fault-tolerant to message loss faults. Then,
in Section A.2, we present an intermediate diffusing computation program synthe-
sized by our synthesis framework, F TSyn. Subsequently, in Section A.3, we present
the synthesized diffusing computation program after we have added pre—synthesized
components to reﬁne one of the high atomicity recovery actions in the intermediate
program. Finally, in Section A.4, we present a reﬁned version of the synthesized
diffusing computation program in the syntax of the Promela modeling language [37]
where we have veriﬁed the synthesized program in the SPIN model checker to gain

more conﬁdence in the impleinentation of FTSyn.

233

A.1 The Promela Model of the Alternating Bit Pro-

tocol

In this section, we present the Promela model of the alternating bit protocol (ABP)

synthesized by adding linear pre-synthesized components to the fault-intolerant ABP

program presented in Section 6.5.

1

2#define inv

3( (((rr != 1) && (cr == -1)) ll (br == bs)) &&

4 (((rs != 1) && (cs == -1)) ll (hr != bs)) &&

5 ((cs == -1) ll (cs == bs)) &&

6 ((cs != -1) ll (cr != -1) ll ((rr + rs) == 1)) &&
7 ((cs == -1) ll (cr != -1) ll ((rr + rs) == 0)) &&
8 ((cs != —1) ll (cr == -1) ll ((rr + rs) ==

9

1o#define fS ((CS == '1) ll (CS == bs)) &&

n (((cs != -1) && (cr != -1)) ll

12

13

14/* The property to be verified

15

16#define
17#define
13#define
19#define
20

21#define
zzﬂdefine
23#define
24#define

25

0)))

(((rr + rs) == 1) ll ((rr + rs) == 0)))

Zs (rs == 0) && (bs
Zr (rr == 0) && (br
ZPs (rs == 0) && (bs

ZPr (rr == 0) && (br

X3 Z3 && ZPr
XPs ZPs && Zr
Xr Zs && Zr

XPr ZPs && ZPr

==1) && (cs == -1)
==1) && (cr == -1)
==O) && (cs == -1)
==O) && (cr == -1)

234

[](fs -> <> inv)

/*
/*
/*
/*

*/

LCs */
LCr */
LC’s */

LC’r */

26bOOl rs = 1;

27bool II = O;

2sbool bs = 1;

29bool hr = 0;

3O

31bool ypr; /* y’r */
32bool ys;

33bool yr;

34bool yps; /* y’s */
35

3ebool us;

37bool ur;

3sbool ups; /* u’s */
39bool upr; /* u’r */

40

II
I
H

41int cs

ll
l
H

4zint cr
43

44proctype sender() {

45do

45:: atomic { ((rs == 1)) -> rs = 0; cs = bs ;
47 us =0; ups =0; }
43:: atomic { (cr != -1) -> rs = 1; cr = -1;

,9 b3 = (bs+1)%2 ; us =0; ups =0; }
50

51:: atomic { Zs && lys && ypr -> ys = 1; }
52:: atomic { ys -> cs = 1; YS=03 }

53

54:: atomic { ZPs as lyps && yr -> yps = 1; }

55:: atomic { yps -> cs = O; yps=0; }

56

57:: atomic { Zs && lus -> us = 1; }
1

53:: atomic { ZPs && lups -> ups =

590d;

60}

61

s2proctype receiver() {

63do

54:: atomic { ( cs != -1) -> cs = -1; rr
65 hr = (br+1)%2 ; yr =0;
56:: atomic { ( rr == 1) -> rr = 0; cr =
67 yr =0;

68

ll
H
-
w—I

59:: atomic { ZPr && lypr -> ypr
7oz: atomic { Zr && !yr -> yr = 1; }

71

72:: atomic { Zr && lur && us -> ur = 1; }
73:: atomic { ur -> cr = 1; ur=0; }

74

75:: atomic { ZPr && lupr && ups -> upr = 1; }

76:: atomic { upr -> cr = 0; upr=0; }
770d;

78}

79

soproctype MessageLossFaults() {

siif

82:: ((cs != -1)) -> cs = —1;
83:: ((cr != -1)) -> cr = -1;
84:: skip;

85fi;

86}

87

sainit{

1;

89run sender(); run receiver(); run MessageLossFaults();

90}

236

A.2 The Synthesized Intermediate Diffusing Com-
putation Program

In this section, we present the intermediate diffusing computation program that we
have synthesized using F TSyn. This program includes the actions of the high atomic-
ity processes added for the purpose of adding recovery. FTSyn represents the synthe-
sized program in a syntax close to the syntax of the Promela modeling language [37].
The semantic of the output program is based on the Dijkstra’s guarded commands,
where each guarded command grd ——> st represents a set of transitions {(so, 31) : grd
holds at so and the atomic execution of st at so takes the state of the program to 81 }.
In the following program, ci, pi, and sni respectively represent the color, the parent,
and the session number of process P,. Also, cpi and snpi respectively represent the

color and the session number of the parent of P,- (0 S i S 3).

1 ---------- The actions of Process PO ----------

2(c0 == 1) &&

3((p0 == 0) && (snO == 1)) -> CO := 0; snO := O;
4

5(C0 == 1) &&

6((p0 == 0) && (snO == 0)) -> CO := O; snO := 1;

7
8(c0 == 1) &&

9( ((c1 == 0) && (c2 == 0) && (sno

1) && (sni == 0) &&

10 (sn2 == 0) && ((p0 == 1) ll (p0 == 2)) ) II
n ((c2 == 0) && (snO == 1) && (sn2 == 0) && (p0 == 2) ) ||
12 ((c1 == 0) && (sno == 0) && (sni == 1) && (p0 == 1) ) ll
13 ((c2 == 0) && (sno == 0) && (sn2 == 1) && (p0 == 2) ) ||
14 ((c1 == 0) as (snO == 1) && (sn1 == 0) && (p0 == 1) ) ll
15 ((c1 == 0) && (c2 == 0) && (sno == 0) && (sni == 1) &&

15 (sn2 == 1) && ((p0 == 1) ll (p0 == 2))) )

17 -> CO := ch; snO := snpO;

237

18

19(C0 == 0) &&

20( ((c1 == 1) && (c2 == 1) && (sno == 0) && (sni == 0) && (sn2 == 0)) ll
21 ((c1 == 1) && (c2 == 1) && (snO == 1) && (sni == 1) && (sn2 == 1)) )
22 > CO = 1;
23

24 ---------- The actions of Process P1 ----------

25

26(C == 1) &&

27( ((cpl == 0) && (sn1 == 0) k& (snpl == 1)) ll

28 ((cpl == 0) && (sni == 1) && (snp1 == 0)) )

29 -> c1 := cpl; 8n1 := snpl;

30

31(c1 == 0) && ((snl == 1) ll (sni == 0)) -> c1 := 1;

32

33 ---------- The actions of Process P2 ----------

34

35 (c2 == 1) &&

35( ((cp2 == 0) && (sn2 == 0) && (snp2 == 1)) ll

37 ((cp2 == 0) && (sn2 == 1) && (snp2 == 0)) )

38 -> c2 := cp2; sn2 := snp2;

39

4o(c2 == 0) &&

41C ((sn2 == 0) && (c3 == 1) && (sn3 == 0) && (p3 == 2)) ll
42 ((sn2 == 1) && (03 == 1) && (sn3 == 1) && (p3 == 2)) )
43 -> c2 := 1;
44

45 ---------- The actions of Process P3 ----------

46

47(c3 == 1) &&

48( ((cp3 == 0) && (sn3 == 0) && (snp3 == 1)) ll

49 ((cp3 == 0) && (sn3 == 1) && (snp3 == 0)) )

a) —> c3 := cp3; sn3 := snp3;

238

m

52(C3 == 0) &&

53 ((sn3 == 0) ll (Sn3 == 1)) -> C3 := 1;
54

55

55 ---------- The actions of the high atomicity Process 0 ----------

57

58(CO == 1) &&

59( ((c1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 1) && (sn3 == 0) &&

60 ((p0 == 2) I1 (po == 1)) 55 (p1 == 0) 32 (p2 == 0) 25 (p3 == 2) ) II

51 ((c1 == 1) && (c2 == 1) && (c3 == 1) && (sno == 1) && (sni == 0) &&

52 ((p0 == 2) ll (p0 == 1)) && (p1 == 0) && (p2 == 0) 52 (p3 as 2) ) ll

53 ((c1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 1) && (sn2 == 0) &&

54 ((p0 == 2) ll (p0 == 1)) && (p1 == 0) && (p2 == 0) && (p3 == 2)) )

67(c0 == 1) &&

-> sn.0 := 0;

58( ((Cl == 1) && (C2 == 1) && (c3 == 1) && (snO == 0) && (sni == 0) &&

59 (sn2 == 0) && (sn3 == 0) && ((p0 == 2) II (p0 == 1))

25 (p1 == 0) as

70 (p2 == 0) 22 (p3 == 2) ) ll

71 ((c1 == 1) && (C2 == 1) && (c3 == 1) && (snO =3 1) && (sni == 1) &&

72 (sn2 == 1) as (sn3 == 1) && ((p0 == 1) || (p0 == 2)) 22 (p1 == 0) as

73 (p2 == 0) 21 (p3 == 2)) ) -> po := o;

75(c0 == 1) &&

75( ((C == 1) && (c == 1) && (c == 1) && (snO == 0)
77 (sn2 == 1) && (p0 == 2) && (p1 == 0) && (p2 == 0)
78 ((CI == 1) && (c2 == 1) && (c3 == 1) && (snO == 0)
79 (sn2 == 0) && (p0 == 2) && (p1 == 0) && (p2 == 0)
80 ((Cl == 1) && (c2 == 1) && (c3 == 1) && (snO == 0)
91 (sn3 == 1) && (p0 == 2) && (p1 == 0) && (p2 == 0)

82 ((c1 == 1) && (C2 == 1) && (C3 == 1) && (sno

O)
83 (sn3 == 0) 32 (p0 == 2) 22 (p1 == 0) 22 (p2 == 0)

239

&&
&&
&&
&&
&&
&&
&&
&&

(sn == 0) &&
(p3 == 2)) ||
(sn1 == 1) &&
(p3 == 2)) ll
(sn2 == 0) &&
(p3 == 2)) ll
(sn1 == 1) &&

(p3 == 2)) ||

84 ((c1 == 1) && (C2 == 1) && (c3 == 1) && (snO

0) && (sn2 == 1) &&
85 (sn3 == 0) && (p0 == 2) && (p1 == 0) && (p2 == 0) && (p3 == 2)) ll
86 ((c1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 0) 92 (sn1 == 0) &&

87 (sn3 == 1) && (p0 == 2) && (p1 == 0) && (p2

O) && (p3 == 2)) )
as -> p0 := 1;
89

9o(c0 == 1) &&

91((c1 == 1) && (c2 == 1) && (c3 == 1) && (sno == 0) && (sni == 1) &&

92 (sn2 == 1) && (sn3 == 1) && ((pO

2) II (p0 == 1)) 25 (p1 == 0) 25

93 (p2 == 0) && (p3 2)) ) -> CO := 1; snO :=1; p0 := 0;
94
95 ---------- The actions of the high atomicity Process 1 ----------

96

97(CO == 1) &&

98( (C1 == 1) && (c2 == 1) && (C3 == 1) && (snO == 0) && (Snl == 1) &&
99 (sn2 == 0) && (sn3 == 0) && (p0 == 1) && (p1 == 0) && (p2 == 0) &&
1m) (p3 == 2) ) -> sni := 0;

101

102(CO == 1) &&

103( ((c1 == 1) && (c2 == 1) && (C3 == 1) && (snO 0) && (sni == 0) &&

104 (sn2 == 1) && (p0 == 1) && (p1 == 0) && (p2

O) && (p3 == 2)) II
105 ((C1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 0) && (sni == 0) &&

106 (sn3 == 1) 55 (p0 == 1) 55 (p1 == 0) 45 (p2

0) && (p3 == 2)) )
107 -> sn1 := 1;
108

109

no ---------- The actions of the high atomicity Process 2 ----------

111

112 (CO == 1) &&

113((c1 == 1) && (c2 == 1) && (c3 == 1) && (sno == 0) && (sni == 1) &&

114 (sn2 == 0) && (sn3 == 1) && (p0 == 1) && (p1 == 0) && (p2 == 0) &&

n5 (p3 == 2)) ) -> sn2 := 1;

116

240

117 ---------- The actions of the high atomicity Process 3 ----------

118

119(C0 == 1) &&

120 ((c1 == 1) && (c2 == 1) && (c3 == 1) && (snO == 0) && (snl == 1) &&
121 (sn2 == 1) 88 (sn3 == 0) 88 (p0 == 1) 88 (p1 == 0) 88 (p2 == 0) 88
m2 (p3 == 2)) -> sn3 := 1;

A.3 The Actions of the Reﬁned Diffusing Compu-
tation Program

In this section, we present the actions of processes P2 and P3 in the DC program
(from Section 6.6.2). These actions construct the actions of the synthesized program.

We presented the actions of P0 in Section 6.6.2.

DC31 : (C3 = 1) A (pan, 2 3) ——> C3 := 0; 3713 = -isn3;
y3 := false; y2 := false;
if ((3713 = 1) A (yg :2 true))

then ya := false; ya := false

D032 : (C3 = 1) /\ (cpar3 : 0) A (3713 as snpara) ——8 C3 :2 cpara; 3113 = 377mm;
if ((03 = 0) A (313 = true))
then 3);; := false; y2 := false;
if ((sn3 = 1) A (95 = true))

then y; := false; y; := false;

DC33: (C3 = 0) A (Vk up;c = 3 => (c;C : 1 A3113 E snk)) ——8 C31: 1;
D31 : ((‘3 :— 1) A (C2 = 1) A (313 = false) ——+ m 2: true;

D31 : (3713 = 0) A (62 = 1) A (313 = false) —-8 ya :=t1'ue;

Note that, in action DC31, our synthesis method has added new statements to

the statements of the ﬁrst action in the fault-intolerant DC program. These new

241

statements falsify the witness predicates of the detectors. For example, when c3
becomes 0 the state predicate LC3 no longer holds. Thus, the witness predicate 3);;
must be falsiﬁed to ensure the interference—freedom of the program and the pres-
synthesized detectors. Now, we present the actions of process P2 composed with the
detectors d2 and (1’2.

D021 : (c2 = 1) /\ (parg = 2) ———+ C2 := 0; 3712 = “18712;
1121: 0:110 1= 0;
if ((y3 = false) A (sn2 = 1)
A((yé = trWE) V (316 = true»)
then y; := false; 3,16 := false;
0022 : (c2 = 1) /\ (elmr2 : 0) A (3712 gé snparz)
—+ C2 == 6pm,; sn2 = snpm;
if ((62 = 1) V (ya = false))
A((y2 = true) V (yo = true)))
then y2 := false; yo := false;
if ((3712 = 1) V (ya = false))
A((yé = true) V (316 = true»)

then 3,”? := false; 311’) := false;

DC23: (C2 : 0) A (Vk :: pk = 2 => (C;‘ =1A sn2 E s-nk))
——* C2 := 1;
if (y;; = false)) A
((312 = true) V (yo =tr1t€)))
then y2 :2 false;y0 := false;
if (y3 =2 false)) A
«y; = true) v as = true)))
then y; := false;y6 2: false;
021 : (y3 = true) A (02 = 1) A (sno = 1) A (c0 = 1) A((paro = 2) V (para 2 1)) A (y2 = false)
—+ yg :2 true;
0’21 3 (315, = true) A (02 = 1) /\ (3710 = 1) /\ (Co = 1) A((Pm'o = 2) V (Faro = 1))/\(1/5 = false)

-——> y; := true;

242

AA The Promela Model of the Synthesized Diffus-
ing Computation Program

In this section, we present the Promela model of the synthesized diffusing computation
program where we verify the nonmasking fault-tolerance property of the synthesized
program. Although the synthesized program is correct by construction, we have con-

ducted this formal veriﬁcation in order to gain more conﬁdence in the implementation

of F TSyn.

1#define inv

2 (((

3 (((C[0] == c[pO]) && (C[4] == C[p0+4])) ll ((c[O] ==1) && (c[pO] == 0)))&&
4 (((c[1] == c[p1]) && (c[5] == c[p1+4])) ll ((c[1] ==1) && (c[p1] == 0)))&&
5 (((c[2] == c[p2]) && (c[6] == c[p2+4])) ll ((c[2] ==1) && (c[p2] == 0)))&&
s (((c[3] == c[p3]) && (c[7] == c[p3+4])) ll ((c[3] ==1) && (c[p3] == 0)))
7)) &&

8 ((p0 ==O) && (p1 == 0) 88 (p2 == 0) && (p3 == 2)) )

9
io#define safetyO (!20 ll X0)

11#define safetyOp (!ZOp ll XOp)

12

13#define safety2 (!22 II X2)

14#define safety2p (122p ll X2p)

15

16#define safety3 (123 || X3)

17#define safety3p (!Z3p ll X3p)

18

n)#define X0 (C[3] == 1) && (C[1] == 1) && (C[2] == 1) && (C[0] == 1) &&
20 (C[4]==1)&&((p0 == 2) ll (p0 == 1))
21#define 20 (yo == 1)

22

z;#define XOp (c[7] == 0) && (c[1] == 1) && (c[2] == 1) && (c[O] == 1) &&

243

24
25 #define
26
27
28#define
29
30#define
31
32 #define
as
34#define
35
36#define
37#define
38
39#define
4o#define

41

( c[4] == 1 ) && ((p0 == 2 ) II (p0 == 1))
1)

ZOp (yOp

X2 (c[3] == 1) && (c[2] == 1) && (c[4] == 1) && (c[0] == 1) &&
((p0 == 2 ) ll (p0 == 1))
22 (y2 ==1)

x2p (c[7] == 0) 88 (c[2] == 1) 88 (c[4] == 1) 88 (c[0] == 1) 88
((p0 == 2 ) ll (p0 == 1))
=1)

22p (y2p

X3 (c[3] == 1) && (c[2] == 1)

23 (y3 ==1)

X3p (c[7] == 0) 88 (c[2] == 1)

23p (y3p ==1)

42/* Properties to be verified

as [] safety

44 [] (linv -> <> inv)

45 [] (O
46*/

47

inv)

48 bool c [8] ;

49bool y3

50

=0, y2=0, y3p=0, y2p=0, yO =0, yOp =0;

51/* The cells of this array respectively represent

52

53

54

55

56

c0, c1, c2, c3, snO, snl, sn2, sn3

// CO ---> c[0]
// c1 ---> c[1]
// c2 ———> c[2]
// c3 ---> c[3]

244

57 //
58 //
59 //
60 //
61*/

62

saint p0 = O;

64int p1 = O;

65int p2 = O;

saint p3 = 2;

67

88proctype PO() {

69do
70::
71

72
73::
74

75

76

77
78}
79
80::
81

82

83

84

85

86

87

88

89

snO
sn1
sn2

sn3

---> c[4]
---> c[5]
---> c[6]

---> c[7]

atomic{ ((c[O] ==1) && (p0 == 0) ) -> c[0] = O; c[4] = !c[4];

YO = 0; yOP =0;

atomic{ ((c[O] == 1) && (c[pO]

{ CEO] = c[pOJ; C[4]
if :: (c[0] == 0) 88 (yO

': else skip;

fi;

= c[p0+4];

}

== 0) && (c[4] != c[p0+4])) ->

-=1) -> y0 = 0; y0p =0;

atomic{ ((c[O] == 0) && ((p1 != 0) II ((c[1] == 1) &&

(c[4] == c[5] ))) && ((p2 != 0) ll ((c[2] == 1) &&

if ::

(c[4] == c[6])) ) ) -> { c[0] = 1;
(y2 == 0) 88 (yO ==1) —> yO =0;

-: else skip;

fi;

if ::

(y2p == 0) 88 (yop ==1)-> yOp =0;

'2 else skip;

fi;

245

90}

91/* component-based actions of PO */
92
93 :: atomic { ( ( yO == 1 ) &&

94( ( y0p == 1 ) ||( c[5] == 0 ) ||( c[6] == 0 )) ) -> c[4] = 0;

95 y0 =0; y0p = 0; y? =0; y2P = 0; }
96

97:: atomic { (y2 == 1) 88 ( c[1] == 1 ) 88 (c[2] == 1) 88

98 (c[0] == 1) && ( c[4] == 1 ) &&

99 ((p0 == 2 ) ll (p0 == 1)) && (yO == 0) -> yo = 1; }
100

101:: atomic { (y2p == 1) 88 ( c[1] == 1 ) 88 (c[2] == 1) 88

102 (c[0] == 1) && ( c[4] == 1 ) &&

um ((p0 == 2 ) II (p0 == 1)) 88 (y0p == 0) —> y0p = 1; }
104 od;

105}

106
107proctype P1() {

lmsdo

109:: atomic { ((c[1] ==1) && (p1 =2 1) ) -> c[1] = 0; C[5] = !C[5]; }
1u1:: atomic { ((c[1] == 1) && (c[p1] == 0) && (c[5] != c[p1+4]) )

IN -> c[1] - c[pl]; c[5] = c[p1+4]; }
112:: atomic { (c[1] == 0) -> cu] = 1; }

1130(1;

114}

n5

IJGPIOCtype P2() {

117 do

118:: atomic{ ((c[2] ==1) && (p2 == 2) ) -> { C[2] = 0; c[6]= !C[5];

U9 y? =0; y0 =0; y3 =0; y3p =0;
120 if :: ((y3p == 0) && (c[6] == 1)) && ((y2p ==1)|| (y0p ==1))
.21 -> y2p =0; y0p =0; y3p =0;
n2 :: else skip;

246

123 fi;
124 }

125 }

126

127:: atomic { ((C[2] == 1) && (C[p2] == 0) && (C[6] != C[p2+4]))

123 -> { c[2] = c[p2]; c[6] = c[p2+4];
129 if :: ((c[2] == 0) ll (y3 == 0)) &&

mo ((y2 ==1) ll (yO ==1) ll (y3 ==1))
w) -> y2 =0; yO =0; y3 =0;
132 :: else skip;

133 fi;

134 if :: ((y3p == 0) ll (c[7] == 1) || (c[3] == 0) ||
135 (c[2] == 0)) && ((y2p ==1)|| (y0p ==1)|| (y3p ==1))
we -> y2p =0; y0p =0; y39 =0;
137 :: else skip;

138 fi;

w9 }

M0}

141
142:: atomic { ((c[2] == 0) && ((p3 != 2) ||

143 ((CEBJ == 1) && (c{7] == c{6])))) -> { CD] = 1;
111 if :: (y3 == 0) 88 ((y2 ==1)Il(yo ==1)) -> y2 =0; yO =0; y3 =0;
M5 :: else skip;

146 fi;

147 if :: (y3p == O)&&((y2p ==1) I I (y0p ==1))-> y2p =0; y0p =0; y3p =0;

M8 :: else skip;
149 fi;

150 }

151 }

152
153 :: atomic { (y3 == 1) && (c[2] == 1) && (c[4] == 1) 818:
154 (c[0] == 1) && ((p0 == 2) II (p0 == 1)) MI

155 (y2 == 0) "> y2 = 1; }

247

156

157:: atomic { (y3p == 1) 88 (c[2] == 1) 88 (c[4] == 1) 88

158 (c[0] == 1) 88 ((p == 2 ) ll (p0 == 1)) &&

159 (y2p == 0) -> y2p = 1: }
mood;

161}

162
183proctype P3() {

164 do

165:: atomic { ((c[3] ==1) && (p3 == 3) ) -> { c[3] = O; c[7] = !c[7];
186y3 = o; y2 = o;

167

1881f :: ((c[7] == 1) ll (c[2] ==O)) && (y3p ==1) -> y3p =0; y2p =0;
169:! else skip;

170 fi;

171 }

172}

173

174

175

178:: atomic { ((c[3] == 1) && (c[p3] == 0) && (c[7] != c[p3+4]))

177 -> { CE3] = c[p3]; CU] = ctp3+4];
178 if :: ((c[3 == 0) ll (c[2] ==O)) && (y3 ===1) -> y3 = 0; y2 =0;
179 :: else skip;

180 fi;

181 if :: ((c[7] == 1) ll (c[2] ==O)) && (y3p ==1)-> y3p =0; y2p =0;

182 :: else skip;

187:: atomic { (c[3] == 0) -> { c[3] = 1;

188 if :: ((c[7] == 1) ll (c[2] ==O)) && (y3p ==1)-> y3p =0; y2p =0;

248

189 :: else skip;
190 fi;

191 }

192 }

193

194:: atomic { (c[3] == 1) && (c[2] == 1) &&

195 ((c[6] ==o) ll (c[7] ==O)) 88 (y3 = o) —> y3 = 1; }

196

197:: atomic { (c[7] == 0) && (c[2] == 1) && (y3p

I!
H
‘-
\--’

0)-> y3p
198 0d;

199 }

200

201

202

203

zoaproctype Pseud00() {

205 do

2mi/* This high atomicity recovery action has been refined by adding

an the pre-synthesized components. Thus, we comment it out.

mm :: atomic { ( c[0] == 1) &&

mm ( ( ( c[1] == 1 ) ha ( c[2] == 1 ) && ( c[3] == 1 ) && ( c[4] == 1 ) &&
2u1( ( p0 == 2 ) ll ( p0 == 1 ) ) ) &&

2n (( c[7] == 0 ) ll ( c[5] == 0 ) II ( c[6] == 0 )) ) -> CE4] = 0; }

212 */

m8 :: atomic{ ((c[O] == 1) && (c[1] == 1) && (c[2] == 1) &&

m4 (c[3] == 1) && ((p0 == 2) ll (p0 == 1)) ) &&
2m (

mo ((c[4] == 0) && (c[5] == 0) && (c[6] == 0) eh (c[7] == 0)) ll
2r7((C[4] == 1) && (c[5] == 1) && (c[6] == 1) && (c[7] == 1))
m8)

219 -> p0 = O; }
220

221:: atomic {

249

222 (c[0] == 1) && (c[1] == 1 ) && (c[2] = 1) && (c[3] == 1) &&

223 (c[4] == 0) && (p == 2) &&

224 (

225 ((c[4] == 0) 8m (c[5] == 0) && (c[6] == 1)) II

226 ((c[4] == 0) && (c[5] == 1) && (c[6] == 0)) II

227 ((c[4] == 0) 8188 (c[5] == 0) && (c[7] == 1)) II

228 ((c[4] == 0) && (c[5] == 1) && (c[7] == 0)) II

229 ((c[4] == 0) && (c[6] == 1) 88: (c[7] == 0)) II

230 ((c[4] == 0) he (c[5] == 0) && (c[7] == 1))

231) —> p0= 1; }

232

233 :: atomic {

234 (c[0] == 1) 818:

mm ((C[1] == 1) && (C[2] == 1) && (C[3] == 1) &&
mm (C[4] == 0)&& (C[5] == 1) && (C[6] == 1) &&
237 (c[7] == 1) 8181 ((p0 == 2) ll (p0 == 1)) )
211 -> c[0] =1; c[4] = 1; p0 = o;
239 }

240 0d;

241 }

242

243 proctype PseudolC) {

244 do

245 :: atomic {

24o (c[0] == 1) 8:81

247 ((c[1] == 1) && (c[2] == 1) as: (c[3] == 1) &&
248 (CM) == 0) && (c[5] == 1) 8:81 (c[6] == 0) &&
249 (cm == 0) 88 (p0 == 1) 88 (p1 == 0) 88
:50 (p2 == 0) 88 (p3 == 2)) -> c[5] =0;
251 }

mm
253:: atomic{(c[0] == 1) && (c[1] == 1) && (c[2] == 1) &&

254 (c[3] == 1)&& (c[4] == 0) && (c[5] == 0) &&

250

255 (p0 == 1) 88 ((C[6] == 1) ll (C[7] == 1))

256

257 od;

258 }

259

250 proctype Pseudo2() {

261 do

-> c[5] = 1; }

262 :: atomic {(c[O] == 1) 88 (c[1] == 1) 88 (c[2] == 1) 88

263 (c[3] == 1) 88 (CM) == 0) 88 (c[5] == 1) 88
21:1 (c[6] == 0) 88 (cm == 1) 88 (p0 == 1)
265 -> { c[6] = 1;
266}

267}

268 od;

269}

270
271 proctype Pseud03() {

272 do

273:: atomic { (c[0] == 1) && (c[1] == 1) && (c[2] == 1) &&

271 (c[3] == 1) 88 (c[4] == 0) 88 (c[5]
175 (c[6] == 1) 88 (CW) == 0) 88 (p0
2715 (p1 == 0) 88 (p2 == 0) 88 (p3
277 ->
}

2790d;

280}

281

282 proctype Faults() {

283 if

284 :: atomic { (true) ->
285 :: atomic { (true) ->
286 :: atomic { (true) ->
287:: atomic { (true) ->

c[0]
c[0]
c[l]
c[1]

251

1) 88

1) 88

2)

CD] = 1;

288

rhrhﬁr-M

(true)
(true)
(true)

(true)

(true)
(true)
(true)

(true)

(true)
(true)
(true)

(true)

atomic{ (true) ‘>

atomic{ (true) ->

atomic{ (true) ->

289 :: atomic
2m1:: atomic
291 :: atomic
292 :: atomic
an

an :: atomic
an :: atomic
296 :: atomic
297 :: atomic
mm

mm 2: atomic
3m1:: atomic
301:: atomic
an :: atomic
mm

mm ::

um ::

mm ::

307 fi;

mm

ama}

am

311 init{

:n2run Faults();

c[2]
c[2]
c[3]
c[3]

c[4]
c[4]
c[5]
c[5]

c[6]
c[6]
C[7]

CE7]

p0

pO=

= o; }
= 1; }
= o; }
= 1; }
= o; }
= 1; }
a o; }
= 1; }
= o; }
= 1; }
a o; }
= 1; }
o; }
1; }

2; }

auirun P0(); run P1(); run P2(); run P3();

384run PseudoO();

:n5run Pseudo2();

ausrun Pseud03();

317 }

run Pseud01();

252

Appendix B: Agreement in the
Presence of Byzantine and Failstop

Faults

In this section, we present a comprehensive example of adding fault-tolerance to a
fault-intolerant program using our software framework FTSyn. Speciﬁcally, we show
how developers of fault-tolerance can interact with FTSyn in order to add masking
fault-tolerance to an agreement program. This example may be thought of as a brief
version of the user manual for our framework. A more detailed user manual including
the source code of FTSyn is available at [73].

The fault-intolerant program consists of a general process and four non-general
processes that are perturbed by Byzantine and fail-stop faults. The user should
specify the input fault-intolerant program, its variables, its invariant, its speciﬁcation,

and the faults in a text ﬁle. The input ﬁle of the agreement program is as follows:

1 program Byzant ine-Failstop

2 var
3 bool bi;
4 bool bj ;
5 bool bk;
5 bool b1;
7 bool bg;

253

9 int dg=0, domain 0 .. 1;

10 int di, domain -1 .. 1;
11 // (di == -1) means process $i$ has not yet decided.
12 int dj, domain -1 .. 1;
13 int dk, domain -1 .. 1;
14 int d1, domain -1 .. 1;

15

16 bool fi;

17 bool fj ;
18 bool fk;
19 bool f1;

20

21 bool upi;

22 bool upj;

23 bool upk;

24 bool upl;

25

26 // The structure of process i.

27 process i

28 begin

29 ((di == -1) 88 (fi == 0) 88 (upi == 0)) -> di = dg ;
30 I

31 ((di != -1) && (fi == 0) && (upi == 0)) -> fi = 1 ;

32

33 read di, dj, dk, d1, dg, fi, upi, bi;

34 write di, fi;

35 end

36

37 // The structure of process 3'.

38 process 3'

39 begin

40 ((dj == -1) && (fj == 0) && (upj == 0)) -> dj = dg;

41I

254

42((dj 1= -1) 88 (fj == 0) 88 (upj == 0)) -> fj = 1;
43

44 788d di, dj, dk, d1, dg, fj, upj, bj;

45 write dj, fj;

46 end

47

48 // The structure of process k.

49 process k

50 begin

51((dk == -1) 88 (fk == 0) 88 (upk == 0)) -> dk = dg;
52 I

53 ((di: != -1) && (fk == 0) && (upk == 0)) -> fk = 1;
54

55 786d di, dj, dk, d1, dg, fk, upk, bk;

56 write dk, fk;

57 end

58

59 // The structure of process 1.

60 process 1

61 begin

62 ((d1 == -1) 88 (f1 == 0) 88 (upl == 0)) -> d1 = dg;
63 l

44((41 1= -1) 88 (11 == 0) 88 (upl == 0)) -> 11 = 1;

65
66 read di, dj, dk, d1, dg, fl, upl, b1;
67 write (11, f1;

68 end

69

70 // Faults are represented as a process.
71

72 fault FailstopAndByzantine

73 begin

74 ((upi == 1)&&(upj == 1)&&(upk == 1)&&(up1 == 1))

255

-> upi = 0, upj = 0, upk = 0, upl

77((bi == 0)88(bj == 0)88(bk == 0)88(b1 == 0)88(bg == 0))

-> bi = 1, bj = 1, bk = 1, bl = 1, bg = 1,

80((bi == 1)) -> di = 1 , di =0 ,

81|

82((bj == 1)) -> dj = 1 , dj =0 ,

83l

84((bk == 1)) -> dk = 1 , dk =0 ,

85|

86((b1 == 1)) -> d1 = 1 , d1 =0 ,

87I

88((bg == 1)) -> dg = 1 , dg =0 ,

89

90

end

91// The invariant of the program.

92 invariant

93( (

94 ((bg==0) 88

95 (((bi == 1) 88 (bj == 0)88 (bk =2 0)88 (bl
96 ((bj == 1) 88 (bi == 0)88 (bk == 0)88 (b1
97 ((bk == 1) 88 (bj == 0)88 (bi == 0)88 (bl
94 ((b1 == 1) 88 (bj == 0)88 (bk == 0)88 (bi

99

100

101

102

103

104

105

106

107

((bi == 0) 88 (bj = 0)88 (bk == 0)88 (bl
((bi==1)l|(di==-1)||(di==dg))88
((bj==1)||(dj==-1)|l(dj==dg))88
((b ==1)ll(dk==-1)||(dk==dg))&8
((bl==1)lI(dl==-1)ll(d1==dg))88
((bi==1)|l(fi==0)ll(di!=-1) )88
((bj==1)l|(fj==0)||(dj!=-1) )88
((bk==1)l|(fk==0)l|(dk!=-1) )88

((b ==1)||(f1==0)l|(d1!=-1) ) ) II

256

0)) ll
0)) ll
0)) ll
0)) ll
0)) ) 88

 

108

109 ((bg==1)&& (bi==0)8&(bj==0)&&(bk==0)&&(b1==0)88 (

no ((((upi == 1) 88 (upj == 1)88 (upk == 1)88 (upl == 1))) 88
111 ((d'==dj)&&(dj==dk)&&(dk==dl)&&(di.'=-1)) ) II
112 ((((upi == 1) 88 (upj == 1)88 (upk == 1)88 (upl == 0))) 88
n3 ((di==dj)88(dj==dk)88(di!=-1)) ) ll

1m ((((upi == 1) 88 (upj == 1)88 (upk == 0)88 (upl == 1))) 88
115 ((di==dj)&&(dj==d1)&&(di!=-1)) ) ll

116 ((((upi == 1) 88 (upj == 0)88 (upk == 1)88 (upl == 1))) 88
117 ((di==dk)&&(dk==dl)&&(di!=-1)) ) ll

11s ((((upi == 0) 88 (upj == 1)88 (upk == 1)88 (upl == 1))) 88
119 ((dj==dk)&&(dk==dl)&&(dj!=-1)) )

no ))

1m )

122 &&

us (

124 ((upi == 0) 88 (upj == 1) 88 (upk ==1) 88 (upl == 1)) ||
125 ((upi == 1) 88 (upj == 0) 88 (upk ==1) 88 (upl == 1)) II
no ((upi == 1) 88 (upj == 1) 88 (upk ==0) 88 (upl == 1)) II
127 ((upi == 1) 88 (upj == 1) 88 (upk ==1) 88 (upl == 0)) II
128 ((upi == 1) 88 (upj == 1) 88 (upk ==1) 88 (upl == 1)) ))

129

130 // The specification of the program is specified in three parts starting
131 // with speciﬁcation keyword.

132

133 speciﬁcation

134

135 // The destination part identifies a set of states that every
136 // transition reaching them violates safety.

137

138 destination

139 (

141( (bid == 0) 88 (bjd == 0) 88 (upid == 1) 88 (upjd == 1) 88 (did 1= —1) 88

257

: ”tum—4n1:37.142: «:1 '.' " '

*-
5.3
- ...
.1.
.-
.
-
,.
w
«s
'-
o
v
o
.1
o
c
-
-
o
.

 

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

( (bid

( (bid

( (bjd

( (bjd

( (bkd

((bgd

((bgd

((bgd

((bgd
)

(djd 1= —1) 88

== 0) 88 (bkd == 0) 88

(dkd != -1) 88

= O) 88 (bld == 0) 88

(dld != -1) 88

0) 88 (bkd == 0) 88

(dkd != -1) 88

0) 88 (bld == 0) 88

(dld != ~1) 88

0) 88 (bld == 0) 88

(dld != -1) 88

== 0) 88 (bid

0) 88 (bjd

O) 88 (bkd

0) 88 (bld

159 // The relation part

160

161

162

163

164

165

166

167

168

169

171

172

173

relatiorz

0)
0)

0)

0)

88
88
88
88

(did 1= djd) 88 (fid
(upid == 1) 88 (upkd
(did 1= dkd) 88 (fid
(upid == 1) 88 (upld
(did l= dld) 88 (fid
(upkd == 1) 88 (upjd
(djd 1= dkd) 88 (fjd
(upld == 1) 88 (upjd
(djd 1= dld) 88 (fjd
(upkd == 1) 88 (upld

(dkd != dld) 88 (fkd =

(did 1= -1) 88 (did
(djd 1= -1) 88 (djd
(dkd 1= -1) 88 (dkd
(dld 1= -1) 88 (dld

!= dgd) 88 (fid

1)
1)
1)
1)
1)
1)
1)
1)
1)
1)
1)

88
88
88
88
88
88
88
88
88
88
88

(fj
(did
(fkd
(did
(fld
(434
(fkd
(djd
(fld
(dkd
(fld

== 1)) ll
1= -1) 88
== 1)) ll
1= -1) 88
== 1)) II
!= -1) 88
== 1)) II
!= -1) 88
== 1)) II
!= -1) 88
== 1)) ll

1)) ||

1= dgd) 88 (fjd == 1)) II

1= dgd) 88 (fkd == 1)) II

1= dgd) 88 (fld

1))

identifies a set of transitions that violate safety.

((((bis == 0)88 (bid == 0) 88

// The

init

(((bjs
(((bks
(((bls
(((bis
(((bjs
(((bks
(((bls

inn section is used for specifying the

== 0)
== 0)
== 0)

- O)

0)
0)

0)

(fis == 1) 88 (dis .

88 (bjd == 0) 88 (fjs ==
88 (bkd == 0) 88 (fks ==
88 (bld == 0) 88 (£13 ==
88 (bid == 0) 88 (fis ==
88 (bjd == 0) 88 (fjs ==
88 (bkd == 0) 88 (fks ==
88 (bld == 0) 88 (fls ==

258

1)
1)
1)
1)
1)
1)

1)

did))) II

88
88
88
88
88
88
88

(djs

(dks .
(dls .

(fid
(fjd
(fkd

(fld

initial states.

djd)))||
dkd)))||
d1d)))||
0)))Il
0)))ll
0)))ll
0))))

174

175 // Each initial state is specified using the state keyword.
176

177 state

178bi=0; bj=0; bk=0;b1=0; bg=0; dg=0;

179di = -1; dj = -1; dk = -1; bl = -1;

isofi=0; fj =0; fk=0; f1=0; upi= 1; upj = 1;
1811.1pk = 1; upl = 1;

182

183

134 state

185bi=0; bj=0;bk=0;b1=0, bg=0; dg=1,
issdi = -1; dj = -1; dk = -1; bl = -1;

187fi=0; fj =0; fk=0; f1=0; upi= 1; upj = 1;

189

B.1 The Description of the Input File

The fault-intolerant agreement program consists of four non-general processes
B, Pj, Pk, H and a general Pg. Each non-general process has four variables d, f, b,
and up. Variable dz’ represents the decision of a non-general process H, fz' denotes
whether R- has ﬁnalized its decision, ()1 denotes whether P,- is Byzantine or not, and
upi states whether P,- has failed or not. Process Pg also has variables (19 and bg. We
assume that the process Pg never fails. Thus, the variables of the agreement program
are as shown in the var section (cf. Lines 2-24).

Transitions of the fault-intolerant program. If process P.- has not copied a
value from the general and P; has not failed (i.e., upi = 1) then P,- copies the decision
of the general (ﬁrst action in the body of process H- (cf. Line 29)). If P,- has copied a

decision and as a result (12' is different from -1 then P,- can ﬁnalize its decision if it has

259

not failed (second action in the body of process P, (of. Line 31)). Other non—general
processes (Pj, Pk, and B) have a similar structure as shown in the input ﬁle (cf. Lines
37-68).

Read / Write restrictions. Each non-general process P,- is allowed to read
{di, dj, dk, dl, dg, fi, upi, bz'}. Thus, P,- can read the d values of other processes and all
its variables. The set of variables that P,- can write is {di, f 2} Read / write restrictions
of each process are speciﬁed in its body after the program actions (usng read and

write keywords (e.g., Lines 33-34)).

Faults. A Byzantine fault transition can cause a process to become Byzantine
if no process is initially Byzantine. A Byzantine process can arbitrarily change its
decision (i.e., the value of d). Moreover, the program is subject to fail-stop faults
such that at most one of the non-general processes can be failed, and as a result, it
will stop executing any action. The developers of fault-tolerance should specify the
faults similar to an independent process that can perturb program variables (cf. Lines
72-89).

Invariant. The developers of fault-tolerance should represent the invariant of the
program as a state predicate. In particular, the invariant is a Boolean function (over
program variables) that takes a state 3 and identiﬁes whether .9 is an invariant state

or not.

In the agreement program, the bg variable partitions the invariant into two parts:
the set of states where P9 is non-Byzantine (cf. Line 94), and the set of states where
P9 is Byzantine (cf. Line 109). When P9 is non-Byzantine, at most one of the non-
generals could be Byzantine (cf. Lines 95-107). Also, for every non-general process
P,- that is non-Byzantine (i) P,- has not yet decided or it has copied the value of dg (cf.
Lines 100—103), and (ii) P,- has not yet ﬁnalized or P,- has decided (cf. Lines 104-107).
When Pg becomes Byzantine, all the non-general processes are non-Byzantine and

all the processes that have not failed agree on the same decision (cf. Lines 109—119).

260

The invariant of the agreement program stipulates the above conditions on the states

where at most one non-general process has failed (cf. Lines 124-128).

Safety speciﬁcation. The safety speciﬁcation requires that if Pg is Byzantine, all
the non-general non-Byzantine processes that have not failed should ﬁnalize with the
same decision (agreement). If P9 is not Byzantine, then the decision of every ﬁnalized
non-general non-Byzantine process should be the same as dg (validity). Thus, safety
is violated if the program executes a transition that satisﬁes at least one of the

conditions speciﬁed in the speciﬁcation section of the input ﬁle (cf. Lines 133-169).

The speciﬁcation section is divided into two parts: destination and relation parts.
Intuitively, in the destination part (cf. Lines 138-158), we write a state predicate that
identiﬁes a set of states Sdesﬁnation, where if a transition t reaches sdesunatim then t
violates safety. In the relation part (cf. Lines 162—169), we specify a condition that
identiﬁes a set of transitions that should not be executed by the program. Note,
that we have added a sufﬁx “(1” (respectively, sufﬁx “8”) to the variable names in
the speciﬁcation section that stands for destination (respectively, source ). Since the
relation condition speciﬁes a set of transitions tsp“ using their source and destination
states, we need to distinguish between the value of a speciﬁc variable :1: in the source
state of tspec (i.e., .125 means the value of a: in the source state of tsp“) and in the
destination state of tsp“ (i.e., :rd means the value of :1: in the destination state of
Its-pee)-

In the case that the program speciﬁcation does not stipulate any destination con-
dition on safety-violating transitions, we leave the destination section empty with the
keyword noDestination . We use similar keyword noRelation for the case where we

do not have relational conditions in the speciﬁcation.

Initial states. The keyword init (cf. Line 173) identifies the section of the input
ﬁle where the user has to specify some initial states. These initial states should belong

to the invariant. For each initial state, the user should use the reserved word state (cf.

261

Line 177). In the state section (cf. Lines 177-181 and 185-188), the user should assign

some values to the program variables that belong to their corresponding domain.

B.2 The Output of the Framework

In this section, we present the output of the synthesis framework. In particular,
we present the actions of non-general processes. Observe that the structures of the
non-generals are not symmetric.

In the rest of this section, we describe the structure of each non-general process
that is subject to Byzantine and fail-stop faults. Note that each non-general process
can take an action if and only if it has not yet ﬁnalized and also has not failed due
to fail-stop faults.

The description of process Pi. Process P, of the fault-tolerant agreement

program consists of 5 actions. We describe each action as a separate item.

1. If process P,- has not yet decided then it performs one of the following actions:
either P,- copies the decision of the general, or if at least two other non—generals

have decided on the same value then P,- copies their decision.

1(di == -1) 88 (

2 ((dk == 0)88(d1 == 0)88(fi == 0)88(upi == 1)) II

3 ((dg == 0)88(fi == 0)88(upi == 1)) ll

4 ((dj == 0)88(d1 == 0)88(fi == 0)88(upi == 1)) ll

5 ((dj == 0)88(dk == 0)88(fi == 0)88(upi == 1)) ) -> set_di_va10
6

7(di == -1) 88 (

a ((dk == 1)88(d1 == 1)88(fi == 0)88(upi == 1)) II

9 ((dg == 1)88(fi == 0)88(upi == 1)) ll

10 ((dj == 1)88(d1 == 1)88(fi == 0)88(upi == 1)) ll

11 ((dj == 1)88(dk == 1)88(fi == 0)88(upi == 1)) ) -> set_di_va11

262

2. If process P,- has copied 1, and at least one of the following conditions holds
then process P,- changes its decision to 0: (i) Pk and B have decided on 0 and
P]- has decided; (ii) P]- and P, have decided on 0, or (iii) Pj and Pk have decided
on 0 and P, has decided.

1(di == 1) 88 (

2 (((dj ==O )||(dj == 1))88(dk == 0)88(d1 == 0)88(fi 0)88(upi == 1)) ll

3 ((dj == 0)88(d1 == 0mm == 0)88(upi == 1)) II
4 ((dj ==O )88(dk == 0)88((d1 == 0)88(d1 == 1))88(fi == 0)88(upi == 1)) )
5 -> set_di_va10

3. If process P,- has copied 0, and at least one of the following conditions holds
then process P,- changes its decision to 1: (i) P,- and Pk have decided on 1; (ii)
P; and P9 have decided on 1; (iii) PJ- and P, have decided on 1, or (iv) Pk and
P, have decided on 1.

1(di == 0) 88 (

2 ((d3 == 1 )88(dk == 1)88(fi == 0)88(upi == 1)) ||
3 ((dl == 1 )88(dg == 1)88(fi == 0)88(upi == 1)) ll
4 ((dj == 1 )88(d1 == 1)88(fi == 0)88(upi == 1)) ll
5 ((dk == 1 )88(d1 == 1)88(fi == 0)88(upi == 1)) ) -> Set_di_va11

4. Process P,- ﬁnalizes with decision 0 if at least one of the following conditions
holds. (i) P]- has decided on 0 or P]- has not. yet decided, and Pk has decided on
0, and P, has decided on 0 or P; has not yet decided; (ii) P,- has decided on 0
or P]- has not yet decided, and Pk has decided on 0 or Pk has not yet decided,
and P, has decided on 0; (iii) P]- has decided on O, and Pk has decided on O or

Pk has not yet decided, and P, has decided 011 O or P, has not yet decided.

1(di == 0) 88 (
2 (((dj == O)||(dj == ~1))88(dk == 0)88((d1 == 0)|l(dl == -1))88

3 (fi == 0)88(upi == 1)) II

263

4 (((dj == 0)|l(dj == -1))88(d1 == 0)88((dk == 0)|l(dk == -1))88

5 (fi == 0)88(upi == 1)) ll

6 ((dj == 0)88((dk == 0)|l(dk == -1))8&((d1 == 0)|l(d1 == -1))88

7 (fi == 0)88(upi == 1)) )

s -> set_fi_va11

5. Process P, ﬁnalizes with decision 1 if at least one of the following conditions
holds. (1) P,- has decided on 1, and Pk has decided on 1 or Pk has not yet
decided, and P; has decided on 1 or P, has not yet decided; (ii) P,- has decided
on 1 or P, has not yet decided, and P, has decided on 1 or H has not yet decided,
and Pk has decided on 1; (iii) P,- has decided on 1 or P,- has not yet decided,

and Pk has decided on 1 or Pk has not yet decided, and B has decided on 1.

1(di == 1) 88(

2 ((dj == 1)88((dk == 1)Il(dk == -1))88((d1 == 1)Il(d1 == —1))88

3 (fi .. 0)88(upi == 1)) II

4 (((dj == 1)Il(dj == -1))88(dk == 1)88((d1 == 1)Il(d1 == -1))88

5 (fi == 0)88(upi == 1)) II

6 (((dj == 1)Il(dj == -1))88((dk == 1)Il(dk == -1))88(d1 == 1)88

7 (fi == 0)88(upi == 1)) >

a -> set_fi_va11

The description of process P,-. The actions of process P, in the fault-tolerant

agreement program are as follows:

1. If process P,- has not yet decided then it performs one of the following actions:
P,- either copies the decision of the general, or if at least two other non-generals

have decided on the same value then P,- copies their decision.

2. If process P,- has copied 1, and at least one of the following conditions holds
then process P,- changes its decision to 0: (i) P, and B have decided on 0; (ii)

Pk and B have decided on 0, or (iii) P,- and Pk have decided on 0.

264

3. If process P,- has copied O, and at least one of the following conditions holds
then process P,- changes its decision to 1: (i) R and Pk have decided on 1; (ii)

P,- and B have decided on 1, or (iii) Pk and P; have decided on 1.

4. Process P, ﬁnalizes with decision 0 if at least one of the following conditions
holds: (i) P,- has decided on 0 or H has not yet decided, and Pk has decided on
O or Pk has not yet decided, and H has decided on 0; (ii) P,- has decided on 0,
and Pk has decided on O or Pk has not yet decided, and B has decided on O or
B has not yet decided; (iii) P,- has decided on 0 or P,- has not yet decided, and
Pk has decided on 0, and H has decided on O or B has not yet decided.

5. Process P,- ﬁnalizes with decision 1 if at least one of the following conditions
holds: (i) P,- has decided on 1 or P,- has not yet decided, and P1, has decided on
1 or Pk has not yet decided, and B has decided on 1; (ii) P,- has decided on 1
or P,- has not yet decided, and B has decided on 1 or B has not yet decided,
and Pk has decided on 1; (iii) i has decided on 1, and It has decided on 1 or A:

has not yet decided, and I has decided on 1 or I has not yet decided.

The description of process Pk. The actions of process Pk in the fault-tolerant

agreement program are as follows:

1. If process Pk has not yet decided then it performs one of the following actions:
Pk either copies the decision of the general, or if at least two other non-generals

have decided on the same value then Pk copies their decision.

2. If process Pk has copied 1, and at least one of the following conditions holds
then process Pk changes its decision to 0: (i) B and P, have decided on 0; (ii)
P,- and P,- have decided on 0; (iii) P,- and B have decided on 0; (iv) P,- and B
have decided on 0; (v) P,- and P, have decided on 0, or (vi) P,- and P, have
decided on O.

3. If process Pk has copied 0, and at least one of the following conditions holds
then process Pk changes its decision to 1: (i) P, and P, have decided on 1; (ii)
B and P, have decided on 1; (iii) P,- and P,- have decided on 1; (iv) P,- and P,

have decided on 1, or (v) P, and B have decided on 1.

4. Process Pk ﬁnalizes with decision 0 if at least one of the following conditions
holds: (i) P,- has decided on 0, and P,- has decided on 0 or P,- has not yet decided,
and B has decided on 0 or H has not yet decided; (ii) P,- has decided on 0 or
P, has not yet decided, and P, has decided on O, and H has decided on 0 or B
has not yet decided; (iii) P,- has decided on O or P,- has not yet decided, and H

has decided on O, and P,- has decided on 0 or P, has not yet decided.

5. Process Pk ﬁnalizes with decision 1 if at least one of the following conditions
holds: (i) P,- has decided on 1, and P, has decided on 1 or P, has not yet decided,
and H has decided on 1 or H has not yet decided; (ii) P,- has decided on 1 or
P,- has not yet decided, and P,- has decided on 1 or P,- has not yet decided, and
B has decided on 1; (iii) P,- has decided on 1 or P,- has not yet decided, and B

has decided on 1 or H has not yet decided, and P, has decided on 1.

The description of process P,. The actions of process B in the fault-tolerant

agreement program are as follows:

1. If process H has not yet decided then it performs one of the following actions:
P, either copies the decision of the general, or if at least two other non-generals

have decided on the same value then B cepies their decision.

2. If process B has copied 1, and at least one of the following conditions holds
then process P, changes its decision to 0: (i) P,- and P, have decided on 0; (ii)
P, and Pk have decided on 0; (iii) P, and P,- have decided on 0; (iv) P,- and Pk

have decided on 0.

266

3. If process P, has copied 0, and at least one of the following conditions holds
then process P, changes its decision to 1: (i) P, and P,- have decided on 1; (ii)

P, and Pk have decided on 1; (iii) P, and Pk have. decided 011 1.

4. Process P, ﬁnalizes with decision 0 if at least one of the following conditions
holds: (i) P, has decided on 0, and P, has decided on O or P,- has not yet decided,
and Pk has decided on 0 or Pk has not yet decided; (ii) P, has decided on 0 or
P, has not yet decided, and P,- has decided on 0 or P,- has not yet decided, and
Pk has decided on 0; (iii) P, has decided on 0 or P, has not yet decided, and P,-

has decided on O, and Pk has decided on 0 or Pk has not yet decided.

5. Process P, ﬁnalizes with decision 1 if at least one of the following conditions
holds: (1) P, has decided on 1, and P,- has decided on 1 or P,- has not yet dec1ded,
and Pk has decided on 1 or Pk has not yet decided; (ii) P, has decided on 1 or
P, has not yet decided, and P,- has decided on 1 or P,- has not yet decided, and
Pk has decided on 1; (iii) P, has decided on 1 or P, has not yet decided, and Pk

has decided on 1 or Pk has not yet decided, and P, has decided on 1.

267

BIBLIOGRAPHY

268

Bibliography

[1] S. S. Kulkarni and A. Arora. Automating the addition of fault-tolerance. Formal
Techniques in Real- Time and Fault- Tolerant Systems, page 82, 2000.

[2] EA. Emerson and EM. Clarke. Using branching time temporal logic to synthe-
size synchronization skeletons. Science of Computer Programming, 2(3):241—266,
1982.

[3] P. C. Attie, A. Arora, and E. A. Emerson. Synthesis of fault-tolerant concur-
rent programs. ACM Transactions on Programming Languages and Systems
(TOPLAS), 26:125 — 185, 2004.

[4] P. Attie and A. Emerson. Synthesis of concurrent programs for an atomic
read/write model of computation. ACM TOPLAS (a preliminary version of
this paper appeared in PODC96), 23(2), March 2001.

[5] A. Pnueli and R. Rosner. On the synthesis of a reactive module. In Proceedings
of the 16th ACM Symposium on Principles of Programming Languages, pages
179—190, 1989.

[6] A. Pnueli and R. Rosner. Distributed reactive systems are hard to synthesis.
In Proc. of 313t IEEE Symposium on Foundation of Computer Science, pages
746—757, 1990.

[7] O. Kupferman and M.Y. Vardi. Synthesizing distributed systems. In Proc. 16th
IEEE Symp. on Logic in Computer Science, July 2001.

[8] Ali Ebnenasir. Automatic synthesis of distributed programs: A survey. http:
//www.cse.msu.edu/"ebnenasi/survey.pdf,2002.

[9] Felix C. Gartner and Arshad Jhumka. Automating the addition of fault-
tolerance: Beyond fusion-closed speciﬁcations. Formal Modeling and Analysis
of Timed Systems - Formal Techniques in Real— Time and Fault Tolerant System

(FORMA TS-F TRTF T 2004), Grenoble, France, September 22-24, 2004.

[10] A. Arora and S. S. Kulkarni. Component based design of multitolerant systems.
IEEE Transactions on Software Engineering, 24(1):63—-78, January 1998.

269

[11] V. Hadzilacos E. Anagnostou. Tolerating transient and permanent failures. Pro-
ceedings of the 7th International Workshop on Distributed Algorithms. Les Dia-
blerets, Switzerland, pages 174—188, September 1993.

[12] S. Dolev and T. Herman. Superstabilizing protocols for dynamic distributed
systems. Proceedings of the Second Workshop on Self-Stabilizing Systems, pages
3.1 — 3.15, 1995.

[13] S. Tsang and E. Magill. Detecting feature interactions in the intelligent network.
Feature Interactions in Telecommunications Systems II, I OS Press, pages 236 —
248, 1994.

[14] S. S. Kulkarni, A. Arora, and A. Chippada. Polynomial time synthesis of byzan-
tine agreement. Symposium on Reliable Distributed Systems, page 130, 2001.

[15] S. S. Kulkarni and A. Ebnenasir. Enhancing the fault—tolerance of nonmasking
programs. In Proceedings of International Conference on Distributed Computing
Systems, page 441, 2003.

[16] B. Alpern and F. B. Schneider. Deﬁning liveness. Information Processing Letters,
21:181—185, 1985.

[17] A. Arora and M. G. Gouda. Closure and convergence: A foundation of fault-
tolerant computing. IEEE Transactions on Software Engineering, 19(11):1015—
1027,1993.

[18] S. S. Kulkarni. Component-based design of fault-tolerance. PhD thesis, Ohio
State University, 1999.

[19] E. W. Dijkstra. Self-stabilizing systems in spite of distributed control. Commu-
nications of the ACM, 17:643—644, November 1974.

[20] A. Arora and S. Kulkarni. Designing masking fault-tolerance via nonmasking
fault-tolerance. Revised for IEEE Transactions on Software Engineering, 1995.
A preliminary version appears in the Proceedings of the Fourteenth Symposium
on Reliable Distributed Systems, Bad Neuenahr, 174—185, 1995.

[21] G. Varghese. Self-stabilization by local checking and correction. PhD thesis,
MIT/LCS/TR—583, 1993.

[22] E. W. Dijkstra. A Discipline of Programming. Prentice-Hall, 1990.

[23] B. Alpern and F. B. Schneider. Deﬁning liveness. Information Processing Letters,
21(4):181—185, 7 October 1985.

[24] M. Barborak, A. Dahbura, and M. Malek. The consensus problem in fault-
tolerant computing. ACM Computing Surveys, 25(2):17l—220, 1993.

270

[25] A. Doudou, B. Garbinato, R. Guerraoui, and A. Schiper. Muteness failure de-
tectors: Speciﬁcation and implementation. In European Dependable Computing
Conference, pages 71—87, 1999.

[26] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM
Transactions on Programming Languages and Systems, 4:382 — 401, July 1982.

[27] L. Gong, P. Lincoln, and J. Rushby. Byzantine agreement with authentication:
Observations and applications in tolerating hybrid and link faults. In Proceedings
Dependable Computing for Critical Applications-5, Champaign, IL, pages 139—
157, September 1995.

[28] M. Singhal and N. Shivaratri. Advanced Concepts in Operating Systems: Dis-
tributed, Database, and Multiprocessor Operating Systems. McGraw—Hill Pub-
lishing Company, 1994.

[29] Ali Ebnenasir and Sandeep S. Kulkarni. Efﬁcient synthesis of failsafe fault-
tolerant distributed programs. Technical Report MSU-CSE—05-13, Computer

Science and Engineering, Michigan State University, East Lansing, Michigan,
April 2005.

[30] K. M. Chandy and J. Misra. Parallel Program Design: A Foundation. Addison—
Wesley, 1988.

[31] S. S. Kulkarni and A. Ebnenasir. The complexity of adding failsafe fault-
tolerance. In Proceedings of International Conference on Distributed Computing
Systems, page 337, 2002.

[32] A. Arora and S. S. Kulkarni. Designing masking fault-tolerance via nonmasking
faulttolerance. IEEE Transactions on Software Engineering, pages 435—450,
June 1998. A preliminary version appears in the Proceedings of the Fourteenth
Symposium on Reliable Distributed Systems, Bad Neuenahr, 1995, pages 174—
185.

[33] A. Arora and S. S. Kulkarni. Detectors and correctors: A theory of fault-tolerance

components. International Conference on Distributed Computing Systems, pages
436—443, May 1998.

[34] A. Arora and S. S. Kulkarni. Component based design of multi-tolerant systems.
IEEE Transactions on Software Engineering, 24:63-78, January 1998.

[35] A. Moormann Zaremski and J .M. Wing. Speciﬁcation matching of software com-
ponents. ACM Transactions on Software Engineering Methods (A preliminary
version appeared in Proceedings of the 3rd ACM SICSOF T Symposium on the
Foundations of Software Engineering, 1995), 6(4):333 - 369, 1997.

[36] G. Holzmann. The model checker spin. IEEE Transactions on Software Engi-
neering, 1997.

271

[37] Spin language reference. http://spinroot . com/spin/Man/promela.html.

[38] Anish Arora, Mohamed G. Gouda, and George Varghese. Constraint satisfaction
as a basis for designing nonmasking fault-tolerant systems. Journal of High Speed
Networks, 5(3):293-—306, 1996.

[39] Z. Liu and M. Joseph. Transformations of programs for fault-tolerance. Formal
Aspects of Computing, 4(5):442-469, 1992.

[40] A.I. Tomlinson and V.K. Garg. Detecting relational global predicates in dis-
tributed systems. In proceedings of the ACM/ONR Workshop on Parallel and
Distributed Debugging, San Diego, California, pages 21-31, May 1993.

[41] Neeraj Mittal. Techniques for Analyzing Distributed Computations. PhD thesis,
The University of Texas at Austin, 2002.

[42] Klaus Havelund and Tom Pressburger. Model checking java programs using java
pathﬁnder. International Journal on Software Tools for Technology Transfer
(STTT), 2(4), April 2000.

[43] Gerard J. Holzmann. From code to models. In Proceedings of the Second Interna-
tional Conference on Application of Concurrency to System Design (A CSD ’01),
pages 3—10, 2001.

[44] MG. Gouda and T. McGuire. Correctness preserving transformations for net-
work protocol compilers. Prepared for the Workshop on New Visions for Software
Design and Productivity: Research and Applications, 2001.

[45] M N esterenko and A Arora. Stabilization-preserving atomicity reﬁnement. Jour-
nal of Parallel and Distributed Computing, 62(5):766—791, 2002.

[46] M. Demirbas and A. Arora. Convergence reﬁnement. International Conference
on Distributed Computing Systems, pages 589 — 596, 2002.

[47] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of
Reusable Object- Oriented Software. Addison-Wesley Publishing Company, 1995.

[48] R. Bharadwaj and C. Heitmeyer. Developing high assurance avionics systems
with the SCR requirements method. In Proceedings of the 19th Digital Avionics
Systems Conference, Philadelphia, PA, October 2000.

[49] R. Hardin, R. Kurshan, S. Shukla, and M. Vardi. A new heuristic for bad cycle
detection using bdds. Computer Aided Veriﬁcation (CAV’97). LNCS Springer-
Verlag, 12542268 - 278, 1997.

[50] R. Bloem, H.N. Gabow, and F. Somenzi. An algorithm for strongly connected
component analysis in n log n symbolic steps. In Proc. F M CAD, LN CS Springer-
Verlag, 1954:37—54, 2000.

272

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

K. Fisler, R. Fraer, G. Kamhi, Y. Vardi, and Z. Yang. Is there a best symbolic
cycle-detection algorithm? In Proc. Tools and Algorithms for Construction and
Analysis of Systems, LNCS, 2031:420—434, 2001.

The alloy analyzer. http:/lalloy.mit.edu.

M. Moskewicz, C. Madigan, Y. Zhao, L. Zhang, and S. Malik. Chaff: Engineering
an efﬁcient sat solver. .3ch Design Automation Conference, Las Vegas, 2001.

Satisﬁability suggested format dimacs, may. ftp://dimacs.rutgers.edu/pub/
challenge/satisfiability/doc/satformat.tex,1993.

Jean-Christophe Filliétre, Sam Owre, Harald RueB, and N. Shankar. ICS: inte-
grated canonizer and solver. Proceedings of the 13th Conference on Computer-
Aided Veriﬁcation ( CA V’01 ), volume 2102 of Lecture Notes in Computer Science.
Springer— Verlag, 2001.

Z. Manna and P. Wolper. Synthesis of communicating processes from temporal
logic speciﬁcations. ACM Transactions on Programming Languages and Systems,
6(1):68-93, 1984.

P. Attie. Synthesis of large concurrent programs via pairwise composition. CON-
CUR ’99: 10th International Conference on Concurrency Theory, 1999.

Xinghua Deng, Matthew B. Dwyer, John Hatcliff, and Masaaki Mizuno.
Invariant-based speciﬁcation, synthesis and veriﬁcation of synchronization in con-
current programs. Proceedings of the 24th International Conference on Software
Engineering, May 2002.

Y.Inaba. An implementation of synthesizing synchronization skeletons using

temporal logic speciﬁcations. Master Thesis, The University of Texas at Austin,
1984.

Sandeep S. Kulkarni and Ali Ebnenasir. A framework for automatic synthesis
of fault-tolerance. Technical Report MSU-CSE—03-16, Computer Science and
Engineering, Michigan State University, East Lansing MI 48824, Michigan, July
2003.

A. Pnueli and R. Rosner. On the synthesis of an asynchronous reactive module.
In Proceeding of 16th International Colloqium on Automata, Languages, and
Programming, Lec. Notes in Computer Science 372, Springer-Verlagz652—671,
1989.

A.W. Appel and AP. Felty. A semantic model of types and machine instructions
for proof-carrying code. In Proceedings of the 27th ACM Symposium of Principles
of Programming Languages, ACM Press, pages 243—253, 2001.

273

[63] Bernd Fisher, Johann Schumann, and Mike Whalen. Synthesizing certiﬁed code.
In Proceedings Formal Methods Europe( F ME ’02). Copenhagen, Denmark. LNAI,
Springer, 2002.

[64] P.J. Ramadge and W.M. Wonham. The control of discrete event systems. Pro-
ceedings of the IEEE, 77(1):81—98, 1989.

[65] S. Lafortune and F. Lin. On tolerable and desirable behaviors in supervisory
control of discrete event systems. Discrete Event Dynamic Systems: Theory and
Applications, 1(1):61—92, 1992.

[66] Feng Lin and W. Murray Wonham. Decentralized control and coordination of
discrete-event systems with partial observation. IEEE Transactions 0n Auto-
matic Control, 35(12), December 1990.

[67] Karen Rudie and W.M. Wonham. Think globally, act locally: Decentralized
supervisory control. IEEE Transactions 0n Automatic Control, 37(11):1692—
1708, 1992.

[68] Kurt Ryan Rohloff. Computations on distributed discrete-event systems. Ph.D.
thesis, University of Michigan, 2004.

[69] Wolfgang Thomas. On the synthesis of strategies in inﬁnite games. S TACS,
pages 1—13, 1995.

[70] R. McNaughton. Inﬁnite games played on ﬁnite graphs. Annals of Pure and
Applied Logic, 65(2):149—184, 1993.

[71] A. Anuchitanukul and Z. Manna. Realizability and synthesis of reactive modules.
International Conference on Computer-Aided Veriﬁcation, pages 156—169, 1994.

[72] EA. Emerson. Handbook of Theoretical Computer Science: Chapter 16, Tempo-
ral and Modal Logic. Elsevier Science Publishers B. V., 1990.

[73] F TSyn: A framework for automatic synthesis of fault-tolerance. http:/hmw.
cse.msu.edu/"ebnenasi/research/tools/ftsyn.htm.

[74] Audun Jusang. Security protocol veriﬁcation using spin. The First SPIN Work-
shop,1995.

[75] Gregory Duval and Jacques Julliand. Modeling and veriﬁcation of rubis micro—
kernel with spin. The First SPIN Workshop, 1995.

[76] MS. Laventhal. Synthesis of Synchronization Code for Data Abstraction. PhD
thesis, MIT, 1978.

[77] Z. Manna and R. Waldinger. A deductive approach to program synthesis. ACM
Transactions on Programming Languages and Systems, 2(1):90—121, 1980.

274

[78] M. Abadi, L. Lamport, and P. Wolper. Realizable and unrealizable concurrent
program speciﬁcation. In Proceeding of 16th International Colloqium on Au-
tomata, Languages, and Programming, volume 372 of LN CS, pages 1-17, 1989.

[79] H. Wong-Toi and D. Dill. Synthesizing processes and schedulers from temporal
logic speciﬁcations. Computer-Aided Veriﬁcation (Proceeding of CAV90 Work-
shop, DIMACS Series in Discrete Mathematics and Theoretical Computer Sci-
ence Vol. 3, 1991.

[80] E. Asarin, O. Maler, and A. Pnueli. Symbolic controller synthesis for discrete
and timed systems. In Hybrid System II, LNCS 999, pages 1 — 20, 1995.

[81] M. Y. Vardi. An automata—theoretic approach to fair realizability and synthesis.
Computer Aided Verﬁﬁcation, volume 939 of LNCS,, pages 267—278, 1995.

[82] Ali Ebnenasir and Sandeep Kulkarni. Automatic addition of liveness. Techni-
cal Report MSU-CSE—04-22, Department of Computer Science, Michigan State
University, East Lansing, Michigan, June 2004.

[83] Jiri Barnat, Lubos Brim, and Jitka Stfibrna. Distributed LTL model-checking
in SPIN. Lecture Notes in Computer Science, 2057:200—216, 2001.

[84] U.Stern and D. L. Dill. Parallelizing the murphi veriﬁer. Proceedings of Computer
Aided Veriﬁcation (CAV ’97), 1254 of LNCS:256—267, 1997.

[85] S. Ben-David, T. Heyman, and O. Grumberg. Scalable distributed on—the—ﬁy
symbolic model checking. In third International Conference on Formal methods
in Computer-Aided Design (FMCAD’UO), Austin, Texas, 2000.

[86] S. Aggarwal, R. Alonso, and C. Courcoubetis. Distributed reachability analysis
for protocol veriﬁcation environments. in Discrete Event Systems: Models and
Applications, IIASA Conference, pages 40—56, 1987.

[87] H. Garavel, R. Mateescu, and I. Smarandache. Parallel state space construction
for model checking. In Proc. SPIN Workshop on Model Checking of Software,
2057 of LNCS:215+, 2001.

[88] Gerard Holzmann. State compression in spinzrecursive indexing and compression
training runs. The Third SPIN Workshop, 1997.