AUTOMATIC VERIFICATION AND REVISION FOR
MULTITOLERANT PROGRAMS
By
Jingshu Chen

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Computer Science - Doctor of Philosophy
2013

ABSTRACT

AUTOMATIC VERIFICATION AND REVISION FOR
MULTITOLERANT PROGRAMS
By
Jingshu Chen
The notion of multitolerance is based on the observation that modern programs are often
subject to multiple faults. And, the requirements in the presence of these faults vary based
on the nature of the faults, their severity and the cost of providing fault-tolerance to them.
Hence, assurance of multitolerant systems is necessary and challenging. This dissertation
proposes to provide such assurance via automated veriﬁcation and revision.
Regarding veriﬁcation, we focus on veriﬁcation of self-stabilization, which is the ability
of the program to recover from arbitrary states. Most of literature on veriﬁcation of faulttolerance focuses on safety property; our work complements it by considering liveness properties. Hence, we envision veriﬁcation of multitolerant programs by using existing approaches
for verifying safety and using the results from this dissertation for verifying liveness. We propose a technique that is based on a bottleneck (fairness requirements) identiﬁed in existing
work on veriﬁcation of stabilization. Our approach uses the eﬀectiveness of fairness along
with symbolic model checking, and hence reduces the cost of veriﬁcation substantially. We
also propose a constraint-based approach that reduces the task of verifying self-stabilization
into a well-studied problem of constraint solving, so that one can leverage existing highly
optimized solutions (SAT/SMT solvers) to reduce the veriﬁcation cost.
Regarding revision, we focus on revising existing programs to add multitolerance in an
automatic way. Revising the program manually is expensive since it requires additional
veriﬁcation steps to guarantee correctness. Also, manual revision may violate existing re-

quirements. For these reasons, we propose an automatic approach to revise a given program
to add multitolerance for the identiﬁed faults. We investigate the complexity of automatic revision for adding multitolerance. We also develop algorithms (and heuristics) for automatic
revision for adding multitolerance to existing programs. We implement these algorithms
in a model repair tool for automatically adding multitolerance. Additionally, we build a
lightweight framework for automatically revising UML state diagram to add fault-tolerance.
Speciﬁcally, this framework allows designers to revise an existing UML model to add faulttolerance without a detailed knowledge of the formalism behind model repair algorithms.

Copyright by
JINGSHU CHEN
2013

ACKNOWLEDGMENTS
First of all, I would like to thank my advisor, Dr. Sandeep Kulkarni, for the incredible
guidance, support and patience he provided during my Ph.D. life. Sandeep introduced me
to the research area of automatic program revision and veriﬁcation. He was always there for
support whenever I needed some fresh insights about how to improve a paper or a talk.
I am extremely grateful to Dr. Laura Dillon for the valuable guidance from her during
years of my Ph.D. program. I also express my thanks to Dr. Abdol-Hossein Esfahanian and
Dr. Rajesh Kulkarni, for their critical and insightful questions/comments during my defense.
Also, I would like to truly thank the Department of Computer Science and Engineering at
Michigan State University for oﬀering me ﬁnancial support through teaching assistantships
and fellowships. In particular, I express my thanks to Dr. Eric Torng for approving the
department’s ﬁnancial support for my Ph.D. education.
Finally, I would like to use this opportunity to thank all the faculties and friends I met
at Michigan State University.

v

TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . .
LIST OF FIGURES

. . . . . . . . . . . . . . . . . . . . . . . . . . .

x
xii

Chapter 1 Introduction
1.1 Fault-tolerance . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Multitolerance . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Automatic Revision for Adding Multitolerance . . .
1.3.2 Veriﬁcation of Stabilization . . . . . . . . . . . . .
1.3.3 Using Model Repair to Add Fault-tolerance in UML
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
State Diagram
. . . . . . . . .

.
.
.
.
.
.
.

1
1
3
4
4
5
5
6

Chapter 2 Preliminaries
2.1 Models and Programs . . .
2.2 Speciﬁcation . . . . . . . .
2.3 Faults . . . . . . . . . . .
2.3.1 Fault-Tolerance . .
2.4 Model Revision for Adding

.
.
.
.
.

.
.
.
.
.

8
8
10
12
13
14

. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Fault-tolerance

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Chapter 3 Eﬀectiveness of Transition Systems to Model Faults
16
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 A Taxonomy of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Using Transition Systems to Model Faults . . . . . . . . . . . . . . . . . . . 24
3.3.1 Operational, External, Human-made, Persistent, Malicious and Software Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1.1 Abstract Model . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1.2
Mapping to Abstract Model - From a Concrete Example . . 26
3.3.1.3 Modeling Variations . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1.4 Modeling Faults from Related Categories . . . . . . . . . . . 30
3.3.1.5 Eﬀect during Veriﬁcation of Fault-tolerance . . . . . . . . . 31
3.3.1.6 Eﬀect during Revision for Adding Fault-tolerance . . . . . . 31
3.3.2 Operational, Internal, Natural, Hardware, Non-malicious, Non-deliberate,
Accidental and Persistent Faults . . . . . . . . . . . . . . . . . . . . . 31
3.3.2.1 Abstract Model . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2.2 Mapping to Abstract Model - From A Concrete Example . . 32
3.3.2.3 Modeling Variations . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2.4 Modeling Faults from Related Categories . . . . . . . . . . . 34
3.3.2.5 Eﬀect during Veriﬁcation and Revision of Fault-tolerance . . 35
vi

3.3.3

3.4

3.5
3.6

Operational, External, Natural, Hardware, Non-malicious, Non-deliberate,
Accidental and Transient Faults . . . . . . . . . . . . . . . . . . . . . 35
3.3.3.1 Abstract Model . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3.2 Mapping to Abstract Model- From A Concrete Example . . 36
3.3.3.3 Modeling Variations . . . . . . . . . . . . . . . . . . . . . . 37
3.3.3.4 Modeling Faults from Related Categories . . . . . . . . . . . 37
3.3.3.5 Eﬀect during Veriﬁcation and Revision of Fault-tolerance . 38
3.3.4 Operational, External, Hardware, Non-malicious and Persistent Faults 38
3.3.4.1 Abstract Model . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.4.2 Mapping to Abstract Model- From A Concrete Example. . . 39
3.3.4.3 Modeling Variations . . . . . . . . . . . . . . . . . . . . . . 40
3.3.4.4 Modeling Faults from Related Categories . . . . . . . . . . . 41
3.3.4.5 Eﬀect during Veriﬁcation and Revision of Fault-tolerance . . 41
3.3.5 Development Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Practicability during Veriﬁcation and Revision . . . . . . . . . . . . . . . . . 43
3.4.1 Cost of Modeling Faults during Model Checking . . . . . . . . . . . . 43
3.4.2 Cost of Modeling Faults During Model Revision . . . . . . . . . . . . 45
Relative Completeness with Recent Literature . . . . . . . . . . . . . . . . . 46
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Chapter 4 Automatic Veriﬁcation of Self-Stabilizing Programs
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 An Approach of Using Symbolic Model Checking to Verify Self-stabilizing
Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Modeling Self-stabilizing Program . . . . . . . . . . . . . . . . . . . .
4.3.2 Case Study 1: K-State Token Ring Program . . . . . . . . . . . . . .
4.3.3 Case Study 2: Ghosh’s Binary Mutual Exclusion Protocol . . . . . .
4.3.4 Case Study 3: Hoepman’s Uniform Ring-orientation Program . . . .
4.3.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Eﬀect of Fairness in Model Checking of Self-Stabilizing Programs . . . . . .
4.4.1 Using Symbolic Model Checking to Verify Self-stabilizing Program Under Unfair Computation . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1.1 Modeling Self-stabilizing Program under Unfair Computation
4.4.1.2 Case Study 1: K-State Token Ring Program (Cont’d) . . . .
4.4.1.3 Case Study 2: Ghosh’s Binary Mutual Exclusion Protocol
(Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1.4 Case Study 3: Hoepman’s Uniform Ring-orientation Program
(Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Utilizing Decomposition to Reduce the Cost of Using Symbolic Model
Checking to Verify Self-stabilizing Program . . . . . . . . . . . . . . .
4.4.2.1 Case Study 4: Huang’s Mutual Exclusion in Uniform Rings
vii

50
50
52
52
53
54
55
57
58
61
61
62
62
64
66
67
75
76
79

4.4.2.2

4.5

4.6

Case Study 5: Self-stabilizing Program based on Raymond’s
Tree algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.2.3 Other Examples and Approaches for Identifying Components 83
4.4.3 Utilizing Weak Stabilization to Improve Scalability of Model Checking
of Self-stabilizing Program . . . . . . . . . . . . . . . . . . . . . . . . 84
A Constraint-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5.1 Approach for Verifying Stabilization with SMT Solvers . . . . . . . . 88
4.5.1.1 Veriﬁcation of Closure . . . . . . . . . . . . . . . . . . . . . 89
4.5.1.2 Veriﬁcation of Convergence . . . . . . . . . . . . . . . . . . 89
4.5.1.3 Resolving Ambiguity by Cycles Detection . . . . . . . . . . 91
4.5.1.4 Combining Veriﬁcation of Convergence and Cycle Detection
92
4.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.2.1 K-State Token Ring Program . . . . . . . . . . . . . . . . . 94
4.5.2.2 Ghosh’s Binary Mutual Exclusion Protocol . . . . . . . . . 95
4.5.2.3 Stabilizing Tolerant Version of Tree-based Mutual Exclusion
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5.3 Veriﬁcation of Token Ring in Synchronous Semantics . . . . . . . . . 97
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Chapter 5 Automatic Revision for Adding Weak Multitolerance
5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Illustrating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Failsafe-Failsafe Weak Multitolerance . . . . . . . . . . . . . . . . . .
5.2.2 Masking-Masking Weak Multitolerance . . . . . . . . . . . . . . . . .
5.2.3 Failsafe-Nonmasking Weak Multitolerance . . . . . . . . . . . . . . .
5.2.4 Masking-Masking Weak Multitolerance . . . . . . . . . . . . . . . . .
5.3 Complexity Analysis of FF Weak Multitolerance . . . . . . . . . . . . . . . .
5.3.1 Application of Add FF Weakmulti . . . . . . . . . . . . . . . . . . . .
5.4 Complexity Analysis of MM Weak Multitolerance . . . . . . . . . . . . . . .
5.4.1 A Heuristic for MM Weak Multitolerance . . . . . . . . . . . . . . . .
5.4.2 Application of Add MM Weakmulti . . . . . . . . . . . . . . . . . . .
5.5 Complexity Analysis of FM Weak Multitolerance . . . . . . . . . . . . . . . .
5.6 Complexity Analysis of MN Weak Multitolerance . . . . . . . . . . . . . . .
5.7 Complexity Analysis of NN Weak Multitolerance . . . . . . . . . . . . . . . .
5.8 Comparison of Feasibility of Strong Multitolerance and Weak Multitolerance
5.8.1 Feasibility Comparison of FF Strong/Weak Multitolerance . . . . . .
5.8.2 Feasibility Comparison of MM Strong/Weak Multitolerance and MF
Strong/Weak Multitolerance. . . . . . . . . . . . . . . . . . . . . . . .
5.8.3 Feasibility Comparison of MN Strong/Weak Multitolerance and NN
Strong/Weak Multitolerance. . . . . . . . . . . . . . . . . . . . . . . .
5.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10 The Tool RM 2 : Model Revision for Adding Multitolerance . . . . . . . . . .
5.10.1 Input Program Language . . . . . . . . . . . . . . . . . . . . . . . . .
5.11 Functionality and Output Program . . . . . . . . . . . . . . . . . . . . . . .
viii

104
105
106
107
110
114
118
121
126
128
133
139
142
143
145
145
147
148
149
149
153
154
159

5.11.1 Example 1: Two-Sensors Program . . . . . . . . . . . . . . . . . . . . 159
5.11.2 Example 2: Byzantine Agreement . . . . . . . . . . . . . . . . . . . . 159
5.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Chapter 6 Automatic Revision of UML State Diagrams
166
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.2 Motivating Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.2.1 Need for Model Revision for Tolerating Sensor Failure . . . . . . . . . 170
6.3 An Overview of UML State Diagram . . . . . . . . . . . . . . . . . . . . . . 172
6.4 Framework Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.4.1 Step A: Translating from UML State Diagram to UCM. . . . . . . . . 174
6.4.2 Step B: Generating Fault Actions, Speciﬁcation and Invariants from
Parameters speciﬁed by Designer . . . . . . . . . . . . . . . . . . . . 175
6.4.3 Step C: Model Revision for Adding Fault-tolerance . . . . . . . . . . 178
6.4.4 Step D: Translating the Revised Program in UCM to UML state diagram.182
6.5 Case Study 1: The Adaptive cruise control system . . . . . . . . . . . . . . . 182
6.5.1 Fault-intolerant UML model for ACC . . . . . . . . . . . . . . . . . . 183
6.5.2 Application of Step A: Generating UCM of the ACC System . . . . . 185
6.5.3 Application of Step B: Generating Remaining Inputs for Model Revision187
6.5.4 Application of Step C: Generation of Fault-Tolerant UCM . . . . . . 189
6.5.5 Application of Step D: Generating Fault-tolerant UML model for ACC
System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.6 Case Study 2: The Altitude Switch Controller . . . . . . . . . . . . . . . . . 192
6.6.1 Fault-intolerant UML model for ASW . . . . . . . . . . . . . . . . . . 192
6.6.2 Application of Step A: Generating UCM of the ASW Program . . . . 194
6.6.3 Application of Step B: Generating Remaining Inputs for Model Revision196
6.6.4 Application of Step C: Generation of Fault-Tolerant UCM . . . . . . 198
6.6.5 Application of Step D: Generating Fault-tolerant UML model for ASW
Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.7 Discussion and Lessons Learnt . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Chapter 7 Related Work
203
7.0.1 Automatic Veriﬁcation of Stabilizing Programs . . . . . . . . . . . . . 203
7.0.2 Automatic Revision for Multitolerant Programs . . . . . . . . . . . . 204
7.0.3 Model Revision of UML State Diagrams for Adding Fault-tolerance . 206
Chapter 8 Conclusion and Future Work
208
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
BIBLIOGRAPHY

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

ix

LIST OF TABLES
Table 3.1: Fault classiﬁcation (1).

. . . . . . . . . . . . . . . . . . . . . . . . . .

20

Table 3.2: Fault classiﬁcation (2).

. . . . . . . . . . . . . . . . . . . . . . . . . .

21

Table 3.3: Fault classiﬁcation (3).

. . . . . . . . . . . . . . . . . . . . . . . . . .

22

Table 3.4: Cost of modeling faults during model checking. . . . . . . . . . . . . .

44

Table 3.5: Complexity of modeling faults during model revision. . . . . . . . . . .

46

Table 3.6: A classiﬁcation of faults proposed in DSN & ICDCS 2007-2010 (1). . .

47

Table 3.7: A classiﬁcation of faults proposed in DSN & ICDCS 2007-2010 (2). . .

48

Table 4.1: Veriﬁcation results for the k-state program. . . . . . . . . . . . . . . .

57

Table 4.2: Veriﬁcation results for Ghosh’s mutual exclusion program. . . . . . . .

59

Table 4.3: Veriﬁcation results for Hoepman’s ring-orientation program. . . . . . .

61

Table 4.4: Veriﬁcation results for the k-state program. . . . . . . . . . . . . . . .

66

Table 4.5: Veriﬁcation results for Ghosh’s mutual exclusion program. . . . . . . .

67

Table 4.6: Veriﬁcation results for Hoepman’s ring-orientation program. . . . . . .

72

Table 4.7: Veriﬁcation results for Huang’s mutual exclusion program. . . . . . . .

80

Table 4.8: Veriﬁcation results for Raymond-tree based program. . . . . . . . . . .

83

Table 4.9: Veriﬁcation cost of weak stabilization vs. stabilization (1). . . . . . . .

86

Table 4.10: Veriﬁcation cost of weak stabilization vs. stabilization (2). . . . . . . .

86

Table 4.11: Veriﬁcation cost of weak stabilization vs. stabilization (3). . . . . . . .

87

Table 4.12: Veriﬁcation cost of weak stabilization vs. stabilization (4). . . . . . . .

87

x

Table 4.13: Veriﬁcation cost of weak stabilization vs. stabilization (5). . . . . . . .

87

Table 4.14: Veriﬁcation time for Ψv for token ring with unbounded variables. . . .

94

Table 4.15: Veriﬁcation time for Ψv for token ring with bounded variables. . . . .

95

Table 4.16: Veriﬁcation time for Ψv for token ring with split actions for K0 . . . .

95

Table 4.17: Veriﬁcation results for Ghosh’s program using SMT solver. . . . . . . .

96

Table 4.18: Veriﬁcation results for Raymond tree-based program. . . . . . . . . . .

96

Table 4.19: Veriﬁcation results for token ring under synchronous semantics. . . . .

98

Table 4.20: Veriﬁcation result for cycle detection. . . . . . . . . . . . . . . . . . . 100

xi

LIST OF FIGURES
Figure 4.1: k-state program under unfair computation.

. . . . . . . . . . . . . .

65

Figure 4.2: Ghosh’s mutual protocol under unfair computation (1).

. . . . . . .

68

Figure 4.3: Ghosh’s mutual protocol under unfair computation (2).

. . . . . . .

69

Figure 4.4: Ghosh’s mutual protocol under unfair computation (3).

. . . . . . .

70

Figure 4.5: Hoepman’s ring program under unfair computation (1).

. . . . . . .

70

Figure 4.6: Hoepman’s ring program under unfair computation (2).

. . . . . . .

71

Figure 4.7: Hoepman’s ring program under unfair computation (3).

. . . . . . .

72

Figure 4.8: Hoepman’s ring program under unfair computation (4).

. . . . . . .

73

Figure 4.9: Hoepman’s ring program under unfair computation (5).

. . . . . . .

74

Figure 4.10: Hoepman’s ring program under unfair computation (6).

. . . . . . .

75

Figure 4.11: The algorighm for determing stabilization. . . . . . . . . . . . . . . .

93

Figure 5.1: Model revision for adding FF weak multitolerance.

. . . . . . . . . . 124

Figure 5.2: Mapping from an instance of the SAT problem. . . . . . . . . . . . . 129
Figure 5.3: Model revision for adding MM weak multitolerance.

. . . . . . . . . 136

Figure 5.4: The recovery algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 137
Figure 5.5: Model revision for adding MN weak multitolerance.

. . . . . . . . . 144

Figure 5.6: A case of FF Weak multitolerance (not Strong). . . . . . . . . . . . . 148
Figure 5.7: Input ﬁle of two-sensors program.
Figure 5.8: Output of two-sensors Program.
xii

. . . . . . . . . . . . . . . . . . . 160
. . . . . . . . . . . . . . . . . . . . 161

Figure 5.9: Program actions of the Byzantine agreement program. . . . . . . . . 162
Figure 5.10: Fault actions of the Byzantine agreement program. . . . . . . . . . . 163
Figure 5.11: Output of two-sensors program. . . . . . . . . . . . . . . . . . . . . . . 164
Figure 6.1: Logic design of the ACC program. . . . . . . . . . . . . . . . . . . . . 170
Figure 6.2: A case to illustrate modeling program in UML state diagram. . . . . 172
Figure 6.3: Model revision for adding fault-tolerance. . . . . . . . . . . . . . . . . 181
Figure 6.4: ACC system modeled in UML state diagram.
Figure 6.5: Annotation in formal expression.

. . . . . . . . . . . . . 184

. . . . . . . . . . . . . . . . . . . . 184

Figure 6.6: The revised ACC program in UML state diagram. . . . . . . . . . . . 191
Figure 6.7: UML state diagram of the ASW program. . . . . . . . . . . . . . . . 193
Figure 6.8: the ASW program with formal annotation. . . . . . . . . . . . . . . . 194
Figure 6.9: The revised ASW program in UML state diagram. . . . . . . . . . . . 199
Figure 6.10: The stepwise procedure of MR4UM.

xiii

. . . . . . . . . . . . . . . . . . 201

Chapter 1
Introduction

1.1

Fault-tolerance

The past decade has witnessed the increasing deployment of software systems in our daily
lives. Every call we receive, every email we send, every bank transaction we request, operating systems used for oﬃce and study, adaptive cruise control in vehicles, all rely on software
systems. Ensuring the reliability of software systems is more critical than ever before.
Fault-tolerance is crucial to the reliability of software systems. Generally, there are two
main requirements in providing fault-tolerance. The ﬁrst requirement is that the program
should preserve its safety properties in the presence of faults. The second is that the program
should recover from faults so that its subsequent computation is correct. That is, the program
should meets its safety and livenss properties after recovery. Intuitively, safety states that
nothing bad ever happens in program computations, and liveness states that something good
will eventually occur in every program computation. When both of these requirements are
met in the presence of faults, we denote the corresponding program as masking fault-tolerant.
1

While masking fault-tolerance is ideal, due to feasibility and/or cost issues, one may
choose to provide a weaker level of tolerance. One weaker level of fault-tolerance is nonmasking. In this level, the program provides recovery but may violate safety during recovery. Nonmasking fault-tolerance is desirable when the design of masking fault-tolerance
is either expensive or impossible. For example, in [46], authors provide nonmasking faulttolerance to memory safety bugs in Neutron, a version of the TinyOS operating system [147].
In this example, while one could technically design a masking fault-tolerant system, it is
very expensive in terms of human eﬀort. Other examples include algorithms for clock
synchronization [151, 191], where clock values could be corrupted by faults, such as node
failure, random re-starts and initial lack of synchronization. In these examples, it is impossible or expensive to guarantee the safety property (e.g., clock drift is always limited).
Hence, nonmasking fault-tolerance is preferred so that the program will eventually recover
to states where clocks remain synchronized. Other examples of nonmasking fault-tolerance
include [47] [78] [215] [214].

Another weaker level of fault-tolerance is failsafe. In this level, the program always satisﬁes its safety properties in the presence of faults, but it may not satisfy its liveness properties.
Failsafe fault-tolerance is applicable in situations where safety is much more important than
liveness (e.g., safety-critical systems). Since liveness is not guaranteed, implementation costs
of failsafe fault-tolerance are typically lower than those of masking fault-tolerance. Failsafe
fault-tolerance is also utilized in systems at the component level, e.g., one may choose to
ensure that in case of faults, a component guarantees its own safety constraints although it
may not satisfy its liveness constraints. Upon noticing this, other components could ensure
that safety and liveness are satisﬁed for the overall system. Examples of such approach
2

include [222], where the authors present a mechanism to prevent a single faulty node from
monopolizing the communication bus in a distributed hard real-time system. In this example, failsafe fault-tolerance is imposed to enforce fail-silent behavior of the node. Other
examples of failsafe systems include [166] [204] [106] [119].

1.2

Multitolerance

A program is often subject to multiple faults. Moreover, the level of expected tolerance to
these faults is diﬀerent. To illustrate this, consider a simple networking system, in which
nodes communicate with each other by message passing. Two possible faults, message loss
and node failure could aﬀect the program. The program is designed to mask the occurrence
of message loss, i.e., ensure that the program continues to satisfy its speciﬁcation even if
messages are lost. However, for more serious faults, e.g., node failure, the program may
only provide nonmasking fault-tolerance, where the program eventually reorganizes itself to
legitimate states while some other properties (e.g., safety properties, satisfaction of requests
generated during recovery) may not be met during recovery.
Unfortunately, there exist situations where multiple faults occur simultaneously. By simultaneous, we mean that a fault from one class occurs before the system has recovered from
a fault from another class. Considering the above example, clearly, if the program provides
nonmasking fault-tolerance for node failure, it cannot guarantee masking fault-tolerance if
message loss occurs during recovery from node failure. Hence, when these two faults occur
simultaneously, it follows that the best one can do is to ensure that the program eventually recovers to states from where it satisﬁes its speciﬁcation. In other words, the tolerance
provided for the case where faults from both classes occur simultaneously is equal to the
3

‘minimum’ level of fault-tolerance provided to each class of faults. We denote such multitolerance as strong multitolerance. (See Chapter 5.8 for a precise deﬁnition.) In the above
example, a strongly multitolerant program would ensure that nonmasking fault-tolerance
would be provided if node failure and message loss occur simultaneously.
Instead of guaranteeing the minimum level of fault tolerance, another possible solution
is to ensure that the program provides an appropriate level of fault-tolerance if faults from
only one class occur. However, it may not provide any tolerance if faults from two classes
occur simultaneously. This scenario is possible when recovery from node failure requires that
message delivery is reliable and no messages are lost. We denote such multitolerance as weak
multitolerance.

1.3

Contributions

The goal of this dissertation is to build a model-based framework that helps to provide
multitolerance in an automatic way. To achieve this, our work addresses two important
problems: (1) automatic program revision for adding multitolerance, and (2) eﬃcient veriﬁcation of multitolerance. Next, we will give a brief overview of the key contributions of this
dissertation.

1.3.1

Automatic Revision for Adding Multitolerance

It is desirable to revise existing program designs for adding multitolerance when faults are
newly identiﬁed. Manual revision is expensive because (1) manual revision requires another
round of veriﬁcation to ensure that the identiﬁed errors are ﬁxed, (2) manual revision may
not be successful and may introduce new human-induced errors to the existing program
4

design. It is a tedious and expensive task to repeat veriﬁcation and manual revision if the
veriﬁcation step does not succeed. This motivates the need for automating the revision
phase. We address the problem of automatic program revision for adding multitolerance.
We characterize multitolerance and diﬀerentiate our proposed multitolerance from previous
approaches. We investigate the complexity of automatic addition for multitolerance, and
develop novel algorithms and heuristics for automatic addition of multitolerance to existing
programs.

1.3.2

Veriﬁcation of Stabilization

We also investigate the problem of verifying self-stabilization, which is the ability to recover
from arbitrary state. Most of literature on veriﬁcation of fault-tolerance focuses on guaranteeing that a program satisﬁes its safety properties, our work complements the existing
literature by considering liveness properties. Thus, we envision veriﬁcation of multitolerant
programs by using existing approaches for verifying safety and using the results from this
dissertation for verifying liveness. One key challenge in veriﬁcation is how to avoid state
space explosion. The problem of ‘state explosion’ is worsened in veriﬁcation of stabilizing
programs, because verifying stabilization requires considering all possible states.

1.3.3

Using Model Repair to Add Fault-tolerance in UML State
Diagram

State diagrams in the Uniﬁed Modeling language (UML) are used to describe the states as
well as the transitions between states of an object in the software system. Hence, UML
state diagrams is widely used in modeling program behaviors for helping programmers to
5

better understand the intricacies of their codes. Since the language of state diagrams is
not a computational one, using model repair to automatically revise the state diagram is a
challenging task. Furthermore, the use of the model repair technique requires the knowledge
of logic and automated reasoning, which makes it harder for designers to apply the model
repair technique in some degree. Thus it is desirable if the details of model repair are hidden
from designers so that they can enjoy the beneﬁts of applying model revision and continue
to work with a familiar language to specify the artifacts of the system.
With this motivation, we propose a lightweight repair framework for UML state diagrams,
which takes a UML state diagram as input, utilizes the techniques of model repair to revise
UML state diagram, and produces a fault-tolerant UML state diagram. Complicated details,
such as the model repair process, and the transformation between UML state diagram and
underlying computing models, are hidden from users in the framework. This feature reduces
the learning curve for applying model repair in UML state diagram to add fault-tolerance.
To demonstrate the feasibility of this framework, we applied the proposed approach in two
scenarios, including the classic problem of Byzantine agreement and the adaptive cruise
control systems in vehicles.

1.4

Organization

The rest of this dissertation is organized as follows. Chapter 2 is dedicated to formalizing the
preliminary concepts which are used through this dissertation. Chapter 3 investigates the
eﬀectiveness of transition systems to model faults. Chapter 4 presents our approaches for
automatic veriﬁcation of multitolerant programs. Chapter 5 presents our work on automatic
revision for adding multitolerance. Chapter 6 introduces our work on using the model repair
6

technique to revise UML state diagram for adding fault-tolerance. Finally, in Chapter 7, we
provide concluding remarks and discuss future work.

7

Chapter 2
Preliminaries
This chapter is devoted to formalizing the fundamental concepts through this dissertation,
namely, programs, speciﬁcation, faults, and fault-tolerance. The notion of programs is
adapted from Kulkarni and Arora [134]. The deﬁnition of speciﬁcation is due to Alpern and
Schneider [7]. The deﬁnition of faults and fault-tolerance are from Arora and Gouda [17],
Kulkarni [133] and Bonakdarpour [35].

2.1

Models and Programs

This section formally deﬁnes models and programs. A model is described by an abstract
program.
A program, P, is described using a ﬁnite set of variables VP = {v0 , v1 , . . . , vn }, and a
ﬁnite set of program actions AP = {ac0 , ac1 , . . . , acm }, where n ≥ 0 and m ≥ 0. Each
variable, vi ∈ VP , is associated with a ﬁnite domain of values, Di . Each action, aci ∈ AP ,
is deﬁned as follows: aci :: gi −→ sti ; where gi is a Boolean formula involving program
variables and sti is a statement that updates a subset of program variables.
8

For such a program, we deﬁne the notion of state, state space and state predicate.
Deﬁnition 2.1.1. (State) A state, s, of program P assigns each variable vi ∈ VP a value
from its respective domain Di .
Deﬁnition 2.1.2. (State space) The state space, SP , of a program P is the set of all
possible states of P.
Deﬁnition 2.1.3. (State predicate) A state predicate of P is a Boolean expression deﬁned
over the program variables VP . Thus, a state predicate C of P identiﬁes the subset, SC ⊆
SP , where C is true in a state s iﬀ s ∈ SC .
Deﬁnition 2.1.4. (Enabled) The action ai :: gi → sti , is enabled in a state s iﬀ gi is
true in s.
Observe that an action ac in a program corresponds to a set of transitions (s0 , s1 ), where
s0 is the initial state and s1 is the next state that is obtained by executing the statement of
the action that is enabled in s0 .
Deﬁnition 2.1.5. (Computation) A computation of P is a ﬁnite or inﬁnite state sequence: σ = s0 , s1 , . . . s.t. the following conditions are satisﬁed: (1) ∀j : 0 < j <
lengthof (σ) : (sj−1 , sj ) is a transition of P, (2) if σ is ﬁnite and terminates in sf then
there does not exist any state s such that (sf , s) is a transition of P.
A program that is described in terms of its variables VP and actions AP can alternatively
be viewed in terms of its state space and its transitions. In particular, VP and corresponding
domain of each variable identiﬁes the state space, SP . And, each action ac in AP is a
subset of SP × SP . Since the subsequent deﬁnitions of speciﬁcation, fault-tolerance and
9

multitolerance do not consider the structure of the program in terms of its variables and
actions, we deﬁne program in an abstract fashion in terms of its state space and transitions.
Deﬁnition 2.1.6. (Program) The program P is deﬁned as the tuple SP , ψP , where SP
is the state space, ψP ⊆ SP × SP .
Remark 2.1.1. In the above deﬁnition, ψP is the transitions of P. In most cases, we need
to refer to transitions of the program as state space is obvious from the context. Hence, we
use P and its transitions ψP interchangeably.
Remark 2.1.2. In describing the program in subsequent chapters, for ease of understanding,
we will describe it in terms of its variables and actions. The conversion from this notation
to state space and transitions is straightforward.
Remark 2.1.3. In our work on veriﬁcation of multitolerant programs, we need a more
detailed structure of the program transitions. Speciﬁcally, we need to split transitions ψP into
set of transitions that correspond to individual actions, ac1 , ac1 , . . . , such that ac1 ∪ac2 ∪. . .
= ψP .

2.2

Speciﬁcation

In this section, we formally present the notions of speciﬁcation and the related concepts that
are used in this work.
Deﬁnition 2.2.1. (Safety Speciﬁcation) The safety speciﬁcation for program P is speciﬁed as a set of bad transitions [133], i.e., for program P, its safety speciﬁcation is a subset
of SP × SP .
10

We say a transition (s0 , s1 ) violates the safety speciﬁcation sspec iﬀ (if and only if) (s0 , s1 ) ∈
sspec . A sequence σ

=

s0 , s1 , . . . satisﬁes sspec iﬀ ∀j : 0 < j < lengthof (σ) :

(sj−1 , sj ) ∈ sspec.
/
Deﬁnition 2.2.2. (Liveness Speciﬁcation) A liveness speciﬁcation is speciﬁed in terms
of a set of inﬁnite sequences.
A sequence σ = s0 , s1 , . . . satisﬁes a liveness speciﬁcation lspec iﬀ some suﬃx of σ is in
lspec.
A speciﬁcation spec for program P is of the form sspec, lspec , where sspec is a safety speciﬁcation of P, and lspec is a liveness speciﬁcation of P. A sequence satisﬁes the speciﬁcation
spec iﬀ it satisﬁes the corresponding safety and liveness speciﬁcation. Hence, for ease of
understanding, we say that a speciﬁcation is an intersection of a safety speciﬁcation and a
liveness speciﬁcation.
We now deﬁne what it means for a program P to satisfy a speciﬁcation.
Deﬁnition 2.2.3. (Satisﬁes) Let P = SP , ψP be a program, S be a state predicate, and
spec be a speciﬁcation for P. We write P |=S spec and say that P satisﬁes spec from S iﬀ
(1) S is closed in ψP , and (2) every computation of P that starts from a state in S satisﬁes
spec.
Assumption 2.2.1. For simplicity of subsequent deﬁnitions, if P satisﬁes spec from S, we
assume that P includes at least one transition from every state in S. If P does not include
a transition from state s then we add the transition (s, s) to P. Note that this assumption
is not restrictive in any way. It simpliﬁes subsequent deﬁnitions, as one does not have to
model terminating computations explicitly.
11

Deﬁnition 2.2.4. (Invariant) Let P = SP , ψP be a program, S be a state predicate,
and spec be a speciﬁcation for P. If P |=S spec and S = {}, we say that S is an invariant
of P for spec.
Whenever the speciﬁcation is clear from the context, we shall omit it; thus, “S is an invariant
of P” abbreviates “S is an invariant of P for spec”. Note that Deﬁnition 2.2.3 introduces
the notion of satisfaction with respect to computations. In case of computation preﬁxes
that are not necessarily maximal, we characterize them by determining whether they can be
extended to an inﬁnite computation that satisﬁes the speciﬁcation.

Deﬁnition 2.2.5. (Maintains) Program P maintains spec from S iﬀ (1) S is closed in
ψP , and (2) for all computation preﬁxes α of P starting in S, there exists a computation
suﬃx β such that αβ ∈ spec. We say that P violates spec from S iﬀ P does not maintain
spec from S.

We note that if P satisﬁes spec from S then P maintains spec from S as well, but the
reverse direction does not hold. We, in particular, introduce the notion of maintains for
computations that a (fault-intolerant) program cannot produce, but the computation can be
extended to one that is in spec by adding recovery (see Section 2.3 for details).

2.3

Faults

Each class of fault f that a program is subject to is systematically represented by a set of
transitions. Formally, a fault class for program P = SP , ψP is a subset of SP × SP . The
eﬀectiveness of this representation is presented in Chapter 3.
12

Deﬁnition 2.3.1. (Fault-span) A state predicate T is an f -span (read as fault-span for
f ) of P = SP , ψP from S iﬀ the following conditions are satisﬁed: (1) S ⊆ T , and (2) T
is closed in ψP ∪ f .
Observe that for all computations of P that start from states in S, T is a boundary in the
state space of P up to which (but not beyond which) the states of P may be perturbed by
the occurrence of the transitions in f . Subsequently, as we deﬁned the computations of P,
one can deﬁne computations of program P in the presence of faults f by simply substituting
ψP with ψP ∪ f in Deﬁnition 2.1.5.

2.3.1

Fault-Tolerance

We now deﬁne what it means for a program to be failsafe/nonmasking/masking f -tolerant
(read as fault-tolerant to fault class f ).
Deﬁnition 2.3.2. (Masking f-tolerance) A program P is masking f -tolerant from S for
spec, iﬀ the following conditions hold:
1. P |=S spec;
2. There exists T such that:
(a) T is an f -span of P from S;
(b) SP , ψP ∪ f maintains spec from T ;
(c) Every computation of SP , ψP that starts from a state in T eventually reaches
a state of S.
Thus, if program P is masking f -tolerant from S for spec then S is closed in ψP and
every computation of P that starts from a state in S satisﬁes spec in the absence of faults.
13

Additionally, in the presence of faults, there is a fault-span predicate T (S ⊆ T ) that is
closed in ψP ∪ f .
Deﬁnition 2.3.3. (Failsafe f-tolerance) A program P is failsafe f -tolerant from S for
spec, iﬀ conditions 1, 2a and 2b in Deﬁnition 2.3.2 hold.
Deﬁnition 2.3.4. (Nonmasking f-tolerance) A program P is nonmasking f -tolerant
from S for spec, iﬀ conditions 1, 2a and 2c in Deﬁnition 2.3.2 hold.
Notation. Whenever the speciﬁcation spec and the invariant S are clear from the context,
we omit them; thus, “f -tolerant” abbreviates “f -tolerant from S for spec”.

2.4

Model Revision for Adding Fault-tolerance

In this section, we deﬁne the model revision problem in the context of adding fault-tolerance.
Problem 2.4.1. The Model Revision Problem for Adding Fault-tolerance
Given a program P, a safety speciﬁcation φ, an invariant I of P from where P satisﬁes
φ and a set of fault actions F : Does there exist a P ′ with an invariant I ′ such that
• (C1) I ′ ⊆ I,
• (C2) (s0 , s1 ) ∈ P ′ ∧ s0 ∈ I ′ ⇒ (s0 , s1 ) ∈ P, and
• (C3) P ′ is fault tolerant to F from I ′ for φ.
Since the goal of this problem is to add fault-tolerance, the revision for adding faulttolerance is not permitted to add new behaviors in the absence of faults. To meet this
requirement, we include two constraints C1 and C2. Speciﬁcally, constraint C1 states that
14

I ′ is a subset of I. If C1 is not true then it implies that the fault-tolerant program could
begin in a state from where the original fault-intolerant program violates its speciﬁcation.
Hence, we cannot conclude correctness of the program behavior (in the absence of faults) if
it starts from a state in I ′ − I. Thus, constraint C1 is required. Likewise, if P ′ has new
transitions in I ′ then it would imply that P ′ could have behaviors that are not in P. In
other words, the computation of P ′ may generate new ways to satisfy spec in the absence of
faults. Hence, constraint C2 is required.
Based on the above problem statement, we can view transitions of P ′ in two parts:
(1) transitions of P that are preserved and (2) new transitions that are added to provide
recovery from faults. Furthermore, since a transition in a UML model may correspond to
several transitions in the underlying computation model. For this reason, the ﬁrst part can
be subdivided into transitions that are preserved as is and transitions that are preserved in
part, i.e., they are restricted to execute under certain conditions. Thus, the transitions of
the fault-tolerant program can be partitioned into three types:
1. Original Transitions To . To corresponds to transitions that are preserved as is in the
fault-tolerant program.
2. Strong Transitions Ts . Ts corresponds to transitions of the original program that are
restricted to execute under certain constraints. In other words, these transitions are
strengthened version of the original actions.
3. Recovery Transitions Tr . Tr does not have the counterpart actions in the input program. Tr is added by model revision to provide recovery in the presence of faults.

15

Chapter 3
Eﬀectiveness of Transition Systems to
Model Faults
In this chapter, we present an overview of our fault modeling approach, that is, using transition systems. In particular, we focus on whether/how faults from various categories can
be modeled using transition systems. We also discuss the practicability of such modeling
approach in two contexts: model checking and model repair, which are important techniques
for automatic veriﬁcation and revision. Besides, we evaluate the completeness of using transition systems to model faults introduced from recent literatures.

3.1

Motivation

This research has been motivated in part due to questions raised by readers and reviewers
about fault modeling approach used in our (and other similar) work. In particular, the formal
methods community typically takes this approach for granted (e.g., [57, 85, 90, 140, 240]).
However, some researchers argue that this approach would have limited application with
16

respect to diﬀerent classes of faults. Our goal in this chapter is to answer questions such
as these and identify feasibility of modeling fault-tolerant systems using transition systems.
In particular, we focus on three questions. The ﬁrst question deals with the feasibility and
limitation of using transition systems to model faults. Speciﬁcally, this includes the following
concerns.
• The fault deﬁnition in Section 2.3 assumes that faults are a set of new transitions that
are added to the given program. How can faults such as stuck-at faults be modeled
as added transitions that perturb the given program, since stuck-at faults intuitively
remove transitions from the original program. The same question is also extended to
other types of faults such as node failure since these faults do not seem to add any new
transitions to the original program.
• How can faults such as buﬀer overﬂow be modeled in terms of new transitions, as these
faults are essentially (intentional or unintentional) bugs in the software.
• A permanent fault such as byzantine fault causes a process to behave arbitrarily. Moreover, this fault is permanent in that the malicious process can perturb the program
several times. How can such faults be modeled using added transitions?
• How to model faults such as physical degradation, which are continuous in nature, and
hence, may not suitable for being modeled as discrete transition systems.
The second question regarding fault modeling is: Even if it is possible to model faults
using transition systems, is it appropriate and reasonable. Speciﬁcally, this question compares
the complexity of veriﬁcation of a fault-intolerant program with the complexity of veriﬁcation
of the corresponding fault-tolerant program (or the complexity of the corresponding repair
17

for adding fault-tolerance) that uses transition systems to model faults. The answer to this
question will identify circumstances where modeling of faults in terms of transition systems
may be feasible but expensive thereby making it diﬃcult (or impossible) to utilize it.
The third question regarding fault modeling is: How can we know if we have represented
all types of faults that may aﬀect a computer system? This question is crucial since it
characterizes the completeness of transitions systems to be used to model faults. While
answering this question is beyond the scope of formal methods, we can consider this question
in the context of a fault classiﬁcation for practical systems. In particular, we can begin with
a classiﬁcation of faults from the perspective of practitioners and then evaluate whether
faults from those models can be eﬀectively represented using transition systems.
Based on these questions, in this work, we focus on the feasibility and practicality of modeling faults as transition systems. We begin with the third question. Speciﬁcally, we utilize
the classiﬁcation from seminal paper by Avizienis et al. [24]. This classiﬁcation provides
diﬀerent causes of faults and their eﬀects. Since this classiﬁcation is based on practitioner’s
viewpoint, it provides the basis for answering the ﬁrst two questions. Speciﬁcally, in this
work, we focus on how/if each category of fault identiﬁed here can be modeled in terms of
transitions. From the point of view of formal methods, it is necessary to model the eﬀect of
the fault as opposed to the cause of the fault. Hence, our work will focus on how eﬀect of
faults from each category can be modeled.
Additionally, we also evaluate the complexity introduced by modeling faults as transition
systems. We consider the complexity in terms of two objectives: model checking and model
revision. Speciﬁcally, model checking [110, 174] is one of the most successful strategies for
providing assurance for model of hardware and software design. It focuses on deciding
18

whether a given model of system, say M satisﬁes the given property pr. Since model checking
computes (directly or indirectly) all computations of M to determine whether pr is satisﬁed,
it is especially useful in providing assurance about a system developed from that model. The
related problem of model revision [36,37] focuses on scenarios where model checking produces
a counterexample or where an existing model needs to be revised to add new properties (such
as safety, liveness and timing constraints). Thus, the goal in model revision is to modify the
given model M so that it satisﬁes the given property pr. Since the revised model is correct
by construction, it can assist us in obtaining a correct model of system when model checking
ends up ﬁnding a counterexample.
We view the fault modeling in online setting where faults occur during system execution.
In oﬄine setting, there is often no need to model faults explicitly. This is due to the fact that
in the oﬄine setting, faults have already occurred at the beginning and the goal of the system
is to provide acceptable service even in the presence of faults. Hence, oﬄine setting often
requires us to model a degraded version of the original system itself rather than modeling
faults. In online setting, however, system may initially execute without faults. Then, one or
more faults could occur and change subsequent system behavior. Hence, in online setting, it
is often necessary to model faults explicitly.

3.2

A Taxonomy of Faults

In this section, we recall the terminology of fault classiﬁcation from [24]. The classiﬁcation
is based on eight basic viewpoints about faults. Table 3.1, 3.2 and 3.3 illustrate this classiﬁcation as well as results in this chapter about the ability and eﬀect of modeling faults from
the respective category.
19

Fault Characteristics
Phase
System Boundaries
Phenomenological Cause
Dimension
Objective
Intent
Capability
Persistence

Development
Operational
Internal
External
Human-made
Natural
Software
Hardware
Non-Malicious
Malicious
Non-Deliberate
Deliberate
Accidently
Incompetence
Persistent
Transient

Fault Characteristics
Phase
System Boundaries
Phenomenological Cause
Dimension
Objective
Intent
Capability
Persistence

Development
Operational
Internal
External
Human-made
Natural
Software
Hardware
Non-Malicious
Malicious
Non-Deliberate
Deliberate
Accidently
Incompetence
Persistent
Transient

2
√

3
√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√
√

√
√

√
√

√

7
√

8
√

√

√

√

√

√

√

√

√
√

√
√

√

√

√

√

√

√

√

√
√

√

√

√

Categories
10 11 12 13 14
√ √
√ √ √
√ √ √ √
√
√ √ √ √ √

√
√

√
√

√

√

9
√
√

√
√

√
√

√
√
√
√
√

Table 3.1: Fault classiﬁcation (1).

20

Categories
4 5 6
√ √ √

1
√

√
√
√
√
√

√
√

√
√

15 16
√

√

√
√

√
√

√
√

√
√

√
√

√
√

√

√

√

√

√

√

√

√

√

√

√

√

Categories
17 18 19 20 21 22 23 24

Fault Characteristics
Phase
System Boundaries
Phenomenological Cause
Dimension
Objective
Intent
Capability
Persistence

Development
Operational
Internal
External
Human-made
Natural
Software
Hardware
Non-Malicious
Malicious
Non-Deliberate
Deliberate
Accidently
Incompetence
Persistent
Transient

System Boundaries
Phenomenological Cause
Dimension
Objective
Intent
Capability
Persistence

√

√

√

√

√

√

√

√
√

√
√

√
√

√
√

√
√

√
√

√
√

√
√

√
√
√
√
√

Development
Operational
Internal
External
Human-made
Natural
Hardware
Software
Non-Malicious
Malicious
Non-Deliberate
Deliberate
Accidently
Incompetence
Persistent
Transient

√
√

√
√

√
√

√

√

√

√

√

√

√
√

√

√

√

√

√

√
√

√

√

-

√

√
√

√

√

√

√

√

√

√

√

√

√

√
√

√
√

√
√

√
√

√
√

√
√

√
√

√

√

√

√

√
√

√
√

√
√

√
√

√
√

√

√

√

√

√

√

√

-

√
√

√

√
√

√

√
√

Table 3.2: Fault classiﬁcation (2).

21

√

√
√

Categories
25 26 27 28 29 30 31

Fault Characteristics
Phase

√

√

√

√
√

-

Section

3.3.5

3.3.2

3.3.4

3.3.1
3.3.3

Eﬀect in Model
Veriﬁcation and
Categories
Revision
Fault prevention is more
suitable.
No need to model faults
1-11
Explicitly.
Signiﬁcant increase in reachable
states.
Need for modeling continuous
behavior.
12-13
Small increase in reachable
states.
Potential use for modeling blind
14,17,
writes.
20, 27, 30
Small increase in reachable
states.
Need for modeling read/
5,6,
write restriction.
22-25
15-21,
Small increase in reachable
26-31
states.

Feasibility
Practicability

Not feasible

Feasible
Not Practical
Feasible
Practical
Feasible
Practical
Feasible
Practical

Table 3.3: Fault classiﬁcation (3).
Next, we brieﬂy describe the viewpoints considered in this classiﬁcation.
One viewpoint is based on how the fault occurs. In this viewpoint, there are two possibilities: development faults and operational faults. The former corresponds to the case
where fault occurs due to mistakes during development. This category includes faults such
as buﬀer overﬂows, incorrect results of certain ﬂoating point division caused by Pentium
FDIV bug etc. The latter corresponds to the case where fault occurs while the system is
deployed.
The second viewpoint is based on whether the fault occurs inside system boundary (internal faults) or whether it occurs outside system boundary, i.e., in the environment (external faults). The former corresponds to faults such as physical deterioration of brakes in
the vehicle and logic bomb. And, the latter corresponds to faults such as bit ﬂip in memory
caused by cosmic ray and wrong parameter conﬁguration.
22

The third viewpoint is based on the cause of the fault. In this viewpoint, a fault can be
either natural fault or human made fault. The former corresponds to random events
that may occur naturally. Examples of such faults include internet and telecoms connectivity disrupted by Taiwan earthquake. The latter corresponds to (intentional or otherwise)
mistakes caused by humans. Examples of such faults include development failure of the AAS
system [181].
The fourth viewpoint is based on how the fault aﬀects the system. In this viewpoint, a
fault can be either hardware fault or a software fault. Examples of former include loss
of network switch and deﬂated car tie whereas examples of latter include Y2038 problem of
software system and Trojan horses.
The ﬁfth viewpoint is based on the objective of the fault. In this viewpoint, the fault can
be either a malicious fault or a non-malicious fault. In the former case, the goal of the
(human responsible for the) fault is to intentionally disrupt the system execution. Examples
of such faults include attacker and worm. In the latter case, there is no malicious objective.
Examples of such faults include physical deterioration and heating/cooling caused by natural
environments.
The sixth viewpoint is based on in the intent of the fault. In this viewpoint, the fault
can be either a deliberate fault or a non-deliberate fault. The former case is due to
bad decisions. Examples of such faults include wrong conﬁguration that can aﬀect security,
networking, storage, middleware, etc [96]. The latter one is caused by mistakes. Examples
of such faults include software ﬂaws and physical production defects.
The seventh viewpoint is based on whether the fault is caused accidently or due to incompetence. And, ﬁnally, the eighth viewpoint is based on whether the fault is a transient
23

fault that occurs occasionally and does not persist for a long duration. Or a persistent
fault where the fault persists for a long time (possibly forever).

3.3

Using Transition Systems to Model Faults

In this section, we discuss modeling of faults from diﬀerent categories in Table 3.1 - 3.3.
Based on the characteristics of the faults, we partition our discussion among Sections 3.3.13.3.5. For each section, we identify the abstract model of fault from one category ﬁrst. Then
we introduce how to mapping to the abstract model with concrete example. We also discuss
variations of the same fault that are considered in the literature and evaluate its eﬀect on
modeling that fault. Subsequently, we identify related fault categories, i.e., fault categories
where the modeling would be similar. Finally, we identify the eﬀect of such modeling in
veriﬁcation and revisions.

3.3.1

Operational, External, Human-made, Persistent, Malicious
and Software Faults

In this section, we focus on faults that are operational, external, human made, persistent,
malicious and software. (This corresponds to category 24 in Table 3.2.) Thus, these faults
occur during the system execution. Generally, they are caused by malicious users or attackers. And they persist permanently (or long enough time). It is expected that the system
will continue to function even when the faults are present. While some attempts may be
made to prevent such faults from occurring, it is possible that such preventions would not be
fully successful. Hence, for assurance in the context of these faults, it is necessary that one
considers the system execution where the faults (exhibited in terms of compromised hosts,
24

malicious users or attacks) continue to occur potentially frequently. Often, in such systems,
assumptions are made about the number of faults that may exist at a given time. These
assumptions ensure that suﬃciently many ‘good resources’ are available to solve the problem
at hand.

3.3.1.1

Abstract Model

To capture the impact that this type of fault has on the underlying system state, there are
three types of abstract actions in the model for fault from this category.
• Access Actions. These actions allow user to get knowledge about the current system
state. An example of access action is that an user get the data from remote server
and make a local copy. We use variable v to denote user’s copy of system state and
variable s to denote the value of the current system state. Hence this type of actions
can be modeled as follows:

ACCESS action:
v := s ;

• Update Actions. These actions create a change to the system state. Examples of update
actions include writing data to ﬁle in the server and updating value of a ﬂag which is
used to denote whether a ﬁle is changed. We use variable v to denote the system state
and x to denote the value which is used to update system state by UpdateActions.
Hence this type of actions can be modeled as follows:

Update action:
25

v := x ;

• Fault Actions. These actions, (which may be conducted by malicious users or attackers), perturb the system to a random state. An example of a fault action is that a
malicious user changes the value of accessible data randomly. We introduce variable
v to denote the state which is corrupted by fault actions. We also introduce an extra variable m to model whether malicious user exists in the environment. The fault
actions can be modeled as follows:
Fault action 1:
true −→ m := true;
Fault action 2:
m

3.3.1.2

−→ v := random();

Mapping to Abstract Model - From a Concrete Example

Now, we discuss how to mapping raw actions in the real example to the abstract actions in
the model described above.
A typical example of the faults from this category occurs in the context of distributed
system where the system is organized as a network of nodes and some of these nodes may not
work as expected and behave arbitrarily because of some reasons, such as attackers. One such
example is Farsite [34]. Farsite is a serverless distributed ﬁle system that provides centralized
ﬁle-system service. Farsite aims at providing secure, scalable and strongly consistent ﬁle
storage service. In Farsite, the concept of servers in the system is virtual and the system
26

actually runs on a network of untrusted PCs some of which may be controlled by a malicious
user. Hence, the whole infrastructure is highly susceptible to a fault that the virtual server
behaves arbitrarily. This fact brings the challenge of guaranteeing the delivery of correct
service in the presence of such faults.
Since our goal is to illustrate the modeling of such fault and not the details of Farsite,
without loss of generality, we describe how one can model operations for a single ﬁle, say f l.
In particular, labeling UpdateAction and AccessAction is straightforward. In this example,
for these Read operations that allows users to get the data from servers, we mapping them
as AccessAction whereas we mapping Write operation that allows users to write data to
servers to UpdateAction. In each operation, the client identiﬁes the list of server nodes that
contain (or are likely to contain) a copy of the ﬁle. We use index j to quantify over the list of
such servers. Furthermore, we use the term data.j to denote data maintained by the server
j for ﬁle f l and the term copy.j to denote local copy of ﬁle f l.
In read operation, the client obtains a copy of the ﬁle from all (or a subset of) servers.
In write operation, the data is written to the respective servers. In order to capture system semantics of fault tolerance mechanism designed in this example, we make extend of
AccessAction in the basic abstract model. There are several copies of stored ﬁles. Since
some of the copies may be incorrect due to malicious users, the client performs an agreement
algorithm (e.g., majority computation) ﬁrst, then return the agreed data to the user who
requests ﬁle f l.
Thus, the operations in Farsite can be modeled as shown next.

Acess action:
∀j :: copy.j := data.j ;
27

getResouce := agreement(copy.j);
return getResource;
Update action to write v to ﬁle:
∀j :: data.j := v;

Obviously our above model in transition system can capture system semantic of this example.
Observe that the above model only describes the program actions. Next, we show that
the actions of malicious users can also be modeled in terms of f aultaction in the abstract
model. In particular, to denote whether process j is malicious, we introduce a variable m.j
that denotes whether j is malicious. The fault actions can be modeled as follows:
Fault action 1:
true −→ m.j := true;
Fault action 2:
m.j −→ data.j := random();

Thus, Fault action 1 denotes that machine j is compromised. Fault action 2 denotes that
the compromised machine can change the local data arbitrarily. Observe that along with
the write action that assumes that data is written to all replicas, the fault action allows the
malicious user to change the data stored in a compromised machine.
Next, we discuss the completeness of this approach. In particular, we argue that this
approach is generic for faults from this category. Speciﬁcally, the faults are operational in
nature, i.e., at the time system is created (and deployed), there are no faults perturbing the
system. (Note that this is not true for development faults.) Thus, faults become operational
28

at some point after (possibly, before the system executes even one of its actions) the system
is deployed. Any modeling must include an action by which the malicious components/processes appear in the system. In other words, there must be one action (similar to Fault
action 1) where some components/process becomes faulty. Moreover, since the faults are
permanent, it is required that it must be possible for the fault to persist forever. In other
words, the actions modeling faults should be such that they can execute forever (such as
that indicated by the guard of Fault action 2).
Finally, because the fault is human made and malicious, the faulty component/process
can change the data that it controls arbitrarily (as in the statement of Fault action 2). Thus,
any fault that is operational, human made, persistent and malicious, can be modeled using
an approach similar to the one presented earlier.

3.3.1.3

Modeling Variations

There are several possible variations that one may consider in this category of faults, as
follows.
• A malicious user may intend to remain hidden. In this case, Fault action 2 will have
to be changed so that instead of changing the data arbitrarily, it will change it only in
a way that allows prevents it from being discovered.
• The system may have a mechanism to clean up aﬀected nodes (e.g., through antivirus
programs etc.) If such a mechanism exists then it can be modeled by the dual of Fault
action 1 where m.j is changed from true to f alse.
• Also, often assumptions are made about the number of malicious users that can exist
in the system at a given time. For example, a standard assumption is that the number
29

of malicious replicas is less than a third of the total replicas. If such an assumption
is desired, it can be achieved by changing Fault action 1 so that a node can become
malicious only if the total number of malicious nodes will not exceed the bound. Note
that this will require one to read the value m variable of all nodes. However, this is
acceptable since we are simply recording the assumption.

3.3.1.4

Modeling Faults from Related Categories

Approach similar to the one in this section also applies for faults from other categories. For
example, the above modeling can be applied when the fault is caused by hardware. Also,
if one were to model the fault which persists for some duration and then disappears, we
only need to slightly modify the aboved modeling (the case where fault is permanent (longlasting) in nature), by adding the modeling of another fault action (similar to Fault action
1) where m.j changes from true to f alse. Based on this discussion, we can model faults
from category 22-25 (cf. Table 3.2) using the approach in this section. A special case when
the fault is transient and occurs only once, we can simplify the modeling of faults by the
approach in Section 3.3.3.
Additionally, a similar approach could also be used for modeling malicious and deliberate
development faults (category 5 and 6 in Table 3.1). Examples of faults from these categories
include logic bomb. For faults from these categories, the modeling in this section may be
used if there are several designer teams producing diﬀerent parts of the system (e.g., with Nversion programming) and suﬃcient redundancy exists to deal with such faults. However, for
the case where one design team is creating the given system or where suﬃcient redundancy is
unavailable, the only suitable approach is to use fault prevention where the goal is to ensure
that the fault cannot happen. And, when fault prevention is used, there is no need to model
30

faults explicitly, as proving that fault prevention work requires one to only consider system
behavior in the absence of faults.

3.3.1.5

Eﬀect during Veriﬁcation of Fault-tolerance

If one were to verify a program that models malicious users in the above fashion, we can
observe that the introduction of the variable m.j increases the state space of the program.
Additionally, to ensure that the variable m.j is not used improperly, we need to syntactically
check the program; speciﬁcally, we need to ensure that actions at machine j do not reference
variable m.k, where j = k.

3.3.1.6

Eﬀect during Revision for Adding Fault-tolerance

One important characteristic of above formulation is that variable m.j is readable only
to node j. Other nodes cannot read it. If model revision is used for a fault from this
category, it is necessary that this restriction is continued to be satisﬁed in the revised model.
Additionally, variable m.j cannot be changed by any node; it may be changed only by a fault
action. These restrictions have been shown to increase complexity from P to NP-complete
in some instances [134].

3.3.2

Operational, Internal, Natural, Hardware, Non-malicious,
Non-deliberate, Accidental and Persistent Faults

In this section, we focus on faults that are operational, internal, natural, hardware, nonmalicious, non-deliberate, accidental and persistent. (This corresponds to category 12 (cf.
Table 3.1).)
31

3.3.2.1

Abstract Model

Obviously, the fault from this category occurs during the system execution. Generally, they
occur due to hardware degradation over time. And, it is expected that they last for a
signiﬁcant duration of time. Similar to the faults from Section 3.3.1, it is expected that the
system will continue to function even when the faults are present. Periodic maintenance
would be used to replace the hardware as necessary to correct this fault.
Taking these impacts that this category of fault has on the underlying system state into
account, we introduce g to denote the hardware state and then the corresponding abstract
action in our model is as follows, where ǫ is a small real number.

Fault action 1:
g ≥ ǫ −→ g := g − ǫ;

3.3.2.2

Mapping to Abstract Model - From A Concrete Example

A typical example from this category occurs in the context of a braking control system
where the brake may wear out due to physical deterioration thus system failures may occur
with accidents and severe damage and human injuries as consequences.
The basic modeling of braking system includes the speed of the vehicle, the status of
brakes (e.g., applied or not) and their reliability.
One such fault in this system can be caused by the case where wear and tear on brakes
(or other factors) reduces their eﬀectiveness. Modeling such fault into abstract action is
straightforward. If we use g to denote the speciﬁed level of performance and integrity of the
brake, we can apply abstract action in the basic model directly. To capture system semantics
32

of this example, we assume value of g is in the range [0, 1] where g = 1 corresponds the ideal
brake and g = 0 corresponds to a nonfunctioning brake in this case.
Besides, in order to model program actions, we use variable s to denote the speed of
the vehicle. And, we use a Boolean variable b to denote whether the brake is pressed.
(Similar to g, b could also be modeled as a continuous variable. However, this extension is
straightforward and, hence omitted.) Thus, when the brake is pressed, the vehicle reduces
the speed. We model this as a reduction in speed by a ﬁxed value (denoted as C in the below
program action) that depends on the reliability of brakes. Hence, one possible way to deﬁne
such program action in this context is as follows:
Brake Action:
b == true −→ s := MAX(s − g ∗ C, 0);

Next, we discuss the completeness of this approach. In particular, we argue that this
approach is generic for faults from this category. Since the faults are operational in nature,
faults occur at some point after the system is deployed and thus any modeling must include
some action where some component/process becomes faulty. Also considering the faults are
internal, natural, non-malicious and non-deliberate, the fault action should not be triggered
by outsider. In other words, the guard of the fault action should be some status of fault
component or process itself, like Fault action 1. Moreover, since the faults are permanent, it
is required that it must be possible for the fault to persist forever (until the faulty component
loses its capability totally).
Thus, any fault that is operational, internal, natural, hardware, nonmalicious, nondeliberate, accidental and persistent Faults, can be modeled using an approach similar to
33

the one presented earlier.

3.3.2.3

Modeling Variations

The approach proposed in this section is discrete in nature. It models that when the brake
is pressed, speed reduces by a certain amount. A more accurate model would be to utilize timing based information for modeling both programs as well as faults. Examples of
such approach include the timed automata [9] that combines transition systems with time.
In particular, with such an approach, we could model speed reducing at a particular rate
depending upon the reliability of brakes and the pressure applied on brakes.

3.3.2.4

Modeling Faults from Related Categories

Approach similar to the one in this section also applies for faults from other categories. The
above modeling is for the case where fault is permanent in nature. If the faults persist for
some duration and then disappears then we need to model another fault action (similar to
the above fault action g ≥ ǫ −→ g := g − ǫ) where g can be increased ǫ and reach at most
1 as range required. Based on this discussion, we can model faults from category 12-13 (cf.
Table 3.1) using the approach in this section.
However, similar to Section 3.3.1, for the case where physical elements play a sole role
in the structure of the whole system, when that elements totally loss the functions, e.g. a
deﬂated tire of car, there is no need to model the fault explicitly. And, the only suitable
approach is to use fault prevention to ensure such faults cannot happen.
34

3.3.2.5

Eﬀect during Veriﬁcation and Revision of Fault-tolerance

It is expected that the fault from this category will generally require one to consider hybrid
models where both discrete and continuous variables are considered. The eﬀect/eﬀects of
such hybrid model in veriﬁcation is/are considered in [8]. Regarding revision, revising timed
automata is considered in [35]. However, the cost of such revision is often higher (exponential
in the constants involved in specifying timing constraints).

3.3.3

Operational, External, Natural, Hardware, Non-malicious,
Non-deliberate, Accidental and Transient Faults

In this section, we focus on faults that are operational, external, natural, hardware, nonmalicious, non-deliberate, accidental and transient. (This corresponds to category 15 in
Table 3.1.)

3.3.3.1

Abstract Model

Obviously, the fault from this category occurs during the system execution. Generally, they
occur due to unexpected transient issues that are unlikely to happen again. They are not
malicious in nature but may cause system to behave in an incorrect manner. One approach
for dealing with such faults is self-stabilization [59] where the system is guaranteed to recover
from an arbitrary state to a legitimate state.
Modeling of the fault from this category in terms of transition systems is straightforward
since the fault occurs once (or rarely) and the impact is to change the system state into
random value. Speciﬁcally, we introduce x to denote system state then the fault can be
modeled by the following action:
35

true −→ x := random();

3.3.3.2

Mapping to Abstract Model- From A Concrete Example

A typical example from this category includes faults that cause memory to change to an arbitrary state. A possible reason why such errors occur is cosmic rays, for example. Typically,
it is assumed that such faults are not detectable and, hence, the system continues to operate
in spite of them. Another related example in this class includes the use of uninitialized
variables. Such a fault will result in the program starting in an arbitrary state. Moreover,
other types of faults such as crash and message loss can exhibit a behavior that is similar to
a fault from this category. Speciﬁcally, in [120], authors have shown that in the presence of
these faults, the program may be (essentially) perturbed to an arbitrary state.
Mapping the raw actions in the real example to the abstract action in the model is
straightforward. First, we map the variable (denoted as v) changed by the raw actions into
x in the abstract action ﬁrst. Second, we map the actual way to change the value of v to
random() in the abstract action. By these two steps, we can apply the basic abstract model
to model fault actions in the real example directly.
Next, we discuss the completeness of this approach. In particular, we argue that this
approach is generic for faults from this category. Consider the faults are operational in
nature, there must be some fault action (similar to Fault action 1) occuring after the system
is deployed. These faults may aﬀect the entire system or a part of it. Since the faults are
transient, this implies that their eﬀect is temporary and the system continues to execute
its actions after faults occur. Moreover, since the fault is accidental and not deliberate,
the eﬀect of fault on the aﬀected part (like var x in Fault action 1) is random. Thus, any
36

fault that is operational, human made, persistent and malicious, can be modeled using an
approach similar to the one presented earlier.

3.3.3.3

Modeling Variations

There are several possible variations that one may consider in this category. For example,
the most common assumption used in the literature states that the fault will change the
variable to only a value that is legitimate in some conﬁguration. In other words, the fault
will assign the variable a value that is from its domain; intuitively, this assumption is based
on the fact that if the value assigned to the variable is outside the domain then it would be
detected before the variable value is used. Also, often assumptions are made about inability
of the fault to change the code itself, i.e., the fault only changes data. If faults are allowed to
disrupt code then this can be modeled using the approach in Section 3.3.1. Alternatively, it
could also be modeled using the approach in Section 3.3.4 if it is expected that the fault will
render the corresponding component in a state from where it cannot continue its execution.

3.3.3.4

Modeling Faults from Related Categories

The approach similar to the one in this section also applies for faults from other categories.
For example, the above modeling can also be applied when the fault is caused by human
being’s non-malicious action. Also, the modeling can be applied where fault is caused by
software. Hence, we can model faults from category 15-16, 18-19, 21, 26, 28-29, 31 (cf. Table
3.2) using the approach in this section.
37

3.3.3.5

Eﬀect during Veriﬁcation and Revision of Fault-tolerance

If one were to verify a system that models transient faults in the above fashion, then one
has to consider system execution from arbitrary (respectively, large number of) states. In
particular, if the fault can corrupt all program variables then this corresponds to considering
execution from arbitrary states. The topic of such veriﬁcation is considered in the context
of self-stabilization; in [225], authors have shown the feasibility of verifying self-stabilization
using model checker SMV [174].
There are no new diﬃculties introduced when applying model revision to address faults
from this category. This is due to the fact that no auxiliary variables are needed to model
this fault. An example where such fault modeling is used in model revision includes [1].

3.3.4

Operational, External, Hardware, Non-malicious and Persistent Faults

In this section, we focus on the class of faults that are operational, external, hardware, nonmalicious and persistent. (They correspond to category 14, 17 and 20 (cf. Table 3.1 and
3.2).) These categories diﬀer in terms of whether the fault is human-made or natural as
well as whether the fault is deliberate or non-deliberate. However, the exact cause is not
important in modeling eﬀect of such faults.

3.3.4.1

Abstract Model

Unlike the faults from Section 3.3.1, where faults are malicious in nature, the faults in
category 14, 17 and 20 are non-malicious. It is expected that these faults are persistent in
nature. Examples of such faults include failure of nodes, failure of channels etc.
38

Modeling of such fault in terms of transition is similar to that in Section 3.3.1. Speciﬁcally,
since the faulty component is permanently ‘killed’ we can model it with an auxiliary variable
C that denotes whether the given component is correct or whether it has failed. Also, let
data denote state of a node (including its buﬀered messages). Now, the fault action can be
modeled as follows:
Fault action 1:
C = true −→C := f alse, data := empty;

Additionally, all program actions would have to be modiﬁed so that they only execute
when the switch is functioning correctly. In other words, all actions by which one node
communicates with another would have to be modiﬁed so that it can occur only if the
respective nodes have not failed.

3.3.4.2

Mapping to Abstract Model- From A Concrete Example.

A typical example of the faults from this category occurs in context of networking systems
where a fault may cause one or more nodes to fail in a manner where it completely stops
working.
One example of such a system is from [206] where authors have focused on developing a
failstop processor: A failstop processor works correctly before a fault occurs and it performs
no actions when it fails. Moreover, data maintained at the failed processor is lost.
Another example is SafetyNet [217]. SafetyNet is a lightweight global checkpoint/recovery
scheme. It aims at providing stable and reliable system services even in the situation of either
dropped coherence messages or the loss of an interconnection network switch (and its buﬀered
39

messages).
Mapping raw actions in the real example to abstract action in the model described above
is very straightforward. The steps is similar with Section 3.3.3(, and hence omitted).
Next, we discuss the completeness of this approach. In particular, we argue that this
approach is generic for faults from this category. Speciﬁcally, considering the faults are
operational, it is required to include in the modeling such an action to denote faulty components/processes appear in the system. Also, since faults of this category are persistent, it
is required that it must be possible for the fault to persist forever (such as that indicated
by Fault action 1). Since the fault is external and non-malicious, the occurrence of fault
is due to some change of environment condition and hence there must be some variable to
denote this change in the modeling (like var c in the Fault action 1). Thus, any fault that
is operational, external, persistent, hardware and non-malicious, can be modeled using an
approach similar to the one presented earlier.

3.3.4.3

Modeling Variations

There are several possible variations that one may consider in this category of faults. For
example, a 2D torus topology is considered in [217] to prevent a single point-of-failure by
splitting each switch into two half-switches. Execution may resume after reconﬁguration to
route around the lost switch [118] although at reduced bandwidth. In this case, we can use
two variables to denote the link status and buﬀered data of each half-switch independently.
The link status of the whole switch can be modeled as disjunction of the link status of
each half element. Also often redundancy is used in the networking system. For example,
the conﬁguration of a speciﬁc point-to-point path may consist of several available links. If
such an assumption is desired, one can model the fault action for each available switch ﬁrst.
40

Then, the link status of the whole conﬁguration can be modeled as disjunction value of these
available switches.
The fault modeling from this section assumes that the fault is permanent, i.e., the failed
node remains failed forever. A variation of this model is one where the faulty node is repaired
and integrated in the system. In this case, a dual of the fault action where C is set to true
must be added. Depending upon how such an action restores the node, the data associated
with the node would change. For example, if the restore is equivalent to reboot where the
node starts from some ﬁxed state, data will be changed accordingly.

3.3.4.4

Modeling Faults from Related Categories

Approach similar to the one in this section also applies for faults from other categories. For
example, the above modeling can be used whether faults aﬀect hardware or software. Hence,
we can model faults from category 27 and 30 (cf. Table 3.2) using the approach in this
section.

3.3.4.5

Eﬀect during Veriﬁcation and Revision of Fault-tolerance

The eﬀect on veriﬁcation and revision is similar to that in Section 3.3.1. There are some
possible changes depending upon the assumptions about faults. In particular, variable C
introduced in fault actions may or may not be readable by other processes depending upon
whether one assumes that a fault is detectable or not. Furthermore, approaches such as
failure detectors [220] could be used to model more ﬁne tuned assumptions about delectability
of the fault.
41

3.3.5

Development Faults

In this section, we focus on faults that occur during development. This corresponds to
category 1-11 in the Table 3.1. Typical examples of such faults cause software ﬂaws, logic
bombs and hardware errata, etc. If the developer(s) or operator(s) did not realize at the
time that the consequence of their decision was a fault (or conceal faulty actions like logic
bombs with the malicious purpose), and furthermore, design or decision is accepted for use,
these faults can be treated as intrinsic nature of system and the occurrence of the faults is
unavoidable.
A typical example from this category includes Pentium FDIV bug in the Intel P5 Pentium
ﬂoating point unit(FPU) that was caused by few missing entries in the lookup table used by
the divide operation. Another example is that of buﬀer overﬂows where one copies a longer
string into a shorter string thereby aﬀecting other parts of memory.
We argue that for such faults, formal methods for modeling faults explicitly are either
undesirable or impossible. For such faults, a more practical approach is fault prevention
where the goal is to ensure that the fault does not occur. For example, a thorough analysis
of code could be useful to ensure that logic bombs do not occur. Likewise, analysis of Pentium
FPU with theorem provers [198] has been successful in identifying the missing table entries
in Pentium. Likewise, approaches such as [226] can be used to prevent buﬀer overﬂows.
In other words, for faults from this category, one needs to consider fault-free execution
to show that the fault does not occur. Since this requires consideration of only fault-free
execution, it does not require one to model faults explicitly.
That said, in certain instances, one may consider these faults explicitly and tolerate them,
as preventing them may be impossible. In such cases, one needs to consider the eﬀect of
42

the development faults. We expect most such development faults to exhibit themselves as
malicious faults (cf. Section 3.3.1) or as failstop (cf. Section 3.3.4) faults. In particular, we
have discussed modeling of faults from category 5 and 6 in Section 3.3.1.

3.4

Practicability during Veriﬁcation and Revision

In Section 3.3, we considered the feasibility of modeling faults in terms of transition systems.
In this section, we utilize those results in identifying the practicability of such modeling.
Speciﬁcally, our goal is to evaluate the cost of such modeling in two contexts: model-checking
and model revision. In case of model checking, we want to compare the cost of verifying a
fault-intolerant program with that of verifying fault-tolerant program. And, in case of model
revision, we want to compare the cost of verifying fault-intolerant program with that of
revising it to add fault-tolerance. Both these tasks are feasible only if we can model the
faults during the veriﬁcation and/or revision process.

3.4.1

Cost of Modeling Faults during Model Checking

Model checking focuses on deciding whether a given model of system, say M satisﬁes the
given property pr. While the cost of model checking depends upon several factors, one
important factor is the state space of the resulting model. Since modeling of faults in terms
of transitions has the potential to increase the state space of the program, we evaluate the
cost in terms of the increased state space. Speciﬁcally, we consider the increased cost of
modeling faults from Sections 3.3.1-3.3.4.
Observe that for modeling persistent and malicious faults, in Section 3.3.1, we needed to
add a variable m.j for every user. Essentially, this would double the state space of that user.
43

Faults From
Section 3.3.1
Section 4.4.2
Section 3.3.3
Section 3.3.4

Increased Cost of Modeling Faults
during Model Checking
(in terms of the increased state space)
Increased by a factor of 2n .
Increased by a constant factor
if discrete degradation is considered.
Potentially undecidable
if continuous degradation is considered.
Unchanged in statespace.
Increased by a factor of 2n .

Table 3.4: Cost of modeling faults during model checking.
Moreover, if there are n users in the system, then the total state space will be 2n times more
than that of the fault-intolerant system.
For faults from Section 3.3.2, the cost depends upon whether the physical degradation can
be modeled using discrete values or whether continuous modeling is required. If the physical
degradation is modeled using discrete values as in Section 3.3.2, the total state space will
increase by a constant factor when compared with that of the fault-intolerant system. If the
physical is modeled using continuous values, especially modeling in hybrid automata [8], the
total reachability problem is potentially undecidable.
If one were to verify a system that models transient fault as in Section 3.3.3, then one
has to consider all possible states that could be subtantially larger. However, the total state
space will be unchanged compared with that of fault-intolerant system.
Since the modeling approach of faults from Section 3.3.4 is similar to that in Section 3.3.1,
the increased state space on veriﬁcation of modeling faults from Section 3.3.4 is similar to
that of Section 3.3.1.
Based on the above discussion, we summarize the increased cost of modeling faults from
Sections 3.3.1-3.3.4 in Table 3.4.
44

According to the results in Table 3.4, we argue that the increased cost due to modeling
faults in 18 categories (identifed in Section 3.3.1, 3.3.3 and 3.3.4) is reasonable. However,
the increased cost due to modeling faults in 2 categories (identiﬁed in Section 3.3.2) is high.

3.4.2

Cost of Modeling Faults During Model Revision

Since model checking computes (directly or indirectly) all computations of M to determine
whether pr is satisﬁed, it is especially useful in providing assurance about a system developed
from that model. The related problem of model revision [36, 37] focuses on scenarios where
model checking produces a counterexample or where an existing model needs to be revised
to add new properties (such as safety, liveness and timing constraints). Thus, the goal in
model revision is to modify the given model M so that it satisﬁes the given property pr.
Since the revised model is correct by construction, it can assist us in obtaining a correct
model of system when model checking ends up ﬁnding a counterexample.
Considering read-write restriction must be continued to be satisﬁed in the revised model,
for modeling persistent and malicious faults in Section 3.3.1, variable m.j is readable only to
node j. Other nodes cannot read it. Besides, variable m.j cannot be changed by any node;
it may be changed only by a fault action. These restrictions have been shown to increase
complexity from P to NP-complete in some instances [134].
Regarding revision for the modeling of faults from Section 3.3.2, revising timed automata
is considered in [35]. However, the cost of such revision is often higher (exponential in the
constants involved in specifying timing constraints. In some circumstance, the problem is
even undecidable.
For modeling transient faults in Section 3.3.3, it does not introduce new diﬃculties in
45

Faults from
Section 3.3.1
Section 3.3.2
Section 3.3.3

Section 3.3.4

Complexity of Modeling Faults
during Model Revision
(in terms of the increased state space)
From P to NP-complete
in size of statespace.
Undecidable in certain circumstance.
For centralized system,
unchanged in complexity class.
For distributed system,
conjectured to NP-complete.
From P to NP-complete
in size of state space.

Table 3.5: Complexity of modeling faults during model revision.
model revision. This is due to the fact that no auxiliary variables are needed to model this
fault. An example where such fault modeling is used in model revision includes [2].
Similarly to modeling of faults from Section 3.3.1, the complexity of modeling faults from
Section 3.3.4 is increased from P to NP-complete in certain circumstance.
As discussed above, we summarize the complexity issues caused by modeling faults from
Sections 3.3.1-3.3.4 during model revision in Table 3.5.
Based on these results in Table 3.5, the complexity increases substantially for model
revision. However, eﬃcient heuristics have been found to mitigate the complexity for modeling faults identiﬁed in Section 3.3.1, 3.3.3 and 3.3.4. Hence, we argue that model revision
for faults from these categories is practical. And for faults discussed in Section 3.3.2, the
increased complexity may be too high.

3.5

Relative Completeness with Recent Literature

In this section, we evaluate relative completeness of modeling approach proposed in this
dissertation with recent literature. In particular, we study the papers from two premier
46

conferences in the area of fault tolerance: the International Conference On Distributed
Computing Systems (ICDCS) and the IEEE/IFIP International Conference on Dependable
Systems and Networks (DSN) from 2007 to 2010.
Of 135 fault-tolerant relevant papers, we evaluate each paper whether modeling of those
faults is feasible using the approaches mentioned in our work. We ﬁnd that faults mentioned
from 51 papers can be modeled in the approach proposed in Section 3.3.1. Faults mentioned
from 13 papers can be modeled in the approach proposed in Section 3.3.2. The fault model
in the Section 3.3.3 can be applied to modeling faults from 40 papers. And the fault model in
the Section 3.3.4 can be used in modeling faults from 31 papers.We summarize our evaluation
that how these faults can be modeled in the our proposed approaches in Table 3.6 and 3.7.

Appropriate Model

Approach in Section 3.3.1

Approach in Section 3.3.2

Faults proposed in the publications
from DSN and ICDCS since 2007
DDoS Attacks in [84], [179]
Selﬁshness of Node/Host in distributed
system in [45], [178], [155], [219], [212],
[182], [162], [121], [245], [29], [203],
[259], [234], [184], [177], [201], [208],
[108], [27]
Attacks in [161], [239], [99], [170], [40],
[223], [183], [144], [65], [210], [252],
[192], [11], [12], [81], [248], [185], [255],
[127], [129], [218], [44], [230], [4], [250],
[30]
Worms in [242], [82], [241],
Power degradation of Battery in
MANETs mentioned in [235],
Energy consumption in [207], [153],
[236], [148], [149], [91], [157], [126],
[249], [224], [100],
Aging-related bug in [98],

Table 3.6: A classiﬁcation of faults proposed in DSN & ICDCS 2007-2010 (1).

47

Appropriate Model

Approach in Section 3.3.3

Approach in Section 3.3.4

Faults proposed in the publications
from DSN and ICDCS since 2007
Data Incoherency in [32], [28], [160],
[243], [244], [49], [232], [5], [247], [152],
[39], [229], [233], [97], [3], [73], [14], [122]
Noise in WSN [258],
User’s misbehavior in [251], [252], [128],
[253]
Uncertain data in [62], [48], [25], [197],
[187]
Transient bugs in [256], [117], [190],
[202], [211], [257], [195], [13], [109], [95],
[193], [216], [94], [86]
Message Loss in [231], [75], [176], [180]
Crash Failure in [103], [83], [237], [55],
[165], [238], [104], [124], [196], [209],
[74], [246], [131], [168], [150], [175],
[172], [43], [228], [167], [94]
Disconnection in distributed system in
[28], [112], [254], [205] [221], [154], [105].

Table 3.7: A classiﬁcation of faults proposed in DSN & ICDCS 2007-2010 (2).

3.6

Summary

The main contributions of this chapter are as follows: i) starting from a taxonomy of faults
in [24], which classiﬁes faults into 31 categories, we showed that, faults from 20 categories of
these 31 categories can be modeled using transition systems. These faults include Byzantine
actions in a networked system, physical deterioration of brake in a braking control system and
so on. And, we showed that faults from 11 categories cannot (or should not be) represented
using transition systems. These include faults such as buﬀer overﬂow, hardware errata and
so on. ii) We also investigated the feasibility and practicability of using transition systems
to model faults. We showed that (1) the modeling of faults from 18 categories as transition
systems is practical and feasible and (2) the modeling of faults from 2 categories as transition
systems is not practical although feasible. iii) We investigated the relative completeness of
48

the proposed approach with recent literature.
Our approach for modeling faults is analogous with the approach for modeling faultfree behavior in model checking, and the latter has been shown to be one of the most
successful strategies for analyzing fault-free models. Also, our approach has been used in
verifying and revising fault-tolerant programs in the context of speciﬁc instances of faults.
Hence, we expect the results in this chapter to bridge the gap between theory and practice
in providing assurance about fault-tolerant system design. Moreover, our fault-modeling
approach is beneﬁcial in the situation where the system is subject to multiple faults from
diﬀerent classes (e.g., node crash, and message loss). In what follows, we are going to use
this fault modeling approach to assist in analyzing the system that is subject to multiple
faults.

49

Chapter 4
Automatic Veriﬁcation of
Self-Stabilizing Programs

4.1

Introduction

In this chapter, we describe our approach for automatic veriﬁcation of self-stabilizing programs. Self-stabilization [60], an ability to converge to a legitimate state from arbitrary
initial state(s), enables a program to automatically recover from the occurrence of (transient) faults. In particular, if a self-stabilizing program is perturbed by a transient fault
then the program is guaranteed to recover to a legitimate state after faults stop. This property is especially useful in a large distributed network, where it is diﬃcult or impossible to
predict the exact dynamic situation. Hence, several algorithms such as routing, leader election, mutual exclusion [60, 63, 89] are designed to be self-stabilizing. Also, self-stabilization
is an eﬀective way to accommodate faults from multiple classes. In particular, a stabilizing
design is able to tolerate multiple classes of faults that require nonmasking fault-tolerance.
50

However, veriﬁcation of self-stabilizing programs is a challenging task [64,189,225]. Contrary to traditional veriﬁcation that considers only a subset of reachable states, veriﬁcation
of stabilization requires one to consider all possible states that could be substantially larger,
leading to worsening of the fundamental problem of ‘state explosion’.
We explore the automatic veriﬁcation approach that utilizes the power of model checking,
which is one of the most successful automatic veriﬁcation techniques and aims at automatically verify whether a given program meets the given property. Previously, Tsuchiya et
al [225] have proposed an approach for model checking of self-stabilizing programs. This
work addresses the problem of state space explosion with the help of symbolic model checking techniques, which use Ordered Binary Decision Diagrams (OBDDs) [41]. This work
demonstrates feasibility of applying model checking for verifying self-stabilizing programs.
Unfortunately, it only works for programs with a small number of processes. To overcome
this limitation, our work identiﬁes a key bottleneck that limits the scalability of model checking of verifying stabilization. Speciﬁcally, we propose a novel approach that beneﬁts from
the eﬀectiveness of diﬀerent fairness schedulers while employing the power of symbolic model
checking.
We also propose a constraint-based approach to analyze stabilizing programs without
introducing fairness scheduler in the program model. The key insight is to reduce the veriﬁcation task into a well-studied problem of constraint solving. In turn, this problem can
be solved by many existing highly optimized solutions. Specially, our approach leverages
the power of oﬀ-the-shelf SMT solvers that have demonstrated the ability to solve industrial
sized-satisﬁability instances. To the best of our knowledge, our research is the ﬁrst one that
analyzes stabilization with the help of SMT solvers.
51

4.1.1

Organization

The rest of this chapter is organized as follows. Section 4.2 presents the formal deﬁnition of
fairness constraints and stabilization. Section 4.3 recalls existing approaches that use model
checking to verify stabilization. In Section 4.4, we propose a veriﬁcation approach that uses
the power of symbolic model checking as well as the eﬀectiveness of fairness schedulers. In
Section 4.5, we present a constraint-based veriﬁcation approach that utilizes the power of
the oﬀ-the-shelf SMT solver to analyze stabilization.

4.2

Background

A fair computation allows for a fair resolution of non-determinism. Next, we introduce
the deﬁnition of weakly-fair computation. Intuitively, in a weakly-fair computation, if a
guard of an action is continuously true then that action must be executed. Thus weakly-fair
computation is deﬁned as follows.
Deﬁnition 4.2.1. (Weakly-fair computation) σ = s0 , s1 , ... is weakly-fair computation of p iﬀ:
1. σ is a computation of p, and
2. if any action, say ai , of p is enabled in all states sj , sj+1 , sj+2 . . . then ∃k : k ≥ j :
sk+1 is obtained by executing sti from state sk .
Remark 4.2.1. Note that Deﬁnition 2.1.5 does not consider fairness. Hence, as needed, we
use the term unfair computation to distinguish the computation without fairness from the
one with fairness. The term unfair computation has the same meaning as computation in
Deﬁnition 2.1.5.
52

Deﬁnition 4.2.2. (Stabilization) Let p be a program and let I be a state predicate of p.
We say that p is self-stabilizing for I iﬀ:
1. closure: if (s0 , s1 ) is a transition of p and s0 ∈ I, then s1 ∈ I;
2. convergence: every computation of p reaches I, i.e., ∀σ : σ is of the form s0 , s1 , s2 , . . .
and σ is computation of p : (∃j :: sj ∈ I).
Note that the above deﬁnition can be instantiated with unfair computations or with
weakly-fair computations. In the former case, we say that the program p is self-stabilizing
for I without fairness. And, in the latter case, we say that program p is self-stabilizing for I
under weak fairness. Finally, whenever I or fairness level is clear from the context, we omit
it.

4.3

An Approach of Using Symbolic Model Checking
to Verify Self-stabilizing Programs

In this section, we recall the existing approach of using model checking to verify selfstabilization proposed in [225]. We evaluate this approach with three classic examples in the
literature on self-stabilization: Dijkstra’s K-state program [59], Ghosh’s mutual exclusion
program [87] and Hoepman’s ring-orientation program [113]. Our case studies demonstrate:
(i) this approach is limited in time performance and only works for programs consisting of a
small number of processes while showing feasibility of using symbolic model checking to verify stabilizing programs, (ii) the ability of using this approach to verify stabilizing programs
remain essentially the same despite the improvement of hardware.
53

4.3.1

Modeling Self-stabilizing Program

In this section, we review the program model introduced in the approach proposed in [225],
which utilize SMV [174] , a symbolic model checker to verify stabilizing programs.
In a SMV program, the behavior of each process is expressed by a module. Given a
program action g1 −→ st1 , we model it inside a module by using keyword ASSSIGN, as
follows:

ASSIGN
init(v) := initial values;
next(v) := case g1 : Fst (v);
1
1:v;
esac;

where v denotes variable changed in st1 and Fst denotes the assignment function used in
1
st1 .
SMV provides each module a special variable running that is true iﬀ the module is
currently being executed. Hence, we can force each process to be selected to run inﬁnitely
often by adding the declaration

FAIRNESS running

to each process. Clearly this models a weakly-fair computation.
54

4.3.2

Case Study 1: K-State Token Ring Program

The K-state program consists of N + 1 processes, numbered from 0 to N. The program
topology is a unidirectional ring. Each process p.i, 0 ≤ i ≤ N, has one variable x.i that
denotes the current state value. Each variable has the domain [0, . . . , K − 1].
The program consists of two types of actions. The ﬁrst type is for process 0. This action
is enabled when x.0 equals x.N. When p.0 executes its action, it increments x.0 by 1 in
modulo K arithmetic. The second type of action is for process p.i, i = 0. This action is
enabled when x.i is not equal to x.(i − 1). When p.i executes its action, it copies x.(i − 1).
Thus, the actions are as follows:

−→

K0 ::

x.0= x.N

Ki ::

x.i = x.(i − 1) −→

x.0 = (x.0 + 1) mod K;
x.i = x.(i − 1);

Remark 4.3.1. This program is known to be self-stabilizing if K > N. In subsequent
discussion, we let K = N + 1.
The state where x values of all processes is 0 is a legitimate state. In this state, only
process 0 is enabled. After process 0 executes, x.0 changes to 1 and all other x values are
still 0. In this state, only process 1 is enabled. Hence, it can execute and change x.1 to 1.
Continuing this further, eventually, we reach a state where all x values are 1 where process
0 is the only enabled process and process 0 will increment x.0 to 2. The legitimate states of
the K-state program are equal to all the states reached in such subsequent execution.
SMV provides a simple approach for modeling weak fairness. In particular, the behavior
of each process can be instantiated from a speciﬁc module. As the program requires, there
55

are two types of actions and hence we use two modules, one for K0 and one for Ki . The
module for K0 speciﬁes variable x.0 and takes one parameter, the x value of its predecessor.
In SMV, the transition (s.0, s.1) for action K.0 is speciﬁed by the keyword ASSIGN. Within
ASSIGN, ‘init(x.0) := 0, 1, 2;’ speciﬁes the value of the variable in the source state, i.e., s.0.
Moreover, ‘next(x.0) := case x.0 = x.N : (x.0 + 1) mod 3; 1 : x.0; esac; ’ speciﬁes the value
in the target state, i.e., s.1. If the guard (x.0 = x.N) is true and s.1 is obtained by executing
x.0 = x.0 + 1 mod 3 from state s.0, otherwise the value of x.0 remains unchanged. Thus,
module for action K.0 can be written as follows:

MODULE type K0 (x.N )
VAR x.0 : 0, 1, 2;
ASSIGN init(x.0) := 0, 1, 2;
next(x.0) := case (x.0 = x.N ) : (x.0 + 1) mod 3 ; 1 : x.0; esac;
FAIRNESS running

Thus, action K0 can be instantiated from module type K0 as follows:
K0 : process type K0 (x.N);.
The module for Ki is similar to the one for K0 . It speciﬁes variable x.i and takes x.(i−1),
as parameter. The transition (s.0, s.1) for action K.i is speciﬁed by the keyword ASSIGN.
Within ASSIGN statement, init(x.i) := 0, 1, 2;, speciﬁes the value of the variable in the
source state. And next(x.i) := case !(x.i = x.j) : x.i = x.(i − 1); 1 : x.i; esac; , speciﬁes the
value in the target state. Hence Action Ki is instantiated from module type Ki as follows:
Ki : process type Ki (x.(i − 1));.
Finally, each process has the declaration FAIRNESS RUNNING to ensure that SMV only
56

weakly-fair
results reported in [225]
approximate state space

weakly-fair
results reported in [225]
approximate state space

Execution time(s)
K=3 K=4
K=5
K=6
0
0.03
0.63
5.33
0.1
0.4
4.6
43.5
101
102
103
104
Execution time(s)
K=7 K=8
K=9
K=10
34.30 139.10 1276.08 N/A
285.2 1836.0
N/A
N/A
105
107
108
1010

Table 4.1: Veriﬁcation results for the k-state program.
considers computation paths where each process executes inﬁnitely often.
We veriﬁed the K-state program for 3 ≤ K ≤ 10. Table 4.4 gives the veriﬁcation time
for model checking the K-state program for diﬀerent values of K. N/A in this table means
the result was not available within an admissible amount of time (1 hour).

4.3.3

Case Study 2: Ghosh’s Binary Mutual Exclusion Protocol

In this section, we present our second case study, namely, Ghosh’s binary mutual exclusion
protocol [87]. Ghosh’s binary mutual exclusion protocol considers a network system of
2m − 1(m ≥ 2) nodes, numbered from 0 to 2m − 1. The neighbor relation is deﬁned as
follows:
• n0 has one neighbor n1 ;
• n2i−1 (1 ≤ i ≤ m − 1) has three neighbors n2i−2 , n2i , and n2i+1 ;
• n2i (1 ≤ i ≤ m − 1) has three neighbors n2i−2 , n2i−1 , and n2i+1 ;
• n2m−1 has one neighbor n2m−2 .
57

The state si of each node ni can be either 0 or 1. Each node can read its own state and
the state of its neighbor nodes. The protocol deﬁnes the four types of actions as follows:
for n0 :
s0 = s1

−→

s0 = 1 − s0 ;

−→

s2m−1 =1 − s2m−1 ;

for n2m−1 :
s2m−1 =s2m−2
for n2i−1 (1 ≤ i ≤ m − 1):
s2i−2 = s2i−1 = s2i ∧ s2i−1 = s2i+1
−→

s2i−1 =1 − s2i−1 ;

for n2i (1 ≤ i ≤ m − 1):
s2i−2 = s2i−1 = s2i+1 ∧ s2i = s2i+1
−→

s2i =1 − s2i ;

We model the program and check the self-stabilization property of the protocol using
the same approach as mentioned in Section 4.4.1.2. The veriﬁcation results for this case are
shown in Table 4.5.

4.3.4

Case Study 3: Hoepman’s Uniform Ring-orientation Program

In this section, we present our third case study, namely, Hoepman’s uniform ring-orientation
program [113]. Hoepman’s uniform deterministic ring-orientation program considers a system of n nodes, numbered from 0 to n − 1, which are organized as a uniform ring of odd
length. Each node ni has a color, Colori , with domain {0, 1}. To impose a direction, each
58

weakly-fair
results reported in [225]
approximate state space

weakly-fair
results reported in [225]
approximate state space

Execution
n=8
n=10
0.4
2.93
3.1
22.9
102
103
Execution
n=16
n=18
693.27 2819.05
N/A
N/A
104
105

time(s)
n=12 n=14
22.43 138.05
182.0 1161.5
103
104
time(s)
n=20
N/A
N/A
-

-

Table 4.2: Veriﬁcation results for Ghosh’s mutual exclusion program.
node stores a phase, P hasei , with domain {0, 1}. A global legitimate state is one where all
the nodes are oriented in the same direction. A ring orientation program is self-stabilizing iﬀ
it reaches a legitimate state from any initial state. To achieve self-stabilization, this program
deﬁnes the following four types of actions for each node ni :
Colorneighbor1 = Colorneighbor2
−→ Colori = 1 − Colorneightbor1 ,
P hasei = 1;

Colorneighbor1 = Colori = 1 − Colorneighbor2
∧ P hasei = P haseneighbor2 = 1
∧ P haseneighbor1 = 0
−→ Colori = 1 − Colori ,
(P hasei ) = 0,
directioni = nneighbor1 ֒→ ni ֒→ nneightbor2 ;

Colorneighbor2 = Colori = 1 − Colorneighbor1
59

∧P hasei = P haseneighbor1 = 1
∧P haseneighbor2 = 1
−→ Colori = 1 − Colori ,
P hasei = 0,
directioni =nneighbor1 ←֓ ni ←֓ nneightbor2 ;

(Colorneighbor1 = Colori = 1 − Colorneighbor2
∧P haseneighbor1 = P hasei )
|(Colorneighbor2 = Colori = 1 − Colorneighbor1
∧P hasei = P haseneighbor2 )
−→ P hasei = 1 − P hasei ;

In the above actions, Action 1 requires that if a node has the same color as both its neighbors,
then it inverts its color. This action creates patterns such as 001 and 110 around the ring
since it is of odd length. Actions 2 and 3 require that if one node has the same color and
the opposite phase as one of its neighbors, then the direction is from the node with phase
0 to the node with phase 1. Action 4 requires that if one node has the same color and the
same phase as one of its neighbors, then it inverts its phase.
We model the program and check the self-stabilization property of the protocol using
the same approach as mentioned in Section 4.4.1.2. The veriﬁcation results for this case are
shown in Table 4.6.
60

weakly-fair
results reported in [225]
approximate state space

Execution time(s)
n=3 n=5
n=7
0.17 18.23 1113.77
1.3 128.1
N/A
101
103
104

Table 4.3: Veriﬁcation results for Hoepman’s ring-orientation program.

4.3.5

Analysis

From these case studies, we observe that the approach proposed in [225] has limited scalability. In the ﬁrst case study, in [225], authors have shown the feasibility of veriﬁcation of
k-state program for upto K = 8 . In particular, the time reported in [225] for K = 8 is
1836.0s whereas the time for the corresponding veriﬁcation is 139.1s. Since the underlying
tool as well as the program remains the same, this change is due to improved hardware over
last few years. However, what this result does show is that in spite of the improved hardware, the ability to verify under weak fairness remains essentially the same. Speciﬁcally, if
we assume a reasonable time constraint permissible (e.g., one hour) for veriﬁcation then the
change in hardware made it possible to achieve veriﬁcation for K = 9 as opposed to K = 8.
As discussed in [173], one of the reasons for this is that to model fairness, one needs to
ensure that each process can execute inﬁnitely often. Achieving this increases the OBDD
size for reachable states quadratically.

4.4

Eﬀect of Fairness in Model Checking of Self-Stabilizing
Programs

In this section, we introduce our approach which utilizes the eﬀectiveness of fairness as well
as the power of symbolic model checking to verify stabilizing program. We ﬁrst present our
61

approach of model checking of stabilizing programs under unfair computation. Then, we
introduce our other two approaches for reducing veriﬁcation cost purpose, including, (i) decomposing the program into sub-components and verifying each sub-components separately;
and (ii) verifying a weaker version of stabilizing programs, namely weak stabilization, to
trade-oﬀ the cost with precision.

4.4.1

Using Symbolic Model Checking to Verify Self-stabilizing
Program Under Unfair Computation

In this section, we propose an approach of model checking self-stabilization under unfair
computation for the case where weak fairness is not essential for self-stabilization. To compare the veriﬁcation performance of this approach with the approach under weak fairness
computation mentioned in Section 4.3, we illustrate this approach with the three same case
studies. We perform the experiments in the same hardware setting. The results show that
the approach of model checking self-stabilization under unfair computation is signiﬁcantly
more scalable.

4.4.1.1

Modeling Self-stabilizing Program under Unfair Computation

Now, we describe how to model self-stabilizing program under unfair computation in SMV.
Recall that a self-stabilizing program P consists of a set of guarded commands of the following
form:
action1 :

g1 −→ st1 ;

action2 :

g2 −→ st2 ;

...
62

actioni :

gi −→ sti ;

...
actionn : gn −→ stn ;

To model P under unfair computation in SMV, we use the TRANS keyword and model
actioni as follows:
conjunctioni : gi ∧(
∧(

sti updates vj

next(vj ) = Fst (vj ))
i

sti does not update vj

next(vj ) = vj )

The above formula requires that gi must be true in the initial state. Moreover, if vj
updated by actioni then next(vj ) corresponds to the value given by sti . Otherwise, vj
remains unchanged.
Since the use of TRANS in SMV requires the user to explicltly ensure that the transition
relation is total. The transition relation is total if every state has a successor state. Hence,
gi ) −→ skip;” to the program. Using the approach
i=1,...,n
for modeling actions, this action is modeled as follows:
we add an additional action “¬(

conjunctionaddtional : ¬(

i=1,...,n

gi ) ∧ (

next(vj ) = vj );

Based on the above modeling, we model the whole program actions as a disjunction of
conjunction1 , conjunction2 , . . . conjunctionn and conjunctionaddtional , as follows:
63

g1 −→next(v1 ) = Fst (v1 ) &
1
∨g2 −→next(v2 ) = Fst (v2 ) &
2

j=2,...,n

next(vj ) = vj

j=1,...,n&j=2

next(vj ) = vj

...
∨gi −→next(vi ) = Fst (vi ) &
i

j=1,...,n&j=i

next(vj ) = vj

...
∨gn −→next(vn ) = Fstn (vn ) &
∨¬

i=1,...,n

gi &

j=1,...,n

j=1,...,n−1

next(vj ) = vj

next(vj ) = vj ;

Hence, in the modeling approach under unfair computation, the whole program is modeled as one transition relation in SMV.

4.4.1.2

Case Study 1: K-State Token Ring Program (Cont’d)

In this section, we continue with the veriﬁcation of K-state program. We ﬁrst discuss how
we model the K-state program in SMV under unfair computation. Then, we provide the veriﬁcation results under unfair computation. We compare these results with the corresponding
veriﬁcation results under weak computation. The comparison results show that for K-state
program, the veriﬁcation performance is substantially improved under unfair computation.
Figure 4.1 gives a SMV program of K-state program with k=3 under unfair computation.
As shown in Figure 4.1, x0, x1 and x2 denotes the states of the three processes. Lines 1-6
deﬁne x0, x1, x2 and initialize them. Lines 7-9 deﬁne x0priv, x1priv and x2priv, that
are used to describe privilege condition for each process. Lines 10-13 deﬁne condition1 and
condtion2 to describe the self-stabilization. Line 15 speciﬁes the program action, which is a
64

disjunction of three possible cases. These three cases include: 1)process 0 is privileged, x0
is assigned new value and states of other two processes x1 and x2 remains unchanged; 2)
process 1 is privileged, x1 is assigned new value and states of other two processes x0 and
x2 remains unchanged; and, 3) process 2 is privileged, x2 is assigned new value and states
of other two processes x1 and x0 remains unchanged. Note that if multiple processes are
privileged then one of them is non-deterministically chosen for execution.

MODULE main
VAR
x0 : {0, 1, 2};
x1 : {0, 1, 2};
x2 : {0, 1, 2};
INIT
x0 = {0, 1, 2}&
x1 = {0, 1, 2}&
x2 = {0, 1, 2}
DEFINE
x0priv := (x0 = x2);
x1priv :=!(x1 = x0);
x2priv :=!(x2 = x1);
condition1 := ((x0priv&!x1priv&!x2priv)|
(!x0priv&x1priv&!x2priv)|
(!x0priv&!x1priv&x2priv));
condition2 := AF x0priv&AF x1priv&AF x2priv;
legitimate := condition1&condition2;
SPEC AF legitimate
TRANS

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)

(10)
(11)
(12)
(13)
(14)

((x0 = x2)&next(x0) = (x0 + 1)mod3&next(x1) = x1&next(x2) = x2)
|(!(x1 = x0)&next(x1) = x0&next(x2) = x2&next(x0) = x0)
|(!(x2 = x1)&next(x2) = x1&next(x0) = x0&next(x1) = x1)
|(next(x2) = x2&next(x0) = x0&next(x1) = x1)
(15)

Figure 4.1: k-state program under unfair computation.
65

unfair
weakly-fair

Execution time(s)
K=3
K=4
0
0
0
0.03
101
102

K=5
0
0.63
103

K=6
0
5.33
104

K=7
0.02
34.30
105

approximate state space
size of BDD nodes(under unfair)
292
786
2423
4373
8346
size of BDD nodes(under weakfair)
680
3435
11251
17880
42131
Execution time(s)
K=8
K=9
K=10 K=50 K=51
unfair
0.03
0.05
0.08 3466.30 N/A
weakly-fair
139.10 1276.08 N/A
N/A
N/A
approximate state space
107
108
1010
1084
size of BDD nodes(under unfair)
10067
11475 14870 1842498
size of BDD nodes(under weakfair) 108723 564794 N/A
N/A
Table 4.4: Veriﬁcation results for the k-state program.
Table 4.4 gives the veriﬁcation time for model checking the K-state program for diﬀerent
values of K, where 3 ≤ K ≤ 9 or K = 50. Here, N/A means that the result is not available
within an admissible amount of time (1 hour).

4.4.1.3

Case Study 2: Ghosh’s Binary Mutual Exclusion Protocol (Cont’d)

In this section, we consider Ghosh’s mutual protocol under unfair computation. We ﬁrst
describe how we model Ghosh’s mutual protocol in SMV under unfair computation. Then,
we provide the veriﬁcation results under unfair computations and compare these results with
the corresponding results under weak fair computations. The comparison results show the
scalability of modeling self-stabilizing program under unfair computation for Ghosh’s mutual
protocol.
Figure 4.2 - 4.4 shows a SMV program of Ghosh’s mutual exclusion protocol with n = 8
modeled under unfair computation. As shown in Figure 4.2, xi (i = 0 . . . 7) denotes the
states of the eight processes. Lines 1-4 deﬁne and initialize these state variables. Lines 5-12
66

unfair
weakly-fair
approximate state space
size of BDD nodes(under unfair)
size of BDD nodes(under weakfair)

unfair
weakly-fair
approximate state space
size of BDD nodes(under unfair)
size of BDD nodes(under weakfair)

n=8
0
0.4
102

Execution time(s)
n=10 n=12 n=14
0
0
0
2.93 22.43 138.05
103
103
104

1099
10082

1811 2831
4153
10483 11693 20786
Execution time(s)
n=18
n=20 n=50 n=100
0.02
0.03
0.35
4.77
2819.05 N/A N/A
N/A
105
106
1016
1031
7847
10000 10134 10288
99088
N/A N/A
N/A

n=16
0.02
693.27
104
5813
43294
-

Table 4.5: Veriﬁcation results for Ghosh’s mutual exclusion program.
deﬁne xipriv(i = 0 . . . 7), that are used to describe privilege condition for each process. Lines
13-17 deﬁne condition1 and condtion2 to describe the self-stabilization. Line 18 speciﬁes the
program action, which is a disjunction of eight possible cases. These possible cases include:
1) process 0 is privileged, x0 is assigned new value and states of other processes remains
unchanged; 2) process 1 is privileged, x1 is assigned new value and states of other processes remains unchanged; and so on. Once again, the modeling captures non-deterministic
execution one of the privileged process.
Our case study veriﬁed the Ghosh’s mutual protocol for n = 2i where 4 ≤ i ≤ 10 or
i = 25, 50. The veriﬁcation results are shown in Table 4.5. N/A in this table means the
result was not available within an admissible amount of time (1 hour).

4.4.1.4

Case Study 3: Hoepman’s Uniform Ring-orientation Program (Cont’d)

In this section, we model Hoepman’s uniform ring program under unfair computation in
SMV. We describe the modeling with 3 processes (Figure 4.5 - 4.10).
67

MODULE main
VAR
x0 : {0, 1}; x1 : {0, 1}; x2 : {0, 1}; x3 : {0, 1};
(1)
x4 : {0, 1}; x5 : {0, 1}; x6 : {0, 1}; x7 : {0, 1};
(2)
INIT
x0 = {0, 1}&x1 = {0, 1}&x2 = {0, 1}&x3 = {0, 1}&
(3)
x4 = {0, 1}&x5 = {0, 1}&x6 = {0, 1}&x7 = {0, 1}
(4)
DEFINE
x0priv := (x0! = x1);
(5)
x1priv := ((x1 = x0)&(x1 = x2)&!(x1 = x3));
(6)
x2priv := (!(x2 = x0)&!(x2 = x1)&!(x2 = x3));
(7)
x3priv := ((x3 = x2)&(x3 = x4)&!(x3 = x5));
(8)
x4priv := (!(x4 = x2)&!(x4 = x3)&!(x4 = x5));
(9)
x5priv := ((x5 = x4)&(x5 = x6)&!(x5 = x7));
(10)
x6priv := (!(x6 = x4)&!(x6 = x5)&!(x6 = x7));
(11)
x7priv := (x7 = x6);
(12)
condition1 := (
(x0priv&!x1priv&!x2priv&!x3priv&!x4priv&!x5priv&!x6priv&!x7priv)|
(!x0priv&x1priv&!x2priv&!x3priv&!x4priv&!x5priv&!x6priv&!x7priv)|
(!x0priv&!x1priv&x2priv&!x3priv&!x4priv&!x5priv&!x6priv&!x7priv)|
(!x0priv&!x1priv&!x2priv&x3priv&!x4priv&!x5priv&!x6priv&!x7priv)|
(!x0priv&!x1priv&!x2priv&!x3priv&x4priv&!x5priv&!x6priv&!x7priv)|
(!x0priv&!x1priv&!x2priv&!x3priv&!x4priv&x5priv&!x6priv&!x7priv)|
(!x0priv&!x1priv&!x2priv&!x3priv&!x4priv&!x5priv&x6priv&!x7priv)|
(!x0priv&!x1priv&!x2priv&!x3priv&!x4priv&!x5priv&!x6priv&x7priv))
(13)
condition2 := AF x0priv&AF x1priv&AF x2priv&AF x3priv
&AF x4priv&AF x5priv&AF x6priv&AF x7priv;
(14)
legitimate := condition1&condition2;
(15)

Figure 4.2: Ghosh’s mutual protocol under unfair computation (1).

We veriﬁed the Hoepman’s uniform ring program for n = 2i + 1 where 1 ≤ i ≤ 4 and
i = 50, 100, 150 and 200. The veriﬁcation results for this case are shown in Table 4.6. N/A in
this table means the result was not available within an admissible amount of time (1 hour).
68

SPEC AF legitimate
TRANS
(((x0 =!x1) & next(x0) = x1) & next(x1) = x1
& next(x2) = x2& next(x3) = x3 & next(x4) = x4
& next(x5) = x5 & next(x6) = x6 & next(x7) = x7)
|(next(x0) = x0 & (((x1 = x0) & (x1 = x2) & !(x1 = x3)) )
& next(x1) = 1 − x1 & next(x2) = x2
& next(x3) = x3 & next(x4) = x4
& next(x5) = x5 & next(x6) = x6
& next(x7) = x7)
|(next(x0) = x0 & next(x1) = x1 & ((!(x2 = x0)
& !(x2 = x1)& !(x2 = x3)) & next(x2) = 1 − x2)
& next(x3) = x3 & next(x4) = x4
& next(x5) = x5 & next(x6) = x6 & next(x7) = x7)
|(next(x0) = x0 & next(x1) = x1 & next(x2) = x2)
& (((x3 = x2) & (x3 = x4)
& !(x3 = x5))& next(x3) = 1 − x3)
& next(x4) = x4 & next(x5) = x5
& next(x6) = x6 & next(x7) = x7)
|(next(x0) = x0 & next(x1) = x1 & next(x2) = x2
& next(x3) = x3 & ((!(x4 = x2)
& !(x4 = x3) & !(x4 = x5))
& next(x4) = 1 − x4) & next(x5) = x5
& next(x6) = x6 & next(x7) = x7)
|(next(x0) = x0 & next(x1) = x1 & next(x2) = x2
& next(x3) = x3 & next(x4) = x4
& (((x5 = x4) & (x5 = x6))
& !(x5 = x7)) & next(x5) = 1 − x5
& next(x6) = x6 & next(x7) = x7)

(16)

Figure 4.3: Ghosh’s mutual protocol under unfair computation (2).

69

|(next(x0) = x0 & next(x1) = x1 & next(x2) = x2
& next(x3) = x3 & next(x4) = x4
& next(x5) = x5& ((!(x6 = x4)
& !(x6 = x5) & !(x6 = x7))
& next(x6) = 1 − x6) & next(x7) = x7)
|(next(x0) = x0 & next(x1) = x1 & next(x2) = x2
& next(x3) = x3 & next(x4) = x4
& next(x5) = x5& next(x6) = x6
& ((x7 = x6) & next(x7) = 1 − x7))
|(next(x0) = x0 & next(x1) = x1 & next(x2) = x2
& next(x3) = x3 & next(x4) = x4
& next(x5) = x5& next(x6) = x6 & next(x7) = x7)

(17)

Figure 4.4: Ghosh’s mutual protocol under unfair computation (3).

MODULE main
VAR
p0 s : {0, 1}; p0 t : {0, 1}; p0 dir : {B, F }; p0 AP 1 pc : {−1, 1};
p1 s : {0, 1}; p1 t : {0, 1}; p1 dir : {B, F }; p1 AP 1 pc : {−1, 1};
p2 s : {0, 1}; p2 t : {0, 1}; p2 dir : {B, F }; p2 AP 1 pc : {−1, 1};

(1)
(2)
(3)

INIT
p0 s = {0, 1}&p0 t = {0, 1}&p0 dir = {B, F }&p0 AP 1 pc = {−1, 1}&(4)
p1 s = {0, 1}&p1 t = {0, 1}&p1 dir = {B, F }&p1 AP 1 pc = {−1, 1}&(5)
p2 s = {0, 1}&p2 t = {0, 1}&p2 dir = {B, F }&p2 AP 1 pc = {−1, 1} (6)
Figure 4.5: Hoepman’s ring program under unfair computation (1).

70

DEFINE
p0 des := ((p0 dir = B)&(p0 AP 1 pc = −1))
|((p0 dir = F )&(p0 AP 1 pc = 1));
p1 des := ((p1 dir = B)&(p1 AP 1 pc = −1))
|((p1 dir = F )&(p1 AP 1 pc = 1));
p2 des := ((p2 dir = B)&(p2 AP 1 pc = −1))
|((p2 dir = F )&(p2 AP 1 pc = 1));
p0 S1 := case(p0 AP 1 pc = −1) : p2 s;
1 : p1 s; esac;
p0 T 1 := case(p0 AP 1 pc = −1) : p2 t;
1 : p1 t; esac;
p0 S2 := case(p0 AP 1 pc = −1) : p1 s;
1 : p2 s; esac;
p0 T 2 := case(p0 AP 1 pc = −1) : p1 t;
1 : p2 t; esac;
p0 r1 := (p0 S1 = p0 S2);
p0 r2 := (p0 S1 = p0 s)&(p0 S2 =!p0 s)
&(p0 T 2 = p0 t)&(p0 T 1 =!p0 t)&(p0 t = 1);
p0 r3 := (p0 S2 = p0 s)&(p0 S1 =!p0 s)
&(p0 T 1 = p0 t)&(p0 T 2 =!p0 t)&(p0 t = 1);
p0 r4 := ((p0 S1 = p0 s)&(p0 S2 =!p0 s)
&(p0 T 1 = p0 t))|((p0 S2 = p0 s)
&(p0 S1 =!p0 s)&(p0 T 2 = p0 t));

(7)
(8)
(9)

(10)
(11)
(12)
(13)
(14)
(15)
(16)

(17)

Figure 4.6: Hoepman’s ring program under unfair computation (2).

71

DEFINE
p1 S1 := case(p1
1 : p2 s; esac;
p1 T 1 := case(p1
1 : p2 t; esac;
p1 S2 := case(p1
1 : p0 s; esac;
p1 T 2 := case(p1
1 : p0 t; esac;

AP 1 pc = −1) : p0 s;
AP 1 pc = −1) : p0 t;

(18)
(19)

AP 1 pc = −1) : p2 s;

(20)

AP 1 pc = −1) : p2 t;

(21)

p1 r1 := (p1 S1 = p1 S2);
p1 r2 := (p1 S1 = p1 s)&(p1 S2 =!p1 s)
&(p1 T 2 = p1 t)&(p1 T 1 =!p1 t)&(p1 t = 1);
p1 r3 := (p1 S2 = p1 s)&(p1 S1 =!p1 s)
&(p1 T 1 = p1 t)&(p1 T 2 =!p1 t)&(p1 t = 1);
p1 r4 := ((p1 S1 = p1 s)&(p1 S2 =!p1 s)
&(p1 T 1 = p1 t))|((p1 S2 = p1 s)
&(p1 S1 =!p1 s)&(p1 T 2 = p1 t));

(22)
(23)
(24)

(25)

Figure 4.7: Hoepman’s ring program under unfair computation (3).

unfair
weakly-fair
approximate state space
size of BDD nodes(under unfair)
size of BDD nodes(under weakfair)

unfair
weakly-fair
approximate state space
size of BDD nodes(under unfair)
size of BDD nodes(under weakfair)

n=3
0
0.17
101

Execution time(s)
n=5
n=7
n=9
0.03
0.08
0.12
18.23 1113.77 N/A
103
104
105

n=51
11.95
N/A
1031

4090
10425

10047
11680
10543 63468
61570 776284
N/A
N/A
Execution time(s)
n=101 n=201 n=301 n=303
95.65 875.23 3420.98 N/A
N/A
N/A
N/A
N/A
60
121
181
10
10
10
125809 324420 726620
N/A
N/A
N/A
-

Table 4.6: Veriﬁcation results for Hoepman’s ring-orientation program.

72

DEFINE
p2 S1 := case(p2
1 : p0 s; esac;
p2 T 1 := case(p2
1 : p0 t; esac;
p2 S2 := case(p2
1 : p1 s; esac;
p2 T 2 := case(p2
1 : p1 t; esac;

AP 1 pc = −1) : p1 s;
AP 1 pc = −1) : p1 t;

(26)
(27)

AP 1 pc = −1) : p0 s;

(28)

AP 1 pc = −1) : p0 t;

(29)

(30)
p2 r1 := (p2 S1 = p2 S2);
p2 r2 := (p2 S1 = p2 s)&(p2 S2 =!p2 s)
&(p2 T 2 = p2 t)&(p2 T 1 =!p2 t)&(p2 t = 1); (31)
p2 r3 := (p2 S2 = p2 s)&(p2 S1 =!p2 s)
&(p2 T 1 = p2 t)&(p2 T 2 =!p2 t)&(p2 t = 1); (32)
p2 r4 := ((p2 S1 = p2 s)&(p2 S2 =!p2 s)&(p2 T 1 = p2 t))
|((p2 S2 = p2 s)&(p2 S1 =!p2 s)
&(p2 T 2 = p2 t));
(33)
legitimate := AG(p0 des&p1 des&p2 des)
|AG(!p0 des&!p1 des&!p2 des);

(34)

Figure 4.8: Hoepman’s ring program under unfair computation (4).

73

SPEC AF legitimate

(35)

TRANS
(((p0 r1&next(p0 s) =!p0 S1)|((p0 r2|p0 r3)&next(p0 s) =!p0 s)
|next(p0 s) = p0 s)
&((p0 r1&next(p0 t) = 1)|((p0 r2|p0 r3|p0 r4)
&next(p0 t) =!p0 t)|next(p0 t) = p0 t)
&((p0 r2&next(p0 dir) = F )|(p0 r3&next(p0 dir) = B)
|next(p0 dir) = p0 dir)&(next(p0 AP 1 pc) = p0 AP 1 pc)
&next(p1 s) = p1 s&next(p1 t) = p1 t&next(p1 dir) = p1 dir
&next(p1 AP 1 pc) = p1 AP 1 pc&next(p2 s) = p2 s
&next(p2 t) = p2 t&next(p2 dir) = p2 dir
&next(p2 AP 1 pc) = p2 AP 1 pc)&(p2 T 2 = p2 t));
(36)
|
(37)
(((p1 r1&next(p1 s) =!p1 S1)|((p1 r2|p1 r3)
&next(p1 s) =!p1 s)|next(p1 s) = p1 s)
&((p1 r1&next(p1 t) = 1)|((p1 r2|p1 r3|p1 r4)
&next(p1 t) =!p1 t)
|next(p1 t) = p1 t)&((p1 r2&next(p1 dir) = F )|(p1 r3
&next(p1 dir) = B)
|next(p1 dir) = p1 dir)&(next(p1 AP 1 pc) = p1 AP 1 pc)
&next(p0 s) = p0 s
&next(p0 t) = p0 t&next(p0 dir) = p0 dir
&next(p0 AP 1 pc) = p0 AP 1 pc
&next(p2 s) = p2 s&next(p2 t) = p2 t
&next(p2 dir) = p2 dir
&next(p2 AP 1 pc) = p2 AP 1 pc)
(38)

Figure 4.9: Hoepman’s ring program under unfair computation (5).

74

|
(39)
(((p2 r1&next(p2 s) =!p2 S1)|((p2 r2|p2 r3)
&next(p2 s) =!p2 s)
|next(p2 s) = p2 s)
&((p2 r1&next(p2 t) = 1)|((p2 r2|p2 r3|p2 r4)
&next(p2 t) =!p2 t)|next(p2 t) = p2 t)
&((p2 r2&next(p2 dir) = F )|(p2 r3&next(p2 dir) = B)
|next(p2 dir) = p2 dir)&(next(p2 AP 1 pc) = p2 AP 1 pc)
&next(p0 s) = p0 s&next(p0 t) = p0 t
&next(p0 dir) = p0 dir&next(p0 AP 1 pc) = p0 AP 1 pc
&next(p1 s) = p1 s&next(p1 t) = p1 t
&next(p1 dir) = p1 dir&next(p1 AP 1 pc) = p1 AP 1 pc) (40)

Figure 4.10: Hoepman’s ring program under unfair computation (6).

4.4.1.5

Analysis

Based on the results in Tables 4.4, 4.5 and 4.6, veriﬁcation of signiﬁcantly large system is
possible if we consider unfair computations. As an illustration, consider K-state program:
It was possible to achieve veriﬁcation for K = 50 in less than 1 hour. And, in this case,
the corresponding state space is 1085 . By contrast, veriﬁcation with weak fairness could not
complete when state space was 1011 . This diﬀerence in performance can be explained by
observing the size of OBDDs representing state sets in the forward search.

Compared with the approach in [225], the approach presented here has the OBDD size
for the reached states running linearly. The overall time complexity for this approach is
O(n3 ). As introduced in [173], this performance dues to three factors: a linear increase in
the transition relation OBDD, a linear increase in the state set OBDD, and a linear increase
in the number of iterations required for successful veriﬁcation.
75

4.4.2

Utilizing Decomposition to Reduce the Cost of Using Symbolic Model Checking to Verify Self-stabilizing Program

The results in Section 4.3 show that veriﬁcation of self-stabilization under unfair computation
is substantially faster than that under weakly-fair computation. Thus, the natural question
is what can a designer do if the program at hand requires weak fairness to provide selfstabilization, i.e., the program is not self-stabilizing under unfair computations. Examples
of such programs include [18, 19, 116].
Generally, there are two approaches that can be used when model checking fails due to
lack of suﬃcient time or space. The ﬁrst approach is to require the designer to perform
extra work (e.g., abstraction, decomposition, identifying partial order reductions, etc.) that
will reduce the cost of veriﬁcation. The second approach is to verify a variation of the
model/property such that the variation will still provide a reasonable assurance about the
goal at hand.
In this section, we focus on the ﬁrst such approach where we show that decomposition of
a self-stabilizing program can provide substantial beneﬁt in reducing the cost of veriﬁcation.
We note that while decomposition is one of the approaches in reducing cost of veriﬁcation,
the eﬀect of this approach in model checking of self-stabilizing programs is not addressed.
To address this limitation, in this approach, designer needs to partition the program into
components such that each component satisﬁes its property without fairness. Subsequently,
we can use existing composition results to show that their composition is correct under
fair execution. There are several such approaches to show that the composed program is
self-stabilizing based on the properties of individual components. Since the focus of our
work is not to identify new strategies of interference freedom, we only consider some of the
76

simple and commonly used approaches and describe them, next. We note, however, that the
subsequent discussion also applies to other approaches [64, 88] for proving self-stabilization
of a program that consists of several components.
Let C1 and C2 be two components (programs) such that variables of C1 and C2 are
disjoint. Let p be the program obtained by combining the actions of C1 and C2. Let
legitimate states of C1 and C2 be I1 and I2 respectively. Then, the wellknown and simple
theorem about such composition is as follows:
Theorem 4.4.1. If
• C1 is weakly-fair stabilizing for I1
• C2 is weakly-fair stabilizing for I2
Then
• C1[]C2 is weakly-fair stabilizing for I1 ∧ I2, where C1[]C2 is the program obtained by
taking union of actions of C1 and C2.

Although straightforward, this theorem can assist in reducing veriﬁcation time if the
fairness requirement is needed essentially to ensure that both components get a chance
to execute. In other words, if the components themselves are self-stabilizing under unfair
computations then the designer can verify the preconditions of this theorem easily under
unfair computation. Self-stabilization under unfair computations implies self-stabilization
under weak fairness. Hence, preconditions of the theorem can be proven easily. Moreover, the
conclusion of the theorem allows us to ensure that the composed program is self-stabilizing
under weak fairness.
77

There are several such theorems that provide the ability to conclude self-stabilization
property of the composed program by self-stabilization property of the components. Another
wellknown theorem relates to superposition where program p consists of two components C1
and C2, where C1 is superposed on C2. In other words, C1 can only read the variables of
C2 and C2 can neither read nor write variables of C1. Then, the wellknown theorem about
superposition is as follows:
Theorem 4.4.2. If
• C1 is weakly-fair stabilizing for I1
• C2 is weakly-fair stabilizing for I2
• After a state in I2 is reached, no action in C2 is enabled
Then
• C1[]C2 is weakly-fair stabilizing for I1 ∧ I2, where C1[]C2 is the program obtained by
taking union of actions of C1 and C2.
Again, similar to the approach above, we may be able to verify each component without
fairness assumption. However, fairness is required for the composed program to ensure that
each component gets a chance to execute. Again, this will allow us to conclude correctness of
the composed program under weak fairness by expediting the veriﬁcation time for individual
components.
There are several instances where such superposition or variations thereof are used. In
particular, one variation is that it suﬃces if C1 ensure convergence to I1 by assuming that
I2 holds already. Also, the third condition (termination of C2) can also be replaced by
78

other non-interference conditions that are less restrictive. Next, we discuss some of these
examples.

4.4.2.1

Case Study 4: Huang’s Mutual Exclusion in Uniform Rings

In [116], authors propose a self-stabilizing mutual exclusion program that consists of two
components: (1) leader election component and (2) token circulation component. The ﬁrst
component consists of a leader election program on an oriented uniform ring where the
number of nodes is prime. The second component consists of a token circulation component
that requires a unique process (such as process 0 in Case Study 1.) Since veriﬁcation of the
second component is similar to that in Section 4.4.1.2, we only focus on the leader election
component.
The leader election component maintains a variable v.j at every process j. The actions
of the processes are as follows (lef t and right denote the left and right neighbor of process
j in the ring):

−→ v.j = (v.j + 1) mod n;

K1 ::

v.lef t = v.j = v.right

K2 ::

gap1 (lef t, j) < gap(j, right) −→ v.j = (v.j + 1) mod n;

The above actions require that a node increments its value if either (1) its value equals
that of its left and its right neighbor or (2) gap with the left node is less than the gap with
the right node.
1
gab(a, b) =

n
(b − a) mod n
79

if a = b
otherwise

Execution time(s)
n=3
unf airleader election
0
unf airtoken circulation
0
unf airtotal with decomposition
0
weakly − f airleader election
0
weakly − f airtoken circulation
0
weakly − f airtotal with decomposition
0
weakly − f airtotal without decomposition
weakly − f airtotal without decomposition 0.17
approximate state space
104

n=5
0
0
0
4.15
0.63
4.79
N/A
1010

n=7
0.05
0.02
0.07
N/A
N/A
N/A
1017

n=11
0.48
0.15
0.63
N/A
N/A
N/A
1034

Execution time(s)
n=23 n=29 n=31
unf airleader election
47.12 271.05 704.48
unf airtoken circulation
14.57 70.18 103.8
unf airtotal with decomposition
61.69 341.23 808.28
weakly − f airleader election
N/A
N/A
N/A
weakly − f airtoken circulation
weakly − f airtotal with decomposition
N/A
N/A
N/A
weakly − f airtotal without decomposition
weakly − f airtotal without decomposition N/A
N/A
N/A
approximate state space
1093 10127 10138
Table 4.7: Veriﬁcation results for Huang’s mutual exclusion program.

80

-

This program requires fairness for correctness. Without fairness, leader election component may not be able to execute. However, each component can be veriﬁed separately
without fairness. Finally, based on Theorem 4.4.2, we can conclude that the overall program is self-stabilizing under weak fairness. Table 4.7 gives the veriﬁcation performance by
utilizing symbolic model checking procedure for veriﬁcation of leader election component.
From this table, we can see that veriﬁcation of self-stabilization is signiﬁcantly more scalable
with decomposition and the use of unfair computation for verifying self-stabilization of each
component. Moreover, the signiﬁcant beneﬁt in reduction in time is based on the use of
unfair scheduler as opposed to the use of decomposition. (In Table 4.7, ‘-’ denotes that the
experiment was not performed for token circulation since the corresponding experiment for
leader election could not be completed in the permissible time.)

4.4.2.2

Case Study 5: Self-stabilizing Program based on Raymond’s Tree algorithm

This second example is the self-stabilizing program based on Raymond’s tree algorithm for
mutual exclusion [194]. In this program, the processes are arranged in a ﬁxed 2 tree, called
the parent tree. On this ﬁxed tree, a dynamic holder tree is superposed such that the holder
of a node is one of its tree neighbors (including itself). A node j has a token iﬀ its holder
(h.j) equals j. There is one action that allows a node to send the token to its neighbors
(Kpassing ). In this action, if node k has a token then it can pass it to its neighbor j by
changing the holder relation of j and k. Additionally, there are three convergence actions.
The ﬁrst action ensures that the holder of a node is a tree neighbor. The second action
2 By ﬁxed, we mean that p.j is ﬁxed and hard coded in the actions themselves and, hence,
cannot be corrupted.
81

ensures that on any edge between j and (p.j), either holder of j is same as p.j or the holder
of p.j is j. And, the third action ensures that holder relation does not have cycles. Thus,
the actions of the self-stabilizing tree program are as shown next:

Kpassing ::

h.k = k ∧ h.j = k

−→ h.j = j, h.k = j;

Kconvergence::h.j = NBR.j ∪ j

−→ h.j = p.j;

j = p.j ∧ h.j = p.j ∧ h.(p.j) = j

−→ h.j = p.j;

j = p.j ∧ h.j = h.(p.j) ∧ h.(p.j) = j −→ h.j = p.j;

This program requires fairness for stabilization; without stabilization, nodes could simply execute the token passing action (Kpassing ) thereby preventing stabilization. However,
correctness of the convergence actions and the correctness of closure actions can be independently veriﬁed without fairness. Furthermore, the results from [1] can be used to show that
these two components do not interfere and, hence, the overall program is self-stabilizing.
Table 4.18 gives the veriﬁcation time for each component under unfair scheduler. It also
gives veriﬁcation time for convergence under weakly-fair scheduler. Since the token passing
component changes the variables from two diﬀerent nodes, we were not able to implement
it under weakly-fair scheduler. However, it is straightforward to observe that the time for
veriﬁcation of the composed program (with token passing and convergence actions) will be
more than the time for veriﬁcation with convergence actions alone. Hence, the beneﬁt of
decomposition and use of unfair scheduler in reducing the cost of veriﬁcation follows from
the results in Table 4.18.
82

Execution time(s)
n=7 n=15
unf airconvergence
0.02 0.10
unf airpassing
0
0.10
weakly − f airconvergence 0.1 17.67
approximate state space
105 1017

n=31 n=63
1.95
N/A
39.97 N/A
N/A N/A
1046 10113

Table 4.8: Veriﬁcation results for Raymond-tree based program.

4.4.2.3

Other Examples and Approaches for Identifying Components

Another example is that of distributed reset [18] where the program consists of a tree layer
and a wave layer. The tree layer constructs a tree from the processes that are still up.
Subsequently, the wave layer utilizes this tree to achieve distributed reset. Again, weak
fairness ensures that each component can always execute although the component itself can
be veriﬁed without fairness.

For the case where decomposition is not straightforward the proof of stabilization can
assist in identifying the desired decomposition. Speciﬁcally, one common way to prove selfstabilization is to use the the approach of Gouda and Multari [93]. Speciﬁcally, in this
approach, the state space itself is partitioned into concentric circles, R0 , R1 , . . . , Rn , where
R0 corresponds to the entire state space, Rn corresponds to the set of legitimate states and
Ri ⊃ Rj if 0 < i < j < n. It is required that if the program starts in any state in Ri ,
0 ≤ i < n then (1) it always stays in states in Ri , and (2) it eventually reaches a state
in Ri+1 . Again, fairness can assist in this approach in ensuring that the overall program
is self-stabilizing although one or more convergence requirements can be veriﬁed without
fairness thereby reducing the time for veriﬁcation.
83

4.4.3

Utilizing Weak Stabilization to Improve Scalability of Model
Checking of Self-stabilizing Program

In this section, we focus on the second approach for improving scalability for veriﬁcation
of self-stabilizing programs. Speciﬁcally, if the program at hand requires weakly-fair computations to provide self-stabilization and the time for such veriﬁcation is prohibitive, the
designer can focus on a variation of self-stabilization, namely weak self-stabilization [92],
where weak stabilization has been demonstrated as a ‘good approximation’ of stabilization.
Furthermore, in [58], Devismes et al have shown how to transform a weak-stabilizing program into a probabilistic stabilizing program. Thus, if the assurance of self-stabilization is
not possible, the designer can obtain a slightly lower assurance provided by weak stabilization. Moreover, the designer can utilize the transformation in [58] to obtain probabilistic
assurance regarding self-stabilization. Next, we recall the deﬁnition of weak stabilization
and a relevant theorem about it from [92].
Deﬁnition 4.4.1. (Weak stabilization)
Let p be a program and let I be a state predicate of p. We say that p is weakly stabilizing
for I iﬀ:
1. closure: if (s0 , s1 ) is a transition of p and s0 ∈ I, then s1 ∈ I;
2. weak convergence: for every state s, there is a computation of p that starts at s and
reaches I.
Deﬁnition 4.4.2. (Strongly-fair computation)
σ = s0 , s1 , ... is strongly-fair computation iﬀ:
1. σ is an unfair computation of p, and
84

2. If any state s is included inﬁnitely often in σ and (s, s′ ) is a transition of p then the
subsequence s, s′ must be included inﬁnitely often in σ.
Theorem 4.4.3. A weakly-stabilizing system is also a self-stabilizing system if:
1. The system has a ﬁnite number of states, and
2. Every computation is under strong fairness.
We can utilize the above result as follows: If we cannot verify that p is self-stabilizing
due to time/space limitations, we can verify that p is weakly stabilizing. By Deﬁnition 4.4.1,
this does not require one to model fairness explicitly. This will allow us to obtain some
assurance (although somewhat weaker) about p. Additionally, the designer can utilize the
transformation from [58] to obtain program p′ that is probabilistically stabilizing.
Next, we revisit case studies 1-5 to evaluate the cost of verifying weak stabilization. Tables 4.9-4.13 compare the cost of verifying self-stabilization under weak fairness with that of
verifying weak stabilization. As we can see, the cost of verifying weak stabilization is substantially less (and is very close to the cost of verifying self-stabilization without fairness). Also,
veriﬁcation of weak stabilization is signiﬁcantly more scalable than that of self-stabilization.
For example, in Dijkstra’s K-state program, it was possible to verify self-stabilization for
only 9 processes (state space 108 ) whereas it was possible to verify weak stabilization for 50
processes (state space 1084 ).

4.5

A Constraint-based Approach

The approach in Section 4.3 assumes fairness in the program model. And, modeling of
a (weak) fairness scheduler (if essential) aﬀects the veriﬁcation performance signiﬁcantly.
85

K-state Program
Execution Time(s)
Number of Processes
3
weak stabilization
0
stablization under weak fairness
0
approximate state space
101

4
0
0.03
102

5
0
0.63
103

Execution time(s)
Number of Processes
6
weak stabilization
0
stablization under weak fairness
5.33
approximate state space
104

7
0.02
34.30
105

8
0.03
139.10
107

Execution time(s)
Number of Processes
9
weak stabilization
0.05
stablization under weak fairness 1276.08
approximate state space
108

10
0.08
N/A
1010

50
3485.27
N/A
1084

Table 4.9: Veriﬁcation cost of weak stabilization vs. stabilization (1).

Ghosh’s mutual exclusion program
Execution time(s)
Number of Processes
8
10
12
weak stabilization
0
0
0
stablization under weak fairness
0.4
2.93
22.43
2
3
approximate state space
10
10
103
Execution time(s)
Number of Processes
14
16
18
weak stabilization
0
0
0.02
stablization under weak fairness 138.05 693.27 2819.05
approximate state space
104
104
105
Execution time(s)
Number of Processes
20
50
100
weak stabilization
0.03
0.35
4.9
stablization under weak fairness N/A
N/A
N/A
approximate state space
106
1016
1031
Table 4.10: Veriﬁcation cost of weak stabilization vs. stabilization (2).

86

Hoepman’s ring-oritentation program
Number of Processes
3
5
7
Execution time(s)
weak stabilization
0
0.037
0.08
stablization under weak fairness 0.17
18.23 1113.77
1
approximate state space
10
103
104
Number of Processes
9
51
101
Execution time(s)
weak stabilization
0.13
11.88
95.9
stablization under weak fairness
5
31
approximate state space
10
10
1060
Number of Processes
201
301
Execution time(s)
weak stabilization
881.5 3442.18
stablization under weak fairness
approximate state space
10121 10181
10241
Table 4.11: Veriﬁcation cost of weak stabilization vs. stabilization (3).

Huang’s ring program
Execution time(s)
n=3 n=5
weak stabilization
0
0.07
stablization under weak fairness 0.17 N/A
approximate state space
104 1010

n=7
0.63
N/A
1017

n=11
N/A
N/A
1034

Table 4.12: Veriﬁcation cost of weak stabilization vs. stabilization (4).

Raymond-tree based mutual exclusion
Execution time(s)
n=7 n=15
weak stabilization
0
0.12
weakly − f airconvergence 0.1 17.67
approximate state space
1011 1035

program
n=31 n=63
2.15
N/A
N/A N/A
1092 10226

Table 4.13: Veriﬁcation cost of weak stabilization vs. stabilization (5).

87

Moreover, the symbolic model checking approach relies on an optimal variable ordering since
this order could have a signiﬁcant impact on the size of the BDD graph. Unfortunately, the
problem of ﬁnding the best variable ordering is NP-hard. To address these issues, in this
section, we consider a constraint-based approach to analyze stabilization. This approach
does not require modeling fairness and does not assume any order information of variables.
The key insight is to reduce the task of verifying whether a program is stabilizing into a wellstudied problem: constraint solving, which can be solved by many existing highly optimized
solutions. Speciﬁcally, our approach leverages the power of the oﬀ-the-shelf SMT solvers
that have demonstrated the ability of solving industrial sized-satisﬁability instances in the
last decade.

4.5.1

Approach for Verifying Stabilization with SMT Solvers

In this section, we present an constraint approach of verifying self-stabilization properties.
The key idea of this approach is to translate the problem into a formula and then to use
SMT solvers to analyze the formula.
Veriﬁcation of stabilization consists of two parts: (1) veriﬁcation of closure and (2) veriﬁcation of convergence. In Section 4.5.1.1, we identify the formula whose satisﬁability can
be used to determine whether closure property is satisﬁed. In Section 4.5.1.4, we identify an
algorithm for verifying convergence by using the formulae developed in Sections 4.5.1.2 and
4.5.1.3.
88

4.5.1.1

Veriﬁcation of Closure

Let P be the given program and let I be the legitimate state predicate used in Deﬁnition
4.2.2 to conclude that P is stabilizing. Let T be the predicate that characterizes transitions
of P. Observe that the closure property requires that if (s0 , s1 ) is a transition of program
P and state s0 is a legitimate state then state s1 is also a legitimate state. Thus, this can
be captured by formula ¬Ψl , where
Ψl = (I(s0 ) ∧ T (s0 , s1 ) ∧ ¬I(s1 ))
Remark 4.5.1. For compactness, the formula Ψl does not explicitly specify the program or
the set of legitimate states that are inputs in deciding closure. In this dissertation, these
two parameters can be determined based on the context. We use similar approach for other
formulae as well.
Based on whether Ψl is satisﬁable or not, we have two scenarios, SC1 and SC2 :
1. SC1 : if Ψl is satisﬁable then it proves that it is possible to begin in a legitimate state,
execute a program transition and be in a state that is not a legitimate state. This
implies that the closure property is not satisﬁed. Moreover, in this case, assignment
to s0 and s1 (which in turn includes values of variables of the program in state s0 and
s1 ) provides a counterexample.
2. SC2 : if Ψl is unsatisﬁable then this implies that the closure property is satisﬁed.
4.5.1.2

Veriﬁcation of Convergence

To verify convergence, we use approach from bounded model checking [33]. We verify convergence by checking that starting from an arbitrary state, the program, say P, reaches a
89

legitimate state (in I) in k steps, where k is a given parameter used in the veriﬁcation. Observe that the convergence property requires us to consider a sequence of states, s0 , s1 , · · · , sk
such that each successive transitions are program transitions. Moreover, to verify (negation
of) convergence requirement, we require that I(sk ) should be false. Additionally, in this veriﬁcation, we can utilize the closure requirement to add additional constraints requiring that
I(sj ), 0 ≤ j ≤ k, should be false. Additionally, in bounded model checking, one typically
adds constraint about what the initial state should be. However, in convergence, the initial
state can be arbitrary and, hence, there is no corresponding constraint for the initial state.
Thus, the formula used for verifying convergence is as follows:

Ψv = T (s0 , s1 ) ∧ T (s1 , s2 ) ∧ · · · ∧ T (sk−1 , sk )
¬I(s0 ) ∧ ¬I(s1 ) ∧ · · · ∧ ¬I(sk )

Based on whether Ψv is satisﬁable or not, we have the following two scenarios:

1. SC3 : if Ψv is satisﬁable, convergence cannot be achieved in k steps. In this case, the
number of steps needs to be increased. If the state space of the program is ﬁnite and
k equals the number of states in the program then this implies that the convergence
property is not satisﬁed. However, a simple cycle detection algorithm (discussed next)
can be used to conclude that the program is not stabilizing for smaller values of k.

2. SC4 : if Ψv is unsatisﬁable, then it proves that even if we begin in an arbitrary state,
it is impossible for the program to be in an illegitimate state if it executes for k steps.
In other words, the convergence property is satisﬁed.
90

4.5.1.3

Resolving Ambiguity by Cycles Detection

As discussed in Section 4.5.1, when Ψv is satisﬁable, either the given program is not stabilizing or the value of k is too small. To distinguish between these scenarios, we use an
approach of resolving ambiguity by checking for an existence of a cycle outside legitimate
states. The main idea of this approach is to check whether the given program can run into
a cycle that is outside legitimate states.
To check whether the given program can execute a cycle that is outside legitimate states,
we consider the behavior of the given program for k steps. Hence, we construct a formula
similar to that of Ψv . Additionally, the computation created by Ψv creates a cycle iﬀ some
state is repeated in this path. This can be checked by adding another constraint that two
of the states visited are identical. In other words, the added constraint is that there exists
two states sj and sk , where j < k and sj = sk .
Note that in case of stabilization, the initial state is arbitrary. Hence, if there exists a
cycle where sj = sk then there exists a suﬃx of the given computation that begins with sj .
In other words, it suﬃces to check whether state s0 is repeated in the given computation.
Hence, the formula used for detecting cycle is as follows:

Ψy = Ψv ∧ (s0 ∼ s1 ) ∨ (s0 ∼ s2 ) ∨ · · · ∨ (s0 ∼ sk )
=
=
=
In the above formula, we use (s0 ∼ s1 ). One implementation of this is to require s0 = s1 .
=
However, in certain cases, it may be suﬃcient to reach a state that is similar to the initial
state. Such similarity is application dependent. One example of this is based on symmetry
of processes and values.
Based on whether Ψy is satisﬁable or not, we have two scenarios: SC5 and SC6 :
91

1. SC5 : if Ψy is satisﬁable then this implies that there is a computation of the given
program that starts in state s0 and revisits state s0 without reaching a legitimate state
in between. This implies that there is a possibility that the program may never reach
a legitimate state. In other words, the given program is not stabilizing.

2. SC6 : if Ψy is unsatisﬁable then there are two possibilities: either the given program is
stabilizing and, hence, such a cycle cannot exist or the number of steps is insuﬃcient
to create a cycle.

4.5.1.4

Combining Veriﬁcation of Convergence and Cycle Detection

Depending upon the satisﬁabilty of Ψv and Ψy , we have four possibilities. Considering
these four possibilities, we can determine whether the given program is stabilizing or not. We
illustrate this checking process by Algorithm presented in Figure 4.11. Line 1 ﬁrst constructs
formula Ψv and Ψy . Then, the algorithm utilizes bounded model checking techniques to check
the satisﬁability of the two formulas in the loop starting from Line 2 and ending at Line 12.
If the condition identiﬁed in Line 3 is satisﬁed, the algorithm concludes that the program is
stabilizing. If the condition identiﬁed in Line 6 is satisﬁed, the algorithm concludes that the
program is not stabilizing. And if the condition identiﬁed in Line 9 is satisﬁed, the algorithm
simply increases the value of k and repeat checking the conditions identiﬁed in Line 3, 6 and
9.
Note that the above algorithm begins with k = 1. However, a better approach is to
begin with k to be the expected number of steps for convergence. Also, the algorithm can be
tuned in terms of how k is increased. Note that for ﬁnite state program, the above program is
guaranteed to terminate. For inﬁnite state programs, however, it may not. This is expected
92

Algorithm 1: Stabilization Veriﬁcation
Input: P: program to be veriﬁed;
I: set of legitimate states.
Construct Ψv and Ψy for P&I.
(1)
For {k = 1 → · · ·}{
(2)
IF {Ψv is unsatisﬁable }
(3)
Print given program is stabilizing.
(4)
ENDIF
(5)
IF {Ψy is satisﬁable }
(6)
Print given program is not stabilizing. (7)
ENDIF
(8)
// Since Ψy ⇒ Ψl
(9)
// it is impossible for Ψy to be satisﬁable
// and Ψl to be unsatisﬁable.
IF {Ψv is satisﬁable and Ψy is unsatisﬁable} (10)
increase the value of k
(11)
ENDIF
(12)
} // endfor
(13)
Figure 4.11: The algorighm for determing stabilization.

given that the halting problem can be trivially reduced to veriﬁcation of stabilization.

4.5.2

Experimental Results

In this section, we present our case studies. These case studies include K-state token ring
program [59] and Ghosh’s mutual exclusion program [87] and Stabilizing Tree based mutual
exclusion [194]. We use the SMT solver Yices [67] to verify the stabilization property. We
note that we have also used Z3 [52]. While the exact numbers associated with Z3 are diﬀerent,
the observations and conclusions from this section still hold. Hence, the results for Z3 are
not presented in this dissertation.
93

Number of nodes
3
4
5
6

Number of steps
for convergence
4
14
25
39

Execution time(s)
for convergence
0.0044
0.01229
0.209468
154.1079279

Execution time(s)
for closure
0.003928
0.004257
0.005399
0.004608

Table 4.14: Veriﬁcation time for Ψv for token ring with unbounded variables.
4.5.2.1

K-State Token Ring Program

In this section, we illustrate the proposed constraint-based approach by studying k-state
token ring program. Although the algorithm in Section 4.5.1.4 attempts diﬀerent values of k
to decide whether the program is stabilizing, in this section, we only focus on the value of k
for which we can conclude that the program is stabilizing (respectively, not stabilizing). This
is due to the fact that several heuristics (e.g., analysis of the program to evaluate expected
number of steps) can be used to limit the values of k used in Section 4.5.1.4. Hence, we only
focus on the value of k for which the algorithm terminates.
Remark 4.5.2. We consider three variations of k-state token ring program: In the ﬁrst
variation, K is set to N + 1. In the second variation, value of x is unbounded and, hence,
K0 simply increments x.0. Finally, we consider the value of K = 2 in Section 4.5.3.
We evaluate the performance of the token ring program for both bounded and unbounded
setting. Tables 4.14 and 4.15 respectively illustrate the time for verifying the closure and
the convergence property for the bounded and unbounded version of the token ring.
As we can observe, the veriﬁcation time is signiﬁcantly lower for unbounded version of
the token ring. In particular, for the case where x values are unbounded, it is possible to
verify the convergence property of a ring with 5 processes in less than a second. However,
the corresponding time for program with bounded x value is 214 seconds.
94

Number of nodes
State space
Number of steps
for convergence
Execution time(s)
for convergence
Execution time(s)
for closure

3
101

4
102

5
103

4

14

25

0.008944 0.494496 214.0957
0.005617 0.005979 0.013349

Table 4.15: Veriﬁcation time for Ψv for token ring with bounded variables.
Number of nodes
State space
Number of steps
for convergence
Execution time(s)
for convergence
Execution time(s)
for closure

3
101

4
102

5
103

4

14

25

0.005855 0.090116 33.526028
0.005387 0.006480

0.006716

Table 4.16: Veriﬁcation time for Ψv for token ring with split actions for K0 .
One of the reasons for this is that the bounded version utilizes a modulo operation.
One can attempt to revise the token ring program to simplify the mod operation to gain a
substantial beneﬁt. Speciﬁcally, Table 4.16 considers the case where action K0 is split into
two actions: The ﬁrst action executes only if x.0 is not equal to K − 1. And, it increments
the value of x.0. The second action executes only if x.0 equals K − 1. And, it resets x.0 to
0. With this change, the veriﬁcation time for ﬁve processes reduces from 214 seconds to 33
seconds.

4.5.2.2

Ghosh’s Binary Mutual Exclusion Protocol

Table 4.17 gives the performance results of Ghosh’s program. The result demonstrates
that the veriﬁcation cost (along with number of steps necessary for convergence) increases
95

Number of nodes
State Space
Number of steps
for convergence
Execution time(s)
for convergence
Execution time(s)
for closure

8
102

10
103

12
103

14
104

9

16

25

36

0.040316 0.470958 11.166705 314.851055
0.007433 0.012232

0.009348

0.015544

Table 4.17: Veriﬁcation results for Ghosh’s program using SMT solver.

Number of nodes
State space
Number of steps
for convergence
Execution time(s)
for convergence
Execution time(s)
for closure

3
101

7
105

9
108

5

8

9

0.007162 1.377709 125.633908
0.005046 0.004943

0.005432

Table 4.18: Veriﬁcation results for Raymond tree-based program.

substantially when the number of processes is more than 12.

4.5.2.3

Stabilizing Tolerant Version of Tree-based Mutual Exclusion Algorithm

In Table 4.18, we present the performance results for verifying the stabilizing tolerant version
of tree-based mutual exclusion algorithm. The results demonstrate that the veriﬁcation cost
is very low when the number of processes is no more than 9. The veriﬁcation cost (along
with number of steps necessary for convergence) increases substantially when the number of
processes is more than 9.
96

4.5.3

Veriﬁcation of Token Ring in Synchronous Semantics

The computation model we considered in Deﬁnition 2.1.5 corresponds to interleaving semantics where in each step one of the actions is executed. Another computation model uses
synchronous semantics. Here, the program actions are partitioned into groups and for every
group there is a corresponding process responsible for executing those actions. Furthermore,
in each step, every process executes one of its enabled actions (unless it has no enabled action in that state). Since the number of steps needed for convergence aﬀects the veriﬁcation
time with SMT solvers signiﬁcantly, in this section, we consider execution of the program in
synchronous semantics and evaluate its eﬀect on veriﬁcation.
Veriﬁcation of the program under synchronous semantics can assist in two scenarios: One
scenario is that one can verify the stabilization property under synchronous semantics. In
this case, the program is guaranteed to reach a legitimate state under synchronous semantics.
One can utilize a program that is correct under synchronous semantics and compose it with
the alternator in [138]. This alternator ensures that given a program that is stabilizing under
synchronous semantics, it transforms into a program that is correct under read/write model.
Speciﬁcally, in this case, the process in the transformed program either reads the state of its
neighbor or writes its own state. In other words, the transformed program guarantees that
it will reach a legitimate state even if one process executes at a time, i.e., the transformed
program is stabilizing under interleaving semantics. Hence, in this scenario, we can obtain
a program that is stabilizing. Another scenario is that one can identify a counterexample
(illustrating the lack of stabilization) for the synchronous program. This counterexample
can in turn be transformed into a counterexample for the original program.
We begin with the ﬁrst scenario, where using synchronous semantics can improve veriﬁ97

cation performance. We consider the execution of the token ring protocol under synchronous
semantics. Results from Table 4.5.3 show the veriﬁcation cost under synchronous semantics.
These results show that the veriﬁcation under synchronous semantics is substantially faster
for both bounded and unbounded token ring. As discussed above, this can allow us to obtain
a stabilizing program that ensures that a legitimate state is reached even if only one process
executes at a time.

Numbers
of Nodes
3
4
5
6
7
8
9
10
11
12
17

Number of Steps
for Convergence
3
5
7
9
11
13
15
17
19
21
23

Execution time(s)
for Convergence
(for Bounded Variables)
0.006714
0.049422
0.306879
1.150117
11.837923
7.741610
18.389602
42.7424
789.75117
324.921817
N/A

Execution time(s)
for Convergence
(for UnBounded Variables)
0.005945
0.008152
0.012003
0.018864
0.030548
0.049720
0.077605
0.130185
0.241754
0.907359
22.63948

Table 4.19: Veriﬁcation results for token ring under synchronous semantics.

To illustrate the second scenario, we consider the case where the given program works in
read/write model where in each step, a process can read the state of its neighbor or write
its own state but not both. In such a program, the variables can be partitioned into a
public variables (variables that can be read by more than one process) and private variables
(variables that can be read by only one process. Thus, in read/write model, the read action
corresponds to the case where a process reads public variable(s) of one of its neighbors and
saves a copy of it in its private variable(s). In write action, the process utilizes its own
public/private variables to update them.
98

Observe that if P is a program in read/write model then for each computation of P
in synchronous semantics there is a corresponding computation in interleaving semantics.
Intuitively, in the computation in interleaving semantics, one transition of the program in
synchronous semantics is split into several steps. Hence, if we can ﬁnd a counterexample to
show that P is not stabilizing under synchronous semantics then it implies that P is not
stabilizing under interleaving semantics either.
To exploit this observation, we consider the execution of the token ring protocol under
read/write model. In this case, x.j is a public variable of process j. The action K0 in Section
4.5.2.1 is not in read/write model since it reads x.N and updates x.0. Read/write model
requires that these two tasks be separated into one read action (of x.N) and one write action
(of x.0). To obtain the corresponding program in read/write model, we introduce y.j that
maintains a copy of the x value of the predecessor. Furthermore, each action is split into
a read action to read the value of the predecessor and a write action that utilizes the local
copy. Thus, the actions of the token ring protocol in read/write model are as follows:
K0r ::

y.0 = x.N

−→

y.0 = x.N;

K0w ::

x.0 = y.0

−→

x.0 = (x.0 + 1) mod K;

Kjr ::

y.j = x.(j − 1) −→

Kjw ::

x.j = y.j

−→

y.j = x.(j − 1);

// read x.N

//read x.(j − 1)

x.j = y.j;

It is well-known that the above protocol can be thought of as the original token ring
protocol with 2(N + 1) processes, where the variables of these processes are x.0, y.1, x.1,
y.2, · · · , y.N, x.N, y.0.
Now, we consider the execution of the token ring protocol with N processes under inter99

Numbers
of Nodes
10
20
30
50
100
200

Under Interleaving Semantics
Numbers
of Steps Execution time(s)
20
0.253376
40
100.605236
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Under Synchronous Semantics
Numbers
of Steps Execution time(s)
4
0.009663
8
0.048594
4
0.040200
4
0.103919
8
0.849282
16
9.9811778

Table 4.20: Veriﬁcation result for cycle detection.
leaving semantics. If we choose K = 2 this program is not stabilizing. We can identify this
by checking that Ψy is satisﬁable when it is used in the context of the token ring program
with N processes using interleaving semantics. Alternatively, the lack of stabilization can
also be proved by considering execution of the token ring program with 2N processes under
synchronous semantics.
With this intuition, we evaluate the time for verifying satisﬁability of Ψy for diﬀerent
processes. Table 4.20 shows the veriﬁcation time with interleaving and synchronous semantics respectively. We observe that for 20 processes under interleaving semantics, it took 40
steps to detect a cycle and the time was 100.605236 seconds. However, the same property
can also be veriﬁed under synchronous semantics with 30 processes in 4 steps and the time
was 0.040200 seconds. Moreover, as discussed above, the latter veriﬁcation suﬃces to conclude that the 20 process token ring program is not stabilizing under interleaving semantics
if K = 2.

4.6

Summary

This chapter focused on automatic veriﬁcation of self-stabilization. One key challenge is
raised by the prohibitive cost incurred during veriﬁcation. Contrary to traditional veriﬁcation
100

that considers only a subset of reachable states starting from some initial state(s), veriﬁcation
of stabilization must consider all possible states, leading to the fundamental problem of ‘state
explosion. To address this challenge, we leveraged BDD-based symbolic model checking to
overcome the scalability constraint of existing approaches. Also, a novel technique was
proposed to analyze program actions against stabilization, which reduces the veriﬁcation
problem into a simpler problem called constraint solving, allowing the use of existing highly
optimized solutions to reduce veriﬁcation cost.
In particular, in Section 4.4, we focused on using symbolic model checking for verifying self-stabilizing algorithms. While a signiﬁcant percentage of the literature on selfstabilization routinely assumes weak fairness, where if an action is continuously enabled, it
is guaranteed to be executed, we argued that veriﬁcation under such weak fairness is not
scalable. Our observation was that in many cases, the assumption of weak fairness is superﬂuous. And, in these cases, scalable veriﬁcation of self-stabilization is possible under
unfair computation model. We illustrated this in the context of three case studies, Dijkstra’s K-state program, Ghosh’s mutual exclusion program and Hoepman’s ring-orientation
program. In particular, we showed that the time for veriﬁcation with unfair computations
is approximately 0.001% − 0.1% of that for weakly-fair computations.
For the case where program cannot preserve self-stabilization property under unfair computations, we proposed two approaches, including decomposition and using weak stabilization.
The ﬁrst approach is to decompose the program into parts where each part can be veriﬁed
under unfair computation. Subsequently, composition of these parts can be proved to be
correct using existing theorems in the literature. We showed how this approach can be used
in the context of two case studies, Huang’s mutual exclusion program and self-stabilizing
101

mutual exclusion program based on Raymond’s tree algorithm. In both case studies, scalability of veriﬁcation increased substantially (e.g., from 104 states to 10138 states for Huang’s
mutual exclusion algorithm.) Also, the approach in [93] can be used to identify layers that
assist in self-stabilization. These layers, in turn, can form the components that one can
verify independently. Since most of reduction in time is obtained by the use of unfair scheduler, one can obtain the savings in this manner even if the components themselves are not
substantially smaller than the original program.
The second approach is to utilize weak stabilization that has been proved to be a reasonable implementation of stabilization [92]. We also showed that veriﬁcation of weak stabilization is substantially more scalable. This validates the suggestion in [92] that weak
stabilization is easier to verify than self-stabilization. However, this suggestion is only valid
for the scenario where weak fairness is necessary for the correctness of stabilization. The
time cost for stabilization veriﬁcation under unfair computation is essentially equal to that
for weak stabilization when unfair computation is possible. Furthermore, a weak stabilizing
program can be transformed into a probabilistically stabilizing program thereby providing
additional assurance to designer.
In Section 4.5, we proposed a constraint-based approach for verifying stabilizing programs. In particular, we investigated the eﬀectiveness of SMT solvers in veriﬁcation of
stabilization. We found that the eﬀectiveness of SMT solvers in this context is mixed.
Speciﬁcally, compared with the approach that utilize symbolic model checkers to verify stabilization, the time for veriﬁcation is larger with SMT solvers. However, BDD based tools
require one to identify the order of program variables for constructing BDDs. An incorrect
ordering of variables can increase the veriﬁcation time by orders of magnitude making it
102

signiﬁcantly worse than the corresponding veriﬁcation time with SMT solvers. Also, the
approach in Section 4.4 only works for verifying ﬁnite state programs. By contrast, our
constraint-based approach demonstrated the feasibility of verifying inﬁnite state program.
We also considered execution of the given program under synchronous semantics. We
argued that this has a potential to reduce the cost of veriﬁcation and utilize a transformation approach to achieve a program that is stabilizing under interleaving semantics and/or
read/write model. We showed that execution under synchronous semantics can reduce the
time for identifying a counterexample illustrating that the given program is not stabilizing.

103

Chapter 5

Automatic Revision for Adding Weak
Multitolerance

Intuitively, a weak multitolerant program assumes that faults from multiple classes will not
occur simultaneously (By simultaneous, we mean that a fault from one class occurs before
the system has recovered from a fault from another class). Thus, while the program tolerates
multiple classes of faults, it tolerates them ‘one at a time’. It follows that if the program is
subject to fault classes f1 , f2 , . . . fn and the program is perturbed by fi , (0 ≤ i ≤ n) then
the program provides the desired level of fault-tolerance to fi . However, if faults from fj
(0 ≤ j ≤ n, j = i) occur while the program is recovering from fault class fi then the program
may not guarantee fault-tolerance. Thus, a weak multitolerant program that tolerates fault
classes f1 , f2 , . . . fn provides fault-tolerance to each fi (0 ≤ i ≤ n) respectively.
104

5.1

Problem Statement

In this section, we ﬁrst present the deﬁnition of weak multitolerance. As mentioned in
Section 2.3.1, a fault-tolerant program guarantees a desired level of fault-tolerance (i.e.,
failsafe/nonmasking/masking) in the presence of a speciﬁc class of faults. Now, we consider
the case where the program is subject to faults from multiple fault-classes.
Deﬁnition 5.1.1. Let fδ = { fi , li | 0 < i ≤ n, li ∈ {f ailsaf e, nonmasking, masking}}
where n ≥ 0. Program P is weak multitolerant to fault set fδ from S for spec iﬀ the
following conditions hold:
1. (In the absence of faults) P |=S spec.
2. For each i, 0 < i ≤ n, P is li fi -tolerant from S for spec respectively.
Remark 5.1.1. Whenever the level of fault tolerance to a given fault class is clear from the
context, for brevity, we omit it.
We note that the deﬁnition of weak multitolerance can also be used to deﬁne intermediate
multitolerance. For example, consider the case where masking fault-tolerance is required
for both f1 and f2 . However, if f2 occurs while the program is recovering from f1 then
nonmasking fault-tolerance is provided. For this case, we can model such requirements by
letting fδ = { f1 , masking , f2 , masking , f1 ∪ f2 , nonmasking }.
Now, using the deﬁnition of weak multitolerant programs, we identify the requirements of
the problem of synthesizing a weak multitolerant program P ′ with invariant S ′ from its faultintolerant version P with invariant S. We require that P ′ only adds weak multitolerance and
introduces no new behaviors in the absence of faults. This problem statement is a natural
extension of the problem statement in [135] where fault-tolerance is added to a single class
105

of faults. More speciﬁcally, we stipulate the following two conditions: (1) S ′ ⊆ S, i.e.,
the invariant S ′ of the weak multitolerant program P ′ is a subset of the invariant S of the
given program P; (2) (s0 , s1 ) ∈ P ′ ∧ s0 ∈ S ′ ⇒ (s0 , s1 ) ∈ P. Thus, the problem of weak
multitolerance synthesis is as follows:
Problem 5.1.1. The Weak Multitolerance Synthesis Problem.
Given P, S, spec and fδ : Identify P ′ and S ′ such that
• (C1) S ′ ⊆ S,
• (C2) (s0 , s1 ) ∈ P ′ ∧ s0 ∈ S ′ ⇒ (s0 , s1 ) ∈ P, and
• (C3) P ′ is weak multitolerant to fδ from S ′ for spec.
We state the corresponding decision problem as follows:
Problem 5.1.2. The Weak Multitolerance Decision Problem.
Given P, S, spec and fδ : Does there exist a program P ′ , with its invariant S ′ that
satisﬁes the requirements of Problem 5.1.1?

5.2

Illustrating Examples

In this section, we present examples to illustrate the scenarios where weak multitolerance is
used and where weak multitolerance is essential for feasibility of solution and/or performance.
We present four examples. The ﬁrst example is a two-sensor agreement protocol that has
the property of failsafe-failsafe weak multitolerance. The second example is a leader election
protocol that provides masking-masking weak multitolerance. The third example is a vertex
coloring protocol where failsafe-nonmasking weak multitolerance is guaranteed. Finally, the
106

fourth example provides masking-masking weak multitolerance and illustrates a case where
strong multitolerance is impossible.
We use the non-deterministic guarded commands language to represent programs. A
guarded command (a.k.a. action) is of the form g −→ st where g identiﬁes constraints under
which the guarded command can be executed and st identiﬁes the eﬀect of such execution.
Thus, guarded commands are suitable for programs that are ‘event-based’ where g identiﬁes
the event and st identiﬁes the action taken by the program while responding to the event.
Guarded commands are also suitable for ‘time-based’ programs where g corresponds to a
clock tick and st denotes the action taken by the program for that clock tick. A guarded
command g −→ st is a compact representation for transitions {(s0 , s1 )|g is true in s0 , s1 is
obtained by atomic execution of st from s0 }.

5.2.1

Failsafe-Failsafe Weak Multitolerance

In this section, we present a protocol that consists of two sensors. Each sensor is used
to detect the status of the environment. The program outputs the detected status of the
environment. To model this protocol, we introduce four variables, in, sensor0, sensor1 and
out. The details are as follows:
1. in denotes the real status of the environment. The domain of in in this model is {0, 1}.
2. sensor0 denotes the value detected by the sensor 0. The domain of sensor0 is {0, 1}.
The value of sensor0 should be the same as the value of in when no fault occurs, that
is, the value detected by the sensor 0 should reﬂect the real status of the environment.
3. sensor1 denotes the value detected by the sensor 1. The domain of sensor1 is {0, 1}.
Again, the value of sensor1 should be the same as the value of in when no fault occurs.
107

4. out denotes the output of the system. The domain of out is {0, 1, ⊥}, where ⊥ represents that the system has not made a decision yet. If the system makes a decision,
the output of the system should be the same as the value of in when no faults occur
in the sensors.
The legitimate states of the program are those where values of two sensors are equal to
the value of the real status in the environment and the output is equal to the value of the
real status in the environment if the output has been decided. Hence, the invariant is as
follows:
Invf f = (sensor0 == sensor1 == in) ∧ (out == ⊥ ∨ out == in)

The program includes the following action: when no output has been made by the program, i.e., out = ⊥ and both sensor0 and sensor1 have the same values, the output will be
set to the value of sensor0.
out == ⊥ ∧ sensor0 == sensor1 −→ out := sensor0;

Although this example models the values of two sensors in a similar fashion, they could
be providing sensor readings based on diﬀerent approaches. For example, the altitude switch
controller in [145] utilizes two types of altimeters, a digital and an analog. One of the reasons
for choosing diﬀerent types of sensing modalities is to ensure that one root cause does not
aﬀect all sensors. Thus, although faults that aﬀect each sensor appear similar, they are
considered as two diﬀerent types of faults since it is expected that both faults do not occur
in the same computation. In this program, we consider two classes of faults: a transient fault
108

that occurs at sensor 0 and a transient fault that occurs at sensor 1. When this transient
fault occurs at a sensor, it changes its value to the opposite of that of the input. Thus, the
fault actions for this program are as follows:
Action of Fault Class 1: true −→ sensor0 := 1 − in;
Action of Fault Class 2: true −→ sensor1 := 1 − in;

The safety speciﬁcation requires the program never reaches a state that the output of
the system is not the same as the real status of the environment. Formally, the safety
speciﬁcation states that the program should never reach a state where the following formula
is true:
specf f = (out == 1 ∧ in == 0) ∨ (in == 1 ∧ out == 0).

The program provides failsafe-failsafe weak multitolerance, which is deﬁned as follows.
1. When no fault occurs, the values of the two sensors sensor0 and sensor1 are equal
and the output is set to the value of sensor0. The output of the system is equal to the
value of real status in the environment.
2. When transient faults corrupt one sensor, failsafe fault-tolerance should be provided.
Since the faults corrupt the value of one sensor, the two values of sensor0 and sensor1
are not equal and, hence, the program action is not executed. That is, the safety
speciﬁcation is not violated.
3. When both faults occur simultaneously, no guarantee is provided.
109

The case discussed above could be extended to be the one which has three or more sensors.
The agreement is reached by the majority of sensor values. In that case, the program will
have the masking-masking weak multitolerance property. When transient fault corrupts one
of the sensors, the majority agreement will guarantee the ﬁnal decision will reﬂect the real
status of the environment.
Although the two fault classes considered here are similar, they are considered separately
due to the source of these faults. The next example illustrates the case where the faults in
diﬀerent fault classes have diﬀerent eﬀects.

5.2.2

Masking-Masking Weak Multitolerance

In this section, we present a leader election program that provides masking-masking weak
multitolerance to two types of faults fm1 and fm2 . The fault-type fm1 manifests itself as
if the leader leaves, called the leader leave fault, and fm2 denotes the message loss fault.
The program consists of n processes (p1 , p2 , . . . , pn ) that are organized in a connected
network. Each process has an unique ID, numbered from 0, . . . , n−1. One of these processes
is selected as the leader. When no faults occur, there are no actions that are executed.
However, when the current leader leaves, other processes can compete to be the leader.
When a process wants to be the leader, it starts a diﬀusing computation [61] and declares
itself to be the leader when the diﬀusing computation completes successfully. If multiple
processes start diﬀusing computations, then the process with higher ID wins.
A process initiates a diﬀusing computation when it receives a leave message from the
current leader and if it is not already participating in a diﬀusing computation. To initiate a
diﬀusing computation, process j sends a message j to all its neighbors. It also sets its own
110

root value and its parent to be equal to j. The root value keeps track of the initiator of a
diﬀusing computation and the parent value keeps track of the node that sent this diﬀusing
computation to j.
When process j receives a diﬀusing computation message of the form ID from k, it
does the following: If j is already participating in a diﬀusing computation such that the
initiator of that diﬀusing computation (stored in root.j) is higher than ID then j ignores
the new message. If the value of root.j is equal to ID then j is receiving the same diﬀusing
computation twice. Hence, it only replies to k. Otherwise, it forwards this diﬀusing computation to all its neighbors. To do so, it forwards the message ID and sets its own parent
variable to k and its own root variable to ID.
Finally, when j receives a reply message from all neighbors except the parent, j sends a
reply to its parent. Moreover, if j is the initiator (p.j == j) then it declares itself to be the
leader. Thus, the actions in this program are as follows:

upon receiving leave message from departing leader
−→//node j sends diﬀusing computation message
if root.j == −1
send j to nbrs.j,
p.j := j,
root.j := j;
else //ignore

upon receiving ID from k
111

−→if root == ID //duplicate
send reply to k with ID ;
else if ID > root.j
p.j := k,
root.j := ID,
send ID to all neighbors except k;
else //ignore;

upon receiving reply from all neighbours except p.j with root.j
−→root.j := −1;
if p.j = j
send reply to p.j with root.j ;
else
leader.j := true;

The legitimate states of the program include those states where there is a unique leader,
there are no ongoing diﬀusing computations to elect a leader, and, the root value of every
process is −1.
The program is subject to two classes of faults, fm1 and fm2 . fm1 is the one where
the leader node leaves. We assume that when the leader node leaves the network, it notiﬁes
its neighbors. fm2 causes messages to be lost. Variable channeli,j denotes the sequence of
messages in the channel between i and j. Hence, the two types of fault actions are as follows:
fm1 (Leader node leave):
112

leader.j == true
−→send leave to nbrs.j, leader.j := f alse ;
fm2 (Message loss):
channeli,j =

//

denotes an empty channel

−→ channeli,j := tail(channeli,j );

The safety speciﬁcation requires that in any state there is at most one leader. More
precisely, the program should never reach a state where:

specmm = (∃j, k : j = k : leader.j ∧ leader.k)
The program provides maksing-masking weak multitolerance to leader node leave and
message loss. More speciﬁcally, it is deﬁned as follows.

1. In the absence of faults, no action is executed. And, there is an unique leader in the
network.
2. When fault fm1 occurs, i.e., the current leader leaves, one or more of its neighbors
initiate a diﬀusing computation. Among the nodes that initiate the diﬀusing computation, the one with highest ID is elected as the leader. Hence, masking fault-tolerance
is guaranteed when fm1 occurs.
3. When fault fm2 causes messages to be lost, there is no eﬀect on the number of leaders.
Thus, there is a unique leader in the network. Hence, masking fault tolerance is
guaranteed when fm2 occurs.
113

4. If fault fm1 and fm2 occur simultaneously, no fault-tolerance is guaranteed. Specifically, if a message is lost during the diﬀusing computation, it is possible that the
diﬀusing computation never completes, and hence, no leader is elected.

A more careful analysis of this program shows that if faults fm1 and fm2 occur together,
it causes the diﬀusing computation to be blocked thereby resulting in states where there is
no leader. In this case, failsafe fault-tolerance is provided. Thus, this is also an instance of
FM weak multitolerance.
Moreover, as one can imagine it is possible to design a fault-tolerant program that provides masking fault-tolerance to both fm1 and fm2 simultaneously. However, providing
such tolerance is expensive. For example, it requires mechanisms to detect message losses
(or potentially failure of a node). Additionally, it requires an overhead in terms of message
retransmission etc. Also, if such faults are considered during diﬀusing computation, there
is a need for keeping track of multiple diﬀusing computations initiated by the same node,
e.g., with the use of sequence numbers. Thus, this example illustrates the case where a
stronger tolerance property is possible although weak multitolerance can provide a slightly
lower assurance at a reduced cost (in terms of complexity of the code, performance, etc.).
The next example illustrates the case where providing weak multitolerance is essential.

5.2.3

Failsafe-Nonmasking Weak Multitolerance

In this section, we present a vertex coloring protocol that provides failsafe-nonmasking
weak multitolerance to Byzantine fault and transient fault, i.e., failsafe fault-tolerance when
Byzantine fault occurs, nonmasking fault-tolerance when transient fault occurs and no guarantees when both faults occur simultaneously.
114

The vertex coloring of the program is an assignment of colors to each process of the
system. The goal of the program is that every process is assigned a color and no two
neighboring processes are assigned the same color. One assumption is that the degree of
each process is at most d and d + 1 colors are to be used.
The legitimate states of the program are those whose colors are assigned appropriately.
The program action for each node j is deﬁned as follows.
color.j == color.k

−→

color.j := available color(j) ;

In the above action, color.j denotes the color assigned to process j. available color(j) is
used to denote the method that ﬁnds a new color among the available ones and assigns it to
process j. If the color of any process j is the same as one of its neighbors, the above action
is enabled and the color of the process will be assigned another color that does not conﬂict
with its neighbors. No action is executed in the absence of faults.
In this program, we consider two types of faults: (1) Byzantine faults and (2) transient
faults. Both these faults result in changing the color of the aﬀected process. However,
the main diﬀerence between these faults is that the former is a permanent fault, i.e., the
aﬀected process can change the color as often as it desires whereas the latter is a transient
fault where there is a bound on the number of times the color of some process is aﬀected.
Another diﬀerence is that the former only aﬀects a subset of (chosen) processes whereas the
latter can aﬀect all processes at once. To model these faults, we introduce a variable b.j that
denotes whether j is Byzantine, variable count.j that denotes the number of times color.j
is aﬀected by transient faults, and MAX that denotes the number of permitted transient
faults. (Note that all these variables are auxiliary variables, i.e., variables used in the proof
115

but not explicitly by the program itself.) Thus, the fault actions are as follows:

ff (Byzantine fault):
b.j == true
−→ colorj := random(0,d); // return a random value from 0 to d;
fn (transient fault):
count.j < MAX
−→ color.j := random(0,d);

The above fault actions may corrupt the color of a process to be the same as that of one
of its neighbors.
The safety speciﬁcation requires that, any two neighboring nodes that are non-Byzantine
have diﬀerent colors. Thus, the program should never reach a state in the state predicate
specf n , where
specf n = (∃j, k :: (k ∈ nbrs.j) ∧ (b.j == f alse)
∧(b.k == f alse) ∧ (color.j == color.k))

The program provides failsafe and nonmasking weak multitolerance to Byzantine fault
and transient fault, respectively. Speciﬁcally, it is deﬁned as follows.
1. In the absence of faults, there is no action. Thus, the program keeps a correct color
assignment to all nodes.
2. When ff corrupts color assignment of the Byzantine node, no non-Byzantine node is
116

aﬀected and hence the safety speciﬁcation is not violated. Thus, failsafe fault-tolerance
is provided.

3. When fn causes one node to change its color transiently, safety speciﬁcation may be
violated at that time. Then, the recovery action will reassign its color to be the correct
one and ﬁnally the program will recover to a correct assignment to all the nodes. Thus,
nonmasking fault-tolerance is provided.

4. When both faults ff and fn occur simultaneously, safety speciﬁcation may be violated
due to the occurrence of transient fault. Thus, failsafe fault-tolerance is not guaranteed.
Also the program may not recover to a correct assignment to all nodes since the
Byzantine node is corrupted permanently. Hence, nonmasking fault tolerance is also
not guaranteed.

This example illustrates the need for weak multitolerance. In particular, it is not possible
to provide a uniform tolerance to both types of faults. Speciﬁcally, with Byzantine faults, the
faults can prevent the program to recover to legitimate state where the colors of all nodes are
assigned properly, i.e., no two neighboring nodes have the same color. Hence, in this example,
providing masking or nonmasking fault-tolerance to Byzantine faults is impossible. Likewise,
execution of the transient fault itself can violate the safety speciﬁcation. Thus, providing
failsafe or masking fault-tolerance to transient faults is impossible. For this reason, the only
possible solution is to provide failsafe fault-tolerance to Byzantine faults and nonmasking
fault-tolerance to transient faults.
117

5.2.4

Masking-Masking Weak Multitolerance

This section presents an example scenario where MM weak multitolerance is the best possible option for designers due to the impossibility of providing MM strong multitolerance.
Intuitively, strong MM multitolerance guarantees that if both faults occur in the same computation (i.e., simultaneously), then masking fault tolerance to both of them will be provided.
(Please see Deﬁnition 5.8.1 in Section 5.8 for a more precise deﬁnition of strong multitolerance.)
The Agreement Program (AP) includes a general process and three non-general processes.
Initially, all non-generals are undecided. The general casts a decision and each non-general
copies the decision of the general and terminates; i.e., ﬁnalizes its decision.
The AP program [141] has to satisfy two safety properties, namely agreement and validity.
Agreement requires that if the general is faulty, then all non-faulty non-generals that have
ﬁnalized agree on the same decision. Validity stipulates that if the general is not faulty, then
any non-faulty non-general that has ﬁnalized has the same decision as that of the general.
The AP program is subject to two classes of faults: Byzantine and fail-stop. The Byzantine faults can perturb the state of at most one process and make it behave arbitrarily. That
is, if a process is aﬀected by Byzantine faults (i.e., a process is Byzantine), then it can cast
diﬀerent decisions lying about its decision to diﬀerent processes. The fail-stop faults could
cause a process to crash in a detectable fashion. A fail-stopped process does not execute any
actions once it crashes. We assume that fail-stop faults only aﬀect the non-general processes.
The set of program variables is {dg , d0 , f0 , up0 , d1 , f1 , up1 , d2 , f2 , up2 }, where (i) dg
denotes the decision of the general, which cou1d be 0 or 1, and di represents the decision of
process i (0 ≤ i ≤ 2), where the domain of di is {0, 1, ⊥}, and ⊥ means that process i is
118

undecided; (ii) fi is a Boolean variable representing whether or not process i has ﬁnalized
its decision after copying a decision from the general, and (iii) upi is also a Boolean variable
that denotes whether process i has crashed in a detectable fashion; i.e., has fail-stopped.
The actions of process i in the AP program are as follows (⊕ denotes addition in modulo 3):
Ai1 : di = ⊥ ∧ ¬fi ∧ upi
−→ di := dg
Ai2 : (di = ⊥ ∧ ¬fi ) ∧ (di⊕1 = ⊥ ∨ di = di⊕1 ) ∧
(di⊕2 = ⊥ ∨ di = di⊕2 ) ∧ (di⊕1 = ⊥ ∨ di⊕2 = ⊥) ∧ upi
−→ fi := true
Ai3 : (di = ⊥ ∧ ¬fi ) ∧ (di⊕1 = ⊥) ∧ (di⊕2 = ⊥) ∧ upi
−→ di := majority(d1 , d2 , d3 ))
fi := true

If process i is undecided and not crashed, then it can copy the decision of the general (see
action Ai1 ). Once decided, process i can ﬁnalize its decision if at least another non-general
has made the same decision (action Ai2 ). If all non-generals have copied a decision from the
general, then process i can ﬁnalize by setting di to the majority of decisions (action Ai3 ).
That is, if di diﬀers from the majority, then process i corrects di by setting it to the majority
of decisions and ﬁnalizing. Otherwise, di is equal to the majority and action Ai3 ﬁnalizes the
decision of process i. Notice that a crashed process can execute none of its actions because
upi becomes false.
An invariant of the AP program includes states in which validity and agreement are
satisﬁed and at most one process is faulty.
119

The AP program provides masking-masking weak multitolerance, including the following
properties:
1. In the absence of Byzantine and fail-stop faults, the program AP satisﬁes both validity
and agreement.
2. In the presence of Byzantine faults, if the general is Byzantine then validity vacuously
holds, and agreement is guaranteed by the majority of decisions. If a non-general has
become Byzantine, then validity holds for non-faulty non-generals, and agreement is
vacuously satisﬁed. Therefore, the AP program is masking fault-tolerant to Byzantine
faults.
3. In the presence of fail-stop faults, one of the non-generals stops executing (i.e., its up
variable becomes false). Thus, the other non-generals satisfy validity. Agreement is
satisﬁed as well since a majority of non-generals exists and the general is not faulty.
Therefore, the AP program is masking fault-tolerant to fail-stop faults.
4. When both faults occur, the program may reach a state where a non-general has
crashed and another non-general has become Byzantine. If the fail-stop faults have
occurred before a process makes a decision, then the guard of action Ai3 in the other
two processes is false. Since another non-general has become Byzantine, its decision
may not be the same as the decision of the non-faulty non-general. Thus, the guard
of action Ai2 of the non-faulty non-general is also false. Therefore, the entire program
deadlocks, i.e., masking fault tolerance cannot be guaranteed for Byzantine and failstop faults.
This example demonstrates a case where it is impossible to design a MM strong multitol120

erant program, and designing MM weak multitolerance is the next best option. The reason
behind it is that the simultaneous occurrence of both faults may get the AP program to a
state where no majority of decisions exists amongst the non-faulty processes. One approach
for enabling recovery from such a state is to add redundancy by including an additional
non-general process in the AP program, which may not be always feasible. Nonetheless,
before resorting to redundancy, we provide the following options for designers. First, one
can design a MM weak multitolerant program (similar to the AP program) where masking
fault tolerance is guaranteed if the faults occur one at a time. The second option is a FM
weak multitolerant program, where masking fault tolerance to Byzantine faults and failsafe
fault tolerance to fail-stop faults is provided (or vice versa). Notice that the AP program
guarantees FM weak multitolerance as well. Third, we can lower our expectation to providing FF strong multitolerance to both faults where the program guarantees that in the
presence of both faults, validity and agreement are not violated, however, agreement may
never be reached (i.e., the program never recovers). Therefore, MM weak multitolerance
provides the best remaining option where MM strong multitolerance is impossible using the
available resources; i.e., masking fault tolerance is guaranteed to both classes of faults if they
do not occur simultaneously.

5.3

Complexity Analysis of FF Weak Multitolerance

In this section, we investigate the synthesis problem of programs that are weak multitolerant
to two classes of faults f1 and f2 for which failsafe fault-tolerance is required. That is,
fδ

= { f1 , f ailsaf e , f2 , f ailsaf e } in Deﬁnition 5.1.1. We show that such a F F

(F ailsaf e-F ailsaf e) weak multitolerant program can be synthesized in polynomial time in
121

program state space. Towards this end, we present a sound and complete algorithm. We
note that this algorithm can be easily generalized for the case where fδ includes three or
more fault classes for which failsafe fault-tolerance is desired.

Given is a program P, with its invariant S and its speciﬁcation spec. Let P ′ be the
synthesized program with invariant S ′ that is weak multitolerant to f1 and f2 . By deﬁnition,
P ′ must maintain spec from every reachable state in the computations of P ′ ∪f1 (respectively,
P ′ ∪ f2 ). To this end, on Line 1 of Algorithm 2 in Figure 5.1, we ﬁrst identify ms1 , a set
of states from where execution of one or more f1 transitions violates safety. Clearly P ′
cannot reach a state in ms1 either in the absence of faults or in the presence of f1 alone.
Likewise, we compute ms2 on Line 2. Next, we compute mt to be a set of transitions that
reach ms1 ∪ms2 or those that violate spec. If there exist states in the invariant such that
execution of one or more fault actions from those states violates spec, we recalculate the
invariant by removing those states. In this recalculation, we ensure that all computations of
P − mt within the new invariant, S ′ , are inﬁnite. By the constraints of Deﬁnition 5.1.1 and
the deﬁnition of ms1 and ms2 , S ′ must be a subset of S −ms1 −ms2 . Likewise, P ′ cannot
include transitions that begin in S ′ and are in mt. Hence, the only transitions P ′ can use
inside S ′ are a subset of P −mt. Removal of states in ms1 ∪ms2 or transitions in mt may
create some deadlock states in S −ms1 −ms2 , i.e., where P ′ has no outgoing transitions.
Since P ′ cannot deadlock in the absence of faults, we remove such deadlock states recursively
to construct S ′ (Line 6 and 7 of Algorithm 2). As shown in Line 8, if the invariant becomes
an empty set after reconstruction, we cannot ﬁnd a F F weak multitolerant program P ′ . If
the invariant is not empty, we remove transitions that start in S ′ and terminate outside S ′ ;
i.e., violate the closure of S ′ . Notice that the removal of such transitions does not introduce
122

any deadlock states in S ′ .
Theorem 5.3.1. The algorithm Add FF Weakmulti is sound and complete.
Proof. To show the soundness of our algorithm, we need to show that constraints C1, C2
and C3 of the Problem 5.1.1 are satisﬁed.
1. S ′ ⊆ S. By the construction of S ′ , S ′ is obtained by removing zero or more states in
S. Thus, C1 is trivially satisﬁed.
2. (s0 , s1 ) ∈ P ′ ∧ s0 ∈ S ′ ⇒ (s0 , s1 ) ∈ P. By the construction of P ′ , P ′ does not have
any new transitions in the absence of faults. Therefore, C2 is trivially satisﬁed.
3. P ′ is FF weak multitolerant to spec from S ′ . Consider a computation c of P ′ that starts
from a state in S ′ : From 1, c starts in a state in S, and from 2, c is a computation of
P. It follows that c satisﬁes spec. Hence, every computation of P ′ that starts from a
state in S ′ is in spec, i.e, P ′ reﬁnes spec from S ′ . We discuss the following two cases:
(a) Failsafe f1 -tolerance to spec from S ′ . We let the fault-span T1 to be the set of
states reached in any computation of ψ ′ ∪ f1 that starts from a state in S ′ .
P

Consider a computation preﬁx c of ψ ′ ∪ f1 that starts from a state in T1 . From
P
the deﬁnition of T1 there exists a computation preﬁx c′ of ψ ′ ∪ f1 such that c
P
is a suﬃx of c′ and c′ starts from a state in S ′ . If c′ violates the safety of spec
then there exists a preﬁx of c′ , say s0 , s1 , . . . , sn , such that s0 , s1 , . . . , sn
violates the safety of spec. Let s0 , s1 , . . . , sn be the smallest such preﬁx, it
follows that (sn−1 , sn ) violates the safety of spec and hence, (sn−1 , sn ) ∈ mt.
By construction, P ′ does not contain any transition in mt1 . Thus (sn−1 , sn )
is a transition of f1 . If (sn−1 , sn ) is a transition of f1 then sn−1 ∈ ms1
123

Algorithm 2: Add FF Weakmulti
Input: P:transitions,
f1 , f2 :faults of two classes
that need failsafe f -tolerance,
S: state predicate, spec: safety speciﬁcation
Ouput: If successful, a fault-tolerant P ′ with invariant S ′
that is weak multitolerant to f1 and f2
ms1 := {∫0 : ∃∫1 , ∫2 , . . . ∫n :
(∀j : 0 ≤ j < n : (∫j , ∫j + 1 ) ∈ f1 )∧
(∫n − 1 , ∫n ) violates spec};
(1)
ms2 := {∫0 : ∃∫1 , ∫2 , . . . ∫n :
(∀j : 0 ≤ j < n : (∫j , ∫j + 1 ) ∈ f2 )∧
(∫n − 1 , ∫n ) violates spec};
(2)
mt := {(∫0 , ∫1 ) : ((∫1 ∈ ms1 ∪ ms2 ) ∨
(∫0 , ∫1 ) violates spec)};
(3)
′
S := S −ms1 −ms2 ;
(4)
P1 := P −mt;
(5)
WHILE(∃∫0 : ∫0 ∈ S ′ :
(∀∫1 : ∫1 ∈ S ′ : (∫0 , ∫1 ) ∈ P1 ))
/
(6)
′
′
{S := S − {∫0 }}
(7)
If(S ′ = {})
(8)
declare no weak multitolerant program P’ exists;
(9)
{ Return ∅, ∅}
(10)
′
P := {(∫0 , ∫1 )|
(∫0 , ∫1 ) ∈ P1 , ∫0 ∈ S ′ , ∫1 ∈ S ′ }
(11)
// P ′ only speciﬁes transitions inside invariant S ′ ;
//P ′ can be modiﬁed to include any subset of
// {(s0 , s1 )|s0 ∈ S ′ ∧ (s0 , s1 ) ∈ mt};
/
/
′
//T = Reachable (S , P ∪ f1 ) ∪
//
Reachable (S ′ , P ∪ f2 ), i.e., T be states
//
reached by starting from S ′ and using transitions of
//
P ∪ f1 (respectively, P ∪ f2 );
//P ′ can include any subset of {(s0 , s1 )|s0 ∈ T };
/
′
′
RETURN P , S ;
(12)
Figure 5.1: Model revision for adding FF weak multitolerance.

124

and (sn−2 , sn−1 ) ∈ mt1 and hence, (sn−2 , sn−1 ) is a transition of f1 . By
induction, if s0 , s1 , . . . , sn violates the safety of spec, s0 ∈ ms1 , which is a
contradiction since s0 ∈ S ′ and S ′ ∩ ms1 = ∅. Thus, each preﬁx of c′ maintains
spec. Since c is a suﬃx of c′ , each preﬁx of c also maintains spec. Thus, ψ ′ ∪ f1
P

maintains spec from T1 .

(b) Failsafe f2 -tolerance to spec from S ′ . The argument is similar to part 3a.

Now we show that if a FF weak multitolerant program can be designed for the given
fault-intolerant program then Add FF Weakmulti will not declare failure. Let program P ′′
and predicate S ′′ solve Problem 5.1.1. Clearly, S ′′ ∩ms1 = ∅; if s0 ∈ (S ′′ ∩ms1 ) then the
execution of faults alone from s0 can violate the safety of spec. It follows that S ′′ ⊆ (S−ms1 ).
Likewise, S ′′ ⊆ (S −ms2 ). Moreover, P ′′ |S ′′ cannot include any transitions in mt; if P ′′ |S ′′
contains a transition in mt then the execution of this transitions can violate the safety of
spec. Thus, P ′′ |S ′′ ⊆ (P −mt). Finally, every computation of P ′′ that starts in a state in
S ′′ must be an inﬁnite computation, if it were to be in spec. It follows that there exists
a subset of S such that all computations of P − mt within that subset are inﬁnite. Our
algorithm declares that no solution for the Problem 5.1.1 exists only when there is no subset
of S −ms1 −ms2 such that all the computations of P −mt within that subset are inﬁnite.
It follows that our algorithm declares that no F F weak multitolerant program exists only if
the answer to Problem 5.1.2 is false.

Remark 5.3.1. Algorithm Add FF Weakmulti can be extended to design a multitolerant program that is subject to 3 or more fault classes. Towards this, we specify ms3 for the third
fault class and msi for the ith fault class. Then, we calculate mt like Line 3 in Algorithm
125

Add FF Weakmulti to specify these transitions that lead to state in

msi (n is the

i=1,...,n
number of fault classes). Besides, we need to recalculate invariant S ′ by removing states in
i=1,...,n

5.3.1

msi . The remaining steps are similar to Algorithm Add FF Weakmulti.

Application of Add FF Weakmulti

Now we present a simple example to demonstrate how the algorithm Add FF Weakmulti
facilitates automated synthesis of FF weak multitolerance. We use the example of two-sensors
agreement protocol introduced in Section 5.2. Speciﬁcally, we apply the Add FF Weakmulti
algorithm in Section 5.3 to the fault-intolerant version of the input-output program.
In the absence of faults, sensor0 produces the correct value of in. Hence, in this program,
if no decision has been made, i.e., out == ⊥, the output will be set to the value of sensor0.
Hence, the program action is as follows:

Pf i : out == ⊥ −→ out := sensor0;

Remark 5.3.2. The program action could also be out == ⊥ −→ out := sensor1 or nondeterminism execution of both actions, i.e., out == ⊥ −→ out := sensor0|sensor1. For
these cases, a similar analysis can be used to design FF weak multitolerant program.
Next, we show the execution of Add FF Weakmulti with input program Pf i , f1 , f2 ,
Invf f and specf f . The fault actions f1 and f2 and safety speciﬁcation specf f are deﬁned
in Section 5.2.1.
Speciﬁcally, our algorithm works as follows: ﬁrst, Add FF Weakmulti computes ms1 (Line
1) from where execution of f1 violates safety. Fault f1 changes the value of sensor0. The
126

safety speciﬁcation is independent of the value of sensor0. Thus, if (s0 , s1 ) is a transition of
f1 then (s0 , s1 ) violates safety iﬀ specf f is true in state s0 . As such, ms1 = specf f where
specf f ≡ ((out == 1 ∧ in == 0) ∨ (int == 0 ∧ out == 0)). Likewise, ms2 = specf f .
Next, mt = {(s0 , s1 )|s1 ∈ specf f } (Line 3). Since Invf f ∩ms1 = ∅ and Invf f ∩ms2 =
∅, the new invariant S ′ is equal to Invf f (Line 4). Then, by removing transitions in mt
(Line 5), we have P1 := “out == ⊥ ∧ sensor0 == in −→ out = sensor0′′ , i.e., P1 assigns
the output only when sensor0 is not corrupted.
On Line 9, we compute the transitions of P1 that start and end in S ′ . Thus, on Line 9,
we get the program, P1 := “out == ⊥ ∧ sensor0 == sensor1 == in −→ out = sensor0”.
Moreover, after computing the fault-span of this program, we observe that a state where
sensor0 is equal to sensor1 but sensor0 is not equal to in is not reached in the presence of
f1 alone (or in the presence of f2 alone). Hence, we can add the transition corresponding
to the following action to P1 :
out == ⊥ ∧ sensor0 == sensor1 = in −→ out = sensor0.
Thus, the FF weak multitolerant program is as follows: (Note that this is the same
program from Section 5.2)
P1 := out == ⊥ ∧ sensor0 == sensor1 −→ out = sensor0.
Remark 5.3.3. The extra transitions (added in Line 9) cannot be executed but are useful in
simplifying the fault-tolerant program. Speciﬁcally, in this example, it allowed us to construct
a fault-tolerant program that does not read the value of the variable in. In a more general
setting, to ensure that the synthesized program reads only the variables that it is allowed
to read, one can utilize the approach in [135] to synthesize distributed programs, where the
program consists of multiple processes and each process can read some subset of program
127

variables. Since this paper does not focus on the issue of distribution, we omit details of
modeling read/write restrictions in synthesizing distributed programs.

5.4

Complexity Analysis of MM Weak Multitolerance

In this section, we investigate Problem 3.2 for cases where we want to add weak multitolerance
for fδ = { fm1 , masking , fm2 , masking }. We ﬁnd a surprising result that the MM
(Masking-Masking) weak multitolerant synthesis problem is NP-complete, even though, as
shown in [71], the synthesis problem of the corresponding strong multitolerant program is in
P.
Before we present the formal proof, we give an intuition behind this complexity. Consider
the case where there exists a transition (s1 , s2 ) of fm2 that violates the safety speciﬁcation.
We have the following two options: (i) ensure that s1 is unreachable in the computations of
P ∪ fm2 . (ii) allow s1 to be reached only while program is ‘recovering’ from fm1 . Moreover,
the choice made for this state aﬀects other similar states. In our proof, we relate the choice
made between these two options to the values of Boolean variables in the SAT formula. This
allows us to reduce the SAT problem to the MM weak multitolerant synthesis problem.
Theorem 5.4.1. The problem of synthesizing MM weak multitolerant programs from their
fault-intolerant version is NP-complete.
Proof. Given a program P, with its invariant S, its speciﬁcation spec, and two classes
of faults fm1 and fm2 , we prove that the decision Problem 3.2 is NP-hard when fδ =
{ fm1 , masking , fm2 , masking }. Illustrating the NP membership of Problem 3.2 is
straightforward; hence omitted.
128

Figure 5.2: Mapping from an instance of the SAT problem.

Now, we present a polynomial-time mapping from an instance of the SAT problem to
a corresponding instance P, S, spec, fm1 , fm2 of Problem 3.2. An instance of the SAT
problem is speciﬁed in terms of a set of literals x1 , x2 , . . . , xn and ¬x1 , ¬x2 , . . . , ¬xn where
xi and ¬xi are complements of each other. The SAT formula is of the form φ = C1 ∧ C2 ∧
C3 ∧ . . . ∧ Ck , where each clause Ci is a disjunction of several literals. Then we show that
the given SAT formula is satisﬁable if and only if there exists a solution for the mapped
instance of Problem 3.2. We construct the mapped instance as follows (see Figure 5.2):
The state space of P is as follows:

• We introduce a state s. This is the only state in the invariant S.
• For each propositional variables xi , 1 ≤ i ≤ n, and its complement ¬xi in the SAT
instance, we introduce the following states: ei , ti , gi , hi , ai and bi .
129

• For each clause Cr , 1 ≤ r ≤ k, we introduce states wr and zr .
• If clause Cr includes literal xi , we introduce a state dri. If clause Cr includes literal
¬xi , we introduce a state d′ .
ri
The transitions of P|S include only a self-loop (s, s). The transitions of fm1 and fm2
are as follows:
• For each clause Cr , we include the fault transition (s, wr ) in fm1 and the fault transition (s, zr ) in fm2 .
• If the clause Cr includes the literal xi , then we include the fault transition (dri, ei ) in
fm1 .
• If the clause Cr includes the literal ¬xi , then we include the fault transition (d′ , ti )
ri
in fm2 .
• For each propositional variable xi and its complement ¬xi , we include the fault transition (gi , ai ) in fm1 and (hi , bi ) in fm2 .
The safety speciﬁcation is deﬁned as follows.
• Transitions (gi , ai ) and (hi , bi ) violate safety.
• Transitions (s, s), (s, wr ), (s, zr ), (dri, ei ) and (d′ , ti ) do not violate safety.
ri
• For each clause Cr , each propositional variable xi and its complement ¬xi , the following transitions do not violate safety:
– (wr , zr ), (zr , dri ), (zr , d′ ), (ei , ti ), (ti , ei ), (ei , gi ), (ti , hi ), (gi, s), and (hi , s).
ri
130

• All transitions except those identiﬁed above (e.g., (zr , wr ), (zr , s), etc) violate safety
speciﬁcation.
Now, we show that the given SAT formula is satisﬁable if and only if the answer to Problem 3.2 for the mapped instance is aﬃrmative where fδ = { fm1 , masking , fm2 , masking }.
• (=⇒) First, we show if the given SAT formula is satisﬁable, then there exists a solution
that meets the requirements of the synthesis problem. Since φ has a satisfying truth
assignment, there exists an assignment of truth values to the literals xi , such that
each Cr evaluates to true. Now, we identify the program P ′ , that solves MM weak
multitolerant problem.
The invariant of P ′ is the same as the invariant of P (i.e., {s}). We derive the transitions of the weak multitolerant program P ′ as follows:
– For each disjunction Cr , we include the transition (wr , zr ).
– If xi is assigned true:
∗ We introduce (ei , ti ), (ti , hi ), (hi , s).
∗ For each disjunction Cr that includes xi , we introduce (zr , dri ) and (dri, s).
– If xi is assigned f alse:
∗ We include (ti , ei ), (ei , gi ) and (gi , s).
∗ For each disjunction Cr that includes ¬xi , we introduce (zr , d′ ) and (d′ , s).
ri
ri
Thus, in the presence of fm1 alone, P ′ provides safe recovery to s through dri , ei , ti , hi .
In the presence of fm2 alone, P ′ provides safe recovery to s through d′ , ti , ei , gi.
ri
Now, we show that P ′ is weak multitolerant in the presence of faults fm1 , fm2 .
131

– (In the absence of faults) P ′ |S = P|S. Thus, P ′ satisﬁes spec in the absence of
faults.
– Masking fm1 -tolerance. If the faults from fm1 occur then the program can
be perturbed to state wr , 1 ≤ r ≤ k. From wr , P ′ has only one transition
that reaches zr . Since Cr evaluates to true, there exists i such that either
xi is a literal in Cr and xi is assigned the truth value true or ¬xi is a literal in Cr and xi is assigned the truth value f alse. In the former case, P ′
can recover to s using the two sequences of transitions, (zr , dri ), (dri, s) , or
(zr , dri ), (dri, ei ), (ei , ti ), (ti , hi ), (hi , s) . In the latter case, P ′ can recover to
s using exactly one sequence of transitions (zr , d′ ), (d′ , s) . Note that if xi
ri
ri
is true then P ′ cannot reach gi from where it can violate safety speciﬁcation.
Thus, any computation of P ′ ∪ fm1 eventually reaches a state in the invariant.
Moreover, from zr , every computation of P ′ ∪ fm1 does not violate the safety
speciﬁcation. Based on the above discussion, P ′ is masking tolerant to fm1 .
– Masking fm2 -tolerance. The argument is similar to the one showing that P ′ is
masking tolerant to fm1 .
• (⇐=) Second, we show that if there exists a weak multitolerant program that solves
the instance of the synthesis problem 3.2, then the given SAT formula is satisﬁable.
Let P ′ be the weak multitolerant program derived from the fault-intolerant program
P. The invariant of P ′ , S ′ , is not empty and S ′ ⊆ S, S ′ must include state s. Thus,
S ′ =S.
Let Cr be a clause in the given SAT formula. The corresponding states added in the
instance of the synthesis problem are wr and zr . Note that wr can be reached from
132

s by a transition in fm1 . Hence, P ′ must include the transition (wr , zr ). Thus zr is
reached in the computation of P ′ ∪ fm1 . Hence, P ′ must recover to s from zr without
violating spec. Therefore, for some i, P ′ has to have a transition of the form (zr , dri) or
(zr , d′ ). If P ′ includes (zr , dri ), then the clause Cr contains literal xi and we assign
ri
xi the truth value true. Likewise, if P ′ includes (zr , d′ ) for some i, then the clause
ri
Cr contains literal ¬xi and we assign xi the truth value f alse. Thus, by construction,
Cr evaluates to true.
Now, to complete the proof, we have to show that the truth values assigned to all
literals are consistent, i.e., it is not the case that xi is assigned true in one clause and
f alse in another clause. We show this by a proof by contradiction. If xi is assigned
true in clause Cr and f alse in clause Cq then P ′ includes both transitions (zr , dri)
and (zq , d′ ). Now, from dri , the program can reach ei by the occurrence of fm1
qi
alone. Hence, the program P ′ cannot include the transition (ei , gi ), as including this
transition will allow the program to reach gi in a computation of P ′ ∪ fm1 and violate
safety by executing (gi , ai ). Likewise, P ′ can reach ti by the occurrence of fm2 alone.
Hence, P ′ cannot include the transition (ti , hi ). If both transitions (ei , gi ) and (ti , hi )
are not included then P ′ cannot recover from ei to the state in the invariant. This
contradicts the assumption that P ′ is masking fm1 -tolerant. Thus, the truth value
assignment to all literals is consistent.

5.4.1

A Heuristic for MM Weak Multitolerance

The previous result showed that in general, the problem of adding MM weak multitolerance
is NP-hard. In this section, we present a sound (but incomplete) algorithm that adds MM
133

weak multitolerance to a given program P that is subject to two classes of faults fδ =
{ f1 , masking , f2 , masking } in polynomial time.

Our algorithm Add MM Weakmulti

takes program actions, faults, invariant and safety speciﬁcation as input and generates a
MM weak multitolerant program. The basic idea of Add MM Weakmulti is to ﬁrst construct
the corresponding FF weak multitolerant program that ensures safety. Then we use the fault
span of the FF weak multitolerant program to add recovery. Speciﬁcally, let T1 and T2 be
the fault-spans in the presence of f1 and f2 respectively. We ensure that every path from T1
reaches a state in S1 , and likewise, every path from T2 reaches a state in S1 . Additionally,
we ensure that T1 (respectively, T2 ) remains closed in transitions of f1 (respectively, f2 )
during this revision process.

Given is a program P with its state predicate S and its speciﬁcation spec. Let P ′ be the
synthesized program with invariant S ′ that is weak multitolerant to f1 and f2 . By deﬁnition,
P ′ is masking f1 -tolerant from S ′ for spec if only f1 occurs, and masking f2 -tolerant from
S ′ for spec if only f2 occurs. To this end, Line 1 of Algorithm 3 identiﬁes ms1 , a set of states
from where execution of one or more f1 transitions violates safety. On Line 2, we identify
ms2 that is a set of states from where execution of one or more f2 transitions violates safety.
Next, we compute mt, a set of transitions that reach ms1 ∪ ms2 or those that violates spec.
By calling Algorithm 1 in Line 6 of Algorithm 3, we obtain P1 with invariant S1 , where P1
is FF weak multitolerant to f1 and f2 . In the loop of Lines 7-12, we reconstruct transitions
to ensure that S1 , P1 ∪ f1 maintains spec from T1 , S1 , P1 ∪ f2 maintains spec from
T2 on Line 9. To guarantee that from each state outside S1 there is a path that reaches
a state in S1 and there are no cycles in states outside S1 , we update P1 by calling the
algorithm Ensure Recovery in Lines 10 and 11. Algorithm 4 in Figure 5.4 captures the details
134

of Ensure Recovery. Speciﬁcally, Ensure Recovery is deﬁned in such a way that from each
state outside S1 there is a path that reaches a state in S1 , and there are no cycles in states
outside S1 . As shown in Line 13, if the invariant becomes an empty set after reconstruction,
we cannot ﬁnd a MM weak multitolerant program P ′ . Details are as shown in Algorithm 3
in Figure 5.3.
Theorem 5.4.2. The algorithm Add MM Weakmulti is sound.
Proof. To show the soundness of our algorithm, we need to show that constraints C1, C2
and C3 of the Problem 5.1.1 are satisﬁed.

1. S1 ⊆ S. By the correctness of Add FF Weakmulti, S1 obtained at Line 6 satisﬁes C1.
Since the following steps do not add any state to S1 , C1 is preserved by the ﬁnal
program P1 .
2. (s0 , s1 ) ∈ P1 ∧ s0 ∈ S1 ⇒ (s0 , s1 ) ∈ P. By the correctness of Add FF Weakmulti, P1
obtained at Line 6 of Algorithm 2 satisﬁes C2. Since the remaining steps do not add
any transition to P1 , C2 is preserved by the ﬁnal program P1 .
3. P1 is MM weak fault-tolerant to spec from S1 . Consider a computation c of P1 that
starts from a state in S1 : From Part 1 of this proof, c starts in a state in S, and
from Part 2, c is a computation of P. It follows that c satisﬁes spec. Hence, every
computation of P1 that starts from a state in S1 is in spec, i.e, P1 reﬁnes spec from
S1 . Next, we discuss the following two cases:
(a) Masking f1 -tolerance to spec from S1 . To show this, we need to show the
following three properties.
135

Algorithm 3: Add MM Weakmulti
Input: P:transitions,
f1 , f2 :faults of two classes
that need masking f -tolerance,
S: state predicate, spec: safety speciﬁcation
Ouput: If successful, a fault-tolerant P ′ with invariant S ′
that is weak multitolerant to f1 and f2
ms1 := {∫0 : ∃∫1 , ∫2 , . . . ∫n :
(∀j : 0 ≤ j < n : (∫j , ∫j + 1 ) ∈ f1 )∧
(∫n − 1 , ∫n ) violates spec};
(1)
ms2 := {∫0 : ∃∫1 , ∫2 , . . . ∫n :
(∀j : 0 ≤ j < n : (∫j , ∫j + 1 ) ∈ f2 )∧
(∫n − 1 , ∫n ) violates spec};
(2)
mt := {(∫0 , ∫1 ) : ((∫1 ∈ ms1 ∪ ms2 ) ∨
(∫0 , ∫1 ) violates spec)};
(3)
T1 := S − ms1
(4)
T2 := S − ms2
(5)
P1 , S1 := Add F F W eakmulti(P, f1 , f2 , S, spec)
(6)
′
′
WHILE(T1 == T1 ∧ T2 == T2 )
(7)
′
′
{T1 := T1 , T2 := T2
(8)
P1 := {(s0 , s1 )|(s0 ∈ S1
⇒ (s0 , s1 ) ∈ P1 )
∧(s0 ∈ T1 ⇒ s1 ∈ T1 )
∧(s0 ∈ T2 ⇒ s1 ∈ T2 )} − mt
(9)
(10)
T1 , S1 , P1 := Ensure Recovery (P1 , f1 , T1 , S1 )
T2 , S1 , P1 := Ensure Recovery (P1 , f2 , T2 , S1 )}
(11)
IF (S1 = ∅)
(12)
return P1 , S1
(13)
ELSE
(14)
declare no weak multitolerant program P ′ exists,
(15)
return ∅, ∅
(16)
Figure 5.3: Model revision for adding MM weak multitolerance.

136

Algorithm 4: Ensure Recovery
Input: P:transitions,
f :fault actions, S: state predicate, T : state predicate
Goal: Find T ′ , P ′ and S ′ such that T ′ ⊆ T ,
S ′ ⊆ S, T ′ is closed in P ′ ∪ f ,
every computation of P ′ from T ′ reaches a state in S ′ .
S1 , S2 := S, S;
(1)
WHILE (S1 == S)
(2)
{S1 := S}
(3)
//Rank(s0 ) = length of the shortest computation preﬁx of
// P from s0 to some state in S.
// Rank(s0 ) = ∞ means S is not reachable from s0 .
T := T − {s0 |Rank(s0 ) = ∞}
(4)
T := T − {s0 |∃s1 : (s0 , s1 )
∈ f, s0 ∈ T, s1 ∈ T }
/
(5)
S := S ∧ T
(6)
WHILE(∃s0 : s0 ∈ S :
∀s1 : s1 ∈ S : (s0 , s1 ) ∈ P)
/
(7)
S := S − {s0 }
(8)
P1 := removeCycles(P, S, T )
(9)
//returns P1 such that P1 ⊆ P
//P1 |S == P|S, P1 |(T − S) is acyclic,
//and ∀s0 : s0 ∈ T : S is reachable from s0 in P1
//There are several possible implementations
//and any one of them is acceptable.
//One possible implementation is to rank each state
//based upon the shortest path from that state to a state inside S,
//and then remove these transitions that do not decrease the rank.
Return P1 , S, T
(10)
Figure 5.4: The recovery algorithm.

137

• T1 is closed in P1 ∪ f1 . Closure of T1 in P1 is by construction. Regarding
closure of T1 in f1 , observe that there is no change in the last iteration of the
loop in Lines 7-12 of Algorithm 3. Thus, T1 is closed in f1 .
• P1 ∪ f1 maintains spec from T1 . We let the fault-span T1 to be the set of
states reached in any computation of P1 ∪ f1 that starts from a state in S1 .
Consider a computation preﬁx c of P1 ∪ f1 that starts from a state in T1 .
From the deﬁnition of T1 there exists a computation preﬁx c′ of P1 ∪ f1
such that c is a suﬃx of c′ and c′ starts from a state in S1 . If c′ violates
the safety of spec then there exists a preﬁx of c′ , say s0 , s1 , . . . , sn , such
that s0 , s1 , . . . , sn violates the safety of spec. Let s0 , s1 , . . . , sn be
the smallest such preﬁx, it follows that (sn−1 , sn ) violates the safety of
spec and hence, (sn−1 , sn ) ∈ mt. By construction, P1 does not contain
any transition in mt (See Line 9 of Algorithm 2). Thus (sn−1 , sn ) is a
transition of f1 . If (sn−1 , sn ) is a transition of f1 then sn−1 ∈ ms1 and
(sn−2 , sn−1 ) ∈ mt1 and hence, (sn−2 , sn−1 ) is a transition of f1 . By
induction, if s0 , s1 , . . . , sn violates the safety of spec, s0 ∈ ms1 , which
is a contradiction since s0 ∈ S1 and S1 ∩ ms1 = ∅ (Guaranteed by Line 6
of Algorithm 2 since the following steps don’t add any state in S1 ). Thus,
each preﬁx of c′ maintains spec. Since c is a suﬃx of c′ , each preﬁx of c also
maintains spec. Thus, P1 ∪ f1 maintains spec from T1 .
• Every computation of P that starts from a state in T1 eventually reaches a
state of S1 . The Ensure Recovery algorithm only updates P1 by removing
some transitions from P1 in Steps 4, 5 and 10 of Algorithm 3. By con138

struction, Ensure Recovery removes all the deadlock states in Steps 7 and 8
recursively. Also the function RemoveCycles is deﬁned in such a way that
from each state outside S1 there is a path that reaches a state in S1 , and
there are no cycles in states outside S1 .
(b) Masking f2 -tolerant to spec from S1 . The argument is the similar as Part 3a.

Remark 5.4.1. Note that Add MM Weakmulti is sound but not complete. One reason for
this is that the function removeCycles has several possible implementations and the choice
of transitions removed in removeCycles(Line 10 of Algorithm 3) for ensuring recovery in
the presence of f1 can prevent recovery in the presence of f2 . If one were to consider all
possible choices of removeCycles, then the time complexity would become exponential in the
state space.
Remark 5.4.2. Algorithm Add MM Weakmulti can be extended to design a multitolerant
program that is subject to 3 or more fault classes. Towards this, we specify ms3 for the
third fault class and the corresponding msi for the ith fault class. Then we calculate mt
like Line 3 in Algorithm Add MM Weakmulti to speicify these transitions that lead to state
in

msi (n is the number of fault classes). Moreover, we need to calculate the correi=1,...,n
sponding Ti for the ith fault class. In particular, Ensure Recovery (Lines 10−11 of Algorithm
2) will be repeated for each fault class.

5.4.2

Application of Add MM Weakmulti

Now we extend the example used in Section 5.3.1 to demonstrate how the algorithm 3
facilitates automated synthesis of MM weak multitolerance. We extend the program in
139

Section 5.3.1 to be the program which consists of three sensors.
As discussed in Section 5.3.1, the action of the input program is as follows:
Pmi : out == ⊥ −→ out := sensor0;

In this program, we consider three classes of faults: a transient fault that occurs at
sensor 0, a transient fault that occurs at sensor 1 and a transient fault that occurs at sensor
2. When transient faults corrupt one sensor, the sensor changes its value to the opposite of
that of the input. Thus, the fault actions for this program are as follows:
f1 : true −→ sensor0 := 1 − in;
f2 : true −→ sensor1 := 1 − in;
f3 : true −→ sensor2 := 1 − in;

The safety speciﬁcation requires that the program never reaches a state where the output
of the system is not the same as the real status of the environment. Hence, the safety
speciﬁcation states that the program should never reach a state where the following formula
is true:
specmm = (out == 1 ∧ in == 0) ∨ (in == 1 ∧ out == 0)

The legitimate states of the program are those where the values of the majority of sensors
are equal to the value of the real status in the environment and the output is equal to the
value of the real status in the environment if the output is decided. Hence, the invariant is
as follows:
140

Invmm =
((sensor0 == sensor1 == in) ∨ (sensor0 == sensor2 == in)
∨(sensor1 == sensor2 == in)) ∧ (out == ⊥ ∨ out == in)

We extend our Add MM Weakmulti algorithm to revise the above input-output program
to be a MM weak multitolerant program with input Pmi , f1 , f2 , f3 , Invmm and specmm .
Speciﬁcally, our algorithm works as follows: after executing Add FF Weakmulti in Line 6 of
Algorithm 3, the program actions include:
out == ⊥ ∧ sensor0 == sensor1 −→ out := sensor0;
out == ⊥ ∧ sensor0 == sensor2 −→ out := sensor0;

The following steps of executing Ensure Recovery (Lines 10 and 11 of Algorithm 3) remove
these deadlock states if sensor0 is corrupted and add recovery transition:
out == ⊥ ∧ sensor1 == sensor2 −→ out := sensor1.
Hence, the ﬁnal program is as follows:
out == ⊥ ∧ sensor0 == sensor1 −→ out := sensor0;
out == ⊥ ∧ sensor0 == sensor2 −→ out := sensor0;
out == ⊥ ∧ sensor1 == sensor2 −→ out := sensor1;

141

5.5

Complexity Analysis of FM Weak Multitolerance

In this section, we investigate the synthesis problem for weak multitolerant programs for the
case where program is subject to two classes of faults f1 and f2 for which respectively failsafe
and masking fault-tolerance is required, that is fδ = { f1 , f ailsaf e , f2 , masking } in
Deﬁnition 5.1.1. This synthesis problem is NP-complete! This result is also surprising since
the corresponding problem for strong multitolerance is in P.
Theorem 5.5.1. The problem of synthesizing F M weak multitolerant programs from their
fault-intolerant version is NP-complete.
Proof. Given is a program P, with its invariant S, its speciﬁcation spec, and two classes
of faults f1 and f2 .

Since demonstrating membership to NP is trivial, we only illus-

trate that the synthesis problem identiﬁed in Deﬁnition 5.1.1 is NP-hard when fδ

=

{ f1 , f ailsaf e , f2 , masking }.
We construct the mapping by changing that part of proof for Theorem 5.4.1 as follows:
• Replace fm1 fault transitions with transitions of f2 .
• Replace fm2 fault transitions with transitions of f1 .
Next, we show that the given SAT formula is satisﬁable if and only if there exists a
solution to the F M weak multitolerant synthesis problem.
• (=⇒) By the proof of Theorem 5.4.1, if the given SAT formula is satisﬁable then there
exists a program that is masking fault-tolerant to ff 1 and fm1 . Also, by Deﬁnitions
2.3.2 and 2.3.3, a program P that is masking f1 -tolerant from S for spec is failsafe
f1 -tolerant from S for spec. Hence, if the given SAT formula is satisﬁable then there
is a solution to the corresponding instance of the F M weak multitolerance synthesis.
142

• (⇐=) This proof is identical to the corresponding proof for Theorem 5.4.1, hence we
omit it.

5.6

Complexity Analysis of MN Weak Multitolerance

In this section, we present a synthesis algorithm for weak multitolerant programs for the case
where a program is subject to two classes of faults f1 and f2 for which respectively masking
and nonmasking fault-tolerance is required, that is fδ = { f1 , masking , f2 , nonmasking }
in Deﬁnition 5.1.1. We show that such a MN (Masking-Nonmasking) weak multitolerant
program can be synthesized in polynomial time in the state space. This sound and complete
algorithm also can be easily generalized for the case where fδ includes one class of faults for
which masking fault tolerance is desired and two or more fault classes for which nonmasking
fault tolerance is desired. Note that from the results in Section 5.4, if masking fault tolerance
is desired for two or more classes of faults, then the problem is NP-complete.
Given is a program P, with its invariant S and its speciﬁcation spec. Our objective is
to synthesize a program P ′ , with invariant S ′ that is weak multitolerant to fδ . By definition, P ′ must be masking f1 -tolerant. P ′ must also be nonmasking f2 -tolerant. The
algorithm for MN weak multitolerance utilizes the algorithm Add Masking (from [134]) that
adds masking fault-tolerance to a single class of faults. Add Masking returns the synthesized program P ′ , its invariant S ′ and its fault-span T ′ such that P ′ is masking faulttolerant to S ′ and T ′ is the fault-span used to prove this in Deﬁnition 2.3.2. The algorithm Add MN Weakmulti only relies on the correctness (i.e., soundness and completeness)
of Add Masking. It does not rely on the actual implementation of Add Masking. (For reader’s
143

Algorithm 5: Add MN Weakmulti
Input: P:transitions,
fδ1 : { f1 , masking , f2 , nonmasking }
S: state predicate, spec: safety speciﬁcation
Ouput: If successful, a fault-tolerant P ′ with invariant S ′
that is weak multitolerant to f1 and f2
(1)
P1 , S ′ , T1 := Add Masking(P, f1 , S, spec);
′
if (S = {})
declare no weak multitolerant program P ′ exists,
(2)
return ∅, ∅;
(3)
′
P := P1 |T1 ∪ {(∫0 , ∫1 ) : (∫0 , ∫1 )
∈ P, ∫0 ∈ T1 ∧ ∫1 ∈ T1 };
/
(4)
RETURN P ′ , S ′
(5)
Figure 5.5: Model revision for adding MN weak multitolerance.
convenience, the algorithm from [134] is included in the Appendix. However, we note that
the proof of Add MN Weakmulti only relies on the correctness of Add Masking and not on
its details.) Thus, Add MN Weakmulti ﬁrst invokes Add Masking on Line 1 with parameters
(P, f1 , S, spec). As shown in Line 2, if the invariant becomes an empty set after reconstruction in Line 1, we cannot ﬁnd an MN weak multitolerant program P ′ . If the invariant is
not empty, we re-compute the set of program transitions in Line 3. Algorithm 5 in Figure
5.5 describes the details.
Algorithm 5 is the same as that for adding strong multitolerant program [71].
Theorem 5.6.1. The algorithm Add MN Weakmulti is sound and complete.
Proof. By correctness of Add Masking, P1 satisﬁes the constraints of Deﬁnition 5.1.1 for the
case where fδ = { f1 , masking }. Since Step 3 does not add or remove any state of S or
a transition from P|T , these constraints are preserved by the ﬁnal program P ′ . Hence, to
complete the proof of this theorem, next, we show that P ′ is nonmasking f2 -tolerant. By
deﬁnition of masking fault-tolerance, every computation of P1 that starts in a state in T1
reaches a state in S ′ . If f2 perturbs the program to a state outside T1 then the recovery
144

transitions added in Step 3 will recover the program to a state in T1 from where it can utilize
the recovery paths inside P1 to reach S ′ . Thus, P ′ is nonmasking f2 -tolerant.
Our algorithm declares that an MN weak multitolerant program does not exist only
when Add Masking does not ﬁnd a masking f1 -tolerant program. Hence, completeness of
Algorithm 5 follows from the completeness of Add Masking.

5.7

Complexity Analysis of NN Weak Multitolerance

The algorithm Add NN Weakmulti for NN (Nonmasking-Nonmasking) weak multitolerant
problem is identical to Algorithm 5, except that instead of invoking Add Masking on Line 1
of Algorithm 5, we call Add Nonmasking(from [134]).
Theorem 5.7.1. The algorithm Add NN Weakmulti is sound and complete.
Proof. Since the proof is similar to the proof of algorithm 5, we omit it.

5.8

Comparison of Feasibility of Strong Multitolerance
and Weak Multitolerance

In this section, we compare the relation between strong multitolerance [71] and weak multitolerance.
Speciﬁcally, we point out that if a program is strong multitolerant then it is also weak
multitolerant, although the reverse is not necessarily true. We also identify circumstances
where solvability for adding weak and strong multitolerance diﬀers. In particular, we show
that for high atomicity programs, (i.e., programs that can read all the variables and write all
145

the variables in one atomic step), if adding MN (respectively, NN or FN) weak multitolerance
to a program is feasible then adding MN (respectively, NN or FN) strong multitolerance is
also feasible. However, the same result does not apply for MM, MF or FF multitolerance.
Deﬁnition 5.8.1. Let fδ = { fi , li | 0 < i ≤ n, li ∈ {f ailsaf e, nonmasking, masking}}
where n ≥ 0. Program P is strong multitolerant to fault set fδ from S for spec iﬀ the
following conditions hold:
1. (In the absence of faults) P |=S spec;
2. P is masking fm -tolerant from S for spec;
3. P is nonmasking fn -tolerant from S for spec;
4. P is failsafe ff -tolerant from S for spec,
where
• fm =

{fi | fi , masking ∈ fδ },

• fn =

{fi | fi , nonmasking ∈ fδ ∨ fi , masking ∈ fδ },

• ff =

{fi | fi , f ailsaf e ∈ fδ ∨ fi , masking ∈ fδ }.

Observation 5.8.1. If a program P is strong multitolerance to fδ from S for spec, then P
is weak multitolerance to fδ from S for spec.
Observation 5.8.2. It is possible that P is weak multitolerance to fδ from S for spec
however P is not strong multitolerance to fδ from S for spec.
Observation 5.8.3. Program P is weak multitolerant to fault set fδ from S for spec where
fδ = { f1 , nonmasking , f2 , f ailsaf e } iﬀ P is strong multitolerant to fault set fδ from S
for spec .
146

Similar to Deﬁnition 5.1.1, we deﬁne the problem of strong multitolerance synthesis as
follows:
Deﬁnition 5.8.2. The Strong Multitolerance Synthesis Problem.
Given P, S, spec and fδ : Identify P ′ and S ′ such that
• (C1) S ′ ⊆ S,
• (C2) (s0 , s1 ) ∈ P ′ ∧ s0 ∈ S ′ ⇒ (s0 , s1 ) ∈ P, and
• (C3) P ′ is strong multitolerant to fδ from S ′ for spec.
We state the corresponding decision problem as follows:
Deﬁnition 5.8.3. The Strong Multitolerance Decision Problem.
Given P, S, spec and fδ : Does there exist a program P ′ , with its invariant S ′ that
satisﬁes the requirements of Deﬁnition 5.8.2?
Now we consider the following question: if the weak multitolerance decision problem for
the given program P is aﬃrmative, then is the strong multitolerance decision problem for
P aﬃrmative?
To demonstrate the answer to the above question, we discuss the feasibility of stepwise
design of weak multitolerant programs and compare it with the feasibility of stepwise design
of strong multitolerant programs in the following combinations.

5.8.1

Feasibility Comparison of FF Strong/Weak Multitolerance

In this section, we show that there are instances where adding FF weak multitolerance is
feasible although adding FF strong multitolerance is not feasible. We can show this with a
147

Figure 5.6: A case of FF Weak multitolerance (not Strong).

simple example illustrated in Figure 5.6. In this example, the input is as follows. The state
space of the input program is {s0 , s1 , s2 }, input program consists of only one transition
(s0 , s0 ) and its invariant contains only one state s0 . The transition (s1 , s2 ) violates safety.
The class of fault f1 includes only one transition (s0 , s1 ) and fault f2 includes only one
transition (s1 , s2 ).
Clearly, if faults f1 and f2 occur in the same computation then safety can be violated
from state s0 , the only state in the invariant. Hence, adding FF strong multitolerance is
not possible in this example, i.e., the answer to the decision problem 5.8.3 is false. However, adding FF weak multitolerance is feasible. In fact, the program P itself is FF weak
multitolerant.

5.8.2

Feasibility Comparison of MM Strong/Weak Multitolerance
and MF Strong/Weak Multitolerance.

In this section, we show that there are instances where adding MM (respectively, MF) weak
multitolerance is feasible although adding MM (respectively, MF) strong multitolerance is
not feasible. To illustrate this, we use the input obtained by mapping the SAT formula as
discussed in Section 5.4. As shown in Section 5.4, if we begin with a SAT formula that
is satisﬁable, then the answer to the decision problem for adding MM weak multitolerance
148

(Problem 5.1.2) is aﬃrmative. Next, we show that for this input, the answer to the decision
problem for adding MM strong multitolerance (Problem 5.8.3) is always false. To show this
observe that, in Figure 5.2, any recovery path to the invariant must go through either gi or
hi from some i (1 ≤ i ≤ n where n is the number of propositional variables in the instance of
the SAT problem in Section 5.4). If f1 and f2 occur in the same computation, safety will be
violated when either fault transition (gi , ai ) or (hi , bi ) is executed. Moreover, the agreement
program in Section 5.2.4 is another example of cases where adding strong multitolerance is
impossible while weak multitolerance can be designed.
We note that using a similar argument, we can show that there are instances where
adding MF weak multitolerance is feasible although adding MF strong multitolerance is not
feasible.

5.8.3

Feasibility Comparison of MN Strong/Weak Multitolerance
and NN Strong/Weak Multitolerance.

Since the algorithm that is used to synthesize the MN/NN weak multitolerance is the same as
that of MN/NN strong multitolerance, the synthesis problem of MN/NN weak multitolerance
and MN/NN strong multitolerance have the same feasibility property.

5.9

Discussion

In this section, we discuss issues related to the way we model faults, the practical signiﬁcance
of the proposed work in this chapter and some limitations of our approach.
Our work models the impact of faults on programs as a set of transitions (i.e., a nondeterministic ﬁnite-state machine) that perturbs the program state. Designers can identify
149

the classes of faults dependent upon the domain of application and the requirements of the
system users. To generate a fault-class, ﬁrst we identify the faults that may perturb the
program at hand. Fault forecasting methods [142] can be useful to achieve this objective.
Then, we formally characterize each of these faults as state perturbations. Finally, we group
the faults into fault classes based on the corresponding level of tolerance required to each
fault-class. The desired level of tolerance is based on the user requirements and feasibility
of providing that level of tolerance under system constraints/resources. As demonstrated in
this dissertation, using this method of fault modeling, one can represent Byzantine, crash,
message loss, input corruption, node leave and transient faults. Moreover, previous work
[15, 71, 163, 164, 186] uses this model to capture other classes of faults such as stuck-at,
omission, disk corruption and sensor failures. However, our fault model cannot capture any
types of faults that cannot be represented as a ﬁnite-state machine (e.g., impact of external
disturbances on aircrafts, eﬀect of wing damage on ﬂight control systems, mechanical systems
involving several masses, springs and dampers).

Adding weak multitolerance enables a method for component-based design of multitolerant programs. Speciﬁcally, consider an existing program P that provides weak multitolerance
to the fault classes f1 , · · · , fn−1 . If designers detect a new fault-class fn after the design
and implementation of P, then it is desirable to have a revised version of P, denoted Pc ,
that provides weak multitolerance to f1 , · · · , fn−1 and fn . To design Pc , developers have
two options: redesign a new multitolerant program from scratch or simply design a component C and compose it with P such that the resulting composition preserves fault tolerance
to faults fi , for 1 ≤ i ≤ n − 1, and enables a speciﬁc level of fault tolerance when fn
occurs. In fact, the component C has to ensure that the computations of P ∪ fn meet the
150

requirements of the desired level of fault tolerance; i.e., no guarantees are provided for the
computations of the composed program Pc in the presence of faults f1 , · · · , fn−1 . While in
our previous work [71] we present stepwise algorithms for the design of strongly multitolerant
programs, there are several cases where strong multitolerance cannot be added. Moreover,
component-based design of weak multitolerance is easier since no guarantees are provided
for the computations in which multiple faults occur.

The proposed approach in this chapter has several practical and methodological signiﬁcance regarding the development of concurrent software. First, we present a paradigm shift
with respect to the traditional design and veriﬁcation methods, where developers design
fault-tolerant programs ﬁrst and verify their correctness after the fact. Automated addition
of multitolerance is especially beneﬁcial since it is often diﬃcult to anticipate all classes
of faults in the early stages of design due to the complex and dynamic nature of today’s
systems. Thus, the proposed approach enables a systematic method for automated addition
of fault tolerance to existing programs. Second, automated addition of weak multitolerance
exploits computational redundancy before resorting to resource redundancy. In particular,
the algorithms presented in this chapter enable the design of several intermediate levels of
multitolerance that developers can consider in their risk/cost analysis so using resource redundancy becomes a last-resort option. Third, the notion of weak multitolerance provides
an impossibility test for designers. That is, if a weakly multitolerant version of an existing
program does not exist, then strong multitolerance cannot be added to that program too.
Fourth, our algorithms can be integrated in model checkers to facilitate the detection and
correction of conﬂicts between several levels of fault tolerance. Last but not least, while our
focus in this chapter is on high atomicity programs, the proposed method can be used for the
151

design of highly resilient network protocols where processing nodes might have read/write
restriction with respect to the variables of other nodes. Thus, this work can signiﬁcantly
improve the way network protocols are designed.

The approach presented in this chapter has some constraints in terms of the input to the
synthesis algorithms and tool development. First, thus far we have investigated the problem
of adding multitolerance for ﬁnite-state programs; i.e., the problem of adding multitolerance to inﬁnite-state programs is still open. Second, during the addition of multitolerance
our algorithms preserve only the properties that can be captured in the linear topological
characterization of speciﬁcations by Alpern and Schneider [6]. For instance, if the properties of the intolerant program are speciﬁed in the Computation Tree Logic [76], then we
do not guarantee that they will be preserved in the absence of faults. Third, the input
program should be maximal. That is, from any state, the program should have the maximum number of non-deterministic outgoing transitions. The maximality of the intolerant
programs increases the chances of success in adding multiple levels of fault tolerance. For
example, consider the program synthesized in Section 5.3.1. While the guard of the action
out == ⊥ ∧ sensor0 == sensor1 = in −→ out = sensor0 is never enabled, we included it in
the multitolerant program since such transitions may be useful for adding new levels of fault
tolerance if new classes of faults are detected. Fourth, our model of programs is an abstract
model in that we do not add multitolerance to C/C++/Java programs. Nonetheless, we can
exploit the existing model extraction techniques [50,111,227] that are used in model checking
where a ﬁnite model is generated from a C/C++/Java program and then multitolerance is
added to the extracted model. Fifth, we plan to reuse the tools that we have developed for
the addition of a single level of fault tolerance to concurrent programs [72] for the automated
152

design of multitolerant programs. While the time/space complexity of synthesis is a bottleneck for tool development, we have developed distributed [68] and symbolic [37] techniques
that increase the scalability of algorithms for the addition of fault tolerance signiﬁcantly
(e.g., for programs with 2100 reachable states).
Finally, in cases where the algorithms for adding weak multitolerance declare failure,
the design of weak multitolerance becomes impossible under the constraints/resources of the
program at hand. One solution may be to add more redundancy. It is possible to revise
the algorithms in this chapter, so additional redundancy could be introduced automatically.
However, we believe that addition of redundancy should be handled manually, since it requires resources and an automation algorithm cannot determine whether these resources are
reasonable and available. Another possible solution may be to change the expected level of
tolerance. For example, if a MM weak multitolerance is unfeasible, then system may provide
MN weak multitolerance. Again, the choice of this depends upon whether the reduced level
of tolerance is acceptable to system users. As stated in Section 6.1, this involves a tradeoﬀ
between the cost and level of fault tolerance. Automation algorithms, like the ones in our
work, allow designers to identify a fault-tolerant program with the given choices in terms of
redundancy and level of tolerance.

5.10

The Tool RM 2: Model Revision for Adding Multitolerance

In this section, we describe RM 2 , a tool for Model Revision for adding Multitolerance.
RM 2 is an extension of the tool SYCRAFT [37], which is a tool of automatically reparing
153

program for adding fault-tolerance. It augments SYCRAFT to address the problem of
automatically revision for adding multitolerance. In particular, RM 2 considers two types
of multitolerance, including strong multitolerance [137] and weak multitolerance proposed
in this chapter. Speciﬁcally, given a fault-intolerant program, a set of fault actions and
speciﬁcations, the tool generates a fault-tolerant program via a symbolic implementation of
the respective algorithms.
RM 2 is written in C++ and the algorithms are implemented based on GLU/CUDD
package [213], which provides functions to manipulate Binary Decision Diagrams (BDDs).
Our tool uses Flex [146] to read the given input ﬁles, and employs Bison [56] to parse the
grammar. Flex is a tool for generating scanners: programs which recognized lexical patterns
in text. Bison is a general-purpose parser generator that converts a grammar description for
an LALR(1) context-free grammar into a C program to parse that grammar.

5.10.1

Input Program Language

Similar to SYCRAFT, every input program to RM 2 consists of eight modules, including
declarations for program name, constraint, variables, processes, faults, invariant, safety speciﬁcation and prohibited transiton predicates. A typical RM 2 input program structure is as
follows:

program

Program name;

const
...
var
...
154

process

the 1st process name

. . . process actions . . .
process

the 2nd process name

. . . process actions . . .
...
endprocess
ﬀault

the 1st fault name

. . . fault actions . . .
endﬀault
sfault

the 2nd fault name

. . . fault actions . . .
endsfault
invariant
...
speciﬁcation
...
prohibited
...

Technically, the input program is deﬁned as follows.
prog ::=

progdecl
f f aultdecl

constdecl

vardecl

procdecl

sf aultdecl

invariantdecl

specdecl

prohibiteddecl
155

Module progdecl declares the program name and is deﬁned as follows.
progdecl ::=

program identif ier

An identiﬁer can be any combination of alphanumeric characters, starting with an alphabet.
And, case matters.
Module constdecl declares constants.The grammar is deﬁned as follows.
constdecl ::=
constdecl ::=

const constlist | empty
concreteconst | constlist concreteconst

concreteconst ::= int identif ier ” = ” guard ”; ”|
boolean identif ier ” = ” guard ”; ”
empty :=
guard can be any combination of quantiﬁed Boolean and arithmetic expressions. An
example of constant declaration is as follows.
const
int N = 2;
int B = 2;

Module vardecl declares variables, which are of two types: integer (nonnegative) and
boolean. Each interger is declared along with its domain, where the lower bound must be
zero. The grammar is deﬁned as follows.
156

vardecl ::=

var varlist

constdecl ::=

concretevar | varlist concretevar

concretevar ::=

booldecl | intdecl

booldecl ::=

boolean identif ier ”; ”|
boolean identif ier ”.” range ”; ”

intdecl ::=

int identif ier ” : domain” range ”; ”|
int identif ier ”.” range ” : domain” range ”; ”

range ::=

rangedelim ”..” rangedelim

rangdelim ::=

guard

An example of constant declaration is as follows.
var
boolean bg;
boolean b.0..N;
int i: domain 0..5;
int j.0..N: domain 0..2;

A program consists of a set of processes. Module processdecl declares process, which
includes: (1) a set of actions, (2) a fault section which describes fault actions, (3) a prohibited
section which deﬁnes a set of actions that is not allowed to execute, (4) a set of readable
variables and (5) a set of writeable variables. The synatx of both program actions and fault
actions are of form guard → statement. The grammar is deﬁned as follows.
procdecl ::=

procstruct | procdecl procstruct
157

procstruct ::=

process identif ier procrange actions f aultdecl
prohibited rwrestrict endprocess

procrange :=

”:” range | empty

In the above deﬁnition, rwrestrict deﬁnes the set of variables that are allowed to read and
the set of variables that are allowed to write.
Module f aultdecl deﬁnes faults that the process is subject to. As illustrated in Chapter
3, fault transitions are modeled using guarded commands. Thus, fault transitions have the
same syntax with the program actions. The grammar is deﬁned as follows.
f aultdecl ::=

f f aultdecl | sf aultdecl | f f aultdecl sf aultdecl | empty

f f aultdecl ::=

ﬀaults actions endﬀault| empty

sf aultdecl ::=

sfaults actions endsfault| empty

Module invariantdecl deﬁnes invariant predicate (see Deﬁnition 2.2.4) of the given
program. Module specdecl deﬁnes safety speciﬁcation (see Deﬁnition 2.2.1). And, module
prohibiteddecl deﬁnes transitions that are not allowed to execute. The given input program
may not have prohibited transitions. Speciﬁcally, rules for these modules are deﬁned as
follows.
invariantdecl ::=

invariant

guard ”; ”

specdecl ::=

speciﬁcation

prohibiteddecl ::=

prohibited

guard ”; ”| empty
guard ”; ”| empty

158

5.11

Functionality and Output Program

RM 2 implements algorithms and heuristics of model revision for adding weak multitolerance
proposed in this chapter, as well as algorithms and heuristics for adding strong multitolerance
in [137].
The output of RM 2 is a fault-tolerant program in terms of unchanged actions, revised
actions and recovery actions, as deﬁned in Section 2.4.

5.11.1

Example 1: Two-Sensors Program

The details of this example is in Section 5.3.1. The input ﬁle is shown in Figure 5.7. After
running RM 2 with paramater wﬀ, where w denotes weak and ﬀ denotes failsafe-failsafe, the
output is a fault-tolerant version shown in Figure 5.8.

5.11.2

Example 2: Byzantine Agreement

We now use the Byzantine Agreement problem to describe the functionality of RM 2 . Figure
5.9 speciﬁes program actions. Figure 5.10 speciﬁes fault actions. After running RM 2 with
paramater smn, where s denotes weak and mn denotes masking-nonmasking, the output is
a fault-tolerant version shown in Figure 5.11.

5.12

Summary

In this chapter, we addressed the problem of automatic revision of multitolerant program.
The problem of model revision for adding multitolerance is motivated by the observation
that a program is often subject to multiple classes of faults and the level of tolerance pro159

program SensorFF;
const N = 1;
var
int inp.0..N: domain 0..1;
int out: domain 0..2;
process prog
(out == 2) ⇢ out := inp.0;
[]
(out == 2) ⇢ out := inp.1;
read: out, inp.0..N;
write: out;
endprocess
ﬀault Corruption
true ⇢ (inp.0 := 1) [] (inp.0 := 0);
endﬀault
sfault Corruption1
true ⇢ (inp.1 := 1) [] (inp.1 := 0);
endsfault
invariant
((out == 2) & (forall p, q in 0..N:: (inp.p == inp.q)))
(exists a, b in 0..N: (a != b):: ((out == inp.a) & (inp.a == inp.b)));
speciﬁcation
(exists a, b in 0..N: (a != b):: ((inp.(a)’ == inp.(b)’)
& (inp.(a)’ != out’) & (out’ != 2)));
prohibited
(out != 2) & (out != out’);

Figure 5.7: Input ﬁle of two-sensors program.

160

UNCHANGED ACTIONS:
————————————————————————————–
————————————————————————————–
REVISED ACTIONS:
————————————————————————————–
1- (inp.0 == 1) & (inp.1 == 1) & (out == 2) ⇢ (out := 1)
2- (inp.0 == 0) & (inp.1 == 0) & (out == 2) ⇢ (out := 0)
————————————————————————————–
NEW RECOVERY ACTIONS:
————————————————————————————–

Figure 5.8: Output of two-sensors Program.

vided to them is often diﬀerent. The problem of adding weak multitolerance considers the
special case where the faults are independent, i.e., occurrence of faults from multiple classes
are unlikely to happen simultaneously and, hence, it suﬃces to ensure that the program
provides the required tolerance to each fault-class. By contrast, the strong multitolerance
considered in [71] focuses on scenarios where faults from several fault classes can happen
simultaneously. For this reason, we illustrated that there are several instances where adding
weak multitolerance is feasible although adding strong multitolerance is not feasible.
We investigated the complexity of adding weak multitolerance, To this end, we considered
ﬁve possible combinations MM, FM, FF, MN and NN. In each combination, the ﬁrst letter
indicates the fault-tolerance level for the ﬁrst class of faults, denoted f1 , and the second letter
indicates the fault-tolerance level for the second class of faults f2 . We found a surprising
result that if masking fault-tolerance is desired for f1 and masking (or failsafe) fault-tolerance
is desired for f2 , then adding weak multitolerance is NP-hard (in program state space). This
result is counterintuitive since the corresponding problem for adding strong multitolerance
161

program Byzantine Agreement;

const N = 2;
var
boolean bg;
boolean b.0..N;
boolean f.0..N;
int dg: domain 0..1;
int d.0..N: domain 0..2;

process j: 0..N
{non-Byzantine non-general process copies value from the general }
((d.j == 2) & !f.j & !b.j) ⇢ d.j := dg;
[]
non-Byzantine non-general ﬁnalizes decision
((d.j != 2) & !f.j & !b.j) ⇢ f.j := true;
ﬀault Byzantine NonGeneral
(!bg) & (forall p in 0..N:: (!b.p)) ⇢ b.j := true;
[]
(b.j) ⇢ (d.j := 1) [] (d.j := 0) [] (f.j := false) [] (f.j := true);
endﬀault
sfault Byzantine NonGeneralAnother
true ⇢ (d.j := 1) [] (d.j := 0);
endsfault
prohibited !b.j & !b.(j)’ & f.j & ((d.j != d.(j)’) !f.(j)’);
read: d.0..N, dg, f.j, b.j;
write: d.j, f.j;
endprocess

Figure 5.9: Program actions of the Byzantine agreement program.

162

ﬀault Byzantine General
!bg & (forall p in 0..N:: (!b.p)) ⇢ bg := true;
[]
bg ⇢ (dg := 1) [] (dg := 0);
endﬀault
sfault Byzantine GeneralAnother
true ⇢ (dg := 1) [] (dg := 0);
endsfault
invariant
(!bg & (forall p, q in 0..N:(p != q):: ((!b.p) (!b.q))) &
(forall r in 0..N:: (!b.r ⇒ ((d.r == 2) — (d.r == dg)))) &
(forall s in 0..N:: ((!b.s & f.s) ⇒ (d.s != 2))))
(bg & (forall t in 0..N:: (!b.t))
& (forall a, b in 0..N:(a != b)::((d.a == d.b)
& (d.a != 2))));
speciﬁcation
(exists p, q in 0..N: (p != q):: (!b.(p)’
& !b.(q)’ & (d.(p)’ != 2) & (d.(q)’ != 2)
& (d.(p)’ != d.(q)’) & f.(p)’ & f.(q)’))
(exists r in 0..N:: (!bg’ & !b.(r)’
& f.(r)’ & (d.(r)’ != 2) & (d.(r)’ != dg’)));

Figure 5.10: Fault actions of the Byzantine agreement program.

163

UNCHANGED ACTIONS:
————————————————————————————–
1- (d0 == 2) & !(f0 == 1) & !(b0 == 1) ⇢ (d0 := dg)
————————————————————————————–
REVISED ACTIONS:
————————————————————————————–
2- (b0==0) &(d0==0)& (d1==1) & (f0==0) ⇢ (f0 := 1)
3- (b0==0) &(d0==0)& (d2==0) & (f0==0) ⇢ (f0 := 1)
4- (b0==0) &(d0==0)& (d1==0) & (f0==0) ⇢ (f0 := 1)
5- (b0==0) &(d0==0)& (d2==1) & (f0==0) ⇢ (f0 := 1)
————————————————————————————–
NEW RECOVERY ACTIONS:
————————————————————————————–
6- (b0==0) &(d0==0)& (d1==1) & (d2==1) & (f0==0) ⇢ (d0 := 1)
7- (b0==0) &(d0==0)& (d2==0) & (d2==0) &(f0==0) ⇢ (d0 := 0)
8- (b0==0) &(d0==0)& (d1==0) & (d2==1) &(f0==0) ⇢ (d0 := 1), (f0 := 1)
9- (b0==0) &(d0==0)& (d2==1) & (d2==0) &(f0==0) ⇢ (d0 := 0),(f0 := 1)

Figure 5.11: Output of two-sensors program.

can be solved in polynomial time. We also presented a sound heuristic for designing MM
weakly multitolerant programs. For other combinations FF, MN and NN, we illustrated that
the problem of synthesizing weak multitolerance is in P. To demonstrate this, we presented
a sound and complete algorithm for each combination.
We also investigated the relation between strong multitolerance and weak multitolerance.
Speciﬁcally, we argue that if a program is strongly multitolerant then it is also weakly multitolerant, although the reverse is not necessarily true. Moreover, we identify circumstances
where solvability of adding weak and strong multitolerance diﬀers. We show that (1) there
are situations where adding FF weak multitolerance is feasible although adding FF strong
multitolerance is not feasible; (2) there are instances where adding MF weak multitolerance
is feasible although adding MF strong multitolerance is not feasible, and (3) the synthesis
164

problem of MN/NN weak multitolerance and MN/NN strong multitolerance have the same
feasibility property. Besides, we implemented the algorithms for automatically repairing
programs for adding multitolerance in the tool RM 2 . Our tool augments the exsiting tools
for the problem of model revision for adding multitolerance.

165

Chapter 6
Automatic Revision of UML State
Diagrams

6.1

Introduction

The utility of formal methods in the development of high assurance systems has gained
widespread use in some segments of industry. In general, there are two main approaches
to utilizing formal methods in providing assurance: correct-by-veriﬁcation and correct-byconstruction. The ﬁrst approach, correct-by veriﬁcation, is the most commonly used approach. In this approach, one begins with an existing model (or program) and a set of
properties (speciﬁcation), and veriﬁes that the given program meets the given properties of
interest. An embodiment of this approach is model-checking [26,42,110,143]. Model checking
has been widely studied in the literature and proved eﬀective in identifying bugs in system
design. However, a pitfall of this approach is that if the manually designed model does not
satisfy the requirements then it is often unclear how one can proceed further. Hence, for
166

scenarios where an existing model needs to be revised to deal with the new environment,
the new identiﬁed faults, or the new requirements, one needs to manually develop the new
model if one needs to obtain assurance via modeling checking.
The second approach, correct-by-construction, utilizes the speciﬁcation of the desired
system and constructs a model that is correct. Examples of this approach include [10, 16,
22, 77, 139, 169]. These works diﬀer in terms of the expressiveness of the speciﬁcations they
permit and in terms of their complexity. This approach has proven eﬀective in constructing
a model that meets the requirements of speciﬁcation. However, a pitfall of this approach is
the loss of reuse (of the original model) and a potential for signiﬁcant increase in complexity.
To obtain the beneﬁts of these two approaches while minimizing their pitfalls, one can
focus on an intermediate approach, model revision. Model revision deals with the problem
where one is given a model/program and a property. And, the goal is to revise the model
to meet the requirement of the given property. Applications of model revision include scenarios where model checking concludes that the model violates the requirements speciﬁed
by the given property. Other applications include scenarios where an existing model needs
to be revised due to changes in requirements and/or changes in the environment. For this
reason, model revision has been studied in contexts where an existing model needs to be revised to add new fault-tolerance properties, safety properties, liveness properties and timing
constraints [35].
Since model revision generates a model through correct-by-construction, it provides assurance comparable to correct-by-construction approaches. Also, since it begins with an
existing model, it has the potential to reuse the existing model during the revision process.
Moreover, there are several instances where complexity of model revision is comparable to
167

that of model checking [35].

To permit wider utilization of model revision in practice, it is desirable to reduce the
learning curve faced by the designer in applying it. In particular, it is especially valuable if
the details of the model revision are hidden from the designer so that the designers continue
to work with a framework that they are familiar with.

One challenge of exploring such an approach of applying the model revision on the program design is that the current model-based design is often modeled in a non-computational
way, such as UML [199]. UML is a well-known modeling language utilized by industry, with
focus on system architecture as a means to organize computation, communication and constraints. Of UML diagram sets, state diagram is especially helpful when designers discuss
the logic architecture and workﬂow of the whole system with the need of independence from
a particular programming language. Since the UML state diagram is able to illustrate the
high-level overview of the whole system, it is widely used to model program design. With
this motivation, we propose a framework, namely MR4UM, for applying model revision on
program design modeled in the UML state diagram. To overcome the challenge of applying model revision on UML state diagram, MR4UM proposes an automatic transformation
mechanism from UML state diagram to the corresponding model in the underlying computational model (UCM). Subsequently, MR4UM applies model revision on the program modeled
in the UCM and generates a revised program modeled in the UCM. Finally, MR4UM converts the revised program modeled in UCM into the corresponding program design modeled
in UML state diagram.
168

6.2

Motivating Scenario

In this section, we present a motivating scenario from automotive industry. An adaptive
cruise control program (ACC) in automotive system is designed to control the distance between the vehicle and the front vehicle (called leader car) automatically. Figure 6.1 describes
the logic design of the ACC program.
As shown in Figure 6.1, when ACC program is on after initialization, the system enters
one of the three modes: active, ACC active or inactive. In the active mode, the sensor keeps
checking whether a leader car exists within a predeﬁned safe distance and the ACC program
keeps reading the sensor result. In the ACC active mode, the ACC program is working
to control the distance between the leader car and the current car because a leader car is
detected within the predeﬁned safe distance. In the inactive mode, the driver is pressing
the brakes and the adaptive cruise control relinquishes control to driver for manual control
without considering the eﬀect of cruise. By switching among these three modes, the ACC
system targets to ensure that the leader car is at a desired distance away and the relative
speed of the current car is zero with respect to the leader car. Moreover, if the leader car
does not exist then the system resumes the previous speed chosen by the driver.
Initially, when no leader car is detected by the sensor, the program stays in active mode.
If a leader car is detected within the predeﬁned safe distance, the ACC program enters into
ACC active mode. ACC program stays in ACC active mode until the leader car goes away or
the drive takes the brake. If the leader car goes away from the detectable distance, the ACC
system goes back to the active mode. Under ACC active mode or active mode, the ACC
system enters into inactive mode when driver taps the brake. When driver stops tapping the
brake and presses resume button, the ACC system enters into active mode if no leader car is
169

Figure 6.1: Logic design of the ACC program.
detected and ACC active mode if a leader car is detected. (For simplicity, we do not model
the resume action. We simply assume that releasing the break is equivalent to resume.)

6.2.1

Need for Model Revision for Tolerating Sensor Failure

While the ACC program in Figure 6.1 works correctly in the absence of faults, it may result
in undesired behavior if some faults aﬀect the sensor. Speciﬁcally, a sensor failure may cause
two problems: false positive and false negative. A false positive sensor may cause the sensor
to detect a non-existing leader vehicle causing the system to change the state from active
to ACC active. This would potentially cause the car to slow down unnecessarily to prevent
collision with a ﬁctitious car.
A more serious error can result in a false negative sensor that fails to detect a leader
170

vehicle. In this scenario, the car would stay in active mode; thereby potentially cause a
collision with the leader car.
For the above reasons, the model in Figure 6.1 needs to be revised to deal with such
false alarms (false positive & false negative). To tolerate the false alarm (false positive and
false negative) caused by the faulty sensor, one typical fault tolerance policy is to provide
redundancy. Thus, if the redundancy policy is chosen to tolerate the sensor failure, the
problem of revising program design to resist false alarm is to modify the previous system
design to utilize the sensor redundancy. After revision, the new system design should get
correct information about whether the leader car exists, even in the presence of the false
alarm of one sensor.
The goal of MR4UM is to facilitate such a revision with minimal overhead to the designer.
Clearly, during such revision process, the designer would need to specify the original (faultintolerant) model. Moreover, this model should be in the format they utilize already. This
does not introduce a new overhead since the problem of model revision is applicable in
scenarios where the designer already has a model. The designer also needs to specify the
speciﬁc nature of faults. Towards this end, MR4UM provides a mechanism to describe
commonly encountered faults. Subsequently, the designer utilizes MR4UM to generate the
corresponding fault-tolerant model for the ACC program.
Thus, MR4UM is proposed to start with the (fault-intolerant version of) UML state diagram. Then, MR4UM converts the fault-intolerant UML state diagram into the corresponding underlying computational model (UCM) automatically. This conversion is annotated to
facilitate generation of revised UML model in the last step. Additionally, MR4UM facilitates the description of faults to minimize the designer overhead. Next, MR4UM applies a
171

Figure 6.2: A case to illustrate modeling program in UML state diagram.
Model Revision algorithm on the program in UCM to add fault-tolerance. Finally, MR4UM
converts the fault-tolerant program in UCM into the corresponding UML state diagram with
the use of annotations from the ﬁrst steps.

6.3

An Overview of UML State Diagram

UML is a standardized general-purpose modeling language in the ﬁeld of object-oriented
software engineering. Speciﬁcally, UML state diagrams enable us to visualize the program
design. Moreover, UML state diagrams enable us to capture any form of fault-tolerance that
can be expressed in a state machine-based formalism [70]. In this section, we introduces the
background knowledges of UML state diagram involved in our framework.
A UML state diagram is a directed graph. The nodes in this graph denote states, and
the edges in this graph represent transitions. The graph could represent potential execution
scenario. Speciﬁcally, the modeling of individual states and transitions include:

1. State.

The state is represented as rounded rectangle. The initial state (if any) is

denoted by a ﬁlled circle. The ﬁnal state (if any) is denoted as a hollow circle containing
172

a smaller ﬁlled circle. The states may also be hierarchical in nature. In particular, a
hierarchical state can be decomposed into several states. Currently, our framework
only considers the states at the lowest level. However, it can be easily extended to
utilize the hierarchical model and obtain the corresponding states in lower level.
2. Transition. The transition between states is represented with an arrow. Each transition is associated with a triggering event followed by the list of executed actions.
A special transition, denoted as the initial transition originates from the solid circle
(initial state). It speciﬁes the default state when the model ﬁrst begins.
As an illustration, the UML state diagram in Figure 6.2 consists of ﬁve states: initial state,
s1, s2, s3 and s4, and six transitions, as follows:
1. The transition from initial state to s1;
2. The transition from s1 to s2 (the triggering condition is E1 and action1 happens concurrently with this transition);
3. The transition from s2 to s3 (the triggering condition is E2 and action2 happens concurrently with this transition);
4. The transition from s2 to s4 (the triggering condition is E3 and action3 happens concurrently with this transition);
5. The transition from s3 to s4 (the triggering condition is E4 and action4 happens concurrently with this transition);
6. The transition from s4 to s2 (the triggering condition is E5 and action5 happens concurrently with this transition);
173

Remark 6.3.1. The UML state diagram is capable of specifying ﬁnite state machines that
are important for event driven programming. And, the UML state diagram has the potential
to cut down on the number of execution paths through the code and simplify switching between
diﬀerent modes of execution [200].

6.4

Framework Description

The workﬂow of MR4UM consists of four steps, including: 1) step A, converting the UML
state diagram into the underlying computational model (UCM); 2) step B, identifying the
eﬀect of faults; 3) step C, utilizing model revision to add fault tolerance to program and 4)
step D, converting the revised program in UCM into UML state diagram while utilizing the
annotations generated in Step A.

6.4.1

Step A: Translating from UML State Diagram to UCM.

In order to utilize the underlying model revision machine, we translate the given UML state
diagram into the corresponding UCM model. This step utilizes syntactic transformation of
the UML state diagram. In particular, for every syntactic feature of the UML state diagram,
we convert it into the corresponding UCM. This step is automated with the help of following
rules:
• Rule 1: Translation of states in UML state model.

For all the states in the

UML state diagram, we introduce variable STATE with integer domain [0, n−1] where
n is the number of states in the UML. All the states in the UML state diagram are
numbered from 0 to n-1. For each state, it is mapped to a concrete value assignment of
variable STATE. If UML model consists of hierarchical states then this rule is applied
174

to the states at the lowest level. For example, for state 0 in the UML state diagram,
it is mapping into STATE==0 .
• Rule 2: Translation of trigger conditions.

For each trigger condition c men-

tioned in the UML state diagram, we introduce one variable Xc with domain {0, 1}.
Xc = 0 denotes the negative of this trigger condition is satisﬁed. Xc = 1 denotes this
trigger condition is satisﬁed. Note that the state space is now deﬁned in terms of the
STATE variable declared in the previous rule and the declaration of variables from this
step.
• Rule 3: Translation of transitions.

For each transition in the UML state dia-

gram, we introduce one corresponding program transition P . The guard of P is the
conjunction of the corresponding STATE assignment of source state and the corresponding variables of each trigger conditions. The action changes the assignment of
STATE according to the target state of the original transition in the UML.

Observe that the application of these rules will facilitate the translation of the UML
model into the underlying computational model. Since this model is obtained by the above
rules, it is straightforward to observe that it has the same behavior as the original UML
model.

6.4.2

Step B: Generating Fault Actions, Speciﬁcation and Invariants from Parameters speciﬁed by Designer

After Step A, we have program actions modeled in UCM. To revise the program design to
satisfy the new speciﬁcation, we need (1) fault actions modeled in UCM, (2) speciﬁcation,
175

that is, requirements in the presence of faults, and (3) states where program should recover
after faults occurs. Moreover, these parameters are closer in spirit to the underlying computational model. For this reason, asking the designer to generate these in UCM is not desired.
Unfortunately, they cannot be derived automatically either. For this reason, our framework
facilitates modeling of typical faults that one may encounter in the revision process. Next,
we describe how we obtain these inputs in our framework.
1. Fault Actions Modeled in UCM.

In our framework, fault actions are automati-

cally generated from parameters which are speciﬁed by designer from GUI. From GUI,
designer needs to specify the following parameters:
(a) What type of faults? Currently, there are three types of faults modeled in our
framework: (1) Byzantine, (2) transient and (3) crash (failstop). The Byzantine
fault allows the fault to change the aﬀected component (variables of the component) in an arbitrary manner. Moreover, this fault can perturb the program
several times. A transient fault perturbs the component (variables of the component) once (respectively, ﬁnitely many times where the bnound is known upfront.
And, crash disables certain variables from being updated and, hence, captures the
notion that a component (containing those variables) has crashed. The default
setting of our framework is transient.
(b) Eﬀect of faults on program.

Designer needs to specify the variables the fault

aﬀects after specifying types of faults that may occur during the execution: In
particular,
i. Byzantine. For this type of fault, designer needs to specify which variable(s)
may be corrupted by the Byzantine component and the possible value(s). For
176

example, in the motivating scenario (see Section 6.2), the variable representing the leader is aﬀected by faults. And, this fault can perturb the program to
an arbitrary state. Hence, the default action for this fault is that the variable
can be corrupted to any value in its domain. However, if the likely fault is
only a false positive, then, the fault action could perturb leader value to 1.
ii. Transient. For this type of fault, designer needs to specify which variable is
perturbed to the random value. The default for this fault is that the variable
can be corrupted to any value in its domain. The diﬀerence between the
transient and Byzantine fault is that the former perturbs the program once
whereas the latter could perturb the program a ﬁnite number of times.
iii. Crash or Failstop. For this type of fault, designer needs to specify which
variables are prevented from update due to the fault. In our framework, we
use the crash faults to denote the fault that is not detectable. By contrast,
failstop is detectable.
(c) Number of occurrences of faults.

Designer also needs to specify the occurrences

of the speciﬁed faults. In case of Byzantine/crash faults, the number denotes the
number of Byzantine/crashed components. In turn, this determines the required
level of redundancy. Regarding transient faults, the number denotes the occurrences of transient faults that may occur during the computation. The default
setting value is 1.
2. Speciﬁcation, that is, Requirements in the Presence of Faults. In our framework, speciﬁcation is automatically generated from parameters which are speciﬁed by
designer from GUI. Designer needs to specify each state with variables and corre177

sponding values. The union of the speciﬁed states from GUI are used to generate the
speciﬁcation automatically.
3. States Where Program should Recover after Fault Occurs. The states where
program should recover after faults occurs are generated automatically from initial
states (by performing reachability analysis) speciﬁed in UML state diagram.

6.4.3

Step C: Model Revision for Adding Fault-tolerance

In this step, we utilize the program generated in Step A and the faults, speciﬁcation and
invariant generated in Step B to obtain a fault-tolerant program in UCM. Since UCM was
chosen due to its compatibility with the synthesis tool [36], we utilize it to obtain the faulttolerant program.
Next, we review the model revision algorithm. This algorithm begins with four inputs, p
(original program in UCM). The details of algorithm are shown in Figure 6.3. This algorithm
consists of ﬁve steps, as follows:
1. Initialization (Lines 1-3).

In this step, we identify state and transition predicates

from where execution of faults alone may violate safety speciﬁcation. Speciﬁcally, if
(s1 , s2 ) is a fault transition that violates safety then the program should never reach
s1 . This is due to the fact that if the program reaches s1 then execution of fault action
can violate safety. Furthermore, if (s0 , s1 ) is also a fault transition then s0 should
also not be reached. Otherwise, execution of two fault transitions would violate safety.
Continuing this process (with backward reachability in fault transitions), we obtain the
set ms such that the fault-tolerant program should not reach a state in ms. For the
same reason, the program should not include a transition that reaches ms. Likewise,
178

it should not include transitions that violate safety. Hence, mt denotes the set of these
transitions that the program should not include.

2. Identiﬁcation of Fault-span (Lines 9-11).

In this step, we identify the fault-

span, that is, the reachable states by the program in the presence of faults starting
from the program invariant.

3. Identifying and removing unsafe transition (Line 12-13).

In this step, we

identify and remove transitions in mt. It turns out that in several scenarios, the
transitions of the fault-tolerant program need to be executed with partial information.
Consider the example in Section 6.2, the program cannot read the variable that denotes
whether the leader car actually exists. However, such a variable is needed during
modeling so we can verify that the transitions of the adaptive cruise control system are
consistent with each other. Hence, any time a transition is removed/added, we need
to add the corresponding Group of transitions; speciﬁcally, this group is obtained by
changing the values of variables that the program cannot read.
Unfortunately, this restriction of adding and removing groups causes the complexity
of model revision. Hence, we utilize the following heuristics.
Heuristic H1 : P ′ may include (s0 , s1 ), s0 ∈ ms.
Reasoning behind H1 . H1 is based on the premise that the algorithm will ensure that
the program never reaches a state in ms. Hence, even if we include such transitions
(e.g., because they are grouped with otherwise useful transitions), it will not cause any
problems. Moreover, such inclusion is helpful to increase the success of model revision
when (s0 , s1 ) is grouped with some other transitions that is desirable in the P ′ .
179

Heuristic H2 : P ′ may include (s0 , s1 ) where (s0 , s1 ) ∈ mt and s0 is not reachable by
transitions in P ′ starting from I ′ in the presence of faults.
Reasoning behind H2 . H2 is based on the premise that if the current version of P does
not reach s0 then it is likely to be true in the ﬁnal program as well. Hence, including
transition (s0 , s1 ) is expected to be acceptable. If at some later point, state s0 is
reachable then this transition may be removed. The fault-span computed in this step
is used to determine whether state s0 is reachable in the presence of faults.
4. Resolving deadlock states (Line 15-18). To ensure that no new ﬁnite computations are introduced to the input fault-intolerant program, we resolve deadlock states
in this step by either adding recovery path or eliminating states. First, we attempt to
add recovery to the invariant. This recovery could be achieved in a single step or in
multiple steps. Only if the recovery is not feasible then we remove transitions so that
the deadlock state is not reached. In this step, we use the following heuristic.
Heuristic H3 .

Given a deadlock state s, s ∈ I ′ , P ′ either includes a recovery action
/

(s, sI ), sI ∈ I ′ , or makes s unreachable from I ′ without eliminating any invariant
states.
Reasoning behind H3 .

H3 is based on the principle that P ′ would not eliminate any

states and/or transitions unless absolutely required to do so. This is due to the fact
that if we remove several states from the invariant then it may be impossible to satisfy
speciﬁcation in the absence of faults. Hence, invariant states are removed only as a
last resort.

5. Re-computing the invariant (Line 20). In this step, we recomputed the program
180

invariant due to identifying oﬀending states during state elimination.
The algorithm keeps repeating steps until the three ﬁxpoints in Line 14, 19 and 20 are
reached. The algorithm terminates when no progress is possible in all the steps.
Algorithm 6: Model Revision for Adding Fault-tolerance
Input: P:transitions,f : fault transitions
spec: safety speciﬁcation , IP : invariant predicate
Ouput: If successful, a fault-tolerant P ′ with invariant S ′
that is fault-tolerant to f
ms:={s0 : ∃s1 , s2 , . . . sn : (∀j : 0 ≤ jn : (sj , sj + 1 ∈ f )
∧(sn − 1 , sn ) violates spec};
mt:={(s0 , s1 ) : ((s1 ∈ ms) ∨(s0 , s1 ) violates spec)};
I1 ,fte:= IP - ms, false ;
REPEAT
I2 := I1 ;
REPEAT
S1 , P2 := I1 , P1 ;
REPEAT
S2 := S1 ;
S1 := FWReachStates(I1 , P1 ∨ f );
S1 := S1 - fte;
mt:= mt ∧S1 ;
P1 := P1 - Group(P1 ∧ mt;
UNTIL S1 = S2 ;
ds := {s0 |s0 ∈ S1 ∪ (∀s1 : s1 ∈ S : (s0 , s1 ) ∈ P1 };
/
P1 := P1 ∨ AddRecovery (ds, I1 , S1 , mt);
ds := {s0 |s0 ∈ S1 ∪ (∀s1 : s1 ∈ S : (s0 , s1 ) ∈ P1 };
/
P1 , fte:= Eliminate(ds, P1 , I1 , S1 , f, f alse, f alse);
UNTIL P1 = P2 ;
P1 , I1 := ConstructInvariant(P1 , I1 , f te);
UNTIL I1 = I2 ;
IP = I1 , P ′ = P1 ;
RETURN IP ′ = P ′ ;

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)

Figure 6.3: Model revision for adding fault-tolerance.

Thus, at the end of this step, we have a fault-tolerant program in UCM that is obtained
by the corresponding input from Steps A and B. Moreover, as mentioned earlier, the output
in this step consists of: (1) original transitions that are preserved as is, (2) original transitions
181

that are strengthened and (3) recovery transitions.

6.4.4

Step D: Translating the Revised Program in UCM to UML
state diagram.

Till now, we have obtained the revised program modeled in UCM, including (1) original
program actions, (2) revised program actions and (3) recovery program actions. In step D,
the fault-tolerant UML state diagram is generated as follows:
1. First, we utilize the revised program actions and recovery program actions to identify
the changed transitions in the original UML state diagram.
2. Second, we re-annotate these transitions in the UML state diagram by the guard
conditions of these revised program actions and recovery program actions.
Thus, after the completion of all four steps, we obtain the fault-tolerant UML state
diagram.

6.5

Case Study 1: The Adaptive cruise control system

In this section, we present the stepwise application of MR4UM on the case of the adaptive
cruise control (ACC) system introduced in Section 6.2. In particular, we begin with the
fault-intolerant UML model for this case study and apply our framework to generate a faulttolerant UML model that satisﬁes Problem 2.4.1. This case study is organized as follows:
First, in Section 6.5.1, we describe UML state diagram for the fault-intolerant ACC program.
In Section 6.5.2, we describe how MR4UM transforms this UML state diagram into program
actions modeled in UCM. In Section 6.5.3, we describe the user inputs for generating the fault
182

model, speciﬁcation and invariant. Subsequently, in Section 6.5.4, we describe how MR4UM
utilizes model revision algorithm to add fault tolerance to the input program. Finally, in
Section 6.5.5, we show how MR4UM transforms the fault-tolerant program obtained by
model revision into the corresponding fault-tolerant UML state diagram.

6.5.1

Fault-intolerant UML model for ACC

The model of the ACC system design in the UML state diagram includes ﬁve states, namely
active, ACC active, inactive, initial and oﬀ. The active state captures the status of ACC
system in “active” mode. The ACC active state captures the status of the ACC system in
“ACC active” mode. The inactive state captures the status of the ACC system in “inactive”
mode. The initial state captures the status of the ACC system in the initializing process.
The oﬀ state captures the status that the ACC system is turned oﬀ.
The state diagram in Figure 6.4 visualizes model design of the ACC system. For better
understanding, Figure 6.5 labels out formal expression of the corresponding annotation. As
shown in Figure 6.4, when the ACC system is turned on, the system will enter into the active
state after initialization if there is no leader car detected. The system will enter into the
ACC active state if the leader car exists according to the information from sensor system.
In other words, existence (or nonexistence) of leader car is the trigger condition to change
system state between active and ACC active. When the brake is applied, irrespective whether
it were in active or ACC active, the ACC system enters into inactive state. When the brake
is released, the system will change from inactive state into active state or ACC active state
depending upon the existence of leader car. The whole ACC system will continue to stay in
one of these three states until the system is turned oﬀ.
183

Figure 6.4: ACC system modeled in UML state diagram.

Figure 6.5: Annotation in formal expression.

As one can observe in the absence of faults, this program works correctly. Anytime a
leader car exists, the system enters into ACC active to ensure that the current car maintains
a safe constant distance to the leader car. And, when the leader moves away, the system
enters into active to ensure that the original speed is resumed.
184

6.5.2

Application of Step A: Generating UCM of the ACC System

The UML model for the ACC system is as shown in Figure 6.5. Based on the transformation
rules, the corresponding UCM needs four variables, namely state, on, brake and leader. The
details of these variables are as follows:
1. state. Based on the number of state in the UML state diagram in Figure 6.5, the
domain of state is {0, 1, 2, 3, 4}. This variable models the ﬁve states of the ACC system.
state = 0 denotes the initial status of the system when it is turned on. state = 1
denotes the system is in active status. state = 2 denotes the system is in ACC active
status. state = 3 denotes the system is in inactive status. And, state = 4 denotes the
status that the system is turned oﬀ.
2. on. The domain of variable on is {0, 1}. It is used to denote whether the ACC system
is turned on. When the ACC system is turned on, the variable on is assigned with 1,
otherwise 0. on = 0 models the trigger condition that causes the ACC system enters
into the stop status.
3. leader. The domain of variable leader is {0, 1}. It is used to model whether a leader
car is detected by the sensor system. leader = 1 denotes that a leader car is detected.
leader = 0 denotes that no leader car is detected.
4. brake. The domain of variable brake is {0, 1}. It is used to denote whether the brake
is applied by the driver. brake = 1 models the event that the brake is applied during
the execution of ACC system. brake = 0 models the event that the brake is released
during the execution of ACC system.
Program actions of the ACC system are generated as follows:
185

1. state = 0 & on = 1 & leader = 0 −→ state := 1;
2. state = 0 & on = 1 & leader = 1 −→ state := 2;
3. state = 1 & leader = 1 −→ state := 2;
4. state = 2 & leader = 0 −→ state := 1;
5. state = 1 & brake = 1 −→ state := 3;
6. state = 2 & brake = 1 −→ state := 3;
7. state = 3 & brake = 0 & leader = 0 −→ state := 1;
8. state = 3 & brake = 0 & leader = 1 −→ state := 2;
9. state = 1 & on = 0 −→ state := 4;
10. state = 2 & on = 0 −→ state := 4;
11. state = 3 & on = 0 −→ state := 4;
In the above program actions, Action 1 models the transition from initial state (state = 0)
to active state (state = 1). The triggering condition of this transition includes: 1) no leader
car exists within the predeﬁned safety distance and 2) the ACC system is on. This condition
is modeled as on = 1&leader = 0. Action 2 models the transition from initial state to
ACC active(state = 2). The triggering condition is that a leader car is detected within the
predeﬁned safety distance when the system is on, that is, on = 1&leader = 1. Action 3
models the transition from active (state = 1) to ACC active (state = 2). The triggering
condition for this action is a leader car is detected within the predeﬁne safety distance.
Action 4 models a reverse transition of action 3, that is, from ACC active (state = 2) to
186

active (state = 1). The triggering condition for this action is no leader car is detected
within the predeﬁned safety distance. Actions 5 and 6 models the transition from active (or
ACC active) to inactive. The triggering condition for both actions is that driver is pressing
brakes, that is, brake = 1. Actions 7 and 8 model the transitions from state inactive to
active (or ACC active). The triggering condition for these two actions is brakes are released
and no leader car is detected within the predeﬁned safety distance. Actions 9 − 11 model the
transitions from state active, ACC active or inactive to state oﬀ(state = 4). The triggering
condition for these three actions is that the system is turned oﬀ.

6.5.3

Application of Step B: Generating Remaining Inputs for
Model Revision

In this framework, the faults cause the sensor to provide an incorrect value. This can be
modeled with Byzantine faults. Moreover, the number of occurrences of this fault is at most
1. And, the fault aﬀects the leader variable from Figure 6.5.
Since the Byzantine fault aﬀects at most one leader variable, we need a redundancy
of three, i.e., we need variables, leader1, leader2 and leader3. Observe that since the
redundancy is added for the leader variable, in the absence of faults, all leader values are
equal. Hence, leader1 can be perturbed from such a state. Moreover, since at most one
fault can occur, leader1 cannot be perturbed further. If two occurrences of faults were
permitted, this would need to be changed so that leader1 could be perturbed even if some
other (and only 1) leader variable were corrupted. Observe that by performing this analysis,
it is possible to generate the guard that identiﬁes when leader1 (respectively, leader2 and
leader3) are corrupted. Based on this the fault actions can be modeled as follows:
187

1. leader1 == leader2 == leader3 −→ leader1 := 0[]leader1 := 1;
2. leader1 == leader2 == leader3 −→ leader2 := 0[]leader2 := 1;
3. leader1 == leader2 == leader3 −→ leader3 := 0[]leader3 := 1;
where [] denotes the non-deterministic execution of statement.
We use an auxiliary variable car to denote whether there is a car in front of the current
car. The value of the variable car is only included for modeling purpose. If the value of the
sensors are not corrupted by fault, the value will be equal to the variable car. Based on the
requirement of the ACC, transitions between Active and ACC active must take the status
of car into account. Moreover, this has to be done without utilizing the variable car in the
revised program. Thus, the set of states the program should not reach are as follows:
((car == 1)&(state′ ! = state)&(on == 1)&(brake == 0)&(state′ ! = 2))|
((car == 0)&(state′ ! = state)&(on == 1)&(brake == 0)&(state′ ! = 1));

The invariant, that is, states where program should recover after fault occurs, is as follows.
Note that the above predicate is generated automatically from the initial state of the UML
model.
(((car == 1)&(on == 1)&(brake == 0)&(state == 2)
(((leader1 == 1)&(leader2 == 1)&(leader3 == 0))|;
((leader1 == 1)&(leader3 == 1)&(leader2 == 0))|;
((leader2 == 1)&(leader3 == 1)&(leader1 == 0))|;
((leader2 == 1)&(leader3 == 1)&(leader1 == 1))));
188

(((car == 0)&(on == 1)&(brake == 0)&(state == 1)
(((leader1 == 0)&(leader2 == 0)&(leader3 == 1))|;
((leader1 == 0)&(leader3 == 0)&(leader2 == 1))|;
((leader2 == 0)&(leader3 == 0)&(leader1 == 1))|;
((leader2 == 0)&(leader3 == 0)&(leader1 == 0))));

6.5.4

Application of Step C: Generation of Fault-Tolerant UCM

In Step C, we utilize the inputs from Sections 6.5.2 and 6.5.3. Since this step utilizes the tool
SYCRAFT [36], we only provide the output of the synthesized program as follows: (Note
that this output is only intended for use in Step D and not meant for designer to analyze it.)
To :
state = 1 & on = 0 −→ state := 4;
state = 2 & on = 0 −→ state := 4;
state = 3 & on = 0 −→ state := 4;
Ts :
state = 0 & on = 1 & leader1 = 0 & leader2 = 0 −→ state := 1;
state = 0 & on = 1 & leader1 = 0 & leader3 = 0 −→ state := 1;
state = 0 & on = 1 & leader2 = 0 & leader3 = 0 −→ state := 1;
state = 0 & on = 1 & leader1 = 1 & leader2 = 1 −→ state := 2;
state = 0 & on = 1 & leader1 = 1 & leader3 = 1 −→ state := 2;
state = 0 & on = 1 & leader2 = 1 & leader3 = 1 −→ state := 2;
state = 1 & leader1 = 1 & leader2 = 1 −→ state := 2;
189

state = 1 & leader1 = 1 & leader3 = 1 −→ state := 2;
state = 1 & leader2 = 1 & leader3 = 1 −→ state := 2;
state = 2 & leader1 = 0 & leader2 = 0 −→ state := 1;
state = 2 & leader1 = 0 & leader3 = 0 −→ state := 1;
state = 2 & leader2 = 0 & leader3 = 0 −→ state := 1;
state = 3 & brake = 0 & leader1 = 0 & leader2 = 0 −→ state := 1;
state = 3 & brake = 0 & leader1 = 0 & leader3 = 0 −→ state := 1;
state = 3 & brake = 0 & leader2 = 0 & leader3 = 0 −→ state := 1;
state = 3 & brake = 0 & leader1 = 1 & leader2 = 1 −→ state := 2;
state = 3 & brake = 0 & leader1 = 1 & leader3 = 1 −→ state := 2;
state = 3 & brake = 0 & leader2 = 1 & leader3 = 1 −→ state := 2;

The action set To denotes the original actions which are from the original program and
unchanged in the fault-tolerant program. The actions in Ts utilize the redundancy of sensors
to tolerate the false alarm (false positive and false negative) caused by an unreliable sensor.
Hence the system can get correct information about whether the leader car exists, even in
the presence of the false alarm of one sensor.

6.5.5

Application of Step D: Generating Fault-tolerant UML model
for ACC System

In this step, we utilize the fault-tolerant UCM into the corresponding UML state diagram.
Observe that some parts of the UML state diagram remain unchanged. For those, we utilize
the corresponding part from the UML state diagram. As an example, this results in that
190

three transitions in the UML state diagram remain unchanged in the fault-tolerant UML
state diagram, that is, the transition that is from state 1 to state 4, the transition that is
from state 2 to state 4 and the transition that is from state 3 to state 4.
Moreover, for revised actions, we strengthen the conditions under which the actions
can be executed. For example, the triggering condition of transition from state 0 to state
1 is revised from on = 1&leader = 0 to on = 1&((leader1 = 0&leader2 = 0)|(leader1 =
3&leader2 = 0)|(leader1 = 0&leader3 = 0)). Thus, the UML state diagram of fault-tolerant
ACC system is shown in Figure 6.6.

Figure 6.6: The revised ACC program in UML state diagram.

191

6.6

Case Study 2: The Altitude Switch Controller

In this section, we show the stepwise application of MR4UM on a simpliﬁed version of
an altitude switch (ASW) program from the aircraft altitude controller system. This case
is adapted from [31]. The ASW program monitors a set of input variables coming from
two analog altitude sensors and a digital altitude sensor. And, it generates an output and
activates an actuator when the altitude is less than a pre-determined threshold.
In particular, we begin with UML state diagram of the ASW program that is faultintolerant, and, we apply MR4UM to generate UML state diagram of the fault-tolerant ASW
program that satisﬁes Problem 2.4.1. Speciﬁcally, in Section 6.6.1, we describe the faultintolerant UML model for the ASW program. In Section 6.6.2, we describe how MR4UM
transforms UML state diagram of the ASW program into program actions modeled in UCM.
In Section 6.6.3, we describe the user inputs for generating the fault model, speciﬁcation and
invariant. Next, Section 6.6.4 describes how MR4UM utilizes model revision algorithm to
synthesize fault tolerance to the input program. Finally, Section 6.6.5 show how MR4UM
transforms the fault-tolerant program obtained by model revision into the corresponding
fault-tolerant UML state diagram.

6.6.1

Fault-intolerant UML model for ASW

The UML state diagram of fault-intolerant ASW program includes three state, namely,
initialization, awaitActuator and standby. The initialization state captures the status when
the ASW program is initializing. The awaitActuator state captures the status when the ASW
program is waiting for the actuator to power on. The standby state captures the status when
the ASW program is in standby mode.
192

Figure 6.7: UML state diagram of the ASW program.

The state diagram in Figure 6.7 visualizes model design of ASW program. Figure 6.8
provides formal expression of the corresponding annotation in Figure 6.7. As shown in Figure
6.7, the ASW program will enter into the standby state after initialization from initialization
state, and, the ﬂag variable init is set to be 1 to denote the initialization process is done. The
ASW program will reset into the initialization state from standby state if the reset process is
triggered, and, the ﬂag variable reset is assigned 1 to denote the reset process is done. When
the actuator is powered oﬀ and actuator power-on is allowed, the ASW program will change
state from standby to awaitActuator, and, the ﬂag variable altBelow will be assigned 1 to
denote that the altitude is below a speciﬁc threshold. The ASW program will change from
awaitActuator state to standby state if the actuator is powered on, and, the ﬂag variable
actuatorStatus will be assigned 1 to denote the actuator is powered on. The ASW program
will change from awaitActuator state to initialization state if the program is reset, and, the
ﬂag variable reset will be assigned 1 to denote that the program is reset. This program works
correctly in the absence of faults.
193

Figure 6.8: the ASW program with formal annotation.

6.6.2

Application of Step A: Generating UCM of the ASW Program

The UML state diagram of the ASW program is as shown in Figure 6.8. Based on the
transformation rules, the corresponding program actions in UCM needs seven variables,
namely, state, init,reset, inhibit, altBelow, actuatorStatus and altFail. These variables are
deﬁned as follows:
1. state. The domain of variable state is {0, 1, 2, 3}. It is used to denote the four states of
the ASW program. state = 0 denotes the initial status of the system when it is turned
on. state = 1 denotes the system is in Await-Actuator status. state = 2 denotes the
system is in standby status. Plus, state = 3 denotes the system is in faulty status.
2. init. The domain of variable init is {0, 1}. It is used to denote whether the altitude
controller program is initialized. When the program is initialized, the variable init is
assigned 1, otherwise 0.
3. reset. The domain of variable reset is {0, 1}. It is used to model whether the program
is being reset. When the program is reset, the variable reset is assigned 1, otherwise 0.
194

4. inhibit. The domain of variable inhibit is {0, 1}. It is used to model whether the
actuator power-on is inhibited. inhibit = 1 denotes that the actuator power-on is
inhibited. And, inhibit = 0 denotes the actuator power-on is allowed.
5. altBelow. The domain of variable brake is {0, 1}. It is used to denote whether the
altitude is less than a pre-determined threshold. altBelow = 1 models the event that
the altitude is below a speciﬁc threshold. altBelow = 0 models the event that the
altitude is not below the threshold.
6. actuatorStatus. The domain of variable actuatorStatus is {0, 1}. It is used to model
whether the actuator is powered on. actuatorStatus = 1 denotes that the actuator is
powered on. And, inhibit = 0 denotes the actuator is not powered on.
7. altFail. The domain of variable brake is {0, 1}. altFail is equal to 1 when analog and
digital altitude meters are failed, otherwise 0.

Hence, the program actions of ASW program in UCM are as follows:

1. state = 0 & init = 1 −→ state := 2; init := 0;
2. state = 2 & reset = 0 −→ state := 0; reset := 1;
3. state = 2 & altBelow = 0 & inhibit = 0 & actuatorStatus = 0 −→ state :=
1; altBelow := 1;
4. state = 1 & actuatorStatus = 0 −→ state := 2; actuatorStatus := 1;
5. state = 1 & reset = 0 −→ state := 0; reset := 1;
195

In the above program actions, action 1 models the program transition from initialization
mode (state = 0) to standby mode (state = 2). This transition triggers the reassignment
of the variable init, that is init = 0 → init = 1 (which denotes the program is done with
initialization process). Action 2 models the transition from standby mode to initialization
mode (state = 0). This transition triggers the reassignment of the variable reset, that is,
reset = 0 → reset = 1 (which denotes the program is done with the resetting process).
Action 3 denotes the transition from standby mode (state = 2) to await-Actuator mode
(state = 1). The triggering condition of this action is the actuator power-on is not inhibited
and the actuator is not powered on. This action triggers the reassignment of the variable
altBelow, that is altBelow = 0 → altBelow = 1 (which denotes that the altitude is below
a speciﬁc threshold). Action 4 denotes that transition from await-Actuator (state = 1)
mode to standby mode (state = 2). This action triggers the reassignment of the variable
actuatorStatus, that is actuatorStatus = 0 → actuatorStatus = 1 (which denotes that the
actuator is powered on). Action 5 denotes the transition from await-Actuator (state = 1) to
initialization mode (state = 0). This action triggers the reassignment of the variable reset,
that is reset = 0 → reset = 1 (which denotes the system is reset).

6.6.3

Application of Step B: Generating Remaining Inputs for
Model Revision

The targeted fault-tolerant program is required to tolerate such a faulty status: the altitude
sensors incur malfunction. This type of fault is recognized as transient fault. To model
this fault action in UCM, the designer needs to specify which variable is perturbed and the
possible value of the corrupted variable. In this case, the designer speciﬁes the fault may
196

corrupt variable state into 3, that is, state = 3 denotes the faulty status of the program.
Besides, the designer also needs to specify the triggering condition of the faults. In this case,
the designer speciﬁes three triggering conditions for three diﬀerent transitions that corrupt
the program into faulty status.

• initFailed = 1. This condition represent the situation where the program stays in the
initialization mode for more than 0.6 second.
• altFailOver = 1. This condition represents the situation where the condition altF ail =
1 remains true more than 2 seconds.
• awaitOver=1. This condition represents the situation where the program stays in the
await-Actuator mode for more than 2 seconds.

Hence, MR4UM generates fault actions as following:
1. initF ailed == 1 −→ initF ailed := 0, state := 3;
2. altF ailOver == 1 −→ altF ailOver := 0, state := 3;
3. awaitOver == 1 −→ awaitOver := 0, state := 3;
In this case, the designer requires the safety speciﬁcation, as follows:
• If the altitude sensor are failed, the program does not transfer from standby mode to
await-Actuator mode;
• The program can only go to the initialization mode from the faulty mode;
• The program can recover from the faulty mode if the program is not reset.
197

Hence, the designer speciﬁes the speciﬁcation from GUI of the framework as follows:
((altF ails == 1)&(state == 2)&(state′ == 0))|
((state == 3)&(state == 2)&(state′ == 1))|
((state == 3)&(reset == 1));

The invariant of the program consists of the states where the program is not in the faulty
states, i.e., state! = 3. Hence, the invariant of this program is as follows:
state! = 3

6.6.4

Application of Step C: Generation of Fault-Tolerant UCM

In Step C, MR4UM utilizes the inputs from Section 6.6.2 and 6.6.3. After Step C applies the
symbolic model revision on the input program, the MR4UM generates the revised program
modeled in UCM. Note that this output is only intended for use in Step D and not meant
for designer to analyze it. The obtained fault-tolerant program is as follows:
To :
1). state = 0 & init = 1 −→ state := 2; init := 0;
2). state = 2 & reset = 0 −→ state := 0; reset := 1;
3). state = 1 & actuatorStatus = 0 −→ state := 2; actuatorStatus := 1;
4). state = 1 & reset = 0 −→ state := 0; reset := 1;
Ts :
5). state = 2 & altBelow = 0 & inhibit = 0 & actuatorStatus = 0 & altF ail = 0
198

Figure 6.9: The revised ASW program in UML state diagram.

−→ state := 1; altBelow := 1;
Tr :
6). state = 3 & reset = 0 −→ state := 0; reset := 1;

Note that Action 5 has been added a new constraint altF ail == 0 which denotes that
the program is allowed to change its state to the await-Actuator mode only when the input
sensors are not corrupted. Action 6 is newly added to the program which denotes that the
program recovers from faulty mode (state = 3) to the initialization mode (state = 0). The
other actions remains the same as the actions of the input program.

6.6.5

Application of Step D: Generating Fault-tolerant UML model
for ASW Program

In this step, MR4UM converts the fault-tolerant program in UCM into the corresponding
model in UML state diagram. The UML state diagram of fault-tolerant ASW program is as
shown in Figure 6.9.
199

6.7

Discussion and Lessons Learnt

In this section, we discuss several questions that are raised by our work as well as lessons
learnt from the case study.
One question is the scalability of the approach of automatic revising the existing program
design proposed in our work. Our framework beneﬁts from the fact that the underlying synthesis engine utilizes OBDDs to mitigate the state explosion problem as well as a heuristic
based algorithm [36] to mitigate the complexity (NP-complete) of model revision. Specifically, this approach has been used to permit model revision of programs with state space
exceeding 10100 . Hence, we expect the framework to be able to handle moderate sized
problems.
Another question is about the choice of UML as the front end for our framework. We
chose UML because it is one of the commonly used platforms to specify requirements. And,
although there is an existing work on formalization of UML models, the problem of model
revision has not been addressed in this context. Our approach is also feasible for revising
program design modeling in other approaches (e.g. AADL [80]) by modifying the mapping
mechanism between the model of program design and underlying computational model. It
has also been demonstrated in revising SCR speciﬁcations [107].
One of the diﬃculties in developing this framework lies in the fact that the revised faulttolerant model in UCM is BDD based. Although converting the UCM model into UML state
diagram involves some challenges, they can be overcome by understanding (1) the part of
the UML model that will remain intact in the ﬁnal model, (2) the part of the UML model
where the structure of the original model will remain intact in the ﬁnal model although
some of the details (e.g., conditions on the arrows) will change, and (3) the part of the UML
200

Figure 6.10: The stepwise procedure of MR4UM.
model that is completely new and added for dealing with recovery from faults. Since BDD
based approaches permit us to check conditions (1) and (2) eﬀectively with negligible cost,
obtaining the revised UML state diagram is feasible. One of the future work in this area is
to optimize the third part that identiﬁes the actions that provide recovery.

6.8

Summary

Our work focuses on reducing the learning curve required for application of formal methods,
speciﬁcally model revision, by keeping the formal methods under-the-hood to a large extent.
We chose to apply model revision for UML models since it is one of the popular modeling
techniques. Speciﬁcally, in this chapter, we proposed a framework, namely MR4UM, which
allows designers to apply model revision to existing UML models for adding fault-tolerance.
The stepwise-procedure of MR4UM is shown in Figure 6.10. Since the model revision process
of MR4UM is based on BDDs, it has the potential to deal with large state space, e.g., the
underlying synthesis engine has been used to revise models with state space exceeding 10100 .
201

We illustrated the stepwise procedure of MR4UM with two case studies. One is the
adaptive cruise controller system (ACC) from automotive systems. The other one is the
altitude switch (ASW) used in the aircraft altitude controller.

202

Chapter 7
Related Work
This chapter is dedicated to illustrate the related work on this dissertation. In particular,
Section 7.0.1 presents the related work on automatic veriﬁcation of stabilizing programs. In
Section 7.0.2, we present the related work on automatic revision for multitolerant programs.
Section 7.0.3 introduces the related work on model repair of UML state diagrams for adding
fault-tolerance.

7.0.1

Automatic Veriﬁcation of Stabilizing Programs

Formal veriﬁcation of stabilizing programs has been studied mainly in two directions, including mechanism theorem proving and model checking. Mechanism theorem proving based
approaches [132, 188, 189] have been demonstrated as a very powerful veriﬁcation technique,
especially for these inﬁnite systems. Unfortunately, this approach usually requires considerable experience in logic reasoning.
Model checking is one of the most successful approaches to veriﬁcation. However, using
model checking for veriﬁcation purpose suﬀers from the ‘state explosion’ problem. To address
203

this issue, researchers have proposed the technique of symbolic model checking, using OBDDs
to obtain a compact representation of programs. Also, Tsuchiya et al [225] have proposed
an approach to verify stabilizing programs based on symbolic model checking. Although this
work has demonstrated the feasibility of using symbolic model checking to verify stabilizing
programs, it only works for programs consisting of a small number of processes. To solve this
bottleneck, we proposed an approach to verify stabilizing programs, that is using eﬀectiveness
of fairness while using symbolic model checking.
With the advancement of model checking techniques, the SAT/SMT based model checking has emerged as an viable alternative to symbolic model checking. In particular, symbolic
model checking requires one to identify the order of program variables when constructing
OBDD representations. An incorrect ordering of variables can increase the veriﬁcation time.
That motivates our research on investigating the eﬀectiveness of SMT-based model checking
in verifying stabilizing programs.

7.0.2

Automatic Revision for Multitolerant Programs

Automated program revision is studied from diﬀerent perspectives. One approach (e.g., [23])
focuses on synthesizing fault-tolerant programs from their speciﬁcation in a temporal logic
(e.g., Linear Temporal Logic(LTL) [76], etc.). The word synthesis is also used in the context
(e.g., [53] [101] [158] [114]) of transforming an abstract (such as UML) program into a
concrete (such as C++) program while ensuring that the location of concrete program in
memory, its data ﬂow etc. meet the constraints of an embedded system. By contrast,
our approach focuses on transformation of one abstract program into another that meets
additional properties of interest. Our approach will advance the applicability of this existing
204

work by allowing designers to add properties of interest in the abstract model and then using
existing work to generate concrete program. Hence our approach is desirable when one needs
to extend an existing system by adding fault-tolerance.
Our work is closely related to the work on controller synthesis [20] [21] [38] [66] and game
theory [51] [79] [123]. In control-theoretic approaches, the supervisory control of real-time
systems has been studied under the assumption that the existing program (called a plant)
and/or the given speciﬁcation is deterministic. Moreover, in both game theory and controller
synthesis, since highly expressive speciﬁcations are often considered, the complexity of the
proposed synthesis methods is very high. For example, the synthesis problems presented
in [20] [21] [51] [79] are EXPTIME-complete. Furthermore, deciding the existence of a
controller in [38] [66] is 2EXPTIME-complete. In addition, these approaches do not address
some of the crucial concerns of fault-tolerance (e.g., providing recovery in the presence of
faults) that are considered in the our work. In addition, the high complexity of these methods
is a serious barrier to actually synthesize moderate sized programs. By contrast, our approach
focus only on speciﬁcations needed to express properties of interest. Hence, the complexity
of our algorithm is substantially lower.
The algorithms in [134] [136] have addressed the problem of adding fault-tolerance to only
one class of fault. Also the algorithm in [71] is targeted toward the synthesis of programs
that simultaneously tolerate multiple classes of faults whereas we address the synthesis of
programs that provide the appropriate level of fault-tolerance and not to provide any tolerance if faults from one class occurs while the program is ‘recovering’ from faults from another
class. The ‘weak’ multitolerance way in this dissertation is necessary especially when it is
impossible to guarantee any level of tolerance in a computation where faults from two classes
205

occur.

7.0.3

Model Revision of UML State Diagrams for Adding Faulttolerance

Previous work in [125,130,156] addresses the problem of formalization of UML state diagram.
Speciﬁcally, these approaches deﬁne operational semantics of the UML state diagram and
then utilize it for simulation, veriﬁcation and/or code generation. The ﬁrst step in our
framework is inspired by these approaches. However, unlike the previous work, in our work
the translations of UML model needs to be annotated so that we can subsequently obtain a
revised UML model after adding fault-tolerance. Another important diﬀerence between our
work and these works is that our work focuses on the problem of model revision whereas
they focus on the problem of model checking. Thus, our work is complementary to previous
work in that our framework can be applied in scenarios where the given UML model fails to
satisfy the given property.
Our work is orthogonal to related work (e.g., [54, 102, 115, 159, 171]) that focuses on
transforming an abstract UML model into a concrete (such as C++) program while ensuring
that the location of concrete program in memory, its data ﬂow etc. meet the constraints
of the underlying system. In particular, our work focuses on revising the given model into
another UML model that satisﬁes the fault-tolerance property. Thus, our work will advance
the applicability of this existing work by allowing designers to add properties of interest in
the abstract model and then using existing work to generate concrete program.
Approaches in [69,70] develop corrector pattern for specifying nonmasking fault-tolerance
and failsafe fault-tolerance respectively. These proposed analysis methods are validated in
206

terms of UML diagrams. While these works simplify and modularize fault-tolerance concerns
and facilitate to analyze the functional and fault-tolerance concerns and their mutual impact,
application of a synthesis tool in automatically adding fault-tolerance is an on-going direction
of these works. Our work utilizes the synthesis tool [37] to automate the revision process for
adding fault-tolerance.
The work in [2] proposes an approach of automating and formalizing the translation from
high level design models, speciﬁcally, Software Cost Reduction (SCR) [31], to a format that
can be used by the automated revision/synthesis tools of example. SCR is a set of formal
methods for constructing and verifying requirements speciﬁcation document. By contrast,
our work focuses on issues of automatic revision of UML state diagram for adding faulttolerance.

207

Chapter 8
Conclusion and Future Work
This dissertation focused on the problem of automatic veriﬁcation and revision of multitolerant programs. In particular, our goal was to build a model-based framework that helps
to build multitolerant programs in an automatic way. To this end, we have developed new
theory, algorithms as well as tools to advance the state-of-the-art research. First, we have
investigated the eﬀectiveness of our fault modeling approach, that is, using transition systems to model faults. Then, leveraging model repair techniques, we proposed approaches
to eﬃciently verify the fault-tolerance properties (stabilization) provided by programs. The
proposed approaches mitigate the fundamental challenge of ‘state explosion’, the key problem
that limits the scalability of previous approaches to verify stabilization. We also investigated
the complexity of automatic program revision for adding multitolerance. And, we have developed novel model repair algorithms for revising programs to add multitolerance. We have
implemented these algorithms as a tool. We also have demonstrated the applicability of
model repair techniques in a lightweight framework, where the given UML state diagram is
automatically revised to add fault-tolerance.
208

In what follows, we summarizes the contributions of this dissertation in Section 8.1.
Then, in Section 8.2, we discuss possible directions of future work.

8.1

Contributions

The main results of this dissertation are as follows.

1. We investigated the eﬀectiveness of transition systems to model faults. Starting from a
taxonomy of faults that is based on practitioner’s point of view [24], we showed that: (i)
faults from 20 categories of all 31 categories can be modeled using transition systems.
These faults include Byzantine actions in a networked system, physical deterioration
of brake in a braking control system and so on. And, (ii) faults from 11 categories
cannot (or should not be) represented using transition systems. These include faults
such as buﬀer overﬂow, hardware errata and so on. We also investigated the feasibility
and practicability of using transition systems to model faults. We showed that (i) the
modeling of faults from 18 categories as transition systems is practical and feasible
and (ii) the modeling of faults from 2 categories as transition systems is not practical
although feasible. Besides, We identiﬁed the relative completeness of the proposed
approach with recent literature.
2. We identiﬁed one bottleneck involved in existing automatic veriﬁcation of self-stabilization.
In particular, we investigated the issues of fairness and its eﬀects on the performance
of verifying self-stabilizing programs. We showed that if self-stabilization is possible
with unfair computation, then the cost of stabilization veriﬁcation is substantially reduced. In particular, we showed that the time for veriﬁcation with unfair computation
209

is approximately 0.001%-0.1% of that with weakly fair computation.
3. For the cases where weak fairness is essential for the correctness of stabilization, we
proposed two approaches to improve the scalability of verifying stabilizing programs,
including decomposition and weak stabilization. We showed that using decomposition
would result in a substantial increase for the scalability of verifying stabilization (e.g.,
from 104 states to 10138 states fro Huang’s mutual exclusion algorithm). We also
showed that veriﬁcation of weak stabilization is substantially more scalable. This
result validates the suggestion in [92] that weak stabilization is easier to verify than
stabilization.
4. We proposed a constraint-based approach to analyze stabilizing programs without introducing fairness scheduler in the program model, and without assuming any order
information of variables. The key insight is to reduce the problem of stabilization veriﬁcation into a well-studied problem, named constraint solving, which can be solved by
many existing highly optimized solutions.
5. We investigated the complexity issues in automatic revision of programs for adding
(weak) multitolerance.
6. We presented a counterintuitive result where we showed that the problem of automatic
addition of Masking-Masking (respectively, Masking-Failsafe) weak multitolerance is
NP-complete. This result is especially surprising given that the corresponding problem
can be solved in polynomial time for strong multitolerance.
7. We presented a sound and complete algorithm for automatic program revision to add
Failsafe-Failsafe (respectively, Masking-Nonmasking) weak multitolerance, Also, we
210

presented a polynomial time heuristic for designing Masking-masking weak multitolerant program.
8. We implemented the proposed algorithms for automatic program revision to add multitolerance as a tool.
9. We compared our proposed multitolerance approach, named weak multitolerance with
previous techniques, named strong multitolerance. We also identiﬁed circumstances
where solvability for adding weak and strong multitolerance diﬀers.
10. We presented a lightweight framework for automatic revision of UML state diagrams
for adding fault-tolerance.

8.2

Future Work

This section presents some of the problems that we believe are both important and interesting.
Scalability of Static Analysis and Model Revision Techniques. Emerging hardwares and
novel software architectures pose new challenges to the analysis and/or repair techniques
for system. There is a great need for developing novel and eﬀective veriﬁcation and revision techniques to analyzing parallel programs and large data space-based software systems.
One such open problem is to exploit the advantages of highly parallel data architectures to
improve the performance of existing DPLL based static analysis techniques.
Analyzing and Improving the Reliability of Software Systems in the Presence of Faults
such as Conﬁgure Errors or Hardware Faults. Recent research on system failure shows that
human mistakes (e.g., wrong conﬁgurations) and hardware faults, rather than software bugs,
211

are becoming the dominating causes for system’s down time. We are currently conducting a
comprehensive study to understand these problems in real world. Leveraging the techniques
of program analysis, data mining and software engineering, we plan to develop novel conﬁguration patterns/suggestions for a software application to eliminate or alleviate the negative
eﬀects of human mistakes, especially misconﬁguration. Also, based on the understanding
of real world data, we expect to identify the taxonomy of hardware faults and investigate
corresponding tolerance solutions.

212

BIBLIOGRAPHY

213

BIBLIOGRAPHY
[1] F. Abujarad and S. Kulkarni. Constraint based automated synthesis of nonmasking
and stabilizing fault-tolerance. In Proceedings of the 2009 28th IEEE International
symposium on Reliable Distributed Systems, pages 119–128, 2009.
[2] F. Abujarad and S. S. Kulkarni. Automated addition of fault-tolerance to scr toolset:
A case study. In The Seventh International Workshop on Assurance in Distributed
Systems and Networks (ADSN), in ICDCSW ’08: Proceedings of the 2008 The 28th
International Conference on Distributed Computing Systems Workshops, pages 539–
544, 2008.
[3] M. K. Aguilera, K. Keeton, A. Merchant, K.-K. Muniswamy-Reddy, and M. Uysal.
Improving recoverability in multi-tier storage systems. In Proceedings of the 37th
Annual IEEE/IFIP International Conference on Dependable Systems and Networks,
DSN ’07, 2007.
[4] I. E. Akkus and A. Goel. Data recovery for web applications. In Proceedings of the 40th
Annual IEEE/IFIP International Conference on Dependable Systems and Networks,
DSN ’10, 2010.
[5] S. Al-Kiswany, M. Ripeanu, S. S. Vazhkudai, and A. Gharaibeh. stdchk: A checkpoint
storage system for desktop grid computing. In 28th IEEE International Conference on
Distributed Computing Systems (ICDCS 2008), 2008.
[6] B. Alpern and F. B. Schneider. Deﬁning liveness. Information Processing Letters,
21:181–185, 1985.
[7] B. Alpern and F. B. Schneider. Proving boolean combinations of deterministic properties. Proceedings of the Second Symposium on Logic in Computer Science, pages
131–137, 1987.
[8] R. Alur, C. Courcoubetis, T. Henzinger, and P.-H. Ho. Hybrid automata: An algorithmic approach to the speciﬁcation and veriﬁcation of hybrid systems. In Hybrid
Systems, pages 209–229, 1993.
[9] R. Alur and D. Dill. A theory of timed automata. Theoretical Computer Science
126:183-235, 1994 (preliminary versions appeared in Proc. 17th ICALP, LNCS 443,
1990, and Real Time: Theory in Practice, 1991.
214

[10] R. Alur, T. Feder, and T. Henzinger. The beneﬁts of relaxing punctuality. Journal of
the ACM, 43(1):116–146, 1996.
[11] K. H. amd Shouhuai Xu. Protecting cryptographic keys from memory disclosure attacks. In Proceedings of the 37th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks, DSN ’07, 2007.
[12] Y. Amir, B. A. Coan, J. Kirsch, and J. Lane. Byzantine replication under attack. In
Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’08, 2008.
[13] H. Ando, R. Kan, Y. Tosaka, K. Takahisa, and K. Hatanaka. Validation of hardware
error recovery mechanisms for the sparc64 v microprocessor. In Proceedings of the 38th
Annual IEEE/IFIP International Conference on Dependable Systems and Networks,
DSN ’08, 2008.
[14] B. Archive, E. Anderson, X. Li, A. Merchant, M. A. Shah, K. Smathers, J. Tucek,
M. Uysal, and J. J. Wylie. Eﬃcient eventual consistency in pahoehoe–an erasurecoded key. In Proceedings of the 40th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks, DSN ’10, 2010.
[15] A. Arora. A foundation of fault-tolerant computing. PhD thesis, The University of
Texas at Austin, 1992.
[16] A. Arora, P. C. Attie, and E. A. Emerson. Synthesis of fault-tolerant concurrent
programs. Proceedings of the 17th ACM Symposium on Principles of Distributed Computing (PODC), 1998.
[17] A. Arora and M. Gouda. Closure and convergence: A foundation of fault-tolerant
computing. IEEE Transactions on Software Engineering, 19(11):1015–1027, Nov 1993.
[18] A. Arora and M. Gouda. Distributed reset.
43(9):1026–1038, Sep. 1994.

IEEE Transactions on Computers,

[19] A. Arora and S. S. Kulkarni. Designing masking fault-tolerance via nonmasking faulttolerance. IEEE Transactions on Software Engineering, 24(6):435–450, Nov 1998.
[20] E. Asarin and O. Maler. As soon as possible: Time optimal control for timed automata.
In In Proc. of the 2nd International Workshop on Hybrid Systems: Computation and
Control, pages 19–30, London, UK, 1999. Springer-Verlag.
215

[21] E. Asarin, O. Maler, A. Pnueli, and J. Sifakis. Controller synthesis for timed automata. In In IFAC symposium on System Structure and Controller, pages 469–474,
Nantes,France, 1998. Elsevier.
[22] P. Attie and A. Emerson. Synthesis of concurrent programs for an atomic read/write
model of computation. ACM TOPLAS (a preliminary version of this paper appeared
in PODC96), 23(2), March 2001.
[23] P. C. Attie, A. Arora, and E. A. Emerson. Synthesis of fault-tolerant concurrent
programs. ACM Transactions on Programming Languages and Systems, 26(1):125–
185, 2004.
[24] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy
of dependable and secure computing. IEEE Trans. Dependable Secur. Comput., 2004.
[25] L. N. Bairavasundaram, M. Rungta, N. Agrawal, A. C. Arpaci-Dusseau, R. H. ArpaciDusseau, and M. M. Swift. Analyzing the eﬀects of disk-pointer corruption. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems
and Networks, DSN ’08, 2008.
[26] T. Ball, B. Cook, V. Levin, and S. K. Rajamani. Slam and static driver veriﬁer:
Technology transfer of formal methods inside microsoft. In In: IFM. (2004, pages
1–20. Springer, 2004.
[27] N. Banu, T. Izumi, and K. Wada. Doubly-expedited one-step byzantine consensus. In
Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’10, 2010.
[28] P. Bergheaud, D. Subhraveti, and M. Vertesi. Fault tolerance in multiprocessor systems via application cloning. In 27th IEEE International Conference on Distributed
Computing Systems (ICDCS 2007), 2007.
[29] M. Bertier, A.-M. Kermarrec, and G. Tan. Message-eﬃcient byzantine fault-tolerant
broadcast in a multi-hop wireless sensor network. In Proceedings of the 30th IEEE
International Conference on Distributed Computing Systems, ICDCS ’10, 2010.
[30] A. N. Bessani, V. V. Cogo, M. Correia, P. Costa, M. Pasin, F. Silva, L. Arantes,
O. Marin, P. Sens, and J. Sopena. Making hadoop mapreduce byzantine fault-tolerant.
In Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’10, 2010.
216

[31] R. Bharadwaj and C. Heitmeyer. Developing high assurance avionics systems with
the scr requirements method. In Proceedings of the 19th Digital Avionics Systems
Conference, pages 1D1/1 – 1D1/8, Philadelphia, PA, 2000.
[32] M. Bhide, K. Ramamritham, and M. Agrawal. Eﬃcient execution of continuous incoherency bounded queries over multi-source streaming data. In Proceeding ICDCS ’07
Proceedings of the 27th International Conference on Distributed Computing Systems,
pages 681–688, 2007.
[33] A. Biere, A. Cimatti, E. Clarke, and Y. Zhu. Symbolic model checking without bdds. In
In Proc. Of the Workshop on Tools and Algorithms for the Construction and Analysis
of Systems, 1999.
[34] W. J. Bolosky, J. R. Douceur, and J. Howell. The farsite project: a retrospective. In
in ACM SIGOPS Operating Systems Review, 2007.
[35] B. Bonakdarpour. Automated Revision of Distributed and Real-Time Programs. PhD
thesis, Michigan State University, 2008.
[36] B. Bonakdarpour and S. Kulkarni. Exploiting symbolic techniques in automated synthesis of distributed programs. In Proceedings of In IEEE International Conference on
Distributed Computing Systems(ICDCS), ICDCS ’07, pages 3–10, Toronto, Canada,
2007.
[37] B. Bonakdarpour and S. Kulkarni. Sycraft: A tool for automated synthesis of faulttolerant distributed programs. In Proceedings of International Conference on Concurrency Theory (CONCUR), pages 167–171, Toronto, Canada, 2008.
[38] P. Bouyer, D. D’Souza, P. Madhusudan, and A. Petit. Timed control with partial
observability. Lecture Notes in Computer Science, 2725:180–192, 2003.
[39] A. Brito, C. Fetzer, and P. Felber. Minimizing latency in fault-tolerant distributed
stream processing systems. In Proceedings of the 29th IEEE International Conference
on Distributed Computing Systems, ICDCS ’09, 2009.
[40] J. C. Brustoloni and D. Kyle. Updates and asynchronous communication in trusted
computing systems. In 28th IEEE International Conference on Distributed Computing
Systems (ICDCS 2008), 2008.
[41] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE
Transactions on Computers, 35(8), 1986.
217

[42] J. R. Burch, E. M. Clarke, K. L. McMillan, D. L. Dill, and L. J. Hwang. Symbolic
model checking: 1020 states and beyond. Information and Computation, 98(2):142–
170, 1992.
[43] L. E. Buzato, G. M. D. Vieira, and W. Zwaenepoel. Dynamic content web applications:
Crash, failover, and recovery analysis. In Proceedings of the 39th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’09, 2009.
[44] C. Cachin, I. Keidar, and A. Shraer. Fail-aware untrusted storage. In Proceedings
of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks, DSN ’09, 2009.
[45] L. Chen and J. Leneutre. Selﬁshness, not always a nightmare: Modeling selﬁsh mac
behaviors in wireless mobile ad hoc networks. In ICDCS ’07 Proceedings of the 27th
International Conference on Distributed Computing Systems, 2007.
[46] Y. Chen, O. Gnawali, M. Kazandjieva, P. Levis, and J. Regehr. Surviving sensor
network software faults. In SOSP’09, New York, NY, USA, 2009. ACM.
[47] V. Claesson, H. Lonn, and N. Suri. An eﬃcient tdma start-up and restart synchronization approach for distributed embedded systems. IEEE Transactions on Parallel
and Distributed Systems, 15(8):725–739, 2004.
[48] C. Constantinescu, I. Parulkar, R. Harper, and S. Michalak. Silent data corruption myth or reality? In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’08, 2008.
[49] R. Curtmola, O. Khan, R. C. Burns, and G. Ateniese. Mr-pdp: Multiple-replica provable data possession. In 28th IEEE International Conference on Distributed Computing
Systems (ICDCS 2008), 2008.
[50] D. Dams, W. Hesse, and G. J. Holzmann. Abstracting C with abC. In 14th International Conference on Computer Aided Veriﬁcation (CAV), pages 515–520, London,
UK, 2002. Springer-Verlag.
[51] L. de Alfaro, M. Faella, T. A. Henzinger, R. Majumdar, and M. Stoelinga. The
element of surprise in timed games. In Internatinal Conference on Concurrency Theory
(CONCUR), Marseille, France, 2003. Springer.
[52] L. M. de Moura and N. Bjorner. Z3: An eﬃcient smt solver. In TACAS 2008, 2008.
218

[53] D. de Niz and R. Rajkumar. Glue code generation: Closing the loophole in model-based
development. In 10th IEEE Real-Time and Embedded Technology and Applications
Symposium (RTAS 2004). Workshop on Model-Driven Embedded Systems, Toronto,
Canada, 2004. IEEE Computer Society.
[54] D. de Niz and R. Rajkumar. Glue code generation: Closing the loophole in model-based
development. 2nd RTAS Workshop on Model-Driven Embedded Systems, 2004.
[55] C. Delporte-Gallet, H. Fauconnier, and A. Tielmann. Fault-tolerant consensus in unknown and anonymous networks. In Proceedings of the 29th IEEE International Conference on Distributed Computing Systems, ICDCS ’09, 2009.
[56] A. Demaille, J. E. Denny, and P. Eggert. Bison - gnu parser generator. Technical
report, http://www.gnu.org/software/bison/, 2012.
[57] M. Demirbas, A. Arora, V. Mittal, and V. Kulathumani. A fault local self-stabilizing
clustering service for wireless ad hoc networks. IEEE Transactions on Parallel and
Distributed Systems, 17, 2006.
[58] S. Devismes, S. Tixeuil, and M. Yamashita. Weak vs. self vs. probabilistic stabilization. In ICDCS ’08: Proceedings of the 2008 The 28th International Conference on
Distributed Computing Systems, pages 681–688, 2008.
[59] E. Dijkstra. Self stabilizing systems in spite of distributed control. Communications
of the ACM, 17(11):643 – 644, Nov. 1974.
[60] E. Dijkstra. A Discipline of Programming. New Jersey, USA, 1990.
[61] E. Dijkstra and C. Scholten. Termination detection for diﬀusing computations. Information Processing Letters, 11:1–4, 1980.
[62] X. Ding and H. Jin. Eﬃcient and progressive algorithms for distributed skyline queries
over uncertain data. In Proceedings of the 30th IEEE International Conference on
Distributed Computing Systems, ICDCS ’10, 2010.
[63] S. Dolev. Self-stabilizing routing and related protocols. Journal of Parallel and Distributed Computing, 42(2), 1997.
[64] S. Dolev. Self-Stabilization. MIT Press, 2000.
219

[65] Q. Dong and D. Liu. Adaptive jamming-resistant broadcast systems with partial channel sharing. In Proceedings of the 30th IEEE International Conference on Distributed
Computing Systems, ICDCS ’10, 2010.
[66] D. D’Souza and P. Madhusudan. Timed control synthesis for external speciﬁcations.
In STACS ’02: Proceedings of the 19th Annual Symposium on Theoretical Aspects of
Computer Science, pages 571–582, London, UK, UK, 2002. Springer-Verlag.
[67] B. Dutertre and L. D. Moura. The yices smt solver. Technical report, Computer
Science Laboratory, SRI International, 2006.
[68] A. Ebnenasir. Diconic addition of failsafe fault-tolerance. In Proceedings of the 22nd
IEEE/ACM International Conference on Automated Software Engineering (ASE),
pages 44–53, 2007.
[69] A. Ebnenasir and B. H. C. Cheng. A pattern-based approach for modeling and analyzing error recovery. In Workshops on Software Achitectures for Dependable systems
(WADS), pages 115–141, 2006.
[70] A. Ebnenasir and B. H. C. Cheng. Pattern-based modeling and analysis of failsafe faulttolerance in uml. In Proceedings of the 10th IEEE High Assurance Systems Engineering
Symposium, HASE ’07, pages 275–282, Washington, DC, USA, 2007. IEEE Computer
Society.
[71] A. Ebnenasir and S. Kulkarni. Feasibility of stepwise design of multitolerant programs.
ACM Transactions on Software Engineering and Methodology, 21, 2011. In Press.
[72] A. Ebnenasir, S. S. Kulkarni, and A. Arora. FTSyn: A framework for automatic
synthesis of fault-tolerance. International Journal on Software Tools for Technology
Transfer, 10(5):455–471, 2008.
[73] J. G. Elerath. A simple equation for estimating reliability of an n+1 redundant array of
independent disks (raid). In Proceedings of the 39th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks, DSN ’09, 2009.
[74] J. G. Elerath and M. Pecht. Enhanced reliability modeling of raid storage systems. In
Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’07, 2007.
220

[75] M. Elhaddad, H. Iqbal, T. Znati, and R. Melhem. Scheduling to minimize the worstcase loss rate. In 27th IEEE International Conference on Distributed Computing Systems (ICDCS 2007), 2007.
[76] E. Emerson. Handbook of Theoretical Computer Science: Chapter 16, Temporal and
Modal Logic. Elsevier Science Publishers B. V., 1990.
[77] E. Emerson and E. Clarke. Using branching time temporal logic to synthesize synchronization skeletons. Science of Computer Programming, 2(3):241–266, 1982.
[78] P. D. Ezhilchelvan, F. V. Brasileiro, and N. A. Speirs. A timeout-based message
ordering protocol for a lightweight software implementation of tmr systems. IEEE
Trans. Parallel Distrib. Syst., 15(1):53–65, 2004.
[79] M. Faella, S. Torre, and A. Murano. Dense real-time games. In LICS ’02: Proceedings
of the 17th Annual IEEE Symposium on Logic in Computer Science, pages 167–176,
Copenhagen, Denmark, 2002. IEEE Computer Society.
[80] P. Feiler, B. Lewis, and S. Vestal. The sae architecture analysis and design language
(aadl) standard: A basis for model-based architecture-driven embedded systems engineering. In In Proceedings of the RTAS Workshop on Model-driven Embedded Systems,
pages 1–10, 2009.
[81] J. Fonseca and M. Vieira. Mapping software faults with web security vulnerabilities.
In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’08, 2008.
[82] F. Freitas, E. Marques, R. Rodrigues, C. Ribeiro, P. Ferreira, and L. Rodrigues. Verme:
Worm containment in overlay networks. In Proceedings of the 39th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’09, 2009.
[83] V. K. Garg and V. Ogale. Fusible data structures for fault-tolerance. In 27th IEEE
International Conference on Distributed Computing Systems (ICDCS 2007), 2007.
[84] M. Garofalakis, R. Rastogi, and K. Sabnani. Streaming algorithms for robust, realtime detection of ddos attacks. In Proceeding ICDCS ’07 Proceedings of the 27th
International Conference on Distributed Computing Systems, pages 681–688, 2007.
[85] F. Gartner. Automating the addition of fault-tolerance: Beyond fusion-closed speciﬁcations. Personal Communication.
221

[86] N. J. George, C. R. Elks, B. W. Johnson, and J. Lach. Transient fault models and
avf estimation revisited. In Proceedings of the 40th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks, DSN ’10, 2010.
[87] S. Ghosh. Binary self-stabilization in distributed systems. Information Processing
Letter, 40(3), 1991.
[88] S. Ghosh. Distributed Systems: An Algorithmic Approach. CRC Press, 2006.
[89] S. Ghosh and a. Gupta. An exercise in fault-containment: Self-stabilizing leader election. Information Processing Letters, 1996.
[90] S. Ghosh, A. Gupta, T. Herman, and S. Pemmaraju. Fault-containing self-stabilizing
distributed protocols. Distributed Computing, 20, 2007.
[91] J. Gong, X. Zhong, and C.-Z. Xu. Energy and timing constrained system reward
maximization on wireless networks. In Proceedings of the 28th International Conference
on Distributed Computing Systems, ICDCS ’08, 2008.
[92] M. G. Gouda. The theory of weak stabilization. In Proceedings of the 5th International
Workshop on Self-Stabilizing Systems, pages 114–123, 2001.
[93] M. G. Gouda and N. J. Multari. Stabilizing communication protocols. IEEE Trans.
Computers, 40(4):448–458, April 1991.
[94] J. Gracia-Moran, D. Gil-Tomas, L. J. Saiz-Adalid, J. C. Baraza, and P. J. Gil-Vicente.
Experimental validation of a fault tolerant microcomputer system against intermittent
faults. In Proceedings of the 40th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks, DSN ’10, 2010.
[95] D. Graham, S. R. Per Strid, and F. Rodriguez. A low-tech solution to avoid the severe
impact of transient errors on the ip interconnect. In Proceedings of the 39th Annual
IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’09,
2009.
[96] J. Gray. Functionality, availability, agility, manageability, scalability – the new priorities of application design. In Proc. Int’l Workshop High Performance Trans. Systems,
2001.
[97] F. Greve and S. Tixeuil. Tixeuil: Knowledge connectivity vs. synchrony requirements
for fault-tolerant agreement in unknown networks. In Proceedings of the 37th Annual
222

IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’07,
2007.
[98] M. Grottke, A. P. Nikora, and K. S. Trivedi. An empirical investigation of fault
types in space mission system software. In Proceedings of the 40th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’10, 2010.
[99] W. Gu, Z. Yang, C. Que, D. Xuan, and W. Jia. On security vulnerabilities of null
data frames in ieee 802.11 based wlans. In 28th IEEE International Conference on
Distributed Computing Systems (ICDCS 2008), 2008.
[100] Y. Gu and T. He. Bounding communication delay in energy harvesting sensor networks.
In Proceedings of the 30th IEEE International Conference on Distributed Computing
Systems, ICDCS ’10, 2010.
[101] Z. Gu and K. G. Shin. Synthesis of real-time implementations from component-based
software models. In RTSS ’05: Proceedings of the 26th IEEE International Real-Time
Systems Symposium, pages 167–176, Washington, DC, USA, 2005. IEEE Computer
Society.
[102] Z. Gu, S. Wang, and K. G. Shin. Synthesis of real-time implementation from uml-rt
models. 2nd RTAS Workshop on Model-Driven Embedded Systems, 2004.
[103] R. Guerraoui, D. Kostic, R. Levy, and V. Quema. A high throughput atomic storage
algorithm. In 27th IEEE International Conference on Distributed Computing Systems
(ICDCS 2007), 2007.
[104] W. He, X. Liu, L. Zheng, and H. Yang. Reliability calculus: A theoretical framework
to analyze communication reliability. In Proceedings of the 30th IEEE International
Conference on Distributed Computing Systems, ICDCS ’10, 2010.
[105] P. E. Heegaard and K. S. Trivedi. Survivability quantiﬁcation of communication services. In Proceedings of the 38th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks, DSN ’08, 2008.
[106] J. Heinzmann and A. Zelinsky. A safe-control paradigm for human–robot interaction.
J. Intell. Robotics Syst., 25(4):295–310, 1999.
[107] C. Heitmeyer, M. Archer, R. Bharadwaj, and R. Jeﬀords. Tools for constructing
requirements speciﬁcations: The scr toolset at the age of ten. In International Journal
of Computer Systems Science and Engineering, pages 19–35, 2005.
223

[108] J. Hendricks, S. Sinnamohideen, G. R. Ganger, and M. K. Reiter. Zzyzx: Scalable fault
tolerance through byzantine locking. In Proceedings of the 40th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’10, 2010.
[109] J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S. Tanenbaum. Fault isolation for
device drivers. In Proceedings of the 39th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks, DSN ’09, 2009.
[110] G. Holzmann. The spin model checker. IEEE Transactions on Software Engineering,
23(5):279–295, 1997.
[111] G. Holzmann. Logic veriﬁcation of ansi-c code with spin. In The Sixth SPIN Workshop,
pages 131–147, 2000.
[112] M. Honda, J. Nakazawa, Y. Nishida, M. Kozuka, and H. Tokuda. A connectivitydriven retransmission scheme based on transport layer readdressing. In 28th IEEE
International Conference on Distributed Computing Systems (ICDCS 2008), 2008.
[113] J. H. Hopeman. Uniform deterministic self-stabilizing ring-orientation on odd-length
rings. Workshop on Distributed Algorithms, 1994.
[114] P.-A. Hsiung and S.-W. Lin. Automatic synthesis and veriﬁcation of real-time embedded software for mobile and ubiquitous systems. Computer Languages, Systems and
Structures, 34(4):153–169, 2008.
[115] P.-A. Hsiung and S.-W. Lin. Automatic synthesis and veriﬁcation of real-time embedded software for mobile and ubiquitous systems. Comput. Lang. Syst. Struct.,
34(4):153–169, 2008.
[116] S.-T. Huang. Leader election in uniform rings. ACM Transactions on Programming
Languages and Systems, 15(6):435–450, Jul 1993.
[117] M. Hutle and A. Schiper. Communication predicates: A high-level abstraction for coping with transient and dynamic faults. In Proceedings of the 37th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’07, 2007.
[118] S. Y. J. Duato and L. Ni. Interconnection network. IEEE Computer Society Press,
1997.
224

[119] M. L. James, A. A. Shapiro, P. L. Springer, and H. P. Zima. Adaptive fault tolerance for
scalable cluster computing in space. Int. J. High Perform. Comput. Appl., 23(3):227–
241, 2009.
[120] M. Jayaram and G. Varghese. Crash failures can drive protocols to arbitrary states.
PODC ’96: Proceedings of the ﬁfteenth annual ACM symposium on Principles of distributed computing, 1996.
[121] K. Jayaraman, W. Du, B. Rajagopalan, and S. J. Chapin. Escudo: A ﬁne-grained
protection model for web browsers. In Proceedings of the 30th IEEE International
Conference on Distributed Computing Systems, ICDCS ’10, 2010.
[122] H. Jin. Checkpointing orchestration for performance improvement. In Proceedings
of the 40th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks, DSN ’10, 2010.
[123] B. Jobstmann, A. Griesmayer, and R. Bloem. Program repair as a game. In Computer
Aided Veriﬁcation (CAV), pages 226–238, Scotland, UK, 2005. Springer.
[124] T. Johnson, S. Mitra, and K. Manamcheri. Safe and stabilizing distributed cellular
ﬂows. In Proceedings of the 30th IEEE International Conference on Distributed Computing Systems, ICDCS ’10, 2010.
[125] T. A. J. Jori Dubrovin. Symbolic model checking of hierarchical uml state machines. In
ACSD: 8th International Conference on Application of Concurrency to System Design,
pages 108 – 117, 2008.
[126] G. Jung, M. A. Hiltunen, K. R. Joshi, R. D. Schlichting, and C. Pu. Mistral: Dynamically managing power, performance, and adaptation cost in cloud infrastructures.
In Proceedings of the 30th IEEE International Conference on Distributed Computing
Systems, ICDCS ’10, 2010.
[127] H. Kannan, M. Dalton, and C. Kozyrakis. Decoupling dynamic information ﬂow tracking with a dedicated coprocessor. In Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’09, 2009.
[128] L. Keller, P. Upadhyaya, and G. Candea. Conferr: A tool for assessing resilience to human conﬁguration errors. In Proceedings of the 38th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks, DSN ’08, 2008.
225

[129] K. S. Killourhy and R. A. Maxion. Comparing anomaly-detection algorithms for
keystroke dynamics. In Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’09, 2009.
[130] A. Knapp and S. Merz. Model checking and code generation for uml state machines
and collaborations. In In Dominik Haneberg, Gerhard Schellhorn, and Wolfgang Reif,
editors, Proc. 5th Wsh. Tools for System Design and Veriﬁcation, pages 59–64, 2002.
[131] K. Kourai and S. Chiba. A fast rejuvenation technique for server consolidation with
virtual machines. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’07, 2007.
[132] S. Kulkarni, J. M. Rushby, and S. Natarajan. A case-study in component-based mechanical veriﬁcation of fault-tolerant programs. In Workshop on Self-stabilizing System,
pages 33–40, 1999.
[133] S. S. Kulkarni. Component-based design of fault-tolerance. PhD thesis, Ohio State
University, 1999.
[134] S. S. Kulkarni and A. Arora. Automating the addition of fault-tolerance. In Proceedings
of the 6th International Symposium of Formal Techniques in Real-Time and FaultTolerant Systems, page 82, Pune, India, 2000. Springer.
[135] S. S. Kulkarni, A. Arora, and A. Ebnenasir. Adding Fault-Tolerance to State MachineBased Designs, volume 19 of Series on Software Engineering and Knowledge Engineering, pages 62–90. Springer Verlag, 2007.
[136] S. S. Kulkarni and A. Ebnenasir. The complexity of adding failsafe fault-tolerance. In
Proceedings of the 22nd International Conference on Distributed Computing Systems,
page 337, Vienna, Austria., 2002. IEEE Computer Society.
[137] S. S. Kulkarni and A. Ebnenasir. Automated synthesis of multitolerance. The International Conference on Dependable Systems and Networks, 2004.
[138] S. S. Kulkarni, C. B. J. Oleszkiewicz, and A. Robinson. Alternators in read/write
atomicity. Information Processing Letters, 2005.
[139] O. Kupferman and M. Vardi. Synthesizing distributed systems. In Proc. 16th IEEE
Symp. on Logic in Computer Science, July 2001.
[140] S. Kutten and D. Peleg. Fault-local mending. Journal of Algorithms, 30(1).
226

[141] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM
Transactions on Programming Languages and Systems, 4:382–401, July 1982.
[142] J.-C. Laprie and B. Randell. Basic concepts and taxonomy of dependable and secure
computing. IEEE transactions on dependable and secure computing, 1(1):11–33, 2004.
Avizienis, Algirdas and Landwehr, Carl.
[143] K. Larsen, P.Pattersson, and W. Yi. UPPAAL in a nutshell. International Journal on
Software Tools for Technology Transfer, 1(1-2):134–152, 1997.
[144] S. B. Lee and V. D. Gligor. Floc : Dependable link access for legitimate traﬃc in ﬂooding attacks. In Proceedings of the 30th IEEE International Conference on Distributed
Computing Systems, ICDCS ’10, 2010.
[145] N. Leveson. Completeness in formal speciﬁcation language design for process-control
systems. In IN: PROCEEDINGS OF THE THIRD WORKSHOP ON FORMAL
METHODS IN SOFTWARE PRACTICE, pages 75–87, Portland, Oregon, 2000. ACM.
[146] J. R. Levine, T. Mason, and D. Brown. A freely available version of lex is ﬂex. Technical
report, http://ﬂex.sourceforge.net/, 1992.
[147] P. Levis, S. Madden, J. Polastre, R. Szewczyk, K. Whitehouse, A. Woo, D. Gay, J. Hill,
M. Welsh, E. Brewer, et al. TinyOS: An operating system for sensor networks. Ambient
Intelligence, pages 115–148, 2005.
[148] D. Li, Y. Zhu, L. Cui, and L. M. Ni. Hotness-aware sensor networks. In 28th IEEE
International Conference on Distributed Computing Systems (ICDCS 2008), 2008.
[149] D. Li, Y. Zhu, L. Cui, and L. M. Ni. Hotness-aware sensor networks. In Proceedings
of the 28th International Conference on Distributed Computing Systems, ICDCS ’08,
2008.
[150] M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou. Tracebased microarchitecture-level diagnosis of permanent hardware faults. In Proceedings
of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks, DSN ’08, 2008.
[151] Q. Li and D. Rus. Global clock synchronization in sensor networks. IEEE Trans.
Comput., 55(2):214–226, 2006.
227

[152] W. Li, E. Chan, D. Chen, and S. Lu. Maintaining probabilistic consistency for frequently oﬄine devices in mobile ad hoc networks. In Proceedings of the 29th IEEE
International Conference on Distributed Computing Systems, ICDCS ’09, 2009.
[153] Y. Li, C. Ai, and Y. W. Wiwek P. Deshmukh. Data estimation in sensor networks
using physical and statistical methodologies. In 28th IEEE International Conference
on Distributed Computing Systems (ICDCS 2008), 2008.
[154] Y. Li and Z. Lan. A fast restart mechanism for checkpoint/recovery protocols in
networked environments. In Proceedings of the 38th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks, DSN ’08, 2008.
[155] Q. Lian, Z. Zhang, M. Yang, B. Y. Zhao, Y. Dai, and X. Li. An empirical study of
collusion behavior in the maze p2p ﬁle-sharing system. In ICDCS ’07 Proceedings of
the 27th International Conference on Distributed Computing Systems, 2007.
[156] J. Lilius and I. P. Paltor. Formalising uml state machines for model checking. In
UML’99 Proceedings of the 2nd international conference on The uniﬁed modeling language: beyond the standard, pages 430–444, 1999.
[157] M. Y. Lim, F. L. R. III, T. K. Bletsch, and V. W. Freeh. Padd: Power aware domain
distribution. In Proceedings of the 29th IEEE International Conference on Distributed
Computing Systems, ICDCS ’09, 2009.
[158] S. Lin, C. Tseng, T. Lee, and J. Fu. Vertaf: An application framework for the design
and veriﬁcation of embedded real-time software. IEEE Transactions on Software Engineering, 30(10):656–674, 2004. Member-Hsiung, Pao-Ann and Member-See, Win-Bin.
[159] S.-W. Lin, S.-W. Lin, C.-H. Tseng, T.-Y. Lee, and J.-M. Fu. Vertaf: An application
framework for the design and veriﬁcation of embedded real-time software. IEEE Trans.
Softw. Eng., 30(10):656–674, 2004. Member-Pao-Ann Hsiung and Member-Win-Bin
See.
[160] Y. Lin, B. Li, and B. Liang. Diﬀerentiated data persistence with priority random
linear codes. In 27th IEEE International Conference on Distributed Computing Systems
(ICDCS 2007), 2007.
[161] D. Liu. Resilient cluster formation for sensor networks. In ICDCS ’07 Proceedings of
the 27th International Conference on Distributed Computing Systems, 2007.
228

[162] D. Liu. Protecting neighbor discovery against node compromises in sensor networks.
In Proceedings of the 29th IEEE International Conference on Distributed Computing
Systems, ICDCS ’09, 2009.
[163] Z. Liu and M. Joseph. Transformation of programs for fault-tolerance. Formal Aspects
of Computing, 4(5):442–469, 1992.
[164] Z. Liu and M. Joseph. Speciﬁcation and veriﬁcation of fault-tolerance, timing, and
scheduling. ACM Transactions on Programming Languages and Systems (TOPLAS),
21(1):46–89, 1999.
[165] L. Lu, P. Sarkar, D. Subhraveti, S. Sarkar, M. Seaman, R. Jain, and A. Bashir. Carp:
Handling silent data errors and site failures in an integrated program and storage
replication mechanism. In Proceedings of the 29th IEEE International Conference on
Distributed Computing Systems, ICDCS ’09, 2009.
[166] M. Lubaszewski and B. Courtois. A reliable fail-safe system. IEEE Transactions on
Computers, 47(2):236–241, 1998.
[167] J. Luo, C. Huang, and L. Xu. Decoding star code for tolerating simultaneous disk
failure and silent errors. In Proceedings of the 40th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks, DSN ’10, 2010.
[168] T. Ma, J. Hillston, and S. Anderson. On the quality of service of crash-recovery failure
detectors. In Proceedings of the 37th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks, DSN ’07, 2007.
[169] O. Maler, D. Nickovic, and A. Pnueli. From MITL to timed automata. In Formal
Modeling and Analysis of Timed Systems (FORMATS), pages 274–289, 2006.
[170] P. K. Manna, S. Ranka, and S. Chen. Analysis of maximum executable length for
detecting text-based malware. In 28th IEEE International Conference on Distributed
Computing Systems (ICDCS 2008), 2008.
[171] A. F. Martinez and K. Kuchcinski. Graph matching constraints for synthesis with
complex components. In DSD ’07: Proceedings of the 10th Euromicro Conference on
Digital System Design Architectures, Methods and Tools, pages 288–295, Washington,
DC, USA, 2007. IEEE Computer Society.
229

[172] M. Marwah, S. Mishra, and C. Fetzer. Enhanced server fault-tolerance for improved
user experience. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’08, 2008.
[173] K. McMillan. The smv system for smv version 2.5.4. Technical report, Carnegie Mellon
University, 2000.
[174] K. L. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1993.
[175] A. Meixner and D. J. Sorin. Detouring: Translating software to circumvent hard
faults in simple cores. In Proceedings of the 38th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks, DSN ’08, 2008.
[176] J. W. Mickens and B. D. Noble. Concilium: Collaborative diagnosis of broken overlay
routes. In Proceedings of the 37th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks, DSN ’07, 2007.
[177] H. Moniz, N. F. Neves, and M. Correia. Turquois: Byzantine consensus in wireless ad
hoc networks. In Proceedings of the 40th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks, DSN ’10, 2010.
[178] R. Morales and I. Gupta. Avmon: Optimal and scalable discovery of consistent availability monitoring overlays for distributed systems. In ICDCS ’07 Proceedings of the
27th International Conference on Distributed Computing Systems, 2007.
[179] M. Muthuprasanna and G. Manimaran. Distributed divide-and-conquer techniques for
eﬀective ddos attack defenses. In 28th IEEE International Conference on Distributed
Computing Systems (ICDCS 2008), 2008.
[180] A. R. Ningfang Mi, E. Smirni, and E. Riedel. Enhancing data availability in disk
drives through background activities. In Proceedings of the 38th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’08, 2008.
[181] O. of Inspector Genera. Audit report: Advance automation system. Technical Report
Report Av-1998-113, USA Department of Transportation, April 1998.
[182] S. L. Pallemulle, H. D. Thorvaldsson, and K. J. Goldman. Byzantine fault-tolerant
web services for n-tier and service oriented architectures. In 28th IEEE International
Conference on Distributed Computing Systems (ICDCS 2008), 2008.
230

[183] J. C. Park and J. R. Crandall. Empirical study of a national-scale distributed intrusion
detection system: Backbone-level ﬁltering of html responses in china. In Proceedings
of the 30th IEEE International Conference on Distributed Computing Systems, ICDCS
’10, 2010.
[184] U. Paul, S. R. Das, and R. Maheshwari. Detecting selﬁsh carrier-sense behavior in
wiﬁ networks by passive monitoring. In Proceedings of the 40th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’10, 2010.
[185] R. Perdisci, M. Antonakakis, X. Luo, and W. Lee. Wsec dns: Protecting recursive
dns resolvers from poisoning attacks. In Proceedings of the 39th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’09, 2009.
[186] L. Pike, J. Maddalon, P. S. Miner, and A. Geser. Abstractions for fault-tolerant
distributed system veriﬁcation. In 17th International Conference Theorem Proving in
Higher Order Logics (TPHOLs), pages 257–270, 2004.
[187] T. Pionteck and W. Brockmann. A concept of a trust management architecture to
increase the robustness of nano age devices. In Dependable Systems and Networks
Workshops (DSN-W), 2010 International Conference on, 2010.
[188] I. Prasetya. Mechanically veriﬁed self-stabilizing hierarchical algorithms. In Proceedings of the Third International Workshop on Tools and Algorithms for Construction
and Analysis of Systems, pages 399–415, 1997.
[189] S. Qadeer and N. Shankar. Verifying a self-stabilizing mutual exclusion algorithm. In
IFIP International Conference on Programming Concepts and Methods (PROCOMET’
98), 1998.
[190] A. Rahmati, L. Zhong, M. A. Hiltunen, and R. Jana. Reliability techniques for rﬁdbased object tracking applications. In Proceedings of the 37th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’07, 2007.
[191] P. Ramanathan, K. G. Shin, and R. W. Butler. Fault-tolerant clock synchronization
in distributed systems. Computer, 23(10):33–42, 1990.
[192] D. Ramsbrock, R. Berthier, and M. Cukier. Proﬁling attacker behavior following ssh
compromises. In Proceedings of the 37th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks, DSN ’07, 2007.
231

[193] L. Rashid, K. Pattabiraman, and S. Gopalakrishnan. Towards understanding the effects of intermittent hardware faults on programs. In Dependable Systems and Networks
Workshops (DSN-W), 2010 International Conference on, 2010.
[194] K. Raymond. A tree-based algorithm for distributed mutual exclusion. ACM Transactions on Computer Systems (TOCS), 7:61–77, 1989.
[195] V. K. Reddy and E. Rotenberg. Coverage of a microarchitecture-level fault check
regimen in a superscalar processor. In Proceedings of the 38th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’08, 2008.
[196] R. Riley, X. Jiang, and D. Xu. An architectural approach to preventing code injection
attacks. In Proceedings of the 37th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks, DSN ’07, 2007.
[197] E. Rozier, W. Belluomini, V. Deenadhayalan, J. Hafner, K. K. Rao, and P. Zhou.
Evaluating the impact of undetected disk errors in raid systems. In Proceedings of
the 39th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks, DSN ’09, 2009.
[198] H. Ruess, N. Shankar, and M. K. Srivas. Modular veriﬁcation of srt division. Form.
Methods Syst, 14, 1999.
[199] J. Rumbaugh, I. Jacobson, and B. G. The Uniﬁed Modeling Language Reference Manual. Pearson Higher Education, 2004.
[200] J. Rumbaugh, I. Jacobson, and B. G. Practical UML Statecharts in C/C++, Second
Edition: Event-Driven Programming for Embedded Systems. Newnes, 2008.
[201] O. Rutti, Z. Milosevic, and A. Schiper. Generic construction of consensus algorithms
for benign and byzantine faults. In Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’10, 2010.
[202] T. Sakata, T. Hirotsu, H. Yamada, and T. Kataoka. A cost-eﬀective dependable
microcontroller architecture with instruction-level rollback for soft error recovery. In
Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’07, 2007.
[203] N. Salatge and J.-C. Fabre. Fault tolerance connectors for unreliable web services. In
Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’07, 2007.
232

[204] H. Schi¨berg, R. Merz, and C. Sengul. A failsafe architecture for mesh testbeds with
o
real users. In MobiHoc S3 ’09: Proceedings of the 2009 MobiHoc S3 workshop on
MobiHoc S3, pages 29–32, New York, NY, USA, 2009. ACM.
[205] N. Schiper and S. Toueg. A robust and lightweight stable leader election service for
dynamic systems. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’08, 2008.
[206] R. D. Schlichting and F. B. Schneider. Fail-stop processors: an approach to designing
fault-tolerant computing systems. ACM Trans. Comput. Syst, 1983.
[207] C. Sengul and R. Kravets. Heuristic approaches to energy-eﬃcient network design
problem. In ICDCS ’07 Proceedings of the 27th International Conference on Distributed
Computing Systems, 2007.
[208] M. Seraﬁni, P. Bokor, D. Dobre, M. Majuntke, and N. Suri. Scrooge: Reducing the
costs of fast byzantine replication in presence of unresponsive replicas. In Proceedings
of the 40th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks, DSN ’10, 2010.
[209] M. Seraﬁni, N. Suri, J. Vinter, A. Ademaj, W. Brandstatter, F. Tagliabo, and J. Koch.
A tunable add-on diagnostic protocol for time-triggered systems. In Proceedings of
the 37th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks, DSN ’07, 2007.
[210] Y. Shi and G. Lee. Augmenting branch predictor to secure program execution. In
Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’07, 2007.
[211] A. Shye, T. M. andVijay Janapa Reddi, J. Blomstedt, and D. A. Connors. Using
process-level redundancy to exploit multiple cores for transient fault tolerance. In
Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’07, 2007.
[212] R. Sion. Strong worm. In 28th IEEE International Conference on Distributed Computing Systems (ICDCS 2008), 2008.
[213] F. Somenzi. Cudd: Cu decision diagram package. Technical report, University of
Colorado at Boulder, 2012.
233

[214] P. Sommer and R. Wattenhofer. Gradient clock synchronization in wireless sensor
networks. In IPSN ’09: Proceedings of the 2009 International Conference on Information Processing in Sensor Networks, pages 37–48, Washington, DC, USA, 2009. IEEE
Computer Society.
[215] H. J. Song and A. A. Chien. Feedback-based synchronization in system area networks
for cluster computing. IEEE Trans. Parallel Distrib. Syst., 16(10):908–920, 2005.
[216] W. Sootkaneung and K. K. Saluja. Gate input reconﬁguration for combating soft errors
in combinational circuits. In Dependable Systems and Networks Workshops (DSN-W),
2010 International Conference on, 2010.
[217] D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. Safetynet: improving
the availability of shared memory multiprocessors with global checkpoint/recovery.
In ISCA ’02: Proceedings of the 29th annual international symposium on Computer
architecture, 2002.
[218] P. Sousa, A. N. Bessani, W. S. Dantas, F. Souto, M. Correia, and N. F. Neves.
Intrusion-tolerant self-healing devices for critical infrastructure protection. In Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems
and Networks, DSN ’09, 2009.
[219] A. C. Squicciarini, A. Trombetta, and E. Bertino. Robust and secure interactions
in open distributed systems through recovery of trust negotiations. In ICDCS ’07
Proceedings of the 27th International Conference on Distributed Computing Systems,
2007.
[220] C. T. and T. S. Unreliable failure detectors for reliable distributed systems. Journal
of the ACM, 1996.
[221] G. Tan, S. A. Jarvis, and A.-M. Kermarrec. Connectivity-guaranteed and obstacleadaptive deployment schemes for mobile sensor networks. In 28th IEEE International
Conference on Distributed Computing Systems (ICDCS 2008), 2008.
[222] C. Temple. Avoiding the babbling-idiot failure in a time-triggered communication
system. In FTCS ’98: Proceedings of the The Twenty-Eighth Annual International
Symposium on Fault-Tolerant Computing, page 218, Washington, DC, USA, 1998.
IEEE Computer Society.
[223] M. T. Thai, Y. Xuan, I. Shin, and T. Znati. On detection of malicious users using group
testing techniques. In 28th IEEE International Conference on Distributed Computing
Systems (ICDCS 2008), 2008.
234

[224] B. Tong, Z. Li, G. Wang, and W. Zhang. How wireless power charging technology
aﬀects sensor network deployment and routing. In Proceedings of the 30th IEEE International Conference on Distributed Computing Systems, ICDCS ’10, 2010.
[225] T. Tsuchiya, S. Nagano, R. B. Paidi, and T. Kikuno. Symbolic model checking for
self-stabilizing algorithms. IEEE Trans. Parallel Distrib. Syst, 12:81–95, 2001.
[226] N. Tuck, B. Calder, and G. Varghese. Hardware and binary modiﬁcation support for
code pointer protection from buﬀer overﬂow. In MICRO 37: Proceedings of the 37th
annual IEEE/ACM International Symposium on Microarchitecture, 2004.
[227] W. Visser, K. Havelund, G. P. Brat, S. Park, and F. Lerda. Model checking programs.
Journal of Automated Software Engineering, 10(2):203–232, 2003.
[228] S. Wan, Q. Cao, C. Xie, B. Eckart, and X. He. Code-m: A non-mds erasure code
scheme to support fast recovery from up to two-disk failures in storage systems. In
Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’10, 2010.
[229] W. Wang and C. Amza. On optimal concurrency control for optimistic replication.
In Proceedings of the 29th IEEE International Conference on Distributed Computing
Systems, ICDCS ’09, 2009.
[230] W. Wang, D. Pu, and A. Wyglinski. Detecting sybil nodes in wireless networks with
physical layer network coding. In Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’10, 2010.
[231] Y. Wang, H. Wu, F. Li, and N.-F. Tzeng. Protocol design and optimization for
delay/fault-tolerant mobile sensor. In 27th IEEE International Conference on Distributed Computing Systems (ICDCS 2007), 2007.
[232] Y. Wang and J. Wu. A nonblocking approach for reaching an agreement on request
total orders. In 28th IEEE International Conference on Distributed Computing Systems
(ICDCS 2008), 2008.
[233] S. Weiss, P. Urso, and P. Molli. Logoot: A scalable optimistic replication algorithm for
collaborative editing on p2p networks. In Proceedings of the 29th IEEE International
Conference on Distributed Computing Systems, ICDCS ’09, 2009.
235

[234] J. Widder, G. Gridling, B. Weiss, and J.-P. Blanquart. Synchronous consensus with
mortal byzantines. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’07, 2007.
[235] S.-H. Wu, C.-M. Chen, and M.-S. Chen. An asymmetric quorum-base power saving
protocol for clustered ad hoc networks. In ICDCS ’07 Proceedings of the 27th International Conference on Distributed Computing Systems, 2007.
[236] S.-H. Wu, M.-S. Chen, and C.-M. Chen. Fully adaptive power saving protocols for ad
hoc networks using the hyper quorum system. In Proceedings of the 27th International
Conference on Distributed Computing Systems, ICDCS ’08, 2008.
[237] W. Xiao and Q. Yang. Can we really recover data if storage subsystem fails? In 28th
IEEE International Conference on Distributed Computing Systems (ICDCS 2008),
2008.
[238] T. Xie and A. Sharma. Collaboration-oriented data recovery for mobile disk arrays.
In Proceedings of the 29th IEEE International Conference on Distributed Computing
Systems, ICDCS ’09, 2009.
[239] K. Xing, F. Liu, X. Cheng, and D. H.-C. Du. Real-time detection of clone attacks
in wireless sensor networks. In 28th IEEE International Conference on Distributed
Computing Systems (ICDCS 2008), 2008.
[240] Y. Yamauchi, T. Masuzawa, and D. Bein. Preserving the fault-containment of ring
protocols executed on trees. British Computer Journal, 52(4), July 2009.
[241] G. Yan, L. Cuellar, S. Eidenbenz, and N. W. Hengartner. Blue-watchdog: Detecting bluetooth worm propagation in public areas. In Proceedings of the 39th Annual
IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’09,
2009.
[242] G. Yan and S. Eidenbenz. Modeling propagation dynamics of bluetooth worms. In
ICDCS ’07 Proceedings of the 27th International Conference on Distributed Computing
Systems, 2007.
[243] X. Yang, P. Wang, H. Fu, Y. Du, Z. Wang, and J. Jia. Compiler-assisted applicationlevel checkpointing for mpi programs. In 28th IEEE International Conference on Distributed Computing Systems (ICDCS 2008), 2008.
236

[244] Q. Ye and L. Cheng. Dtp: Double-pairwise time protocol for disruption tolerant
networks. In 28th IEEE International Conference on Distributed Computing Systems
(ICDCS 2008), 2008.
[245] M. Young, A. Kate, I. Goldberg, and M. Karsten. Practical robust communication in
dhts tolerating a byzantine adversary. In Proceedings of the 30th IEEE International
Conference on Distributed Computing Systems, ICDCS ’10, 2010.
[246] S. Yu and Y. Zhang. R-sentry: Providing continuous sensor services against random
node failures. In Proceedings of the 37th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks, DSN ’07, 2007.
[247] W. Yu, W. Hongyi, L. Feng, and T. Nian-Feng. Protocol design and optimization for
delay/fault-tolerant mobile sensor networks. In Proceedings of the 27th International
Conference on Distributed Computing Systems, ICDCS ’07, 2007.
[248] W. Yu, N. Zhang, X. Fu, R. Bettati, and W. Zhao. On localization attacks to internet threat monitors: An information-theoretic framework. In Proceedings of the 38th
Annual IEEE/IFIP International Conference on Dependable Systems and Networks,
DSN ’08, 2008.
[249] Y. Yue, L. Tian, H. Jiang, F. Wang, D. Feng, Q. Zhang, and P. Zeng. Rolo: A rotated
logging storage architecture for enterprise data centers. In Proceedings of the 30th
IEEE International Conference on Distributed Computing Systems, ICDCS ’10, 2010.
[250] D. Zhang and D. Liu. Dataguard: Dynamic data attestation in wireless sensor networks. In Proceedings of the 40th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks, DSN ’10, 2010.
[251] L. Zhang and Y. Guan. Detecting click fraud in pay-per-click streams of online advertising networks. In 28th IEEE International Conference on Distributed Computing
Systems (ICDCS 2008), 2008.
[252] Y. Zhang, Z. M. Mao, and J. Wang. A ﬁrewall for routers: Protecting against routing
misbehavior. In Proceedings of the 37th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks, DSN ’07, 2007.
[253] Y. Zhang, Z. Zhang, Z. M. Mao, and Y. C. Hu. Hc-bgp: A light-weight and ﬂexible
scheme for securing preﬁx ownership. In Proceedings of the 39th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, DSN ’09, 2009.
237

[254] Z. Zhang, W. Wu, and S. Shekhar. Optimal placements in ring network for data replicas
in distributed database with majorityvoting protocol. In 28th IEEE International
Conference on Distributed Computing Systems (ICDCS 2008), 2008.
[255] Y. Zhao, S. Vemuri, J. Chen, Y. Chen, H. Zhou, and Z. Fu. Exception triggered dos
attacks on wireless networks. In Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’09, 2009.
[256] Y. Zhou, X. Chen, M. R. Lyu, and J. Liu. Sentomist: Unveiling transient sensor
network bugs via symptom mining. In Proceedings of the 30th IEEE International
Conference on Distributed Computing Systems, ICDCS ’10, 2010.
[257] Q. Zhu and C. Yuan. A reinforcement learning approach to automatic error recovery.
In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’07, 2007.
[258] Y. Zhuang, L. Chen, Xiaoyang, X. S. Wang, and J. Lian. A weighted moving averagebased approach for cleaning sensor data. In 27th IEEE International Conference on
Distributed Computing Systems (ICDCS 2007), 2007.
[259] P. Zielinski. Automatic veriﬁcation and discovery of byzantine consensus protocols. In
Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, DSN ’07, 2007.

238