:2. ......2...

 

   

        

 

     

 

. . ._ 25.1....” 3...... in. . .
. I _ . . _cv . -

.... . . . . . .. 2...... w”. Lair... .. H... .. .... A ...

.. .. ._... ﬁvenrehgzmpf .... ..r. r. «ﬂawnén in... hi...
.1..." ...... .. $4155.. “....mumma.» $9M... .. 1:4“... w. aw. .. _. Z .
:I y. . ' ... . . .. . . u. 3 .. .
...r............ b.5535... . “......”qu . . .. ...... ..s... ...: ...... .. ...... ... L. a
u ..«9 5. . ‘. ~ (1"...392I2: u. .‘o.
. . . . ...
. . ..."...‘n ...-c? n.

 

..
. I . c). .19.... o. .5... .l'.. ...... O o 5.. .... . .
.J a“ ..loon. lira-J! .0 I13 oilfbum I... a . 05$ 3...! 33.3.”. .4: no.5! 0‘
... _| ’83-‘3 ....o ......I... I ... . «. ......stt ...)..a .2. U: l~..-18c.a ....lfll ...;r. I. C . . 323.958.
...: I ......2. . ......3... .922... ..K .14... .. . xi . . .... ....J. .....Itit‘ul . ._ .|.. .8... 5.3.6.1.- .-

... toil ...Jl‘il. .... 341 .l 3r 1.3. . b4 :Jﬁtmot .4 8 .o .... , 36.1%....

. .. I . .1....Jl.‘o I}: J.
o

 

 

I. .D‘ . ...».
III.“ ...
.... . .
_ .JJﬁ..J.~I_...3 . (......Jflit
0.1.. .... J3..- . .. o. ... ._ . ..I.
t ...-....e. .. ...-v.1.

 

            
     

 

. ...... s
. .-

. o
. . . . . . , 1. .
. .. 1. ‘3... . . I. r . .. . .0. ...-”3.1....-
1....Ia’0..... 3r . .r..:... I . . . :
13....rt‘¢.-.. ‘ . ... I . ... . 3-.0Io .
t.Pd8....:v...r. 9.22.5.3.
Q
..
d
-

  

.. , . . . , . . ...—u
..I I... 1.2.3. I... . .. . .
. 8113., 3.13:... .. .. .

        

                     

 

           

 

                    

       

 

 
 

        

                      

  
  

 

  

          

    

    

 

     
       

                  
                         

 

   
       

   

              

                 
   

        

 

    
        

     

 
   

   
       

 

    
     

     

 

 

 

              

    

          

 

 

      

 

 

 

 

 

     
   

. ...-ISL “A. .2 .
. . . . ..u...‘ ......- . . 2-”:
.... ... .... . ...“: .d 033 . o a. .... ...: .Q . 6.4 D
. . . .... o .4 ... 3. ......3: g 08.4. 1...... .... . .- guLcOhcb ¢ $8 . I 8|.08‘
. . . .. ., . . . . Jo. ... a. 3... ...... . .I. . ...4. d. o 3.. A If ..i‘.
. . . . .. . “1.3.9.... Cr: I .33.... .I . ....» . ...6. v ... .. .0 .3 ......3». pt: 13.. ....- .. 13....» .- ...... .
. o ﬂ. . .. I}... .... ... .. 3-; o. . .1... I... a. . . . 0...; ......f... 2 .... I . ....s.... o. .. f..- .3 .61.... ‘La.:. ...-...: . ....O o‘ ..6..:....‘... 34? 0.. 1‘: 3i.‘ .'4- a... .... J- o.- . ...!!3.
Ir ..Ivv 1!..5...~!I. ...: .. . .ult' ...)p.o.._:.o' a... ... r I: «..i:v.... . .. V... 2.2.... . .113. 7...... . 3.3.3.... I ...:..v .. . . 1.413.. ~23 ...... .8; I 3 .. 31......L ...-t l 5...... .... ......
a...) . .... I: .... . . 1. ...v .. ... p ... ...... ..T... ......l‘. I ......u . .... .. .r....$.........la.... .... ...: ... 28.... . Or ......II..JJ~¢:'..1 I... . .. ... . .J'ou-Jbtl. E43,, .3! ... k.
.30 ... t.....’..( l. 21.. 9.1.5.8... I. 58.32.... .... . . .. 1.3.9.. .38. o . , . ......— o .t .1..I4‘.... . ... . to. ..L .21 .33. 0.7.1.3331?“ .-
.. . .. ...... .. . .rr... ... I. . ._ .. .. .. . _ . . ,. . . . I . ... z ......t ...: :13.-. a... 1132......» 5.... 210...... ....
. .v.... . . . o .. ...... .. .... ...? . . . . . .225. .l.!I ...cc..:,.ISIIQ«.-f.; 3.5. ...;a ...... .. In 8 u..- .
.qul. .37 ...... .- . I ..(3..:.8.... . . . Y .. .0 ... .... .... .23 .44; 193-. . .l)....c.v 3......J..‘ . 23:69.34 [knilh‘ O
. 3. v...) 2 ...-... .u. I. I p 1.... .... .. 70 ...)... .... .‘aot . ....O: .3... rd: -P 0.9.21. .3351??? .....olﬁ,..oa€3 .3 B .... 1.. ... a... .‘..1‘38 ..
...! «3......8. .... .. o 9.. . 23% 75...! I..IY!...¢..I..... ... . .l ...-...... l ...-60..., .... «J2... 2...... s .....I .. 7....th .4... 0.3a: .3 . 42f 3.. 31.3.31 .3“)?. : . .
.m.’ s ..33..X'. 1.0-4.. 0 ... o ......32 .. ..v I .4} ... .‘X ...... . 0.. '15 ..I. .o c .n goof-...; .... Ancoai; ... .0 o ...J. .... o . o o 5.... . . 3A a‘vnIQ 1' 5.. a ...-10...... :0" 0' w . _.
w. . : 2.3.1.5... {$5.2 . 3...... 3.1.2.22... ......Ira. . . ...!o. 2.5+; 3. a .... . . . .I: ...I-oI...1.oI: . . ...: astute .7 L1}: . . . I... .o. ...... u I: .o 9.. t... .5 ta ., h
.. 1 .Szl. 1.. .... .ﬁ... ...... ...: .\ ...: 1...... .....l..- .... I. . . .I.,........ . .v. 3... ...... . .31.)... I :17... ....I. . _ ....35: ......t. 013:... 3. «5:30;! ...! _ .IILIJ» .. .. .
. 8:7 ...-0. v. ll‘..IJ ... . o. . ...... {it} . . ...Io . . .J ~}.~lI?‘JI...33 c ... 7.0... . ..(‘fi TI! . 0“... :3...) . o ..J‘OI...‘?L‘.~.1‘..§‘XC and,” t v ‘4. o” no
. I .... ... .o. . ... . 8:... ...... ...li ... I0: I. 3.: ....” 6...... .s .. 03.... . (a C . .Clt.’ 2‘... \ ......
.. . 2...... 13:38.. .. .1 ... 613'... . ....N .3. 05.0.... . H3 '9 a? t.
.. .... .I...oa.-J . ... ... . . I. ... .12.... .. .. ... 5 v9.23. ... 5 t». Bloatsr E .I m
; . 1'39: .. .1... . ...: . ...... .... .... lﬁrl1ztaiibg . (Hui . w 11.
... .....v. ... . .. at... 2 53...... ..IrttIIJ... v.4 . ”$41..
I II .. I I at .... I I... .06.qu 9. ...IF 5.. .13.... .... c L. V a \1
ﬂ; . .... . I. 3.38.... 11.19. A .K ...-58.... I9... I63... .....Ioil .tt’b I. DID QI..I. 8 . 1 4 .
... ... Inf... . . .501. .... ....O ... ...-.... . .I ...‘t. . . . ...... i. . I£..u.fb£l\ff.£“1cl it“... .9. IF ‘
354.113....1... .. ..II... ....8..13al..II .....ll..;t.vb - . .J 2.33.7... ... . . c3 ... I ...... .IJCA... .0. c. L. a a raw «.
..J .....~ .11 . 3.6151..- ........38...I...!I..1 ....-. ofﬁﬁl . . . .L '9. . ...-K... I . ﬂ) ... . . I . <
.I. z .r ,9 t . ..J. ...... ...6 ...! $.92... ... . . . 83...... .... .. (‘3: . .. . o8. ... .l. I ...... .Yu .dnlt LP. $LO‘oMO‘ ﬁn IV} .
. ...... .r . . ...... .1 . J... . .... .‘d: .212... . 10...... ... . . . .... .. ... . ...c ... ......r: .3... ..tbﬁ 3.8: 31.... . .‘l§’;.!9. ... , (A
...... I. .9 13.8.4. .7: ... u 3 10‘. . . 2 T... .. .....t. - ...-I .8308. . If“... 1.3.... ......0-00 . IJIC. ...-9r..(...9 (‘3. ltll‘B..ll .3..£t3$3‘ ... «rt
...... ... . .a ... . . . . .. .l....Pr. ......818. 3.. . I. ... .C... ..ll..\. 0 .1....$ro..llﬁok~vil 0.1.3151!ch .tl.’ rIkIQO..3b a.
I. . .58.. . _ .1. ... . 34.... .....36 013.....5... 65.! ...-c .1 ...... ... a... 3. 333v.l..0 E3 Irliilﬁﬂbgl .1
u... v . . . .. . . . . . I. .0. ... o . l..-.I! ...: . ... I 4.. .. .....I. . ...: 30331 I . .L .14 ... 0 CI .i..¢I...:c t‘ I‘naozf I
J ...! ,I .....I.... .... ., .3336; ... .4 ..U I... I. .8 ...a... ...)...T. . . ...-3.... 2.8.3.2.: :..3..¢3...¢ «a.
J ... .33! . ...thI .I... . y a 0.. ...-.3. II...O J. .3211... .0. .0.
..v. “ . vii .-
.. .. ... . ...... . VI... ... I... . .I. .X'
... . It .5 . .0. . _ . .
. 9.3.0 .32.. 7t... fawn; I
.. ta. .1 ... o ...o.
o.‘ I .v. I- .

         
     

b

                      

85:... ...... .4. 4......)3303... “3‘53”“?
. f! I. o. z...£.-o ...l..:.. .2’..Jall.v.x :13! hat 0). 54. ,
. ..I . .

       

  

            

   

. ,,. . . z . ...
... Mwﬁmxﬁﬂ ..x
......m I .31}.

        

 

 

             

    

 

      

                 

 

 

 

            

   

'
5...... I. 0......)er Ila.cttz3£dv§
£59.... ...... ......r..........................u..... ﬁxing ..
. . . .
. v ...-...... . . . “:3. “czztitélﬁoibl 2

 

       

.0, I
z 19‘ .
.. II, ..I.) $3.8...uis § . . r
...... .IIII: . .... . 3... .. . . :40. 5.531.... n. un... 03...... ...:hwrgt 117‘ .Tn n.
722:... ...-(v.5...z‘i. 4...... .... ...-...... ....L...?.:..III.I:: t‘t..t..x’| on. 0...)... A: Raﬁ . .. ,
$1 .... .. .. J .1 It: [0.3.3: I... ... I. . .... 3 .. ...... .. .9]: I10. ...... {'3‘ ,. .I J,
a. ... .. .... . . .. . ...-843...... ....IJ .... ...... 1:60.54...- . 361' .0. 1 . ..Q S..- u
n!I..3|:l..91-. . o . .. ..I..3i.....hvv-i¢. .0. C .0. .... bit...- I on I 5:. h w ,
. ...: at. . SJ......I.8...I0.I . . . 1.6:...- ..l . 1 b.— .I. l’a.3I8-.)..$ ...-c-IIO‘ a) 9. cl ... . Q ? .
.9..II o. 3...... ti... 1.. . {:01‘. .o M... I. by..." ... 0.30:1...Dcoxa f D; 8-- 8.
. . .... . . it.... .....Irlt...... ......II C b _ :1... .... (....alta Incaﬁ’la n... l A p a. b...
. . . .. . . . . . .I. ...... .l .0. 3... .l . D...“ II- II. .Pﬁt. ...-9... .3 .... dlﬁx’ . ,3 30:51 I. U .0. 3 9!! u t”
. 8... .. . . . . .. . . . ... A?! . I .... 3.133.... a! 920.5.Ilu .3...— cI-IS‘ob ‘ O O O
i ... ... . . . . _ . . . .. . ...... ...8 3.5.5.}... .104, a...5........8:d:¢ ...... 9.. 13...: ....e...’ :1...) littl‘tta‘k. c . .
.u (I... u I. ......o.... .. . . . . . . . . . . . I..II-I .... . 0,. ... 91.5.... 3.223).... I .. .. ... ...)... .391... 2"! 0/53,... I... zit..li’9;$£ an
...Il . .....IIC . ...C o .. ......n ..\. .l-.‘l..6..OCS‘8.-l 6.3. 4....62 3.54.... . .... .....QC ..Fo-C
in... . . ..t' . . . . .0 .... I . . . I: o to u' a
.3... .0. c ...-l . . a ... Z .13
w. $4.1... an... .... I . . o u o
I. v . o... .

.. .... i0 331:5. .. .. ...». t...
...-.0. I... fl‘!‘3..:3rt. .3) .4 .....l l‘. holi’-*
. 4. ...-1 .21....6]? 3. ......
.. ...-I. . . [£019. 0......
.. .03.. .I‘ of 97.11.38 .
a... .

  

. I...
o a}...

              

’1‘. 28.n‘z‘AI

 

 

       

                  

      

          

    

  
  

      

. . 9h . f

“at"! I 9 ...).Fti. . . ﬁg

, 1:5,;
3.. ...r ...r .u. ..

   
    

   

   

         
 

       

           

 

      

. i .
.. K"::_.Q ..z )‘iit‘.:¢o
.. 99-17.! . do 0.. 12 3...]... 1.33158 0.0.23.3..3... ...-6...... no lo. ‘ ‘
.... .31.. .... “It. 35...... ...)..It tutti-Co}... ‘0‘.‘. .3 ¥
3' ... .l.oo_¢ ......Olo...... ....O ...‘......ld.t' “u o.“o\‘.v¢..." ...-«I..Q.;.Iuc:..:
.... .01: If. 1. . to.l£...a.. '30.. . .. a . . ‘13....)36 ....I‘ 8:... ...... . ...: o .I‘ l’o.."5
.48.. : O ..ol-ct..5’ ..‘...oo.....l.\c. To... . In ....I o....€......oa... 23.3.3.0. ... .- ..‘aquIiéooll: In...
.....O.§‘.ol .O. ... 7...... ...y. .u ‘ do 9‘. p.29... .9 .... . .IO'I .0. o... .... ..o- .
. ... 9.. U3 ... — ..- O- n . 09¢. .L.O.I...Jl.0 .
.. 3;... ... . . Vol-:00... v i . ‘13....S- . .
. .. . n. . a o . .-

 

 

.. . .4 v.0I. 931...?! .. I ...0’.... .30.: no".
._ In. I. o . . A, ...t ...!!103icoail 30$II¢€0€33 :3... ’0 rv.9..r.o|Q.-.7
.09....5‘7. ...I“ .....:..vo.. .8‘: . I . oll
. 0‘.

        

 

 

      

          

     

 

 

     

           

    

  

 

 

 

 

 

        

II: I...
4“ x
...-.... .....I. I: 3.13.... 0. .. . l the.”
.I .I....:.po..-o....... ...? .I 7.1.1.. ... . .3 23.333.23LI: 3‘ m
. a... . 'I ...1 o ....‘641 .Is‘ ... . .0... - ..Io IXIIC: 1...... . ‘ .
Z- .. a .. . . 1.! .‘:4 w... 61". ‘ v}
.0 . I ...: . . ...f‘... . .
. .f o. . ...... .111...
...-...... . ......aﬁlI.
_. d. . J... a. ..I‘
O ....
I .t . '-
v.2... .

 

 

v... . . ...-c ....-
. ...!lb'f

     

..~ ....I ... .¢-.I.OI.. ...J

O 0 . .0 In .. I. I ...—
.II... 1.00. O ..I'II Ov- .-
... ...

   

 

              

     

 

        

 

 

  

 

 

 

 

 

 

      

 

 

In
I a ‘
. ... 8‘... $3 6!.
a.
if! ...-D. I.“
So. ...! 63$ ... .- 9... a o I. .u .0. ‘J‘?§ “ ‘nlsoz‘
I o . I; . . .v, .. :I 06" cc, 3
.0363 ...! . _. . . .o ’..’Jol:.. u... . b$1§ 4' "I
...... ... . I. , J... .0... I.. I {...-ISO...)
- O. n. . O .. :
. . .... .....a... . an“. 31..
. .. II ..l...¢. P.
.. .9 ... . ...»le .... ......I.
..Y‘... to‘ .n . . _
... . ... .I... . v 00..
5......1 U .. .

.
o -

    
    

O. VII. .
.v ....

    

       

 

. $801.3. ii. iii
. .l:0D,:zlﬂ‘t&l.lls.N-I’c,’l.l ’;
.93).. . II. . . ‘I- 30‘. ‘:ib§ '0‘: o... .0.
.I.... I!!..C..)Q£l! ...-‘13:, . Ici ﬁttlzibilo. ... {jiflit I‘ﬂymp‘
......ztf .3295It13842 «18%... .09....- ..‘.2...l8?..'ooleul103 ‘Iq..l) 15‘. P. ‘A‘CIIV 17..
x. t
‘

:Olllnz 20.}- In... 2.0:. ’a-‘Jll‘2-3ol
tulov.au.‘4. he"): I... a, O‘t’l’.‘
II... b‘9.t.’i.3 . .

 

. u.
.. on -.I..O...v O I

     

  

 

      

 

 

               
               

.I..l-It¢t.|l!1§i(
o .. I.§I..I!$D9;.I 33!...0). ttlxalﬁ$zrloﬁh
luv. 5....3192I1..IIQ‘I..IIOK!}II 32.3.1313: .3
.II ......I I. r I III-39....‘1ti‘3‘ lOblittat! .-
o........n.1)!01¢ I93..t:i.1l.1tu.‘l 1r.£3:.( i8 1.. III.
..n..i".31~_. 355:3I’ .II;:. 1 \‘I.E‘2¢Qb" .L‘\
I. .. . o . _ . . . IIIIIIIIIQIOI... .u‘. Iltfl'l301l I...
. V. . O . a. O ‘ I. . I . - .. -
...: ....~ 9-... . ., . . . . . . . .
.‘ ....ch.u.~ . . r . .OIQO..I-. . . .. ... . ..
v v.19..... . ...... ...I .. .. .. .
. .....t.....oo. .... I.

 

I... it. 332.3.15’. .
roolz‘dva’oaﬁotovn...II’I..II!$IZI$-. 4”
. 9| 0 I ’

           

it): 10113.... «I

5“!!! 4...! K I! 9‘ .‘g‘O-I.9Q.I.c 0.... n ‘1. '$ ......13’1"

I’D}...I.8.I..t!.lolttt Site-SoigfioctI5|¢O..:!AOOBADIIJ1.x...
13’)‘.o..¢( 1...!39-3 ’73?!) ...Io..‘§.‘.l’.n r... ‘30:.” 0.33..

. so ......I Iig-‘1.p.iio

I. ......»Y ...

’.O~l.0’aou.o nut... ..

.I. a o In...

.. ....r. '2.‘ 124. . I 21.3.3310: grufrﬁlldilg‘iu a... 1
I ..O’.2I.O..x.of....o..3 34.011! 6.9.14... ’4 (I 2.I'l......$ 91‘s {9‘51

.....f"‘...u .0...3I..X’. I. In 0.1.3.12: . , can
..vIS. a to’.... .. I: ......V. .. ..-.o.c.o,I-..o.o .-

 

 

    

  

Sot. £049... IIII‘IIIIIvQ..;aIJ,
70...}. .. ...}It Dit! 9.110 I ’I.I?.\..%t‘l\$:n...z
. . I 0...: I. 7.. ‘o.u.‘,ll\.‘ (III. II . 4.. PO ..-1s".‘:ln 18:32:
. . . .I [:51 Sol-:1-.. .o ...-‘36.: Io’I‘I3'..'3a.31 ..Otn.‘:$..¢.-.IC. .
-..... .l .. ... . 9’.~.1‘I.. .... I... .v’ 0.1.. 9.5.1...
o..’ I
II

 

 

                 

0310.... .4‘..vll'-rb .o'
.

. J. ...
.00». ..I.‘

    

r
-. ..II.:.I.I nu.

 

 

a)..I.oI. -...

 

 

4'..Q“O..J 9"} ‘;C.. o-
. a u I 0“. A. u . 2“ .I 0....on KI o.I.I-.Iv '-

. . In. I. . .I..-.'.. II! ."73 at '.I3’I¢\’.$o..:

..ln. .- . .0. C...‘ I. . I..~ao"’,->3I O In A\~l'.‘.la..I '0 adv-I.

. . n9’I-Ooav .... . I ‘30....3: .. «VI '1... v .0 .7; 0.0.0I "Io: ... II.

A .... ... 1.1...‘: .I.
.C.a¢o-_3n.o.oo... a .

o u .. I ...
. o . a . c0...

 

.§-..o. . . .. out.

 

 

        

        

       

 

   

   
        

      
  
   

  

      
   

   

    

         

  

 

        

 
 

 

 

 

 

 

 

   

 

    

 

 
               

 

               
    

 

            

  

                      
             

     
 

 

  

 

     

  
  

  

     

        

  
 

01“?) a. n.3,...
I ’31,? Icnho’l'o.
. o 900. g... .c ..U' .0.
. . . . .. .... u...3..t‘uvo..3.dd..
. . K I I o o o c. . o ._ -o_ .. . . . . .. . . n . . .. . . . . . . O. I l)....~‘u.
I; 5.. o ... ... u . . . . .. . . 0.. ... I. . .1... ... o"..q0... . . . .. . . . . .33...qu 'IEOI‘S ha. I 33.1-
. . . ...v.... I o . . ... . .~ . ... . .a . . .I. . . ..v . . . a. 4. .. . .. . Q o :3 ‘1. III: 0.... 1‘"...-
_ v . . . _ .. . . I . .. . .. . . . . . a .. . . . . . .. f. . . y I ..Il‘.o '..&l.’l.2lt:)jt.io§n
I o ......p O . i! _ I O G . . c . I I .o ....i . ......voo. . . c . 4 .‘Il. .. 'c.kul-l.. co'. . .... .I :- 2......0315‘.‘ o‘.:’.:.’§.obo 0,0 liC’IIISI ICOVI.O;
. ... . . . . . . . . .. . ... I. I . . . ..........I... . . . .. 1......IIZIvf. ...“...lfqu. Inca-III... ...-...}.ttoi7lfltslvs. Q's-c.0259? .1. 3...
. . . . . 0.. .. o . . .. .. ..I o. . . v .. o . . .1. ... . . .. 2. . ...... . . . . . 0.. . II.!-...Yo.. :- a, I o . \I..¢Oyoor’l“.l<'n .-
. o o . . . u I ... 05 n . .o . _ ...: ._ ...! o. ..v.... I... . . . ...I.‘I.l..o.o.:?.o‘i . . . .. . . ' 2|.o;t; 1&3! Al
. n. .. p. . . . I _ . . . ... . .4. . . . I. .. .. .o In... . . )I7IIIII ...-...}?-
. . . o . o 0.... ... I II . s. ....8 .0.- . ...... .1-.. .I?‘ tt..i.olwﬁo.;ldys
. .. .. .. I . ...I ......of . . . ... pvt-QQQQIIO’tfsa.’
. : .... ... ... . . . . . . . . .. . . .IQIIISI. 2c Ital.
. . I . .. ...-n o . . u. ....I ...... ... q '7‘... '03}... I.
to...ol . -. .... . a. . .olo..t 0| 8. .w... h....t....¢lov
.. .. II . I . . 0 II . . ... .n . o. . . . 0.}. IO’XII L...‘IQI’II.II’OII
. . . . . I . . ... . . I . . . . . .. 0.0.0.0..3 I ’0 I’l’t’u 'JI‘IOOID ’.-.3IV¢.IO.5I «1" ..lo I...
. . .. ... .. \a .o . n n. I. I.I ...-..I a ’3'... .o. .0. ..I .170 o . v I: 30 0.13‘ ..Ivnl 1.4 :IOOQ“."I.‘I‘,JO"'I II..." 0“;
. . . . 0.. ll- I05 {322- .1 Q .o . _’O..~I Infill-:1...- - ... 3...... 91.3,»). v 0 135.15.!!! I"...
I. . . .0 I. 0.0. .1. ...II.I ..I ..I .‘ SOI- .II.. .
.. ... . a .0 I .6. . .
. ... . .. In .... . . . .
I p. . . . . . n I
- . . ...
. o ..v .- o o .
I t ...0. u
. o 3 . o
3.. .o .u . .. . I
. o .. . .04 . 0.. r . . .
. . . .. . I .. . .. p
. I . J .. _ . . ..I‘ ..
. .o . . . . .... I .. T 23.1.. I I 0v 0...... 0 .
. ... . .. n. . . . ...v. I.. . .‘IIIC‘IVIIII.
.. . .. o n O. . . V ..o.. a. . f I ...
_ . .
. . OI . . . .. . g o I . c . . .
. . . . . , I .I. x ...I'
. .. . II}. I . OI; . _. . I 3-...- ..o.... . o .
. . . . . . . .. ..
... .l I. Q .. I I . C. .3
.I I I II . 6.. .o‘.
v .. . . o I I. . . I ... 6..
c. .
. . I . . . . I l . u
. . .. . . I . . . I
|. ... o... .. '. . .o. .
.14. n.
0. al‘
v . I. .'
I .f .o . . . 0......08 o... la v .....944100. 91.6.0 I 56.5..II‘WIIJI
. . o . .. ... . I. 31.10.}: I... ...-‘3‘... .I . I (I I".O,I Q‘III K I ’16 ‘3
. a .... . .. . . .. o. . _. {OIIOI‘IZOi 1.7.Cf.-aln.5. .I -I.I..II.u
... _. . .0. I I . I. .. . .I \o. ...-......Oii ...QIi..." \t‘§.‘0‘-..¢~ .....I‘ I
_ . . . S. . . . ... .I...o: J0 ......CZ . . . . . . . 0.4 I. «III .. 1w li-’\“u ... COOI§ICV¢O‘. PI
. . . ... . . . . . . u . . .... . ‘1‘. . . . v 0.3,. (I o. vb! I510; III :0. .I‘I a -.O¢~I
. . I. .. o . , . . . . . . . ... I. .0. .. I . . .... . I . II I ..I 0.! 1‘.- .. . .. .. II ...-I ... Q v oI...aIi .... n :1... 19‘14';’- 'Cuolv ..-v0.. 4.7.3‘t:
. . . . . . . . I . , I .o o . ... ..I. .. . .v. . I. . s I, . . . . u u ... ...—I. a. < , U.~v‘t"no§ov ... .IOI .6. (I . O. ... Otll QOOIIOIA 0 I . '1 O
. . ......w.x.. . . .. . . .. . .I I .. . ... . - ... .... .y ..Yo! ...I. . I ......j... .r it. .. 7. 3 ...... It}... .... ......cu... .r.. I‘: Ail-06.0010". .fou. OISOoOISYIy- II.
I . . .. . . I . . . . . .. I . o In. L . .IIII Cc.‘\. I .Ilrﬁliao‘..o-LOII.I’O.- .I
.. ... _ ...... .. o f If... If _s .I 2 . .... o .. o I. I. . . . o I o. .I‘. .0. o... I. o . O. .33.. . I . . I... I .. ...... I II. . . . .oo o n I . a ,3- oo . .... ......Ioo .. . .....930IVO o ‘33QQII‘ In...o .YhooIOI‘ III-.10. '\.’In I . I
. c . . . I . . . . . .. I . .. Z a .. . .. ... I o I . .0 . ... v.00..nv.\14a. I. o v. I 0. y . . s oOultu .- ... . .‘4oo.cl-OI¢OI¢ .OII‘,|...nIo.¢.. I.-. I
.I ‘I . ‘o r... . . .... ... In?! .... v.. .. . _. . o .. . I I . o .r. .. ....II. . I .32... o: o. . I. . o I n . . .. .. I . ... ... .....ni 1. .. . . I. .....I- ... 3 I o. O . In"! ‘1”?! I. an 9‘. i .I 1., Iv‘IS'n’l.C . OI
. . . w I r .I . . .... . I o . n .u. . .. . .. . .. . . v... . .| .r-; ...: o. . .. 0.000... K C...“.I§'t|.I$ 6.01.0.I-‘Lrnl
. I . . II . II I. I... ... I . I. .... . .. .. . .. .. 1... II I If: ...: . . .n . 3.37.9.313.‘ ..IQ...‘ . O OAIDI,OIVIIYI
_ ... . ... o . . .. m u .. . . . I . .- ... .I .I I... ..IIIKI.
... I. I .l .v . I I . . . II . 0Q . . 5 I ‘ . . a. o v. 0:. I. v I o _‘ I I I . . . . . Io. .. .. 0.1 0,. an. I. ..‘o . a I .
o . . . n . . . . . . . I a. . .
I . I .o o I no . . .. _ .. V. v...- . o . '0 I I
.1 wt 0 . . o ..4 I .o I 0‘ a I. .
. .9- I . . Q o . .. I ... In. 0 .. II
. . o . . v . ‘ i. o . .. o . .
u . I . I o " . 0 Q ...II C . . _ II . 1.... I . . o.
. . I . .. . .....
v. .I 0 ...aL . - a...» Ia Ii ‘ . n . c a v I. I ...:- .
. ... 0 . v . . ..- u
I .. f It I . . o . no .. 30 I . . n . I I q. I. . _
I I u I! O u
. n. C . I .II- . I. Ito.
- . I .0. . I c . . . u 1.. I . o ... . an I...’
c a I . .. I . .I .Io. o . . I o n. _I . . ... I .Il- 0 .A .v..-I'II.I-II to‘ﬁ‘oonil
. . I ‘ . .I II.Q..II!:0IIICI 0th IO,oIoI.’II'¢6 3“ OIIIOIo
V ... . . I I I . o (..- I I u. I ...... ...I' o I .. ... .10. I.Iv000.lv. I...I..|.o-.Q'I.lv..a QUIIIOI cuo.‘ III! I.
. .. o a. u - . s I: It I . I on .I .... I. 0.0 00......0‘ I I ...-II...OI 60-5 II..O-. a... ... . ...-....A.-III- I I. III.
I .. _ 9 . o ... . I. . . On. . .. . c n. . o . .....I I: 0 v.Iol.o ...... OVn6.o~o. I I .1..v. v-.. ...-..I, I.v.‘l ......III o I. I bin...
. . ... . . . I no . s. ... I ... u I .I ...... ...Ivoo. a o . u I I I. IIIavo D. .00.I.Invv I .II..I . to
I‘o I . .I. ... . I .0. ‘ . ... h I I .0... o ......o... ..., ..II ‘0.Ié.4.o_. .I I .u . 0.1.0....OI‘ Io.~n I.I-O III. .0 O .I.o ...-III III. . -I .II. I.
. .... n. . I n I. O a .o n. a ...II I o I. . 9:0. I .. .0 .III .III... I ... IIII . .II a. 0.0. a. It ... I.
I . o. I I. '9 ... II . . .0... II 0. I. I .0... at: . .u . I... .I . . _ IIv I o . o
‘. . . . . I . 0 I o C . . I _ . I . . I
. . . . . I .1 0 O I. .. . WI) ‘1'.“
.. o I .. .. r-II 4 a n... . . ..f: ..
I . . III“. .. I .0 ...... . I A .. ... r . w ..I... .. 4. . . .. 37....
. o . .v . .c. is ‘4 .t . .. ”f.ﬂ‘ W{,£4 hf.ﬁ.~u¢4.. . . dc. ..3 11.11,... A.) . , nm6~wwm.ww.o.u....~:.kuﬁ....C.I. . . .qa.._.v..£
. . . ......v . o p 1 .. ... x .. a 94...; .. 1.. (...... i . . O a JII . it Iv. .. . .. i . .. J, . . day), . r. q .. ...ﬁc“ .. «w. ......1 I... nvv...» . w :J. ‘34 ...... .....L....~.~L .>.V....Y.4.L ....3 n . . .. . f x ., ... I .1... cl.
. 12.79.? .... 1..- .. I t... . ... o... ... 241.0,... ﬁ.....:ad....:.<..o. .. 0.”... «o. u . .-...a... ....‘r... ..w. . I» .9. .... . .w: .u. . h .. 13%.. x. . a. y ..r. by ; .... r r .2 . .. ......u .15... . ... .r ”gar... .... ..u . x . .... ., ... ...; .... . .1. r". .... . TL... . \
It... > . . .. In ‘ I . o .... I... a .1 .o .6 $4. .. .. . ... .n q I .. .... I . m.- . . . . .. . I .. v I . t . p .
...9 .. . 6 .. ....I 5 . t . . 1.3168,; 3 om. .3‘JI.LLII.¢. $6.... _ 3:32??? ....I «9.5». 559359;... {1I : bo‘ﬂn’ ob.- Cé. .. 0‘ . J ......3. . .1 . .w J . . . . I . J ... . S .. I 4.3 ' . k “I I ~1otlv... 10-5.1 . . I --....ol II . in
w . . .. a. ... _ ...: .r n . I ... ...; 2. . t... ...... ... Ir. . 29.52;... I ....2... J .....z... » .... . ...-.. .....s; . .... . .33.... it . ...... Julitoo... at ... . l .3 . .... . 6 ..., t .. .v .32 tt.—’.l.¢\ £3913- ; .r». . . . z . . . .-. I ... . - I . . . . . .
t..£..lo .L F. O; I. .a . n4.&' 0......3 .. 0.. 39%.. .603: I .& 3.. III. ... ... ... .331... ... .I .o.>.!wr.vl=nu.foi 3M1“ .1....9.o.0'|3l.£39 it... ...!303’31 o..ul..vo.0.s-.oo....‘3..i.o...v.. 1:33 31...... ...-.415! of. I... .v.. 5;..- .. . .. I... . \l.ﬂ." . . ... to.-. v.‘ .I I. I . .
. o. 9 U . 9.. 3 o . t ... o .o ...... I .. . . n ..v o. .o .0 . .. o .I. I r . a. . I I 4. _ at t. . . o r A. ‘I. o . . . I a . I . I . I o l
L".3I'.n.~ II‘ I. . ..I I. . . I ... ... .. II. . . . III _ . In -
.
r

 

 

20‘ ‘ LIBRARY
Michigan State
lJnhmusﬁy

 

 

 

This is to certify that the
dissertation entitled

Towards Automated Model Revision For Fault-Tolerant

Systems
presented by
FUAD ABUJARAD
has been accepted towards fulﬁllment
of the requirements for the
Ph.D. degree in Computer Science

 

 

WM law

 

Majo‘r’ Professor’s Signature
95 i ii l0
i F

Date

MSU is an Affirmative Action/Equal Opportunity Employer

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K:IProleoc&Pres/CIRCIDateDue.indd

TOWARDS AUTOMATED MODEL REVISION FOR
FAULT-TOLERANT SYSTEMS

By

FUAD ABUJARAD

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Computer Science

2010

ABSTRACT

TOWARDS AUTOMATED MODEL REVISION FOR
FAULT-TOLERANT SYSTEMS

By

FUAD ABUJARAD

Automated model revision of distributed programs is one of the emerging and impor-
tant approaches for achieving and maintaining program correctness. In this approach, an
existing model is automatically revised to satisfy new properties. Such model revision
is required when an existing model/program is subject to a newly identiﬁed fault, a new
requirement, or a new environment. Thus, model revision is especially beneﬁcial in the
development of systems that need high assurance. To apply model revision in practice, we
need to develop tools that are user friendly, comprehensive, and efﬁcient.

However, due to their limitations, the current model revision tools and techniques are
not widely used in the development of practical systems. More speciﬁcally, some of the
limitation are that they suffer from a high learning curve, they require high time and space
complexity, they need many details to be speciﬁed that otherwise could be automatically
discovered, and they do not cover different types of revision.

Taking into consideration the aforementioned limitations, in this dissertation, we derive
theories, develop algorithms, and build tools to advance the state-of-the-art of the auto-
mated model revision. Our approach comprises four main elements: First, we reduce the
learning curve for the automated model revision techniques by utilizing existing design
tools to perform the revision under-the-hood. Second, to permit the designer to efﬁciently
describe the model to be synthesized and to minimize the user input, we develop algorithms
and tools to automate the generation of the legitimate states of the original model, thereby
reducing the burden of the designer. Third, to utilize the available computing resources and

to efﬁciently complete the revision, we utilize both symmetry and parallelism to speedup

the automated revision and to overcome its bottlenecks. Fourth, to provide comprehensive
revision and to cover more types of model revision, such as nonmasking and stabilizing
fault-tolerance, we develop algorithms and tools to allow for addition of new types of fault-
tolerance. To validate our approach and illustrate its feasibility, we apply it to several case

studies.

© Copyright by
FUAD ABUJARAD
2010

I dedicate this dissertation to my wonderful family. Particularly, to my
parents, who believed in diligence, science, and the pursuit of academic
excellence. To my beloved wife, Samah, who has been patient and
supportive with these many years of research, and to our lovely kids Haya,

Khaled, and Amir, who are the joy of our lives.

ACKNOWLEDGMENTS

I am extremely grateful to all who helped me complete my PhD. program. First and
foremost, it was the unconditional support of my wife, Samah. Her support, encourage-
ment, quiet patience, and unwavering love were undeniably the bedrock upon which the
past eleven years of my life have been built. Her tolerance of my changing moods is a
testament in itself of her unyielding devotion and love. I would like to thank our three
children, Haya, Khaled, and Amir who made this all possible. My family made tremen-
dous sacriﬁces so that I could spend time on my doctoral education. They encouraged and
pushed me to continue in my pursuit.

I would like to gratefully and sincerely thank Dr. Sandeep S. Kulkarni for his guid-
ance, understanding, and patience during my graduate studies at Michigan State University.
His mentorship was paramount in providing a well-rounded experience consistent with my
long-term career goals. He encouraged me to not only grow as an experimentalist but also
as an independent thinker. For everything you have done for me, Dr. Kulkarni, I thank you.

I would like to thank the Department of Computer Science and Engineering at MSU,
especially those members of my doctoral committee for their input, valuable discussions,
and availablity. In particular, I would like to thank Dr. Laura Dillon and Dr. Betty H. C.
Cheng, as well as Dr. Jonathan Hall from the Department of Mathematics. This dissertation
would not have been nearly as complete without your help.

Additionally, I am very grateful for the friendship of all the members of the SENS lab
research group, especially Ali Ebnenasir, Mahesh Arumugam, Borzoo Bonakdarpour, and

J ingshu Chen, with whom I worked closely and co-authored some of my papers during my

vi

PhD. program.

Finally, and most importantly, I would like to acknowledge my parents, Suleiman and
Hamamah, for their unconditional love and for their faith in me. It was under their watchful
eye that I gained so much self—steam and an ability to tackle challenges. Also, I would like

to thank my brothers and sisters for their continuous support and unending encouragement.

vii

TABLE OF CONTENTS

LIST OF TABLES ........................................................... xi
LIST OF FIGURES ......................................................... xiv
1 Introduction 1
1.0.1 Motivations and Goals ........................ 3

1.0.2 Thesis ................................ 4

1 .0 .3 Contributions ............................. 5

1.0.4 Outline ................................ 8

2 Preliminaries 9
2.1 Models and Programs ............................. 9
2.2 Modeling Distributed Programs ....................... 12
2.2.1 Write Restrictions .......................... 13

2.2.2 Read Restrictions .......................... 13

2.2.3 Example (Group) ........................... 13

2.2.4 The Group Algorithm ........................ 14

2.3 Speciﬁcation ................................. 16
2.4 Faults ..................................... 18
2.5 Fault-Tolerance ................................ 19
2.6 Example: (Data Dissemination Protocol in Sensor Networks) ........ 20

3 Under-The-Hood Revision 24
3.1 Introduction to SCR .............................. 24
3.1.1 SCR Formal Method ......................... 25

3.1.2 Automated Model Revision to Add Fault-Tolerance ......... 29

3.2 Integration of SCR toolset and SYCRAFT .................. 30
3.2.1 Transforming SCR speciﬁcations into SYCRAFI' input ....... 30

3.2.2 Translation from SCR Syntax to SYCRAFI‘ Syntax ......... 32

3.2.3 Modeling of faults .......................... 32

3.2.4 Adding fault-tolerance to SCR speciﬁcations ............ 33

3.3 Case Studies .................................. 33
3.3.1 Case Study 1: Altitude Switch Controller .............. 34

3.3.2 Case Study 2: Cruise Control System ................ 37

3.4 Summary ................................... 39

4 Expediting the Automated Revision Using Parallelization and Symmetry 40
4.1 Introduction .................................. 41
4.2 Issues in Automated Model Revision ..................... 43
4.2.] Input for Byzantine Agreement Problem ............... 43

4.2.2 The Need for Modeling Read/Write Restrictions .......... 45

4.2.3 The Need for Deadlock Resolution .................. 46

viii

4.3 Approach 1: Parallelizing Group Computation ................ 48

4.3.1 Design Choices ............................ 49
4.3.2 Parallel Group Algorithm Description ................ 50
4.3.3 Experimental Results ......................... 54
4.3.4 Group Time Analysis ........................ 59
4.4 Approach 2: Alternative (Conventional) Approach .............. 60
4.4.1 Design Choices ............................ 61
4.4.2 Algorithm Sketch ........................... 62
4.4.3 Experimental Results ........................ 66
4.5 Using Symmetry to Expedite the Automated Revision ............ 69
4.5.1 Symmetry ............................... 69
4 .5 .2 Experimental Results ......................... 71
4.6 Summary ................................... 77
Nonmasking and Stabilizing Fault-Tolerance 80
5.1 Introduction .................................. 81
5.2 Programs and Speciﬁcations ......................... 85
5.3 Synthesis Algorithm of the Nonmasking and Stabilizing Fault-Tolerance . . 86
5.3.1 Constraint Satisﬁer .......................... 87
5.3.2 Algorithm Illustration ........................ 90
5.4 Expediting the Constraints Satisfaction .................... 91
5.4.1 Design Choices for Parallelism .................... 91
5.4.2 Partitioning the Constraints Satisfaction ............... 93
5.5 Case Studies ................................. 96
5.5.1 Case Study 1: Stabilizing Mutual Exclusion Program ........ 96
5.5.2 Case Study 2: Data Dissemination in Sensor Networks ....... 103
5.5.3 Case Study 3: Stabilizing Diffusing Computation .......... 106
5.6 Choosing Ordering Among Constraints ................... 111
5.7 Reducing the Complexity with Hierarchical Structure ............ 117
5.8 Summary ................................... 119
Legitimate States Automated Discovery 121
6.1 Introduction .................................. 122
6.2 The “Weakest Legitimate State Predicate Generator (stpGenerator)” Al-
gorithm .................................... 124
6.2.1 Weakest Legitimate State Predicate Generator ............ 125
6.2.2 Safety Checker ............................ 125
6.2.3 Liveness Checker ........................... 126
6.3 Application of stpGenerator in Automated Model Revision ........ 131
6.3.1 Case Study 1: Byzantine agreement program ............ 131
6.3.2 Case Study 2: Token Ring ...................... 135
6.3.3 Case Study 3: Mutual Exclusion ................... 136
6.3.4 Case Study 4: Diffusing Computation ................ 138
6.4 Summary ................................... 139

ix

7 Automated Model Revision Without Explicit Legitimate States 141
7.1 Introduction .................................. 142

7.2 Problem Statement .............................. 144

7.3 Relative Completeness (Q. 1) ......................... 146
7.4 Complexity Analysis (Q. 2) .......................... 148
7.4.1 Complexity Comparison for Partial Revision ............ 148

7.4.2 Complexity Comparison for Total Revision ............. 153

7.4.3 Heuristic for Polynomial Time Solution for Partial Revision . 155

7.4.4 Algorithm for Model Revision Without Explicit Legitimate States . 156

7.4.5 Summary of Complexity Results ................... 159

7.5 Relative Computation Cost (Q. 3) ...................... 161

7.6 Summary ................................... 162

8 Related Work 163
8.1 Model Checking ................................ 164

8.2 Controller Synthesis and Game Theory ................... 167

8.3 Model Revision and Automated Program Synthesis ............. 168

8.4 Parallelization and Symmetry ......................... 170

8.5 Nonmasking and Stabilizing Fault-Tolerance ................. 172

8.6 Legitimate States Discovery ......................... 173

9 Conclusion and Future Work 175
9.1 Contributions ................................. 175
9.2 Future Research Directions .......................... 182
BIBLIOGRAPHY ............................................................ 187

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

3.11

3.12

4.1

4.2

4.3

4.4

LIST OF TABLES

Monitored Variables of the altitude switch controller system (ASW) ..... 28
Mode transition table for the mode class mcStatus. ............. 28
Condition table for cWake UpDOI. ...................... 29
mRoom Mode Table .............................. 30
Translation rules ............................... 32
The mcStatus mode table translated. ..................... 35
The SYCRAFT fault section. ......................... 36
The fault-tolerant mcStatus mode table. ................... 36
Fault-tolerant mode class mcStatus. ..................... 37
Fault intolerant mode class mcCruise. .................... 38
The SYCRAFT fault section. ......................... 38
Fault-tolerant mode class mcCruise. ..................... 39

Deadlock scenario 1 (The underlined values indicates which variable is
being changed by the program action/fault. For reasons of space the true

and false values are replaced by 1 and 0 respectively for the variables b
and f.) .................................... 47

Deadlock scenario 2 (The underlined values indicates which variable is
being changed by the program action/fault. For reasons of space the true
and false values are replaced by 1 and 0 respectively for the variables b
and f.) .................................... 48

Group computation time for Byzantine Agreement. PR: Number of pro-
cesses. RS: Size of reachable state space. GT(s): Group time in seconds.
SR: Speedup ratio. .............................. 60

Group computation time for the Agreement problem in the presence of fail-
stop and Byzantine faults. PR: Number of processes. RS: Size of reachable
state space. GT(s): Group time in seconds. SR: Speedup ratio. ....... 60

xi

4.5

4.6

5.1
5.2

5.3

5.4

5.5

5.6

5.7

5.8
5.9

5.10

5.11

5.12

Group computation time for token ring. PR: Number of processes. RS:
Size of reachable state space. GT(s): Group time in seconds. SR: Speedup
ratio ....................................... 61

The time required for the revision to add fault-tolerance for several numbers
of non-general processes of BA in sequential and by partitioning deadlock
states using parallelism.PR: Number of processes. RS: Size of reachable
state space. DRT(s): Deadlock resolution time in seconds. TST(s): Total

revision time in seconds. ........................... 68
Stabilizing Mutual Exclusion, linear topology ................ 99
Stabilizing Mutual Exclusion, binary tree topology. ............. 100

Stabilizing Mutual Exclusion using Constraints partitioning. Cnst t(s) :
Total time spent in constraints satisfaction in seconds. Syn t(s): Total revi-
sion time in seconds. Mem (MB): Memory usage in MB ........... 101

Stabilizing Mutual Exclusion using Group threading. Grp t(s) : Total time
spent in Group computation in seconds. Syn t(s): Total revision time in
seconds. Mem (MB): Memory usage in MB. ................ 102

Nonmasking with linear topology data dissemination program. ....... 106

Data Dissemination program using Constraints partitioning. Grp t(s) :
Total time spent in Group computation in seconds. Syn t(s): Total revision
time in seconds. Mem (MB): Memory usage in MB. ............ 107

Data Dissemination program using Group threading. Grp t(s) : Total time
spent in Group computation in seconds. Syn t(s): Total revision time in

seconds. Mem (MB): Memory usage in MB. ................ 108
Stabilizing Diffusing Computation, linear topology. ............. 1 10
Stabilizing Diffusing Computation, binary tree topology. .......... 110
Stabilizing Diffusing Computation program using Group threading. Grp

t(s) : Total time spent in Group computation in seconds. Syn t(s): Total

revision time in seconds. Mem (MB): Memory usage in MB ......... 112
Stabilizing Diffusing Computation using Constraints partitioning. Cnst
t(s) : Total time spent in constraints satisfaction in seconds. Syn t(s): Total
revision time in seconds. Mem (MB): Memory usage in MB ......... 113
Stabilizing Mutual Exclusion with linear topology using random con-

straints satisfaction. ............................. 115

xii

5.13 Stabilizing Diffusing Computation with linear topology using random con-
straints satisfaction ............................... 116

6.1 The time required to generate the weakest legitimate state predicate
(Byzantine Agreement) ............................. 134

6.2 The time required to generate the weakest legitimate state predicate (token
ring) ....................................... 1 36

6.3 The time required to generate the weakest legitimate state predicate (Mu-
tual Exclusion). ................................ 138

7.1 The complexity of different types of automated revision (NP-C = NP-
Complete). .................................. 160

7.2 The time comparison for the Byzantine Agreement program. ........ 162

xiii

3.1

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

4.9

4.10

5.1

5.2

5.3

LIST OF FIGURES

The transformation cycle between SCR toolset and SYCRAFT. ....... 34

The time required to resolve deadlock states in the revision to add fault-
tolerance for several numbers of non-general processes of BA in sequential
and parallel algorithms. ........................... 55

The time required for the revision to add fault-tolerance for several numbers
of non-general processes of BA in sequential and parallel algorithms. . . . . 56

The time required to resolve deadlock states in the revision to add fault-
tolerance for several numbers of token ring processes in sequential and

parallel algorithms. .............................. 57
The time required for the revision to add fault-tolerance for several numbers

of token ring processes in sequential and parallel algorithms. ....... 58
Inconsistencies raised by concurrency. .................... 67

The time required to resolve deadlock states in the revision to add fault-
tolerance for several numbers of BA non-general processes in sequential
and symmetrical algorithms. ......................... 72

The time required for the revision to add fault-tolerance for several numbers
of BA non-general processes in sequential and symmetrical algorithms. . . 73

The tttttime required to resolve deadlock states in the revision to add fault-
tolerance for several numbers of token ring processes in sequential and
symmetrical algorithms. ........................... 74

The time required for the revision to add fault-tolerance for several numbers

of token ring processes in sequential and symmetrical algorithms. ..... 75
The time required for the revision to add fault-tolerance for several numbers

of BA non-general processes using both symmetry and parallelism. . . . . 76
Constraints ordering and transitions selections. ............... 90
The holder tree ................................ 98
Complexity and hierarchy for linear topology ................ 117

xiv

5.4

7.1

7.2

7.3

8.1

Complexity and hierarchy for the binary tree topology ........... 119

Model Revision with Explicit Legitimate States ................ 142
Model Revision without Explicit Legitimate States. ............. 143
Mapping of (x 1 Wm) /\ (ﬁx; V -wx2) into corresponding program transitions.

The transitions in bold show the revised program where x1 = true and x2 :
false. ..................................... 150

Model Checking and Automated Model Revision ............... 167

XV

Chapter 1

Introduction

The rapid growth of computer systems is increasing our reliance on them more than ever.
Therefore, the burden of ensuring the correctness of reliable hardware and software sys—
tems is signiﬁcantly growing. Model checking is one of the commonly used techniques to
provide such assurance, especially for ﬁnite state concurrent systems [48,49,64]. Given
a model of a system, a model checker veriﬁes whether this model meets a given prop-
erty. If the model does not satisfy that property, the model checker (typically) gives a
counter-example. Then, the model needs to be modiﬁed to satisfy the desired property.
Consequently, such modiﬁcation will require another cycle of veriﬁcation.

Based on this observation, in this dissertation, we focus on model revision [26,29,
61,96] where an existing model is revised so that it satisﬁes a given property of interest.
Model revision is required in several contexts. For example, it is required to revise an
existing model to ﬁx a counter-example, i.e. a bug. It is also required if the original speci-
ﬁcation was incomplete and the model has to be revised to meet the missing speciﬁcation.
Furthermore, it is required to respond to faults introduced by a change in the environment.
When a program is deployed in a new environment, it may be subject to faults that were
not considered in the original design. Moreover, even if the faults were known in the initial

design, to provide separation of concerns, it is desirable to allow the designer to focus on

the functionality aspect and add fault-tolerance subsequently. In either case, it is desired
that we revise the program to add fault-tolerance.

One requirement for such revision is that the existing program requirements continue
to be satisﬁed [101]. Also, in the above contexts, it is more practical to reuse the existing
program in the construction of the revised one [25]. Performing such revisions manually
has the potential to incur a huge cost as well as introduce new errors. Therefore, automat-
ing such revisions is desirable for reducing cost and guaranteeing assurance. One approach
to gain assurance in such program revision is by automated model revision (also known as
automated incremental synthesis) [27,30,31,55,59,101,103], which guarantees that the
revised program is correct-by-construction. The automated model revision to add fault-
tolerance takes a fault-intolerant program, program speciﬁcations, and faults as an input
and generates a fault-tolerant program as an output. More speciﬁcally, it reuses the original
program (which is fault-intolerant) in synthesizing its fault-tolerant version [101]. More-
over, since the synthesized program is correct-by-construction there is no need to reverify
its correctness.

The automated model revision (or, incremental synthesis) of fault-tolerant programs is
highly desirable, as it allows the designer to focus on the normal system behavior in the
absence of faults and leaves the fault-tolerance aspect to the automated techniques. Ini-
tially, Kulkarni and Arora [101,102] presented an algorithm for synthesizing fault-tolerant
programs. The input to their algorithm is a fault-intolerant program that satisﬁes its spec-
iﬁcation in the absence of faults but provides no guarantees in the presence of faults. The
output of their algorithm is a fault-tolerant program that continues to satisfy its speciﬁca-
tions in the absence of faults and provides the desired level of fault-tolerance to tolerate
the given faults. Later, in [59] Ebnenasir and Kulkarni presented an enumerative (explicit-
state) implementation to the revision algorithm. This was a signiﬁcantly important step,
since it enabled them to verify the concepts of the revision and demonstrate the applica-

bility of the automated revision algorithms [59]. However, similar to other enumerative

implementations, it was subject to the state explosion problems and was only suitable to
revise small programs. Recently, Bonakdarpour and Kulkarni presented a symbolic-based
implementation for the revision algorithm [27,30]. In this implementation, the components
of the revision algorithm are constructed using Boolean formulae represented by Bryants
Ordered Binary Decision Diagrams [33]. This was the ﬁrst time where moderate to large

050 and beyond) have been synthesized. The symbolic

sized programs (a state space of 1
implementation enabled them to identify bottlenecks in the automated revision. These bot-
tlenecks included, deadlock resolution, computation of reachable states in the presence of

faults, and addition of recovery paths.

1.0.1 Motivations and Goals

In practice, applying the automated model/program revision in real life applications is dif-

ﬁcult due to the following factors:

1. The use of the existing tools for automated model revision has a high learning curve.
The designer is required to learn different aspects of modeling distributed programs,
program speciﬁcation, faults, and fault-tolerance [27,30,59]. To alleviate this difﬁ-
culty, we focus on moving the task for adding fault-tolerance to be under-the-hood.

In this manner, we make automated revision more accessible [3].

2. Current model revision tools require the designer to specify the fault-intolerant
model, the model speciﬁcations, the model legitimate states, and the faults [27,30,31,
59,101,103]. Of those, identifying the set of legitimate states is the most demanding
task. The designer needs to specify the legitimate states of the model and describe
them in a logical formula. Although specifying the model, the speciﬁcations, and the
faults is a must, it is an open question as to whether the explicit speciﬁcation of the
legitimate states is necessary. To alleviate this difﬁculty, we focus on designing an

algorithm that provides automatic generation of the legitimate states from the model

actions and speciﬁcations [6].

3. Current model revision tools [27, 30,59] focus on the addition of masking fault-
tolerance, where both safety and liveness are preserved during recovery. However,
they do not address other types of fault-tolerance, including nonmasking and stabi-
lizing fault-tolerance. In nonmasking fault-tolerance, safety can be violated during
recovery and the program should tolerate temporary perturbation. In stabilizing fault-
tolerance, the program recovers to its legitimate states from any arbitrary state [57].
To provide broader domain of the problems that can be resolved by automated model
revision, we develop algorithms for the automated addition of nonmasking and sta-

bilizing fault-tolerance [4].

4. The current model revision tools utilize multiple heuristics to reduce the complex-
ity of the revision [27, 30,59,101]. However, to improve the efﬁciency further, we
need to utilize advantages from model checking [48,49,64,93]. Hence, we develop
techniques that concentrate on reducing the complexity of the revision using symme-
try and/or parallelism. We show that these approaches provide a signiﬁcant speedup

separately as well as together [5].

1.0.2 Thesis

Thesis Statement: The automated model revision can be made more usable, compre-
hensive, and efﬁcient through the use of four key elements: the use of existing design
tools as front end to the automated model revision tools, the introduction of new revision
algorithms that handle different classes of fault-tolerance, the use of the original model
speciﬁcation and actions to automatically discover other inputs to the revision algorithm,
and the utilization of symmetry and parallelism.

To validate this thesis statement, we have derived theories, developed algorithms, and

built tools to advance the automated model revision through a usable, comprehensive, and

efﬁcient toolset. First, to reduce the automated revision learning-curve, we utilized existing
design tools (i.e. SCR toolset) such that the automated revision is done under-the- hood
[3]. Second, to revise a broader range of programs, we developed algorithms and tools
to add new types of fault-tolerance [4, 5]. Next, we reduced the revision parameters by
automating the discovery of the program legitimate states, thereby reducing the burden of
the designer [6]. Finally, to overcome the automated revision bottlenecks and reduce its
time complexity, we utilize both symmetry and parallelism to speedup the revision time

[2,5].

1.0.3 Contributions

Our contributions can be grouped into four major categories:

Under-the-hood Revision

It is desirable that the designer utilizes the automated model revision tools with minimal
prerequisite knowledge of the details of the automated revision techniques. We focus on
performing the automated revision under-the-hood. Therefore, we utilize existing design
tools, such as the SCR toolset [21,84,87], in the automated revision. The SCR toolset is a
set of tools used to formally construct and verify the requirements speciﬁcation document.
It is widely used in constructing many mission critical systems. Our approach is to combine
the SCR toolset with the tool SYCRAFT that automates the model revision. This approach
is desirable, as it allows one to perform functions of the automated model revision without
the need to know its details. Of course, it would be necessary to convert (l) the SCR
speciﬁcation into a format that can be used with SYCRAFT and (2) the revised fault-tolerant
program into corresponding SCR speciﬁcation. 1

Based on the above discussion, we combine the SCR toolset with the automated model
revision tool SYCRAFI' [27,30]. More speciﬁcally, we let the designer specify the program

requirements through the SCR toolset interface and we handle the aspects of the automated

revision of fault-tolerance using SYCRAFT.

Legitimate States Generator

One of the requirements of the model revision algorithm is identifying the set of the legiti-
mate states of the program being synthesized. This set represents the states from where the
execution of the actions of the model is correct. One approach for providing fault-tolerance
is to ensure that after the occurrence of faults, the revised program eventually recovers to
the legitimate states of the original model. Since the original model met its original speciﬁ-
cation from these legitimate states, we can ascertain that eventually a revised model reaches
states from where subsequent computation is correct.

One of the problems in providing recovery to the legitimate states, however, is that these
legitimate states are not always easy to determine. Existing model revision approaches
(e.g., SYCRAFT [27, 30]) have required the designer to specify these legitimate states ex-
plicitly. It is straightforward to observe that if these legitimate states could be derived
automatically, then it would reduce the burden put on the designer, thereby making it easier
to apply these techniques in revision of existing programs. We focus on identifying the

largest set of states from where the existing model is correct.

Nonmasking and Stabilizing Fault-Tolerance.

To provide comprehensive tools for the automated model revision, we focus our atten-
tion on automated addition of nonmasking and stabilizing fault-tolerance to fault-intolerant
programs. Intuitively, a nonmasking fault-tolerant program ensures that if the program is
perturbed by faults to an illegitimate state, then it would eventually recover to its legitimate
states. However, safety may be violated during recovery [101].

The current model revision tools [30,59] support the design of masking fault-tolerance
only. However, there are several reasons that make the design of nonmasking fault-

tolerance attractive. For one, the design of masking fault-tolerant programs, where both

safety and liveness are preserved during recovery, is often expensive or impossible, even
though the design of nonmasking fault-tolerance is easy [15]. Also, the design of nonmask-
ing fault-tolerance can assist and simplify the design of masking fault-tolerance [105].

A special case of nonmasking fault-tolerance is the stabilizing fault-tolerance [54,56],
where, starting from an arbitrary state, the program is guaranteed to reach a legitimate state.
Stabilizing systems are especially useful in handling unexpected transient faults. Moreover,
this property is often critical in long-lived applications where faults are difﬁcult to predict.

.We present an algorithm for adding nonmasking fault-tolerance to an existing program
by performing three steps [4]. The ﬁrst step is to identify the set of legitimate states of the
fault—intolerant program. This set deﬁnes the constraints that should be true in the legitimate
states. The second step is to identify a set of convergence actions that recover the program
from illegitimate states to a legitimate state. This can be done by ﬁnding actions that satisfy
one or more constraints. The last step consists of ensuring that the convergence actions do
not interfere with each other. In other words, the collective effect of all recovery actions

should, eventually, lead the program to legitimate states.

Expediting the Automated Revision

To reduce the time complexity of the automated model revision, we ﬁrst need to identify
bottleneck(s) where symmetry and parallelism features can provide the maximum impact.
Based on the analysis of the experimental results from Bonakdarpour and Kulkarni [30],
the performance of the revision suffers from two major complexity obstacles, namely gen-
eration of fault-span and resolution of deadlock states. -

To effectively target those bottlenecks, we present two approaches for utilizing the
multi-core architecture in reducing the time required to complete the automated revision.
The ﬁrst approach is based on the distributed nature of the program being revised. In par-
ticular, when a new transition is added (respectively, removed), since the process executing

it has only a partial view of the program variables, we need to add (respectively, remove) a

group of transitions based on the variables that cannot be read by the process. The second
approach is based on partitioning deadlock states among multiple threads. We show that
this provides a small performance beneﬁt. Based on the analysis of these results, we argue
that the simple approach that parallelizes the group computation is likely to provide maxi-
mum beneﬁt in the context of deadlock resolution in the automated revision of distributed
programs. To further expedite the automated model revision, we use symmetry speedup

the revision algorithm.

1.0 .4 Outline

The remainder of this thesis is organized as follows. Chapter 2 describes the preliminar-
ies and presents the elements of the automated incremental model revision. In Chapter
3, we present our approach to minimize the prerequisite knowledge of the details of the
automated revision techniques and provide practical approaches to perform the automated
revision under-the-hood. In Chapter 4, we show how we utilize parallelism and symmetry
to expedite the automated model revision. Subsequently, to revise a broader range of pro-
grams, in Chapter 5 we present our approach for the automated addition of nonmasking and
stabilizing fault-tolerance to fault-intolerant programs. In Chapter 6, we show how we can
reduce the designer burden by automatically discovering the legitimate states of the model
being revised. Later, in Chapter 7, we analyze the effect of performing the automated
model revision without explicitly specifying the legitimate states. We present the related
work and literature review in Chapter 8. Finally, we present a summary of our contributions

and future research direction in Chapter 9.

Chapter 2

Preliminaries

In this chapter, we formally present the elements of our automated model revision frame-
work. Mainly, we deﬁne the notion of models, programs, speciﬁcations, faults, and fault-
tolerance. The notion of distributed programs is adapted from Kulkarni and Arora [101].
Deﬁnition of faults and fault-tolerance are based on the ones given by Arora and Gouda
[12], Kulkarni [100], and Bonakdarpour [25]. At the end of this chapter, we illustrate the
basic constructs of this framework using a real-world example, an application in sensor

networks.

2.1 Models and Programs

In this section, we present the formal deﬁnition of models and programs. A model is
described by an abstract program. Intuitively, a program, p, is described using a ﬁnite
set of variables VP 2 {v0,v1,...,v,,}, n 2 0, and a ﬁnite set of program actions Ap =
{ao,a1, . . . ,am}, m 2 0. Each variable, v,- E V , is associated with a ﬁnite domain of val-
ues, 0;. Let a,- 6 AP be an action, then a; is deﬁned as follows: a,- :: g,- ——> sti; where
g,- is a Boolean formula involving program variables and st,- is a deterministic terminating
statement that updates a subset of program variables.

Before we give a formal deﬁnition of programs based on this intuition, we deﬁne the

notion of state space and state predicate.

Deﬁnition 2.1.1 (state) A state, s, of program p is identiﬁed by assigning each variable

in VI) a value from its respective domain, 0;. I

Deﬁnition 2.1.2 (state space) The state space, S p, of p is the set of all possible states of

p.I

Deﬁnition 2.1.3 (state predicate) A state predicate of p is Boolean expression deﬁned
over the program variables Vp- Thus, a state predicate C of p identiﬁes the subset, SC 9 S p,

where C is true in a state s iff s 6 SC. I

Note that state predicate corresponds to a set of states where the Boolean value of the corre-
sponding predicate is true. Thus, the intersection of two state predicates corresponds to the
conjunction of corresponding functions. Likewise, disjunction corresponds to union, and
so on. Hence, we use these Boolean operators for constructing different state predicates.
For example, let C1 and C2 be state predicates that identify the state space subsets SCI and

5C2 , then C; /\ C2 (respectively C1 V C2) correspond to SCI (1 5C2 (respectively SCl U 3C2)-

Deﬁnition 2.1.4 (transition predicate) Intuitively, a program action consists of one or
more transitions. Let (a,- :: g,- —+ st;;) be an action of the program. Then, the corresponding
transitions included in this action are org, where 0t,- 2 {(s0,sl) | g,- is true in so and s1 is

obtained by executing st,- from so}. I

Hence, a transition predicate correspond to an action is a subset of S p x Sp. A single
transition t is speciﬁed by the tuple (s0, s1), where so, s] E S p and so is the before state and
s; is the after state.

Given a program that is deﬁned in term of VI) and A p, we can now identify an equivalent
representation in terms of its state space and transitions. In particular based on Vp and A p,
we can compute S p, the state space of p and on,- for each action of p. Based on the above,

we formally deﬁne the program as follows.

10

Deﬁnition 2.1.5 (program) The program p is deﬁned as the tuple (Sp, (a1 , (12,03, ....or,))
where or,- 6 SP x Sp. I

In many instances, we do not need the details of the individual actions of p. For these
cases, we utilize program transitions 5,,. For the program p = (S p, (on , a2, a3, ....a,)) , the
transitions of p is 5,, = (or. U 0L2 U (13 U. . . U 0L1). Whenever it is clear from the context, we

use p and its transitions 5,, interchangeably.

Deﬁnition 2.1.6 (closed) Let Sc be a state predicate, then Sc is closed in a program p iff

(V (80,81) I (S0,S]) E 5p 2 (So 6 Sc =>51€ Sc» .I

Deﬁnition 2.1.7 (enabled) The action a; is enabled in a state Sj iff the guard of g,- = true

in the states sj. I

Deﬁnition 2.1.8 (unfair computation) A sequence of states, 0' = (so,s1,...) is unfair

computation of p iff

1. V j : O < j < length(0') : (SJ--1 ,sj), is obtained by executing a program action, say

(a,- :: g,- ——> sti). That is, g,- is true in s j_| and s,- is obtained by executing st,-, and

2. if G is ﬁnite and terminates in s, then all the guards of the program actions are false
in S]. I
Computations can also be fair. Intuitively, a fair computation allows a fair resolution
for non-determinism. Next, we deﬁne weak and strong fair-computation.
Deﬁnition 2.1.9 (weak-fair computation) o = (so, s. , ...) is weak-fair computation of p
if:
1. 0' is an unfair computation of p, and

2. if any action, say a;, ofp, is enabled in all states sj, sj+1,sj+2 . .. then Elk : k 2 j: sk+1
is obtained by executing st,- in state sk. I

In weak-fair computation, if some guard , say g5, eventually becomes continuously enabled,

then the corresponding action is guaranteed to execute inﬁnitely often.

11

Deﬁnition 2.1.10 (strong-fair computation) o = (s0,s| , ...) is strong-fair computation

iff:
1. 0' is an unfair computation of p, and

2. there exists an action a,- : g; —> st,- of p such that g,- is true in s and s’ is obtained by

executing st,- in state s, then the transitions (s, s’) are included inﬁnitely often in 0'. I

In strong-fair computation, if some guard , say g, is continuously enabled forever then the
corresponding action must execute inﬁnitely often.
Note that, in this dissertation, we refer to weak-fair computation as a fair computation.

Also, our deﬁnition of weak-fair computation is equivalent to weak fairness from [1 ,9,65].

2.2 Modeling Distributed Programs

Since we focus on the design of distributed programs, we specify the transitions of the
program in terms of a set of processes, where every process can read and write a subset of
the program variables. The transitions of a process are obtained by considering how that
process updates the program variables. The transitions of the program are the union of the

transitions of its processes.

Deﬁnition 2.2.1 (process) A process Pj is speciﬁed by the tuple (8},Rj, Wj) where Bj is
a transition predicate in S p and 5,, = ’19:, 51-. R j is the set of variable that the process P]
is allowed to read, and W} is the set of variables that the process P} is allowed to write and
W} g R ,- E V (i .e., we assume that the program must, ﬁrst, read the variable to be able to
write it.). I
Notation. Let va(s0) denote the value of variable va in the state so.

A process in a distributed program has a partial view of the program variables,
which introduces write/read restrictions. Therefore, when a new program transition is

added/removed, we need to add/remove a group of transitions based on the variables that

12

cannot be read/writen by that process. The write/read restrictions of the process are deﬁned

as follows.

2.2.1 Write Restrictions

Let P} = (5,-,R,~,Wj) be a process, then the only variables that Pj can write are variables
in Wj. If P} can only write the subset of variables W,- and the value of a variable other
than that in W is changed in the transition (so,s1), then that transition cannot be used in
synthesizing the transitions of P]. In other words, being able to write the subset W} is
equivalent to providing a set of transitions write j(Wj) that Pj cannot use in the revision

algorithm. Clearly, the transition predicate write j(WJ-) is deﬁned as follows.

writej(Wj) = {(so,s]) : (V va :: va 6 (VP—Wj): va(so) 7é va(sl) )}.

2.2.2 Read Restrictions

Let P} = (5,-,R,~,Wj), the only variables that Pj can read are variables belonging to R j.
let t = (so,s1) be a transition in 5,- then we deﬁne groupj(t) as the group of transitions
associated with t. Such a group includes transitions of the form (sz,s3) where so and S2
(respectively s1 and S3) are undistinguishable for Pj. By undistinguishable, we mean that
they differ only in terms of the variables that Pj cannot read. Thus, we formally deﬁne

group j(t) as follows:

groupjll) : V(s2,53)
( AV¢RJ(V(SO) =V(Sl) /\ V(52) =V(S3) ) /\
AveRJ-(V(SO) =V(Sz) /\ V(Sl) =V(53)) )-

2.2.3 Example (Group)

Let p be a program specied using the set of processes P = {P1(= (81,R1,W1)),P2(=
(52,R2,W2))}, the set of variables V = {v1,v2}, and the domains Dv. = {0,1} and

13

D,,2 = {0,1} . Also, let R1 : {VI} (respectively R2 2 {v2}) and W. = {v1} (respectively
W2 2 {v2}) (i.e. each process can only to read and write its own variable). Now, consider
the transition from the state (v1 = 0,v2 = 0) to the state (v1 = 1,v2 = 0). If this tran-
sition is to be included in 8. then it is necessary to include the transition from the state
(v1 = 0, v2 = l) to the state (v1 = 1,vz = 1). Clearly, this should be the case since P1 is not
allowed to read the variable v2, therefore we have to consider the case where vz = 0 as well
as the case where v2 = 1. The automated model revision algorithm adds/removes program
transitions to complete the revision. Therefore, whenever a transition is added or removed,

the revision algorithm must add or remove the corresponding group.

2.2.4 The Group Algorithm

The group algorithm (c.f. Algorithm 1) takes a transition set, trans, as an input and com-
putes the transition group, transg, as an output. Speciﬁcally, it creates an array, tPred[],
with number of elements equal to the number of processes such that tPred [i] holds the part
of the group transitions associated with the process i (Line 1). Now, based on W; (i .e. the
set of variables the process i is allowed to write) the group algorithm uses the function
Alloeritei(W,-) to ﬁnd the set of all transitions which process i is permitted to execute.
Then, it uses this set to ﬁnd which of the transitions in trans process i is responsible for
(Line 3). Later, it uses the tPred [i] and R; in the function F indGroup to account for all vari-
ables that process i cannot read and compute the transitions that cannot be distinguished
by, i (Line 4). Once the steps in lines 3 and 4 are completed for all processes, the algorithm
collects the transitions of the group in transg (Lines 7-9) and returns.

Observe that for the transition t, group j(t) can be executed by process Pj while respect-
ing its read/write restrictions. Let tr ,- be a set of transitions. Now, based on the notion of
read/write restrictions, tr 1 can be included in 5}- iff there exist transitions t1,t2,...t1 such
that tr ,- 2 group j(t]) U group j(t2) U . . . group j(t,). Furthermore, let p be a program whose

transitions speciﬁed with the processes P1, P2 ...Px. Also, let trp denote a set of transi-

14

 

Algorithm 1 Group
Input: transitions set trans.
Output: transitions group transg.

 

MDD* tPred := MDD[ numberOfProcesses] ;
for i z: 0 to numberOfProcesses do

tPred[i] := trans /\ Alloerite;(W;);

' tPred[i] := F indGroup(tPred [i],R,-);

end for
MDD transg := false;
for i := 0 to numberOfProcesses do

transg := transg V tPred [i];
end for

999$???er

I—i
O

: return transg;

 

tions. Then, trp can be included as transitions of p iff there exists a set of transitions tr],
tr j, , ...trJr such that trp = U352] tr ,- and tr ,- can be included as transitions of process P}.
The way we use this group operation is as follows: When we compute a set of tran-
sitions, say tr, that we need to either add or remove, we ensure that tr can be imple-
mented using read/write restrictions of the synthesized program. Hence, often, we cannot
add/remove tr as is. Instead, we need to revise tr so that it respects the read/write restric-
tions of the program being revised. One operation we utilize for this is called Group, where
Groupmax(tr) returns a superset, say trlarg, that can be included as transitions of the syn-
thesized program. The intuition of Groupmax operation is as follows (c.f. Algorithm 1):
Given a set of transitions, say tr, we use a loop that traverses through all the processes.
While traversing process P-, it computes subset of transitions, say tr j, in tr such that each
transition in tr j satisﬁes the write restrictions of process Pj. Then, for each transition in
tr j, it applies the group operation describe above to compute other transitions that must be
included. (Note that with the use of BDDs and MDDs (i.e., Binary and Multi-Valued Deci-
sion Diagrams [125]), we do not have to actually evaluate each transition in tr ,- separately
to compute the corresponding group.) Finally, it takes a union of all transitions obtained

thus to compute Groupmax(tr).

15

Another operation we utilize is Groupmin. This operation returns a subset, say Irma”,
such that trsmau can be included as transitions of the revised program. The operation
Groupmm is implemented in a similar fashion to that of Group by traversing through all
processes.

Remark. Since Groupm is the operation that is used most frequently in our algo-
rithms, for simplicity of presentation, we drop the subscript and call it Group.

Remark. Note that the group j(t) is deﬁned only if I does not violate write restrictions
of process Pj. However, for brevity, we do not specify this whenever it is clear from the
context.

The tasks involved in computing one such group depend on the number of processes
and the number of variables in the program. As can be seen from the formula above, to
compute this group the algorithm (c.f. Algorithm 1) needs to go through all the processes

in the program and for each process it has to go through all the variables.

2.3 Speciﬁcation

Following Alpem and Schneider [7], it can be shown that any speciﬁcation can be parti-
tioned into some “safety” speciﬁcation and some “liveness” speciﬁcation. Intuitively, the
safety speciﬁcation indicates that nothing bad should happen. And, a liveness speciﬁcation

requires that something good must eventually happen. Formally,

Deﬁnition 2.3.1 (safety) The safety speciﬁcation, S fp, for program p is speciﬁed in terms
of bad states, SPECbs, and bad transitions SPECb,. A sequence (so,s1, ...) (denoted by 0')

satisﬁes the safety speciﬁcation of p iff the following two conditions are satisﬁed.
1.Vj:0 S j < len(0') : sjgéSPECbS, and

2. Vj I 0 < j < [871(6) 2 (Sj_],Sj) QSPECM. I

16

Deﬁnition 2.3.2 (liveness) The liveness speciﬁcation, Up, of program p is speciﬁed in
terms of one or more leads-to properties of the form {I w T . A sequence 0' = (so,s1, ...)
satisﬁes 9? -> rI ifij: (f is true in S} :> 3k : j S k < len(0‘) : ’I is true in sk). We assume
that 9? ﬂ 'I = {}. If not, we can replace the property by ((f — T) w T). I

Remark. Observe that if p satisﬁes 9' w T , then it cannot contain computations that start
from 9? and reach a deadlock/termination state without reaching a state in '1' . Likewise, it

cannot contain computations that start from f and reach a cycle without reaching T .

Deﬁnition 2.3.3 (speciﬁcation) A speciﬁcation, say spec is a tuple (Sfp , va), where
S fp is a safety speciﬁcation and va is a liveness speciﬁcation. A sequence 0' satisﬁes spec

iff it satisﬁes S fp and Up. I

Based on the above deﬁnition, for simplicity, given a speciﬁcation, say spec, deﬁned as
(Sfp , va) we say that spec is an intersection of S fp and va.

Given a program p and its speciﬁcation, say spec, p may not satisfy spec from an arbi-
trary state. Rather, it satisﬁes spec only from its legitimate states (also known as invariant).
We use the term legitimate state predicate I to denote the set of legitimate states of p. In
particular, we say that a program p satisﬁes spec from 1 iff the following two conditions are

satisﬁed:

1. I is closed in p, and

2. every computation of p that starts from a state in I satisﬁes spec.

A program p satisﬁes the (safety, liveness, or a combination of both) speciﬁcation from
the legitimate states, I , iff every computation of p that starts from a state in I satisﬁes that

speciﬁcation.

Deﬁnition 2.3.4 ( legitimate state predicate) Let I be a state predicate, and p satisﬁes
spec from I , then we say that I is the legitimate state predicate of p from spec. Note that a

program may have multiple legitimate state predicates. I

17

2.4 Faults

The faults that may perturb a program are systematically represented by transitions. Based
on the classiﬁcation of faults from Avizienis et al. [18], this representation sufﬁces for
physical faults, process faults, message faults, and improper initialization. It is not intended
for program errors (e .g. buffer overﬂow). However, if such errors exhibit behavior, such as,
a component crash, it can be modeled using this approach. Thus, a fault for p(——— (Sp, 8,,))
is a subset of 5,, x Sp.

We use ‘ p[] f’ to mean ‘p in the presence of f’. The transitions of p[] f are obtained by
taking the union of the transitions of p and the transitions of f.

Just as we deﬁned computations of a program in Section 2.1, we deﬁne the notion of
program computations in the presence of faults. In particular, a sequence of states, 0' =
(so,s1,...), is a computation of p[] f (i.e., a computation of p(= (Sp, 8p» in the presence of

f) iff the following three conditions are satisﬁed:
1.Vj:0 < j < len(0’) : (sj_l,sj)E (5pr),and

2. if (so,s| , ...) is ﬁnite and terminates in state S, then there does not exist state s such

that (s,, s) 6 8p, and
3. ifo is inﬁnite then 3n : Vj > n : (Sj_],Sj) 6 8,,)

Thus, if 0' is a computation of p in the presence of f, then in each step of 6, either a
transition of p occurs or a transition of f occurs. Additionally, 0' is ﬁnite only if it reaches
a state from where the program has no outgoing transition. And, if 0' is inﬁnite then 0' has
a sufﬁx where only program transitions execute. We note that the last requirement can be
relaxed to require that 0' has a sufﬁciently long subsequence where only program transitions
execute. However, to avoid details such as the length of the subsequence, we require that 0’
has a sufﬁx where only program transitions execute.

We use f-span (fault-span) to identify the set of states from where the program satisﬁes

its fault-tolerance requirement.

18

Deﬁnition 2.4.1 ( f-span) Let T be a state predicate, then T is an f-span of p from 1 iff
12> T and (V(so,s1) : (so,s1) Epﬂf: (soET => s1 ET)).I

Thus, at each state where I of p is true, the T of p from I is also true. Also, T, like I,
is also closed in p. Moreover, if any action in f is executed in a state where T is true, the
resulting state is also one where T is true. It follows that for all computations of p that start
at states where I is true, T is a boundary in the state space of p, up to which (but not beyond

which) the state of p may be perturbed by the occurrence of the actions in f.

2.5 Fault-Tolerance

In this section, we present a formal deﬁnition to three classical levels of fault-tolerance;
namely, failsafe, masking, and nonmasking fault-tolerance.
Fault-Tolerance. In the absence of faults, a program, p, satisﬁes its speciﬁcation and
remains in its legitimate states. In the presence of faults, it may be perturbed to a state
outside its legitimate states. By deﬁnition, when the program is perturbed by faults, its
state will be one in the corresponding f-span. From such a state, it is desired that p does not
result in a failure, i.e., p does not violate its safety speciﬁcation. Furthermore, p recovers
to its legitimate states from where p subsequently satisﬁes both its safety and liveness
speciﬁcation.

Based on this intuition, we now deﬁne what it means for a program to be (masking)
fault-tolerant. Let S fp and va be the safety and liveness speciﬁcations for program p. We
say that p is masking fault-tolerant to S fp and va from I iff the following two conditions

hold.

1. p satisﬁes S fp and va from I .
2. 3 T ::

(a) T is f-span of p from I .

19

(b) p[] f satisﬁes S fp from T.

(c) Every computation of p[] f that starts from a state in T has a state in 1.

While masking fault-tolerance is ideal, for reasons of costs and feasibility, a weaker
level of fault-tolerance is often required. Two commonly considered weaker levels of fault-
tolerance include failsafe and nonmasking. In particular, we say that p is failsafe fault-
tolerant [72] if the conditions 1, 2a, and 2b are satisﬁed in the above deﬁnition. And, we
say that p is nonmasking fault-tolerant [71] if the conditions 1, 2a, and 2c are satisﬁed in

the above deﬁnition.

2.6 Example: (Data Dissemination Protocol in Sensor

Networks)

In this example, we show how we model distributed programs and illustrate some of the
previous deﬁnitions from the previous sections. We use the program Inﬁtse, a time division
multiple access (T DMA) based reliable data dissemination protocol in sensor networks
[104]. In this example, a base station initiates a computation in which data are to be sent to
all sensors in the network. The data message is split into ﬁxed size packets. Each packet is
given a sequence number. The base station starts transmitting the packets to its neighbor(s)
in speciﬁed time slots, in the order of the packet sequence number. Subsequently, when the
neighbor(s) receive a message, they, in turn, retransmit it to their neighbors and so on. The
computation ends when all sensors in the network receive all the messages.

This protocol does not require explicit acknowledgments to be sent back from the re-
ceiver to the sender. For example, when a sensor sends a message to one of its neighbors it
waits before sending the next message until it knows that the receiver did receive the mes-
sage. In other words, it gets its acknowledgment by listening to the messages the neighbor-

ing sensors are currently transmitting. It only advances to next message if it knows that all

20

its neighbors have attempted to transmit the last message it had sent.
To concisely describe the transitions of the program we use Dijkstra’s guarded com-

mand [53] notation:
(guard) —> (statement);

where guard is a Boolean expression over program variables and the statement describes
how program variables are updated and the statement always terminates. A guarded com-
mand of the form g —> st corresponds to transitions of the form {(so,s1)| g evaluates to true

in so and s1 is obtained by executing st from so}.

The Program.

In this example, we arrange the processes in a linear topology. The base station has N
packets to send to M processes. The fault-intolerant program transmits the packets in a
simple pipeline. For this, each process keeps track of the messages (received/sent) using
two variables r. j and s. j, where r. j is the highest message sequence number received by
process j and s. j is the sequence number of the message currently being transmitted by
process j. Process j increments r. j every time it receives a new message. It also sets s. j
to be the sequence number of the message it is transmitting. The base station transmits a
packet if its neighbor has received the previous packet (action 1N1). A process j, j > 0,
receives a packet from its predecessor if its successor had received the previous packet

(actions IN 2 and IN 3). Thus, the actions of fault-intolerant program are as follows:

Action for base station:

(1N1) (s.0=r.1)—> s.0 :2 s.0+1;

Action for process j E {1..M-1}:
(1N2) (MISS-(141)) /\ (MESH-1)) A (S-(j- 1) =r-J'+1))

——> r.j,s.j :2 r.j+1,s.j+1;

21

Action for process M:
(1N3) ((r.MSr.(M—l)) /\ (s.(M-—1)=s.M+1))
———> s.M, s.M :2 s.M+1, s.M+l;

Faults.

The faults we consider are such that when a fault occurs a message is lost. To model such
faults for the base station, we add action (F 1), where the base station increments s.0 even
though its successor has not received the previous packet. To model such action for other
processes, we add action (F2), where a process advances s. j even though the successor has

not yet received the previous packet.

(Fl) true ——> s.0 :2 s.0+l;

(F2) ( (r-sz-(j-1)) /\ (S.(j-l)=S-(j+1))

—> r.j,s.j :2 r.j+l,s.j+l;

The Set of Legitimate States.

The constraints that deﬁne the legitimate states in the case of the data dissemination pro-
gram are as follows. The ﬁrst constraint states that initially the base station has all the
packets (Cl). A process cannot receive a packet if its predecessor has not received it (C2),
and cannot transmit a packet that it does not have (C3). A process transmits a packet that

is expected by its successor (C4 and C 5).

(c1) (s.0=N)
(C2) (Vj=0<jSM=(r-j=s.(j-1)))

(C3) (VJ'IOSjSMib-J'Snjh)

(C4) (s.ogs.1+1)

(C5) (Vj=0<jS(M-l)I(S-J'SS-(j-1)+1)/\(S-jSS-(j+1)+1)))

22

And the legitimate states predicate is:

I=C1/\C2/\C3/\C4/\C5

The data dissemination program has a set of constraints imposed by the model. More
speciﬁcally, these constraints identify the set of bad transitions that violate the safety spec—
iﬁcation of the program. In particular, the model requires that the reception of a packet
cannot be reversed (MTl), packets can only be received in sequence (MT2), a process can
only receive one packet at a time, it can only receive a packet sent by its predecessor (MT3
and MT4), a process cannot transmit a packet unless it has received it (MTS), and a process
should not transmit a packet unless it is potentially needed by its successor (MT6). Thus,

the set of transitions disallowed by the model are as follows:

MTl: (3j:0<jSM:r.j'<r.j)

MT2: (3j:0<jSM:r.j’<(r.j)+1)

MT3: (3j:0<jSM:(r.j’=(r.j)+l)/\(r.j’7$s.(j—1))/\(r.j’¢s.(j+l))
MT4: (s.M’=(s.M)+1/\s.M’#s.(M—l))

MT5: (Elj:0_<_jSM:(r.j’<s.j’))

MT6: (ElijSjSM—l:(s.j>s.(j+1)+l)/\(s.j’<s.(j+l)+l)

Fault-Tolerant program.

Using the program [NI-1N3 for each process, the faults F H“ 2, the constraints S l-SS, and
prohibited transitions MTl-MT6, the output was a nonmasking fault-tolerant program with

the following recovery actions added to it.

(R1) (r.j>s.(j+l)) /\ (s.j>s.(j+1)+1) A (r.j+1=s.(j—1))
——>r.j:= s.(j—l),s.j:= s.(j+l)+l;
(R2) (r.j>s.(j+l)+l) /\ (s.j>s.(j+l)+1)—>s.j :2 s.(j+1)+1;

23

Chapter 3

Under-The-Hood Revision

In this chapter, we present our contributions on performing the automated model revision
while minimizing the effort and the expertise needed to perform such revision. We show
how the designer can continue to utilize existing design tools while the revision is done
under-the-hood. This makes automated revision more useable, as well as makes it available
across different design tools. Speciﬁcally, we focus on integrating the automated revision
with the SCR toolset. Part of the reasons behind our choice of SCR toolset is that the SCR
descriptions are precise, unambiguous, and consistent. Also, many industrial farms use the
SCR toolset to develop mission critical systems.

This chapter is organized as follows. In Section 3.1, we brieﬂy describe the SCR formal
method and we provide highlights of the SCR toolset. In Section 3.2, we present our
approach for transforming the SCR speciﬁcation into input for SYCRAFT. Then, in Section
3.3, we illustrate our approach using two case studies: an Altitude Switch Controller and

an Automobile Cruise Controller. Finally, we summarize the chapter in Section 3.4.

3.1 Introduction to SCR

The Software Cost Reduction (SCR) formal method [22,83,84] is a set of techniques for

constructing and evaluating requirements documents for high assurance systems. SCR uses

24

tables to describe system behaviors and properties, as these tables provide a precise descrip-
tion of the model and capture the mathematical representation of systems. But these tables
consume a considerable amount of time and resources to verify. Therefore, techniques and
tools have been developed to provide a comprehensive framework that automates the vali-
dation and veriﬁcation of the SCR tables. Hence, the SCR toolset [22,83—87] was created
to serve this purpose. In this section, we describe the SCR formal method and show how

the SCR toolset is used in the design and veriﬁcation of event-driven systems.

3.1.1 SCR Formal Method

SCR is a set of formal methods for constructing and verifying requirements speciﬁcation
documents. The US. Naval Research Laboratory (NRL) developed SCR in the late 705.
Since then, it has been used in constructing many mission critical systems. SCR was used
to design and model the A-7 aircraft and to document requirements. It was also used in
the design of requirement speciﬁcation of the Operational Flight Program (OFP) for the
A-6 aircraft [114], the Bell telephone [91], submarines communication systems, nuclear
plants [88], and many other systems.

The SCR formal method speciﬁes system requirements using tabular notation. Tables
provide a precise and compact way to describe requirements, making it possible for the
user to automatically model and analyze those requirements to identify errors. SCR uses
tables to describe both the system and its environment [85, 86]. The environmental quan-
tities whose values changes the system behavior are described using Monitored variables.
The environmental quantities whose values are changed by the system are represented by
Controlled variables.

To relate the variables of the system and represent constraints on those variables, the
state machine model of the SCR is based on the “Four-Variable Model” that was, initially,
introduced by Paranas [120]. This model describes the desired functionality of an embed-

ded system in terms of four relations as follows.

25

o NAT: is the set of relations that describe the way in which the values of the variables
(monitored or controlled) are restricted by the laws of the environment, whether these

laws are imposed by previously deployed systems or by the physical laws.

0 REQ: is the set of relations that deﬁnes the way in which the system changes the
values of the controlled variables based on the change in the values of the monitored

variables .

0 IN: is the set of relations that maps the values of the monitored quantities to the

values of the input variables.

OUT: is the relation that maps the value of the output variable to a controlled quantity.

The IN and OUT relations describe the behavior of the input and output devices in
some level of isolation. Thus, the IN and OUT relations give requirements speciﬁcation
the freedom of specifying the observed system behavior without going into further details.

Four more variables are also used in the constructs of the SCR. These are modes, terms,
conditions, and events. The mode class is a state machine whose states are called modes.
Changes from one mode to another are triggered by events. The terms are representations
of a group of input variables, mode classes, or other terms in one single term. The condition
is a predicate deﬁned on single system state. Finally, the event is a predicate deﬁned on two
system states and is triggered by a change in a system entity. The following state machine

formally represents a typical SCR system:
2 = (S, So,E’",T)

where S is the state space, So g S is the initial state set, E m represent, a change in the value
of the monitored events, and T is the function that identiﬁes the transitions of the system
based on monitored events (i .e. T maps e E E'" and the current state s E S to the next state

s’ E S) [83].

26

In SCR, the systems are represented in the ideal state and with no time representation.
The model deﬁnes the system as a before state, in terms of the system entities with guards
as conditions, and an after state. The system transits from the before state to the after state
by transitions triggered by a change in an input variable. These transitions are part of a
transformation, T, which is deﬁned by a set of functions that are represented by the SCR
tables.

The SCR toolset [22,83—87] is a set of tools for constructing and validating require-
ments speciﬁcations based on the SCR formal method. It is composed of a speciﬁcation
editor, a user interface for creating and editing the speciﬁcation in a tabular way; a de-
pendency graph browser, which uses the directed graph representation to show the depen-
dency of variables; and a simulator, which uses a symbolic variable representation to test
if the desired system behavior is satisﬁed. The SCR toolset also includes different kinds
of checkers: consistency checker, model checker, and property checker. This set of tools
helps systems designers to check and analyze the speciﬁcations and to automatically detect
errors and missed cases.

To illustrate these concepts, consider the altitude switch controller system (ASW) [21],
which is responsible for turning on a device of interest when an aircraft altitude is below
2,000 feet. ASW will be disabled if it receives an Inhibit signal. A Reset signal will reset
the system to its initial state. ASW has three altitude meters: two are digital and one is
analog. It also has a fault indicator that is switched 0n if the DOI does not turn on in two
seconds, if the system fails to initialize, or if all three altitude meters do not work for more
than two seconds.

The SCR speciﬁcations for the ASW system are constructed with ﬁve monitored vari-
ables as shown in Table 3.1, one controlled variable, and a mode class. The mAltBelow,
Boolean variable, value is true when the aircraft descends below 2, 000 feet. The mDOIsta-
tus is true when the D01 is on. The mlnitializing indicates if the system is being initialized.

The mlnhibit, indicates whether the system can turn on the DOI or not. The mReset mon-

27

itors the reset request. The controlled variable cWakeupDOI will be initialized to false. It

will be set to true to wake-up the DOI.

 

 

Name Type Init. Value Description
mAltBelow Boolean true true if alt. below threshold
mDOIStatus enum off on if DOI powered on; else off
mInitializing Boolean true true iff system initializing

mlnhibit Boolean false true iff DOI power on inhibited
mReset Boolean false true iff Reset button is pushed

 

 

 

 

 

 

Table 3.1: Monitored Variables of the altitude switch controller system (ASW).

Table 3.2 describes the mode class mcStatus. Each transition in the mode table de-
scribes the system transition from one mode to another as a result of change in one or more
monitored variables. There are three modes for the mode class mcStatus: Init, standby, and
awaitDOIon. For example, the ﬁrst row of Table 3.2 states that ASW transitions from init
mode to standby if it is not initializing.

Table 3.3 contains the description of the condition table for the controlled variable
cWakeupDOI. The value of the controlled variable cWakeupDOI depends mainly on the
current value of the mod class mcStatus. If the value of mcStatus is awaitDOIon, then the

DOI can be powered on. If the value of mcStatus is Init or Standby, the DOI will be turned

 

 

 

Old Mode Event New Mode
init @F(mlnitializing) Standby
standby @T(mReset) init

 

standby @T(mAltBelow)WHEN NOT awaitDOIon
mlnhibit AND mDOIStatus = off

awaitDOIon @T(mDOIStatus = on) standby
awaitDOIon @T(mReset) init

 

 

 

 

 

 

 

Table 3.2: Mode transition table for the mode class mcStatus.

28

 

Mode cWakeupDOI

 

Init, Standby false
awaitDOIon true

 

 

 

 

Table 3.3: Condition table for cWakeUpDOl.

There are two major advantages of the SCR toolset. First, all the tools interface with
each other automatically. Hence, they behave as a single application [83]. Second, the
toolset has been adopted by the industry and was used in the development of many real
world applications [83]. Moreover, the toolset stores the speciﬁcations in an ASCII text ﬁle
from which other systems can have access to those speciﬁcations. More speciﬁcally, we

use this ﬁle as an interfaces channel to communicate with the tool SYCRAFT.

3.1.2 Automated Model Revision to Add Fault-Tolerance

Programs are subject to faults that may not be preventable. A program may function
correctly in the absence of faults. However, it may not give the desired functionality in
the presence of faults. The automated model revision to add fault-tolerance is the pro-
cess of transforming a fault-intolerant program to a fault-tolerant one. This transformation
guarantees that the program continues to satisfy its speciﬁcation in the presence of faults.
SYCRAFT, described brieﬂy next, is a framework for automating such revisions [27, 30].
In SYCRAFT, programs (input and output) and faults are represented using guarded com-
mands. SYCRAFI' takes both the program and the faults as an input and generates the
fault-tolerant program version as an output. To add fault-tolerance, SYCRAFT ﬁrst identi-
ﬁes states from where faults alone can violate safety speciﬁcation. It removes such states
and the transitions that reach them. Then, it adds recovery transitions to ensure that after

the occurrence of faults, the program recovers to its legitimate states.

29

3.2 Integration of SCR toolset and SYCRAFT

In this section, we ﬁrst describe how we translate the SCR program into an input for
SYCRAFT. Then, we describe modeling of faults and subsequently give an outline of our
tool for adding the automated model revision to the SCR toolset. Our approach, allows one
to perform separation of concerns where the fault-tolerance aspect is relegated only to the

tool that performs the automated addition of fault-tolerance.

3.2.1 Transforming SCR speciﬁcations into SYCRAFT input

The integration of SCR and SYCRAFT mainly focuses on the mode table since the mode
table captures the system behavior in response to different inputs. Hence, the mode table
is the most relevant in terms of the effect of the faults on system behavior. The integration
focuses on translating the mode table so that it can be used as an input in SYCRAFT and
then translating the SYCRAFI' output so as to generate the mode table of the fault-tolerant
SCR speciﬁcation.

We illustrate the mode table in SCR using the simple example mRoom (cf. Table 3.4).
As the name suggests, this table describes different modes of mRoom and shows how they
change in response to the system events. mRoom has two modes: Dark and Light and
one monitored variable mSwitchOn. This system switches the room from Dark mode to
Light mode if the event @T(mSwitchOn) occurs, i.e. if the monitored variable mSwitchOn

changes its value from false to true.

 

Old Mode Event New Mode

 

Dark @T(mSwitchOn) Light
Light @F(mSwitchOn) Dark

 

 

 

 

 

Table 3.4: mRoom Mode Table

To add fault—tolerance to the SCR speciﬁcation, we need to convert the SCR tables

30

into guarded commands. In particular, we need to translate modes, conditions, terms, and
events. Next, we describe how we translate the SCR events into guarded commands for
SYCRAFI'. Events in SCR occur at the time when the value of their condition is switched
from false to true or vice versa in a single transition. It is not only the current state of
the monitored variable that initiates the transition; rather, it is the combination of both the

current and the old states. The notation used to represent events is as follows:
(@T(c)WHENd) E (ﬁc /\ cI /\ d)

where (c) represents the condition value in the before state and (c’ ) represent the condition
value in the after state [83]. For example, if we consider the SCR mode table entry in

mRoom mode class:
From “Dark” EVENT “@T (mSwtichOn)” TO “Light”

In the “before” state, the mode value mRoom is Dark and the condition mSwitchOn is false.
And, in the “after” state the mode value mRoom = Light and the condition mSwitchOn =
true.

In SYCRAFT, (guarded commands) transitions are represented in the following format:

(8 —> st)

The guard, g, is a predicate whose value must be true in the before state in order for the
statement, st, to execute. The guarded command translation for mRoom table entry would

be:

( ( mRoom : Dark ) /\ (mSwtichOn = false ) )

——> mRoom :2 Light; mSwtich :2 true

Likewise, we need to convert states, terms, and modes into the corresponding input for
SYCRAFT. In particular, each mode is translated into corresponding states that a program
could reach. Conditions are translated into guards that determine when actions can be

executed.

31

3.2.2 Translation from SCR Syntax to SYCRAFT Syntax

In this translation we preserve the model abstraction as well as compactness to avoid the
state explosion problem. The goal of this translation is to translate the SCR table syntax
into action language that the SYCRAFT can deal with. The translation rules are based
on the fact that the transition relation in the SCR tables is identiﬁed using a condition
on the current state and another condition on the next state. For example, the current
state in SCR is deﬁned using the “FROM mode” with a condition, and the next state is
identiﬁed by the “TO mode”. In the SYCRAFT syntax we translate the “FROM mode” into
“mcMode== mode” and the “TO mode” into “—-> mcModezzmode”. Table 3.5 shows some

of the translation rules.

 

 

 

 

 

 

 

 

 

 

SCR Syntax SYCRAFT Syntax
MODETRAN S “mcMode”; => process “mcMode” ;

FROM => (

“Source Mode” => (mcMode=“Source Mode” ) &&
EVENT => )(

@F ( condl ) => ! Condl

@T ( condl ) :> condl

WHEN :> &&

TO => ) —+

“Target Mode” => mcMode :=“Target Mode” ;

 

 

 

 

 

 

 

Table 3.5: Translation rules

3.2.3 Modeling of faults

Faults in SYCRAFT are also modeled using guarded commands that change program vari-
ables. To effectively model faults for designers, we can model them using tables similar
to the way the SCR speciﬁcation is speciﬁed. Note that this would require changes to the
SCR toolset. However, the change is minimal in that it would require adding an extra table

for faults rather than putting all program/fault actions together as was done in [22]. Note

32

that with this change, we do not expect the designer task to be more complex since faults
are speciﬁed using a method. similar to describing programs. For simplicity, currently, we
let faults be directly represented using guarded commands so that modiﬁcation to the SCR
toolset is not necessary. Likewise, it would be necessary for the designer to specify require-
ments in the presence of faults. These speciﬁcations are also similar to that used in SCR

for requirements in the absence of faults.

3.2.4 Adding fault-tolerance to SCR speciﬁcations

The scenario of adding fault-tolerance to the SCR speciﬁcations is described in Figure 3.1.
The cycle begins at step 1 by creating the speciﬁcations requirement using the SCR toolset.
The speciﬁcations in SCR formats are exported from the SCR toolset as in step 2. In step 3,
the middle-layer imports the SCR speciﬁcations and the ﬁrst translation phase generates an
output ﬁle for the use in the addition of fault-tolerance by SYCRAFT. This ﬁle is imported
in step 4 to SYCRAFT, which generates a fault-tolerant version of the program in step 5. In
step 6, the middle-layer imports the SYCRAFT output and in step 7 translates it back to the
SCR speciﬁcation. Finally, in step 8, the ﬁle is imported back into the SCR toolset so that
it can be visualized using the SCR toolset. Thus, the translation layer shown in Figure 3.1
allows the automated revision to add fault-tolerance where the addition is done under-the-
hood, meaning that, it allows users of the SCR tools to add fault-tolerance to speciﬁcations

without knowing the details of SYCRAFT or the theory on which SYCRAFT is based.

3.3 Case Studies

To illustrate the integration of SCR and SYCRAFT, we present two case studies: the control
system for an aircraft altitude switch (ASW) [22] and the automobile cruise control system
(CCS) [95]. For both systems, we brieﬂy describe the concept and demonstrate how our

8-steps method from Section 3 .2 .4 works on these examples to translate the fault-intolerant

33

 

 

 

 

309* File _> Convert to input
for SYCRAFT

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

,, . Automated Model
SCR Tool Set The Translation Revision
Layer
[3
SCR” We Convert to SYCRAFT File
“—7 SCR" Syntax ‘I

 

 

 

 

 

 

Figure 3.1: The transformation cycle between SCR toolset and SYCRAFT.

SCR speciﬁcation into the corresponding fault-tolerant speciﬁcation.

3.3.1 Case Study 1: Altitude Switch Controller

In Section 3.1.], we described the ASW system and illustrated how it is modeled using the
SCR formal method. In this Section, we show how to transform the SCR speciﬁcation of
the ASW into guarded command. Then, we use SYCRAFT to revise the speciﬁcation of the
ASW to add fault-tolerance. Later, we show how to transform the ASW speciﬁcation from
guarded command into SCR to import it back into the SCR toolset.

Step 1. As shown in Figure 3.1 at step 1, we eXtract the mode table of the ASW system in
the SCR speciﬁcation. The mcStatus mode table of the ASW system is illustrated in Table
3.2. It describes the mode class mcStatus that represents a function between the monitored
variables and the current value of the mcStatus. The mcStatus class has one of the following
three modes: standby, init, or awaitDOIon.

Steps 2 8t 3. At step 2, we import the SCR speciﬁcation into the middle layer. This layer

generates the input in guarded command format at step 3. The result of the translation layer

34

 

((mcStatus = init) /\ ((mInitializing) = true))
—-> mcStatus := standby; (mInitializing) :2 false;

 

((mcStatus = standby) /\ ((mReset) = false))
—> mcStatus := init; (mReset) :2 true;

 

((mcStatus = standby) /\ ((mAltBelow) =
falseA !mInhibit /\ mDOI Status 2 off ))

—-> mcStatus = awaitDOIon; (mAltBelow) = trueA
!mInhibit /\ mDOIStatus 2 off;

 

((mcStatus = awaitDOIon) /\ ((mDOIStatus = on) = false))
———> mcStatus := standby;mD01Status := true;

 

((mcStatus = awaitDOIon) /\ ((mReset) = false))
——> mcStatus :2 init;mReset := true;

 

 

 

Table 3.6: The mcStatus mode table translated.

is as shown in Table 3.6. For example, the ﬁrst entry in Table 3.6 shows that in order for
this action to execute, the old value (i .e. the “before” state) of the mcStatus should be equal
to standby, and mReset should be equal to false. The two statements in the right hand side
represent the “after” state; both values of mcStatus and mReset should be changed.

We consider three hardware malfunctions that may alter the operation of the fault in-
tolerant ASW controller [22]. They are an altimeter fault, an initialization fault, and DOI
fault. All three faults are time-out faults, i.e., they require the system to stay in a given state
for a speciﬁed amount of time. But since SYCRAFT does not include the notion of time
yet, we abstract those faults to be on/oﬁ ﬂags. We added a new mode, called fault, to the
mcStatus class to indicate the presence of faults in the system. Table 3.7 shows how those
faults are represented in the input ﬁle to SYCRAFT. Note that the fault transitions described
below can be easily described using SCR tables. Therefore, the designer can specify the
faults using the SCR toolset interface which they are familiar with.

Step 4. In step 4, we use the translated SCR speciﬁcation and the three faults described

in Table 3.7 as an input to SYCRAFT so that SYCRAFT can add fault-tolerance to ASW

35

 

(mcStatus = init) /\ (Init.Duration_Fault = true)

—-+ Init_Duration_Fault :2 false ; mcStatus :2 Fault;
(standby = init) /\ (Alt_Duration_Fault = true)

—> Alt_Duration_Fault :2 false ; mcStatus :2 Fault;
(awaitDOI on = init) /\ (AwaitDOLDuartionfault = true)

—> AwaitDOLDuartiomFault :2 false ; mcStatus :2 Fault;

 

 

 

 

 

Table 3.7: The SYCRAFT fault section.

 

(mcStatus = init) /\ ((mInitializing) = true))

——> mcStatus := standby ; mInitializing :2 false;
((mcStatus = standby) /\ ((mReset) = false))

——+ mcStatus :2 init ; mReset :2 true;
((mcStatus = standby) /\ ((mAltBelow) = false/\lmlnhibitA
mDOIStatus 2 off /\ mAltFail = false))

—> mcStatus = awaitDOIon; (mAltBelow) = true;
((mcStatus = awaitDOIon) /\ ((mDOIStatus = false))

—+ mcStatus :2 standby ; mDOIStatus :2 true;
((mcStatus = awaitDOIon) /\ ((mReset) = false))

——» mcStatus :2 init ; mReset := true;
((mcStatus = fault) /\ ((mReset) = false))

——> mcStatus := standby ; mReset := true;

 

 

 

 

 

 

 

 

Table 3.8: The fault-tolerant mcStatus mode table.

speciﬁcation that tolerates the failure of the altimeter, initialization, or DOI.

Step 5. The result of step 5 is shown in Table 3.8. The parts where SYCRAFT added
the tolerance were at two places. First, the condition ( mAltFail = false ) was added to
the guard of the third transition to prevent the mcStatus from activating the device when
mAltFail is true. Second, the last transition in the Table 3.8 was added to provide recovery
from the fault state to one of the system legitimate states.

Steps 6 & 7. We import the SYCRAFT speciﬁcations into the translation layer at step 6
to translate it to fault-tolerant SCR speciﬁcations. Table 3.9 is the result after applying the

translation on the mcStatus from SYCRAFT output to SCR.

36

 

Old Mode Event New Mode

 

 

 

init @F(mlnitializing) Standby
standby @T(mReset) init
standby mDOIStatus = off AND NOT mAltFail awaitDOIon

mDOIStatus = off AND NOT mAltFail
awaitDOIon @T(mAltBelow)WHEN NOT mlnhibit AND standby
mDOIStatus = off AND NOT mAltFail

 

 

 

 

 

 

 

 

 

 

awaitDOIon @T(mDOIStatus = on) init
fault @T(mReset) init
init @T(IniLDurationFault) fault
standby @T(AlLDurationFault) fault
awaitDOIon @T(AwaitDOLDuartiomFault) fault

 

Table 3.9: Fault-tolerant mode class mcStatus.
3.3.2 Case Study 2: Cruise Control System

The cruise control system (CCS) [95] manages the cruising speed of an automobile by con-
trolling the throttle position. It depends on several monitored variables, namely, mlgnon,
mEngRunning, mSpeed, mLever, and mBrake. The system uses monitored variables to con-
trol the automobile speed. The cruise mode is engaged by setting the mLever to “const”,
provided that other conditions like “engine running” and “ignition is on” are met. The
CCS can maintain constant, decrease, or increase automobile speed depending on the cur-
rent speed. Below, we show how fault-tolerant CCS is revised using the tool described in
Figure 3.1.

The mCruise mode table is shown in Table 3.10. This table speciﬁes the values that
the mCruise class can take. We imported the modeTable 3.10 into the middle layer, which
generated speciﬁcation in SYCRAFT format. Then we translated the mCruise mode table
to SYCRAFT.

We consider a system malfunction that may alter the operation of the fault intolerant

CCS. The fault takes place when the status of the cruise becomes unknown. Table 3.11

37

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Old Mode Event New Mode
Off @T(mlgnOn) Inactive
Inactive @F(mlgnOn) Off
Inactive @T(mLever=const) WHEN mIgnOn AND Cruise
mEngRunning AND NOT Brake
Cruise @F(mlgnOn) Off
Cruise @F(mEngRunning) Inactive
Cruise @T(mBrake) OR @T(mLever = off) Override
Override @F(mlgnOn) Off
Override @F(mEngRunning) Inactive
Override @T(mLever = resume) WHEN mlgnOn AND Cruise
mEngRunning AND NOT mBrake OR @T(mLever = const)
WHEN mIgnOn AND mEngRunning AND NOT mBrake

 

Table 3.10: Fault intolerant mode class mcCruise.

 

(mcCruise = Override) V (mcCruise = Cruise) V (mcCruise =
Inactive) V (mcCruise = Off) /\ (CruiseFault = true)
—> mcCruise := U nkown;CruiseFault := false;

 

 

 

Table 3.11: The SYCRAFT fault section.

shows how this fault is represented in the input ﬁle to SYCRAFT.

We have inputted the faults and the fault-intolerant CCS to SYCRAFT in order to add
fault-tolerance to the CCS system to tolerate a recovery from an unknown state to one of
the CCS safe state. SYCRAFT added two actions to recover from the unknown state to one
of the system valid states depending on the value of the IgnOn monitored variable. The

fault-tolerant speciﬁcation is as shown in Table 3.12.

38

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Old Mode Event New Mode
Off @T(mlgnOn) Inactive
Inactive @F(mlgnOn) Off
Inactive @T(mLever=const) WHEN mIgnOn AND Cruise
mEngRunning AND NOT mBrake
Cruise @F(mlgnOn) Off
Cruise @F(mEngRunning) Inactive
Cruise @T(mBrake) OR @T(mLever = off) Override
Override @F(mlgnOn) Off
Override @F(mEngRunning) Inactive
Override @T(mLever = resume) WHEN mIgnOn AND Cruise
mEngRunning AND NOT mBrake OR
@T(mLever = const) WHEN mIgnOn
AND mEngRunning AND NOT mBrake
Unknown @T (IgnOn) Off
Unknown @F (IgnOn) Inactive
Override, Cruise, @T(CruiseFault) Unknown
Off, Inactive

 

Table 3.12: Fault-tolerant mode class mcCruise.

3.4 Summary

In this chapter we presented the techniques we developed to make the automated model
revision more easier to use. Our goal is to make the model revision accessible to wide range
of system designers. Speciﬁcally, we utilized existing design tools (e.g., SCR toolset) to
be the front end of our approach and performed all the aspects related to the automated
model revision behind the scene. To successfully achieve this coupling, we developed a
middle layer that translated the SCR speciﬁcations into SYCRAFT speciﬁcations and from
SYCRAFT back to SCR. With this middle layer, we enabled the designers to perform the

tasks of the automated model revision under-the-hood.

39

 

Chapter 4

Expediting the Automated Revision

Using Parallelization and Symmetry

To make the automated model revision more applicable in practice, we need to develop
approaches for enhancing their performance. Speciﬁcally, we need to able to revise pro-
grams with moderate to large state space in a reasonable amount of time. Our goal in this
chapter is to utilize both the properties of the programs being revised and the available
infrastructure (e.g., multi-core architecture) to expedite the revision. Hence, we focus on
using symmetry, inside the program being revised, and parallelism, obtained from multiple
cores, to speedup the revision algorithm.

The rest of this chapter is organized as follows. We explain the bottlenecks of the
automated model revision and illustrate the issues involved in the revision problem in the
context of Byzantine agreement in Section 4.2. We analyze the effect of the distributed
nature of the program being revised on the complexity of the revision in Section 4.2 .2. We
present our algorithms in Section 4.3. We analyze the results in Subsection 4.3.3 and argue
that our multi-core algorithm is likely to beneﬁt further with additional cores. We evaluate
a different parallelism approach in Section 4.4. In Section 4.5, we present our approach for

expediting the revision of fault-tolerant programs with the use of symmetry. Finally, we

40

summarize in Section 4.6.

4.1 Introduction

Given the current trend in processor design where the number of transistors keeps growing
as directed by Moore’s law but where clock speed remains relatively ﬂat, it is expected
that multi-core computing will be the key for utilizing such computers most effectively. As
argued in [90], it is expected that programs/protocols from distributed computing will be
especially beneﬁcial in exploiting such multi-core computers.

One of the difﬁculties in adding fault-tolerance using automated techniques, however,
is its time complexity. Our focus is to evaluate the effectiveness of different approaches
that utilize multi-core computing to reduce the time complexity during deadlock resolution
in the revision to add fault-tolerance to distributed programs.

To evaluate the effectiveness of multi-core computing, we ﬁrst need to identify bot-
tleneck(s) where multi-core features can provide the maximum impact. To identify these
bottlenecks, in [30], Bonakdarpour and Kulkarni developed a symbolic (BDD-based) algo-
rithm for adding fault-tolerance to distributed programs with state space larger than 1030.
Based on the analysis of the experimental results from [30], they observed that depending
upon the structure of the given distributed intolerant program, performance of the revision
suffers from two major complexity obstacles: (1) generation of fault-span, the set of states
reachable in the presence of faults, and (2) resolving deadlock states, from where the pro-
gram has no outgoing transitions. To resolve a deadlock state, we either need to provide
recovery actions that allow the program to continue its execution or eliminate the dead-
lock state by preventing the program execution from reaching it. Of these, generation of
fault-span closely similar to program veriﬁcation and, hence, techniques for efﬁcient veri—
ﬁcation are directly applicable to it. In this chapter, we focus on expediting the resolution

of deadlock states with the use of parallelism and symmetry.

41

In the context of dependable systems, the revised fault-tolerant program should meet
its liveness requirements even in the presence of faults. Therefore, no deadlock states are
permitted in the fault-tolerant program since the existence of such states can violate the
liveness requirements. A program may reach a deadlock state due to the fact that faults
perturb the program to a new state that was not considered in the fault-intolerant program.
Or, it may reach a deadlock state due to the fact that some program actions are removed
(e .g., because they violate safety in the presence of faults).

We present two approaches for parallelization. The ﬁrst approach is to parallelizes the
group computation. It is based on the distributed nature of the program being revised.
In particular, when a new transition is added/removed, since the process executing it has
only a partial view of the program variables, we need to add/remove a group of transitions
based on the variables that cannot be read by the process. The second approach is based
on partitioning deadlock states among multiple threads; each thread resolves the deadlock
states that have been assigned to it. We show that this provides a small performance beneﬁt.
Based on the analysis of these results, we argue that the simple approach that parallelizes
the group computation is likely to provide maximum beneﬁt in the context of deadlock
resolution for the revision of distributed programs.

To understand the use of symmetry, we observe that, often, multiple processes in a
distributed program are symmetric in nature, i.e., their actions are similar (except for the
renaming of variables). Thus, if we ﬁnd recovery transitions for a process, then we can
utilize symmetry to identify other recovery transitions that should also be included for
other processes in the system. Likewise, if some transitions of a process violate safety in
the presence of faults, then we can identify similar transitions of other processes that would
also violate safety. If the cost of identifying these similar transitions with the knowledge of
symmetry among processes is less than the cost of identifying these transitions explicitly,
then the use of symmetry will reduce the overall time required for revision.

We also present an algorithm that utilizes symmetry to expedite the revision. We show

42

that our algorithm signiﬁcantly improves performance over previous implementations. For
example, in the case of Byzantine agreement (BA) [107] with 25 processes, time for revision
with a sequential algorithm was 1,632s. With symmetry alone, revision time was reduced
to 188s (8.7 times better). With parallelism (8 threads), revision time was reduced to 467s
(3.5 times better). When we combined both symmetry and parallelism together, the total

revision time was reduced to 107s (more than 15.2 times better).

4.2 Issues in Automated Model Revision

In this section, we use the example of Byzantine agreement [107] (denotedBA) to describe
the issues in automated revision to add fault-tolerance. Towards this end, in Section 4.2.1 ,
we describe the inputs used for revising the Byzantine agreement problem. Subsequently,
in Section 4.2 .2, we identify the need for explicit modeling of read-write restrictions im-
posed by the nature of the distributed program. Finally, in Section 4.2.3, we describe how
deadlock states get created while revising the program for adding fault-tolerance and illus-

trate our approach for managing them.

4.2.1 Input for Byzantine Agreement Problem

The Byzantine agreement problem (BA) consists of a general, say g, and three (or more)
non-general processes, say j, k, and l . The agreement problem requires that a process copy
the decision chosen by the general (0 or 1) and ﬁnalize (output) the decision (subject to
some constraints). Thus, each process of BA maintains a decision d; for the general, the
decision can be either 0 or 1, and for the non-general processes, the decision can be 0, 1, or
I, where the value 1 denotes that the corresponding process has not yet received the deci-
sion from the general. Each non-general process also maintains a Boolean variable f that
denotes whether that process has ﬁnalized its decision. For each process, a Boolean vari-

able b shows whether or not the process is Byzantine; the read/write restrictions (described

43

...-...:

in Section 4.2.2) ensure that a process cannot determine if other processes are Byzantine.
Thus, a state of the program is obtained by assigning each variable, listed below, a value

from its domain. And, the state space of the program is the set of all possible states.

V = {d.g} U (the general decision variables):{0, l}
{d.j,d.k,d.l} U (the processes decision variables):{0, l, I}
{f.j,f.k,f.l} U (ﬁnalized?):{false, true}
{b.g, b.j, b.k, b.l}. (Byzantine?):{false, true}
Fault-intolerant program. To concisely describe the transitions of the (fault-

intolerant) version of BA, we use guarded commands of the form g —-» st. Recall from
Chapter 1 that g is a predicate involving the above program variables and st updates the
above program variables. The command g —> st corresponds to the set of transitions
{ (so, s1) : g is true in so and s1 is obtained by executing st in state so}. Thus, the transitions

of a non-general process, say j, is speciﬁed by the following two actions:

BAinmlj :: BAlj :: (d.j= _L) /\ (f.j =false) /\ (b.j =false) —> d.j:=d.g
BAZJ- :: (d.jaé I) /\ (f.j =false) /\ (b.j =false) ——+ f.j:= true

We include similar transitions for k and l as well. Note that the general does not need
explicit actions; the action by which the general sends the decision to j is modeled by BA] j.

Speciﬁcation. The safety speciﬁcation of the BA requires validity and agreement.
Validity requires that if the general is non-Byzantine, then the ﬁnal decision of a non-
Byzantine, non-general must be the same as that of the general. Additionally, agreement
requires that the ﬁnal decision of any two non-Byzantine, non-generals must be equal.
Finally, once a non-Byzantine process ﬁnalizes (outputs) its decision, it cannot change it.

Faults. A fault transition can cause a process to become Byzantine, if no other process
is initially Byzantine. Also, a fault can change the d and f values of a Byzantine process.
The fault transitions that affect a process, say j, of BA are as follows: (We include similar

actions for k, l , and g)

F1 :: -wb.gA—wb.jA-wb.kAﬁb.l ———-> b.j:= true
F2 :: b.j ——> d.j,f.j:=0|1,false|true

where d. j :2 0|1 means that d. j could be assigned either 0 or 1. In case of the general
process, the second action does not change the value of any f-variable.

Goal of automated Addition of fault-tolerance. The goal of the automated revision
is to start from the intolerant program (BAimlj) and given the set of faults (F 1&F 2) and to

automatically generate the fault-tolerant program (BAmlemmj), given below.

d.j: .L)A(f.j=false)A(b.j=false) ———>d.j:=d.g
d.j;éJ.)A(f.j —fa.lse)A(d17éJ_ Vd.k.7é_L)———>fj :=true

BA,01emmj::BAlj-:(
3A2,- ::(
3A3]- :: (d.l= )0A(d.k=0)A(d.j= l)A(f.j=O) ——>d.j,f.j::0,0|1
(d.l=

3A4} :: 1)/\(d.k=1)A(d.j=0)/\(f.j=0)—>d.j,f.j:=1,0|1

In the above program, the ﬁrst action remains the same. The second action is restricted
to execute only in the states where another process has the same d value. Actions (3&4)

are for ﬁxing the process decision.

4.2.2 The Need for Modeling Read/W rite Restrictions

Since the program being revised is distributed in nature, each process can only read a subset
of the program variables. It is important to articulate these restrictions precisely to ensure
that the revised program is realizable under the constraints of the underlying distributed
system for which it is designed. For example, in the context of the Byzantine agreement
example from Section 4.2.1, non-general process j is not supposed to know whether other
processes are Byzantine. It follows that process j cannot include an action of the form ‘if
b.k is true then change d. j to 0’. To permit such modeling, we need to specify read-write

restrictions for a given process. For the Byzantine agreement example, process j is allowed

45

to read Rj = {b.j,d.j,f.j,d.k,d.l,d.g} and it is allowed to write W} = {d.j,f.j}. Observe
that this modeling prevents j from knowing whether other processes are Byzantine.

With such read/write restriction, if process j were to include an action of the form ‘if
b.k is true then change d. j to 0’ then it must also include a transition of the form ‘if b.k is
false then change d . j to 0’. In general, if transition (so, s1) is to be included as a transition
of process j then we must also include a corresponding equivalence class of transitions
(called group of transitions) that differ only in terms of variables that j cannot read. For

further discussion of the group operation please refer to Section 2.2.

4.2.3 The Need for Deadlock Resolution

During revision, we analyze the effect of faults on the given fault-intolerant program and
identify a fault-tolerant program that meets the constraints of Problem 2.1. This involves
addition of new transitions as well as removal of existing transitions. In this section, we
utilize the Byzantine agreement problem to illustrate how deadlocks get created during the

execution of the revision algorithm and identify two approaches for resolving them.

0 Deadlock scenario 1 and use of recovery actions. One legitimate state, say so (c.f.
Table 4.1), for the Byzantine agreement program is a state where all processes are
non-Byzantine, d.g is 0 and the decision of all non-generals is 1. Thus, in this state,
the general has chosen the value 0 and no non-general has received any value. From
this state, process j (respectively k) can copy the general decision by executing the
program action BAl ,- (respectively BAlk) as in s1 (respectively s2) from Table 4.1.
The general can become Byzantine and change its value from 0 to 1 arbitrarily as in

s3. Therefore, a non-general can receive either 0 or 1 from the general.

Clearly, starting from S3, in the presence of faults (F l & F2), the program (BA,-,,,0,)
can reach a state, say s5, where d.g = d.l = 1, and d.j = d.k = 0. From such a state,

transitions of the fault-intolerant program violate safety if they allow j (or k) and l

46

to ﬁnalize their decision. If we remove these safety violating transitions then there
are no other transitions from state S5. In other words, during revision, we encounter
that state S5 is a deadlock state. One can resolve this deadlock state by simply adding
a recovery transition that changes d! to 0. (Note that based on the discussion of
Section 4.2 .2, adding such recovery transition requires us to add the corresponding
group of transitions. It is straightforward to observe that none of the transitions in

this group violate safety.)

 

 

 

Action/

State Fault b.g b.j b.k b.l d.g d j d.k d.l f j f.k f.l
So — O 0 O O 0 I I J. 0 O 0
S] BA 1 j 0 0 0 0 0 Q J_ .l. O O 0
S2 BA 1 k 0 0 O O O O Q _L O 0 0
S3 F 1 1 O 0 0 O 0 O _L 0 0 0
S4 F2 1 O O 0 _1_ 0 O J. O O 0
S5 BA 1 1 1 O O 0 1 0 0 l 0 O O

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.1: Deadlock scenario 1 (The underlined values indicates which variable is being
changed by the program action/fault. For reasons of space the true and false values are
replaced by l and 0 respectively for the variables b and f.)

o Deadlock scenario 2 and need for elimination. Again, consider the execution
of the program (BAimOl) in the presence of faults (F 1 & F2). Starting from state
so in the previous scenario the program can also reach a state, say so (c.f. Table
4.2), where d.g = d.l = l,d.j = d.k = 0, and f.j : 1; state so differs from S5 in the
previous scenario in terms of the value of f.l. Unlike S5 in the previous scenario,
since I has ﬁnalized its decision, we cannot resolve S6 by adding safe recovery. Since
safe recovery from so cannot be added, the only choice for designing a fault-tolerant
program is to ensure that state S6 is never reached in the fault-tolerant program. This
can be achieved by removing transitions that reach S6. However, removal of such

transitions can create more deadlock states that have to be eliminated. Thus, the

47

deadlock algorithm needs to be recursive in nature.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Action/

State Fault b.g b.j b.k b.l d.g d j d.k d.l f j f.k f.l
so - O 0 0 0 0 I J. I 0 0 0
S1 BA] j 0 0 O O 0 Q I I 0 O 0
82 BA 1 k 0 O O O O O Q J. 0 0 0
S3 3A2]- O O O O 0 0 O _L 1 0 0
S4 F 1 1 O 0 O O O O 1 1 O 0
S5 F2 1 0 0 0 _1_ O 0 J. 1 O 0
s6 BA 1 1 1 O O O 1 O 0 1 l O O

 

 

Table 4.2: Deadlock scenario 2 (The underlined values indicates which variable is being
changed by the program action/fault. For reasons of space the true and false values are
replaced by 1 and 0 respectively for the variables b and f.)

To maximize the success of the revision algorithm, our approach to handle deadlock
states is as follows: Whenever possible, we add recovery transition(s) from the deadlock
’states to a legitimate state. However, if no recovery transition(s) can be added from the
deadlock states, we try to eliminate (i.e. make it unreachable) the deadlock states by pre-
venting the program from reaching the deadlock states. In other words, we try to eliminate

deadlock states only if adding recovery from them fails.

4.3 Approach 1: Parallelizing Group Computation

In this section, we present our approach for parallelizing the group computation to expedite
the revision to add fault-tolerance. First, in Section 4.3.1, we identify the different design
choices we considered and then present our algorithm. In Section 4.3 .2, we describe our
approach for parallelizing the group computation. Subsequently, in Section 4.3.3, we pro-
vide experimental results in the context of the Byzantine agreement example from Section
4.2.1 and the token ring [14]. Finally, in Section 4.3 .4, we analyze the experimental results

to evaluate the effectiveness of parallelization for group computation.

48

4.3.1 Design Choices

The structure of the group computation permits an efﬁcient way to parallelize it. In particu-
lar, whenever some recovery transitions are added for dealing with a deadlock state or some
states are removed for ensuring that a deadlock state is not reached, we can utilize multiple
threads in a master-slave fashion to expedite the group computation. During the analy-
sis for utilizing the multiple cores effectively, we make the following observations/design

choices.

0 Multiple BDD packages vs. reentrant BDD package. We chose to utilize differ-
ent instances of BDD packages for each thread. Thus, at the time of group computa-
tion, each thread obtains a copy of the BDD corresponding to the program transitions
and other BDDs from the master thread. In part, this was motivated by the fact that
existing parallel BDD implementations have shown limited speedup. Also, we argue
that the increased space complexity of this approach is acceptable in the context of
revision, since the time complexity of the revision algorithm is high (compared with

model checking) and we always run out of time before we run out of space.

0 Synchronization overhead. Although simple to parallelize, the group computation
itself is ﬁne grained, i.e., the time to compute a group of the recovery transitions
that are to be added to the program is small (100-500ms). Hence, the overhead of
using multiple threads needs to be small. With this motivation, our algorithm creates
the required set of threads up front and utilizes mutexes to synchronize between
them. This provided a signiﬁcant beneﬁt over creating and destroying threads for

each group operation.

0 Load balancing. Load balancing among several threads is desirable so that all
threads take approximately the same amount of time in performing their task. To
perform a group computation for the recovery transitions being added, we need to

evaluate the effect of read/write restrictions imposed by each process. A static way to

49

parallelize this is to let each thread compute the set of transitions caused by read/write
restrictions of a (given) subset of processes. A dynamic way is to consider the set of
processes for which a group computation is to be performed as a Shared pool of tasks
and allow each thread to pick one task after it ﬁnishes the previous one. We ﬁnd that
given the small duration of each group computation, static partitioning of the group
computation works better than dynamic partitioning since the overload of dynamic

partitioning is high.

4.3.2 Parallel Group Algorithm Description

To better illustrate the parallel group algorithm, we ﬁrst describe its sequential version. The
sequential group algorithm (c.f. Algorithm 1) takes a transition set, trans, as an input and
computes the transition group, transg, as an output. Recall from Section 2.2 that the tasks
involved in computing the group depend on the number of processes and the number of
variables in the program. The sequential group algorithm (of Algorithm 1) needs to go
through all the processes in the program and for each process it has to go through all the
variables. The revision algorithm is required to compute the group associated with any set
of transitions added/removed from the program transitions. Based on this discussion and
the design choices above, we now describe the parallel group algorithm.

Algorithm sketch. Given transition set trans the goal of this algorithm is to compute the
Group of transitions associated with the set trans. The sequential algorithm will go through
many computations for each process, one after another. However, in the parallel algorithm,
we split the Group computation over the available number of threads. In particular, rather
than having one thread ﬁnd the Group for all the processes, we let each thread compute
the Group for a subset of the processes. Since the tasks assigned to each thread require a
very small amount of the processor time, there is considerable overhead associated with
the thread creation/destruction every time the Group is computed. Therefore, we let the

master thread create the worker threads at the initialization stage of the revision algorithm.

50

The worker threads stay idle until the master thread needs to compute the Group for a set
of transitions. The Master thread activates/deactivates the worker threads through a set
of mutexes. When all worker threads are done, the main thread collects the results of all
worker threads in one Group.

The parallel group algorithm consists of three parts: the initialization of the worker
threads, the assignment of tasks to worker threads, and the computation of a group with
worker threads.

Initialization. In the initialization phase, the master thread creates all required threads
by calling the algorithm InitiateThreads (c.f. Algorithm 2). These threads stay idle until a
group computation is required and terminate when the revision algorithm ends. Due to the
design choice for load balancing, the algorithm distributes the work among the available

threads statically (Lines 3-4). Then it creates all the required worker threads (Line 7).

 

Algorithm 2 InitiateThreads
Input: n00 f processes , n00 f Thread S.

 

: for i := 0 to n00 fThreads — 1 do
BDDMgr[i] = C lone(masterBDDManager) ;

- ,_ i x Ofpr c S ,
Star 1‘” lll .— l gdnghrZad: es],

. ._ (i+l x n00 fprocesses .
endPM '_ ‘- noOfThreads J _ 1’

 

end for
: for tth := 0 to n00 fThreads — 1 do

S pawnThread -> GroupWorkerThread(thID);
end for

9?:‘0‘9‘5? 93!?"

 

Tasks for worker thread. Initially, the algorithm WorkerThread (c.f. Algorithm 3)
locks the mutexes Start and Stop (Lines 1-2). Then it waits until the master thread unlocks
the Start mutexes (Line 5). At this point, the worker starts computing the part of the Group
associated with this thread. This section of WorkerThread (Lines 7-15) is similar to the
Group() function in the sequential revision algorithm, except rather than ﬁnding the Group

for all the processes, the WorkerThread algorithm ﬁnds the group for a subset of processes

51

(Line 8). When the computation is completed, the worker thread notiﬁes the master thread

by unlocking the mutex Stop (Line 17).

 

Algorithm 3 WorkerThread

 

Input: thID.

10:
ll:
12:
l3:
14:
15:
16:
17:

999.519.5099???

// Initial Lock of the mutexes

mutexJ ock(thData [thl D] .mutexStart) ;
mutex_l ock(thData [thI D] .mutexSto p) ;
while True do

// Waiting for the Signal from the master thread

mutex_l ock(thData [thI D] .mutexStart);

gtr[id] := false;

BDD* tPred := BDD[ endP[thID] - startP[thID]+l ] ;

for i := 0 to (endP[thID] — startP[thID]) + 1 do
tPred[i] := thData[thID].tranS A
alloerite [i + startP[thI D]] from fer (BDDM gr [thI D] ) ;
tPred [i] := FindGroup(tPred [i], i, thID);

end for

thData[thID].result := false;

for i := O to (endP[thID] — startP[thID]) + 1 do
thData[thID].result := thData[thID].result V tPred [i];

end for

// Triggering the master thread that this thread is done

mutex-unl ock(thData [thI D] .mutexSto p) ;

18: end while

 

Tasks for master thread. Given transition set trans, the master thread copies trans to

each instance of the BDD package used by the worker threads (cf. Algorithm 4, Lines

3-5). Then it assigns a subset of group computation to the worker threads (Lines 6-8) and

unlocks them. After the worker thread completes, the master thread collects the results and

returns the group associated with the input trans.

52

 

Algorithm 4 MasterThread

 

Input: transitions set thisTr.
Output: transition group gAll.

$05?ri

tr 2: thisTr;
gAll := false;
for i := 0 to NoOfThreads — 1 do
threadData[i] .trans := trans.Transfer(BDDMgr[thID]);
end for
// all idle threads to start computing the group

6: for i := O to NoOfThreads — 1 do

mutex_unlock(thData [i] .mutexStart);

8: end for

10:
ll:

12:
l3:
14:
15:

// Waiting for all threads to ﬁnish computing the group

for i := 0 to NoOfThreads — 1 do
mutex-lock(thData [i] .mutexSto p) ;

end for

// Merging the results from all threads

for i := 0 to NoOfThreads — 1 do
gAll := gAll + thData[i] .results;

end for

return gAll;

 

53

4.3.3 Experimental Results

In this section, we describe the respective experimental results in the context of the Byzan-
tine agreement (described in Section 4.2.1) and the token ring [14]. In both case studies,
we ﬁnd that parallelizing the group computation improves the execution time substantially.
Throughout this section, all experiments are run on a Sun Fire V40z with 4 dual-core
Opteron processors and 16 GB RAM. The OBDD representation of the Boolean formulae
has been done using the C++ interface to the CUDD package developed at University of
Colorado [125]. Throughout this section, we refer to the original implementation of the
revision algorithm (without parallelism) as sequential implementation. We use X threads
to refer to the parallel algorithm that utilizes X threads.

We would like to note that the revision time duration differences between the sequential
implementation in this experiment and the one in [30] is due to other unrelated improve-
ments on the sequential implementation itself. However, the sequential, and the parallel
implementations differ only in terms of the modiﬁcation described in Section 4.3 .2.

We note that our algorithm is deterministic and the testbed is dedicated and, hence, the
only non-deterministic factor in time for revision is synchronization among threads. Based
on our experience with the revision, this factor has a negligible impact and, hence, multiple
runs on the same data essentially reproduce the same results.

In Figures 4.1 and 4.2, we show the results of using the sequential approach versus the
parallel approach (with multiple threads) to perform the revision. All the tests have shown
that we gain a signiﬁcant speedup. For example, in the case of 45 non-general processes and
8 threads, we gain a speedup of 6.1 . We can clearly see that the parallel 16-thread version is
faster than the corresponding 8-thread version. This is surprising, given that there are only 8
cores available. However, upon closer observation, we ﬁnd that the group computation that
is parallelized using threads is ﬁne-grained. Thus, when the master thread uses multiple
slave threads for performing the group computation, the slave threads complete quickly

and therefore cannot utilize the available resources to the full extent. Hence, creating more

54

10000 ~-~ - *-~---5-~--- - , +~ ~-~—~——~-» ,_- v- —~»-——-~~

 

 

 

Tlme(s)

0 ' ' 'H - . . " _"‘ 7—_~’_‘-7‘_T~"” —‘ -T“ T ‘ "" 1""— I'_"~‘ _'

Processes IO 15 20 25 30 35 40 45

 

 

+Sequential 2 Threads +4 Threads *8 Threads """‘ 16 Threads

 

L_“-_._.--_ _L ---—__._._._-..

Figure 4.1: The time required to resolve deadlock states in the revision to add fault-
tolerance for several numbers of non-general processes of BA in sequential and parallel
algorithms.

55

60000 — — . - i
: l
i 4
I ,
I ;
50000 ' , - 1

4mm 1...“-.. . ..-.-_-_-.._-.__- -_ . _ -___ _ -..-

 

 

 

 

 

 

20000 ~ — — - 5 .
I .
l f
l0000 I 5
F l
‘5’ i
E. l . 1
E'- 0 >1 -~~~—tr--~-r«~~ ~wnﬁ'f'T " . . » 1 ~ . -. i
M |
“s“ 10 15 20 25 30 35 40 45 g
+Sequential 2 Threads +4 Threads *8 Threads “' 16 Threads

Figure 4.2: The time required for the revision to add fault-tolerance for several numbers of
non-general processes of BA in sequential and parallel algorithms.

56

 

50 ..t.-________.-._ ...- ___-_.____ -__.. - .____ _.____.____._-_ -_-_.____.-_

Time(s)

 

 

0 l» ~ at wwwqu ~ -
Processes 10 20 30 40 50 60 70 80 90 100 150
2 Threads +4 Threads *8 Threads "‘9'" 16 Threads

 

 

+Sequential ‘

 

 

Figure 4.3: The time required to resolve deadlock states in the revision to add fault-
tolerance for several numbers of token ring processes in sequential and parallel algorithms.
threads (than available processors) can improve the performance further.

In Figures 4.3 and 4.4, we present the results of our experiments in parallelizing the
deadlock resolution of the token ring problem. After the number of processes exceeds a
threshold, the execution time increases substantially. This phenomenon also occurs in the
case of parallelized implementation, although it appears for larger programs. However, this
effect is not as strong. Note that the spike in speedup at 80 processes is caused by the page
fault behavior where the performance of the sequential algorithm is affected although the

performance of the parallel algorithm is still not affected.

57

I 700.-, . .H,“
600 . ..- __ _

 

 

400

..- .... ...“.-.—
l I

300

I

I

I

i 200 it-.-_---.--.--- Wm, ..........a.,. -...n- ,- ﬁsh---”
I

1 100

 

 

Tlme(s)

I

0 ..L. mer- wﬁ-m'erﬁr-h'.“ ‘1 3": , ' r" . "T . . ‘ . .. .r . ..z . . . .1 i
Processes 10 20 3O 40 50 60 70 80 90 100 150 I
+Sequential 2 Threads +4 Threads *8 Threads "'"" 16 Threads I

Figure 4.4: The time required for the revision to add fault-tolerance for several numbers of
token ring processes in sequential and parallel algorithms.

58

4.3.4 Group Time Analysis

To understand the speedup gain provided by our algorithm in Section 4.5.2, we evaluated
the experimental results closely. As an example, consider the case of 32 BA processes. For
sequential implementation, the total revision time is 59.7 minutes of which 55 are used
for group computation. Hence, the ideal completion time with 4 cores is 18.45 minutes
(55/4 + 4.7). By comparison, the actual time taken in our experiment was 19.1 minutes.
Thus, the speedup using this approach is close to the ideal speedup.

In this section, we focus on the effectiveness of the parallelization of group computation
by considering the time taken for it in sequential and parallel implementation. Towards this
end, we analyze the group computation time for sequential and parallel implementation in
the context of three examples: Byzantine agreement, agreement in the presence of failstop
and Byzantine faults, and token ring [14]. The results for these examples are included in
Tables 4.3-4 .5.

In some cases, the speedup ratio is less than the number of threads. This is caused
by the fact that each group computation takes a very small amount of time and incurs an
overhead for thread synchronization. Moreover, as mentioned in Section 4.2 .3, due to the
overhead of load balancing, we allocate tasks of each thread statically. Thus, the load of
different threads can be slightly uneven. We also observe that the speedup ratio increases
with the number of processes in the program being revised. This implies that the parallel
algorithm will scale to larger problem instances.

An interesting as well as surprising observation is that when the state space is large
enough then the speedup ratio is more than the number of threads. This behavior is caused
by the fact that with parallelization each thread is working on smaller BDDs during the
group computation. To understand this behavior, we conducted experiments where we cre-
ated the threads to perform the group computation and forced them to execute sequentially
by adding extra synchronization. We found that such a pseudo-sequential run took less

time than that used by a purely sequential run.

59

 

Sequential 2-threads 4-threads 8-threads

 

 

PR RS GT(s) GT(S) SR GT(s) SR GT(s) SR

 

 

15 10“ 50 29 1.72 17 2.94 11 4.55
24 10'7 652 346 1.88 185 3.52 122 5.34
32 1022 3347 1532 2.18 848 3.95 490 6.83
48 1033 33454 14421 2.32 7271 4.60 3837 8.72

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.3: Group computation time for Byzantine Agreement. PR: Number of processes.
RS: Size of reachable state space. GT(s): Group time in seconds. SR: Speedup ratio.

 

Sequential 2-threads 4 -threads 8-threads

 

PR RS GT(S) GT(s) SR GT(S) SR GT(S) SR

 

 

10 10'0 53 24 2.21 23 2.30 30 1.77
15 10'5 624 319 1.96 175 3.57 174 3.59
20 1020 4473 2644 1.69 1275 3.51 1128 3.97
25 1025 26154 11739 2.23 6527 4.01 5692 4.59

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.4: Group computation time for the Agreement problem in the presence of failstop
and Byzantine faults. PR: Number of processes. RS: Size of reachable state space. GT(s):
Group time in seconds. SR: Speedup ratio.

4.4 Approach 2: Alternative (Conventional) Approach

A traditional approach for parallelization in the context of resolving deadlock states, say
ds, would be to partition the deadlock states into multiple threads and allow each thread to
handle the partition assigned to it. Next, in Section 4.4.], we discuss some of the design
choices we considered for this approach. We give brief description of our. algorithm in
Section 4.4 .2. Subsequently, we describe experimental results in Section 4.4.3 and analyze
them to argue that for such an approach to work in revising distributed programs, group

computation must itself be parallelized.

6O

 

Sequential 2 -threads 4 -threads 8-threads

 

PR RS GT(S) GT(s) SR GT(S) SR GT(s) SR

 

 

30 1014 0.32 0.15 2.12 0.10 3.34 0.12 2.75
40 10'9 0.84 0.36 2.34 0.22 3.84 0.23 3.59
50 1023 1.82 0.68 2.68 0.39 4.66 0.42 4.37
60 1028 3.22 1.22 2.63 0.67 4.80 0.64 5.01
70 1033 5 .36 1 .91 2.80 1 .06 5 .05 0.86 6.23
80 1038 7.77 2.94 2.64 1.53 5.09 1.23 6.30

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.5: Group computation time for token ring. PR: Number of processes. RS: Size of
reachable state space. GT(s): Group time in seconds. SR: Speedup ratio.

4.4.1 Design Choices

To maximize the beneﬁts from parallelism, we consider two factors when partitioning the
deadlock state among available threads. First, the deadlock states should be distributed
evenly among the threads. Second, the partitions should minimize the overlap between
worker threads. More speciﬁcally, states considered by one thread should not be consid-
ered by an other thread. Therefore, we partition the deadlock states based on the values
of the program variables. We use the size of the BDDs and the number 0f minterrns to
split the deadlock states as evenly as possible. Regarding the second factor, we chose to
add limited synchronization among worker threads to reduce the overlap in the explored
states by different threads. For example, we can partition ds using the partition predicates,
prti, l g i g n, such that VL, (prti /\ ds) = ds and n is the number of threads. Thus, if two
threads are available during the revision of the Byzantine agreement program then we can
letprtl = (d.j = O) andprtz = (d.j aé 0).

After partitioning, each thread would work independently as long as it does not affect
the states visited by other threads. As discussed in Section 4.2 .3 , to resolve a deadlock state,
each thread explores a part of the state space using backward reachability. Clearly, when

the states visited by two threads overlap, we have two options: (1) perform synchronization

61

so that only one thread explores any state or (2) allow two threads to explore the states
concurrently and resolve any inconsistencies that may be created due to this.

We ﬁnd that following the ﬁrst option by itself is very expensive/impossible due to the
fact that with the use of BDDs, each thread explores a set of states speciﬁed by the BDD.
And, since each thread begins with a set of deadlock states and performs backward reach-
ability, there is a signiﬁcant overlap among states explored by different threads. Hence,
following the ﬁrst option essentially reduced the parallel run to a sequential run. For this
reason, we focused on the second approach where each thread explored the states concur-
rently. (We also used some heuristic-based synchronization where we maintained a set of
visited states that each thread checked before performing backward state exploration. This

provided a small performance beneﬁt and is included in the results below.)

4.4.2 Algorithm Sketch

In this section, we focus on the descriptions of the parallel aspect of our deadlock resolution
algorithm. For more details on the sequential algorithm for deadlock resolution please refer
to [101].

The goal of our algorithm (c.f. Algorithm 5) is to resolve the deadlock states by adding
safe recovery. However, if for some deadlock states safe recovery is not possible, the al-
gorithm eliminates such states (i .e. makes them unreachable). To efﬁciently utilize the
available worker threads, the master thread partitions the set of deadlock states among
available threads as described in Section 4.4.1 and provides each thread with its own parti-
tion. Subsequently, the master thread activates the worker threads to add safe recovery (c.f.
Algorithm 6). Once activated, in adding safe recovery mode, each worker thread works as
follows. It constructs the recovery transitions that originate from the deadlock states and
leads to the legitimate states of the program in a ﬁnite number of steps. Of course, the
algorithm does not include any transition that reaches a state from where the safety of the

program can be violated. Once all worker threads are done computing the recovery transi-

62

tions, the master thread merges the recovery transitions, returned by all threads, and adds

them to the program transitions.

 

Algorithm 5 ResolveDeadlockStates

 

Input: program p, faults f, legitimate state predicate I , fault span T, pro-

hibited transitions mt, and partition predicates prtl ..prtn, where n is the
number of worker threads.

Output: program p’ and the predicate fte of states failed to eliminate.

9."

gill

10:

11:
12:

13:
14:
15:
16:

: ds := T /\ -:g(p);

// Resolving deadlock states by adding safe recovery

:forizzltondo

rt,- := SpawnThread w AddRecovery(ds /\ prti, 1, mt);

: end for

// Merging results from worker threads

: p==pVV§’=1rti:

vds,fte :2 false;

: ds := T /\ -:g(p);

// Eliminating deadlock states from where safe recovery is not possible

:forizzltondo

rpi, vds;,ﬁe,- := SpawnThread w Eliminate(ds /\
prti,p,I,f, T, vds,fte);

end for

// Merging results from worker threads

I” 3: Group(Ai’=1 rPi);

fte, vds := leﬁei, VLI vdsi;

// Handling inconsistencies

nds == ((T /\ 71) /\ "8(P’)) /\ _1((T /\ 71) /\ “8(PI)§

p’ := p’ V Group(p /\ nds);

14’ == 19’ V Group(8(P) A (ﬁeI’):

return p’, fte;

 

At this point, the master thread computes the remaining deadlock states. This set iden-

tiﬁes the deadlock states from which safe recovery is not possible. As mentioned earlier

in Section 4.2.3, those states have to be eliminated (i.e., made unreachable by program

63

 

Thread 6 AddSafeRecovery
Input: deadlock states ds, legitimate state predicate I , and transition predi-

cate mt.
Output: recovery transition predicate rec.

 

1: lyr, rec :2 I,false;
2: repeat

3 rt := Group(ds /\ (lyr)’);
4. rt := rt /\ ﬁGroup(rt /\ mt);
5: rec :2 rec V rt;
6: [yr := g(ds /\ rt)
7: until (lyr = false);
8: return rec;

 

transitions). Once again the master thread partitions the deadlock states and provides each
worker thread with one such partition. Subsequently, it activates the worker threads. Once
activated, in eliminating mode (c.f. Algorithm 7), the worker threads remove all program
transitions that terminate at the deadlock states, thereby making them unreachable. How-
ever, if the removal of some of those transitions introduces new deadlock states, then the al—
gorithm puts back such transitions and recursively eliminates the recently introduced dead-
lock states.

When threads explore states concurrently, some inconsistencies may be created. Next,
we give a brief overview of the inconsistencies that may occur due to concurrent state
exploration by different threads and identify how we can resolve them. Towards this end, let
S] and s2 be two states that are considered for deadlock elimination and (so, s1) and (so, s2)
be two program transitions for some so. To eliminate s1 and s2, a sequential elimination
algorithm removes transitions (so, s1) and (so, s2) , which makes so be a new deadlock state
(cf. Figure 4.5 .a).

This in turn requires that state so itself must be made unreachable. If so is unreachable,
then including the transition (so,s1) and (so,s2) in the revised program is harmless. In

fact, it is desirable since including this transition also causes other transitions in the corre-

64

 

Thread 7 Eliminate
Input: deadlock states ds, program p, legitimate state predicate I , fault tran-

sitions f, fault span T, visited deadlock states vds, predicate fte failed
to eliminate.

Output: revised program transition predicate p, visited deadlock states vds,
predicate fte failed to eliminate.

 

wait(mutex);
ds := ds /\ -nvds;
vds :2 vds V ds;
Signal (mutex);
if (ds = false) then
return p;
end if
old :=p;
tmp :2 (T /\ o!) /\p /\ (ds)’;
: p := p /\ ﬁGroup(tmp);
:fs :2 g(T /\ ﬁI/\f/\ (ds)’);
: p,vds,fte :2 Eliminate(fs,p,l, f, T, vds,fte);
: nds :2 g(T /\ ﬁIA Group(tmp) /\ ﬁg(p));
: p := p V (Group(tmp) /\ nds);
: nds :2 nds /\ g(tmp);
// (X)” = {(s1,true)|(so,s1) E X}
16: fte :=fte V -1(old/\ op /\ T /\ (ds)’)”;
17: p, vds,fte :2 Eliminate(nds /\ ﬁl,p,I, f, T, vds,fte);
18: return p, vds, fte;

99.519999???

t—‘h—Ih—b-ﬁl—tt—t
m-mer—O}?

 

65

sponding group to be included as well. And, these grouped transitions might be useful in
providing recovery from other states. Hence, it puts back (so,s1) and (so,s2) (and corre-
sponding group) and starts eliminating the state so. However, the concurrent execution of
worker threads may create some inconsistencies. We describe some of these inconsisten-
cies and our approach to resolve them next.

Case 1. States s1 and S; are in different partitions. Therefore, thl eliminates s1, which
in turn removes the transition (so,s1), and thz eliminates S2, which removes the transition
(so, s2) (cf. Figure 4.5.b). Since each thread works on its own copy, neither thread tries to
eliminate so, as they do not identify so as a deadlock state. Subsequently, when the master
thread merges the results returned by thl and thz, so becomes a new deadlock state that
has to be eliminated while the group predicates of transitions (so, 51) and (so, s2) have been
removed unnecessarily. In order to resolve this case, we replace all outgoing transitions
that start from so and mark so as a state that has to be eliminated in subsequent iterations.
Case 2. To eliminate deadlock states, the elimination algorithm performs backward
exploration starting from the deadlock state. Thus, two or more threads may consider the
same state for elimination. For example, if thl consider S] for elimination and thz consider
both s1 and s2 (c.f. Figure 4.5.b) then thl removes (so, 51) and thz removes (so, s1) and (so,
sz). Now, when the master thread joins the results of the two threads, the transition (so, s1)
is removed. However, as shown in Case 1, the removal of (so, s1) is not really necessary. In
fact, we would like to keep this transition in the program for the reasons mentioned above.
To handle this inconsistency, we collect _such transitions and add them back to the program

transitions.

4.4.3 Experimental Results

We also implemented this approach for parallelization. The results for the problem of
Byzantine agreement are as shown in Table 4.6. From these results, we noticed that the

improvement in the performance was small. To analyze these results, we studied the effect

66

Sequential

 

S] 82

3M

Before elimination

 

 

s s
k/02
SO

After elimination

 

 

(a)

 

Case 1 Case 2

 

Thread I Thread 2 Thread 1 Thread 2

 

S] 82 SI 82 S] 82 S1 82

O O O O O O
3.x 3.» ....» (3V

 

 

 

 

Merged Merged
S l 82 S 1 S 2
O O Q
SOC) 80
s F ixecg F zxed
1 s s
o c? ‘0 02

 

 

 

 

 

(b)

 

Legend

 

O Astate

 

O Eliminated state C To be considered for elimination

 

 

Figure 4.5: Inconsistencies raised by concurrency.

67

of this approach in more detail. For the case where we utilize two threads, this approach
partitions the deadlock states, say ds, into two parts, dsl and ds2. Thread 1 begins with dsl
and performs backward exploration to determine how states in dsl can be made unreach-
able. In each such backward exploration, if it chooses to remove some transition, then it has
to perform a group computation to remove the corresponding group. Although this thread
is working with a smaller set of deadlock states, the time required for group computation is
only slightly less than the sequential implementation where only one thread was working
with the entire set of deadlock states ds. Moreover, the time required in such group compu—
tation is very high (more than 80%) compared to the overall time required for eliminating
the deadlock states. This implies that, especially for the case where we are revising a pro-
gram with a large number of processes and where the available threads are relatively small,

parallelization of the group computation is going to provide us the maximum beneﬁt.

 

 

 

 

 

 

 

 

 

 

 

 

 

Sequential Parallel Elimination with 2-threads
PR RS DR(s) TST(s) DRT(s) TST(s)
10 107 7 9 8 9
15 1012 78 85 78 87
20 1014 406 442 374 417
25 1018 1503 1,632 1,394 1503
30 102' 4,302 4,606 3,274 3,518
35 1025 11,088 11,821 10,995 11,608
40 1028 27,115 28,628 21,997 23,101
45 1032 45,850 48,283 39,645 41,548

 

 

Table 4.6: The time required for the revision to add fault-tolerance for several numbers of
non-general processes of BA in sequential and by partitioning deadlock states using paral-
lelism.PR: Number of processes. RS: Size of reachable state space. DRT(s): Deadlock
resolution time in seconds. TST(s): Total revision time in seconds.

68

4.5 Using Symmetry to Expedite the Automated Revision

In this section, we present our approach for expediting the revision with the use of sym-
metry using the input from Section 4.2.]. We utilize this approach in the task of resolving
deadlock states that are encountered during the revision process. Therefore, using the ex-
ample BA from Section 4.2.1, we describe how symmetry can help in resolving them. Then
we discuss our algorithms for resolving deadlock states by utilizing symmetry to expedite

the two aspects of deadlock resolution: adding recovery and eliminating deadlock states.

4.5 .1 Symmetry

To describe the use of symmetry, consider the ﬁrst scenario described in Section 4.2 .3. In
this scenario, we resolved the state S] by adding a recovery transition. Due to the symmetry
of the non-generals, one can observe that we can also add other recovery transitions. For
example, if we consider the state d.g = d.j = d.l = O,d.k = l,andf.k = O, we can add the
recovery transition by which d.k changes to 0.

With this observation, if we identify recovery action(s) to be added for one process, we
can add the similar actions that correspond to other processes. Therefore, to add recovery,
our algorithm does the following: whenever we ﬁnd recovery transition(s), we identify
other recovery transitions based on symmetry. Then, we add all these recovery transitions
to the program being revised (c.f. Algorithm 8).

We also apply symmetry for deadlock states elimination. To eliminate a set of deadlock
states, we ﬁnd the set of transitions, which if removed from one process, will prevent that
process from reaching deadlock states. Then, we use this set of transitions to remove sim-
ilar transitions from other processes. Therefore, to eliminate deadlock states by removing
program transitions, our algorithm does the following: whenever we ﬁnd a set of transi-
tion(s), if removed from one process, the algorithm prevents the program from reaching a

deadlock state; we use symmetry to identify similar transitions for other processes, and we

69

 

Algorithm 8 Add_Symmetrical_Recovery
Input: deadlock states ds, legitimate state predicate I , and the set of unac-

ceptable transitions including those in specb, mt
Output: recovery transitions predicate rec

 

1: rec :2 ds/\ (I)’;
// (I)’ the set of states to which recovery can be added to ensure recovery
to legitimate states

2: rec :2 Group(rec);
// Select program transition or process i while ensuring read/write re-
strictions

3: rec :2 rec /\ -nGroup(rec /\ mt);
// Remove transition that violate safety while ensuring distribution re-
strictions

// Find similar transitions for other processes
4: for i := 1 to numberOfProcesses do
5: rec := recVSwapVariables( rec, i );
// Generate BDDs for other processes by swapping variables based
on symmetry
6: end for
7: return rec;

 

7O

remove these transitions from program transitions (c.f. Algorithm 9).

 

Algorithm 9 Group_Symmetry
Input: a set of transitions trans.
Output: a group of transitions grp.

 

l: grp :2 F indGroup(trans, read/write restrictions on i);
// Find the group related to process i transitions while ensuring the
read/write restrictions

// Find similar transitions for other processes
2: for i := 1 to numberOfProcesses do
grp := grpVSwapVariables( grp, i);
4: end for

P?

5: return grp;

 

4.5.2 Experimental Results

In Section 4.5.], we described the use of symmetry approaches to resolve deadlock states
in the automated revision. In Sections 4.5 .2-4.5 .2, we describe and analyze the respective
experimental results. In particular, we describe the results in the context of two classi-
cal examples in the literature of distributed computing, namely, the Byzantine agreement
(described in Section 4.2.1) and the token ring [14]. In both case studies, we ﬁnd that

symmetry and parallelism improve the execution time substantially.

Symmetry

In this section, we present our experimental results in using symmetry for the resolution of
deadlock of deadlock states in the automated revision.
Figure 4.6 shows the time spent in deadlock resolution, and Figure 4.7 shows the to-

tal revision time for different numbers of processes in the Byzantine agreement problem.

71

From this ﬁgure, we observe that the use of symmetry provides a remarkable improvement
in the performance. More importantly, one can notice that the speedup ratio (gained using
a symmetrical approach) grows with the increase in the number of processes. In particular,
as shown in Figure 4.7, the speedup ratio in the case of 10 non-general processes is 4.5.
However, in the case of 45 non-general processes the speedup ratio is 19. This behavior is
both expected and highly valuable. Since symmetry uses transitions of one process to iden-
tify transitions of another process, it is expected that as the number of symmetric processes
increases, so would the effectiveness of symmetry. Moreover, since the speedup is propor-
tional to the number of (symmetric) processes, we argue that symmetry would be highly

valuable in handling the state space explosion with an increased number of processes.

 

...“ ...-“— ....

60000 '*~—*”"~~——h “-4” ~~—————-——-“~-———_._—_m-w----_---Wﬂ..--2___---

50000 .....______.__._H_ - -~-——~v~-~~-+-——-—4———_____~_-4—44_-___.._.2_-..--_-__.-.___

 

 

 

40000

 

 

30000

 

 

20000

 

 

 

10000 ,.__2_._.#____.

 

 

w
v

0

E
.-

 

0 ~--——-- .-
10 15 20 25 30 35 40 45

“"Sequential Symmetry

Processes

 

 

 

 

Figure 4.6: The time required to resolve deadlock states in the revision to add fault-
tolerance for several numbers of BA non-general processes in sequential and symmetrical
algorithms.

To explain this remarkable improvement, we focus on the fact that far more time is

spent resolving deadlock states for each process independently than by simply resolving

72

 

 

 

I
I

 

 

60000 1

 

 

50000 “H

 

40000

 

30000 -2

 

 

20000 “b ———

 

 

 

10000 —- m~———-——--——-

 

U)
21

. ”...-O
..a. 0 ....
E-t

10 15 20 25 30 35 40 45

+Sequential Symmetry

L. . ___._ .._-__.__ ...__-___---__ --.. _ w..____.___.__-..-_._ .

Processes

 

 

 

Figure 4.7: The time required for the revision to add fault-tolerance for several numbers of
BA non-general processes in sequential and symmetrical algorithms.

deadlock states for single process and using symmetry to resolve deadlock states for the
rest of the processes. Consequently, symmetry is expected to give better speedup ratios
when the number of symmetrical processes is large.

In Figures 4.8 and 4.9, we present the results of our experiments on the token ring prob-
lem. We observe that symmetry substantially reduces the time for deadlock resolution. In
fact, symmetry was able to keep this time almost a constant, i.e., independent of the prob-
lem size. One can notice a spike in the required revision time of the sequential algorithm
for token ring after we hit the threshold of 90 processes. This behavior was also observed
in [30] and is caused by the fact that, at this state space, we are utilizing all the available

memory, causing performance to degrade due to page faults.

73

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.10 +—~- --—-

 

0 SIM" l l 7 "I”? I 2-..]... ‘ uni—T.” 1 I l I "_"F'"'”" *1

10 20 30 40 50 60 70 80 90 100 150
rocesses +Sequential Symmetry

U)
35’
..E.
5..

P

 

 

 

 

Figure 4.8: The tttttime required to resolve deadlock states in the revision to add fault-
tolerance for several numbers of token ring processes in sequential and symmetrical algo-
rithms.

Symmetry and Parallelism

In this section, we present our experimental results of using parallelism in computing the
symmetry. The results of parallelizing the symmetry computation with various implemen-
tations in the automated symbolic revision are presented in Figure 4.10. We have achieved
the shortest revision time when we use parallelism to compute the symmetry. For example,
in the case of the Byzantine agreement with 45 non-general processes using 16 threads, we
achieve a speedup ratio of 1.8 times that of the symmetry alone. Since in case of the token
ring, symmetry alone reduces the time of computing recovery transitions to a negligible

amount, the results for this case are omitted.

74

 

 

700 ~ — —~

 

600 22 -5-

 

 

500 7’

 

 

400 — v - ~-ww~—

 

300 ~— ——~_

 

 

 

200

 

 

 

100 +— ~n

 

 

 

o -+——- .- r—-—- "

10 20 30 40 50 60 70 8O 90 100 150
rocesses

‘v'f a
.E
[—
P

+Sequential Symmetry

 

 

 

Figure 4.9: The time required for the revision to add fault-tolerance for several numbers of
token ring processes in sequential and symmetrical algorithms.

75

 

 

3000 T -—— ~—~——-—--- r————~— - ~~——-~—

2- _-__~_ .__241

 

 

 

 

 

+1 Thraed 2 Threads +4 Threads *8 Threads "*“ l6 Threads

Figure 4.10: The time required for the revision to add fault-tolerance for several numbers
of BA non-general processes using both symmetry and parallelism.

76

4.6 Summary

In this chapter, we focused on the techniques that can efﬁciently complete the automated
model revision in a reasonable amount of time. Speciﬁcally, we used techniques that ex-
ploit symmetry and parallelism to expedite the automated model revision and to overcome
its bottlenecks. For parallelism, our approach was based on parallelization with multiple
threads on a multi-core architecture. We found that the performance improvement with the
simple parallelization of the group computation is signiﬁcantly more efﬁcient than tradi-
tional approaches that partition the deadlock states among available threads. With group
computation parallelism we achieved signiﬁcant beneﬁt that is close to the ideal. In the
case of symmetry, we used the fact that multiple processes in a distributed program are
symmetric in nature. We used this characteristic to efﬁciently expedite the automated re-
vision. Since, the cost of identifying the transition of a given model with the knowledge
of symmetry among processes is less than the cost of identifying these transitions explic-
itly, the use of symmetry reduces the overall time required for the revision. Moreover, the
speedup increases as the number of symmetric processes increases.

Lessons Learned. The results show that a traditional approach of partitioning dead-
lock states provides a small improvement. However, it helped identify an alternative ap-
proach for parallelization that is based on the distribution constraints imposed on the pro-
gram being revised. While parallelization reduces the time spent in eliminating deadlock
states, it may also lead to some inconsistencies that have to be resolved. The time for
resolving such inconsistencies is one of the bottlenecks in parallelization, as this inconsis-
tency is resolved sequentially. We note that the synchronization on visited states was also
added, in part, to reduce inconsistencies among threads by requiring them to coordinate
with each other.

The performance improvement with the parallelizing of the group computation is sig-
niﬁcant. In fact, for most cases, the performance was close to the ideal speedup. What

this suggests is that for the task of deadlock resolution, a simple approach based on par-

77

allelizing the group computation (as opposed to a reentrant BDD package or partitioning
of the deadlock states, etc.) will provide the biggest beneﬁt in performance. Moreover,
the group computation itself occurs in every aspect of the revision where new transitions
have to be added for recovery or existing transitions have to be removed for preventing
safety violations or breaking cycles that prevent recovery to the set of the legitimate states
model/program. Therefore, the approach of parallelizing the group computation will be
effective in the automated model revision of distributed programs.

Impact. Automated model revision has been widely believed to be signiﬁcantly more
complex than automated veriﬁcation. When we evaluate the complexity of automated revi-
sion to add fault-tolerance, we ﬁnd that it fundamentally includes two parts: (1) analyzing
the existing program and (2) revising it to ensure that it meets the fault-tolerance properties.
We showed that the complexity of the second part can be signiﬁcantly remedied by the use
of parallelization in a simple and scalable fashion. Moreover, if we evaluate the typical
inexpensive technology that is currently being used or is likely to be available in the near
future, it is expected to be 2-16 core computers. And, the ﬁrst approach used in this chap-
ter is expected to be the most suitable one for utilizing these multi-core computers to the
fullest extent. Also, since the group computation is caused by distribution constraints of the
program being revised, it is guaranteed to be required even with other techniques for ex-
pediting automated revision. For example, it can be used in conjunction with the approach
for parallelizing the group as well as the approach that utilizes symmetry among processes
being revised. Hence, even if a large number of cores were available, this approach would
be valuable together with other techniques that utilize those additional cores.

Memory Usage. Both of our approaches, symmetry and parallelism, require the use of
more memory. For instance, the revision of the BA with 2 threads requires almost twice the
amount of memory needed by the sequential algorithm for the same number of non-general
processes. However, unlike model checking, in automated model revision, since we always

run out of time before we run out of memory, we argue that the extra usage of memory is

78

acceptable given the remarkable reductions we achieve in total revision time.

79

Chapter 5

Nonmasking and Stabilizing

Fault-Tolerance

Achieving practical automated model revision requires us to derive theories and develop
algorithms that provide broader domain of problems, which we can resolve by automated
model revision. Towards this end, in this chapter, we focus on the constraint-based auto-
mated addition of nonmasking and stabilizing fault-tolerance to hierarchical programs. We
specify legitimate states of the program in terms of constraints that should be satisﬁed in
those states. To deal with faults that may violate these constraints, we add recovery ac-
tions while ensuring interference freedom among the recovery actions added for satisfying
different constraints. Since the constraint-based manual design of fault-tolerance is well
known to be applicable in the manual design of nonmasking fault-tolerance, we expect
our approach to have a signiﬁcant beneﬁt in automation of fault-tolerant programs. We
illustrate our algorithms with three case studies: stabilizing mutual exclusion, stabilizing
diffusing computation, and a data dissemination problem in sensor networks. With exper-
imental results, we show that the complexity of revision is reasonable and that it can be
reduced using the structure of the hierarchical systems.

To our knowledge, this is the ﬁrst instance where automated revision has been success-

80

fully used in revising programs that are correct under fairness assumptions. Moreover, in
two of the case studies considered in this chapter, the structure of the recovery paths is too
complex to permit existing heuristic-based approaches for adding recovery.

To expedite the revision, we concentrate on reducing the time complexity of such revi-
sion using parallelism. We apply these techniques in the context of constraint satisfaction.
We consider two approaches to speedup the revision algorithm: ﬁrst, the use of the multiple
constraints that have to be satisﬁed during revision; second, the use of the distributed nature
of the programs being revised. We show that our approaches provide signiﬁcant reductions
in the revision time.

The rest of the chapter is organized as follows. In Section 5.2, we deﬁne the problem
statement for the automated addition of nonmasking and stabilizing fault-tolerance. We
describe the algorithms for the automated addition of nonmasking and stabilizing fault-
tolerance in Section 5.3. We present our multi-core algorithms in Section 5.4 and experi-
mental results in Section 5 .5. In Section 5 .6, we study the ordering in which the constraints
should be satisﬁed. We show how we can use the hierarchical structure to reduce the com-

plexity of our algorithm in Section 5.7. Finally, we summarize the chapter in Section 5 .8.

5.1 Introduction

In this chapter, we focus on automated addition of nonmasking and stabilizing fault-
tolerance to fault-intolerant programs. Intuitively, a nonmasking fault-tolerant program en—
sures that if it is perturbed by faults to an illegitimate state, then it would eventually recover
to its legitimate states. However, safety may be violated during recovery. Therefore, non-
masking fault-tolerance is useful to tolerate a temporary perturbation of the program state.
After recovery is completed, a nonmasking fault-tolerant program satisﬁes both the safety
and liveness in the subsequent computation. Nonmasking and stabilizing fault-tolerance

is an ideal solution to add fault-tolerance to the programs that organize network nodes in

81

speciﬁed topology or a predeﬁned logical structure [13].

There are several reasons that make the design of nonmasking fault-tolerance attractive.
For one, the design of masking fault-tolerant programs, where both safety and liveness
are preserved during recovery, is often expensive or impossible even though the design of
nonmasking fault-tolerance is easy [15]. Also, the design of nonmasking fault-tolerance
can assist and simplify the design of masking fault-tolerance [105]. Moreover, in several
applications nonmasking fault-tolerance is more desirable than solutions that provide fail-
safe fault-tolerance (where in the presence of faults the program reaches to “safe” states
from where it does not satisfy liveness requirements). This is especially true for networking
related applications such as routing and tree maintenance.

A special case of nonmasking fault-tolerance is stabilization [54,56], where, starting
from an arbitrary state, the program is guaranteed to reach a legitimate state. Stabiliz-
ing systems are especially useful in handling unexpected transient faults. Moreover, this
property is often critical in long-lived applications where faults are difﬁcult to predict. Fur-
thermore, it is recognized that verifying stabilizing systems is especially hard [76]. Hence,
techniques for automated revision are expected to be useful for designing stabilizing sys-
tems.

Techniques for adding nonmasking and stabilizing fault-tolerance to distributed pro-
grams can be classiﬁed in two categories. The ﬁrst category includes approaches based
on distributed reset [13], where the program utilizes approaches such as distributed snap-
shot [38] and resetting the system to a legitimate state if the current state is found to be
illegitimate. Approaches from this category suffer from several drawbacks. In particular,
they require the designer to know the set of all legitimate states. The cost of detecting the
global state can be high. Additionally, this approach is heavy-handed since it requires a
reset of the entire system, even if the fault may be localized.

The second category includes approaches based on constraint satisfaction, where we

identify constraints that should be satisﬁed in the legitimate states. Typically, the con-

82

straints are local (e .g., involving one node or a node and its neighbors); therefore, detecting
their violation is easy. Since the constraints are local, the recovery actions to ﬁx them are
also local.

There are several issues that complicate the design of nonmasking and stabilizing fault-
tolerance [10]. One such issue is the complexity of designing and analyzing the recovery
actions needed to ensure that the program recovers to legitimate states. Another issue is
that to verify correctness of the nonmasking fault-tolerant program, one needs to consider
all possible concurrent executions of the original program, recovery actions, and fault ac-
tions. Yet another issue is that most nonmasking algorithms assume that faults can keep
happening (although they will eventually stop for a long enough time to permit recovery)
even during recovery, thereby, complicating the recovery to legitimate states.

Adding nonmasking and stabilizing fault-tolerance to an existing program is achieved
by performing three steps. The ﬁrst step is to identify the set of legitimate states of the fault-
intolerant program. This set deﬁnes the constraints that should be true in the legitimate
states. The second step is to identify a set of convergence actions that recover the program
from illegitimate states to legitimate states. This can be done by ﬁnding actions that satisfy
one or more constraints. The last step consists of ensuring that the convergence actions do
not interfere with each other. In other words, the collective effect of all recovery actions
should, eventually, lead the program to legitimate states.

In this chapter, we automate the last two steps by identifying the necessary actions to
ensure that the constraints are satisﬁed and that the recovery actions do not interfere with
each other. The automation of the ﬁrst step is discussed in details in Chapter 6.

However, this approach suffers from one important drawback: local actions taken to
ﬁx one constraint may violate other constraints. Consequently, these constraints need to
be ordered. Furthermore, we need to ensure that satisfying one constraint does not vio-
late constraints earlier in the order. Since verifying that recovery actions for satisfying

one constraint do not affect other constraints is a demanding task, automated techniques

83

that ensure correctness by construction are highly desirable. In the correct-by—construction
approach, a program is automatically revised such that the output program preserves ’the
original program speciﬁcation. In addition, it satisﬁes new properties. However, algo-
rithms for designing programs that are correct by construction suffer from high complexity
and, hence, techniques to expedite them need to be developed. Since the time complexity
of the automation algorithms can be high, we also evaluate parallelization techniques to
expedite addition of nonmasking and stabilizing algorithm.

In this chapter, we present an automated model revision algorithm for constraint-based
synthesis of nonmasking and stabilizing fault-tolerant programs. We illustrate our algo-
rithm with three case studies. We note that the structure of the recovery actions in the
ﬁrst two case studies is too complex to permit previous approaches to achieve revision of
the corresponding fault-tolerant programs [30]. We also show that the structure of the hi-
erarchical system can be effectively used to generalize programs with a small number of
processes while preserving the correct-by-construction property of the revised program.

Also, we present a multi-core algorithm to synthesize distributed nonmasking and sta-
bilizing fault-tolerant programs by partitioning the satisfaction of the constraints among
available threads. To further expedite the revision, we also present a multi-core algorithm
that utilizes the distributed nature of programs being revised by parallelizing them.

To our knowledge, this is the ﬁrst instance where programs that require fairness assump-
tions have been revised with automated techniques. Particularly, in our ﬁrst case study, it
is straightforward to observe that stabilizing fault-tolerance cannot be added without some

fairness among all processes. Thus, the previous algorithms (e.g., [30]) will declare failure

in adding fault-tolerance.

84

5.2 Programs and Speciﬁcations

In this section, we deﬁne the problem statement for adding nonmasking and stabilizing
fault-tolerance. Please note that the problem statements deﬁned in this section are instances
of the original deﬁnition of the fault-tolerance from Section 2.5. Those deﬁnitions are based
on the ones given by Arora and Gouda [12]. Also, we use the deﬁnitions of distributed
programs, fairness, legitimate states, faults, and fault-span from Chapter 2.

The goal of an algorithm that adds nonmasking fault-tolerance is to begin with a fault-
intolerant program p, its legitimate state predicate I , and faults f, and to derive the non-
masking fault-tolerant program, say p’, such that in the presence of faults, p’ eventually
converges to I . Furthermore, computations of p’ that begin in I must be the same as that of
p.

Based on this discussion, we deﬁne the problem of adding nonmasking fault-tolerance

as follows:

 

Problem statement 4.1 Given p, I, and f, identity p’ such that:
o Transitions within the legitimate states remain unchanged
so 6 I :>(Vs1::(so,s1)Ep <=> (so,s1) Ep’)
0 There exists a state predicate T (fault-span) such that
- I g T,

- (so,s.) €(p’Vf) /\ (506 T) :>s1 e T,

— so 6 T/\ (so,sl,...) is a computation ofp’

 

 

=> (3j:j20:sj61).

 

Stabilizing fault—tolerance is a special instance of this problem statement with the re-
quirement that T = S p, i.e. the fault-span equals the set of all states. Based on this discus-

sion, we deﬁne the problem of adding stabilizing fault-tolerance as follows:

85

 

Problem statement 4.2 Given p, I, and f, identify p’ such that:

o Transitions within the legitimate states remain unchanged:
- S0 6 [=> (V81 :2 (80,61) Ep <=> (S(),S1) Ep’)

0 All program transitions eventually converge to the set of le-

gitimate states

- so E 5,, /\ (so,sl , ...) is a computation of p’

 

 

23> (Elj:j20:s,-€I)

 

Note that since each constraint is preserved by the original program p, closure property of
the stabilizing program p’ is satisﬁed from the ﬁrst constraint of the problem statement.

Thus, it is not explicitly speciﬁed above.

5.3 Synthesis Algorithm of the Nonmasking and Stabiliz-
ing Fault-Tolerance

Our approach for adding nonmasking and stabilizing to fault-intolerant programs, based
on [13]. The goal of nonmasking and stabilizing fault-tolerance is to ensure that after faults
occur, the program eventually reaches one of the legitimate states in I . We focus on the
instance of the problem where I = C; AC2... AC”, and C;, 1 2 i 2 m, is a constraint on
the variables of the program. Faults perturb the program to a state in (-w 1). Hence, in
the presence of f, one or more of the constraints from C1,C2...Cm are violated. The goal
of our algorithm is to automatically synthesize the recovery actions such that when faults
stop occurring, the constructed recovery actions in conjunction with the original program

actions will, eventually, converge the program to a state where I holds.

86

5.3.1 Constraint Satisﬁer

Our algorithm for adding nonmasking and stabilizing fault-tolerance is shown in Algorithm
10. The input for the algorithm is the constraint array C, fault-span T, and program p.
In this algorithm, the constraints from the constraint array are satisﬁed one after another.
The algorithm starts by computing the legitimate state predicate as the intersection of all
constraints in the constraint array (Lines 3).

Then, the algorithm computes the recovery transitions to satisfy C [I] Let Tr denote
transitions that begin in the fault-span and in a state where C [t] is false and end in a state
where C [i] is true. Unfortunately, we cannot add Tr as is, since Tr may not be imple-
mentable using read/write constraints on processes due to the distributed nature of the pro-
gram. The algorithm adds a subset of Tr, say Tr] , such that Tr] can be implemented
using the read/write restrictions of one or more processes. We denote this by the function
Groupmin (see Line 6)‘. This ensures that the only transitions added are those that start
from a state where C [l] is false and reach a state where C [t] is true. These transitions are
denoted by temp on Line 6.

Subsequently, the algorithm removes transitions from temp that violate the closure of
the fault-span T. Thus, it computes a subset of transitions, say Tr f, M , in temp that begin in
a state in T and reach a state in -~T. Again, we need to ensure that the removed transitions
are consistent with read/write restrictions of processes. The algorithm achieves this by
applying function Groupm to Tr fspan; this computes a superset of Tr fspan such that one
or more processes can execute it. Subsequently, it removes this superset from temp (Line
7). This ensures that all transitions that violate closure of T are removed. Therefore, it
removes the group of transitions that violates T (respectively, I) (Lines 7-8).

The algorithm needs to ensure that none of the transitions used to satisfy the constraint,
say C [i], violates the pre-satisﬁed constraints C [0] to C [i — 1]. Hence, it lets V include

the transitions that originate from a state where C [i — l] is true and end in a state where

 

I( X /\ (Y )’ ) refers to the transitions that start in a state in X and reach Y.

87

C [i — I] is false as well as similar transitions for the constraints C [0] to C [i — 2] (Line 11).
The transitions in V are used to ensure that recovery transitions do not violate other pre-
satisﬁed constraints. The algorithm ensures that none of the transitions in temp interfere
with earlier constraints. Therefore, it removes the transitions in V from temp if any are
found (Line 9). At this point, the algorithm collects all recovery transitions in rec (Line
10). Steps 4 — 12 are repeated until all the recovery actions that satisfy all the constraints

in the array C are found. Finally, it returns the recovery actions of the program p.

 

Algorithm 10 ConstraintSatisﬁer
Input: constraint array C, fault-span T, and program transitions p.
Output: recovery transitions rec.

 

p—

: temp, V := false, false;

2: m :2 SizeO f (C ) — 1; // m is the number of constraints

3: I :2 30cm; //Compute I (invariant) as the intersection of all con-
straints

4: fori:=0tom do

5: //temp are the transitions that start in a state in T — C (i) and reach

C(i)
6: temp :2 Groupmm((T — C[i])A (C[i])’);
//ensure that no recovery transitions violate T

7: temp := temp — Group(temp * (TA (nT)’));
//ensure that no recovery transitions violate I
8: temp := temp — Group(temp * (I/\ (-.I)’));
9: temp := temp — V ;
// Combine current recovery transitions with the new recovery transi-
tion.
10: rec :2 rec V temp;

//Compute, V, the set of the transitions that violating the constraints
11: V := V V Group(C[i]/\(-1C[i])’)
12: end for

// return the recovery transition.
13: return rec;

 

88

Theorem 5.3.1 :
0 Given are fault—intolerant program p, constraints C1,C2...Cm, and faults f.
0 Let I = C1/\C2.../\Cm.
0 Let T: set of states reached in the execution of pV f that start from any state in I .

0 Let rec: ConstraintSatisfier(C, T, p).

1f
V80 2 So 6 T—I: (3S1 2S1 E T I (S0,S|) 6 rec)
Then

p’ (= p V rec) solves the constraints in Problem statement 4.]. I

Proof. To prove Theorem 3.1 we show that the p’ (2 p V rec) solves the constraints of

the problem statement 4.1.

o By the construction of the transitions in rec, it is straightforward to see that rec does
not introduce any new transitions in I . Therefore, the transitions within the legitimate

states remain unchanged.

0 By the construction of T, it is clear that I g T since T includes all the states in I as

well as the states reachable from I by (pV f ).

0 From Line 7 in the algorithm ConstraintSatisfier, the transitions in rec do not in-

clude any transition that violates T.

0 Since rec does not include any of the transitions from V (Lines 9 and 11), none
of the transitions in rec violate pre-satisﬁed constraints. Therefore, there will be
no cycles between the recovery transitions themselves. Hence, the constraint (so 6

TA (so,s|,...) is acomputation ofp’ => (Eljzjz 0 : s; E 1)) is satisﬁed. I

89

 

Figure 5.1: Constraints ordering and transitions selections.

5.3.2 Algorithm Illustration

To illustrate the algorithm ConstraintSatisfier, consider the system described in Figure
5.1. In this system, we have three ordered constraints C1,C2, and C3 and I = C1 AC2 AC3.

Since C1 is the ﬁrst to be satisﬁed, we construct all possible recovery actions that start
from any state in T — C1 and reach a state in C1 A T. We proceed to satisfy C2 in the
same manner. However, after constructing the recovery actions that satisfy C2, we need to
exclude actions that violate the constraint C1. In particular, we exclude actions like rec]
(c.f. Figure 5.1) since it starts from a state, so, where C. is true and ends in a state, s],
where C I is false. On the other hand, we keep transitions like recz and reg. We continue
to construct the recovery actions that establish C3 provided that they preserve T, C 1 , and

C2.

5 .4 Expediting the Constraints Satisfaction

In Section 5.3, we described the sequential approach (i.e., single thread) for synthesiz-
ing nonmasking and stabilizing fault-tolerant distributed programs from fault-intolerant
versions. In this section, we explain our design choices and present our approaches for

expediting the revision with multi-core computing (i.e., multiple threads).

5.4.1 Design Choices for Parallelism

After reviewing Algorithm 10, we can see that there are two main bottlenecks, which lower

the performance of this algorithm. The ﬁrst is the main loop (Lines 4-12) where the number
of iterations is determined by the number of constraints. The second is the Group operation
in Lines 6, 7, 8, and 11. The group operation is based on the nature of distributed programs
where addition of a transition for one process requires us to add additional transitions that
are computed based on what the process cannot read/write.
Choices for constraint satisfaction. One way to partition the computation of recovery
transitions is to split the recovery computation among multiple threads by allowing them
to work on satisfying separate constraints. However, Algorithm 10 uses the computation
of V, transitions that violate preceding constraints (Line 11). Clearly, one possibility is to
compute all possible values taken by V during the computation up front and utilize them ap-
propriately for computing valid recovery transitions. Computing the possible values taken
by V also requires a computation that utilizes a loop that requires sizeO f (C) iterations,
which can be parallelized using standard techniques from parallel computing.

After computation of V, we can partition the iterations (Lines 4-12 in Algorithm 10)
among several threads. We considered several approaches for this. One approach we con-
sidered was dynamic partitioning. In particular, in this approach, a pool of uncompleted
iterations is maintained. Each thread picks an iteration from this pool and computes the

recovery transitions for that iteration. Subsequently, it picks another iteration from the pool

91

and so on. We found that this dynamic partitioning approach, however, resulted in a high
overhead, thereby reducing the speedup. Hence, we considered static partitioning where
each thread was given ﬁxed iterations. Even here, we tried different options. One option
was to partition the iterations in an alternating manner (e .g., thread 1 gets iterations 0, 2, 4,
and thread 2 gets iterations l, 3, 5, ...). It was expected that this would leave the size
of MDDs used in each thread to be evenly balanced. However, we found that this approach
and the approach of partitioning where thread 1 got iterations 0, l, (sizeO f (C ) / 2) — 1
and thread 2 got iterations sizeO f (C ) / 2, sizeO f (C ) — 1 had almost identical perfor-
mance in the case studies. We have used the latter in our experiments. However, we believe
that the choice of partitioning could play a role in other case studies.
Choices for utilizing distributed nature. When the recovery algorithm adds new tran-
sitions (or removes transitions that violate earlier constraints), we have to add the corre-
sponding group of transitions based on the distributed nature of the program. Moreover,
with symbolic approach, we add (or remove) a set of transitions at a time. This set may
include transitions that could be executed by several processes. Therefore, for a given set
of transitions that are added, we need to consider read/write restrictions of each of these
processes to determine the group for that set of transitions. We can utilize this feature to
parallelize the group computation itself by having each thread compute the group corre—
sponding to a subset of processes.

Again, similar to the parallelization with constraints, we considered several approaches.
It turned out that even for this approach, the overhead of dynamic partitioning was more
than its beneﬁt. Thus, we utilized static approaches. Since several approaches consid-
ered for partitioning resulted in a similar speedup, we utilize the simple approach where
each thread obtains a subset of processes and computes the corresponding group for those
processes.

Finally, in group parallelization, the actual computation involved in the group itself is

small. Hence, we found that the overhead of creating and terminating threads for each

92

group computation was very high. For this reason, we created the threads up front and used
mutexes to determine when they will be active. I

Choices for parallelizing the MDD (Multi-Valued Decision Diagrams) library. Since
we are using MDD-based symbolic revision [28], the constraints are characterized by
Boolean formulae involving the variables in the program being revised. The MDD li-
brary [125] is not designed to be reentrant and assumes that at most one MDD package
is active at any given time. Multiple threads cannot operate on the same MDD package
simultaneously. Also, different threads cannot access different MDD packages simultane-
ously. We considered two approaches to solve this problem: (1) utilize a reentrant version
of the MDD package, or (2) utilize multiple independent MDD packages. Since a reentrant
MDD package is not available, we followed the second approach. We modiﬁed the MDD
library so that multiple instances could be used simultaneously. We also added a Transfer
function to transfer an MDD object from one MDD package to a different MDD package.
Hence, during the parallel algorithms, a master thread spawns several worker threads, each
running on a different core/processor in parallel with an instance of its own MDD pack-
age. The instance of the MDD package assigned to each worker thread is initialized using
MDDs (e.g., program transitions MDD) transferred from the MDD package of the master
thread.

5.4.2 Partitioning the Constraints Satisfaction

Based on the design choices from Section 5.4.1, we present a multi-core algorithm that
partitions the satisfaction of such constraints among available cores/processes.

Algorithm sketch. Intuitively, our algorithm works as follows. During constraint
satisfaction, a master thread spawns several worker threads, each running on a different
core/processor. Each worker thread runs on its own MDD package concurrently with other
threads. The instance of the MDD package assigned to each worker thread is initialized us-

ing MDDs transferred from the MDD package of the master thread. Some of those MDDs

93

are the array of constraints to be satisﬁed, the program transitions, the array of constraints

violating transitions, and the legitimate state predicate. The master thread partitions the

constraints and provides each worker thread with one such partition. Subsequently, worker

threads start resolving their assigned set of constraints in parallel by adding the required

recovery actions. Upon completion, the master thread merges the results returned by the

worker threads.

 

Algorithm ll ParallelConstraintsSatisfaction [Master Thread]

 

Input: constraint array C, program transitions p, fault-span T, and number

of threads n.

Output: recovery transitions recAll .

1:
2:

.10:
11:
12:
13:
14:

15:
16:
17:
18:

999.5'9‘999’

gAll := false;
I 1= Ain=0CIiI§
// Notation: C [t] A (-1C[i])’ refers to transitions that start in -1C[i] and
ends in C [i]
fori:= 1 ton— 1 do
SpawnThread w ComputeViolate(i);
end for
for i := 1 to Size0f(C) —1 do
V[i] :2 V[i— l] V V[i];
end for
fori:=0ton— 1 do
(3,, [i] = Split(i,C);
v,,[i] = Split(i, V);
end for
fori:=lton—1do
rec[i] := SpawnThread -> PConstraintSatisﬁer(Cp[i] , p, fault-span
T, VpIiI , I);
end for
ThreadJoin(0..n — 1);
recAll := V30] rec[i]; // Merging the results from all threads
return recAll;

 

Parallel Constraints Satisfaction. Our algorithm for satisfying the constraints in parallel

is as shown in Algorithm 11. This algorithm begins with the array of constraints to be

94

satisﬁed C, fault-intolerant program p, fault-span T, and the number of worker threads to
be spawned n. The goal of this algorithm is to discover the set of recovery transitions
recAll such that all the constraints in C are satisﬁed in a way that enables the fault-tolerant
program to recover to its legitimate states. Initially, the algorithm starts by computing
the legitimate state predicate I as the intersection of all constraints (Lines 2). Now, the
algorithm constructs the array V such that V[i] includes the transitions that start from a
state where C [I] is true and end in a state where C [t] is false as well as the similar transitions
for the constraints C [j], where 0 S j g i— 1 (Lines 3-8). A more efﬁcient way to do this
computation is by letting the master thread use the worker threads such that each worker
thread computes its share of V elements such that V[i] contains the transitions that starts
from C [l] and end in -vC[i]. Once all threads are done, the master thread updates the array
V such that V[i] = V[i — l] V V[i]. In other words, V[i] contains all transitions that violate
the constraint C [0] to C [i].

After constructing the array V , the algorithm proceeds to evenly distribute the elements
of the arrays C and V among the worker threads (Lines 9-12). Speciﬁcally, C p [1] includes
the array of constraints assigned to the thread i, and Vp[i] includes the array of correspond-
ing constraints violating transitions. Note that the availability of the array Vp enables each
worker thread to work independently without interfering with the other threads. To com-
pute the respective recovery transitions, each worker thread (Lines 13-15) calls the algo-
rithm PConstraintSatisﬁer, which is similar to Algorithm 10 except that in addition to C p
and p it also takes VI) and I as an input. Once all worker threads complete their jobs (Line
16), the master thread collects all the recovery transitions returned by worker threads in

recAll (Lines 17-19) and returns the overall recovery transitions.

95

5.5 Case Studies

In Section 5.3, we presented our approach for constraint-based automated addition of non-
masking and stabilizing fault-tolerance. In Section 5.4, we presented different approaches
to exploit parallelism. In Subsections 5.5.1-5 .5 .3, we describe and analyze three case stud-
ies, namely the Stabilizing Mutual Exclusion [124], the stabilization of Data Dissemina-
tion Problem in Sensor Networks [104], and the Stabilizing Diffusing Computation [13].
Of these, the ﬁrst and the third case study are classic problems from distributed comput-
ing and illustrate the feasibility of algorithms that add stabilizing fault-tolerance. In the
second case, study we demonstrate the applicability of our approach on a real world prob-
lem, particularly, in the ﬁeld of sensor networks. In all of these case studies, we ﬁnd
that our approach for constraint-based automated addition of nonmasking and stabilizing
fault-tolerance was successful in synthesizing the nonmasking fault-intolerant programs.
Furthermore, we ﬁnd that parallelism signiﬁcantly reduces the total revision time.
Throughout this section, all experiments are run on, sun x4275 with 4 x Quad-core
Intel Xeon E5520 (2.27GHz w/ 8MG cache each) processors with 24 GB RAM. The MDD
representation of the Boolean formulae has been done using a modiﬁed version of the

MDD/BDD Glu 2.1 package [125] developed at the University of Colorado.

5.5.1 Case Study 1: Stabilizing Mutual Exclusion Program

Mutual exclusion is one of the fundamental problems in distributed/concurrent programs.
One of the classical solutions to this problem is the token-based solution due to Raymond
[124]. In this solution, the processes form a directed rooted tree, a holder tree, in which
there is a unique token held at the tree root. If a process wants to access the critical section,
it must ﬁrst acquire the token. Our goal in this case study is to add stabilization to the
fault-intolerant program in [15]. When faults occur and perturb the holder tree, the new

program will stabilize and reconstruct a correct holder tree within a ﬁnite number of steps

96

under weak fairness assumption.
F auIt-Intolerant Program. In Raymond’s algorithm, the processes are organized in a
logical tree, denoted as a parent. The holder tree is superimposed on top of the parent
tree such that the root of the holder tree is the process that has the token. For example,
Figure 5.2.a represents the undirected parent tree and Figure 5.2.b shows the holder tree
when c has the token. In the fault-intolerant program, each process j has a variable h. j.
If h. j = j then j has the token. Otherwise, h. j contains the process number of one of j’s
neighbors. The holder variable forms a directed path from any process in the tree to the
process currently holding the token.

In this program, a process can send the token to one of its neighbors. For example,
Figure 5.2.c shows the case where process c sends the token to e. In particular, if j and
k are adjacent (in the parent tree), then the action by which k sends the token to j is as

follows:

A1 :: (h.k=kA jEAdj.k) A (h.jzk)—>h.k, h.j :=- j, j;

Constraints. Recall from Section 5.2 that we deﬁne the legitimate states to be a set of
constraints on the program state space. In this case study, this set is the conjunction of the
constraints S1, 52, and S3, described next. Moreover, each of these constraints is speciﬁed
for each process separately. Therefore, if n is the number of processes then we have 3n
constraints to satisfy. Constraint S 1 requires that j’s holder can either be j’s parent, j itself,
or one of j’s children. S2 requires that the holder tree conforms to the parent tree. Finally,
S3 requires that there are no cycles in the holder relation. Thus, predicates S1, 52, and S3

are as follows:

(51) Vj:(h.j=P.j)V(h.j=j)V(3k:(P.k=j)A(h.j=k))
($2) vr=<m¢j>=>(h.j=Rj)v(h.(P.j)=j)
(53) W : (ID-1'7é 1') => n((h~j -= P-J') A (h-(P-j) = j))

97

 

e f

undirected parent tree

(a)

 

a b c d a b c d
O—> 04—0 O—> 4—0
e O/f token e f
‘_0 token .<_O
c has the token 0 passes the token to e
(b) (C)

 

 

 

 

Figure 5.2: The holder tree

Faults. Since we focus on stabilizing fault-tolerance, we consider faults that perturb the

holder relation of all processes to an arbitrary value. Thus the fault action is as follows:

(F 1) true ——> {h. j :2 any arbitrary value from its domain};

F ault-Tolerant Program. To add stabilizing fault-tolerance to the above program, we
used the revision algorithm as follows. The fault intolerant program for each process is
speciﬁed by actions Al; the faults are speciﬁed by the fault action F 1; and the constraints
are from S l , $2, and S3. We speciﬁed these constraints in the following order: ﬁrst, we
speciﬁed constraints S l for the root, then its children, then its grandchildren, and so on.
Subsequently, we speciﬁed constraint $2 likewise. Finally, we speciﬁed constraint S3 in

the reverse order. The recovery actions computed by the revision algorithm are as follows:

R1 :: —1((h.jsz)V(h.j=j)V(Elk:(P.k=j)A(h.j=k)))
——>h.j:=j|h.j:=P.j|h.j:= {childofj};

98

82 :2 31 (ID-1741') : (M = P-j)V(h-(P-j) =1) )
... h.j :———P.j | h.(P.j) == 1';
R331 "( (ID-1751') => «(12.7: P-J'IMh-(P-J') =1?) )
—+ M := j I h.(P.j) := Rj I h-(le := 8187);

Analysis of experimental results. Table 5.1 shows the results of synthesizing the Sta-
bilizing Mutual Exclusion program with various numbers of processes organized in linear
topology. It shows the time needed, in seconds, to add recovery, validate the recovery tran-
sitions (against pre-satisﬁed constraints), and the total revision time in terms of the number
of processes being revised. Table 5.2 shows the result of a similar case study where the

processes are arranged in a binarytree topology.

 

N0. of Time(s)
Processes constraint satisfaction total

 

 

 

 

 

 

 

 

 

 

Recovery Validation
30 19 21 40
40 78 74 153
50 217 238 457
60 505 509 1020
70 1 1 10 1103 2238

 

Table 5.1: Stabilizing Mutual Exclusion, linear topology.

Table 5.2 illustrates that given the same state space, the complexity is higher in the
tree topology than the linear topology. This is due to the following reason: the constraints
of a process compare its variables with that of its neighbors. To model this effectively,
the process variables and the variables of its neighbors need to be close to each other in
the MDD variable ordering. This can be achieved easily on a linear topology. However,
for a tree topology, this is not possible for all the processes. Hence, computing recovery

transitions for those cases is more expensive.

99

 

No. of Time(s)
Processes constraint satisfaction total
Recovery ] Validation

 

 

 

 

7 < 1 < 1 < 1
15 2 < 1 < 3
17 3 < 1 < 4
21 3 5 10
31 30 19 49

 

 

 

 

 

 

 

Table 5.2: Stabilizing Mutual Exclusion, binary tree topology.

Table 5.3 shows the results of using parallelism during constraints satisfaction in syn-
thesizing the stabilizing Mutual Exclusion program. The table illustrates the results for
various numbers of processes organized in linear topology using different numbers of pro-
cessors/cores. It shows the time needed, in seconds, to satisfy the constraints, and the total
revision time. It also shows the amount of memory in megabytes. As we can see from this
table, using parallelism has substantially reduced the time needed for the revision. As a
concrete example, observe that the time required to synthesize a stable mutual exclusion
program with 50 processes dropped from 457 seconds, using the sequential algorithm, to
374 seconds when two cores were used, and to 178 seconds when four cores were used.

Table 5.4 shows the results of exploiting the distributed nature of the program being
revised (i .e., Group parallelism) in synthesizing the stabilizing Mutual Exclusion program.
It shows the time needed, in seconds, to compute the group, and the total revision time. It
also shows the amount of memory in megabytes needed by our algorithm.

We can clearly see the feasibility of adding stabilizing fault-tolerance using automated
revision. Both time and space complexity are reasonable and proportional to the reachable
state space. Furthermore, as speciﬁed in Section 5.7, the complexity for a larger number of

processes can be reduced by utilizing the hierarchal structure.

100

 

.m2 5 ommms 408822 ”Am—2v :52 32503 E 08: 5538 :38. Am: Sam
59503 5 58833 358328 5 Beam 2:: _Soh ” Am: 6:0 .manEwam 2532.56 wEm: comma—oxm— 33:2 wEEESm ”m. m 2an

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o2 cow mow v: 82 o2: 3. $2 mg C wmmm 2mm 82: E.
52 a? w? Nw 2m 2m ﬁn mm» $5 3 ONE 33 2:3 ow
N2 o2 ma P >3 2: we Em Em m _ hmv mmv ewe om
omﬁ mm 3 2. Q. Nb NV :: o9 3 m2 N2 32 ow

S 2. 2 8 cm 2 me cm on 2 ow ca 3.3 Om

:V m m mm m m mm m m o b o 82 cm
as: a: 3. SEC 3. a: as: a: a: as: a: a: 8.5.6. 8:824
E92 Enm 3:0 :82 Sam 71:0 :32 Sam 6:0 592 Sam 6:0 2.12—23.. \e 62

munch: m €35: v €325 m Nehemxwem

 

 

 

 

 

 

 

 

101

.52 E owmm: @0802 ”Am—EV :52 .mwcooom E 08: 8632 :38.
Am: Sam .mucooom 5 5:83:80 3336 5 30% 0:5 :38. N Am: Em. $5385 macaw mﬁms ~582me 33:2 wag—58m ”Wm 29E.

 

 

 

 

 

 

 

 

 

 

 

 

omﬁ was con 2. E: o2: 54 mg 35 S wmmm 2mm 8:: on
m2 chm Em E. mmm Bum we owe one 2 omS 32 2:2 co

mm _. em _ om ~ 3. cam VNN 3 Nmm omm m _ hmv mmv $2 on
NS om vm we mm K ow 2: m3 3 mm_ N2 89 ow

mm 9 2 on om om mm nm em mﬁ ow ow 3.2 om

NV m m mm m m 2 v v c h c 82 om
82V 3. a: as: a: a: an): a: 3. as: a: 3. 8:9. 83884
:52 5mm EU :52 Pam new :32 3mm new :82 5mm 56 «EB—23.. use 62

 

 

 

 

 

 

 

 

 

 

 

 

€QE£ m €323 v €83. N 32:»:me

 

102

5.5.2 Case Study 2: Data Dissemination in Sensor Networks

In this problem, a base station initiates a computation in which a block of data is to be sent
to all sensors in the network. The data message is split into ﬁxed size packets. Each packet
is given a sequence number. The base station starts transmitting the packets to its neigh-
bor(s) in speciﬁed time slots, in the order of the packet sequence number. Subsequently,
when the neighbor(s) receive a message, they, in turn, retransmit it to their neighbors and
so on. The computation ends when all sensors in the network receive all the messages.
Our goal in this case study is to synthesize a nonmasking fault-tolerant version of the
data dissemination program that can tolerate a ﬁnite number of lost packets. The revised
program is the same as Infuse [104] that is designed manually.
Fault-Intolerant Program. In this case study, we arrange the processes in a linear topol-
ogy. The base station has N packets to send to M processes. (We note that similar revision
is possible for any other ﬁxed topology.) The Fault-intolerant program transmits the pack-
ets in a simple pipeline. For this, each process keeps track of the messages (received/sent)
using two variables u. j and l. j, where u. j is the highest message sequence number received
by process j, and l . j is the sequence number of the message currently being transmitted by
process j. Process j increments u. j every time it receives a new message. It also sets I . j
to be the sequence number of the message it is transmitting. The base station transmits a
packet if its neighbor has received the previous packet (action 1N1). A process j, j > 0,
receives a packet from its predecessor if its successor had received the previous packet

(actions IN 2 and IN 3). Thus, the actions of the fault-intolerant program are as follows:

Action for base station:

(1N1) (10:01) —4m:=w+1;

Action for process j E {1..M—l}:
(1N2) (u.jg U.(j+1))/\(U.jg U.(j— 1))/\(L.(j— 1) :U.j+ 1)
'—> U.j,L.j:: U.j+l,L.j+ 1;

103

Action for process M (the last process):

(1N3) UMg U.(M— 1)AL.(M— 1) = U.M+1—> U.M, L.M :2 U.M+1,L.M+ 1;

Faults. In this section, we consider faults that lose a message. To model such faults for
the base station, we add action (F l ), where the base station increments L.0, even though its
successor has not received the previous packet. To model such action for other processes,
we add action (F 2), where a process advances L. j, even though the successor has not yet

received the previous packet.

(F1) true—+1102: [10+];
(F2) (U.ng.(j—1))/\(L.(j— 1) =U.(j+1)) ——>U.j,L.j :2 U.j+1,L.j+1;

Constraints. The constraints that deﬁne the legitimate states in the case of the data
dissemination program are as follows. The ﬁrst constraint states that initially the base
station has all the packets (S l). A process cannot receive a packet if its predecessor has not
received it (S2), and cannot transmit a packet that it does not have (S3). A process transmits

a packet that is expected by its successor (S4 and S5).

51 (U02 N)

S2 (Vj: 0<j<M: (U.j=U.(j—1)))

S4

( )
( )
(S3) (Vj: 0<j<M (LIEU-1)))
( )(L.0<U.1+1)

( )

SS (Vj: 0<j<(M—l):(L.jSU.(j—l)+l)A(L.jSU.(j+l)+1)))

The data dissemination program has a set of constraints imposed by the model. More
speciﬁcally, these constraints identify the transitions that the revised algorithm is not al-
lowed to use as recovery transitions. Notice that Algorithm 10 is slightly modiﬁed to con—
sider such transitions; these transitions are removed from temp right before Step 4. This

set is speciﬁed by predicates imposed on the current and the next state. In particular, the

104

model requires that the reception of a packet cannot be reversed (MTl), packets can only
be received in sequence (MT2), a process can only receive one packet at a time, it can
only receive a packet sent by its predecessor (MT3 and MT4), a process cannot transmit a
packet unless it has received it (MTS), and a process should not transmit a packet unless it
is potentially needed by its successor (MT6). Thus, the set of transitions disallowed by the

model are as follows:

MTl: (3j:0<jSM:U.j’<U.j)

MT2: (3):0<ng:U.j’<(U.j)+1)

MT3: (Elj:0<jSM:(U.j’=(U.j)+1)A(U.j’7éL.(j—1))A(U.j’;£L.(j+l))
MT4: (U.M’= (U.M)+1AU.M’¢ L.(M— 1))

MTS: (3j:0§jSM:(U.j’<L.j’))

MT6: (ajzogng—l:(L.j>U.(j+1)+1)A(L.j'<U.(j+1)+1)

Fault-Tolerant program. Using the program actions (IN l-IN 3) for each process, the
faults (F l-F 2), the constraints (S l-SS), and prohibited transitions (MTl-MT6), the output

was a nonmasking fault-tolerant program with the following recovery actions added to it.

(R1) (U.j>U.(j+1)) A (L.j>U.(j+l)+1) A (U.j+1=L.(j—1))
—>U.j:= L.(j—l),L.j:= U.(j+l)+1;

(R2) (U.j>U.(j+l)+1) A (L.j>U.(j+l)+1)
—>L.j :2 U.(j+l)+l;

Table 5.5 shows the results of synthesizing the data dissemination protocol with a vari-
ous numbers of processes. One can notice that most of the total revision time was spent
on adding recovery, while a smaller amount of time was spent in validating the recovery
transitions.

The main reason for this behavior is that the structure of the fault-span in this case study
is simpler: if a message is lost on one link, then until it is recovered, that message cannot

be sent again (it is possibly lost on subsequent links).

105

 

 

 

 

 

No. of Space Time(s)
Processes reachable memory constraint satisfaction total
states (MB) Recovery I Validation
50 1025 11 4 2 6
100 1059 12 32 14 48
150 1070 15 153 47 207
200 1093 16 452 162 633

 

 

 

 

 

 

 

 

 

 

Table 5.5: Nonmasking with linear topology data dissemination program.

Table 5 .6 shows the results of synthesizing the data dissemination protocol with various
numbers of processes by partitioning the constraints among available threads. Note that, in
the case of the data dissemination problem, there were only 5 constraints to satisfy. Hence,
when the revision is launched with 8 threads, we are only utilizing 5 of them. As can
be seen from Table 5.6 if the number of constraints is not large enough then the speedup
gained from portioning the constraints is limited.

Table 5 .7 shows the results of synthesizing the data dissemination protocol with various

numbers of processes by exploiting the distributed nature of this program.

5.5 .3 Case Study 3: Stabilizing Diffusing Computation

In distributed systems, diffusing computation is used to inquire about (e.g., termination
detection) or establish (e.g., distributed reset) 3 system global state. We consider a diffusing
computation on a system where processes are arranged in a logical tree. The root initiates
a diffusing computation and propagates it to its children and the children forward it to their
children and so on until it reaches all processes. Once the computation reaches a leaf, it
marks the leaf as completed and reﬂects back to the parent. When all children of a process
are marked completed, that process marks itself completed and reﬂects the computation to

its parent. The diffusing computation ends when it marks the root as completed.

106

.m—Z E 0mg: beau—2 ”Am—2v :52 .musooom E 2:: 5638 :38. Am: Pam
@288 5 5:83:26 38.5 E 3on 08: 130,—. n Am: ...—G .choEtdm 2553.56 maﬁa 88on cosmEEomma San ”9m Bash.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o: Omm can we SN mwm 9V vmm mmm m: mmc 30 922 com
m: o3 02 we _N_ _S 3 a: w: 2 Sm ooN ~32 of
o: am mm we _m _m ow mm E Q we we 32 OS
8 m m .3 m m mm v v S o o 5.9 om
as: 3. a: 82V 3. a: as: a: a: 82V 3. 3. 8.8m mascara
:32 Sam 8:0 :52 Sam 6:0 :32 :mw 3:0 :52 Sam 2:0 «3223.. \e 62
QQES m awash: v €383 N NEmewwm

 

 

107

.m2 5 owmm: boEo—Z ”Am—>3 Eu: .mncooom 5 08: 5638 :38.
Am: 5mm .mccooom 5 :0553800 @336 E 25% 2:: Bob “ Am: EU @5385 33.5 wEm: 88on cocaEEomma San— gum 2an

 

hN_ 9: mm N5 N3 m: 0v mmN mMN E mmo 03 2:2 CON
_N_ am Nm we on we NV mm 5 2 EN mVN MES om _
a: I o co 3 3 am om 5N N_ xv a: 33 oo—
wc N N 3a m N hN v v g g c we $2 on

 

 

 

 

 

 

ES: 3. a: 82V 3. a: 32V 3. 3. 32v 3. a: 8.5m schemata
:52 :mm 9.0 :52 5mm 96 :52 :mm 9.9 :52 Sam BO 24323.. \e 62
9.38:: m “ﬁsts“ V @383“ N Nuttmxwmm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

108

F ault-Intolerant Program. The fault-intolerant program in this case study is the diffusing
computation program from [13]. Each process j has two Boolean variables c. j (color)
and sn. j (session number) and an integer variable P (the parent of j). A new diffusing
computation can start if the root is colored green (c.root _—_ green) and the session number
of the root is the same as its children. To start a new diffusing computation, the root
sets c.root = red and ﬂips sn.root. When a green process ﬁnds that its parent is red, it
copies its parent color and session number. Moreover, if a process has no children or all
its children switched colors from red to green, the process then switches its color to green.
The program for the diffusing computation consists of three actions. The ﬁrst action starts
the diffusing computation at the root (DC 1). The second action propagates the diffusing
computation to the children (DC2). The third action completes the diffusing computation
when all the children complete computation (DC3). The program actions are described

below:

DC] :: (c.root 2 green) —> c.root :2 red,sn.root :2 -'sn.root;
DC2 :: c.j = green/\c.(P.j) = red/\sn.j 7é sn.(P.j) ——> c.j,sn.j : c.(P.j),sn.(P.j);
DC3 :: (c.j = red) /\ (Vk : P.k = j => (c.k = green/\sn.j = sn.k)) ——> c.j 2: green;

Constraints. The ﬁrst disjunction of (81) states that j’s parent has participated in a dif-
fusing computation while j did not participate yet. The second disjunction of (S 1) states
that j and its parent are participating in a computation or they both have completed a com-

putation .

(SI) \7’j : (c.j = green/\c.(P.j) 2 red) V (c.j = c.(P.j) Asn.j = sn.(P.j))

Faults. We now consider the faults that change the values of c. j and sn. j to an arbitrary

value. The fault actions are as follows:

(F1) true ——> c.j :2 red | green;

(F2) true ——> sn.j := true | false;

109

F ault-Tolerant Program. To construct the nonmasking fault-tolerant program of the
fault-intolerant program of Diffusing Computation, we used our algorithm with program
actions (DC 1 — DC 3), and the constraint (S 1) with the fault actions (F l , F2) as an input.
The revised program has the actions (DCl — DC3) in addition to the following recovery

actions:

(R1) (c.jzred) A (sn. Hemp. j) _. c. j ;= green, sn. j ;= sn.(P. j);
(R2) (c.(P. j) = green) A (c.jzred) _. c. j 2: green;

(R3) (c.(P. j) :c. j) A (sn.j¢sn.(P.j)) —r sn. j :2 sn.(P.j);

(R4) (c.(P. j) =red) A (c.jzred) A (sn.j;ésn.(P.j)) __. c. j := green;

 

No. of Time(s)
Processes constraint satisfaction total
Recovery l Validation

 

 

 

 

 

 

 

 

 

 

5O 1 3 4

100 12 19 32
150 57 53 113
200 151 124 282

 

Table 5.8: Stabilizing Diffusing Computation, linear topology.

 

No. of Time(s)
Processes constraint satisfaction total
Recovery 1 Validation

 

 

 

 

15 < 1 < 1 < 1
17 1 1 2
21 1 3 25
23 2 4 6

 

 

 

 

 

 

 

Table 5.9: Stabilizing Diffusing Computation, binary tree topology.

Table 5 .8 shows the results for synthesizing a stabilizing diffusing computation program
with a various numbers of processes organized in a linear topology. Table 5.9 shows the

result where the processes are arranged in a binary tree.

110

Table 5.10 shows the results of synthesizing the diffusing computation program with a
various numbers of processes by exploiting the distributed nature of this program.

Table 5.11 shows the results of synthesizing the diffusing computation program with a
various numbers of processes by partitioning the constraints among available threads.
Memory Usage. Notice that the amount of memory needed during revision is proportional
to the number of threads being used. It is approximately the amount of memory used by the
sequential algorithm multiplied by the number of cores being used. Clearly, this is expected
since for every thread used, we create a new MDD package. We argue that using extra
memory to gain a speedup is acceptable, since in the automated revision, time complexity

is a far more serious barrier than space complexity.

5 .6 Choosing Ordering Among Constraints

To apply Theorem 3.1, we need to identify an order among the constraints. In our case
studies, we attempted several orderings and most were successful in synthesizing the non-
masking and stabilizing fault-tolerant program. Hence, choosing the “right” order does not
appear to be very crucial. Also, [13] identiﬁes several heuristics that can assist in identify-
ing the right order among constraints.

One possible approach is to consider different combinations as part of the revision algo-
rithm. With such an approach, 0(n2) combinations sufﬁce for most examples. In particular,
to identify an ordering, we can utilize an algorithm similar to insert-sort as follows: ﬁrst
consider only constraints Cl and C2 and attempt both orderings between them. If both
orderings fail, then adding nonmasking fault-tolerance cannot be achieved using the con-
straint based approach that uses constraints C 1 and C2. If both succeed, then we can choose
any order. Without loss of generality, let the order be C1 and C2. Then, we consider con-
straint C3 in conjunction with C1 and C2. There are three possible combinations to insert

C3 without affecting the order between C1 and C2. We can evaluate all three options and

111

.52, E owams boss—2 ”Am—2V =82 .mucooom 5 0:5 .8838 :88. Am: Fam .mccooom

5mm 2an

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

E 5:829:00 38.5 E 20% 0:5 :38. N 3: EU .wﬁcmouﬁ macaw main ":8on :ocﬁaﬁov wﬁmahn— wENEnﬁm
o: VN~ n2 8 3; mm“ ow E: b: 2 NwN mnN 022 EN
w: av ov me no me am on E. m _ m: o: 82 cm _
mo 3 2 mm ON 2 hm N N 3 NM “m 82 oo—
ov N _ NN N N m ﬂ m m c v v 82 om
82v 3. a: as: a: a: an): a: a: as: a: a: as: $388;
:52 5mm EU :32 Sam ...—G E32 am 9.6 :82 Sam 9.5 25293.. \e 62
awash: m €325 v €335 N 3:2»:me

 

 

112

.m2 5 owmm: beau—2 ”ES: :82 .6388 E 2:: 56:6: :38. Am: 5mm 69586
E 5:09.33 358650 E ::on 2:: :38. H 3: 6:0 .w:_:o::.am 6:536:80 wﬁm: 5:83:50 wEmEbQ wﬁﬁznﬁm J :m 2an

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

v: wo 3 no 0: o: 2“ ow: cw_ m: NwN mnN 82: SN

w: in we we we we 9“ we we 2 w: o: 89 oﬂ

wo 2 2 No 2 2 ov wN NN 3 Nw _w 82 OS

em N N nN N N E w w o v v 82 on

as: a. a: as: a. 3. as: a: 3. as: a. a. mass ease:

:32 :wm 6:0 Ea: :wm 6:0 :82 :mm 6:0 532 :wm 6:0 «3.2—23.. K: .02
3363: w 6:39;: V 6:323 N 3:2»:me

 

 

113

then consider C4 and so on. It follows that the number of such runs will be 0(n2). In all the
case studies in this chapter as well as several other algorithms in the literature, the above
approach would succeed in identifying the right order of constraints. It follows that one
does not need to consider all possible (n!) orderings among the constraints.

Another approach is to allow the revision algorithm to chooses a random ordering for
satisfying the constraints. If the revision algorithm fails to ﬁnd a solution using a given
constraints ordering, then it choses a different random order. The revision algorithm keeps
trying different random ordering for the constraints until it ﬁnds a solution or it exhausts
all possible combinations.

We implemented this approach. We found that depending on the program being revised
the time required to complete the revision may vary signiﬁcantly. More speciﬁcally, in the
case of the Stabilizing Mutual Exclusion from Section 5.5.1, the order of the constraints
is almost always irrelevant and the revision algorithm found a solution using any order it
tried. Table 5.12 shows the results of 10 experiments. In each experiment, the revision
algorithm randomly chose an order for the constraints and tried to synthesize using that
order. In all cases the revision completed successfully for any order and from the ﬁrst try.
The time needed to complete the revision was almost identical to that of the case where
the constraints were manually ordered (c.f. Table 5.1). However, this was not always the
case. For example, Table 5.13 shows the results of synthesizing the Stabilizing Diffusing
Computation from Section 5.5 .3. In this case, the order in which the revision algorithm
satisﬁes the constraints is signiﬁcantly important. More speciﬁcally, the revision algorithm
has to try different orderings (on average 3-4 times) before it successfully synthesizes the
stabilizing fault-tolerant program. Moreover, the time required to complete the revision,
in this case, was much higher than that when the constraints were manually ordered (c.f.

Table 5.8).

114

.:o:0£m:mm 3:56:00 28:52 wEm: awe—0:8 30:: 55 8620me 33:2 wENEDSm ”Nﬁm 038.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

oomN momN ﬂ KN _wa _ th thN Noow wmnN chN ShN PwN on
9“: EN— NwN_ N02 mmN_ RN— 9N: wmﬁ SE omN_ owN_ o:
in owm 3m nwm 0mm Em :m wwc Elm hmm Nmm om
_w_ ca aw~ oNN vw_ ww_ vw_ vw_ waﬁ 3: 2N ow
_m 5 mm _m E Nm wm wm _m E _m cm
3 a w b e m e m N a 60600:...—
0w:..0>< dum— .:x0 .30 dum— dxm dam. dam— dxm dam dum— 00 .02
30:5. 8:33: :88. ,I I

 

 

115

.:oco£muam 8:65:00 50:25 wEm: $238 30:: 53, 8:83:80 wSMPEQ mﬁuzﬁﬁm :26 2an

 

 

$2 cam: 33 comm via 5 _ mm m Km 55 $3 $3 cwmm com
3: mmw :0 $5 new mm: mmw 0; :3 3o omw om _
2N mwN mom :5 mvm wg mom :2 EN 3N owm oo—
om _m 2 vm VM _m cm 3 cm cm om on
3 a w b e m e m N — 83:85
$5.52 dam dam .95.— .&m dam— .nxm— .axm .me dam dum— 5 dz

 

 

 

 

 

 

 

 

 

 

 

 

308$. SEEM ESP

 

 

 

116

5.7 Reducing the Complexity with Hierarchical Structure

Based on the case studies, we can observe that as the number of nodes in the hierarchy
increases, the time complexity can increase substantially. For example, in the ﬁrst case
study, when we increased the height of the binary tree from 3 to 4 (i.e., from 7 to 15
processes), the revision time increased from 5 to 72 seconds. This is expected since the
state space increases from 105 to 10'6 states. Thus, a natural question in this area is whether
the structure of the hierarchical system can assist in reducing the complexity. We show that
the answer to this question is afﬁrmative. For simplicity, we illustrate this in the context of
the linear topology and binary tree topology.

Linear topology. Consider the case where the system is as shown in Figure 5.3.a. Let
the constraints used during revision be V j :: C j, where the quantiﬁcation is over the set
of all processes in the system. Let C j be a constraint that depends on the variables of
process j, j—l (if it exists) and j+1 (if it exists). Furthermore, assume that constraints
for intermediate processes are identical except for the renaming variables. Let the order
of predicates added for system in Figure 5.3.a be CA,C3,CD. Furthermore, let the added

recovery actions be recA,recB, recD.

 

 

®®©©

(a) (b)

 

 

 

Figure 5.3: Complexity and hierarchy for linear topology

Theorem 5.7.1 If (recA V recB V recD) form the recovery actions for the program in
Figure 5 .3 .a then (recA V recB V recér V rec'D) form the recovery actions for the program
in Figure 5 3 .b where rec’C is obtained by replacing B by C and (then) replacing A by B

from recB and rec;J is obtained by replacing B by C in recD. I

Proof. Based on the order of constraints and the rules used in constructing recovery

actions, constraints CA and CB will be satisﬁed even for the network in Figure 5.3.b. Since

117

recovery actions do not execute after the corresponding constraint is satisﬁed, eventually,
the recovery actions in rec}: and reef) (and the fault-intolerant program) will execute. Since
CD only depends on the variables of D and its predecessor and they correct a predicate
involving D and its predecessor, if actions in recb execute then they will correct CD. More-
over, if actions in rec’D execute then they terminate (after satisfying C ,3). Hence, given the
fairness assumption, actions in rec’C will execute. Observe that rec’C is obtained from recB
by replacing B by C and A by B. Furthermore, based on the deﬁnitions of the constraints,
Cc is obtained from C 3 by replacing B by C and A by B. Thus, rec’C will correct CC. Note
that rec'C can violate C b. However, it will be corrected again by reef). I

Binary tree topology. Consider the case where the system is as shown in Figure
5.4.a. Let the constraints used during revision be V j :: C j, where the quantiﬁcation is
over the set of all processes in the system. Let C j be a constraint that depends on the
variables of process j, j’s parent (if it exists) and j’ children (if they exist). Furthermore,
assume that constraints for intermediate processes (respectively the leaves) are identical
except for the renaming variables. Let the order of predicates added for system in Fig—
ure 5.4.a be CA,C3,CC,CD,CE,CF,andC(;. Furthermore, let the added recovery actions be

recA , recB, recC, recD, recE , recF, andrecG.

Theorem 5.7.2 If (recA V recB V recc V recD V recE V recF V recG) form the recovery
. . . I I l
actions for the program in Figure 5 .4 .a then (recA V recB V recc V recD V recE V recF V
rec’G V recH V rec] V MC) V recK V recL V recM V recN V reco) form the recovery

actions for the program in Figure 5 .4 .b where:

1. rec); is used generate rec’D by:

(a) replacing D by H and E by I,
(b) replacing B by D, and (then)

(c) replacing A by B,

2. rec” is obtained by replacing D by H and (then) by replacing B by D in recD,

118

3. rec, is obtained by replacing D by l and (then) by replacing B by D in recD;

recjg, rec'F, rec’0, rec), recK, recL, recM, recN, and reco are generated by using steps

similar to the steps 1-3. I

Proof.
The proof of Theorem 5.7.2 is similar to that of Theorem 5.7.1]

(3)

 

Figure 5.4: Complexity and hierarchy for the binary tree topology

While the above result is straightforward and widely understood, it is especially useful
for managing complexity of hierarchical systems. While results of this form have been pre-
sented in the literature, the pre-conditions that must be satisﬁed to apply it are often difﬁcult
to evaluate during automated revision. However, the conditions of the above theorem are
easy to evaluate and this theorem can reduce the complexity of synthesizing systems with
a larger number of nodes. Clearly, constructing and verifying the recovery action which
satisfy the conditions of Theorem 5.7.] and Theorem 5.7 .2 is syntactical and requires a

minimal amount of time to complete.

5.8 Summary

In this chapter, we focused on making the automated model revision more comprehensive

and covering more levels of fault-tolerance. In particular, we derived theories, developed

119

algorithms, and built tools to automate the addition of nonmasking and stabilizing fault-
tolerance. Our algorithm ensures that it adds recovery actions that enable the program
to recover to its legitimate states from any arbitrary state. This algorithm is based on
describing the legitimate states using a set of constraints. Then, it ﬁnds recovery actions
that satisfy each constraint. Finally, it makes sure that the recovery actions do not interfere
with each others and work collectively to reach the legitimate states.

Also, we used the multi-core technology to parallelize our algorithm to substantially
.reduce the revision time. We illustrated our approach with three case studies. Furthermore,
we demonstrated that automated revision in these case studies was feasible and achieved in

a reasonable time.

120

Chapter 6

Legitimate States Automated Discovery

Existing algorithms for the automated model revision require that the designers have to
identify the legitimate states of original model. Experience suggests that of the inputs re-
quired for model revision, identifying such legitimate states is the most difﬁcult and creates
a burden on the use of these methods. To reduce this burden, we develop an algorithm wL-
spGenerator (i.e., weakest legitimate state predicate generator) for identifying the largest
set of states from where the program satisﬁes its speciﬁcation. Furthermore, we show how
this algorithm can be integrated with existing algorithms for the addition of fault-tolerance.
With an example, we show that a straightforward approach of using reachability analysis
from initial states to compute legitimate states is not relatively complete.

The rest of the chapter is organized as follows: In Section 6.2, we present our algorithm,
stpGenerator, for computing the weakest legitimate state predicate for the given program.
In Section 6.3, we demonstrate the application of this algorithm with four case studies
to show that it computes the largest set of legitimate states required for model revision.

Finally, we present the summary in Section 6.4.

121

6.1 Introduction

In automated model revision to add fault-tolerance, it is required that after the occurrence
of faults, the revised program eventually recovers to the legitimate states of the original pro-
gram. Since the original program met its original speciﬁcation from these states, we can
ascertain that eventually a revised program reaches states from where subsequent computa-
tions are correct. One of the problems in providing recovery to legitimate states, however,
is that these legitimate states are not always easy to determine.

Current approaches for automated model revision for revising an existing model to add
fault-tolerance include [27.30, 101, 1 l 1] as well as the approaches presented in Chapters
(4 - 3). These approaches describe the model as an abstract program. They require the
designer to specify (1) the existing abstract program that is correct in the absence of faults,
(2) the program speciﬁcation, (3) the faults that have to be tolerated, and (4) the program
legitimate states, from where the existing program satisﬁes its speciﬁcation. Of these four
inputs, the ﬁrst three are easy to identify and are unavoidable. For example, one is expected
to utilize model revision only if they have an existing model that fails to satisfy a required
property. Thus, if model revision is applied in the context of newly identiﬁed faults, original
model and faults are already available. Likewise, speciﬁcation identiﬁes what the model
was supposed to do. Clearly, requiring it is unavoidable. Identifying the legitimate states
from where the fault-intolerant program satisﬁes its speciﬁcation is, however a difﬁcult
task. Our experience in this context shows that while identifying the other three arguments
is often straightforward, identifying precise legitimate states requires signiﬁcant effort. It
is straightforward to observe that if these legitimate states could be derived automatically,
then it would reduce the burden put on the designer, thereby making it easier to apply these
techniques in revision of existing programs.

One approach for identifying legitimate states is to use initial states as legitimate states.
While identifying these initial states is typically easy for the designer, this approach is very

limiting. A variation of this approach is to deﬁne the legitimate states to be those states that

122

are reachable from the initial states. While less limiting, this approach fails to identify states
from where the existing program is correct, although such states are not reached in fault-
free execution. While the knowledge of these states is irrelevant for fault-free execution, it
is potentially useful in adding fault-tolerance. In particular, if faults perturb the program to
one of these states, no recovery may be needed. Furthermore, recovery could be added to
these states so that subsequent computation is correct.

In this chapter, we focus on automated model revision where we begin with the spec-
iﬁcation of the original program and discover the legitimate states automatically. In par-
ticular, we focus on identifying the largest set of legitimate states from where the original
fault-intolerant program satisﬁes its speciﬁcation. Subsequently, we utilize this set of le-
gitimate states in obtaining the fault-tolerant program that is correct by construction. (If
we view a set of states as a predicate that is true only in those states, then this corresponds
to the weakest state predicate.) Of course, an enumerative approach, where we consider
each state as a potential initial state, is impractical. Our goal in this chapter is to identify
efﬁcient techniques for identifying the largest set of legitimate states for a given program.

Our algorithm for computing the largest set of legitimate states takes two inputs: the
program (speciﬁed in terms of its transitions) and its speciﬁcation. The program speciﬁca—
tions consists of: (l) a safety speciﬁcation, which is speciﬁed in terms of (bad) states that
the program should not reach and (bad) transitions that the program should not execute,
and (2) zero or more liveness speciﬁcations of the form f leads to 7' (written as f w T ),
which states that if the program ever reaches a state where 9? is true then in its subsequent
computation it reaches a state where T is true.

In this chapter, we present the algorithm stpGenerator for identifying the set of le-
gitimate states with respect to the given program and speciﬁcation. We show that our
algorithm for ﬁnding the largest set of legitimate states is sound. With a BDD based imple-
mentation, we show that our algorithm manages the state explosion problem. We illustrate

our algorithm in the context of four case studies: the Byzantine agreement program [108],

123

the token ring program [30], the Stabilizing Tree Based Mutual Exclusion problem based
on the fault-intolerant version by Raymond [124], and the Stabilizing Diffusing Computa-
tion [13]. The set of legitimate states computed in these examples are identical to those in
Chapters (3 -5) and in [30,102]. In particular, the sets of legitimate states computed in this
paper for mutual exclusion is used in [15] for adding nonmasking fault-tolerance. It fol-
lows that by combining our algorithm with that in [102] for adding fault-tolerance, it would
be possible to permit the revision to add fault-tolerance without requiring the designer to

specify the legitimate states explicitly.

6.2 The “Weakest Legitimate State Predicate Generator
(stpGenerator)” Algorithm

In this section, we present our algorithm to automatically generate the largest set of legiti-
mate states using the program transitions and its speciﬁcation. The goal of our algorithm is
to generate the largest set of legitimate states (i.e., weakest legitimate state predicate) from
where the program satisﬁes its safety and liveness speciﬁcation. Our algorithm consists
of three main parts: the legitimate states generator, the safety checker, and the liveness
checker. We will describe each of the three algorithms in subsections 6.2.1-6.2.3.

We use a symbolic representation in terms of Boolean formulas since we implemented
this algorithm using Ordered Binary Decision Diagrams (OBDD) [34].
Algorithm sketch. Intuitively, our algorithm consists of two main steps. The ﬁrst step
is to generate the initial set of legitimate states from the program transitions and safety
speciﬁcations. In this step, we identify the initial set of legitimate states, say I , to be all the
states in the state space excluding the set of bad states, SPEC b5 (the states that should not be
reached). Then we proceed to ensure that I does not include any state that violates safety.
The second step is to ensure that I sufﬁces the liveness properties. To verify a speciﬁc

liveness property, say X «M Y, the algorithm needs to ensure that all program transitions

124

paths from all states in X reach to Y. Furthermore, all paths should be cycles-free. If such
cycles exist, then all Y states in X that leads to the cycles are removed from I.

We now describe our algorithm in detail: First, we describe the algorithm
stpGenerator, which computes the largest set of legitimate states, say I , that satisfy the
program speciﬁcations. Then, we proceed to describe the algorithm SafetyChecker that
computes the set of state in which the program does not violate the safety property. Finally,
we describe the algorithm LivenessChecker that removes any state that may violate the

liveness of the program from the set of legitimate states, 1.

6.2.1 Weakest Legitimate State Predicate Generator

The input to stpGenerator consists of the program transitions, SPEC,” (the states that
should not be reached), SPEC,” (the transitions that should not be executed), and the live-
ness properties. The algorithm returns the largest set of legitimate states from where the
program satisﬁes its speciﬁcation. First, it initializes the legitimate states I to be the whole
state space (Line 1). Then, the algorithm computes the largest set of legitimate states by
calling the function SafetyChecker (Line 4). At this point, 1 includes the set of states from
where the program satisﬁes the given safety speciﬁcation. Later, the algorithm satisﬁes the
liveness properties one after another by calling the function LivenessChecker that removes
states that violate the given liveness property (Lines 5-7). Removal of states due to live-
ness properties may require re-computation of 1. Hence, this computation is in a loop and

terminates when a ﬁxpoint is reached.

6.2.2 Safety Checker

The input of the SafetyChecker algorithm consists of the initial set of legitimate states, the
program transitions, the SPEC)”, and the SPEC)”. The output is the computed largest set
of legitimate states, I, , for the given safety speciﬁcation.

First, the algorithm initializes the set of legitimate states 15f to be Imp excluding the

125

 

Algorithm 12 WeakestLegitimateStatePredicateGenerator (stpGenerator)

Input: program transitions p, SPEC,” (states that should not be reached),
SPEC,” (transitions that should not be executed), T [L and T [] state
predicates describing leads-to properties .

Output: weakest legitimate state predicate 1w.

 

// Initially Iw equals S p, the program states space.
I... = S p
repeat
tmp = [W
1...: SafetyChecker(Iw, p, SPEC)”, SPEC I”);
//check the i’h liveness properties
for i := 0 to N 00 fLivenessProperties do
1...: LivenessChecker(Iw, p, T [i] , T [1]);
end for
until tmp = [W
// return the largest set of legitimate states.
9: return 1...;

B58592?

9°hl9‘9‘

 

states in SPEC [,3 (Line I). Then, the algorithm starts a ﬁxpoint computation that removes
undesired states from I;,,,,. If Isf contains a state so such that the program can execute the
transition (s0, s1), which violates safety, then so cannot be in 13f. Hence, we remove so
from [sf (Line 4). Note that a state is removed from 15f only if the given program violates
safety from that state. If Isf contains a state so, then p contains a transition (so, s1), and S]
has been removed from I, , then so must also be removed from 13f (Line 5). This process
continues until a ﬁxpoint is reached. At this point, it exits the loop and returns the desired

set of legitimate states 15f.

6.2.3 Liveness Checker

The input of the LivenessChecker algorithm consists of the initial set of legitimate states,
11,, p, the program transitions, the T and T where T w T is a given state predicate describ-

ing leads-to properties. The output is the largest set of states that is a subset of [MP from

126

 

Algorithm l3 SafetyChecker

Input: initial legitimate states Imp, program transitions p, SPEC,” (states
that should not be reached), SPEC,” (transitions that should not be exe-
cuted).

Output: weakest legitimate state predicate 15f.

 

// Sp is the state space of p
I: 13ft: Imp - SPECbS;

repeat
tmpI:= Isf;
13ft: sf — {SO 2 (S0,S1) EanPECb, };
13ft: sf_ {S0 2 (So,S1) €p/\S0 E [sf/\S1¢ 15f};
until tmpI = sf
// return the set of states from where the program satisﬁes safety prop-
erties.
7: return Isf;

95MB?”

 

where the given program satisﬁes 9? w T .

First, the algorithm creates a program tmpP where we add a self-loop to all the dead-
lock states where the program p has no outgoing transitions from so and so 615 T (Line 1).
All computations of tmpP are inﬁnite or terminate in a state in T . Now we remove all
transitions in tmpP that reach T (Line 2). If p satisﬁes (at w T), then it follows that tmpP
cannot include any inﬁnite computation that includes a state in {F . Hence, the algorithm
iteratively removes deadlock states in tmpP (Lines 5-7). If some states in 7 still remain,
then it implies that there are inﬁnite computations of tmpP that begin in a state in T but do
not reach a state in T . We remove such states from [MP and iteratively compute Imp.

Extension. In some cases, the program actions are partitioned in terms of system ac-
tions and environment actions. It is expected that the environment actions will eventually
stop (for along enough time) so that the system actions can make progress (and satisfy live-

ness property). In such cases, we can apply the above algorithm as follows: The program

127

 

Algorithm 14 LivenessChecker

 

Input: initial legitimate states Imp, program transitions p, T , and T state

predicates describing leads-to properties.

Output: weakest legitimate state predicate Imp.

NI—I

10:
: until T ﬂinvF = {}

12:

PPPEIQ‘SPBE’?

// ASSUMPTION: T 0T 2 {}. Ifnot,change T to (T —T).

// let ds(p) = {so :Vsl , (so,sl) ¢ p} be the set of deadlock states.

// add self-loop to the states in ds(p).

: tmpP :=pU{(so,so) : so ¢ T /\so 6 ds(p)};

tmpP:= {(so,s1) : (so,s1) E tmpP/\s1¢ T};

repeat

invF 2: Imp ;

while (invF ﬂds(tmpP)) 79 {} do
invF := invF —ds(tmpP);

end while

ifT ﬂinvF 7E {} then
Imp :2 Imp — (T ﬂinvF);

end if

// return the set of states from where the program satisﬁes liveness
properties.
return Imp;

 

128

actions used in SafetyChecker will consist of both the system actions and the environment
actions. The program actions used for LivenessChecker will consist of only the system

actions.

Theorem 6.2.1 The Algorithm stpGenerator is sound (i .e., the generated set of legitimate

states is the largest set of legitimate states).

Proof. The proof consists of two parts: (I) if state, say so, is not included in the output
of stpGenerator, then the program does not satisfy its speciﬁcation from so, and (2) if
a state, say so, is included in the output of stpGenerator, then the program satisﬁes its
speciﬁcation from so.

We now prove the ﬁrst part by considering all parts of the code where some state is

removed from the output.

Line 1 of SafetyChecker: Clearly, states in SPEC ,5, cannot be included in the ﬁnal set

of legitimate states.

0 Line 4 of SafetyChecker: If (so,sl) is a transition of the program that violates safety
then there is a computation of the program that starts from so and violates the speci-

ﬁcation.

0 Line 5 of SafetyChecker: If s; is a state already removed from the ﬁnal set of legit-
imate states, i.e., there is a program computation that starts from s] and violates the
speciﬁcation, and (so, s1) is a program transition, then there exists a computation that

starts from So and violates the speciﬁcation.

0 Line 9 of LivenessChecker: Observe that in tmpP, transitions that reach T are re-
moved. Now, the loop on Lines 5-7 removes all deadlock states in invF . If any state,
say so, in T is not removed, then that implies that there are inﬁnite computations
of tmpP that start from so. For instance, this happens if a cycle is reachable from

so. By construction, this computation cannot reach T . Thus, if a state so is removed

129

on Line 9 of LivenessChecker, then there is a computation from so that violates the

speciﬁcation.

We use proof by contradiction for the second part. Suppose so is included in the output
of stpGenerator and there is a computation, say (so,s1, . . .) that violates the speciﬁcation
from so. We consider two cases depending upon whether this computation violates the

safety speciﬁcation or the liveness speciﬁcation.

0 Safety speciﬁcation. Consider the ﬁrst state where safety violation is detected, e.g.,
because a state, say Si, in SPEC,” is reached or a transition, say (sj_1,sj) in SPEC!"

is executed.

- Case I: Sj G SPEC)”. By Line 1 of SafetyChecker, j yé 0. Also, from Line 5 of
SafetyChecker, s j_1 would be removed from the ﬁnal set of legitimate states.
Likewise, sj_2 would be removed and so on. Thus, so cannot be in the output

of stpGenerator. This is a contradiction.

— Case 2: (sj_1,sj) E SPEC)”. By the same argument as in Case 1, we can show

that so cannot be in the output of stpGenerator. This is a contradiction.

o Liveness speciﬁcation. If this computation does not satisfy the liveness speciﬁca-
tion then this implies that it has a sufﬁx where T is true in some state, say sj, but
T is false in all states. Now, we deﬁne a computation 0' that starts from Sj. If the
computation (so, s1 , . . .) is inﬁnite then G is the sufﬁx that starts from s,-. If the com-
putation (so,s1, . . .) is not inﬁnite, it ends in a state say, s], where p has no outgoing
transitions, then 0' is obtained by concatenating the sufﬁx starting from S} and an
inﬁnite stuttering of state S]. By construction, 0 is also a computation of tmpP (Line
2 from LivenessChecker). Thus, s j is removed from the output of stpGenerator.
Again, by an argument similar to the case of safety speciﬁcation, we can conclude

that so cannot be in the output of stpGenerator. This is a contradiction. I

130

6.3 Application of stpGenerator in Automated Model
Revision

In this section, we describe and analyze our approach for generating the legitimate states of
the four case studies: the Byzantine agreement program [108] , the token ring program [30] ,
the Stabilizing Tree Based Mutual Exclusion problem based on the fault-intolerant version
by Raymond [124], and the Stabilizing Diffusing Computation [13]. We chose these clas-
sical examples from the literature of distributed computing to illustrate the feasibility and
applicability of our algorithm in generating the weakest legitimate state predicate. Fur-
thermore, these case studies illustrate that the overhead of computing the legitimate states
using stpGenerator is very small compared to the overall time required for the addition of
fault-tolerance. Thus, reducing the burden of the designer in terms of requiring the explicit
legitimate states increases the complexity by a very small factor.

Throughout this section, all case studies are run on a MacBook Pro with 2.6 Ghz Intel
Core 2 Duo processor and 4 GB RAM. The OBDD representation of the Boolean formula
has been done using the C++ interface to the CUDD package developed at the University

of Colorado [125].

6.3.1 Case Study 1: Byzantine agreement program

We illustrate our algorithm in the context of the Byzantine agreement program from Section
4.3 .3. We start by specifying the fault-intolerant program. Then, we provide the program
speciﬁcation. Finally, we describe the weakest legitimate state predicate generated by our
algorithm.

Program. The Byzantine agreement program consists of a “general” and three or
more non-general processes. Each process copies the decision of the general and ﬁnalizes
(outputs) that decision.

Recall from Section 4.3 .3, the actions of the Byzantine agreement program are as shown

131

in action below. The only difference is in the third and fourth actions that allow a Byzantine
process to change its decision and ﬁnalized status. The last two actions are environment

actions.

d.sz.)/\(f.j:false) ——+ d.j:=d.g;

d.j ¢ _L) /\ (f.j =false) ———> f.j :2 true;

) —> d.j :2 llO, f.j :=falseltrue;

RUIN

(
(
'1 (
( 0:

b.j
b.g) —> d.g :=l

 

Where j E {l,n} and n is the number of non—general processes.
Speciﬁcation. The safety speciﬁcation of the Byzantine agreement requires validity and

agreement:

0 Validity requires that if the general is non-Byzantine, then the ﬁnal decision of a
non-Byzantine process must be the same as that of the general. Thus, validity( j) is
deﬁned as follows.

validity(j) = ((ﬁb.j /\ -1b.g /\ f.j) :> (d.j=d.g))

0 Agreement means that the ﬁnal decision of any two non-Byzantine processes must
be equal. Thus, agreement( j, k) is deﬁned as follows.
agreement(j,k) = ((—1b.j A-wb.k /\ f.j /\ f.k)
:> (d. j : d.k)) ‘

o The ﬁnal decision of a process must be either 0 or 1. Thus, final (j) is deﬁned as
follows.

final(j) =f.j:> (d.j:OVd.j: 1)

We formally identify safety speciﬁcation of the Byzantine agreement in the following set

of bad states:

SPECBA,” = (313k 6 {1..n} ::
( -u(validity(j) Aagreement(j,k) Afinal(j)) )

132

Observe that SPEC BA m can be easily derived based on the speciﬁcation of the Byzantine
Agreement problem.

The liveness speciﬁcation of the Byzantine agreement requires that eventually every
non-Byzantine process ﬁnalizes a decision. The requirement that process j eventually ﬁ-

nalizes a decision can be speciﬁed as follows:
nb-j w (f J)

Application of our algorithm. The weakest predicate computed (for 3 non-general pro-
cesses) is as follows. If the general is non-Byzantine, then it is necessary that d . j, where j
is also a non-Byzantine, be either d .g or _L. Furthermore, a non-Byzantine process cannot
ﬁnalize its decision if its decision equals I. Now, we consider the set of states where the
general is Byzantine. In this case, the general can change its decision arbitrarily. Also,
the predicate includes states where other processes are non-Byzantine and have the same
value that is different from I. Thus, the generated weakest legitimate state predicate is as

follows:

13A:

(ﬁb.g/\(Vp e {1..n} :: ((ﬁb.p/\f.p) => d.p ¢ I)
/\(-1b.p :> (d.p = _LVd.p = d.g))) ) v
(b.g/\(Vj,k€ {1..n} :jaékzz (d.j=d.k)

/\ (Cl-j?é i) ))

Observe that 13A cannot be easily derived based on the speciﬁcation of the Byzantine Agree-
ment problem. More speciﬁcally, the set of states where the general is Byzantine, are not
reachable from the initial states of the program.

We used the exact same predicate in the case study from Section 4.3 .3 to add fault—
tolerance to Byzantine faults. (In [30], where we reported the results for addition of fault-

tolerance with symbolic techniques, the set of legitimate states used was a conjunction

133

of the above predicate and a formula that states that at most one process is Byzantine.
However, this extra formula does not affect the revised program or the time complexity.)
The amount of time required for computing this set of legitimate states for a different
number of processes is as shown in Table 7.2. We would like to note that the set of le-
gitimate states computed in these case studies is the same as that used in the addition of

fault-tolerance.

 

No. of Reachable Legitimate States
Process States Generation Time(Sec)

 

10 109 0.57
20 1015 1 .34
30 102?- 4.38
40 1030 9.25
50 1036 26.34
100 107' 267.30

 

 

 

 

 

Table 6.1: The time required to generate the weakest legitimate state predicate (Byzantine
Agreement).

We note that the time required to compute the set of legitimate states is very small as
compared with the total time needed to complete the revision. For example, to synthesize
a fault-tolerant Byzantine agreement program with 40 processes, it takes more than 9,000
seconds as shown in Section 4.3.3. By contrast, the time to compute the legitimate states
is only 9.25 seconds. Thus, the overhead of synthesizing with the speciﬁcation without
explicit legitimate states is negligible.

We use this case study to illustrate that computing the set of legitimate states to be
those that are reachable from initial states is not relatively complete. In particular, for the
Byzantine agreement example, the initial state is one where all processes are non-Byzantine
and the decision of all non-general processes is equal to J_. Clearly, all processes are non-
Byzantine in all states reached by the program from these initial states. It follows that

recovery to these reachable states is not always feasible in the presence of faults. Hence,

134

these reachable states are insufﬁcient to obtain the fault-tolerant program. By contrast, the

weakest legitimate state predicate can be utilized to ﬁnd the fault-tolerant program.

6.3.2 Case Study 2: Token Ring

In this section, we illustrate our algorithm in the context of the token ring program. First,
we specify the fault-intolerant program. Then, we provide its speciﬁcation. Finally, we
identify the largest set of legitimate states generated by the algorithm from Section 6.2.

Program. The token ring program consists of n processes organized in a ring. A
token is circulated among the processes in a ﬁxed direction. When a process gets the token
it can access the critical section. Each process j, where j E {0..n}, has a variable x. j with
the domain {0, 1,I} , where J. denotes that the process is in an illegitimate state. A process
0 has the token iff x.n is equal to x0 and a process j, where l S j _<_ n, has the token iff
x.j;éx.(j— 1).

The actions of the token ring program are as follows:

l::x.j¢x.(j—l) ——+ x.j:=x.(j—l);

2 :: x.0=x.n —> x.0 :2 x.n+21;

where +2 denotes modulo 2 addition.

Speciﬁcation. The safety speciﬁcation of the token ring requires that the value of x at
any process is either 0 or 1 and that no two processes have a token simultaneously. Thus,
the safety speciﬁcations of the token ring program can be identiﬁed using the following set

of bad states (i.e. states that should not be reached by normal program execution).

sperms:
(3j,k:j7ék /\ j,k€{]..n} :: ((x.(j—l)#x.j)/\(x.(k—l)7éx.k))) v
(Elj:j€{l..n} :: ((x.(j—l)#x.j)/\(x.0=x.n))) v
(3) : j€{0..n} :: (x.j=J_))

135

The liveness speciﬁcation of the token ring requires that eventually every process gets

the token. The requirement that process 0 eventually gets the token can be speciﬁed as:
true -+ (1:0 2 x.n).

Application of our algorithm. After applying our algorithm with the above inputs,
the generated largest set of legitimate states can be represented using the following regular

expression:
(x.0,x. 1 ,x.2. . .x.n) e (0’1<n+ H) 0 1’0“” l-’>), where 0 g 1 g n +1.

Thus, the above predicate states that the sequence of (x.0,x. l ,x.2. . .x.n) is a sequence
of zeros followed by ones or ones followed by zeros. The value of l + l in the above
sequence identiﬁes the process with the token.

We note that this is the exact same set of legitimate states used in Section 4.3.3 for
adding fault-tolerance to the fault where up to n processes are detectably corrupted. Fur-
thermore, the time for computing this set of legitimate states for different values of n is as

shown in Table 6.2. As we can see, its very small.

 

No. of Reachable Legitimate States
Process States Generation Time(Sec)

 

10 104 0.1
20 109 0.2
30 10'4 0.3
40 10[9 0.4
50 1023 0.6
100 1047 0.19

 

 

 

 

 

Table 6.2: The time required to generate the weakest legitimate state predicate (token ring).

6.3.3 Case Study 3: Mutual Exclusion

In this section, we illustrate our algorithm in the context of the Raymond’s tree-based mu-

tual exclusio'n program from Section 6.3.3. Our goal in this case study is to automatically

I36

generate the weakest legitimate state predicate for the program in [15].

We start by specifying the fault-intolerant program. Then, we provide the program
speciﬁcation. Finally, we identify the weakest legitimate state predicate generated by our
algorithm.

Program. Recall that the action by which k sends the token to j is as follows:
1:: (Iz.k=k /\ jEAdjk) /\ (h.j:k) ———>h.k::j, h.j:zj;

Where Ad j.k denotes one of the neighbors of k.

Speciﬁcation. Since the goal of Raymond’s mutual exclusion algorithm is to main-
tain a tree rooted at the token, it requires that the holder of any process is one of its tree
neighbors. It also requires that there should be no cycles in the holder relation.

We formally describe the safety speciﬁcations in the following predicate:

SPECMEM : ( aje {0..n} :: ((h.j;éj)V(h.j7£p.j)V(h.j;éch.j)) ) v
(Elj,k€ {0..n}:j7$k:: ((h.j=k)/\(h.k:j) )) V
(3j,k E {0..n} : j 75 k :: ((h.j = j) /\(h.k= k) ))

Where ch. j denotes one of the children of j.

Application of our algorithm. The generated weakest legitimate state predicate of the
mutual exclusion program computed by our algorithm is as follows. The legitimate states
predicate require that j’s holder can either be j’s parent, j itself, or one of j’s children. It
also requires that the holder tree conforms to the parent tree and there are no cycles in the

holder relation.

[ME = (VjE {0..n}::(h.j=P.j)V(h.j=j)V(3k:(P.k=j)/\(h.j=k))) /\
(we {0m} == (R1741) :> (h.j: P4) v (h-(Rj) = m A
(W e {0..n} 2: (Rt 74 1) => «h.j = R1) A (II-(Pr) = 1)))

Where P. j denote the parent of j.

137

Recall that [ME is equivalent to the conjunction of the constraints (S 1 , S2, and S3), from
Section 6.3.3, used in deriving the the non-masking fault-tolerant version of the mutual
exclusion program.

The amount of time required for computing this set of legitimate states for a different

number of processes is as shown in Table 6.3.

 

No. of Reachable Legitimate States
Process States Generation Time(Sec)

 

10 109 0.01
20 1026 0.1
30 1044 0.2
40 10")4 0.5
50 1084 0.9
100 :10200 0.43

 

 

 

 

 

Table 6.3: The time required to generate the weakest legitimate state predicate (Mutual
Exclusion).

6.3.4 Case Study 4: Diffusing Computation

In this case study, we consider a diffusing computation on a system where processes are
arranged in a logical tree. The root initiates a diffusing computation and propagates it to its
children and the children forward it to their children and so on until it reaches all processes.
Once the computation reaches a leaf, it marks the leaf as completed and reﬂects back to
the parent. When all children of a process are marked completed, that process marks itself
completed and reflects the computation to its parent. The diffusing computation ends when
it marks the root as completed.

Program. The fault-intolerant program in this case study is the diffusing computation
program from [13]. Each process j has two Boolean variables c. j (color) and sn. j (session
number) and an integer variable P (the parent of j). A new diffusing computation can

start if the root is colored green (c.root 2 green) and the session number of the root is the

138

same as its children. To start a new diffusing computation, the root sets c.root = red and
flips sn.root. When a green process ﬁnds that its parent is red, it copies its parent color
and session number. Moreover, if a process has no children or all its children switched
colors from red to green, the process then switches its color to green. The program for
the diffusing computation consists of three actions. The ﬁrst action starts the diffusing
computation at the root (1). The second action propagates the diffusing computation to the
children (2). The third action completes the diffusing computation when all the children

complete computation (3). The program actions are described below:

1 :: (c.root 2 green) —+ c.root 2: red ,sn.root z: ﬁsn.root;
2 :: c.j = green/\c.(P.j) = red/\sn.j # sn.(P.j) ——* c.j,sn.j = c.(P.j),sn.(P.j);

3:: (c.j: red) /\ (Vk : P.k = j :> (c.k 2 green Asn.j = sn.k)) ——> c.j 2: green;

Speciﬁcation. The safety speciﬁcations for the diffusing computation program re-
quires that all processes must have the same color and the same session number. We for-

mally deﬁne the safety speciﬁcations in the following predicate:
SPECDC = ((3j,k E {0..n} : j 75 k :: (sn.j 75 sn.ch.j 74 c.k))

Application of our algorithm. The generated weakest legitimate state predicate
of the diffusing computation is as follows : The set of legitimate states requires that all

processes should have the same colors and session numbers.

IDF = Vj: (c.j : green/\c.(P.j) : red) V (c.j = c.(P.j) /\sn.j = sn.(P.j))

6.4 Summary

In this chapter, we provided techniques that permit the designer to efﬁciently describe the
model to be revised. Speciﬁcally, we derived theories, developed algorithms, and built tools

to automate the discovery of the legitimate states of the model. Our techniques relieve the

139

designer from performing unnecessary steps, thereby simplifying the application of the
automated model revision. Our algorithm uses the program actions and speciﬁcation to
automatically generate the weakest legitimate state predicate. First, it initializes weakest
legitimate state predicate to be the set of the states from where the given program does not
violate the safety speciﬁcation. Second, it ensures that the generated weakest legitimate
state predicate satisﬁes the liveness properties by removing any state that violates liveness
properties. Also, we considered four case studies. We used our algorithm to automatically
discover the set of legitimate state for each case. In each of these examples, the generated
set of legitimate states was the same as the one speciﬁed explicitly in automated addition of
fault-tolerance an the time to generate the legitimate states was very small when compared

with that for performing the corresponding model revision.

140

Chapter 7

Automated Model Revision Without

Explicit Legitimate States

In Chapter 6 we introduced our algorithm for the automated discovery of the legitimate
state. We also showed how such automation reduces the burden put on the designer, making
it easier to apply these techniques in the revision of existing programs. However, one
question that we need to answer is regarding the completeness of this approach. In other
words, if it were possible to perform model revision with explicit legitimate states, then is
it possible to do so without the explicit identiﬁcation of the legitimate states.

In this chapter, we consider the problem of automated model revision without explicit
legitimate states. We show that this formulation is relatively complete, i.e., if it were pos-
sible to perform model revision with explicit legitimate states, then it is possible to do so
without the explicit identiﬁcation of the legitimate states.

We also identify instances where the complexity class of model revision without ex-
plicit legitimate states is the same as that with explicit legitimate states. In turn, this iden-
tiﬁes heuristics for performing model revision without explicit legitimate states. Finally,
we show that with these heuristics, the increased cost for model revision without explicit

legitimate states is small.

141

The rest of this chapter is organized as follows: In Section 7.1 , we present an alternative
approach for performing model revision. In Section 7.2, we state the automated model re-
vision problem statement. In Sections, 7.3, 7.4, and 7.5, we answer three questions related
to the completeness, complexity, and coast of our approach. Finally, we summarize the

chapter in Section 7.6.

7 .1 Introduction

In this chapter, we focus on the problem of model revision where the legitimate states are
computed using automation techniques. In particular, when the algorithm stpGenerator
from Chapter 6 is used to generate the set of legitimate states. Recall from Chapter 6
that the current approaches for automated model revision describe the model as an abstract
program. They require the designer to specify (1) the existing abstract program that is
correct in the absence of faults, (2) the program speciﬁcation, (3) the faults that have to be
tolerated, and (4) the program legitimate states, from where the existing program satisﬁes
its speciﬁcation (c.f. Figure 7.1). We call this problem as the problem of model revision

with explicit legitimate states.

 

_'_________.__._
Original _
Model

_______________.__

Speciﬁcations ‘ Automated Revised
Model Model
7””..— > Revision

Fauﬂs

 

 

 

 

 

 

_________________
Legitimate :
States

 

 

 

 

 

 

 

Figure 7.1: Model Revision with Explicit Legitimate States.

We focus on the problem of model revision where the input only consists of the fault-

intolerant program, faults and the speciﬁcation, i.e., it does not include the legitimate states.

142

We call this problem as the problem of model revision without explicit legitimate states (cf.

Figure 7.2).

 

 

[W _
Model

 

 

 

Automated
Speciﬁcations ‘ Model “avail?
Revision

 

 

Fauna

 

 

)

 

 

Figure 7.2: Model Revision without Explicit Legitimate States.

There are several important questions that have to be addressed for such a new formu-

lation.

Q. 1 Is the new formulation relatively complete? (i.e., if it is possible to perform model
revision using the problem formulation in Figure 7.1, is it guaranteed that it could be

solved using the formulation in Figure 7.2?)
An afﬁrmative answer to this question will indicate that reduction of designers’ bur-

den does not affect the solvability of the corresponding problem.

Q. 2 Is the complexity of both formulation in the same class? (By same class, we mean
polynomial time reducibility, where complexity is computed in the size of state

space.)
An afﬁrmative answer to this question will indicate that the reduction in the design-
ers’ burden does not signiﬁcantly affect the complexity.

Q.3 Is the increased time cost, if any, small comparable to the overall cost of program
revision?

While Question 2 focuses on qualitative complexity, assuming that the answer is

afﬁrmative, Question 3 will address the quantitative change in complexity.

143

In this chapter, we show that the answer to Q. 1 is afﬁrmative (cf. Theorem 7.3.1).
Furthermore, we show that the answer to Q. 2 is partially afﬁrmative. Speciﬁcally, we
identify two versions of problem revision: partial revision and total revision. We show
that the answer is afﬁrmative for total revision (cf. Theorem 7.4 .3). We point out that the
answer is negative for partial revision. In other words, for partial revision, complexity of
solving the problem in Figure 7.2 can be larger (cf. Section 7.4 .5). Even though the answer
to Q. 2 is negative for partial revision, we show that there is a subclass of this problem
where the complexity of the approach in Figure 7.2 is the same as that in Figure 7.1. In
particular, we show that for all instances where the answer to the problem in Figure 7.1
is afﬁrmative, it is possible to solve the corresponding problem in Figure 7.2 under the
same complexity class. However, it is possible that the answer to the problem in Figure 7.1
is negative, i.e., the corresponding algorithm declares failure to generate the fault-tolerant
program, although the answer to the corresponding problem in Figure 7.2 is afﬁrmative.
For these cases, complexity of solving the problem in Figure 7.2 can be high. Regarding Q.
3, we show that for instances where the answer to the question in Figure 7.1 is afﬁrmative,

the extra computation cost of solving the problem using an approach in Figure 7.2 is small.

7 .2 Problem Statement

In this section, we formally deﬁne the problem of model revision with and without explicit
legitimate states.

Model Revision with Explicit legitimate states (Approach in Figure 7 .1). Recall that in
Section 2.5 we deﬁned what it means for a program to be (masking) fault-tolerant. Using a
similar deﬁnition we now formally specify the problem of deriving a fault-tolerant program
from a fault-intolerant program with explicit legitimate states I , safety speciﬁcation S fp,
and liveness speciﬁcation Up. The goal of the model revision is to modify p to p’ by

only adding fault-tolerance, i.e., without adding new behaviors in the absence of faults.

144

Since the correctness of p is known only from its legitimate states, I, it is required that the
legitimate states of p’ , say I’ , cannot include any states that are not in I . Additionally, inside
the legitimate states, it cannot include transitions that were not transitions of p. Also, by
Assumption 1].], p cannot include new terminating states that were not terminating states
of p. Finally, p’ must be fault-tolerant. Thus, the problem statement (from [101]) for the

case where the legitimate states are speciﬁed explicitly is as follows.

 

Problem Statement 7 .1
Revision for Fault-Tolerance with Explicit Legitimate States.
Given p, I, S f,,, va and f such that p satisfies
S fp and va from I
Identify p’ and I’ such that
(Respectively, does there exist p’ and I’ such that)
A1: I’ :> 1.
A2: so 6 1’ => Vs] :sl 6 I’ : ((so,s1) Ep’ => (so,s1) Ep).

A3: p’ is f-tolerant to S fp and va from I’.

 

 

 

Note that this deﬁnition can be instantiated for each level of fault-tolerance (i.e., mask-
ing, failsafe, and nonmsaking). Also, the above problem statement can be used as a revision
problem or a decision problem (with the comments inside parenthesis).

We call the above problem as the problem of ‘partial revision’ because the transitions
of p’ that begins in 1’ are a subset of the transitions of p that begins in I’ . An alternative
formulation is that of total revision where the transitions of p’ that begins in I’ is equal to
the transitions of p that begins in I’ . In other words, the problem of total revision is identical

to the problem statement 7.] except that A2 is changed to A2’ described next:

 

A2,: So E ”IVY/$12816 1’ 2 ((80,81) Ep’ <=> (80,31) Ep)

 

 

 

Modeling Revision without Explicit legitimate states (Approach in Figure 7 .2) Now,

we formally deﬁne the new problem of model revision without explicit legitimate states.

145

The goal in this problem is to ﬁnd a fault-tolerant program, say p,. It is, also, required
that there is some set of legitimate states for p, say I, such that p, does not introduce new
behaviors in 1. Thus, the problem statement for partial revision for the case where the

legitimate states are not speciﬁed explicitly is as follows.

 

Problem Statement 7 .2
Revision for Fault-Tolerance without Explicit Legitimate States.
Given p, S fp and up, and f
Identity p, such that
(Respectively, does there exist p, such that)
( 31::
Bi: so 6 I :> Vs1:31€ I: ((so,s|) 6p, :> (so,s|) Ep)
82: p, is a f-tolerant to S fp and va from I

)

Just like problem statement 7.1, the problem of total revision is obtained from problem

 

 

 

statement 7.2 by changing B] with B 1’ described next:

 

Bi’:soEI =>Vsl :51 EI:((so,s;)€p, 4:) (so,s1)€p)

 

 

 

Existing algorithms for model revision [27,30, 101,111] are based on Problem State-
ment 7.]. Also, the tool S YCRAF T [27] utilizes Problem Statement 7.1 for the addition of
fault-tolerance. However, as stated in Section 7.1, this requires the users of S YCRAF T to
identify the legitimate states explicitly. Our goal is to evaluate the effect of simplifying the

task of the designers by permitting them to omit explicit identiﬁcation of legitimate states.

7 .3 Relative Completeness (Q. 1)

In this section, we show that if the problem of model revision can be solved with explicit
legitimate states (Problem Statement 7.1), then it can also be solved without explicit legit-

imate states (Problem Statement 7.2). Since each problem statement can be instantiated

146

with partial or total revision, this requires us to consider four combinations. We prove this

result in Theorem 7.3.1.

Theorem 7.3.1 -
If

0 the answer to the decision problem 7.1 is ajﬁrmative with input p (fault-intolerant
program), S fp (safety speciﬁcation), va (liveness speciﬁcation), f (faults) and l

(legitimate states).
Then

a the answer to the decision problem 7 .2 is aﬂirmative with input p (fault-intolerant

program), S fp (safety speciﬁcation), va (liveness speciﬁcation) and f (faults).

Proof. Intuitively, a slightly revised version of the program that satisﬁes Problem 7.1 can
be used to show that Problem 7.2 can be solved. Speciﬁcally, let the transitions of p, to be
{(so,s|)| (so 6 I’Asl E I’/\(so,s1)E p) V(so ¢I’/\(so,s1) E p’) }.

Formally, since the answer to the decision Problem 7.1 is afﬁrmative, there exists pro-
gram p’ and I ’ that satisfy constraints in Problem Statement 7.1. To show that the answer to
the decision problem 7.2 is afﬁrmative, we need to ﬁnd p, such that constraints of Problem

Statement 7.2 are satisﬁed. We let transitions of p, be

{(50350) (So E l’/\S| E l’/\(S0,Sl) E p) V(SO El’A (50,51) E p’) }.

Next, we show that p, satisﬁes the constraints of Problem Statement 7.2. Towards this

end, we instantiate I to be I’ and show that constraints B 1 and 82 are satisﬁed.

0 Constraint B1: By construction of transitions of pr, this constraint is satisﬁed for
the case where we consider partial revision and for the case where we consider total

revision.

147

o Constraint B2: By construction, I’ is closed in p,. Also, since 1’ => I and p satisﬁes
S fp and va from I, it is straightforward to observe that p, satisﬁes S fp and va from
I’ .
Also, transitions of p, that begin outside I’ are identical to that of p’. The second

constraint “(3T :: ...)” from the deﬁnition of fault-tolerance is also satisﬁed. Thus,

p, is fault—tolerant to S fp and va from I’ . I

Implication of Theorem 7.3.1 for Q. 1: From Theorem 7.3.1, it follows that answer
to Q. 1 from Introduction is afﬁrmative for both partial and total revision. Hence, the new

formulation (c.f. Figure 7.2) is relatively complete.

7.4 Complexity Analysis (Q. 2)

In this section, we focus on the second question and compare the complexity class for
Problem 7.] with that of Problem 7.2. In particular, in Section 7.4.1, we show that the
complexity of the model revision can increase substantially for partial revision if legitimate
states are not speciﬁed explicitly. Then in Section 7.4.2, we show that for total revision
Problem 7.2 can be reduced to Problem 7.] in polynomial time. In Section 7.4.3, we give
a heuristic-based approach for partial revision. Furthermore, we show that the heuristic is
guaranteed to work when the answer to the corresponding Problem in Figure 3.1 is afﬁrma-
tive. In section 7.4 .4, we show how one can obtain an algorithm for model revision without
explicit legitimate states by utilizing an algorithm that requires explicit legitimate states.

Finally, we mention other complexity results in Section 7.4.5.

7 .4.1 Complexity Comparison for Partial Revision

In this section, we show that solving Problem 7.2 for partial revision is NP-complete. Since

the complexity of the revision Problem 7.1 is in P [101], it follows that the complexity of

148

partial revision increases substantially when the legitimate states are not speciﬁed explic-
itly. We show this by a reduction from the well-known 3-SAT problem. The 3-SAT instance
is speciﬁed as follows:
3-SAT Instance. Let x1 ,x2, ...,x,, be propositional variables. Given is a Boolean formula
y = y1 /\ yz ~-- /\ yM, where each y j (l 5 j S M) is a disjunction of exactly three literals.
Does there exist an assignment of truth values to x1,x2,...,x,, such that y is satisﬁable?
Since the membership of Problem 7.2 in NP is straightforward, we focus on showing
that it is NP-complete. Hence, we ﬁrst present the mapping from the 3-SAT instance to the
problem of partial revision without explicit legitimate states. Then, we show that the given
3-SAT instance is satisﬁable iff the answer to the corresponding instance of partial revision

is afﬁrmative.

Mapping 3-SAT to Partial Revision without Explicit Legitimate States

We now present the mapping of an instance of the 3-SAT problem to an instance of the
partial revision problem without explicit legitimate states. Recall that this instance consists
of the program (speciﬁed in terms of its state space and transitions), safety, and liveness
speciﬁcation and faults. We begin with identifying the input program. Then, we identify
faults and ﬁnally we identify safety and liveness speciﬁcation.

The state space of the input program. Corresponding to each variable x; of the given
3-SAT instance, we introduce eight states Pi, Q;,R,~, T,-,a,-, b,-, c,-, and d,- where l g i g n (cf.
Figure 7.3). For each disjunction y j, we introduce states Z j and e j, where l S j g M, in the
state space. Thus, state space of the input program is Sp 2 {Pi,Q,-,R,-, 7},a,-,b,-,c,-,d,~ | l g
ign}U{ZJ-,ej | l _<_j_<__M}.

Transitions of the input program. Corresponding to each variable x;, we include the
following transitions in SP: (P;,a,~), (a;,c,-), (c;,b,-), (b;,Q,-), (Ri,b;), (bi,d,-), (d,-,a,-), (ai,T,-),
(Q;,ej) and (7},ej) where l S j S M. Moreover, corresponding to each disjunction y j, we

include the following transitions:

149

 

Figure 7.3: Mapping of (xi sz) /\ (-wa V -wx2) into corresponding program transitions.
The transitions in bold show the revised program where x. = true and x2 = false.

. (21.361),
0 If x,- is a literal in y j, then we include the transition (e j, P.) in SP, and
o If ﬁx,- is a literal in y j, then we include the transition (e j,R,-) in 5,,.

Fault transitions. The fault transitionsf = {(7},Zj), (Qi,Zj)|1 S i g n, l S j S M}.
Safety speciﬁcation S fp. All transitions except those in EP U f that violate safety.
Liveness speciﬁcation va. The liveness speciﬁcation is P; -> c,-, c,- w Q3, Ri w d,- and

d;->T,-,wherelgi§n.

Reduction from the 3-SAT Problem.

Theorem 7 .4.1 The given instance of the 3 -SAT problem is satisﬁable if the correspond-
ing instance of the partial revision problem has an afﬁrmative answer for masking fault-

tolerance.

Proof.

First, we prove the => part, then we prove 4: part.

150

0 => If the given instance of the 3-SAT problem is satisﬁable, then we construct the

transitions of revised program by including the following transitions:

- (Zjaej),1 S j S M.
— If y j contains x,- and x,- is assigned truth value true, then (e j, P,:) ,
- If y ,- contains ﬁx,- and x,- is assigned truth value false then, (e j,R,-) ,

— Ifx; is assigned truth value true then (Pi, at), (a,, c,-), (Ci, bi) ,(bg, Qg) , and (Q;, ej) ,

lgign,

- If x,- is assigned truth value false then (R;,b,-),(b,-,d,-),(d,~,a,-), (a,~,T,-), and

(Rhej), l S l S n.

The predicate, 1’ , used to show that this program satisﬁes SPEC includes all reachable
states except { Z j|1 S j g M }. It is straightforward to show that the constraints B l

and 82 are satisﬁed.

a <—_- The legitimate state predicate of the revised program contains at least one state.
Our ﬁrst step is to show that for some i, Q; or T,- is included in the legitimate state
predicate of the revised program. To show this, we observe that if Z j, 1 g j _<_ M , is
included in the legitimate state predicate for some j, then the corresponding state e ,-
must also be included in the legitimate state predicate. Hence, the revised program
must include at least one transition that begins in ej. It follows that either P,- or R),
l S i S n , must also be included the legitimate state predicate. If P; (respectively,
R;) is included in the legitimate state predicate, then c,- and Q,- (respectively, d,- and
7}) must also be included so that va is satisﬁed. Also, if a,- (respectively, b,-) is
included in the legitimate state predicate, then T,- or Q,- must also be included in the
legitimate state predicate. From the above discussion, it follows that for some i, Q:
or 7} is included in the legitimate state predicate of the revised program. Now, based

on the deﬁnition of faults, all states in {Z j|1 S j S M } are reachable in the presence

151

of faults. Hence, transition (Zj,ej) must be included for 1 g j S M in the revised

program.

Furthermore, some transition originating from e j must also be included. Transitions
from e ,- correspond to literals in disjunction y j. If a transition of the form (e j,P,°) is
included, then we set x,- to true. If a transition of the form (e j,R1) is included, then

we set x, to false.

Observe that if P,- is reachable in the revised program, then it must also include
(Pi,a,-), (ai,c,-), (c;,b,-), and (b;,Q,-) so that va is satisﬁed. And, if R,- is reachable
in the revised program, then it must also include (Ri, bi), (b;,d,-) , (di,a,-) , and (a;, T,-).
However, if all these transitions are included, then va will not be satisﬁed. There-
fore, for any i, revised program cannot reach both P,- and R). This implies that the
truth value assigned to x,- by any disjunction is the same. Moreover, based on the con-
struction of the instance of the program of partial revision, the truth assignments to
literals make each clause to be satisﬁed, i.e., the assignment of truth values to literals

causes the given 3-SAT formula to be satisﬁable. I

From the above theorem, it follows that the problem of partial revision without explicit
legitimate states is NP-hard. Moreover, in [101], it is shown that the problem of partial
revision can be solved in polynomial time if legitimate states are speciﬁed explicitly. Thus,
it follows that the complexity of partial revision increases substantially when explicit legit-
imate states are not available.

Intuition behind the increased complexity of partial revision. We analyze the N P-
completeness proof to determine why the complexity of partial revision increased substan-
tially. Towards this end, we carefully look at the instance of partial revision generated from
the SAT formula. Observe that the fault-intolerant program does not satisfy va from P,-
or R;, as the program can be stuck in the loop (ai,c,-), (c;,b,-), (bi,d,-), (di,a,-). However,
removal of some transitions allows P,- (or, R;) to be included as a legitimate state. The

increased complexity of partial revision is caused by the need to remove the “right” transi-

152

tions so that the additional states can be included in the set of legitimate states. Choosing

these “right” transitions increases the complexity substantially.

7 .42 Complexity Comparison for Total Revision

Although the complexity of partial revision increases substantially when legitimate states
are not available explicitly, we ﬁnd that complexity of total revision effectively remains
unchanged. We note that this is the ﬁrst instance where complexity difference between
partial and total revision has been identiﬁed. To show this result, we show that in the
context of total revision Problem 7.2 is polynomial time reducible to Problem 7.1 Since the
results in this section require the notion of weakest legitimate state predicate, we deﬁne it
next. Recall that we use the term legitimate state predicate and the corresponding set of
legitimate states interchangeably. Hence, weakest legitimate state predicate corresponds to
the largest set of legitimate states.
Deﬁnition. Let [W = stp(p,Sf,,,Lv,,)) be the weakest legitimate state predicate of p
for SPEC(=(Sfp , va)) iff:
l: p satisﬁes SPEC from 1w, and
2: V I :: (p satisﬁes SPEC from I) :> 1,... I

Recall from Chapter 6 that, we identiﬁed the algorithm stpGenerator(p, S fp, va) that

computes weakest legitimate state predicate in polynomial time in the state space of p.

Theorem 7.4.2 If the answer to the decision problem 7.2 (with total revision) is aﬁirma-
tive (i .e., El p, that satisﬁes the constraints of the Problem 7.2) with input p, S fp, va, and
f, then the answer to the decision problem 7.1 ( with total or partial revision) is afﬁrmative
(i .e., 3 p’ and I’ that satisfy the constraints of the Problem 7.1) with input p, S fp, va, f,
and stp(p,Sfp,va).

Proof. Intuitively, the program p, obtained for solving problem statement 7.2 can be used

to show that problem 7.1 is satisﬁed. Speciﬁcally, let [2 be the predicate used to show that

153

p, satisﬁes constraints of Problem 7.2. Then, let p’ = p, and I’ = 12.

Formally, since the answer to the decision problem 7.2 is afﬁrmative, there exists
program p, that satisﬁes constraints in Problem Statement 7.2 (with total revision). Let
[2 denote the predicate used to show that constraints BI and 32 are satisﬁed. Let
Iw = wlsp( p, S fp, Up). To show that the answer to the decision problem 7.1 is afﬁrmative,
we need to ﬁnd p’ and I’ such that constraints of Problem Statement 7.] are satisﬁed. We
let p’ = p, and I’ = [2. Based on constraint 32, p, satisﬁes S fp and va from 12. Also,
from constraint Bl , (for total revision), p satisﬁes S fp and va from [2. Now, we show that

constraints Al , A2, and A3 are satisﬁed.

0 Constraint Al: Based on deﬁnition of weakest legitimate state predicate, 12 => 1,...

Thus, constraint A1 is satisﬁed.

0 Constraint A2: Based on constraint Bl , constraint A2 is satisﬁed for both total and

partial revision.

0 Constraint A3: Based on constraint 82, p, is fault — tolerant to S f,, and va from

12. Thus, constraint A3 is satisﬁed. I

Remark: Note that if the phrase ‘with total revision’ shown in bold in Theorem 7.4.2 is

replaced by ‘with partial revision’ , then the corresponding theorem is not valid.

Theorem 7.4.3 For total revision, the revision problem 7.2 is polynomial time reducible to

the revision problem 7.1.

Proof. Given an instance, say X, of the decision problem 7.2 that consists of p, S fp, Up,
and f, the corresponding instance, say Y, for the decision problem 7.1 is p, S fp, va, f, and
stp( p, S fp, va). From Theorems 7.3 .1 and 7.4 .2 it follows that answer to X is afﬁrmative

iff answer to Y is afﬁrmative. I

154

7 .4.3 Heuristic for Polynomial Time Solution for Partial Revision

Theorem 7.4.2 utilizes the weakest legitimate state predicate to solve the problem of total
revision without explicit legitimate states. In this section, we show that a similar approach
can be utilized to develop a heuristic for solving the problem of partial revision in polyno-
mial time. Moreover, if there is an afﬁrmative answer to the revision problem with explicit
legitimate states, then this heuristic is guaranteed to ﬁnd a revised program that satisﬁes

constraints of Problem 7.2. Towards this end, we present Theorem 7.4 .4.

Theorem 7.4.4 For partial revision, the revision problem 7 .2 consisting of (p, Sfp,va, f)
is polynomial time reducible to the revision problem 7.] provided there exists a legiti-

mate states predicate I such that the answer to the decision problem 7.1 for instance

(p,I,Sfp,va,f) is aﬂirmative.

Proof.

Clearly, if an instance of Problem 7.1 has an afﬁrmative answer, then from Theorem
7.3.1, the corresponding instance of Problem 7.2 has an afﬁrmative answer. Similar to the
proof of Theorem 7.4.3, we map the instance of Problem 7.2 to an instance of Problem 7.1
where we use the weakest legitimate state predicate. Now, from Theorem 7.3.1 it follows
that the answer to this revised instance of Problem 7.1 is also afﬁrmative. I

What the above theorem shows is that even for partial revision, if it were possible to
obtain a fault-tolerant program with explicit legitimate states, then it is possible to do so
in the same complexity class without explicit legitimate states. However, there may be
instances where answer to the decision problem 7.1 may be negative and the answer to the
corresponding decision problem 7.2 is afﬁrmative. For these instances, for partial revision,

the complexity can be high.

155

7 .4.4 Algorithm for Model Revision Without Explicit Legitimate

States

In this section, we utilize the results in Section 7.4 .2 to obtain an algorithm for model re-
vision without explicit legitimate states. In particular, we present algorithm Add_fs_fr.spec
that adds failsafe fault-tolerance (where safety is satisﬁed in the presence of faults although
liveness may not be) to high atomicity programs (where a program transition can read any
number of variables as well as write any number of variables in one atomic step). This al-
gorithm is obtained by combining the algorithm stpGenerator from Chapter 6 to compute
the weakest legitimate state predicate as well as the algorithm Add_failsafe from [101].

Given the program transitions p, the fault transitions f, and the program speciﬁcation
(Sfp , va) , the goal of this algorithm is to compute the failsafe fault-tolerant program p,-
that satisﬁes the constraints of problem statement 7.2 (with total revision). It ﬁrst identiﬁes
the weakest legitimate state predicate 1w. If p has any state in [w where it has no outgoing
transitions, we add self-loops at those states. These self-loops help us distinguish between
a state where p has no outgoing transitions and states that become a deadlock state because
we removed some transitions of p. Then it identiﬁes ms as the states that violate safety
or the states from where execution of one or more fault transitions violates safety (Lines
4-7). Then, the algorithm ﬁnds the transitions, mt, of p that reach states in ms as well as
transitions of p that violate the safety speciﬁcation SPEC 1,, (Line 8).

If there exist states in If, such that execution of one or more fault actions from those
states violates the safety of the speciﬁcation, then it recalculates 1,1, by removing those
states (Lines 10-13). In this recalculation, it ensures that all computations of p—mt within,
I,’,,, are inﬁnite. In other words, the ﬁnal value of I,’,, is the largest subset of Iw—ms such
that all computations of p—mt when restricted to that subset are inﬁnite. At this point,
if I,’,, is empty the algorithm declares that no failsafe fault tolerant program can be found.
Otherwise, the algorithm removes mt from p to compute p, where no program transitions

violate the program speciﬁcation (Line 18). Now, it ensures that all the transitions of p,

156

 

Algorithm 15 Add_fs_fr-spec: Addition of Failsafe Fault-Tolerance

 

Input: program transitions p, fault transitions f, safety speciﬁcation S fp

(consisting of SPEC b, and SPEC b,), liveness speciﬁcation va (consist-
ing of multiple T w T proprieties)

Output: failsafe fault-tolerant program p,.

g

14:
15:
16:
17:
18:
19:
20:
21:

// Find the legitimate states [w

Self_loops = {(so,so)|so E I,,./\Vs1 :: (so,s1) Ep};
: p=pVSelf_loops;
° repeat
ms’ :2 ms;
ms := msU{so::3s1 : (so,s1) E f/\( ((so,sl) E SPECb,) V (s1 E
ms) H:
: until (ms : ms’)

2 mt 2: {(80,51) :2 (((S0,S1) E SPEC“) V (5161715)) };

// compute the largest subset of [w from where all computations of p
are inﬁnite

: I,’,, :2 Iw — ms;
10:
ll:
12:
13:

repeat
ItImp i: 1w;
I,’,, := I,’,,— {so :: soEI,’,, : (V31 :: s1 EI,’,, : (so,s1)E(p—mt))};
until (I,’,, 2 [[mp)

if (I,’,,= {}) then

print No failsafe f—tolerant program Pr exists;
return {};
else
p, :2 p—mt;
pr=pr—{(so,S1)IISOEIC. /\ Sié’lv};
end if
return p,—Self-loops;

 

157

that start in a state in [W also end in a state in Iw. If not such transitions are removed from
Pr-

Remark: Note that since this section focuses on failsafe fault-tolerant programs, there
is no recovery requirement for the program in the presence of faults. However, for other
levels of fault-tolerance, e.g., nonmasking and masking, where the program needs to satisfy
its liveness properties as well, we would need an additional requirement that states that

eventually faults stop for a long enough time to ensure that liveness properties can be met.

Theorem 7 .4.5 Algorithm Add_is_ir-spec is sound, i .e., the output p, of Add_fs-fr_spec sat-

isﬁes the constraints of Problem Statement 7.2.

Proof. Let I... from the problem statement 7.2 be instantiated with the value of If, at the
end of Add_is_fr_spec. Now, the ﬁrst constraint of the Problem Statement 7.2 is satisﬁed
by construction. Moreover, the satisfaction of the ﬁrst constraint implies correctness of p,
in the absence of faults. Regarding the behavior in the presence of faults, we can observe
that by construction, the program does not reach a state in SPEC 1,5 or execute a transition
in SPEC bt. Moreover, the construction of ms implies that the program does not reach states
from where faults can violate the safety speciﬁcation. Thus, the revised program is failsafe

fault-tolerant. I

Theorem 7.4.6 Algorithm Add_ts_fr_spec is complete, i.e., if it declares failure, then there

does not exist a fault-tolerant program that satisﬁes the constraints in Problem Statement

7.2.

Proof. Suppose that a program, say p”, satisﬁes the constraints of Problem Statement 7.2.
Let I” be the predicate used in demonstrating that p” satisﬁes the constraints of Problem
Statement 7.2. Now, we show that at any time in the use of Add.fs.fr-spec, it must be the
case that I” g I,’,,. In particular, on Line 1, this follows from the correctness of the algorithm
that computes the weakest legitimate state predicate. On Line 9, this follows from the fact

that no state in ms can be legitimate state, as faults alone can violate safety from those

158

states. Likewise, since I ’ ’ cannot have deadlock states, I ’ ' g I,’,, is true on Line 12. Since the
algorithm declares failure when I,’,. = { }, it follows that l” = { }, which is a contradiction.

Theorem 7.4.7 The algorithm Add_fs_fr_spec is in P.

Proof. Let us consider the complexity of each statement in Add_fs_fr_spec. (1) From
Chapter 6, the complexity of computing the weakest legitimate state predicate is in P. (2-
3) The complexity of statements 2, and 3 is clearly in P. (4-7) Calculating ms is in P
as we can use the following algorithm: For each fault transition (so,s1) such that (so,s1)
violates safety of SPEC, include so in ms. Now, in each iteration, check if there exists a
fault transition (so,s1) such that so E ms and s. E ms. If such a transition exists add so to
ms. Since the size of ms increases by at least one in each iteration, the number of iterations
is polynomial in the state space, namely, Sp. (8) Calculating mt is in P as we need to
check each transition only once. (9) This statement is in P. The while loop in (IO-13) can
execute only IS p| number of times. (14-21) The complexity of these statements is clearly
polynomial. I

The above result shows that in the context of failsafe fault-tolerance, when we reduce
the designer’s burden by not requiring them to identify the legitimate states explicitly, there
is no signiﬁcance in terms of the complexity class of the problem involved or in terms of

the soundness and completeness property of the corresponding algorithms.

7 .4.5 Summary of Complexity Results

In Section 7.4.4, we showed that the problem of total model revision for failsafe fault-
tolerance is in P. In this section, we list the complexity for other levels of fault-tolerance
for both total and partial revision.

Recall from Section 7.4.] that, for partial revision, the problem of adding failsafe and

masking fault-tolerance is NP-complete. For distributed programs, it is shown (in [101])

159

that revising the program for adding failsafe and masking fault-tolerance is NP-complete
when the set of legitimate states is speciﬁed explicitly. A variation of that proof also works
for model revision without explicit legitimate states. Revising the program for adding
nonmasking fault-tolerance is in NP. However, it is not known whether it is NP—complete
or whether it is in P.

For high atomicity programs, i.e., where a program can read and write all its variables
atomically, it is possible to perform total revision in P. To show this, we note that the algo-
rithm Add_ts.ir-spec ﬁrst identiﬁes the weakest legitimate state predicate. Then it utilizes
the set of legitimate states in Add.failsafe (from [101]) which requires that the legitimate
states be explicitly speciﬁed. Likewise, we can utilize the algorithms Add_nonmasking
and Add-masking (from [101]) to obtain the corresponding algorithms for total revision for

adding nonmasking and masking fault-tolerance.

 

 

 

 

Revision Without Revision With
Explicit Legitimate States Explicit Legitimate States
Partial Total Partial Total
High Failsafe ? Pi P* P*
Atomi city nonmasking ? Pif P" P*
masking NP — C ’ P’F P* P*
Distributed Failsafe NP—CA NP—CA NP—C* NP—C*
Programs nonmasking ? ? ? ?
masking NP—CA NP—CA NP—C* NP —C*

 

 

 

 

 

 

 

 

 

Table 7.1: The complexity of different types of automated revision (NP-C = NP-Complete).

In summary, the results for complexity comparison are as shown in Table 7.1. Results
marked with ‘i‘ follow from NP—completeness results from Section 7.4.1. Results marked
:1: follow from Section 7.4.2, 7.4.3, and 7.4.4. Results marked A are stated without proof.
Results marked ? indicate that the complexity of the corresponding problem is open. And,

ﬁnally, results marked * are from [101].

160

7.5 Relative Computation Cost (Q. 3)

As mentioned in Section 7.1, the increased cost of model revision in the absence of ex-
plicit legitimate states needs to be studied in two parts: complexity class and relative in-
crease in the execution time. We considered the former in Section 7.4. In this section,
we consider the latter. As we can see from Section 7.4 .4, if the legitimate states are not
speciﬁed explicitly, the increased cost of model revision is essentially that of computing
stp(p,Sfp,va). Hence, we analyze the complexity of computing stp(p,Sfp,va) in
the context of a case study. We choose the classic example from the literature, namely,
Byzantine Agreement [107]. We explain this case study in detail and show the time required
to generate the weakest legitimate state predicate for different numbers of processes. This
case study illustrates that the increased cost when explicit legitimate states are unavailable
is very small compared to the overall time required for the addition of fault-tolerance. In
particular, we show that reducing the burden of the designer in terms of not requiring the
explicit legitimate states increases the computation cost by approximately 1%.

Throughout this section, the experiments are run on a MacBook Pro with 2.6 Ghz Intel
Core 2 Duo processor and 4 GB RAM. The OBDD representation of the Boolean formula
has been done using the C++ interface to the CUDD package developed at the University
of Colorado [125].

The amount of time required for computing this set of legitimate states for a different
number of processes is as shown in Table 7.2. We would like to note that the set of le-
gitimate states computed in these case studies is the same as that used in the addition of
fault-tolerance.

We use this case study to illustrate that computing the set of legitimate states to be
those that are reachable from initial states is not relatively complete. In particular, for the
Byzantine agreement example, the initial state is one where all processes are non-Byzantine
and the decision of all non-general processes is equal to J_. Clearly, all processes are non-

Byzantine in all states reached by the program from these initial states. It follows that

161

 

 

 

 

 

 

 

No.0f Reachable Leg. States Total Revision
Process States Generation Time(Sec) Time(Sec)
10 109 0.57 6
20 10'5 1.34 199
30 1022 4.38 1836
40 1030 9.25 9366
50 1036 26.34 > 10000
100 107' 267.30 > 10000

 

Table 7.2: The time comparison for the Byzantine Agreement program.

recovery to these reachable states is not always feasible in the presence of faults. Hence,
these reachable states are insufﬁcient to obtain the fault-tolerant program. By contrast, the

weakest legitimate state predicate can be utilized to ﬁnd the fault-tolerant program.

7 .6 Summary

We devoted this chapter to study the problem of automated model revision without explicit
legitimate states. In particular, we compared performing the revision when the legitimate
states are explicitly speciﬁed with that when they are not explicitly speciﬁed. We consid-
ered three different aspects in our comparison: relative completeness, qualitative complex-
ity class comparison, and quantitative change of the time for model revision. We illustrated
that our approach for model revision without explicit legitimate states is relatively com-
plete. This isimportant, since it implies that the reduction in the human effort required for
model revision does not reduce the class of the problems that could be solved. Addition-
ally, we found some surprising and counterintuitive results. Speciﬁcally, for total revision,
we found that the complexity class remains unchanged. However, for partial revision, the
complexity class changes substantially. Finally, we found that quantitative change of the

time for model revision without explicit legitimate states is negligible.

162

Chapter 8

Related Work

During the past three decades, automation of the software veriﬁcation tools evolved signif-
icantly. Currently, veriﬁcation tools are widely used in several applications. In particular,
they are used in the veriﬁcation of the high assurance and mission critical systems, where
the consequences of any failure can end with catastrophic results.

Formal veriﬁcation of distributed and concurrent program focuses on the use of math-
ematical logic and formal methods to verify the correctness of the properties of a speciﬁc
program. Initially, the focus was on developing techniques to verify full functional correct-
ness. However, most of the tools developed for this purpose were incapable of handling
complex systems. This limitation encouraged many to focus on the veriﬁcation of the prop-
erties that are more important than others. In most of the veriﬁcation techniques the system
and the desired properties are described via a logical model. The veriﬁcation algorithm
answers with (yes/no) to the question of whether the model satisﬁes the desired property.

Unlike the automated veriﬁcation techniques, the goal of the automated model revision
is to automatically revise an existing model to generate a new model, which is correct-
by-construction. Such revised model will preserve the existing model properties as well
as satisfy new properties. The basic form of the problem of automated model revision

focuses on modifying an existing model, say M, into a new model, say M’. It is required

163

that M’ satisﬁes the new property of interest. Additionally, M’ continues to satisfy existing
properties of M using the same transitions that M used.

In this chapter, we brieﬂy review some of the automated veriﬁcation techniques and
discuss their relation to our approach. Currently, there is a wide range of tools available for
verifying the correctness of distributed programs. Those tools are based on different tech-
niques, which makes them useful for different types of applications. We believe that no
single approach is suitable for the veriﬁcation of all types of distributed programs. How-
ever, some approaches may be more appropriate to some applications more than others.
Out of the wide range of the available techniques for automated veriﬁcation of the ﬁnite
state distributed and concurrent programs, we focus on those that are closely related to our

approach.

8.1 Model Checking

Model checking is a technique for verifying the correctness of ﬁnite state programs. The
idea is based on exploring the state space of the program, described using temporal logic, in
an efﬁcient manner. In model checking, the program is represented using Kripke structure,
say M, and a formula, say f, represent one of the program properties. The model checker
determines if M is a model for f, i.e., whether the formula holds or not.

One of the advantages of the model checking technique is that it provides a push but-
ton approach. Model checking is effective in verifying whether the system meets the de-
sired properties. Furtherrnore, if the model does not satisfy the property of interest, then
the model checker, typically, provides a counterexample and the corresponding execution
trace. Moreover, it supports partial veriﬁcation, e.g., it does not require the complete spec-
iﬁcation of the program being veriﬁed. Due to this push-button approach, model checking
techniques have become very popular for detecting errors in the early stages of the design.

Also, it has helped in transferring the formal veriﬁcation of correctness from research to

164

practice. Next, we brieﬂy review the evolution of model checking tools and techniques.

As early as the 1970’s, Tadao Murata and Kurt Jensen started working on the ver-
iﬁcation of Petri nets; however, there were no actual veriﬁcation tools created prior to
1981 [42]. The initial work on state exploration started in the 1980 when Bochmann pre-
sented a method for verifying communication protocols [24]. Later, Holzmann presented
a technique for automatic protocol veriﬁcation. Burstall [37], Kroger [99] and Pnueli used
the temporal logic to describe the program behavior and the proof of correctness was done
manually.

In their early work on concurrency, E. M. Clarke et al. [45,50, l 10] focused on the
ﬁxed point theory and abstract interpretation. They emphasized on the connection between
Branching Time Logic and Mu-Calculus [62]. Clarke also presented how program text
is used to extract the invariant of a given program. In 1980, Emerson and Clarke [62]
developed a technique based on branching time logic. Later they adopted more elegant
presentation of temporal logic that was presented in [46]. In a milestone step in the evo-
lution of veriﬁcation techniques, Clarke, Emerson and Sistla [43,48] presented the EMC
Model Checker. This was the ﬁrst model checker that could handle fairness constraints.
Although, the EMC Model Checker can only check models with state space of a size not
more than 105 , it was able to detect errors in several systems. In [63] Emerson and Halpem
presented framework CT L* for investigating the expressive power of temporal logic. Their
framework was a combination of branching-time and linear-time operators.

The most signiﬁcant improvement in model checking was in the early 90’s. To this
end, symbolic model checking and partial order reduction were used in building the model
checker. McMillan used a symbolic representation based on the ordered binary decision
diagrams (OBDDs) to develop the SMV [112]. The compact representation of the state
space and the transition graph made it possible to verify sophisticated programs with very
large state space [36,112]. Since then, the SMV model checker has been used in verifying

several systems. In 2000, a new version of SMV was released [41].

165

The second important improvement in model checking techniques is the exploitation
of the partial order reduction of the state space [74]. The basic idea of the partial order
reduction is as follows. If two events are independent, then the system will reach the same
global state with no regards to which event execute ﬁrst. This way, less space is needed to
represent the system, which in turn reduces the effect of the state explosion problem.

Since the early 90’s, many techniques have been developed to extend the capabilities of
the model checking tools. These techniques include Abstraction [80], where the data values
of the system, usually reactive systems, are mapped to smaller set of abstracted values,
Compositional Reasoning [8,51,79], where the behavior of the system, which is composed
of many similar process, can be represented by few processes, Symmetry Reduction [44,
117], where the model checker exploits the symmetrical characteristics of the program to
obtain smaller model, and Induction and Parameterized Veriﬁcation [106,131], where the
behavior of the system is represented in a way that can be used for an arbitrary number of
processes.

The development of more effective methods for program veriﬁcation continued over
the past few years. Also, it resulted in the creation of more innovative technique, to handle
speciﬁc problems in more customized settings.

One application of our approach is to be complementary to existing approaches [36,41,
93] for verifying program correctness in early stages of system design. In particular, the
techniques in [98,119,130] aim to identify unacceptable system behavior to ﬁnd the root
causes that makes the system behave incorrectly. However, these approaches do not address
what to do when new faults or bugs are identiﬁed. Generally, it is left to the designer to
address this with some guidance or with trial and error. Moreover, manual revision has the
potential to introduce new errors.

Our approach focuses on automating such model revisions. Therefore, once the model
checker identiﬁes an instance where the model does not satisfy the property of interest,

we can use the automated model revision techniques to automatically revise the existing

166

model (c.f. Figure 8.1). The revised model will continue to satisfy the original properties
as well as the new property. Such automated revision is highly desirable since it enables
system designers to automatically and incrementally add properties to the models. Some
of the advantages of this approach are that the revised model is correct by construction and
there is no need to re-verify the revised model. Also, the original model properties are
preserved. Furthermore, there is a potential for this approach to require less time and space

complexities since it does not require the revision of the entire model speciﬁcation.

 

 

   

 

 

   

 

 

 

 

 

 

 

 

 

 

Program Program
r Model Properties
Rev'se the
"'0“, Model Checker .
L No Yes

 

 

 

 

 

 

Figure 8.1: Model Checking and Automated Model Revision.

In another context, we have adopted many techniques, which were used to advance
the development of better model checking tools, in the development of our model revision
tools. For example, one way to reduce the complexity further is to integrate advances from
model checking, as incremental synthesis involves several tasks that are also considered
in model checking. We considered two approaches from model checking: (1) the use of

symmetry and (2) the parallelism of the algorithm with multiple processors/cores.

8.2 Controller Synthesis and Game Theory

Our work is closely related to the work on controller synthesis (e .g. [16,17,32]) and game

theory (e.g., [70]). In this work, supervisory control of real-time systems has been studied

167

under the assumption that the existing program (called a plant) and/or the given speci-
ﬁcation is deterministic. In particular, Jobstmann, Griesmayer, and Bloem [96] used an
approch based on the concepts of the game theory. They presented the problem of program
repair by two players playing the Biichi game. They modeled the program and its environ-
ment as the two players. More speciﬁcally, the program takes a move in response to a move
taken by the environment. Our formulation for the automated model revision is similar to
that used by Ramadge and Wonham [122] in the discrete controller synthesis problem. In
both approaches the goal is to restrict the program actions to the desired behaviors.

These techniques require highly expressive speciﬁcations. Hence, the complexity is
also high (EXPTIME-complete or higher). In addition, these approaches do not address
some of the crucial concerns of fault-tolerance (e .g., providing recovery in the presence of

faults) that are considered in our work.

8.3 Model Revision and Automated Program Synthesis

In this section, we review the history and the evolution of the automated model revision
techniques [101]. Also, we show how our work in this dissertation is related to the previous
work done in this regard.

Automated program synthesis and revision have been studied from various perspec-
tives. Inspired by the seminal work by Emerson and Clarke [64], Arora, Attie, and Emer-
son [1 I] propose an algorithm for synthesizing fault-tolerant programs from CTL speciﬁca-
tions. Their method, however, does not address the issue of the addition of fault—tolerance
to existing programs. Initially, Kulkarni and Arora presented an automated algorithm for
the automated addition of fault-tolerance for centralized programs as well as distributed
programs. Their approach depends on the existence of an original program that is correct
in the absence of faults, i.e., the existing program satisﬁes its speciﬁcation as far as no

faults exists. Their goal is to modify (i.e., revise) the existing program and generate modi-

168

ﬁed (i.e., revised) version of the program such that the revised program is fault-tolerant as
well as it does not introduce any new behavior in the absence of faults [101]. The authors
also analyzed the complexity of adding fault-tolerance in different setting. We used some
of their results in the table in Section 7. For instance, they proved that the problem of the
automated addition of masking fault-tolerance is NP—complete.

Kulkarni, Arora, and Chippada [102] developed a polynomial time algorithm for auto-
mated synthesis of fault—tolerant distributed programs. Since this problem was proven to
be NP—hard in [101], the authors presented an algorithm that relies on heuristics to reduce
the complexity. Moreover, they demonstrated that the algorithm sufﬁces to synthesize an
agreement program that tolerates a Byzantine fault.

In their effort to automate the synthesis of fault-tolerant programs, Ebnenasir and
Kulkarni developed a framework, called Fault-Tolerance Synthesizer (FTSyn) [60]. The
PI Syn framework implemented most of the heuristics that have been proposed to syn-
thesize fault-tolerant programs. The main reasons for developing FTSyn were to validate
the theoretical results as well as to provide developers with an interactive tool for auto-
mated synthesize. The authors use FTSyn to synthesize several fault-tolerant distributed
programs. For instance, they used FTSyn to synthesis an altitude switch that controls the
altitude of an aircraft. The input of FTSyn consists of an abstract program consisting of
a set of processes described in a guarded command language. And, the output is masking
fault-tolerant program also in guarded commands. The authors used FTSyn to demonstrate
the applicability of their approach and also to show that with automation it can be applied
to the cases where there are different types of faults. However, similar to other enumerative
implementations, FT Syn was subject to the state explosion problems and was only suitable
for synthesizing small programs.

Recently, Bonakdarpour and Kulkarni presented a symbolic-based implementation for
the synthesis algorithm [27,30]. In their tool (SYCRAFT), the components of the syn-

thesis algorithm are constructed using Boolean formulae represented by Bryants Ordered

169

Binary Decision Diagrams [33]. This was the ﬁrst time where moderate to large sized pro-
grams (a state space of 1050 and beyond) have been synthesized. Although, both FTSyn
and SYCRAFT implement similar synthesis heuristics from [102], there are several differ-
ence between them. For instance, the symbolic representation made SYCRAFT capable
of handling programs with larger state space. Moreover, the grammar of the input lan-
guage of SYCRAFT has more constructs which can assist the designer in describing the
abstract program. Also, one of the characteristics of SYCRAFT is that it describes the out-
put in an optimized representation. Using SYCRAFT, authors also have identiﬁed several
bottlenecks that can slow down the synthesis. In particular, they identiﬁed the following
bottlenecks: the deadlock resolution, computation of recovery action, computation of the
fault-span and the cycle resolution. In this dissertation, we focused on two major complex-
ity obstacles in deadlock resolution, namely computation of the recovery actions and the
deadlock elimination. We used parallelism and symmetry to overcome these bottlenecks.
Our work in this dissertation is closely related to the tool SYCRAFT. In particular, we
have implemented most of the techniques we presented in this dissertation and added them

to SYCRAFT.

8.4 Parallelization and Symmetry

In the model checking community, various techniques have been proposed to implement
the symbolic state space generation and exploration using parallel computing. Some of
those approaches targeted the state explosion problem by focusing on data parallelism by
distributing the computation among a group of workstations, e.g., NOWs [77,78,92, 115,
126]. Their goal was mainly providing more memory resources to handle the expanding
state-space. Obviously, the speed was not an issue here and the time complexity was not
the target. Others focused on enhancing the time-efﬁciency by using parallelism. For this

group, the goal was to use the ever-expanding parallel infrastructure of multi-core PCs and

170

multi-processers platforms in expediting model checking. Most notably was the work on
parallelizing the Saturation algorithm [39]. Unfortunately, the symbolic state exploration
has proven to be notoriously resistant to parallelization.

In [66,67,69], the authors propose solutions and analyze different approaches to paral-
lelization of the saturation-based generation of state space in model checking. In particular,
in [67], the authors show that in order to gain speedups in saturation-based parallel sym-
bolic veriﬁcation, one has to pay a penalty for memory usage of up to 10 times, that of
the sequential algorithm. Other efforts range from simple approaches that essentially im-
plement BDDs as two-tiered hash tables [1 15, 127], to sophisticated approaches relying on
slicing BDDs [78] and techniques for workstealing [77]. However, the resulting implemen-
tations show only limited speedups. Ezekiel j., Luttgen G., and Siminiceanu R. [68] argue
that a heavily optimized symbolic algorithm such as Saturation may be more efﬁcient than
a parallel version of the same algorithm.

Ebnenasir presented a divide-and-conquer method [58] for synthesizing failsafe fault-
tolerant distributed programs. In failsafe fault—tolerance, the program is not required to
maintain any liveness requirements when faults occur. Therefore, resolving deadlock states
in the fault-span is not needed.

In this dissertation, we focused on two major complexity obstacles in deadlock res-
olution, namely computation of the recovery actions and the deadlock elimination. We
used parallelism and symmetry to reduces the time complexity. Our work utilizes paral-
lelization of group computation as well as symmetry for expediting the automated model
revision. Unlike other parallelization algorithms for the symbolic based representation of
models, we were able to achieve speedup up to multiple orders of magnitude. By focusing
on parallelizin g the group operation, we were able to harness the beneﬁts of the multi-core

infrastructure.

171

8.5 Nonmasking and Stabilizing Fault-Tolerance

Automated program synthesis is studied from different perspectives. One approach (e.g.,
[l 1]) focuses on synthesizing fault-tolerant programs from their speciﬁcation in a temporal
logic (e.g., CTL, LTL, etc .). Our approach for adding nonmasking and stabilizing fault-
tolerance is based on satisfying constraints that should be true in legitimate states.

In masking fault-tolerance, when faults occur, the program cannot violate the safety
property during recovery. Therefore, this approach will not be able to synthesize nonmask-
ing fault-tolerant programs where safety can be violated during recovery. Furthermore,
while our algorithm accounts for weak-faimess among program actions and allows for re-
covery actions to be added under this assumption, the heuristic-based approach does not
account for fairness assumptions.

Katz and Perry [97] proposed an algorithm to extend an arbitrary asynchronous dis-
tributed message-passing system into a self-stabilizing system. They also gave a formal
deﬁnition of the self-stabilizing extension of a non-stabilizing program and they deﬁned
the set of properties that must be maintained by the new extension. Their algorithm super-
imposes a control program on the original non-stabilizing program. The control program
repeatedly takes a global snapshot and then checks if the snapshot indicates an illegal state.
If an illegal state is found, the control program resets the memory of each process to a legal
default state.

Arora, Gouda, and Varghese [13] proposed a manual approach to design nonmasking
fault-tolerant programs. In this approach, a program is intended to satisfy a set of con-
straints during normal operation (i .e., no faults). Program actions are categorized into
“closure” actions and “convergence” actions. When faults occur and violate one or more
of the program constraints, convergence actions are responsible for correcting program be-
havior and reestablishing those constraints again. This method, however, does not address
the issue of automated addition of nonmasking fault-tolerance to existing fault-intolerant

programs .

172

Our approach for adding nonmasking fault-tolerance and self-stabilization is based on
satisfying constraints that should be true in legitimate states. An orthogonal approach is to
utilize primitives such as distributed reset [97] where one detects whether the system is in a
consistent state and resets it to a legitimate state, if needed. Examples of these approaches
include [97, 128]. Our approach can be utilized to design the distributed reset protocol
itself.

The veriﬁcation of self-stabilizing properties has been studied by several researchers.
One method to verify the correctness of self-stabilizing algorithms is by using mechanical
theorem proving. In [121], Qadeer and Shankar used PVS [118] to verify the correctness
of Dijekstra’s algorithm. Another approach to verify self-stabilizing algorithms was done
using model checking. In [129], Tsuchiya et al. applied CTL symbolic model check-
ing techniques to verify several distributed algorithms against self-stabilization properties.
They used SMV [113] to overcome the state explosion problem. They showed that the
state space can be efﬁciently reduced using OBDDs. However, they concluded that their

approach is applicable only when the number of process is modest.

8.6 Legitimate States Discovery

Several techniques have been developed to verify program correctness [35, 36,47,89,93,
113]. For most of those methods, the program is translated into a logical formula that de-
scribes the program behavior and properties. Then, tools are used to verify the correctness
of the program. For many of these tools, identifying the program legitimate states (i.e., legal
or invariant states) is an essential step. Several approaches have been proposed to improve
the automatic generation of the legitimate states [19,20,23,109,116]. These methods can
be widely classiﬁed as either top-down or bottom-up approaches. The top-down approach
starts with the weakest possible invariant and uses program speciﬁcation to strengthen that

invariant. The bottom-up approach performs forward propagations of the program actions

173

to derive the invariant. Our algorithm is a top-down approach since it starts by initializing
the largest set of legitimate states to be the whole state space and later removes states that
violate the predeﬁned safety and liveness speciﬁcations.

Rustan, Leino, and Barnett [19,109] presented methods for forming an efﬁcient weak-
est precondition to enhance the performance of the veriﬁcation tools like ESC/Java and
ESC/Modula3. Their goal is to simplify the presentation of the weakest pre-condition to
avoid redundancy and to avoid exponential growth of the condition size. Our deﬁnition
of largest set of legitimate states is equivalent to their deﬁnition of the weakest conserva-
tive preconditions in which the execution of a program statement does not go wrong and it
terminates. However, in their work they address the problem of redundancy in describing
such conditions while we focus on the automatic generation of such conditions from the
program speciﬁcation.

Jeffords and Heitmeyer [94,116] described an algorithm to automate the generation of
the invariant. Their technique is based on deriving the invariant based on propositional
formulas derived from the SCR tables. Their algorithm is intended for detecting errors
at early stages of program design. By contrast, our algorithm is intended to discover the
largest set of legitimate states of programs assumed to be correct for the purpose of adding
fault-tolerance to such programs.

The accurate and complete identiﬁcation of the legitimate states is an essential step
that enables designers to apply the algorithms and tools for the automated model revision
of fault-tolerant programs from a fault-intolerant programs [27,30, 101,111]. Unlike the
traditional approaches, that require the explicit speciﬁcation of the Legitimate States, our
approach does not require explicit speciﬁcation of the Legitimate States but it generates
the largest set of legitimate states from program transitions and speciﬁcation. Therefore, it
will signiﬁcantly improve and simplify the process of automated addition of fault-tolerance.
Furthermore, our approach is relatively complete when compared to traditional approaches.

Moreover, it does not introduce any signiﬁcant cost.

174

Chapter 9

Conclusion and Future Work

In this dissertation, we focused on the problem of automated model revision. We derived
theories, developed algorithms, and built tools to make the model revisions more compre-
hensive, efﬁcient, and designer-friendly. In particular, we reduced the automated model
revision learning curve by utilizing existing design tools. Also, we developed algorithms
and tools to apply model revision in adding new types of fault-tolerance properties and
to automate the generation of the legitimate states of the original model. Finally, we uti-
lized both symmetry and parallelism to speedup the automated revision and to overcome
its bottlenecks to reduce its time complexity.

In this chapter, we present a summary of our contributions. In Section 9.1, we summa-
rize the contributions of this dissertation. Then, in Section 9.2 we list some of the future

research directions.

9.1 Contributions

This dissertation makes four main contributions:

1. Reducing the Learning Curve of the Automated Model Revision: To reduce the

learning curve of automated model revision, we focused on utilizing existing design

175

tools. We combined the automated model revision tool SYCRAFT with the SCR
tool set. To achieve successful coupling, we developed a middle layer that translated
the SCR speciﬁcation into SYCRAFT input as well as from SYCRAFT output back
to SCR. Thus, our approach gives designers the ability to perform the tasks of the
model revision under-the-hood (i .e., while working within the SCR toolset). In this
way, they do not need to know all the details required to perform automated model

revision.

We expect that the ability to add fault-tolerance under-the-hood is especially useful,
as it allows designers to continue to use the design tools they were already using.
This reduces the learning curve of the model revision techniques. In the context of
SCR, this is especially useful since the SCR toolset has already been adopted by the
industry and is used in the development of many real world applications. Further-
more, the SCR toolset integrates several tools for consistency checking, veriﬁcation,
etc. Since synthesized fault-tolerant SCR speciﬁcation can be viewed/modiﬁed us-
ing the SCR toolset, one can analyze the revised fault-tolerant SCR speciﬁcation for

various other properties.

With case studies we showed that, for our approach to be effective, certain changes
need to be made to the SCR interface. In particular, we demonstrated that the SCR
toolset would have to be modiﬁed to include the description of faults. However,
we showed that the changes required for describing faults in the SCR toolset are
straightforward. In particular, the faults themselves could be represented using tables.
We also demonstrated that the designer needs to specify the requirements that should
be met in the presence of faults. Once again, this is similar to how other requirements
(not related to fault-tolerance) are speciﬁed in the SCR toolset. These changes to the
SCR toolset are reasonable in that they essentially require the designer to specify

what the faults are and the requirements for fault-tolerance in the presence of faults.

Additionally, automated revision with SYCRAFT also provides the possibility of de-

176

tecting errors in the requirements themselves. In particular, one can identify errors
caused due to a missing requirement on how recovery can be added. Since SYCRAFT
tries to provide maximum non—determinism in the revised program, if a requirement
is missing, then there is a high potential that it would be detected. Therefore, this
approach provides the ability to reduce cost since it detects errors and missing spec-

iﬁcations early in the design stage.

. Automating the Discovery of the Legitimate States: To further reduce the effort
required by the designer in automated model revision, we focused on generating
one of the inputs - legitimate states - automatically. In particular, the inputs to the
model revision algorithms includes: (1) the existing model, (2) the speciﬁcation of

the model, (3) the faults, and (4) the legitimate states of the original model.

Clearly, specifying the existing model is unavoidable. Moreover, the task required
in identifying it is easy, as model revision is expected to be used in contexts where
designers already have an existing model. Speciﬁcation is also already available to
the designer when model revision is used in contexts where, existing model fails to
satisfy the desired speciﬁcation. Likewise, the new property that is to be added to the
existing model is also easy to identify. In the context of fault-tolerance, this requires

the designers to identify the faults that need to be tolerated.

Based on our experience, the hardest input to identify is the set of legitimate states
from where the original model satisﬁes its speciﬁcation. In part, it is because of the
fact that identifying these legitimate states explicitly is often not required during the
evaluation of the original model. Hence, we focused on the problem of automated
model revision of an existing model without the use of explicit legitimate states.
Moreover, as shown by the example in Section 5.5, typical algorithms for computing
legitimate states based on initial states do not work in the context of automated model

revision.

177

We presented an algorithm for automated discovery of the weakest legitimate state
predicate of the given program. Our algorithm uses the program actions and speciﬁ-

cation to automatically generate the weakest legitimate state predicate.

To evaluate this algorithm, we compared the automated model revision when the le-
gitimate states are explicitly speciﬁed with that when they are not. We considered
three questions in this context: (1) relative completeness, (2) qualitative complex;
ity class comparison, and (3) quantitative change of the time for model revision.
We illustrated that our approach for model revision without explicit legitimate states
is relatively complete, i.e., if model revision can be solved with explicit legitimate
states, then it could also be solved without explicit legitimate states. This is impor-
tant since it implies that the reduction in the human effort required for model revision

does not reduce the class of the problems that could be solved.

Regarding the second question, we found some surprising and counterintuitive re-
sults. Speciﬁcally, for total revision, we found that the complexity class remains un-
changed. However, for partial revision, the complexity class changes substantially.
In particular, we showed that problems that could be solved in P when legitimate
states are available explicitly become NP-complete if explicit legitimate states are
unavailable. This result is especially surprising since this is the ﬁrst instance where
complexity levels for total and partial revision have been found to be different. Even
though the general problem of partial revision becomes NP-complete without the ex-
plicit legitimate states, we found a subset of these problems that can be solved in
P. Speciﬁcally, this subset included all instances where model revision was possible

when legitimate states are speciﬁed explicitly.

Regarding the third question, we showed that the extra computation cost obtained
by reducing the human effort for specifying the legitimate states is negligible. To-
wards this end, we considered four case studies that included Byzantine agreement,

mutual exclusion, token ring and diffusing computation. In each of these examples,

178

the generated set of legitimate states was the same as the one speciﬁed explicitly in
automated addition of fault-tolerance. Moreover, the time to generate the legitimate
states was negligible (less than 1%) when compared with the time for performing the

corresponding addition of fault-tolerance.

Also, we have integrated the automated revision without explicit legitimate states in
the tool SYCRAFT. We note that this result can also be extended to other problems
in model revision where one adds safety properties, liveness properties and timing

constraints.

. Exploiting Parallelism and Symmetry to Expedite the Automated Model Revi-

sion:

Another contribution of this dissertation is directed towards making the automated
model revision more efﬁcient. Speciﬁcally, we worked on improving the perfor-
mance of the automated model revision to synthesize fault-tolerant programs from
their fault-intolerant version. Towards this end, we developed techniques that uti-
lize the (1) multi-core processors and (2) the symmetry among the processes of the

program being revised to expedite the automated model revision.

In the case of parallelism, we focused on one of the main complexity barriers, reso-
lution of deadlock states, in automated model revision to add fault-tolerance to dis-
tributed programs. Our approach was based on parallelization with multiple threads
on a multi-core architecture. We considered parallelization in two scenarios: (1)
adding recovery transitions, and (2) eliminating deadlock states. Our approach pro-
vides each thread its own copy of shared variables. Although this has a potential to
increase the memory usage, in general, automated model revision problems tend to
have a higher time complexity than the corresponding veriﬁcation problems. Hence,
we expect that the automated model revision algorithm will run out of time before

it runs out of memory. Hence, the increased space complexity is unlikely to be the

179

bottleneck during revision.

Initially, we showed that the approach of partitioning deadlock states provides a small
improvement. And, the approach based on parallelizing the group computation —
that is caused by distribution constraints of the program being synthesized— provides
a signiﬁcant beneﬁt that is close to the ideal, i.e., equal to the number of threads
used. Additionally, we demonstrated that there is a potential to gain superlinear
speedup due to the partitioning of the group computation that reduces the size of
corresponding BDDs. Since the conﬁguration used to evaluate performance was on
an 8-core (4 dual-cores) machine, we considered the case where up to 16 threads are
used. We ﬁnd that as the number of threads increases, the revision time decreases. In
fact, because the parallelism is ﬁne-grained, using more threads than available cores
has the potential to improve the performance slightly. This demonstrates that we
have not yet reached the bottleneck involved in parallelization. Furthermore, there is
potential for further reduction in revision time if the level of parallelism is increased
(e .g., if there are more processors). Although, the level of parallelism is ﬁne-grained,

we showed that the overhead of parallel computation is small.

In the case of symmetry, we showed that symmetry provides a substantial beneﬁt in
reducing the time involved in the revision. More speciﬁcally, we observed that mul-
tiple processes in a distributed program are symmetric in nature, i.e., their actions are
similar (except for the renaming of variables). Thus, if our algorithm ﬁnds recovery
transitions for a process, then it utilizes symmetry to identify other recovery tran-
sitions that should also be included for other processes in the system. Likewise, if
some transitions of a process violate safety in the presence of faults, then it identiﬁes
similar transitions of other processes that would also violate safety. Since, the cost
of identifying these similar transitions with the knowledge of symmetry among pro-
cesses is less than the cost of identifying these transitions explicitly, then the use of

symmetry reduces the overall time required for the revision. Moreover, the speedup

180

increases as the number of symmetric processes increases.

. Automating the Model Revision to Add Nonmasking and Stabilizing Fault-
Tolerance: The tools for automated model revision need to be comprehensive and in—
clude techniques to automate the addition of different levels of fault-tolerance. In this
dissertation, we also focused on the automated revision to add nonmasking and stabi-
lizing fault-tolerance to hierarchical distributed systems. In particular, we considered
systems where legitimate states are speciﬁed in terms of constraints that are true in
legitimate states. The goal of adding nonmasking and stabilizing fault-tolerance was
to ensure that if these constraints are violated by faults, then eventually the program
would reach a state where all the constraints are satisﬁed and subsequent behavior

would be correct.

Our approach was to utilize an order among the constraints. With this order, we
ensured that correction actions that correct constraint C,- did not cause violation of
any of the previous constraints Co,C1 ...C,-_l although they may violate constraints
C j, j > i. In our case studies from Chapter 5, we considered different possible or-
derings and in most cases, we were able to synthesize a nonmasking fault-tolerant
program. Therefore, identifying an order among these predicates does not appear to
be a critical concern. Moreover, the number of orderings that need to be considered
for a group of n constraints will be at most 0(n2). Finally, we ﬁnd that this approach
is especially suited for synthesizing stabilizing programs, since it eliminates one of

the bottlenecks of the automated revision (evaluating fault-span).

Also, we focused on improving the revision to add nonmasking and stabilizing fault-
tolerant programs from their fault-intolerant version. We showed that the use of
multi-core technology to parallelize the revision algorithm reduces the revision time
substantially. We parallelized constraint satisfaction by: (1) partitioning the con-

straints and (2) utilizing the nature of distributed programs. We showed that paral-

181

lelism provides a substantial beneﬁt in reducing the time needed in the revision. We
illustrated our approach with three case studies: stabilizing mutual exclusion, stabi—
lizing diffusing computation, and a data dissemination problem for sensor networks.
The complexity analysis demonstrated that automated model revision in these case

studies was feasible and achieved in a reasonable time speedup in all case studies.

Furthermore, since our work is structured on constraint based (manual) design of
nonmasking and stabilizing fault-tolerance from [13] that has been found to be useful
in deriving several protocols manually (e.g., [73,75,128]), we expect that it will
be highly valuable for automatically designing various stabilizing and nonmasking
programs. We also showed that the hierarchical nature of the underlying system
could be effectively utilized to reduce the complexity of synthesizing programs with
larger number of processes while maintaining the correct-by-construction property

of programs designed by automated model revision.

This work also advances the state-of-the-art of the automated model revision in yet
another way. To our knowledge, this is the ﬁrst instance where automated model
revision to add fault-tolerance is achieved with fairness constraints. Without fairness
constraints, a stabilizing mutual exclusion algorithm based on [124] is impossible.
Moreover, the structure of the recovery actions in the ﬁrst two case studies is too

complex to successfully utilize previous heuristic based approaches [30].

9.2 Future Research Directions

During our work on automated model revision we have identiﬁed several possible direc-
tions of future work. Some of these are listed below.

In Chapter 3, we identiﬁed the requirements to complete the revision under-the-hood.
Also, we developed middle layer that translate the SCR speciﬁcation into the SYCRAFT

speciﬁcation. One future research direction in this context is to develop an enhanced ver-

182

sion of the middle layer. In particular, a middle layer that can be more generic, i.e., capable
of handling several types of speciﬁcations other than SCR. We believe that many activities
of the automated model revision are not user centric and do not require direct involvement
for the user. Furthermore, many software solutions require modiﬁcation to some of the
software properties at several stages in the software life cycle. Moreover, in many cases
such software modiﬁcation is required to be completed in an expedited fashion. These
requirements make the ability to perform automated model revision under-the-hood more
appealing to many design tools. Hence, one future research direction in this context is in-
vestigating of the possibility of integrating automated model revision to other design tools
such as Simulink [52] and Rational Rhapsody [81, 82]. The enhanced middle layer will
also include complete description of the input and output ﬁelds. This will allow other
developers/researchers to link their design tools with SYCRAFT.

Time complexity is one of the important factors in a successful automated model re-
vision. One future research direction in this context is to combine other advances from
program veriﬁcation. We expect that by combining these advances along with characteris-
tics of distributed systems, e.g., forward reachability analysis, hierarchical behavior, types
of expected faults, etc., would be extremely beneﬁcial. Speciﬁcally, it will make the auto-
mated revision of practical distributed programs to add new properties more feasible.

In Chapter 4, we listed some of the factors that contribute to the time complexity of
the automated model revision. Of these, the deadlock resolution problem, is a unique
bottleneck and does not exist in other veriﬁcation methods. However, we recognize that
there are other bottlenecks (e .g., forward reachability analysis) that are common with the
other veriﬁcation techniques. Hence, one future work in this context is to incorporate other
techniques such as partitioning [35], clustering [123], and saturation-based reachability
analysis [39,40] in the automated model revision tools. We expect these techniques to
improve computation of many constructs in our tool.

In Chapter 4, we identiﬁed the importance of the group computations in automated

183

model revision. In particular, we found that the revision time is often dedicated to comput-
ing such groups. Also, since the group computation is caused by distribution constraints
of the program being synthesized, as discussed in Chapter 4 and 5, it is guaranteed to be
required even with other techniques for expediting automated model revision. One future
work is to combine the group parallelism with the techniques that partition the deadlock
states among available threads. In particular, as discussed in Chapter 4, the parallelism that
partitions the deadlock states is coarse-grained. However, it can permit threads to perform
inconsistent behaviors that need to be resolved later. Thus, it provides a tradeoff between
overhead of synchrony among threads and potential error resolutions. Hence, even when a
large number of cores were available, this approach would be valuable together with other
techniques that utilize those additional cores. Thus, one of the future works is to com-
bine the partitioning of the deadlock states and the group parallelism. Also, another future
research direction is to explore other approaches to expedite the group computation. For
example, it can be used in conjunction with the approach that utilizes symmetry among
processes being synthesized.

Another possible future work is developing more efﬁcient algorithms for computing the
groups. Due to the distributed nature of the programs being revised, it is most likely the case
that the group associated with a given transition gets computed several times. Such repeated
computation is not really necessary. In fact, the group associated with a given transition
is ﬁxed and does not change during the revision. Therefore, one approach for reducing
the time required for computing the groups is as follows. In the initialization stage of the
revision algorithm, we compute the groups associated with all the transitions of the program
and store them in an efﬁcient data structure. Later and during the revision, whenever it is
required to compute the group associated with a given transition, such, group is retrieved
from the storage. We expect this approach to signiﬁcantly reduce the time complexity of
the revision. However, it may require more memory and at that point some tradeoffs will

need to be made to select the appropriate choices. We also expect that integrating our

184

implementation with a SAT or SMT (satisﬁability modulo theories) solver is beneﬁcial. In
SMT solvers, one can use other types such as abstract data types, integers, reals etc., in
formulae that involve arithmetic and quantiﬁers as well.

In automated model revision tools, we used BDDs to efﬁciently represent the model
being revised. However, the level of efﬁciency depends on the order in which we choose
to list the variables of the model. Traditionally, such ordering is done manually based on
some heuristics to achieve the minimal space required to describe the model. Such, manual
approach is sufﬁcient for other approaches for program veriﬁcation (e.g., model checking)
since in veriﬁcation the model itself does not change. Therefore, the initial order chosen for
the variables stays valid. Unlike veriﬁcation, in model revision the model is modiﬁed. For
instance, transitions can be removed if they violate the safety, on the other hand, transitions
might be added to achieve recovery. Consequently, the initial order of the variables may
need to be changed during the revision. One interesting future work is to look for solutions
where the order of the variables is dynamic and changes during the revision.

Distributed programs often consist of processes with similar structure. In Chapter 4,
we developed some simple yet effective techniques that utilize symmetry to expedite the
revision. Also, we demonstrated that the use of symmetry could extremely lower the time
required for automated model revision. However, one limitation for our technique is that it
requires the designer to identify the symmetry patterns in the program. A future work in
this area will involve searching for techniques that allow for automated discovery of such
symmetry patterns. An interesting problem would be to exploit the symmetry in distributed
programs by automatically identifying symmetrical processes and actions.

In Chapter 5, we demonstrated how the hierarchal structure of the processes could be
used to reduce the complexity of the automated model revision. In particular, we showed
how we could revise a small model and use the results to revise larger models. One future
work in this context is to incorporate techniques that can automatically identify the net-

work topology of the model being revised and use it to complete the revision efﬁciently.

185

In the automated model revision to add nonmasking fault-tolerance, we used a set of con-
straints to describe the legitimate states of the model being revised. The order in which we
chose to satisfy these constraints is very important. More speciﬁcally, choosing a wrong or-
der may result in the impossibility of ﬁnding the correct nonmasking fault-tolerant model.
We brieﬂy presented a heuristic that considers all possible combinations to order the con-
straints. Another future work in this context is to investigate other heuristics that takes
into consideration the relation between the constraints them selves. For example, if the set
of state identiﬁed by a constraint, say C I , is included in the set of states identiﬁed by the

constraints, say C2, then we may need to satisfy C 1 before satisfying C2.

186

BIBLIOGRAPHY

[l] M. Abadi and L. Lamport. Conjoining speciﬁcations. ACM Transactions on Pro-
gramming Languages and Systems (TOPLAS), l7(3):507—535, 1995.

[2] F. Abujarad, B. Bonakdarpour, and S. Kulkarni. Parallelizing Deadlock Resolution
in Symbolic Synthesis of Distributed Programs. In PDMC 2009, 2009.

[3] F. Abujarad and S. Kulkarni. Automated Addition of Fault-Tolerance to SCR
Toolset: A Case Study. In Distributed Computing Systems Workshops, 2008.
lCDCS ’08. 28th International Conference on, pages 539—544, 2008.

[4] F. Abujarad and S. Kulkarni. Constraint Based Automated Synthesis of Nonmasking
and Stabilizing Fault-Tolerance. In Reliable Distributed Systems, 2009. SRDS ’09.
28th IEEE International Symposium on, Niagara Falls, New York, USA, Sep 27 - 30,
2009. In Proceedings, pages 119 — 128, 2009.

[5] F. Abujarad and S. Kulkarni. Multicore Constraint-Based Automated Stabilization.
In Stabilization, Safety, and Security of Distributed Systems: 11th International
Symposium, SSS 2009, Lyon, France, November 3-6, 2009. Proceedings, page 47.
Springer, 2009.

[6] F. Abujarad and S. Kulkarni. Weakest Invariant Generation for Automated
Addition of Fault-Tolerance. Electronic Notes in Theoretical Computer Sci-
ence, 258(2):3—15, 2009. Available as Technical Report MSU-CSE-09-29 at
http://www.cse .msu .edu/cgi-user/web/tech/reports?Year=2009 .

[7] B. Alpem and F. B. Schneider. Deﬁning liveness. Information Processing Letters,
21:181—185, 1985.

[8] R. Alur, P. Madhusudan, and W. Nam. Symbolic compositional veriﬁcation by learn-
ing assumptions. In Computer Aided Veriﬁcation, pages 548—562. Springer, 2005.

[9] B. Aminof, T. Ball, and O. Kupferrnan. Reasoning about systems with transition
fairness. Proc. LPAR, LNCS 3452, pages 194—208, 2004.

[10] A. Arora. Efﬁcient reconﬁguration of trees: A case study in methodical design
of nonmasking fault-tolerant programs. In Science of Computer Programming.
Springer, 1996.

[1 l] A. Arora, P. C. Attie, and E. A. Emerson. Synthesis of fault-tolerant concurrent
programs. In Principles of Distributed Computing (PODC), pages 173—182, 1998.

[12] A. Arora and M. G. Gouda. Closure and convergence: A foundation of fault-tolerant
computing. IEEE Transactions on Software Engineering, 19(11):1015—1027, 1993.

[13] A. Arora, M. G. Gouda, and G. Varghese. Constraint satisfaction as a basis for

designing nonmasking fault-tolerant systems. Journal of High Speed Networks,
5(3):293—306, 1996.

187

[14] A. Arora and S. S. Kulkarni. Component based design of multitolerant systems.
IEEE Transactions on Software Engineering, 24(1):63—78, 1998.

[15] A. Arora and S. S. Kulkarni. Designing masking fault-tolerance via nonmasking
fault-tolerance. IEEE Transactions on Software Engineering, pages 435—450, June
1998.

[16] E. Asarin and O. Maler. As soon as possible: Time optimal control for timed au-
tomata. In Hybrid Systems: Computation and Control ( HSC C), pages 19-30, 1999.

[17] E. Asarin, O. Maler, A. Pnueli, and J. Sifakis. Controller synthesis for timed au-
tomata. In IFAC Symposium on System Structure and Control, pages 469—474, 1998.

[18] A. Aviiienis, J. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy
of dependable and secure computing. IEEE transactions on dependable and secure
computing, pages 11—33, 2004.

[19] M. Barnett and K. Leino. Weakest-precondition of unstructured programs. In Pro-
ceedings of the 6th ACM SIGPLAN—SIGSOF T workshop on Program analysis for
software tools and engineering, pages 82—87. ACM New York, NY, USA, 2005.

[20] S. BensMem, Y. Lakhnech, and H. Saidi. Powerful techniques for the automatic
generation of invariants. In Proc. 8th Int. Conf. on Computer-Aided Veriﬁcation, to
appear in Lect. Notes in Comput. Sci. Springer, 1996.

[21] R. Bharadwaj and C. Heitmeyer. Developing high assurance avionics systems with
the SCR requirements method. In Digital Avionics Systems Conference, 2000.

[22] R. Bharadwaj and C. Heitmeyer. Developing high assurance avionics systems with
the SCR requirements method. In Digital Avionics Systems Conferences, 2000. Pro-
ceedings. DASC. The I 9th, volume 1, 2000.

[23] N. Bjomer, A. Browne, and Z. Manna. Automatic generation of invariants and in-
termediate assertions. Theoretical Computer Science, l73(1):49—87, 1997.

[24] G. V. Bochmann. Hardware speciﬁcation with temporal logic: An example. IEEE
Trans. Comput., 31(3):223—23l , 1982.

[25] B. Bonakdarpour. Automated Revision of Distributed and Real- Time Programs. PhD
thesis, Michigan State University, 2008.

[26] B. Bonakdarpour, A. Ebnenasir, and S. Kulkarni. Complexity results in revising
UNITY programs. AC M Transactions on Autonomous and Adaptive Systems (TAAS) ,
4(1):5, 2009.

[27] B. Bonakdarpour and S. Kulkarni. SYCRAFT: A Tool for Synthesizing Distributed
Fault-Tolerant Programs. In Proceedings of the 19th international conference on
Concurrency Theory, August, pages 19—22. Springer, 2008.

188

 

[28] B. Bonakdarpour and S. S. Kulkarni. SYCRAFT: SYmboliC synthesizeR and Adder
of Fault-Tolerance. Available at http: / /www. cse .msu . edu/ ”borzoo/sycraft.

[29] B. Bonakdarpour and S. S. Kulkarni. Automated incremental synthesis of timed
automata. In International Workshop on Formal Methods for Industrial Critical
Systems ( F MICS ), LNCS 4346, pages 261—276, 2006.

[30] B. Bonakdarpour and S. S. Kulkarni. Exploiting symbolic techniques in automated
synthesis of distributed programs with large state space. In IEEE International Con-
ference on Distributed Computing Systems (1CDCS), pages 3—10, 2007.

[31] B. Bonakdarpour, S. S. Kulkarni, and F. Abujarad. Distributed synthesis of fault-
tolerance. In International Symposium on Stabilization, Safety, and Security of Dis-
tributed Systems (SSS), 2006. Full version available as a Technical Report MSU-
CSE-06-27 at Computer Science and Engineering Department, Michigan State Uni-
versity, East Lansing, Michigan.

[32] P. Bouyer, D. D’Souza, P. Madhusudan, and A. Petit. Timed control with partial
observability. In Computer Aided Veriﬁcation (CAV), pages 180—192, 2003.

[33] R. Bryant. Graph-Based Algorithms for Boolean Function Manipulation. IEEE
Transactions on Computers, 35(8):677—69l , 1986.

[34] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE
Transactions on Computers, 35(8):677—691 , 1986.

[35] J. Burch, E. Clarke, and D. Long. Symbolic model checking with partitioned tran-
sition relations. In International Conference on Very Large Scale Integration, pages
49—58, 1991.

[36] J. R. Burch, E. M. Clarke, K. L. McMillan, D. L. Dill, and L. J. Hwang. Symbolic
model checking: 1020 states and beyond. Information and Computation, 98(2): 142-—
170, 1992.

[37] R. Burstall. Program proving as hand simulation with a little induction. Information
processing, 74(308-312):448 , 1974.

[38] K. M. Chandy and J. Misra. Parallel program design: a foundation. Addison-Wesley
Longman Publishing Co., Inc., Boston, MA, USA, 1988.

[39] G. Ciardo, G. Liittgen, and R. Siminiceanu. Saturation: An efﬁcient iteration strategy
for symbolic state-space generation. In Tools and Algorithms for the Construction
and Analysis of Systems (TACAS), pages 328-342, 2001.

[40] G. Ciardo and A. J. Yu. Saturation-based symbolic reachability analysis using con-
junctive and disjunctive partitioning. In Correct Hardware Design and Veriﬁcation
Methods (CHARME), pages 146—161, 2005.

189

[41] A. Cimatti, E. Clarke, F. Giunchiglia, and M. Roveri. NuSMV: A new symbolic
model Checker. Int. J. Softw. Tools Technol. Transf, 2(4):410—425, 2000.

[42] E. Clarke. The birth of model checking. 25 Years of Model Checking, pages 1—26,
2008.

[43] E. Clarke, E. Emerson, and A. Sistla. Automatic veriﬁcation of ﬁnite-state concur-
rent systems using temporal logic speciﬁcations. ACM Transactions on Program-
ming Languages and Systems (TOPLAS), 8(2):263, 1986.

[44] E. Clarke, R. Enders, T. Filkom, and S. Jha. Exploiting symmetry in temporal logic
model checking. Formal Methods in System Design, 9(1):77—104, 1996.

[45] E. Clarke and L. Liu. Approximate algorithms for optimization of busy waiting in
parallel programs (preliminary report). 20th Annual Symposium on Foundations of
Computer Science, pages 255—266, 1979.

[46] E. M. Clarke and E. A. Emerson. Design and synthesis of synchronization skeletons
using branching-time temporal logic. In Logic of Programs, pages 52—71 , 1981.

[47] E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic veriﬁcation of ﬁnite
state concurrent system using temporal logic speciﬁcations: a practical approach. In
POPL ’83 : Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Princi-
ples of programming languages, pages 1 17—126, New York, NY, USA, 1983. ACM.

[48] E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic veriﬁcation of ﬁnite-
state concurrent systems using temporal logic speciﬁcations. ACM Transactions on
Programming Languages and Systems (TOPLAS), 8(2):244—263, 1986.

[49] E. M. Clarke, 0. Grumberg, and D. A. Peled. Model checking. Springer, 1999.

[50] E. Clarke Jr. Synthesis of resource invariants for concurrent programs. ACM Trans-
actions on Programming Languages and Systems (TOPLAS), 2(3):358, 1980.

[51] J. M. Cobleigh, D. Giannakopoulou, and C. S. Pasareanu. Learning assumptions
for compositional veriﬁcation. In TACAS ’03: Proceedings of the 9th international
conference on Tools and algorithms for the construction and analysis of systems,
pages 331—346, Berlin, Heidelberg, 2003. Springer-Verlag.

[52] J. Dabney and T. Harman. Mastering Simulink. Prentice Hall PTR Upper Saddle
River, NJ, USA, 1997.

[53] E. Dijkstra. A discipline of programming. Prentice-Hall, Englewood Cliffs, NJ .,
1976.

[54] E. W. Dijkstra. Self-stabilizing systems in spite of distributed control. Communica-
tions ofthe ACM, 17(11), 1974.

190

[55] R. Dimitrova and B. Finkbeiner. Synthesis of Fault-Tolerant Distributed Systems. In
Automated Technology for Veriﬁcation and Analysis: 7th International Symposium,
Atva 2009, Macao, China, October 14-16, 2009, Proceedings, page 32]. Springer,
2009.

[56] S. Dolev. Self-Stabilization. MIT Press, 2000.

[57] S. Dolev, A. Israeli, and S. Moran. Self-stabilization of dynamic systems assuming
only read/write atomicity. Distributed Computing, 7:3—16, 1993.

[58] A. Ebnenasir. DiConic addition of failsafe fault-tolerance. In Automated Software
Engineering (ASE), pages 44—53, 2007.

[59] A. Ebnenasir, S. Kulkarni, and A. Arora. FT Syn: A framework for automatic syn-
thesis of fault-tolerance. International Journal on Software Tools for Technology
Transfer (STIT), 10(5):455—47 l , 2008.

[60] A. Ebnenasir, S. S. Kulkarni, and A. Arora. Ftsyn: a framework for automatic
synthesis of fault-tolerance. Int. J. Softw. Tools Technol. Transf., 10(5):455—471,
2008.

[61] A. Ebnenasir, S. S. Kulkarni, and B. Bonakdarpour. Revising UNITY programs:
Possibilities and limitations. In International Conference on Principles of Dis-
tributed Systems (OPODIS), LNCS 3974, pages 275—290, 2005.

[62] E. Emerson and E. Clarke. Characterizing Correctness Properties of Parallel Pro-
grams Using Fixpoints. In Proceedings of the 7th Colloquium on Automata, Lan-
guages and Programming, page 181. Springer—Verlag, 1980.

[63] E. Emerson and J. Y. Halpem. “Sometimes” and “not never” revisited: On branching
versus linear time temporal logic. J. Assoc. Comput. Mach., 33: 151—178, 1986.

[64] E. A. Emerson and E. M. Clarke. Using branching time temporal logic to synthesize
synchronization skeletons. Science of Computer Programming, 2(3):241—266, 1982.

[65] E. A. Emerson and C. L. Lei. Temporal model checking under generalized fairness
constraints. In Proc. 18th Hawaii International Conference on System Sciences,

pages 277—288, 1985.

[66] J. Ezekiel and G. Liittgen. Measuring and evaluating parallel state-space exploration
algorithms. In International Workshop on Parallel and Distributed Methods in Ver-
ification (PDMC), 2007.

[67] J. Ezekiel, G. Luttgen, and G. Ciardo. Parallelising symbolic state-space generators.
In Computer Aided Veriﬁcation (CAV), pages 268—280, 2007.

[68] J. Ezekiel, G. Liittgen, and R. Siminiceanu. Can Saturation be Parallelised? Formal
Methods: Applications and Technology, pages 331-346.

191

[69] J. Ezekiel, G. Luttgen, and R. Siminiceanu. Can Saturation be parallelised? on the
parallelisation of a symbolic state-space generator. In International Workshop on
Parallel and Distributed Methods of Veriﬁcation (PDMC), pages 331—346, 2006.

[70] M. Faella, S. LaTorre, and A. Murano. Dense real-time games. In Logic in Computer
Science (LICS), pages 167—176, 2002.

[71] F. Gartner. Fundamentals of fault-tolerant distributed computing in asynchronous
environments. ACM Computing Surveys ( CS UR), 31(1): 1—26, 1999.

[72] F. Gartner and A. Jhumka. Automating the addition of fail-safe fault-tolerance: Be-
yond fusion-closed speciﬁcations. Lecture notes in computer science, pages 183-

198, 2004.

[73] F. Gartner and H. Pagnia. Self-stabilizing load distribution for replicated servers on
a per-access basis. In Proceedings of the 19th IEEE International Conference on
Distributed Computing Systems Workshop on Self-Stabilizing Systems, pages 102—
109, 1999.

[74] P. Godefroid. Using partial orders to improve automatic veriﬁcation methods. In
CAV ’90: Proceedings of the 2nd International Workshop on Computer Aided Veri-
ﬁcation, pages 176—185, London, UK, 1991. Springer-Verlag.

[75] M. Gouda. Multiphase stabilization. IEEE Transactions on Software Engineering,
pages 201—208, 2002.

[76] M. G. Gouda. The triumph and tribulation of system stabilization. In Proceedings
of the 9th International Workshop on Distributed Algorithms, pages 1-18. Springer-
Verlag London, UK, 1995.

[77] O. Grumberg, T. Heyman, N. Ifergan, and A. Schuster. Achieving speedups in dis-
tributed symbolic reachability analysis through asynchronous computation. In Cor-
rect Hardware Design and Veriﬁcation Methods (CHARME), pages 129-145, 2005 .

[78] O. Grumberg, T. Heyman, and A. Schuster. A work-efﬁcient distributed algorithm
for reachability analysis. Formal Methods in System Design (FMSD), 29(2):]57-
175, 2006.

[79] O. Grumberg and D. Long. Model checking and modular veriﬁcation. In CON-
CUR ’91 , pages 250—265. Springer, 1991 .

[80] O. Grumberg and D. E. Long. Model checking and modular veriﬁcation. ACM
Trans. Program. Lang. Syst., 16(3):843—87l , 1994.

[81] D. Harel and H. Kugler. The rhapsody semantics of statecharts. Lecture notes in
computer science, pages 325—354, 2004.

[82] D. Hare] and H. Kugler. The rhapsody semantics of statecharts. Lecture notes in
computer science, pages 325—354, 2004.

192

[83] M. Heimdahl and N. Leveson. Completeness and consistency in hierarchical state-
based requirements. IEEE transactions on Software Engineering, 22(6):363—377,
1996.

[84] C. Heitmeyer, M. Archer, R. Bharadwaj, and R. Jeffords. Tools for constructing
requirements speciﬁcations: The SCR toolset at the age of ten. International Journal
of Computer Systems Science and Engineering, 20(1): 19—35, 2005.

[85] C. Heitmeyer and R. Jeffords. Applying a Formal Requirements Method to Three
NASA Systems: Lessons Learned. In 2007 IEEE Aerospace Conference, pages 1-
10,2007.

[86] C. Heitmeyer, J. Kirby, and B. Labaw. Tools for formal speciﬁcation, veriﬁcation,
and validation ofrequirements. In Computer Assurance, 1997. COMPASS 97 . Are We
Making Progress Towards Computer Assurance? Proceedings of the 12th Annual
Conference on, pages 35—47, 1997.

[87] C. Heitmeyer,J. Kirby, B. Labaw, R. Bharadwaj, et al. SCR*: A toolset for specify-
ing and analyzing software requirements, 1998.

[88] C. Heitneter and J. McLean. Abstract requirements speciﬁcation: A new approach
and its application. IEEE Transactions on Software Engineering, pages 580—589,
1983.

[89] T. A. Henzinger, X. Nicollin, J. Sifakis, and S. Yovine. Symbolic model checking
for real-time systems. Information and Computation, 111(2):]93—244, 1994.

[90] M. Herlihy. The future of distributed computing: Renaissance or reformation? In
Twenty-Seventh Annual ACM SIGACT-SIGOPS Symposium on Principles of Dis-
tributed Computing (PODC 2008), 2008.

[91] S. Hester, D. Parnas, and D. Utter. Using documentation as a software design
medium. Bell System Tech. J, 60(8):]941—1977, 1981.

[92] T. Heyman, D. Geist, O. Grumberg, and A. Schuster. Achieving scalability in paral-
lel reachability analysis of very large circuits. In Computer-Aided Veriﬁcation (CAV),
pages 20—35, 2000.

[93] G. Holzmann. The model checker spin. IEEE Transactions on Software Engineering,
1997.

[94] R. Jeffords and C. Heitmeyer. An algorithm for strengthening state invariants gen-
erated from requirements speciﬁcations. In Proceedings of the Fifth IEEE Interna-
tional Symposium on Requirements Engineering (RE ’01). IEEE Computer Society
Washington, DC, USA, 2001.

[95] R. Jeffords and C. Heitmeyer. A strategy for efﬁciently verifying requirements. AC M
SIGSOF T Software Engineering Notes, 28(5):28—37, 2003.

193

[96] B. Jobstmann, A. Griesmayer, and R. Bloem. Program repair as a game. In Computer
Aided Veriﬁcation (CAV), pages 226—238, 2005.

[97] S. Katz and K. Perry. Self-stabilizing extensions for message passing systems. Dis-
tributed Computing, 7: 17—26, 1993.

[98] T. Kletz. Hazop and Hazan: Identifying and assessing process industry hazards.
Inst of Chemical Engineers, 1999.

[99] F. Kroger. Lat: A logic of algorithmic reasoning. Acta 1nf., 8:243—266, 1977.

[100] S. S. Kulkarni. Component-based design of fault-tolerance. PhD thesis, Ohio State
University, 1999.

[101] S. S. Kulkarni and A. Arora. Automating the addition of fault-tolerance. In Formal
Techniques in Real-Time and Fault-Tolerant Systems ( F TRTF T), pages 82—93, 2000.

[102] S. S. Kulkarni, A. Arora, and A. Chippada. Polynomial time synthesis of Byzantine
agreement. In Symposium on Reliable Distributed Systems (SRDS), pages 130—140,
2001.

[103] S. S. Kulkarni, A. Arora, and A. Ebnenasir. Software Engineering and Fault-
Tolerance, chapter Adding Fault-Tolerance to State Machine-Based Designs. World
Scientiﬁc Publishing Co. Pte. Ltd, 2007.

[104] S. S. Kulkarni and M. Arumugam. Infuse: A TDMA based data dissemination
protocol for sensor networks. International Journal of Distributed Sensor Networks,
2(1):55—78, 2006.

[105] S. S. Kulkarni and A. Ebnenasir. Enhancing the fault-tolerance of nonmasking pro-
grams. International Conference on Distributed Computing Systems, 2003.

[106] R. Kurshan and K. McMillan. A structural induction theorem for processes. In Pro-
ceedings of the eighth annual ACM Symposium on Principles of distributed comput-
ing, pages 239—247. ACM, 1989.

[107] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM
Transactions on Programming Languages and Systems, 4(3):382—401 , 1982.

[108] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM
Transactions on Programming Languages and Systems, 1982.

[109] K. Leino. Efﬁcient weakest preconditions. Information Processing Letters, 93(6):281—
288, 2005.

[110] L. Liu and E. Clarke. Optimization of busy waiting in conditional critical regions.
13th Hawaii International Conference on System Sciences, 1980.

194

[111] H. Mantel and F. C.G'artner. A case study in the mechanical veriﬁcation of fault-
tolerance. Technical Report TUD-BS-l999-08, Department of Computer Science,
Darmstadt University of Technology, 1999.

[112] K. L. McMillan. Symbolic model checking: an approach to the state explosion
problem. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1992.

[113] K. L. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1993.

[l 14] S. Meyer and S. White. Software requirements methodology and tool study for A6-
E technology transfer. Technical Report MSU—CSE-09-21, Grumman Aerospace
Corp., Bethpage, NY, 1983.

[1 15] K. Milvang-Jensen and A. J. Hu. BDDNOW: A parallel BDD package. In Formal
Methods in Computer Aided Design (FMCAD), pages 501—507, 1998.

[116] J. Nimmer and M. Ernst. Automatic generation of program speciﬁcations. ACM
SIGSOF T Software Engineering Notes, 27(4):229—239, 2002.

[117] C. Norris 1p and D. Dill. Better veriﬁcation through symmetry. Formal methods in
system design, 9(1):41—75, 1996.

[118] S. Owre, J. M. Rushby, , and N. Shankar. PVS: A prototype veriﬁcation system. In
D. Kapur, editor, 11th International Conference on Automated Deduction (CADE),
volume 607 of Lecture Notes in Artiﬁcial Intelligence, pages 748-752, Saratoga,
NY, jun 1992. Springer-Verlag.

[1 19] P. Palady. Failure modes and effects analysis. PT Publications Inc, 1995.

[120] D. Parnas and J. Madey. Functional documents for computer systems. Science of
Computer programming, 25(1):41—61 , 1995.

[121] S. Qadeer and N. Shankar. Verifying a self-stabilizing mutual exclusion algorithm.
In D. Gries and W.-P. de Roever, editors, [F [P International Conference on Program-
ming Concepts and Methods (PROCOMET ’98), pages 424—443 , Shelter Island, NY,
June 1998. Chapman & Hall.

[122] P. Ramadge and W. Wonham. The control of discrete event systems. Proceedings of
the IEEE, 77(1):81—98, 1989.

[123] R. Ranjan, A. Aziz, R. Brayton, B. Plessier, and C. Pixley. Efﬁcient BDD algorithms
for FSM synthesis and veriﬁcation. In IEEE/ACM International Workshop on logic
Synthesis , 1995.

[124] K. Raymond. A tree based algorithm for mutual exclusion. ACM Transactions on
Computer Systems, 7:61-77, 1989.

[125] F. Somenzi. CUDD: Colorado University Decision Diagram Package.
http://vlsi.colorado.edu/“fabio/CUDD/cuddIntro.html.

195

[126] T. Stornetta and F. Brewer. Implementation of an efﬁcient parallel BDD package.
In Proceedings of the 33rd annual Design Automation Conference, pages 641—644.
ACM, 1996.

[127] T. Stornetta and F. Brewer. Implementation of an efﬁcient parallel BDD package. In
Design automation (DAC), pages 641—644, 1996.

[128] O. Theel and F. Gartner. An exercise in proving convergence through transfer func-
tions. In Proc. 4th Workshop on Self-stabilizing Systems, Austin, Texas, pages 41—47,
1999.

[129] T. Tsuchiya, S. Nagano, R. B. Paidi, and T. Kikuno. Symbolic model checking for
self-stabilizing algorithms. IEEE Trans. Parallel Distrib. Syst., 12(1):81—95, 2001.

[130] W. Vesely. Fault tree handbook. us nuclear regulatory committee report nureg-0492,
us me, washington dc, united states., 1981.

[131] P. Wolper and V. Lovinfosse. Verifying properties of large sets of processes with net-
work invariants. In Proceedings of the International Workshop on Automatic Veriﬁ-
cation Methods for Finite State Systems, pages 68—80, London, UK, 1990. Springer-
Verlag.

196

-.'\~'\v'

1

1
1

1111

LIBRARIES

1 111111

3220 8906

3 1293 0

11

1

V:
.h
s
R
E
N
N
U
E
T
A
T
s
N
A
m
H
m
M

 

. u
u
.
. .
.
.
.
. .
. 6
. u
.
.
. .
. . . .
.. .
. ..
...
_. .
. .
.
_
-
.
.. . . . . .
. . o
. .. .
. .
. .
. .
. .
. . . c
a
. .
. . . . .
. .
. ..
.
. o. . .
s
... . .
. ..
v .
. o .
. .. .
. . .
.. .
. . ..
..
. .
. . .
. .
.. .
. . .
. .
. _ .
.. . .
. .
. .