:2. ......2... . . ._ 25.1....” 3...... in. . . . I _ . . _cv . - .... . . . . . .. 2...... w”. Lair... .. H... .. .... A ... .. .. ._... fivenrehgzmpf .... ..r. r. «flawnén in... hi... .1..." ...... .. $4155.. “....mumma.» $9M... .. 1:4“... w. aw. .. _. Z . :I y. . ' ... . . .. . . u. 3 .. . ...r............ b.5535... . “......”qu . . .. ...... ..s... ...: ...... .. ...... ... L. a u ..«9 5. . ‘. ~ (1"...392I2: u. .‘o. . . . . ... . . ..."...‘n ...-c? n. .. . I . c). .19.... o. .5... .l'.. ...... O o 5.. .... . . .J a“ ..loon. lira-J! .0 I13 oilfbum I... a . 05$ 3...! 33.3.”. .4: no.5! 0‘ ... _| ’83-‘3 ....o ......I... I ... . «. ......stt ...)..a .2. U: l~..-18c.a ....lfll ...;r. I. C . . 323.958. ...: I ......2. . ......3... .922... ..K .14... .. . xi . . .... ....J. .....Itit‘ul . ._ .|.. .8... 5.3.6.1.- .- ... toil ...Jl‘il. .... 341 .l 3r 1.3. . b4 :Jfitmot .4 8 .o .... , 36.1%.... . .. I . .1....Jl.‘o I}: J. o I. .D‘ . ...». III.“ ... .... . . _ .JJfi..J.~I_...3 . (......Jflit 0.1.. .... J3..- . .. o. ... ._ . ..I. t ...-....e. .. ...-v.1. . ...... s . .- . o . . . . . . , 1. . . .. 1. ‘3... . . I. r . .. . .0. ...-”3.1....- 1....Ia’0..... 3r . .r..:... I . . . : 13....rt‘¢.-.. ‘ . ... I . ... . 3-.0Io . t.Pd8....:v...r. 9.22.5.3. Q .. d - .. , . . . , . . ...—u ..I I... 1.2.3. I... . .. . . . 8113., 3.13:... .. .. . . ...-ISL “A. .2 . . . . . ..u...‘ ......- . . 2-”: .... ... .... . ...“: .d 033 . o a. .... ...: .Q . 6.4 D . . . .... o .4 ... 3. ......3: g 08.4. 1...... .... . .- guLcOhcb ¢ $8 . I 8|.08‘ . . . .. ., . . . . Jo. ... a. 3... ...... . .I. . ...4. d. o 3.. A If ..i‘. . . . . .. . “1.3.9.... Cr: I .33.... .I . ....» . ...6. v ... .. .0 .3 ......3». pt: 13.. ....- .. 13....» .- ...... . . o fl. . .. I}... .... ... .. 3-; o. . .1... I... a. . . . 0...; ......f... 2 .... I . ....s.... o. .. f..- .3 .61.... ‘La.:. ...-...: . ....O o‘ ..6..:....‘... 34? 0.. 1‘: 3i.‘ .'4- a... .... J- o.- . ...!!3. Ir ..Ivv 1!..5...~!I. ...: .. . .ult' ...)p.o.._:.o' a... ... r I: «..i:v.... . .. V... 2.2.... . .113. 7...... . 3.3.3.... I ...:..v .. . . 1.413.. ~23 ...... .8; I 3 .. 31......L ...-t l 5...... .... ...... a...) . .... I: .... . . 1. ...v .. ... p ... ...... ..T... ......l‘. I ......u . .... .. .r....$.........la.... .... ...: ... 28.... . Or ......II..JJ~¢:'..1 I... . .. ... . .J'ou-Jbtl. E43,, .3! ... k. .30 ... t.....’..( l. 21.. 9.1.5.8... I. 58.32.... .... . . .. 1.3.9.. .38. o . , . ......— o .t .1..I4‘.... . ... . to. ..L .21 .33. 0.7.1.3331?“ .- .. . .. ...... .. . .rr... ... I. . ._ .. .. .. . _ . . ,. . . . I . ... z ......t ...: :13.-. a... 1132......» 5.... 210...... .... . .v.... . . . o .. ...... .. .... ...? . . . . . .225. .l.!I ...cc..:,.ISIIQ«.-f.; 3.5. ...;a ...... .. In 8 u..- . .qul. .37 ...... .- . I ..(3..:.8.... . . . Y .. .0 ... .... .... .23 .44; 193-. . .l)....c.v 3......J..‘ . 23:69.34 [knilh‘ O . 3. v...) 2 ...-... .u. I. I p 1.... .... .. 70 ...)... .... .‘aot . ....O: .3... rd: -P 0.9.21. .3351??? .....olfi,..oa€3 .3 B .... 1.. ... a... .‘..1‘38 .. ...! «3......8. .... .. o 9.. . 23% 75...! I..IY!...¢..I..... ... . .l ...-...... l ...-60..., .... «J2... 2...... s .....I .. 7....th .4... 0.3a: .3 . 42f 3.. 31.3.31 .3“)?. : . . .m.’ s ..33..X'. 1.0-4.. 0 ... o ......32 .. ..v I .4} ... .‘X ...... . 0.. '15 ..I. .o c .n goof-...; .... Ancoai; ... .0 o ...J. .... o . o o 5.... . . 3A a‘vnIQ 1' 5.. a ...-10...... :0" 0' w . _. w. . : 2.3.1.5... {$5.2 . 3...... 3.1.2.22... ......Ira. . . ...!o. 2.5+; 3. a .... . . . .I: ...I-oI...1.oI: . . ...: astute .7 L1}: . . . I... .o. ...... u I: .o 9.. t... .5 ta ., h .. 1 .Szl. 1.. .... .fi... ...... ...: .\ ...: 1...... .....l..- .... I. . . .I.,........ . .v. 3... ...... . .31.)... I :17... ....I. . _ ....35: ......t. 013:... 3. «5:30;! ...! _ .IILIJ» .. .. . . 8:7 ...-0. v. ll‘..IJ ... . o. . ...... {it} . . ...Io . . .J ~}.~lI?‘JI...33 c ... 7.0... . ..(‘fi TI! . 0“... :3...) . o ..J‘OI...‘?L‘.~.1‘..§‘XC and,” t v ‘4. o” no . I .... ... .o. . ... . 8:... ...... ...li ... I0: I. 3.: ....” 6...... .s .. 03.... . (a C . .Clt.’ 2‘... \ ...... .. . 2...... 13:38.. .. .1 ... 613'... . ....N .3. 05.0.... . H3 '9 a? t. .. .... .I...oa.-J . ... ... . . I. ... .12.... .. .. ... 5 v9.23. ... 5 t». Bloatsr E .I m ; . 1'39: .. .1... . ...: . ...... .... .... lfirl1ztaiibg . (Hui . w 11. ... .....v. ... . .. at... 2 53...... ..IrttIIJ... v.4 . ”$41.. I II .. I I at .... I I... .06.qu 9. ...IF 5.. .13.... .... c L. V a \1 fl; . .... . I. 3.38.... 11.19. A .K ...-58.... I9... I63... .....Ioil .tt’b I. DID QI..I. 8 . 1 4 . ... ... Inf... . . .501. .... ....O ... ...-.... . .I ...‘t. . . . ...... i. . I£..u.fb£l\ff.£“1cl it“... .9. IF ‘ 354.113....1... .. ..II... ....8..13al..II .....ll..;t.vb - . .J 2.33.7... ... . . c3 ... I ...... .IJCA... .0. c. L. a a raw «. ..J .....~ .11 . 3.6151..- ........38...I...!I..1 ....-. offifil . . . .L '9. . ...-K... I . fl) ... . . I . < .I. z .r ,9 t . ..J. ...... ...6 ...! $.92... ... . . . 83...... .... .. (‘3: . .. . o8. ... .l. I ...... .Yu .dnlt LP. $LO‘oMO‘ fin IV} . . ...... .r . . ...... .1 . J... . .... .‘d: .212... . 10...... ... . . . .... .. ... . ...c ... ......r: .3... ..tbfi 3.8: 31.... . .‘l§’;.!9. ... , (A ...... I. .9 13.8.4. .7: ... u 3 10‘. . . 2 T... .. .....t. - ...-I .8308. . If“... 1.3.... ......0-00 . IJIC. ...-9r..(...9 (‘3. ltll‘B..ll .3..£t3$3‘ ... «rt ...... ... . .a ... . . . . .. .l....Pr. ......818. 3.. . I. ... .C... ..ll..\. 0 .1....$ro..llfiok~vil 0.1.3151!ch .tl.’ rIkIQO..3b a. I. . .58.. . _ .1. ... . 34.... .....36 013.....5... 65.! ...-c .1 ...... ... a... 3. 333v.l..0 E3 Irliilfiflbgl .1 u... v . . . .. . . . . . I. .0. ... o . l..-.I! ...: . ... I 4.. .. .....I. . ...: 30331 I . .L .14 ... 0 CI .i..¢I...:c t‘ I‘naozf I J ...! ,I .....I.... .... ., .3336; ... .4 ..U I... I. .8 ...a... ...)...T. . . ...-3.... 2.8.3.2.: :..3..¢3...¢ «a. J ... .33! . ...thI .I... . y a 0.. ...-.3. II...O J. .3211... .0. .0. ..v. “ . vii .- .. .. ... . ...... . VI... ... I... . .I. .X' ... . It .5 . .0. . _ . . . 9.3.0 .32.. 7t... fawn; I .. ta. .1 ... o ...o. o.‘ I .v. I- . b 85:... ...... .4. 4......)3303... “3‘53”“? . f! I. o. z...£.-o ...l..:.. .2’..Jall.v.x :13! hat 0). 54. , . ..I . . . ,,. . . z . ... ... Mwfimxfifl ..x ......m I .31}. ' 5...... I. 0......)er Ila.cttz3£dv§ £59.... ...... ......r..........................u..... fixing .. . . . . . v ...-...... . . . “:3. “czztitélfioibl 2 .0, I z 19‘ . .. II, ..I.) $3.8...uis § . . r ...... .IIII: . .... . 3... .. . . :40. 5.531.... n. un... 03...... ...:hwrgt 117‘ .Tn n. 722:... ...-(v.5...z‘i. 4...... .... ...-...... ....L...?.:..III.I:: t‘t..t..x’| on. 0...)... A: Rafi . .. , $1 .... .. .. J .1 It: [0.3.3: I... ... I. . .... 3 .. ...... .. .9]: I10. ...... {'3‘ ,. .I J, a. ... .. .... . . .. . ...-843...... ....IJ .... ...... 1:60.54...- . 361' .0. 1 . ..Q S..- u n!I..3|:l..91-. . o . .. ..I..3i.....hvv-i¢. .0. C .0. .... bit...- I on I 5:. h w , . ...: at. . SJ......I.8...I0.I . . . 1.6:...- ..l . 1 b.— .I. l’a.3I8-.)..$ ...-c-IIO‘ a) 9. cl ... . Q ? . .9..II o. 3...... ti... 1.. . {:01‘. .o M... I. by..." ... 0.30:1...Dcoxa f D; 8-- 8. . . .... . . it.... .....Irlt...... ......II C b _ :1... .... (....alta Incafi’la n... l A p a. b... . . . .. . . . . . .I. ...... .l .0. 3... .l . D...“ II- II. .Pfit. ...-9... .3 .... dlfix’ . ,3 30:51 I. U .0. 3 9!! u t” . 8... .. . . . . .. . . . ... A?! . I .... 3.133.... a! 920.5.Ilu .3...— cI-IS‘ob ‘ O O O i ... ... . . . . _ . . . .. . ...... ...8 3.5.5.}... .104, a...5........8:d:¢ ...... 9.. 13...: ....e...’ :1...) littl‘tta‘k. c . . .u (I... u I. ......o.... .. . . . . . . . . . . . I..II-I .... . 0,. ... 91.5.... 3.223).... I .. .. ... ...)... .391... 2"! 0/53,... I... zit..li’9;$£ an ...Il . .....IIC . ...C o .. ......n ..\. .l-.‘l..6..OCS‘8.-l 6.3. 4....62 3.54.... . .... .....QC ..Fo-C in... . . ..t' . . . . .0 .... I . . . I: o to u' a .3... .0. c ...-l . . a ... Z .13 w. $4.1... an... .... I . . o u o I. v . o... . .. .... i0 331:5. .. .. ...». t... ...-.0. I... fl‘!‘3..:3rt. .3) .4 .....l l‘. holi’-* . 4. ...-1 .21....6]? 3. ...... .. ...-I. . . [£019. 0...... .. .03.. .I‘ of 97.11.38 . a... . . I... o a}... ’1‘. 28.n‘z‘AI . . 9h . f “at"! I 9 ...).Fti. . . fig , 1:5,; 3.. ...r ...r .u. .. . i . .. K"::_.Q ..z )‘iit‘.:¢o .. 99-17.! . do 0.. 12 3...]... 1.33158 0.0.23.3..3... ...-6...... no lo. ‘ ‘ .... .31.. .... “It. 35...... ...)..It tutti-Co}... ‘0‘.‘. .3 ¥ 3' ... .l.oo_¢ ......Olo...... ....O ...‘......ld.t' “u o.“o\‘.v¢..." ...-«I..Q.;.Iuc:..: .... .01: If. 1. . to.l£...a.. '30.. . .. a . . ‘13....)36 ....I‘ 8:... ...... . ...: o .I‘ l’o.."5 .48.. : O ..ol-ct..5’ ..‘...oo.....l.\c. To... . In ....I o....€......oa... 23.3.3.0. ... .- ..‘aquIiéooll: In... .....O.§‘.ol .O. ... 7...... ...y. .u ‘ do 9‘. p.29... .9 .... . .IO'I .0. o... .... ..o- . . ... 9.. U3 ... — ..- O- n . 09¢. .L.O.I...Jl.0 . .. 3;... ... . . Vol-:00... v i . ‘13....S- . . . .. . n. . a o . .- .. . .4 v.0I. 931...?! .. I ...0’.... .30.: no". ._ In. I. o . . A, ...t ...!!103icoail 30$II¢€0€33 :3... ’0 rv.9..r.o|Q.-.7 .09....5‘7. ...I“ .....:..vo.. .8‘: . I . oll . 0‘. II: I... 4“ x ...-.... .....I. I: 3.13.... 0. .. . l the.” .I .I....:.po..-o....... ...? .I 7.1.1.. ... . .3 23.333.23LI: 3‘ m . a... . 'I ...1 o ....‘641 .Is‘ ... . .0... - ..Io IXIIC: 1...... . ‘ . Z- .. a .. . . 1.! .‘:4 w... 61". ‘ v} .0 . I ...: . . ...f‘... . . . .f o. . ...... .111... ...-...... . ......afilI. _. d. . J... a. ..I‘ O .... I .t . '- v.2... . v... . . ...-c ....- . ...!lb'f ..~ ....I ... .¢-.I.OI.. ...J O 0 . .0 In .. I. I ...— .II... 1.00. O ..I'II Ov- .- ... ... In I a ‘ . ... 8‘... $3 6!. a. if! ...-D. I.“ So. ...! 63$ ... .- 9... a o I. .u .0. ‘J‘?§ “ ‘nlsoz‘ I o . I; . . .v, .. :I 06" cc, 3 .0363 ...! . _. . . .o ’..’Jol:.. u... . b$1§ 4' "I ...... ... . I. , J... .0... I.. I {...-ISO...) - O. n. . O .. : . . .... .....a... . an“. 31.. . .. II ..l...¢. P. .. .9 ... . ...»le .... ......I. ..Y‘... to‘ .n . . _ ... . ... .I... . v 00.. 5......1 U .. . . o - O. VII. . .v .... . $801.3. ii. iii . .l:0D,:zlfl‘t&l.lls.N-I’c,’l.l ’; .93).. . II. . . ‘I- 30‘. ‘:ib§ '0‘: o... .0. .I.... I!!..C..)Q£l! ...-‘13:, . Ici fittlzibilo. ... {jiflit I‘flymp‘ ......ztf .3295It13842 «18%... .09....- ..‘.2...l8?..'ooleul103 ‘Iq..l) 15‘. P. ‘A‘CIIV 17.. x. t ‘ :Olllnz 20.}- In... 2.0:. ’a-‘Jll‘2-3ol tulov.au.‘4. he"): I... a, O‘t’l’.‘ II... b‘9.t.’i.3 . . . u. .. on -.I..O...v O I .I..l-It¢t.|l!1§i( o .. I.§I..I!$D9;.I 33!...0). ttlxalfi$zrlofih luv. 5....3192I1..IIQ‘I..IIOK!}II 32.3.1313: .3 .II ......I I. r I III-39....‘1ti‘3‘ lOblittat! .- o........n.1)!01¢ I93..t:i.1l.1tu.‘l 1r.£3:.( i8 1.. III. ..n..i".31~_. 355:3I’ .II;:. 1 \‘I.E‘2¢Qb" .L‘\ I. .. . o . _ . . . IIIIIIIIIQIOI... .u‘. Iltfl'l301l I... . V. . O . a. O ‘ I. . I . - .. - ...: ....~ 9-... . ., . . . . . . . . .‘ ....ch.u.~ . . r . .OIQO..I-. . . .. ... . .. v v.19..... . ...... ...I .. .. .. . . .....t.....oo. .... I. I... it. 332.3.15’. . roolz‘dva’oafiotovn...II’I..II!$IZI$-. 4” . 9| 0 I ’ it): 10113.... «I 5“!!! 4...! K I! 9‘ .‘g‘O-I.9Q.I.c 0.... n ‘1. '$ ......13’1" I’D}...I.8.I..t!.lolttt Site-SoigfioctI5|¢O..:!AOOBADIIJ1.x... 13’)‘.o..¢( 1...!39-3 ’73?!) ...Io..‘§.‘.l’.n r... ‘30:.” 0.33.. . so ......I Iig-‘1.p.iio I. ......»Y ... ’.O~l.0’aou.o nut... .. .I. a o In... .. ....r. '2.‘ 124. . I 21.3.3310: grufrfilldilg‘iu a... 1 I ..O’.2I.O..x.of....o..3 34.011! 6.9.14... ’4 (I 2.I'l......$ 91‘s {9‘51 .....f"‘...u .0...3I..X’. I. In 0.1.3.12: . , can ..vIS. a to’.... .. I: ......V. .. ..-.o.c.o,I-..o.o .- Sot. £049... IIII‘IIIIIvQ..;aIJ, 70...}. .. ...}It Dit! 9.110 I ’I.I?.\..%t‘l\$:n...z . . I 0...: I. 7.. ‘o.u.‘,ll\.‘ (III. II . 4.. PO ..-1s".‘:ln 18:32: . . . .I [:51 Sol-:1-.. .o ...-‘36.: Io’I‘I3'..'3a.31 ..Otn.‘:$..¢.-.IC. . -..... .l .. ... . 9’.~.1‘I.. .... I... .v’ 0.1.. 9.5.1... o..’ I II 0310.... .4‘..vll'-rb .o' . . J. ... .00». ..I.‘ r -. ..II.:.I.I nu. a)..I.oI. -... 4'..Q“O..J 9"} ‘;C.. o- . a u I 0“. A. u . 2“ .I 0....on KI o.I.I-.Iv '- . . In. I. . .I..-.'.. II! ."73 at '.I3’I¢\’.$o..: ..ln. .- . .0. C...‘ I. . I..~ao"’,->3I O In A\~l'.‘.la..I '0 adv-I. . . n9’I-Ooav .... . I ‘30....3: .. «VI '1... v .0 .7; 0.0.0I "Io: ... II. A .... ... 1.1...‘: .I. .C.a¢o-_3n.o.oo... a . o u .. I ... . o . a . c0... .§-..o. . . .. out. 01“?) a. n.3,... I ’31,? Icnho’l'o. . o 900. g... .c ..U' .0. . . . . .. .... u...3..t‘uvo..3.dd.. . . K I I o o o c. . o ._ -o_ .. . . . . .. . . n . . .. . . . . . . O. I l)....~‘u. I; 5.. o ... ... u . . . . .. . . 0.. ... I. . .1... ... o"..q0... . . . .. . . . . .33...qu 'IEOI‘S ha. I 33.1- . . . ...v.... I o . . ... . .~ . ... . .a . . .I. . . ..v . . . a. 4. .. . .. . Q o :3 ‘1. III: 0.... 1‘"...- _ v . . . _ .. . . I . .. . .. . . . . . a .. . . . . . .. f. . . y I ..Il‘.o '..&l.’l.2lt:)jt.io§n I o ......p O . i! _ I O G . . c . I I .o ....i . ......voo. . . c . 4 .‘Il. .. 'c.kul-l.. co'. . .... .I :- 2......0315‘.‘ o‘.:’.:.’§.obo 0,0 liC’IIISI ICOVI.O; . ... . . . . . . . . .. . ... I. I . . . ..........I... . . . .. 1......IIZIvf. ...“...lfqu. Inca-III... ...-...}.ttoi7lfltslvs. Q's-c.0259? .1. 3... . . . . . 0.. .. o . . .. .. ..I o. . . v .. o . . .1. ... . . .. 2. . ...... . . . . . 0.. . II.!-...Yo.. :- a, I o . \I..¢Oyoor’l“.l<'n .- . o o . . . u I ... 05 n . .o . _ ...: ._ ...! o. ..v.... I... . . . ...I.‘I.l..o.o.:?.o‘i . . . .. . . ' 2|.o;t; 1&3! Al . n. .. p. . . . I _ . . . ... . .4. . . . I. .. .. .o In... . . )I7IIIII ...-...}?- . . . o . o 0.... ... I II . s. ....8 .0.- . ...... .1-.. .I?‘ tt..i.olwfio.;ldys . .. .. .. I . ...I ......of . . . ... pvt-QQQQIIO’tfsa.’ . : .... ... ... . . . . . . . . .. . . .IQIIISI. 2c Ital. . . I . .. ...-n o . . u. ....I ...... ... q '7‘... '03}... I. to...ol . -. .... . a. . .olo..t 0| 8. .w... h....t....¢lov .. .. II . I . . 0 II . . ... .n . o. . . . 0.}. IO’XII L...‘IQI’II.II’OII . . . . . I . . ... . . I . . . . . .. 0.0.0.0..3 I ’0 I’l’t’u 'JI‘IOOID ’.-.3IV¢.IO.5I «1" ..lo I... . . .. ... .. \a .o . n n. I. I.I ...-..I a ’3'... .o. .0. ..I .170 o . v I: 30 0.13‘ ..Ivnl 1.4 :IOOQ“."I.‘I‘,JO"'I II..." 0“; . . . . 0.. ll- I05 {322- .1 Q .o . _’O..~I Infill-:1...- - ... 3...... 91.3,»). v 0 135.15.!!! I"... I. . . .0 I. 0.0. .1. ...II.I ..I ..I .‘ SOI- .II.. . .. ... . a .0 I .6. . . . ... . .. In .... . . . . I p. . . . . . n I - . . ... . o ..v .- o o . I t ...0. u . o 3 . o 3.. .o .u . .. . I . o .. . .04 . 0.. r . . . . . . .. . I .. . .. p . I . J .. _ . . ..I‘ .. . .o . . . . .... I .. T 23.1.. I I 0v 0...... 0 . . ... . .. n. . . . ...v. I.. . .‘IIIC‘IVIIII. .. . .. o n O. . . V ..o.. a. . f I ... _ . . . . OI . . . .. . g o I . c . . . . . . . . , I .I. x ...I' . .. . II}. I . OI; . _. . I 3-...- ..o.... . o . . . . . . . . .. .. ... .l I. Q .. I I . C. .3 .I I I II . 6.. .o‘. v .. . . o I I. . . I ... 6.. c. . . . I . . . . I l . u . . .. . . I . . . I |. ... o... .. '. . .o. . .14. n. 0. al‘ v . I. .' I .f .o . . . 0......08 o... la v .....944100. 91.6.0 I 56.5..II‘WIIJI . . o . .. ... . I. 31.10.}: I... ...-‘3‘... .I . I (I I".O,I Q‘III K I ’16 ‘3 . a .... . .. . . .. o. . _. {OIIOI‘IZOi 1.7.Cf.-aln.5. .I -I.I..II.u ... _. . .0. I I . I. .. . .I \o. ...-......Oii ...QIi..." \t‘§.‘0‘-..¢~ .....I‘ I _ . . . S. . . . ... .I...o: J0 ......CZ . . . . . . . 0.4 I. «III .. 1w li-’\“u ... COOI§ICV¢O‘. PI . . . ... . . . . . . u . . .... . ‘1‘. . . . v 0.3,. (I o. vb! I510; III :0. .I‘I a -.O¢~I . . I. .. o . , . . . . . . . ... I. .0. .. I . . .... . I . II I ..I 0.! 1‘.- .. . .. .. II ...-I ... Q v oI...aIi .... n :1... 19‘14';’- 'Cuolv ..-v0.. 4.7.3‘t: . . . . . . . . I . , I .o o . ... ..I. .. . .v. . I. . s I, . . . . u u ... ...—I. a. < , U.~v‘t"no§ov ... .IOI .6. (I . O. ... Otll QOOIIOIA 0 I . '1 O . . ......w.x.. . . .. . . .. . .I I .. . ... . - ... .... .y ..Yo! ...I. . I ......j... .r it. .. 7. 3 ...... It}... .... ......cu... .r.. I‘: Ail-06.0010". .fou. OISOoOISYIy- II. I . . .. . . I . . . . . .. I . o In. L . .IIII Cc.‘\. I .Ilrfiliao‘..o-LOII.I’O.- .I .. ... _ ...... .. o f If... If _s .I 2 . .... o .. o I. I. . . . o I o. .I‘. .0. o... I. o . O. .33.. . I . . I... I .. ...... I II. . . . .oo o n I . a ,3- oo . .... ......Ioo .. . .....930IVO o ‘33QQII‘ In...o .YhooIOI‘ III-.10. '\.’In I . I . c . . . I . . . . . .. I . .. Z a .. . .. ... I o I . .0 . ... v.00..nv.\14a. I. o v. I 0. y . . s oOultu .- ... . .‘4oo.cl-OI¢OI¢ .OII‘,|...nIo.¢.. I.-. I .I ‘I . ‘o r... . . .... ... In?! .... v.. .. . _. . o .. . I I . o .r. .. ....II. . I .32... o: o. . I. . o I n . . .. .. I . ... ... .....ni 1. .. . . I. .....I- ... 3 I o. O . In"! ‘1”?! I. an 9‘. i .I 1., Iv‘IS'n’l.C . OI . . . w I r .I . . .... . I o . n .u. . .. . .. . .. . . v... . .| .r-; ...: o. . .. 0.000... K C...“.I§'t|.I$ 6.01.0.I-‘Lrnl . I . . II . II I. I... ... I . I. .... . .. .. . .. .. 1... II I If: ...: . . .n . 3.37.9.313.‘ ..IQ...‘ . O OAIDI,OIVIIYI _ ... . ... o . . .. m u .. . . . I . .- ... .I .I I... ..IIIKI. ... I. I .l .v . I I . . . II . 0Q . . 5 I ‘ . . a. o v. 0:. I. v I o _‘ I I I . . . . . Io. .. .. 0.1 0,. an. I. ..‘o . a I . o . . . n . . . . . . . I a. . . I . I .o o I no . . .. _ .. V. v...- . o . '0 I I .1 wt 0 . . o ..4 I .o I 0‘ a I. . . .9- I . . Q o . .. I ... In. 0 .. II . . o . . v . ‘ i. o . .. o . . u . I . I o " . 0 Q ...II C . . _ II . 1.... I . . o. . . I . .. . ..... v. .I 0 ...aL . - a...» Ia Ii ‘ . n . c a v I. I ...:- . . ... 0 . v . . ..- u I .. f It I . . o . no .. 30 I . . n . I I q. I. . _ I I u I! O u . n. C . I .II- . I. Ito. - . I .0. . I c . . . u 1.. I . o ... . an I...’ c a I . .. I . .I .Io. o . . I o n. _I . . ... I .Il- 0 .A .v..-I'II.I-II to‘fi‘oonil . . I ‘ . .I II.Q..II!:0IIICI 0th IO,oIoI.’II'¢6 3“ OIIIOIo V ... . . I I I . o (..- I I u. I ...... ...I' o I .. ... .10. I.Iv000.lv. I...I..|.o-.Q'I.lv..a QUIIIOI cuo.‘ III! I. . .. o a. u - . s I: It I . I on .I .... I. 0.0 00......0‘ I I ...-II...OI 60-5 II..O-. a... ... . ...-....A.-III- I I. III. I .. _ 9 . o ... . I. . . On. . .. . c n. . o . .....I I: 0 v.Iol.o ...... OVn6.o~o. I I .1..v. v-.. ...-..I, I.v.‘l ......III o I. I bin... . . ... . . . I no . s. ... I ... u I .I ...... ...Ivoo. a o . u I I I. IIIavo D. .00.I.Invv I .II..I . to I‘o I . .I. ... . I .0. ‘ . ... h I I .0... o ......o... ..., ..II ‘0.Ié.4.o_. .I I .u . 0.1.0....OI‘ Io.~n I.I-O III. .0 O .I.o ...-III III. . -I .II. I. . .... n. . I n I. O a .o n. a ...II I o I. . 9:0. I .. .0 .III .III... I ... IIII . .II a. 0.0. a. It ... I. I . o. I I. '9 ... II . . .0... II 0. I. I .0... at: . .u . I... .I . . _ IIv I o . o ‘. . . . . I . 0 I o C . . I _ . I . . I . . . . . I .1 0 O I. .. . WI) ‘1'.“ .. o I .. .. r-II 4 a n... . . ..f: .. I . . III“. .. I .0 ...... . I A .. ... r . w ..I... .. 4. . . .. 37.... . o . .v . .c. is ‘4 .t . .. ”f.fl‘ W{,£4 hf.fi.~u¢4.. . . dc. ..3 11.11,... A.) . , nm6~wwm.ww.o.u....~:.kufi....C.I. . . .qa.._.v..£ . . . ......v . o p 1 .. ... x .. a 94...; .. 1.. (...... i . . O a JII . it Iv. .. . .. i . .. J, . . day), . r. q .. ...fic“ .. «w. ......1 I... nvv...» . w :J. ‘34 ...... .....L....~.~L .>.V....Y.4.L ....3 n . . .. . f x ., ... I .1... cl. . 12.79.? .... 1..- .. I t... . ... o... ... 241.0,... fi.....:ad....:.<..o. .. 0.”... «o. u . .-...a... ....‘r... ..w. . I» .9. .... . .w: .u. . h .. 13%.. x. . a. y ..r. by ; .... r r .2 . .. ......u .15... . ... .r ”gar... .... ..u . x . .... ., ... ...; .... . .1. r". .... . TL... . \ It... > . . .. In ‘ I . o .... I... a .1 .o .6 $4. .. .. . ... .n q I .. .... I . m.- . . . . .. . I .. v I . t . p . ...9 .. . 6 .. ....I 5 . t . . 1.3168,; 3 om. .3‘JI.LLII.¢. $6.... _ 3:32??? ....I «9.5». 559359;... {1I : bo‘fln’ ob.- Cé. .. 0‘ . J ......3. . .1 . .w J . . . . I . J ... . S .. I 4.3 ' . k “I I ~1otlv... 10-5.1 . . I --....ol II . in w . . .. a. ... _ ...: .r n . I ... ...; 2. . t... ...... ... Ir. . 29.52;... I ....2... J .....z... » .... . ...-.. .....s; . .... . .33.... it . ...... Julitoo... at ... . l .3 . .... . 6 ..., t .. .v .32 tt.—’.l.¢\ £3913- ; .r». . . . z . . . .-. I ... . - I . . . . . . t..£..lo .L F. O; I. .a . n4.&' 0......3 .. 0.. 39%.. .603: I .& 3.. III. ... ... ... .331... ... .I .o.>.!wr.vl=nu.foi 3M1“ .1....9.o.0'|3l.£39 it... ...!303’31 o..ul..vo.0.s-.oo....‘3..i.o...v.. 1:33 31...... ...-.415! of. I... .v.. 5;..- .. . .. I... . \l.fl." . . ... to.-. v.‘ .I I. I . . . o. 9 U . 9.. 3 o . t ... o .o ...... I .. . . n ..v o. .o .0 . .. o .I. I r . a. . I I 4. _ at t. . . o r A. ‘I. o . . . I a . I . I . I o l L".3I'.n.~ II‘ I. . ..I I. . . I ... ... .. II. . . . III _ . In - . r 20‘ ‘ LIBRARY Michigan State lJnhmusfiy This is to certify that the dissertation entitled Towards Automated Model Revision For Fault-Tolerant Systems presented by FUAD ABUJARAD has been accepted towards fulfillment of the requirements for the Ph.D. degree in Computer Science WM law Majo‘r’ Professor’s Signature 95 i ii l0 i F Date MSU is an Affirmative Action/Equal Opportunity Employer PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 K:IProleoc&Pres/CIRCIDateDue.indd TOWARDS AUTOMATED MODEL REVISION FOR FAULT-TOLERANT SYSTEMS By FUAD ABUJARAD A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science 2010 ABSTRACT TOWARDS AUTOMATED MODEL REVISION FOR FAULT-TOLERANT SYSTEMS By FUAD ABUJARAD Automated model revision of distributed programs is one of the emerging and impor- tant approaches for achieving and maintaining program correctness. In this approach, an existing model is automatically revised to satisfy new properties. Such model revision is required when an existing model/program is subject to a newly identified fault, a new requirement, or a new environment. Thus, model revision is especially beneficial in the development of systems that need high assurance. To apply model revision in practice, we need to develop tools that are user friendly, comprehensive, and efficient. However, due to their limitations, the current model revision tools and techniques are not widely used in the development of practical systems. More specifically, some of the limitation are that they suffer from a high learning curve, they require high time and space complexity, they need many details to be specified that otherwise could be automatically discovered, and they do not cover different types of revision. Taking into consideration the aforementioned limitations, in this dissertation, we derive theories, develop algorithms, and build tools to advance the state-of-the-art of the auto- mated model revision. Our approach comprises four main elements: First, we reduce the learning curve for the automated model revision techniques by utilizing existing design tools to perform the revision under-the-hood. Second, to permit the designer to efficiently describe the model to be synthesized and to minimize the user input, we develop algorithms and tools to automate the generation of the legitimate states of the original model, thereby reducing the burden of the designer. Third, to utilize the available computing resources and to efficiently complete the revision, we utilize both symmetry and parallelism to speedup the automated revision and to overcome its bottlenecks. Fourth, to provide comprehensive revision and to cover more types of model revision, such as nonmasking and stabilizing fault-tolerance, we develop algorithms and tools to allow for addition of new types of fault- tolerance. To validate our approach and illustrate its feasibility, we apply it to several case studies. © Copyright by FUAD ABUJARAD 2010 I dedicate this dissertation to my wonderful family. Particularly, to my parents, who believed in diligence, science, and the pursuit of academic excellence. To my beloved wife, Samah, who has been patient and supportive with these many years of research, and to our lovely kids Haya, Khaled, and Amir, who are the joy of our lives. ACKNOWLEDGMENTS I am extremely grateful to all who helped me complete my PhD. program. First and foremost, it was the unconditional support of my wife, Samah. Her support, encourage- ment, quiet patience, and unwavering love were undeniably the bedrock upon which the past eleven years of my life have been built. Her tolerance of my changing moods is a testament in itself of her unyielding devotion and love. I would like to thank our three children, Haya, Khaled, and Amir who made this all possible. My family made tremen- dous sacrifices so that I could spend time on my doctoral education. They encouraged and pushed me to continue in my pursuit. I would like to gratefully and sincerely thank Dr. Sandeep S. Kulkarni for his guid- ance, understanding, and patience during my graduate studies at Michigan State University. His mentorship was paramount in providing a well-rounded experience consistent with my long-term career goals. He encouraged me to not only grow as an experimentalist but also as an independent thinker. For everything you have done for me, Dr. Kulkarni, I thank you. I would like to thank the Department of Computer Science and Engineering at MSU, especially those members of my doctoral committee for their input, valuable discussions, and availablity. In particular, I would like to thank Dr. Laura Dillon and Dr. Betty H. C. Cheng, as well as Dr. Jonathan Hall from the Department of Mathematics. This dissertation would not have been nearly as complete without your help. Additionally, I am very grateful for the friendship of all the members of the SENS lab research group, especially Ali Ebnenasir, Mahesh Arumugam, Borzoo Bonakdarpour, and J ingshu Chen, with whom I worked closely and co-authored some of my papers during my vi PhD. program. Finally, and most importantly, I would like to acknowledge my parents, Suleiman and Hamamah, for their unconditional love and for their faith in me. It was under their watchful eye that I gained so much self—steam and an ability to tackle challenges. Also, I would like to thank my brothers and sisters for their continuous support and unending encouragement. vii TABLE OF CONTENTS LIST OF TABLES ........................................................... xi LIST OF FIGURES ......................................................... xiv 1 Introduction 1 1.0.1 Motivations and Goals ........................ 3 1.0.2 Thesis ................................ 4 1 .0 .3 Contributions ............................. 5 1.0.4 Outline ................................ 8 2 Preliminaries 9 2.1 Models and Programs ............................. 9 2.2 Modeling Distributed Programs ....................... 12 2.2.1 Write Restrictions .......................... 13 2.2.2 Read Restrictions .......................... 13 2.2.3 Example (Group) ........................... 13 2.2.4 The Group Algorithm ........................ 14 2.3 Specification ................................. 16 2.4 Faults ..................................... 18 2.5 Fault-Tolerance ................................ 19 2.6 Example: (Data Dissemination Protocol in Sensor Networks) ........ 20 3 Under-The-Hood Revision 24 3.1 Introduction to SCR .............................. 24 3.1.1 SCR Formal Method ......................... 25 3.1.2 Automated Model Revision to Add Fault-Tolerance ......... 29 3.2 Integration of SCR toolset and SYCRAFT .................. 30 3.2.1 Transforming SCR specifications into SYCRAFI' input ....... 30 3.2.2 Translation from SCR Syntax to SYCRAFI‘ Syntax ......... 32 3.2.3 Modeling of faults .......................... 32 3.2.4 Adding fault-tolerance to SCR specifications ............ 33 3.3 Case Studies .................................. 33 3.3.1 Case Study 1: Altitude Switch Controller .............. 34 3.3.2 Case Study 2: Cruise Control System ................ 37 3.4 Summary ................................... 39 4 Expediting the Automated Revision Using Parallelization and Symmetry 40 4.1 Introduction .................................. 41 4.2 Issues in Automated Model Revision ..................... 43 4.2.] Input for Byzantine Agreement Problem ............... 43 4.2.2 The Need for Modeling Read/Write Restrictions .......... 45 4.2.3 The Need for Deadlock Resolution .................. 46 viii 4.3 Approach 1: Parallelizing Group Computation ................ 48 4.3.1 Design Choices ............................ 49 4.3.2 Parallel Group Algorithm Description ................ 50 4.3.3 Experimental Results ......................... 54 4.3.4 Group Time Analysis ........................ 59 4.4 Approach 2: Alternative (Conventional) Approach .............. 60 4.4.1 Design Choices ............................ 61 4.4.2 Algorithm Sketch ........................... 62 4.4.3 Experimental Results ........................ 66 4.5 Using Symmetry to Expedite the Automated Revision ............ 69 4.5.1 Symmetry ............................... 69 4 .5 .2 Experimental Results ......................... 71 4.6 Summary ................................... 77 Nonmasking and Stabilizing Fault-Tolerance 80 5.1 Introduction .................................. 81 5.2 Programs and Specifications ......................... 85 5.3 Synthesis Algorithm of the Nonmasking and Stabilizing Fault-Tolerance . . 86 5.3.1 Constraint Satisfier .......................... 87 5.3.2 Algorithm Illustration ........................ 90 5.4 Expediting the Constraints Satisfaction .................... 91 5.4.1 Design Choices for Parallelism .................... 91 5.4.2 Partitioning the Constraints Satisfaction ............... 93 5.5 Case Studies ................................. 96 5.5.1 Case Study 1: Stabilizing Mutual Exclusion Program ........ 96 5.5.2 Case Study 2: Data Dissemination in Sensor Networks ....... 103 5.5.3 Case Study 3: Stabilizing Diffusing Computation .......... 106 5.6 Choosing Ordering Among Constraints ................... 111 5.7 Reducing the Complexity with Hierarchical Structure ............ 117 5.8 Summary ................................... 119 Legitimate States Automated Discovery 121 6.1 Introduction .................................. 122 6.2 The “Weakest Legitimate State Predicate Generator (stpGenerator)” Al- gorithm .................................... 124 6.2.1 Weakest Legitimate State Predicate Generator ............ 125 6.2.2 Safety Checker ............................ 125 6.2.3 Liveness Checker ........................... 126 6.3 Application of stpGenerator in Automated Model Revision ........ 131 6.3.1 Case Study 1: Byzantine agreement program ............ 131 6.3.2 Case Study 2: Token Ring ...................... 135 6.3.3 Case Study 3: Mutual Exclusion ................... 136 6.3.4 Case Study 4: Diffusing Computation ................ 138 6.4 Summary ................................... 139 ix 7 Automated Model Revision Without Explicit Legitimate States 141 7.1 Introduction .................................. 142 7.2 Problem Statement .............................. 144 7.3 Relative Completeness (Q. 1) ......................... 146 7.4 Complexity Analysis (Q. 2) .......................... 148 7.4.1 Complexity Comparison for Partial Revision ............ 148 7.4.2 Complexity Comparison for Total Revision ............. 153 7.4.3 Heuristic for Polynomial Time Solution for Partial Revision . 155 7.4.4 Algorithm for Model Revision Without Explicit Legitimate States . 156 7.4.5 Summary of Complexity Results ................... 159 7.5 Relative Computation Cost (Q. 3) ...................... 161 7.6 Summary ................................... 162 8 Related Work 163 8.1 Model Checking ................................ 164 8.2 Controller Synthesis and Game Theory ................... 167 8.3 Model Revision and Automated Program Synthesis ............. 168 8.4 Parallelization and Symmetry ......................... 170 8.5 Nonmasking and Stabilizing Fault-Tolerance ................. 172 8.6 Legitimate States Discovery ......................... 173 9 Conclusion and Future Work 175 9.1 Contributions ................................. 175 9.2 Future Research Directions .......................... 182 BIBLIOGRAPHY ............................................................ 187 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.11 3.12 4.1 4.2 4.3 4.4 LIST OF TABLES Monitored Variables of the altitude switch controller system (ASW) ..... 28 Mode transition table for the mode class mcStatus. ............. 28 Condition table for cWake UpDOI. ...................... 29 mRoom Mode Table .............................. 30 Translation rules ............................... 32 The mcStatus mode table translated. ..................... 35 The SYCRAFT fault section. ......................... 36 The fault-tolerant mcStatus mode table. ................... 36 Fault-tolerant mode class mcStatus. ..................... 37 Fault intolerant mode class mcCruise. .................... 38 The SYCRAFT fault section. ......................... 38 Fault-tolerant mode class mcCruise. ..................... 39 Deadlock scenario 1 (The underlined values indicates which variable is being changed by the program action/fault. For reasons of space the true and false values are replaced by 1 and 0 respectively for the variables b and f.) .................................... 47 Deadlock scenario 2 (The underlined values indicates which variable is being changed by the program action/fault. For reasons of space the true and false values are replaced by 1 and 0 respectively for the variables b and f.) .................................... 48 Group computation time for Byzantine Agreement. PR: Number of pro- cesses. RS: Size of reachable state space. GT(s): Group time in seconds. SR: Speedup ratio. .............................. 60 Group computation time for the Agreement problem in the presence of fail- stop and Byzantine faults. PR: Number of processes. RS: Size of reachable state space. GT(s): Group time in seconds. SR: Speedup ratio. ....... 60 xi 4.5 4.6 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 Group computation time for token ring. PR: Number of processes. RS: Size of reachable state space. GT(s): Group time in seconds. SR: Speedup ratio ....................................... 61 The time required for the revision to add fault-tolerance for several numbers of non-general processes of BA in sequential and by partitioning deadlock states using parallelism.PR: Number of processes. RS: Size of reachable state space. DRT(s): Deadlock resolution time in seconds. TST(s): Total revision time in seconds. ........................... 68 Stabilizing Mutual Exclusion, linear topology ................ 99 Stabilizing Mutual Exclusion, binary tree topology. ............. 100 Stabilizing Mutual Exclusion using Constraints partitioning. Cnst t(s) : Total time spent in constraints satisfaction in seconds. Syn t(s): Total revi- sion time in seconds. Mem (MB): Memory usage in MB ........... 101 Stabilizing Mutual Exclusion using Group threading. Grp t(s) : Total time spent in Group computation in seconds. Syn t(s): Total revision time in seconds. Mem (MB): Memory usage in MB. ................ 102 Nonmasking with linear topology data dissemination program. ....... 106 Data Dissemination program using Constraints partitioning. Grp t(s) : Total time spent in Group computation in seconds. Syn t(s): Total revision time in seconds. Mem (MB): Memory usage in MB. ............ 107 Data Dissemination program using Group threading. Grp t(s) : Total time spent in Group computation in seconds. Syn t(s): Total revision time in seconds. Mem (MB): Memory usage in MB. ................ 108 Stabilizing Diffusing Computation, linear topology. ............. 1 10 Stabilizing Diffusing Computation, binary tree topology. .......... 110 Stabilizing Diffusing Computation program using Group threading. Grp t(s) : Total time spent in Group computation in seconds. Syn t(s): Total revision time in seconds. Mem (MB): Memory usage in MB ......... 112 Stabilizing Diffusing Computation using Constraints partitioning. Cnst t(s) : Total time spent in constraints satisfaction in seconds. Syn t(s): Total revision time in seconds. Mem (MB): Memory usage in MB ......... 113 Stabilizing Mutual Exclusion with linear topology using random con- straints satisfaction. ............................. 115 xii 5.13 Stabilizing Diffusing Computation with linear topology using random con- straints satisfaction ............................... 116 6.1 The time required to generate the weakest legitimate state predicate (Byzantine Agreement) ............................. 134 6.2 The time required to generate the weakest legitimate state predicate (token ring) ....................................... 1 36 6.3 The time required to generate the weakest legitimate state predicate (Mu- tual Exclusion). ................................ 138 7.1 The complexity of different types of automated revision (NP-C = NP- Complete). .................................. 160 7.2 The time comparison for the Byzantine Agreement program. ........ 162 xiii 3.1 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 5.1 5.2 5.3 LIST OF FIGURES The transformation cycle between SCR toolset and SYCRAFT. ....... 34 The time required to resolve deadlock states in the revision to add fault- tolerance for several numbers of non-general processes of BA in sequential and parallel algorithms. ........................... 55 The time required for the revision to add fault-tolerance for several numbers of non-general processes of BA in sequential and parallel algorithms. . . . . 56 The time required to resolve deadlock states in the revision to add fault- tolerance for several numbers of token ring processes in sequential and parallel algorithms. .............................. 57 The time required for the revision to add fault-tolerance for several numbers of token ring processes in sequential and parallel algorithms. ....... 58 Inconsistencies raised by concurrency. .................... 67 The time required to resolve deadlock states in the revision to add fault- tolerance for several numbers of BA non-general processes in sequential and symmetrical algorithms. ......................... 72 The time required for the revision to add fault-tolerance for several numbers of BA non-general processes in sequential and symmetrical algorithms. . . 73 The tttttime required to resolve deadlock states in the revision to add fault- tolerance for several numbers of token ring processes in sequential and symmetrical algorithms. ........................... 74 The time required for the revision to add fault-tolerance for several numbers of token ring processes in sequential and symmetrical algorithms. ..... 75 The time required for the revision to add fault-tolerance for several numbers of BA non-general processes using both symmetry and parallelism. . . . . 76 Constraints ordering and transitions selections. ............... 90 The holder tree ................................ 98 Complexity and hierarchy for linear topology ................ 117 xiv 5.4 7.1 7.2 7.3 8.1 Complexity and hierarchy for the binary tree topology ........... 119 Model Revision with Explicit Legitimate States ................ 142 Model Revision without Explicit Legitimate States. ............. 143 Mapping of (x 1 Wm) /\ (fix; V -wx2) into corresponding program transitions. The transitions in bold show the revised program where x1 = true and x2 : false. ..................................... 150 Model Checking and Automated Model Revision ............... 167 XV Chapter 1 Introduction The rapid growth of computer systems is increasing our reliance on them more than ever. Therefore, the burden of ensuring the correctness of reliable hardware and software sys— tems is significantly growing. Model checking is one of the commonly used techniques to provide such assurance, especially for finite state concurrent systems [48,49,64]. Given a model of a system, a model checker verifies whether this model meets a given prop- erty. If the model does not satisfy that property, the model checker (typically) gives a counter-example. Then, the model needs to be modified to satisfy the desired property. Consequently, such modification will require another cycle of verification. Based on this observation, in this dissertation, we focus on model revision [26,29, 61,96] where an existing model is revised so that it satisfies a given property of interest. Model revision is required in several contexts. For example, it is required to revise an existing model to fix a counter-example, i.e. a bug. It is also required if the original speci- fication was incomplete and the model has to be revised to meet the missing specification. Furthermore, it is required to respond to faults introduced by a change in the environment. When a program is deployed in a new environment, it may be subject to faults that were not considered in the original design. Moreover, even if the faults were known in the initial design, to provide separation of concerns, it is desirable to allow the designer to focus on the functionality aspect and add fault-tolerance subsequently. In either case, it is desired that we revise the program to add fault-tolerance. One requirement for such revision is that the existing program requirements continue to be satisfied [101]. Also, in the above contexts, it is more practical to reuse the existing program in the construction of the revised one [25]. Performing such revisions manually has the potential to incur a huge cost as well as introduce new errors. Therefore, automat- ing such revisions is desirable for reducing cost and guaranteeing assurance. One approach to gain assurance in such program revision is by automated model revision (also known as automated incremental synthesis) [27,30,31,55,59,101,103], which guarantees that the revised program is correct-by-construction. The automated model revision to add fault- tolerance takes a fault-intolerant program, program specifications, and faults as an input and generates a fault-tolerant program as an output. More specifically, it reuses the original program (which is fault-intolerant) in synthesizing its fault-tolerant version [101]. More- over, since the synthesized program is correct-by-construction there is no need to reverify its correctness. The automated model revision (or, incremental synthesis) of fault-tolerant programs is highly desirable, as it allows the designer to focus on the normal system behavior in the absence of faults and leaves the fault-tolerance aspect to the automated techniques. Ini- tially, Kulkarni and Arora [101,102] presented an algorithm for synthesizing fault-tolerant programs. The input to their algorithm is a fault-intolerant program that satisfies its spec- ification in the absence of faults but provides no guarantees in the presence of faults. The output of their algorithm is a fault-tolerant program that continues to satisfy its specifica- tions in the absence of faults and provides the desired level of fault-tolerance to tolerate the given faults. Later, in [59] Ebnenasir and Kulkarni presented an enumerative (explicit- state) implementation to the revision algorithm. This was a significantly important step, since it enabled them to verify the concepts of the revision and demonstrate the applica- bility of the automated revision algorithms [59]. However, similar to other enumerative implementations, it was subject to the state explosion problems and was only suitable to revise small programs. Recently, Bonakdarpour and Kulkarni presented a symbolic-based implementation for the revision algorithm [27,30]. In this implementation, the components of the revision algorithm are constructed using Boolean formulae represented by Bryants Ordered Binary Decision Diagrams [33]. This was the first time where moderate to large 050 and beyond) have been synthesized. The symbolic sized programs (a state space of 1 implementation enabled them to identify bottlenecks in the automated revision. These bot- tlenecks included, deadlock resolution, computation of reachable states in the presence of faults, and addition of recovery paths. 1.0.1 Motivations and Goals In practice, applying the automated model/program revision in real life applications is dif- ficult due to the following factors: 1. The use of the existing tools for automated model revision has a high learning curve. The designer is required to learn different aspects of modeling distributed programs, program specification, faults, and fault-tolerance [27,30,59]. To alleviate this diffi- culty, we focus on moving the task for adding fault-tolerance to be under-the-hood. In this manner, we make automated revision more accessible [3]. 2. Current model revision tools require the designer to specify the fault-intolerant model, the model specifications, the model legitimate states, and the faults [27,30,31, 59,101,103]. Of those, identifying the set of legitimate states is the most demanding task. The designer needs to specify the legitimate states of the model and describe them in a logical formula. Although specifying the model, the specifications, and the faults is a must, it is an open question as to whether the explicit specification of the legitimate states is necessary. To alleviate this difficulty, we focus on designing an algorithm that provides automatic generation of the legitimate states from the model actions and specifications [6]. 3. Current model revision tools [27, 30,59] focus on the addition of masking fault- tolerance, where both safety and liveness are preserved during recovery. However, they do not address other types of fault-tolerance, including nonmasking and stabi- lizing fault-tolerance. In nonmasking fault-tolerance, safety can be violated during recovery and the program should tolerate temporary perturbation. In stabilizing fault- tolerance, the program recovers to its legitimate states from any arbitrary state [57]. To provide broader domain of the problems that can be resolved by automated model revision, we develop algorithms for the automated addition of nonmasking and sta- bilizing fault-tolerance [4]. 4. The current model revision tools utilize multiple heuristics to reduce the complex- ity of the revision [27, 30,59,101]. However, to improve the efficiency further, we need to utilize advantages from model checking [48,49,64,93]. Hence, we develop techniques that concentrate on reducing the complexity of the revision using symme- try and/or parallelism. We show that these approaches provide a significant speedup separately as well as together [5]. 1.0.2 Thesis Thesis Statement: The automated model revision can be made more usable, compre- hensive, and efficient through the use of four key elements: the use of existing design tools as front end to the automated model revision tools, the introduction of new revision algorithms that handle different classes of fault-tolerance, the use of the original model specification and actions to automatically discover other inputs to the revision algorithm, and the utilization of symmetry and parallelism. To validate this thesis statement, we have derived theories, developed algorithms, and built tools to advance the automated model revision through a usable, comprehensive, and efficient toolset. First, to reduce the automated revision learning-curve, we utilized existing design tools (i.e. SCR toolset) such that the automated revision is done under-the- hood [3]. Second, to revise a broader range of programs, we developed algorithms and tools to add new types of fault-tolerance [4, 5]. Next, we reduced the revision parameters by automating the discovery of the program legitimate states, thereby reducing the burden of the designer [6]. Finally, to overcome the automated revision bottlenecks and reduce its time complexity, we utilize both symmetry and parallelism to speedup the revision time [2,5]. 1.0.3 Contributions Our contributions can be grouped into four major categories: Under-the-hood Revision It is desirable that the designer utilizes the automated model revision tools with minimal prerequisite knowledge of the details of the automated revision techniques. We focus on performing the automated revision under-the-hood. Therefore, we utilize existing design tools, such as the SCR toolset [21,84,87], in the automated revision. The SCR toolset is a set of tools used to formally construct and verify the requirements specification document. It is widely used in constructing many mission critical systems. Our approach is to combine the SCR toolset with the tool SYCRAFT that automates the model revision. This approach is desirable, as it allows one to perform functions of the automated model revision without the need to know its details. Of course, it would be necessary to convert (l) the SCR specification into a format that can be used with SYCRAFT and (2) the revised fault-tolerant program into corresponding SCR specification. 1 Based on the above discussion, we combine the SCR toolset with the automated model revision tool SYCRAFI' [27,30]. More specifically, we let the designer specify the program requirements through the SCR toolset interface and we handle the aspects of the automated revision of fault-tolerance using SYCRAFT. Legitimate States Generator One of the requirements of the model revision algorithm is identifying the set of the legiti- mate states of the program being synthesized. This set represents the states from where the execution of the actions of the model is correct. One approach for providing fault-tolerance is to ensure that after the occurrence of faults, the revised program eventually recovers to the legitimate states of the original model. Since the original model met its original specifi- cation from these legitimate states, we can ascertain that eventually a revised model reaches states from where subsequent computation is correct. One of the problems in providing recovery to the legitimate states, however, is that these legitimate states are not always easy to determine. Existing model revision approaches (e.g., SYCRAFT [27, 30]) have required the designer to specify these legitimate states ex- plicitly. It is straightforward to observe that if these legitimate states could be derived automatically, then it would reduce the burden put on the designer, thereby making it easier to apply these techniques in revision of existing programs. We focus on identifying the largest set of states from where the existing model is correct. Nonmasking and Stabilizing Fault-Tolerance. To provide comprehensive tools for the automated model revision, we focus our atten- tion on automated addition of nonmasking and stabilizing fault-tolerance to fault-intolerant programs. Intuitively, a nonmasking fault-tolerant program ensures that if the program is perturbed by faults to an illegitimate state, then it would eventually recover to its legitimate states. However, safety may be violated during recovery [101]. The current model revision tools [30,59] support the design of masking fault-tolerance only. However, there are several reasons that make the design of nonmasking fault- tolerance attractive. For one, the design of masking fault-tolerant programs, where both safety and liveness are preserved during recovery, is often expensive or impossible, even though the design of nonmasking fault-tolerance is easy [15]. Also, the design of nonmask- ing fault-tolerance can assist and simplify the design of masking fault-tolerance [105]. A special case of nonmasking fault-tolerance is the stabilizing fault-tolerance [54,56], where, starting from an arbitrary state, the program is guaranteed to reach a legitimate state. Stabilizing systems are especially useful in handling unexpected transient faults. Moreover, this property is often critical in long-lived applications where faults are difficult to predict. .We present an algorithm for adding nonmasking fault-tolerance to an existing program by performing three steps [4]. The first step is to identify the set of legitimate states of the fault—intolerant program. This set defines the constraints that should be true in the legitimate states. The second step is to identify a set of convergence actions that recover the program from illegitimate states to a legitimate state. This can be done by finding actions that satisfy one or more constraints. The last step consists of ensuring that the convergence actions do not interfere with each other. In other words, the collective effect of all recovery actions should, eventually, lead the program to legitimate states. Expediting the Automated Revision To reduce the time complexity of the automated model revision, we first need to identify bottleneck(s) where symmetry and parallelism features can provide the maximum impact. Based on the analysis of the experimental results from Bonakdarpour and Kulkarni [30], the performance of the revision suffers from two major complexity obstacles, namely gen- eration of fault-span and resolution of deadlock states. - To effectively target those bottlenecks, we present two approaches for utilizing the multi-core architecture in reducing the time required to complete the automated revision. The first approach is based on the distributed nature of the program being revised. In par- ticular, when a new transition is added (respectively, removed), since the process executing it has only a partial view of the program variables, we need to add (respectively, remove) a group of transitions based on the variables that cannot be read by the process. The second approach is based on partitioning deadlock states among multiple threads. We show that this provides a small performance benefit. Based on the analysis of these results, we argue that the simple approach that parallelizes the group computation is likely to provide maxi- mum benefit in the context of deadlock resolution in the automated revision of distributed programs. To further expedite the automated model revision, we use symmetry speedup the revision algorithm. 1.0 .4 Outline The remainder of this thesis is organized as follows. Chapter 2 describes the preliminar- ies and presents the elements of the automated incremental model revision. In Chapter 3, we present our approach to minimize the prerequisite knowledge of the details of the automated revision techniques and provide practical approaches to perform the automated revision under-the-hood. In Chapter 4, we show how we utilize parallelism and symmetry to expedite the automated model revision. Subsequently, to revise a broader range of pro- grams, in Chapter 5 we present our approach for the automated addition of nonmasking and stabilizing fault-tolerance to fault-intolerant programs. In Chapter 6, we show how we can reduce the designer burden by automatically discovering the legitimate states of the model being revised. Later, in Chapter 7, we analyze the effect of performing the automated model revision without explicitly specifying the legitimate states. We present the related work and literature review in Chapter 8. Finally, we present a summary of our contributions and future research direction in Chapter 9. Chapter 2 Preliminaries In this chapter, we formally present the elements of our automated model revision frame- work. Mainly, we define the notion of models, programs, specifications, faults, and fault- tolerance. The notion of distributed programs is adapted from Kulkarni and Arora [101]. Definition of faults and fault-tolerance are based on the ones given by Arora and Gouda [12], Kulkarni [100], and Bonakdarpour [25]. At the end of this chapter, we illustrate the basic constructs of this framework using a real-world example, an application in sensor networks. 2.1 Models and Programs In this section, we present the formal definition of models and programs. A model is described by an abstract program. Intuitively, a program, p, is described using a finite set of variables VP 2 {v0,v1,...,v,,}, n 2 0, and a finite set of program actions Ap = {ao,a1, . . . ,am}, m 2 0. Each variable, v,- E V , is associated with a finite domain of val- ues, 0;. Let a,- 6 AP be an action, then a; is defined as follows: a,- :: g,- ——> sti; where g,- is a Boolean formula involving program variables and st,- is a deterministic terminating statement that updates a subset of program variables. Before we give a formal definition of programs based on this intuition, we define the notion of state space and state predicate. Definition 2.1.1 (state) A state, s, of program p is identified by assigning each variable in VI) a value from its respective domain, 0;. I Definition 2.1.2 (state space) The state space, S p, of p is the set of all possible states of p.I Definition 2.1.3 (state predicate) A state predicate of p is Boolean expression defined over the program variables Vp- Thus, a state predicate C of p identifies the subset, SC 9 S p, where C is true in a state s iff s 6 SC. I Note that state predicate corresponds to a set of states where the Boolean value of the corre- sponding predicate is true. Thus, the intersection of two state predicates corresponds to the conjunction of corresponding functions. Likewise, disjunction corresponds to union, and so on. Hence, we use these Boolean operators for constructing different state predicates. For example, let C1 and C2 be state predicates that identify the state space subsets SCI and 5C2 , then C; /\ C2 (respectively C1 V C2) correspond to SCI (1 5C2 (respectively SCl U 3C2)- Definition 2.1.4 (transition predicate) Intuitively, a program action consists of one or more transitions. Let (a,- :: g,- —+ st;;) be an action of the program. Then, the corresponding transitions included in this action are org, where 0t,- 2 {(s0,sl) | g,- is true in so and s1 is obtained by executing st,- from so}. I Hence, a transition predicate correspond to an action is a subset of S p x Sp. A single transition t is specified by the tuple (s0, s1), where so, s] E S p and so is the before state and s; is the after state. Given a program that is defined in term of VI) and A p, we can now identify an equivalent representation in terms of its state space and transitions. In particular based on Vp and A p, we can compute S p, the state space of p and on,- for each action of p. Based on the above, we formally define the program as follows. 10 Definition 2.1.5 (program) The program p is defined as the tuple (Sp, (a1 , (12,03, ....or,)) where or,- 6 SP x Sp. I In many instances, we do not need the details of the individual actions of p. For these cases, we utilize program transitions 5,,. For the program p = (S p, (on , a2, a3, ....a,)) , the transitions of p is 5,, = (or. U 0L2 U (13 U. . . U 0L1). Whenever it is clear from the context, we use p and its transitions 5,, interchangeably. Definition 2.1.6 (closed) Let Sc be a state predicate, then Sc is closed in a program p iff (V (80,81) I (S0,S]) E 5p 2 (So 6 Sc =>51€ Sc» .I Definition 2.1.7 (enabled) The action a; is enabled in a state Sj iff the guard of g,- = true in the states sj. I Definition 2.1.8 (unfair computation) A sequence of states, 0' = (so,s1,...) is unfair computation of p iff 1. V j : O < j < length(0') : (SJ--1 ,sj), is obtained by executing a program action, say (a,- :: g,- ——> sti). That is, g,- is true in s j_| and s,- is obtained by executing st,-, and 2. if G is finite and terminates in s, then all the guards of the program actions are false in S]. I Computations can also be fair. Intuitively, a fair computation allows a fair resolution for non-determinism. Next, we define weak and strong fair-computation. Definition 2.1.9 (weak-fair computation) o = (so, s. , ...) is weak-fair computation of p if: 1. 0' is an unfair computation of p, and 2. if any action, say a;, ofp, is enabled in all states sj, sj+1,sj+2 . .. then Elk : k 2 j: sk+1 is obtained by executing st,- in state sk. I In weak-fair computation, if some guard , say g5, eventually becomes continuously enabled, then the corresponding action is guaranteed to execute infinitely often. 11 Definition 2.1.10 (strong-fair computation) o = (s0,s| , ...) is strong-fair computation iff: 1. 0' is an unfair computation of p, and 2. there exists an action a,- : g; —> st,- of p such that g,- is true in s and s’ is obtained by executing st,- in state s, then the transitions (s, s’) are included infinitely often in 0'. I In strong-fair computation, if some guard , say g, is continuously enabled forever then the corresponding action must execute infinitely often. Note that, in this dissertation, we refer to weak-fair computation as a fair computation. Also, our definition of weak-fair computation is equivalent to weak fairness from [1 ,9,65]. 2.2 Modeling Distributed Programs Since we focus on the design of distributed programs, we specify the transitions of the program in terms of a set of processes, where every process can read and write a subset of the program variables. The transitions of a process are obtained by considering how that process updates the program variables. The transitions of the program are the union of the transitions of its processes. Definition 2.2.1 (process) A process Pj is specified by the tuple (8},Rj, Wj) where Bj is a transition predicate in S p and 5,, = ’19:, 51-. R j is the set of variable that the process P] is allowed to read, and W} is the set of variables that the process P} is allowed to write and W} g R ,- E V (i .e., we assume that the program must, first, read the variable to be able to write it.). I Notation. Let va(s0) denote the value of variable va in the state so. A process in a distributed program has a partial view of the program variables, which introduces write/read restrictions. Therefore, when a new program transition is added/removed, we need to add/remove a group of transitions based on the variables that 12 cannot be read/writen by that process. The write/read restrictions of the process are defined as follows. 2.2.1 Write Restrictions Let P} = (5,-,R,~,Wj) be a process, then the only variables that Pj can write are variables in Wj. If P} can only write the subset of variables W,- and the value of a variable other than that in W is changed in the transition (so,s1), then that transition cannot be used in synthesizing the transitions of P]. In other words, being able to write the subset W} is equivalent to providing a set of transitions write j(Wj) that Pj cannot use in the revision algorithm. Clearly, the transition predicate write j(WJ-) is defined as follows. writej(Wj) = {(so,s]) : (V va :: va 6 (VP—Wj): va(so) 7é va(sl) )}. 2.2.2 Read Restrictions Let P} = (5,-,R,~,Wj), the only variables that Pj can read are variables belonging to R j. let t = (so,s1) be a transition in 5,- then we define groupj(t) as the group of transitions associated with t. Such a group includes transitions of the form (sz,s3) where so and S2 (respectively s1 and S3) are undistinguishable for Pj. By undistinguishable, we mean that they differ only in terms of the variables that Pj cannot read. Thus, we formally define group j(t) as follows: groupjll) : V(s2,53) ( AV¢RJ(V(SO) =V(Sl) /\ V(52) =V(S3) ) /\ AveRJ-(V(SO) =V(Sz) /\ V(Sl) =V(53)) )- 2.2.3 Example (Group) Let p be a program specied using the set of processes P = {P1(= (81,R1,W1)),P2(= (52,R2,W2))}, the set of variables V = {v1,v2}, and the domains Dv. = {0,1} and 13 D,,2 = {0,1} . Also, let R1 : {VI} (respectively R2 2 {v2}) and W. = {v1} (respectively W2 2 {v2}) (i.e. each process can only to read and write its own variable). Now, consider the transition from the state (v1 = 0,v2 = 0) to the state (v1 = 1,v2 = 0). If this tran- sition is to be included in 8. then it is necessary to include the transition from the state (v1 = 0, v2 = l) to the state (v1 = 1,vz = 1). Clearly, this should be the case since P1 is not allowed to read the variable v2, therefore we have to consider the case where vz = 0 as well as the case where v2 = 1. The automated model revision algorithm adds/removes program transitions to complete the revision. Therefore, whenever a transition is added or removed, the revision algorithm must add or remove the corresponding group. 2.2.4 The Group Algorithm The group algorithm (c.f. Algorithm 1) takes a transition set, trans, as an input and com- putes the transition group, transg, as an output. Specifically, it creates an array, tPred[], with number of elements equal to the number of processes such that tPred [i] holds the part of the group transitions associated with the process i (Line 1). Now, based on W; (i .e. the set of variables the process i is allowed to write) the group algorithm uses the function Alloeritei(W,-) to find the set of all transitions which process i is permitted to execute. Then, it uses this set to find which of the transitions in trans process i is responsible for (Line 3). Later, it uses the tPred [i] and R; in the function F indGroup to account for all vari- ables that process i cannot read and compute the transitions that cannot be distinguished by, i (Line 4). Once the steps in lines 3 and 4 are completed for all processes, the algorithm collects the transitions of the group in transg (Lines 7-9) and returns. Observe that for the transition t, group j(t) can be executed by process Pj while respect- ing its read/write restrictions. Let tr ,- be a set of transitions. Now, based on the notion of read/write restrictions, tr 1 can be included in 5}- iff there exist transitions t1,t2,...t1 such that tr ,- 2 group j(t]) U group j(t2) U . . . group j(t,). Furthermore, let p be a program whose transitions specified with the processes P1, P2 ...Px. Also, let trp denote a set of transi- 14 Algorithm 1 Group Input: transitions set trans. Output: transitions group transg. MDD* tPred := MDD[ numberOfProcesses] ; for i z: 0 to numberOfProcesses do tPred[i] := trans /\ Alloerite;(W;); ' tPred[i] := F indGroup(tPred [i],R,-); end for MDD transg := false; for i := 0 to numberOfProcesses do transg := transg V tPred [i]; end for 999$???er I—i O : return transg; tions. Then, trp can be included as transitions of p iff there exists a set of transitions tr], tr j, , ...trJr such that trp = U352] tr ,- and tr ,- can be included as transitions of process P}. The way we use this group operation is as follows: When we compute a set of tran- sitions, say tr, that we need to either add or remove, we ensure that tr can be imple- mented using read/write restrictions of the synthesized program. Hence, often, we cannot add/remove tr as is. Instead, we need to revise tr so that it respects the read/write restric- tions of the program being revised. One operation we utilize for this is called Group, where Groupmax(tr) returns a superset, say trlarg, that can be included as transitions of the syn- thesized program. The intuition of Groupmax operation is as follows (c.f. Algorithm 1): Given a set of transitions, say tr, we use a loop that traverses through all the processes. While traversing process P-, it computes subset of transitions, say tr j, in tr such that each transition in tr j satisfies the write restrictions of process Pj. Then, for each transition in tr j, it applies the group operation describe above to compute other transitions that must be included. (Note that with the use of BDDs and MDDs (i.e., Binary and Multi-Valued Deci- sion Diagrams [125]), we do not have to actually evaluate each transition in tr ,- separately to compute the corresponding group.) Finally, it takes a union of all transitions obtained thus to compute Groupmax(tr). 15 Another operation we utilize is Groupmin. This operation returns a subset, say Irma”, such that trsmau can be included as transitions of the revised program. The operation Groupmm is implemented in a similar fashion to that of Group by traversing through all processes. Remark. Since Groupm is the operation that is used most frequently in our algo- rithms, for simplicity of presentation, we drop the subscript and call it Group. Remark. Note that the group j(t) is defined only if I does not violate write restrictions of process Pj. However, for brevity, we do not specify this whenever it is clear from the context. The tasks involved in computing one such group depend on the number of processes and the number of variables in the program. As can be seen from the formula above, to compute this group the algorithm (c.f. Algorithm 1) needs to go through all the processes in the program and for each process it has to go through all the variables. 2.3 Specification Following Alpem and Schneider [7], it can be shown that any specification can be parti- tioned into some “safety” specification and some “liveness” specification. Intuitively, the safety specification indicates that nothing bad should happen. And, a liveness specification requires that something good must eventually happen. Formally, Definition 2.3.1 (safety) The safety specification, S fp, for program p is specified in terms of bad states, SPECbs, and bad transitions SPECb,. A sequence (so,s1, ...) (denoted by 0') satisfies the safety specification of p iff the following two conditions are satisfied. 1.Vj:0 S j < len(0') : sjgéSPECbS, and 2. Vj I 0 < j < [871(6) 2 (Sj_],Sj) QSPECM. I 16 Definition 2.3.2 (liveness) The liveness specification, Up, of program p is specified in terms of one or more leads-to properties of the form {I w T . A sequence 0' = (so,s1, ...) satisfies 9? -> rI ifij: (f is true in S} :> 3k : j S k < len(0‘) : ’I is true in sk). We assume that 9? fl 'I = {}. If not, we can replace the property by ((f — T) w T). I Remark. Observe that if p satisfies 9' w T , then it cannot contain computations that start from 9? and reach a deadlock/termination state without reaching a state in '1' . Likewise, it cannot contain computations that start from f and reach a cycle without reaching T . Definition 2.3.3 (specification) A specification, say spec is a tuple (Sfp , va), where S fp is a safety specification and va is a liveness specification. A sequence 0' satisfies spec iff it satisfies S fp and Up. I Based on the above definition, for simplicity, given a specification, say spec, defined as (Sfp , va) we say that spec is an intersection of S fp and va. Given a program p and its specification, say spec, p may not satisfy spec from an arbi- trary state. Rather, it satisfies spec only from its legitimate states (also known as invariant). We use the term legitimate state predicate I to denote the set of legitimate states of p. In particular, we say that a program p satisfies spec from 1 iff the following two conditions are satisfied: 1. I is closed in p, and 2. every computation of p that starts from a state in I satisfies spec. A program p satisfies the (safety, liveness, or a combination of both) specification from the legitimate states, I , iff every computation of p that starts from a state in I satisfies that specification. Definition 2.3.4 ( legitimate state predicate) Let I be a state predicate, and p satisfies spec from I , then we say that I is the legitimate state predicate of p from spec. Note that a program may have multiple legitimate state predicates. I 17 2.4 Faults The faults that may perturb a program are systematically represented by transitions. Based on the classification of faults from Avizienis et al. [18], this representation suffices for physical faults, process faults, message faults, and improper initialization. It is not intended for program errors (e .g. buffer overflow). However, if such errors exhibit behavior, such as, a component crash, it can be modeled using this approach. Thus, a fault for p(——— (Sp, 8,,)) is a subset of 5,, x Sp. We use ‘ p[] f’ to mean ‘p in the presence of f’. The transitions of p[] f are obtained by taking the union of the transitions of p and the transitions of f. Just as we defined computations of a program in Section 2.1, we define the notion of program computations in the presence of faults. In particular, a sequence of states, 0' = (so,s1,...), is a computation of p[] f (i.e., a computation of p(= (Sp, 8p» in the presence of f) iff the following three conditions are satisfied: 1.Vj:0 < j < len(0’) : (sj_l,sj)E (5pr),and 2. if (so,s| , ...) is finite and terminates in state S, then there does not exist state s such that (s,, s) 6 8p, and 3. ifo is infinite then 3n : Vj > n : (Sj_],Sj) 6 8,,) Thus, if 0' is a computation of p in the presence of f, then in each step of 6, either a transition of p occurs or a transition of f occurs. Additionally, 0' is finite only if it reaches a state from where the program has no outgoing transition. And, if 0' is infinite then 0' has a suffix where only program transitions execute. We note that the last requirement can be relaxed to require that 0' has a sufficiently long subsequence where only program transitions execute. However, to avoid details such as the length of the subsequence, we require that 0’ has a suffix where only program transitions execute. We use f-span (fault-span) to identify the set of states from where the program satisfies its fault-tolerance requirement. 18 Definition 2.4.1 ( f-span) Let T be a state predicate, then T is an f-span of p from 1 iff 12> T and (V(so,s1) : (so,s1) Epflf: (soET => s1 ET)).I Thus, at each state where I of p is true, the T of p from I is also true. Also, T, like I, is also closed in p. Moreover, if any action in f is executed in a state where T is true, the resulting state is also one where T is true. It follows that for all computations of p that start at states where I is true, T is a boundary in the state space of p, up to which (but not beyond which) the state of p may be perturbed by the occurrence of the actions in f. 2.5 Fault-Tolerance In this section, we present a formal definition to three classical levels of fault-tolerance; namely, failsafe, masking, and nonmasking fault-tolerance. Fault-Tolerance. In the absence of faults, a program, p, satisfies its specification and remains in its legitimate states. In the presence of faults, it may be perturbed to a state outside its legitimate states. By definition, when the program is perturbed by faults, its state will be one in the corresponding f-span. From such a state, it is desired that p does not result in a failure, i.e., p does not violate its safety specification. Furthermore, p recovers to its legitimate states from where p subsequently satisfies both its safety and liveness specification. Based on this intuition, we now define what it means for a program to be (masking) fault-tolerant. Let S fp and va be the safety and liveness specifications for program p. We say that p is masking fault-tolerant to S fp and va from I iff the following two conditions hold. 1. p satisfies S fp and va from I . 2. 3 T :: (a) T is f-span of p from I . 19 (b) p[] f satisfies S fp from T. (c) Every computation of p[] f that starts from a state in T has a state in 1. While masking fault-tolerance is ideal, for reasons of costs and feasibility, a weaker level of fault-tolerance is often required. Two commonly considered weaker levels of fault- tolerance include failsafe and nonmasking. In particular, we say that p is failsafe fault- tolerant [72] if the conditions 1, 2a, and 2b are satisfied in the above definition. And, we say that p is nonmasking fault-tolerant [71] if the conditions 1, 2a, and 2c are satisfied in the above definition. 2.6 Example: (Data Dissemination Protocol in Sensor Networks) In this example, we show how we model distributed programs and illustrate some of the previous definitions from the previous sections. We use the program Infitse, a time division multiple access (T DMA) based reliable data dissemination protocol in sensor networks [104]. In this example, a base station initiates a computation in which data are to be sent to all sensors in the network. The data message is split into fixed size packets. Each packet is given a sequence number. The base station starts transmitting the packets to its neighbor(s) in specified time slots, in the order of the packet sequence number. Subsequently, when the neighbor(s) receive a message, they, in turn, retransmit it to their neighbors and so on. The computation ends when all sensors in the network receive all the messages. This protocol does not require explicit acknowledgments to be sent back from the re- ceiver to the sender. For example, when a sensor sends a message to one of its neighbors it waits before sending the next message until it knows that the receiver did receive the mes- sage. In other words, it gets its acknowledgment by listening to the messages the neighbor- ing sensors are currently transmitting. It only advances to next message if it knows that all 20 its neighbors have attempted to transmit the last message it had sent. To concisely describe the transitions of the program we use Dijkstra’s guarded com- mand [53] notation: (guard) —> (statement); where guard is a Boolean expression over program variables and the statement describes how program variables are updated and the statement always terminates. A guarded com- mand of the form g —> st corresponds to transitions of the form {(so,s1)| g evaluates to true in so and s1 is obtained by executing st from so}. The Program. In this example, we arrange the processes in a linear topology. The base station has N packets to send to M processes. The fault-intolerant program transmits the packets in a simple pipeline. For this, each process keeps track of the messages (received/sent) using two variables r. j and s. j, where r. j is the highest message sequence number received by process j and s. j is the sequence number of the message currently being transmitted by process j. Process j increments r. j every time it receives a new message. It also sets s. j to be the sequence number of the message it is transmitting. The base station transmits a packet if its neighbor has received the previous packet (action 1N1). A process j, j > 0, receives a packet from its predecessor if its successor had received the previous packet (actions IN 2 and IN 3). Thus, the actions of fault-intolerant program are as follows: Action for base station: (1N1) (s.0=r.1)—> s.0 :2 s.0+1; Action for process j E {1..M-1}: (1N2) (MISS-(141)) /\ (MESH-1)) A (S-(j- 1) =r-J'+1)) ——> r.j,s.j :2 r.j+1,s.j+1; 21 Action for process M: (1N3) ((r.MSr.(M—l)) /\ (s.(M-—1)=s.M+1)) ———> s.M, s.M :2 s.M+1, s.M+l; Faults. The faults we consider are such that when a fault occurs a message is lost. To model such faults for the base station, we add action (F 1), where the base station increments s.0 even though its successor has not received the previous packet. To model such action for other processes, we add action (F2), where a process advances s. j even though the successor has not yet received the previous packet. (Fl) true ——> s.0 :2 s.0+l; (F2) ( (r-sz-(j-1)) /\ (S.(j-l)=S-(j+1)) —> r.j,s.j :2 r.j+l,s.j+l; The Set of Legitimate States. The constraints that define the legitimate states in the case of the data dissemination pro- gram are as follows. The first constraint states that initially the base station has all the packets (Cl). A process cannot receive a packet if its predecessor has not received it (C2), and cannot transmit a packet that it does not have (C3). A process transmits a packet that is expected by its successor (C4 and C 5). (c1) (s.0=N) (C2) (Vj=0s.(j+1)+l)/\(s.j’s.(j+l)) /\ (s.j>s.(j+1)+1) A (r.j+1=s.(j—1)) ——>r.j:= s.(j—l),s.j:= s.(j+l)+l; (R2) (r.j>s.(j+l)+l) /\ (s.j>s.(j+l)+1)—>s.j :2 s.(j+1)+1; 23 Chapter 3 Under-The-Hood Revision In this chapter, we present our contributions on performing the automated model revision while minimizing the effort and the expertise needed to perform such revision. We show how the designer can continue to utilize existing design tools while the revision is done under-the-hood. This makes automated revision more useable, as well as makes it available across different design tools. Specifically, we focus on integrating the automated revision with the SCR toolset. Part of the reasons behind our choice of SCR toolset is that the SCR descriptions are precise, unambiguous, and consistent. Also, many industrial farms use the SCR toolset to develop mission critical systems. This chapter is organized as follows. In Section 3.1, we briefly describe the SCR formal method and we provide highlights of the SCR toolset. In Section 3.2, we present our approach for transforming the SCR specification into input for SYCRAFT. Then, in Section 3.3, we illustrate our approach using two case studies: an Altitude Switch Controller and an Automobile Cruise Controller. Finally, we summarize the chapter in Section 3.4. 3.1 Introduction to SCR The Software Cost Reduction (SCR) formal method [22,83,84] is a set of techniques for constructing and evaluating requirements documents for high assurance systems. SCR uses 24 tables to describe system behaviors and properties, as these tables provide a precise descrip- tion of the model and capture the mathematical representation of systems. But these tables consume a considerable amount of time and resources to verify. Therefore, techniques and tools have been developed to provide a comprehensive framework that automates the vali- dation and verification of the SCR tables. Hence, the SCR toolset [22,83—87] was created to serve this purpose. In this section, we describe the SCR formal method and show how the SCR toolset is used in the design and verification of event-driven systems. 3.1.1 SCR Formal Method SCR is a set of formal methods for constructing and verifying requirements specification documents. The US. Naval Research Laboratory (NRL) developed SCR in the late 705. Since then, it has been used in constructing many mission critical systems. SCR was used to design and model the A-7 aircraft and to document requirements. It was also used in the design of requirement specification of the Operational Flight Program (OFP) for the A-6 aircraft [114], the Bell telephone [91], submarines communication systems, nuclear plants [88], and many other systems. The SCR formal method specifies system requirements using tabular notation. Tables provide a precise and compact way to describe requirements, making it possible for the user to automatically model and analyze those requirements to identify errors. SCR uses tables to describe both the system and its environment [85, 86]. The environmental quan- tities whose values changes the system behavior are described using Monitored variables. The environmental quantities whose values are changed by the system are represented by Controlled variables. To relate the variables of the system and represent constraints on those variables, the state machine model of the SCR is based on the “Four-Variable Model” that was, initially, introduced by Paranas [120]. This model describes the desired functionality of an embed- ded system in terms of four relations as follows. 25 o NAT: is the set of relations that describe the way in which the values of the variables (monitored or controlled) are restricted by the laws of the environment, whether these laws are imposed by previously deployed systems or by the physical laws. 0 REQ: is the set of relations that defines the way in which the system changes the values of the controlled variables based on the change in the values of the monitored variables . 0 IN: is the set of relations that maps the values of the monitored quantities to the values of the input variables. OUT: is the relation that maps the value of the output variable to a controlled quantity. The IN and OUT relations describe the behavior of the input and output devices in some level of isolation. Thus, the IN and OUT relations give requirements specification the freedom of specifying the observed system behavior without going into further details. Four more variables are also used in the constructs of the SCR. These are modes, terms, conditions, and events. The mode class is a state machine whose states are called modes. Changes from one mode to another are triggered by events. The terms are representations of a group of input variables, mode classes, or other terms in one single term. The condition is a predicate defined on single system state. Finally, the event is a predicate defined on two system states and is triggered by a change in a system entity. The following state machine formally represents a typical SCR system: 2 = (S, So,E’",T) where S is the state space, So g S is the initial state set, E m represent, a change in the value of the monitored events, and T is the function that identifies the transitions of the system based on monitored events (i .e. T maps e E E'" and the current state s E S to the next state s’ E S) [83]. 26 In SCR, the systems are represented in the ideal state and with no time representation. The model defines the system as a before state, in terms of the system entities with guards as conditions, and an after state. The system transits from the before state to the after state by transitions triggered by a change in an input variable. These transitions are part of a transformation, T, which is defined by a set of functions that are represented by the SCR tables. The SCR toolset [22,83—87] is a set of tools for constructing and validating require- ments specifications based on the SCR formal method. It is composed of a specification editor, a user interface for creating and editing the specification in a tabular way; a de- pendency graph browser, which uses the directed graph representation to show the depen- dency of variables; and a simulator, which uses a symbolic variable representation to test if the desired system behavior is satisfied. The SCR toolset also includes different kinds of checkers: consistency checker, model checker, and property checker. This set of tools helps systems designers to check and analyze the specifications and to automatically detect errors and missed cases. To illustrate these concepts, consider the altitude switch controller system (ASW) [21], which is responsible for turning on a device of interest when an aircraft altitude is below 2,000 feet. ASW will be disabled if it receives an Inhibit signal. A Reset signal will reset the system to its initial state. ASW has three altitude meters: two are digital and one is analog. It also has a fault indicator that is switched 0n if the DOI does not turn on in two seconds, if the system fails to initialize, or if all three altitude meters do not work for more than two seconds. The SCR specifications for the ASW system are constructed with five monitored vari- ables as shown in Table 3.1, one controlled variable, and a mode class. The mAltBelow, Boolean variable, value is true when the aircraft descends below 2, 000 feet. The mDOIsta- tus is true when the D01 is on. The mlnitializing indicates if the system is being initialized. The mlnhibit, indicates whether the system can turn on the DOI or not. The mReset mon- 27 itors the reset request. The controlled variable cWakeupDOI will be initialized to false. It will be set to true to wake-up the DOI. Name Type Init. Value Description mAltBelow Boolean true true if alt. below threshold mDOIStatus enum off on if DOI powered on; else off mInitializing Boolean true true iff system initializing mlnhibit Boolean false true iff DOI power on inhibited mReset Boolean false true iff Reset button is pushed Table 3.1: Monitored Variables of the altitude switch controller system (ASW). Table 3.2 describes the mode class mcStatus. Each transition in the mode table de- scribes the system transition from one mode to another as a result of change in one or more monitored variables. There are three modes for the mode class mcStatus: Init, standby, and awaitDOIon. For example, the first row of Table 3.2 states that ASW transitions from init mode to standby if it is not initializing. Table 3.3 contains the description of the condition table for the controlled variable cWakeupDOI. The value of the controlled variable cWakeupDOI depends mainly on the current value of the mod class mcStatus. If the value of mcStatus is awaitDOIon, then the DOI can be powered on. If the value of mcStatus is Init or Standby, the DOI will be turned Old Mode Event New Mode init @F(mlnitializing) Standby standby @T(mReset) init standby @T(mAltBelow)WHEN NOT awaitDOIon mlnhibit AND mDOIStatus = off awaitDOIon @T(mDOIStatus = on) standby awaitDOIon @T(mReset) init Table 3.2: Mode transition table for the mode class mcStatus. 28 Mode cWakeupDOI Init, Standby false awaitDOIon true Table 3.3: Condition table for cWakeUpDOl. There are two major advantages of the SCR toolset. First, all the tools interface with each other automatically. Hence, they behave as a single application [83]. Second, the toolset has been adopted by the industry and was used in the development of many real world applications [83]. Moreover, the toolset stores the specifications in an ASCII text file from which other systems can have access to those specifications. More specifically, we use this file as an interfaces channel to communicate with the tool SYCRAFT. 3.1.2 Automated Model Revision to Add Fault-Tolerance Programs are subject to faults that may not be preventable. A program may function correctly in the absence of faults. However, it may not give the desired functionality in the presence of faults. The automated model revision to add fault-tolerance is the pro- cess of transforming a fault-intolerant program to a fault-tolerant one. This transformation guarantees that the program continues to satisfy its specification in the presence of faults. SYCRAFT, described briefly next, is a framework for automating such revisions [27, 30]. In SYCRAFT, programs (input and output) and faults are represented using guarded com- mands. SYCRAFI' takes both the program and the faults as an input and generates the fault-tolerant program version as an output. To add fault-tolerance, SYCRAFT first identi- fies states from where faults alone can violate safety specification. It removes such states and the transitions that reach them. Then, it adds recovery transitions to ensure that after the occurrence of faults, the program recovers to its legitimate states. 29 3.2 Integration of SCR toolset and SYCRAFT In this section, we first describe how we translate the SCR program into an input for SYCRAFT. Then, we describe modeling of faults and subsequently give an outline of our tool for adding the automated model revision to the SCR toolset. Our approach, allows one to perform separation of concerns where the fault-tolerance aspect is relegated only to the tool that performs the automated addition of fault-tolerance. 3.2.1 Transforming SCR specifications into SYCRAFT input The integration of SCR and SYCRAFT mainly focuses on the mode table since the mode table captures the system behavior in response to different inputs. Hence, the mode table is the most relevant in terms of the effect of the faults on system behavior. The integration focuses on translating the mode table so that it can be used as an input in SYCRAFT and then translating the SYCRAFI' output so as to generate the mode table of the fault-tolerant SCR specification. We illustrate the mode table in SCR using the simple example mRoom (cf. Table 3.4). As the name suggests, this table describes different modes of mRoom and shows how they change in response to the system events. mRoom has two modes: Dark and Light and one monitored variable mSwitchOn. This system switches the room from Dark mode to Light mode if the event @T(mSwitchOn) occurs, i.e. if the monitored variable mSwitchOn changes its value from false to true. Old Mode Event New Mode Dark @T(mSwitchOn) Light Light @F(mSwitchOn) Dark Table 3.4: mRoom Mode Table To add fault—tolerance to the SCR specification, we need to convert the SCR tables 30 into guarded commands. In particular, we need to translate modes, conditions, terms, and events. Next, we describe how we translate the SCR events into guarded commands for SYCRAFI'. Events in SCR occur at the time when the value of their condition is switched from false to true or vice versa in a single transition. It is not only the current state of the monitored variable that initiates the transition; rather, it is the combination of both the current and the old states. The notation used to represent events is as follows: (@T(c)WHENd) E (fic /\ cI /\ d) where (c) represents the condition value in the before state and (c’ ) represent the condition value in the after state [83]. For example, if we consider the SCR mode table entry in mRoom mode class: From “Dark” EVENT “@T (mSwtichOn)” TO “Light” In the “before” state, the mode value mRoom is Dark and the condition mSwitchOn is false. And, in the “after” state the mode value mRoom = Light and the condition mSwitchOn = true. In SYCRAFT, (guarded commands) transitions are represented in the following format: (8 —> st) The guard, g, is a predicate whose value must be true in the before state in order for the statement, st, to execute. The guarded command translation for mRoom table entry would be: ( ( mRoom : Dark ) /\ (mSwtichOn = false ) ) ——> mRoom :2 Light; mSwtich :2 true Likewise, we need to convert states, terms, and modes into the corresponding input for SYCRAFT. In particular, each mode is translated into corresponding states that a program could reach. Conditions are translated into guards that determine when actions can be executed. 31 3.2.2 Translation from SCR Syntax to SYCRAFT Syntax In this translation we preserve the model abstraction as well as compactness to avoid the state explosion problem. The goal of this translation is to translate the SCR table syntax into action language that the SYCRAFT can deal with. The translation rules are based on the fact that the transition relation in the SCR tables is identified using a condition on the current state and another condition on the next state. For example, the current state in SCR is defined using the “FROM mode” with a condition, and the next state is identified by the “TO mode”. In the SYCRAFT syntax we translate the “FROM mode” into “mcMode== mode” and the “TO mode” into “—-> mcModezzmode”. Table 3.5 shows some of the translation rules. SCR Syntax SYCRAFT Syntax MODETRAN S “mcMode”; => process “mcMode” ; FROM => ( “Source Mode” => (mcMode=“Source Mode” ) && EVENT => )( @F ( condl ) => ! Condl @T ( condl ) :> condl WHEN :> && TO => ) —+ “Target Mode” => mcMode :=“Target Mode” ; Table 3.5: Translation rules 3.2.3 Modeling of faults Faults in SYCRAFT are also modeled using guarded commands that change program vari- ables. To effectively model faults for designers, we can model them using tables similar to the way the SCR specification is specified. Note that this would require changes to the SCR toolset. However, the change is minimal in that it would require adding an extra table for faults rather than putting all program/fault actions together as was done in [22]. Note 32 that with this change, we do not expect the designer task to be more complex since faults are specified using a method. similar to describing programs. For simplicity, currently, we let faults be directly represented using guarded commands so that modification to the SCR toolset is not necessary. Likewise, it would be necessary for the designer to specify require- ments in the presence of faults. These specifications are also similar to that used in SCR for requirements in the absence of faults. 3.2.4 Adding fault-tolerance to SCR specifications The scenario of adding fault-tolerance to the SCR specifications is described in Figure 3.1. The cycle begins at step 1 by creating the specifications requirement using the SCR toolset. The specifications in SCR formats are exported from the SCR toolset as in step 2. In step 3, the middle-layer imports the SCR specifications and the first translation phase generates an output file for the use in the addition of fault-tolerance by SYCRAFT. This file is imported in step 4 to SYCRAFT, which generates a fault-tolerant version of the program in step 5. In step 6, the middle-layer imports the SYCRAFT output and in step 7 translates it back to the SCR specification. Finally, in step 8, the file is imported back into the SCR toolset so that it can be visualized using the SCR toolset. Thus, the translation layer shown in Figure 3.1 allows the automated revision to add fault-tolerance where the addition is done under-the- hood, meaning that, it allows users of the SCR tools to add fault-tolerance to specifications without knowing the details of SYCRAFT or the theory on which SYCRAFT is based. 3.3 Case Studies To illustrate the integration of SCR and SYCRAFT, we present two case studies: the control system for an aircraft altitude switch (ASW) [22] and the automobile cruise control system (CCS) [95]. For both systems, we briefly describe the concept and demonstrate how our 8-steps method from Section 3 .2 .4 works on these examples to translate the fault-intolerant 33 309* File _> Convert to input for SYCRAFT ,, . Automated Model SCR Tool Set The Translation Revision Layer [3 SCR” We Convert to SYCRAFT File “—7 SCR" Syntax ‘I Figure 3.1: The transformation cycle between SCR toolset and SYCRAFT. SCR specification into the corresponding fault-tolerant specification. 3.3.1 Case Study 1: Altitude Switch Controller In Section 3.1.], we described the ASW system and illustrated how it is modeled using the SCR formal method. In this Section, we show how to transform the SCR specification of the ASW into guarded command. Then, we use SYCRAFT to revise the specification of the ASW to add fault-tolerance. Later, we show how to transform the ASW specification from guarded command into SCR to import it back into the SCR toolset. Step 1. As shown in Figure 3.1 at step 1, we eXtract the mode table of the ASW system in the SCR specification. The mcStatus mode table of the ASW system is illustrated in Table 3.2. It describes the mode class mcStatus that represents a function between the monitored variables and the current value of the mcStatus. The mcStatus class has one of the following three modes: standby, init, or awaitDOIon. Steps 2 8t 3. At step 2, we import the SCR specification into the middle layer. This layer generates the input in guarded command format at step 3. The result of the translation layer 34 ((mcStatus = init) /\ ((mInitializing) = true)) —-> mcStatus := standby; (mInitializing) :2 false; ((mcStatus = standby) /\ ((mReset) = false)) —> mcStatus := init; (mReset) :2 true; ((mcStatus = standby) /\ ((mAltBelow) = falseA !mInhibit /\ mDOI Status 2 off )) —-> mcStatus = awaitDOIon; (mAltBelow) = trueA !mInhibit /\ mDOIStatus 2 off; ((mcStatus = awaitDOIon) /\ ((mDOIStatus = on) = false)) ———> mcStatus := standby;mD01Status := true; ((mcStatus = awaitDOIon) /\ ((mReset) = false)) ——> mcStatus :2 init;mReset := true; Table 3.6: The mcStatus mode table translated. is as shown in Table 3.6. For example, the first entry in Table 3.6 shows that in order for this action to execute, the old value (i .e. the “before” state) of the mcStatus should be equal to standby, and mReset should be equal to false. The two statements in the right hand side represent the “after” state; both values of mcStatus and mReset should be changed. We consider three hardware malfunctions that may alter the operation of the fault in- tolerant ASW controller [22]. They are an altimeter fault, an initialization fault, and DOI fault. All three faults are time-out faults, i.e., they require the system to stay in a given state for a specified amount of time. But since SYCRAFT does not include the notion of time yet, we abstract those faults to be on/ofi flags. We added a new mode, called fault, to the mcStatus class to indicate the presence of faults in the system. Table 3.7 shows how those faults are represented in the input file to SYCRAFT. Note that the fault transitions described below can be easily described using SCR tables. Therefore, the designer can specify the faults using the SCR toolset interface which they are familiar with. Step 4. In step 4, we use the translated SCR specification and the three faults described in Table 3.7 as an input to SYCRAFT so that SYCRAFT can add fault-tolerance to ASW 35 (mcStatus = init) /\ (Init.Duration_Fault = true) —-+ Init_Duration_Fault :2 false ; mcStatus :2 Fault; (standby = init) /\ (Alt_Duration_Fault = true) —> Alt_Duration_Fault :2 false ; mcStatus :2 Fault; (awaitDOI on = init) /\ (AwaitDOLDuartionfault = true) —> AwaitDOLDuartiomFault :2 false ; mcStatus :2 Fault; Table 3.7: The SYCRAFT fault section. (mcStatus = init) /\ ((mInitializing) = true)) ——> mcStatus := standby ; mInitializing :2 false; ((mcStatus = standby) /\ ((mReset) = false)) ——+ mcStatus :2 init ; mReset :2 true; ((mcStatus = standby) /\ ((mAltBelow) = false/\lmlnhibitA mDOIStatus 2 off /\ mAltFail = false)) —> mcStatus = awaitDOIon; (mAltBelow) = true; ((mcStatus = awaitDOIon) /\ ((mDOIStatus = false)) —+ mcStatus :2 standby ; mDOIStatus :2 true; ((mcStatus = awaitDOIon) /\ ((mReset) = false)) ——» mcStatus :2 init ; mReset := true; ((mcStatus = fault) /\ ((mReset) = false)) ——> mcStatus := standby ; mReset := true; Table 3.8: The fault-tolerant mcStatus mode table. specification that tolerates the failure of the altimeter, initialization, or DOI. Step 5. The result of step 5 is shown in Table 3.8. The parts where SYCRAFT added the tolerance were at two places. First, the condition ( mAltFail = false ) was added to the guard of the third transition to prevent the mcStatus from activating the device when mAltFail is true. Second, the last transition in the Table 3.8 was added to provide recovery from the fault state to one of the system legitimate states. Steps 6 & 7. We import the SYCRAFT specifications into the translation layer at step 6 to translate it to fault-tolerant SCR specifications. Table 3.9 is the result after applying the translation on the mcStatus from SYCRAFT output to SCR. 36 Old Mode Event New Mode init @F(mlnitializing) Standby standby @T(mReset) init standby mDOIStatus = off AND NOT mAltFail awaitDOIon mDOIStatus = off AND NOT mAltFail awaitDOIon @T(mAltBelow)WHEN NOT mlnhibit AND standby mDOIStatus = off AND NOT mAltFail awaitDOIon @T(mDOIStatus = on) init fault @T(mReset) init init @T(IniLDurationFault) fault standby @T(AlLDurationFault) fault awaitDOIon @T(AwaitDOLDuartiomFault) fault Table 3.9: Fault-tolerant mode class mcStatus. 3.3.2 Case Study 2: Cruise Control System The cruise control system (CCS) [95] manages the cruising speed of an automobile by con- trolling the throttle position. It depends on several monitored variables, namely, mlgnon, mEngRunning, mSpeed, mLever, and mBrake. The system uses monitored variables to con- trol the automobile speed. The cruise mode is engaged by setting the mLever to “const”, provided that other conditions like “engine running” and “ignition is on” are met. The CCS can maintain constant, decrease, or increase automobile speed depending on the cur- rent speed. Below, we show how fault-tolerant CCS is revised using the tool described in Figure 3.1. The mCruise mode table is shown in Table 3.10. This table specifies the values that the mCruise class can take. We imported the modeTable 3.10 into the middle layer, which generated specification in SYCRAFT format. Then we translated the mCruise mode table to SYCRAFT. We consider a system malfunction that may alter the operation of the fault intolerant CCS. The fault takes place when the status of the cruise becomes unknown. Table 3.11 37 Old Mode Event New Mode Off @T(mlgnOn) Inactive Inactive @F(mlgnOn) Off Inactive @T(mLever=const) WHEN mIgnOn AND Cruise mEngRunning AND NOT Brake Cruise @F(mlgnOn) Off Cruise @F(mEngRunning) Inactive Cruise @T(mBrake) OR @T(mLever = off) Override Override @F(mlgnOn) Off Override @F(mEngRunning) Inactive Override @T(mLever = resume) WHEN mlgnOn AND Cruise mEngRunning AND NOT mBrake OR @T(mLever = const) WHEN mIgnOn AND mEngRunning AND NOT mBrake Table 3.10: Fault intolerant mode class mcCruise. (mcCruise = Override) V (mcCruise = Cruise) V (mcCruise = Inactive) V (mcCruise = Off) /\ (CruiseFault = true) —> mcCruise := U nkown;CruiseFault := false; Table 3.11: The SYCRAFT fault section. shows how this fault is represented in the input file to SYCRAFT. We have inputted the faults and the fault-intolerant CCS to SYCRAFT in order to add fault-tolerance to the CCS system to tolerate a recovery from an unknown state to one of the CCS safe state. SYCRAFT added two actions to recover from the unknown state to one of the system valid states depending on the value of the IgnOn monitored variable. The fault-tolerant specification is as shown in Table 3.12. 38 Old Mode Event New Mode Off @T(mlgnOn) Inactive Inactive @F(mlgnOn) Off Inactive @T(mLever=const) WHEN mIgnOn AND Cruise mEngRunning AND NOT mBrake Cruise @F(mlgnOn) Off Cruise @F(mEngRunning) Inactive Cruise @T(mBrake) OR @T(mLever = off) Override Override @F(mlgnOn) Off Override @F(mEngRunning) Inactive Override @T(mLever = resume) WHEN mIgnOn AND Cruise mEngRunning AND NOT mBrake OR @T(mLever = const) WHEN mIgnOn AND mEngRunning AND NOT mBrake Unknown @T (IgnOn) Off Unknown @F (IgnOn) Inactive Override, Cruise, @T(CruiseFault) Unknown Off, Inactive Table 3.12: Fault-tolerant mode class mcCruise. 3.4 Summary In this chapter we presented the techniques we developed to make the automated model revision more easier to use. Our goal is to make the model revision accessible to wide range of system designers. Specifically, we utilized existing design tools (e.g., SCR toolset) to be the front end of our approach and performed all the aspects related to the automated model revision behind the scene. To successfully achieve this coupling, we developed a middle layer that translated the SCR specifications into SYCRAFT specifications and from SYCRAFT back to SCR. With this middle layer, we enabled the designers to perform the tasks of the automated model revision under-the-hood. 39 Chapter 4 Expediting the Automated Revision Using Parallelization and Symmetry To make the automated model revision more applicable in practice, we need to develop approaches for enhancing their performance. Specifically, we need to able to revise pro- grams with moderate to large state space in a reasonable amount of time. Our goal in this chapter is to utilize both the properties of the programs being revised and the available infrastructure (e.g., multi-core architecture) to expedite the revision. Hence, we focus on using symmetry, inside the program being revised, and parallelism, obtained from multiple cores, to speedup the revision algorithm. The rest of this chapter is organized as follows. We explain the bottlenecks of the automated model revision and illustrate the issues involved in the revision problem in the context of Byzantine agreement in Section 4.2. We analyze the effect of the distributed nature of the program being revised on the complexity of the revision in Section 4.2 .2. We present our algorithms in Section 4.3. We analyze the results in Subsection 4.3.3 and argue that our multi-core algorithm is likely to benefit further with additional cores. We evaluate a different parallelism approach in Section 4.4. In Section 4.5, we present our approach for expediting the revision of fault-tolerant programs with the use of symmetry. Finally, we 40 summarize in Section 4.6. 4.1 Introduction Given the current trend in processor design where the number of transistors keeps growing as directed by Moore’s law but where clock speed remains relatively flat, it is expected that multi-core computing will be the key for utilizing such computers most effectively. As argued in [90], it is expected that programs/protocols from distributed computing will be especially beneficial in exploiting such multi-core computers. One of the difficulties in adding fault-tolerance using automated techniques, however, is its time complexity. Our focus is to evaluate the effectiveness of different approaches that utilize multi-core computing to reduce the time complexity during deadlock resolution in the revision to add fault-tolerance to distributed programs. To evaluate the effectiveness of multi-core computing, we first need to identify bot- tleneck(s) where multi-core features can provide the maximum impact. To identify these bottlenecks, in [30], Bonakdarpour and Kulkarni developed a symbolic (BDD-based) algo- rithm for adding fault-tolerance to distributed programs with state space larger than 1030. Based on the analysis of the experimental results from [30], they observed that depending upon the structure of the given distributed intolerant program, performance of the revision suffers from two major complexity obstacles: (1) generation of fault-span, the set of states reachable in the presence of faults, and (2) resolving deadlock states, from where the pro- gram has no outgoing transitions. To resolve a deadlock state, we either need to provide recovery actions that allow the program to continue its execution or eliminate the dead- lock state by preventing the program execution from reaching it. Of these, generation of fault-span closely similar to program verification and, hence, techniques for efficient veri— fication are directly applicable to it. In this chapter, we focus on expediting the resolution of deadlock states with the use of parallelism and symmetry. 41 In the context of dependable systems, the revised fault-tolerant program should meet its liveness requirements even in the presence of faults. Therefore, no deadlock states are permitted in the fault-tolerant program since the existence of such states can violate the liveness requirements. A program may reach a deadlock state due to the fact that faults perturb the program to a new state that was not considered in the fault-intolerant program. Or, it may reach a deadlock state due to the fact that some program actions are removed (e .g., because they violate safety in the presence of faults). We present two approaches for parallelization. The first approach is to parallelizes the group computation. It is based on the distributed nature of the program being revised. In particular, when a new transition is added/removed, since the process executing it has only a partial view of the program variables, we need to add/remove a group of transitions based on the variables that cannot be read by the process. The second approach is based on partitioning deadlock states among multiple threads; each thread resolves the deadlock states that have been assigned to it. We show that this provides a small performance benefit. Based on the analysis of these results, we argue that the simple approach that parallelizes the group computation is likely to provide maximum benefit in the context of deadlock resolution for the revision of distributed programs. To understand the use of symmetry, we observe that, often, multiple processes in a distributed program are symmetric in nature, i.e., their actions are similar (except for the renaming of variables). Thus, if we find recovery transitions for a process, then we can utilize symmetry to identify other recovery transitions that should also be included for other processes in the system. Likewise, if some transitions of a process violate safety in the presence of faults, then we can identify similar transitions of other processes that would also violate safety. If the cost of identifying these similar transitions with the knowledge of symmetry among processes is less than the cost of identifying these transitions explicitly, then the use of symmetry will reduce the overall time required for revision. We also present an algorithm that utilizes symmetry to expedite the revision. We show 42 that our algorithm significantly improves performance over previous implementations. For example, in the case of Byzantine agreement (BA) [107] with 25 processes, time for revision with a sequential algorithm was 1,632s. With symmetry alone, revision time was reduced to 188s (8.7 times better). With parallelism (8 threads), revision time was reduced to 467s (3.5 times better). When we combined both symmetry and parallelism together, the total revision time was reduced to 107s (more than 15.2 times better). 4.2 Issues in Automated Model Revision In this section, we use the example of Byzantine agreement [107] (denotedBA) to describe the issues in automated revision to add fault-tolerance. Towards this end, in Section 4.2.1 , we describe the inputs used for revising the Byzantine agreement problem. Subsequently, in Section 4.2 .2, we identify the need for explicit modeling of read-write restrictions im- posed by the nature of the distributed program. Finally, in Section 4.2.3, we describe how deadlock states get created while revising the program for adding fault-tolerance and illus- trate our approach for managing them. 4.2.1 Input for Byzantine Agreement Problem The Byzantine agreement problem (BA) consists of a general, say g, and three (or more) non-general processes, say j, k, and l . The agreement problem requires that a process copy the decision chosen by the general (0 or 1) and finalize (output) the decision (subject to some constraints). Thus, each process of BA maintains a decision d; for the general, the decision can be either 0 or 1, and for the non-general processes, the decision can be 0, 1, or I, where the value 1 denotes that the corresponding process has not yet received the deci- sion from the general. Each non-general process also maintains a Boolean variable f that denotes whether that process has finalized its decision. For each process, a Boolean vari- able b shows whether or not the process is Byzantine; the read/write restrictions (described 43 ...-...: in Section 4.2.2) ensure that a process cannot determine if other processes are Byzantine. Thus, a state of the program is obtained by assigning each variable, listed below, a value from its domain. And, the state space of the program is the set of all possible states. V = {d.g} U (the general decision variables):{0, l} {d.j,d.k,d.l} U (the processes decision variables):{0, l, I} {f.j,f.k,f.l} U (finalized?):{false, true} {b.g, b.j, b.k, b.l}. (Byzantine?):{false, true} Fault-intolerant program. To concisely describe the transitions of the (fault- intolerant) version of BA, we use guarded commands of the form g —-» st. Recall from Chapter 1 that g is a predicate involving the above program variables and st updates the above program variables. The command g —> st corresponds to the set of transitions { (so, s1) : g is true in so and s1 is obtained by executing st in state so}. Thus, the transitions of a non-general process, say j, is specified by the following two actions: BAinmlj :: BAlj :: (d.j= _L) /\ (f.j =false) /\ (b.j =false) —> d.j:=d.g BAZJ- :: (d.jaé I) /\ (f.j =false) /\ (b.j =false) ——+ f.j:= true We include similar transitions for k and l as well. Note that the general does not need explicit actions; the action by which the general sends the decision to j is modeled by BA] j. Specification. The safety specification of the BA requires validity and agreement. Validity requires that if the general is non-Byzantine, then the final decision of a non- Byzantine, non-general must be the same as that of the general. Additionally, agreement requires that the final decision of any two non-Byzantine, non-generals must be equal. Finally, once a non-Byzantine process finalizes (outputs) its decision, it cannot change it. Faults. A fault transition can cause a process to become Byzantine, if no other process is initially Byzantine. Also, a fault can change the d and f values of a Byzantine process. The fault transitions that affect a process, say j, of BA are as follows: (We include similar actions for k, l , and g) F1 :: -wb.gA—wb.jA-wb.kAfib.l ———-> b.j:= true F2 :: b.j ——> d.j,f.j:=0|1,false|true where d. j :2 0|1 means that d. j could be assigned either 0 or 1. In case of the general process, the second action does not change the value of any f-variable. Goal of automated Addition of fault-tolerance. The goal of the automated revision is to start from the intolerant program (BAimlj) and given the set of faults (F 1&F 2) and to automatically generate the fault-tolerant program (BAmlemmj), given below. d.j: .L)A(f.j=false)A(b.j=false) ———>d.j:=d.g d.j;éJ.)A(f.j —fa.lse)A(d17éJ_ Vd.k.7é_L)———>fj :=true BA,01emmj::BAlj-:( 3A2,- ::( 3A3]- :: (d.l= )0A(d.k=0)A(d.j= l)A(f.j=O) ——>d.j,f.j::0,0|1 (d.l= 3A4} :: 1)/\(d.k=1)A(d.j=0)/\(f.j=0)—>d.j,f.j:=1,0|1 In the above program, the first action remains the same. The second action is restricted to execute only in the states where another process has the same d value. Actions (3&4) are for fixing the process decision. 4.2.2 The Need for Modeling Read/W rite Restrictions Since the program being revised is distributed in nature, each process can only read a subset of the program variables. It is important to articulate these restrictions precisely to ensure that the revised program is realizable under the constraints of the underlying distributed system for which it is designed. For example, in the context of the Byzantine agreement example from Section 4.2.1, non-general process j is not supposed to know whether other processes are Byzantine. It follows that process j cannot include an action of the form ‘if b.k is true then change d. j to 0’. To permit such modeling, we need to specify read-write restrictions for a given process. For the Byzantine agreement example, process j is allowed 45 to read Rj = {b.j,d.j,f.j,d.k,d.l,d.g} and it is allowed to write W} = {d.j,f.j}. Observe that this modeling prevents j from knowing whether other processes are Byzantine. With such read/write restriction, if process j were to include an action of the form ‘if b.k is true then change d. j to 0’ then it must also include a transition of the form ‘if b.k is false then change d . j to 0’. In general, if transition (so, s1) is to be included as a transition of process j then we must also include a corresponding equivalence class of transitions (called group of transitions) that differ only in terms of variables that j cannot read. For further discussion of the group operation please refer to Section 2.2. 4.2.3 The Need for Deadlock Resolution During revision, we analyze the effect of faults on the given fault-intolerant program and identify a fault-tolerant program that meets the constraints of Problem 2.1. This involves addition of new transitions as well as removal of existing transitions. In this section, we utilize the Byzantine agreement problem to illustrate how deadlocks get created during the execution of the revision algorithm and identify two approaches for resolving them. 0 Deadlock scenario 1 and use of recovery actions. One legitimate state, say so (c.f. Table 4.1), for the Byzantine agreement program is a state where all processes are non-Byzantine, d.g is 0 and the decision of all non-generals is 1. Thus, in this state, the general has chosen the value 0 and no non-general has received any value. From this state, process j (respectively k) can copy the general decision by executing the program action BAl ,- (respectively BAlk) as in s1 (respectively s2) from Table 4.1. The general can become Byzantine and change its value from 0 to 1 arbitrarily as in s3. Therefore, a non-general can receive either 0 or 1 from the general. Clearly, starting from S3, in the presence of faults (F l & F2), the program (BA,-,,,0,) can reach a state, say s5, where d.g = d.l = 1, and d.j = d.k = 0. From such a state, transitions of the fault-intolerant program violate safety if they allow j (or k) and l 46 to finalize their decision. If we remove these safety violating transitions then there are no other transitions from state S5. In other words, during revision, we encounter that state S5 is a deadlock state. One can resolve this deadlock state by simply adding a recovery transition that changes d! to 0. (Note that based on the discussion of Section 4.2 .2, adding such recovery transition requires us to add the corresponding group of transitions. It is straightforward to observe that none of the transitions in this group violate safety.) Action/ State Fault b.g b.j b.k b.l d.g d j d.k d.l f j f.k f.l So — O 0 O O 0 I I J. 0 O 0 S] BA 1 j 0 0 0 0 0 Q J_ .l. O O 0 S2 BA 1 k 0 0 O O O O Q _L O 0 0 S3 F 1 1 O 0 0 O 0 O _L 0 0 0 S4 F2 1 O O 0 _1_ 0 O J. O O 0 S5 BA 1 1 1 O O 0 1 0 0 l 0 O O Table 4.1: Deadlock scenario 1 (The underlined values indicates which variable is being changed by the program action/fault. For reasons of space the true and false values are replaced by l and 0 respectively for the variables b and f.) o Deadlock scenario 2 and need for elimination. Again, consider the execution of the program (BAimOl) in the presence of faults (F 1 & F2). Starting from state so in the previous scenario the program can also reach a state, say so (c.f. Table 4.2), where d.g = d.l = l,d.j = d.k = 0, and f.j : 1; state so differs from S5 in the previous scenario in terms of the value of f.l. Unlike S5 in the previous scenario, since I has finalized its decision, we cannot resolve S6 by adding safe recovery. Since safe recovery from so cannot be added, the only choice for designing a fault-tolerant program is to ensure that state S6 is never reached in the fault-tolerant program. This can be achieved by removing transitions that reach S6. However, removal of such transitions can create more deadlock states that have to be eliminated. Thus, the 47 deadlock algorithm needs to be recursive in nature. Action/ State Fault b.g b.j b.k b.l d.g d j d.k d.l f j f.k f.l so - O 0 0 0 0 I J. I 0 0 0 S1 BA] j 0 0 O O 0 Q I I 0 O 0 82 BA 1 k 0 O O O O O Q J. 0 0 0 S3 3A2]- O O O O 0 0 O _L 1 0 0 S4 F 1 1 O 0 O O O O 1 1 O 0 S5 F2 1 0 0 0 _1_ O 0 J. 1 O 0 s6 BA 1 1 1 O O O 1 O 0 1 l O O Table 4.2: Deadlock scenario 2 (The underlined values indicates which variable is being changed by the program action/fault. For reasons of space the true and false values are replaced by 1 and 0 respectively for the variables b and f.) To maximize the success of the revision algorithm, our approach to handle deadlock states is as follows: Whenever possible, we add recovery transition(s) from the deadlock ’states to a legitimate state. However, if no recovery transition(s) can be added from the deadlock states, we try to eliminate (i.e. make it unreachable) the deadlock states by pre- venting the program from reaching the deadlock states. In other words, we try to eliminate deadlock states only if adding recovery from them fails. 4.3 Approach 1: Parallelizing Group Computation In this section, we present our approach for parallelizing the group computation to expedite the revision to add fault-tolerance. First, in Section 4.3.1, we identify the different design choices we considered and then present our algorithm. In Section 4.3 .2, we describe our approach for parallelizing the group computation. Subsequently, in Section 4.3.3, we pro- vide experimental results in the context of the Byzantine agreement example from Section 4.2.1 and the token ring [14]. Finally, in Section 4.3 .4, we analyze the experimental results to evaluate the effectiveness of parallelization for group computation. 48 4.3.1 Design Choices The structure of the group computation permits an efficient way to parallelize it. In particu- lar, whenever some recovery transitions are added for dealing with a deadlock state or some states are removed for ensuring that a deadlock state is not reached, we can utilize multiple threads in a master-slave fashion to expedite the group computation. During the analy- sis for utilizing the multiple cores effectively, we make the following observations/design choices. 0 Multiple BDD packages vs. reentrant BDD package. We chose to utilize differ- ent instances of BDD packages for each thread. Thus, at the time of group computa- tion, each thread obtains a copy of the BDD corresponding to the program transitions and other BDDs from the master thread. In part, this was motivated by the fact that existing parallel BDD implementations have shown limited speedup. Also, we argue that the increased space complexity of this approach is acceptable in the context of revision, since the time complexity of the revision algorithm is high (compared with model checking) and we always run out of time before we run out of space. 0 Synchronization overhead. Although simple to parallelize, the group computation itself is fine grained, i.e., the time to compute a group of the recovery transitions that are to be added to the program is small (100-500ms). Hence, the overhead of using multiple threads needs to be small. With this motivation, our algorithm creates the required set of threads up front and utilizes mutexes to synchronize between them. This provided a significant benefit over creating and destroying threads for each group operation. 0 Load balancing. Load balancing among several threads is desirable so that all threads take approximately the same amount of time in performing their task. To perform a group computation for the recovery transitions being added, we need to evaluate the effect of read/write restrictions imposed by each process. A static way to 49 parallelize this is to let each thread compute the set of transitions caused by read/write restrictions of a (given) subset of processes. A dynamic way is to consider the set of processes for which a group computation is to be performed as a Shared pool of tasks and allow each thread to pick one task after it finishes the previous one. We find that given the small duration of each group computation, static partitioning of the group computation works better than dynamic partitioning since the overload of dynamic partitioning is high. 4.3.2 Parallel Group Algorithm Description To better illustrate the parallel group algorithm, we first describe its sequential version. The sequential group algorithm (c.f. Algorithm 1) takes a transition set, trans, as an input and computes the transition group, transg, as an output. Recall from Section 2.2 that the tasks involved in computing the group depend on the number of processes and the number of variables in the program. The sequential group algorithm (of Algorithm 1) needs to go through all the processes in the program and for each process it has to go through all the variables. The revision algorithm is required to compute the group associated with any set of transitions added/removed from the program transitions. Based on this discussion and the design choices above, we now describe the parallel group algorithm. Algorithm sketch. Given transition set trans the goal of this algorithm is to compute the Group of transitions associated with the set trans. The sequential algorithm will go through many computations for each process, one after another. However, in the parallel algorithm, we split the Group computation over the available number of threads. In particular, rather than having one thread find the Group for all the processes, we let each thread compute the Group for a subset of the processes. Since the tasks assigned to each thread require a very small amount of the processor time, there is considerable overhead associated with the thread creation/destruction every time the Group is computed. Therefore, we let the master thread create the worker threads at the initialization stage of the revision algorithm. 50 The worker threads stay idle until the master thread needs to compute the Group for a set of transitions. The Master thread activates/deactivates the worker threads through a set of mutexes. When all worker threads are done, the main thread collects the results of all worker threads in one Group. The parallel group algorithm consists of three parts: the initialization of the worker threads, the assignment of tasks to worker threads, and the computation of a group with worker threads. Initialization. In the initialization phase, the master thread creates all required threads by calling the algorithm InitiateThreads (c.f. Algorithm 2). These threads stay idle until a group computation is required and terminate when the revision algorithm ends. Due to the design choice for load balancing, the algorithm distributes the work among the available threads statically (Lines 3-4). Then it creates all the required worker threads (Line 7). Algorithm 2 InitiateThreads Input: n00 f processes , n00 f Thread S. : for i := 0 to n00 fThreads — 1 do BDDMgr[i] = C lone(masterBDDManager) ; - ,_ i x Ofpr c S , Star 1‘” lll .— l gdnghrZad: es], . ._ (i+l x n00 fprocesses . endPM '_ ‘- noOfThreads J _ 1’ end for : for tth := 0 to n00 fThreads — 1 do S pawnThread -> GroupWorkerThread(thID); end for 9?:‘0‘9‘5? 93!?" Tasks for worker thread. Initially, the algorithm WorkerThread (c.f. Algorithm 3) locks the mutexes Start and Stop (Lines 1-2). Then it waits until the master thread unlocks the Start mutexes (Line 5). At this point, the worker starts computing the part of the Group associated with this thread. This section of WorkerThread (Lines 7-15) is similar to the Group() function in the sequential revision algorithm, except rather than finding the Group for all the processes, the WorkerThread algorithm finds the group for a subset of processes 51 (Line 8). When the computation is completed, the worker thread notifies the master thread by unlocking the mutex Stop (Line 17). Algorithm 3 WorkerThread Input: thID. 10: ll: 12: l3: 14: 15: 16: 17: 999.519.5099??? // Initial Lock of the mutexes mutexJ ock(thData [thl D] .mutexStart) ; mutex_l ock(thData [thI D] .mutexSto p) ; while True do // Waiting for the Signal from the master thread mutex_l ock(thData [thI D] .mutexStart); gtr[id] := false; BDD* tPred := BDD[ endP[thID] - startP[thID]+l ] ; for i := 0 to (endP[thID] — startP[thID]) + 1 do tPred[i] := thData[thID].tranS A alloerite [i + startP[thI D]] from fer (BDDM gr [thI D] ) ; tPred [i] := FindGroup(tPred [i], i, thID); end for thData[thID].result := false; for i := O to (endP[thID] — startP[thID]) + 1 do thData[thID].result := thData[thID].result V tPred [i]; end for // Triggering the master thread that this thread is done mutex-unl ock(thData [thI D] .mutexSto p) ; 18: end while Tasks for master thread. Given transition set trans, the master thread copies trans to each instance of the BDD package used by the worker threads (cf. Algorithm 4, Lines 3-5). Then it assigns a subset of group computation to the worker threads (Lines 6-8) and unlocks them. After the worker thread completes, the master thread collects the results and returns the group associated with the input trans. 52 Algorithm 4 MasterThread Input: transitions set thisTr. Output: transition group gAll. $05?ri tr 2: thisTr; gAll := false; for i := 0 to NoOfThreads — 1 do threadData[i] .trans := trans.Transfer(BDDMgr[thID]); end for // all idle threads to start computing the group 6: for i := O to NoOfThreads — 1 do mutex_unlock(thData [i] .mutexStart); 8: end for 10: ll: 12: l3: 14: 15: // Waiting for all threads to finish computing the group for i := 0 to NoOfThreads — 1 do mutex-lock(thData [i] .mutexSto p) ; end for // Merging the results from all threads for i := 0 to NoOfThreads — 1 do gAll := gAll + thData[i] .results; end for return gAll; 53 4.3.3 Experimental Results In this section, we describe the respective experimental results in the context of the Byzan- tine agreement (described in Section 4.2.1) and the token ring [14]. In both case studies, we find that parallelizing the group computation improves the execution time substantially. Throughout this section, all experiments are run on a Sun Fire V40z with 4 dual-core Opteron processors and 16 GB RAM. The OBDD representation of the Boolean formulae has been done using the C++ interface to the CUDD package developed at University of Colorado [125]. Throughout this section, we refer to the original implementation of the revision algorithm (without parallelism) as sequential implementation. We use X threads to refer to the parallel algorithm that utilizes X threads. We would like to note that the revision time duration differences between the sequential implementation in this experiment and the one in [30] is due to other unrelated improve- ments on the sequential implementation itself. However, the sequential, and the parallel implementations differ only in terms of the modification described in Section 4.3 .2. We note that our algorithm is deterministic and the testbed is dedicated and, hence, the only non-deterministic factor in time for revision is synchronization among threads. Based on our experience with the revision, this factor has a negligible impact and, hence, multiple runs on the same data essentially reproduce the same results. In Figures 4.1 and 4.2, we show the results of using the sequential approach versus the parallel approach (with multiple threads) to perform the revision. All the tests have shown that we gain a significant speedup. For example, in the case of 45 non-general processes and 8 threads, we gain a speedup of 6.1 . We can clearly see that the parallel 16-thread version is faster than the corresponding 8-thread version. This is surprising, given that there are only 8 cores available. However, upon closer observation, we find that the group computation that is parallelized using threads is fine-grained. Thus, when the master thread uses multiple slave threads for performing the group computation, the slave threads complete quickly and therefore cannot utilize the available resources to the full extent. Hence, creating more 54 10000 ~-~ - *-~---5-~--- - , +~ ~-~—~——~-» ,_- v- —~»-——-~~ Tlme(s) 0 ' ' 'H - . . " _"‘ 7—_~’_‘-7‘_T~"” —‘ -T“ T ‘ "" 1""— I'_"~‘ _' Processes IO 15 20 25 30 35 40 45 +Sequential 2 Threads +4 Threads *8 Threads """‘ 16 Threads L_“-_._.--_ _L ---—__._._._-.. Figure 4.1: The time required to resolve deadlock states in the revision to add fault- tolerance for several numbers of non-general processes of BA in sequential and parallel algorithms. 55 60000 — — . - i : l i 4 I , I ; 50000 ' , - 1 4mm 1...“-.. . ..-.-_-_-.._-.__- -_ . _ -___ _ -..- 20000 ~ — — - 5 . I . l f l0000 I 5 F l ‘5’ i E. l . 1 E'- 0 >1 -~~~—tr--~-r«~~ ~wnfi'f'T " . . » 1 ~ . -. i M | “s“ 10 15 20 25 30 35 40 45 g +Sequential 2 Threads +4 Threads *8 Threads “' 16 Threads Figure 4.2: The time required for the revision to add fault-tolerance for several numbers of non-general processes of BA in sequential and parallel algorithms. 56 50 ..t.-________.-._ ...- ___-_.____ -__.. - .____ _.____.____._-_ -_-_.____.-_ Time(s) 0 l» ~ at wwwqu ~ - Processes 10 20 30 40 50 60 70 80 90 100 150 2 Threads +4 Threads *8 Threads "‘9'" 16 Threads +Sequential ‘ Figure 4.3: The time required to resolve deadlock states in the revision to add fault- tolerance for several numbers of token ring processes in sequential and parallel algorithms. threads (than available processors) can improve the performance further. In Figures 4.3 and 4.4, we present the results of our experiments in parallelizing the deadlock resolution of the token ring problem. After the number of processes exceeds a threshold, the execution time increases substantially. This phenomenon also occurs in the case of parallelized implementation, although it appears for larger programs. However, this effect is not as strong. Note that the spike in speedup at 80 processes is caused by the page fault behavior where the performance of the sequential algorithm is affected although the performance of the parallel algorithm is still not affected. 57 I 700.-, . .H,“ 600 . ..- __ _ 400 ..- .... ...“.-.— l I 300 I I I i 200 it-.-_---.--.--- Wm, ..........a.,. -...n- ,- fish---” I 1 100 Tlme(s) I 0 ..L. mer- wfi-m'erfir-h'.“ ‘1 3": , ' r" . "T . . ‘ . .. .r . ..z . . . .1 i Processes 10 20 3O 40 50 60 70 80 90 100 150 I +Sequential 2 Threads +4 Threads *8 Threads "'"" 16 Threads I Figure 4.4: The time required for the revision to add fault-tolerance for several numbers of token ring processes in sequential and parallel algorithms. 58 4.3.4 Group Time Analysis To understand the speedup gain provided by our algorithm in Section 4.5.2, we evaluated the experimental results closely. As an example, consider the case of 32 BA processes. For sequential implementation, the total revision time is 59.7 minutes of which 55 are used for group computation. Hence, the ideal completion time with 4 cores is 18.45 minutes (55/4 + 4.7). By comparison, the actual time taken in our experiment was 19.1 minutes. Thus, the speedup using this approach is close to the ideal speedup. In this section, we focus on the effectiveness of the parallelization of group computation by considering the time taken for it in sequential and parallel implementation. Towards this end, we analyze the group computation time for sequential and parallel implementation in the context of three examples: Byzantine agreement, agreement in the presence of failstop and Byzantine faults, and token ring [14]. The results for these examples are included in Tables 4.3-4 .5. In some cases, the speedup ratio is less than the number of threads. This is caused by the fact that each group computation takes a very small amount of time and incurs an overhead for thread synchronization. Moreover, as mentioned in Section 4.2 .3, due to the overhead of load balancing, we allocate tasks of each thread statically. Thus, the load of different threads can be slightly uneven. We also observe that the speedup ratio increases with the number of processes in the program being revised. This implies that the parallel algorithm will scale to larger problem instances. An interesting as well as surprising observation is that when the state space is large enough then the speedup ratio is more than the number of threads. This behavior is caused by the fact that with parallelization each thread is working on smaller BDDs during the group computation. To understand this behavior, we conducted experiments where we cre- ated the threads to perform the group computation and forced them to execute sequentially by adding extra synchronization. We found that such a pseudo-sequential run took less time than that used by a purely sequential run. 59 Sequential 2-threads 4-threads 8-threads PR RS GT(s) GT(S) SR GT(s) SR GT(s) SR 15 10“ 50 29 1.72 17 2.94 11 4.55 24 10'7 652 346 1.88 185 3.52 122 5.34 32 1022 3347 1532 2.18 848 3.95 490 6.83 48 1033 33454 14421 2.32 7271 4.60 3837 8.72 Table 4.3: Group computation time for Byzantine Agreement. PR: Number of processes. RS: Size of reachable state space. GT(s): Group time in seconds. SR: Speedup ratio. Sequential 2-threads 4 -threads 8-threads PR RS GT(S) GT(s) SR GT(S) SR GT(S) SR 10 10'0 53 24 2.21 23 2.30 30 1.77 15 10'5 624 319 1.96 175 3.57 174 3.59 20 1020 4473 2644 1.69 1275 3.51 1128 3.97 25 1025 26154 11739 2.23 6527 4.01 5692 4.59 Table 4.4: Group computation time for the Agreement problem in the presence of failstop and Byzantine faults. PR: Number of processes. RS: Size of reachable state space. GT(s): Group time in seconds. SR: Speedup ratio. 4.4 Approach 2: Alternative (Conventional) Approach A traditional approach for parallelization in the context of resolving deadlock states, say ds, would be to partition the deadlock states into multiple threads and allow each thread to handle the partition assigned to it. Next, in Section 4.4.], we discuss some of the design choices we considered for this approach. We give brief description of our. algorithm in Section 4.4 .2. Subsequently, we describe experimental results in Section 4.4.3 and analyze them to argue that for such an approach to work in revising distributed programs, group computation must itself be parallelized. 6O Sequential 2 -threads 4 -threads 8-threads PR RS GT(S) GT(s) SR GT(S) SR GT(s) SR 30 1014 0.32 0.15 2.12 0.10 3.34 0.12 2.75 40 10'9 0.84 0.36 2.34 0.22 3.84 0.23 3.59 50 1023 1.82 0.68 2.68 0.39 4.66 0.42 4.37 60 1028 3.22 1.22 2.63 0.67 4.80 0.64 5.01 70 1033 5 .36 1 .91 2.80 1 .06 5 .05 0.86 6.23 80 1038 7.77 2.94 2.64 1.53 5.09 1.23 6.30 Table 4.5: Group computation time for token ring. PR: Number of processes. RS: Size of reachable state space. GT(s): Group time in seconds. SR: Speedup ratio. 4.4.1 Design Choices To maximize the benefits from parallelism, we consider two factors when partitioning the deadlock state among available threads. First, the deadlock states should be distributed evenly among the threads. Second, the partitions should minimize the overlap between worker threads. More specifically, states considered by one thread should not be consid- ered by an other thread. Therefore, we partition the deadlock states based on the values of the program variables. We use the size of the BDDs and the number 0f minterrns to split the deadlock states as evenly as possible. Regarding the second factor, we chose to add limited synchronization among worker threads to reduce the overlap in the explored states by different threads. For example, we can partition ds using the partition predicates, prti, l g i g n, such that VL, (prti /\ ds) = ds and n is the number of threads. Thus, if two threads are available during the revision of the Byzantine agreement program then we can letprtl = (d.j = O) andprtz = (d.j aé 0). After partitioning, each thread would work independently as long as it does not affect the states visited by other threads. As discussed in Section 4.2 .3 , to resolve a deadlock state, each thread explores a part of the state space using backward reachability. Clearly, when the states visited by two threads overlap, we have two options: (1) perform synchronization 61 so that only one thread explores any state or (2) allow two threads to explore the states concurrently and resolve any inconsistencies that may be created due to this. We find that following the first option by itself is very expensive/impossible due to the fact that with the use of BDDs, each thread explores a set of states specified by the BDD. And, since each thread begins with a set of deadlock states and performs backward reach- ability, there is a significant overlap among states explored by different threads. Hence, following the first option essentially reduced the parallel run to a sequential run. For this reason, we focused on the second approach where each thread explored the states concur- rently. (We also used some heuristic-based synchronization where we maintained a set of visited states that each thread checked before performing backward state exploration. This provided a small performance benefit and is included in the results below.) 4.4.2 Algorithm Sketch In this section, we focus on the descriptions of the parallel aspect of our deadlock resolution algorithm. For more details on the sequential algorithm for deadlock resolution please refer to [101]. The goal of our algorithm (c.f. Algorithm 5) is to resolve the deadlock states by adding safe recovery. However, if for some deadlock states safe recovery is not possible, the al- gorithm eliminates such states (i .e. makes them unreachable). To efficiently utilize the available worker threads, the master thread partitions the set of deadlock states among available threads as described in Section 4.4.1 and provides each thread with its own parti- tion. Subsequently, the master thread activates the worker threads to add safe recovery (c.f. Algorithm 6). Once activated, in adding safe recovery mode, each worker thread works as follows. It constructs the recovery transitions that originate from the deadlock states and leads to the legitimate states of the program in a finite number of steps. Of course, the algorithm does not include any transition that reaches a state from where the safety of the program can be violated. Once all worker threads are done computing the recovery transi- 62 tions, the master thread merges the recovery transitions, returned by all threads, and adds them to the program transitions. Algorithm 5 ResolveDeadlockStates Input: program p, faults f, legitimate state predicate I , fault span T, pro- hibited transitions mt, and partition predicates prtl ..prtn, where n is the number of worker threads. Output: program p’ and the predicate fte of states failed to eliminate. 9." gill 10: 11: 12: 13: 14: 15: 16: : ds := T /\ -:g(p); // Resolving deadlock states by adding safe recovery :forizzltondo rt,- := SpawnThread w AddRecovery(ds /\ prti, 1, mt); : end for // Merging results from worker threads : p==pVV§’=1rti: vds,fte :2 false; : ds := T /\ -:g(p); // Eliminating deadlock states from where safe recovery is not possible :forizzltondo rpi, vds;,fie,- := SpawnThread w Eliminate(ds /\ prti,p,I,f, T, vds,fte); end for // Merging results from worker threads I” 3: Group(Ai’=1 rPi); fte, vds := lefiei, VLI vdsi; // Handling inconsistencies nds == ((T /\ 71) /\ "8(P’)) /\ _1((T /\ 71) /\ “8(PI)§ p’ := p’ V Group(p /\ nds); 14’ == 19’ V Group(8(P) A (fieI’): return p’, fte; At this point, the master thread computes the remaining deadlock states. This set iden- tifies the deadlock states from which safe recovery is not possible. As mentioned earlier in Section 4.2.3, those states have to be eliminated (i.e., made unreachable by program 63 Thread 6 AddSafeRecovery Input: deadlock states ds, legitimate state predicate I , and transition predi- cate mt. Output: recovery transition predicate rec. 1: lyr, rec :2 I,false; 2: repeat 3 rt := Group(ds /\ (lyr)’); 4. rt := rt /\ fiGroup(rt /\ mt); 5: rec :2 rec V rt; 6: [yr := g(ds /\ rt) 7: until (lyr = false); 8: return rec; transitions). Once again the master thread partitions the deadlock states and provides each worker thread with one such partition. Subsequently, it activates the worker threads. Once activated, in eliminating mode (c.f. Algorithm 7), the worker threads remove all program transitions that terminate at the deadlock states, thereby making them unreachable. How- ever, if the removal of some of those transitions introduces new deadlock states, then the al— gorithm puts back such transitions and recursively eliminates the recently introduced dead- lock states. When threads explore states concurrently, some inconsistencies may be created. Next, we give a brief overview of the inconsistencies that may occur due to concurrent state exploration by different threads and identify how we can resolve them. Towards this end, let S] and s2 be two states that are considered for deadlock elimination and (so, s1) and (so, s2) be two program transitions for some so. To eliminate s1 and s2, a sequential elimination algorithm removes transitions (so, s1) and (so, s2) , which makes so be a new deadlock state (cf. Figure 4.5 .a). This in turn requires that state so itself must be made unreachable. If so is unreachable, then including the transition (so,s1) and (so,s2) in the revised program is harmless. In fact, it is desirable since including this transition also causes other transitions in the corre- 64 Thread 7 Eliminate Input: deadlock states ds, program p, legitimate state predicate I , fault tran- sitions f, fault span T, visited deadlock states vds, predicate fte failed to eliminate. Output: revised program transition predicate p, visited deadlock states vds, predicate fte failed to eliminate. wait(mutex); ds := ds /\ -nvds; vds :2 vds V ds; Signal (mutex); if (ds = false) then return p; end if old :=p; tmp :2 (T /\ o!) /\p /\ (ds)’; : p := p /\ fiGroup(tmp); :fs :2 g(T /\ fiI/\f/\ (ds)’); : p,vds,fte :2 Eliminate(fs,p,l, f, T, vds,fte); : nds :2 g(T /\ fiIA Group(tmp) /\ fig(p)); : p := p V (Group(tmp) /\ nds); : nds :2 nds /\ g(tmp); // (X)” = {(s1,true)|(so,s1) E X} 16: fte :=fte V -1(old/\ op /\ T /\ (ds)’)”; 17: p, vds,fte :2 Eliminate(nds /\ fil,p,I, f, T, vds,fte); 18: return p, vds, fte; 99.519999??? t—‘h—Ih—b-fil—tt—t m-mer—O}? 65 sponding group to be included as well. And, these grouped transitions might be useful in providing recovery from other states. Hence, it puts back (so,s1) and (so,s2) (and corre- sponding group) and starts eliminating the state so. However, the concurrent execution of worker threads may create some inconsistencies. We describe some of these inconsisten- cies and our approach to resolve them next. Case 1. States s1 and S; are in different partitions. Therefore, thl eliminates s1, which in turn removes the transition (so,s1), and thz eliminates S2, which removes the transition (so, s2) (cf. Figure 4.5.b). Since each thread works on its own copy, neither thread tries to eliminate so, as they do not identify so as a deadlock state. Subsequently, when the master thread merges the results returned by thl and thz, so becomes a new deadlock state that has to be eliminated while the group predicates of transitions (so, 51) and (so, s2) have been removed unnecessarily. In order to resolve this case, we replace all outgoing transitions that start from so and mark so as a state that has to be eliminated in subsequent iterations. Case 2. To eliminate deadlock states, the elimination algorithm performs backward exploration starting from the deadlock state. Thus, two or more threads may consider the same state for elimination. For example, if thl consider S] for elimination and thz consider both s1 and s2 (c.f. Figure 4.5.b) then thl removes (so, 51) and thz removes (so, s1) and (so, sz). Now, when the master thread joins the results of the two threads, the transition (so, s1) is removed. However, as shown in Case 1, the removal of (so, s1) is not really necessary. In fact, we would like to keep this transition in the program for the reasons mentioned above. To handle this inconsistency, we collect _such transitions and add them back to the program transitions. 4.4.3 Experimental Results We also implemented this approach for parallelization. The results for the problem of Byzantine agreement are as shown in Table 4.6. From these results, we noticed that the improvement in the performance was small. To analyze these results, we studied the effect 66 Sequential S] 82 3M Before elimination s s k/02 SO After elimination (a) Case 1 Case 2 Thread I Thread 2 Thread 1 Thread 2 S] 82 SI 82 S] 82 S1 82 O O O O O O 3.x 3.» ....» (3V Merged Merged S l 82 S 1 S 2 O O Q SOC) 80 s F ixecg F zxed 1 s s o c? ‘0 02 (b) Legend O Astate O Eliminated state C To be considered for elimination Figure 4.5: Inconsistencies raised by concurrency. 67 of this approach in more detail. For the case where we utilize two threads, this approach partitions the deadlock states, say ds, into two parts, dsl and ds2. Thread 1 begins with dsl and performs backward exploration to determine how states in dsl can be made unreach- able. In each such backward exploration, if it chooses to remove some transition, then it has to perform a group computation to remove the corresponding group. Although this thread is working with a smaller set of deadlock states, the time required for group computation is only slightly less than the sequential implementation where only one thread was working with the entire set of deadlock states ds. Moreover, the time required in such group compu— tation is very high (more than 80%) compared to the overall time required for eliminating the deadlock states. This implies that, especially for the case where we are revising a pro- gram with a large number of processes and where the available threads are relatively small, parallelization of the group computation is going to provide us the maximum benefit. Sequential Parallel Elimination with 2-threads PR RS DR(s) TST(s) DRT(s) TST(s) 10 107 7 9 8 9 15 1012 78 85 78 87 20 1014 406 442 374 417 25 1018 1503 1,632 1,394 1503 30 102' 4,302 4,606 3,274 3,518 35 1025 11,088 11,821 10,995 11,608 40 1028 27,115 28,628 21,997 23,101 45 1032 45,850 48,283 39,645 41,548 Table 4.6: The time required for the revision to add fault-tolerance for several numbers of non-general processes of BA in sequential and by partitioning deadlock states using paral- lelism.PR: Number of processes. RS: Size of reachable state space. DRT(s): Deadlock resolution time in seconds. TST(s): Total revision time in seconds. 68 4.5 Using Symmetry to Expedite the Automated Revision In this section, we present our approach for expediting the revision with the use of sym- metry using the input from Section 4.2.]. We utilize this approach in the task of resolving deadlock states that are encountered during the revision process. Therefore, using the ex- ample BA from Section 4.2.1, we describe how symmetry can help in resolving them. Then we discuss our algorithms for resolving deadlock states by utilizing symmetry to expedite the two aspects of deadlock resolution: adding recovery and eliminating deadlock states. 4.5 .1 Symmetry To describe the use of symmetry, consider the first scenario described in Section 4.2 .3. In this scenario, we resolved the state S] by adding a recovery transition. Due to the symmetry of the non-generals, one can observe that we can also add other recovery transitions. For example, if we consider the state d.g = d.j = d.l = O,d.k = l,andf.k = O, we can add the recovery transition by which d.k changes to 0. With this observation, if we identify recovery action(s) to be added for one process, we can add the similar actions that correspond to other processes. Therefore, to add recovery, our algorithm does the following: whenever we find recovery transition(s), we identify other recovery transitions based on symmetry. Then, we add all these recovery transitions to the program being revised (c.f. Algorithm 8). We also apply symmetry for deadlock states elimination. To eliminate a set of deadlock states, we find the set of transitions, which if removed from one process, will prevent that process from reaching deadlock states. Then, we use this set of transitions to remove sim- ilar transitions from other processes. Therefore, to eliminate deadlock states by removing program transitions, our algorithm does the following: whenever we find a set of transi- tion(s), if removed from one process, the algorithm prevents the program from reaching a deadlock state; we use symmetry to identify similar transitions for other processes, and we 69 Algorithm 8 Add_Symmetrical_Recovery Input: deadlock states ds, legitimate state predicate I , and the set of unac- ceptable transitions including those in specb, mt Output: recovery transitions predicate rec 1: rec :2 ds/\ (I)’; // (I)’ the set of states to which recovery can be added to ensure recovery to legitimate states 2: rec :2 Group(rec); // Select program transition or process i while ensuring read/write re- strictions 3: rec :2 rec /\ -nGroup(rec /\ mt); // Remove transition that violate safety while ensuring distribution re- strictions // Find similar transitions for other processes 4: for i := 1 to numberOfProcesses do 5: rec := recVSwapVariables( rec, i ); // Generate BDDs for other processes by swapping variables based on symmetry 6: end for 7: return rec; 7O remove these transitions from program transitions (c.f. Algorithm 9). Algorithm 9 Group_Symmetry Input: a set of transitions trans. Output: a group of transitions grp. l: grp :2 F indGroup(trans, read/write restrictions on i); // Find the group related to process i transitions while ensuring the read/write restrictions // Find similar transitions for other processes 2: for i := 1 to numberOfProcesses do grp := grpVSwapVariables( grp, i); 4: end for P? 5: return grp; 4.5.2 Experimental Results In Section 4.5.], we described the use of symmetry approaches to resolve deadlock states in the automated revision. In Sections 4.5 .2-4.5 .2, we describe and analyze the respective experimental results. In particular, we describe the results in the context of two classi- cal examples in the literature of distributed computing, namely, the Byzantine agreement (described in Section 4.2.1) and the token ring [14]. In both case studies, we find that symmetry and parallelism improve the execution time substantially. Symmetry In this section, we present our experimental results in using symmetry for the resolution of deadlock of deadlock states in the automated revision. Figure 4.6 shows the time spent in deadlock resolution, and Figure 4.7 shows the to- tal revision time for different numbers of processes in the Byzantine agreement problem. 71 From this figure, we observe that the use of symmetry provides a remarkable improvement in the performance. More importantly, one can notice that the speedup ratio (gained using a symmetrical approach) grows with the increase in the number of processes. In particular, as shown in Figure 4.7, the speedup ratio in the case of 10 non-general processes is 4.5. However, in the case of 45 non-general processes the speedup ratio is 19. This behavior is both expected and highly valuable. Since symmetry uses transitions of one process to iden- tify transitions of another process, it is expected that as the number of symmetric processes increases, so would the effectiveness of symmetry. Moreover, since the speedup is propor- tional to the number of (symmetric) processes, we argue that symmetry would be highly valuable in handling the state space explosion with an increased number of processes. ...“ ...-“— .... 60000 '*~—*”"~~——h “-4” ~~—————-——-“~-———_._—_m-w----_---Wfl..--2___--- 50000 .....______.__._H_ - -~-——~v~-~~-+-——-—4———_____~_-4—44_-___.._.2_-..--_-__.-.___ 40000 30000 20000 10000 ,.__2_._.#____. w v 0 E .- 0 ~--——-- .- 10 15 20 25 30 35 40 45 “"Sequential Symmetry Processes Figure 4.6: The time required to resolve deadlock states in the revision to add fault- tolerance for several numbers of BA non-general processes in sequential and symmetrical algorithms. To explain this remarkable improvement, we focus on the fact that far more time is spent resolving deadlock states for each process independently than by simply resolving 72 I I 60000 1 50000 “H 40000 30000 -2 20000 “b ——— 10000 —- m~———-——--——- U) 21 . ”...-O ..a. 0 .... E-t 10 15 20 25 30 35 40 45 +Sequential Symmetry L. . ___._ .._-__.__ ...__-___---__ --.. _ w..____.___.__-..-_._ . Processes Figure 4.7: The time required for the revision to add fault-tolerance for several numbers of BA non-general processes in sequential and symmetrical algorithms. deadlock states for single process and using symmetry to resolve deadlock states for the rest of the processes. Consequently, symmetry is expected to give better speedup ratios when the number of symmetrical processes is large. In Figures 4.8 and 4.9, we present the results of our experiments on the token ring prob- lem. We observe that symmetry substantially reduces the time for deadlock resolution. In fact, symmetry was able to keep this time almost a constant, i.e., independent of the prob- lem size. One can notice a spike in the required revision time of the sequential algorithm for token ring after we hit the threshold of 90 processes. This behavior was also observed in [30] and is caused by the fact that, at this state space, we are utilizing all the available memory, causing performance to degrade due to page faults. 73 .10 +—~- --—- 0 SIM" l l 7 "I”? I 2-..]... ‘ uni—T.” 1 I l I "_"F'"'”" *1 10 20 30 40 50 60 70 80 90 100 150 rocesses +Sequential Symmetry U) 35’ ..E. 5.. P Figure 4.8: The tttttime required to resolve deadlock states in the revision to add fault- tolerance for several numbers of token ring processes in sequential and symmetrical algo- rithms. Symmetry and Parallelism In this section, we present our experimental results of using parallelism in computing the symmetry. The results of parallelizing the symmetry computation with various implemen- tations in the automated symbolic revision are presented in Figure 4.10. We have achieved the shortest revision time when we use parallelism to compute the symmetry. For example, in the case of the Byzantine agreement with 45 non-general processes using 16 threads, we achieve a speedup ratio of 1.8 times that of the symmetry alone. Since in case of the token ring, symmetry alone reduces the time of computing recovery transitions to a negligible amount, the results for this case are omitted. 74 700 ~ — —~ 600 22 -5- 500 7’ 400 — v - ~-ww~— 300 ~— ——~_ 200 100 +— ~n o -+——- .- r—-—- " 10 20 30 40 50 60 70 8O 90 100 150 rocesses ‘v'f a .E [— P +Sequential Symmetry Figure 4.9: The time required for the revision to add fault-tolerance for several numbers of token ring processes in sequential and symmetrical algorithms. 75 3000 T -—— ~—~——-—--- r————~— - ~~——-~— 2- _-__~_ .__241 +1 Thraed 2 Threads +4 Threads *8 Threads "*“ l6 Threads Figure 4.10: The time required for the revision to add fault-tolerance for several numbers of BA non-general processes using both symmetry and parallelism. 76 4.6 Summary In this chapter, we focused on the techniques that can efficiently complete the automated model revision in a reasonable amount of time. Specifically, we used techniques that ex- ploit symmetry and parallelism to expedite the automated model revision and to overcome its bottlenecks. For parallelism, our approach was based on parallelization with multiple threads on a multi-core architecture. We found that the performance improvement with the simple parallelization of the group computation is significantly more efficient than tradi- tional approaches that partition the deadlock states among available threads. With group computation parallelism we achieved significant benefit that is close to the ideal. In the case of symmetry, we used the fact that multiple processes in a distributed program are symmetric in nature. We used this characteristic to efficiently expedite the automated re- vision. Since, the cost of identifying the transition of a given model with the knowledge of symmetry among processes is less than the cost of identifying these transitions explic- itly, the use of symmetry reduces the overall time required for the revision. Moreover, the speedup increases as the number of symmetric processes increases. Lessons Learned. The results show that a traditional approach of partitioning dead- lock states provides a small improvement. However, it helped identify an alternative ap- proach for parallelization that is based on the distribution constraints imposed on the pro- gram being revised. While parallelization reduces the time spent in eliminating deadlock states, it may also lead to some inconsistencies that have to be resolved. The time for resolving such inconsistencies is one of the bottlenecks in parallelization, as this inconsis- tency is resolved sequentially. We note that the synchronization on visited states was also added, in part, to reduce inconsistencies among threads by requiring them to coordinate with each other. The performance improvement with the parallelizing of the group computation is sig- nificant. In fact, for most cases, the performance was close to the ideal speedup. What this suggests is that for the task of deadlock resolution, a simple approach based on par- 77 allelizing the group computation (as opposed to a reentrant BDD package or partitioning of the deadlock states, etc.) will provide the biggest benefit in performance. Moreover, the group computation itself occurs in every aspect of the revision where new transitions have to be added for recovery or existing transitions have to be removed for preventing safety violations or breaking cycles that prevent recovery to the set of the legitimate states model/program. Therefore, the approach of parallelizing the group computation will be effective in the automated model revision of distributed programs. Impact. Automated model revision has been widely believed to be significantly more complex than automated verification. When we evaluate the complexity of automated revi- sion to add fault-tolerance, we find that it fundamentally includes two parts: (1) analyzing the existing program and (2) revising it to ensure that it meets the fault-tolerance properties. We showed that the complexity of the second part can be significantly remedied by the use of parallelization in a simple and scalable fashion. Moreover, if we evaluate the typical inexpensive technology that is currently being used or is likely to be available in the near future, it is expected to be 2-16 core computers. And, the first approach used in this chap- ter is expected to be the most suitable one for utilizing these multi-core computers to the fullest extent. Also, since the group computation is caused by distribution constraints of the program being revised, it is guaranteed to be required even with other techniques for ex- pediting automated revision. For example, it can be used in conjunction with the approach for parallelizing the group as well as the approach that utilizes symmetry among processes being revised. Hence, even if a large number of cores were available, this approach would be valuable together with other techniques that utilize those additional cores. Memory Usage. Both of our approaches, symmetry and parallelism, require the use of more memory. For instance, the revision of the BA with 2 threads requires almost twice the amount of memory needed by the sequential algorithm for the same number of non-general processes. However, unlike model checking, in automated model revision, since we always run out of time before we run out of memory, we argue that the extra usage of memory is 78 acceptable given the remarkable reductions we achieve in total revision time. 79 Chapter 5 Nonmasking and Stabilizing Fault-Tolerance Achieving practical automated model revision requires us to derive theories and develop algorithms that provide broader domain of problems, which we can resolve by automated model revision. Towards this end, in this chapter, we focus on the constraint-based auto- mated addition of nonmasking and stabilizing fault-tolerance to hierarchical programs. We specify legitimate states of the program in terms of constraints that should be satisfied in those states. To deal with faults that may violate these constraints, we add recovery ac- tions while ensuring interference freedom among the recovery actions added for satisfying different constraints. Since the constraint-based manual design of fault-tolerance is well known to be applicable in the manual design of nonmasking fault-tolerance, we expect our approach to have a significant benefit in automation of fault-tolerant programs. We illustrate our algorithms with three case studies: stabilizing mutual exclusion, stabilizing diffusing computation, and a data dissemination problem in sensor networks. With exper- imental results, we show that the complexity of revision is reasonable and that it can be reduced using the structure of the hierarchical systems. To our knowledge, this is the first instance where automated revision has been success- 80 fully used in revising programs that are correct under fairness assumptions. Moreover, in two of the case studies considered in this chapter, the structure of the recovery paths is too complex to permit existing heuristic-based approaches for adding recovery. To expedite the revision, we concentrate on reducing the time complexity of such revi- sion using parallelism. We apply these techniques in the context of constraint satisfaction. We consider two approaches to speedup the revision algorithm: first, the use of the multiple constraints that have to be satisfied during revision; second, the use of the distributed nature of the programs being revised. We show that our approaches provide significant reductions in the revision time. The rest of the chapter is organized as follows. In Section 5.2, we define the problem statement for the automated addition of nonmasking and stabilizing fault-tolerance. We describe the algorithms for the automated addition of nonmasking and stabilizing fault- tolerance in Section 5.3. We present our multi-core algorithms in Section 5.4 and experi- mental results in Section 5 .5. In Section 5 .6, we study the ordering in which the constraints should be satisfied. We show how we can use the hierarchical structure to reduce the com- plexity of our algorithm in Section 5.7. Finally, we summarize the chapter in Section 5 .8. 5.1 Introduction In this chapter, we focus on automated addition of nonmasking and stabilizing fault- tolerance to fault-intolerant programs. Intuitively, a nonmasking fault-tolerant program en— sures that if it is perturbed by faults to an illegitimate state, then it would eventually recover to its legitimate states. However, safety may be violated during recovery. Therefore, non- masking fault-tolerance is useful to tolerate a temporary perturbation of the program state. After recovery is completed, a nonmasking fault-tolerant program satisfies both the safety and liveness in the subsequent computation. Nonmasking and stabilizing fault-tolerance is an ideal solution to add fault-tolerance to the programs that organize network nodes in 81 specified topology or a predefined logical structure [13]. There are several reasons that make the design of nonmasking fault-tolerance attractive. For one, the design of masking fault-tolerant programs, where both safety and liveness are preserved during recovery, is often expensive or impossible even though the design of nonmasking fault-tolerance is easy [15]. Also, the design of nonmasking fault-tolerance can assist and simplify the design of masking fault-tolerance [105]. Moreover, in several applications nonmasking fault-tolerance is more desirable than solutions that provide fail- safe fault-tolerance (where in the presence of faults the program reaches to “safe” states from where it does not satisfy liveness requirements). This is especially true for networking related applications such as routing and tree maintenance. A special case of nonmasking fault-tolerance is stabilization [54,56], where, starting from an arbitrary state, the program is guaranteed to reach a legitimate state. Stabiliz- ing systems are especially useful in handling unexpected transient faults. Moreover, this property is often critical in long-lived applications where faults are difficult to predict. Fur- thermore, it is recognized that verifying stabilizing systems is especially hard [76]. Hence, techniques for automated revision are expected to be useful for designing stabilizing sys- tems. Techniques for adding nonmasking and stabilizing fault-tolerance to distributed pro- grams can be classified in two categories. The first category includes approaches based on distributed reset [13], where the program utilizes approaches such as distributed snap- shot [38] and resetting the system to a legitimate state if the current state is found to be illegitimate. Approaches from this category suffer from several drawbacks. In particular, they require the designer to know the set of all legitimate states. The cost of detecting the global state can be high. Additionally, this approach is heavy-handed since it requires a reset of the entire system, even if the fault may be localized. The second category includes approaches based on constraint satisfaction, where we identify constraints that should be satisfied in the legitimate states. Typically, the con- 82 straints are local (e .g., involving one node or a node and its neighbors); therefore, detecting their violation is easy. Since the constraints are local, the recovery actions to fix them are also local. There are several issues that complicate the design of nonmasking and stabilizing fault- tolerance [10]. One such issue is the complexity of designing and analyzing the recovery actions needed to ensure that the program recovers to legitimate states. Another issue is that to verify correctness of the nonmasking fault-tolerant program, one needs to consider all possible concurrent executions of the original program, recovery actions, and fault ac- tions. Yet another issue is that most nonmasking algorithms assume that faults can keep happening (although they will eventually stop for a long enough time to permit recovery) even during recovery, thereby, complicating the recovery to legitimate states. Adding nonmasking and stabilizing fault-tolerance to an existing program is achieved by performing three steps. The first step is to identify the set of legitimate states of the fault- intolerant program. This set defines the constraints that should be true in the legitimate states. The second step is to identify a set of convergence actions that recover the program from illegitimate states to legitimate states. This can be done by finding actions that satisfy one or more constraints. The last step consists of ensuring that the convergence actions do not interfere with each other. In other words, the collective effect of all recovery actions should, eventually, lead the program to legitimate states. In this chapter, we automate the last two steps by identifying the necessary actions to ensure that the constraints are satisfied and that the recovery actions do not interfere with each other. The automation of the first step is discussed in details in Chapter 6. However, this approach suffers from one important drawback: local actions taken to fix one constraint may violate other constraints. Consequently, these constraints need to be ordered. Furthermore, we need to ensure that satisfying one constraint does not vio- late constraints earlier in the order. Since verifying that recovery actions for satisfying one constraint do not affect other constraints is a demanding task, automated techniques 83 that ensure correctness by construction are highly desirable. In the correct-by—construction approach, a program is automatically revised such that the output program preserves ’the original program specification. In addition, it satisfies new properties. However, algo- rithms for designing programs that are correct by construction suffer from high complexity and, hence, techniques to expedite them need to be developed. Since the time complexity of the automation algorithms can be high, we also evaluate parallelization techniques to expedite addition of nonmasking and stabilizing algorithm. In this chapter, we present an automated model revision algorithm for constraint-based synthesis of nonmasking and stabilizing fault-tolerant programs. We illustrate our algo- rithm with three case studies. We note that the structure of the recovery actions in the first two case studies is too complex to permit previous approaches to achieve revision of the corresponding fault-tolerant programs [30]. We also show that the structure of the hi- erarchical system can be effectively used to generalize programs with a small number of processes while preserving the correct-by-construction property of the revised program. Also, we present a multi-core algorithm to synthesize distributed nonmasking and sta- bilizing fault-tolerant programs by partitioning the satisfaction of the constraints among available threads. To further expedite the revision, we also present a multi-core algorithm that utilizes the distributed nature of programs being revised by parallelizing them. To our knowledge, this is the first instance where programs that require fairness assump- tions have been revised with automated techniques. Particularly, in our first case study, it is straightforward to observe that stabilizing fault-tolerance cannot be added without some fairness among all processes. Thus, the previous algorithms (e.g., [30]) will declare failure in adding fault-tolerance. 84 5.2 Programs and Specifications In this section, we define the problem statement for adding nonmasking and stabilizing fault-tolerance. Please note that the problem statements defined in this section are instances of the original definition of the fault-tolerance from Section 2.5. Those definitions are based on the ones given by Arora and Gouda [12]. Also, we use the definitions of distributed programs, fairness, legitimate states, faults, and fault-span from Chapter 2. The goal of an algorithm that adds nonmasking fault-tolerance is to begin with a fault- intolerant program p, its legitimate state predicate I , and faults f, and to derive the non- masking fault-tolerant program, say p’, such that in the presence of faults, p’ eventually converges to I . Furthermore, computations of p’ that begin in I must be the same as that of p. Based on this discussion, we define the problem of adding nonmasking fault-tolerance as follows: Problem statement 4.1 Given p, I, and f, identity p’ such that: o Transitions within the legitimate states remain unchanged so 6 I :>(Vs1::(so,s1)Ep <=> (so,s1) Ep’) 0 There exists a state predicate T (fault-span) such that - I g T, - (so,s.) €(p’Vf) /\ (506 T) :>s1 e T, — so 6 T/\ (so,sl,...) is a computation ofp’ => (3j:j20:sj61). Stabilizing fault—tolerance is a special instance of this problem statement with the re- quirement that T = S p, i.e. the fault-span equals the set of all states. Based on this discus- sion, we define the problem of adding stabilizing fault-tolerance as follows: 85 Problem statement 4.2 Given p, I, and f, identify p’ such that: o Transitions within the legitimate states remain unchanged: - S0 6 [=> (V81 :2 (80,61) Ep <=> (S(),S1) Ep’) 0 All program transitions eventually converge to the set of le- gitimate states - so E 5,, /\ (so,sl , ...) is a computation of p’ 23> (Elj:j20:s,-€I) Note that since each constraint is preserved by the original program p, closure property of the stabilizing program p’ is satisfied from the first constraint of the problem statement. Thus, it is not explicitly specified above. 5.3 Synthesis Algorithm of the Nonmasking and Stabiliz- ing Fault-Tolerance Our approach for adding nonmasking and stabilizing to fault-intolerant programs, based on [13]. The goal of nonmasking and stabilizing fault-tolerance is to ensure that after faults occur, the program eventually reaches one of the legitimate states in I . We focus on the instance of the problem where I = C; AC2... AC”, and C;, 1 2 i 2 m, is a constraint on the variables of the program. Faults perturb the program to a state in (-w 1). Hence, in the presence of f, one or more of the constraints from C1,C2...Cm are violated. The goal of our algorithm is to automatically synthesize the recovery actions such that when faults stop occurring, the constructed recovery actions in conjunction with the original program actions will, eventually, converge the program to a state where I holds. 86 5.3.1 Constraint Satisfier Our algorithm for adding nonmasking and stabilizing fault-tolerance is shown in Algorithm 10. The input for the algorithm is the constraint array C, fault-span T, and program p. In this algorithm, the constraints from the constraint array are satisfied one after another. The algorithm starts by computing the legitimate state predicate as the intersection of all constraints in the constraint array (Lines 3). Then, the algorithm computes the recovery transitions to satisfy C [I] Let Tr denote transitions that begin in the fault-span and in a state where C [t] is false and end in a state where C [i] is true. Unfortunately, we cannot add Tr as is, since Tr may not be imple- mentable using read/write constraints on processes due to the distributed nature of the pro- gram. The algorithm adds a subset of Tr, say Tr] , such that Tr] can be implemented using the read/write restrictions of one or more processes. We denote this by the function Groupmin (see Line 6)‘. This ensures that the only transitions added are those that start from a state where C [l] is false and reach a state where C [t] is true. These transitions are denoted by temp on Line 6. Subsequently, the algorithm removes transitions from temp that violate the closure of the fault-span T. Thus, it computes a subset of transitions, say Tr f, M , in temp that begin in a state in T and reach a state in -~T. Again, we need to ensure that the removed transitions are consistent with read/write restrictions of processes. The algorithm achieves this by applying function Groupm to Tr fspan; this computes a superset of Tr fspan such that one or more processes can execute it. Subsequently, it removes this superset from temp (Line 7). This ensures that all transitions that violate closure of T are removed. Therefore, it removes the group of transitions that violates T (respectively, I) (Lines 7-8). The algorithm needs to ensure that none of the transitions used to satisfy the constraint, say C [i], violates the pre-satisfied constraints C [0] to C [i — 1]. Hence, it lets V include the transitions that originate from a state where C [i — l] is true and end in a state where I( X /\ (Y )’ ) refers to the transitions that start in a state in X and reach Y. 87 C [i — I] is false as well as similar transitions for the constraints C [0] to C [i — 2] (Line 11). The transitions in V are used to ensure that recovery transitions do not violate other pre- satisfied constraints. The algorithm ensures that none of the transitions in temp interfere with earlier constraints. Therefore, it removes the transitions in V from temp if any are found (Line 9). At this point, the algorithm collects all recovery transitions in rec (Line 10). Steps 4 — 12 are repeated until all the recovery actions that satisfy all the constraints in the array C are found. Finally, it returns the recovery actions of the program p. Algorithm 10 ConstraintSatisfier Input: constraint array C, fault-span T, and program transitions p. Output: recovery transitions rec. p— : temp, V := false, false; 2: m :2 SizeO f (C ) — 1; // m is the number of constraints 3: I :2 30cm; //Compute I (invariant) as the intersection of all con- straints 4: fori:=0tom do 5: //temp are the transitions that start in a state in T — C (i) and reach C(i) 6: temp :2 Groupmm((T — C[i])A (C[i])’); //ensure that no recovery transitions violate T 7: temp := temp — Group(temp * (TA (nT)’)); //ensure that no recovery transitions violate I 8: temp := temp — Group(temp * (I/\ (-.I)’)); 9: temp := temp — V ; // Combine current recovery transitions with the new recovery transi- tion. 10: rec :2 rec V temp; //Compute, V, the set of the transitions that violating the constraints 11: V := V V Group(C[i]/\(-1C[i])’) 12: end for // return the recovery transition. 13: return rec; 88 Theorem 5.3.1 : 0 Given are fault—intolerant program p, constraints C1,C2...Cm, and faults f. 0 Let I = C1/\C2.../\Cm. 0 Let T: set of states reached in the execution of pV f that start from any state in I . 0 Let rec: ConstraintSatisfier(C, T, p). 1f V80 2 So 6 T—I: (3S1 2S1 E T I (S0,S|) 6 rec) Then p’ (= p V rec) solves the constraints in Problem statement 4.]. I Proof. To prove Theorem 3.1 we show that the p’ (2 p V rec) solves the constraints of the problem statement 4.1. o By the construction of the transitions in rec, it is straightforward to see that rec does not introduce any new transitions in I . Therefore, the transitions within the legitimate states remain unchanged. 0 By the construction of T, it is clear that I g T since T includes all the states in I as well as the states reachable from I by (pV f ). 0 From Line 7 in the algorithm ConstraintSatisfier, the transitions in rec do not in- clude any transition that violates T. 0 Since rec does not include any of the transitions from V (Lines 9 and 11), none of the transitions in rec violate pre-satisfied constraints. Therefore, there will be no cycles between the recovery transitions themselves. Hence, the constraint (so 6 TA (so,s|,...) is acomputation ofp’ => (Eljzjz 0 : s; E 1)) is satisfied. I 89 Figure 5.1: Constraints ordering and transitions selections. 5.3.2 Algorithm Illustration To illustrate the algorithm ConstraintSatisfier, consider the system described in Figure 5.1. In this system, we have three ordered constraints C1,C2, and C3 and I = C1 AC2 AC3. Since C1 is the first to be satisfied, we construct all possible recovery actions that start from any state in T — C1 and reach a state in C1 A T. We proceed to satisfy C2 in the same manner. However, after constructing the recovery actions that satisfy C2, we need to exclude actions that violate the constraint C1. In particular, we exclude actions like rec] (c.f. Figure 5.1) since it starts from a state, so, where C. is true and ends in a state, s], where C I is false. On the other hand, we keep transitions like recz and reg. We continue to construct the recovery actions that establish C3 provided that they preserve T, C 1 , and C2. 5 .4 Expediting the Constraints Satisfaction In Section 5.3, we described the sequential approach (i.e., single thread) for synthesiz- ing nonmasking and stabilizing fault-tolerant distributed programs from fault-intolerant versions. In this section, we explain our design choices and present our approaches for expediting the revision with multi-core computing (i.e., multiple threads). 5.4.1 Design Choices for Parallelism After reviewing Algorithm 10, we can see that there are two main bottlenecks, which lower the performance of this algorithm. The first is the main loop (Lines 4-12) where the number of iterations is determined by the number of constraints. The second is the Group operation in Lines 6, 7, 8, and 11. The group operation is based on the nature of distributed programs where addition of a transition for one process requires us to add additional transitions that are computed based on what the process cannot read/write. Choices for constraint satisfaction. One way to partition the computation of recovery transitions is to split the recovery computation among multiple threads by allowing them to work on satisfying separate constraints. However, Algorithm 10 uses the computation of V, transitions that violate preceding constraints (Line 11). Clearly, one possibility is to compute all possible values taken by V during the computation up front and utilize them ap- propriately for computing valid recovery transitions. Computing the possible values taken by V also requires a computation that utilizes a loop that requires sizeO f (C) iterations, which can be parallelized using standard techniques from parallel computing. After computation of V, we can partition the iterations (Lines 4-12 in Algorithm 10) among several threads. We considered several approaches for this. One approach we con- sidered was dynamic partitioning. In particular, in this approach, a pool of uncompleted iterations is maintained. Each thread picks an iteration from this pool and computes the recovery transitions for that iteration. Subsequently, it picks another iteration from the pool 91 and so on. We found that this dynamic partitioning approach, however, resulted in a high overhead, thereby reducing the speedup. Hence, we considered static partitioning where each thread was given fixed iterations. Even here, we tried different options. One option was to partition the iterations in an alternating manner (e .g., thread 1 gets iterations 0, 2, 4, and thread 2 gets iterations l, 3, 5, ...). It was expected that this would leave the size of MDDs used in each thread to be evenly balanced. However, we found that this approach and the approach of partitioning where thread 1 got iterations 0, l, (sizeO f (C ) / 2) — 1 and thread 2 got iterations sizeO f (C ) / 2, sizeO f (C ) — 1 had almost identical perfor- mance in the case studies. We have used the latter in our experiments. However, we believe that the choice of partitioning could play a role in other case studies. Choices for utilizing distributed nature. When the recovery algorithm adds new tran- sitions (or removes transitions that violate earlier constraints), we have to add the corre- sponding group of transitions based on the distributed nature of the program. Moreover, with symbolic approach, we add (or remove) a set of transitions at a time. This set may include transitions that could be executed by several processes. Therefore, for a given set of transitions that are added, we need to consider read/write restrictions of each of these processes to determine the group for that set of transitions. We can utilize this feature to parallelize the group computation itself by having each thread compute the group corre— sponding to a subset of processes. Again, similar to the parallelization with constraints, we considered several approaches. It turned out that even for this approach, the overhead of dynamic partitioning was more than its benefit. Thus, we utilized static approaches. Since several approaches consid- ered for partitioning resulted in a similar speedup, we utilize the simple approach where each thread obtains a subset of processes and computes the corresponding group for those processes. Finally, in group parallelization, the actual computation involved in the group itself is small. Hence, we found that the overhead of creating and terminating threads for each 92 group computation was very high. For this reason, we created the threads up front and used mutexes to determine when they will be active. I Choices for parallelizing the MDD (Multi-Valued Decision Diagrams) library. Since we are using MDD-based symbolic revision [28], the constraints are characterized by Boolean formulae involving the variables in the program being revised. The MDD li- brary [125] is not designed to be reentrant and assumes that at most one MDD package is active at any given time. Multiple threads cannot operate on the same MDD package simultaneously. Also, different threads cannot access different MDD packages simultane- ously. We considered two approaches to solve this problem: (1) utilize a reentrant version of the MDD package, or (2) utilize multiple independent MDD packages. Since a reentrant MDD package is not available, we followed the second approach. We modified the MDD library so that multiple instances could be used simultaneously. We also added a Transfer function to transfer an MDD object from one MDD package to a different MDD package. Hence, during the parallel algorithms, a master thread spawns several worker threads, each running on a different core/processor in parallel with an instance of its own MDD pack- age. The instance of the MDD package assigned to each worker thread is initialized using MDDs (e.g., program transitions MDD) transferred from the MDD package of the master thread. 5.4.2 Partitioning the Constraints Satisfaction Based on the design choices from Section 5.4.1, we present a multi-core algorithm that partitions the satisfaction of such constraints among available cores/processes. Algorithm sketch. Intuitively, our algorithm works as follows. During constraint satisfaction, a master thread spawns several worker threads, each running on a different core/processor. Each worker thread runs on its own MDD package concurrently with other threads. The instance of the MDD package assigned to each worker thread is initialized us- ing MDDs transferred from the MDD package of the master thread. Some of those MDDs 93 are the array of constraints to be satisfied, the program transitions, the array of constraints violating transitions, and the legitimate state predicate. The master thread partitions the constraints and provides each worker thread with one such partition. Subsequently, worker threads start resolving their assigned set of constraints in parallel by adding the required recovery actions. Upon completion, the master thread merges the results returned by the worker threads. Algorithm ll ParallelConstraintsSatisfaction [Master Thread] Input: constraint array C, program transitions p, fault-span T, and number of threads n. Output: recovery transitions recAll . 1: 2: .10: 11: 12: 13: 14: 15: 16: 17: 18: 999.5'9‘999’ gAll := false; I 1= Ain=0CIiI§ // Notation: C [t] A (-1C[i])’ refers to transitions that start in -1C[i] and ends in C [i] fori:= 1 ton— 1 do SpawnThread w ComputeViolate(i); end for for i := 1 to Size0f(C) —1 do V[i] :2 V[i— l] V V[i]; end for fori:=0ton— 1 do (3,, [i] = Split(i,C); v,,[i] = Split(i, V); end for fori:=lton—1do rec[i] := SpawnThread -> PConstraintSatisfier(Cp[i] , p, fault-span T, VpIiI , I); end for ThreadJoin(0..n — 1); recAll := V30] rec[i]; // Merging the results from all threads return recAll; Parallel Constraints Satisfaction. Our algorithm for satisfying the constraints in parallel is as shown in Algorithm 11. This algorithm begins with the array of constraints to be 94 satisfied C, fault-intolerant program p, fault-span T, and the number of worker threads to be spawned n. The goal of this algorithm is to discover the set of recovery transitions recAll such that all the constraints in C are satisfied in a way that enables the fault-tolerant program to recover to its legitimate states. Initially, the algorithm starts by computing the legitimate state predicate I as the intersection of all constraints (Lines 2). Now, the algorithm constructs the array V such that V[i] includes the transitions that start from a state where C [I] is true and end in a state where C [t] is false as well as the similar transitions for the constraints C [j], where 0 S j g i— 1 (Lines 3-8). A more efficient way to do this computation is by letting the master thread use the worker threads such that each worker thread computes its share of V elements such that V[i] contains the transitions that starts from C [l] and end in -vC[i]. Once all threads are done, the master thread updates the array V such that V[i] = V[i — l] V V[i]. In other words, V[i] contains all transitions that violate the constraint C [0] to C [i]. After constructing the array V , the algorithm proceeds to evenly distribute the elements of the arrays C and V among the worker threads (Lines 9-12). Specifically, C p [1] includes the array of constraints assigned to the thread i, and Vp[i] includes the array of correspond- ing constraints violating transitions. Note that the availability of the array Vp enables each worker thread to work independently without interfering with the other threads. To com- pute the respective recovery transitions, each worker thread (Lines 13-15) calls the algo- rithm PConstraintSatisfier, which is similar to Algorithm 10 except that in addition to C p and p it also takes VI) and I as an input. Once all worker threads complete their jobs (Line 16), the master thread collects all the recovery transitions returned by worker threads in recAll (Lines 17-19) and returns the overall recovery transitions. 95 5.5 Case Studies In Section 5.3, we presented our approach for constraint-based automated addition of non- masking and stabilizing fault-tolerance. In Section 5.4, we presented different approaches to exploit parallelism. In Subsections 5.5.1-5 .5 .3, we describe and analyze three case stud- ies, namely the Stabilizing Mutual Exclusion [124], the stabilization of Data Dissemina- tion Problem in Sensor Networks [104], and the Stabilizing Diffusing Computation [13]. Of these, the first and the third case study are classic problems from distributed comput- ing and illustrate the feasibility of algorithms that add stabilizing fault-tolerance. In the second case, study we demonstrate the applicability of our approach on a real world prob- lem, particularly, in the field of sensor networks. In all of these case studies, we find that our approach for constraint-based automated addition of nonmasking and stabilizing fault-tolerance was successful in synthesizing the nonmasking fault-intolerant programs. Furthermore, we find that parallelism significantly reduces the total revision time. Throughout this section, all experiments are run on, sun x4275 with 4 x Quad-core Intel Xeon E5520 (2.27GHz w/ 8MG cache each) processors with 24 GB RAM. The MDD representation of the Boolean formulae has been done using a modified version of the MDD/BDD Glu 2.1 package [125] developed at the University of Colorado. 5.5.1 Case Study 1: Stabilizing Mutual Exclusion Program Mutual exclusion is one of the fundamental problems in distributed/concurrent programs. One of the classical solutions to this problem is the token-based solution due to Raymond [124]. In this solution, the processes form a directed rooted tree, a holder tree, in which there is a unique token held at the tree root. If a process wants to access the critical section, it must first acquire the token. Our goal in this case study is to add stabilization to the fault-intolerant program in [15]. When faults occur and perturb the holder tree, the new program will stabilize and reconstruct a correct holder tree within a finite number of steps 96 under weak fairness assumption. F auIt-Intolerant Program. In Raymond’s algorithm, the processes are organized in a logical tree, denoted as a parent. The holder tree is superimposed on top of the parent tree such that the root of the holder tree is the process that has the token. For example, Figure 5.2.a represents the undirected parent tree and Figure 5.2.b shows the holder tree when c has the token. In the fault-intolerant program, each process j has a variable h. j. If h. j = j then j has the token. Otherwise, h. j contains the process number of one of j’s neighbors. The holder variable forms a directed path from any process in the tree to the process currently holding the token. In this program, a process can send the token to one of its neighbors. For example, Figure 5.2.c shows the case where process c sends the token to e. In particular, if j and k are adjacent (in the parent tree), then the action by which k sends the token to j is as follows: A1 :: (h.k=kA jEAdj.k) A (h.jzk)—>h.k, h.j :=- j, j; Constraints. Recall from Section 5.2 that we define the legitimate states to be a set of constraints on the program state space. In this case study, this set is the conjunction of the constraints S1, 52, and S3, described next. Moreover, each of these constraints is specified for each process separately. Therefore, if n is the number of processes then we have 3n constraints to satisfy. Constraint S 1 requires that j’s holder can either be j’s parent, j itself, or one of j’s children. S2 requires that the holder tree conforms to the parent tree. Finally, S3 requires that there are no cycles in the holder relation. Thus, predicates S1, 52, and S3 are as follows: (51) Vj:(h.j=P.j)V(h.j=j)V(3k:(P.k=j)A(h.j=k)) ($2) vr==>(h.j=Rj)v(h.(P.j)=j) (53) W : (ID-1'7é 1') => n((h~j -= P-J') A (h-(P-j) = j)) 97 e f undirected parent tree (a) a b c d a b c d O—> 04—0 O—> 4—0 e O/f token e f ‘_0 token .<_O c has the token 0 passes the token to e (b) (C) Figure 5.2: The holder tree Faults. Since we focus on stabilizing fault-tolerance, we consider faults that perturb the holder relation of all processes to an arbitrary value. Thus the fault action is as follows: (F 1) true ——> {h. j :2 any arbitrary value from its domain}; F ault-Tolerant Program. To add stabilizing fault-tolerance to the above program, we used the revision algorithm as follows. The fault intolerant program for each process is specified by actions Al; the faults are specified by the fault action F 1; and the constraints are from S l , $2, and S3. We specified these constraints in the following order: first, we specified constraints S l for the root, then its children, then its grandchildren, and so on. Subsequently, we specified constraint $2 likewise. Finally, we specified constraint S3 in the reverse order. The recovery actions computed by the revision algorithm are as follows: R1 :: —1((h.jsz)V(h.j=j)V(Elk:(P.k=j)A(h.j=k))) ——>h.j:=j|h.j:=P.j|h.j:= {childofj}; 98 82 :2 31 (ID-1741') : (M = P-j)V(h-(P-j) =1) ) ... h.j :———P.j | h.(P.j) == 1'; R331 "( (ID-1751') => «(12.7: P-J'IMh-(P-J') =1?) ) —+ M := j I h.(P.j) := Rj I h-(le := 8187); Analysis of experimental results. Table 5.1 shows the results of synthesizing the Sta- bilizing Mutual Exclusion program with various numbers of processes organized in linear topology. It shows the time needed, in seconds, to add recovery, validate the recovery tran- sitions (against pre-satisfied constraints), and the total revision time in terms of the number of processes being revised. Table 5.2 shows the result of a similar case study where the processes are arranged in a binarytree topology. N0. of Time(s) Processes constraint satisfaction total Recovery Validation 30 19 21 40 40 78 74 153 50 217 238 457 60 505 509 1020 70 1 1 10 1103 2238 Table 5.1: Stabilizing Mutual Exclusion, linear topology. Table 5.2 illustrates that given the same state space, the complexity is higher in the tree topology than the linear topology. This is due to the following reason: the constraints of a process compare its variables with that of its neighbors. To model this effectively, the process variables and the variables of its neighbors need to be close to each other in the MDD variable ordering. This can be achieved easily on a linear topology. However, for a tree topology, this is not possible for all the processes. Hence, computing recovery transitions for those cases is more expensive. 99 No. of Time(s) Processes constraint satisfaction total Recovery ] Validation 7 < 1 < 1 < 1 15 2 < 1 < 3 17 3 < 1 < 4 21 3 5 10 31 30 19 49 Table 5.2: Stabilizing Mutual Exclusion, binary tree topology. Table 5.3 shows the results of using parallelism during constraints satisfaction in syn- thesizing the stabilizing Mutual Exclusion program. The table illustrates the results for various numbers of processes organized in linear topology using different numbers of pro- cessors/cores. It shows the time needed, in seconds, to satisfy the constraints, and the total revision time. It also shows the amount of memory in megabytes. As we can see from this table, using parallelism has substantially reduced the time needed for the revision. As a concrete example, observe that the time required to synthesize a stable mutual exclusion program with 50 processes dropped from 457 seconds, using the sequential algorithm, to 374 seconds when two cores were used, and to 178 seconds when four cores were used. Table 5.4 shows the results of exploiting the distributed nature of the program being revised (i .e., Group parallelism) in synthesizing the stabilizing Mutual Exclusion program. It shows the time needed, in seconds, to compute the group, and the total revision time. It also shows the amount of memory in megabytes needed by our algorithm. We can clearly see the feasibility of adding stabilizing fault-tolerance using automated revision. Both time and space complexity are reasonable and proportional to the reachable state space. Furthermore, as specified in Section 5.7, the complexity for a larger number of processes can be reduced by utilizing the hierarchal structure. 100 .m2 5 ommms 408822 ”Am—2v :52 32503 E 08: 5538 :38. Am: Sam 59503 5 58833 358328 5 Beam 2:: _Soh ” Am: 6:0 .manEwam 2532.56 wEm: comma—oxm— 33:2 wEEESm ”m. m 2an o2 cow mow v: 82 o2: 3. $2 mg C wmmm 2mm 82: E. 52 a? w? Nw 2m 2m fin mm» $5 3 ONE 33 2:3 ow N2 o2 ma P >3 2: we Em Em m _ hmv mmv ewe om omfi mm 3 2. Q. Nb NV :: o9 3 m2 N2 32 ow S 2. 2 8 cm 2 me cm on 2 ow ca 3.3 Om :V m m mm m m mm m m o b o 82 cm as: a: 3. SEC 3. a: as: a: a: as: a: a: 8.5.6. 8:824 E92 Enm 3:0 :82 Sam 71:0 :32 Sam 6:0 592 Sam 6:0 2.12—23.. \e 62 munch: m €35: v €325 m Nehemxwem 101 .52 E owmm: @0802 ”Am—EV :52 .mwcooom E 08: 8632 :38. Am: Sam .mucooom 5 5:83:80 3336 5 30% 0:5 :38. N Am: Em. $5385 macaw mfims ~582me 33:2 wag—58m ”Wm 29E. omfi was con 2. E: o2: 54 mg 35 S wmmm 2mm 8:: on m2 chm Em E. mmm Bum we owe one 2 omS 32 2:2 co mm _. em _ om ~ 3. cam VNN 3 Nmm omm m _ hmv mmv $2 on NS om vm we mm K ow 2: m3 3 mm_ N2 89 ow mm 9 2 on om om mm nm em mfi ow ow 3.2 om NV m m mm m m 2 v v c h c 82 om 82V 3. a: as: a: a: an): a: 3. as: a: 3. 8:9. 83884 :52 5mm EU :52 Pam new :32 3mm new :82 5mm 56 «EB—23.. use 62 €QE£ m €323 v €83. N 32:»:me 102 5.5.2 Case Study 2: Data Dissemination in Sensor Networks In this problem, a base station initiates a computation in which a block of data is to be sent to all sensors in the network. The data message is split into fixed size packets. Each packet is given a sequence number. The base station starts transmitting the packets to its neigh- bor(s) in specified time slots, in the order of the packet sequence number. Subsequently, when the neighbor(s) receive a message, they, in turn, retransmit it to their neighbors and so on. The computation ends when all sensors in the network receive all the messages. Our goal in this case study is to synthesize a nonmasking fault-tolerant version of the data dissemination program that can tolerate a finite number of lost packets. The revised program is the same as Infuse [104] that is designed manually. Fault-Intolerant Program. In this case study, we arrange the processes in a linear topol- ogy. The base station has N packets to send to M processes. (We note that similar revision is possible for any other fixed topology.) The Fault-intolerant program transmits the pack- ets in a simple pipeline. For this, each process keeps track of the messages (received/sent) using two variables u. j and l. j, where u. j is the highest message sequence number received by process j, and l . j is the sequence number of the message currently being transmitted by process j. Process j increments u. j every time it receives a new message. It also sets I . j to be the sequence number of the message it is transmitting. The base station transmits a packet if its neighbor has received the previous packet (action 1N1). A process j, j > 0, receives a packet from its predecessor if its successor had received the previous packet (actions IN 2 and IN 3). Thus, the actions of the fault-intolerant program are as follows: Action for base station: (1N1) (10:01) —4m:=w+1; Action for process j E {1..M—l}: (1N2) (u.jg U.(j+1))/\(U.jg U.(j— 1))/\(L.(j— 1) :U.j+ 1) '—> U.j,L.j:: U.j+l,L.j+ 1; 103 Action for process M (the last process): (1N3) UMg U.(M— 1)AL.(M— 1) = U.M+1—> U.M, L.M :2 U.M+1,L.M+ 1; Faults. In this section, we consider faults that lose a message. To model such faults for the base station, we add action (F l ), where the base station increments L.0, even though its successor has not received the previous packet. To model such action for other processes, we add action (F 2), where a process advances L. j, even though the successor has not yet received the previous packet. (F1) true—+1102: [10+]; (F2) (U.ng.(j—1))/\(L.(j— 1) =U.(j+1)) ——>U.j,L.j :2 U.j+1,L.j+1; Constraints. The constraints that define the legitimate states in the case of the data dissemination program are as follows. The first constraint states that initially the base station has all the packets (S l). A process cannot receive a packet if its predecessor has not received it (S2), and cannot transmit a packet that it does not have (S3). A process transmits a packet that is expected by its successor (S4 and S5). 51 (U02 N) S2 (Vj: 0U.(j+1)+1)A(L.j'U.(j+1)) A (L.j>U.(j+l)+1) A (U.j+1=L.(j—1)) —>U.j:= L.(j—l),L.j:= U.(j+l)+1; (R2) (U.j>U.(j+l)+1) A (L.j>U.(j+l)+1) —>L.j :2 U.(j+l)+l; Table 5.5 shows the results of synthesizing the data dissemination protocol with a vari- ous numbers of processes. One can notice that most of the total revision time was spent on adding recovery, while a smaller amount of time was spent in validating the recovery transitions. The main reason for this behavior is that the structure of the fault-span in this case study is simpler: if a message is lost on one link, then until it is recovered, that message cannot be sent again (it is possibly lost on subsequent links). 105 No. of Space Time(s) Processes reachable memory constraint satisfaction total states (MB) Recovery I Validation 50 1025 11 4 2 6 100 1059 12 32 14 48 150 1070 15 153 47 207 200 1093 16 452 162 633 Table 5.5: Nonmasking with linear topology data dissemination program. Table 5 .6 shows the results of synthesizing the data dissemination protocol with various numbers of processes by partitioning the constraints among available threads. Note that, in the case of the data dissemination problem, there were only 5 constraints to satisfy. Hence, when the revision is launched with 8 threads, we are only utilizing 5 of them. As can be seen from Table 5.6 if the number of constraints is not large enough then the speedup gained from portioning the constraints is limited. Table 5 .7 shows the results of synthesizing the data dissemination protocol with various numbers of processes by exploiting the distributed nature of this program. 5.5 .3 Case Study 3: Stabilizing Diffusing Computation In distributed systems, diffusing computation is used to inquire about (e.g., termination detection) or establish (e.g., distributed reset) 3 system global state. We consider a diffusing computation on a system where processes are arranged in a logical tree. The root initiates a diffusing computation and propagates it to its children and the children forward it to their children and so on until it reaches all processes. Once the computation reaches a leaf, it marks the leaf as completed and reflects back to the parent. When all children of a process are marked completed, that process marks itself completed and reflects the computation to its parent. The diffusing computation ends when it marks the root as completed. 106 .m—Z E 0mg: beau—2 ”Am—2v :52 .musooom E 2:: 5638 :38. Am: Pam @288 5 5:83:26 38.5 E 3on 08: 130,—. n Am: ...—G .choEtdm 2553.56 mafia 88on cosmEEomma San ”9m Bash. o: Omm can we SN mwm 9V vmm mmm m: mmc 30 922 com m: o3 02 we _N_ _S 3 a: w: 2 Sm ooN ~32 of o: am mm we _m _m ow mm E Q we we 32 OS 8 m m .3 m m mm v v S o o 5.9 om as: 3. a: 82V 3. a: as: a: a: 82V 3. 3. 8.8m mascara :32 Sam 8:0 :52 Sam 6:0 :32 :mw 3:0 :52 Sam 2:0 «3223.. \e 62 QQES m awash: v €383 N NEmewwm 107 .m2 5 owmm: boEo—Z ”Am—>3 Eu: .mncooom 5 08: 5638 :38. Am: 5mm .mccooom 5 :0553800 @336 E 25% 2:: Bob “ Am: EU @5385 33.5 wEm: 88on cocaEEomma San— gum 2an hN_ 9: mm N5 N3 m: 0v mmN mMN E mmo 03 2:2 CON _N_ am Nm we on we NV mm 5 2 EN mVN MES om _ a: I o co 3 3 am om 5N N_ xv a: 33 oo— wc N N 3a m N hN v v g g c we $2 on ES: 3. a: 82V 3. a: 32V 3. 3. 32v 3. a: 8.5m schemata :52 :mm 9.0 :52 5mm 96 :52 :mm 9.9 :52 Sam BO 24323.. \e 62 9.38:: m “fists“ V @383“ N Nuttmxwmm 108 F ault-Intolerant Program. The fault-intolerant program in this case study is the diffusing computation program from [13]. Each process j has two Boolean variables c. j (color) and sn. j (session number) and an integer variable P (the parent of j). A new diffusing computation can start if the root is colored green (c.root _—_ green) and the session number of the root is the same as its children. To start a new diffusing computation, the root sets c.root = red and flips sn.root. When a green process finds that its parent is red, it copies its parent color and session number. Moreover, if a process has no children or all its children switched colors from red to green, the process then switches its color to green. The program for the diffusing computation consists of three actions. The first action starts the diffusing computation at the root (DC 1). The second action propagates the diffusing computation to the children (DC2). The third action completes the diffusing computation when all the children complete computation (DC3). The program actions are described below: DC] :: (c.root 2 green) —> c.root :2 red,sn.root :2 -'sn.root; DC2 :: c.j = green/\c.(P.j) = red/\sn.j 7é sn.(P.j) ——> c.j,sn.j : c.(P.j),sn.(P.j); DC3 :: (c.j = red) /\ (Vk : P.k = j => (c.k = green/\sn.j = sn.k)) ——> c.j 2: green; Constraints. The first disjunction of (81) states that j’s parent has participated in a dif- fusing computation while j did not participate yet. The second disjunction of (S 1) states that j and its parent are participating in a computation or they both have completed a com- putation . (SI) \7’j : (c.j = green/\c.(P.j) 2 red) V (c.j = c.(P.j) Asn.j = sn.(P.j)) Faults. We now consider the faults that change the values of c. j and sn. j to an arbitrary value. The fault actions are as follows: (F1) true ——> c.j :2 red | green; (F2) true ——> sn.j := true | false; 109 F ault-Tolerant Program. To construct the nonmasking fault-tolerant program of the fault-intolerant program of Diffusing Computation, we used our algorithm with program actions (DC 1 — DC 3), and the constraint (S 1) with the fault actions (F l , F2) as an input. The revised program has the actions (DCl — DC3) in addition to the following recovery actions: (R1) (c.jzred) A (sn. Hemp. j) _. c. j ;= green, sn. j ;= sn.(P. j); (R2) (c.(P. j) = green) A (c.jzred) _. c. j 2: green; (R3) (c.(P. j) :c. j) A (sn.j¢sn.(P.j)) —r sn. j :2 sn.(P.j); (R4) (c.(P. j) =red) A (c.jzred) A (sn.j;ésn.(P.j)) __. c. j := green; No. of Time(s) Processes constraint satisfaction total Recovery l Validation 5O 1 3 4 100 12 19 32 150 57 53 113 200 151 124 282 Table 5.8: Stabilizing Diffusing Computation, linear topology. No. of Time(s) Processes constraint satisfaction total Recovery 1 Validation 15 < 1 < 1 < 1 17 1 1 2 21 1 3 25 23 2 4 6 Table 5.9: Stabilizing Diffusing Computation, binary tree topology. Table 5 .8 shows the results for synthesizing a stabilizing diffusing computation program with a various numbers of processes organized in a linear topology. Table 5.9 shows the result where the processes are arranged in a binary tree. 110 Table 5.10 shows the results of synthesizing the diffusing computation program with a various numbers of processes by exploiting the distributed nature of this program. Table 5.11 shows the results of synthesizing the diffusing computation program with a various numbers of processes by partitioning the constraints among available threads. Memory Usage. Notice that the amount of memory needed during revision is proportional to the number of threads being used. It is approximately the amount of memory used by the sequential algorithm multiplied by the number of cores being used. Clearly, this is expected since for every thread used, we create a new MDD package. We argue that using extra memory to gain a speedup is acceptable, since in the automated revision, time complexity is a far more serious barrier than space complexity. 5 .6 Choosing Ordering Among Constraints To apply Theorem 3.1, we need to identify an order among the constraints. In our case studies, we attempted several orderings and most were successful in synthesizing the non- masking and stabilizing fault-tolerant program. Hence, choosing the “right” order does not appear to be very crucial. Also, [13] identifies several heuristics that can assist in identify- ing the right order among constraints. One possible approach is to consider different combinations as part of the revision algo- rithm. With such an approach, 0(n2) combinations suffice for most examples. In particular, to identify an ordering, we can utilize an algorithm similar to insert-sort as follows: first consider only constraints Cl and C2 and attempt both orderings between them. If both orderings fail, then adding nonmasking fault-tolerance cannot be achieved using the con- straint based approach that uses constraints C 1 and C2. If both succeed, then we can choose any order. Without loss of generality, let the order be C1 and C2. Then, we consider con- straint C3 in conjunction with C1 and C2. There are three possible combinations to insert C3 without affecting the order between C1 and C2. We can evaluate all three options and 111 .52, E owams boss—2 ”Am—2V =82 .mucooom 5 0:5 .8838 :88. Am: Fam .mccooom 5mm 2an E 5:829:00 38.5 E 20% 0:5 :38. N 3: EU .wficmoufi macaw main ":8on :ocfiafiov wfimahn— wENEnfim o: VN~ n2 8 3; mm“ ow E: b: 2 NwN mnN 022 EN w: av ov me no me am on E. m _ m: o: 82 cm _ mo 3 2 mm ON 2 hm N N 3 NM “m 82 oo— ov N _ NN N N m fl m m c v v 82 om 82v 3. a: as: a: a: an): a: a: as: a: a: as: $388; :52 5mm EU :32 Sam ...—G E32 am 9.6 :82 Sam 9.5 25293.. \e 62 awash: m €325 v €335 N 3:2»:me 112 .m2 5 owmm: beau—2 ”ES: :82 .6388 E 2:: 56:6: :38. Am: 5mm 69586 E 5:09.33 358650 E ::on 2:: :38. H 3: 6:0 .w:_:o::.am 6:536:80 wfim: 5:83:50 wEmEbQ wfifiznfim J :m 2an v: wo 3 no 0: o: 2“ ow: cw_ m: NwN mnN 82: SN w: in we we we we 9“ we we 2 w: o: 89 ofl wo 2 2 No 2 2 ov wN NN 3 Nw _w 82 OS em N N nN N N E w w o v v 82 on as: a. a: as: a. 3. as: a: 3. as: a. a. mass ease: :32 :wm 6:0 Ea: :wm 6:0 :82 :mm 6:0 532 :wm 6:0 «3.2—23.. K: .02 3363: w 6:39;: V 6:323 N 3:2»:me 113 then consider C4 and so on. It follows that the number of such runs will be 0(n2). In all the case studies in this chapter as well as several other algorithms in the literature, the above approach would succeed in identifying the right order of constraints. It follows that one does not need to consider all possible (n!) orderings among the constraints. Another approach is to allow the revision algorithm to chooses a random ordering for satisfying the constraints. If the revision algorithm fails to find a solution using a given constraints ordering, then it choses a different random order. The revision algorithm keeps trying different random ordering for the constraints until it finds a solution or it exhausts all possible combinations. We implemented this approach. We found that depending on the program being revised the time required to complete the revision may vary significantly. More specifically, in the case of the Stabilizing Mutual Exclusion from Section 5.5.1, the order of the constraints is almost always irrelevant and the revision algorithm found a solution using any order it tried. Table 5.12 shows the results of 10 experiments. In each experiment, the revision algorithm randomly chose an order for the constraints and tried to synthesize using that order. In all cases the revision completed successfully for any order and from the first try. The time needed to complete the revision was almost identical to that of the case where the constraints were manually ordered (c.f. Table 5.1). However, this was not always the case. For example, Table 5.13 shows the results of synthesizing the Stabilizing Diffusing Computation from Section 5.5 .3. In this case, the order in which the revision algorithm satisfies the constraints is significantly important. More specifically, the revision algorithm has to try different orderings (on average 3-4 times) before it successfully synthesizes the stabilizing fault-tolerant program. Moreover, the time required to complete the revision, in this case, was much higher than that when the constraints were manually ordered (c.f. Table 5.8). 114 .:o:0£m:mm 3:56:00 28:52 wEm: awe—0:8 30:: 55 8620me 33:2 wENEDSm ”Nfim 038. oomN momN fl KN _wa _ th thN Noow wmnN chN ShN PwN on 9“: EN— NwN_ N02 mmN_ RN— 9N: wmfi SE omN_ owN_ o: in owm 3m nwm 0mm Em :m wwc Elm hmm Nmm om _w_ ca aw~ oNN vw_ ww_ vw_ vw_ wafi 3: 2N ow _m 5 mm _m E Nm wm wm _m E _m cm 3 a w b e m e m N a 60600:...— 0w:..0>< dum— .:x0 .30 dum— dxm dam. dam— dxm dam dum— 00 .02 30:5. 8:33: :88. ,I I 115 .:oco£muam 8:65:00 50:25 wEm: $238 30:: 53, 8:83:80 wSMPEQ mfiuzfifim :26 2an $2 cam: 33 comm via 5 _ mm m Km 55 $3 $3 cwmm com 3: mmw :0 $5 new mm: mmw 0; :3 3o omw om _ 2N mwN mom :5 mvm wg mom :2 EN 3N owm oo— om _m 2 vm VM _m cm 3 cm cm om on 3 a w b e m e m N — 83:85 $5.52 dam dam .95.— .&m dam— .nxm— .axm .me dam dum— 5 dz 308$. SEEM ESP 116 5.7 Reducing the Complexity with Hierarchical Structure Based on the case studies, we can observe that as the number of nodes in the hierarchy increases, the time complexity can increase substantially. For example, in the first case study, when we increased the height of the binary tree from 3 to 4 (i.e., from 7 to 15 processes), the revision time increased from 5 to 72 seconds. This is expected since the state space increases from 105 to 10'6 states. Thus, a natural question in this area is whether the structure of the hierarchical system can assist in reducing the complexity. We show that the answer to this question is affirmative. For simplicity, we illustrate this in the context of the linear topology and binary tree topology. Linear topology. Consider the case where the system is as shown in Figure 5.3.a. Let the constraints used during revision be V j :: C j, where the quantification is over the set of all processes in the system. Let C j be a constraint that depends on the variables of process j, j—l (if it exists) and j+1 (if it exists). Furthermore, assume that constraints for intermediate processes are identical except for the renaming variables. Let the order of predicates added for system in Figure 5.3.a be CA,C3,CD. Furthermore, let the added recovery actions be recA,recB, recD. ®®©© (a) (b) Figure 5.3: Complexity and hierarchy for linear topology Theorem 5.7.1 If (recA V recB V recD) form the recovery actions for the program in Figure 5 .3 .a then (recA V recB V recér V rec'D) form the recovery actions for the program in Figure 5 3 .b where rec’C is obtained by replacing B by C and (then) replacing A by B from recB and rec;J is obtained by replacing B by C in recD. I Proof. Based on the order of constraints and the rules used in constructing recovery actions, constraints CA and CB will be satisfied even for the network in Figure 5.3.b. Since 117 recovery actions do not execute after the corresponding constraint is satisfied, eventually, the recovery actions in rec}: and reef) (and the fault-intolerant program) will execute. Since CD only depends on the variables of D and its predecessor and they correct a predicate involving D and its predecessor, if actions in recb execute then they will correct CD. More- over, if actions in rec’D execute then they terminate (after satisfying C ,3). Hence, given the fairness assumption, actions in rec’C will execute. Observe that rec’C is obtained from recB by replacing B by C and A by B. Furthermore, based on the definitions of the constraints, Cc is obtained from C 3 by replacing B by C and A by B. Thus, rec’C will correct CC. Note that rec'C can violate C b. However, it will be corrected again by reef). I Binary tree topology. Consider the case where the system is as shown in Figure 5.4.a. Let the constraints used during revision be V j :: C j, where the quantification is over the set of all processes in the system. Let C j be a constraint that depends on the variables of process j, j’s parent (if it exists) and j’ children (if they exist). Furthermore, assume that constraints for intermediate processes (respectively the leaves) are identical except for the renaming variables. Let the order of predicates added for system in Fig— ure 5.4.a be CA,C3,CC,CD,CE,CF,andC(;. Furthermore, let the added recovery actions be recA , recB, recC, recD, recE , recF, andrecG. Theorem 5.7.2 If (recA V recB V recc V recD V recE V recF V recG) form the recovery . . . I I l actions for the program in Figure 5 .4 .a then (recA V recB V recc V recD V recE V recF V rec’G V recH V rec] V MC) V recK V recL V recM V recN V reco) form the recovery actions for the program in Figure 5 .4 .b where: 1. rec); is used generate rec’D by: (a) replacing D by H and E by I, (b) replacing B by D, and (then) (c) replacing A by B, 2. rec” is obtained by replacing D by H and (then) by replacing B by D in recD, 118 3. rec, is obtained by replacing D by l and (then) by replacing B by D in recD; recjg, rec'F, rec’0, rec), recK, recL, recM, recN, and reco are generated by using steps similar to the steps 1-3. I Proof. The proof of Theorem 5.7.2 is similar to that of Theorem 5.7.1] (3) Figure 5.4: Complexity and hierarchy for the binary tree topology While the above result is straightforward and widely understood, it is especially useful for managing complexity of hierarchical systems. While results of this form have been pre- sented in the literature, the pre-conditions that must be satisfied to apply it are often difficult to evaluate during automated revision. However, the conditions of the above theorem are easy to evaluate and this theorem can reduce the complexity of synthesizing systems with a larger number of nodes. Clearly, constructing and verifying the recovery action which satisfy the conditions of Theorem 5.7.] and Theorem 5.7 .2 is syntactical and requires a minimal amount of time to complete. 5.8 Summary In this chapter, we focused on making the automated model revision more comprehensive and covering more levels of fault-tolerance. In particular, we derived theories, developed 119 algorithms, and built tools to automate the addition of nonmasking and stabilizing fault- tolerance. Our algorithm ensures that it adds recovery actions that enable the program to recover to its legitimate states from any arbitrary state. This algorithm is based on describing the legitimate states using a set of constraints. Then, it finds recovery actions that satisfy each constraint. Finally, it makes sure that the recovery actions do not interfere with each others and work collectively to reach the legitimate states. Also, we used the multi-core technology to parallelize our algorithm to substantially .reduce the revision time. We illustrated our approach with three case studies. Furthermore, we demonstrated that automated revision in these case studies was feasible and achieved in a reasonable time. 120 Chapter 6 Legitimate States Automated Discovery Existing algorithms for the automated model revision require that the designers have to identify the legitimate states of original model. Experience suggests that of the inputs re- quired for model revision, identifying such legitimate states is the most difficult and creates a burden on the use of these methods. To reduce this burden, we develop an algorithm wL- spGenerator (i.e., weakest legitimate state predicate generator) for identifying the largest set of states from where the program satisfies its specification. Furthermore, we show how this algorithm can be integrated with existing algorithms for the addition of fault-tolerance. With an example, we show that a straightforward approach of using reachability analysis from initial states to compute legitimate states is not relatively complete. The rest of the chapter is organized as follows: In Section 6.2, we present our algorithm, stpGenerator, for computing the weakest legitimate state predicate for the given program. In Section 6.3, we demonstrate the application of this algorithm with four case studies to show that it computes the largest set of legitimate states required for model revision. Finally, we present the summary in Section 6.4. 121 6.1 Introduction In automated model revision to add fault-tolerance, it is required that after the occurrence of faults, the revised program eventually recovers to the legitimate states of the original pro- gram. Since the original program met its original specification from these states, we can ascertain that eventually a revised program reaches states from where subsequent computa- tions are correct. One of the problems in providing recovery to legitimate states, however, is that these legitimate states are not always easy to determine. Current approaches for automated model revision for revising an existing model to add fault-tolerance include [27.30, 101, 1 l 1] as well as the approaches presented in Chapters (4 - 3). These approaches describe the model as an abstract program. They require the designer to specify (1) the existing abstract program that is correct in the absence of faults, (2) the program specification, (3) the faults that have to be tolerated, and (4) the program legitimate states, from where the existing program satisfies its specification. Of these four inputs, the first three are easy to identify and are unavoidable. For example, one is expected to utilize model revision only if they have an existing model that fails to satisfy a required property. Thus, if model revision is applied in the context of newly identified faults, original model and faults are already available. Likewise, specification identifies what the model was supposed to do. Clearly, requiring it is unavoidable. Identifying the legitimate states from where the fault-intolerant program satisfies its specification is, however a difficult task. Our experience in this context shows that while identifying the other three arguments is often straightforward, identifying precise legitimate states requires significant effort. It is straightforward to observe that if these legitimate states could be derived automatically, then it would reduce the burden put on the designer, thereby making it easier to apply these techniques in revision of existing programs. One approach for identifying legitimate states is to use initial states as legitimate states. While identifying these initial states is typically easy for the designer, this approach is very limiting. A variation of this approach is to define the legitimate states to be those states that 122 are reachable from the initial states. While less limiting, this approach fails to identify states from where the existing program is correct, although such states are not reached in fault- free execution. While the knowledge of these states is irrelevant for fault-free execution, it is potentially useful in adding fault-tolerance. In particular, if faults perturb the program to one of these states, no recovery may be needed. Furthermore, recovery could be added to these states so that subsequent computation is correct. In this chapter, we focus on automated model revision where we begin with the spec- ification of the original program and discover the legitimate states automatically. In par- ticular, we focus on identifying the largest set of legitimate states from where the original fault-intolerant program satisfies its specification. Subsequently, we utilize this set of le- gitimate states in obtaining the fault-tolerant program that is correct by construction. (If we view a set of states as a predicate that is true only in those states, then this corresponds to the weakest state predicate.) Of course, an enumerative approach, where we consider each state as a potential initial state, is impractical. Our goal in this chapter is to identify efficient techniques for identifying the largest set of legitimate states for a given program. Our algorithm for computing the largest set of legitimate states takes two inputs: the program (specified in terms of its transitions) and its specification. The program specifica— tions consists of: (l) a safety specification, which is specified in terms of (bad) states that the program should not reach and (bad) transitions that the program should not execute, and (2) zero or more liveness specifications of the form f leads to 7' (written as f w T ), which states that if the program ever reaches a state where 9? is true then in its subsequent computation it reaches a state where T is true. In this chapter, we present the algorithm stpGenerator for identifying the set of le- gitimate states with respect to the given program and specification. We show that our algorithm for finding the largest set of legitimate states is sound. With a BDD based imple- mentation, we show that our algorithm manages the state explosion problem. We illustrate our algorithm in the context of four case studies: the Byzantine agreement program [108], 123 the token ring program [30], the Stabilizing Tree Based Mutual Exclusion problem based on the fault-intolerant version by Raymond [124], and the Stabilizing Diffusing Computa- tion [13]. The set of legitimate states computed in these examples are identical to those in Chapters (3 -5) and in [30,102]. In particular, the sets of legitimate states computed in this paper for mutual exclusion is used in [15] for adding nonmasking fault-tolerance. It fol- lows that by combining our algorithm with that in [102] for adding fault-tolerance, it would be possible to permit the revision to add fault-tolerance without requiring the designer to specify the legitimate states explicitly. 6.2 The “Weakest Legitimate State Predicate Generator (stpGenerator)” Algorithm In this section, we present our algorithm to automatically generate the largest set of legiti- mate states using the program transitions and its specification. The goal of our algorithm is to generate the largest set of legitimate states (i.e., weakest legitimate state predicate) from where the program satisfies its safety and liveness specification. Our algorithm consists of three main parts: the legitimate states generator, the safety checker, and the liveness checker. We will describe each of the three algorithms in subsections 6.2.1-6.2.3. We use a symbolic representation in terms of Boolean formulas since we implemented this algorithm using Ordered Binary Decision Diagrams (OBDD) [34]. Algorithm sketch. Intuitively, our algorithm consists of two main steps. The first step is to generate the initial set of legitimate states from the program transitions and safety specifications. In this step, we identify the initial set of legitimate states, say I , to be all the states in the state space excluding the set of bad states, SPEC b5 (the states that should not be reached). Then we proceed to ensure that I does not include any state that violates safety. The second step is to ensure that I suffices the liveness properties. To verify a specific liveness property, say X «M Y, the algorithm needs to ensure that all program transitions 124 paths from all states in X reach to Y. Furthermore, all paths should be cycles-free. If such cycles exist, then all Y states in X that leads to the cycles are removed from I. We now describe our algorithm in detail: First, we describe the algorithm stpGenerator, which computes the largest set of legitimate states, say I , that satisfy the program specifications. Then, we proceed to describe the algorithm SafetyChecker that computes the set of state in which the program does not violate the safety property. Finally, we describe the algorithm LivenessChecker that removes any state that may violate the liveness of the program from the set of legitimate states, 1. 6.2.1 Weakest Legitimate State Predicate Generator The input to stpGenerator consists of the program transitions, SPEC,” (the states that should not be reached), SPEC,” (the transitions that should not be executed), and the live- ness properties. The algorithm returns the largest set of legitimate states from where the program satisfies its specification. First, it initializes the legitimate states I to be the whole state space (Line 1). Then, the algorithm computes the largest set of legitimate states by calling the function SafetyChecker (Line 4). At this point, 1 includes the set of states from where the program satisfies the given safety specification. Later, the algorithm satisfies the liveness properties one after another by calling the function LivenessChecker that removes states that violate the given liveness property (Lines 5-7). Removal of states due to live- ness properties may require re-computation of 1. Hence, this computation is in a loop and terminates when a fixpoint is reached. 6.2.2 Safety Checker The input of the SafetyChecker algorithm consists of the initial set of legitimate states, the program transitions, the SPEC)”, and the SPEC)”. The output is the computed largest set of legitimate states, I, , for the given safety specification. First, the algorithm initializes the set of legitimate states 15f to be Imp excluding the 125 Algorithm 12 WeakestLegitimateStatePredicateGenerator (stpGenerator) Input: program transitions p, SPEC,” (states that should not be reached), SPEC,” (transitions that should not be executed), T [L and T [] state predicates describing leads-to properties . Output: weakest legitimate state predicate 1w. // Initially Iw equals S p, the program states space. I... = S p repeat tmp = [W 1...: SafetyChecker(Iw, p, SPEC)”, SPEC I”); //check the i’h liveness properties for i := 0 to N 00 fLivenessProperties do 1...: LivenessChecker(Iw, p, T [i] , T [1]); end for until tmp = [W // return the largest set of legitimate states. 9: return 1...; B58592? 9°hl9‘9‘ states in SPEC [,3 (Line I). Then, the algorithm starts a fixpoint computation that removes undesired states from I;,,,,. If Isf contains a state so such that the program can execute the transition (s0, s1), which violates safety, then so cannot be in 13f. Hence, we remove so from [sf (Line 4). Note that a state is removed from 15f only if the given program violates safety from that state. If Isf contains a state so, then p contains a transition (so, s1), and S] has been removed from I, , then so must also be removed from 13f (Line 5). This process continues until a fixpoint is reached. At this point, it exits the loop and returns the desired set of legitimate states 15f. 6.2.3 Liveness Checker The input of the LivenessChecker algorithm consists of the initial set of legitimate states, 11,, p, the program transitions, the T and T where T w T is a given state predicate describ- ing leads-to properties. The output is the largest set of states that is a subset of [MP from 126 Algorithm l3 SafetyChecker Input: initial legitimate states Imp, program transitions p, SPEC,” (states that should not be reached), SPEC,” (transitions that should not be exe- cuted). Output: weakest legitimate state predicate 15f. // Sp is the state space of p I: 13ft: Imp - SPECbS; repeat tmpI:= Isf; 13ft: sf — {SO 2 (S0,S1) EanPECb, }; 13ft: sf_ {S0 2 (So,S1) €p/\S0 E [sf/\S1¢ 15f}; until tmpI = sf // return the set of states from where the program satisfies safety prop- erties. 7: return Isf; 95MB?” where the given program satisfies 9? w T . First, the algorithm creates a program tmpP where we add a self-loop to all the dead- lock states where the program p has no outgoing transitions from so and so 615 T (Line 1). All computations of tmpP are infinite or terminate in a state in T . Now we remove all transitions in tmpP that reach T (Line 2). If p satisfies (at w T), then it follows that tmpP cannot include any infinite computation that includes a state in {F . Hence, the algorithm iteratively removes deadlock states in tmpP (Lines 5-7). If some states in 7 still remain, then it implies that there are infinite computations of tmpP that begin in a state in T but do not reach a state in T . We remove such states from [MP and iteratively compute Imp. Extension. In some cases, the program actions are partitioned in terms of system ac- tions and environment actions. It is expected that the environment actions will eventually stop (for along enough time) so that the system actions can make progress (and satisfy live- ness property). In such cases, we can apply the above algorithm as follows: The program 127 Algorithm 14 LivenessChecker Input: initial legitimate states Imp, program transitions p, T , and T state predicates describing leads-to properties. Output: weakest legitimate state predicate Imp. NI—I 10: : until T flinvF = {} 12: PPPEIQ‘SPBE’? // ASSUMPTION: T 0T 2 {}. Ifnot,change T to (T —T). // let ds(p) = {so :Vsl , (so,sl) ¢ p} be the set of deadlock states. // add self-loop to the states in ds(p). : tmpP :=pU{(so,so) : so ¢ T /\so 6 ds(p)}; tmpP:= {(so,s1) : (so,s1) E tmpP/\s1¢ T}; repeat invF 2: Imp ; while (invF flds(tmpP)) 79 {} do invF := invF —ds(tmpP); end while ifT flinvF 7E {} then Imp :2 Imp — (T flinvF); end if // return the set of states from where the program satisfies liveness properties. return Imp; 128 actions used in SafetyChecker will consist of both the system actions and the environment actions. The program actions used for LivenessChecker will consist of only the system actions. Theorem 6.2.1 The Algorithm stpGenerator is sound (i .e., the generated set of legitimate states is the largest set of legitimate states). Proof. The proof consists of two parts: (I) if state, say so, is not included in the output of stpGenerator, then the program does not satisfy its specification from so, and (2) if a state, say so, is included in the output of stpGenerator, then the program satisfies its specification from so. We now prove the first part by considering all parts of the code where some state is removed from the output. Line 1 of SafetyChecker: Clearly, states in SPEC ,5, cannot be included in the final set of legitimate states. 0 Line 4 of SafetyChecker: If (so,sl) is a transition of the program that violates safety then there is a computation of the program that starts from so and violates the speci- fication. 0 Line 5 of SafetyChecker: If s; is a state already removed from the final set of legit- imate states, i.e., there is a program computation that starts from s] and violates the specification, and (so, s1) is a program transition, then there exists a computation that starts from So and violates the specification. 0 Line 9 of LivenessChecker: Observe that in tmpP, transitions that reach T are re- moved. Now, the loop on Lines 5-7 removes all deadlock states in invF . If any state, say so, in T is not removed, then that implies that there are infinite computations of tmpP that start from so. For instance, this happens if a cycle is reachable from so. By construction, this computation cannot reach T . Thus, if a state so is removed 129 on Line 9 of LivenessChecker, then there is a computation from so that violates the specification. We use proof by contradiction for the second part. Suppose so is included in the output of stpGenerator and there is a computation, say (so,s1, . . .) that violates the specification from so. We consider two cases depending upon whether this computation violates the safety specification or the liveness specification. 0 Safety specification. Consider the first state where safety violation is detected, e.g., because a state, say Si, in SPEC,” is reached or a transition, say (sj_1,sj) in SPEC!" is executed. - Case I: Sj G SPEC)”. By Line 1 of SafetyChecker, j yé 0. Also, from Line 5 of SafetyChecker, s j_1 would be removed from the final set of legitimate states. Likewise, sj_2 would be removed and so on. Thus, so cannot be in the output of stpGenerator. This is a contradiction. — Case 2: (sj_1,sj) E SPEC)”. By the same argument as in Case 1, we can show that so cannot be in the output of stpGenerator. This is a contradiction. o Liveness specification. If this computation does not satisfy the liveness specifica- tion then this implies that it has a suffix where T is true in some state, say sj, but T is false in all states. Now, we define a computation 0' that starts from Sj. If the computation (so, s1 , . . .) is infinite then G is the suffix that starts from s,-. If the com- putation (so,s1, . . .) is not infinite, it ends in a state say, s], where p has no outgoing transitions, then 0' is obtained by concatenating the suffix starting from S} and an infinite stuttering of state S]. By construction, 0 is also a computation of tmpP (Line 2 from LivenessChecker). Thus, s j is removed from the output of stpGenerator. Again, by an argument similar to the case of safety specification, we can conclude that so cannot be in the output of stpGenerator. This is a contradiction. I 130 6.3 Application of stpGenerator in Automated Model Revision In this section, we describe and analyze our approach for generating the legitimate states of the four case studies: the Byzantine agreement program [108] , the token ring program [30] , the Stabilizing Tree Based Mutual Exclusion problem based on the fault-intolerant version by Raymond [124], and the Stabilizing Diffusing Computation [13]. We chose these clas- sical examples from the literature of distributed computing to illustrate the feasibility and applicability of our algorithm in generating the weakest legitimate state predicate. Fur- thermore, these case studies illustrate that the overhead of computing the legitimate states using stpGenerator is very small compared to the overall time required for the addition of fault-tolerance. Thus, reducing the burden of the designer in terms of requiring the explicit legitimate states increases the complexity by a very small factor. Throughout this section, all case studies are run on a MacBook Pro with 2.6 Ghz Intel Core 2 Duo processor and 4 GB RAM. The OBDD representation of the Boolean formula has been done using the C++ interface to the CUDD package developed at the University of Colorado [125]. 6.3.1 Case Study 1: Byzantine agreement program We illustrate our algorithm in the context of the Byzantine agreement program from Section 4.3 .3. We start by specifying the fault-intolerant program. Then, we provide the program specification. Finally, we describe the weakest legitimate state predicate generated by our algorithm. Program. The Byzantine agreement program consists of a “general” and three or more non-general processes. Each process copies the decision of the general and finalizes (outputs) that decision. Recall from Section 4.3 .3, the actions of the Byzantine agreement program are as shown 131 in action below. The only difference is in the third and fourth actions that allow a Byzantine process to change its decision and finalized status. The last two actions are environment actions. d.sz.)/\(f.j:false) ——+ d.j:=d.g; d.j ¢ _L) /\ (f.j =false) ———> f.j :2 true; ) —> d.j :2 llO, f.j :=falseltrue; RUIN ( ( '1 ( ( 0: b.j b.g) —> d.g :=l Where j E {l,n} and n is the number of non—general processes. Specification. The safety specification of the Byzantine agreement requires validity and agreement: 0 Validity requires that if the general is non-Byzantine, then the final decision of a non-Byzantine process must be the same as that of the general. Thus, validity( j) is defined as follows. validity(j) = ((fib.j /\ -1b.g /\ f.j) :> (d.j=d.g)) 0 Agreement means that the final decision of any two non-Byzantine processes must be equal. Thus, agreement( j, k) is defined as follows. agreement(j,k) = ((—1b.j A-wb.k /\ f.j /\ f.k) :> (d. j : d.k)) ‘ o The final decision of a process must be either 0 or 1. Thus, final (j) is defined as follows. final(j) =f.j:> (d.j:OVd.j: 1) We formally identify safety specification of the Byzantine agreement in the following set of bad states: SPECBA,” = (313k 6 {1..n} :: ( -u(validity(j) Aagreement(j,k) Afinal(j)) ) 132 Observe that SPEC BA m can be easily derived based on the specification of the Byzantine Agreement problem. The liveness specification of the Byzantine agreement requires that eventually every non-Byzantine process finalizes a decision. The requirement that process j eventually fi- nalizes a decision can be specified as follows: nb-j w (f J) Application of our algorithm. The weakest predicate computed (for 3 non-general pro- cesses) is as follows. If the general is non-Byzantine, then it is necessary that d . j, where j is also a non-Byzantine, be either d .g or _L. Furthermore, a non-Byzantine process cannot finalize its decision if its decision equals I. Now, we consider the set of states where the general is Byzantine. In this case, the general can change its decision arbitrarily. Also, the predicate includes states where other processes are non-Byzantine and have the same value that is different from I. Thus, the generated weakest legitimate state predicate is as follows: 13A: (fib.g/\(Vp e {1..n} :: ((fib.p/\f.p) => d.p ¢ I) /\(-1b.p :> (d.p = _LVd.p = d.g))) ) v (b.g/\(Vj,k€ {1..n} :jaékzz (d.j=d.k) /\ (Cl-j?é i) )) Observe that 13A cannot be easily derived based on the specification of the Byzantine Agree- ment problem. More specifically, the set of states where the general is Byzantine, are not reachable from the initial states of the program. We used the exact same predicate in the case study from Section 4.3 .3 to add fault— tolerance to Byzantine faults. (In [30], where we reported the results for addition of fault- tolerance with symbolic techniques, the set of legitimate states used was a conjunction 133 of the above predicate and a formula that states that at most one process is Byzantine. However, this extra formula does not affect the revised program or the time complexity.) The amount of time required for computing this set of legitimate states for a different number of processes is as shown in Table 7.2. We would like to note that the set of le- gitimate states computed in these case studies is the same as that used in the addition of fault-tolerance. No. of Reachable Legitimate States Process States Generation Time(Sec) 10 109 0.57 20 1015 1 .34 30 102?- 4.38 40 1030 9.25 50 1036 26.34 100 107' 267.30 Table 6.1: The time required to generate the weakest legitimate state predicate (Byzantine Agreement). We note that the time required to compute the set of legitimate states is very small as compared with the total time needed to complete the revision. For example, to synthesize a fault-tolerant Byzantine agreement program with 40 processes, it takes more than 9,000 seconds as shown in Section 4.3.3. By contrast, the time to compute the legitimate states is only 9.25 seconds. Thus, the overhead of synthesizing with the specification without explicit legitimate states is negligible. We use this case study to illustrate that computing the set of legitimate states to be those that are reachable from initial states is not relatively complete. In particular, for the Byzantine agreement example, the initial state is one where all processes are non-Byzantine and the decision of all non-general processes is equal to J_. Clearly, all processes are non- Byzantine in all states reached by the program from these initial states. It follows that recovery to these reachable states is not always feasible in the presence of faults. Hence, 134 these reachable states are insufficient to obtain the fault-tolerant program. By contrast, the weakest legitimate state predicate can be utilized to find the fault-tolerant program. 6.3.2 Case Study 2: Token Ring In this section, we illustrate our algorithm in the context of the token ring program. First, we specify the fault-intolerant program. Then, we provide its specification. Finally, we identify the largest set of legitimate states generated by the algorithm from Section 6.2. Program. The token ring program consists of n processes organized in a ring. A token is circulated among the processes in a fixed direction. When a process gets the token it can access the critical section. Each process j, where j E {0..n}, has a variable x. j with the domain {0, 1,I} , where J. denotes that the process is in an illegitimate state. A process 0 has the token iff x.n is equal to x0 and a process j, where l S j _<_ n, has the token iff x.j;éx.(j— 1). The actions of the token ring program are as follows: l::x.j¢x.(j—l) ——+ x.j:=x.(j—l); 2 :: x.0=x.n —> x.0 :2 x.n+21; where +2 denotes modulo 2 addition. Specification. The safety specification of the token ring requires that the value of x at any process is either 0 or 1 and that no two processes have a token simultaneously. Thus, the safety specifications of the token ring program can be identified using the following set of bad states (i.e. states that should not be reached by normal program execution). sperms: (3j,k:j7ék /\ j,k€{]..n} :: ((x.(j—l)#x.j)/\(x.(k—l)7éx.k))) v (Elj:j€{l..n} :: ((x.(j—l)#x.j)/\(x.0=x.n))) v (3) : j€{0..n} :: (x.j=J_)) 135 The liveness specification of the token ring requires that eventually every process gets the token. The requirement that process 0 eventually gets the token can be specified as: true -+ (1:0 2 x.n). Application of our algorithm. After applying our algorithm with the above inputs, the generated largest set of legitimate states can be represented using the following regular expression: (x.0,x. 1 ,x.2. . .x.n) e (0’1), where 0 g 1 g n +1. Thus, the above predicate states that the sequence of (x.0,x. l ,x.2. . .x.n) is a sequence of zeros followed by ones or ones followed by zeros. The value of l + l in the above sequence identifies the process with the token. We note that this is the exact same set of legitimate states used in Section 4.3.3 for adding fault-tolerance to the fault where up to n processes are detectably corrupted. Fur- thermore, the time for computing this set of legitimate states for different values of n is as shown in Table 6.2. As we can see, its very small. No. of Reachable Legitimate States Process States Generation Time(Sec) 10 104 0.1 20 109 0.2 30 10'4 0.3 40 10[9 0.4 50 1023 0.6 100 1047 0.19 Table 6.2: The time required to generate the weakest legitimate state predicate (token ring). 6.3.3 Case Study 3: Mutual Exclusion In this section, we illustrate our algorithm in the context of the Raymond’s tree-based mu- tual exclusio'n program from Section 6.3.3. Our goal in this case study is to automatically I36 generate the weakest legitimate state predicate for the program in [15]. We start by specifying the fault-intolerant program. Then, we provide the program specification. Finally, we identify the weakest legitimate state predicate generated by our algorithm. Program. Recall that the action by which k sends the token to j is as follows: 1:: (Iz.k=k /\ jEAdjk) /\ (h.j:k) ———>h.k::j, h.j:zj; Where Ad j.k denotes one of the neighbors of k. Specification. Since the goal of Raymond’s mutual exclusion algorithm is to main- tain a tree rooted at the token, it requires that the holder of any process is one of its tree neighbors. It also requires that there should be no cycles in the holder relation. We formally describe the safety specifications in the following predicate: SPECMEM : ( aje {0..n} :: ((h.j;éj)V(h.j7£p.j)V(h.j;éch.j)) ) v (Elj,k€ {0..n}:j7$k:: ((h.j=k)/\(h.k:j) )) V (3j,k E {0..n} : j 75 k :: ((h.j = j) /\(h.k= k) )) Where ch. j denotes one of the children of j. Application of our algorithm. The generated weakest legitimate state predicate of the mutual exclusion program computed by our algorithm is as follows. The legitimate states predicate require that j’s holder can either be j’s parent, j itself, or one of j’s children. It also requires that the holder tree conforms to the parent tree and there are no cycles in the holder relation. [ME = (VjE {0..n}::(h.j=P.j)V(h.j=j)V(3k:(P.k=j)/\(h.j=k))) /\ (we {0m} == (R1741) :> (h.j: P4) v (h-(Rj) = m A (W e {0..n} 2: (Rt 74 1) => «h.j = R1) A (II-(Pr) = 1))) Where P. j denote the parent of j. 137 Recall that [ME is equivalent to the conjunction of the constraints (S 1 , S2, and S3), from Section 6.3.3, used in deriving the the non-masking fault-tolerant version of the mutual exclusion program. The amount of time required for computing this set of legitimate states for a different number of processes is as shown in Table 6.3. No. of Reachable Legitimate States Process States Generation Time(Sec) 10 109 0.01 20 1026 0.1 30 1044 0.2 40 10")4 0.5 50 1084 0.9 100 :10200 0.43 Table 6.3: The time required to generate the weakest legitimate state predicate (Mutual Exclusion). 6.3.4 Case Study 4: Diffusing Computation In this case study, we consider a diffusing computation on a system where processes are arranged in a logical tree. The root initiates a diffusing computation and propagates it to its children and the children forward it to their children and so on until it reaches all processes. Once the computation reaches a leaf, it marks the leaf as completed and reflects back to the parent. When all children of a process are marked completed, that process marks itself completed and reflects the computation to its parent. The diffusing computation ends when it marks the root as completed. Program. The fault-intolerant program in this case study is the diffusing computation program from [13]. Each process j has two Boolean variables c. j (color) and sn. j (session number) and an integer variable P (the parent of j). A new diffusing computation can start if the root is colored green (c.root 2 green) and the session number of the root is the 138 same as its children. To start a new diffusing computation, the root sets c.root = red and flips sn.root. When a green process finds that its parent is red, it copies its parent color and session number. Moreover, if a process has no children or all its children switched colors from red to green, the process then switches its color to green. The program for the diffusing computation consists of three actions. The first action starts the diffusing computation at the root (1). The second action propagates the diffusing computation to the children (2). The third action completes the diffusing computation when all the children complete computation (3). The program actions are described below: 1 :: (c.root 2 green) —+ c.root 2: red ,sn.root z: fisn.root; 2 :: c.j = green/\c.(P.j) = red/\sn.j # sn.(P.j) ——* c.j,sn.j = c.(P.j),sn.(P.j); 3:: (c.j: red) /\ (Vk : P.k = j :> (c.k 2 green Asn.j = sn.k)) ——> c.j 2: green; Specification. The safety specifications for the diffusing computation program re- quires that all processes must have the same color and the same session number. We for- mally define the safety specifications in the following predicate: SPECDC = ((3j,k E {0..n} : j 75 k :: (sn.j 75 sn.ch.j 74 c.k)) Application of our algorithm. The generated weakest legitimate state predicate of the diffusing computation is as follows : The set of legitimate states requires that all processes should have the same colors and session numbers. IDF = Vj: (c.j : green/\c.(P.j) : red) V (c.j = c.(P.j) /\sn.j = sn.(P.j)) 6.4 Summary In this chapter, we provided techniques that permit the designer to efficiently describe the model to be revised. Specifically, we derived theories, developed algorithms, and built tools to automate the discovery of the legitimate states of the model. Our techniques relieve the 139 designer from performing unnecessary steps, thereby simplifying the application of the automated model revision. Our algorithm uses the program actions and specification to automatically generate the weakest legitimate state predicate. First, it initializes weakest legitimate state predicate to be the set of the states from where the given program does not violate the safety specification. Second, it ensures that the generated weakest legitimate state predicate satisfies the liveness properties by removing any state that violates liveness properties. Also, we considered four case studies. We used our algorithm to automatically discover the set of legitimate state for each case. In each of these examples, the generated set of legitimate states was the same as the one specified explicitly in automated addition of fault-tolerance an the time to generate the legitimate states was very small when compared with that for performing the corresponding model revision. 140 Chapter 7 Automated Model Revision Without Explicit Legitimate States In Chapter 6 we introduced our algorithm for the automated discovery of the legitimate state. We also showed how such automation reduces the burden put on the designer, making it easier to apply these techniques in the revision of existing programs. However, one question that we need to answer is regarding the completeness of this approach. In other words, if it were possible to perform model revision with explicit legitimate states, then is it possible to do so without the explicit identification of the legitimate states. In this chapter, we consider the problem of automated model revision without explicit legitimate states. We show that this formulation is relatively complete, i.e., if it were pos- sible to perform model revision with explicit legitimate states, then it is possible to do so without the explicit identification of the legitimate states. We also identify instances where the complexity class of model revision without ex- plicit legitimate states is the same as that with explicit legitimate states. In turn, this iden- tifies heuristics for performing model revision without explicit legitimate states. Finally, we show that with these heuristics, the increased cost for model revision without explicit legitimate states is small. 141 The rest of this chapter is organized as follows: In Section 7.1 , we present an alternative approach for performing model revision. In Section 7.2, we state the automated model re- vision problem statement. In Sections, 7.3, 7.4, and 7.5, we answer three questions related to the completeness, complexity, and coast of our approach. Finally, we summarize the chapter in Section 7.6. 7 .1 Introduction In this chapter, we focus on the problem of model revision where the legitimate states are computed using automation techniques. In particular, when the algorithm stpGenerator from Chapter 6 is used to generate the set of legitimate states. Recall from Chapter 6 that the current approaches for automated model revision describe the model as an abstract program. They require the designer to specify (1) the existing abstract program that is correct in the absence of faults, (2) the program specification, (3) the faults that have to be tolerated, and (4) the program legitimate states, from where the existing program satisfies its specification (c.f. Figure 7.1). We call this problem as the problem of model revision with explicit legitimate states. _'_________.__._ Original _ Model _______________.__ Specifications ‘ Automated Revised Model Model 7””..— > Revision Faufls _________________ Legitimate : States Figure 7.1: Model Revision with Explicit Legitimate States. We focus on the problem of model revision where the input only consists of the fault- intolerant program, faults and the specification, i.e., it does not include the legitimate states. 142 We call this problem as the problem of model revision without explicit legitimate states (cf. Figure 7.2). [W _ Model Automated Specifications ‘ Model “avail? Revision Fauna ) Figure 7.2: Model Revision without Explicit Legitimate States. There are several important questions that have to be addressed for such a new formu- lation. Q. 1 Is the new formulation relatively complete? (i.e., if it is possible to perform model revision using the problem formulation in Figure 7.1, is it guaranteed that it could be solved using the formulation in Figure 7.2?) An affirmative answer to this question will indicate that reduction of designers’ bur- den does not affect the solvability of the corresponding problem. Q. 2 Is the complexity of both formulation in the same class? (By same class, we mean polynomial time reducibility, where complexity is computed in the size of state space.) An affirmative answer to this question will indicate that the reduction in the design- ers’ burden does not significantly affect the complexity. Q.3 Is the increased time cost, if any, small comparable to the overall cost of program revision? While Question 2 focuses on qualitative complexity, assuming that the answer is affirmative, Question 3 will address the quantitative change in complexity. 143 In this chapter, we show that the answer to Q. 1 is affirmative (cf. Theorem 7.3.1). Furthermore, we show that the answer to Q. 2 is partially affirmative. Specifically, we identify two versions of problem revision: partial revision and total revision. We show that the answer is affirmative for total revision (cf. Theorem 7.4 .3). We point out that the answer is negative for partial revision. In other words, for partial revision, complexity of solving the problem in Figure 7.2 can be larger (cf. Section 7.4 .5). Even though the answer to Q. 2 is negative for partial revision, we show that there is a subclass of this problem where the complexity of the approach in Figure 7.2 is the same as that in Figure 7.1. In particular, we show that for all instances where the answer to the problem in Figure 7.1 is affirmative, it is possible to solve the corresponding problem in Figure 7.2 under the same complexity class. However, it is possible that the answer to the problem in Figure 7.1 is negative, i.e., the corresponding algorithm declares failure to generate the fault-tolerant program, although the answer to the corresponding problem in Figure 7.2 is affirmative. For these cases, complexity of solving the problem in Figure 7.2 can be high. Regarding Q. 3, we show that for instances where the answer to the question in Figure 7.1 is affirmative, the extra computation cost of solving the problem using an approach in Figure 7.2 is small. 7 .2 Problem Statement In this section, we formally define the problem of model revision with and without explicit legitimate states. Model Revision with Explicit legitimate states (Approach in Figure 7 .1). Recall that in Section 2.5 we defined what it means for a program to be (masking) fault-tolerant. Using a similar definition we now formally specify the problem of deriving a fault-tolerant program from a fault-intolerant program with explicit legitimate states I , safety specification S fp, and liveness specification Up. The goal of the model revision is to modify p to p’ by only adding fault-tolerance, i.e., without adding new behaviors in the absence of faults. 144 Since the correctness of p is known only from its legitimate states, I, it is required that the legitimate states of p’ , say I’ , cannot include any states that are not in I . Additionally, inside the legitimate states, it cannot include transitions that were not transitions of p. Also, by Assumption 1].], p cannot include new terminating states that were not terminating states of p. Finally, p’ must be fault-tolerant. Thus, the problem statement (from [101]) for the case where the legitimate states are specified explicitly is as follows. Problem Statement 7 .1 Revision for Fault-Tolerance with Explicit Legitimate States. Given p, I, S f,,, va and f such that p satisfies S fp and va from I Identify p’ and I’ such that (Respectively, does there exist p’ and I’ such that) A1: I’ :> 1. A2: so 6 1’ => Vs] :sl 6 I’ : ((so,s1) Ep’ => (so,s1) Ep). A3: p’ is f-tolerant to S fp and va from I’. Note that this definition can be instantiated for each level of fault-tolerance (i.e., mask- ing, failsafe, and nonmsaking). Also, the above problem statement can be used as a revision problem or a decision problem (with the comments inside parenthesis). We call the above problem as the problem of ‘partial revision’ because the transitions of p’ that begins in 1’ are a subset of the transitions of p that begins in I’ . An alternative formulation is that of total revision where the transitions of p’ that begins in I’ is equal to the transitions of p that begins in I’ . In other words, the problem of total revision is identical to the problem statement 7.] except that A2 is changed to A2’ described next: A2,: So E ”IVY/$12816 1’ 2 ((80,81) Ep’ <=> (80,31) Ep) Modeling Revision without Explicit legitimate states (Approach in Figure 7 .2) Now, we formally define the new problem of model revision without explicit legitimate states. 145 The goal in this problem is to find a fault-tolerant program, say p,. It is, also, required that there is some set of legitimate states for p, say I, such that p, does not introduce new behaviors in 1. Thus, the problem statement for partial revision for the case where the legitimate states are not specified explicitly is as follows. Problem Statement 7 .2 Revision for Fault-Tolerance without Explicit Legitimate States. Given p, S fp and up, and f Identity p, such that (Respectively, does there exist p, such that) ( 31:: Bi: so 6 I :> Vs1:31€ I: ((so,s|) 6p, :> (so,s|) Ep) 82: p, is a f-tolerant to S fp and va from I ) Just like problem statement 7.1, the problem of total revision is obtained from problem statement 7.2 by changing B] with B 1’ described next: Bi’:soEI =>Vsl :51 EI:((so,s;)€p, 4:) (so,s1)€p) Existing algorithms for model revision [27,30, 101,111] are based on Problem State- ment 7.]. Also, the tool S YCRAF T [27] utilizes Problem Statement 7.1 for the addition of fault-tolerance. However, as stated in Section 7.1, this requires the users of S YCRAF T to identify the legitimate states explicitly. Our goal is to evaluate the effect of simplifying the task of the designers by permitting them to omit explicit identification of legitimate states. 7 .3 Relative Completeness (Q. 1) In this section, we show that if the problem of model revision can be solved with explicit legitimate states (Problem Statement 7.1), then it can also be solved without explicit legit- imate states (Problem Statement 7.2). Since each problem statement can be instantiated 146 with partial or total revision, this requires us to consider four combinations. We prove this result in Theorem 7.3.1. Theorem 7.3.1 - If 0 the answer to the decision problem 7.1 is ajfirmative with input p (fault-intolerant program), S fp (safety specification), va (liveness specification), f (faults) and l (legitimate states). Then a the answer to the decision problem 7 .2 is aflirmative with input p (fault-intolerant program), S fp (safety specification), va (liveness specification) and f (faults). Proof. Intuitively, a slightly revised version of the program that satisfies Problem 7.1 can be used to show that Problem 7.2 can be solved. Specifically, let the transitions of p, to be {(so,s|)| (so 6 I’Asl E I’/\(so,s1)E p) V(so ¢I’/\(so,s1) E p’) }. Formally, since the answer to the decision Problem 7.1 is affirmative, there exists pro- gram p’ and I ’ that satisfy constraints in Problem Statement 7.1. To show that the answer to the decision problem 7.2 is affirmative, we need to find p, such that constraints of Problem Statement 7.2 are satisfied. We let transitions of p, be {(50350) (So E l’/\S| E l’/\(S0,Sl) E p) V(SO El’A (50,51) E p’) }. Next, we show that p, satisfies the constraints of Problem Statement 7.2. Towards this end, we instantiate I to be I’ and show that constraints B 1 and 82 are satisfied. 0 Constraint B1: By construction of transitions of pr, this constraint is satisfied for the case where we consider partial revision and for the case where we consider total revision. 147 o Constraint B2: By construction, I’ is closed in p,. Also, since 1’ => I and p satisfies S fp and va from I, it is straightforward to observe that p, satisfies S fp and va from I’ . Also, transitions of p, that begin outside I’ are identical to that of p’. The second constraint “(3T :: ...)” from the definition of fault-tolerance is also satisfied. Thus, p, is fault—tolerant to S fp and va from I’ . I Implication of Theorem 7.3.1 for Q. 1: From Theorem 7.3.1, it follows that answer to Q. 1 from Introduction is affirmative for both partial and total revision. Hence, the new formulation (c.f. Figure 7.2) is relatively complete. 7.4 Complexity Analysis (Q. 2) In this section, we focus on the second question and compare the complexity class for Problem 7.] with that of Problem 7.2. In particular, in Section 7.4.1, we show that the complexity of the model revision can increase substantially for partial revision if legitimate states are not specified explicitly. Then in Section 7.4.2, we show that for total revision Problem 7.2 can be reduced to Problem 7.] in polynomial time. In Section 7.4.3, we give a heuristic-based approach for partial revision. Furthermore, we show that the heuristic is guaranteed to work when the answer to the corresponding Problem in Figure 3.1 is affirma- tive. In section 7.4 .4, we show how one can obtain an algorithm for model revision without explicit legitimate states by utilizing an algorithm that requires explicit legitimate states. Finally, we mention other complexity results in Section 7.4.5. 7 .4.1 Complexity Comparison for Partial Revision In this section, we show that solving Problem 7.2 for partial revision is NP-complete. Since the complexity of the revision Problem 7.1 is in P [101], it follows that the complexity of 148 partial revision increases substantially when the legitimate states are not specified explic- itly. We show this by a reduction from the well-known 3-SAT problem. The 3-SAT instance is specified as follows: 3-SAT Instance. Let x1 ,x2, ...,x,, be propositional variables. Given is a Boolean formula y = y1 /\ yz ~-- /\ yM, where each y j (l 5 j S M) is a disjunction of exactly three literals. Does there exist an assignment of truth values to x1,x2,...,x,, such that y is satisfiable? Since the membership of Problem 7.2 in NP is straightforward, we focus on showing that it is NP-complete. Hence, we first present the mapping from the 3-SAT instance to the problem of partial revision without explicit legitimate states. Then, we show that the given 3-SAT instance is satisfiable iff the answer to the corresponding instance of partial revision is affirmative. Mapping 3-SAT to Partial Revision without Explicit Legitimate States We now present the mapping of an instance of the 3-SAT problem to an instance of the partial revision problem without explicit legitimate states. Recall that this instance consists of the program (specified in terms of its state space and transitions), safety, and liveness specification and faults. We begin with identifying the input program. Then, we identify faults and finally we identify safety and liveness specification. The state space of the input program. Corresponding to each variable x; of the given 3-SAT instance, we introduce eight states Pi, Q;,R,~, T,-,a,-, b,-, c,-, and d,- where l g i g n (cf. Figure 7.3). For each disjunction y j, we introduce states Z j and e j, where l S j g M, in the state space. Thus, state space of the input program is Sp 2 {Pi,Q,-,R,-, 7},a,-,b,-,c,-,d,~ | l g ign}U{ZJ-,ej | l _<_j_<__M}. Transitions of the input program. Corresponding to each variable x;, we include the following transitions in SP: (P;,a,~), (a;,c,-), (c;,b,-), (b;,Q,-), (Ri,b;), (bi,d,-), (d,-,a,-), (ai,T,-), (Q;,ej) and (7},ej) where l S j S M. Moreover, corresponding to each disjunction y j, we include the following transitions: 149 Figure 7.3: Mapping of (xi sz) /\ (-wa V -wx2) into corresponding program transitions. The transitions in bold show the revised program where x. = true and x2 = false. . (21.361), 0 If x,- is a literal in y j, then we include the transition (e j, P.) in SP, and o If fix,- is a literal in y j, then we include the transition (e j,R,-) in 5,,. Fault transitions. The fault transitionsf = {(7},Zj), (Qi,Zj)|1 S i g n, l S j S M}. Safety specification S fp. All transitions except those in EP U f that violate safety. Liveness specification va. The liveness specification is P; -> c,-, c,- w Q3, Ri w d,- and d;->T,-,wherelgi§n. Reduction from the 3-SAT Problem. Theorem 7 .4.1 The given instance of the 3 -SAT problem is satisfiable if the correspond- ing instance of the partial revision problem has an affirmative answer for masking fault- tolerance. Proof. First, we prove the => part, then we prove 4: part. 150 0 => If the given instance of the 3-SAT problem is satisfiable, then we construct the transitions of revised program by including the following transitions: - (Zjaej),1 S j S M. — If y j contains x,- and x,- is assigned truth value true, then (e j, P,:) , - If y ,- contains fix,- and x,- is assigned truth value false then, (e j,R,-) , — Ifx; is assigned truth value true then (Pi, at), (a,, c,-), (Ci, bi) ,(bg, Qg) , and (Q;, ej) , lgign, - If x,- is assigned truth value false then (R;,b,-),(b,-,d,-),(d,~,a,-), (a,~,T,-), and (Rhej), l S l S n. The predicate, 1’ , used to show that this program satisfies SPEC includes all reachable states except { Z j|1 S j g M }. It is straightforward to show that the constraints B l and 82 are satisfied. a <—_- The legitimate state predicate of the revised program contains at least one state. Our first step is to show that for some i, Q; or T,- is included in the legitimate state predicate of the revised program. To show this, we observe that if Z j, 1 g j _<_ M , is included in the legitimate state predicate for some j, then the corresponding state e ,- must also be included in the legitimate state predicate. Hence, the revised program must include at least one transition that begins in ej. It follows that either P,- or R), l S i S n , must also be included the legitimate state predicate. If P; (respectively, R;) is included in the legitimate state predicate, then c,- and Q,- (respectively, d,- and 7}) must also be included so that va is satisfied. Also, if a,- (respectively, b,-) is included in the legitimate state predicate, then T,- or Q,- must also be included in the legitimate state predicate. From the above discussion, it follows that for some i, Q: or 7} is included in the legitimate state predicate of the revised program. Now, based on the definition of faults, all states in {Z j|1 S j S M } are reachable in the presence 151 of faults. Hence, transition (Zj,ej) must be included for 1 g j S M in the revised program. Furthermore, some transition originating from e j must also be included. Transitions from e ,- correspond to literals in disjunction y j. If a transition of the form (e j,P,°) is included, then we set x,- to true. If a transition of the form (e j,R1) is included, then we set x, to false. Observe that if P,- is reachable in the revised program, then it must also include (Pi,a,-), (ai,c,-), (c;,b,-), and (b;,Q,-) so that va is satisfied. And, if R,- is reachable in the revised program, then it must also include (Ri, bi), (b;,d,-) , (di,a,-) , and (a;, T,-). However, if all these transitions are included, then va will not be satisfied. There- fore, for any i, revised program cannot reach both P,- and R). This implies that the truth value assigned to x,- by any disjunction is the same. Moreover, based on the con- struction of the instance of the program of partial revision, the truth assignments to literals make each clause to be satisfied, i.e., the assignment of truth values to literals causes the given 3-SAT formula to be satisfiable. I From the above theorem, it follows that the problem of partial revision without explicit legitimate states is NP-hard. Moreover, in [101], it is shown that the problem of partial revision can be solved in polynomial time if legitimate states are specified explicitly. Thus, it follows that the complexity of partial revision increases substantially when explicit legit- imate states are not available. Intuition behind the increased complexity of partial revision. We analyze the N P- completeness proof to determine why the complexity of partial revision increased substan- tially. Towards this end, we carefully look at the instance of partial revision generated from the SAT formula. Observe that the fault-intolerant program does not satisfy va from P,- or R;, as the program can be stuck in the loop (ai,c,-), (c;,b,-), (bi,d,-), (di,a,-). However, removal of some transitions allows P,- (or, R;) to be included as a legitimate state. The increased complexity of partial revision is caused by the need to remove the “right” transi- 152 tions so that the additional states can be included in the set of legitimate states. Choosing these “right” transitions increases the complexity substantially. 7 .42 Complexity Comparison for Total Revision Although the complexity of partial revision increases substantially when legitimate states are not available explicitly, we find that complexity of total revision effectively remains unchanged. We note that this is the first instance where complexity difference between partial and total revision has been identified. To show this result, we show that in the context of total revision Problem 7.2 is polynomial time reducible to Problem 7.1 Since the results in this section require the notion of weakest legitimate state predicate, we define it next. Recall that we use the term legitimate state predicate and the corresponding set of legitimate states interchangeably. Hence, weakest legitimate state predicate corresponds to the largest set of legitimate states. Definition. Let [W = stp(p,Sf,,,Lv,,)) be the weakest legitimate state predicate of p for SPEC(=(Sfp , va)) iff: l: p satisfies SPEC from 1w, and 2: V I :: (p satisfies SPEC from I) :> 1,... I Recall from Chapter 6 that, we identified the algorithm stpGenerator(p, S fp, va) that computes weakest legitimate state predicate in polynomial time in the state space of p. Theorem 7.4.2 If the answer to the decision problem 7.2 (with total revision) is afiirma- tive (i .e., El p, that satisfies the constraints of the Problem 7.2) with input p, S fp, va, and f, then the answer to the decision problem 7.1 ( with total or partial revision) is affirmative (i .e., 3 p’ and I’ that satisfy the constraints of the Problem 7.1) with input p, S fp, va, f, and stp(p,Sfp,va). Proof. Intuitively, the program p, obtained for solving problem statement 7.2 can be used to show that problem 7.1 is satisfied. Specifically, let [2 be the predicate used to show that 153 p, satisfies constraints of Problem 7.2. Then, let p’ = p, and I’ = 12. Formally, since the answer to the decision problem 7.2 is affirmative, there exists program p, that satisfies constraints in Problem Statement 7.2 (with total revision). Let [2 denote the predicate used to show that constraints BI and 32 are satisfied. Let Iw = wlsp( p, S fp, Up). To show that the answer to the decision problem 7.1 is affirmative, we need to find p’ and I’ such that constraints of Problem Statement 7.] are satisfied. We let p’ = p, and I’ = [2. Based on constraint 32, p, satisfies S fp and va from 12. Also, from constraint Bl , (for total revision), p satisfies S fp and va from [2. Now, we show that constraints Al , A2, and A3 are satisfied. 0 Constraint Al: Based on definition of weakest legitimate state predicate, 12 => 1,... Thus, constraint A1 is satisfied. 0 Constraint A2: Based on constraint Bl , constraint A2 is satisfied for both total and partial revision. 0 Constraint A3: Based on constraint 82, p, is fault — tolerant to S f,, and va from 12. Thus, constraint A3 is satisfied. I Remark: Note that if the phrase ‘with total revision’ shown in bold in Theorem 7.4.2 is replaced by ‘with partial revision’ , then the corresponding theorem is not valid. Theorem 7.4.3 For total revision, the revision problem 7.2 is polynomial time reducible to the revision problem 7.1. Proof. Given an instance, say X, of the decision problem 7.2 that consists of p, S fp, Up, and f, the corresponding instance, say Y, for the decision problem 7.1 is p, S fp, va, f, and stp( p, S fp, va). From Theorems 7.3 .1 and 7.4 .2 it follows that answer to X is affirmative iff answer to Y is affirmative. I 154 7 .4.3 Heuristic for Polynomial Time Solution for Partial Revision Theorem 7.4.2 utilizes the weakest legitimate state predicate to solve the problem of total revision without explicit legitimate states. In this section, we show that a similar approach can be utilized to develop a heuristic for solving the problem of partial revision in polyno- mial time. Moreover, if there is an affirmative answer to the revision problem with explicit legitimate states, then this heuristic is guaranteed to find a revised program that satisfies constraints of Problem 7.2. Towards this end, we present Theorem 7.4 .4. Theorem 7.4.4 For partial revision, the revision problem 7 .2 consisting of (p, Sfp,va, f) is polynomial time reducible to the revision problem 7.] provided there exists a legiti- mate states predicate I such that the answer to the decision problem 7.1 for instance (p,I,Sfp,va,f) is aflirmative. Proof. Clearly, if an instance of Problem 7.1 has an affirmative answer, then from Theorem 7.3.1, the corresponding instance of Problem 7.2 has an affirmative answer. Similar to the proof of Theorem 7.4.3, we map the instance of Problem 7.2 to an instance of Problem 7.1 where we use the weakest legitimate state predicate. Now, from Theorem 7.3.1 it follows that the answer to this revised instance of Problem 7.1 is also affirmative. I What the above theorem shows is that even for partial revision, if it were possible to obtain a fault-tolerant program with explicit legitimate states, then it is possible to do so in the same complexity class without explicit legitimate states. However, there may be instances where answer to the decision problem 7.1 may be negative and the answer to the corresponding decision problem 7.2 is affirmative. For these instances, for partial revision, the complexity can be high. 155 7 .4.4 Algorithm for Model Revision Without Explicit Legitimate States In this section, we utilize the results in Section 7.4 .2 to obtain an algorithm for model re- vision without explicit legitimate states. In particular, we present algorithm Add_fs_fr.spec that adds failsafe fault-tolerance (where safety is satisfied in the presence of faults although liveness may not be) to high atomicity programs (where a program transition can read any number of variables as well as write any number of variables in one atomic step). This al- gorithm is obtained by combining the algorithm stpGenerator from Chapter 6 to compute the weakest legitimate state predicate as well as the algorithm Add_failsafe from [101]. Given the program transitions p, the fault transitions f, and the program specification (Sfp , va) , the goal of this algorithm is to compute the failsafe fault-tolerant program p,- that satisfies the constraints of problem statement 7.2 (with total revision). It first identifies the weakest legitimate state predicate 1w. If p has any state in [w where it has no outgoing transitions, we add self-loops at those states. These self-loops help us distinguish between a state where p has no outgoing transitions and states that become a deadlock state because we removed some transitions of p. Then it identifies ms as the states that violate safety or the states from where execution of one or more fault transitions violates safety (Lines 4-7). Then, the algorithm finds the transitions, mt, of p that reach states in ms as well as transitions of p that violate the safety specification SPEC 1,, (Line 8). If there exist states in If, such that execution of one or more fault actions from those states violates the safety of the specification, then it recalculates 1,1, by removing those states (Lines 10-13). In this recalculation, it ensures that all computations of p—mt within, I,’,,, are infinite. In other words, the final value of I,’,, is the largest subset of Iw—ms such that all computations of p—mt when restricted to that subset are infinite. At this point, if I,’,, is empty the algorithm declares that no failsafe fault tolerant program can be found. Otherwise, the algorithm removes mt from p to compute p, where no program transitions violate the program specification (Line 18). Now, it ensures that all the transitions of p, 156 Algorithm 15 Add_fs_fr-spec: Addition of Failsafe Fault-Tolerance Input: program transitions p, fault transitions f, safety specification S fp (consisting of SPEC b, and SPEC b,), liveness specification va (consist- ing of multiple T w T proprieties) Output: failsafe fault-tolerant program p,. g 14: 15: 16: 17: 18: 19: 20: 21: // Find the legitimate states [w Self_loops = {(so,so)|so E I,,./\Vs1 :: (so,s1) Ep}; : p=pVSelf_loops; ° repeat ms’ :2 ms; ms := msU{so::3s1 : (so,s1) E f/\( ((so,sl) E SPECb,) V (s1 E ms) H: : until (ms : ms’) 2 mt 2: {(80,51) :2 (((S0,S1) E SPEC“) V (5161715)) }; // compute the largest subset of [w from where all computations of p are infinite : I,’,, :2 Iw — ms; 10: ll: 12: 13: repeat ItImp i: 1w; I,’,, := I,’,,— {so :: soEI,’,, : (V31 :: s1 EI,’,, : (so,s1)E(p—mt))}; until (I,’,, 2 [[mp) if (I,’,,= {}) then print No failsafe f—tolerant program Pr exists; return {}; else p, :2 p—mt; pr=pr—{(so,S1)IISOEIC. /\ Sié’lv}; end if return p,—Self-loops; 157 that start in a state in [W also end in a state in Iw. If not such transitions are removed from Pr- Remark: Note that since this section focuses on failsafe fault-tolerant programs, there is no recovery requirement for the program in the presence of faults. However, for other levels of fault-tolerance, e.g., nonmasking and masking, where the program needs to satisfy its liveness properties as well, we would need an additional requirement that states that eventually faults stop for a long enough time to ensure that liveness properties can be met. Theorem 7 .4.5 Algorithm Add_is_ir-spec is sound, i .e., the output p, of Add_fs-fr_spec sat- isfies the constraints of Problem Statement 7.2. Proof. Let I... from the problem statement 7.2 be instantiated with the value of If, at the end of Add_is_fr_spec. Now, the first constraint of the Problem Statement 7.2 is satisfied by construction. Moreover, the satisfaction of the first constraint implies correctness of p, in the absence of faults. Regarding the behavior in the presence of faults, we can observe that by construction, the program does not reach a state in SPEC 1,5 or execute a transition in SPEC bt. Moreover, the construction of ms implies that the program does not reach states from where faults can violate the safety specification. Thus, the revised program is failsafe fault-tolerant. I Theorem 7.4.6 Algorithm Add_ts_fr_spec is complete, i.e., if it declares failure, then there does not exist a fault-tolerant program that satisfies the constraints in Problem Statement 7.2. Proof. Suppose that a program, say p”, satisfies the constraints of Problem Statement 7.2. Let I” be the predicate used in demonstrating that p” satisfies the constraints of Problem Statement 7.2. Now, we show that at any time in the use of Add.fs.fr-spec, it must be the case that I” g I,’,,. In particular, on Line 1, this follows from the correctness of the algorithm that computes the weakest legitimate state predicate. On Line 9, this follows from the fact that no state in ms can be legitimate state, as faults alone can violate safety from those 158 states. Likewise, since I ’ ’ cannot have deadlock states, I ’ ' g I,’,, is true on Line 12. Since the algorithm declares failure when I,’,. = { }, it follows that l” = { }, which is a contradiction. Theorem 7.4.7 The algorithm Add_fs_fr_spec is in P. Proof. Let us consider the complexity of each statement in Add_fs_fr_spec. (1) From Chapter 6, the complexity of computing the weakest legitimate state predicate is in P. (2- 3) The complexity of statements 2, and 3 is clearly in P. (4-7) Calculating ms is in P as we can use the following algorithm: For each fault transition (so,s1) such that (so,s1) violates safety of SPEC, include so in ms. Now, in each iteration, check if there exists a fault transition (so,s1) such that so E ms and s. E ms. If such a transition exists add so to ms. Since the size of ms increases by at least one in each iteration, the number of iterations is polynomial in the state space, namely, Sp. (8) Calculating mt is in P as we need to check each transition only once. (9) This statement is in P. The while loop in (IO-13) can execute only IS p| number of times. (14-21) The complexity of these statements is clearly polynomial. I The above result shows that in the context of failsafe fault-tolerance, when we reduce the designer’s burden by not requiring them to identify the legitimate states explicitly, there is no significance in terms of the complexity class of the problem involved or in terms of the soundness and completeness property of the corresponding algorithms. 7 .4.5 Summary of Complexity Results In Section 7.4.4, we showed that the problem of total model revision for failsafe fault- tolerance is in P. In this section, we list the complexity for other levels of fault-tolerance for both total and partial revision. Recall from Section 7.4.] that, for partial revision, the problem of adding failsafe and masking fault-tolerance is NP-complete. For distributed programs, it is shown (in [101]) 159 that revising the program for adding failsafe and masking fault-tolerance is NP-complete when the set of legitimate states is specified explicitly. A variation of that proof also works for model revision without explicit legitimate states. Revising the program for adding nonmasking fault-tolerance is in NP. However, it is not known whether it is NP—complete or whether it is in P. For high atomicity programs, i.e., where a program can read and write all its variables atomically, it is possible to perform total revision in P. To show this, we note that the algo- rithm Add_ts.ir-spec first identifies the weakest legitimate state predicate. Then it utilizes the set of legitimate states in Add.failsafe (from [101]) which requires that the legitimate states be explicitly specified. Likewise, we can utilize the algorithms Add_nonmasking and Add-masking (from [101]) to obtain the corresponding algorithms for total revision for adding nonmasking and masking fault-tolerance. Revision Without Revision With Explicit Legitimate States Explicit Legitimate States Partial Total Partial Total High Failsafe ? Pi P* P* Atomi city nonmasking ? Pif P" P* masking NP — C ’ P’F P* P* Distributed Failsafe NP—CA NP—CA NP—C* NP—C* Programs nonmasking ? ? ? ? masking NP—CA NP—CA NP—C* NP —C* Table 7.1: The complexity of different types of automated revision (NP-C = NP-Complete). In summary, the results for complexity comparison are as shown in Table 7.1. Results marked with ‘i‘ follow from NP—completeness results from Section 7.4.1. Results marked :1: follow from Section 7.4.2, 7.4.3, and 7.4.4. Results marked A are stated without proof. Results marked ? indicate that the complexity of the corresponding problem is open. And, finally, results marked * are from [101]. 160 7.5 Relative Computation Cost (Q. 3) As mentioned in Section 7.1, the increased cost of model revision in the absence of ex- plicit legitimate states needs to be studied in two parts: complexity class and relative in- crease in the execution time. We considered the former in Section 7.4. In this section, we consider the latter. As we can see from Section 7.4 .4, if the legitimate states are not specified explicitly, the increased cost of model revision is essentially that of computing stp(p,Sfp,va). Hence, we analyze the complexity of computing stp(p,Sfp,va) in the context of a case study. We choose the classic example from the literature, namely, Byzantine Agreement [107]. We explain this case study in detail and show the time required to generate the weakest legitimate state predicate for different numbers of processes. This case study illustrates that the increased cost when explicit legitimate states are unavailable is very small compared to the overall time required for the addition of fault-tolerance. In particular, we show that reducing the burden of the designer in terms of not requiring the explicit legitimate states increases the computation cost by approximately 1%. Throughout this section, the experiments are run on a MacBook Pro with 2.6 Ghz Intel Core 2 Duo processor and 4 GB RAM. The OBDD representation of the Boolean formula has been done using the C++ interface to the CUDD package developed at the University of Colorado [125]. The amount of time required for computing this set of legitimate states for a different number of processes is as shown in Table 7.2. We would like to note that the set of le- gitimate states computed in these case studies is the same as that used in the addition of fault-tolerance. We use this case study to illustrate that computing the set of legitimate states to be those that are reachable from initial states is not relatively complete. In particular, for the Byzantine agreement example, the initial state is one where all processes are non-Byzantine and the decision of all non-general processes is equal to J_. Clearly, all processes are non- Byzantine in all states reached by the program from these initial states. It follows that 161 No.0f Reachable Leg. States Total Revision Process States Generation Time(Sec) Time(Sec) 10 109 0.57 6 20 10'5 1.34 199 30 1022 4.38 1836 40 1030 9.25 9366 50 1036 26.34 > 10000 100 107' 267.30 > 10000 Table 7.2: The time comparison for the Byzantine Agreement program. recovery to these reachable states is not always feasible in the presence of faults. Hence, these reachable states are insufficient to obtain the fault-tolerant program. By contrast, the weakest legitimate state predicate can be utilized to find the fault-tolerant program. 7 .6 Summary We devoted this chapter to study the problem of automated model revision without explicit legitimate states. In particular, we compared performing the revision when the legitimate states are explicitly specified with that when they are not explicitly specified. We consid- ered three different aspects in our comparison: relative completeness, qualitative complex- ity class comparison, and quantitative change of the time for model revision. We illustrated that our approach for model revision without explicit legitimate states is relatively com- plete. This isimportant, since it implies that the reduction in the human effort required for model revision does not reduce the class of the problems that could be solved. Addition- ally, we found some surprising and counterintuitive results. Specifically, for total revision, we found that the complexity class remains unchanged. However, for partial revision, the complexity class changes substantially. Finally, we found that quantitative change of the time for model revision without explicit legitimate states is negligible. 162 Chapter 8 Related Work During the past three decades, automation of the software verification tools evolved signif- icantly. Currently, verification tools are widely used in several applications. In particular, they are used in the verification of the high assurance and mission critical systems, where the consequences of any failure can end with catastrophic results. Formal verification of distributed and concurrent program focuses on the use of math- ematical logic and formal methods to verify the correctness of the properties of a specific program. Initially, the focus was on developing techniques to verify full functional correct- ness. However, most of the tools developed for this purpose were incapable of handling complex systems. This limitation encouraged many to focus on the verification of the prop- erties that are more important than others. In most of the verification techniques the system and the desired properties are described via a logical model. The verification algorithm answers with (yes/no) to the question of whether the model satisfies the desired property. Unlike the automated verification techniques, the goal of the automated model revision is to automatically revise an existing model to generate a new model, which is correct- by-construction. Such revised model will preserve the existing model properties as well as satisfy new properties. The basic form of the problem of automated model revision focuses on modifying an existing model, say M, into a new model, say M’. It is required 163 that M’ satisfies the new property of interest. Additionally, M’ continues to satisfy existing properties of M using the same transitions that M used. In this chapter, we briefly review some of the automated verification techniques and discuss their relation to our approach. Currently, there is a wide range of tools available for verifying the correctness of distributed programs. Those tools are based on different tech- niques, which makes them useful for different types of applications. We believe that no single approach is suitable for the verification of all types of distributed programs. How- ever, some approaches may be more appropriate to some applications more than others. Out of the wide range of the available techniques for automated verification of the finite state distributed and concurrent programs, we focus on those that are closely related to our approach. 8.1 Model Checking Model checking is a technique for verifying the correctness of finite state programs. The idea is based on exploring the state space of the program, described using temporal logic, in an efficient manner. In model checking, the program is represented using Kripke structure, say M, and a formula, say f, represent one of the program properties. The model checker determines if M is a model for f, i.e., whether the formula holds or not. One of the advantages of the model checking technique is that it provides a push but- ton approach. Model checking is effective in verifying whether the system meets the de- sired properties. Furtherrnore, if the model does not satisfy the property of interest, then the model checker, typically, provides a counterexample and the corresponding execution trace. Moreover, it supports partial verification, e.g., it does not require the complete spec- ification of the program being verified. Due to this push-button approach, model checking techniques have become very popular for detecting errors in the early stages of the design. Also, it has helped in transferring the formal verification of correctness from research to 164 practice. Next, we briefly review the evolution of model checking tools and techniques. As early as the 1970’s, Tadao Murata and Kurt Jensen started working on the ver- ification of Petri nets; however, there were no actual verification tools created prior to 1981 [42]. The initial work on state exploration started in the 1980 when Bochmann pre- sented a method for verifying communication protocols [24]. Later, Holzmann presented a technique for automatic protocol verification. Burstall [37], Kroger [99] and Pnueli used the temporal logic to describe the program behavior and the proof of correctness was done manually. In their early work on concurrency, E. M. Clarke et al. [45,50, l 10] focused on the fixed point theory and abstract interpretation. They emphasized on the connection between Branching Time Logic and Mu-Calculus [62]. Clarke also presented how program text is used to extract the invariant of a given program. In 1980, Emerson and Clarke [62] developed a technique based on branching time logic. Later they adopted more elegant presentation of temporal logic that was presented in [46]. In a milestone step in the evo- lution of verification techniques, Clarke, Emerson and Sistla [43,48] presented the EMC Model Checker. This was the first model checker that could handle fairness constraints. Although, the EMC Model Checker can only check models with state space of a size not more than 105 , it was able to detect errors in several systems. In [63] Emerson and Halpem presented framework CT L* for investigating the expressive power of temporal logic. Their framework was a combination of branching-time and linear-time operators. The most significant improvement in model checking was in the early 90’s. To this end, symbolic model checking and partial order reduction were used in building the model checker. McMillan used a symbolic representation based on the ordered binary decision diagrams (OBDDs) to develop the SMV [112]. The compact representation of the state space and the transition graph made it possible to verify sophisticated programs with very large state space [36,112]. Since then, the SMV model checker has been used in verifying several systems. In 2000, a new version of SMV was released [41]. 165 The second important improvement in model checking techniques is the exploitation of the partial order reduction of the state space [74]. The basic idea of the partial order reduction is as follows. If two events are independent, then the system will reach the same global state with no regards to which event execute first. This way, less space is needed to represent the system, which in turn reduces the effect of the state explosion problem. Since the early 90’s, many techniques have been developed to extend the capabilities of the model checking tools. These techniques include Abstraction [80], where the data values of the system, usually reactive systems, are mapped to smaller set of abstracted values, Compositional Reasoning [8,51,79], where the behavior of the system, which is composed of many similar process, can be represented by few processes, Symmetry Reduction [44, 117], where the model checker exploits the symmetrical characteristics of the program to obtain smaller model, and Induction and Parameterized Verification [106,131], where the behavior of the system is represented in a way that can be used for an arbitrary number of processes. The development of more effective methods for program verification continued over the past few years. Also, it resulted in the creation of more innovative technique, to handle specific problems in more customized settings. One application of our approach is to be complementary to existing approaches [36,41, 93] for verifying program correctness in early stages of system design. In particular, the techniques in [98,119,130] aim to identify unacceptable system behavior to find the root causes that makes the system behave incorrectly. However, these approaches do not address what to do when new faults or bugs are identified. Generally, it is left to the designer to address this with some guidance or with trial and error. Moreover, manual revision has the potential to introduce new errors. Our approach focuses on automating such model revisions. Therefore, once the model checker identifies an instance where the model does not satisfy the property of interest, we can use the automated model revision techniques to automatically revise the existing 166 model (c.f. Figure 8.1). The revised model will continue to satisfy the original properties as well as the new property. Such automated revision is highly desirable since it enables system designers to automatically and incrementally add properties to the models. Some of the advantages of this approach are that the revised model is correct by construction and there is no need to re-verify the revised model. Also, the original model properties are preserved. Furthermore, there is a potential for this approach to require less time and space complexities since it does not require the revision of the entire model specification. Program Program r Model Properties Rev'se the "'0“, Model Checker . L No Yes Figure 8.1: Model Checking and Automated Model Revision. In another context, we have adopted many techniques, which were used to advance the development of better model checking tools, in the development of our model revision tools. For example, one way to reduce the complexity further is to integrate advances from model checking, as incremental synthesis involves several tasks that are also considered in model checking. We considered two approaches from model checking: (1) the use of symmetry and (2) the parallelism of the algorithm with multiple processors/cores. 8.2 Controller Synthesis and Game Theory Our work is closely related to the work on controller synthesis (e .g. [16,17,32]) and game theory (e.g., [70]). In this work, supervisory control of real-time systems has been studied 167 under the assumption that the existing program (called a plant) and/or the given speci- fication is deterministic. In particular, Jobstmann, Griesmayer, and Bloem [96] used an approch based on the concepts of the game theory. They presented the problem of program repair by two players playing the Biichi game. They modeled the program and its environ- ment as the two players. More specifically, the program takes a move in response to a move taken by the environment. Our formulation for the automated model revision is similar to that used by Ramadge and Wonham [122] in the discrete controller synthesis problem. In both approaches the goal is to restrict the program actions to the desired behaviors. These techniques require highly expressive specifications. Hence, the complexity is also high (EXPTIME-complete or higher). In addition, these approaches do not address some of the crucial concerns of fault-tolerance (e .g., providing recovery in the presence of faults) that are considered in our work. 8.3 Model Revision and Automated Program Synthesis In this section, we review the history and the evolution of the automated model revision techniques [101]. Also, we show how our work in this dissertation is related to the previous work done in this regard. Automated program synthesis and revision have been studied from various perspec- tives. Inspired by the seminal work by Emerson and Clarke [64], Arora, Attie, and Emer- son [1 I] propose an algorithm for synthesizing fault-tolerant programs from CTL specifica- tions. Their method, however, does not address the issue of the addition of fault—tolerance to existing programs. Initially, Kulkarni and Arora presented an automated algorithm for the automated addition of fault-tolerance for centralized programs as well as distributed programs. Their approach depends on the existence of an original program that is correct in the absence of faults, i.e., the existing program satisfies its specification as far as no faults exists. Their goal is to modify (i.e., revise) the existing program and generate modi- 168 fied (i.e., revised) version of the program such that the revised program is fault-tolerant as well as it does not introduce any new behavior in the absence of faults [101]. The authors also analyzed the complexity of adding fault-tolerance in different setting. We used some of their results in the table in Section 7. For instance, they proved that the problem of the automated addition of masking fault-tolerance is NP—complete. Kulkarni, Arora, and Chippada [102] developed a polynomial time algorithm for auto- mated synthesis of fault—tolerant distributed programs. Since this problem was proven to be NP—hard in [101], the authors presented an algorithm that relies on heuristics to reduce the complexity. Moreover, they demonstrated that the algorithm suffices to synthesize an agreement program that tolerates a Byzantine fault. In their effort to automate the synthesis of fault-tolerant programs, Ebnenasir and Kulkarni developed a framework, called Fault-Tolerance Synthesizer (FTSyn) [60]. The PI Syn framework implemented most of the heuristics that have been proposed to syn- thesize fault-tolerant programs. The main reasons for developing FTSyn were to validate the theoretical results as well as to provide developers with an interactive tool for auto- mated synthesize. The authors use FTSyn to synthesize several fault-tolerant distributed programs. For instance, they used FTSyn to synthesis an altitude switch that controls the altitude of an aircraft. The input of FTSyn consists of an abstract program consisting of a set of processes described in a guarded command language. And, the output is masking fault-tolerant program also in guarded commands. The authors used FTSyn to demonstrate the applicability of their approach and also to show that with automation it can be applied to the cases where there are different types of faults. However, similar to other enumerative implementations, FT Syn was subject to the state explosion problems and was only suitable for synthesizing small programs. Recently, Bonakdarpour and Kulkarni presented a symbolic-based implementation for the synthesis algorithm [27,30]. In their tool (SYCRAFT), the components of the syn- thesis algorithm are constructed using Boolean formulae represented by Bryants Ordered 169 Binary Decision Diagrams [33]. This was the first time where moderate to large sized pro- grams (a state space of 1050 and beyond) have been synthesized. Although, both FTSyn and SYCRAFT implement similar synthesis heuristics from [102], there are several differ- ence between them. For instance, the symbolic representation made SYCRAFT capable of handling programs with larger state space. Moreover, the grammar of the input lan- guage of SYCRAFT has more constructs which can assist the designer in describing the abstract program. Also, one of the characteristics of SYCRAFT is that it describes the out- put in an optimized representation. Using SYCRAFT, authors also have identified several bottlenecks that can slow down the synthesis. In particular, they identified the following bottlenecks: the deadlock resolution, computation of recovery action, computation of the fault-span and the cycle resolution. In this dissertation, we focused on two major complex- ity obstacles in deadlock resolution, namely computation of the recovery actions and the deadlock elimination. We used parallelism and symmetry to overcome these bottlenecks. Our work in this dissertation is closely related to the tool SYCRAFT. In particular, we have implemented most of the techniques we presented in this dissertation and added them to SYCRAFT. 8.4 Parallelization and Symmetry In the model checking community, various techniques have been proposed to implement the symbolic state space generation and exploration using parallel computing. Some of those approaches targeted the state explosion problem by focusing on data parallelism by distributing the computation among a group of workstations, e.g., NOWs [77,78,92, 115, 126]. Their goal was mainly providing more memory resources to handle the expanding state-space. Obviously, the speed was not an issue here and the time complexity was not the target. Others focused on enhancing the time-efficiency by using parallelism. For this group, the goal was to use the ever-expanding parallel infrastructure of multi-core PCs and 170 multi-processers platforms in expediting model checking. Most notably was the work on parallelizing the Saturation algorithm [39]. Unfortunately, the symbolic state exploration has proven to be notoriously resistant to parallelization. In [66,67,69], the authors propose solutions and analyze different approaches to paral- lelization of the saturation-based generation of state space in model checking. In particular, in [67], the authors show that in order to gain speedups in saturation-based parallel sym- bolic verification, one has to pay a penalty for memory usage of up to 10 times, that of the sequential algorithm. Other efforts range from simple approaches that essentially im- plement BDDs as two-tiered hash tables [1 15, 127], to sophisticated approaches relying on slicing BDDs [78] and techniques for workstealing [77]. However, the resulting implemen- tations show only limited speedups. Ezekiel j., Luttgen G., and Siminiceanu R. [68] argue that a heavily optimized symbolic algorithm such as Saturation may be more efficient than a parallel version of the same algorithm. Ebnenasir presented a divide-and-conquer method [58] for synthesizing failsafe fault- tolerant distributed programs. In failsafe fault—tolerance, the program is not required to maintain any liveness requirements when faults occur. Therefore, resolving deadlock states in the fault-span is not needed. In this dissertation, we focused on two major complexity obstacles in deadlock res- olution, namely computation of the recovery actions and the deadlock elimination. We used parallelism and symmetry to reduces the time complexity. Our work utilizes paral- lelization of group computation as well as symmetry for expediting the automated model revision. Unlike other parallelization algorithms for the symbolic based representation of models, we were able to achieve speedup up to multiple orders of magnitude. By focusing on parallelizin g the group operation, we were able to harness the benefits of the multi-core infrastructure. 171 8.5 Nonmasking and Stabilizing Fault-Tolerance Automated program synthesis is studied from different perspectives. One approach (e.g., [l 1]) focuses on synthesizing fault-tolerant programs from their specification in a temporal logic (e.g., CTL, LTL, etc .). Our approach for adding nonmasking and stabilizing fault- tolerance is based on satisfying constraints that should be true in legitimate states. In masking fault-tolerance, when faults occur, the program cannot violate the safety property during recovery. Therefore, this approach will not be able to synthesize nonmask- ing fault-tolerant programs where safety can be violated during recovery. Furthermore, while our algorithm accounts for weak-faimess among program actions and allows for re- covery actions to be added under this assumption, the heuristic-based approach does not account for fairness assumptions. Katz and Perry [97] proposed an algorithm to extend an arbitrary asynchronous dis- tributed message-passing system into a self-stabilizing system. They also gave a formal definition of the self-stabilizing extension of a non-stabilizing program and they defined the set of properties that must be maintained by the new extension. Their algorithm super- imposes a control program on the original non-stabilizing program. The control program repeatedly takes a global snapshot and then checks if the snapshot indicates an illegal state. If an illegal state is found, the control program resets the memory of each process to a legal default state. Arora, Gouda, and Varghese [13] proposed a manual approach to design nonmasking fault-tolerant programs. In this approach, a program is intended to satisfy a set of con- straints during normal operation (i .e., no faults). Program actions are categorized into “closure” actions and “convergence” actions. When faults occur and violate one or more of the program constraints, convergence actions are responsible for correcting program be- havior and reestablishing those constraints again. This method, however, does not address the issue of automated addition of nonmasking fault-tolerance to existing fault-intolerant programs . 172 Our approach for adding nonmasking fault-tolerance and self-stabilization is based on satisfying constraints that should be true in legitimate states. An orthogonal approach is to utilize primitives such as distributed reset [97] where one detects whether the system is in a consistent state and resets it to a legitimate state, if needed. Examples of these approaches include [97, 128]. Our approach can be utilized to design the distributed reset protocol itself. The verification of self-stabilizing properties has been studied by several researchers. One method to verify the correctness of self-stabilizing algorithms is by using mechanical theorem proving. In [121], Qadeer and Shankar used PVS [118] to verify the correctness of Dijekstra’s algorithm. Another approach to verify self-stabilizing algorithms was done using model checking. In [129], Tsuchiya et al. applied CTL symbolic model check- ing techniques to verify several distributed algorithms against self-stabilization properties. They used SMV [113] to overcome the state explosion problem. They showed that the state space can be efficiently reduced using OBDDs. However, they concluded that their approach is applicable only when the number of process is modest. 8.6 Legitimate States Discovery Several techniques have been developed to verify program correctness [35, 36,47,89,93, 113]. For most of those methods, the program is translated into a logical formula that de- scribes the program behavior and properties. Then, tools are used to verify the correctness of the program. For many of these tools, identifying the program legitimate states (i.e., legal or invariant states) is an essential step. Several approaches have been proposed to improve the automatic generation of the legitimate states [19,20,23,109,116]. These methods can be widely classified as either top-down or bottom-up approaches. The top-down approach starts with the weakest possible invariant and uses program specification to strengthen that invariant. The bottom-up approach performs forward propagations of the program actions 173 to derive the invariant. Our algorithm is a top-down approach since it starts by initializing the largest set of legitimate states to be the whole state space and later removes states that violate the predefined safety and liveness specifications. Rustan, Leino, and Barnett [19,109] presented methods for forming an efficient weak- est precondition to enhance the performance of the verification tools like ESC/Java and ESC/Modula3. Their goal is to simplify the presentation of the weakest pre-condition to avoid redundancy and to avoid exponential growth of the condition size. Our definition of largest set of legitimate states is equivalent to their definition of the weakest conserva- tive preconditions in which the execution of a program statement does not go wrong and it terminates. However, in their work they address the problem of redundancy in describing such conditions while we focus on the automatic generation of such conditions from the program specification. Jeffords and Heitmeyer [94,116] described an algorithm to automate the generation of the invariant. Their technique is based on deriving the invariant based on propositional formulas derived from the SCR tables. Their algorithm is intended for detecting errors at early stages of program design. By contrast, our algorithm is intended to discover the largest set of legitimate states of programs assumed to be correct for the purpose of adding fault-tolerance to such programs. The accurate and complete identification of the legitimate states is an essential step that enables designers to apply the algorithms and tools for the automated model revision of fault-tolerant programs from a fault-intolerant programs [27,30, 101,111]. Unlike the traditional approaches, that require the explicit specification of the Legitimate States, our approach does not require explicit specification of the Legitimate States but it generates the largest set of legitimate states from program transitions and specification. Therefore, it will significantly improve and simplify the process of automated addition of fault-tolerance. Furthermore, our approach is relatively complete when compared to traditional approaches. Moreover, it does not introduce any significant cost. 174 Chapter 9 Conclusion and Future Work In this dissertation, we focused on the problem of automated model revision. We derived theories, developed algorithms, and built tools to make the model revisions more compre- hensive, efficient, and designer-friendly. In particular, we reduced the automated model revision learning curve by utilizing existing design tools. Also, we developed algorithms and tools to apply model revision in adding new types of fault-tolerance properties and to automate the generation of the legitimate states of the original model. Finally, we uti- lized both symmetry and parallelism to speedup the automated revision and to overcome its bottlenecks to reduce its time complexity. In this chapter, we present a summary of our contributions. In Section 9.1, we summa- rize the contributions of this dissertation. Then, in Section 9.2 we list some of the future research directions. 9.1 Contributions This dissertation makes four main contributions: 1. Reducing the Learning Curve of the Automated Model Revision: To reduce the learning curve of automated model revision, we focused on utilizing existing design 175 tools. We combined the automated model revision tool SYCRAFT with the SCR tool set. To achieve successful coupling, we developed a middle layer that translated the SCR specification into SYCRAFT input as well as from SYCRAFT output back to SCR. Thus, our approach gives designers the ability to perform the tasks of the model revision under-the-hood (i .e., while working within the SCR toolset). In this way, they do not need to know all the details required to perform automated model revision. We expect that the ability to add fault-tolerance under-the-hood is especially useful, as it allows designers to continue to use the design tools they were already using. This reduces the learning curve of the model revision techniques. In the context of SCR, this is especially useful since the SCR toolset has already been adopted by the industry and is used in the development of many real world applications. Further- more, the SCR toolset integrates several tools for consistency checking, verification, etc. Since synthesized fault-tolerant SCR specification can be viewed/modified us- ing the SCR toolset, one can analyze the revised fault-tolerant SCR specification for various other properties. With case studies we showed that, for our approach to be effective, certain changes need to be made to the SCR interface. In particular, we demonstrated that the SCR toolset would have to be modified to include the description of faults. However, we showed that the changes required for describing faults in the SCR toolset are straightforward. In particular, the faults themselves could be represented using tables. We also demonstrated that the designer needs to specify the requirements that should be met in the presence of faults. Once again, this is similar to how other requirements (not related to fault-tolerance) are specified in the SCR toolset. These changes to the SCR toolset are reasonable in that they essentially require the designer to specify what the faults are and the requirements for fault-tolerance in the presence of faults. Additionally, automated revision with SYCRAFT also provides the possibility of de- 176 tecting errors in the requirements themselves. In particular, one can identify errors caused due to a missing requirement on how recovery can be added. Since SYCRAFT tries to provide maximum non—determinism in the revised program, if a requirement is missing, then there is a high potential that it would be detected. Therefore, this approach provides the ability to reduce cost since it detects errors and missing spec- ifications early in the design stage. . Automating the Discovery of the Legitimate States: To further reduce the effort required by the designer in automated model revision, we focused on generating one of the inputs - legitimate states - automatically. In particular, the inputs to the model revision algorithms includes: (1) the existing model, (2) the specification of the model, (3) the faults, and (4) the legitimate states of the original model. Clearly, specifying the existing model is unavoidable. Moreover, the task required in identifying it is easy, as model revision is expected to be used in contexts where designers already have an existing model. Specification is also already available to the designer when model revision is used in contexts where, existing model fails to satisfy the desired specification. Likewise, the new property that is to be added to the existing model is also easy to identify. In the context of fault-tolerance, this requires the designers to identify the faults that need to be tolerated. Based on our experience, the hardest input to identify is the set of legitimate states from where the original model satisfies its specification. In part, it is because of the fact that identifying these legitimate states explicitly is often not required during the evaluation of the original model. Hence, we focused on the problem of automated model revision of an existing model without the use of explicit legitimate states. Moreover, as shown by the example in Section 5.5, typical algorithms for computing legitimate states based on initial states do not work in the context of automated model revision. 177 We presented an algorithm for automated discovery of the weakest legitimate state predicate of the given program. Our algorithm uses the program actions and specifi- cation to automatically generate the weakest legitimate state predicate. To evaluate this algorithm, we compared the automated model revision when the le- gitimate states are explicitly specified with that when they are not. We considered three questions in this context: (1) relative completeness, (2) qualitative complex; ity class comparison, and (3) quantitative change of the time for model revision. We illustrated that our approach for model revision without explicit legitimate states is relatively complete, i.e., if model revision can be solved with explicit legitimate states, then it could also be solved without explicit legitimate states. This is impor- tant since it implies that the reduction in the human effort required for model revision does not reduce the class of the problems that could be solved. Regarding the second question, we found some surprising and counterintuitive re- sults. Specifically, for total revision, we found that the complexity class remains un- changed. However, for partial revision, the complexity class changes substantially. In particular, we showed that problems that could be solved in P when legitimate states are available explicitly become NP-complete if explicit legitimate states are unavailable. This result is especially surprising since this is the first instance where complexity levels for total and partial revision have been found to be different. Even though the general problem of partial revision becomes NP-complete without the ex- plicit legitimate states, we found a subset of these problems that can be solved in P. Specifically, this subset included all instances where model revision was possible when legitimate states are specified explicitly. Regarding the third question, we showed that the extra computation cost obtained by reducing the human effort for specifying the legitimate states is negligible. To- wards this end, we considered four case studies that included Byzantine agreement, mutual exclusion, token ring and diffusing computation. In each of these examples, 178 the generated set of legitimate states was the same as the one specified explicitly in automated addition of fault-tolerance. Moreover, the time to generate the legitimate states was negligible (less than 1%) when compared with the time for performing the corresponding addition of fault-tolerance. Also, we have integrated the automated revision without explicit legitimate states in the tool SYCRAFT. We note that this result can also be extended to other problems in model revision where one adds safety properties, liveness properties and timing constraints. . Exploiting Parallelism and Symmetry to Expedite the Automated Model Revi- sion: Another contribution of this dissertation is directed towards making the automated model revision more efficient. Specifically, we worked on improving the perfor- mance of the automated model revision to synthesize fault-tolerant programs from their fault-intolerant version. Towards this end, we developed techniques that uti- lize the (1) multi-core processors and (2) the symmetry among the processes of the program being revised to expedite the automated model revision. In the case of parallelism, we focused on one of the main complexity barriers, reso- lution of deadlock states, in automated model revision to add fault-tolerance to dis- tributed programs. Our approach was based on parallelization with multiple threads on a multi-core architecture. We considered parallelization in two scenarios: (1) adding recovery transitions, and (2) eliminating deadlock states. Our approach pro- vides each thread its own copy of shared variables. Although this has a potential to increase the memory usage, in general, automated model revision problems tend to have a higher time complexity than the corresponding verification problems. Hence, we expect that the automated model revision algorithm will run out of time before it runs out of memory. Hence, the increased space complexity is unlikely to be the 179 bottleneck during revision. Initially, we showed that the approach of partitioning deadlock states provides a small improvement. And, the approach based on parallelizing the group computation — that is caused by distribution constraints of the program being synthesized— provides a significant benefit that is close to the ideal, i.e., equal to the number of threads used. Additionally, we demonstrated that there is a potential to gain superlinear speedup due to the partitioning of the group computation that reduces the size of corresponding BDDs. Since the configuration used to evaluate performance was on an 8-core (4 dual-cores) machine, we considered the case where up to 16 threads are used. We find that as the number of threads increases, the revision time decreases. In fact, because the parallelism is fine-grained, using more threads than available cores has the potential to improve the performance slightly. This demonstrates that we have not yet reached the bottleneck involved in parallelization. Furthermore, there is potential for further reduction in revision time if the level of parallelism is increased (e .g., if there are more processors). Although, the level of parallelism is fine-grained, we showed that the overhead of parallel computation is small. In the case of symmetry, we showed that symmetry provides a substantial benefit in reducing the time involved in the revision. More specifically, we observed that mul- tiple processes in a distributed program are symmetric in nature, i.e., their actions are similar (except for the renaming of variables). Thus, if our algorithm finds recovery transitions for a process, then it utilizes symmetry to identify other recovery tran- sitions that should also be included for other processes in the system. Likewise, if some transitions of a process violate safety in the presence of faults, then it identifies similar transitions of other processes that would also violate safety. Since, the cost of identifying these similar transitions with the knowledge of symmetry among pro- cesses is less than the cost of identifying these transitions explicitly, then the use of symmetry reduces the overall time required for the revision. Moreover, the speedup 180 increases as the number of symmetric processes increases. . Automating the Model Revision to Add Nonmasking and Stabilizing Fault- Tolerance: The tools for automated model revision need to be comprehensive and in— clude techniques to automate the addition of different levels of fault-tolerance. In this dissertation, we also focused on the automated revision to add nonmasking and stabi- lizing fault-tolerance to hierarchical distributed systems. In particular, we considered systems where legitimate states are specified in terms of constraints that are true in legitimate states. The goal of adding nonmasking and stabilizing fault-tolerance was to ensure that if these constraints are violated by faults, then eventually the program would reach a state where all the constraints are satisfied and subsequent behavior would be correct. Our approach was to utilize an order among the constraints. With this order, we ensured that correction actions that correct constraint C,- did not cause violation of any of the previous constraints Co,C1 ...C,-_l although they may violate constraints C j, j > i. In our case studies from Chapter 5, we considered different possible or- derings and in most cases, we were able to synthesize a nonmasking fault-tolerant program. Therefore, identifying an order among these predicates does not appear to be a critical concern. Moreover, the number of orderings that need to be considered for a group of n constraints will be at most 0(n2). Finally, we find that this approach is especially suited for synthesizing stabilizing programs, since it eliminates one of the bottlenecks of the automated revision (evaluating fault-span). Also, we focused on improving the revision to add nonmasking and stabilizing fault- tolerant programs from their fault-intolerant version. We showed that the use of multi-core technology to parallelize the revision algorithm reduces the revision time substantially. We parallelized constraint satisfaction by: (1) partitioning the con- straints and (2) utilizing the nature of distributed programs. We showed that paral- 181 lelism provides a substantial benefit in reducing the time needed in the revision. We illustrated our approach with three case studies: stabilizing mutual exclusion, stabi— lizing diffusing computation, and a data dissemination problem for sensor networks. The complexity analysis demonstrated that automated model revision in these case studies was feasible and achieved in a reasonable time speedup in all case studies. Furthermore, since our work is structured on constraint based (manual) design of nonmasking and stabilizing fault-tolerance from [13] that has been found to be useful in deriving several protocols manually (e.g., [73,75,128]), we expect that it will be highly valuable for automatically designing various stabilizing and nonmasking programs. We also showed that the hierarchical nature of the underlying system could be effectively utilized to reduce the complexity of synthesizing programs with larger number of processes while maintaining the correct-by-construction property of programs designed by automated model revision. This work also advances the state-of-the-art of the automated model revision in yet another way. To our knowledge, this is the first instance where automated model revision to add fault-tolerance is achieved with fairness constraints. Without fairness constraints, a stabilizing mutual exclusion algorithm based on [124] is impossible. Moreover, the structure of the recovery actions in the first two case studies is too complex to successfully utilize previous heuristic based approaches [30]. 9.2 Future Research Directions During our work on automated model revision we have identified several possible direc- tions of future work. Some of these are listed below. In Chapter 3, we identified the requirements to complete the revision under-the-hood. Also, we developed middle layer that translate the SCR specification into the SYCRAFT specification. One future research direction in this context is to develop an enhanced ver- 182 sion of the middle layer. In particular, a middle layer that can be more generic, i.e., capable of handling several types of specifications other than SCR. We believe that many activities of the automated model revision are not user centric and do not require direct involvement for the user. Furthermore, many software solutions require modification to some of the software properties at several stages in the software life cycle. Moreover, in many cases such software modification is required to be completed in an expedited fashion. These requirements make the ability to perform automated model revision under-the-hood more appealing to many design tools. Hence, one future research direction in this context is in- vestigating of the possibility of integrating automated model revision to other design tools such as Simulink [52] and Rational Rhapsody [81, 82]. The enhanced middle layer will also include complete description of the input and output fields. This will allow other developers/researchers to link their design tools with SYCRAFT. Time complexity is one of the important factors in a successful automated model re- vision. One future research direction in this context is to combine other advances from program verification. We expect that by combining these advances along with characteris- tics of distributed systems, e.g., forward reachability analysis, hierarchical behavior, types of expected faults, etc., would be extremely beneficial. Specifically, it will make the auto- mated revision of practical distributed programs to add new properties more feasible. In Chapter 4, we listed some of the factors that contribute to the time complexity of the automated model revision. Of these, the deadlock resolution problem, is a unique bottleneck and does not exist in other verification methods. However, we recognize that there are other bottlenecks (e .g., forward reachability analysis) that are common with the other verification techniques. Hence, one future work in this context is to incorporate other techniques such as partitioning [35], clustering [123], and saturation-based reachability analysis [39,40] in the automated model revision tools. We expect these techniques to improve computation of many constructs in our tool. In Chapter 4, we identified the importance of the group computations in automated 183 model revision. In particular, we found that the revision time is often dedicated to comput- ing such groups. Also, since the group computation is caused by distribution constraints of the program being synthesized, as discussed in Chapter 4 and 5, it is guaranteed to be required even with other techniques for expediting automated model revision. One future work is to combine the group parallelism with the techniques that partition the deadlock states among available threads. In particular, as discussed in Chapter 4, the parallelism that partitions the deadlock states is coarse-grained. However, it can permit threads to perform inconsistent behaviors that need to be resolved later. Thus, it provides a tradeoff between overhead of synchrony among threads and potential error resolutions. Hence, even when a large number of cores were available, this approach would be valuable together with other techniques that utilize those additional cores. Thus, one of the future works is to com- bine the partitioning of the deadlock states and the group parallelism. Also, another future research direction is to explore other approaches to expedite the group computation. For example, it can be used in conjunction with the approach that utilizes symmetry among processes being synthesized. Another possible future work is developing more efficient algorithms for computing the groups. Due to the distributed nature of the programs being revised, it is most likely the case that the group associated with a given transition gets computed several times. Such repeated computation is not really necessary. In fact, the group associated with a given transition is fixed and does not change during the revision. Therefore, one approach for reducing the time required for computing the groups is as follows. In the initialization stage of the revision algorithm, we compute the groups associated with all the transitions of the program and store them in an efficient data structure. Later and during the revision, whenever it is required to compute the group associated with a given transition, such, group is retrieved from the storage. We expect this approach to significantly reduce the time complexity of the revision. However, it may require more memory and at that point some tradeoffs will need to be made to select the appropriate choices. We also expect that integrating our 184 implementation with a SAT or SMT (satisfiability modulo theories) solver is beneficial. In SMT solvers, one can use other types such as abstract data types, integers, reals etc., in formulae that involve arithmetic and quantifiers as well. In automated model revision tools, we used BDDs to efficiently represent the model being revised. However, the level of efficiency depends on the order in which we choose to list the variables of the model. Traditionally, such ordering is done manually based on some heuristics to achieve the minimal space required to describe the model. Such, manual approach is sufficient for other approaches for program verification (e.g., model checking) since in verification the model itself does not change. Therefore, the initial order chosen for the variables stays valid. Unlike verification, in model revision the model is modified. For instance, transitions can be removed if they violate the safety, on the other hand, transitions might be added to achieve recovery. Consequently, the initial order of the variables may need to be changed during the revision. One interesting future work is to look for solutions where the order of the variables is dynamic and changes during the revision. Distributed programs often consist of processes with similar structure. In Chapter 4, we developed some simple yet effective techniques that utilize symmetry to expedite the revision. Also, we demonstrated that the use of symmetry could extremely lower the time required for automated model revision. However, one limitation for our technique is that it requires the designer to identify the symmetry patterns in the program. A future work in this area will involve searching for techniques that allow for automated discovery of such symmetry patterns. An interesting problem would be to exploit the symmetry in distributed programs by automatically identifying symmetrical processes and actions. In Chapter 5, we demonstrated how the hierarchal structure of the processes could be used to reduce the complexity of the automated model revision. In particular, we showed how we could revise a small model and use the results to revise larger models. One future work in this context is to incorporate techniques that can automatically identify the net- work topology of the model being revised and use it to complete the revision efficiently. 185 In the automated model revision to add nonmasking fault-tolerance, we used a set of con- straints to describe the legitimate states of the model being revised. The order in which we chose to satisfy these constraints is very important. More specifically, choosing a wrong or- der may result in the impossibility of finding the correct nonmasking fault-tolerant model. We briefly presented a heuristic that considers all possible combinations to order the con- straints. Another future work in this context is to investigate other heuristics that takes into consideration the relation between the constraints them selves. For example, if the set of state identified by a constraint, say C I , is included in the set of states identified by the constraints, say C2, then we may need to satisfy C 1 before satisfying C2. 186 BIBLIOGRAPHY [l] M. Abadi and L. Lamport. Conjoining specifications. ACM Transactions on Pro- gramming Languages and Systems (TOPLAS), l7(3):507—535, 1995. [2] F. Abujarad, B. Bonakdarpour, and S. Kulkarni. Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs. In PDMC 2009, 2009. [3] F. Abujarad and S. Kulkarni. Automated Addition of Fault-Tolerance to SCR Toolset: A Case Study. In Distributed Computing Systems Workshops, 2008. lCDCS ’08. 28th International Conference on, pages 539—544, 2008. [4] F. Abujarad and S. Kulkarni. Constraint Based Automated Synthesis of Nonmasking and Stabilizing Fault-Tolerance. In Reliable Distributed Systems, 2009. SRDS ’09. 28th IEEE International Symposium on, Niagara Falls, New York, USA, Sep 27 - 30, 2009. In Proceedings, pages 119 — 128, 2009. [5] F. Abujarad and S. Kulkarni. Multicore Constraint-Based Automated Stabilization. In Stabilization, Safety, and Security of Distributed Systems: 11th International Symposium, SSS 2009, Lyon, France, November 3-6, 2009. Proceedings, page 47. Springer, 2009. [6] F. Abujarad and S. Kulkarni. Weakest Invariant Generation for Automated Addition of Fault-Tolerance. Electronic Notes in Theoretical Computer Sci- ence, 258(2):3—15, 2009. Available as Technical Report MSU-CSE-09-29 at http://www.cse .msu .edu/cgi-user/web/tech/reports?Year=2009 . [7] B. Alpem and F. B. Schneider. Defining liveness. Information Processing Letters, 21:181—185, 1985. [8] R. Alur, P. Madhusudan, and W. Nam. Symbolic compositional verification by learn- ing assumptions. In Computer Aided Verification, pages 548—562. Springer, 2005. [9] B. Aminof, T. Ball, and O. Kupferrnan. Reasoning about systems with transition fairness. Proc. LPAR, LNCS 3452, pages 194—208, 2004. [10] A. Arora. Efficient reconfiguration of trees: A case study in methodical design of nonmasking fault-tolerant programs. In Science of Computer Programming. Springer, 1996. [1 l] A. Arora, P. C. Attie, and E. A. Emerson. Synthesis of fault-tolerant concurrent programs. In Principles of Distributed Computing (PODC), pages 173—182, 1998. [12] A. Arora and M. G. Gouda. Closure and convergence: A foundation of fault-tolerant computing. IEEE Transactions on Software Engineering, 19(11):1015—1027, 1993. [13] A. Arora, M. G. Gouda, and G. Varghese. Constraint satisfaction as a basis for designing nonmasking fault-tolerant systems. Journal of High Speed Networks, 5(3):293—306, 1996. 187 [14] A. Arora and S. S. Kulkarni. Component based design of multitolerant systems. IEEE Transactions on Software Engineering, 24(1):63—78, 1998. [15] A. Arora and S. S. Kulkarni. Designing masking fault-tolerance via nonmasking fault-tolerance. IEEE Transactions on Software Engineering, pages 435—450, June 1998. [16] E. Asarin and O. Maler. As soon as possible: Time optimal control for timed au- tomata. In Hybrid Systems: Computation and Control ( HSC C), pages 19-30, 1999. [17] E. Asarin, O. Maler, A. Pnueli, and J. Sifakis. Controller synthesis for timed au- tomata. In IFAC Symposium on System Structure and Control, pages 469—474, 1998. [18] A. Aviiienis, J. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing, pages 11—33, 2004. [19] M. Barnett and K. Leino. Weakest-precondition of unstructured programs. In Pro- ceedings of the 6th ACM SIGPLAN—SIGSOF T workshop on Program analysis for software tools and engineering, pages 82—87. ACM New York, NY, USA, 2005. [20] S. BensMem, Y. Lakhnech, and H. Saidi. Powerful techniques for the automatic generation of invariants. In Proc. 8th Int. Conf. on Computer-Aided Verification, to appear in Lect. Notes in Comput. Sci. Springer, 1996. [21] R. Bharadwaj and C. Heitmeyer. Developing high assurance avionics systems with the SCR requirements method. In Digital Avionics Systems Conference, 2000. [22] R. Bharadwaj and C. Heitmeyer. Developing high assurance avionics systems with the SCR requirements method. In Digital Avionics Systems Conferences, 2000. Pro- ceedings. DASC. The I 9th, volume 1, 2000. [23] N. Bjomer, A. Browne, and Z. Manna. Automatic generation of invariants and in- termediate assertions. Theoretical Computer Science, l73(1):49—87, 1997. [24] G. V. Bochmann. Hardware specification with temporal logic: An example. IEEE Trans. Comput., 31(3):223—23l , 1982. [25] B. Bonakdarpour. Automated Revision of Distributed and Real- Time Programs. PhD thesis, Michigan State University, 2008. [26] B. Bonakdarpour, A. Ebnenasir, and S. Kulkarni. Complexity results in revising UNITY programs. AC M Transactions on Autonomous and Adaptive Systems (TAAS) , 4(1):5, 2009. [27] B. Bonakdarpour and S. Kulkarni. SYCRAFT: A Tool for Synthesizing Distributed Fault-Tolerant Programs. In Proceedings of the 19th international conference on Concurrency Theory, August, pages 19—22. Springer, 2008. 188 [28] B. Bonakdarpour and S. S. Kulkarni. SYCRAFT: SYmboliC synthesizeR and Adder of Fault-Tolerance. Available at http: / /www. cse .msu . edu/ ”borzoo/sycraft. [29] B. Bonakdarpour and S. S. Kulkarni. Automated incremental synthesis of timed automata. In International Workshop on Formal Methods for Industrial Critical Systems ( F MICS ), LNCS 4346, pages 261—276, 2006. [30] B. Bonakdarpour and S. S. Kulkarni. Exploiting symbolic techniques in automated synthesis of distributed programs with large state space. In IEEE International Con- ference on Distributed Computing Systems (1CDCS), pages 3—10, 2007. [31] B. Bonakdarpour, S. S. Kulkarni, and F. Abujarad. Distributed synthesis of fault- tolerance. In International Symposium on Stabilization, Safety, and Security of Dis- tributed Systems (SSS), 2006. Full version available as a Technical Report MSU- CSE-06-27 at Computer Science and Engineering Department, Michigan State Uni- versity, East Lansing, Michigan. [32] P. Bouyer, D. D’Souza, P. Madhusudan, and A. Petit. Timed control with partial observability. In Computer Aided Verification (CAV), pages 180—192, 2003. [33] R. Bryant. Graph-Based Algorithms for Boolean Function Manipulation. IEEE Transactions on Computers, 35(8):677—69l , 1986. [34] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, 35(8):677—691 , 1986. [35] J. Burch, E. Clarke, and D. Long. Symbolic model checking with partitioned tran- sition relations. In International Conference on Very Large Scale Integration, pages 49—58, 1991. [36] J. R. Burch, E. M. Clarke, K. L. McMillan, D. L. Dill, and L. J. Hwang. Symbolic model checking: 1020 states and beyond. Information and Computation, 98(2): 142-— 170, 1992. [37] R. Burstall. Program proving as hand simulation with a little induction. Information processing, 74(308-312):448 , 1974. [38] K. M. Chandy and J. Misra. Parallel program design: a foundation. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1988. [39] G. Ciardo, G. Liittgen, and R. Siminiceanu. Saturation: An efficient iteration strategy for symbolic state-space generation. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pages 328-342, 2001. [40] G. Ciardo and A. J. Yu. Saturation-based symbolic reachability analysis using con- junctive and disjunctive partitioning. In Correct Hardware Design and Verification Methods (CHARME), pages 146—161, 2005. 189 [41] A. Cimatti, E. Clarke, F. Giunchiglia, and M. Roveri. NuSMV: A new symbolic model Checker. Int. J. Softw. Tools Technol. Transf, 2(4):410—425, 2000. [42] E. Clarke. The birth of model checking. 25 Years of Model Checking, pages 1—26, 2008. [43] E. Clarke, E. Emerson, and A. Sistla. Automatic verification of finite-state concur- rent systems using temporal logic specifications. ACM Transactions on Program- ming Languages and Systems (TOPLAS), 8(2):263, 1986. [44] E. Clarke, R. Enders, T. Filkom, and S. Jha. Exploiting symmetry in temporal logic model checking. Formal Methods in System Design, 9(1):77—104, 1996. [45] E. Clarke and L. Liu. Approximate algorithms for optimization of busy waiting in parallel programs (preliminary report). 20th Annual Symposium on Foundations of Computer Science, pages 255—266, 1979. [46] E. M. Clarke and E. A. Emerson. Design and synthesis of synchronization skeletons using branching-time temporal logic. In Logic of Programs, pages 52—71 , 1981. [47] E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic verification of finite state concurrent system using temporal logic specifications: a practical approach. In POPL ’83 : Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Princi- ples of programming languages, pages 1 17—126, New York, NY, USA, 1983. ACM. [48] E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic verification of finite- state concurrent systems using temporal logic specifications. ACM Transactions on Programming Languages and Systems (TOPLAS), 8(2):244—263, 1986. [49] E. M. Clarke, 0. Grumberg, and D. A. Peled. Model checking. Springer, 1999. [50] E. Clarke Jr. Synthesis of resource invariants for concurrent programs. ACM Trans- actions on Programming Languages and Systems (TOPLAS), 2(3):358, 1980. [51] J. M. Cobleigh, D. Giannakopoulou, and C. S. Pasareanu. Learning assumptions for compositional verification. In TACAS ’03: Proceedings of the 9th international conference on Tools and algorithms for the construction and analysis of systems, pages 331—346, Berlin, Heidelberg, 2003. Springer-Verlag. [52] J. Dabney and T. Harman. Mastering Simulink. Prentice Hall PTR Upper Saddle River, NJ, USA, 1997. [53] E. Dijkstra. A discipline of programming. Prentice-Hall, Englewood Cliffs, NJ ., 1976. [54] E. W. Dijkstra. Self-stabilizing systems in spite of distributed control. Communica- tions ofthe ACM, 17(11), 1974. 190 [55] R. Dimitrova and B. Finkbeiner. Synthesis of Fault-Tolerant Distributed Systems. In Automated Technology for Verification and Analysis: 7th International Symposium, Atva 2009, Macao, China, October 14-16, 2009, Proceedings, page 32]. Springer, 2009. [56] S. Dolev. Self-Stabilization. MIT Press, 2000. [57] S. Dolev, A. Israeli, and S. Moran. Self-stabilization of dynamic systems assuming only read/write atomicity. Distributed Computing, 7:3—16, 1993. [58] A. Ebnenasir. DiConic addition of failsafe fault-tolerance. In Automated Software Engineering (ASE), pages 44—53, 2007. [59] A. Ebnenasir, S. Kulkarni, and A. Arora. FT Syn: A framework for automatic syn- thesis of fault-tolerance. International Journal on Software Tools for Technology Transfer (STIT), 10(5):455—47 l , 2008. [60] A. Ebnenasir, S. S. Kulkarni, and A. Arora. Ftsyn: a framework for automatic synthesis of fault-tolerance. Int. J. Softw. Tools Technol. Transf., 10(5):455—471, 2008. [61] A. Ebnenasir, S. S. Kulkarni, and B. Bonakdarpour. Revising UNITY programs: Possibilities and limitations. In International Conference on Principles of Dis- tributed Systems (OPODIS), LNCS 3974, pages 275—290, 2005. [62] E. Emerson and E. Clarke. Characterizing Correctness Properties of Parallel Pro- grams Using Fixpoints. In Proceedings of the 7th Colloquium on Automata, Lan- guages and Programming, page 181. Springer—Verlag, 1980. [63] E. Emerson and J. Y. Halpem. “Sometimes” and “not never” revisited: On branching versus linear time temporal logic. J. Assoc. Comput. Mach., 33: 151—178, 1986. [64] E. A. Emerson and E. M. Clarke. Using branching time temporal logic to synthesize synchronization skeletons. Science of Computer Programming, 2(3):241—266, 1982. [65] E. A. Emerson and C. L. Lei. Temporal model checking under generalized fairness constraints. In Proc. 18th Hawaii International Conference on System Sciences, pages 277—288, 1985. [66] J. Ezekiel and G. Liittgen. Measuring and evaluating parallel state-space exploration algorithms. In International Workshop on Parallel and Distributed Methods in Ver- ification (PDMC), 2007. [67] J. Ezekiel, G. Luttgen, and G. Ciardo. Parallelising symbolic state-space generators. In Computer Aided Verification (CAV), pages 268—280, 2007. [68] J. Ezekiel, G. Liittgen, and R. Siminiceanu. Can Saturation be Parallelised? Formal Methods: Applications and Technology, pages 331-346. 191 [69] J. Ezekiel, G. Luttgen, and R. Siminiceanu. Can Saturation be parallelised? on the parallelisation of a symbolic state-space generator. In International Workshop on Parallel and Distributed Methods of Verification (PDMC), pages 331—346, 2006. [70] M. Faella, S. LaTorre, and A. Murano. Dense real-time games. In Logic in Computer Science (LICS), pages 167—176, 2002. [71] F. Gartner. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys ( CS UR), 31(1): 1—26, 1999. [72] F. Gartner and A. Jhumka. Automating the addition of fail-safe fault-tolerance: Be- yond fusion-closed specifications. Lecture notes in computer science, pages 183- 198, 2004. [73] F. Gartner and H. Pagnia. Self-stabilizing load distribution for replicated servers on a per-access basis. In Proceedings of the 19th IEEE International Conference on Distributed Computing Systems Workshop on Self-Stabilizing Systems, pages 102— 109, 1999. [74] P. Godefroid. Using partial orders to improve automatic verification methods. In CAV ’90: Proceedings of the 2nd International Workshop on Computer Aided Veri- fication, pages 176—185, London, UK, 1991. Springer-Verlag. [75] M. Gouda. Multiphase stabilization. IEEE Transactions on Software Engineering, pages 201—208, 2002. [76] M. G. Gouda. The triumph and tribulation of system stabilization. In Proceedings of the 9th International Workshop on Distributed Algorithms, pages 1-18. Springer- Verlag London, UK, 1995. [77] O. Grumberg, T. Heyman, N. Ifergan, and A. Schuster. Achieving speedups in dis- tributed symbolic reachability analysis through asynchronous computation. In Cor- rect Hardware Design and Verification Methods (CHARME), pages 129-145, 2005 . [78] O. Grumberg, T. Heyman, and A. Schuster. A work-efficient distributed algorithm for reachability analysis. Formal Methods in System Design (FMSD), 29(2):]57- 175, 2006. [79] O. Grumberg and D. Long. Model checking and modular verification. In CON- CUR ’91 , pages 250—265. Springer, 1991 . [80] O. Grumberg and D. E. Long. Model checking and modular verification. ACM Trans. Program. Lang. Syst., 16(3):843—87l , 1994. [81] D. Harel and H. Kugler. The rhapsody semantics of statecharts. Lecture notes in computer science, pages 325—354, 2004. [82] D. Hare] and H. Kugler. The rhapsody semantics of statecharts. Lecture notes in computer science, pages 325—354, 2004. 192 [83] M. Heimdahl and N. Leveson. Completeness and consistency in hierarchical state- based requirements. IEEE transactions on Software Engineering, 22(6):363—377, 1996. [84] C. Heitmeyer, M. Archer, R. Bharadwaj, and R. Jeffords. Tools for constructing requirements specifications: The SCR toolset at the age of ten. International Journal of Computer Systems Science and Engineering, 20(1): 19—35, 2005. [85] C. Heitmeyer and R. Jeffords. Applying a Formal Requirements Method to Three NASA Systems: Lessons Learned. In 2007 IEEE Aerospace Conference, pages 1- 10,2007. [86] C. Heitmeyer, J. Kirby, and B. Labaw. Tools for formal specification, verification, and validation ofrequirements. In Computer Assurance, 1997. COMPASS 97 . Are We Making Progress Towards Computer Assurance? Proceedings of the 12th Annual Conference on, pages 35—47, 1997. [87] C. Heitmeyer,J. Kirby, B. Labaw, R. Bharadwaj, et al. SCR*: A toolset for specify- ing and analyzing software requirements, 1998. [88] C. Heitneter and J. McLean. Abstract requirements specification: A new approach and its application. IEEE Transactions on Software Engineering, pages 580—589, 1983. [89] T. A. Henzinger, X. Nicollin, J. Sifakis, and S. Yovine. Symbolic model checking for real-time systems. Information and Computation, 111(2):]93—244, 1994. [90] M. Herlihy. The future of distributed computing: Renaissance or reformation? In Twenty-Seventh Annual ACM SIGACT-SIGOPS Symposium on Principles of Dis- tributed Computing (PODC 2008), 2008. [91] S. Hester, D. Parnas, and D. Utter. Using documentation as a software design medium. Bell System Tech. J, 60(8):]941—1977, 1981. [92] T. Heyman, D. Geist, O. Grumberg, and A. Schuster. Achieving scalability in paral- lel reachability analysis of very large circuits. In Computer-Aided Verification (CAV), pages 20—35, 2000. [93] G. Holzmann. The model checker spin. IEEE Transactions on Software Engineering, 1997. [94] R. Jeffords and C. Heitmeyer. An algorithm for strengthening state invariants gen- erated from requirements specifications. In Proceedings of the Fifth IEEE Interna- tional Symposium on Requirements Engineering (RE ’01). IEEE Computer Society Washington, DC, USA, 2001. [95] R. Jeffords and C. Heitmeyer. A strategy for efficiently verifying requirements. AC M SIGSOF T Software Engineering Notes, 28(5):28—37, 2003. 193 [96] B. Jobstmann, A. Griesmayer, and R. Bloem. Program repair as a game. In Computer Aided Verification (CAV), pages 226—238, 2005. [97] S. Katz and K. Perry. Self-stabilizing extensions for message passing systems. Dis- tributed Computing, 7: 17—26, 1993. [98] T. Kletz. Hazop and Hazan: Identifying and assessing process industry hazards. Inst of Chemical Engineers, 1999. [99] F. Kroger. Lat: A logic of algorithmic reasoning. Acta 1nf., 8:243—266, 1977. [100] S. S. Kulkarni. Component-based design of fault-tolerance. PhD thesis, Ohio State University, 1999. [101] S. S. Kulkarni and A. Arora. Automating the addition of fault-tolerance. In Formal Techniques in Real-Time and Fault-Tolerant Systems ( F TRTF T), pages 82—93, 2000. [102] S. S. Kulkarni, A. Arora, and A. Chippada. Polynomial time synthesis of Byzantine agreement. In Symposium on Reliable Distributed Systems (SRDS), pages 130—140, 2001. [103] S. S. Kulkarni, A. Arora, and A. Ebnenasir. Software Engineering and Fault- Tolerance, chapter Adding Fault-Tolerance to State Machine-Based Designs. World Scientific Publishing Co. Pte. Ltd, 2007. [104] S. S. Kulkarni and M. Arumugam. Infuse: A TDMA based data dissemination protocol for sensor networks. International Journal of Distributed Sensor Networks, 2(1):55—78, 2006. [105] S. S. Kulkarni and A. Ebnenasir. Enhancing the fault-tolerance of nonmasking pro- grams. International Conference on Distributed Computing Systems, 2003. [106] R. Kurshan and K. McMillan. A structural induction theorem for processes. In Pro- ceedings of the eighth annual ACM Symposium on Principles of distributed comput- ing, pages 239—247. ACM, 1989. [107] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382—401 , 1982. [108] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 1982. [109] K. Leino. Efficient weakest preconditions. Information Processing Letters, 93(6):281— 288, 2005. [110] L. Liu and E. Clarke. Optimization of busy waiting in conditional critical regions. 13th Hawaii International Conference on System Sciences, 1980. 194 [111] H. Mantel and F. C.G'artner. A case study in the mechanical verification of fault- tolerance. Technical Report TUD-BS-l999-08, Department of Computer Science, Darmstadt University of Technology, 1999. [112] K. L. McMillan. Symbolic model checking: an approach to the state explosion problem. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1992. [113] K. L. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1993. [l 14] S. Meyer and S. White. Software requirements methodology and tool study for A6- E technology transfer. Technical Report MSU—CSE-09-21, Grumman Aerospace Corp., Bethpage, NY, 1983. [1 15] K. Milvang-Jensen and A. J. Hu. BDDNOW: A parallel BDD package. In Formal Methods in Computer Aided Design (FMCAD), pages 501—507, 1998. [116] J. Nimmer and M. Ernst. Automatic generation of program specifications. ACM SIGSOF T Software Engineering Notes, 27(4):229—239, 2002. [117] C. Norris 1p and D. Dill. Better verification through symmetry. Formal methods in system design, 9(1):41—75, 1996. [118] S. Owre, J. M. Rushby, , and N. Shankar. PVS: A prototype verification system. In D. Kapur, editor, 11th International Conference on Automated Deduction (CADE), volume 607 of Lecture Notes in Artificial Intelligence, pages 748-752, Saratoga, NY, jun 1992. Springer-Verlag. [1 19] P. Palady. Failure modes and effects analysis. PT Publications Inc, 1995. [120] D. Parnas and J. Madey. Functional documents for computer systems. Science of Computer programming, 25(1):41—61 , 1995. [121] S. Qadeer and N. Shankar. Verifying a self-stabilizing mutual exclusion algorithm. In D. Gries and W.-P. de Roever, editors, [F [P International Conference on Program- ming Concepts and Methods (PROCOMET ’98), pages 424—443 , Shelter Island, NY, June 1998. Chapman & Hall. [122] P. Ramadge and W. Wonham. The control of discrete event systems. Proceedings of the IEEE, 77(1):81—98, 1989. [123] R. Ranjan, A. Aziz, R. Brayton, B. Plessier, and C. Pixley. Efficient BDD algorithms for FSM synthesis and verification. In IEEE/ACM International Workshop on logic Synthesis , 1995. [124] K. Raymond. A tree based algorithm for mutual exclusion. ACM Transactions on Computer Systems, 7:61-77, 1989. [125] F. Somenzi. CUDD: Colorado University Decision Diagram Package. http://vlsi.colorado.edu/“fabio/CUDD/cuddIntro.html. 195 [126] T. Stornetta and F. Brewer. Implementation of an efficient parallel BDD package. In Proceedings of the 33rd annual Design Automation Conference, pages 641—644. ACM, 1996. [127] T. Stornetta and F. Brewer. Implementation of an efficient parallel BDD package. In Design automation (DAC), pages 641—644, 1996. [128] O. Theel and F. Gartner. An exercise in proving convergence through transfer func- tions. In Proc. 4th Workshop on Self-stabilizing Systems, Austin, Texas, pages 41—47, 1999. [129] T. Tsuchiya, S. Nagano, R. B. Paidi, and T. Kikuno. Symbolic model checking for self-stabilizing algorithms. IEEE Trans. Parallel Distrib. Syst., 12(1):81—95, 2001. [130] W. Vesely. Fault tree handbook. us nuclear regulatory committee report nureg-0492, us me, washington dc, united states., 1981. [131] P. Wolper and V. Lovinfosse. Verifying properties of large sets of processes with net- work invariants. In Proceedings of the International Workshop on Automatic Verifi- cation Methods for Finite State Systems, pages 68—80, London, UK, 1990. Springer- Verlag. 196 -.'\~'\v' 1 1 1 1111 LIBRARIES 1 111111 3220 8906 3 1293 0 11 1 V: .h s R E N N U E T A T s N A m H m M . u u . . . . . . . . . 6 . u . . . . . . . . .. . . .. ... _. . . . . _ - . .. . . . . . . . o . .. . . . . . . . . . . . . c a . . . . . . . . . . .. . . o. . . s ... . . . .. v . . o . . .. . . . . .. . . . .. .. . . . . . . . .. . . . . . . . _ . .. . . . . . .