2 o 2 3 7 [05 Illlliillllllll lull , LIBRARY Michigan State University 4' This is to certify that the dissertation entitled Reliable Parallel Processing The Application Oriented Paradigm presented by Bruce Malcolm McMillin has been accepted towards fulfillment of the requirements for Ph .D . degree in Computer Science \VR‘BNKQ M iv. Major professor mama.— MS U is an Affirmative Action/Equal Opportunity Institution 0- 12771 MSU LIBRARIES .—c—. RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. Fnfirukkflwwawa i" ’ tx A007 RELIABLE PARALLEL PROCESSING THE APPLICATION ORIENTED PARADIGM A Dissertation By Bruce Malcolm McMillin Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1988‘ ABSTRACT RELIABLE PARALLEL PROCESSING THE APPLICATION ORIENTED PARADIGM By Bruce Malcolm McMillin Fault-tolerance is of paramount importance in large distributed multiprocessor sys- tems. The key to providing reliability for this necessarily applications oriented environ- ment, lies in low cost and easily programmable solutions. This motivates consideration of a new class of fault tolerance techniques - the Application Oriented Paradigm. This dissertation provides both the basis for construction of the requisite application oriented constraint predicates and the necessary tools for a reliable consistent distributed diagnosis in the Byzantine failure environment. The approach taken here is novel in the sense that a Constraint Predicate is formulated from a basis set of design metrics which embody desirable facets of the solution theory. The predicate is then an application oriented abstraction of the necessary fault tolerance. Reliable consistent distributed diag- nosis appears as part of the environment to the programmer. Both Byzantine Agreement and syndrome testing are treated as diagnostic techniques. The result of this dichotomization of fault tolerance is a provable and usable environment for Reliable Parallel Processing. © Copyright by Bruce Malcolm McMillin 1988 To Patricia iv ACKNOWLEDGEMENTS I wish to thank the members of my Ph.D. committee for their guidance and support during my Ph.D. program. I would particularly like to express my appreciation to A. Wojcik for his leadership of the Department, to A. Esfahanian for lending an ear, and mostly to L. Ni for providing a nearly optimal environment for his Ph.D. students. I also wish to thank both Dr. Wojcik and Dr. Fox for their careful reading of this disseration and to Dr. Hughes for his insightful comments. To my parents, Kenneth and Phyllis McMillin, I express my appreciation for the early encouragement they gave that enabled me to pursue this endeavor. I would like to acknowledge all the people that touched my life while I was at Michigan State. Too numerous to mention all, in particular, I do wish to acknowledge my good friends Ray and Atsuko Klein, Pat Flynn, Eric Wu, Yew-Huey Liu, Chung-Ta King, Youran Lan, and Taieb Znati. To my wife Patty, for whom I hold the greatest love, no further words need be said. This work was supported in part by the DARPA ACMP project, in part by the State Of Michigan RE/ED project, in part by a Michigan State University Graduate Office Fel- lowship, and in part by the GTE Graduate Research Fellowship Program. Table of Contents List of Tables ............................................................................................................. viii List of Figures ............................................................................................................ ix Chapter 1: Introduction ....................................................................................... 1 1.1 Modern Fault Tolerance ................................................................................... 2 1.2 Background On Reliable Systems ................................................................... 3 1.3 Fault Tolerance Techniques ............................................................................. 7 1.4 Resource Failure Models ................................................................................. 10 1.5 Distributed Memory Multiprocessors .............................................................. 11 1.6 The Application Oriented Fault-Tolerance Paradigm ..................................... 14 1.7 Executable Assertions ...................................................................................... 15 1.8 Direction Of This Thesis .................................................................................. 16 1.9 Thesis Outline .................................................................................................. 17 Chapter 2: Motivation And Problem Statement ............................................ 19 2.1 Hardware Fault-Tolerance ............................................................................... 19 2.2 Software Fault-Tolerance ................................................................................ 22 2.3 Problem Statement ........................................................................................... 24 2.4 Chapter Summary ....................................................................................... - ..... 2 5 Chapter 3: A New View Of The Fault Detection Problem ........................... 26 3.1 The Distributed Fault Detection Model ........................................................... 26 3.2 Programmer/System Issues .............................................................................. 33 3.3 Chapter Summary ............................................................................................ 35 Chapter 4: Constraint Predicates - A Programmer Level View ................ 36 4.1 Definition ......................................................................................................... 36 4.2 Basis Metrics .................................................................................................... 39 4.3 Chapter Summary ............................................................................................ 46 Chapter 5: Distributed Diagnostic Basis ........................................................ 47 5.1 Syndrome Testing And Diagnostic Impossibilities ......................................... 47 5.2 Byzantine Generals Problem ........................................................................... 54 5.3 Diagnostic Comments ...................................................................................... 72 5.4 Chapter Summary ............................................................................................ 72 Chapter 6: Performance Evaluation ................................................................. 74 6.1 Expected Error Coverage ................................................................................. 74 6.2 Run Time Overhead Estimation ...................................................................... 77 6.3 Chapter Summary ............................................................................................ 81 Chapter 7: Matrix Problem And Iterative Techniques ................................. 83 7.1 Matrix Problem Specification .......................................................................... 83 7.2 Natural Problem/Solution Constraints ............................................................. 85 7.3 Error Coverage Modeling For Matrix Iterative Analysis ................................ 91 7.4 Run Time Performance .................................................................................... 100 7.5 Comments And Limitations ............................................................................. 103 Chapter 8 Relaxation Labeling ......................................................................... 105 8.1 Parallelized Weighted Relaxation Labeling .................................................... 106 8.2 Constraint Predicate ......................................................................................... 110 8.3 Reliable Parallel Algorithm For Relaxation Labeling ..................................... 114 8.4 Chapter Summary ............................................................................................ 116 Chapter 9: Parallel Bitonlc Sorting ................................................................... 117 9.1 Bitonic Sorting ................................................................................................. 118 9.2 Faulty Behavior ................................................................................................ 121 9.3 Reliable Bitonic Sorting Algorithm Development .......................................... 122 9.4 Error Coverage And Resilience ....................................................................... 130 9.5 Time And Space Complexity ........................................................................... 130 9.6 Chapter Summary ............................................................................................ 134 Chapter 10: Directions For Future Research ................................................. 136 10.1 Summary of Major Contributions .................................................................. 136 10.2 Configuration Control .................................................................................... 138 10.3 Automated Constraint Predicate Generation ................................................. 141 10.4 Application Applicability .............................................................................. 142 References .............................................................................................................. 143 vii List of Tables Table 1.1. Fault Models ............................................................................................. 10 Table 3.1. Constraint Predicate Size Classification ................................................... 34 Table 6.1. DMMP Assumptions ................................................................................ 78 Table 6.2. Run Time Comparisons ............................................................................ 82 Table 7.1. Predicate Subclass Membership ............................................................... 91 viii List of Figures Figure 1.1. Alternative paths to achieve reliability. ........................................... 7 Figure 1.2. Fault Model Relationships. .............................................................. 11 Figure 1.3. Distributed-Memory Multiprocessor System Architecture .............. 12 Figure 1.4. The Reliable Parallel Processing Model. ......................................... 17 Figure 2.1. Expected run time for a serial-reliability parallel algorithm ............ 22 Figure 3.1. Splitting the Detection Problem. ...................................................... 29 Figure 4.1. Abstraction of the Constraint Predicate. .......................................... 40 Figure 5.1. Fault Models ..................................................................................... 49 Figure 5.2. Syndrome Undiagnosable by Yang and Masson Algorithm ............ 51 Figure 5.3. Byzantine Agreement (P1 is faulty). ............................................... 55 Figure 5.4. Algorithm Agree. ............................................................................. 64 Figure 5.5. Example of Algorithm Agree for n =4,t=1. ..................................... 67 Figure 6.1. Expected run time comparison ......................................................... 80 Figure 7.1. Relaxation Skeleton. ........................................................................ 85 Figure 7.2. Sample Matrix for Au==v. ................................................................. 95 Figure 7.3. Feasibility Event and Conditional Distribution of Errors. ............... 96 Figure 7.4. Error Coverage by Solution Step Count ........................................... 98 Figure 7 .5. Error Coverage by Solution Step Count. Theorem 7-7 Applied ...... 101 Figure 7.6. Mapping for q =25. ........................................................................... 102 Figure 8.1. Algorithm RLNR. .............................................................................. 108 Figure 8.2. Algorithm projection-operator. ....................................................... 109 Figure 8.3. Non-maximizing movements. .......................................................... 113 Figure 8.4. Movement outside of feasibility cone. ............................................. 113 Figure 8.5. Zigzagging - Convergence but no Finiteness ................................... 114 Figure 8.6. Reliable relaxation labeling algorithm RLR. .................................... 115 Figure 9.1. Compare-Exchange Step. ................................................................. 119 Figure 9.2. Algorithm SNR. ................................................................................. 120 Figure 9.4. Algorithm Sn. ................................................................................. 123 Figure 9.5a. (DP - Progress Component. ............................................................. 124 Figure 9.5b. (DI.- - Feasibility Component ........................................................... 125 Figure 9.5c. (DC - Consistency Component. ....................................................... 127 Figure 9.6. Example of Sn for n =3. .................................................................. 129 Figure 9.7. Sorting Time Comparisons. ............................................................. 133 Figure 9.8. Projected Sorting Time Comparisons - Large Systems. .................. 135 ix Chapter 1 Introduction Massively parallel distributed-memory multiprocessors (DMMP’s) have gained much attention recently due to their unique scalability in providing massive computing power. Various hypercube multiprocessors, such as the Ncube, iPSC, and FPS T-series, are notable examples of commercially available DMMPs [ShFi88, Hwan87]. Design of parallel algorithms for DMMPs has been a major research issue in the application domain. However, the design of reliable parallel algorithms has received little attention. As the size of DMMPs grows into thousands of processors and the corresponding scale of attempted application problems grows even faster, solution to the problem of providing fault-tolerance is of paramount importance to the greater use of parallel processing. Traditional fault-tolerance techniques are mainly geared to providing an ultra- reliable environment [HoSL78, Wens78, Lala86]. The ultra-reliable environment is necessary when loss of property or life is the result of a failure. DMMPs often operate under less stringent reliability requirements. However, a failure, while not catastrophic, may increase application turnaround time to unacceptable levels (reference Chapter 2). The number of components of a large scale DMMP makes it particularly susceptible to failure. To provide the appropriate level of reliability it is necessary to re-examine the notion of fault-tolerance and how to provide it. Reliability should be treated as part of the operating environment much as the com- pilers, libraries, and operating system appear to the programmer. Ideally, reliability is entirely transparent to the programmer. However, the cost of techniques to furnish this 1 2 transparency can be prohibitive or the techniques themselves infeasible. Some of the existing techniques to provide the necessary reliability include the use of better com- ponents (Fault Avoidance), stopping the system when a fault occurs (Fault Detection), running in the presence of faults (Masking Redundancy) or removing failed components from the system (Reconfiguration). For DMMPs, which are constructed from "off the shelf“ building blocks, the last two techniques are most applicable. As we shall see, these two techniques require communication and agreement among the components of the DMMP. This agreement is commonly obtained by Voting. 1.1 MODERN FAULT TOLERANCE Early work in computer system fault tolerance was concerned with failures of indi- vidual components. Much of the work thus was concerned with gate level issues such as "stuck at faults" [KrTo81] and other low level issues in processor design. This was pri- marily owing to the single CPU employed and its expensive hardware components. In modern systems and particularly in massively parallel DMMP design, failed components are considered to be at the granularity of the processor, the memory, and the bus [Wens78]. This means that when a failure occurs in a component, we wish to isolate its effects on the remaining portion of the system. We are not concerned with how the com- ponent failed, simply that it did so. When a component fails, we expect to have enough other similar components to take its place. These components, in the interest of low cost, are typically off-the-shelf processor, disk, and memory designs. The desired reliability may not be built into these components. Thus we build the desired level of redundancy through system level algorithms to coordinate these components. Finally, as systems become more complex, low level analysis of fault-tolerance and low level redundancy design are not feasible. 1.2 BACKGROUND ON RELIABLE SYSTEMS Early work in the area of computer system reliability was focused in the space pro- gram and in the telephone switching environment. The OAO satellite computer [Kueh69] used discrete component redundancy. This technique became outmoded, as with integrated circuits the probability of multiple transistors being destroyed is about the same as for a single transistor [Renn84]. The initial Bell ESS design (for telephone switching purposes) uses two way redundancy [Toyw78]. Two computers perform the application in parallel. If one computer failed, the independent results would disagree and the entire ESS would halt. The advantage of this approach is that an incorrect output is never produced. A more modern example of computer fault-tolerance is in the Apollo guidance system. This system used triple modular redundancy (TMR) in which three times the necessary resources are employed. Three processors run the same program and vote to mask errors in any single processor [AnMa67]. The JPL star is a computer designed for deep space exploration where human maintenance is impossible. It has a large number of functional units with spares which can be switched in automatically to replace a failing component [Aviz71]. These designs form a basis for more modern fault-tolerant architectures. 1.2.1 Fault Tolerant Mum-Processor (FI’MP) The Fault-Tolerant Multi-Processor (FTMP) [HoSL78] is an extension of early spacecraft designs. The basis for design of the FTMP is to mask all errors from the applications software. This is advantageous since no software rollbacks are required to fix errors. This masking is accomplished through a TMR design extension known as Parallel Hybrid Redundancy. A pool consisting of multiple triads of modules such as processors, memory, and busses comprise the system. Each triad has a voting element which votes on the data presented to it in triplicate. A module may be retired, regrouped into another triad, or assigned to perform maintenance diagnostic testing. A module is retired when it has been found to have failed at which time a spare is switched in its 4 place. If no spares are available, the triad is retired. Since the hardware resources are allocated from a pool, the application never feels the effects of these reconfiguration activities. Voting at the hardware level is done bit by bit in the the tightly synchronized approach adopted. Synchronization is achieved through keeping the clocks of each auto- nomous processor in absolute synchrony. This is accomplished by the use of TMR com- munication between the clocks at the hardware level. The voting elements detect failures in a 2-of-3 vote. Another controller called a bus guardian isolates or reconfigures around the failing module. If a bus is faulty, the bus guardians are directed to switch to another bus. If a processor is found to be faulty, it is isolated from the bus by the low level logic of the bus guardian. Voting is not the only means employed in the FTMP for fault-detection. Modules may be assigned in a maintenance mode to exercise the system to attempt to find failed components and to remove them as quickly as possible. This testing checks each proces- sor against multiple busses to attempt to narrow down the actual location of the fault, either processor or bus. The advantage of this right synchronization, other than the ability to mask faults at the hardware level, is that no buffering, synchronization primitives, or completion inter- vals are required. This implies a speed improvement over a machine which employs loose synchrony. There are, however, disadvantages to this method of low level voting. A voter or bus guardian failure can cause a catastrophic hardware failure that will prob- ably defy any method of software correction. The tight synchronization also has limita- tions that will be discussed in a following section. 1.2.2 Software Implemented Fault-Tolerance (SIFI') A contrasting approach to the FTMP, using loose synchrony, is the Sofiware Imple- mented Fault-Tolerant computer (SIFT) [Wens78]. The SIFT was designed to serve in a similar fault tolerant capacity as FTMP. The SIFT uses multiple busses and modules, however, the modules (processors) are off-the-shelf components. The SIFT concept is currently employed in state of the art systems such as the space shuttle [Skla76]. The loose synchrony of the SIFT allows it to run as a loosely coupled multiproces- sor. Non-critical applications can take advantage of this multiprocessor speedup. For critical applications, the SIFT can form into a redundant configuration using TMR redun- dancy. Fault isolation is obtained by physical isolation of failed hardware components. Fault masking is enhanced through the use of multiple busses over which multiple copies of the data are transmitted. Voting takes place as a function of the software. Critical tasks in the SIFT are required to be iterative tasks such that voting is done on the state of data before each iteration. This guarantees that each processor is working on the same loop. If an error is discovered in the vote on this state of the system, the software attempts to locate and remove the faulty component. This is done by changing the bus of the presumed faulty processor. If the fault follows the move, the processor is retired. If the fault disappears, the bus is retired. Thus the SIFI‘ can configure around the fault. This detection, however, is not complete. Each processor makes a report of failed com- ponents to the runtime supervisor. Depending on the type of fault, it may not be decid- able which processor or bus is actually failing. However, the system can continue to run in a fault masking mode. The clocks in the SIFT are not synchronous. However, for reasons that will made clear in Chapter 5, the clocks must be kept in partial synchrony. With respect to the hardware, each processor has an independent clock. These clocks are resynchronized occasionally by an algorithm known as a clock skew reset. This algorithm is essentially a voting algorithm which requires at least four clocks to mask one fault [PeSL80]. Low level hardware correction is not employed in the modules of the SIFI‘. Through modeling work done by the system designers, only a slight increase in reliability may be gained by adding this hardware support over what is achieved through the voting. 6 The loose synchrony and high level error correction of the SIFT, as will be seen shortly, can provide a basis for the construction of reliable distributed systems. 1 .23 Space Shuttle Computer Control System The space shuttle computer control system is of an ultra-reliable design as are the FI‘MP and SIFT. The important difference is that the space shuttle computers are con- strained to be off-the-shelf processing systems, specifically, IBM AP-101 CPUs [Sk1a76]. Thus the space shuffle design is consistent with the more modern notion of fault-tolerant systems outlined at the beginning of this chapter. The hardware configuration of the space shuttle system consists of five computers. Four of the computers are programmed identically to perform flight critical functions such as guidance, navigation, and control. The outputs of the four computers are voted in the control actuators with the inputs to the calculations coming from multiple input sen- sors. Faults are not simply masked. The flight crew must be notified of a failed com- puter. Furthermore, the faults are isolated; that is, one computer failure cannot cause another computer to erroneously identify itself as faulty. This allows the crew to re- allocate resources such that the system many continue to operate in the presence of up to 3 failed computers. In the case where only two computers are left operational, fault identification is not possible; however, fault detection is possible in the sense of the Bell ESS systems. A fault occurrence can be detected but not located. The fifth computer is programmed to the same set of specifications as the four pri- mary computers, but is implemented by a different contractor [Renn84]. This allows detection of software design and coding errors thus implementing a degree of software error detection. In the case of a disagreement, the conect version of the software must be determined by the crew, and that version then taken as correct. The space shuttle provides both hardware and software fault tolerance, continued operation in the presence of hardware failures, and detection of implementation problems in the software. As we shall see in Chapter 2, these two facets of fault-tolerance also form the basis for the application-oriented fault-tolerance paradigm described in this thesis. 1 .3 FAULT TOLERANCE TECHNIQUES The previous introductory examples have shown the origins of fault-tolerant com- puting design. A more rigorous treatment of these techniques is presented in this section. System reliability can be achieved through the alternative event paths shown in Fig- ure 1.1. [Siew82, KuRe86, YaH384]. Reliability Figure 1.1. Alternative paths to achieve reliability ault Out-V0 1.3.1 Fault Detection Reliable fault detection informs a trusted observer or observers when a fault occurs. In a DMMP there may be no centralized observer, or use of a centralized controller may be prohibitive in terms of run time penalty. However, for hardware fault-tolerance, trusted observers may simply be a failed component’s peers. If the number of faulty components in the system is a bounded design parameter, then sufficient non-faulty com- ponents always are available for reliable detection. In the distributed environment, dur- ing problem execution, it is not known by all components which components are faulty and which are reliable leading to a possibly inconsistent diagnosis. The problem of reliable hardware fault detection is twofold: the achievement of a consistent diagnosis, and the achievement of a correct diagnosis though appropriate test- ing. Two possible techniques to achieve this consistent diagnosis are presented in Chapter 3. Software fault detection, assuming a reliable hardware execution environment, need not be concerned with consensus on the location of a failed component. A single test/processor upon detection of a failed component can reliably signal the error instance. In both the hardware and software failure environments, generation of an appropri- ate test is the common task. 1.3.2 Fault Masking Reliable fault masking uses non-faulty elements to compensate for the effects of faulty elements. N-Modular Redundancy (NMR), a more general case of the TMR employed in the Apollo system, is a common technique to mask failures. The idea is to take a majority vote on a calculation replicated N times. This, of course, is expensive in terms of either hardware or run-time cost. A hardware solution requires N times the number of processors as well as a reliable voter. Software solutions require each proces- sor to run N copies of surrounding computations and then vote on the result. This slows down the computation by at least a factor of N. However, as described earlier, NMR required in ultra-reliable systems such as the Apollo spacecraft control, the FTth and SIFT aircraft control systems, and the FTP reactor control system [Siew82, Hopk77, HoSL78, Renn84, Wens78, Lala86]. 1.3.3 Reconfiguration Reliable reconfiguration removes the faulty component from the system either logi- cally or physically. Centralized reconfiguration control is the easiest to implement. How— ever, this technique may be unacceptable in a distributed system because it introduces a single point of failure, namely at the controller. Decentralized reconfiguration control is more appealing but harder to implement. [Gar082] proposed the idea of a local coordina- tor to oversee the reconfiguration process. The local coordinator must be elected in a reli- able election. This can be done via a voting process. The local coordinator then attempts to map in a spare component or to assign the job from the failed component to another functioning component. When a hardware component fails, it may have suffered only a transient failure and can be reenstated into the system if the operation, when retried, succeeds. If the com- ponent is found to have suffered a permanent failure, then it can be replaced by an identi- cal copy. Software failures are completely different. Since software design/coding errors are repeatable, once an error is detected, replacing the copy with another identical copy is futile since the same error will result. Thus the erroring algorithm must be replaced by an alternate algorithm. Unlike the hardware failure environment, for the software environment, it may be desirable to return to the primary algorithm after the failed problem instance has been completed by the alternate algorithm. 10 1.3.4 Recovery Reliable recovery involves rolling back the computation to a suitable checkpoint and restarting from that point. A local coordinator may assume this function and initiate the rollback. The checkpoints may be held within the reliable components or held by the coordinator. Once the system is recovered, it ignores any messages from a failed hardware component. 1.4 RESOURCE FAILURE MODELS The previous four types of fault-tolerance techniques are all a function of the partic- ular fault model under consideration. Resource failures can be classified (ranging from most to least restrictive) by the classification given in Table 1.1. The relationship between them is shown in Figure 1.2. 1) Fall-Stop A failing resource detects its own failure and stops. A Fail-Stop resource never generates spurious or incorrect messages. The resources connected to a stopped Fail-Stop resource immediately know that it has stopped. This is the easiest form of fault to detect, but is probably the hardest for the resource to implement. 2) Hard The resource, when tested by a non-faulty testing unit, always responds with its correct status, faulty or non-faulty. This is the fault model assumption made in the PMC [PrMC67] model of fault diag- nosis which will be discussed further under syndrome testing. 3) Intermittent A resource, when tested by a non-faulty testing unit, may or may not respond correctly without restriction. 4) Byzantine Byzantine resources encompass the universal class of faults. Any failure to meet specifications is a Byzantine fault. A Byzantine fault may generate incorrect or even malicious results to a test query or during algorithm participation. Table 1.1. Fault Models 11 Byzantine Intermittent Fail stop Figure 1.2. Fault Model Relationships 1.5 DISTRIBUTED MEMORY MULTIPROCESSORS (DMMPs) A DMMP (Figure 1.3) is a system of N autonomous MIMD (Multiple Instruction Multiple Data Stream) processors each with their own local memory. These processors communicate with each other via message over an interconnect. This message passing takes non-negligible communication time. Examples of commercially available DMMPs were presented in the introduction to this chapter. These systems currently are on the order of thousands of processors. SIMD (Single Instruction Multiple Data Stream) archi- tectures such as the Connection Machine [Hill85] have 65 thousand processors. It is not hard to make the transition to DMMPs of at least this scale. The Intel iPSC and Ncube computers make a distinction between host processors and node processors. Both have an additional host processor which is used for program 12 Host I/O processor processors j ...I[ node 0 - node 1 node N] 00000000 Interprocessor connection Figure 1.3. Distributed—Memory Multiprocessor System (DMMP) Architecture downloading/uploading and data transfer. Note that while the host has a connection to each node, the number of nodes precludes the host processor from becoming involved in the nodes’ calculations. Design of a support environment requires consideration of the system characteris- tics. The primary issues of concern here are the locality of information and the nature of distributed control. 1.5.1 Locality of Information The lack of shared data structures, or more precisely the limitation of a particular testing processor to examine another’s internal memory, reduces testing capabilities. A test can only make a decision based on information from the local internal state of the computation and the messages it receives from other processors (peers) in the computa- tion. While privacy of information can be a drawback, it also enhances the fault- tolerance of a DMMP. In a faulty DMMP, corruption of the environment by a particular l3 processor is limited to the message it sends. Contrast this with the shared memory model of multiprocessing in which a processor can undetectably corrupt a computation at any time through disruption of the shared store. Thus correctness assertions need only con- sider the message exchange. This considerably enhances the tractability of the assertion development problem. To categorize the amount of information available in the distributed environment, the terms "White Box" and "Black Box" are employed. These terms are used in the software engineering environment to dichotomize the two approaches coverage test gen- eration. Here, in the white box environment, the participants in the distributed environ- ment have perfect information. The complete internal state of a particular processor is known to other processors. This is consistent with the shared memory model of mul- tiprocessing but is inconsistent, due to infeasibility, with the DMMP model. The black box environment is characterized by the true DMMP. Participants in the distributed environment have, as their only information, local state information and received mes- sages. 1.5.2 Control The large size of a DMMP dictates that operation under a centralized control scheme is infeasible due to the performance bottleneck in the central control unit. Furth- ermore, from a reliability standpoint, a centralized controller introduces the undesirable single point of failure. Thus, fostered by the locality of information and the size of DMMPs, individual processors function together under distributed control. The locality of information, coupled with non-negligible communication delays between the proces- sors, results in a lack of state consistency for any particular snapshot of the system. Since processors in DMMPs are of the MIMD type, there may be no synchronization, or at best partial synchronization, between the units. Any algorithm to provide fault toler- ance must be able to function in the presence of imperfect information, temporarily 14 inconsistent information, and under decentralized control. 1.6 THE APPLICATION ORIENTED FAULT-TOLERANCE PARADIGM Out of the three problems of reliable fault detection, reconfiguration, and recovery, the fault detection problem is conceptually the hardest. Key to the understanding of the problem is the notion of a fault and the ability of a test to find its occurrence. In off-line system testing, some test pattern is applied to the components of the system. The expected behavior is known, and any deviation from this is flagged as a fault. Systems designed for this type of test fall under the study of Design for T estability [McCl85]. The internal circuitry is designed such that individual components can be tested and failed ones located through application of test patterns. Traditionally this kind of analysis has concentrated on low level gate issues. The problems with off-line testing for faults are twofold. A component that exhibits transient failures may pass an instance of a test. Component exercising attempts to catch these errors by repeatedly testing components during periods of system idle time, how- ever, this off-line testing may not catch transient errors that occur during the actual operation of the system. The second problem is that a test may not be complete. A com- plex system may have a sufficiently large number of states that generation of a complete test may be infeasible or impossible. Huang and Abraham in [HuAb84, Huan83] proposed the idea of Algorithm Based Fault Tolerance in which checksum encodings are embedded in the data. This is of dubi- ous importance as a fault detection technique as most computing systems already contain some sort of checksumming and error correction at the bit level. One paper [HuAb84], however, proposed the idea of using properties of the solution as acceptance tests. This method is deemed elegant by the following observation: Since we test the intermediate results for correctness with respect to the algorithm, the end solution is correct if the intermediate results 15 are correct. If processor errors occur that do not affect the solution, then they are not errors. For application oriented reliability, this method is clearly superior to component exercis- ing or other off-line tests. The test set consists of only what is necessary to ensure correctness for the current application. The work in application oriented fault tolerance of this type comprises only a small portion of Huang’s thesis [Huan83] and the detection method proposed is incomplete. However, the concept is excellent. 1.7 EXECUTABLE ASSERTIONS The Assertion is the basic unit of program verification [Somm82], software fault- tolerance [Rand75], and the application oriented fault-tolerance paradigm described in this thesis. Development of these assertions is straightforward for the sequential program execution environment (See [YaCh75, Andr79, Somm82, HuAb87] for examples). Assertions are primarily generated from the coding and design process. An assertion is a statement of the form: If not ASSERTION then ERROR; where ASSERTTON is a Boolean invariant on the program state. Assertions integrated into the program code are called Executable Assertions. If an assertion fails due to some abnormality, whether hardware or software, the ERROR condition is raised and appropri- ate action taken. [Stuc77] discusses the use of executable assertions for software fault tolerance, but there is nothing to limit their use in hardware testing. Assertions are typically outgrowths of the design process as in the Recovery Block Model of Randall [Rand75]. Design generated assertions are desirable from the aspect that they may often be generated automatically from the design language [HuAb87]. In the sequential programming environment, each result of a statement is determin- istically a function of the current program state and the statement executed. Procedures and function calls may be handled in one of two ways. Either the function may be con- sidered as a "meta-statement" which is a function of the calling program’s state, or the 16 statements internal to the called function may be expanded inline to the calling program. In any case, assertions are made using the state of a single program and its statements. Careful choice of these assertions, particularly those involving loop invariance, lead to a high degree of confidence in both the program’s termination and correctness. 1.8 DIRECTION OF THIS THESIS Large scale multiprocessors are a product of this decade. Little if any consideration has been given to the system aspect of reliability. As these systems come into widespread usage and as their size and resulting complexity continues to grow, reliability will continue to be a major issue and a viable research topic. This research represents the first such attempt at specifying a reliable parallel processing environment for large scale multiprocessors. This envionment is characterized by the Reliable Parallel Processing Model of Figure 1.4. 17 Application Layers Configuration Layers System Layers Figure 1.4. The Reliable Parallel Processing Model The applications programmer need only specify an abstraction of the necessary reli- ability information at the upper levels. The system environment consists of tools at the lower levels of the model that minimize the burden of reliability programming on the applications programmer. The configuration layers enable the reliable parallel process- ing model to be employed in various system topologies. 1 .9 Thesis Outline This introduction has provided a historical basis of reliability and addressed some of this issues that are considered in the following chapters. Chapter 2 gives a motivation for study of new techniques for fault-tolerance. It describes a probabalistic model of expected run time in a large system in the presence of failures as a motivation to provide hardware fault-tolerance. The need for software fault 18 tolerance, in the form of executable assertions for the parallel environment, is perhaps even stronger motivation for this study than hardware fault-tolerance. Chapter 3 discusses the fault detection problem, temporally the first problem encountered and perhaps the most difficult to solve. Chapter 4 presents a formal classification of the basis metrics which can be applied to the problem at hand to generate the necessary executable assertions which compose a structure called a constraint predicate. The usage of the constraint predicate is detailed both as a hardware fault-detection tool and as a software verifcation/fault-detection tool. Chapter 5 presents the underlying distributed diagnostic basis and it’s ideal virtual machine interface to the constraint predicate. Techniques considered are Vector Byzan- tine Agreement, a new form of Byzantine Agreement, and Distributed Syndrome Testing. Chapter 6 presents a set of tools and techniques for analysis of both the coverage and run time performance for a generated constraint predicate. Chapters 7 through 9 present three examples of constraint predicate development. The first is a common numerical technique for solving very large systems of equations on a parallel processor. The second is a reliable parallelized form of a non-numerical algo- rithm for computer vision. The third is a reliable sorting procedure. These chapters also deal with programming experiences and implementation details of the algorithms. Chapter 10 presents directions for future research. These include the outline of a reconfiguration/mappin g control module which isolates the application level from topo— logical changes. Directions for localized reliable distributed reconfiguration and recovery are presented. Chapter 2 Motivation and Problem Statement The application oriented fault-tolerance paradigm is a reliability method applicable both to tolerance of hardware faults and software faults. The motivational forces behind consideration of each is different. 2.1 HARDWARE FAULT-TOLERANCE A system is said to be hardware fault-tolerant if, under a bounded number of faulty components, the system continues to satisfy its operational requirements. This bound is obtained through examination of the reliability of the components of the system. Periodic testing of the system will require additional time to perform the various bounds checking and message interchanges associated with the test. For small problems, this overhead may not be necessary. If the system has a high reliability and the problem incurs no more than a short execution time, the probability of an error occurring during the execution may be very small. For small problems, it may be best to "take your chances" and run the job with no fault-tolerance. However, long run times with large numbers of processors may not be able to finish before an error occurs. Indeed the Opera- tional requirements of the GFll project at IBM [Agar88] specify the execution of algo- rithms that can run as long as a year on a 500+ processor machine! The expected time between errors is called the Mean Time To Failure (MTTF). If the problem run time is long with respect to the MTTF, then reliability techniques are indicated. If the problem run time is short with respect to the MTTF (as in the "small" problem above), reliability 19 20 techniques may not need to be used. Furthermore, when an error occurs, the problem must be restarted from the beginning. If the MTTF is short, the job may never complete. Reliability modeling can provide a basis for quantification of these concepts. We must make some assumptions about the type of problem and solution being attempted. Model Assumptions Serial Reliability Nearest Neighbor Iterative Problem - All processors must function i.i.d Exponential Processor failures are independent, identical exponentially distributed with parameter u 2.1.1 Failure Model As noted in the assumptions, processor MTTFs are given by the random variable X and are i.i.d. exponential with parameter p. The probability density function of X is given by: —t f(t|Lt)=-:-l-ef, t>0 (2.1) The conditional probability that an individual processor fails at some time later than T is given by: .. ;T_ 1-F(T|u)= If(tlu)dt=e *1 (2.2) T The system M'I'I'F is given by the random variables X1,X2, . - - ,XN where N is the total number of processors involved in the problem solution. Since we have serial reliability, the system fails when one component fails. This is given by the random variable Y: Y=min{X1,X2, . - ° ,XN} (2.3) Let the conditional cumulative distribution function of Y be G (tlu). Since the Xi’s are independent, it 1—G(tlu) = fia—F (t l u)) = e “ (2.4) i=1 21 Thus l—G (t I u) is an exponential with parameter .1EV_' 2.1.2 Expected Run Time The expected run time of a non-fault tolerant system is constructed as follows. If the system does not fail before time T, then the job runs in time T. If the system fails before T, then the job must be restarted and the expected run time is the time spent before the failure plus another expected run time (since the exponential distribution has the memorylessness property). The expected run time E (R) is given by E (R )=TP, [Y >T]+E (R )P, [Y 0i+1 for some j, then the result 0 produced by S is incorrect. Proof: For part (1) if either there is no j such that 0i=lj for some i or there is no i such that 13:0,- for some j, then the output list 0 is not a permutation of I thus violating Definition 4-2. For part (2), the output list must appear in a non-decreasing manner as an ascending sort it being performed. C] From Theorem 4-1, the constraint predicate (bps follows immediately. Predicate (bps (1,0,N) returns heal; for i := 0toN-1 if 0;;th for every j then ERROR; for i := 0toN—1 if 15¢0j for every j then ERROR; forj := 0toN—2 if 0 j >0 j+1 ERROR; end. 0;: is correct since it flags any unsorted output list as an error, and it is complete since any output list that is not a permutation of the input list constitutes an error. However, this predicate can only be applied at the termination of the sorting calculation. If an error occurs early in the calculation, this predicate cannot find it. Thus the fault-latency is long. We shall re-visit sorting predicates in Chapter 9. Condition (3) is necessary to prevent unnecessary loss of computing resources. An individual component of a constraint predicate might perform an incomplete subtest - that is the subtest allows faulty behavior to pass. It will never, however, perform an incorrect test in which a correct result is flagged as faulty. At some stage of testing, the subtests will either progress sufficiently to actually flag the error, or the actual error will be caught by another test that comprises the constraint predicate. The intersection of all 39 such tests, each of which is correct, composes a correct constraint predicate. Two issues are important in constraint predicate development for the distributed parallel environment. Granularity, as mentioned in Chapter 2, is the first. A more gen- eral issue common to executable assertions is coverage. We shall treat the granularity issue now and delay the issue of coverage until Chapter 6. Problem mapping to the DMMP environment is usually done in the Data Parallel [I-IiSt86] style. Each processor executes the same portion of an algorithm on different data. For maximum parallelism to be obtained from this type of mapping, the problem must contain a natural structure that allows the parallelism to be achieved. Furthermore, the identifiable points of parallelism will naturally fall at message exchange boundaries for data parallel algorithms. Thus the constraint predicate is more closely allied with the problem than with any specific solution. This relationship is shown in Figure 4.1 which depicts the construction of the constraint predicate. 4.2 BASIS MET RICS The following is proposed as a basis for constraint predicate generation through pro- perty extraction. In a solution, each testable intermediate result should satisfy one or more of the three predicate subclasses of progress ((Dp), feasibility ((13):), and consistency (¢C)- 4.2.1 Progress (p (max_step,timeout, k) returns heel; for i = 1 to max_step do select read data from P j; if data not from the k’th step ERROR; or delay timeout; ERROR; end; process locally; end. Note that if a processor continues to send messages after max_step, this is not considered an error for this behavior does not affect the final result of the calculation. Common to both iterative and non-iterative problems is the notion of convergence. We shall assume that the problems attempted are convergent; for if not, then a noncon- vergent result is indistinguishable from a hardware failure. Most, if not all, problem solu- tions contain implicit information necessary to form a convergence envelope. This places bounds on the reduction of error between the current and final solution. Let the error at step k of the solution be 55") = Iuikl-u, I, for ie (1,...,N} where k is the current step count of the solution, um = (ugh) is a single (or vector- valued) state of the solution at the k’th step, and u = (u;) for i e {1, ~ - - ,N } is the actual 42 correct solution. Let ll - ll be some suitably chosen vector norm, scalar absolute value, or discrete convergence function-r. Let e“)=(ef")), for i e {1, - - - ,N }. Monotonic reduction of error is then defined as ”2““) II < “8"" ll Ilu("+1)-—u ll < H u(")-u II (4.1) A suitable constraint predicate (D): for this relation is: Predicate (DP (u("),u(k+1), real vector-valued, timeout, k) returns bool; select read u("+1); II (II u<"+1)—u ll 2 II u(k)—u II) ERROR; or delay timeout; ERROR; end; end. Two things should be said about this predicate. In the programmer/system classification scheme this is a PPT predicate. It requires all the values held by all the pro- cessors involved in the calculation. For a mesh of N processors this takes 0 (W) time, and for a cube of size N this takes 0 (log 2N). Both these values are non-negligible. The second difficulty is that it involves the unknown u. What is more desirable is a constraint predicate that can test locally in 0(1) time, or even unit time, and only use obtainable data for its tests. Locality and obtainability cooperate to form an IPT test for monotonic convergence. Consider as a special case of the convergence envelope localized monotonic reduc- tion of error. This is typically seen in relaxation solutions such as PDE relaxation. 1' A discrete convergence function is employed in Chapter 9 when parallel sorting is con- sidered. 43 Without loss of generality, assume that the sequence of local errors 85") is monotone decreasing. elk”) 5 85k) u§k+1)—u,- S uf")—u,- (4.2) It then follows immediately that the sequence of ug’s are also monotonically decreasing. A suitable constraint predicate (D): for this relation is: Predicate tbp (uSk)'u§"+1), real, i,k) returns bool; It 043"“) 2 uf”) ERROR; end. Convergence envelopes, while ensuring a guarantee of nonegative progress, cannot check for sufficient progress. By this we mean that a solution satisfying the above predi- cate may proceed arbitrarily slowly. Consider bounds on the convergence rate ‘Yh’iln and 79:2,. such that each step satisfies (1‘) (k) “8 II It) Again this is a PPT class predicate. A corresponding local rate bound may be obtained and implemented similarly to that of (4.1). Progress alone is not sufficient to guarantee solution correctness. The final vector, u“), particularly when the bounds y are loose, may differ significantly from the actual solution u. To restrict completely arbitrary behavior further, we consider feasibility as the next constraint predicate basis. 4.2.2 Feasibility ((1);) Each testable result must remain within the defined solution space of the problem. Formally, consider a solution space H), and any intermediate result “(1‘) . Then for any step k, “(1‘) 8 H k The feasibility constraints are often immediately apparent from the nature of the problem studied. Problems in physics and engineering such as equilibrium problems, eigenvalue problems, and to a lesser extent propagation problems contain feasibility con- straints in the form of boundary conditions. Indeed these boundary conditions are exactly the class of natural problem constraints. Equilibrium problems can be described as jury problems in which the entire solution is obtained by a jury which must satisfy all the boundary constraints and all internal requirements of the problem. Propagation prob- lems can be considered as marching problems in which the solution marches from the initial state guided and modified by the side boundary conditions. In these type of prob- lems, the boundary conditions are of course known a priori and do not vary as the solu- tion progresses (H), =H is stationary), and thus can easily be used as feasibility con- straints. Branch and bound tree searching (for example see [HoSa84]) is a method which searches a state space (tree) for a minimum cost goal state. This search is guided by an accumulated and estimated cost function. Let c(")(x) be the estimate of the cost function at step k and c(x) be the actual cost of reaching the goal state from node x. If we enforce that emu) S c (x) for ke {l,2,...}, then ca‘)(x) provides a lower bound in H), on the cost of any solution obtainable fiom node x. Let L be an upper bound on the cost of a minimum cost solution (initially L may be co). If at some step k, a possible goal node y is discovered, then all nodes x with c (x) 2 C(k)(x) > co) may be killed, and L is updated as L = 00). Thus branch and bound provides a way of dynamically narrowing the bounds on the feasible solutions Hk. For some problems, the feasibility constraints may be so loose as to be virtually of no use. This is precisely the case for the "area" predicate OpA. This is an example of a feasibility predicate for which [1,, = {x l x 2 0) for ke (1,2,...) i.e. the set of positive 45 reals. The difficulty lies not in the idea of using a constraint predicate. This is the best predicate that can be obtained from the problem statement. The problem simply contains insufficient natural constraints from which to generate a good predicate. 4.2.3 Consistency (OC) Many intermediate calculations contain additional properties that are indirectly obtainable from the problem’s natural constraints. These are defined as Consistency Con- ditions. In a white box testing environment, all facets of each testable step can be checked. As noted previously, in the DMMP environment, due to the locality of infor- mation, black box testing must be utilized. A consistency predicate may be applied only to the received information and locally known information. This may appear to severely restrict the functionality and usefulness of this type of test, but in reality a consistency test is powerful. [Ay0287] developed an entire constraint predicate using only con- sistency conditions. Consistency can compensate for limitations in the progress and feasibility bases. Consider a problem in which (4.3) has a globally specified bound 7 but no locally specifiable 7;. Furthermore, assume that it is too expensive to implement a PPT predi- cate. Knowledge that a processor has about its local state and the values that it sent to other processors in previous steps can provide bounds on the range of acceptable values for the current testable step. Assume that the solution proceeds as in (4.2). Let each new value of uf") be calcu- lated as a linear function f of the neighboring values ujk‘ll, (ER with coefficient a). From the perspective of a processor P,- calculating u,-, a candidate intermediate result u)", leR must satisfy the following property iii") 5 ask-ll—a,(u$"-2)-u$"—1)), 1,: = 1,2,...,N, ke {2,3,4,...} (4.4) To prove this as a theorem it is necessary to consider two successive iterations uf") and ujk‘l). Subtracting the two yields an expression which is a function of known and 46 unknown values (from processor Pi’s point of view). If we let W“) represent the unknown values at iteration k, the expression becomes uikl-ujk’1)= W(k)—u(k‘1)+a)(ufk'1)—uf"’2)), l,i = 1,2,...,N, ke {2,3,4,...} Since f is a monotonic linear function and ui") S uik'l) where l,i = 1,2,...,N, ke {1,2,...], then WU‘“1)2W("). This provides a lower bound on the amount of movement that processor P,- can expect which results in the inequality given in (4.4). The lower bound is solely a function of the values that P,- sent P, in the previous iterations and Pl’s previous iteration. Furthermore, this is an IPT predicate which makes it attractive fi'om a computational aspect. 4.3 CHAPTER SUMMARY A systematic way of generating executable assertions for the DMMP envrionment is essential to the application oriented fault-tolerance paradigm. The three predicate subc- lasses of progress, feasibility, and consistency formulate the constraint predicate. Error coverage analysis techniques for the resulting constraint predicate are presented in Chapter 6. Natural problem constraints are a necessary condition for successful con— straint predicate generation. It is not known if natural constraints are a sufficient condi- tion. This is discussed in Chapter 10. Chapter 5 Distributed Diagnostic Basis In this chapter, the two diagnostic basis algorithms mentioned in Chapter 3 are explored in greater detail. We present syndrome testing and show that in the Byzantine environment it produces an incomplete diagnosis. As an alternative, Byzantine Agree- ment is described and an implementation called Vector Byzantine Agreement, which is necessary for efficiency in the DMMP, is presented, proven correct, and given running time bounds. 5.1 SYNDROME TESTING AND DIAGNOSTIC IMPOSSIBILITIES In the mid to late sixties, fault diagnosis research had no real formalism to rely upon. In a landmark paper [PrMC67], the authors outlined a method of system level diagnosis called syndrome testing. Peer testing using this method has received concen- trated research treatment over the ensuing two decades since its inception and thus is worthwhile to consider as a candidate distributed diagnostic basis. Syndrome testing of a system S is modeled as a testing digraph G (V,A), where each processor in S is represented by a vertex, and each arc aid-6A denotes that processor i tests processor j for i, je V. An arc aid-EA is given weights 0(1) if the testing processor i e V finds the tested processor je V fault-free(faulty). If a). j=O(1) then it is said that i has a 0-link(1-link) to j. The collection of all such test weights for a particular system S is called the syndrome W formed by the test graph G of the system S. The system S is diag- nosed through the examination of the syndrome W to locate a unique set of faulty 47 48 processors F. To facilitate the discussion of this testing model, the following sets are defined. For each vertex i e V, let F(i)= {jlaiJeA} 1"1(i)={jlaj,,-eA}. Discussion of fault diagnosis through testing must consider a fault model of deviant processor behavior. In the PMC model [PrMC67], the following behavior is assumed. If processor i is fault-free and tests processor j, a fault-free (faulty) processor, then dg‘j = 0(1). If processor i is faulty, then the test outcome is unreliable. This is summarized in Figure 5.1(a). The term symmetric invalidation is used to describe the PMC model. The characterization of diagnosable graphs G given in [HaAm74] is a function of the maximum number of tolerable faults and the connection assignment. If a system S obeys this characterization, then it is said to be t-fault diagnosable - the system can withstand up to t faults and diagnosis can still identify all the faulty units in the system for all syn- dromes that can result from testing under the PMC model. Performance of the diagnosis is described by the same terms correct and complete from Chapter 3. [DaMa84] presents an algorithm which can diagnose a system S in 0(n 25) time. The diagnosis provided by this algorithm is both correct and complete as the fault set F is exactly the set of faulty processors. The PMC model can be unrealistic since it requires a processor to always identify itself to a non-faulty testing processor. This may not occur if the test itself is "incom- plete," that is, a faulty processor can pass the test but still be faulty. The model is also unrealistic in that it requires that a faulty processor exhibit only permanently faulty behavior. The CMOS technology involved requires consideration of intermittently faulty behavior. To review the distinction between these two cases presented in Chapter 1, once a permanently faulty unit fails, it suffers a hard failure - it never recovers. An 49 intermittently faulty unit may function incorrectly and then correctly at a later time with the possibility for future failure/recovery cycles. Possible Test Results X - indicates a faulty processor ii . 22 2. ii (a) - PMC Fault Model Io Q p—n (b) - Intermittent Fault Model Figure 5.1. Fault Models Fault diagnosis under the intermittent fault model has received much study [MaMa78, YaMa86b]. For intermittently faulty behavior, if i is a fault-free processor testing a faulty processor j, the outcome is unreliable. All other conditions of the PMC model remain true. This is depicted in Figure 5.1(b). This model is also characterized by symmetric invalidation. 50 Let /V1/ be the cardinality of set V1. Lemma 5-1 [MaMa78]: A system S is ti-diagnosable if and only if for any set of processor’s V1 ;V where 0<|V1lt,-. ieVl If a system S subject to intermittent faults is characterized as ti-fault diagnosable, then a comet diagnosis resulting in identification of a unique fault set F may be obtained provided that at most t,- processor’s fail. [YaMa86a] presented an algorithm for ti-fault diagnosable systems which yields a unique fault set F. However, under certain charac- terizable syndromes, the algorithm may not achieve a complete diagnosis. Thus, based on the syndrome that occurs, faulty units may escape detection. This work provides a characterization of syndromes for which their algorithm produces an incomplete diag- nosis. To describe this characterization, we adopt their notation. For each i e V, the 0- ancestors of i correspond to the set A0(1') = {j'0j,i = 0} Theorem 5-1 [YaMa86a]: A faulty processor j is detectable by Yang and Masson’s algorithm if and only if MoWuUl'Sh‘ where t,- is the maximum tolerable number of intermittent faults. Theorem 5-1, however, is not a full characterization of ti-diagnosability. Consider the testing digraph G and syndrome shown in Figure 5.2 for t,- = 1. The following discus- sion shows that processor 2 is the only faulty processor. If processor 2 is not faulty, then assume processor 0 is faulty as indicated by the testing link (12.0 = 1. But then either pro- cessor 1 or processor 2 must also be faulty since a l-link exists between them. Since pro- cessor 2 is fault-free, then processor 1 must be faulty and the l-link from processor 1 to processor 2 (a 13:1) is an erroneous test result reported by processor 1. This, however, is 51 Figure 5.2. Syndrome Undiagnosable by Yang and Masson Algorithm tg-diagnosable system with testing digraph G. processor 2 is faulty, t,- = 1 a contradiction since now the fault set F has cardinality 2 (IFI = |[0,1}| =2 > t,-) and exceeds the bound on the number of faults. If we assume processor 1 is faulty, a similar analysis reaches the same contradiction - the syndrome contains too many faulty proces- sors. Thus, the only possible diagnosis is that F = {2}. It is easily verified that the test- ing digraph G satisfies Lemma 5-1 for t,- =1 and is thus tg-diagnosable. However, Ao(2) = {0} which when applied in Theorem 5-1, yields lA0(2)U{2}| =2 > t,-. Thus Yang and Masson’s algorithm cannot detect the faulty processor in this syndrome even though it is diagnosable. The reason for this should be clear. The Yang and Masson characterization of diagnosable syndromes only considers detection of faulty behavior 52 through implication of the suspected faulty processor. The analysis just performed used the additional information of faulty implication by the suspected faulty processor. This prompts the question of a whether a better diagnosis algorithm exists. Optimally, a diagnosis algorithm that can perform a complete diagnosis is desired. We now show that an incomplete diagnosis is the best that can be achieved, and that a com- plete diagnosis for all syndromes is impossible. Definition 5-1: A Non-masking Diagnosis is one in which the result of a test aid- from processor i to processor j cannot be discarded by the diagnosis algorithm unless it is known that processor i is faulty. Definition 5-2: A Deterministic Diagnosis is a diagnosis in which no individual proba- bilities are assigned to either the failure of the processors nor to the results of the tests. An non-masking diagnosis cannot, as in [GuRa86], find a processor i’ faulty based on some minimal cardinality proper subset of I‘ "1(i’). To do so establishes a diagnosis . based on repetitive testing. Deterministic diagnosis forces symmetric invalidation of two processors i’ and j’ which share an incident 1-link. The diagnosis algorithms given by [YaMa86a, DaMa84] are both deterministic non-masking diagnosis algorithms. Lemma 5-2: Given any testing digraph G (V,A) there exists one syndrome, for which any non-masking diagnosis is incomplete for even the single intermittent fault case. Proof: Let G (V,A) be a digraph with weighted ares a” as described in Section 5.1. Let i’ be a faulty vertex with a); = 0 for ie 1" ’1(i’) and am- = 0 for je F(i’). Then it is impossible to diagnose the system completely since i’ cannot be assigned to the set of faulty units F. El Thus a faulty processor may pass all tests and report pass for processors it tests in the trivial case. The syndrome described by Lemma 5-2 is called the trivial syndrome. However, even if the faulty processor manifests itself by assigning l-links to some incident edges, it can still escape detection as the following theorem shows. 53 Theorem 5-2: Given any testing digraph G(V,A) with W 2 1, there exists at least one syndrome, exclusive of the trivial syndrome, for which any deterministic, non-masking diagnosis is incomplete for even the single intermittent fault case. Proof: Let G(V,A) be a testing digraph with weighted arcs a”. Without loss of generality, let i’ and j’ be given such that j’e I‘(i') and an, = 1. Furthermore let a“, = 0 for ie 1“-1 (i’) and am = o for je rm. Note that one or both of the sets I“1 (i’) and rm may be the empty set. Since the diagnosis is deterministic, symmetric invalidation states that there is no conclusive information to favor one processor as faulty over the other. Thus it is impossible to diagnose the system as either processor i’ or processor j’ may be in the set of faulty units F. El Further lessening of the restrictions on the intermittent fault model leads to the most general class of faults, the Byzantine fault model. In this model, a faulty processor no longer exhibits random behavior when tested as in the intermittent faulty mode, but assumes a malicious personality. By this malicious behavior, a faulty processor will per- form the most disruptive function at the most critical time. This is clearly an attractive fault model since any algorithm tolerant of Byzantine faults is tolerant of all faults. In the Byzantine case, the syndrome testing fault model is exactly the same as the intermittent fault model (in terms of testing are weights). However, the philosophy of the assignment is different. In the syndrome diagnosis, the faulty units will elude detection. In light of the Lemma 5-2, a Byzantine processor can always completely elude detection by passing all its tests through reporting of the trivial syndrome. Furthermore, it will con- fuse the diagnosis, as in Theorem 5-2, such that an incomplete diagnosis is obtained. Two main points are stressed in this discussion. The first is that the set of t;- diagnosable syndromes given by [YaMa86a] is a proper subset of the true set of t;- diagnosable syndromes. More importantly, we have shown that there can never be a diagnosis algorithm for tg-diagnosable systems which is always complete. This is partic- ularly important when the Byzantine class of faults is considered. Unlike algorithms 54 which always function correctly in the presence of Byzantine faults, such as Byzantine Agreement, no algorithm can perform Byzantine diagnosis completely. Numerous algo- rithms have been proposed that function in the presence of Byzantine faults and purport diagnosis of Byzantine faults [GuRa86, ShRa87]. In reality these algorithms are tolerant of intermittent failures as they require repeated testing to uncover faulty behavior. A true Byzantine fault will never manifest itself in a manner diagnosable by these algorithms. 5.2 BYZANTINE GENERALS PROBLEM The Byzantine Generals problem [LaSP82] is a rephrasal of [PeSL80] giving the problem a more colorful name and description. In the Byzantine Generals scenario, the Byzantine army is encamped outside of an enemy city. Each division has a general (resource) some of whom are traitorous. The solution requires that the loyal generals decide on the same plan of action ("attack” or "retreat") and that a small number of trai- torous generals cannot force adaptation of a bad plan (a bad plan being that some gen- erals attack while others retreat). Note that for all the loyal generals to decide upon the same plan of action, all the loyal generals must obtain the same information. This is done by the exchange of messages. A traitorous general may send different values to dif- ferent generals. In order to prevent a small number of traitors forcing the adoption of a bad plan, the value sent by a loyal general must be used by every loyal general as the correct value. Furthermore, all loyal generals must use the same value from a traitorous general. An example adapted from [PeSL80] for the four general, one traitor, case is given in Figure 5.3. For the complete algorithm, the reader is encouraged to consult one of the two above references. We now substitute processors for generals and use multiple values instead of the single valued "attack" or "retreat." In Figure 5.3 there are four processors (P 1 -P4), one of which is faulty (P 1). It is not known, however, by P2, P3, or P4, that P1 is faulty. Nor may P1 be aware of its 55 Figure 5.3. Byzantine Agreement (P1 is faulty) (from [LaSP82]) faulty condition. If the faulty processor P1 broadcasts over point to point links, it is free to send different values to different recipients of its broadcast. Say it sends the value "a" to P2, "b" to P3, and "c" to P 4. Each processor, with its own local view of the system, has no idea that the other processors have received different values for the same broad- cast message. By exchanging values, it is possible for each processor to relay the value it received from P1 to each of the other processors. Thus if each one of the receivers relays its version of what P1 sent to the other receivers, each receiver will have three ver- sions of P 1 ’8 value. From the point of view of one processor, say P2, it will receive one version (the value "a") directly from P1, the value "b" from P3 as Pl’s value and the 56 value "c" from P4 as Pl’s value. Since all receivers have received different values in this case, a strategy might be to pick the lowest ASCII coded value, say "a." Thus P2 will use the value "a" as P 1 ’s value. Since both P3 and P4 perform the same computa- tion as P2, and since P2, P3, and P4 are non-faulty, they all come to the same decision on P 1 ’s value, namely "a." An algorithm which allows the non-faulty processors to come to the same agreement on a value is said to solve the Byzantine Generals Problem. A round is a set of message exchanges. The above example used two rounds of information interchange. [DoSt82] shows that for n generals with t faulty, t+l rounds of interchange are needed to reach agreement among n23t+1 generals. If either less than 3t+1 generals or fewer than H] rounds are employed, no reliable consensus can be reached. This solution is known to be t-resilient, that is, it can withstand t faults. This is the Unauthenticated Byzantine Generals solution. It requires three assumptions. 1) Every message that is sent is delivered correctly. 2) The receiver of a message knows who sent it. 3) The absence of a message can be detected. The Authenticated Byzantine Generals solution requires, in addition to assumptions 1-3 above, the following assumption. 4) Each loyal general can send unforgable signed messages. For the authenticated solution, a three general solution exists and the protocol is known to be n-resilient, that is, it can withstand n —2 faults where n is the number of gen- erals in the system (n must be >2 or the problem is vacuous). 5.2.1 Synchronlc Algorithm Constraints In voting, we are concerned with whether the algorithm is synchronous, asynchro- nous, or partially synchronous characterized by bounds A and ‘I’. A is the bound on the time to send a message from one processor to another, and ‘I’ is the maximum clock drift between any two processors. 57 5.2.1.1 Synchronous Algorithms If A and ‘I‘ are known a priori, then the system is said to be synchronous. The Byzantine generals problem can be solved for t-resiliency or n-resiliency in the following 038682 1) Synchronous processors, synchronous communication 2) Synchronous processors, synchronous message order 3) Broadcast transmission and synchronous message order Synchronous algorithms, such as used in the F'I'MP [HoSL78], do not generally fit the DMMP model. However, specific applications may indeed fit this model. 5.2.1.2 Asynchronous Algorithms In an asynchronous system, neither A nor ‘I’ exist. It is impossible to find consensus in an asynchronous environment for even one faulty Fail-Stop processor [FiLP85]. There exists a window of vulnerability in which a faulty processor can indefinitely delay agree- ment. Deterministic asynchronous Byzantine Agreement can be achieved only if proces- sors do not fail during the protocol execution. However, other, more realistic possibili- ties, also permit asynchronous agreement. [Perr85], building on the work of [Rabi83], proposed a randomized Byzantine Gen- erals algorithm which reaches agreement with probability 1-2(‘R) in R rounds using the concept of "Shared Secrets" [Sham79] in which Digital Signature Authentication is employed and a trusted dealer gives out public key encripted data. Shared secrets are used to circumvent the impossibility result given in [FiLP85]. The reconstruction of the secret avoids a process from indefinitely waiting for messages from faulty processors. This algorithm will achieve agreement with probability 1 if R-—>oo. The expected number of rounds is four", two rounds are needed to reach agreement and two rounds are needed to verify the proof. 58 This algorithm works on the following principle. A station wishing to enter into agreement enters into a loop bounded by R. This loop consists of procedures Poll, Lot- tery, and Decision. Poll sends its encrypted message to all stations. It then reads all values coming into it from this round. Collecting n —t values, it decides the plurality vote and the number of stations voting this way. The station then enters the Lottery phase. The Lottery asks for the secret message from all processes for this round. The secret is then decoded when t different messages have arrived. The final phase is the Decision. Each station has its own idea of what the message should be. Based on the randomly generated secret just received, the message is either accepted or rejected. This procedure continues until acceptance is met or R is reached. Notice that since we have the com- pletely asynchronous case, messages may arrive out of order, be delayed indefinitely, or exhibit other Byzantine manifestations. Thus each round of the lottery should get its messages from some asynchronous process which collects messages for all rounds regardless of the current round. [Rabi83] used signed messages in his algorithm, but [Perr85] showed that this was unnecessary in agreement with the theoretical result given in the Byzantine Generals problem statement. In both cases, the resulting algorithms tolerate t faulty processors where n >6t. The method of shared secrets is limited. A trusted dealer must exist to reliably predistribute the shared secret values before each process begins. Additionally, it must be assumed, although not explicitly said in either paper, that a process does not know the secret if and only if it sends a bad message. An alternative to algorithms employing shared secrets is presented by [DLPS86] in which approximate agreement can be reached in an asynchronous environment. This algorithm handles Byzantine Agreement with successive approximations converging to some 820. At first it is unclear how one could use such an approximate algorithm. How- ever, it becomes useful when one considers the problem of synchronizing clocks or of agreeing on real valued input from data collection sensors. The algorithm functions in a 59 number of rounds indirectly bounded by the requested a. At each round, n -t values are requested from other processors. A selection algorithm chooses the smallest 2t elements and then takes the average. When the average is less than 8, the algorithm terminates. This algorithm tolerates t faulty processors where n>5t. This procedure assumes that all processes terminate, but not necessarily at the same time. It also tolerates a greater pro- portion of faulty processes than 5t because it handles processes which exhibit transient Byzantine behavior; that is, they fail and then recover to a correct mode of operation. 5.2.1.3 Partially Synchronous Case A partially synchronous system is one in which both A and ‘I’ hold eventually but are not known a priori. Agreement in the partially synchronous envrionment is solvable without authenticated signatures [DwLS88]. The algorithm tolerates t faults where n>3t. The algorithm locks and unlocks various supposed values of agreement for each round. If a locked value is later learned to be not an agreed upon value, it is unlocked and the algorithm continues. The algorithm is guaranteed to terminate, however, the number of rounds is specified as polynomially in N, ‘1’, and A which for an actual implementation is expensive. 5.2.2 Masking vs. Detection The Byzantine Generals problem has been the subject of study as a solution to the problem of finding a consensus among a number of processors in a faulty environment. The weak consensus problem is solved if when the local value of all non-faulty proces- sors is v and no processors fail, all processors reach the same agreement, namely v. The strong consensus problem adds that all non-faulty processors reach agreement on v in the presence of up to a predetermined number of faults denoted by t. While Byzantine Generals solution algorithms mask faults, they alone have a draw- back in the solution of the fault-detection problem. In the introductory example it was 60 clear that processor P1 was faulty, since all receivers relayed different values. If the sender is non-faulty, however, and a receiver is faulty and thus relays the wrong values, processors may erroneously flag P1 as faulty when, in reality, some other processor is faulty. The Byzantine Generals solution cannot implicitly locate the faulty processor (Byzantine detection is covered in Section 5.2.4). It can only guarantee that a consistent state of the system can be produced for each processor. However, an acceptance test run on this view by each non-faulty processor will produce the same set of faulty processors as a test result. 5.2.3 Vector Byzantine Agreement Agree Byzantine Agreement (even in the best case of synchronous behavior) is expensive. The problem with each of the agreement algorithms presented above is that to achieve the necessary concensus for application oriented fault-detection among n processors, n individual agreements are necessary. Vector Byzantine Agreement collapses these into a single agreement by trading off communication complexity for time and space complex- ity. Since message initiation is the most expensive part of the communication, lowering the number of agreements lowers the number of messages. The algorithm presented achieves Vector Byzantine Agreement under the assump- tions of synchronous processors and communications and a reliable completely con- nected (within the agreement group) communication network. These are not unreason- able assumptions. If communication network errors occur, they may be charged to one of the processors in the agreement. The completely connected communication require- ment only extends to the members of the agreement group which is typically of small size and is seen to be easy to construct for the Poker environment [Snyd84]. Furthermore the algorithm extends to the less than completely connected case using the result of [Dolc82]. The concept of Vector Byzantine Agreement is not limited to synchronous processors. Any of the partially synchronous [DLPSS6] or asynchronous [DwL888] 61 [Perr85] algorithms presented above could be adapted to the form of Vector Byzantine Agreement to function as a distributed diagnostic basis. A constraint predicate (I) may be applied by each non-faulty processor to its local copy of the agreed upon vector. The predicates will be constructed in such a way that each non-faulty processor will obtain the same decision as every other non-faulty proces- SOT. 5.2.3.1 Agree The algorithm Agree described in this section achieves Vector Byzantine Agree- ment. We introduce some notation at this point to formally describe Agree. n In the agreement, there are n processors. The ith processor is denoted P,- for i=1,...,n. t Among the n processors, the maximum number of tolerated faults is t. v,- Each processor that enters into the agreement begins with a local value that it wishes to broadcast to other processors in the agreement. For Pi, this value is denoted by v;. The value v,- is treated as correct by Pg. Note that v,- may or may not beequaltovjifi¢j. Vi At the successful termination of the agreement protocol, each processor has obtained a vector of values. Each component of this vector corresponds to the agreed upon local value of a processor in the agreement. For processor Pi, Vi denotes Pi’s vector of values. Each component of this vector is the value Vi (j) where j=1,...,n. ViU) denotes the value that P,- uses as Pj’s local value vj. It is the case that Vi(i) a v,- if P,- is non-faulty. Agree departs from typical Byzantine agreement algorithms [LaSP82, Dole82] which deal with single valued agreement and instead reaches a consensus on the entire vector of local values in the same set of rounds. Thus at termination, each non-faulty 62 processor will return its vector Vi which achieves Vector Interactive Consistency Condi- tions. Vector Interactive Consistency Conditions: VIC]: Any two non-faulty processors P,- and P ,- obtain vectors Vi and Vi such that V‘ac) = viac), k=l,...,n. VIC2: If a processor k is non-faulty with local value vk, then for any non-faulty pro- cessor Pi, Vi(k) = vk. The above two conditions may seem the same, however, they are not. Condition (VICl) states that any two non-faulty processors obtain the same copies of the local values of all the processors in the agreement. However, these values may or may not be equal to the local value v,- that a particular processor P; actually holds. Condition (V 1C2) states that all non-faulty processors agree on the same value as a particular processor Pk’s local value vk if the particular processor is non-faulty. The other protocols mentioned at the beginning of this section require n (t+1) rounds to reach agreement as each value must be agreed upon individually before the next round can start. Our protocol achieves agreement in t+1 rounds. The protocol recursively broadcasts and then negotiates on the broadcast values. Each negotiation requires a message exchange, and thus each processor recursively enters into agreement on the exchanged messages. The algorithm Agree shown in Figure 5.4 achieves Vector Interactive Consistency (Reference Theorem 5-3 which follows). Each processor running Agree will send its local value to all other processors. Then Agree is called recursively to reach agreement on these values. The algorithm tolerates a maximum number of faults, t. It will be shown in Theorem 5-3 that agreement can be achieved if n > 3t. Initially processor P,, 13' Sn, invokes the algorithm Agree with Agree(0,v,~). Let Q be the set of processors in the agreement excluding the calling processor. 63 Agree is called recursively. The first time each Pi’s v,- is broadcast to all other pro- cessors. Subsequent calls reach agreement on the received broadcast values. Thus, Agree(O,T0) performs the initial broadcast of each v,-. Agree(1,T1) calls for agreement on the values received in Agree(0,T0) by invoking Agree(2,T2). Let T", be an m-dimensional square array in which each dimension is of length n. In each call to Agree(m,T,,,) each processor sends its own Tm and assembles n-l m- dimensional square arrays from the other n—l processors to form an (m +1)-dimensional Tm“. Each T ,,, from Pk is indexed in the (m +1)-dimcnsional Tm“ (k). Similarly, each m~ dimensional Tm is indexed in the (m +2)-dimensional sqaure array Tm+2 by Tm+2(i,k) where j indexes a T,,,+1 and k indexes the T," within that T "+1. In the proofs, processor Pk’s T,,I is referenced by T5. Where no ambiguity exists, the superscript on T is dropped. Let R represent the indices of elements in the m-dimensional square array Tm: i 1, i 2. ...,i,,,, where 1Si an, for lSj Sm. Thus an individual element of Tm is referenced by Tm(R). Similarly define Tm+1(k,R) to be an element in the k’th m-dimensional square array T", and Tm+2(j,k,R) to be an element in the k’th Tm of the j’th (m +1)-dimensional Tm“. The function majority returns the majority value of its arguments (the individual elements of each array). If no majority exists, it returns the lowest ASCII coded value. Note that this is a purely arbitrary choice; however, it will be made consistently by all non-faulty processors. If no value is received for a particular element, majority chooses an arbitrary (but consistent) value to use. Notice that at the initial call, the To is a scalar. At termination of Agree(0,T0), a vector T1 is returned satisfying conditions WC] and VIC2. Algorithm Agree(m,T,,,); allocate storage T1,,,+1; l) T 1m+1(id)<—T,,,; /* id is the calling processor id */ 2) for all PkeQ, send Tm to Pg; 3) for all PkeQ, receive an m—dimensional square array from Pk and store in T 1m+1 (k); if t>m then allocate storage T2", +2; 5) for each keQ for eachR such that ig at i). for 1 Sg,h S m and g at h for each jeQ,j¢ig andj at k for 1 fig Sm T1m+1(k,R)<—- majority (T 1m+1(k,R),T2,,,+2(i,k,R)); return(Tl,,,+1); Figure 5.4. Algorithm Agree 5.2.3.2 Example of Algorithm Agree A short example demonstrating the use of Agree in presented below. Figure 5.5 contains an example of the algorithm with n=4 and t=1. Processors are numbered P 1-P4 with P3 as the faulty processor. Initially, each P,- invokes algorithm Agree(O,v,-) (viaTfi). At Step 1, each processor sends To to the other three (3 = n-l) processors. Each processor then receives three values in Step 2. These values, along with the local processor’s vi, compose the square array T 11, a one dimensional square array. This is shown as a 4 element vector in Figure 5.5. Row indices index the received local value from each processor. Since P3 is faulty and exhibits Byzantine behavior, it may send any values it wants or may send no value at all. Agree(1,T11) is recursively called in order to exchange messages on each receiver’s view of the initial broadcast. Agree, in Step 2, sends the square array (T1) to the three other processors. In Step 3 it receives the three other l-dimensional square arrays. These 65 three l-dimensional square arrays along with the square array T1 it received from Agree in the previous recursion are used to create the 2-dimensional square array T 12. This is shown in Figure 5.5 as a two dimensional array. The row indices index the received T11 for each P;. The "?"’s for P3’s T1 indicate "don’t care" values since P3 is faulty and nothing can be assumed about its internal values. Since now t=m=1, Agree(1,T1) returns T 11 to Agree(0,T0). In Agree(O,To), each processor applies its majority function to the returned 2- dimensional square array T22 and to its T 11 for each of the other processors. Processor P1 obtains the majority values in the following way: T11(2)<—majority (T 11(2),T22(3,2)T 22 (4,2)) = majority (b,b,b) T 11(3)e—majority (T 11(3),T22 (2,3)T 22 (4,3)) = majority (a,b,c) T 11(4)<—majority (T 11(4),T22 (2,4)T 22(3,4)) = majority (d,d,c) Thus P,- retums [abad] as its vector. Note that in all cases, T 11 a T22(k) in every non- faulty processor Pk’s computation. Thus the majority could have been taken over T22. The majorities above decide the final square array T11 = Vi for the non-faulty Pi, i=1,2,4. Each processor runs majority 3 times, once to decide on each of the other processor’s value. Notice that P,- does not decide upon its own value T f). This is known by P,- to be correct. Processor P3 has attempted to thwart a consensus by sending different values to dif- ferent processors. However each non-faulty processor runs majority on the same set of values. Here we have chosen to choose the "lowest" value (ASCII collating sequence) for the majority if no majority value can be found. Thus each non-faulty processor uses "a" as the value for P3. Vectored interactive consistency is seen to be achieved as each non-faulty processor i uses the vector of values Vi = [a,b,a,d]. Each processor uses the communications medium to send six messages in two rounds. The total space requirement for this case is 0(n2). In general, the algorithm Agree requires 0(n‘ +1) space and computation 66 complexity and a communication complexity of (n—1)(t+1) messages in t+1 rounds of information exchange. 67 A9 709(0) 890d (8) (a) (a) (b) (b) (b) (a) (b) (C) (d) (d) ((0 P2 P3 P, P1 P3 P4 P1 P2 P4 P1 P2 P3 (b) (a) (d) (a) (b) (d) (a) (b) (d) (a) (b) (C) Agree(O) receive P1 P2 P3 P4 in in 19 2|! 2|! 2|! 711 3n 3“ 3a 4n 4“ 4“ P, P2 P3 “mi/K /l\ /l\ /1 (3 b.a.d) (a b a d) (a M d) (a b b d) (a,b, b d) (a.b b d) (a b b C) (b b ca) (aaaa) (a.b.c d) (a.b.c d) (a b c d) P 2 P 3 P 4 P r P 3 P 4 Pr P 2 P 4 P1 P2 Ag we“ )(a,b,b,d) (a,b,b,c) (a,b,c,d) (a,b,a,d) (b,b,c ,a) (a,b,c,d) (a,b,a,d) (a,b,b,d) (a,b,c,d) (a,b,a,d) (a,b, b ,d) (a,a,a,a) receive P 1 P2 1 a b a d 1 a b a d l 2 a b b d 2 a b b d 2 T12 3abbc 3bbca 3 4 a b c d 4 a b c d 4 Tll(2)=maj(b,b,b)=b T11(1)=maj(a,b,a)=a T11(l)=maj(a,a,a)==a T11(3)=maj(a.b.6)=a T11(3)=maj(a.b.6)=a T11(2)=maj(b.a.b)=b T11(4)=maj(d.d.6)=d T11(4)=maj(d.d.a)=d T11(3)=mai(a.b.0)=a return T11 [a,b,a,d] [a,b,a,d] [?,?,?,?] [a,b,a,d] Figure 5.5. Example of Algorithm Agree for n =4,t =1. 68 5.2.3.3 Proof oi Correctness It was stated that Agree achieves VICl and VIC2 for n > St. We must first prove that our algorithm satisfies vectored interactive consistency condition (VIC2) for a weaker condition of fault tolerance. This will be necessary when we consider the case of deciding on a vector when the transmitter of that vector is reliable and its recipients con- tain all the faults. This condition violates our intentions of achieving VICl for n S 3(t —m) for a given recursion m. Lemma 5-3: Agree satisfies condition VIC2 for n > 3t-m where t 2 0 is the maximum number of faulty processors and m is the index of recursion. Proof: We proceed by induction on t-m. For each correct sender in Step 2, all correct receivers receive each correct sender’s value. If (t-m)=0, Agree fulfills VIC2 because the number of faults is zero and every receiver uses the sender’s value. Assume that Agree works for (t —m )—1, we wish to show it is correct for (t-m), t>m. 1) In Agree(m,Tm). each correct receiver receives n-l square arrays T", into T 1",“ at Step 3 from the n-l calls to send at Step 2 done by the n —1 other participating pro- CC SSOI'S. 2) Each correct receiver then applies Agree(m+1,T1,,,+1). The number of processors considered in Step 5 (the number of individual elements of the square array T 1",“) in the agreement is reduced by 1. Thus Agree(m+1) runs with values from n-l pro- cessors (1 directly obtained from the sender and n —2 relayed by other receivers for each component of the vector T 1”,“). 3) Since Agree(m,T,,,) is called with n square arrays Tm_1 , and by hypothesis, n > 3t—m, then n—l > 3t-(m+1). Thus by the inductive hypothesis and by (2), 4) 69 Agree(m+1,Tlm+1) achieves VIC2 with n-l > 3t —(m +1). Thus each non-faulty processor receives the same T2,, +2 (k) for each non-faulty Pk. To achieve a majority vote on n —1 components of T2m+2 (j,k,R) such that the majority is governed by values reported from non-faulty processors, it must be true that: n —l-t > t or n-l > 2t. Since there are at most t faulty processors and t > m, n-l > 3t—m—1 n-l > 3t-(t-1)-1 2 2:. Each non-faulty processor has enough reported values from non-faulty processors in T 1,,“ equal to Tfn(R) to compute the same majority value for any non-faulty Pk, namely Ti. Thus for any non-faulty processor, T 1,,,+1(k)=Tf,, for each non-faulty Pk and VIC2 is satisfied. [I] Theorem 5-3: Agree(m,T,,,) achieves vectored interactive consistency VICl and VIC2 for n >3(t —m ). Proof: We proceed by induction on (t —m ). If (t-m) = 0, since there are no faulty processors under consideration, each P,- receives, in Step 3, the sender’s local value T{,, for any P I- (sent in Step 2). Each P; then returns the same vector T 1",“. Assume that Agree(m+1,Tm+1) achieves VICl and VIC2. We want to show that this is true for Agree(m,T,,,), t > m Case 1 The sender in Step 2 is non-faulty. By Lemma 5-3, each non-faulty PieQ receives the Tfn=T 1",“ (k) for each non-faulty Pk. By taking m =0 in Lemma 5-3, we have n > 3t. Since the sender is non-faulty, then VICl follows from VIC2. Case 2 The sender in Step 1 is faulty. Thus each processor receives possibly a dif- ferent value, the wrong value, or no value at all. Each correct receiver 7O executes Agree(m+1,T1,,,+1). Since one of the faulty processors is known to be the sender, only (t -m)—l faulty processors are the receivers. Thus since: n > 3(t—m) n-l > 3(t-(m+l)) we apply the induction hypothesis to determine that each Agree(m+1,T1m+1) achieves vectored interactive consistency VICl and VIC2. Thus for any two correct processors for each k and for all j, T2,“; (j,k) are identical. Thus each correct processor obtains the same T 1",“ (k) in Step 5. D We now give the time and space complexity for algorithm Agree. Theorem 54: The algorithm Agree with n processors and at most t faults requires space 001‘“). Proof: The algorithm is recursively called with m, 0 S m S t, as the index of recursion. For m=0, Agree requires 1+ n1+ n2. m+l + ""2. For 1 S m N is treated in Section 7.4. Selection of points for computation is controlled by the iteration variable k as in the following predi- cate. 83 84 Definition 7-1: 1 if P,- is active during iteration k AP" = 0 otherwise. The most common of these consistent multicolor orderings is the red/black or checker- board multicolor ordering [Smit65] (implying, among other things, that the number of nonzero aid-’8 per row i is no more than five). For this specific case A“ becomes: Definition 7-2: 1 if (1' div «1717+ i mod W) is even and k is even for ie {1,...,N},ke (0.1.2,...) A“: 1 if(i div W+tmod W) isodd andkis oddforie{1,...,N},ke{0,1,2,...} 0 otherwise where div is the integer division operator and mod is the remainder operator for integer division. The ug’s are assigned to points of a \fQ—xfo coordinate indexed grid such that uj’s with nonzero 01.j’S are within distance one of u,-. Each u,- is updated on odd/even values of k according to A“. An odd/even iteration cycle is sometimes called two half- iterations. The remainder of this chapter considers (without loss of generality) consistent multicolor orderings of length two. The pointwise SSOR method is given by the follow- ing equation for fixed 0). 115") =(1—(1))u§""2)—0)—1— Za;,ju§"‘1)—v, if A,_,,=1 (7.1) i,i j¢i J: The following theorem characterizes matrices which can be solved by (7.1). Theorem 7-1: [Varg62] Let A be a strictly or irreducibly diagonally dominant Hermitian QxQ complex matrix. Then the pointwise SSOR method with 0 _. l uf‘m) (- (1—w)uf" 1L0»? 2 ai,ju§k)-Vi ; 1.8 .¢. Ice—k +1. j e {North,SéuIlLEasLWest} Figure 7.1. Relaxation Skeleton 7.2 NATURAL PROBLEM/SOLUTION CONSTRAINTS Constraint predicate development proceeds as outlined in Chapter 4. First the mathematical properties of the problem and its solution must be espoused. Theorem 7-1 allows the choice of the initial assignment vector “(0) to be arbitrary. The next theorem shows that certain choices can facilitate constraint predicate development. 86 Theorem 7-2: Let A be a matrix satisfying Theorem 7-1 with positive diagonal and non- positive off-diagonal entries. Then the solution given by (7.1) will converge monotoni- cally to the final solution “(10 for the following choices of “(0) = (ufm) for 0) = 1. Q 1) Let umin be the smallest value such that vi—umin 2am- S 0 for all is {l,2,...,Q }. i=1 If uf0)=umm for all is {l,2,...,Q }, then ufk)2u§k+2) for all ke {0,1,...}. Q 2) Let umax be the largest value such that v,--umax 2a“- 2 0 for all is {l,2,...,Q }. i=1 If uf0)=umax for all is {1,2,...,Q},then ufk+2)2uf") for all ke {0,1,...}. Proof: For case 1), by induction on k, the iteration count. Basis, k=0, 1 Q up) = ——[v,——Za,-,ju§0):l (7.2) 01,1' j¢i and since then Q vi-ZagJufoLaLgufo) S. 0 (7.3) j¢i Since dig->0, dividing (7.3) by a“, simplifying and substituting in (7.2) yields 1 Q ufl) = —-—[v,~-Za,-Jufo)] S ufo) (7.4) i,i jg Continuing to the completion of the half iteration pair, 1 Q ufz) = —l:v,--2a,-Ju§1)] ‘35 jaei 87 By (7.4) and since aid-SO, iatj; i,je{1,2,....Q} By (7.3) S ago) Now assume uf’) S “(t—2) for l=2,3,...,k—1,k. Then for l=k+l 1 “(k+1) = —[Vi-zai,ju§k)] aid jati Since aid-S0, i ¢j, 1 _ S— [ Iii—201,114? 2)] ai,i jg S ufk’l) The proof for case (2) is symmetric. [:1 Theorem 7.3; In the solution of (7.1), each 11)"), ie{1,2,...,Q}, ke {0,1,...} is bounded by umin 2 uf") 2 um“ for uf"), aft”) satisfying the conditions of Theorem 7-2. Proof: Assume that for some uf"),umin < uf") or ufk) < umax. Then there are two conditions based on the initial assignment vector “(0). 1) Let the initial assignment be ufo) = am, is {1,...,Q }. One of two cases can occur. By Theorem 7-2 1150:1194), ie {1,...,Q }, ke {2,3,4,....}. Thus uf") S “(k—2) S - ~ - S ufo) for k even Its") Sufk’z) s --- sup) fork odd _<. 1150) by (7.4) 88 Altemately assume that uf") v;-a,;,-umax j¢i Q 201', jumax -<- Vi-ai,tu max jaei Substituting Q Q k—l Eat/“i ) > Barium jset jest - . . - - (k—l) (k-I) . - Smce aw S 0 for 1,16 {l,2,...,Q }, u I < umax for at least one u] , a contradiction. 2) The case for the initial choice of um, is symmetric. 89 Theorem 74: Consider a sequence of ufk)’s satisfying the conditions of Theorem 7-2. From the perspective of a processor P; calculating u;, a candidate intermediate result uj") with a nonzero coefficient au, Lie {1,2, ...,Q} must satisfy the following properties Ifufo) =umin, ie {1,2, . - . ,Q}, then ulk+2’5ul")+-ul" ”hm-Etallulkm—al.ul" 1’) The case when 1150): um, ie {1,2, - . . ,Q} is symmetric. 90 Theorem 7-5: A candidate intermediate result ujk) with nonzero coefficient a“, from the perspective of processor calculating uf"), Lie {l,2,...,Q }, must satisfy the following properties. Ifufo) =umin, ie {1,2, - ° ° ,Q}, then 1 ask”) 2(1—(0)uj")-(0— Z az’jumafialfl-uEHD—v 01.1 j¢l,j¢i Similarly if ufo) = um, ie {1,2, ~ - ° ,Q}, then 115"”) S (1—(0)u§")—t0—1— k+l Z al,jumin+al,iuI )-V 01.1 j¢l.j¢i Proof: Immediate from the conditions of Theorem 7-2 and the proof of Theorem 7-4 CI. Theorems 7-4 and 7-5 guarantee a minimal and maximal amount of progress of each individual component of the solution as a function of the current state of the solu- tion. The stopping point is selected to be the iteration K at which lu§K+2)—ufK)l uSK)-—[ Zai,jqu+l)-VJ 0) i,i j¢i The case when u§0)=umax, ie {1,2, . ° ° ,Q} is symmetric [:1 Each of the theorems developed is a member of a particular subclass of constraint predi- cates. This membership is summarized in Table 7.1. Subclass Theorems Size (DP 7-2 IPT (I); 7-3 IPT (DC 7-4,7-5 IPT 7-6 GPT Table 7.1. Predicate Subclass Membership. 7.3 ERROR COVERAGE MODELING FOR MATRIX ITERATIVE ANALYSIS As in the development of the constraint predicate, the development of the modeling of each individual feature is treated separately. Feasibility Modeling of the error coverage of the feasibility constraints for the problem con- sidered in this chapter is straightforward. The feasibility events F ,-,i =1,2 are defined by 92 i F 5") : Event that uj") 2 u max for j and k under test and 5173") : Event that uf") S umin for the i and k under test Since these constraints are stationary Pr[jF,-] = Pr[jF§")] for all je{1,2,...,Q}, ke {0,1,2,...}, and i=1,2. The cumulative distribution function of the appropriate model yields the coverage proba- bilities for all steps of the problem. The corresponding probabilities are given by Pr[jF1] = Pr[u§“)2um,,] Pr[I'F2] = Pr[u§")Sumin] and Pr(iF] = Pr[um,,su§")sum,n] Progress Progress is modeled by considering the global convergence properties of the solu- tion and extrapolating this behavior back to each individual case. [Ames77] gives a bound on the rate of error reduction in a matrix iterative solution. The first step in this analysis is to determine the spectral radius pg of the iteration matrix G. It is possible to determine a close approximation to pG by examining the ratio of the norm of the error vectors between iterations. However, pg can also be calculated directly from G, while in general is not feasible due to the size of G, for analytical pur- poses is sufficient. For the red-black ordering used in this chapter, G cannot be expressed in a convenient matrix form. Instead we use the G from the Gauss-Seidel con- sistent ordering with Successive Over-Relaxation applied. The matrix A is decomposed into A=B+D+C with B lower triangular, C upper triangular, and D diagonal QxQ matrices. Thus G becomes, G =(D+mB)-1[(l—to)D—toC] 93 If the complex eigenvalues of G are given by A,- for ie {1,...,Q }, then pa = max IAgI i e {1,...,Q} Let the error at step k of the computation be denoted by the vector 80‘): | uk-ul for ke {0,1,2,...}. For a stationary linear iteration matrix G of the form considered in Theorem 7-1, Ilea‘) II = IIG"8(°) II where He“) II is the spectral norm of a“). If pG<1, for k sufficiently large, N G" ll =[pG]" where H G H is the spectral norm of G. Combining these results yields [Ames77] Hamil 3 IleII New)” Thus, for large k the ratio ll 80‘“) ll / lie“) ll averages to pg. Therefore, on the average, the error decreases by a factor of pa at each step in the iteration. For modeling purposes, the final result is known. Assuming that each individual point behaves according to glo- bal convergence and using Theorem 7-2, treatment of an arbitrary point, i’, yields the fol- lowing relation If “I: )"umins k I — I, .(k)_ 0, pG(umm “1)2u‘ u‘ , k6 {0,1,2,...} (705) where uy is the correct final result of the calculation. A similar result holds if u§0)=u max. The events P,- that define progress are If 11,50) =u mini i'Pf") is the event m5") S pf;(umin-u,v)+u,v, ke {0,1,2,...} If 11,50) =umax, PP?) is the event 14,5") 2 air-pf; (ail—um”), ke {0,1,2,...} Consistency The consistency components of Theorems 7-4 and 7-5, like the progress component, can either be determined from a trace or by analytic means. For the analytic model, we apply Theorem 7-4 with the assumption that each u,- proceeds toward the solution U at 94 the same convergence rate. Again this is not the case in the actual solution but does effectively display the aggregate behavior of the solution. While Theorems 7-4 and 7-5 are directly implementable as a constraint predicate, to use them in the model requires additional simplification. The ratio “Li/01.1 is aggregated by replacing it with the ratio at) R... _—W where n is the average number of nonzero au’s. Using (7.5) as an equality nu, and substituting into Theorem 7-4 yields the events ‘v C 3’22) and i'Cj’ff’z) If ufp) =umin, i'Ci’fif’z) is the event, all”) s p604 ma—u.v>—u.v<1-m>(p£~(um-uto—p’eitumn—ui) -mR Figure 7.4. Error Coverage by Solution Step Count The diagnosis of a fault group is stated as follows. If a processor in Q; is faulty for some i’, then 52? a 1 and the entire fault group is flagged as faulty; otherwise, 0? a 0 and the entire fault group is flagged as non-faulty. Lemma 7-1: Let the processors in a local fault group (2p communicate via Agree. Then processors in Q): reliably diagnose Q; as either faulty or non-faulty if the maximum number of faulty units per fault group is 1. Proof: The proof is constructive. Step 1: Each Processor Pie 9; sends its last value of a)“ to the other members of (25'. 99 Step 2: If lug-2) - 145K)! > 8 then report that Q; (and indeed Pg) is faulty. Stop. Otherwise, Step 3: Each processor applies Theorem 7—6 to the qu)’s it receives from Step 1. Step 4: Each processor reports, in a distributed manner, whether Theorem 7-6 is satisfied or not. Since a reliable broadcast is utilized, each processor receives the same version of a sent message. Each non-faulty processor then reaches the same conclusion concerning the result of the final calculation of the iteration sequence specified by (7.1). The test in Step 2 precludes P; changing its value from an erroneous to correct state during the broadcast phase. If P; were allowed to change it’s value, it could send a value that agreed with Theorem 7 -6 thus making Q? = 0 erroneously. If P,» does not change it’s value, then Theorem 7-6 is used to diagnose 9;. Note that Theorem 7-6 does not explicitly specify which member of Q,» is faulty. Since the maximum number of faults per fault group is 1, no two processors can cooperate to fool the test of Theorem 7 -6. Furthermore, in the example under consideration, since the minimum cardinality fault group is 3, the non-faulty processors will always outvote the faulty processors. A simple majority vote is taken to determine the status of the fault group. C] Definition 7-4: Extended Fault Region K ,- = UIQj'QjflPii‘g} The extended fault region of a point i’ is simply the set of local fault groups of all neighboring points j such that u ,- uses ur in it’s calculation. Let K? = I(P,-IP,- is faulty and P,-eK,-}I and let k,- = I[jlo,-eK,-}I Theorem 7-7: For each processor Pi, P,- is faulty iff Q? =1 for all QjeK, and K? s ki-2. Proof: Necessity: Assume P,- is nonfaulty. If 52? = 0 then we are done. If not, then there exists some Pie Q,- and P is (21- such that P j is faulty. Since at most ki—2 processors may be 100 faulty and at most one is allowed per fault group (by Lemma 7-1), and since there are ki-2—1 faulty processors remaining to be distributed over the ki—2 remaining fault groups, some (2? = 0. Sufficiency: Assume that some (2,0 = 0. However, by Definition 7-4, each Die K,- con- tains P j. Thus P ,- is nonfaulty. Secondly assume that K? > ki-2. Then it is possible to have [Cg—2 faulty processors none of which are in Q,- and one faulty processor in 9,- which is not Pi. Thus P,- is nonfaulty. C] With the test provided by Theorem 7-7, we can obtain the error coverage shown in Figure 7.5. 7.4 RUN TIME PERFORMANCE Using the techniques of Chapter 6 for this problem will yield a similar performance curve to that of Figure 6.1 (p. 80). However, in matrix iterative solutions, many points will be assigned to a single processor. All of the constraint predicates developed in this chapter are applicable, but they only need to be run at those points which fall at the boun- dary of the data partitioning. Let the total number of points on the grid Q be greater than the number of processors N. Without loss of generality assume that N divides Q into a perfect square. Then Q/N = q is the number of points per processor. Continuing the assumption of coordinate grid indexed points, Figure 7.6 depicts the mapping and error checking for q = 25. The grids marked by dashed lines indicate points that are not checked by the constraint predicate as they are internal to that processor. The computational complexity calculation proceeds along the lines of Chapter 6. The unreliable algorithm, for a sub-grid of 4 elements per processor with an average of n -1 neighbors, is easily seen to have a computational complexity per iteration given by, CpNR= nq+2$L. Each constraint predicate contributes the following run time overhead, 101 % HIWIEI C 0.75 a o v e 0.50 — r a g 0.25 — e I 0 K Iteration —-) Figure 7.5. Error Coverage by Solution Step Count. Theorem 7-7 Applied. (DP 1 (DP "V67 (DC l+n\/Z The run time of the reliable algorithm under a maximum of t faults per fault group is given by CPR = 2+2nxl§ +nq+2(t+1)SL-i-(n \IE )’ +1 The ratio of the computation times of the reliable algorithm to the unreliable algorithm is: I ! “—" I 1 H L! __.-i--'e-._ ' ' __...-i--'u- _ ' ' _,-J--L_._ P1_1,,- _—.-.‘--i-_ ___g__;_]__ ___;__;_, P141.) : I H 1 : H 1 : j I l H [J H [T I l 1 i Mh—1--r-dh—_ I I I Phi-1 Figure 7.6. Mapping for q =25. CPR 1+2m/Z1'+nq+2(t+1)sL+(m/(7)'+1] CpNR _ "(1+25L J Let r represent the computation to communication ratio. Then r = nq/ZSL. Neglecting low order terms, for t=1 (the solution is locally tolerant of one fault per fault group 9,), C121? _ ZFSL-HISL-I-ZrSL CpNR — 2SL(r+1) 5r+2 r+1 (7.6) By setting r=0, (7.6) is easily seen to be a more general form of (6.9). Letting r=1 sets computation time equal to communication time resulting in, 103 CR: n+2 CNR. (7.7) Since n is usually very small (5 or 7), the reliable algorithm is at most 3.5 to 4.5 times more expensive than the unreliable algorithm and indeed much smaller than this since the value for q to achieve even an r=l is prohibitively large. Moreover, this analysis is overly pessimistic as it treats logical comparisons with the same weight as floating point multiplication/division. 7.5 COMMENTS AND LIMITATIONS In SSOR, the optimal relaxation factor coop, is chosen to minimize the spectral radius of the iteration matrix G. However, setting to = (00,”, does not preserve the mono- tonicity of the ug’s as was suggested in [HuAb84]. The expectation of monotonic behavior is unreasonable since SOR is specifically designed to "over-relax" some ui’s to increase the convergence rate. Values of to in the range 1 < (l) S (00,” can be found experimentally that do preserve monotonicity; however, no way is known to calculate these values even if the spectral radius is known a priori. This limits the feasibility of using constraint predicates to iteration matrices that have the spectral radius pg < 1. It is not known at this time whether a convergence envelope can be found to bound the ug’s. This is an area for further study. Error coverage can detect a significant number of errors with a fault-latency of 1. However, some errors can result in a pathological state of the system (values of “(1‘) for some k in which convergence to the final correct result takes longer than the time required to restart the solution from the initial guess). Enhanced error checking capabil- ity can come only from an increased understanding of the local convergence properties of the solution. Such work is being performed under the guise of "Local Relaxation Methods" [KuLM87, BoVe82]. Local convergence properties, as shown in this chapter, can be embodied as IPT predicates. 104 A final comment on the constraint predicate implementation concerns Theorem 7-7. One of the goals of application oriented testing is to migrate the system dependent issue to the diagnostic basis. Theorem 7-7 is an Ad Hoc approach, at best, to fault diagnosis. Optimally, control of a GPT test such as this should be migrated to the diagnostic basis. It is not understood at this point in time how to do this in a general way while still preserving the application oriented view of fault—tolerance. Clearly this is an area for further study. Chapter 8 Relaxation Labeling We now focus our attention on a non-numerical problem, Relaxation Labeling. To describe the problem of relaxation labeling, we quote from [HuZu83]: Relaxation labeling processes are a class of mechanisms that were originally developed to deal with ambiguity and noise in vision systems. The general framework, however, has for broader potential applications and implications. The structure of relaxation labeling is motivated by two basic concerns: I ) the decomposition of a com- plex computation into a network of simple "myopic," or local, computations; and 2) the requisite use of context in resolving ambiguities... Fundamentally, relaxation labeling assigns labels to objects in an image. Various compa- tibilities and incompatibilities exist between the objects. For example, consider an image which is to be labeled as a human body. If two objects are adjacent in the image and one is the object "head," then it is very likely that an adjacent object is "shoulder." Likewise, it is highly improbable that the adjacent object has the label "foot." This competition and cooperation can work together in the relaxing the labelings on the objects such that even- tually an unambiguous labeling is found. Parallelization of relaxation labeling for implementation on a multiprocessor system is advantageous for large problems, that is, problems that would otherwise take a long time on a conventional SISD computer. A large problem is characterized by a large number of objects. Indeed some applications of relaxation labeling can include objects the size of pixels in a 512x512 image or 218 objects [Stoc87]. Relaxation labeling lends itself easily to implementation as a data parallel algorithm [HiSt86]. The algorithm 105 106 requires only local computations and data exchanges. Thus good speedup may be obtained by parallelization. As in the matrix iterative case, the properties of the inter- mediate results of the computation are checked for faulty/nonfaulty behavior. 8.1 PARALLELIZED WEIGHTED RELAXATION LABELING The weighted relaxation labeling algorithm using the variational inequality method given by [HuZu83] yields a straightforward parallelization. The notation and algorithms are presented here for completeness. The basic idea is that given an initial feasible solu- tion, attempt to maximize an objective function by taking small steps in the tangent direction which maximizes the directional derivative of the objective function. When the directional derivative becomes negative or zero, the objective function is at a local max- imum and the procedure stops. N is the number of objects and m is the number of possi- ble labels. Labels are denoted by A. and 1’. An objective function is defined as: m e 41(70=2Pt(7»)si(7»), 1 SI SN (8.1) 2:1 where the support function s,-(7I.) is: N m I I 51(1):: 2 was)» )Pjov) (3.2) j=1x'=l The compatibility matrix RNxN = [r,- j (1,710] is the relative compatibility of label 1 at object i with label 1’ at object j. The weight vector?,- with components p,(h) is the relative weighting for label 1. at object i and is constrained by: m 2pi0»)=1, 0< {(A)Sl, 1S )1. Sm (8.3) 2:1 The weight vectorfi- is constrained to be a consistent solution for each object i providing: Elm-(1)3002 E v,-(7L)s,-(?t) for all labelings?) satisfying (8.3) (8.4) x=1 2:1 107 8.1.1 Non-Fault-Tolerant Parallel Relaxation Algorithm Parallelization of the algorithm is accomplished through assignment of each of the N objects to one of the N processors in the DMMP. The i’th processor is denoted by P,-. Each processor runs the same copy of the algorithm on different data, hence the term data parallel algorithm. Each processor P,- has a copy of the entries relevant to it of the com- patibility coefficient matrix RAM]. The neighbors of a processor in the problem domain of relaxation labeling are those that have non-zero compatibility coefficients. Thus each processor P,- has only the nonzero column elements j of row i in RNxN. The average number of nonzero entries per row is given by n. In the following algorithm, the index i indicates the local object (processor), and the index jranges over the neighbors. For example, if," indicates the k’th iteration value of 3 held by processor P j. Each processor P,- executes the following algorithm 1 Si SN. It is assumed here that the mapping is ideal. That is, each processor has a direct connec- tion to its neighbors. This assumption is not critical and it simplifies complexity analysis. 108 Initialize: 1) Start with a consistent initial labeling assignment?0 k(—0 loop 2a) sendfi’t to all n neighbors P,- 2b) receive 3} from all n neighbors P ,. /" nearest neighbor updating */ 20) :10») = 22) r60». 1312100 1' 1’ /"' find a feasible gradient direqtion */ 3) 7 = projection -operator(? ,? ) 4) it (7"=0) break; 5) FHA 93,3417" /" where h is some small positive value */ I" that keeps? + consistent 6) kt—k +1 pool Figure 8.1. Algorithm RLNR, Relaxation labeling with no added reliability 8.1.1.1 Projection Operator Computation Computation of the updating direction vector is presented in [MoHZ83]. It is for- mulated as a linear optimization problem subject to quadratic constraints. We repeat the formal problem specification here. Let IR“ be m-dimensional real space and let [X be the convex set defined by: IK={?eIR'" l Epmzl, p(x).>_0, 19.971} 2:] For any vectorfi’elK, the tangent set T 7,» is: T? = {761R ”‘ | in: v (70:0. v0»)20 whenever p(x)=0} i=1 109 The set of feasible directions to move at pointFis: F3=T7m {781le ll?” $1} The subproblem to be solved by the projection operator algorithm is; Given a current feasible weighting vector ‘p'Iand a current arbitrary direction 3’8 IR'”, find 72F? such that?°72?'7for all 71-: F3. projection-operator (7, 3) l) D(— [xIp(x)=0) 2) Ste—{l 100p 3) “(— l n — 4) Sufi—{ND I 5(1) < I; }; 5) if (St+1=St) break; 6) k(—k+l pool 0 If A. 8 St 7) “(M (— 30») - t), otherwise 0 if 17’: 0 17’ otherwise II?” 8) E(— Figure 8.2. Algorithm projection—operator Note here that the vector—Bis normalized only locally. Since it is stated in the origi- nal paper that normalization over the entire problem is not a requirement for conver- gence, the local normalization makes the parallel algorithm possible. 110 8.1.2 PERFORMANCE EVALUATION Speedup can be defined as the ratio of the sequential complexity vs. parallel com- plexity. For each processor, P,-, the complexity of the projection-operator routine is O(mz). All other steps for one iteration are O(m) with the exception of the calculation of T“: which is O(nm). Thus the complexity of a parallel iteration is O(nm+m2+2.S'L). If a single processor works on all the calculations, then an iteration is O(nm+m2). Thus the speedup obtain by parallelization is the ratio: 0 N (m2+nm) m2+nm+ZSL If N is large, the speedup is significant. 8.2 CONSTRAINT PREDICATE The mathematical theory surrounding the application must be mature to facilitate development of an appropriate testing predicate. Such is the case with the relaxation labeling procedure given by [HuZu83] which is essentially a constrained local maximiza- tion problem with the direction of updating solved by the method of feasible directions [Zout76]. We prove a series of lemmas which correspond to the components of the constraint predicate. Lemma 8-1: At each iteration, the following feasibility conditions hold (1);: §p1(l>=1andoslnl0)s1, ISXSm 1:1 Proof: Since Step 5 of the algorithm RLNR in Figure 8.1 chooses h such that?“1 remains consistent, each 3"“ must satisfy the consistency conditions. Since the con- sistency constraints form a compact convex set over IR'", then all intermediate solutions must remain within a convex cone in m—space [Zout76]. This "funnels" the solution to the local attractor. Cl 111 Lemma 8-2: At each iteration the following additional feasibility conditions (D): are maintained: It: a) Zui(7t)=0 1:1 I» (rm-17:0 Proof: a) Step 3 of projection-operator, performs an orthogonal projection of? onto the feasi- bility space F5». Thus (8—7) -7=0. m b) Since71ies in T3, 2 u,-0t)=0 for all feasible? [3 1:1 Lemma 8-3: Each iteration is subject to the following progress conditions (DP: a) If78 F-p-r and ll7ll=l, then Vmaximizes the gradient direction change. b) If 7 is a correct result of projection-operator and 77:0, then 3H1 7H1 5'1"?" c) Convergence occurs in a finite time within a suitable neighborhood of the solution under conditions of strict consistency. Proof: a) Immediate from the problem definition of projection-operator. b) If?’,-"+l°_p"k+l<.T{"-E”r then for some 7817?, W>W. Thus 7was not a correct result from projection-operator. HEP“? “1:39,”? k then by Step 5 of RLNR, since h>0,7=0. c) Immediate from Theorem 9.1 [HuZu83]. 8.2.1 Predicate Generation and Proof The correctness and completeness of the predicate is proven in the following theorem. The metric for correctness requires that if the algorithm RLNR, under no faults, produces a correct solution, then the reliable algorithm RLR , under a locally bounded number of faults t, also produces the same correct solution within some predetermined 112 error tolerance 80. It will be shown that the constraints given in Lemmas 1-3 are sufficient for the predicate correctness if one constraint is added. This constraint has nothing to do with the problem theory. It is possible for a Byzantine processor to arrange the vectorsTt’and ?as to erroneously signal early st0pping. Thus in the case of early stopping, the last cal- culation must be replicated. If the replication also produces a stopping point, then that solution is indeed the correct solution. The method of proof is by considering all possible movements of the vectorfidur- ing the iterations of the labeling algorithm. Theorem 8-1: A predicate (D that embodies the features (Dp and (DF proven in Lemmas 1-3 plus the early stopping test will constrain Byzantine behavior such that if the unreli- able solution computed be RLNR is correct, the reliable solution with predicate (D is also . correct within the error tolerance £0. Proof: Consider a point?” that satisfies Lemma 8-1. A direction of movement by Vmay be to any point in m-spaceE’H1 . Consider a movement by processor P 1- (object 1') that does not change 3,", such thatf’jipr1 is moved towards the local maximum, andfi’j has not already been found to be a solution. There are two cases: 1) Lemma 8-3b is violated (Figure 8.3). IfVJ-k+1 #0, then processor P} has made a "nonconvergence error." H7?“ = 0, then we must invoke the early stopping test. The members of the local fault group "re-run" the last iteration based on the values 31": and 71". If the replicated calculation is the same as the original 3“] ,TI’HI, then 771"“ is a solution and processor P ,- stops. 2) E; satisfies Lemma 8-3 but does not satisfy Lemma 8-1. Thusfij has moved "out- side" of the solution space (Figure 8.4) and violated the consistency conditions. 113 Infeasible Region 1» Feasibility Cone Solution Pt. Figure 8.3. Non-maximizing movements Infeasible Region sibility Cone Solution Pt. Figure 8.4. Movement outside of feasibility cone If EH1 ,7“! leads to a correct movement of ‘s’f but does not make the maximal movement available, then Lemma 8-2 and Lemma 8-3a are violated. Finally if Lemmas 8-1, 8-2 are satisfied and conditions of Lemma 8-3a and 8-3b are satisfied and condition 3c is violated, then the Byzantine resource is attempting to delay convergence through zigzagging (Figure 8.5). However, zigzagging may occur as a result of errors even if the algorithm employs an anti-zigzagging procedure. Even though the solution is making progress by virtue of condition (a), for all practical purposes it could be making progress as small as a 1 bit change in}; Clearly this would cause the solution to run a very long time. However, we know that the convergence occurs within a finite amount of time, so the following test is utilized: Eli-“15”“ -Tt:-"°?k < 8’”1 then a convergence error is flagged. When 8"p,C corresponding to the progress, feasi- bility, and consistency predicates subclasses are discussed in the following paragraphs. 123 /"' Sn, executed by node, OSnodeSN—l - Local starting value in a */ Procedure Sn . LBS [1]:=a; Imask:=2’ ; for i:=0 to n—l do ionic=minp ensures that algorithm termination will occur and that at each testable step of the solution, the state of the solution advances to the goal or final solution of the problem. If this progress is not made, then faulty behavior may indefinitely postpone the solution - any solution - even an incorrect solution. For iterative convergent problems, the progress component involves reduction of error. For the bitonic sorting problem, the number of steps is known a priori to all parti- cipants in the algorithm. Thus any early termination is considered an error. The testable step for (hp is at the bottom of the outer loop 1' in algorithm SNR. By Lemma 9-2 this must be a bitonic sequence of length 2i +1 . The progress test (1),: is given in Figure 9.5a. The sequence BS,- is a bitonic sequence of length 2i+1 in SCHlmde. Procedure (Dp (BS i) for it := sci,- to 5C5,- if (ascending) if (BS[k+1]BS[k]) then ERROR; if (i at n) . for k := SCij+2‘ to SCEN- if (ascending) if (BS[k+ l]>BS[k]) then ERROR; else if (BS[k+1]>x is a bitwise right shift of y by x bits */ limit=min(SC;-9,1J,N) mask:=vect_mask(i,j,source): omaslc=mask; maskz=mask>>SC§+lJz unasltz=1maslt>>scf,,,,; rot ltz=sc,-S,1 J to limit [if (mask&01 && l(1mask&01)) LBS [k]:=lbuf[k]; else if (mask&01 && lmask&01) if (LBS [k] at lbuf[K]) ERROR; mask:=mask>>l; lmask:=lmask>>1;} retum(omask); Procedure vect_mask(ijmode) (1:21; if (j=i) if (node mod (d<p,c detects all errors from the Byzantine fault class committed by one faulty processor. Thus the reli- able bitonic sort algorithm is fail-stop using components which may fail in Byzantine ways. The result of the calculation is either completely correct, or the entire system halts with an error condition. We are guaranteed that under a single processor failure, we will never receive an incorrect sorting result. Furthermore, unless an error occurs, the host is never involved in the calculation. This is ideal from a performance standpoint since, as mentioned earlier, the host can become a bottleneck. Clearly, there will be a performance penalty to pay for the increased reliability of algorithm S n over the unreliable SNR. One may question how much overhead is intro- duced and whether it might be better to simply send all the data to the host, let the host sort the data, and return the final result to the node processors. Another possibility is to send all the data to the host, sort the data in the node processors, and send the results to the host for verification. As we show in the next section, the performance of both these possibilities becomes unreasonable for even moderately large problems. 9.5 TIME AND SPACE COMPLEXITY As mentioned in Section 2, algorithm SNR has a time complexity of 0 (log%N). Sequential sorting, on the other hand, has a lower bound of 0 (NlogzN). The 131 communication complexity (since the data must be transferred from the nodes to the host) is 0 (N) for sequential sorting. The communication complexity for SNR is 0 (log%N). Thus it is expected that the fault tolerant algorithm Spy will have time/communication complexity between these two. Lemma 9-6: Algorithm vect_mask(i,j) has time complexity of 0 (2i ‘1'). Proof: Let the running time of vect_mask(i,j) be TVM (i, j ). If i =j, then four logical bit operations are performed and TVM(i,i)=0 (2°)=0(1). If i> j, then two executions of vect_mask(i, j +1) are performed. Thus we have the linear recurrence Tm(i,f) = 2Tvu(i,j+1) Solving this yields: TVMGJ) = 2‘-J'Tmc(i,j) has complexity 0 (2j+1+2i‘j). Since each iteration contains time for 2 (DC iterations 132 (because the computations cannot be overlapped), we have 2 i 2j+1+2i-j=2,2i+2+2i+1=0 (2i+3) j=0 By Lemma 9-7, bit_compare has complexity 2i and updating LLBS takes 2‘; so for a sin- gle iteration of i, we have a run time of 0 (2i+3+2i+2i). Summing over all i from O to n —1 and adding in the final verification step, we have "i1 (2i+3+2i+2i)+2n+2=2n+3+2.2n-1+2n+2=0 (2n+3) i=0 Since n=log2N, we have the run time of Spf=0 (21°g2N+2)=0 (N) The communication complexity of the main loop, as in SNR, is 0 (log %N). The final verification stage adds logzN communication so the order of S n remains unchanged from SNR. C] For comparison purposes, a sequential "sorting" algorithm was constructed for the host. Sort is quoted since we implement this "sort" as a single if statement executed N logzN times to achieve the theoretical minimum. 0 (N) communication is required to send/receive the sorted data. Additionally, a sequential verification was constructed. In this algorithm, the initial data is sent to the host, sorted by the node processors, and the sorted data also sent to the host. The host then implements Theorem 4-1 to verify the results. This takes 0 (N) communication complexity. It also takes 0 (N logzN ) computa- tional complexity since the matching of the ordered and unordered list becomes equivalent to finding a permutation in the sense of Definition 4—1. Thus, for the follow- ing discussion, the best sequential algorithm is 0 (N loggN ). The algorithms SNR, Sn, and a sequential sort were implemented on a 32 nodes of a 64 node Ncube DMMP [HMSC86] to sort 32-bit integers into ascending order (imple- mentation constraints forced consideration of the smaller subcube). Timings were obtained for for all three algorithms for problem sizes of 4, 8, 16, and 32 nodes. This is shown in Figure 9.7. 133 600 _ __ Theoretical {a ............. Observed R E] - Sn u I - SNR n 400 _ . - Host Sort T i ,..I m e I l ' ' 4 8 16 32 Problem Size Figure 9.7. Sorting Time Comparisons The execution (observed) results are inconclusive since the cube we have available is very small. Asymptotically, we certainly expect Sp,— to be more efficient than the host sort, but the constant multiplier in the run time order dominates for these small problem sizes. Measurement of the running time for each component of the two algorithms yields the following table (measured in clock ticks). 134 Algorithm Communication Time Computation Time Sp,— 81og§N+250N .72(16N) Sequential 14N 0.45Nlog 2N This behavior is plotted as (Theoretical) in Figure 9.7. Comparison with the observed values indicates this approximation is close to the actual run time for smaller cube sizes. In the projected run times of Figure 9.8, S pp rapidly becomes more efficient for the size of cubes that we are concerned with in a real DMMP application. The portion of this plot covered by Figure 9.7 is highlighted in the lower left comer of Figure 9.8. These pro- jected run times indicate that the penalty paid for fault tolerance in parallel sorting is less than the cost for sequential sorting in the host. 9.6 CHAPTER SUMMARY This chapter has presented a fault-tolerant parallel sorting algorithm developed using the application oriented fault-tolerance paradigm. The algorithm is tolerant of one faulty processor node, or faults in a single processor’s incident communication links. The addition of reliability to the sorting algorithm results in a performance penalty. While the experiments on the small cube available were unable to demonstrate that fault tolerant sorting is more efficient than simply sorting in the host, asymptotically the developed fault tolerant algorithm is less costly than host sorting. Experiments on a larger cube are necessary to verify this behavior. Additionally, performance of the three algorithms under bitonic sort/merge is interesting to study. In this algorithm, each node has not one element, but a small, however, non-negligible number. The sequences are sorted locally and then merged in parallel to form bitonic sequences. This bitonic sort/merge procedure is an easy extension of SNR. To extend Sn to this sort/merge case 135 20000 R X, 15000— Host Sort ’1’ u ,I n ’1’ 10000— ,” T / i m ,” 5000— I’ e ,’/ I/l’ S” 600— -, I I I 32 512 1024 Problem Size Figure 9.8. Projected Sorting Time Comparisons - Large Systems is more difficult. Simply growing the present algorithm will undoubtedly result in unac- ceptable run time and space requirements. To reduce the run time complexity, the computation/communication ratio will shift towards more communication of shorter messages. This chapter has demonstrated that the application oriented fault tolerance paradigm is applicable to problems of a non-iterative nature. Again, all that is necessary for suc- cessful algorithm development is a sufficient set of natural problem constraints; in this case the bitonic nature of the intermediate results. Chapter 10 Summary and Directions for Future Research This chapter summarizes the major contribution of this thesis and outlines directions for future work. 10.1 SUMMARY OF MAJOR CONTRIBUTIONS This work has been motivated by a need for both hardware and software reliability in large scale DMMP systems. From a hardware perspective, the thousands of com- ponents in the DMMP will result in low overall system reliability [McNi88a]. Thus, system-level techniques must be employed to achieve reliable system operation. How- ever, traditional system level techniques such as N-Modular Redundancy are too expen- sive to implement. The problem of providing reliable software is not as easy to quantify but is easy to see its need. The detection (and recovery from) design/coding faults is at the heart of software fault-tolerance. However, to apply software fault-tolerance tech- niques to a DMMP requires a different approach [McNi88b]. Both of these problems may be solved by the unified method proposed by this research, the application oriented fault-tolerance paradigm. The application oriented fault-tolerance paradigm is a completely novel way to pro- vide system level reliability. It is necessarily an application based software approach to fault-tolerance. Thus, most of the "system-oriented" issues must be isolated from the applications programmer. These system oriented routines include reliable broadcast and detection support and form the lower layers of the reliable parallel processing model. 136 137 The application and its interface to reliability form the higher layers of this model. The level of coupling between these two layers is low. The applications programmer then must specify only an abstraction of fault-tolerance that is based on the application at hand. A systematic way of generating executable assertions for the DMMP environment is essential to the application oriented fault-tolerance paradigm. The three predicate subc- lasses of progress, feasibility, and consistency provide a systematic technique to formu- late the constraint predicate. The constraint predicate is embedded in the actual program for both hardware and software fault-detection. Thus, an application with an embedded constraint predicate constrains processor and software behavior to remain within predefined limits. This is an effective paradigm for many types of parallel computing applications as demonstrated by its application to the "bread and butter" algorithms of parallel computation, namely, matrix iterative analysis [McNi87a, McNi88b], computer vision [McNi88a], and parallel sorting [McNi88c]. The only known restriction on appli- cability of the constraint predicate paradigm is that the applications problem contain sufficient natural constraints. Since the components of the constraint predicate are executed by a number of reli- able processors, the diagnosis is guaranteed to locate the faulty component. The con- straint predicate relys upon the distributed diagnostic basis to communicate its test results. There are clearly two different possible approaches to a distributed diagnostic basis. The tight coupling with the application oriented fault-tolerance paradigm favors the use of a masking basis such as Byzantine Agreement and rejects the use of syndrome testing. The algorithm Vector Byzantine Agreement [McNi87b] functions as an efficient diagnostic basis for the DMMP environment under Byzantine fault conditions. Syn- drome testing can never be complete in this manner, only correct [McEN88]. There will be a run time penalty to be paid for the use of reliability. There is a trade- off based on expected run time between using reliability or running an unreliable 138 algorithm. For very short problems, it is better to run with no reliability than to incur the overhead of a reliable parallel algorithm. For larger problems there is a crossover point at which it is more effective to run with the reliability penalty and be guaranteed a correct solution within a bounded run time. Both asymptotically as well as practically, since the reliable solutions are only a linear factor of the fault free run time as opposed to an exponentially growing run time without reliability, the reliability methods developed in this research provide the necessary cost effective solution to the problem of reliability in parallel computation. Not only is run time an important metric in judging the effectiveness of any reliabil- ity scheme but also the expected coverage of errors. As shown in the error coverage sec- tion of matrix iterative analysis, seventy-five percent of all errors can be detected immediately, and total coverage can be achieved at algorithm termination. Similar results exist for bitonic sorting, the reliable algorithm never allows an incorrect result to be produced. In summary, this work has accomplished its objectives. It is clear that this area will continue to receive much study. The following sections outline the direction this work should take. 10.2 Configuration Control This thesis has presented a new paradigm to provide both hardware and software reliability in a faulty large scale DMMP and has concentrated mainly on the fault- detection aspect of the problem. However, to provide a reliable system, the problem of continuing execution in the presence of failed components must be addressed as well as the problems of providing reliable Remote Procedure Calls (RPC) in the unreliable environment. The RPC scheme proposed by [BiNe84] has a drawback in the unreliable environ- ment. From the application level, RPCs are treated as simple function calls. If the called 139 process fails, the procedure call simply blocks forever. To alleviate this problem, timeouts are employed. Necessary bounds for timeouts in the DMMP environment are (1) A bound on the maximum processor clock drift ‘l‘ and (2) A bound on the maximum communication delay A. The problem with specification of these values is that they are configuration dependent. The application programmer level should not be concerned with even the existence of these values. Moreover, these values are primarily used by the distributed diagnostic basis. Thus, in the reliable parallel processing model, the Configuration Control section of Figure 1.4 (p. 17) is responsible for computing these values. Dynamic redundancy refers to the ability of a system to continue to function in the presence of failed components. Masking of errors is limited as a fault-tolerance tech- nique. Consider again IBM’s GF 11 computer [Ager88] whose operational design param- eters dictate single problem run time on the order of a year in length. If only fault- masking is used as a fault-tolerance technique, eventually enough elements will fail to invalidate fault masking. Furthermore, masking of errors also has the difficulty that the reporting and knowledge of fault occurrences is also masked [Hopk77]. Thus, instead of masking faults, we wish to remove the faulty component from the system and reconfigure around it. Additionally, as a maintenance activity, the faulty component should be repaired or replaced. In [YaHa84], this recovery consists of three phases; fault diagnosis, system reconfiguration, and operational recovery. 10.2.1 Reconfiguration Centralized reconfiguration control is the easiest to implement. However, this tech- nique is unacceptable in a distributed system because it introduces a single point of failure. We instead turn to decentralized reconfiguration control. There are two approaches to reconfiguration. The fault group may attempt to replenish its ranks with a spare. This is advantageous as the spare can pick up the duties 140 of the failed processor and the algorithm execution can continue with no performance degradation. The drawback to this approach is that standby spares must be made avail- able and that some hardware mechanism must exist as in the MPP [Batc80] to map the spare component into the system topology. The alternative to this approach is to make use of the applications oriented environment. In the applications environment, since we have the concept of a local fault group, it is possible that reconfiguration can be done in a local distributed manner. The reconfiguration involves not replacing the hardware component, but remapping portions of the application to fault-free components. This task, in the reliable parallel processing model, is handled by the configuration control layer. The function of the configuration control layer is also to recompute the A and ‘I’ timeout values. Even in the latest DMMP systems, the second generation hypercubes, the hop delay is still non-negligible. Remapping portions of the application will necessi- tate propagation of the new timeout values to the remaining non-faulty processors. This is to be hidden as much as possible from the applications layer. It is not hard to see that results of this research can be applied to more general usage of parallel computers that have failed components. For example, current techniques for running with a faulty hypercube system involve not executing around the failed component but instead finding a smaller completely functional subcube. However, this problem is not easily solvable and indeed is polynomially complete [DuHa88]. The reconfiguration control proposed here not attempt to find a completely functional topology but rather work within the existing topology containing failed components. 10.2.2 Recovery Coupled with the problem of reconfiguration is operational recovery and continued execution in the presence of a changed topology due to failed and logically removed hardware components. 141 Recovery involves restarting the calculation after reconfiguration has occurred. Current techniques require a reliable backing store. However, in the applications environ- ment, the inforrmuion obtained by members of the fault group of a failed processor required by the constraint predicate may also be sufficient information to restart the cal- culation from the point of failure. 10.3 AUTOMATED CONSTRAINT PREDICATE GENERATION The application oriented reliability paradigm currently consists of a set of basis metrics for constraint predicate extraction. These are applied at the specification phase of the software life cycle by the programmer. Ideally, the salient reliability features to compose the constraint predicate would be extracted from the applications problem in an automated fashion. While this research has made significant progress in generation tech- niques for a constraint predicate, the application oriented reliability paradigm is still in its infancy. Two primary reasons exist for this. The first is that while human identification of the required constraint predicate features can be comprehended, it still takes an in- depth knowledge of the application to extract the features. The second problem is that specification of the constraint predicate and its insertion into the code is a manual task. Both of these problems, however, can be addressed by Computer Aided Software Engineering (CASE) tools. The problem of application knowledge is not as severe as it might first seem. While expert knowledge is required of a particular application, this knowledge transfers easily between similar instances of a problem. Thus it is quite feasible to work with the idea of problem classes, i.e., the class of relaxation problems, the class of sorting problems, etc. A CASE tool can consist of an expert system with various experts than can be employed based on the class of problem under consideration. Insertion of the constraint predicate can occur as a result of the same expert system. 142 The major remaining problem is to generate a uniform representation of the requisite abstraction of reliability information necessary for constraint predicate genera- tion. Given a uniform representation, the tool of Automated Reasoning can be applied to generate the constraint predicate. However, automated reasoning is still very "class- based," i.e. it works for well—defined applications. Additionally, if the representation of the reliability abstraction is similar, save for a syntactic transformation, to the constraint predicate specification, then nothing is gained through automation. Thus, in the best case, while automated. reasoning can provide help in specification of the constraint predi- cate, it is suspected that identification of the abstraction points will remain a human activity. 10.4 APPLICATION APPLICABILITY Parallel iterative relaxation methods form a large component of all parallel algo- rithms. The local nature and superior natural constraints of these algorithms yield good constraint predicates. For other algorithms, the transition is not as clear. Bitonic sorting, for instance, has no such "nice" local properties. Sequential assertions, such as the sort- ing assertion of Randall [Rand75] cannot be implemented in the DMMP environment since no single processor has a complete view of the data space. However, through use of the constraint predicate paradigm, a reliable parallel algorithm was created [McNi88c]. It is not clear at this time exactly which problems lend themselves to application oriented reliability other than the general guideline of those with suitable natural con- straints. A better classification scheme is necessary. This will continue to be a research area in the field. This clearly is an interdisciplinary study which ideally is conducted in concert with practitioners of each application field. [AdBC82] [Ager88] [Ames77] [AnMa67] [Aviz7 1] [AyOz87] [Batc68] [BiNe84] [BoVe82] [DaMa84] [DLPS 86] [Dole82] [DoSt82] 143 REFERENCES Adrion, W., Branstad, M. and Cherniavsky, J., “Validation, Verification, and Testing of Computer Software,” Computing Surveys, Vol. 14, No. 2., June 1982, pp. 159-192. Agerwala, Tilak, Invited Talk, Michigan State University, April, 1988. Ames, William, Numerical Methods for Partial Differential Equations, Academic Press, New York, 1977. Anderson, J. and Macri, F., “Multiple Redundancy Applications in a Computer,” Proceedings of the 13th International Symposium on F ault- Tolerant Computing, Washington DC, January 1967, pp. 553-562. Avizienis, A. et. al. “The STAR Computer: An Investigation Into the Theory and Practice of Fault-Tolerant Computing,” IEEE Transactions on Computers, Vol. 020, November 1971, pp. 1312—1321. Aykanat, C., Ozguner, F., “A Concurrent Error Detecting Conjugate Gra— dient Algorithm on a Hypercube Multiprocessor,” 17th Symposium on Fault Tolerant Computing, Pittsburg PA, July 1987, pp. 204-209. Batcher, K., “Sorting Networks and Their Applications,” Proc. of the 1968 Spring Joint Computer Conference, vol. 32, AFIPS Press, Reston, VA. pp. 307-314. Birrell, A. and Nelson, B., “Implementing Remote Procedure Calls,” ACM Transactions on Computing Systems, Vol. 2, No. 1, February 1985, pp. 63-75. Botta, E. and Veldman, A., “On Local Relaxation Methods and Their Application to Convection-Diffusion Equations,” Journal of Computation Physics, Vol. 48, 1982, pp 127-149. A.T. Dahbura and G. M. Masson, “AN 0(n 25) fault identification algo- rithm for diagnosable systems,” IEEE Trans. Comput., vol. C-33, July 84, pp. 486-492. Dolev, D., Lynch, N., Pinter, S., Stark, E., Weihl, W., “Reaching Approx- imate Agreement in the Presence of Faults,” Journal of the ACM, Vol. 33, No. 3, July 1986. PP. 499—516. Dolev, D., "The Byzantine Generals Strike Again," Journal of Algorithms, Vol. 3, 1982, PP. 14-30. Dolev, D., Strong, H., “Polynomial algorithms for multiple processor agreement,” IBM Research, San Jose, 1982 [DuHa88] [DwLS88] [FiLP85] [Garc82] [Gend87] [GrRe86] [GuRa86] [HaAm74] [Hi1185] [HiSt86] [HMSC86] [Hopk77] [HoSa84] [HoSL78] 144 Dutt, S. and Hayes, J., “On Allocating Subcubes in a Hypercube Mul- tiprocessor,’ ’ Proceedings of the Third Conference on Hypercube Con- current Computers and Applications, Pasedena, CA, January, 1988 to appear. Dwork, C., Lynch, N., Stockmeyer, L., "Consensus in the Presence of Par- tial Synchrony," Journal of the ACM, Vol. 35, No. 2, , April 1988, pp. 288-323. Fischer, M, Lynch, N. and Patterson, M., “Impossibility of Distributed Consensus with One Faulty Process,” Journal of the Association for Com- puting Machinery, Vol. 32, No. 2, April 1985, pp. 374-382. Garcia-Molina, H., “Elections in a Distributed Computing System,” IEEE Transaction on Computers, Vol. C-31, No. 1, Jan. 1982, pp. 48-59. Gendreau, T. A Unified Environment for Distributed Computing, Ph.D. thesis, Department of Computer Science, Michigan State University, Janu- ary 1987. Grunwald, DC. and Reed, D. A., “Benchmarking hypercube hardware and software,” Technical Report, UIUCDCS-R-86—l303, Department of Computer Science, University of Illinois at Urbana-Champaign, 1986. Gupta, R. and Ramakrishnan, 1., “System Level Diagnosis in Malicious Environments,” Technical Report, SUNY Stony Brook, Dept. of Com- puter Science, Oct, 1986. Hakimi, S. and Amin, A., “Characterization of connection assignment of diagnosable systems,” IEEE Trans. Comput., vol. 023, Jan 1974, pp. 86-88. Hillis, D., The Connection Machine, MIT Press, Cambridge, Mass., 1985. Hillis, D. and Steele, 6., "Data Parallel Algorithms," CACM, Vol. 29, No. 12, , December, 1986, pp. 1170-1183. Hayes, J ., Mudge, T., Stout, Q., Colley, S. and Palmer, J., “Architecture of a Hypercube Supercomputer,” Proceedings of the 1986 International Conference on Parallel Processing, August 1986, pp. 653-660. Hopkins, A.L. Jr., “Design Foundation for Survivable Integrated On- Board Computation and Control,’ ’ Proceedings of the Joint Automatic Control Conference, 1977, pp. 232-237. Horowitz, E. and Sahni, 8., Fundamentals of Computer Algorithms, Com- puter Science Press, Rockville, MD, 1984. Hopkins, A. Smith, B., Lala, J., “FTMP-A Highly Reliable Fault Tolerant Multiprocessor for Aircraft,” Proceedings of the IEEE, Vol. 66, No. 10, [HuAb84] [HuAb87] [Huan83] [HuZu83] [Hwan87] [Kr’I'o8 1] [Kueh69] [KuLM87] [KuRe8 1 ] [KuRe86] [Lala86] [LaSP82] [MaMa78] 145 October 1978. pp. 1221-1239. Huang, K. and Abraham, J., “Fault-Tolerant Algorithms and Their Appli- cation to Solving Laplace Equations,’ ’ Proceedings of the 1984 I nterna- tional Conference on Parallel Processing, August 1984, pp. 117-122. Hua, K. and Abraham, J ., “Design and Evaluation of Executable Asser- tions for Concurrent Error Detection,’ ’ Proceedings of the 11th Interna- tional C 0MPSAC, Tokyo, Japan, October 1987, pp. 324-330. Huang, K., Fault-Tolerant Algorithms for Multiple Processor Systems, PhD. Thesis, Coordinated Science Laboratory, University of Illinois, November, 1983. Hummel, R. and Zucker, 8., "On the Foundations of Relaxation Labeling Processes," PAMI, Vol. PAMI-S, No. 3, May 1984, pp. 267-287. Hwang, K., “Advanced parallel processing with supercomputer architec- tures,” Proc. of the IEEE, October 1987, pp. 1348-1378. Kraft, G. and Toy, W., Microprogrammed Control and Reliable Design of Small Computers, Prentice-Hall, 1981. Kuehn, R. “Computer Redundancy: Design, Performance, and Future,” IEEE Transactions on Reliability, Vol. R-18, No. 1, February, 1969, pp. 3-11. Kuo, C., Levy, B. and Musicus, B., “A Local Relaxation Method for Solving Elliptic PDEs on Mesh-Connected Arrays,” SIAM Journal of Scientific and Statistical Computing, Vol. 8, No. 4, July 1987, pp. 550- 573. Kuhl, J. and Reddy, S., “Fault-Diagnosis in Fully Distributed Systems,” Proc. 11th Fault Tolerant Computing Symposium, 1981, pp. 100-105. Kuhl, J. and Reddy, 8., “Fault Tolerance Considerations in Large, Multi- ple Processor Systems,” IEEE Computer, March 1986, pp. 56—67. Lala, J., “A Byzantine Resilient Fault Tolerant Computer for Nuclear Power Plant Applications,” 16th Symposium on Fault Tolerant Comput- ing Systems,, Vienna, Austria, July 1986, pp. 338-343. Lamport, L., Shostak, R., Pease, M., “The Byzantine Generals Problem,” ACM Transactions on Programming Languages and Systems, Vol 4., No. 3, July 1982, pp. 382-401. Mallela, S. and Masson, G., “Diagnosable Systems for Intermittent Faults,” IEEE Trans. Comp., Vol. 027, No. 6, June, 1978 pp. 560-566. [McC185] [McNi87a] [McNi87b] [McNi88a] [McNi88b] [McNi88c] [McEN88] [MoHZ83] [Perr85] [PeSL80] [PrMC67] [Quin87] [Rabi83] 146 McCluskey, E., “Built-in Self-test Techniques,” IEEE Design and Test, April 1985. PP. 21-28. McMillin, B. and Ni, L., “Reliable Parallel Elliptic PDE Solution,” Presented at the 3rd. SIAM Conf. on Parallel Processing for Scientific Computing, Los Angeles, CA, December, 1987. McMillin, B. and Ni, L., “Byzantine Fault-Tolerance through Application Oriented Specification,” Proceedings of the 11th International COMPSAC, Tokyo, Japan, October 1987, pp. 347-353. McMillin, B. and Ni, L., “A Reliable Parallel Algorithm for Relaxation Labeling,’ ’ Proceedings of the 1988 International Conference on Parallel Processing for Computer Vision and Display, Leeds, U.K., January, 1988, to appear. McMillin, B. and Ni, L., “Executable Assertion Development for the Dis- tributed Parallel Environment,” Proceedings of the 12th International COMPSAC, Chicago, IL, October 1988, to appear. McMillin, B. and Ni, L., “Reliable Parallel Sorting Through the Applica- tion Oriented Fault-Tolerance Paradigm,’ ’ Submitted for Publication. McMillin, B., Esfahanian, A. and Ni, L., “Limitations of System Level Diagnosis by Fault Model Considerations,” Proceedings of the 3rd Inter- national Conference on Supercomputing, Boston, MA, May, 1988, to appear. Mohammed, J., Hummel, R. and Zucker, S., "A Gradient Projection Algo- rithm for Relaxation Methods," PAMI, Vol. PAMI-5, No. 3, May 1983, pp. 330-332. Perry, K., “Randomized Byzantine Agreement,” IEEE Transactions on Software Engineering, Vol SE—l 1, No. 6, June 1985 pp. 539-546. Pease, M., Shostak, R. and Lamport, L., “Reaching Agreement in the presence of faults,” Journal of the ACM, Vol 27, No. 2,, April 1980, pp. 228-234. Preparata, F., Metze, G. and Chien, R., “On the connection assignment problem of diagnosable systems,” IEEE Trans. Electron. Comput., vol. EC-l6, December 1967, pp. 848-854. Quinn, M., Designing Efficient Algorithms for Parallel Computers, McGraw-Hill, New York, 1987. Rabin, M., “Randomized Byzantine Generals,” Proceedings of the 24th Symposium on the Foundation of Computer Science, Tucson, AZ, November, 1983, pp. 403-409. [Rand75] [Renn84] [Sham79] [ShFi88] [ShRa87] [Skla76] [Siew82] [Smit65] [Snyd82] [Somm82] [Stoc87] [Str086] [Stuc77] [Toyw78] [Varg62] [Wens78] 147 Randall, B., “System Structure for Software Fault Tolerance,” IEEE Transactions on Software Engineering, Vol. SE-l, No. 2, June 1975, pp. 220-232. Rennels, D., "Fault-Tolerant Computing-Concepts and Examples," IEEE Transactions on Computers, Vol 033, No. 12, , Dec. 1984, pp. 1116- 1129. Shamir, A., “How to share a secret,” CACM, Vol 22, 1979, pp. 612-613 Shih, Y. and Pier, J., “Hypercube Systems and Key Applications,” Paral- lel Processing for Supercomputing and Artificial Intelligence, K. Hwang and D. DeGroot, (Eds.), McGraw-Hill, New York, 1988. Shin, K. and Rarnanathan, P., “Diagnosis of Processors with Byzantine Faults in a Distributed Computing System,” 17th Symposium on Fault Tolerant Computing, Pittsburg PA, July 1987, pp. 55-60. Sklaroff, J., "Redundancy Management Techniques for Space Shuttle Computers," IBM Journal of Research and Development, vol. 20, January 1976. Pp. 20-28. Sieworek, D., The Theory and Practice of Reliable System Design, Digital Press, Bedford, MA, 1982. Smith, G., Numerical Solution of Partial Dr'flerential Equations, Oxford University Press, New York, p. 150. Snyder, L., ‘ ‘Parallel Programming and the Poker Programming Environ- ment,” IEEE Computer, Vol. 17 No. 7, July, 1984, pp. 27-36. Sommerville, 1., Software Engineering, Addison-Wesley, London, 1982. Stockman, G. C., Private Communication, 1987. Strong, R., “Problems in Maintaining Agreement,” F ifth Symposium on reliability in Distributed Software and Database Systems, IEEE Computer Society, January 1986, Los Angeles, CA, pp 20-27. Stucki, L. “New directions in automated tools for improving software quality,” Current Trends in Programming Methodology, Vol. 2., RT. Yeh (Ed), Prentice-Hall, Englewood Cliffs, N.J., 1977, pp. 80-111. Toy, Wing, “Fault-Tolerant Design of Local ESS Processors,” Proceed- ings of the IEEE, Vol. 66, , October, 1978, pp. 1126-1145. Varga, R., Matrix Iterative Analysis, Prentice-Hall, Englewood Cliffs, New Jersey, 1962. Wenslcy, J., et. a1. “SIFT: The design and analysis of a fault-tolerant computer for aircraft control,” Proceedings of the IEEE, Vol. 66, October [YaHa84] [YaMa86a] [YaMa86b] [Zout76] 148 1978. pp. 12401255. Yanney, R. and Hayes, J., “Distributed Recovery in Fault Tolerance Mul- tiprocessor Networks,” 4th International Conference on Distributed Com- puting Systems, IEEE 1984, pp. 514-525. Yang. C-e Masson, G., “A fault Identification Algorithm for t;- Diagnosable Systems,” IEEE Trans. Comp., Vol. 035, No. 6, June 1986, pp. 503-510. Yang, C., Masson, G., "A strategy for Fault Diagnosis in Asynchronous Distributed Systems with Soft Failures," Proc. Intl. Conf. on Computer Design: VLSI in Computers, 1986 Port Chester, NY, pp. 366-369. Zoutendijk, G., Mathematical Programming Methods, North-Holland, Amsterdam, 1976.