2 o 2 3 7 [05
Illlliillllllll lull ,

LIBRARY
Michigan State
University

4'

 

 

 

 

This is to certify that the
dissertation entitled

Reliable Parallel Processing
The Application Oriented Paradigm

presented by

Bruce Malcolm McMillin

has been accepted towards fulﬁllment
of the requirements for

Ph .D . degree in Computer Science

\VR‘BNKQ M iv.

Major professor

mama.—

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

 

MSU

LIBRARIES
.—c—.

 

 

 

RETURNING MATERIALS:
Place in book drop to
remove this checkout from
your record. FINES will
be charged if book is
returned after the date
stamped below.

 

Fnﬁrukkﬂwwawa
i" ’

tx A007

 

 

 

RELIABLE PARALLEL PROCESSING
THE APPLICATION ORIENTED PARADIGM

A Dissertation

By

Bruce Malcolm McMillin

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1988‘

ABSTRACT

RELIABLE PARALLEL PROCESSING
THE APPLICATION ORIENTED PARADIGM

By

Bruce Malcolm McMillin

Fault-tolerance is of paramount importance in large distributed multiprocessor sys-
tems. The key to providing reliability for this necessarily applications oriented environ-
ment, lies in low cost and easily programmable solutions. This motivates consideration of
a new class of fault tolerance techniques - the Application Oriented Paradigm.

This dissertation provides both the basis for construction of the requisite application
oriented constraint predicates and the necessary tools for a reliable consistent distributed
diagnosis in the Byzantine failure environment. The approach taken here is novel in the
sense that a Constraint Predicate is formulated from a basis set of design metrics which
embody desirable facets of the solution theory. The predicate is then an application
oriented abstraction of the necessary fault tolerance. Reliable consistent distributed diag-
nosis appears as part of the environment to the programmer. Both Byzantine Agreement
and syndrome testing are treated as diagnostic techniques.

The result of this dichotomization of fault tolerance is a provable and usable
environment for Reliable Parallel Processing.

© Copyright by
Bruce Malcolm McMillin
1988

To Patricia

iv

ACKNOWLEDGEMENTS

I wish to thank the members of my Ph.D. committee for their guidance and support
during my Ph.D. program. I would particularly like to express my appreciation to A.
Wojcik for his leadership of the Department, to A. Esfahanian for lending an ear, and
mostly to L. Ni for providing a nearly optimal environment for his Ph.D. students. I also
wish to thank both Dr. Wojcik and Dr. Fox for their careful reading of this disseration
and to Dr. Hughes for his insightful comments.

To my parents, Kenneth and Phyllis McMillin, I express my appreciation for the
early encouragement they gave that enabled me to pursue this endeavor. I would like to
acknowledge all the people that touched my life while I was at Michigan State. Too
numerous to mention all, in particular, I do wish to acknowledge my good friends Ray
and Atsuko Klein, Pat Flynn, Eric Wu, Yew-Huey Liu, Chung-Ta King, Youran Lan, and
Taieb Znati.

To my wife Patty, for whom I hold the greatest love, no further words need be said.

This work was supported in part by the DARPA ACMP project, in part by the State
Of Michigan RE/ED project, in part by a Michigan State University Graduate Ofﬁce Fel-
lowship, and in part by the GTE Graduate Research Fellowship Program.

Table of Contents

List of Tables ............................................................................................................. viii
List of Figures ............................................................................................................ ix
Chapter 1: Introduction ....................................................................................... 1
1.1 Modern Fault Tolerance ................................................................................... 2
1.2 Background On Reliable Systems ................................................................... 3
1.3 Fault Tolerance Techniques ............................................................................. 7
1.4 Resource Failure Models ................................................................................. 10
1.5 Distributed Memory Multiprocessors .............................................................. 11
1.6 The Application Oriented Fault-Tolerance Paradigm ..................................... 14
1.7 Executable Assertions ...................................................................................... 15
1.8 Direction Of This Thesis .................................................................................. 16
1.9 Thesis Outline .................................................................................................. 17
Chapter 2: Motivation And Problem Statement ............................................ 19
2.1 Hardware Fault-Tolerance ............................................................................... 19
2.2 Software Fault-Tolerance ................................................................................ 22
2.3 Problem Statement ........................................................................................... 24
2.4 Chapter Summary ....................................................................................... - ..... 2 5
Chapter 3: A New View Of The Fault Detection Problem ........................... 26
3.1 The Distributed Fault Detection Model ........................................................... 26
3.2 Programmer/System Issues .............................................................................. 33
3.3 Chapter Summary ............................................................................................ 35
Chapter 4: Constraint Predicates - A Programmer Level View ................ 36
4.1 Deﬁnition ......................................................................................................... 36
4.2 Basis Metrics .................................................................................................... 39
4.3 Chapter Summary ............................................................................................ 46

Chapter 5: Distributed Diagnostic Basis ........................................................ 47

5.1 Syndrome Testing And Diagnostic Impossibilities ......................................... 47
5.2 Byzantine Generals Problem ........................................................................... 54
5.3 Diagnostic Comments ...................................................................................... 72
5.4 Chapter Summary ............................................................................................ 72
Chapter 6: Performance Evaluation ................................................................. 74
6.1 Expected Error Coverage ................................................................................. 74
6.2 Run Time Overhead Estimation ...................................................................... 77
6.3 Chapter Summary ............................................................................................ 81
Chapter 7: Matrix Problem And Iterative Techniques ................................. 83
7.1 Matrix Problem Speciﬁcation .......................................................................... 83
7.2 Natural Problem/Solution Constraints ............................................................. 85
7.3 Error Coverage Modeling For Matrix Iterative Analysis ................................ 91
7.4 Run Time Performance .................................................................................... 100
7.5 Comments And Limitations ............................................................................. 103
Chapter 8 Relaxation Labeling ......................................................................... 105
8.1 Parallelized Weighted Relaxation Labeling .................................................... 106
8.2 Constraint Predicate ......................................................................................... 110
8.3 Reliable Parallel Algorithm For Relaxation Labeling ..................................... 114
8.4 Chapter Summary ............................................................................................ 116
Chapter 9: Parallel Bitonlc Sorting ................................................................... 117
9.1 Bitonic Sorting ................................................................................................. 118
9.2 Faulty Behavior ................................................................................................ 121
9.3 Reliable Bitonic Sorting Algorithm Development .......................................... 122
9.4 Error Coverage And Resilience ....................................................................... 130
9.5 Time And Space Complexity ........................................................................... 130
9.6 Chapter Summary ............................................................................................ 134
Chapter 10: Directions For Future Research ................................................. 136
10.1 Summary of Major Contributions .................................................................. 136
10.2 Conﬁguration Control .................................................................................... 138
10.3 Automated Constraint Predicate Generation ................................................. 141
10.4 Application Applicability .............................................................................. 142
References .............................................................................................................. 143

vii

List of Tables

Table 1.1. Fault Models ............................................................................................. 10
Table 3.1. Constraint Predicate Size Classiﬁcation ................................................... 34
Table 6.1. DMMP Assumptions ................................................................................ 78
Table 6.2. Run Time Comparisons ............................................................................ 82
Table 7.1. Predicate Subclass Membership ............................................................... 91

viii

List of Figures

Figure 1.1. Alternative paths to achieve reliability. ........................................... 7
Figure 1.2. Fault Model Relationships. .............................................................. 11
Figure 1.3. Distributed-Memory Multiprocessor System Architecture .............. 12
Figure 1.4. The Reliable Parallel Processing Model. ......................................... 17
Figure 2.1. Expected run time for a serial-reliability parallel algorithm ............ 22
Figure 3.1. Splitting the Detection Problem. ...................................................... 29
Figure 4.1. Abstraction of the Constraint Predicate. .......................................... 40
Figure 5.1. Fault Models ..................................................................................... 49
Figure 5.2. Syndrome Undiagnosable by Yang and Masson Algorithm ............ 51
Figure 5.3. Byzantine Agreement (P1 is faulty). ............................................... 55
Figure 5.4. Algorithm Agree. ............................................................................. 64
Figure 5.5. Example of Algorithm Agree for n =4,t=1. ..................................... 67
Figure 6.1. Expected run time comparison ......................................................... 80
Figure 7.1. Relaxation Skeleton. ........................................................................ 85
Figure 7.2. Sample Matrix for Au==v. ................................................................. 95
Figure 7.3. Feasibility Event and Conditional Distribution of Errors. ............... 96
Figure 7.4. Error Coverage by Solution Step Count ........................................... 98
Figure 7 .5. Error Coverage by Solution Step Count. Theorem 7-7 Applied ...... 101
Figure 7.6. Mapping for q =25. ........................................................................... 102
Figure 8.1. Algorithm RLNR. .............................................................................. 108
Figure 8.2. Algorithm projection-operator. ....................................................... 109
Figure 8.3. Non-maximizing movements. .......................................................... 113
Figure 8.4. Movement outside of feasibility cone. ............................................. 113
Figure 8.5. Zigzagging - Convergence but no Finiteness ................................... 114
Figure 8.6. Reliable relaxation labeling algorithm RLR. .................................... 115
Figure 9.1. Compare-Exchange Step. ................................................................. 119
Figure 9.2. Algorithm SNR. ................................................................................. 120
Figure 9.4. Algorithm Sn. ................................................................................. 123
Figure 9.5a. (DP - Progress Component. ............................................................. 124
Figure 9.5b. (DI.- - Feasibility Component ........................................................... 125
Figure 9.5c. (DC - Consistency Component. ....................................................... 127
Figure 9.6. Example of Sn for n =3. .................................................................. 129
Figure 9.7. Sorting Time Comparisons. ............................................................. 133
Figure 9.8. Projected Sorting Time Comparisons - Large Systems. .................. 135

ix

Chapter 1

Introduction

 

Massively parallel distributed-memory multiprocessors (DMMP’s) have gained
much attention recently due to their unique scalability in providing massive computing
power. Various hypercube multiprocessors, such as the Ncube, iPSC, and FPS T-series,
are notable examples of commercially available DMMPs [ShFi88, Hwan87]. Design of
parallel algorithms for DMMPs has been a major research issue in the application
domain. However, the design of reliable parallel algorithms has received little attention.
As the size of DMMPs grows into thousands of processors and the corresponding scale of
attempted application problems grows even faster, solution to the problem of providing
fault-tolerance is of paramount importance to the greater use of parallel processing.

Traditional fault-tolerance techniques are mainly geared to providing an ultra-
reliable environment [HoSL78, Wens78, Lala86]. The ultra-reliable environment is
necessary when loss of property or life is the result of a failure. DMMPs often operate
under less stringent reliability requirements. However, a failure, while not catastrophic,
may increase application turnaround time to unacceptable levels (reference Chapter 2).
The number of components of a large scale DMMP makes it particularly susceptible to
failure. To provide the appropriate level of reliability it is necessary to re-examine the
notion of fault-tolerance and how to provide it.

Reliability should be treated as part of the operating environment much as the com-
pilers, libraries, and operating system appear to the programmer. Ideally, reliability is

entirely transparent to the programmer. However, the cost of techniques to furnish this

1

2

transparency can be prohibitive or the techniques themselves infeasible. Some of the
existing techniques to provide the necessary reliability include the use of better com-
ponents (Fault Avoidance), stopping the system when a fault occurs (Fault Detection),
running in the presence of faults (Masking Redundancy) or removing failed components
from the system (Reconﬁguration). For DMMPs, which are constructed from "off the
shelf“ building blocks, the last two techniques are most applicable. As we shall see,
these two techniques require communication and agreement among the components of

the DMMP. This agreement is commonly obtained by Voting.

1.1 MODERN FAULT TOLERANCE

Early work in computer system fault tolerance was concerned with failures of indi-
vidual components. Much of the work thus was concerned with gate level issues such as
"stuck at faults" [KrTo81] and other low level issues in processor design. This was pri-
marily owing to the single CPU employed and its expensive hardware components. In
modern systems and particularly in massively parallel DMMP design, failed components
are considered to be at the granularity of the processor, the memory, and the bus
[Wens78]. This means that when a failure occurs in a component, we wish to isolate its
effects on the remaining portion of the system. We are not concerned with how the com-
ponent failed, simply that it did so. When a component fails, we expect to have enough
other similar components to take its place. These components, in the interest of low cost,
are typically off-the-shelf processor, disk, and memory designs. The desired reliability
may not be built into these components. Thus we build the desired level of redundancy
through system level algorithms to coordinate these components. Finally, as systems
become more complex, low level analysis of fault-tolerance and low level redundancy

design are not feasible.

1.2 BACKGROUND ON RELIABLE SYSTEMS

Early work in the area of computer system reliability was focused in the space pro-
gram and in the telephone switching environment. The OAO satellite computer
[Kueh69] used discrete component redundancy. This technique became outmoded, as
with integrated circuits the probability of multiple transistors being destroyed is about the
same as for a single transistor [Renn84]. The initial Bell ESS design (for telephone
switching purposes) uses two way redundancy [Toyw78]. Two computers perform the
application in parallel. If one computer failed, the independent results would disagree
and the entire ESS would halt. The advantage of this approach is that an incorrect output
is never produced. A more modern example of computer fault-tolerance is in the Apollo
guidance system. This system used triple modular redundancy (TMR) in which three
times the necessary resources are employed. Three processors run the same program and
vote to mask errors in any single processor [AnMa67]. The JPL star is a computer
designed for deep space exploration where human maintenance is impossible. It has a
large number of functional units with spares which can be switched in automatically to
replace a failing component [Aviz71]. These designs form a basis for more modern
fault-tolerant architectures.

1.2.1 Fault Tolerant Mum-Processor (FI’MP)

The Fault-Tolerant Multi-Processor (FTMP) [HoSL78] is an extension of early
spacecraft designs. The basis for design of the FTMP is to mask all errors from the
applications software. This is advantageous since no software rollbacks are required to
ﬁx errors. This masking is accomplished through a TMR design extension known as
Parallel Hybrid Redundancy. A pool consisting of multiple triads of modules such as
processors, memory, and busses comprise the system. Each triad has a voting element
which votes on the data presented to it in triplicate. A module may be retired, regrouped
into another triad, or assigned to perform maintenance diagnostic testing. A module is

retired when it has been found to have failed at which time a spare is switched in its

4

place. If no spares are available, the triad is retired. Since the hardware resources are
allocated from a pool, the application never feels the effects of these reconﬁguration
activities.

Voting at the hardware level is done bit by bit in the the tightly synchronized
approach adopted. Synchronization is achieved through keeping the clocks of each auto-
nomous processor in absolute synchrony. This is accomplished by the use of TMR com-
munication between the clocks at the hardware level. The voting elements detect failures
in a 2-of-3 vote. Another controller called a bus guardian isolates or reconﬁgures around
the failing module. If a bus is faulty, the bus guardians are directed to switch to another
bus. If a processor is found to be faulty, it is isolated from the bus by the low level logic
of the bus guardian.

Voting is not the only means employed in the FTMP for fault-detection. Modules
may be assigned in a maintenance mode to exercise the system to attempt to ﬁnd failed
components and to remove them as quickly as possible. This testing checks each proces-
sor against multiple busses to attempt to narrow down the actual location of the fault,
either processor or bus.

The advantage of this right synchronization, other than the ability to mask faults at
the hardware level, is that no buffering, synchronization primitives, or completion inter-
vals are required. This implies a speed improvement over a machine which employs
loose synchrony. There are, however, disadvantages to this method of low level voting.
A voter or bus guardian failure can cause a catastrophic hardware failure that will prob-
ably defy any method of software correction. The tight synchronization also has limita-

tions that will be discussed in a following section.

1.2.2 Software Implemented Fault-Tolerance (SIFI')
A contrasting approach to the FTMP, using loose synchrony, is the Soﬁware Imple-
mented Fault-Tolerant computer (SIFT) [Wens78]. The SIFT was designed to serve in a

similar fault tolerant capacity as FTMP. The SIFT uses multiple busses and modules,
however, the modules (processors) are off-the-shelf components. The SIFT concept is
currently employed in state of the art systems such as the space shuttle [Skla76].

The loose synchrony of the SIFT allows it to run as a loosely coupled multiproces-
sor. Non-critical applications can take advantage of this multiprocessor speedup. For
critical applications, the SIFT can form into a redundant conﬁguration using TMR redun-
dancy. Fault isolation is obtained by physical isolation of failed hardware components.
Fault masking is enhanced through the use of multiple busses over which multiple copies
of the data are transmitted. Voting takes place as a function of the software. Critical
tasks in the SIFT are required to be iterative tasks such that voting is done on the state of
data before each iteration. This guarantees that each processor is working on the same
loop. If an error is discovered in the vote on this state of the system, the software
attempts to locate and remove the faulty component. This is done by changing the bus of
the presumed faulty processor. If the fault follows the move, the processor is retired. If
the fault disappears, the bus is retired. Thus the SIFI‘ can conﬁgure around the fault.
This detection, however, is not complete. Each processor makes a report of failed com-
ponents to the runtime supervisor. Depending on the type of fault, it may not be decid-
able which processor or bus is actually failing. However, the system can continue to run
in a fault masking mode.

The clocks in the SIFT are not synchronous. However, for reasons that will made
clear in Chapter 5, the clocks must be kept in partial synchrony. With respect to the
hardware, each processor has an independent clock. These clocks are resynchronized
occasionally by an algorithm known as a clock skew reset. This algorithm is essentially
a voting algorithm which requires at least four clocks to mask one fault [PeSL80].

Low level hardware correction is not employed in the modules of the SIFI‘.
Through modeling work done by the system designers, only a slight increase in reliability
may be gained by adding this hardware support over what is achieved through the voting.

6

The loose synchrony and high level error correction of the SIFT, as will be seen

shortly, can provide a basis for the construction of reliable distributed systems.

1 .23 Space Shuttle Computer Control System

The space shuttle computer control system is of an ultra-reliable design as are the
FI‘MP and SIFT. The important difference is that the space shuttle computers are con-
strained to be off-the-shelf processing systems, speciﬁcally, IBM AP-101 CPUs [Sk1a76].
Thus the space shufﬂe design is consistent with the more modern notion of fault-tolerant
systems outlined at the beginning of this chapter.

The hardware conﬁguration of the space shuttle system consists of ﬁve computers.
Four of the computers are programmed identically to perform ﬂight critical functions
such as guidance, navigation, and control. The outputs of the four computers are voted in
the control actuators with the inputs to the calculations coming from multiple input sen-
sors.

Faults are not simply masked. The ﬂight crew must be notiﬁed of a failed com-
puter. Furthermore, the faults are isolated; that is, one computer failure cannot cause
another computer to erroneously identify itself as faulty. This allows the crew to re-
allocate resources such that the system many continue to operate in the presence of up to
3 failed computers. In the case where only two computers are left operational, fault
identiﬁcation is not possible; however, fault detection is possible in the sense of the Bell
ESS systems. A fault occurrence can be detected but not located.

The ﬁfth computer is programmed to the same set of speciﬁcations as the four pri-
mary computers, but is implemented by a different contractor [Renn84]. This allows
detection of software design and coding errors thus implementing a degree of software
error detection. In the case of a disagreement, the conect version of the software must be

determined by the crew, and that version then taken as correct.

The space shuttle provides both hardware and software fault tolerance, continued
operation in the presence of hardware failures, and detection of implementation problems
in the software. As we shall see in Chapter 2, these two facets of fault-tolerance also
form the basis for the application-oriented fault-tolerance paradigm described in this

thesis.

1 .3 FAULT TOLERANCE TECHNIQUES
The previous introductory examples have shown the origins of fault-tolerant com-
puting design. A more rigorous treatment of these techniques is presented in this section.

System reliability can be achieved through the alternative event paths shown in Fig-

ure 1.1. [Siew82, KuRe86, YaH384].

   
   

Reliability

Figure 1.1. Alternative paths to achieve reliability

  
 

 

ault Out-V0

 

1.3.1 Fault Detection

Reliable fault detection informs a trusted observer or observers when a fault occurs.
In a DMMP there may be no centralized observer, or use of a centralized controller may
be prohibitive in terms of run time penalty. However, for hardware fault-tolerance,
trusted observers may simply be a failed component’s peers. If the number of faulty
components in the system is a bounded design parameter, then sufﬁcient non-faulty com-
ponents always are available for reliable detection. In the distributed environment, dur-
ing problem execution, it is not known by all components which components are faulty
and which are reliable leading to a possibly inconsistent diagnosis.

The problem of reliable hardware fault detection is twofold: the achievement of a
consistent diagnosis, and the achievement of a correct diagnosis though appropriate test-
ing. Two possible techniques to achieve this consistent diagnosis are presented in
Chapter 3.

Software fault detection, assuming a reliable hardware execution environment, need
not be concerned with consensus on the location of a failed component. A single
test/processor upon detection of a failed component can reliably signal the error instance.

In both the hardware and software failure environments, generation of an appropri-

ate test is the common task.

1.3.2 Fault Masking

Reliable fault masking uses non-faulty elements to compensate for the effects of
faulty elements. N-Modular Redundancy (NMR), a more general case of the TMR
employed in the Apollo system, is a common technique to mask failures. The idea is to
take a majority vote on a calculation replicated N times. This, of course, is expensive in
terms of either hardware or run-time cost. A hardware solution requires N times the
number of processors as well as a reliable voter. Software solutions require each proces-

sor to run N copies of surrounding computations and then vote on the result. This slows

down the computation by at least a factor of N. However, as described earlier, NMR
required in ultra-reliable systems such as the Apollo spacecraft control, the FTth and
SIFT aircraft control systems, and the FTP reactor control system [Siew82, Hopk77,

HoSL78, Renn84, Wens78, Lala86].

1.3.3 Reconfiguration

Reliable reconﬁguration removes the faulty component from the system either logi-
cally or physically. Centralized reconﬁguration control is the easiest to implement. How—
ever, this technique may be unacceptable in a distributed system because it introduces a
single point of failure, namely at the controller. Decentralized reconﬁguration control is
more appealing but harder to implement. [Gar082] proposed the idea of a local coordina-
tor to oversee the reconﬁguration process. The local coordinator must be elected in a reli-
able election. This can be done via a voting process. The local coordinator then attempts
to map in a spare component or to assign the job from the failed component to another
functioning component.

When a hardware component fails, it may have suffered only a transient failure and
can be reenstated into the system if the operation, when retried, succeeds. If the com-
ponent is found to have suffered a permanent failure, then it can be replaced by an identi-
cal copy. Software failures are completely different. Since software design/coding
errors are repeatable, once an error is detected, replacing the copy with another identical
copy is futile since the same error will result. Thus the erroring algorithm must be
replaced by an alternate algorithm. Unlike the hardware failure environment, for the
software environment, it may be desirable to return to the primary algorithm after the

failed problem instance has been completed by the alternate algorithm.

10

1.3.4 Recovery

Reliable recovery involves rolling back the computation to a suitable checkpoint
and restarting from that point. A local coordinator may assume this function and initiate
the rollback. The checkpoints may be held within the reliable components or held by the
coordinator. Once the system is recovered, it ignores any messages from a failed
hardware component.
1.4 RESOURCE FAILURE MODELS

The previous four types of fault-tolerance techniques are all a function of the partic-
ular fault model under consideration. Resource failures can be classiﬁed (ranging from
most to least restrictive) by the classiﬁcation given in Table 1.1. The relationship

between them is shown in Figure 1.2.

 

1) Fall-Stop A failing resource detects its own failure and stops. A Fail-Stop
resource never generates spurious or incorrect messages. The
resources connected to a stopped Fail-Stop resource immediately
know that it has stopped. This is the easiest form of fault to detect,
but is probably the hardest for the resource to implement.

2) Hard The resource, when tested by a non-faulty testing unit, always
responds with its correct status, faulty or non-faulty. This is the fault
model assumption made in the PMC [PrMC67] model of fault diag-
nosis which will be discussed further under syndrome testing.

3) Intermittent A resource, when tested by a non-faulty testing unit, may or may not
respond correctly without restriction.

4) Byzantine Byzantine resources encompass the universal class of faults. Any
failure to meet speciﬁcations is a Byzantine fault. A Byzantine fault
may generate incorrect or even malicious results to a test query or
during algorithm participation.

Table 1.1. Fault Models

 

11

 

Byzantine

  

Intermittent

  
   

Fail stop

Figure 1.2. Fault Model Relationships

 

1.5 DISTRIBUTED MEMORY MULTIPROCESSORS (DMMPs)

A DMMP (Figure 1.3) is a system of N autonomous MIMD (Multiple Instruction
Multiple Data Stream) processors each with their own local memory. These processors
communicate with each other via message over an interconnect. This message passing
takes non-negligible communication time. Examples of commercially available DMMPs
were presented in the introduction to this chapter. These systems currently are on the
order of thousands of processors. SIMD (Single Instruction Multiple Data Stream) archi-
tectures such as the Connection Machine [Hill85] have 65 thousand processors. It is not
hard to make the transition to DMMPs of at least this scale.

The Intel iPSC and Ncube computers make a distinction between host processors

and node processors. Both have an additional host processor which is used for program

12

 

 

Host I/O
processor processors

 

 

 

 

 

j ...I[

 

 

 

 

node 0 - node 1 node N]

00000000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Interprocessor connection

 

 

 

Figure 1.3. Distributed—Memory Multiprocessor System (DMMP) Architecture

 

downloading/uploading and data transfer. Note that while the host has a connection to
each node, the number of nodes precludes the host processor from becoming involved in
the nodes’ calculations.

Design of a support environment requires consideration of the system characteris-
tics. The primary issues of concern here are the locality of information and the nature of

distributed control.

1.5.1 Locality of Information

The lack of shared data structures, or more precisely the limitation of a particular
testing processor to examine another’s internal memory, reduces testing capabilities. A
test can only make a decision based on information from the local internal state of the
computation and the messages it receives from other processors (peers) in the computa-
tion.

While privacy of information can be a drawback, it also enhances the fault-

tolerance of a DMMP. In a faulty DMMP, corruption of the environment by a particular

l3

processor is limited to the message it sends. Contrast this with the shared memory model
of multiprocessing in which a processor can undetectably corrupt a computation at any
time through disruption of the shared store. Thus correctness assertions need only con-
sider the message exchange. This considerably enhances the tractability of the assertion
development problem.

To categorize the amount of information available in the distributed environment,
the terms "White Box" and "Black Box" are employed. These terms are used in the
software engineering environment to dichotomize the two approaches coverage test gen-
eration. Here, in the white box environment, the participants in the distributed environ-
ment have perfect information. The complete internal state of a particular processor is
known to other processors. This is consistent with the shared memory model of mul-
tiprocessing but is inconsistent, due to infeasibility, with the DMMP model. The black
box environment is characterized by the true DMMP. Participants in the distributed
environment have, as their only information, local state information and received mes-

sages.

1.5.2 Control

The large size of a DMMP dictates that operation under a centralized control
scheme is infeasible due to the performance bottleneck in the central control unit. Furth-
ermore, from a reliability standpoint, a centralized controller introduces the undesirable
single point of failure. Thus, fostered by the locality of information and the size of
DMMPs, individual processors function together under distributed control. The locality
of information, coupled with non-negligible communication delays between the proces-
sors, results in a lack of state consistency for any particular snapshot of the system.
Since processors in DMMPs are of the MIMD type, there may be no synchronization, or
at best partial synchronization, between the units. Any algorithm to provide fault toler-

ance must be able to function in the presence of imperfect information, temporarily

14

inconsistent information, and under decentralized control.

1.6 THE APPLICATION ORIENTED FAULT-TOLERANCE PARADIGM

Out of the three problems of reliable fault detection, reconﬁguration, and recovery,
the fault detection problem is conceptually the hardest. Key to the understanding of the
problem is the notion of a fault and the ability of a test to ﬁnd its occurrence. In off-line
system testing, some test pattern is applied to the components of the system. The
expected behavior is known, and any deviation from this is ﬂagged as a fault. Systems
designed for this type of test fall under the study of Design for T estability [McCl85]. The
internal circuitry is designed such that individual components can be tested and failed
ones located through application of test patterns. Traditionally this kind of analysis has
concentrated on low level gate issues.

The problems with off-line testing for faults are twofold. A component that exhibits
transient failures may pass an instance of a test. Component exercising attempts to catch
these errors by repeatedly testing components during periods of system idle time, how-
ever, this off-line testing may not catch transient errors that occur during the actual
operation of the system. The second problem is that a test may not be complete. A com-
plex system may have a sufﬁciently large number of states that generation of a complete
test may be infeasible or impossible.

Huang and Abraham in [HuAb84, Huan83] proposed the idea of Algorithm Based
Fault Tolerance in which checksum encodings are embedded in the data. This is of dubi-
ous importance as a fault detection technique as most computing systems already contain
some sort of checksumming and error correction at the bit level. One paper [HuAb84],
however, proposed the idea of using properties of the solution as acceptance tests. This
method is deemed elegant by the following observation:

Since we test the intermediate results for correctness with respect to
the algorithm, the end solution is correct if the intermediate results

15

are correct. If processor errors occur that do not affect the solution,
then they are not errors.

For application oriented reliability, this method is clearly superior to component exercis-
ing or other off-line tests. The test set consists of only what is necessary to ensure
correctness for the current application. The work in application oriented fault tolerance
of this type comprises only a small portion of Huang’s thesis [Huan83] and the detection

method proposed is incomplete. However, the concept is excellent.

1.7 EXECUTABLE ASSERTIONS

The Assertion is the basic unit of program veriﬁcation [Somm82], software fault-
tolerance [Rand75], and the application oriented fault-tolerance paradigm described in
this thesis. Development of these assertions is straightforward for the sequential program
execution environment (See [YaCh75, Andr79, Somm82, HuAb87] for examples).
Assertions are primarily generated from the coding and design process.

An assertion is a statement of the form:
If not ASSERTION then ERROR;

where ASSERTTON is a Boolean invariant on the program state. Assertions integrated
into the program code are called Executable Assertions. If an assertion fails due to some
abnormality, whether hardware or software, the ERROR condition is raised and appropri-
ate action taken. [Stuc77] discusses the use of executable assertions for software fault
tolerance, but there is nothing to limit their use in hardware testing.

Assertions are typically outgrowths of the design process as in the Recovery Block
Model of Randall [Rand75]. Design generated assertions are desirable from the aspect
that they may often be generated automatically from the design language [HuAb87].

In the sequential programming environment, each result of a statement is determin-
istically a function of the current program state and the statement executed. Procedures
and function calls may be handled in one of two ways. Either the function may be con-

sidered as a "meta-statement" which is a function of the calling program’s state, or the

16

statements internal to the called function may be expanded inline to the calling program.
In any case, assertions are made using the state of a single program and its statements.
Careful choice of these assertions, particularly those involving loop invariance, lead to a
high degree of conﬁdence in both the program’s termination and correctness.
1.8 DIRECTION OF THIS THESIS

Large scale multiprocessors are a product of this decade. Little if any consideration
has been given to the system aspect of reliability. As these systems come into
widespread usage and as their size and resulting complexity continues to grow, reliability
will continue to be a major issue and a viable research topic. This research represents the
ﬁrst such attempt at specifying a reliable parallel processing environment for large scale
multiprocessors. This envionment is characterized by the Reliable Parallel Processing

Model of Figure 1.4.

17

 

 

Application Layers
Conﬁguration Layers

System Layers

 

 

 

Figure 1.4. The Reliable Parallel Processing Model

 

The applications programmer need only specify an abstraction of the necessary reli-
ability information at the upper levels. The system environment consists of tools at the
lower levels of the model that minimize the burden of reliability programming on the
applications programmer. The conﬁguration layers enable the reliable parallel process-

ing model to be employed in various system topologies.

1 .9 Thesis Outline

This introduction has provided a historical basis of reliability and addressed some of
this issues that are considered in the following chapters.

Chapter 2 gives a motivation for study of new techniques for fault-tolerance. It
describes a probabalistic model of expected run time in a large system in the presence of

failures as a motivation to provide hardware fault-tolerance. The need for software fault

18

tolerance, in the form of executable assertions for the parallel environment, is perhaps
even stronger motivation for this study than hardware fault-tolerance.

Chapter 3 discusses the fault detection problem, temporally the ﬁrst problem
encountered and perhaps the most difﬁcult to solve.

Chapter 4 presents a formal classiﬁcation of the basis metrics which can be applied
to the problem at hand to generate the necessary executable assertions which compose a
structure called a constraint predicate. The usage of the constraint predicate is detailed
both as a hardware fault-detection tool and as a software verifcation/fault-detection tool.

Chapter 5 presents the underlying distributed diagnostic basis and it’s ideal virtual
machine interface to the constraint predicate. Techniques considered are Vector Byzan-
tine Agreement, a new form of Byzantine Agreement, and Distributed Syndrome Testing.

Chapter 6 presents a set of tools and techniques for analysis of both the coverage
and run time performance for a generated constraint predicate.

Chapters 7 through 9 present three examples of constraint predicate development.
The ﬁrst is a common numerical technique for solving very large systems of equations on
a parallel processor. The second is a reliable parallelized form of a non-numerical algo-
rithm for computer vision. The third is a reliable sorting procedure. These chapters also
deal with programming experiences and implementation details of the algorithms.

Chapter 10 presents directions for future research. These include the outline of a
reconﬁguration/mappin g control module which isolates the application level from topo—
logical changes. Directions for localized reliable distributed reconﬁguration and

recovery are presented.

Chapter 2

Motivation and Problem Statement

 

The application oriented fault-tolerance paradigm is a reliability method applicable
both to tolerance of hardware faults and software faults. The motivational forces behind

consideration of each is different.

2.1 HARDWARE FAULT-TOLERANCE

A system is said to be hardware fault-tolerant if, under a bounded number of faulty
components, the system continues to satisfy its operational requirements. This bound is
obtained through examination of the reliability of the components of the system.

Periodic testing of the system will require additional time to perform the various
bounds checking and message interchanges associated with the test. For small problems,
this overhead may not be necessary. If the system has a high reliability and the problem
incurs no more than a short execution time, the probability of an error occurring during
the execution may be very small. For small problems, it may be best to "take your
chances" and run the job with no fault-tolerance. However, long run times with large
numbers of processors may not be able to ﬁnish before an error occurs. Indeed the Opera-
tional requirements of the GFll project at IBM [Agar88] specify the execution of algo-
rithms that can run as long as a year on a 500+ processor machine! The expected time
between errors is called the Mean Time To Failure (MTTF). If the problem run time is
long with respect to the MTTF, then reliability techniques are indicated. If the problem

run time is short with respect to the MTTF (as in the "small" problem above), reliability

19

20

techniques may not need to be used. Furthermore, when an error occurs, the problem
must be restarted from the beginning. If the MTTF is short, the job may never complete.
Reliability modeling can provide a basis for quantiﬁcation of these concepts. We

must make some assumptions about the type of problem and solution being attempted.

 

Model Assumptions

Serial Reliability Nearest Neighbor Iterative Problem - All
processors must function

 

 

i.i.d Exponential Processor failures are independent, identical
exponentially distributed with parameter u

 

 

 

2.1.1 Failure Model

As noted in the assumptions, processor MTTFs are given by the random variable X
and are i.i.d. exponential with parameter p. The probability density function of X is given
by:

—t

f(t|Lt)=-:-l-ef, t>0 (2.1)

The conditional probability that an individual processor fails at some time later than T is
given by:

.. ;T_
1-F(T|u)= If(tlu)dt=e *1 (2.2)
T

The system M'I'I'F is given by the random variables X1,X2, . - - ,XN where N is the total
number of processors involved in the problem solution. Since we have serial reliability,

the system fails when one component fails. This is given by the random variable Y:
Y=min{X1,X2, . - ° ,XN} (2.3)

Let the conditional cumulative distribution function of Y be G (tlu). Since the Xi’s are
independent,
it

1—G(tlu) = ﬁa—F (t l u)) = e “ (2.4)

i=1

21

Thus l—G (t I u) is an exponential with parameter .1EV_'

2.1.2 Expected Run Time

The expected run time of a non-fault tolerant system is constructed as follows. If
the system does not fail before time T, then the job runs in time T. If the system fails
before T, then the job must be restarted and the expected run time is the time spent
before the failure plus another expected run time (since the exponential distribution has

the memorylessness property). The expected run time E (R) is given by
E (R )=TP, [Y >T]+E (R )P, [Y <T]+E (Z I Y <T), (2.5)

where E(ZIY<T) is the conditional M‘I'I'F given that the system has failed before T time
units. Z is given by P,[Y=tl Y<T]:

P,[Y=t,Y<T] _ P,[Y=t] _ g(t)

 

 

P Y: IY T = — — , O<t<T 2.6
'[ t < ] P,[Y<T] P,[Y<T] G(T) ( )
. . . N T ‘
Since g (t) rs given by He , we have
-N
TN e ” t L T
E Z = —t——dt = . (2.7)
( ) £11 if N £1
l—e e ” -1

Solving (2.5) for E (R) and using the value for E (Z) obtained in (2.7) we obtain
ﬂ-T
E(R)=T — E (Z) + E(Z)e “ . (2.8)
2.1.3 Discussion
The E (Z) is actually optimistic since it is assumed that when a processor fails, the
system halts. In Chapter 5 it is shown that this is not always the case for Byzantine faults.
If it cannot be determined that a processor has failed until the end of the computation,

it

then E (Z) is replaced by the larger value of T and E (R )=Te ” . The general shape of
(2.8) is given in Figure 2.1 for ﬁxed values of N = 32768 and u = 5000 hours.

22

 

 

4 _.
N = 32768 processors
3 — it = 5000 hours
Expected

Run

Time 2 -

E(R)

(hours)

1 _.
o a

 

 

 

 

I l I l I
0.1 0.2 0.3 0.4 0.5

Fault-Free Run Time T (hours)

O-a

Figure 2.1. Expected run time for a serial-reliability parallel algorithm

 

The abscissa shows the time for a solution with no fault tolerance that encounters no
errors. The ordinate is the expected run time given the reliability of an individual proces-
sor and the number of processors involved in the solution. Since this is a serial reliability

model, a single processor failure causes a job restart.

2.2 SOFTWARE FAULT-TOLERANCE
A system is software fault-tolerant if, under some bounded number of software
design/coding errors, the system continues to satisfy its Operation requirements. Notice

that this is closely related to the notion of hardware fault-tolerance discussed in the

23

previous section. However, this is where the similarity ends.

Software faults result from design and coding errors. These errors are much more
prevalent than errors that one encounters from hardware design. Furthermore,
veriﬁcation of hardware designs, while a non-trivial task, is still a manageable problem.
The large number of program states for even a trivial program makes veriﬁcation of gen-
eral software nearly impossible. Testing can only show the presence of errors, not their
absence. Operational fault-detection, in which executable assertions become part of the
operational environment, can detect faults during the system’s operational lifetime.
Operational fault tolerance may be achieved by Randall’s Recovery Blocks [Rand75] in
which alternate algorithms are employed if a software failure occurs. As noted in the
introduction, this system is in limited use in the Space Shufﬂe ﬂight computers.

The DMMP environment limits the amount of information available to an execut-
able assertion. Since a checking processor cannot examine another’s internal memory,
the message sent is the only form of communicated information available.

To develop and implement executable assertions on a DMMP requires additional
consideration over the sequential environment. Since processors may only communicate
via messages, detailed assertions may not be implementable. In general, the detailed
assertions do not occur at testable stages, i.e. message interchange points. Thus, the
entire program state is no longer detenninistically obtainable by any single processor.
These considerations require development of executable assertions that function in the
presence of partial information.

These DMMP issues foster consideration of granularity of test design and
speciﬁcation as a primary issue. Creation of assertions at the message interchange granu-
larity is essential if true distributed diagnosis of hardware faults is to be achieved. For-
tunately, as a problem is decomposed into its parallel components and
partitioned/mapped onto the parallel processor, the message exchanges fall at the prob.

lem speciﬁc parallel component boundaries. Assertions then are a function of the local

24

program state and any received messages from other parallel components.
The primary question is how to generate, in a systematic way, assertions at this
level. This is the major t0pic of this thesis. The metrics necessary to guide this genera-

tion are described in the Chapter 4.

2.3 PROBLEM STATEMENT

The attainment of reliable parallel processing can proceed along any of the paths
given in Figure 1.1 (p. 7). The most promising method is that of reliable fault detection,
reconﬁguration, and recovery [Garc82, YaHa84, Stro86]. Solution of all three problems
is a massive task. The fault detection problem is temporally the ﬁrst to be considered. As
mentioned earlier, fault detection is the most difﬁcult problem due to the elusive nature
of the deﬁnition of a fault. Fault detection can be considered as a twofold problem. The
ﬁrst is to develop the tools necessary for consistent diagnosis in the parallel environment.
Using this consistent diagnosis, the second problem is to develop classes of executable
assertions that can be reliably applied to detect faults. In Figure 1.4 (p. 17), the consistent
diagnosis resides in the system layers, and the executable assertions reside in the applica-
tion layers.

The approach taken in this research is software oriented. This provides the most
technologically independent result and the most complete fault coverage possible. The
major contribution of this dissertation is the development of tools for consistent diagnosis
in the Byzantine environment and speciﬁcation of metrics for development of constraint
predicates for a wide range of problems. The result is an integrated technologically

independent framework of reliable parallel processing for large multiprocessor systems.

25

2.4 CHAPTER SUMMARY

This chapter has presented a motivation for the study of reliable parallel processing
both from a hardware and software reliability standpoint. Large multiprocessing systems
with thousands of components will experience low overall system reliability. Thus,
system-level techniques must be employed to achieve reliable system operation. The
problem of providing reliable software is not as easy to quantify, but its need is easy to
see. The detection (and recovery from) design/coding faults is at the heart of software
fault-tolerance. To apply software fault-tolerance techniques to a DMMP requires a dif-

ferent approach. This is detailed in the next few chapters.

Chapter 3

A New View of the Fault Detection Problem

 

Analysis of the fault detection problem is interesting as it is temporally the ﬁrst and
perhaps most difﬁcult problem encountered. To examine techniques for fault detection in
the applications environment, fault detection must be viewed in a new way, one that
minimizes the impact of system issues on the applications programmer. This chapter
explores the concepts of distributed fault diagnosis. A new model for fault detection is
presented and a system-oriented classiﬁcation scheme given for the "size" of the tests

employed.

3.1 THE DISTRIBUTED FAULT DETECTION MODEL

The goal of distributed fault diagnosis is to achieve a diagnosis in which all proces-
sors involved in the diagnosis (the diagnosticians) come to the same conclusion - one
that is representative of the true state of the system - in the presence of faulty behavior
under a bounded number of faults. To state the problem more succinctly, the following
distinction is made. Using standard terminology, a distributed Correct diagnosis is one in
which all diagnosticians arrive at the same decision on the set of faulty units, and all
units ﬂagged as faulty are indeed faulty. A Complete diagnosis is one in which all faulty
units are detected. Thus, optimally, distributed diagnosis achieves a correct, complete
diagnosis among the diagnosticians. In a DMMP, the diagnosticians are the peers of the
tested processor although, in general, the diagnosticians may also be specialized diagnos-

tic units. The units that receive the results of the test then take an appropriate action such

26

27

as halting the system or initiating reconﬁguration.

The fault model considered (some of which are listed in Table 1.1 (p. 10) has a
direct impact on the correctness and completeness of a fault-detection algorithm’s diag-
nosis. In the unrestricted fault case, messages may be lost, delayed, or altered by a faulty
resource. Such behavior is called Byzantine behavior [LaSP82]. Furthermore, a Byzan-
tine faulty processor may send different versions of the same message to different proces-
sors. This makes correct operation in the presence of these faults difﬁcult not only to
mask but to detect.

A more restrictive fault case is that proposed for testing by [PrMC67]. In this
model (the PMC model), a faulty processor, when queried for its status by a non-faulty
unit, always responds that it is faulty.

Fault diagnosis under the PMC model (and more restrictive models) has been stu-
died extensively over the past two decades. Between the case of the PMC model and the
Byzantine model, [MaMa78, YaMa86a] have studied the class of "intermittent faults." A
result of a test for faulty behavior that is correct for the PMC model may be incorrect for
the intermittently faulty behavior model. The Byzantine model is the same basic fault
model as intermittently faulty behavior but with the additional stipulation that failure is
not random, but malicious. The distinction between intermittent faults and Byzantine
faults is subtle, but important. Fault diagnosis under the Byzantine model has been stu-
died recently by [ShR387] as true Byzantine diagnosis considering faulty message relays
as faulty behavior. Chapter 5 considers, instead, faulty message content as faulty
behavior. Correctness and completeness of the diagnosis uncover important differences
between these two approaches.

For distributed diagnosis to achieve its goal, diagnostic algorithm design must con-
sider not only the difﬁculty of faulty resource behavior in the protocol or message inter-
change necessary to communicate the diagnostic information, but must also include an

appropriate test for faulty behavior. In the Byzantine environment, for example, it is not

28

always clear what a test would be. As for intermittent faults, a faulty processor is per-
fectly free to pass an instance of a test and yet fail during the actual system operation.
Thus test development is at least as hard as development of the diagnostic algorithm
itself.

To study fault detection we split the distributed diagnosis problem into two parts as

shown in Figure 3.1, thus, formalizing the terminology as follows.

Acceptance Test The goal of the acceptance test is to detect erroneous system
behavior.
Diagnostic Basis The goal of the diagnostic basis is to provide a reliable environ-

ment, in the presence of faults, such that the information delivered
to the acceptance test can result in a correct and complete diag-

nosis.

This splitting allows development of the testing basis to proceed independently of actual
test development. Proof of correctness then, can similarly be separated. Proof of asser-
tions of diagnostic consistency and correctness can be made abstractly in the general dis-
tributed system sense. Proof of the acceptance test is based solely on program-speciﬁc
assertions and follows the model of more traditional software engineering approaches.
An excellent survey of these techniques is provided in [AdBC82].

A further dichotomy of these two classes of algorithms exists in the orientation of
their speciﬁcation. The diagnostic basis can be considered as a "system oriented" concept
in the form of a set of tools for programming support. The acceptance test can be con-
sidered as an application level, "programmer oriented" concept. This dichotomy, ideally,
is complete. However, system issues may pervade to the application level resulting in a

higher degree of coupling between these layers than is desired.

29

 

 

 

    
   

TEST
- Application Theory

 

Based

- Programmer
Spe iﬁed

      
   
 

 

Layered
Model
for
Reliable
Parallel
Processing

 

 

 

 

 

   
   
 

Diagnosis

  
    

— Diagnostic Basis
* Byzantine Agreement

* Syndrome Testing

 

Figure 3.1. Splitting the Detection Problem

 

3.1.1 Diagnostic Basis
The job of the diagnostic basis is to provide the acceptance test layer with sufﬁcient

information to reach a correct, complete diagnosis. In the distributed environment, this

30

requires the diagnostic basis to provide the same local information to each diagnostician.
The diagnostic basis must preserve this information such that a non-faulty processor can-
not be erroneously ﬂagged as faulty, nor must it allow a faulty processor to escape detec-
tion.

In the literature, two techniques have formed the mainstream of distributed diag-
nosis. The ﬁrst are ad hoc approaches that work with limited fault sets. Ad hoc
approaches are not attractive since they have no formal theoretical design basis and thus
do not lend themselves to analysis. Nothing further will be said about these techniques
here.

The second, more comprehensive testing scheme, is called syndrome testing, ﬁrst
proposed by [PrMC67] for the PMC model of faulty behavior. In this approach, each
peer tests some of its neighbors and, in turn, is tested by other neighbors. The set of test-
ing assignments is called a testing digraph (see [KuRe86] for a survey of testing tech-
niques). The test results form the syndrome of the system. For classes of well deﬁned
testing digraphs, the syndrome provides the diagnosis algorithm with sufﬁcient informa-
tion to yield the set of faulty processors.

Syndrome testing in the Byzantine environment (for deterministic models) is
incomplete (Chapter 5). This incompleteness is inconsistent with the stated function of
the diagnostic basis. A further limitation is that syndrome tests do not integrate well with
the concept of a fault-tolerant algorithm. The primary advantage of the fault-tolerant
algorithm is that the exchanged messages form both the actual algorithm communication
and the test. Syndrome testing can be modiﬁed to utilize the actual data values as testing
weights. The main problem in the application environment is that different processors
may accept different versions of the same message due to a faulty processor. If the
actual application is testing, then this is not a problem as the application data are simply
the testing weights - binary values good or bad. This view of the system test is indeed

consistent with our notion of application oriented tests for the speciﬁc application of

31

testing. For general applications, however, correct diagnosis may still lead to incorrect
operation.

These problems lead to consideration of a third, newer, type of testing called Byzan-
tine Detection. Solution of the Byzantine detection problem utilizes Byzantine Agree-
ment [LaSP82]. In Byzantine Agreement, a group of processors, each having some local
data which they wish to reliably communicate to other members of the group, engage in
an agreement algorithm. At termination of the algorithm, each processor holds its view
of values held by the other processors subject to the following conditions: (1) Any two
non-faulty processors in the agreement obtain common values for all processors in the
agreement, and (2) if a processor k is non-faulty and wishes to communicate a value to
the other members of the diagnosis, then at the end of the agreement, each other non-
faulty processor receives k’s intended value. An algorithm which allows the non-faulty
processors to achieve the above conditions solves the Byzantine Generals Problem.
There are many versions of Byzantine Agreement which work under differing assump-
tions of synchronization, connectivity, and probabilistic correctness. A detailed descrip-
tion of Byzantine Agreement is presented in Chapter 5.

The Byzantine Generals problem has been the subject of study as a solution to the
problem of ﬁnding a consensus among a number of processors in a faulty environment in
which the total number of processors, n, contains at most I faulty processors. The Byzan-
tine algorithms can be thought of as distributed voting algorithms which mask faults.
While Byzantine Generals solution algorithms mask faults, alone they have a drawback
in the solution of the fault-detection problem. Examples can be constructed (and indeed
this is the method used by [ShRa87]) in which it is obvious to all members of the agree-
ment group which processor is faulty. Byzantine Agreement does no detection; it only
guarantees that a consistent state of the system is produced for each processor. This, of
course, solves one of the stated distributed system problems. However, for reasons that

are explored in detail in Chapter 5, to provide the necessary fault coverage, we must

32

appeal to the acceptance test. Thus, masking of faulty behavior by the diagnostic basis
through Byzantine agreement does not compromise complete diagnosis as it does in syn-
drome testing. Note that unlike syndrome testing, each processor has not just the same
diagnosis, but has the same set of values to operate on. This is key to the Byzantine

Detection problem.

3.1 .2 Acceptance Test

Assume that a diagnostic basis is employed as a reliable message delivery system
for message exchange between peers. In a DMMP, since the only method of communi-
cation is message passing, it would seem that a faulty processor can remain undetected as
long as it continues to send messages. However, faulty processors can be detected
through examination of their messages in the context of the system as outlined in the
beginning of this section. The construct that forms the basis for the acceptance test is the
Constraint Predicate. The constraint predicate is comprised of executable assertions
which form a functionally cohesive unit (in the software engineering sense). The actual
development of the constraint predicate is deferred to Chapter 4.

The constraint predicate must satisfy the following property. Any two processors
that are correct, given the same input values, reach the same conclusion. Thus any neigh-
boring processors working on the same portion of the problem can execute the same con-
straint predicate and reach a common decision regarding their immediate neighbors, and
thus report the same conclusion of error or no error. Furthermore, since each correct pro-
cessor has obtained and veriﬁed the same data from each processor under consideration,
the solution of the problem can proceed correctly in the presence of any errors that are
maskable by the diagnostic basis.

Fault model considerations at this level are different than that at the diagnostic basis
level. The Byzantine model, while appropriate to consider for the diagnostic basis, is

unnecessary for the acceptance test. Hardware errors that result in intermittent faults

33

(real physical phenomena) will manifest themselves as Byzantine failures. A failure in
the diagnostic basis can (and will in the Byzantine environment) cause such pathological
errors ranging from inconsistent diagnosis to deadlock. It is the function of the reliable
distributed diagnostic basis to intercept, ﬁlter, and mask errors. For this task, it must be
tolerant of Byzantine faults. In contrast, errors in the actual application data can cause, at
worst, increased fault latency - the time to locate a fault once it has occured. The relia-
bility required at this level is less stringent and lends itself to test development with pro-
babilistic coverage of faults as the main goal.

We now turn our attention to the speciﬁc tools afforded the applications program-
mer as a result of our development so far. Consideration of the programmer’s point of
view dictates ease of speciﬁcation of the constraint predicate. This is fostered purely by
the human factors involved in programming. If something is difﬁcult to specify, then it
won’t be utilized. The better the tools are, the easier application oriented constraint
predicate generation will be.

There are a set of fundamental objects that a parallel application must utilize: a data
structure that holds communicated information and message communication primitives.
Schemes similar to the Remote Procedure Call (RPC) mechanism proposed by [BiNe84]
or the uniﬁed distributed computing environment proposed by [Gend87], both of which
provide a sufﬁciently high abstraction for the programmer, can be utilized as message

communication primitives.

3.2 PROGRAMMER/SYSTEM ISSUES

The constraint predicate provides the application programmer with the ability to
control the actual test while minimizing the concern for system issues. However, system
issues such as performance and topology for fault-tolerance still must be considered. To
analyze a constraint predicate’s performance, an interface oriented classiﬁcation based on

the cardinality of the test size is proposed. This presupposes that there is a direct

34

relationship between test size and data availability.
How much input is needed for a test? Empirically for any test, we want the number
of processors that must be queried for input to be as small as possible. Table 3.1 sum-

marizes a predicate classiﬁcation that will be useful in constraint predicate development

 

 

 

 

for an N processor system.
Predicate Predicate Size
Class Name
IPT Individual Predicate Testable l
GPT Group Predicate Testable 1< and <<N
PPT Problemwide Predicate Testable N

 

 

 

 

 

Table 3.1. Constraint Predicate Size Classiﬁcation

IPT is the most desirable in terms of run time cost. It also maps well into existing
fault-detection schemes such as syndrome testing in which a single processor tests
another single processor. IPT may not be sufﬁcient for many problems, however, as not
enough information for a good diagnosis may be available. Many IPT tests are satisﬁed
locally, but the global result is incorrect.

GPT utilizes several values reported from processors to generate a correct diagnosis.
Many DMMP applications are of the nearest neighbor variety. Thus the data values that
are relayed by processors in close proximity are related to the testing processor. The
more values that the test uses, the better the test. When the test uses all values of the
neighbors of the testee, then the tester has enough information to completely replicate the
testee’s calculation. Thus GPT predicates can be used to implement N-Modular Redun-

dancy.

35

PPT requires all values of a problem to be checked. Clearly in a large scale DMMP,
this is infeasible except if the test is done very infrequently. Each value obtained incurs
extra communication overhead, an expensive commodity in a DMMP.

The predicate classes above give the notion of the size of the test. Also important is
the fault-latency. Optimally a test ﬁnds an error as soon as it occurs or with a fault-
latency of one, the one standing for one cycle or message exchange in the tested system.
Other tests, in reality may have much longer fault latencies. Typically as we move from
IPT to GPT to PPT, the fault latency becomes shorter. Thus the test class used is selected

as a tradeoff between time complexity and tolerable fault latency.

3.3 CHAPTER SUMMARY

The new view of fault-detection isolates most of the "system-oriented" issues from
the applications programmer. These system oriented routines include reliable broadcast
and detection support. The level of coupling between these two layers is low. The appli-
cations programmer then must specify only an abstraction of fault—tolerance that is based
on the application at hand. The next chapter details development of this abstraction for
the distributed parallel envrionment. Chapter 5 describes considerations and possible

algorithms for the distributed diagnostic basis.

Chapter 4

Constraint Predicates - A Programmer Level View

 

The development of the necessary constraint predicate is at the crux of application
oriented reliability. Historically, the main shortcoming of application oriented reliability

has been the difﬁculty of ﬁnding, and of how to ﬁnd, such a constraint predicate.

4.1 Definition

Deﬁnition 4-1: A constraint predicate (I)

1) constrains processor errors to within acceptable (testable) limits,

2) attains its functionality from the natural constraints of the application problem, and
3) never erroneously ﬂags a correct result as faulty.

Condition (1) is, intuitively, a test. Each testable portion of a calculation must con-
form to the expected result. Given the granularity of assertion to be at the processor
level, a tester investigates messages from the external environment for erroneous content
and format. Assuming that we have employed a proper testing basis as our tool for fault
tolerance, it can be assured that the designated testers will also make the same diagnosis,
and thus any seemingly faulty processor will be forced to remain within correct limits by
its neighboring processors. Any deviation from these limits will be ﬂagged as a fault and
the appropriate action may be taken. What remains to be understood is how to ﬁnd
appropriate constraint predicates.

Condition (2) is best introduced by contrasting it with traditional fault tolerance
techniques such as replication through N-Modular Redundancy. For replicated calcula-

tions, regardless of how the result is obtained, an error is ﬂagged if the replicated results

36

37

differ. Constraint predicates make use of known characteristics of the problem. A very
simple example of a constraint predicate is to consider some predicate (DIM (the meaning
of (b).- will be made clear in the next few paragraphs) for the calculation of the area of a

plane ﬁgure.

 

Predicate <pr (area_result: real) returns bool;
It (area_result < 0)
ERROR;
end.

 

Clearly any area result that was negative would be in error. Thus (DFA is correct in the
sense that it does not erroneously ﬂag a correct result as in error. However, it has woe-
fully poor error coverage and is therefore incomplete. The faulty processor would be
free to choose any positive real number for "area-result." Thus one should intuitively
feel that the constraint predicate (DFA for area should be considered unacceptable.
Exactly why this situation is present will be revealed in the following discussion of con-
straint predicate generation and coverage analysis.

A more effective constraint predicate is illustrated by considering the problem of
sorting a list of elements. Formally, sorting is deﬁned as follows:
Deﬁnition 4-2: Given an input list I=(I,-),i=0, ° - ° ,N -1 , a sorting procedure S ﬁnds a

permutation I'I=(1t,-) such that:

1) 1,... 51,,“l,i=o, - - - ,N—2
01‘
2) 1,, 21M,i=o, ~-,N—2

Let the output list delivered by S be O=(0,-)=I,g and assume that an ascending sort
is performed. The following theorem immediately follows.

Theorem 4-1:

1) If 3i such that Vj, 0,-¢Ij or Eli such that Vj, {#0},

38

or
2) 0,- > 0i+1 for some j,
then the result 0 produced by S is incorrect.

Proof: For part (1) if either there is no j such that 0i=lj for some i or there is no i
such that 13:0,- for some j, then the output list 0 is not a permutation of I thus violating
Deﬁnition 4-2. For part (2), the output list must appear in a non-decreasing manner as an
ascending sort it being performed. C]

From Theorem 4-1, the constraint predicate (bps follows immediately.

 

Predicate (bps (1,0,N) returns heal;
for i := 0toN-1
if 0;;th for every j then
ERROR;
for i := 0toN—1
if 15¢0j for every j then
ERROR;
forj := 0toN—2
if 0 j >0 j+1
ERROR;
end.

 

0;: is correct since it ﬂags any unsorted output list as an error, and it is complete since
any output list that is not a permutation of the input list constitutes an error. However,
this predicate can only be applied at the termination of the sorting calculation. If an error
occurs early in the calculation, this predicate cannot ﬁnd it. Thus the fault-latency is long.
We shall re-visit sorting predicates in Chapter 9.

Condition (3) is necessary to prevent unnecessary loss of computing resources. An
individual component of a constraint predicate might perform an incomplete subtest -
that is the subtest allows faulty behavior to pass. It will never, however, perform an
incorrect test in which a correct result is ﬂagged as faulty. At some stage of testing, the
subtests will either progress sufﬁciently to actually ﬂag the error, or the actual error will

be caught by another test that comprises the constraint predicate. The intersection of all

39

such tests, each of which is correct, composes a correct constraint predicate.

Two issues are important in constraint predicate development for the distributed
parallel environment. Granularity, as mentioned in Chapter 2, is the ﬁrst. A more gen-
eral issue common to executable assertions is coverage. We shall treat the granularity
issue now and delay the issue of coverage until Chapter 6.

Problem mapping to the DMMP environment is usually done in the Data Parallel
[I-IiSt86] style. Each processor executes the same portion of an algorithm on different
data. For maximum parallelism to be obtained from this type of mapping, the problem
must contain a natural structure that allows the parallelism to be achieved. Furthermore,
the identiﬁable points of parallelism will naturally fall at message exchange boundaries
for data parallel algorithms. Thus the constraint predicate is more closely allied with the
problem than with any speciﬁc solution. This relationship is shown in Figure 4.1 which

depicts the construction of the constraint predicate.

4.2 BASIS MET RICS
The following is proposed as a basis for constraint predicate generation through pro-
perty extraction. In a solution, each testable intermediate result should satisfy one or

more of the three predicate subclasses of progress ((Dp), feasibility ((13):), and consistency

(¢C)-

4.2.1 Progress (<bp)

We require progress to be made at each testable step of the solution. Progress
means that the state of the solution advances to the goal or ﬁnal solution of the problem.
Each testable step of the solution is deﬁned as a message interchange, sub-message inter-
change, or multi-message interchange dependent upon the type of solution. If this pro-
gress is not made, then faulty behavior may indeﬁnitely postpone the solution - any solu-

tion, even an incorrect solution.

40

 

Problem

Property : : Program
Extraction Speciﬁcation

\

Program
Design

/

Program
Coding

Program
Execution

 
    

accept

Figure 4.1. Abstraction of the
Constraint Predicate from the Software Life Cycle

 

A partial dichotomy exists in parallel algorithms. Problems in which the number of
steps is known a priori forms one class, and problems of an iterative convergent nature in

which the number of steps is not known a priori forms the other.

41

Clearly for the class of problems with a known number of steps, each processor
must complete the required number of steps. Early termination constitutes an error. Let
the current step be k. If we bound the time interval that a processor can take to produce a
step of the problem (timeout), then the following predicate can be used to check for pro-
gress of P ,- (Note the use of the Ada select construct which nondeterministically chooses

between any true alternatives if the choice is not unique).

 

Predicate <I>p (max_step,timeout, k) returns heel;
for i = 1 to max_step do

select
read data from P j;
if data not from the k’th step
ERROR;
or

delay timeout;
ERROR;
end;
process locally;
end.

 

Note that if a processor continues to send messages after max_step, this is not considered
an error for this behavior does not affect the ﬁnal result of the calculation.

Common to both iterative and non-iterative problems is the notion of convergence.
We shall assume that the problems attempted are convergent; for if not, then a noncon-
vergent result is indistinguishable from a hardware failure. Most, if not all, problem solu-
tions contain implicit information necessary to form a convergence envelope. This
places bounds on the reduction of error between the current and ﬁnal solution. Let the

error at step k of the solution be
55") = Iuikl-u, I, for ie (1,...,N}

where k is the current step count of the solution, um = (ugh) is a single (or vector-

valued) state of the solution at the k’th step, and u = (u;) for i e {1, ~ - - ,N } is the actual

42

correct solution. Let ll - ll be some suitably chosen vector norm, scalar absolute value, or
discrete convergence function-r. Let e“)=(ef")), for i e {1, - - - ,N }. Monotonic reduction

of error is then deﬁned as
”2““) II < “8"" ll

Ilu("+1)-—u ll < H u(")-u II (4.1)

A suitable constraint predicate (D): for this relation is:

 

Predicate (DP (u("),u(k+1), real vector-valued, timeout, k) returns bool;
select
read u("+1);
II (II u<"+1)—u ll 2 II u(k)—u II)

ERROR;
or
delay timeout;
ERROR;
end;

end.

 

Two things should be said about this predicate. In the programmer/system
classiﬁcation scheme this is a PPT predicate. It requires all the values held by all the pro-
cessors involved in the calculation. For a mesh of N processors this takes 0 (W) time,
and for a cube of size N this takes 0 (log 2N). Both these values are non-negligible. The
second difﬁculty is that it involves the unknown u. What is more desirable is a constraint
predicate that can test locally in 0(1) time, or even unit time, and only use obtainable
data for its tests. Locality and obtainability cooperate to form an IPT test for monotonic
convergence.

Consider as a special case of the convergence envelope localized monotonic reduc-

tion of error. This is typically seen in relaxation solutions such as PDE relaxation.

 

1' A discrete convergence function is employed in Chapter 9 when parallel sorting is con-
sidered.

43

Without loss of generality, assume that the sequence of local errors 85") is monotone
decreasing.
elk”) 5 85k)

u§k+1)—u,- S uf")—u,- (4.2)

It then follows immediately that the sequence of ug’s are also monotonically decreasing.

A suitable constraint predicate (D): for this relation is:

 

Predicate tbp (uSk)'u§"+1), real, i,k) returns bool;
It 043"“) 2 uf”)
ERROR;
end.

 

Convergence envelopes, while ensuring a guarantee of nonegative progress, cannot
check for sufﬁcient progress. By this we mean that a solution satisfying the above predi-
cate may proceed arbitrarily slowly. Consider bounds on the convergence rate ‘Yh’iln and

79:2,. such that each step satisﬁes

(1‘)
(k) “8 II It)

Again this is a PPT class predicate. A corresponding local rate bound may be obtained
and implemented similarly to that of (4.1).

Progress alone is not sufﬁcient to guarantee solution correctness. The ﬁnal vector,
u“), particularly when the bounds y are loose, may differ signiﬁcantly from the actual
solution u. To restrict completely arbitrary behavior further, we consider feasibility as

the next constraint predicate basis.

4.2.2 Feasibility ((1);)
Each testable result must remain within the deﬁned solution space of the problem.

Formally, consider a solution space H), and any intermediate result “(1‘) . Then for any

step k,
“(1‘) 8 H k

The feasibility constraints are often immediately apparent from the nature of the
problem studied. Problems in physics and engineering such as equilibrium problems,
eigenvalue problems, and to a lesser extent propagation problems contain feasibility con-
straints in the form of boundary conditions. Indeed these boundary conditions are
exactly the class of natural problem constraints. Equilibrium problems can be described
as jury problems in which the entire solution is obtained by a jury which must satisfy all
the boundary constraints and all internal requirements of the problem. Propagation prob-
lems can be considered as marching problems in which the solution marches from the
initial state guided and modiﬁed by the side boundary conditions. In these type of prob-
lems, the boundary conditions are of course known a priori and do not vary as the solu-
tion progresses (H), =H is stationary), and thus can easily be used as feasibility con-
straints.

Branch and bound tree searching (for example see [HoSa84]) is a method which
searches a state space (tree) for a minimum cost goal state. This search is guided by an
accumulated and estimated cost function. Let c(")(x) be the estimate of the cost function
at step k and c(x) be the actual cost of reaching the goal state from node x. If we enforce
that emu) S c (x) for ke {l,2,...}, then ca‘)(x) provides a lower bound in H), on the cost
of any solution obtainable ﬁom node x. Let L be an upper bound on the cost of a
minimum cost solution (initially L may be co). If at some step k, a possible goal node y is
discovered, then all nodes x with c (x) 2 C(k)(x) > co) may be killed, and L is updated as
L = 00). Thus branch and bound provides a way of dynamically narrowing the bounds
on the feasible solutions Hk.

For some problems, the feasibility constraints may be so loose as to be virtually of
no use. This is precisely the case for the "area" predicate OpA. This is an example of a

feasibility predicate for which [1,, = {x l x 2 0) for ke (1,2,...) i.e. the set of positive

45

reals. The difﬁculty lies not in the idea of using a constraint predicate. This is the best
predicate that can be obtained from the problem statement. The problem simply contains

insufﬁcient natural constraints from which to generate a good predicate.

4.2.3 Consistency (OC)

Many intermediate calculations contain additional properties that are indirectly
obtainable from the problem’s natural constraints. These are deﬁned as Consistency Con-
ditions. In a white box testing environment, all facets of each testable step can be
checked. As noted previously, in the DMMP environment, due to the locality of infor-
mation, black box testing must be utilized. A consistency predicate may be applied only
to the received information and locally known information. This may appear to severely
restrict the functionality and usefulness of this type of test, but in reality a consistency
test is powerful. [Ay0287] developed an entire constraint predicate using only con-
sistency conditions.

Consistency can compensate for limitations in the progress and feasibility bases.
Consider a problem in which (4.3) has a globally speciﬁed bound 7 but no locally
speciﬁable 7;. Furthermore, assume that it is too expensive to implement a PPT predi-
cate. Knowledge that a processor has about its local state and the values that it sent to
other processors in previous steps can provide bounds on the range of acceptable values
for the current testable step.

Assume that the solution proceeds as in (4.2). Let each new value of uf") be calcu-
lated as a linear function f of the neighboring values ujk‘ll, (ER with coefﬁcient a).
From the perspective of a processor P,- calculating u,-, a candidate intermediate result

u)", leR must satisfy the following property
iii") 5 ask-ll—a,(u$"-2)-u$"—1)), 1,: = 1,2,...,N, ke {2,3,4,...} (4.4)

To prove this as a theorem it is necessary to consider two successive iterations uf") and

ujk‘l). Subtracting the two yields an expression which is a function of known and

46

unknown values (from processor Pi’s point of view). If we let W“) represent the

unknown values at iteration k, the expression becomes
uikl-ujk’1)= W(k)—u(k‘1)+a)(ufk'1)—uf"’2)), l,i = 1,2,...,N, ke {2,3,4,...}

Since f is a monotonic linear function and ui") S uik'l) where
l,i = 1,2,...,N, ke {1,2,...], then WU‘“1)2W("). This provides a lower bound on the
amount of movement that processor P,- can expect which results in the inequality given in
(4.4). The lower bound is solely a function of the values that P,- sent P, in the previous
iterations and Pl’s previous iteration. Furthermore, this is an IPT predicate which makes

it attractive ﬁ'om a computational aspect.

4.3 CHAPTER SUMMARY

A systematic way of generating executable assertions for the DMMP envrionment is
essential to the application oriented fault-tolerance paradigm. The three predicate subc-
lasses of progress, feasibility, and consistency formulate the constraint predicate. Error
coverage analysis techniques for the resulting constraint predicate are presented in
Chapter 6. Natural problem constraints are a necessary condition for successful con—
straint predicate generation. It is not known if natural constraints are a sufﬁcient condi-

tion. This is discussed in Chapter 10.

Chapter 5

Distributed Diagnostic Basis

 

In this chapter, the two diagnostic basis algorithms mentioned in Chapter 3 are
explored in greater detail. We present syndrome testing and show that in the Byzantine
environment it produces an incomplete diagnosis. As an alternative, Byzantine Agree-
ment is described and an implementation called Vector Byzantine Agreement, which is
necessary for efﬁciency in the DMMP, is presented, proven correct, and given running

time bounds.

5.1 SYNDROME TESTING AND DIAGNOSTIC IMPOSSIBILITIES

In the mid to late sixties, fault diagnosis research had no real formalism to rely
upon. In a landmark paper [PrMC67], the authors outlined a method of system level
diagnosis called syndrome testing. Peer testing using this method has received concen-
trated research treatment over the ensuing two decades since its inception and thus is
worthwhile to consider as a candidate distributed diagnostic basis.

Syndrome testing of a system S is modeled as a testing digraph G (V,A), where each
processor in S is represented by a vertex, and each arc aid-6A denotes that processor i
tests processor j for i, je V. An arc aid-EA is given weights 0(1) if the testing processor
i e V ﬁnds the tested processor je V fault-free(faulty). If a). j=O(1) then it is said that i has
a 0-link(1-link) to j. The collection of all such test weights for a particular system S is
called the syndrome W formed by the test graph G of the system S. The system S is diag-

nosed through the examination of the syndrome W to locate a unique set of faulty

47

48

processors F.

To facilitate the discussion of this testing model, the following sets are deﬁned. For

each vertex i e V, let
F(i)= {jlaiJeA}
1"1(i)={jlaj,,-eA}.

Discussion of fault diagnosis through testing must consider a fault model of deviant
processor behavior. In the PMC model [PrMC67], the following behavior is assumed. If
processor i is fault-free and tests processor j, a fault-free (faulty) processor, then
dg‘j = 0(1). If processor i is faulty, then the test outcome is unreliable. This is summarized
in Figure 5.1(a). The term symmetric invalidation is used to describe the PMC model.
The characterization of diagnosable graphs G given in [HaAm74] is a function of the
maximum number of tolerable faults and the connection assignment. If a system S obeys
this characterization, then it is said to be t-fault diagnosable - the system can withstand
up to t faults and diagnosis can still identify all the faulty units in the system for all syn-
dromes that can result from testing under the PMC model. Performance of the diagnosis
is described by the same terms correct and complete from Chapter 3. [DaMa84] presents
an algorithm which can diagnose a system S in 0(n 25) time. The diagnosis provided by
this algorithm is both correct and complete as the fault set F is exactly the set of faulty
processors.

The PMC model can be unrealistic since it requires a processor to always identify
itself to a non-faulty testing processor. This may not occur if the test itself is "incom-
plete," that is, a faulty processor can pass the test but still be faulty. The model is also
unrealistic in that it requires that a faulty processor exhibit only permanently faulty
behavior. The CMOS technology involved requires consideration of intermittently faulty
behavior. To review the distinction between these two cases presented in Chapter 1,

once a permanently faulty unit fails, it suffers a hard failure - it never recovers. An

49

intermittently faulty unit may function incorrectly and then correctly at a later time with

the possibility for future failure/recovery cycles.

 

Possible Test Results

X - indicates a faulty processor

ii
. 22

2.
ii

(a) - PMC Fault Model

Io
Q
p—n

(b) - Intermittent Fault Model
Figure 5.1. Fault Models

 

Fault diagnosis under the intermittent fault model has received much study
[MaMa78, YaMa86b]. For intermittently faulty behavior, if i is a fault-free processor
testing a faulty processor j, the outcome is unreliable. All other conditions of the PMC
model remain true. This is depicted in Figure 5.1(b). This model is also characterized by
symmetric invalidation.

50

Let /V1/ be the cardinality of set V1.
Lemma 5-1 [MaMa78]: A system S is ti-diagnosable if and only if for any set of

processor’s V1 ;V where 0<|V1l<t,-, we have that

IU I“1(i)—V1|>t,-.

ieVl

If a system S subject to intermittent faults is characterized as ti-fault diagnosable,
then a comet diagnosis resulting in identiﬁcation of a unique fault set F may be obtained
provided that at most t,- processor’s fail. [YaMa86a] presented an algorithm for ti-fault
diagnosable systems which yields a unique fault set F. However, under certain charac-
terizable syndromes, the algorithm may not achieve a complete diagnosis. Thus, based
on the syndrome that occurs, faulty units may escape detection. This work provides a
characterization of syndromes for which their algorithm produces an incomplete diag-
nosis. To describe this characterization, we adopt their notation. For each i e V, the 0-

ancestors of i correspond to the set
A0(1') = {j'0j,i = 0}

Theorem 5-1 [YaMa86a]: A faulty processor j is detectable by Yang and Masson’s
algorithm if and only if

MoWuUl'Sh‘

where t,- is the maximum tolerable number of intermittent faults.

Theorem 5-1, however, is not a full characterization of ti-diagnosability. Consider
the testing digraph G and syndrome shown in Figure 5.2 for t,- = 1. The following discus-
sion shows that processor 2 is the only faulty processor. If processor 2 is not faulty, then
assume processor 0 is faulty as indicated by the testing link (12.0 = 1. But then either pro-
cessor 1 or processor 2 must also be faulty since a l-link exists between them. Since pro-
cessor 2 is fault-free, then processor 1 must be faulty and the l-link from processor 1 to

processor 2 (a 13:1) is an erroneous test result reported by processor 1. This, however, is

51

 

 

Figure 5.2. Syndrome Undiagnosable by Yang and Masson Algorithm
tg-diagnosable system with testing digraph G.

processor 2 is faulty, t,- = 1

 

a contradiction since now the fault set F has cardinality 2 (IFI = |[0,1}| =2 > t,-) and
exceeds the bound on the number of faults. If we assume processor 1 is faulty, a similar
analysis reaches the same contradiction - the syndrome contains too many faulty proces-
sors. Thus, the only possible diagnosis is that F = {2}. It is easily veriﬁed that the test-
ing digraph G satisﬁes Lemma 5-1 for t,- =1 and is thus tg-diagnosable. However,
Ao(2) = {0} which when applied in Theorem 5-1, yields lA0(2)U{2}| =2 > t,-. Thus
Yang and Masson’s algorithm cannot detect the faulty processor in this syndrome even
though it is diagnosable. The reason for this should be clear. The Yang and Masson

characterization of diagnosable syndromes only considers detection of faulty behavior

52

through implication of the suspected faulty processor. The analysis just performed used
the additional information of faulty implication by the suspected faulty processor.

This prompts the question of a whether a better diagnosis algorithm exists.
Optimally, a diagnosis algorithm that can perform a complete diagnosis is desired. We
now show that an incomplete diagnosis is the best that can be achieved, and that a com-
plete diagnosis for all syndromes is impossible.

Definition 5-1: A Non-masking Diagnosis is one in which the result of a test aid- from
processor i to processor j cannot be discarded by the diagnosis algorithm unless it is
known that processor i is faulty.

Definition 5-2: A Deterministic Diagnosis is a diagnosis in which no individual proba-
bilities are assigned to either the failure of the processors nor to the results of the tests.

An non-masking diagnosis cannot, as in [GuRa86], ﬁnd a processor i’ faulty based
on some minimal cardinality proper subset of I‘ "1(i’). To do so establishes a diagnosis
. based on repetitive testing. Deterministic diagnosis forces symmetric invalidation of two
processors i’ and j’ which share an incident 1-link. The diagnosis algorithms given by
[YaMa86a, DaMa84] are both deterministic non-masking diagnosis algorithms.

Lemma 5-2: Given any testing digraph G (V,A) there exists one syndrome, for which
any non-masking diagnosis is incomplete for even the single intermittent fault case.

Proof: Let G (V,A) be a digraph with weighted ares a” as described in Section 5.1.
Let i’ be a faulty vertex with a); = 0 for ie 1" ’1(i’) and am- = 0 for je F(i’). Then it is
impossible to diagnose the system completely since i’ cannot be assigned to the set of
faulty units F. El

Thus a faulty processor may pass all tests and report pass for processors it tests in
the trivial case. The syndrome described by Lemma 5-2 is called the trivial syndrome.
However, even if the faulty processor manifests itself by assigning l-links to some

incident edges, it can still escape detection as the following theorem shows.

53

Theorem 5-2: Given any testing digraph G(V,A) with W 2 1, there exists at least one
syndrome, exclusive of the trivial syndrome, for which any deterministic, non-masking
diagnosis is incomplete for even the single intermittent fault case.

Proof: Let G(V,A) be a testing digraph with weighted arcs a”. Without loss of
generality, let i’ and j’ be given such that j’e I‘(i') and an, = 1. Furthermore let a“, = 0
for ie 1“-1 (i’) and am = o for je rm. Note that one or both of the sets I“1 (i’) and rm
may be the empty set. Since the diagnosis is deterministic, symmetric invalidation states
that there is no conclusive information to favor one processor as faulty over the other.
Thus it is impossible to diagnose the system as either processor i’ or processor j’ may be
in the set of faulty units F. El

Further lessening of the restrictions on the intermittent fault model leads to the most
general class of faults, the Byzantine fault model. In this model, a faulty processor no
longer exhibits random behavior when tested as in the intermittent faulty mode, but
assumes a malicious personality. By this malicious behavior, a faulty processor will per-
form the most disruptive function at the most critical time. This is clearly an attractive
fault model since any algorithm tolerant of Byzantine faults is tolerant of all faults.

In the Byzantine case, the syndrome testing fault model is exactly the same as the
intermittent fault model (in terms of testing are weights). However, the philosophy of the
assignment is different. In the syndrome diagnosis, the faulty units will elude detection.
In light of the Lemma 5-2, a Byzantine processor can always completely elude detection
by passing all its tests through reporting of the trivial syndrome. Furthermore, it will con-
fuse the diagnosis, as in Theorem 5-2, such that an incomplete diagnosis is obtained.

Two main points are stressed in this discussion. The ﬁrst is that the set of t;-
diagnosable syndromes given by [YaMa86a] is a proper subset of the true set of t;-
diagnosable syndromes. More importantly, we have shown that there can never be a
diagnosis algorithm for tg-diagnosable systems which is always complete. This is partic-

ularly important when the Byzantine class of faults is considered. Unlike algorithms

54

which always function correctly in the presence of Byzantine faults, such as Byzantine
Agreement, no algorithm can perform Byzantine diagnosis completely. Numerous algo-
rithms have been proposed that function in the presence of Byzantine faults and purport
diagnosis of Byzantine faults [GuRa86, ShRa87]. In reality these algorithms are tolerant
of intermittent failures as they require repeated testing to uncover faulty behavior. A true

Byzantine fault will never manifest itself in a manner diagnosable by these algorithms.

5.2 BYZANTINE GENERALS PROBLEM

The Byzantine Generals problem [LaSP82] is a rephrasal of [PeSL80] giving the
problem a more colorful name and description. In the Byzantine Generals scenario, the
Byzantine army is encamped outside of an enemy city. Each division has a general
(resource) some of whom are traitorous. The solution requires that the loyal generals
decide on the same plan of action ("attack” or "retreat") and that a small number of trai-
torous generals cannot force adaptation of a bad plan (a bad plan being that some gen-
erals attack while others retreat). Note that for all the loyal generals to decide upon the
same plan of action, all the loyal generals must obtain the same information. This is
done by the exchange of messages. A traitorous general may send different values to dif-
ferent generals. In order to prevent a small number of traitors forcing the adoption of a
bad plan, the value sent by a loyal general must be used by every loyal general as the
correct value. Furthermore, all loyal generals must use the same value from a traitorous
general.

An example adapted from [PeSL80] for the four general, one traitor, case is given in
Figure 5.3. For the complete algorithm, the reader is encouraged to consult one of the
two above references. We now substitute processors for generals and use multiple values
instead of the single valued "attack" or "retreat."

In Figure 5.3 there are four processors (P 1 -P4), one of which is faulty (P 1). It is

not known, however, by P2, P3, or P4, that P1 is faulty. Nor may P1 be aware of its

55

 

 

 

Figure 5.3. Byzantine Agreement (P1 is faulty)

(from [LaSP82])

 

faulty condition. If the faulty processor P1 broadcasts over point to point links, it is free
to send different values to different recipients of its broadcast. Say it sends the value "a"
to P2, "b" to P3, and "c" to P 4. Each processor, with its own local view of the system,
has no idea that the other processors have received different values for the same broad-
cast message. By exchanging values, it is possible for each processor to relay the value it
received from P1 to each of the other processors. Thus if each one of the receivers
relays its version of what P1 sent to the other receivers, each receiver will have three ver-
sions of P 1 ’8 value. From the point of view of one processor, say P2, it will receive one

version (the value "a") directly from P1, the value "b" from P3 as Pl’s value and the

56

value "c" from P4 as Pl’s value. Since all receivers have received different values in
this case, a strategy might be to pick the lowest ASCII coded value, say "a." Thus P2
will use the value "a" as P 1 ’s value. Since both P3 and P4 perform the same computa-
tion as P2, and since P2, P3, and P4 are non-faulty, they all come to the same decision
on P 1 ’s value, namely "a." An algorithm which allows the non-faulty processors to
come to the same agreement on a value is said to solve the Byzantine Generals Problem.

A round is a set of message exchanges. The above example used two rounds of
information interchange. [DoSt82] shows that for n generals with t faulty, t+l rounds of
interchange are needed to reach agreement among n23t+1 generals. If either less than
3t+1 generals or fewer than H] rounds are employed, no reliable consensus can be
reached. This solution is known to be t-resilient, that is, it can withstand t faults. This is
the Unauthenticated Byzantine Generals solution. It requires three assumptions.
1) Every message that is sent is delivered correctly.
2) The receiver of a message knows who sent it.
3) The absence of a message can be detected.

The Authenticated Byzantine Generals solution requires, in addition to assumptions
1-3 above, the following assumption.
4) Each loyal general can send unforgable signed messages.

For the authenticated solution, a three general solution exists and the protocol is
known to be n-resilient, that is, it can withstand n —2 faults where n is the number of gen-

erals in the system (n must be >2 or the problem is vacuous).

5.2.1 Synchronlc Algorithm Constraints

In voting, we are concerned with whether the algorithm is synchronous, asynchro-
nous, or partially synchronous characterized by bounds A and ‘I’. A is the bound on the
time to send a message from one processor to another, and ‘I’ is the maximum clock drift

between any two processors.

57

5.2.1.1 Synchronous Algorithms
If A and ‘I‘ are known a priori, then the system is said to be synchronous. The
Byzantine generals problem can be solved for t-resiliency or n-resiliency in the following

038682

1) Synchronous processors, synchronous communication
2) Synchronous processors, synchronous message order

3) Broadcast transmission and synchronous message order

Synchronous algorithms, such as used in the F'I'MP [HoSL78], do not generally ﬁt
the DMMP model. However, speciﬁc applications may indeed ﬁt this model.
5.2.1.2 Asynchronous Algorithms

In an asynchronous system, neither A nor ‘I’ exist. It is impossible to ﬁnd consensus
in an asynchronous environment for even one faulty Fail-Stop processor [FiLP85]. There
exists a window of vulnerability in which a faulty processor can indeﬁnitely delay agree-
ment. Deterministic asynchronous Byzantine Agreement can be achieved only if proces-
sors do not fail during the protocol execution. However, other, more realistic possibili-
ties, also permit asynchronous agreement.

[Perr85], building on the work of [Rabi83], proposed a randomized Byzantine Gen-
erals algorithm which reaches agreement with probability 1-2(‘R) in R rounds using the
concept of "Shared Secrets" [Sham79] in which Digital Signature Authentication is
employed and a trusted dealer gives out public key encripted data. Shared secrets are
used to circumvent the impossibility result given in [FiLP85]. The reconstruction of the
secret avoids a process from indeﬁnitely waiting for messages from faulty processors.
This algorithm will achieve agreement with probability 1 if R-—>oo. The expected number
of rounds is four", two rounds are needed to reach agreement and two rounds are needed

to verify the proof.

58

This algorithm works on the following principle. A station wishing to enter into
agreement enters into a loop bounded by R. This loop consists of procedures Poll, Lot-
tery, and Decision. Poll sends its encrypted message to all stations. It then reads all
values coming into it from this round. Collecting n —t values, it decides the plurality vote
and the number of stations voting this way. The station then enters the Lottery phase.
The Lottery asks for the secret message from all processes for this round. The secret is
then decoded when t different messages have arrived. The ﬁnal phase is the Decision.
Each station has its own idea of what the message should be. Based on the randomly
generated secret just received, the message is either accepted or rejected. This procedure
continues until acceptance is met or R is reached. Notice that since we have the com-
pletely asynchronous case, messages may arrive out of order, be delayed indeﬁnitely, or
exhibit other Byzantine manifestations. Thus each round of the lottery should get its
messages from some asynchronous process which collects messages for all rounds
regardless of the current round. [Rabi83] used signed messages in his algorithm, but
[Perr85] showed that this was unnecessary in agreement with the theoretical result given
in the Byzantine Generals problem statement. In both cases, the resulting algorithms
tolerate t faulty processors where n >6t.

The method of shared secrets is limited. A trusted dealer must exist to reliably
predistribute the shared secret values before each process begins. Additionally, it must
be assumed, although not explicitly said in either paper, that a process does not know the
secret if and only if it sends a bad message.

An alternative to algorithms employing shared secrets is presented by [DLPS86] in
which approximate agreement can be reached in an asynchronous environment. This
algorithm handles Byzantine Agreement with successive approximations converging to
some 820. At ﬁrst it is unclear how one could use such an approximate algorithm. How-
ever, it becomes useful when one considers the problem of synchronizing clocks or of

agreeing on real valued input from data collection sensors. The algorithm functions in a

59

number of rounds indirectly bounded by the requested a. At each round, n -t values are
requested from other processors. A selection algorithm chooses the smallest 2t elements
and then takes the average. When the average is less than 8, the algorithm terminates.
This algorithm tolerates t faulty processors where n>5t. This procedure assumes that all
processes terminate, but not necessarily at the same time. It also tolerates a greater pro-
portion of faulty processes than 5t because it handles processes which exhibit transient

Byzantine behavior; that is, they fail and then recover to a correct mode of operation.

5.2.1.3 Partially Synchronous Case

A partially synchronous system is one in which both A and ‘I’ hold eventually but
are not known a priori. Agreement in the partially synchronous envrionment is solvable
without authenticated signatures [DwLS88]. The algorithm tolerates t faults where n>3t.
The algorithm locks and unlocks various supposed values of agreement for each round.
If a locked value is later learned to be not an agreed upon value, it is unlocked and the
algorithm continues. The algorithm is guaranteed to terminate, however, the number of
rounds is speciﬁed as polynomially in N, ‘1’, and A which for an actual implementation is

expensive.

5.2.2 Masking vs. Detection

The Byzantine Generals problem has been the subject of study as a solution to the
problem of ﬁnding a consensus among a number of processors in a faulty environment.
The weak consensus problem is solved if when the local value of all non-faulty proces-
sors is v and no processors fail, all processors reach the same agreement, namely v. The
strong consensus problem adds that all non-faulty processors reach agreement on v in the
presence of up to a predetermined number of faults denoted by t.

While Byzantine Generals solution algorithms mask faults, they alone have a draw-

back in the solution of the fault-detection problem. In the introductory example it was

60

clear that processor P1 was faulty, since all receivers relayed different values. If the
sender is non-faulty, however, and a receiver is faulty and thus relays the wrong values,
processors may erroneously ﬂag P1 as faulty when, in reality, some other processor is
faulty. The Byzantine Generals solution cannot implicitly locate the faulty processor
(Byzantine detection is covered in Section 5.2.4). It can only guarantee that a consistent
state of the system can be produced for each processor. However, an acceptance test run
on this view by each non-faulty processor will produce the same set of faulty processors

as a test result.

5.2.3 Vector Byzantine Agreement Agree

Byzantine Agreement (even in the best case of synchronous behavior) is expensive.
The problem with each of the agreement algorithms presented above is that to achieve
the necessary concensus for application oriented fault-detection among n processors, n
individual agreements are necessary. Vector Byzantine Agreement collapses these into a
single agreement by trading off communication complexity for time and space complex-
ity. Since message initiation is the most expensive part of the communication, lowering
the number of agreements lowers the number of messages.

The algorithm presented achieves Vector Byzantine Agreement under the assump-
tions of synchronous processors and communications and a reliable completely con-
nected (within the agreement group) communication network. These are not unreason-
able assumptions. If communication network errors occur, they may be charged to one
of the processors in the agreement. The completely connected communication require-
ment only extends to the members of the agreement group which is typically of small
size and is seen to be easy to construct for the Poker environment [Snyd84]. Furthermore
the algorithm extends to the less than completely connected case using the result of
[Dolc82]. The concept of Vector Byzantine Agreement is not limited to synchronous

processors. Any of the partially synchronous [DLPSS6] or asynchronous [DwL888]

61

[Perr85] algorithms presented above could be adapted to the form of Vector Byzantine
Agreement to function as a distributed diagnostic basis.

A constraint predicate (I) may be applied by each non-faulty processor to its local
copy of the agreed upon vector. The predicates will be constructed in such a way that
each non-faulty processor will obtain the same decision as every other non-faulty proces-

SOT.

5.2.3.1 Agree
The algorithm Agree described in this section achieves Vector Byzantine Agree-

ment.

We introduce some notation at this point to formally describe Agree.

n In the agreement, there are n processors. The ith processor is denoted P,- for
i=1,...,n.

t Among the n processors, the maximum number of tolerated faults is t.

v,- Each processor that enters into the agreement begins with a local value that it
wishes to broadcast to other processors in the agreement. For Pi, this value is
denoted by v;. The value v,- is treated as correct by Pg. Note that v,- may or may not
beequaltovjifi¢j.

Vi At the successful termination of the agreement protocol, each processor has
obtained a vector of values. Each component of this vector corresponds to the
agreed upon local value of a processor in the agreement. For processor Pi, Vi
denotes Pi’s vector of values. Each component of this vector is the value Vi (j)
where j=1,...,n. ViU) denotes the value that P,- uses as Pj’s local value vj. It is the
case that Vi(i) a v,- if P,- is non-faulty.

Agree departs from typical Byzantine agreement algorithms [LaSP82, Dole82]
which deal with single valued agreement and instead reaches a consensus on the entire

vector of local values in the same set of rounds. Thus at termination, each non-faulty

62

processor will return its vector Vi which achieves Vector Interactive Consistency Condi-

tions.

Vector Interactive Consistency Conditions:

VIC]: Any two non-faulty processors P,- and P ,- obtain vectors Vi and Vi such that
V‘ac) = viac), k=l,...,n.

VIC2: If a processor k is non-faulty with local value vk, then for any non-faulty pro-
cessor Pi, Vi(k) = vk.

The above two conditions may seem the same, however, they are not. Condition
(VICl) states that any two non-faulty processors obtain the same copies of the local
values of all the processors in the agreement. However, these values may or may not be
equal to the local value v,- that a particular processor P; actually holds. Condition (V 1C2)
states that all non-faulty processors agree on the same value as a particular processor
Pk’s local value vk if the particular processor is non-faulty.

The other protocols mentioned at the beginning of this section require n (t+1)
rounds to reach agreement as each value must be agreed upon individually before the
next round can start. Our protocol achieves agreement in t+1 rounds.

The protocol recursively broadcasts and then negotiates on the broadcast values.
Each negotiation requires a message exchange, and thus each processor recursively
enters into agreement on the exchanged messages.

The algorithm Agree shown in Figure 5.4 achieves Vector Interactive Consistency
(Reference Theorem 5-3 which follows). Each processor running Agree will send its
local value to all other processors. Then Agree is called recursively to reach agreement
on these values. The algorithm tolerates a maximum number of faults, t. It will be shown
in Theorem 5-3 that agreement can be achieved if n > 3t. Initially processor P,, 13' Sn,
invokes the algorithm Agree with Agree(0,v,~). Let Q be the set of processors in the

agreement excluding the calling processor.

63

Agree is called recursively. The ﬁrst time each Pi’s v,- is broadcast to all other pro-
cessors. Subsequent calls reach agreement on the received broadcast values. Thus,
Agree(O,T0) performs the initial broadcast of each v,-. Agree(1,T1) calls for agreement
on the values received in Agree(0,T0) by invoking Agree(2,T2).

Let T", be an m-dimensional square array in which each dimension is of length n. In
each call to Agree(m,T,,,) each processor sends its own Tm and assembles n-l m-
dimensional square arrays from the other n—l processors to form an (m +1)-dimensional
Tm“.

Each T ,,, from Pk is indexed in the (m +1)-dimcnsional Tm“ (k). Similarly, each m~
dimensional Tm is indexed in the (m +2)-dimensional sqaure array Tm+2 by Tm+2(i,k)
where j indexes a T,,,+1 and k indexes the T," within that T "+1. In the proofs, processor
Pk’s T,,I is referenced by T5. Where no ambiguity exists, the superscript on T is dropped.

Let R represent the indices of elements in the m-dimensional square array Tm:
i 1, i 2. ...,i,,,, where 1Si an, for lSj Sm. Thus an individual element of Tm is referenced by
Tm(R). Similarly deﬁne Tm+1(k,R) to be an element in the k’th m-dimensional square
array T", and Tm+2(j,k,R) to be an element in the k’th Tm of the j’th (m +1)-dimensional
Tm“.

The function majority returns the majority value of its arguments (the individual
elements of each array). If no majority exists, it returns the lowest ASCII coded value.
Note that this is a purely arbitrary choice; however, it will be made consistently by all
non-faulty processors. If no value is received for a particular element, majority chooses
an arbitrary (but consistent) value to use.

Notice that at the initial call, the To is a scalar. At termination of Agree(0,T0), a

vector T1 is returned satisfying conditions WC] and VIC2.

 

Algorithm Agree(m,T,,,);
allocate storage T1,,,+1;
l) T 1m+1(id)<—T,,,; /* id is the calling processor id */

2) for all PkeQ, send Tm to Pg;
3) for all PkeQ, receive an m—dimensional square array from Pk and
store in T 1m+1 (k);

if t>m then
allocate storage T2", +2;
5) for each keQ
for eachR such that ig at i). for 1 Sg,h S m and g at h
for each jeQ,j¢ig andj at k for 1 ﬁg Sm
T1m+1(k,R)<—- majority (T 1m+1(k,R),T2,,,+2(i,k,R));
return(Tl,,,+1);

Figure 5.4. Algorithm Agree

 

5.2.3.2 Example of Algorithm Agree

A short example demonstrating the use of Agree in presented below. Figure 5.5
contains an example of the algorithm with n=4 and t=1. Processors are numbered
P 1-P4 with P3 as the faulty processor.

Initially, each P,- invokes algorithm Agree(O,v,-) (viaTﬁ). At Step 1, each processor
sends To to the other three (3 = n-l) processors. Each processor then receives three
values in Step 2. These values, along with the local processor’s vi, compose the square
array T 11, a one dimensional square array. This is shown as a 4 element vector in Figure
5.5. Row indices index the received local value from each processor. Since P3 is faulty
and exhibits Byzantine behavior, it may send any values it wants or may send no value at
all.

Agree(1,T11) is recursively called in order to exchange messages on each receiver’s
view of the initial broadcast. Agree, in Step 2, sends the square array (T1) to the three

other processors. In Step 3 it receives the three other l-dimensional square arrays. These

65

three l-dimensional square arrays along with the square array T1 it received from Agree
in the previous recursion are used to create the 2-dimensional square array T 12. This is
shown in Figure 5.5 as a two dimensional array. The row indices index the received T11
for each P;. The "?"’s for P3’s T1 indicate "don’t care" values since P3 is faulty and
nothing can be assumed about its internal values. Since now t=m=1, Agree(1,T1)
returns T 11 to Agree(0,T0).

In Agree(O,To), each processor applies its majority function to the returned 2-
dimensional square array T22 and to its T 11 for each of the other processors. Processor
P1 obtains the majority values in the following way:

T11(2)<—majority (T 11(2),T22(3,2)T 22 (4,2)) = majority (b,b,b)

T 11(3)e—majority (T 11(3),T22 (2,3)T 22 (4,3)) = majority (a,b,c)

T 11(4)<—majority (T 11(4),T22 (2,4)T 22(3,4)) = majority (d,d,c)
Thus P,- retums [abad] as its vector. Note that in all cases, T 11 a T22(k) in every non-
faulty processor Pk’s computation. Thus the majority could have been taken over T22.

The majorities above decide the ﬁnal square array T11 = Vi for the non-faulty
Pi, i=1,2,4. Each processor runs majority 3 times, once to decide on each of the other
processor’s value. Notice that P,- does not decide upon its own value T f). This is known
by P,- to be correct.

Processor P3 has attempted to thwart a consensus by sending different values to dif-
ferent processors. However each non-faulty processor runs majority on the same set of
values. Here we have chosen to choose the "lowest" value (ASCII collating sequence)
for the majority if no majority value can be found. Thus each non-faulty processor uses
"a" as the value for P3.

Vectored interactive consistency is seen to be achieved as each non-faulty processor
i uses the vector of values Vi = [a,b,a,d]. Each processor uses the communications
medium to send six messages in two rounds. The total space requirement for this case is

0(n2). In general, the algorithm Agree requires 0(n‘ +1) space and computation

66

complexity and a communication complexity of (n—1)(t+1) messages in t+1 rounds of

information exchange.

67

 

A9 709(0)
890d

(8) (a) (a) (b) (b) (b) (a) (b) (C) (d) (d) ((0

P2 P3 P, P1 P3 P4 P1 P2 P4 P1 P2 P3

(b) (a) (d) (a) (b) (d) (a) (b) (d) (a) (b) (C)

 

Agree(O)
receive
P1 P2 P3 P4
in in 19
2|! 2|! 2|!
711 3n 3“ 3a
4n 4“ 4“
P, P2 P3

“mi/K /l\ /l\ /1

(3 b.a.d) (a b a d) (a M d) (a b b d) (a,b, b d) (a.b b d) (a b b C) (b b ca) (aaaa) (a.b.c d) (a.b.c d) (a b c d)
P 2 P 3 P 4 P r P 3 P 4 Pr P 2 P 4 P1 P2
Ag we“ )(a,b,b,d) (a,b,b,c) (a,b,c,d) (a,b,a,d) (b,b,c ,a) (a,b,c,d) (a,b,a,d) (a,b,b,d) (a,b,c,d) (a,b,a,d) (a,b, b ,d) (a,a,a,a)

receive

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

P 1 P2

1 a b a d 1 a b a d l

2 a b b d 2 a b b d 2

T12 3abbc 3bbca 3

4 a b c d 4 a b c d 4
Tll(2)=maj(b,b,b)=b T11(1)=maj(a,b,a)=a T11(l)=maj(a,a,a)==a
T11(3)=maj(a.b.6)=a T11(3)=maj(a.b.6)=a T11(2)=maj(b.a.b)=b
T11(4)=maj(d.d.6)=d T11(4)=maj(d.d.a)=d T11(3)=mai(a.b.0)=a

return T11 [a,b,a,d] [a,b,a,d] [?,?,?,?] [a,b,a,d]

 

Figure 5.5. Example of Algorithm Agree for n =4,t =1.

68

5.2.3.3 Proof oi Correctness

It was stated that Agree achieves VICl and VIC2 for n > St. We must ﬁrst prove
that our algorithm satisﬁes vectored interactive consistency condition (VIC2) for a
weaker condition of fault tolerance. This will be necessary when we consider the case of
deciding on a vector when the transmitter of that vector is reliable and its recipients con-
tain all the faults. This condition violates our intentions of achieving VICl for
n S 3(t —m) for a given recursion m.
Lemma 5-3: Agree satisﬁes condition VIC2 for n > 3t-m where t 2 0 is the maximum
number of faulty processors and m is the index of recursion.
Proof: We proceed by induction on t-m. For each correct sender in Step 2, all correct
receivers receive each correct sender’s value. If (t-m)=0, Agree fulﬁlls VIC2 because
the number of faults is zero and every receiver uses the sender’s value.

Assume that Agree works for (t —m )—1, we wish to show it is correct for

(t-m), t>m.

1) In Agree(m,Tm). each correct receiver receives n-l square arrays T", into T 1",“ at
Step 3 from the n-l calls to send at Step 2 done by the n —1 other participating pro-

CC SSOI'S.

2) Each correct receiver then applies Agree(m+1,T1,,,+1). The number of processors
considered in Step 5 (the number of individual elements of the square array T 1",“)
in the agreement is reduced by 1. Thus Agree(m+1) runs with values from n-l pro-
cessors (1 directly obtained from the sender and n —2 relayed by other receivers for

each component of the vector T 1”,“).

3) Since Agree(m,T,,,) is called with n square arrays Tm_1 , and by hypothesis,

n > 3t—m, then n—l > 3t-(m+1). Thus by the inductive hypothesis and by (2),

 

4)

69

Agree(m+1,Tlm+1)

achieves VIC2 with n-l > 3t —(m +1). Thus each non-faulty processor receives the
same T2,, +2 (k) for each non-faulty Pk.
To achieve a majority vote on n —1 components of T2m+2 (j,k,R) such that the
majority is governed by values reported from non-faulty processors, it must be true
that: n —l-t > t or n-l > 2t. Since there are at most t faulty processors and t > m,
n-l > 3t—m—1
n-l > 3t-(t-1)-1 2 2:.
Each non-faulty processor has enough reported values from non-faulty processors in
T 1,,“ equal to Tfn(R) to compute the same majority value for any non-faulty Pk,
namely Ti. Thus for any non-faulty processor, T 1,,,+1(k)=Tf,, for each non-faulty

Pk and VIC2 is satisﬁed. [I]

Theorem 5-3: Agree(m,T,,,) achieves vectored interactive consistency VICl and VIC2

for n >3(t —m ).

Proof: We proceed by induction on (t —m ).

If (t-m) = 0, since there are no faulty processors under consideration, each P,-

receives, in Step 3, the sender’s local value T{,, for any P I- (sent in Step 2). Each P; then

returns the same vector T 1",“.

Assume that Agree(m+1,Tm+1) achieves VICl and VIC2. We want to show that

this is true for Agree(m,T,,,), t > m

Case 1 The sender in Step 2 is non-faulty. By Lemma 5-3, each non-faulty PieQ

receives the Tfn=T 1",“ (k) for each non-faulty Pk. By taking m =0 in Lemma
5-3, we have n > 3t. Since the sender is non-faulty, then VICl follows from

VIC2.

Case 2 The sender in Step 1 is faulty. Thus each processor receives possibly a dif-

ferent value, the wrong value, or no value at all. Each correct receiver

7O

executes Agree(m+1,T1,,,+1). Since one of the faulty processors is known to

be the sender, only (t -m)—l faulty processors are the receivers.

Thus since:
n > 3(t—m)
n-l > 3(t-(m+l))
we apply the induction hypothesis to determine that each Agree(m+1,T1m+1)
achieves vectored interactive consistency VICl and VIC2. Thus for any two
correct processors for each k and for all j, T2,“; (j,k) are identical. Thus
each correct processor obtains the same T 1",“ (k) in Step 5. D
We now give the time and space complexity for algorithm Agree.
Theorem 54: The algorithm Agree with n processors and at most t faults requires space
001‘“).
Proof: The algorithm is recursively called with m, 0 S m S t, as the index of recursion.

For m=0, Agree requires 1+ n1+ n2.

m+l + ""2.

For 1 S m <t, Agree requires an additional n n

For m=t, Agree requires an additional n‘ +1.

Summing we get:
:-1
1+ n”1 + n(n+1) 2 n’"
m=0

n(t—1)
n—l

=1 +n‘+1 + n(n+1)
= 0 ("(+1) D
Theorem 5-5: The algorithm Agree with n processors and at most I faults has 0(n‘ +1)
computational complexity.
Proof: The algorithm is recursively called with m, 0 S m S t, as the index of recursion.

For the m-th recursion, majority is performed on (n —(m+1))”'+2 values. Thus the total

1_nt+2
l-n

t—l r-1
complexity is: Z (n -(m-i-l))"’+2 S 2 n’“2 S
"1:0 m=0

= 0(n”1). El

71

Theorem 5-6: The algorithm Agree with n processors and at most t faults has a message
complexity of (n -1)(t+1) messages in t+1 rounds of information exchange.

Proof: Each Agree calls itself once recursively. In each call, it sends 1 message (1 round)
to each of n-l processors. Since the recursion terminates when m = t, each processor
makes a total of t calls to Agree, thus including the initial call, t+1 sends (rounds) are

made for a total number of messages (n —1)(t +1). D

5.2.4 Decision Metric (Predicate)

We introduce D as a formalism to represent a class of computable predicates which
can be applied to the result of Algorithm Agree.
Theorem 5-7: Predicate D,- e D computes a decision d e {yes,no} if for any two
V"andVl from algorithm Agree in any two correct processors Pk and P1, D,-(V") and
D;(V’) return the same decision, d.
Proof: By Theorem 5-3, all non-faulty processors Pk and P, hold the same V such that
V" = V'. Since D; is computable, D,- returns the same decision d in each non—faulty pro-
cessor. Cl

Relating D to the class of constraint predicates (I) of Chapter 4 completes the fault-
detection model of Chapter 3. Each testing processor runs its own (I) (=D,-) on test input.
By Theorem 5-7, each non-faulty processor reaches the same decision d and thus all
non-faulty processors reach the same decision d. In this way, a reliable distributed diag-

nosis is achieved.

72

5.3 DIAGNOSTIC COMMENTS

At ﬁrst glance, syndrome testing is more attractive as a diagnostic basis than Byzan-
tine Agreement. The connectivity requirements for the intermittent fault case are lower
than for Byzantine Agreement. Furthermore, syndrome testing is a diagnostic algorithm
whereas Byzantine Agreement is a fault-maskin g algorithm. However, the incomplete-
ness of syndrome testing and its impact in the Byzantine environment reduce the attrac-
tiveness of such a method. Additionally, all the algorithms presented in Section 5.1
require a centralized diagnostician to analyze the syndrome. Distributed diagnosis may
be achieved in an elegant manner for the PMC fault model [KuRe81]. However, for the
Byzantine model, reliable dissemination of the syndrome is easily seen to be exactly the
problem of reaching Byzantine Agreement. Indeed, the algorithm presented by
[YaMa86b], although the authors claim differently, is tolerant of intermittent faults - not
Byzantine faults.

The application oriented fault-tolerance treated here makes excellent use of Byzan-
tine Agreement as a reliable distributed diagnostic basis. Since processors involved in a
calculation call upon the diagnostic basis to disseminate/receive data, each processor is
guaranteed to receive the same data (ref. VICl and VIC2 & Theorem 5-3). Since the
constraint predicate D is uniformly applied (by Theorem 5-7), the common decision is
either faulty or non-faulty. If masking of a fault should occur such that the masked result
appears non-faulty, it is non-faulty and no error has occurred. This can happen in the

case of transient(intermittent) failures or during a failure of the communication links.

5.4 CHAPTER SUMMARY

This chapter has presented two different possible approaches to a distributed diag-
nostic basis. The tight coupling with the application-oriented fault-tolerance paradigm
favors the use of a masking basis such as Byzantine agreement and rejects the use of syn—

drome testing. The algorithm presented in this chapter, Vector Byzantine Agreement,

73

functions as an efﬁcient diagnostic basis for the DMMP environment under Byzantine

fault conditions. Syndrome testing can never be complete in this manner, only correct.

Chapter 6

Performance Evaluation

 

Performance of fault-detection schemes consists of two primary components. First,
expected coverage of errors is an important metric in judging the effectiveness of a test-
ing scheme. Run time cost is the second metric in judging effectiveness since overhead

dictates the usability of the scheme.

6.1 EXPECTED ERROR COVERAGE

We restate the dichotomization of error classes introduced in Chapter 3. The distri-
buted diagnostic basis provides us with a reliable set of values on which the constraint
predicate operates. The constraint predicate coverage, as we have seen from previous
examples, is often not complete. Any attempt at consideration of the Byzantine model for
these predicates is unreasonable. A Byzantine failure will always escape detection.
Thus, instead we consider the expected coverage as our metric.

To model the error coverage we have two options. One is to generate a large
number of test cases, introduce random errors, and measure the coverage. This is unat—
tractive since it does not show which tests contribute most to the error coverage and
which tests are redundant, nor does it provide much conﬁdence in the coverage unless a
large number of tests is run. The alternative is to use an analytical model. A probability
model is chosen.

In general, to utilize a probability model, both a prior and posterior probability dis-

tribution of errors must be chosen. Here, however, we are only interested in the coverage

74

75

of the constraint predicate given that an error has occurred. This restricts the number of
distributions to be considered to the distribution of errors given that an error has
occurred. The fault model for application data errors can be chosen from some prior dis-
tribution appropriate to either the hardware faults expected, or as a function of the appli-
cation itself.

Model Speciﬁcations: Deﬁne the following events corresponding to the three classes of

test bases:
j F 5") Event that a result U5," satisﬁes feasibility test i.
ij") Event that a result UY‘) satisﬁes progress test i.
ijk) Event that a result Uj") satisﬁes consistency test i.
E Event that an error has occurred.

The absence of a particular subscript/superscript indicates that the event holds for an
aggregation of the individual events. Thus the following events denoting the intersection
of each class of error coverage are true when all of the tests of a particular class are true:

If“) = mi 175")
i

J'p(k) = nip?)
1'

ice) = mics/c)

The absence of the superscript (k) indicates that the event is stationary over all steps k as
in if}. The meaning will be made clear in each individual context.
We are interested in the probability that the tests ﬁnd an error if it has occurred. This

probability is given by the following:
Pr (FnPnC IE) = Pr (FUPUEIE)
= Pr(flE) + Pr(P IE) + Pr(ClE) — Pra'EP IE)

— Pr(PEIE) - Pr(PE IE) + Pr(FPE IE)

76

To model the type of errors that can occur, we can choose either a discrete density
function or continuous distribution function. In the former case, the discrete nature
results from the discreteness of the computer word used to contain the result. The proba-
bility function of the random variable denoting the actual value of the result is the hyper-
geometric distribution. Assume that the range of values that a computer word can hold is
{xlx = —L,—L+l,...,L -1,L}. Then the conditional probability of the correct result (and
any particular result) is 1/ 2L. For any reasonable size computer word length, this value is
nearly zero which is exactly what we want. In the latter case, consider the choice of the
distribution of errors from some well deﬁned distribution such as the normal. The mean
of the distribution is chosen to be equal to the actual correct value of the calculation. The
choice of variance can control whether erroneous values are in close proximity to the
correct value or are more widely scattered. Another possibility is to choose the condi-
tional density function to be uniform over [-—L,L] as in the discrete case above. In both
cases, since the distribution is continuous, the conditional probability of the correct result
again is zero. We are not limited to such standard distributions. If a priori knowledge is
available concerning the probability of faulty behavior, the model can be adjusted to
accommodate.

To show how the modeling works, consider the example area predicate <T),-A of
Chapter 4 and its corresponding feasibility event j F 5"). There is only one component of
the solution, so the leading superscript is dropped. Similarly, since the ﬁnal value is the
only one considered, the step number is dropped. If we model the errors as coming from
a discrete random distribution with the random variable E b, then the conditional proba-

bility of detecting an error is given by

Pr (PIE) = Pr(Eb<0) = = 0.50.

A
2L
This shows, as we already suspected, that the predicate has a low error coverage - only

50%.

77

In modeling the errors as a normally distributed random variable EC, we assume that
the mean is chosen to be the actual correct value of the area 8. For simplicity, let the

variance 0:1. The probability of error detection is given by:
Pr (f IE) = Pr(Ec<0) = Pr(Ec—6<-8) = Pr(Z<—8)

where Z has the standard normal distribution. If we choose 8 to be, say, 2, then
Pr(Z <-2) = .0667

or just 7% error coverage. Clearly under this model, the predicate (bpn is virtually use-
less.

In constraint predicate analysis, it is usually true that one predicate signiﬁcantly nar-
rows the range of acceptable values, and that additional predicates only make small con-
tributions. This type of modeling can help eliminate predicates with little usefulness to

increase run time efﬁciency at the cost of lower error coverage.

6.2 RUN TIME OVERHEAD ESTIMATION

In Chapter 2, it was demonstrated that an algorithm with no reliability could have an
exponential expected run time. In this section, expectations for the run time of a reliable
algorithm are given for comparison purposes. While this analysis only covers the fault-
detection aspects of this problem, it is understood that a reconﬁguration will take place
after the fault is detected [Garc82]. This will enable the problem solution to continue in
the presence of the detected errors.

General analysis techniques are presented here with speciﬁc analysis for each prob-
lem treated given in Chapters 7 through 9. The appropriate assumptions for the DMMP
environment are detailed in Table 6.1. As a simplifying assumption, the problem has
been partitioned such that one piece of data has been allocated to one processor. When
relaxation techniques are covered in Chapter 7, performance is treated for other data par-

titionin gs.

78

 

Implementation Assumptions

 

 

 

Item Variable Description
Scientiﬁc b 60-80 bits per compu-
Word tation exchange.

Data Dominat- Message header will
ed Message not dominate the mes-

sage length, i.e. a sin-
gle datum will dom-
inate the message
header/trailer (worst
case assumption)

 

Message Set— SL Communication setup
up Non-trivial latency (time to either
send or receive a mes-
sage taken by a pro-
cessor)

 

Communica- CCNMCR) Communication

tion Time time/iteration for the
unreliable (reliable)
algorithm
Computation CPNR (pk) Processing

Time time/iteration for the
unreliable (reliable)
algorithm

 

 

Bus width VB Communication velo-
city of the interpro-
cessor interconnection

 

 

 

 

 

Table 6.1. DMMP Assumptions
6.2.1 Communication Complexity Analysis
If we assume communication time is the dominant factor, then the analysis consid-
ers only the communication setup latency SL and the message transmission time. Setup
latency is typically expensive. For example, the iPSC hypercube has 5;, = 0(1ms)
[GrRe86]. This is large compared with a typical instruction time for most computers. If
we employ algorithm Agree, the increase in communication will be due solely to the

increased message exchange and length of the agreement.

CcNR =VBD+ZSL (6.1)

79

t .
CCR=2 (Van‘ + 25L) (6.2)
i=0
Cd =2(t+1)SL+VB anl (6.3)

Now, if we assume the setup latency SL is of order larger than the message transmission

time Vbbn‘ +1 equations (6.1, 6.3) reduce to:

CC]?
C cNR

 

0( )=t+1 (6.4)

Thus the reliable algorithm’s communication complexity is only a linear factor in H] of

the unreliable algorithm.

6.2.2 Computation Complexity Analysis

Now assume that processing time is the bottleneck. The complexity components
consist of the algorithm iterations, communication setup latency, and reliability calcula-
tions. The algorithm Agree, as noted earlier, has 0(n‘ +1) computational complexity.

For the reliable algorithm the run time is given by
CPR = 0(n‘+1+2:sL)+0(cpNR). (6.5)
The unreliable algorithm is decomposed into internal computation and message set up
times.
CpNR= 0(C'pNR)+2SL
The ratio of the computation times of the reliable algorithm to the unreliable algorithm:

C C’ +2 t+1 S +n‘+1
CpNR C’pNR-I-ZSL J

 

In the worst case, C’ pm; is negligible. For reasonably small values of n and I, this ratio

reduces to:

 

= t+1 (6.8)

80

Thus for a small average n, tolerant of one fault t=1, both the computational and

communication complexity hold the following relationship:

C R 32C NR. (69)

 

 

   
  
    
  

4 _
N = 32768 processors
It = 5000 hours Unreliable
3 __ Algorithm
Expected
Run
Time 2 "
E(R) Reliable
(Hours) Algorithm
1 _.
o _.

 

 

 

I I I I T
0 0.1 0.2 0.3 0.4 0.5

Fault-Free Run Time T (Hours)

Figure 6.1. Expected run time comparison

 

The graph in Figure 6.1 shows the linear reliable solution compared with the
exponential unreliable solution for a representative solution using the modeling tech-
niques of Chapter 2. The point of intersection is the crossover point. To the left of this
point, it is better to run with no reliability (assuming that an error is somehow detected).
To the right of this point, in terms of run time, it is better to use the reliable solution.

Table 6.2 shows other run time crossover points.

81

6.3 CHAPTER SUMMARY

Both the expected error coverage and the expected run time penalty are important
metrics in judging the effectiveness of a fault-tolerant algorithm. We have seen that for
problems with inadequate natural constraints, error coverage can be poor. However, for
problems with good natural constraints, such as the sorting example of Chapter 4, error
coverage can be 100%.

There will be a run time penalty to be paid for the use of reliability. There is a trade-
off based on expected run time between using reliability or running an unreliable algo-
rithm. For very short problems, it is better to run with no reliability than to incur the
overhead of a reliable parallel algorithm. For larger problems there is a crossover point
at which it is more effective to run with the reliability penalty and be guaranteed a
correct solution within a bounded run time.

The next few chapters put the developed techniques to work in the reliable solution

of three diverse parallel programming problems.

82

 

Failure Rate it = 5000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

n Crossover E (R) MTTF
512 12.27 24.5 4.9
1024 6.14 12.3 2.4
2048 3.07 6.1 1.2
4096 1.54 3.1 0.6
8192 0.77 1.5 0.3
16384 0.38 0.8 0.2
32768 0.20 0.4 0. 1
65536 0.10 0.2 0.0
Failure Rate it = 50000
n Crossover E (R) MTTF
512 122.70 245.4 48.8
1024 61.35 122.7 24.4
2048 30.67 61.4 12.2
4096 15.34 30.7 6.1
8192 7.67 15.3 3.1
16384 3.84 7.7 1.5
32768 1.92 3.8 0.8
65536 0.96 1.9 0.4
Failure Rate u = 100000
n Crossover E (R) MTTF
512 245.40 490.8 97.7
1024 122.70 245.4 48.8
2048 61.35 122.7 24.4
4096 30.67 61.4 12.2
8192 15.34 30.7 6.1
16384 7.67 15.3 3.1
32768 3.84 7.7 1.5
65536 1.92 3.8 0.8

 

 

 

 

Table 6.2. Run Time Comparisons

 

Chapter 7

Matrix Problem and Iterative Techniques

 

Matrix iterative techniques have been employed since the 1940’s to solve large sys-
tems of equations numerically [Varg62, Ames77]. These systems of equations arise out
of the numerical solution of physical problems utilizing ﬁnite difference and/or ﬁnite ele-
ment methods. Techniques of particular interest are the pointwise relaxation techniques,
which, while not as elegant as some other methods, do contain the massive inherent

parallelism necessary to take full advantage of a DMMP.

7.1 MATRIX PROBLEM SPECIFICATION

The problem is to solve the linear system Au=v, where, A = (aid) is a nonsingular
QxQ complex matrix, v = (v,-) is a complex vector, and u = (u;) is the solution vector for
i, je {l,2,...,Q] and Q a perfect square. The method of Successive Over Relaxation
(SSOR) is an iterative technique with which to obtain an approximate solution,
“(10 = (uf’o), where K is the ﬁnal iteration, to this system. To parallelize an inherently
sequential solution technique, a consistent multicolor ordering of the Q values of “(10
(also called points) must be made. Assume Q=N for simplicity and let processor
P;,i =1,...,N compute value u). The case for Q >N is treated in Section 7.4. Selection of

points for computation is controlled by the iteration variable k as in the following predi-

cate.

83

84

Deﬁnition 7-1:

1 if P,- is active during iteration k
AP" = 0 otherwise.

The most common of these consistent multicolor orderings is the red/black or checker-
board multicolor ordering [Smit65] (implying, among other things, that the number of
nonzero aid-’8 per row i is no more than ﬁve). For this speciﬁc case A“ becomes:
Deﬁnition 7-2:

1 if (1' div «1717+ i mod W) is even and k is even for ie {1,...,N},ke (0.1.2,...)

A“: 1 if(i div W+tmod W) isodd andkis oddforie{1,...,N},ke{0,1,2,...}
0 otherwise

where div is the integer division operator and mod is the remainder operator for integer
division. The ug’s are assigned to points of a \fQ—xfo coordinate indexed grid such that
uj’s with nonzero 01.j’S are within distance one of u,-. Each u,- is updated on odd/even
values of k according to A“. An odd/even iteration cycle is sometimes called two half-
iterations. The remainder of this chapter considers (without loss of generality) consistent

multicolor orderings of length two. The pointwise SSOR method is given by the follow-

ing equation for ﬁxed 0).
115") =(1—(1))u§""2)—0)—1— Za;,ju§"‘1)—v, if A,_,,=1 (7.1)
i,i j¢i
J:

The following theorem characterizes matrices which can be solved by (7.1).
Theorem 7-1: [Varg62] Let A be a strictly or irreducibly diagonally dominant Hermitian
QxQ complex matrix. Then the pointwise SSOR method with 0<c0<2 is convergent for
any initial assignment vector “(0).

In the case of a two-dimensional grid, we can think of the values of “(10 on the right
hand side of (7.1) as neighbors of the point u,- at the compass directions North, South,

East, and West. Each new value of a point is iteratively a function of the current values

85

of these four neighbors, the locations of which are North: (i div \IQ+1,i mod «IQ—L
South: (i divVQ— l,i mod 7Q), East: (i div «IQJ mod \lQ-+ 1), and West:
(i div \IQ-J mod \IQ — 1). The program fragment of Figure 7.1 shows this typical relax-

ation calculation. Each processor runs the same algorithm.

 

Procedure Relax(u§°))
/* For Processor at location P; */
/* u,- is the data held by P,- */

k(—0 /* Iteration Count */
/* u? is the initial value */

while not converged
foreach (direction in {North, South, East, West})
if ( Agk=1 )
receive “519mm from P direction;
else
send “In to P direction;
if<A11=1>

_. l

uf‘m) (- (1—w)uf" 1L0»? 2 ai,ju§k)-Vi ;
1.8 .¢.

Ice—k +1. j e {North,SéuIlLEasLWest}

Figure 7.1. Relaxation Skeleton

 

7.2 NATURAL PROBLEM/SOLUTION CONSTRAINTS

Constraint predicate development proceeds as outlined in Chapter 4. First the
mathematical properties of the problem and its solution must be espoused. Theorem 7-1
allows the choice of the initial assignment vector “(0) to be arbitrary. The next theorem

shows that certain choices can facilitate constraint predicate development.

86

Theorem 7-2: Let A be a matrix satisfying Theorem 7-1 with positive diagonal and non-
positive off-diagonal entries. Then the solution given by (7.1) will converge monotoni-

cally to the ﬁnal solution “(10 for the following choices of “(0) = (ufm) for 0) = 1.

Q
1) Let umin be the smallest value such that vi—umin 2am- S 0 for all is {l,2,...,Q }.
i=1

If uf0)=umm for all is {l,2,...,Q }, then ufk)2u§k+2) for all ke {0,1,...}.
Q
2) Let umax be the largest value such that v,--umax 2a“- 2 0 for all is {l,2,...,Q }.
i=1
If uf0)=umax for all is {1,2,...,Q},then ufk+2)2uf") for all ke {0,1,...}.

Proof: For case 1), by induction on k, the iteration count.

Basis, k=0,
1 Q
up) = ——[v,——Za,-,ju§0):l (7.2)
01,1' j¢i
and since

then

Q
vi-ZagJufoLaLgufo) S. 0 (7.3)

j¢i

Since dig->0, dividing (7.3) by a“, simplifying and substituting in (7.2) yields

1 Q
ufl) = —-—[v,~-Za,-Jufo)] S ufo) (7.4)

i,i jg
Continuing to the completion of the half iteration pair,

1 Q
ufz) = —l:v,--2a,-Ju§1)]

‘35 jaei

87

By (7.4) and since aid-SO, iatj; i,je{1,2,....Q}

By (7.3)

S ago)

Now assume uf’) S “(t—2) for l=2,3,...,k—1,k. Then for l=k+l

1
“(k+1) = —[Vi-zai,ju§k)]

aid jati

Since aid-S0, i ¢j,

1 _
S— [ Iii—201,114? 2)]

ai,i jg
S ufk’l)

The proof for case (2) is symmetric. [:1
Theorem 7.3; In the solution of (7.1), each 11)"), ie{1,2,...,Q}, ke {0,1,...} is bounded
by umin 2 uf") 2 um“ for uf"), aft”) satisfying the conditions of Theorem 7-2.

Proof: Assume that for some uf"),umin < uf") or ufk) < umax. Then there are two
conditions based on the initial assignment vector “(0).
1) Let the initial assignment be ufo) = am, is {1,...,Q }. One of two cases can occur.

By Theorem 7-2 1150:1194), ie {1,...,Q }, ke {2,3,4,....}. Thus
uf") S “(k—2) S - ~ - S ufo) for k even
Its") Sufk’z) s --- sup) fork odd

_<. 1150) by (7.4)

88

Altemately assume that uf")<um,x for some ie{l,2,...,Q} and ke{0,1,2,...} and
ufDZumu for l=0,...,k-1. By (7.1)
1 Q _
uf") = —[v,-—2ai,ju§k 1)] < umax
aiti j¢l
or
Q (It-1)
vi-Zaid-uj —a,-‘,-umax < 0
j¢i
The choice of umax satisﬁes

Q
Vi-Zai,jumax—u max 2 O
j¢i
Rearranging each yields
Q
k—l
EdiJU§ ) > v;-a,;,-umax
j¢i
Q
201', jumax -<- Vi-ai,tu max
jaei
Substituting
Q Q
k—l
Eat/“i ) > Barium
jset jest
- . . - - (k—l) (k-I) . -
Smce aw S 0 for 1,16 {l,2,...,Q }, u I < umax for at least one u] , a contradiction.

2) The case for the initial choice of um, is symmetric.

89

Theorem 74: Consider a sequence of ufk)’s satisfying the conditions of Theorem 7-2.
From the perspective of a processor P; calculating u;, a candidate intermediate result uj")
with a nonzero coefﬁcient au, Lie {1,2, ...,Q} must satisfy the following properties

Ifufo) =umin, ie {1,2, . - . ,Q}, then

ulk+2’5ul")+<I—mXul’Q-ulk‘zb—wzl—[atttufkm—ulk‘lij . ke {0,1,2,...}
[,1

Similarly iruf") = um, ie {1,2, ~-,Q}, then

ujk+2)2u§k)+(1—(0)(u)k)—ujk‘2))-c0;:—I-[a1.,-(ufk+l)—ufk’l)i , ke {0,1,2,...}
Proof: Consider two successive iterations at k and k+2

u$k+2)=(1-(0)uj")—c0-l— Z al‘ju}k+1)+al,iufk+l)—v
‘11.! j¢l,j¢i

and

_ 1 _ _
uj")=(l—co)uj" 2)—0)—[ z aLJ-uj" 1)+a1,,'uf" 1)—v,
01.1 j¢l,j¢i

Subtracting u)"+2)—u)")
u)"+2)—u$") =(1—mxul’0—ulk‘2’)
1 - _
-(t)'—"[ z al,ju§k+1)+al,1ufk+l)—( 2 Cult? Uri-dull? l)
01.: j¢l,j¢i j¢l.j¢i
Since ujk’l) 2 u)"+1),ke {1,2,...}

_ 1 ._
5(1—wxul">-ul" ”hm-Etallulkm—al.ul" 1’)

The case when 1150): um, ie {1,2, - . . ,Q} is symmetric.

90

Theorem 7-5: A candidate intermediate result ujk) with nonzero coefﬁcient a“, from
the perspective of processor calculating uf"), Lie {l,2,...,Q }, must satisfy the following
properties.

Ifufo) =umin, ie {1,2, - ° ° ,Q}, then

1
ask”) 2(1—(0)uj")-(0— Z az’jumaﬁalﬂ-uEHD—v
01.1 j¢l,j¢i

Similarly if ufo) = um, ie {1,2, ~ - ° ,Q}, then

115"”) S (1—(0)u§")—t0—1—

k+l
Z al,jumin+al,iuI )-V
01.1

j¢l.j¢i

 

Proof: Immediate from the conditions of Theorem 7-2 and the proof of Theorem
7-4 CI.

Theorems 7-4 and 7-5 guarantee a minimal and maximal amount of progress of
each individual component of the solution as a function of the current state of the solu-
tion.

The stopping point is selected to be the iteration K at which lu§K+2)—ufK)l<e for
i=1,2,3,...,Q for some small 8.

Theorem 7-6: At the end of the ﬁnal iteration K, the ﬁnal result a?“ satisﬁes the
following relations

If uf°)=umin, ie {1,2, - - - ,Q}, then

“go 2 _1_

 

Q a
ZauuijvJ 2 1190——

i.‘ jati (i)

and ifu§0)=umax, ie {1,2, - ° ° ,Q}, then

i,t'

 

Q
2a,,jqu)—v,] 5 “1K)+%

j¢i
Proof: Consider the ﬁnal iteration at K +2. Assume that

ufo)=umin, ie {1,2, . ~ ° ,Q}.

91

Q
ufK+2) = (1—w)ufK)—co—1—[ZagjujxﬂLvJ

i,i j¢i
(K+2) (K) 1 Q (K+1)
u, —(1—(t))u; = -(1)— 20,3111] —v,-
i.i jaei
Since ufK)—u$K+2) <8,
8 1 Q
— > uSK)-—[ Zai,jqu+l)-VJ
0) i,i j¢i
The case when u§0)=umax, ie {1,2, . ° ° ,Q} is symmetric [:1
Each of the theorems developed is a member of a particular subclass of constraint predi-

cates. This membership is summarized in Table 7.1.

 

 

 

 

Subclass Theorems Size
(DP 7-2 IPT
(I); 7-3 IPT
(DC 7-4,7-5 IPT

7-6 GPT

 

 

 

 

 

Table 7.1. Predicate Subclass Membership.

7.3 ERROR COVERAGE MODELING FOR MATRIX ITERATIVE ANALYSIS

As in the development of the constraint predicate, the development of the modeling
of each individual feature is treated separately.
Feasibility

Modeling of the error coverage of the feasibility constraints for the problem con-

sidered in this chapter is straightforward. The feasibility events F ,-,i =1,2 are deﬁned by

 

92

i F 5") : Event that uj") 2 u max for j and k under test and

5173") : Event that uf") S umin for the i and k under test
Since these constraints are stationary

Pr[jF,-] = Pr[jF§")] for all je{1,2,...,Q}, ke {0,1,2,...}, and i=1,2.
The cumulative distribution function of the appropriate model yields the coverage proba-
bilities for all steps of the problem.

The corresponding probabilities are given by
Pr[jF1] = Pr[u§“)2um,,]
Pr[I'F2] = Pr[u§")Sumin]

and
Pr(iF] = Pr[um,,su§")sum,n]

Progress

Progress is modeled by considering the global convergence properties of the solu-
tion and extrapolating this behavior back to each individual case. [Ames77] gives a
bound on the rate of error reduction in a matrix iterative solution.

The ﬁrst step in this analysis is to determine the spectral radius pg of the iteration
matrix G. It is possible to determine a close approximation to pG by examining the ratio
of the norm of the error vectors between iterations. However, pg can also be calculated
directly from G, while in general is not feasible due to the size of G, for analytical pur-
poses is sufﬁcient. For the red-black ordering used in this chapter, G cannot be
expressed in a convenient matrix form. Instead we use the G from the Gauss-Seidel con-
sistent ordering with Successive Over-Relaxation applied. The matrix A is decomposed
into A=B+D+C with B lower triangular, C upper triangular, and D diagonal QxQ

matrices. Thus G becomes,

G =(D+mB)-1[(l—to)D—toC]

93

If the complex eigenvalues of G are given by A,- for ie {1,...,Q }, then

pa = max IAgI

i e {1,...,Q}

Let the error at step k of the computation be denoted by the vector 80‘): | uk-ul for
ke {0,1,2,...}. For a stationary linear iteration matrix G of the form considered in
Theorem 7-1, Ilea‘) II = IIG"8(°) II where He“) II is the spectral norm of a“). If pG<1,
for k sufﬁciently large, N G" ll =[pG]" where H G H is the spectral norm of G. Combining
these results yields [Ames77]

Hamil 3 IleII New)”

Thus, for large k the ratio ll 80‘“) ll / lie“) ll averages to pg. Therefore, on the average,
the error decreases by a factor of pa at each step in the iteration. For modeling purposes,
the ﬁnal result is known. Assuming that each individual point behaves according to glo-
bal convergence and using Theorem 7-2, treatment of an arbitrary point, i’, yields the fol-
lowing relation

If “I: )"umins
k I — I, .(k)_ 0,
pG(umm “1)2u‘ u‘ , k6 {0,1,2,...} (705)

where uy is the correct ﬁnal result of the calculation. A similar result holds if u§0)=u max.
The events P,- that deﬁne progress are
If 11,50) =u mini

i'Pf") is the event m5") S pf;(umin-u,v)+u,v, ke {0,1,2,...}
If 11,50) =umax,

PP?) is the event 14,5") 2 air-pf; (ail—um”), ke {0,1,2,...}
Consistency

The consistency components of Theorems 7-4 and 7-5, like the progress component,
can either be determined from a trace or by analytic means. For the analytic model, we

apply Theorem 7-4 with the assumption that each u,- proceeds toward the solution U at

94

the same convergence rate. Again this is not the case in the actual solution but does
effectively display the aggregate behavior of the solution. While Theorems 7-4 and 7-5
are directly implementable as a constraint predicate, to use them in the model requires
additional simpliﬁcation. The ratio “Li/01.1 is aggregated by replacing it with the ratio

at)
R...

_—W where n is the average number of nonzero au’s. Using (7.5) as an equality
nu,

and substituting into Theorem 7-4 yields the events ‘v C 3’22) and i'Cj’ff’z)

If ufp) =umin, i'Ci’ﬁf’z) is the event,

all”) s p604 ma—u.v>—u.v<1-m>(p£~(um-uto—p’eitumn—ui)
-mR<pl‘;+‘<umn—ua)—p6“(um—uh». ke12.3.4....};

and if ufvo) =u max, its/3+2) is the event,

all“) 2 PE("min—“10‘“?(l—COXPE(“nun—Ur)-Pf;_2(unnn-ut')
-03R(P’t‘;+l(umin—ut')-P’c‘f1(“min—“10), ke (2.3.4....1.

Letting the mean of the v,-’s be 7, substitution into Theorem 7-5 yields the events ‘v C 3’3”)
and 1' C95,”).

If ufo) = umin, 'q C $12”) is the event,

all”) 2 (1-w)(Pi;(umln-ur)+ur

(2 k+1 -—
—mR 2 umax+pG (“min—u?)+ui'—v 9 k6 {293,430-0};

j¢l,j¢t'
and if ufm=u max, 1' C93”) is the event,

all”) 5(1-w)(ur-PE(ur-umax)
(2 k+1 -
—c0R 2 um+ui'—p(; (u,v—umin)-v , ke {2,3,4,...].
j¢Lj¢i
With these relationships deﬁned, the modeling can proceed. Choose as an example

problem the matrix shown in Figure 7.2 which satisﬁes the conditions of Theorem 7-1.

95

The spectral radius pg of the corresponding iteration matrix G is 0.4775 for (1)=1. It is

easily veriﬁed that umin=1.00 (from rows 5 or 6 of A) and umu=.057 (from row 4 of A)T.

 

4.87 —1.28 0—l.37 0 0 0 0
-l.28 5.62-1.78 0—1.37 0 0 0

0-1.78 6.87 0 0-1.37 0 0
-1.37 0 0 5.37 —128 0—1.62 0
A: 0—1.37 0—1.28 6.12 -1.78 0-1.62
0 0-137 0—1.78 7.37 0 0—1.62
0 0-l.62 0 0 5.87 —1.28 0
0 0 0—1.62 0—l.28 6.62 —1.78
0 0 0 0-1.62 0—1.78 7.87
F 0.343750
0.625000
3.437500
0.062500
v: 0.062500
2.593750
0.179688
0.531250
3.648438

OOOOO

 

 

COO

 

 

Figure 7.2. Sample Matrix for Au=v

Matrix formed from the ﬁnite difference solution of the self-adjoint elliptic PDE
-aa:(x+1)a-;;+%(yz+l)8§;—u=l on the unit square with Dirichlet Boundary condi-

tions:

u(0,y)=y, u(1,y)=y2, u(x, 0)=0, u(x, 1)=1.

 

The normal distribution function is chosen as the conditional distribution of erroring
values. If we center the mean of the distribution at the actual correct result of the calcu-
lation, then the normal with mean it and standard deviation 6 is an attractive choice since
the erroring values tend to group near the correct value. While an overly pessimistic

assumption, it does display the effectiveness of the generated constraint predicate.

 

TNote that due to truncation of the values shown these values do not correspond exactly with the
matrix shown.

96

Figure 7.3 depicts the feasibility event F and its relation to the conditional distribution of
erroring values. it = .562667 is equal to the norm of the ﬁnal solution vector ll um II IQ.
o = .65 is assigned such that more than half of the error probability is undetectable by the

feasibility conditions.

 

 

 

 

 

0.8

P

1' I I

o 0.6 — E E

b E E

. i a

b 0.4 4 I

i : i

1 i i

l 0.2 .1 E E

. i i

y f IE i E IE

0 I : I i I
-1.25 0.56 1.75

Values

Figure 7.3. Feasibility Event and

Conditional Distribution of Errors

 

Let E be the event that an error has occurred. Figure 7.4 depicts the error coverage
as a function of the iteration step k for three different levels of error coverage. The error
coverage of the feasibility predicate is constant since the constraints are stationary. The

progress component rapidly rises to it’s limiting value of approximately 25% additional

I—~   v 1

97

coverage. This is because the average solution makes the greatest reduction of error in
the early steps of the solution. The consistency component adds only a few more percent
to the error coverage to bring the total error coverage to approximately 75%.

Seventy-ﬁve percent of the errors generated by a faulty processor can be ﬂagged
with a fault latency of 1 using only IPT predicates. It is interesting to note that while the
addition of the consistency constraints only provides a small gain in error coverage, it is
still a necessary component. Constraints C M and C 1), guarantee that a faulty processor
must make some non-negligible movement toward the solution. While not modeled here,
empirical data suggest that a faulty processor can delay convergence almost indeﬁnitely
by chosing a movement just slightly large than a. The consistency constraint, when
viewed in this manner, is a necessary component of the constraint predicate.

The choice of the normal as the distribution of errors also limits the error coverage
using these predicates. Any symmetric distribution centered at the ﬁnal result will neces-
sarily result in an error coverage measurement somewhat less that total coverage since
the tests developed (except for the consistency test based on Theorem 7-5) are primarily
one-sided.

Theorem 7—6, the GPT predicate, has not yet been used in the constraint predicate.
Theorem 7-6 affords total test coverage if certain additional restrictions are imposed.

As mentioned previously, algorithm Agree performs a reliable broadcast such that
all recipients of a message receive the same version of that message and the message
received is the message sent. Employing Agree requires a bound on the number of
failures that may occur. To utilize Theorem 7-6, the number of failures must be bound as
well from the application level.

Deﬁnition 7-3: Local Fault Group: Q,- = {Film-Jaw, j=1,...,N }.

Thus a fault group Q; for some i’ is the set of processors involved directly in the

solution of (7.1) for point uy.

 

98

 

 

% __
Pr[FnPlnCIE]

 

C 0.75 —

e 0.50 .4

 

g 0.25 -—

 

 

 

Iteration —>

Figure 7.4. Error Coverage by Solution Step Count

 

The diagnosis of a fault group is stated as follows. If a processor in Q; is faulty for
some i’, then 52? a 1 and the entire fault group is ﬂagged as faulty; otherwise, 0? a 0
and the entire fault group is ﬂagged as non-faulty.

Lemma 7-1: Let the processors in a local fault group (2p communicate via Agree. Then
processors in Q): reliably diagnose Q; as either faulty or non-faulty if the maximum
number of faulty units per fault group is 1.

Proof: The proof is constructive.

Step 1: Each Processor Pie 9; sends its last value of a)“ to the other members of

(25'.

99

Step 2: If lug-2) - 145K)! > 8 then report that Q; (and indeed Pg) is faulty. Stop.

Otherwise,
Step 3: Each processor applies Theorem 7—6 to the qu)’s it receives from Step 1.
Step 4: Each processor reports, in a distributed manner, whether Theorem 7-6 is

satisﬁed or not.

Since a reliable broadcast is utilized, each processor receives the same version of a
sent message. Each non-faulty processor then reaches the same conclusion concerning
the result of the ﬁnal calculation of the iteration sequence speciﬁed by (7.1). The test in
Step 2 precludes P; changing its value from an erroneous to correct state during the
broadcast phase. If P; were allowed to change it’s value, it could send a value that
agreed with Theorem 7 -6 thus making Q? = 0 erroneously.

If P,» does not change it’s value, then Theorem 7-6 is used to diagnose 9;. Note
that Theorem 7-6 does not explicitly specify which member of Q,» is faulty. Since the
maximum number of faults per fault group is 1, no two processors can cooperate to fool
the test of Theorem 7 -6. Furthermore, in the example under consideration, since the
minimum cardinality fault group is 3, the non-faulty processors will always outvote the
faulty processors. A simple majority vote is taken to determine the status of the fault
group. C]

Deﬁnition 7-4: Extended Fault Region K ,- = UIQj'QjﬂPii‘g}

The extended fault region of a point i’ is simply the set of local fault groups of all
neighboring points j such that u ,- uses ur in it’s calculation.

Let K? = I(P,-IP,- is faulty and P,-eK,-}I and let k,- = I[jlo,-eK,-}I
Theorem 7-7: For each processor Pi, P,- is faulty iff Q? =1 for all QjeK, and
K? s ki-2.

Proof:

Necessity: Assume P,- is nonfaulty. If 52? = 0 then we are done. If not, then there exists

some Pie Q,- and P is (21- such that P j is faulty. Since at most ki—2 processors may be

100

faulty and at most one is allowed per fault group (by Lemma 7-1), and since there are
ki-2—1 faulty processors remaining to be distributed over the ki—2 remaining fault
groups, some (2? = 0.
Sufﬁciency: Assume that some (2,0 = 0. However, by Deﬁnition 7-4, each Die K,- con-
tains P j. Thus P ,- is nonfaulty. Secondly assume that K? > ki-2. Then it is possible to
have [Cg—2 faulty processors none of which are in Q,- and one faulty processor in 9,- which
is not Pi. Thus P,- is nonfaulty. C]

With the test provided by Theorem 7-7, we can obtain the error coverage shown in

Figure 7.5.

7.4 RUN TIME PERFORMANCE

Using the techniques of Chapter 6 for this problem will yield a similar performance
curve to that of Figure 6.1 (p. 80). However, in matrix iterative solutions, many points
will be assigned to a single processor. All of the constraint predicates developed in this
chapter are applicable, but they only need to be run at those points which fall at the boun-
dary of the data partitioning. Let the total number of points on the grid Q be greater than
the number of processors N. Without loss of generality assume that N divides Q into a
perfect square. Then Q/N = q is the number of points per processor. Continuing the
assumption of coordinate grid indexed points, Figure 7.6 depicts the mapping and error
checking for q = 25. The grids marked by dashed lines indicate points that are not
checked by the constraint predicate as they are internal to that processor.

The computational complexity calculation proceeds along the lines of Chapter 6.
The unreliable algorithm, for a sub-grid of 4 elements per processor with an average of

n -1 neighbors, is easily seen to have a computational complexity per iteration given by,
CpNR= nq+2$L.

Each constraint predicate contributes the following run time overhead,

101

 

 

 

 

 

 

 

 

% HIWIEI
C 0.75 a
o
v
e 0.50 —
r
a
g 0.25 —
e
I
0 K
Iteration —-)
Figure 7.5. Error Coverage by Solution Step Count.
Theorem 7-7 Applied.
(DP 1
(DP "V67
(DC l+n\/Z

The run time of the reliable algorithm under a maximum of t faults per fault group is

given by
CPR = 2+2nxl§ +nq+2(t+1)SL-i-(n \IE )’ +1

The ratio of the computation times of the reliable algorithm to the unreliable algorithm

is:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I ! “—" I 1 H L!
__.-i--'e-._ ' ' __...-i--'u- _ ' ' _,-J--L_._
P1_1,,- _—.-.‘--i-_ ___g__;_]__ ___;__;_, P141.)
: I H 1 : H 1 : j
I l H [J H [T

 

 

 

 

 

I
l 1

i

Mh—1--r-dh—_

 

 

 

I
I
I

 

 

 

Phi-1

Figure 7.6. Mapping for q =25.

 

CPR 1+2m/Z1'+nq+2(t+1)sL+(m/(7)'+1]

CpNR _ "(1+25L J

 

 

Let r represent the computation to communication ratio. Then r = nq/ZSL. Neglecting

low order terms, for t=1 (the solution is locally tolerant of one fault per fault group 9,),

C121? _ ZFSL-HISL-I-ZrSL
CpNR — 2SL(r+1)

 

5r+2
r+1

 

(7.6)

By setting r=0, (7.6) is easily seen to be a more general form of (6.9). Letting r=1

sets computation time equal to communication time resulting in,

103

CR: n+2

 

CNR. (7.7)

Since n is usually very small (5 or 7), the reliable algorithm is at most 3.5 to 4.5 times
more expensive than the unreliable algorithm and indeed much smaller than this since the
value for q to achieve even an r=l is prohibitively large. Moreover, this analysis is
overly pessimistic as it treats logical comparisons with the same weight as ﬂoating point

multiplication/division.

7.5 COMMENTS AND LIMITATIONS

In SSOR, the optimal relaxation factor coop, is chosen to minimize the spectral
radius of the iteration matrix G. However, setting to = (00,”, does not preserve the mono-
tonicity of the ug’s as was suggested in [HuAb84]. The expectation of monotonic
behavior is unreasonable since SOR is speciﬁcally designed to "over-relax" some ui’s to
increase the convergence rate. Values of to in the range 1 < (l) S (00,” can be found
experimentally that do preserve monotonicity; however, no way is known to calculate
these values even if the spectral radius is known a priori. This limits the feasibility of
using constraint predicates to iteration matrices that have the spectral radius pg < 1. It is
not known at this time whether a convergence envelope can be found to bound the ug’s.
This is an area for further study.

Error coverage can detect a signiﬁcant number of errors with a fault-latency of 1.
However, some errors can result in a pathological state of the system (values of “(1‘) for
some k in which convergence to the ﬁnal correct result takes longer than the time
required to restart the solution from the initial guess). Enhanced error checking capabil-
ity can come only from an increased understanding of the local convergence properties of
the solution. Such work is being performed under the guise of "Local Relaxation
Methods" [KuLM87, BoVe82]. Local convergence properties, as shown in this chapter,
can be embodied as IPT predicates.

104

A ﬁnal comment on the constraint predicate implementation concerns Theorem 7-7.
One of the goals of application oriented testing is to migrate the system dependent issue
to the diagnostic basis. Theorem 7-7 is an Ad Hoc approach, at best, to fault diagnosis.
Optimally, control of a GPT test such as this should be migrated to the diagnostic basis.
It is not understood at this point in time how to do this in a general way while still
preserving the application oriented view of fault—tolerance. Clearly this is an area for

further study.

Chapter 8

Relaxation Labeling

 

We now focus our attention on a non-numerical problem, Relaxation Labeling. To

describe the problem of relaxation labeling, we quote from [HuZu83]:

Relaxation labeling processes are a class of mechanisms that were originally
developed to deal with ambiguity and noise in vision systems. The general framework,
however, has for broader potential applications and implications. The structure of
relaxation labeling is motivated by two basic concerns: I ) the decomposition of a com-
plex computation into a network of simple "myopic," or local, computations; and 2)
the requisite use of context in resolving ambiguities...

Fundamentally, relaxation labeling assigns labels to objects in an image. Various compa-
tibilities and incompatibilities exist between the objects. For example, consider an image
which is to be labeled as a human body. If two objects are adjacent in the image and one
is the object "head," then it is very likely that an adjacent object is "shoulder." Likewise,
it is highly improbable that the adjacent object has the label "foot." This competition and
cooperation can work together in the relaxing the labelings on the objects such that even-
tually an unambiguous labeling is found.

Parallelization of relaxation labeling for implementation on a multiprocessor system
is advantageous for large problems, that is, problems that would otherwise take a long
time on a conventional SISD computer. A large problem is characterized by a large
number of objects. Indeed some applications of relaxation labeling can include objects
the size of pixels in a 512x512 image or 218 objects [Stoc87]. Relaxation labeling lends

itself easily to implementation as a data parallel algorithm [HiSt86]. The algorithm

105

106

requires only local computations and data exchanges. Thus good speedup may be
obtained by parallelization. As in the matrix iterative case, the properties of the inter-

mediate results of the computation are checked for faulty/nonfaulty behavior.

8.1 PARALLELIZED WEIGHTED RELAXATION LABELING
The weighted relaxation labeling algorithm using the variational inequality method
given by [HuZu83] yields a straightforward parallelization. The notation and algorithms
are presented here for completeness. The basic idea is that given an initial feasible solu-
tion, attempt to maximize an objective function by taking small steps in the tangent
direction which maximizes the directional derivative of the objective function. When the
directional derivative becomes negative or zero, the objective function is at a local max-
imum and the procedure stops. N is the number of objects and m is the number of possi-
ble labels. Labels are denoted by A. and 1’. An objective function is deﬁned as:
m e
41(70=2Pt(7»)si(7»), 1 SI SN (8.1)
2:1
where the support function s,-(7I.) is:
N m I I
51(1):: 2 was)» )Pjov) (3.2)
j=1x'=l
The compatibility matrix RNxN = [r,- j (1,710] is the relative compatibility of label 1 at
object i with label 1’ at object j.
The weight vector?,- with components p,(h) is the relative weighting for label 1. at object
i and is constrained by:
m
2pi0»)=1, 0< {(A)Sl, 1S )1. Sm (8.3)
2:1
The weight vectorﬁ- is constrained to be a consistent solution for each object i providing:

Elm-(1)3002 E v,-(7L)s,-(?t) for all labelings?) satisfying (8.3) (8.4)
x=1 2:1

107

8.1.1 Non-Fault-Tolerant Parallel Relaxation Algorithm

Parallelization of the algorithm is accomplished through assignment of each of the
N objects to one of the N processors in the DMMP. The i’th processor is denoted by P,-.
Each processor runs the same copy of the algorithm on different data, hence the term data
parallel algorithm. Each processor P,- has a copy of the entries relevant to it of the com-
patibility coefﬁcient matrix RAM].

The neighbors of a processor in the problem domain of relaxation labeling are those
that have non-zero compatibility coefﬁcients. Thus each processor P,- has only the
nonzero column elements j of row i in RNxN. The average number of nonzero entries per
row is given by n.

In the following algorithm, the index i indicates the local object (processor), and the
index jranges over the neighbors. For example, if," indicates the k’th iteration value of
3 held by processor P j. Each processor P,- executes the following algorithm 1 Si SN.
It is assumed here that the mapping is ideal. That is, each processor has a direct connec-

tion to its neighbors. This assumption is not critical and it simpliﬁes complexity analysis.

108

 

Initialize:
1) Start with a consistent initial labeling assignment?0
k(—0

loop
2a) sendﬁ’t to all n neighbors P,-

2b) receive 3} from all n neighbors P ,.

/" nearest neighbor updating */
20) :10») = 22) r60». 1312100

1' 1’

/"' ﬁnd a feasible gradient direqtion */
3) 7 = projection -operator(? ,? )
4) it (7"=0) break;
5) FHA 93,3417"

/" where h is some small positive value */

I" that keeps? + consistent

6) kt—k +1
pool

Figure 8.1. Algorithm RLNR,
Relaxation labeling with no added reliability

 

8.1.1.1 Projection Operator Computation

Computation of the updating direction vector is presented in [MoHZ83]. It is for-
mulated as a linear optimization problem subject to quadratic constraints. We repeat the
formal problem speciﬁcation here.

Let IR“ be m-dimensional real space and let [X be the convex set deﬁned by:

IK={?eIR'" l Epmzl, p(x).>_0, 19.971}
2:]

For any vectorﬁ’elK, the tangent set T 7,» is:

T? = {761R ”‘ | in: v (70:0. v0»)20 whenever p(x)=0}
i=1

109

The set of feasible directions to move at pointFis:
F3=T7m {781le ll?” $1}

The subproblem to be solved by the projection operator algorithm is; Given a
current feasible weighting vector ‘p'Iand a current arbitrary direction 3’8 IR'”, ﬁnd

72F? such that?°72?'7for all 71-: F3.

 

projection-operator (7, 3)

l) D(— [xIp(x)=0)
2) Ste—{l

100p
3) “(— l
n —

 

4) Suﬁ—{ND I 5(1) < I; };

5) if (St+1=St) break;
6) k(—k+l
pool
0 If A. 8 St
7) “(M (— 30») - t), otherwise

0 if 17’: 0
17’ otherwise

II?”

8) E(—

Figure 8.2. Algorithm projection—operator

 

Note here that the vector—Bis normalized only locally. Since it is stated in the origi-
nal paper that normalization over the entire problem is not a requirement for conver-

gence, the local normalization makes the parallel algorithm possible.

110

8.1.2 PERFORMANCE EVALUATION

Speedup can be deﬁned as the ratio of the sequential complexity vs. parallel com-
plexity. For each processor, P,-, the complexity of the projection-operator routine is
O(mz). All other steps for one iteration are O(m) with the exception of the calculation of
T“: which is O(nm). Thus the complexity of a parallel iteration is O(nm+m2+2.S'L). If a
single processor works on all the calculations, then an iteration is O(nm+m2). Thus the

speedup obtain by parallelization is the ratio:

0

 

N (m2+nm)
m2+nm+ZSL

 

If N is large, the speedup is signiﬁcant.

8.2 CONSTRAINT PREDICATE

The mathematical theory surrounding the application must be mature to facilitate
development of an appropriate testing predicate. Such is the case with the relaxation
labeling procedure given by [HuZu83] which is essentially a constrained local maximiza-
tion problem with the direction of updating solved by the method of feasible directions
[Zout76].

We prove a series of lemmas which correspond to the components of the constraint
predicate.

Lemma 8-1: At each iteration, the following feasibility conditions hold (1);:

§p1(l>=1andoslnl0)s1, ISXSm
1:1

Proof: Since Step 5 of the algorithm RLNR in Figure 8.1 chooses h such that?“1
remains consistent, each 3"“ must satisfy the consistency conditions. Since the con-
sistency constraints form a compact convex set over IR'", then all intermediate solutions

must remain within a convex cone in m—space [Zout76]. This "funnels" the solution to the

local attractor. Cl

111

Lemma 8-2: At each iteration the following additional feasibility conditions (D): are
maintained:
It:
a) Zui(7t)=0
1:1

I» (rm-17:0
Proof:
a) Step 3 of projection-operator, performs an orthogonal projection of? onto the feasi-

bility space F5». Thus (8—7) -7=0.

m
b) Since71ies in T3, 2 u,-0t)=0 for all feasible? [3
1:1

Lemma 8-3: Each iteration is subject to the following progress conditions (DP:
a) If78 F-p-r and ll7ll=l, then Vmaximizes the gradient direction change.
b) If 7 is a correct result of projection-operator and 77:0, then
3H1 7H1 5'1"?"
c) Convergence occurs in a ﬁnite time within a suitable neighborhood of the
solution under conditions of strict consistency.
Proof:
a) Immediate from the problem deﬁnition of projection-operator.
b) If?’,-"+l°_p"k+l<.T{"-E”r then for some 7817?, W>W. Thus 7was not a correct
result from projection-operator.
HEP“? “1:39,”? k then by Step 5 of RLNR, since h>0,7=0.
c) Immediate from Theorem 9.1 [HuZu83].

8.2.1 Predicate Generation and Proof

The correctness and completeness of the predicate is proven in the following
theorem. The metric for correctness requires that if the algorithm RLNR, under no faults,
produces a correct solution, then the reliable algorithm RLR , under a locally bounded

number of faults t, also produces the same correct solution within some predetermined

112

error tolerance 80.

It will be shown that the constraints given in Lemmas 1-3 are sufﬁcient for the
predicate correctness if one constraint is added. This constraint has nothing to do with
the problem theory. It is possible for a Byzantine processor to arrange the vectorsTt’and
?as to erroneously signal early st0pping. Thus in the case of early stopping, the last cal-
culation must be replicated. If the replication also produces a stopping point, then that
solution is indeed the correct solution.

The method of proof is by considering all possible movements of the vectorﬁdur-

ing the iterations of the labeling algorithm.
Theorem 8-1: A predicate (D that embodies the features (Dp and (DF proven in Lemmas
1-3 plus the early stopping test will constrain Byzantine behavior such that if the unreli-
able solution computed be RLNR is correct, the reliable solution with predicate (D is also
. correct within the error tolerance £0.

Proof: Consider a point?” that satisﬁes Lemma 8-1. A direction of movement by
Vmay be to any point in m-spaceE’H1 . Consider a movement by processor P 1- (object 1')
that does not change 3,", such thatf’jipr1 is moved towards the local maximum, andﬁ’j has
not already been found to be a solution. There are two cases:

1) Lemma 8-3b is violated (Figure 8.3). IfVJ-k+1 #0, then processor P} has made a

"nonconvergence error." H7?“ = 0, then we must invoke the early stopping test.

The members of the local fault group "re-run" the last iteration based on the values

31": and 71". If the replicated calculation is the same as the original 3“] ,TI’HI,

then 771"“ is a solution and processor P ,- stops.

2) E; satisﬁes Lemma 8-3 but does not satisfy Lemma 8-1. Thusﬁj has moved "out-

side" of the solution space (Figure 8.4) and violated the consistency conditions.

113

 

Infeasible Region

 
 
  
  

1» Feasibility Cone

Solution Pt.

Figure 8.3. Non-maximizing movements

 

 

Infeasible Region

  
  

 

sibility Cone

Solution Pt.

Figure 8.4. Movement outside of feasibility cone

 

If EH1 ,7“! leads to a correct movement of ‘s’f but does not make the maximal
movement available, then Lemma 8-2 and Lemma 8-3a are violated.

Finally if Lemmas 8-1, 8-2 are satisﬁed and conditions of Lemma 8-3a and 8-3b are
satisﬁed and condition 3c is violated, then the Byzantine resource is attempting to delay
convergence through zigzagging (Figure 8.5). However, zigzagging may occur as a
result of errors even if the algorithm employs an anti-zigzagging procedure. Even though
the solution is making progress by virtue of condition (a), for all practical purposes it
could be making progress as small as a 1 bit change in}; Clearly this would cause the
solution to run a very long time. However, we know that the convergence occurs within
a ﬁnite amount of time, so the following test is utilized:

Eli-“15”“ -Tt:-"°?k < 8’”1 then a convergence error is ﬂagged. When 8"<t-:0 for

the predetermined error tolerance 80, the solution halts.

114

Thus the predicate (I) catches all movement errors that will lead to incorrect and

slow solutions. [I]

 

Inconsistent Region

 
 
  
  

Consistent Cone

Solution Pt.

Figure 8.5. Zigzagging - Convergence but no Finiteness

 

8.3 RELIABLE PARALLEL ALGORITHM FOR RELAXATION LABELING

The reliable algorithm is constructed from the proof of Theorem 8-1. At each itera-
tion, Agree is called to exchange values among the members of the local fault group.
The member’s values are then checked according to the consistency, feasibility, and con-
vergence conditions. The additional stopping rule check is also employed to completely

constrain Byzantine resource behavior.

115

 

Initialize:
Start wi initial labeling assignment
7”: feasible labelings.
k <— 0
loop

/"' reliabl send/receive neighbor’s iteration values */
,- , ,- ,VI't— agree(p,~k,7:,7,-k)
/" Start of Predicate Cb */

for each neighbor of P,- : P,- do begin
for l“: l to m do begin

E(ij‘oc) at: 1 orijt) t [0,1]) then

report error- inconsistent p ’(l);

rof;
if (211,- -(}t) at 0) then report error - infeasible 1155(2);

1mg" -it," )7); #O)then
report error- infeasible'll’, oral;
if (HZ-II #1) then /*4’P */
report error- nonconvergentH’, ,
if (8" < so) then
{d P,- as stopped, resultisﬁ’, ’ break; }
rr -1 < 8") then /* <bp */
report error- nonconvergentﬁ’;
rof;

PWV

/"' End of Predicate <1) */

/" nearest neighbor updating */
Si (1)=22rtjol. Mp} (1')
i 1'

/" Find a feasible gradient drillection */
Tt’ = projection-operator (ﬁ’ 3’ )

if (7* = 0 ) break;

f‘ h' rs some small positive value */

V“ W347
k e—k+l
pool

Figure 8.6. Reliable relaxation labeling algorithm RLR

for each processor P,-

 

116

8.4 CHAPTER SUMMARY

Performance analysis of the reliable algorithm follows the modeling of Chapter 6.
The results do not differ signiﬁcantly from the performance data presented there and are
not repeated here. While the mathematical theory is well developed in terms of conver-
gence of the algorithm, it is not rich enough to support the type of analysis presented in
Chapter 7. This modeling is an area for further study.

Parallel relaxation labeling for computer vision differs from simple matrix relaxa-
tion in that both the calculations are more complex and that a more rich set of natural
constraints exists. Both problems, however, yield equally satisfactory constraint predi-
cates. The next chapter details reliable solution of a completely different type of prob-

lem, that of parallel sorting.

Chapter 9

Parallel Bitonic Sorting

 

Sorting in the DMMP environment is not likely to be the sole application but rather
a sub—problem of some other parallel applications. This distinction is important for two
reasons. First, since the data is already in the node processors of the DMMP, there is no
central point through which all data will pass or had passed Since no such central point
exists, there is no opportunity for centralized fault diagnosis; therefore, the fault—
detection problem must be solved in a distributed manner. A second, more pragmatic,
consideration is that if the data must be loaded from the external environment, it may be
more efﬁcient to simply sort the data sequentially in the host rather than incur the com-
munication cost necessary to distribute/collect the data to/from the nodes.

The target DMMP interconnection topology considered in this chapter is the popu-
lar hypercube topology. The topology of an n-dimensional hypercube is a graph G (P,E)
with N =2" vertices called nodes labeled P0,P1,P2,...,PN_1. An edge ewe E connects P,-
and P,- if the binary representations of i and j differ in exactly 1 bit. If we let this bit posi-
tion be k, then Pi: ft» 2:. Thus, in an n-dimensional hypercube, each processor connects
to n neighboring processors. Connections between the host and nodes are mainly used for

program/data downloading and result uploading and are not represented in G.

117

118

9.1 BITONIC SORTING

Recall the sorting deﬁnition from Chapter 4. The assertion (I); suggested by
Theorem 4-1 can be used in the sequential environment to verify the output of sorting
procedure S. However, in the parallel environment, in general, neither the input list I nor
the output list 0 is available to a processor implementing the assertion. Furthermore, the
assertion (in general) is only applicable at the termination of the sorting procedure.
There is no opportunity to ﬂag an error that occurs earlier then the termination phase in S.
These difﬁculties with general sorting foster consideration of a particular parallel sorting
algorithm, the Bitonic Sort [Batc68].

The bitonic sort algorithm was introduced as a parallel sorting algorithm that can
take advantage of interconnection topologies such as the perfect shufﬂe and hypercube.
For our purposes the algorithm has properties that make it amenable to assertion develop-
ment in the parallel environment. Furthermore, there exists a bitonic sort algorithm that
maps directly to a hypercube topology.

The general idea of a bitonic sort is to build up longer bitonic sequences which
eventually lead to a sorted sequence.

Deﬁnition 9-1: A Bitonic Sequence is a sequence of elements 00,01, - - ~ ,0N_1 such

that

1) There exists a subscript i, O Si .<_N—1 such that 003015 ' ° ' SO,- and
0r+120i+22 ' ° ' ZON—l

or

2) There exists a subscript i, O Si SN -1 such that 002012 ° - 20,- and
0i+150i+25 ‘ ° ' S0N—1

The fundamental operation in a bitonic sort is the compare-exchange, either
min(x,y) or max(x,y).

Lemma 9-1: [Batc68] Given a bitonic sequence 105119 ‘ - SIN,2_1 and

IN,221N,2+12 - ' ' ZIN_1, each of the subsequences formed by the compare-exchange

119

steps:
min(IoJN/2),min(11JN/2+1). ' ' ‘ min(IN/2—1JN—1)
=00,01, ° ' ' ’ON/Z-l
and
maX(10JN/2),max(11JN/2+1). ° ° ° max(IN/z—l JN-i)
=0N/2:0N/2+1’ ' ' ' vON—l
is bitonic with the property that 0,501- for all

i=0,1, - ° - ,N/2—1 and j=N/2,N/2+1, - - - ,N—l.

Note that the midpoint of the sequence need not be N / 2. However, assume that the
midpoint is N/ 2 and N =2" for some k. Pictorially, this splitting and merging is shown in
Figure 9.1. It is easy to see that a bitonic sequence of length 2" can be sorted by recur-

sively applying the compare-exchange Operation k times.

 

Bitonic Sequence

10 1N/2-1 1M2 IN_1

Compare-Exchange

\/ 0M2 ' ' ' ON-l
/\ 00 ' ° ’ 0N/2—1

Figure 9.1. Compare-Exchange Step

 

120

Since each compare-exchange involves only a comparison between elements whose
subscripts differ on only one bit and the number of elements is always 2", if we have one
element per processor, then the bitonic sort can be easily implemented on a hypercube of

dimension n =log2N [Quin87]. The algorithm is shown in Figure 9.2 as Algorithm SNR.

 

It
" Sm, executed by node node, OSnodeSN—l
"' Local starting value in a
*/

Procedure S NR

for i:=0 to n-l do
{for j:=i downto 0 do
[d:=2j ;
if (node mod (2d)<d)
[read into data from node +d;
if (node mod 2‘+2<2‘+1)
{b := max(dataa);a := min(data,a);}
else
{b := min(data,a);a := max(data,a);}
write from b to node+d;]
else /"' Send to neighbor - we are inactive this iteration */
{write from a to node—d;read into a from node-d;]
] /* End for j *l
] /"' End fori */

Figure 9.2. Algorithm SNR

 

The completion of each iteration of the for i is called a stage.

Lemma 9-2: Algorithm SNR produces a bitonic subsequence in each subcube of size 2i+2
at the end of stage 1' given bitonic subsequences of length 2"“1 at the start of step i if no
errors occur.

Proof: Proof is by induction on i, the subcube index under consideration. Initially
we are given degenerate bitonic sequences of length 2. For i=0, since d =20=1, by
Lemma 9-1 a single parallel compare-exchange step forms two bitonic sequences of
length 1 in each subcube of dimension 2, either ascending or descending based on the

subcube indices. This forms bitonic sequences of length 2i+2=4 in each 2-cube.

121

Assume that for i =k, algorithm SNR has produced a bitonic sequence in each sub-
cube of size 2"+2 at the end of step It. Then for i=k+1, since d=2k+1, a single compare-
exchange step, by Lemma 9-1, forms two bitonic sequences each of length 2"+1 , one high
and one low. Since the next iteration of j is a bitonic sort on these two subcubes of
dimension 2"“, by the inductive hypothesis at the termination of step k+1, the subcube

2"+3 contains a bitonic sequence. E]

of size
Theorem 9-1: Algorithm SNR sorts a list of items arranged in a cube of size N =2".

Proof: Immediate from Lemma 9-2 by iteration of the for 1' loop from 0 up to n—l
in algorithm SNR. Cl

Since the code executed by the inner for j loop is clearly 0 (1), the runtime of this
parallel algorithm is o (n2)=0 (loggN).

Our goal in development of this algorithm is to implement a system which can sort

in parallel and never produce an incorrect output. To perform this task, the notion of a

fault must be clearly deﬁned.

9.2 FAULTY BEHAVIOR

Deﬁnition 9-2: A fault is a deviation from the correct sequence of a calculation that pro-
duces an incorrect result. A system is said to be fail-stop if, upon the occurrence of a
fault, no output is produced and the system simply halts. A component is said to exhibit
Byzantine [LaSP82] behavior if, upon the occurrence of a fault, the effect of the fault on
the overall calculation is to produce an incorrect result. Furthermore, this will be done in
the most malicious manner possible. A node P; is said to be faulty under the following
cases:

1) A fault occurs in processor Pi and any number of its incident communication links.
2) A fault occurs in one or more of Pi’s incident communication links but not in Pg.

There are two cases.

122

a) The fault occurs in exactly one incident communication link e“. If a fault has
also occurred in P j, then P j is faulty. If no fault has occurred in P j, then arbi-
tarily P; is declared faulty and P ,- is declared non-faulty.

b) The fault occurs in two or more incident communication links. Then P,- is
declared to be faulty.

The following assumptions are made in the development of the parallel sorting

algorithms.

Environmental Assumptions:

1) Inter-node communications and processors are subject to Byzantine faults.

2) The host processor is reliable as are the links from the host processor to each node.

3) Message transmission is over point-to-point links and no atomic broadcast exists.

4) The absence of a message can be detected and constitutes an error.

5) All nodes are non-faulty at initiation of the algorithm and remain non-faulty through
the ﬁrst message exchange.

Deﬁnition 9-3: The home subcube SC” of dimension 1' of a processor P ,- is the subcube

of size 2i that begins with processor Pk, k=j—(i mod 2‘) and includes all processors

through P,, I =j -(i mod 2‘)+2‘-1. Let SCEJ- denote the index k and SCEJ- denote the

index I. If the bitonic sequence in SC“ is comprised of an ascending sequence followed

by a descending sequence each of length 2"], then the ﬂag ascending is true and false

otherwise.

9.3 RELIABLE BITONIC SORTING ALGORITHM DEVELOPMENT
The reliable bitonic sorting algorithm Sn is given in Figure 9.4. The individual
components of the constraint predicate (Dp,<I>p,<I>C corresponding to the progress, feasi-

bility, and consistency predicates subclasses are discussed in the following paragraphs.

123

 

/"' Sn, executed by node, OSnodeSN—l - Local starting value in a */
Procedure Sn .
LBS [1]:=a; Imask:=2’ ;
for i:=0 to n—l do
ionic=min<scarmnx
for j:=i downto 0 do
[d:=2j;
if (node mod (2d)<d)
{read into (data,lbuf) from node +d;
Imask:=¢C(LBS,-,j,lmask);
ir (node mod 2"+2<2"+1)
b := max(data.a,a);a := min(dataa,a);
else
b := min(data.a,a);a := max(data.a,a);
data:=(a,b);
write from data,LBS to node+d;}
else
[data.a:=a;
write from data,LBS to node —d;
read into (data,lbuf) from node —d;
lmaskz=¢c (LBS,,j,lmask);
(ab):=data;l
] /* End forj */
/"' Verify both the validity of the sorted sequence in LBS with respect
to LLBS and the bitonic nature of the sequence in LBS */
if (#0)
if (not bit_compare(LLBS,-.LBS,~)
signal ERROR to host;
rot mz=scim to limit /* Update LLBS */
LLBS[m]:=LBS[m];
LBS[n0de]:=a;lmask:=2"°"‘;
] /"' End for i */
/"' Verify Last Stage by pure exchange of ﬁnal LBS */
izsn-l;
for j:=i downto 0 do
{d:=2j;
if (node mod (2d)<d)
read into lbuf from node +d;lmask:=<bc(LBS,-,j,lmask);write from LBS to node+d;
else
write from LBS to node-d;read into lbuf from node-d;lmask:=<bc(LBS,- , j, (mask);
} /" End for j *I
if (not bit_compare(LLBS,-.LBS,-)
signal ERROR to host;

Figure 9.4. Algorithm Sn

 

124

9.3.1 Progress

The progress component <I>p ensures that algorithm termination will occur and that
at each testable step of the solution, the state of the solution advances to the goal or ﬁnal
solution of the problem. If this progress is not made, then faulty behavior may
indeﬁnitely postpone the solution - any solution - even an incorrect solution.

For iterative convergent problems, the progress component involves reduction of
error. For the bitonic sorting problem, the number of steps is known a priori to all parti-
cipants in the algorithm. Thus any early termination is considered an error. The testable
step for (hp is at the bottom of the outer loop 1' in algorithm SNR. By Lemma 9-2 this
must be a bitonic sequence of length 2i +1 .

The progress test (1),: is given in Figure 9.5a. The sequence BS,- is a bitonic

sequence of length 2i+1 in SCHlmde.

 

Procedure (Dp (BS i)

for it := sci,- to 5C5,-
if (ascending)
if (BS[k+1]<BS [141]) then ERROR;
else
if (BS[k+l]>BS[k]) then ERROR;
if (i at n) .
for k := SCij+2‘ to SCEN-
if (ascending)
if (BS[k+ l]>BS[k]) then ERROR;
else
if (BS[k+1]<BS[k]) then ERROR;

Figure 9.5a. (DP - Progress Component

 

9.3.2 Feasiblllty

Each testable result must remain within the deﬁned solution space H obtained from
the natural constraints of the problem. The natural constraint here, and indeed the reason
for the choice of the bitonic sort procedure, is that at each stage i of the computation, the

bitonic sequence formed must contain only the elements to be sorted, no more, no less.

125

Deﬁnition 9-4: The bitonic sequence LIBS,- is a bitonic sequence of length 2" resulting
from stage i -2 of algorithm S”. The bitonic sequence LBS,- is a bitonic sequence of
length 2i *1 resulting from stage i—l of algorithm Sn.

The feasibility test (1),: is given in Figure 9.5b. Note the "C" conditional statement
syntax in the for loop. In this notation, x?y:z indicates that if x is true, execute y, other-

wise execute z.

 

Procedure ¢F(LLBS,-,LBS,-)

II=SCEM¢;
UI=SCfm;
for m := ascending?SC,-S,m:SCEM;ascending?tozdownto ascending?SCEm:SCﬁmd,

if (LBS,- [m]=LLBS,- [1] && 1s2‘-1+SC;$,-)
I := I +1;

else if (LBS,[m]=LLBS,-[u] && m22“‘+SC§_,~)
u := u—l;

else
ERROR;

Figure 9.5b. (1),: - Feasibility Component

 

9.3.3 Consistency

In the case of bitonic sorting, the bitonic subsequence must be distributed to check-
ing processors. Since it is entirely possible for a Byzantine faulty processor to send dif-
ferent versions of the same message to different checking processors, each version of
which satisﬁes the feasibility test locally but is incorrect globally. This situation is said
to be inconsistent. To achieve consistency, we require that each processor "hears" the
same version of a message, in this case a bitonic subsequence. This can be accomplished
by sending two copies of each bitonic sequence via vertex disjoint paths to each checking

processor. The message routing is constructed in such a way that each message must

126

pass though processors that are capable of checking parts of each message locally. The
intersection of these tests, given the same input message, forms a global test. When the
locally checked messages meet at a checking processor, they must contain the same
information. If not, then the message has been altered to satisfy each individual proces-
sor locally. This situation is then detected at the checking processor, and an error is
ﬂagged.

The consistency test (DC is given in Figure 9.5c. Note that in keeping with the
application oriented paradigm, the test for faulty behavior is closely intertwined with the
actual message delivery of the last bitonic sequence LBSg. [mask is initially set to the
binary representation of j, and source is the address of the sender of the message in lbuf.

Brieﬂy the concept of the fault-tolerant bitonic sorting algorithm is that as we build
up longer bitonic sequences, we (1) check that these sequences are bitonic and (2) check
that these bitonic sequences are permutations of the the earlier, shorter, bitonic
sequences. The communication of a bitonic sequence from stage i is "piggybacked" in
the communication that occurs naturally at stage i+1. This affords the resulting fault-
tolerant algorithm no increase in communication complexity over the non fault-tolerant
algorithm SNR .

Deﬁnition 9-5: A bitonic (sub)—sequence LBS,- is complete if for each I je LLBS,- there
exists a permutation II on the elements of LLBS,- to the lower or upper half of LBS,- deter-
mined by the range of SC,-._2J that is bijective.

Lemma 9-3: Algorithm vect_mask(i,j,k) provides a bit_vector in which a 1 in bit posi-
tion 1 indicates that LBS,-[l] has been collected from node I in a message exchange at node
k from iteration i and one or more of the iterations j, j —1, - - ° ,0.

Proof: By induction on j, we show that the bit vector returned by vect_mask
represents the actual message trafﬁc of algorithm 5;; (and SNR). If i =j, then this is the

start of stage i of the message exchange, and mask is set to processor Pk and Pkwy.

127

 

Procedure ¢c(LBS,-,j,lmask)
/* y>>x is a bitwise right shift of y by x bits */

limit=min(SC;-9,1J,N)
mask:=vect_mask(i,j,source):
omaslc=mask;
maskz=mask>>SC§+lJz
unasltz=1maslt>>scf,,,,;

rot ltz=sc,-S,1 J to limit
[if (mask&01 && l(1mask&01))
LBS [k]:=lbuf[k];
else if (mask&01 && lmask&01)
if (LBS [k] at lbuf[K])
ERROR;
mask:=mask>>l;
lmask:=lmask>>1;}
retum(omask);

Procedure vect_mask(ijmode)
(1:21;
if (j=i)
if (node mod (d<<l)<d)
mask=l<<nodel l<<node+d;
else
mask=l<<node l l<<node-d;}
else
if (node mod (d<< 1)<d)
mask=vect_mask(i,j+1,node+d) l vect_mask(ij+l,node);
else
mask=vect_mask(i,j+1,node-d) I vect_mask(i,i+1,node);
return(mask);

]

Figure 9.5c. (DC - Consistency Component

 

Assume that vect_mask(i,l+1,k) returns a correct (as in the lemma statement) mask.
Then for j =1, the mask returned is the j +1 mask for Pk or’ed with the j+1 mask for
Pm 2;, each of which is the correct mask. El
Lemma 9-4: Algorithm bit_compare detects non-bitonic LBS,- and non-complete LBS,-
with respect to LLBS,- given a bitonic LLBS,- as input.

Proof: Trivial from the previous discussion and bit_compare code. C]

128

 

Procedure bit_compare(LLBS,- ,LBS,-)

¢P(L35i);
¢F(LLBS;,LBSg)

 

Lemma 9-5: At the end of stage i, at least one processor in SCHJ can detect an error
made by processor P j that results in either (1) a non-bitonic LBS,- or (2) if the the max-
imum number of faulty nodes in SC;_1, ,- is 1, a non-complete LBS,- given a bitonic, com-
plete LLBSg.

Proof: The proof is by induction on i, the step count. For i =0, each Pj,PJ-+1 for j
even contains the actual correct initial values 1131/41 in LBS o- This forms a complete
bitonic sequence of length 2. This then becomes LLBSO at the bottom of step i.

Assume that 1.35,, has been veriﬁed complete and correct with respect to LLBS), at
the end of step k. Then during step k+1, LBSk+1 is ﬁlled by partial LBSk+1’s according
to the bit sequence of vect_mask(k +1, j). Since each element considered for inclusion in
LBS j+1 is reported to at least one processor through two vertex disjoint paths in graph G,
the effects of a single faulty relay are limited to one of these paths. If the two candidate
elements differ, an error is signaled. Thus if the sender is faulty, it must send identical
values along both paths. If these values destroy the bitonic nature of LBS,- or if they are
not complete with respect to LLBS,-, then, by Lemma 9-4, bit_compare will ﬂag an error.
Otherwise, no error has occurred. E]

An example of Spy is shown for n=3 in Figure 9.6. The list to be sorted,
{10,8,3,9,4,2,7,5} is stored in processors Po—P7. Note that for stage 1, the LBS and

LLBS are shown only for SC 1.2 and SC 15 and for stage 2, only SC 2,0 is shown.

129

 

 

LBS: 8.10.X.X.X.X.X.x
LLBS: 10,8,x,x,x,x,x,x

 

LBS: x,x.9,3.X.X.X.x
LLBS: x,x,3,9,x,x,x,x

0 l

 

 

 

 

2 3

LBS: 8,10,9,3,x,x,x,x

LLBS: x,x,9,3,x,x,x,x

 

 

Stage 2

 

 

 

 

 

 

LBS: 3,8,9,10,7,5,4,2

LLBS: 8,10,9,3,x,x,x,x

 

LBS: x,x,x,x,2,4,x,x
LLBS: x,x,x,x,4,2,x,x

 

 

 

 

 

 

 

 

 

 

Stage 0
6 7
LBS: x,x,x,x,x,x,7,5
LLBS: x,x,x,x,x,x,7,5
4 5
Stage 1
6 7
LBS: x,x,x,x,2,4,7,5
LLBS: x,x,x,x,x,x,7,5
4 5
0 1
Final
6 7 Veriﬁcation
2 3

 

 

 

LBS: 2,3,4,S,7,8,9,10

LLBS: 3,8,9,10,7,5,4,2

Figure 9.6. Example of Sn for n =3.

 

130

9.4 ERROR COVERAGE AND RESILIENCE

Analysis of the expected error coverage is key for any fault-tolerant algorithm.
Theorem 9-2: Algorithm S n produces either a correct bitonic sort or stops with an error
in the presence of one faulty node.

Proof: By application of Lemma 9—5, at each step i of the for loop in S”, each
bitonic sequence is veriﬁed. The ﬁnal extra stage veriﬁes that last sequence. Since we
are allowed one faulty node per SC” (i now = n in bit_compare), a processor P j can
detect any faulty behavior. Cl

As Theorem 9-2 shows, the constraint predicate (1) formed by (Dp,<l>p,<l>c detects all
errors from the Byzantine fault class committed by one faulty processor. Thus the reli-
able bitonic sort algorithm is fail-stop using components which may fail in Byzantine
ways. The result of the calculation is either completely correct, or the entire system halts
with an error condition. We are guaranteed that under a single processor failure, we will
never receive an incorrect sorting result. Furthermore, unless an error occurs, the host is
never involved in the calculation. This is ideal from a performance standpoint since, as
mentioned earlier, the host can become a bottleneck.

Clearly, there will be a performance penalty to pay for the increased reliability of
algorithm S n over the unreliable SNR. One may question how much overhead is intro-
duced and whether it might be better to simply send all the data to the host, let the host
sort the data, and return the ﬁnal result to the node processors. Another possibility is to
send all the data to the host, sort the data in the node processors, and send the results to
the host for veriﬁcation. As we show in the next section, the performance of both these

possibilities becomes unreasonable for even moderately large problems.

9.5 TIME AND SPACE COMPLEXITY
As mentioned in Section 2, algorithm SNR has a time complexity of 0 (log%N).

Sequential sorting, on the other hand, has a lower bound of 0 (NlogzN). The

131

communication complexity (since the data must be transferred from the nodes to the
host) is 0 (N) for sequential sorting. The communication complexity for SNR is
0 (log%N). Thus it is expected that the fault tolerant algorithm Spy will have
time/communication complexity between these two.
Lemma 9-6: Algorithm vect_mask(i,j) has time complexity of 0 (2i ‘1').

Proof: Let the running time of vect_mask(i,j) be TVM (i, j ). If i =j, then four logical
bit operations are performed and TVM(i,i)=0 (2°)=0(1). If i> j, then two executions of

vect_mask(i, j +1) are performed. Thus we have the linear recurrence
Tm(i,f) = 2Tvu(i,j+1)
Solving this yields:
TVMGJ) = 2‘-J'Tm<i,i)
= 0 (2H)

D
Lemma 9-7: Algorithm bit_compareO has a time complexity of 0 (2‘) for a calling node
k at step i.

Proof: Finding the size and location of the subcube is 0 (1) by simple application of
Deﬁnition 9-3. Verifying the sorted nature of each half of LBS,- can be done in 0 (2‘)
time. Veriﬁcation of the completeness of LBS,- with respect to LLBS,- can be done in
o (2"). Thus the total time complexity is o (3.25:0 (2").

Lemma 9-8: Algorithm oc (i, j) has a time complexity of o (21'+1+2i *J' ).

Proof: There are at most 21'+1 non-zero entries reported by vect_mask at step 1' plus
the time for vect_mask(i,j) of 0 (2"!) by Lemma 9-6 gives the bound. CL
Theorem 9-3: Algorithm Sn has a computational time complexity of 0 (N) and a com-
munication complexity of 0 (loggN).

Proof: For a particular value of i, j ranges from O to i. By Lemma 9-8, each <I>c(i,j)

has complexity 0 (2j+1+2i‘j). Since each iteration contains time for 2 (DC iterations

132

(because the computations cannot be overlapped), we have
2 i 2j+1+2i-j=2,2i+2+2i+1=0 (2i+3)
j=0

By Lemma 9-7, bit_compare has complexity 2i and updating LLBS takes 2‘; so for a sin-
gle iteration of i, we have a run time of 0 (2i+3+2i+2i). Summing over all i from O to
n —1 and adding in the ﬁnal veriﬁcation step, we have

"i1 (2i+3+2i+2i)+2n+2=2n+3+2.2n-1+2n+2=0 (2n+3)

i=0
Since n=log2N, we have the run time of Spf=0 (21°g2N+2)=0 (N)

The communication complexity of the main loop, as in SNR, is 0 (log %N). The ﬁnal
veriﬁcation stage adds logzN communication so the order of S n remains unchanged
from SNR. C]

For comparison purposes, a sequential "sorting" algorithm was constructed for the
host. Sort is quoted since we implement this "sort" as a single if statement executed
N logzN times to achieve the theoretical minimum. 0 (N) communication is required to
send/receive the sorted data. Additionally, a sequential veriﬁcation was constructed. In
this algorithm, the initial data is sent to the host, sorted by the node processors, and the
sorted data also sent to the host. The host then implements Theorem 4-1 to verify the
results. This takes 0 (N) communication complexity. It also takes 0 (N logzN ) computa-
tional complexity since the matching of the ordered and unordered list becomes
equivalent to ﬁnding a permutation in the sense of Deﬁnition 4—1. Thus, for the follow-
ing discussion, the best sequential algorithm is 0 (N loggN ).

The algorithms SNR, Sn, and a sequential sort were implemented on a 32 nodes of a
64 node Ncube DMMP [HMSC86] to sort 32-bit integers into ascending order (imple-
mentation constraints forced consideration of the smaller subcube). Timings were
obtained for for all three algorithms for problem sizes of 4, 8, 16, and 32 nodes. This is

shown in Figure 9.7.

133

 

 

 

 

 

 

600 _ __ Theoretical {a
............. Observed
R
E] - Sn
u I - SNR
n 400 _ . - Host Sort
T
i
,..I
m
e
I l ' '
4 8 16 32

Problem Size

Figure 9.7. Sorting Time Comparisons

 

The execution (observed) results are inconclusive since the cube we have available
is very small. Asymptotically, we certainly expect Sp,— to be more efﬁcient than the host
sort, but the constant multiplier in the run time order dominates for these small problem
sizes. Measurement of the running time for each component of the two algorithms yields

the following table (measured in clock ticks).

134

 

 

 

Algorithm Communication Time Computation Time
Sp,— 81og§N+250N .72(16N)
Sequential 14N 0.45Nlog 2N

 

 

 

 

 

This behavior is plotted as (Theoretical) in Figure 9.7. Comparison with the observed
values indicates this approximation is close to the actual run time for smaller cube sizes.
In the projected run times of Figure 9.8, S pp rapidly becomes more efﬁcient for the size
of cubes that we are concerned with in a real DMMP application. The portion of this plot
covered by Figure 9.7 is highlighted in the lower left comer of Figure 9.8. These pro-
jected run times indicate that the penalty paid for fault tolerance in parallel sorting is less

than the cost for sequential sorting in the host.

9.6 CHAPTER SUMMARY

This chapter has presented a fault-tolerant parallel sorting algorithm developed
using the application oriented fault-tolerance paradigm. The algorithm is tolerant of one
faulty processor node, or faults in a single processor’s incident communication links.

The addition of reliability to the sorting algorithm results in a performance penalty.
While the experiments on the small cube available were unable to demonstrate that fault
tolerant sorting is more efﬁcient than simply sorting in the host, asymptotically the
developed fault tolerant algorithm is less costly than host sorting. Experiments on a
larger cube are necessary to verify this behavior. Additionally, performance of the three
algorithms under bitonic sort/merge is interesting to study. In this algorithm, each node
has not one element, but a small, however, non-negligible number. The sequences are
sorted locally and then merged in parallel to form bitonic sequences. This bitonic

sort/merge procedure is an easy extension of SNR. To extend Sn to this sort/merge case

135

 

 

 

 

 

20000
R X,
15000— Host Sort ’1’
u ,I
n ’1’
10000— ,”
T /
i
m ,”
5000— I’
e ,’/
I/l’ S”
600— -,
I I I
32 512 1024

Problem Size

Figure 9.8. Projected Sorting Time Comparisons - Large Systems

 

is more difﬁcult. Simply growing the present algorithm will undoubtedly result in unac-
ceptable run time and space requirements. To reduce the run time complexity, the
computation/communication ratio will shift towards more communication of shorter
messages.

This chapter has demonstrated that the application oriented fault tolerance paradigm
is applicable to problems of a non-iterative nature. Again, all that is necessary for suc-
cessful algorithm development is a sufﬁcient set of natural problem constraints; in this

case the bitonic nature of the intermediate results.

Chapter 10

Summary and Directions for Future Research

 

This chapter summarizes the major contribution of this thesis and outlines directions

for future work.

10.1 SUMMARY OF MAJOR CONTRIBUTIONS

This work has been motivated by a need for both hardware and software reliability
in large scale DMMP systems. From a hardware perspective, the thousands of com-
ponents in the DMMP will result in low overall system reliability [McNi88a]. Thus,
system-level techniques must be employed to achieve reliable system operation. How-
ever, traditional system level techniques such as N-Modular Redundancy are too expen-
sive to implement. The problem of providing reliable software is not as easy to quantify
but is easy to see its need. The detection (and recovery from) design/coding faults is at
the heart of software fault-tolerance. However, to apply software fault-tolerance tech-
niques to a DMMP requires a different approach [McNi88b]. Both of these problems
may be solved by the uniﬁed method proposed by this research, the application oriented
fault-tolerance paradigm.

The application oriented fault-tolerance paradigm is a completely novel way to pro-
vide system level reliability. It is necessarily an application based software approach to
fault-tolerance. Thus, most of the "system-oriented" issues must be isolated from the
applications programmer. These system oriented routines include reliable broadcast and

detection support and form the lower layers of the reliable parallel processing model.

136

137

The application and its interface to reliability form the higher layers of this model. The
level of coupling between these two layers is low. The applications programmer then
must specify only an abstraction of fault-tolerance that is based on the application at
hand.

A systematic way of generating executable assertions for the DMMP environment is
essential to the application oriented fault-tolerance paradigm. The three predicate subc-
lasses of progress, feasibility, and consistency provide a systematic technique to formu-
late the constraint predicate. The constraint predicate is embedded in the actual program
for both hardware and software fault-detection. Thus, an application with an embedded
constraint predicate constrains processor and software behavior to remain within
predeﬁned limits. This is an effective paradigm for many types of parallel computing
applications as demonstrated by its application to the "bread and butter" algorithms of
parallel computation, namely, matrix iterative analysis [McNi87a, McNi88b], computer
vision [McNi88a], and parallel sorting [McNi88c]. The only known restriction on appli-
cability of the constraint predicate paradigm is that the applications problem contain
sufﬁcient natural constraints.

Since the components of the constraint predicate are executed by a number of reli-
able processors, the diagnosis is guaranteed to locate the faulty component. The con-
straint predicate relys upon the distributed diagnostic basis to communicate its test
results. There are clearly two different possible approaches to a distributed diagnostic
basis. The tight coupling with the application oriented fault-tolerance paradigm favors
the use of a masking basis such as Byzantine Agreement and rejects the use of syndrome
testing. The algorithm Vector Byzantine Agreement [McNi87b] functions as an efﬁcient
diagnostic basis for the DMMP environment under Byzantine fault conditions. Syn-
drome testing can never be complete in this manner, only correct [McEN88].

There will be a run time penalty to be paid for the use of reliability. There is a trade-

off based on expected run time between using reliability or running an unreliable

138

algorithm. For very short problems, it is better to run with no reliability than to incur the
overhead of a reliable parallel algorithm. For larger problems there is a crossover point
at which it is more effective to run with the reliability penalty and be guaranteed a
correct solution within a bounded run time. Both asymptotically as well as practically,
since the reliable solutions are only a linear factor of the fault free run time as opposed to
an exponentially growing run time without reliability, the reliability methods developed
in this research provide the necessary cost effective solution to the problem of reliability
in parallel computation.

Not only is run time an important metric in judging the effectiveness of any reliabil-
ity scheme but also the expected coverage of errors. As shown in the error coverage sec-
tion of matrix iterative analysis, seventy-ﬁve percent of all errors can be detected
immediately, and total coverage can be achieved at algorithm termination. Similar
results exist for bitonic sorting, the reliable algorithm never allows an incorrect result to
be produced.

In summary, this work has accomplished its objectives. It is clear that this area will
continue to receive much study. The following sections outline the direction this work

should take.

10.2 Configuration Control

This thesis has presented a new paradigm to provide both hardware and software
reliability in a faulty large scale DMMP and has concentrated mainly on the fault-
detection aspect of the problem. However, to provide a reliable system, the problem of
continuing execution in the presence of failed components must be addressed as well as
the problems of providing reliable Remote Procedure Calls (RPC) in the unreliable
environment.

The RPC scheme proposed by [BiNe84] has a drawback in the unreliable environ-

ment. From the application level, RPCs are treated as simple function calls. If the called

139

process fails, the procedure call simply blocks forever. To alleviate this problem,
timeouts are employed. Necessary bounds for timeouts in the DMMP environment are
(1) A bound on the maximum processor clock drift ‘l‘ and (2) A bound on the maximum
communication delay A. The problem with speciﬁcation of these values is that they are
conﬁguration dependent. The application programmer level should not be concerned
with even the existence of these values. Moreover, these values are primarily used by the
distributed diagnostic basis. Thus, in the reliable parallel processing model, the
Conﬁguration Control section of Figure 1.4 (p. 17) is responsible for computing these
values.

Dynamic redundancy refers to the ability of a system to continue to function in the
presence of failed components. Masking of errors is limited as a fault-tolerance tech-
nique. Consider again IBM’s GF 11 computer [Ager88] whose operational design param-
eters dictate single problem run time on the order of a year in length. If only fault-
masking is used as a fault-tolerance technique, eventually enough elements will fail to
invalidate fault masking. Furthermore, masking of errors also has the difﬁculty that the
reporting and knowledge of fault occurrences is also masked [Hopk77].

Thus, instead of masking faults, we wish to remove the faulty component from the
system and reconﬁgure around it. Additionally, as a maintenance activity, the faulty
component should be repaired or replaced. In [YaHa84], this recovery consists of three

phases; fault diagnosis, system reconﬁguration, and operational recovery.

10.2.1 Reconfiguration

Centralized reconﬁguration control is the easiest to implement. However, this tech-
nique is unacceptable in a distributed system because it introduces a single point of
failure. We instead turn to decentralized reconﬁguration control.

There are two approaches to reconﬁguration. The fault group may attempt to

replenish its ranks with a spare. This is advantageous as the spare can pick up the duties

140

of the failed processor and the algorithm execution can continue with no performance
degradation. The drawback to this approach is that standby spares must be made avail-
able and that some hardware mechanism must exist as in the MPP [Batc80] to map the
spare component into the system topology. The alternative to this approach is to make
use of the applications oriented environment.

In the applications environment, since we have the concept of a local fault group, it
is possible that reconﬁguration can be done in a local distributed manner. The
reconﬁguration involves not replacing the hardware component, but remapping portions
of the application to fault-free components. This task, in the reliable parallel processing
model, is handled by the conﬁguration control layer.

The function of the conﬁguration control layer is also to recompute the A and ‘I’
timeout values. Even in the latest DMMP systems, the second generation hypercubes,
the hop delay is still non-negligible. Remapping portions of the application will necessi-
tate propagation of the new timeout values to the remaining non-faulty processors. This
is to be hidden as much as possible from the applications layer. It is not hard to see that
results of this research can be applied to more general usage of parallel computers that
have failed components. For example, current techniques for running with a faulty
hypercube system involve not executing around the failed component but instead ﬁnding
a smaller completely functional subcube. However, this problem is not easily solvable
and indeed is polynomially complete [DuHa88]. The reconﬁguration control proposed
here not attempt to ﬁnd a completely functional topology but rather work within the

existing topology containing failed components.

10.2.2 Recovery
Coupled with the problem of reconﬁguration is operational recovery and continued
execution in the presence of a changed topology due to failed and logically removed

hardware components.

141

Recovery involves restarting the calculation after reconﬁguration has occurred.
Current techniques require a reliable backing store. However, in the applications environ-
ment, the inforrmuion obtained by members of the fault group of a failed processor
required by the constraint predicate may also be sufﬁcient information to restart the cal-

culation from the point of failure.

10.3 AUTOMATED CONSTRAINT PREDICATE GENERATION

The application oriented reliability paradigm currently consists of a set of basis
metrics for constraint predicate extraction. These are applied at the speciﬁcation phase
of the software life cycle by the programmer. Ideally, the salient reliability features to
compose the constraint predicate would be extracted from the applications problem in an
automated fashion. While this research has made signiﬁcant progress in generation tech-
niques for a constraint predicate, the application oriented reliability paradigm is still in its
infancy. Two primary reasons exist for this. The ﬁrst is that while human identiﬁcation
of the required constraint predicate features can be comprehended, it still takes an in-
depth knowledge of the application to extract the features. The second problem is that
speciﬁcation of the constraint predicate and its insertion into the code is a manual task.
Both of these problems, however, can be addressed by Computer Aided Software
Engineering (CASE) tools.

The problem of application knowledge is not as severe as it might ﬁrst seem. While
expert knowledge is required of a particular application, this knowledge transfers easily
between similar instances of a problem. Thus it is quite feasible to work with the idea of
problem classes, i.e., the class of relaxation problems, the class of sorting problems, etc.
A CASE tool can consist of an expert system with various experts than can be employed
based on the class of problem under consideration. Insertion of the constraint predicate

can occur as a result of the same expert system.

142

The major remaining problem is to generate a uniform representation of the
requisite abstraction of reliability information necessary for constraint predicate genera-
tion. Given a uniform representation, the tool of Automated Reasoning can be applied to
generate the constraint predicate. However, automated reasoning is still very "class-
based," i.e. it works for well—deﬁned applications. Additionally, if the representation of
the reliability abstraction is similar, save for a syntactic transformation, to the constraint
predicate speciﬁcation, then nothing is gained through automation. Thus, in the best
case, while automated. reasoning can provide help in speciﬁcation of the constraint predi-
cate, it is suspected that identiﬁcation of the abstraction points will remain a human

activity.

10.4 APPLICATION APPLICABILITY

Parallel iterative relaxation methods form a large component of all parallel algo-
rithms. The local nature and superior natural constraints of these algorithms yield good
constraint predicates. For other algorithms, the transition is not as clear. Bitonic sorting,
for instance, has no such "nice" local properties. Sequential assertions, such as the sort-
ing assertion of Randall [Rand75] cannot be implemented in the DMMP environment
since no single processor has a complete view of the data space. However, through use
of the constraint predicate paradigm, a reliable parallel algorithm was created
[McNi88c].

It is not clear at this time exactly which problems lend themselves to application
oriented reliability other than the general guideline of those with suitable natural con-
straints. A better classiﬁcation scheme is necessary. This will continue to be a research
area in the ﬁeld. This clearly is an interdisciplinary study which ideally is conducted in

concert with practitioners of each application ﬁeld.

[AdBC82]

[Ager88]
[Ames77]

[AnMa67]

[Aviz7 1]

[AyOz87]

[Batc68]

[BiNe84]

[BoVe82]

[DaMa84]

[DLPS 86]

[Dole82]

[DoSt82]

143

REFERENCES

Adrion, W., Branstad, M. and Cherniavsky, J., “Validation, Veriﬁcation,
and Testing of Computer Software,” Computing Surveys, Vol. 14, No. 2.,
June 1982, pp. 159-192.

Agerwala, Tilak, Invited Talk, Michigan State University, April, 1988.

Ames, William, Numerical Methods for Partial Differential Equations,
Academic Press, New York, 1977.

Anderson, J. and Macri, F., “Multiple Redundancy Applications in a
Computer,” Proceedings of the 13th International Symposium on F ault-
Tolerant Computing, Washington DC, January 1967, pp. 553-562.

Avizienis, A. et. al. “The STAR Computer: An Investigation Into the
Theory and Practice of Fault-Tolerant Computing,” IEEE Transactions
on Computers, Vol. 020, November 1971, pp. 1312—1321.

Aykanat, C., Ozguner, F., “A Concurrent Error Detecting Conjugate Gra—
dient Algorithm on a Hypercube Multiprocessor,” 17th Symposium on
Fault Tolerant Computing, Pittsburg PA, July 1987, pp. 204-209.

Batcher, K., “Sorting Networks and Their Applications,” Proc. of the
1968 Spring Joint Computer Conference, vol. 32, AFIPS Press, Reston,
VA. pp. 307-314.

Birrell, A. and Nelson, B., “Implementing Remote Procedure Calls,”
ACM Transactions on Computing Systems, Vol. 2, No. 1, February 1985,
pp. 63-75.

Botta, E. and Veldman, A., “On Local Relaxation Methods and Their
Application to Convection-Diffusion Equations,” Journal of Computation
Physics, Vol. 48, 1982, pp 127-149.

A.T. Dahbura and G. M. Masson, “AN 0(n 25) fault identiﬁcation algo-
rithm for diagnosable systems,” IEEE Trans. Comput., vol. C-33, July
84, pp. 486-492.

Dolev, D., Lynch, N., Pinter, S., Stark, E., Weihl, W., “Reaching Approx-
imate Agreement in the Presence of Faults,” Journal of the ACM, Vol. 33,
No. 3, July 1986. PP. 499—516.

Dolev, D., "The Byzantine Generals Strike Again," Journal of Algorithms,
Vol. 3, 1982, PP. 14-30.

Dolev, D., Strong, H., “Polynomial algorithms for multiple processor
agreement,” IBM Research, San Jose, 1982

[DuHa88]

[DwLS88]

[FiLP85]

[Garc82]

[Gend87]

[GrRe86]

[GuRa86]

[HaAm74]

[Hi1185]
[HiSt86]

[HMSC86]

[Hopk77]

[HoSa84]

[HoSL78]

144

Dutt, S. and Hayes, J., “On Allocating Subcubes in a Hypercube Mul-
tiprocessor,’ ’ Proceedings of the Third Conference on Hypercube Con-
current Computers and Applications, Pasedena, CA, January, 1988 to

appear.

Dwork, C., Lynch, N., Stockmeyer, L., "Consensus in the Presence of Par-
tial Synchrony," Journal of the ACM, Vol. 35, No. 2, , April 1988, pp.
288-323.

Fischer, M, Lynch, N. and Patterson, M., “Impossibility of Distributed
Consensus with One Faulty Process,” Journal of the Association for Com-
puting Machinery, Vol. 32, No. 2, April 1985, pp. 374-382.

Garcia-Molina, H., “Elections in a Distributed Computing System,”
IEEE Transaction on Computers, Vol. C-31, No. 1, Jan. 1982, pp. 48-59.

Gendreau, T. A Uniﬁed Environment for Distributed Computing, Ph.D.
thesis, Department of Computer Science, Michigan State University, Janu-
ary 1987.

Grunwald, DC. and Reed, D. A., “Benchmarking hypercube hardware
and software,” Technical Report, UIUCDCS-R-86—l303, Department of
Computer Science, University of Illinois at Urbana-Champaign, 1986.

Gupta, R. and Ramakrishnan, 1., “System Level Diagnosis in Malicious
Environments,” Technical Report, SUNY Stony Brook, Dept. of Com-
puter Science, Oct, 1986.

Hakimi, S. and Amin, A., “Characterization of connection assignment of
diagnosable systems,” IEEE Trans. Comput., vol. 023, Jan 1974, pp.
86-88.

Hillis, D., The Connection Machine, MIT Press, Cambridge, Mass., 1985.

Hillis, D. and Steele, 6., "Data Parallel Algorithms," CACM, Vol. 29, No.
12, , December, 1986, pp. 1170-1183.

Hayes, J ., Mudge, T., Stout, Q., Colley, S. and Palmer, J., “Architecture
of a Hypercube Supercomputer,” Proceedings of the 1986 International
Conference on Parallel Processing, August 1986, pp. 653-660.

Hopkins, A.L. Jr., “Design Foundation for Survivable Integrated On-
Board Computation and Control,’ ’ Proceedings of the Joint Automatic
Control Conference, 1977, pp. 232-237.

Horowitz, E. and Sahni, 8., Fundamentals of Computer Algorithms, Com-
puter Science Press, Rockville, MD, 1984.

Hopkins, A. Smith, B., Lala, J., “FTMP-A Highly Reliable Fault Tolerant
Multiprocessor for Aircraft,” Proceedings of the IEEE, Vol. 66, No. 10,

[HuAb84]

[HuAb87]

[Huan83]

[HuZu83]

[Hwan87]

[Kr’I'o8 1]

[Kueh69]

[KuLM87]

[KuRe8 1 ]

[KuRe86]

[Lala86]

[LaSP82]

[MaMa78]

145

October 1978. pp. 1221-1239.

Huang, K. and Abraham, J., “Fault-Tolerant Algorithms and Their Appli-
cation to Solving Laplace Equations,’ ’ Proceedings of the 1984 I nterna-
tional Conference on Parallel Processing, August 1984, pp. 117-122.

Hua, K. and Abraham, J ., “Design and Evaluation of Executable Asser-
tions for Concurrent Error Detection,’ ’ Proceedings of the 11th Interna-
tional C 0MPSAC, Tokyo, Japan, October 1987, pp. 324-330.

Huang, K., Fault-Tolerant Algorithms for Multiple Processor Systems,
PhD. Thesis, Coordinated Science Laboratory, University of Illinois,
November, 1983.

Hummel, R. and Zucker, 8., "On the Foundations of Relaxation Labeling
Processes," PAMI, Vol. PAMI-S, No. 3, May 1984, pp. 267-287.

Hwang, K., “Advanced parallel processing with supercomputer architec-
tures,” Proc. of the IEEE, October 1987, pp. 1348-1378.

Kraft, G. and Toy, W., Microprogrammed Control and Reliable Design of
Small Computers, Prentice-Hall, 1981.

Kuehn, R. “Computer Redundancy: Design, Performance, and Future,”
IEEE Transactions on Reliability, Vol. R-18, No. 1, February, 1969, pp.
3-11.

Kuo, C., Levy, B. and Musicus, B., “A Local Relaxation Method for
Solving Elliptic PDEs on Mesh-Connected Arrays,” SIAM Journal of
Scientiﬁc and Statistical Computing, Vol. 8, No. 4, July 1987, pp. 550-
573.

Kuhl, J. and Reddy, S., “Fault-Diagnosis in Fully Distributed Systems,”
Proc. 11th Fault Tolerant Computing Symposium, 1981, pp. 100-105.

Kuhl, J. and Reddy, 8., “Fault Tolerance Considerations in Large, Multi-
ple Processor Systems,” IEEE Computer, March 1986, pp. 56—67.

Lala, J., “A Byzantine Resilient Fault Tolerant Computer for Nuclear
Power Plant Applications,” 16th Symposium on Fault Tolerant Comput-
ing Systems,, Vienna, Austria, July 1986, pp. 338-343.

Lamport, L., Shostak, R., Pease, M., “The Byzantine Generals Problem,”
ACM Transactions on Programming Languages and Systems, Vol 4., No.
3, July 1982, pp. 382-401.

Mallela, S. and Masson, G., “Diagnosable Systems for Intermittent
Faults,” IEEE Trans. Comp., Vol. 027, No. 6, June, 1978 pp. 560-566.

[McC185]

[McNi87a]

[McNi87b]

[McNi88a]

[McNi88b]

[McNi88c]

[McEN88]

[MoHZ83]

[Perr85]

[PeSL80]

[PrMC67]

[Quin87]

[Rabi83]

146

McCluskey, E., “Built-in Self-test Techniques,” IEEE Design and Test,
April 1985. PP. 21-28.

McMillin, B. and Ni, L., “Reliable Parallel Elliptic PDE Solution,”
Presented at the 3rd. SIAM Conf. on Parallel Processing for Scientiﬁc
Computing, Los Angeles, CA, December, 1987.

McMillin, B. and Ni, L., “Byzantine Fault-Tolerance through Application
Oriented Speciﬁcation,” Proceedings of the 11th International
COMPSAC, Tokyo, Japan, October 1987, pp. 347-353.

McMillin, B. and Ni, L., “A Reliable Parallel Algorithm for Relaxation
Labeling,’ ’ Proceedings of the 1988 International Conference on Parallel
Processing for Computer Vision and Display, Leeds, U.K., January, 1988,
to appear.

McMillin, B. and Ni, L., “Executable Assertion Development for the Dis-
tributed Parallel Environment,” Proceedings of the 12th International
COMPSAC, Chicago, IL, October 1988, to appear.

McMillin, B. and Ni, L., “Reliable Parallel Sorting Through the Applica-
tion Oriented Fault-Tolerance Paradigm,’ ’ Submitted for Publication.

McMillin, B., Esfahanian, A. and Ni, L., “Limitations of System Level
Diagnosis by Fault Model Considerations,” Proceedings of the 3rd Inter-
national Conference on Supercomputing, Boston, MA, May, 1988, to

appear.

Mohammed, J., Hummel, R. and Zucker, S., "A Gradient Projection Algo-
rithm for Relaxation Methods," PAMI, Vol. PAMI-5, No. 3, May 1983,
pp. 330-332.

Perry, K., “Randomized Byzantine Agreement,” IEEE Transactions on
Software Engineering, Vol SE—l 1, No. 6, June 1985 pp. 539-546.

Pease, M., Shostak, R. and Lamport, L., “Reaching Agreement in the
presence of faults,” Journal of the ACM, Vol 27, No. 2,, April 1980, pp.
228-234.

Preparata, F., Metze, G. and Chien, R., “On the connection assignment
problem of diagnosable systems,” IEEE Trans. Electron. Comput., vol.
EC-l6, December 1967, pp. 848-854.

Quinn, M., Designing Efﬁcient Algorithms for Parallel Computers,
McGraw-Hill, New York, 1987.

Rabin, M., “Randomized Byzantine Generals,” Proceedings of the 24th
Symposium on the Foundation of Computer Science, Tucson, AZ,
November, 1983, pp. 403-409.

[Rand75]

[Renn84]

[Sham79]
[ShFi88]

[ShRa87]

[Skla76]

[Siew82]

[Smit65]

[Snyd82]

[Somm82]
[Stoc87]
[Str086]

[Stuc77]

[Toyw78]

[Varg62]

[Wens78]

147

Randall, B., “System Structure for Software Fault Tolerance,” IEEE
Transactions on Software Engineering, Vol. SE-l, No. 2, June 1975, pp.
220-232.

Rennels, D., "Fault-Tolerant Computing-Concepts and Examples," IEEE
Transactions on Computers, Vol 033, No. 12, , Dec. 1984, pp. 1116-
1129.

Shamir, A., “How to share a secret,” CACM, Vol 22, 1979, pp. 612-613

Shih, Y. and Pier, J., “Hypercube Systems and Key Applications,” Paral-
lel Processing for Supercomputing and Artiﬁcial Intelligence, K. Hwang
and D. DeGroot, (Eds.), McGraw-Hill, New York, 1988.

Shin, K. and Rarnanathan, P., “Diagnosis of Processors with Byzantine
Faults in a Distributed Computing System,” 17th Symposium on Fault
Tolerant Computing, Pittsburg PA, July 1987, pp. 55-60.

Sklaroff, J., "Redundancy Management Techniques for Space Shuttle
Computers," IBM Journal of Research and Development, vol. 20, January
1976. Pp. 20-28.

Sieworek, D., The Theory and Practice of Reliable System Design, Digital
Press, Bedford, MA, 1982.

Smith, G., Numerical Solution of Partial Dr'ﬂerential Equations, Oxford
University Press, New York, p. 150.

Snyder, L., ‘ ‘Parallel Programming and the Poker Programming Environ-
ment,” IEEE Computer, Vol. 17 No. 7, July, 1984, pp. 27-36.

Sommerville, 1., Software Engineering, Addison-Wesley, London, 1982.
Stockman, G. C., Private Communication, 1987.

Strong, R., “Problems in Maintaining Agreement,” F ifth Symposium on
reliability in Distributed Software and Database Systems, IEEE Computer
Society, January 1986, Los Angeles, CA, pp 20-27.

Stucki, L. “New directions in automated tools for improving software
quality,” Current Trends in Programming Methodology, Vol. 2., RT.
Yeh (Ed), Prentice-Hall, Englewood Cliffs, N.J., 1977, pp. 80-111.

Toy, Wing, “Fault-Tolerant Design of Local ESS Processors,” Proceed-
ings of the IEEE, Vol. 66, , October, 1978, pp. 1126-1145.

Varga, R., Matrix Iterative Analysis, Prentice-Hall, Englewood Cliffs,
New Jersey, 1962.

Wenslcy, J., et. a1. “SIFT: The design and analysis of a fault-tolerant
computer for aircraft control,” Proceedings of the IEEE, Vol. 66, October

[YaHa84]

[YaMa86a]

[YaMa86b]

[Zout76]

148

1978. pp. 12401255.

Yanney, R. and Hayes, J., “Distributed Recovery in Fault Tolerance Mul-
tiprocessor Networks,” 4th International Conference on Distributed Com-
puting Systems, IEEE 1984, pp. 514-525.

Yang. C-e Masson, G., “A fault Identiﬁcation Algorithm for t;-
Diagnosable Systems,” IEEE Trans. Comp., Vol. 035, No. 6, June 1986,
pp. 503-510.

Yang, C., Masson, G., "A strategy for Fault Diagnosis in Asynchronous
Distributed Systems with Soft Failures," Proc. Intl. Conf. on Computer
Design: VLSI in Computers, 1986 Port Chester, NY, pp. 366-369.

Zoutendijk, G., Mathematical Programming Methods, North-Holland,
Amsterdam, 1976.