3' '3 "'33 33 ' ' I‘3II3'3IIII'II'I I3I 33333 E '. .. | : II
"I' "3333' I" II.3.,I "'3. 3'3" I 3
.3'33I 3333. 333333 3' 3333333 3"333' 'II 3333'3'3
IIII3333II33 I 3 II3 33333 3III

I 3333'" 33 IIII'333 III II"'I3
3'
I

.IIIIIIIII
I3" .13333333'333‘ 3.3333331I'I.' 3'33;
. 333" I

II I33333" I3 '3 '333'33'I.

33'I I3
IIII

    

         
   
 

33 II'3""'

“.7

'33 '3

__.

33'3IIII ' "

333 '

33 3 I'II "'3"'
III3

II. .III
III 3:33 '
333'
3

3 3' 33

 

       
       
      
 
   
    

   

 

 

 
 
 

     
  
 

 

 

 

 

  

 

    

 

 

 

  
  

   
 

     

 

       

 
       
  
 

      

 

  
  

 

I33 3 3 .3"
III I 3.3 3.33
3333 33 III III . I "'3"'I
' 3"" 3. 3333.333 333 I"I'I'3333II'.."333.III 333333....
3I3 3! I3 ‘33 3' I33 33 I I33!” .33 ”.3,
3333333333'I3333III333-I3333‘.333333 3333 33333.3 33 333. """"'3""3 I3333 73' I... 1 3.3 .
":"II'I'33I'. 3.3133333333.33 '43 .333 3'3 3' I' III I333 333‘ 3333 ' .3. 313.3 .I 33' I'LII‘I'III
II "'3"""33""33"""'3'"3"" 33333.3333333 .II II III II .. I'I'I
I ." 33333333 I 33:“; 333" 3'333 3"33'3'3' "3333 3'33. I 333III 3333.333 .33.;
"33!I 333333333333; "'3'333 33I'3 I333 333333333 "33333333 "33 33333333333333.3333: 3 II 33.3 II. 3. 3333333
7.. ”III '33. 33.3IIII3 3' .III 333333 33 III 333 333.3333 "'33g3'3'3"""""""""""3"3"'§" .I.III..I333.33.133.33
33II 33'3'3 33IIII. ‘IIIIII3III 'I'III'I" 3'I I33 31“ I‘I3I3I33'..I'.'.."" 3“" II“ '
u. I3'3 III 33.3,. .I'I33I 3333.33"33 I I3'I3I3.'33' 3.3 33.. .3333 3333. I3
I33 3333III3I3III33 I3 33 3333! 33.. 3.333 3333.33..I 333.3 ...3.I 33333. 3. LWIII33.3.I_.III_
.3 I I. I ‘.. I3 III. III I 3 IIIIIIIIII III I I ..,.
.3... II IIIIIII I33", I... 3.3"I ""333 3 III III IIIII3 I.
‘3IIIIII I33 133 3. III III I. "'3'“.
'3'3433"3"'3"'3I'333 3 3 3 'I 333:3 33I3 3.333333%. I'333I'I 333333 3'"L3"‘3""3'3JT33"'3"3'3333'3"3'3" I-I'rt
'33::33. 3I' '.I 13 3333”,. 3333I3I.3 .3333 I. I-33333II33 I‘ .333“ '333'3333'3313'33'33333'33"33333331’ "3".3
'II33.I.I..III'I IIIIII III .II 3.3. I'33333"33I 3333..'I,.‘I33'I33II'333".3 . III.
33333 ...I I... I... I... II 3'3'3'3 I3 I3333.33.33 "I.33333.333.33.1333333 . ’
"""I " "II III-I. "I'"'333' '- ' "' ' ' '3.“ ”"I "I.'II"'3I "3 ".33 .I. ".
3""3'333'3333'3'1333' 33"3'3'33'3"'3333'33""'33"3 33. '"3' '3'0"I"'3"'33" "" "333. "3339 '3 ' "333333333333333 333333 III 3;.
.. 3I' " 3I'I" I... . 33 _:3',I'I‘ r :I" 1'
I3" II 'II 'II III"" 313."."33 33' 333 "3333333" 'II'I'I'IIII 3353‘"
IIIIII I...IIIII II. _ III II

 

 

 

 

LIBRARY
Michigan State
University

 

 

 

This is to certify that the

dissertation entitled
A Concurvevﬁ’ (Ewe? Miﬁciton An$ Qarfet-an
Aiaociﬂxm For FauPn—Voknkvﬁ \ILSI

Ar\\*\\vv\z\*\.c_ AWOA‘ rFYGCES’SO‘V'S.
presentedby

QAMQKDM SHOKODH: ~\<ﬂ&1vm\$

has been accepted towards fulﬁllment
of the requirements for

Ph-B- degreein SietiﬁM\Y\/13M(O('Mj

Megs/3Z4 “Magi
22/419 /%WQ7W\

Major professor

Date_////3,/g~5

MS U i: an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

 

MSU

LIBRARIES
-:—.

RETURNING MATERIALS:

 

 

 

Piace in book drop to
remove this checkout from
your record. FINES wil]
be charged if book is
returned after the date
stamped beiow.

 

 

 

A CONCURRENT ERROR DETECTION AND CORRECTION ALGORITHM FOR

FAULT-TOLERANT VLSI ARITRHETIC ARRAY PROCESSORS

BY

KAMRAN SHOKOOHI—KAYVAN

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering
and System Science

1985

ABSIRACT

A (DNCURREIT ERROR DETECTION AND museum ALGtRITRN FOR

FAULT-mL-ANT VLSI ARITRIBTIC ARRAY PROCESSCRS

By

Kamran Shokoohi-Kayvan

A concurrent error detection and correction algorithm for errors
caused by permanent. intermittent and transient faults in arithmetic
parallel-pipeline array processors. is described. The fault model.
applicable to VLSI implementations. assumes the occurrence of faults
with unknown origin and frequency. Recovery from faults is achieved
through minimal processing element (PB) redundancy in the array
architecture. facilitated by spatial bypassing of the correct operands
previous to the occurrence of a fault to fault-free PE's and
recomputing during the following clock cycles. The overhead hardware
and timing are determined. It is shown that this concurrent error
detection and correction technique uses less additional hardware than
3880 (REcomputing with Shifted Operands). offers marginal timing
improvement over the R880 technique. and adds the concurrent error

correction capability which is not present in RESO. Furthermore. the

additional hardware and control for error detection and recovery is
local and modular. hence making this technique very attractive for
VLSI implementations. This algorithm is based on a general fault
model which assumes the occurrence of permanent. intermittent and
transient fault of unknown origin and frequency. Transient and
intermittent faults with a duration longer than one computational
clock cycle are classified as "permanent". The effects of these
faults are detected as undesired changes in the logical values at the
outputs of the PE's or other circuit elements. such as latches. The
error detecting algorithm is concurrent and thus may detect and

correct errors caused by transient faults affecting the PE's randomly.

TONY

PARENTS

AND

GRANDPARENTS

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my advisers. Dr.
Erik D. Goodman and Dr. lichael A. Shanblatt. for their continous
moral and professional support and guidance throughout my doctoral
pregram. I also would like to thank Dr. D. Reinhard and Dr. B.

Drachman for their professional advise and review of this work.

I shall always be grateful to my parents for their continous
love. support and guidance through the years. My appreciation is
extended to my brother Dr. Farhad k. Shokoohi. his family and my
sister Dr. frough k. Shokoohi for their unending encouragement.

understanding and guidance throughout my academic programs.

TABLE OF CONTENTS

Chapter Page
1. INTRODUCTION 1
2. BACKGROUND 5
2.1 Algorithm Philosophy 6
2.2 Systolic Arrays 7
2.3 Fault-Tolerance 12
2.3.1 Fault Classes 13
2.3.2 The Concept of Fault-Tolerance 15
2.3.3 Fault-Tolerance In VLSI 18
2.3.4 Fault-Tolerant Systolic Arrays 20

AN ERROR DETECTION AND CORRECTION ALGORITHN FOR EAULT-TOLERANT

LINEAR ONE-DIMENSIONAL SYSTOLIC ARRAYS 23
3.1 The Error Detection Algorithm, 23
3.2 Error Recovery 41
3.2.1 Control Algorithm For The Redundant PE 43
3.3 Overhead Hardware and Timing Requirements ' 46
3.3.1 Fault Coverage 50
3.4 Linear Systolic Array for DPT 51
3.4.1 Linear Fault-Tolerant Systolic Array for DFT 55
3.4.2 Error Recovery 57

AN ERROR DETECTION AND CORRECTION ALGORITHM FOR FAULT-TOLERANT

TWO-DIMENSIONAL ARRAY PROCESSORS 64
4.1 The Modification of The Error Detection Algorithm 66
4.2 Error Recovery 71
4.3 Overhead Hardware and Timing of The Two-Dimensional

Design 80
4.4 A Fault-Tolerant Pipeliued Array for Matrix

Multiplication 82

4.4.1 The Fault-Tolerant VLSI Array for Matrix

Multiplication 87

iv

Chapter

5.

6.

FUNCTIONAL TRANSITIONS IN ADJACENT PE’S

5.1 Functional Transitions In VLSI Structures
5.2 Transitional PE’s

5.3 Error Recovery

5.4 Overhead Hardware and Timing Requirements
5.5 Two-Dimensional Functional Transitions

SUMMARY AND CONCLUSIONS

BIBLIOGRAPHY

Page

95

96
98
104
111
113

118

122

Figure

2.2.1

2.2.2

3.1.1

3.1.2

3.1.3

3.1.4

3.3.1

3.3.2

3.4.1

3.4.2

3.4.3

4.1

4.1.1

4.1.2

4.3.1

4. 4. 1

LIST OF FIGURES

Different interconnections for systolic arrays.

The structure of a computing array for a system of arbitrary

dimensions and 8-3.
Basic structure of a linear pipeline processor.

The structure of a fault-tolerant one-dimensional linear
pipeline processor.

Basic PE for the fault-tolerant pipeline processor of
Figure 3.1.2.

Basic redundant PE for the fault-tolerant processor of
Figure 3.1.2.

The overhead hardware requirement OHR(k,m) for the
one-dimensional fault-tolerant array processor.

The efficiency and throughput of the fault-tolerant
pipeline of Figure 3.1.2.

Basic PE for linear systolic array used in DFT computation.

The structure of the linear systolic array for DFT
computation.

The structure of the fault-tolerant systolic array for DFT
computation, n-4.

The structure of an array processor connected in mesh
without the boundary wraparound connections.

The structure of the first row of the array of Figure
4.1 with added switching and comparator elements.

Basic PE structure for the fault-tolerant array of
Figure 4.1.1.

The overhead hardware requirement OER(k,m) for the
two-dimensional fault-tolerant array processor.

A‘VLSI array for pipelined multiplication of two 2x2 dense
matrices.

vi

Page

10

24

27

28

29

48

49

53

54

56

65

67

68

81

83

Figure Page

4.4.2 A VLSI array for pipelined multiplication of two 3x3 dense
matrices. 85

4.4.3 The fault-tolerant VLSI array for pipelined multiplication
of two 2x2 dense matrices. 88

4.4.4 The VLSI fault-tolerant array for pipelined multiplication
of two 3x3 dense matrices. 94

5.2.1 The structure of the fault-tolerant one-dimensional array
processor with one functional transitions along the
pipeline. 100
5.2.2 Basic structure of the transitional PE3,m/d. 101
5.3.1 A proposed architecture for the one-dimensional fault-tolerant

pipeline processor of arbitrarily chosen k+L+1 stages and two
functional transitions. 112

vii

Table

2.2.1

3.1.1

3.1.2

3.1.3

3.1.4

3.1.5

LIST OF TABLES

The port-coefficient time table of the array of
Figure 2.2.2 for 8-3.

The space-time table of the normal linear oue-dimensioanl
pipeline processor of Figure 3.1.1.

The space-time table of the fault-tolerant pipeline.
Error-free case.

The space-time table of the fault-tolerant pipeline
with a transien fault in PE.2 during the second
clock cycle.

The space-time table of the fault-tolerant pipeline
with a transientfault in either PE.3 or PE.4 during
the sixth clock cycle.

The space-time table of the fault-tolerant pipeline
with a multiple transient fault in either PE.2. PE.3

or PE.4 during the sixth clock cycle.

3.1.6

3.1.7

3.1.8

3.1.9

3.4.1

3.4.2

The space-time table of the fault-tolerant pipeline
with a permanent fault in PE.2.

The space-time table of the fault-tolerant pipeline
with a permanent fault in PE.4.

The space-time table for the parity check and output
validity with a transient fault in PE.2 during the
second clock cycle.

The space-time table for the parity check and output
validity with a permanent fault in PE.4.

The space-time table of the normal systolic array
for DFT computation.

The space-time table of the fault-tolerant systolic
array of Figure 3.4.3 for DFT computation, error-free
case.

viii

Page

11

25

33

34

35

36

37

38

39

4O

58

59

Table

3.4.3

3.4.4

3.4.5

3.4.6

4.1.1

4.2.1

4.2.2

4.2.3

4.2.4

4.2.5

4.2.6

4.4.1

The space-time table of

the fault-tolerant systolic

array of Figure 3.4.3 for the DFT computation, error-free

C886.

The space-time table of
of Figure 3.4.3 for the
fault in either PE.1 or
cycle.

The space-time table of
of Figure 3.4.3 for DFT
fault in either PE.2 or
clock cycle.

The space-time table of
of Figure 3.4.3 for DFT
fault in PE.2.

The space-time table of
Figure 4.1.1.

The space-time table of

the fault-tolerant systolic array
DFT computation with a transient
PE.2 during the second clock

the fault-tolerant systolic array
computation with a transient
PE.3 during the seventh

the fault-tolerant systolic array
computation with a permanent

the fault-tolerant array of

the fault-tolerant array of

Figure 4.1.1 with a transient fault in PE1.2 during

the second clock cycle.

The space-time table of

the fault-tolerant array of

Figure 4.1.1 with a transient fault in PE1.4 during

the fifth clock cycle.

The space-time table of

the fault-tolerant array of

Figure 4.1.1 with a multiple transient fault in PE1.1,

PE1.3, PE1.4 during the

The space-time table of

seventh clock cycle.

the fault-tolerant array of

Figure 4.1.1 with a permanent fault in PE1.2.

The space-time table of

the array of Figure 4.1.1,

illustrating the register transfer operation of

constituent PE’s in the

The space-time table of

error-free case.

the array of Figure 4.1.1,

illustrating the register transfer operations of
constituent PE’s with a transient fault in PE1.2.

The space-time table for matrix multiplication using

the array of Figure 4.4.

1.

ix

Page

60

61

62

63

69

72

73

74

75

76

77

84

Table

4.4.2

A. 4.3

4.4.4

4.4.5

4.4.6

4.4.7

5.2.1

5.2.2

5.3.1

5.3.2

5.3.3

5.3.4

5.3.5

5.5.1

The space-time table for matrix multiplication using
the array of Figure 4.4.2.

The space-time table of the fault-tolerant VLSI array
of Figure 4.4.3.

The space-time table of the fault-tolerant VLSI
array of Figure 4.4.3 with a transient fault in
MAC.5 during the third clock cycle.

The space-time table of the fault-tolerant VLSI array
of Figure 4.4.3 with a transient fault in MAC.6 during
the fourth clock cycle.

The space-time table of the fault-tolerant VLSI
array of Figure 4.4.3 with a permanent fault in
MAc.5.

The space-time table of the fault-tolerant VLSI
array of Figure 4.4.3 with a multiple transient
fault in MAC.4, MAC.6 during the fourth clock cycle.

The space-time table for the unmodified array pipeline.
Note the functional transition and output inconsistency
in PE3,m.and PE4,d.

The space-time table of the fault-tolerant pipeline
of Figure 5.2.1. The error-free case.

The space-time table of the fault-tolerant pipeline
of Figure 5.2.1 with a transient error in PEZ,m during
the second clock cycle.

The space-time table of the fault-tolerant pipeline
of Figure 5.2.1 with a transient fault in either PE2,m
or PE3gm/d during the third clock cycle.

The space-time table for the fault-tolerant pipeline
of Figure 5.2.1 with a transient fault in either
PE3,m/d or PE4,d/m during the fourth clock cycle.

The space-time table for the fault-tolerant pipeline
of Figure 5.2.1 with a permanent fault in PE2,m.

The space-time table of the fault-tolerant pipeline
of Figure 5.2.1 with a permanent fault in PE3,m/d.

The space-time table of the fault-tolerant
two-dimensional array procesor for the error-free case.

Page

86

89

90

91

92

93

99

103

106

107

108

109

110

114

Table Page

5.5.2 The space-time table of the fault-tolerant
two-dimensional array processor with a transient
fault in either PE1.2,m or PE1.3,m/d. 115

5.5.3 The space-time table of the fault-tolerant

two-dimensional array processor with a permanent
fault in PE1.4,d/m. 116

xi

CHAPTER 1

INTRODUCTION

In recent years much attention has been focused on designing
highly dedicated computing structures such as systolic arrays which
use both pipeline and parallel algorithms. These structures are used
in many applications including signal processing. filtering and many

matrix analysis routines.

Most often used as a peripheral to a host computer. these
dedicated computing hardware devices will increase problem throughput

and often relieve the computational burden of general purpose

computers. The architecture of these structures is the result of a
direct mapping of the algorithms onto simple and modular computing
blocks. and the interconnections between blocks are determined by the
computational data flow paths. Because of the modularity of the
computational blocks and locality of the communication paths. these
computing structures are prime candidates for VLSI (Very Large Scale

Integration) implementation.

VLSI implementation will enable the cost-effective design of such
structures with smaller size. lower cost. and higher speed. Because
of the increasing availability of CAD/CAI (Computer-Aided
Design/Computer-Aided Manufacturing) systems. VLSI implementation can
be realised more accurately with lower design cost and faster
turn-around time. Simple or complex functional blocks can be designed
and verified interactively on a CAD system and stored as library
cells. For a given application. library cells can be placed and
routed by the CAD system and the final design can be checked with
respect to the designer's check rules. Finally. mask data generated

by the CAD system are sent to a silicon foundry for chip fabrication.

Problems such as efficient usage of chip real-estate. chip
testability. yield and reliability may arise from such
implementations. Improved chip fabrication techniques will result in
higher chip yield and reliability. Chip reliability depends strongly
on constituent building blocks reliability. As chip complexity

increases. these constituent functional block have a high probability

of containing or developing faults before or during operation despite
extensive application of fault prevention. These faults can be
classified as permanent. intermittent and transient. Since.
next-neighbor communication among constituent functional cells is
essential in VLSI implementations of dedicated computing structures.
an error caused by any fault within each block will propagate
throughout the structure and will result sometimes in an undetectable
erroneous output result. Global error checking and correction. due to
the large number of PE's and the large amount of processed data is
time consuming. and even if an error is detected. the error correction
requires the recomputation of the whole task. Furthermore. global
error detection most of the time will not locate the source of errors.
Therefore. any error detection and correction technique must be local
(i.e. at the PE level) and concurrent so that it may detect and
correct errors caused by all kinds of faults. Therefore. in order to
design highly reliable computing structures. fault-tolerance must be

incorporated into their design.

The goal of this work is to incorporate fault-tolerance and
dynamic reconfigurability into existing VLSI-implementable systolic
array architectures. This is achieved by devising a concurrent error
detection and correction algorithm. capable of detecting and
correcting errors caused by all classifications of faults. This
algorithm utilizes the inherent PE redundancy of the array
architecture in order to minimize the overhead hardware and timing

required by the fault-tolerant design. The algorithm's overhead

hardware and timing requirements are determined. Finally. this
approach is applied to classes of systolic arrays and VLSI

architectures used in signal processing and matrix multiplication.

Chapter 2 of this dissertation presents the background
information on the systolic arrays and the notion of fault-tolerant
computing. A new error detection and correction algorithm for linear
one-dimensional systolic arrays is presented in Chapter 3. In Chapter
4. the error detection and correction algorithm is extended to
two-dimensional square array architectures. Chapter 5 deals with
solving problems arising from functional transitions in neighboring
processing elements for fault-tolerant one-and two-dimensional
architectures. Chapter 6 contains the summary and conclusions of this

research.

CHAPTER 2

BACKGROUND

Dedicated computing structures such as systolic arrays can be
used in applications such as signal processing. filtering. and various
matrix analysis routines [1-10]. The architecture of such a structure
is the direct mapping of the algorithms onto relatively simple and
modular blocks. Intercounections are determined by the algorithm's
computational flow paths. Direct mapping of the algorithm onto
relatively simple and modular cells. the locality of intercell
communication paths. and the increasing availability of CAD tools.

have made the VLSI implementation of such structues very attractive.

2.1 Algorithm Philosophy

Based on the layout methodology described by Mead and Conway
[21.22]. algorithms suitable for VLSI implementations must have the

following features:

A) The algorithm should be implemented using only a few different

types of simple cells.

B) The data control flow of the algorithm should be simple and
regular. thus allowing cells to be connected by a network of local and
regular interconnections. Long and irregular interconnections must be

minimized.

C) The algorithm should use extensive pipelining and
multiprocessing. In this way a large number of cells can be active at
one time. so that the computational speed can keep pace with the data

rate.

The algorithms that have these properties have the following

advantages:

1) A new design is generated with only a few different cells.

since most of the cells are replications.

2) Regular interconnections imply that the design may be made

6

modular and extensive. Thus. a large chip design can be made by

combining the design of small tested cells.

3) Through the use of pipelining and parallelism in the design.

the computational performance (throughput) can be increased.

2.2 Systolic Arrays

Among the several approaches to parallel computation which
satisfy the VLSI algorithm philosophy of Section 2-1. the systolic
array concept is particularly interesting. As proposed by lung [1]. a
systolic array is a special-purpose redundant parallel computing
structure. made out of a few relatively simple cell types which are
regularly and locally interconnected. Computed data circulate through
these cells in a regular fashion and are processed at each cell. New
computed results of each cell are then sent to neighboring cells.
Constituent processing cells or elements (PE's) are arranged in a
parallel-pipeline fashion to form a one-or two-dimensional structural
configuration. Data is "pumped" synchronously between levels of PE's.
After each ”systolic best" (time step). part of an overall computation
is performed and results are transferred to the neighboring cells.
Simple dynamic storage circuits. such as latches. are placed between
rows of PE's to provide data synchronization. Figure 2.2.1
illustrates some common interconnection schemes used in systolic

arrays.

(b)

Figure 2.2.1.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-—l .—
1 sI ‘ I .
.. i—q .4 t.
f’ .__I:_. RI
r
ﬂ i—- h—
*r* l 1*

 

Different interconnections for systolic arrays:
(a) One-dimensional systolic array.

(b) Degenerate two-dimensional systolic array,
(c) Two-dimensional square systolic array,

(d) Two-dimensional hexagonal systolic array.

A systolic array structure used for matrix triangulation is shown
in Figure 2.2.2 [2]. Any augmented matrix {Alb} can be triangulated
into an upper triangular form {u:g}. where A has been permuted ;
pgiggi to minimum band form. The matrix A is of bandwidth 8 defined
as max1.1 {li-jl: ‘ij A 0]. In Figure 2.1 PE's shown in circles.
separated by dark lines (latches). are made out of two types. PE's
labeled as MAC (Multiply and Add Cell) perform the multiplication and
addition of three incoming operands. w-xy+z. and transfer x-x and y'y.
PE's labeled as DC (divide cell). perform the division of two incoming
operands. g-e/f. and transfer f-f. Elements of [Aih] are "pumped”
into the array through input lines. labeled as 11-11 and ID- Blg-ontg
of {g:g) are pumped out through output lines. Ol-Q‘ and OD- Lgtchgg,
controlled by a two-phase non-overlapping clock. provide data
synchronization and regular data circulation. Isolated MAC's at the
bottom of the structure operate upon the b vector elements. which

subsequently will be resolved into 9 vector elements.

In general. 2B+2 input and 8+1 output ports are required. Each
port is n bits wide. where n is the width of the coefficient matrix
elements. It can be shown that B(B+l) MAC cells and 8 DC cells or
O(B') PE's are required [2]. Also 2N+2B. where N is the arbitrary
matrix dimension. or O(N) (assuming E<< N ) time steps are required to
complete the triangulation. This-is an improvement. by a factor of
O(N). over the time required for the same Operation on a serial or
attached array processing computer. Thble 2.2.1 illustrates the

timing steps for the operation of the array of Figure 2.2.2.

10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

e
f
0 ‘ DC DC DC
.J.._.
9
w
02 MAC MAC MAC x
0 MAC MAC MAC
.2_.l
‘ MAC MAC MAC
0
3......
l 1'
I1 I2 I3 I4 I5 I6
0D
l'“"""""“"“""”"‘"‘.
i ' D section
I :1
: MAC MAC MAC .
3 i
l‘ I
I- ........................ t---|
ID

Figure 2.2.2. The structure of a computing array for a system of
arbitrary dimension and B=3.

ll

Table 2.2.1. The Port-Coefficient time table of the array of Figure

2.l for B=3.

I1 12 13 I4 15 16 17 ID 01 02 03 04 00
t1 a11 a21 a31 a41
t2 1
t3 a12 a22 a32 a42 a52 b1
t4 ‘ u11
t5 a13 a23 a33 a43 a53 a63 b2 u12
t6 1 ”22 u13
t7 a14 a24 a34 a44 a54 a64 a74 b3 “23 u14
t8 ‘ u33 u24 d1
t9 a25 a3s a45 a55 a65 a75 a85 b4 “34 “25
t1o ‘ ”44 “35 d2
t11
t12 1

t2N+6 dN

12

2.3 Fault-Tolerance

Fault-tolerant computing is the correct execution of a specified
algorithm in the presence of defects [19.27.28]. Defects can be
categorized as design defects or fabrication defects. Dedicated
computing structures. such as systolic arrays. are less likely to
suffer from design defects. This is due to the fact that the
circuitry involved in such designing consists of highly modular and
relatively simple functional subcircuits or cells. Functional
correctness of the subcircuit design is verified through simulation at
all of the implementation levels (register transfer level. gate level.
transistor level. and mask level). Therefore. the defects for all
practical purposes may be redefined as flaws introduced during various
phases of the fabrication process (diffusion. ion implantation. oxide
growth.etc.) or faults introduced later when the system is being used
(noise. transistor burn-out. etc.). Because of the irreversible
nature of the fabrication steps. the end result will determine the

degree of functionality of the whole computing structure.

Structural complexity and the limited number of Input/Output
(IIO) pins allow only for external testing [46]. Even in the presence
of a minor flaw. such testing will indicate a complete system failure
and thus reduce overall chip yield. One way to overcome fabrication
defects is to devise flawless fabrication techniques. However. before
such techniques are develOped. fault-tolerant design is essential to

improve yield and system reliability.

13

2.3.1 Fault Classes

Undesired events in a computing environment can be categorized in

a sequence of cause-effect relationships [29] as follow:

Physical ( Component Failure )

Logical ( Faults )

Computational ( Errors )

External ( Crash )

Chip failure caused by physical defects must be avoided by improved
fabrication techniques. However. physical defects resulting in faults
(logical faults) [57.58] can be masked by incorporating
faultrtolerance into the system design. In order to be able to
understand fault classes and to devise methods for fault recovery. one
must model different types of predictable faults. Furthermore. due to
the technology dependency and causal complexity of the faults.
modeling fault classes for all possible causes and technologies is a
difficult task. Therefore one must concentrate on modeling all
possible and predictable faults for a given technology. MOS (metal

oxide semiconductor) technology is commonly used in today's chip

14

manufacturing. In MOS technolOgy a large number of physical failures
introduced during the fabrication phases can be covered by two classes

of defects [30].

1. Defects that cause the transistor to be open permanently.
This class of faults is caused by missing or open contacts on the
source. gate. and drain of the transistor; a break in the metal.
polysilicon or diffusion line that connects the terminal node of the
transistor; or a threshold voltage shift which makes the device

permanently open.

2. Defects that cause the transistor to conduct permanently
(permanently closed). This class of fault is caused by channel
punch-through which connects source and drain of the transistor; the
oxide breakdown resulting in the transistor gate being shorted to
drain (n-type) or source (p-type); or shifts in the threshold voltage

causing the transistor to conduct pemmanently.

Major defects such as mask misalignment or dust particles on the
surface of the wafer will damage a large area of the chip and might
result in a complete chip failure. A statistical study of fault
occurrence with respect to the defect size (diameter) is given in
[51.64]. Defects that will change the circuit configuration
(bridging) or do not result in the node being permanently stuck at
some logic value (noise) can also be encountered. The above

categorization of fault classes is a rather specialized one. However

15

it will set a good base for fault modeling in a VLSI environment and
can be applied to more general classes of faults which might be caused
by similar physical flaws or defects in other fabrication

technologies.

2.3.2 The Concept of Fault-Tolerance

Complex systems have a high probability of containing or
develOping faults before or during Operation despite extensive
application of fault prevention [27.54.55.56]. Fault prevention has
two aspects. The first aspect of fault prevention is fault avoidance.
These techniques are used to avoid introducing faults into the system
by improving design methodologies and quality control. The second
aspect is fault removal. At the design level. these techniques are
used to find and remove faults by testing and validation [31-36].
Based on definitions given in [27]. three major constituents of fault

tolerance can be identified as follows:

1. Error detection

In order to tolerate a fault. its effects must be first detected
[37.38]. For complex systems. such as VLSI computing structures. a
fault might not be directly detected because of the limited number of
IIO pins. However. any fault will generate errors somewhere in the
system. Thus. the usual starting point for fault tolerance is the

detection of an erroneous state.

16

2 . Damage as se ssment

When an error is detected. much more of the system state may be
suspect than that initially discovered to be erroneous. Other errors
masked by one error might exist in the system. Thus before any
attempt is made to deal with a detected error. it may be necessary to
assess the extent to which the system state has been damaged. Damage
assessment must be done locally. Any additional error or damage
detecting hardware must be totally self checking (TSC) so that faulty
checkers do not introduce additional or undetectable errors to an

already faulty system [31.33.53].

3. Error recovery

Following error detection and damage assessment. techniques for
error recovery must be utilized. These techniques. such as dynamic
system self-reconfiguration. will aim to transform the current
erroneous system into a well defined and error-free state from which
normal system operation with. ideally. no system degradation can
continue [18.20.39.52]. 'ithout such a transformation the system will

fail to operate correctly.

It is important to note that a fault-tolerant system design must
provide tolerance to a range of predicted component faults.
Unanticipated effects of implementation faults may not be modeled into

a fault-tolerant system. In any case. the latter undesirably

17

increases the complexity of such a system. Such problems are best
addressed by the fault prevention aspects of the design in order to

narrow unanticipated fault classes into predictable fault models.

18

2.3.3 Fault-Tolerance in VLSI

The rapid pragress in VLSI technologies and the ability to
package millions of devices on a single chip. raise the probability of
design and fabrication faults. The shrinking dimensions of the
circuits. seen by line widths dropping from 10 microns in the early
1970's to 1 micron in today's protptypes. makes the circuits
susceptible to transient physical disturbances.such as Alpha particle
impacts. that were previously insignificant. The growing logic
complexity of a single circuit package. which was 1000 gates in the
early 1970's and close to 1 million today. makes the prefabrication

and postfabrication detection of all errors less probable.

One basic approach to VLSI fault-tolerant design is "Design
Diversity". In such an approach. multiple designs are carried out
independently starting from a verifiable. formal specification of the
functions to be performed. The specification needs to be expressed in
a high level language that allows formal verification and minimizes
the probability of biasing the independent designs toward the same
errors. All of the multiple designs can contain errors. however
during concurrent operation the errors will be detected if their
symptoms are not identical and errors can be corrected. if a majority
of good results can be found [40]. Hardware design diversity can only
be achieved through component redundancy. VLSI implementation has led
to cost breakthroughs that bring the hardware cost of multiply

redundant systems within the range of many applications. Redundant

19

fault-tolerant VLSI architectures in which subsystems have been added
topologically. are used extensively in VLSI memory chips [41.42.50].
In such architectures. design diversity is not desirable. It would
add enormously to the chip real-estate and increases I/O complexity
and therefore degrades chip performance. Most architectures that
benefit from a high degree of modularity and cell simplicity can not
be built cost-effectively using a high level of design diversity.
Chip real-estate in VLSI implementation is expensive and can not be
sacrificed in order to achieve design diversity (e.g. processor
arrays replicated using diverse design etc.). It is important to note
that including fault tolerance in a given design would certainly add
to chip real-estate consumption. However. the additional hardware. if
kept within a reasonable range. if capable of covering predicted

faults. will result in an error-free and reliable design.

Another group of fault-tolerant techniques. especially used for
tolerance to transient errors. can be divided into error masking and
concurrent error detection. Some of techniques associated with error
masking are Triple Modular Redundancy and Quadded Logic. These
techniques require at least a factor of 2 or 3 of additional hardware
redundancy to tolerate single module failures [43]. The concurrent
error detection schemes. which are used to signal errors but not mask
them. include: Totally Self Checking (TSC) circuits. Alternating
Logic (AL). and REcomputing with Shifted Operands (RESO). According
to [43]. TSC techniques require between 73 and 94 percent hardware

redundancy for single bit and unidirectional errors. AL has the worst

20

figures with 85 percent hardware and 100 percent time redundancy.
RESO requires almost 100 percent time redundancy for recomputation of
results to detect errors. Note that the hardware redundancy is
defined as the ratio of the hardware required by the fault-tolerant
technique to the hardware of the original system without
fault-tolerance. Likewise the time redundancy ratio is defined as the
ratio of the time overhead required by the fault-tolerant technique to

the time complexity of the original system without fault-tolerance.
2.3.4 Faul t-Tolerant Systolic Arrays

As covered in Section 2-2. systolic arrays are dedicated
computing structures that satisfy the Mead and Conway design
philosophy and therefore can be impleeented in VLSI. However the
limits on IIO pins [46]. the high computational speed. and the lack of
on-chip data mass-storage. make the structure defenseless against
errors. Faults only become apparent when the whole structure fails to
output the correct results. Because of the amount of data involved.
it is very costly to verify the correctness of the computed output.
In fact. verifications of such caliber cause questions to be asked on
the whole purpose of needing computational peripherals. Therefore.

future design of such arrays must take fault-tolerance into account.

As discussed earlier. fault-tolerance capability will increase
the chip real-estate consumption. It will require component

redundancy and additional checking and control hardware. However.

21

before fault prevention techniques are improved. such a tradeoff is

essential for an optimum and reliable implementation.

In conclusion. designing systolic arrays with fault-tolerance and
self-reconfigurability properties. which use the inherent redundancy
of the array architecture to overcome errors caused by predictable
permanent or transient faults. is highly desirable. Introduction of
additional hardware must not significantly reduce the system
performance or implementation feasibility. Furthermore. fault
detection must be local (i.e. at the PE level). This requires all
additional elements to be totally self checking. Comparison
techniques. which might indicate a nonexisting fault due to round off
errors or functional transitions in neighboring processing elements.
must be avoided; and finally. additional control and/or
interconnection lines must be kept local. The fault model must be
general. In other words. the error detection and correction algorithm
must be capable of detecting and correcting errors caused by
permanent. intermittent and/or transient faults. Furthermore. it has
been shown that a general fault model in which the origin or the
frequency of fault-occurrences are assumed unknown. is most suitable
for fault modeling in a VLSI environment. In arithmetic array
processors implemented on a VLSI chip. these faults affect a small
part of the chip. most likely a processing element. The duration of
these faults is sometimes unknown (ex. transient faults caused by
power supply fluctuations or Alpha particle impact). Transient faults

will not damage the hardware and their occurrences are only detected

22

by checking the output state of the processing elements. This implies

that the error checking and recovery algorithms must be concurrent.

CHAPTER 3

An Error Detection and Correction Algorithm
Fer Fault-Tblerant Linear One-Dimensional

Systolic Arrays

3.1 ‘Dne Error Detection Algorithm

Figure 3.1.1 illustrates the structure of an array of processing
elements connected in a pipelined fashion. All PE's perform the same
operation on incoming Operands. The Operands are loaded from the left
and processed by each PE. Ideally. a linear pipeline with k stages
will process n Operands in Tk-k+(n-1) clock periods. The space-time
table Of the pipeline Of Figure 3.1.1. for arbitrarily chosen k?‘ and

n-5. is shown in Table 3.1.1.

23

24

a

CLOCK

 

H
y

 

 

 

 

 

 

 

 

 

 

 

INPUT OPERANDS
->-é—

OUTPUT OPERANDS

FE 1—6—>

INPUT REGISTER
OUTPUT REGISTER

 

 

 

 

 

Figure 3.1.1. Basic structure of a linear pipeline processor.

25

Table 3.l.l. The space-time table of the normal linear
one-dimensional pipeline processor of Figure 3.l.l.

Input/Time PE.1 PE.2 PE.3 PE.4 PE.5 PE.6
to A o o o o 0
t1 B A1 0 o o 0
t2 c 131 A2 0 o 0
t3 D c1 32 A3 0 o
1% E 01 c2 B3 A4 0
g x 151 02 c3 B4 A5

Output/Time PE.1 PE.2 PE.3 PE.4 PE.5 PE.6
to o o o o o 0
t1 A1 0 o o o 0
t2 B1 A2 0 o o 0
t3 c1 32 A3 0 o 0
F4 01 c2 B3 A4 0 0
t5 E] 02 c3 B4 A5 0
t6 X E2 D3 C4 B5 A6
t7 x x E3 04 c5 36
t8 x x x E4 05 c6
t9 x x x x 55 D6
t x x x x x E

10 5

26

Figure 3.1.2 illustrates the modified array. PE's are designed
to process or bypass the input operands. The bypass mode is used for
the concurrent error detection and recovery algorithm that will be
discussed in the remainder of this section. Latches are present but
not shown in Figure 3.1.2 and comparators are added between adjacent
PE's. The operands are loaded from the left and each PE performs the
same operation on incoming operands. Every PE contains an input data
register (IDR). an arithmetic unit (AU) and an output data
register(ODR). The input Operands are loaded into the input data
register. processed by the arithmetic unit and stored in the output

data register.

During the first clock period the operand A is loaded from the
left to the PE labeled as PE.1 and results in the processed output
labeled '8 A1: During the second clock period A1 is fed into PE.2 and
in the normal pipeline Operation illustrated in Table 3.1.1 a new
Operand is loaded into PE.1. However in this scheme A1 1. fgd bgck
into PE.1 and processed simultaneously by both PE.1 and PE.2. Since
all PB's are unifunctional (i.e. perform the same operation on their
input operands) the outputs of PE.1 and PE.2 Operating on the same
input must be equal. The outputs of PE.1 and PE.2 are compared by the
comparator C1. If the outputs of PE.1 and PE.2 are equal the
comparator output is "0" and the normal Operation of the pipeline
continues with each new Operands loaded into PE.1 twice. However if
the outputs of PE.1 and PE.2 are not equal. Cl will flag an error by

setting its output to "1".

27

 

 

 

 

 

d1 1*

PE.1 PE.2 PE.3 PE.4 RPE.5

 

 

 

rev-u (TSC) snail - «en-4‘ +- 1.4-.

I
I
, _
: fTsc J .--- ---5
..-, TEST ’ «1 «» w
1 . , . 1 .

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R
—-'l' —4-—p
._.;32IJ . c1 -~—J-—-c2 e—L—-c3 .—» 11bit

......,--.1 ‘ 1 ‘ ‘ Operands

Input I
Operands l 2 l
TWO-PHASE
CLOCK
Parity bit in _—.» ——.. Parity bit out
(PI) PE.i (P0)
Input operands +., 1.“ ’5 .,.. Outputooperands
i i

 

 

 

 

I—hl

1 Ci ._.. From 01.
e.—

— 1

 

 

 

CLOCK

Figure 3.1.2. The structure of the fault-tolerant one-dimensional
linear pipeline processor.

28

 

Input Parity bit
(PI)

Input Operands

1‘1 -—-e—.

 

P E .
1

Arithmetic Unit (AU)

Input Data Register (IDR)
Output Data Register (ODR) ‘

Output Parity bit
(P0)

Output Operands

 

 

Begin
if EF=O then
IDR ‘— 1‘1
AU :-- IDR
ODR-— AU
PO ‘-- PT

Oi ‘— ODR

elseif EF=l then
01' ‘- IDR
AU -— IDR
ODR *— AU
PO -‘- PI
0i -—- ODR

end if

end;

Error Flag
(EF)

Figure 3.1.3. Basic PE for the fault-tolerant pipeline processor
of Figure 3.1.2.

29

 

Input Parity Bit R P E ‘ 5 Output Parity
0
PI —- —’ Bit

Arithmetic Unit ( AU ) Data Valid(DV)

Input Data Register(IDR)

Input Operands Output Operands
dbl-OI +D
IR Output Data Register (ODR) O

I

Error Flag
(EF)

R

 

 

 

Begin

if EF=O then-
if PI=O then
IDR --IR
ODR --—IDR
DV — PI
oR -— ODR

if PI=l then
AU ~— IDR
ODR ~— AU
DV ~—- 'P‘I'
oR ~— ODR

elseif EF=l then
I AU -— IDR
ODR .— AU
DV "-PI
OR e—ODR
end if
end;

Figure 3.1.4. Basic Redundant PE for the Fault-tolerant
processor of Figure 3.l.2

30

At this point. assume PE.1 is fault-free. The case when PE.1
could be faulty will be discussed later in this chapter. A mismatch
in thO outputs A: of PE.1 and PE.2 will indicate that an error caused
by some unknown fault in PE.2. assuming fault-free PE.1. has occurred.
Note that the fault model assumes an unknown origin for faults and
errors which can be caused by permanent. intermittent or transient
faults and is thus suitable for fault modeling in a VLSI environment.
The only assumption is that faults will cause the output of the PE's

to be erroneous.

An error flag by C1 will cause PE.2 to maintain its present input
Operand for an additional clock period and bypass this Operand to
PE.3. During the third clock period a new Operand B is loaded into
P3-1 and A1. stored in the input register of PE.2 and bypassed to PE.3
at the end of the second clock period. is recomputed. In case of an
error detection the Operands are delayed by one PE along the array
(i.e. 42 computed by PE.2 and PE.3 instead of PE.1 and PE.2). In
order to recover an error in PE.k (i.e. last PE in the array). an

additional PE is needed and will be discussed in the next section.

The comparators of Figure 3.1.2 are relatively simple and
modular. Various designs of totally self checking comparators have
been proposed in the literature [44.45]. However. fault-free
comparators are assumed here. In fact. in the normal pipeline
Operation. a false 1 error in the output of a comparator will only

cause its corresponding PE to bypass and recompute its incoming

31

Operands for one clock cycle and therefore will not introduce any

error in the data flow of the pipeline.

For algorithms which require a bidirectional computational flow.
a symmetrical row of comparators and interconnects can be added to the
top of the pipeline structure. be error detection and bypassing
technique can be used in the same manner as described previously. The
comparator row on the bottu of the structure compares the computation
results of the PE's with the data flow from the left to right and the
comparator row on top of the structure compares the computation
results of PE's with the data flow from right to left. In case an
error is detected in either direction the corresponding PE's are set

to bypass their input operands and recompute for one clock period.

For the one-directional pipeline we assumed the first PE to be
error-free. The first PE labeled as PE.1 can be designed to be
Totally Self Checking (TSC). Various designs of such PE's are
proposed in the literature [36.46]. The T‘SC-PE.1 can be checked at
the start-up of the execution. This might determine the presence of a
permanent fault in PE.1. In case PE.1 is permanently faulty. it will
be bypassed for the entire duration of the pipeline Operation. For
bidirectional data flow. both the rightmost and the leftmost R's are
designed to be TSC. This only increases the overhead hardware
requir-ents by the extra TSC hardware required for these two PE's.
In VLSI implementations. a large number of PE's(R>lOOO) are built on a

single chip. Therefore the extra hardware required for the leftmost

32

and rightmost PE's of the array is minimal compared to the additional
hardware required for the error detection and correction algorithm.
In the one-directional pipeline Operation. another approach would be
to load the first input Operand sets (i.e. A.B.C.etc.) into both PE.1
and PE.2 at the same time and compare the results. In order to
maintain the correct number of operations on each operands. an
additional PE is required. Also the comparaison timing along the

pipeline must be modified accordingly.

Thbles 3.1.2 through 3.1u9 illustrate different cases of fault
detection. recovery and output result validity for an array of k-4 PE

processing n-5 Operands.

33

Table 3.1.2. The space-time table of the fault-tolerant pipeline

error-free case-

PE.2 PE.3 PE.4

PE.1

Input/Time

3434343434

OOOAABBCCDDEE

2323232323

OOAABBCCDDEEX

11211212112112

0AABBCCDDEEXX

AABBCCDDEEXXX

2
1
.T.

0
0 I 2 3 4 PO 6 7 8 9 1|
.7. t t t t t t t t t t

1
1
t

PE.2 PE.3 PE.4 PE.5

PE.1

Output/time

SET TO BYPASS

4545454545
OOOAABBCCDDEE

3434343434
OOAABBCCDDEEX

2323232323
0AABBCCDDEEXX

121212112112
AABBCCDDEEXXX

34

Table 3.1.3. The space-time table of the fault-tolerant pipeline
with a transient fault in PE.2 during the second clock

cycle.

Output/Time PE.l PE.2 PE.3 PE.4 PE.5
t1 A1 0 o o 0
t2 ifafftw.) _____ ‘imput‘: 0
t3 B1 in? ..... 64 0
t4 32 32 A3 A3 0
t5 c1 B3 B3 A4 x
t6 c2 c2 B4 B4 A4
t7 01 c3 c3 35 34
t8 02 02 c4 c4 x
t9 E1 D3 D3 c5 c4
tlo £2 £2 04 04 x
t11 x E3 E3 05 D4
t12 x x E4 E4 x
t13 x x x E5 54
t14 x x x x x

. f : Compare results

----_--‘

35

Table 3.1.4. The space-time table of the fault-tolerant pipeline
with a transient fault in either PE.3 or PE.4 during

the 6th clock cycle.

Output/Time PE.1 PE.2 PE.3 PE.4 PE.5
t1 A1 0 O 0 0
t2 A2 A2 0 0 0
t3 B1 A3 A3 0 0
t4 B2 B2 A4 A4 0
t5 C1 B3 B3 A5 A4
t6 (:2 C2 [8. - erreanj X r compute
t7 01 c3 c3 B4 A4»;
t8 D2 02 C4 C4 X
t9 E1 03 D3 C5 C4
th E2 E2 04 D4 X
tll X E3 E3 05 D4
t12 X X E4 E4 X
t13 X X X E5 E4
t14 X X X X X

36

Table 3.1.5. The space-time table of the fault-tolerant pipeline
with a multiple transient fault in either PE.2, PE.3

or PE.4 during the 6th clock cycle.

Output/Time PE.l PE.2 PE.3 PE.4 PE.5
t1 A1 0 o o 0
t2 A2 A2 0 o 0
t3 B1 A3 A3 0 0
t4 32 32 A4 A4 0
t5 c1 B3 B3 A5 A4
ts {C2225 .152”: :.B's'_'f‘:‘i°"'31‘: X
., D] :c;:::::'sé.s B4\~.W
t8 D2 D2 C3 C3 X recovered
t9 E1 D3 03 c4 x
tio E2 E2 ”4 D4 C4
tH x E3 E3 05 D4
t12 x x E4 E4 x
t13 x x x 55 E4
t14 x x x x x

X: Don't care

1
n .: Compare results

37

Table 3.1.6. The space-time table for the fault-tolerant pipeline
with a permanent fault in PE.2.

Output/Time PE.1 PE.2 PE.3 PE.4 PE 5
t1 A1 0 o o 0
t2 in? ..... Ag: 0 o o
.3 B] lg:::::‘5a'i 0 0
t4 Eh;"""@ﬁ' A2 A2 0
t5 C1 55---- 35: A3
t6 E£---‘-E:E ...... 5; B2 A4
t7 “1 592----fai 33 X
ta {02:13 .2 .2 B4
.9 :1 __9;,::::::°:23 cs x
Ho 62--.“.f21 02 02 c4
tn " 5252 “3 "
t12 x x 52 52 D4
t13 x x x E3 x
t14 x x x x E4

X: Don't care

- I - - . C ' ~. .-

_3: All comparisons on PE.2 indicate error

38

Table 3.l.7. The space-time table of the fault-tolerant pipeline
with a permanent fault in PE.4.

Output/Time PE.l PE.2 PE.3 PE.4 PE.5
t1 A1 0 o o 0
£2 A2 A2 0 o 0
t3 B1 A3 A3 0 0
t4 32 32 lag ..... n5} 0
t5 c1 B3 B3 A4 A4
'6 '2 '2 ['21. 1'... 1 1'9} "
t7 01 c3 c3 34 B4

r -------- 1
t8 ”2 ”2 :C4 C4: X
t9 El 03 03 c4 C4
tm 52 E2 gﬁg'm'Bgi x
tll X E3 E3 “4 D4
H2 x x fi----fé x
t13 x x x E4 E4
t14 x x x x x

X: Don't care

1 : All comparisons on PE.4 indicate error

39

Table 3.l.8 The space-time table for the parity check and output
validity with a transient fault in PE.2 during the
second clock cycle.

Parity out/Time PE.1 PE.2 PE.3 PE.4 PE.5 OUTPUT
VALID?
1:.l l 0 0 O 0 YES
t2 0 0 0 0 0 YES
t3 1 0 0 0 0 YES
t 0 0 l l 0 YES
t5 1 l l 0 1 N0
t6 0 0 O O 0 YES
t7 1 l l l 0 YES
t8 0 O 0 0 1 NO
t9 1 l l l 0 YES
t10 0 0 0 0 1 N0
t]] X l l l 0 YES
t12 X X 0 0 1 NO
t13 X X X l 0 YES
t X X X X 1 N0

40

Table 3.1.9. The space-time table for the parity check and output
validity with a permanent fault in PE.4.

Parity out/Time PE.1 PE.2 PE.3 PE.4 PE.5 OUTPUT

VALID?

t] 1 o o o 0 YES
t2 0 o o o 0

t3 1 1 1 o o "
t4 0 o o o o "
t5 1 1 1 o 0

t6 0 o o o 1 N0
t7 1 1 1 o 0 YES
t8 0 o o o 1 N0
t9 1 1 1 o 0 YES
tlo o o o o 1 N0
t]] x 1 1 o 0 YES
t12 x x o o 1 NO
t13 x x x o 0 YES
t x x x x 1 NO

_a
.b

4l

3.2 Error Recovery

If en error is detected in the ith PE (i>1). P81 is set to
recompute its input operands for the next clock cycle. During this
cycle the input cpersnds 0f 931 previous to the erroneous conputstion
ere transferred to the input of PBi+1. During the following clock
cycle the error is recovered by P81 end PEi+l by recomputing the
bypassed dsts fro- the previous step. This inplies thst for the
design of s k stege feult-tolerent pipeline cspsble of detecting end
correcting errors in sny of k stsges. h+l PE's ere needed. The
sdditionsl PE hypesses its input operends unless there is en error in
the h‘h stsge or e pernsnent feult in eny of the PE's elong the
pipeline structure. The following register trensfer level elgorithl.
elso reproduced in Figure 3.1.3. delonstretes the error recovery steps
in the ith PE (1(i<k) of the fsult-tolersnt one-dinensionsl pipeline

structur 0.

Begin
if EFUO then
IDR <-- Ii
AB <-- IDR
ODR <-- AD
P0 <-— 171'

oi <.... 0m

elseif EFhl then

42

ODR (-- IDR

AD <-- IDR

ODR <-- AD

P0 (~— PI

01 <-- ODR
end if

end;

In the error-free operetion of the pipeline. the finel result is
eveileble every other clock cycle. However if en error occurs the
output result cen not be obteined es regulsrly es before. One wey of
overcoming this problem is to edd en odd or even bit with eech PB. In
other words. PE.1. PE.3. PE.5 etc.. being odd. will heve e ”0" bit end
end PE.2. PE.4 etc.. being even will heve e "1" bit. The output
control will check for the perity of the finsl dete end decides on the
velidity of the finsl dete. For example in e k-4 stege pipeline. four
processes must be performed on eech input opersnd. Sterting et PE.1.
the perity bit is set to "0". PE.2 will reset the perity bit if it
performs en erithmetic operetion on its input cperend. If PE.2
bypesses the input. the perity bit rueins st "0". This process
prOpegetes elong the errey eech time PE's bypess or cperste on their
incoming cperends. The finsl result must contein en even perity bit
(for the cese ks4). If not it will be discerded by the output buffer.
Since the perity bit end the dete ere prOpegeted et the seme time. the

perity bit will not slow down the computetion timing of the errey

43

pipeline.

In e k stege unifunctionel one-directionel pipeline (k is en
erbitrery integer). the output result of the hﬁh stege is velid only
if en input operend losded into the first P8 of the pipeline undergoes
k processes. If. beceuse of e feult. eny input operend is processed
less then k times. the output result of the kth PE is not velid end
must be processed by edditionel redundent PE's. The number of the
spetiel redundency is determined by the cepebility of the
feult-tolerent design to recover from permenent feults. In other
words. in order to meintein the correct number of processes on eech
operend. permenently feulty PE's must be repleced by redundent
feult-free PE's. For eremple the redundency of one. essumed
throughout this chepter. will ensble the feult-tolerent pipeline to
recover from one permenent feult. In the error-free operetion. the
redundent PB must only bypess its input operends. However in cese e
permenent feult is detected in say of the PE's. the redundent PB must
process its incoming operends so thst the feulty PE is repleced by the
redundent PB. Therefore s control elgorithm must be developed for the

consistent operetion of the redundent PB.

3.2.1 Control Algorithm for The Bedundent PE

In this section. e control elgorithm for the redundent PE is

developed. This elgorithm is besed on ell possible ceses of

feult-occurrence in e k stege pipeline. It is essumed thet the

44

structure is cepeble of recovering from et most one permenent feult.

Cese l : Permenent Feult

In this cese. one of the PE's in the pipeline structure is
permenently feulty. As illustreted in Thble 3.1.6 end 3.1.7. the
error detection end correction elgorithm indicetes e permenent feult.
Since one of the PE's is feulty. the input operends undergo krl
processes. Therefore the redundent PE must process its incoming
operends. As described. in the error-free pipeline operetion the
output perity bit of the k“ a; .111 be odd if k is en odd number end
even if k is even. This output perity bit of eech PE is reset eech
time PE's perform en operetion on their input operends. Beceuse of e
permenent feult in P31 (1<i<-k). the output perity bit of the kﬁh PE
will not metch the evenness or oddness of k. In other words. for the
cese of kF4. if e permenent feult occurs in say of the PE's (i.e.
PE.1-PE.3). the output perity bit of Hit .111 not b. even. Therefore.
the control elgorithm of the redundent PB (PEk+1) .ngt check {at the
output perity 0‘ th' PEk° If the perity does not metch k. P8k+1 must

process its incoming operends.
Cese 2: Trensient end Intermittent Feults
If e trensient or intermittent feult occurs in 231 (1(1(-k-1), g.

illustreted in Teble 3.1.3. the output result of PEk i; velid. In

other words. the control elgorithm for ”1+1 will set this PE to

45

bypess its incoming cperends. However. if en error occurs in PEk,
PEk+l must process its incoming operends. The control elgorithm for
the redundent PEh+1' elso reproduced in Figure 3.1.4. cen be written

es follow:

Begin
if EF-O then
if PI-O then
IDR <-- 18
ODR <-- IDR
UV <-- PI
one-om

elseif PI-l then

AD <-- IDR

ODR <-- AD

nv (mi?

03 (- 008
end if

elseif EFhl then

AD <-- IDR

ODR <-— AD

DV (-- PI

03 <-—- ODR
end if

end;

46

3.3 Overheed Berdwere end Timing Requirements

As discussed. in order to design e k stege feult-tolerent
erithmetic pipeline processor using the concurrent error detection
technique. h+l PE's ere required. Additionelly. k-l comperetors end
corresponding interconnection lines ere required. If it is essumed
thet eech comperetor end its interconnection lines occupy (l/m)th the
chip eree consumed by e single PE. denoted es APB, the overhggd
herdwere requirements (088) es s function of k end m cen be formuleted

epprorimetely es:

OﬂB(k.m)' [(k+1)+((kr1)/m)]/k. (3.3.1)

The verieble m in Equetion 3.3.1 cen be chosen so thst it includes the
ertrs herdwere required for the two TSC PE end the multiplerors used
to reloed the dete into the rightmost or the leftmost PE's. A plot of
the overheed herdwere requirements es s function of k end m is shown

in Figure 3.3.1.

The overheed timing is given by the edditionel time required to
reloed eech input operend into the first stege of the pipeline.
Therefore the generel timing of the pipeline must be modified. The

number of clock periods (Tk) required to process n operends on e k

47

stege pipeline is in the usuel cese given by:

n'k+ n-1, (3e302)

T30 number 0‘ clock 9011048 (kat) required by the feult-tolerent

pipeline to process n operends is given by:

A plot of the overheed efficiency end throughput of the feult-tolerent
pipeline es s function of n end 1 is shown in Figure 3.3.2. It is
evident thet the throughput is never below thet of the seme pipeline
designed with the 3880 technique. It must be noted thet 3880 is
cepeble of only detecting errors. Error recovery using 330 might
require lsrger overheed herdwere end might not meintein the VLSI
ettributes such es locelity of control end modulerity of the error

recovery herdwere.

Lommmooca xenon ucmcopoulupzmm
Pm:o_m=mswu-mco on» not Ae.xva:o acmemcwacmn «rezone; camsco>o ugh ._.m.m mczmwm

 

48

OF op op or or op op
or 0— m. «p . m _ N; _—
J - d - _ _ _
----unnw----.----nuuu---------unuw---------unuu--:------nnuwmwwwnlll..1 1o._
:-:----,/-
.I// a”
I. 5
Jay. I
’u
z
(w
7..
I.
3
O 1:
ooopuv. I..111..| .,
,
oopux .............

 

25.11113... Eizzo

49

.N.F.m .mvm to mcvpmaea
pcmcmpou-u~=mp on» ma paqsmzonzp can aucmrovymu ash .N.m.m we:m_m

OOOH OOH Dd

 

x\: _ .. _ . _

|'|‘l|||l|I'le-‘l"ll"|l‘I‘Il'-"||l'l‘ll0.

1

ml- ” J 10 L
I

a» \ «cu 3296.2.— mﬁ—SE .3253. um:

np\ r.» pansmsoczh acrpmara ucmcmpou-upsmm up:
Am.wav mmmnum to ransaz ax

Amxmwuv mucmnoao mo napsnz u c
N: 11:.H N3\—3 u

 

mucmwurcmm mcFFoawa napsmma

PC NC \PC

zucm_uwmcm oc_9mawa ucmgmpopuppzmm

50

3.3.1 Feult Coverege

The feult-tolerent pipeline structure of Figure 3.1.2 is cepeble
of detecting end correcting ell trensient feults occurring during the
entire operetion of the pipeline end up to k+R permenent feults. where
I is the number of redundent PE's sdded to e k stege pipeline
structure. In generel for e k'+ R stege pipeline up to (k-l).Thft,
where kat-h+l+2n-l. trensient feults cen be detected end corrected.
Iultiple trensient feults occurring in k previous steges of the
pipeline cen be detected end corrected in only two clock cycles (i.e.

one cycle for detection end one cycle for recomputetion).

In e VLSI environment. e lerge number of PE's cen be implemented
on e single chip (k>10') end very lerge number of operends cen be
executed by erithmetic pipeline processors (n)10‘). Therefore it cen
be seen thet for such en epplicetion. the degree of trensient feult
coverege offered by this concurrent technique is tremendous (i.e.

0(1o’n.

As en epplicetion exemple. in the next section the error
detection end correction elgorithm is epplied to e systolic errey used

in Discrete Fourier Trensform (DFT) computetions.

51

3.4 Lineer Systolic Arrsy for DFT

An n-point Discrete Fourier Trsnsform (DFT) problem [6 l is
d°fimd .‘ £01.10": 'iVGn (‘0s.11eees.n-1)e 60-pin. (YOeYleeeeeYn_1)

defined es:

-1
Ii. 2’ ‘j'ij (3.4.1)
3'0

where w is en n-th root of unity. The n-point DFT problem cen be

viewed es thet of evelueting the polynomiel

‘n-l‘n-l + ‘an‘n-l + ... + le+eo (3.4.2)

et 1' 1. w. w2. .... 'n-l.

Bquetion (3.4.2) cen be rewritten es:
("'(“n-l‘ + ‘n—Z)‘ + en_3)r + ... + e1): + so (3.4.3)

The recurrence formule cen be es follows:

Let 2§°’- an,1 (i=0 to n-l) (3.4.4)

52

git). Y9-1)“ + .n_1_k (x=1 to ne1) (3.4.5)

’1‘ Yin'l) (3.4.6)

The besic PE is illustreted in Figure 3.4.1. A full errey consist of

n-l besic PB. Figure 3.4 .2 illustretes the structure of the lineer

systolic errey for DFT computetion.

According to Bquetion (3.4.1). the velues for 11'. for the cese

of ad ere es follow:

3

You-2 .j'Oj. no + el + e2 + e3 (3.4.7)
1'0
3

11.2 gjmj' no + '1' + gzwz + m3w3 (3.4.8)
.i'0
3

12.2 gjwzj- mo + glqz + ‘2'4 + g3'6 (3.4.9)
1'0

3
Y3!- 21-31.31. so + elw3 + £2w6 + e3} (3.4.10)

in

in

Figure 3.4.l.

53

 

out

 

 

 

out' in

I
..<

Yout:- inxin + a

Basic PE for linear systolic array used in
DFT computation

54

 

 

 

 

 

 

 

 

 

 

 

 

 

I -——- ——H 1——o- .. —. L———-.- 0
1 l
an-2 a"-3 a0
12 . 1——.° -—- —’ O
Wn-l s we 1 '—'. I]
an-l’ ’ an-l’ a'n-1 “’12

Figure 3.4.2. The structure of the linear systolic array for

DFT computation

55

3.4.1 Linesr Feult-Tblerent Systolic Arrey for DFT

In this section the error detection end correction elgorithm
described eerlier will be epplied to the lineer systolic errey for DFT
computetion. Thble 3.4.1 illustretes the spsce-time teble of the
normel errey. Figure 3.4.3 illustretes the structure of the
feult-tolerent errey for n94 with sdded comperetor elements. This
errey consists of n-l-3 besic PE's. Bech PE conteins en input end en
output register for temporery dete storege during eech clock period.
Esch PE is losded with two coefficients. The coefficient s: end s1.
e1 end so. end so ere stored in PE.1 through PE.3 respectively. In
other words. PE.1 conteins the coefficients ‘n-2 end sn.3 while PE.2
conteins 'n-3 end sn_z end so on. As illustreted in teble 3.4.2. in
the error-free cese. during the first clock.cycle PE.1 multiplies its

0 end edds the result to ‘n-Z end stores the result

inputs ‘n-l end w
in its output register. During the second clock cycle this result is
fed beck into PE.1 through the Yin input (refer to Fig. 3.4.1). It
18 multiplied it by xin input end the result is sdded to sn,3
coefficient. During this clock cycle PE.2 multiplies the output of
PE.1 by its xin input ('0.1) end edds it to its 'nr3 coefficient.
Since the sums operetion hes been performed by both PE.1 end PE.2
during the second clock. cycle on the seme input operends. PE.1 end
PE.2 will result in the seme output which will be competed by the
comperetor CI for mismetch. During the third clock cycle s new set of

operends ere losded into PE.1; PE.2 operetes on its input operends

using ‘n—4 ceifficient end PE.3 performs the seme operetion using its

56

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

( #0612?
I ‘PE.l PE.2 PE.3 RPE vahd
__. —«- 0
l l
a a a a
2 l 0 0
I2 1"" a] a0 0 o F" 02
I
l
g l
| ..... ---- -'
L--]
R 4M u x j -—J—7 F—j’
7f ‘3 ------ —-~ c c
Input-ﬁberands f] f2
1:--Two--Phase __3:
CLOCK
R: Reload Input Operands
Basic structure for the ith PE:
Input Parity __d_ Output Parity
PI i PO
Xin ”‘T' * T'"’ Xi
PEi out
i . 1
Yin —- (1%],4) —-- Yout
Error Flag

EF

*: The control algorithm is given in Section 3.2.

Figure 3.4.3.

DFT computation,n=4.

The structure of the fault-tolerant systolic array for

57

tn_4 ceifficient end the computed results of PE.2 sud PE.3 ere

compsred for mismetch.

3.4.2 Error Recovery

In cese en error is detected in either P81 or PEi+l during clock
01010 t.. the input operends prior to the erroneous computetion sre
bypeseed to P8i+l end PB1+2 end recomputed during the clock cycle

t Thbles 3.4.3 through 3.4.6 illustrste different chses of

e+l°
feult-occurrence in the errey. The output velidity is checked by the
edditionel perity bit sssocisted with eech PB end is described in

Section 3.1.

58

Table 3.4.]. The space-time table of the normal systolic array
for DFT computation

Input/Time
t] Xin:
Y. :
in
t
2 Xin
Yin:
t
3 Xin
Yin:
t
4 Xin
Yin:
t .
5 Xin'
Yin:
t6 Xin:
Yin:
Output/Time
t4 Yout:
t5 Yout:
t5 Yout:

t7 Yout:

PE.2

PE.3

a3+a2*a1+ao

_ 3 2
Yl- a3w +a2w +a1w+a0

6 4 2
a3w +a2w +a1w +a0

9 6 3
a3w +a2w +a1w +a0

PE.3

2
a3w +a2w +a]

w3

6 3
a3w +a2w +a1

59

Table 3.4.2. The space-time table of the fault-tolerant systolic
array of Figure 3.4.3 for DFT computation.
Error-free case

Input/Time PE.l PE.2 PE.3 PE.4
t1 Xin: l 0 0 O
in: a3 0 0 0
t2 l l 0 0
a3+a2 a3+a2+a1 0 0
t3 w l l 0
a3 a3w+a2 a3+a2+a1 0
t4 w w l l
a3w+a2 a3w+a2 a3+az+31+a0 a3+a2+31+a0
t w2 w w l
5 a a w2+a w+a a w2+a w+a X
3 3 2 l 3 2 1
t6 w2 w2 w w
2 2 3 2 3 2
a3w +a2 a3w +a2 a3w++a2w a3w +a2w +a1w+ao
aIWTao
t W3 w2 w2 w
7 a a w4+a w2+a a w4+a w2+a X
3 3 2 l 3 2 l
t8 W3 W3 w2 w2
a3w3+a2 a3w3+a2 a3w6+a2w4+a1w2 a3w6+a2w4+a1w2
+ +
a0 30
t9 X w3 W3 w2
6 3 6 3

a3w +a2w +a1 a3w +a2w +a1 X

60

Table 3.4.3. The space-time table of the fault-tolerant systolic
array of Figure 3.4.3 for the DFT computation.
Error-free case

Output/Time PE.1 PE.2 PE.3 PE.4
t1 a3+a2 0 0 0
r- """ “l .......
t2 ifafazfit1-133631: 0 0
t3 a3w+a2 .agrazfa1+aq:a§fa2+a1+a0, 0
r--2-----r-- Pcul-o'I ...... _-
t4 Laid :aZrI-+a1-._a3-vlja2w+a]; X .21 a3+a2+a1+a0-Y0
2 .‘"3'"2"'""3"" . .
t5 a3w +a2 .a3w +a2w E a3w +a2w . x
: + . + :
Lf‘lTQ- -:- ftw’fa- ..1
"‘E"‘2‘“:”‘4‘"2"l 3 2 =
t6 Ea3w +a2w +a 3 a3w +a2w +a15 X a3w +a2w +a1w+a0 Y]
----3-.. 1"bm'"1'7"-'3'-‘$
t7 a3w +a2 :a3w +a2w : a3w +a2w'| X
' + l + 3
: .aJ'izf ‘10- ...E...“‘. 333*qu
.—--------1-------—1 6 4 2
. + + a =Y
ts lag“: fa".f31-:-a.3~3‘3+32.w.3-a.u X .3. .2. a‘" * ° 2
P“- " ""‘I""' "‘1
t9 X Ea3w9+a3w6 : a3w9+a2w6: X
. .. : + :
i 3 ' 3 :
:_"i‘1'.’ 1‘10. .. E 311330";
9 6 3 =
t10 X X X a3w +a2w +a2w +a0 Y3

X : Don't care

; : g : compare output results
L----J----J

61

Table 3.4.4. The space-time table of the fault-tolerant systolic

array of Figure 3.4.3 for DFT computation with a
transient fault in either PE.l or PE.2 during the
second clock cycle.

Output/Time PE.l
t1 L
t2
t3 M
t4 Ml
t5 N
t N1
t7 0
t8 01
t9 X
t10 X

X = Don't care

L = a3+a2

L1= a3+a2+a1
L2= a3+a2+a1+a0
M = a3w+a2

M1= a3w2+a2w+a1
M2 a3w3+a2w2+a1w+a0

_ 2
4 2
1 a3w +a2w +3]
6 4 2
63W +62%! +a1w +a0

I
El-qp¥-_-

PE.2 PE.3 PE.4
¢ ¢ ¢
E15 9’ 9’

{ii ......... fﬁ ¢

L - - _ resgmule- _ J
M1 L2 L1
M2 M2 L2
N1 M3 M2
N2 N2 x
01 N3 N2
02 02 x
X 03 O2

0 = a3w3+a2

- 6 3
0]— a3w +a2w +a1

_ 9 6 3
02- a3w +a2w +a1w +a0

M3, N3, 03 NOT VALID

62

Table 3.4.5. The space-time table of the fault—tolerant systolic
array of Figure 3.4.3 for DFT computation with a
transient fault in either PE.2 or PE.3 during the

7th clock cycle.

Output/Time PE.l PE.2 PE.3 PE.4
t] L ¢ ¢ (6
t2 L1 L1 p 95
t3 M L2 L2 6
t4 M1 M1 L3 L2
t5 N M2 M2 X
t6 N1 N1 M3 M2
‘7 ° [’12. - 3'19". - .."3...‘ x
is 01 01 [4.232236% N2
t9 X 02 02 X
t10 X X 03 02

X : Don't care

L, L1, L2

M, M1, M2

N, N1, N2 refer to Table 3.4.4.
0, 01, 02

L3, M3, 03 NOT VALID

63

Table 3.4.6. The space-time table of the fault-tolerant systolic
array of Figure 3.4.3 for DFT computation with a
permanent fault in PE.2

Output/Time PE.l PE.2 PE.3 PE.4
t] L (6 ¢ ¢
t2 1E1: Ii‘ii‘iilﬁ 4 9’
1:3 M {Ll féii‘ﬁﬂte L15 ‘3

.......... ,dr--------u.
t4 vM] -error Ml; L1 recompute L1
ts N FM} 1923335“ 7h“; L2
t6 :rlJ:::.er.ro.r.-t:l§f :: ...... M .1 recompute M.l
t7 0 {"1 252%? "11' M2.
t8 :0; ';‘e?-F6r‘:"oj --------- N .1 recompute N1
.9 pissing? 2;: ~2
t10 X X ........ dquecompute 01
t1] X X X 02

X : Don't care

L, L1, L2

”1’”2
N N refer to Table 3.4.4.

1’N2

. E : compare output results

~------—

mum”

A Concurrent Error Detection and Correction Algorith-
For Paul t-‘l‘olerant Two-Dinensional

Array Processors

Figure 4.1 illustrates an array processor with N-k’ identical
unifunctional E's interconnected by a mesh network without boundary
wraparound connections. PE's of each row are connected in a pipeline
fashion. Data are loaded serially into the left most P8 of each row
and fro- the top' of the structure into each PE. Both the horizontal
and vertical data flow through each PE will result in a new computed
output at each PE. As long as the vertical data flow pattern is not
altered or delayed each row can be treated separately because of the

parallelisn of the design.

64

 

 

VERTICAL

 

65

INPUT OPERANDS

i i

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PE .’ PE $7 PE 9 PE s
1.1 . l;2__ 1.3 1.4
y A; it A}
3 PE PE 715 FPJ'E—T
u: 2.1 2.2 2 3 2.4
3
r Ay Jr W
'—
é; __JL_1 ...JL_1.
—~ , PE 5' PE 1; PE PE
.4 3.1 3.2 3 3 3.4
<
p.-
E J; I, ‘V AV
ﬁg ' __J ___lL_.1
:: . PE ‘r PE 1' PE % PE { i
4.1 4.2 4 3 4.4
Figure 4.1. The structure of an array processor connected

in mesh without the boundary wraparound connections

66

Therefore the error detection and correction algorithm can be applied

simultaneously to each row of the array processor.

4.1 The lodificstion of The Error Detection Algorithm

Since new sets of data are loaded into each PE both horizontally
and vertically. the error detection scheme of Chapter 3 can not be
applied directly to the two-dimensional structure. Iodifications must
be introduced to take into account the variation of the computed
output of each PE for the same clock cycle. In other words. the
output 0‘ P31 and PBi+1. operating on the same horizontal operands.

are different because their vertical input operands are different.

Consider the first row of the two-dimensional structure with 4
PE's. The cese kr4 is considered for presentation simplicity. Let
the row input (horizontal) data set be A1,,,,,A-, Sinilggly
31.....BI. C1.....C.. 01.....0. and 81.....8. are the vertical
operands loaded into PE1.l-P81.4. respectively. The space-time table

for k-4 is illustrated in Table 4.1.1.

Consider the multiplication operation for presentation
simplicity. During the first clock cycle A1 and 31 gr. logded into
PB1.1 and result in A131 and B1. 81 is transferred down to PEZ.1 and
A131 is loaded into the horizontal input of PEl.2. During the second
clock cycle A131 is reloaded into the horizontal input of PEl.l. The

inputs A2 and 82 are therefore delayed for one clock cycle. Instead,

 

 

67

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SWITCH CELL
v E R T I C A L I N P u T o P E R A N o s .
1 ' '
: I l I. «’0’:
e iv 7 , '7
l I 2
#1_ T1" J SWITCH
s s s s
1 2 4
+71. ..... “17151
- ..., 2 ‘* 0 Two-PHASE CLOCK
i
_ __. __gr"-""E.
1 P E ‘$* P E ** P E P E **‘* R P E
I
,. 1.1 1.2 A, 1.3 .. 1.4 1.5
r ‘ ‘T“ l T.
R , . (15c) T'
q ' f ‘ r
L-t----1 . r .
’I’ I 4 1 ‘F i a I
l I ' l ' ' e
MUX : . * .
HORIZONTAL 1 , . ' I 1
INPUT L.....- C1 2 .2 ' 3 1 .
OPERANDS v . ' z .
l
Two-PHASE -..:- - ..- -'- — -- 1‘-,- - - -- . :
CLOCK 1 1 ' 2 1
v i i i v
1 o N E x 1 R o w

R: Reload Input Operands

Vertical Input Operands

Input Parity Bit

From Vertical Input
Register of PEl.(i-l)

Horizontal Input+.

Operands

 

4

+-

 

P E
1.1
(i#1,5)

 

 

Error Flag J .5

EF

Figure 4.1.1.

0
Vertical
to next

Output Parity Bit

To Vertical Input Register
of PEl.(i+l)
Horizontal Output
Operands

Output
row

The structure of the first row of the array of Figure

4.1 with added switching and comparator elements.

68

Vertical Input

 

 

 

 

 

Operands
Input Parity Bit ...”. P E l . i ....Output Parity Bit
(P1) (P0)
( i # l . 5 )
Vertical Transfer Bit. Vertical Transfer Bit
VTB ‘ , ""' VTB
and REGISTERS. and
Vertical Output Transfer HIR Vertical Output Transfer
(V07) . VIRl (VOT) .
from PEl.(i-l) VIR2 to PEl.(1+l)
HOR
Horizontal Input ——.» VOR .4..Horizontal Output
Operands Operands‘
Arithmetic Unit:
AU
.'
Error Flag :
(EF) o

Vertical Output
Operands

HIR: Horizontal Input Register
VIRl: Vertical Input Register 1
VIRZ: Vertical Input Register 2
HOR: Horizontal Output Register
VOR: Vertical Output Register

Basic PE structure for the fault-tolerant array of

Figure 4.1.2.
Figure 4.l.l.

69

Table 4.l.l. The space-time table of the fault-tolerant array
of Figure 4.l.l
Input/Time PE l.l PE l.2 PE l.3 PE l.4
t1 Al’Bl O O 0
t2 A1314:1 A181,C1 O 0
t3 A2,B2 A1B]C],D1 A181C],D1 0
t4 A282,C2 A282,C2 AlBlClDl’El A]B]C]D],E1
t5 A383 AZBZCZ’DZ A282C2,D2 AlBlClDlEl
t6 A383,C3 A383,C3 A282C202,E2 AZBZCZDZ’EZ
t7 A484 A3B3C3,D3 A3B3C3,D3 AZBZCZDZEZ
t8 A484,C4 A4B&,C4 A3B3C3D3,E3 A3B3C303,E3
t9 X A4B4C4,D4 A4B4C4,D4 A383C3D3E3
t10 X X A4B4C4D4,E4 A4B4C4D4,E4
tH X X x A4B4C4D4E4
Output/Time PE l.l PE l.2 PE l.3 PE l.4
t2 AlBl O O O
t {EEC-""1156? o o
3 ._ 1-1.1 ...... 1. 3-1..
t4 A2'32 153391.51: - :Aj I$16151] 0
ts 17152152” f i 7.423;?" 3 [6339203333 E1303
.6 x
27 1.22.23.1111139'233 11.21.292.22: 1122222225
t8 A434 5523393.“; - - - 53333303.: X
1:11 x x x- A a C o E
4 4 4 4 4

70

ﬂ" vertical input Cl is loaded into the vertical input of $1.1.
This requires switching elements in order to load either B's or C's
into $1.1. (It is shown in Table 4.1.1 that these switching elements
can be controlled by a two-phase clock). Since A131 “4 c1 are 10.4.4
into $1.1. during the second clock cycle the output of $1.1 will be
qnlcl. The output of $1.2 will also be MBicl- These two outputs
can be compared at the end of the second clock period and mismatch
errors can be detected. Note that regardless of the arithmetic
operation performed by each $ the error detection will still be
valid. since any operation by adjacent unifunctional $'s on the same

input operands must result in the same outputs.

Darius the third clock cycle. A; and a; are loaded into 1131.1.
‘131‘31 is loaded into $1.2 and $1.3. 01 is loaded. via the switch
element. into $1.2. Both $1.2 and $1.3 will result in outputs
labeled as 5310101. These two outputs are compared for a possible
mismatch. Figure 4.1.1 illustrates the structure of the first row of
the array processor of Figure 4.1 with additional switching and
comparator elements. Table 4.1.1 shows the space-time table of the

structure of Figure 4.1.1.

In general the algorithm uses the inherent redundancy of the
pipeline structure by reloading the first horizontal operands back
into the leftmost $ and. during each clock cycle. comparing the
computed output of adjacent $‘s along the pipeline operating on the

same input Operands. From Table 4.1.1 it is noted that the ith

71

vertical input operands are switched every other clock cycle into
their adjacent (i-l)th PE. It is also noticed that because of the
vertical flow of the operands only one row of switching elements at
the top of the structure is needed. Similarly the same steps may be

applied to the other rows in the array.

4 .2 Error Recovery

In the error-free array operation. the horizontal operands are
loaded into the horizontal input register of each PE. Similarly the
vertical operands are loaded into the vertical input registers and
depending on the operation (i.e. processing or bypassing vertical
operands) the processed results are stored in the horizontal and
vertical output registers. Every PE contains two registers labeled as
V131 and V132 for the vertical input data. The vertical operands are
first loaded into V132 and then moved to V131. The arithmetic unit
uses the contents of V131 for arithmetic operations. If an error is
detected in P31 during the clock cycle t.. the contents of V131 of P31
is saved. In other words. the contents of V132 will not be moved into
V131. At the end 0f t‘. the contents of V132 is transferred to the
next row in the array. This guarantees the consistency of the
vertical data flow in the array structure. At the same time. the
contents 0f V131 0f F81 is transferred to V131 of FEi+1. The error is
recovered by 951 and PEi+1 during the clock cycle t‘+1. Because of
the ocurrence of an error in P31 during t‘. for horizontml data

consistency. PBi+1 transfers the contents of its V132 to V131 of

Table 4.2.1.

Output/Time

72

The space-time table of the fault-tolerant array
of Figure 4.l.l with a transient fault in PE 1.2
during the second clock cycle.

PE l.l PE l.2 PE l.3 PE l.4 PE 1.5

A13] compare ...error
1' ...... T T - Tl
L§I§JSJ_ __61Elfli recompute and compare
[- ..... - - - - 1
A333 AZBZCZDZ AZBZCZDZ ATBTCTDTET X
A33353 A333C3 A232320252 A232C20252 A13190151
X X X X A B C D E

4 4 4 4 4

Table 4.2.2.

Output/Time

73

The space-time table of the fault-tolerant array
of Figure 4.l.l with a transient fault in PE l.4
during the 5th clock cycle.
PE l.l PE l.2 PE l.3 PE l.4 PE l.5
O
O
O
. . O
: """ eFFGr """ j
A B C A B C :A B C D E A B C D E1 O
2 2 2 2 2 2 “LJ-lj-l1:121:14:l’réEo'm'put'e-"I
A383 AZBZCZDZ AZBZCZDZ LAIBJOJOJE1--AJBJOIO1E11
X
AZBZCZDZEZ
X
A434910454

Table 4.2.3.

Output/Time

74

The space-time table of the fault-tolerant array
of Figure 4.l.l with a multiple transient fault
in PE 1.1, PE 1.2, PE 1.3, PE 1.4 during the 7th
clock cycle.
PEl.l PEl.2 PEl.3 PEl.4 PEl.5
error error
A333 AZBZCZDZ AZBZCZDZ recompute
r ' "' "' ' ‘ ‘ "" ‘1 r """"""""
.L 11.313353“ 13.31%; LA g13.2%sz2 A 282C} Z/Pii
.......... 2 F- ---- -----
A434 LA3B§C§ - :A§B3C3; LA§§?€2?ZF§_.AEP?€ZRZEZ:
A4B4C4 A484C4 A3B3C303 A3B3C3D3
X A4B4C4D4 A4B4C4D4 A333C3D3E3 A3B3C3D3E3
X X A4B4C4D4E4 A4B4C4O4E4 X
X X X A484C4D4E4 A4B4C4D4E4
recompute
and

compare

Table 4.2.4.

Output/Time

75

The space-time table of the fault-tolerant array

of Figure 4.l.l with a permanent fault in PE l.2.

PE1.1 PEl.2
ff} ........ ,
L“. 5131. - E13151).
A232 §A1B1C1
923252.... : :2}. 2‘...
A333 IAszcz
1A.; 8:32; :13 525:
A484 EA3B3C3
933451 ifs-571.62 i:
x :A4B4C4
x x
x x
x x

X : Don't care

PEl.3

“13151:
..--J

I
Aszczt
..... J

PEl.4

A131 1

A131C101

C

A232C2
A232C202

A333%
333°3

A D

3
A43491

A434C404

. : All comparaisons indicate error

PEl.5

A B C 01

1 1 1
A131C1D1E1

A232C202

AZBZCZDZE
A

333C303
A333C3D3E3

A4B4C4D4

2

A434C4°4E4

76

ll... Illlliﬁullllliilll.llllilliill-II-

mm.. I . ems .N . aha .u . em; .2 um .

. .
. .
"1H” <Hmi <meuxnx <Hmd <meuzax <Hx_ <Hmwuznx <Hmd <Hm~ H ocaecﬁx "
Fl ' .l | .l l l l I I ' r I l: l | I l l l |.' C l I. I.’ ' l- - 'J ‘U I. '2 I l- - ' ‘Itll I -l 'l—l-‘ "ﬂldaml*
_ n . _ . _
-I 1.pITni..-111-.-u-1F 1111111111 11-.Ju--1 111.. 1

has . _ _ --J--. ...... _

_w---.--1-4%m1:1-w-1-mii-1..-!- . _ ¢
n has or mm H d n I . u - 1---- ...-U u .

1 _
_ d...dn. _ >dw.n. _

_
. . . N .
---- ---!i--n-------. ....... - ....... ..----- -----11---_
_
.DN mm ow “>.mnn. or am “>.m.n. on ms ” n "
1191111111-1-1- -1-Ito...ll-111-141l1lliillloic 11 111-lilo 11.1 1-1 11.. 4.l.l-11-1..-.
. >Nmm “ >dwnnd i h >.msndos " u a u
.- iiiiiiiiiiii l. lllllll ollll Iiiullillilulllllllllllineal-illilluil.
“pmwm am an h>~wm am on ">dmnndc. m. ow ">1 I has n . m _
Ilillolillill llllll 111ll|1l|l-i liillllllli Ililllillullllil-..
n ......... --_--- ------ .-ﬂ--..1--1 1- -..--..---- --1--..J--..--
.>w we no .>~m~nm am an .>~ Nam ON mN .>im.nd and d mw _ ”
r-w4%11--1.ffl.wym a -.f-nu-m-m.i- u-ur.IZ.W1m 11--II.-1--w-i.-i"
.
rim-Au- --..I”---.~.~--w----r----N--m----M--..-i-n-_-w-------{-.--MCI--
. u . >~ N New mm on ">wanwow- mm 1 u

(A)
--4--'--‘---

qmcdm n.~.m. qzm mcmnm-anam "maﬁa om Ham wreak ow «macaw a.d.d. madcmaxmanzc arm smodmﬁmx
nxmzmmmx oumwmnso: om nonunincmaa pm.m 3: «ya mxxox-«xmm nmmm.

77

''''''''''''''' III||IIIIIIIIIIIIIIIIIIIII'I’IIIIIIIIqIIIIIII—

_ a m n . i . a m . . N . e m d . a . e m a . s . m e m i . an Ass. _
.12» <Hxn <Hz~ .d :2» <2». <meu In» <2». <Hmmu In» <Hms <Hx~ q. ocqecq . _
lllllllllllllllll IIIIIII.IIIII|I..II..1IIIIIIIIIIIIIIIIII-III..III-I-
. . . . . .

_>d as or n n . .

. u . - .

- >.—w._ . . . _ .—

amu.m n.~.m.

 

'I-‘I"-'-"J .

 

 

 

 

 

 

 

--—’q_--_-_----—_-——

 

 

 

. N
.
.

. A

an . .
. .
. .

..-.-..wwammw . >.ain a. . >.w.n.cd . . s u
. . VIII. . .

. >~m~n~ cm on . >~m~n~ on mm H >dwindod md m» . _

. . . . .

. pmmNnNON ” >~w~n~c~ ” >dmdoiodmi ” . m _

..I llllll -II'I. .................... . IIIIIIII IIIIII . .

. . _ .

H . >Nm~n~o~ "N cw . >~mmnmo~ mm d H psa.nio.md . .

. . . .

H . >Nm~nmcmmm . >~m~nmc~m~ u n m a

. . r --..u-.. 1 -..-.- --..- --.-I ..... . 111111111 ..-- -1. _

_ u " >~w~nmo~m~ a mu " >Nmmn~cmmm _ n

u . n . " >~m~n~c~mN m . V "

alllll'l-‘llll|llsllt'-al-l' -

I
I
I
I
I
1
l
l
I
1
I
I
I
l
I
I
I
I
I
I
b
1
I
I
I
I
l
1

43m mumnm-amam nmcdm oﬁ azm mxsmk ow «Accra a.d.d. Addcmnxmadzn ﬂaw wmcnmamx
«emammmx ocmxmunozm om nozmﬁwncmzﬂ pm.m riﬂ: m nemamhmzn «main is vmd.~.

78

93112. This horizontal data transfer nong adjacent $'s propagates
through each row of $'s in the array. and guarantees the correct
execution of the error detection and correction algorithm. New
vertical operands are processed or bypassed to the next row of $'s in
the array. This guarantees the correct transfer of vertical operands
to all k-i rows of the array if an error is recovered by the i“ row.
The register transfer level control algorithm to: the 1“ $ (um)

can be written as follow:

Begin
if EF-O then

if VTB-0 then
313 <-- BI
V132 <—- VI
V131 {—- V132
All <--- 313.an
P0 <—- iﬁ
303 <-- AD
V03 <-- V131

el seif VTB-1 then
313 <-- 31
V132 (-- VI
V131 <-- VOT
All <-- 313.le
P0 <-- F1

303 (“All

79

V03.VOT <--- V131
end if
elseif EF=l then
313 (-- 313
V132 <-- VI
V131 (- V131
VTB <-- 1
All <--- 313.V131
PO <-- Pl
303 <-- AD
V03 (-- V131
end if

end;

The space-time table of the first row of the array of Figure 4.1.1 for
this error detection technique is illustrated in Table 4.1.1. Tables
4.2.2 through 4.2.4 illustrate the space-time table of the

fault-tolerant design recovering from different cases of faults.

For the twordimensional array. each row can be treated as a
separate pipeline as long as the vertical data consistency is
guaranteed by the error detection algorithm. Since the algorithm is
based on comparing sets of input and output operands. it can be
extended to all structural configurations including diagonal data flow
in a two-dimensional array. In that case. PE's along the diagonal

connections can be treated as pipelines. The diagonal output results

80

of adjacent PE's can be compared for mismatch. Error recovery can be
achieved by adding redundancy along the diagonal PE's and can be

executed in a similar manner as before.

4.3 Overhead hardware and timing of the two-dimensional design

The two-dimensional fault-tolerant array processor uses (h+l)’
33's to recover from faults in 3’ stages. Also (k-l)’ comparators and
(3) switch elements are required. For the array of Figure 4.1.1 if
the area required by each comparator. switching element and their
corresponding interconnection lines is (llm)th the area of a single PE

(APE) the overhead hardware requirement as a function of k and m can

be formulated as:

033(k.m)' IIh+1)’.APE+(((k-l)‘.ApE)/m)+((k.APB)/m)l/k’.A?E

= Iml)‘ + ((k-i)‘+k)/n)]/k’. (4.3.1)

A plot of 033(k.m) is given in Figure 4.3.1. Due to the parallelism
of the design. the overhead timing is determined by the overhead
timing of a single row of the array which is the same as the

one-dimensional pipeline structure described in Chapter 3.0.

As an application example. in the next section. the
two-dimensional error detection and correction algorithm is applied to

a VLSI structure used for the multiplication of two dense matrices.

8l

commmooua mango acmnmpou-upamm

_mcowmcmswu-ozu esp no» Ae.xvm:o acmemnwacmn neurone; ummgcm>o on» .P.m.¢ mczm_m

mﬂp ed, mop _ pd—

 

_
_ _ m . _ _

 

 

ooopux

873. .............

 

82

4.4 A Fault-Tblerant VLSI Pipelined Array for Matrix lultiplication

In this section the two-dimensional error detection and
correction algorithm described in the previous sections is applied to
a VLSI pipelined array used for the multiplication of two dense

matrices [48].

Figure 4.4.1 illustrates the structure of this array for the
multiplication of two 2x2 dense matrices. The elements of matrices A
and 3 are fed from the lower and upper input lines in a pipelined
fashion. one skewed row or one skewed column at a time. Some dummy

zero inputs are interspaced with the matrix elements.

"’ " T '7
‘11 ‘12 F511 I>12 K=11 °12
A x B I X =
'21 ‘22 1,21 b22 °21 °22
— a - I- "I (4.4 .1)

 

 

 

 

 

 

where the product coefficients :

2
cij' 2g:ik.bkj for all i and j (4.4.2)

In general. to multiply two nxn matrices requires n(2n-1) lultiply and
Add Cells (IAC). Figure 4.4.2 illustrates the structure of the VLSI
array used for the multiplication of two 3x3 dense matrices. Thbles
4.4.1 and 4.4.2 show the space-time computation steps required for the

arrays of Figure 4.4.1 and 4.4.2 respectively.

83

 

U930.
II II II
0'00

Figure 4.4.l. A VLSI array for pipelined multiplication
of two 2 x 2 dense matrices.

84

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.4.l. The space-time table for matrix multiplication using
the array of Figure 4.4.1.

Input/Time I1 12 I; 1;
t1 an 0 b1] 0
t2 0 a12 0 b2]
t3 a2] 0 b2} 0
t4 0 622 0 b22
t5 0 O O O

Output/Time O1 O2 O3 O4 O5 06
t1 0 O O O O 0
t2 0 allbll O O O O

a 6
t3 anb12 O a2lbll 0 ll ll 0
a12b21
a b a b
t4 0 amb12 O ll+l2 O 21 ll
a b
alzb22 22 21
t5 0 o o o a21b12 0
+
a22b22
t6 0 O O O O O

 

 

 

 

 

85

o
o g 833
bl3 23
o b
o b 032
b 22
12 o b
o 31
b b21 0
I 1] I 0 g 0
I1 I2 I3
01
o
o o c13 o o
02
O O c23 0 C12 0

C33 0 C22 0 C11

9-
9.?

 

I'lill'l‘%‘ o c o c o
32 21
I I 05
O O 0 c3] 0 0
I1 I2 I3
an 2 O
o 12 0
a21 0 a13 C
33- 322 3
23
0 a32 O
O 0 a33

Figure 4.4.2. A VLSI array for pipelined multiplication
of two 3 x 3 dense matrices.

86

Table 4.4.2. The space-time table for matrix multiplication using
the array of Figure 4.4.2.

 

 

 

 

 

 

 

 

 

Input/Time I1 12 13 I; I; 1;
t1 an o 0 b1] 0 0
t2 0 a12 o o b21 0
t3 a21 0 a13 ”12 0 ”31
t4 0 a22 o o b22 0
t5 a31 0 a23 ”13 0 ”32
t6 0 a32 O 0 D23 0

Output/time O1 O2 O3 O4 05
t6 0 0 cn o 0
t7 0 C12 0 c2] 0
t8 c13 0 c22 0 C31
t9 0 C23 0 c32 O
tlo o 0 C33 0 o
tn 0 o o o o

 

 

 

87

4.4.1. The Fault-Tblerant VLSI Array for latrix Multiplication

Figure 4.4.3 illustrates the structure of a fault-tolerant VLSI
array for the multiplication of two 2x2 dense matrices. Thble 4.4.3
illustrates the space-time input and output pattern for the structure
of Figure 4.4.3. According to the error detection and correction
algorithm described in the previous sections. the elements ‘12- .22,
b21. 322 of the matrices A and 3 are loaded via switch elements 8 and
S' placed at the bottom and the top of the structure into the inputs
11 and 1'1. The computed output of the PE's (in this case lAC's) of
the first column are reloaded into these PE's and the subsequent
results of adjacent PE's in the same row are compared for mismatch.
In case of error the error recovery is achieved according to the
algorithm by bypassing the correct operands prior to the erroneous
computation to the subsequent PE's along each row and results are
recomputed during the following clock.cycles. The data consistency.
the correct timing and the final output validity are guaranteed by the
algorithm. Tables 4.4.4 through 4.4.7 illustrate different cases of
fault occurrence and error recovery for the structure of Figure 4.4.3.
An architecture for a fault-tolerant VLSI array used for the
multiplication of two 3x3 dense matrices is shown in Figure 4.4.4.
The error detection and recovery is similar to the array of Figure

4.4.3.

88

 

 

 

   

I
Q I . ;\ ’
L if- -3:!--.. -‘ Redundant
I O I ' PE's
1 H.-2--.---I
bottom "II, {‘ ' j b a
switch: ix: : : S :
cell L_I_‘i'_: ‘f-f"
al11 0 -c d '
parity parity
0 312 in OUt
recom ute d=c+a.b
32] 0 p a=a
0 322 b a b=b
{OI-It
11-9 “@09- I2
Il=12-out=0

111‘12-out=l

Figure 4.4.3. The fault-tolerant VLSI array for pipelined
multiplication of two 2x2 dense matrices.

89

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.4.3. The space-time table of the fault-tolerant VLSI array
of Figure 4.4.3.
Input/Time I1 12 I1 12
t1 aH O bll 0
t2 a12 ”12 ”21 ”21
t3 ”21 ° ”12 0
t4 ”22 ”22 ”22 ”22
t5 0 o o o
Output/Time O1 O2 O3 O4 O5 06
t] o o o o o 0
t2 0 anbn o o o o
r"-"1 r" -'-1
t3 ”11”12 I”11”11| ”21”11 0 :”11”11I 0
I +-, 1 : +b :
6.12.2.1.-- - - - -- - - - - "-112.21;
---. . , ..... , ,
t4 u”11”12: ”21”12 ”21511 ;”11”12l X ”21511
' +b I +b . +b : +b
!”12 22: ”22 21 L” 2 22- ”22 21
--'!_": ____________ '_T"" ‘
---” I'_-iil
t5 0 1”21”122 0 X |”22”22' X
: + I E + I
L”22.”2.2_:--------------1_”21.”12:
t6 0 o o o x o

 

 

X= Don‘t care

 

 

90

Table 4.4.4. The space-time table of the fault-tolerant VLSI array

of Figure 4.4.3 with a transient fault in MAC.5 during
the third clock cycle.

I
I
I
l
I
I
I
I
I
J

. I"

Output/TTme OI O2 O3 O4 O5 O6 : O7 08 09:
t1 0 o o o o o : o o o 1
I I
I I
t2 0 R1 0 o o o . o o o .
r--1 r-Tﬂ ' 1
t R, I‘ 1 R 0 1C ' 0 ' 0 0 0 1
3 ‘ till- -3- max-“€13 : r__1 -
t C 9 R C C . . C . R C R '
4 1.. 4 21 12 ,_ 1_1,___2_1_r€9C Gamay; 3 1
t5 0 ”22 0 X ”22 X : ”12 ”11 ”21:
I
t6 0 o o o x o 1 x C22 x ,
I
t7 0 o o o o o 1 o x o -
I. _ - _ __-_ ___l

OUTPUT

STAGE

X : Don't care

R1= ”11”11

R2‘ ”11”12
R3‘ ”21”11
4

”21”12
”11‘ ”11”11+”12”21

”12‘ ”11”12+”12”22
”21 ”21”11+”22” 21

b +a b

”22‘ ”21 12 22 22

9l

Table 4.4.5. The space-time table of the fault-tolerant VLSI array
of Figure 4.4.3 with a transient fault in MAC.6 during

the 4th clock cycle.

Output/Time O1 O2 03 O4 O5 O6 : O7 O8 09 I
11 O O O O O O 1 O O O 1
| I
I
t2 0 R1 0 O O O : O O O ’
I

t C R IC-- C x :C-l 'R C R '
4 12 4 ~21..:__12.e-.9-.-;21;: 1 -- .3
t O C O x C C “,1 C x C 1
5 22 22 13-1-11- 13169011203” 31-”
t6 0 O o O x o 1 x C22 '
t7 0 O O O O O : O x O :
A ......... 4

OUTPUT

STAGE

X : Don't care

R , R , R , R
1 2 3 4 } refer to Table 4.4.4.

”11’ ”12’ ”21’ ”22

92

Table 4.4.6. The space-time table of the fault-tolerant VLSI array
of Figure 4.4.3 with a permanent fault in MAC.5.

I
t1 0 o o o o o I o o o .
| I
t2 0 R1 0 o o o 1 o o o ;
--. I

t R {6"l R 0 5C . o : o o o '
3 2 3-4-‘3. - .3 -93"?! -2111] 3 .---. 1
t4 C12 R4 c21 C12 L 11; C213 Rl '6 : R3 ;
r--r Tel-7 _x. .recgmputext." C .
t5 0 [3ng ° erﬁor I__ g2; . 12 21 .
t 0 0 " 6""b' 'T “'F 0 l x ff'" X '
5 , 3.3-3. .. - 199933.99? 32-23 :
t7 0 o o o o o g o x o .
-. .......... J

OUTPUT

STAGE

X : Don't care

R13 R23 R33 R4

} refer to Table 4.4.4.
113 C123 C213 C22

C

93

Table 4.4.7. The space-time table of the fault-tolerant VLSI array
of Figure 4.4.3 with a multiple transient fault in
MAC.4, MAC.6 during the 4th clock cycle.

Output/Time 01 02 03 04 0

| I
5 I 8 9 I
| I
t1 0 o o o o o : o o o g
| I
t2 0 R1 0 o o o . o o o .
I 1
t3 R2 Cn R3 0 cn o : o o o ,
r--. r-ﬁ "“' r'-'l '
t4 :Cizg R4 {9%}: [FT : ° Lgfjju Rl C11 R3 .
-~~- ---= -=:1: ....... n u
t5 0 [C22 0/7 9-132.-.ng :9 .:-L'E X 323-1:
t6 0 1' o ’,/6 o x 70 g )5 , C22 x I
t7 0 i /’ o o o I o,” : o x o 3
T o
v3
error rec°mpute STAGE

X : Don‘t care

R],R2, R3, R

113 C123

4 refer to Table 4.4.4.
c22

C C

21’

94

 

 

 

Top
Switch Cell MATRIX B
(III? 3":11 Fir“. + ,I’Out
I 'I | ' I-
; LI: : : S] : . $2 : 11-9 ¥‘E)<-912
- -- I. - J I. - J
I1 I II -‘"""‘""g I :I *0Ut=0
I I I I->out=l
Ol 06 3 011 ’i“ 0r1 1 2
0 . ' MAC L...
' ‘ \ l6 *"
. I f I
I I: TI
0 ’U‘ o
/ \
0 MAC ‘2. MAC I432
17 I"’
v at ’
. Ii , T’f
I ‘vk
\ r3
0 MAC ‘ ,
‘8 1::-
I']’
L . I
‘ o I'vL
0 l4,’ MAC ‘. r4 I»! x
‘9 if: ‘r-
.1 , *3 3"
(”I ,t-.-.’
1.1 I I
‘ 0 V I
0 l5 MAC I ”_rS Redundant
20,5-.. PPS
a j“'
l ‘ ‘
....... -J
r.I_I_' . I b a
I ‘ I I
l :\\l I
L: If} c d
arit .
f33m P i" Y pgztty
om
- MATRIX A recompute
Switch Cell b a d=c+a.b
a=a
b=b

Figure 4.4.4. The VLSI fault-tolerant array for pipelined
multiplication of two 3x3 dense matrices.

CHAPTER 5

Functional Transitions In Adjacent PE's

In Chapters 3 and 4. a concurrent error detection and correction
algorith- for parallel-pipeline arith-etic array processors was
described. The inherent processing elenent (PB) redundancy of the
array architecture was used to devise the error detection and
correction algorithn. This ache-e is based on reloading the operand
sets into the left-oat PE of the pipeline structure. By causing a
data retardation the output results of adjacent PE's along the
pipeline Operating on the sale input operands can be conpared and the

occurrence of faults in those PE's can be detected. Error recovery is

95

96

achived by bypassing the correct operands prior to the erroneous
computation to the subsequent PE's for recomputation. A.major problem
arises when adjacent PE's perform different arithmetic operation on
the sane input operand sets. This case often used in architectures
such as systolic arrays for matrix analysis routines will disrupt such
error detection and recovery algorithms. In this chapter. the
fault-tolerant array structure is modified to overcome problems
arising from functional transitions in adjacent PE's. The error
detection and recovery is achieved without degradation in a similar
manner as described earlier. Finally. the overhead hardware and

timing requirement is determined.

5.1 Functional Transitions In VLSI Structures

lost systolic array architectures used in applications such as
matrix analysis routines use groups of unifunctional arithmetic PB's
defined as IAC (Inltiply Add Cell) and DC (Divide Cell) to achieve
fast computational capabilities. Depending on the requirements of the
implemented algorithms. IAC's and DC's are arranged in groups and
interconnected locally. The number of each type of cell per group is
determined by the number of computations required by a specific
algorithm on each input Operand. Because of the large amount of data
processed by high-speed systolic arrays and the near future VLSI
implementations of these structures it is essential to design
fault-tolerant architectures. The inherent PB redundancy of systolic

arrays can be used to devise fault-tolerant techniques based on

97

concurrent error detection and recovery algorithms. It is highly
desirable that such techniques not tremendously degrade the array
throughput in an error-free case. Furthermore if an error is detected
the structure must be able to recover using minimal overhead hardware

and timing.

A concurrent error detection and correction algorithm based on
reloading the first input operand set into the leftmost PB of the
pipeline and therefore causing data retardation in the pipeline. then
comparing the output results of adjacent PE's for mismatch errors. was
described in Chapters 3 and 4. In this scheme error recovery is
achieved by bypassing the correct input Operands prior to the
erroneous computation to the subsequent PE's in the pipelined array

for recomputation.

A major problem arises when adjacent PE's in a pipeline perform
different operation on the same input operand. A nonexisting mismatch
error is detected when the output results of adjacent PE's are
compared. This results in undesired reconfiguration of the input data
and therefore will disrupt the execution of the error detection
algorithm. Furthermore. this inconsistency remains each time the
output results of the PE's in question are compared. As described
earlier. systolic arrays use groups of functionally different PE's.
IAC's and DC's are arranged in groups to implement the computation
requirements of given algorithm. Since local communication is

essential among PE's. the output of the last MAC in the group is

98

directly connected to the input of the first DC. This results in an
inconsistency in the output result of these two PE's when Operating on

the same input Operands.

5 .2 Transitional PE' s

In order to overcome this problem which is caused by the
functional transition between two adjacent PE's. in a one-dimensional
linear pipeline structure the following architectural modification
will permit the correct execution of the error detection and
correction algorithm. The one-dimensional pipeline is considered

first for presentation simplicity.

Consider the linear one-dimensional pipeline Of Figure 5.2.1.
P83.m and PE4.d (ie. PBi.j. isnumber of processes. jsfunction
performed) are defined as transitional PE's. Multiplication and
division Operations are chosen for presentation purpose. however any
different arithmetic operations would work as well. In general any
time the data stream is subject to a functional transition by the
adjacent PE's those PE's are defined as transitional PE's. As
illustrated in Thble 5.2.1. it is clear that the error detection
algorithm fails at P83.m and PE4.d. Let PE3.m and PE4.d be
multifunctional. In other words both PE's can be selected to perform
either multiplication or division on their incoming operands. In the
error-free pipeline Operation P23.m will multiply during one clock

cycle and divides during the next clock cycle. PE4.d performs a

Table 5.2.1.

Output/Time

99

The space-time table for the unmodified array pipeline.
Note the functional transition and the output

inconsistency in PE3,m and PE4,d.

PEl,m

Al,m

A2,m
l,m
B2,m

Cl,m

C2,m

PE2,m

A2,m
3,m

2,m

3,m

2,m

PE3,m PE4,d

A3,m error
r .........

A A
:.5@---Jﬁqj
B3,m A5,d

" """"" I' error
133.3431 - - PM;

: Compare results for mismatch

100

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I
, I
I
PEl,m PE2,m PE3,m I PE4,g_ RPE5,d
3' I m
I"" (TSC) v v - F +9
I
' ‘ l ‘ .
‘ TSC —..-....J
.__, FS I '““' ““
I TEST 0 JCLK A a
R M U X I
I .
.4 a. . c c l I;
Input 1 2 : 3
Operands
l —-J 2 —J I l-J
I ' l
~-Two-Phase Clock- -- :-1
I
Plane of
Functional
Transition
Comparator Cell ‘3:to-l Mux.
Output Output
, , I
Input Input " Reload
l 2
1 I
Clock Input Input
1 2
Clock=l: 1=12-'-0utput=0 R=O : Output=I1
II“2”‘)“tp“t=1 R=l : Output=12

Figure 5.2.l. The structure of the fault-tolerant one-dimensional
array processor with one functional transition along

the pipeline.

lOI

 

Input Parity Bit P E 3 , m_ Output Parity Bit
PI ___n.. d ___. PO
Arithmetic Units AU1
AU2
Input Operands Input DatSRRegister Output Operands
I. —9-. —+—.- O.

'l 'l

 

 

Output Data Register
(ODR)

 

1 Error Flag
Function EF
Select
FS
Begin
if EF=O then elseif EF=l then
if FS=O then 01. .— IDR
IDR .__ I1. AU1 ._ IDR
AU1 ‘— IDR ODR .— AU1
ODR ~— AU1 PO «n— PI
P0 *— PI 01 +— ODR
O1. ‘— ODR end if
if FS=l then
IDRo— I]. “d;
AU2 ~— IDR
ODR +— AU2
P0 «r— PI
CI.i .— ODR

AU1 : Used for Multiplication

AU Used for Division

2 I

Figure 5.2.2. Basic structure of the transitional PE3,g_.

102

divison operation during the entire duration of the pipeline and is
used for error recovery discussed later. Since the input Operand
stream is retarded for one clock cycle by the algorithm through
reloading the first input operand into the leftmost m. from Table
5.2.2. the Operand A Operated on by P82.m during the third clock
cycle. results in the output labeled as A3,. (1... Air! : i-number
of processes on A. j-function performed on A). A3“ output of Wm
and P83.m can be compared for mismatch errors during the third clock
period. During the fourth clock cycle A3“ 1g loaded into both P33,-
and MA. However during this cycle P33.m is selected to perform a
“V1303 operation on A3“. Since RAJ will also perform a division
09.11““! on A3'. the output labeled as “Md of P83.m and mm must
be equal and can be compared for mismatch. During the fifth clock
cycle P83.m is selected to perform multiplication operation. From
Table 5.2.2 it can be noted that P83.m must alternate its function

every other clock cycle.

103

Table 5.2.2. The Space-time table for the fault-tolerant pipeline
of Figure 5.2.l. The error-free case.
Output/Time PEl,m PE2,m PE3,m. PE4,d_ PE5,d
d m
tl Al,m
t2 A2,m A2,m
t3 Bl,m A3,m A3,m
t4 B2,m B2,m A4,d A4,d
t5 Cl,m B3,m B3,m A5,d A4,d
t6 c2,m c2,m B4,d B4,d X
t7 Dl,m c3,m c3,m 35,d B4,d
t8 D2,m D2,m c4.d c4,d x
t9 x D3,m D3,m Cs,d C4,d
t10 X X D4,d D4,d X
tn X X X 05,d D4,d
t12 X X X X X

PE3,m and PE4,g_are Transitional PE's (i.e. can be selected
m
to perform either multiplication or division operations).

X: Don't care

104

5.3 Error Recovery

In case of error. the correct input operands are loaded into
subsequent PE's during the next clock cycle and recomputed. If an
error is detected in PE's which are two PE's away from the
transitional PE's the error will be recovered. However if an error is
detected one PE away from the transitional PB. the transitional PE
must be set to maintain its actual function for an additional clock
cycle so that the error can be recovered. Table 5.3.1 illustrates the
space-time diagram of the fault-tolerant pipeline of Figure 5.2.1 with
a transient error in P82.m during the second clock cycle. During the
third clock cycle A2,. is recomputed by both P82.m and P83.m and the
results are compared for mismatch. Since this error has delayed the
data computation timing by one clock.cycle. during the fourth clock
cycle PB3.m is selected to maintain its previous function
(multiplication) and for comparison consistency PE4.d is selected to
perform a multiplication operation. Both transitional PE's are reset
to resume normal Operation at the begining of the fifth clock.cycle.
Table 5.3.2 illustrates the operation steps for a transient error in
P82.m or PE3,m during the third clock cycle. Thble 5.3.3 illustrates
fault recovery for a transient fault in PE3,m or PE4.d during the
fourth clock cycle. The control algorithm for the transitional P83.m.

also reproduced in Figure 5.2.2. can be written as follows:

105

Begin
if EF=O then
if FS=O then
119 IDR
IDRO ADI
ADIO ODR
Phi P0
ODR? Oi
elseif FS=l then

119 ID!
IDI9 A02
A029 om
PI? P0
ODRO Oi

end if

elseif EF-l then
IDB-) 01
IDIé A01
A019 one
PI9 PO
0039 Oi

end if

end;

This algorithm is applied to PE4.d. by switching the functions of the

arithmetic units ADI and A02 (Anl performs division. A02 performs

106

Table 5.3.l. The Space-time table for the fault-tolerant pipeline
of Figure 5.2.l with a transient error in PE2,m during
the second clock cycle.

Output/Time PEl,m PE2,m PE3,m_ PE4,d_ PE5,d
d m

A

l l,m
t2 {Ag 31"..." K2:m.i ‘- error
I: B FA: -“- - - ‘A’ - recompute
3 l,m [_g,m- _ __22m3 reset to normal
t4 BZ,m 32,m A3,m A3,q"., -‘. operation
t5 Cl ,m B3,m 83m" 'A4,d X
tG c2,m C2,m B4,d B4,d A4,d
t7 Dl,m C3,m C3,m B5,d . B4,d
t8 D2,m D2,m c4,d c4,d X
t9 X D3,m D3,m C5,d C4,d
tTo X X D4,d D4,d X
tH x x x 05,d o4,d
t12 x . x x x x

PE3,m_and PE4,g_ are Transitional PE's (i.e. can be selected
m
to perform either multiplication or division).

I'""'"" 'ﬁ

: i : Compare results for mismatch

- -- -----d

Table 5.3.2.

Output/Time

107

The space-time table for the fault-tolerant pipeline

of Figure 5.2.l with a transient fault in either

PE2,m or PE3,m_during the third clock cycles.
d

PEl,m

l,m
2,m
l,m
2,m
l,m

2,m

cnnmw>>

l,m

DZ,m

PE2,m PE3,m_ PE4,g_ PE5,d
d m

..gim ------- .grror recompute

LAM-.. 13.31.!

starﬁsh

B3,m B3,m A4,d X

C2,m B4,d B4.d A4:d
3,m C3,m B5.d B43d
Dz,m C4,d C4,d X
D3,m D3,m c5:d C43d
X D4,d D4,d X

x X 05,d D4,d
x X X X

108

Table 5.3.3. The Space-time table for the fault-tolerant pipeline
of Figure 5.2.1 with a transient fault in either

PE3,m_or PE4,g_during the fourth clock cycle.
d m

Output/Time PEl,m PE2.m PE3,m_ PE4,g_ PE5,d

d m
‘1 Al,m
t2 A2,m A2,m
t3 Bl,m A3,m A3,m error
t4 82,m 32,m {Aili- - - 11.1.11 I recompute
3s
t6 C2,m C2,m B4,d 84,d. X
t7 Dl,m c3,m c3,m BS,d 34,d
t8 D2,m D2,m Ca.d C4,d X
t9 X D3,m D3,m C5,d C4,d
tTo X X D4,d D4,d X
tll X X X DS,d D4,d
t.2 x x x x x

X: Don't care

109

Table 5.3.4. The space-time table for the fault-tolerant pipeline
of Figure 5.2.l with a permanent fault in PE2,m.

Output/Time PEl ,m PE2,m PE3,m PE4,g RPE5,d
d m

tl flan _______

t2 LZ’E'---,A.2.’.m_L-_, _,

t3 531.1"- - “2,133-- "$2.";

‘4 £32.31- - .. 2.31.3- _ _ 2.9 “2...

t5 C1,. - _ iBZJm“ .B-Z,m_ir A3,m X
t5 $2.31“- - _ 2_,mj - - 8.2:!“ BZ,m A4,d
1‘7 .01.": - 3.3.1" :95} 83’3" X
t8 IDZJ'." - - -0? 391 ...c.2 1'“ C23“ 8439
t9 ' L92..r9_---02.»3!3.ﬁi C393" x
tIo ' 62m D2,m c4,d
tll D3,m X
tlz ' D4,d

X: Don't care

I ' : All comparisons indicate error

110

Table 5.3.5. The Space-time table of the fault-tolerant pipeline of
Figure 5.2.l with a permanent fault in PE3,m.
Output/Time PEl ,m PE2,m PE3:_I;1 PE4,d RPE5,d
m
:1 Al,m
t2 A2,m A2,m
I3 31 .m Em: I . 53E
t4 B2,m B2,m 93...; - . Tie-:1:
r "" "*" ' T ””” ‘
‘3 CI ,m :33» - -,.B3.,n:.=_ -_ 53.43.. ‘X
tG E2,m _cgznl _ _ 133,51 __ _B_3.m_3 A4,d
t7 Dl,m Ec3.m C3,m? B3,m x
t8 DZ,m -62.“; - -331"--. - 3,}; B4,.1
39 5:3a‘i:;‘3.ai_is"3‘aT x
1:10 ' ; 3,m 3,m.5 C4,d
tTI "nu-5311;. x
t12 D4,d

lll

multplication). A proposed architecture for a pipeline processor with
multiple functional transitions is illustrated in Figure 5.3.1. The
above algorithm can be applied to this structure without any

modifications.

5.4 Overhead Hardware and Timing Requirements

From Thbles 5.2.2 through 5.3.3. it can be seen that for fault
recovery in a k stage linear one-dimensional pipeline h+l PE's are
required. As an upper bound for the area FEQUirements. the
transitional PB's are counted as two PE's because of their
multifunctional design. In VLSI implementation however. the area
required by one PB containing two arithmetic units and the
corresponding switching elements for function selection. is less than
the area required by two unifunctional PE's. The exact area required
by such a PE is very hard to determine and is strongly design
dependent. therefore as an upper bound. an overall Of 1+3 PB's are
required if there is one functional transition in the PE's along the
pipeline (i.e. two groups of functionaly different PE's. for example
a group Of IAC's followed by a group of DC's). Also k-l comparators
are needed. If the chip area required to implement a comperetor and
its respective interconnection lines is (l/m)th the area consumed by a
single PE denoted as APE the overhead hardware requirement for the
fault-tolerant one-dimensional pipeline can be formulated

approximately as follow:

2
1
1

 

 

pm
..a

l

awn

 

 

 

 

 

 

 

 

 

 

I'"""

--J

m E

.lIcsIL

Hausa
oumxmsam

mﬁocwm m.w.d.

 

 

 

 

 

 

 

 

 

 

 

].
emllirllem em
m.a w.s. x+m.

til Q " IIQI.

”ﬂ- ..

.
—
“we.
.
owl. n HLA
.— e ”Id—I.
-..
.
.
.
_

 

 

 

 

 

 

L

 

 

 

adage om
meannmoamm
asmamnadoz
ozm

 

 

 

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

em . men It
x+r. " x+r+~
III!
e . e ...
a u i
u m
mm " JIM.
” ,
n . ._ a re:
x+rn . +
. .#T x r
.
u
.
v.m=m oﬁ
meanwnosmd
qwmzmmnho:
42c

> uxocomma mensnwmnnzxm «or use ozmuadamamnozmd emcdniaodmxman unumddam
excommmow om mxcdnxmwddk n:Omm: x+r+. mamcmm man are ﬁcananoam. nxmzmmnnoam.

ll3

033(k.l)’[(k+3)05pg+((k-1).APB)/l]/k.ApE (5.4.1)

Since the transitional PB's are selected to perform one operation at a
time on their input Operands(i.e. either multiplication or division).
the overhead timing requir-ents Of this modified faul t-tolerant
pipeline ruains the same as the one described in Chapters 3 and 4 and

is given in Equation 3.3.3.

5.5 Two-Dimensional Functional Transitions

once more. due to the parallelism in the design of dedicated
computing structures. as long as the data flow consistency (both
horizontal and vertical) is guaranteed by the algorithm. each row in a
square array architecture for example. can be treated as a separate
one-dimensional pipeline. Since the algorithm described in the
previous section does not delay or alter the data flow in the
pipeline. the fault-tolerant two-dimensional design of Figure 4.l.l
with transitional PB's replacing regular PE's each time there is a
plane of functional transition. can be used without any further
modifications. Tables 5.5.1 through 5.5.3 illustrate the error
detection and recovery steps for the cases of transient and permanent
fault-occurrence in one row of a two-dimensional array processor. The
upper bound for the overhead hardware requiruents can be formulated
by counting each transitional PB as two unifunctional PE's. For
example. in a square array having one functional transition interface

in each row. Equation 4.3.1 can be written as:

ll4

vmd.d.a

> m

d d

> w n

d a d

> m

N N

>~w~nm

> w

w w

> m n

w u w

> w

e
>Aw

a

n

s e

x” con.” nmxm w

qmcdm m.m.d.

vmd.m.a

> w n

d d .

n o

> . a

w

d .

>~m~n~
>Nm~n~om

> w n

w u w

are“
9

euros

> w

u w

> 0

ea.
>

as

vmd.w.a

> w n o

d . d d

>dwdndcd\md

>~m~o~c~

>~wmn~c~\mm

> w n o

u u w w

>wwunucw\mw

> o

swans A

>
x

ﬁrm <m1nnnm_ ocmwmaa Am cannk

sweneoe\ms

emd.e.m

>dwdndc.\md
>dwdndod\md *
>~wmnmo~\mm
>~wmn~cw\m~ s
>wwunuow\mw
>wmunuow\mw *

>

eweneos\me
>emensce\ms .

zvmd.m.a

33nd}:d
x

>Nmmnmc~\mm
x

>wwunuow\mw
x

43m mumnmuanam ﬁmcdm om «3m emcdﬁuaodmwmza ﬂroiamamzmnozm. mwwmk uxonmmmox

ﬂow azm exaowiwxmm nmmm.

llS

brewer. emn.d.a ems N.a em_.w.e. em..e.m. ems.m.a

Adam a a
A
. pawn
A
N 35: 139.5. - - - - \RBA.
ﬁ . I I I IIIII J
w >~m~ wrmmgmyw~.- -..-)mm‘mywmw amnoaucﬁm
AA pmmmnm >~m~n~ ">.wanncn pswnn.mn.
----------..-------.

A

m >wwu >~m~n~o~ pmamnmow >.mdn.c.\m. x
A
A

A >Awe >uwwnuow >uwwnuow >~m~n~c~\m~ x
A

m >Awene >Awene puwwnwou\mu >uwwnwcu\mw >~m~n~o~\m~
A x
A x >Amnnece >Aweoeon >uwwnwow\mw
A

no x x >Amenscs\me >Awensce\ms puwwnwou\mw
A

as x x x psmenece\me x
A.N x x x x p.

smensoe\me

.A noaumwm ccauca sewedAm

amcdm m.m.m. 43m mumnm-anam Amcdm om "so wmcda-aodmwmzu Aro-anam:mno=md mwxmk excommmox
in”: m armsmnmaa Amch A: mdﬁrmw vmd.~.a ox vmd.w.mu

ll6

ocach\qAam umA.A.a

AA >AwA
AN >Awdnd
AA >mw~
AA >~wmn~
am >www
Am >wwwow
aw .

«m .

no .
"do .

x” con." nmwm
ﬁ.-..- -..- 4

emd.N.a em..A.a AAA.A.A

m a
3A}
3A}: 3A}:
ANANAN wavy WP“ ..-.. EA To? 5..
>NANANAN >NANANA N >AAAA AAANAA
AAAAAA _>N.AN.A mom\.mN.-- - -mN-N.A.N.Aw.\mN._.
>AAAAAAA >A.AA.AmAAH - ->NANA.NA.N\AN

. .>A.A IAmAm.AA.\m. - - - wA.A MAWAWNWA."

>wwwnw ow\mu_

_ _ . >AA nosnmxAAoam Azamnmam msxox

amcﬂm m.m.u.

2AA: m umwamzmaA Amcdn A: vmd.n.mu

a

vmd.m.a

3.30%.;d
x

>Nwwnmom\mm
x

>wwwnwcw\mu
x

>Awnnaoa\mb
x

43m Acmnm-aAam Amudm AA azm Amcdn-no.mxmsn Aro-aAam:AAo:mA mxxmk uxonmmmox

ll7

onn(k.a)=[(k+3)‘+(r+(k-1)‘)/-]/k’.

The overhead timing requirement remains the

one-dimensional design and is given by Equation 3.3.3.

(5.5.1)

as the

CHAPTER 6

SUIlAIY AND CDNCLUSIONS

In this research work. the inherent PB redundancy of the
parallel-pipeline arrsy architecture is used to deveIOp a new
concurrent error detection and correction algorithm capable of
detecting and correcting errors caused by permanent. intermittent and
transient faults Of unknown origin and frequency of occurrence. These
faults are assumed to affect small parts of an arithmetic processor
chip. especially the arithmetic units before or during Operation
despite extensive application of fault prevention. The inherent PE

redundancy of the parallel-pipeline architectures will enable the

H8

ll9

algorithm to use minimum overhead hardware and timing in order to
detect and correct errors. In other words. the errorneous computed
results of constituent PE's in the array are recomputed by subsequent
PE's in the same pipeline during the following clock periods.
Furthermore. the concurrency of the error detection and correction
will enable the fault-tolerant structure to detect and correct
transient errors with very short durations. These errors affect the
output results of the FB's and they can not be detected and corrected

using nonconcurrent techniques.

'ith the rapid progress in the area of VLSI. dedicated high-speed
computing structures are implemented on a single chip. Such
implementations require modular PB designs with local next-neighbor
communication paths. As chip complexity and operational speeds
increase. more fault-occurrence is expected. Because of the limited
IIO pins. large amount of processed data and the high Operational
speeds. undetectable errors can propagate throughout the whole
computing structure causing the system to crash. Off-line or start-up
testing techniques. while capable of detecting permanent faults. will
not detect transient faults. Therefore. locality and concurrency in
the error detection and correction techniques is essential in VLSI

implementations.

The major contribution Of ﬂhis work has been the ability to
efficiently incorporate fault-tolerance into special classes of VLSI

computing structures and thus to improve reliability and yield Of such

l20

designs which have been more susceptable to faults as chip complexity
and Operational speeds have increased. Furthermore. the addition of
fault-tolerance tO the existing computing structures has not
tremendously degraded the system performance and has increased the

system reliablity from almost none to a very promising level.

As a bottom-up approach. in Chapter 3. a new error detection and
correction algorithm for one-dimensional systolic arrays. based on
reloading each input operand set into the pipeline and concurrently
comparing the output results of adjacent PE's was described. The
overhead hardware and timing requirements were determined and it was
shown that by adding only one redundancy in a k stage pipeline
Operating on n operands. an order Of k.n transient and intermittent
faults with duration less than one computational clock cycle can be
detected and corrected. Furthermore. the overhead timing requirements
for the algorithm never exceed the timing requirement of the same
pipeline implemented using the B880 technique. capable of only
detecting and not correcting errors. with approximately one third the
overhead hardware requirements. As an application example. the
algorithm was applied to a one-dimensional systolic array used in

Discrete Fourier Transform (DFT) computations.

In Chapter 4. the algorithm was extended and applied to
two-dimensional array architectures. It was shown that as long as the
vertical and horizontal data flow were not delayed or altered. each

row of the array can be treated as a one-dimensional pipeline. The

lZl

overhead hardware and timing were determined. As an application
example. the algorithm was applied to VLSI array used in pipelined

matrix multiplications.

In Chapter 5. the problems arising from functional transitions in
VLSI array architectures among adjacent PE's were approached by
extending the error detection and correction algorithm and determining

the overhead hardware and timing.

In conclusion. the new error detection and correction algorithm
presented in this research enables highly reliable designs Of
fault-tolerant computing structures which benefit extensively from
parallel-pipeline architectures. Fnrthermore. the algorithm preserves
VLSI attributes. such as locality of communication among constituent
PE's. modularity Of the design and simplicity of the control structure
and does not require additional pinouts. This algorithm. due to its
general assumptions. can be applied to systems which contain

processors on a single chip or mainframe computers interconnected

BIBLOGRAPHY

1- Runs. H. 1'- . " The W 2f Emile; mum". Advances in
Computers, Vol.19, Academic press, 1980, pp. 65-112.

2- Huang. K. and Cheng. Y.H.. " V L§I._ggpggigg §§guctgggs a ﬁg;__g;xigg

magical: 1:12:11 mum" . Proc- 1980 Int’ 1 Conf. on
Parallel Processing, Aug. 1980. pp. 217-230.

3- Moldovan. 0.1.332an .91 mm for __VLSI W
.Agggyg".Proc. of the 1288. Vol. 71. No. 1. Jan.1983, pp.113-120.

4- Leung. Y.Y.J. and Shanblatt. H.A.,T§jnglg§ipg .3 VL§I .ﬁypgglig
Arm for Lead 112! WWW". Proc. 1983 Conf. on

Modeling and Simulation, Univ. of Pittsburg, Apr. 1983. pp.
1349-1353.

5- Shanblatt. HAA. and Leung. Y.Y.J. fIhg.§1gnig§.gj 1L§I _ﬁg; _ngd

.Zlg! .annutatigg .ﬁphggggngpg". IEEE Trans. on Power Apparatus and
Systems. Vol. PAS-103. July 1984. pp. 1714-1719.

6- Zhana. 0-H” Yun. DJJ-J'W Mir Anus for
lgigggggg ‘ggugig; .Iggnpﬁggg", 11th Annual Int’l Symp. on Computer
Architecture. June 1984. pp. 215-222.

7- Quinton. P-."An.tmtis§1ntheaisnf§n.talisArrm from _Ilnifm
.gggggggg; _gggggiggg", 11th Annual Int’l Symp. on Computer
Architecture, June 1984, pp. 208-214.

8- Runs. 8-Y-. " .Qu Was 11th W1 W Arm
.Exgggggggg", IEEE proceedings. July 1984, pp. 867-884.

9- Runs. H-T- £11.35; Desire Alarm £91 __.§_VL I m".Prom Caltech
Conf. on VLSI, Cal. Inst. of Tech.. Pasadena. Jan. 1979. pp.

65-90.

Ill-Evans. K-. Chens.Y.H-."£artiti.qnc.d mus Algorithms in: __§.VL I
.Aziggngtig m ". IEBE Trans. on Computers. Vol. c-31, Dec.
1982. pp.1215-1224.

122

123

ll-Fortes. J.A.B. and Raghavendra, C.S.,"Dyg§gic§111 Rgcgnfiguggblg
Fault-Tolggant Array Pgoggssgr", FTCS-14, June 1984, pp.386-392.

12-Chung, F.R.K.,"DIOGENES: ‘A Hethodglggy .jg; Igggiggigg
Fault-Igleggng VLSI ngggssg; Arrays". FTCS-13, June 1983, pp. 26-32.

13-Koren, 1.,".A Rgcggﬁiguggblg gag Fgglg-qugggnt VLSI Multipggcggsgg
", Proc. 8th Int’l Symp. on Computer Architecture. Hay 1981,
pp. 425-431.

14-Head. C. and Conway, L., ﬁngggggggigg .59 VLSI
_§ygggg§".Addison-Wesley Pub. 60.. Reading. Mass., 1980.

15-Head. H. and Head. 0-. "_9_tC I and We 2f ___..VLSI 31mins
.§£;gg;g;gg", IEHE J. of solid State Circuits. Vol.SC-14. April 1979.
pp. 455-462.

16-Anderson. ’13. and Lee. P-A-. 'Ianlt Tolerant: 1512139123!

.gggpggglg". Proc. 1383 International Symposium of Fault-Tolerant
Computing. June 1982, pp. 29-33.

17-Sedmak, 3.8. and Liebergot. H.L. fggglggzglgggggg _gf .3 .Qgpgggl
__rml’u s £22223: W .121 MMMW".EEE

Trans. Computers, Vol. C-29, NO. 6, June 1980, pp. 492-500.

18-Avizienis. A0: 'mmmummsm whoselm the
MafW". PICS-12. June 1982. pp- 6-13-

19-Praud. 3.. '32ch .er __§IVL 3.9.9.; W Thins the
Interrelationships .9! Earl: 9mm. 11321.4 and Lest Quality".
FTCS-12, June 1982, pp. 400-405.

20-Niraj. K-Jo. Abraham. J-A-. 'Tesell! Self-£22m __HOS 211211;;
_gggg; Rgglistic Physiggl Failures", Proc. IEEB International

Congerence on Computer Design: VLSI in Computers. Oct. 1984. pp.
665-670.

124

21-Sami, H.G., Stefanelli, R., f§glﬁ;1g§§igg.égggy Stugctgggs", Proc.
IEEE International Conference on Computer Design: VLSI in Computers,

Oct. 1984, pp. 677-682.

22-Wang, S.L., and Avizienis, A., ﬁzhg Dgsigg 9; Totally Self Checking
ginning hiss wmm". Proc. IEEE International
Symposium of Fault-Tolerant Computing, June 1979, pp. 173-180.

23-Crouzet. Y., and Landrault, C., fyggigg .9; .ﬁglﬁzgggggigg HOS-L81
mm:- Annliminn .29 3 MP -B' W". Proc- IEEE
International Symposium of Fault-Tolerant Computing, June 1979, pp.

24-Halbert, M.P., and Bose, 8.H., fgggigg .Apgggggg _gg; .5 VLSI
.ggljzghgggigg HIL-§ID:17§0§,gigggpgggggggg", Proc. IEEE International

Conference of Fault- Tolerant Computing, June 1984, pp. 254-259.

25-Pate1. LE. and Fans. L-V-. 'Ecnsunsnt Em: 13m in
igglgjply .39; .Qixigg .Agrgyg", 1833. Trans. Computers, Vol. C-32,
April 1983, pp. 417-422.

26-Chiang, K.W. and Vranesic, 2.6., f9;,§ggl§,§g§gg§igg.ig CMOS,ngig
Networks", Proc. 20th Design Automation Conference. June 1983, pp.

50-56.

27-Reddy, 8.11. Agraval, V.D., Jain, S.K., "_A 91;; mg], )1ng], jg;
___§CHO 9W1 Lexis ginning sits Annlisasion .ts 2.12.1;
.Qgggggigg", Proc. 21st Design Automation Conference, June 1984.

28-He1as.C-H- . "2m Manning .19 a 233.3 ﬂushing lama". IEEE

Trans. on Communications, Jan. 1983, pp. 155-157.

29-Avizieniz. A's 'Mixnllixersimlnemums 2: in: Eighties".
PTCS -12, June 1982, pp. 44-45.

30-Kanai. 13.. '33ng .9: 21.52211 _LSLSS In 11L h
.§§g!igg.gggggﬁiggggtigg", IEEE Trans. Computers, Vol. 0-30, No. 10,
Oct. 1981, pp. 811-812.

125

ill-Ryan. P-H- . "___999La 91999.9 19919 19199999 999991 1999999999". Proc-
IEEE International Symposium on Fault-TTolerant Computing, June 1983,

32-Kuang, 1.8.1., and Abraham, J.A., 919! Cost Schemes .ﬁg;
__9__19_9L___99P u -'1' mm in 2199911.! 9999991999 LLi b.9939? c a 1:93.919". Proc-
Ieee FTCS-12, June 1982, pp. 330-337.

33-Pranklin, H.A., Wann, D.F., Thomas, V.J., 52in .ngﬁsatigg .199

29999919923 .95 __I.§_V I 19999999993199 999999.199". IEEE Trans. on
Computers, Vol. C-31, Nov. 1982, pp. 1109-1116.

34-Matick. LL. "M999 9f 1199911 .9919 Win:
.gglighiligy.ig Vi;§ugl.§ggg;igg", FTCS-12, June 1982, pp. 223-227.

35-Williams, T.“., and Brown, N.C., ...L’SIESSLQXSJ n _a 223211193 3f
.ggglg ,gggggggg", IEEE Trans. on Computers, Vol. C-29, 1981, pp.
987-988.

36-1-9119. 8- and Patel. ML. 339: 9999999199 .19 Wis

.9291§£1221.ﬂging.Iig§,§g§gg§;ggy", Proc. IEEE International Symposium
on Fault-Tolerant Computing, June 1983, pp. 298-305.

37-Nicolaidis, H., Janesh, 1., Courtois, 3., ﬁﬁgzgggly ,Qggg ,Qigjgin;
_Qggggggg", Proc. IEEE International Conference on Fault-Tolerant
Computing, June 1984, pp. 16-21.

38-Torki. K-E- . and Radke. C-E- . 'Dmlin 3.9991 9.99 .9919? 99999999 .99:
.gultighip ", Proc. 20th Design Automation Conference, 1983,
pp. 201-2060

39-Agrawal, V. Seth, 8., and Agraval, P., fLﬁl ,ggggug; .Qgglity ,gng
.ggglg .Qggggggg", Proc. 18th Design Automation Conference, 1981, pp.
196-203.

40-91“, DJ!” and Radke. C-E- ‘19993999999999999999999999199999
99999991 99 .9119 99911.91 9951 11.19.19 9f MW'SProc- IEEE

International Conference on Fault-Tolerant Computing, June 1984, pp.
56-59.

126

41-Chandramou1i, R. 299 Testing §§ugk-02en Faults", Proc. IEEE.
International Symposium on Fault Tolerant Computing, June 1983, pp.

258-265.

 

42-Faulkner, T.L., Bartlett, C.W., Small, M., "Hardware _ngig Design

Fault; .;A Classificatigg_ggg 89mg Measurements", FTCS-12, June 1982,
pp. 377-380.

43-Seth, S.C., Agrawal, V.D., fggggiggiggl Design V i i i ",
FTCS-12, June 1982, pp. 393-399.

“Pushes. J-L-A-. Hccvskey. E-J-. "1399.199 99 19991.1! 9.91::99999999
99999999999 91.99 99 99.19999! 999929: 9; 4999.1 a". PICS 13th

International symposium on Fault-Tolerant Computing, pp.l69-172, June
1983.

45-Tmir. Y's Sequin. 0-. "2995.99 .999 9991999399 99 “Lat-C in
Cgmpggatggs ,Lgplggggggg .1153 .!Q§ ,§L§;§", IEEE Transactions on
Computers, Vol.c-33, No.6, pp.493-506, June 1984.

46-Furuya, K., Akita, Y., Tohma, Y.,.ngig .Qggigg .2; P -T

91919999 99999 99 .999 W999 99.999991" PICS-13 pp- 306-313
June 1983.