3' '3 "'33 33 ' ' I‘3II3'3IIII'II'I I3I 33333 E '. .. | : II "I' "3333' I" II.3.,I "'3. 3'3" I 3 .3'33I 3333. 333333 3' 3333333 3"333' 'II 3333'3'3 IIII3333II33 I 3 II3 33333 3III I 3333'" 33 IIII'333 III II"'I3 3' I .IIIIIIIII I3" .13333333'333‘ 3.3333331I'I.' 3'33; . 333" I II I33333" I3 '3 '333'33'I. 33'I I3 IIII 33 II'3""' “.7 '33 '3 __. 33'3IIII ' " 333 ' 33 3 I'II "'3"' III3 II. .III III 3:33 ' 333' 3 3 3' 33 I33 3 3 .3" III I 3.3 3.33 3333 33 III III . I "'3"'I ' 3"" 3. 3333.333 333 I"I'I'3333II'.."333.III 333333.... 3I3 3! I3 ‘33 3' I33 33 I I33!” .33 ”.3, 3333333333'I3333III333-I3333‘.333333 3333 33333.3 33 333. """"'3""3 I3333 73' I... 1 3.3 . ":"II'I'33I'. 3.3133333333.33 '43 .333 3'3 3' I' III I333 333‘ 3333 ' .3. 313.3 .I 33' I'LII‘I'III II "'3"""33""33"""'3'"3"" 33333.3333333 .II II III II .. I'I'I I ." 33333333 I 33:“; 333" 3'333 3"33'3'3' "3333 3'33. I 333III 3333.333 .33.; "33!I 333333333333; "'3'333 33I'3 I333 333333333 "33333333 "33 33333333333333.3333: 3 II 33.3 II. 3. 3333333 7.. ”III '33. 33.3IIII3 3' .III 333333 33 III 333 333.3333 "'33g3'3'3"""""""""""3"3"'§" .I.III..I333.33.133.33 33II 33'3'3 33IIII. ‘IIIIII3III 'I'III'I" 3'I I33 31“ I‘I3I3I33'..I'.'.."" 3“" II“ ' u. I3'3 III 33.3,. .I'I33I 3333.33"33 I I3'I3I3.'33' 3.3 33.. .3333 3333. I3 I33 3333III3I3III33 I3 33 3333! 33.. 3.333 3333.33..I 333.3 ...3.I 33333. 3. LWIII33.3.I_.III_ .3 I I. I ‘.. I3 III. III I 3 IIIIIIIIII III I I ..,. .3... II IIIIIII I33", I... 3.3"I ""333 3 III III IIIII3 I. ‘3IIIIII I33 133 3. III III I. "'3'“. '3'3433"3"'3"'3I'333 3 3 3 'I 333:3 33I3 3.333333%. I'333I'I 333333 3'"L3"‘3""3'3JT33"'3"3'3333'3"3'3" I-I'rt '33::33. 3I' '.I 13 3333”,. 3333I3I.3 .3333 I. I-33333II33 I‘ .333“ '333'3333'3313'33'33333'33"33333331’ "3".3 'II33.I.I..III'I IIIIII III .II 3.3. I'33333"33I 3333..'I,.‘I33'I33II'333".3 . III. 33333 ...I I... I... I... II 3'3'3'3 I3 I3333.33.33 "I.33333.333.33.1333333 . ’ """I " "II III-I. "I'"'333' '- ' "' ' ' '3.“ ”"I "I.'II"'3I "3 ".33 .I. ". 3""3'333'3333'3'1333' 33"3'3'33'3"'3333'33""'33"3 33. '"3' '3'0"I"'3"'33" "" "333. "3339 '3 ' "333333333333333 333333 III 3;. .. 3I' " 3I'I" I... . 33 _:3',I'I‘ r :I" 1' I3" II 'II 'II III"" 313."."33 33' 333 "3333333" 'II'I'I'IIII 3353‘" IIIIII I...IIIII II. _ III II LIBRARY Michigan State University This is to certify that the dissertation entitled A Concurvevfi’ (Ewe? Mificiton An$ Qarfet-an Aiaociflxm For FauPn—Voknkvfi \ILSI Ar\\*\\vv\z\*\.c_ AWOA‘ rFYGCES’SO‘V'S. presentedby QAMQKDM SHOKODH: ~\<fl&1vm\$ has been accepted towards fulfillment of the requirements for Ph-B- degreein SietifiM\Y\/13M(O('Mj Megs/3Z4 “Magi 22/419 /%WQ7W\ Major professor Date_////3,/g~5 MS U i: an Affirmative Action/Equal Opportunity Institution 0-12771 MSU LIBRARIES -:—. RETURNING MATERIALS: Piace in book drop to remove this checkout from your record. FINES wil] be charged if book is returned after the date stamped beiow. A CONCURRENT ERROR DETECTION AND CORRECTION ALGORITHM FOR FAULT-TOLERANT VLSI ARITRHETIC ARRAY PROCESSORS BY KAMRAN SHOKOOHI—KAYVAN A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering and System Science 1985 ABSIRACT A (DNCURREIT ERROR DETECTION AND museum ALGtRITRN FOR FAULT-mL-ANT VLSI ARITRIBTIC ARRAY PROCESSCRS By Kamran Shokoohi-Kayvan A concurrent error detection and correction algorithm for errors caused by permanent. intermittent and transient faults in arithmetic parallel-pipeline array processors. is described. The fault model. applicable to VLSI implementations. assumes the occurrence of faults with unknown origin and frequency. Recovery from faults is achieved through minimal processing element (PB) redundancy in the array architecture. facilitated by spatial bypassing of the correct operands previous to the occurrence of a fault to fault-free PE's and recomputing during the following clock cycles. The overhead hardware and timing are determined. It is shown that this concurrent error detection and correction technique uses less additional hardware than 3880 (REcomputing with Shifted Operands). offers marginal timing improvement over the R880 technique. and adds the concurrent error correction capability which is not present in RESO. Furthermore. the additional hardware and control for error detection and recovery is local and modular. hence making this technique very attractive for VLSI implementations. This algorithm is based on a general fault model which assumes the occurrence of permanent. intermittent and transient fault of unknown origin and frequency. Transient and intermittent faults with a duration longer than one computational clock cycle are classified as "permanent". The effects of these faults are detected as undesired changes in the logical values at the outputs of the PE's or other circuit elements. such as latches. The error detecting algorithm is concurrent and thus may detect and correct errors caused by transient faults affecting the PE's randomly. TONY PARENTS AND GRANDPARENTS ACKNOWLEDGEMENTS I would like to express my deepest gratitude to my advisers. Dr. Erik D. Goodman and Dr. lichael A. Shanblatt. for their continous moral and professional support and guidance throughout my doctoral pregram. I also would like to thank Dr. D. Reinhard and Dr. B. Drachman for their professional advise and review of this work. I shall always be grateful to my parents for their continous love. support and guidance through the years. My appreciation is extended to my brother Dr. Farhad k. Shokoohi. his family and my sister Dr. frough k. Shokoohi for their unending encouragement. understanding and guidance throughout my academic programs. TABLE OF CONTENTS Chapter Page 1. INTRODUCTION 1 2. BACKGROUND 5 2.1 Algorithm Philosophy 6 2.2 Systolic Arrays 7 2.3 Fault-Tolerance 12 2.3.1 Fault Classes 13 2.3.2 The Concept of Fault-Tolerance 15 2.3.3 Fault-Tolerance In VLSI 18 2.3.4 Fault-Tolerant Systolic Arrays 20 AN ERROR DETECTION AND CORRECTION ALGORITHN FOR EAULT-TOLERANT LINEAR ONE-DIMENSIONAL SYSTOLIC ARRAYS 23 3.1 The Error Detection Algorithm, 23 3.2 Error Recovery 41 3.2.1 Control Algorithm For The Redundant PE 43 3.3 Overhead Hardware and Timing Requirements ' 46 3.3.1 Fault Coverage 50 3.4 Linear Systolic Array for DPT 51 3.4.1 Linear Fault-Tolerant Systolic Array for DFT 55 3.4.2 Error Recovery 57 AN ERROR DETECTION AND CORRECTION ALGORITHM FOR FAULT-TOLERANT TWO-DIMENSIONAL ARRAY PROCESSORS 64 4.1 The Modification of The Error Detection Algorithm 66 4.2 Error Recovery 71 4.3 Overhead Hardware and Timing of The Two-Dimensional Design 80 4.4 A Fault-Tolerant Pipeliued Array for Matrix Multiplication 82 4.4.1 The Fault-Tolerant VLSI Array for Matrix Multiplication 87 iv Chapter 5. 6. FUNCTIONAL TRANSITIONS IN ADJACENT PE’S 5.1 Functional Transitions In VLSI Structures 5.2 Transitional PE’s 5.3 Error Recovery 5.4 Overhead Hardware and Timing Requirements 5.5 Two-Dimensional Functional Transitions SUMMARY AND CONCLUSIONS BIBLIOGRAPHY Page 95 96 98 104 111 113 118 122 Figure 2.2.1 2.2.2 3.1.1 3.1.2 3.1.3 3.1.4 3.3.1 3.3.2 3.4.1 3.4.2 3.4.3 4.1 4.1.1 4.1.2 4.3.1 4. 4. 1 LIST OF FIGURES Different interconnections for systolic arrays. The structure of a computing array for a system of arbitrary dimensions and 8-3. Basic structure of a linear pipeline processor. The structure of a fault-tolerant one-dimensional linear pipeline processor. Basic PE for the fault-tolerant pipeline processor of Figure 3.1.2. Basic redundant PE for the fault-tolerant processor of Figure 3.1.2. The overhead hardware requirement OHR(k,m) for the one-dimensional fault-tolerant array processor. The efficiency and throughput of the fault-tolerant pipeline of Figure 3.1.2. Basic PE for linear systolic array used in DFT computation. The structure of the linear systolic array for DFT computation. The structure of the fault-tolerant systolic array for DFT computation, n-4. The structure of an array processor connected in mesh without the boundary wraparound connections. The structure of the first row of the array of Figure 4.1 with added switching and comparator elements. Basic PE structure for the fault-tolerant array of Figure 4.1.1. The overhead hardware requirement OER(k,m) for the two-dimensional fault-tolerant array processor. A‘VLSI array for pipelined multiplication of two 2x2 dense matrices. vi Page 10 24 27 28 29 48 49 53 54 56 65 67 68 81 83 Figure Page 4.4.2 A VLSI array for pipelined multiplication of two 3x3 dense matrices. 85 4.4.3 The fault-tolerant VLSI array for pipelined multiplication of two 2x2 dense matrices. 88 4.4.4 The VLSI fault-tolerant array for pipelined multiplication of two 3x3 dense matrices. 94 5.2.1 The structure of the fault-tolerant one-dimensional array processor with one functional transitions along the pipeline. 100 5.2.2 Basic structure of the transitional PE3,m/d. 101 5.3.1 A proposed architecture for the one-dimensional fault-tolerant pipeline processor of arbitrarily chosen k+L+1 stages and two functional transitions. 112 vii Table 2.2.1 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 LIST OF TABLES The port-coefficient time table of the array of Figure 2.2.2 for 8-3. The space-time table of the normal linear oue-dimensioanl pipeline processor of Figure 3.1.1. The space-time table of the fault-tolerant pipeline. Error-free case. The space-time table of the fault-tolerant pipeline with a transien fault in PE.2 during the second clock cycle. The space-time table of the fault-tolerant pipeline with a transientfault in either PE.3 or PE.4 during the sixth clock cycle. The space-time table of the fault-tolerant pipeline with a multiple transient fault in either PE.2. PE.3 or PE.4 during the sixth clock cycle. 3.1.6 3.1.7 3.1.8 3.1.9 3.4.1 3.4.2 The space-time table of the fault-tolerant pipeline with a permanent fault in PE.2. The space-time table of the fault-tolerant pipeline with a permanent fault in PE.4. The space-time table for the parity check and output validity with a transient fault in PE.2 during the second clock cycle. The space-time table for the parity check and output validity with a permanent fault in PE.4. The space-time table of the normal systolic array for DFT computation. The space-time table of the fault-tolerant systolic array of Figure 3.4.3 for DFT computation, error-free case. viii Page 11 25 33 34 35 36 37 38 39 4O 58 59 Table 3.4.3 3.4.4 3.4.5 3.4.6 4.1.1 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.4.1 The space-time table of the fault-tolerant systolic array of Figure 3.4.3 for the DFT computation, error-free C886. The space-time table of of Figure 3.4.3 for the fault in either PE.1 or cycle. The space-time table of of Figure 3.4.3 for DFT fault in either PE.2 or clock cycle. The space-time table of of Figure 3.4.3 for DFT fault in PE.2. The space-time table of Figure 4.1.1. The space-time table of the fault-tolerant systolic array DFT computation with a transient PE.2 during the second clock the fault-tolerant systolic array computation with a transient PE.3 during the seventh the fault-tolerant systolic array computation with a permanent the fault-tolerant array of the fault-tolerant array of Figure 4.1.1 with a transient fault in PE1.2 during the second clock cycle. The space-time table of the fault-tolerant array of Figure 4.1.1 with a transient fault in PE1.4 during the fifth clock cycle. The space-time table of the fault-tolerant array of Figure 4.1.1 with a multiple transient fault in PE1.1, PE1.3, PE1.4 during the The space-time table of seventh clock cycle. the fault-tolerant array of Figure 4.1.1 with a permanent fault in PE1.2. The space-time table of the array of Figure 4.1.1, illustrating the register transfer operation of constituent PE’s in the The space-time table of error-free case. the array of Figure 4.1.1, illustrating the register transfer operations of constituent PE’s with a transient fault in PE1.2. The space-time table for matrix multiplication using the array of Figure 4.4. 1. ix Page 60 61 62 63 69 72 73 74 75 76 77 84 Table 4.4.2 A. 4.3 4.4.4 4.4.5 4.4.6 4.4.7 5.2.1 5.2.2 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.5.1 The space-time table for matrix multiplication using the array of Figure 4.4.2. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3 with a transient fault in MAC.5 during the third clock cycle. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3 with a transient fault in MAC.6 during the fourth clock cycle. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3 with a permanent fault in MAc.5. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3 with a multiple transient fault in MAC.4, MAC.6 during the fourth clock cycle. The space-time table for the unmodified array pipeline. Note the functional transition and output inconsistency in PE3,m.and PE4,d. The space-time table of the fault-tolerant pipeline of Figure 5.2.1. The error-free case. The space-time table of the fault-tolerant pipeline of Figure 5.2.1 with a transient error in PEZ,m during the second clock cycle. The space-time table of the fault-tolerant pipeline of Figure 5.2.1 with a transient fault in either PE2,m or PE3gm/d during the third clock cycle. The space-time table for the fault-tolerant pipeline of Figure 5.2.1 with a transient fault in either PE3,m/d or PE4,d/m during the fourth clock cycle. The space-time table for the fault-tolerant pipeline of Figure 5.2.1 with a permanent fault in PE2,m. The space-time table of the fault-tolerant pipeline of Figure 5.2.1 with a permanent fault in PE3,m/d. The space-time table of the fault-tolerant two-dimensional array procesor for the error-free case. Page 86 89 90 91 92 93 99 103 106 107 108 109 110 114 Table Page 5.5.2 The space-time table of the fault-tolerant two-dimensional array processor with a transient fault in either PE1.2,m or PE1.3,m/d. 115 5.5.3 The space-time table of the fault-tolerant two-dimensional array processor with a permanent fault in PE1.4,d/m. 116 xi CHAPTER 1 INTRODUCTION In recent years much attention has been focused on designing highly dedicated computing structures such as systolic arrays which use both pipeline and parallel algorithms. These structures are used in many applications including signal processing. filtering and many matrix analysis routines. Most often used as a peripheral to a host computer. these dedicated computing hardware devices will increase problem throughput and often relieve the computational burden of general purpose computers. The architecture of these structures is the result of a direct mapping of the algorithms onto simple and modular computing blocks. and the interconnections between blocks are determined by the computational data flow paths. Because of the modularity of the computational blocks and locality of the communication paths. these computing structures are prime candidates for VLSI (Very Large Scale Integration) implementation. VLSI implementation will enable the cost-effective design of such structures with smaller size. lower cost. and higher speed. Because of the increasing availability of CAD/CAI (Computer-Aided Design/Computer-Aided Manufacturing) systems. VLSI implementation can be realised more accurately with lower design cost and faster turn-around time. Simple or complex functional blocks can be designed and verified interactively on a CAD system and stored as library cells. For a given application. library cells can be placed and routed by the CAD system and the final design can be checked with respect to the designer's check rules. Finally. mask data generated by the CAD system are sent to a silicon foundry for chip fabrication. Problems such as efficient usage of chip real-estate. chip testability. yield and reliability may arise from such implementations. Improved chip fabrication techniques will result in higher chip yield and reliability. Chip reliability depends strongly on constituent building blocks reliability. As chip complexity increases. these constituent functional block have a high probability of containing or developing faults before or during operation despite extensive application of fault prevention. These faults can be classified as permanent. intermittent and transient. Since. next-neighbor communication among constituent functional cells is essential in VLSI implementations of dedicated computing structures. an error caused by any fault within each block will propagate throughout the structure and will result sometimes in an undetectable erroneous output result. Global error checking and correction. due to the large number of PE's and the large amount of processed data is time consuming. and even if an error is detected. the error correction requires the recomputation of the whole task. Furthermore. global error detection most of the time will not locate the source of errors. Therefore. any error detection and correction technique must be local (i.e. at the PE level) and concurrent so that it may detect and correct errors caused by all kinds of faults. Therefore. in order to design highly reliable computing structures. fault-tolerance must be incorporated into their design. The goal of this work is to incorporate fault-tolerance and dynamic reconfigurability into existing VLSI-implementable systolic array architectures. This is achieved by devising a concurrent error detection and correction algorithm. capable of detecting and correcting errors caused by all classifications of faults. This algorithm utilizes the inherent PE redundancy of the array architecture in order to minimize the overhead hardware and timing required by the fault-tolerant design. The algorithm's overhead hardware and timing requirements are determined. Finally. this approach is applied to classes of systolic arrays and VLSI architectures used in signal processing and matrix multiplication. Chapter 2 of this dissertation presents the background information on the systolic arrays and the notion of fault-tolerant computing. A new error detection and correction algorithm for linear one-dimensional systolic arrays is presented in Chapter 3. In Chapter 4. the error detection and correction algorithm is extended to two-dimensional square array architectures. Chapter 5 deals with solving problems arising from functional transitions in neighboring processing elements for fault-tolerant one-and two-dimensional architectures. Chapter 6 contains the summary and conclusions of this research. CHAPTER 2 BACKGROUND Dedicated computing structures such as systolic arrays can be used in applications such as signal processing. filtering. and various matrix analysis routines [1-10]. The architecture of such a structure is the direct mapping of the algorithms onto relatively simple and modular blocks. Intercounections are determined by the algorithm's computational flow paths. Direct mapping of the algorithm onto relatively simple and modular cells. the locality of intercell communication paths. and the increasing availability of CAD tools. have made the VLSI implementation of such structues very attractive. 2.1 Algorithm Philosophy Based on the layout methodology described by Mead and Conway [21.22]. algorithms suitable for VLSI implementations must have the following features: A) The algorithm should be implemented using only a few different types of simple cells. B) The data control flow of the algorithm should be simple and regular. thus allowing cells to be connected by a network of local and regular interconnections. Long and irregular interconnections must be minimized. C) The algorithm should use extensive pipelining and multiprocessing. In this way a large number of cells can be active at one time. so that the computational speed can keep pace with the data rate. The algorithms that have these properties have the following advantages: 1) A new design is generated with only a few different cells. since most of the cells are replications. 2) Regular interconnections imply that the design may be made 6 modular and extensive. Thus. a large chip design can be made by combining the design of small tested cells. 3) Through the use of pipelining and parallelism in the design. the computational performance (throughput) can be increased. 2.2 Systolic Arrays Among the several approaches to parallel computation which satisfy the VLSI algorithm philosophy of Section 2-1. the systolic array concept is particularly interesting. As proposed by lung [1]. a systolic array is a special-purpose redundant parallel computing structure. made out of a few relatively simple cell types which are regularly and locally interconnected. Computed data circulate through these cells in a regular fashion and are processed at each cell. New computed results of each cell are then sent to neighboring cells. Constituent processing cells or elements (PE's) are arranged in a parallel-pipeline fashion to form a one-or two-dimensional structural configuration. Data is "pumped" synchronously between levels of PE's. After each ”systolic best" (time step). part of an overall computation is performed and results are transferred to the neighboring cells. Simple dynamic storage circuits. such as latches. are placed between rows of PE's to provide data synchronization. Figure 2.2.1 illustrates some common interconnection schemes used in systolic arrays. (b) Figure 2.2.1. -—l .— 1 sI ‘ I . .. i—q .4 t. f’ .__I:_. RI r fl i—- h— *r* l 1* Different interconnections for systolic arrays: (a) One-dimensional systolic array. (b) Degenerate two-dimensional systolic array, (c) Two-dimensional square systolic array, (d) Two-dimensional hexagonal systolic array. A systolic array structure used for matrix triangulation is shown in Figure 2.2.2 [2]. Any augmented matrix {Alb} can be triangulated into an upper triangular form {u:g}. where A has been permuted ; pgiggi to minimum band form. The matrix A is of bandwidth 8 defined as max1.1 {li-jl: ‘ij A 0]. In Figure 2.1 PE's shown in circles. separated by dark lines (latches). are made out of two types. PE's labeled as MAC (Multiply and Add Cell) perform the multiplication and addition of three incoming operands. w-xy+z. and transfer x-x and y'y. PE's labeled as DC (divide cell). perform the division of two incoming operands. g-e/f. and transfer f-f. Elements of [Aih] are "pumped” into the array through input lines. labeled as 11-11 and ID- Blg-ontg of {g:g) are pumped out through output lines. Ol-Q‘ and OD- Lgtchgg, controlled by a two-phase non-overlapping clock. provide data synchronization and regular data circulation. Isolated MAC's at the bottom of the structure operate upon the b vector elements. which subsequently will be resolved into 9 vector elements. In general. 2B+2 input and 8+1 output ports are required. Each port is n bits wide. where n is the width of the coefficient matrix elements. It can be shown that B(B+l) MAC cells and 8 DC cells or O(B') PE's are required [2]. Also 2N+2B. where N is the arbitrary matrix dimension. or O(N) (assuming E<< N ) time steps are required to complete the triangulation. This-is an improvement. by a factor of O(N). over the time required for the same Operation on a serial or attached array processing computer. Thble 2.2.1 illustrates the timing steps for the operation of the array of Figure 2.2.2. 10 e f 0 ‘ DC DC DC .J.._. 9 w 02 MAC MAC MAC x 0 MAC MAC MAC .2_.l ‘ MAC MAC MAC 0 3...... l 1' I1 I2 I3 I4 I5 I6 0D l'“"""""“"“""”"‘"‘. i ' D section I :1 : MAC MAC MAC . 3 i l‘ I I- ........................ t---| ID Figure 2.2.2. The structure of a computing array for a system of arbitrary dimension and B=3. ll Table 2.2.1. The Port-Coefficient time table of the array of Figure 2.l for B=3. I1 12 13 I4 15 16 17 ID 01 02 03 04 00 t1 a11 a21 a31 a41 t2 1 t3 a12 a22 a32 a42 a52 b1 t4 ‘ u11 t5 a13 a23 a33 a43 a53 a63 b2 u12 t6 1 ”22 u13 t7 a14 a24 a34 a44 a54 a64 a74 b3 “23 u14 t8 ‘ u33 u24 d1 t9 a25 a3s a45 a55 a65 a75 a85 b4 “34 “25 t1o ‘ ”44 “35 d2 t11 t12 1 t2N+6 dN 12 2.3 Fault-Tolerance Fault-tolerant computing is the correct execution of a specified algorithm in the presence of defects [19.27.28]. Defects can be categorized as design defects or fabrication defects. Dedicated computing structures. such as systolic arrays. are less likely to suffer from design defects. This is due to the fact that the circuitry involved in such designing consists of highly modular and relatively simple functional subcircuits or cells. Functional correctness of the subcircuit design is verified through simulation at all of the implementation levels (register transfer level. gate level. transistor level. and mask level). Therefore. the defects for all practical purposes may be redefined as flaws introduced during various phases of the fabrication process (diffusion. ion implantation. oxide growth.etc.) or faults introduced later when the system is being used (noise. transistor burn-out. etc.). Because of the irreversible nature of the fabrication steps. the end result will determine the degree of functionality of the whole computing structure. Structural complexity and the limited number of Input/Output (IIO) pins allow only for external testing [46]. Even in the presence of a minor flaw. such testing will indicate a complete system failure and thus reduce overall chip yield. One way to overcome fabrication defects is to devise flawless fabrication techniques. However. before such techniques are develOped. fault-tolerant design is essential to improve yield and system reliability. 13 2.3.1 Fault Classes Undesired events in a computing environment can be categorized in a sequence of cause-effect relationships [29] as follow: Physical ( Component Failure ) Logical ( Faults ) Computational ( Errors ) External ( Crash ) Chip failure caused by physical defects must be avoided by improved fabrication techniques. However. physical defects resulting in faults (logical faults) [57.58] can be masked by incorporating faultrtolerance into the system design. In order to be able to understand fault classes and to devise methods for fault recovery. one must model different types of predictable faults. Furthermore. due to the technology dependency and causal complexity of the faults. modeling fault classes for all possible causes and technologies is a difficult task. Therefore one must concentrate on modeling all possible and predictable faults for a given technology. MOS (metal oxide semiconductor) technology is commonly used in today's chip 14 manufacturing. In MOS technolOgy a large number of physical failures introduced during the fabrication phases can be covered by two classes of defects [30]. 1. Defects that cause the transistor to be open permanently. This class of faults is caused by missing or open contacts on the source. gate. and drain of the transistor; a break in the metal. polysilicon or diffusion line that connects the terminal node of the transistor; or a threshold voltage shift which makes the device permanently open. 2. Defects that cause the transistor to conduct permanently (permanently closed). This class of fault is caused by channel punch-through which connects source and drain of the transistor; the oxide breakdown resulting in the transistor gate being shorted to drain (n-type) or source (p-type); or shifts in the threshold voltage causing the transistor to conduct pemmanently. Major defects such as mask misalignment or dust particles on the surface of the wafer will damage a large area of the chip and might result in a complete chip failure. A statistical study of fault occurrence with respect to the defect size (diameter) is given in [51.64]. Defects that will change the circuit configuration (bridging) or do not result in the node being permanently stuck at some logic value (noise) can also be encountered. The above categorization of fault classes is a rather specialized one. However 15 it will set a good base for fault modeling in a VLSI environment and can be applied to more general classes of faults which might be caused by similar physical flaws or defects in other fabrication technologies. 2.3.2 The Concept of Fault-Tolerance Complex systems have a high probability of containing or develOping faults before or during Operation despite extensive application of fault prevention [27.54.55.56]. Fault prevention has two aspects. The first aspect of fault prevention is fault avoidance. These techniques are used to avoid introducing faults into the system by improving design methodologies and quality control. The second aspect is fault removal. At the design level. these techniques are used to find and remove faults by testing and validation [31-36]. Based on definitions given in [27]. three major constituents of fault tolerance can be identified as follows: 1. Error detection In order to tolerate a fault. its effects must be first detected [37.38]. For complex systems. such as VLSI computing structures. a fault might not be directly detected because of the limited number of IIO pins. However. any fault will generate errors somewhere in the system. Thus. the usual starting point for fault tolerance is the detection of an erroneous state. 16 2 . Damage as se ssment When an error is detected. much more of the system state may be suspect than that initially discovered to be erroneous. Other errors masked by one error might exist in the system. Thus before any attempt is made to deal with a detected error. it may be necessary to assess the extent to which the system state has been damaged. Damage assessment must be done locally. Any additional error or damage detecting hardware must be totally self checking (TSC) so that faulty checkers do not introduce additional or undetectable errors to an already faulty system [31.33.53]. 3. Error recovery Following error detection and damage assessment. techniques for error recovery must be utilized. These techniques. such as dynamic system self-reconfiguration. will aim to transform the current erroneous system into a well defined and error-free state from which normal system operation with. ideally. no system degradation can continue [18.20.39.52]. 'ithout such a transformation the system will fail to operate correctly. It is important to note that a fault-tolerant system design must provide tolerance to a range of predicted component faults. Unanticipated effects of implementation faults may not be modeled into a fault-tolerant system. In any case. the latter undesirably 17 increases the complexity of such a system. Such problems are best addressed by the fault prevention aspects of the design in order to narrow unanticipated fault classes into predictable fault models. 18 2.3.3 Fault-Tolerance in VLSI The rapid pragress in VLSI technologies and the ability to package millions of devices on a single chip. raise the probability of design and fabrication faults. The shrinking dimensions of the circuits. seen by line widths dropping from 10 microns in the early 1970's to 1 micron in today's protptypes. makes the circuits susceptible to transient physical disturbances.such as Alpha particle impacts. that were previously insignificant. The growing logic complexity of a single circuit package. which was 1000 gates in the early 1970's and close to 1 million today. makes the prefabrication and postfabrication detection of all errors less probable. One basic approach to VLSI fault-tolerant design is "Design Diversity". In such an approach. multiple designs are carried out independently starting from a verifiable. formal specification of the functions to be performed. The specification needs to be expressed in a high level language that allows formal verification and minimizes the probability of biasing the independent designs toward the same errors. All of the multiple designs can contain errors. however during concurrent operation the errors will be detected if their symptoms are not identical and errors can be corrected. if a majority of good results can be found [40]. Hardware design diversity can only be achieved through component redundancy. VLSI implementation has led to cost breakthroughs that bring the hardware cost of multiply redundant systems within the range of many applications. Redundant 19 fault-tolerant VLSI architectures in which subsystems have been added topologically. are used extensively in VLSI memory chips [41.42.50]. In such architectures. design diversity is not desirable. It would add enormously to the chip real-estate and increases I/O complexity and therefore degrades chip performance. Most architectures that benefit from a high degree of modularity and cell simplicity can not be built cost-effectively using a high level of design diversity. Chip real-estate in VLSI implementation is expensive and can not be sacrificed in order to achieve design diversity (e.g. processor arrays replicated using diverse design etc.). It is important to note that including fault tolerance in a given design would certainly add to chip real-estate consumption. However. the additional hardware. if kept within a reasonable range. if capable of covering predicted faults. will result in an error-free and reliable design. Another group of fault-tolerant techniques. especially used for tolerance to transient errors. can be divided into error masking and concurrent error detection. Some of techniques associated with error masking are Triple Modular Redundancy and Quadded Logic. These techniques require at least a factor of 2 or 3 of additional hardware redundancy to tolerate single module failures [43]. The concurrent error detection schemes. which are used to signal errors but not mask them. include: Totally Self Checking (TSC) circuits. Alternating Logic (AL). and REcomputing with Shifted Operands (RESO). According to [43]. TSC techniques require between 73 and 94 percent hardware redundancy for single bit and unidirectional errors. AL has the worst 20 figures with 85 percent hardware and 100 percent time redundancy. RESO requires almost 100 percent time redundancy for recomputation of results to detect errors. Note that the hardware redundancy is defined as the ratio of the hardware required by the fault-tolerant technique to the hardware of the original system without fault-tolerance. Likewise the time redundancy ratio is defined as the ratio of the time overhead required by the fault-tolerant technique to the time complexity of the original system without fault-tolerance. 2.3.4 Faul t-Tolerant Systolic Arrays As covered in Section 2-2. systolic arrays are dedicated computing structures that satisfy the Mead and Conway design philosophy and therefore can be impleeented in VLSI. However the limits on IIO pins [46]. the high computational speed. and the lack of on-chip data mass-storage. make the structure defenseless against errors. Faults only become apparent when the whole structure fails to output the correct results. Because of the amount of data involved. it is very costly to verify the correctness of the computed output. In fact. verifications of such caliber cause questions to be asked on the whole purpose of needing computational peripherals. Therefore. future design of such arrays must take fault-tolerance into account. As discussed earlier. fault-tolerance capability will increase the chip real-estate consumption. It will require component redundancy and additional checking and control hardware. However. 21 before fault prevention techniques are improved. such a tradeoff is essential for an optimum and reliable implementation. In conclusion. designing systolic arrays with fault-tolerance and self-reconfigurability properties. which use the inherent redundancy of the array architecture to overcome errors caused by predictable permanent or transient faults. is highly desirable. Introduction of additional hardware must not significantly reduce the system performance or implementation feasibility. Furthermore. fault detection must be local (i.e. at the PE level). This requires all additional elements to be totally self checking. Comparison techniques. which might indicate a nonexisting fault due to round off errors or functional transitions in neighboring processing elements. must be avoided; and finally. additional control and/or interconnection lines must be kept local. The fault model must be general. In other words. the error detection and correction algorithm must be capable of detecting and correcting errors caused by permanent. intermittent and/or transient faults. Furthermore. it has been shown that a general fault model in which the origin or the frequency of fault-occurrences are assumed unknown. is most suitable for fault modeling in a VLSI environment. In arithmetic array processors implemented on a VLSI chip. these faults affect a small part of the chip. most likely a processing element. The duration of these faults is sometimes unknown (ex. transient faults caused by power supply fluctuations or Alpha particle impact). Transient faults will not damage the hardware and their occurrences are only detected 22 by checking the output state of the processing elements. This implies that the error checking and recovery algorithms must be concurrent. CHAPTER 3 An Error Detection and Correction Algorithm Fer Fault-Tblerant Linear One-Dimensional Systolic Arrays 3.1 ‘Dne Error Detection Algorithm Figure 3.1.1 illustrates the structure of an array of processing elements connected in a pipelined fashion. All PE's perform the same operation on incoming Operands. The Operands are loaded from the left and processed by each PE. Ideally. a linear pipeline with k stages will process n Operands in Tk-k+(n-1) clock periods. The space-time table Of the pipeline Of Figure 3.1.1. for arbitrarily chosen k?‘ and n-5. is shown in Table 3.1.1. 23 24 a CLOCK H y INPUT OPERANDS ->-é— OUTPUT OPERANDS FE 1—6—> INPUT REGISTER OUTPUT REGISTER Figure 3.1.1. Basic structure of a linear pipeline processor. 25 Table 3.l.l. The space-time table of the normal linear one-dimensional pipeline processor of Figure 3.l.l. Input/Time PE.1 PE.2 PE.3 PE.4 PE.5 PE.6 to A o o o o 0 t1 B A1 0 o o 0 t2 c 131 A2 0 o 0 t3 D c1 32 A3 0 o 1% E 01 c2 B3 A4 0 g x 151 02 c3 B4 A5 Output/Time PE.1 PE.2 PE.3 PE.4 PE.5 PE.6 to o o o o o 0 t1 A1 0 o o o 0 t2 B1 A2 0 o o 0 t3 c1 32 A3 0 o 0 F4 01 c2 B3 A4 0 0 t5 E] 02 c3 B4 A5 0 t6 X E2 D3 C4 B5 A6 t7 x x E3 04 c5 36 t8 x x x E4 05 c6 t9 x x x x 55 D6 t x x x x x E 10 5 26 Figure 3.1.2 illustrates the modified array. PE's are designed to process or bypass the input operands. The bypass mode is used for the concurrent error detection and recovery algorithm that will be discussed in the remainder of this section. Latches are present but not shown in Figure 3.1.2 and comparators are added between adjacent PE's. The operands are loaded from the left and each PE performs the same operation on incoming operands. Every PE contains an input data register (IDR). an arithmetic unit (AU) and an output data register(ODR). The input Operands are loaded into the input data register. processed by the arithmetic unit and stored in the output data register. During the first clock period the operand A is loaded from the left to the PE labeled as PE.1 and results in the processed output labeled '8 A1: During the second clock period A1 is fed into PE.2 and in the normal pipeline Operation illustrated in Table 3.1.1 a new Operand is loaded into PE.1. However in this scheme A1 1. fgd bgck into PE.1 and processed simultaneously by both PE.1 and PE.2. Since all PB's are unifunctional (i.e. perform the same operation on their input operands) the outputs of PE.1 and PE.2 Operating on the same input must be equal. The outputs of PE.1 and PE.2 are compared by the comparator C1. If the outputs of PE.1 and PE.2 are equal the comparator output is "0" and the normal Operation of the pipeline continues with each new Operands loaded into PE.1 twice. However if the outputs of PE.1 and PE.2 are not equal. Cl will flag an error by setting its output to "1". 27 d1 1* PE.1 PE.2 PE.3 PE.4 RPE.5 rev-u (TSC) snail - «en-4‘ +- 1.4-. I I , _ : fTsc J .--- ---5 ..-, TEST ’ «1 «» w 1 . , . 1 . R —-'l' —4-—p ._.;32IJ . c1 -~—J-—-c2 e—L—-c3 .—» 11bit ......,--.1 ‘ 1 ‘ ‘ Operands Input I Operands l 2 l TWO-PHASE CLOCK Parity bit in _—.» ——.. Parity bit out (PI) PE.i (P0) Input operands +., 1.“ ’5 .,.. Outputooperands i i I—hl 1 Ci ._.. From 01. e.— — 1 CLOCK Figure 3.1.2. The structure of the fault-tolerant one-dimensional linear pipeline processor. 28 Input Parity bit (PI) Input Operands 1‘1 -—-e—. P E . 1 Arithmetic Unit (AU) Input Data Register (IDR) Output Data Register (ODR) ‘ Output Parity bit (P0) Output Operands Begin if EF=O then IDR ‘— 1‘1 AU :-- IDR ODR-— AU PO ‘-- PT Oi ‘— ODR elseif EF=l then 01' ‘- IDR AU -— IDR ODR *— AU PO -‘- PI 0i -—- ODR end if end; Error Flag (EF) Figure 3.1.3. Basic PE for the fault-tolerant pipeline processor of Figure 3.1.2. 29 Input Parity Bit R P E ‘ 5 Output Parity 0 PI —- —’ Bit Arithmetic Unit ( AU ) Data Valid(DV) Input Data Register(IDR) Input Operands Output Operands dbl-OI +D IR Output Data Register (ODR) O I Error Flag (EF) R Begin if EF=O then- if PI=O then IDR --IR ODR --—IDR DV — PI oR -— ODR if PI=l then AU ~— IDR ODR ~— AU DV ~—- 'P‘I' oR ~— ODR elseif EF=l then I AU -— IDR ODR .— AU DV "-PI OR e—ODR end if end; Figure 3.1.4. Basic Redundant PE for the Fault-tolerant processor of Figure 3.l.2 30 At this point. assume PE.1 is fault-free. The case when PE.1 could be faulty will be discussed later in this chapter. A mismatch in thO outputs A: of PE.1 and PE.2 will indicate that an error caused by some unknown fault in PE.2. assuming fault-free PE.1. has occurred. Note that the fault model assumes an unknown origin for faults and errors which can be caused by permanent. intermittent or transient faults and is thus suitable for fault modeling in a VLSI environment. The only assumption is that faults will cause the output of the PE's to be erroneous. An error flag by C1 will cause PE.2 to maintain its present input Operand for an additional clock period and bypass this Operand to PE.3. During the third clock period a new Operand B is loaded into P3-1 and A1. stored in the input register of PE.2 and bypassed to PE.3 at the end of the second clock period. is recomputed. In case of an error detection the Operands are delayed by one PE along the array (i.e. 42 computed by PE.2 and PE.3 instead of PE.1 and PE.2). In order to recover an error in PE.k (i.e. last PE in the array). an additional PE is needed and will be discussed in the next section. The comparators of Figure 3.1.2 are relatively simple and modular. Various designs of totally self checking comparators have been proposed in the literature [44.45]. However. fault-free comparators are assumed here. In fact. in the normal pipeline Operation. a false 1 error in the output of a comparator will only cause its corresponding PE to bypass and recompute its incoming 31 Operands for one clock cycle and therefore will not introduce any error in the data flow of the pipeline. For algorithms which require a bidirectional computational flow. a symmetrical row of comparators and interconnects can be added to the top of the pipeline structure. be error detection and bypassing technique can be used in the same manner as described previously. The comparator row on the bottu of the structure compares the computation results of the PE's with the data flow from the left to right and the comparator row on top of the structure compares the computation results of PE's with the data flow from right to left. In case an error is detected in either direction the corresponding PE's are set to bypass their input operands and recompute for one clock period. For the one-directional pipeline we assumed the first PE to be error-free. The first PE labeled as PE.1 can be designed to be Totally Self Checking (TSC). Various designs of such PE's are proposed in the literature [36.46]. The T‘SC-PE.1 can be checked at the start-up of the execution. This might determine the presence of a permanent fault in PE.1. In case PE.1 is permanently faulty. it will be bypassed for the entire duration of the pipeline Operation. For bidirectional data flow. both the rightmost and the leftmost R's are designed to be TSC. This only increases the overhead hardware requir-ents by the extra TSC hardware required for these two PE's. In VLSI implementations. a large number of PE's(R>lOOO) are built on a single chip. Therefore the extra hardware required for the leftmost 32 and rightmost PE's of the array is minimal compared to the additional hardware required for the error detection and correction algorithm. In the one-directional pipeline Operation. another approach would be to load the first input Operand sets (i.e. A.B.C.etc.) into both PE.1 and PE.2 at the same time and compare the results. In order to maintain the correct number of operations on each operands. an additional PE is required. Also the comparaison timing along the pipeline must be modified accordingly. Thbles 3.1.2 through 3.1u9 illustrate different cases of fault detection. recovery and output result validity for an array of k-4 PE processing n-5 Operands. 33 Table 3.1.2. The space-time table of the fault-tolerant pipeline error-free case- PE.2 PE.3 PE.4 PE.1 Input/Time 3434343434 OOOAABBCCDDEE 2323232323 OOAABBCCDDEEX 11211212112112 0AABBCCDDEEXX AABBCCDDEEXXX 2 1 .T. 0 0 I 2 3 4 PO 6 7 8 9 1| .7. t t t t t t t t t t 1 1 t PE.2 PE.3 PE.4 PE.5 PE.1 Output/time SET TO BYPASS 4545454545 OOOAABBCCDDEE 3434343434 OOAABBCCDDEEX 2323232323 0AABBCCDDEEXX 121212112112 AABBCCDDEEXXX 34 Table 3.1.3. The space-time table of the fault-tolerant pipeline with a transient fault in PE.2 during the second clock cycle. Output/Time PE.l PE.2 PE.3 PE.4 PE.5 t1 A1 0 o o 0 t2 ifafftw.) _____ ‘imput‘: 0 t3 B1 in? ..... 64 0 t4 32 32 A3 A3 0 t5 c1 B3 B3 A4 x t6 c2 c2 B4 B4 A4 t7 01 c3 c3 35 34 t8 02 02 c4 c4 x t9 E1 D3 D3 c5 c4 tlo £2 £2 04 04 x t11 x E3 E3 05 D4 t12 x x E4 E4 x t13 x x x E5 54 t14 x x x x x . f : Compare results ----_--‘ 35 Table 3.1.4. The space-time table of the fault-tolerant pipeline with a transient fault in either PE.3 or PE.4 during the 6th clock cycle. Output/Time PE.1 PE.2 PE.3 PE.4 PE.5 t1 A1 0 O 0 0 t2 A2 A2 0 0 0 t3 B1 A3 A3 0 0 t4 B2 B2 A4 A4 0 t5 C1 B3 B3 A5 A4 t6 (:2 C2 [8. - erreanj X r compute t7 01 c3 c3 B4 A4»; t8 D2 02 C4 C4 X t9 E1 03 D3 C5 C4 th E2 E2 04 D4 X tll X E3 E3 05 D4 t12 X X E4 E4 X t13 X X X E5 E4 t14 X X X X X 36 Table 3.1.5. The space-time table of the fault-tolerant pipeline with a multiple transient fault in either PE.2, PE.3 or PE.4 during the 6th clock cycle. Output/Time PE.l PE.2 PE.3 PE.4 PE.5 t1 A1 0 o o 0 t2 A2 A2 0 o 0 t3 B1 A3 A3 0 0 t4 32 32 A4 A4 0 t5 c1 B3 B3 A5 A4 ts {C2225 .152”: :.B's'_'f‘:‘i°"'31‘: X ., D] :c;:::::'sé.s B4\~.W t8 D2 D2 C3 C3 X recovered t9 E1 D3 03 c4 x tio E2 E2 ”4 D4 C4 tH x E3 E3 05 D4 t12 x x E4 E4 x t13 x x x 55 E4 t14 x x x x x X: Don't care 1 n .: Compare results 37 Table 3.1.6. The space-time table for the fault-tolerant pipeline with a permanent fault in PE.2. Output/Time PE.1 PE.2 PE.3 PE.4 PE 5 t1 A1 0 o o 0 t2 in? ..... Ag: 0 o o .3 B] lg:::::‘5a'i 0 0 t4 Eh;"""@fi' A2 A2 0 t5 C1 55---- 35: A3 t6 E£---‘-E:E ...... 5; B2 A4 t7 “1 592----fai 33 X ta {02:13 .2 .2 B4 .9 :1 __9;,::::::°:23 cs x Ho 62--.“.f21 02 02 c4 tn " 5252 “3 " t12 x x 52 52 D4 t13 x x x E3 x t14 x x x x E4 X: Don't care - I - - . C ' ~. .- _3: All comparisons on PE.2 indicate error 38 Table 3.l.7. The space-time table of the fault-tolerant pipeline with a permanent fault in PE.4. Output/Time PE.l PE.2 PE.3 PE.4 PE.5 t1 A1 0 o o 0 £2 A2 A2 0 o 0 t3 B1 A3 A3 0 0 t4 32 32 lag ..... n5} 0 t5 c1 B3 B3 A4 A4 '6 '2 '2 ['21. 1'... 1 1'9} " t7 01 c3 c3 34 B4 r -------- 1 t8 ”2 ”2 :C4 C4: X t9 El 03 03 c4 C4 tm 52 E2 gfig'm'Bgi x tll X E3 E3 “4 D4 H2 x x fi----fé x t13 x x x E4 E4 t14 x x x x x X: Don't care 1 : All comparisons on PE.4 indicate error 39 Table 3.l.8 The space-time table for the parity check and output validity with a transient fault in PE.2 during the second clock cycle. Parity out/Time PE.1 PE.2 PE.3 PE.4 PE.5 OUTPUT VALID? 1:.l l 0 0 O 0 YES t2 0 0 0 0 0 YES t3 1 0 0 0 0 YES t 0 0 l l 0 YES t5 1 l l 0 1 N0 t6 0 0 O O 0 YES t7 1 l l l 0 YES t8 0 O 0 0 1 NO t9 1 l l l 0 YES t10 0 0 0 0 1 N0 t]] X l l l 0 YES t12 X X 0 0 1 NO t13 X X X l 0 YES t X X X X 1 N0 40 Table 3.1.9. The space-time table for the parity check and output validity with a permanent fault in PE.4. Parity out/Time PE.1 PE.2 PE.3 PE.4 PE.5 OUTPUT VALID? t] 1 o o o 0 YES t2 0 o o o 0 t3 1 1 1 o o " t4 0 o o o o " t5 1 1 1 o 0 t6 0 o o o 1 N0 t7 1 1 1 o 0 YES t8 0 o o o 1 N0 t9 1 1 1 o 0 YES tlo o o o o 1 N0 t]] x 1 1 o 0 YES t12 x x o o 1 NO t13 x x x o 0 YES t x x x x 1 NO _a .b 4l 3.2 Error Recovery If en error is detected in the ith PE (i>1). P81 is set to recompute its input operands for the next clock cycle. During this cycle the input cpersnds 0f 931 previous to the erroneous conputstion ere transferred to the input of PBi+1. During the following clock cycle the error is recovered by P81 end PEi+l by recomputing the bypassed dsts fro- the previous step. This inplies thst for the design of s k stege feult-tolerent pipeline cspsble of detecting end correcting errors in sny of k stsges. h+l PE's ere needed. The sdditionsl PE hypesses its input operends unless there is en error in the h‘h stsge or e pernsnent feult in eny of the PE's elong the pipeline structure. The following register trensfer level elgorithl. elso reproduced in Figure 3.1.3. delonstretes the error recovery steps in the ith PE (1(io ugh ._.m.m mczmwm 48 OF op op or or op op or 0— m. «p . m _ N; _— J - d - _ _ _ ----unnw----.----nuuu---------unuw---------unuu--:------nnuwmwwwnlll..1 1o._ :-:----,/- .I// a” I. 5 Jay. I ’u z (w 7.. I. 3 O 1: ooopuv. I..111..| ., , oopux ............. 25.11113... Eizzo 49 .N.F.m .mvm to mcvpmaea pcmcmpou-u~=mp on» ma paqsmzonzp can aucmrovymu ash .N.m.m we:m_m OOOH OOH Dd x\: _ .. _ . _ |'|‘l|||l|I'le-‘l"ll"|l‘I‘Il'-"||l'l‘ll0. 1 ml- ” J 10 L I a» \ «cu 3296.2.— mfi—SE .3253. um: np\ r.» pansmsoczh acrpmara ucmcmpou-upsmm up: Am.wav mmmnum to ransaz ax Amxmwuv mucmnoao mo napsnz u c N: 11:.H N3\—3 u mucmwurcmm mcFFoawa napsmma PC NC \PC zucm_uwmcm oc_9mawa ucmgmpopuppzmm 50 3.3.1 Feult Coverege The feult-tolerent pipeline structure of Figure 3.1.2 is cepeble of detecting end correcting ell trensient feults occurring during the entire operetion of the pipeline end up to k+R permenent feults. where I is the number of redundent PE's sdded to e k stege pipeline structure. In generel for e k'+ R stege pipeline up to (k-l).Thft, where kat-h+l+2n-l. trensient feults cen be detected end corrected. Iultiple trensient feults occurring in k previous steges of the pipeline cen be detected end corrected in only two clock cycles (i.e. one cycle for detection end one cycle for recomputetion). In e VLSI environment. e lerge number of PE's cen be implemented on e single chip (k>10') end very lerge number of operends cen be executed by erithmetic pipeline processors (n)10‘). Therefore it cen be seen thet for such en epplicetion. the degree of trensient feult coverege offered by this concurrent technique is tremendous (i.e. 0(1o’n. As en epplicetion exemple. in the next section the error detection end correction elgorithm is epplied to e systolic errey used in Discrete Fourier Trensform (DFT) computetions. 51 3.4 Lineer Systolic Arrsy for DFT An n-point Discrete Fourier Trsnsform (DFT) problem [6 l is d°fimd .‘ £01.10": 'iVGn (‘0s.11eees.n-1)e 60-pin. (YOeYleeeeeYn_1) defined es: -1 Ii. 2’ ‘j'ij (3.4.1) 3'0 where w is en n-th root of unity. The n-point DFT problem cen be viewed es thet of evelueting the polynomiel ‘n-l‘n-l + ‘an‘n-l + ... + le+eo (3.4.2) et 1' 1. w. w2. .... 'n-l. Bquetion (3.4.2) cen be rewritten es: ("'(“n-l‘ + ‘n—Z)‘ + en_3)r + ... + e1): + so (3.4.3) The recurrence formule cen be es follows: Let 2§°’- an,1 (i=0 to n-l) (3.4.4) 52 git). Y9-1)“ + .n_1_k (x=1 to ne1) (3.4.5) ’1‘ Yin'l) (3.4.6) The besic PE is illustreted in Figure 3.4.1. A full errey consist of n-l besic PB. Figure 3.4 .2 illustretes the structure of the lineer systolic errey for DFT computetion. According to Bquetion (3.4.1). the velues for 11'. for the cese of ad ere es follow: 3 You-2 .j'Oj. no + el + e2 + e3 (3.4.7) 1'0 3 11.2 gjmj' no + '1' + gzwz + m3w3 (3.4.8) .i'0 3 12.2 gjwzj- mo + glqz + ‘2'4 + g3'6 (3.4.9) 1'0 3 Y3!- 21-31.31. so + elw3 + £2w6 + e3} (3.4.10) in in Figure 3.4.l. 53 out out' in I ..< Yout:- inxin + a Basic PE for linear systolic array used in DFT computation 54 I -——- ——H 1——o- .. —. L———-.- 0 1 l an-2 a"-3 a0 12 . 1——.° -—- —’ O Wn-l s we 1 '—'. I] an-l’ ’ an-l’ a'n-1 “’12 Figure 3.4.2. The structure of the linear systolic array for DFT computation 55 3.4.1 Linesr Feult-Tblerent Systolic Arrey for DFT In this section the error detection end correction elgorithm described eerlier will be epplied to the lineer systolic errey for DFT computetion. Thble 3.4.1 illustretes the spsce-time teble of the normel errey. Figure 3.4.3 illustretes the structure of the feult-tolerent errey for n94 with sdded comperetor elements. This errey consists of n-l-3 besic PE's. Bech PE conteins en input end en output register for temporery dete storege during eech clock period. Esch PE is losded with two coefficients. The coefficient s: end s1. e1 end so. end so ere stored in PE.1 through PE.3 respectively. In other words. PE.1 conteins the coefficients ‘n-2 end sn.3 while PE.2 conteins 'n-3 end sn_z end so on. As illustreted in teble 3.4.2. in the error-free cese. during the first clock.cycle PE.1 multiplies its 0 end edds the result to ‘n-Z end stores the result inputs ‘n-l end w in its output register. During the second clock cycle this result is fed beck into PE.1 through the Yin input (refer to Fig. 3.4.1). It 18 multiplied it by xin input end the result is sdded to sn,3 coefficient. During this clock cycle PE.2 multiplies the output of PE.1 by its xin input ('0.1) end edds it to its 'nr3 coefficient. Since the sums operetion hes been performed by both PE.1 end PE.2 during the second clock. cycle on the seme input operends. PE.1 end PE.2 will result in the seme output which will be competed by the comperetor CI for mismetch. During the third clock cycle s new set of operends ere losded into PE.1; PE.2 operetes on its input operends using ‘n—4 ceifficient end PE.3 performs the seme operetion using its 56 ( #0612? I ‘PE.l PE.2 PE.3 RPE vahd __. —«- 0 l l a a a a 2 l 0 0 I2 1"" a] a0 0 o F" 02 I l g l | ..... ---- -' L--] R 4M u x j -—J—7 F—j’ 7f ‘3 ------ —-~ c c Input-fiberands f] f2 1:--Two--Phase __3: CLOCK R: Reload Input Operands Basic structure for the ith PE: Input Parity __d_ Output Parity PI i PO Xin ”‘T' * T'"’ Xi PEi out i . 1 Yin —- (1%],4) —-- Yout Error Flag EF *: The control algorithm is given in Section 3.2. Figure 3.4.3. DFT computation,n=4. The structure of the fault-tolerant systolic array for 57 tn_4 ceifficient end the computed results of PE.2 sud PE.3 ere compsred for mismetch. 3.4.2 Error Recovery In cese en error is detected in either P81 or PEi+l during clock 01010 t.. the input operends prior to the erroneous computetion sre bypeseed to P8i+l end PB1+2 end recomputed during the clock cycle t Thbles 3.4.3 through 3.4.6 illustrste different chses of e+l° feult-occurrence in the errey. The output velidity is checked by the edditionel perity bit sssocisted with eech PB end is described in Section 3.1. 58 Table 3.4.]. The space-time table of the normal systolic array for DFT computation Input/Time t] Xin: Y. : in t 2 Xin Yin: t 3 Xin Yin: t 4 Xin Yin: t . 5 Xin' Yin: t6 Xin: Yin: Output/Time t4 Yout: t5 Yout: t5 Yout: t7 Yout: PE.2 PE.3 a3+a2*a1+ao _ 3 2 Yl- a3w +a2w +a1w+a0 6 4 2 a3w +a2w +a1w +a0 9 6 3 a3w +a2w +a1w +a0 PE.3 2 a3w +a2w +a] w3 6 3 a3w +a2w +a1 59 Table 3.4.2. The space-time table of the fault-tolerant systolic array of Figure 3.4.3 for DFT computation. Error-free case Input/Time PE.l PE.2 PE.3 PE.4 t1 Xin: l 0 0 O in: a3 0 0 0 t2 l l 0 0 a3+a2 a3+a2+a1 0 0 t3 w l l 0 a3 a3w+a2 a3+a2+a1 0 t4 w w l l a3w+a2 a3w+a2 a3+az+31+a0 a3+a2+31+a0 t w2 w w l 5 a a w2+a w+a a w2+a w+a X 3 3 2 l 3 2 1 t6 w2 w2 w w 2 2 3 2 3 2 a3w +a2 a3w +a2 a3w++a2w a3w +a2w +a1w+ao aIWTao t W3 w2 w2 w 7 a a w4+a w2+a a w4+a w2+a X 3 3 2 l 3 2 l t8 W3 W3 w2 w2 a3w3+a2 a3w3+a2 a3w6+a2w4+a1w2 a3w6+a2w4+a1w2 + + a0 30 t9 X w3 W3 w2 6 3 6 3 a3w +a2w +a1 a3w +a2w +a1 X 60 Table 3.4.3. The space-time table of the fault-tolerant systolic array of Figure 3.4.3 for the DFT computation. Error-free case Output/Time PE.1 PE.2 PE.3 PE.4 t1 a3+a2 0 0 0 r- """ “l ....... t2 ifafazfit1-133631: 0 0 t3 a3w+a2 .agrazfa1+aq:a§fa2+a1+a0, 0 r--2-----r-- Pcul-o'I ...... _- t4 Laid :aZrI-+a1-._a3-vlja2w+a]; X .21 a3+a2+a1+a0-Y0 2 .‘"3'"2"'""3"" . . t5 a3w +a2 .a3w +a2w E a3w +a2w . x : + . + : Lf‘lTQ- -:- ftw’fa- ..1 "‘E"‘2‘“:”‘4‘"2"l 3 2 = t6 Ea3w +a2w +a 3 a3w +a2w +a15 X a3w +a2w +a1w+a0 Y] ----3-.. 1"bm'"1'7"-'3'-‘$ t7 a3w +a2 :a3w +a2w : a3w +a2w'| X ' + l + 3 : .aJ'izf ‘10- ...E...“‘. 333*qu .—--------1-------—1 6 4 2 . + + a =Y ts lag“: fa".f31-:-a.3~3‘3+32.w.3-a.u X .3. .2. a‘" * ° 2 P“- " ""‘I""' "‘1 t9 X Ea3w9+a3w6 : a3w9+a2w6: X . .. : + : i 3 ' 3 : :_"i‘1'.’ 1‘10. .. E 311330"; 9 6 3 = t10 X X X a3w +a2w +a2w +a0 Y3 X : Don't care ; : g : compare output results L----J----J 61 Table 3.4.4. The space-time table of the fault-tolerant systolic array of Figure 3.4.3 for DFT computation with a transient fault in either PE.l or PE.2 during the second clock cycle. Output/Time PE.l t1 L t2 t3 M t4 Ml t5 N t N1 t7 0 t8 01 t9 X t10 X X = Don't care L = a3+a2 L1= a3+a2+a1 L2= a3+a2+a1+a0 M = a3w+a2 M1= a3w2+a2w+a1 M2 a3w3+a2w2+a1w+a0 _ 2 4 2 1 a3w +a2w +3] 6 4 2 63W +62%! +a1w +a0 I El-qp¥-_- PE.2 PE.3 PE.4 ¢ ¢ ¢ E15 9’ 9’ {ii ......... ffi ¢ L - - _ resgmule- _ J M1 L2 L1 M2 M2 L2 N1 M3 M2 N2 N2 x 01 N3 N2 02 02 x X 03 O2 0 = a3w3+a2 - 6 3 0]— a3w +a2w +a1 _ 9 6 3 02- a3w +a2w +a1w +a0 M3, N3, 03 NOT VALID 62 Table 3.4.5. The space-time table of the fault—tolerant systolic array of Figure 3.4.3 for DFT computation with a transient fault in either PE.2 or PE.3 during the 7th clock cycle. Output/Time PE.l PE.2 PE.3 PE.4 t] L ¢ ¢ (6 t2 L1 L1 p 95 t3 M L2 L2 6 t4 M1 M1 L3 L2 t5 N M2 M2 X t6 N1 N1 M3 M2 ‘7 ° [’12. - 3'19". - .."3...‘ x is 01 01 [4.232236% N2 t9 X 02 02 X t10 X X 03 02 X : Don't care L, L1, L2 M, M1, M2 N, N1, N2 refer to Table 3.4.4. 0, 01, 02 L3, M3, 03 NOT VALID 63 Table 3.4.6. The space-time table of the fault-tolerant systolic array of Figure 3.4.3 for DFT computation with a permanent fault in PE.2 Output/Time PE.l PE.2 PE.3 PE.4 t] L (6 ¢ ¢ t2 1E1: Ii‘ii‘iilfi 4 9’ 1:3 M {Ll féii‘fiflte L15 ‘3 .......... ,dr--------u. t4 vM] -error Ml; L1 recompute L1 ts N FM} 1923335“ 7h“; L2 t6 :rlJ:::.er.ro.r.-t:l§f :: ...... M .1 recompute M.l t7 0 {"1 252%? "11' M2. t8 :0; ';‘e?-F6r‘:"oj --------- N .1 recompute N1 .9 pissing? 2;: ~2 t10 X X ........ dquecompute 01 t1] X X X 02 X : Don't care L, L1, L2 ”1’”2 N N refer to Table 3.4.4. 1’N2 . E : compare output results ~------— mum” A Concurrent Error Detection and Correction Algorith- For Paul t-‘l‘olerant Two-Dinensional Array Processors Figure 4.1 illustrates an array processor with N-k’ identical unifunctional E's interconnected by a mesh network without boundary wraparound connections. PE's of each row are connected in a pipeline fashion. Data are loaded serially into the left most P8 of each row and fro- the top' of the structure into each PE. Both the horizontal and vertical data flow through each PE will result in a new computed output at each PE. As long as the vertical data flow pattern is not altered or delayed each row can be treated separately because of the parallelisn of the design. 64 VERTICAL 65 INPUT OPERANDS i i PE .’ PE $7 PE 9 PE s 1.1 . l;2__ 1.3 1.4 y A; it A} 3 PE PE 715 FPJ'E—T u: 2.1 2.2 2 3 2.4 3 r Ay Jr W '— é; __JL_1 ...JL_1. —~ , PE 5' PE 1; PE PE .4 3.1 3.2 3 3 3.4 < p.- E J; I, ‘V AV fig ' __J ___lL_.1 :: . PE ‘r PE 1' PE % PE { i 4.1 4.2 4 3 4.4 Figure 4.1. The structure of an array processor connected in mesh without the boundary wraparound connections 66 Therefore the error detection and correction algorithm can be applied simultaneously to each row of the array processor. 4.1 The lodificstion of The Error Detection Algorithm Since new sets of data are loaded into each PE both horizontally and vertically. the error detection scheme of Chapter 3 can not be applied directly to the two-dimensional structure. Iodifications must be introduced to take into account the variation of the computed output of each PE for the same clock cycle. In other words. the output 0‘ P31 and PBi+1. operating on the same horizontal operands. are different because their vertical input operands are different. Consider the first row of the two-dimensional structure with 4 PE's. The cese kr4 is considered for presentation simplicity. Let the row input (horizontal) data set be A1,,,,,A-, Sinilggly 31.....BI. C1.....C.. 01.....0. and 81.....8. are the vertical operands loaded into PE1.l-P81.4. respectively. The space-time table for k-4 is illustrated in Table 4.1.1. Consider the multiplication operation for presentation simplicity. During the first clock cycle A1 and 31 gr. logded into PB1.1 and result in A131 and B1. 81 is transferred down to PEZ.1 and A131 is loaded into the horizontal input of PEl.2. During the second clock cycle A131 is reloaded into the horizontal input of PEl.l. The inputs A2 and 82 are therefore delayed for one clock cycle. Instead, 67 SWITCH CELL v E R T I C A L I N P u T o P E R A N o s . 1 ' ' : I l I. «’0’: e iv 7 , '7 l I 2 #1_ T1" J SWITCH s s s s 1 2 4 +71. ..... “17151 - ..., 2 ‘* 0 Two-PHASE CLOCK i _ __. __gr"-""E. 1 P E ‘$* P E ** P E P E **‘* R P E I ,. 1.1 1.2 A, 1.3 .. 1.4 1.5 r ‘ ‘T“ l T. R , . (15c) T' q ' f ‘ r L-t----1 . r . ’I’ I 4 1 ‘F i a I l I ' l ' ' e MUX : . * . HORIZONTAL 1 , . ' I 1 INPUT L.....- C1 2 .2 ' 3 1 . OPERANDS v . ' z . l Two-PHASE -..:- - ..- -'- — -- 1‘-,- - - -- . : CLOCK 1 1 ' 2 1 v i i i v 1 o N E x 1 R o w R: Reload Input Operands Vertical Input Operands Input Parity Bit From Vertical Input Register of PEl.(i-l) Horizontal Input+. Operands 4 +- P E 1.1 (i#1,5) Error Flag J .5 EF Figure 4.1.1. 0 Vertical to next Output Parity Bit To Vertical Input Register of PEl.(i+l) Horizontal Output Operands Output row The structure of the first row of the array of Figure 4.1 with added switching and comparator elements. 68 Vertical Input Operands Input Parity Bit ...”. P E l . i ....Output Parity Bit (P1) (P0) ( i # l . 5 ) Vertical Transfer Bit. Vertical Transfer Bit VTB ‘ , ""' VTB and REGISTERS. and Vertical Output Transfer HIR Vertical Output Transfer (V07) . VIRl (VOT) . from PEl.(i-l) VIR2 to PEl.(1+l) HOR Horizontal Input ——.» VOR .4..Horizontal Output Operands Operands‘ Arithmetic Unit: AU .' Error Flag : (EF) o Vertical Output Operands HIR: Horizontal Input Register VIRl: Vertical Input Register 1 VIRZ: Vertical Input Register 2 HOR: Horizontal Output Register VOR: Vertical Output Register Basic PE structure for the fault-tolerant array of Figure 4.1.2. Figure 4.l.l. 69 Table 4.l.l. The space-time table of the fault-tolerant array of Figure 4.l.l Input/Time PE l.l PE l.2 PE l.3 PE l.4 t1 Al’Bl O O 0 t2 A1314:1 A181,C1 O 0 t3 A2,B2 A1B]C],D1 A181C],D1 0 t4 A282,C2 A282,C2 AlBlClDl’El A]B]C]D],E1 t5 A383 AZBZCZ’DZ A282C2,D2 AlBlClDlEl t6 A383,C3 A383,C3 A282C202,E2 AZBZCZDZ’EZ t7 A484 A3B3C3,D3 A3B3C3,D3 AZBZCZDZEZ t8 A484,C4 A4B&,C4 A3B3C3D3,E3 A3B3C303,E3 t9 X A4B4C4,D4 A4B4C4,D4 A383C3D3E3 t10 X X A4B4C4D4,E4 A4B4C4D4,E4 tH X X x A4B4C4D4E4 Output/Time PE l.l PE l.2 PE l.3 PE l.4 t2 AlBl O O O t {EEC-""1156? o o 3 ._ 1-1.1 ...... 1. 3-1.. t4 A2'32 153391.51: - :Aj I$16151] 0 ts 17152152” f i 7.423;?" 3 [6339203333 E1303 .6 x 27 1.22.23.1111139'233 11.21.292.22: 1122222225 t8 A434 5523393.“; - - - 53333303.: X 1:11 x x x- A a C o E 4 4 4 4 4 70 fl" vertical input Cl is loaded into the vertical input of $1.1. This requires switching elements in order to load either B's or C's into $1.1. (It is shown in Table 4.1.1 that these switching elements can be controlled by a two-phase clock). Since A131 “4 c1 are 10.4.4 into $1.1. during the second clock cycle the output of $1.1 will be qnlcl. The output of $1.2 will also be MBicl- These two outputs can be compared at the end of the second clock period and mismatch errors can be detected. Note that regardless of the arithmetic operation performed by each $ the error detection will still be valid. since any operation by adjacent unifunctional $'s on the same input operands must result in the same outputs. Darius the third clock cycle. A; and a; are loaded into 1131.1. ‘131‘31 is loaded into $1.2 and $1.3. 01 is loaded. via the switch element. into $1.2. Both $1.2 and $1.3 will result in outputs labeled as 5310101. These two outputs are compared for a possible mismatch. Figure 4.1.1 illustrates the structure of the first row of the array processor of Figure 4.1 with additional switching and comparator elements. Table 4.1.1 shows the space-time table of the structure of Figure 4.1.1. In general the algorithm uses the inherent redundancy of the pipeline structure by reloading the first horizontal operands back into the leftmost $ and. during each clock cycle. comparing the computed output of adjacent $‘s along the pipeline operating on the same input Operands. From Table 4.1.1 it is noted that the ith 71 vertical input operands are switched every other clock cycle into their adjacent (i-l)th PE. It is also noticed that because of the vertical flow of the operands only one row of switching elements at the top of the structure is needed. Similarly the same steps may be applied to the other rows in the array. 4 .2 Error Recovery In the error-free array operation. the horizontal operands are loaded into the horizontal input register of each PE. Similarly the vertical operands are loaded into the vertical input registers and depending on the operation (i.e. processing or bypassing vertical operands) the processed results are stored in the horizontal and vertical output registers. Every PE contains two registers labeled as V131 and V132 for the vertical input data. The vertical operands are first loaded into V132 and then moved to V131. The arithmetic unit uses the contents of V131 for arithmetic operations. If an error is detected in P31 during the clock cycle t.. the contents of V131 of P31 is saved. In other words. the contents of V132 will not be moved into V131. At the end 0f t‘. the contents of V132 is transferred to the next row in the array. This guarantees the consistency of the vertical data flow in the array structure. At the same time. the contents 0f V131 0f F81 is transferred to V131 of FEi+1. The error is recovered by 951 and PEi+1 during the clock cycle t‘+1. Because of the ocurrence of an error in P31 during t‘. for horizontml data consistency. PBi+1 transfers the contents of its V132 to V131 of Table 4.2.1. Output/Time 72 The space-time table of the fault-tolerant array of Figure 4.l.l with a transient fault in PE 1.2 during the second clock cycle. PE l.l PE l.2 PE l.3 PE l.4 PE 1.5 A13] compare ...error 1' ...... T T - Tl L§I§JSJ_ __61Elfli recompute and compare [- ..... - - - - 1 A333 AZBZCZDZ AZBZCZDZ ATBTCTDTET X A33353 A333C3 A232320252 A232C20252 A13190151 X X X X A B C D E 4 4 4 4 4 Table 4.2.2. Output/Time 73 The space-time table of the fault-tolerant array of Figure 4.l.l with a transient fault in PE l.4 during the 5th clock cycle. PE l.l PE l.2 PE l.3 PE l.4 PE l.5 O O O . . O : """ eFFGr """ j A B C A B C :A B C D E A B C D E1 O 2 2 2 2 2 2 “LJ-lj-l1:121:14:l’réEo'm'put'e-"I A383 AZBZCZDZ AZBZCZDZ LAIBJOJOJE1--AJBJOIO1E11 X AZBZCZDZEZ X A434910454 Table 4.2.3. Output/Time 74 The space-time table of the fault-tolerant array of Figure 4.l.l with a multiple transient fault in PE 1.1, PE 1.2, PE 1.3, PE 1.4 during the 7th clock cycle. PEl.l PEl.2 PEl.3 PEl.4 PEl.5 error error A333 AZBZCZDZ AZBZCZDZ recompute r ' "' "' ' ‘ ‘ "" ‘1 r """""""" .L 11.313353“ 13.31%; LA g13.2%sz2 A 282C} Z/Pii .......... 2 F- ---- ----- A434 LA3B§C§ - :A§B3C3; LA§§?€2?ZF§_.AEP?€ZRZEZ: A4B4C4 A484C4 A3B3C303 A3B3C3D3 X A4B4C4D4 A4B4C4D4 A333C3D3E3 A3B3C3D3E3 X X A4B4C4D4E4 A4B4C4O4E4 X X X X A484C4D4E4 A4B4C4D4E4 recompute and compare Table 4.2.4. Output/Time 75 The space-time table of the fault-tolerant array of Figure 4.l.l with a permanent fault in PE l.2. PE1.1 PEl.2 ff} ........ , L“. 5131. - E13151). A232 §A1B1C1 923252.... : :2}. 2‘... A333 IAszcz 1A.; 8:32; :13 525: A484 EA3B3C3 933451 ifs-571.62 i: x :A4B4C4 x x x x x x X : Don't care PEl.3 “13151: ..--J I Aszczt ..... J PEl.4 A131 1 A131C101 C A232C2 A232C202 A333% 333°3 A D 3 A43491 A434C404 . : All comparaisons indicate error PEl.5 A B C 01 1 1 1 A131C1D1E1 A232C202 AZBZCZDZE A 333C303 A333C3D3E3 A4B4C4D4 2 A434C4°4E4 76 ll... Illllifiullllliilll.llllilliill-II- mm.. I . ems .N . aha .u . em; .2 um . . . . . "1H” dw.n. _ _ . . . N . ---- ---!i--n-------. ....... - ....... ..----- -----11---_ _ .DN mm ow “>.mnn. or am “>.m.n. on ms ” n " 1191111111-1-1- -1-Ito...ll-111-141l1lliillloic 11 111-lilo 11.1 1-1 11.. 4.l.l-11-1..-. . >Nmm “ >dwnnd i h >.msndos " u a u .- iiiiiiiiiiii l. lllllll ollll Iiiullillilulllllllllllineal-illilluil. “pmwm am an h>~wm am on ">dmnndc. m. ow ">1 I has n . m _ Ilillolillill llllll 111ll|1l|l-i liillllllli Ililllillullllil-.. n ......... --_--- ------ .-fl--..1--1 1- -..--..---- --1--..J--..-- .>w we no .>~m~nm am an .>~ Nam ON mN .>im.nd and d mw _ ” r-w4%11--1.ffl.wym a -.f-nu-m-m.i- u-ur.IZ.W1m 11--II.-1--w-i.-i" . rim-Au- --..I”---.~.~--w----r----N--m----M--..-i-n-_-w-------{-.--MCI-- . u . >~ N New mm on ">wanwow- mm 1 u (A) --4--'--‘--- qmcdm n.~.m. qzm mcmnm-anam "mafia om Ham wreak ow «macaw a.d.d. madcmaxmanzc arm smodmfimx nxmzmmmx oumwmnso: om nonunincmaa pm.m 3: «ya mxxox-«xmm nmmm. 77 ''''''''''''''' III||IIIIIIIIIIIIIIIIIIIII'I’IIIIIIIIqIIIIIII— _ a m n . i . a m . . N . e m d . a . e m a . s . m e m i . an Ass. _ .12» d as or n n . . . u . - . - >.—w._ . . . _ .— amu.m n.~.m. 'I-‘I"-'-"J . --—’q_--_-_----—_-—— . N . . . A an . . . . . . ..-.-..wwammw . >.ain a. . >.w.n.cd . . s u . . VIII. . . . >~m~n~ cm on . >~m~n~ on mm H >dwindod md m» . _ . . . . . . pmmNnNON ” >~w~n~c~ ” >dmdoiodmi ” . m _ ..I llllll -II'I. .................... . IIIIIIII IIIIII . . . . _ . H . >Nm~n~o~ "N cw . >~mmnmo~ mm d H psa.nio.md . . . . . . H . >Nm~nmcmmm . >~m~nmc~m~ u n m a . . r --..u-.. 1 -..-.- --..- --.-I ..... . 111111111 ..-- -1. _ _ u " >~w~nmo~m~ a mu " >Nmmn~cmmm _ n u . n . " >~m~n~c~mN m . V " alllll'l-‘llll|llsllt'-al-l' - I I I I I 1 l l I 1 I I I l I I I I I I b 1 I I I I l 1 43m mumnm-amam nmcdm ofi azm mxsmk ow «Accra a.d.d. Addcmnxmadzn flaw wmcnmamx «emammmx ocmxmunozm om nozmfiwncmzfl pm.m rifl: m nemamhmzn «main is vmd.~. 78 93112. This horizontal data transfer nong adjacent $'s propagates through each row of $'s in the array. and guarantees the correct execution of the error detection and correction algorithm. New vertical operands are processed or bypassed to the next row of $'s in the array. This guarantees the correct transfer of vertical operands to all k-i rows of the array if an error is recovered by the i“ row. The register transfer level control algorithm to: the 1“ $ (um) can be written as follow: Begin if EF-O then if VTB-0 then 313 <-- BI V132 <—- VI V131 {—- V132 All <--- 313.an P0 <—- ifi 303 <-- AD V03 <-- V131 el seif VTB-1 then 313 <-- 31 V132 (-- VI V131 <-- VOT All <-- 313.le P0 <-- F1 303 (“All 79 V03.VOT <--- V131 end if elseif EF=l then 313 (-- 313 V132 <-- VI V131 (- V131 VTB <-- 1 All <--- 313.V131 PO <-- Pl 303 <-- AD V03 (-- V131 end if end; The space-time table of the first row of the array of Figure 4.1.1 for this error detection technique is illustrated in Table 4.1.1. Tables 4.2.2 through 4.2.4 illustrate the space-time table of the fault-tolerant design recovering from different cases of faults. For the twordimensional array. each row can be treated as a separate pipeline as long as the vertical data consistency is guaranteed by the error detection algorithm. Since the algorithm is based on comparing sets of input and output operands. it can be extended to all structural configurations including diagonal data flow in a two-dimensional array. In that case. PE's along the diagonal connections can be treated as pipelines. The diagonal output results 80 of adjacent PE's can be compared for mismatch. Error recovery can be achieved by adding redundancy along the diagonal PE's and can be executed in a similar manner as before. 4.3 Overhead hardware and timing of the two-dimensional design The two-dimensional fault-tolerant array processor uses (h+l)’ 33's to recover from faults in 3’ stages. Also (k-l)’ comparators and (3) switch elements are required. For the array of Figure 4.1.1 if the area required by each comparator. switching element and their corresponding interconnection lines is (llm)th the area of a single PE (APE) the overhead hardware requirement as a function of k and m can be formulated as: 033(k.m)' IIh+1)’.APE+(((k-l)‘.ApE)/m)+((k.APB)/m)l/k’.A?E = Iml)‘ + ((k-i)‘+k)/n)]/k’. (4.3.1) A plot of 033(k.m) is given in Figure 4.3.1. Due to the parallelism of the design. the overhead timing is determined by the overhead timing of a single row of the array which is the same as the one-dimensional pipeline structure described in Chapter 3.0. As an application example. in the next section. the two-dimensional error detection and correction algorithm is applied to a VLSI structure used for the multiplication of two dense matrices. 8l commmooua mango acmnmpou-upamm _mcowmcmswu-ozu esp no» Ae.xvm:o acmemnwacmn neurone; ummgcm>o on» .P.m.¢ mczm_m mflp ed, mop _ pd— _ _ _ m . _ _ ooopux 873. ............. 82 4.4 A Fault-Tblerant VLSI Pipelined Array for Matrix lultiplication In this section the two-dimensional error detection and correction algorithm described in the previous sections is applied to a VLSI pipelined array used for the multiplication of two dense matrices [48]. Figure 4.4.1 illustrates the structure of this array for the multiplication of two 2x2 dense matrices. The elements of matrices A and 3 are fed from the lower and upper input lines in a pipelined fashion. one skewed row or one skewed column at a time. Some dummy zero inputs are interspaced with the matrix elements. "’ " T '7 ‘11 ‘12 F511 I>12 K=11 °12 A x B I X = '21 ‘22 1,21 b22 °21 °22 — a - I- "I (4.4 .1) where the product coefficients : 2 cij' 2g:ik.bkj for all i and j (4.4.2) In general. to multiply two nxn matrices requires n(2n-1) lultiply and Add Cells (IAC). Figure 4.4.2 illustrates the structure of the VLSI array used for the multiplication of two 3x3 dense matrices. Thbles 4.4.1 and 4.4.2 show the space-time computation steps required for the arrays of Figure 4.4.1 and 4.4.2 respectively. 83 U930. II II II 0'00 Figure 4.4.l. A VLSI array for pipelined multiplication of two 2 x 2 dense matrices. 84 Table 4.4.l. The space-time table for matrix multiplication using the array of Figure 4.4.1. Input/Time I1 12 I; 1; t1 an 0 b1] 0 t2 0 a12 0 b2] t3 a2] 0 b2} 0 t4 0 622 0 b22 t5 0 O O O Output/Time O1 O2 O3 O4 O5 06 t1 0 O O O O 0 t2 0 allbll O O O O a 6 t3 anb12 O a2lbll 0 ll ll 0 a12b21 a b a b t4 0 amb12 O ll+l2 O 21 ll a b alzb22 22 21 t5 0 o o o a21b12 0 + a22b22 t6 0 O O O O O 85 o o g 833 bl3 23 o b o b 032 b 22 12 o b o 31 b b21 0 I 1] I 0 g 0 I1 I2 I3 01 o o o c13 o o 02 O O c23 0 C12 0 C33 0 C22 0 C11 9- 9.? I'lill'l‘%‘ o c o c o 32 21 I I 05 O O 0 c3] 0 0 I1 I2 I3 an 2 O o 12 0 a21 0 a13 C 33- 322 3 23 0 a32 O O 0 a33 Figure 4.4.2. A VLSI array for pipelined multiplication of two 3 x 3 dense matrices. 86 Table 4.4.2. The space-time table for matrix multiplication using the array of Figure 4.4.2. Input/Time I1 12 13 I; I; 1; t1 an o 0 b1] 0 0 t2 0 a12 o o b21 0 t3 a21 0 a13 ”12 0 ”31 t4 0 a22 o o b22 0 t5 a31 0 a23 ”13 0 ”32 t6 0 a32 O 0 D23 0 Output/time O1 O2 O3 O4 05 t6 0 0 cn o 0 t7 0 C12 0 c2] 0 t8 c13 0 c22 0 C31 t9 0 C23 0 c32 O tlo o 0 C33 0 o tn 0 o o o o 87 4.4.1. The Fault-Tblerant VLSI Array for latrix Multiplication Figure 4.4.3 illustrates the structure of a fault-tolerant VLSI array for the multiplication of two 2x2 dense matrices. Thble 4.4.3 illustrates the space-time input and output pattern for the structure of Figure 4.4.3. According to the error detection and correction algorithm described in the previous sections. the elements ‘12- .22, b21. 322 of the matrices A and 3 are loaded via switch elements 8 and S' placed at the bottom and the top of the structure into the inputs 11 and 1'1. The computed output of the PE's (in this case lAC's) of the first column are reloaded into these PE's and the subsequent results of adjacent PE's in the same row are compared for mismatch. In case of error the error recovery is achieved according to the algorithm by bypassing the correct operands prior to the erroneous computation to the subsequent PE's along each row and results are recomputed during the following clock.cycles. The data consistency. the correct timing and the final output validity are guaranteed by the algorithm. Tables 4.4.4 through 4.4.7 illustrate different cases of fault occurrence and error recovery for the structure of Figure 4.4.3. An architecture for a fault-tolerant VLSI array used for the multiplication of two 3x3 dense matrices is shown in Figure 4.4.4. The error detection and recovery is similar to the array of Figure 4.4.3. 88 I Q I . ;\ ’ L if- -3:!--.. -‘ Redundant I O I ' PE's 1 H.-2--.---I bottom "II, {‘ ' j b a switch: ix: : : S : cell L_I_‘i'_: ‘f-f" al11 0 -c d ' parity parity 0 312 in OUt recom ute d=c+a.b 32] 0 p a=a 0 322 b a b=b {OI-It 11-9 “@09- I2 Il=12-out=0 111‘12-out=l Figure 4.4.3. The fault-tolerant VLSI array for pipelined multiplication of two 2x2 dense matrices. 89 Table 4.4.3. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3. Input/Time I1 12 I1 12 t1 aH O bll 0 t2 a12 ”12 ”21 ”21 t3 ”21 ° ”12 0 t4 ”22 ”22 ”22 ”22 t5 0 o o o Output/Time O1 O2 O3 O4 O5 06 t] o o o o o 0 t2 0 anbn o o o o r"-"1 r" -'-1 t3 ”11”12 I”11”11| ”21”11 0 :”11”11I 0 I +-, 1 : +b : 6.12.2.1.-- - - - -- - - - - "-112.21; ---. . , ..... , , t4 u”11”12: ”21”12 ”21511 ;”11”12l X ”21511 ' +b I +b . +b : +b !”12 22: ”22 21 L” 2 22- ”22 21 --'!_": ____________ '_T"" ‘ ---” I'_-iil t5 0 1”21”122 0 X |”22”22' X : + I E + I L”22.”2.2_:--------------1_”21.”12: t6 0 o o o x o X= Don‘t care 90 Table 4.4.4. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3 with a transient fault in MAC.5 during the third clock cycle. I I I l I I I I I J . I" Output/TTme OI O2 O3 O4 O5 O6 : O7 08 09: t1 0 o o o o o : o o o 1 I I I I t2 0 R1 0 o o o . o o o . r--1 r-Tfl ' 1 t R, I‘ 1 R 0 1C ' 0 ' 0 0 0 1 3 ‘ till- -3- max-“€13 : r__1 - t C 9 R C C . . C . R C R ' 4 1.. 4 21 12 ,_ 1_1,___2_1_r€9C Gamay; 3 1 t5 0 ”22 0 X ”22 X : ”12 ”11 ”21: I t6 0 o o o x o 1 x C22 x , I t7 0 o o o o o 1 o x o - I. _ - _ __-_ ___l OUTPUT STAGE X : Don't care R1= ”11”11 R2‘ ”11”12 R3‘ ”21”11 4 ”21”12 ”11‘ ”11”11+”12”21 ”12‘ ”11”12+”12”22 ”21 ”21”11+”22” 21 b +a b ”22‘ ”21 12 22 22 9l Table 4.4.5. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3 with a transient fault in MAC.6 during the 4th clock cycle. Output/Time O1 O2 03 O4 O5 O6 : O7 O8 09 I 11 O O O O O O 1 O O O 1 | I I t2 0 R1 0 O O O : O O O ’ I t C R IC-- C x :C-l 'R C R ' 4 12 4 ~21..:__12.e-.9-.-;21;: 1 -- .3 t O C O x C C “,1 C x C 1 5 22 22 13-1-11- 13169011203” 31-” t6 0 O o O x o 1 x C22 ' t7 0 O O O O O : O x O : A ......... 4 OUTPUT STAGE X : Don't care R , R , R , R 1 2 3 4 } refer to Table 4.4.4. ”11’ ”12’ ”21’ ”22 92 Table 4.4.6. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3 with a permanent fault in MAC.5. I t1 0 o o o o o I o o o . | I t2 0 R1 0 o o o 1 o o o ; --. I t R {6"l R 0 5C . o : o o o ' 3 2 3-4-‘3. - .3 -93"?! -2111] 3 .---. 1 t4 C12 R4 c21 C12 L 11; C213 Rl '6 : R3 ; r--r Tel-7 _x. .recgmputext." C . t5 0 [3ng ° erfior I__ g2; . 12 21 . t 0 0 " 6""b' 'T “'F 0 l x ff'" X ' 5 , 3.3-3. .. - 199933.99? 32-23 : t7 0 o o o o o g o x o . -. .......... J OUTPUT STAGE X : Don't care R13 R23 R33 R4 } refer to Table 4.4.4. 113 C123 C213 C22 C 93 Table 4.4.7. The space-time table of the fault-tolerant VLSI array of Figure 4.4.3 with a multiple transient fault in MAC.4, MAC.6 during the 4th clock cycle. Output/Time 01 02 03 04 0 | I 5 I 8 9 I | I t1 0 o o o o o : o o o g | I t2 0 R1 0 o o o . o o o . I 1 t3 R2 Cn R3 0 cn o : o o o , r--. r-fi "“' r'-'l ' t4 :Cizg R4 {9%}: [FT : ° Lgfjju Rl C11 R3 . -~~- ---= -=:1: ....... n u t5 0 [C22 0/7 9-132.-.ng :9 .:-L'E X 323-1: t6 0 1' o ’,/6 o x 70 g )5 , C22 x I t7 0 i /’ o o o I o,” : o x o 3 T o v3 error rec°mpute STAGE X : Don‘t care R],R2, R3, R 113 C123 4 refer to Table 4.4.4. c22 C C 21’ 94 Top Switch Cell MATRIX B (III? 3":11 Fir“. + ,I’Out I 'I | ' I- ; LI: : : S] : . $2 : 11-9 ¥‘E)<-912 - -- I. - J I. - J I1 I II -‘"""‘""g I :I *0Ut=0 I I I I->out=l Ol 06 3 011 ’i“ 0r1 1 2 0 . ' MAC L... ' ‘ \ l6 *" . I f I I I: TI 0 ’U‘ o / \ 0 MAC ‘2. MAC I432 17 I"’ v at ’ . Ii , T’f I ‘vk \ r3 0 MAC ‘ , ‘8 1::- I']’ L . I ‘ o I'vL 0 l4,’ MAC ‘. r4 I»! x ‘9 if: ‘r- .1 , *3 3" (”I ,t-.-.’ 1.1 I I ‘ 0 V I 0 l5 MAC I ”_rS Redundant 20,5-.. PPS a j“' l ‘ ‘ ....... -J r.I_I_' . I b a I ‘ I I l :\\l I L: If} c d arit . f33m P i" Y pgztty om - MATRIX A recompute Switch Cell b a d=c+a.b a=a b=b Figure 4.4.4. The VLSI fault-tolerant array for pipelined multiplication of two 3x3 dense matrices. CHAPTER 5 Functional Transitions In Adjacent PE's In Chapters 3 and 4. a concurrent error detection and correction algorith- for parallel-pipeline arith-etic array processors was described. The inherent processing elenent (PB) redundancy of the array architecture was used to devise the error detection and correction algorithn. This ache-e is based on reloading the operand sets into the left-oat PE of the pipeline structure. By causing a data retardation the output results of adjacent PE's along the pipeline Operating on the sale input operands can be conpared and the occurrence of faults in those PE's can be detected. Error recovery is 95 96 achived by bypassing the correct operands prior to the erroneous computation to the subsequent PE's for recomputation. A.major problem arises when adjacent PE's perform different arithmetic operation on the sane input operand sets. This case often used in architectures such as systolic arrays for matrix analysis routines will disrupt such error detection and recovery algorithms. In this chapter. the fault-tolerant array structure is modified to overcome problems arising from functional transitions in adjacent PE's. The error detection and recovery is achieved without degradation in a similar manner as described earlier. Finally. the overhead hardware and timing requirement is determined. 5.1 Functional Transitions In VLSI Structures lost systolic array architectures used in applications such as matrix analysis routines use groups of unifunctional arithmetic PB's defined as IAC (Inltiply Add Cell) and DC (Divide Cell) to achieve fast computational capabilities. Depending on the requirements of the implemented algorithms. IAC's and DC's are arranged in groups and interconnected locally. The number of each type of cell per group is determined by the number of computations required by a specific algorithm on each input Operand. Because of the large amount of data processed by high-speed systolic arrays and the near future VLSI implementations of these structures it is essential to design fault-tolerant architectures. The inherent PB redundancy of systolic arrays can be used to devise fault-tolerant techniques based on 97 concurrent error detection and recovery algorithms. It is highly desirable that such techniques not tremendously degrade the array throughput in an error-free case. Furthermore if an error is detected the structure must be able to recover using minimal overhead hardware and timing. A concurrent error detection and correction algorithm based on reloading the first input operand set into the leftmost PB of the pipeline and therefore causing data retardation in the pipeline. then comparing the output results of adjacent PE's for mismatch errors. was described in Chapters 3 and 4. In this scheme error recovery is achieved by bypassing the correct input Operands prior to the erroneous computation to the subsequent PE's in the pipelined array for recomputation. A major problem arises when adjacent PE's in a pipeline perform different operation on the same input operand. A nonexisting mismatch error is detected when the output results of adjacent PE's are compared. This results in undesired reconfiguration of the input data and therefore will disrupt the execution of the error detection algorithm. Furthermore. this inconsistency remains each time the output results of the PE's in question are compared. As described earlier. systolic arrays use groups of functionally different PE's. IAC's and DC's are arranged in groups to implement the computation requirements of given algorithm. Since local communication is essential among PE's. the output of the last MAC in the group is 98 directly connected to the input of the first DC. This results in an inconsistency in the output result of these two PE's when Operating on the same input Operands. 5 .2 Transitional PE' s In order to overcome this problem which is caused by the functional transition between two adjacent PE's. in a one-dimensional linear pipeline structure the following architectural modification will permit the correct execution of the error detection and correction algorithm. The one-dimensional pipeline is considered first for presentation simplicity. Consider the linear one-dimensional pipeline Of Figure 5.2.1. P83.m and PE4.d (ie. PBi.j. isnumber of processes. jsfunction performed) are defined as transitional PE's. Multiplication and division Operations are chosen for presentation purpose. however any different arithmetic operations would work as well. In general any time the data stream is subject to a functional transition by the adjacent PE's those PE's are defined as transitional PE's. As illustrated in Thble 5.2.1. it is clear that the error detection algorithm fails at P83.m and PE4.d. Let PE3.m and PE4.d be multifunctional. In other words both PE's can be selected to perform either multiplication or division on their incoming operands. In the error-free pipeline Operation P23.m will multiply during one clock cycle and divides during the next clock cycle. PE4.d performs a Table 5.2.1. Output/Time 99 The space-time table for the unmodified array pipeline. Note the functional transition and the output inconsistency in PE3,m and PE4,d. PEl,m Al,m A2,m l,m B2,m Cl,m C2,m PE2,m A2,m 3,m 2,m 3,m 2,m PE3,m PE4,d A3,m error r ......... A A :.5@---Jfiqj B3,m A5,d " """"" I' error 133.3431 - - PM; : Compare results for mismatch 100 I , I I PEl,m PE2,m PE3,m I PE4,g_ RPE5,d 3' I m I"" (TSC) v v - F +9 I ' ‘ l ‘ . ‘ TSC —..-....J .__, FS I '““' ““ I TEST 0 JCLK A a R M U X I I . .4 a. . c c l I; Input 1 2 : 3 Operands l —-J 2 —J I l-J I ' l ~-Two-Phase Clock- -- :-1 I Plane of Functional Transition Comparator Cell ‘3:to-l Mux. Output Output , , I Input Input " Reload l 2 1 I Clock Input Input 1 2 Clock=l: 1=12-'-0utput=0 R=O : Output=I1 II“2”‘)“tp“t=1 R=l : Output=12 Figure 5.2.l. The structure of the fault-tolerant one-dimensional array processor with one functional transition along the pipeline. lOI Input Parity Bit P E 3 , m_ Output Parity Bit PI ___n.. d ___. PO Arithmetic Units AU1 AU2 Input Operands Input DatSRRegister Output Operands I. —9-. —+—.- O. 'l 'l Output Data Register (ODR) 1 Error Flag Function EF Select FS Begin if EF=O then elseif EF=l then if FS=O then 01. .— IDR IDR .__ I1. AU1 ._ IDR AU1 ‘— IDR ODR .— AU1 ODR ~— AU1 PO «n— PI P0 *— PI 01 +— ODR O1. ‘— ODR end if if FS=l then IDRo— I]. “d; AU2 ~— IDR ODR +— AU2 P0 «r— PI CI.i .— ODR AU1 : Used for Multiplication AU Used for Division 2 I Figure 5.2.2. Basic structure of the transitional PE3,g_. 102 divison operation during the entire duration of the pipeline and is used for error recovery discussed later. Since the input Operand stream is retarded for one clock cycle by the algorithm through reloading the first input operand into the leftmost m. from Table 5.2.2. the Operand A Operated on by P82.m during the third clock cycle. results in the output labeled as A3,. (1... Air! : i-number of processes on A. j-function performed on A). A3“ output of Wm and P83.m can be compared for mismatch errors during the third clock period. During the fourth clock cycle A3“ 1g loaded into both P33,- and MA. However during this cycle P33.m is selected to perform a “V1303 operation on A3“. Since RAJ will also perform a division 09.11““! on A3'. the output labeled as “Md of P83.m and mm must be equal and can be compared for mismatch. During the fifth clock cycle P83.m is selected to perform multiplication operation. From Table 5.2.2 it can be noted that P83.m must alternate its function every other clock cycle. 103 Table 5.2.2. The Space-time table for the fault-tolerant pipeline of Figure 5.2.l. The error-free case. Output/Time PEl,m PE2,m PE3,m. PE4,d_ PE5,d d m tl Al,m t2 A2,m A2,m t3 Bl,m A3,m A3,m t4 B2,m B2,m A4,d A4,d t5 Cl,m B3,m B3,m A5,d A4,d t6 c2,m c2,m B4,d B4,d X t7 Dl,m c3,m c3,m 35,d B4,d t8 D2,m D2,m c4.d c4,d x t9 x D3,m D3,m Cs,d C4,d t10 X X D4,d D4,d X tn X X X 05,d D4,d t12 X X X X X PE3,m and PE4,g_are Transitional PE's (i.e. can be selected m to perform either multiplication or division operations). X: Don't care 104 5.3 Error Recovery In case of error. the correct input operands are loaded into subsequent PE's during the next clock cycle and recomputed. If an error is detected in PE's which are two PE's away from the transitional PE's the error will be recovered. However if an error is detected one PE away from the transitional PB. the transitional PE must be set to maintain its actual function for an additional clock cycle so that the error can be recovered. Table 5.3.1 illustrates the space-time diagram of the fault-tolerant pipeline of Figure 5.2.1 with a transient error in P82.m during the second clock cycle. During the third clock cycle A2,. is recomputed by both P82.m and P83.m and the results are compared for mismatch. Since this error has delayed the data computation timing by one clock.cycle. during the fourth clock cycle PB3.m is selected to maintain its previous function (multiplication) and for comparison consistency PE4.d is selected to perform a multiplication operation. Both transitional PE's are reset to resume normal Operation at the begining of the fifth clock.cycle. Table 5.3.2 illustrates the operation steps for a transient error in P82.m or PE3,m during the third clock cycle. Thble 5.3.3 illustrates fault recovery for a transient fault in PE3,m or PE4.d during the fourth clock cycle. The control algorithm for the transitional P83.m. also reproduced in Figure 5.2.2. can be written as follows: 105 Begin if EF=O then if FS=O then 119 IDR IDRO ADI ADIO ODR Phi P0 ODR? Oi elseif FS=l then 119 ID! IDI9 A02 A029 om PI? P0 ODRO Oi end if elseif EF-l then IDB-) 01 IDIé A01 A019 one PI9 PO 0039 Oi end if end; This algorithm is applied to PE4.d. by switching the functions of the arithmetic units ADI and A02 (Anl performs division. A02 performs 106 Table 5.3.l. The Space-time table for the fault-tolerant pipeline of Figure 5.2.l with a transient error in PE2,m during the second clock cycle. Output/Time PEl,m PE2,m PE3,m_ PE4,d_ PE5,d d m A l l,m t2 {Ag 31"..." K2:m.i ‘- error I: B FA: -“- - - ‘A’ - recompute 3 l,m [_g,m- _ __22m3 reset to normal t4 BZ,m 32,m A3,m A3,q"., -‘. operation t5 Cl ,m B3,m 83m" 'A4,d X tG c2,m C2,m B4,d B4,d A4,d t7 Dl,m C3,m C3,m B5,d . B4,d t8 D2,m D2,m c4,d c4,d X t9 X D3,m D3,m C5,d C4,d tTo X X D4,d D4,d X tH x x x 05,d o4,d t12 x . x x x x PE3,m_and PE4,g_ are Transitional PE's (i.e. can be selected m to perform either multiplication or division). I'""'"" 'fi : i : Compare results for mismatch - -- -----d Table 5.3.2. Output/Time 107 The space-time table for the fault-tolerant pipeline of Figure 5.2.l with a transient fault in either PE2,m or PE3,m_during the third clock cycles. d PEl,m l,m 2,m l,m 2,m l,m 2,m cnnmw>> l,m DZ,m PE2,m PE3,m_ PE4,g_ PE5,d d m ..gim ------- .grror recompute LAM-.. 13.31.! starfish B3,m B3,m A4,d X C2,m B4,d B4.d A4:d 3,m C3,m B5.d B43d Dz,m C4,d C4,d X D3,m D3,m c5:d C43d X D4,d D4,d X x X 05,d D4,d x X X X 108 Table 5.3.3. The Space-time table for the fault-tolerant pipeline of Figure 5.2.1 with a transient fault in either PE3,m_or PE4,g_during the fourth clock cycle. d m Output/Time PEl,m PE2.m PE3,m_ PE4,g_ PE5,d d m ‘1 Al,m t2 A2,m A2,m t3 Bl,m A3,m A3,m error t4 82,m 32,m {Aili- - - 11.1.11 I recompute 3s t6 C2,m C2,m B4,d 84,d. X t7 Dl,m c3,m c3,m BS,d 34,d t8 D2,m D2,m Ca.d C4,d X t9 X D3,m D3,m C5,d C4,d tTo X X D4,d D4,d X tll X X X DS,d D4,d t.2 x x x x x X: Don't care 109 Table 5.3.4. The space-time table for the fault-tolerant pipeline of Figure 5.2.l with a permanent fault in PE2,m. Output/Time PEl ,m PE2,m PE3,m PE4,g RPE5,d d m tl flan _______ t2 LZ’E'---,A.2.’.m_L-_, _, t3 531.1"- - “2,133-- "$2."; ‘4 £32.31- - .. 2.31.3- _ _ 2.9 “2... t5 C1,. - _ iBZJm“ .B-Z,m_ir A3,m X t5 $2.31“- - _ 2_,mj - - 8.2:!“ BZ,m A4,d 1‘7 .01.": - 3.3.1" :95} 83’3" X t8 IDZJ'." - - -0? 391 ...c.2 1'“ C23“ 8439 t9 ' L92..r9_---02.»3!3.fii C393" x tIo ' 62m D2,m c4,d tll D3,m X tlz ' D4,d X: Don't care I ' : All comparisons indicate error 110 Table 5.3.5. The Space-time table of the fault-tolerant pipeline of Figure 5.2.l with a permanent fault in PE3,m. Output/Time PEl ,m PE2,m PE3:_I;1 PE4,d RPE5,d m :1 Al,m t2 A2,m A2,m I3 31 .m Em: I . 53E t4 B2,m B2,m 93...; - . Tie-:1: r "" "*" ' T ””” ‘ ‘3 CI ,m :33» - -,.B3.,n:.=_ -_ 53.43.. ‘X tG E2,m _cgznl _ _ 133,51 __ _B_3.m_3 A4,d t7 Dl,m Ec3.m C3,m? B3,m x t8 DZ,m -62.“; - -331"--. - 3,}; B4,.1 39 5:3a‘i:;‘3.ai_is"3‘aT x 1:10 ' ; 3,m 3,m.5 C4,d tTI "nu-5311;. x t12 D4,d lll multplication). A proposed architecture for a pipeline processor with multiple functional transitions is illustrated in Figure 5.3.1. The above algorithm can be applied to this structure without any modifications. 5.4 Overhead Hardware and Timing Requirements From Thbles 5.2.2 through 5.3.3. it can be seen that for fault recovery in a k stage linear one-dimensional pipeline h+l PE's are required. As an upper bound for the area FEQUirements. the transitional PB's are counted as two PE's because of their multifunctional design. In VLSI implementation however. the area required by one PB containing two arithmetic units and the corresponding switching elements for function selection. is less than the area required by two unifunctional PE's. The exact area required by such a PE is very hard to determine and is strongly design dependent. therefore as an upper bound. an overall Of 1+3 PB's are required if there is one functional transition in the PE's along the pipeline (i.e. two groups of functionaly different PE's. for example a group Of IAC's followed by a group of DC's). Also k-l comparators are needed. If the chip area required to implement a comperetor and its respective interconnection lines is (l/m)th the area consumed by a single PE denoted as APE the overhead hardware requirement for the fault-tolerant one-dimensional pipeline can be formulated approximately as follow: 2 1 1 pm ..a l awn I'""" --J m E .lIcsIL Hausa oumxmsam mfiocwm m.w.d. ]. emllirllem em m.a w.s. x+m. til Q " IIQI. ”fl- .. . — “we. . owl. n HLA .— e ”Id—I. -.. . . . _ L adage om meannmoamm asmamnadoz ozm em . men It x+r. " x+r+~ III! e . e ... a u i u m mm " JIM. ” , n . ._ a re: x+rn . + . .#T x r . u . v.m=m ofi meanwnosmd qwmzmmnho: 42c > uxocomma mensnwmnnzxm «or use ozmuadamamnozmd emcdniaodmxman unumddam excommmow om mxcdnxmwddk n:Omm: x+r+. mamcmm man are ficananoam. nxmzmmnnoam. ll3 033(k.l)’[(k+3)05pg+((k-1).APB)/l]/k.ApE (5.4.1) Since the transitional PB's are selected to perform one operation at a time on their input Operands(i.e. either multiplication or division). the overhead timing requir-ents Of this modified faul t-tolerant pipeline ruains the same as the one described in Chapters 3 and 4 and is given in Equation 3.3.3. 5.5 Two-Dimensional Functional Transitions once more. due to the parallelism in the design of dedicated computing structures. as long as the data flow consistency (both horizontal and vertical) is guaranteed by the algorithm. each row in a square array architecture for example. can be treated as a separate one-dimensional pipeline. Since the algorithm described in the previous section does not delay or alter the data flow in the pipeline. the fault-tolerant two-dimensional design of Figure 4.l.l with transitional PB's replacing regular PE's each time there is a plane of functional transition. can be used without any further modifications. Tables 5.5.1 through 5.5.3 illustrate the error detection and recovery steps for the cases of transient and permanent fault-occurrence in one row of a two-dimensional array processor. The upper bound for the overhead hardware requiruents can be formulated by counting each transitional PB as two unifunctional PE's. For example. in a square array having one functional transition interface in each row. Equation 4.3.1 can be written as: ll4 vmd.d.a > m d d > w n d a d > m N N >~w~nm > w w w > m n w u w > w e >Aw a n s e x” con.” nmxm w qmcdm m.m.d. vmd.m.a > w n d d . n o > . a w d . >~m~n~ >Nm~n~om > w n w u w are“ 9 euros > w u w > 0 ea. > as vmd.w.a > w n o d . d d >dwdndcd\md >~m~o~c~ >~wmn~c~\mm > w n o u u w w >wwunucw\mw > o swans A > x firm dwdndc.\md >dwdndod\md * >~wmnmo~\mm >~wmn~cw\m~ s >wwunuow\mw >wmunuow\mw * > eweneos\me >emensce\ms . zvmd.m.a 33nd}:d x >Nmmnmc~\mm x >wwunuow\mw x 43m mumnmuanam fimcdm om «3m emcdfiuaodmwmza flroiamamzmnozm. mwwmk uxonmmmox flow azm exaowiwxmm nmmm. llS brewer. emn.d.a ems N.a em_.w.e. em..e.m. ems.m.a Adam a a A . pawn A N 35: 139.5. - - - - \RBA. fi . I I I IIIII J w >~m~ wrmmgmyw~.- -..-)mm‘mywmw amnoaucfim AA pmmmnm >~m~n~ ">.wanncn pswnn.mn. ----------..-------. A m >wwu >~m~n~o~ pmamnmow >.mdn.c.\m. x A A A >Awe >uwwnuow >uwwnuow >~m~n~c~\m~ x A m >Awene >Awene puwwnwou\mu >uwwnwcu\mw >~m~n~o~\m~ A x A x >Amnnece >Aweoeon >uwwnwow\mw A no x x >Amenscs\me >Awensce\ms puwwnwou\mw A as x x x psmenece\me x A.N x x x x p. smensoe\me .A noaumwm ccauca sewedAm amcdm m.m.m. 43m mumnm-anam Amcdm om "so wmcda-aodmwmzu Aro-anam:mno=md mwxmk excommmox in”: m armsmnmaa Amch A: mdfirmw vmd.~.a ox vmd.w.mu ll6 ocach\qAam umA.A.a AA >AwA AN >Awdnd AA >mw~ AA >~wmn~ am >www Am >wwwow aw . «m . no . "do . x” con." nmwm fi.-..- -..- 4 emd.N.a em..A.a AAA.A.A m a 3A} 3A}: 3A}: ANANAN wavy WP“ ..-.. EA To? 5.. >NANANAN >NANANA N >AAAA AAANAA AAAAAA _>N.AN.A mom\.mN.-- - -mN-N.A.N.Aw.\mN._. >AAAAAAA >A.AA.AmAAH - ->NANA.NA.N\AN . .>A.A IAmAm.AA.\m. - - - wA.A MAWAWNWA." >wwwnw ow\mu_ _ _ . >AA nosnmxAAoam Azamnmam msxox amcflm m.m.u. 2AA: m umwamzmaA Amcdn A: vmd.n.mu a vmd.m.a 3.30%.;d x >Nwwnmom\mm x >wwwnwcw\mu x >Awnnaoa\mb x 43m Acmnm-aAam Amudm AA azm Amcdn-no.mxmsn Aro-aAam:AAo:mA mxxmk uxonmmmox ll7 onn(k.a)=[(k+3)‘+(r+(k-1)‘)/-]/k’. The overhead timing requirement remains the one-dimensional design and is given by Equation 3.3.3. (5.5.1) as the CHAPTER 6 SUIlAIY AND CDNCLUSIONS In this research work. the inherent PB redundancy of the parallel-pipeline arrsy architecture is used to deveIOp a new concurrent error detection and correction algorithm capable of detecting and correcting errors caused by permanent. intermittent and transient faults Of unknown origin and frequency of occurrence. These faults are assumed to affect small parts of an arithmetic processor chip. especially the arithmetic units before or during Operation despite extensive application of fault prevention. The inherent PE redundancy of the parallel-pipeline architectures will enable the H8 ll9 algorithm to use minimum overhead hardware and timing in order to detect and correct errors. In other words. the errorneous computed results of constituent PE's in the array are recomputed by subsequent PE's in the same pipeline during the following clock periods. Furthermore. the concurrency of the error detection and correction will enable the fault-tolerant structure to detect and correct transient errors with very short durations. These errors affect the output results of the FB's and they can not be detected and corrected using nonconcurrent techniques. 'ith the rapid progress in the area of VLSI. dedicated high-speed computing structures are implemented on a single chip. Such implementations require modular PB designs with local next-neighbor communication paths. As chip complexity and operational speeds increase. more fault-occurrence is expected. Because of the limited IIO pins. large amount of processed data and the high Operational speeds. undetectable errors can propagate throughout the whole computing structure causing the system to crash. Off-line or start-up testing techniques. while capable of detecting permanent faults. will not detect transient faults. Therefore. locality and concurrency in the error detection and correction techniques is essential in VLSI implementations. The major contribution Of flhis work has been the ability to efficiently incorporate fault-tolerance into special classes of VLSI computing structures and thus to improve reliability and yield Of such l20 designs which have been more susceptable to faults as chip complexity and Operational speeds have increased. Furthermore. the addition of fault-tolerance tO the existing computing structures has not tremendously degraded the system performance and has increased the system reliablity from almost none to a very promising level. As a bottom-up approach. in Chapter 3. a new error detection and correction algorithm for one-dimensional systolic arrays. based on reloading each input operand set into the pipeline and concurrently comparing the output results of adjacent PE's was described. The overhead hardware and timing requirements were determined and it was shown that by adding only one redundancy in a k stage pipeline Operating on n operands. an order Of k.n transient and intermittent faults with duration less than one computational clock cycle can be detected and corrected. Furthermore. the overhead timing requirements for the algorithm never exceed the timing requirement of the same pipeline implemented using the B880 technique. capable of only detecting and not correcting errors. with approximately one third the overhead hardware requirements. As an application example. the algorithm was applied to a one-dimensional systolic array used in Discrete Fourier Transform (DFT) computations. In Chapter 4. the algorithm was extended and applied to two-dimensional array architectures. It was shown that as long as the vertical and horizontal data flow were not delayed or altered. each row of the array can be treated as a one-dimensional pipeline. The lZl overhead hardware and timing were determined. As an application example. the algorithm was applied to VLSI array used in pipelined matrix multiplications. In Chapter 5. the problems arising from functional transitions in VLSI array architectures among adjacent PE's were approached by extending the error detection and correction algorithm and determining the overhead hardware and timing. In conclusion. the new error detection and correction algorithm presented in this research enables highly reliable designs Of fault-tolerant computing structures which benefit extensively from parallel-pipeline architectures. Fnrthermore. the algorithm preserves VLSI attributes. such as locality of communication among constituent PE's. modularity Of the design and simplicity of the control structure and does not require additional pinouts. This algorithm. due to its general assumptions. can be applied to systems which contain processors on a single chip or mainframe computers interconnected BIBLOGRAPHY 1- Runs. H. 1'- . " The W 2f Emile; mum". Advances in Computers, Vol.19, Academic press, 1980, pp. 65-112. 2- Huang. K. and Cheng. Y.H.. " V L§I._ggpggigg §§guctgggs a fig;__g;xigg magical: 1:12:11 mum" . Proc- 1980 Int’ 1 Conf. on Parallel Processing, Aug. 1980. pp. 217-230. 3- Moldovan. 0.1.332an .91 mm for __VLSI W .Agggyg".Proc. of the 1288. Vol. 71. No. 1. Jan.1983, pp.113-120. 4- Leung. Y.Y.J. and Shanblatt. H.A.,T§jnglg§ipg .3 VL§I .fiypgglig Arm for Lead 112! WWW". Proc. 1983 Conf. on Modeling and Simulation, Univ. of Pittsburg, Apr. 1983. pp. 1349-1353. 5- Shanblatt. HAA. and Leung. Y.Y.J. fIhg.§1gnig§.gj 1L§I _fig; _ngd .Zlg! .annutatigg .fiphggggngpg". IEEE Trans. on Power Apparatus and Systems. Vol. PAS-103. July 1984. pp. 1714-1719. 6- Zhana. 0-H” Yun. DJJ-J'W Mir Anus for lgigggggg ‘ggugig; .Iggnpfiggg", 11th Annual Int’l Symp. on Computer Architecture. June 1984. pp. 215-222. 7- Quinton. P-."An.tmtis§1ntheaisnf§n.talisArrm from _Ilnifm .gggggggg; _gggggiggg", 11th Annual Int’l Symp. on Computer Architecture, June 1984, pp. 208-214. 8- Runs. 8-Y-. " .Qu Was 11th W1 W Arm .Exgggggggg", IEEE proceedings. July 1984, pp. 867-884. 9- Runs. H-T- £11.35; Desire Alarm £91 __.§_VL I m".Prom Caltech Conf. on VLSI, Cal. Inst. of Tech.. Pasadena. Jan. 1979. pp. 65-90. Ill-Evans. K-. Chens.Y.H-."£artiti.qnc.d mus Algorithms in: __§.VL I .Aziggngtig m ". IEBE Trans. on Computers. Vol. c-31, Dec. 1982. pp.1215-1224. 122 123 ll-Fortes. J.A.B. and Raghavendra, C.S.,"Dyg§gic§111 Rgcgnfiguggblg Fault-Tolggant Array Pgoggssgr", FTCS-14, June 1984, pp.386-392. 12-Chung, F.R.K.,"DIOGENES: ‘A Hethodglggy .jg; Igggiggigg Fault-Igleggng VLSI ngggssg; Arrays". FTCS-13, June 1983, pp. 26-32. 13-Koren, 1.,".A Rgcggfiiguggblg gag Fgglg-qugggnt VLSI Multipggcggsgg ", Proc. 8th Int’l Symp. on Computer Architecture. Hay 1981, pp. 425-431. 14-Head. C. and Conway, L., fingggggggigg .59 VLSI _§ygggg§".Addison-Wesley Pub. 60.. Reading. Mass., 1980. 15-Head. H. and Head. 0-. "_9_tC I and We 2f ___..VLSI 31mins .§£;gg;g;gg", IEHE J. of solid State Circuits. Vol.SC-14. April 1979. pp. 455-462. 16-Anderson. ’13. and Lee. P-A-. 'Ianlt Tolerant: 1512139123! .gggpggglg". Proc. 1383 International Symposium of Fault-Tolerant Computing. June 1982, pp. 29-33. 17-Sedmak, 3.8. and Liebergot. H.L. fggglggzglgggggg _gf .3 .Qgpgggl __rml’u s £22223: W .121 MMMW".EEE Trans. Computers, Vol. C-29, NO. 6, June 1980, pp. 492-500. 18-Avizienis. A0: 'mmmummsm whoselm the MafW". PICS-12. June 1982. pp- 6-13- 19-Praud. 3.. '32ch .er __§IVL 3.9.9.; W Thins the Interrelationships .9! Earl: 9mm. 11321.4 and Lest Quality". FTCS-12, June 1982, pp. 400-405. 20-Niraj. K-Jo. Abraham. J-A-. 'Tesell! Self-£22m __HOS 211211;; _gggg; Rgglistic Physiggl Failures", Proc. IEEB International Congerence on Computer Design: VLSI in Computers. Oct. 1984. pp. 665-670. 124 21-Sami, H.G., Stefanelli, R., f§glfi;1g§§igg.égggy Stugctgggs", Proc. IEEE International Conference on Computer Design: VLSI in Computers, Oct. 1984, pp. 677-682. 22-Wang, S.L., and Avizienis, A., fizhg Dgsigg 9; Totally Self Checking ginning hiss wmm". Proc. IEEE International Symposium of Fault-Tolerant Computing, June 1979, pp. 173-180. 23-Crouzet. Y., and Landrault, C., fyggigg .9; .figlfizgggggigg HOS-L81 mm:- Annliminn .29 3 MP -B' W". Proc- IEEE International Symposium of Fault-Tolerant Computing, June 1979, pp. 24-Halbert, M.P., and Bose, 8.H., fgggigg .Apgggggg _gg; .5 VLSI .ggljzghgggigg HIL-§ID:17§0§,gigggpgggggggg", Proc. IEEE International Conference of Fault- Tolerant Computing, June 1984, pp. 254-259. 25-Pate1. LE. and Fans. L-V-. 'Ecnsunsnt Em: 13m in igglgjply .39; .Qixigg .Agrgyg", 1833. Trans. Computers, Vol. C-32, April 1983, pp. 417-422. 26-Chiang, K.W. and Vranesic, 2.6., f9;,§ggl§,§g§gg§igg.ig CMOS,ngig Networks", Proc. 20th Design Automation Conference. June 1983, pp. 50-56. 27-Reddy, 8.11. Agraval, V.D., Jain, S.K., "_A 91;; mg], )1ng], jg; ___§CHO 9W1 Lexis ginning sits Annlisasion .ts 2.12.1; .Qgggggigg", Proc. 21st Design Automation Conference, June 1984. 28-He1as.C-H- . "2m Manning .19 a 233.3 flushing lama". IEEE Trans. on Communications, Jan. 1983, pp. 155-157. 29-Avizieniz. A's 'Mixnllixersimlnemums 2: in: Eighties". PTCS -12, June 1982, pp. 44-45. 30-Kanai. 13.. '33ng .9: 21.52211 _LSLSS In 11L h .§§g!igg.gggggfiiggggtigg", IEEE Trans. Computers, Vol. 0-30, No. 10, Oct. 1981, pp. 811-812. 125 ill-Ryan. P-H- . "___999La 91999.9 19919 19199999 999991 1999999999". Proc- IEEE International Symposium on Fault-TTolerant Computing, June 1983, 32-Kuang, 1.8.1., and Abraham, J.A., 919! Cost Schemes .fig; __9__19_9L___99P u -'1' mm in 2199911.! 9999991999 LLi b.9939? c a 1:93.919". Proc- Ieee FTCS-12, June 1982, pp. 330-337. 33-Pranklin, H.A., Wann, D.F., Thomas, V.J., 52in .ngfisatigg .199 29999919923 .95 __I.§_V I 19999999993199 999999.199". IEEE Trans. on Computers, Vol. C-31, Nov. 1982, pp. 1109-1116. 34-Matick. LL. "M999 9f 1199911 .9919 Win: .gglighiligy.ig Vi;§ugl.§ggg;igg", FTCS-12, June 1982, pp. 223-227. 35-Williams, T.“., and Brown, N.C., ...L’SIESSLQXSJ n _a 223211193 3f .ggglg ,gggggggg", IEEE Trans. on Computers, Vol. C-29, 1981, pp. 987-988. 36-1-9119. 8- and Patel. ML. 339: 9999999199 .19 Wis .9291§£1221.flging.Iig§,§g§gg§;ggy", Proc. IEEE International Symposium on Fault-Tolerant Computing, June 1983, pp. 298-305. 37-Nicolaidis, H., Janesh, 1., Courtois, 3., fifigzgggly ,Qggg ,Qigjgin; _Qggggggg", Proc. IEEE International Conference on Fault-Tolerant Computing, June 1984, pp. 16-21. 38-Torki. K-E- . and Radke. C-E- . 'Dmlin 3.9991 9.99 .9919? 99999999 .99: .gultighip ", Proc. 20th Design Automation Conference, 1983, pp. 201-2060 39-Agrawal, V. Seth, 8., and Agraval, P., fLfil ,ggggug; .Qgglity ,gng .ggglg .Qggggggg", Proc. 18th Design Automation Conference, 1981, pp. 196-203. 40-91“, DJ!” and Radke. C-E- ‘19993999999999999999999999199999 99999991 99 .9119 99911.91 9951 11.19.19 9f MW'SProc- IEEE International Conference on Fault-Tolerant Computing, June 1984, pp. 56-59. 126 41-Chandramou1i, R. 299 Testing §§ugk-02en Faults", Proc. IEEE. International Symposium on Fault Tolerant Computing, June 1983, pp. 258-265. 42-Faulkner, T.L., Bartlett, C.W., Small, M., "Hardware _ngig Design Fault; .;A Classificatigg_ggg 89mg Measurements", FTCS-12, June 1982, pp. 377-380. 43-Seth, S.C., Agrawal, V.D., fggggiggiggl Design V i i i ", FTCS-12, June 1982, pp. 393-399. “Pushes. J-L-A-. Hccvskey. E-J-. "1399.199 99 19991.1! 9.91::99999999 99999999999 91.99 99 99.19999! 999929: 9; 4999.1 a". PICS 13th International symposium on Fault-Tolerant Computing, pp.l69-172, June 1983. 45-Tmir. Y's Sequin. 0-. "2995.99 .999 9991999399 99 “Lat-C in Cgmpggatggs ,Lgplggggggg .1153 .!Q§ ,§L§;§", IEEE Transactions on Computers, Vol.c-33, No.6, pp.493-506, June 1984. 46-Furuya, K., Akita, Y., Tohma, Y.,.ngig .Qggigg .2; P -T 91919999 99999 99 .999 W999 99.999991" PICS-13 pp- 306-313 June 1983.